AI Reviewer is an automated document analysis system designed to assist in academic peer review by systematically evaluating the relationship between claims and their supporting evidence in research documents. The project goal is to employ most recent LLMs, agent-based workflows and techniques found the most recent literature to help researchers, reviewers, and academics improve the rigor and quality of their work.
This project is funded by RAND’s CAST Center (RAND Center on AI, Security, and Technology).
This page outlines the project’s scientific and technical approach and presents its results, showcasing some real input/output examples. For development setup and usage instructions of the tool, see the README and DEVELOPMENT files in the GitHub repository.
Automated scholarly paper review (ASPR) represents an emerging field that leverages artificial intelligence and natural language processing to assist in the peer review process. As the volume of academic publications continues to grow exponentially, traditional manual review processes face increasing challenges in scalability, consistency, and timeliness 1. ASPR systems aim to augment human reviewers by automating various aspects of document evaluation, including claim verification, citation analysis, and evidence assessment.
Recent surveys on LLMs for ASPR 2 indicate that large language models have shown transformative potential for the full-scale implementation of automated review systems. LLMs are being widely adopted across the academic review process, demonstrating significant improvements in review efficiency, generating high-quality structured comments, validating checklists, and checking technical errors. The incorporation of LLMs has enabled new capabilities such as long text modeling, multi-modal input processing, and advanced prompt engineering techniques that address many of the technological bottlenecks that previously limited ASPR systems. However, this integration also introduces new challenges, including concerns about bias, inaccuracies, privacy risks, and the need for transparent disclosure of AI usage in the review process.
This project focuses on providing an end-to-end, ready-to-use open source tool that leverages commercial large language models (LLMs) and various methods found in the most recent literature and state-of-the-art research. The system addresses critical aspects of scholarly document quality through the systematic evaluation of claim-evidence relationships: ensuring that claims are properly substantiated by their cited references, identifying gaps in evidentiary support, and recommending improvements to strengthen the document’s foundation.
The system addresses these primary research questions:
For organizational quality assurance workflows, the system also includes experimental analyses:

The system accepts two primary inputs: a main document to be reviewed and a set of supporting documents/references that provide the evidentiary foundation. These inputs are processed by the AI-Reviewer, which orchestrates a series of specialized agents to analyze the document. The output is a comprehensive table containing all extracted elements—files, chunks, claims, citations, and their verification results—along with a detailed analysis summary. The web interface provides multiple views (Summary, Explorer, Files, Chunks, Citations) to navigate the results and assess the quality of claim substantiation throughout the document.
The system processes documents through a multi-stage pipeline implemented using LangGraph, which orchestrates a series of specialized AI agents:
Document Conversion: Input documents (PDF, DOCX, Markdown) are converted to structured markdown format while preserving semantic structure using Markitdown.
Document Chunking: Documents are segmented into semantically coherent chunks using NLTK-based sentence splitting with LLM fallback for complex text. The chunking process:
Claim Extraction: An LLM-based agent extracts factual claims from each chunk. Claims are defined as decontextualized propositions—assertions that can be understood and verified independently of their surrounding context. The extraction process considers:
Citation Detection: Citations are identified and mapped to their corresponding references in the document’s bibliography. The system handles various citation formats and associates citations with claims based on proximity and paragraph-level context.
Reference Extraction: Bibliographic references are extracted using section detection and windowed extraction, enabling mapping between in-text citations and their full reference entries.
Reference File Matching: Supporting documents are matched to extracted references to enable verification against full-text sources.
Claim Categorization: Extracted claims are classified into six categories:
Each category determination includes an assessment of whether external verification is required, filtering out common knowledge claims that do not necessitate citation.
Claim Verification: Claims are verified against supporting documents using RAG-Based verification:
text-embedding-3-large embeddingsInference Validation: Claims identified as inferential or interpretive are analyzed using the Toulmin model of argumentation to detect potential logical fallacies, unsupported leaps, or missing intermediate reasoning steps. The system examines claims, data/grounds, warrants, qualifiers, rebuttals, and backing to identify invalid inferences.
Reference Validation: Uses web search to check if each reference from the document is available online and matches author, title, year, and publisher against public internet sources. Useful for detecting fabricated or hallucinated references.
Literature Review: The system conducts automated literature reviews by:
Citation Suggestion: For claims lacking citations, the system suggests relevant references from the document’s bibliography or external sources, considering:
Methodological Alignment: Analyzes the methodology used in the document against typical methods used in the field, using web search to find field methods context.
Results Extraction: Extracts main results from the document and assesses their reproducibility.
For organizational compliance and quality assurance:
Advocacy & Tone Detection: Uses a two-layer detection approach:
Preface Validation: Validates preface/introduction sections against configurable requirements:
Author Biography Validation: Validates author biographies for:
Agent-Based Design: The system employs a registry-based agent architecture where specialized agents handle distinct tasks. Each agent implements a common protocol, enabling dynamic composition and replacement of components.
Workflow Orchestration: LangGraph manages the execution flow, supporting:
Vector Storage: Supporting documents are indexed in PostgreSQL with pgvector extension:
State Management: The workflow maintains a comprehensive state object that tracks:
The system uses GPT-5 (via LangChain) for all agent operations, configured with:
lib/config/llm_models.pyThe system includes evaluation capabilities for:
Evaluation datasets are maintained in YAML format with ground truth annotations for systematic testing. Results can be exported and visualized in a dedicated frontend evaluation viewer.

The system follows a containerized architecture consisting of three primary containers and integration with external providers. The App Container hosts a NextJS frontend that provides the user interface, allowing users to interact with the system. This frontend communicates with the Server Container, which houses the core processing engine built on FastAPI and LangGraph. LangGraph orchestrates the agent-based workflow as a directed graph, where each node represents a specialized processing step (claim extraction, verification, citation detection, etc.). The Database Container runs PostgreSQL with the pgvector extension, storing workflow state, execution history, and vector embeddings for semantic search. The server container maintains bidirectional communication with the database for both workflow persistence and retrieval-augmented generation (RAG) operations. Finally, the system integrates with External Providers including OpenAI, Anthropic, Google (and others) for large language model inference, as well as web search capabilities for literature review tasks. This architecture enables flexible deployment, horizontal scaling of processing components, and provider-agnostic LLM integration through a unified interface.
Note: The following examples represent excerpts extracted from complete document analyses conducted during actual system evaluations. While these excerpts are presented in isolation for clarity and illustrative purposes, it should be noted that the agents operate within the full document context, where paragraph-level and document-level contextual information significantly influences claim extraction, citation association, and verification outcomes.
The example below demonstrates the system’s capability to assess claim-evidence alignment across different levels of substantiation. Three variations of a sentence extracted from a research document are evaluated: (1) the original sentence, classified as “partially supported” due to a minor overstatement in its claims; (2) a modified version containing explicit contradictions with the cited evidence, correctly identified as “unsupported”; and (3) a refined version with softened language that aligns more precisely with the evidence, classified as “supported”. This illustrates the system’s sensitivity to subtle variations in claim strength and its ability to distinguish between different degrees of evidentiary support.

The following example demonstrates the system’s reference validation capabilities when presented with a fabricated bibliographic entry. The validation agent systematically evaluates the reference’s metadata fields (author, title, publication year, publisher) against online sources and correctly identifies the reference as invalid due to the absence of corresponding published work.

The subsequent example illustrates a more nuanced validation scenario involving a legitimate reference with verifiable online presence. The system detects a discrepancy between the title field in the provided reference and the actual publication title found in online databases.

The following example demonstrates the system’s capability to identify claims that lack appropriate evidentiary support. The system evaluates sentences containing assertions that require citation but are not substantiated by references, and distinguishes these from universally accepted common knowledge that does not necessitate citation. The system classifies such claims as “unsupported” when they represent factual assertions, empirical findings, or domain-specific knowledge that would typically require attribution. Notably, the system performs granular claim-level analysis, as illustrated in the second example where multiple distinct claims are extracted from a single sentence and evaluated independently, enabling precise identification of unsupported assertions within complex statements.

The following example demonstrates the system’s capability to validate inferential and interpretive claims by analyzing their argument structure according to the Toulmin model of argumentation. The system evaluates claims that go beyond direct factual assertions to assess whether they contain logical fallacies, unsupported leaps in reasoning, or missing intermediate steps that would strengthen the argument. For claims identified as inferential or interpretive, the system examines the logical structure connecting the claim to its supporting evidence, identifying potential weaknesses in the reasoning chain and flagging areas where additional justification or intermediate reasoning steps may be required. The first sentence and related Inference Validation analysis is the original sentence, marked as valid by the system; the second sentence is a modification of the original one, creating a logic inconsistency in the claim, which the system correctly flagged.

The images below show examples of output for the citation suggestion and literature review agents.


The following example demonstrates the system’s “live reports” capabilities for published documents. The system analyzed RAND’s research article “Understanding the Artificial Intelligence Diffusion Framework” (published January 2025) and successfully identified that the framework was rescinded on May 13, 2025. This illustrates the system’s ability to detect post-publication changes, retractions, and evolving information that may affect the document’s current validity or relevance.

LLM Dependencies: Verification quality depends on the underlying LLM’s reasoning capabilities and may exhibit biases or errors inherent to the model.
Reference Availability: Citation-based verification requires access to full-text versions of cited references. When unavailable, the system marks claims as unverifiable.
Semantic Retrieval: RAG-based verification relies on semantic similarity, which may retrieve passages that are topically related but do not substantiate specific claims. The verification agent filters these, but false positives are possible.
Common Knowledge Boundaries: The distinction between claims requiring citation and common knowledge is domain- and audience-dependent. The system’s categorization may not align with all disciplinary conventions.
Citation Proximity: The system associates citations with claims based on paragraph-level proximity. In cases where citations are distant from their claims, associations may be incorrect.
Processing Scale: Large documents with many claims require significant computational resources. The system supports selective re-evaluation of specific chunks to optimize resource usage.
Web Search Dependency: Literature review, reference validation, and methodological alignment analyses require web search access. Results depend on search engine availability and the indexed web content.
QA Screener Customization: The experimental QA screener workflows (advocacy tone, preface validation, author bios) are configurable via YAML but require tuning for different organizational requirements.
Lin, J., Song, J., Zhou, Z., Chen, Y., & Shi, X. (2023). Automated Scholarly Paper Review: Concepts, Technologies, and Challenges. arXiv preprint arXiv:2111.07533. https://arxiv.org/pdf/2111.07533 ↩
Zhuang, Z., Chen, J., Xu, H., Jiang, Y., & Lin, J. (2025). Large language models for automated scholarly paper review: A survey. arXiv preprint arXiv:2501.10326. https://arxiv.org/html/2501.10326v1 ↩