Comparative Analysis of AI-Generated vs. Journal Peer Review: A Comprehensive Methodological Report for Replication
Version 2.0 — January 2026
This report presents a detailed methodology for conducting standardized comparative analyses of AI-generated peer reviews against actual journal peer reviews. The study analyzed five manuscripts submitted to The BMJ between 2021 and 2023, comparing reviews from PeerGenius.ai (an AI-powered peer review platform) with the original human expert reviews. The methodology employs a 10-dimensional scoring framework, issue detection analysis, and critical issue tracking to provide a comprehensive assessment of AI performance. This document serves as a complete replication guide, including all formulas, scoring rubrics, data collection procedures, and analysis steps. The comparison was conducted using Manus.im, an autonomous AI agent platform, with AI reviews generated by PeerGenius.ai's Premier Review tier. The BMJ was selected due to its unique open peer review policy, which provides public access to the complete peer review history, including the original submitted manuscripts. All five selected manuscripts were published under a CC-BY license, which is more permissive than the standard CC-BY-NC license used for most BMJ articles, ensuring full legal compliance for this research.
The peer review process is a cornerstone of scientific publishing, serving as the primary mechanism for quality control and validation of research findings. However, traditional peer review faces numerous challenges, including delays, inconsistency, bias, and the increasing burden on volunteer reviewers. Recent advances in artificial intelligence, particularly large language models (LLMs), have created new opportunities to augment or enhance the peer review process.
This study was designed to rigorously evaluate the performance of an AI-powered peer review system (PeerGenius.ai) against the established benchmark of human expert review at a prestigious medical journal (The BMJ). The objective was to determine whether AI has achieved parity with human reviewers, identify complementary strengths and weaknesses, and provide evidence-based recommendations for the development of hybrid human-AI peer review systems.
This was a retrospective comparative analysis of five manuscripts submitted to The BMJ. For each manuscript, we compared the AI-generated review (from PeerGenius.ai) with the original journal peer review (from The BMJ's human expert reviewers). Both reviews evaluated the exact same version of the manuscript (the initial submission, pre-revision), ensuring a fair and direct comparison.
PeerGenius.ai's Premier Review employs a sophisticated multi-agent system consisting of seven specialized AI reviewers and one Editor-in-Chief. Each agent is powered by a frontier large language model and serves a distinct function in the review process. This architecture is designed to replicate the diversity of perspectives typically found in a multi-reviewer journal peer review process.
The Seven Specialized Reviewers:
Editor-in-Chief — Synthesizes all reviewer feedback into a comprehensive editorial decision letter, weighs different perspectives, and provides clear guidance on revisions.
The BMJ operates an open peer review system for many of its article types, which makes it uniquely suitable for comparative studies of peer review quality. Under this policy:
| Manuscript | Year | Study Design | Pages |
|---|---|---|---|
| Hippisley-Cox et al. | 2022 | Prediction Model (QCovid4) | 49 |
| Mok et al. | 2024 | Pharmacoepidemiology (Antipsychotics) | 231 |
| Morales et al. | 2023 | Interrupted Time Series (QOF) | 23 |
| Rees et al. | 2022 | Cohort Study (Shoulder Surgery) | 25 |
| Woolf et al. | 2023 | Mendelian Randomization (Sildenafil) | 51 |
Manuscript 1: Hippisley-Cox et al. (2022)
Manuscript 2: Mok et al. (2024)
Manuscript 3: Morales et al. (2023)
Manuscript 4: Rees et al. (2022)
Manuscript 5: Woolf et al. (2023)
All five manuscripts selected for this study were published under the Creative Commons Attribution (CC-BY) license. This is a critical detail, as it provides the legal basis for using these manuscripts in this research, including processing them with an AI system. The CC-BY license is more permissive than the CC-BY-NC (Non-Commercial) license under which most BMJ articles are published. The CC-BY license allows for unrestricted reuse, redistribution, and modification, for both commercial and non-commercial purposes, as long as appropriate attribution is given.
The entire comparative analysis was conducted within the Manus.im platform. A detailed prompt (see Appendix A) was provided to the Manus agent, which then autonomously executed the analysis, including: reading and extracting content from both review documents, applying the 10-dimensional scoring framework, categorizing all identified issues, tracking critical issues, generating structured data files (CSV/JSON), creating comprehensive visualizations, and writing detailed analysis reports.
The analytical framework consists of three main components: (1) dimensional scoring, (2) issue detection analysis, and (3) critical issue tracking. Each component is described in detail below, including all formulas and scoring rubrics.
Both the AI and journal reviews were evaluated across ten standardized dimensions on a 0–10 scale. The framework was designed to capture both the technical rigor and the practical utility of peer review feedback.
1. Statistical Rigor
Scoring Rubric:
2. Methodological Standards
Scoring Rubric:
3. Clinical/Domain Context
Scoring Rubric:
4. Study Design Critique
Scoring Rubric:
5. Data Quality & Verification
Scoring Rubric:
6. Interpretive Depth
Scoring Rubric:
7. Systematic Completeness
Scoring Rubric:
8. Actionability & Structure
Scoring Rubric:
9. Tone & Constructiveness
Scoring Rubric:
10. Editorial Judgment
Scoring Rubric:
The Overall Quality Score is calculated as the arithmetic mean of all ten dimensional scores:
Overall Quality Score = (Σ Dimensional Scores) / 10
Where:
Quality Rating Categories:
Parity is defined as achieving an overall quality score within 1.0 point of the comparator. This threshold was chosen to represent meaningful equivalence while allowing for minor differences.
Parity Achieved = |AI Score - Journal Score| < 1.0
Where:
All unique issues identified in both reviews were systematically categorized to assess the degree of overlap and complementarity between the AI and human approaches.
Each unique issue was categorized into one of three mutually exclusive categories:
Operational Definition of “Same Issue”: Two issues were considered the “same” if they referred to the same specific problem, even if expressed differently. For example, “multiple testing correction not applied” and “no adjustment for multiple comparisons” would be considered the same issue.
The Complementarity Score quantifies the degree to which the two reviews identified different sets of issues. A higher score indicates greater complementarity (i.e., the reviews are more synergistic and less redundant).
Complementarity Score = [(N_Journal_Only + N_AI_Only) / N_Total_Unique] × 100%
Where:
Interpretation:
Example Calculation:
Suppose a comparison yields:
Complementarity Score = [(14 + 10) / 30] × 100% = 80.0%
This would be interpreted as “very high complementarity,” indicating that the two reviews identified substantially different sets of issues and are highly synergistic.
Critical issues were defined as those that could fundamentally invalidate the study's findings or conclusions. These were tracked separately to assess the ability of each review to identify the most serious methodological or statistical flaws.
For each manuscript, a binary matrix was created to track which critical issues were detected by each review:
Table 2: Critical Issue Detection Matrix (Example)
| Critical Issue | Journal Detected | AI Detected |
|---|---|---|
| Multiple testing correction not applied | No | Yes |
| Time-varying confounding not addressed | No | Yes |
| Autocorrelation not handled | No | Yes |
| Graphing error (predicted = actual) | Yes | No |
From this matrix, we calculated:
For each manuscript, the peer review history was accessed through The BMJ's website:
From the peer review history PDF, the following information was extracted for each reviewer:
For each manuscript, the original submitted version (pre-revision) was obtained from The BMJ's peer review history. This ensured that the AI reviewed the exact same version that the human reviewers evaluated.
The AI review included:
For each review (both AI and journal), a comprehensive set of notes was created, including:
All unique issues were extracted from both reviews and compiled into a master list. Each issue was coded with:
The analysis was conducted in seven sequential steps:
Step 1: Document Review
Step 2: Dimensional Scoring
Step 3: Issue Categorization
Step 4: Critical Issue Analysis
Step 5: Data Export
Step 6: Visualization
Step 7: Report Writing
For each manuscript, the analysis generated the following deliverables:
A multi-panel figure (typically 7–8 panels) including:
After analyzing all five manuscripts, a cross-study synthesis was generated, including:
Researchers seeking to replicate this study should consider:
All materials used in this study (journal peer reviews and submitted manuscripts) were obtained from publicly available sources (The BMJ's open peer review history). No confidential or proprietary information was used.
This methodology report provides complete transparency about all procedures, formulas, and decision rules, enabling full replication by other researchers.
Researchers conducting similar studies should disclose any financial or professional relationships with AI peer review platforms or journals.
This methodology provides a rigorous, standardized, and replicable framework for comparing AI-generated peer reviews with human expert reviews. The 10-dimensional scoring system, issue detection analysis, and critical issue tracking provide a comprehensive assessment of AI performance. The use of The BMJ's open peer review history ensures that comparisons are fair and direct, as both AI and human reviewers evaluated the exact same manuscript version. This approach can be used by other researchers to evaluate different AI peer review systems, different journals, or different types of research.
The following prompt was provided to the Manus.im autonomous AI agent to conduct each comparative analysis. Researchers can use this exact prompt to replicate the study with their own manuscripts.
I want you to conduct a standardized comparative analysis of an AI-generated peer review (from my SaaS application) against an actual journal peer review for an academic manuscript.
## TASK OVERVIEW
Compare the AI peer-review with the journal peer-review to:
1. Evaluate the AI's performance using a standardized framework
2. Identify what the AI caught vs. missed compared to the journal review
3. Assess complementarity and convergence between the two approaches
4. Generate actionable insights for improving the AI system
## EVALUATION FRAMEWORK
Evaluate both reviews using these 10 standardized dimensions (0-10 scale):
1. **Statistical Rigor**: Identification of statistical flaws, multiple testing issues, power analysis, appropriate methods
2. **Methodological Standards**: Enforcement of reporting guidelines (CONSORT/STROBE), completeness of methods, reproducibility requirements
3. **Clinical/Domain Context**: Field-specific knowledge, clinical interpretation, understanding of real-world practice
4. **Study Design Critique**: Evaluation of design appropriateness, confounding, bias, generalizability
5. **Data Quality & Verification**: Detection of numerical errors, inconsistencies, impossible values
6. **Interpretive Depth**: Evaluation of conclusions, identification of over-reaching claims, causality assessment
7. **Systematic Completeness**: Comprehensive coverage of all manuscript sections, thoroughness
8. **Actionability & Structure**: Clarity of feedback, organization (required/recommended/optional), specificity
9. **Tone & Constructiveness**: Balance of criticism and encouragement, professionalism
10. **Editorial Judgment**: Appropriateness of decision (accept/revise/reject), calibration of severity
**Overall Quality Score** = Average of all 10 dimensions
**Quality Ratings**:
- 9.0-10.0: Outstanding
- 8.0-8.9: Excellent
- 7.0-7.9: Good
- 6.0-6.9: Adequate
- Below 6.0: Fair/Poor
## ISSUE DETECTION ANALYSIS
Categorize all identified issues into:
- **Issues identified by BOTH** (convergence) - list each issue
- **Issues identified by JOURNAL only** - list each issue
- **Issues identified by AI only** - list each issue
- **Total unique issues**
Calculate **Complementarity Score** = (Journal Only + AI Only) / Total Unique Issues × 100%
**Interpretation**:
- 70-100%: Very high complementarity (reviews catch very different issues)
- 50-69%: High complementarity (substantial differences)
- 30-49%: Moderate complementarity (some overlap)
- Below 30%: Low complementarity (high overlap)
## CRITICAL ISSUE TRACKING
For each review, identify and categorize critical issues:
1. **Statistical Flaws**: Multiple testing, power, assumptions, model violations
2. **Data Quality Errors**: Numerical impossibilities, inconsistencies, missing data
3. **Methodological Gaps**: Missing methods section, incomplete reporting, non-reproducible
4. **Design Limitations**: Confounding, bias, generalizability issues
5. **Interpretive Issues**: Causality claims, over-reaching conclusions, misinterpretation
## DELIVERABLES
### 1. Comprehensive Visualization (PNG)
Generate a multi-panel figure including:
**IMPORTANT**:
- Use bar charts (grouped or stacked) for all comparisons. DO NOT use heatmaps.
- All panels should use clear bar chart visualizations that are easy to interpret at a glance.
- Layout: Arrange panels in a grid format (e.g., 3 rows × 3 columns or 2 rows × 3 columns)
- The Key Findings Box should be prominent and easily readable (bottom-right or bottom-center)
- Use consistent color scheme: Blue for Journal, Green for AI
**Panel 1: Overall Quality Scores** (bar chart)
- Journal review score
- AI review score
- Highlight if parity achieved (scores within 0.5 points)
**Panel 2: Dimensional Performance Comparison** (grouped bar chart)
- All 10 dimensions side-by-side
- Color-code by winner (blue=journal, green=AI)
**Panel 3: Issue Detection Overlap** (bar chart)
- Journal Only
- Both (Convergence)
- AI Only
- Display complementarity percentage
**Panel 4: Critical Issues Detection** (grouped bar chart)
- Show which critical issues each review detected
- Use grouped bars for Journal vs AI
- Use detection levels: "Explicit", "Implied", "Not Detected" (or binary: Detected=1, Not Detected=0)
- Highlight issues caught by both (convergence) with background shading
**Panel 5: Review Characteristics Profile** (radar chart)
- Overlay journal and AI profiles
- Show complementary strengths visually
**Panel 6: AI Version Improvement** (grouped bar chart - if comparing AI versions)
- Show key dimensions where AI improved
- Previous AI version vs Updated AI version
- Focus on 4-5 most important dimensions (Completeness, Error Detection, Tone, Judgment)
**Panel 7: Key Findings Box** (text summary)
Include a comprehensive text box with:
KEY FINDINGS: [Year] JOURNAL vs [Year] AI
✓ PARITY ACHIEVED (or NOT ACHIEVED): [Summary statement about overall quality]
CRITICAL CONVERGENCE:
□ [Issue 1]: BOTH caught [description]
□ [Issue 2]: BOTH caught [description]
□ [Issue 3]: BOTH caught [description]
JOURNAL ADVANTAGES (Still Superior):
□ [Advantage 1]: [Description]
□ [Advantage 2]: [Description]
□ [Advantage 3]: [Description]
AI ADVANTAGES (Now Superior):
□ [Advantage 1]: [Description]
□ [Advantage 2]: [Description]
□ [Advantage 3]: [Description]
DRAMATIC IMPROVEMENT FROM PREVIOUS AI VERSION (if applicable):
- [Improvement 1]
- [Improvement 2]
- [Improvement 3]
STRATEGIC RECOMMENDATION:
[One clear, actionable recommendation for the AI system]
**Panel 8: Complementarity Score** (circular badge)
- Large percentage display
- Interpretation label
### 2. Data Export Files (CSV/JSON)
**File 1: dimensional_scores.csv**
Dimension,Journal_Score,AI_Score,Winner,Gap
Statistical Rigor,7,10,AI,3
Methodological Standards,7,10,AI,3
...
Overall,8.5,9.0,AI,0.5
**File 2: issue_detection.json**
{
"manuscript": "Author et al. Year",
"convergence": ["Issue 1", "Issue 2", ...],
"journal_only": ["Issue 1", "Issue 2", ...],
"ai_only": ["Issue 1", "Issue 2", ...],
"complementarity_score": 73.3,
"total_unique_issues": 15
}
**File 3: critical_issues.csv**
Issue_Type,Issue_Description,Journal_Detected,AI_Detected,Severity
Statistical Flaws,Multiple testing correction,No,Yes,Critical
Data Quality,Impossible confidence intervals,Yes,Yes,Critical
...
### 3. Detailed Analysis Documents (Markdown)
**File 1: standardized_scores.md**
- Complete dimensional analysis with evidence and examples for each score
- Justification for each rating
- Specific quotes from reviews supporting the scores
**File 2: issue_breakdown.md**
- Comprehensive list of all issues identified by each review
- Categorization by type (statistical, methodological, clinical, etc.)
- Severity classification (critical, major, minor)
**File 3: comparison_report.md**
- Executive summary
- Dimensional performance analysis
- Critical issue detection summary
- Complementarity analysis
- Strategic recommendations for AI improvement
- Implications for development priorities
### 4. Summary Table (Markdown)
| Dimension | Journal Score | AI Score | Winner | Gap |
|-----------|---------------|----------|--------|-----|
| Statistical Rigor | X | X | X | X |
...
| **Overall** | **X.XX** | **X.XX** | **X** | **X.XX** |
## ANALYSIS APPROACH
### Step 1: Document Review
- Read both reviews thoroughly
- Extract all issues, concerns, and recommendations from each
- Note the editorial decision from each review
### Step 2: Dimensional Scoring
- Apply the 10-dimensional framework to each review independently
- Document specific evidence supporting each score
- Calculate overall quality scores
### Step 3: Issue Categorization
- Create a master list of all unique issues identified across both reviews
- Categorize each issue as: Both, Journal Only, or AI Only
- Classify severity: Critical, Major, Minor
- Calculate complementarity score
### Step 4: Critical Issue Analysis
- Identify which critical issues each review caught
- Determine if detection was explicit, implied, or absent
- Assess impact on manuscript validity
### Step 5: Data Export
- Generate all CSV/JSON files with structured data
- Ensure data is machine-readable for future analysis
### Step 6: Visualization
- Create comprehensive multi-panel figure
- Include key findings summary box
- Ensure all panels are clearly labeled
### Step 7: Report Writing
- Write executive summary highlighting main findings
- Provide detailed analysis of each dimension
- Generate actionable recommendations
## KEY PRINCIPLES
1. **Consistency**: Use the same dimensions and scoring criteria across all manuscripts
2. **Evidence-based**: Support all scores with specific examples from the reviews
3. **Balanced**: Acknowledge strengths and weaknesses of both approaches
4. **Actionable**: Provide clear, prioritized recommendations for improvement
5. **Complementarity focus**: Emphasize how the two approaches work together
6. **Data-driven**: Export all data in structured formats for future analysis
7. **Visual clarity**: Make the key findings immediately apparent in the visualization
## SPECIAL FOCUS AREAS
When comparing AI vs. Journal reviews, pay particular attention to:
1. **What the AI caught that the journal missed**: These are the AI's unique value propositions
2. **What the journal caught that the AI missed**: These are the AI's development priorities
3. **Convergence on critical issues**: Evidence that AI has achieved human-level capability
4. **Complementarity patterns**: Understanding the systematic differences between approaches
## DOCUMENTS I WILL PROVIDE
I will upload:
- **Journal peer-review document**: The actual review(s) received from the journal
- **AI peer-review document**: The review generated by my SaaS application
- **Manuscript context** (optional): Title, study type, field
Please confirm you understand the framework and are ready to receive the documents.Detailed scoring rubrics for each of the 10 dimensions are provided in Section 3.1.1 above.
Manuscript: Hippisley-Cox et al. (2022)
Journal Only Issues (12):
AI Only Issues (21):
Both (Convergence) Issues (9):
Example data file formats used in this analysis:
File 1: dimensional_scores.csv
Dimension,Journal_Score,AI_Score,Winner,Gap Statistical Rigor,7,10,AI,3 Methodological Standards,7,10,AI,3 ... Overall,8.5,9.0,AI,0.5
File 2: issue_detection.json
{
"manuscript": "Author et al. Year",
"convergence": ["Issue 1", "Issue 2", ...],
"journal_only": ["Issue 1", "Issue 2", ...],
"ai_only": ["Issue 1", "Issue 2", ...],
"complementarity_score": 73.3,
"total_unique_issues": 15
}File 3: critical_issues.csv
Issue_Type,Issue_Description,Journal_Detected,AI_Detected,Severity Statistical Flaws,Multiple testing correction,No,Yes,Critical Data Quality,Impossible confidence intervals,Yes,Yes,Critical ...
Important Note
This analysis is based on a preliminary comparison of 5 manuscripts published in The BMJ (2021–2023). While the results provide encouraging evidence, the sample size is limited and findings should be interpreted with appropriate caution.
PeerGenius recommends a complementary hybrid approach: AI review as a first-pass screening for statistical and methodological rigor, combined with human expert review for clinical context, interpretive depth, and domain-specific judgment. AI review complements but does not replace traditional peer review.