Abstract

This report presents a detailed methodology for conducting standardized comparative analyses of AI-generated peer reviews against actual journal peer reviews. The study analyzed five manuscripts submitted to The BMJ between 2021 and 2023, comparing reviews from PeerGenius.ai (an AI-powered peer review platform) with the original human expert reviews. The methodology employs a 10-dimensional scoring framework, issue detection analysis, and critical issue tracking to provide a comprehensive assessment of AI performance. This document serves as a complete replication guide, including all formulas, scoring rubrics, data collection procedures, and analysis steps. The comparison was conducted using Manus.im, an autonomous AI agent platform, with AI reviews generated by PeerGenius.ai's Premier Review tier. The BMJ was selected due to its unique open peer review policy, which provides public access to the complete peer review history, including the original submitted manuscripts. All five selected manuscripts were published under a CC-BY license, which is more permissive than the standard CC-BY-NC license used for most BMJ articles, ensuring full legal compliance for this research.

1. Introduction

1.1. Background and Rationale

The peer review process is a cornerstone of scientific publishing, serving as the primary mechanism for quality control and validation of research findings. However, traditional peer review faces numerous challenges, including delays, inconsistency, bias, and the increasing burden on volunteer reviewers. Recent advances in artificial intelligence, particularly large language models (LLMs), have created new opportunities to augment or enhance the peer review process.

This study was designed to rigorously evaluate the performance of an AI-powered peer review system (PeerGenius.ai) against the established benchmark of human expert review at a prestigious medical journal (The BMJ). The objective was to determine whether AI has achieved parity with human reviewers, identify complementary strengths and weaknesses, and provide evidence-based recommendations for the development of hybrid human-AI peer review systems.

1.2. Research Questions

Has AI achieved parity with human expert peer review in terms of overall quality?
What are the complementary strengths and weaknesses of AI vs. human review?
What critical issues does AI catch that human reviewers miss, and vice versa?
What is the degree of complementarity between AI and human reviews?
What are the implications for the design of hybrid peer review systems?

1.3. Study Design

This was a retrospective comparative analysis of five manuscripts submitted to The BMJ. For each manuscript, we compared the AI-generated review (from PeerGenius.ai) with the original journal peer review (from The BMJ's human expert reviewers). Both reviews evaluated the exact same version of the manuscript (the initial submission, pre-revision), ensuring a fair and direct comparison.

2. Materials and Data Sources

2.1. AI Peer Review Platform: PeerGenius.ai

2.1.1. Platform Overview

Platform: PeerGenius.ai
Service: AI-powered manuscript peer review
Tier Used: Premier Review ($29.99 per manuscript at time of study; pricing now varies by manuscript length)
Review Time: Minutes (typically 5–15 minutes per manuscript)
Output: Comprehensive peer review report with editorial decision letter (PDF export)

2.1.2. Multi-Agent Architecture

PeerGenius.ai's Premier Review employs a sophisticated multi-agent system consisting of seven specialized AI reviewers and one Editor-in-Chief. Each agent is powered by a frontier large language model and serves a distinct function in the review process. This architecture is designed to replicate the diversity of perspectives typically found in a multi-reviewer journal peer review process.

The Seven Specialized Reviewers:

Domain Expert — Field-specific expertise and novelty assessment. Evaluates the manuscript's contribution to the field, identifies literature gaps, assesses novelty, and ensures domain-specific accuracy.
Adversarial Skeptic — Challenge assumptions and identify weaknesses. Stress-tests the methodology by identifying potential weaknesses, questioning assumptions, and exploring alternative explanations.
Statistical Methods Agent — Deep statistical analysis with corrective code. Provides in-depth analysis of statistical approaches, identifies issues with test selection or interpretation, and provides corrective code with sample write-ups showing the proper analysis.
Results Accuracy Verification Agent — Numerical verification and consistency checking. Verifies calculations, checks data consistency across tables and figures, and flags potential errors.
Systematic Reviewer — Methodology, structure, and reporting standards. Ensures research design, statistical methods, and reporting meet publication standards; checks adherence to guidelines (e.g., CONSORT, PRISMA, TRIPOD).
Pragmatic Reviewer — Real-world applicability and clarity. Assesses clarity, practical significance, and accessibility for broader audiences.
Scientific & Technical Writer — Technical writing and style refinement. Improves scientific clarity, grammar, terminology consistency, and adherence to style guidelines.

Editor-in-Chief — Synthesizes all reviewer feedback into a comprehensive editorial decision letter, weighs different perspectives, and provides clear guidance on revisions.

2.1.3. Review Process

Manuscript is uploaded to PeerGenius.ai platform
All seven specialized reviewers analyze the manuscript in parallel
Each reviewer generates detailed feedback from their specialized perspective
Editor-in-Chief synthesizes all feedback into a comprehensive editorial decision letter
Complete review report is generated and exported as PDF

2.1.4. Key Features

Unbiased Feedback: Objective analysis based on academic standards, free from human cognitive biases
Multiple Perspectives: Each reviewer powered by different frontier models, ensuring diverse viewpoints
Actionable Feedback: Provides statistical corrective code, methodology fixes, and clarity improvements
Rapid Turnaround: Complete review in minutes rather than weeks
Consistency: Applies the same rigorous standards to every manuscript

2.2. Journal Peer Review Source: The BMJ

2.2.1. Journal Overview

Journal: The British Medical Journal (BMJ)
Impact Factor: High (varies by year, typically 30–40)
Acceptance Rate: Selective (approximately 7–10%)
Peer Review Model: Open peer review (for most article types)
Specialty: General medicine and public health

2.2.2. Open Peer Review Policy

The BMJ operates an open peer review system for many of its article types, which makes it uniquely suitable for comparative studies of peer review quality. Under this policy:

Reviewers are required to sign reviews with their name, position, and institution
Competing interests must be declared
Reviews are published online alongside the original submitted manuscript (pre-revision), all rounds of reviewer comments, author responses, revised manuscript versions, and the final published version

2.2.3. Why The BMJ Was Selected

Complete Transparency: The BMJ's open peer review policy provides public access to the entire peer review history, including the original submitted manuscript. This ensures both AI and human reviewers evaluated the exact same version.
High Quality: The BMJ is a prestigious, high-impact journal with rigorous peer review standards. Reviewers are typically experts in their fields.
Methodological Diversity: The BMJ publishes a wide range of study designs, allowing for robust assessment across different types of clinical research.
Public Accessibility: All materials needed for this study are publicly available, making the study fully transparent and replicable.
Ethical Considerations: Because the peer review history is already public, there are no ethical concerns about confidentiality or consent.

2.2.4. Peer Review Process at The BMJ

Manuscript submitted by authors
Initial editorial screening
Manuscript sent to 2–4 expert reviewers
Reviewers provide detailed comments and recommendations
Editorial committee makes decision (accept, revise, reject)
If revisions requested, authors respond and resubmit
Revised manuscript may be sent back to reviewers
Final editorial decision
If accepted, full peer review history published online

2.3. Manuscript Selection and Characteristics

2.3.1. Selection Criteria

Public Availability: Full peer review history publicly available on The BMJ website
Temporal Range: Submitted between 2021–2023 (recent research)
Methodological Diversity: Represents a range of study designs
Clinical Relevance: Addresses important public health questions
Complete Documentation: All necessary materials available (submitted manuscript, reviews, editorial decisions)

2.3.2. Manuscript Details

Manuscript	Year	Study Design	Pages
Hippisley-Cox et al.	2022	Prediction Model (QCovid4)	49
Mok et al.	2024	Pharmacoepidemiology (Antipsychotics)	231
Morales et al.	2023	Interrupted Time Series (QOF)	23
Rees et al.	2022	Cohort Study (Shoulder Surgery)	25
Woolf et al.	2023	Mendelian Randomization (Sildenafil)	51

Manuscript 1: Hippisley-Cox et al. (2022)

Full Title: QCovid 4 — Predicting risk of death or hospitalisation from COVID-19 in adults testing positive for SARS-CoV-2 infection during the Omicron wave in England
Study Design: Prediction model development and validation
Population: Adults testing positive for SARS-CoV-2 in England during the Omicron wave
Methods: Cox proportional hazards model, external validation
Reporting Guidelines: TRIPOD

Manuscript 2: Mok et al. (2024)

Full Title: Multiple adverse outcomes associated with antipsychotic use in people with dementia: population based matched cohort study
Study Design: Pharmacoepidemiology cohort study
Population: Patients with dementia
Methods: Time-varying exposure analysis, propensity score matching
Key Challenges: Time-varying confounding, multiple outcomes, heterogeneous exposure definitions

Manuscript 3: Morales et al. (2023)

Full Title: Estimated impact from the withdrawal of primary care financial incentives on selected indicators of quality of care in Scotland: controlled interrupted time series analysis
Study Design: Interrupted time series analysis
Intervention: Removal of Quality and Outcomes Framework (QOF) indicators
Methods: Segmented regression, controlled interrupted time series
Key Challenges: Autocorrelation, limited pre/post time points, confounding by secular trends

Manuscript 4: Rees et al. (2022)

Full Title: Serious adverse event rates and reoperation after arthroscopic shoulder surgery: population based cohort study
Study Design: Cohort study
Population: Patients undergoing arthroscopic shoulder surgery
Methods: Descriptive analysis, temporal trends
Key Challenges: Heterogeneous patient populations, multiple outcomes, temporal confounding

Manuscript 5: Woolf et al. (2023)

Full Title: A drug target for erectile dysfunction to help improve fertility, sexual activity, and wellbeing: mendelian randomisation study
Study Design: Mendelian randomization study
Methods: Two-sample Mendelian randomization, sensitivity analyses
Special Context: BMJ Christmas article (intended to be entertaining and educational)
Key Challenges: Sample overlap, genetic proxy validity, causal inference limitations

2.4. Licensing Information

All five manuscripts selected for this study were published under the Creative Commons Attribution (CC-BY) license. This is a critical detail, as it provides the legal basis for using these manuscripts in this research, including processing them with an AI system. The CC-BY license is more permissive than the CC-BY-NC (Non-Commercial) license under which most BMJ articles are published. The CC-BY license allows for unrestricted reuse, redistribution, and modification, for both commercial and non-commercial purposes, as long as appropriate attribution is given.

Legal Compliance: The CC-BY license ensures full compliance with copyright law for the use of these manuscripts in this AI evaluation study.
Reproducibility: Other researchers can freely access and use the same manuscripts to replicate this study without copyright concerns.
Transparency: The open access nature of these articles ensures that all materials used in this study are publicly available and verifiable.

2.5. Analysis Platform: Manus.im

Platform: Manus.im
Service: Autonomous AI agent for complex task execution
Model: Advanced LLM with tool use capabilities

The entire comparative analysis was conducted within the Manus.im platform. A detailed prompt (see Appendix A) was provided to the Manus agent, which then autonomously executed the analysis, including: reading and extracting content from both review documents, applying the 10-dimensional scoring framework, categorizing all identified issues, tracking critical issues, generating structured data files (CSV/JSON), creating comprehensive visualizations, and writing detailed analysis reports.

3. Analytical Framework

The analytical framework consists of three main components: (1) dimensional scoring, (2) issue detection analysis, and (3) critical issue tracking. Each component is described in detail below, including all formulas and scoring rubrics.

3.1. Dimensional Scoring Framework

Both the AI and journal reviews were evaluated across ten standardized dimensions on a 0–10 scale. The framework was designed to capture both the technical rigor and the practical utility of peer review feedback.

3.1.1. The Ten Dimensions

1. Statistical Rigor

Definition: The review's ability to identify statistical flaws, assess the appropriateness of statistical methods, and ensure proper application of statistical principles.
Evaluation Criteria: Identification of multiple testing issues; assessment of power analysis and sample size justification; evaluation of statistical assumptions (e.g., proportional hazards, normality); detection of inappropriate statistical methods; verification of statistical calculations.

Scoring Rubric:

9–10: Comprehensive statistical review with identification of all major flaws
7–8: Good statistical review with most major issues identified
5–6: Adequate statistical review with some issues identified
3–4: Limited statistical review with major issues missed
0–2: Minimal or no statistical review

2. Methodological Standards

Definition: The review's enforcement of methodological rigor and adherence to reporting guidelines.
Evaluation Criteria: Enforcement of discipline-specific reporting guidelines (e.g., CONSORT, STROBE, TRIPOD, PRISMA); assessment of methods completeness and reproducibility; evaluation of study design appropriateness; identification of methodological gaps.

Scoring Rubric:

9–10: Rigorous enforcement of methodological standards with comprehensive checklist
7–8: Good methodological review with most standards enforced
5–6: Adequate methodological review with some standards enforced
3–4: Limited methodological review with major gaps
0–2: Minimal or no methodological review

3. Clinical/Domain Context

Definition: The review's demonstration of field-specific knowledge and understanding of clinical or domain-specific implications.
Evaluation Criteria: Evidence of deep domain expertise; clinical interpretation of findings; understanding of real-world practice implications; identification of clinically meaningful vs. statistically significant findings; contextualization within the existing literature.

Scoring Rubric:

9–10: Exceptional domain expertise with nuanced clinical insights
7–8: Strong domain knowledge with good clinical interpretation
5–6: Adequate domain knowledge with basic clinical understanding
3–4: Limited domain knowledge with superficial clinical interpretation
0–2: Minimal or no domain-specific insights

4. Study Design Critique

Definition: The review's evaluation of the appropriateness and rigor of the study design.
Evaluation Criteria: Assessment of design appropriateness for research question; identification of confounding variables; evaluation of bias (selection, information, measurement); assessment of generalizability and external validity; critique of causal inference claims.

Scoring Rubric:

9–10: Comprehensive design critique with identification of all major limitations
7–8: Good design critique with most major issues identified
5–6: Adequate design critique with some issues identified
3–4: Limited design critique with major issues missed
0–2: Minimal or no design critique

5. Data Quality & Verification

Definition: The review's ability to detect numerical errors, inconsistencies, and data quality issues.
Evaluation Criteria: Detection of numerical errors or impossibilities; identification of inconsistencies across tables, figures, and text; verification of calculations; assessment of missing data handling; evaluation of data quality and completeness.

Scoring Rubric:

9–10: Systematic verification with identification of all major errors
7–8: Good verification with most major errors identified
5–6: Adequate verification with some errors identified
3–4: Limited verification with major errors missed
0–2: Minimal or no verification

6. Interpretive Depth

Definition: The review's evaluation of the interpretation and conclusions, including identification of over-reaching claims.
Evaluation Criteria: Assessment of whether conclusions are supported by data; identification of over-reaching or unjustified claims; evaluation of causality claims; assessment of alternative explanations; nuanced interpretation of findings.

Scoring Rubric:

9–10: Sophisticated interpretive analysis with nuanced insights
7–8: Strong interpretive analysis with good critical evaluation
5–6: Adequate interpretive analysis with basic evaluation
3–4: Limited interpretive analysis with superficial evaluation
0–2: Minimal or no interpretive analysis

7. Systematic Completeness

Definition: The review's comprehensive coverage of all manuscript sections and thoroughness.
Evaluation Criteria: Coverage of all manuscript sections (abstract, introduction, methods, results, discussion); thoroughness of review; identification of gaps or omissions; systematic approach to review.

Scoring Rubric:

9–10: Comprehensive coverage of all sections with no gaps
7–8: Good coverage with most sections reviewed thoroughly
5–6: Adequate coverage with some sections reviewed
3–4: Limited coverage with major sections missed
0–2: Minimal or no systematic coverage

8. Actionability & Structure

Definition: The clarity, organization, and specificity of the feedback provided.
Evaluation Criteria: Clarity of feedback; organization (e.g., required/recommended/optional); specificity of recommendations; prioritization of issues; ease of implementation.

Scoring Rubric:

9–10: Highly structured, clear, and actionable feedback
7–8: Well-structured and mostly actionable feedback
5–6: Adequately structured with some actionable feedback
3–4: Poorly structured with limited actionability
0–2: Unstructured and not actionable

9. Tone & Constructiveness

Definition: The balance of criticism and encouragement, and the professionalism of the review.
Evaluation Criteria: Balance of positive and negative feedback; constructiveness of criticism; professionalism and respect; encouragement and support; avoidance of personal attacks or dismissiveness.

Scoring Rubric:

9–10: Exceptionally balanced, constructive, and professional
7–8: Well-balanced and constructive
5–6: Adequately balanced with some constructiveness
3–4: Imbalanced or overly critical
0–2: Unprofessional or dismissive

10. Editorial Judgment

Definition: The appropriateness of the editorial decision and calibration of severity.
Evaluation Criteria: Appropriateness of decision (accept/revise/reject); calibration of severity (minor vs. major revisions); consistency between feedback and decision; recognition of manuscript strengths and weaknesses.

Scoring Rubric:

9–10: Excellent editorial judgment, perfectly calibrated
7–8: Good editorial judgment, well-calibrated
5–6: Adequate editorial judgment, reasonably calibrated
3–4: Poor editorial judgment, poorly calibrated
0–2: Inappropriate or inconsistent editorial judgment

3.1.2. Overall Quality Score Calculation

The Overall Quality Score is calculated as the arithmetic mean of all ten dimensional scores:

Overall Quality Score = (Σ Dimensional Scores) / 10

Where:

Σ Dimensional Scores = Sum of scores for all 10 dimensions
Each dimensional score ranges from 0 to 10

Quality Rating Categories:

9.0–10.0: Outstanding
8.0–8.9: Excellent
7.0–7.9: Good
6.0–6.9: Adequate
Below 6.0: Fair/Poor

3.1.3. Parity Assessment

Parity is defined as achieving an overall quality score within 1.0 point of the comparator. This threshold was chosen to represent meaningful equivalence while allowing for minor differences.

Parity Achieved = |AI Score - Journal Score| < 1.0

Where:

AI Score = Overall Quality Score for AI review
Journal Score = Overall Quality Score for journal review
| | denotes absolute value

3.2. Issue Detection Analysis

All unique issues identified in both reviews were systematically categorized to assess the degree of overlap and complementarity between the AI and human approaches.

3.2.1. Issue Categorization

Each unique issue was categorized into one of three mutually exclusive categories:

Convergence (Both): Issues identified by both the AI and journal reviews
Journal Only: Issues identified only by the journal review
AI Only: Issues identified only by the AI review

Operational Definition of “Same Issue”: Two issues were considered the “same” if they referred to the same specific problem, even if expressed differently. For example, “multiple testing correction not applied” and “no adjustment for multiple comparisons” would be considered the same issue.

3.2.2. Complementarity Score Calculation

The Complementarity Score quantifies the degree to which the two reviews identified different sets of issues. A higher score indicates greater complementarity (i.e., the reviews are more synergistic and less redundant).

Complementarity Score = [(N_Journal_Only + N_AI_Only) / N_Total_Unique] × 100%

Where:

N_Journal_Only = Number of issues identified only by the journal review
N_AI_Only = Number of issues identified only by the AI review
N_Total_Unique = Total number of unique issues across both reviews
N_Total_Unique = N_Journal_Only + N_AI_Only + N_Both

Interpretation:

70–100%: Very high complementarity (reviews catch very different issues)
50–69%: High complementarity (substantial differences)
30–49%: Moderate complementarity (some overlap)
Below 30%: Low complementarity (high overlap, redundant reviews)

Example Calculation:

Suppose a comparison yields:

Journal Only: 14 issues
AI Only: 10 issues
Both: 6 issues
Total Unique: 14 + 10 + 6 = 30 issues

Complementarity Score = [(14 + 10) / 30] × 100% = 80.0%

This would be interpreted as “very high complementarity,” indicating that the two reviews identified substantially different sets of issues and are highly synergistic.

3.3. Critical Issue Tracking

Critical issues were defined as those that could fundamentally invalidate the study's findings or conclusions. These were tracked separately to assess the ability of each review to identify the most serious methodological or statistical flaws.

3.3.1. Critical Issue Categories

Statistical Flaws: Errors or omissions in statistical methods that could lead to invalid inferences (e.g., multiple testing without correction, violation of statistical assumptions, inadequate power)
Data Quality Errors: Numerical impossibilities, inconsistencies, or errors that undermine data integrity (e.g., impossible confidence intervals, inconsistent numbers across tables)
Methodological Gaps: Missing or incomplete methodological information that prevents reproducibility or assessment of validity
Design Limitations: Fundamental flaws in study design that threaten internal or external validity (e.g., uncontrolled confounding, severe selection bias)
Interpretive Issues: Over-reaching conclusions or causal claims not supported by the study design

3.3.2. Critical Issue Detection Matrix

For each manuscript, a binary matrix was created to track which critical issues were detected by each review:

Table 2: Critical Issue Detection Matrix (Example)

Critical Issue	Journal Detected	AI Detected
Multiple testing correction not applied	No	Yes
Time-varying confounding not addressed	No	Yes
Autocorrelation not handled	No	Yes
Graphing error (predicted = actual)	Yes	No

From this matrix, we calculated:

Total Critical Issues Detected by Journal: Sum of “Yes” in Journal column
Total Critical Issues Detected by AI: Sum of “Yes” in AI column
Critical Issues Detected by Both: Count of rows where both columns are “Yes”
Critical Issues Missed by Both: Count of rows where both columns are “No”

4. Data Collection and Processing

4.1. Obtaining Journal Peer Reviews from The BMJ

4.1.1. Accessing Peer Review History

For each manuscript, the peer review history was accessed through The BMJ's website:

Navigate to the published article on The BMJ website
Locate the “Peer review” or “History” tab/link
Download the peer review history PDF, which includes: original submitted manuscript, all reviewer comments (first round), author responses, revised manuscript (if applicable), subsequent review rounds, and editorial decision letters

4.1.2. Extracting Reviewer Comments

From the peer review history PDF, the following information was extracted for each reviewer:

Reviewer number/identifier
Reviewer's overall recommendation (accept/revise/reject)
Reviewer's detailed comments
Specific issues identified
Suggestions for improvement
Editorial committee comments (if applicable)

4.2. Generating AI Reviews with PeerGenius.ai

4.2.1. Manuscript Preparation

For each manuscript, the original submitted version (pre-revision) was obtained from The BMJ's peer review history. This ensured that the AI reviewed the exact same version that the human reviewers evaluated.

4.2.2. Review Generation Process

Navigate to PeerGenius.ai
Select “Premier Review” tier ($29.99 at time of study)
Upload the original submitted manuscript (PDF)
Wait for review generation (typically 5–15 minutes)
Download the complete review report (PDF)

4.2.3. AI Review Components

The AI review included:

Individual reviews from all 7 specialized agents
Editor-in-Chief's consolidated editorial decision letter
Overall recommendation (accept/revise/reject)
Structured feedback (required/recommended/optional)
Specific issues identified by each agent
Suggestions for improvement

4.3. Data Extraction and Coding

4.3.1. Review Reading and Note-Taking

For each review (both AI and journal), a comprehensive set of notes was created, including:

All issues identified
Severity of each issue (critical/major/minor)
Category of each issue (statistical/methodological/clinical/etc.)
Specific quotes supporting each issue
Overall tone and constructiveness
Editorial decision and justification

4.3.2. Issue Extraction

All unique issues were extracted from both reviews and compiled into a master list. Each issue was coded with:

Issue Description: Brief description of the problem
Category: Statistical/Methodological/Clinical/Design/Interpretive/Data Quality/Other
Severity: Critical/Major/Minor
Source: Journal/AI/Both
Evidence: Specific quote or reference from the review

5. Analysis Procedures

5.1. Step-by-Step Analysis Process

The analysis was conducted in seven sequential steps:

Step 1: Document Review

Read both the AI and journal review documents thoroughly
Extract all issues, concerns, and recommendations from each
Note the editorial decision from each review
Create comprehensive notes for each review

Step 2: Dimensional Scoring

Apply the 10-dimensional framework to each review independently
Document specific evidence supporting each score
Assign scores (0–10) for each dimension
Calculate overall quality scores
Determine if parity was achieved

Step 3: Issue Categorization

Create a master list of all unique issues identified across both reviews
Categorize each issue as: Both, Journal Only, or AI Only
Classify severity: Critical, Major, Minor
Calculate complementarity score

Step 4: Critical Issue Analysis

Identify which critical issues each review caught
Determine if detection was explicit, implied, or absent
Assess impact on manuscript validity
Create critical issue detection matrix

Step 5: Data Export

Generate all CSV/JSON files with structured data
Ensure data is machine-readable for future analysis
Export dimensional scores, issue categorizations, and critical issue matrix

Step 6: Visualization

Create comprehensive multi-panel figure
Include key findings summary box
Ensure all panels are clearly labeled
Use consistent color scheme (Blue for Journal, Green for AI)

Step 7: Report Writing

Write executive summary highlighting main findings
Provide detailed analysis of each dimension
Generate actionable recommendations
Synthesize findings across all manuscripts

5.2. Quality Control Procedures

Consistency Checks: All manuscripts were analyzed using the exact same framework and scoring criteria
Evidence Documentation: All scores were supported by specific quotes and examples from the reviews
Double-Checking: Critical issues and dimensional scores were reviewed multiple times to ensure accuracy
Transparency: All data, scores, and evidence were exported to structured files for independent verification

6. Deliverables and Outputs

For each manuscript, the analysis generated the following deliverables:

6.1. Comprehensive Visualization (PNG)

A multi-panel figure (typically 7–8 panels) including:

Panel 1: Overall Quality Scores (bar chart)
Panel 2: Dimensional Performance Comparison (grouped bar chart)
Panel 3: Issue Detection Overlap (bar chart)
Panel 4: Critical Issues Detection (grouped bar chart)
Panel 5: Review Characteristics Profile (radar chart)
Panel 6: Complementarity Score (circular badge or bar)
Panel 7: Key Findings Box (text summary)

6.2. Data Export Files

dimensional_scores.csv: All 10 dimensional scores for both reviews
issue_detection.json: Complete issue categorization with metadata
critical_issues.csv: Critical issue detection matrix

6.3. Detailed Analysis Documents (Markdown)

standardized_scores.md: Complete dimensional analysis with evidence
issue_breakdown.md: Comprehensive list of all issues
comparison_report.md: Executive summary and strategic recommendations
summary_table.md: Quick reference comparison table

6.4. Cross-Study Synthesis

After analyzing all five manuscripts, a cross-study synthesis was generated, including:

Synthesis Visualization: Multi-panel figure showing aggregate results across all manuscripts
Synthesis Report: Comprehensive report summarizing patterns and implications
Aggregate Data: CSV file with summary statistics for all manuscripts

7. Limitations and Considerations

7.1. Methodological Limitations

Sample Size: Only five manuscripts were analyzed, which may limit generalizability
Journal Selection: Only The BMJ was used; results may not generalize to other journals
Study Design Diversity: While diverse, the five manuscripts may not represent all types of clinical research
Temporal Factors: Manuscripts span 2021–2023; AI capabilities have evolved rapidly
Subjective Scoring: Dimensional scoring involves subjective judgment, despite the use of rubrics

7.2. Considerations for Replication

Researchers seeking to replicate this study should consider:

AI Evolution: AI capabilities are rapidly improving; results may differ with newer AI models
Journal Policies: Not all journals have open peer review policies; access to materials may be limited
Cost: PeerGenius.ai Premier Review pricing varies by manuscript length (was $29.99 at time of study); budget accordingly
Expertise: Dimensional scoring requires expertise in research methodology and statistics

8. Ethical Considerations

8.1. Use of Public Data

All materials used in this study (journal peer reviews and submitted manuscripts) were obtained from publicly available sources (The BMJ's open peer review history). No confidential or proprietary information was used.

8.2. Transparency

This methodology report provides complete transparency about all procedures, formulas, and decision rules, enabling full replication by other researchers.

8.3. Potential Conflicts of Interest

Researchers conducting similar studies should disclose any financial or professional relationships with AI peer review platforms or journals.

9. Conclusion

This methodology provides a rigorous, standardized, and replicable framework for comparing AI-generated peer reviews with human expert reviews. The 10-dimensional scoring system, issue detection analysis, and critical issue tracking provide a comprehensive assessment of AI performance. The use of The BMJ's open peer review history ensures that comparisons are fair and direct, as both AI and human reviewers evaluated the exact same manuscript version. This approach can be used by other researchers to evaluate different AI peer review systems, different journals, or different types of research.

References

PeerGenius.ai. (2026). AI-Powered Scientific Peer Reviews. Retrieved from https://peergenius.ai/
BMJ. (2026). Peer Review Terms and Conditions. Retrieved from https://authors.bmj.com/policies/peer-review-terms-and-conditions/
Hippisley-Cox, J., et al. (2023). Risk prediction of covid-19 related death or hospital admission in adults testing positive for SARS-CoV-2 infection during the omicron wave in England (QCOVID4): cohort study. BMJ, 381, e072976.
Mok, P. L. H., et al. (2024). Multiple adverse outcomes associated with antipsychotic use in people with dementia: population based matched cohort study. BMJ, 385, e076268.
Morales, D. R., et al. (2023). Estimated impact from the withdrawal of primary care financial incentives on selected indicators of quality of care in Scotland: controlled interrupted time series analysis. BMJ, 380, e072098.
Rees, J. L., et al. (2022). Serious adverse event rates and reoperation after arthroscopic shoulder surgery: population based cohort study. BMJ, 378, e069901.
Woolf, B., et al. (2023). A drug target for erectile dysfunction to help improve fertility, sexual activity, and wellbeing: mendelian randomisation study. BMJ, 383, e076197.

Appendix A: Full Replication Prompt for Manus.im

The following prompt was provided to the Manus.im autonomous AI agent to conduct each comparative analysis. Researchers can use this exact prompt to replicate the study with their own manuscripts.

I want you to conduct a standardized comparative analysis of an AI-generated peer review (from my SaaS application) against an actual journal peer review for an academic manuscript.

## TASK OVERVIEW
Compare the AI peer-review with the journal peer-review to:
1. Evaluate the AI's performance using a standardized framework
2. Identify what the AI caught vs. missed compared to the journal review
3. Assess complementarity and convergence between the two approaches
4. Generate actionable insights for improving the AI system

## EVALUATION FRAMEWORK
Evaluate both reviews using these 10 standardized dimensions (0-10 scale):

1. **Statistical Rigor**: Identification of statistical flaws, multiple testing issues, power analysis, appropriate methods
2. **Methodological Standards**: Enforcement of reporting guidelines (CONSORT/STROBE), completeness of methods, reproducibility requirements
3. **Clinical/Domain Context**: Field-specific knowledge, clinical interpretation, understanding of real-world practice
4. **Study Design Critique**: Evaluation of design appropriateness, confounding, bias, generalizability
5. **Data Quality & Verification**: Detection of numerical errors, inconsistencies, impossible values
6. **Interpretive Depth**: Evaluation of conclusions, identification of over-reaching claims, causality assessment
7. **Systematic Completeness**: Comprehensive coverage of all manuscript sections, thoroughness
8. **Actionability & Structure**: Clarity of feedback, organization (required/recommended/optional), specificity
9. **Tone & Constructiveness**: Balance of criticism and encouragement, professionalism
10. **Editorial Judgment**: Appropriateness of decision (accept/revise/reject), calibration of severity

**Overall Quality Score** = Average of all 10 dimensions

**Quality Ratings**:
- 9.0-10.0: Outstanding
- 8.0-8.9: Excellent
- 7.0-7.9: Good
- 6.0-6.9: Adequate
- Below 6.0: Fair/Poor

## ISSUE DETECTION ANALYSIS
Categorize all identified issues into:
- **Issues identified by BOTH** (convergence) - list each issue
- **Issues identified by JOURNAL only** - list each issue
- **Issues identified by AI only** - list each issue
- **Total unique issues**

Calculate **Complementarity Score** = (Journal Only + AI Only) / Total Unique Issues × 100%

**Interpretation**:
- 70-100%: Very high complementarity (reviews catch very different issues)
- 50-69%: High complementarity (substantial differences)
- 30-49%: Moderate complementarity (some overlap)
- Below 30%: Low complementarity (high overlap)

## CRITICAL ISSUE TRACKING
For each review, identify and categorize critical issues:
1. **Statistical Flaws**: Multiple testing, power, assumptions, model violations
2. **Data Quality Errors**: Numerical impossibilities, inconsistencies, missing data
3. **Methodological Gaps**: Missing methods section, incomplete reporting, non-reproducible
4. **Design Limitations**: Confounding, bias, generalizability issues
5. **Interpretive Issues**: Causality claims, over-reaching conclusions, misinterpretation

## DELIVERABLES

### 1. Comprehensive Visualization (PNG)
Generate a multi-panel figure including:

**IMPORTANT**:
- Use bar charts (grouped or stacked) for all comparisons. DO NOT use heatmaps.
- All panels should use clear bar chart visualizations that are easy to interpret at a glance.
- Layout: Arrange panels in a grid format (e.g., 3 rows × 3 columns or 2 rows × 3 columns)
- The Key Findings Box should be prominent and easily readable (bottom-right or bottom-center)
- Use consistent color scheme: Blue for Journal, Green for AI

**Panel 1: Overall Quality Scores** (bar chart)
- Journal review score
- AI review score
- Highlight if parity achieved (scores within 0.5 points)

**Panel 2: Dimensional Performance Comparison** (grouped bar chart)
- All 10 dimensions side-by-side
- Color-code by winner (blue=journal, green=AI)

**Panel 3: Issue Detection Overlap** (bar chart)
- Journal Only
- Both (Convergence)
- AI Only
- Display complementarity percentage

**Panel 4: Critical Issues Detection** (grouped bar chart)
- Show which critical issues each review detected
- Use grouped bars for Journal vs AI
- Use detection levels: "Explicit", "Implied", "Not Detected" (or binary: Detected=1, Not Detected=0)
- Highlight issues caught by both (convergence) with background shading

**Panel 5: Review Characteristics Profile** (radar chart)
- Overlay journal and AI profiles
- Show complementary strengths visually

**Panel 6: AI Version Improvement** (grouped bar chart - if comparing AI versions)
- Show key dimensions where AI improved
- Previous AI version vs Updated AI version
- Focus on 4-5 most important dimensions (Completeness, Error Detection, Tone, Judgment)

**Panel 7: Key Findings Box** (text summary)
Include a comprehensive text box with:

KEY FINDINGS: [Year] JOURNAL vs [Year] AI

✓ PARITY ACHIEVED (or NOT ACHIEVED): [Summary statement about overall quality]

CRITICAL CONVERGENCE:
□ [Issue 1]: BOTH caught [description]
□ [Issue 2]: BOTH caught [description]
□ [Issue 3]: BOTH caught [description]

JOURNAL ADVANTAGES (Still Superior):
□ [Advantage 1]: [Description]
□ [Advantage 2]: [Description]
□ [Advantage 3]: [Description]

AI ADVANTAGES (Now Superior):
□ [Advantage 1]: [Description]
□ [Advantage 2]: [Description]
□ [Advantage 3]: [Description]

DRAMATIC IMPROVEMENT FROM PREVIOUS AI VERSION (if applicable):
- [Improvement 1]
- [Improvement 2]
- [Improvement 3]

STRATEGIC RECOMMENDATION:
[One clear, actionable recommendation for the AI system]

**Panel 8: Complementarity Score** (circular badge)
- Large percentage display
- Interpretation label

### 2. Data Export Files (CSV/JSON)

**File 1: dimensional_scores.csv**
Dimension,Journal_Score,AI_Score,Winner,Gap
Statistical Rigor,7,10,AI,3
Methodological Standards,7,10,AI,3
...
Overall,8.5,9.0,AI,0.5

**File 2: issue_detection.json**
{
  "manuscript": "Author et al. Year",
  "convergence": ["Issue 1", "Issue 2", ...],
  "journal_only": ["Issue 1", "Issue 2", ...],
  "ai_only": ["Issue 1", "Issue 2", ...],
  "complementarity_score": 73.3,
  "total_unique_issues": 15
}

**File 3: critical_issues.csv**
Issue_Type,Issue_Description,Journal_Detected,AI_Detected,Severity
Statistical Flaws,Multiple testing correction,No,Yes,Critical
Data Quality,Impossible confidence intervals,Yes,Yes,Critical
...

### 3. Detailed Analysis Documents (Markdown)

**File 1: standardized_scores.md**
- Complete dimensional analysis with evidence and examples for each score
- Justification for each rating
- Specific quotes from reviews supporting the scores

**File 2: issue_breakdown.md**
- Comprehensive list of all issues identified by each review
- Categorization by type (statistical, methodological, clinical, etc.)
- Severity classification (critical, major, minor)

**File 3: comparison_report.md**
- Executive summary
- Dimensional performance analysis
- Critical issue detection summary
- Complementarity analysis
- Strategic recommendations for AI improvement
- Implications for development priorities

### 4. Summary Table (Markdown)

| Dimension | Journal Score | AI Score | Winner | Gap |
|-----------|---------------|----------|--------|-----|
| Statistical Rigor | X | X | X | X |
...
| **Overall** | **X.XX** | **X.XX** | **X** | **X.XX** |

## ANALYSIS APPROACH

### Step 1: Document Review
- Read both reviews thoroughly
- Extract all issues, concerns, and recommendations from each
- Note the editorial decision from each review

### Step 2: Dimensional Scoring
- Apply the 10-dimensional framework to each review independently
- Document specific evidence supporting each score
- Calculate overall quality scores

### Step 3: Issue Categorization
- Create a master list of all unique issues identified across both reviews
- Categorize each issue as: Both, Journal Only, or AI Only
- Classify severity: Critical, Major, Minor
- Calculate complementarity score

### Step 4: Critical Issue Analysis
- Identify which critical issues each review caught
- Determine if detection was explicit, implied, or absent
- Assess impact on manuscript validity

### Step 5: Data Export
- Generate all CSV/JSON files with structured data
- Ensure data is machine-readable for future analysis

### Step 6: Visualization
- Create comprehensive multi-panel figure
- Include key findings summary box
- Ensure all panels are clearly labeled

### Step 7: Report Writing
- Write executive summary highlighting main findings
- Provide detailed analysis of each dimension
- Generate actionable recommendations

## KEY PRINCIPLES

1. **Consistency**: Use the same dimensions and scoring criteria across all manuscripts
2. **Evidence-based**: Support all scores with specific examples from the reviews
3. **Balanced**: Acknowledge strengths and weaknesses of both approaches
4. **Actionable**: Provide clear, prioritized recommendations for improvement
5. **Complementarity focus**: Emphasize how the two approaches work together
6. **Data-driven**: Export all data in structured formats for future analysis
7. **Visual clarity**: Make the key findings immediately apparent in the visualization

## SPECIAL FOCUS AREAS

When comparing AI vs. Journal reviews, pay particular attention to:

1. **What the AI caught that the journal missed**: These are the AI's unique value propositions
2. **What the journal caught that the AI missed**: These are the AI's development priorities
3. **Convergence on critical issues**: Evidence that AI has achieved human-level capability
4. **Complementarity patterns**: Understanding the systematic differences between approaches

## DOCUMENTS I WILL PROVIDE

I will upload:
- **Journal peer-review document**: The actual review(s) received from the journal
- **AI peer-review document**: The review generated by my SaaS application
- **Manuscript context** (optional): Title, study type, field

Please confirm you understand the framework and are ready to receive the documents.

Appendix B: Scoring Rubrics

Detailed scoring rubrics for each of the 10 dimensions are provided in Section 3.1.1 above.

Appendix C: Example Issue Categorization

Manuscript: Hippisley-Cox et al. (2022)

Journal Only Issues (12):

Advanced statistical methods (Net Benefit, Decision Curve Analysis)
Sophisticated interpretive insights
Nuanced understanding of UK healthcare context
... (and 9 more)

AI Only Issues (21):

Systematic numerical verification (found discrepancies journal missed)
Explicit multiple testing correction requirement
Systematic TRIPOD compliance checking
... (and 18 more)

Both (Convergence) Issues (9):

Mismatch between target population and intended use (CRITICAL)
Need for external validation
... (and 7 more)

Appendix D: Data File Formats

Example data file formats used in this analysis:

File 1: dimensional_scores.csv

Dimension,Journal_Score,AI_Score,Winner,Gap
Statistical Rigor,7,10,AI,3
Methodological Standards,7,10,AI,3
...
Overall,8.5,9.0,AI,0.5

File 2: issue_detection.json

{
  "manuscript": "Author et al. Year",
  "convergence": ["Issue 1", "Issue 2", ...],
  "journal_only": ["Issue 1", "Issue 2", ...],
  "ai_only": ["Issue 1", "Issue 2", ...],
  "complementarity_score": 73.3,
  "total_unique_issues": 15
}

File 3: critical_issues.csv

Issue_Type,Issue_Description,Journal_Detected,AI_Detected,Severity
Statistical Flaws,Multiple testing correction,No,Yes,Critical
Data Quality,Impossible confidence intervals,Yes,Yes,Critical
...

Important Note

This analysis is based on a preliminary comparison of 5 manuscripts published in The BMJ (2021–2023). While the results provide encouraging evidence, the sample size is limited and findings should be interpreted with appropriate caution.

PeerGenius recommends a complementary hybrid approach: AI review as a first-pass screening for statistical and methodological rigor, combined with human expert review for clinical context, interpretive depth, and domain-specific judgment. AI review complements but does not replace traditional peer review.

← Back to Evidence Overview

Validation Methodology & Replication Guide

Abstract

1. Introduction

1.1. Background and Rationale

1.2. Research Questions

1.3. Study Design

2. Materials and Data Sources

2.1. AI Peer Review Platform: PeerGenius.ai

2.1.1. Platform Overview

2.1.2. Multi-Agent Architecture

2.1.3. Review Process

2.1.4. Key Features

2.2. Journal Peer Review Source: The BMJ

2.2.1. Journal Overview

2.2.2. Open Peer Review Policy

2.2.3. Why The BMJ Was Selected

2.2.4. Peer Review Process at The BMJ

2.3. Manuscript Selection and Characteristics

2.3.1. Selection Criteria

2.3.2. Manuscript Details

2.4. Licensing Information

2.5. Analysis Platform: Manus.im

3. Analytical Framework

3.1. Dimensional Scoring Framework

3.1.1. The Ten Dimensions

3.1.2. Overall Quality Score Calculation

3.1.3. Parity Assessment

3.2. Issue Detection Analysis

3.2.1. Issue Categorization

3.2.2. Complementarity Score Calculation

3.3. Critical Issue Tracking

3.3.1. Critical Issue Categories

3.3.2. Critical Issue Detection Matrix

4. Data Collection and Processing

4.1. Obtaining Journal Peer Reviews from The BMJ

4.1.1. Accessing Peer Review History

4.1.2. Extracting Reviewer Comments

4.2. Generating AI Reviews with PeerGenius.ai

4.2.1. Manuscript Preparation

4.2.2. Review Generation Process

4.2.3. AI Review Components

4.3. Data Extraction and Coding

4.3.1. Review Reading and Note-Taking

4.3.2. Issue Extraction

5. Analysis Procedures

5.1. Step-by-Step Analysis Process

5.2. Quality Control Procedures

6. Deliverables and Outputs

6.1. Comprehensive Visualization (PNG)

6.2. Data Export Files

6.3. Detailed Analysis Documents (Markdown)

6.4. Cross-Study Synthesis

7. Limitations and Considerations

7.1. Methodological Limitations

7.2. Considerations for Replication

8. Ethical Considerations

8.1. Use of Public Data

8.2. Transparency

8.3. Potential Conflicts of Interest

9. Conclusion

References

Appendix A: Full Replication Prompt for Manus.im

Appendix B: Scoring Rubrics

Appendix C: Example Issue Categorization

Appendix D: Data File Formats