For 50% off your first order, use coupon code: WELCOME50

Common Statistical Mistakes in Research Papers

By Ray W. Shiraishi, Ph.D.||10 min read

Common statistical mistakes in research papers go far beyond p-value misinterpretation. In medical and surveillance research, the mistakes that cause the most damage are structural: ignoring clustering, mishandling design effects, and treating correlated data as independent.

Multi-site studies, household surveys, cluster-randomized trials, and surveillance systems all produce data where observations within groups are correlated. Standard methods assume independence. When that assumption is wrong, your standard errors are too small, your confidence intervals are too narrow, and your p-values are overstated. The result is conclusions that look stronger than the data actually support.

Here are the statistical mistakes I see most often in manuscripts that use clustered or surveillance data, with corrective code in R, Stata, and Python.

1. Ignoring Clustering: The Most Common Statistical Mistake

The problem is that researchers collect data from patients nested within clinics, students nested within schools, or households nested within communities, then analyze it as if every observation were independent. This is the single most consequential statistical error in multi-site research.

Patients within the same clinic share providers, protocols, and local context. Their outcomes are correlated. A standard logistic regression treats 500 patients from 10 clinics the same as 500 independent patients. It is not the same. The effective sample size is smaller than the nominal sample size, sometimes substantially so.

The fix: Use a method that accounts for clustering. The two main options are mixed-effects (random-effects) models and generalized estimating equations (GEE). For a binary outcome with patients nested in clinics:

# R: Random-effects logistic regression
library(lme4)
model <- glmer(outcome ~ treatment + age + sex + (1 | clinic_id),
               data = df, family = binomial)

# Stata
melogit outcome treatment age sex || clinic_id:

# Python
import statsmodels.api as sm
model = sm.BinomialBayes.from_formula(
    "outcome ~ treatment + age + sex",
    groups="clinic_id", data=df)

2. Using Logistic Regression When You Need a Random-Effects Model

This is a specific and extremely common form of mistake #1. A multi-center trial enrolls patients at 15 hospitals. The analyst runs a standard logistic regression with hospital as a fixed-effect covariate (or worse, ignores hospital entirely).

Here's the issue. If hospital is a random sample from a larger population of hospitals, it should be modeled as a random effect. A fixed effect for hospital uses up degrees of freedom and only estimates differences between those specific hospitals. A random effect estimates the variance across hospitals, which is usually what you actually want to report: how much the outcome varies by site, and what the treatment effect is after accounting for that variation.

The fix: Ask yourself whether the sites in your study represent the specific sites you care about (fixed) or a sample from a broader population of sites (random). In most multi-center medical studies, sites are random.

# Wrong: standard logistic regression (ignores clustering)
glm(outcome ~ treatment + age, data = df, family = binomial)

# Wrong: hospital as fixed effect (wastes df, not generalizable)
glm(outcome ~ treatment + age + factor(hospital), data = df, family = binomial)

# Right: hospital as random intercept
library(lme4)
glmer(outcome ~ treatment + age + (1 | hospital),
      data = df, family = binomial)

3. Sample Size Mistakes: Ignoring the Design Effect

Surveillance surveys and population-based health surveys almost never use simple random sampling. They use stratified, multi-stage, or cluster sampling. This produces a design effect (DEFF): the ratio of the variance under the actual sampling design to the variance you would get under simple random sampling.

A DEFF of 2.0 means your effective sample size is half the nominal sample size. If you analyze the data without accounting for the survey design, your standard errors will be too small by a factor of roughly sqrt(DEFF). Your confidence intervals will be too narrow. Your p-values will be too small.

The fix: Use survey-weighted analysis that accounts for strata, clusters (primary sampling units), and sampling weights:

# R: Survey-weighted logistic regression
library(survey)
design <- svydesign(id = ~psu, strata = ~stratum,
                     weights = ~weight, data = df, nest = TRUE)
model <- svyglm(outcome ~ exposure + age + sex,
                 design = design, family = quasibinomial)

# Stata
svyset psu [pweight=weight], strata(stratum)
svy: logistic outcome exposure age sex

# Check design effects
estat effects

If you are using data from DHS, PHIA, MICS, or similar household surveys, survey-weighted analysis is not optional. It is a requirement.

4. Not Reporting the Intraclass Correlation Coefficient

The ICC (intraclass correlation coefficient, or rho) quantifies how much of the total variance in your outcome is attributable to the clustering variable. An ICC of 0.05 in a cluster-randomized trial might seem small, but with 50 individuals per cluster, the design effect is 1 + (50 - 1)(0.05) = 3.45. Your effective sample size is less than a third of what you enrolled.

Many manuscripts report the results of a mixed model but never report the ICC. Reviewers and readers have no way to assess how much clustering mattered.

The fix: Always report the ICC for your primary clustering variable. Interpret it in context. Report the design effect if your study uses cluster sampling or cluster randomization.

# R: Extract ICC from a mixed model
library(lme4)
library(performance)
model <- glmer(outcome ~ treatment + (1 | clinic_id),
               data = df, family = binomial)
icc(model)

# Stata: ICC after mixed model
melogit outcome treatment || clinic_id:
estat icc

# Design effect: DEFF = 1 + (m - 1) * ICC
# where m = average cluster size

5. GEE vs. Mixed Models: Choosing the Wrong One

Both GEE (generalized estimating equations) and mixed-effects models handle clustering. They answer different questions, and the choice matters.

GEE estimates population-averaged effects: the average difference in the outcome across the entire population, averaging over clusters. Mixed models estimate cluster-specific (conditional) effects: the difference within a given cluster. For binary outcomes, these give different odds ratios. The mixed-model OR is always larger in magnitude than the GEE OR for the same data.

Neither is universally "better." If your research question is about population-level policy (e.g., "does this intervention reduce prevalence nationally?"), GEE is often the right choice. If your question is about individual-level effects within clusters (e.g., "does this treatment help a patient at a given hospital?"), a mixed model is more appropriate.

The fix: State which approach you used and why. Justify the choice based on your research question.

# R: GEE with exchangeable correlation (population-averaged)
library(geepack)
model <- geeglm(outcome ~ treatment + age + sex,
                 id = clinic_id, data = df,
                 family = binomial, corstr = "exchangeable")

# Stata: GEE
xtgee outcome treatment age sex, family(binomial) ///
    link(logit) corr(exchangeable) i(clinic_id)

# Compare: Mixed model (cluster-specific)
library(lme4)
model_re <- glmer(outcome ~ treatment + age + sex + (1 | clinic_id),
                   data = df, family = binomial)

6. Data Interpretation Errors: The Ecological Fallacy

This is the ecological fallacy applied to multi-level data. A study finds that clinics with higher average patient age have higher mortality rates and concludes that older patients are at higher risk of death. That may or may not be true. The association at the clinic level does not necessarily reflect the association at the patient level.

In surveillance data, this shows up when district-level or country-level aggregates are used to draw conclusions about individual risk. A country with higher average BMI and higher cardiovascular mortality does not prove that BMI causes cardiovascular death in individuals within that country.

The fix: Analyze the data at the level that matches your inference. If you want to make claims about individual risk, use individual-level data. If you only have aggregate data, acknowledge the limitation explicitly. Multi-level models can estimate both within-cluster and between-cluster effects separately:

# R: Separate within- and between-cluster effects
library(lme4)

# Create cluster mean and deviation
df$age_clinic_mean <- ave(df$age, df$clinic_id, FUN = mean)
df$age_within <- df$age - df$age_clinic_mean

model <- glmer(outcome ~ age_within + age_clinic_mean +
               (1 | clinic_id), data = df, family = binomial)
# age_within  = individual-level effect (within clinic)
# age_clinic_mean = contextual effect (between clinics)

7. Study Design Errors: Underpowered Cluster-Randomized Trials

Standard power calculations assume independent observations. A trial that randomizes 20 clinics (10 per arm) with 50 patients each has a nominal sample size of 1,000. But if the ICC is 0.05, the design effect is 3.45, and the effective sample size is closer to 290. The study that looked adequately powered is actually underpowered by a factor of three.

The problem is that many sample size calculations in cluster-randomized trials either ignore the ICC entirely or use an ICC from a different outcome, population, or setting.

The fix: Use a cluster-adjusted power calculation. You need the ICC (from pilot data or similar studies), the average cluster size, the number of clusters, and the expected effect size:

# R: Power for cluster-randomized trial
library(clusterPower)
cpa.binary(alpha = 0.05, power = 0.80,
           nclusters = 20, nsubjects = 50,
           p1 = 0.30, p2 = 0.20, ICC = 0.05)

# Stata: Power for cluster RCT
power twoproportions 0.30 0.20, cluster ///
    k1(10) k2(10) m(50) rho(0.05)

# Manual DEFF calculation
# DEFF = 1 + (m - 1) * ICC = 1 + 49 * 0.05 = 3.45
# Effective n = 1000 / 3.45 ≈ 290

Report the ICC you assumed, its source, and the resulting design effect in your methods section. Reviewers will look for this.

8. Research Methodology Mistakes with Repeated Measures

Surveillance systems often collect data from the same sites, facilities, or populations at multiple time points. Analyzing each round of data collection as if it were independent inflates the sample size and understates uncertainty.

This also applies to longitudinal clinical studies where patients are measured at baseline, 6 months, and 12 months. Running three separate cross-sectional analyses and comparing the results is not the same as modeling the trajectory over time. The correlation between repeated measures on the same individual contains information. Ignoring it wastes that information and produces incorrect standard errors.

The fix: Use longitudinal methods that account for the correlation between repeated measures. Options include mixed models with random intercepts and slopes, or GEE with an autoregressive or unstructured correlation matrix:

# R: Longitudinal mixed model (random intercept + slope)
library(lme4)
model <- glmer(outcome ~ time + treatment + time:treatment +
               (1 + time | subject_id),
               data = df, family = binomial)

# Stata: Longitudinal mixed model
melogit outcome c.time##i.treatment || subject_id: time

# R: GEE for repeated measures
library(geepack)
model <- geeglm(outcome ~ time + treatment + time:treatment,
                 id = subject_id, data = df,
                 family = binomial, corstr = "ar1")

The Common Thread

Every mistake on this list comes back to the same issue: treating correlated data as if it were independent. The methods for handling clustering have been available for decades. The software implementations are mature. The problem is not that researchers lack access to the right tools. The problem is that the wrong method (standard logistic regression, standard chi-square, unweighted prevalence estimates) is easier to run and produces results that look more precise than they actually are.

Reviewers know this. It is one of the first things a statistical reviewer checks: does the analysis match the study design? If your data have clustering and your methods do not account for it, expect a major revision request. For a broader look at what triggers rejection before your paper even reaches review, see our pre-submission guides.

Want to preview the kind of feedback reviewers give on statistical issues? Try the free Reviewer 2 Generator to see how your abstract holds up. For a full statistical review with corrective R, Stata, and Python code, PeerGenius's Statistical Methods Expert examines your analysis for design-method mismatches. Not as a replacement for understanding your study design, but as a systematic check before submission.

Get a dedicated statistical review with corrective R, Python, and Stata code, plus feedback from 6 other specialist reviewers.

Try PeerGenius

Disclosure: This article was drafted with AI assistance. The analysis, positions, and conclusions are the author's own. All content was reviewed, edited, and approved before publication.