In June 2020, one of the most-read papers in The Lancet’s history was retracted twelve days after publication. The study, by Mehra, Ruschitzka, and Patel, claimed to draw on data from 671 hospitals across six continents and concluded that hydroxychloroquine raised mortality in COVID-19 patients. Within days, WHO paused related clinical trials. Within two weeks, the authors admitted they could not verify their own underlying data, because Surgisphere, the company that supplied it, refused to share it with independent reviewers. The paper came down. The trials restarted. According to the Retraction Watch database, 2023 saw more than 10,000 papers retracted from biomedical journals. That was the highest annual count on record.
That figure is not a reason to distrust science. It is a reason to read it more carefully.
Why this skill matters more than it used to
Most people who need to use research findings (health journalists, frontline workers, policymakers, graduate students) never see the full paper. They read a press release, a headline, or a summary in a policy brief. Each step away from the original text adds another layer of interpretation, and sometimes distortion. The Surgisphere paper had already shaped public discourse and halted global trials before a single independent scientist had checked whether the dataset was real.
The Critical Appraisal Skills Programme (CASP), based in Oxford and updated in 2024, has trained healthcare professionals to evaluate evidence for over 25 years. Their approach rests on a straightforward sequence: before engaging with any findings, establish whether the study was valid, what the results actually showed, and whether those results apply to your population. Everything else is detail.
Start with study design, not the findings
Before reading a single result, identify what kind of study you are looking at. This one step filters out a significant share of interpretive errors that circulate in policy and media.
A randomised controlled trial (RCT) randomly assigns participants to intervention or control groups, the strongest available protection against confounding. A cohort study follows a group over time without randomisation. A cross-sectional study measures exposure and outcome at the same point in time, making it impossible to determine which came first. A systematic review synthesizes findings across multiple studies and, when combined with meta-analysis, estimates effects across populations.
Each design has a ceiling on what it can prove. A cross-sectional study can show that a correlation exists. It cannot establish that one thing caused another. When a headline reads “Researchers find eating X causes Y,” the first question to ask is: what type of study was this?
Who was studied, and does that population include yours?
Sample size matters, but so does sample composition. A trial conducted in urban tertiary hospitals in high-income countries may tell you something about those settings. It may tell you very little about community health workers managing the same condition in rural districts of South Asia or sub-Saharan Africa.
Look at inclusion and exclusion criteria in the methods section. If a study on maternal mortality excluded women under 18 or over 40, its findings do not apply to those groups, even if the paper never says so explicitly. India carries some of the world’s highest burdens of anaemia, preterm birth, and neonatal mortality, yet many foundational studies in global maternal health were conducted in populations where those burdens look structurally different. Applying findings across that gap without adjustment is where evidence-based policy starts to drift.
What was actually measured?
Outcomes sound clear until you notice that two studies using the word “adherence” defined it in completely different ways. One counted any antenatal visit at any gestational age. The other required four visits at specified weeks. Same word. Different data.
Check the operational definitions in the methods section, not the abstract. The abstract will say “improved outcomes.” The methods section will say what outcomes were measured and by what means. If outcomes were self-reported, consider whether social desirability bias could have inflated them. If they were clinician-assessed, ask whether assessors knew which group the patient was in. That problem is called assessment bias, and it can make modest effects appear larger than they are.
Confounding: the variable the paper did not feature
A confounder is a third variable that explains the relationship between two others without either of them causing it. Studies showing an association between higher educational attainment and lower infant mortality are not straightforwardly showing that education causes better neonatal outcomes. Wealth, access to healthcare, nutrition, and geography are all entangled in that relationship.
The GRADE framework, described by Guyatt and colleagues in the BMJ in 2008 and now used by WHO, the CDC, and the UK’s NICE, rates research quality partly on how well confounding has been controlled. Their four certainty levels (high, moderate, low, and very low) apply to a body of evidence, not to an individual study. When reading a single paper, the practical question is: what plausible confounders exist, and did the authors adjust for them? The methods section will list the covariates included in the regression model. The variables that are absent from that list tell you as much as the ones that appear.
Statistical significance is not the whole story
A p-value below 0.05 tells you that an effect is unlikely to be due to chance. It does not tell you the effect is large enough to matter in practice. A study with 50,000 participants can detect a statistically significant difference in systolic blood pressure of 1 mmHg between groups. That finding is real. It is not clinically meaningful.
I find myself returning to this every time I review a paper forwarded as evidence for a program decision. The confidence interval around an effect estimate carries as much information as the point estimate itself. A relative risk of 1.4 with a 95% confidence interval running from 0.9 to 2.1 means the true effect could plausibly sit below 1.0, meaning no effect at all. A narrow interval around a modest effect is often more useful to a planner than a wide interval around a large one.
Funding and conflicts of interest
This is the section most readers skip. It is also the one that modifies every other finding on the page.
A systematic review by Lundh et al., published in the Cochrane Database of Systematic Reviews in 2017, found that industry-funded studies more consistently reported favourable results than independently funded ones, across multiple medical specialties. The pattern did not necessarily reflect fraud. It reflected which studies were initiated, which outcomes were pre-specified, which thresholds were set for success, and which results made it into the published record.
CASP’s 2024 checklists explicitly ask: “Is there a clear statement of funding?” and “Did the researchers declare potential conflicts of interest?” If the paper is silent on both, treat the findings with additional scrutiny. Not dismissal. Just a raised baseline of caution.
A practical place to start
CASP publishes free, study-design-specific checklists at casp-uk.net, updated in 2024. There is one for cohort studies, one for RCTs, one for systematic reviews, one for qualitative studies, and several others. Each runs to roughly ten questions. None of them requires a doctorate to use productively.
One habit worth building, regardless of which checklist you use: read the abstract once, then set it aside and go directly to the methods section before you look at the results. The abstract was written to be persuasive. The methods section was written to be accurate. When the two descriptions of the same study do not match, the methods section is the one to trust.
This article is for educational purposes only and does not constitute medical advice, diagnosis, or treatment recommendations. Consult a qualified healthcare provider for any health concerns. See our Medical Disclaimer.
Sources
- Mehra MR, Ruschitzka F, Patel AN. Retraction — Hydroxychloroquine or chloroquine with or without a macrolide for treatment of COVID-19: a multinational registry analysis. The Lancet. 2020. DOI: 10.1016/S0140-6736(20)31324-6
- Retraction Watch Database. 2023 annual retraction statistics. Available at: retractionwatch.com
- Critical Appraisal Skills Programme (CASP). Critical Appraisal Checklists. 2024. Available at: casp-uk.net
- Guyatt GH, Oxman AD, Vist GE, et al. GRADE: an emerging consensus on rating quality of evidence and strength of recommendations. BMJ. 2008;336(7650):924–926. PMID: 18436948. DOI: 10.1136/bmj.39489.470347.AD
- Lundh A, Lexchin J, Mintzes B, Schroll JB, Bero L. Industry sponsorship and research outcome. Cochrane Database of Systematic Reviews. 2017;(2):MR000033. DOI: 10.1002/14651858.MR000033.pub3
Leave a Comment