Most health journalism collapses three different evidentiary claims into one paragraph: a mechanism, a mouse study, and a press release. Pulling them apart is the entire game. The skill is not memorising statistics. It is recognising which kind of evidence is in front of you, what that kind of evidence can and cannot say, and where the obvious failure modes are.
This article is the literacy floor. If you read it, you should be able to look at a typical health headline, click through to the underlying paper, and decide within 5 minutes whether the claim is supported, partly supported, or oversold.
How do you rank study designs by evidence quality?
Not all studies are equal. The hierarchy is not arbitrary; each tier controls one more source of confounding than the tier below it.
| Tier | Design | What it controls for | Typical interpretation |
|---|---|---|---|
| 1 | Meta-analysis of RCTs | Random error, single-trial bias | Strongest non-mechanistic evidence available |
| 2 | Single large RCT | Confounding via randomization | Direction + magnitude credible if pre-registered |
| 3 | Prospective cohort | Recall bias, reverse causation | Association, not cause; effect sizes inflate |
| 4 | Retrospective cohort | Some confounding | Hypothesis-generating |
| 5 | Case-control | Selection of controls is everything | Useful only for rare outcomes |
| 6 | Case series | Almost nothing | Pattern-spotting; do not act on it |
| 7 | Anecdote / mechanism | Nothing | Mechanism is a hypothesis, not an outcome |
Two practical consequences. First, observational data inflate effect sizes by roughly 30% to 50% relative to the RCTs that follow them, because residual confounding survives every statistical adjustment. The 2002 hormone replacement therapy reversal is the canonical case: 20 years of cohort data suggested HRT cut cardiovascular risk; the WHI RCT showed no benefit and a small harm signal. The cohort women who chose HRT were healthier at baseline. Second, an underpowered RCT is not necessarily better than a large cohort. A 40-person RCT can be more biased than a 40,000-person cohort if the randomization is poorly executed.
Reading move: when a paper claims an effect, scroll to the methods. The first sentence usually tells you the design. If the design is "we surveyed 800 people about their diet," the headline is an association, no matter what the title implies.
What is effect size and why does it matter more than p-values?
Statistical significance answers "is the effect probably non-zero?" Effect size answers "is the effect big enough to care about?" These are different questions, and most health writing conflates them.
Three effect-size frames you need:
Cohen's d for continuous outcomes (blood pressure, body weight, bench press 1RM). The convention from Jacob Cohen's 1988 textbook: 0.2 is small, 0.5 is moderate, 0.8 is large. A d of 0.05 means the average treated person is at the 52nd percentile of the placebo distribution. That is a real effect. It is also indistinguishable from noise to any single human.
Risk ratio (RR) and odds ratio (OR) for binary outcomes (heart attack, death, infection). RR of 0.80 means a 20% relative risk reduction. The trap: the absolute risk reduction depends on the baseline rate. A 20% RR on a 1% baseline event rate moves the absolute rate from 1.0% to 0.8%. The number needed to treat (NNT) is 1 divided by the absolute risk reduction; in this case, 500. That is, 500 people take the drug for one to benefit.
Number needed to treat (NNT) is the cleanest patient-facing summary. Statins for primary prevention have NNTs around 100 over 5 years for non-fatal MI. GLP-1 agonists at 2.4 mg weekly produced a 20% MACE reduction in 17,604 SELECT participants ( Lincoff et al. (SELECT) 2023, n=17604 ), translating to an NNT around 67 over 40 months. REDUCE-IT showed icosapent ethyl cut MACE 25% in 8,179 participants ( Bhatt et al. (REDUCE-IT) 2019, n=8179 ), NNT around 21 over 5 years.
Reading move: every time a press release says "X cuts risk by Y percent," ask whether Y is a relative or absolute reduction, and what the baseline event rate was. Most of the time, the headline number is relative and the absolute number is small.
What does a p-value actually tell you?
A p-value is the probability of seeing the observed data (or more extreme) under the null hypothesis. It is not the probability the hypothesis is true. It is not the probability of replication. It is not a measure of effect size.
Three failure modes compress most of the p-value abuse you encounter:
Multiple comparisons. If you run 20 outcome tests at alpha = 0.05, you expect 1 false positive even if no effect exists. A trial that pre-specifies 1 primary endpoint and finds p = 0.04 is meaningful. A trial that runs 47 outcomes and reports the 3 with p under 0.05 in the abstract is fishing.
Pre-specification vs post-hoc. Pre-registered outcomes (typed into clinicaltrials.gov before recruitment) are credible. Outcomes added after the data come in are not. Subgroup analyses (men over 60 with diabetes, etc.) inflate false-positive rates further; they are hypothesis-generating, never confirmatory.
Statistical vs clinical significance. With 100,000 participants, almost any tiny effect crosses p < 0.05. Cardiorespiratory fitness associates with mortality at p effectively zero in the n=122,007 Mandsager cohort, but the same cohort can show statistically significant associations of magnitude too small to act on. Always check the confidence interval, not just the p-value. A 95% CI of [0.95, 1.20] on a relative risk includes the null and tells you the data are uninformative even when the point estimate flatters your prior.
Ioannidis 2005 (PLoS Medicine) formalized this with the prior-probability argument: when a research field has low base rates of true effects (most novel biology), low statistical power (small trials), and high analytical flexibility (many tests), the majority of published "positive" findings are false ( Ioannidis 2005 ). The fix is pre-registration, replication, and triangulation across designs, not p-value chasing.
Reading move: before reading any p-value, find the pre-registration. ClinicalTrials.gov is the registry of record for clinical work. If the published primary endpoint matches the pre-registered primary endpoint, the p-value is meaningful. If it does not, treat the result as exploratory.
Reading a meta-analysis
A meta-analysis is not the average of two trials. A proper meta-analysis pools effect sizes across studies, weights each study by its inverse variance, and reports a summary estimate alongside heterogeneity statistics. Three things to check:
I-squared (heterogeneity). Higgins 2003 (BMJ) introduced this measure: the percentage of variance across studies attributable to real differences rather than chance ( Higgins, Thompson, Deeks & Altman 2003 ). Under 25% is low, 25-50% moderate, over 75% high. A meta-analysis with I-squared above 75% is averaging studies that disagree, and the summary estimate is suspect.
Funnel plot for publication bias. Plot effect size on the x-axis against precision (1/SE) on the y-axis. Trials that find positive effects publish; null trials get drawered. An asymmetric funnel suggests the published literature is skewed.
GRADE confidence rating. The Grading of Recommendations Assessment, Development and Evaluation framework (Guyatt 2008, BMJ) rates evidence as high, moderate, low, or very low ( Guyatt, Oxman, Vist, Kunz, Falck-Ytter, Alonso-Coello & Schunemann 2008 ). It downgrades for risk of bias, inconsistency, indirectness, imprecision, and publication bias. Cochrane reviews always carry a GRADE rating. If a meta-analysis does not, it has skipped a step.
The Cholesterol Treatment Trialists meta of 186,854 statin participants is a worked example of how this looks done well: per-mmol/L LDL-C reduction cut major vascular events by 22%, with low heterogeneity across 28 RCTs ( Cholesterol Treatment Trialists Collaboration 2019, n=186854 ). The Brown 1999 fiber meta pooled 67 controlled trials and produced a clean dose-response ( Brown, Rosner, Willett & Sacks 1999 ). Both ship with their I-squared and GRADE-equivalent disclosures.
Reconciling conflicting trials
When trials of the "same" intervention disagree, the resolution is almost always a methodological difference, not a biological contradiction. The omega-3 case is the textbook example. REDUCE-IT (n=8,179) tested icosapent ethyl, a purified EPA, against a mineral oil placebo and cut MACE 25% ( Bhatt et al. (REDUCE-IT) 2019, n=8179 ). STRENGTH (n=13,078) tested an EPA+DHA combination against a corn oil placebo and showed no benefit ( Nicholls et al. (STRENGTH) 2020, n=13078 ).
Three differences explain almost all the gap. The molecule was different (pure EPA vs EPA+DHA mix). The placebo was different (mineral oil may have actively raised LDL in the REDUCE-IT placebo arm, exaggerating the apparent benefit). The dose was different (4 g vs ~3.36 g). None of this means EPA does not work; it means the EPA-specific dose-and-formulation question is not yet settled by these two trials alone.
A second case: cold water immersion. Roberts 2015 showed CWI after resistance training cut hypertrophy gains roughly 40% over 12 weeks. Other trials show CWI reduces inflammation and DOMS without compromising hypertrophy. Resolution: timing relative to the workout. CWI within 1 hour of resistance training blunts the anabolic signal; CWI 6 hours later does not. The biology is the same; the protocol is different.
Reading move: when two trials disagree, list the population, the dose, the comparator, the duration, and the primary endpoint side by side. The cause of the disagreement is almost always one of those rows.
Red flags
Some patterns reliably correlate with overstated claims:
- Surrogate endpoints sold as outcomes. "Lowered LDL by 20%" is a surrogate. "Reduced MACE by 15%" is an outcome. Many surrogates predict outcomes (LDL, blood pressure, A1c). Some do not (HDL has been a famously poor surrogate; raising it pharmacologically has failed in trial after trial).
- Industry funding plus null-result amnesia. Industry-funded trials are not inherently bad, but pre-registration plus published null results are the only protection against selective reporting. Check for at least 2 published null results in the same compound's history; absence of any is a warning.
- Post-hoc subgroup analyses headlining. "Worked best in women over 50" appearing in the abstract when the primary endpoint was null is a fishing expedition.
- Single-trial sensationalism. No single trial is the final word. Hormone trials, antidepressant trials, and nutrition trials have all been overturned by replication. Wait for the second trial.
- n under 30 in a clinical trial. Cohen's d at 0.5 needs roughly 64 participants per arm at 80% power. Smaller trials cannot detect modest effects reliably; reported "positive" results are often regression to the mean or noise.
- Animal data presented as human evidence. Mouse lifespan studies are mechanism, not protocol. The translation rate from rodent longevity intervention to human clinical benefit is roughly 1 in 8 in the modern era.
- No confidence intervals reported. A point estimate without uncertainty is propaganda, not data.
What this looks like at biologicalx
We codify these distinctions in our editorial gates. Every article carries an evidenceTier of robust, moderate, preliminary, or insufficient, with a one-sentence justification (see methodology). Every clinical claim more specific than "consider X" requires an inline <StudyCite /> resolving to our citation registry. Every recommendation comes with a named dissenter when the evidence is contested. We re-examine each article against the current literature on a 180-day cycle; the dashboard at research-index flags overdue articles publicly.
The result you should expect: when we say "robust," we mean a meta-analysis or two large RCTs agreeing on direction and magnitude. When we say "preliminary," we mean a single small trial, an open-label study, or a strong mechanistic case without confirmatory human data. When we say "insufficient," we mean we are documenting a claim, not endorsing it. The grade lives at the top of every article so you can decide whether to read further.
For a worked catalog of every meta-analysis and systematic review we cite, see the meta-analysis master list. It is the fastest way to find the strongest available evidence on any topic we cover.
Counter-view
John Ioannidis has argued that even the standard hierarchy understates the rot: pre-registration is patchy, replication is rare, and the publication system rewards novel positive findings over confirmatory negative ones. Ben Goldacre and the AllTrials movement have made a similar case: the unpublished trial is the silent statistic. Both are correct; the response is to read with calibrated skepticism, not to discard the literature. The alternative (anecdote, mechanism, marketing) is worse.
A different counter-view comes from Bayesian statisticians like Andrew Gelman: p-values themselves are the wrong tool, and confidence intervals plus pre-registered analyses without a hard 0.05 threshold would serve readers better. We use frequentist framing here because that is what 95% of the published literature uses; the Bayesian critique is real and worth understanding.