protocols Pillar Evidence: robust

How to Read Medical Studies: Effect Size, Evidence Hierarchy, p-Values

Q: How do you spot a weak study?

Watch for unblinded designs, post-hoc subgroup analysis, primary-endpoint switches between protocol and publication, n<100 per arm, surrogate endpoints presented as clinical outcomes, and industry funding without independent replication.

Q: Why do nutrition studies contradict each other?

Most are observational with self-reported food-frequency questionnaires, which produce noisy exposure measurement. The few RCTs (PREDIMED, DIETFITS) tend to converge; the conflict lives mostly in the observational layer where confounding by socioeconomic status and overall diet pattern dominates.

By BiologicalX Editorial Apr 24, 2026 Updated Apr 27, 2026

Effect size, study design, evidence hierarchy. The literacy that lets you audit health claims yourself.

You can audit most health claims yourself with three concepts: study design hierarchy, effect size, and pre-specification. Everything else is detail.

BiologicalX Editorial Apr 24, 2026 Updated Apr 27, 2026 11m read 0h / 0p studies Reviewed Apr 27, 2026

Evidence note Statistical and methodological frameworks (effect size interpretation, study-design hierarchy, GRADE) are well-established methodological literature with consensus support.

books, education, school, literature, knowledge, reading, library, paper, study, page, read, text, learn, textbook, research, books, books, books, books, education, education, education, school, schoo

Contents (8)

01How do you rank study designs by evidence quality?
02What is effect size and why does it matter more than p-values?
03What does a p-value actually tell you?
04Reading a meta-analysis
05Reconciling conflicting trials
06Red flags
07What this looks like at biologicalx
08Counter-view

TL;DR

Study design rank: RCT > prospective cohort > retrospective cohort > case-control > case series > anecdote. Each step down adds a layer of confounding.
Effect size beats p-value. A p of 0.001 with Cohen's d of 0.05 is statistically significant and clinically irrelevant. A d of 0.8 is large; 0.5 moderate; 0.2 small.
Pre-specified outcomes survive scrutiny; post-hoc subgroup analyses do not. Multiple comparisons inflate false positives at roughly 5% per extra test.
Meta-analyses earn trust through low heterogeneity (I-squared under 40%) and a GRADE rating of moderate or higher. Two trials averaged together is not a meta-analysis.
When trials disagree, the resolution is usually a methodological difference (population, dose, duration, comparator), not a biological contradiction.
Red flags: surrogate endpoints, industry funding plus null result memory-holed, single-trial sensationalism, n under 30, animal data sold as human evidence.

Most health journalism collapses three different evidentiary claims into one paragraph: a mechanism, a mouse study, and a press release. Pulling them apart is the entire game. The skill is not memorising statistics. It is recognising which kind of evidence is in front of you, what that kind of evidence can and cannot say, and where the obvious failure modes are.

This article is the literacy floor. If you read it, you should be able to look at a typical health headline, click through to the underlying paper, and decide within 5 minutes whether the claim is supported, partly supported, or oversold.

How do you rank study designs by evidence quality?

Not all studies are equal. The hierarchy is not arbitrary; each tier controls one more source of confounding than the tier below it.

Tier	Design	What it controls for	Typical interpretation
1	Meta-analysis of RCTs	Random error, single-trial bias	Strongest non-mechanistic evidence available
2	Single large RCT	Confounding via randomization	Direction + magnitude credible if pre-registered
3	Prospective cohort	Recall bias, reverse causation	Association, not cause; effect sizes inflate
4	Retrospective cohort	Some confounding	Hypothesis-generating
5	Case-control	Selection of controls is everything	Useful only for rare outcomes
6	Case series	Almost nothing	Pattern-spotting; do not act on it
7	Anecdote / mechanism	Nothing	Mechanism is a hypothesis, not an outcome

Two practical consequences. First, observational data inflate effect sizes by roughly 30% to 50% relative to the RCTs that follow them, because residual confounding survives every statistical adjustment. The 2002 hormone replacement therapy reversal is the canonical case: 20 years of cohort data suggested HRT cut cardiovascular risk; the WHI RCT showed no benefit and a small harm signal. The cohort women who chose HRT were healthier at baseline. Second, an underpowered RCT is not necessarily better than a large cohort. A 40-person RCT can be more biased than a 40,000-person cohort if the randomization is poorly executed.

Reading move: when a paper claims an effect, scroll to the methods. The first sentence usually tells you the design. If the design is "we surveyed 800 people about their diet," the headline is an association, no matter what the title implies.

What is effect size and why does it matter more than p-values?

Statistical significance answers "is the effect probably non-zero?" Effect size answers "is the effect big enough to care about?" These are different questions, and most health writing conflates them.

Three effect-size frames you need:

Cohen's d for continuous outcomes (blood pressure, body weight, bench press 1RM). The convention from Jacob Cohen's 1988 textbook: 0.2 is small, 0.5 is moderate, 0.8 is large. A d of 0.05 means the average treated person is at the 52nd percentile of the placebo distribution. That is a real effect. It is also indistinguishable from noise to any single human.

Risk ratio (RR) and odds ratio (OR) for binary outcomes (heart attack, death, infection). RR of 0.80 means a 20% relative risk reduction. The trap: the absolute risk reduction depends on the baseline rate. A 20% RR on a 1% baseline event rate moves the absolute rate from 1.0% to 0.8%. The number needed to treat (NNT) is 1 divided by the absolute risk reduction; in this case, 500. That is, 500 people take the drug for one to benefit.

Number needed to treat (NNT) is the cleanest patient-facing summary. Statins for primary prevention have NNTs around 100 over 5 years for non-fatal MI. GLP-1 agonists at 2.4 mg weekly produced a 20% MACE reduction in 17,604 SELECT participants ( Lincoff et al. (SELECT) 2023, n=17604 ), translating to an NNT around 67 over 40 months. REDUCE-IT showed icosapent ethyl cut MACE 25% in 8,179 participants ( Bhatt et al. (REDUCE-IT) 2019, n=8179 ), NNT around 21 over 5 years.

Reading move: every time a press release says "X cuts risk by Y percent," ask whether Y is a relative or absolute reduction, and what the baseline event rate was. Most of the time, the headline number is relative and the absolute number is small.

What does a p-value actually tell you?

A p-value is the probability of seeing the observed data (or more extreme) under the null hypothesis. It is not the probability the hypothesis is true. It is not the probability of replication. It is not a measure of effect size.

Three failure modes compress most of the p-value abuse you encounter:

Multiple comparisons. If you run 20 outcome tests at alpha = 0.05, you expect 1 false positive even if no effect exists. A trial that pre-specifies 1 primary endpoint and finds p = 0.04 is meaningful. A trial that runs 47 outcomes and reports the 3 with p under 0.05 in the abstract is fishing.

Pre-specification vs post-hoc. Pre-registered outcomes (typed into clinicaltrials.gov before recruitment) are credible. Outcomes added after the data come in are not. Subgroup analyses (men over 60 with diabetes, etc.) inflate false-positive rates further; they are hypothesis-generating, never confirmatory.

Statistical vs clinical significance. With 100,000 participants, almost any tiny effect crosses p < 0.05. Cardiorespiratory fitness associates with mortality at p effectively zero in the n=122,007 Mandsager cohort, but the same cohort can show statistically significant associations of magnitude too small to act on. Always check the confidence interval, not just the p-value. A 95% CI of [0.95, 1.20] on a relative risk includes the null and tells you the data are uninformative even when the point estimate flatters your prior.

Ioannidis 2005 (PLoS Medicine) formalized this with the prior-probability argument: when a research field has low base rates of true effects (most novel biology), low statistical power (small trials), and high analytical flexibility (many tests), the majority of published "positive" findings are false ( Ioannidis 2005 ). The fix is pre-registration, replication, and triangulation across designs, not p-value chasing.

Reading move: before reading any p-value, find the pre-registration. ClinicalTrials.gov is the registry of record for clinical work. If the published primary endpoint matches the pre-registered primary endpoint, the p-value is meaningful. If it does not, treat the result as exploratory.

Reading a meta-analysis

A meta-analysis is not the average of two trials. A proper meta-analysis pools effect sizes across studies, weights each study by its inverse variance, and reports a summary estimate alongside heterogeneity statistics. Three things to check:

I-squared (heterogeneity). Higgins 2003 (BMJ) introduced this measure: the percentage of variance across studies attributable to real differences rather than chance ( Higgins, Thompson, Deeks & Altman 2003 ). Under 25% is low, 25-50% moderate, over 75% high. A meta-analysis with I-squared above 75% is averaging studies that disagree, and the summary estimate is suspect.

Funnel plot for publication bias. Plot effect size on the x-axis against precision (1/SE) on the y-axis. Trials that find positive effects publish; null trials get drawered. An asymmetric funnel suggests the published literature is skewed.

GRADE confidence rating. The Grading of Recommendations Assessment, Development and Evaluation framework (Guyatt 2008, BMJ) rates evidence as high, moderate, low, or very low ( Guyatt, Oxman, Vist, Kunz, Falck-Ytter, Alonso-Coello & Schunemann 2008 ). It downgrades for risk of bias, inconsistency, indirectness, imprecision, and publication bias. Cochrane reviews always carry a GRADE rating. If a meta-analysis does not, it has skipped a step.

The Cholesterol Treatment Trialists meta of 186,854 statin participants is a worked example of how this looks done well: per-mmol/L LDL-C reduction cut major vascular events by 22%, with low heterogeneity across 28 RCTs ( Cholesterol Treatment Trialists Collaboration 2019, n=186854 ). The Brown 1999 fiber meta pooled 67 controlled trials and produced a clean dose-response ( Brown, Rosner, Willett & Sacks 1999 ). Both ship with their I-squared and GRADE-equivalent disclosures.

Reconciling conflicting trials

When trials of the "same" intervention disagree, the resolution is almost always a methodological difference, not a biological contradiction. The omega-3 case is the textbook example. REDUCE-IT (n=8,179) tested icosapent ethyl, a purified EPA, against a mineral oil placebo and cut MACE 25% ( Bhatt et al. (REDUCE-IT) 2019, n=8179 ). STRENGTH (n=13,078) tested an EPA+DHA combination against a corn oil placebo and showed no benefit ( Nicholls et al. (STRENGTH) 2020, n=13078 ).

Three differences explain almost all the gap. The molecule was different (pure EPA vs EPA+DHA mix). The placebo was different (mineral oil may have actively raised LDL in the REDUCE-IT placebo arm, exaggerating the apparent benefit). The dose was different (4 g vs ~3.36 g). None of this means EPA does not work; it means the EPA-specific dose-and-formulation question is not yet settled by these two trials alone.

A second case: cold water immersion. Roberts 2015 showed CWI after resistance training cut hypertrophy gains roughly 40% over 12 weeks. Other trials show CWI reduces inflammation and DOMS without compromising hypertrophy. Resolution: timing relative to the workout. CWI within 1 hour of resistance training blunts the anabolic signal; CWI 6 hours later does not. The biology is the same; the protocol is different.

Reading move: when two trials disagree, list the population, the dose, the comparator, the duration, and the primary endpoint side by side. The cause of the disagreement is almost always one of those rows.

Red flags

Some patterns reliably correlate with overstated claims:

Surrogate endpoints sold as outcomes. "Lowered LDL by 20%" is a surrogate. "Reduced MACE by 15%" is an outcome. Many surrogates predict outcomes (LDL, blood pressure, A1c). Some do not (HDL has been a famously poor surrogate; raising it pharmacologically has failed in trial after trial).
Industry funding plus null-result amnesia. Industry-funded trials are not inherently bad, but pre-registration plus published null results are the only protection against selective reporting. Check for at least 2 published null results in the same compound's history; absence of any is a warning.
Post-hoc subgroup analyses headlining. "Worked best in women over 50" appearing in the abstract when the primary endpoint was null is a fishing expedition.
Single-trial sensationalism. No single trial is the final word. Hormone trials, antidepressant trials, and nutrition trials have all been overturned by replication. Wait for the second trial.
n under 30 in a clinical trial. Cohen's d at 0.5 needs roughly 64 participants per arm at 80% power. Smaller trials cannot detect modest effects reliably; reported "positive" results are often regression to the mean or noise.
Animal data presented as human evidence. Mouse lifespan studies are mechanism, not protocol. The translation rate from rodent longevity intervention to human clinical benefit is roughly 1 in 8 in the modern era.
No confidence intervals reported. A point estimate without uncertainty is propaganda, not data.

What this looks like at biologicalx

We codify these distinctions in our editorial gates. Every article carries an evidenceTier of robust, moderate, preliminary, or insufficient, with a one-sentence justification (see methodology). Every clinical claim more specific than "consider X" requires an inline <StudyCite /> resolving to our citation registry. Every recommendation comes with a named dissenter when the evidence is contested. We re-examine each article against the current literature on a 180-day cycle; the dashboard at research-index flags overdue articles publicly.

The result you should expect: when we say "robust," we mean a meta-analysis or two large RCTs agreeing on direction and magnitude. When we say "preliminary," we mean a single small trial, an open-label study, or a strong mechanistic case without confirmatory human data. When we say "insufficient," we mean we are documenting a claim, not endorsing it. The grade lives at the top of every article so you can decide whether to read further.

For a worked catalog of every meta-analysis and systematic review we cite, see the meta-analysis master list. It is the fastest way to find the strongest available evidence on any topic we cover.

Counter-view

John Ioannidis has argued that even the standard hierarchy understates the rot: pre-registration is patchy, replication is rare, and the publication system rewards novel positive findings over confirmatory negative ones. Ben Goldacre and the AllTrials movement have made a similar case: the unpublished trial is the silent statistic. Both are correct; the response is to read with calibrated skepticism, not to discard the literature. The alternative (anecdote, mechanism, marketing) is worse.

A different counter-view comes from Bayesian statisticians like Andrew Gelman: p-values themselves are the wrong tool, and confidence intervals plus pre-registered analyses without a hard 0.05 threshold would serve readers better. We use frequentist framing here because that is what 95% of the published literature uses; the Bayesian critique is real and worth understanding.

Frequently asked questions

How do you read a medical research study?

Read the methods before the abstract: study design (RCT, cohort, case-control), sample size, primary endpoint, and pre-specification status. Then check effect size before p-value: a tiny effect at p=0.001 is statistically significant and clinically irrelevant.

What is the difference between a randomized trial and an observational study?

An RCT randomly assigns subjects to intervention vs control, eliminating selection bias and most confounding. Observational studies measure people who chose their own exposure, leaving residual confounding that no statistical adjustment fully resolves.

What does a p-value actually mean?

A p-value is the probability of observing data at least as extreme as the trial's, assuming no true effect. p<0.05 means 'less than 5% chance the result is noise', not 'the intervention works'. It says nothing about effect magnitude.

How do you spot a weak study?

Watch for unblinded designs, post-hoc subgroup analysis, primary-endpoint switches between protocol and publication, n<100 per arm, surrogate endpoints presented as clinical outcomes, and industry funding without independent replication.

Why do nutrition studies contradict each other?

Most are observational with self-reported food-frequency questionnaires, which produce noisy exposure measurement. The few RCTs (PREDIMED, DIETFITS) tend to converge; the conflict lives mostly in the observational layer where confounding by socioeconomic status and overall diet pattern dominates.

Related protocols

bay, baltic sea, tomorrow, protocol, eckernförde, whoops

protocols

Best Breathwork Protocols Ranked: Evidence by Method

Best breathwork protocols ranked: physiological sigh, box breathing, Wim Hof, 4-7-8. HRV and mood data, dose, duration, when to use each.

Apr 22, 2026 7m read

river, rocks, water, snow, winter, long exposure, cold, nature, river, river, river, river, river, cold

protocols

Cold Exposure Protocols Ranked: Ice Bath, Plunge, Shower

Cold exposure protocols ranked by endpoint: ice bath for recovery, cold plunge for fat loss, cold shower for mood. Doses, durations, evidence.

Apr 22, 2026 7m read

remove, weight loss, slim, diet, obesity, stomach, health, nourishment, loss, fitness, overweight, body, lifestyle, calories, success, scale, food, thin, woman, weight, fruit, measurement, fat, training, weight loss, wei

protocols Pillar

The 2026 Fat Loss Protocol: Boring Beats Clever

A sequenced 16-week protocol for durable fat loss in healthy adults. Prioritizes lean mass retention, sleep, and behavior sustainability over thermogenic.

Apr 22, 2026 5m read

Continue exploring

Deep dive

Best Longevity Protocol 2026: Tier 1 First, Tier 3 Last

The best longevity protocol for healthspan extension: a tiered stack ranked by evidence, from Tier 1 essentials to Tier 3 experimental adds.

Read article →

Compound

Rapamycin

Rapamycin for longevity: sirolimus, an mTOR inhibitor with ITP mouse lifespan data. Off-label geroprotective dosing remains investigational.

View compound →

Protocols tool

Cycling Generator

Schedule cycles from half-life and target window.

Open calculator →