 |
 |

P Values vs Estimates of Association With Confidence Intervals
Peter Cummings, MD, MPH;
Thomas D. Koepsell, MD, MPH
Arch Pediatr Adolesc Med. 2010;164(2):193-196.
Since 1988, the International Committee of Medical Journal Editors has used this language in their guidelines for authors: "When possible, quantify findings and present them with appropriate indicators of measurement error or uncertainty (such as confidence intervals). Avoid relying solely on statistical hypothesis testing, such as the use of P values, which fails to convey important quantitative information."1-2 Hundreds of biomedical journals, including the Archives,3 endorse these guidelines. What concerns do editors have about P values and hypothesis testing?
Consider the hypothetical data in Table 1. Two randomized trials were conducted among hospital patients who had a urinary catheter inserted. Each trial compared an ordinary catheter with a catheter impregnated with an antibiotic drug. The study outcome was a new urinary infection while in the hospital. Usually (though not always) authors report 2-sided P values that test the hypothesis of no association between an exposure (such as a treatment) and an outcome (such as infection). Both trials followed this convention and reported P values of about .8. A 2-sided P of .8 for the null hypothesis means that if there were no treatment-related outcome difference in the population from which the study subjects were drawn,4 the probability of drawing subjects with the observed test statistic (a 2 statistic in this example), or a more extreme test statistic, is 8 in 10. To put this more plainly, if less precisely, a large P value tells us the observed risk ratios of 0.86 and 1.03 may easily differ from the null risk ratio of 1.0 (no treatment difference) owing to what we loosely call chance—variation in infection frequency from one random sample to the next.
|
|
|
|
Table 1. Hypothetical Data for 2 Randomized Trials of Urinary Catheter Type and the Outcome of a New Urinary Tract Infection While Hospitalized for Other Illness
|
|
|
MISUSE OF P VALUES: CLAIMS OF NO DIFFERENCE
Authors sometimes use wording that suggests that a large P value means that there is no exposure-related difference in the outcomes of the observed subjects and/or the unobserved people in the population from which study subjects were drawn. Such wording confuses lack of a statistically significant difference with lack of any difference.4-12
In Table 1, the investigators observed differences in cumulative infection incidence in both trials; the risk ratios were 0.86 and 1.03, not 1.0. This is a matter of description and has nothing to do with P values. While a large P value indicates that any observed difference may be due entirely to chance, it cannot tell us that it is due entirely or partly to chance. It is possible that both risk ratios in Table 1 are accurate estimates of the true association. We can never prove the null hypothesis; no study can exclude the possibility of some true, albeit possibly small, difference between 2 groups. A claim of no difference should not be based on a P value. Incorrect use of P values is not a fault of P values, but because P values are not needed for understanding most study results, misuse can be remedied by avoiding them entirely or not using them to interpret results.
P VALUES DO NOT MEASURE STRENGTH OF ASSOCIATION
If all else is equal, a P value will be smaller when there is (1) a larger observed difference between 2 groups, (2) a larger sample size, or (3) less variation within treatment group for a continuous exposure or outcome variable.4, 13 P value size is also affected by the proportion of subjects who are exposed and the proportion with the outcome.
Imagine hypothetical data (Table 2) for 4 trials of a drug for high blood pressure. In trial 1, the average systolic pressure was 10 mm Hg lower among those treated compared with controls: P = .12. Compared with trial 1, the P value was smaller in trial 2 because the mean blood pressure difference was larger; the P value was smaller in trial 3 because the number of subjects was larger; and the P value was smaller in trial 4 because blood pressure varied less in that trial. Because P value size depends partly on sample size and within-group variation in exposure or outcome, it cannot be expected to measure the strength of an association, here reflected by the difference in average systolic blood pressure between treatment groups. Trials 1, 3, and 4 all found the same association between treatment and mean blood pressure, a 10 mm Hg difference, despite having different P values.
|
|
|
|
Table 2. Hypothetical Data for 4 Clinical Trials of Drugs to Reduce Systolic Blood Pressure
|
|
|
ESTIMATES OF ASSOCIATION AND CONFIDENCE INTERVALS
Some limitations and misuses of P values can be avoided if authors instead report and interpret data using estimates of association with intervals that reflect the precision of those estimates.4, 14-19 To describe the direction and size of an exposure-outcome association, authors can use risk ratios, rate ratios, hazard ratios, risk differences, rate differences, or mean differences. Odds ratios can be added to this list when they come from a case-control study; their value in studies in which they lack an interpretation as either a risk ratio or a rate ratio is a topic of debate.20-21
To account for chance (outcome variation between finite subject samples), confidence intervals can be used. P values and confidence intervals are related.4, 22-24 In a randomized trial of drug D, death was less common among treated persons than controls (Table 3): risk ratio, 0.5; 95% confidence interval, 0.22-1.13; P = .09. Using the data from this trial, we can compute P for the hypothesis that any particular risk ratio is true in the population from which the study subjects came. In a plot of these P values (Figure 1), P is 1.0 for the hypothesis that the true risk ratio is 0.5; these data are perfectly compatible with this risk ratio. For the hypothesis that the true risk ratio is the null value of 1.0, P is .09. Risk ratios from 0.22 to 1.13 have 2-sided P values of .05 or greater; these are the 95% confidence limits for the effect of drug D.
|
|
|
|
Table 3. Hypothetical Data From the First 2 Randomized Controlled Trials of Treatment and the Outcome of Death Among Patients With Septic Shock
|
|
|
|
|
|
|
Figure 1. Plot of 2-sided P values for a set of risk ratios based on data from the trial of drug D (Table 3). Each P value is for a hypothesis test that each risk ratio from 0.125 to 2.0 is true in the population from which the study subjects came. The vertical line marks the null risk ratio of 1.0 and the dashed horizontal line marks both the 95% confidence level and the P value of .05. The 95% confidence limits are risk ratios 0.22 and 1.13.
|
|
|
When we see a confidence interval, what should we be confident about? If we test drug D (Table 3) an infinite number of times, drawing subjects randomly from the same population each time, the 95% confidence interval will include the true effect estimate in 95% of the trials.4 For any 1 trial, we cannot be completely confident that the true value falls within the stated bounds; in 5% of the trials, the true risk ratio will lie outside of the 95% interval.
We should not consider all risk ratios within the 95% interval to be equally compatible with the data and all estimates outside of the interval as excluded by the data.4, 22-23 Confidence limits are just 2 points on a continuum and 95% limits are just a convention. A P value plot (Figure 1) peaks at the observed effect estimate. The confidence interval helps us visualize the continuous curves that fall from that peak to effect estimates with progressively less support from the data.
The studies of drugs A and B had the same P value (Table 1), but the evidence was quite different for the 2 drugs. For drug A, we might summarize by writing:
The cumulative incidence of infection was less among treated persons compared with controls, with a risk ratio 0.86, a 14% reduction. But the 95% confidence interval extended from a beneficial 0.31 to a hazardous 2.37. This trial provides little information about the utility of impregnating urinary catheters with drug A. If there is reason to think that A may be beneficial, a larger trial is needed. Our data can help to estimate the sample size for a larger study by furnishing estimates of the infection incidence to be expected among subjects with an ordinary catheter.
Note, however, that sample size or power calculations for the larger study should be based on the smallest difference in outcomes that would be of practical or theoretical importance, rather than on an imprecise preliminary estimate of association observed in a small pilot study.25 For drug B, we could write:
The risk ratio of 1.03 is consistent with a small harmful effect. The 95% confidence interval (0.79-1.34) suggests that a strong benefit is unlikely. B is probably not useful for prevention of urinary infection.
Imagine a larger, second trial of drug A (Table 4) with a statistically significant reduction in new urinary tract infections (risk ratio, 0.970; 95% confidence interval, 0.945-0.996; P = .02). The risk reductions within the 95% interval (Figure 2), 0.4% to 5.5%, are small. The 14% risk reduction in the first trial (Table 1) now seems an unlikely estimate of the true effect, because it lies far outside the 95% confidence interval observed in the second trial. In the new trial, 333 patients were treated for each prevented infection. The number needed to treat = 1/(risk among controls – risk among treated) = 1/(.1 – .097) = 333. Unless drug A is easy to administer, free of adverse effects, and nearly free of cost, it probably has little clinical utility.
|
|
|
|
Table 4. Hypothetical Data for a Randomized Trial of Urinary Catheters Impregnated With Drug A and the Outcome of a New Urinary Tract Infection While Hospitalized for Other Illness
|
|
|
|
|
|
|
Figure 2. Plots of 2-sided P values for risk ratios from 2 studies. The solid P value curve is based on data from a small study of catheters impregnated with drug A (Table 1) and the dashed curve is based on data from a large study of drug A catheters (Table 4). The vertical line marks the null risk ratio of 1.0 and the dashed horizontal line marks both the 95% confidence level and the P value of .05.
|
|
|
How would you interpret the trials for drugs C and D (Table 3) for patients in septic shock? Both had a P value of .09. Imagine these drugs are inexpensive, free of adverse effects, and available for other purposes. If a patient had septic shock today, should C be given? Are additional trials of C indicated? What about D?
CONFIDENCE INTERVALS, POWER, AND SAMPLE SIZE
Sometimes a difference is not quite statistically significant (Table 3, for example). Authors or reviewers then may wonder: What power did the study have to detect the reported result? Power calculations are appropriate in the study design stage, but they are no longer relevant once results are known.24, 26-30 More can be learned from the confidence interval, which reveals the range of possible effects that are reasonably compatible with the observed data.24, 26-27
CONFIDENCE INTERVALS AND META-ANALYSIS
Meta-analysis requires that each study reports the effect estimate and its standard error, or information that can be used to calculate estimates of these. Routinely providing estimates of association with confidence intervals makes future meta-analyses possible.
THREE CAUTIONS
First, confidence intervals should not be used to simply judge estimates as statistically significant or not. Second, confidence intervals tell us about how large a role chance may play, but they reveal nothing about bias. Interpretations of point estimates should consider possible bias. Last, confidence intervals are not the only way to estimate precision; Bayesian and likelihood intervals are available.4, 31-34
CONCLUSIONS
Most Archives articles present effect estimates and confidence intervals, but some still use P values for interpretation. We suggest that articles would benefit by omitting most P values,35 reserving a few only for specialized purposes such as testing for a trend in outcome across several ordered exposure levels or testing the significance of differences in associations across levels of a third factor. An alternative would be to present some P values in tables, but not use them for interpretation or present them in the text. We also suggest that authors consider framing their research aim in their article's introduction in terms of estimating the size of an association rather than in terms of testing for the presence or absence of an association.
AUTHOR INFORMATION
Correspondence: Drs Cummings, Department of Epidemiology, University of Washington, 250 Grandview Dr, Bishop, CA 93514 (peterc{at}u.washington.edu).
Author Contributions: Study concept and design: Cummings and Koepsell. Drafting of the manuscript: Cummings. Critical revision of the manuscript for important intellectual content: Cummings and Koepsell. Statistical analysis: Cummings and Koepsell. Administrative, technical, and material support: Cummings and Koepsell.
Financial Disclosure: None reported.
Author Affiliations: Harborview Injury Prevention and Research Center (Dr Cummings), and School of Public Health (Drs Cummings and Koepsell), University of Washington, Seattle.
REFERENCES
 |
1. International Committee of Medical Journal Editors. Uniform requirements for manuscripts submitted to biomedical journals. JAMA. 1997;277(11):927-934.
FREE FULL TEXT
2. International Committee of Medical Journal Editors. Uniform requirements for manuscripts submitted to biomedical journals: writing and editing for biomedical publications. http://www.icmje.org/. Accessed May 14, 2009.3. Archives of Pediatrics & Adolescent Medicine. Instructions for Authors. http://archpedi.ama-assn.org/misc/ifora.dtl. Accessed May 22, 2009.4. Rothman KJ, Greenland S, Lash TL. Precision and statistics in epidemiologic studies. In: Rothman KJ, Greenland S, Lash TL, eds. Modern Epidemiology. 3rd ed. Philadelphia, PA: Lippincott Williams & Wilkins; 2008:148-167.5. Freiman JA, Chalmers TC, Smith H Jr, Kuebler RR. The importance of beta, the type II error and sample size in the design and interpretation of the randomized control trial: survey of 71 "negative" trials. N Engl J Med. 1978;299(13):690-694.
WEB OF SCIENCE
| PUBMED
6. Edwards AWF. Likelihood: Expanded Edition. Baltimore, MD: The Johns Hopkins University Press; 1992:179-180.7. Altman DG, Bland MJ. Absence of evidence is not evidence of absence. BMJ. 1995;311(7003):485.
FREE FULL TEXT
8. Sterne JA, Davey Smith G. Sifting the evidence-what's wrong with significance tests? BMJ. 2001;322(7280):226-231.
FREE FULL TEXT
9. Alderson P, Chalmers I. Survey of claims of no effect in abstracts of Cochrane reviews. BMJ. 2003;326(7387):475.
FREE FULL TEXT
10. Alderson P. Absence of evidence is not evidence of absence. BMJ. 2004;328(7438):476-477.
FREE FULL TEXT
11. Hauer E. The harm done by tests of significance. Accid Anal Prev. 2004;36(3):495-500.
FULL TEXT
|
WEB OF SCIENCE
| PUBMED
12. Gigerenzer G. Mindless statistics. J Socio- Economics. 2004;33(5):587-606.
FULL TEXT
13. Lang JM, Rothman KJ, Cann CI. That confounded P-value [editorial]. Epidemiology. 1998;9(1):7-8.
WEB OF SCIENCE
| PUBMED
14. Rothman KJ. A show of confidence. N Engl J Med. 1978;299(24):1362-1363.
WEB OF SCIENCE
| PUBMED
15. Rothman KJ. Significance questing. Ann Intern Med. 1986;105(3):445-447.
FREE FULL TEXT
16. Gardner MJ, Altman DG. Confidence intervals rather than P values: estimation rather than hypothesis testing. Br Med J (Clin Res Ed). 1986;292(6522):746-750.
FREE FULL TEXT
17. Savitz DA. Is statistical significance testing useful in interpreting data? Reprod Toxicol. 1993;7(2):95-100.
WEB OF SCIENCE
| PUBMED
18. Altman DG, Machin D, Bryant TN, Gardner MJ. Statistics with Confidence. 2nd ed. London, England: BMJ Publishing Group; 2000.19. Altman D, Bland JM. Confidence intervals illuminate absence of evidence [letter]. BMJ. 2004;328(7446):1016-1017.
FREE FULL TEXT
20. Greenland S. Interpretation and choice of effect measures in epidemiologic analyses. Am J Epidemiol. 1987;125(5):761-768.
FREE FULL TEXT
21. Cummings P. The relative merits of risk ratios and odds ratios. Arch Pediatr Adolesc Med. 2009;163(5):438-445.
FREE FULL TEXT
22. Poole C. Confidence intervals exclude nothing. Am J Public Health. 1987;77(4):492-493.
PUBMED
23. Poole C. Beyond the confidence interval. Am J Public Health. 1987;77(2):195-199.
WEB OF SCIENCE
| PUBMED
24. Smith AH, Bates MN. Confidence limit analyses should replace power calculations in the interpretation of epidemiologic studies. Epidemiology. 1992;3(5):449-452.
WEB OF SCIENCE
| PUBMED
25. Kraemer HC, Mintz J, Noda A, Tinklenberg J, Yesavage JA. Caution regarding the use of pilot studies to guide power calculations for study proposals. Arch Gen Psychiatry. 2006;63(5):484-489.
FREE FULL TEXT
26. Goodman SN, Berlin JA. The use of predicted confidence intervals when planning experiments and the misuse of power when interpreting results. Ann Intern Med. 1994;121(3):200-206.
FREE FULL TEXT
27. Hoening JM, Heisey DM. The abuse of power: the pervasive fallacy of power calculations for data analysis. Am Stat. 2001;55(1):19-24.
FULL TEXT
|
WEB OF SCIENCE
28. Bacchetti P. Peer review of statistics in medical research: the other problem. BMJ. 2002;324(7348):1271-1273.
FREE FULL TEXT
29. Bacchetti P. Author's thoughts on power calculations [letter]. BMJ. 2002;325(7362):491.
FREE FULL TEXT
30. Senn SJ. Power is indeed irrelevant in interpreting completed studies [letter]. BMJ. 2002;325(7375):1304.
FREE FULL TEXT
31. Goodman SN. P values, hypothesis tests, and likelihood: implications for epidemiology of a neglected historical debate. Am J Epidemiol. 1993;137(5):485-501.
FREE FULL TEXT
32. Goodman SN. Toward evidence-based medical statistics. 1: the Pvalue fallacy. Ann Intern Med. 1999;130(12):995-1004.
FREE FULL TEXT
33. Goodman SN. Toward evidence-based medical statistics. 2: the Bayes factor. Ann Intern Med. 1999;130(12):1005-1013.
FREE FULL TEXT
34. Royall R. Statistical Evidence: A Likelihood Paradigm. Boca Raton, FL: Chapman & Hall/CRC; 1997.35. The value of P. Epidemiology. 2001;12(3):286.
FULL TEXT
|
WEB OF SCIENCE
| PUBMED
CiteULike Connotea Delicious Digg Facebook Reddit Technorati Twitter
What's this?
THIS ARTICLE HAS BEEN CITED BY OTHER ARTICLES
Research and Statistics: Distinguishing Statistical Significance from Clinical Importance: The Value of the P Value
Ghazarian
Pediatr. Rev. 2011;32:73-74.
FULL TEXT
|