You are seeing this message because your Web browser does not support basic Web standards. Find out more about why this message is appearing and what you can do to make your experience on this site better.


ABOUT ARCHIVES
Advanced Search

Welcome   | My Account | E-mail Alerts | RSS | Access Rights | Sign In


  Vol. 164 No. 2, February 2010 TABLE OF CONTENTS
  Online Only
 •  Online First Table of
Contents
  Commentary
 •Online Features
 This Article
 •PDF
 • Reply to article
 •Send to a friend
 • Save in My Folder
 •Save to citation manager
 •Permissions
 Citing Articles
 •Citation map
 •Citing articles on HighWire
 •Citing articles on Web of Science (1)
 •Contact me when this article is cited
 Related Content
 •Similar articles in this journal
 Topic Collections
 •Journalology/ Peer Review/ Authorship
 •Pediatrics
 •Pediatrics, Other
 •Statistics and Research Methods
 •Prognosis/ Outcomes
 •Reading, Writing, and Interpreting the Medical Literature
 •Alert me on articles by topic
 Social Bookmarking
  Add to CiteULike Add to Connotea Add to Delicious Add to Digg Add to Facebook Add to Reddit Add to Technorati Add to Twitter What's this?

P Values vs Estimates of Association With Confidence Intervals

Peter Cummings, MD, MPH; Thomas D. Koepsell, MD, MPH

Arch Pediatr Adolesc Med. 2010;164(2):193-196.

Since 1988, the International Committee of Medical Journal Editors has used this language in their guidelines for authors: "When possible, quantify findings and present them with appropriate indicators of measurement error or uncertainty (such as confidence intervals). Avoid relying solely on statistical hypothesis testing, such as the use of P values, which fails to convey important quantitative information."1-2 Hundreds of biomedical journals, including the Archives,3 endorse these guidelines. What concerns do editors have about P values and hypothesis testing?

Consider the hypothetical data in Table 1. Two randomized trials were conducted among hospital patients who had a urinary catheter inserted. Each trial compared an ordinary catheter with a catheter impregnated with an antibiotic drug. The study outcome was a new urinary infection while in the hospital. Usually (though not always) authors report 2-sided P values that test the hypothesis of no association between an exposure (such as a treatment) and an outcome (such as infection). Both trials followed this convention and reported P values of about .8. A 2-sided P of .8 for the null hypothesis means that if there were no treatment-related outcome difference in the population from which the study subjects were drawn,4 the probability of drawing subjects with the observed test statistic (a {chi}2 statistic in this example), or a more extreme test statistic, is 8 in 10. To put this more plainly, if less precisely, a large P value tells us the observed risk ratios of 0.86 and 1.03 may easily differ from the null risk ratio of 1.0 (no treatment difference) owing to what we loosely call chance—variation in infection frequency from one random sample to the next.


View this table:
[in this window]
[in a new window]
[as a PowerPoint slide]
 
Table 1. Hypothetical Data for 2 Randomized Trials of Urinary Catheter Type and the Outcome of a New Urinary Tract Infection While Hospitalized for Other Illness


MISUSE OF P VALUES: CLAIMS OF NO DIFFERENCE



Authors sometimes use wording that suggests that a large P value means that there is no exposure-related difference in the outcomes of the observed subjects and/or the unobserved people in the population from which study subjects were drawn. Such wording confuses lack of a statistically significant difference with lack of any difference.4-12

In Table 1, the investigators observed differences in cumulative infection incidence in both trials; the risk ratios were 0.86 and 1.03, not 1.0. This is a matter of description and has nothing to do with P values. While a large P value indicates that any observed difference may be due entirely to chance, it cannot tell us that it is due entirely or partly to chance. It is possible that both risk ratios in Table 1 are accurate estimates of the true association. We can never prove the null hypothesis; no study can exclude the possibility of some true, albeit possibly small, difference between 2 groups. A claim of no difference should not be based on a P value. Incorrect use of P values is not a fault of P values, but because P values are not needed for understanding most study results, misuse can be remedied by avoiding them entirely or not using them to interpret results.


P VALUES DO NOT MEASURE STRENGTH OF ASSOCIATION

If all else is equal, a P value will be smaller when there is (1) a larger observed difference between 2 groups, (2) a larger sample size, or (3) less variation within treatment group for a continuous exposure or outcome variable.4, 13 P value size is also affected by the proportion of subjects who are exposed and the proportion with the outcome.

Imagine hypothetical data (Table 2) for 4 trials of a drug for high blood pressure. In trial 1, the average systolic pressure was 10 mm Hg lower among those treated compared with controls: P = .12. Compared with trial 1, the P value was smaller in trial 2 because the mean blood pressure difference was larger; the P value was smaller in trial 3 because the number of subjects was larger; and the P value was smaller in trial 4 because blood pressure varied less in that trial. Because P value size depends partly on sample size and within-group variation in exposure or outcome, it cannot be expected to measure the strength of an association, here reflected by the difference in average systolic blood pressure between treatment groups. Trials 1, 3, and 4 all found the same association between treatment and mean blood pressure, a 10 mm Hg difference, despite having different P values.


View this table:
[in this window]
[in a new window]
[as a PowerPoint slide]
 
Table 2. Hypothetical Data for 4 Clinical Trials of Drugs to Reduce Systolic Blood Pressure



ESTIMATES OF ASSOCIATION AND CONFIDENCE INTERVALS

Some limitations and misuses of P values can be avoided if authors instead report and interpret data using estimates of association with intervals that reflect the precision of those estimates.4, 14-19 To describe the direction and size of an exposure-outcome association, authors can use risk ratios, rate ratios, hazard ratios, risk differences, rate differences, or mean differences. Odds ratios can be added to this list when they come from a case-control study; their value in studies in which they lack an interpretation as either a risk ratio or a rate ratio is a topic of debate.20-21

To account for chance (outcome variation between finite subject samples), confidence intervals can be used. P values and confidence intervals are related.4, 22-24 In a randomized trial of drug D, death was less common among treated persons than controls (Table 3): risk ratio, 0.5; 95% confidence interval, 0.22-1.13; P = .09. Using the data from this trial, we can compute P for the hypothesis that any particular risk ratio is true in the population from which the study subjects came. In a plot of these P values (Figure 1), P is 1.0 for the hypothesis that the true risk ratio is 0.5; these data are perfectly compatible with this risk ratio. For the hypothesis that the true risk ratio is the null value of 1.0, P is .09. Risk ratios from 0.22 to 1.13 have 2-sided P values of .05 or greater; these are the 95% confidence limits for the effect of drug D.


View this table:
[in this window]
[in a new window]
[as a PowerPoint slide]
 
Table 3. Hypothetical Data From the First 2 Randomized Controlled Trials of Treatment and the Outcome of Death Among Patients With Septic Shock



Figure 1
View larger version (18K):
[in this window]
[in a new window]
[as a PowerPoint slide]
 
Figure 1. Plot of 2-sided P values for a set of risk ratios based on data from the trial of drug D (Table 3). Each P value is for a hypothesis test that each risk ratio from 0.125 to 2.0 is true in the population from which the study subjects came. The vertical line marks the null risk ratio of 1.0 and the dashed horizontal line marks both the 95% confidence level and the P value of .05. The 95% confidence limits are risk ratios 0.22 and 1.13.


When we see a confidence interval, what should we be confident about? If we test drug D (Table 3) an infinite number of times, drawing subjects randomly from the same population each time, the 95% confidence interval will include the true effect estimate in 95% of the trials.4 For any 1 trial, we cannot be completely confident that the true value falls within the stated bounds; in 5% of the trials, the true risk ratio will lie outside of the 95% interval.

We should not consider all risk ratios within the 95% interval to be equally compatible with the data and all estimates outside of the interval as excluded by the data.4, 22-23 Confidence limits are just 2 points on a continuum and 95% limits are just a convention. A P value plot (Figure 1) peaks at the observed effect estimate. The confidence interval helps us visualize the continuous curves that fall from that peak to effect estimates with progressively less support from the data.

The studies of drugs A and B had the same P value (Table 1), but the evidence was quite different for the 2 drugs. For drug A, we might summarize by writing:

The cumulative incidence of infection was less among treated persons compared with controls, with a risk ratio 0.86, a 14% reduction. But the 95% confidence interval extended from a beneficial 0.31 to a hazardous 2.37. This trial provides little information about the utility of impregnating urinary catheters with drug A. If there is reason to think that A may be beneficial, a larger trial is needed. Our data can help to estimate the sample size for a larger study by furnishing estimates of the infection incidence to be expected among subjects with an ordinary catheter.

Note, however, that sample size or power calculations for the larger study should be based on the smallest difference in outcomes that would be of practical or theoretical importance, rather than on an imprecise preliminary estimate of association observed in a small pilot study.25 For drug B, we could write:

The risk ratio of 1.03 is consistent with a small harmful effect. The 95% confidence interval (0.79-1.34) suggests that a strong benefit is unlikely. B is probably not useful for prevention of urinary infection.

Imagine a larger, second trial of drug A (Table 4) with a statistically significant reduction in new urinary tract infections (risk ratio, 0.970; 95% confidence interval, 0.945-0.996; P = .02). The risk reductions within the 95% interval (Figure 2), 0.4% to 5.5%, are small. The 14% risk reduction in the first trial (Table 1) now seems an unlikely estimate of the true effect, because it lies far outside the 95% confidence interval observed in the second trial. In the new trial, 333 patients were treated for each prevented infection. The number needed to treat = 1/(risk among controls – risk among treated) = 1/(.1 – .097) = 333. Unless drug A is easy to administer, free of adverse effects, and nearly free of cost, it probably has little clinical utility.


View this table:
[in this window]
[in a new window]
[as a PowerPoint slide]
 
Table 4. Hypothetical Data for a Randomized Trial of Urinary Catheters Impregnated With Drug A and the Outcome of a New Urinary Tract Infection While Hospitalized for Other Illness



Figure 2
View larger version (21K):
[in this window]
[in a new window]
[as a PowerPoint slide]
 
Figure 2. Plots of 2-sided P values for risk ratios from 2 studies. The solid P value curve is based on data from a small study of catheters impregnated with drug A (Table 1) and the dashed curve is based on data from a large study of drug A catheters (Table 4). The vertical line marks the null risk ratio of 1.0 and the dashed horizontal line marks both the 95% confidence level and the P value of .05.


How would you interpret the trials for drugs C and D (Table 3) for patients in septic shock? Both had a P value of .09. Imagine these drugs are inexpensive, free of adverse effects, and available for other purposes. If a patient had septic shock today, should C be given? Are additional trials of C indicated? What about D?


CONFIDENCE INTERVALS, POWER, AND SAMPLE SIZE

Sometimes a difference is not quite statistically significant (Table 3, for example). Authors or reviewers then may wonder: What power did the study have to detect the reported result? Power calculations are appropriate in the study design stage, but they are no longer relevant once results are known.24, 26-30 More can be learned from the confidence interval, which reveals the range of possible effects that are reasonably compatible with the observed data.24, 26-27


CONFIDENCE INTERVALS AND META-ANALYSIS

Meta-analysis requires that each study reports the effect estimate and its standard error, or information that can be used to calculate estimates of these. Routinely providing estimates of association with confidence intervals makes future meta-analyses possible.


THREE CAUTIONS

First, confidence intervals should not be used to simply judge estimates as statistically significant or not. Second, confidence intervals tell us about how large a role chance may play, but they reveal nothing about bias. Interpretations of point estimates should consider possible bias. Last, confidence intervals are not the only way to estimate precision; Bayesian and likelihood intervals are available.4, 31-34


CONCLUSIONS

Most Archives articles present effect estimates and confidence intervals, but some still use P values for interpretation. We suggest that articles would benefit by omitting most P values,35 reserving a few only for specialized purposes such as testing for a trend in outcome across several ordered exposure levels or testing the significance of differences in associations across levels of a third factor. An alternative would be to present some P values in tables, but not use them for interpretation or present them in the text. We also suggest that authors consider framing their research aim in their article's introduction in terms of estimating the size of an association rather than in terms of testing for the presence or absence of an association.


AUTHOR INFORMATION

Correspondence: Drs Cummings, Department of Epidemiology, University of Washington, 250 Grandview Dr, Bishop, CA 93514 (peterc{at}u.washington.edu).

Author Contributions: Study concept and design: Cummings and Koepsell. Drafting of the manuscript: Cummings. Critical revision of the manuscript for important intellectual content: Cummings and Koepsell. Statistical analysis: Cummings and Koepsell. Administrative, technical, and material support: Cummings and Koepsell.

Financial Disclosure: None reported.

Author Affiliations: Harborview Injury Prevention and Research Center (Dr Cummings), and School of Public Health (Drs Cummings and Koepsell), University of Washington, Seattle.


REFERENCES

1. International Committee of Medical Journal Editors. Uniform requirements for manuscripts submitted to biomedical journals. JAMA. 1997;277(11):927-934. FREE FULL TEXT
2. International Committee of Medical Journal Editors. Uniform requirements for manuscripts submitted to biomedical journals: writing and editing for biomedical publications. http://www.icmje.org/. Accessed May 14, 2009.
3. Archives of Pediatrics & Adolescent Medicine. Instructions for Authors. http://archpedi.ama-assn.org/misc/ifora.dtl. Accessed May 22, 2009.
4. Rothman KJ, Greenland S, Lash TL. Precision and statistics in epidemiologic studies. In: Rothman KJ, Greenland S, Lash TL, eds. Modern Epidemiology. 3rd ed. Philadelphia, PA: Lippincott Williams & Wilkins; 2008:148-167.
5. Freiman JA, Chalmers TC, Smith H Jr, Kuebler RR. The importance of beta, the type II error and sample size in the design and interpretation of the randomized control trial: survey of 71 "negative" trials. N Engl J Med. 1978;299(13):690-694. WEB OF SCIENCE | PUBMED
6. Edwards AWF. Likelihood: Expanded Edition. Baltimore, MD: The Johns Hopkins University Press; 1992:179-180.
7. Altman DG, Bland MJ. Absence of evidence is not evidence of absence. BMJ. 1995;311(7003):485. FREE FULL TEXT
8. Sterne JA, Davey Smith G. Sifting the evidence-what's wrong with significance tests? BMJ. 2001;322(7280):226-231. FREE FULL TEXT
9. Alderson P, Chalmers I. Survey of claims of no effect in abstracts of Cochrane reviews. BMJ. 2003;326(7387):475. FREE FULL TEXT
10. Alderson P. Absence of evidence is not evidence of absence. BMJ. 2004;328(7438):476-477. FREE FULL TEXT
11. Hauer E. The harm done by tests of significance. Accid Anal Prev. 2004;36(3):495-500. FULL TEXT | WEB OF SCIENCE | PUBMED
12. Gigerenzer G. Mindless statistics. J Socio- Economics. 2004;33(5):587-606. FULL TEXT
13. Lang JM, Rothman KJ, Cann CI. That confounded P-value [editorial]. Epidemiology. 1998;9(1):7-8. WEB OF SCIENCE | PUBMED
14. Rothman KJ. A show of confidence. N Engl J Med. 1978;299(24):1362-1363. WEB OF SCIENCE | PUBMED
15. Rothman KJ. Significance questing. Ann Intern Med. 1986;105(3):445-447. FREE FULL TEXT
16. Gardner MJ, Altman DG. Confidence intervals rather than P values: estimation rather than hypothesis testing. Br Med J (Clin Res Ed). 1986;292(6522):746-750. FREE FULL TEXT
17. Savitz DA. Is statistical significance testing useful in interpreting data? Reprod Toxicol. 1993;7(2):95-100. WEB OF SCIENCE | PUBMED
18. Altman DG, Machin D, Bryant TN, Gardner MJ. Statistics with Confidence. 2nd ed. London, England: BMJ Publishing Group; 2000.
19. Altman D, Bland JM. Confidence intervals illuminate absence of evidence [letter]. BMJ. 2004;328(7446):1016-1017. FREE FULL TEXT
20. Greenland S. Interpretation and choice of effect measures in epidemiologic analyses. Am J Epidemiol. 1987;125(5):761-768. FREE FULL TEXT
21. Cummings P. The relative merits of risk ratios and odds ratios. Arch Pediatr Adolesc Med. 2009;163(5):438-445. FREE FULL TEXT
22. Poole C. Confidence intervals exclude nothing. Am J Public Health. 1987;77(4):492-493. PUBMED
23. Poole C. Beyond the confidence interval. Am J Public Health. 1987;77(2):195-199. WEB OF SCIENCE | PUBMED
24. Smith AH, Bates MN. Confidence limit analyses should replace power calculations in the interpretation of epidemiologic studies. Epidemiology. 1992;3(5):449-452. WEB OF SCIENCE | PUBMED
25. Kraemer HC, Mintz J, Noda A, Tinklenberg J, Yesavage JA. Caution regarding the use of pilot studies to guide power calculations for study proposals. Arch Gen Psychiatry. 2006;63(5):484-489. FREE FULL TEXT
26. Goodman SN, Berlin JA. The use of predicted confidence intervals when planning experiments and the misuse of power when interpreting results. Ann Intern Med. 1994;121(3):200-206. FREE FULL TEXT
27. Hoening JM, Heisey DM. The abuse of power: the pervasive fallacy of power calculations for data analysis. Am Stat. 2001;55(1):19-24. FULL TEXT | WEB OF SCIENCE
28. Bacchetti P. Peer review of statistics in medical research: the other problem. BMJ. 2002;324(7348):1271-1273. FREE FULL TEXT
29. Bacchetti P. Author's thoughts on power calculations [letter]. BMJ. 2002;325(7362):491. FREE FULL TEXT
30. Senn SJ. Power is indeed irrelevant in interpreting completed studies [letter]. BMJ. 2002;325(7375):1304. FREE FULL TEXT
31. Goodman SN. P values, hypothesis tests, and likelihood: implications for epidemiology of a neglected historical debate. Am J Epidemiol. 1993;137(5):485-501. FREE FULL TEXT
32. Goodman SN. Toward evidence-based medical statistics. 1: the Pvalue fallacy. Ann Intern Med. 1999;130(12):995-1004. FREE FULL TEXT
33. Goodman SN. Toward evidence-based medical statistics. 2: the Bayes factor. Ann Intern Med. 1999;130(12):1005-1013. FREE FULL TEXT
34. Royall R. Statistical Evidence: A Likelihood Paradigm. Boca Raton, FL: Chapman & Hall/CRC; 1997.
35. The value of P. Epidemiology. 2001;12(3):286. FULL TEXT | WEB OF SCIENCE | PUBMED


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Delicious Delicious   Add to Digg Digg   Add to Facebook Facebook   Add to Reddit Reddit   Add to Technorati Technorati   Add to Twitter Twitter     What's this?

THIS ARTICLE HAS BEEN CITED BY OTHER ARTICLES

Research and Statistics: Distinguishing Statistical Significance from Clinical Importance: The Value of the P Value
Ghazarian
Pediatr. Rev. 2011;32:73-74.
FULL TEXT  





HOME | CURRENT ISSUE | PAST ISSUES | TOPIC COLLECTIONS | CME | PHYSICIAN JOBS | SUBMIT | SUBSCRIBE | HELP
CONDITIONS OF USE | PRIVACY POLICY | CONTACT US | SITE MAP
 
© 2010 American Medical Association. All Rights Reserved.