Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Published: 26 November 2021

The clinician’s guide to p values, confidence intervals, and magnitude of effects

  • Mark R. Phillips   ORCID: orcid.org/0000-0003-0923-261X 1   na1 ,
  • Charles C. Wykoff 2 , 3 ,
  • Lehana Thabane   ORCID: orcid.org/0000-0003-0355-9734 1 , 4 ,
  • Mohit Bhandari   ORCID: orcid.org/0000-0001-9608-4808 1 , 5 &
  • Varun Chaudhary   ORCID: orcid.org/0000-0002-9988-4146 1 , 5

for the Retina Evidence Trials InterNational Alliance (R.E.T.I.N.A.) Study Group

Eye volume  36 ,  pages 341–342 ( 2022 ) Cite this article

20k Accesses

9 Citations

15 Altmetric

Metrics details

  • Outcomes research

A Correction to this article was published on 19 January 2022

This article has been updated

Introduction

There are numerous statistical and methodological considerations within every published study, and the ability of clinicians to appreciate the implications and limitations associated with these key concepts is critically important. These implications often have a direct impact on the applicability of study findings – which, in turn, often determine the appropriateness for the results to lead to modification of practice patterns. Because it can be challenging and time-consuming for busy clinicians to break down the nuances of each study, herein we provide a brief summary of 3 important topics that every ophthalmologist should consider when interpreting evidence.

p -values: what they tell us and what they don’t

Perhaps the most universally recognized statistic is the p-value. Most individuals understand the notion that (usually) a p -value <0.05 signifies a statistically significant difference between the two groups being compared. While this understanding is shared amongst most, it is far more important to understand what a p -value does not tell us. Attempting to inform clinical practice patterns through interpretation of p -values is overly simplistic, and is fraught with potential for misleading conclusions. A p -value represents the probability that the observed result (difference between the groups being compared)—or one that is more extreme—would occur by random chance, assuming that the null hypothesis (the alternative scenario to the study’s hypothesis) is that there are no differences between the groups being compared. For example, a p -value of 0.04 would indicate that the difference between the groups compared would have a 4% chance of occurring by random chance. When this probability is small, it becomes less likely that the null hypothesis is accurate—or, alternatively, that the probability of a difference between groups is high [ 1 ]. Studies use a predefined threshold to determine when a p -value is sufficiently small enough to support the study hypothesis. This threshold is conventionally a p -value of 0.05; however, there are reasons and justifications for studies to use a different threshold if appropriate.

What a p -value cannot tell us, is the clinical relevance or importance of the observed treatment effects. [ 1 ]. Specifically, a p -value does not provide details about the magnitude of effect [ 2 , 3 , 4 ]. Despite a significant p -value, it is quite possible for the difference between the groups to be small. This phenomenon is especially common with larger sample sizes in which comparisons may result in statistically significant differences that are actually not clinically meaningful. For example, a study may find a statistically significant difference ( p  < 0.05) between the visual acuity outcomes between two groups, while the difference between the groups may only amount to a 1 or less letter difference. While this may be in fact a statistically significant difference, the difference is likely not large enough to make a meaningful difference for patients. Thus, p -values lack vital information on the magnitude of effects for the assessed outcomes [ 2 , 3 , 4 ].

Overcoming the limitations of interpreting p -values: magnitude of effect

To overcome this limitation, it is important to consider both (1) whether or not the p -value of a comparison is significant according to the pre-defined statistical plan, and (2) the magnitude of the treatment effects (commonly reported as an effect estimate with 95% confidence intervals) [ 5 ]. The magnitude of effect is most often represented as the mean difference between groups for continuous outcomes, such as visual acuity on the logMAR scale, and the risk or odds ratio for dichotomous/binary outcomes, such as occurrence of adverse events. These measures indicate the observed effect that was quantified by the study comparison. As suggested in the previous section, understanding the actual magnitude of the difference in the study comparison provides an understanding of the results that an isolated p -value does not provide [ 4 , 5 ]. Understanding the results of a study should shift from a binary interpretation of significant vs not significant, and instead, focus on a more critical judgement of the clinical relevance of the observed effect [ 1 ].

There are a number of important metrics, such as the Minimally Important Difference (MID), which helps to determine if a difference between groups is large enough to be clinically meaningful [ 6 , 7 ]. When a clinician is able to identify (1) the magnitude of effect within a study, and (2) the MID (smallest change in the outcome that a patient would deem meaningful), they are far more capable of understanding the effects of a treatment, and further articulate the pros and cons of a treatment option to patients with reference to treatment effects that can be considered clinically valuable.

The role of confidence intervals

Confidence intervals are estimates that provide a lower and upper threshold to the estimate of the magnitude of effect. By convention, 95% confidence intervals are most typically reported. These intervals represent the range in which we can, with 95% confidence, assume the treatment effect to fall within. For example, a mean difference in visual acuity of 8 (95% confidence interval: 6 to 10) suggests that the best estimate of the difference between the two study groups is 8 letters, and we have 95% certainty that the true value is between 6 and 10 letters. When interpreting this clinically, one can consider the different clinical scenarios at each end of the confidence interval; if the patient’s outcome was to be the most conservative, in this case an improvement of 6 letters, would the importance to the patient be different than if the patient’s outcome was to be the most optimistic, or 10 letters in this example? When the clinical value of the treatment effect does not change when considering the lower versus upper confidence intervals, there is enhanced certainty that the treatment effect will be meaningful to the patient [ 4 , 5 ]. In contrast, if the clinical merits of a treatment appear different when considering the possibility of the lower versus the upper confidence intervals, one may be more cautious about the expected benefits to be anticipated with treatment [ 4 , 5 ].

There are a number of important details for clinicians to consider when interpreting evidence. Through this editorial, we hope to provide practical insights into fundamental methodological principals that can help guide clinical decision making. P -values are one small component to consider when interpreting study results, with much deeper appreciation of results being available when the treatment effects and associated confidence intervals are also taken into consideration.

Change history

19 january 2022.

A Correction to this paper has been published: https://doi.org/10.1038/s41433-021-01914-2

Li G, Walter SD, Thabane L. Shifting the focus away from binary thinking of statistical significance and towards education for key stakeholders: revisiting the debate on whether it’s time to de-emphasize or get rid of statistical significance. J Clin Epidemiol. 2021;137:104–12. https://doi.org/10.1016/j.jclinepi.2021.03.033

Article   PubMed   Google Scholar  

Gagnier JJ, Morgenstern H. Misconceptions, misuses, and misinterpretations of p values and significance testing. J Bone Joint Surg Am. 2017;99:1598–603. https://doi.org/10.2106/JBJS.16.01314

Goodman SN. Toward evidence-based medical statistics. 1: the p value fallacy. Ann Intern Med. 1999;130:995–1004. https://doi.org/10.7326/0003-4819-130-12-199906150-00008

Article   CAS   PubMed   Google Scholar  

Greenland S, Senn SJ, Rothman KJ, Carlin JB, Poole C, Goodman SN, et al. Statistical tests, p values, confidence intervals, and power: a guide to misinterpretations. Eur J Epidemiol. 2016;31:337–50. https://doi.org/10.1007/s10654-016-0149-3

Article   PubMed   PubMed Central   Google Scholar  

Phillips M. Letter to the editor: editorial: threshold p values in orthopaedic research-we know the problem. What is the solution? Clin Orthop. 2019;477:1756–8. https://doi.org/10.1097/CORR.0000000000000827

Devji T, Carrasco-Labra A, Qasim A, Phillips MR, Johnston BC, Devasenapathy N, et al. Evaluating the credibility of anchor based estimates of minimal important differences for patient reported outcomes: instrument development and reliability study. BMJ. 2020;369:m1714. https://doi.org/10.1136/bmj.m1714

Carrasco-Labra A, Devji T, Qasim A, Phillips MR, Wang Y, Johnston BC, et al. Minimal important difference estimates for patient-reported outcomes: a systematic survey. J Clin Epidemiol. 2020;0. https://doi.org/10.1016/j.jclinepi.2020.11.024

Download references

Author information

Authors and affiliations.

Department of Health Research Methods, Evidence, and Impact, McMaster University, Hamilton, ON, Canada

Mark R. Phillips, Lehana Thabane, Mohit Bhandari & Varun Chaudhary

Retina Consultants of Texas (Retina Consultants of America), Houston, TX, USA

Charles C. Wykoff

Blanton Eye Institute, Houston Methodist Hospital, Houston, TX, USA

Biostatistics Unit, St. Joseph’s Healthcare-Hamilton, Hamilton, ON, Canada

Lehana Thabane

Department of Surgery, McMaster University, Hamilton, ON, Canada

Mohit Bhandari & Varun Chaudhary

NIHR Moorfields Biomedical Research Centre, Moorfields Eye Hospital, London, UK

Sobha Sivaprasad

Cole Eye Institute, Cleveland Clinic, Cleveland, OH, USA

Peter Kaiser

Retinal Disorders and Ophthalmic Genetics, Stein Eye Institute, University of California, Los Angeles, CA, USA

David Sarraf

Department of Ophthalmology, Mayo Clinic, Rochester, MN, USA

Sophie J. Bakri

The Retina Service at Wills Eye Hospital, Philadelphia, PA, USA

Sunir J. Garg

Center for Ophthalmic Bioinformatics, Cole Eye Institute, Cleveland Clinic, Cleveland, OH, USA

Rishi P. Singh

Cleveland Clinic Lerner College of Medicine, Cleveland, OH, USA

Department of Ophthalmology, University of Bonn, Boon, Germany

Frank G. Holz

Singapore Eye Research Institute, Singapore, Singapore

Tien Y. Wong

Singapore National Eye Centre, Duke-NUD Medical School, Singapore, Singapore

Centre for Eye Research Australia, Royal Victorian Eye and Ear Hospital, East Melbourne, VIC, Australia

Robyn H. Guymer

Department of Surgery (Ophthalmology), The University of Melbourne, Melbourne, VIC, Australia

You can also search for this author in PubMed   Google Scholar

  • Varun Chaudhary
  • , Mohit Bhandari
  • , Charles C. Wykoff
  • , Sobha Sivaprasad
  • , Lehana Thabane
  • , Peter Kaiser
  • , David Sarraf
  • , Sophie J. Bakri
  • , Sunir J. Garg
  • , Rishi P. Singh
  • , Frank G. Holz
  • , Tien Y. Wong
  •  & Robyn H. Guymer

Contributions

MRP was responsible for conception of idea, writing of manuscript and review of manuscript. VC was responsible for conception of idea, writing of manuscript and review of manuscript. MB was responsible for conception of idea, writing of manuscript and review of manuscript. CCW was responsible for critical review and feedback on manuscript. LT was responsible for critical review and feedback on manuscript.

Corresponding author

Correspondence to Varun Chaudhary .

Ethics declarations

Competing interests.

MRP: Nothing to disclose. CCW: Consultant: Acuela, Adverum Biotechnologies, Inc, Aerpio, Alimera Sciences, Allegro Ophthalmics, LLC, Allergan, Apellis Pharmaceuticals, Bayer AG, Chengdu Kanghong Pharmaceuticals Group Co, Ltd, Clearside Biomedical, DORC (Dutch Ophthalmic Research Center), EyePoint Pharmaceuticals, Gentech/Roche, GyroscopeTx, IVERIC bio, Kodiak Sciences Inc, Novartis AG, ONL Therapeutics, Oxurion NV, PolyPhotonix, Recens Medical, Regeron Pharmaceuticals, Inc, REGENXBIO Inc, Santen Pharmaceutical Co, Ltd, and Takeda Pharmaceutical Company Limited; Research funds: Adverum Biotechnologies, Inc, Aerie Pharmaceuticals, Inc, Aerpio, Alimera Sciences, Allergan, Apellis Pharmaceuticals, Chengdu Kanghong Pharmaceutical Group Co, Ltd, Clearside Biomedical, Gemini Therapeutics, Genentech/Roche, Graybug Vision, Inc, GyroscopeTx, Ionis Pharmaceuticals, IVERIC bio, Kodiak Sciences Inc, Neurotech LLC, Novartis AG, Opthea, Outlook Therapeutics, Inc, Recens Medical, Regeneron Pharmaceuticals, Inc, REGENXBIO Inc, Samsung Pharm Co, Ltd, Santen Pharmaceutical Co, Ltd, and Xbrane Biopharma AB—unrelated to this study. LT: Nothing to disclose. MB: Research funds: Pendopharm, Bioventus, Acumed – unrelated to this study. VC: Advisory Board Member: Alcon, Roche, Bayer, Novartis; Grants: Bayer, Novartis – unrelated to this study.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The original version of this article was revised: In this article the middle initial in author name Sophie J. Bakri was missing.

Rights and permissions

Reprints and permissions

About this article

Cite this article.

Phillips, M.R., Wykoff, C.C., Thabane, L. et al. The clinician’s guide to p values, confidence intervals, and magnitude of effects. Eye 36 , 341–342 (2022). https://doi.org/10.1038/s41433-021-01863-w

Download citation

Received : 11 November 2021

Revised : 12 November 2021

Accepted : 15 November 2021

Published : 26 November 2021

Issue Date : February 2022

DOI : https://doi.org/10.1038/s41433-021-01863-w

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

Exploring the fragility of meta-analyses in ophthalmology: a systematic review.

  • Keean Nanji

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

the null hypothesis clinical trials

Best Practices in Science

The Null Hypothesis

Show Topics

Publications

  • Journals and Blogs

The null hypothesis, as described by Anthony Greenwald in ‘Consequences of Prejudice Against the Null Hypothesis,’ is the hypothesis of no difference between treatment effects or of no association between variables. Unfortunately in academia, the ‘null’ is often associated with ‘insignificant,’ ‘no value,’ or ‘invalid.’ This association is due to the bias against papers that accept the null hypothesis by journals. This prejudice by journals to only accept papers that show ‘significant’ results (also known as rejecting this ‘null hypothesis’) puts added pressure on those working in academia, especially with their relevance and salaries often depend on publications. This pressure may also be correlated with increased scientific misconduct, which you can also read more about on this website by clicking here . If you would like to read publication, journal articles, and blogs about the null hypothesis, views on rejecting and accepting the null, and journal bias against the null hypothesis, please see the resources we have linked below.

Most scientific journals are prejudiced against papers that demonstrate support for null hypotheses and are unlikely to publish such papers and articles. This phenomenon leads to selective publishing of papers and ensures that the portion of articles that do get published is unrepresentative of the total research in the field.

Anderson, D. R., Burnham, K. P., & Thompson, W. L. (2000). Null hypothesis testing: problems, prevalence, and an alternative. The journal of wildlife management , 912-923.

Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the royal statistical society . Series B (Methodological), 289-300.

Berger, J. O., & Sellke, T. (1987). Testing a point null hypothesis: The irreconcilability of p values and evidence. Journal of the American statistical Association , 82 (397), 112-122.

Blackwelder, W. C. (1982). “Proving the null hypothesis” in clinical trials. Controlled clinical trials , 3 (4), 345-353.

Dirnagl, U. (2010). Fighting publication bias: introducing the Negative Results section. Journal of cerebral blood flow and metabolism: official journal of the International Society of Cerebral Blood Flow and Metabolism , 30 (7), 1263.

Dickersin, K., Chan, S. S., Chalmersx, T. C., Sacks, H. S., & Smith, H. (1987). Publication bias and clinical trials. Controlled clinical trials , 8 (4), 343-353.

Efron, B. (2004). Large-scale simultaneous hypothesis testing: the choice of a null hypothesis. Journal of the American Statistical Association , 99 (465), 96-104.

Fanelli, D. (2010). Do pressures to publish increase scientists’ bias? An empirical support from US States Data. PloS one , 5 (4), e10271.

Fanelli, D. (2011). Negative results are disappearing from most disciplines and countries. Scientometrics , 90 (3), 891-904.

Greenwald, A. G. (1975). Consequences of Prejudice Against the Null Hypothesis. Psychological Bulletin , 82 (1).

Hubbard, R., & Armstrong, J. S. (1997). Publication bias against null results. Psychological Reports , 80 (1), 337-338.

I’ve Got Your Impact Factor Right Here (Science, February 24, 2012)

Johnson, R. T., & Dickersin, K. (2007). Publication bias against negative results from clinical trials: three of the seven deadly sins. Nature Clinical Practice Neurology , 3 (11), 590-591.

Keep negativity out of politics. We need more of it in journals (STAT, October 14, 2016)

Knight, J. (2003). Negative results: Null and void. Nature , 422 (6932), 554-555.

Koren, G., & Klein, N. (1991). Bias against negative studies in newspaper reports of medical research. Jama , 266 (13), 1824-1826.

Koren, G., Shear, H., Graham, K., & Einarson, T. (1989). Bias against the null hypothesis: the reproductive hazards of cocaine. The Lancet , 334 (8677), 1440-1442.

Krantz, D. (2012).  The Null Hypothesis Testing Controversy in Psychology. Journal of American Statistical Association .

Lash, T. (2017). The Harm Done to Reproducibility by the Culture of Null Hypothesis Significance Testing. American Journal of Epidemiology .

Mahoney, M. J. (1977). Publication prejudices: An experimental study of confirmatory bias in the peer review system. Cognitive therapy and research , 1 (2), 161-175.

Matosin, N., Frank, E., Engel, M., Lum, J. S., & Newell, K. A. (2014). Negativity towards negative results: a discussion of the disconnect between scientific worth and scientific culture.

Nickerson, R. S. (2000). Null hypothesis significance testing: a review of an old and continuing controversy. Psychological methods , 5 (2), 241.

No result is worthless: the value of negative results in science (BioMed Central, October 10, 2012)

Negative Results: The Dark Matter of Research (American Journal Experts)

Neil Malhotra: Why No News Is Still Important News in Research (Stanford Graduate School of Business, October 27, 2014)

Null Hypothesis Definition and Example (Statistics How To, November 5, 2012)

Null Hypothesis Glossary Definition (Statlect Digital Textbook)

Opinion: Publish Negative Results (The Scientist, January 15, 2013)

Positives in negative results: when finding ‘nothing’ means something (The Conversation, September 24, 2014)

Rouder, J. N., Speckman, P. L., Sun, D., Morey, R. D., & Iverson, G. (2009). Bayesian t tests for accepting and rejecting the null hypothesis. Psychonomic bulletin & review , 16 (2), 225-237.

Unknown Unknowns: The War on Null and Negative Results (social science space, September 19, 2014)

Valuing Null and Negative Results in Scientific Publishing (Scholastica, November 4, 2015)

Vasilev, M. R. (2013). Negative results in European psychology journals. Europe’s Journal of Psychology , 9 (4), 717-730

Where have all the negative results gone? (bioethics.net, December 4, 2013)

Where to publish negative results (BitesizeBio, November 27, 2013)

Why it’s time to publish research “failures” (Elsevier, May 5, 2015)

Woolson, R. F., & Kleinman, J. C. (1989). Perspectives on statistical significance testing. Annual review of public health , 10 (1), 423-440.

Would you publish your negative results? If no, why? (ResearchGate, October 26, 2012)

  • Correspondence
  • Open access
  • Published: 04 March 2014

The thresholds for statistical and clinical significance – a five-step procedure for evaluation of intervention effects in randomised clinical trials

  • Janus Christian Jakobsen 1 , 2 ,
  • Christian Gluud 1 ,
  • Per Winkel 1 ,
  • Theis Lange 3 &
  • Jørn Wetterslev 1  

BMC Medical Research Methodology volume  14 , Article number:  34 ( 2014 ) Cite this article

23k Accesses

146 Citations

8 Altmetric

Metrics details

Thresholds for statistical significance are insufficiently demonstrated by 95% confidence intervals or P -values when assessing results from randomised clinical trials. First, a P -value only shows the probability of getting a result assuming that the null hypothesis is true and does not reflect the probability of getting a result assuming an alternative hypothesis to the null hypothesis is true. Second, a confidence interval or a P -value showing significance may be caused by multiplicity. Third, statistical significance does not necessarily result in clinical significance. Therefore, assessment of intervention effects in randomised clinical trials deserves more rigour in order to become more valid.

Several methodologies for assessing the statistical and clinical significance of intervention effects in randomised clinical trials were considered. Balancing simplicity and comprehensiveness, a simple five-step procedure was developed.

For a more valid assessment of results from a randomised clinical trial we propose the following five-steps: (1) report the confidence intervals and the exact P-value s; (2) report Bayes factor for the primary outcome, being the ratio of the probability that a given trial result is compatible with a ‘null’ effect (corresponding to the P -value) divided by the probability that the trial result is compatible with the intervention effect hypothesised in the sample size calculation; (3) adjust the confidence intervals and the statistical significance threshold if the trial is stopped early or if interim analyses have been conducted; (4) adjust the confidence intervals and the P -values for multiplicity due to number of outcome comparisons; and (5) assess clinical significance of the trial results.

Conclusions

If the proposed five-step procedure is followed, this may increase the validity of assessments of intervention effects in randomised clinical trials.

Peer Review reports

Clinical experience and observational studies cannot and should not be used to validate intervention effects [ 1 ]. The randomised clinical superiority trial remains the mainstay of modern clinical intervention research and is needed for a valid assessment of possible causality between interventions and outcomes [ 1 ].

Most commonly, the statistical analyses in randomised clinical trials are performed under the frequentist paradigm. In this approach, a significant difference in effect is declared when a value of a test statistic exceeds a specified threshold showing that it is unlikely that the trial results are produced by zero difference in effect between the compared interventions, i.e., that the null hypothesis is true [ 2 ]. A P -value less than 5% has been the most commonly used threshold for statistical significance in clinical intervention research since Fisher warned against exactly that in 1955 [ 3 – 5 ]. P -values are easily calculated but are often misinterpreted [ 6 , 7 ] and misused [ 8 – 10 ].

In the following we describe the methodological limitations of focusing too much on confidence intervals and P -values, and suggest a five-step procedure for a more valid assessment of results of intervention effects in randomised clinical superiority trials [ 11 ]. Our recommendations do not solve all problems of interpreting results from randomised clinical trials, but we aim to present a valid, practical, relatively simple, and yet comprehensive assessment tool to be used by trial investigators and clinical research consumers. The five following sections of the manuscript will correspond to each step of the proposed five-point assessment.

Methods and results

The confidence interval and the p -value.

Due to stochastic variation (‘play of chance’) in biomedical data, statistical analyses are needed to clarify if the results demonstrate a genuine difference in effect between the compared interventions in a randomised clinical trial [ 1 ]. The P -value describes the probability of obtaining an observed or larger difference in intervention effect purely by ‘play of chance’ assuming that there is no intervention effect (i.e., assuming that the ‘null hypothesis’ is true) (Additional file 1 : Table S1). Trialists can and should report the calculated confidence intervals and the exact P- values, but their exclusive relation to the null hypothesis should be kept in mind.

Confidence intervals not containing 1.0 for binary outcomes (or hazard ratios for survival data) or 0.0 for continuous outcomes are, as well as the corresponding P -value, often used as thresholds for statistical significance. Reporting confidence intervals have rightfully been claimed a more appropriate and understandable demonstration of the statistical uncertainty [ 12 , 13 ]. However, confidence intervals do not essentially provide more information than implicitly given by the estimated effect and the P -value. The confidence interval and the observed effect size can be derived from the P -value — and vice versa [ 14 , 15 ]. We believe it is informative both to report the confidence interval and the corresponding exact P- value because the former explicitly demonstrates the range of uncertainty of the intervention effect and the latter tells how likely the results are assuming the null hypothesis is true.

Sample size estimation, the alternative hypothesis to the null hypothesis, and Bayes factor

Before conducting a randomised clinical trial one should estimate the required sample size based on the primary outcome [ 16 – 18 ]. A sample size calculation estimates the number of trial participants necessary to demonstrate or discard a specific a priori anticipated intervention effect with specified error probabilities [ 16 ]. In order to calculate a sample size relating to one specified primary outcome, it is necessary:

To define an anticipated difference in intervention effect (i.e., a hypothesis alternative to the null hypothesis) between the compared intervention groups. This intervention effect could, e.g., be a mean difference, an odds ratio, or a hazard ratio [ 16 ]. This hypothesised difference in effect should be based on the most realistic intervention effect as suggested by a meta-analysis of prior evidence with low risks of bias (Additional file 1 : Table S1) [ 19 , 20 ], but may also be defined as a ‘minimal relevant clinical difference’ (see Statistical significance and clinical significance).

To estimate the variability of the anticipated difference in intervention effect (e.g., a standard deviation of a mean difference or a proportion of participants with an outcome of interest in the control intervention group).

To decide on an acceptable risk of falsely rejecting the null hypothesis (alpha or type I error) (most investigators choose 5%, see ‘Discussion’) and an acceptable risk of falsely accepting the null hypothesis (beta or type II error) (most investigators choose 10% or 20%).

The lower the anticipated intervention effect is and the lower the above acceptable risks are, the larger the sample size becomes.

When the estimated sample size has been obtained, the null hypothesis can be tested and rejected if P is below 5%. However, a low exact P -value may be misleading if there, at the same time, is a low probability of the trial results being compatible with the intervention effect hypothesised in the sample size calculation. To compensate for this deficiency of the P- value it is helpful to calculate Bayes factor [ 21 , 22 ], which is the ratio between the probability of getting the result assuming the null hypothesis (H 0 ) is true divided by the probability of getting the result assuming the alternative hypothesis (H A ) is true [ 21 ]. In the following we have chosen to quantify the alternative hypothesis (H A ) (Additional file 1 : Table S1) as the intervention effect hypothesised in the sample size calculation, but Bayes factor can (and in some instances should) be defined differently. For an in-depth discussion of Bayesian methods and principles see reference [ 21 ].

Figure  1 depicts Bayes factor as a function of the observed effect size, where the observed effect size is expressed as fractions of ‘1.0’. When Bayes factor is 1.0, the likelihoods of the null hypothesis and the alternative hypothesis are the same, i.e., the observed effect size is exactly half way between null effect and the hypothesised effect size. When Bayes factor is less than 1.0, the trial results are more compatible with the alternative hypothesis than the null hypothesis. When Bayes factor is larger than 1.0, the trial results are more compatible with the null hypothesis than the alternative hypothesis.

figure 1

A figure showing how Bayes factor will change according to different observed effects. The red left vertical line represents the null hypothesis (an effect of null), the right green vertical line represents an alternative hypothesis to the null hypothesis with an effect of 1.0. The black curve shows that Bayes factor will be 1.0 when the observed effect size if exactly half of the effect size of the alternative hypothesis; and the curve shows that Bayes factor will decease with increasing observed effect sizes.

Confidence intervals not containing 1.0 for binary outcomes (and hazard ratios) or 0.0 for continuous outcomes and low exact P -values do not necessarily correspond to a low Bayes factor — and confidence intervals and P -values may in some circumstances misleadingly indicate evidence for an intervention effect [ 19 , 21 ]. A low Bayes factor (e.g., less than 0.1) together with a low P value (e.g., less than 0.05) will correspond to a high probability of an intervention effect similar to or even greater than the hypothesised intervention effect used in the sample size calculation (Figure  1 ) (Additional file 1 : Table S1).

The intervention effect hypothesised in the sample size calculation should be based on what prior evidence indicates provided prior evidence exists. By calculating Bayes factor as above, new trial results will be related to former evidence. If a new trial result demonstrates an intervention effect closer to zero than prior evidence indicates, a high Bayes factor will demonstrate that there is a low probability that the intervention effect indicated by prior evidence is compatible with the new trial result. On the other hand, the validity of a trial result will increase if a new trial result including a low Bayes factor shows intervention effects similar to (or larger) what prior evidence has indicated, i.e., that the new trial result is compatible with the intervention effect indicated by prior evidence.

If no former trials have been conducted, an anticipated intervention effect cannot be estimated based on empirical high quality data. Anticipated realistic intervention effect may still be chosen based on knowledge about other analogous interventions’ effects on the same disease or condition [ 23 , 24 ], but the uncertainty related to the choice of an anticipated intervention effect prior to the trial conduct and the subsequent estimation of a sufficient sample size remain a problem. The possibility to adjust the hypothesised intervention effect to get low values of Bayes factor makes Bayes factor sensitive to biased post hoc analyses. When Bayes factor is to be calculated, it is therefore essential to define the intervention effect hypothesised in the sample size calculation a priori so biased post hoc analyses can be avoided.

Trialists might be tempted to perform a sample size calculation based on unrealistically large anticipated intervention effects in order to reduce the necessary number of participants in a trial (relatively few patients are needed to demonstrate or discard large intervention effects) [ 25 , 26 ]. However, sample size estimations based on unrealistically large anticipated intervention effects increase the risk of erroneous estimation of intervention effects — as trials with too small sample sizes (relative to the actual effect) have been shown to have an increased risk of either overestimating or underestimating both effect size and variability [ 27 ]. This also means that the calculation of Bayes factor before a realistic sample size has been reached will also be relatively unreliable, because the observed effect size used to calculate Bayes factor might be erroneous (Additional file 1 : Table S1). Bayes factor assessed after the sample size has been reached will increase if the trial results show an intervention effect smaller than the intervention effect hypothesised in the sample size calculation. The use of Bayes factor might, therefore, be an incentive for a more realistic and smaller estimation of anticipated intervention effects, leading to more trials with sufficient power and less trials either overestimating or underestimating intervention effects. However, if trial results confirm unrealistically large anticipated intervention effects by ‘play of chance’ there is evidently a great risk of misleading trial results. Intervention effects hypothesised in sample size calculations should therefore preferably be based on results from systematic reviews of randomised clinical trials with low risk of bias, which to some extent will ensure that realistic hypothesised intervention effects are used in the sample size calculation. If the intervention effect hypothesised in the sample size calculation is not based on results from systematic reviews of randomised clinical trials, then we recommend to calculate an additional Bayes factor using a smaller (‘sceptical’) hypothesised intervention effect, e.g., a relative risk halfway between the intervention effect hypothesised in the sample size calculation and 1.0.

Adaptive trial design has been proposed to account for the uncertainty of estimating a sample size [ 28 ]. An adaptive trial design enables sample size re-estimation at interim analyses time points during the trial [ 29 ]. At these time points the sample size can either be increased or decreased. The adaptive trial design is complex and is probably less efficient compared to the sequential design including a predefined realistic sample size [ 29 ]. Furthermore, to implement an adaptive design it should be possible, practically and financially, to expand the originally estimated sample size, which is rarely occurring in trials not financed by the industry.

Assurance is another valid method that has been proposed to estimate a sample size to achieve a desired power (assurance), rather than to achieve a desired power conditional on an assumed treatment effect [ 30 ].

Adjustment of the confidence interval and the P -value when a trial is stopped before reaching the planned sample size

The majority of randomised clinical trials have difficulties in obtaining the stipulated sample size [ 10 , 31 , 32 ]. A trial that is stopped prematurely with an effect that is significant (e.g., P  < 5%) may reach this significance level because the estimated difference in effect between the compared trial interventions is larger than anticipated or because the estimated variance is lower than anticipated — or both (see Section 2 about sample size estimation) [ 27 , 29 , 33 ]. Deviations of intervention effects far from the anticipated values should a priori be regarded as unlikely and this is one reason for using a lower statistical threshold to stop a trial before the planned sample size has been reached [ 33 ]. If, e.g., a sample size calculation has shown that a total of 500 patients are needed in a trial and the trial is stopped after only 250 participants are included, it might be necessary to use 1‰ instead of 5% as statistical threshold for significance in order to avoid undue declarations of statistical significance due to early random high intervention effects or low variance [ 34 ]. As mentioned, trials with too small sample sizes often show intervention effect sizes far from the effect sizes shown in larger trials and systematic reviews with meta-analyses [ 27 , 35 ]. As pointed out by Lindley, the apparent paradox of small trials seemingly contributing with evidence of large intervention effects while large trials tend to rule out smaller intervention effects and thereby also larger intervention effects, is bound to confuse the average clinical researcher and reader [ 36 ]. If trialists are allowed to assess statistical significance continuously during a trial (i.e., to conduct interim analyses) and stop at different time points without adjusting the level of statistical significance, this will inevitably increase the risk of falsely negating the null hypothesis [ 37 ]. This is due to sparse data and due to repetitive testing on accumulating data both leading to increased risks of random errors. Therefore, the threshold of statistical significance should be related to the fraction of the pre-planned number of participants randomised and the number of tests conducted (see also Problems with multiplicity due to multiple outcome comparisons) [ 38 – 40 ] — and a number of different methods have been developed for this purpose [ 41 – 44 ]. One example is the [ 41 , 42 ] O’Brien-Fleming boundaries (and the corresponding adjusted thresholds of the confidence intervals and the P -values), which show the adjusted thresholds for significance if a sample size has not been reached [ 41 , 45 ].

Any outcome should only be assessed using the thresholds used in the sample size calculation if there are sufficient data, i.e., that a sample size based on proper acceptable risks of type I and type II errors has been reached. It is, therefore, necessary to perform power calculations for all secondary outcomes (based on an anticipated intervention effect, a variance, and a risk of type I error) before randomisation begins. If an analysis of a secondary outcome has a power of less than 80%, then either the secondary outcome should be classified as an exploratory outcome or the confidence interval and the P -value thresholds for significance should be adjusted just as the thresholds are adjusted if a sample size has not been reached.

In conclusion, it is imperative to estimate a sufficient sample size before a trial is conducted, and proper adjustments of the thresholds for significance should be performed if a trial is stopped early or if interim analyses are conducted [ 17 , 34 ].

Problems with multiplicity due to multiple outcome comparisons

If a randomised clinical trial assesses more than one outcome, compares more than two intervention groups, or assesses an outcome at more than one time point, then the overall risk of falsely rejecting the null hypothesis for at least one of the outcomes (e.g., family wise error less than 5%) may increase with the number of outcome comparisons [ 39 ]. Problems with multiplicity has major implications for the interpretation of the confidence interval and the P -value and this is one reason why it should be mandatory to report a predefined outcome hierarchy including a clear definition of a primary outcome before conducting a randomised clinical trial [ 17 , 40 , 46 ]. The conclusion about trial intervention effects should always be related to the result on the primary outcome (or outcomes) limiting the risk of falsely declaring a trial intervention for being effective. The outcome hierarchy and a specific, detailed description of every other relevant aspect of the trial methodology should be described in a protocol, which should be registered (e.g., at http://www.clinicaltrials.gov ) and published in a journal preferably before randomisation begins [ 17 , 40 , 46 ].

How adjustment for multiplicity is done should depend on the design of the trial, i.e., the chosen outcomes and their relative importance, etc. — and different statistical methods have been proposed to adjust the observed confidence intervals and P -values to obtain strong control [ 47 , 48 ] of this risk of type 1 error when multiple outcome comparisons are used. Under weak control the type 1 error rate is controlled only under the global null hypothesis that all null hypotheses are true. Under strong control, which should be required in a clinical trial, the type 1 error rate is controlled under any partial configuration of true and false null hypotheses [ 47 , 48 ]. Most methods (see paragraph below) have focused on threshold adjustments and adjustments of the P -value, but adjusted confidence intervals can often be calculated based on an adjusted P -value and an effect estimate, as well as adjusted P-values can often be calculated based on adjusted confidence intervals and an effect estimate [ 14 , 49 ].

Adjustments of the P -value due to multiplicity can be obtained using Bonferroni adjustment. This simple method multiplies the P -value with the number of outcome comparisons when only one out of the chosen outcome comparisons must be significant in order to reject the overall null hypothesis, i.e., to declare that the trial intervention is effective [ 50 ]. The Bonferroni procedure tends to be rather conservative if the number of tests is large or if the outcomes are positively correlated. As most outcomes are dependent (e.g., incidence of cancer mortality and mortality in the same sample of participants are evidently positively correlated outcomes) Bonferroni adjustment is obviously too conservative a method to account for multiple testing and corresponding methods that are more powerful are available [ 51 ]. Hommel’s method deals with all of the chosen outcomes as a group using a data-driven adjustment of the P -values [ 52 ]. An alternative method (the fixed sequence procedure) is to specify the sequence of the hypothesis testing (primary outcome, first secondary, second secondary, etc.) [ 53 ]. Then each test will be done at the chosen level of significance in the specified order (here both the confidence interval and the P-value can be used to demonstrate the threshold), but as soon as a test is non-significant then the remaining null hypotheses are accepted. A fourth approach is the so-called ‘fall back procedure’ where the fixed hypothesis testing sequence is also used [ 54 ]. However, if a test is insignificant using the ‘fall back procedure’ then the procedure does not stop but the next hypothesis is tested at a reduced threshold for significance. This procedure also allows one to weight the hypotheses according to their importance and likelihood of being rejected. Other more complex methods taking correlation of the P -values into account are also available [ 55 , 56 ]. It might not be necessary to include P -value adjustments for outcomes pre-specified as exploratory or hypothesis generating — but such P -values must always be interpreted conservatively.

Analysing results from interim analyses, it may still be unknown how stable a low Bayes factor is, i.e., how often a Bayes factor once it is low will increase after additional patients have been randomised and change from below to above a certain threshold (e.g., 0.1). Full Bayesian statistics may be able to account for problems of multiplicity due to interim analyses, multiple outcomes, or comparisons of the same outcome at multiple times [ 57 – 59 ]. However, this may imply integration of fairly complicated models and software in the final analyses of the trial results [ 57 – 59 ].

Statistical significance and clinical significance

When surrogate outcomes or continuous outcomes are used to assess intervention effects, it is often unclear if a given statistical significant effect has any patient relevant clinical significance. Moreover, if a large number of trial participants are assessed, small and clinically irrelevant intervention effects may achieve statistical significance leading to rejection of the null hypothesis [ 60 , 61 ]. Statins have, e.g., been widely accepted as evidence-based treatment for high cholesterol levels in the blood [ 62 ], but it has recently been shown that decades of intake of statins may only prolong life with an average of a few months [ 63 ]. For clinically relevant outcomes such as mortality, it is difficult to delineate a ‘minimal relevant clinical difference’ (Section 2). Any prevention, whatever small, of patient-important outcomes may seem relevant. Nevertheless, the significance of the clinical benefit of statins may be questioned taking adverse effects and other costs of the statins into consideration [ 63 , 64 ].

In spite of a statistical significant effect with even very low P -values and corresponding narrow confidence intervals, clinical significance can often be questioned. Relating trial results to the ‘minimal relevant clinical difference’ used to calculate the predefined sample size as well as calculating Bayes factor based on this ‘minimal relevant clinical difference’, provide indications about the clinical significance of intervention effects (see Sample size estimation, the alternative hypothesis to the null hypothesis, and Bayes factor). However, to assess the clinical significance of intervention effects it is important to perform a thorough overall assessment of the balance between beneficial and harmful effects [ 65 , 66 ]. Even rare serious adverse effects may rule out the rational use of an otherwise beneficial intervention [ 67 ].

It has been suggested that the ‘minimal relevant clinical difference’ should be defined as what patients perceive as important [ 69 ]. However, patients tend to regard even the smallest effect sizes as clinically important [ 70 ]. We therefore suggest that clinical researchers in close cooperation with patients and relatives must somehow consent on the quantification of the ‘minimal relevant clinical differences’ as well as the relevant outcomes to be assessed. The latter work is dealt with by research groups within The Cochrane Collaboration, The James Lind Alliance, and the COMET Initiative [ 14 , 68 , 71 – 73 ].

Ideally the ‘threshold’ effect size delimiting clinical significance from lack of clinical significance should, as the rest of the trial methodology, be predefined [ 68 ]. To avoid erroneous interpretations, assessment of clinical significance should only be assessed if statistical significance and a Bayes factor of less than 0.1 have been obtained.

The five-step procedure described aims to improve the validity of results from randomised clinical trials. The five-step procedure has the strength that is based on well-established methodology and provides a ratio of the probability that a trial result is compatible with the null hypothesis divided by the probability that the result is compatible with the intervention effect hypothesised in the sample size calculation. Our procedure adjusts for problems with multiplicity and also forces investigators and consumers of clinical research to judge clinical significance. A potential drawback of Bayesian statistical analyses is that it can be difficult to verify modelling assumptions, e.g., if assumed distributions in the analysis are appropriate or not [ 74 ]. A strength of our simplified approach is that if the assumptions behind the initial analysis methods (e.g., logistic regression or survival analysis) are fulfilled then our five-point assessment can be used validly without further testing.

The five-step procedure has limitations. First, we have provided our recommendations for understanding the results of a single randomised clinical trial in the light of usually sparse prior evidence. It has been shown that it is often unwise to base diagnostic, prognostic, preventive, or therapeutic interventions on data from one or few trials [ 1 , 9 , 10 , 26 , 75 ], and our recommendations do not in any way change this. Our aim is to introduce a simple assessment procedure which we believe will improve the validity of the assessment of results from a single randomised clinical trial, but our procedure does not solve all problems. Clinical decision-making should primarily be based on systematic reviews of all randomised clinical trials with low risk of bias including meta-analyses, trial sequential analyses, and obtained consensus of clinical significance [ 9 , 45 , 68 , 76 – 78 ]. Also in a scenario of a systematic review, calculation of Bayes factor and assessment of clinical significance may become pivotal. We will address these issues in a forthcoming article.

Second, our recommended methodology as well as our definition of Bayes factor is simplified. Alternatively, Bayes factors could be based on the ratio of the probability that the trial result is compatible with the null hypothesis divided by the probability that the result is compatible with a range of realistic alternative hypotheses. A full Bayesian analysis could also be used to analyse trial results, which focuses on the calculation of the posterior odds that an alternative hypothesis to the null hypothesis is true, given the observed data and any available prior information [ 2 , 74 , 79 ]. There are a number of major methodological advantages using full Bayesian statistics compared to frequentistic statistics [ 19 , 80 ] and results from a full Bayesian analysis might in some circumstances reliably show a low posterior probability for the alternative hypothesis while a low Bayes factor wrongly indicates the opposite. However, Bayesian statistical analyses increases the methodological complexity [ 19 , 80 ]; can make research results sensitive to apparently innocuous assumptions which hinder taking possible trial results into account [ 80 , 81 ]; and will, in essence, require a methodological paradigm shift including use of detailed Bayesian statistical analyses plans and Bayesian statistical software such as WinBUGS [ 82 ].

Third, it is necessary to define some kind of alternative hypothesis to the null hypothesis calculating the Bayes factor. The definition of the alternative hypothesis often involves an element of subjectivity, and it is for this reason that many trialists do not use the Bayesian approach [ 2 , 79 ]. It has been suggested that the alternative hypothesis might be defined as ‘uniformly most powerful Bayesian tests’ where the alternative hypothesis is defined as an average value of any hypothetical intervention effect resulting in a Bayes factor below a given threshold [ 2 , 79 ]. This procedure is appealing because no subjective assumptions have to be made about the alternative hypothesis — but it is a problem that potentially important information about intervention effects showed in former randomised trials or systematic reviews of such trials cannot be included in the definition of the alternative hypothesis. Furthermore, the method is primarily for one-parameter exponential family models and has, in essence, no methodological advantages compared to only using the P -value as a threshold for significance [ 2 , 79 ]. The researcher behind the ‘uniformly most powerful Bayesian tests’ suggests to use lower P -value thresholds (0.005 or 0.001) to avoid false positive significant results [ 2 ], which clearly seems to be a valid alternative to our calculation and use of Bayes factor. We have chosen the intervention effect hypothesised in the sample size calculation as the alternative hypothesis firmly relating the pre-planned trial design to the interpretation of the trial result. Most trials already include a predetermined sample size calculation, which includes estimation of an anticipated intervention effect. New assumptions are therefore, in essence, not needed to calculate Bayes factor. However, it is still a clear limitation that Bayes factor can be influenced by post-hoc adjustments and erroneous quantifications of the alternative hypothesis.

Fourth, our procedure is based on already well-established methodology. However, there is no empirical evidence so far assessing the validity of the procedure. We will also address this issue in a forthcoming article.

To assess the statistical significance and the clinical significance of results from randomised clinical superiority trials, we propose a five-step procedure: (1) Calculate and report the confidence intervals and the exact P -values for all pre-specified outcome comparisons. A P -value less than 0.05 may be chosen as threshold for statistical significance for the primary outcome, only if 0.05 has been used as the acceptable risk of type I error in the sample size calculation and the sample size has been reached. (2) Calculate and report the Bayes factor for the primary outcome (or outcomes) based on the hypothesised intervention effect used in the sample size estimation. If the intervention effect hypothesised in the sample size calculation is not based on results from systematic reviews or randomised clinical trials, then calculate an additional sceptical Bayes factor using a smaller hypothesised intervention effect, e.g., a relative risk halfway between 1.0 and the intervention effect hypothesized in the sample size calculation. A Bayes factor less than 0.1, indicating a ten-fold higher likelihood of compatibility with the alternative hypothesis than the likelihood of compatibility with the null hypothesis, may be chosen as threshold for supporting the alternative hypothesis. More research is needed to assess if this threshold is optimal. (3) If the a priori estimated sample size has not been reached or interim analyses have been performed, then adjust the confidence intervals and the P -values accordingly. (4) If more than one outcome is used, if more than two intervention groups are compared, or if the primary outcome is assessed at multiple time points (and just one of these outcome comparisons must be significant to reject the overall null hypothesis), then the confidence intervals and the P- values should be adjusted accordingly. (5) Assess and report clinical significance of the results if all of the first four steps of the five-point procedure have shown statistical significance.

Table  1 summarises our suggestions for a more valid assessment of intervention effects in randomised clinical superiority trials, and we have included three examples of how the five-step assessment can be used to assess statistical significance and clinical significance of results from a randomised clinical trial (see Example 1, Example 2, Example 3). We have, for simplicity, only assessed the primary outcome results in the three examples.

A trial published in JAMA 2012 examined the effects of multivitamins in the prevention of cancer [ 83 ]. The conclusion of the trial was that multivitamin supplementation significantly reduced the risk of total cancer (HR 0.92; 95% CI, 0.86 to 0.998; P = 0.04). We will use our five-step procedure to assess the statistical and clinical significance of the trial results:

Report the confidence interval and the exact P-value.

Our assessment: The hazard ratio, the 95% confidence interval, and the exact P -value are reported in the publication (HR 0.92; 95% CI, 0.86 to 0.998; P = 0.04).

Calculate and report the Bayes factor for the primary outcome. A Bayes factor less than 0.1 may be chosen as threshold for significance.

Our assessment: First, to calculate Bayes factor we need to calculate log odds ratio and the standard error of the log odds ratio of the trial result: odds ratio 0.92, log odds ratio −0.08, and standard error of the log odds ratio 0.04.

Second, we need to calculate the log odds ratio of the sample size calculation. The statistical power for showing a 20% and a 30% reduction in the risk of total cancer was calculated in the protocol [ 84 ]: odds ratio 0.8 and log odds ratio −0.22.

Bayes factor = 53.10 if a risk reduction of 20% is used as the anticipated intervention effect, which is considerably greater than 0.1.

Bayes factor = 3,009,380,258 if a risk reduction of 30% is used as the anticipated intervention effect, again considerably greater than 0.1.

If the a priori estimated sample size has not been reached or if interim analyses have been performed, then adjust the confidence intervals and the P-values accordingly.

Our assessment: The sample size estimation is based on a total of 15,000 participants, and 14,641 participants are randomised in the trial. The sample size was almost sufficiently reached, so no adjustment may be needed.

If more than one outcome is used, if more than two intervention groups are compared, or if the primary outcome is assessed multiple times (and just one of these outcome comparisons must be significant to reject the overall null hypothesis), then the confidence intervals and the P-values should be adjusted accordingly .

Our assessment: In the published protocol [ 84 ] it is reported that the trial is a randomised, double-blind, placebo-controlled trial of the balance of benefits and risks of beta-carotene, vitamin E, vitamin C, and a multivitamin in the prevention of total and prostate cancer, cardiovascular disease, and the age-related eye diseases, cataract and macular degeneration. In the protocol no clear definition of a primary outcome is reported [ 84 ]. Five outcomes (total cancer, prostate cancer, important cardiovascular events, age-related macular degeneration, and cataract) are mentioned in the protocol to assess all the included outcomes. However, in the trial publication total cancer is mentioned as the primary outcome. The P -value of 0.04 should properly have been adjusted for multiplicity due to the many outcome comparisons.

If statistical significance has been shown according to all of the above points then assess clinical significance of the trial results.

Our assessment: The assessment of statistical significance was not adequately addressed and if it had been it is highly unlikely if statistical significance would have been attained. So it is not deemed relevant to assess any clinical significance.

Interpretation: Our five-point assessment demonstrates that the results from the randomised clinical trial should be interpreted with great caution and that the results from this single trial indicates that the effect of multivitamins is 53 times more compatible with the null hypothesis than the hypothesis of a 20% relative risk reduction of total cancer. Our five-point assessment of this trial is in agreement with results from systematic reviews with meta-analysis and trial sequential analysis on all-cause mortality, gastrointestinal cancers, and other cancers [ 85 – 87 ].

A trial published in The Lancet 2010 examined the effects of tranexamic acid versus placebo in trauma patients with significant haemorrhage [ 88 ]. The conclusion of the trial was that tranexamic acid significantly reduced all-cause mortality. We will use our five-step procedure to assess the statistical and clinical significance of the trial results:

Our assessment: The authors reported a relative risk 0.91, 95% CI 0.85 to 0.97, and P  = 0.0035

Our assessment: First, to calculate Bayes factor we need to calculate the log odds ratio and the standard error of the log odds ratio of the trial result: odds ratio 0.89, log odds ratio −0.12, and standard error of the log odds ratio 0.04.

Second, we need to calculate the log odds ratio of the intervention effect hypothesised in the sample size calculation. The sample size calculation was based on an assumed risk of death of 20% in the control group, a relative risk of 0.90, and it was planned to randomise 2 × 10,000 participants. This corresponds to an odds ratio of 0.89 and a log odds ratio of −0.11.

Bayes factor = 0.01 which is 10 times less than the suggested threshold of 0.1. Accordingly, there seems to be good support of the assessed postulated intervention effect.

If the a priori estimated sample size has not been reached or if interim analyses have been performed, then adjust the confidence intervals and the exact P-values accordingly

Our assessment: The sample size estimation is based on a total of 20,000 participants, and 20,211 were randomised in the trial. The sample size was sufficiently reached.

If more than one outcome is used, if more than two intervention groups are compared, or if the primary outcome is assessed multiple times (and just one of these outcome comparisons must be significant to reject the overall null hypothesis), then the confidence intervals and the P-values should be adjusted accordingly.

Our assessment: Only one primary outcome is reported in the published protocol ( http://www.thelancet.com/protocol-reviews/05PRT-1 ) and this outcome is only assessed at one time point. There is no need for an adjustment of the confidence interval or the P- value.

If statistical significance has been shown according to all of the above four points, then assess clinical significance of the trial results.

Our assessment: Statistical significance was reached according to all of the first four points of the five-point assessment. Clinical significance for a dichotomous outcome can be assessed by calculating the number-needed-to-treat to save one life. Number-needed-to-treat is 35 participants, demonstrating a relatively large clinical benefit of tranexamic acid. A further assessment of the balance between beneficial and harmful effects should also be performed.

Interpretation: Our five-point assessment demonstrates that the results from the randomised clinical trial are a 100 times more compatible with a 10% relative risk reduction than the null effect of tranexamic acid on all-cause mortality. However, before this promising treatment is introduced into clinical practice, a systematic review of all randomised clinical trials should assess the benefits and harms of tranexamic acid. Such a review should include a thorough bias risk assessment, meta-analyses, trial sequential analyses, and reports on harm from observational studies [ 9 , 45 , 68 , 76 – 78 ].

A trial published in The New England Journal of Medicine in 2012 examined the effects of hydroxyethyl starch versus Ringer’s acetate in severe sepsis [ 89 ]. The conclusion of the trial was that the primary outcome, death or dependence on dialysis at 90 days after randomisation, occurred in 202 patients (51%) in the starch group as compared with 173 patients (43%) in the Ringer’s acetate group (relative risk, 1.17; 95% CI, 1.01 to 1.36; P = 0.03).

We will use our five-step procedure to assess the statistical and clinical significant of the trial results:

Our assessment: The confidence interval and the P- value are reported in the publication (relative risk, 1.17; 95% CI, 1.01 to 1.36; P = 0.03).

Our assessment: First, to calculate Bayes factor we need to calculate the log odds ratio and the standard error of the log odds ratio of the trial result: odds ratio 1.35, log odds ratio 0.30, and standard error of the log odds ratio 0.142.

Second, we need to calculate the log odds ratio of the sample size calculation. The sample size calculation reported in the published protocol showed that a total of 800 participants was needed to show a 20% relative risk reduction on either death or end-stage kidney failure (primary outcome) assuming a 50% incidence of either death or end-stage kidney failure in the control group. This will correspond to an odds ratio of 0.67 and a log odds ratio of −0.40.

Bayes factor = 20,306 which is considerably greater than 0.1.

It must be noted that the trialists anticipated a beneficial effect of hydroxyethyl starch, but found a harmful effect compared with Ringer’s acetate. This results in a large Bayes factor demonstrating that the trial results show that it is far more probable that the results are compatible with a null effect (or a harmful effect) than the results are compatible with a 20% relative risk reduction of mortality hypothesised in the sample size calculation.

If the a priori estimated sample size has not been reached or if interim analyses have been performed, then adjust the confidence intervals and the exact P-values accordingly.

Our assessment: The sample size estimation is based on a total of 800 participants, and 804 participants were randomised. The sample size was reached.

Our assessment: The same single primary outcome (either death or end-stage kidney failure) was reported in the published protocol [ 90 ] and in the trial publication [ 89 ]. The primary outcome was only planned to be analysed at one time point. There is no need for any adjustment of the threshold for significance.

If statistical significance has been shown according to the above four points, then assess clinical significance of the trial results.

Our assessment: The first four points of the five-point assessment clearly showed that hydroxyethyl starch does not seem to have a beneficial effect. Clinical significance can for dichotomous outcomes be assessed by calculating number-needed-to-treat or number-needed-to-harm. The number-needed-to-harm is 13, i.e., after 13 patients with severe sepsis have been treated with hydroxyethyl starch one extra patient will die or develop end-stage renal disease because of being treated with hydroxyethyl starch compared with being treated with Ringer’s acetate. A further assessment of the balance between beneficial and harmful effects should also be performed but is irrelevant in this trial.

Interpretation: Our five-point assessment confirms that hydroxyethyl starch compared with Ringer’s acetate does not seem to have a beneficial effect in the treatment of severe sepsis. Our five-point assessment is in agreement with results from systematic reviews with meta-analysis and trial sequential analysis [ 91 ].

Jakobsen JC, Gluud C: The necessity of randomized clinical trials. Br J Med Res. 2013, 3 (4): 1453-1468.

Article   Google Scholar  

Johnson VE: Revised standards for statistical evidence. Proc Natl Acad Sci USA. 2013, 110 (48): 19313-19317. 10.1073/pnas.1313476110.

Article   CAS   PubMed   PubMed Central   Google Scholar  

Fisher R: Statistical methods and scientific induction. J R Stat Soc Ser B. 1955, 17 (1): 69-78.

Google Scholar  

Gigerenzer G: Mindless statistics. J Socio Econ. 2004, 33 (5): 587-606. 10.1016/j.socec.2004.09.033.

Hald A: A history of mathematical statistics from 1750 to 1930. 1998, New York: John Wiley & Sons

Goodman S: A dirty dozen: twelve p-value misconceptions. Semin Hematol. 2008, 45: 135-140. 10.1053/j.seminhematol.2008.04.003.

Article   PubMed   Google Scholar  

Oliveri RS, Gluud C, Wille-Jørgensen PA: Hospital doctors' self-rated skills in and use of evidence-based medicine - a questionnaire survey. J Eval Clin Pract. 2004, 10 (2): 219-226. 10.1111/j.1365-2753.2003.00477.x.

Bassler D, Briel M, Montori VM, Lane M, Glasziou P, Zhou Q, Heels-Ansdell D, Walter SD, Guyatt GH, Flynn DN, Elamin MB, Murad MH, Abu Elnour NO, Lampropulos JF, Sood A, Mullan RJ, Erwin PJ, Bankhead CR, Perera R, Ruiz Culebro C, You JJ, Mulla SM, Kaur J, Nerenberg KA, Schunemann H, Cook DJ, Lutz K, Ribic CM, Vale N, Malaga G, Akl EA, et al: Stopping randomized trials early for benefit and estimation of treatment effects: systematic review and meta-regression analysis. JAMA. 2010, 303: 1180-1187. 10.1001/jama.2010.310.

Article   CAS   PubMed   Google Scholar  

Thorlund K, Devereaux PJ, Wetterslev J, Guyatt G, Ioannidis JP, Thabane L, Gluud LL, Als-Nielsen B, Gluud C: Can trial sequential monitoring boundaries reduce spurious inferences from meta-analyses?. Int J Epidemiol. 2009, 38 (1): 276-286. 10.1093/ije/dyn179.

Ioannidis JP: Why most published research findings are false. PLoS Med. 2005, 2 (8): e124-10.1371/journal.pmed.0020124.

Article   PubMed   PubMed Central   Google Scholar  

Garattini S, Bertele V: Non-inferiority trials are unethical because they disregard patients' interests. Lancet. 2007, 370 (9602): 1875-1877. 10.1016/S0140-6736(07)61604-3.

Sterne JA: Teaching hypothesis tests–time for significant change?. Stat Med. 2002, 21: 985-999. 10.1002/sim.1129.

Ranstam J: Why the P-value culture is bad and confidence intervals a better alternative. Osteoarthritis Cartilage. 2012, 20: 805-808. 10.1016/j.joca.2012.04.001.

Williamson PR, Altman DG, Blazeby JM, Clarke M, Gargon E: The COMET (Core Outcome Measures in Effectiveness Trials) Initiative. Trials. 2011, 12 (Suppl 1): A70-10.1186/1745-6215-12-S1-A70.

Article   PubMed Central   Google Scholar  

Altman DG, Bland JM: How to obtain the confidence interval from a P value. BMJ. 2011, 343: d2090-10.1136/bmj.d2090.

Chow S-C, Shao J, Wang H: Sample Size Calculations in Clinical Research, Second Edition. 2008, Boca Raton, Florida: Chapman and Hall/CRC

Schulz KF, Altman DG, Moher D: CONSORT 2010 statement: updated guidelines for reporting parallel group randomized trials. Ann Int Med. 2010, 152 (11): 726-732. 10.7326/0003-4819-152-11-201006010-00232.

Scales DC, Rubenfeld GD: Estimating sample size in critical care clinical trials. J Crit Care. 2005, 20 (1): 6-11. 10.1016/j.jcrc.2005.02.002.

Myles DJS, Keith RA, Jonathan P: Bayesian approaches to clinical trials and health-care evaluation (Statistics in Practice). 2004, West Sussex, England: John Wiley & Sons

Roloff V, Higgins JP, Sutton AJ: Planning future studies based on the conditional power of a meta-analysis. Stat Med. 2013, 32 (1): 11-24. 10.1002/sim.5524.

Goodman SN: Introduction to Bayesian methods I: measuring the strength of evidence. Clin Trials. 2005, 2: 282-378. 10.1191/1740774505cn098oa.

Goodman SN: Toward evidence-based medical statistics. 2: The Bayes factor. Ann Int Med. 1999, 130 (12): 1005-1013. 10.7326/0003-4819-130-12-199906150-00019.

Pogue JM, Yusuf S: Cumulating evidence from randomized trials: utilizing sequential monitoring boundaries for cumulative meta-analysis. Control Clin Trials. 1997, 18 (6): 580-593. 10.1016/S0197-2456(97)00051-2.

Higgins JP, Whitehead A: Borrowing strength from external trials in a meta-analysis. Stat Med. 1996, 15 (24): 2733-2749. 10.1002/(SICI)1097-0258(19961230)15:24<2733::AID-SIM562>3.0.CO;2-0.

Fayers PM, Cuschieri A, Fielding J, Craven J, Uscinska B, Freedman LS: Sample size calculation for clinical trials: the impact of clinician beliefs. Br J Cancer. 2000, 82 (1): 213-219. 10.1054/bjoc.1999.0902.

Thorlund K, Imberger G, Walsh M, Chu R, Gluud C, Wetterslev J, Guyatt G, Devereaux PJ, Thabane L: The number of patients and events required to limit the risk of overestimation of intervention effects in meta-analysis-a simulation study. PLoS One. 2011, 6: e25491-10.1371/journal.pone.0025491.

Pereira TV, Horwitz RI, Ioannidis JP: Empirical evaluation of very large treatment effects of medical interventions. JAMA. 2012, 308: 1676-1684. 10.1001/jama.2012.13444.

Mehta CR, Pocock SJ: Adaptive increase in sample size when interim results are promising: a practical guide with examples. Stat Med. 2011, 30 (28): 3267-3284. 10.1002/sim.4102.

Jennison C, Turnbull BW: Efficient group sequential designs when there are several effect sizes under consideration. Stat Med. 2005, 25: 917-932.

O'Hagan A, Stevens JW, Campbell MJ: Assurance in clinical trial design. Pharm Stat. 2005, 4 (3): 187-201. 10.1002/pst.175.

Turner RM, Bird SM, Higgins JP: The impact of study size on meta-analyses: examination of underpowered studies in Cochrane reviews. PLoS One. 2013, 8 (3): e59202-10.1371/journal.pone.0059202.

Sully BG, Julious SA, Nicholl J: A reinvestigation of recruitment to randomised, controlled, multicenter trials: a review of trials funded by two UK funding agencies. Trials. 2013, 14: 166-10.1186/1745-6215-14-166.

Levin GP, Emerson SC, Emerson SS: Adaptive clinical trial designs with pre-specified rules for modifying the sample size: understanding efficient types of adaptation. Stat Med. 2012, 32 (8): 1259-1275.

DeMets DL, Lan KK: Interim analysis: the alpha spending function approach. Stat Med. 1994, 13 (13–14): 1341-1356.

Bassler D, Montori VM, Briel M, Glasziou P, Walter SD, Ramsay T, Guyatt G: Reflections on meta-analyses involving trials stopped early for benefit: is there a problem and if so, what is it?. Stat Methods Med Res. 2013, 22 (2): 159-168. 10.1177/0962280211432211.

Lindley DV: A statistical paradox. Biometrika. 1957, 44 (1/2): 187-192. 10.2307/2333251.

Guyatt GH, Briel M, Glasziou P, Bassler D, Montori VM: Problems of stopping trials early. BMJ. 2012, 344: e3863-10.1136/bmj.e3863.

Wald A: Sequential tests of statistical hypotheses. Ann Math Stat. 1945, 16: 117-186. 10.1214/aoms/1177731118.

Zhang J, Quan H, Ng J, Stepanavage ME: Some statistical methods for multiple endpoints in clinical trials. Control Clin Trials. 1997, 18: 204-221. 10.1016/S0197-2456(96)00129-8.

Imberger G, Vejlby AD, Hansen SB, Møller AM, Wetterslev J: Statistical multiplicity in systematic reviews of anaesthesia interventions: a quantification and comparison between Cochrane and non-Cochrane reviews. PLoS One. 2011, 6: e28422-10.1371/journal.pone.0028422.

Pocock SJ: When to stop a clinical trial. BMJ. 1992, 305 (6847): 235-240. 10.1136/bmj.305.6847.235.

Jennison C, Turnbull BW: Repeated confidence intervals for group sequential clinical trials. Control Clin Trials. 1984, 5 (1): 33-45. 10.1016/0197-2456(84)90148-X.

Todd S, Whitehead J, Facey KM: Point and interval estimation following a sequential clinical trial. Biometrika. 1996, 83 (2): 453-461. 10.1093/biomet/83.2.453.

Jennison C, Turnbull BW: Group Sequential Methods with Applications to Clinical Trials (Chapman & Hall/CRC Interdisciplinary Statistics). 1999, : Chapman and Hall/CRC

Thorlund K, Engstrøm J, Wetterslev J, Brok J, Imberger G, Gluud C: User manual for trial sequential analysis (TSA). 2011, Copenhagen, Denmark: Copenhagen Trial Unit, Centre for Clinical Intervention Research, 1-115. Available from http://www.ctu.dk/tsa

Equator Network: Enhancing the QUAlity and Transparency Of health Research. Available at : http://www.equator-network.org/ 2014

Yang Q, Cui J, Chazaro I, Cupples LA, Demissie S: Power and type I error rate of false discovery rate approaches in genome-wide association studies. BMC Genet. 2005, 6 (Suppl 1): S134-10.1186/1471-2156-6-S1-S134.

Bretz F, Hothorn T, Westfall P: Multiple Comparisons Using R. 2010, Boca Raton, Florida: Chapman and Hall/CRC

Book   Google Scholar  

Altman DG, Bland JM: How to obtain the P value from a confidence interval. BMJ. 2011, 343: d2304-10.1136/bmj.d2304.

Abdi H: Encyclopedia of Measurement and Statistics. The Bonferonni and Šidák corrections for multiple comparisons. In N.J. Salkind (Ed.) page 103–107. 2007, Thousand Oaks (CA): Sage

Holm S: A simple sequentially rejective multipletestprocedure. Scand J Statist. 1979, 6: 65-70.

Dmitrienko A, Ajit C, Tamhane AC, Bretz F: Multiple testing problems in pharmaceutical statistics (Chapman & Hall/CRC Biostatistics Series). 2009, Boca Raton, Florida: Chapman and Hall/CRC

Chapter   Google Scholar  

Tu YH, Cheng B, Cheung YK: A note on confidence bounds after fixed-sequence multiple tests. J Stat Plan Inference. 2012, 142 (11): 2993-2998. 10.1016/j.jspi.2012.05.002.

Wiens BL, Dmitrienko A: The fallback procedure for evaluating a single family of hypotheses. J Biopharm Stat. 2005, 15 (6): 929-942. 10.1080/10543400500265660.

Korn EL, Li MC, McShane LM, Simon R: An investigation of two multivariate permutation methods for controlling the false discovery proportion. Stat Med. 2007, 26 (24): 4428-4440. 10.1002/sim.2865.

Westfall PH, Young S: Resampling-Based Multiple Testing: Examples and Methods for p-Value Adjustment (Wiley Series in Probability and Statistics). 1993, New York: Wiley-Interscience

Yu J, Hutson AD, Siddiqui AH, Kedron MA: Group sequential control of overall toxicity incidents in clinical trials - non-Bayesian and Bayesian approaches. Stat Methods Med Res. 2012, Epub ahead of print

Thall PF, Simon RM, Shen Y: Approximate Bayesian evaluation of multiple treatment effects. Biometrics. 2000, 56: 213-219. 10.1111/j.0006-341X.2000.00213.x.

Zhang X, Cutter G: Bayesian interim analysis in clinical trials. Contemp Clin Trials. 2008, 29: 751-755. 10.1016/j.cct.2008.05.007.

Jakobsen JC, Lindschou Hansen J, Storebø OJ, Simonsen E, Gluud C: The effects of cognitive therapy versus 'treatment as usual' in patients with major depressive disorder. PLoS One. 2011, 6 (8): e22890-10.1371/journal.pone.0022890.

Knorr U, Vinberg M, Kessing LV, Wetterslev J: Salivary cortisol in depressed patients versus control persons: a systematic review and meta-analysis. Psychoneuroendocrinol. 2010, 35: 1275-1286. 10.1016/j.psyneuen.2010.04.001.

Article   CAS   Google Scholar  

Downs JR, Clearfield M, Weis S, Whitney E, Shapiro DR, Beere PA, Langendorfer A, Stein EA, Kruyer W, Gotto AM: Primary prevention of acute coronary events with lovastatin in men and women with average cholesterol levels: results of AFCAPS/TexCAPS. Air Force/Texas Coronary Atherosclerosis Prevention Study. JAMA. 1998, 279 (20): 1615-1622. 10.1001/jama.279.20.1615.

Stovring H, Harmsen CG, Wisloff T, Jarbol DE, Nexoe J, Nielsen JB, Kristiansen IS: A competing risk approach for the European Heart SCORE model based on cause-specific and all-cause mortality. Eur J Prev Cardiol. 2012, 20 (5): 827-836.

Prasad V, Vandross A: Cardiovascular primary prevention: how high should we set the bar?. Arch Int Med. 2012, 172: 656-659. 10.1001/archinternmed.2012.812.

Guyatt G, Oxman AD, Akl EA, Kunz R, Vist G, Brozek J, Norris S, Falck-Ytter Y, Glasziou P, DeBeer H, Jaeschke R, Rind D, Meerpohl J, Dahm P, Schunemann HJ: GRADE guidelines: 1. Introduction-GRADE evidence profiles and summary of findings Tables. J Clin Epidemiol. 2011, 64 (4): 383-394. 10.1016/j.jclinepi.2010.04.026.

Guyatt G, Oxman AD, Sultan S, Brozek J, Glasziou P, Alonso-Coello P, Atkins D, Kunz R, Montori V, Jaeschke R, Rind D, Dahm P, Akl EA, Meerpohl J, Vist G, Berliner E, Norris S, Falck-Ytter Y, Schunemann HJ: GRADE guidelines: 11. Making an overall rating of confidence in effect estimates for a single outcome and for all outcomes. J Clin Epidemiol. 2013, 66 (2): 151-157. 10.1016/j.jclinepi.2012.01.006.

Jüni P, Nartey L, Reichenbach S, Sterchi R, Dieppe PA, Egger M: Risk of cardiovascular events and rofecoxib: cumulative meta-analysis. Lancet. 2004, 364 (9450): 2021-2029. 10.1016/S0140-6736(04)17514-4.

Higgins JPT, Green S: The Cochrane Handbook for Systematic Reviews of Interventions, Version 5.1.0. 2011, The Cochrane Collaboration, Available from http://www.cochrane-handbook.org

Johnston BC, Thorlund K, Schunemann HJ, Xie F, Murad MH, Montori VM, Guyatt GH: Improving the interpretation of quality of life evidence in meta-analyses: the application of minimal important difference units. Health Qual Life Outcomes. 2010, 8: 116-10.1186/1477-7525-8-116.

Halvorsen PA, Kristiansen IS: Decisions on drug therapies by numbers needed to treat: a randomized trial. Arch Int Med. 2005, 165: 1140-1146. 10.1001/archinte.165.10.1140.

Chalmers I, Milne I, Trohler U, Vandenbroucke J, Morabia A, Tait G, Dukan E: The James Lind Library: explaining and illustrating the evolution of fair tests of medical treatments. J R Coll Physicians Edinb. 2008, 38 (3): 259-264.

CAS   PubMed   Google Scholar  

The Library and Information Services Department, The Royal College of Physicians of Edinburgh: James Lind Library. Available online at : http://www.jameslindlibrary.org/ 2003

The Cochrane Collaboration: The Cochrane Collaboration. http://www.cochrane.org ,

Garthwaite P, Kadane JB, O'Hagan A: Statistical Methods for Eliciting Probability Distributions. J Am Stat Assoc. 2012, 100 (470):

Ioannidis J: Contradicted and initially stronger effects in highly cited clinical research. JAMA. 2005, 294 (2): 218-228. 10.1001/jama.294.2.218.

Wetterslev J, Thorlund K, Brok J, Gluud C: Trial sequential analysis may establish when firm evidence is reached in cumulative meta-analysis. J Clin Epidemiol. 2008, 61 (1): 64-75. 10.1016/j.jclinepi.2007.03.013.

Higgins JP, Whitehead A, Simmonds M: Sequential methods for random-effects meta-analysis. Stat Med. 2011, 30 (9): 903-921. 10.1002/sim.4088.

Keus F, Wetterslev J, Gluud C, van Laarhoven CJ: Evidence at a glance: error matrix approach for overviewing available evidence. BMC Med Res Methodol. 2010, 10: 90-10.1186/1471-2288-10-90.

Johnson VE: Uniformly most powerful Bayesian tests. Ann Stat. 2013, 41: 1716-1741. 10.1214/13-AOS1123.

Higgins JP, Spiegelhalter DJ: Being sceptical about meta-analyses: a Bayesian perspective on magnesium trials in myocardial infarction. Int J Epidemiol. 2002, 31 (1): 96-104. 10.1093/ije/31.1.96.

Korn EL, Freidlin B: The likelihood as statistical evidence in multiple comparisons in clinical trials: no free lunch. Biom J. 2006, 48 (3): 346-355. 10.1002/bimj.200510216.

Lunn D, Spiegelhalter D, Thomas A, Best N: The BUGS project: Evolution, critique and future directions. Stat Med. 2009, 28 (25): 3049-3067. 10.1002/sim.3680.

Gaziano JM, Sesso HD, Christen WG, Bubes V, Smith JP, MacFadyen J, Schvartz M, Manson JE, Glynn RJ, Buring JE: Multivitamins in the prevention of cancer in men: the Physicians' Health Study II. JAMA. 2012, 308 (18): 1871-1880. 10.1001/jama.2012.14641.

Christen WG, Gaziano JM, Hennekens CH: Design of Physicians' Health Study II–a randomized trial of beta-carotene. Ann Epidemiol. 2000, 10 (2): 125-134. 10.1016/S1047-2797(99)00042-3.

Bjelakovic G, Nikolova D, Gluud LL, Simonetti RG, Gluud C: Antioxidant supplements for prevention of mortality in healthy participants and patients with various diseases. Cochrane Database Syst Rev. 2012, 3: CD007176

Bjelakovic G, Nikolova D, Simonetti RG, Gluud C: Antioxidant supplements for preventing gastrointestinal cancers. Cochrane Database of Syst Rev. 3: CD004183-

Cortés-Jofré M, Rueda JR, Corsini-Muñoz G, Fonseca-Cortés C, Caraballoso M, Bonfill Cosp X: Drugs for preventing lung cancer in healthy people. Cochrane Database of Syst Rev. 10: CD002141-

Shakur H, Roberts I, Bautista R, Caballero J, Coats T, Dewan Y, El-Sayed H, Gogichaishvili T, Gupta S, Herrera J, Hunt B, Iribhogbe P, Izurieta M, Khamis H, Komolafe E, Marrero MA, Mejia-Mantilla J, Miranda J, Morales C, Olaomi O, Olldashi F, Perel P, Peto R, Ramana PV, Ravi RR, Yutthakasemsunt S: Effects of tranexamic acid on death, vascular occlusive events, and blood transfusion in trauma patients with significant haemorrhage (CRASH-2): a randomised, placebo-controlled trial. Lancet. 2010, 376 (9734): 23-32.

Perner A, Haase N, Guttormsen AB, Tenhunen J, Klemenzson G, Aneman A, Madsen KR, Moller MH, Elkjaer JM, Poulsen LM, Bendtsen A, Winding R, Steensen M, Berezowicz P, Soe-Jensen P, Bestle M, Strand K, Wiis J, White JO, Thornberg KJ, Quist L, Nielsen J, Andersen LH, Holst LB, Thormar K, Kjaeldgaard AL, Fabritius ML, Mondrup F, Pott FC, Moller T, et al: Hydroxyethyl starch 130/0.42 versus Ringer's acetate in severe sepsis. N Eng J Med. 2012, 367 (2): 124-134. 10.1056/NEJMoa1204242.

Perner A, Haase N, Wetterslev J, Aneman A, Tenhunen J, Guttormsen AB, Klemenzson G, Pott F, Bodker KD, Badstolokken PM, Bendtsen A, Soe-Jensen P, Tousi H, Bestle M, Pawlowicz M, Winding R, Bulow HH, Kancir C, Steensen M, Nielsen J, Fogh B, Madsen KR, Larsen NH, Carlsson M, Wiis J, Petersen JA, Iversen S, Schoidt O, Leivdal S, Berezowicz P, et al: Comparing the effect of hydroxyethyl starch 130/0.4 with balanced crystalloid solution on mortality and kidney failure in patients with severe sepsis (6S--Scandinavian Starch for Severe Sepsis/Septic Shock trial): study protocol, design and rationale for a double-blinded, randomised clinical trial. Trials. 2011, 12 (1): 24-10.1186/1745-6215-12-24.

Haase N, Perner A, Hennings LI, Siegemund M, Lauridsen B, Wetterslev M, Wetterslev J: Hydroxyethyl starch 130/0.38-0.45 versus crystalloid or albumin in patients with sepsis: systematic review with meta-analysis and trial sequential analysis. BMJ. 2013, 346: f839-10.1136/bmj.f839.

Pre-publication history

The pre-publication history for this paper can be accessed here: http://www.biomedcentral.com/1471-2288/14/34/prepub

Download references

Acknowledgements

We thank two peer reviewers, Lisa McShane and Steven Julious, for excellent comments.

Author information

Authors and affiliations.

Copenhagen Trial Unit, Centre for Clinical Intervention Research, Department 7812 Rigshospitalet, Copenhagen University Hospital, Copenhagen, Denmark

Janus Christian Jakobsen, Christian Gluud, Per Winkel & Jørn Wetterslev

Emergency Department, Holbæk Hospital, Holbæk, Denmark

Janus Christian Jakobsen

Department of Biostatistics, Faculty of Health Sciences, University of Copenhagen, Copenhagen, Denmark

Theis Lange

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Janus Christian Jakobsen .

Additional information

Competing interests.

The authors declare that they have no competing interests.

Authors’ contributions

JCJ wrote the first draft. All authors were substantially involved in revising the manuscript and all authors have given final approval of the present version to be published.

Electronic supplementary material

Additional file 1: table s1: different statistical terms and calculation of bayes factor. (docx 185 kb), authors’ original submitted files for images.

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Rights and permissions.

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/4.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Cite this article.

Jakobsen, J.C., Gluud, C., Winkel, P. et al. The thresholds for statistical and clinical significance – a five-step procedure for evaluation of intervention effects in randomised clinical trials. BMC Med Res Methodol 14 , 34 (2014). https://doi.org/10.1186/1471-2288-14-34

Download citation

Received : 11 February 2014

Accepted : 20 February 2014

Published : 04 March 2014

DOI : https://doi.org/10.1186/1471-2288-14-34

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Randomised clinical trial
  • Threshold for significance
  • Bayes factor
  • Confidence interval

BMC Medical Research Methodology

ISSN: 1471-2288

the null hypothesis clinical trials

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings
  • My Bibliography
  • Collections
  • Citation manager

Save citation to file

Email citation, add to collections.

  • Create a new collection
  • Add to an existing collection

Add to My Bibliography

Your saved search, create a file for external citation management software, your rss feed.

  • Search in PubMed
  • Search in NLM Catalog
  • Add to Search

Hypothesis and hypothesis testing in the clinical trial

Affiliation.

  • 1 Department of Psychiatry and Mental Health and Neuroscience Clinical Research Center, University of North Carolina School of Medicine, Chapel Hill 27599-7160, USA.
  • PMID: 11379832

The hypothesis provides the justification for the clinical trial. It is antecedent to the trial and establishes the trial's direction. Hypothesis testing is the most widely employed method of determining whether the outcome of clinical trials is positive or negative. Too often, however, neither the hypothesis nor the statistical information necessary to evaluate outcomes, such as p values and alpha levels, is stated explicitly in reports of clinical trials. This article examines 5 recent studies comparing atypical antipsychotics with special attention to how they approach the hypothesis and hypothesis testing. Alternative approaches are also discussed.

PubMed Disclaimer

Similar articles

  • [New antipsychotic drugs]. Weizer M, Levy A, Davidson M. Weizer M, et al. Harefuah. 1997 Jul;133(1-2):57-60. Harefuah. 1997. PMID: 9332062 Review. Hebrew. No abstract available.
  • Two new atypical antipsychotics: advantages and disadvantages. Lemon MD. Lemon MD. S D J Med. 1998 Aug;51(8):285-6. S D J Med. 1998. PMID: 9724957 No abstract available.
  • Selecting an atypical antipsychotic by combining clinical experience with guidelines from clinical trials. Stahl SM. Stahl SM. J Clin Psychiatry. 1999;60 Suppl 10:31-41. J Clin Psychiatry. 1999. PMID: 10340685 Review.
  • Schizophrenia, VI: Treatments. Lieberman JA, Stroup TS. Lieberman JA, et al. Am J Psychiatry. 2003 Oct;160(10):1748. doi: 10.1176/appi.ajp.160.10.1748. Am J Psychiatry. 2003. PMID: 14514482 No abstract available.
  • Use of atypical antipsychotics in mood disorders. Weizman R, Weizman A. Weizman R, et al. Curr Opin Investig Drugs. 2001 Jul;2(7):940-5. Curr Opin Investig Drugs. 2001. PMID: 11757795 Review.
  • How fragile the positive results of Chinese herbal medicine randomized controlled trials on irritable bowel syndrome are? Luo M, Huang J, Wang Y, Li Y, Liu Z, Liu M, Tao Y, Cao R, Chai Q, Liu J, Fei Y. Luo M, et al. BMC Complement Med Ther. 2024 Aug 14;24(1):300. doi: 10.1186/s12906-024-04561-8. BMC Complement Med Ther. 2024. PMID: 39143474 Free PMC article.
  • Search in MeSH

LinkOut - more resources

Full text sources.

  • Physicians Postgraduate Press, Inc.
  • MedlinePlus Health Information
  • Citation Manager

NCBI Literature Resources

MeSH PMC Bookshelf Disclaimer

The PubMed wordmark and PubMed logo are registered trademarks of the U.S. Department of Health and Human Services (HHS). Unauthorized use of these marks is strictly prohibited.

Log in using your username and password

  • Search More Search for this keyword Advanced search
  • Latest content
  • Current issue
  • BMJ Journals

You are here

  • Volume 20, Issue 5
  • Superiority trials: raising the bar of null hypothesis statistical testing
  • Article Text
  • Article info
  • Citation Tools
  • Rapid Responses
  • Article metrics

Download PDF

https://doi.org/10.1136/ebmed-2015-110280

Statistics from Altmetric.com

Request permissions.

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

PDF extract preview

You do not have access to the full text of this article, the first page of the PDF of this article appears above.

Competing interests None declared.

Provenance and peer review Not commissioned; externally peer reviewed.

Read the full text or download the PDF:

Latest Developments in “Adaptive Enrichment” Clinical Trial Designs in Oncology

  • Open access
  • Published: 13 September 2024

Cite this article

You have full access to this open access article

the null hypothesis clinical trials

  • Yue Tu 1 &
  • Lindsay A. Renfro 1  

As cancer has become better understood on the molecular level with the evolution of gene sequencing techniques, considerations for individualized therapy using predictive biomarkers (those associated with a treatment’s effect) have shifted to a new level. In the last decade or so, randomized “adaptive enrichment” clinical trials have become increasingly utilized to strike a balance between enrolling all patients with a given tumor type, versus enrolling only a subpopulation whose tumors are defined by a potential predictive biomarker related to the mechanism of action of the experimental therapy. In this review article, we review recent innovative design extensions and adaptations to adaptive enrichment designs proposed during the last few years in the clinical trial methodology literature, both from Bayesian and frequentist perspectives.

Avoid common mistakes on your manuscript.

Introduction

As cancer has become better understood on the molecular level with the evolution of gene sequencing techniques, considerations for individualized therapy using predictive biomarkers (those associated with a treatment’s effect) have shifted to a new level. Traditional randomized trial designs tend to either oversimplify or overlook differences in patients’ genetic and molecular profiles, either by fully enriching eligibility to a marker subgroup or enrolling all-comers without prospective use of potentially predictive biomarkers. In the former case of marker enrichment, one cannot learn about a marker’s true predictive ability from the trial’s conduct (as marker-negative patients are excluded); in the latter case ignoring the biomarker, the end result may be a “washing out” of the treatment effect when a predictive marker truly does exist within the sampled patient population.

In the last decade or so, randomized “adaptive enrichment” clinical trials have become increasingly utilized to strike a balance between enrolling all patients with a given tumor type, versus enrolling only a subpopulation whose tumors are defined by a potential predictive biomarker related to the mechanism of action of the experimental therapy (see for example [ 1 , 2 ]). On a high level, adaptive enrichment designs take the form of a clinical trial that begins by randomizing participants to a targeted versus a control therapy regardless of marker value, then adapts through a series of one or more interim analyses to potentially limit subsequent trial recruitment to a marker-defined patient subpopulation that is showing early signals of enhanced treatment benefit.

In this review article, we first discuss the “traditional” presentation of both enrichment and adaptive enrichment designs and their decision rules and describe statistical or practical challenges associated with each. Next, we introduce innovative design extensions and adaptations to adaptive enrichment designs proposed during the last few years in the clinical trial methodology literature, both from Bayesian and frequentist perspectives. Finally, we review articles in which different designs within this class are directly compared or features are examined, and we conclude with some comments on future research directions.

Enrichment Trial Designs

To motivate discussion of adaptive enrichment designs and why they are useful, it is helpful to first understand enrichment trial designs , or designs that focus only on a subset of the patient population from the beginning.

Design Details: In the setting of targeted therapies with strong prior evidence or clinical rationale supporting efficacy only within a biomarker-selected subgroup, “marker-enriched” or enrichment trial designs are used to confirm signal or efficacy only in that selected subgroup. In these types of trials, patients are screened and classified into prespecified marker positive and negative subgroups at or prior to enrollment, with only marker positive patients eligible to remain on study and receive protocol-directed targeted therapy. This usually takes the form of a small, single-arm phase II study without a randomized comparator, but in some settings, comparisons against a randomized non-targeted standard of care therapy might be made (see Fig.  1 ).

figure 1

Enrichment trial Schema with a Single Arm.

Example: An example of a clinical trial with an enrichment design is the Herceptin Adjuvant (HERA) trial. The HERA trial is a phase III, randomized, three-arm trial that studied the efficacy of 1 year versus 2 years of adjuvant trastuzumab versus control (no additional treatment) in women with human epidermal growth factor receptor 2 (HER2)-positive early breast cancer after completion of locoregional therapy and chemotherapy [ 3 ]. HER2 is overexpressed in 15–25% of breast cancer and trastuzumab, a monoclonal antibody, binds the HER2 extracellular receptor [ 4 , 5 ]. The primary outcome was disease-free survival and using an intention-to-treat analysis, significant treatment benefit was demonstrated for 1 year of trastuzumab compared to the control arm.

Limitations: One important limitation of enrichment designs is that a marker’s predictive ability to select patients for treatment is assumed to already be known and cannot be validated from the trial itself. It is theoretically possible that a pre-defined marker-negative subgroup might also benefit from the targeted treatment, but that knowledge won’t be updated with an enrichment design. For example, a pre-clinical study found that trastuzumab can decrease cancer cell proliferation in HER2 negative and HER2 phosphorylation at tyrosine Y877 positive breast cancer cell lines, which is comparable to the drug effect in HER2 positive breast cancer cell lines, showing that the HER2 negative subpopulation may also benefit from trastuzumab [ 6 ]. Around the same time, however, the randomized study B-47 conducted by the National Surgical Adjuvant Breast and Bowel Project (NSABP) group showed no effect of trastuzumab in HER2-low patients [ 7 ].

Another limitation of enrichment trial designs is the necessity of establishing predefined subgroups during the study planning phase, which becomes complicated when dealing with biomarkers that are measured on a continuous scale, like expression levels or laboratory values. Determining an appropriate threshold to divide patients into “positive” and “negative” groups is not always straightforward, validated, or effective in distinguishing the effect of the targeted treatment. Selecting an incorrect threshold during trial design can result in an ineffective or underpowered study, and revising the decision once the trial has begun accrual is not advisable.

Adaptive Enrichment Trial Designs

Adaptive enrichment trial designs, on the other hand, are an attractive solution to the inherent weaknesses of a fully enriched trial design.

Design Details: An adaptive enrichment trial design initially enrolls patients with any marker value(s) and randomizes them to experimental targeted versus standard (non-targeted) therapy. As the trial progresses, accrual may be subsequently refined or restricted to patients with certain marker values according to those showing initial efficacy on the basis of one or more interim analyses. This design is randomized out of necessity, so that treatment-by-marker interactions may be computed, and adaptations based on differential treatment effects by marker subgroups can be facilitated. At the interim analyses, according to pre-specified decision rules, a trial may stop early for futility or efficacy, either overall, or within a marker-defined subgroup. If the biomarker of interest is not naturally dichotomous, the same interim analyses may also be used to select or revise marker cutpoints (see Fig.  2 ).

figure 2

Adaptive Enrichment Trial Schema with a Binary Biomarker.

Example: One real-world example of an adaptive enrichment design is the Morphotek Investigation in Colorectal Cancer: Research of MORAb-004 (MICRO), which is an adaptive, two-stage, phase II study assessing the effect of ontuxizumab versus placebo in patients with advanced metastatic colorectal cancer [ 8 , 9 ]. Ontuxizumab, a monoclonal antibody treatment targeting endosialin function, was expected to be more effective in patients with endosialin-related biomarkers. Since the biomarkers were continuous in nature and the optimal cutoffs were unknown, the study included an assessment for determining the best cutoffs at an interim analysis, where progression-free survival (PFS) served as the primary endpoint. Initially, the goal was to demonstrate the treatment effect of ontuxizumab either overall or within subgroups defined by biomarkers. However, the interim analysis revealed that none of the biomarkers had a predictive relationship with treatment outcome. Consequently, the design shifted to a non-marker-driven comparison. Additionally, the interim analysis showed early futility for ontuxizumab compared to placebo overall, terminating the trial early due to lack of efficacy. In summary, this adaptive enrichment design concluded both the biomarker assessment and the evaluation of the therapy early, and additional resources and patients were spared. However, it is worth noting that it may have been underpowered to identify modestly-sized interaction effects, had they been present.

Limitations: Adaptive enrichment trial designs do have some statistical challenges, including limitations faced in the design of the MICRO trial. These include estimation of subgroup-specific treatment effects, particularly when the marker prevalence is low, as a sufficiently large sample size is required to have enough patient-level information at interim analysis for informative subgroup selection. As a practical consideration, the primary endpoint must be quickly observed relative to the pace of accrual, to allow time for impactful adaptations based on observed outcomes relatively early in the trial. Another challenge is how exactly one should select cutpoints for adaptation of accrual. In the MICRO trial, at the interim analysis, a series of Cox proportional hazards models were fit over a grid of possible cutpoints, and the significance of a marker by treatment interaction term was evaluated. A pre-specified level of statistical significance for the interaction, along with a clinically meaningful effect in the marker “positive” group defined by the interaction, would warrant potential accrual restriction; however, this approach treated truly continuous biomarkers as binary in its implementation, which results (at least theoretically) in a loss of information and potential loss of power.

Several groups have attempted to extend or modify the standard adaptive enrichment trial design in various ways to address statistical shortcomings or tailor the strategy to various applications. The remainder of this paper provides an overview of some of these recent developments. While we admit such designations are rather arbitrary, we present this work separately by Bayesian and frequentist approaches, so that structural similarities among them may be readily described and compared.

Recent Developments and Extensions in Adaptive Enrichment Trial Designs

Bayesian approaches.

Xu et al. proposed an adaptive enrichment randomized two-arm design that combines exploration of treatment benefit subgroups and estimation of subgroup-specific effects in the context of a multilevel target product profile, where both minimal and targeted treatment effect thresholds are investigated [ 10 ]. This adaptive subgroup-identification enrichment design (ASIED) opens for all-comers first, and subgroups identified as having enhanced treatment effects are selected at an interim analysis, where pre-set minimum and targeted treatment effects are evaluated against a set of decision criteria for futility or efficacy stopping for all-comers or possible subgroups. A Bayesian random partition (BayRP) model for subgroup-identification is incorporated into ASIED, based on models proposed by Xu et al. and Guo et al. [ 11 , 12 ]. Due to the flexibility of the BayRP model, biomarkers can be continuous, binary, categorical, or ordinal, and the primary endpoint types can be binary, categorical, or continuous. Per the authors, extensions to count or survival outcomes are also possible. BayRP was implemented due to its robustness, but other Bayesian subgroup identification methods could be used as well, like Bayesian additive regression tree (BART) or random forests for larger sample sizes [ 13 ]. A tree-type random partition of biomarkers is used as a prior and an equally spaced k-dimensional grid constructed from k biomarkers is used to represent possible biomarker profiles. The operating characteristics of ASIED as a trial design was evaluated by simulations with 4 continuous biomarkers, a total sample size of 180, an interim analysis after 100 patients were enrolled, a minimum desired treatment effect of 2.37 and target treatment effect of 3.08 on a continuous score scale. ASIED’s recommendations were close to the expected results. However, the number of simulated trials was only 100, which could yield lower precision of the estimated operating characteristics. Another limitation is that the partition of the biomarker profile was limited to at most four biomarker subgroups due to the small sample size in each partition.

Another Bayesian randomized group-sequential adaptive enrichment two-arm design incorporating multiple baseline biomarkers was proposed by Park et al. [ 14 ]. The design’s primary endpoint is time-to-event, while a binary early response acts as a surrogate endpoint assisting with biomarker pruning and enrichment to a sensitive population at each interim analysis. Initially, the study is open for all-comers and the baseline biomarkers can be binary, continuous, or categorical. The first step at each interim analysis is to jointly select covariates based on both the surrogate and final endpoints by checking each treatment by covariate interaction. The second step is to recalculate the personalized benefit index (PBI), which is a weighted average posterior probability indicating patients with selected biomarkers who benefit more from the experimental treatment. The refitted regression from the variable selection step will redefine the treatment-sensitive patients, and only patients with PBI values larger than some pre-specified cutoff continue to be enrolled to the trial. The third step is to test for futility and efficacy stopping by a Bayesian group sequential test procedure for the previously identified treatment-sensitive subgroups. In simulations, AED was compared with group sequential enriched designs called InterAdapt and GSED, an adaptive enrichment design and all- comers group sequential design [ 15 , 16 , 17 ]. The maximum sample size considered was 400, and patients were accrued by a Poisson process with 100 patients per year. Two interim analyses took place after 200 and 300 patients enrolled, and 10 baseline biomarkers were considered. Across each of the seven scenarios, prevalence of the treatment-sensitive group was set to be 0.65, 0.50, or 0.35. While nearly all the designs controlled the nominal Type I error to 0.05, AED had higher probabilities of identifying the sensitive subgroup and correctly concluding efficacy than other designs. Also, 1000 future patients were simulated and treated by each design’s suggested treatment, and AED had the longest median survival time overall. One stated limitation of this work was its inability to handle high dimensional baseline biomarker covariates, as the authors suggest considering no more than 50 baseline covariates in total. Also, biomarkers in this design are assumed to be independent, though selection adjustment for correlated predictors is mentioned. It is worth noting that early response (as used by this design) has not been validated as a good surrogate for longer-term clinical endpoints.

To address the scenario of a single continuous predictive biomarker where the marker-treatment relationship is continuous instead of a step function, Ohwada and Morita proposed a Bayesian adaptive patient enrollment restriction (BAPER) design that can restrict the subsequent enrollment of treatment insensitive biomarker-based subgroups based on interim analyses [ 18 ]. The primary endpoint is assumed to be time-to-event, and the relationship between the biomarker and treatment effect is assumed to increase monotonically and is modeled via a four-parameter change-point model within a proportional hazard model. Parameters are assumed to follow non-informative priors, and the posterior distributions are calculated using the partial likelihood of the Cox proportional hazard model. At each interim analysis, decisions can be made for a subgroup or the overall cohort. In addition, treatment-sensitive patients can be selected based on a biomarker cutoff value, which is determined by searching over the range of biomarker values and picking the one with the highest conditional posterior probability of achieving the target treatment effect. Simulations were conducted to compare the proposed method against both a similar method without enrichment and a design using a step-function to model marker-treatment interaction effects without enrichment. The maximum sample size considered was 240 with two interim analyses, and the assumed target hazard ratio was 0.6. The results show that the proposed BAPER method decreases the average number of enrolled patients who will not experience the targeted treatment effect, compared to designs without patient selection. Also, BAPER has a higher probability of correctly identifying the cutoff point that achieves the target hazard ratio. However, BAPER has certain restrictions: the biomarker cannot be prognostic, as the main effect for the biomarker is excluded from the proportional hazard model. Also, the design does not consider the distribution of the biomarker values themselves, so a larger sample size is required when the prevalence of the treatment sensitive (or insensitive) population is small.

Focusing on an optimal decision threshold for a binary biomarker which is either potentially predictive or both prognostic and predictive, Krisam and Kieser proposed a new class of interim decision rules for a two-stage, two-arm adaptive enrichment design [ 19 ]. This approach is an extension of Jenkins et al.’s design but with a binary endpoint instead of a time-to-event outcome [ 20 ]. Initially, their trial randomizes all patients from two distinct subgroups (i.e., a binary biomarker), assuming one subgroup will have greater benefit, and the sample size is fixed per stages by treatment group. At the first interim analysis, the trial might stop early for futility, continue enrolling to only the marker-positive group, or continue enrolling the full population, while using Hochberg multiplicity- corrected p-values for these decisions. When the full population proceeds to the second stage, it remains possible that efficacy testing will be performed both overall and in the treatment-sensitive subgroup if the biomarker is found to be predictive or prognostic, or only within the total population if the biomarker is not predictive. The critical boundaries for subgroup decisions minimize the Bayes risk of a quadratic loss function by setting the roots of partial derivatives as optimal thresholds, assuming the estimated treatment effects follow bivariate normal distributions with design parameters from uniform prior distributions. A relevance threshold for the effect size, which serves as the minimal clinical meaningful effect, also needs to be prespecified. Optimal decision threshold tables are presented for a biomarker that is predictive, both predictive and prognostic, or non-informative, with sample sizes ranging from 20 to 400 and subgroup prevalence values of 0.1, 0.25 and 0.5 considered. In their simulations, the sample size is 200 per group per stage (for a total trial sample size of 800), the treatment effect (response rate) in one of the subgroups is 0.15, and the biomarker is both predictive and prognostic. Optimal decision rules with three different assumptions for the biomarkers (predictive, predictive and prognostic, non-informative) and subgroup prevalence are compared with a rule just based on relevance thresholds. Power is increased under the proposed decision rules when the correct biomarker assumption is made. Since the decision thresholds incorporate sample size and subgroup prevalence information, one major limitation is that knowledge about the biomarkers must be strong enough pre-trial to prespecify the required parameters.

Nesting frequentist testing procedures within a Bayesian framework, Simon and Simon proposed a group-sequential randomized adaptive enrichment trial design that uses frequentist hypothesis tests for controlling Type I error but Bayesian modeling to select treatment-sensitive subgroups and estimate effect size [ 17 ]. The primary endpoint in their models is binary, and multiple continuous biomarkers are allowed, comprising a vector of covariates for each patient. Patients are sequentially enrolled in a total of K blocks, and enrollment criteria for the next block are refined by a decision function, which is built on the block adaptive enrichment design by Simon and Simon [ 21 ]. The final analysis is based on inverse normal combination test statistics using data from the entire trial. A prior for the response rate in each arm needs to be prespecified, which is based on both the biomarker covariates and a utility function. Different utility functions can be applied according to the trial’s goal, and the one adopted here is the expected future patient outcome penalized by accrual time. Using the conditional posterior for the previous block’s information, simulations are conducted to find the optimal enrollment criteria based on the utility function. The expected treatment effect given covariates can be estimated by the posterior predictive distribution for the response rate at the end of trial. In the presented simulation study, there are two continuous biomarkers and 300 patients accrued in two or three enrollment blocks, with three logistic and three cutpoint models for the biomarker-response relationships. An unenriched design and an adaptive enrichment strategy with prespecified fixed cutpoints are compared with the proposed design. The two adaptive enrichment designs have higher power than the unenriched design to detect a treatment sensitive subgroup, and the enrichment designs have higher power when there are three versus two enrollment blocks. Compared with the fixed cutpoint enrichment method, the proposed design generally correctly identifies the treatment-sensitive subgroup while avoiding non-ideal pre-determined cutoff points for the following enrollment criteria. Though the effect size estimation is biased under the proposed design, the bias is more severe under the unenriched design.

Graf et al. proposed to optimize design decisions using utility functions from the sponsor and public health points of view in the context of a two-stage adaptive enrichment design with a continuous biomarker [ 22 ]. Similar to Simon and Simon’s method, the proposed design’s decisions are based on frequentist hypothesis tests, while the utility functions are evaluated under the Bayesian approach. In this design, patients are classified into marker positive and marker negative groups at enrollment, and decisions can be made with respect to the full population or the marker positive subgroup only. Closed testing procedures along with Hochberg tests are used to control the family wise type I error rate. Parameters called “gain”, which quantify the benefit rendered by the trial to the sponsor and society, need to be pre-specified. The utility function under the sponsor view is the sum of the gain multiplied by the probability of claiming treatment efficacy in the full population or a marker-positive group, respectively. In addition to gain and success probabilities, the public health utility function also considers the true effect sizes in subgroups, and safety risk as a penalization parameter. Prior distributions are used to model treatment effects in each subgroup to account for uncertainty, but the authors assume that only the marker negative group can be ineffective, and only point priors are used, which leads to a single probability that the treatment is effective in just the marker positive subgroup or the full population. This optimized adaptive design is compared with a non-adaptive design when the total sample sizes are the same. The adaptive design provides larger expected utility in both utility functions only when the values are intermediate in gain from treatment efficacy and the prior point probability. One limitation is that those utility functions can only compare designs with the same total sample size and the cost of running a trial is not included.

Serving as an extension of Graf et al.’s work by incorporating a term for the trial cost in utility functions, Ondra et al. derived an adaptive two-stage partial enrichment design for a normally distributed outcome with subgroup selection and optimization of the second stage sample size [ 23 ]. In a partial enrichment design, the proportion of the marker-positive subjects enrolled does not need to be aligned with the true prevalence. At interim analysis, the trial can be stopped for futility, or continued in only the marker-positive population or the full population. The final analysis is based on the weighted inverse normal function with Bonferroni correction. Utility functions used for optimization are from societal or sponsor perspectives. Expected utility is calculated by numerical integration on the joint sampling distribution of two stage-wise test statistics, with the prior distributions for the treatment effect in each subgroup. The optimal sample size for the second stage maximizes the conditional expected utility given the first stage test statistics and sample size used, and the optimal first stage sample size maximizes the utility using the solved optimal number for the second stage. The optimization function is solved recursively by dynamic programming, and the optimal design in terms of the sample size is obtained. The optimized adaptive enrichment design is compared with an optimized single- stage design for subgroup prevalence ranging from 10 to 90%, with both weak and strong predictive biomarker priors considered. Expected utilities are higher in both sponsor and societal views in the adaptive design. Also, even if the prior distribution for the effect size used in the design differs from the true distribution, the proposed adaptive design is robust in terms of expected utilities when the biomarker’s prevalence is high enough. One limitation is that the endpoint needs to be observed immediately, which might be addressed by a short-term surrogate endpoint—though to date, validated short-term endpoints are rare in oncology.

Frequentist Approaches

Fisher et al. proposed an adaptive multi-stage enrichment design that allows sub-group selection at an interim analysis with continuous or binary outcomes [ 24 ]. Two subpopulations are predefined, and the goal is to claim treatment efficacy in one of the subpopulations or the full population. The cumulative test statistics for the subgroups and the full population are calculated at each interim analysis and compared against efficacy and non-binding futility boundaries. To control the family-wise Type I error rate (FWER), two methods for constructing efficacy boundaries are presented. One is proposed by Rosenblum et al. that spends alpha based on the covariance matrix of test statistics by populations (two subpopulations and the full population) and by interim stages [ 16 ]. Another is the alpha reallocation approach [ 25 , 26 ]. The design parameters, including sample size per stage, futility boundaries, etc., are optimized to minimize the expected number enrolled or expected trial duration using simulated annealing, with constraints on power and Type I error. If the resulting design does not meet the power requirement, the total sample size will be increased until the power requirement is met. The optimized adaptive design is compared with a single-stage design, optimized single-stage design, and a multi-stage group sequential design with O’Brien-Fleming or Pocock boundaries using actual trial data from MISTIE [ 27 ] and ADNI [ 28 ]. For the MISTIE trail, the proposed designs are optimized by the expected number enrolled, which is lower than for the optimized single-stage design and group-sequential design, but the maximum number enrolled is still lower in the simple single-stage design. In the ADNI trial, when the expected trial duration is optimized, the proposed design has a slightly shorter expected duration but a longer maximum duration than the optimized single-stage design.

Similar to the aforementioned Bayesian approaches without predefined sub-populations, Zhang et al. proposed a two-stage adaptive enrichment design that does not require predefined subgroups [ 29 ]. The primary outcome is binary, and a collection of baseline covariates, including biomarkers and demographics, is used to define a treatment-sensitive subgroup. The selection criteria are based on a prespecified function modeling the treatment effect and marker by treatment interaction using first stage data. The final treatment effect estimate is a weighted average of estimates in each stage. To minimize the resubstitution bias from using first stage data in subsequent subgroup selection and inference, four methods for estimating the treatment effect and variance for the first stage are discussed: naive approach, cross-validation, nonparametric bootstrap, and parametric bootstrap. To compare those estimation methods, ECHO [ 30 ] and THRIVE [ 31 ] trial data are used for the simulation with a total sample size of 1000. The first stage has 250, 500 or 750 subjects, and the function used to simulate outcomes is the logistic regression model. The results show that the bootstrap method is more favorable than both the naive estimate (which has a large empirical bias) and the cross-validation method (which is overly conservative). The weight for each stage and first stage sample size need to be selected carefully to reach a small root mean squared error (RMSE) and close-to-nominal one-sided coverage. Though a trial can stop due to inability to recruit to a subset resulting from restricted enrollment, the proposed method does not include an early stopping rule for futility or efficacy.

In order to reduce sample size while assessing the treatment effect in the full population, Matsui and Crowley proposed a two-stage subgroup-focused sequential design for time-to-event outcomes, which could extend to multiple stages [ 32 ]. In this design, patients are classified into two subgroups by a dichotomized predictive marker, with the assumption that the experimental treatment is more efficacious in the marker-positive subgroup. The trial can proceed to the second stage with one of the subgroups, or the full population, but treatment efficacy is only tested in the marker-positive group or the full population at the final analysis. Choices of testing procedures are fixed-sequence and split-alpha. At the interim analysis, a superiority boundary for the marker-positive subgroup and a futility boundary for the marker-negative subgroup are constructed. The superiority boundary is calculated to control the study-wide alpha level, while the futility boundary is based on a Bayesian posterior probability of efficacy with a non-informative prior. The required sample sizes for each subgroup are calculated separately, and the hazard ratio for the marker-positive subgroup is recommended to be 0.05–0.70 under this application. The proposed design is compared with a traditional all-comers design, an enriched design with only marker-positive subjects, a two-stage enriched design, and a traditional marker-stratified design. Different scenarios are considered including those with no treatment effect, constant treatment effect in both groups with hazard ratio (HR) = 0.75, a nearly qualitative interaction with HRs = 0.65 and 1, and a quantitative interaction with HRs = 0.7 and 0.8. The marker prevalence is set to 0.4, and the accrual rate is 200 patients per year. When using the split-alpha test, the proposed design has greater than 80% power to reject any null hypothesis in the alternative cases, but the traditional marker-stratified design also provides enough power under all cases. The number screened and the number randomized are reduced for the proposed design compared to the traditional marker stratified design, but the reduction is only moderate.

To determine whether the full population or only the biomarker-positive subgroup benefit more from the experimental treatment, Uozumi and Hamada proposed a two-stage adaptive population selection design for a time-to-event outcome, an extension of methods from Brannath et al. and Jenkins et al. [ 20 , 33 , 34 ]. The main extension is that the decision-making strategy at the interim analysis incorporates both progression-free survival (PSF) and overall survival (OS) information. Also, OS is decomposed into time-to-progression (TTP) and post-progression survival (PPS) when tumor progression has occurred, to account for the correlation between OS and PFS. The combination test approach is used for the final analysis based on Simes’ procedure [ 35 ]. The hypothesis rejection rule for each population is a weighted inverse normal combination function with prespecified weights based on the expected number of OS events in each stage. At the interim analysis, a statistical model from Fleischer et al. under the semi-competing risks framework is applied to account for the correlation between OS and PFS [ 36 , 37 ]. The interim decision rule uses the predictive power approach in each population, extending Brannath et al.’s method from single endpoint to multiple endpoints with a higher weight on PFS data due to its rapid observation. In the simulation, a dichotomized biomarker is used with a 50% prevalence. Four scenarios are considered, where hazard ratios in the marker-positive subgroup are always 0.5 and are higher in the marker-negative subgroup. For simplicity, the HR is the same for TTP, PPS, and death. FWER is controlled for all cases, but it is a little too conservative when the treatment is effective. The proposed design has a higher probability of identifying the treatment-sensitive population at the interim analysis, particularly when the PPS effect is large, those probabilities are similar between using OS or PFS alone or the combined endpoints when the PFS effect is small. One limitation of this design is that sample size calculations are not considered.

Instead of a single primary endpoint, Sinha et al. suggested a two-stage Phase III design with population enrichment for two binary co-primary endpoints, which is an extension of Magnusson and Turnbull’s work with co-primary endpoints [ 15 , 38 ]. The two binary endpoints are assumed to be independent, and the efficacy goal should be reached in both endpoints. With two distinct predefined subgroups, a set of decision rules stops the non-responsive subgroups using efficient score statistics. The futility and efficacy boundary values, which do not depend on the marker prevalence, are the same for both endpoints due to independence. The lower and upper stopping boundaries are calculated by alpha spending functions, and FWER is strongly controlled. Simulations were conducted assuming biomarker prevalences of 0.25 or 0.75 and weighted subgroup effect sizes of 0, 1, and 2 as the means of efficient score statistics under normal distribution. The results show that the proposed design can reduce false-negative results for heterogeneous treatment effects between subgroups. The authors state the possibility of extending the design to a bivariate continuous outcome, while an extension to bivariate survival would be more challenging.

Published Comparisons and Examination of Features

Kimani, Todd, and Stallard derived a uniformly minimum variance unbiased point estimator (UMVUE) of treatment effect in adaptive two-arm, two-stage enrichment design with a binary biomarker [ 39 ]. Based on the Rao-Blackwell theorem, UMVUE for the treatment effect conditional on the selected subgroup is derived with and without prior information on maker prevalence. The proposed estimator is compared with the naive estimator, which is biased but with a lower mean squared error (MSE) when prevalence is known. The estimator is robust, with and without prior information on marker prevalence.

Kimani et al. developed estimators for a two-stage adaptive enrichment design with a normally distributed outcome [ 40 ]. A predictive continuous biomarker is used to partition the full population into a prespecified number of subgroups, and the cutoff values are determined at the interim analyses based on stage I observations. To estimate the treatment effect after enrichment for the selected subgroup, a naive estimator, uniformly minimum variance conditional unbiased estimator (UMVCUE), unbiased estimator, single- iteration and multiple-iteration biased-adjusted estimators, and two shrinkage estimators are derived and compared. Though no estimator is superior in terms of bias and MSE in all scenarios, UMVUE is recommended by the authors due to its mean unbiasedness.

Tang et al. evaluated several proposed adaptive enrichment designs with a binary biomarker against the traditional group sequential design (GSD) for a time-to-event outcome [ 41 ]. Type I error is controlled, and the subpopulation is selected by Bayesian predictive power. Adaptive design A selects the subgroup after considering futility and efficacy stopping decision. Design B selects the subgroup when the targeted number of events are observed in full population, which can be earlier than the interim analysis. Design C selects the subgroup only after the full population has reached a futility rule. Design D proceeds with the subgroup or full population by checking the treatment effect in the complementary subgroup, proposed by Wang et al. [ 42 ]. When an enhanced treatment effect exists in the subpopulation, all of these adaptive designs could improve study power compared to GSD. Furthermore, Design C generally provides higher power across all scenarios among all the adaptive designs.

Benner and Kieser explored how the timing of interim analyses would affect power in adaptive enrichment designs with a fixed total sample size for a continuous outcome and binary marker [ 43 ]. Two subgroup selection rules are considered: the estimated treatment effect, or the estimated treatment effect difference between the subgroup and the full population (as opposed to the complement of the subgroup). When using the first selection rule, early timing increases power when the marker prevalence and marker cutoff values are low. However, the interim analysis timing’s impact on power is small when marker prevalence is high. If absolute treatment effect is used instead, earlier timing leads to power loss in general. Power depends more on the marker threshold, prevalence, and treatment effect size when interim timing is later than when half of the total sample size have observed outcomes.

Kunzmann et al. investigated the performance of six different estimators besides maximum likelihood estimator (MLE) for a two-stage adaptive enrichment design for a continuous outcome [ 44 ]. Those estimators are empirical Bayes estimator (EBE) [ 45 , 46 ], parametric bootstrap estimator [ 47 ], conditional moment estimator (CME) [ 48 ], and UMVCUE with MLE and CME as two hybrid estimators [ 49 ]. The hybrid UMVCUE and CME estimator could reduce the bias across all considered scenarios, which the authors recommend, though with the cost of larger RMSE.

Conclusions and Future Needs in Adaptive Enrichment Trial Designs

In this review article, we have given an overview of traditional enrichment and adaptive enrichment designs, outlined their limitations, and described recent extensions and modifications to adaptive enrichment design strategies. Both Bayesian and frequentist perspectives in handling statistical issues of these designs were discussed in detail, along with important considerations for design parameters.

Although the adaptive enrichment designs we have reviewed contain theoretical benefits such as early subgroup identification and early decision-making resulting in sample size reduction, we caution that selection and implementation of any of these designs requires acceptance of substantial additional trial complexity, and special consideration of the disease setting, endpoints, and markers at hand. For any of these trial designs to possibly have advantages over a simple randomized design followed by retrospective biomarker-focused analyses, the following should be true: the primary endpoint should be quickly observable relative to the pace of accrual; a sufficiently large sample size to detect moderately-sized subgroup effects of clinical interest must be achievable in a reasonable time frame, and the experimental treatment under study must have sufficiently strong preliminary evidence (e.g., from earlier phase studies) of a mechanism of action related to the candidate biomarker(s). If any of these criteria are not met, one runs the serious risk of conducting a study that is far less efficient than a standard design that is not biomarker-driven. In considering use of any design considered here, a trial biostatistician should meet with trial investigators and stakeholders to discuss the assumptions and requirements of different design options. The statistician should also prospectively understand and quantify the impact of any potential deviations from these assumptions while still in the trial planning stage (e.g., by using simulation studies).

Each of the designs we discussed also have associated pros and cons, and are more suitable for application in different settings. To guide selection of a particular design for a particular context, we summarize design attributes (e.g., applicable primary endpoint types, number of biomarkers, decision rules, and other structural differences) as well as pros and cons in Table  1 . For example, if there is no predefined biomarker subgroup and predictive biomarker discovery is required, Xu et al. and Zhang et al.’s proposed designs could be considered [ 10 , 29 ]. Where Bayesian methods for estimation and interim decision-making using utility functions are desired but where final frequentist hypothesis testing is necessary, e.g., for regulatory purposes, the designs by Simon and Simon, Graf et al., or Ondra et al. may be appropriate [ 17 , 22 , 23 ]. Where strong control of Type I error rate is required (e.g., in a later-phase application), designs by Matsui and Crowley, Fisher et al., and Uozumi and Hamada may be referenced [ 24 , 32 , 34 ].

Overall, adaptive enrichment trial designs tend to increase study efficiency while minimizing subsequent study participation among patients showing a low likelihood of benefit based on early trial results [ 21 ]. Biomarker-driven designs that reliably identify or validate predictive biomarker relationships and their thresholds with sufficient power to achieve phase II or III objectives continue to be of interest and warrant further development. Designs that make better use of truly continuous (versus dichotomous) marker-efficacy relationships are essential for future research.

Data Availability

No datasets were generated or analysed during the current study.

Mittendorf EA, Zhang H, Barrios CH, Saji S, Jung KH, Hegg R, Koehler A, Sohn J, Iwata H, Telli ML, Ferrario C, Punie K, Penault-Llorca F, Patel S, Duc AN, Liste-Hermoso M, Maiya V, Molinero L, Chui SY, Harbeck N. Neoadjuvant atezolizumab in combination with sequential nab-paclitaxel and anthracycline-based chemotherapy versus placebo and chemotherapy in patients with early-stage triple-negative breast cancer (IMpassion031): a randomised, double-blind, phase 3 trial. Lancet (Br Edn) . 2020;396(10257):1090–100. https://doi.org/10.1016/S0140-6736(20)31953-X .

Article   CAS   Google Scholar  

Jones RL, Ravi V, Brohl AS, et al. Efficacy and safety of TRC105 plus pazopanib vs pazopanib alone for treatment of patients with advanced angiosarcoma: a randomized clinical trial. JAMA Oncol. 2022;8(5):740–7.

Article   PubMed   PubMed Central   Google Scholar  

Gianni L, Dafni U, Gelber RD, Azambuja E, Muehlbauer S, Goldhirsch A, et al. Treatment with trastuzumab for 1 year after adjuvant chemotherapy in patients with HER2-positive early breast cancer: a 4-year follow-up of a randomised controlled trial. Lancet Oncol. 2011;12(3):236–44.

Article   CAS   PubMed   Google Scholar  

Slamon DJ, Godolphin W, Jones LA, Holt JA, Wong SG, Keith DE, et al. Studies of the HER-2/NEU proto-oncogene in human breast and ovarian cancer. Science . 1989;244(4905):707–12.

Slamon DJ, Leyland-Jones B, Shak S, Fuchs H, Paton V, Bajamonde A, et al. Use of chemotherapy plus a monoclonal antibody against HER2 for metastatic breast cancer that overexpresses HER2. N Engl J Med. 2001;344(11):783–92.

Burguin A, Furrer D, Ouellette G, Jacob S, Diorio C, Durocher F. Trastuzumab effects depend on HER2 phosphorylation in HER2- negative breast cancer cell lines. PLoS ONE. 2020;15(6): e0234991.

Article   CAS   PubMed   PubMed Central   Google Scholar  

Fehrenbacher L, Cecchini RS, Geyer CE, Rastogi P, Costantino JP, Atkins JN, et al. NSABP B-47/NRG Oncology Phase III randomized trial comparing adjuvant chemotherapy with or without trastuzumab in high-risk invasive breast cancer negative for HER2 by FISH and with IHC 1+ or 2+. J Clin Oncol. 2019;38(5):444–53.

Grothey A, Strosberg JR, Renfro LA, Hurwitz HI, Marshall JL, Safran H, et al. A randomized, double-blind, placebo-controlled phase ii study of the efficacy and safety of monotherapy ontuxizumab (MORAb-004) plus best supportive care in patients with chemorefractory metastatic colorectal cancer. Clin Cancer Res. 2018;24(2):316–25.

Morphotek investigation in colorectal cancer: research of morab-004 (MICRO). 2012. https://clinicaltrials.gov/study/NCT01507545

Xu Y, Constantine F, Yuan Y, Pritchett YL. Asied: a Bayesian adaptive subgroup-identification enrichment design. J Biopharm Stat. 2020;30(4):623–38.

Article   PubMed   Google Scholar  

Guo W, Ji Y, Catenacci DV. A subgroup cluster-based bayesian adaptive design for precision medicine. Biometrics. 2017;73(2):367–77.

Xu Y, Trippa L, Mu ̈ller P, Ji Y. Subgroup-based adaptive (SUBA) designs for multi-arm biomarker trials. Stat Biosci. 2016;8:159–80.

Chipman HA, George EI, McCulloch RE. Bart: Bayesian additive regression trees. Ann Appl Stat. 2010;4(1):266–98.

Article   Google Scholar  

Park Y, Liu S, Thall PF, Yuan Y. Bayesian group sequential enrichment designs based on adaptive regression of response and survival time on baseline biomarkers. Biometrics. 2022;78(1):60–71.

Magnusson BP, Turnbull BW. Group sequential enrichment design incorporating subgroup selection. Stat Med. 2013;32(16):2695–714.

Rosenblum M, Luber B, Thompson RE, Hanley D. Group sequential designs with prospectively planned rules for subpopulation enrichment. Stat Med. 2016;35(21):3776–91.

Simon N, Simon R. Using Bayesian modeling in frequentist adaptive enrichment designs. Biostatistics. 2018;19(1):27–41.

Ohwada S, Morita S. Bayesian adaptive patient enrollment restriction to identify a sensitive subpopulation using a continuous biomarker in a randomized phase 2 trial. Pharm Stat. 2016;15(5):420–9.

Krisam J, Kieser M. Optimal decision rules for biomarker-based subgroup selection for a targeted therapy in oncology. Int J Mol Sci. 2015;16(5):10354–75.

Jenkins M, Stone A, Jennison C. An adaptive seamless phase ii/iii design for oncology trials with subpopulation selection using correlated survival endpoints. Pharm Stat. 2011;10(4):347–56.

Simon N, Simon R. Adaptive enrichment designs for clinical trials. Biostatistics. 2013;14(4):613–25.

Graf AC, Posch M, Koenig F. Adaptive designs for subpopulation analysis optimizing utility functions. Biom J. 2015;57(1):76–89.

Ondra T, Jobjo ̈rnsson S, Beckman RA, Burman C-F, Ko ̈nig F, Stallard N, Posch M. Optimized adaptive enrichment designs. Stat Methods Med Res. 2019;28(7):2096–111.

Fisher A, Rosenblum M, Initiative ADN. Stochastic optimiza- tion of adaptive enrichment designs for two subpopulations. J Biopharm Stat. 2018;28(5):966–82.

Bretz F, Maurer W, Brannath W, Posch M. A graphical approach to sequentially rejective multiple test procedures. Stat Med. 2009;28(4):586–604.

Maurer W, Bretz F. Multiple testing in group sequential trials using graphical approaches. Stat Biopharm Res. 2013;5(4):311–20.

Morgan T, Zuccarello M, Narayan R, Keyl P, Lane K, Hanley D. Preliminary findings of the minimally-invasive surgery plus rtPA for intracerebral hemorrhage evacuation (MISTIE) clinical trial. In: Cerebral haemorrhage. 2008. pp. 147–51

Alzeheimer’s disease neuroimaging intiative (ADNI). 2017. https://adni.loni.usc.edu

Zhang Z, Chen R, Soon G, Zhang H. Treatment evaluation for a data-driven subgroup in adaptive enrichment designs of clinical trials. Stat Med. 2018;37(1):1–11.

Molina J-M, Cahn P, Grinsztejn B, Lazzarin A, Mills A, Saag M, et al. Rilpivirine versus efavirenz with tenofovir and emtricitabine in treatment-naive adults infected with HIV-1 (ECHO): a phase 3 randomised double-blind active-controlled trial. Lancet. 2011;378(9787):238–46.

Cohen CJ, Andrade-Villanueva J, Clotet B, Fourie J, Johnson MA, Ruxrungtham K, et al. Rilpivirine versus efavirenz with two background nucleoside or nucleotide reverse transcriptase inhibitors in treatment-naive adults infected with HIV-1 (THRIVE): a phase 3, randomised, non-inferiority trial. Lancet. 2011;378(9787):229–37.

Matsui S, Crowley J. Biomarker-stratified phase iii clinical trials: enhancement with a subgroup-focused sequential design. Clin Cancer Res. 2018;24(5):994–1001.

Brannath W, Zuber E, Branson M, Bretz F, Gallo P, Posch M, Racine-Poon A. Confirmatory adaptive designs with Bayesian decision tools for a targeted therapy in oncology. Stat Med. 2009;28(10):1445–63.

Uozumi R, Hamada C. Interim decision-making strategies in adaptive designs for population selection using time-to-event endpoints. J Biopharm Stat. 2017;27(1):84–100.

Simes RJ. An improved bonferroni procedure for multiple tests of significance. Biometrika. 1986;73(3):751–4.

Fine JP, Jiang H, Chappell R. On semi-competing risks data. Biometrika. 2001;88(4):907–19.

Fleischer F, Gaschler-Markefski B, Bluhmki E. A statistical model for the dependence between progression-free survival and overall survival. Stat Med. 2009;28(21):2669–86.

Sinha AK, Moye L III, Piller LB, Yamal J-M, Barcenas CH, Lin J, Davis BR. Adaptive group-sequential design with population enrichment in phase 3 randomized controlled trials with two binary co- primary endpoints. Stat Med. 2019;38(21):3985–96.

Kimani PK, Todd S, Stallard N. Estimation after subpopulation selection in adaptive seamless trials. Stat Med. 2015;34(18):2581–601.

Kimani PK, Todd S, Renfro LA, Stallard N. Point estimation following two-stage adaptive threshold enrichment clinical trials. Stat Med. 2018;37(22):3179–96.

Tang R, Ma X, Yang H, Wolf M. Biomarker-defined subgroup selection adaptive design for phase iii confirmatory trial with time-to-event data: comparing group sequential and various adaptive enrichment designs. Stat Biosci. 2018;10:371–404.

Wang S-J, O’Neill RT, Hung HJ. Approaches to evaluation of treatment effect in randomized clinical trials with genomic subset. Pharm Stat. 2007;6(3):227–44.

Benner L, Kieser M. Timing of the interim analysis in adaptive enrichment designs. J Biopharm Stat. 2018;28(4):622–32.

Kunzmann K, Benner L, Kieser M. Point estimation in adaptive enrichment designs. Stat Med. 2017;36(25):3935–47.

Carreras M, Brannath W. Shrinkage estimation in two-stage adaptive designs with midtrial treatment selection. Stat Med . 2013;32(10):1677–90.

Hwang JT. Empirical bayes estimation for the means of the selected populations. Sankhya . 1993;55:285–304.

Google Scholar  

Pickard MD, Chang M. A flexible method using a parametric bootstrap for reducing bias in adaptive designs with treatment selection. Stat Biopharm Res . 2014;6(2):163–74.

Luo X, Li M, Shih WJ, Ouyang P. Estimation of treatment effect following a clinical trial with adaptive design. J Biopharm Stat . 2012;22(4):700–18.

Cohen A, Sackrowitz HB. Two stage conditionally unbiased estimators of the selected mean. Stat Probab Lett . 1989;8(3):273–8.

Download references

Open access funding provided by SCELC, Statewide California Electronic Library Consortium. Funding was provided by National Cancer Institute, United States (5U10CA180899-11).

Author information

Authors and affiliations.

Division of Biostatistics, Department of Population and Public Health Sciences, University of Southern California, Los Angeles, CA, USA

Yue Tu & Lindsay A. Renfro

You can also search for this author in PubMed   Google Scholar

Contributions

Yue Tu, primary researcher and writer Lindsay Renfro, mentor of Yue Tu and guided content and organization of review article, editing.

Corresponding author

Correspondence to Lindsay A. Renfro .

Ethics declarations

Conflict of interest.

The authors declare no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Tu, Y., Renfro, L.A. Latest Developments in “Adaptive Enrichment” Clinical Trial Designs in Oncology. Ther Innov Regul Sci (2024). https://doi.org/10.1007/s43441-024-00698-3

Download citation

Received : 20 March 2024

Accepted : 30 August 2024

Published : 13 September 2024

DOI : https://doi.org/10.1007/s43441-024-00698-3

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Biomarker-driven trials
  • Adaptive enrichment
  • Trial design
  • Cancer clinical trials
  • Find a journal
  • Publish with us
  • Track your research

Warning: The NCBI web site requires JavaScript to function. more...

U.S. flag

An official website of the United States government

The .gov means it's official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you're on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings
  • Browse Titles

NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

StatPearls [Internet]. Treasure Island (FL): StatPearls Publishing; 2024 Jan-.

Cover of StatPearls

StatPearls [Internet].

Type i and type ii errors and statistical power.

Jacob Shreffler ; Martin R. Huecker .

Affiliations

Last Update: March 13, 2023 .

  • Definition/Introduction

Healthcare professionals, when determining the impact of patient interventions in clinical studies or research endeavors that provide evidence for clinical practice, must distinguish well-designed studies with valid results from studies with research design or statistical flaws. This article will help providers determine the likelihood of type I or type II errors and judge the adequacy of statistical power (see Table. Type I and Type II Errors and Statistical Power). Then one can decide whether or not the evidence provided should be implemented in practice or used to guide future studies.

  • Issues of Concern

Having an understanding of the concepts discussed in this article will allow healthcare providers to accurately and thoroughly assess the results and validity of medical research. Without an understanding of type I and II errors and power analysis, clinicians could make poor clinical decisions without evidence to support them.

Type I and Type II Errors

Type I and Type II errors can lead to confusion as providers assess medical literature. A vignette that illustrates the errors is the Boy Who Cried Wolf. First, the citizens commit a type I error by believing there is a wolf when there is not. Second, the citizens commit a type II error by believing there is no wolf when there is one.

A type I error occurs when in research when we reject the null hypothesis and erroneously state that the study found significant differences when there indeed was no difference. In other words, it is equivalent to saying that the groups or variables differ when, in fact, they do not or having false positives. [1]  An example of a research hypothesis is below:

Drug 23 will significantly reduce symptoms associated with Disease A  compared to Drug 22.

For our example, if we were to state that Drug 23 significantly reduced symptoms of Disease A compared to Drug 22 when it did not, this would be a type I error. Committing a type I error can be very grave in specific scenarios. For example, if we did, move ahead with Drug 23 based on our research findings even though there was actually was no difference between groups, and the drug costs significantly more money for patients or has more side effects, then we would raise healthcare costs, cause iatrogenic harm, and not improve clinical outcomes. If a p-value is used to examine type I error, the lower the p-value, the lower the likelihood of the type I error to occur.

A type II error occurs when we declare no differences or associations between study groups when, in fact, there was. [2]  As with type I errors, type II errors in certain cause problems. Picture an example with a new, less invasive surgical technique that was developed and tested in comparison to the more invasive standard care. Researchers would seek to show no differences between patients receiving the two treatment methods in health outcomes (noninferiority study). If, however, the less invasive procedure resulted in less favorable health outcomes, it would be a severe error. Table 1 provides a depiction of type I and type II errors.

(See Type I and Type II Errors and Statistical Power Table 1) 

A concept closely aligned to type II error is statistical power. Statistical power is a crucial part of the research process that is most valuable in the design and planning phases of studies, though it requires assessment when interpreting results. Power is the ability to correctly reject a null hypothesis that is indeed false. [3]  Unfortunately, many studies lack sufficient power and should be presented as having inconclusive findings. [4]  Power is the probability of a study to make correct decisions or detect an effect when one exists. [3] [5]

The power of a statistical test is dependent on: the level of significance set by the researcher, the sample size, and the effect size or the extent to which the groups differ based on treatment. [3]  Statistical power is critical for healthcare providers to decide how many patients to enroll in clinical studies. [4]  Power is strongly associated with sample size; when the sample size is large, power will generally not be an issue. [6]  Thus, when conducting a study with a low sample size, and ultimately low power, researchers should be aware of the likelihood of a type II error. The greater the N within a study, the more likely it is that a researcher will reject the null hypothesis. The concern with this approach is that a very large sample could show a statistically significant finding due to the ability to detect small differences in the dataset; thus, utilization of p values alone based on a large sample can be troublesome.

It is essential to recognize that power can be deemed adequate with a smaller sample if the  effect size is large. [6]  What is an acceptable level of power? Many researchers agree upon a power of 80% or higher as credible enough for determining the actual effects of research studies. [3]  Ultimately, studies with lower power will find fewer true effects than studies with higher power; thus, clinicians should be aware of the likelihood of a power issue resulting in a type II error. [7]  Unfortunately, many researchers, and providers who assess medical literature, do not scrutinize power analyses. Studies with low power may inhibit future work as they lack the ability to detect actual effects with variables; this could lead to potential impacts remaining undiscovered or noted as not effective when they may be. [7]

Medical researchers should invest time in conducting power analyses to sufficiently distinguish a difference or association. [3]  Luckily, there are many tables of power values as well as statistical software packages that can help to determine study power and guide researchers in study design and analysis. If choosing to utilize statistical software to calculate power, the following are necessary for entry: the predetermined alpha level, proposed sample size, and effect size the investigator(s) is aiming to detect. [2]  By utilizing power calculations on the front end, researchers can determine adequate sample size to compute effect, and determine based on statistical findings; sufficient power was actually observed. [2]

  • Clinical Significance

By limiting type I and type II errors, healthcare providers can ensure that decisions based on research outputs are safe for patients. [8]  Additionally, while power analysis can be time-consuming, making inferences on low powered studies can be inaccurate and irresponsible. Through the utilization of adequately designed studies through balancing the likelihood of type I and type II errors and understanding power, providers and researchers can determine which studies are clinically significant and should, therefore, implemented into practice.

  • Nursing, Allied Health, and Interprofessional Team Interventions

All physicians, nurses, pharmacists, and other healthcare professionals should strive to understand the concepts of Type I and II errors and power. These individuals should maintain the ability to review and incorporate new literature for evidence-based and safe care. They will also more effectively work in teams with other professionals. 

  • Review Questions
  • Access free multiple choice questions on this topic.
  • Comment on this article.

Type I and Type II Errors and Statistical Power  Contributed by M Huecker, MD and J Shreffler, PhD

Disclosure: Jacob Shreffler declares no relevant financial relationships with ineligible companies.

Disclosure: Martin Huecker declares no relevant financial relationships with ineligible companies.

This book is distributed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) ( http://creativecommons.org/licenses/by-nc-nd/4.0/ ), which permits others to distribute the work, provided that the article is not altered or used commercially. You are not required to obtain permission to distribute this article, provided that you credit the author and journal.

  • Cite this Page Shreffler J, Huecker MR. Type I and Type II Errors and Statistical Power. [Updated 2023 Mar 13]. In: StatPearls [Internet]. Treasure Island (FL): StatPearls Publishing; 2024 Jan-.

In this Page

Bulk download.

  • Bulk download StatPearls data from FTP

Related information

  • PMC PubMed Central citations
  • PubMed Links to PubMed

Similar articles in PubMed

  • The future of Cochrane Neonatal. [Early Hum Dev. 2020] The future of Cochrane Neonatal. Soll RF, Ovelman C, McGuire W. Early Hum Dev. 2020 Nov; 150:105191. Epub 2020 Sep 12.
  • The Effectiveness of Integrated Care Pathways for Adults and Children in Health Care Settings: A Systematic Review. [JBI Libr Syst Rev. 2009] The Effectiveness of Integrated Care Pathways for Adults and Children in Health Care Settings: A Systematic Review. Allen D, Gillen E, Rixson L. JBI Libr Syst Rev. 2009; 7(3):80-129.
  • Review Evidence Brief: The Quality of Care Provided by Advanced Practice Nurses [ 2014] Review Evidence Brief: The Quality of Care Provided by Advanced Practice Nurses McCleery E, Christensen V, Peterson K, Humphrey L, Helfand M. 2014 Sep
  • Review Low power and type II errors in recent ophthalmology research. [Can J Ophthalmol. 2016] Review Low power and type II errors in recent ophthalmology research. Khan Z, Milko J, Iqbal M, Masri M, Almeida DRP. Can J Ophthalmol. 2016 Oct; 51(5):368-372. Epub 2016 Sep 3.
  • Review Interventions for improving the adoption of shared decision making by healthcare professionals. [Cochrane Database Syst Rev. 2010] Review Interventions for improving the adoption of shared decision making by healthcare professionals. Légaré F, Ratté S, Stacey D, Kryworuchko J, Gravel K, Graham ID, Turcotte S. Cochrane Database Syst Rev. 2010 May 12; (5):CD006732. Epub 2010 May 12.

Recent Activity

  • Type I and Type II Errors and Statistical Power - StatPearls Type I and Type II Errors and Statistical Power - StatPearls

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

Connect with NLM

National Library of Medicine 8600 Rockville Pike Bethesda, MD 20894

Web Policies FOIA HHS Vulnerability Disclosure

Help Accessibility Careers

statistics

IMAGES

  1. PPT

    the null hypothesis clinical trials

  2. PPT

    the null hypothesis clinical trials

  3. Evidence in medicine starts with observations and trials, which have

    the null hypothesis clinical trials

  4. Hypothesis Testing and The Null Hypothesis, Clearly Explained!!!

    the null hypothesis clinical trials

  5. Research Techniques

    the null hypothesis clinical trials

  6. 10 Easy Steps to Find Null Hypothesis in Research Articles

    the null hypothesis clinical trials

VIDEO

  1. Understanding the Null Hypothesis

  2. Hypothesis Testing: the null and alternative hypotheses

  3. Misunderstanding The Null Hypothesis

  4. Testing of Hypothesis|Null & Alternative hypothesis|Level Of Significance|Critical Region|Lecture 22

  5. Statistical Null Hypothesis #shorts #ugcnet2024 #upsc #research #sociology #shortsfeed

  6. Role of Biostatistician in Clinical Trials

COMMENTS

  1. Hypothesis Testing, P Values, Confidence Intervals, and Significance

    Medical providers often rely on evidence-based medicine to guide decision-making in practice. Often a research hypothesis is tested with results provided, typically with p values, confidence intervals, or both. Additionally, statistical or research significance is estimated or determined by the investigators. Unfortunately, healthcare providers may have different comfort levels in interpreting ...

  2. An Introduction to Statistics: Understanding Hypothesis Testing and

    HYPOTHESIS TESTING. A clinical trial begins with an assumption or belief, and then proceeds to either prove or disprove this assumption. In statistical terms, this belief or assumption is known as a hypothesis. Counterintuitively, what the researcher believes in (or is trying to prove) is called the "alternate" hypothesis, and the opposite ...

  3. "Proving the null hypothesis" in clinical trials

    Abstract. When designing a clinical trial to show whether a new or experimental therapy is as effective as a standard therapy (but not necessarily more effective), the usual null hypothesis of equality is inappropriate and leads to logical difficulties. Since therapies cannot be shown to be literally equivalent, the appropriate null hypothesis ...

  4. Understanding Superiority, Noninferiority, and Equivalence for Clinical

    The efficacy of a drug in randomized control trials (RCTs) is established using null hypothesis significance testing (NHST) approach.[1,2] An insightful clinical trial meticulously documents the specific objectives, hypotheses, data analysis, and reporting plans. Traditionally, the investigators were investigating the superiority of the ...

  5. The clinician's guide to p values, confidence intervals, and ...

    Attempting to inform clinical practice patterns through ... it becomes less likely that the null hypothesis is ... unrelated to this study. LT: Nothing to disclose. MB: Research funds: Pendopharm ...

  6. Null hypothesis

    In scientific research, the null hypothesis (often denoted H 0) [1] ... The gold standard in clinical research is the randomized placebo-controlled double-blind clinical trial. But testing a new drug against a (medically ineffective) placebo may be unethical for a serious illness. Testing a new drug against an older medically effective drug ...

  7. The null hypothesis significance test in health sciences research (1995

    The null hypothesis statistical testing (NHST) has been the most widely used statistical approach in health research over the past 80 years. Its origins dates back to 1279 [] although it was in the second decade of the twentieth century when the statistician Ronald Fisher formally introduced the concept of "null hypothesis" H 0 - which, generally speaking, establishes that certain parameters ...

  8. PDF Fundamental Statistical Concepts in Clinical Trials and Diagnostic Testing

    objective. Trials are conducted within a sample, a subset of the population of interest. Statistics are used to summarize the sample and estimate an unknown population parameter, a number summarizing the population (Table 1) (4). Hypothesis tests are based on a null hypothesis, 𝐻 4, and an alternative hypothesis, 𝐻 º.

  9. Fundamental Statistical Concepts in Clinical Trials and ...

    Abstract. This article explores basic statistical concepts of clinical trial design and diagnostic testing, or how one starts with a question, formulates it into a hypothesis on which a clinical trial is then built, and integrates it with statistics and probability, such as determining the probability of rejecting the null hypothesis when it is ...

  10. Challenges in the Design and Interpretation of Noninferiority Trials

    Hypothesis Testing in Noninferiority Trials. In a noninferiority trial, the null hypothesis states that the primary end point for the new treatment is worse than that of the active control by a ...

  11. PDF Common types of clinical trial design, study objectives, randomisation

    • Null hypothesis (H 0) is set a priori • If the trial aims to detect a difference, null hypothesis is that there is no difference (hence "null") • e.g. H 0: there is no difference between the new treatment and placebo • i.e. distributions in same place • The "alternative hypothesis" (H 1 or H A) is the hypothesis of interest

  12. Challenges in the Design and Interpretation of Noninferiority Trials

    This is the premise of a randomized, noninferiority trial. The null hypothesis in a noninferiority study states that the primary end point for the experimental treatment is worse than that for the ...

  13. PDF Hypothesis and Hypothesis Testing in the Clinical Trial

    The hypothesis provides the justification for the clinical trial. It is antecedent to the trial and estab-lishes the trial's direction. Hypothesis testing is the most widely employed method of determining whether the outcome of clinical trials is positive or negative. Too often, however, neither the hypoth-esis nor the statistical information ...

  14. Statistical Principles for Clinical Development

    the null hypothesis (no drug effect) Example: p-value = 0.02. If there was no drug effect, the ... •ICH E9 Statistical Principles for Clinical Trials •ICH E9(R1) Estimands and Sensitivity ...

  15. Interpreting Results of Clinical Trials: A Conceptual Framework

    Clinical trials are generally designed to test the superiority of an intervention (e.g., treatment, procedure, or device) as compared with a control.Trials that claim superiority of an intervention most often try to reject the null hypothesis, which generally states that the effect of an intervention of interest is no different from the control.

  16. The Null Hypothesis

    The Null Hypothesis. The null hypothesis, as described by Anthony Greenwald in 'Consequences of Prejudice Against the Null Hypothesis,' is the hypothesis of no difference between treatment effects or of no association between variables. ... "Proving the null hypothesis" in clinical trials. Controlled clinical trials, 3(4), 345-353 ...

  17. The thresholds for statistical and clinical significance

    Thresholds for statistical significance are insufficiently demonstrated by 95% confidence intervals or P-values when assessing results from randomised clinical trials. First, a P-value only shows the probability of getting a result assuming that the null hypothesis is true and does not reflect the probability of getting a result assuming an alternative hypothesis to the null hypothesis is true.

  18. Non-Inferiority Clinical Trials to Establish Effectiveness

    In a placebo-controlled trial, the null hypothesis (H o ) is that the beneficial response to the test drug (T) is less than or equal to the response to the placebo (P); the alternative hypothesis (H

  19. Hypothesis and hypothesis testing in the clinical trial

    The hypothesis provides the justification for the clinical trial. It is antecedent to the trial and establishes the trial's direction. Hypothesis testing is the most widely employed method of determining whether the outcome of clinical trials is positive or negative. Too often, however, neither the hypothesis nor the statistical information ...

  20. "Proving the null hypothesis" in clinical trials

    When designing a clinical trial to show whether a new or experimental therapy is as effective as a standard therapy (but not necessarily more effective), the usual null hypothesis of equality is inappropriate and leads to logical difficulties. Since therapies cannot be shown to be literally equivalent, the appropriate null hypothesis is that ...

  21. Understanding noninferiority trials

    Clinical equipoise, referring to the state of true uncertainty about the relative benefits of alternative treatments under the "null" hypothesis to be tested, is an ethically necessary condition in all clinical research5). Active controls are sometimes used to demonstrate the efficacy of a drug that may have large placebo effects.

  22. Superiority trials: raising the bar of null hypothesis statistical

    We propose a 'superiority margin', akin to the non-inferiority margin used in non-inferiority trials. This would overcome several drawbacks of null hypothesis statistical testing and can be applicable to other types of analyses as well. A recent commentary in The Lancet states: "The case

  23. Latest Developments in "Adaptive Enrichment" Clinical Trial Designs in

    Example: An example of a clinical trial with an enrichment design is the Herceptin Adjuvant (HERA) trial. The HERA trial is a phase III, randomized, three-arm trial that studied the efficacy of 1 year versus 2 years of adjuvant trastuzumab versus control (no additional treatment) in women with human epidermal growth factor receptor 2 (HER2)-positive early breast cancer after completion of ...

  24. Type I and Type II Errors and Statistical Power

    Healthcare professionals, when determining the impact of patient interventions in clinical studies or research endeavors that provide evidence for clinical practice, must distinguish well-designed studies with valid results from studies with research design or statistical flaws. This article will help providers determine the likelihood of type I or type II errors and judge the adequacy of ...