• Bipolar Disorder
  • Therapy Center
  • When To See a Therapist
  • Types of Therapy
  • Best Online Therapy
  • Best Couples Therapy
  • Managing Stress
  • Sleep and Dreaming
  • Understanding Emotions
  • Self-Improvement
  • Healthy Relationships
  • Student Resources
  • Personality Types
  • Sweepstakes
  • Guided Meditations
  • Verywell Mind Insights
  • 2024 Verywell Mind 25
  • Mental Health in the Classroom
  • Editorial Process
  • Meet Our Review Board
  • Crisis Support

What Is Replication in Psychology Research?

Examples of replication in psychology.

  • Why Replication Matters
  • How It Works

What If Replication Fails?

  • The Replication Crisis

How Replication Can Be Strengthened

Replication refers to the repetition of a research study, generally with different situations and subjects, to determine if the basic findings of the original study can be applied to other participants and circumstances.

In other words, when researchers replicate a study, it means they reproduce the experiment to see if they can obtain the same outcomes.

Once a study has been conducted, researchers might be interested in determining if the results hold true in other settings or for other populations. In other cases, scientists may want to replicate the experiment to further demonstrate the results.

At a Glance

In psychology, replication is defined as reproducing a study to see if you get the same results. It's an important part of the research process that strengthens our understanding of human behavior. It's not always a perfect process, however, and extraneous variables and other factors can interfere with results.

For example, imagine that health psychologists perform an experiment showing that hypnosis can be effective in helping middle-aged smokers kick their nicotine habit. Other researchers might want to replicate the same study with younger smokers to see if they reach the same result.

Exact replication is not always possible. Ethical standards may prevent modern researchers from replicating studies that were conducted in the past, such as Stanley Milgram's infamous obedience experiments .

That doesn't mean that researchers don't perform replications; it just means they have to adapt their methods and procedures. For example, researchers have replicated Milgram's study using lower shock thresholds and improved informed consent and debriefing procedures.

Why Replication Is Important in Psychology

When studies are replicated and achieve the same or similar results as the original study, it gives greater validity to the findings. If a researcher can replicate a study’s results, it is more likely that those results can be generalized to the larger population.

Human behavior can be inconsistent and difficult to study. Even when researchers are cautious about their methods, extraneous variables can still create bias and affect results. 

That's why replication is so essential in psychology. It strengthens findings, helps detect potential problems, and improves our understanding of human behavior.

How Do Scientists Replicate an Experiment?

When conducting a study or experiment , it is essential to have clearly defined operational definitions. In other words, what is the study attempting to measure?

When replicating earlier researchers, experimenters will follow the same procedures but with a different group of participants. If the researcher obtains the same or similar results in follow-up experiments, it means that the original results are less likely to be a fluke.

The steps involved in replicating a psychology experiment often include the following:

  • Review the original experiment : The goal of replication is to use the exact methods and procedures the researchers used in the original experiment. Reviewing the original study to learn more about the hypothesis, participants, techniques, and methodology is important.
  • Conduct a literature review : Review the existing literature on the subject, including any other replications or previous research. Considering these findings can provide insights into your own research.
  • Perform the experiment : The next step is to conduct the experiment. During this step, keeping your conditions as close as possible to the original experiment is essential. This includes how you select participants, the equipment you use, and the procedures you follow as you collect your data.
  • Analyze the data : As you analyze the data from your experiment, you can better understand how your results compare to the original results.
  • Communicate the results : Finally, you will document your processes and communicate your findings. This is typically done by writing a paper for publication in a professional psychology journal. Be sure to carefully describe your procedures and methods, describe your findings, and discuss how your results compare to the original research.

So what happens if the original results cannot be reproduced? Does that mean that the experimenters conducted bad research or that, even worse, they lied or fabricated their data?

In many cases, non-replicated research is caused by differences in the participants or in other extraneous variables that might influence the results of an experiment. Sometimes the differences might not be immediately clear, but other researchers might be able to discern which variables could have impacted the results.

For example, minor differences in things like the way questions are presented, the weather, or even the time of day the study is conducted might have an unexpected impact on the results of an experiment. Researchers might strive to perfectly reproduce the original study, but variations are expected and often impossible to avoid.

Are the Results of Psychology Experiments Hard to Replicate?

In 2015, a group of 271 researchers published the results of their five-year effort to replicate 100 different experimental studies previously published in three top psychology journals. The replicators worked closely with the original researchers of each study in order to replicate the experiments as closely as possible.

The results were less than stellar. Of the 100 experiments in question, 61% could not be replicated with the original results. Of the original studies, 97% of the findings were deemed statistically significant. Only 36% of the replicated studies were able to obtain statistically significant results.

As one might expect, these dismal findings caused quite a stir. You may have heard this referred to as the "'replication crisis' in psychology.

Similar replication attempts have produced similar results. Another study published in 2018 replicated 21 social and behavioral science studies. In these studies, the researchers were only able to successfully reproduce the original results about 62% of the time.

So why are psychology results so difficult to replicate? Writing for The Guardian , John Ioannidis suggested that there are a number of reasons why this might happen, including competition for research funds and the powerful pressure to obtain significant results. There is little incentive to retest, so many results obtained purely by chance are simply accepted without further research or scrutiny.

The American Psychological Association suggests that the problem stems partly from the research culture. Academic journals are more likely to publish novel, innovative studies rather than replication research, creating less of an incentive to conduct that type of research.

Reasons Why Research Cannot Be Replicated

The project authors suggest that there are three potential reasons why the original findings could not be replicated.  

  • The original results were a false positive.
  • The replicated results were a false negative.
  • Both studies were correct but differed due to unknown differences in experimental conditions or methodologies.

The Nobel Prize-winning psychologist Daniel Kahneman has suggested that because published studies are often too vague in describing methods used, replications should involve the authors of the original studies to more carefully mirror the methods and procedures used in the original research.

In fact, one investigation found that replication rates are much higher when original researchers are involved.

While some might be tempted to look at the results of such replication projects and assume that psychology is more art than science, many suggest that such findings actually help make psychology a stronger science. Human thought and behavior is a remarkably subtle and ever-changing subject to study.

In other words, it's normal and expected for variations to exist when observing diverse populations and participants.

Some research findings might be wrong, but digging deeper, pointing out the flaws, and designing better experiments helps strengthen the field. The APA notes that replication research represents a great opportunity for students. it can help strengthen research skills and contribute to science in a meaningful way.

Nosek BA, Errington TM. What is replication ?  PLoS Biol . 2020;18(3):e3000691. doi:10.1371/journal.pbio.3000691

Burger JM. Replicating Milgram: Would people still obey today ?  Am Psychol . 2009;64(1):1-11. doi:10.1037/a0010932

Makel MC, Plucker JA, Hegarty B. Replications in psychology research: How often do they really occur? Perspectives on Psychological Science . 2012;7(6):537-542. doi:10.1177/1745691612460688

Aarts AA, Anderson JE, Anderson CJ, et al. Estimating the reproducibility of psychological science . Science. 2015;349(6251). doi:10.1126/science.aac4716

Camerer CF, Dreber A, Holzmeister F, et al. Evaluating the replicability of social science experiments in Nature and Science between 2010 and 2015 . Nat Hum Behav . 2018;2(9):637-644. doi:10.1038/s41562-018-0399-z

American Psychological Association. Learning into the replication crisis: Why you should consider conducting replication research .

Kahneman D. A new etiquette for replication . Social Psychology. 2014;45(4):310-311.

By Kendra Cherry, MSEd Kendra Cherry, MS, is a psychosocial rehabilitation specialist, psychology educator, and author of the "Everything Psychology Book."

  • A-Z Publications

Annual Review of Psychology

Volume 73, 2022, review article, replicability, robustness, and reproducibility in psychological science.

  • Brian A. Nosek 1,2 , Tom E. Hardwicke 3 , Hannah Moshontz 4 , Aurélien Allard 5 , Katherine S. Corker 6 , Anna Dreber 7 , Fiona Fidler 8 , Joe Hilgard 9 , Melissa Kline Struhl 2 , Michèle B. Nuijten 10 , Julia M. Rohrer 11 , Felipe Romero 12 , Anne M. Scheel 13 , Laura D. Scherer 14 , Felix D. Schönbrodt 15 , and Simine Vazire 16
  • View Affiliations Hide Affiliations Affiliations: 1 Department of Psychology, University of Virginia, Charlottesville, Virginia 22904, USA; email: [email protected] 2 Center for Open Science, Charlottesville, Virginia 22903, USA 3 Department of Psychology, University of Amsterdam, 1012 ZA Amsterdam, The Netherlands 4 Addiction Research Center, University of Wisconsin–Madison, Madison, Wisconsin 53706, USA 5 Department of Psychology, University of California, Davis, California 95616, USA 6 Psychology Department, Grand Valley State University, Allendale, Michigan 49401, USA 7 Department of Economics, Stockholm School of Economics, 113 83 Stockholm, Sweden 8 School of Biosciences, University of Melbourne, Parkville VIC 3010, Australia 9 Department of Psychology, Illinois State University, Normal, Illinois 61790, USA 10 Meta-Research Center, Tilburg University, 5037 AB Tilburg, The Netherlands 11 Department of Psychology, Leipzig University, 04109 Leipzig, Germany 12 Department of Theoretical Philosophy, University of Groningen, 9712 CP Groningen, The Netherlands 13 Department of Industrial Engineering and Innovation Sciences, Eindhoven University of Technology, 5612 AZ Eindhoven, The Netherlands 14 University of Colorado Anschutz Medical Campus, Aurora, Colorado 80045, USA 15 Department of Psychology, Ludwig Maximilian University of Munich, 80539 Munich, Germany 16 School of Psychological Sciences, University of Melbourne, Parkville VIC 3052, Australia
  • Vol. 73:719-748 (Volume publication date January 2022) https://doi.org/10.1146/annurev-psych-020821-114157
  • First published as a Review in Advance on October 19, 2021
  • Copyright © 2022 by Annual Reviews. All rights reserved

Replication—an important, uncommon, and misunderstood practice—is gaining appreciation in psychology. Achieving replicability is important for making research progress. If findings are not replicable, then prediction and theory development are stifled. If findings are replicable, then interrogation of their meaning and validity can advance knowledge. Assessing replicability can be productive for generating and testing hypotheses by actively confronting current understandings to identify weaknesses and spur innovation. For psychology, the 2010s might be characterized as a decade of active confrontation. Systematic and multi-site replication projects assessed current understandings and observed surprising failures to replicate many published findings. Replication efforts highlighted sociocultural challenges such as disincentives to conduct replications and a tendency to frame replication as a personal attack rather than a healthy scientific practice, and they raised awareness that replication contributes to self-correction. Nevertheless, innovation in doing and understanding replication and its cousins, reproducibility and robustness, has positioned psychology to improve research practices and accelerate progress.

Article metrics loading...

Full text loading...

Literature Cited

  • Alogna VK , Attaya MK , Aucoin P , Bahník Š , Birch S et al. 2014 . Registered Replication Report: Schooler and Engstler-Schooler (1990). Perspect. Psychol. Sci. 9 : 5 556– 78 [Google Scholar]
  • Altmejd A , Dreber A , Forsell E , Huber J , Imai T et al. 2019 . Predicting the replicability of social science lab experiments. PLOS ONE 14 : 12 e0225826 [Google Scholar]
  • Anderson CJ , Bahník Š , Barnett-Cowan M , Bosco FA , Chandler J et al. 2016 . Response to Comment on “Estimating the reproducibility of psychological science. Science 351 : 6277 1037 [Google Scholar]
  • Anderson MS , Martinson BC , De Vries R. 2007 . Normative dissonance in science: results from a national survey of U.S. scientists. J. Empir. Res. Hum. Res. Ethics 2 : 4 3– 14 [Google Scholar]
  • Appelbaum M , Cooper H , Kline RB , Mayo-Wilson E , Nezu AM , Rao SM. 2018 . Journal article reporting standards for quantitative research in psychology: the APA Publications and Communications Board task force report. Am. Psychol. 73 : 1 3 – 25 Corrigendum 2018 . Am. Psychol 73 : 7 947 [Google Scholar]
  • Armeni K , Brinkman L , Carlsson R , Eerland A , Fijten R et al. 2020 . Towards wide-scale adoption of open science practices: the role of open science communities. MetaArXiv, Oct. 6 https://doi.org/10.31222/osf.io/7gct9 [Crossref]
  • Artner R , Verliefde T , Steegen S , Gomes S , Traets F et al. 2020 . The reproducibility of statistical results in psychological research: an investigation using unpublished raw data. Psychol. Methods. In press. https://doi.org/10.1037/met0000365 [Crossref] [Google Scholar]
  • Baker M. 2016 . Dutch agency launches first grants programme dedicated to replication. Nat. News. https://doi.org/10.1038/nature.2016.20287 [Crossref] [Google Scholar]
  • Bakker M , van Dijk A , Wicherts JM. 2012 . The rules of the game called psychological science. Perspect. Psychol. Sci. 7 : 6 543– 54 [Google Scholar]
  • Bakker M , Wicherts JM. 2011 . The (mis)reporting of statistical results in psychology journals. Behav. Res. Methods 43 : 3 666– 78 [Google Scholar]
  • Baribault B , Donkin C , Little DR , Trueblood JS , Oravecz Z et al. 2018 . Metastudies for robust tests of theory. PNAS 115 : 11 2607– 12 [Google Scholar]
  • Baron J , Hershey JC. 1988 . Outcome bias in decision evaluation. J. Pers. Soc. Psychol. 54 : 4 569– 79 [Google Scholar]
  • Baumeister RF. 2016 . Charting the future of social psychology on stormy seas: winners, losers, and recommendations. J. Exp. Soc. Psychol. 66 : 153– 58 [Google Scholar]
  • Baumeister RF , Vohs KD. 2016 . Misguided effort with elusive implications. Perspect. Psychol. Sci. 11 : 4 574– 75 [Google Scholar]
  • Benjamin DJ , Berger JO , Johannesson M , Nosek BA , Wagenmakers E-J et al. 2018 . Redefine statistical significance. Nat. Hum. Behav. 2 : 1 6– 10 [Google Scholar]
  • Botvinik-Nezer R , Holzmeister F , Camerer CF , Dreber A , Huber J et al. 2020 . Variability in the analysis of a single neuroimaging dataset by many teams. Nature 582 : 7810 84– 88 [Google Scholar]
  • Bouwmeester S , Verkoeijen PPJL , Aczel B , Barbosa F , Bègue L et al. 2017 . Registered Replication Report: Rand, Greene, and Nowak (2012). Perspect. Psychol. Sci. 12 : 3 527– 42 [Google Scholar]
  • Brown NJL , Heathers JAJ. 2017 . The GRIM test: A simple technique detects numerous anomalies in the reporting of results in psychology. Soc. Psychol. Pers. Sci. 8 : 4 363– 69 [Google Scholar]
  • Button KS , Ioannidis JPA , Mokrysz C , Nosek BA , Flint J et al. 2013 . Power failure: why small sample size undermines the reliability of neuroscience. Nat. Rev. Neurosci. 14 : 5 365– 76 [Google Scholar]
  • Byers-Heinlein K , Bergmann C , Davies C , Frank M , Hamlin JK et al. 2020 . Building a collaborative psychological science: lessons learned from ManyBabies 1. Can. Psychol. Psychol. Can. 61 : 4 349– 63 [Google Scholar]
  • Camerer CF , Dreber A , Forsell E , Ho T-H , Huber J et al. 2016 . Evaluating replicability of laboratory experiments in economics. Science 351 : 6280 1433– 36 [Google Scholar]
  • Camerer CF , Dreber A , Holzmeister F , Ho T-H , Huber J et al. 2018 . Evaluating the replicability of social science experiments in Nature and Science between 2010 and 2015. Nat. Hum. Behav. 2 : 9 637– 44 [Google Scholar]
  • Carter EC , Schönbrodt FD , Gervais WM , Hilgard J. 2019 . Correcting for bias in psychology: a comparison of meta-analytic methods. Adv. Methods Pract. Psychol. Sci. 2 : 2 115– 44 [Google Scholar]
  • Cent. Open Sci 2020 . APA joins as new signatory to TOP guidelines. Center for Open Science Nov. 10. https://www.cos.io/about/news/apa-joins-as-new-signatory-to-top-guidelines [Google Scholar]
  • Cesario J. 2014 . Priming, replication, and the hardest science. Perspect. Psychol. Sci. 9 : 1 40– 48 [Google Scholar]
  • Chambers C. 2019 . What's next for Registered Reports?. Nature 573 : 7773 187– 89 [Google Scholar]
  • Cheung I , Campbell L , LeBel EP , Ackerman RA , Aykutoğlu B et al. 2016 . Registered Replication Report: Study 1 from Finkel, Rusbult, Kumashiro, & Hannon (2002). Perspect. Psychol. Sci. 11 : 5 750– 64 [Google Scholar]
  • Christensen G , Wang Z , Paluck EL , Swanson N , Birke DJ , Miguel E , Littman R. 2019 . Open science practices are on the rise: the State of Social Science (3S) Survey. MetaArXiv, Oct. 18. https://doi.org/10.31222/osf.io/5rksu [Crossref]
  • Christensen-Szalanski JJ , Willham CF. 1991 . The hindsight bias: a meta-analysis. Organ. Behav. Hum. Decis. Process. 48 : 1 147– 68 [Google Scholar]
  • Cohen J. 1962 . The statistical power of abnormal-social psychological research: a review. J. Abnorm. Soc. Psychol. 65 : 3 145– 53 [Google Scholar]
  • Cohen J. 1973 . Statistical power analysis and research results. Am. Educ. Res. J. 10 : 3 225– 29 [Google Scholar]
  • Cohen J. 1990 . Things I have learned (so far). Am. Psychol. 45 : 1304– 12 [Google Scholar]
  • Cohen J. 1992 . A power primer. Psychol. Bull. 112 : 1 155– 59 [Google Scholar]
  • Cohen J. 1994 . The earth is round (p < .05). Am. Psychol. 49 : 12 997– 1003 [Google Scholar]
  • Colling LJ , Szücs D , De Marco D , Cipora K , Ulrich R et al. 2020 . Registered Replication Report on Fischer, Castel, Dodd, and Pratt (2003). Adv. Methods Pract. Psychol. Sci 3 : 2 143– 62 [Google Scholar]
  • Cook FL. 2016 . Dear Colleague Letter: robust and reliable research in the social, behavioral, and economic sciences. National Science Foundation Sept. 20. https://www.nsf.gov/pubs/2016/nsf16137/nsf16137.jsp [Google Scholar]
  • Crandall CS , Sherman JW. 2016 . On the scientific superiority of conceptual replications for scientific progress. J. Exp. Soc. Psychol. 66 : 93– 99 [Google Scholar]
  • Crisp RJ , Miles E , Husnu S 2014 . Support for the replicability of imagined contact effects. Soc. Psychol. 45 : 4 303– 4 [Google Scholar]
  • Cronbach LJ , Meehl PE. 1955 . Construct validity in psychological tests. Psychol. Bull. 52 : 4 281– 302 [Google Scholar]
  • Dang J , Barker P , Baumert A , Bentvelzen M , Berkman E et al. 2021 . A multilab replication of the ego depletion effect. Soc. Psychol. Pers. Sci. 12 : 1 14– 24 [Google Scholar]
  • Devezer B , Nardin LG , Baumgaertner B , Buzbas EO. 2019 . Scientific discovery in a model-centric framework: reproducibility, innovation, and epistemic diversity. PLOS ONE 14 : 5 e0216125 [Google Scholar]
  • Dijksterhuis A. 2018 . Reflection on the professor-priming replication report. Perspect. Psychol. Sci. 13 : 2 295– 96 [Google Scholar]
  • Dreber A , Pfeiffer T , Almenberg J , Isaksson S , Wilson B et al. 2015 . Using prediction markets to estimate the reproducibility of scientific research. PNAS 112 : 50 15343– 47 [Google Scholar]
  • Duhem PMM. 1954 . The Aim and Structure of Physical Theory Princeton, NJ: Princeton Univ. Press [Google Scholar]
  • Ebersole CR , Alaei R , Atherton OE , Bernstein MJ , Brown M et al. 2017 . Observe, hypothesize, test, repeat: Luttrell, Petty and Xu (2017) demonstrate good science. J. Exp. Soc. Psychol. 69 : 184– 86 [Google Scholar]
  • Ebersole CR , Atherton OE , Belanger AL , Skulborstad HM , Allen JM et al. 2016a . Many Labs 3: evaluating participant pool quality across the academic semester via replication. J. Exp. Soc. Psychol. 67 : 68– 82 [Google Scholar]
  • Ebersole CR , Axt JR , Nosek BA 2016b . Scientists’ reputations are based on getting it right, not being right. PLOS Biol . 14 : 5 e1002460 [Google Scholar]
  • Ebersole CR , Mathur MB , Baranski E , Bart-Plange D-J , Buttrick NR et al. 2020 . Many Labs 5: testing pre-data-collection peer review as an intervention to increase replicability. Adv. Methods Pract. Psychol. Sci. 3 : 3 309– 31 [Google Scholar]
  • Eerland A , Sherrill AM , Magliano JP , Zwaan RA , Arnal JD et al. 2016 . Registered Replication Report: Hart & Albarracín (2011). Perspect. Psychol. Sci. 11 : 1 158– 71 [Google Scholar]
  • Ellemers N , Fiske ST , Abele AE , Koch A , Yzerbyt V. 2020 . Adversarial alignment enables competing models to engage in cooperative theory building toward cumulative science. PNAS 117 : 14 7561– 67 [Google Scholar]
  • Epskamp S , Nuijten MB. 2018 . Statcheck: extract statistics from articles and recompute p values. Statistical Software https://CRAN.R-project.org/package=statcheck [Google Scholar]
  • Errington TM , Denis A , Perfito N , Iorns E , Nosek BA 2021 . Challenges for assessing reproducibility and replicability in preclinical cancer biology. eLife In press [Google Scholar]
  • Etz A , Vandekerckhove J. 2016 . A Bayesian perspective on the reproducibility project: psychology. PLOS ONE 11 : 2 e0149794 [Google Scholar]
  • Fanelli D. 2010 .. “ Positive” results increase down the hierarchy of the sciences. PLOS ONE 5 : 4 e10068 [Google Scholar]
  • Fanelli D. 2012 . Negative results are disappearing from most disciplines and countries. Scientometrics 90 : 3 891– 904 [Google Scholar]
  • Feest U. 2019 . Why replication is overrated. Philos. Sci. 86 : 5 895– 905 [Google Scholar]
  • Ferguson MJ , Carter TJ , Hassin RR. 2014 . Commentary on the attempt to replicate the effect of the American flag on increased Republican attitudes. Soc. Psychol. 45 : 4 301– 2 [Google Scholar]
  • Fetterman AK , Sassenberg K. 2015 . The reputational consequences of failed replications and wrongness admission among scientists. PLOS ONE 10 : 12 e0143723 [Google Scholar]
  • Forsell E , Viganola D , Pfeiffer T , Almenberg J , Wilson B et al. 2019 . Predicting replication outcomes in the Many Labs 2 study. J. Econ. Psychol. 75 : 102117 [Google Scholar]
  • Franco A , Malhotra N , Simonovits G. 2014 . Publication bias in the social sciences: unlocking the file drawer. Science 345 : 6203 1502– 5 [Google Scholar]
  • Franco A , Malhotra N , Simonovits G. 2016 . Underreporting in psychology experiments: evidence from a study registry. Soc. Psychol. Pers. Sci. 7 : 1 8– 12 [Google Scholar]
  • Frank MC , Bergelson E , Bergmann C , Cristia A , Floccia C et al. 2017 . A collaborative approach to infant research: promoting reproducibility, best practices, and theory-building. Infancy 22 : 4 421– 35 [Google Scholar]
  • Funder DC , Ozer DJ. 2019 . Evaluating effect size in psychological research: sense and nonsense. Adv. Methods Pract. Psychol. Sci. 2 : 2 156– 68 [Google Scholar]
  • Gelman A , Carlin J. 2014 . Beyond power calculations: assessing type S (sign) and type M (magnitude) errors. Perspect. Psychol. Sci. 9 : 6 641– 51 [Google Scholar]
  • Gelman A , Loken E. 2013 . The garden of forking paths: why multiple comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research hypothesis was posited ahead of time Work. Pap., Columbia Univ. New York: [Google Scholar]
  • Gergen KJ. 1973 . Social psychology as history. J. Pers. Soc. Psychol. 26 : 2 309– 20 [Google Scholar]
  • Gervais WM , Jewell JA , Najle MB , Ng BKL. 2015 . A powerful nudge? Presenting calculable consequences of underpowered research shifts incentives toward adequately powered designs. Soc. Psychol. Pers. Sci. 6 : 7 847– 54 [Google Scholar]
  • Ghelfi E , Christopherson CD , Urry HL , Lenne RL , Legate N et al. 2020 . Reexamining the effect of gustatory disgust on moral judgment: a multilab direct replication of Eskine, Kacinik, and Prinz (2011). Adv. Methods Pract. Psychol. Sci. 3 : 1 3– 23 [Google Scholar]
  • Gilbert DT , King G , Pettigrew S , Wilson TD 2016 . Comment on “Estimating the reproducibility of psychological science. Science 351 : 6277 1037 [Google Scholar]
  • Giner-Sorolla R. 2012 . Science or art? How aesthetic standards grease the way through the publication bottleneck but undermine science. Perspect. Psychol. Sci. 7 : 6 562– 71 [Google Scholar]
  • Giner-Sorolla R. 2019 . From crisis of evidence to a “crisis” of relevance? Incentive-based answers for social psychology's perennial relevance worries. Eur. Rev. Soc. Psychol. 30 : 1 1– 38 [Google Scholar]
  • Gollwitzer M. 2020 . DFG Priority Program SPP 2317 Proposal: A meta-scientific program to analyze and optimize replicability in the behavioral, social, and cognitive sciences (META-REP). PsychArchives, May 29. http://dx.doi.org/10.23668/psycharchives.3010 [Crossref]
  • Gordon M , Viganola D , Bishop M , Chen Y , Dreber A et al. 2020 . Are replication rates the same across academic fields? Community forecasts from the DARPA SCORE programme. R. Soc. Open Sci. 7 : 7 200566 [Google Scholar]
  • Götz M , O'Boyle EH , Gonzalez-Mulé E , Banks GC , Bollmann SS 2020 . The “Goldilocks Zone”: (Too) many confidence intervals in tests of mediation just exclude zero. Psychol. Bull. 147 : 1 95– 114 [Google Scholar]
  • Greenwald AG. 1975 . Consequences of prejudice against the null hypothesis. Psychol. Bull. 82 : 1 1– 20 [Google Scholar]
  • Hagger MS , Chatzisarantis NLD , Alberts H , Anggono CO , Batailler C et al. 2016 . A multilab preregistered replication of the ego-depletion effect. Perspect. Psychol. Sci. 11 : 4 546– 73 [Google Scholar]
  • Hanea AM , McBride MF , Burgman MA , Wintle BC , Fidler F et al. 2017 . I nvestigate D iscuss E stimate A ggregate for structured expert judgement. Int. J. Forecast. 33 : 1 267– 79 [Google Scholar]
  • Hardwicke TE , Bohn M , MacDonald KE , Hembacher E , Nuijten MB et al. 2021 . Analytic reproducibility in articles receiving open data badges at the journal Psychological Science : an observational study. R. Soc. Open Sci. 8 : 1 201494 [Google Scholar]
  • Hardwicke TE , Mathur MB , MacDonald K , Nilsonne G , Banks GC et al. 2018 . Data availability, reusability, and analytic reproducibility: evaluating the impact of a mandatory open data policy at the journal Cognition . R. Soc. Open Sci. 5 : 8 180448 [Google Scholar]
  • Hardwicke TE , Serghiou S , Janiaud P , Danchev V , Crüwell S et al. 2020a . Calibrating the scientific ecosystem through meta-research. Annu. Rev. Stat. Appl. 7 : 11– 37 [Google Scholar]
  • Hardwicke TE , Thibault RT , Kosie JE , Wallach JD , Kidwell M , Ioannidis J. 2020b . Estimating the prevalence of transparency and reproducibility-related research practices in psychology (2014–2017). MetaArXiv, Jan. 2. https://doi.org/10.31222/osf.io/9sz2y [Crossref]
  • Hedges LV , Schauer JM. 2019 . Statistical analyses for studying replication: meta-analytic perspectives. Psychol. Methods 24 : 5 557– 70 [Google Scholar]
  • Hoogeveen S , Sarafoglou A , Wagenmakers E-J. 2020 . Laypeople can predict which social-science studies will be replicated successfully. Adv. Methods Pract. Psychol. Sci. 3 : 3 267– 85 [Google Scholar]
  • Hughes BM. 2018 . Psychology in Crisis London: Palgrave Macmillan [Google Scholar]
  • Inbar Y. 2016 . Association between contextual dependence and replicability in psychology may be spurious. PNAS 113 : 34 E4933– 34 [Google Scholar]
  • Ioannidis JPA. 2005 . Why most published research findings are false. PLOS Med 2 : 8 e124 [Google Scholar]
  • Ioannidis JPA. 2008 . Why most discovered true associations are inflated. Epidemiology 19 : 5 640– 48 [Google Scholar]
  • Ioannidis JPA. 2014 . How to make more published research true. PLOS Med 11 : 10 e1001747 [Google Scholar]
  • Ioannidis JPA , Trikalinos TA. 2005 . Early extreme contradictory estimates may appear in published research: the Proteus phenomenon in molecular genetics research and randomized trials. J. Clin. Epidemiol. 58 : 6 543– 49 [Google Scholar]
  • Isager PM , van Aert RCM , Bahník Š , Brandt M , DeSoto KA et al. 2020 . Deciding what to replicate: A formal definition of “replication value” and a decision model for replication study selection. MetaArXiv, Sept. 2. https://doi.org/10.31222/osf.io/2gurz [Crossref]
  • John LK , Loewenstein G , Prelec D. 2012 . Measuring the prevalence of questionable research practices with incentives for truth telling. Psychol. Sci. 23 : 5 524– 32 [Google Scholar]
  • Kahneman D. 2003 . Experiences of collaborative research. Am. Psychol. 58 : 9 723– 30 [Google Scholar]
  • Kerr NL. 1998 . HARKing: Hypothesizing after the results are known. Pers. Soc. Psychol. Rev. 2 : 3 196– 217 [Google Scholar]
  • Kidwell MC , Lazarević LB , Baranski E , Hardwicke TE , Piechowski S et al. 2016 . Badges to acknowledge open practices: a simple, low-cost, effective method for increasing transparency. PLOS Biol 14 : 5 e1002456 [Google Scholar]
  • Klein RA , Cook CL , Ebersole CR , Vitiello C , Nosek BA et al. 2019 . Many Labs 4: failure to replicate mortality salience effect with and without original author involvement. PsyArXiv, Dec. 11. https://doi.org/10/ghwq2w [Crossref]
  • Klein RA , Ratliff KA , Vianello M , Adams RB , Bahník Š et al. 2014 . Investigating variation in replicability: a “many labs” replication project. Soc. Psychol. 45 : 3 142– 52 [Google Scholar]
  • Klein RA , Vianello M , Hasselman F , Adams BG , Adams RB et al. 2018 . Many Labs 2: investigating variation in replicability across samples and settings. Adv. Methods Pract. Psychol. Sci. 1 : 4 443– 90 [Google Scholar]
  • Kunda Z. 1990 . The case for motivated reasoning. Psychol. Bull. 108 : 3 480– 98 [Google Scholar]
  • Lakens D. 2019 . The value of preregistration for psychological science: a conceptual analysis. PsyArXiv, Nov. 18. https://doi.org/10.31234/osf.io/jbh4w [Crossref]
  • Lakens D , Adolfi FG , Albers CJ , Anvari F , Apps MA et al. 2018 . Justify your alpha. Nat. Hum. Behav. 2 : 3 168– 71 [Google Scholar]
  • Landy JF , Jia ML , Ding IL , Viganola D , Tierney W et al. 2020 . Crowdsourcing hypothesis tests: making transparent how design choices shape research results. Psychol. Bull. 146 : 5 451– 79 [Google Scholar]
  • Leary MR , Diebels KJ , Davisson EK , Jongman-Sereno KP , Isherwood JC et al. 2017 . Cognitive and interpersonal features of intellectual humility. Pers. Soc. Psychol. Bull. 43 : 6 793– 813 [Google Scholar]
  • LeBel EP , McCarthy RJ , Earp BD , Elson M , Vanpaemel W. 2018 . A unified framework to quantify the credibility of scientific findings. Adv. Methods Pract. Psychol. Sci. 1 : 3 389– 402 [Google Scholar]
  • Leighton DC , Legate N , LePine S , Anderson SF , Grahe J 2018 . Self-esteem, self-disclosure, self-expression, and connection on Facebook: a collaborative replication meta-analysis. Psi Chi J. Psychol. Res. 23 : 2 98– 109 [Google Scholar]
  • Leising D , Thielmann I , Glöckner A , Gärtner A , Schönbrodt F. 2020 . Ten steps toward a better personality science—how quality may be rewarded more in research evaluation. PsyArXiv, May 31. https://doi.org/10.31234/osf.io/6btc3 [Crossref]
  • Leonelli S 2018 . Rethinking reproducibility as a criterion for research quality. Research in the History of Economic Thought and Methodology 36 L Fiorito, S Scheall, CE Suprinyak 129– 46 Bingley, UK: Emerald [Google Scholar]
  • Lewandowsky S , Oberauer K. 2020 . Low replicability can support robust and efficient science. Nat. Commun. 11 : 1 358 [Google Scholar]
  • Maassen E , van Assen MALM , Nuijten MB , Olsson-Collentine A , Wicherts JM. 2020 . Reproducibility of individual effect sizes in meta-analyses in psychology. PLOS ONE 15 : 5 e0233107 [Google Scholar]
  • Machery E. 2020 . What is a replication?. Philos. Sci. 87 : 4 545 – 67 [Google Scholar]
  • ManyBabies Consort 2020 . Quantifying sources of variability in infancy research using the infant-directed-speech preference. Adv. Methods Pract. Psychol. Sci. 3 : 1 24– 52 [Google Scholar]
  • Marcus A , Oransky I 2018 . Meet the “data thugs” out to expose shoddy and questionable research. Science Feb. 18. https://www.sciencemag.org/news/2018/02/meet-data-thugs-out-expose-shoddy-and-questionable-research [Google Scholar]
  • Marcus A , Oransky I. 2020 . Tech firms hire “Red Teams.” Scientists should, too. WIRED July 16. https://www.wired.com/story/tech-firms-hire-red-teams-scientists-should-too/ [Google Scholar]
  • Mathur MB , VanderWeele TJ. 2020 . New statistical metrics for multisite replication projects. J. R. Stat. Soc. A 183 : 3 1145– 66 [Google Scholar]
  • Maxwell SE. 2004 . The persistence of underpowered studies in psychological research: causes, consequences, and remedies. Psychol. Methods 9 : 2 147– 63 [Google Scholar]
  • Maxwell SE , Lau MY , Howard GS. 2015 . Is psychology suffering from a replication crisis? What does “failure to replicate” really mean?. Am. Psychol. 70 : 6 487– 98 [Google Scholar]
  • Mayo DG. 2018 . Statistical Inference as Severe Testing Cambridge, UK: Cambridge Univ. Press [Google Scholar]
  • McCarthy R , Gervais W , Aczel B , Al-Kire R , Baraldo S et al. 2021 . A multi-site collaborative study of the hostile priming effect. Collabra Psychol 7 : 1 18738 [Google Scholar]
  • McCarthy RJ , Hartnett JL , Heider JD , Scherer CR , Wood SE et al. 2018 . An investigation of abstract construal on impression formation: a multi-lab replication of McCarthy and Skowronski (2011). Int. Rev. Soc. Psychol. 31 : 1 15 [Google Scholar]
  • Meehl PE. 1978 . Theoretical risks and tabular asterisks: Sir Karl, Sir Ronald, and the slow progress of soft psychology. J. Consult. Clin. Psychol. 46 : 4 806– 34 [Google Scholar]
  • Meyer MN , Chabris C. 2014 . Why psychologists' food fight matters. Slate Magazine July 31. https://slate.com/technology/2014/07/replication-controversy-in-psychology-bullying-file-drawer-effect-blog-posts-repligate.html [Google Scholar]
  • Mischel W. 2008 . The toothbrush problem. APS Observer Dec. 1. https://www.psychologicalscience.org/observer/the-toothbrush-problem [Google Scholar]
  • Moran T , Hughes S , Hussey I , Vadillo MA , Olson MA et al. 2020 . Incidental attitude formation via the surveillance task: a Registered Replication Report of Olson and Fazio (2001). PsyArXiv, April 17. https://doi.org/10/ghwq2z [Crossref]
  • Moshontz H , Campbell L , Ebersole CR , IJzerman H , Urry HL et al. 2018 . The Psychological Science Accelerator: advancing psychology through a distributed collaborative network. Adv. Methods Pract. Psychol. Sci. 1 : 4 501– 15 [Google Scholar]
  • Munafò MR , Chambers CD , Collins AM , Fortunato L , Macleod MR. 2020 . Research culture and reproducibility. Trends Cogn. Sci. 24 : 2 91– 93 [Google Scholar]
  • Muthukrishna M , Henrich J. 2019 . A problem in theory. Nat. Hum. Behav. 3 : 3 221– 29 [Google Scholar]
  • Natl. Acad. Sci. Eng. Med 2019 . Reproducibility and Replicability in Science Washington, DC: Natl. Acad. Press [Google Scholar]
  • Nelson LD , Simmons J , Simonsohn U. 2018 . Psychology's renaissance. Annu. Rev. Psychol. 69 : 511– 34 [Google Scholar]
  • Nickerson RS. 1998 . Confirmation bias: a ubiquitous phenomenon in many guises. Rev. Gen. Psychol. 2 : 2 175– 220 [Google Scholar]
  • Nosek B. 2019a . Strategy for culture change. Center for Open Science June 11. https://www.cos.io/blog/strategy-for-culture-change [Google Scholar]
  • Nosek B. 2019b . The rise of open science in psychology, a preliminary report. Center for Open Science June 3. https://www.cos.io/blog/rise-open-science-psychology-preliminary-report [Google Scholar]
  • Nosek BA , Alter G , Banks GC , Borsboom D , Bowman SD et al. 2015 . Promoting an open research culture. Science 348 : 6242 1422– 25 [Google Scholar]
  • Nosek BA , Beck ED , Campbell L , Flake JK , Hardwicke TE et al. 2019 . Preregistration is hard, and worthwhile. Trends Cogn. Sci. 23 : 10 815– 18 [Google Scholar]
  • Nosek BA , Ebersole CR , DeHaven AC , Mellor DT. 2018 . The preregistration revolution. PNAS 115 : 11 2600– 6 [Google Scholar]
  • Nosek BA , Errington TM. 2020a . What is replication?. PLOS Biol 18 : 3 e3000691 [Google Scholar]
  • Nosek BA , Errington TM. 2020b . The best time to argue about what a replication means? Before you do it. Nature 583 : 7817 518– 20 [Google Scholar]
  • Nosek BA , Gilbert EA. 2017 . Mischaracterizing replication studies leads to erroneous conclusions. PsyArXiv, April 18. https://doi.org/10.31234/osf.io/nt4d3 [Crossref]
  • Nosek BA , Spies JR , Motyl M. 2012 . Scientific utopia: II. Restructuring incentives and practices to promote truth over publishability. Perspect. On Psychol. Sci. 7 : 6 615– 31 [Google Scholar]
  • Nuijten MB , Bakker M , Maassen E , Wicherts JM. 2018 . Verify original results through reanalysis before replicating. Behav. Brain Sci. 41 : e143 [Google Scholar]
  • Nuijten MB , Hartgerink CHJ , van Assen MALM , Epskamp S , Wicherts JM 2016 . The prevalence of statistical reporting errors in psychology (1985–2013). Behav. Res. Methods 48 : 4 1205– 26 [Google Scholar]
  • Nuijten MB , van Assen MA , Veldkamp CL , Wicherts JM. 2015 . The replication paradox: Combining studies can decrease accuracy of effect size estimates. Rev. Gen. Psychol. 19 : 2 172– 82 [Google Scholar]
  • O'Donnell M , Nelson LD , Ackermann E , Aczel B , Akhtar A et al. 2018 . Registered Replication Report: Dijksterhuis and van Knippenberg (1998). Perspect. Psychol. Sci. 13 : 2 268– 94 [Google Scholar]
  • Olsson-Collentine A , Wicherts JM , van Assen MALM. 2020 . Heterogeneity in direct replications in psychology and its association with effect size. Psychol. Bull. 146 : 10 922– 40 [Google Scholar]
  • Open Sci. Collab 2015 . Estimating the reproducibility of psychological science. Science 349 : 6251 aac4716 [Google Scholar]
  • Patil P , Peng RD , Leek JT. 2016 . What should researchers expect when they replicate studies? A statistical view of replicability in psychological science. Perspect. Psychol. Sci. 11 : 4 539– 44 [Google Scholar]
  • Pawel S , Held L. 2020 . Probabilistic forecasting of replication studies. PLOS ONE 15 : 4 e0231416 [Google Scholar]
  • Perugini M , Gallucci M , Costantini G. 2014 . Safeguard power as a protection against imprecise power estimates. Perspect. Psychol. Sci. 9 : 3 319– 32 [Google Scholar]
  • Protzko J , Krosnick J , Nelson LD , Nosek BA , Axt J et al. 2020 . High replicability of newly-discovered social-behavioral findings is achievable. PsyArXiv, Sept. 10. https://doi.org/10.31234/osf.io/n2a9x [Crossref]
  • Rogers EM. 2003 . Diffusion of Innovations New York: Free Press, 5th ed.. [Google Scholar]
  • Romero F. 2017 . Novelty versus replicability: virtues and vices in the reward system of science. Philos. Sci. 84 : 5 1031– 43 [Google Scholar]
  • Rosenthal R. 1979 . The file drawer problem and tolerance for null results. Psychol. Bull. 86 : 3 638– 41 [Google Scholar]
  • Rothstein HR , Sutton AJ , Borenstein M 2005 . Publication bias in meta-analysis. Publication Bias in Meta-Analysis: Prevention, Assessment and Adjustments HR Rothstein, AJ Sutton, M Borenstein 1– 7 Chichester, UK: Wiley & Sons [Google Scholar]
  • Rouder JN. 2016 . The what, why, and how of born-open data. Behav. Res. Methods 48 : 3 1062– 69 [Google Scholar]
  • Scheel AM , Schijen M , Lakens D. 2020 . An excess of positive results: comparing the standard psychology literature with Registered Reports. PsyArXiv, Febr. 5. https://doi.org/10.31234/osf.io/p6e9c [Crossref]
  • Schimmack U. 2012 . The ironic effect of significant results on the credibility of multiple-study articles. Psychol. Methods 17 : 4 551– 66 [Google Scholar]
  • Schmidt S. 2009 . Shall we really do it again? The powerful concept of replication is neglected in the social sciences. Rev. Gen. Psychol. 13 : 2 90– 100 [Google Scholar]
  • Schnall S 2014 . Commentary and rejoinder on Johnson, Cheung, and Donnellan (2014a). Clean data: Statistical artifacts wash out replication efforts. Soc. Psychol. 45 : 4 315– 17 [Google Scholar]
  • Schwarz N , Strack F. 2014 . Does merely going through the same moves make for a “direct” replication? Concepts, contexts, and operationalizations. Soc. Psychol. 45 : 4 305– 6 [Google Scholar]
  • Schweinsberg M , Madan N , Vianello M , Sommer SA , Jordan J et al. 2016 . The pipeline project: pre-publication independent replications of a single laboratory's research pipeline. J. Exp. Soc. Psychol. 66 : 55– 67 [Google Scholar]
  • Sedlmeier P , Gigerenzer G. 1992 . Do studies of statistical power have an effect on the power of studies?. Psychol. Bull. 105 : 2 309– 16 [Google Scholar]
  • Shadish WR , Cook TD , Campbell DT 2002 . Experimental and Quasi-Experimental Designs for Generalized Causal Inference Boston: Houghton Mifflin [Google Scholar]
  • Shiffrin RM , Börner K , Stigler SM. 2018 . Scientific progress despite irreproducibility: a seeming paradox. PNAS 115 : 11 2632– 39 [Google Scholar]
  • Shih M , Pittinsky TL 2014 . Reflections on positive stereotypes research and on replications. Soc. Psychol. 45 : 4 335– 38 [Google Scholar]
  • Silberzahn R , Uhlmann EL , Martin DP , Anselmi P , Aust F et al. 2018 . Many analysts, one data set: making transparent how variations in analytic choices affect results. Adv. Methods Pract. Psychol. Sci. 1 : 3 337– 56 [Google Scholar]
  • Simmons JP , Nelson LD , Simonsohn U 2011 . False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychol. Sci. 22 : 11 1359– 66 [Google Scholar]
  • Simons DJ. 2014 . The value of direct replication. Perspect. Psychol. Sci. 9 : 1 76– 80 [Google Scholar]
  • Simons DJ , Shoda Y , Lindsay DS. 2017 . Constraints on generality (COG): a proposed addition to all empirical papers. Perspect. Psychol. Sci. 12 : 6 1123– 28 [Google Scholar]
  • Simonsohn U. 2015 . Small telescopes: detectability and the evaluation of replication results. Psychol. Sci. 26 : 5 559– 69 [Google Scholar]
  • Simonsohn U , Simmons JP , Nelson LD. 2020 . Specification curve analysis. Nat. Hum. Behav. 4 : 1208– 14 [Google Scholar]
  • Smaldino PE , McElreath R. 2016 . The natural selection of bad science. R. Soc. Open Sci. 3 : 9 160384 [Google Scholar]
  • Smith PL , Little DR. 2018 . Small is beautiful: in defense of the small-N design. Psychon. Bull. Rev. 25 : 6 2083– 101 [Google Scholar]
  • Soderberg CK. 2018 . Using OSF to share data: a step-by-step guide. Adv. Methods Pract. Psychol. Sci. 1 : 1 115– 20 [Google Scholar]
  • Soderberg CK , Errington T , Schiavone SR , Bottesini JG , Thorn FS et al. 2021 . Initial evidence of research quality of Registered Reports compared with the standard publishing model. Nat. Hum. Behav 5 : 8 990 – 97 [Google Scholar]
  • Soto CJ. 2019 . How replicable are links between personality traits and consequential life outcomes? The life outcomes of personality replication project. Psychol. Sci. 30 : 5 711– 27 [Google Scholar]
  • Spellman BA. 2015 . A short (personal) future history of revolution 2.0. Perspect. Psychol. Sci. 10 : 6 886– 99 [Google Scholar]
  • Steegen S , Tuerlinckx F , Gelman A , Vanpaemel W 2016 . Increasing transparency through a multiverse analysis. Perspect. Psychol. Sci. 11 : 5 702– 12 [Google Scholar]
  • Sterling TD. 1959 . Publication decisions and their possible effects on inferences drawn from tests of significance—or vice versa. J. Am. Stat. Assoc. 54 : 285 30– 34 [Google Scholar]
  • Sterling TD , Rosenbaum WL , Weinkam JJ. 1995 . Publication decisions revisited: the effect of the outcome of statistical tests on the decision to publish and vice versa. Am. Stat. 49 : 108– 12 [Google Scholar]
  • Stroebe W , Strack F. 2014 . The alleged crisis and the illusion of exact replication. Perspect. Psychol. Sci. 9 : 1 59– 71 [Google Scholar]
  • Szucs D , Ioannidis JPA. 2017 . Empirical assessment of published effect sizes and power in the recent cognitive neuroscience and psychology literature. PLOS Biol 15 : 3 e2000797 [Google Scholar]
  • Tiokhin L , Derex M. 2019 . Competition for novelty reduces information sampling in a research game—a registered report. R. Soc. Open Sci. 6 : 5 180934 [Google Scholar]
  • Van Bavel JJ , Mende-Siedlecki P , Brady WJ , Reinero DA 2016 . Contextual sensitivity in scientific reproducibility. PNAS 113 : 23 6454– 59 [Google Scholar]
  • Vazire S. 2018 . Implications of the credibility revolution for productivity, creativity, and progress. Perspect. Psychol. Sci. 13 : 4 411– 17 [Google Scholar]
  • Vazire S , Schiavone SR , Bottesini JG. 2020 . Credibility beyond replicability: improving the four validities in psychological science. PsyArXiv, Oct. 7. https://doi.org/10.31234/osf.io/bu4d3 [Crossref]
  • Verhagen J , Wagenmakers E-J. 2014 . Bayesian tests to quantify the result of a replication attempt. J. Exp. Psychol. Gen. 143 : 4 1457– 75 [Google Scholar]
  • Verschuere B , Meijer EH , Jim A , Hoogesteyn K , Orthey R et al. 2018 . Registered Replication Report on Mazar, Amir, and Ariely (2008). Adv. Methods Pract. Psychol. Sci. 1 : 3 299– 317 [Google Scholar]
  • Vosgerau J , Simonsohn U , Nelson LD , Simmons JP 2019 . 99% impossible: a valid, or falsifiable, internal meta-analysis. J. Exp. Psychol. Gen. 148 : 9 1628– 39 [Google Scholar]
  • Wagenmakers E-J , Beek T , Dijkhoff L , Gronau QF , Acosta A et al. 2016 . Registered Replication Report: Strack, Martin, & Stepper (1988). Perspect. Psychol. Sci. 11 : 6 917– 28 [Google Scholar]
  • Wagenmakers E-J , Wetzels R , Borsboom D , van der Maas HL. 2011 . Why psychologists must change the way they analyze their data. The case of psi: comment on Bem (2011). J. Pers. Soc. Psychol. 100 : 3 426– 32 [Google Scholar]
  • Wagenmakers E-J , Wetzels R , Borsboom D , van der Maas HL , Kievit RA. 2012 . An agenda for purely confirmatory research. Perspect. Psychol. Sci. 7 : 6 632– 38 [Google Scholar]
  • Wagge J , Baciu C , Banas K , Nadler JT , Schwarz S et al. 2018 . A demonstration of the Collaborative Replication and Education Project: replication attempts of the red-romance effect. PsyArXiv, June 22. https://doi.org/10.31234/osf.io/chax8 [Crossref]
  • Whitcomb D , Battaly H , Baehr J , Howard-Snyder D. 2017 . Intellectual humility: owning our limitations. Philos. Phenomenol. Res. 94 : 3 509– 39 [Google Scholar]
  • Wicherts JM , Bakker M , Molenaar D. 2011 . Willingness to share research data is related to the strength of the evidence and the quality of reporting of statistical results. PLOS ONE 6 : 11 e26828 [Google Scholar]
  • Wiktop G. 2020 . Systematizing Confidence in Open Research and Evidence (SCORE). Defense Advanced Research Projects Agency https://www.darpa.mil/program/systematizing-confidence-in-open-research-and-evidence [Google Scholar]
  • Wilkinson MD , Dumontier M , Aalbersberg IJ , Appleton G , Axton M et al. 2016 . The FAIR Guiding Principles for scientific data management and stewardship. Sci. Data 3 : 1 160018 [Google Scholar]
  • Wilson BM , Harris CR , Wixted JT. 2020 . Science is not a signal detection problem. PNAS 117 : 11 5559– 67 [Google Scholar]
  • Wilson BM , Wixted JT. 2018 . The prior odds of testing a true effect in cognitive and social psychology. Adv. Methods Pract. Psychol. Sci. 1 : 2 186– 97 [Google Scholar]
  • Wintle B , Mody F , Smith E , Hanea A , Wilkinson DP et al. 2021 . Predicting and reasoning about replicability using structured groups. MetaArXiv, May 4. https://doi.org/10.31222/osf.io/vtpmb [Crossref]
  • Yang Y , Youyou W , Uzzi B. 2020 . Estimating the deep replicability of scientific findings using human and artificial intelligence. PNAS 117 : 20 10762– 68 [Google Scholar]
  • Yarkoni T. 2019 . The generalizability crisis. PsyArXiv, Nov. 22. https://doi.org/10.31234/osf.io/jqw35 [Crossref]
  • Yong E 2012 . A failed replication draws a scathing personal attack from a psychology professor. National Geo-graphic March 10. https://www.nationalgeographic.com/science/phenomena/2012/03/10/failed-replication-bargh-psychology-study-doyen/ [Google Scholar]

Data & Media loading...

Supplementary Data

Download the Supplemental Material as a single PDF. Includes Supplemental Text, Supplemental Tables 1-12, and Supplemental Figures 1-2.

  • Article Type: Review Article

Most Read This Month

Most cited most cited rss feed, job burnout, executive functions, social cognitive theory: an agentic perspective, on happiness and human potentials: a review of research on hedonic and eudaimonic well-being, sources of method bias in social science research and recommendations on how to control it, mediation analysis, missing data analysis: making it work in the real world, grounded cognition, personality structure: emergence of the five-factor model, motivational beliefs, values, and goals.

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Perspective
  • Open access
  • Published: 25 July 2023

The replication crisis has led to positive structural, procedural, and community changes

  • Max Korbmacher 1 , 2 , 3 ,
  • Flavio Azevedo   ORCID: orcid.org/0000-0001-9000-8513 4   nAff35 ,
  • Charlotte R. Pennington   ORCID: orcid.org/0000-0002-5259-642X 5 ,
  • Helena Hartmann   ORCID: orcid.org/0000-0002-1331-6683 6 ,
  • Madeleine Pownall 7 ,
  • Kathleen Schmidt   ORCID: orcid.org/0000-0002-9946-5953 8 ,
  • Mahmoud Elsherif 9 ,
  • Nate Breznau   ORCID: orcid.org/0000-0003-4983-3137 10 ,
  • Olly Robertson 11 ,
  • Tamara Kalandadze 12 ,
  • Shijun Yu 9 ,
  • Bradley J. Baker   ORCID: orcid.org/0000-0002-1697-4198 13 ,
  • Aoife O’Mahony 14 ,
  • Jørgen Ø. -S. Olsnes 15 ,
  • John J. Shaw   ORCID: orcid.org/0000-0003-3190-6772 16 ,
  • Biljana Gjoneska   ORCID: orcid.org/0000-0003-1200-6672 17 ,
  • Yuki Yamada   ORCID: orcid.org/0000-0003-1431-568X 18 ,
  • Jan P. Röer 19 ,
  • Jennifer Murphy 20 ,
  • Shilaan Alzahawi   ORCID: orcid.org/0000-0002-6892-4643 21 ,
  • Sandra Grinschgl   ORCID: orcid.org/0000-0001-6666-9426 22 ,
  • Catia M. Oliveira 23 ,
  • Tobias Wingen 24 ,
  • Siu Kit Yeung 25 ,
  • Meng Liu 26 ,
  • Laura M. König 27 ,
  • Nihan Albayrak-Aydemir   ORCID: orcid.org/0000-0003-3412-4311 28 , 29 ,
  • Oscar Lecuona   ORCID: orcid.org/0000-0003-0080-1062 30 , 31 ,
  • Leticia Micheli   ORCID: orcid.org/0000-0003-0066-8222 32 &
  • Thomas Evans 33 , 34  

Communications Psychology volume  1 , Article number:  3 ( 2023 ) Cite this article

24k Accesses

12 Citations

97 Altmetric

Metrics details

  • Publication characteristics
  • Scientific community

The emergence of large-scale replication projects yielding successful rates substantially lower than expected caused the behavioural, cognitive, and social sciences to experience a so-called ‘replication crisis’. In this Perspective, we reframe this ‘crisis’ through the lens of a credibility revolution, focusing on positive structural, procedural and community-driven changes. Second, we outline a path to expand ongoing advances and improvements. The credibility revolution has been an impetus to several substantive changes which will have a positive, long-term impact on our research environment.

Similar content being viewed by others

replication in experimental psychology

High replicability of newly discovered social-behavioural findings is achievable

replication in experimental psychology

Low replicability can support robust and efficient science

replication in experimental psychology

The fundamental importance of method to theory

Introduction.

After several notable controversies in 2011 1 , 2 , 3 , skepticism regarding claims in psychological science increased and inspired the development of projects examining the replicability and reproducibility of past findings 4 . Replication refers to the process of repeating a study or experiment with the goal of verifying effects or generalising findings across new models or populations, whereas reproducibility refers to assessing the accuracy of the research claims based on the original methods, data, and/or code (see Table  1 for definitions).

In one of the most impactful replication initiative of the last decade, the Open Science Collaboration 5 sampled studies from three prominent journals representing different sub-fields of psychology to estimate the replicability of psychological research. Out of 100 independently performed replications, only 39% were subjectively labelled as successful replications, and on average, the effects were roughly half the original size. Putting these results into a wider context, a minimum replicability rate of 89% should have been expected if all of the original effects were true (and not false positives; ref. 6 ). Pooling the Open Science Collaboration 5 replications with 207 other replications from recent years resulted in a higher estimate; 64% of effects successfully replicated with effect sizes being 32% smaller than the original effects 7 . While estimations of replicability may vary, they nevertheless appear to be sub-optimal—an issue that is not exclusive to psychology and found across many other disciplines (e.g., animal behaviour 8 , 9 , 10 ; cancer biology 11 ; economics 12 ), and symptomatic of persistent issues within the research environment 13 , 14 . The ‘replication crisis’ has introduced a number of considerable challenges, including compromising the public’s trust in science 15 and undermining the role of science and scientists as reliable sources to inform evidence-based policy and practice 16 . At the same time, the crisis has provided a unique opportunity for scientific development and reform. In this narrative review, we focus on the latter, exploring the replication crisis through the lens of a credibility revolution 17 to provide an overview of recent developments that have led to positive changes in the research landscape (see Fig.  1 ).

figure 1

This figure presents an overview of the three modes of change proposed in this article: structural change is often evoked at the institutional level and expressed by new norms and rules; procedural change refers to behaviours and sets of commonly used practices in the research process; community change encompasses how work and collaboration within the scientific community evolves.

Recent discussions have outlined various reasons why replications fail (see Box  1 ). To address these replicability concerns, different perspectives have been offered on how to reform and promote improvements to existing research norms in psychological science 18 , 19 , 20 . An academic movement collectively known as open scholarship (incorporating Open Science and Open Research) has driven constructive change by accelerating the uptake of robust research practices while concomitantly championing a more diverse, equitable, inclusive, and accessible psychological science 21 , 22 .

These reforms have been driven by a diverse range of institutional initiatives, grass-roots, bottom-up initiatives, and individuals. The extent of a such impact led Vazire 17 to reframe the replicability crisis as a credibility revolution, acknowledging that the term crisis reflects neither the intense self-examination of research disciplines in the last decade nor the various advances that have been implemented as a result.

Scientific practices are behaviours 23 and can be changed, especially when structures (e.g., funding agencies), environments (e.g., research groups), and peers (e.g., individual researchers) facilitate and support them. Most attempts to change the behaviours of individual researchers have concentrated on identifying and eliminating problematic practices and improving training in open scholarship 23 . Efforts to change individuals’ behaviours have ranged from the creation of grass-roots communities to support individuals to incorporate open scholarship practices into their research and teaching (e.g., ref. 24 ) to infrastructural change (e.g., creation of open tools fostering the uptake of improved norms such as the software StatCheck 25 to identify statistical inconsistencies en-masse, providing high-quality and modularized training on the underlying skills needed for transparent and reproducible data preparation and analysis 26 or documenting contributions and author roles transparently 27 ).

The replication crisis has highlighted the need for a deeper understanding of the research landscape and culture, and a concerted effort from institutions, funders, and publishers to address the substantive issues. Despite the creation of new open access journals, they still face challenges in gaining acceptance due to the prevailing reputation and prestige-based market. These stakeholders have made significant efforts, but their impact remains isolated and infrequently reckoned. As a result, although there have been positive developments, progress toward a systemic transformation in how science is considered, actioned, and structured is still in its infancy.

In this article, we take the opportunity to reflect upon the scope and extent of positive changes resulting from the credibility revolution. To capture these different levels of change in our complex research landscape, we differentiate between (a) structural, (b) procedural, and (c) community change. Our categorisation is not informed by any given theory, and there are overlaps and similarities across the outlined modes of change. However, this approach allows us to consider change in different domains: (a) embedded norms, (b) behaviours, and (c) interactions, which we believe assists in demonstrating the scope of optimistic changes allowing us to empower and retain change-makers towards further scientific reform.

Box 1 Why research fails to replicate. This box outlines some explanations for the low replicability of research at the individual and structural levels. A more exhaustive overview can be found in ref. 14

Individual level . Questionable research practices (QRPs) at the individual researcher level may help understand the replicability crisis. Researchers have significant flexibility in processing and analysing data, including the exclusion of outliers and running multiple tests on subsets of the data, leading to false positive results 203 , 204 . QRPs also involve measurement, such as omitting to report psychometric properties when found to be unsatisfactory, potentially compromising replicability 205 , 206 ), likely with similar negative effects for replicability. Researchers continue to employ a variety of QRPs, which are significant contributors to low replicability 202 , 204 , 207 , 208 , 209 , and a lack of transparency in reporting can mask and exacerbate these issues 14 .

Structural level . Characteristics of the academic system also contribute widely to low replicability. For example, misaligned incentives can influence researchers trying to obtain career stability or advancement and encourage the usage of unethical research behaviours 210 such as QRPs 103 , 211 . More generally, the incentivisation of research is conveyed through common academic aphorisms such as ‘publish or perish’ 212 . Many of these incentives are driven by research institutions and their governing agencies but also by publishers and funding agencies 213 . For example, the emphasis on arbitrary publication metrics, such as the impact factor 214 or the h-index, create perverse incentives 215 that ultimately reinforce the prioritization of research quantity over quality 85 , 92 , 216 , with little-to-no requirements for transparency in the publishing process 201 . Such emphasis is particularly evident in publication practices, where novel and hypothesis-supporting research has historically been viewed more positively by editors and reviewers, and thus published more frequently 217 . Many journals prevent or resist publication of null-findings and replications 218 often due to perceived ‘lack of contribution’, or prioritising novelty over incremental developments 219 , 220 , 221 . This selective publishing creates ‘publication bias’, a distortion of the literature to over-represent positive findings and under-represent negative ones, giving misleading representations of existing effects 50 , 78 , 222 . Negative findings can often end in the ‘file-drawer’ and are never published 223 .

Challenges with replications . While replicability is an essential feature of scientific research, its application can vary depending on the research field, with standardization and control of covariates varying for different objects of study 200 . Moreover, replication is not conclusive in ensuring scientific quality due to the heterogeneity of the concept 224 . Differences in replicability can be attributed to several factors such as variations in procedures, measurement characteristics, or evidence strength between original studies and their replications 7 , 225 , 226 , 227 . Additionally, the characteristics of the effects themselves may vary in size depending on time or context 228 . However, low reliability and systematic errors in original studies 229 , missing formalizations of verbalizations, for example of concepts, definitions, and results 174 , or the inappropriate use and interpretation of statistical tests 102 , 103 , 170 , and weak theoretical development underpinning hypothesized effects 230 can lead to mismatches between statistical tests and their interpretations, ultimately influencing replication rates.

Structural change

In the wake of the credibility revolution, structural change is seen as crucial to achieving the goals of open scholarship, with new norms and rules often being developed at the institutional level. In this context, there has been increasing interest in embedding open scholarship practices into the curriculum and incentivizing researchers to adopt improved practices. In the following, we describe and discuss examples of structural change and its impact.

Embedding replications into the curriculum

Higher Education instructors and programmes have begun integrating open science practices into the curriculum at different levels. Most notably, some instructors have started including replications as part of basic research training and course curricula 28 , and there are freely available, curated materials covering the entire process of executing replications with students 29 (see also forrt.org/reversals). In one prominent approach, the Collaborative Replications and Education Project 30 , 31 integrates replications in undergraduate courses as coursework with a twofold goal: educating undergraduates to uphold high research standards whilst simultaneously advancing the field with replications. In this endeavour, the most cited studies from the most cited journals in the last three years serve as the sample from which students select their replication target. Administrative advisors then rate the feasibility of the replication to decide whether to run the study across the consortium of supervisors and students. After study completion, materials and data are submitted and used in meta-analyses, for which students are invited as co-authors.

In another proposed model 32 , 33 , 34 , graduate students complete replication projects as part of their dissertations. Early career researchers (ECRs) are invited to prepare the manuscripts for publication 35 , 36 and, in this way, students’ research efforts for their dissertation are utilised to contribute to a more robust body of literature, while being formally acknowledged. An additional benefit is the opportunity for ECRs to further their career by publishing available data. Institutions and departments can also profit from embedding these projects as these not only increase the quality of education on research practices and transferable skills but also boost research outputs 21 , 37 .

If these models are to become commonplace, developing a set of standards regarding authorship is beneficial. In particular, the question of what merits authorship can become an issue when student works are further developed, potentially without further involvement of the student. Such conflicts occur with other models of collaboration (see Community Change, below; ref. 38 ) but may be tackled by following standardized authorship templates, such as the Contributor Roles Taxonomy (CRediT), which helps detail each individual’s contributions to the work 27 , 39 , 40 .

Wider embedding into curricula

In addition to embedding replications, open scholarship should be taught as a core component of Higher Education. Learning about open scholarship practices has been shown to influence student knowledge, expectations, attitudes, and engagement toward becoming more effective and responsible researchers and consumers of science 41 . It is therefore essential to adequately address open scholarship in the classroom, and promote the creation and maintenance of open educational resources supporting teaching staff 21 , 29 , 37 . Gaining an increased scientific literacy early on may have significant long-term benefits for students, including the opportunity to make a rigorous scientific contribution, acquire a critical understanding of the scientific process and the value of replication, and a commitment to research integrity values such as openness and transparency 24 , 41 , 42 , 43 . Embedding open research practices into education further shapes personal values that are connected to research, which will be crucial in later stages of both academic and non-academic careers 44 . This creates a path towards open scholarship values and practices becoming the norm rather than the exception. It also links directly to existing social movements often embraced among university students to foster greater equity and justice and helps break down status hierarchies and power structures (e.g., decolonisation, diversity, equity, inclusion and accessibility efforts) 45 , 46 , 47 , 48 , 49 .

Various efforts to increase the adoption of open scholarship practices into the curriculum are being undertaken by pedagogical teams with the overarching goal of increasing research rigour and transparency over time. While these changes are structural, they are often driven by single or small groups of individuals, who are usually in the early stages of their careers and receive little recognition for their contributions 21 . An increasing number of grassroot open science organisations contribute to different educational roles and provide resources, guidelines, and community. The breadth of tasks required in the pedagogic reform towards open scholarship is exemplified by the Framework for Open and Reproducible Research Training (FORRT), focusing on reform and meta-scientific research to advance research transparency, reproducibility, rigour, social justice, and ethics 24 . The FORRT community is currently running more than 15 initiatives which include summaries of open scholarship literature, a crowdsourced glossary of open scholarship terms 50 , a literature review of the impact on students of integrating open scholarship into teaching 41 , out-of-the-box lesson plans 51 , a team working on bridging neurodiversity and open scholarship 46 , 52 , and a living database of replications and reversals 53 . Other examples of organisations providing open scholarship materials are ReproducibiliTea 54 , the RIOT Science Club, the Turing Way 55 , Open Life Science 56 , OpenSciency 57 , the Network of Open Science Initiatives 58 , Course Syllabi for Open and Reproducible Methods 59 , the Carpentries 60 , the Embassy of Good Science 61 , the Berkeley Initiative for Transparency in Social Sciences 62 , the Institute for Replication 63 , Reproducibility for Everyone 64 , the International Network of Open Science 65 , and Reproducibility Networks such as the UKRN 66 .

These and other collections of open-source teaching and learning materials (such as podcasts, how-to guides, courses, labs, networks, and databases) can facilitate the integration of open scholarship principles into education and practice. Such initiatives not only raise awareness for open scholarship but also level the playing field for researchers from countries or institutions with fewer resources, such as the Global South and low- and middle-income countries, referring to the regions outside of Western Europe and North America that are primarily politically and culturally marginalised such as regions in Asia, Latin America, and Africa 51 , 67 .

Scientific practice has been characterized by problematic rewards, such as prioritizing research quantity over quality and emphasizing statistically significant results. To foster a sustained integration of open scholarship practices, it is essential to revise incentive structures. Current efforts have focused on developing incentives that target various actors, including students, academics, faculties, universities, funders, and journals 68 , 69 , 70 . However, as each of these actors has different—and sometimes competing—goals, their motivations to engage in open scholarship practices can vary. In the following, we discuss recently developed incentives that specifically target researchers, academic journals, and funders.

Targeting researchers

Traditional incentives for academics to advance in their career are publishing articles, winning grants, and signalling the quality of the published work (e.g., perceived journal prestige) 45 , 71 . In some journals, researchers are now given direct incentives for the preregistration of study plans and analyses before study execution, and for openly sharing data and materials in the form of (open science) badges, with the aim of signalling study quality 23 . However, the extent to which badges can be used to increase open scholarship behaviours remains unclear; while one study 72 reports increased data sharing rates among articles published in Psychological Science with badges, a recent randomized control trial shows no evidence for badges increasing data sharing 73 , suggesting that more effective incentives or complementary workflows are required to motivate researchers to engage in open research practices 74 .

Furthermore, there are incentives provided for different open scholarship practices, such as using the Registered Report publishing format 75 , 76 . Here, authors submit research protocols for peer-review before data collection or analyses (in the case of secondary data). Registered Reports meeting high scientific standards are given provisional acceptance (‘in-principle acceptance’) before the results are known. Such format shifts the focus from the research outcomes to methodological quality and realigns incentives by providing researchers with the certainty of publication when adhering to the preregistered protocol 75 , 76 . Empirical evidence has also found that Registered Reports are perceived to be higher in research quality than regular articles, as well as equivalent in creativity and importance 77 , while also allowing to report more negative results 78 , which may provide further incentives for researchers to adopt this format.

Targeting journals and funders

Incentives are not limited to individual researchers but also to the general research infrastructure. One example of this is academic journals, which are attempting to implement open science standards to remain current and competitive by significantly increasing open publishing options and formulating new guidelines reinforcing and enforcing these changes 79 . For example, the Center for Open Science introduced the Transparency and Openness Promotion (TOP) Guidelines 80 comprising eight modular standards to reflect journals’ transparency standards. Namely, citation standards, data transparency, analytic methods transparency, research materials transparency, design and analysis transparency, study preregistration, analysis plan preregistration, and replication. Building on these guidelines, the TOP factor quantifies the degree to which journals implement these standards, providing researchers with a guide on selecting journals accordingly. Based on TOP and other Open Science best-practices, there are also guides for editors available (e.g., ref. 81 ). Similarly, organisations such as NASA 82 , UNESCO 83 , and the European Commission 84 all came to support open scholarship efforts publicly, and on an international level. There are moreover efforts to open up funding options through the Registered Reports funding schemes 85 . Here, funding allocation and publication review are being combined into a single process, reducing both the burden on reviewers and opportunities for questionable research practices. Finally, large-scale policies are being implemented supporting open scholarship practices, such as Plan-S 86 , 87 , mandating open publishing when the research is funded by public grants. The increase in open access options illustrates how journals are being effectively incentivized to expand their repertoire and normalize open access 88 . At the same time, article processing charges have increased, causing new forms of inequities between the haves and have-nots, and the exclusion of researchers from low-resource universities across the globe 88 , 89 . As Plan-S shapes the decision space of journals and researchers, it is an incentive with the promise of long-term change.

Several initiatives aim to re-design systems such as peer review and publishing. Community peer reviews (e.g., PeerCommunityIn 90 ) is a relatively new system in which experts review and recommend preprints to journals. Future developments in the direction of community peer review might contain an increased usage of overlay journals, meaning that the journals themselves do not manage their own content (including peer review) but rather select and curate content. The peer review procedures can also be changed, as shown by the recent editorial plans in the journal e-Life to abolish accept/reject decisions during peer review 91 , and as reflected by a recommendation-based system of the community peer review system.

Evaluation of researchers in academic settings has historically been focused on the quantity of papers they publish in high-impact journals 45 , despite criticism of the impact factor metric 71 , and their ability to secure grants 85 , 92 . In response to this narrow evaluation, a growing number of research stakeholders, including universities, have signed declarations such as the San Francisco Declaration of Research Assessment (DORA) or the agreement by the Coalition for Advancing Research Assessment (CORA). This initiative aims to broaden the criteria used for hiring, promotion, and funding decisions by considering all research outputs, including software and data, and considering the qualitative impact of research, such as its policy or practice implications. To promote sustained change towards open scholarship practices, some institutions have also modified the requirements for hiring committees to consider such practices 69 (see, for example, the Open Hiring Initiative 93 ). Such initiatives incentivize researchers to adopt open scholarship principles for career advancement.

Procedural change

Procedural change refers to behaviours and sets of commonly used practices in the research process. We describe and discuss prediction markets, statistical assessment tools, multiverse analysis, and systematic reviews and meta-analysis as examples of procedural changes.

Prediction markets of research credibility

In recent years, researchers have employed prediction markets to assess the credibility of research findings 94 , 95 , 96 , 97 , 98 , 99 . Here, researchers invite experts or non-experts to estimate the replicability of different studies or claims. Large prediction market projects such as the repliCATS project have yielded replicability predictions with high classification accuracy (between 61% and 86% 96 , 97 ). The repliCATS project implemented a structured, iterative evaluation procedure to solicit thousands of replication estimates which are now being used to develop prediction algorithms using machine learning. Though many prediction markets are composed of researchers or students with research training, even lay people seem to perform better than chance in predicting replicability 100 . Replication markets are considered both an alternative and complementary approach to replication since certain conditions may favour one approach over the other. For instance, replication markets may be advantageous in cases where data collection is resource-intensive but less so when study design is especially complex. Therefore, replication markets offer yet another tool for researchers to assess the credibility of existing and hypothetical works. In that sense, it is an ongoing discussion whether low credibility estimates from replication markets can be used to inform decisions on which articles to replicate 101 .

Statistical assessment tools

Failure to control error rates and design high-power studies can contribute to low replication rates 102 , 103 . In response, researchers have developed various quantitative methods to assess expected distributions of statistical estimates (i.e., p -values), such as p-curving 104 , z-curving 105 , and others. P-curve assesses publication bias by plotting the distribution of p -values across a set of studies, measuring the deviation from an expected uniform distribution of p -values considering a true null hypothesis 104 . Like p-curve, the z-curve assesses the distribution of test statistics while considering the power of statistical tests and false discovery rate within a body of literature 105 . Additionally, and perhaps most importantly, such estimations of bias in the literature identify selective reporting trends and help establish a better estimate of whether replication failures may be due to features of the original study or features of the replication study. Advocates of these methods argue for decreasing α -levels (i.e., the probability of finding a false positive/committing a type I error) when the likelihood of publication bias is high to allow for increased power and confidence in findings. Other researchers have called for reducing α -levels for all tests (e.g., from 0.05 to 0.005 106 ), rethinking null hypothesis statistical testing (NHST) and considering exploratory NHST 107 , or abandoning NHST altogether 108 (see for an example, ref. 109 ). However, these approaches are not panaceas and are unlikely to address all the highlighted concerns 19 , 110 . Instead, researchers have recommended simply justifying the alpha for tests with regard to the magnitude of acceptable Type I versus Type II (false negative) errors 110 . In this context, equivalence testing 111 or Bayesian analyses 112 have been proposed as suitable approaches to directly assess evidence for the alternative hypothesis against evidence for the null hypothesis 113 . Graphical user interface (GUI) based statistical software packages, like JASP 114 and Jamovi 115 , have played a significant role in making statistical methods such as equivalence tests and Bayesian statistics accessible to a broader audience. The promotion of these methods, including practical walkthroughs and interactive tools like Shiny apps 111 , 112 , has further contributed to their increased adoption.

Single-study statistical assessments

A range of useful tools has been developed to pursue open values. For example, the accuracy of reported findings may be assessed by running simple, automated error checks, such as StatCheck 25 . Validation studies 25 reported high sensitivity (larger than 83%), specificity (larger than 96%), and accuracy (larger than 92%) of this tool. Other innovations include the Granularity-Related Inconsistency of Means (GRIM) test 116 aiming to evaluate the consistency of mean values of integer data (e.g., from Likert-type scales), considering sample size and the number of items. Another is the Sample Parameter Reconstruction via Iterative TEchniques (SPRITE), which reconstructs samples and estimates of the item value distributions based on reported descriptive statistics 117 . Adopting these efforts can serve as an initial step in reviewing existing literature, to ensure that findings are not the result of statistical errors or potential falsification. Researchers themselves can implement these tools to check their work and identify potential errors. With greater awareness and use of such tools, we can increase accessibility and enhance our ability to identify unsubstantiated claims.

Multiverse analysis

The multitude of researcher degrees of freedom—i.e., decisions researchers can make when using data—have been shown to influence the outcomes of analyses performed on the same data 118 , 119 . In one investigation, 70 independent research teams analysed the same nine hypotheses with one neuroimaging dataset, and results show data cleaning and statistical inferences varied considerably between teams: no two groups used the same pipeline to pre-process the imaging data, which ultimately influenced both results and inference drawn from the results 118 . Another systematic effort comprising of 161 researchers in 73 teams independently investigated and tested a hypothesis central to an extensive body of scholarship using identical cross-country survey data—that more immigration will reduce public support for government provision of social policies—revealing a hidden universe of uncertainty 120 . The study highlights that the scientific process involves numerous analytical decisions that are often taken for granted as nondeliberate actions following established procedures but whose cumulative effect is far from insignificant 119 . It also illustrates that, even in a research setting with high accuracy motivation and unbiased incentives, reliability between researchers may remain low, regardless of researchers’ methodological expertise 119 . The upshot is that idiosyncratic uncertainty may be a fundamental aspect of the scientific process that is not easily attributable to specific researcher characteristics or analytical decisions. However, increasing transparency regarding researcher degrees of freedom is still crucial, and multiverse analyses provide a useful tool to achieve this. By considering a range of feasible and reasonable analyses, researchers can test the same hypothesis across various scenarios and determine the stability of certain effects as they navigate the often large ’garden of forking paths’ 121 . Multiverse analyses (and their sibling, sensitivity analyses) can now be performed more often to provide authoritative evidence of an effect on substantive research questions 122 , 123 , 124 , 125 , 126 , 127 . Such an approach helps researchers to determine the robustness of a finding by pooling evidence from a range of appropriate analyses.

Systematic review and meta-analysis

Systematic reviews or meta-analyses are used to synthesise findings from several primary studies 128 , 129 , which can reveal nuanced aspects of the research while keeping a bird’s-eye perspective, for example, by presenting the range of effect sizes and resulting power estimates. Methods have been developed to assess the extent of publication bias in meta-analyses, and, to an extent, correct for it, using methods such as funnel plot asymmetry tests 130 . However, there are additional challenges influencing the results of meta-analyses and systematic reviews and hence their replicability, such as researcher degrees of freedom in determining inclusion criteria, methodological approaches, and the rigour of the primary studies 131 , 132 . Thus, researchers have developed best practices for open, and reproducible systematic reviews and meta-analyses such as Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) 133 , Non-Interventional, Reproducible, and Open (NIRO) 134 systematic review guidelines, the Generalized Systematic Review Registration Form 135 , and PROSPERO, a register of systematic review protocols 136 . These guides and resources provide opportunities for more systematic accounts of research 137 . These guidelines often include a risk of bias assessment, where different biases are assessed and reported 138 . Yet, most systematic reviews and meta-analyses do not follow standardized reporting guidelines, even when required by the journal and stated in the article, reducing the reproducibility of primary and pooled effect size estimates 139 . An evaluated trial of enhanced requirements by some journals as part of the submission process 140 did lead to a slowly increasing uptake in such practices 141 , with later findings indicating that protocol registrations increased the quality of associated meta-analyses 142 . Optimistically, continuous efforts to increase transparency appear to have already contributed to researchers more consistently reporting eligibility criteria, effect size information, and synthesis techniques 143 .

Community change

Community change encompasses how work and collaboration within the scientific community evolves. We describe two of these recent developments: Big Team Science and adversarial collaborations.

Big Team Science

The credibility revolution has undoubtedly driven the formation and development of various large-scale, collaborative communities 144 . Community examples of such approaches include mass replications, which can be integrated into research training 30 , 31 , 34 , 145 , and projects conducted by large teams and organisations such as the Many Labs studies 146 , 147 , the Hagen Cumulative Science Project 148 , the Psychological Science Accelerator 149 , and the Framework for Open and Reproducible Research Training (FORRT) 24 .

A promising development to accelerate scientific progress is Big Team Science—i.e., large-scale collaborations of scientists working on a scholarly common goal and pooling resources across labs, institutions, disciplines, cultures, and countries 14 , 150 , 151 . Replication studies are often the focus of such collaborations, with many of them sharing their procedural knowledge and scientific insights 152 . This collaborative approach leverages the expertise of a consortium of researchers, increases research efficiency by pooling resources such as time and funding, and allows for richer cross-cultural samples to draw conclusions from 150 , 153 . Big Team Science emphasizes various practices to improve research quality, including interdisciplinary internal reviews, incorporating multiple perspectives, implementing uniform protocols across participating labs, and recruiting larger and more diverse samples 41 , 50 , 149 , 150 , 152 , 154 . The latter also extends to researchers themselves; Big Team Science can increase representation, diversity, and equality and allow researchers to collaborate by either coordinating data collection efforts at their respective institutions or by funding the data collection of researchers who may not have access to funds 155 .

Big-Team Science represents a prime opportunity to advance open scholarship goals that have proven to be the most difficult to achieve, including diversity, equity, inclusion, accessibility, and social justice in research. Through a collaborative approach that prioritizes the inclusion of disenfranchised researchers, coalition building, and the redistribution of expertise, training, and resources from the most to the least affluent, Big Team Science can contribute to a more transparent and robust science that is also more inclusive, diverse, accessible, and equitable. This participatory approach creates a better and more just science for everyone 52 . However, it is important to critically examine some of the norms, practices, and culture associated with Big Team Science to identify areas for improvement. Big Team Science projects are often led by researchers from Anglo-Saxon and Global North institutions, while the contributions of researchers from the Global South are oftentimes diluted in the ordering of authors—i.e., authors from Global North tend to occupy positions of prestige such as the first, corresponding, and last author (e.g., refs. 146 , 147 , 156 , 157 , 158 , 159 , 160 , 161 , 162 ) while researchers from Low- and Middle-income countries are compressed in the middle. Moreover, there are also challenges associated with collecting data in low-and-middle-income countries that are often not accounted for, such as limited access to polling infrastructure or technology and the gaping inequities in resources, funding, and educational opportunities. Furthermore, journals and research institutions do not always recognize contributions to Big Team Science projects, which can unequally negatively impact the academic careers of already marginalized researchers. For example, some prominent journals prefer mentioning consortium or group names instead of accommodating complete lists of author names in the byline, with the result that immediate author visibility decreases (see for example, ref. 156 ). To promote social justice in Big Team Science practices, it is crucial to set norms that redistribute credit, resources, funds, and expertise rather than preserving the status quo of extractive intellectual labour. These complex issues must be examined carefully—and now—to ensure current customs do not unintendedly sustain colonialist, extractivist, and racist research practices that pervade academia and society at large. Big Team Science stakeholders must see to it that, as a minimum, Big Team Science doesn’t perpetuate existing inequalities and power structures.

Adversarial collaborations

Scholarly critique typically occurs after research has been completed, for example, during peer-review or in back-and-forth commentaries of published work. With some exceptions 161 , 163 , 164 , 165 , 166 , rarely do researchers who support contradictory theoretical frameworks work together to formulate research questions and design studies to test them. ‘Adversarial collaborations’ of this kind are arguably one of the most important developments in procedures to advance research because they allow for a consensus-based resolution of scientific debates and they facilitate more efficient knowledge production and self-correction by reducing bias 3 , 167 . An example is the Transparent Psi Project 168 which united teams of researchers both supportive and critical of the idea of extra-sensory perception, allowing for a constructive dialogue and more agreeable consensus in conclusion.

A related practice to adversarial collaborations is that of ‘red teams’, which can be applied by both larger and smaller teams of researchers playing ‘devil’s advocate’ between one another. Red teams work together to constructively criticise each other’s work or to find errors during (but preferably early in) the entire research process, with the overarching goal of maximising research quality (Lakens, 2020). By avoiding errors “before it is too late”, red teams have the potential to save large amounts of resources 150 . However, whether these initiatives contribute to research that is less biased in its central assumptions depends on wherein the “adversarial" nature lies. For example, two researchers may hold the same biased negative views towards one group, but opposing views on the implications of group membership on secondary outcomes. An adversarial collaboration from the same starting point pitching different methodological approaches and hypotheses about factors associated with group membership would still suffer from the same fundamental biases.

Expanding structural, procedural and community changes

To expand the developments discussed and to address current challenges in the field, we now highlight a selection of areas that can benefit from the previously described structural, procedural, and community changes, namely: (a) generalizability, (b) theory building, and (c) open scholarship for qualitative research, and (d) diversity and inclusion as an area necessary to be considered in the context of open scholarship.

Generalizability

In extant work, the generalizability of effects is a serious concern (e.g., refs. 169 , 170 ). Psychological researchers have traditionally focused on individual-level variability and failed to consider variables such as stimuli, tasks, or contexts over which they wish to generalise. While accounting for methodological variation can be partially achieved through statistical estimation (e.g., including random effects of stimuli in models) or acknowledging and discussing study limitations, unmeasured variables, stable contexts, and narrow samples still present substantive challenges to the generalizability of results 170 .

Possible solutions may lay in Big-team science and large-scale collaborations. Scientific communities such as the Psychological Science Accelerator (PSA) have aimed to test the generalizability of effects across cultures and beyond the Global North (i.e., the affluent and rich regions of the world, for example, North America, Europe, and Australia 149 ). However, Big Team Science projects tend to be conducted voluntarily with very few resources in order to understand the diversity of a specific phenomenon (e.g., ref. 171 ). The large samples required to detect small effects may make it difficult for single researchers from specific countries to achieve adequate power for publication. Large global collaborations, such as the PSA, can therefore contribute to avoiding wasted resources by conducting large studies instead of many small-sample studies 149 . At the same time, large collaborations might offer a chance to counteract geographical inequalities in research outputs 172 . However, such projects also tend to recruit only the most accessible (typically student) populations from their countries, thereby potentially perpetuating issues of representation and diversity. Yet, increased efforts of international teams of scientists offer opportunities to provide both increased diversity in the research team and the research samples, potentially increasing generalisability at various stages of the research process.

Formal theory building

Researchers have suggested that the replication crisis is, in fact, a “theory crisis” 173 . Low rates of replicability may be explained in part by the lack of formalism and first principles 174 . One example is the improper testing of theory or failures to identify auxiliary theoretical assumptions 103 , 170 . The verbal formulation of psychological theories and hypotheses cannot always be directly tested with inferential statistics. Hence, generalisations provided in the literature are not always supported by the used data. Yarkoni 170 has recommended moving away from broad, unspecific claims and theories towards specific quantitative tests that are interpreted with caution and increased weighting of qualitative and descriptive research. Others have suggested formalising theories as computational models and engaging in theory testing rather than null hypothesis significance testing 173 . Indeed, many researchers may not even be at a stage where they are ready or able to test hypotheses 175 . Additional discussion of improving psychological theory and its evaluation is needed to advance the credibility revolution. Hence, a suggested approach to solving methodological problems is to (1) define variables, population parameters, and constants involved in the problem, including model assumptions, to then (2) for-mulate a formal mathematical problem statement. Results are (3) used to interrogate the problem. If the claims are valid, (4a) examples can be used to present practical relevance, while also (4b) presenting possible extensions and limitations. Finally, (5) policy making recommendations can be given. Additional discussion of improving psychological theory and its evaluation is needed to advance the credibility revolution. Such discussions reassessing the application of statistics (in the context of statistical theory) are important steps in improving research quality 174 .

Qualitative research

Open scholarship research has focused primarily on quantitative data collection and analyses, with substantively less consideration for compatibility with qualitative or mixed methods 51 , 176 , 177 , 178 . Qualitative research presents methodological, ontological, epistemological, and ethical challenges that need to be considered to increase openness while preserving the integrity of the research process. The uniqueness, context-dependent, and labour-intensive features of qualitative research can create barriers, for example, to preregistration or data sharing 179 , 180 . Similarly, some of the tools, practices, and concerns of open scholarship are simply not compatible with many qualitative epistemological approaches (e.g., a concern for replicability; ref. 41 ). Thus, a one-size-fits-all approach to qualitative or mixed methods data sharing and engagement with other open scholarship tools may not be appropriate for safeguarding the fundamental principles of qualitative research (see review 181 ). However, there is a growing body of literature offering descriptions on how to engage in open scholarship practices when executing qualitative studies to move the field forward 6 , 179 , 182 , 183 , 184 , and protocols are being developed specifically for qualitative research, such as preregistration templates 185 , practices increasing transparency 186 , and curating 187 and reusing qualitative data 188 . Better representation of the application of open scholarship practices like a buffet, which can be chosen from, depending on the projects and its limitations and opportunities 6 , 189 is ongoing. Such an approach is reflected in various studies describing the tailored application of open scholarship protocols in qualitative studies 182 , 184 .

It is important to note that qualitative research also has dishes to add to the buffet of open science 178 , 190 . Qualitative research includes practices that realize forms of transparency that currently lack in quantitative work. One example is the practice of reflexivity, which aims to make transparent the positionality of the researcher(s) and their role in the production and interpretation of the data (see e.g., refs. 191 , 192 , 193 ). A different form of transparency is ‘member checking’ 194 , which makes the participants in a study part of the analysis process by asking them to comment on a preliminary report of the analysis. These practices—e.g., member checking, positionality, reflexivity, critical team discussions, external audits and others—would likely hold potential benefits for strictly quantitative research by promoting transparency and contextualization 178 , 193 .

Overall, validity, transparency, ethics, reflexivity, and collaboration can be fostered by engaging in qualitative open science; open practices which allow others to understand the research process and its knowledge generation are particularly impactful here 179 , 183 . Irrespective of the methodological and epistemological approach, then, transparency is key to the effective communication and evaluation of results from both quantitative and qualitative studies, and there have been promising developments within qualitative and mixed research towards increasing the uptake of open scholarship practices.

Diversity and inclusion

An important point to consider when encouraging change is that the playing field is not equal for all actors, underlining the need for flexibility that takes into account regional differences and marginalised groups as well as differences in resource allocation when implementing open science practices 51 . For example, there are clear differences in the availability of resources by geographic region 195 , 196 and social groups, by ethnicity 197 or sex and gender 51 , 198 , 199 . Resource disparities are also self-sustaining as, for instance, funding increases the chances of conducting research at or beyond the state-of-the-art which in turn increases the chances of obtaining future funding 195 . Choosing (preferably free) open access options, including preprints and post-prints is one step allowing scholars to access resources irrespective of their privileges. An additional possibility is to waive article processing charges for researchers from low, or low-and-middle income countries. Other options are pooled funding applications, re-distributions of resources in international teams of researchers, and international collaborations. Big Team Science is a promising avenue to produce high-quality research while embracing diversity 150 ; yet, the predominantly volunteering-based system of such team science might exclude researchers who do not have allocated hours or funding for such team efforts. Hence, beyond these procedural and community changes, structural change aiming to foster diversity and inclusion is essential.

Outlook: what can we learn in the future?

Evidenced by the scale of developments discussed, the replication crisis has motivated structural, procedural, and community changes that would have previously been considered idealistic, if not impractical. While developments within the credibility revolution were originally fuelled by failed replications, these in themselves are not the only issue of discussion within the credibility revolution. Furthermore, replication rates alone may not be the best measure of research quality. Instead of focusing purely on replicability, we should strive to maximize transparency, rigour, and quality in all aspects of research 18 , 200 . To do so, we must observe structural, procedural, and community processes as intertwined drivers of change, and implement actionable changes on all levels. It is crucial that actors in different domains take responsibility for improvements and work together to ensure that high-quality outputs are incentivized and rewarded 201 . If one is fixed without the other (e.g., researchers focus on high-quality outputs [ individual level ] but are incentivised to focus on novelty [ structural level ]), then the problems will prevail, and meaningful reform will fail. In outlining multiple positive changes already implemented and embedded, we hope to provide our scientific community with hope, and a structure, to make further advances in the crises and revolutions to come.

Bem, D. Feeling the future: experimental evidence for anomalous retroactive influences on cognition and affect. J. Pers. Soc. Psychol. 100 , 407 (2011).

Article   PubMed   Google Scholar  

Crocker, J. The road to fraud starts with a single step. Nature 479 , 151–151 (2011).

Wagenmakers, E.-J., Wetzels, R., Borsboom, D. & Van Der Maas, H. L. Why psychologists must change the way they analyze their data: the case of psi: comment on Bem (2011). JPSP . 100 , 426–432 (2011).

Munafò, M. R. et al. A manifesto for reproducible science. Nat. Hum. Behav. 1 , 1–9 (2017).

Article   Google Scholar  

Open Science Collaboration. Estimating the reproducibility of psychological science. Science 349 , aac4716 (2015). This study was one of the first large-scale replication projects showing lower replication rates and smaller effect sizes among “successful” replicated findings.

Field, S. M., Hoekstra, R., Bringmann, L. & van Ravenzwaaij, D. When and why to replicate: as easy as 1, 2, 3? Collabra Psychol . 5 , 46 (2019).

Nosek, B. A. et al. Replicability, robustness, and reproducibility in psychological science. Annu. Rev. Psychol. 73 , 719–748 (2022). This paper highlights the importance of addressing issues related to replicability, robustness, and reproducibility in psychological research to ensure the validity and reliability of findings.

Farrar, B. G., Boeckle, M. & Clayton, N. S. Replications in comparative cognition: what should we expect and how can we improve? Anim. Behav. Cognit. 7 , 1 (2020).

Farrar, B. G., Voudouris, K. & Clayton, N. S. Replications, comparisons, sampling and the problem of representativeness in animal cognition research. Anim. Behav. Cognit. 8 , 273 (2021).

Farrar, B. G. et al. Reporting and interpreting non-significant results in animal cognition research. PeerJ 11 , e14963 (2023).

Article   PubMed   PubMed Central   Google Scholar  

Errington, T. M. et al. Investigating the replicability of preclinical cancer biology. Elife 10 , e71601 (2021).

Camerer, C. F. et al. Evaluating replicability of laboratory experiments in economics. Science 351 , 1433–1436 (2016).

Frith, U. Fast lane to slow science. Trends Cog. Sci. 24 , 1–2 (2020).

Pennington, C. A Student’s Guide to Open Science: Using the Replication Crisis to Reform Psychology (Open University Press, 2023).

Hendriks, F., Kienhues, D. & Bromme, R. Replication crisis = trust crisis? the effect of successful vs failed replications on laypeople’s trust in researchers and research. Public Underst. Sci. 29 , 270–288 (2020).

Sanders, M., Snijders, V. & Hallsworth, M. Behavioural science and policy: where are we now and where are we going? Behav. Public Policy 2 , 144–167 (2018).

Vazire, S. Implications of the credibility revolution for productivity, creativity, and progress. Perspect. Psychol. Sci. 13 , 411–417 (2018). This paper explores how the rise of the credibility revolution, which emphasizes the importance of evidence-based knowledge and critical thinking, can lead to increased productivity, creativity, and progress in various fields .

Freese, J., Rauf, T. & Voelkel, J. G. Advances in transparency and reproducibility in the social sciences. Soc. Sci. Res. 107 , 102770 (2022).

Trafimow, D. et al. Manipulating the alpha level cannot cure significance testing. Front. Psychol. 9 , 699 (2018).

Loken, E. & Gelman, A. Measurement error and the replication crisis. Science 355 , 584–585 (2017).

Azevedo, F. et al. Towards a culture of open scholarship: the role of pedagogical communities. BMC Res. Notes 15 , 75 (2022). This paper details (a) the need to integrate open scholarship principles into research training within higher education; (b) the benefit of pedagogical communities and the role they play in fostering an inclusive culture of open scholarship; and (c) call for greater collaboration with pedagogical communities, paving the way for a much needed integration of top-down and grassroot open scholarship initiatives.

Grahe, J. E., Cuccolo, K., Leighton, D. C. & Cramblet Alvarez, L. D. Open science promotes diverse, just, and sustainable research and educational outcomes. Psychol. Lean. Teach. 19 , 5–20 (2020).

Norris, E. & O’Connor, D. B. Science as behaviour: using a behaviour change approach to increase uptake of open science. Psychol. Health 34 , 1397–1406 (2019).

Azevedo, F. et al. Introducing a framework for open and reproducible research training (FORRT). Preprint at https://osf.io/bnh7p/ (2019). This paper describes the importance of integrating open scholarship into higher education, its benefits and challenges, as well as about FORRT initiatives aiming to support educators in this endeavor .

Nuijten, M. B. & Polanin, J. R. “statcheck”: Automatically detect statistical reporting inconsistencies to increase reproducibility of meta-analyses. Res. Synth. Methods 11 , 574–579 (2020).

PubMed   PubMed Central   Google Scholar  

McAleer, P. et al. Embedding data skills in research methods education: preparing students for reproducible research. Preprint at https://psyarxiv.com/hq68s/ (2022).

Holcombe, A. O., Kovacs, M., Aust, F. & Aczel, B. Documenting contributions to scholarly articles using CRediT and tenzing. PLoS ONE 15 , e0244611 (2020).

Koole, S. L. & Lakens, D. Rewarding replications: a sure and simple way to improve psychological science. Perspect. Psychol. Sci. 7 , 608–614 (2012).

Bauer, G. et al. Teaching constructive replications in the social sciences. Preprint at https://osf.io/g3k5t/ (2022).

Wagge, J. R. et al. A demonstration of the Collaborative Replication and Education Project: Replication attempts of the red-romance effect. Collabra Psychol. 5 , 5 (2019). A multi-institutional effort is being presented with the goal to replicate and teach research methods by collaboratively conducting and evaluating replications of three psychology experiments.

Wagge, J. R. et al. Publishing research with undergraduate students via replication work: the collaborative replications and education project. Front. Psychol. 10 , 247 (2019).

Quintana, D. S. Replication studies for undergraduate theses to improve science and education. Nat. Hum. Behav. 5 , 1117–1118 (2021).

Button, K. S., Chambers, C. D., Lawrence, N. & Munafò, M. R. Grassroots training for reproducible science: a consortium-based approach to the empirical dissertation. Psychol. Learn. Teach. 19 , 77–90 (2020). The article argues that improving the reliability and efficiency of scientific research requires a cultural shift in both thinking and practice, and better education in reproducible science should start at the grassroots, presenting a model of consortium-based student projects to train undergraduates in reproducible team science and reflecting on the pedagogical benefits of this approach.

Feldman, G. Replications and extensions of classic findings in Judgment and Decision Making. https://doi.org/10.17605/OSF.IO/5Z4A8 (2020). A research team of early career researchers with the main activities in the years 2018-2023 focused on: 1) Mass scale project completing over 120 replications and extensions of classic findings in social psychology and judgment and decision making, 2) Building collaborative resources (tools, templates, and guides) to assist others in implementing open-science.

Efendić, E. et al. Risky therefore not beneficial: replication and extension of Finucane et al.’s (2000) affect heuristic experiment. Soc. Psychol. Personal. Sci 13 , 1173–1184 (2022).

Ziano, I., Yao, J. D., Gao, Y. & Feldman, G. Impact of ownership on liking and value: replications and extensions of three ownership effect experiments. J. Exp. Soc. Psychol. 89 , 103972 (2020).

Pownall, M. et al. Embedding open and reproducible science into teaching: a bank of lesson plans and resources. Schol. Teach. Learn. Psychol . (in-press) (2021). To support open science training in higher education, FORRT compiled lesson plans and activities, and categorized them based on their theme, learning outcome, and method of delivery, which are made publicly available here: FORRT’s Lesson Plans .

Coles, N. A., DeBruine, L. M., Azevedo, F., Baumgartner, H. A. & Frank, M. C. ‘big team’ science challenges us to reconsider authorship. Nat. Hum. Behav . 7 , 665–667 (2023).

Allen, L., O’Connell, A. & Kiermer, V. How can we ensure visibility and diversity in research contributions? how the Contributor Role Taxonomy (CRediT) is helping the shift from authorship to contributorship. Learn. Publ. 32 , 71–74 (2019).

Allen, L., Scott, J., Brand, A., Hlava, M. & Altman, M. Publishing: Credit where credit is due. Nature 508 , 312–313 (2014).

Pownall, M. et al. The impact of open and reproducible scholarship on students’ scientific literacy, engagement, and attitudes towards science: a review and synthesis of the evidence. Roy. Soc. Open Sci. , 10 , 221255 (2023). This review article describes the available (empirical) evidence of the impact (and importance) of integrating open scholarship into higher education, its benefits and challenges on three specific areas: students’ (a) scientific literacy; (b) engagement with science; and (c) attitudes towards science.

Chopik, W. J., Bremner, R. H., Defever, A. M. & Keller, V. N. How (and whether) to teach undergraduates about the replication crisis in psychological science. Teach. Psychol. 45 , 158–163 (2018).

Frank, M. C. & Saxe, R. Teaching replication. Perspect. Psychol. Sci. 7 , 600–604 (2012). In this perspective article, Frank and Saxe advocate for incorporating replication as a fundamental component of research training in psychology and other disciplines .

Levin, N. & Leonelli, S. How does one “open” science? questions of value in biological research. Sci. Technol. Human Values 42 , 280–305 (2017).

Van Dijk, D., Manor, O. & Carey, L. B. Publication metrics and success on the academic job market. Curr. Bio. 24 , R516–R517 (2014).

Elsherif, M. M. et al. Bridging Neurodiversity and Open Scholarship: how shared values can Guide best practices for research integrity, social justice, and principled education. Preprint at https://osf.io/preprints/metaarxiv/k7a9p/ (2022). The authors describe systematic barriers, issues with disclosure, directions on prevalence and stigma, and the intersection of neurodiversity and open scholarship, and provide recommendations that can lead to personal and systematic changes to improve acceptance of neurodivergent individuals. Furthermore, perspectives of neurodivergent authors are being presented, the majority of whom have personal lived experiences of neurodivergence(s), and possible improvements in research integrity, inclusivity and diversity are being discussed.

Onie, S. Redesign open science for Asia, Africa and Latin America. Nature 587 , 35–37 (2020).

Roberts, S. O., Bareket-Shavit, C., Dollins, F. A., Goldie, P. D. & Mortenson, E. Racial inequality in psychological research: trends of the past and recommendations for the future. Perspect. Psychol. Sci. 15 , 1295–1309 (2020). Roberts et al. highlight historical and current trends of racial inequality in psychological research and provide recommendations for addressing and reducing these disparities in the future .

Steltenpohl, C. N. et al. Society for the improvement of psychological science global engagement task force report. Collabra Psychol. 7 , 22968 (2021).

Parsons, S. et al. A community-sourced glossary of open scholarship terms. Nat. Hum. Behav. 6 , 312–318 (2022). In response to the varied and plural new terminology introduced by the open scholarship movement, which has transformed academia’s lexicon, FORRT members have produced a community and consensus-based Glossary to facilitate education and effective communicationbetween experts and newcomers.

Pownall, M. et al. Navigating open science as early career feminist researchers. Psychol. Women Q. 45 , 526–539 (2021).

Gourdon-Kanhukamwe, A. et al. Opening up understanding of neurodiversity: a call for applying participatory and open scholarship practices. Preprint at https://osf.io/preprints/metaarxiv/jq23s/ (2022).

Leech, G. Reversals in psychology. Behavioural and Social Sciences (Nature Portfolio) at https://socialsciences.nature.com/posts/reversals-in-psychology (2021).

Orben, A. A journal club to fix science. Nature 573 , 465–466 (2019).

Arnold, B. et al. The turing way: a handbook for reproducible data science. Zenodo https://doi.org/10.5281/zenodo.3233986 (2019).

Open Life Science. A mentoring & training program for Open Science ambassadors. https://openlifesci.org/ (2023).

Almarzouq, B. et al. Opensciency—a core open science curriculum by and for the research community (2023).

Schönbrodt, F. et al. Netzwerk der Open-Science-Initiativen (NOSI). https://osf.io/tbkzh/ (2016).

Ball, R. et al. Course Syllabi for Open and Reproducible Methods. https://osf.io/vkhbt/ (2022).

The Carpentries. https://carpentries.org/ (2023).

The Embassy of Good Science. https://embassy.science/wiki/Main_Page (2023).

Berkeley Initiative for Transparency in the Social Sciences. https://www.bitss.org/ (2023).

Institute for Replication. https://i4replication.org/ (2023).

Reproducibility for Everyone. https://www.repro4everyone.org/ (2023).

Armeni, K. et al. Towards wide-scale adoption of open science practices: The role of open science communities. Sci. Public Policy 48 , 605–611 (2021).

Welcome to the UK Reproducibility Network The UK Reproducibility Network (UKRN). https://www.ukrn.org/ (2023).

Collyer, F. M. Global patterns in the publishing of academic knowledge: Global North, global South. Curr. Soc. 66 , 56–73 (2018).

Ali-Khan, S. E., Harris, L. W. & Gold, E. R. Motivating participation in open science by examining researcher incentives. Elife 6 , e29319 (2017).

Robson, S. G. et al. Promoting open science: a holistic approach to changing behaviour. Collabra Psychol. 7 , 30137 (2021).

Coalition for Advancing Research Assessment. https://coara.eu/ (2023).

Vanclay, J. K. Impact factor: outdated artefact or stepping-stone to journal certification? Scientometrics 92 , 211–238 (2012).

Kidwell, M. C. et al. Badges to acknowledge open practices: a simple, low-cost, effective method for increasing transparency. PLoS Bio. 14 , e1002456 (2016).

Rowhani-Farid, A., Aldcroft, A. & Barnett, A. G. Did awarding badges increase data sharing in BMJ Open? A randomized controlled trial. Roy. Soc. Open Sci. 7 , 191818 (2020).

Thibault, R. T., Pennington, C. R. & Munafo, M. Reflections on preregistration: core criteria, badges, complementary workflows. J. Trial & Err. https://doi.org/10.36850/mr6 (2022).

Chambers, C. D. Registered reports: a new publishing initiative at Cortex. Cortex 49 , 609–610 (2013).

Chambers, C. D. & Tzavella, L. The past, present and future of registered reports. Nat. Hum. Behav. 6 , 29–42 (2022).

Soderberg, C. K. et al. Initial evidence of research quality of registered reports compared with the standard publishing model. Nat. Hum. Behav. 5 , 990–997 (2021).

Scheel, A. M., Schijen, M. R. & Lakens, D. An excess of positive results: comparing the standard Psychology literature with Registered Reports. Adv. Meth. Pract. Psychol. Sci. 4 , 25152459211007467 (2021).

Google Scholar  

Renbarger, R. et al. Champions of transparency in education: what journal reviewers can do to encourage open science practices. Preprint at https://doi.org/10.35542/osf.io/xqfwb .

Nosek, B. A. et al. Transparency and Openness Promotion (TOP) Guidelines. Center for Open Science project. https://osf.io/9f6gx/ (2022).

Silverstein, P. et al. A Guide for Social Science Journal Editors on Easing into Open Science. (2023). Preprint at https://doi.org/10.31219/osf.io/hstcx .

NASA. Transform to Open Science (TOPS). https://github.com/nasa/Transform-to-Open-Science (2023).

UNESCO. UNESCO Recommendation on Open Science. https://unesdoc.unesco.org/ark:/48223/pf0000379949.locale=en (2021).

European University Association. https://eua.eu (2023).

Munafò, M. R. Improving the efficiency of grant and journal peer review: registered reports funding. Nicotine Tob. Res. 19 , 773–773 (2017).

Else, H. A guide to Plan S: the open-access initiative shaking up science publishing. Nature (2021).

Mills, M. Plan S–what is its meaning for open access journals and for the JACMP? J. Appl. Clin. Med. Phys. 20 , 4 (2019).

Zhang, L., Wei, Y., Huang, Y. & Sivertsen, G. Should open access lead to closed research? the trends towards paying to perform research. Scientometrics 127 , 7653–7679 (2022).

McNutt, M. Plan S falls short for society publishers—and for the researchers they serve. Proc. Natl Acad. Sci. USA 116 , 2400–2403 (2019).

PeerCommunityIn. https://peercommunityin.org/ (2023).

Elife. https://elifesciences.org/for-the-press/b2329859/elife-ends-accept-reject-decisions-following-peer-review (2023).

Nosek, B. A., Spies, J. R. & Motyl, M. Scientific utopia II: Restructuring incentives and practices to promote truth over publishability. Perspect. Psychol. Sci. 7 , 615–631 (2012).

Schönbrodt, F. https://www.nicebread.de/open-science-hiring-practices/ (2016).

Delios, A. et al. Examining the generalizability of research findings from archival data. Proc. Natl Acad. Sci. USA 119 , e2120377119 (2022).

Dreber, A. et al. Using prediction markets to estimate the reproducibility of scientific research. Proc. Natl Acad. Sci. USA 112 , 15343–15347 (2015).

Fraser, H. et al. Predicting reliability through structured expert elicitation with the repliCATS (Collaborative Assessments for Trustworthy Science) process. PLoS ONE 18 , e0274429 (2023).

Gordon, M., Viganola, D., Dreber, A., Johannesson, M. & Pfeiffer, T. Predicting replicability-analysis of survey and prediction market data from large-scale forecasting projects. PLoS ONE 16 , e0248780 (2021).

Tierney, W. et al. Creative destruction in science. Organ. Behav. Hum. Decis. Process 161 , 291–309 (2020).

Tierney, W. et al. A creative destruction approach to replication: implicit work and sex morality across cultures. J. Exp. Soc. Psychol. 93 , 104060 (2021).

Hoogeveen, S., Sarafoglou, A. & Wagenmakers, E.-J. Laypeople can predict which social-science studies will be replicated successfully. Adv. Meth. Pract. Psychol. Sci. 3 , 267–285 (2020).

Lewandowsky, S. & Oberauer, K. Low replicability can support robust and efficient science. Nat. Commun. 11 , 358 (2020).

Button, K. S. & Munafò, M. R. in Psychological Science under Scrutiny: Recent Challenges and Proposed Solutions 22–33 (2017).

Świątkowski, W. & Dompnier, B. Replicability crisis in social psychology: looking at the past to find new pathways for the future. Int. Rev. Soc. Psychol. 30 , 111–124 (2017).

Simonsohn, U., Nelson, L. D. & Simmons, J. P. P-curve: a key to the file-drawer. J. Exp. Psychol. Gen. 143 , 534 (2014).

Brunner, J. & Schimmack, U. Estimating population mean power under conditions of heterogeneity and selection for significance. Meta-Psychol . 4 , 1–22 (2020).

Benjamin, D. J. et al. Redefine statistical significance. Nat. Hum. Behav. 2 , 6–10 (2018).

Rubin, M. & Donkin, C. Exploratory hypothesis tests can be more compelling than confirmatory hypothesis tests. Philos. Psychol . (in-press) 1–29 (2022).

Amrhein, V. & Greenland, S. Remove, rather than redefine, statistical significance. Nat. Hum. Behav. 2 , 4–4 (2018).

Trafimow, D. & Marks, M. Editorial in basic and applied social psychology. Basic Appl. Soc. Psych. 37 , 1–2 (2015).

Lakens, D. et al. Justify your alpha. Nat. Hum. Behav. 2 , 168–171 (2018).

Lakens, D., Scheel, A. M. & Isager, P. M. Equivalence testing for psychological research: a tutorial. Adv. Meth. Prac. Psychol. Sci. 1 , 259–269 (2018).

Verhagen, J. & Wagenmakers, E.-J. Bayesian tests to quantify the result of a replication attempt. J. Exp. Psychol. Gen. 143 , 1457 (2014).

Dienes, Z. Bayesian versus orthodox statistics: Which side are you on?. Perspect. Psychol. Sci. 6 , 274–290 (2011).

Love, J. et al. JASP: Graphical statistical software for common statistical designs. J. Stat. Soft. 88 , 1–17 (2019).

Şahin, M. & Aybek, E. Jamovi: an easy to use statistical software for the social scientists. Int. J. Assess. Tools Educ. 6 , 670–692 (2019).

Brown, N. J. & Heathers, J. A. The GRIM test: a simple technique detects numerous anomalies in the reporting of results in psychology. Soc. Psychol. Personal. Sci. 8 , 363–369 (2017).

Heathers, J. A., Anaya, J., van der Zee, T. & Brown, N. J. Recovering data from summary statistics: Sample parameter reconstruction via iterative techniques (SPRITE). PeerJ Preprints 6 , e26968v1 (2018).

Botvinik-Nezer, R. et al. Variability in the analysis of a single neuroimaging dataset by many teams. Nature 582 , 84–88 (2020).

Breznau, N. et al. Observing many researchers using the same data and hypothesis reveals a hidden universe of uncertainty. Proc. Natl Acad. Sci. USA 119 , e2203150119 (2022).

Breznau, N. et al. The Crowdsourced Replication Initiative: investigating immigration and social policy preferences: executive report. https://osf.io/preprints/socarxiv/6j9qb/ (2019).

Gellman, A. & Lokem, E. The statistical crisis in science data-dependent analysis-a ‘garden of forking paths’-explains why many statistically significant comparisons don’t hold up. Am. Sci. 102 , 460 (2014).

Azevedo, F. & Jost, J. T. The ideological basis of antiscientific attitudes: effects of authoritarianism, conservatism, religiosity, social dominance, and system justification. Group Process. Intergroup Relat. 24 , 518–549 (2021).

Heininga, V. E., Oldehinkel, A. J., Veenstra, R. & Nederhof, E. I just ran a thousand analyses: benefits of multiple testing in understanding equivocal evidence on gene-environment interactions. PLoS ONE 10 , e0125383 (2015).

Liu, Y., Kale, A., Althoff, T. & Heer, J. Boba: Authoring and visualizing multiverse analyses. IEEE Trans. Vis. Comp. Graph. 27 , 1753–1763 (2020).

Harder, J. A. The multiverse of methods: extending the multiverse analysis to address data-collection decisions. Perspect. Psychol. Sci. 15 , 1158–1177 (2020).

Steegen, S., Tuerlinckx, F., Gelman, A. & Vanpaemel, W. Increasing transparency through a multiverse analysis. Perspect. Psychol. Sci. 11 , 702–712 (2016).

Azevedo, F., Marques, T. & Micheli, L. In pursuit of racial equality: identifying the determinants of support for the black lives matter movement with a systematic review and multiple meta-analyses. Perspect. Politics , (in-press), 1–23 (2022).

Borenstein, M., Hedges, L. V., Higgins, J. P. & Rothstein, H. R. Introduction to Meta-analysis (John Wiley & Sons, 2021).

Higgins, J. P. et al. Cochrane Handbook for Systematic Reviews of Interventions (John Wiley & Sons, 2022).

Carter, E. C., Schönbrodt, F. D., Gervais, W. M. & Hilgard, J. Correcting for bias in psychology: a comparison of meta-analytic methods. Adv. Meth. Pract. Psychol. Sci. 2 , 115–144 (2019).

Nuijten, M. B., Hartgerink, C. H., Van Assen, M. A., Epskamp, S. & Wicherts, J. M. The prevalence of statistical reporting errors in psychology (1985–2013). Behav. Res. Methods 48 , 1205–1226 (2016).

Van Assen, M. A., van Aert, R. & Wicherts, J. M. Meta-analysis using effect size distributions of only statistically significant studies. Psychol. Meth. 20 , 293 (2015).

Page, M. J. et al. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. Int. J. Surg. 88 , 105906 (2021).

Topor, M. K. et al. An integrative framework for planning and conducting non-intervention, reproducible, and open systematic reviews (NIRO-SR). Meta-Psychol . (In Press) (2022).

Van den Akker, O. et al. Generalized systematic review registration form. Preprint at https://doi.org/10.31222/osf.io/3nbea (2020).

Booth, A. et al. The nuts and bolts of PROSPERO: an international prospective register of systematic reviews. Sys. Rev. 1 , 1–9 (2012).

Cristea, I. A., Naudet, F. & Caquelin, L. Meta-research studies should improve and evaluate their own data sharing practices. J. Clin. Epidemiol. 149 , 183–189 (2022).

Knobloch, K., Yoon, U. & Vogt, P. M. Preferred reporting items for systematic reviews and meta-analyses (PRISMA) statement and publication bias. J. Craniomaxillofac Surg. 39 , 91–92 (2011).

Lakens, D., Hilgard, J. & Staaks, J. On the reproducibility of meta-analyses: Six practical recommendations. BMC Psychol. 4 , 1–10 (2016).

Editors, P. M. Best practice in systematic reviews: the importance of protocols and registration. PLoS Med. 8 , e1001009 (2011).

Tsujimoto, Y. et al. Majority of systematic reviews published in high-impact journals neglected to register the protocols: a meta-epidemiological study. J. Clin. Epidemiol. 84 , 54–60 (2017).

Xu, C. et al. Protocol registration or development may benefit the design, conduct and reporting of dose-response meta-analysis: empirical evidence from a literature survey. BMC Med. Res. Meth. 19 , 1–10 (2019).

Polanin, J. R., Hennessy, E. A. & Tsuji, S. Transparency and reproducibility of meta-analyses in psychology: a meta-review. Perspect. Psychol. Sci. 15 , 1026–1041 (2020).

Uhlmann, E. L. et al. Scientific utopia III: Crowdsourcing science. Perspect. Psychol. Sci. 14 , 711–733 (2019).

So, T. Classroom experiments as a replication device. J. Behav. Exp. Econ. 86 , 101525 (2020).

Ebersole, C. R. et al. Many labs 3: evaluating participant pool quality across the academic semester via replication. J. Exp. Soc. Psychol. 67 , 68–82 (2016).

Klein, R. et al. Investigating variation in replicability: a “many labs” replication project. Soc. Psychol . 45 , 142–152 (2014).

Glöckner, A. et al. Hagen Cumulative Science Project. Project overview at osf.io/d7za8 (2015).

Moshontz, H. et al. The psychological science accelerator: advancing psychology through a distributed collaborative network. Adv. Meth. Pract. Psychol. Sci. 1 , 501–515 (2018).

Forscher, P. S. et al. The Benefits, Barriers, and Risks of Big-Team Science. Perspect. Psychol. Sci. 17456916221082970 (2020). The paper discusses the advantages and challenges of conducting large-scale collaborative research projects, highlighting the potential for increased innovation and impact, as well as the difficulties in managing complex collaborations and addressing issues related to authorship and credit.

Lieck, D. S. N. & Lakens, D. An Overview of Team Science Projects in the Social Behavioral Sciences. https://doi.org/10.17605/OSF.IO/WX4ZD (2022).

Jarke, H. et al. A roadmap to large-scale multi-country replications in psychology. Collabra Psychol. 8 , 57538 (2022).

Pennington, C. R., Jones, A. J., Tzavella, L., Chambers, C. D. & Button, K. S. Beyond online participant crowdsourcing: the benefits and opportunities of big team addiction science. Exp. Clin. Psychopharmacol . 30 , 444–451 (2022).

Disis, M. L. & Slattery, J. T. The road we must take: multidisciplinary team science. Sci. Trans. Med. 2 , 22cm9–22cm9 (2010).

Ledgerwood, A. et al. The pandemic as a portal: reimagining psychological science as truly open and inclusive. Perspect. Psychol. Sci. 17 , 937–959 (2022).

Legate, N. et al. A global experiment on motivating social distancing during the COVID-19 pandemic. Proc. Natl Acad. Sci. USA 119 , e2111091119 (2022).

Nexus, P. N. A. S. Predicting attitudinal and behavioral responses to COVID-19 pandemic using machine learning. Proc. Natl Acad. Sci. USA 1 , 1–15 (2022).

Van Bavel, J. J. et al. National identity predicts public health support during a global pandemic. Nat. Commun. 13 , 517 (2022).

Buchanan, E. M. et al. The psychological science accelerator’s COVID-19 rapid-response dataset. Sci. Data 10 , 87 (2023).

Azevedo, F. et al. Social and moral psychology of covid-19 across 69 countries. Nat. Sci. Dat . https://kar.kent.ac.uk/99184/ (2022).

Wang, K. et al. A multi-country test of brief reappraisal interventions on emotions during the COVID-19 pandemic. Nat. Hum. Behav. 5 , 1089–1110 (2021).

Dorison, C. A. et al. In COVID-19 health messaging, loss framing increases anxiety with little-to-no concomitant benefits: Experimental evidence from 84 countries. Affect. Sci. 3 , 577–602 (2022).

Coles, N. A. et al. A multi-lab test of the facial feedback hypothesis by the many smiles collaboration. Nat. Hum. Behav. 6 , 1731–1742 (2022).

Coles, N. A., Gaertner, L., Frohlich, B., Larsen, J. T. & Basnight-Brown, D. M. Fact or artifact? demand characteristics and participants’ beliefs can moderate, but do not fully account for, the effects of facial feedback on emotional experience. J. Pers. Soc. Psychol. 124 , 287 (2023).

Cowan, N. et al. How do scientific views change? notes from an extended adversarial collaboration. Perspect. Psychol. Sci. 15 , 1011–1025 (2020).

Forscher, P. S. et al. Stereotype threat in black college students across many operationalizations. Preprint at https://psyarxiv.com/6hju9/ (2019).

Kahneman, D. & Klein, G. Conditions for intuitive expertise: a failure to disagree. Am. Psychol. 64 , 515 (2009).

Kekecs, Z. et al. Raising the value of research studies in psychological science by increasing the credibility of research reports: the transparent Psi project. Roy. Soc. Open Sci. 10 , 191375 (2023).

Henrich, J., Heine, S. J. & Norenzayan, A. The weirdest people in the world? Behav. Brain Sci. 33 , 61–83 (2010).

Yarkoni, T. The generalizability crisis. Behav. Brain Sci. 45 , e1 (2022).

Ghai, S. It’s time to reimagine sample diversity and retire the WEIRD dichotomy. Nat. Hum. Behav. 5 , 971–972 (2021). The paper argues that the reliance on WEIRD (Western, educated, industrialized, rich, and democratic) samples in psychological research limits the generalizability of findings and suggests reimagining sample diversity to ensure greater external validity.

Nielsen, M. W. & Andersen, J. P. Global citation inequality is on the rise. Proc. Natl Acad. Sci. USA 118 , e2012208118 (2021).

Oberauer, K. & Lewandowsky, S. Addressing the theory crisis in psychology. Psychon. Bull. Rev. 26 , 1596–1618 (2019).

Devezer, B., Navarro, D. J., Vandekerckhove, J. & Ozge Buzbas, E. The case for formal methodology in scientific reform. Roy. Soc. Open Sci. 8 , 200805 (2021).

Scheel, A. M., Tiokhin, L., Isager, P. M. & Lakens, D. Why hypothesis testers should spend less time testing hypotheses. Perspect. Psychol. Sci. 16 , 744–755 (2021).

Chauvette, A., Schick-Makaroff, K. & Molzahn, A. E. Open data in qualitative research. Int. J. Qual. Meth. 18 , 1609406918823863 (2019).

Field, S. M., van Ravenzwaaij, D., Pittelkow, M.-M., Hoek, J. M. & Derksen, M. Qualitative open science—pain points and perspectives. Preprint at https://doi.org/10.31219/osf.io/e3cq4 (2021).

Steltenpohl, C. N. et al. Rethinking transparency and rigor from a qualitative open science perspective. J. Trial & Err. https://doi.org/10.36850/mr7 (2023).

Branney, P. et al. Three steps to open science for qualitative research in psychology. Soc. Pers. Psy. Comp . 17 , 1–16 (2023).

VandeVusse, A., Mueller, J. & Karcher, S. Qualitative data sharing: Participant understanding, motivation, and consent. Qual. Health Res. 32 , 182–191 (2022).

Çelik, H., Baykal, N. B. & Memur, H. N. K. Qualitative data analysis and fundamental principles. J. Qual. Res. Educ . 8 , 379–406 (2020).

Class, B., de Bruyne, M., Wuillemin, C., Donzé, D. & Claivaz, J.-B. Towards open science for the qualitative researcher: from a positivist to an open interpretation. Int. J. Qual. Meth. 20 , 16094069211034641 (2021).

Humphreys, L., Lewis Jr, N. A., Sender, K. & Won, A. S. Integrating qualitative methods and open science: five principles for more trustworthy research. J. Commun. 71 , 855–874 (2021).

Steinhardt, I., Bauer, M., Wünsche, H. & Schimmler, S. The connection of open science practices and the methodological approach of researchers. Qual. Quant . (in-press) 1–16 (2022).

Haven, T. L. & Van Grootel, L. Preregistering qualitative research. Account. Res. 26 , 229–244 (2019).

Frohwirth, L., Karcher, S. & Lever, T. A. A transparency checklist for qualitative research. Preprint at https://doi.org/10.31235/osf.io/wc35g (2023).

Demgenski, R., Karcher, S., Kirilova, D. & Weber, N. Introducing the qualitative data repository’s curation handbook. J. eSci. Librariansh. 10 , 1–11 (2021).

Karcher, S., Kirilova, D., Pagé, C. & Weber, N. How data curation enables epistemically responsible reuse of qualitative data. Qual. Rep . 26 , 1996–2010 (2021).

Bergmann, C. How to integrate open science into language acquisition research? Student workshop at BUCLD 43 (2018).

Bergmann, C. The buffet approach to open science. https://cogtales.wordpress.com/2023/04/16/the-buffet-approach-to-open-science/ (2023).

Field, S. M. & Derksen, M. Experimenter as automaton; experimenter as human: exploring the position of the researcher in scientific research. Eur. J. Philos. Sci. 11 , 11 (2021).

Chenail, R. J. Communicating your qualitative research better. Fam. Bus. Rev. 22 , 105–108 (2009).

Levitt, H. M. et al. The meaning of scientific objectivity and subjectivity: from the perspective of methodologists. Psychol. Methods 27 , 589–605 (2020).

Candela, A. G. Exploring the function of member checking. Qual. Rep. 24 , 619–628 (2019).

Petersen, O. H. Inequality of research funding between different countries and regions is a serious problem for global science. Function 2 , zqab060 (2021).

Puthillam, A. et al. Guidelines to improve internationalization in psychological science. Preprint at https://psyarxiv.com/2u4h5/ (2022).

Taffe, M. & Gilpin, N. Equity, diversity and inclusion: racial inequity in grant funding from the US National Institutes of Health. eLife 10 , e65697 (2021).

Burns, K. E., Straus, S. E., Liu, K., Rizvi, L. & Guyatt, G. Gender differences in grant and personnel award funding rates at the Canadian Institutes of Health Research based on research content area: a retrospective analysis. PLoS Med. 16 , e1002935 (2019).

Sato, S., Gygax, P. M., Randall, J. & Schmid Mast, M. The leaky pipeline in research grant peer review and funding decisions: challenges and future directions. High. Educ. 82 , 145–162 (2021).

Guttinger, S. The limits of replicability. Eur. J. Philos. Sci. 10 , 10 (2020).

Evans, T. Developments in open data norms. J. Open Psychol. Data 10 , 1–6 (2022).

John, L. K., Loewenstein, G. & Prelec, D. Measuring the prevalence of questionable research practices with incentives for truth telling. Psychol. Sci. 23 , 524–532 (2012).

Simmons, J. P., Nelson, L. D. & Simonsohn, U. False-positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychol. Sci. 22 , 1359–1366 (2016).

Wicherts, J. M. et al. Degrees of freedom in planning, running, analyzing, and reporting psychological studies: a checklist to avoid p-hacking. Front. Psychol. 7 , 1832 (2016).

Flake, J. K., Pek, J. & Hehman, E. Construct validation in social and personality research: current practice and recommendations. Soc. Psychol. Personal. Sci. 8 , 370–378 (2017).

Flake, J. K. & Fried, E. I. Measurement schmeasurement: questionable measurement practices and how to avoid them. Adv. Meth. Pract. Psychol. Sci. 3 , 456–465 (2020).

Agnoli, F., Wicherts, J. M., Veldkamp, C. L., Albiero, P. & Cubelli, R. Questionable research practices among Italian research psychologists. PLoS ONE 12 , e0172792 (2017).

Fiedler, K. & Schwarz, N. Questionable research practices revisited. Soc. Psychol. Personal. Sci 7 , 45–52 (2016).

Kerr, N. L. HARKing: Hypothesizing after the results are known. Pers. Soc. Psychol. Rev. 2 , 196–217 (1998).

Molléri, J. S. Research Incentives in Academia Leading to Unethical Behavior. in Research Challenges in Information Science: 16th International Conference, RCIS 2022, Barcelona, Spain, May 17–20, 2022, Proceedings 744–751 (Springer, 2022).

Gerrits, R. G. et al. Occurrence and nature of questionable research practices in the reporting of messages and conclusions in international scientific Health Services Research publications: a structured assessment of publications authored by researchers in the Netherlands. BMJ Open 9 , e027903 (2019).

Checchi, D., De Fraja, G. & Verzillo, S. Incentives and careers in academia: theory and empirical analysis. Rev. Econ. Stat. 103 , 786–802 (2021).

Grove, L. The Effects of Funding Policies on Academic Research . Ph.D. thesis, University College London (2017).

Frias-Navarro, D., Pascual-Soler, M., Perezgonzalez, J., Monterde-i Bort, H. & Pascual-Llobell, J. Spanish Scientists’ Opinion about Science and Researcher Behavior. Span. J. Psychol. 24 , e7 (2021).

Bornmann, L. & Daniel, H.-D. The state of h index research: is the h index the ideal way to measure research performance? EMBO Rep. 10 , 2–6 (2009).

Munafò, M. et al. Scientific rigor and the art of motorcycle maintenance. Nat. Biotechn. 32 , 871–873 (2014).

Primbs, M. A. et al. Are small effects the indispensable foundation for a cumulative psychological science? A reply to Götz et al. (2022). Perspect. Psychol. Sci. 18 , 508–512 (2022).

Martin, G. & Clarke, R. M. Are psychology journals anti-replication? A snapshot of editorial practices. Front. Psychol. 8 , 523 (2017).

Cohen, B. A. How should novelty be valued in science? Elife 6 , e28699 (2017).

Tijdink, J. K., Vergouwen, A. C. & Smulders, Y. M. Publication pressure and burn out among Dutch medical professors: a nationwide survey. PLoS ONE 8 , e73381 (2013).

Tijdink, J. K., Verbeke, R. & Smulders, Y. M. Publication pressure and scientific misconduct in medical scientists. J. Empir. Res. Hum. Res. Ethics 9 , 64–71 (2014).

Laitin, D. D. et al. Reporting all results efficiently: a RARE proposal to open up the file drawer. Proc. Natl Acad. Sci. USA 118 , e2106178118 (2021).

Franco, A., Malhotra, N. & Simonovits, G. Publication bias in the social sciences: unlocking the file drawer. Science 345 , 1502–1505 (2014).

Matarese, V. Kinds of replicability: different terms and different functions. Axiomathes 1–24 (2022).

Maxwell, S. E., Lau, M. Y. & Howard, G. S. Is psychology suffering from a replication crisis? what does “failure to replicate” really mean? Am. Psychol. 70 , 487 (2015).

Ulrich, R. & Miller, J. Questionable research practices may have little effect on replicability. Elife 9 , e58237 (2020).

Devezer, B. & Buzbas, E. Minimum viable experiment to replicate (2021). Preprint at http://philsci-archive.pitt.edu/21475/ .

Stroebe, W. & Strack, F. The alleged crisis and the illusion of exact replication. Perspect. Psychol. Sci. 9 , 59–71 (2014).

Feest, U. Why replication is overrated. Phil. Sci. 86 , 895–905 (2019).

Eronen, M. I. & Bringmann, L. F. The theory crisis in psychology: how to move forward. Perspect. Psychol. Sci. 16 , 779–788 (2021).

Download references

Acknowledgements

We want to thank Ali H. Al-Hoorie for his contribution with reference formatting and commenting on the manuscript, and Sriraj Aiyer, Crystal Steltenpohl, Maarten Derksen, Rinske Vermeij, and Esther Plomp for the comments provided on an earlier version of the manuscript. No funding has been received for this work.

Author information

Flavio Azevedo

Present address: Department of Social Psychology, University of Groningen, Groningen, The Netherlands

Authors and Affiliations

Department of Health and Functioning, Western Norway University of Applied Sciences, Bergen, Norway

Max Korbmacher

NORMENT Centre for Psychosis Research, University of Oslo and Oslo University Hospital, Oslo, Norway

Mohn Medical Imaging and Visualisation Center, Bergen, Norway

Department of Psychology, University of Cambridge, Cambridge, UK

School of Psychology, Aston University, Birmingham, UK

Charlotte R. Pennington

Department of Neurology, University of Essen, Essen, Germany

Helena Hartmann

School of Psychology, University of Leeds, Leeds, UK

Madeleine Pownall

Department of Psychology, Ashland University, Ashland, USA

Kathleen Schmidt

School of Psychology, University of Birmingham, Birmingham, UK

Mahmoud Elsherif & Shijun Yu

SOCIUM Research Center on Inequality and Social Policy, University of Bremen, Bremen, Germany

Nate Breznau

Department of Psychiatry, University of Oxford, Oxford, UK

Olly Robertson

Department of Education, ICT and Learning, Ostfold University College, Halden, Norway

Tamara Kalandadze

Department of Sport and Recreation Management, Temple University, Philadelphia, USA

Bradley J. Baker

School of Psychology, Cardiff University, Cardiff, UK

Aoife O’Mahony

Kavli Institute for Systems Neuroscience, Norwegian University of Science and Technology, Trondheim, Norway

Jørgen Ø. -S. Olsnes

Division of Psychology, De Montfort University, Leicester, UK

John J. Shaw

Macedonian Academy of Sciences and Arts, Skopje, North Macedonia

Biljana Gjoneska

Faculty of Arts and Science, Kyushu University, Fukuoka, Japan

Yuki Yamada

Department of Psychology and Psychotherapy, Witten/Herdecke University, Witten, Germany

Jan P. Röer

Department of Applied Science, Technological University Dublin, Dublin, Ireland

Jennifer Murphy

Graduate School of Business, Stanford University, Standford, USA

Shilaan Alzahawi

Institute of Psychology, University of Graz, Graz, Austria

Sandra Grinschgl

Department of Psychology, University of York, York, UK

Catia M. Oliveira

Institute of General Practice and Family Medicine, University of Bonn, Bonn, Germany

Tobias Wingen

Department of Psychology, Chinese University of Hong Kong, Hong Kong, China

Siu Kit Yeung

Faculty of Education, University of Cambridge, Cambridge, UK

Faculty of Life Sciences: Food, Nutrition and Health, University of Bayreuth, Bayreuth, Germany

  • Laura M. König

Open Psychology Research Centre, Open University, Milton Keynes, UK

Nihan Albayrak-Aydemir

Department of Psychological and Behavioural Science, London School of Economics and Political Science, London, UK

Department of Psychology, Universidad Rey Juan Carlos, Madrid, Spain

Oscar Lecuona

Faculty of Psychology, Universidad Autónoma de Madrid, Madrid, Spain

Institute of Psychology, Leiden University, Leiden, The Netherlands

Leticia Micheli

School of Human Sciences, University of Greenwich, Greenwich, UK

Thomas Evans

Institute for Lifecourse Development, University of Greenwich, Greenwich, UK

You can also search for this author in PubMed   Google Scholar

Contributions

Author contributions and roles are laid out based on the CRediT (Contributor Roles Taxonomy). Please visit https://www.casrai.org/credit.html for details and definitions of each of the roles listed below 27 . Conceptualization: M.K., F.A. and T.R.E. Methodology: M.K., F.A. and T.R.E. Project administration: M.K. and F.A. Visualisation: M.K. and F.A. Writing—original draft: M.K., F.A., T.R.E., C.R.P. Writing—review and editing: M.K., F.A., H.H., M.M.E., N.B., A.O.M., T.K., J.Ø.S.O., C.R.P., J.J.S., M.P., B.G., Y.Y., J.P.R., J.M., S.A., B.J.B., S.G., C.M.O., T.W., S.K.Y., M.L., L.M.K., N.A.A., O.L., T.R.E., O.R., L.M., K.S. and S.Y.

Corresponding author

Correspondence to Flavio Azevedo .

Ethics declarations

Competing interests.

This work is as an initiative from The Framework for Open and Reproducible Research Training (FORRT; https://forrt.org ). All authors are advocates of open scholarship with different open science organisational affiliations, including FORRT: M.K., F.A., H.H., M.M.E., A.O.M., B.S., T.K., C.R.P., J.J.S., Y.Y., J.P.R., J.M., S.A., B.J.B., S.G., C.M.O., T.W., S.K.Y., M.L., L.M.K., N.A.A., O.L., T.E., O.R., L.M., K.S. and S.Y. ReproducibiliTEA journal clubs: Y.Y., J.S., C.R.P., M.M.E. and O.L. NASA TOPs: F.A. and S.A. Student Initiative for Open Science: M.K. Open Applied Linguistics: M.L. Psychological Science Accelerator: M.K., F.A., T.K., C.R.P., M.P., K.S., M.E., N.B., B.B., A.O.M., B.G., Y.Y., J.R.P., S.A., S.K.Y., M.L., N.A.A., T.E. and M.M.E. ManyMany: B.G. and M.M.E. ManyBabies: M.M.E. The United Kingdom Reproducibility Network: F.A., C.R.P., T.E., O.R., C.R.P. and A.O.M. Reproducible Research Oxford: O.R. The Sports Science Replication Centre: J.M. Norway’s Reproducibility Network: T.K. and M.K. Interessengruppe Offene und Reproduzierbare Forschung: H.H. Arbeitsgruppe Open Science der Fachgruppe Gesundheitspsychologie in der Deutschen Gesellschaft für Psychologie: L.M.K. Reproducible Interpretable Open Transparent (RIOT) Science Club: T.K. Center for Open Science: S.A. Open Sciency: F.A. and S.A. Society for the Improvement of Psychological Science: F.A. and S.A. Stanford Center for Open and Reproducible Science: S.A.

Peer review

Peer review information.

Communications Psychology thanks Rima-Maria Rahal and Ulf Toelch for their contribution to the peer review of this work. Primary Handling Editor: Marike Schiffer. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Peer review file, rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Korbmacher, M., Azevedo, F., Pennington, C.R. et al. The replication crisis has led to positive structural, procedural, and community changes. Commun Psychol 1 , 3 (2023). https://doi.org/10.1038/s44271-023-00003-2

Download citation

Received : 20 January 2023

Accepted : 22 May 2023

Published : 25 July 2023

DOI : https://doi.org/10.1038/s44271-023-00003-2

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

A year of growth.

Communications Psychology (2024)

Hyper-ambition and the Replication Crisis: Why Measures to Promote Research Integrity can Falter

  • Yasemin J. Erden

Journal of Academic Ethics (2024)

Reducing intervention- and research-induced inequalities to tackle the digital divide in health promotion

  • Rebecca A. Krukowski
  • Max J. Western

International Journal for Equity in Health (2023)

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

replication in experimental psychology

HYPOTHESIS AND THEORY article

Replication, falsification, and the crisis of confidence in social psychology.

\r\nBrian D. Earp,*

  • 1 Uehiro Centre for Practical Ethics, University of Oxford, Oxford, UK
  • 2 Department of History and Philosophy of Science, University of Cambridge, Cambridge, UK
  • 3 Department of Psychology, New Mexico State University, Las Cruces, NM, USA

The (latest) crisis in confidence in social psychology has generated much heated discussion about the importance of replication, including how it should be carried out as well as interpreted by scholars in the field. For example, what does it mean if a replication attempt “fails”—does it mean that the original results, or the theory that predicted them, have been falsified? And how should “failed” replications affect our belief in the validity of the original research? In this paper, we consider the replication debate from a historical and philosophical perspective, and provide a conceptual analysis of both replication and falsification as they pertain to this important discussion. Along the way, we highlight the importance of auxiliary assumptions (for both testing theories and attempting replications), and introduce a Bayesian framework for assessing “failed” replications in terms of how they should affect our confidence in original findings.

“Only when certain events recur in accordance with rules or regularities, as in the case of repeatable experiments, can our observations be tested—in principle—by anyone.… Only by such repetition can we convince ourselves that we are not dealing with a mere isolated ‘coincidence,’ but with events which, on account of their regularity and reproducibility, are in principle inter-subjectively testable.”

– Karl Popper (1959, p. 45)

Introduction

Scientists pay lip-service to the importance of replication. It is the “coin of the scientific realm” ( Loscalzo, 2012 , p. 1211); “one of the central issues in any empirical science” ( Schmidt, 2009 , p. 90); or even the “demarcation criterion between science and nonscience” ( Braude, 1979 , p. 2). Similar declarations have been made about falsifiability , the “demarcation criterion” proposed by Popper in his seminal work of 1959 (see epigraph). As we will discuss below, the concepts are closely related—and also frequently misunderstood. Nevertheless, their regular invocation suggests a widespread if vague allegiance to Popperian ideals among contemporary scientists, working from a range of different disciplines ( Jordan, 2004 ; Jost, 2013 ). The cosmologist Hermann Bondi once put it this way: “There is no more to science than its method, and there is no more to its method than what Popper has said” (quoted in Magee, 1973 , p. 2).

Experimental social psychologists have fallen in line. Perhaps in part to bolster our sense of identity with the natural sciences ( Danzinger, 1997 ), we psychologists have been especially keen to talk about replication. We want to trade in the “coin” of the realm. As Billig (2013) notes, psychologists “cling fast to the belief that the route to knowledge is through the accumulation of [replicable] experimental findings” (p. 179). The connection to Popper is often made explicit. One recent example comes from Kepes and McDaniel (2013) , from the field of industrial-organizational psychology: “The lack of exact replication studies [in our field] prevents the opportunity to disconfirm research results and thus to falsify [contested] theories” (p. 257). They cite The Logic of Scientific Discovery .

There are problems here. First, there is the “ lack ” of replication noted in the quote from Kepes and McDaniel. If replication is so important, why isn't it being done? This question has become a source of crisis-level anxiety among psychologists in recent years, as we explore in a later section. The anxiety is due to a disconnect: between what is seen as being necessary for scientific credibility—i.e., careful replication of findings based on precisely-stated theories—and what appears to be characteristic of the field in practice ( Nosek et al., 2012 ). Part of the problem is the lack of prestige associated with carrying out replications ( Smith, 1970 ). To put it simply, few would want to be seen by their peers as merely “copying” another's work (e.g., Mulkay and Gilbert, 1986 ); and few could afford to be seen in this way by tenure committees or by the funding bodies that sponsor their research. Thus, while “a field that replicates its work is [seen as] rigorous and scientifically sound”—according to Makel et al. (2012) —psychologists who actually conduct those replications “are looked down on as bricklayers and not [as] advancing [scientific] knowledge” (p. 537). In consequence, actual replication attempts are rare.

A second problem is with the reliance on Popper—or, at any rate, a first-pass reading of Popper that seems to be uninformed by subsequent debates in the philosophy of science. Indeed, as critics of Popper have noted, since the 1960s and consistently thereafter, neither his notion of falsification nor his account of experimental replicability seem strictly amenable to being put into practice (e.g., Mulkay and Gilbert, 1981 ; see also Earp, 2011 )—at least not without considerable ambiguity and confusion. What is more, they may not even be fully coherent as stand-alone “abstract” theories, as has been repeatedly noted as well (cf. Cross, 1982 ).

The arguments here are familiar. Let us suppose that—at the risk of being accused of laying down bricks—Researcher B sets up an experiment to try to “replicate” a controversial finding that has been reported by Researcher A. She follows the original methods section as closely as she can (assuming that this has been published in detail; or even better, she simply asks Researcher A for precise instructions). She calibrates her equipment. She prepares the samples and materials just so. And she collects and then analyzes the data. If she gets a different result from what was reported by Researcher A—what follows? Has she “falsified” the other lab's theory? Has she even shown the original result to be erroneous in some way?

The answer to both of these questions, as we will demonstrate in some detail below, is “no.” Perhaps Researcher B made a mistake (see Trafimow, 2014 ). Perhaps the other lab did. Perhaps one of B's research assistants wrote down the wrong number. Perhaps the original effect is a genuine effect, but can only be obtained under specific conditions—and we just don't know yet what they are ( Cesario, 2014 ). Perhaps it relies on “tacit” ( Polanyi, 1962 ) or “unofficial” ( Westen, 1988 ) experimental knowledge that can only be acquired over the course of several years, and perhaps Researcher B has not yet acquired this knowledge ( Collins, 1975 ).

Or perhaps the original effect is not a genuine effect, but Researcher A's theory can actually accommodate this fact. Perhaps Researcher A can abandon some auxiliary hypothesis, or take on board another, or re-formulate a previously unacknowledged background assumption—or whatever (cf. Lakatos, 1970 ; Cross, 1982 ; Folger, 1989 ). As Lakatos (1970) once put it: “given sufficient imagination, any theory… can be permanently saved from ‘refutation’ by some suitable adjustment in the background knowledge in which it is embedded” (p. 184). We will discuss some of these potential “adjustments” below. The upshot, however, is that we simply do not know, and cannot know, exactly what the implications of a given replication attempt are, no matter which way the data come out. There are no critical tests of theories; and there are no objectively decisive replications.

Popper (1959) was not blind to this problem. “In point of fact,” he wrote, in an under-appreciated passage of his famous book, “no conclusive disproof of a theory can ever be produced, for it is always possible to say that the experimental results are not reliable, or that the discrepancies which are asserted to exist between the experimental results and the theory are only apparent” (p. 50, emphasis added). Hence as Mulkay and Gilbert (1981) explain:

… in relation to [actual] scientific practice, one can only talk of positive and negative results, and not of proof or disproof. Negative results, that is, results which seem inconsistent with a given hypothesis [or with a putative finding from a previous experiment], may incline a scientist to abandon [the] hypothesis but they will never require him to abandon it… Whether or not he does so may depend on the amount and quality of positive evidence, on his confidence in his own and others' experimental skills and on his ability to conceive of alternative interpretations of the negative findings. (p. 391)

Drawing hard and fast conclusions, therefore, about “negative” results—such as those that may be produced by a “failed” replication attempt—is much more difficult than Kepes and McDaniel seem to imagine (see e.g., Chow, 1988 for similarly problematic arguments). This difficulty may be especially acute in the field of psychology. As Folger (1989) notes, “Popper himself believed that too many theories, particularly in the social sciences , were constructed so loosely that they could be stretched to fit any conceivable set of experimental results, making them… devoid of testable content” (p. 156, emphasis added). Furthermore, as Collins (1985) has argued, the less secure a field's foundational theories—and especially at the field's “frontier”—the more room there is for disagreement about what should “count” as a proper replication 1 .

Related to this problem is that it can be difficult to know in what specific sense a replication study should be considered to be “the same” as the original (e.g., Van IJzendoorn, 1994 ). Consider that the goal for these kinds of studies is to rule out flukes and other types of error. Thus, we want to be able to say that the same experiment , if repeated one more time, would produce the same result as was originally observed. But an original study and a replication study cannot, by definition, be identical—at the very least, some time will have passed and the participants will all be new 2 —and if we don't yet know which differences are theory-relevant, we won't be able to control for their effects. The problem with a field like psychology, whose theoretical predictions are often “constructed so loosely,” as noted above, is precisely that we do not know—or at least, we do not in a large number of cases—which differences are in fact relevant to the theory.

Finally, human behavior is notoriously complex. We are not like billiard balls, or beavers, or planets, or paramecia (i.e., relatively simple objects or organisms). This means that we should expect our behavioral responses to vary across a “wide range of moderating individual difference and experimental context variables” ( Cesario, 2014 , p. 41)—many of which are not yet known, and some of which may be difficult or even impossible to uncover ( Meehl, 1990a ). Thus, in the absence of “well-developed theories for specifying such [moderating] variables, the conclusions of replication failures will be ambiguous” ( Cesario, 2014 , p. 41; see also Meehl, 1978 ).

Summing up the Problem

Hence we have two major points to consider. First, due to a lack of adequate incentives in the reward structure of professional science (e.g., Nosek and the Open Science Collaboration, 2012 ), actual replication attempts are rarely carried out. Second, to the extent that they are carried out, it can be well-nigh impossible to say conclusively what they mean, whether they are “successful” (i.e., showing similar, or apparently similar, results to the original experiment) or “unsuccessful” (i.e., showing different, or apparently different, results to the original experiment). Thus, Collins (1985) came to the conclusion that, in physics at least, disputes over contested findings are likelier to be resolved by social and reputational negotiations —over, e.g., who should be considered a competent experimenter—than by any “objective” consideration of the experiments themselves. Meehl (1990b) drew a similar conclusion about the field of social psychology, although he identified sheer boredom (rather than social/reputational negotiation) as the alternative to decisive experimentation:

… theories in the “soft areas” of psychology have a tendency to go through periods of initial enthusiasm leading to large amounts of empirical investigation with ambiguous over-all results. This period of infatuation is followed by various kinds of amendment and the proliferation of ad hoc hypotheses. Finally, in the long run, experimenters lose interest rather than deliberately discard a theory as clearly falsified. (p. 196)

So how shall we take stock of what has been said? A cynical reader might conclude that—far from being a “demarcation criterion between science and nonscience”—replication is actually closer to being a waste of time. Indeed, if even replications in physics are sometimes not conclusive, as Collins (1975 , 1981 , 1985) has convincingly shown, then what hope is there for replications in psychology?

Our answer is simply as follows. Replications do not need to be “conclusive” in order to be informative . In this paper, we highlight some of the ways in which replication attempts can be more, rather than less, informative, and we discuss—using a Bayesian framework—how they can reasonably affect a researcher's confidence in the validity of an original finding. The same is true of “falsification.” Whilst a scientist should not simply abandon her favorite theory on account of a single (apparently) contradictory result—as Popper himself was careful to point out 3 (1959, pp. 66–67; see also Earp, 2011 )—she might reasonably be open to doubt it, given enough disconfirmatory evidence, and assuming that she had stated the theory precisely. Rather than being a “waste of time,” therefore, experimental replication of one's own and others' findings can be a useful tool for restoring confidence in the reliability of basic effects—provided that certain conditions are met. The work of the latter part of this essay is to describe and to justify at least a few of those essential conditions. In this context, we draw a distinction between “conceptual” or “reproductive” replications (cf. Cartwright, 1991 )—which may conceivably be used to bolster confidence in a particular theory —and “direct” or “close” replications, which may be used to bolster confidence in a finding ( Schmidt, 2009 ; see also Earp et al., 2014 ). Since it is doubt about the findings that seems to have prompted the recent “crisis” in social psychology, it is the latter that will be our focus. But first we must introduce the crisis.

The (Latest) Crisis in Social Psychology and Calls for Replication

“Is there currently a crisis of confidence in psychological science reflecting an unprecedented level of doubt among practitioners about the reliability of research findings in the field? It would certainly appear that there is.” So write Pashler and Wagenmakers (2012 , p. 529) in a recent issue of Perspectives on Psychological Science . The “crisis” is not unique to psychology; it is rippling through biomedicine and other fields as well ( Ioannidis, 2005 ; Loscalzo, 2012 ; Earp and Darby, 2015 )—but psychology will be the focus of this paper, if for no other reason than that the present authors have been closer to the facts on the ground.

Some of the causes of the crisis are fairly well known. In 2011, an eminent Dutch researcher confessed to making up data and experiments, producing a résumé-full of “findings” that he had simply invented out of whole cloth ( Carey, 2011 ). He was outed by his own students, however, and not by peer review nor by any attempt to replicate his work. In other words, he might just as well have not been found out, had he only been a little more careful ( Stroebe et al., 2012 ). An unsettling prospect was thus aroused: Could other fraudulent “findings” be circulating—undetected, and perhaps even undetectable—throughout the published record? After an exhaustive analysis of the Dutch fraud case, Stroebe et al. (2012) concluded that the notion of self-correction in science was actually a “myth” (p. 670); and others have offered similar pronouncements ( Ioannidis, 2012a ).

But fraud, it is hoped, is rare. Nevertheless, as Ioannidis (2005 , 2012a) and others have argued, the line between explicitly fraudulent behavior and merely “questionable” research practices is perilously thin, and the latter are probably common. John et al. (2012) conducted a massive, anonymous survey of practicing psychologists and showed that this conjecture is likely correct. Psychologists admitted to such questionable research practices as failing to report all of the dependent measures for which they had collected data (78%) 4 , collecting additional data after checking to see whether preliminary results were statistically significant (72%), selectively reporting studies that “worked” (67%), claiming to have predicted an unexpected finding (54%), and failing to report all of the conditions that they ran (42%). Each of these practices alone, and even more so when combined, reduces the interpretability of the final reported statistics, casting doubt upon any claimed “effects” (e.g., Simmons et al., 2011 ).

The motivation behind these practices, though not necessarily conscious or deliberate, is also not obscure. Professional journals have long had a tendency to publish only or primarily novel, “statistically significant” effects, to the exclusion of replications—and especially “failed” replications—or other null results. This problem, known as “publication bias,” leads to a file-drawer effect whereby “negative” experimental outcomes are simply “filed away” in a researcher's bottom drawer, rather than written up and submitted for publication (e.g., Rosenthal, 1979 ). Meanwhile, the “questionable research practices” carry on in full force, since they increase the researcher's chances of obtaining a “statistically significant” finding—whether it turns out to be reliable or not.

To add insult to injury, in 2012, an acrimonious public skirmish broke out in the form of dueling blog posts between the distinguished author of a classic behavioral priming study 5 and a team of researchers who had questioned his findings ( Yong, 2012 ). The disputed results had already been cited more than 2000 times—an extremely large number for the field—and even been enshrined in introductory textbooks. What if they did turn out to be a fluke? Should other “priming studies” be double-checked as well? Coverage of the debate ensued in the mainstream media (e.g., Bartlett, 2013 ).

Another triggering event resulted in “widespread public mockery” ( Pashler and Wagenmakers, 2012 , p. 528). In contrast to the fraud case described above, which involved intentional, unblushing deception, the psychologist Daryl Bem relied on well-established and widely-followed research and reporting practices to generate an apparently fantastic result, namely evidence that participants' current responses could be influenced by future events ( Bem, 2011 ). Since such paranormal precognition is inconsistent with widely-held theories about “the fundamental nature of time and causality” (Lebel and Peters, p. 371), few took the findings seriously. Instead, they began to wonder about the “well-established and widely-followed research and reporting practices” that had sanctioned the findings in the first place (and allowed for their publication in a leading journal). As Simmons et al. (2011) concluded—reflecting broadly on the state of the discipline—“it is unacceptably easy to publish ‘statistically significant’ evidence consistent with any hypothesis” (p. 1359) 6 .

The main culprit for this phenomenon is what Simmons et al. (2011) identified as researcher degrees of freedom :

In the course of collecting and analyzing data, researchers have many decisions to make: Should more data be collected? Should some observations be excluded? Which conditions should be combined and which ones compared? Which control variables should be considered? Should specific measures be combined or transformed or both?… It is rare, and sometimes impractical, for researchers to make all these decisions beforehand. Rather, it is common (and accepted practice) for researchers to explore various analytic alternatives, to search for a combination that yields “statistical significance” and to then report only what “worked.” (p. 1359)

One unfortunate consequence of such a strategy—involving, as it does, some of the very same questionable research practices later identified by John et al. (2012) in their survey of psychologists—is that it inflates the possibility of producing a false positive (or a Type 1 error). Since such practices are “common” and even “accepted,” the literature may be replete with erroneous results. Thus, as Ioannidis (2005) declared after performing a similar analysis in his own field of biomedicine, “ most published research findings” may be “false” (p. 0696, emphasis added). This has led to the “unprecedented level of doubt” referred to by Pashler and Wagenmakers (2012) in the opening quote to this section.

This not the first crisis for psychology. Giner-Sorolla (2012) points out that “crises” of one sort or another “have been declared regularly at least since the time of Wilhelm Wundt”—with turmoil as recent as the 1970s inspiring particular déjà vu (p. 563). Then, as now, a string of embarrassing events—including the publication in mainstream journals of literally unbelievable findings 7 —led to “soul searching” amongst leading practitioners. Standard experimental methods, statistical strategies, reporting requirements, and norms of peer review were all put under the microscope; numerous sources of bias were carefully rooted out (e.g., Greenwald, 1975 ). While various calls for reform were put forward—some more energetically than others—a single corrective strategy seemed to emerge from all the din: the need for psychologists to replicate their work . Since “ all flawed research practices yield findings that cannot be reproduced,” critics reasoned, replication could be used to separate the wheat from the chaff ( Koole and Lakens, 2012 , p. 608, emphasis added; see also Elms, 1975 ).

The same calls reverberate today. “For psychology to truly adhere to the principles of science,” write Ferguson and Heene (2012) , “the need for replication of research results [is] important… to consider” (p. 556). LeBel and Peters (2011) put it like this: “Across all scientific disciplines, close replication is the gold standard for corroborating the discovery of an empirical phenomenon” and “the importance of this point for psychology has been noted many times” (p. 375). Indeed, “leading researchers [in psychology]” agree, according to Francis (2012) , that “experimental replication is the final arbiter in determining whether effects are true or false” (p. 585).

We have already seen that such calls must be heeded with caution: replication is not straightforward, and the outcome of replication studies may be difficult to interpret. Indeed they can never be conclusive on their own. But we suggested that replications could be more or less informative ; and in the following sections we discuss some strategies for making them “more” rather than “less.” We begin with a discussion of “direct” vs. “conceptual” replication.

Increasing Replication Informativeness: “Direct” vs. “Conceptual” Replication

In a systematic review of the literature, encompassing multiple academic disciplines, Gómez et al. (2010) identified 18 different types of replication. Three of these were from Lykken (1968) , who drew a distinction between “literal,” “operational,” and “constructive”—which Schmidt (2009) then winnowed down (and re-labeled) to arrive at “direct” and “conceptual” in an influential paper. As Makel et al. (2012) have pointed out, it is Schmidt's particular framework that seems to have crystallized in the field of psychology, shaping most of the subsequent discussion on this issue. We have no particular reason to rock the boat; indeed these categories will suit our argument just fine.

The first step in making a replication informative is to decide what specifically it is for. “Direct” replications and “conceptual” replications are “for” different things; and assigning them their proper role and function will be necessary for resolving the crisis. First, some definitions:

A “direct” replication may be defined as an experiment that is intended to be as similar to the original as possible ( Schmidt, 2009 ; Makel et al., 2012 ). This means that along every conceivable dimension—from the equipment and materials used, to the procedure, to the time of day, to the gender of the experimenter, etc.—the replicating scientist should strive to avoid making any kind of change or alteration. The purpose here is to “check” the original results. Some changes will be inevitable, of course; but the point is that only the inevitable changes (such as the passage of time between experiments) are ideally tolerated in this form of replication. In a “conceptual” replication, by contrast, at least certain elements of the original experiment are intentionally altered, (ideally) systematically so, toward the end of achieving a very different sort of purpose—namely to see whether a given phenomenon, assuming that it is reliable, might obtain across a range of variable conditions. But as Doyen et al. (2014) note in a recent paper:

The problem with conceptual replication in the absence of direct replication is that there is no such thing as a “conceptual failure to replicate.” A failure to find the same “effect” using a different operationalization can be attributed to the differences in method rather than to the fragility of the original effect. Only the successful conceptual replications will be published, and the unsuccessful ones can be dismissed without challenging the underlying foundations of the claim. Consequently, conceptual replication without direct replication is unlikely to change beliefs about the underlying effect (p. 28) .

In simplest terms, therefore, a “direct” replication seeks to validate a particular fact or finding ; whereas a “conceptual” replication seeks to validate the underlying theory or phenomenon —i.e., the theory that has been proposed to “predict” the effect that was obtained by the initial experiment—as well to establish the boundary conditions within which the theory holds true ( Nosek et al., 2012 ). The latter is impossible without the former. In other words, if we cannot be sure that our finding is reliable to begin with (because it turns out to have been a coincidence, or else a false alarm due to questionable research practices, publication bias, or fraud), then we are in no position to begin testing the theory by which it is supposedly explained ( Cartwright, 1991 ; see also Earp et al., 2014 ).

Of course both types of replication are important, and there is no absolute line between them. Rather, as Asendorpf et al. (2013) point out, “direct replicability [is] one extreme pole of a continuous dimension extending to broad generalizability [via ‘conceptual’ replication] at the other pole, ranging across multiple, theoretically relevant facets of study design” (p. 139). Collins made a similar point in 1985 (e.g., p. 37). But so long as we remain largely ignorant about exactly which “facets of study design” are “theoretically relevant” to begin with—as is the case with much of current social psychology ( Meehl, 1990b ), and nearly all of the most heavily-contested experimental findings—we need to orient our attention more toward the “direct” end of the spectrum 8 .

How else can replication be made more informative ? Brandt et al. (2014) 's “Replication Recipe” offers several important factors, one of which must be highlighted to begin with. This is their contention that a “convincing” replication should be carried out outside the lab of origin . Clearly this requirement shifts away from the “direct” extreme of the replication gradient that we have emphasized so far, but such a change from the original experiment, in this case, is justified. As Ioannidis (2012b) points out, replications by the original researchers—while certainly important and to be encouraged as a preliminary step—are not sufficient to establish “convincing” experimental reliability. This is because allegiance and confirmation biases, which may apply especially to the original team, would be less of an issue for independent replicators.

Partially against this view, Schnall (2014 , np) argues that “authors of the original work should be allowed to participate in the process of having their work replicated.” On the one hand, this might have the desirable effect of ensuring that the replication attempt faithfully reproduces the original procedure. It seems reasonable to think that the original author would know more than anyone else about how the original research was conducted—so her viewpoint is likely to be helpful. On the other hand, however, too much input by the original author could compromise the independence of the replication: she might have a strong motivation to make the replication a success, which could subtly influence the results (see Earp and Darby, 2015 ). Whichever position one takes on the appropriate degree of input and/or oversight from the original author, however, Schnall (2014 , np) is certainly right to note that “the quality standards for replications need to be at least as high as for the original findings. Competent evaluation by experts is absolutely essential, and is especially important if replication authors have no prior expertise with a given research topic.”

Other ingredients in increasing the informativeness of replication attempts include: (1) carefully defining the effects and methods that the researchers intend to replicate; (2) following as exactly as possible the methods of the original study (as described above); (3) having high statistical power (i.e., an adequate sample size to detect an effect if one is really present); (4) making complete details about the replication available, so that interested experts can fully evaluate the replication attempt (or attempt another replication themselves); and (5) evaluating the replication results, comparing them critically to the results of the study ( Brandt et al., 2014 , p. 218, paraphrased). This list is not exhaustive, but it gives a concrete sense of how “stabilizing” procedures (see Radder, 1992 ) can be employed to give greater credence to the quality and informativeness of replication efforts.

Replication, Falsification, and Auxiliary Assumptions

Brandt et al.'s (2014) “replication recipe” provides a vital tool for researchers seeking to conduct high quality replications. In this section, we offer an additional “ingredient” to the discussion, by highlighting the role of auxiliary assumptions in increasing replication informativeness, specifically as these pertain to the relationship between replication and falsification. Consider the logical fallacy of affirming the consequent that provided an important basis for Popper's falsification argument.

If the theory is true,

        an observation should occur ( T → O ) (Premise 1)

The observation occurs ( O ) (Premise 2)

Therefore, the theory is true ( T ) (Conclusion)

Obviously, the conclusion does not follow. Any number of things might have led to the observation that have nothing to do with the theory being proposed (see Earp, 2015 for a similar argument). On the other hand, denying the consequent ( modus tollens ) does invalidate the theory, strictly according to the logic given:

The observation does not occur (~ O ) (Premise 2)

Therefore, the theory is not true (~ T ) (Conclusion)

Given this logical asymmetry, then, between affirming and denying the consequent of a theoretical prediction (see Earp and Everett, 2013 ), Popper opted for the latter. By doing so, he famously defended a strategy of disconfirming rather than confirming theories. Yet if the goal is to disconfirm theories, then the theories must be capable of being disconfirmed in the first place; hence, a basic requirement of scientific theories (in order to count as properly scientific) is that they have this feature: they must be falsifiable .

As we hinted at above, however, this basic framework is an oversimplification. As Popper himself noted, and as was made particularly clear by Lakatos (1978 ; also see Duhem, 1954 ; Quine, 1980 ), scientists do not derive predictions only from a given theory, but rather from a combination of the theory and auxiliary assumptions . The auxiliary assumptions are not part of the theory proper, but they serve several important functions. One of these functions is to show the link between the sorts of outcomes that a scientist can actually observe (i.e., by running an experiment), and the non-observable, “abstract” content of the theory itself. To pick one classic example from psychology, according to the theory of reasoned action (e.g., Fishbein, 1980 ), attitudes determine subjective norms. One implication of this theoretical assumption is that researchers should be able to obtain strong correlations between attitudes and behavioral intentions. But this assumes, among other things, that a check mark on an attitude scale really indicates a person's attitude, and that a check mark on an intention scale really indicates a person's intention. The theory of reasoned action has nothing to say about whether check marks on scales indicate attitudes or intentions; these are assumptions that are peripheral to the basic theory. They are auxiliary assumptions that researchers use to connect non-observational terms such as “attitude” and “intention” to observable phenomena such as check marks. Fishbein and Ajzen (1975) recognized this and took great pains to spell out, as well as possible, the auxiliary assumptions that best aid in measuring theoretically relevant variables (see also Ajzen and Fishbein, 1980 ).

The existence of auxiliary assumptions complicates the project of falsification. This is because the major premise of the modus tollens argument—denying the consequent of the theoretical prediction—must be stated somewhat differently. It must be stated like this: “If the theory is true and a set of auxiliary assumptions is true , an observation should occur.” Keeping the second premise the same implies that either the theory is not true or that at least one auxiliary assumption is not true, as the following syllogism (in symbols only) illustrates.

T & ( A 1 & A 2 … A n ) → O (Premise 1)

~ O (Premise 2)

∴ ~ T or ~ ( A 1 & A 2 … A n ) =        ~ T or ~ A 1 or ~ A 2 … ~ A n (Conclusion)

Consider an example. It often is said that Newton's gravitational theory predicted where planets would be at particular times. But this is not precisely accurate. It would be more accurate to say that such predictions were derived from a combination of Newton's theory and auxiliary assumptions not contained in that theory (e.g., about the present locations of the planets). To return to our example about attitudes and intentions from psychology, consider the mini-crisis in social psychology from the 1960s, when it became clear to researchers that attitudes—the kingly construct—failed to predict behaviors. Much of the impetus for the theory of reasoned action (e.g., Fishbein, 1980 ) was Fishbein's realization that there was a problem with attitude measurement at the time: when this problem was fixed, strong attitude-behavior (or at least attitude-intention) correlations became the rule rather than the exception. This episode provides a compelling illustration of a case in which attention to the auxiliary assumptions that bore on actual measurement played a larger role in resolving a crisis in psychology than debates over the theory itself.

What is the lesson here? Due to the fact that failures to obtain a predicted observation can be blamed either on the theory itself or on at least one auxiliary assumption, absolute theory falsification is about as problematic as is absolute theory verification. In the Newton example, when some of Newton's planetary predictions were shown to be wrong, he blamed the failures on incorrect auxiliary assumptions rather than on his theory, arguing that there were additional but unknown astronomical bodies that skewed his findings—which turned out to be a correct defense of the theory. Likewise, in the attitude literature, the theoretical connection between attitudes and behaviors turned out to be correct (as far as we know) with the problem having been caused by incorrect auxiliary assumptions pertaining to attitude measurement.

There is an additional consequence to the necessity of giving explicit consideration to one's auxiliary assumptions. Suppose, as often happens in psychology, that a researcher deems a theory to be unfalsifiable because he or she does not see any testable predictions. Is the theory really unfalsifiable or is the problem that the researcher has not been sufficiently thorough in identifying the necessary auxiliary assumptions that would lead to falsifiable predictions? Given that absolute falsification is impossible, and that researchers are therefore limited to some kind of “reasonable” falsification, Trafimow (2009) has argued that many allegedly unfalsifiable theories are reasonably falsifiable after all: it is just a matter of researchers having to be more thoughtful about considering auxiliary assumptions. Trafimow documented examples of theories that had been described as unfalsifiable that one could in fact falsify by proposing better auxiliary assumptions than had been imagined by previous researchers.

The notion that auxiliary assumptions can vary in quality is relevant for replication. Consider, for example, the case alluded to earlier regarding a purported failure to replicate Bargh et al.'s (1996) famous priming results. In the replication attempt of this well-known “walking time” study ( Doyen et al., 2012 ), laser beams were used to measure the speed with which participants left the laboratory, rather than students with stopwatches. Undoubtedly, this adjustment was made on the basis of a reasonable auxiliary assumption that methods of measuring time that are less susceptible to human idiosyncrasies would be superior to methods that are more susceptible to them. Does the fact that the failed replication was not exactly like the original experiment disqualify it as invalid? At least with regard to this particular feature of this particular replication attempt, the answer is clearly “no.” If a researcher uses a better auxiliary assumption than in the original experiment, this should add to its validity rather than subtract from it 9 .

But suppose, for a particular experiment, that we are not in a good position to judge the superiority of alternative auxiliary assumptions. We might invoke what Meehl (1990b) termed the ceteris paribus (all else equal) assumption. This idea, applied to the issue of direct replications, suggests that for researchers to be confident that a replication attempt is a valid one, the auxiliary assumptions in the replication have to be sufficiently similar to those in the original experiment that any differences in findings cannot reasonably be attributed to differences in the assumptions. Put another way, all of the unconsidered auxiliary assumptions should be indistinguishable in the relevant way: that is, all have to be sufficiently equal or sufficiently right or sufficiently irrelevant so as not to matter to the final result.

What makes it allowable for a researcher to make the ceteris paribus assumption? In a strict philosophical sense, of course, it is not allowable. To see this, suppose that Researcher A has published an experiment, Researcher B has replicated it, but the replication failed. If Researcher A claims that Researcher B made a mistake in performing the replication, or just got unlucky, there is no way to disprove Researcher A's argument absolutely. But suppose that Researchers C, D, E, and F also attempt replications, and also fail. It becomes increasingly difficult to support the contention that Researchers B–F all “did it wrong” or were unlucky, and that we should continue to accept Researcher A's version of the experiment. Even if a million researchers attempted replications, and all of them failed, it is theoretically possible that Researcher A's version is the unflawed one and all the others are flawed. But most researchers would conclude (and in our view, would be right to conclude) that it is more likely that it is Researcher A who got it wrong and not the million researchers who failed to replicate the observation. Thus, we are not arguing that replications, whether successful or not, are definitive. Rather, our argument is that replications (of sufficient quality) are informative.

Introducing a Bayesian Framework

To see why this is the case, we shall employ a Bayesian framework similar to Trafimow (2010) . Suppose that an aficionado of Researcher A believes that the prior probability of anything Researcher A said or did is very high. Researcher B attempts a replication of an experiment by Researcher A and fails. The aficionado might continue confidently to believe in Researcher A's version, but the aficionado's confidence likely would be decreased slightly. Well then, as there are more replication failures, the aficionado's confidence would continue to decrease accordingly, and at some point the decrease in confidence would push the aficionado's confidence below the 50% mark, in which case the aficionado would put more credence in the replication failures than on the success obtained by Researcher A.

In the foregoing scenario, we would want to know the probability that the original result is actually true given Researcher B's replication failure [ p ( T | F )]. As Equation (1) shows, this depends on the aficionado's prior level of confidence that the original result is true [ p ( T )], the probability of failing to replicate given that the original result is true [ p ( F | T )], and the probability of failing to replicate [ p ( F )], as Equation (1) shows.

Alternatively, we could frame what we want to know in terms of a confidence ratio that the original result is true or not true given the failure to replicate [ p ( T | F ) p ( ~ T | F ) ] . This would be a function of the aficionado's prior confidence ratio about the truth of the finding [ p ( T ) p ( ~ T ) ] and the ratio of probabilities of failing given that the original result is true or not [ p ( F | T ) p ( F | ~ T ) ] . Thus, Equation (2) gives the posterior confidence ratio.

Suppose that the aficionado is a very strong one, so that the prior confidence ratio is 50. In addition, the probability ratio pertaining to failing to replicate is 0.5. It is worthwhile to clarify two points about this probability ratio. First, we assume that the probability of failing to replicate is less if the original finding is true than if it is not true, so that the ratio ought to be substantially less than 1. Second, how much less than 1 this ratio will be depends largely on the quality of the replication; as the replication becomes closer to meeting the ideal ceteris paribus condition, the ratio will deviate increasingly from 1. Put more generally, as the quality of the auxiliary assumptions going into the replication attempt increases, the ratio will decrease. Given these two ratios of 50 and 0.5, the posterior confidence ratio is 25. Although this is a substantial decrease in confidence from 50, the aficionado still believes that the finding is extremely likely to be true. But suppose there is another replication failure and the probability ratio is 0.8. In that case, the new confidence ratio is (25)(0.8) = 20. The pattern should be clear here: As there are more replication failures, a rational person, even if that person is an aficionado of the original researcher, will experience continually decreasing confidence as the replication failures mount.

If we imagine that there are N attempts to replicate the original finding that fail, the process described in the foregoing paragraph can be summarized in a single equation that gives the ratio of posterior confidences in the original finding, given that there have been N failures to replicate. This is a function of the prior confidence ratio and the probability ratios in the first replication failure, the second replication failure, and so on.

For example, staying with our aficionado with a prior confidence ratio of 50, imagine a set of 10 replication failures, with the following probability ratios: 0.5, 0.8, 0.7, 0.65, 0.75, 0.56, 0.69, 0.54, 0.73, and 0.52. The final confidence ratio, according to Equation (3), would be:

Note the following. First, even with an extreme prior confidence ratio (we had set it at 50 for the aficionado), it is possible to overcome it with a reasonable number of replication failures providing that the person tallying the replication failures is a rational Bayesian (and there is reason to think that those attempting the replications are sufficiently competent in the subject area and methods to be qualified to undertake them). Second, it is possible to go from a state of extreme confidence to one of substantial lack of confidence. To see this in the example, take the reciprocal of the final confidence ratio (0.54), which equals 1.84. In other words, the Bayesian aficionado now believes that the finding is 1.84 times as likely to be not true as true. If we imagine yet more failed attempts to replicate, it is easy to foresee that the future belief that the original finding is not true could eventually become as powerful, or more powerful, than the prior belief that the original finding is true.

In summary, auxiliary assumptions play a role, not only for original theory-testing experiments but also in replications—even in replications concerned only with the original finding and not with the underlying theory. A particularly important auxiliary assumption is the ever-present ceteris paribus assumption, and the extent to which it applies influences the “convincingness” of the replication attempt. Thus, a change in confidence in the original finding is influenced both by the quality and quantity of the replication attempts, as Equation (3) illustrates.

In presenting Equations (1–3), we reduced the theoretical content as much as possible, and more than is realistic in actual research 10 , in considering so-called “direct” replications. As the replications serve other purposes, such as “conceptual” replications, the amount of theoretical content is likely to increase. To link that theoretical content to the replication attempt, more auxiliary assumptions will become necessary. For example, in a conceptual replication of an experiment finding that attitudes influence behavior, the researcher might use a different attitude manipulation or a different behavior measure. How do we know that the different manipulation and measure are sufficiently theoretically unimportant that the conceptual replication really is a replication (i.e., a test of the underlying theory)? We need new auxiliary assumptions linking the new manipulation and measure to the corresponding constructs in the theory, just as an original set of auxiliary assumptions was necessary in the original experiment to link the original manipulation and measure to the corresponding constructs in the theory. Auxiliary assumptions always matter—and they should be made explicit so far as possible. In this way, it will be easier to identify where in the chain of assumptions a “breakdown” must have occurred, in attempting to explain an apparent failure to replicate.

Replication is not a silver bullet. Even carefully-designed replications, carried out in good faith by expert investigators, will never be conclusive on their own. But as Tsang and Kwan (1999) point out:

If replication is interpreted in a strict sense, [conclusive] replications or experiments are also impossible in the natural sciences.… So, even in the “hardest” science (i.e., physics) complete closure is not possible. The best we can do is control for conditions that are plausibly regarded to be relevant. (p. 763)

Nevertheless, “failed” replications, especially, might be dismissed by an original investigator as being flawed or “incompetently” performed—but this sort of accusation is just too easy. The original investigator should be able to describe exactly what parameters she sees as being theoretically relevant, and under what conditions her “effect” should obtain. If a series of replications is carried out, independently by different labs, and deliberately tailored to the parameters and conditions so described—yet they reliably fail to produce the original result—then this should be considered informative . At the very least, it will suggest that the effect is sensitive to theoretically-unspecified factors, whose specification is sorely needed. At most, it should throw the existence of the effect into doubt, possibly justifying a shift in research priorities. Thus, while “falsification” can in principle be avoided ad infinitum, with enough creative effort by one who wished to defend a favored theory, scientists should not seek to “rescue” a given finding at any empirical cost 11 . Informative replications can reasonably factor into scientists' assessment about just what that cost might be; and they should pursue such replications as if the credibility of their field depended on it. In the case of experimental social psychology, it does.

Conflict of Interest Statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Acknowledgments

Thanks are due to Anna Alexandrova for feedback on an earlier draft.

1. ^ There are two steps to understanding this idea. First, because the foundational theories are so insecure, and the field's findings so under dispute, the “correct” empirical outcome of a given experimental design is unlikely to have been firmly established. Second, and insofar the first step applies, the standard by which to judge whether a replication has been competently performed is equally unavailable—since that would depend upon knowing the “correct” outcome of just such an experiment. Thus, a “competently performed” experiment is one that produces the “correct” outcome; while the “correct” outcome is defined by whatever it is that is produced by a “competently performed” experiment. As Collins (1985) states: “Where there is disagreement about what counts as a competently performed experiment, the ensuing debate is coextensive with the debate about what the proper outcome of the experiment is” (p. 89). This is the infamously circular experimenter's regress . Of course, a competently performed experiment should produce satisfactory (i.e., meaningful, useful) results on “outcome neutral” tests.

2. ^ Assuming that it is a psychology experiment. Note that even if the “same” participants are run through the experiment one more time, they'll have changed in at least one essential way: they'll have already gone through the experiment (opening the door for practice effects, etc.).

3. ^ On Popper's view, one must set up a “falsifying hypothesis,” i.e., a hypothesis specifying how another experimenter could recreate the falsifying evidence. But then, Popper says, the falsifying hypothesis itself should be severely tested and corroborated before it is accepted as falsifying the main theory. Interestingly, as a reviewer has suggested, the distinction between a falsifying hypothesis and the main theory may also correspond to the distinction between direct vs. conceptual replications that we discuss in a later section. On this view, direct replications (attempt to) reproduce what the falsifying hypothesis states is necessary to generate the original predicted effect, whereas conceptual replications are attempts to test the main theory.

4. ^ The percentages reported here are the geometric mean of self-admission rates, prevalence estimates by the psychologists surveyed, and prevalence estimates derived by John et al. from the other two figures.

5. ^ Priming has been defined a number of different ways. Typically, it refers to the ability of subtle cues in the environment to affect an individual's thoughts and behavior, often outside of her awareness or control (e.g., Bargh and Chartrand, 1999 ).

6. ^ Even more damning, Trafimow (2003 ; Trafimow and Rice, 2009 ; Trafimow and Marks, 2015 ) has argued that the standard significance tests used in psychology are invalid even when they are done “correctly.” Thus, even if psychologists were to follow the prescriptions of Simmons et al.—and reduce their researcher degrees of freedom (see the discussion following this footnote)—this would still fail to address the core problem that such tests should not be used in the first place.

7. ^ For example, a “study found that eating disorder patients were significantly more likely than others to see frogs in a Rorschach test, which the author interpreted as showing unconscious fear of oral impregnation and anal birth…” ( Giner-Sorolla, 2012 , p. 562).

8. ^ Asendorpf et al. (2013) explain why this is so: “[direct] replicability is a necessary condition for further generalization and thus indispensible for building solid starting points for theoretical development. Without such starting points, research may become lost in endless fluctuation between alternative generalization studies that add numerous boundary conditions but fail to advance theory about why these boundary conditions exist” (p. 140, emphasis added).

9. ^ There may be other reasons why the “failed” replication by Doyen et al. should not be considered conclusive, of course; for further discussion see, e.g., Lieberman (2012) .

10. ^ Indeed, we have presented our analysis in this section in abstract terms so that the underlying reasoning could be seen most clearly. However, this necessarily raises the question of how to go about implementing these ideas in practice. As a reviewer points out, to calculate probabilities, the theory being tested would need to be represented as a probability model; then in effect one would have Bayes factors to deal with. We note that both Dienes (2014) and Verhagen and Wagenmakers (2014) have presented methods for assessing the strength of evidence of a replication attempt (i.e., in confirming the original result) along these lines, and we refer the reader to their papers for further consideration.

11. ^ As Doyen et al. (2014 , p. 28, internal references omitted) recently argued: “Given the existence of publication bias and the prevalence of questionable research practices, we know that the published literature likely contains some false positive results. Direct replication is the only way to correct such errors. The failure to find an effect with a well-powered direct replication must be taken as evidence against the original effect. Of course, one failed direct replication does not mean the effect is non-existent—science depends on the accumulation of evidence. But, treating direct replication as irrelevant makes it impossible to correct Type 1 errors in the published literature.”

Ajzen, I., and Fishbein, M. (1980). Understanding Attitudes and Predicting Social Behavior . Englewood Cliffs, NJ: Prentice-Hall.

Asendorpf, J. B., Conner, M., De Fruyt, F., De Houwer, J., Denissen, J. J., Fiedler, K., et al. (2013). Replication is more than hitting the lottery twice. Euro. J. Person . 27, 108–119. doi: 10.1002/per.1919

CrossRef Full Text | Google Scholar

Bargh, J. A., and Chartrand, T. L. (1999). The unbearable automaticity of being. Am. Psychol . 54, 462–479. doi: 10.1037/0003-066X.54.7.462

Bargh, J. A., Chen, M., and Burrows, L. (1996). Automaticity of social behavior: direct effects of trait construct and stereotype activation on action. J. Person. Soc. Psychol . 71, 230–244. doi: 10.1037/0022-3514.71.2.230

PubMed Abstract | CrossRef Full Text | Google Scholar

Bartlett, T. (2013). Power of suggestion. The Chronicle of Higher Education . Available online at: http://chronicle.com/article/Power-of-Suggestion/136907

Bem, D. J. (2011). Feeling the future: experimental evidence for anomalous retroactive influences on cognition and affect. J. Person. Soc. Psychol . 100, 407–425. doi: 10.1037/a0021524

Billig, M. (2013). Learn to Write Badly: How to Succeed in the Social Sciences . Cambridge: Cambridge University Press.

Google Scholar

Brandt, M. J., IJzerman, H., Dijksterhuis, A., Farach, F. J., Geller, J., Giner-Sorolla, R., et al. (2014). The replication recipe: what makes for a convincing replication? J. Exp. Soc. Psychol . 50, 217–224. doi: 10.2139/ssrn.2283856

Braude, S. E. (1979). ESP and Psychokinesis: a Philosophical Examination . Philadelphia, PA: Temple University Press.

Carey, B. (2011). Fraud case seen as a red flag for psychology research. N. Y. Times . Available online at: http://www.nytimes.com/2011/11/03/health/research/noted-dutch-psychologist-stapel-accused-of-research-fraud.html

Cartwright, N. (1991). Replicability, reproducibility, and robustness: comments on Harry Collins. Hist. Pol. Econ . 23, 143–155. doi: 10.1215/00182702-23-1-143

Cesario, J. (2014). Priming, replication, and the hardest science. Perspect. Psychol. Sci . 9, 40–48. doi: 10.1177/1745691613513470

Chow, S. L. (1988). Significance test or effect size? Psychol. Bullet . 103, 105–110. doi: 10.1037/0033-2909.103.1.105

Collins, H. M. (1975). The seven sexes: a study in the sociology of a phenomenon, or the replication of experiments in physics. Sociology 9, 205–224. doi: 10.1177/003803857500900202

Collins, H. M. (1981). Son of seven sexes: the social destruction of a physical phenomenon. Soc. Stud. Sci . 11, 33–62. doi: 10.1177/030631278101100103

Collins, H. M. (1985). Changing Order: Replication and Induction in Scientific Practice . Chicago, IL: University of Chicago Press.

Cross, R. (1982). The Duhem-Quine thesis, Lakatos and the appraisal of theories in macroeconomics. Econ. J . 92, 320–340. doi: 10.2307/2232443

Danzinger, K. (1997). Naming the Mind . London: Sage.

Dienes, Z. (2014). Using Bayes to get the most out of non-significant results. Front. Psychol . 5:781. doi: 10.3389/fpsyg.2014.00781

Doyen, S., Klein, O., Pichon, C. L., and Cleeremans, A. (2012). Behavioral priming: it's all in the mind, but whose mind? PLoS ONE 7:e29081. doi: 10.1371/journal.pone.0029081

Doyen, S., Klein, O., Simons, D. J., and Cleeremans, A. (2014). On the other side of the mirror: priming in cognitive and social psychology. Soc. Cognit . 32, 12–32. doi: 10.1521/soco.2014.32.supp.12

Duhem, P. (1954). The Aim and Structure of Physical Theory. Transl. P. P. Wiener . Princeton, NJ: Princeton University Press.

Earp, B. D. (2011). Can science tell us what's objectively true? New Collect . 6, 1–9. Available online at: https://www.academia.edu/625642/Can_science_tell_us_whats_objectively_true

Earp, B. D. (2015). Does religion deserve a place in secular medicine? J. Med. Ethics . E-letter. doi: 10.1136/medethics-2013-101776. Available online at: https://www.academia.edu/11118590/Does_religion_deserve_a_place_in_secular_medicine

PubMed Abstract | CrossRef Full Text

Earp, B. D., and Darby, R. J. (2015). Does science support infant circumcision? Skeptic 25, 23–30. Available online at: https://www.academia.edu/9872471/Does_science_support_infant_circumcision_A_skeptical_reply_to_Brian_Morris

Earp, B. D., and Everett, J. A. C. (2013). Is the N170 face-specific? Controversy, context, and theory. Neuropsychol. Trends 13, 7–26. doi: 10.7358/neur-2013-013-earp

CrossRef Full Text

Earp, B. D., Everett, J. A. C., Madva, E. N., and Hamlin, J. K. (2014). Out, damned spot: can the “Macbeth Effect” be replicated? Basic Appl. Soc. Psychol . 36, 91–98. doi: 10.1080/01973533.2013.856792

Elms, A. C. (1975). The crisis of confidence in social psychology. Am. Psychol . 30, 967–976. doi: 10.1037/0003-066X.30.10.967

Ferguson, C. J., and Heene, M. (2012). A vast graveyard of undead theories publication bias and psychological science's aversion to the null. Perspect. Psychol. Sci . 7, 555–561. doi: 10.1177/1745691612459059

Fishbein, M. (1980). “Theory of reasoned action: some applications and implications,” in Nebraska Symposium on Motivation, 1979 , eds H. Howe and M. Page (Lincoln, OR: University of Nebraska Press), 65–116.

Fishbein, M., and Ajzen, I. (1975). Belief, Attitude, Intention and Behavior: an Introduction to Theory and Research . Reading, MA: Addison-Wesley.

Folger, R. (1989). Significance tests and the duplicity of binary decisions. Psychol. Bullet . 106, 155–160. doi: 10.1037/0033-2909.106.1.155

Francis, G. (2012). The psychology of replication and replication in psychology. Perspect. Psychol. Sci . 7, 585–594. doi: 10.1177/1745691612459520

Giner-Sorolla, R. (2012). Science or art? How aesthetic standards grease the way through the publication bottleneck but undermine science. Perspect. Psychol. Sci . 7, 562–571. doi: 10.1177/1745691612457576

Gómez, O. S., Juzgado, N. J., and Vegas, S. (2010). “Replications types in experimental disciplines,” in Proceedings of the International Symposium on Empirical Software Engineering and Measurement (Bolzano: ESEM).

Greenwald, A. G. (1975). Consequences of prejudice against the null hypothesis. Psychol. Bullet . 82, 1–20. doi: 10.1037/h0076157

Ioannidis, J. P. (2005). Why most published research findings are false. PLoS Med . 2:e124. doi: 10.1371/journal.pmed.0020124

Ioannidis, J. P. (2012a). Why science is not necessarily self-correcting. Perspect. Psychol. Sci . 7, 645–654. doi: 10.1177/1745691612464056

Ioannidis, J. P. (2012b). Scientific inbreeding and same-team replication: type D personality as an example. J. Psychosom. Res . 73, 408–410. doi: 10.1016/j.jpsychores.2012.09.014

John, L. K., Loewenstein, G., and Prelec, D. (2012). Measuring the prevalence of questionable research practices with incentives for truth telling. Psychol. Sci . 23, 524–532. doi: 10.1177/0956797611430953

Jordan, G. (2004). Theory Construction in Second Language Acquisition . Philadelphia, PA: John Benjamins.

Jost, J. (2013). Introduction to: an Additional Future for Psychological Science. Perspect. Psychol. Sci . 8, 414–423. doi: 10.1177/1745691613491270

Kepes, S., and McDaniel, M. A. (2013). How trustworthy is the scientific literature in industrial and organizational psychology? Ind. Organi. Psychol . 6, 252–268. doi: 10.1111/iops.12045

Koole, S. L., and Lakens, D. (2012). Rewarding replications a sure and simple way to improve psychological science. Perspect. Psychol. Sci . 7, 608–614. doi: 10.1177/1745691612462586

Lakatos, I. (1970). “Falsification and the methodology of scientific research programmes,” in Criticism and the growth of knowledge , eds I. Lakatos and A. Musgrave (London: Cambridge University Press), 91–196.

Lakatos, I. (1978). The Methodology of Scientific Research Programmes . Cambridge: Cambridge University Press.

LeBel, E. P., and Peters, K. R. (2011). Fearing the future of empirical psychology: Bem's 2011 evidence of psi as a case study of deficiencies in modal research practice. Rev. Gen. Psychol . 15, 371–379. doi: 10.1037/a0025172

Lieberman, M. (2012). Does thinking of grandpa make you slow? What the failure to replicate results: does and does not mean. Psychol. Today . Available online at http://www.psychologytoday.com/blog/social-brain-social-mind/201203/does-thinking-grandpa-make-you-slow

Loscalzo, J. (2012). Irreproducible experimental results: causes, (mis) interpretations, and consequences. Circulation 125, 1211–1214. doi: 10.1161/CIRCULATIONAHA.112.098244

Lykken, D. T. (1968). Statistical significance in psychological research. Psychol. Bullet . 70, 151–159. doi: 10.1037/h0026141

Magee, B. (1973). Karl Popper . New York, NY: Viking Press.

Makel, M. C., Plucker, J. A., and Hegarty, B. (2012). Replications in psychology research how often do they really occur? Perspect. Psychol. Sci . 7, 537–542. doi: 10.1177/1745691612460688

Meehl, P. E. (1978). Theoretical risks and tabular asterisks: Sir Karl, Sir Ronald, and the slow progress of soft psychology. J. Consult. Clin. Psychol . 46, 806–834. doi: 10.1037/0022-006X.46.4.806

Meehl, P. E. (1990a). Appraising and amending theories: the strategy of Lakatosian defense and two principles that warrant using it. Psychol. Inquiry 1, 108–141. doi: 10.1207/s15327965pli0102_1

Meehl, P. E. (1990b). Why summaries of research on psychological theories are often uninterpretable. Psychol. Reports 66, 195–244.

Mulkay, M., and Gilbert, G. N. (1981). Putting philosophy to work: Karl Popper's influence on scientific practice. Philos. Soc. Sci . 11, 389–407. doi: 10.1177/004839318101100306

Mulkay, M., and Gilbert, G. N. (1986). Replication and mere replication. Philos. Soc. Sci . 16, 21–37. doi: 10.1177/004839318601600102

Nosek, B. A. the Open Science Collaboration. (2012). An open, large-scale, collaborative effort to estimate the reproducibility of psychological science. Perspect. Psychol. Sci . 7, 657–660. doi: 10.1177/1745691612462588

Nosek, B. A., Spies, J. R., and Motyl, M. (2012). Scientific utopia II. Restructuring incentives and practices to promote truth over publishability. Perspect. Psychol. Sci . 7, 615–631. doi: 10.1177/1745691612459058

Pashler, H., and Wagenmakers, E. J. (2012). Editors' introduction to the special section on replicability in psychological science: a crisis of confidence? Perspect. Psychol. Sci . 7, 528–530. doi: 10.1177/1745691612465253

Polanyi, M. (1962). Tacit knowing: Its bearing on some problems of philosophy. Rev. Mod. Phys . 34, 601–615. doi: 10.1103/RevModPhys.34.601

Popper, K. (1959). The Logic of Scientific Discovery . London: Hutchison.

Quine, W. V. O. (1980). “Two dogmas of empiricism,” in From a Logical Point of View, 2n Edn. , ed W. V. O. Quine (Cambridge, MA: Harvard University Press), 20–46.

Radder, H. (1992). “Experimental reproducibility and the experimenters' regress,” in PSA: Proceedings of the Biennial Meeting of the Philosophy of Science Association (Chicago, IL: University of Chicago Press).

Rosenthal, R. (1979). The file drawer problem and tolerance for null results. Psychol. Bullet . 86, 638–641. doi: 10.1037/0033-2909.86.3.638

Schmidt, S. (2009). Shall we really do it again? The powerful concept of replication is neglected in the social sciences. Rev. Gen. Psychol . 13, 90–100. doi: 10.1037/a0015108

Schnall, S. (2014). Simone Schnall on her experience with a registered replication project. SPSP Blog . Available online at: http://www.spspblog.org/simone-schnall-on-her-experience-with-a-registered-replication-project/

Simmons, J. P., Nelson, L. D., and Simonsohn, U. (2011). False-positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychol. Sci . 22, 1359–1366. doi: 10.1177/0956797611417632

Smith, N. C. (1970). Replication studies: a neglected aspect of psychological research. Am. Psychol . 25, 970–975. doi: 10.1037/h0029774

Stroebe, W., Postmes, T., and Spears, R. (2012). Scientific misconduct and the myth of self-correction in science. Perspect. Psychol. Sci . 7, 670–688. doi: 10.1177/1745691612460687

Trafimow, D. (2003). Hypothesis testing and theory evaluation at the boundaries: surprising insights from Bayes's theorem. Psychol. Rev . 110, 526–535. doi: 10.1037/0033-295X.110.3.526

Trafimow, D. (2009). The theory of reasoned action: a case study of falsification in psychology. Theory Psychol . 19, 501–518. doi: 10.1177/0959354309336319

Trafimow, D. (2010). On making assumptions about auxiliary assumptions: reply to Wallach and Wallach. Theory Psychol . 20, 707–711. doi: 10.1177/0959354310374379

Trafimow, D. (2014). Editorial. Basic Appl. Soc. Psychol . 36, 1–2, doi: 10.1080/01973533.2014.865505

Trafimow, D., and Marks, M. (2015). Editorial. Basic Appl. Soc. Psychol . 37, 1–2. doi: 10.1080/01973533.2015.1012991

Trafimow, D., and Rice, S. (2009). A test of the NHSTP correlation argument. J. Gen. Psychol . 136, 261–269. doi: 10.3200/GENP.136.3.261-270

Tsang, E. W., and Kwan, K. M. (1999). Replication and theory development in organizational science: a critical realist perspective. Acad. Manag. Rev . 24, 759–780. doi: 10.2307/259353

Van IJzendoorn, M. H. (1994). A Process Model of Replication Studies: on the Relation between Different Types of Replication . Leiden University Library. Available online at: https://openaccess.leidenuniv.nl/bitstream/handle/1887/1483/168_149.pdf?sequence=1

Verhagen, J., and Wagenmakers, E. J. (2014). Bayesian tests to quantify the result of a replication attempt. J. Exp. Psychol . 143, 1457–1475. doi: 10.1037/a0036731

Westen, D. (1988). Official and unofficial data. New Ideas Psychol . 6, 323–331. doi: 10.1016/0732-118X(88)90044-X

Yong, E. (2012). A failed replication attempt draws a scathing personal attack from a psychology professor. Discover Magazine . Available online at http://blogs.discovermagazine.com/notrocketscience/2012/03/10/failed-replication-bargh-psychology-study-doyen/#.VVGC-M6Gjds

Keywords: replication, falsification, falsifiability, crisis of confidence, social psychology, priming, philosophy of science, Karl Popper

Citation: Earp BD and Trafimow D (2015) Replication, falsification, and the crisis of confidence in social psychology. Front. Psychol . 6 :621. doi: 10.3389/fpsyg.2015.00621

Received: 05 March 2015; Accepted: 27 April 2015; Published: 19 May 2015.

Reviewed by:

Copyright © 2015 Earp and Trafimow. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY) . The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Brian D. Earp, Uehiro Centre for Practical Ethics, University of Oxford, Suite 8, Littlegate House, St. Ebbes Street, Oxford OX1 1PT, UK, [email protected]; https://twitter.com/@briandavidearp

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings
  • My Bibliography
  • Collections
  • Citation manager

Save citation to file

Email citation, add to collections.

  • Create a new collection
  • Add to an existing collection

Add to My Bibliography

Your saved search, create a file for external citation management software, your rss feed.

  • Search in PubMed
  • Search in NLM Catalog
  • Add to Search

Publication bias and the failure of replication in experimental psychology

Affiliation.

  • 1 Department of Psychological Sciences, Purdue University, 703 Third Street, West Lafayette, IN 47906, USA. [email protected]
  • PMID: 23055145
  • DOI: 10.3758/s13423-012-0322-y

Replication of empirical findings plays a fundamental role in science. Among experimental psychologists, successful replication enhances belief in a finding, while a failure to replicate is often interpreted to mean that one of the experiments is flawed. This view is wrong. Because experimental psychology uses statistics, empirical findings should appear with predictable probabilities. In a misguided effort to demonstrate successful replication of empirical findings and avoid failures to replicate, experimental psychologists sometimes report too many positive results. Rather than strengthen confidence in an effect, too much successful replication actually indicates publication bias, which invalidates entire sets of experimental findings. Researchers cannot judge the validity of a set of biased experiments because the experiment set may consist entirely of type I errors. This article shows how an investigation of the effect sizes from reported experiments can test for publication bias by looking for too much successful replication. Simulated experiments demonstrate that the publication bias test is able to discriminate biased experiment sets from unbiased experiment sets, but it is conservative about reporting bias. The test is then applied to several studies of prominent phenomena that highlight how publication bias contaminates some findings in experimental psychology. Additional simulated experiments demonstrate that using Bayesian methods of data analysis can reduce (and in some cases, eliminate) the occurrence of publication bias. Such methods should be part of a systematic process to remove publication bias from experimental psychology and reinstate the important role of replication as a final arbiter of scientific findings.

PubMed Disclaimer

Similar articles

  • Too good to be true: publication bias in two prominent studies from experimental psychology. Francis G. Francis G. Psychon Bull Rev. 2012 Apr;19(2):151-6. doi: 10.3758/s13423-012-0227-9. Psychon Bull Rev. 2012. PMID: 22351589
  • The Psychology of Replication and Replication in Psychology. Francis G. Francis G. Perspect Psychol Sci. 2012 Nov;7(6):585-94. doi: 10.1177/1745691612459520. Perspect Psychol Sci. 2012. PMID: 26168115
  • Internal conceptual replications do not increase independent replication success. Kunert R. Kunert R. Psychon Bull Rev. 2016 Oct;23(5):1631-1638. doi: 10.3758/s13423-016-1030-9. Psychon Bull Rev. 2016. PMID: 27068542 Free PMC article.
  • Implications of "Too Good to Be True" for Replication, Theoretical Claims, and Experimental Design: An Example Using Prominent Studies of Racial Bias. Francis G. Francis G. Front Psychol. 2016 Sep 22;7:1382. doi: 10.3389/fpsyg.2016.01382. eCollection 2016. Front Psychol. 2016. PMID: 27713708 Free PMC article. Review.
  • Dissemination and publication of research findings: an updated review of related biases. Song F, Parekh S, Hooper L, Loke YK, Ryder J, Sutton AJ, Hing C, Kwok CS, Pang C, Harvey I. Song F, et al. Health Technol Assess. 2010 Feb;14(8):iii, ix-xi, 1-193. doi: 10.3310/hta14080. Health Technol Assess. 2010. PMID: 20181324 Review.
  • Concepts and Reasoning: a Conceptual Review and Analysis of Logical Issues in Empirical Social Science Research. Yao Q. Yao Q. Integr Psychol Behav Sci. 2024 Jun;58(2):502-530. doi: 10.1007/s12124-023-09792-x. Epub 2023 Jul 8. Integr Psychol Behav Sci. 2024. PMID: 37421547 Review.
  • Do we really measure what we think we are measuring? Gordillo D, Ramos da Cruz J, Moreno D, Garobbio S, Herzog MH. Gordillo D, et al. iScience. 2023 Jan 20;26(2):106017. doi: 10.1016/j.isci.2023.106017. eCollection 2023 Feb 17. iScience. 2023. PMID: 36844457 Free PMC article.
  • Replication concerns in sports and exercise science: a narrative review of selected methodological issues in the field. Mesquida C, Murphy J, Lakens D, Warne J. Mesquida C, et al. R Soc Open Sci. 2022 Dec 14;9(12):220946. doi: 10.1098/rsos.220946. eCollection 2022 Dec. R Soc Open Sci. 2022. PMID: 36533197 Free PMC article.
  • Remarkably reproducible psychological (memory) phenomena in the classroom: some evidence for generality from small-N research. Imam AA. Imam AA. BMC Psychol. 2022 Nov 22;10(1):274. doi: 10.1186/s40359-022-00982-7. BMC Psychol. 2022. PMID: 36419180 Free PMC article.
  • Systematic review of conservation interventions to promote voluntary behavior change. Thomas-Walters L, McCallum J, Montgomery R, Petros C, Wan AKY, Veríssimo D. Thomas-Walters L, et al. Conserv Biol. 2023 Feb;37(1):e14000. doi: 10.1111/cobi.14000. Epub 2022 Dec 8. Conserv Biol. 2023. PMID: 36073364 Free PMC article.
  • J Pers Soc Psychol. 2011 Mar;100(3):407-25 - PubMed
  • Behav Res Methods. 2011 Sep;43(3):666-78 - PubMed
  • Perspect Psychol Sci. 2011 May;6(3):274-90 - PubMed
  • Epidemiology. 2008 Sep;19(5):640-8 - PubMed
  • Cognition. 2008 Jun;107(3):1035-47 - PubMed

Publication types

  • Search in MeSH

LinkOut - more resources

Full text sources.

  • Citation Manager

NCBI Literature Resources

MeSH PMC Bookshelf Disclaimer

The PubMed wordmark and PubMed logo are registered trademarks of the U.S. Department of Health and Human Services (HHS). Unauthorized use of these marks is strictly prohibited.

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Front Psychol

Replication, falsification, and the crisis of confidence in social psychology

Brian d. earp.

1 Uehiro Centre for Practical Ethics, University of Oxford, Oxford, UK

2 Department of History and Philosophy of Science, University of Cambridge, Cambridge, UK

David Trafimow

3 Department of Psychology, New Mexico State University, Las Cruces, NM, USA

The (latest) crisis in confidence in social psychology has generated much heated discussion about the importance of replication, including how it should be carried out as well as interpreted by scholars in the field. For example, what does it mean if a replication attempt “fails”—does it mean that the original results, or the theory that predicted them, have been falsified? And how should “failed” replications affect our belief in the validity of the original research? In this paper, we consider the replication debate from a historical and philosophical perspective, and provide a conceptual analysis of both replication and falsification as they pertain to this important discussion. Along the way, we highlight the importance of auxiliary assumptions (for both testing theories and attempting replications), and introduce a Bayesian framework for assessing “failed” replications in terms of how they should affect our confidence in original findings.

“Only when certain events recur in accordance with rules or regularities, as in the case of repeatable experiments, can our observations be tested—in principle—by anyone.… Only by such repetition can we convince ourselves that we are not dealing with a mere isolated ‘coincidence,’ but with events which, on account of their regularity and reproducibility, are in principle inter-subjectively testable.” – Karl Popper (1959, p. 45)

Introduction

Scientists pay lip-service to the importance of replication. It is the “coin of the scientific realm” (Loscalzo, 2012 , p. 1211); “one of the central issues in any empirical science” (Schmidt, 2009 , p. 90); or even the “demarcation criterion between science and nonscience” (Braude, 1979 , p. 2). Similar declarations have been made about falsifiability , the “demarcation criterion” proposed by Popper in his seminal work of 1959 (see epigraph). As we will discuss below, the concepts are closely related—and also frequently misunderstood. Nevertheless, their regular invocation suggests a widespread if vague allegiance to Popperian ideals among contemporary scientists, working from a range of different disciplines (Jordan, 2004 ; Jost, 2013 ). The cosmologist Hermann Bondi once put it this way: “There is no more to science than its method, and there is no more to its method than what Popper has said” (quoted in Magee, 1973 , p. 2).

Experimental social psychologists have fallen in line. Perhaps in part to bolster our sense of identity with the natural sciences (Danzinger, 1997 ), we psychologists have been especially keen to talk about replication. We want to trade in the “coin” of the realm. As Billig ( 2013 ) notes, psychologists “cling fast to the belief that the route to knowledge is through the accumulation of [replicable] experimental findings” (p. 179). The connection to Popper is often made explicit. One recent example comes from Kepes and McDaniel ( 2013 ), from the field of industrial-organizational psychology: “The lack of exact replication studies [in our field] prevents the opportunity to disconfirm research results and thus to falsify [contested] theories” (p. 257). They cite The Logic of Scientific Discovery .

There are problems here. First, there is the “ lack ” of replication noted in the quote from Kepes and McDaniel. If replication is so important, why isn't it being done? This question has become a source of crisis-level anxiety among psychologists in recent years, as we explore in a later section. The anxiety is due to a disconnect: between what is seen as being necessary for scientific credibility—i.e., careful replication of findings based on precisely-stated theories—and what appears to be characteristic of the field in practice (Nosek et al., 2012 ). Part of the problem is the lack of prestige associated with carrying out replications (Smith, 1970 ). To put it simply, few would want to be seen by their peers as merely “copying” another's work (e.g., Mulkay and Gilbert, 1986 ); and few could afford to be seen in this way by tenure committees or by the funding bodies that sponsor their research. Thus, while “a field that replicates its work is [seen as] rigorous and scientifically sound”—according to Makel et al. ( 2012 )—psychologists who actually conduct those replications “are looked down on as bricklayers and not [as] advancing [scientific] knowledge” (p. 537). In consequence, actual replication attempts are rare.

A second problem is with the reliance on Popper—or, at any rate, a first-pass reading of Popper that seems to be uninformed by subsequent debates in the philosophy of science. Indeed, as critics of Popper have noted, since the 1960s and consistently thereafter, neither his notion of falsification nor his account of experimental replicability seem strictly amenable to being put into practice (e.g., Mulkay and Gilbert, 1981 ; see also Earp, 2011 )—at least not without considerable ambiguity and confusion. What is more, they may not even be fully coherent as stand-alone “abstract” theories, as has been repeatedly noted as well (cf. Cross, 1982 ).

The arguments here are familiar. Let us suppose that—at the risk of being accused of laying down bricks—Researcher B sets up an experiment to try to “replicate” a controversial finding that has been reported by Researcher A. She follows the original methods section as closely as she can (assuming that this has been published in detail; or even better, she simply asks Researcher A for precise instructions). She calibrates her equipment. She prepares the samples and materials just so. And she collects and then analyzes the data. If she gets a different result from what was reported by Researcher A—what follows? Has she “falsified” the other lab's theory? Has she even shown the original result to be erroneous in some way?

The answer to both of these questions, as we will demonstrate in some detail below, is “no.” Perhaps Researcher B made a mistake (see Trafimow, 2014 ). Perhaps the other lab did. Perhaps one of B's research assistants wrote down the wrong number. Perhaps the original effect is a genuine effect, but can only be obtained under specific conditions—and we just don't know yet what they are (Cesario, 2014 ). Perhaps it relies on “tacit” (Polanyi, 1962 ) or “unofficial” (Westen, 1988 ) experimental knowledge that can only be acquired over the course of several years, and perhaps Researcher B has not yet acquired this knowledge (Collins, 1975 ).

Or perhaps the original effect is not a genuine effect, but Researcher A's theory can actually accommodate this fact. Perhaps Researcher A can abandon some auxiliary hypothesis, or take on board another, or re-formulate a previously unacknowledged background assumption—or whatever (cf. Lakatos, 1970 ; Cross, 1982 ; Folger, 1989 ). As Lakatos ( 1970 ) once put it: “given sufficient imagination, any theory… can be permanently saved from ‘refutation’ by some suitable adjustment in the background knowledge in which it is embedded” (p. 184). We will discuss some of these potential “adjustments” below. The upshot, however, is that we simply do not know, and cannot know, exactly what the implications of a given replication attempt are, no matter which way the data come out. There are no critical tests of theories; and there are no objectively decisive replications.

Popper ( 1959 ) was not blind to this problem. “In point of fact,” he wrote, in an under-appreciated passage of his famous book, “no conclusive disproof of a theory can ever be produced, for it is always possible to say that the experimental results are not reliable, or that the discrepancies which are asserted to exist between the experimental results and the theory are only apparent” (p. 50, emphasis added). Hence as Mulkay and Gilbert ( 1981 ) explain:

… in relation to [actual] scientific practice, one can only talk of positive and negative results, and not of proof or disproof. Negative results, that is, results which seem inconsistent with a given hypothesis [or with a putative finding from a previous experiment], may incline a scientist to abandon [the] hypothesis but they will never require him to abandon it… Whether or not he does so may depend on the amount and quality of positive evidence, on his confidence in his own and others' experimental skills and on his ability to conceive of alternative interpretations of the negative findings. (p. 391)

Drawing hard and fast conclusions, therefore, about “negative” results—such as those that may be produced by a “failed” replication attempt—is much more difficult than Kepes and McDaniel seem to imagine (see e.g., Chow, 1988 for similarly problematic arguments). This difficulty may be especially acute in the field of psychology. As Folger ( 1989 ) notes, “Popper himself believed that too many theories, particularly in the social sciences , were constructed so loosely that they could be stretched to fit any conceivable set of experimental results, making them… devoid of testable content” (p. 156, emphasis added). Furthermore, as Collins ( 1985 ) has argued, the less secure a field's foundational theories—and especially at the field's “frontier”—the more room there is for disagreement about what should “count” as a proper replication 1 .

Related to this problem is that it can be difficult to know in what specific sense a replication study should be considered to be “the same” as the original (e.g., Van IJzendoorn, 1994 ). Consider that the goal for these kinds of studies is to rule out flukes and other types of error. Thus, we want to be able to say that the same experiment , if repeated one more time, would produce the same result as was originally observed. But an original study and a replication study cannot, by definition, be identical—at the very least, some time will have passed and the participants will all be new 2 —and if we don't yet know which differences are theory-relevant, we won't be able to control for their effects. The problem with a field like psychology, whose theoretical predictions are often “constructed so loosely,” as noted above, is precisely that we do not know—or at least, we do not in a large number of cases—which differences are in fact relevant to the theory.

Finally, human behavior is notoriously complex. We are not like billiard balls, or beavers, or planets, or paramecia (i.e., relatively simple objects or organisms). This means that we should expect our behavioral responses to vary across a “wide range of moderating individual difference and experimental context variables” (Cesario, 2014 , p. 41)—many of which are not yet known, and some of which may be difficult or even impossible to uncover (Meehl, 1990a ). Thus, in the absence of “well-developed theories for specifying such [moderating] variables, the conclusions of replication failures will be ambiguous” (Cesario, 2014 , p. 41; see also Meehl, 1978 ).

Summing up the problem

Hence we have two major points to consider. First, due to a lack of adequate incentives in the reward structure of professional science (e.g., Nosek and the Open Science Collaboration, 2012 ), actual replication attempts are rarely carried out. Second, to the extent that they are carried out, it can be well-nigh impossible to say conclusively what they mean, whether they are “successful” (i.e., showing similar, or apparently similar, results to the original experiment) or “unsuccessful” (i.e., showing different, or apparently different, results to the original experiment). Thus, Collins ( 1985 ) came to the conclusion that, in physics at least, disputes over contested findings are likelier to be resolved by social and reputational negotiations —over, e.g., who should be considered a competent experimenter—than by any “objective” consideration of the experiments themselves. Meehl ( 1990b ) drew a similar conclusion about the field of social psychology, although he identified sheer boredom (rather than social/reputational negotiation) as the alternative to decisive experimentation:

… theories in the “soft areas” of psychology have a tendency to go through periods of initial enthusiasm leading to large amounts of empirical investigation with ambiguous over-all results. This period of infatuation is followed by various kinds of amendment and the proliferation of ad hoc hypotheses. Finally, in the long run, experimenters lose interest rather than deliberately discard a theory as clearly falsified. (p. 196)

So how shall we take stock of what has been said? A cynical reader might conclude that—far from being a “demarcation criterion between science and nonscience”—replication is actually closer to being a waste of time. Indeed, if even replications in physics are sometimes not conclusive, as Collins ( 1975 , 1981 , 1985 ) has convincingly shown, then what hope is there for replications in psychology?

Our answer is simply as follows. Replications do not need to be “conclusive” in order to be informative . In this paper, we highlight some of the ways in which replication attempts can be more, rather than less, informative, and we discuss—using a Bayesian framework—how they can reasonably affect a researcher's confidence in the validity of an original finding. The same is true of “falsification.” Whilst a scientist should not simply abandon her favorite theory on account of a single (apparently) contradictory result—as Popper himself was careful to point out 3 (1959, pp. 66–67; see also Earp, 2011 )—she might reasonably be open to doubt it, given enough disconfirmatory evidence, and assuming that she had stated the theory precisely. Rather than being a “waste of time,” therefore, experimental replication of one's own and others' findings can be a useful tool for restoring confidence in the reliability of basic effects—provided that certain conditions are met. The work of the latter part of this essay is to describe and to justify at least a few of those essential conditions. In this context, we draw a distinction between “conceptual” or “reproductive” replications (cf. Cartwright, 1991 )—which may conceivably be used to bolster confidence in a particular theory —and “direct” or “close” replications, which may be used to bolster confidence in a finding (Schmidt, 2009 ; see also Earp et al., 2014 ). Since it is doubt about the findings that seems to have prompted the recent “crisis” in social psychology, it is the latter that will be our focus. But first we must introduce the crisis.

The (latest) crisis in social psychology and calls for replication

“Is there currently a crisis of confidence in psychological science reflecting an unprecedented level of doubt among practitioners about the reliability of research findings in the field? It would certainly appear that there is.” So write Pashler and Wagenmakers ( 2012 , p. 529) in a recent issue of Perspectives on Psychological Science . The “crisis” is not unique to psychology; it is rippling through biomedicine and other fields as well (Ioannidis, 2005 ; Loscalzo, 2012 ; Earp and Darby, 2015 )—but psychology will be the focus of this paper, if for no other reason than that the present authors have been closer to the facts on the ground.

Some of the causes of the crisis are fairly well known. In 2011, an eminent Dutch researcher confessed to making up data and experiments, producing a résumé-full of “findings” that he had simply invented out of whole cloth (Carey, 2011 ). He was outed by his own students, however, and not by peer review nor by any attempt to replicate his work. In other words, he might just as well have not been found out, had he only been a little more careful (Stroebe et al., 2012 ). An unsettling prospect was thus aroused: Could other fraudulent “findings” be circulating—undetected, and perhaps even undetectable—throughout the published record? After an exhaustive analysis of the Dutch fraud case, Stroebe et al. ( 2012 ) concluded that the notion of self-correction in science was actually a “myth” (p. 670); and others have offered similar pronouncements (Ioannidis, 2012a ).

But fraud, it is hoped, is rare. Nevertheless, as Ioannidis ( 2005 , 2012a ) and others have argued, the line between explicitly fraudulent behavior and merely “questionable” research practices is perilously thin, and the latter are probably common. John et al. ( 2012 ) conducted a massive, anonymous survey of practicing psychologists and showed that this conjecture is likely correct. Psychologists admitted to such questionable research practices as failing to report all of the dependent measures for which they had collected data (78%) 4 , collecting additional data after checking to see whether preliminary results were statistically significant (72%), selectively reporting studies that “worked” (67%), claiming to have predicted an unexpected finding (54%), and failing to report all of the conditions that they ran (42%). Each of these practices alone, and even more so when combined, reduces the interpretability of the final reported statistics, casting doubt upon any claimed “effects” (e.g., Simmons et al., 2011 ).

The motivation behind these practices, though not necessarily conscious or deliberate, is also not obscure. Professional journals have long had a tendency to publish only or primarily novel, “statistically significant” effects, to the exclusion of replications—and especially “failed” replications—or other null results. This problem, known as “publication bias,” leads to a file-drawer effect whereby “negative” experimental outcomes are simply “filed away” in a researcher's bottom drawer, rather than written up and submitted for publication (e.g., Rosenthal, 1979 ). Meanwhile, the “questionable research practices” carry on in full force, since they increase the researcher's chances of obtaining a “statistically significant” finding—whether it turns out to be reliable or not.

To add insult to injury, in 2012, an acrimonious public skirmish broke out in the form of dueling blog posts between the distinguished author of a classic behavioral priming study 5 and a team of researchers who had questioned his findings (Yong, 2012 ). The disputed results had already been cited more than 2000 times—an extremely large number for the field—and even been enshrined in introductory textbooks. What if they did turn out to be a fluke? Should other “priming studies” be double-checked as well? Coverage of the debate ensued in the mainstream media (e.g., Bartlett, 2013 ).

Another triggering event resulted in “widespread public mockery” (Pashler and Wagenmakers, 2012 , p. 528). In contrast to the fraud case described above, which involved intentional, unblushing deception, the psychologist Daryl Bem relied on well-established and widely-followed research and reporting practices to generate an apparently fantastic result, namely evidence that participants' current responses could be influenced by future events (Bem, 2011 ). Since such paranormal precognition is inconsistent with widely-held theories about “the fundamental nature of time and causality” (Lebel and Peters, p. 371), few took the findings seriously. Instead, they began to wonder about the “well-established and widely-followed research and reporting practices” that had sanctioned the findings in the first place (and allowed for their publication in a leading journal). As Simmons et al. ( 2011 ) concluded—reflecting broadly on the state of the discipline—“it is unacceptably easy to publish ‘statistically significant’ evidence consistent with any hypothesis” (p. 1359) 6 .

The main culprit for this phenomenon is what Simmons et al. ( 2011 ) identified as researcher degrees of freedom :

In the course of collecting and analyzing data, researchers have many decisions to make: Should more data be collected? Should some observations be excluded? Which conditions should be combined and which ones compared? Which control variables should be considered? Should specific measures be combined or transformed or both?… It is rare, and sometimes impractical, for researchers to make all these decisions beforehand. Rather, it is common (and accepted practice) for researchers to explore various analytic alternatives, to search for a combination that yields “statistical significance” and to then report only what “worked.” (p. 1359)

One unfortunate consequence of such a strategy—involving, as it does, some of the very same questionable research practices later identified by John et al. ( 2012 ) in their survey of psychologists—is that it inflates the possibility of producing a false positive (or a Type 1 error). Since such practices are “common” and even “accepted,” the literature may be replete with erroneous results. Thus, as Ioannidis ( 2005 ) declared after performing a similar analysis in his own field of biomedicine, “ most published research findings” may be “false” (p. 0696, emphasis added). This has led to the “unprecedented level of doubt” referred to by Pashler and Wagenmakers ( 2012 ) in the opening quote to this section.

This not the first crisis for psychology. Giner-Sorolla ( 2012 ) points out that “crises” of one sort or another “have been declared regularly at least since the time of Wilhelm Wundt”—with turmoil as recent as the 1970s inspiring particular déjà vu (p. 563). Then, as now, a string of embarrassing events—including the publication in mainstream journals of literally unbelievable findings 7 —led to “soul searching” amongst leading practitioners. Standard experimental methods, statistical strategies, reporting requirements, and norms of peer review were all put under the microscope; numerous sources of bias were carefully rooted out (e.g., Greenwald, 1975 ). While various calls for reform were put forward—some more energetically than others—a single corrective strategy seemed to emerge from all the din: the need for psychologists to replicate their work . Since “ all flawed research practices yield findings that cannot be reproduced,” critics reasoned, replication could be used to separate the wheat from the chaff (Koole and Lakens, 2012 , p. 608, emphasis added; see also Elms, 1975 ).

The same calls reverberate today. “For psychology to truly adhere to the principles of science,” write Ferguson and Heene ( 2012 ), “the need for replication of research results [is] important… to consider” (p. 556). LeBel and Peters ( 2011 ) put it like this: “Across all scientific disciplines, close replication is the gold standard for corroborating the discovery of an empirical phenomenon” and “the importance of this point for psychology has been noted many times” (p. 375). Indeed, “leading researchers [in psychology]” agree, according to Francis ( 2012 ), that “experimental replication is the final arbiter in determining whether effects are true or false” (p. 585).

We have already seen that such calls must be heeded with caution: replication is not straightforward, and the outcome of replication studies may be difficult to interpret. Indeed they can never be conclusive on their own. But we suggested that replications could be more or less informative ; and in the following sections we discuss some strategies for making them “more” rather than “less.” We begin with a discussion of “direct” vs. “conceptual” replication.

Increasing replication informativeness: “direct” vs. “conceptual” replication

In a systematic review of the literature, encompassing multiple academic disciplines, Gómez et al. ( 2010 ) identified 18 different types of replication. Three of these were from Lykken ( 1968 ), who drew a distinction between “literal,” “operational,” and “constructive”—which Schmidt ( 2009 ) then winnowed down (and re-labeled) to arrive at “direct” and “conceptual” in an influential paper. As Makel et al. ( 2012 ) have pointed out, it is Schmidt's particular framework that seems to have crystallized in the field of psychology, shaping most of the subsequent discussion on this issue. We have no particular reason to rock the boat; indeed these categories will suit our argument just fine.

The first step in making a replication informative is to decide what specifically it is for. “Direct” replications and “conceptual” replications are “for” different things; and assigning them their proper role and function will be necessary for resolving the crisis. First, some definitions:

A “direct” replication may be defined as an experiment that is intended to be as similar to the original as possible (Schmidt, 2009 ; Makel et al., 2012 ). This means that along every conceivable dimension—from the equipment and materials used, to the procedure, to the time of day, to the gender of the experimenter, etc.—the replicating scientist should strive to avoid making any kind of change or alteration. The purpose here is to “check” the original results. Some changes will be inevitable, of course; but the point is that only the inevitable changes (such as the passage of time between experiments) are ideally tolerated in this form of replication. In a “conceptual” replication, by contrast, at least certain elements of the original experiment are intentionally altered, (ideally) systematically so, toward the end of achieving a very different sort of purpose—namely to see whether a given phenomenon, assuming that it is reliable, might obtain across a range of variable conditions. But as Doyen et al. ( 2014 ) note in a recent paper:

The problem with conceptual replication in the absence of direct replication is that there is no such thing as a “conceptual failure to replicate.” A failure to find the same “effect” using a different operationalization can be attributed to the differences in method rather than to the fragility of the original effect. Only the successful conceptual replications will be published, and the unsuccessful ones can be dismissed without challenging the underlying foundations of the claim. Consequently, conceptual replication without direct replication is unlikely to change beliefs about the underlying effect (p. 28) .

In simplest terms, therefore, a “direct” replication seeks to validate a particular fact or finding ; whereas a “conceptual” replication seeks to validate the underlying theory or phenomenon —i.e., the theory that has been proposed to “predict” the effect that was obtained by the initial experiment—as well to establish the boundary conditions within which the theory holds true (Nosek et al., 2012 ). The latter is impossible without the former. In other words, if we cannot be sure that our finding is reliable to begin with (because it turns out to have been a coincidence, or else a false alarm due to questionable research practices, publication bias, or fraud), then we are in no position to begin testing the theory by which it is supposedly explained (Cartwright, 1991 ; see also Earp et al., 2014 ).

Of course both types of replication are important, and there is no absolute line between them. Rather, as Asendorpf et al. ( 2013 ) point out, “direct replicability [is] one extreme pole of a continuous dimension extending to broad generalizability [via ‘conceptual’ replication] at the other pole, ranging across multiple, theoretically relevant facets of study design” (p. 139). Collins made a similar point in 1985 (e.g., p. 37). But so long as we remain largely ignorant about exactly which “facets of study design” are “theoretically relevant” to begin with—as is the case with much of current social psychology (Meehl, 1990b ), and nearly all of the most heavily-contested experimental findings—we need to orient our attention more toward the “direct” end of the spectrum 8 .

How else can replication be made more informative ? Brandt et al. ( 2014 )'s “Replication Recipe” offers several important factors, one of which must be highlighted to begin with. This is their contention that a “convincing” replication should be carried out outside the lab of origin . Clearly this requirement shifts away from the “direct” extreme of the replication gradient that we have emphasized so far, but such a change from the original experiment, in this case, is justified. As Ioannidis ( 2012b ) points out, replications by the original researchers—while certainly important and to be encouraged as a preliminary step—are not sufficient to establish “convincing” experimental reliability. This is because allegiance and confirmation biases, which may apply especially to the original team, would be less of an issue for independent replicators.

Partially against this view, Schnall ( 2014 , np) argues that “authors of the original work should be allowed to participate in the process of having their work replicated.” On the one hand, this might have the desirable effect of ensuring that the replication attempt faithfully reproduces the original procedure. It seems reasonable to think that the original author would know more than anyone else about how the original research was conducted—so her viewpoint is likely to be helpful. On the other hand, however, too much input by the original author could compromise the independence of the replication: she might have a strong motivation to make the replication a success, which could subtly influence the results (see Earp and Darby, 2015 ). Whichever position one takes on the appropriate degree of input and/or oversight from the original author, however, Schnall ( 2014 , np) is certainly right to note that “the quality standards for replications need to be at least as high as for the original findings. Competent evaluation by experts is absolutely essential, and is especially important if replication authors have no prior expertise with a given research topic.”

Other ingredients in increasing the informativeness of replication attempts include: (1) carefully defining the effects and methods that the researchers intend to replicate; (2) following as exactly as possible the methods of the original study (as described above); (3) having high statistical power (i.e., an adequate sample size to detect an effect if one is really present); (4) making complete details about the replication available, so that interested experts can fully evaluate the replication attempt (or attempt another replication themselves); and (5) evaluating the replication results, comparing them critically to the results of the study (Brandt et al., 2014 , p. 218, paraphrased). This list is not exhaustive, but it gives a concrete sense of how “stabilizing” procedures (see Radder, 1992 ) can be employed to give greater credence to the quality and informativeness of replication efforts.

Replication, falsification, and auxiliary assumptions

Brandt et al.'s ( 2014 ) “replication recipe” provides a vital tool for researchers seeking to conduct high quality replications. In this section, we offer an additional “ingredient” to the discussion, by highlighting the role of auxiliary assumptions in increasing replication informativeness, specifically as these pertain to the relationship between replication and falsification. Consider the logical fallacy of affirming the consequent that provided an important basis for Popper's falsification argument.

If the theory is true,
        an observation should occur ( → )(Premise 1)
The observation occurs ( )(Premise 2)
Therefore, the theory is true ( )(Conclusion)

Obviously, the conclusion does not follow. Any number of things might have led to the observation that have nothing to do with the theory being proposed (see Earp, 2015 for a similar argument). On the other hand, denying the consequent ( modus tollens ) does invalidate the theory, strictly according to the logic given:

If the theory is true,
        an observation should occur ( → )(Premise 1)
The observation does not occur (~ )(Premise 2)
Therefore, the theory is not true (~ )(Conclusion)

Given this logical asymmetry, then, between affirming and denying the consequent of a theoretical prediction (see Earp and Everett, 2013 ), Popper opted for the latter. By doing so, he famously defended a strategy of disconfirming rather than confirming theories. Yet if the goal is to disconfirm theories, then the theories must be capable of being disconfirmed in the first place; hence, a basic requirement of scientific theories (in order to count as properly scientific) is that they have this feature: they must be falsifiable .

As we hinted at above, however, this basic framework is an oversimplification. As Popper himself noted, and as was made particularly clear by Lakatos ( 1978 ; also see Duhem, 1954 ; Quine, 1980 ), scientists do not derive predictions only from a given theory, but rather from a combination of the theory and auxiliary assumptions . The auxiliary assumptions are not part of the theory proper, but they serve several important functions. One of these functions is to show the link between the sorts of outcomes that a scientist can actually observe (i.e., by running an experiment), and the non-observable, “abstract” content of the theory itself. To pick one classic example from psychology, according to the theory of reasoned action (e.g., Fishbein, 1980 ), attitudes determine subjective norms. One implication of this theoretical assumption is that researchers should be able to obtain strong correlations between attitudes and behavioral intentions. But this assumes, among other things, that a check mark on an attitude scale really indicates a person's attitude, and that a check mark on an intention scale really indicates a person's intention. The theory of reasoned action has nothing to say about whether check marks on scales indicate attitudes or intentions; these are assumptions that are peripheral to the basic theory. They are auxiliary assumptions that researchers use to connect non-observational terms such as “attitude” and “intention” to observable phenomena such as check marks. Fishbein and Ajzen ( 1975 ) recognized this and took great pains to spell out, as well as possible, the auxiliary assumptions that best aid in measuring theoretically relevant variables (see also Ajzen and Fishbein, 1980 ).

The existence of auxiliary assumptions complicates the project of falsification. This is because the major premise of the modus tollens argument—denying the consequent of the theoretical prediction—must be stated somewhat differently. It must be stated like this: “If the theory is true and a set of auxiliary assumptions is true , an observation should occur.” Keeping the second premise the same implies that either the theory is not true or that at least one auxiliary assumption is not true, as the following syllogism (in symbols only) illustrates.

& ( & … ) → (Premise 1)
~ O(Premise 2)
∴ ~ ~ ( & … ) =
  ~ ~ ~ … ~ (Conclusion)

Consider an example. It often is said that Newton's gravitational theory predicted where planets would be at particular times. But this is not precisely accurate. It would be more accurate to say that such predictions were derived from a combination of Newton's theory and auxiliary assumptions not contained in that theory (e.g., about the present locations of the planets). To return to our example about attitudes and intentions from psychology, consider the mini-crisis in social psychology from the 1960s, when it became clear to researchers that attitudes—the kingly construct—failed to predict behaviors. Much of the impetus for the theory of reasoned action (e.g., Fishbein, 1980 ) was Fishbein's realization that there was a problem with attitude measurement at the time: when this problem was fixed, strong attitude-behavior (or at least attitude-intention) correlations became the rule rather than the exception. This episode provides a compelling illustration of a case in which attention to the auxiliary assumptions that bore on actual measurement played a larger role in resolving a crisis in psychology than debates over the theory itself.

What is the lesson here? Due to the fact that failures to obtain a predicted observation can be blamed either on the theory itself or on at least one auxiliary assumption, absolute theory falsification is about as problematic as is absolute theory verification. In the Newton example, when some of Newton's planetary predictions were shown to be wrong, he blamed the failures on incorrect auxiliary assumptions rather than on his theory, arguing that there were additional but unknown astronomical bodies that skewed his findings—which turned out to be a correct defense of the theory. Likewise, in the attitude literature, the theoretical connection between attitudes and behaviors turned out to be correct (as far as we know) with the problem having been caused by incorrect auxiliary assumptions pertaining to attitude measurement.

There is an additional consequence to the necessity of giving explicit consideration to one's auxiliary assumptions. Suppose, as often happens in psychology, that a researcher deems a theory to be unfalsifiable because he or she does not see any testable predictions. Is the theory really unfalsifiable or is the problem that the researcher has not been sufficiently thorough in identifying the necessary auxiliary assumptions that would lead to falsifiable predictions? Given that absolute falsification is impossible, and that researchers are therefore limited to some kind of “reasonable” falsification, Trafimow ( 2009 ) has argued that many allegedly unfalsifiable theories are reasonably falsifiable after all: it is just a matter of researchers having to be more thoughtful about considering auxiliary assumptions. Trafimow documented examples of theories that had been described as unfalsifiable that one could in fact falsify by proposing better auxiliary assumptions than had been imagined by previous researchers.

The notion that auxiliary assumptions can vary in quality is relevant for replication. Consider, for example, the case alluded to earlier regarding a purported failure to replicate Bargh et al.'s ( 1996 ) famous priming results. In the replication attempt of this well-known “walking time” study (Doyen et al., 2012 ), laser beams were used to measure the speed with which participants left the laboratory, rather than students with stopwatches. Undoubtedly, this adjustment was made on the basis of a reasonable auxiliary assumption that methods of measuring time that are less susceptible to human idiosyncrasies would be superior to methods that are more susceptible to them. Does the fact that the failed replication was not exactly like the original experiment disqualify it as invalid? At least with regard to this particular feature of this particular replication attempt, the answer is clearly “no.” If a researcher uses a better auxiliary assumption than in the original experiment, this should add to its validity rather than subtract from it 9 .

But suppose, for a particular experiment, that we are not in a good position to judge the superiority of alternative auxiliary assumptions. We might invoke what Meehl ( 1990b ) termed the ceteris paribus (all else equal) assumption. This idea, applied to the issue of direct replications, suggests that for researchers to be confident that a replication attempt is a valid one, the auxiliary assumptions in the replication have to be sufficiently similar to those in the original experiment that any differences in findings cannot reasonably be attributed to differences in the assumptions. Put another way, all of the unconsidered auxiliary assumptions should be indistinguishable in the relevant way: that is, all have to be sufficiently equal or sufficiently right or sufficiently irrelevant so as not to matter to the final result.

What makes it allowable for a researcher to make the ceteris paribus assumption? In a strict philosophical sense, of course, it is not allowable. To see this, suppose that Researcher A has published an experiment, Researcher B has replicated it, but the replication failed. If Researcher A claims that Researcher B made a mistake in performing the replication, or just got unlucky, there is no way to disprove Researcher A's argument absolutely. But suppose that Researchers C, D, E, and F also attempt replications, and also fail. It becomes increasingly difficult to support the contention that Researchers B–F all “did it wrong” or were unlucky, and that we should continue to accept Researcher A's version of the experiment. Even if a million researchers attempted replications, and all of them failed, it is theoretically possible that Researcher A's version is the unflawed one and all the others are flawed. But most researchers would conclude (and in our view, would be right to conclude) that it is more likely that it is Researcher A who got it wrong and not the million researchers who failed to replicate the observation. Thus, we are not arguing that replications, whether successful or not, are definitive. Rather, our argument is that replications (of sufficient quality) are informative.

Introducing a bayesian framework

To see why this is the case, we shall employ a Bayesian framework similar to Trafimow ( 2010 ). Suppose that an aficionado of Researcher A believes that the prior probability of anything Researcher A said or did is very high. Researcher B attempts a replication of an experiment by Researcher A and fails. The aficionado might continue confidently to believe in Researcher A's version, but the aficionado's confidence likely would be decreased slightly. Well then, as there are more replication failures, the aficionado's confidence would continue to decrease accordingly, and at some point the decrease in confidence would push the aficionado's confidence below the 50% mark, in which case the aficionado would put more credence in the replication failures than on the success obtained by Researcher A.

In the foregoing scenario, we would want to know the probability that the original result is actually true given Researcher B's replication failure [ p ( T | F )]. As Equation (1) shows, this depends on the aficionado's prior level of confidence that the original result is true [ p ( T )], the probability of failing to replicate given that the original result is true [ p ( F | T )], and the probability of failing to replicate [ p ( F )], as Equation (1) shows.

Alternatively, we could frame what we want to know in terms of a confidence ratio that the original result is true or not true given the failure to replicate [ p ( T | F ) p ( ~ T | F ) ] . This would be a function of the aficionado's prior confidence ratio about the truth of the finding [ p ( T ) p ( ~ T ) ] and the ratio of probabilities of failing given that the original result is true or not [ p ( F | T ) p ( F | ~ T ) ] . Thus, Equation (2) gives the posterior confidence ratio.

Suppose that the aficionado is a very strong one, so that the prior confidence ratio is 50. In addition, the probability ratio pertaining to failing to replicate is 0.5. It is worthwhile to clarify two points about this probability ratio. First, we assume that the probability of failing to replicate is less if the original finding is true than if it is not true, so that the ratio ought to be substantially less than 1. Second, how much less than 1 this ratio will be depends largely on the quality of the replication; as the replication becomes closer to meeting the ideal ceteris paribus condition, the ratio will deviate increasingly from 1. Put more generally, as the quality of the auxiliary assumptions going into the replication attempt increases, the ratio will decrease. Given these two ratios of 50 and 0.5, the posterior confidence ratio is 25. Although this is a substantial decrease in confidence from 50, the aficionado still believes that the finding is extremely likely to be true. But suppose there is another replication failure and the probability ratio is 0.8. In that case, the new confidence ratio is (25)(0.8) = 20. The pattern should be clear here: As there are more replication failures, a rational person, even if that person is an aficionado of the original researcher, will experience continually decreasing confidence as the replication failures mount.

If we imagine that there are N attempts to replicate the original finding that fail, the process described in the foregoing paragraph can be summarized in a single equation that gives the ratio of posterior confidences in the original finding, given that there have been N failures to replicate. This is a function of the prior confidence ratio and the probability ratios in the first replication failure, the second replication failure, and so on.

For example, staying with our aficionado with a prior confidence ratio of 50, imagine a set of 10 replication failures, with the following probability ratios: 0.5, 0.8, 0.7, 0.65, 0.75, 0.56, 0.69, 0.54, 0.73, and 0.52. The final confidence ratio, according to Equation (3), would be:

Note the following. First, even with an extreme prior confidence ratio (we had set it at 50 for the aficionado), it is possible to overcome it with a reasonable number of replication failures providing that the person tallying the replication failures is a rational Bayesian (and there is reason to think that those attempting the replications are sufficiently competent in the subject area and methods to be qualified to undertake them). Second, it is possible to go from a state of extreme confidence to one of substantial lack of confidence. To see this in the example, take the reciprocal of the final confidence ratio (0.54), which equals 1.84. In other words, the Bayesian aficionado now believes that the finding is 1.84 times as likely to be not true as true. If we imagine yet more failed attempts to replicate, it is easy to foresee that the future belief that the original finding is not true could eventually become as powerful, or more powerful, than the prior belief that the original finding is true.

In summary, auxiliary assumptions play a role, not only for original theory-testing experiments but also in replications—even in replications concerned only with the original finding and not with the underlying theory. A particularly important auxiliary assumption is the ever-present ceteris paribus assumption, and the extent to which it applies influences the “convincingness” of the replication attempt. Thus, a change in confidence in the original finding is influenced both by the quality and quantity of the replication attempts, as Equation (3) illustrates.

In presenting Equations (1–3), we reduced the theoretical content as much as possible, and more than is realistic in actual research 10 , in considering so-called “direct” replications. As the replications serve other purposes, such as “conceptual” replications, the amount of theoretical content is likely to increase. To link that theoretical content to the replication attempt, more auxiliary assumptions will become necessary. For example, in a conceptual replication of an experiment finding that attitudes influence behavior, the researcher might use a different attitude manipulation or a different behavior measure. How do we know that the different manipulation and measure are sufficiently theoretically unimportant that the conceptual replication really is a replication (i.e., a test of the underlying theory)? We need new auxiliary assumptions linking the new manipulation and measure to the corresponding constructs in the theory, just as an original set of auxiliary assumptions was necessary in the original experiment to link the original manipulation and measure to the corresponding constructs in the theory. Auxiliary assumptions always matter—and they should be made explicit so far as possible. In this way, it will be easier to identify where in the chain of assumptions a “breakdown” must have occurred, in attempting to explain an apparent failure to replicate.

Replication is not a silver bullet. Even carefully-designed replications, carried out in good faith by expert investigators, will never be conclusive on their own. But as Tsang and Kwan ( 1999 ) point out:

If replication is interpreted in a strict sense, [conclusive] replications or experiments are also impossible in the natural sciences.… So, even in the “hardest” science (i.e., physics) complete closure is not possible. The best we can do is control for conditions that are plausibly regarded to be relevant. (p. 763)

Nevertheless, “failed” replications, especially, might be dismissed by an original investigator as being flawed or “incompetently” performed—but this sort of accusation is just too easy. The original investigator should be able to describe exactly what parameters she sees as being theoretically relevant, and under what conditions her “effect” should obtain. If a series of replications is carried out, independently by different labs, and deliberately tailored to the parameters and conditions so described—yet they reliably fail to produce the original result—then this should be considered informative . At the very least, it will suggest that the effect is sensitive to theoretically-unspecified factors, whose specification is sorely needed. At most, it should throw the existence of the effect into doubt, possibly justifying a shift in research priorities. Thus, while “falsification” can in principle be avoided ad infinitum, with enough creative effort by one who wished to defend a favored theory, scientists should not seek to “rescue” a given finding at any empirical cost 11 . Informative replications can reasonably factor into scientists' assessment about just what that cost might be; and they should pursue such replications as if the credibility of their field depended on it. In the case of experimental social psychology, it does.

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Acknowledgments

Thanks are due to Anna Alexandrova for feedback on an earlier draft.

1 There are two steps to understanding this idea. First, because the foundational theories are so insecure, and the field's findings so under dispute, the “correct” empirical outcome of a given experimental design is unlikely to have been firmly established. Second, and insofar the first step applies, the standard by which to judge whether a replication has been competently performed is equally unavailable—since that would depend upon knowing the “correct” outcome of just such an experiment. Thus, a “competently performed” experiment is one that produces the “correct” outcome; while the “correct” outcome is defined by whatever it is that is produced by a “competently performed” experiment. As Collins ( 1985 ) states: “Where there is disagreement about what counts as a competently performed experiment, the ensuing debate is coextensive with the debate about what the proper outcome of the experiment is” (p. 89). This is the infamously circular experimenter's regress . Of course, a competently performed experiment should produce satisfactory (i.e., meaningful, useful) results on “outcome neutral” tests.

2 Assuming that it is a psychology experiment. Note that even if the “same” participants are run through the experiment one more time, they'll have changed in at least one essential way: they'll have already gone through the experiment (opening the door for practice effects, etc.).

3 On Popper's view, one must set up a “falsifying hypothesis,” i.e., a hypothesis specifying how another experimenter could recreate the falsifying evidence. But then, Popper says, the falsifying hypothesis itself should be severely tested and corroborated before it is accepted as falsifying the main theory. Interestingly, as a reviewer has suggested, the distinction between a falsifying hypothesis and the main theory may also correspond to the distinction between direct vs. conceptual replications that we discuss in a later section. On this view, direct replications (attempt to) reproduce what the falsifying hypothesis states is necessary to generate the original predicted effect, whereas conceptual replications are attempts to test the main theory.

4 The percentages reported here are the geometric mean of self-admission rates, prevalence estimates by the psychologists surveyed, and prevalence estimates derived by John et al. from the other two figures.

5 Priming has been defined a number of different ways. Typically, it refers to the ability of subtle cues in the environment to affect an individual's thoughts and behavior, often outside of her awareness or control (e.g., Bargh and Chartrand, 1999 ).

6 Even more damning, Trafimow ( 2003 ; Trafimow and Rice, 2009 ; Trafimow and Marks, 2015 ) has argued that the standard significance tests used in psychology are invalid even when they are done “correctly.” Thus, even if psychologists were to follow the prescriptions of Simmons et al.—and reduce their researcher degrees of freedom (see the discussion following this footnote)—this would still fail to address the core problem that such tests should not be used in the first place.

7 For example, a “study found that eating disorder patients were significantly more likely than others to see frogs in a Rorschach test, which the author interpreted as showing unconscious fear of oral impregnation and anal birth…” (Giner-Sorolla, 2012 , p. 562).

8 Asendorpf et al. ( 2013 ) explain why this is so: “[direct] replicability is a necessary condition for further generalization and thus indispensible for building solid starting points for theoretical development. Without such starting points, research may become lost in endless fluctuation between alternative generalization studies that add numerous boundary conditions but fail to advance theory about why these boundary conditions exist” (p. 140, emphasis added).

9 There may be other reasons why the “failed” replication by Doyen et al. should not be considered conclusive, of course; for further discussion see, e.g., Lieberman ( 2012 ).

10 Indeed, we have presented our analysis in this section in abstract terms so that the underlying reasoning could be seen most clearly. However, this necessarily raises the question of how to go about implementing these ideas in practice. As a reviewer points out, to calculate probabilities, the theory being tested would need to be represented as a probability model; then in effect one would have Bayes factors to deal with. We note that both Dienes ( 2014 ) and Verhagen and Wagenmakers ( 2014 ) have presented methods for assessing the strength of evidence of a replication attempt (i.e., in confirming the original result) along these lines, and we refer the reader to their papers for further consideration.

11 As Doyen et al. ( 2014 , p. 28, internal references omitted) recently argued: “Given the existence of publication bias and the prevalence of questionable research practices, we know that the published literature likely contains some false positive results. Direct replication is the only way to correct such errors. The failure to find an effect with a well-powered direct replication must be taken as evidence against the original effect. Of course, one failed direct replication does not mean the effect is non-existent—science depends on the accumulation of evidence. But, treating direct replication as irrelevant makes it impossible to correct Type 1 errors in the published literature.”

  • Ajzen I., Fishbein M. (1980). Understanding Attitudes and Predicting Social Behavior . Englewood Cliffs, NJ: Prentice-Hall. [ Google Scholar ]
  • Asendorpf J. B., Conner M., De Fruyt F., De Houwer J., Denissen J. J., Fiedler K., et al. (2013). Replication is more than hitting the lottery twice . Euro. J. Person . 27 , 108–119 10.1002/per.1919 [ CrossRef ] [ Google Scholar ]
  • Bargh J. A., Chartrand T. L. (1999). The unbearable automaticity of being . Am. Psychol . 54 , 462–479 10.1037/0003-066X.54.7.462 [ CrossRef ] [ Google Scholar ]
  • Bargh J. A., Chen M., Burrows L. (1996). Automaticity of social behavior: direct effects of trait construct and stereotype activation on action . J. Person. Soc. Psychol . 71 , 230–244. 10.1037/0022-3514.71.2.230 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Bartlett T. (2013). Power of suggestion . Chron. High. Educ. Available online at: http://chronicle.com/article/Power-of-Suggestion/136907
  • Bem D. J. (2011). Feeling the future: experimental evidence for anomalous retroactive influences on cognition and affect . J. Person. Soc. Psychol . 100 , 407–425. 10.1037/a0021524 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Billig M. (2013). Learn to Write Badly: How to Succeed in the Social Sciences . Cambridge: Cambridge University Press. [ Google Scholar ]
  • Brandt M. J., IJzerman H., Dijksterhuis A., Farach F. J., Geller J., Giner-Sorolla R., et al. (2014). The replication recipe: what makes for a convincing replication? J. Exp. Soc. Psychol . 50 , 217–224 10.2139/ssrn.2283856 [ CrossRef ] [ Google Scholar ]
  • Braude S. E. (1979). ESP and Psychokinesis: a Philosophical Examination . Philadelphia, PA: Temple University Press. [ Google Scholar ]
  • Carey B. (2011). Fraud case seen as a red flag for psychology research . N. Y. Times . Available online at: http://www.nytimes.com/2011/11/03/health/research/noted-dutch-psychologist-stapel-accused-of-research-fraud.html
  • Cartwright N. (1991). Replicability, reproducibility, and robustness: comments on Harry Collins . Hist. Pol. Econ . 23 , 143–155 10.1215/00182702-23-1-143 [ CrossRef ] [ Google Scholar ]
  • Cesario J. (2014). Priming, replication, and the hardest science . Perspect. Psychol. Sci . 9 , 40–48 10.1177/1745691613513470 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Chow S. L. (1988). Significance test or effect size? Psychol. Bullet . 103 , 105–110 10.1037/0033-2909.103.1.105 [ CrossRef ] [ Google Scholar ]
  • Collins H. M. (1975). The seven sexes: a study in the sociology of a phenomenon, or the replication of experiments in physics . Sociology 9 , 205–224 10.1177/003803857500900202 [ CrossRef ] [ Google Scholar ]
  • Collins H. M. (1981). Son of seven sexes: the social destruction of a physical phenomenon . Soc. Stud. Sci . 11 , 33–62 10.1177/030631278101100103 [ CrossRef ] [ Google Scholar ]
  • Collins H. M. (1985). Changing Order: Replication and Induction in Scientific Practice . Chicago, IL: University of Chicago Press. [ Google Scholar ]
  • Cross R. (1982). The Duhem-Quine thesis, Lakatos and the appraisal of theories in macroeconomics . Econ. J . 92 , 320–340 10.2307/2232443 [ CrossRef ] [ Google Scholar ]
  • Danzinger K. (1997). Naming the Mind . London: Sage. [ Google Scholar ]
  • Dienes Z. (2014). Using Bayes to get the most out of non-significant results . Front. Psychol . 5 : 781 . 10.3389/fpsyg.2014.00781 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Doyen S., Klein O., Pichon C. L., Cleeremans A. (2012). Behavioral priming: it's all in the mind, but whose mind? PLoS ONE 7 :e29081. 10.1371/journal.pone.0029081 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Doyen S., Klein O., Simons D. J., Cleeremans A. (2014). On the other side of the mirror: priming in cognitive and social psychology . Soc. Cognit . 32 , 12–32 10.1521/soco.2014.32.supp.12 [ CrossRef ] [ Google Scholar ]
  • Duhem P. (1954). The Aim and Structure of Physical Theory. Transl. P. P. Wiener . Princeton, NJ: Princeton University Press. [ Google Scholar ]
  • Earp B. D. (2011). Can science tell us what's objectively true? New Collect . 6 , 1–9 Available online at: https://www.academia.edu/625642/Can_science_tell_us_whats_objectively_true [ Google Scholar ]
  • Earp B. D. (2015). Does religion deserve a place in secular medicine? J. Med. Ethics . E-letter. Available online at: https://www.academia.edu/11118590/Does_religion_deserve_a_place_in_secular_medicine 10.1136/medethics-2013-101776 [ PubMed ] [ CrossRef ]
  • Earp B. D., Darby R. J. (2015). Does science support infant circumcision? Skeptic 25 , 23–30 Available online at: https://www.academia.edu/9872471/Does_science_support_infant_circumcision_A_skeptical_reply_to_Brian_Morris [ Google Scholar ]
  • Earp B. D., Everett J. A. C. (2013). Is the N170 face-specific? Controversy, context, and theory . Neuropsychol. Trends 13 , 7–26 10.7358/neur-2013-013-earp [ CrossRef ] [ Google Scholar ]
  • Earp B. D., Everett J. A. C., Madva E. N., Hamlin J. K. (2014). Out, damned spot: can the “Macbeth Effect” be replicated? Basic Appl. Soc. Psychol . 36 , 91–98 10.1080/01973533.2013.856792 [ CrossRef ] [ Google Scholar ]
  • Elms A. C. (1975). The crisis of confidence in social psychology . Am. Psychol . 30 , 967–976 10.1037/0003-066X.30.10.967 [ CrossRef ] [ Google Scholar ]
  • Ferguson C. J., Heene M. (2012). A vast graveyard of undead theories publication bias and psychological science's aversion to the null . Perspect. Psychol. Sci . 7 , 555–561 10.1177/1745691612459059 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Fishbein M. (1980). Theory of reasoned action: some applications and implications , in Nebraska Symposium on Motivation, 1979 , eds Howe H., Page M. (Lincoln, OR: University of Nebraska Press; ), 65–116. [ PubMed ] [ Google Scholar ]
  • Fishbein M., Ajzen I. (1975). Belief, Attitude, Intention and Behavior: an Introduction to Theory and Research . Reading, MA: Addison-Wesley. [ Google Scholar ]
  • Folger R. (1989). Significance tests and the duplicity of binary decisions . Psychol. Bullet . 106 , 155–160 10.1037/0033-2909.106.1.155 [ CrossRef ] [ Google Scholar ]
  • Francis G. (2012). The psychology of replication and replication in psychology . Perspect. Psychol. Sci . 7 , 585–594 10.1177/1745691612459520 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Giner-Sorolla R. (2012). Science or art? How aesthetic standards grease the way through the publication bottleneck but undermine science . Perspect. Psychol. Sci . 7 , 562–571 10.1177/1745691612457576 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Gómez O. S., Juzgado N. J., Vegas S. (2010). Replications types in experimental disciplines , in Proceedings of the International Symposium on Empirical Software Engineering and Measurement (Bolzano: ESEM; ). [ Google Scholar ]
  • Greenwald A. G. (1975). Consequences of prejudice against the null hypothesis . Psychol. Bullet . 82 , 1–20 10.1037/h0076157 [ CrossRef ] [ Google Scholar ]
  • Ioannidis J. P. (2005). Why most published research findings are false . PLoS Med . 2 :e124. 10.1371/journal.pmed.0020124 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Ioannidis J. P. (2012a). Why science is not necessarily self-correcting . Perspect. Psychol. Sci . 7 , 645–654 10.1177/1745691612464056 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Ioannidis J. P. (2012b). Scientific inbreeding and same-team replication: type D personality as an example . J. Psychosom. Res . 73 , 408–410. 10.1016/j.jpsychores.2012.09.014 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • John L. K., Loewenstein G., Prelec D. (2012). Measuring the prevalence of questionable research practices with incentives for truth telling . Psychol. Sci . 23 , 524–532. 10.1177/0956797611430953 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Jordan G. (2004). Theory Construction in Second Language Acquisition . Philadelphia, PA: John Benjamins. [ Google Scholar ]
  • Jost J. (2013). Introduction to: an Additional Future for Psychological Science . Perspect. Psychol. Sci . 8 , 414–423 10.1177/1745691613491270 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Kepes S., McDaniel M. A. (2013). How trustworthy is the scientific literature in industrial and organizational psychology? Ind. Organi. Psychol . 6 , 252–268 10.1111/iops.12045 [ CrossRef ] [ Google Scholar ]
  • Koole S. L., Lakens D. (2012). Rewarding replications a sure and simple way to improve psychological science . Perspect. Psychol. Sci . 7 , 608–614 10.1177/1745691612462586 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Lakatos I. (1970). Falsification and the methodology of scientific research programmes , in Criticism and the growth of knowledge , eds Lakatos I., Musgrave A. (London: Cambridge University Press; ), 91–196. [ Google Scholar ]
  • Lakatos I. (1978). The Methodology of Scientific Research Programmes . Cambridge: Cambridge University Press. [ Google Scholar ]
  • LeBel E. P., Peters K. R. (2011). Fearing the future of empirical psychology: Bem's 2011 evidence of psi as a case study of deficiencies in modal research practice . Rev. Gen. Psychol . 15 , 371–379 10.1037/a0025172 [ CrossRef ] [ Google Scholar ]
  • Lieberman M. (2012). Does thinking of grandpa make you slow? What the failure to replicate results: does and does not mean . Psychol. Today . Available online at http://www.psychologytoday.com/blog/social-brain-social-mind/201203/does-thinking-grandpa-make-you-slow
  • Loscalzo J. (2012). Irreproducible experimental results: causes, (mis) interpretations, and consequences . Circulation 125 , 1211–1214. 10.1161/CIRCULATIONAHA.112.098244 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Lykken D. T. (1968). Statistical significance in psychological research . Psychol. Bullet . 70 , 151–159. 10.1037/h0026141 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Magee B. (1973). Karl Popper . New York, NY: Viking Press. [ Google Scholar ]
  • Makel M. C., Plucker J. A., Hegarty B. (2012). Replications in psychology research how often do they really occur? Perspect. Psychol. Sci . 7 , 537–542 10.1177/1745691612460688 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Meehl P. E. (1978). Theoretical risks and tabular asterisks: Sir Karl, Sir Ronald, and the slow progress of soft psychology . J. Consult. Clin. Psychol . 46 , 806–834 10.1037/0022-006X.46.4.806 [ CrossRef ] [ Google Scholar ]
  • Meehl P. E. (1990a). Appraising and amending theories: the strategy of Lakatosian defense and two principles that warrant using it . Psychol. Inquiry 1 , 108–141 10.1207/s15327965pli0102_1 [ CrossRef ] [ Google Scholar ]
  • Meehl P. E. (1990b). Why summaries of research on psychological theories are often uninterpretable . Psychol. Reports 66 , 195–244. [ Google Scholar ]
  • Mulkay M., Gilbert G. N. (1981). Putting philosophy to work: Karl Popper's influence on scientific practice . Philos. Soc. Sci . 11 , 389–407 10.1177/004839318101100306 [ CrossRef ] [ Google Scholar ]
  • Mulkay M., Gilbert G. N. (1986). Replication and mere replication . Philos. Soc. Sci . 16 , 21–37 10.1177/004839318601600102 [ CrossRef ] [ Google Scholar ]
  • Nosek B. A., the Open Science Collaboration. (2012). An open, large-scale, collaborative effort to estimate the reproducibility of psychological science . Perspect. Psychol. Sci . 7 , 657–660 10.1177/1745691612462588 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Nosek B. A., Spies J. R., Motyl M. (2012). Scientific utopia II. Restructuring incentives and practices to promote truth over publishability . Perspect. Psychol. Sci . 7 , 615–631 10.1177/1745691612459058 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Pashler H., Wagenmakers E. J. (2012). Editors' introduction to the special section on replicability in psychological science: a crisis of confidence? Perspect. Psychol. Sci . 7 , 528–530 10.1177/1745691612465253 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Polanyi M. (1962). Tacit knowing: Its bearing on some problems of philosophy . Rev. Mod. Phys . 34 , 601–615 10.1103/RevModPhys.34.601 [ CrossRef ] [ Google Scholar ]
  • Popper K. (1959). The Logic of Scientific Discovery . London: Hutchison. [ Google Scholar ]
  • Quine W. V. O. (1980). Two dogmas of empiricism , in From a Logical Point of View, 2n Edn., ed Quine W. V. O. (Cambridge, MA: Harvard University Press; ), 20–46. [ Google Scholar ]
  • Radder H. (1992). Experimental reproducibility and the experimenters' regress , in PSA: Proceedings of the Biennial Meeting of the Philosophy of Science Association (Chicago, IL: University of Chicago Press; ). [ Google Scholar ]
  • Rosenthal R. (1979). The file drawer problem and tolerance for null results . Psychol. Bullet . 86 , 638–641 10.1037/0033-2909.86.3.638 [ CrossRef ] [ Google Scholar ]
  • Schmidt S. (2009). Shall we really do it again? The powerful concept of replication is neglected in the social sciences . Rev. Gen. Psychol . 13 , 90–100 10.1037/a0015108 [ CrossRef ] [ Google Scholar ]
  • Schnall S. (2014). Simone Schnall on her experience with a registered replication project . SPSP Blog . Available online at: http://www.spspblog.org/simone-schnall-on-her-experience-with-a-registered-replication-project/
  • Simmons J. P., Nelson L. D., Simonsohn U. (2011). False-positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant . Psychol. Sci . 22 , 1359–1366. 10.1177/0956797611417632 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Smith N. C. (1970). Replication studies: a neglected aspect of psychological research . Am. Psychol . 25 , 970–975 10.1037/h0029774 [ CrossRef ] [ Google Scholar ]
  • Stroebe W., Postmes T., Spears R. (2012). Scientific misconduct and the myth of self-correction in science . Perspect. Psychol. Sci . 7 , 670–688 10.1177/1745691612460687 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Trafimow D. (2003). Hypothesis testing and theory evaluation at the boundaries: surprising insights from Bayes's theorem . Psychol. Rev . 110 , 526–535. 10.1037/0033-295X.110.3.526 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Trafimow D. (2009). The theory of reasoned action: a case study of falsification in psychology . Theory Psychol . 19 , 501–518 10.1177/0959354309336319 [ CrossRef ] [ Google Scholar ]
  • Trafimow D. (2010). On making assumptions about auxiliary assumptions: reply to Wallach and Wallach . Theory Psychol . 20 , 707–711 10.1177/0959354310374379 [ CrossRef ] [ Google Scholar ]
  • Trafimow D. (2014). Editorial . Basic Appl. Soc. Psychol . 36 , 1–2, 10.1080/01973533.2014.865505 [ CrossRef ] [ Google Scholar ]
  • Trafimow D., Marks M. (2015). Editorial . Basic Appl. Soc. Psychol . 37 , 1–2 10.1080/01973533.2015.1012991 [ CrossRef ] [ Google Scholar ]
  • Trafimow D., Rice S. (2009). A test of the NHSTP correlation argument . J. Gen. Psychol . 136 , 261–269. 10.3200/GENP.136.3.261-270 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Tsang E. W., Kwan K. M. (1999). Replication and theory development in organizational science: a critical realist perspective . Acad. Manag. Rev . 24 , 759–780 10.2307/259353 [ CrossRef ] [ Google Scholar ]
  • Van IJzendoorn M. H. (1994). A Process Model of Replication Studies: on the Relation between Different Types of Replication . Leiden University Library. Available online at: https://openaccess.leidenuniv.nl/bitstream/handle/1887/1483/168_149.pdf?sequence=1
  • Verhagen J., Wagenmakers E. J. (2014). Bayesian tests to quantify the result of a replication attempt . J. Exp. Psychol . 143 , 1457–1475. 10.1037/a0036731 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Westen D. (1988). Official and unofficial data . New Ideas Psychol . 6 , 323–331 10.1016/0732-118X(88)90044-X [ CrossRef ] [ Google Scholar ]
  • Yong E. (2012). A failed replication attempt draws a scathing personal attack from a psychology professor . Discover Magazine . Available online at http://blogs.discovermagazine.com/notrocketscience/2012/03/10/failed-replication-bargh-psychology-study-doyen/#.VVGC-M6Gjds
  • DOI: 10.3758/s13423-012-0322-y
  • Corpus ID: 207587742

Publication bias and the failure of replication in experimental psychology

  • Published in Psychonomic Bulletin & Review 4 October 2012

198 Citations

Replication, statistical consistency, and publication bias..

  • Highly Influenced
  • 11 Excerpts

The Replication Paradox: Combining Studies can Decrease Accuracy of Effect Size Estimates

Assessing heterogeneity and power in replications of psychological experiments., replication in psychological science, psychology, science, and knowledge construction: broadening perspectives from the replication crisis, understanding the replication crisis as a base rate fallacy, restoring confidence in psychological science findings: a call for direct replication studies, replicating studies in which samples of participants respond to samples of stimuli, excess success for psychology articles in the journal science, negativland - a home for all findings in psychology, 50 references, too good to be true: publication bias in two prominent studies from experimental psychology, publication decisions revisited: the effect of the outcome of statistical tests on the decision to p, the (mis)reporting of statistical results in psychology journals, consequences of prejudice against the null hypothesis, is there evidence of publication biases in jdm research, publication bias: the “file-drawer” problem in scientific inference, estimating effect size: bias resulting from the significance criterion in editorial decisions, publication decisions and their possible effects on inferences drawn from tests of significance—or vice versa, the ironic effect of significant results on the credibility of multiple-study articles., publication and related bias in meta-analysis: power of statistical tests and prevalence in the literature., related papers.

Showing 1 through 3 of 0 Related Papers

Publication bias and the failure of replication in experimental psychology

  • Theoretical Review
  • Published: 04 October 2012
  • Volume 19 , pages 975–991, ( 2012 )

Cite this article

replication in experimental psychology

  • Gregory Francis 1  

19k Accesses

125 Citations

101 Altmetric

11 Mentions

Explore all metrics

Replication of empirical findings plays a fundamental role in science. Among experimental psychologists, successful replication enhances belief in a finding, while a failure to replicate is often interpreted to mean that one of the experiments is flawed. This view is wrong. Because experimental psychology uses statistics, empirical findings should appear with predictable probabilities. In a misguided effort to demonstrate successful replication of empirical findings and avoid failures to replicate, experimental psychologists sometimes report too many positive results. Rather than strengthen confidence in an effect, too much successful replication actually indicates publication bias, which invalidates entire sets of experimental findings. Researchers cannot judge the validity of a set of biased experiments because the experiment set may consist entirely of type I errors. This article shows how an investigation of the effect sizes from reported experiments can test for publication bias by looking for too much successful replication. Simulated experiments demonstrate that the publication bias test is able to discriminate biased experiment sets from unbiased experiment sets, but it is conservative about reporting bias. The test is then applied to several studies of prominent phenomena that highlight how publication bias contaminates some findings in experimental psychology. Additional simulated experiments demonstrate that using Bayesian methods of data analysis can reduce (and in some cases, eliminate) the occurrence of publication bias. Such methods should be part of a systematic process to remove publication bias from experimental psychology and reinstate the important role of replication as a final arbiter of scientific findings.

Similar content being viewed by others

replication in experimental psychology

Low replicability can support robust and efficient science

replication in experimental psychology

Making our “meta-hypotheses” clear: heterogeneity and the role of direct replications in science

The role of replication in psychological science.

Avoid common mistakes on your manuscript.

Introduction

Imagine that you read about a set of 10 experiments that describe an effect (call it effect “A”). The experiments appear to be conducted properly, and across the set of experiments, the null hypothesis was rejected 9 times out of 10. Next, you read about another set of 19 experiments that describe effect “B.” Again, the experiments appear to be conducted properly, and the null hypothesis was rejected 10 times out of 19 experiments. I suspect most experimental psychologists would express stronger belief in effect A than in effect B. After all, effect A is so strong that almost every test was statistically significant, while effect B rejected the null hypothesis only about half of the time. Replication is commonly used to weed out false effects and verify scientific truth. Unfortunately, faith in replication is unfounded, at least as science is frequently practiced in experimental psychology.

Effect A is based on the series of experiments by Bem ( 2011 ) that reported evidence of people using Psi ability to gain knowledge from the future. Even after hearing about this study’s findings, most psychologists do not believe that people can get information from the future. This persistent disbelief raises serious questions about how people should and do interpret experimental findings in psychology. If researchers remain skeptical of a finding that has a 90% successful replication rate, it would seem that they should be skeptical of almost all findings in experimental psychology. If so, one must consider whether it is worthwhile to spend time and money on experiments that do not change anyone’s beliefs.

Effect B is based on a meta-analysis of a set of experiments that describe the bystander effect (Fischer et al., 2011 ), which is the empirical observation that people are less likely to help someone in distress if there are other people around. I do not know of anyone who doubts that the bystander effect is real, and it is frequently discussed in introductory psychology textbooks (e.g., Nairne, 2009 ). Given the poor replicability of this effect, the persistent belief in the phenomenon is curious.

Contrary to its central role in other sciences, it appears that successful replication is sometimes not related to belief about an effect in experimental psychology. A high rate of successful replication is not sufficient to induce belief in an effect (Bem, 2011 ), nor is a high rate of successful replication necessary for belief (Fischer et al., 2011 ). The insufficiency of replication is a very serious issue for the practice of science in experimental psychology. The scientific method is supposed to be able to reveal truths about the world, and the reliability of empirical findings is supposed to be the final arbiter of science; but this method does not seem to work in experimental psychology as it is currently practiced.

Counterintuitive as it may seem, I will show that the order of beliefs for effects A and B is rational based largely on the numbers of reported successful replications. In a field like psychology that depends on techniques of null hypothesis significance testing (NHST), there can be too much successful replication, as well as too little. Recognizing this property of the field is central to sorting out which effects should be believed or doubted.

Part of the problem is that replication’s ability to sort out the truth from a set of experiments is undermined by publication bias (Hedges & Olkin, 1985 ; Rosenthal, 1984 ). This bias may be due to selective reporting of results that are consistent with the desires of a researcher, or it may be due to the desires of editors and reviewers who want to publish only what they believe to be interesting findings. As is shown below, a publication bias can also be introduced by violating the procedures of NHST. Such a publication bias can suggest that a false effect is true and can overestimate the size of true effects. The following sections show how to detect publication bias, provide evidence that it contaminates experimental psychology, and describe how future studies can partly avoid it by adopting Bayesian data analysis methods.

A publication bias test

Ioannidis and Trikalinos ( 2007 ) described a test for whether a set of experimental findings contains an excess of statistically significant results. Because this test, which I call the publication bias test, is central to the present discussion, the method will be described in detail.

The ability of repeated experiments to provide compelling evidence for the validity of an effect must consider the statistical power of the experiments. If all of the experiments have high power (the probability of rejecting the null hypothesis when it is false), multiple experiments that reject the null hypothesis will indeed be strong evidence for the validity of an effect. However, the proportion of times a set of experiments rejects the null hypothesis needs to reflect the underlying power of those experiments. Even populations with strong effects should have some experiments that do not reject the null hypothesis. Such null findings should not be interpreted as failures to replicate, because if the experiments are run properly and reported fully, such nonsignificant findings are an expected outcome of random sampling. As is shown below, some researchers in experimental psychology appear to misunderstand this fundamental characteristic of their science, and they engage in a misguided effort to publish more successful replications than are believable. If there are not enough null findings in a set of moderately powered experiments, the experiments were either not run properly or not fully reported. If experiments are not run properly or not reported fully, there is no reason to believe the reported effect is real.

The relationship between power and the frequency of significant results has been made many times (e.g., Cohen, 1988 ; Gelman & Weakliem, 2009 ; Hedges, 1984 ; Sterling, Rosenbaum, & Weinkam, 1995 ), but there has not been a systematic way to test for violations of this relationship. The publication bias test identified by Ioannidis and Trikalinos ( 2007 ) uses the reported effect sizes to estimate the power of each experiment and then uses those power measures to predict how often one would expect to reject the null hypothesis. If the number of observed rejections is substantially larger than what was expected, the test indicates evidence for some kind of publication bias. In essence, the test is a check on the internal consistency of the number of reported rejections, the reported effect size, and the power of the tests to detect that effect size.

A difference between the expected and observed numbers of experiments that reject the null hypothesis can be analyzed with a χ 2 test:

where O and E refer to the observed and expected number of studies that reject the null hypothesis. N is the total number of studies in the set of reported experiments. This analysis actually tests for both too many and too few observed rejections of the null hypothesis, relative to the expected number of rejections. The latter probabilities will be negligible in most situations considered here.

The observed number of rejections is easily counted as the number of reported experiments that reject the null hypothesis. The expected number of rejections is found by first estimating the effect size of a phenomenon across a set of experiments. This article will consider only mean-based effect sizes (such as Hedges’ g ), but a similar analysis could be used for correlation-based effect sizes. Meta-analytic methods (Hedges & Olkin, 1985 ) weight an experiment-wise effect size by its inverse variance to produce the pooled estimate of the effect size.

With the pooled estimated effect size and the sample size(s) of each experiment, it is easy to determine the power of each experiment to reject the null hypothesis. Power is the complement of type II error, β , (the probability of failing to reject the null hypothesis when the effect size is not zero). For a set of N experiments with type II error values β i , the expected number of times the set of experiments would reject the null hypothesis is

which simply adds up the power values across all of the experiments.

For relatively small experiment sets, an exact test can be used to compute the probability of getting an observed number of rejections (or more) for a given set of experiments. Imagine a binary vector, a = [ a 1 ,…,a N ], that indicates whether each of N experiments rejects the null hypothesis (1) or not (0). Then the probability of a particular pattern of rejections and nonrejections can be computed from the power and type II error values:

The equation is simply the product of the power and type II error values for the experiments that reject the null hypothesis or fail to reject the null hypothesis, respectively. If every experiment rejects the null hypothesis, the term will simply be the product of all the power values. Likewise, if no experiment rejects the null hypothesis, the term will be the product of all the type II error values.

Fisher’s exact test can be used to compute the probability of the observed number of rejections, O , or more rejections. If the vectors that describe the different combinations of experiments are designated by an index, j , then the probability of a set of experiments having O or more rejections out of a set of N experiments is

where N C k indicates N choose k , the number of different combinations of k rejections from a set of N experiments, and j indexes those different combinations. If all of the experiments in a reported set reject the null hypothesis, there is only one term under the summations, and Eq.  4 becomes the product of the power values.

Following the standard logic of hypothesis testing, if the probability of the experiments is small under the hypothesis that there is no bias, there is evidence of publication bias in the set of experiments. It is not entirely clear what is the appropriate criterion of “small” for a publication bias test. Using the traditional .05 criterion seems odd, since the purpose of such a low criterion for NHST is to ensure that researchers do not mistakenly conclude that they have evidence for an effect that does not really exist. In contrast, when the probability is below the criterion for a publication bias test, one concludes that the set of experiments does not have evidence for an effect. Thus, the more conservative approach, relative to the conclusion about the existence of an empirical effect, is to use a larger criterion value. Of course, if the criterion is too large, the test will frequently report evidence of publication bias where it does not exist. As a compromise between these competing demands, tests of publication bias frequently use a criterion of .1 (Begg & Mazumdar, 1994 ; Ioannidis & Trikalinos, 2007 ; Sterne, Gavaghan, & Egger, 2000 ). As will be shown below, the publication bias test is quite conservative, so this criterion is often larger than the type I error rate of the test.

The following sections demonstrate how the publication bias test works by looking at simulated experiments. The advantage of starting with simulated experiments is that there is a known ground truth about the presence or absence of publication bias.

File drawer bias

A file drawer bias is based on the idea that researchers more frequently publish experimental findings that reject the null hypothesis (Rosenthal, 1984 ; Scargle, 2000 ; Sterling, 1959 ). The impact of such a bias on the interpretation of a set of experiments and the ability of the publication bias test to identify a file drawer bias were investigated with simulation studies.

In a simulated experiment of a two-sample t -test, a common sample size for each group was chosen randomly to be between 15 and 50. For a control sample, n 1 scores were drawn from a normal distribution with a mean of zero and a standard deviation of one. For an experimental sample, n 2 = n 1 scores were drawn from a normal distribution with a mean of 0.3 and a standard deviation of one. A two-sample, two-tailed t -test was then computed for the samples with α = .05. The experiment was repeated 20 times with different sample sizes and random samples, and for each experiment, an estimate of effect size was computed as Hedges g with a correction for bias in effect size (Del Re, 2010 ; Hedges & Olkin, 1985 ).

Table  1 summarizes the key statistics for this set of experiments. The fourth column gives the estimated value of Hedges’ g for each experiment. Because of random sampling, it varies around the true effect size. The next three columns give estimates of experimental power (Champely, 2009 ; R Development Core Team, 2011 ) for three different effect sizes, which are given below the power values. The first power column is for the true effect size. The second power column is for a meta-analytic pooled effect size across all 20 experiments. This estimated effect size is quite close to the true effect size, so the power values for the true and pooled effect size columns are also quite similar. For each column, the sum of the power values is the expected number of times the null hypothesis would be rejected. Not surprisingly, these values are close to the observed five rejections out of 20 experiments.

The situation is different if the experiments are filtered by a file drawer publication bias. Suppose that the calculations above ignored the null findings (because they were not published) and, instead, were based only on the five experiments that rejected the null hypothesis. Using the effect size pooled from only those experiments, the penultimate column in Table  1 provides their power values. The experiments that reject the null hypothesis tend to have larger t values and experiment-wise effect sizes, so their pooled effect size is almost twice the true effect size. As a result, the estimated power of each experiment that rejects the null hypothesis is substantially larger than the true power. Thus, one side effect of a file drawer publication bias is that both the effect size and experimental power are overestimated (Hedges, 1984 ; Ioannidis, 2008 ; Lane & Dunlap, 1978 ). This overestimation undermines the publication bias test, but there is another effect that overcomes this problem. The sum of the overestimated power values is the expected number of times the null hypothesis would be rejected if the experiments were fully reported. Because the experiments were not reported fully, this sum includes only the five experiments that reject the null hypothesis. Since the power values from the unreported experiments are not included, the sum is substantially smaller than for the fully reported cases. In fact, even though the effect size and power values are overestimated, the calculations would lead one to expect that around three of the five experiments should reject the null hypothesis. The probability that five out of five experiments like these would reject the null hypothesis is found by multiplying the power values, and this gives .081, which is below the .1 criterion used for the publication bias test. Thus, the test correctly identifies that something is amiss with the biased set of experiments.

The probability of five rejections out of the 20 experiments can also be computed for the fully reported set of experiments using the true effect size and the pooled effect size. As the bottom row of Table  1 shows, the probability of five or more rejections out of the 20 experiments is around .4, which was measured using an exact test for the 1,042,380 different combinations of experiments that could have five or more rejections of the null hypothesis out of these 20 reported experiments.

The example in Table  1 demonstrates a single case where the publication bias test works well. It is still necessary to show that the test can generally discriminate experiment sets with a file drawer bias from experiment sets that do not contain any bias. Simulations validated the publication bias test by repeating the experiments in Table  1 , while varying the size of the experiment set being considered and the probability of null results being published.

Designate the size of the experiment set being tested for publication bias as M . Using the same parameters for each experiment as for the findings in Table  1 , 1,000 experiment sets of size M were simulated. The publication bias test was then applied to each of the experiment sets. In order to ease the computations, the χ 2 test for publication bias was used instead of the exact test (these different methods were verified to give essentially the same result for a subset of the experiment sets). If the p value for the χ 2 test was less than or equal to .1 and if the number of published experiments that rejected the null hypothesis was larger than expected, the experiment set was classified as having a publication bias. A publication bias was not reported for the relatively rare cases where the χ 2 test found evidence of fewer than expected statistically significant experiments, which tended to happen only when there was no bias against publishing null results. The ideal publication bias test has a low proportion of experiment sets indicating bias when the experiments are fully published (no bias) and a high proportion indicating bias when some (or all) of the experiments that fail to reject the null hypothesis are not published.

The simulations varied the size of the experiment set, M , between 10 and 80 in steps of 5. The simulations also varied the probability, q , of publishing an experiment that did not reject the null hypothesis. When q = 0, only the experiments that reject the null hypothesis are published and, thereby, considered in the publication bias test analysis. When q = 1, both significant and null findings are fully published, and there is no publication bias. The simulations also considered intermediate probabilities of publication, q = .125, .25, .5, which will produce experiment sets with differing amounts of publication bias.

Figure  1 plots the proportion of times that the test reports evidence of publication bias as a function of the number of experiments in the set. The different curves are for the different probabilities of publishing a null finding. The proportion of times the test falsely reported a publication bias when the experiment set actually published all experiments ( q = 1) was at most .02. The publication bias test does not have many false alarms when the data are fully reported.

Results of simulated experiments exploring the ability of the publication bias test to discriminate biased and unbiased experiment sets. When all experiments are published ( q = 1), the test almost never reports evidence of bias. When only significant experiments are published ( q = 0), the proportion of times the test reports bias increases with the size of the experiment set. The values in parentheses are the average number of experiments that rejected the null hypothesis in the experiment set

For experiment sets with bias, the test becomes ever more accurate as the number of experiments increases. Under an extreme file drawer bias ( q = 0), only the experiments that reject the null hypothesis are published, and for the parameters used in these simulations, that corresponds to around 2.2 rejections for every 10 experiments. These values are given in parentheses on the x -axis of Fig.  1 . As more null findings are published ( q > 0), the test becomes less likely to report the presence of a publication bias.

The test is very conservative for small experiment sets. The proportion of times the test detects publication bias does not reach 50% until the experiment set contains between 25 and 30 experiments. This corresponds to an average of between 5.5 and 6.6 reported experiments that reject the null hypothesis. Thus, if an extreme file drawer bias really is present, the test has a fairly low chance of detecting the bias if fewer than 5 experiments are reported. Even so, the low false alarm rate means that when evidence of publication bias is found, the test is very likely to be correct.

It might be tempting to dismiss the simulation findings as being purely theoretical because few researchers would purposely suppress null findings while reporting significant findings. However, there are at least two situations where this kind of behavior may be quite common for researchers who do not understand how methodological choices can produce this kind of bias (see also Simmons, Nelson, & Simonsohn, 2011 ).

Multiple measures

For some experiments, it is common to measure a wide variety of characteristics for each participant. For some areas of research, there are easily more than 20 such items, and researchers can run multiple statistical analyses but report only the findings that are statistically significant. This type of selective reporting is very similar to a file drawer publication bias. (It is not quite the same, because the measures often have dependencies.) Such an approach violates the basic principles of hypothesis testing and the scientific method, but I have heard more than one academic talk about where it appeared that this kind of selective reporting was being done.

Improper pilot studies

A single reported experiment that rejects the null hypothesis is sometimes the end result of a series of pilot experiments that failed to reject the null hypothesis. Often times, this kind of pilot work is valid exploratory research that leads to a well-designed final study. If such final reported studies have large power values, the publication bias test will not report a problem. On the other hand, some researchers may believe that they are running various pilot studies, but they are really just repeating an experiment with minor variations until they happen to reject the null hypothesis. A set of such experiments with relatively low power will show the pattern exhibited by the biased reports in Table  1 and will likely be detected by the publication bias test.

Data-peeking bias

Traditional hypothesis testing requires a fixed sample size for its inferences to be valid. However, it is easy to violate this requirement and artificially increase the likelihood of rejecting the null hypothesis. Suppose a researcher runs a two-sample t -test by gathering data from control and experimental groups with 15 participants. If the difference between groups is statistically significant, the experiment stops, and the researcher reports the result. If the difference between groups is not statistically significant, the researcher gathers data from one more participant in each group and repeats the statistical analysis. This process can be iterated until the difference between groups is found to be statistically different or the experimenter gives up. Concerns about this kind of data peeking (also called optional stopping) have been raised many times (Berger & Berry, 1988 ; Kruschke, 2010a ; Simmons et al . , 2011 ; Strube, 2006 ; Wagenmakers, 2007 ), but many researchers seem unaware of how subversive it can be. If the null hypothesis is true, the type I error rate for experiments using this procedure for several iterations can dramatically increase from the expected .05, and it can effectively approach 1.0 if the experimenter is willing to gather enough data. There are many variations of this approach, and the increase in type I error produced by any of these approaches depends on a variety of experimental design factors, such as the number of initial participants, the number of participants added after each peek at the data, and the criteria for continuing or stopping the experiment. All of these approaches introduce a publication bias because they end up rejecting the null hypothesis more frequently than would happen if proper NHST rules were followed. An inflated rejection rate occurs even if the null hypothesis is false.

To explore the effects of publication bias due to data peeking, simulated experiments were generated with a data-peeking method. Each experiment started with a random sample of 15 scores from a control (mean of zero) and an experimental (mean of 0.3) normal distribution, each with a standard deviation of one. A two-tailed t -test was performed, and the experiment was stopped; the result was judged significant if p ≤ .05. If the result was not significant, one additional score from each distribution was added to the data set, and the analysis was run again. This process continued until the sample size reached 50 or the experiment rejected the null hypothesis.

Table  2 shows the statistical properties of 20 such experiments. The sample size values reflect the number of data points at the time the experiment ended. Unlike a file drawer bias, data peeking does not much alter the pooled effect size. For the particular data in Table  2 , the pooled estimated effect size from the fully published data set is 0.323, which is only slightly larger than the true effect size of 0.3.

However, the number of experiments that reject the null hypothesis is artificially high under data peeking. The last column of Table  2 lists the power of each experiment for the pooled effect size and the given sample sizes. The sum of these power values is E = 5.191, which is much smaller than the observed O = 12 significant findings. An exact test (across 263,950 different experiment combinations) reveals that the probability of finding 12 or more experiments that reject the null hypothesis out of 20 experiments with these power values is slightly less than .001. Thus, there is very strong evidence of a publication bias due to data peeking.

The final column of Table  2 shows the estimated power values related to a situation where data peeking was used for each study and a file drawer bias was added so that only the positive findings were reported. As was discussed for the file drawer bias in Table  1 , the effect size and power values are substantially overestimated, but the fewer considered experiments means that one would expect only about E = 6.45 of the 12 reported experiments to reject the null hypothesis. The probability that all 12 reported experiments would reject the null hypothesis is the product of the power values, which is only .0003.

Table  3 summarizes a similar simulation analysis when the effect size is zero, which means that the null hypothesis is true. In the simulations, the experiment continued until the null hypothesis was rejected in the right-hand tail (evidence in the negative direction was ignored, under the idea that a researcher was looking for a particular direction of an effect). To be sure to get enough experiments that rejected the null hypothesis, the maximum number of data samples was increased to 100.

Because the null hypothesis was true for these simulated experiments, every rejection of the null hypothesis was a type I error. If the experiments were run properly (without data peeking), 1 out of the 20 experiments would be expected to reject the null hypothesis. In fact, with data peeking, the null hypothesis was rejected four times (type I error rate of 0.2). The pooled effect size is quite small (0.052) because most of the experiments that reject the null hypothesis have small sample sizes, as compared with the experiments that do not reject the null hypothesis, and large samples have more influence on the pooled effect size estimate than do small samples. With such a small effect size, the power values are also quite small, and the expected number of times these 20 experiments will reject the null hypothesis is only 1.284. An exact test considered all 1,047,225 combinations of ways that 4 or more experiments might reject the null hypothesis for these 20 experiments, and the probability of any of those combinations is only .036. Thus, there is clear evidence of publication bias.

The publication bias test continues to properly detect a problem when a file drawer bias is combined with a data-peeking bias under the situation where the null hypothesis is true. The penultimate column in Table  3 shows the power values for the pooled effect size that was computed from only those experiments that rejected the null hypothesis in a positive direction. As in the previous simulations with a file drawer bias, the effect size and power values are dramatically overestimated. Even so, the expected number of times these experiments should reject the null hypothesis is about half the actual number of reported rejections. The probability that four out of four such experiments would reject the null hypothesis is .047, which indicates publication bias.

It would be good to know how well the publication bias test discriminates between the presence and absence of a data-peeking bias. The performance of the publication bias test should improve as the number of experiments in the set increases and as the effect of peeking introduces more bias. The most biased case of data peeking comes when a single data point is added to each group after each peek at the data. With more data points added after each peek, there are fewer peeks (assuming a fixed upper number of data points), and each addition of a set of data points tends to be less variable. In new simulations, 10,000 simulated experiments like the one in Table  3 (where the null hypothesis is true) were created for each of five data-peeking cases with different numbers of data points added after each peek. From this pool of simulated experiments for a given peek addition number, M experiments were drawn at random, and the set was tested for publication bias (there was no file drawer bias, since experiments were fully reported). The test was repeated 1,000 times, and Fig.  2 plots the proportion of times the test reports evidence of publication bias as a function of the experiment set size, M . The different curves are for variations in the number of added data points after each peek.

The results of simulated data-peeking experiments exploring the ability of the publication bias test to discriminate between biased and unbiased experiment sets. When there is no data peeking, the test rarely reports evidence of bias. In the most extreme form of data peeking, the test generally detects the bias more than 50% of the time

The bottom curve (no peeking) verifies that if there is no data peeking (the sample size was always n 1 = n 2 = 15), it is rare for the test to report that publication bias was present. The maximum proportion of false alarms was .06 when the experiment set contained only 10 experiments. Just as for the file drawer bias, the publication bias test makes few false alarms.

All of the other curves show results for varying the number of data points added after each peek. The test more frequently detects bias in experiment sets with smaller numbers of data points added after each peek. In the strongest bias case (adding 1 data point for each group after each data peek), the publication bias test detects the bias better than 50% of the time even when the experiment set size is as small as 20. In the least biased case reported here (adding 40 data points to each group after each data peek), there are only three opportunities to test the data, and it is difficult for the test to detect such a bias.

Other sources of publication bias

One of the main strengths of the publication bias test is that it detects biases from a variety of different sources. Bias can be introduced by seemingly innocuous decisions at many different levels of research (Simmons et al., 2011 ). Given the high frequency of errors in reporting statistical findings (Bakker & Wicherts, 2011 ), researchers can introduce a publication bias by being very careful when the results are contrary to what was expected but not double-checking results that agree with their expectations. Likewise, data from a participant that performs poorly (relative to the experimenter’s hopes) might be discarded if there is some external cause, such as noise in a nearby room; but data from a participant that happens to perform well under the same circumstances might be kept. The key property of the publication bias test is that many of these choices leave evidence of their influence by producing an excess of significant findings, relative to what the estimated effect sizes suggest should be found.

Overall, the publication bias test properly detects evidence for several different types of bias that produce an excess of significant results. One shortcoming of the test is that it is conservative, especially when the number of significant findings is small (less than five), because using the pooled estimated effect size from a set of published experiments gives a generous benefit of the doubt, which often leads to an overestimation of the power values when there really is bias. Thus, once evidence for publication bias is found, the magnitude of the bias is probably larger than what the test directly indicates.

Publication bias in experimental psychology

The previous section described the Ioannidis and Trikalinos ( 2007 ) test for publication bias and demonstrated that it works properly for simulated experiments. The test has a low false alarm rate and is conservative about reporting the presence of publication bias. In this section, the test is applied to several sets of studies in experimental psychology. The examples are chosen to highlight different properties of publication bias, but it is also valuable to know that a particular set of experiments has a publication bias, because those findings should be considered nonscientific and anecdotal.

Previous applications of the test have indicated publication bias in individual reports (Francis, 2012a , 2012b , in press ) and in a meta-analysis (Renkewitz, Fuchs, & Fiedler, 2011 ). Schimmack ( in press ) independently developed a similar analysis. Francis ( 2012b ) found evidence that the precognition studies of Bem ( 2011 ), which correspond to effect “A” in the introduction, are contaminated with publication bias. Thus, despite the high replication rate of those studies, the findings appear unbelievable. Below, this conclusion will be contrasted with a similar analysis of effect “B,” which had a low replication rate.

It is important to be clear that the indication of publication bias in a set of experiments does not necessarily imply that the reported effect is false. Rather, the conclusion is that the set of experiments does not provide scientific information about the validity of the effect. The proper interpretation of a biased set of experiments is a level of skepticism equivalent to what was held before the experiments were reported. Future studies (without bias) may be able to demonstrate that the effect is true or false.

Washing away your sins

Zhong and Liljenquist ( 2006 ) reported four experiments with evidence that cleansing activities were related to moral purity. For example, they showed that people subject to moral impurity (e.g., recalling unethical behavior in their past) were more likely than controls to complete a word fragment task with cleanliness-related words. Each of the four experiments rejected the null hypothesis, thereby providing converging evidence of a connection between moral purity and cleanliness.

A meta-analytic approach was used to compute the power of the experiments, as summarized in Table  4 . For the four experiments reported in Zhong and Liljenquist ( 2006 ), the pooled effect size is 0.698, and the penultimate column in Table  4 shows the power of the studies to detect this pooled effect size. The sum of the power values is E = 2.25, which is small, as compared with the O = 4 rejections. The product of the four power values is the probability that four experiments like these would all reject the null hypothesis if the studies were run properly and reported fully. This probability is .089, which is below the .1 threshold used to conclude evidence of publication bias. Thus, readers should be skeptical about the validity of this work.

Additional positive replications of the effect will alleviate the publication bias only if the effect size of the new experiments is much larger than in the original experiments. One might wonder if a failure to replicate the effect in new experiments might make evidence of publication bias disappear. This would set up a counterintuitive situation where a scientific effect becomes more believable if it fails to replicate in new experiments. Mathematically, such a situation seems possible, but it would occur only if the new experiments just barely fail to replicate the effect. The findings of the new experiments contribute to a new meta-analytic estimate of the effect size, and because the experiments do not reject the null hypothesis, their estimate of the effect size will usually be smaller than the previous estimate (assuming similar sample sizes). As a result, the pooled effect size will be smaller, and the power of all the experiments will take smaller values.

In fact, two of the findings of Zhong and Liljenquist ( 2006 ) did fail to replicate in a follow-up study (Fayard, Bassi, Bernstein, & Roberts, 2009 ). The replications reported small effect sizes, as shown in the last two rows of Table  4 . Since Fayard et al. used larger sample sizes, their estimates of effect size are weighted more heavily than any individual experiment from Zhong and Liljenquist. Indeed, the pooled effect size across all six experiments is 0.306, which is less than half the previous estimate. As is shown in the final column of Table  4 , the power of the experiments to reject the null hypothesis for this new estimated effect size is fairly small. The sum across the power values ( E = 1.62) is the expected number of times experiments like these six experiments would reject the null hypothesis if they were run properly and reported fully. An exact test considered all possible ways that O = 4 or more of the six experiments might reject the null hypothesis and computed the probability of each of the 22 possible combinations. The sum of these probabilities is only 0.036, which means there is still evidence of publication bias in this set of experiments. One cannot draw statistical conclusions from any particular pattern of results, but it is curious that in the reported findings, all of the low-powered experiments rejected the null hypothesis, while the moderately powered experiments failed to reject the null hypothesis.

Bystander effect

One common view of science is that although a given research lab may have biases, these can be mitigated by countering biases from other research labs. Thus, it might seem that although publication bias could be introduced with a few studies for specific topics, the main effects in experimental psychology would likely not be subject to such bias. We can investigate this possibility by looking at a meta-analysis of such a main effect.

Fischer et al. ( 2011 ) used meta-analytic techniques to compare the bystander effect for dangerous and nondangerous situations. The bystander effect is the observation that an individual is less likely to help someone in need if there are other people who might provide assistance. The bystander effect has been widely studied, and the meta-analysis reported 105 independent effect sizes that were based on more than 7,700 participants.

Fischer et al . ( 2011 ) broke down the investigations of the bystander effect by several different variables. One important classification was whether the context of the study involved an emergency or a nonemergency situation. (The nonemergency situation is effect “B” from the introduction.) Table  5 shows the results of a publication bias test for each of these two contexts. Given the large number of studies, it is not surprising that evidence for the bystander effect is not always found. Only 24 of the 65 experiments in the emergency situation rejected the null hypothesis (assuming two-tailed t -tests with α = .05), and only 10 of 19 experiments in the nonemergency situation rejected the null hypothesis. Fischer et al. identified some additional experiments with other contexts that are not considered here.

To know whether there is a publication bias in either of these experiment sets, one computes the power of each experiment for the pooled estimated effect size. For the nonemergency situation, the expected number of experiments that report evidence of the bystander effect (the sum of the power values across experiments) is just a bit less than 11, which is pretty close to the observed number of experiments that reject the null hypothesis. In contrast, the power analysis for the 65 experiments in the emergency situation expects that around 10 of the experiments should find evidence of the bystander effect (most of the experiments have fairly small sample sizes), which is much smaller than the 24 observed findings. The results of the χ 2 tests are shown in the bottom two rows of Table  5 .

Even though fewer than half of the studies with an emergency situation reported evidence of the bystander effect, there is strong evidence of a publication bias, so researchers should be skeptical about the validity of this effect. It should be pointed out that this observation is consistent with the general ideas of Fischer et al. ( 2011 ), who argued that the bystander effect should be attenuated in emergency situations.

It is noteworthy that there is no evidence of a publication bias for the nonemergency situation studies. The nonemergency situation studies had power values ranging from 0.2 (for several experiments with only 24 participants) to nearly 1.0 (for one experiment with 2,500 participants). The key characteristic of this set of experiments is that the number of studies that reject the null hypothesis is generally consistent with the properties of the experiments and the pooled effect size. This self-consistency is missing in the experiment sets with publication bias.

Altered vision near the hands

The previous two investigations of published data sets were drawn from the subfield of social psychology. There may be different standards and experimental methods for different subfields, but there is no reason to believe that publication bias is a problem only for social psychology. Indeed, data sets that appear to be biased are readily found in mainstream publications for cognitive psychology. For example, Abrams, Davoli, Du, Knapp, and Paull ( 2008 ) presented evidence that visual processing is different when a person’s hands are near the stimuli. In one representative experiment, they noted that search times increased with display size more quickly when the hands were placed on either side of a computer monitor than when the hands were on the lap. In a set of five experiments, Abrams et al. explored this effect with several different experimental designs and concluded that the effect of hand position is due to an influence on disengagement of attention. Table  6 shows the effect sizes of this phenomenon across the five experiments, which all used a within-subjects design.

The pooled effect size across all five experiments is 0.543. The last column of Table  6 gives the power of each experiment to detect this pooled effect size if the experiments were run properly and reported fully. The sum of the power values across the five experiments is E = 2.84, and this expected number of rejections is notably smaller than the reported O = 5 rejections. Indeed, the probability that experiments like these would all reject the null hypothesis is the product of the power values, which is .048.

At least one follow-up experiment showed a similar pattern. Davoli and Abrams ( 2009 ) reported that the effect was found even with imagined placement of the hands. Their reported result ( n = 16, g = 0.567) is very similar to those in Abrams et al. ( 2008 ), but the power of this experiment to detect the pooled effect size is only 0.53. Despite the relatively low power, Davoli and Abrams did reject the null hypothesis. If one considers this finding to be another replication of the effect, the probability of six such experiments all rejecting the null hypothesis is roughly .025. Further successful replications of the finding will only provide stronger evidence for publication bias if they include the findings from Abrams et al . , so researchers interested in this phenomenon are advised to ignore those findings and start over with new unbiased investigations.

Bayesian data analysis can reduce publication bias

The previous section makes it clear that the publication bias test is not just a theoretical curiosity. There is evidence for publication bias in articles and topics that are published in the top journals in the field and have been investigated by dozens of different labs. It is not yet clear how pervasive the problem might be, but such studies were not difficult to identify. A systematic investigation of publication bias across the field may be necessary in order to weed out the unbelievable findings.

Although no analytical technique is completely immune to all forms of publication bias, this section shows that Bayesian data analysis methods have features that allow them to avoid some forms of publication bias. These properties add to the already excellent motivations to analyze data with Bayesian methods (Deines, 2011 ; Kruschke, 2010a ; Rouder, Speckman, Sun, Morey, and Iverson, 2009 ; Wagenmakers, 2007 ).

There are several different approaches to Bayesian data analysis, but in relation to the effects of publication bias, they all have the properties described in this section. The most flexible Bayesian approach involves creating a parametric distribution model for a data set and then using Gibbs sampling techniques to estimate the probabilities of the observed data for different parameters. Readers interested in this approach are encouraged to look at Kruschke ( 2010b ) for an introduction into how experimental psychologists would use these techniques.

Although flexible, the Gibbs sampling approach is computationally intensive. This is usually not a problem for analyzing any particular data set, but for the simulated experiments described below, the computations would be cumbersome. An alternative approach is to accept the assumptions appropriate for a traditional t -test and then compute a Bayes factor, which is the ratio of the probability of the observed data for the null and alternative hypotheses. The main advantage of this approach is that Rouder et al. ( 2009 ) derived a formula for computing the Bayes factor using a standard objective prior distribution that can be applied to t -tests. This formula requires only the sample size(s) and the t value that is used in traditional NHST approaches. A complete understanding of the calculations of the Bayes factor is not necessary for the following discussion, but interested readers should see Rouder et al. for details on the t -test and Rouder, Morey, Speckman, and Province ( in press ) for Bayesian analysis methods of ANOVA designs. An additional advantage of using the Bayes factor computations is that Rouder and Morey ( 2011 ) derived a meta-analytic version that pools information across multiple experiments. Again, the meta-analytic calculations require only the sample size(s) and t value for each experiment.

As a ratio of probabilities given the null and alternative hypotheses, a Bayes factor of one corresponds to equal evidence for both hypotheses. It is arbitrary whether to place the probability corresponding to the null hypothesis in the numerator or denominator of the ratio. I have put the null hypothesis in the denominator and use the term BF 10 to indicate that this is the Bayes factor with hypothesis 1 (alternative) in the numerator and hypothesis 0 (null) in the denominator. A Bayes factor larger than one indicates evidence for the alternative hypothesis, while a Bayes factor less than one indicates evidence for the null hypothesis.

Bayesian analysts have identified thresholds that are conventionally used to characterize the evidence in favor of the null or alternative hypotheses (Kass & Raftery, 1995 ). When \( B{F_{10 }} > \sqrt{10}\approx 3.16 \) , there is said to be “substantial” evidence for the alternative hypothesis. Evidence values between 1 and 3.16 are said to be anecdotal or inconclusive. Other category thresholds include above 10 for “strong” evidence and above 100 for “decisive” evidence. A Bayesian analysis can also provide evidence for the null hypothesis, with thresholds set by the inverse of the alternative category boundaries. Thus, when BF 10 < 0.316, there is said to be “substantial” evidence for the null hypothesis. Unlike NHST, none of these threshold categories are sacrosanct, and one frequently finds the “substantial” evidence thresholds rounded to 3 and 1/3. None of these choices will alter the general discussion below.

At the risk of offending my Bayesian colleagues, the discussion below is going to treat these thresholds with more respect than they deserve. Simulated experiments will investigate how file drawer and data-peeking publication biases affect the frequency of finding “substantial” evidence for the null and alternative hypotheses in a meta-analysis. From the viewpoint of a Bayesian analysis, this focus ignores a lot of useful information. Generally, a Bayesian data analysis is interested in the evidence itself, rather than identifying where the evidence falls within some quasi-arbitrary categories. Nevertheless, I hope to introduce the main issues to the majority of experimental psychologists, who I suspect will look for concepts analogous to “rejecting the null hypothesis,” and these Bayesian categories play a somewhat similar role. The main arguments apply equally well to consideration of evidence itself, without the categories.

The last column of Table  1 shows the results of a Bayesian analysis of the previously investigated set of 20 simulated experiments that were generated with a true effect size of 0.3. Two of the experiments report substantial evidence for the alternative hypothesis, and five experiments report substantial evidence for the null hypothesis. The remaining experiments do not provide substantial evidence for either the null or the alternative hypothesis. However, the meta-analytic Bayes factor that considers all of the experiments finds decisive evidence for the alternative hypothesis ( BF 10 = 44,285).

It might seem that there would be no reason to publish the inconclusive experiments, but they actually provide a lot of information. The meta-analytic Bayes factor based only on the seven experiments with substantial evidence for either of the hypotheses is BF 10 = 0.534, which is not convincing evidence for either hypothesis but slightly favors the null. Thus, introducing a file drawer publication bias radically alters the Bayesian interpretation of this set of findings. By not reporting the findings from inconclusive experiments, the meta-analysis would fail to recognize the decisive evidence for the effect. Indeed, pooling nonsignificant findings across a large set of experiments in order to extract significant effects is one of the primary motivations for meta-analytic approaches (Hedges & Olkin, 1985 ).

To explore the effects of publication bias further, the simulated experiment set was repeated 100 times. After gathering data from 20 experiments, the experiment set was subjected to one of four publication bias conditions, described below. The published data from the 20 experiments were then analyzed with the meta-analytic Bayes factor, and the output was classified as substantial evidence for the alternative hypothesis, substantial evidence for the null hypothesis, or inconclusive. The whole process was then repeated for experiments where the effect size equaled zero (the null hypothesis was true). Table  7 reports the number of times the meta-analytic Bayes factor reached each of the decisions.

When there was no publication bias (first-row pair), the meta-analytic Bayesian analysis almost always makes the correct decision, and it never reports substantial evidence for the wrong hypothesis. The second-row pair corresponds to a publication bias where only the experiments that reached a definitive conclusion (substantial evidence for the alternative or null hypothesis) were published. This is the most natural version of a file drawer bias, and it has little effect when the null hypothesis is true. However this bias has a striking negative effect when the true effect size is 0.3. Frequently, the meta-analytically pooled evidence ends up being inconclusive or supportive of the null hypothesis. As was described above, the inconclusive experiments actually contain evidence for the alternative hypothesis, and ignoring that information with a publication bias sometimes leads to the wrong conclusion in the meta-analysis.

The third bias condition supposes that a researcher does not publish any experiment that fails to provide evidence for a positive alternative hypothesis. It might seem that such a bias will lead to artificially enhanced evidence for the alternative hypothesis, and this is indeed the impact if the null hypothesis is true. The impact is not overwhelming, however, and the meta-analytically pooled decision is most often that the evidence is inconclusive. Interestingly, a similar effect is found when the alternative hypothesis is true. As compared with fully publishing all experiments, publishing only those experiments that find evidence for a positive alternative reduces the number of times the meta-analytic Bayes factor data analysis correctly concludes that the alternative hypothesis is true. Again, this is because the unreported experiments contain evidence for the alternative hypothesis and not reporting those findings reduces the ability of the meta-analysis to draw the proper conclusion.

Finally, the simulation considered the impact of the file drawer bias on the basis of the results of a traditional NHST t -test. An experiment was reported only if the t -test rejected the null hypothesis. This kind of bias has little impact on the meta-analytic Bayesian analysis if the alternative hypothesis is true, but it greatly diminishes the ability of the meta-analysis to correctly identify when the null hypothesis is true. Instead, the most common decision is that the data are inconclusive.

One main conclusion from the simulations of the meta-analytic Bayes factor data analysis is that a file drawer publication bias undermines the ability of experiments to demonstrate evidence for the alternative hypothesis. Bayesian analysis methods do not reduce the impact of a file drawer bias, but they make the problems with such a bias more obvious. If finding evidence for nonzero effects is of interest to researchers who use a Bayesian analysis, they should avoid publication bias. Moreover, researchers who introduce a publication bias will probably not be very productive, because a common outcome of such bias is to produce experiment sets with inconclusive meta-analysis findings.

A benefit of Bayesian data analysis is that it almost entirely avoids publication bias introduced by data peeking. The last column of Table  3 shows the Bayes factor data analysis for the experiments generated when the null hypothesis was true, but data peeking was used such that the experiment stopped when the null hypothesis was rejected with a positive mean difference. This type of data peeking exaggerates the NHST type I error rate (from 0.05 to 0.2), which leads to a publication bias.

However, Kerridge ( 1963 ) proved that the frequency of type I errors with a Bayesian analysis has an upper bound that depends on the criterion used to indicate evidence of an effect. As a result, Bayesian data analysis is mostly insensitive to the presence of data peeking. This insensitivity is evident in Table  3 , since not one of the experiments finds substantial evidence for the alternative hypothesis. In fact, 14 out of the 20 experiments (correctly) provide substantial evidence that the null hypothesis is true. When all of the experimental results are considered, BF 10 = 0.092, which is strong evidence for the null hypothesis.

In part, the insensitivity of the Bayesian data analysis to data peeking is due to a stricter criterion for evidence of the alternative hypothesis, relative to NHST approaches. But there are other factors as well. In an additional simulation, data peeking was implemented so that data points were added until the Bayes factor calculation gave a conclusive result for either the null or the alternative hypothesis. Table  8 shows that this approach most often leads to substantial evidence for the null hypothesis, and only one experiment reports substantial evidence for the alternative hypothesis. When pooled across all of the experiments, the meta-analytic Bayes factor is 0.145, which is substantial evidence for the null. An important characteristic of the Bayesian approach, as compared with NHST, is that an experiment can stop with substantial evidence for the alternative hypothesis or for the null hypothesis. Notably, the Bayesian data-peeking approach reaches a conclusion with fairly small sample sizes. Bayesian data analysis fits in well with the general idea that a researcher should gather data until a definitive answer is found.

A Bayesian data analysis is even immune to a data-peeking publication bias that appears to be blatantly deceptive. Table  9 shows experimental data generated with a data-peeking process that continued adding one data point after each peek until the experiment found evidence for a positive alternative hypothesis or the upper limit to the number of data points was reached. In contrast to the data-peeking approach in Table  8 , which quickly converged on evidence for the null hypothesis, experiments searching for Bayesian evidence of the positive alternative hypothesis are mostly futile, because the null hypothesis is actually true in these simulations. Eighteen of the 20 experiments reach the upper limit of the number of data points and then report substantial evidence for the null hypothesis. One experiment stops very early and reports substantial evidence for the positive alternative hypothesis. The remaining experiment reaches the upper limit and reports an inconclusive Bayes factor. The meta-analytic Bayes factor for all experiments gives a value 0.067, which is strong evidence for the null hypothesis.

Finally, Table  10 summarizes the results of simulations that repeated the analyses above and reports on the final decision of the meta-analytic Bayesian analysis under different types of data peeking. The main finding is that data peeking does not introduce a substantial bias for a Bayesian meta-analysis.

Conclusions

Science is difficult, and anyone who believes that it is easy to gather good scientific data in a discipline like experimental psychology is probably doing it wrong. The various ways of doing it wrong undermine the ability of replication to verify experimental findings. The publication bias test proposed by Ioannidis and Trikalinos ( 2007 ) provides a means of identifying the influence of some of these mistakes, and it should be frequently used as a check on the tendency to report too many significant findings. Experimental psychologists can mitigate the influence of publication bias by using Bayesian data analysis techniques. With these improvements, experimental psychology can fully take advantage of replication to reveal scientific evidence about the world.

Some people may mistakenly believe that to avoid a publication bias, researchers must publish the outcome of every experiment. If taken too far, this view leads to the absurd idea that every experiment deserves to be published, even though it may be poorly conceived, designed, or executed. Likewise, some people may believe that all pilot experiments need to be reported to avoid a file drawer bias. Calls for a registry of planned experiments that can be used to check on the outcome of experiments operate along these ideas (e.g., Banks & McDaniel, 2011 ; Munafò & Flint, 2010 ; Schooler, 2011 ). This approach would indeed address issues of publication bias, but at the cost of flooding the field with low-quality experiments. While well intentioned, such registries would probably introduce more trouble than they are worth.

It may seem counterintuitive, but an easier way to avoid a publication bias is to be more selective about which experiments to publish, rather than more liberal. The selection criteria must focus on the quality of the scientific investigation, rather than the findings (Greenwald, 1975 ; Sterling et al., 1995 ). Studies with unnecessarily small sample sizes should not be published. Pilot studies and exploratory research should not be published as scientific findings. The publication bias test is not going to report a problem with a set of selective experiments as long as the power of those reported experiments is high, relative to the number of reported experiments that reject the null hypothesis.

As was noted above, Bayesian hypothesis testing is better than NHST, but better still is an approach that largely abandons hypothesis testing. Hypothesis testing is good for making a decision, such as whether to pursue an area of research, so it has a role to play in drawing conclusions about pilot studies. The more important scientific work is to measure an effect (whether in substantive or standardized units) to a desired precision. When an experiment focuses on measurement precision and lets nature determine an effect’s magnitude, there is little motivation for publication bias. Better experiments give more precise estimates of effects, and a meta-analysis that pools effects across experiments allows for still more precise estimates. Moreover, one can practice data peeking (e.g., gather data until the confidence interval width for g is less than 0.5) with almost no bias for the final measurement magnitude. Such an approach can use traditional confidence interval construction techniques (Cumming, 2012 ) or Bayesian equivalents (Kruschke, 2010b ). The latter has advantages beyond the issues of publication bias.

The past year has been a difficult one for the field of psychology, with several high-profile cases of fraud. It is appropriate to be outraged when falsehoods are presented as scientific evidence. On the other hand, the large number of scientists who unintentionally introduce bias into their studies (John, Loewenstein, & Prelec, 2012 ) probably causes more harm than the fraudsters. As scientists, we need to respect, and explicitly describe, the uncertainty in our measurements and our conclusions. When considering publication, authors, reviewers, and editors need to know the difference between good and poor studies and be honest about the work. The publication bias test is available to identify cases where authors and the peer review process make mistakes.

Abrams, R. A., Davoli, C. C., Du, F., Knapp, W. H., III, & Paull, D. (2008). Altered vision near the hands. Cognition, 107, 1035–1047.

Article   PubMed   Google Scholar  

Bakker, M., & Wicherts, J. (2011). The (mis)reporting of statistical results in psychology journals. Behavior Research Methods, 43, 666–678.

Banks, G. C., & McDaniel, M. (2011). The kryptonite of evidence-based I-O psychology. Industrial and Organizational Psychology, 4, 40–44.

Article   Google Scholar  

Begg, C. B., & Mazumdar, M. (1994). Operating characteristics of a rank correlation test for publication bias. Biometrics, 50, 1088–1101.

Bem, D. J. (2011). Feeling the future: Experimental evidence for anomalous retroactive influences on cognition and affect. Journal of Personality and Social Psychology, 100, 407–425.

Berger, J., & Berry, D. (1988). The relevance of stopping rules in statistical inference (with discussion). In S. S. Gupta & J. Berger (Eds.), Statistical decision theory and related topics, 1 (Vol. 4, pp. 29–72). New York: Springer.

Chapter   Google Scholar  

Champely, S. (2009). pwr: Basic functions for power analysis. R package version 1.1.1. http://CRAN.R-project.org/package=pwr

Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ: Erlbaum.

Google Scholar  

Cumming, G. (2012). Understanding the new statistics: Effect sizes, confidence intervals, and meta-analysis . New York: Routledge.

Davoli, C. C., & Abrams, R. A. (2009). Reaching out with the imagination. Psychological Science, 20 (3), 293–295.

Del Re, A.C. (2010). compute.es: Compute Effect Sizes. R package version 0.2. http://CRAN.R-project.org/web/packages/compute.es/

Dienes, Z. (2011). Bayesian versus orthodox statistics: Which side are you on? Perspectives on Psychological Science, 6, 274–290.

Fayard, J. V., Bassi, A. K., Bernstein, D. M., & Roberts, B. W. (2009). Is cleanliness next to godliness? Dispelling old wives’ tales: Failure to replicate Zhong and Liljenquist (2006). Journal of Articles in Support of the Null Hypothesis, 6, 21–28.

Fischer, P., Krueger, J. I., Greitemeyer, T., Vogrincic, C., Kastenmüller, A., Frey, D., Heene, M., Wicher, M., & Kainbacher, M. (2011). The bystander-effect: A meta-analytic review on bystander intervention in dangerous and non-dangerous emergencies. Psychological Bulletin, 137, 517–537.

Francis, G. (2012a). The same old New Look: Publication bias in a study of wishful seeing. Perception, 3 (3), 176–178.

Francis, G. (2012b). Too good to be true: Publication bias in two prominent studies from experimental psychology. Psychonomic Bulletin & Review, 19, 151–156.

Francis, G. (in press). Publication bias in “Red, Rank, and Romance in Women Viewing Men” by Elliot et al. (2010). Journal of Experimental Psychology: General .

Gelman, A., & Weakliem, D. (2009). Of beauty, sex and power. American Scientist, 97, 310–316.

Greenwald, A. G. (1975). Consequences of prejudice against the null hypothesis. Psychological Bulletin, 82, 1–20.

Hedges, L. V. (1984). Estimation of effect size under nonrandom sampling: The effects of censoring studies yielding statistically insignificant mean differences. Journal of Educational Statistics, 9, 61–85.

Hedges, L. V., & Olkin, I. (1985). Statistical methods for meta-analysis . San Diego, CA: Academic Press.

Ioannidis, J. P. A. (2008). Why most discovered true associations are inflated. Epidemiology, 19, 640–648.

Ioannidis, J. P. A., & Trikalinos, T. A. (2007). An exploratory test for an excess of significant findings. Clinical Trials, 4, 245–253.

John, L. K., Loewenstein, G., & Prelec, D. (2012). Measuring the prevalence of questionable research practices with incentives for truth-telling. Psychological Science, 23, 524–532.

Kass, R. E., & Raftery, A. E. (1995). Bayes factors. American Statistical Association, 90, 773–7935.

Kerridge, D. (1963). Bounds for the frequency of misleading Bayes inferences. Annals of Mathematical Statistics, 34, 1109–1110.

Kruschke, J. K. (2010a). Bayesian data analysis. Wiley Interdisciplinary Reviews: Cognitive Science, 1 (5), 658–676. doi: 10.1002/wcs.72

Kruschke, J. K. (2010b). Doing Bayesian Data Analysis: A Tutorial with R and BUGS . Academic Press/Elsevier Science.

Lane, D. M., & Dunlap, W. P. (1978). Estimating effect size: Bias resulting from the significance criterion in editorial decisions. British Journal of Mathematical and Statistical Psychology, 31, 107–112.

Munafò, M. R., & Flint, J. (2010). How reliable are scientific studies? The British Journal of Psychiatry, 197, 257–258.

Nairne, J. S. (2009). Psychology (5th ed.). Belmont, CA: Thomson Learning.

R Development Core Team. (2011). R: A language and environment for statistical computing . R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, http://www.R-project.org/

Renkewitz, F., Fuchs, H. M., & Fiedler, S. (2011). Is there evidence of publication biases in JDM research? Judgment and Decision Making, 6, 870–881.

Rosenthal, R. (1984). Applied Social Research Methods Series, Vol. 6. Meta-analytic procedures for social research . Newbury Park, CA: Sage Publications.

Rouder, J. N., & Morey, R. D. (2011). A Bayes fator meta-analysis of Bem’s ESP claim. Psychonomic Bulletin & Review, 18, 682–689.

Rouder, J. N., Morey R. D., Speckman P. L., & Province J. M. (in press). Default Bayes factors for ANOVA designs. Journal of Mathematical Psychology .

Rouder, J. N., Speckman, P. L., Sun, D., Morey, R. D., & Iverson, G. (2009). Bayesian t tests for accepting and rejecting the null hypothesis. Psychonomic Bulletin & Review, 16, 225–237.

Scargle, J. D. (2000). Publication bias: The “File-Drawer” problem in scientific inference. Journal of Scientific Exploration, 14 (1), 91–106.

Schimmack, U. (in press). The ironic effect of significant results on the credibility of multiple study articles. Psychological Methods .

Schooler, J. (2011). Unpublished results hide the decline effect. Nature, 470, 437.

Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22, 1359–1366.

Sterling, T. D. (1959). Publication decisions and the possible effects on inferences drawn from test of significance—or vice versa. Journal of the American Statistical Association, 54, 30–34.

Sterling, T. D., Rosenbaum, W. L., & Weinkam, J. J. (1995). Publication decisions revisited: The effect of the outcome of statistical tests on the decision to publish and vice versa. The American Statistician, 49, 108–112.

Sterne, J. A., Gavaghan, D., & Egger, M. (2000). Publication and related bias in meta-analysis: Power of statistical tests and prevalence in the literature. Journal of Clinical Epidemiology, 53, 1119–1129.

Strube, M. J. (2006). SNOOP: A program for demonstrating the consequences of premature and repeated null hypothesis testing. Behavior Research Methods, 38, 24–27.

Wagenmakers, E.-J. (2007). A practical solution to the pervasive problems of p values. Psychonomic Bulleting & Review, 14, 779–804.

Zhong, C., & Liljenquist, K. (2006). Washing away your sins: Threatened morality and physical cleansing. Science, 313, 1451–1452.

Download references

Author information

Authors and affiliations.

Department of Psychological Sciences, Purdue University, 703 Third Street, West Lafayette, IN, 47906, USA

Gregory Francis

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Gregory Francis .

Rights and permissions

Reprints and permissions

About this article

Francis, G. Publication bias and the failure of replication in experimental psychology. Psychon Bull Rev 19 , 975–991 (2012). https://doi.org/10.3758/s13423-012-0322-y

Download citation

Published : 04 October 2012

Issue Date : December 2012

DOI : https://doi.org/10.3758/s13423-012-0322-y

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Bayesian methods
  • Hypothesis testing
  • Meta-analysis
  • Publication bias
  • Replication
  • Find a journal
  • Publish with us
  • Track your research

IMAGES

  1. PPT

    replication in experimental psychology

  2. Research Methods: Replication in Psychology (1 of 3)

    replication in experimental psychology

  3. Percentage of Experimental Research in Each Psychology Subfield and the

    replication in experimental psychology

  4. Teaching the why and how of replication studies

    replication in experimental psychology

  5. Nick & Bethan Redshaw's A-Level Psychology Resources

    replication in experimental psychology

  6. The Psychology of Replication and Replication in Psychology

    replication in experimental psychology

COMMENTS

  1. Replication in Psychology: Definition, Steps, and Challenges

    In psychology, replication is defined as reproducing a study. It is essential for validity, but it's not always easy to perform experiments and get the same result. ... a group of 271 researchers published the results of their five-year effort to replicate 100 different experimental studies previously published in three top psychology journals.

  2. Replications in Psychology Research: How Often Do They Really Occur

    The first use of the term replication that we found in a psychology journal was Rosenblith's article in the Journal of Abnormal Psychology, titled "A Replication of Some Roots of Prejudice," which successfully replicated the findings of Allport and Kramer (1946) while relying on college students in South Dakota instead of those in Harvard ...

  3. The Psychology of Replication and Replication in Psychology

    Abstract. Like other scientists, psychologists believe experimental replication to be the final arbiter for determining the validity of an empirical finding. Reports in psychology journals often attempt to prove the validity of a hypothesis or theory with multiple experiments that replicate a finding. Unfortunately, these efforts are sometimes ...

  4. A causal replication framework for designing and assessing replication

    Given the central role of replication in the accumulation of scientific knowledge, researchers have reevaluated the replicability of seemingly well-established findings. Results from these efforts have not been promising. The Open Science Collaboration (OSC) replicated 100 experimental and correlational studies published in high impact psychology journals and found that only 36% of these ...

  5. The replication crisis and open science in psychology: Methodological

    This editorial begins by noting that if you were to ask two psychologists, a pessimist and an optimist, as to when psychology entered what we now know as the "replication crisis," the former might state "decades ago" whereas the latter might be inclined to say "2011." Despite this disagreement, both are likely to agree that two key occurrences in 2011 that received considerable ...

  6. A discipline-wide investigation of the replicability of Psychology

    The mean replication score of non-experimental papers is significantly higher than the mean of experimental studies. Non-experimental papers have an overall mean replication score of 0.50, whereas experimental papers have a mean replication score of 0.39 (t = 34.22, P < 0.001, Cohen's d = 0.78). The difference is confirmed in the training ...

  7. Replication in Psychological Science

    In August of this year, Science published a fascinating article by Brian Nosek and 269 coauthors (Open Science Collaboration, 2015).They reported direct replication attempts of 100 experiments published in prestigious psychology journals in 2008, including experiments reported in 39 articles in Psychological Science.Although I expect there is room to critique some of the replications, the ...

  8. The role of replication in psychological science

    The replication or reproducibility crisis in psychological science has renewed attention to philosophical aspects of its methodology. I provide herein a new, functional account of the role of replication in a scientific discipline: to undercut the underdetermination of scientific hypotheses from data, typically by hypotheses that connect data with phenomena. These include hypotheses that ...

  9. Replicability, Robustness, and Reproducibility in Psychological Science

    Vol. 63 (2012), pp. 539-569. More. Replication—an important, uncommon, and misunderstood practice—is gaining appreciation in psychology. Achieving replicability is important for making research progress. If findings are not replicable, then prediction and theory development are stifled. If findings are replicable, then interrogation of ...

  10. Psychology, replication & beyond

    As remarked upon by Francis "Contrary to its central role in other sciences,it appears that successful replication is sometimes not related to belief about an effect in experimental psychology. A high rate of successful replication is not sufficient to induce belief in an effect [ 8 ] , nor is a high rate of successful replication necessary ...

  11. The replication crisis has led to positive structural, procedural, and

    A Student's Guide to Open Science: Using the Replication Crisis to Reform Psychology (Open University Press, ... Experimental evidence from 84 countries. Affect. Sci. 3, 577-602 (2022).

  12. Replication and the Establishment of Scientific Truth

    Similar to replications in physics, replications in psychology can only speak to observations about affect, cognition, and behavior in a specific context at a specific time. ... Phenomenon replications are therefore needed to examine the effects of deviations from and variations in experimental details on replication results.

  13. Replicability and Reproducibility in Comparative Psychology

    Psychology faces a replication crisis. The Reproducibility Project: Psychology sought to replicate the effects of 100 psychology studies. ... Unfortunately, biases in how researchers decide on experimental design, data analysis, and publication can produce results that fail to replicate. At many steps in the scientific process, researchers can ...

  14. The psychology of replication and replication in psychology.

    Like other scientists, psychologists believe experimental replication to be the final arbiter for determining the validity of an empirical finding. Reports in psychology journals often attempt to prove the validity of a hypothesis or theory with multiple experiments that replicate a finding. Unfortunately, these efforts are sometimes misguided because in a field like experimental psychology ...

  15. Replication, falsification, and the crisis of confidence in social

    Indeed, "leading researchers [in psychology]" agree, according to Francis (2012), that "experimental replication is the final arbiter in determining whether effects are true or false" (p. 585). ... In the case of experimental social psychology, it does.

  16. How and Why to Conduct a Replication Study

    Replication is a research methodology used to verify, consolidate, and advance knowledge and understanding within empirical fields of study. A replication study works toward this goal by repeating a study's methodology with or without changes followed by systematic comparison to better understand the nature, repeatability, and generalizability of its findings.

  17. Publication bias and the failure of replication in experimental psychology

    Among experimental psychologists, successful replication enhances belief in a finding, while a failure to replicate is often interpreted to mean that one of the experiments is flawed. This view is wrong. Because experimental psychology uses statistics, empirical findings should appear with predictable probabilities.

  18. The Psychology of Replication and Replication in Psychology

    Abstract. Like other scientists, psychologists believe experimental replication to be the final arbiter for determining the validity of an empirical finding. Reports in psychology journals often attempt to prove the validity of a hypothesis or theory with multiple experiments that replicate a finding.

  19. Early Experimental Psychology: How did © The Author(s) 2022 Replication

    Replication is a rather recent term that has been used by scientists since 1914 to refer to the activity of repeating an experiment or a series of observations (Tweney, 2018).3 In. 1Theory & History of Psychology, University of Groningen Faculty of Behavioural and Social Sciences, Groningen, Netherlands.

  20. Replication, falsification, and the crisis of confidence in social

    Indeed, "leading researchers [in psychology]" agree, according to Francis , that "experimental replication is the final arbiter in determining whether effects are true or false" (p. 585). We have already seen that such calls must be heeded with caution: replication is not straightforward, and the outcome of replication studies may be ...

  21. Publication bias and the failure of replication in experimental psychology

    Among experimental psychologists, successful replication enhances belief in a finding, while a failure to replicate is often interpreted to mean that one of the experiments is flawed. This view is wrong. Because experimental psychology uses statistics, empirical findings should appear with predictable probabilities.

  22. Publication bias and the failure of replication in experimental psychology

    Replication of empirical findings plays a fundamental role in science. Among experimental psychologists, successful replication enhances belief in a finding, while a failure to replicate is often interpreted to mean that one of the experiments is flawed. This view is wrong. Because experimental psychology uses statistics, empirical findings should appear with predictable probabilities. In a ...

  23. Early Experimental Psychology: How did Replication Work Before P

    In this setting, we find another use of replication: experimental psychology became a collaborative enterprise, and the repetition of experiments was part of student training and of learning laboratory routines in hierarchically organized laboratories in which teaching and research went hand in hand. 30.

  24. The viewing perspective effect in learning from ...

    Two graduate students majoring in psychology watched all the videos to make sure that each step could be clearly seen and that no single step was covered by the research assistant's hands. Videos displaying the butterfly paper-folding task lasted for 2 min and 36 s, and videos that displayed the four-leaf clover paper-folding task lasted for 2 ...