National Academies Press: OpenBook

Reproducibility and Replicability in Science (2019)

Chapter: 3 understanding reproducibility and replicability, 3 understanding reproducibility and replicability, the evolving practices of science.

Scientific research has evolved from an activity mainly undertaken by individuals operating in a few locations to many teams, large communities, and complex organizations involving hundreds to thousands of individuals worldwide. In the 17th century, scientists would communicate through letters and were able to understand and assimilate major developments across all the emerging major disciplines. In 2016—the most recent year for which data are available—more than 2,295,000 scientific and engineering research articles were published worldwide ( National Science Foundation, 2018e ).

In addition, the number of scientific and engineering fields and subfields of research is large and has greatly expanded in recent years, especially in fields that intersect disciplines (e.g., biophysics); more than 230 distinct fields and subfields can now be identified. The published literature is so voluminous and specialized that some researchers look to information retrieval, machine learning, and artificial intelligence techniques to track and apprehend the important work in their own fields.

Another major revolution in science came with the recent explosion of the availability of large amounts of data in combination with widely available and affordable computing resources. These changes have transformed many disciplines, enabled important scientific discoveries, and led to major shifts in science. In addition, the use of statistical analysis of data has expanded, and many disciplines have come to rely on complex and expensive instrumentation that generates and can automate analysis of large digital datasets.

Large-scale computation has been adopted in fields as diverse as astronomy, genetics, geoscience, particle physics, and social science, and has added scope to fields such as artificial intelligence. The democratization of data and computation has created new ways to conduct research; in particular, large-scale computation allows researchers to do research that was not possible a few decades ago. For example, public health researchers mine large databases and social media, searching for patterns, while earth scientists run massive simulations of complex systems to learn about the past, which can offer insight into possible future events.

Another change in science is an increased pressure to publish new scientific discoveries in prestigious and what some consider high-impact journals, such as Nature and Science. 1 This pressure is felt worldwide, across disciplines, and by researchers at all levels but is perhaps most acute for researchers at the beginning of their scientific careers who are trying to establish a strong scientific record to increase their chances of obtaining tenure at an academic institution and grants for future work. Tenure decisions have traditionally been made on the basis of the scientific record (i.e., published articles of important new results in a field) and have given added weight to publications in more prestigious journals. Competition for federal grants, a large source of academic research funding, is intense as the number of applicants grows at a rate higher than the increase in federal research budgets. These multiple factors create incentives for researchers

___________________

1 “High-impact” journals are viewed by some as those which possess high scores according to one of the several journal impact indicators such as Citescore, Scimago Journal Ranking (SJR), Source Normalized Impact per Paper (SNIP)—which are available in Scopus—and Journal Impact Factor (IF), Eigenfactor (EF), and Article Influence Score (AIC)—which can be obtained from the Journal Citation Report (JCR).

to overstate the importance of their results and increase the risk of bias—either conscious or unconscious—in data collection, analysis, and reporting.

In the context of these dynamic changes, the questions and issues related to reproducibility and replicability remain central to the development and evolution of science. How should studies and other research approaches be designed to efficiently generate reliable knowledge? How might hypotheses and results be better communicated to allow others to confirm, refute, or build on them? How can the potential biases of scientists themselves be understood, identified, and exposed in order to improve accuracy in the generation and interpretation of research results? How can intentional misrepresentation and fraud be detected and eliminated? 2

Researchers have proposed approaches to answering some of the questions over the past decades. As early as the 1960s, Jacob Cohen surveyed psychology articles from the perspective of statistical power to detect effect sizes, an approach that launched many subsequent power surveys (also known as meta-analyses) in the social sciences in subsequent years ( Cohen, 1988 ).

Researchers in biomedicine have been focused on threats to validity of results since at least the 1970s. In response to the threat, biomedical researchers developed a wide variety of approaches to address the concern, including an emphasis on randomized experiments with masking (also known as blinding), reliance on meta-analytic summaries over individual trial results, proper sizing and power of experiments, and the introduction of trial registration and detailed experimental protocols. Many of the same approaches have been proposed to counter shortcomings in reproducibility and replicability.

Reproducibility and replicability as they relate to data and computation-intensive scientific work received attention as the use of computational tools expanded. In the 1990s, Jon Claerbout launched the “reproducible research movement,” brought on by the growing use of computational workflows for analyzing data across a range of disciplines ( Claerbout and Karrenbach, 1992 ). Minor mistakes in code can lead to serious errors in interpretation and in reported results; Claerbout’s proposed solution was to establish an expectation that data and code will be openly shared so that results could be reproduced. The assumption was that reanalysis of the same data using the same methods would produce the same results.

In the 2000s and 2010s, several high-profile journal and general media publications focused on concerns about reproducibility and replicability (see, e.g., Ioannidis, 2005 ; Baker, 2016 ), including the cover story in The

2 See Chapter 5 , Fraud and Misconduct, which further discusses the association between misconduct as a source of non-replicability, its frequency, and reporting by the media.

Economist ( “How Science Goes Wrong,” 2013 ) noted above. These articles introduced new concerns about the availability of data and code and highlighted problems of publication bias, selective reporting, and misaligned incentives that cause positive results to be favored for publication over negative or nonconfirmatory results. 3 Some news articles focused on issues in biomedical research and clinical trials, which were discussed in the general media partly as a result of lawsuits and settlements over widely used drugs ( Fugh-Berman, 2010 ).

Many publications about reproducibility and replicability have focused on the lack of data, code, and detailed description of methods in individual studies or a set of studies. Several attempts have been made to assess non-reproducibility or non-replicability within a field, particularly in social sciences (e.g., Camerer et al., 2018 ; Open Science Collaboration, 2015 ). In Chapters 4 , 5 , and 6 , we review in more detail the studies, analyses, efforts to improve, and factors that affect the lack of reproducibility and replicability. Before that discussion, we must clearly define these terms.

DEFINING REPRODUCIBILITY AND REPLICABILITY

Different scientific disciplines and institutions use the words reproducibility and replicability in inconsistent or even contradictory ways: What one group means by one word, the other group means by the other word. 4 These terms—and others, such as repeatability—have long been used in relation to the general concept of one experiment or study confirming the results of another. Within this general concept, however, no terminologically consistent way of drawing distinctions has emerged; instead, conflicting and inconsistent terms have flourished. The difficulties in assessing reproducibility and replicability are complicated by this absence of standard definitions for these terms.

In some fields, one term has been used to cover all related concepts: for example, “replication” historically covered all concerns in political science ( King, 1995 ). In many settings, the terms reproducible and replicable have distinct meanings, but different communities adopted opposing definitions ( Claerbout and Karrenbach, 1992 ; Peng et al., 2006 ; Association for Computing Machinery, 2018 ). Some have added qualifying terms, such as methods reproducibility, results reproducibility, and inferential reproducibility to the lexicon ( Goodman et al., 2016 ). In particular, tension has emerged between the usage recently adopted in computer science and the way that

3 One such outcome became known as the “file drawer problem”: see Chapter 5 ; also see Rosenthal (1979) .

4 For the negative case, both “non-reproducible” and “irreproducible” are used in scientific work and are synonymous.

researchers in other scientific disciplines have described these ideas for years ( Heroux et al., 2018 ).

In the early 1990s, investigators began using the term “reproducible research” for studies that provided a complete digital compendium of data and code to reproduce their analyses, particularly in the processing of seismic wave recordings ( Claerbout and Karrenbach, 1992 ; Buckheit and Donoho, 1995 ). The emphasis was on ensuring that a computational analysis was transparent and documented so that it could be verified by other researchers. While this notion of reproducibility is quite different from situations in which a researcher gathers new data in the hopes of independently verifying previous results or a scientific inference, some scientific fields use the term reproducibility to refer to this practice. Peng et al. (2006 , p. 783) referred to this scenario as “replicability,” noting: “Scientific evidence is strengthened when important results are replicated by multiple independent investigators using independent data, analytical methods, laboratories, and instruments.” Despite efforts to coalesce around the use of these terms, lack of consensus persists across disciplines. The resulting confusion is an obstacle in moving forward to improve reproducibility and replicability ( Barba, 2018 ).

In a review paper on the use of the terms reproducibility and replicability, Barba (2018) outlined three categories of usage, which she characterized as A, B1, and B2:

B1 and B2 are in opposition of each other with respect to which term involves reusing the original authors’ digital artifacts of research (“research compendium”) and which involves independently created digital artifacts. Barba (2018) collected data on the usage of these terms across a variety of disciplines (see Table 3-1 ). 5

5 See also Heroux et al. (2018) for a discussion of the competing taxonomies between computational sciences (B1) and new definitions adopted in computer science (B2) and proposals for resolving the differences.

TABLE 3-1 Usage of the Terms Reproducibility and Replicability by Scientific Discipline

A B1 B2
Political Science Signal Processing Microbiology, Immunology (FASEB)
Economics Scientific Computing Computer Science (ACM)
Econometry
Epidemiology
Clinical Studies
Internal Medicine
Physiology (neurophysiology)
Computational Biology
Biomedical Research
Statistics

NOTES: See text for discussion. ACM = Association for Computing Machinery, FASEB = Federation of American Societies for Experimental Biology. SOURCE: Barba (2018, Table 2) .

The terminology adopted by the Association for Computing Machinery (ACM) for computer science was published in 2016 as a system for badges attached to articles published by the society. The ACM declared that its definitions were inspired by the metrology vocabulary, and it associated using an original author’s digital artifacts to “replicability,” and developing completely new digital artifacts to “reproducibility.” These terminological distinctions contradict the usage in computational science, where reproducibility is associated with transparency and access to the author’s digital artifacts, and also with social sciences, economics, clinical studies, and other domains, where replication studies collect new data to verify the original findings.

Regardless of the specific terms used, the underlying concepts have long played essential roles in all scientific disciplines. These concepts are closely connected to the following general questions about scientific results:

  • Are the data and analysis laid out with sufficient transparency and clarity that the results can be checked ?
  • If checked, do the data and analysis offered in support of the result in fact support that result?
  • If the data and analysis are shown to support the original result, can the result reported be found again in the specific study context investigated?
  • Finally, can the result reported or the inference drawn be found again in a broader set of study contexts ?

Computational scientists generally use the term reproducibility to answer just the first question—that is, reproducible research is research that is capable of being checked because the data, code, and methods of analysis are available to other researchers. The term reproducibility can also be used in the context of the second question: research is reproducible if another researcher actually uses the available data and code and obtains the same results. The difference between the first and the second questions is one of action by another researcher; the first refers to the availability of the data, code, and methods of analysis, while the second refers to the act of recomputing the results using the available data, code, and methods of analysis.

In order to answer the first and second questions, a second researcher uses data and code from the first; no new data or code are created by the second researcher. Reproducibility depends only on whether the methods of the computational analysis were transparently and accurately reported and whether that data, code, or other materials were used to reproduce the original results. In contrast, to answer question three, a researcher must redo the study, following the original methods as closely as possible and collecting new data. To answer question four, a researcher could take a variety of paths: choose a new condition of analysis, conduct the same study in a new context, or conduct a new study aimed at the same or similar research question.

For the purposes of this report and with the aim of defining these terms in ways that apply across multiple scientific disciplines, the committee has chosen to draw the distinction between reproducibility and replicability between the second and third questions. Thus, reproducibility includes the act of a second researcher recomputing the original results, and it can be satisfied with the availability of data, code, and methods that makes that recomputation possible. This definition of reproducibility refers to the transparency and reproducibility of computations: that is, it is synonymous with “computational reproducibility,” and we use the terms interchangeably in this report.

When a new study is conducted and new data are collected, aimed at the same or a similar scientific question as a previous one, we define it as a replication. A replication attempt might be conducted by the same investigators in the same lab in order to verify the original result, or it might be conducted by new investigators in a new lab or context, using the same or different methods and conditions of analysis. If this second study, aimed at the same scientific question but collecting new data, finds consistent results or can draw consistent conclusions, the research is replicable. If a second study explores a similar scientific question but in other contexts or

populations that differ from the original one and finds consistent results, the research is “generalizable.” 6

In summary, after extensive review of the ways these terms are used by different scientific communities, the committee adopted specific definitions for this report.

CONCLUSION 3-1: For this report, reproducibility is obtaining consistent results using the same input data; computational steps, methods, and code; and conditions of analysis. This definition is synonymous with “computational reproducibility,” and the terms are used interchangeably in this report.

Replicability is obtaining consistent results across studies aimed at answering the same scientific question, each of which has obtained its own data.

Two studies may be considered to have replicated if they obtain consistent results given the level of uncertainty inherent in the system under study. In studies that measure a physical entity (i.e., a measurand), the results may be the sets of measurements of the same measurand obtained by different laboratories. In studies aimed at detecting an effect of an intentional intervention or a natural event, the results may be the type and size of effects found in different studies aimed at answering the same question. In general, whenever new data are obtained that constitute the results of a study aimed at answering the same scientific question as another study, the degree of consistency of the results from the two studies constitutes their degree of replication.

Two important constraints on the replicability of scientific results rest in limits to the precision of measurement and the potential for altered results due to sometimes subtle variation in the methods and steps performed in a scientific study. We expressly consider both here, as they can each have a profound influence on the replicability of scientific studies.

PRECISION OF MEASUREMENT

Virtually all scientific observations involve counts, measurements, or both. Scientific measurements may be of many different kinds: spatial dimensions (e.g., size, distance, and location), time, temperature, brightness, colorimetric properties, electromagnetic properties, electric current,

6 The committee definitions of reproducibility, replicability, and generalizability are consistent with the National Science Foundation’s Social, Behavioral, and Economic Sciences Perspectives on Robust and Reliable Science ( Bollen et al., 2015 ).

material properties, acidity, and concentration, to name a few from the natural sciences. The social sciences are similarly replete with counts and measures. With each measurement comes a characterization of the margin of doubt, or an assessment of uncertainty ( Possolo and Iyer, 2017 ). Indeed, it may be said that measurement, quantification, and uncertainties are core features of scientific studies.

One mark of progress in science and engineering has been the ability to make increasingly exact measurements on a widening array of objects and phenomena. Many of the things taken for granted in the modern world, from mechanical engines to interchangeable parts to smartphones, are possible only because of advances in the precision of measurement over time ( Winchester, 2018 ).

The concept of precision refers to the degree of closeness in measurements. As the unit used to measure distance, for example, shrinks from meter to centimeter to millimeter and so on down to micron, nanometer, and angstrom, the measurement unit becomes more exact and the proximity of one measurand to a second can be determined more precisely.

Even when scientists believe a quantity of interest is constant, they recognize that repeated measurement of that quantity may vary because of limits in the precision of measurement technology. It is useful to note that precision is different from the accuracy of a measurement system, as shown in Figure 3-1 , demonstrating the differences using an archery target containing three arrows.

In Figure 3-1 , A, the three arrows are in the outer ring, not close together and not close to the bull’s eye, illustrating low accuracy and low precision (i.e., the shots have not been accurate and are not highly precise). In B, the arrows are clustered in a tight band in an outer ring, illustrating

Image

low accuracy and high precision (i.e., the shots have been more precise, but not accurate). The other two figures similarly illustrate high accuracy and low precision (C) and high accuracy and high precision (D).

It is critical to keep in mind that the accuracy of a measurement can be judged only in relation to a known standard of truth. If the exact location of the bull’s eye is unknown, one must not presume that a more precise set of measures is necessarily more accurate; the results may simply be subject to a more consistent bias, moving them in a consistent way in a particular direction and distance from the true target.

It is often useful in science to describe quantitatively the central tendency and degree of dispersion among a set of repeated measurements of the same entity and to compare one set of measurements with a second set. When a set of measurements is repeated by the same operator using the same equipment under constant conditions and close in time, metrologists refer to the proximity of these measurements to one another as

measurement repeatability (see Box 3-1 ). When one is interested in comparing the degree to which the set of measurements obtained in one study are consistent with the set of measurements obtained in a second study, the committee characterizes this as a test of replicability because it entails the comparison of two studies aimed at the same scientific question where each obtained its own data.

Consider, for example, the set of measurements of the physical constant obtained over time by a number of laboratories (see Figure 3-2 ). For each laboratory’s results, the figure depicts the mean observation (i.e., the central tendency) and standard error of the mean, indicated by the error bars. The standard error is an indicator of the precision of the obtained measurements, where a smaller standard error represents higher precision. In comparing the measurements obtained by the different laboratories, notice that both the mean values and the degrees of precision (as indicated by the width of the error bars) may differ from one set of measurements to another.

Image

We may now ask what is a central question for this study: How well does a second set of measurements (or results) replicate a first set of measurements (or results)? Answering this question, we suggest, may involve three components:

  • proximity of the mean value (central tendency) of the second set relative to the mean value of the first set, measured both in physical units and relative to the standard error of the estimate
  • similitude in the degree of dispersion in observed values about the mean in the second set relative to the first set
  • likelihood that the second set of values and the first set of values could have been drawn from the same underlying distribution

Depending on circumstances, one or another of these components could be more salient for a particular purpose. For example, two sets of measures could have means that are very close to one another in physical units, yet each were sufficiently precisely measured as to be very unlikely to be

different by chance. A second comparison may find means are further apart, yet derived from more widely dispersed sets of observations, so that there is a higher likelihood that the difference in means could have been observed by chance. In terms of physical proximity, the first comparison is more closely replicated. In terms of the likelihood of being derived from the same underlying distribution, the second set is more highly replicated.

A simple visual inspection of the means and standard errors for measurements obtained by different laboratories may be sufficient for a judgment about their replicability. For example, in Figure 3-2 , it is evident that the bottom two measurement results have relatively tight precision and means that are nearly identical, so it seems reasonable these can be considered to have replicated one another. It is similarly evident that results from LAMPF (second from the top of reported measurements with a mean value and error bars in Figure 3-2 ) are better replicated by results from LNE-01 (fourth from top) than by measurements from NIST-89 (sixth from top). More subtle may be judging the degree of replication when, for example, one set of measurements has a relatively wide range of uncertainty compared to another. In Figure 3-2 , the uncertainty range from NPL-88 (third from top) is relatively wide and includes the mean of NIST-97 (seventh from top); however, the narrower uncertainty range for NIST-97 does not include the mean from NPL-88. Especially in such cases, it is valuable to have a systematic, quantitative indicator of the extent to which one set of measurements may be said to have replicated a second set of measurements, and a consistent means of quantifying the extent of replication can be useful in all cases.

VARIATIONS IN METHODS EMPLOYED IN A STUDY

When closely scrutinized, a scientific study or experiment may be seen to entail hundreds or thousands of choices, many of which are barely conscious or taken for granted. In the laboratory, exactly what size of Erlenmeyer flask is used to mix a set of reagents? At what exact temperature were the reagents stored? Was a drying agent such as acetone used on the glassware? Which agent and in what amount and exact concentration? Within what tolerance of error are the ingredients measured? When ingredient A was combined with ingredient B, was the flask shaken or stirred? How vigorously and for how long? What manufacturer of porcelain filter was used? If conducting a field survey, how exactly, were the subjects selected? Are the interviews conducted by computer or over the phone or in person? Are the interviews conducted by female or male, young or old, the same or different race as the interviewee? What is the exact wording of a question? If spoken, with what inflection? What is the exact sequence of questions? Without belaboring the point, we can

say that many of the exact methods employed in a scientific study may or may not be described in the methods section of a publication. An investigator may or may not realize when a possible variation could be consequential to the replicability of results.

In a later section, we will deal more generally with sources of non-replicability in science (see Chapter 5 and Box 5-2 ). Here, we wish to emphasize that countless subtle variations in the methods, techniques, sequences, procedures, and tools employed in a study may contribute in unexpected ways to differences in the obtained results (see Box 3-2 ).

Finally, note that a single scientific study may entail elements of the several concepts introduced and defined in this chapter, including computational reproducibility, precision in measurement, replicability, and generalizability or any combination of these. For example, a large epidemiological survey of air pollution may entail portable, personal devices to measure various concentrations in the air (subject to precision of measurement), very large datasets to analyze (subject to computational reproducibility), and a large number of choices in research design, methods, and study population (subject to replicability and generalizability).

RIGOR AND TRANSPARENCY

The committee was asked to “make recommendations for improving rigor and transparency in scientific and engineering research” (refer to Box 1-1 in Chapter 1 ). In response to this part of our charge, we briefly discuss the meanings of rigor and of transparency below and relate them to our topic of reproducibility and replicability.

Rigor is defined as “the strict application of the scientific method to ensure robust and unbiased experimental design” ( National Institutes of Health, 2018e ). Rigor does not guarantee that a study will be replicated, but conducting a study with rigor—with a well-thought-out plan and strict adherence to methodological best practices—makes it more likely. One of the assumptions of the scientific process is that rigorously conducted studies “and accurate reporting of the results will enable the soundest decisions” and that a series of rigorous studies aimed at the same research question “will offer successively ever-better approximations to the truth” ( Wood et al., 2019 , p. 311). Practices that indicate a lack of rigor, including poor study design, errors or sloppiness, and poor analysis and reporting, contribute to avoidable sources of non-replicability (see Chapter 5 ). Rigor affects both reproducibility and replicability.

Transparency has a long tradition in science. Since the advent of scientific reports and technical conferences, scientists have shared details about their research, including study design, materials used, details of the system under study, operationalization of variables, measurement techniques,

uncertainties in measurement in the system under study, and how data were collected and analyzed. A transparent scientific report makes clear whether the study was exploratory or confirmatory, shares information about what measurements were collected and how the data were prepared, which analyses were planned and which were not, and communicates the level of uncertainty in the result (e.g., through an error bar, sensitivity analysis, or p- value). Only by sharing all this information might it be possible for other researchers to confirm and check the correctness of the computations, attempt to replicate the study, and understand the full context of how to interpret the results. Transparency of data, code, and computational methods is directly linked to reproducibility, and it also applies to replicability. The clarity, accuracy, specificity, and completeness in the description of study methods directly affects replicability.

FINDING 3-1: In general, when a researcher transparently reports a study and makes available the underlying digital artifacts, such as data and code, the results should be computationally reproducible. In contrast, even when a study was rigorously conducted according to best practices, correctly analyzed, and transparently reported, it may fail to be replicated.

One of the pathways by which the scientific community confirms the validity of a new scientific discovery is by repeating the research that produced it. When a scientific effort fails to independently confirm the computations or results of a previous study, some fear that it may be a symptom of a lack of rigor in science, while others argue that such an observed inconsistency can be an important precursor to new discovery.

Concerns about reproducibility and replicability have been expressed in both scientific and popular media. As these concerns came to light, Congress requested that the National Academies of Sciences, Engineering, and Medicine conduct a study to assess the extent of issues related to reproducibility and replicability and to offer recommendations for improving rigor and transparency in scientific research.

Reproducibility and Replicability in Science defines reproducibility and replicability and examines the factors that may lead to non-reproducibility and non-replicability in research. Unlike the typical expectation of reproducibility between two computations, expectations about replicability are more nuanced, and in some cases a lack of replicability can aid the process of scientific discovery. This report provides recommendations to researchers, academic institutions, journals, and funders on steps they can take to improve reproducibility and replicability in science.

READ FREE ONLINE

Welcome to OpenBook!

You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

Do you want to take a quick tour of the OpenBook's features?

Show this book's table of contents , where you can jump to any chapter by name.

...or use these buttons to go back to the previous chapter or skip to the next one.

Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

Switch between the Original Pages , where you can read the report as it appeared in print, and Text Pages for the web version, where you can highlight and search the text.

To search the entire text of this book, type in your search term here and press Enter .

Share a link to this book page on your preferred social network or via email.

View our suggested citation for this chapter.

Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

Get Email Updates

Do you enjoy reading reports from the Academies online for free ? Sign up for email notifications and we'll let you know about new publications in your areas of interest when they're released.

Warning: The NCBI web site requires JavaScript to function. more...

U.S. flag

An official website of the United States government

The .gov means it's official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you're on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings
  • Browse Titles

NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Committee on Applied and Theoretical Statistics; Board on Mathematical Sciences and Their Applications; Division on Engineering and Physical Sciences; National Academies of Sciences, Engineering, and Medicine. Statistical Challenges in Assessing and Fostering the Reproducibility of Scientific Results: Summary of a Workshop. Washington (DC): National Academies Press (US); 2016 Feb 29.

Cover of Statistical Challenges in Assessing and Fostering the Reproducibility of Scientific Results

Statistical Challenges in Assessing and Fostering the Reproducibility of Scientific Results: Summary of a Workshop.

  • Hardcopy Version at National Academies Press

3 Conceptualizing, Measuring, and Studying Reproducibility

The second session of the workshop provided an overview of how to conceptualize, measure, and study reproducibility. Steven Goodman (Stanford University School of Medicine) and Yoav Benjamini (Tel Aviv University) discussed definitions and measures of reproducibility. Dennis Boos (North Carolina State University), Andrea Buja (Wharton School of the University of Pennsylvania), and Valen Johnson (Texas A&M University) discussed reproducibility and statistical significance. Marc Suchard (University of California, Los Angeles), Courtney Soderberg (Center for Open Science), and John Ioannidis (Stanford University) discussed assessment of factors affecting reproducibility. Mark Liberman (University of Pennsylvania) and Micah Altman (Massachusetts Institute of Technology) discussed reproducibility from the informatics perspective.

In addition to the references cited in this chapter, the planning committee would like to highlight the following background references: Altman et al. (2004) ; Begley (2013) ; Bernau et al. (2014) ; Berry (2012) ; Clayton and Collins (2014) ; Colquhoun (2014) ; Cossins (2014) ; Donoho (2010) ; Goodman et al. (1998) ; Jager and Leek (2014) ; Li et al. (2011) ; Liberati et al. (2009) ; Nosek et al. (2012) ; Peers et al. (2012) ; Rekdal (2014) ; Schooler (2014) ; Spence (2014) ; and Stodden (2013) .

  • DEFINITIONS AND MEASURES OF REPRODUCIBILITY

Steven Goodman, Stanford University School of Medicine

Steven Goodman began his presentation by explaining that defining and measuring reproducibility is difficult; thus, his goal was not to list every measure but rather to provide an ontology of ways to think about reproducibility and to identify some challenges going forward. When trying to define reproducibility, Goodman noted, one needs to distinguish among reproducibility, replicability, repeatability, reliability, robustness, and generalizability because these terms are often used in widely different ways. Similarly, related issues such as open science, transparency, and truth are often ill defined. He noted that although the word “reproducibility” is often used in place of “truth,” this association is often inaccurate.

Goodman mentioned a 2015 Academy Award acceptance speech in which Julianne Moore said, “I read an article that said that winning an Oscar could lead to living five years longer. If that's true, I'd really like to thank the Academy because my husband is younger than me.” According to Goodman, the article that Moore referenced ( Redelmeier and Singh, 2001 ) had analysis and data access issues. He noted that an examination of some of these issues was later added to the discussion section of the paper and when the study was repeated ( Sylvestre et al., 2006 ), no statistically significant added life expectancy was found for Oscar winners. In spite of the flawed original analysis, it is repeatedly referenced as fact in news stories. Goodman said this is a testament to how difficult it is to rectify the impact of nonreproducible findings.

Many philosophers and researchers have thought about the essence of truth over the past 100 years. William Whewell, Goodman noted, was one of the most important philosophers of science of the 19th century. He coined many common words such as scientist, physicist, ion, anode, cathode, and dielectric. He was an influential thinker who inspired Darwin, Faraday, Babbage, and John Stuart Mill. Goodman explained that Whewell also came up with the notion of consilience: that when a phenomenon is observed through multiple independent means, its validity may be greater than could be ascribed to any one of the individual inferences. This notion is used in the Bradford Hill criteria for causation 1 ( Hill, 1965 ) and in a range of scientific teaching. Edwin Wilson reenergized the notion of consilience in the late 1990s ( Wilson, 1998 ). Goodman explained that the consilience discussions are in many ways a precursor to the current discussions of what kind of reproducibility can be designed and what actually gets science closer to the truth.

Ronald Aylmer Fisher, Goodman explained, is one of the most important statisticians of the 20th century. Fisher wrote, “Personally, the writer prefers to set a low standard of significance at the 5 percent point. . . . A scientific fact should be regarded as experimentally established only if a properly designed experiment rarely fails to give this level of significance” (1926). Goodman stated that if this were how science was practiced by even by a small percentage of researchers, there would be significant progress in the realm of reproducibility.

Goodman noted that it is challenging to discuss reproducibility across the disciplines because there are both definitional and procedural differences among the various branches of science. He suggested that disciplines may cluster somewhat into groups with similar cultures as follows:

  • Clinical and population-based sciences (e.g., epidemiology, clinical research, social science),
  • Laboratory science (e.g., chemistry, physics, biology),
  • Natural world-based science (e.g., astronomy, ecology),
  • Computational sciences (e.g., statistics, applied mathematics, computer science, informatics), and
  • Psychology.

Statistics, however, works in all sciences, and most statisticians cross multiple domain boundaries. This makes them ideal ambassadors of methods and carriers of ideas across disciplines, giving them the scope to see commonalities where other people might not.

Goodman noted that while some mergers are occurring—such as among genomics, proteomics, and economics—a few key differences exist between these communities of disciplines:

  • Signal-measurement error ratio. When dealing with human beings, it is difficult to increase or decrease the error simply by increasing the sample size. In physics, however, engineering a device better can often reduce error. The calibration is different across fields and it results in a big difference in terms of what replication or reproducibility means.
  • Statistical methods and standards for claims.
  • Complexity of designs/measurement tools.
  • Closeness of fit between a hypothesis and experimental design/data. In psychology, if a researcher is studying the influence of patriotic symbols on election choices, there are many different kinds of experiments that would fit that broad hypothesis. In contrast, a biomedical researcher studying the acceptable dose of aspirin to prevent heart attacks is faced with a much narrower range of experimental design. The closeness of fit has tremendous bearing on what is considered a meaningful replication. While in many disciplines a language of direct replication, close replication, and conceptual replication exists, that language does not exist in biomedicine.
  • Culture of replication, transparency, and cumulative knowledge. In clinical research and trials, it is routine to do studies and add their results. If one study is significant but the next one is not, neither is disconfirmed; rather, all the findings are added and analyzed together. However, in other fields it would be seen as a crisis if one study shows no effect while another study shows a significant effect.
  • Purpose to which findings will be put (consequences of false claims). This is something that lurks in the background, especially with high-stakes research where lives are at risk. The bar has to be very high, as it is in clinical trials ( Lash, 2015 ).

Goodman highlighted some of the literature that describes how reproducible research and replications are defined. Peng et al. (2006) defined criteria for reproducible epidemiology research (see Table 3.1 ), which state that replication of results requires that the analytical data set, the methods of the computer code, and all of the necessary metadata in the documentation necessary to run that code be available and that standard methods for distribution be used. Goodman explained that, in this case, reproducible research was research where you could see the data, run the code, and go from there to see if you could replicate the results or make adjustments. These criteria were applied to computational science a few years later ( Peng, 2011 ), as shown in Figure 3.1 . In this figure, the reproducibility spectrum progresses from code only, to code and data, to link and executable code and data. Goodman noted that this spectrum stops short of “full replication,” and Peng stated that “the fact that an analysis is reproducible does not guarantee the quality, correctness, or validity of the published results” (2011).

TABLE 3.1. Criteria for Reproducible Epidemiologic Research.

Criteria for Reproducible Epidemiologic Research.

Spectrum of reproducibility in computational research. SOURCE: Republished with permission of Science magazine, from Roger D. Peng, Reproducible research in computational science, Science 334(6060), 2011; permission conveyed through Copyright Clearance (more...)

Goodman gave several examples to illustrate difficulties in reproducibility, including missing or misrepresented data in an analytical data set and limited access to important information not typically published or shared (such as case study reports in the case of Doshi et al. [2012] ).

The National Science Foundation's Subcommittee on Robust Research defined the following terms in their 2015 report on reproducibility, replicability, and generalization in the social, behavioral, and economic sciences:

  • Reproducibility. “The ability of a researcher to duplicate the results of a prior study using the same materials . . . as were used by the original investigator. . . . A second researcher might use the same raw data to build the same analysis files and implement the same statistical analysis . . . [in an attempt to] yield the same results. . . . If the same results were not obtained, the discrepancy could be due to differences in processing of the data, differences in the application of statistical tools, differences in the operations performed by the statistical tools, accidental errors by an investigator, and other factors. . . . Reproducibility is a minimum necessary condition for a finding to be believable and informative” ( NSF, 2015 ).
  • Replicability. “The ability of a researcher to duplicate the results of a prior study if the same procedures are followed but new data are collected . . . a failure to replicate a scientific finding is commonly thought to occur when one study documents [statistically significant] relations between two or more variables and a subsequent attempt to implement the same operations fails to yield the same [statistically significant] relations” ( NSF, 2015 ). Failure to replicate can occur when methods from either study are flawed or sufficiently different or when results are in fact statistically compatible in spite of differences in significance. Goodman questioned if this is an appropriate and sufficient definition.

Goodman referenced several recent and current reproducibility efforts, including the Many Labs Replication Project, 2 aimed at replicating important findings in social psychology. He noted that replication in this context means repeating the experiment to see if the findings are the same. He also discussed the work of researchers from Bayer who are trying to replicate important findings in basic science for oncology; Prinz et al. stated the following:

To substantiate our incidental observations that published reports are frequently not reproducible with quantitative data, we performed an analysis of our early (target identification and validation) in-house projects in our strategic research fields of oncology, women's health and cardiovascular diseases that were performed over the past 4 years. . . . In almost two-thirds of the projects, there were inconsistencies between published data and in-house data that either considerably prolonged the duration of the target validation process or, in most cases resulted in termination of the projects because the evidence that was generated for the therapeutic hypothesis was insufficient to justify further investment in these projects. (2011)

Goodman pointed out that “reproducible” in this context denotes repeating the experiment again. C. Glenn Begley and Lee Ellis found that only 6 of 53 landmark experiments in preclinical research were reproducible: “When findings could not be reproduced, an attempt was made to contact the original authors, discuss the discrepant findings, exchange reagents and repeat experiments under the author's direction, occasionally even in the laboratory of the original investigator” (2012). Sharon Begley wrote about C. Glenn Begley's experiences in a later Reuters article: [C. Glenn] Begley met for breakfast at a cancer conference with the lead scientist of one of the problematic studies. “We went through the paper line by line, figure by figure,” said [C. Glenn] Begley. “I explained that we re-did their experiment 50 times and never got their results.” He said they'd done it six times and got this result once, but put it in the paper because it made the best story” (2012).

Goodman noted that the approach taken by Begley and Ellis is a valuable attempt to uncover the truth. He stressed that they did not just do a one-off attempt to replicate and compare the results; they tried every way possible to try to demonstrate the underlying phenomenon, even if that took attempting an experiment 50 times. This goes beyond replication and aims to reveal the true phenomenon; however, he noted that the same language is being used.

There is also no clear consensus about what constitutes reproducibility, according to Goodman. He quoted from a paper by C. Glenn Begley and John Ioannidis:

This has been highlighted empirically in preclinical research by the inability to replicate the majority of findings presented in high-profile journals. The estimates for irreproducibility based on these empirical observations range from 75% to 90%. . . . There is no clear consensus at to what constitutes a reproducible study. The inherent variability in biological systems means there is no expectation that results will necessarily be precisely replicated. So it is not reasonable to expect that each component of a research report will be replicated in perfect detail. However, it seems completely reasonable that the one or two big ideas or major conclusions that emerge from a scientific report should be validated and withstand close interrogation. (2015)

Goodman observed that in this text the terms “inability to replicate” and “irreproducibility” are used synonymously, and the words “reproducible,” “replicable,” and “validated” all relate to the process of getting at the truth of the claims.

Goodman also mentioned an article from Francis Collins and Larry Tabak (2014) on the importance of reproducibility:

A complex array of other factors seems to have contributed to the lack of reproducibility. Factors include poor training of researchers in experimental design; increased emphasis on making provocative statements rather than presenting technical details; and publications that do not report basic elements of experimental design. . . . Some irreproducible reports are probably the result of coincidental findings that happened to reach statistical significance, coupled with publication bias. Another pitfall is the overinterpretation of creative “hypothesis-generating” experiments, which are designed to uncover new avenues of inquiry rather than to provide definitive proof for any single question. Still, there remains a troubling frequency of published reports that claim a significant result, but fail to be reproducible.

Goodman summarized the three meanings of “reproducibility” as follows:

Reproducing the processes of investigations : looking at what a study did and determining if it is clear enough to know how to interpret results, which is particularly relevant in computational domains. These processes include factors such as transparency and reporting, and key design elements (e.g., blinding and control groups).

Reproducing the results of investigations : finding the same evidence or data, with the same strength.

Reproducing the interpretation of results : reaching the same conclusions or inferences based on the results. Goodman noted that there are many cases in epidemiology and clinical research in which an investigation leads to a significant finding, but no associated claims or interpretations are stated, other than to assert that the finding is interesting and should be examined further. He does not think this is necessarily a false finding, and it may be a proper conclusion.

Goodman offered a related parsing of the goals of replication from Bayarri and Mayoral (2002) :

  • Goal 1: Reduction of random error. Conclusions are based on a combined analysis of (original and replicated) experiments.
  • Goal 2: Validation (confirmation) of conclusions. Original conclusions are validated through reproduction of similar conclusions by independent replications.
  • Goal 3: Extension of conclusions. The extent to which the conclusions are still valid is investigated when slight (or moderate) changes in the co-variables, experimental conditions, and so on, are introduced.
  • Goal 4: Detection of bias. Bias is suspected of having been introduced in the original experiment. This is an interesting (although not always clearly stated) goal in some replications.

Goodman offered two additional goals that may be achieved through replication:

  • Learning about the robustness of results : resistance to (minor or moderate) changes in the experimental or analytic procedures and assumptions, and
  • Learning about the generalizability (also known as transportability) of the results : true findings outside the experimental frame or in a not-yet-tested situation. Goodman added that the border between robustness and generalizability is indistinct because all scientific findings must have some degree of generalizability; otherwise, the findings are not actually science.

Measuring reproducibility is challenging, in Goodman's view, because different conceptions of reproducibility have different measures. However, many measures fall generally into the following four categories: (1) process (including key design elements), (2) replicability, (3) evidence, and (4) truth.

When discussing process, the key question is whether the research adhered to reproducible research-sharing standards. More specifically, did it provide code, data, and metadata (including research plan), and was it registered (for many clinical research designs)? Did it adhere to reporting (transparency) standards (e.g., CONSORT, 3 STROBE, 4 QUADAS, 5 QUORUM, 6 CAMARADES, 7 REMARK 8 )? Did it adhere to various design and conduct standards, such as validation, selection thresholds, blinding, and controls? Can the work be evaluated using quality scores (e.g., GRADE 9 )? Did the researchers assess meta-biases that are hard to detect except through meta-research? Goodman noted that in terms of assessing meta-biases, one of the most important factors is selective reporting in publication. This is captured by multiple terms and often varies by discipline:

  • Multiple comparisons (1950s, most statisticians),
  • File drawer problem ( Rosenthal, 1979 ),
  • Significance questing ( Rothman and Boice, 1979 ),
  • Data mining, dredging, torturing ( Mills, 1993 ),
  • Data snooping ( White, 2000 ),
  • Selective outcome reporting ( Chan et al., 2004 ),
  • Bias ( Ioannidis, 2005 ),
  • Hidden multiplicity ( Berry, 2007 ),
  • Specification searching ( Leamer, 1978 ), and
  • p-hacking ( Simmons et al., 2011 ).

Goodman observed that all of these terms refer to essentially the same phenomenon in different fields, which underscores the importance of clarifying the language.

In terms of results, Goodman explained, the methods used to assess replication are not clear. He noted some methods that are agreed upon within different disciplinary cultures include contrasting statistical significance, assessing statistical compatibility, improving estimation and precision, and assessing contributors to bias/variability.

Assessing replicability from individual studies can be approached using probability of replication, weight of evidence (e.g., p-values, traditional meta-analysis and estimation, Bayes factors, likelihood curves, and the probability of replication as a substitute evidential measure), probability of truth (e.g., Bayesian posteriors, false discovery rate), and inference/knowledge claims. Goodman noted that there are many ways to determine the probability of replication, many of which include accounting for predictive power and resampling statistics internally.

Goodman discussed some simple ways to connect evidence to truth and to replicability, as shown in Table 3.1 . However, he noted that very few scientists are even aware that these measures can show how replication is directly related to strength of evidence and prior probability. A participant questioned shifting the conversation from reproducibility to truth and how to quantify uncertainty in truth. Specifically, since so much of the reproducibility discussion is fundamentally about error, is truth a specific target number in the analysis or is it a construct for which there is a range that would be acceptable? Goodman explained that replicability and reproducibility is a critical operational step on the road to increasing certainty about anything. Ultimately the goal is to accumulate enough certainty to declare something as established. He explained that the reproducibility in and of itself is not the goal; rather, it is the calculus that leads a community to agree after a certain number of experiments or different confirmations of the calculations that the result is true. This is why, according to Goodman, a hierarchy of evidence exists, and the results from an observational study are not treated in quite the same way as a clinical trial even though the numbers might look exactly the same in the standard errors. Some baseline level of uncertainty might be associated with the phenomenon under study, and additional uncertainties arise from measurement issues and covariance. To evaluate the strength of evidence, quantitative and qualitative assessments need to be combined. Goodman noted that some measures directly put a probability statement on truth as opposed to using operational surrogates. Only through that focus can we understand exactly how reproducibility, additional experiments, or increased precision raises the probability that we have a true result. However, he commented that there are no formal equations that incorporate all of the qualitative dimensions of assessing experimental quality and covariant inclusion or exclusion. To some extent, this can be captured through sensitivity analyses, which can be incorporated into that calculus. While the uncertainty can be captured in a Bayesian posterior through Bayesian model averaging, this is not always done and it is unclear if this is even the best way to proceed.

Goodman mentioned a highly discussed editorial ( Trafimow and Marks, 2015 ) in Basic and Applied Social Psychology whose authors banned the use of indices related to null hypothesis significance testing procedures in that journal (including p-values, t-values, F-values, statements about significant difference, and confidence intervals). Goodman concluded that reproducibility does not have a single agreed-upon definition, but there is a widely accepted belief that being able to demonstrate a physical phenomenon in repeated independent contexts is an indispensable and essential component of scientific truth. This requires, at minimum, a complete knowledge of both experimental and analytic procedures (both process-related reproducibility and the strength of evidence). He speculated that reproducibility is the focus instead of truth, per se, because frequentist logic, language, and measures do not provide proper evidential or probabilistic measures of truth. The statistical measures related to reproducibility depend on the criteria by which truth or true claims are established in a given discipline. Goodman stated that in some ways, addressing the problems would be a discipline-by-discipline reformation effort where statisticians have a special role.

Goodman reiterated that statisticians are in a position to see the commonalities of problems and solutions across disciplines and to help harmonize and improve the methods that we use to enhance reproducibility on the journey to truth. Statisticians communicate through papers, teaching, and development of curricula, but the curricula to cross many disciplines do not currently exist. He noted that these curricula have to cover the foundation of inference, threats to validity, and the calculus of uncertainty. Statisticians can also develop software that incorporates measures and methods that contribute to the understanding and maximization of reproducibility.

Goodman concluded his presentation by noting that enhancing reproducibility is a problem of collective action that requires journals, funders, and academic communities to develop and enforce the standards. No one entity can do it alone. In response to a participant question about why a journal could not force authors to release their data and methods without the collaboration of other journals, Goodman explained that many journals are in direct competition and no one journal wants to create more impediments for the contributors than necessary. Data sharing is still uncommon in many fields, and a journal may hesitate to require something of authors that other competitive journals do not. However, Goodman noted that many journals are starting to move in this direction. The National Institutes of Health and other funders are beginning to mandate data sharing. The analogous movement toward clinical trial registration worked only when the top journals declared they would not publish clinical trials that were not preregistered. While any one journal could stand up and be the first to similarly require data sharing, Goodman stressed that this is unlikely given the highly competitive publishing environment.

Another participant stated that even if all data were available, the basic questions about what constitutes reproducible research remain because there is not a clear conceptual framework. Goodman observed that there may be more opportunity for agreement about the underlying constructs than about language because language seems to differ by field. Mapping out the underlying conceptual framework, however, and pointing out that while these words are applied here or there in different ways and cultures, everyone is clear about the underlying issues (e.g., the repeatability of the experiment versus the calculations versus the adjustments). If the broader community can agree on that conceptual understanding, and specific disciplines can map onto that conceptualization, Goodman is optimistic.

A participant questioned the data-sharing obligation in the case of a large data set that a researcher plans to use to produce many papers. Goodman suggested the community envision a new world where researchers mutually benefit from people making their data more available. Many data-sharing disciplines are already realizing this benefit.

Yoav Benjamini, Tel Aviv University

Yoav Benjamini began his presentation by discussing the underlying effort of reproducibility: to discover the truth. From a historical perspective, the main principle protecting scientific discoveries is a constant subjection to scrutiny by other scientists. Replicability became the gold standard of science during the 17th century, as illustrated by the debate associated with the air pump vacuum. That air pump was complicated and expensive to build, with only two existing in England. When the Dutch mathematician and scientist Christiaan Huygens observed unique properties within a vacuum, he traveled to England to demonstrate the phenomenon to scientists such as Robert Boyle (whom Benjamini cited as the first to introduce the methods section into scientific papers) and Thomas Hobbes. Huygens did not believe the phenomenon would be believed unless he could demonstrate it on a vacuum in England.

According to Fisher (1935) , Benjamini explained, “We may say that a phenomenon is experimentally demonstrable when we know how to conduct an experiment which will rarely fail to give us statistically significant results.” Benjamini noted that this is the p-value, but since the p-value cannot be replicated a threshold is needed. However, there has been little in the literature about quantifying what it means to be replicable ( Wolfinger, 2013 ) until recently with the introduction of the r-value.

Benjamini offered the following definitions to differentiate between reproducibility and replicability:

  • Reproducibility of the study . Subject the study's raw data to the same analysis used in the study, and arrive at the same outputs and conclusions.
  • Replicability of results . Replicate the entire study, from enlisting subjects through collecting data, analyze the results in a similar but not necessarily identical way, and obtain results that are essentially the same as those reported in the original study ( Peng, 2009 ; Nature Neuroscience , 2013 ; NSF, 2014 ).

He noted that this is not merely terminology because reproducibility is a property of a study, and replicability is a property of a result that can be proved only by inspecting other results of similar experiments. Therefore, the reproducibility of a result from a single study can be assured, and improving the statistical analysis can enhance its replicability.

Replicability assessment, Benjamini asserted, requires multiple studies, so it can usually be done within a meta-analysis. However, meta-analysis and replicability assessment are not the same. Meta-analysis answers the following question: Is there evidence for effects in at least one study? In contrast, a quantitative replicability assessment should answer this question: Is there evidence for effect in at least two (or more) studies? This is not in order to buy power but in order to buy a stronger scientific statement.

An assessment establishes replicability in a precise statistical sense, according to Benjamini. If two studies identify the same finding as statistically significant (with p-values P 1 and P 2 ), replicability is established if the union hypothesis H 01 ∪ H 02 is rejected in favor of the conjunction of alternatives

Image p59

The r-value is therefore the level of the test according to which the alternate is true in at least two studies. Similar findings in at least two studies are a minimal requirement, but the strength of the replicability argument increases as the number of studies with findings in agreement increases. When screening many potential findings, the r-value is the smallest false discovery rate at which the finding will be among those where the alternate is true in at least two studies.

Benjamini concluded his presentation with several comments:

  • Different designs need different methods. For example, an empirical Bayes approach can be used for genome-wide association studies ( Heller and Yekutieli, 2014 ).
  • The mixed-model analysis can be used as evidence for replicability. 10 It has the advantage of being useful for enhancing replicability in single-laboratory experiments. The relation between the strength of evidence and power needs to be explored, perhaps through the use of r-values.
  • Selective inference and uncertainty need to be introduced into basic statistical education (e.g., the Bonferroni correction method, false discovery rate, selective confidence intervals, post-selection inference, and data splitting).
  • REPRODUCIBILITY AND STATISTICAL SIGNIFICANCE

Dennis Boos, North Carolina State University

Dennis Boos began his workshop presentation by explaining that variability refers to how results differ when an experiment is repeated, which naturally relates to reproducibility of results. The subject of his presentation was the variability of p-values ( Boos and Stefanski, 2011 ); he asserted that p-values are valuable indicators, but their use needs to be recalibrated to account for replication in future experiments.

To Boos, reproducibility can be thought of in terms of two identical experiments producing similar conclusions. Assume summary measures T 1 and T 2 (such as p-values) are continuous and exchangeable.

Image p60-1

Thus, before the two experiments are performed, there is a 50 percent chance that T 2 > T 1 . Boos noted that for p-values, it is not unusual for weak effects that produce a p 1 near, but less than, 0.05 to be followed by p 2 larger than 0.05, which indicates that the first experiment is not reproducible. Boos commented that this is not limited to p-values; regardless of what measure is used, if the results are near the threshold of what is declared significant, there is a good chance that that standard will not be met in the subsequent experiment.

Image p60-4.jpg

Boos argued that only the orders of magnitude of p-values, as opposed to their exact values, are accurate enough to be reported, with three rough ranges being of interest:

Image p60-3

He suggested that some alternate methodologies could be useful in reducing variability:

  • Bootstrap and jackknife standard errors for log 10 ( p -value), as suggested by Shao and Tu (1996) ;
  • Bootstrap prediction intervals and bounds for p new of a replicated experiment, as suggested by Mojirsheibani and Tibshirani (1996) ; and
  • Estimate of the reproducibility probability P ( p new ≤ 0.05), which assesses the probability that a repeated experiment will produce a statistically significant result ( Goodman, 1992 ; Shao and Chow, 2002 ).

Using the reproducibility probability, Boos argued that p-values of p ≤ 0.01 are necessary if the reported results are to an acceptable degree of reproducibility. However, the reproducibility probability has a large variance itself, so the standard error is large. Boos suggested using the reproducibility probability as a calibration but remaining aware of its limitations.

Boos also discussed Bayes calculations in the one-sided normal test ( Casella and Berger, 1987 ). He gave an example of looking at the posterior probability over a class of priors. Using the Bayes factor, which is the posterior probability of the alternative over the null hypothesis, he illustrated that Bayes factors show similar repeatability problems as p-values when the values are near the threshold.

Image p61-2.jpg

Reporting p-values in terms of ranges (specifically *, **, or ***) may be a reasonable approach for the broader community, he suggested.

Andreas Buja, Wharton School of the University of Pennsylvania

Andreas Buja stated that the research from Boos and Stefanski (2011) made many significant contributions to the scientific community's awareness of sampling variability in p-values and showed that this variability can be quantified. That awareness led to a fundamental question: When seeing a p-value, is it believable that something similar would appear again under replication? Because a positive answer is not a given, Buja argues that more stringent cutoffs than p = 0.05 are important to achieve replicability.

There are two basic pedagogical problems, according to Buja: (1) p-values are random variables; while they appear to be probabilities, they are transformed and inverted test statistics. (2) The p-value random variables exhibit sampling variability, but the depth of this understanding is limited and should be more broadly stated as data set–to–data set variability. Moving from p-value variability to bias, Buja suggested that multiple studies might be more advantageous than a single study with a larger sample because interstudy correlation can be significant.

Buja argued that statistics and economics need to work together to tackle reproducibility. Statistical thinking, he explained, views statistics as a quantitative epistemology and the science that creates protocols for the acquisition of qualified knowledge. The absence of protocols is damaging, and important distinctions are made between replicability and reproducibility in empirical, computational, and statistical respects. Economic thinking situates research within the economic system where incentives must be set right to solve the reproducibility problem. Buja argued that economic incentives (e.g., journals and their policies) should to be approached in conjunction with the statistical protocols.

Journals need to fight the file drawer problem, Buja said, by stopping the chase of “breakthrough science” and publishing, soliciting, and favorably treating replicated results and negative outcomes. He argued that institutions and researchers are lacking suitable incentives. Researchers will self-censor if journals treat replicated results and negative outcomes even slightly less favorably. Also, researchers will lose interest as soon as negative outcomes are apparent. In Buja's view, negative results are important but often are not viewed as such by researchers because publishing them is challenging. Following the example of Young and Karr (2011) , Buja argued that journals should accept or reject a paper based on the merit and interest of the research problem, the study design, and the quality of researcher, without knowing the outcomes of the study.

Statistical methods should take into account all data analytic activity, according to Buja. This includes the following:

  • Revealing all exploratory data analysis, in particular visualizations;
  • Revealing all model searching (e.g., lasso, forward/backward/all-subsets, Bayesian, cross validated, Akaike information criterion, Bayesian information criterion, residual information criterion);
  • Revealing all model diagnostics and actions resulting from them; and
  • Inferencing steps that attempt to account for all of the above.

In principle, Buja stated that any data-analytic action that could result in a different outcome in another data set should be documented, and the complete set of data-analysis inference should be undertaken in an integrated fashion.

While these objectives are not yet attained, Buja identified some work that is contributing to these goals. Post-selection inference, for example, is a method for statistical inference that is valid after model selection in linear models ( Berk et al., 2013 ). Buja explained that this helps secure against attempts at model selection, including p-hacking. Inference for data visualization is also progressing, according to Buja. In principle, synthetic data can be plotted with and compared to actual data. Sources of synthetic data include permutations for independence tests, parametric bootstrap for model diagnostics, and sampling conditional data distributions given sufficient statistics. If the actual data plot can be differentiated from the synthetic plots, significance can be demonstrated ( Buja et al., 2009 ).

Valen Johnson, Texas A&M University

Valen Johnson began by noting that the nature of p-values does not answer the question of whether a test statistic is bigger than it would be if the experiments were repeated; it does prove whether the null hypothesis is true.

Johnson offered the following review of the Bayesian approach to hypothesis testing: The Bayes theorem provides the posterior odds between two hypotheses after seeing the data. This demonstrates that the posterior probability of each hypothesis is equal to the Bayes factor times the prior odds between the two hypotheses.

Image p63

posterior odds = Bayes factor × prior odds

where the Bayes factor is the integral of the sampling density with respect to the prior distribution under that hypothesis:

Image p63-2

and when doing null hypothesis significant testing, the prior distribution for the parameter under the null hypothesis is just a point mass π 0 ( θ , H 0 ). Bayesian methods are not used frequently because they make it difficult to specify what the prior distribution of the parameter is under an alternative hypothesis.

Recently there has been some methodology developed called uniformly most powerful Bayesian tests (UMPBTs) that provides a default specification of the prior π 1 ( θ , H 1 ) under the alternative hypothesis ( Johnson, 2013a ). Johnson explained that the rejection region for UMPBT can often be matched to the rejection regions of classical uniformly most powerful tests. Interestingly, the UMPBT alternative places its prior mass at the boundary of the rejection region, which can be interpreted as the implicit alternative hypothesis that is tested by researchers in the classical paradigm. This methodology provides a direct correspondence among p-values, Bayes factors, and the posterior probability. Johnson noted that the prior distribution under the alternative is selected such that it will maximize the probability that the Bayes factor exceeds the threshold against all other possible prior distributions. He commented that it is surprising that such a prior distribution exists because it has to sustain for every data-generating value of the parameter.

For single-parameter exponential family models, Johnson explained that specifying the significance level of the classical uniformly most powerful test is equivalent to specifying the threshold that a Bayes factor must exceed to reject the null hypothesis in the Bayesian tests. So, the UMPBT has the same rejection region as the classical test. If one assumes that equal prior probability is assigned to the null hypothesis and the alternative hypothesis, then there is equivalence between the p-value and the posterior value and the null hypothesis is true.

Under this assumption and using UMPBT to calculate posterior probabilities and prior odds, a p-value of 0.05 leads to the posterior probability of the null hypothesis of 20 percent. Johnson stressed that the UMPBT was selected to give a high probability to rejecting the null at the threshold so the posterior probability would be greater than 20 percent if another prior were used. Johnson speculated that the widespread acceptance of a p-value of 0.05 reduces reproducibility in scientific studies. To get to a posterior probability that the null hypothesis is true with 0.05, he asserted that one really needs a p-value of about 0.005.

The next question Johnson posed was whether the assumed prior probability of the null hypothesis of only 0.5 is correct. He noted that when researchers do experiments, they do multiple experiments with the same data in multiple testing, conduct subgroup analysis, and have file drawer biases, among other issues. Given all these potential concerns, he speculated that the prior probability that should be assigned to the null should be much greater than 0.5. Assuming a higher prior probability of the null hypothesis would result in an even higher posterior probability. That is, a p-value of 0.05 would be associated with a posterior probability of the null hypothesis that was much greater than 20 percent.

In conclusion, Johnson recommends that minimum p-values of 0.005 and 0.001 should be designated as standards for significant and highly significant results, respectively ( Johnson, 2013b ).

Panel Discussion

After their presentations, Dennis Boos, Andreas Buja, and Valen Johnson participated in a panel discussion. During this session, two key themes emerged: the value of raising the bar for demonstrating statistical significance (be it through p-values, Bayes factors, or another measure) and the importance of protocols that contribute to the replicability of research.

Statistical Significance

A participant asked how individuals and communities could convince clinical researchers of the benefits of smaller p-values. Johnson agreed that there has been pushback from scientists about the proposal to raise the bar for statistical significance. But he noted that the evidence against a false null hypothesis grows exponentially fast, so even reducing the p-value threshold by a factor of 10 may require only a 60 percent increase in an experiment's sample size. This is in contrast to the argument that massively larger sample sizes would be needed to get smaller p-values. He also noted that modern statistical methodology provides many ways to perform sequential hypothesis tests, and it is not necessary to conduct every test with the full sample size. Instead, Johnson said that a preliminary analysis could be done to see if there appears to be an effect and, if the results look promising, sample size could be increased. Buja commented that, from an economic perspective, fixed thresholds do not work. He argued that there should be a gradual kind of decision making to avoid the intrinsic problems of thresholds.

A participant wondered if the most effective approach might be to try to move people away from p-values to Bayes factors because the Bayes factor of 0.05 is more statistically significant than a p-value of 0.05. Since most researchers are only familiar with the operational meaning of the p-value, the participant argued that keeping the 0.05 value but changing the calculation leading to it may be easier. Johnson noted the challenge of bringing a Bayesian perspective into introductory courses, so educating researchers on Bayes factors is going to be difficult. Because of this challenge, he asserted that changing the p-value threshold to 0.005 might be easier.

A participant proposed an alternative way to measure reproducibility: tracking the ratio of the number of papers that attempted to verify the main conclusion of the report to the number that succeeded ( Nicholson and Lazebnik, 2014 ). Another workshop participant questioned how well such a counting measure would work because many replication studies are underpowered and not truly conclusive on their own. Johnson explained that Bayes factors between experiments multiply together naturally so they serve as an easy way to combine information across multiple experiments.

In response to a participant's question about how much reproducibility might improve if Bayes factors were used rather than p-values, Buja commented that many factors other than statistical methodology could play bigger roles. Boos added that both frequentist and Bayesian analyses will generally come to the same conclusions if done correctly. Johnson agreed that there are many sources of irreproducibility in scientific studies and statistics, and the use of elevated significance thresholds is just one of many factors.

A participant summarized that the level of evidence needs to be much higher than the current standard, be it through Bayes factors or p-values. The participant wondered how to get a higher standard codified in a situation where a variety of models and approaches are being used to analyze a set of data and come up with a conclusion. While a more stringent p-value or higher Bayes factor will help, it may be that neither will provide as much value as single hypothesis tests. But the prior probabilities of various kinds of hypotheses may matter, and that is more difficult to model. Johnson agreed that it is more difficult when working with more complicated models that have high-dimensional parameters. He agreed that the Bayesian principles are in place, but that implementing and specifying the priors is a much more difficult task.

A participant noted that whenever a threshold is used, regardless of what method, approach, process, or philosophy is in place, a result near the threshold would have more uncertainty than a finding within the threshold. Boos agreed that the threshold is just a placeholder and it is up to the community to realize that anything close to it may not reproduce. Johnson disagreed, stating that there needs to be a point at which a journal editor has to decide whether to publish a result or whether a physicist can claim a new discovery. The important takeaway, in Johnson's view, is that this bar needs to be higher.

A participant observed that although there are many suggested alternatives to p-values, none of them have been widely accepted by researchers. Johnson responded that UMPBT cannot be computed on all models and that p-values may be a reasonable methodology to use as long as the standards are tightened. Buja reinforced the notion that protocols are important even if they are suboptimal; reporting a suboptimal p-value, in other words, is an important standard.

A participant noted that Buja called for “accounting for all data analytical actions,” including all kinds of model selection and tuning; however, the participant would instead encourage scientists to work with an initial data set and figure out how to replicate the study in an interesting way. Buja agreed that replication is the goal, but protocols are essential to make a study fully reproducible by someone else. With good reporting, it may be possible to follow the research path taken by a researcher, but it is difficult to know why particular paths were not taken when those results are not reported. He stressed that there is no substitute for real replication.

  • ASSESSMENT OF FACTORS AFFECTING REPRODUCIBILITY

Marc Suchard, University of California, Los Angeles

Marc Suchard began his presentation by discussing some recent cases of conflicting observational studies using nationwide electronic health records. The first example given was the case of assessing the exposure to oral bisphosphonates and the risk of esophageal cancer. In this case, Cardwell et al. (2010) and Green et al. (2010) had different analyses of the same data set to show “no significant association” and “a significantly increased risk,” respectively. Another example evaluated oral fluoroquinolones and the risk of retinal detachment; Etminan et al. (2012) and Pasternak et al. (2013) used different data sources and analytical methods to infer “a higher risk” and “not associated with increased risk,” respectively. Suchard's final example was the case of pioglitazone and bladder cancer, as examined by Wei et al. (2013) and Azoulay et al. (2012) , which “[do] not appear to be significantly associated” and “[have] an increased risk,” respectively, using the same population but different analytical methods.

Suchard explained that he is interested in what he terms subjective reproducibility to gain as much useful information from a reproducibility study as possible by utilizing other available data and methods. In particular, he is interested in quantitatively investigating how research choices in data and analysis might generate an answer that is going to be more reproducible, or more likely to get the same answer, if the same type of experiment were to be repeated.

For example, when examining associations between adverse drug events and different pharmaceutical agents using large-scale observational studies, several studies can be implemented using different methods and design choices to help identify the true operating characteristics of these methods and design choices. This could help guide the choice of what type of design to use in terms of reliability or reproducibility and better establish a ground truth of the positive and negative associations. Another advantage of this approach, Suchard explained, is to better identify the type I error rate and establish the 95 percent confidence interval. He stressed that reproducibility of large-scale observational studies needs to be improved.

As an empirical approach toward improving reproducibility, Suchard and others set up a 3-year study called the Observational Medical Outcomes Partnership (OMOP) 11 to develop an analysis platform for conducting reproducible observational studies. The first difficulty with looking at observational data across the world, he explained, is that observational data from electronic health records are stored in many different formats. Addressing this requires a universal set of tools and a common data model that can store all this information, so that everyone is extracting it in the same way when specifying particular inclusion and exclusion criteria. To do this, the data from multiple sources need to be translated into a common data model without losing any information. Then one of many study methods is specified and the associated design decisions are made. Currently, Suchard noted that very few of these choices are documented in the manuscripts that report the work or in the protocols that were set up to initially approve the studies, even though the information might exist somewhere in a database that is specialized for the data researchers were using. If everything is on a common platform, the choices can be streamlined and reported.

Through the OMOP experiment, Suchard and others developed a cross-platform analysis base to run thousands of studies using open-source methods. The results of these studies are fully integrated with vignettes, from the first step of data extraction all the way to publication. He noted that doing this makes everything reproducible both objectively (can the study be reproduced exactly?) and subjectively (can a researcher make minor changes to the study?).

Subject-matter experts were then asked to provide their best understanding of known associations and nonassociations between drugs and adverse events. The experiment now has about 500 such statements about “ground truth,” which provide some information on the null distribution of any statistic under any method. Suchard noted that this is important to counter some of the confounders that cannot be controlled in observational studies.

Database heterogeneity can be a problem for reproducible results even when everything else can be held constant, according to Suchard. He explained that looking at different data sets could result in very different answers using the same methodology. Of the first 53 outcomes examined in the OMOP experiment, about half of them had so much heterogeneity that it would not be advisable to combine them, and 20 percent of them had at least one data source with a significant positive effect and one with a significant negative effect.

Suchard then discussed what happens to study results when the data and design are held constant but design choices are varied. He explained that just a small change in a design choice could have a large impact on results.

Another issue is that most observational methods do not have nominal statistical operating characteristics for estimates, according to Suchard. Using the example of point estimates of the relative risk for a large number of negative controls ( Ryan et al., 2013 ), he explained that if the process is modeled correctly and the confidence intervals are constructed using asymptotic normality assumptions, the estimates should have 95 percent coverage over the true value. However, this proved not to be the case and the estimated coverage was only about 50 percent with both negatively and positively biased estimates. He explained that there needs to be a way to control for the many factors affecting bias so that the point estimates at the end are believable.

When assessing which types of study designs tend to perform best in the real world, Suchard suggested one way to look at the overall performance of the method is to think of it as a binary classifier ( Ryan et al., 2013 ). If a given data source and an open black box analyzer produced either a positive or negative association, the result could be compared with the current set of ground truths and, if these estimates appear to be true, the area under the operator-receiver curve for that binary classification gives the probability of being able to identify a positive result as having a higher value than a negative result. Suchard noted that a group of self-controlled methods that compare the rate of adverse events when an individual is and is not taking a drug tend to perform better in these data with this ground truth than the cohort methods.

Suchard also noted that empirical calibration could help restore interpretation of study findings, specifically by constructing a Bayesian model to estimate what the distribution of the normal values might be using a normal mixture model ( Schuemie et al., 2013 ). The p-value can then be computed as the error under the predicted distribution under the null hypothesis for the observed relative risk. The relative risk versus standard error can then be computed.

To address the findings that observational data are heterogeneous and methods are not reliable ( Madigan et al., 2014 ), the public-private partnership Observational Health Data Sciences and Informatics 12 group was established to construct reproducible tools for large-scale observational studies in health care. This group consists of more than 80 collaborators across 10 countries and has a shared, open data model (tracking more than 600 million people) with open analysis tools.

A participant wondered if medical research could be partitioned into categories, such as randomized clinical trials and observational studies, to assess which is making the most advances in medicine. Suchard pointed out that both randomized control trials and observational studies have limitations; the former is expensive and can put patients at risk, but the latter can provide misinformation. Balancing the two will be important in the future.

Courtney Soderberg, Center for Open Science

Courtney Soderberg began her presentation by giving an overview of the Center for Open Science's recent work to assess the level of reproducibility in experimental research, predict reproducibility rates, and change behaviors to improve reproducibility. The Many Labs project brought together 36 laboratories to perform direct replication experiments of 13 studies. While 10 out of 13 studies were ultimately replicable, Soderberg noted a high level of heterogeneity among the replication attempts. There are different ways to measure reproducibility, she explained, such as through p-value or effect sizes, but if there is a high level of heterogeneity in the chosen measure, it may indicate a problem in the analysis such as a missed variable or an overgeneralization of the theory. This knowledge of the heterogeneity can in itself move science forward and help identify new hypotheses to test. However, Soderberg noted that irreproducibility caused by false positives does not advance science, and the rate of these must be reduced. In addition, research needs to be more transparent so exploratory research is not disguised as confirmatory research. As noted by Simmons et al. (2011) , the likelihood of obtaining a false-positive result when a researcher has multiple degrees of freedom can be as high as 60 percent. A large population of researchers admits to engaging in questionable research behavior, but many define it as research degrees of freedom and not as problematic ( John et al., 2012 ).

Soderberg explained that a good way to assess reproducibility rates across disciplines is through a large-scale reproducibility project, such as the Center for Open Science's two reproducibility projects in psychology 13 and cancer biology. 14 The idea behind both reproducibility projects is to better understand reproducibility rates in these two disciplines. In the case of Reproducibility Project: Psychology, 100 direct replications are conducted to determine the reproducibility rate of papers from 2008 and analyze findings to understand what factors can predict whether a study is likely to replicate ( Open Science Collaboration, 2015 ). 15 Moving forward, the project aims to look at initiatives, mandates, and tools to change behaviors. For example, Soderberg explained that open data sharing and study preregistration are not traditional practices in psychology, but the proportion of studies engaging in both has been increasing in recent years. Conducting future reproducibility projects can determine whether behavior change has resulted in higher levels of reproducibility.

John Ioannidis, Stanford University

In his presentation, John Ioannidis noted that a lack of reproducibility could often be attributed to the typical flaws of research:

  • Solo, siloed investigator;
  • Suboptimal power due either to using small sample sizes or to looking at small effect sizes;
  • Extreme multiplicity, with many factors that can be analyzed but often not taken fully into account;
  • Cherry-picking, choosing the best hypothesis, and using post-hoc selections to present a result that is exciting and appears to be significant;
  • p-values of p < 0.05 being accepted in many fields, even though that threshold is often insufficient;
  • Lack of registration;
  • Lack of replication; and
  • Lack of data sharing.

He explained that empirical studies in fields where replication practices are common suggest that most effects that are initially claimed to be statistically significant turn out to be false positives or substantially exaggerated ( Ioannidis et al., 2011 ; Fanelli and Ioannidis, 2013 ). Another factor that complicates reproducibility is that large effects, while desired, are not common in repeated experiments ( Pereira et al., 2012 ).

Ioannidis explained that a problem in making discoveries is searching for causality among correlation; the existence of large data sets is making this even more difficult ( Patel and Ioannidis, 2014 ). He proposed that instead of focusing on a measure of statistical inference, such as a p-value, adjusted p-value, normalized p-value, or a base factor equivalent, it may be useful to see where an effect size falls in the distribution of the effect size seen across a field in typical situations.

Another possibility with large databases is to think about how all available data can be analyzed using large consortia and conglomerates of data sets. This would allow for an analysis of effects and correlations to see if a specific association or correlation is more or less than average. However, Ioannidis cautioned that even if a particular correlation is shown, it is difficult to tell if the effect is real. Another approach to the problem is trying to understand the variability of the results researchers want to get when using different analytical approaches ( Patel et al., 2015 ). Ioannidis suggested that instead of focusing on a single result, communities should reach a consensus on which essential choices enter into one analysis and then identify the distribution of results that one could obtain once these analytical choices are taken into account.

He concluded by discussing the potential for changing the paradigms that contribute to irreproducibility problems. He asserted that funding agencies, journals, reviewers, researchers, and others in the community need to identify what goals should be given highest priority (e.g., productivity, quality, reporting quality, reproducibility, sharing data and resources, translational impact of applied research) and reward those goals appropriately. However, Ioannidis admitted that it is an open question how such goals can be operationalized.

Q&A

A participant congratulated Suchard on his important work in quantifying uncertainty associated with different designs in pharmacoepidemology applications but wondered to what extent the methodology is extendable to other applications, including to studies that are not dependent on databases. Suchard explained that it is difficult to know how to transfer the methodology to studies that are not easy to replicate, but there may be some similarities in applying it to observational studies with respect to comparative effectiveness research. He noted that the framework was not specifically designed for pharmacoepidemology; it was just the first problem to which it was applied.

A participant asked Ioannidis whether—if one accepts that the false discovery rate is 26 percent using a p-value of 0.05—a result with a p-value of 0.05 should be described in papers as “worth another look” instead of “significant”? Ioannidis responded that it would depend on the field and what the performance characteristics of that field would look like, although most fields should be skeptical of a p-value close to 0.05. Recently, he and his colleagues started analyzing data from a national database in Sweden that includes information on 9 million people. Using this database, they are able to get associations with p-values of 10 −100 ; however, he speculates that most are false associations, with such small p-values being possible due to the size of the data set. Similarly for pharmacoepidemology, he stated that almost all drugs could be associated with almost any disease if a large enough data set is used.

A participant asked if there are data about the increase in the number of problematic research papers. Soderberg observed that Simmons et al. (2011) ran simulations looking at how researcher degrees of freedom and flexibility in the analytic choice can cause false positives to go as high as 61 percent. She explained that it is difficult to tell if the real literature has such a high false-positive rate because it is unclear how many degrees of freedom, and how much analytic flexibility, are faced by researchers. However, surveys suggest that it is quite common to make such choices while executing research, and she believes the false-positive rates are higher than the 5 percent that a p-value of 0.05 would suggest. She noted that it is difficult to tell how behavior has evolved over time, so it is hard to measure how widespread reproducibility issues have been in the past.

A participant noted that much of Ioannidis's work is centered on studies with a relatively small sample size, while Suchard's work deals with tens of millions or hundreds of millions of patients, and wondered if there needs to be a fundamental rethinking on how to interpret statistics in the context of data sets where sampling variability is effectively converging to epsilon but bias is still persistent in the analyses. Ioannidis agreed that with huge sample sizes, power is no longer an issue but bias is a major determinant. He suggested that many potential confounders be examined.

  • REPRODUCIBILITY FROM THE INFORMATICS PERSPECTIVE

Mark Liberman, University of Pennsylvania

Mark Liberman began his presentation by giving an overview of a replicability experiment using the “common task method,” which is a research paradigm from experimental computational science that involves shared training and testing data; a well-defined evaluation metric typically implemented by a computer program managed by some responsible third party; and a variety of techniques to avoid overfitting, including holding test data until after the bulk of the research is done. He explained that the setting for this experiment is an algorithmic analysis of the natural world, which often falls within the discipline of engineering rather than science, although many areas of science have a similar structure.

There are dozens of current examples of the common task method applied in the context of natural language research. Some involve shared task workshops (such as the Conference on Natural Language Learning, 16 Open Keyword Search Evaluation, 17 Open Machine Translation Evaluation, 18 Reconnaissance de Personnes dans les Émissions Audiovisuelles, 19 Speaker Recognition Evaluation, 20 Text REtrieval Conference, 21 TREC Video Retrieval Evaluation, 22 and Text Analysis Conference 23 ) where all participants utilize a given data set or evaluation metrics to produce a result and report on it. There are also available data sets, such as the Street View House Numbers, 24 for developing machine-learning and object-recognition algorithms with minimal requirement on data preprocessing and formatting. In the Street View House Numbers case, there has been significant improvement in the performance, especially in terms of reducing the error rate using different methods ( Netzer et al., 2011 ; Goodfellow et al., 2013 ; Lee et al., 2015 ).

Liberman pointed out that while the common task method is part of the culture today, it was not so 30 years ago. In the 1960s, he explained, there was strong resistance to such group approaches to natural language research, the implication being that basic scientific work was needed first. For example, Liberman cited Languages and Machines: Computers in Translation and Linguistics , which recommended that machine translational funding “should be spent hardheadedly toward important, realistic, and relatively short-range goals” ( NRC, 1966 ). Following the release of that report, U.S. funding for machine translational research decreased essentially to zero for more than 20 years. The committee felt that science should precede engineering in such cases.

Bell Laboratories' John Pierce, the chair of that National Research Council study, added his personal opinions a few years later in a letter to the Journal of the Acoustical Society of America :

A general phonetic typewriter is simply impossible unless the typewriter has an intelligence and a knowledge of language comparable to those of the native speaker of English. . . . Most recognizers behave, not like scientists, but like mad inventors or untrustworthy engineers. The typical recognizer gets into his head that he can solve “the problem.” The basis for this is either individual inspiration (the mad “inventor” source of knowledge) or acceptance of untested rules, schemes, or information (the untrustworthy engineer approach) . . . The typical recognizer . . . builds or programs an elaborate system that either does very little or flops in an obscure way. A lot of money and time are spent. No simple, clear, sure knowledge is gained. The work has been an experience, not an experiment. ( Pierce, 1969 )

While Pierce was not referring to p-values or replicability, Liberman pointed out that he was talking about whether scientific progress was being made toward a reasonable goal.

The Defense Advanced Research Projects Agency (DARPA) conducted its Speech Understanding Research Project from 1972 through 1975, which used classical artificial intelligence to improve speech understanding. According to Liberman, this project was viewed as a failure, and funding ceased after 3 years. After this, there was no U.S. research funding for machine translation or automatic speech recognition until 1986.

Liberman noted that Pierce was not the only person to be skeptical of research and development investment in the area of human language technology. By the 1980s, many informed American research managers were equally skeptical about the prospects, and relatively few companies worked in this area. However, there were people who thought various sorts of human language technology were needed and that the principle ought to be feasible. In the mid-1980s, according to Liberman, the Department of Defense (DoD) began discussing whether it should restart research in this area.

Charles Wayne, a DoD program manager at the time, proposed to design a speech recognition research program with a well-defined objective evaluation metric that would be applied by a neutral agent, the National Institute of Standards and Technology (NIST), on shared data sets. Liberman explained that this approach would ensure that “simple, clear, sure knowledge” was gained because it would require the participants to reveal their methods at the time the evaluation was revealed. The program was set up accordingly in 1985 ( Pallett, 1985 ).

From this experiment the common task structure was born, in which a detailed evaluation plan is developed in consultation with researchers and published as a first step in the project. Automatic evaluation software (originally written and maintained by NIST but developed more widely now) was published at the start of the project to share data and withhold evaluation test data. Liberman noted that not everyone liked this idea, and many people were skeptical that the research would lead anywhere. Others were disgruntled because the task and evaluation metrics were too explicit, leading some to declare this type of research undignified and untrustworthy.

However, in spite of this resistance and skepticism, Liberman reported that the plan worked. And because funders could measure progress over time, this model for research funding facilitated the resumption of funding for the field. The mechanism also allowed the community to methodically surmount research challenges, in Liberman's view, because the evaluation metrics were automatic and the evaluation code was public. The same researchers who had objected to being tested twice a year began testing themselves hour by hour to try algorithms and see what worked best. He also credited the common task method with creating a new culture where researchers exchanged methods and results on shared data with a common metric. This participation in the culture became so valuable that many research groups joined without funding.

As an example of a project that used the common task method, Liberman described a text retrieval program in 1990 in which four contractors were given money for working on a project. Because of the community engagement across the field, 40 groups from all around the world participated in the evaluation workshop because it offered them an opportunity to access research results for free.

An additional benefit of the common task method is that it creates a positive feedback loop. Liberman explained that when every program has to interpret the same ambiguous evidence, ambiguity resolution becomes a gambling game that rewards the use of statistical methods. That, in turn, led to what has become known as machine learning. Liberman noted that, given the nature of speech and language, statistical methods need large training sets, which reinforces the value of shared data. He also asserted that researchers appreciated iterative training cycles because they seem to create simple, clear, and sure knowledge, which motivated participation in the common-task culture.

Over the past 30 years, variants of this method have been applied to many other problems, according to Liberman, including machine translation, speaker identification, language identification, parsing, sense disambiguation, information retrieval, information extraction, summarization, question answering, optical character recognition, sentiment analysis, image analysis, and video analysis. Liberman said that the general experience is that error rates decline by a fixed percentage each year to an asymptote, depending on task and data quality. He noted that progress usually comes from many small improvements and that shared data play a crucial role because they can be reused in unexpected ways.

Liberman concluded by stating that while science and engineering cultures vary, sharing data and problems lowers the cost of the data entry, creates intellectual communities, and speeds up replication and extension.

A participant commented that there is a debate about whether subject matter experts can cooperate with computational scientists and statisticians, particularly at this point in time when some statistical methods are outperforming expert rule-based systems. Liberman observed that DARPA essentially forced these communities to work together and managed to change the culture of both fields.

Micah Altman, Massachusetts Institute of Technology

To begin his presentation, Micah Altman surmised how informatics could improve reproducibility. He partitioned these thoughts into two categories: formal properties and systems properties. Formal properties, Altman explained, are properties of information flows and their management that tend to support reproducibility-related inferences. These include transparency, auditability, provenance, fixity, identification, durability, integrity, repeatability, self-documentation, and nonrepudiation. These properties can be applied to different stages and entities, as well as to components of the information system. Systems properties describe how the system interacts with users and what incentives and culture they engender. They include factors such as barriers to entry, ease of use, support for intellectual communities, speed and performance, security, access control, personalization, credit and attribution, and incentives for well-founded trust among actors. There is also the larger question of how the system integrates into the research ecosystem, which includes issues such as sustainability, cost, and the capability for inducing well-founded trust in the system and its outputs.

Altman noted that reproducibility claims are not formulated as direct claims about the world. Instead, they are formulated as questions such as the following:

  • What claims about information are implied by reproducibility claims or issues?
  • What properties of information and information flow are related to those claims?
  • What would possible changes to information processing and flow yield? How much would they cost?

He noted that many elements of the system might be targeted to improve today's situation. People are at the root of the ecosystem, where they affect decisions of theory; select and apply methods; observe and edit data; select, design, and perform analyses; and create and apply documents. The methods also interact, intervene, and simulate in the real world. The researcher's information flow over time ranges among creation/collection, storage/ingest, processing, internal sharing, analysis, external dissemination/publication, reuse (scientific, educational, scientometric, and institutional), and long-term access. Altman reiterated the point made by many of the workshop's speakers that there are many operational reproducibility claims and interventions, such as those outlined in Table 3.2 , that are often referred to by the same or inverted names in different communities.

TABLE 3.2. Some Types of Reproducibility Issues and Use Cases.

Some Types of Reproducibility Issues and Use Cases.

There is an overlap between social science, empirical methods, and statistical computation, according to Altman. In all cases there is a theory guiding the choice of how to conceptualize the world or conceptualize the approach, which leads to algorithms and protocols for solving a particular inference problem. Those protocols or algorithms are implemented in particular coding rules (e.g., instrumentation design and software) and are executed to produce details and context. Informatics tools can be used in a number of environments to deal with information flow, depending on what properties and reproducibility claims are being examined.

To conclude, Altman posed the following questions about how to better support reproducibility with information infrastructure:

  • How can we better identify the inferential claims implied by a specific set of (non)reproducibility claims and issues?
  • Which information flows and systems are most closely associated with these inferential claims?
  • Which properties of information systems support generating these inferential claims?

Altman also thanked his collaborators, namely Kobbi Nissim, Michael Bar-Sinai, Salil Vadhan, Jeff Gill, and Michael P. McDonald, and provided the following references to related work: Allen et al. (2014) ; Altman and Crosas (2013) ; Garnett et al. (2013) ; Altman et al. (2001 , 2011 ); Altman (2002 , 2008 ); Altman and King (2007) ; Altman et al. (2004) ; and Altman and McDonald (2001 , 2003 ).

A group of minimal conditions is necessary to provide adequate evidence of a causal relationship between an incidence and a possible consequence.

The Many Labs Replication Project was a 36-site, 12-country, 6,344-subject effort to try to replicate a variety of important findings in social psychology ( Klein et al., 2014 ; Nosek, 2014 ; Nauts et al., 2014 ; Wesselmann et al., 2014 ; Sinclair et al., 2014 ; Vermeulen et al., 2014 ; Müller and Rothermund, 2014 ; Gibson et al., 2014 ; Moon and Roeder, 2014 ; IJzerman et al., 2014 ; Johnson et al., 2014 ; Lynott et al., 2014 ; Žeželj and Jokić, 2014 ; Blanken et al., 2014 ; Calin-Jageman and Caldwell, 2014 ; Brandt et al., 2014 ). More information is available at Open Science Framework, “Investigating Variation in Replicability: A “Many Labs” Replication Project,” last updated September 24, 2015, https://osf ​.io/wx7ck/ .

The Consolidated Standards of Reporting Trials website is http://www ​.consort-statement.org , accessed January 6, 2016.

The Strengthening the Reporting of Observational Studies in Epidemiology website is http://www ​.strobe-statement.org , accessed January 12, 2016.

The Quality Assessment of Diagnostic Accuracy Studies website is http://www ​.bris.ac.uk/quadas/ , accessed January 12, 2016.

See, for example, the Quality of Reporting of Meta-Analyses checklist for K.B. Filion, F.E. Khoury, M. Bielinski, I. Schiller, N. Dendukuri, and J.M. Brophy, “Omega-3 fatty acids in high-risk cardiovascular patients: A meta-analysis of randomized controlled trials,” BMC Cardiovascular Disorders 10:24, 2010, electronic supplementary material, Additional file 1: QUORUM Checklist, http://www ​.biomedcentral ​.com/content/supplementary ​/1471-2261-10-24-s1.pdf .

The Collaborative Approach to Meta-Analysis and Review of Animal Data from Experimental Studies website is http://www ​.dcn.ed.ac.uk/camarades/ , accessed January 12, 2016.

The REporting recommendations for tumour MARKer prognostic studies website is http://www ​.ncbi.nlm.nih ​.gov/pmc/articles/PMC2361579/ , accessed January 12, 2016.

The Grading of Recommendations, Assessment, Development and Evaluations website is http://www ​.gradeworkinggroup.org/ , accessed January 12, 2016.

See Chapter 2 for Benjamini's previous discussion of this topic.

The Observational Medical Outcomes Partnership website is http://omop ​.org , accessed January 12, 2016.

The Observational Health Data Sciences and Informatics website is http://www ​.ohdsi.org/ , accessed January 12, 2016.

The Center for Open Science's reproducibility project in psychology website is https://osf ​.io/ezcuj/wiki/home/ , accessed January 12, 2016.

The Center for Open Science's reproducibility project in cancer biology website is https://osf ​.io/e81xl/wiki/home/ , accessed January 12, 2016.

Results reported subsequent to the time of the February 2015 workshop.

The Conference on Natural Language Learning website is http://ifarm ​.nl/signll/conll/ , accessed January 12, 2016.

The Open Keyword Search Evaluation website is http://www ​.nist.gov/itl/iad/mig/openkws ​.cfm , accessed January 12, 2016.

The Open Machine Translation Evaluation website is http://www ​.nist.gov/itl/iad/mig/openmt . cfm, accessed January 12, 2016.

The Reconnaissance de Personnes dans les Émissions Audiovisuelles website is https://tel ​.archives-ouvertes ​.fr/tel-01114399v1 , accessed January 12, 2016.

The Speaker Recognition Evaluation website is http://www ​.itl.nist.gov ​/iad/mig/tests/spk/ , accessed January 12, 2016.

The Text REtrieval Conference website is http://trec ​.nist.gov , accessed January 12, 2016.

The TREC Video Retrieval Evaluation website is http://trecvid ​.nist.gov , accessed January 12, 2016.

The Text Analysis Conference website is http://www ​.nist.gov/tac/ , accessed January 12, 2016.

The Street View House Numbers website is http://ufldl ​.stanford.edu/housenumbers/ , accessed January 12, 2016.

  • Cite this Page Committee on Applied and Theoretical Statistics; Board on Mathematical Sciences and Their Applications; Division on Engineering and Physical Sciences; National Academies of Sciences, Engineering, and Medicine. Statistical Challenges in Assessing and Fostering the Reproducibility of Scientific Results: Summary of a Workshop. Washington (DC): National Academies Press (US); 2016 Feb 29. 3, Conceptualizing, Measuring, and Studying Reproducibility.
  • PDF version of this title (1.0M)

In this Page

Recent activity.

  • Conceptualizing, Measuring, and Studying Reproducibility - Statistical Challenge... Conceptualizing, Measuring, and Studying Reproducibility - Statistical Challenges in Assessing and Fostering the Reproducibility of Scientific Results

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

Connect with NLM

National Library of Medicine 8600 Rockville Pike Bethesda, MD 20894

Web Policies FOIA HHS Vulnerability Disclosure

Help Accessibility Careers

statistics

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

  • Knowledge Base

Methodology

  • Reproducibility vs Replicability | Difference & Examples

Reproducibility vs Replicability | Difference & Examples

Published on August 19, 2022 by Kassiani Nikolopoulou . Revised on June 22, 2023.

The terms reproducibility , repeatability , and replicability  are sometimes used interchangeably, but they mean different things.

  • A research study is reproducible when the existing data is reanalysed using the same research methods and yields the same results. This shows that the analysis was conducted fairly and correctly.
  • A research study is replicable (or repeatable ) when the entire research process is conducted again, using the same methods but new data, and still yields the same results. This shows that the results of the original study are reliable .

A survey of 60 children between the ages of 12 and 16 shows that football and hockey are the most popular sports. Football received 20 votes and hockey 18.

An independent researcher reanalyses the survey data and also finds that 20 children chose football and 18 children chose hockey. This makes the research reproducible.

Table of contents

Why reproducibility and replicability matter in research, what is the replication crisis, how to ensure reproducibility and replicability in your research, other interesting articles, frequently asked questions about reproducibility, replicability and repeatability.

Reproducibility and replicability enhance the reliability of results. This allows researchers to check the quality of their own work or that of others, which in turn increases the chance that the results are valid and not suffering from research bias .

On the other hand, reproduction alone does not show whether the results are correct. As it does not involve collecting new data , reproducibility is a minimum necessary condition – showing that findings are transparent and informative.

In order to make research reproducible, it is important to provide all the necessary raw data. This makes it so that anyone can run the analysis again, ideally recreating the same results. Omitted variables , missing data, or mistakes leading to information bias can lead to your research not being reproducible.

To make your research reproducible, you describe step by step how you collected and analysed your data. You also include all the raw data in the appendix : a list of the interview questions, the interview transcripts, and the coding sheet you used to analyse your interviews.

Sometimes researchers also conduct replication studies . These studies investigate whether researchers can arrive at the same scientific findings as an existing study while collecting new data and completing new analyses.  

The researchers administered an online survey, and the majority of the participants (58%) reported that they waste 10% or less of procured food. They also found that guilt and setting a good example were the main motivators for reducing food waste, rather than economic or environmental factors.

Together with your research group, you decide to conduct a replication study: you collect new data in the same state, this time via focus groups . Your findings are consistent with the initial study, which makes it replicable.

Overall, repeatability and reproducibility ensure that scientists remain honest and do not invent or distort results to get better outcomes. In particular, testing for reproducibility can also be a way to catch any mistakes, biases , or inconsistencies in your data.

Receive feedback on language, structure, and formatting

Professional editors proofread and edit your paper by focusing on:

  • Academic style
  • Vague sentences
  • Style consistency

See an example

reproducibility of an experiment is measured from

Unfortunately, findings from many scientific fields – such as psychology, medicine, or economics – often prove impossible to replicate. When other research teams try to repeat a study, they get a different result, suggesting that the initial study’s findings are not reliable.

Some factors contributing to this phenomenon include:

  • Unclear definition of key terms
  • Poor description of research methods
  • Lack of transparency in the discussion section
  • Unclear presentation of raw data
  • Poor description of data analysis undertaken

Publication bias can also play a role. Scientific journals are more likely to accept original (non-replicated) studies that report positive, statistically significant results that support the hypothesis .

To make your research reproducible and replicable, it is crucial to describe, step by step, how to conduct the research. You can do so by focusing on writing a clear and transparent methodology section , using precise language and avoiding vague writing .

Transparent methodology section

In your methodology section, you explain in detail what steps you have taken to answer the research question. As a rule of thumb, someone who has nothing to do with your research should be able to repeat what you did based solely on your explanation.

For example, you can describe:

  • What type of research ( quantitative , qualitative , mixed methods ) you conducted
  • Which research method you used ( interviews , surveys , etc.)
  • Who your participants or respondents are (e.g., their age or education level)
  • What materials you used (audio clips, video recording, etc.)
  • What procedure you used
  • What data analysis method you chose (such as the type of statistical analysis )
  • How you ensured reliability and validity
  • Why you drew certain conclusions, and on the basis of which results
  • In which appendix the reader can find any survey questions, interviews, or transcripts

Sometimes, parts of the research may turn out differently than you expected, or you may accidentally make mistakes. This is all part of the process! It’s important to mention these problems and limitations so that they can be prevented next time. You can do this in the discussion or conclusion , depending on the requirements of your study program.

Use of clear and unambiguous language

You can also increase the reproducibility and replicability/repeatability of your research by always using crystal-clear writing. Avoid using vague language, and ensure that your text can only be understood in one way. Careful description shows that you have thought in depth about the method you chose and that you have confidence in the research and its results.

Here are a few examples.

  • The participants of this study were children from a school.
  • The 67 participants of this study were elementary school children between the ages of 6 and 10.
  • The interviews were transcribed and then coded.
  • The semi-structured interviews were first summarised, transcribed, and then open-coded.
  • The results were compared with a t test .
  • The results were compared with an unpaired t test.

If you want to know more about statistics , methodology , or research bias , make sure to check out some of our other articles with explanations and examples.

  • Normal distribution
  • Degrees of freedom
  • Null hypothesis
  • Discourse analysis
  • Control groups
  • Mixed methods research
  • Non-probability sampling
  • Quantitative research
  • Ecological validity

Research bias

  • Rosenthal effect
  • Implicit bias
  • Cognitive bias
  • Selection bias
  • Negativity bias
  • Status quo bias

Reproducibility and replicability are related terms.

  • Reproducing research entails reanalyzing the existing data in the same manner.
  • Replicating (or repeating ) the research entails reconducting the entire analysis, including the collection of new data . 
  • A successful reproduction shows that the data analyses were conducted in a fair and honest manner.
  • A successful replication shows that the reliability of the results is high.

The reproducibility and replicability of a study can be ensured by writing a transparent, detailed method section and using clear, unambiguous language.

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the “Cite this Scribbr article” button to automatically add the citation to our free Citation Generator.

Nikolopoulou, K. (2023, June 22). Reproducibility vs Replicability | Difference & Examples. Scribbr. Retrieved August 26, 2024, from https://www.scribbr.com/methodology/reproducibility-repeatability-replicability/

Is this article helpful?

Kassiani Nikolopoulou

Kassiani Nikolopoulou

Other students also liked, internal validity in research | definition, threats, & examples, the 4 types of validity in research | definitions & examples, reliability vs. validity in research | difference, types and examples, "i thought ai proofreading was useless but..".

I've been using Scribbr for years now and I know it's a service that won't disappoint. It does a good job spotting mistakes”

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, automatically generate references for free.

  • Knowledge Base
  • Methodology
  • Reproducibility vs Replicability | Difference & Examples

Reproducibility vs Replicability | Difference & Examples

Translated on 19 August 2022 by Kassiani Nikolopoulou. Originally published by Julia Merkus

The terms ‘ reproducibility ‘, ‘ repeatability ‘, and ‘ replicability ‘ are sometimes used interchangeably, but they mean different things.

  • A research study is reproducible when the existing data is reanalysed using the same research methods and yields the same results. This shows that the analysis was conducted fairly and correctly.
  • A research study is replicable (or repeatable ) when the entire research process is conducted again, using the same methods but new data, and still yields the same results. This shows that the results of the original study are reliable .

A survey of 60 children between the ages of 12 and 16 shows that football and hockey are the most popular sports. Football received 20 votes and hockey 18.

An independent researcher reanalyses the survey data and also finds that 20 children chose football and 18 children chose hockey. This makes the research reproducible.

Table of contents

Why reproducibility and replicability matter in research, what is the replication crisis, how to ensure reproducibility and replicability in your research, frequently asked questions about reproducibility, replicability and repeatability.

Reproducibility and replicability enhance the reliability of results. This allows researchers to check the quality of their own work or that of others, which in turn increases the chance that the results are valid .

On the other hand, reproduction alone does not show whether the results are correct. As it does not involve collecting new data , reproducibility is a minimum necessary condition – showing that findings are transparent and informative.

In order to make research reproducible, it is important to provide all the necessary raw data. This makes it so that anyone can run the analysis again, ideally recreating the same results.

To make your research reproducible, you describe step by step how you collected and analysed your data. You also include all the raw data in the appendix : a list of the interview questions, the interview transcripts, and the coding sheet you used to analyse your interviews.

Sometimes researchers also conduct replication studies . These studies investigate whether researchers can arrive at the same scientific findings as an existing study while collecting new data and completing new analyses.  

The researchers administered an online survey, and the majority of the participants (58%) reported that they waste 10% or less of procured food. They also found that guilt and setting a good example were the main motivators for reducing food waste, rather than economic or environmental factors.

Together with your research group, you decide to conduct a replication study: you collect new data in the same state, this time via focus groups . Your findings are consistent with the initial study, which makes it replicable.

Overall, repeatability and reproducibility ensure that scientists remain honest and do not invent or distort results to get better outcomes. In particular, testing for reproducibility can also be a way to catch any mistakes or inconsistencies in your data.

Prevent plagiarism, run a free check.

Unfortunately, findings from many scientific fields – such as psychology, medicine, or economics – often prove impossible to replicate. When other research teams try to repeat a study, they get a different result, suggesting that the initial study’s findings are not reliable.

Some factors contributing to this phenomenon include:

  • Unclear definition of key terms
  • Poor description of research methods
  • Lack of transparency in the discussion section
  • Unclear presentation of raw data
  • Poor description of data analysis undertaken

Publication bias can also play a role. Scientific journals are more likely to accept original (non-replicated) studies that report positive, statistically significant results that support the hypothesis .

To make your research reproducible and replicable, it is crucial to describe, step by step, how to conduct the research. You can do so by focusing on writing a clear and transparent methodology section , using precise language and avoiding vague writing .

Transparent methodology section

In your methodology section, you explain in detail what steps you have taken to answer the research question. As a rule of thumb, someone who has nothing to do with your research should be able to repeat what you did based solely on your explanation.

For example, you can describe:

  • What type of research ( quantitative , qualitative , mixed methods ) you conducted
  • Which research method you used ( interviews , surveys , etc.)
  • Who your participants or respondents are (e.g., their age or education level)
  • What materials you used (audio clips, video recording, etc.)
  • What procedure you used
  • What data analysis method you chose (such as the type of statistical analysis )
  • How you ensured reliability and validity
  • Why you drew certain conclusions, and on the basis of which results
  • In which appendix the reader can find any survey questions, interviews, or transcripts

Sometimes, parts of the research may turn out differently than you expected, or you may accidentally make mistakes. This is all part of the process! It’s important to mention these problems and limitations so that they can be prevented next time. You can do this in the discussion or conclusion , depending on the requirements of your study program.

Use of clear and unambiguous language

You can also increase the reproducibility and replicability/repeatability of your research by always using crystal-clear writing. Avoid using vague language, and ensure that your text can only be understood in one way. Careful description shows that you have thought in depth about the method you chose and that you have confidence in the research and its results.

Here are a few examples.

  • The participants of this study were children from a school.
  • The 67 participants of this study were elementary school children between the ages of 6 and 10.
  • The interviews were transcribed and then coded.
  • The semi-structured interviews were first summarised, transcribed, and then open-coded.
  • The results were compared with a t test .
  • The results were compared with an unpaired t test.

Reproducibility and replicability are related terms.

  • Reproducing research entails reanalysing the existing data in the same manner.
  • Replicating (or repeating ) the research entails reconducting the entire analysis, including the collection of new data . 
  • A successful reproduction shows that the data analyses were conducted in a fair and honest manner.
  • A successful replication shows that the reliability of the results is high.

The reproducibility and replicability of a study can be ensured by writing a transparent, detailed method section and using clear, unambiguous language.

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the ‘Cite this Scribbr article’ button to automatically add the citation to our free Reference Generator.

Nikolopoulou, K. (2022, August 19). Reproducibility vs Replicability | Difference & Examples. Scribbr. Retrieved 26 August 2024, from https://www.scribbr.co.uk/research-methods/reproducibility-vs-replicability/

Is this article helpful?

Kassiani Nikolopoulou

Kassiani Nikolopoulou

Other students also liked, a quick guide to experimental design | 5 steps & examples, reliability vs validity in research | differences, types & examples, data collection methods | step-by-step guide & examples.

We've updated our Privacy Policy to make it clearer how we use your personal data. We use cookies to provide you with a better experience. You can read our Cookie Policy here.

Informatics

Stay up to date on the topics that matter to you

Repeatability vs. Reproducibility

A picture of RJ Mackenzie

Complete the form below to unlock access to ALL audio articles.

In measuring the quality of experiments, repeatability and reproducibility are key. In this article, we explore the differences between the two terms, and why they are important in determining the worth of published research. 

What is repeatability?

Repeatability is a measure of the likelihood that, having produced one result from an experiment, you can try the same experiment, with the same setup, and produce that exact same result. It’s a way for researchers to verify that their own results are true and are not just chance artifacts.

To demonstrate a technique’s repeatability, the conditions of the experiment must be kept the same. These include: 

  • Measuring tools
  • Other apparatus used in the experiment
  • Time period (taking month-long breaks between repetitions isn’t good practice) 

Bland and Altman authored an extremely useful paper in 1986 which highlighted one benefit of assessing repeatability: it allows one to make comparisons between different methods of measurement.

Previous studies had used similarities in the correlation coefficient ( r ) between techniques as an indicator of agreement. Bland and Altman showed that r actually measured the strength of the relation between two techniques, not the extent to which they agree with each other. This means r is quite meaningless in this context; if two different techniques were both designed to measure heart rate, it would be bizarre if they weren’t related to each other!

What is reproducibility?

The reproducibility of data is a measure of whether results in a paper can be attained by a different research team, using the same methods. This shows that the results obtained are not artifacts of the unique setup in one research lab. It’s easy to see why reproducibility is desirable, as it reinforces findings and protects against rare cases of fraud, or less rare cases of human error, in the production of significant results.

Why are repeatability and reproducibility important?

Science is a method built on an approach of gradual advance backed up by independent verification and the ability to show that your findings are correct and transparent. Academic research findings are only useful to the wider scientific community if the knowledge can be repeated and shared among research groups. As such, irreproducible and unrepeatable studies are the source of much concern within science. 

What is the reproducibility crisis?

Over recent decades, science, in particular the social and life sciences, has seen increasing importance placed on the reproducibility of published studies. Large-scale efforts to assess the reproducibility of scientific publications have turned up worrying results. For example, a 2015 paper by a group of psychology researchers dubbed the “Open Science Collaboration” examined 100 experiments published in high-ranking, peer-reviewed journals. Of these 100 studies, just 68 reproductions provided statistically significant results that matched the original findings. These efforts are part of a growing field of “ metascience ” that aims to take on the reproducibility crisis.

But what about replicability? 

How can we improve reproducibility.

A lot of thought is being put into improving experimental reproducibility. Below are just some of the ways you can improve reproducibility:

  • Journal checklists – more and more journals are coming to understand the importance of including all relevant details in published studies, and are bringing in mandatory checklists for any published papers. The days of leaving out sample numbers and animal model descriptions in methods sections are over and blinding and randomization should be standard where possible.
  • Strict on stats – power calculations, multiple comparisons tests and descriptive statistics are all essential to making sure that reported results are statistically sound. 
  • Technology can lend a hand – automating processes and using high-throughput systems can improve accuracy in individual experiments, and enable more measurements to be taken in a given time, increasing sample numbers. 

What is repeatability? Repeatability is a measure of the likelihood that, having produced one result from an experiment, you can try the same experiment, with the same setup, and produce that exact same result. It’s a way for researchers to verify that their own results are true and are not just chance artifacts.

What is reproducibility? The reproducibility of data is a measure of whether results in a paper can be attained by a different research team, using the same methods. This shows that the results obtained are not artifacts of the unique setup in one research lab.

What is the reproducibility crisis? Over recent decades, science, in particular the social and life sciences, has seen increasing importance placed on the reproducibility of published studies. Large-scale efforts to assess the reproducibility of scientific publications have turned up worrying results.

How can we improve reproducibility? A lot of thought is being put into improving experimental reproducibility. Below are just some of the ways you can improve reproducibility: Journal checklists – more and more journals are coming to understand the importance of including all relevant details in published studies, and are bringing in mandatory checklists for any published papers. The days of leaving out sample numbers and animal model descriptions in methods sections are over and blinding and randomization should be standard where possible. Strict on stats – power calculations, multiple comparisons tests and descriptive statistics are all essential to making sure that reported results are statistically sound. Technology can lend a hand – automating processes and using high-throughput systems can improve accuracy in individual experiments, and enable more measurements to be taken in a given time, increasing sample numbers.

A picture of RJ Mackenzie

  • Search Menu
  • Sign in through your institution
  • Advance articles
  • Themed Content
  • AI and Machine Learning
  • Gastrointestinal
  • Genitourinary
  • Head and Neck
  • Interventional
  • Medical Physics: Radiotherapy
  • Medical Physics: Diagnostic
  • Musculoskeletal and Soft Tissue
  • Nuclear Medicine and Molecular Imaging
  • Obstetrics and Gynaecology
  • Paediatrics
  • Radiation Protection
  • Radiobiology
  • Radiotherapy and Oncology
  • Respiratory and Chest
  • Why publish in BJR|Open
  • Author Guidelines
  • Submission Site
  • Reviewer Guidelines
  • Open Access Policy
  • Read and Publish
  • Self-Archiving Policy
  • Author Testimonials
  • About BJR|Open
  • About British Institute of Radiology
  • Editorial Board
  • Advertising & Corporate Services
  • BIR Journals
  • Journals on Oxford Academic
  • Books on Oxford Academic

Article Contents

Introduction, measurement error model, repeatability and reproducibility, examples of the impact of qib measurement errors on clinical studies.

  • < Previous

Statistical considerations for repeatability and reproducibility of quantitative imaging biomarkers

  • Article contents
  • Figures & tables
  • Supplementary Data

Shangyuan Ye, Jeong Youn Lim, Wei Huang, Statistical considerations for repeatability and reproducibility of quantitative imaging biomarkers, BJR|Open , Volume 4, Issue 1, 1 January 2022, 20210083, https://doi.org/10.1259/bjro.20210083

  • Permissions Icon Permissions

Quantitative imaging biomarkers (QIBs) are increasingly used in clinical studies. Because many QIBs are derived through multiple steps in image data acquisition and data analysis, QIB measurements can produce large variabilities, posing a significant challenge in translating QIBs into clinical trials, and ultimately, clinical practice. Both repeatability and reproducibility constitute the reliability of a QIB measurement. In this article, we review the statistical aspects of repeatability and reproducibility of QIB measurements by introducing methods and metrics for assessments of QIB repeatability and reproducibility and illustrating the impact of QIB measurement error on sample size and statistical power calculations, as well as predictive performance with a QIB as a predictive biomarker.

Medical imaging modalities such as CT, MRI, and positron emission tomography (PET) are routinely used in clinical practice for disease screening, diagnosis, staging, therapeutic monitoring, evaluation of residual disease, and assessment of disease recurrence. Traditionally, image contrast-based qualitative interpretations of medical images are the most commonly employed radiology practice. With advances in imaging technologies in recent years, imaging metrics that can quantify tissue biological and physiological properties, in addition to those that quantify tissue morphology such as disease size, are increasingly used in research and early phase clinical trials to characterize disease and response to treatment. A recent review 1 by a group of principal investigators from the Quantitative Imaging Network (National Cancer Institute, National Institutes of Health) has called for wider incorporation of quantitative imaging methods into clinical trials, and eventually, clinical practice for evaluation of cancer therapy response. In the emerging era of precision medicine, quantitative imaging biomarkers (QIBs) can be integrated with quantitative biomarkers from genomics, transcriptomics, proteomics, and metabolomics to facilitate patient stratification for individualized treatment strategy and improve treatment outcome. 2 A QIB is defined as “an objective characteristic derived from an in vivo image measured on a ratio or interval scale as an indicator of normal biological processes, pathogenic processes or a response to a therapeutic intervention.” 3 QIBs can be generally classified into five different types: structural, morphological, textural, functional, and physical property QIBs. 4 Kessler et al 3 have introduced terminologies related to QIBs for scientific studies. Study designs and statistical methods used for assessing QIB technical performances have been extensively reviewed. 4–8 Because many QIBs are derived through multiple steps in image data acquisition and data analysis that often involve different manufacturer scanner platforms and different computer algorithms and software tools, QIB measurements can produce large variabilities, which pose a significant challenge in translating QIBs into clinical trials, and ultimately, clinical practice. In order for a QIB and its changes to be interpretable in clinical settings across institutions and clinics for disease characterization and therapy response assessment, it is highly important to evaluate the repeatability and reproducibility of the QIB.

Both repeatability and reproducibility constitute the reliability of a QIB measurement. Repeatability refers to the precision of a QIB measured under identical conditions ( e.g. using the same measurement procedure, same measurement system, same image analysis algorithm, and same location over a short period of time; also known as repeatability condition), 3,4 which is mainly a measure of the within-subject variability and the variability caused by the same imaging device over time. On the other hand, reproducibility refers to the precision of a QIB measured under different experimental conditions 3,4 (also known as the reproducibility condition), which is mainly a measure of the variability associated with different measurement systems, imaging methods, study sites, and population. In recent years, many studies have been conducted to investigate the reliability of different QIBs. For example, Yokoo et al 9 studied the precision of hepatic proton-density fat-fraction measurements by using MRI; Lodge 10 examined the repeatability of standardized uptake value (SUV) in oncologic 11 F-fludeoxy glucose (FDG) PET studies; Shukla-Dave et al 12 reviewed and emphasized the need for assessment of reproducibility and repeatability of MRI QIBs in oncologic studies; Park et al 13 reviewed the challenges in reproducibility of quantitative radiomics metrics; Fedorov et al 14 and Schwier et al 15 assessed the repeatability of radiomics features from multiparametric MRI of small prostate tumors; Hernando et al 16 quantified the reproducibility of MRI-based proton-density fat-fraction measurements in a phantom across vendor platforms at different field strengths; Kalpathy et al 17 investigated the repeatability and reproducibility of tumor volume estimate from CT images of lung cancer; Lin et al 18 evaluated the repeatability of 18 F-NaF PET–derived SUV metrics; Baumgartner et al 11 studied the repeatability of 18 F-FDG PET brain imaging; Jafer et al., 19 Winfield et al., 20 Weller et al., 21 Lecler et al, 22 and Lu et al 23 estimated the repeatability and reproducibility of the apparent diffusion coefficient (ADC) derived from diffusion-weighted MRI; Hagiwara et al 24 studied the repeatability and reproducibility of quantitative relaxometry with a multidynamic multiecho MRI sequence using a phantom and normal healthy human subjects; Jafari-Khouzani et al 25 appraised the repeatability of brain tumor perfusion measurement using dynamic susceptibility contrast MRI; Han et al 26 studied the repeatability and reproducibility of the ultrasonic attenuation coefficient and backscatter coefficient in the liver; Hagiwara et al 27 reviewed the repeatability of MRI and CT QIBs; Wang et al 28 examined the repeatability and reproducibility of 2D and 3D hepatic MR elastography markers in healthy volunteers; and Olin et al 29 investigated the reproducibility of the MR-based attenuation correction factors in PET/MRI and its impact on 18 F-FDG PET quantification in patients with non-small cell lung cancer.

In this article, we review the statistical aspects on repeatability and reproducibility of QIB. We introduce methods and metrics for assessment of QIB repeatability and reproducibility, and illustrate the impact of QIB measurement error on sample size and statistical power calculations, as well as performance as a predictive biomarker.

Precision of a QIB is defined as the closeness of agreement between repeated measurements of the QIB, 3 and repeatability and reproducibility comprise different sources of variability that may impact the precision of a given QIB. Measurement error is defined as the difference between a measured quantity and its true value. 30 Any sources of variability can cause measurement errors in QIB measurements. Though it is critical to identify and obtain valid inference on the impact of every component of variation (see Section 3), we start by introducing a general model on measurement error. Table 1 lists commonly used symbols in this article.

List of symbols and notations

~Distributed as
Normal distribution with mean and variance
dfDegree of freedom
Quantile function of chi-square distribution
Variance
Measured QIB value from the th measurements of a repeated QIB measurement made at time for subject
True QIB value for subject
Number of subjects included in the study
Number of replicates for each subject
Mean of
Measurement error
Within-subject error (under repeatability condition)
Between-condition error (under reproducibility condition)
Interaction between subject and condition
, , Variance of , , , , and , respectively
ˆ Represents the estimator of a parameter
Null hypothesis
Alternative hypothesis
~Distributed as
Normal distribution with mean and variance
dfDegree of freedom
Quantile function of chi-square distribution
Variance
Measured QIB value from the th measurements of a repeated QIB measurement made at time for subject
True QIB value for subject
Number of subjects included in the study
Number of replicates for each subject
Mean of
Measurement error
Within-subject error (under repeatability condition)
Between-condition error (under reproducibility condition)
Interaction between subject and condition
, , Variance of , , , , and , respectively
ˆ Represents the estimator of a parameter
Null hypothesis
Alternative hypothesis

QIBs, quantitative imaging biomarkers.

In a measurement error model, instead of the true QIB value, we can only observe QIB values with errors that are random across different QIB measurements. If the errors are constant for all measurements, then the error is called bias. 3 Since both repeatability and reproducibility mainly concern random errors, we assume that there is no bias and that the random errors are independent and identically distributed with the mean equal to zero and variance of σ ϵ 2 . Thus, the only unknown parameter σ ϵ 2 measures the level of variability, with larger values indicating larger variability or worse precision. Let Y itl be the measured QIB value from the l th measurements of a repeated QIB measurement made at time t for subject i with X it being the corresponding true value, the measurement error model can be expressed as

graphic

Repeatability and reproducibility represent different sources of variability (causing measurement error ϵ i t l ⁠ ) in model (1). Repeatability refers to the precision of a QIB measured under identical conditions, while reproducibility refers to the precision of a QIB measured under different experimental conditions. 3,4 Kessler et al 3 recommends identifying and evaluating each experimental component separately that contributes to reproducibility-related variability. However, in many real-life settings, it is very difficult to independently investigate each source of variability under the reproducibility condition. Thus, the statistical model introduced below only assumes a single parameter representing all sources of variability associated with the reproducibility condition.

Following the measurement error model (2), we consider a simplified setting where all measurements for subject i are made at a relatively short time interval so that the true value X i remains unchanged. That is, let Y ijk be the k th repeated QIB measurement made on subject i under experimental condition j (different experimental conditions may include different measurement systems, imaging methods, and countries/regions), in a similar fashion to model (1) we employ a linear relationship between Y ijk and X i , 4,5,32 but further breaking down the measurement error ϵ i j k into different components of repeatability- and reproducibility-related errors. Similar to model (2), we assume there are no bias or proportional bias, and the model that accounts for both repeatability- and reproducibility-related errors can be written as

The terms δ ik , γ j , and ( γδ ) ij represent different components of measurement error caused by within-subject variability (under repeatability condition), between-condition variability (under reproducibility condition), and the interaction between subject and condition, respectively, and we assume they follow normal distributions. In general, the random error variances σ δ 2 , σ γ 2 , and σ γ δ 2 are the key performance characteristics used in repeatability and reproducibility studies (see details below).

Repeatability

Many studies have been conducted to investigate the repeatability of QIBs for different imaging modalities, including but not limited to CT, 17,18 MRI, 9,12,14,15,20–25 and PET. 10,11,18,29 Test–retest studies are usually performed to evaluate the repeatability of QIB measurements. These studies usually require each subject to be scanned repeatedly over a short period of time with the assumption that X i does not change. If the repeatability study condition is held, i.e . all repeated scans are performed at the same location, with the same measurement procedure, and using the same measurement system and image analysis algorithm, an estimate of QIB repeatability can be calculated.

In practice, test–retest studies can be performed fairly easily using a phantom, but could be difficult with human subjects due to expenses and logistic problems. Therefore, repeatability studies using human subjects are often limited to a small number of replicates (usually two) on each subject. Furthermore, for imaging studies with contrast administration, because there usually exists a required contrast washout period between two consecutive scans ( e.g. consecutive dynamic contrast-enhanced (DCE) MRI scans are usually required to be performed at least 24 h apart), 4,31 “coffee-break” experiments where there is only a short break between repeated scans are not always possible. Thus, for repeatability studies with long interval between scans, possible changes in the true values should also be considered in the model.

Specifically, for a test–retest study with n subjects and m replicates for each subject, since all experiments are conducted under the same experimental condition, without the loss of generalizability, we set j = 1 and model (3) becomes

for i = 1 , ⋯ , n and k = 1 , ⋯ , m ⁠ . The random effect variance σ δ 2 is the key performance characteristic used in a repeatability study, with smaller value corresponding to better repeatability. Because we only consider QIB measurements in a single-site and single measurement system study for a test–retest repeatability study, the random error γ j in (3), which measures the reproducibility-related variability such as between-site variability, becomes a condition-specific systematic error γ   1 in (4), representing the condition-specific bias. As discussed in Section 2, the constant bias γ   1 cannot be identified through the test–retest study and should be estimated through a phantom study with ground-truth values. The term ( γ δ ) i j in model (3) vanishes because there is only one study condition and the interaction effect cannot be observed.

Many metrics related to the estimate of within-subject variance σ δ 2 have been proposed to quantify the magnitude of repeatability. 4   Table 2 shows the metrics considered in this article for repeatability measurement. The within-subject standard deviation (wSD) is the most commonly used metrics for assessing repeatability. It is the standard deviation ( ⁠ σ δ ⁠ ) of repeat measurements for a single subject. If we assume all subjects have the same σ δ and the ground true values X i are independent and normally distributed, wSD can be obtained by fitting a linear mixed-effects model with subject-specified random intercepts using maximum likelihood. 33 Although other estimation procedures such as method of moment estimators can also provide consistent estimation of wSD, the maximum likelihood method is usually preferred for small sample size when model (3) is correctly constructed. Alternatively, wSD can be calculated by averaging the within-subject sample variances 32

Repeatability and reproducibility metrics

wSDWithin-subject standard deviation
ICCIntraclass correlation coefficient, proportion of total variation associated with the variation of true value
wCVWithin-subject coefficient of variation, ratio of the within-subject standard deviation to its mean
tSD (reproducibility SD)Total standard deviation under reproducibility conditions
CCCConcordance correlation coefficient, measurement agreement between two experimental conditions
wSDWithin-subject standard deviation
ICCIntraclass correlation coefficient, proportion of total variation associated with the variation of true value
wCVWithin-subject coefficient of variation, ratio of the within-subject standard deviation to its mean
tSD (reproducibility SD)Total standard deviation under reproducibility conditions
CCCConcordance correlation coefficient, measurement agreement between two experimental conditions

where Y - i = ∑ k = 1 m ‍ Y i 1 k m . This estimator (equation (5)) is equivalent to the estimator obtained from the one-way analysis of variance (ANOVA) model. 34

Another closely related metric for repeatability measurement is the intraclass correlation coefficient (ICC), 35,36 which is defined as the proportion of total variation that is associated with the variation of true value. That is, if we assume X i ∼ N ( μ , σ X 2 ) ⁠ , we have Y i 1 k ∼ N ( μ + γ 1 , σ X 2 + σ δ 2 ) and

Comparing the variance of QIB measurement Y i 1 k (denominator of equation (6)) with the variance of the true value X i (numerator of equation (6)), we can observe that the extra variation equals to the within-subject variation σ δ 2 . If σ δ 2 is much smaller than σ X 2 , then V a r ( Y i 1 k ) ≈ V a r ( X i ) and ICC is close to 1, which indicates that the measurement error contributes little to variation of the QIB measurement. In other words, larger ICC implies better repeatability and smaller ICC implies worse repeatability. ICC can be estimated with known estimates of σ δ and σ X (equation (6)), both of which can be obtained by either fitting a linear mixed-effects model with subject-specified random intercepts or one-way ANOVA model. When using ICC as the measure of repeatability, it is crucial to ensure that the subjects participating in the study are representative of the study population so that the estimated QIB variation can well reflect the variation of the study population (variance of X i ⁠ ).

The within-subject coefficient of variation (wCV) ( Table 2 ) is an alternative metric of wSD. The wCV is defined as the ratio of the within-subject standard deviation to its mean, which is commonly used for test–retest studies of repeatability when wSD is not constant among studied subjects and model (4) becomes inadequate. A useful alternative model for model (4) is to assume that the wSD increased proportionally with the true value X i , i.e.,

In model (7), with the extra constraint of δ i k §amp;gt; 0 ⁠ , it is more adequate to assume δ i k follows log-normal or Weibull distributions 37 with the mean of δ i k equal to one. Raunig et al 4 suggest using log-normal distribution so that the log-transformed QIB measurement l o g ( Y i 1 k ) is normally distributed (after adjusting for the site-specific bias γ   1 )

Under model (8), wCV only depends on the log-transformed within-subject variance σ δ ` 2   4 through the form

Therefore, we can apply any of the estimators of wSD on the log-transformed QIB measurements to obtain valid estimates of σ δ ` 2 , and wCV can be estimated by plugging the σ ^ δ ` 2 into model (9). Without the log-normal distribution assumption, by mimicking estimator (5), when m = 2, we can still estimate wCV by pooling and averaging the within-subject sample coefficient variation 32

Both ICC and wCV have the benefit of being dimensionless, which make them useful for comparing quantities measured on different scales.

Instead of a single point estimates of the repeatability metric (either wSD, ICC, or wCV), it is desirable to make inference on these values through either constructing confidence intervals (CIs) or performing hypothesis testing. Both confidence interval and hypothesis testing involve estimating the distribution of the estimator and thus depend on the choice of the estimation method. For wSD (denoted as σ and the corresponding estimator denoted as σ ^ ⁠ ), if the ANOVA-type estimators such as equations (5) and (10) are used, n ( m − 1 ) σ ^ 2 / σ 2 follows χ 2 distribution with a degree of freedom ( df ) of n ( m − 1 ) ⁠ . Thus, the (1 − α ) ∗ 100% CI of σ ^ is

where χ d f , α / 2 2 and χ d f , 1 - α / 2 2 are the α / 2 and 1 - α / 2 quantiles, respectively, of the χ 2 distribution with degree of freedom df . To test whether the level of wSD is greater than a threshold value c , we conduct hypothesis test with null hypothesis H 0 : σ 2 ≤ c 2   vs alternative hypothesis H 1 : σ 2 §amp;gt; c 2 . The corresponding test statistic is T = n ( m − 1 ) σ ^ 2 c 2 and we reject the null hypothesis if T > χ d f , α 2 . On the other hand, if the maximum likelihood estimators are used, making inference based on the asymptotic distribution of σ ^ is usually problematic for small sample sizes and numerical methods such as bootstrap CI 38 or profile likelihood CI 39 should be considered. For wCV, CI of σ δ ′ 2 on the log-transformed data is first determined using formula (11). Then the CI of wCV can be obtained based on model (9). Because the estimator of ICC is a nonlinear function of σ ^ δ and σ ^ X , its exact sampling distribution is not available. In this case, bootstrap confidence intervals have been extensively used in literature. 40,41 We can also construct the confidence interval of I C C ^ by approximating its sampling distribution using either F (also known as Satterthwaite approximation) or β distribution. 42 Based on Monte Carlo simulation of the sampling distribution of the generalized pivotal quantity of ICC, Ionan et al 43 suggests the generalized confidence interval proposed by Weerahandi. 44 Sample size calculation can be conducted using the method provided by A’Hern. 36

Reproducibility

Reproducibility concerns the consistency or precision of the QIB measurement made on the same subject with the same experimental design but under different experimental conditions, such as different measurement device. Reproducibility of QIBs for different imaging modalities, such as CT, 17 MRI, 9,12,16,19,24 and PET, 13,29 have also been extensively studied. Although many experimental factors can be included for the reproducibility study condition, it is practically impossible to consider all conditions for a single reproducibility study. Raunig et al 4 provided a list of conditions that can be tested in reproducibility studies. Depending on which condition is being tested, reproducibility studies can be classified into two categories: (1) repeated measurement design and (2) cohort measurement design. For example, the former can be used to study the variability caused by different scanners, while the latter can be used to study the variability caused by different study sites. Because the within-subject variability is generally embedded in the variability under different experimental conditions, a reproducibility study can generate repeatability results for each experimental condition. Specifically, for a repeated measurement design, subject i is repeatedly measured m times for each of the J experimental conditions; for a cohort measurement design, each subject will be repeatedly measured m times under one of the experimental conditions. That is, a repeated measurement design requires each subject being measured m × J times, while a cohort measurement design requires each subject only being measured m times. Model (3) is valid for both experimental designs, and the key performance characteristics is the sum of the random effect variances ( ⁠ σ ϵ 2 = σ δ 2 + σ γ 2 + σ γ δ 2 ⁠ ), which represents the total variation under the reproducibility study condition. However, subjects in the cohort measurement design are only measured under a single experimental condition and thus, the subject-condition interaction effect ( γ δ ) i j does not exist. In this case, we can assume σ γ δ 2 = 0 ⁠ , and the total variation for the cohort measurement design becomes σ δ 2 + σ γ 2 .

Similar to repeatability studies, either linear mixed effects models or two-way ANOVA can be used to fit the data and obtain valid estimates of σ δ 2 , σ γ 2 , and σ γ δ 2 . The square root of the total variance σ ϵ 2 , denoted as total SD (tSD) ( Table 2 ) or reproducibility SD, can be used as a metric to quantify the magnitude of reproducibility. If estimators based on linear mixed effects models are considered, the sampling distributions for these estimators are unknown and numerical methods such as bootstrap or permutation should be used to make valid inferences on these parameters. On the other hand, if ANOVA-type estimators are considered, all the three terms d f σ ^ δ 2 / σ δ 2 , d f σ ^ γ 2 / σ γ 2 , and d f σ ^ γ δ 2 / σ γ δ 2 follow χ 2 distributions and the corresponding CIs and test statistics can be easily obtained. However, because the sampling distribution of σ ^ ϵ 2 , which equals to σ ^ δ 2 + σ ^ γ 2 + σ ^ γ δ 2 , is unknown, numerical methods are recommended.

In a scenario where only two experimental conditions are being compared, e.g. comparing two scanner platforms, the variance of γ j ( ⁠ σ γ 2 ⁠ ) is no longer estimable (only γ 1 and γ 2 exist). We then use the agreement between these two experimental conditions as the measure of reproducibility. For such a situation where m = 1 for a repeated measurement design, Lawrence and Lin 45 proposed the concordance correlation coefficient (CCC) ( Table 2 ) as an agreement measure of reproducibility, 46 defined as

to evaluate the QIB agreement between the two experimental conditions, where σ 1 and σ 2 are the standard deviations of the measured QIB under experiment condition 1 and 2, respectively, and ρ 12 is the Pearson correlation between the measured QIB values Y i 1 and Y i 2 . Similar to Pearson correlation, CCC is also ranged between −1 and 1, with values close to 1 (or −1) representing good concordance (or good discordance) and 0 representing no correlation.

The method introduced by Lawrence and Lin 45 is commonly used to estimate CCC and its CI. 47 Lin and Williamson 48 provide a simple method to perform sample size calculation of CCC.

Plots for repeatability and reproducibility studies

Varies plots can be used to visually study the impact of measurement errors on repeatability and reproducibility studies. For repeatability studies, although Box-Whisker plots can provide information on the variability for each subject, 4 they are usually not feasible for test–retest studies since only few repeated measurements (usually two) are available for each subject. Bland–Altman plots 49 are commonly included in repeatability studies to visualize trends in variability within measurement intervals. Figure 1 demonstrates the Bland–Altman plots based on 100 simulated data with X generated uniformly at random from interval 0 and 5. When gold-standard or reference values are available (see the setting described in Obuchowski et al 32 ), the difference between the QIB measurements and the corresponding reference values are plotted against the reference values in Bland–Altman plots. Figure 1a illustrates the case of additive error (model (4) with σ = 0.8), while Figure 1b shows the case of multiplicative error (model (7) with σ = 0.3). When reference values are not available, e.g. test–retest studies, standard deviations (or differences for the case of m = 2) of repeated measurements are plotted against the averages of repeated measurements in Bland–Altman plots. Figure 1c and d show the cases of additive and multiplicative errors based on two repeated measurements and same values of σ as in Figure 1a and Figure 1b , respectively. When multiplicative errors are observed ( e.g. in Figure 1(b) , and (d) ), log-transformed QIB measurements should be considered as suggested in Section 3.1.

Bland–Altman plot examples for (a) reference value available with constant measurement error variance, (b) reference value available with increasing measurement error variance, (c) reference value not available with constant measurement error variance, and (b) reference value not available with increasing measurement error variance.

Bland–Altman plot examples for ( a ) reference value available with constant measurement error variance, ( b ) reference value available with increasing measurement error variance, ( c ) reference value not available with constant measurement error variance, and ( b ) reference value not available with increasing measurement error variance.

When only two experimental conditions are being compared, Bland–Altman plots can also be used in reproducibility studies, especially when gold-standard or reference values are not available. 4 In addition, scatter plot of experimental Condition 1 vs Condition 2 with a fitted regression line can provide useful visualization of the agreement between these two methods ( Figure 2 ). When there exists more than two experimental conditions, we can follow the same procedure as above for each pair of the experimental conditions.

Scatter plot of two method comparison (slope = 0.831, intercept = 0.335).

Scatter plot of two method comparison (slope = 0.831, intercept = 0.335).

QIB as trial end point

QIBs can serve as a clinical trial’s endpoint to assess treatment efficacy, where subjects enrolled in the study are scanned before and after treatment, and the difference of the mean QIB measurements over the treatment course is used to determine the efficacy of the treatment. Since it is often nearly impossible to perform repeated measurements at a single time point in a longitudinal study, it is difficult to assess repeatability and/or reproducibility-related QIB measurement errors. From model (2), for a setting without repeated measurements, let Y i 1 and Y i 2 be the QIB measurements (or log-transformed QIB measurements) before and after intervention for subject i ⁠ , respectively, and we further assume that the corresponding true value X i t follows normal distribution with mean μ t and variance σ X 2 for t = 1 , 2 and i = 1 , ⋯ , n ⁠ . The common approach to assess treatment efficacy is to test if the mean difference is greater than a threshold value c so that the difference is practically meaningful ( i.e. null hypothesis H 0 : μ 2 - μ 1 ≤ c   vs alternative hypothesis H 1 : μ 2 - μ 1 §amp;gt; c ⁠ ). Under model (2), the corresponding test statistic Z is

where ρ is the Pearson correlation (ranged between 0 and 1) between Y i 1 and Y i 2 . Thus, for a statistically significant level of α , the formula of minimum sample size required to achieve β power is

where z α and z β are the α and β quantiles of standard normal distribution. For example, under the setting of c = 0 and μ 2 - μ 1 = 0.5 ⁠ , which represents 50% change against no change in hypothesis testing if log-transformed QIB is considered, Figure 3 illustrates the required sample sizes to achieve 80% power with a significance level of 5% for different values of σ ϵ , σ X , and ρ ⁠ . From both Figure 3 and equation (12), we can notice that the required sample size is an increasing function of σ X 2 and σ ϵ 2 , and is a decreasing function of ρ ⁠ .

Sample size required to achieve 80% power with a significance level of 5% against measurement error standard deviation σϵ . The true difference μ2-μ1 is 0.5, commonly represents 50% change for log-transformed QIB; ρ=0.4,0.6, or 0.8; σX=1 or 1.5; and c=0. QIB, quantitative imaging biomarker.

Sample size required to achieve 80% power with a significance level of 5% against measurement error standard deviation σ ϵ . The true difference μ 2 - μ 1 is 0.5, commonly represents 50% change for log-transformed QIB; ρ = 0.4 , 0.6 , or 0.8; σ X = 1 or 1.5; and c = 0 ⁠ . QIB, quantitative imaging biomarker.

For a longitudinal study, the correlation parameter ρ measures the level of dependence between the QIB measurements before and after treatment, with longer time interval between the measures generally resulting in smaller ρ ⁠ . Obuchowski et al 6 showed that the range of ρ ⁠ , which depends on the time interval of the two measurements, is between 0 and σ X 2 σ X 2 + σ ϵ 2 . Using equation (12), the range of sample size is from 2 σ ϵ 2 ( z α + z β ) c - ( μ 2 - μ 1 ) 2 to 2 ( σ X 2 + σ ϵ 2 ) ( z α + z β ) c - ( μ 2 - μ 1 ) 2 . Although the required sample size n is a decreasing function of ρ and ρ is a decreasing function of the time interval between the measurements, we cannot jump into the conclusion that studies with smaller time interval require smaller sample sizes. This is because a smaller time interval usually also results in a smaller difference between μ 1 and μ 2 .

QIB as predictive biomarker

In addition to serving as clinical trial end points, QIBs can also be used as predictive biomarkers for early prediction of treatment effect, or as intermediate endpoints in multiarm, multistage trials. 50 Under this scenario, it is usually assumed that the true value of QIB is associated with the primary trial end point but we only measure QIB with error (see model (1)). Many methods have been proposed to adjust the measurement error on covariates in regression models. 51–53 However, using such adjustment requires knowledge of the distribution of measurement errors that can only be obtained from additional studies such as repeatability or reproducibility studies. This requirement, on the one hand, emphasizes the importance of repeatability or reproducibility studies, but on the other hand, may not always be applicable and standard approach that ignores the measurement error is conducted. 54 In this study, we use simulation to illustrate the impact of measurement error when standard approach is used.

Because the sample sizes in published studies where QIBs were used as predictive biomarkers are usually small, it is difficult to analytically evaluate the impact of measurement error on QIBs as predictive biomarkers. Based on the study design and assumptions, Monte Carlo simulations can be used to numerically approximate the impact of measurement error. For illustration purposes, we designed our simulations based on the study by Tudorica et al, 54 where DCE-MRI QIBs were used for early prediction of breast cancer response [pathologic complete response (pCR) vs non-pCR] to neoadjuvant chemotherapy (NACT). We denoted Z i as the indicator of pCR and X i as the true value of a DEC-MRI QIB for subject i . The true DCE-MRI QIB X i was generated from normal distribution. Tudorica et al 54 provided a list of DEC-MRI QIBs with their means and SDs for pCR and non-pCR patients. Here, we considered the percent change in QIB K   trans (transfer rate constant) after the first cycle of NACT relative to baseline (pCR: mean = −64%, SD = 9%; non-pCR: mean = −14%, SD = 41%), which showed the best predictive performance for pCR vs non-pCR in that study. 53 Consistent with the sample size of the study, 53 we included a total of 28 subjects in this simulation study. Here, without loss of generality, we assumed that the first five subjects are pCR patients and the other subjects are non-pCR patients. As noted above, we can only observe the DEC-MRI QIB with error (model (2)). The best approach was to fit a univariate logistic regression model using the observed DEC-MRI QIB Y i as covariate, i.e.

Area under the receiver operating characteristic curve (AUC) was used to evaluate the QIB predictive performance. Sample size calculation of AUC can be conducted using the formula provided by Obuchowski et al. 55 Because the effect of measurement error on AUC is still not clear, we perform a simulation study evaluate this effect. The true K   trans percent change values were repeatedly generated 1000 times, and the average AUC across these 1000 simulated data sets and the corresponding 95% CIs were calculated. Figure 4 illustrates the average AUCs against different values of σ ϵ ( ⁠ σ ϵ = 5 % , 10 % , ⋯ , 30 % ⁠ ), the measurement error standard deviation in K   trans percent change. Our simulation results show that the predictive performance as measured by the average AUC decreases, and the length of 95% CIs increases with increased measurement error ( ⁠ σ ϵ ⁠ ).

Average AUC (solid black curve) and the corresponding 95% confidence intervals (dashed blue curves) against different values of σϵ . Results were obtained based on 1000 simulated data sets. AUC, area under the receiver operating characteristic curve.

Average AUC (solid black curve) and the corresponding 95% confidence intervals (dashed blue curves) against different values of σ ϵ . Results were obtained based on 1000 simulated data sets. AUC, area under the receiver operating characteristic curve.

In this review article, we provided a general introduction on the study design, statistical model, and statistical metrics that can be used to assess repeatability and reproducibility of QIB measurements. We also illustrated the impact of repeatability- and reproducibility-related QIB measurement errors on QIB applications, e.g. on sample size calculation when a QIB is used as a clinical trial end point.

The statistical models presented here assume that the measurement errors are normally distributed and independent from the true QIB values. If the measurement errors increase in proportion to the true QIB values, i.e. multiplicative errors (see model (7)), log-transformed QIB values can be used as the relationship between error and true value becomes additive after the transformation. In practice, when QIB measurements have values equal or close to zero, we can add a small amount on all QIB values before taking the log-transformation. For more complex error structures such as non-Gaussian or heterogeneous measurement errors, the statistical methods introduced in this article can provide a reasonable approximation of the repeatability and reproducibility metrics of interest, e.g. wSD and ICC, but statistical inferences on these estimates can be biased and may lead to false conclusions.

Test–retest studies are commonly used to study repeatability and reproducibility of a QIB, where each object, e.g. a phantom or a human subject, is being repeatedly measured. This approach may sometimes be impractical for human subject studies for reasons such as costs, time, and the invasiveness of the imaging scan. As an alternative strategy, Obuschowski et al 32 proposed a method to estimate the measurement error under repeatability condition when a test–retest study is not feasible. The method requires a reference (golden standard) value is available for each subject. By assuming the reference value to be the true QIB value ( ⁠ X i ⁠ ), it can serve as the second measured value for wSD or wCV estimation:

Repeatability and reproducibility can be part of the same study. It is possible to study repeatability in a restricted subset of a reproducibility study to ensure repeatability is acceptable, e.g . in an initial subset of subjects going into the study to ensure it is worth pursuing.

There is increasing need to accelerate clinical translation of QIBs. However, significant challenges remain. Using solid tumor therapy response as an example, the 1D imaging tumor size measurement based on the RECIST (Response Evaluation Criteria In Solid Tumors) 1.1 guidelines 56 is the only widely used QIB in today’s standard of care and clinical trials. Many QIBs that interrogate tumor biology and physiology and thus are well suited for evaluation of response to increasingly used and effective molecular targeted therapies find it difficult to be translated into clinical trials and practice. This is mainly due to the variabilities in quantifying these QIB parameter values caused by differences in vendor imaging platforms, imaging data acquisition methods, and imaging data analysis algorithms and software tools. Because of the lack of sufficient repeatability and reproducibility studies to understand the variabilities of these functional QIBs, unlike RECIST tumor size measurement, there is currently no consensus on the magnitudes of changes in these QIBs for defining clinical response endpoints such as complete response, stable disease, etc. In order to establish a path to clinical translation for functional QIBs, there is a clear need for not only standardization of data acquisition and analysis to minimize variability, 1 but also more efforts in assessment of QIB repeatability and reproducibility. 1 It is our hope that the statistical tools presented in this article may contribute to this endeavor.

We thank the anonymous reviewers for their insightful comments.

R01 CA248192, National Institutes of Health, U.S.A.

Yankeelov   TE , Mankoff   DA , Schwartz   LH , Lieberman   FS , Buatti   JM , Mountz   JM  et al.  . Quantitative imaging in cancer clinical trials . Clin Cancer Res   2016 ; 22 : 284 – 90 . doi: https://doi.org/10.1158/1078-0432.CCR-14-3336

  • Google Scholar

Pinker   K , Chin   J , Melsaether   AN , Morris   EA , Moy   L . Precision medicine and radiogenomics in breast cancer: new approaches toward diagnosis and treatment . Radiology   2018 ; 287 : 732 – 47 . doi: https://doi.org/10.1148/radiol.2018172171

Kessler   LG , Barnhart   HX , Buckler   AJ , Choudhury   KR , Kondratovich   MV , Toledano   A  et al.  . The emerging science of quantitative imaging biomarkers terminology and definitions for scientific studies and regulatory submissions . Stat Methods Med Res   2015 ; 24 : 9 – 26 . doi: https://doi.org/10.1177/0962280214537333

Raunig   DL , McShane   LM , Pennello   G , Gatsonis   C , Carson   PL , Voyvodic   JT  et al.  . Quantitative imaging biomarkers: a review of statistical methods for technical performance assessment . Stat Methods Med Res   2015 ; 24 : 27 – 67 . doi: https://doi.org/10.1177/0962280214537344

Obuchowski   NA , Reeves   AP , Huang   EP , Wang   X-F , Buckler   AJ , Kim   HJG  et al.  . Quantitative imaging biomarkers: a review of statistical methods for computer algorithm comparisons . Stat Methods Med Res   2015 ; 24 : 68 – 106 . doi: https://doi.org/10.1177/0962280214537390

Obuchowski   NA , Mozley   PD , Matthews   D , Buckler   A , Bullen   J , Jackson   E . Statistical considerations for planning clinical trials with quantitative imaging biomarkers . J Natl Cancer Inst   2019 ; 111 : 19 – 26 . doi: https://doi.org/10.1093/jnci/djy194

Sullivan   DC , Obuchowski   NA , Kessler   LG , Raunig   DL , Gatsonis   C , Huang   EP  et al.  . Metrology standards for quantitative imaging biomarkers . Radiology   2015 ; 277 : 813 – 25 : 825 . doi: https://doi.org/10.1148/radiol.2015142202

Huang   EP , Wang   X-F , Choudhury   KR , McShane   LM , Gönen   M , Ye   J  et al.  . Meta-analysis of the technical performance of an imaging procedure: guidelines and statistical methodology . Stat Methods Med Res   2015 ; 24 : 141 – 74 . doi: https://doi.org/10.1177/0962280214537394

Yokoo   T , Serai   SD , Pirasteh   A , Bashir   MR , Hamilton   G , Hernando   D  et al.  . Linearity, bias, and precision of hepatic proton density fat fraction measurements by using MR imaging: a meta-analysis . Radiology   2018 ; 286 : 486 – 98 . doi: https://doi.org/10.1148/radiol.2017170550

] Lodge   MA   Repeatability of SUV in oncologic 18F-FDG PET . Journal of Nuclear Medicine . 2017 ; 58 ( 4 ): 523 – 532 .

Baumgartner   R , Joshi   A , Feng   D , Zanderigo   F , Ogden   RT . Statistical evaluation of test-retest studies in PET brain imaging . EJNMMI Res   2018 ; 8 : 1 – 9 : 13 . doi: https://doi.org/10.1186/s13550-018-0366-8

Shukla‐Dave   A , Obuchowski   NA , Chenevert   TL , Jambawalikar   S , Schwartz   LH , Malyarenko   D  et al.  . Quantitative imaging biomarkers alliance (QIBA) recommendations for improved precision of DWI and DCE‐MRI derived biomarkers in multicenter oncology trials . J Magn Reson Imaging   2019 ; 49 : e101 – 21 . doi: https://doi.org/10.1002/jmri.26518

Park   JE , Park   SY , Kim   HJ , Kim   HS . Reproducibility and generalizability in radiomics modeling: possible strategies in radiologic and statistical perspectives . Korean J Radiol   2019 ; 20 : 1124 . doi: https://doi.org/10.3348/kjr.2018.0070

Fedorov   A , Vangel   MG , Tempany   CM , Fennessy   FM . Multiparametric magnetic resonance imaging of the prostate: repeatability of volume and apparent diffusion coefficient quantification . Invest Radiol   2017 ; 52 ( 9 ): 538 .

Schwier   M , Griethuysen   J , Vangel   MG , Pieper   S , Peled   S , Tempany   C  et al.  . Repeatability of multiparametric prostate MRI radiomics features . Sci Rep   2019 ; 9 : 1 – 16 . doi: https://doi.org/10.1038/s41598-019-45766-z

Hernando   D , Sharma   SD , Aliyari Ghasabeh   M , Alvis   BD , Arora   SS , Hamilton   G  et al.  . Multisite, multivendor validation of the accuracy and reproducibility of proton-density fat-fraction quantification at 1.5t and 3T using a fat-water phantom . Magn Reson Med   2019 ; 77 : 1516 – 24 . doi: https://doi.org/10.1002/mrm.26228

Kalpathy-Cramer   J , Zhao   B , Goldgof   D , Gu   Y , Wang   X , Yang   H  et al.  . A comparison of lung nodule segmentation algorithms: methods and results from A multi-institutional study . J Digit Imaging   2016 ; 29 : 476 – 87 . doi: https://doi.org/10.1007/s10278-016-9859-z

Lin   C , Bradshaw   T , Perk   T , Harmon   S , Eickhoff   J , Jallow   N  et al.  . Repeatability of quantitative 18F-naf PET: a multicenter study . J Nucl Med   2016 ; 57 : 1872 – 79 . doi: https://doi.org/10.2967/jnumed.116.177295

Jafar   MM , Parsai   A , Miquel   ME . Diffusion-weighted magnetic resonance imaging in cancer: reported apparent diffusion coefficients, in-vitro and in-vivo reproducibility . World J Radiol   2016 ; 8 : 21 – 49 . doi: https://doi.org/10.4329/wjr.v8.i1.21

Winfield   JM , Tunariu   N , Rata   M , Miyazaki   K , Jerome   NP , Germuska   M  et al.  . Extracranial soft-tissue tumors: repeatability of apparent diffusion coefficient estimates from diffusion-weighted MR imaging . Radiology   2017 ; 284 : 88 – 99 . doi: https://doi.org/10.1148/radiol.2017161965

Weller   A , Papoutsaki   MV , Waterton   JC , Chiti   A , Stroobants   S , Kuijer   J  et al.  . Diffusion-weighted (DW) MRI in lung cancers: ADC test-retest repeatability . Eur Radiol   2017 ; 27 : 4552 – 62 . doi: https://doi.org/10.1007/s00330-017-4828-6

Lecler   A , Savatovsky   J , Balvay   D , Zmuda   M , Sadik   J-C , Galatoire   O  et al.  . Repeatability of apparent diffusion coefficient and intravoxel incoherent motion parameters at 3.0 tesla in orbital lesions . Eur Radiol   2017 ; 27 : 5094 – 5103 . doi: https://doi.org/10.1007/s00330-017-4933-6

Lu   Y , Hatzoglou   V , Banerjee   S , Stambuk   HE , Gonen   M , Shankaranarayanan   A  et al.  . Repeatability investigation of reduced field-of-view diffusion-weighted magnetic resonance imaging on thyroid glands . J Comput Assist Tomogr   2015 ; 39 : 334 – 39 . doi: https://doi.org/10.1097/RCT.0000000000000227

Hagiwara   A , Hori   M , Cohen-Adad   J , Nakazawa   M , Suzuki   Y , Kasahara   A  et al.  . Linearity, bias, intrascanner repeatability, and interscanner reproducibility of quantitative multidynamic multiecho sequence for rapid simultaneous relaxometry at 3 T: a validation study with a standardized phantom and healthy controls . Invest Radiol   2019 ; 54 : 39 – 47 . doi: https://doi.org/10.1097/RLI.0000000000000510

Jafari-Khouzani   K , Emblem   KE , Kalpathy-Cramer   J , Bjørnerud   A , Vangel   MG , Gerstner   ER  et al.  . Repeatability of cerebral perfusion using dynamic susceptibility contrast MRI in glioblastoma patients . Transl Oncol   2015 ; 8 : 137 – 46 : S1936-5233(15)00018-2 . doi: https://doi.org/10.1016/j.tranon.2015.03.002

Han   A , Andre   MP , Deiranieh   L , Housman   E , Erdman   JW  Jr , Loomba   R  et al.  . Repeatability and reproducibility of the ultrasonic attenuation coefficient and backscatter coefficient measured in the right lobe of the liver in adults with known or suspected nonalcoholic fatty liver disease . J Ultrasound Med   2018 ; 37 : 1913 – 27 . doi: https://doi.org/10.1002/jum.14537

Hagiwara   A , Fujita   S , Ohno   Y , Aoki   S . Variability and standardization of quantitative imaging: monoparametric to multiparametric quantification, radiomics, and artificial intelligence . Invest Radiol   2020 ; 55 : 601 – 16 . doi: https://doi.org/10.1097/RLI.0000000000000666

Wang   K , Manning   P , Szeverenyi   N , Wolfson   T , Hamilton   G , Middleton   MS  et al.  . Repeatability and reproducibility of 2D and 3D hepatic MR elastography with rigid and flexible drivers at end-expiration and end-inspiration in healthy volunteers . Abdom Radiol (NY )   2017 ; 42 : 2843 – 54 . doi: https://doi.org/10.1007/s00261-017-1206-4

Olin   A , Ladefoged   CN , Langer   NH , Keller   SH , Löfgren   J , Hansen   AE  et al.  . Reproducibility of MR-based attenuation maps in PET/MRI and the impact on PET quantification in lung cancer . J Nucl Med   2018 ; 59 : 999 – 1004 . doi: https://doi.org/10.2967/jnumed.117.198853

Fuller   WA . Measurement error models . John Wiley & Sons ; 2009 .

Google Preview

Obuchowski   NA , Bullen   J . Quantitative imaging biomarkers: effect of sample size and bias on confidence interval coverage . Stat Methods Med Res   2018 ; 27 : 3139 – 50 . doi: https://doi.org/10.1177/0962280217693662

Obuchowski   NA , Buckler   AJ . Estimating the precision of quantitative imaging biomarkers without test-retest studies . Academic Radiology   2022 ; 29 : 543 – 49 . doi: https://doi.org/10.1016/j.acra.2021.06.009

Gałecki   A , Burzykowski   T . Linear Mixed-Effects Models Using R . New York, NY : Springer ; 2013 ., pp . 245 – 73 . doi: https://doi.org/10.1007/978-1-4614-3900-4

Girden   ER . ANOVA . 2455 Teller Road, Thousand Oaks California 91320 United States of America : Sage ; 1992 . doi: https://doi.org/10.4135/9781412983419

Killip   S , Mahfoud   Z , Pearce   K . What is an intracluster correlation coefficient? crucial concepts for primary care researchers . Ann Fam Med   2004 ; 2 : 204 – 8 . doi: https://doi.org/10.1370/afm.141

A’Hern   RP . Employing multiple synchronous outcome samples per subject to improve study efficiency . BMC Med Res Methodol   2021 ; 21 : 1 – 11 : 211 . doi: https://doi.org/10.1186/s12874-021-01414-7

Rinne   H . The Weibull Distribution . CRC press ; 2008 . doi: https://doi.org/10.1201/9781420087444

Efron   B , Tibshirani   RJ . An Introduction to the Bootstrap . CRC press ; 1994 . doi: https://doi.org/10.1201/9780429246593

Cole   SR , Chu   H , Greenland   S . Maximum likelihood, profile likelihood, and penalized likelihood: a primer . Am J Epidemiol   2014 ; 179 : 252 – 60 . doi: https://doi.org/10.1093/aje/kwt245

Cook   JA , Bruckner   T , MacLennan   GS , Seiler   CM . Clustering in surgical trials--database of intracluster correlations . Trials   2012 ; 13 : 1 – 8 : 2 . doi: https://doi.org/10.1186/1745-6215-13-2

] Thompson   DM , Fernald   DH , Mold   JW   Intraclass correlation coefficients typical of cluster-randomized studies: estimates from the Robert Wood Johnson Prescription for Health projects . The Annals of Family Medicine . 2012 ; 10 ( 3 ): 235 – 240 .

Demetrashvili   N , Wit   EC , Heuvel   ER . Confidence intervals for intraclass correlation coefficients in variance components models . Stat Methods Med Res   2016 ; 25 : 2359 – 76 . doi: https://doi.org/10.1177/0962280214522787

Ionan   AC , Polley   MYC , McShane   LM , Dobbin   KK . Comparison of confidence interval methods for an intra-class correlation coefficient (ICC) . BMC Med Res Methodol   2014 ; 14 : 1 – 11 . doi: https://doi.org/10.1186/1471-2288-14-121

Weerahandi   S . Generalized confidence intervals . Journal of the American Statistical Association   1993 ; 88 : 899 – 905 . doi: https://doi.org/10.1080/01621459.1993.10476355

Lin   LI-K . A concordance correlation coefficient to evaluate reproducibility . Biometrics   1993 ; 45 : 255 . https://doi.org/10.2307/2532051

Huang   W , Li   X , Chen   Y , Li   X , Chang   M-C , Oborski   MJ  et al.  . Variations of dynamic contrast-enhanced magnetic resonance imaging in evaluation of breast cancer therapy response: a multicenter data analysis challenge . Transl Oncol   2014 ; 7 : 153 – 66 : 166 . doi: https://doi.org/10.1593/tlo.13838

Chen   CC , Barnhart   HX . Comparison of ICC and CCC for assessing agreement for data without and with replications . Computational Statistics & Data Analysis   2008 ; 53 : 554 – 64 . doi: https://doi.org/10.1016/j.csda.2008.09.026

Lin   HM , Williamson   JM . A simple approach for sample size calculation for comparing two concordance correlation coefficients estimated on the same subjects . J Biopharm Stat   2015 ; 25 : 1145 – 60 . doi: https://doi.org/10.1080/10543406.2014.971163

Altman   DG , Bland   JM . Measurement in medicine: the analysis of method comparison studies . The Statistician   1983 ; 32 : 307 . https://doi.org/10.2307/2987937

Millen   GC , Yap   C . Adaptive trial designs: what are multiarm, multistage trials . In : Archives of Disease in Childhood-Education and Practice . ; 2020   pp . 376 – 78 .

Yang   M , Adomavicius   G , Burtch   G , Ren   Y . Mind the gap: accounting for measurement error and misclassification in variables generated via data mining . Information Systems Research   2018 ; 29 : 4 – 24 . doi: https://doi.org/10.1287/isre.2017.0727

Chesher   A . The effect of measurement error . Biometrika   2018 ; 78 : 451 – 62 . doi: https://doi.org/10.1093/biomet/78.3.451

Carroll   RJ , Ruppert   D , Stefanski   LA . Measurement Error in Nonlinear Models . Boston, MA : CRC press ; March   2018 . doi: https://doi.org/10.1007/978-1-4899-4477-1

Tudorica   A , Oh   KY , Chui   SY-C , Roy   N , Troxell   ML , Naik   A  et al.  . Early prediction and evaluation of breast cancer response to neoadjuvant chemotherapy using quantitative DCE-MRI . Transl Oncol   2016 ; 9 : 8 – 17 : S1936-5233(15)30017-6 . doi: https://doi.org/10.1016/j.tranon.2015.11.016

Obuchowski   NA , Lieber   ML , Wians   FH  Jr . ROC curves in clinical chemistry: uses, misuses, and possible solutions . Clin Chem   2004 ; 50 : 1118 – 25 . doi: https://doi.org/10.1373/clinchem.2004.031823

Eisenhauer   EA , Therasse   P , Bogaerts   J , Schwartz   LH , Sargent   D , Ford   R  et al.  . New response evaluation criteria in solid tumours: revised RECIST guideline (version 1.1) . Eur J Cancer   2009 ; 228 – 47 .

Month: Total Views:
September 2023 3
November 2023 5
December 2023 11
January 2024 39
February 2024 88
March 2024 105
April 2024 122
May 2024 94
June 2024 135
July 2024 83
August 2024 117

Email alerts

Related articles in, citing articles via.

  • Advertising & Corporate Services

Affiliations

  • Online ISSN 2513-9878
  • Copyright © 2024 British Institute of Radiology
  • About Oxford Academic
  • Publish journals with us
  • University press partners
  • What we publish
  • New features  
  • Open access
  • Institutional account management
  • Rights and permissions
  • Get help with access
  • Accessibility
  • Advertising
  • Media enquiries
  • Oxford University Press
  • Oxford Languages
  • University of Oxford

Oxford University Press is a department of the University of Oxford. It furthers the University's objective of excellence in research, scholarship, and education by publishing worldwide

  • Copyright © 2024 Oxford University Press
  • Cookie settings
  • Cookie policy
  • Privacy policy
  • Legal notice

This Feature Is Available To Subscribers Only

Sign In or Create an Account

This PDF is available to Subscribers Only

For full access to this pdf, sign in to an existing account, or purchase an annual subscription.

Every print subscription comes with full digital access

Science News

Reproducing experiments is more complicated than it seems.

illustrations of Robert Boyle's air pumps

Seventeeth-century physicist and inventor Robert Boyle made a point to communicate the details of his experiments creating a vacuum with an air pump (his first and third attempts are illustrated here) so others could reproduce his results. Statisticians have devised a new way to measure the evidence that an experimental result has really been reproduced.

Wellcome Library, London ( CC BY 4.0 )

Share this:

By Tom Siegfried

October 7, 2014 at 11:46 am

Whether in science or in courtrooms, drawing correct conclusions depends on reliable evidence. One witness is not enough.

Robert Boyle, one of the founding fathers of modern experimental science, articulated that science/law analogy back in the 17th century.

“Though the testimony of a single witness shall not suffice to prove the accused party guilty of murder; yet the testimony of two witnesses . . . shall ordinarily suffice to prove a man guilty,” Boyle wrote . In the same way, he said, a scientific experiment could not establish a “matter of fact” without multiple witnesses to its result.

So in some of his famous experiments creating a vacuum with an air pump, Boyle invited prominent spectators, “an illustrious assembly of virtuosi,” to attest to the fact of his demonstrations. Boyle realized that such verification could also be achieved by communicating the details of his methods so other experimenters could reproduce his findings. Since then, the principle that experiments should be reproducible has guided the process of science and the evaluation of scientific evidence.

“Replicability has been the cornerstone of science as we know it since the foundation of experimental science,” statistician Ruth Heller and collaborators write in a recent paper .

For quite a while now, though, it seems as if that cornerstone of science has been crumbling. Study after study finds that many scientific results are not reproducible, Heller and colleagues note. Experiments on the behavior of mice produce opposite results in different laboratories. Drugs that show promise in small clinical trials fail in larger trials. Genes linked to the risk of a disease in one study turn out not to have anything to do with that disease at all.

Part of the problem, of course, is that it’s rarely possible to duplicate a complicated experiment perfectly. No two laboratories are identical, for instance. Neither are any two graduate students. (And a mouse’s behavior can differ depending on who took it out of its cage.) Medical trials test different patients each time. Given all these complications, it’s not always obvious how to determine whether a new experiment really reproduces an earlier one or not.

“We need to have an objective way to declare that a certain study really replicates the findings in another study,” write Heller, of Tel-Aviv University in Israel, and collaborators Marina Bogomolov and Yoav Benjamini.

When comparing only two studies, each testing only one hypothesis, the problem is not severe. But in the era of Big Data, multiple studies often address multiple questions. In those cases assessing the evidence that a result has been reproduced is tricky.

Several small studies may suggest the existence of an effect — such as a drug ameliorating a disease — but in each case the statistical analysis can be inconclusive, so it’s not clear whether any of the studies reproduce one another. Combining the findings from all the studies, a procedure known as meta-analysis, may produce strong enough data to conclude that the drug works. But a meta-analysis can be skewed by one study in the group showing a really strong effect, masking the fact that other studies show no effect (and thus do not reproduce each other).

“A meta-analysis discovery based on a few studies is no better than a discovery from a single large study in assessing replicability,” Heller and colleagues point out.

Their approach is to devise a new measure of replicability (they call it an r-value) for use in cases when multiple hypotheses are tested multiple times, an especially common situation in genetics research. When thousands of genes are tested to see if they are related to a disease, many will appear to be related just by chance. When you test again, some of the same genes show up on the list a second time. You’d like to know which of them counts as a reliable replication.

Heller and colleagues’ r-value attempts to provide a way of quantifying that question. They construct an elaborate mathematical approach, based on such factors as the number of features tested in the first place and the fraction of those features that appear not to exert an influence in either the first or follow-up study. Tests of the r-value approach show that it leads to conclusions that differ from standard methods, such as genome-wide association studies . In the GWAS approach, hundreds of thousands of genetic variations are investigated to see if they show signs of being involved in a disease. Follow-up tests then determine which of the candidate variations are likely to be truly linked to the disease.

A GWAS analysis of Crohn’s disease, for instance, tested over 600,000 genetic variations in more than 3,000 people. A follow-up analysis re-examined 126 of those genetic variations. Using the r-value method, Heller and colleagues determined that 52 of those variations passed the replication test. But those 52 were not the same as the 52 most strongly linked variations identified by the standard GWAS method. In other words, the standard statistical method may not be providing the best evidence for replication.

But as Heller and colleagues note, their method is not necessarily appropriate for all situations. It is designed specifically for large-scale searches when actual true results are relatively rare. Other approaches may work better in other circumstances. And many other methods have been proposed to cope with variations on these issues. Different methods often yield different conclusions. How to know whether a result has really been reproduced remains a weakness of the scientific method.

Of course, even in Boyle’s time, though, reproducibility wasn’t so simple. Boyle lamented that, despite how carefully he wrote up his reports so that others could reconstruct his apparatus and reproduce his experiments, few people actually tried.

“For though they may be easily read,” Boyle wrote, “yet he, that shall really go about to repeat them, will find it no easy task.”

Follow me on Twitter: @tom_siegfried

More Stories from Science News on Math

A snowflake-shaped collecting of winding and looping black lines on a white background, resembling a dense maze

This intricate maze connects the dots on quasicrystal surfaces

An illustration of bacterial molecules forming a triangular fractal.

Scientists find a naturally occurring molecule that forms a fractal

An image of grey numbers piled on top of each other. All numbers are grey except for the visible prime numbers of 5, 11, 17, 23 and 29, which are highlighted blue.

How two outsiders tackled the mystery of arithmetic progressions

A predicted quasicrystal is based on the ‘einstein’ tile known as the hat.

The produce section of a grocery store with lots of fruit and vegetables on sloped displays

Here’s how much fruit you can take from a display before it collapses

A simulation of the cosmic web

Here are some astounding scientific firsts of 2023

An image showing 1+1=2 with other numbers fading into the black background.

‘Is Math Real?’ asks simple questions to explore math’s deepest truths

An illustration of a green Möbius strip, a loop of paper with a half-twist in it.

An enduring Möbius strip mystery has finally been solved

Subscribers, enter your e-mail address for full access to the Science News archives and digital editions.

Not a subscriber? Become one now .

reproducibility of an experiment is measured from

Repeatability vs. Reproducibility: What’s the Difference?

Published: October 17, 2022 by iSixSigma Staff

reproducibility of an experiment is measured from

Repeatability and reproducibility are two ways that scientists and engineers measure the precision of their experiments and measuring tools. They are heavily used in chemistry and engineering and are typically reported as standard deviations in scientific write-ups. While these two measurements are both used in many types of experiments, they are quite different and offer valuable insights into the work being done.

What is Repeatability?

Simply put, repeatability is the variability of a measurement. For it to be established, the measurement must be taken by the same person, under the same conditions, using the same instruments, and in a short amount of time.

The Benefits of Repeatability

Repeatability is important for researchers to verify their results. If you can produce the same result in an experiment using the same setup and methods under the same exact conditions in a relatively short amount of time, you can assume the results are not simply by chance. Then you can work towards creating reproducibility.

Repeatability is also essential in determining the accuracy of your measuring tools. This could be repeatedly measuring the length of a bolt with the same caliper and recording the results. If the results are within an acceptable range, you can say the tool produces repeatable measurements.

How to Calculate Repeatability

In order to calculate repeatability, you need to perform a repeatability test. When collecting data, you must ensure that certain criteria are met. This includes ensuring the exact same:

1. Conditions 2. Operator 3. Equipment 4. Location 5. Item 6. Method

Once you have all six of these criteria met, you can move on to performing a repeatability test. First, you will need to choose what you are going to measure. Depending on your experiment, this could be voltage, length, pressure, temperature, or a number of other measurements. Then you will need to set a measurement range. This is typically a range of low to high values.

After you have determined your measurement range, the next thing to do is choose test points. For a linear experiment, this would be two points. Usually, this is 10% and 90% of the measurement range. Depending on the project and if available, calibration points should be chosen for the best results. If your experiment is non-linear, you should pick three or more points to prevent errors.

The only thing left to do is perform the experiment using the criteria and parameters listed above and collect your data. You will want to perform the test 20 to 30 times if practical. Some experiments are time-consuming, and it is only realistic to perform five or fewer tests. You should choose whatever is the most appropriate for your experiment.

Once you have all your data collected, it is time to analyze it. You should calculate the degrees of freedom, mean, and standard deviation. When dealing with multiple sets of data, you should use the method of pooled variance when making your calculations. Your standard deviation will be the value for repeatability.

What is Reproducibility?

Reproducibility measures the degree of agreement of an experiment or study performed by different individuals in different locations and with different instruments. An experiment that is reproducible if it can replicate the results anywhere and by anyone who properly follows the procedure.

The Benefits of Reproducibility

Achieving reproducibility allows for more thorough and accurate research. It allows teams to reduce errors and ambiguity before moving a project from the development phase to the production phase. Reproducibility benefits the researcher and the research community as a whole.

How to Create Reproducibility

Creating reproducibility is simply the process of determining if your experiment can be replicated by a different individual or team using the same setup and provide the same results. Typically, an experiment needs to be performed successfully multiple times by different teams before it can be considered reproducible.

Reproducibility allows for more accurate research, whereas repeatability measures that accuracy and confirms the results. Both are a means to evaluate the stability and reliability of an experiment and are key factors in uncertainty calculations of measurements.

Repeatability vs. Reproducibility: Who would use repeatability and/or reproducibility?

Both repeatability and reproducibility are essential in nearly all areas of science, including chemistry, computer science, data science, and engineering. A line operator in a manufacturing facility may rely more on repeatability, whereas a chemist developing a new chemical reaction would strive to achieve reproducibility of their experimental method and rely on repeatability data to determine its accuracy.

Choosing Between Repeatability and Reproducibility: Real World Scenarios

Let’s say a quality inspector is measuring the weight of different sized bolts to ensure they are within the acceptable range before sending them through to the next step in the manufacturing process. To determine the reliability of the scale, the head engineer wants the operator to weigh each bolt multiple times and record their results. Essentially, the operator is performing a repeatability test. The engineer notices that the scale appears to be having a significant variation in bolts that weigh more than 50 grams. The conclusion is that a different scale capable of producing repeatable measurements on the same bolts over 50 grams needs to be installed at the quality inspector’s station. This simple repeatability test allowed the engineer to catch a potential area of inaccuracy in the manufacturing process and potentially avoid costly quality and functional issues in the end product.

Reproducibility is a major goal of research teams and collaborations in the field of chemistry. Let’s say a team of chemists has produced a chemical reaction that can create a nickel complex that absorbs near-infrared light. The team has designated one of its members to perform the test multiple times and record their data. They have come to the conclusion that their experiment is repeatable. The team is in collaboration with another team working on the same goal of producing safer materials for solar panels. They send their method to the other team and ask them to perform the experiment multiple times and record their data. Several of the other teams’ members reproduce the same results as the original team. The experiment can now be deemed reproducible, and the team can move on to improving the process so it can be scaled for production.

Depending on what industry you work in, you may need to use repeatability and/or reproducibility tests. If you are simply trying to determine the accuracy or consistency of a measuring tool, repeatability tests are likely all you need. However, if you are developing new processes and/or products, both repeatability and reproducibility are essential in the process of moving your ideas from planning to experimentation to production.

About the Author

' src=

iSixSigma Staff

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 11 November 2022

Error, reproducibility and uncertainty in experiments for electrochemical energy technologies

  • Graham Smith   ORCID: orcid.org/0000-0003-0713-2893 1 &
  • Edmund J. F. Dickinson   ORCID: orcid.org/0000-0003-2137-3327 1  

Nature Communications volume  13 , Article number:  6832 ( 2022 ) Cite this article

10k Accesses

9 Citations

22 Altmetric

Metrics details

  • Electrocatalysis
  • Electrochemistry
  • Materials for energy and catalysis

The authors provide a metrology-led perspective on best practice for the electrochemical characterisation of materials for electrochemical energy technologies. Such electrochemical experiments are highly sensitive, and their results are, in practice, often of uncertain quality and challenging to reproduce quantitatively.

A critical aspect of research on electrochemical energy devices, such as batteries, fuel cells and electrolysers, is the evaluation of new materials, components, or processes in electrochemical cells, either ex situ, in situ or in operation. For such experiments, rigorous experimental control and standardised methods are required to achieve reproducibility, even on standard or idealised systems such as single crystal platinum 1 . Data reported for novel materials often exhibit high (or unstated) uncertainty and often prove challenging to reproduce quantitatively. This situation is exacerbated by a lack of formally standardised methods, and practitioners with less formal training in electrochemistry being unaware of best practices. This limits trust in published metrics, with discussions on novel electrochemical systems frequently focusing on a single series of experiments performed by one researcher in one laboratory, comparing the relative performance of the novel material against a claimed state-of-the-art.

Much has been written about the broader reproducibility/replication crisis 2 and those reading the electrochemical literature will be familiar with weakly underpinned claims of “outstanding” performance, while being aware that comparisons may be invalidated by measurement errors introduced by experimental procedures which violate best practice; such issues frequently mar otherwise exciting science in this area. The degree of concern over the quality of reported results is evidenced by the recent decision of several journals to publish explicit experimental best practices 3 , 4 , 5 , reporting guidelines or checklists 6 , 7 , 8 , 9 , 10 and commentary 11 , 12 , 13 aiming to improve the situation, including for parallel theoretical work 14 .

We write as two electrochemists who, working in a national metrology institute, have enjoyed recent exposure to metrology: the science of measurement. Metrology provides the vocabulary 15 and mathematical tools 16 to express confidence in measurements and the comparisons made between them. Metrological systems and frameworks for quantification underpin consistency and assurance in all measurement fields and formal metrology is an everyday consideration for practical and academic work in fields where accurate measurements are crucial; we have found it a useful framework within which to evaluate our own electrochemical work. Here, rather than pen another best practice guide, we aim, with focus on three-electrode electrochemical measurements for energy material characterisation, to summarise some advice that we hope helps those performing electrochemical experiments to:

avoid mistakes and minimise error

report in a manner that facilitates reproducibility

consider and quantify uncertainty

Minimising mistakes and error

Metrology dispenses with nebulous concepts such as performance and instead requires scientists to define a specific measurand (“the quantity intended to be measured”) along with a measurement model ( ”the mathematical relation among all quantities known to be involved in a measurement”), which converts the experimental indicators into the measurand 15 . Error is the difference between the reported value of this measurand and its unknowable true value. (Note this is not the formal definition, and the formal concepts of error and true value are not fully compatible with measurement concepts discussed in this article, but we retain it here—as is common in metrology tuition delivered by national metrology institutes—for pedagogical purposes 15 ).

Mistakes (or gross errors) are those things which prevent measurements from working as intended. In electrochemistry the primary experimental indicator is often current or voltage, while the measurand might be something simple, like device voltage for a given current density, or more complex, like a catalyst’s turnover frequency. Both of these are examples of ‘method-defined measurands’, where the results need to be defined in reference to the method of measurement 17 , 18 (for example, to take account of operating conditions). Robust experimental design and execution are vital to understand, quantify and minimise sources of error, and to prevent mistakes.

Contemporary electrochemical instrumentation can routinely offer a current resolution and accuracy on the order of femtoamps; however, one electron looks much like another to a potentiostat. Consequently, the practical limit on measurements of current is the scientist’s ability to unambiguously determine what causes the observed current. Crucially, they must exclude interfering processes such as modified/poisoned catalyst sites or competing reactions due to impurities.

As electrolytes are conventionally in enormous excess compared to the active heterogeneous interface, electrolyte purity requirements are very high. Note, for example, that a perfectly smooth 1 cm 2 polycrystalline platinum electrode has on the order of 2 nmol of atoms exposed to the electrolyte, so that irreversibly adsorbing impurities present at the part per billion level (nmol mol −1 ) in the electrolyte may substantially alter the surface of the electrode. Sources of impurities at such low concentration are innumerable and must be carefully considered for each experiment; impurity origins for kinetic studies in aqueous solution have been considered broadly in the historical literature, alongside a review of standard mitigation methods 19 . Most commercial electrolytes contain impurities and the specific ‘grade’ chosen may have a large effect; for example, one study showed a three-fold decrease in the specific activity of oxygen reduction catalysts when preparing electrolytes with American Chemical Society (ACS) grade acid rather than a higher purity grade 20 . Likewise, even 99.999% pure hydrogen gas, frequently used for sparging, may contain more than the 0.2 μmol mol −1 of carbon monoxide permitted for fuel cell use 21 .

The most insidious impurities are those generated in situ. The use of reference electrodes with chloride-containing filling solutions should be avoided where chloride may poison catalysts 22 or accelerate dissolution. Similarly, reactions at the counter electrode, including dissolution of the electrode itself, may result in impurities. This is sometimes overlooked when platinum counter electrodes are used to assess ‘platinum-free’ electrocatalysts, accidentally resulting in performance-enhancing contamination 23 , 24 ; a critical discussion on this topic has recently been published 25 . Other trace impurity sources include plasticisers present in cells and gaskets, or silicates from the inappropriate use of glass when working with alkaline electrolytes 26 . To mitigate sensitivity to impurities from the environment, cleaning protocols for cells and components must be robust 27 . The use of piranha solution or similarly oxidising solution followed by boiling in Type 1 water is typical when performing aqueous electrochemistry 20 . Cleaned glassware and electrodes are also routinely stored underwater to prevent recontamination from airborne impurities.

The behaviour of electronic hardware used for electrochemical experiments should be understood and considered carefully in interpreting data 28 , recognising that the built-in complexity of commercially available digital potentiostats (otherwise advantageous!) is capable of introducing measurement artefacts or ambiguity 29 , 30 . While contemporary electrochemical instrumentation may have a voltage resolution of ~1 μV, its voltage measurement uncertainty is limited by other factors, and is typically on the order of 1 mV. As passing current through an electrode changes its potential, a dedicated reference electrode is often incorporated into both ex situ and, increasingly, in situ experiments to provide a stable well defined reference. Reference electrodes are typically selected from a range of well-known standardised electrode–electrolyte interfaces at which a characteristic and kinetically rapid reversible faradaic process occurs. The choice of reference electrode should be made carefully in consideration of chemical compatibility with the measurement environment 31 , 32 , 33 , 34 . In combination with an electronic blocking resistance, the potential of the electrode should be stable and reproducible. Unfortunately, deviation from the ideal behaviour frequently occurs. While this can often be overlooked when comparing results from identical cells, more attention is required when reporting values for comparison.

In all cases where conversion between different electrolyte–reference electrode systems is required, junction potentials should be considered. These arise whenever there are different chemical conditions in the electrolyte at the working electrode and reference electrode interfaces. Outside highly dilute solutions, or where there are large activity differences for a reactant/product of the electrode reaction (e.g. pH for hydrogen reactions), liquid junction potentials for conventional aqueous ions have been estimated in the range <50 mV 33 . Such a deviation may nonetheless be significant when onset potentials or activities at specific potentials are being reported. The measured potential difference between the working and reference electrode also depends strongly on the geometry of the cell, so cell design is critical. Fig.  1 shows the influence of cell design on potential profiles. Ideally the reference electrode should therefore be placed close to the working electrode (noting that macroscopic electrodes may have inhomogeneous potentials). To minimise shielding of the electric field between counter and working electrode and interruption of mass transport processes, a thin Luggin-Haber capillary is often used and a small separation maintained. Understanding of shielding and edge effects is vital when reference electrodes are introduced in situ. This is especially applicable for analysis of energy devices for which constraints on cell design, due to the need to minimise electrolyte resistance and seal the cell, preclude optimal reference electrode positioning 32 , 35 , 36 .

figure 1

a Illustration (simulated data) of primary (resistive) current and potential distribution in a typical three-electrode cell. The main compartment is cylindrical (4 cm diameter, 1 cm height), filled with electrolyte with conductivity 1.28 S m −1 (0.1 M KCl(aq)). The working electrode (WE) is a 2 mm diameter disc drawing 1 mA (≈ 32 mA cm −2 ) from a faradaic process with infinitely fast kinetics and redox potential 0.2 V vs the reference electrode (RE). The counter electrode (CE) is connected to the main compartment by a porous frit; the RE is connected by a Luggin capillary (green cylinders) whose tip position is offset from the WE by a variable distance. Red lines indicate prevailing current paths; coloured surfaces indicate isopotential contours normal to the current density. b Plot of indicated WE vs RE potential (simulated data). As the Luggin tip is moved away from the WE surface, ohmic losses due to the WE-CE current distribution lead to variation in the indicated WE-RE potential. Appreciable error may arise on an offset length scale comparable to the WE radius.

Quantitative statements about fundamental electrochemical processes based on measured values of current and voltage inevitably rely on models of the system. Such models have assumptions that may be routinely overlooked when following experimental and analysis methods, and that may restrict their application to real-world systems. It is quite possible to make highly precise but meaningless measurements! An often-assumed condition for electrocatalyst analysis is the absence of mass transport limitation. For some reactions, such as the acidic hydrogen oxidation and hydrogen evolution reactions, this state is arguably so challenging to reach at representative conditions that it is impossible to measure true catalyst activity 11 . For example, ex situ thin-film rotating disk electrode measurements routinely fail to predict correct trends in catalyst performance in morphologically complex catalyst layers  as relevant operating conditions (e.g. meaningful current densities) are theoretically inaccessible. This topic has been extensively discussed with some authors directly criticising this technique and exploring alternatives 37 , 38 , and others defending the technique’s applicability for ranking catalysts if scrupulous attention is paid to experimental details 39 ; yet, many reports continue to use this measurement technique blindly with no regard for its applicability. We therefore strongly urge those planning measurements to consider whether their chosen technique is capable of providing sufficient evidence to disprove their hypothesis, even if it has been widely used for similar experiments.

The correct choice of technique should be dependent upon the measurand being probed rather than simply following previous reports. The case of iR correction, where a measurement of the uncompensated resistance is used to correct the applied voltage, is a good example. When the measurand is a material property, such as intrinsic catalyst activity, the uncompensated resistance is a source of error introduced by the experimental method and it should carefully be corrected out (Fig.  1 ). In the case that the uncompensated resistance is intrinsic to the measurand—for instance the operating voltage of an electrolyser cell—iR compensation is inappropriate and only serves to obfuscate. Another example is the choice of ex situ (outside the operating environment), in situ (in the operating environment), and operando (during operation) measurements. While in situ or operando testing allows characterisation under conditions that are more representative of real-world use, it may also yield measurements with increased uncertainty due to the decreased possibility for fine experimental control. Depending on the intended use of the measurement, an informed compromise must be sought between how relevant and how uncertain the resulting measurement will be.

Maximising reproducibility

Most electrochemists assess the repeatability of measurements, performing the same measurement themselves several times. Repeats, where all steps (including sample preparation, where relevant) of a measurement are carried out multiple times, are absolutely crucial for highlighting one-off mistakes (Fig.  2 ). Reproducibility, however, is assessed when comparing results reported by different laboratories. Many readers will be familiar with the variability in key properties reported for various systems e.g. variability in the reported electrochemically active surface area (ECSA) of commercial catalysts, which might reasonably be expected to be constant, suggesting that, in practice, the reproducibility of results cannot be taken for granted. As electrochemistry deals mostly with method-defined measurands, the measurement procedure must be standardised for results to be comparable. Variation in results therefore strongly suggests that measurements are not being performed consistently and that the information typically supplied when publishing experimental methods is insufficient to facilitate reproducibility of electrochemical measurements. Quantitative electrochemical measurements require control over a large range of parameters, many of which are easily overlooked or specified imprecisely when reporting data. An understanding of the crucial parameters and methods for their control is often institutional knowledge, held by expert electrochemists, but infrequently formalised and communicated e.g. through publication of widely adopted standards. This creates challenges to both reproducibility and the corresponding assessment of experimental quality by reviewers. The reporting standards established by various publishers (see Introduction) offer a practical response, but it is still unclear whether these will contain sufficiently granular detail to improve the situation.

figure 2

The measurements from laboratory 1 show a high degree of repeatability, while the measurements from laboratory 2 do not. Apparently, a mistake has been made in repeat 1, which will need to be excluded from any analysis and any uncertainty analysis, and/or suggests further repeat measurements should be conducted. The error bars are based on an uncertainty with coverage factor ~95% (see below) so the results from the two laboratories are different, i.e. show poor reproducibility. This may indicate differing experimental practice or that some as yet unidentified parameter is influencing the results.

Besides information typically supplied in the description of experimental methods for publication, which, at a minimum, must detail the materials, equipment and measurement methods used to generate the results, we suggest that a much more comprehensive description is often required, especially where measurements have historically poor reproducibility or the presented results differ from earlier reports. Such an expanded ‘supplementary experimental’ section would additionally include any details that could impact the results: for example, material pre-treatment, detailed electrode preparation steps, cleaning procedures, expected electrolyte and gas impurities, electrode preconditioning processes, cell geometry including electrode positions, detail of junctions between electrodes, and any other fine experimental details which might be institutional knowledge but unknown to the (now wide) readership of the electrochemical literature. In all cases any corrections and calculations used should be specified precisely and clearly justified; these may include determinations of properties of the studied system, such as ECSA, or of the environment, such as air pressure. We highlight that knowledge of the ECSA is crucial for conventional reporting of intrinsic electrocatalyst activity, but is often very challenging to measure in a reproducible manner 40 , 41 .

To aid reproducibility we recommend regularly calibrating experimental equipment and doing so in a way that is traceable to primary standards realising the International System of Units (SI) base units. The SI system ensures that measurement units (such as the volt) are uniform globally and invariant over time. Calibration applies to direct experimental indicators, e.g. loads and potentiostats, but equally to supporting tools such as temperature probes, balances, and flow meters. Calibration of reference electrodes is often overlooked even though variations from ideal behaviour can be significant 42 and, as discussed above, are often the limit of accuracy on potential measurement. Sometimes reports will specify internal calibration against a known reaction (such as the onset of the hydrogen evolution reaction), but rarely detail regular comparisons to a local master electrode artefact such as a reference hydrogen electrode or explain how that artefact is traceable, e.g. through control of the filling solution concentration and measurement conditions. If reference is made to a standardised material (e.g. commercial Pt/C) the specified material should be both widely available and the results obtained should be consistent with prior reports.

Beyond calibration and reporting, the best test of reproducibility is to perform intercomparisons between laboratories, either by comparing results to identical experiments reported in the literature or, more robustly, through participation in planned intercomparisons (for example ‘round-robin’ exercises); we highlight a recent study applied to solid electrolyte characterisation as a topical example 43 . Intercomparisons are excellent at establishing the key features of an experimental method and the comparability of results obtained from different methods; moreover they provide a consensus against which other laboratories may compare themselves. However, performing repeat measurements for assessing repeatability and reproducibility cannot estimate uncertainty comprehensively, as it excludes systematic sources of uncertainty.

Assessing measurement uncertainty

Formal uncertainty evaluation is an alien concept to most electrochemists; even the best papers (as well as our own!) typically report only the standard deviation between a few repeats. Metrological best practice dictates that reported values are stated as the combination of a best estimate of the measurand, and an interval, and a coverage factor ( k ) which gives the probability of the true value being within that interval. For example, “the turnover frequency of the electrocatalyst is 1.0 ± 0.2 s −1 ( k  = 2)” 16 means that the scientist (having assumed normally distributed error) is 95% confident that the turnover frequency lies in the range 0.8–1.2 s −1 . Reporting results in such a way makes it immediately clear whether the measurements reliably support the stated conclusions, and enables meaningful comparisons between independent results even if their uncertainties differ (Fig.  3 ). It also encourages honesty and self-reflection about the shortcomings of results, encouraging the development of improved experimental techniques.

figure 3

a Complete reporting of a measurement includes the best estimate of the measurand and an uncertainty and the probability the true value falls within the uncertainty reported. Here, the percentages indicate that a normal distribution has been assumed. b Horizontal bars indicate 95% confidence intervals from uncertainty analysis. The confidence intervals of measurements 1 and 2 overlap when using k  = 2, so it is not possible to say with 95% confidence that the result of the measurement 2 is higher than measurement 1, but it is possible to say this with 68% confidence, i.e. k  = 1. Measurement 3 has a lower uncertainty, so it is possible to say with 95% confidence that the value is higher than measurement 2.

Constructing such a statement and performing the underlying calculations often appears daunting, not least as there are very few examples for electrochemical systems, with pH measurements being one example to have been treated thoroughly 44 . However, a standard process for uncertainty analysis exists, as briefly outlined graphically in Fig.  4 . We refer the interested reader to both accessible introductory texts 45 and detailed step-by-step guides 16 , 46 . The first steps in the process are to state precisely what is being measured—the measurand—and identify likely sources of uncertainty. Even this qualitative effort is often revealing. Precision in the definition of the measurand (and how it is determined from experimental indicators) clarifies the selection of measurement technique and helps to assess its appropriateness; for example, where the measurand relates only to an instantaneous property of a specific physical object, e.g. the current density of a specific fuel cell at 0.65 V following a standardised protocol, we ignore all variability in construction, device history etc. and no error is introduced by the sample. Whereas, when the measurand is a material property, such as turnover frequency of a catalyst material with a defined chemistry and preparation method, variability related to the material itself and sample preparation will often introduce substantial uncertainty in the final result. In electrochemical measurements, errors may arise from a range of sources including the measurement equipment, fluctuations in operating conditions, or variability in materials and samples. Identifying these sources leads to the design of better-quality experiments. In essence, the subsequent steps in the calculation of uncertainty quantify the uncertainty introduced by each source of error and, by using a measurement model or a sensitivity analysis (i.e. an assessment of how the results are sensitive to variability in input parameters), propagate these to arrive at a final uncertainty on the reported result.

figure 4

Possible sources of uncertainty are identified, and their standard uncertainty or probability distribution is determined by statistical analysis of repeat measurements (Type A uncertainties) or other evidence (Type B uncertainties). If required, uncertainties are then converted into the same unit as the measurand and adjusted for sensitivity, using a measurement model. Uncertainties are then combined either analytically using a standard approach or numerically to generate an overall estimate of uncertainty for the measurand (as indicated in Fig.  3a ).

Generally, given the historically poor understanding of uncertainty in electrochemistry, we promote increased awareness of uncertainty reporting standards and a focus on reporting measurement uncertainty with a level of detail that is appropriate to the claim made, or the scientific utilisation of the data. For example, where the primary conclusion of a paper relies on demonstrating that a material has the ‘highest ever X’ or ‘X is bigger than Y’ it is reasonable for reviewers to ask authors to quantify how confident they are in their measurement and statement. Additionally, where uncertainties are reported, even with error bars in numerical or graphical data, the method by which the uncertainty was determined should be stated, even if the method is consciously simple (e.g. “error bars indicate the sample standard deviation of n  = 3 measurements carried out on independent electrodes”). Unfortunately, we are aware of only sporadic and incomplete efforts to create formal uncertainty budgets for electrochemical measurements of energy technologies or materials, though work is underway in our group to construct these for some exemplar systems.

Electrochemistry has undoubtedly thrived without significant interaction with formal metrology; we do not urge an abrupt revolution whereby rigorous measurements become devalued if they lack additional arcane formality. Rather, we recommend using the well-honed principles of metrology to illuminate best practice and increase transparency about the strengths and shortcomings of reported experiments. From rethinking experimental design, to participating in laboratory intercomparisons and estimating the uncertainty on key results, the application of metrological principles to electrochemistry will result in more robust science.

Climent, V. & Feliu, J. M. Thirty years of platinum single crystal electrochemistry. J. Solid State Electrochem . https://doi.org/10.1007/s10008-011-1372-1 (2011).

Nature Editors and Contributors. Challenges in irreproducible research collection. Nature https://www.nature.com/collections/prbfkwmwvz/ (2018).

Chen, J. G., Jones, C. W., Linic, S. & Stamenkovic, V. R. Best practices in pursuit of topics in heterogeneous electrocatalysis. ACS Catal. 7 , 6392–6393 (2017).

Article   CAS   Google Scholar  

Voiry, D. et al. Best practices for reporting electrocatalytic performance of nanomaterials. ACS Nano 12 , 9635–9638 (2018).

Article   CAS   PubMed   Google Scholar  

Wei, C. et al. Recommended practices and benchmark activity for hydrogen and oxygen electrocatalysis in water splitting and fuel cells. Adv. Mater. 31 , 1806296 (2019).

Article   Google Scholar  

Chem Catalysis Editorial Team. Chem Catalysis Checklists Revision 1.1 . https://info.cell.com/introducing-our-checklists-learn-more (2021).

Chatenet, M. et al. Good practice guide for papers on fuel cells and electrolysis cells for the Journal of Power Sources. J. Power Sources 451 , 227635 (2020).

Sun, Y. K. An experimental checklist for reporting battery performances. ACS Energy Lett. 6 , 2187–2189 (2021).

Li, J. et al. Good practice guide for papers on batteries for the Journal of Power Sources. J. Power Sources 452 , 227824 (2020).

Arbizzani, C. et al. Good practice guide for papers on supercapacitors and related hybrid capacitors for the Journal of Power Sources. J. Power Sources 450 , 227636 (2020).

Hansen, J. N. et al. Is there anything better than Pt for HER? ACS Energy Lett. 6 , 1175–1180 (2021).

Article   CAS   PubMed   PubMed Central   Google Scholar  

Xu, K. Navigating the minefield of battery literature. Commun. Mater. 3 , 1–7 (2022).

Dix, S. T., Lu, S. & Linic, S. Critical practices in rigorously assessing the inherent activity of nanoparticle electrocatalysts. ACS Catal. 10 , 10735–10741 (2020).

Mistry, A. et al. A minimal information set to enable verifiable theoretical battery research. ACS Energy Lett. 6 , 3831–3835 (2021).

Joint Committee for Guides in Metrology: Working Group 2. International Vocabulary of Metrology—Basic and General Concepts and Associated Terms . (2012).

Joint Committee for Guides in Metrology: Working Group 1. Evaluation of measurement data—Guide to the expression of uncertainty in measurement . (2008).

International Organization for Standardization: Committee on Conformity Assessment. ISO 17034:2016 General requirements for the competence of reference material producers . (2016).

Brown, R. J. C. & Andres, H. How should metrology bodies treat method-defined measurands? Accredit. Qual. Assur . https://doi.org/10.1007/s00769-020-01424-w (2020).

Angerstein-Kozlowska, H. Surfaces, Cells, and Solutions for Kinetics Studies . Comprehensive Treatise of Electrochemistry vol. 9: Electrodics: Experimental Techniques (Plenum Press, 1984).

Shinozaki, K., Zack, J. W., Richards, R. M., Pivovar, B. S. & Kocha, S. S. Oxygen reduction reaction measurements on platinum electrocatalysts utilizing rotating disk electrode technique. J. Electrochem. Soc. 162 , F1144–F1158 (2015).

International Organization for Standardization: Technical Committee ISO/TC 197. ISO 14687:2019(E)—Hydrogen fuel quality—Product specification . (2019).

Schmidt, T. J., Paulus, U. A., Gasteiger, H. A. & Behm, R. J. The oxygen reduction reaction on a Pt/carbon fuel cell catalyst in the presence of chloride anions. J. Electroanal. Chem. 508 , 41–47 (2001).

Chen, R. et al. Use of platinum as the counter electrode to study the activity of nonprecious metal catalysts for the hydrogen evolution reaction. ACS Energy Lett. 2 , 1070–1075 (2017).

Ji, S. G., Kim, H., Choi, H., Lee, S. & Choi, C. H. Overestimation of photoelectrochemical hydrogen evolution reactivity induced by noble metal impurities dissolved from counter/reference electrodes. ACS Catal. 10 , 3381–3389 (2020).

Jerkiewicz, G. Applicability of platinum as a counter-electrode material in electrocatalysis research. ACS Catal. 12 , 2661–2670 (2022).

Guo, J., Hsu, A., Chu, D. & Chen, R. Improving oxygen reduction reaction activities on carbon-supported ag nanoparticles in alkaline solutions. J. Phys. Chem. C. 114 , 4324–4330 (2010).

Arulmozhi, N., Esau, D., van Drunen, J. & Jerkiewicz, G. Design and development of instrumentations for the preparation of platinum single crystals for electrochemistry and electrocatalysis research Part 3: Final treatment, electrochemical measurements, and recommended laboratory practices. Electrocatal 9 , 113–123 (2017).

Colburn, A. W., Levey, K. J., O’Hare, D. & Macpherson, J. V. Lifting the lid on the potentiostat: a beginner’s guide to understanding electrochemical circuitry and practical operation. Phys. Chem. Chem. Phys. 23 , 8100–8117 (2021).

Ban, Z., Kätelhön, E. & Compton, R. G. Voltammetry of porous layers: staircase vs analog voltammetry. J. Electroanal. Chem. 776 , 25–33 (2016).

McMath, A. A., Van Drunen, J., Kim, J. & Jerkiewicz, G. Identification and analysis of electrochemical instrumentation limitations through the study of platinum surface oxide formation and reduction. Anal. Chem. 88 , 3136–3143 (2016).

Jerkiewicz, G. Standard and reversible hydrogen electrodes: theory, design, operation, and applications. ACS Catal. 10 , 8409–8417 (2020).

Ives, D. J. G. & Janz, G. J. Reference Electrodes, Theory and Practice (Academic Press, 1961).

Newman, J. & Balsara, N. P. Electrochemical Systems (Wiley, 2021).

Inzelt, G., Lewenstam, A. & Scholz, F. Handbook of Reference Electrodes (Springer Berlin, 2013).

Cimenti, M., Co, A. C., Birss, V. I. & Hill, J. M. Distortions in electrochemical impedance spectroscopy measurements using 3-electrode methods in SOFC. I—effect of cell geometry. Fuel Cells 7 , 364–376 (2007).

Hoshi, Y. et al. Optimization of reference electrode position in a three-electrode cell for impedance measurements in lithium-ion rechargeable battery by finite element method. J. Power Sources 288 , 168–175 (2015).

Article   ADS   CAS   Google Scholar  

Jackson, C., Lin, X., Levecque, P. B. J. & Kucernak, A. R. J. Toward understanding the utilization of oxygen reduction electrocatalysts under high mass transport conditions and high overpotentials. ACS Catal. 12 , 200–211 (2022).

Masa, J., Batchelor-McAuley, C., Schuhmann, W. & Compton, R. G. Koutecky-Levich analysis applied to nanoparticle modified rotating disk electrodes: electrocatalysis or misinterpretation. Nano Res. 7 , 71–78 (2014).

Martens, S. et al. A comparison of rotating disc electrode, floating electrode technique and membrane electrode assembly measurements for catalyst testing. J. Power Sources 392 , 274–284 (2018).

Wei, C. et al. Approaches for measuring the surface areas of metal oxide electrocatalysts for determining their intrinsic electrocatalytic activity. Chem. Soc. Rev. 48 , 2518–2534 (2019).

Lukaszewski, M., Soszko, M. & Czerwiński, A. Electrochemical methods of real surface area determination of noble metal electrodes—an overview. Int. J. Electrochem. Sci. https://doi.org/10.20964/2016.06.71 (2016).

Niu, S., Li, S., Du, Y., Han, X. & Xu, P. How to reliably report the overpotential of an electrocatalyst. ACS Energy Lett. 5 , 1083–1087 (2020).

Ohno, S. et al. How certain are the reported ionic conductivities of thiophosphate-based solid electrolytes? An interlaboratory study. ACS Energy Lett. 5 , 910–915 (2020).

Buck, R. P. et al. Measurement of pH. Definition, standards, and procedures (IUPAC Recommendations 2002). Pure Appl. Chem. https://doi.org/10.1351/pac200274112169 (2003).

Bell, S. Good Practice Guide No. 11. The Beginner’s Guide to Uncertainty of Measurement. (Issue 2). (National Physical Laboratory, 2001).

United Kingdom Accreditation Service. M3003 The Expression of Uncertainty and Confidence in Measurement  4th edn. (United Kingdom Accreditation Service, 2019).

Download references

Acknowledgements

This work was supported by the National Measurement System of the UK Department of Business, Energy & Industrial Strategy. Andy Wain, Richard Brown and Gareth Hinds (National Physical Laboratory, Teddington, UK) provided insightful comments on the text.

Author information

Authors and affiliations.

National Physical Laboratory, Hampton Road, Teddington, TW11 0LW, UK

Graham Smith & Edmund J. F. Dickinson

You can also search for this author in PubMed   Google Scholar

Contributions

G.S. originated the concept for the article. G.S. and E.J.F.D. contributed to drafting, editing and revision of the manuscript.

Corresponding author

Correspondence to Graham Smith .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Peer review

Peer review information.

Nature Communications thanks Gregory Jerkiewicz for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Smith, G., Dickinson, E.J.F. Error, reproducibility and uncertainty in experiments for electrochemical energy technologies. Nat Commun 13 , 6832 (2022). https://doi.org/10.1038/s41467-022-34594-x

Download citation

Received : 29 July 2022

Accepted : 31 October 2022

Published : 11 November 2022

DOI : https://doi.org/10.1038/s41467-022-34594-x

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

Electrochemical surface-enhanced raman spectroscopy.

  • Christa L. Brosseau
  • Alvaro Colina

Nature Reviews Methods Primers (2023)

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

reproducibility of an experiment is measured from

What is the Difference Between Repeatability and Reproducibility?

Related stories

What are the safe limits for nitrosamines, how do you test for nitrosamines, why is nitrosamine analysis important in pharmaceuticals, what are nitrosamines and where are they found, news & views, what is the difference between repeatability and reproducibility.

--> Jun 27 2014 --> Read 470775 Times -->

Repeatability and reproducibility are ways of measuring precision, particularly in the fields of chemistry and engineering. In general, scientists perform the same experiment several times in order to confirm their findings. These findings may show variation. In the context of an experiment,  repeatability measures the variation in measurements taken by a single instrument or person under the same conditions, while reproducibility measures whether an entire study or experiment can be reproduced in its entirety.    Within scientific write ups, reproducibility and repeatability are often reported as standard deviation. The article From Crude Mixture to Pure Compound – Automatic Purification Made Easy discusses how scientists and researchers establish repeatability and reproducibility in more detail.

What is repeatability?   

Repeatability practices were introduced by scientists Bland and Altman. For repeatability to be established, the following conditions must be in place: the same location; the same measurement procedure; the same observer; the same measuring instrument, used under the same conditions; and repetition over a short period of time.    What’s known as “the repeatability coefficient” is a measurement of precision, which denotes the absolute difference between a pair of repeated test results.

What is reproducibility? 

Reproducibility, on the other hand, refers to the degree of agreement between the results of experiments conducted by different individuals, at different locations, with different instruments. Put simply, it measures our ability to replicate the findings of others. Through their extensive research,  controlled inter-laboratory test programs are able to determine reproducibility. The article Precise Low Temperature Control Improves Reaction Reproducibility discusses the challenges related to reproducibility in more detail.

Why are repeatability and reproducibility considered desirable?

In terms of repeatability and reproducibility, test/re-test reliability demonstrates that scientific findings and constructs are not expected to alter over time. For instance, if you used a certain method to measure the length of an adult’s arm, and then repeated the process two years later using the same method, it’s highly likely that your results would correlate. Yet if your results differed greatly, you would probably conclude that your findings were inaccurate, leading you on to further investigations. As such, repeatability and reliability imbue investigative findings with a degree of authority. 

  We can also use repeatability and reliability to measure difference and a lack of correlation. If, for instance, we are unable to repeat or reproduce our findings, we have to ask ourselves why, and to investigate further.

reproducibility of an experiment is measured from

The Scottish Microscopy Society will be holding its Annual Symposium (SMS) on 26th November 2024 at Hampden Park Conference Centre in Glasgow. This one...

SMS 2024 November 26 Glasgow

--> Aug 10 2024 --> Read 1029 Times -->

reproducibility of an experiment is measured from

RMS Vice President, Professor Grace Burke, has received the 2024 Mishima Award from the American Nuclear Society (ANS).The award, presented for “enduring a...

ANS award presented to RMS Vice President

--> Aug 09 2024 --> Read 1053 Times -->

reproducibility of an experiment is measured from

During the Farnborough International Airshow (20-24 July),  a plan to build on the success of the European Centre for Space Applications and Telecommunicati...

Initiative to strengthen ESA presence in UK and space skills training

--> Aug 08 2024 --> Read 756 Times -->

Digital Edition

Lab asia 31.4 august 2024.

August 2024

Chromatography Articles - HPLC gradient validation using non-invasive flowmeters Mass Spectrometry & Spectroscopy Articles - MS detection of Alzheimer’s blood-based biomarkers   Labo...

View all digital editions

reproducibility of an experiment is measured from

Selecting the right TOC analyser for your labor...

reproducibility of an experiment is measured from

MS detection of Alzheimer’s blood-based bioma...

reproducibility of an experiment is measured from

Deuterium Lamps for your HPLC Detectors and UV-...

reproducibility of an experiment is measured from

New light sources and tools for spectroscopy

reproducibility of an experiment is measured from

Understanding PFAS: Analysis and Implications

reproducibility of an experiment is measured from

UV-Vis and fluorescence spectroscopy illuminate...

reproducibility of an experiment is measured from

Scamming in the laboratory industry: A growing...

reproducibility of an experiment is measured from

Industry leader to steer ambitious sample manag...

reproducibility of an experiment is measured from

Geological clocks prove age of Tibetan glacier

reproducibility of an experiment is measured from

Maintaining sample integrity during repeated fr...

Sep 04 2024 Chiba, Tokyo, Japan

BMSS-BSPR Super Meeting 2024

Sep 04 2024 University of Warwick, Coventry, UK

Thailand Lab 2024

Sep 11 2024 Bangkok, Thailand

Medical Fair Asia 2024

Sep 11 2024 Singapore

Bio Asia Pacific 2024

View all events

Labmate Online

International Labmate Limited Oak Court Business Centre Sandridge Park, Porters Wood St Albans Hertfordshire AL3 6PH United Kingdom

T +44 (0)1727 858 840 F +44 (0)1727 840 310 E [email protected]

Our other channels

Copyright © 2024 Labmate Online. All rights reserved.

  • Terms & Conditions

isobudgets

5 Reproducibility Tests You Can Use For Estimating Uncertainty

May 15, 2017 by Richard Hogan

measurement reproducibility for estimating uncertainty

Introduction

Reproducibility testing is an important part of estimating uncertainty in measurement. It is a type A uncertainty component that should be included in every uncertainty budget .

However, a lot of laboratories neglect to test reproducibility and include it in their analyses.

When talking with clients and friends, I get a lot of reasons excuses for why they do not test reproducibility;

• “We don’t have time.” • “We don’t know how.” • “We are a one person laboratory.” • “We only have one measurement standard.”

Well, I am here today to tell you that you can test reproducibility even if you are only a one person laboratory.

In this guide, you are going to learn all that you need to know, and more, to perform your own reproducibility tests.

If you are calculating uncertainty in measurement, you need to know how reproducible your measurement results are.

It is not only important for quality control. There are so many things that you can learn from testing reproducibility that it is worth the effort.

   

There has been a lot of discussions and opinions about including type A uncertainty into your estimates of uncertainty in measurement.

Most laboratories include repeatability test data into their uncertainty budgets. However, many neglect to include reproducibility test data.

I am not sure, but I would like to find out.

Recently, I attended the A2LA Annual Meeting for the Measurement Advisory Committee; and, I was shocked to hear someone ask why ‘Reproducibility’ should be considered for uncertainty analysis.

After making the statement, this person continued speaking to tell everyone that it is not necessary or feasible because the laboratory only had one work station and one technician.

Honestly, I was shocked!

The person that made this statement is an intelligent person from a reputable company, and I was very confused why this person could not conceptualize the rationale for reproducibility testing.

Unfortunately, this person is not alone. Many people question the use reproducibility testing for estimating uncertainty. Over the years, I have had a number of readers, leads, and clients who have also had questions about reproducibility testing.

With this information, I can only assume;

1. They do not understand reproducibility testing, 2. They have not tried reproducibility testing, and(or) 3. They have not researched reproducibility testing

Therefore, I have decided to write a guide that will teach you everything that you will ever need to know about reproducibility testing for collecting type A uncertainty data.

Additionally, I am going to show you 5 ways that you can test measurement reproducibility for calculating uncertainty in measurement even if you are a single person laboratory with only one workbench.

So, if you are interested in collecting type A uncertainty data, keep reading because I am going to cover the following seven topics;

1. What is Reproducibility 2. Why is Reproducibility Important 3. Reproducibility Testing Requirements 4. Types of Reproducibility Tests 5. How to Perform Reproducibility Testing 6. How to Analyze Reproducibility Testing Results 7. How to Evaluate Reproducibility Test Results

What is Reproducibility

According to the Vocabulary in International Metrology (VIM) , reproducibility is “measurement precision under reproducibility conditions of measurement.”

reproducibility definition

While reading the definition, I am sure that you noticed the terms “measurement precision” and “reproducibility condition of measurement.”

Don’t you hate it when terms are defined by their own words? Me too. It’s frustrating.

Therefore, let’s try to simplify the definition so you don’t need a PhD to understand the concept.

First, let’s define measurement precision. The VIM defines it as “the closeness of agreement between indications.”

Essentially, this is the spread or standard deviation calculated from a set of repeated measurements, most likely from a repeatability test.

measurement precision definition

Next, let’s define reproducibility condition of measurement. The VIM defines it as a “condition of measurement, out of a set of conditions that includes different locations, operators, measuring systems, and replicate measurements on the same or similar objects.”

Therefore, a reproducibility condition of measurement is another repeatability test where a condition of measurement has been changed.

reproducibility condition of measurement definition

Now that we have broken down the definition of reproducibility, it can be best explained as the standard deviation of multiple repeatability test results where the conditions of measurement have been changed .

The goal is to determine how closely the results of one repeatability test agree with another to determine how reproducible your results are when performed under various conditions.

Fundamentally, you need to recreate your measurement results after changing one variable at a time and evaluate the impact it has on your results.

Hopefully, you now have a better understanding of the term reproducibility.

In the next section, you will learn why it is important to estimating uncertainty in measurement.

Why is Reproducibility Important

Reproducibility is important because it demonstrates that your laboratory has the ability to replicate measurement results under various conditions.

Additionally, reproducibility testing offers you the ability to experiment with different factors that can influence your measurement results and estimated uncertainty.

When you know which factors significantly impact your measurement results, you can take action to control your measurement process and reduce uncertainty in measurement.

For example, if your laboratory were to conduct a reproducibility test evaluating every technician and one operator’s measurement results were significantly different from the sampled group, you could investigate the cause and take appropriate action (e.g. provide them training to improve their skills).

In another example, imagine that your laboratory is evaluating various test or calibration methods using reproducibility testing. If one method yields significantly different results than the other methods, you could take action to revise the method or eliminate its use for tests and calibrations.

As you can see, reproducibility testing can be a powerful tool for quality control and optimizing your laboratory processes. This is why it is important.

However, it is only beneficial if you actually analyze the data and use it to improve your measurement process. So, think beyond estimating uncertainty, reproducibility testing can be used for;

• Monitoring the quality of work, • Validating methods, procedures, training, etc, • Finding problems in measurement systems, and • Increasing confidence in measurement results.

If your goal is to provide better quality measurement results with less uncertainty, start performing reproducibility tests and use the results to improve your process.

Requirements for Reproducibility Testing

Currently, reproducibility testing is not required by ISO/IEC 17025 or any other normative document; unless you are accredited via A2LA . However, there are strong recommendations for the use of reproducibility testing.

In this section, learn how reproducibility is referenced in various policies and requirements documents.

ISO/IEC 17025 International Standard

In section 5.4.5.3 of the ISO/IEC 17025 standard , reproducibility is listed as a value that may be obtained from validated methods.

Additionally, in note 3 under section 5.4.5.3, reproducibility is listed as an example or component where the uncertainty of values can be listed.

reproducibility iso 17025 requirements

ILAC P14 Policy

In section 5.4 of the ILAC P14 policy recommends that reproducibility should be included in your estimation of CMC uncertainty.

“A reasonable amount of contribution to uncertainty from repeatability shall be included and contributions due to reproducibility should be included in the CMC uncertainty component, when available.”

reproducibility ILAC P14 requirements

JCGM 100:2008 Guide to the Estimation of Uncertainty in Measurement

In Appendix B, section 2.16 of the Guide to the Estimation of Uncertainty in Measurement (GUM) , reproducibility is defined and provides a list of conditions or variables that can be changed for reproducibility testing.

reproducibility JCGM 100 GUM requirements

ASTM E177 & ASTM E456

In table 1 of the ASTM E177 , Practice for Use of the Terms Precision and Bias, reproducibility is listed as a condition for precision. Furthermore, the table below provides a list of conditions or variables that can be changed to determine reproducibility.

reproducibility astm e177 requirements

As you can see, reproducibility is not strictly required for estimating uncertainty; but, it sure is recognized as a significant contributor and recommended for consideration in your uncertainty budgets.

Only one accreditation body in North America, A2LA, has made reproducibility a requirement for estimates of CMC uncertainty statements.

In section 6.7.1 of the A2LA R205 Specific Requirements: Calibration Laboratory Accreditation Program, reproducibility is listed as a key component that shall be considered in every CMC uncertainty calculation.

reproducibility a2la r205 requirements

In my opinion, reproducibility is a significant contributor to uncertainty in measurement and should be included in your uncertainty budgets.

It is a type A uncertainty component that is just as important as repeatability.

Furthermore, it is a test that you can easily perform in your laboratory. It only takes a few additional minutes of your time to collect the data and analyze it.

Types of Reproducibility Tests

When it comes to reproducibility testing, there are a lot of different conditions that you can test. However, most metrologists (that I have spoken with or surveyed) only compare operators (i.e. one technician’s results compared to another).

While I agree that comparing operators is important, there are so many additional conditions that you can test. Why only test one condition?

You will never be sure that one condition is more significant than the other unless you test it yourself.

In this section, you will learn about five types of reproducibility tests that you can perform in your laboratory to collect type A uncertainty data. Afterwards, you should have enough information to help you decide which conditions are best for testing in your laboratory.

The five conditions that we will cover in this guide are;

1. Operator vs Operator 2. Equipment vs Equipment 3. Environment vs Environment 4. Method vs Method 5. Day vs Day

Operator vs Operator

The most common reproducibility test for collecting type A uncertainty data is looking for variances in measurement results via the operator.

By comparing two or more technicians, engineers, etc, you can learn a lot about inconsistencies in your measurement process.

All you need to do is have two technicians independently perform that same measurement process.

First, have technician A perform a simple repeatability test and record their results.

With this information, calculate the average and standard deviation.

Next, have technician B perform the same repeatability test and record the results.

Again, calculate the average and standard deviation.

Now, calculate the standard deviation of technician A’s average and technician B’s average.

The result will be the operator to operator reproducibility.

If the calculated result is larger than you prefer or larger than you expected, consider providing the technicians training and repeat the experiment.

Continue this process until you achieve a result that you are happy with.

To perform an operator vs operator reproducibility test, use the following instructions;

1. Perform a repeatability test with operator A. 2. Record your results, 3. Calculate the mean, standard deviation, and degrees of freedom, 4. Perform a repeatability test with operator B, 5. Record your results, 6. Calculate the mean, standard deviation, and degrees of freedom, 7. Calculate the standard deviation of the two means recorded in steps 3 and 6.

Another common way to test reproducibility of a measurement process is to test the variance in measurement results from day to day.

This method is great for collecting type A uncertainty data for laboratories with only one technician, one workbench, or both.

The only thing that you need to change is the day or time that the test is performed.

For example, you can compare;

• Morning vs Afternoon, • Monday vs Tuesday, or • Monday vs Friday.

You can compare any scenario you want as long as the only variable that changes if day or time of day.

To get started, have a technician perform a repeatability test on day 1.

From their results, calculate the average and the standard deviation.

Next, have the technician perform the exact same repeatability test on day 2.

Again, calculate the average and standard deviation from their results.

Now, calculate the standard deviation of the averages calculated from day1 and day 2. The result will be your day to day reproducibility.

To perform a day vs day reproducibility test, use the following instructions;

1. Perform a repeatability test on day A. 2. Record your results, 3. Calculate the mean, standard deviation, and degrees of freedom, 4. Perform a repeatability test on day B, 5. Record your results, 6. Calculate the mean, standard deviation, and degrees of freedom, 7. Calculate the standard deviation of the two means recorded in steps 3 and 6.

Method vs Method

Testing methods for reproducibility is not a common test performed for type A uncertainty. However, it is a very beneficial reproducibility test if you are seeking to reduce your measurement uncertainty.

The method that you choose can significantly affect your measurement uncertainty, Therefore, method vs method reproducibility can be used to determine the variance between two measurement methods.

From the results, you can determine which measurement process yields less uncertainty in measurement results.

To perform method vs method reproducibility test, use the following instructions;

1. Perform a repeatability test using method A. 2. Record your results, 3. Calculate the mean, standard deviation, and degrees of freedom, 4. Perform a repeatability test using method B, 5. Record your results, 6. Calculate the mean, standard deviation, and degrees of freedom, 7. Calculate the standard deviation of the two means recorded in steps 3 and 6.

Equipment vs Equipment

If you work in a medium to large-size laboratory, you probably have multiple workbenches and duplicate measurement equipment.

Therefore, you are likely to have variance in your measurement results from using different workbenches or measurement equipment. Wouldn’t it be interesting to see how selecting different equipment can affect your measurement results?

If you said “Yes,” I recommend that you try this type of reproducibility test in your laboratory. You may be surprised by the results.

Every measuring instrument or device will perform differently from another, even though they may have the same manufacture and model number.

Even if your equipment was manufactured on the same day, on the same production line, it’s performance will slightly vary.

This is why it is important to consider equipment vs equipment reproducibility testing for your type A uncertainty data. The probability or likelihood that your measurement results will be affected by the equipment you choose is pretty high!

It makes a good substitute for operator vs operator reproducibility testing.

To perform equipment vs equipment reproducibility test, use the following instructions;

1. Perform a repeatability test using equipment A. 2. Record your results, 3. Calculate the mean, standard deviation, and degrees of freedom, 4. Perform a repeatability test using equipment B, 5. Record your results, 6. Calculate the mean, standard deviation, and degrees of freedom, 7. Calculate the standard deviation of the two means recorded in steps 3 and 6.

Environment vs Environment

Performing work outside of the laboratory is becoming more common these days. If your laboratory performs calibration or testing in the field or at customer sites, your measurement results will most likely be impacted by the environment where testing and calibration is performed.

Therefore, it may be a good idea to perform an environment vs environment reproducibility test.

To perform an environment vs environment reproducibility test , just use the following instructions;

1. Perform a repeatability test in environment A. 2. Record your results, 3. Calculate the mean, standard deviation, and degrees of freedom, 4. Perform a repeatability test in environment B, 5. Record your results, 6. Calculate the mean, standard deviation, and degrees of freedom, 7. Calculate the standard deviation of the two means recorded in steps 3 and 6.

  Now, environment vs environment reproducibility testing does not have to be limited to in laboratory vs in the field.

Some laboratories have different rooms each with their own unique environment, even if they are performing similar tests and calibrations.

In these organizations, it may be a good idea to test the variance in measurement results between the two environmental conditions.

For example, I have piston gauges that require calibration. Although I send them to the same manufacture for calibration, the piston gauges may be calibrated in different environments.

In the past, I have received calibration reports that list different environmental conditions. One piston gauge will be calibrated at 20°C and the other one at 23°C.

It literally drives me nuts, because it creates extra work for me to correct the results for my laboratory environment or to compare the results from previous calibrations.

In another example, imagine calibrating an instrument in your laboratory at 20°C and a customer site at 28°C. Could this impact your measurement results?

Maybe? However, you will never know unless you tested it yourself and have data to support your claim.

How to Perform Reproducibility Testing

Performing a reproducibility test is pretty easy and straightforward. However, it never hurts to outline the process.

In this section, you will learn how to perform a reproducibility test step-by-step. All you need to do is following the instructions below.

1. Establish a Goal 2. Determine What You Will Test or Measure 3. Select a Variable or Condition to Change 4. Perform a Test With Variable A 5. Perform a Test With Variable B 6. Analyze the Results

1. Establish a Goal

When deciding to perform a reproducibility test, it is important to establish a goal.

Ask yourself, “Why am I performing a reproducibility test?”

Some common reasons people perform reproducibility tests are;

• Collect type A uncertainty data, • Validate a new calibration or test method, • Evaluate the quality and consistency of laboratory results, • Investigate factors that cause measurement uncertainty or error.

There are plenty of reasons that may lead you to perform a reproducibility. However, you need to determine the purpose or goal.

Without determining the goal, you may incorrectly design the experiment. This could cause you to collect the wrong data and force you to repeat the experiments.

So, make sure you establish your goals first.

2. Determine What You Will Measure

Now that you know why you are performing a reproducibility test, it is time to determine what you will measure.

For example, maybe measuring DC voltage is a critical function in your laboratory. Therefore, you want to conduct a reproducibility test to verify that your measurement results are repeatable.

Choose the measurement function and parameter that you wish to test. Then, define how you will perform the measurement.

To complete this step, you will need to choose;

• When you will perform the test (Time), • Where you will perform the test (Environment), • How you will perform the test (Method), • What you will use to perform the test (Equipment), • Who you will perform the test (Operator).

3. Select a Variable or Condition to Change

Now that you have established the parameters of your reproducibility test, it is time to choose what condition you want to change in the experiment.

If you have multiple personnel who are qualified to perform a test or calibration, consider choosing operators as your condition.

This is the most common variable selected in reproducibility testing. It will offer you the ability to see how reproducible measurement results are when tests and calibrations are performed by different people.

If you perform test and calibrations in the laboratory and the field, consider changing the environment for your reproducibility test.

Therefore, perform one experiment in the laboratory and the other in the field.

It will also you to see how reproducible your measurement results are outside of the laboratory.

If your personnel can select different methods to perform a test or calibration, consider changing the method for your reproducibility test.

It will show you how much variance you may have selecting one method over another.

If you have duplication of standards or multiple workbenches for conducting similar tests or calibrations, consider selecting equipment as your variable.

This reproducibility test will offer you the ability to determine if your measurement results are repeatable choosing one piece of test equipment versus another.

If you do not have many options for testing other variables, try selecting time as your variable. Especially, if you are a laboratory with only one operator and one workbench.

This is easily accomplished by performing a day to day reproducibility test. It will allow you to see how much variance is in your measurement results over a period of time.

4. Perform a Repeatability Test with Condition A

Similar to an A/B Test, you will perform a reproducibility test to evaluate the results of condition A versus condition B.

Simply perform a repeatability test with condition A. Collect 20 or more repeated samples over a short period of time and record your results.

After performing the test, calculate the mean (i.e. average), standard deviation, and degrees of freedom for your sample data set.

I find it easy to log my results in a Microsoft Excel spreadsheet. Once the data is in the spreadsheet, I can easily analyze the data and store it on my computer.

Later on, I can use the spreadsheet to collect more data and compare the results with prior experiments.

Additionally, this is a great practice to add to your process if you plan to develop and use control charts.

5. Perform a Repeatability Test with Condition B

After completing the experiment with condition A, perform a new repeatability test. However, this time change your condition to option B.

It is really simple. You do not want to change anything else about the measurement process in the experiment except the condition.

Remember, the goal is to see how each condition effects the measurement results.

With condition B, perform a new repeatability test where you collect 20 or more repeated samples over a short period of time and record your results.

Afterward, calculate the mean (i.e. average), standard deviation, and degrees of freedom for your sample data set.

Then, proceed to the next step where you will analyze the results and determine the uncertainty introduced by the test variable.

6. Analyze the Results to Calculate Reproducibility

Using the data collected from the reproducibility test, you can analyze the results to calculate the uncertainty caused by reproducibility.

First, calculate the mean (i.e. average) result from testing with condition A.

Next, calculate the mean (i.e. average) result from testing with condition B.

If you tested more than two conditions in your experiment include the additional mean (i.e. average) measurement results in your analysis.

Now, calculate the standard deviation of the two or more mean values from the previous steps. This will be your value for reproducibility uncertainty.

To recap the process;

1. Calculate the Mean or Average of Results With Condition A, 2. Calculate the Mean or Average of Results With Condition B, 3. Calculate the Standard Deviation of Results for A and B.

Although this may seem like a lot of work, it really isn’t. In fact, this is typically a simple process to perform that does not require a lot of time.

When I perform repeatability and reproducibility testing in the laboratory, I usually block off 30 minutes of my day (at the beginning or end of the work day) to conduct these experiments.

Most of the time, it usually requires less than 15 minutes to complete the process including the data analysis.

So to successfully incorporate this process in your laboratory, do the following;

1. Determine what needs to be tested, 2. Determine how often it should be tested, 3. Set a schedule, 4. Implement, and 5. Repeat

Once you commit to adding this task to this to your calendar and stick to a routine, it becomes much easier to accomplish repeatability and reproducibility testing. Before you know it, you will have acquired a lot of type A uncertainty data.

How to Analyze Test Results

Although we briefly covered analyzing test results in the previous section. Let’s elaborate further on the process and add a few images to help you learn how to analyze your data.

First, you will want to record your results and enter them into a spreadsheet program. I prefer to use Microsoft Excel or Google Sheets.

It is much easier and faster than using statistical software such as SPSS, Minitab, JMP, etc.

1. Calculate the Mean or Average of Results With Condition A

After entering all of your test results into Microsoft Excel, you will want to begin analyzing the test results of condition A.

Using the ‘Average’ function in Excel, calculate the mean or average value. Simply, enter the following equation into an empty cell and hit enter;

=AVERAGE(Cell1:Cell#)

If done correctly, it should look like this;

Reproducibility condition A

2. Calculate the Mean or Average of Results With Condition B

Now that you have calculated the average of results for condition A, repeat the process for condition B and any additional conditions that you may have tested (e.g. C, D, E, etc.).

Again, using the ‘Average’ function in Excel, calculate the mean or average value. Simply, enter the following equation into an empty cell and hit enter;

reproducibility condition B

3. Calculate the Standard Deviation of Results for A and B.

Finally, you will want to calculate the standard deviation of the average values you calculated for each condition. The result will give you a value to assign to your type A uncertainty for reproducibility.

Using the ‘STDEV’ function in Excel, calculate the standard deviation of the average values you calculated for each condition. Simply, enter the following equation into an empty cell and hit enter;

=STDEV(Cell1:Cell#)

calculate measurement reproducibility

As you can see, calculating reproducibility is not difficult. It is just a process that you need to follow to adequately analyze your test results. The best part is that you can complete this task in a matter of minutes.

So, there is no excuse. Add reproducibility to your estimation of uncertainty in measurement.

How to Evaluate Test Results

Reproducibility test data can be used for a lot more than uncertainty analysis. It can show you how to improve the quality of your measurement processes.

You just have evaluate the test results to find problems and opportunities.

Without making this process overly complex, you should evaluate your results for;

1. Errors in Data, 2. Significant Differences in Mean (i.e. average) Value, and(or) 3. Significant Differences in Standard Deviation

By evaluating your test results for these three occurrences, you can quickly take action to improve your measurement process.

Errors in Data

Searching for errors in your test results is pretty straight forward. However, a lot of people don’t do it.

If you just set aside five or ten minutes of your time to review your data for errors, you can avoid problems later on.

However, you have to be careful. Sometimes errors are easy to find. Other times, they are not. You will have to use objective evidence and your judgement to find errors in your data.

If you are not sure, sometimes it is best to repeat an experiment for verification.

Significant Differences in Mean or Average Value

If you ever notice a significant difference in the reported mean or average value while evaluating data, you may want to conduct an investigation.

evaluate reproducibility average

Most of the time, a large difference in the mean value can indicate a problem;

• An operator may lack the skill to perform the measurement process, • A method may not be appropriate for the test performed, • Measurement equipment may not be performing as intended, • Environmental conditions may be unstable, or • A test location may have introduced an additional error.

However, the difference in results could indicate a potential opportunity to improve your measurement process. For example;

• An operator may be more skilled than his coworkers, • A method may minimize measurement uncertainty, • Measurement equipment may be more accurate or precise, • Environmental conditions may be more appropriate for the test performed, • A location may be more favorable for the test performed.

Make sure look at the results from multiple perspectives. Be both optimistic and pessimistic.

You do not know what really happened or why it happened until you evaluate your results and conduct an investigation.

Even though it is difficult, try not assume that all significant differences are bad. You may miss opportunities for improvement.

Significant Differences in Standard Deviation

If you ever notice a significant difference in the reported standard deviations while evaluating data, you may want to conduct an investigation.

evaluate reproducibility standard deviation

Many times, it is easy to assume that there is a problem or that it is a random error. However, you should really take the time to evaluate your results further to discover why it happened.

When you see larger standard deviations, there may be a problem. For example;

On the other hand, a smaller standard deviation may indicate an opportunity for improvement. For example;

• An operator may be more skilled than his coworkers, • A method may minimize more sources of uncertainty, • Measurement equipment may be more accurate or precise, • Environmental conditions may be more appropriate for the test performed, • A location more be more favorable for the test performed.

Again, you will not know what really happened or why it happened until you evaluate your results and conduct an investigation.

Most of the time, you will discover that significant difference indicate that there is a problem. However, every once in a while, you will find an opportunity to improve your measurement process.

In either case, it is always an opportunity for improvement. So, take action to improve your measurement process and the reproducibility of your measurement results.

All you need to do is put in the effort and take the time to evaluate your test results.

Reproducibility testing is an important component that should be added to your uncertainty budgets. It is a type A uncertainty source that is typically accompanied with repeatability data and shows you how good you are at reproducing your measurement results.

This is why it should be included in practically every uncertainty budget.

In this guide, you should have learned all about reproducibility testing, including five ways to conduct reproducibility experiments even if your laboratory has limited resources (e.g. personnel, equipment, etc.).

Most importantly, you should have learned;

• What is reproducibility, • Why it is important, and • How to calculate it

Now, I want you to give it a try.

Perform a reproducibility test in your laboratory. Then, post a comment and tell me what type of experiment you conducted and why.

It is just a matter to making the time and doing the work. So, make sure to perform your own reproducibility test and leave a comment below letting me know what type of experiment you conducted and why.

  • Uncertainty
  • measurement reproducibility
  • repeatability and reproducibility
  • reproducibility definition

About the Author

Richard hogan.

' src=

Richard Hogan is the CEO of ISO Budgets, L.L.C., a U.S.-based consulting and data analysis firm. Services include measurement consulting, data analysis, uncertainty budgets, and control charts. Richard is a systems engineer who has laboratory management and quality control experience in the Metrology industry. He specializes in uncertainty analysis, industrial statistics, and process optimization. Richard holds a Masters degree in Engineering from Old Dominion University in Norfolk, VA. Connect with Richard on LinkedIn .

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Save my name, email, and website in this browser for the next time I comment.

free measurement uncertainty guide

About Rick Hogan

rhogan-blog

Richard Hogan, M.Eng.

An Engineer, Metrologist, and Manager who answers questions and delivers solutions to ISO 17025 accredited testing and calibration laboratories. Learn more about me here .

Email: [email protected]

Call: 1.757.525.2004

Measurement Uncertainty Training for ISO 17025

Measurement Uncertainty Training

Learn how to estimate uncertainty for ISO/IEC 17025 accreditation. Click here to learn more .

uncertainty calculator excel

Popular Posts About Measurement Uncertainty

  • How to Calculate Normalized Error
  • Confidence Intervals and Risk in Measurement
  • How to Start Every Uncertainty Analysis
  • ISO 17025 Assessments: Defending Your CMC Uncertainty Calculation
  • 9 Tips to Skyrocket Laboratory Sales
  • How to Calculate Uncertainty
  • Find Uncertainty
  • Sources of Uncertainty
  • Sensitivity Coefficient
  • Type A Uncertainty
  • Type B Uncertainty
  • Probability Distributions
  • Standard Uncertainty
  • Combined Uncertainty
  • Expanded Uncertainty
  • Effective Degrees of Freedom
  • Calibration Uncertainty
  • Reporting Uncertainty
  • Rounding Uncertainty
  • Uncertainty Budget
  • Uncertainty Calculator
  • Linear Thermal Expansion
  • Local Gravity
  • Repeatability Calculation
  • Reproducibility Calculation
  • Test Uncertainty Ratio
  • Normalized Error
  • Statement of Conformity
  • Decision Rule
  • Uncertainty in Measurement Chemistry
  • Measurement Uncertainty in Microbiology
  • ISO/IEC 17025 QMS Template
  • Easy Uncertainty Calculator
  • Simple Uncertainty Calculator
  • Multi-Budget Uncertainty Calculator
  • Calibration Uncertainty Calculator
  • Microbiology Uncertainty Calculator
  • Proficiency Testing Calculator

Learn to Calculate Measurement Uncertainty

NIMH Logo

Transforming the understanding and treatment of mental illnesses.

Información en español

Celebrating 75 Years! Learn More >>

  • Science News
  • Meetings and Events
  • Social Media
  • Press Resources
  • Email Updates
  • Innovation Speaker Series

Day Two: Placebo Workshop: Translational Research Domains and Key Questions

July 11, 2024

July 12, 2024

Day 1 Recap and Day 2 Overview

ERIN KING: All right. It is 12:01 so we'll go ahead and get started. And so on behalf of the Co-Chairs and the NIMH Planning Committee, I'd like to welcome you back to day two of the NIMH Placebo Workshop, Translational Research Domains and Key Questions. Before we begin, I will just go over our housekeeping items again. So attendees have been entered into the workshop in listen-only mode with cameras disabled. You can submit your questions via the Q&A box at any time during the presentation. And be sure to address your question to the speaker that you would like to respond.

For more information on today's speakers, their biographies can be found on the event registration website. If you have technical difficulties hearing or viewing the workshop, please note these in the Q&A box and our technicians will work to fix the problem. And you can also send an e-mail to [email protected]. And we'll put that e-mail address in the chat box for you. This workshop will be recorded and posted to the NIMH event web page for later viewing.

Now I would like to turn it over to our workshop Co-Chair, Dr. Cristina Cusin, for today's introduction.

CRISTINA CUSIN: Thank you so much, Erin. Welcome, everybody. It's very exciting to be here for this event.

My job is to provide you a brief recap of day one and to introduce you to the speakers of day two. Let me share my slides.

Again, thank you to the amazing Planning Committee. Thanks to their effort, we think this is going to be a success. I learned a lot of new information and a lot of ideas for research proposals and research projects from day one. Very briefly, please go and watch the videos. They are going to be uploaded in a couple of weeks if you missed them.

But we had an introduction from Tor, my Co-Chair. We had an historic perspective on clinical trials from the industry regulatory perspective. We had the current state from the FDA on placebo.

We had an overview of how hard it is to sham, to provide the right sham for device-based trials, and the challenges for TMS. We have seen some new data on the current state of placebo in psychosocial trials and what is the equivalent of a placebo pill for psychosocial trials. And some social neuroscience approach to placebo analgesia. We have come a long way from snake oil and we are trying to figure out what is placebo.

Tor, my Co-Chair, presented some data on the neurocircuitry underlying placebo effect and the questions that how placebo is a mixture of different elements including regression to the mean, sampling bias, selective attrition for human studies, the natural history of illness, the placebo effect per se that can be related to expectations, context, learning, interpretation.

We have seen a little bit of how is the impact on clinical trial design and how do we know that something, it really works. Or whatever this "it" is. And why do even placebo effect exists? It's fascinating idea that placebo exists as a predictive control to anticipate threats and the opportunity to respond in advance and to provide causal inference, a construct perception to infer the underlying state of body and of world.

We have seen historical perspective. And Ni Aye Khin and Mike Detke provided some overview of 25 years of randomized control trials from the data mining in major depressive disorders, schizophrenia trials and the lessons we have learned.

We have seen some strategies, both historical strategies and novel strategies to decrease placebo response in clinical trials and the results. Start from trial design, SPCD, lead-in, placebo phase and flexible dosing. Use of different scales. The use of statistical approaches like last observation carried forward or MMRM. Centralized ratings, self-rating, computer rating for different assessments. And more issues in clinical trials related to patient selection and professional patients.

Last, but not least, the dream of finding biomarkers for psychiatric conditions and tying response, clinical response to biomarkers. And we have seen how difficult it is to compare more recent studies with studies that were started in the '90s.

We have the FDA perspective with Tiffany Farchione in this placebo being a huge issue from the FDA. Especially the discussion towards the end of the day was on how to blind psychedelics.

We have seen an increasing placebo response rate in randomized controlled trials, also in adolescents, that is. And the considerations from the FDA of novel design models in collaboration with industry. We had examples of drugs approved for other disorders, not psychiatric condition, and realized -- made me realize how little we know about the true pathophysiology of psychiatric disorders, likely also heterogeneous conditions.

It made me very jealous of other fields because they have objective measures. They have biology, they have histology, they have imaging, they have lab values. While we are -- we are far behind, and we are not really able to explain to our patients why our mitigations are supposed to work or how they really work.

We heard from Holly Lisanby and Zhi-De Deng. The sham, the difficulty in producing the right sham for each type of device because most of them have auxiliary effects that are separate from the clinical effect like the noise or the scalp stimulation for TMS.

And it's critical to obtain a true blinding and separating sham from verum. We have seen how in clinical trials for devices expectancy from the patient, high tech environment and prolonged contact with clinicians and staff may play a role. And we have seen how difficult it is to develop the best possible sham for TMS studies in tDCS. It's really complicated and it's so difficult also to compare published studies in meta-analysis because they've used very different type of sham. Not all sham are created equal. And some of them could have been biologically active, so therefore invalidating the result or making the study uninformative.

Then we moved on to another fascinating topic with Dr. Rief and Dr. Atlas. What is the impact of psychological factors when you're studying a psychological intervention. Expectations, specific or nonspecific factors in clinical trials and what is interaction between those factors.

More, we learned about the potential nocebo effect of standard medical care or being on a wait list versus being in the active arm of a psychotherapy trial. And the fact that we are not accurately measuring the side effect of psychotherapy trial itself. And we heard more a fascinating talk about the neurocircuitry mediating placebo effect -- salience, affective value, cognitive control. And how perception of provider, perception of his or her warmth and competence and other social factors can affect response and placebo response, induce bias in evaluation of acute pain of others. Another very interesting field of study.

From a clinician perspective, this is -- and from someone who conduct clinical trials, all this was extremely informative because in many case in our patient situation no matter how good the treatment is, they have severe psychosocial stressors. They have difficulties to accessing food, to access treatment, transportation, or they live in an extremely stressful environment. So to disentangle other psychosocial factors from the treatment, from the biology is going to be critical to figure out how to treat best our patients.

And there is so much more work to do. Each of us approach the placebo topic for research from a different perspective. And like the blind man trying to understand what is an elephant, we have to endure it, we have to talk to each other, we have to collaborate and understand better the underlying biology, understand different aspect of the placebo phenomena.

And this lead us to the overview for day two. We are going to hear more about other topic that are so exciting. The placebo, the nocebo effect and other predictive factors in laboratory setting. We are going to hear about genetic of the placebo response to clinical trials. More physiological and psychological and neuromechanism for analgesia. And after a brief break around 1:30, we are going to hear about novel biological and behavioral approaches for the placebo effect.

We are going to hear about brain mapping. We are going to hear about other findings from imaging. And we're going to hear about preclinical modeling. There were some questions yesterday about animal models of placebo. And last, but not least, please stay around because in the panel discussion, we are going to tackle some of your questions. And we are going to have two wonderful moderators, Ted Kaptchuk and Matthew Rudorfer. So please stay with us and ask questions. We love to see more challenges for our speakers. And we're going to be all of the panelists from yesterday, from today are going to be present. Thank you so much.

Now we're going to move on to our first speaker of the day. If I am correct according to the last -- Luana.

Measuring & Mitigating the Placebo Effect

LUANA COLLOCA: Thank you very much, Cristina. First, I would love to thank the organizer. This is a very exciting opportunity to place our awareness about this important phenomenon for clinical trials and the clinical practice.

And today, I wish to give you a very brief overview of the psychoneurobiological mechanism of a placebo and nocebo, the description of some pharmacological studies, and a little bit of information on social learning. That is a topic that has been mentioned a little bit yesterday. And finally, the translational part. Can we translate what we learn from mechanistic approach to placebo and nocebo in terms of a disease and symptomatology and eventually predictors is the bigger question.

So we learned yesterday that placebo effects are generated by verbal suggestion, this medication has strong antidepressant effects. Therapeutic prior experience, merely taking a medication weeks, days before being substitute with a simulation of placebo sham treatment. Observation of a benefit in other people, contextual and treatment cue, and interpersonal interactions.

Especially in the fields of pain where we can simulate nociception, painful experience in laboratory setting, we learn a lot about the modulation related to placebo. In particular, expectation can provide a reaction and activation of parts of the brain like frontal area, nucleus accumbens, ventral striatum. And this kind of mechanism can generate a descending stimulation to make the painful nociceptive stimulus less intense.

The experience of analgesia at the level of a pain mechanism translate into a modulation reduction of a pain intensity. But most important, pain unpleasantness and the effective components of the pain. I will show today some information about the psychological factor, the demographic factor as well as genetic factors that can be predictive of placebo effects in the context of a pain.

On the other hand, a growing interest is related to nocebo effects, the negative counter sides of this phenomenon. When we talk about nocebo effects, we refer to increase in worsening of outcome in symptoms related to negative expectation, prior negative therapeutic experience, observing a negative outcome in others, or even mass psychogenic modeling such as some nocebo-related response during the pandemic. Treatment leaflets, the description of all side effects related to a medication. Patient-clinician communication. The informed consent where we list all of the side effects of a procedure or medication as well as contextual cues in clinical encounters.

And importantly, internal factor like emotion, mood, maladaptive cognitive appraisal, negative valence, personality traits, somatosensory features and omics can be predictive of negative worsening of symptom and outcome related to placebo and nocebo effects. In terms of a nocebo very briefly, there is a lot of attention again related to brain imaging with beautiful data show that the brainstem, the spinal cord, the hippocampus play a critical role during nocebo hyperalgesic effects.

And importantly, we learn that about placebo and nocebo through different approach including brain imaging, as we saw yesterday, but also pharmacological approach. We start from realizing that placebo effects are really neurobiological effects with the use of agonist or antagonist.

In other words, we can use a drug to mimic the action of that drug when we replace the drug with a saline solution, for example. In the cartoon here, you can see a brief pharmacological conditioning with apomorphine. Apomorphine is a dopamine agonist. And after three days of administration, apomorphine was replaced with saline solution in the intraoperative room to allow us to understand if we can mimic at the level of neuronal response the effects of apomorphine.

So in brief these are patients undergoing subthalamic EEG installation of deep brain stimulation. You can see here reaching the subthalamic nucleus. So after crossing the thalamus, the zona incerta, the STN, and the substantia nigra, the surgeon localized the area of stimulation. Because we have two subthalamic nuclei, we can use one as control and the other one as target to study in this case the effects of saline solution given after three days of apomorphine.

What we found was in those people who respond, there was consistency in reduction of clinical symptoms. As you can see here, the UPDRS, a common scale to measure rigidity in Parkinson, the frequency of a discharge at the level of neurons and the self-perception, patients with sentences like I feel like after Levodopa, I feel good. This feeling good translate in less rigidity, less tremor in the surgical room.

On the other hand, some participants didn't respond. Consistently we found no clinical improvement, no difference in preference over this drug at the level of a single unit and no set perception of a benefit. This kind of effects started to trigger the questions what is the reason why some people who responded to placebo and pharmacological conditioning and some other people don't. I will try to address this question in the second part of my talk.

On the other hand, we learn a lot about the endogenous modulation of pain and true placebo effects by using in this case an antagonist. The goal in this experiment was to create a painful sensation through a tourniquet. Week one with no treatment. Week two we pre-inject healthy participant with morphine. Week three the same morphine. And week four we replace morphine with placebo.

And you can see that a placebo increase the pain tolerance in terms of imminent. And this was not carryover effects. In fact, the control at week five showed no differences. Part of the participants were pre-injected with an antagonist Naloxone that when we use Naloxone at high dose, we can block the opioids delta and K receptors. You can see that by pre-injecting Naloxone there is a blockage of placebo analgesia, and I would say this morphine-like effects related to placebo given after morphine.

We start to then consider this phenomenon. Is this a way for tapering opioids. And we called this sort of drug-like effects as dose-extending placebo. The idea is that if we use a pharmacological treatment, morphine, apomorphine, as I showed to you, and then we replace the treatment with a placebo, we can create a pharmacological memory, and this can translate into a clinical benefit. Therefore, the dose-extending placebo can be used to extend the benefit of the drug, but also to reduce side effects related to the active drug.

In particular for placebo given after morphine, you can see on this graph, the effects is similarly strong if we do the repetition of a morphine one day apart or one week apart. Interestingly, this is the best model to be used in animal research.

Here at University of Maryland in collaboration with Todd Degotte, we create a model of anhedonia in mice and we condition animals with Ketamine. The goal was to replace Ketamine with a placebo. There are several control as you can see. But what is important for us, we condition animal with Ketamine week one, three and five. And then we substitute Ketamine with saline along with the CS. The condition of the stimulus was a light, a low light. And we want to compare this with an injection of Ketamine given at week seven.

So as you can see here, of course Ketamine was inducing a benefit as compared to saline and the Ketamine. But what is seen testing when we compare Ketamine week seven with saline replacing the Ketamine, we found no difference; suggesting that even in animals, in mice we were able to create drug-related effects. In this case, a Ketamine antidepressant-like placebo effects. These effects also add dimorphic effects in the sense that we observed this is in males but not in females.

So another approach to use agonist, like I mentioned for apomorphine in Parkinson patient, was to use vasopressin and oxytocin to increase placebo effects. In this case, we used verbal suggestion that in our experience especially with healthy participants tended to create very small sample size in terms of placebo analgesic effects. So we knew that from the literature that there is a dimorphic effects for this hormone. So we inject people with intranasal vasopressin, saline, oxytocin in low dose and no treatment. You can see there was a drug effects in women whereby vasopressin boost placebo analgesic effects, but not in men where yet we found many effects of manipulation but not drug effects.

Importantly, vasopressin affect dispositional anxiety as well as cortisol. And there is a negative correlation between anxiety and cortisol in relationship to vasopressin-induced placebo analgesia.

Another was, can we use medication to study placebo in laboratory setting or can we study placebo and nocebo without any medication? One example is to use a manipulation of the intensity of the painful stimulations. We use a thermal stimulation tailored at three different levels. 80 out of 100 with a visual analog scale, 50 or 20, as you can see from the thermometer.

We also combined the level of pain with a face. So first to emphasize there is three level of pain, participants will see an anticipatory cue just before the thermal stimulation. Ten seconds of the thermal stimulation to provide the experience of analgesia with the green and the hyperalgesia with the red as compared to the control, the yellow condition.

Therefore, the next day we move in the fMRI. And the goal was to try to understand to what extent expectation is relevant in placebo and nocebo effects. We mismatch what they anticipate, and they learn the day before. But also you can see we tailored the intensity at the same identical level. 50 for each participant.

We found that when expectation matched the level of the cues, anticipatory cue and face, we found a strong nocebo effects and placebo effects. You can see in red that despite the level of pain were identical, the perceived red-related stimuli as higher in terms of intensity, and the green related the stimuli as lower when compared to the control. By mismatching what they expect with what they saw, we blocked completely placebo effects and still nocebo persist.

So then I showed to you that we can use conditioning in animals and in humans to create placebo effects. But also by suggestion, the example of vasopressin. Another important model to study placebo effects in laboratory setting is social observation. We see something in other people, we are not told what we are seeing and we don't experience the thermal stimulation. That is the setting. A demonstrator receiving painful or no painful stimulation and someone observing this stimulation.

When we tested the observers, you can see the level of pain were tailored at the same identical intensity. And these were the effects. In 2009, when we first launched this line of research, this was quite surprising. We didn't anticipate that merely observing someone else could boost the expectations and probably creating this long-lasting analgesic effect. This drove our attention to the brain mechanism of what is so important during this transfer of placebo analgesia.

So we scanned participants when they were observing a video this time. And a demonstrator receiving control and placebo cream. We counterbalance the color. We controlled for many variables. So during the observation of another person when they were not stimulated, they didn't receive the cream, there is an activation of the left and right temporoparietal junction and a different activation of the amygdala with the two creams. And importantly, an activation of the periaqueductal gray that I show to you is critical in modulating placebo analgesia.

Afterwards we put both the placebo creams with the two different color. We tailored the level of pain at the identical level of intensity. And we saw how placebo effects through observation are generated. They create strong different expectation and anxiety. And importantly, we found that the functional connectivity between the dorsolateral prefrontal cortex and temporoparietal junction that was active during the observation mediate the behavior results. Suggesting that there is some mechanism here that may be relevant to exploit in clinical trials and clinical practice.

From this, I wish to switch to a more translational approach. Can we replicate these results observed in health participant for nociception in people suffering from chronic pain. So we chose as population of facial pain that is an orphan disease that has no consensus on how to treat it, but also it affects the youngest including children.

So participants were coming to the lab. And thus you can see we used the same identical thermal stimulation, the same electrodes, the same conditioning that I showed to you. We measured expectation before and after the manipulation. The very first question was can we achieve similar monitored distribution of placebo analgesia in people suffering chronically from pain and comorbidities. You can see that we found no difference between temporo parenthala, between TMD and controls. Also, we observed that some people responded to the placebo manipulation with hyperalgesia. We call this nocebo effect.

Importantly, these affects are less relevant than the benefit that sometime can be extremely strong show that both health control and TMD. Because we run experiment in a very beautiful ecological environment where we are diverse, the lab, the experimenters as well as the population we recruit in the lab has a very good distribution of race, ethnicity.

So the very first question was we need to control for this factor. And this turned out to be a beautiful model to study race, ethnicity in the lab. So when chronic pain patient were studied by same experimenter race, dark blue, we observe a larger placebo effect. And this tell us about the disparity in medicine. In fact, we didn't see these effects in our controls.

In chronic pain patient, we also saw a sex concordance influence. But in the opposite sense in women studied by a man experimenter placebo effects are larger. Such an effect was not seen in men.

The other question that we had was what about the contribution of psychological factors. At that stage, there were many different survey used by different labs. Some are based on the different area of, you know, the states of the world, there were trends where in some people in some study they observe an effects of neurodisease, more positive and negative set, that refer to the words. Instead of progressing on single survey, and now we have a beautiful meta-analysis today that is not worth in the sense that it is not predictive of placebo effects.

We use the rogue model suggested by the NIMH. And by doing a sophisticated approach we were able to combine this into four balances. Emotional distress, reward-seeking, pain related fear catastrophizing, empathy and openness. These four valences then were interrelated to predict placebo effects. And you can see that emotional distress is associated with lower magnitude of placebo effects extinguishing over time and lower proportion of placebo responsivity.

Also people who tend to catastrophizing display lower magnitude of placebo effects. In terms of expectation, it is also interesting patients expect to benefit, they have this desire for a reward. But also those people who are more open and characterized by empathy tend for the larger expectations. But this doesn't translate necessarily in larger placebo effects, somehow hinting that the two phenomenon can be not necessarily linked.

Because we study chronic pain patients they come with their own baggage of disease comorbidities. And Dr. Wang in his department look at insomnia. Those people suffering from insomnia tends to have lower placebo analgesic effects along with those who have a poor pattern of sleep, suggesting that clinical factor can be relevant when we wish to predict placebo effects.

Another question that we address how simple SNPs, single nucleotide polymorphism variants in three regions that have been published can be predictive of placebo effects. In particular, I'm referring to OPRM1 that is linked to the gene for endogenous opioids. COMT linked to endogenous dopamine. And FAAH linked to endogenous cannabinoids. And we will learn about that more with the next talk.

And you can see that there is a prediction. These are rogue codes that can be interesting. We model all participants with verbal suggestion alone, the conditioning. There isn't really a huge difference between using one SNP versus two or three. What is truly impact and was stronger in terms of prediction was accounting for the procedure we used to study placebo. Whether by suggestion alone versus condition. When we added the manipulation, the prediction becomes stronger.

More recently, we started gene expression transcriptomic profile associated with placebo effects. We select from the 402 participants randomly 54. And we extract their transcriptomic profiles. Also we select a validation cohort to see if we can't replicate what we discover in terms of mRNA sequencing. But we found over 600 genes associated with the discovered cohort. In blue are the genes downregulated and in red upregulated.

We chose the top 20 genes and did the PCA to validate the top 20. And we found that six of them were replicated and they include all these genes that you see here. The Selenom for us was particularly interesting, as well as the PI3, the CCDC85B, FBXL15, HAGHL and the TNFRSF4. So with this --

LUANA COLLOCA: Yes, I'm done. With this, that is the goal probably one day with AI and other approach to combine clinical psychological brain imaging and so on, characteristic and behavior to predict a level of transitory response to placebo. That may guide us in clinical trials and clinical path to tailor the treatment. Therefore, the placebo and nocebo biological response can be to some extent predicted. And identify those who responded to placebo can help tailoring drug development and symptom management.

Thank you to my lab. All of you, the funding agencies. And finally, for those who like to read more about placebo, this book is available for free to be downloaded. And they include many of the speakers from this two-day event as contributors to this book. Thank you very much.

CRISTINA CUSIN: Thank you so much, Luana. It was a wonderful presentation. We have one question in the Q&A.

Elegant studies demonstrating powerful phenomena. Two questions. Is it possible to extend or sustain placebo-boosting effect? And what is the dose response relationship with placebo or nocebo?

LUANA COLLOCA: Great questions. The goal is to boost a placebo effects. And one way, as I showed was, for example, using intranasal vasopressin. But also extending relationship with placebo we know that we need the minimum of a three or four other administration before boosting this sort of pharmacological memory. And the longer is the administration of the active drug before we replace with placebo, the larger the placebo effects.

For nocebo, we show similar relationship with the collaborators. So again, the longer we condition, the stronger the placebo or nocebo effects. Thank you so much.

CRISTINA CUSIN: I wanted to ask, do you have any theory or interpretation about the potential for transmit to person a placebo response between the observer or such, do you have any interpretation of this phenomenon?

LUANA COLLOCA: It is not completely new in the literature. There is a lot of studies show that we can transfer pain in both animal models and humans.

So transfer analgesia is a natural continuation of that line of research. And the fact that we mimic things that we see in some other people, this is the very most basic form of learning when we grow up. But also from a revolutionary point of view protect us from predators and animals and us as human beings observing is a very good mechanism to boost behaviors and in this case placebo effects. Thank you.

CRISTINA CUSIN: Okay. We will have more time to ask questions.

We are going to move on to the next speaker. Dr. Kathryn Hall.

KATHRYN HALL: Thank you. Can you see my screen okay? Great.

So I'm going to build on Dr. Colloca's talk to really kind of give us a deeper dive into the genetics of the placebo response in clinical trials.

So I have no disclosures. So as we heard and as we have been hearing over the last two days, there is -- there are physiological drivers of placebo effects, whether they are opioid signaling or dopamine signaling. And these are potentiated by the administration or can be potentiated by saline pills, saline injections, sugar pills. And what's really interesting here, I think, is this discussion about how drugs impact the drivers of placebo response. In particular we heard about Naloxone yesterday and proglumide.

What I really want to do today is think about the next layer. Like how do the genes that shape our biology and really drive or influence that -- those physiological drivers of placebo response, how do the genes, A, modify our placebo response? But also, how are they modifying the effect of the drugs and the placebos on this basic -- this network?

And if you think about it, we really don't know much about all of the many interactions that are happening here. And I would actually argue that it goes even beyond genetic variation to other factors that lead to heterogeneity in clinical trials. Today I'm going to really focus on genes and variations in the genome.

So let's go back so we have the same terminology. I'm going to be talking about placebo-responsing trials. And so we saw this graph or a version of this graph yesterday where in clinical trials when we want to assess the effect of a drug, we subtract the outcomes in the placebo arm from the outcomes in the drug treatment arm. And there is a basic assumption here that the placebo response is additive to the drug response.

And what I want to do today is to really challenge that assumption. I want to challenge that expectation. Because I think we have enough literature and enough studies that have already been done that demonstrate that things are not as simple as that and that we might be missing a lot from this basic averaging and subtracting that we are doing.

So the placebo response is that -- is the bold lines there which includes placebo effects which we have been focusing on here. But it also includes a natural history of the disease or the condition, phenomenon such as statistical regression not mean, blinding and bias and Hawthorn effects. So we lump all of those together in the placebo arm of the trial and subtract the placebo response from the drug response to really understand the drug effect.

So one way to ask about, well, how do genes affect this is to look at candidate genes. And as Dr. Colloca pointed out and has done some very elegant studies in this area, genes like COMT, opioid receptors, genes like OPRM1, the FAAH endocannabinoid signaling genes are all candidate genes that we can look at in clinical trials and ask did these genes modify what we see in the placebo arm of trials?

We did some studies in COMT. And I want to just show you those to get a -- so you can get a sense of how genes can influence placebo outcomes. So COMT is catacholamethyl transferase. And it's a protein, an enzyme that metabolizes dopamine which as you saw is important in mediating the placebo response. COMT also metabolizes epinephrin, norepinephrine and catecholest estrogen. So the fact that COMT might be involved in placebo response is really interesting because it might be doing more than just metabolizing dopamine.

So we asked the question what happens if we look at COMT genetic variation in clinical trials of irritable bowel syndrome? And working with Ted Kaptchuk and Tony Lembo at Beth Israel Deaconess Medical Center, we did just that. We looked at COMT effects in a randomized clinical trial of irritable bowel syndrome. And what we did see was that for the gene polymorphism RS46AD we saw that people who had the weak version of the COMT enzyme actually had more placebo response. These are the met/met people here shown on this, in this -- by this arrow. And that the people who had less dopamine because that enzyme didn't work as well for this polymorphism, they had less of a placebo response in one of the treatment arms. And we would later replicate this study in another clinical trial that was recently concluded in 2021.

So to get a sense, as you can see, we are somewhat -- we started off being somewhat limited by what was available in the literature. And so we wanted to expand on that to say more about genes that might be associated with placebo response. So we went back, and we found 48 studies in the literature where there was a gene that was looked at that modified the placebo response.

And when we mapped those to the interactome, which is this constellation of all gene products and their interactions, their physical interactions, we saw that the placebome or the placebo module had certain very interesting characteristics. Two of those characteristics that I think are relevant here today are that they overlapped with the targets of drugs, whether they were analgesics, antidepressive drugs, anti-Parkinson's agents, placebo genes putatively overlapped with drug treatment genes or targets.

They also overlapped with disease-related genes. And so what that suggests is that when we were looking at the outcomes of clinical trial there might be a lot more going on that we are missing.

And let's just think about that for a minute. On the left is what we expect. We expect that we are going to see an effect in the drug, it's going to be greater than the effect of the placebo and that difference is what we want, that drug effect. But what we often see is on the right here where there is really no difference between drug and placebo. And so we are left to scratch our heads. Many companies go out of business. Many sections of companies close. And, quite frankly, patients are left in need. Money is left on the table because we can't discern between drug and placebo.

And I think what is interesting is that's been a theme that's kind of arisen since yesterday where oh, if only we had better physiological markers or better genes that targeted physiology then maybe we could see a difference and we can, you know, move forward with our clinical trials.

But what I'm going to argue today is actually what we need to do is to think about what is happening in the placebo arm, what is contributing to the heterogeneity in the placebo arm, and I'm going to argue that when we start to look at that compared to what is happening in the drug treatment arm, oftentimes -- and I'm going to give you demonstration after demonstration. And believe me, this is just the tip of the iceberg.

What we are seeing is there are differential effects by genotype in the drug treatment arm and the placebo treatment arm such that if you average out what's happening in these -- in these drug and placebo arms, you would basically see that there is no difference. But actually there's some people that are benefiting from the drug but not placebo. And conversely, benefiting from placebo but not drug. Average out to no difference.

Let me give you some examples. We had this hypothesis and we started to look around to see if we could get partners who had already done clinical trials that had happened to have genotyped COMT. And what we saw in this clinical trial for chronic fatigue syndrome where adolescents were treated with clonidine was that when we looked in the placebo arm, we saw that the val/val patients, so this is the COMT genotype. The low activity -- sorry, that is high activity genotype. They had the largest number increase in the number of steps they were taking per week. In contrast, the met/met people, the people with the weaker COMT had fewer, almost no change in the number of steps they were taking per week.

So you would look at this and you would say, oh, the val/val people were the placebo responders and the met/met people didn't respond to placebo. But what we saw when we looked into the drug treatment arm was very surprising. We saw that clonidine literally erased the effect that we were seeing in placebo for the val/val participants in this trial. And clonidine basically was having no effect on the heterozygotes, the val/mets or on the met/mets. And so this trial rightly concluded that there was no benefit for clonidine.

But if they hadn't taken this deeper look at what was happening, they would have missed that clonidine may potentially be harmful to people with chronic fatigue in this particular situation. What we really need to do I think is look not just in the placebo or not just in the drug treatment arm but in both arms to understand what is happening there.

And I'm going to give you another example. And, like I said, the literature is replete with these examples. On the left is an example from a drug that was used to test cognitive -- in cognitive scales, Tolcupone, which actually targets COMT. And what you can see here again on the left is differential outcomes in the placebo arm and in the drug treatment arm that if you were to just average these two you would not see the differences.

On the right is a really interesting study looking at alcohol among people with alcohol disorder, number of percent drinking days. And they looked at both COMT and OPRM1. And this is what Dr. Colloca was just talking about there seemed to be not just gene-placebo drug interactions but gene-gene drug placebo interactions. This is a complicated space. And I know we like things to be very simple. But I think what these data are showing is we need to pay more attention.

So let me give you another example because these -- you know, you could argue, okay, those are objective outcomes -- sorry, subjective outcomes. Let's take a look at the Women's Health Study. Arguably, one of the largest studies on aspirin versus placebo in history. 30,000 women were randomized to aspirin or placebo. And lo and behold, after 10 years of following them the p value was nonsignificant. There was no difference between drug and placebo.

So we went to this team, and we asked them, could we look at COMT because we had a hypothesis that COMT might modify the outcomes in the placebo arm and potentially differentially modify the treatments in the drug treatment arm. You might be saying that can't have anything to do with the placebo effect and we completely agree. This if we did find it would suggest that there might be something to do with the placebo response that is related to natural history. And I'm going to show you the data that -- what we found.

So when we compared the outcomes in the placebo arm to the aspirin arm, what we found was the met/met women randomized to placebo had the highest of everybody rates of cardiovascular disease. Which means the highest rates of myocardial infarction, stroke, revascularization and death from a cardiovascular disease cause. In contrast, the met/met women on aspirin had benefit, had a statistically significant reduction in these rates.

Conversely, the val/val women on placebo did the best, but the val/val women on aspirin had the highest rates, had significantly higher rates than the val/val women on placebo. What does this tell us? Well, we can't argue that this is a placebo effect because we don't have the control for placebo effects, which is a no treatment control.

But we can say that these are striking differences that, like I said before, if you don't pay attention to them, you miss the point that there are subpopulations for benefit or harm because of differential outcomes in the drug and placebo arms of the trial.

And so I'm going to keep going. There are other examples of this. We also partnered with a group at Brigham and Women's Hospital that had done the CAMP study, the Childhood Asthma Management Study. And in this study, they randomized patients to placebo, Budesonide or Nedocromil for five years and study asthma outcomes.

Now what I was showing you previously was candidate gene analyses. What this was, was a GWAS. We wanted to be agnostic and ask are there genes that modify the placebo outcomes and are these outcomes different in the -- when we look in the drug treatment arm. And so that little inset is a picture of all of the genes that were looked at in the GWAS. And we had a borderline genome Y significant hit called BBS9. And when we looked at BBS9 in the placebo arm, those white boxes at the top are the baseline levels of coughing and wheezing among these children. And in the gray are at the end of the treatment their level of coughing and wheezing.

And what you can see here is that participants with the AA genotype were the ones that benefited from the Bedenoside -- from placebo, whereas the GG, the patients with the GG genotype really there was no significant change.

Now, when we looked in the drug treatment arms, we were surprised to see that the outcomes were the same, of course, at baseline. There is no -- everybody is kind of the same. But you can see the differential responses depending on the genotype. And so, again, not paying attention to these gene drug/placebo interactions we miss another story that is happening here among our patients.

Now, I just want to -- I added this one because it is important just to realize that this is not just about gene-drug placebo. But these are also about epigenetic effects. And so here is the same study that I showed earlier on alcohol use disorder. They didn't just stop at looking at the polymorphisms or the genetic variants. This team also went so far as to look at methylation of OPRM1 and COMT.

So methylation is basically when the promoter region of a gene is basically blocked because it has a methyl group. It has methylation on some of the nucleotides in that region. So you can't make the protein as efficiently. And if you look on the right, what you can see in the three models that they looked at, they looked at other genes. They also looked at SLC6A3 that's involved in dopamine transport. And what you can see here is that there is significant gene by group by time interactions for all these three genes, these are candidate genes that they looked at.

And even more fascinating is their gene-by-gene interactions. Basically it is saying that you cannot say what the outcome is going to be unless you know the patient's or the participant's COMT or OPRM genotype A and also how methylated the promoter region of that -- of these genes are. So this makes for a very complicated story. And I know we like very simple stories.

But I want to say that I'm just adding to that picture that we had before to say that it's not just in terms of the gene's polymorphisms, but as Dr. Colloca just elegantly showed it is transcription as well as methylation that might be modifying what is happening in the drug treatment arm and the placebo treatment arm. And to add to this it might also be about the natural history of the condition.

So BBS9 is actually a gene that is involved in the cilia, the activity of the formation of the cilia which is really important in breathing in the nasal canal. And so, you can see that it is not just about what's happening in the moment when you are doing the placebo or drug or the clinical trial, it also might -- the genes might also be modifying where the patient starts out and how the patient might develop over time. So, in essence, we have a very complicated playground here.

But I think I have shown you that genetic variation, whether it is polymorphisms in the gene, gene-gene interactions or epigenetics or all of the above can modify the outcomes in placebo arms of clinical trials. And that this might be due to the genetic effects on placebo effects or the genetic effects on natural history. And this is something I think we need to understand and really pay attention to.

And I also think I've showed you, and these are just a few examples, there are many more. But genetic variation can differentially modify drugs and placebos and that these potential interactive effects really challenge this basic assumption of additivity that I would argue we have had for far too long and we really need to rethink.

TED KAPTCHUK: (Laughing) Very cool.

KATHRYN HALL: Hi, Ted.

TED KAPTCHUK: Oh, I didn't know I was on.

KATHRYN HALL: Yeah, that was great. That's great.

So in summary, can we use these gene-placebo drug interactions to improve clinical trials. Can we change our expectations about what is happening. And perhaps as we have been saying for the last two days, we don't need new drugs with clear physiological effects, what we need is to understand drug and placebo interactions and how they impact subpopulations and can reveal who benefits or is harmed by therapies.

And finally, as we started to talk about in the last talk, can we use drugs to boost placebo responses? Perhaps some drugs already do. Conversely, can we use drugs to block placebo responses? And perhaps some drugs already do.

So I just want to thank my collaborators. There was Ted Kaptchuk, one of my very close mentors and collaborators. And really, thank you for your time.

CRISTINA CUSIN: Thank you so much. It was a terrific presentation. And definitely Ted's captured laugh, it was just one of the best spontaneous laughs.

We have a couple of questions coming through the chat. One is about the heterogeneity of response in placebo arms. It is not uncommon to see quite a dispersion of responses at trials. Was that thought experiment, if one looks at the fraction of high responders in the placebo arms, would one expect to see, enrich for some of the genetic marker for and as placebo response?

KATHRYN HALL: I absolutely think so. We haven't done that. And I would argue that, you know, we have been having kind of quiet conversation here about Naloxone because I think as Lauren said yesterday that the findings of Naloxone is variable. Sometimes it looks like Naloxone is blocking placebo response and sometimes it isn't.

We need to know more about who is in that trial, right? Is this -- I could have gone on and showed you that there is differences by gender, right. And so this heterogeneity that is coming into clinical trials is not just coming from the genetics. It's coming from race, ethnicity, gender, population. Like are you in Russia or are you in China or are you in the U.S. when you're conducting your clinical trial? We really need to start unpacking this and paying attention to it. I think because we are not paying attention to it, we are wasting a lot of money.

CRISTINA CUSIN: And epigenetic is another way to consider traumatic experiences, adverse event learning. There is another component that we are not tracking accurately in clinical trials. I don't think this is a one of the elements routinely collected. Especially in antidepressant clinical trials it is just now coming to the surface.

KATHRYN HALL: Thank you.

CRISTINA CUSIN: Another question comes, it says the different approaches, one is GWAS versus candidate gene approach.

How do you start to think about genes that have a potential implication in neurophysiological pathways and choosing candidates to test versus a more agnostic U.S. approach?

KATHRYN HALL: I believe you have to do both because you don't know what you're going to find if you do a GWAS and it's important to know what is there.

At the same time, I think it's also good to test our assumptions and to replicate our findings, right? So once you do the GWAS and you have a finding -- for instance, our BBS9 finding would be amazing to replicate or to try and test in another cohort. But, of course, it is really difficult to do a whole clinical trial again. These are very expensive, and they last many years.

And so, you know, I think replication is something that is tough to do in this space, but it is really important. And I would do both.

CRISTINA CUSIN: Thank you. We got a little short on time. We are going to move on to the next speaker. Thank you so much.

FADEL ZEIDAN: Good morning. It's me, I imagine. Or good afternoon.

Let me share my screen. Yeah, so good morning. This is going to be a tough act to follow. Dr. Colloca and Dr. Hall's presentations were really elegant. So manage your expectations for mine. And, Ted, please feel free to unmute yourself because I think your laugh is incredibly contagious, and I think we were all were laughing as well.

So my name is Fadel Zeidan, I'm at UC San Diego. And I'll be discussing mostly unpublished data that we have that's under review examining if and how mindfulness meditation assuages pain and if the mechanism supporting mindfulness meditation-based analgesia are distinct from placebo.

And so, you know, this is kind of like a household slide that we all are here because we all appreciate how much of an epidemic chronic pain is and, you know, how significant it is, how much it impacts our society and the world. And it is considered a silent epidemic because of the catastrophic and staggering cost to our society. And that is largely due to the fact that the subjective experience of pain is modulated and constructed by a constellation of interactions between sensory, cognitive, emotional dimensions, genetics, I mean I can -- the list can go on.

And so what we've been really focused on for the last 20 years or so is to appreciate if there is a non-pharmacological approach, a self-regulated approach that can be used to directly assuage the experience of pain to acutely modify exacerbated pain.

And to that extent, we've been studying meditation, mindfulness-based meditation. And mindfulness is a very nebulous construct. If you go from one lab to another lab to another lab, you are going to get a different definition of what it is. But obviously my lab's definition is the correct one. And so the way that we define it is awareness of arising sensory events without reaction, without judgment.

And we could develop this construct, this disposition by practicing mindfulness-based meditation, which I'll talk about here in a minute. And we've seen a lot of -- and this is an old slide -- a lot of new evidence, converging evidence demonstrating that eight weeks of manualized mindfulness-based interventions can produce pretty robust improvements in chronic pain and opiate misuse. These are mindfulness-based stress reduction programs, mindfulness-oriented recovery enhancement, mindfulness-based cognitive therapy which are about eight weeks long, two hours of formalized didactics a week, 45 minutes a day of homework.

There is yoga, there is mental imagery, breathing meditation, walking meditation, a silent retreat and about a $600 tab. Which may not be -- I mean although they are incredibly effective, may not be targeting demographics and folks that may not have the time and resources to participate in such an intense program.

And to that extent and, you know, as an immigrant to this country I've noticed that we are kind of like this drive-thru society where, you know, we have a tendency to eat our lunches and our dinners in our cars. We're attracted to really brief interventions for exercise or anything really, pharmaceuticals, like ":08 Abs" and "Buns of Steel." And we even have things called like the military diet that promise that you'll lose ten pounds in three days without dying.

So we seemingly are attracted to these fast-acting interventions. And so to this extent we've worked for quite some time to develop a very user friendly, very brief mindfulness-based intervention. So this is an intervention that is about four sessions, 20 minutes each session. And participants are -- we remove all religious aspects, all spiritual aspects. And we really don't even call it meditation, we call it mindfulness-based mental training.

And our participants are taught to sit in a straight posture, close their eyes, and to focus on the changing sensations of the breath as they arise. And what we've seen is this repetitive practice enhances cognitive flexibility and the ability to -- flexibility and the ability to sustain attention. And when individual's minds drift away from focusing on the breath, they are taught to acknowledge distractive thoughts, feelings, emotions without judging themselves or the experience. Doing so by returning their attention back to the breath.

So there is really a one-two punch here where, A, you're focusing on the breath and enhancing cognitive flexibility; and, B, you're training yourself to not judge discursive events. And that we believe enhances emotion regulation. So quite malleable to physical training we would say mental training. Now that we have the advent of imaging, we can actually see that there are changes in the brain related to this.

But as many of you know, mindfulness is kind of like a household term now. It's all over our mainstream media. You know, we have, you know, Lebron meditating courtside. Oprah meditating with her Oprah blanket. Anderson Cooper is meditating on TV. And Time Magazine puts, you know, people on the cover meditating. And it's just all over the place.

And so these types of images and these types of, I guess, insinuations could elicit nonspecific effects related to meditation. And for quite some time I've been trying to really appreciate not is meditation more effective than placebo, although that's interesting, but does mindfulness meditation engage mechanisms that also are shared by placebo? So beliefs that you are meditating could elicit analgesic responses.

The majority of the manualized interventions in their manuals they use terms like the power of meditation, which I guarantee you is analgesic. To focus on the breath, we need to slow the breath down. Not implicit -- not explicitly, but it just happens naturally. And slow breathing can also reduce pain. Facilitator attention, social support, conditioning, all factors that are shared with other therapies and interventions but in particular are also part of meditation training.

So the question is because of all this, is mindfulness meditation merely -- or not merely after these two rich days of dialogue -- but is mindfulness meditation engaging processes that are also shared by placebo.

So if I apply a placebo cream to someone's calf and then throw them in the scanner versus asking someone to meditate, the chances are very high that the brain processes are going to be distinct. So we wanted to create a -- and validate an operationally matched mindfulness meditation intervention that we coined as sham mindfulness meditation. It's not sham meditation because it is meditation. It's a type of meditative practice called Pranayama.

But here in this intervention we randomize folks, we tell folks that they've been randomized to a genuine mindfulness meditation intervention. Straight posture, eyes closed. And every two to three minutes they are instructed to, quote-unquote, take a deep breath as we sit here in mindfulness meditation. We even match the time giving instructions between the genuine and the sham mindfulness meditation intervention.

So the only difference between the sham mindfulness and the genuine mindfulness is that the genuine mindfulness is taught to explicitly focus on the changing sensations of the breath without judgment. The sham mindfulness group is just taking repetitive deep, slow breaths. So if the magic part of mindfulness, if the active component of mindfulness is this nonjudgmental awareness, then we should be able to see disparate mechanisms between these.

And we also use a third arm, a book listening control group called the "Natural History of Selborne" where it's a very boring, arguably emotionally pain-evocating book for four days. And this is meant to control for facilitator time and -- sorry, facilitator attention and the time elapsed in the other group's interventions.

So we use a very high level of noxious heat to the back of the calf. And we do so because imaging is quite expensive, and we want to ensure that we can see pain-related processing within the brain. Here and across all of our studies, we use ten 12-second plateaus of 49 degrees to the calf, which is pretty painful.

And then we assess pain intensity and pain unpleasantness using a visual analog scale, where here the participants just see red the more they pull on the algometer the more in pain they are. But on the back, the numbers fluoresce where 0 is no pain and 10 is the worst pain imaginable.

So pain intensity can be considered like sensory dimension of pain, and pain unpleasantness could be more like I don't want to say pain affect but more like the bothersome component of pain, pain unpleasantness. So what we did was we combined all of our studies that have used the mindfulness, sham mindfulness in this book listing control, to see does mindfulness meditation engage is mindfulness meditation more effective than sham mindfulness meditation at reducing pain.

We also combined two different fMRI techniques: Blood oxygen dependent level signalling, bold, which allows us a higher temporal resolution and signal to noise ratio than, say, perfusion imaging technique and allows us to look at connectivity. However, meditation is also predicated on changes in respiration rate which could elicit pretty dramatic artifacts in the brain, breathing related artifacts explicitly related to CO2 output.

So using the perfusion based fMRI technique like arterial spin labeling is really advantageous as well, although it's not as temporally resolute as bold, it provides us a direct quantifiable measurement of cerebral blood flow.

So straight to the results. On the Y axis we have the pain ratings, and on the X axis are book listening controls sham mindfulness meditation, mindfulness meditation. Here are large sample sizes. Blue is intensity and red is unpleasantness. This is the post intervention fMRI scans where we see the first half of the scan to the second half of the scan our controlled participants are simply resting and pain just increases because of pain sensitization and being in a claustrophobic MRI environment.

And you can see here that sham mindfulness meditation does produce pretty significant reduction in pain intensity and unpleasantness, more than the control book. But mindfulness meditation is more effective than sham mindfulness and the controls at reducing pain intensity and pain unpleasantness.

There does seem to be some kind of additive component to the genuine intervention, although this is a really easy practice, the sham techniques.

So for folks that have maybe fatigue or cognitive deficits or just aren't into doing mindfulness technique, I highly recommend this technique, which is just a slow breathing approach, and it's dead easy to do.

Anyone that's practiced mindfulness for the first time or a few times can state that it can be quite difficult and what's the word? -- involving, right?

So what happened in the brain? These are our CBF maps from two studies that we replicated in 2011 and '15 where we found that higher activity, higher CBF in the right anterior insula, which is ipsilateral to the stimulation site and higher rostral anterior cingulate cortex subgenual ACC was associated with greater pain relief, pain intensity, and in the context of pain unpleasantness, higher over the frontal cortical activity was associated with lower pain, and this is very reproducible where we see greater thalamic deactivation predicts greater analgesia on the unpleasantness side.

These areas, obviously right entry insula in conjunction with other areas is associated with interoceptive processing awareness of somatic sensations. And then the ACC and the OFC are associated with higher order cognitive flexibility, emotional regulation processes. And the thalamus is really the gatekeeper from the brain -- I'm sorry, from the body to the brain. Nothing can enter the brain except unless it goes through the thalamus, except if it's the sense of smell.

So it's really like this gatekeeper of arising nociceptive information.

So the takehome here is that mindfulness is engaging multiple neural processes to assuage pain. It's not just one singular pathway.

Our gold studies were also pretty insightful. Here we ran a PPI analysis, psychophysiologic interaction analysis and this was whole brain to see what brain regions are associated with pain relief on the context of using the bold technique, and we find that greater ventral medial prefrontal cortical activity deactivation I'm sorry is associated with lower pain, and the vmPFC is a super evolved area that's associated with, like, higher order processes relating to self. It's one of the central nodes of the so called default mode network, a network supporting self referential processing. But in the context of the vmPFC, I like the way that Tor and Mathieu reflect the vmPFC as being more related to affective meaning and has a really nice paper showing that vmPFC is uniquely involved in, quote/unquote, self ownership or subjective value, which is particularly interesting for the context of pain because pain is a very personal experience that's directly related to the interpretation of arising sensations and what they mean to us.

And seemingly -- I apologize for the reverse inferencing here -- but seemingly mindfulness meditation based on our qualitative assessments as well is reducing the ownership or the intrinsic value, the contextual value of those painful sensations, i.e., they don't feel like they bother -- that pain is there but it doesn't bother our participants as much, which is quite interesting as a manipulation.

We also ran our connectivity analysis between the contralateral thalamus and the whole brain, and we found that greater decoupling between the contralateral thalamus and the precuneus, another central node of the default mode network predicted greater analgesia.

This is a really cool, I think, together mechanism showing that two separate analyses are indicating that the default mode network could be an analgesic system which we haven't seen before. We have seen the DMN involved in chronic pain and pain related exacerbations, but I don't think we've seen it as being a part of an analgesic, like being a pain relieving mechanism. Interestingly, the thalamus and precuneus together are the first two nodes to go offline when we lose consciousness, and they're the first two nodes to come back online when we recover from consciousness, suggesting that these two -- that the thalamus and precuneus are involved in self referential awareness, consciousness of self, things of this nature.

Again, multiple processes involved in meditation based pain relief which maybe gives rise to why we are seeing consistently that meditation could elicit long lasting improvements in pain unpleasantness, in particular, as compared to sensory pain. Although it does that as well.

And also the data gods were quite kind on this because these mechanisms are also quite consistent with the primary premises of Buddhist and contemplative scriptures saying that the primary principle is that your experiences are not you.

Not that there is no self, but that the processes that arise in our moment to moment experience are merely reflections and interpretations in judgments, and that may not be the true inherent nature of mind.

And so before I get into more philosophical discourse, I'm going to keep going for the sake of time. Okay.

So what happened with the sham mindfulness meditation intervention?

We did not find any neural processes predicted analgesia significantly and during sham mindfulness meditation. What did predict analgesia during sham mindfulness was slower breathing rate, which we've never seen before with mindfulness. We've never seen a significant or even close to significant relationship between mindfulness based meditation analgesia and slow breathing. But over and over we see that sham mindfulness based analgesia is related to slower breathing which provides us this really cool distinct process where kind of this perspective where mindfulness is engaging higher order top down type processes to assuage pain while sham mindfulness may be engaging this more bottom up type response to assuage pain.

I'm going to move on to some other new work, and this is in great collaboration with the lovely Tor Wager, and he's developed, with Marta and Woo, these wonderful signatures, these machine learned multivariate pattern signatures that are remarkably accurate at predicting pain over I think like 98, 99 percent.

His seminal paper, the Neurological Pain Signature, was published in the New England Journal of Medicine that showed that these signatures can predict nociceptive specific, in particular, for this particular, thermal heat pain with incredible accuracy.

And it's not modulated by placebo or affective components, per se. And then the SIIPS is a machine learned signature that is, as they put it, associated with cerebral contributions to pain. But if you look at it closely, these are markers that are highly responsive to the placebo response.

So the SIIPS can be used -- he has this beautiful pre print out, showing that it does respond with incredible accuracy to placebo, varieties of placebo.

So we used this MVPA to see if meditation engages signature supporting placebo responses.

And then Marta Ceko's latest paper with Tor published in Nature and Neuro found that the negative affect of signature predicts pain responses above and beyond nociceptive related processes. So this is pain related to negative affect, which again contributes to the multimodal processing of pain and how now we could use these elegant signatures to kind of disentangle which components of pain meditation and other techniques assuage. Here's the design.

We had 40 -- we combined two studies. One with bold and one with ASL. So this would be the first ASL study with signatures, with these MVPA signatures.

And we had the mindfulness interventions that I described before, the book listing interventions I described before and a placebo cream intervention which I'll describe now, all in response to 49 degrees thermal stimuli.

So across again all of our studies we use the same methods. And the placebo group -- I'll try to be quick about this -- this is kind of a combination of Luana Colloca, Don Price and Tor's placebo conditioning interventions where we administer 49 degrees -- we tell our participants that we're testing a new form of lidocaine, and the reason that it's new is that the more applications of this cream, the stronger the analgesia.

And so in the conditioning sessions, they come in, administer 49 degrees, apply and remove this cream, which is just petroleum jelly after 10 minutes, and then we covertly reduce the temperature to 48.

And then they come back in in session two and three, after 49 degrees and removing the cream, we lower the temperature to 47. And then on the last conditioning session, after we remove the cream, we lower the temperature to 46.5, which is a qualitatively completely different experience than 49.

And we do this to lead our participants to believe that the cream is actually working.

And then in a post intervention MRI session, after we remove the cream, we don't modulate the temperature, we just keep it at 49, and that's how we measured placebo in these studies. And then so here, again -- oops -- John Dean and Gabe are coleading this project.

Here, pain intensity on this axis, pain unpleasantness on that axis, controls from the beginning of the scan to the end of the scan significantly go up in pain.

Placebo cream was effective at reducing intensity and unpleasantness, but we see mindfulness meditation was more effective than all the conditions at reducing pain. The signatures, we see that the nociceptive specific signature, the controls go up in pain here.

No change in the placebo and mindfulness meditation you can see here produces a pretty dramatic reduction in the nociceptive specific signature.

The same is true for the negative affective pain signature. Mindfulness meditation uniquely modifies this signature as well which I believe this is one of the first studies to show something like this.

But it does not modulate the placebo signature. What does modulate the placebo signature is our placebo cream, which is a really nice manipulation check for these signatures.

So here, taken together, we show that mindfulness meditation, again, is engaging multiple processes and is reducing pain by directly assuaging nociceptive specific markers as well as markers supporting negative affect but not modulating placebo related signatures, providing further credence that it's not a placebo type response, and we're also demonstrating this granularity between a placebo mechanism that's not being shared by another active mechanism. While we all assume that active therapies and techniques are using a shared subset of mechanisms or processes with placebo, here we're providing accruing evidence that mindfulness is separate from a placebo.

I'll try to be very quick on this last part. This is all not technically related placebo, but I would love to hear everyone's thoughts on these new data we have.

So as we've seen elegantly that pain relief by placebo, distraction, acupuncture, transcranial magnetic stimulation, prayer, are largely driven by endogenous opioidergic release. And, yes, there are other systems. A prime other system is the (indiscernible) system, serotonergic system, dopamine. The list can go on. But it's considered by most of us that the endogenous opioidergic system is this central pain modulatory system.

And the way we do this is by antagonizing endogenous opioids by employing incredibly high administration dosage of naloxone.

And I think this wonderful paper by Ciril Etnes's (phonetic) group provides a nice primer on the appropriate dosages for naloxone to antagonize opiates. And I think a lot of the discussions here where we see differences in naloxone responses are really actually reflective of differences in dosages of naloxone.

It metabolizes so quickly that I would highly recommend a super large bolus with a maintenance infusion IV.

And we've seen this to be a quite effective way to block endogenous opioids. And across four studies now, we've seen that mindfulness based pain relief is not mediated by endogenous opioids. It's something else. We don't know what that something else is but we don't think it's endogenous opioids. But what if it's sex differences that could be driving these opioidergic versus non opioid opioidergic differences?

We've seen that females require -- exhibit higher rates of chronic pain than males. They are prescribed opiates at a higher rate than men. And when you control for weight, they require higher dosages than men. Why?

Well, there's excellent literature in rodent models and preclinical models that demonstrate that male rodents versus female -- male rodents engage endogenous opioids to reduce pain but female rodents do not.

And this is a wonderful study by Ann Murphy that basically shows that males, in response to morphine, have a greater latency and paw withdrawal when coupled with morphine and not so much with females.

But when you add naloxone to the picture, with morphine, the latency goes down. It basically blocks the analgesia in male rodents but enhances analgesia in female rodents.

We basically asked -- we basically -- Michaela, an undergraduate student doing an odyssey thesis asked this question: Are males and females in humans engaging in distinct systems to assuage pain?

She really took off with this and here's the design. We had heat, noxious heat in the baseline.

CRISTINA CUSIN: Doctor, you have one minute left. Can you wrap up?

FADEL ZEIDAN: Yep. Basically we asked, are there sex differences between males and females during meditation in response to noxious heat? And there are.

Baseline, just change in pain. Green is saline. Red is naloxone. You can see that with naloxone onboard, there's greater analgesia in females, and we reversed the analgesia. Largely, there's no differences between baseline in naloxone in males, and the males are reducing pain during saline.

We believe this is the first study to show something like this in humans. Super exciting. It also blocked the stress reduction response in males but not so much in females. Let me just acknowledge our funders. Some of our team. And I apologize for the fast presentation. Thank you.

CRISTINA CUSIN: Thank you so much. That was awesome.

We're a little bit on short on time.

I suggest we go into a short break, ten minute, until 1:40. Please continue to add your questions in Q&A. Our speakers are going to answer or we'll bring some of those questions directly to the discussion panel at the end of the session today. Thank you so much.

Measuring & Mitigating the Placebo Effect (continued)

CRISTINA CUSIN: Hello, welcome back. I'm really honored to introduce our next speaker, Dr. Marta Pecina. And she's going to talk about mapping expectancy-mood interactions in antidepressant placebo effects. Thank you so much.

MARTA PECINA: Thank you, Cristina. It is my great pleasure to be here. And just I'm going to switch gears a little bit to talk about antidepressant placebo effects. And in particular, I'm going to talk about the relationship between acute expectancy-mood neural dynamics and long-term antidepressant placebo effects.

So while we all know that depression is a very prevalent disorder, and just in 2020, Major Depressive Disorder affected 21 million adults in the U.S. and 280 million adults worldwide. And current projections indicate that by the year 2030 it will be the leading cause of disease burden globally.

Now, response rates to first-line treatments, antidepressant treatments are approximately 50%. And complete remission is only achieved in 30 to 35% of individuals. Also, depression tends to be a chronic disorder with 50% of those recovering from a first episode having an additional episode. And 80% of those with two or more episodes having another recurrence.

And so for patients who are nonresponsive to two intervention, remission rates with subsequent therapy drop significantly to 10 to 25%. And so, in summary, we're facing a disorder that is very resistant or becomes resistant very easily. And in this context, one would expect that antidepressant placebo effects would actually be low. But we all know that this is not the case. The response rate to placebos is approximately 40% compared to 50% response rates to antidepressants. And obviously this varies across studies.

But what we do know and learned yesterday as well is that response rates to placebos have increased approximately 7% over the last 40 years. And so these high prevalence of placebo response in depressions have significantly contributed to the current psychopharmacology crisis where large pharma companies have reduced at least in half the number of clinical trials devoted to CNS disorders.

Now, antidepressant placebo response rates among individuals with depression are higher than in any other psychiatric condition. And this was recently published again in this meta-analysis of approximately 10,000 psychiatric patients. Now, other disorders where placebo response rates are also prevalent are generalized anxiety disorder, panic disorders, HDHC or PTSD. And maybe less frequent, although still there, in schizophrenia or OCD.

Now, importantly, placebo effects appear not only in response to pills but also surgical interventions or devices, as it was also mentioned yesterday. And this is particularly important today where there is a very large development of device-based interventions for psychiatric conditions. So, for example, in this study that also was mentioned yesterday of deep brain stimulation, patients with resistant depression were assigned to six months of either active or some pseudo level DBS. And this was followed by open level DBS.

As you can see here in this table, patients from both groups improved significantly compared to baseline, but there were no significant differences between the two groups. And for this reason, DBS has not yet been approved by the FDA for depression, even though it's been approved for OCD or Parkinson's disease as we all know.

Now what is a placebo effect, that's one of the main questions of this workshop, and how does it work from a clinical neuroscience perspective? Well, as it's been mentioned already, most of what we know about the placebo effect comes from the field of placebo analgesia. And in summary, classical theories of the placebo effect have consistently argued that placebo effects results from either positive expectancies regarding the potential beneficial effects of a drug or classical conditioning where the pairing of a neutral stimulus, in this case the placebo pill, with an unconditioned stimulus, in this case the active drug, results in a conditioned response.

Now more recently, theories of the placebo effect have used computational models to predict placebo effects. And these theories posit that individuals update their expectancies as new sensory evidence is accumulated by signaling the response between what is expected and what is perceived. And this information is then used to refine future expectancies. Now these conceptual models have been incorporated into a trial-by-trial manipulation of both expectancies of pain relief and pain sensory experience. And this has rapidly advanced our understanding of the neural and molecular mechanisms of placebo analgesia.

And so, for example, in these meta analytic studies using these experiments they have revealed really two patterns of distinct activations with decreases in brain activity in regions involving brain processing such as the dorsal medial prefrontal cortex, the amygdala and the thalamus; and increases in brain activity in regions involving effective appraisal, such as the vmDFC, the nucleus accumbens, and the PAG.

Now what happens in depression? Well, in the field of antidepressant placebo effects, the long-term dynamics of mood and antidepressant responses have not allowed us to have such trial-by-trial manipulation of expectancies. And so instead researchers have used broad brain changes in the context of a randomized control trial or a placebo lead-in phase which has, to some extent, limited the progress of the field.

Now despite these methodological limitations of these studies, they provide important insights about the neural correlates of antidepressant placebo effects. In particular, from studies -- two early on studies we can see the placebo was associated with increased activations broadly in cortical regions and decreased activations in subcortical regions. And these deactivations in subcortical regions were actually larger in patients who were assigned to an SSRI drug treatment.

We also demonstrated that there is similar to pain, antidepressant placebo effects were associated with enhanced endogenous opiate release during placebo administration, predicting the response to open label treatment after ten weeks. And we have also -- we and others have demonstrated that increased connectivity between the salience network and the rostral anterior cingulate during antidepressant placebo effects can actually predict short-term and long-term placebo effects.

Now an important limitation, and as I already mentioned, is that this study is basically the delay mechanism of action of common antidepressant and this low dynamics of mood which really limit the possibility of actively manipulating antidepressant expectancies.

So to address this important gap, we develop a trial-by-trial manipulation of antidepressant expectancies to be used inside of the scanner. And the purpose was really to be able to further disassociate expectancy and mood dynamics during antidepressant placebo effects.

And so the basic structure of this test involved an expectancy condition where subjects are presented with a four-second infusion cue followed by an expectancy rating cue, and a reinforcement condition which consist of 20 seconds of some neurofeedback followed by a mood rating cue. Now the expectancy and the reinforcement condition map onto the classical theories of the placebo effect that I explained earlier.

During the expectancy condition, the antidepressant infusions are compared to periods of calibration where no drug is administered. And during the reinforcement condition, on the other hand, some neurofeedback of positive sign 80% of the time as compared to some neurofeedback of baseline sign 80% of the time. And so this two-by-two study design results in four different conditions. The antidepressant reinforced, the antidepressant not reinforced, the calibration reinforced, and the calibration not reinforced.

And so the cover story is that we tell participants that we are testing the effects of a new fast-acting antidepressant compared to a conventional antidepressant, but in reality, they are both saline. And then we tell them that they will receive multiple infusions of these drugs inside of the scanner while we record their brain activity which we call neurofeedback. So then patients learn that positive neurofeedback compared to baseline is more likely to cause mood improvement. But they are not told that the neurofeedback is simulated.

Then we place an intravenous line for the administration of the saline infusion, and we bring them inside of the scanner. For these kind of experiments we recruit individuals who are 18 through 55 with or without anxiety disorders and have a HAMD depression rating scale greater than 16, consistent with moderate depression. They're antidepressant medication free for at least 25 -- 21 days and then we use consenting procedures that involve authorized deception.

Now, as suspected, behavioral results during this test consistently show that antidepressant expectancies are higher during the antidepressant infusions compared to the calibration, especially when they are reinforced by positive sham neurofeedback. Now mood responses also are significantly higher during positive sham neurofeedback compared to baseline. But this is also enhanced during the administration of the antidepressant infusions.

Now interestingly, these effects are moderated by the present severity such that the effects of the test conditions and the expectancies and mood ratings are weaker in more severe depression even though their overall expectancies are higher, and their overall mood are lower.

Now at a neuron level, what we see is that the presentation of the infusion cue is associated with an increased activation in the occipital cortex and the dorsal attention network suggesting greater attention processing engaged during the presentation of the treatment cue. And similarly, the reinforcement condition revealed increased activations in the dorsal attention network with additional responses in the ventral striatum suggesting that individuals processed the sham positive neurofeedback cue as rewarding.

Now an important question for us was now that we can manipulate acute placebo -- antidepressant placebo responses, can we use this experiment to understand the mechanisms implicated in short-term and long-term antidepressant placebo effects. And so as I mentioned earlier, there was emerging evidence suggesting that placebo analgesic could be explained by computational models, in particular reinforcement learning.

And so we tested the hypothesis that antidepressant placebo effects could be explained by similar models. So as you know, under these theories, learning occurs when an experienced outcome differs from what is expected. And this is called the prediction error. And then the expected value of the next possible outcome is updated with a portion of this prediction error as reflected in this cue learning rule.

Now in the context of our experiment, model predicted expectancies for each of the four trial conditions would be updated every time the antidepressant or the calibration infusion cue is presented and an outcome, whether positive or baseline neurofeedback, is observed based on a similar learning rule.

Now this basic model was then compared against two alternative models. One which included differential learning rates to account for the possibility that learning would depend on whether participants were updating expectancies for the placebo or the calibration. And then an additional model to account for the possibility that subjects were incorporating positive mood responses as mood rewards.

And then finally, we constructed this additional model to allow the possibility of the combination of models two and three. And so using patient model comparison, we found that the model -- the fourth model, model four which included a placebo bias learning in our reinforcement by mood dominated all the other alternatives after correction for patient omnibus risk.

Now we then map the expected value and reward predictions error signals from our reinforcement learning models into our raw data. And what we found was that expected value signals map into the salience network raw responses; whereas reward prediction errors map onto the dorsal attention network raw responses. And so all together, the combination of our model-free and model-based results reveal that the processing of the antidepressant in patient cue increase activation in the dorsal attention network; whereas, the encoding of the expectancies took place in the salience network once salience had been attributed to the cue.

And then furthermore, we demonstrated that the reinforcement learning model predicted expectancies in coding the salience network triggered mood changes that are perceived as reward signals. And then these mood reward signals further reinforce antidepressant expectancies through the information of expectancy mood dynamics defined by models of reinforcement learning, an idea that could possibly contribute to the formation of long-lasting antidepressant placebo effects.

And so the second aim was really -- was going to look at these in particular how to use behavioral neuroresponses of placebo effects to predict long-term placebo effects in the context of a clinical trial. And so our hypothesis was that during placebo administration greater salient attribution to the contextual cue in the salience network would transfer to regions involved in mood regulation to induce mood changes. So in particular we hypothesized that the DMN would play a key role in belief-induced mood regulation.

And why the DMN? Well, we knew that activity in the rostral anterior cingulate, which is a key node of the DMN, is a robust predictor of mood responses to both active antidepressant and placebos, implying its involvement in nonspecific treatment response mechanisms. We also knew that the rostral anterior cingulate is a robust predicter of placebo analgesia consistent with its role in cognitive appraisals, predictions and evaluation. And we also had evidence that the SN to DMN functional connectivity appears to be a predictor of placebo and antidepressant responses over ten weeks of treatment.

And so in our clinical trial, which you can see the cartoon diagram here, we randomized six individuals to placebo or escitalopram 20 milligrams. And this table is just to say there were no significant differences between the two groups in regard to the gender, race, age, or depression severity. But what we found interesting is that there were also no significant differences in the correct belief assignment with 60% of subjects in each group approximately guessing that they were receiving escitalopram.

Now as you can see here, participants showed lower MADR scores at eight weeks in both groups. But there was no significant differences between the two groups. However, when split in the two groups by the last drug assignment belief, subjects with the drug assignment belief improved significantly compared to those with a placebo assignment belief.

And so the next question was can we use neuroimaging to predict these responses? And what we found was at a neural level during expectancy process the salience network had an increased pattern of functional connectivity with the DMN as well as with other regions of the brainstem including the thalamus. Now at the end -- we also found that increased SN to DMN functional connectivity predicted expectancy ratings during the antidepressant placebo fMRI task such that higher connectivity was associated with greater modulation of the task conditions on expectancy ratings.

Now we also found that enhanced functional connectivity between the SN and the DMN predicted the response to eight weeks of treatment, especially on individuals who believed that they were of the antidepressant group. Now this data supports that during placebo administration, greater salient attributions to the contextual cue is encoded in the salience network; whereas belief-induced mood regulation is associated with an increased functional connectivity between the SN and DMN and altogether this data suggest that enhancements to DMN connectivity enables the switch from greater salient attribution to the treatment cue to DMN-mediated mood regulation.

And so finally, and this is going to be brief, but the next question for us was can we modulate these networks to actually enhance placebo-related activity. And in particular, we decided to use theta burst stimulation which can potentiate or depotentiate brain activity in response to brief periods of stimulation. And so in this study participants undergo three counterbalance sessions of TBS with either continuous, intermittent, or sham known to depotentiate, potentiate, and have no effect.

So each TBS is followed by an fMRI session during the antidepressant placebo effect task which happens approximately an hour after stimulation. The inclusive criteria are very similar to all of our other studies. And our pattern of stimulation is pretty straightforward. We do two blocks of TBS. And during the first block stimulation intensity is gradually escalated in 5% increments in order to enhance tolerability. And during the second session the stimulation is maintained constant at 80% of the moderate threshold.

Then we use the modified cTBS session consisting of three stimuli applied at intervals of 30 hertz. We first repeat it at 6 hertz for a total of 600 stimuli in a continuous train of 33.3 seconds. Then we did the iTBS session consist of a burst of three stimuli applied at intervals of 50 hertz with bursts repeated at 5 hertz for a total of 600 stimulus during 192 seconds. We also use a sham condition where 50% of subjects are assigned to sham TBS simulating the iTBS stimulus pattern, and 50% are assigned to sham TBS simulating the cTBS pattern.

Now our target is the DMN which is the cortical target for the dorsal medial -- the cortical target for the DMN -- sorry, the dmPFC which is the cortical target for the DMN. And this corresponds to the -- and we found these effects based on the results from the antidepressant placebo fMRI task.

And so this target corresponds to our neurosynth scalp which is located 30% of the distance from the nasion-to-inion forward from the vertex and 5% left which corresponds to an EEG location of F1. And the connectivity map of these regions actually result in activation of the DMN. Now we can also show here the E-Field map of this target which basically demonstrates supports a nice coverage of the DMN.

And so what we found here is that the iTBS compared to sham and cTBS enhances the effect of the reinforcement condition of mood responses. And we also found that at a neural level iTBS compared to cTBS shows significant greater bold responses during expectancy processing within the DMN with sham responses in the middle but really not significantly different from iTBS. Now, increased bold responses in the ventral medial prefrontal cortex were associated with a greater effect of the task conditions of mood responses.

And so all together our results suggest that first trial-by-trial modulation of antidepressant expectancies effectively disassociates expectancy mood dynamics. Antidepressant expectancies are predicted by models of reinforcement learning and they're encoded in the salience network. We also showed that enhanced SN to DMN connectivity enables the switch from greater salient attribution to treatment cues to DMN-mediated mood regulation, contributing to the formation of acute expectancy-mood interactions and long-term antidepressant placebo effects. And iTBS potentiation of the DMN enhances placebo-induced mood responses and expectancy processing.

With this, I would just like to thank my collaborators that started this work with me at the University of Michigan and mostly the people in my lab and collaborators at the University of Pittsburgh as well as the funding agencies.

CRISTINA CUSIN: Wonderful presentation. Really terrific way of trying to untangle different mechanism in placebo response in depression, which is not an easy feat.

There are no specific questions in the Q&A. I would encourage everybody attending the workshop to please post your question to the Q&A. Every panelist can answer in writing. And then we will answer more questions during the discussion, but please don't hesitate.

I think I will move on to the next speaker. We have only a couple of minutes so we're just going to move on to Dr. Schmidt. Thank you so much. We can see your slides. We cannot hear you.

LIANE SCHMIDT: Can you hear me now?

CRISTINA CUSIN: Yes, thank you.

LIANE SCHMIDT: Thank you. So I'm Liane Schmidt. I'm an INSERM researcher and team leader at the Paris Brain Institute. And I'm working on placebo effects but understanding the appetitive side of placebo effects. And what I mean by that I will try to explain to you in the next couple of slides.

NIMH Staff: Can you turn on your video?

LIANE SCHMIDT: Sorry?

NIMH Staff: Can you please turn on your video, Dr. Schmidt?

LIANE SCHMIDT: Yeah, yes, yes, sorry about that.

So it's about the appetitive side of placebo effects because actually placebo effects on cognitive processes such as motivation and biases and belief updating because these processes actually play also a role when patients respond to treatment. And when we measure placebo effects, basically when placebo effects matter in the clinical setting.

And this is done at the Paris Brain Institute. And I'm working also in collaboration with the Pitie-Salpetriere Hospital Psychiatry department to get access to patients with depression, for example.

So my talk will be organized around three parts. On the first part, I will show you some data about appetitive placebo effects on taste pleasantness, hunger sensations and reward learning. And this will make the bridge to the second part where I will show you some evidence for asymmetrical learning biases that are more tied to reward learning and that could contribute actually or can emerge after fast-acting antidepressant treatment effects in depression.

And why is this important? I will try to link these two different parts, the first and second part, in the third part to elaborate some perspectives on the synergies between expectations, expectation updating through learning mechanisms, motivational mechanisms, motivational processes and drug experiences and which we can -- might harness actually by using computational models such as, for example, risk-reward Wagner models as Marta just showed you all the evidence for this in her work.

The appetitive side of placebo effects is actually known very well from the field of research in consumer psychology and marketing research where price labels, for example, or quality labels can affect decision-making processes and also experiences like taste pleasantness experience. And since we are in France, one of the most salient examples for these kind of effects comes from wine tasting. And many people have shown -- many studies have shown that basically the price of wine can influence how pleasant it tastes.

And we and other people have shown that this is mediated by activation in what is called the brain valuation systems or regions that encode expected and experienced reward. And one of the most prominent hubs in this brain valuation system is the ventral medial prefrontal cortex, what you see here on the SPM on the slide. That can explain, that basically translates these price label effects on taste pleasantness liking. And what is interesting is also that its sensitivity to monetary reward, for example, obtaining by surprise a monetary reward. It activates, basically the vmPFC activates when you obtain such a reward surprisingly.

And the more in participants who activate the vmPFC more in these kind of positive surprises, these are also the participants in which the vmPFC encoded more strongly the difference between expensive and cheap wines, which makes a nice parallel to what we know from placebo hyperalgesia where it has also been shown that the sensitivity of the reward system in the brain can moderate placebo analgesia with participants with higher reward sensitivity in the ventral striatum, for example, another region showing stronger placebo analgesia.

So this is to basically hope to let you appreciate that these effects parallel nicely what we know from placebo effects in the pain and also in disease. So we went further beyond actually, so beyond just taste liking which is basically experiencing rewards such as wine. But what could be -- could placebos also affect motivational processes per se? So when we, for example, want something more.

And one way to study is to study basic motivation such as, for example, hunger. It is long thought, for instance, eating behavior that is conceptualized to be driven by homeostatic markers, hormone markers such as Ghrelin and Leptin that signal satiety and energy stores. And as a function of these different hormonal markers in our blood, we're going to go and look for food and eat food. But we also know from the placebo effects on taste pleasantness that there is a possibility that our higher order beliefs about our internal states not our hormones can influence whether we want to eat food, whether we engage in these types of very basic motivations. And that we tested that, and other people also, that's a replication.

In the study where we gave healthy participants who came into the lab in a fasted state a glass of water. And we told them well, water sometimes can stimulate hunger by stimulating the receptors in your mouth. And sometimes you can also drink a glass of water to kill your hunger. And a third group, a control group was given a glass of water and told it's just water; it does nothing to hunger. And then we asked them to rate how hungry they feel over the course of the experiment. And it's a three-hour experiment. Everybody has fasted. And they have to do this food choice task in an fMRI scanner so they get -- everybody gets hungry over this three hours.

But what was interesting and what you see here on this rain cloud plot is that participants who believed or drank the water suggested to be a hunger killer increased in their hunger rating less than participants who believed the water will enhance their hunger. So this is a nice replication what we already know from the field; other people have shown this, too.

And the interesting thing is that it also affected this food wanting, this motivational process how much you want to eat food. So when people laid there in the fMRI scanner, they saw different food items, and they were asked whether they want to eat it or not for real at the end of the experiment. So it's incentive compatible. And what you see here is basically what we call stimulus value. So how much do you want to eat this food.

And the hunger sensation ratings that I just showed you before parallel what we find here. The people in the decreased hunger suggestion group wanted to eat the food less than in the increased hunger suggestion group, showing that it is not only an effect on subjective self-reports or how you feel your body signals about hunger. It's also about what you would actually prefer, what your subjective preference of food that is influenced by the placebo manipulation. And it's also influencing how your brain valuation system again encodes the value for your food preference. And that's what you see on this slide.

Slide two, you see the ventral medial prefrontal cortex. The yellow boxes that the more yellow they are, the stronger they correlate to your food wanting. And you see on the right side with the temporal time courses of the vmPFC that that food wanting encoding is stronger when people were on the increased hunger suggestion group than in the decreased hunger suggestion group.

So basically what I've showed you here is three placebo effects. Placebo effects on subjective hunger ratings, placebo effects on food choices, and placebo effects on how the brain encodes food preference and food choices. And you could wonder -- these are readouts. So these are behavioral readouts, neural readouts. But you could wonder what is the mechanism behind? Basically what is in between the placebo intervention here and basically the behavior feed and neural readout of this effect.

And one snippet of the answer to this question is when you look at the expectation ratings. For example, expectations have long been shown to be one of the mediators, the cognitive mediators of placebo effects across domains. And that's what we see here, too. Especially in the hunger killer suggestion group. The participants who believed that the hunger -- that the drug will kill their hunger more strongly were also those whose hunger increased less over the course of the experiment experience.

And this moderated activity in the region that you see here, which is called the medial prefrontal cortex, that basically activated when people saw food on the screen and thought about whether they want to eat it or not. And this region activated by that activity was positively moderated by the strength of the expectancy about the glass of water to decrease their hunger. So the more you expect that the water will decrease your hunger, the more the mPFC activates when you see food on the screen.

It's an interesting brain region because it's right between the ventral medial prefrontal cortex that encodes the value, the food preference, and the dorsal lateral prefrontal cortex. And it has been shown by past research to connect to the vmPFC when participants self-control, especially during food decision-making paradigms.

But another mechanism or another way to answer the question about the mechanism of how the placebo intervention can affect this behavior in neural effects is to use computational modelings to better understand the preference formation -- the preference formation basically. And one way is that -- is drift diffusion modeling. So these drift diffusion models come from perceptual research for understanding perception. And they are recently also used to better understand preference formation. And they assume that your preference for a yes or no food choice, for example, is a noisy accumulation of evidence.

And there are two types of evidence you accumulate in these two -- in these decision-making paradigms is basically how tasty and how healthy food is. How much you like the taste, how much you consider the health. And this could influence this loop of your evidence accumulation how rapidly basically you reach a threshold towards yes or no.

It could also be that the placebo and the placebo manipulation could influence this loop. But the model loops test several other hypotheses. It could be that the placebo intervention basically affected also just the threshold like that reflects how carefully you made the decision towards a yes or no choice. It could be your initial bias; that is, basically initially you were biased towards a yes or a no response. Or it could be the nondecision time which reflects more sensory motor integration.

And the answer to this question is basically that three parameters were influenced by the placebo manipulation. Basically how much you integrated healthiness and tastiness in your initial starting bias. So you paid more attention to the healthiness when you believed that you were on a hunger killer. And more the tastiness when you believed that you were on a hunger enhancer. And similarly, you were initially biased towards accepting food more when participants believed they were on a hunger enhancer than on a hunger killer.

Interestingly, so this basically shows that this decision-making process is biased by the placebo intervention and basically also how much you filter information that is most relevant. When you are hungry, basically taste is very relevant for your choices. When you are believing you are less hungry, then you have more actually space or you pay less attention to taste, but you can also pay attention more to healthiness of food.

And so the example that shows that this might be a filtering of expectation-relevant information is to use psychophysiologic interaction analyzers that look basically at the brain activity in the vmPFC, that's our seed region. Where in the brain does it connect when people, when participants see food on a computer screen and have to think about whether they want to eat this food or not?

And what we observed there that's connected to the dlPFC, the dorsal lateral prefrontal cortex region. And it's a region of interest that we localized first to be sure it is actually a region that is inter -- activating through an interference resolution basically when we filter -- have to filter information that is most relevant to a task in a separate Stroop localizer task.

So the vmPFC connects stronger to this dlPFC interference resolution region and this is moderated especially in the decreased hunger suggestion group by how much participants considered the healthiness against the tastiness of food.

To wrap this part up, it's basically that we replicated findings from previous studies about appetitive placebo effects by showing that expectancies about efficiency of a drink can affect hunger sensations. How participants make -- form their food preferences, make food choices. And value encoding in the ventral medial prefrontal cortex.

But we also provided evidence for underlying neurocognitive mechanisms that involve the medial prefrontal cortex that is moderated by the strengths of the hunger expectation. That the food choice formation is biased in the form of attention-filtering mechanism toward expectancy congruent information that is taste for an increased hunger suggestion group, and healthiness for a decreased hunger suggestion group. And this is implemented by regions that are linked to interference resolution but also to valuation preference encoding.

And so why should we care? In the real world, it is not very relevant to provide people with deceptive information about hunger-influencing ingredients of drinks. But studies like this one provide insights into cognitive mechanisms of beliefs about internal states and how these beliefs can affect the interoceptive sensations and also associated motivations such as economic choices, for example.

And this can actually also give us insights into the synergy between drug experiences and outcome expectations. And that could be harnessed via motivational processes. So translated basically via motivational processes. And then through it maybe lead us to better understand active treatment susceptibility.

And I'm going to elaborate on this in the next part of the talk by -- I'm going a little bit far, coming a little bit far, I'm not talking about or showing evidence about placebo effects. But yes -- before that, yes, so basically it is.

Links to these motivational processes have long been suggested actually to be also part of placebo effects or mechanisms of placebo effect. And that is called the placebo-reward hypothesis. And that's based on findings in Parkinson's disease that has shown that when you give Parkinson's patients a placebo but tell them it's a dopaminergic drug, then you can measure dopamine in the brain. And the dopamine -- especially the marker for dopamine, its binding potential decreases. That is what you see here on this PET screen -- PET scan results.

And that suggests that the brain must have released endogenous dopamine. And dopamine is very important for expectations and learning. Basically learning from reward. And clinical benefit is the kind of reward that patients expect. So it might -- it is possible that basically when a patient expects reward clinical benefit, its brain -- their brain releases dopamine in remodulating that region such as the vmPFC or the ventral striatum.

And we have shown this in the past that the behavioral consequence of such a nucleus dopamine release under placebo could be linked to reward learning, indeed. And what we know is that, for example, that Parkinson patients have a deficit in learning from reward when they are off dopaminergic medication. But this normalizes when they are under active dopaminergic medication.

So we wondered if based on these PET studies under placebo, the brain releases dopamine, does this also have behavior consequences on their reward learning ability. And that is what you see here on the screen on the right side on the screen is that the Parkinson patients basically tested on placebo shows similar reward learning abilities as under active drug.

And this again was also underpinned by increased correlation of the ventral medial prefrontal cortex. Again, this hub of the brain valuation system to the learned reward value. That was stronger in the placebo and active drug condition compared to baseline of drug condition.

And I want to make now this -- a link to another type of disease where also the motivation is deficitary, and which is depression. And depression is known to be maintained or is sympathized to be maintained by this triad of very negative beliefs about the world, the future and one's self. Which is very insensitive to belief disconfirming information, especially if the belief disconfirming information is positive, so favorable. And this has been shown by cognitive neuroscience studies to be reflected by a thought of like of good news/bad news bias or optimism biases and belief updating in depression. And this good news/bad news bias is basically a bias healthy people have to consider favorable information that contradicts initial negative beliefs more than negative information.

And this is healthy because it avoids reversing of beliefs. And it also includes a form of motivational process because good news have motivational salience. So it should be more motivating to update beliefs about the future, especially if these beliefs are negative, then when we learn that our beliefs are way too negative and get information about that disconfirms this initial belief. But depressed patients, they like this good news/bad news bias. So we wonder what happens when patients respond to antidepressant treatments that give immediate sensory evidence about being on an antidepressant.

And these new fast-acting antidepressants such as Ketamine, these types of antidepressants that patients know right away whether they got the treatment through dissociative experiences. And so could it be that this effect or is it a cognitive model of depression. So this was the main question of the study. And then we wondered again what is the computational mechanism. And is it linked again also, as shown in the previous studies, to reward learning mechanisms, so biased updating of beliefs. And is it linked to clinical antidepressant effects and also potentially outcome expectations makes the link to placebo effects.

So patients were given the -- were performing a belief updating task three times before receiving Ketamine infusions. And then after first infusion and then one week after the third infusion, each time, testing time we measured the depression with the Montgomery-Asberg Depression Rating Scale. And patients performed this belief updating task where they were presented with different negative life events like, for example, getting a disease, losing a wallet here, for example.

And they were asked to estimate their probability of experiencing this life event in the near future. And they were presented with evidence about the contingencies of this event in the general population, what we call the base rate. And then they had the possibility to update their belief knowing now the base rate.

And this is, for example, a good news trial where participants initially overestimated the chance for losing a wallet and then learn it's much less frequent than they initially thought. Updates, for example, 15%. And in a bad news trial, it's you initially underestimated your probability of experiencing this adverse life event. And if you have a good news/bad news bias, well, you're going to consider this information to a lesser degree than in a good news trial.

And that's what -- exactly what happens in the healthy controls that you see on the left most part of the screen. I don't know whether you can see the models, but basically we have the belief updating Y axis. And this is healthy age-matched controls to patients. And you can see updating of the good news. Updating of the bad news. We tested the participants more than two times within a week. You can see the bias. There is a bias that decreases a little bit with more sequential testing in the healthy controls. But importantly, in the patients the bias is there although before Ketamine treatment.

But it becomes much more stronger after Ketamine treatment. It emerged basically. So patients become more optimistically biased after Ketamine treatment. And this correlates to the MADRS scores. Patients who improve more with treatment are also those who show a stronger good news/bad news bias after one week of treatment.

And we wondered again about the computational mechanisms. So one way to get at this using a Rescoria-Wagner model reward reinforcement learning model that basically assumes that updating is proportional to your surprise which is called the estimation error.

The difference between the initial estimate and the base rate. And this is weighted by learning rate. And the important thing here is the learning rate has got two components, a scaling parameter and an asymmetry parameter. And the asymmetry parameter basically weighs in how much the learning rate varies after good news, after positive estimation error, than after negative estimation errors.

And what we can see that in healthy controls, there is a stronger learning rate for positive estimation errors and less stronger for negative estimation errors translating this good news/bad news bias. It's basically an asymmetrical learning mechanism. And in the patients, the asymmetrical learning is non-asymmetrical before Ketamine treatment. And it becomes asymmetrical as reflected in the learning rates after Ketamine treatment.

So what we take from that is basically that Ketamine induced an optimism bias. But an interesting question is whether -- basically what comes first. Is it basically the improvement in the depression that we measured with the Montgomery-Asberg Depression Rating Scale, or is it the optimism bias that emerged and that triggered basically. Since it's a correlation, we don't know what comes first.

And an interesting side effect or aside we put in the supplement was that in 16 patients, it's a very low sample size, the expectancy about getting better also correlated to the clinical improvement after Ketamine treatment. We have two expectancy ratings here about the efficiency about Ketamine and also what patients expect their intensity of depression will be after Ketamine treatment.

And so that suggested the clinical benefit is kind of in part or synergistically seems to interact with the drug experience that emerges that generates an optimism bias. And to test this more, we continued data collection just on the expectancy ratings. And basically wondered how the clinical improvement after first infusion links to the clinical improvement after third infusion.

And we know from here that patients improve after first infusion are also those that improved after a third infusion. But is it mediated by their expectancy about the Ketamine treatment? And that's what we indeed found is that basically the more patients expected to get better, the more they got better after one week of treatment. But it mediated this link between the first drug experience and the later drug experiences and suggested there might not be an additive effect as other panelist members today already put forward today, it might be synergetic link.

And one way to get at these synergies is basically again use computational models. And this idea has been around although yesterday that basically there could be self-fulfilling prophesies that could contribute to the treatment responsiveness and treatment adherence. And these self-fulfilling prophesies are biased symmetrically learning mechanisms that are more biased when you have positive treatment experiences, initial positive treatment experiences, and then might contribute how you adhere to the treatment in the long term and also how much you benefit from it in the long term. So it's both drug experience and an expectancy.

And so this is nonpublished work where we played with this idea basically using a reinforcement learning model. This is also very inspired by we know from placebo analgesia. Tor and Luana Kuven, they have a paper on showing that self-fulfilling prophecies can be harnessed with biased patient and reinforcement learning models. And the idea of these models is that there are two learning rates, alpha plus and alpha minus. And these learning rates rate differently into the updating of your expectation after drug experience.

LIANE SCHMIDT: Okay, yeah, I'm almost done.

So rate differently on these drug experiences and expectations as a function of whether the initial experience was congruent to your expectations. So a positive experience, then a negative one. And here are some simulations of this model. I'm showing this basically that your expectation is getting more updated the more bias, positively biased you are. Then when you are negatively biased. And these are some predictions of the model concerning depression improvement.

To wrap this up, the conclusion about this is that there seems to be asymmetrical learning that can capture self-fulfilling prophesies and could be a mechanism that translates expectations and drug experiences potentially across domains from placebo hypoalgesia to antidepressant treatment responsiveness. And the open question is obviously to challenge these predictions of these models more with empirical data in pain but also in mood disorders as Marta does and as we do also currently at Cypitria where we test the mechanisms of belief updating biases in depression with fMRI and these mathematical models.

And this has a direct link implication because it could help us to better understand how these fast-acting antidepressants work and what makes patients adhere to them and get responses to them. Thank you for your attention. We are the control-interoception-attention team. And thank you to all the funders.

CRISTINA CUSIN: Fantastic presentation. Thank you so much. Without further ado, let's move on to the next speaker. Dr. Greg Corder.

GREG CORDER: Did that work? Is it showing?

GREG CORDER: Awesome, all right. One second. Let me just move this other screen. Perfect. All right.

Hi, everyone. My name is Greg Corder. I'm an Assistant Professor at the University of Pennsylvania. I guess I get to be the final scientific speaker in this session over what has been an amazing two-day event. So thank you to the organizers for also having me get the honor of representing the entire field of preclinical placebo research as well.

And so I'm going to give a bit of an overview, some of my friends and colleagues over the last few years and then tell you a bit about how we're leveraging a lot of current neuroscience technologies to really identify the cell types and circuits building from, you know, the human fMRI literature that's really honed in on these key circuits for expectations, belief systems as well as endogenous antinociceptive symptoms, in particular opioid cell types.

So the work I'm going to show from my lab has really been driven by these two amazing scientists. Dr. Blake Kimmey, an amazing post-doc in the lab. As well as Lindsay Ejoh, who recently last week just received her D-SPAN F99/K00 on placebo circuitry. And we think this might be one of the first NIH-funded animal projects on placebo. So congratulations, Lindsay, if you are listening.

Okay. So why use animals, right? We've heard an amazing set of stories really nailing down the specific circuits in humans leveraging MRI, fMRI, EEG and PET imaging that give us this really nice roadmap and idea of how beliefs in analgesia might be encoded within different brain circuits and how those might change over times with different types of patient modeling or updating of different experiences.

And we love this literature. We -- in the lab we read it in depth as best as we can. And we use this as a roadmap in our animal studies because we can take advantage of animal models that really allow us to dive deep into the very specific circuits using techniques like that on the screen here from RNA sequencing, electrophysiology really showing that those functional measurements in fMRI are truly existent with the axons projecting from one region to another.

And then we can manipulate those connections and projections using things like optogenetics and chemogenetics that allow us really tight temporal coupling to turn cells on and off. And we can see the effects of that intervention in real time on animal behavior. And that's really the tricky part is we don't get to ask the animals do you feel pain? Do you feel less pain? It's hard to give verbal suggestions to animals.

And so we have to rely on a lot of different tricks and really get into the heads of what it's like to be a small prey animal existing in a world with a lot of large monster human beings around them. So we really have to be very careful about how we design our experiments. And it's hard. Placebo in animals is not an easy subject to get into. And this is reflected in the fact that as far as we can tell, there is only 24 published studies to date on placebo analgesia in animal models.

However, I think this is an excellent opportunity now to really take advantage of what has been the golden age of neuroscience technologies exploding in the last 10-15 years to revisit a lot of these open questions about when are opioids released, are they released? Can animals have expectations? Can they have something like a belief structure and violations of those expectations that lead to different types of predictions errors that can be encoded in different neural circuits. So we have a chance to really do that.

But I think the most critical first thing is how do we begin to behaviorally model placebo in these preclinical models. So I want to touch on a couple of things from some of my colleagues. So on the left here, this is a graph that has been shown on several different presentations over the past two days from Benedetti using these tourniquet pain models where you can provide pharmacological conditioning with an analgesic drug like morphine to increase this pain tolerance.

And then if it is covertly switched out for saline, you can see that there is an elevation in that pain tolerance reflective of something like a placebo analgesic response overall. And this is sensitive to Naloxone, the new opioid receptor antagonist, suggesting endogenous opioids are indeed involved in this type of a placebo-like response.

And my colleague, Dr. Matt Banghart, at UCSD has basically done a fantastic job of recapitulating this exact model in mice where you can basically use morphine and other analgesics to condition them. And so if I just kind of dive in a little bit into Matt's model here.

You can have a mouse that will sit on a noxious hot plate. You know, it's an environment that's unpleasant. You can have contextual cues like different types of patterns on the wall. And you can test the pain behavior responses like how much does the animal flick and flinch and lick and bite and protect itself to the noxious hot plate.

And then you can switch the contextual cues, provide an analgesic drug like morphine, see reductions in those pain behaviors. And then do the same thing in the Benedetti studies, you switch out the morphine for saline, but you keep the contextual cues. So the animal has effectively created a belief that when I am in this environment, when I'm in this doctor's office, I'm going to receive something that is going to reduce my perceptions of pain.

And, indeed, Matt sees a quite robust effect here where this sort of placebo response is -- shows this elevated paw withdrawal latency indicating that there is endogenous nociception occurring with this protocol. And it happens, again, pretty robustly. I mean most of the animals going through this conditioning protocol demonstrate this type of antinociceptive behavioral response. This is a perfect example of how we can leverage what we learn from human studies into rodent studies for acute pain.

And this is also really great to probe the effects of placebo in chronic neuropathic pain models. And so here this is Dr. Damien Boorman who was with Professor Kevin Key in Australia, now with Lauren Martin in Toronto.

And here Damien really amped up the contextual cues here. So this is an animal who has had an injury to the sciatic nerve with this chronic constriction injury. So now this animal is experiencing something like a tonic chronic neuropathic pain state. And then once you let the pain develop, you can have the animals enter into this sort of placebo pharmacological conditioning paradigm where animals will go onto these thermal plates, either hot or cool, in these rooms that have a large amount of visual tactile as well as odorant cues. And they are paired with either morphine or a controlled saline.

Again, the morphine is switched for saline on that last day. And what Damien has observed is that in a subset of the animals, about 30%, you can have these responder populations that show decreased pain behavior which we interpret as something like analgesia overall. So overall you can use these types of pharmacological conditionings for both acute and chronic pain.

So now what we're going to do in our lab is a bit different. And I'm really curious to hear the field's thoughts because all -- everything I'm about to show is completely unpublished. Here we're going to use an experimenter-free, drug-free paradigm of instrumental conditioning to instill something like a placebo effect.

And so this is what Blake and Lindsay have been working on since about 2020. And this is our setup in one of our behavior rooms here. Our apparatus is this tiny little device down here. And everything else are all the computers and optogenetics and calcium imaging techniques that we use to record the activity of what's going on inside the mouse's brain.

But simply, this is just two hot plates that we can control the temperature of. And we allow a mouse to freely explore this apparatus. And we can with a series of cameras and tracking devices plot the place preference of an animal within the apparatus. And we can also record with high speed videography these highly conserved sort of protective recuperative pain-like behaviors that we think are indicative of the negative affect of pain.

So let me walk you through our little model here real quick. Okay. So we call this the placebo analgesia conditioning assay or PAC assay. So here is our two-plate apparatus here. So plate number one, plate number two. And the animal can always explore whichever plate it wants. It's never restricted to one side. And so we have a habituation day, let the animal familiarize itself. Like oh, this is a nice office, I don't know what's about to happen.

And then we have a pretest. And in this pretest, importantly, we make both of these plates, both environments a noxious 45-degree centigrade. So this will allow the animal to form an initial expectation that the entire environment is noxious and it's going to hurt. So both sides are noxious. Then for our conditioning, this is where we actually make one side of the chamber non-noxious. So it's just room temperature. But we keep one side noxious. So now there is a new expectation for the animal that it learns that it can instrumentally move its body from one side to the other side to avoid and escape feeling pain.

And so we'll do this over three days, twice per day. And then on our post tester placebo day we make both environments hot again. So now we'll start the animal off over here and the animals will get to freely choose do they want to go to the side that they expect should be non-noxious? Or what happens? So what happens?

Actually, if you just look at the place preference for this, over the course of conditioning we can see that the animals will, unsurprisingly, choose the environment that is non-noxious. And they spend 100% of their time there basically. But when we flip the plates or flip the conditions such that everything is noxious on the post test day, the animals will still spend a significant amount of time on the expected analgesia side. So I'm going to show you some videos here now and you are all going to become mouse pain behavior experts by the end of this.

So what I'm going to show you are both side by side examples of conditioned and unconditioned animals. And try to follow along with me as you can see what the effect looks like. So on this post test day. Oh, gosh, let's see if this is going to -- here we go. All right. So on the top we have the control animal running back and forth. The bottom is our conditioned animal.

And you'll notice we start the animal over here and it's going to go to the side that it expects it to not hurt. Notice the posture of the animals. This animal is sitting very calm. It's putting its entire body down on the hot plate. This animal, posture up, tail up. It's running around a little bit frantically. You'll notice it start to lick and bite and shake its paws. This animal down here might have a couple of flinches so it's letting you know that some nociception is getting into the nervous system overall.

But over the course of this three-minute test, the animals will rightly choose to spend more time over here. And if we start to quantify these types of behaviors that the animals are doing in both conditions, what we find is that there is actually a pretty significant reduction in these nociceptive behaviors. But it's not across the entire duration of this placebo day or post test day.

So this trial is three minutes long. And what we see is that this antinociceptive and preference choice only exists for about the first 90 seconds of this assay. So this is when the video I just showed, the animal goes to the placebo side, it spends a lot of its time there, does not seem to be displaying pain-like behaviors.

And then around 90 seconds, the animal -- it's like -- it's almost like the belief or the expectation breaks. And at some point, the animal realizes oh, no, this is actually quite hot. It starts to then run around and starts to show some of the more typical nociceptive-like behaviors. And we really like this design because this is really, really amenable to doing different types of calcium imaging, electrophysiology, optogenetics because now we have a really tight timeline that we can observe the changing of neural dynamics at speeds that we can correlate with some type of behavior.

Okay. So what are those circuits that we're interested in overall that could be related to this form of placebo? Again, we like to use the human findings as a wonderful roadmap. And Tor has demonstrated, and many other people have demonstrated this interconnected distributed network involving prefrontal cortex, nucleus accumbens, insula, thalamus, as well as the periaqueductal gray.

And so today I'm going to talk about just the periaqueductal gray. Because there is evidence that there is also release of endogenous opioids within this system here. And so we tend to think that the placebo process and the encoding, whatever that is, the placebo itself is likely not encoded in the PAG. The PAG is kind of the end of the road. It's the thing that gets turned on during placebo and we think is driving the antinociceptive or analgesic effects of the placebo itself.

So the PAG, for anyone who's not as familiar, we like it because it's conserved across species. We look at in a mouse. There's one in a human. So potentially it's really good for translational studies as well. It has a very storied past where it's been demonstrated that the PAG subarchitecture has these beautiful anterior to posterior columns that if you electrically stimulate different parts of PAG, you can produce active versus passive coping mechanisms as well as analgesia that's dependent on opioids as well as endocannabinoids.

And then the PAG is highly connected. Both from ascending nociception from the spinal cord as well as descending control systems from prefrontal cortex as well as the amygdala. So with regard to opioid analgesia. If you micro infuse morphine into the posterior part of the PAG, you can produce an analgesic effect in rodents that is across the entire body. So it's super robust analgesia from this very specific part of the PAG.

If you look at the PAG back there and you do some of these techniques to look for histological indications that the mu opioid receptor is there, it is indeed there. There is a large amount of mu opioid receptors, it's OPRM1. And it's largely on glutamatergic neurons. So the excitatory cells, not the inhibitory cells. They are on some of them.

And as far as E-phys data goes as well, we can see that the mu opioid receptor is there. So DAMGOs and opioid agonist. We can see activation of inhibitory GIRK currents in those cells. So the system is wired up for placebo analgesia to happen in that location. Okay. So how are we actually going to start to tease this out? By finding these cells where they go throughout the brain and then understanding their dynamics during placebo analgesia.

So last year we teamed up with Karl Deisseroth's lab at Stanford to develop a new toolkit that leverages the genetics of the opioid system, in particular the promoter for the mu opioid receptor. And we were able to take the genetic sequence for this promoter and package it into adeno associated viruses along with a range of different tools that allow us to turn on or turn off cells or record their activity. And so we can use this mu opioid receptor promoter to gain genetic access throughout the brain or the nervous system for where the mu opioid receptors are. And we can do so with high fidelity.

This is just an example of our mu opioid virus in the central amygdala which is a highly mu opioid specific area. But so Blake used this tool using the promoter to drive a range of different trans genes within the periaqueductal gray. And right here, this is the G camp. So this is a calcium indicator that allows us to in real time assess the calcium activity of PAG mu opioid cells.

And so what Blake did was he took a mouse, and he recorded the nociceptive responses within that cell type and found that the mu opioid cell types are actually nociceptive. They respond to pain, and they do so with increasing activity to stronger and stronger and more salient and intense noxious stimuli. So these cells are actually nociceptive.

And if we look at a ramping hot plate, we can see that those same mu opioid cell types in the PAG increase the activity as this temperature on this hot plate increases. Those cells can decrease that activity if we infuse morphine.

Unsurprisingly, they express the mu opioid receptor and they're indeed sensitive to morphine. If we give naltrexone to block the mu opioid receptors, we can see greater activity to the noxious stimuli, suggesting that there could be an opioid tone or some type of an endogenous opioid system that's keeping this system in check, that it's repressing its activity. So when we block it, we actually enhance that activity. So it's going to be really important here. The activity of these mu opioid PAG cells correlates with affective measures of pain.

When animals are licking, shaking, biting, when it wants to escape away from noxious stimuli, that's when we see activity within those cells. So this is just correlating different types of behavior when we see peak amplitudes within those cell types. So let me skip that real quick.

Okay. So we have this ability to look and peek into the activity of mu opioid cell types. Let's go back to that placebo assay, our PAC assay I mentioned before. If we record from the PAG on that post test day in an animal that has not undergone conditioning, when the plates are super hot, we see a lot of nocioceptive activity in these cells here. They're bouncing up and down.

But if we look at the activity of the nociception in an animal undergoing placebo, what we see is there's a suppression of neural activity within that first 90 seconds. And this actually does seem to extinguish within the lighter 90 seconds. So kind of tracks along with the behavior of those animals. When they're showing anti nocioceptive behavior, that's when those cells are quiet.

When the pain behavior comes back, that's when those cell types are ramping up. But what about the opioids too? Mu opioid receptor cell type's decreasing activity. What about the opioids themselves here? The way to do this in animals has been to use microdialysis, fantastic technique but it's got some limitations to it. This is a way of sampling peptides in real time and then using liquid chromatography to tell if the protein was present. However, the sampling rate is about 10 minutes.

And in terms of the brain processing, 10 minutes might as well be an eternity. If we're talking about milliseconds here. But we want to know what these cells here and these red dots are doing. These are the enkephaliner cells in the PAG. We needed revolution in technologies. One of those came several years ago from Dr. Lin Tian, who developed some of the first sensors for dopamine. Some of you may have heard of it. It's called D-Light.

This is a version of D-Light. But it's actually an enkephalin opioid sensor. What Lin did to genetically engineer this is to take the delta opioid receptor, highly select it for enkephalin, and then link it with this GFP molecule here such that when enkephalin binds to the sensor it will fluoresce.

We can capture that florescence with microscopes that we implant over the PAG and we can see when enkephalin is being released with subsecond resolution. And so what we did for that is we want to see if enkephalin is indeed being released onto those mu opioid receptor expressing pain encoding neurons in the PAG. What I showed you before is that those PAG neurons, they ramp up their activity as the nociception increases, a mouse standing on a hot plate. We see nociception ramp up. What do you all think happened with the opoids?

It wasn't what we expected. It actually drops. So what we can tell is that there's a basal opioid tone within the PAG, but that as nociception increases, acute nociception, we see a decrease suppression of opioid peptide release.

We think this has to do with stuff that Tor has published on previously that the PAG is more likely involved in updating prediction errors. And this acute pain phenomenon we think is reflective of the need to experience pain to update your priors about feeling pain and to bias the selection of the appropriate behaviors, like affect related things to avoid pain. However, what happens in our placebo assay?

We actually see the opposite. So if we condition animals to expect pain relief within that PAC assay, we actually see an increase from the deltoid sensor suggesting that there is an increase in enkephalin release post conditioning. So there can be differential control of the opioid system within this brain region. So this next part is the fun thing you can do with animals. What if we just bypassed the need to do the placebo assay?

If we know that we just need to cause release of enkephalin within the PAG to produce pain relief, we could just directly do that with optigenetics. So we tried to us this animal that allows us to put a red light sensitive opsin protein into the enkephalinergic interneurons into the PAG.

When we shine red light on top of these cells, they turn on and they start to release their neurotransmitters. These are GABAergic and enkephalinergic. So they're dumping out GABA and now dumping out enkephalin into the ERG. We can visualize that using the Delta Light sensor from Lin Tien.

So here is an example of optogenetically released enkephalin within the PAG over 10 minutes. The weird thing that we still don't fully understand is that this signal continues after the optogenetic stimulation. So can we harness the placebo effect in mice? At least it seems we can. So if we turn on these cells strongly, cause them to release enkephalin and put animals back on these ramping hot plate tests we don't see any changes in the latency to detect pain, but we see specific ablation or reductions in these affective motivational pain like behaviors overall. Moderator: You have one minute remaining.

GREGORY CORDER: Cool. In this last minutes, people are skeptical. Can we actually test these higher order cognitive processes in animals? And for anyone who is not a behavioral preclinical neural scientist, you might not be aware there's an absolute revolution happening in behavior with the use of deep learning modules that can precisely and accurately quantify animal behavior. So this is an example of a deep learning tracking system.

We've built the Light Automated Pain Evaluator that can capture a range of different pain related behaviors fully automated without human intervention whatsoever that can be paired with brain reporting techniques like calcium imaging, that allow us to fit a lot of different computational models to understand what the activity of single neurons might be doing, let's say, in the cingulate cortex that might be driving that placebo response.

We can really start to tie now in at single cell resolution the activity of prefrontal cortex to drive these placebo effects and see if that alters anti nocioceptive behavior endogenously. I'll stop there and thank all the amazing people, Blake, Greg, and Lindsay, who did this work, as well as all of our funders and the numerous collaborators who have helped us do this. So thank you.

CRISTINA CUSIN: Terrific talk. Thank you so much. We're blown away. I'll leave the discussion to our two moderators. They're going to gather some of the questions from the chat and some of their own questions for all the presenters from today and from yesterday as well.

TED KAPTCHUK: Matt, you start gathering questions. I got permission to say a few moments of comments. I wanted to say this is fantastic. I actually learned an amazing amount of things. The amount of light that was brought forward about what we know about placebos and how we can possibly control placebo effects, how we can possibly harness placebo effects.

There was so much light and new information. What I want to do in my four minutes of comments is look to the future. What I mean by that is -- I want to give my comments and you can take them or leave them but I've got a few minutes.

What I want to say is we got the light, but we didn't put them together. There's no way we could have. We needed to be more in the same room. How does this fit in with your model? It's hard to do. What I mean by putting things together is I'll give you an example. In terms of how do we control placebo effects in clinical trials. I not infrequently get asked by the pharmaceutical industry, when you look at our placebo data -- we just blew it. Placebo was good as or always as good as the drug.

And the first thing I say is I want to talk to experts in that disease. I want to know the natural history. I want to know how you made your entry criteria so I can understand regression to the mean.

I want to know what's the relationship of the objective markers and subjective markers so I can begin to think about how much is the placebo response. I always tell them I don't know. If I knew how to reduce -- increase the difference between drug and placebo I'd be a rich man, I wouldn't be an academic. What I usually wind up saying is, get a new drug. And they pay me pretty well for that. And the reason is that they don't know anything about natural history. We're trying to harness something, and I just want to say -- I've done a lot of natural history controls, and that's more interesting than the rest of the experiments because they're unbelievable, the amount of improvement people show entering the trial without any treatment.

I just want to say we need to look at other things besides the placebo effect. We want to control the placebo response in a randomized control trial. I want to say that going forward. But I also want to say that we need a little bit of darkness. We need to be able to say, you know, I disagree with you. I think this other data, and one of the things I've learned doing placebo reach there's a paper that contradicts your paper real quickly and there's lots of contradictory information. It's very easy to say you're wrong, and we don't say it enough.

I want to take one example -- please forgive me -- I know that my research could be said that, Ted, you're wrong. But I just want to say something. Consistently in the two days of talk everyone talks about the increase of the placebo response over time. No one refers to the article published in 2022 in BMJ, first author was Mark Stone and senior author was Irving Kirsch. And they analyzed all FDA Mark Stone is in the Division of Psychiatry at CDER at the FDA. They analyzed all data of placebo controlled trials in major depressive disorder. They had over 230 trials, way more than 70,000 patients, and they analyzed the trend over time, in 1979 to the present, the publication. There was no increase in the placebo effect.

Are they right or are other people right? Nothing is one hundred percent clear right now and we need to be able to contradict each other when we get together personally and say, I don't think that's right, maybe that's right. I think that would help us. And the last thing I want to say is that some things were missing from the conference that we need to include in the future. We need to have ethics. Placebo is about ethics. If you're a placebo researcher in placebo controlled trials, that's an important question:

What are we talking about in terms of compromising ethics? There's no discussion that we didn't have time but in the future, let's do that.

And the last thing I would say is, we need to ask patients what their experience is. I've got to say I've been around for a long time. But the first time I started asking patients what their experiences were, they were in double blind placebo or open label placebo, I did it way after they finished the trial, the trial was over, and I actually took notes and went back and talked to people. They told me things I didn't even know about. And we need to have that in conferences. What I want to say, along those lines, is I feel so much healthier because I'm an older person, and I feel with this younger crowd here is significantly younger than me.

Maybe Matt and I are the same age, I don't know, but I think this is really one of the best conferences I ever went to. It was real clear data. We need to do lots of other things in the future. So with that, Matt, feed me some questions.

MATTHEW RUDORFER: Okay. Thanks. I didn't realize you were also 35. But okay. [LAUGHTER].

MATTHEW RUDORFER: I'll start off with a question of mine. The recent emergence of intravenous ketamine for resistant depression has introduced an interesting methodologic approach that we have not seen in a long time and that is the active placebo. So where the early trials just used saline, more recently we have seen benzodiazapine midazolam, while not mimicking really the full dissociative effect that many people get from ketamine, but the idea is for people to feel something, some kind of buzz so that they might believe that they're on some active compound and not just saline. And I wonder if the panel has any thoughts about the merits of using an active placebo and is that something that the field should be looking into more?

TED KAPTCHUK: I'm going to say something. Irving Kirsch published a meta analysis of H studies that used atropine as a control in depression studies. He felt that it made it difficult to detect a placebo drug difference. But in other meta analysis said that was not true. That was common in the '80s. People started thinking about that. But I have no idea how to answer your question.

MICHAEL DETKE: I think that's a great question. And I think in the presentations yesterday about devices, Dr. Lisanby was talking about the ideal sham. And I think it's very similar, the ideal active placebo would have none of the axia of the drug, of the drug in question, but would have, you know, exactly the same side effects and all other features, and of course that's attractive, but of course we probably would never have a drug that's exactly like that. I think midazolam was a great thing to try with ketamine. It's still not exactly the same. But I'd also add that it's not black and white. It's not like we need to do this with ketamine and ignore it for all of our other drugs. All of our drugs have side effects.

Arguably, if you do really big chunks, like classes of relatively modern antidepressants, antipsychotics and the psychostimulants, those are in order of bigger effect sizes in clinical trials, psychostimulants versus anti psychotics, versus -- and they're also in the order of roughly, I would argue, of unblinding, of functional unblinding. And in terms of more magnitude, Zyprexa will make you hungry. And also speed of onset of some of the adverse effects, stimulants and some of the Type II -- the second generation and beyond -- anti psychotics, they have pretty noticeable side effects for many subjects and relatively rapidly. So I think those are all important features to consider.

CRISTINA CUSIN: Dr. Schmidt?

LIANE SCHMIDT: I think using midazolam could give, like, some sensory sensations so the patients actually can say there's some effect on the body like immediately. But this raises actually a question whether these dissociations we observe in some patients of ketamine infusions we know have, will play a role for the antidepressant response. It's still an open question. So I don't have the answer to that question. And I think with midazolam doesn't really induce dissociations. I don't know, maybe you can isolate the dissociations you get on ketamine. But maybe even patients might be educated, expecting scientific reaction experiences and basically when they don't have -- so they make the midazolam experience something negative. So yeah, just self fulfilling prophesies might come into play.

CRISTINA CUSIN: I want to add for five seconds. Because I ran a large ketamine clinic. We know very little about cyto placebo maintaining an antidepressant response while the dissociation often wears off over time. It's completely separate from the anti depressant effect. We don't have long term placebo studies. The studies are extremely short lived and we study the acute effect. But we don't know how to sustain or how to maintain, what's the role of placebo effect in long term treatments. So that's another field that really is open to investigations. Dr. Rief.

WINFRIED RIEF: Following up on the issue of active placebos. I just want to mention that we did a study comparing active placebos to passive placebos and showing that active placebos are really more powerful. And I think the really disappointing part of this news is that it questions the blinding of our typical RCTs comparing antidepressants versus placebos because many patients who are in the active group or the tracked group, they perceive these onset effects and this will further boost the placebo mechanisms in the track group that are not existing in the passive placebo group. This is a challenge that further questions the validity of our typical RCTs.

CRISTINA CUSIN: Marta.

MARTA PECINA : Just a quick follow up to what Cristina was saying, too, that we need to clarify whether we want to find an active control for the dissociative effects or for the antidepressive effects. I think the approach will be very different. And this applies to ketamine but also psychodelics because we're having this discussion as well. So when thinking about how to control for or how to blind or how we just -- these treatments are very complicated. They have multiple effects. We just need to have the discussion of what are we trying to blind because the mechanism of action of the blinding drug will be very different.

TED KAPTCHUK: Can I say something about blinding? Robertson, who is the author of the 1970 -- no -- 1993 New England Journal paper saying that there's no that the placebo effect is a myth.

In 2022, published in BMJ, the largest -- he called it a mega meta analysis on blinding. And he took 144 randomized control trials that included nonblinded evidence on the drug versus blinded evidence of the drug. I'm not going to tell you the conclusion because it's unbelievable. But you should read it because it really influences -- it would influence what we think about blinding. That study was just recently replicated on a different set of patients with procedures in JAMA Surgery three months ago. And blinding like placebo is more complicated than we think. That's what I wanted to say.

MATTHEW RUDORFER: Another clinical factor that's come up during our discussion has been the relationship of the patient to the provider that we saw data showing that a warm relationship seemed to enhance therapeutic response, I believe, to most interventions. And I wonder what the panel thinks about the rise on the one hand of shortened clinical visits now that, for example, antidepressants are mostly given by busy primary care physicians and not specialists and the so called med check is a really, kind of, quickie visit, and especially since the pandemic, the rise of telehealth where a person might not ever even meet their provider in person, and is it possible we're on our way to where a clinical trial could involve, say, mailing medication every week to a patient, having them do their weekly ratings online and eliminating a provider altogether and just looking at the pharmacologic effect?

I mean, that probably isn't how we want to actually treat people clinically, but in terms of research, say, early phase efficacy, is there merit to that kind of approach?

LUANA COLLOCA: I'll comment on this, Dr. Rudorfer. We're very interested to see how the telemedicine or virtual reality can affect placebo effects, and we're modeling in the lab placebo effects induced via, you know, in person interaction.

There's an Avatar and virtual reality. And actually we found placebo effects with both the settings. Or whether, when we look at empathy, the Avatar doesn't elicit any empathy in the relationship. We truly need the in person connection to have empathy. So that suggests that our outcome that are affected by having in person versus telemedicine/para remote interactions, but yet the placebo effects persist in both the settings. The empathy is differently modulated and the empathy mediated, interestingly in our data, placebo effects only in the in person interactions. There is still a value in telemedicine. Effects that bypass empathy completely in competence.

MATTHEW RUDORFER: Dr. Hall.

KATHRYN HALL: Several of the large studies, like the Women's Health Study, Physicians' Health Study and, more recently, Vital, they did exactly that, where they mail these pill packs. And I mean, the population, obviously, is clinicians. So they are very well trained and well behaved. And they follow them for years but there's very little contact with the providers, and you still have these giant -- I don't know if you can call them placebo effects -- but certainly many of these trials have not proven to be more effective, the drugs they're studying, than placebo.

MATTHEW RUDORFER: Dr. Atlas.

LAUREN ATLAS: I wanted to chime in briefly on this important question. I think that the data that was presented yesterday in terms of first impressions of providers is relevant for this because it suggests that even when we use things like soft dot (phonetic) to select physicians and we have head shots (phonetic), that really we're making these decisions about who to see based on these kinds of just first impressions and facial features and having the actual interactions by providers is critical for sort of getting beyond that kind of factor that may drive selection. So I think if we have situations where there's reduced chances to interact, first of all, people are bringing expectations to the table based on what they know about the provider and then you don't really have the chance to build on that without the actual kind of therapeutic alliance. That's why I think, even though our study was done in an artificial setting, it really does show how we make these choices when there are bios for physicians and things available for patients to select from. I think there's a really important expectation being brought to the table before the treatment even occurs.

MATTHEW RUDORFER: Thanks. Dr. Lisanby.

SARAH “HOLLY” LISANBY: Thanks for raising this great question, Matt. I have a little bit of a different take on it. Equity in access to mental health care is a challenge. And the more that we can leverage technology to provide and extend the reach of mental health care the better. And so telemedicine and telepsychiatry, we've been thrust into this era by the pandemic but it existed before the pandemic as well. And it's not just about telepsychotherapy or teleprescription from monitoring pharmacotherapy, but digital remote neuromodulation is also a thing now. There are neuromodulation interventions that can be done at home that are being studied, and so there have been trials on transcranial direct current stimulation at home with remote monitoring. There are challenges in those studies differentiating between active and sham. But I think you're right in that we may have to rethink how do we control remote studies when the intensity of the clinician contact is very different, but I do think that we should explore these technologies so that we can extend the reach and extend access to research and to care for people who are not able to come into the research lab setting.

TED KAPTCHUK: May I add something on this? It's also criticizing myself. In 2008, I did this very nice study showing you could increase the doctor/patient relationship. And as you increase it, the placebo effect got bigger and bigger, like a dose response. A team in Korea that I worked with replicated that. I just published that replication.

The replication came out with the exact opposite results. The less doctor/patient relationship, the less intrusive, the less empathy got better effects. We're dealing with very complicated culturally constructed issues, and I just want to put it out there, the sand is soft. I'm really glad that somebody contradicted a major study that I did.

LUANA COLLOCA: Exactly. The central conference is so critical, what we observed in one context in one country, but even within the same in group or out group can be completely different in Japan, China or somewhere else. So the Americas, South Africa. So we need larger studies and more across country collaborations.

MATTHEW RUDORFER: Dr. Schmidt.

LIANE SCHMIDT: I just wanted to raise a point not really like -- it's more like a comment, like there's also very interesting research going on in the interactions between humans and robots, and usually humans treat robots very badly. And so I wonder what could be like -- here we focus on very human traits, like empathy, competence, what we look at. But when it comes to artificial intelligence, for example, and when we have to interact with algorithms, basically, like all these social interactions might completely turn out completely different, actually, and all have different effects on placebo effects. Just a thought.

MATTHEW RUDORFER: Dr. Rief.

WINFRIED RIEF: Yesterday, I expressed a belief for showing more warmth and competence, but I'll modify it a little bit today because I think the real truth became quite visible today, and that is that there is an interaction between these non specific effect placebo effects and the track effect. In many cases, at least. We don't know whether there are exceptions from this rule, but in many cases we have an interaction. And to learn about the interaction, we instead need study designs that modulate track intake versus placebo intake, but they also modulate the placebo mechanisms, the expectation mechanisms, the context of the treatment. And only if we have these 2 by 2 designs, modulating track intake and modulating context and psychological factors, then we learn about the interaction. You cannot learn about the interaction if you modulate only one factor.

And, therefore, I think what Luana and others have said that interact can be quite powerful and effective in one context but maybe even misleading in another context. I think this is proven. We have to learn more about that. And all the studies that have been shown from basic science to application that there could be an interaction, they're all indicating this line and to this necessity that we use more complex designs to learn about the interaction.

MATTHEW RUDORFER: Yes. And the rodent studies we've seen, I think, have a powerful message for us just in terms of being able to control a lot of variables that are just totally beyond our control in our usual human studies. It always seemed to me, for example, if you're doing just an antidepressant versus placebo trial in patients, well, for some people going into the clinic once a week to get ratings, that might be the only day of the week that they get up and take a shower, get dressed, have somebody ask them how they're doing, have some human interaction. And so showing up for your Hamilton rating could be a therapeutic intervention that, of course, we usually don't account for in the pharmacotherapy trial. And the number of variables really can escalate in a hurry when we look at our trials closely.

TED KAPTCHUK: Tor wants to say something.

TOR WAGER: Thanks, Ted.

I wanted to add on to the interaction issue, which came up yesterday, which Winfried and others just commented on, because it seems like it's really a crux issue. If the psychosocial or expectation effects and other things like that are entangled with specific effects so that one can influence the other and they might interact, then, yeah, we need more studies that independently manipulate specific drug or device effects and other kinds of psychological effects independently. And I wanted to bring this back up again because this is an idea that's been out here for a long time. I think the first review on this was in the '70s, like '76 or something, and it hasn't really been picked up for a couple of reasons. One, it's hard to do the studies. But second, when I talk to people who are in industry and pharma, they are very concerned about changing the study designs at all for FDA approval.

And since we had some, you know, FDA and regulatory perspectives here yesterday, I wanted to bring that up and see what people think, because I think that's been a big obstacle. And if it is, then that may be something that would be great for NIH to fund instead of pharma companies because then there's a whole space of drugs, psychological or neurostimulation psychological interactions, that can be explored.

MATTHEW RUDORFER: We also had a question. Yesterday there was discussion in a naloxone trial in sex differences in placebo response. And wonder if there's any further thoughts on studies of sex differences or diversity in general in placebo trials. Yes.

LUANA COLLOCA: We definitely see sex differences in placebo effect, and I show also, for example, women responded to arginine vasopressin in a way that we don't observe in men.

But also you asked about diversity. Currently actually in our paper just accepted today where we look at where people are living, the Maryland states, and even the location where they are based make a difference in placebo effects. So people who live in the most distressed, either the greatest Baltimore area, tended to have lower placebo effects as compared to a not distressful location. And we define that the radius of the criteria and immediately it's a race but we take into account the education, the income and so on. So it is interesting because across studies consistently we see an impact of diversity. And in that sense, I echo, listen to the comment that we need to find a way to reach out to these people and truly improve access and the opportunity for diversity. Thank you for asking.

MATTHEW RUDORFER: Thank you. Another issue that came up yesterday had to do with the pharmacogenomics. And there was a question or a question/comment about using candidate approaches and are they problematic.

KATHRYN HALL: What approaches.

MATTHEW RUDORFER: Candidate genes.

KATHRYN HALL: I think we have to start where we are. I think that the psychiatric field has had a really tough time with genetics. They've invested a lot and, sadly, don't have as much to show for it as they would like to. And I think that that has really tainted this quest for genetic markers of placebo and related studies, these interaction factors. But it's really important to not, I think, to use that to stop us from looking forward and identifying what's there. Because when you start to scratch the surface, there are interactions. You can see them. They're replete in the literature. And what's really fascinating is everybody who finds them, they don't see them when they report their study. And even some of these vasopressin studies, not obviously, Tor, yours, but I was reading one the other day where they had seen tremendous differences by genetics in response to arginine vasopressin. And they totally ignored what they were seeing in placebo and talked about who responds to drug. And so I think that not only do we need to start looking for what's happening, we need to start being more open minded and paying attention to what we're seeing in the placebo arm and accounting for that, taking that into account to understand what we're seeing across a trial in total.

CRISTINA CUSIN: I'll take a second to comment on sufficient selection and trying to figure out, depending on the site who are the patients who went there, treatment and depression clinical trial. If we eliminate from the discussion professional patient and we think about the patients who are more desperate, patients who don't have access to care, patients who are more likely to have psychosocial stressors or the other extreme, there are patients who are highly educated. The trials above and they search out, but they're certainly not representative of the general populations we see in the clinical setting.

They are somewhat different. And then if you think about the psychedelics trial, they go from 5,000 patients applying for a study and the study ends up recruiting 20, 30. So absolutely not representative of the general population we see in terms of diversity, in terms of comorbidities, in terms of psychosocial situations. So that's another factor that adds to the complexity of differentiating what happens in the clinical setting versus artificial setting like a research study. Tor.

MATTHEW RUDORFER: The question of who enters trials and I think the larger issue of diagnosis in general has, I think, really been a challenge to the field for many years. Ted and I go back a ways, and just looking at depression, of course, has dominated a lot of our discussion these last couple of days, with good reason. Now I realize the good database, my understanding is that the good database of placebo controlled trials go back to the late '90s, is what we heard yesterday. And if you go back further, the tricyclic era not only dealt with different medications, which we don't want to go back to, but if you think about practice patterns then, on the one hand, the tricyclics, most nonspecialists steered clear of, they required a lot of hands on. They required titration slowly up. They had some concerning toxicities, and so it was typical that psychiatrists would prescribe them but family docs would not. And that also had the effect of a naturalistic screening, that is, people would have to reach a certain level of severity before they were referred to a psychiatrist to get a prescription for medication.

More mildly ill people either wound up, probably inappropriately, on tranquilizers or no treatment at all and moderately to severely ill people wound up on tricyclics, and of course inpatient stays were common in those days, which again was another kind of screening. So it was the sort of thing, I mean, in the old days I heard of people talk about, well, you could, if you go to the inpatient board, you could easily collect people to be in clinical trial and you kind of knew that they were vetted already. That they had severe depression, the general sense was that the placebo response would be low. Though there's no real evidence for that. But the thing is, once we had the SSRIs on the one hand, the market vastly expanded because they're considered more broad spectrum. People with milder illness and anxiety disorders now are appropriate candidates and they're easier to dispense. The concern about overdose is much less, and so they're mostly prescribed by nonspecialists. So it's the sort of thing where we've seen a lot of large clinical trials where it doesn't take much to reach the threshold for entry, being if I go way back and this is just one of my personal concerns over many years the finer criteria, which I think were the first good set of diagnostic criteria based on data, based on literature, those were published in 1972 to have a diagnosis of major depression, called for four weeks of symptoms. Actually, literally, I think it said one month.

DSM III came out in 1980 and it called for two weeks of symptoms. I don't know -- I've not been able to find any documentation of how the one month went to two weeks, except that the DSM, of course, is the manual that's used in clinical practice. And you can understand, well, you might not want to have too high a bar to treat people who are seeking help. But I think one of the challenges of DSM, it was not meant as a research manual. Though that's often how it's used. So ever since that time, those two weeks have gotten reified, and so my point is it doesn't take much to reach diagnostic criteria for DSM, now, 5TR, major depression. So if someone is doing a clinical trial of an antidepressant, it is tempting to enroll people who meet, honestly meet those criteria but the criteria are not very strict. So I wonder whether that contributes to the larger placebo effect that we see today.

End of soapbox. The question -- I'd like to revisit an excellent point that Dr. Lisanby raised yesterday which has to do with the research domain criteria, the RDOC criteria. I don't know if anyone on the panel has had experience in using that in any trials and whether you see any merit there. Could RDOC criteria essentially enrich the usual DSM type clinical criteria in terms of trying to more finely differentiate subtypes of depression, might respond differently to different treatments.

MODERATOR: I think Tor has been patient on the hand off. Maybe next question, Tor, I'm not sure if you had comments on previous discussion.

TOR WAGER: Sure, thanks. I wanted to make a comment on the candidate gene issue. And I think it links to what you were just saying as well, doctor, in a sense. I think it relates to the issue of predicting individual differences in placebo effects and using that to enhance clinical trials, which has been really sort of a difficult issue. And in genetics, I think what's happened, as many of us know, is that there were many findings on particular candidate genes, especially comped and other particular set of genes in Science and Nature, and none of those really replicated when larger GWA studies started being done. And the field of genetics really focused in on reproducibility and replicability in one of our sample sizes. So I think my genetics colleagues tell me something like 5,000 is a minimum for even making it into their database of genetic associations. And so that makes it really difficult to study placebo effects in sample sizes like that. And at the same time, there's been this trend in psychology and in science, really, in general, towards reproducibility and replicability that probably in part are sort of evoked by John Ioannidis's provocative claims that most findings are false, but there's something really there.

There's been many teams of people who have tried to pull together, like Brian Nosek's work with Open Science Foundation, and many lab studies to replicate effects in psychology with much higher power. So there's this sort of increasing effort to pull together consortia to really test these things vigorously. And I wonder if -- we might not have a GWA study of placebo effects in 100,000 people or something, which is what would convince a geneticist that there's some kind of association. I'm wondering what the ways forward are, and I think one way is to increasingly come together to pull studies or do larger studies that are pre registered and even registered reports which are reviewed before they're published so that we can test some of these associations that have emerged in these what we call early studies of placebo effects.

And I think if we preregister and found something in sufficiently large and diverse samples, that might make a dent in convincing the wider world that essentially there is something that we can use going forward in clinical trials. And pharma might be interested in, for example, as well. That's my take on that. And wondering what people think.

KATHRYN HALL: My two cents. I completely agree with you. I think the way forward is to pull our resources to look at this and not simply stop -- I think when things don't replicate, I think we need to understand why they don't replicate. I think there's a taboo on looking beyond, if you prespecified it and you don't see it, then it should be over. I think in at least this early stage, when we're trying to understand what's happening, I think we need to allow ourselves deeper dives not for action but for understanding.

So I agree with you. Let's pull our resources and start looking at this. The other thing I would like to point out that's interesting is when we've looked at some of these clinical trials at the placebo arm, we actually learn a lot about natural history. We just did one in Alzheimer's disease and in the placebo arm the genome wide significant hit was CETP, which is now a clinical target in Alzheimer's disease. You can learn a lot by looking at the placebo arms of these studies not just about whether or not the drug is working or how the drug is working, but what's happening in the natural history of these patients that might change the effect of the drug.

TED KAPTCHUK: Marta, did you have something to say; you had your hand up.

MARTA PECINA: Just a follow up to what everybody is saying. I do think the issue of individualability is important. I think that one thing that maybe kind of explains some of the things that was also saying at the beginning that there's a little bit of lack of consistency or a way to put all of these findings together. The fact that we think about it as a one single placebo effect and we do know that there's not one single placebo effect, but even within differing clinical conditions is the newer value placebo effect the same in depression as it is in pain?

Or are there aspects that are the same, for example, expectancy processing, but there's some other things that are very specific to the clinical condition, whether it's pain processing, mood or some others. So I think we face the reality of use from a neurobiology perspective that a lot of the research has been done in pain and still there's very little being done at least in psychiatry across many other clinical conditions that we just don't know. And we don't really even know if the placebo how does the placebo effect look when you have both pain and depression, for example?

And so those are still very open questions that kind of reflect our state, right, that we're making progress but there's a lot to do.

TED KAPTCHUK: Winfried, did you want to say something? You have your hand up.

WINFRIED RIEF: I wanted to come back to the question of whether we really understand this increase of placebo effects. I don't know whether you have (indiscernible) for that. But I'm more like a scientist I can't believe that people are nowadays more reacting to placebos than they did 20 years ago. So there might be other explanations for this effect, like we changed the trial designs. We have more control visits maybe nowadays compared to 30 years ago, but there could be also other factors like publication bias which was maybe more frequent, more often 30 years ago than it is nowadays with the need for greater registration. So there are a lot of methodological issues that could explain this increase of placebo effects or of responses in the placebo groups. I would be interested whether you think that this increase is well explained or what your explanations are for this increase.

TED KAPTCHUK: Winfried, I want to give my opinion. I did think about this issue. I remember the first time it was reported in scientists in Cleveland, 40, 50 patients, and I said, oh, my God, okay, and the newspapers had it all over: The placebo effect is increasing. There's this boogie man around, and everyone started believing it. I've been consistently finding as many papers saying there's no -- I've been collecting them. There's no change over time there are changes over time. When I read the original article, I said, of course there's differences. The patients that got recruited in 1980 were different than the patients in 1990 or 2010. They were either more chronic, less chronic.

They were recruited in different ways, and that's really an easy explanation of why things change. Natural history changes. People's health problems are different, and I actually think that the Stone's meta analysis with 70,033 patients says it very clearly. It's a flat line from 1979. And the more data you have, the more you have to believe it. That's all. That's my personal opinion. And I think we actually are very deeply influenced by the media. I mean, I can't believe this:

The mystery of the placebo. We know more about placebo effects at least compared to many drugs on the market. Thanks my opinion. Thanks, Winfried, for letting me say it.

MATTHEW RUDORFER: Thanks, Ted.

We have a question for Greg. The question is, I wonder what the magic of 90 seconds is? Is there a physiologic basis to the turning point when the mouse changes behavior?

GREGORY CORDER: I think I addressed it in a written post somewhere. We don't know. We see a lot of variability in those animals. So like in this putative placebo phase, some mice will remain on that condition side for 40 seconds, 45 seconds, 60 seconds. Or they'll stay there the entire three minutes of the test. We're not exactly sure what's driving the difference in those different animals. These are both male and females. We see the effect in both male and female C57 black six mice, a genetically inbred animal. We always try to restrict the time of day of testing. We do reverse light testing. This is the animal wake cycle.

And there are things like dominance hierarchies within the cages, alpha versus betas. They may have different levels of pain thresholds. But the breaking of whatever the anti nocioceptive effect is they're standing on a hot plate for quite a long time. At some point those nociceptors in the periphery are going to become sensitized and signal. And to some point it's to the animal's advantage to pay attention to pain. You don't want to necessarily go around not paying attention to something that's potentially very dangerous or harmful to you. We would have to scale up the number of animals substantially I think, to really start parse out what the difference is that would account for that. But that's an excellent point, though.

MATTHEW RUDORFER: Carolyn.

CAROLYN RODRIGUEZ: I want to thank all today's speakers and wonderful presentations today. I just wanted to just go back for a second to Dr. Pecina's point about thinking about a placebo effect is not a monolith and also thinking about individual disorders.

And so I'm a clinical trialist and do research in obsessive compulsive disorder, and a lot of the things that are written in the literature meta analysis is that OCD has one of the lowest placebo rates. And so, you know, from what we gathered today, I guess to turn the question on its head is, is why is that, is that the case, why is that the case, and does that say something about OCD pathology, and what about it? Right? How can we really get more refined in terms of different domains and really thinking about the placebo effect.

So just want to say thank you again and to really having a lot of food for thought.

MATTHEW RUDORFER: Thanks. As we're winding down, one of the looming questions on the table remains what are research gaps and where do you think the next set of studies should go. And I think if anyone wants to put some ideas on the table, they'd be welcome.

MICHAEL DETKE: One of the areas that I mentioned in my talk that is hard for industry to study, or there's a big incentive, which is I talked about having third party reviewers review source documents and videos or audios of the HAM D, MADRS, whatever, and that there's not much controlled evidence.

And, you know, it's a fairly simple design, you know, within our largest controlled trial, do this with half the sites and don't do it with the other half.

Blinding isn't perfect. I haven't thought about this, and it can probably be improved upon a lot, but imagine you're the sponsor who's paying the $20 million in three years to run this clinical trial. You want to test your drug as fast as you possibly can. You don't want to really be paying for this methodology.

So that might be -- earlier on Tor or someone mentioned there might be some specific areas where this might be something for NIH to consider picking up. Because that methodology is being used in hundreds of trials, I think, today, the third party remote reviewer. So there's an area to think about.

MATTHEW RUDORFER: Thanks. Holly.

SARAH “HOLLY” LISANBY: Yeah. Carolyn just mentioned one of the gap areas, really trying to understand why some disorders are more amenable to the placebo response than others and what can that teach us. That sounds like a research gap area to me.

Also, throughout these two days we've heard a number of research gap areas having to do with methodology, how to do placebos or shams, how to assess outcome, how to protect the blind, how do you select what your outcome measures should be.

And then also today my mind was going very much towards what can preclinical models teach us and the genetics, the biology of a placebo response, the biogender line, individual differences in placebo response.

There may be clues there. Carolyn, to your point to placebo response being lower in OCD, and yet there are some OCD patients who respond, what's different about them that makes them responders?

And so studies that just look at within a placebo response versus nonresponse or gradation response or durability response and the mechanisms behind that.

These are questions that I think may ultimately facilitate getting drugs and devices to market, but certainly are questions that might be helpful to answer at the research stage, particularly at the translational research stage, in order to inform the design of pivotal trials that you would ultimately do to get things to market.

So it seems like there are many stages before getting to the ideal pivotal trial. So I really appreciate everyone's input. Let me stop talking because I really want to hear what Dr. Hall has to say.

KATHRYN HALL: I wanted to just come back for one of my favorite gaps to this question increasing the placebo effect. I think it's an important one because so many trials are failing these days. And I think it's not all trials are the same.

And what's really fascinating to me is that you see in Phase II clinical trials really great results, and then what's the first thing you do as a pharma company when you got a good result? You get to put out a press release.

And what's the first thing you're going to go do when you enroll in a clinical trial? You're going to read a press release. You're going to read as much as you can about the drug or the trial you're enrolling in. And how placebo boosting is it going to be to see that this trial had amazing effects on this condition you're struggling with.

If lo and behold we go to Phase III, and you can -- we're actually writing a paper on this, how many times we see the words "unexpected results," and I think we saw them here today, today or yesterday. Like, this should not be unexpected. When your Phase III trial fails, you should not be surprised because this is what's happening time and time again.

And I think some of the -- yeah, I agree, Ted, it's like this is a modern time, but there's so much information out there, so much information to sway us towards placebo responses that I think that's a piece of the problem. And finding out what the problem is I think is a really critical gap.

MATTHEW RUDORFER: Winfried.

WINFRIED RIEF: Yeah. May I follow up in that I think it fits quite nicely to what has been said before, and I want to direct I want to answer directly to Michael Detke.

On first glance, it seems less expensive to do the trials the way we do it with one placebo group and one drug arm, and we try to keep the context constant. But this is the problem. We have a constant context without any variation, so we don't learn under which context conditions is this drug really effective and what are the context conditions the drug might not be effective at all.

And therefore I think the current strategy is more like a lottery. It's really by chance it can happen that you are in this little window where the drug can show the most positive effectivity, but it can also be that you are in this little window or the big window where the drug is not able to show its effectivity.

And therefore I think, on second glance, it's a very expensive strategy only to use one single context to evaluate a drug.

MATTHEW RUDORFER: If I have time for--

TED KAPTCHUK: Marta speak, and then Liane should speak.

MARTA PECINA: I just wanted to add kind of a minor comment here, which is this idea that we're going to have to move on from the idea that giving someone a placebo is enough to induce positive expectancies and the fact that expectancies evolve over time.

So at least in some of the data that we've shown, and it's a small sample, but still we see that 50% of those subjects who are given a placebo don't have drug assignment beliefs. And so that is a very large amount of variability there that we are getting confused with everything else.

And so I do think that it is really important, whether in clinical trials, in research, to really come up with very and really develop new ways of measuring expectancies and allow expectancies to be measured over time. Because they do change. We have some prior expectancies, and then we have some expectancies that are learned based on experience. And I do think that this is an area of improvement that the field could improve relatively easily, you know, assess expectancies better, measure expectancies better.

TED KAPTCHUK: Liane, why don't you say something, and Luana, and then Cristina.

LIANE SCHMIDT: So I wanted to -- maybe one -- another open gap is like about the cognition, like what studying placebo, how can it help us to better understand human reasoning, like, and vice versa, actually, all the biases we have, these cognitive processes like motivation, for example, or memory, and yet all the good news about optimism biases, how do they contribute to placebo effects on the patient side but also on the clinician side when the clinicians have to make diagnosis or judge, actually, treatment efficiency based on some clinical scale.

So basically using like tools from cognition, like psychology or cognitive neuroscience, to better understand the processes, the cognitive processes that intervene when we have an expectation and behavior reach out, a symptom or neural activation, what comes in between, like how is it translated, basically, from cognitive predictability.

LUANA COLLOCA: I think we tended to consider expectation as static measurement when in reality we know that what we expect at the beginning of this workshop is slightly different by the end of what we are hearing and, you know, learning.

So expectation is a dynamic phenomenon, and the assumption that we can predict placebo effects with our measurement of expectation can be very limiting in terms of, you know, applications. Rather, it is important to measure expectation over time and also realize that there are so many nuance, like Liane just mentioned, of expectations, you know.

There are people who say I don't expect anything, I try everything, or people who say, oh, I truly want, I will be I truly want to feel better. And these also problematic patients because having an unrealistic expectation can often destroy, as I show, with a violation of expectancies of placebo effects.

TED KAPTCHUK: Are we getting close? Do you want to summarize? Or who's supposed to do that? I don't know.

CRISTINA CUSIN: I think I have a couple of minutes for remarks. There's so much going on, and more questions than answers, of course.

That has been a fantastic symposium, and I was trying to pitch some idea about possibly organizing a summit with all the panelists, all the presenters, and everyone else who wants to join us, because I think that with a coffee or a tea in our hands and talking not through a Zoom video, we could actually come up with some great idea and some collaboration projects.

Anyone who wants to email us, we'll be happy to answer. And we're always open to collaborating and starting a new study, bouncing off each other new ideas. This is what we do for a living. So we're very enthusiastic about people asking difficult questions.

And some of the questions that are ongoing and I think would be future areas is what we were talking a few minutes ago, we don't know if a placebo responder in a migraine study, for example, would be a placebo responder of depression study or IBS study. We don't know if this person is going to be universal placebo responder or is the context include the type of disease they're suffering from so it's going to be fairly different, and why some disorders have lower placebo response rate overall compared to others. Is that a chronicity, a relaxing, remitting disorder, has higher chance of placebo because the system can be modulated, versus a disorder that is considered more chronic and stable? A lot of this information is not known in the natural history.

Also comes to mind the exact trial it is because we almost never have a threshold for number of prior episodes of depression to enter a trial or how chronic has it been or years of depression or other factors that can clearly change our probability of responding to a treatment.

We heard about methodology for clinical trial design and how patients could be responsive to placebo responses or sham, responsive to drug. How about patients who could respond to both? We have no idea how many of those patients are undergoing a trial, universal responders, unless we do a crossover. And we know that crossover is not a popular design for drug trials.

So we need to figure out also aspects of methodology, how to assess outcome, what's the best way to assess the outcome that we want, is it clinically relevant, how to protect the blind aspect, assess expectations and how expectations change over time.

We didn't hear much during the discussion about the role of mindfulness in pain management, and I would like to hear much more about how we're doing in identifying the areas and can we actually intervene on those areas with devices to help with pain management. That's one of the biggest problems we have in terms of clinical care.

In the eating disorder aspect, creating computational models to influence food choices. And, again, with devices or treatments specifically changing the balance about making healthier food choices, I can see an entire field developing. Because most of the medications we prescribe for psychiatric disorders affect food choices and there's weight gain, potentially leading to obesity and cardiovascular complications. So there's an entire field of research we have not touched on.

And the role of animal models in translational results, I don't know if animal researchers, like Greg, talk much with clinical trialists. I think that would be a cross fertilization that is much needed, and we can definitely learn from each other.

And just fantastic. I thank all the panelists for their willingness to work with us and their time, dedication, and just so many meetings to discuss to agree on the program and to divide and conquer different topics. Has been a phenomenal experience, and I'm very, very grateful.

And the NIMH staff has been also amazing, having to collaborate with all of them, and they were so organized. And just a fantastic panel. Thank you, everybody.

MATTHEW RUDORFER: Thank you.

TOR WAGER: Thank you.

NIMH TEAM: Thanks from the NIMH team to all of our participants here.

(Meeting adjourned)

IMAGES

  1. Experimental setup of the reproducibility experiment. Illustration of

    reproducibility of an experiment is measured from

  2. PPT

    reproducibility of an experiment is measured from

  3. 12 Experiment Reproducibility

    reproducibility of an experiment is measured from

  4. 5 Replicability

    reproducibility of an experiment is measured from

  5. Intra-experiment reproducibility and inter-experiment variability in

    reproducibility of an experiment is measured from

  6. Repeatability and Reproducibility of Experiment

    reproducibility of an experiment is measured from

COMMENTS

  1. Reproducibility

    Reproducibility, closely related to replicability and repeatability, is a major principle underpinning the scientific method.For the findings of a study to be reproducible means that results obtained by an experiment or an observational study or in a statistical analysis of a data set should be achieved again with a high degree of reliability when the study is replicated.

  2. 3 Understanding Reproducibility and Replicability

    DEFINING REPRODUCIBILITY AND REPLICABILITY. Different scientific disciplines and institutions use the words reproducibility and replicability in inconsistent or even contradictory ways: What one group means by one word, the other group means by the other word. 4 These terms—and others, such as repeatability—have long been used in relation to the general concept of one experiment or study ...

  3. What does research reproducibility mean?

    Methods reproducibility. Methods reproducibility refers to the provi-sion of enough detail about study procedures and data so the same procedures could, in theory or in actuality, be exactly repeated. Operationally, this can mean different things in different sciences.

  4. Summary

    Generalizability, another term frequently used in science, refers to the extent that results of a study apply in other contexts or populations that differ from the original one. 1 A single scientific study may include elements or any combination of these concepts. In short, reproducibility involves the original data and code; replicability ...

  5. The fundamental principles of reproducibility

    A reproducibility experiment can only be conducted if some documentation of the original experiment is shared. The degree to which independent investigators can conduct the same experiment as the original investigators is dependent on how well the original investigators documented the experiment and how much of the documentation is shared with ...

  6. Reproducibility and Replicability in Science

    Different scientific disciplines and institutions use the words reproducibility and replicability in inconsistent or even contradictory ways: What one group means by one word, the other group means by the other word. 4 These terms—and others, such as repeatability—have long been used in relation to the general concept of one experiment or ...

  7. What does research reproducibility mean?

    Results reproducibility (previously described as replicability) refers to obtaining the same results from the conduct of an independent study whose procedures are as closely matched to the original experiment as possible. As with methods reproducibility, this might be clear in principle but is operationally elusive.

  8. Conceptualizing, Measuring, and Studying Reproducibility

    The second session of the workshop provided an overview of how to conceptualize, measure, and study reproducibility. Steven Goodman (Stanford University School of Medicine) and Yoav Benjamini (Tel Aviv University) discussed definitions and measures of reproducibility. Dennis Boos (North Carolina State University), Andrea Buja (Wharton School of the University of Pennsylvania), and Valen ...

  9. Six factors affecting reproducibility in life science research and how

    The lack of reproducibility in scientific research has negative impacts on health, lower scientific output efficiency, slower 6, 7 scientific progress, wasted time and money, and erodes the public ...

  10. Reproducibility vs Replicability

    Reproducibility vs Replicability | Difference & Examples. Published on August 19, 2022 by Kassiani Nikolopoulou.Revised on June 22, 2023. The terms reproducibility, repeatability, and replicability are sometimes used interchangeably, but they mean different things.. A research study is reproducible when the existing data is reanalysed using the same research methods and yields the same results.

  11. Reproducibility vs Replicability

    Reproducibility vs Replicability | Difference & Examples. Translated on 19 August 2022 by Kassiani Nikolopoulou. Originally published by Julia Merkus The terms 'reproducibility', 'repeatability', and 'replicability' are sometimes used interchangeably, but they mean different things. A research study is reproducible when the existing data is reanalysed using the same research ...

  12. Reproducibility

    The precision of an observation is generally understood as the measure of reproducibility of repeated measurements and can be expressed as the variance of the measured value (x) around its average value (〈x〉). ... or conceptual or broad reproducibility in generalizing the results of any particular experiment to the larger question, whether ...

  13. PDF Measuring reproducibility of high-throughput experiments

    Reproducibility is essential to reliable scientific discovery in high-throughput experiments. In this work, we propose a unified approach to measure the reproducibility of findings identified from replicate experiments and identify putative discoveries using reproducibility. Unlike the usual scalar measures of reproducibility, our approach cre ...

  14. Criteria for biological reproducibility: What does " n

    Reproducibility in biological science is critical to scientific progress. Here, we discuss the issue of reproducibility at the level of individual experiments. ... Some scientists prefer to do an n = 1 experiment in four different cell lines rather than an n = 4 experiment in a single cell line, arguing that if they see the same phenomenon once ...

  15. Repeatability vs. Reproducibility

    Repeatability is a measure of the likelihood that, having produced one result from an experiment, you can try the same experiment, with the same setup, and produce that exact same result. ... The reproducibility of data is a measure of whether results in a paper can be attained by a different research team, using the same methods. This shows ...

  16. Statistical considerations for repeatability and reproducibility of

    On the other hand, reproducibility refers to the precision of a QIB measured under different experimental conditions 3,4 (also known as the reproducibility condition), which is mainly a measure of the variability associated with different measurement systems, imaging methods, study sites, and population. In recent years, many studies have been ...

  17. Reproducing experiments is more complicated than it seems

    Statisticians have devised a new way to measure the evidence that an experimental result has really been reproduced. Whether in science or in courtrooms, drawing correct conclusions depends on ...

  18. Repeatability vs. Reproducibility: What's the Difference?

    Repeatability and reproducibility are two ways that scientists and engineers measure the precision of their experiments and measuring tools. They are heavily used in chemistry and engineering and are typically reported as standard deviations in scientific write-ups. While these two measurements are both used in many types of experiments, they ...

  19. Repeatability, Reproducibility and Coefficient of Variation

    Repeatability and reproducibility can only be assessed when replicate measurements by each method are available. If replicate measurements by a method are available, it is simple to estimate the mea...

  20. Error, reproducibility and uncertainty in experiments for ...

    Most electrochemists assess the repeatability of measurements, performing the same measurement themselves several times. Repeats, where all steps (including sample preparation, where relevant) of ...

  21. Reproducibility

    Reproducibility. Reproducibility describes how close the measurements are when the same input is measured repeatedly over time. When the range of measurements is small, the reproducibility is high. For example, a temperature sensor may have a reproducibility of ±0.1°C for a measurement range of 20°C to 80°C.

  22. What is the Difference Between Repeatability and Reproducibility

    In the context of an experiment, repeatability measures the variation in measurements taken by a single instrument or person under the same conditions, while reproducibility measures whether an entire study or experiment can be reproduced in its entirety. Within scientific write ups, reproducibility and repeatability are often reported as ...

  23. 5 Reproducibility Tests You Can Use For Estimating Uncertainty

    Perform a Repeatability Test with Condition A. Similar to an A/B Test, you will perform a reproducibility test to evaluate the results of condition A versus condition B. Simply perform a repeatability test with condition A. Collect 20 or more repeated samples over a short period of time and record your results.

  24. Day Two: Placebo Workshop: Translational Research Domains and ...

    The National Institute of Mental Health (NIMH) hosted a virtual workshop on the placebo effect. The purpose of this workshop was to bring together experts in neurobiology, clinical trials, and regulatory science to examine placebo effects in drug, device, and psychosocial interventions for mental health conditions. Topics included interpretability of placebo signals within the context of ...

  25. Oregon's drug decriminalization experiment ends Saturday. Here's

    Starting Sunday, Oregon's ill-fated experiment with drug decriminalization ends. The closely scrutinized first-in-the-nation law put minor drug possession on par with a traffic ticket and now ...