U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Gastroenterol Hepatol Bed Bench
  • v.5(2); Spring 2012

How to control confounding effects by statistical analysis

Mohamad amin pourhoseingholi.

1 Department of Biostatistics, Shahid Beheshti University of Medical Sciences, Tehran, Iran

Ahmad Reza Baghestani

2 Department of Mathematic, Islamic Azad University - South Tehran Branch, Iran

Mohsen Vahedi

3 Department of Epidemiology and Biostatistics, School of Public Health, Tehran University of Medical Sciences, Tehran, Iran

A Confounder is a variable whose presence affects the variables being studied so that the results do not reflect the actual relationship. There are various ways to exclude or control confounding variables including Randomization, Restriction and Matching. But all these methods are applicable at the time of study design. When experimental designs are premature, impractical, or impossible, researchers must rely on statistical methods to adjust for potentially confounding effects. These Statistical models (especially regression models) are flexible to eliminate the effects of confounders.

Introduction

Confounding variables or confounders are often defined as the variables correlate (positively or negatively) with both the dependent variable and the independent variable ( 1 ). A Confounder is an extraneous variable whose presence affects the variables being studied so that the results do not reflect the actual relationship between the variables under study.

The aim of major epidemiological studies is to search for the causes of diseases, based on associations with various risk factors. There may be also other factors that are associated with the exposure and affect the risk of developing the disease and they will distort the observed association between the disease and exposure under study. A hypothetical example would be a study of relation between coffee drinking and lung cancer. If the person who entered in the study as a coffee drinker was also more likely to be cigarette smoker, and the study only measured coffee drinking but not smoking, the results may seem to show that coffee drinking increases the risk of lung cancer, which may not be true. However, if a confounding factor (in this example, smoking) is recognized, adjustments can be made in the study design or data analysis so that the effects of confounder would be removed from the final results. Simpson's paradox too is another classic example of confounding ( 2 ). Simpson's paradox refers to the reversal of the direction of an association when data from several groups are combined to form a single group.

The researchers therefore need to account for these variables - either through experimental design and before the data gathering, or through statistical analysis after the data gathering process. In this case the researchers are said to account for their effects to avoid a false positive (Type I) error (a false conclusion that the dependent variables are in a casual relationship with the independent variable). Thus, confounding is a major threat to the validity of inferences made about cause and effect (internal validity). There are various ways to modify a study design to actively exclude or control confounding variables ( 3 ) including Randomization, Restriction and Matching.

In randomization the random assignment of study subjects to exposure categories to breaking any links between exposure and confounders. This reduces potential for confounding by generating groups that are fairly comparable with respect to known and unknown confounding variables.

Restriction eliminates variation in the confounder (for example if an investigator only selects subjects of the same age or same sex then, the study will eliminate confounding by sex or age group). Matching which involves selection of a comparison group with respect to the distribution of one or more potential confounders.

Matching is commonly used in case-control studies (for example, if age and sex are the matching variables, then a 45 year old male case is matched to a male control with same age).

But all these methods mentioned above are applicable at the time of study design and before the process of data gathering. When experimental designs are premature, impractical, or impossible, researchers must rely on statistical methods to adjust for potentially confounding effects ( 4 ).

Statistical Analysis to eliminate confounding effects

Unlike selection or information bias, confounding is one type of bias that can be, adjusted after data gathering, using statistical models. To control for confounding in the analyses, investigators should measure the confounders in the study. Researchers usually do this by collecting data on all known, previously identified confounders. There are mostly two options to dealing with confounders in analysis stage; Stratification and Multivariate methods.

1. Stratification

Objective of stratification is to fix the level of the confounders and produce groups within which the confounder does not vary. Then evaluate the exposure-outcome association within each stratum of the confounder. So within each stratum, the confounder cannot confound because it does not vary across the exposure-outcome.

After stratification, Mantel-Haenszel (M-H) estimator can be employed to provide an adjusted result according to strata. If there is difference between Crude result and adjusted result (produced from strata) confounding is likely. But in the case that Crude result dose not differ from the adjusted result, then confounding is unlikely.

2. Multivariate Models

Stratified analysis works best in the way that there are not a lot of strata and if only 1 or 2 confounders have to be controlled. If the number of potential confounders or the level of their grouping is large, multivariate analysis offers the only solution.

Multivariate models can handle large numbers of covariates (and also confounders) simultaneously. For example in a study that aimed to measure the relation between body mass index and Dyspepsia, one could control for other covariates like as age, sex, smoking, alcohol, ethnicity, etc in the same model.

2.1. Logistic Regression

Logistic regression is a mathematical process that produces results that can be interpreted as an odds ratio, and it is easy to use by any statistical package. The special thing about logistic regression is that it can control for numerous confounders (if there is a large enough sample size). Thus logistic regression is a mathematical model that can give an odds ratio which is controlled for multiple confounders. This odds ratio is known as the adjusted odds ratio, because its value has been adjusted for the other covariates (including confounders).

2.2. Linear Regression

The linear regression analysis is another statistical model that can be used to examine the association between multiple covariates and a numeric outcome. This model can be employed as a multiple linear regression to see through confounding and isolate the relationship of interest ( 5 ). For example, in a research seeking for relationship between LDL cholesterol level and age, the multiple linear regression lets you answer the question, How does LDL level vary with age, after accounting for blood sugar and lipid (as the confounding factors)? In multiple linear regression (as mentioned for logistic regression), investigators can include many covariates at one time. The process of accounting for covariates is also called adjustment (similar to logistic regression model) and comparing the results of simple and multiple linear regressions can clarify that how much the confounders in the model distort the relationship between exposure and outcome.

2.3. Analysis of Covariance

The Analysis of Covariance (ANCOVA) is a type of Analysis of Variance (ANOVA) that is used to control for potential confounding variables. ANCOVA is a statistical linear model with a continuous outcome variable (quantitative, scaled) and two or more predictor variables where at least one is continuous (quantitative, scaled) and at least one is categorical (nominal, non-scaled). ANCOVA is a combination of ANOVA and linear regression. ANCOVA tests whether certain factors have an effect on the outcome variable after removing the variance for which quantitative covariates (confounders) account. The inclusion of this analysis can increase the statistical power.

The Analysis of Covariance (ANCOVA) is a type of Analysis of Variance (ANOVA) that is used to control for potential confounding variables . ANCOVA is a statistical linear model with a continuous outcome variable (quantitative, scaled) and two or more predictor variables where at least one is continuous (quantitative, scaled) and at least one is categorical (nominal, non-scaled). ANCOVA is a combination of ANOVA and linear regression. ANCOVA tests whether certain factors have an effect on the outcome variable after removing the variance for which quantitative covariates (confounders) account. The inclusion of this analysis can increase the statistical power.

Practical example

Suppose that, in a cross-sectional study, we are seeking for the relation between infection with Helicobacter. Pylori (HP) and Dyspepsia Symptoms. The study conducted on 550 persons with positive H.P and 440 persons without HP. The results are appeared in 2*2 crude table ( Table 1 ) that indicated that the relation between infection with H.P and Dyspepsia is a reverese association (OR = 0.60, 95% CI: 0.42-0.94). Now suppose that weight can be a potential confounder in this study. So we break the crude table down in two stratum according to the weight of subjects (normal weight or over weight) and then calculate OR's for each stratum again. If stratum-specific OR is similar to crude OR, there is no potential impact from confounding factors. In this example there are different OR for each stratum (for normal weight group OR= 0.80, 95% CI: 0.38-1.69 and for overweight group OR= 1.60, 95% CI: 0.79-3.27).

The crude contingency table of association between H.Pylori and Dyspepsia

Dyspepsia (positive)Dyspepsia (negative)
50500
60380

The contingency table of association between H. Pylori and Dyspepsia for person who are in normal weight group

Dyspepsia (positive)Dyspepsia (negative)
1050
50200

The contingency table of association between H. Pylori and Dyspepsia for person who are in over weight group

Dyspepsia (positive)Dyspepsia (negative)
40450
10180

This shows that there is a potential confounding affects which is presented by weight in this study. This example is a type of Simpson's paradox, therefore the crude OR is not justified for this study. We calculated the Mantel-Haenszel (M-H) estimator as an alternative statistical analysis to remove the confounding effects (OR= 1.16, 95% CI: 0.71-1.90). Also logistic regression model (in which, weight is presented in multiple model) would be conducted to control the confounder, its result is similar as M-H estimator (OR= 1.15, 95% CI: 0.71-1.89).

The results of this example clearly indicated that if the impacts of confounders did not account in the analysis, the results can deceive the researchers with unjustified results.

Confounders are common causes of both treatment/exposure and of response/outcome. Confounding is better taken care of by randomization at the design stage of the research ( 6 ).

A successful randomization minimizes confounding by unmeasured as well as measured factors, whereas statistical control that addresses confounding by measurement and can introduce confounding through inappropriate control ( 7 – 9 ).

Confounding can persist, even after adjustment. In many studies, confounders are not adjusted because they were not measured during the process of data gathering. In some situation, confounder variables are measured with error or their categories are improperly defined (for example age categories were not well implied its confounding nature) ( 10 ). Also there is a possibility that the variables that are controlled as the confounders were actually not confounders.

Before applying a statistical correction method, one has to decide which factors are confounders. This sometimes is a complex issue ( 11 – 13 ). Common strategies to decide whether a variable is a confounder that should be adjusted or not, rely mostly on statistical criteria. The research strategy should be based on the knowledge of the field and on conceptual framework and causal model. So expertise' criteria should be involved for evaluating the confounders. Statistical models (especially regression models) are a flexible way of investigating the separate or joint effects of several risk factors for disease or ill health ( 14 ). But the researchers should notice that wrong assumptions about the form of the relationship between confounder and disease can lead to wrong conclusions about exposure effects too.

( Please cite as: Pourhoseingholi MA, Baghestani AR, Vahedi M. How to control confounding effects by statistical analysis. Gastroenterol Hepatol Bed Bench 2012;5(2):79-83.)

Confounding Variables in Psychology: Definition & Examples

Julia Simkus

Editor at Simply Psychology

BA (Hons) Psychology, Princeton University

Julia Simkus is a graduate of Princeton University with a Bachelor of Arts in Psychology. She is currently studying for a Master's Degree in Counseling for Mental Health and Wellness in September 2023. Julia's research has been published in peer reviewed journals.

Learn about our Editorial Process

Saul McLeod, PhD

Editor-in-Chief for Simply Psychology

BSc (Hons) Psychology, MRes, PhD, University of Manchester

Saul McLeod, PhD., is a qualified psychology teacher with over 18 years of experience in further and higher education. He has been published in peer-reviewed journals, including the Journal of Clinical Psychology.

Olivia Guy-Evans, MSc

Associate Editor for Simply Psychology

BSc (Hons) Psychology, MSc Psychology of Education

Olivia Guy-Evans is a writer and associate editor for Simply Psychology. She has previously worked in healthcare and educational sectors.

On This Page:

A confounding variable is an unmeasured third variable that influences, or “confounds,” the relationship between an independent and a dependent variable by suggesting the presence of a spurious correlation.

Confounding Variables in Research

Due to the presence of confounding variables in research, we should never assume that a correlation between two variables implies causation.

When an extraneous variable has not been properly controlled and interferes with the dependent variable (i.e., results), it is called a confounding variable.

Confounding Variable

For example, if there is an association between an independent variable (IV) and a dependent variable (DV), but that association is due to the fact that the two variables are both affected by a third variable (C). The association between IV and DV is extraneous.

Variable C would be considered the confounding variable in this example. We would say that the IV and DV are confounded by C whenever C causally influences both the IV and the DV.

In order to accurately estimate the effect of the IV on the DV, the researcher must reduce the effects of C.

If you identify a causal relationship between the independent variable and the dependent variable, that relationship might not actually exist because it could be affected by the presence of a confounding variable.

Even if the cause-and-effect relationship does exist, the confounding variable still might overestimate or underestimate the impact of the independent variable on the dependent variable.

Reducing Confounding Variables

It is important to identify all possible confounding variables and consider their impact of them on your research design in order to ensure the internal validity of your results.

Here are some techniques to reduce the effects of these confounding variables:
  • Random allocation : randomization will help eliminate the impact of confounding variables. You can randomly assign half of your subjects to a treatment group and the other half to a control group. This will ensure that confounders have the same effect on both groups, so they cannot correlate with your independent variable.
  • Control variables : This involves restricting the treatment group only to include subjects with the same potential for confounding factors. For example, you can restrict your subject pool by age, sex, demographic, level of education, or weight (etc.) to ensure that these variables are the same among all subjects and thus cannot confound the cause-and-effect relationship at hand.
  • Within-subjects design : In a within-subjects design, all participants participate in every condition.
  • Case-control studies : Case-control studies assign confounders to both groups (the experimental group and the control group) equally.

Suppose we wanted to measure the effects of caloric intake (IV) on weight (DV). We would have to try to ensure that confounding variables did not affect the results. These variables could include the following:

  • Metabolic rate : If you have a faster metabolism, you tend to burn calories more quickly.
  • Age : Age can affect weight gain differently, as younger individuals tend to burn calories quicker than older individuals.
  • Physical Activity : Those who exercise or are more active will burn more calories and could weigh less, even if they consume more.
  • Height : Taller individuals tend to need to consume more calories in order to gain weight.
  • Sex : Men and women have different caloric needs to maintain a certain weight.

Frequently asked questions

1. what is a confounding variable in psychology.

A confounding variable in psychology is an extraneous factor that interferes with the relationship between an experiment’s independent and dependent variables . It’s not the variable of interest but can influence the outcome, leading to inaccurate conclusions about the relationship being studied.

For instance, if studying the impact of studying time on test scores, a confounding variable might be a student’s inherent aptitude or previous knowledge.

2. What is the difference between an extraneous variable and a confounding variable?

A confounding variable is a type of extraneous variable . Confounding variables affect both the independent and dependent variables. They influence the dependent variable directly and either correlate with or causally affect the independent variable.

An extraneous variable is any variable that you are not investigating that can influence the dependent variable.

3. What is Confounding Bias?

Confounding bias is a bias that is the result of having confounding variables in your study design. If the observed association overestimates the effect of the independent variable on the dependent variable, this is known as a positive confounding bias.

If the observed association underestimates the effect of the independent variable on the dependent variable, this is known as a negative confounding bias.

Glen, Stephanie. Confounding Variable: Simple Definition and Example. Retrieved from StatisticsHowTo.com: Elementary Statistics for the rest of us! https://www.statisticshowto.com/experimental-design/confounding-variable/

Thomas, L. (2021). Understanding confounding variables. Scribbr. Retrieved from https://www.scribbr.com/methodology/confounding-variables/

University of Michigan. (n.d.). Confounding Variables. ICPSR. Retrieved from https://www.icpsr.umich.edu/web/pages/instructors/setups2012/exercises/notes/confounding-variable.html

Print Friendly, PDF & Email

Listen-Hard

Mastering the Control of Confounding Variables in Psychology Experiments

confounding variables control experiments

Confounding variables are often the hidden culprits that can skew the results of psychology experiments, leading to inaccurate conclusions. In order to ensure the validity and reliability of research findings, it is crucial to understand and control for these variables.

From participant variables to environmental factors, there are various types of confounding variables that can impact the outcome of an experiment. By implementing strategies such as random assignment and experimental design, researchers can effectively minimize the influence of these variables.

In this article, we will explore the importance of controlling confounding variables in psychology experiments, the types of confounding variables, strategies for controlling them, examples of their implementation, and common mistakes to avoid. Join us on this journey to mastering the control of confounding variables for more accurate and insightful results in psychological research.

  • Controlling confounding variables is crucial in psychology experiments to ensure accurate and reliable results.
  • Confounding variables can be participant, environmental, or task-related and can significantly impact the outcome of an experiment.
  • Strategies such as random assignment, matching, counterbalancing, and experimental design can effectively control confounding variables in experiments.
  • 1 The Importance of Controlling Confounding Variables in Psychology Experiments
  • 2 What Are Confounding Variables?
  • 3.1 Participant Variables
  • 3.2 Environmental Variables
  • 3.3 Task Variables
  • 4 How Do Confounding Variables Affect the Results of an Experiment?
  • 5.1 Random Assignment
  • 5.2 Matching
  • 5.3 Counterbalancing
  • 5.4 Statistical Control
  • 5.5 Experimental Design
  • 6 Examples of Controlling Confounding Variables in Psychology Experiments
  • 7 Common Mistakes in Controlling Confounding Variables
  • 8 Conclusion: Mastering the Control of Confounding Variables for Accurate Results
  • 9.1 1. What are confounding variables in psychology experiments?
  • 9.2 2. Why is it important to master control of confounding variables?
  • 9.3 3. How can I identify potential confounding variables in my experiment?
  • 9.4 4. What are some techniques for controlling confounding variables?
  • 9.5 5. Can confounding variables ever be completely eliminated from an experiment?
  • 9.6 6. How does controlling confounding variables improve the overall quality of psychological research?

The Importance of Controlling Confounding Variables in Psychology Experiments

Controlling confounding variables in psychology experiments is crucial to ensure the accuracy and validity of research findings.

When conducting a study, researchers must identify and account for all possible influences that could impact the results besides the independent variable. Failure to account for these extraneous variables can lead to skewed results and inaccurate conclusions. For example, if studying the effect of a new teaching method on student performance, failing to control for factors like prior knowledge, motivation, or socioeconomic background could introduce bias. This can ultimately undermine the reliability and generalizability of the findings.

To address this, researchers employ various strategies such as randomization, matching, or statistical techniques like regression analysis to control for confounding variables. By carefully controlling these factors, researchers can isolate the true effect of the independent variable, leading to more robust and trustworthy results.

What Are Confounding Variables?

Confounding variables are extraneous factors that can distort the true relationship between the independent and dependent variables in a study.

For example, let’s consider a study analyzing the effects of a new drug on patients’ health outcomes. If age is not controlled for as a confounding variable, the observed differences in health outcomes between patients taking the drug and those not taking it may actually be due to age differences rather than the drug itself.

Identifying confounding variables is crucial to ensure the validity of research findings. Researchers can use various methods such as stratification, matching, or regression analysis to control for confounders and isolate the true effects of the independent variable.

Types of Confounding Variables

Confounding variables can be categorized into participant variables, environmental variables, and task variables, each posing unique challenges to establishing causal relationships.

Participant variables encompass characteristics such as age, gender, or education level, all of which can introduce biases affecting study outcomes. Environmental variables, which include factors like noise levels or lighting conditions, may also impact the results by influencing participant behavior. Task variables, on the other hand, refer to the specific methods and instructions given to participants during the research process, potentially leading to misinterpretations if not carefully controlled.

Controlling confounding variables is crucial to ensure the validity and reliability of research findings.

Participant Variables

Participant variables refer to individual characteristics that can introduce bias into study results, requiring researchers to carefully identify and control for these confounding factors.

Ensuring the accuracy and reliability of research findings heavily relies on minimizing the impact of participant variables.

One effective approach is random allocation of participants into experimental and control groups, which helps spread these variables evenly.

Another method involves using matching techniques, where participants are paired based on specific criteria to balance out potential biases.

Researchers also employ diverse statistical analyses to adjust for these variables during data interpretation, such as ANCOVA or multiple regression models.

By diligently addressing participant variables, researchers can enhance the validity and generalizability of their study findings.

Environmental Variables

Environmental variables encompass external factors that can impact study outcomes, necessitating researchers to control and account for these variables to maintain research validity.

These variables can include elements such as temperature, humidity, lighting, and noise levels, among others, which have the potential to introduce bias or confound study results. Researchers often use experimental design strategies to manipulate and control these variables, ensuring that the observed effects can be attributed to the intended interventions. By conducting pilot studies and implementing randomization techniques, researchers can minimize the influence of extraneous variables, enhancing the internal validity of their research outcomes.

Task Variables

Task variables involve the specific conditions or manipulations within an experiment that can affect the outcomes and findings of the study, emphasizing the need for controlled treatment and analysis of study variables.

Examining the influence of these task variables in a systematic manner is crucial for drawing accurate conclusions and generalizing results. Researchers often design experiments with careful consideration of factors such as timing, order of tasks, and difficulty level, to ensure that the observed effects are truly due to the intended manipulation. By manipulating these variables strategically, scientists can isolate the impact of each factor and reduce any confounding effects that may obscure the true relationship between variables.

How Do Confounding Variables Affect the Results of an Experiment?

Confounding variables can distort study results by introducing unintended influences that mask or create false associations between variables, impacting the validity and reliability of research studies.

In essence, confounding variables are extraneous factors that can interfere with the ability to draw accurate conclusions from research findings. For example, imagine a study exploring the relationship between coffee consumption and heart health. If the participants’ ages are not controlled for, the correlation may be misleading, as older individuals are more likely to have heart issues regardless of their coffee intake. It is crucial for researchers to identify and control for such variables to ensure the integrity of their results and the credibility of their conclusions.

Strategies for Controlling Confounding Variables

Effective control of confounding variables requires the implementation of strategic methods such as randomization, matching, counterbalancing, statistical control, and robust experimental design.

One widely used strategy in research to combat confounding variables is randomization, which involves assigning participants or subjects to different groups or conditions randomly. This helps ensure that any potential confounding variables are equally distributed across the groups, reducing their impact on the study outcomes. Matching is another technique where researchers pair subjects based on specific characteristics to create comparable groups.

Furthermore, counterbalancing is frequently utilized in experimental designs to address order effects, where the sequence of treatments or conditions is varied systematically among participants. Statistical control methods, such as ANCOVA (Analysis of Covariance) or using covariates , can help adjust for the influence of confounding variables statistically.

Random Assignment

Random assignment involves assigning participants to different groups or conditions randomly, reducing the potential for bias and enhancing control over variables related to the research question.

This method plays a crucial role in ensuring that any differences among the groups are due to the experimental manipulation rather than pre-existing characteristics of the participants. By assigning individuals randomly, researchers can minimize the impact of extraneous variables that may skew the results, thus increasing the internal validity of the study. This control mechanism is fundamental in various research designs, including experimental studies and clinical trials, where precise comparisons are necessary to draw valid conclusions.

Matching involves pairing participants in experimental and control groups based on specific characteristics to control for potential confounding variables and ensure the validity of treatment effects.

By selecting participants with similar characteristics and assigning them to respective groups, researchers aim to minimize the impact of variables that could potentially distort the treatment outcomes. This process of controlled pairing allows for a more accurate assessment of the true effectiveness of the treatment being studied. Matching helps researchers to isolate the specific impact of the treatment rather than being influenced by other external factors, thus enhancing the validity and reliability of the study results.

Counterbalancing

Counterbalancing involves varying the order of treatments or conditions across participants to mitigate the effects of confounding variables, enabling researchers to identify and reduce potential biases.

By systematically altering the sequence in which different levels of the treatments are administered, researchers can ensure that any observed effects are more likely attributable to the treatment itself rather than external factors. This technique helps in controlling for the influence of variables like participant fatigue or practice effects. Through counterbalancing, the impact of extraneous variables is minimized, enhancing the internal validity of the study. It is a crucial method to ascertain the true effects of the independent variable and improve the overall quality of research findings.

Statistical Control

Statistical control involves using statistical techniques to account for the influence of potential confounding variables, allowing researchers to isolate the effects of variables under study and enhance the validity of research findings.

By establishing statistical control, researchers aim to reduce the risk of drawing incorrect conclusions or attributing effects to variables erroneously. This practice involves meticulously managing variables that could impact the outcome of a study, ensuring that only the factors of interest are influencing the results. Through the careful application of statistical methods, such as regression analysis or analysis of variance, researchers can quantify the contribution of each variable and discern their individual impacts on the outcome.

Experimental Design

Robust experimental design plays a critical role in controlling potential confounding variables, enabling researchers to establish causal relationships and draw valid conclusions from their studies.

By carefully structuring the researcher interventions and study conditions, a well-planned experimental design minimizes the risk of unintended influences that could skew results. Through systematic allocation of participants into treatment and control groups, researchers can manage confounding variables effectively, ensuring that the observed outcomes are indeed attributable to the interventions being studied. This meticulous approach enhances the internal validity of the study, bolstering the confidence researchers have in the accuracy and reliability of their findings.

Examples of Controlling Confounding Variables in Psychology Experiments

Illustrative examples of controlling confounding variables in psychology experiments showcase the application of various methods and strategies to enhance the validity and reliability of study results.

In a classic experimental psychology study examining the effects of music on mood, researchers might control for confounding variables by ensuring that all participants are exposed to the same type and duration of music. This eliminates the potential influence of different musical genres or lengths on the participants’ mood responses, allowing the researchers to attribute any observed changes solely to the music manipulation. By closely monitoring and standardizing such variables, the study’s internal validity is strengthened, leading to more accurate conclusions about the impact of music on mood.

Common Mistakes in Controlling Confounding Variables

Avoiding common mistakes in controlling confounding variables is essential to prevent inaccuracies in research findings and ensure the integrity of study outcomes.

One prevalent error researchers often make is failing to identify and account for all potential confounding variables that could skew the results of their study. This oversight can lead to invalid conclusions and hinder the reliability of the research outcomes.

Improper handling of confounding variables may introduce bias, leading to misleading interpretations of the data and potentially impacting the generalizability of the findings.

To address these issues, researchers should prioritize thorough planning, meticulous data collection, and proper statistical analysis techniques to effectively control for confounding variables.

Conclusion: Mastering the Control of Confounding Variables for Accurate Results

Mastering the control of confounding variables is paramount to ensuring the accuracy and reliability of research results in psychology experiments.

Confounding variables, if left unchecked, can distort the true relationship between the independent and dependent variables, leading to erroneous conclusions. Therefore, by meticulously identifying, measuring, and controlling these variables, researchers can enhance the internal validity of their studies. This meticulous control ensures that any observed effects can be confidently attributed to the variables under investigation, rather than external factors.

Researchers must also implement robust study designs and statistical analyses to mitigate the impact of confounding variables. By employing randomization, blinding, and stratification techniques, they can reduce the influence of extraneous variables and produce more accurate and generalizable results. Conducting sensitivity analyses and controlling for potential confounders in regression models can further enhance the validity and reliability of research outcomes.

By acknowledging the significance of controlling confounding variables, researchers can strengthen the overall quality of their research, increase the trustworthiness of their findings, and contribute valuable insights to the field of psychology.

Frequently Asked Questions

1. what are confounding variables in psychology experiments.

Confounding variables are factors that can influence the outcome of an experiment, but are not the main variables being studied. They can lead to inaccurate or misleading results if not properly controlled.

2. Why is it important to master control of confounding variables?

Controlling for confounding variables allows researchers to confidently attribute any observed effects to the targeted variables, rather than other external factors. This increases the validity and reliability of the experiment’s results.

3. How can I identify potential confounding variables in my experiment?

One way to identify potential confounding variables is to conduct a thorough literature review and consider any factors that have been shown to impact the outcome of similar experiments. Consulting with experienced researchers and conducting pilot studies can also help identify potential confounding variables.

4. What are some techniques for controlling confounding variables?

One technique is randomization, where participants are randomly assigned to different experimental conditions. Another technique is matching, where participants are matched based on specific characteristics before being assigned to different conditions. Other techniques include counterbalancing and statistical control.

5. Can confounding variables ever be completely eliminated from an experiment?

It is difficult to completely eliminate all potential confounding variables from an experiment, but by using proper techniques and controls, their effects can be minimized. It is important to acknowledge and address any remaining confounds in the interpretation of the results.

6. How does controlling confounding variables improve the overall quality of psychological research?

By mastering control of confounding variables, researchers can establish a stronger cause-and-effect relationship between the variables being studied. This leads to more accurate and reliable results, which can contribute to the advancement of psychological knowledge and understanding.

' src=

Dr. Naomi Kessler is a forensic psychologist with extensive experience in criminal behavior analysis and legal consultancy. Her professional journey includes working with law enforcement agencies, providing expert testimony in court cases, and conducting research on the psychological profiles of offenders. Dr. Kessler’s writings delve into the intersection of psychology and law, aiming to shed light on the complexities of human behavior within the legal system and the role of psychology in justice and rehabilitation.

Similar Posts

Understanding Malingering: Deceptive Behavior in Psychology

Understanding Malingering: Deceptive Behavior in Psychology

The article was last updated by Alicia Rhodes on February 4, 2024. Malingering is a deceptive behavior in psychology that involves feigning illness or injury…

Understanding the Significance of ‘Pi’ in Psychological Research

Understanding the Significance of ‘Pi’ in Psychological Research

The article was last updated by Alicia Rhodes on February 9, 2024. Have you ever wondered how psychologists measure and understand the complexities of the…

Understanding Statistical Validity in Psychological Research

Understanding Statistical Validity in Psychological Research

The article was last updated by Samantha Choi on February 4, 2024. Statistical validity is a crucial aspect of psychological research, ensuring that the results…

Analyzing the Snowball Effect in Psychology: A Comprehensive Study

Analyzing the Snowball Effect in Psychology: A Comprehensive Study

The article was last updated by Ethan Clarke on February 8, 2024. Have you ever heard of the snowball effect in psychology? In this comprehensive…

Exploring Case Studies in AP Psychology

Exploring Case Studies in AP Psychology

The article was last updated by Nicholas Reed on February 8, 2024. Curious about the world of AP Psychology and the role case studies play…

Significance of Schizophrenia in Psychological Research

Significance of Schizophrenia in Psychological Research

The article was last updated by Dr. Emily Tan on February 9, 2024. Schizophrenia is a complex and often misunderstood mental health disorder that has…

  • Skip to secondary menu
  • Skip to main content
  • Skip to primary sidebar

Statistics By Jim

Making statistics intuitive

Confounding Variable: Definition & Examples

By Jim Frost 86 Comments

Confounding Variable Definition

In studies examining possible causal links, a confounding variable is an unaccounted factor that impacts both the potential cause and effect and can distort the results. Recognizing and addressing these variables in your experimental design is crucial for producing valid findings. Statisticians also refer to confounding variables that cause bias as confounders, omitted variables, and lurking variables .

diagram that displays how confounding works.

A confounding variable systematically influences both an independent and dependent variable in a manner that changes the apparent relationship between them. Failing to account for a confounding variable can bias your results, leading to erroneous interpretations. This bias can produce the following problems:

  • Overestimate the strength of an effect.
  • Underestimate the strength of an effect.
  • Change the direction of an effect.
  • Mask an effect that actually exists.
  • Create Spurious Correlations .

Additionally confounding variables reduce an experiment’s internal validity , thereby reducing its ability to make causal inferences about treatment effects. You don’t want any of these problems!

In this post, you’ll learn about confounding variables, the problems they cause, and how to minimize their effects. I’ll provide plenty of examples along the way!

What is a Confounding Variable?

Confounding variables bias the results when researchers don’t account for them. How can variables you don’t measure affect the results for variables that you record? At first glance, this problem might not make sense.

Confounding variables influence both the independent and dependent variable, distorting the observed relationship between them. To be a confounding variable, the following two conditions must exist:

  • It must correlate with the dependent variable.
  • It must correlate with at least one independent variable in the experiment.

The diagram below illustrates these two conditions. There must be non-zero correlations (r) on all three sides of the triangle. X1 is the independent variable of interest while Y is the dependent variable. X2 is the confounding variable.

Diagram that displays the conditions for confounding variables to produce bias.

The correlation structure can cause confounding variables to bias the results that appear in your statistical output. In short, The amount of bias depends on the strength of these correlations. Strong correlations produce greater bias. If the relationships are weak, the bias might not be severe. If any of the correlations are zero, the extraneous variable won’t produce bias even if the researchers don’t control for it.

Leaving a confounding variable out of a regression model can produce omitted variable bias .

Confounding Variable Examples

Exercise and weight loss.

In a study examining the relationship between regular exercise and weight loss, diet is a confounding variable. People who exercise are likely to have other healthy habits that affect weight loss, such as diet. Without controlling for dietary habits, it’s unclear whether weight loss is due to exercise, changes in diet, or both.

Education and Income Level

When researching the correlation between the level of education and income, geographic location can be a confounding variable. Different regions may have varying economic opportunities, influencing income levels irrespective of education. Without controlling for location, you can’t be sure if education or location is driving income.

Exercise and Bone Density

I used to work in a biomechanics lab. For a bone density study, we measured various characteristics including the subjects’ activity levels, their weights, and bone densities among many others. Bone growth theories suggest that a positive correlation between activity level and bone density likely exists. Higher activity should produce greater bone density.

Early in the study, I wanted to validate our initial data quickly by using simple regression analysis to assess the relationship between activity and bone density. There should be a positive relationship. To my great surprise, there was no relationship at all!

Long story short, a confounding variable was hiding a significant positive correlation between activity and bone density. The offending variable was the subjects’ weights because it correlates with both the independent (activity) and dependent variable (bone density), thus allowing it to bias the results.

After including weight in the regression model, the results indicated that both activity and weight are statistically significant and positively correlate with bone density. Accounting for the confounding variable revealed the true relationship!

The diagram below shows the signs of the correlations between the variables. In the next section, I’ll explain how the confounder (Weight) hid the true relationship.

Diagram of the bone density model.

Related post : Identifying Independent and Dependent Variables

How the Confounder Hid the Relationship

The diagram for the Activity and Bone Density study indicates the conditions exist for the confounding variable (Weight) to bias the results because all three sides of the triangle have non-zero correlations. Let’s find out how leaving the confounding variable of weight out of the model masked the relationship between activity and bone density.

The correlation structure produces two opposing effects of activity. More active subjects get a bone density boost directly. However, they also tend to weigh less, which reduces bone density.

When I fit a regression model with only activity, the model had to attribute both opposing effects to activity alone. Hence, the zero correlation. However, when I fit the model with both activity and weight, it could assign the opposing effects to each variable separately.

Now imagine if we didn’t have the weight data. We wouldn’t have discovered the positive correlation between activity and bone density. Hence, the example shows the importance of controlling confounding variables. Which leads to the next section!

Reducing the Effect of Confounding Variables

As you saw above, accounting for the influence of confounding variables is essential to ensure your findings’ validity . Here are four methods to reduce their effects.

Restriction

Restriction involves limiting the study population to a specific group or criteria to eliminate confounding variables.

For example, in a study on the effects of caffeine on heart rate, researchers might restrict participants to non-smokers. This restriction eliminates smoking as a confounder that can influence heart rate.

This process involves pairing subjects by matching characteristics pertinent to the study. Then, researchers randomly assign one individual from each pair to the control group and the other to the experimental group. This randomness helps eliminate bias, ensuring a balanced and fair comparison between groups. This process controls confounding variables by equalizing them between groups. The goal is to create groups as similar as possible except for the experimental treatment.

For example, in a study examining the impact of a new education method on student performance, researchers match students on age, socioeconomic status, and baseline academic performance to control these potential confounders.

Learn more about Matched Pairs Design: Use & Examples .

Random Assignment

Randomly assigning subjects to the control and treatment groups helps ensure that the groups are statistically similar, minimizing the influence of confounding variables.

For example, in clinical trials for a new medication, participants are randomly assigned to either the treatment or control group. This random assignment helps evenly distribute variables such as age, gender, and health status across both groups.

Learn more about Random Assignment in Experiments .

Statistical Control

Statistical control involves using analytical techniques to adjust for the effect of confounding variables in the analysis phase. Researchers can use methods like regression analysis to control potential confounders.

For example, I showed you how I controlled for weight as a confounding variable in the bone density study. Including weight in the regression model revealed the genuine relationship between activity and bone density.

Learn more about controlling confounders by using regression analysis .

By incorporating these strategies into research design and analysis, researchers can significantly reduce the impact of confounding variables, leading to more accurate results.

If you aren’t careful, the hidden hazards of a confounding variable can completely flip the results of your experiment!

Kamangar F. Confounding variables in epidemiologic studies: basics and beyond . Arch Iran Med. 2012 Aug;15(8):508-16. PMID: 22827790.

Share this:

confounding variables control experiments

Reader Interactions

' src=

January 15, 2024 at 10:02 am

To address this potential problem, I collect all the possible variables and create a correlation matrix to identify all the correlations, there direction, and their statistical significance, before regression.

' src=

January 15, 2024 at 2:54 pm

That’s a great practice for understanding the underlying correlation structure of your data. Definitely a good thing to do along with graphing the scatterplots for all those pairs because they’re good at displaying curved relationships that might not register with Pearson’s correlation.

It’s been awhile since I worked on the bone density study, but I’m sure I created that correlation & scatterplot matrix to get the lay of the land.

A couple of caveats:

Those correlations are pairwise relationships, equivalent to one predictor for a response (but without the directionality). So, those correlations can be affected by a confounding variable just like a simple regression model. Going back to the example in my post, if I did a pairwise correlation between all variables, including activity and bone density, that would’ve still been essentially zero–affected by the weight confounder in the same way as the regression model. At least with a correlation matrix, you’d be able to piece together that weight was a confounder likely affecting the other correlation.

And a confounder can exist outside your dataset. You might not have even measured a confounder, so it won’t be in your correlation matrix, but it can still impact your results. Hence, it’s always good to consider variables that you didn’t record as well.

I’m guessing you know all that, I’m more spelling it out for other readers.

And if I’m remember correctly, your background is more with randomized experiments. The random assignment process should break any correlation between a confounder and the outcome, making it essentially zero. Consequently, randomizes experiments tend to prevent confounding variables from affecting the results.

' src=

July 17, 2023 at 11:11 am

Hi Jim, In multivariate regression, I have always removed variables that aren’t significant. However, recently a reviewer said that this approach is unjustified. Is there a consensus about this? a reference article? thanks, Ray

July 17, 2023 at 4:52 pm

Hi Raymond,

I don’t have an article handy to refer you to. But based on what happens to models when you retain and exclude variables, I recommend the following approach.

Deciding whether to eliminate an insignificant independent variable from a regression model requires a thorough understanding of the theoretical implications related to that variable. If there’s strong theoretical justification for its inclusion, it might be advisable to keep it within the model, despite its insignificance.

Maintaining an insignificant variable in the model does not typically degrade its overall performance. On the contrary, removing a theoretically justified but insignificant variable can lead to biased outcomes for the remaining independent variables, a situation known as omitted variable bias . Therefore, it can be beneficial to retain an insignificant variable within the model.

It’s vital to consider two major aspects when making this decision. Firstly, whether there’s strong theoretical support for retaining the insignificant variable, and secondly, whether excluding it has a significant impact on the coefficient estimates of the remaining variables. In short, if you remove an insignificant variable and the other coefficients change, you need to assess the situation.

If there are no theoretical reasons to retain an insignificant variable and removing it doesn’t appear to bias the result, then you probably should remove it because it might increase the precision of your model somewhat.

Consequently, I advise “considering” the removal of insignificant independent variables from the model, instead of asserting that you “should” remove them, as this decision depends on the aforementioned factors and is not a hard-and-fast rule. Of course, when you do the write-up, explain your reasoning for including insignificant variables along with everything else.

' src=

January 16, 2023 at 5:31 pm

Thank you very much! That helped a lot.

January 15, 2023 at 9:12 am

thank you for the interesting post. I would like to ask a question because I think that I am very much stuck into a discipline mismatch. I come from economics but I am now working in the social sciences field.

You describe that conditions for confounding bias: 1) there is a correlation between x1 and x2 (the OVB) 2) x1 associates with y 3) x2 associates with y. I interpret 1) as that sometime x1 may determine x2 or the contrary.

However, I read quite recently a social stat paper in which they define confounding bias differently. 2)3) still hold but 1) says that x2 –> x1, not the contrary. So, the direction of the relationship cannot go the other way around. Otherwise that would be mediation..

I am a bit confused and think that this could be due to the different disciplines but I would be interested in knowing what you think.

Thank you. Best, Vero

January 16, 2023 at 12:56 am

Hi Veronica,

Some of your notation looks garbled in the comment, but I think I get the gist of your question. Unfortunately, the comments section doesn’t handle formatting well!

So, X 1 and X 2 are explanatory variables while Y is the outcome. The two x variables correlate with each other and the Y variable. In this scenario, yes, if you exclude X 2 , it will cause some degree of omitted variable bias. It is a confounding variable. The degree of bias depends on the collective strength of all three correlations.

Now, as for the question of the direction of the relationship between X 1 and X 2 , that doesn’t matter statistically. As long as the correlation is there, the potential for confounding bias exists. This is true whether the relationship between X 1 and X 2 is causal in either direction or totally non-causal. It just depends on the set of correlations existing.

I think you’re correct in that this is a difference between disciplines.

The social sciences define a mediator variable as explaining the process by which two variables are related, which gets to your point about the direction of a causal relationship. When X 1 –> X 2 , I’d say that the social sciences would call that a mediator variable AND that X 2 is still a confounder that will cause bias if it is omitted from the model. Both things are true.

I hope that helps!

' src=

October 10, 2022 at 11:07 am

Thanks in advance for your awesome content.

Regarding this question brought by Lucy, I want to ask the following: If introducing variables reduces the bias (because the model controls for it), why don’t we just insert all variables at once to see the real impact of each variable?

Let’s say I have a dataset of 150 observations and I want to study the impact of 20 variables (dummies and continuous), it is advantageous to introduce everything at once and see which variables are significant? I got the idea that introducing variables is always positive because it forces the model to show the real effects (of course I am talking about fundamented variables), but are there any caveats of doing so? Is it possible that some variables in fact may “hide” the significance of others because they will overshadow the others regressors? Usually it is said that, if the significance changes when introducing a variable, it was due to confounding. My question now is: is possible that confounding was not case and, in fact, the significance is just being hiden due to a present of a much more strong predictor?

October 10, 2022 at 8:10 pm

In some ways, you’re correct. Generally speaking, it is better to include too many variables than too few. However, there is a cost for including more variables than necessary, particularly when they’re not significant. Adding more variables than needed increases the model’s variance, which reduces statistical power and precision of the estimates. Ideally, you want a balance of all the necessary variables, no more, and no less. I write about this tradeoff in my post about selecting the best model . That should answer a lot of your questions.

I think the approach of starting with model with all possible variables has merit. You can always start removing the ones that are not significant. Just do that by removing one at a time and start by removing the least significant. Watch for any abrupt changes in coefficient signs and p-values as you remove each one.

As for caveats, there are rules of thumb as to how many independent variables you can include in a model based on how many observations you have. If you include too many, you can run into overfitting, which can produce whacky results. Read my post about overfitting models for information about that. So, in some cases, you just won’t be able to add all the potential variables at once, but that depends on the number of variables versus the number of observations. The overfitting post describes that.

And, to answer your last question, overfitting is another case where adding variables can change the significance that’s not due to confounding.

' src=

January 20, 2022 at 8:10 am

Thanks for the clear explanation, it was reallly helpful! I do have a question regarding this sentence: “The important takeaway here is that leaving out a confounding variable not only reduces the goodness-of-fit (larger residuals), but it can also bias the coefficient estimates.”

Is it always the case that leaving out a confounding variable leads to a lesser fit? I was thinking about the case of positive bias: say variables x and y are both negatively correlated with the dependent variable, but x and y are positively correlated with each other. If a high value for x is caused by a high value of y both variables ‘convey the information’ of variable y. So adding variable x to a model wouldn’t add any additional information, and thus wouldn’t improve the fit of the model.

Am I making a mistake in my reasoning somewhere? Or does leaving out a confounding variable not lead to a worse fit in this case?

Thanks again for the article! Sterre

January 20, 2022 at 2:20 pm

Think about it this way. In general, adding an IV always causes R-squared to increase to some degree–even when it’s only a chance correlation. That still applies when you add a confounding variable. However, with a confounding variable, you know it’s an appropriate variable to add.

Yes, the correlation with the IV in the model might capture some of the confounder’s explanatory power, but you can also be sure that adding it will cause the model to fit better. And, again, it’s an entirely appropriate variable to include because of its relationship with the DV (i.e., you’re not adding it just to artificially inflate R-squared/goodness-of-fit). Additionally, unless there’s a perfect correlation between the included IV and the confounder, the included IV can’t contain all the confounder’s information. But, if there was a perfect correlation, you wouldn’t be able to add both anyway.

There are cases where you might not want to include the confounder. If you’re mainly interested in making predictions and don’t need to understand the role of each IV, you might not need to include the confounder if your model makes sufficiently precise predictions. That’s particularly true if the confounder is difficult/expensive to measure.

Alternatively, if there is a very high, but not perfect correlation, between the included IV and the confounder, adding the confounder might introduce too much multicollinearity , which causes its own problems. So, you might be willing to take the tradeoff between exchanging multicollinearity issues for omitted variable bias. However, that’s a very specific weighing of pros and cons given the relative degree of severity for both problems for your specific model. So, there’s no general advice for which way to go. It’s also important to note that there are other types of regression analysis (Ridge and LASSO) that can effectively handle multicollinearity, although at the cost of introducing a slight bias. Another possibility to balance!

But, to your main question, yes, if you add the confounder, you can expect the model fit to improve to some degree. It may or may not be an improvement that’s important in a practical sense. Even if the fit isn’t notably better, it’s often worthwhile adding the confounder to address the bias.

' src=

May 2, 2021 at 4:23 pm

Jim, this was a great article, but I do not understand the table. I am sure it is easy, and I am missing something basic. what does it mean to be included and omitted: negative correlation…. etc. in the 2 way by 2 way table? I cannot wrap my head around the titles, and correspdonding scenarios. thanks John

May 3, 2021 at 9:39 pm

When I refer to “included” and “omitted,” I’m talking about whether the variable in question an independent variable IN the model (included), or a potential independent variable that is NOT in the model (omitted). After all, we’re talking about omitted variable bias, which is the bias caused by leaving an important variable out of the model.

The table allows you to determine the direction the coefficient estimate is being biased if you can determine the direction of the correlation between several variables.

In the example, I’m looking at a model where Activity (the included IV) predicts the bone density of the individual (the DV). The omitted confounder is weight. So, now we just need to assess the relationships between those variables to determine the direction of the bias. I explain the process of using the table with this example in the paragraph below the table, so I won’t retype it here. But, if you don’t understand something I write there, PLEASE let me know and I’ll help clarify it!

In the example, Activity = Included, Weight = Omitted, and Dependent = Bone Density. I use the signs from the triangle diagram that include a ways before the table which lists these three variables to determine the column and row to use.

Again, I’m not sure which part is tripping you up!

' src=

April 27, 2021 at 2:23 am

Thank you Jim ! The two groups are both people with illness, only different because they are illnesses that occur in different ages. The first illness group is of younger age like around 30, the other of older age around 45. Overlap of ages between these groups is very minimal. By control group, I meant a third group of healthy people without illness, and has ages uniformly distributed in the range represented in the two patient groups, and thus the group factor having three levels now.. I was thinking if this can reduce the previous problem of directly comparing the young and old patient groups where adding age as covariate can cause collinearity problem..

April 28, 2021 at 10:42 pm

Ah, ok. I didn’t realize that both groups had an illness. Usually a control group won’t have a condition.

I really wouldn’t worry about the type of multicollinearity you’re referring to. You’d want to include those two groups and age plus the interaction term, which you could remove if it’s not significant. If the two groups were completely distinct in age and had a decent gap between them, there are other model estimate problems to worry about, but that doesn’t seem to be the case. If age is a factor in this study area, you definitely don’t want to exclude it. Including it allows you to control for it. Otherwise, if you leave it out, the age effect will get rolled into the groups and, thereby, bias your results. Including age is particularly important in your case because you know the groups are unbalanced in age. You don’t want the model to attribute the difference in outcomes to the illness condition when it’s actually age that is unbalanced between those two conditions. I’d go so far to say that your model urgently needs you to include age!

That said, I would collect a true control group that has healthy people and ideally a broad range of ages that covers both groups. That will give you several benefits. Right now, you won’t know how your illness groups compare to a healthy group. You’ll only know how they compare to each other. Having that third group will allow you to compare each illness group to the healthy group. I’m assuming that’s useful information. Plus, having a full range of ages will allow the model to produce a better estimate of the age effect.

April 26, 2021 at 6:51 am

Hi JIm, Thanks a lot for your intuitive explanations!!

I want to study the effect of two Groups of patients (X1) on y (a test performance score), in a GLM framework. Age (X2) and Education (X3) are potential confounders on y.

However its not possible to match these two groups for age, as they are illnesses that occur in different age groups-one group is younger than the other. Hence the mean ages are significantly different between these groups.

I’m afraid adding age as a covariate could potentially cause multicollinearity problem as age is significantly different between groups, and make the estimation of group effect (β1) erroneous, although it might improve the model. Is recruiting a control group with age distribution comparable to the pooled patient groups, hence of a mean age mid-way between the two patient groups a good idea to improve the statistical power of the study? In this case my group factor X1 will have three levels. Can this reduce the multicollinearity problem to an extent as the ages of patients in the two patient groups are approximately represented in the control group also..? Should I add an interaction term of Age*Group in the GLM to account for the age difference between groups..? Thank you in advance.. -Mohan

April 26, 2021 at 11:13 pm

I’d at least try including age to see what happens. If there’s any overlap in age between the two groups, I think you’ll be ok. Even if there is no overlap, age is obviously a crucial variable. My guess would be that it’s doing more harm by excluding it from the model when it’s clearly important.

I’m a bit confused by what you’re suggesting for the control group. Isn’t one of your groups those individuals with the condition and the other without it?

It does sound possible that there would be an interaction effect in this case. I’d definitely try fitting and see what the results are! That interaction term would show whether the relationship between age and test score is different between the groups.

' src=

April 26, 2021 at 12:44 am

In the paragraph below the table, both weight and activity are referred to as included variables.

April 26, 2021 at 12:50 am

Hi Joshua, yes, you’re correct! A big thanks! I’ve corrected the text. In that example, activity is the included variable, weight is the omitted variable, and bone density it the dependent variable.

' src=

April 24, 2021 at 1:06 pm

Hi, Jim. Great article. However, is that a typo in the direction of omitted variable bias table? For the rows, it makes more sense to me if they were “correlation between dependent and omitted variables” instead of between depedent and included variables”.

April 25, 2021 at 11:21 pm

No, that’s not a typo!

' src=

April 22, 2021 at 9:53 am

Please let me know if this summary makes sense. Again, Thanks for the great posts !

Scenario 1: There are 10 IVs. They are modeled using OLS. We get the regression coefficients.

Scenario 2: One of the IVs is removed. It is not a confounder. The only impact is on the residuals (they increase). The coefficients obtained in Scenario 1 remain intact. Is that correct ?

Scenario 3: The IV that was removed in Scenario 2, is placed back into the mix. This time, another IV is removed. Now this one’s a confounder. OLS modeling is re-run. There are 3 resutls.

1) The residuals increase — because it is correlated with the dependent variable. 2) The coefficient of the other IV, to which this removed confounder is correlated, changes. 3) The coefficients of the other IVs remain intact.

Are these 3 scenarios an accurate summary, Jim? A reply would be much appreciated !

Again, do keep up the good work.

April 25, 2021 at 11:26 pm

Yes, that all sounds right on! 🙂

April 22, 2021 at 8:37 am

Great post, Jim !

Probably a basic question, but would appreciate your answer on this, since we have encountered this in practical scenarios. Thanks in advance.

What if we know of a variable that should get included on the IV side, we don’t have data for that, we know (from domain expertise) that it is correlated with the dependent variable, but it is not correlated with any of the IVs…In other words, it is not a confounding variable in the strictest sense of the term (since it is not correlated to any of the IVs).

How do we account for such variables?

Here again the solution would be to use proxy variables? In other words, can we consider proxy variables to be a workaround for not just confounders, but also non-confounders of the above type ?

Thanks again !

April 23, 2021 at 11:20 pm

I discuss several methods in this article. The one I’d recommend if at all possible is identifying a proxy variable that stands in for the important variable that you don’t have. It sounds like in your case it’s not a confounder. So, it’s probably not biasing your other coefficients. However, your model is missing important information. You might be able to improve the precision using a proxy variable.

' src=

March 19, 2021 at 10:45 am

Hi Jim, that article is helping me a lot during my research project, thank you so for that! However, there is one question for which I couldn’t find a satisfactory answer on the internet, so I hope that maybe you can shed some light on this: In my panel regression, I have my main independent variable on “Policy Uncertainty”, that catpures uncertainty related to the possible impact of future government policies. It is based on an index that has a mean of 100. My dependent variable is whether a firm has received funding in quarter t (Yes = 1, No = 0), thus I want to estimate the impact of policy uncertainty on the likelihood of receiving external funding. In my baseline regression, the coefficient on policy uncertainty is insignificant, suggesting that policy uncertainty has no impact. When I now add a proxy for uncertainty related finincial markets (e.g. implied stock market volatitily), then policy uncertainty becomes significant at the 1% level and the market uncertainty proxy is statistically significant at the 1% level too! The correlation between both is rather low, 0.2. Furthermore, both have opposite signs (poilcy uncertainty is positively associated with the likelihood of receiving funding), additionally, the magnitude of the coefficients is comparable.

Now am I wondering what this tells me…did the variable on policy uncertainty previously capture the effect of market uncertainty before including the latter in regression? Would be great if you could help 🙂

March 19, 2021 at 2:56 pm

Thanks for writing with the interesting questions!

First, I’ll assume you’re using binary logistic regression because you have a binary dependent variable. For logistic regression, you don’t interpret the coefficients that same ways as you do for say least squares regression. Typically, you’ll assess the odds ratio to understand the IVs relationship to the binary DV.

On to your example. It’s entirely possible that leaving out market uncertainty was causing omitted variable bias in the policy uncertainty. That might be what is happening. But, the positive sign of one and the negative sign of the other could be cancelling each other out when you only include the one. That is what happens in the example I use in this post. However, for that type of bias/confounding, you’d expect there to be a correlation between the two DVs and you say it is low.

Another possibility is the fact that for each variable in a model, the significance refers to the Adj SS for the variable, which factors in all the other variables before entering variable in question. So, the policy uncertainty in the model with market volatility is significant after accounting for the variance that the other variables explain, including market volatility. For the model without market volatility, the policy uncertainty is not significant in that different pool of remaining variability. Given the low correlation (0.2) between those two IVs, I’d lean towards this explanation. If there was a stronger correlation between the policy/market uncertainty, I’d lean towards omitted variable bias.

Also be sure that your model doesn’t have any other type of problems, such as overfitting or patterns in the residual plots . Those can cause weird things to happen with the coefficients.

It can be unnerving when the significance of one variable depends entirely on the presence of another variable. It makes choosing the correct model difficult! I’d let theory be your guide. I write about that towards the end of my post about selecting the correct regression model . That’s written in the contest of least squares regression, but the same ideas about theory and other research apply here.

You should definitely investigate this mystery further!

' src=

February 11, 2021 at 12:31 am

Thank you for this blog. I have a question: If two independent variables are corelated, can we not convert one into the other and replace that in the model? For example, If Y=X1 +X2, and X2= – 0.5X1, then Y=0.5X1. However, I don’t see that as a suggestion in the blog. The blog mentions that activity is related to weight, but then somehow both are finally included in the model, rather than replacing one with the other in the model. Will this not help with multicollinearity, too? I am sure I am missing something here that you can see, but I am unable to find that out. Can you please help?

Regards, Kushal Jain

February 11, 2021 at 4:45 pm

Why would you want to convert one to another? Typically, you want to understand the relationship between each independent variable and the dependent variable. In the model I talk about, I’d want to know the relationship between both activity and weight with bone density. Converting activity to weight does not help with that.

And, I’m not understanding what you mean by “then somehow both are finally included in the model.” You just include both variables in the model the normal way.

There’s no benefit to converting the variables as you describe and there are reasons not to do that!

' src=

November 25, 2020 at 2:22 pm

Hi Jim, I have been trying to figure out covariates for a study we are doing for some time. My colleague believes that if two covariates have a high correlation (>20%) then one should be removed from the model. I’m assuming this is true unless both are correlated to the dependent variable, per your discussion above? Also, what do you think about selecting covariates by using the 10% change method? Any thoughts would be helpful. We’ve had a heck of a time selecting covariates for this study. Thanks, Erin

November 27, 2020 at 2:06 am

It’s usually ok to have covariates that have a correlation greater than 20%. The exact value depends on the number of covariates and the strength of their correlations. But 20% is low and almost never a problem. When covariates are corelated, it’s known as multicollinearity. And, there’s a special measure known as VIFs that determine whether you have an excessive amount of correlation amongst your covariates. I have a post that discusses multicollinearity and how to detect and correct it .

I have not used the 10% change method myself. However, I would suggest using that method only as one point of information. I’d really place more emphasis on theory and understanding the subject area. However, observing how much a covariate changes can provide useful information about whether bias is a problem or not. In general, if you’re uncertain, I’d err on the side of unnecessarily including a covariate than leaving it out. There are usually fewer problems associated with having an additional variable than omitting one. However, keep an eye out on the VIFs as you do that. And, having a number of unnecessary variables could lead to problems if taken to an extreme or if you have a really small sample size.

I wrote a post about model selection . I give some practical tips in it. Overall, I suggest using a mix of theory, subject area knowledge, and statistical approaches. I’d suggest reading that. It’s not specifically about controlling for confounders but the same principles apply. Also, I’d highly recommend reading about what researchers performing similar studies have done if that’s at all possible. They might have already addressed that issue!

' src=

November 5, 2020 at 6:29 am

Hi Jim Im not sure whether my problem fits under this category or not so apologies if not. I am looking at whether an inflammatory biomarker (independant variable) correlates with a measure of cognitive function (dependant variable). It does if its just a simple linear regression however the biomarker (independant variable) is affected by age, sex and whether you’re a smoker or not. Correcting for these 3 covariables in the model shows that actually there is no correlation between the biomarker and cognitive function. I assume this was the correct thing to do but wanted to make sure seeing as a) none of the 3 covariables correlate with/predict my dependant variable, and b) as age correlates highly with the biomarker, does this not introduce colinearity? Thanks! Charlotte

November 6, 2020 at 9:46 pm

Hi Charlotte,

Yes, it sounds like you did the right thing. Including the other variables in the model allows the model to control for them.

The collinearity (aka multicollinearity or correlation between independent variables) between age and the biomarker is a potential concern. However, a little correlation, or a moderate amount of correlation is fine. What you really need to do is to assess the VIFs for your independent variables. I discuss VIFs and multicollinearity in my post about multicollinearity . So, your next step should be to determine whether you have problematic levels of multicollinearity.

One symptom of multicollinearity is a lack of statistical significance, which your model is experience. So, it would be good to check.

Actually, I’m noticing that at least several of your independent variables are binary. Smoker. Gender. Is the biomarker also binary? Present or not present? If so, that’s doesn’t change the rational for including the other variables in the model but it does mean VIFs won’t detect the multicollinearity.

' src=

October 28, 2020 at 9:33 pm

Thanks for the clarification, Jim. Best regards.

October 24, 2020 at 11:30 pm

I think the section on “Predicting the Direction of Omitted Variable Bias” has a typo on the first column, first two rows. It should state:

*Omitted* and Dependent: Negative Correlation

*Omitted* and Dependent: Positive Correlation

This makes it consistent with the required two conditions for Omitted Variable Bias to occurs:

The *omitted* variable must correlate with the dependent variable. The omitted variable must correlate with at least one independent variable that is in the regression model.

October 25, 2020 at 12:24 am

Hi Humberto,

Thanks for the close reading of my article! The table is correct as it is, but you are also correct. Let’s see why!

There are the following two requirements for omitted variable bias to exist: *The omitted variable must correlate with an IV in the model. *That IV must correlate with the DV.

The table accurately depicts both those conditions. The columns indicate the relationship between the IV (included) and omitted variable. The rows indicate the nature of the relationship between the IV and DV.

If both those conditions are true, you can then infer that there is a correlation between the omitted variable and the dependent variable and the nature of the correlation, as you indicate. I could include that in the table, but it is redundant information.

We’re thinking along the same lines and portraying the same overall picture. Alas, I’d need to use a three dimensional matrix to portray those three conditions! Fortunately, using the two conditions that I show in the table, we can still determine the direction of bias. And you could use those two relationships to determine the relationship between the omitted variable and dependent variable if you so wanted. However, that information doesn’t change our understanding of the direction of bias because it’s redundant with information already in the table.

Thanks for the great comment and it’s always beneficial thinking through these things using a different perspective!

' src=

August 14, 2020 at 3:00 am

Thank you for the intuitive explanation, Jim! I would like to ask a query. Suppose i have two groups-one with a recently diagnosed lung disease and another with chronic lung disease where i would like to do an independent t-test for the amount of lung damage. It happens that the two groups also significantly differ in their mean age. The group with recently diagnosed disease has a lesser mean age than the group with chronic disease. Also theory says Age can cause some damage in lung as a normal course too. So if i include age as a covariate in the model, wont it regress out the effect of DV and give underestimated effect as the IV (age) significantly correlates with DV (lung damage)? How do we address this confounding effect of correlation between only IV and DV? Should it be by having a control group without lung disease? If so can one control group help? Or should there be 2 control groups with age-matching to the two study groups? Thank you in advance.

August 15, 2020 at 3:46 pm

Hi Vineeth,

First, yes, if you know age is a factor, you should include it as a covariate in the model. It won’t “regress out” the true effect between the two groups. I would think of it a little differently.

You have two groups and you suspect that something caused those two groups to have differing amounts of lung damage. You also know that age plays a role. And those groups have different ages. So, if you look only at the groups without factoring in age, the effect of age is still present but the model is incorrectly attributing it to the groups. In your case, it will make the effect look larger.

When you include age, yes, it will reduce the effect size between the groups, but it’s reveal the correct effect by accounting for age. So, yes, in your cases, it’ll make the group difference look smaller, but don’t think of it as “regressing out” the effect but instead it is removing the bias in the other results. In other words, you’re improving the quality of your results.

When you look at your model results for say the grouping variable, it’s already controlling for the age variable. So, you’re left with what you need, just the effect between the IV and DV that is accounted for by another variable in the model, such as age. That’s what you need!

A control group for any experiment is always a good idea if you can manage one. However, it’s not always possible. I write about these experimental design issues, randomized experiments, observational studies, how to design a good experiment, etc. among other topics in my Introduction to Statistics ebook , which you might consider. It’s also just now available in print on Amazon !

' src=

August 12, 2020 at 7:04 am

I was wondering whether it’s correct to check the correlation between the independent variables and the error term in order to check for endogeneity. If we assume that there is endogeneity then the estimated errors aren’t correct and so the correlation between the independent variables and those errors doesn’t say much. Am I missing something here?

best regards,

' src=

July 15, 2020 at 1:57 pm

I wanted to look at the effects of confounders on my study but I’m not sure what analysis(es) to use for dichotomous covariates. I have one categorical iv with two levels, two continuous dvs, and then the two dichotomous confounding variables. It was hard to finds information for categorical covariates online. Thanks in advance Jim!

' src=

May 8, 2020 at 10:04 am

Thank you for your nice blog. I have still a question. Let’s say I want to determine the effect of one independent variable on a dependent variable with a linear regression analysis. I have selected a number of potential variables for this relationship based on literature, such as age, gender, health status and education level. How can I check (with statistical analyses) if these are indeed confounders? I would like to know for which of them I should control for in my linear regression analysis. Can I create a correlationmatrix beforehand to see if the potential confounder is both correlated with my independent and dependent variable? And what threshold for the correlation coefficient should be taken here? Is this every correlation coefficient except zero (for instance 0.004? Are there scientific articles/books that endorce this threshold? Or is it maybe better to use a “change-in-estimate” criterion to see if my regression coefficient changes with a particular size after adding my potential confounder in the linear regression model? What would be the threshold here?

I hope my question is clear. Thanks in advance!

' src=

April 29, 2020 at 2:47 am

thanks for a wonderful website! I love your example with the bone density which does not appear to be correlated to physical activity if looked at alone, and needs to have the weight added as explanatory variable to make both of them appear as significantly correlated with bone density. I would love to use this example in my class, as I think it is very important to understand that there are situations where a single-parameter model can lead you badly astray (here into thinking activity is not correlated with bone density). Of course, I could make up some numbers for my students, but it would be even nicer if I could give them your real data. Could you by any chance make a file of real measurements of bone densities, physical activity and weight available? I would be very grateful, and I suppose a lot of other teachers/students too!

best regards Martin

April 30, 2020 at 5:06 pm

When I wrote this post, I wanted to share the data. Unfortunately, it seems like I no longer have it. If I uncover it, I’ll add it to the post.

' src=

February 8, 2020 at 1:45 pm

The work you have done is amazing, and I’ve learned so much through this website. . I am at beginner level in SPSS and I would be grateful if you could answer my question. I have found that a medical treatment results in worse quality of life. But I know from crosstabs that people that are taking this treatment present more severe disease (continuous variable) that also correlates to quality of life. How can I test if it is treatment or severity that worsens quality of life?

February 8, 2020 at 3:16 pm

Hi Evangelia,

Thanks so much for your kind words, I really appreciate them! And, I’m glad my website has been helpful!

That’s a great question and a valid concern to have. Fortunately, in a regression model, the solution is very simple. Just include both the treatment and severity of the disease in the model as independent variables. Doing that allows the model to hold disease severity constant (i.e., controls for it) while it estimates the effect of the treatment.

Conversely, if you did not include severity of the disease in the model, and it correlates with both the treatment and quality of life, it is uncontrolled and will be a confounding variable. In other words, if you don’t include severity of disease, the estimate for the relationship between treatment and quality of life will be biased.

We can use the table in this post for estimating the direction of bias. Based on what you wrote, I’ll assume that the treatment condition and severity have a positive correlation. Those taking the treatment present a more severe disease. And, that the treatment condition has a negative correlation with quality of life. Those on the treatment have a lower quality of life for the reasons you indicated. That puts us in the top-right quadrant of the table, which indicates that if you do not include severity of disease as an IV, the treatment effect will be underestimated.

Again, simply by including disease severity in your model will reduce the bias!

' src=

December 7, 2019 at 7:32 pm

Just a question about what you said about power. Will adding more independent variables to a regression model cause a loss of power? (at a fixed sample size). Or does it depend on the type of independent variable added: confounder vs. non confounder.

' src=

November 1, 2019 at 8:54 pm

you mention “Suppose you have a regression model with two significant independent variables, X1 and X2. These independent variables correlate with each other and the dependent variable” How is possible for two random variables (in this case the two factors) to correlate with each other if they are independent? If two random variables are independent then covariance is zero and therefore correlaton is zero.

Corr(X1,X2)=Cov(X1, X2)/(sqrt(var(X1))*sqrt(var(X2))) Cov(X1,X2)=E[X1*X2]-E[X1]*E[X2] if X1 and X2 are independent then E[X1*X2]=E[X1]*E[X2] and therefore covariance is zero.

November 4, 2019 at 9:07 am

Ah, there’s a bit of confusion here. The explanatory variables in a regression model are often referred to as independent variables, as well as predictors, x-variables, inputs, etc. I was using “independent variable” as the name. You’re correct, if they were independent in the sense that you describe them, there would be no correlation. Ideally, there would be no correlation between them in a regression model. However, they can, in fact, be correlated. If that correlation is too strong, it will cause problems with the model.

“Independent variable” in the regression context refers to the predictors and describes their ideal state. In practice, they’ll often have some degree of correlation.

I hope this helps!

' src=

April 8, 2019 at 12:33 pm

Ah! Enlightenment!

I had taken your statement about the correlation of the independent variable with the residuals to be a statement about computed value of the correlation between them, that is, that cor(X1, resid) was nonzero. I believe that (in a model with a constant term), this is impossible.

But I think I get now that that you were using the term more loosely, referring to a (nonlinear) pattern appearing between the values of X1 and the corresponding residuals, in the same way as you would see a parabolic pattern in a scatterplot of residuals versus X if you tried to make a linear fit of quadratic data. The linear correlation between X and the residuals would still compute out, numerically, to zero, so X1 and the residuals would would technically be uncorrelated, but they would not be statistically independent. If the residuals are showing a nonlinear pattern when plotted against X, look for a lurker.

The Albany example was very helpful. Thanks so much for digging it up!

April 8, 2019 at 8:38 am

Hi, Jim! Thanks very much for you speedy reply!

I appreciate the clarity that you aim for in your writing, and I’m sorry if I wasn’t clear in my post. Let me try again, being a bit more precise, hopefully without getting too technical.

My problem is that I think that the very process used in finding the OLS coefficients (like minimizing the sum squared error of the residuals) results in a regression equation that satisfies two properties. First, that the sum (or mean) of the resulting residuals is zero. Second, that for any regressor Xi, Xi is orthogonal to the vector of residuals, which in turn leads to the covariance of the residuals with any regressor having to be zero. Certainly, the true error terms need not sum to zero, nor need they be uncorrelated with a regressor…but if I understand correctly, these properties of the _residuals_ is an automatic consequence of fitting OLS to a data set, regardless of whether the actual error terms are correlated to the regressor or not.

I’ve found a number of sources that seem to say this–one online example is on page two here: https://www.stat.berkeley.edu/~aditya/resources/LectureSIX.pdf . I’ll be happy to provide others on request.

I’ve also generated a number of my own data sets with correlated regressors X1 and X2 and Y values generated by a X1 + b X2 + (error), where a and b are constants and (error) is a normally distributed error term of fixed variance, independently chosen for each point in the data set. In each case, leaving X2 out of the model still left me with zero correlation between X1 and the residuals, although there was a correlation between X1 and the true error terms, of course.

If I have it wrong, I’d love to see a data set that demonstrates what you’re talking about. If you don’t have time to find one (which I certainly understand), I’d be quite happy with any reference you might point me to that talks about this kind of correlation between residuals and one of the regressors in OLS, in any context.

Thanks again for your help, and for making regression more comprehensible to so many people.

Scott Stevens

April 8, 2019 at 10:59 am

Unfortunately, the analysis doesn’t fix all possible problems with the residuals. It is possible to specify models where the residuals exhibit various problems. You mention that residuals will sum to zero. However, if you specify a model without a constant, the residuals won’t necessarily sum to zero-read about that here . If you have a time series model, it’s possible to have autocorrelation in the residuals if you leave out important variables. If you specify a model that doesn’t adequately model curvature in the data, you’ll see patterns in the residuals.

In a similar vein, if you leave out an important variable that is correlated both with the DV and another IV in the model, you can have residuals that correlate with an IV. The standard practice is to graph the residuals by the independent variable to look for that relationship because it might have a curved shape which indicates a relationship but not necessarily a linear one that correlation would detect.

As for references, any regression textbook should cover this assumption. Again, it’ll refer to error, but the key is to remember that residuals are the proxy for error.

Here’s a reference from the University of Albany about Omitted Variable Bias that goes into it in more detail from the standpoint of residuals and includes an example of graphing the residuals by the omitted variable.

April 7, 2019 at 11:17 am

Hi, Jim. I very much enjoy how you make regression more accessible, and I like to use your approaches with my own students. I’m confused, though by the matter brought up by SFDude.

I certainly see how the _error_ term in a regression model will be correlated with an independent variable when a confounding variable is omitted, but it seems to me that the normal equations that define the regression coefficients assure that an independent variable in the model will always be uncorrelated with the _residuals_ of that model, regardless of whether an omitted confounding variable exists or not. Certainly, “X1 correlates with X2, and X2 correlates with the residuals. Ergo, variable X1 correlates with the residuals” would not hold for any three variables X1 and X2 and R. For example, if A and B are independent, then “A correlates with A + B, A + B correlates with B. Ergo, A correlates with B” is a false statement.

If I’m missing something here, I’d very much appreciate a data set that demonstrates the kind of correlation between an independent variable and the residuals of the model that it seems you’re talking about.

Thanks! Scott Stevens

April 7, 2019 at 6:28 pm

Thanks for writing. And, I’m glad to hear that you find my website helpful!

The key thing to remember is that while the OLS assumptions refer to the error, we can’t directly observe the true error. So, we use the residuals as estimates of the error. If the error is correlated with an omitted variable, we’d expect the residuals to be correlated as well in approximately the same manner. Omitted variable bias is a real condition, and that description is simply getting deep into the nuts and bolts of how it works. But, it’s the accepted explanation. You can read it in textbooks. While the assumptions refer to error, we can only assess the residuals instead. They’re the best we’ve got!

When you say A and B are “independent”, if you mean they are not correlated, I’d agree that removing a truly uncorrelated variable from the model does not cause this type of bias. I mention that in this post. This bias only occurs when independent variables are correlated with each other to some degree, and with the dependent variable, and you exclude one of the IVs.

I guess I’m not exactly sure which part is causing the difficulty? The regression equations can’t ensure that the residuals are not uncorrelated if the model is specified in such a way that it causes them to be correlated. It’s just like in time series regression models, you have to be on the look out for autocorrelation (correlated residuals) because the model doesn’t account for time-order effects. Incorrectly specified models can and do cause problems with the residuals, including residuals that are correlated with other variables and themselves.

I’ll have to see if I can find a dataset with this condition.

' src=

March 10, 2019 at 10:41 am

Hi Jim, I am involved in a study which involves looking into s number of clinical paramaters like platelet count and Haemogobin for patients who underwent emergency change of a mechanical circulatory support device due to thrombosis or clotting of the actual device. The purpose is to look if there is a trend in these parameters in the time frame of before 3 days and after 3 days of the change and establish if these parameters could be used as predictor of the event. My concern is that there is no control group for this study. But I dont see the need for looking into trend in a group which never had an event itself. Will not having a control group be considered as a weakness for this study? Also, what would be best statistical test for this. I was thinking of the generalized linear model. I would really appreciate your guidance here. Thank you

' src=

February 20, 2019 at 8:49 am

I’m looking at a published paper that develops clinical prediction rules by using logistic regression in order to help primary care doctors to decide who to refer to breast clinics for further investigation. The dependent variable is simply whether breast cancer is found to be present or not. The independent variables include 11 symptoms and age in (mostly) ten year increments (six separate age bands). The age bands were decided before the logistical regression was carried out. The paper goes on to use the data to create a scoring system based on symptoms and age. If this scoring system were to be used then above a certain score a woman would be referred, and below a certain score a woman would not be referred.

The total sample size is 6590 women referred to a breast clinic of which 320 were found to have breast cancer. The sample itself is very skewed. In younger women, breast cancer is rare and so some categories the numbers are very low. So for instance, in the 18-29 age band there are 62 women referred of whom 8 women have breast cancer, and in the 30-39 age band there are 755 women referred of which only one woman has breast cancer. So my first question is: if there are fewer individuals in particular categories than symptoms can the paper still use logistic regression to predict who to refer to a breast clinic based on a scoring system that includes both age and symptoms? My second question is: if there is meant to be at least 10 individuals per variable in logistic regression, are the numbers of women with breast cancer in these age groups too small for logistic regression to apply?

When I look at the total number of women in the sample (6590) and then the total number of symptoms (8616) there is a discrepancy. This means that some women have had more than one symptom recorded. (Or from the symptoms’ point of view, some women have been recorded more than once). So my third question is: does this mean that some of the independent variables are not actually independent of each other? (There is around a 30%-32% discrepancy in all categories. How significant is this?)

There are lots of other problems with the paper (the fact the authors only look at referred women rather than all the symptomatic women that a primary care doctor sees is a case in point) but I’d like to know whether the statistics are flawed too. If there are any other questions I need to ask about the data please do let me know.

With very best wishes,

Ms Susan Mitchell

February 20, 2019 at 11:23 pm

Offhand, I don’t see anything that screams to me that there is a definite problem. I’d have to read the study to be really sure. Here’s some thoughts.

I’m not in the medical field, but I’ve heard talks by people in the that field and it sounds like this is a fairly common use for binary logistic regression. The analyst creates a model where you indicate which characteristics, risk factors, etc apply to an individual. Then, the model predicts the probability of an outcome for them. I’ve seen similar models for surgical success, death, etc. The idea is that it’s fairly easy to use because some can just enter the characteristics of the patient and the model spits out a probability. For any model of this type, you’d really have to check the residuals and see all the output to determine how well the model fits the data. But, there’s nothing inherently wrong with this approach.

I don’t see a problem with the sample size (6590) and the number of IVs (12). That’s actually a very good ratio of observations per IV.

It’s ok that there are fewer individuals in some categories. It’s better if you have a fairly equal number but it’s not a show stopper. Categories with fewer observations will have less precise estimates. It can potentially reduce the precision of model. You’d have to see how well the model fit the data to really know how well it works out. But, yes, if you have an extremely low number of individuals that have a particular symptom, you won’t get as precise of an estimate for that symptoms effect. You might see a wider CI for its odds ratio. But, it’s hard to say without seeing all of that output and how the numbers by symptoms. And, it’s possible that they selected the characteristics that apply to a sufficient number of women. Again, I wouldn’t be able to say. It’s an issue to consider for sure.

As for the number of symptoms versus the number of women, it’s ok that a woman can have more than one symptom. Each symptom is in it’s own column and will be coded with a 1 or 0. A row corresponds to one woman and she’ll have a 1 for each characteristic that she has and 0s for the ones that she does not have. It’s possible these symptoms are correlated. These are categorical variables, so you couldn’t use Pearson’s correlation. You’d need to use something like the chi-square test of independence. And, some correlation is okay. Only very high correlation would be problematic. Again, I can’t say whether that’s a problem in this study or not because it depends on the degree of correlation. It might be, but it’s not necessarily a problem. You’d hope that the study strategically included a good set of IVs that aren’t overly correlated.

Regarding the referred women vs symptomatic women, that comes down to the population that is being sampled and how generalizeable the results are. Not being familiar with the field, I don’t have a good sense for how that affects generalizability, but yes that would be a concern to consider.

So, I don’t see anything that shouts to me that it’s a definite problem. But, as with any regression model, it would come down to the usual assessments of how well the model fits the data. You mention issues that could be concerns, but again, it depends on the specifics.

Sorry I couldn’t provide more detailed thoughts but evaluating these things requires real specific information. But, the general approach for this study seems sound to me.

' src=

February 17, 2019 at 3:48 pm

I have a question, how well can we evaluate a regression equation “fits” the data by examing the R Square statistic, and test for statistical significance of the whole regression equation using the F-Test?

February 18, 2019 at 4:56 pm

I have two blog posts that will be perfect for you!

Interpreting R-squared Interpreting the F-test of Overall Significance

If you have questions about either one, please post it in the comments section of the corresponding post. But, I think those posts will go a long way in answering your questions!

' src=

January 18, 2019 at 7:00 pm

Mr. Frost I know I need to run a regression model however I’m still unsure of which one. I’m examining the effects of alcohol use on teenagers with 4 confounders.

January 19, 2019 at 6:47 pm

Hi Dahlia, to make the decision, I’d need to know what types of variables they all are (continuous, categorical, binary, etc). However, if the effect of alcohol is a continuous variable, then OLS linear regression is a great place to start!

Best of luck with your analysis!

' src=

January 5, 2019 at 2:39 am

Thank you very much Jim,

Very helpful, I think my problem is really on the number of observation (25 obs). Yes, I have read that post also, and I always keep the theory in mind when analyzing the IVs.

My main objective is to show the existing relationship between X2 and Y, which is also supported by literature, however, if I do not control for X1 I will never be sure that the effect I have found is due to X2 or X1, because X1 and X2 are correlated.

I think only correlation would be ok, since my number of observation are limited and by using regression it limits me about the number of IVs to be included in the model also, which may make me leave out of the model some others IVs, which is also bad.

Thank you again

Best regards!

January 4, 2019 at 9:40 am

Thank you for this very good post.

However, I have a question. What to do if the (IV) X1 and X2 are correlated (says 0.75) and both are correlated to Y (DV) at 0.60. However, when include X1 and X2 in the same model X2 is not statistically significant, but when put separably they become statistically significant. On the other hand, the model with only X1 has higher explanatory power than the model with only X2.

Note: In individual model both meet the OLS assumptions but, together, X2 become not statistically significant (using stepwise regression X2 is removed from the model), what this means. In addition, I know from the literarture that X2 affects Y, but I am testing X1, and X1 is showing better fits that X2.

Thank you in advance, I hope you understand my question!

January 4, 2019 at 3:15 pm

Yes, I understand completely! This situation isn’t too unusual. The underlying problem is that because the two IVs are correlated, they’re supplying a similar type of predictive information. There isn’t enough unique predictive information for both of them to be statistically significant. If you had a larger sample size, it’s possible that both would significant. Also, keep in mind that correlation is a pairwise measure and doesn’t account for other variables. When you include both IVs in the model, the relationship between each IV and the DV is determined after accounting for the other variables in the model. That’s why you can see a pairwise correlation but not a relationship in a regression model.

I know you’ve read a number of my posts, but I’m not sure if you’ve read the one about model specification. In that post, a key point I make is not to use statistical measures alone to determine which IVs to leave in the model. If theory suggests that X2 should be included, you have a very strong case for including it even if it’s not significant when X1 is in the model–just be sure to include that discussion in your write-up.

Conversely, just because X2 seems to provide a better fit statistically and is significant with or without X1 doesn’t mean you must include it in the model. Those are strong signs that you should consider including a variable in the model. However, as always, use theory as a guide and document the rational for the decisions you make.

For your case, you might consider include both IVs in the model. If they’re both supplying similar information and X2 is justified by theory, chances are that X1 is as well. Again, document your rationale. If you include both, check the VIFs to be sure that you don’t have problematic levels of multicollinearity when you include both IVs. If those are the only two IVs in your model, that won’t be problematic given the correlations you describe. But, it could be problematic if you more IVs in the model that are also correlated to X1 and X2.

Another thing to look at is whether the coefficients for X1 and X2 vary greatly depending on whether you have one or both of the IVs in the model. If they don’t change much, that’s nice and simple. However, if they do change quite a bit, then you need to determine which coefficient values are likely to be closer to the correct value because that corresponds to the choice about which IVs to include! I’m sounding like a broken record, but if this is a factor, document your rational and decisions.

I hope that helps! Best of luck with your analysis!

' src=

November 28, 2018 at 11:30 pm

Another great post! Thank you for truly making statistics intuitive. I learned a lot of this material back in school, but am only now understanding them more conceptually thanks to you. Super useful for my work in analytics. Please keep it up!

November 29, 2018 at 8:54 am

Thanks, Patrick! It’s great to hear that it was helpful!

' src=

November 12, 2018 at 12:54 pm

I think there may be a typo here – “These are important variables that the statistical model does include and, therefore, cannot control.” Shouldn’t it be “does not include”, if I understand correctly?

November 12, 2018 at 1:19 pm

Thanks, Jayant! Good eagle eyes! That is indeed a typo. I will fix it. Thanks for pointing it out!

' src=

November 3, 2018 at 12:07 pm

Mr. Jim thank you for making me understand econometrics. I thought that omitted variable is excluded from the model and that why they under/overestimate the coefficients. Somewhere in this article you mentioned that they are still included in the model but not controlled for. I find that very confusing, would you be able to clarify ? Thanks a lot.

November 3, 2018 at 2:26 pm

You’re definitely correct. Omitted variable bias occurs when you exclude a variable from the model. If I gave the impression that it’s included, please let me know where in the text because I want to clarify that! Thanks!

By excluding the variable, the model does not control for it, which biases the results. When you include a previously excluded variable, the model can now control for it and the bias goes away. Maybe I wrote that in a confusing way?

Thanks! I always strive to make my posts as clear as possible, so I’ll think about how to explain this better.

September 28, 2018 at 4:31 pm

In addition to mean square error, adj R-squared, I use Cp, IC, HQC, and SBIC to decide the number of dependent variables in multiple regression.

September 28, 2018 at 4:39 pm

I think there are a variety of good measures. I’d also add predicted R-squared–as long as you use them in conjunction with subject-area expertise. As I mention in this post, the entire set of estimate relationships must make theoretical sense. If they don’t, the statistical measures are not important.

September 28, 2018 at 4:13 pm

i have to read the article you named. Having said that, caution should be given when regression models model systems or processes not in statistical control. Also, some processes have physical bounds that a regression model does not capture and calculated predicted values have no physical meaning. Further, models from narrow ranges of independent variables may not be applicable outside the ranges of the independent variables.

September 28, 2018 at 4:19 pm

Hi Stan, those are all great points, and true. They all illustrate how you need to use your subject-area knowledge in conjunction with statistical analyses.

I talk about the issue of not going outside the range of the data, amongst other issues, in my post about Using Regression to Make Predictions .

I also agree about statistical control, which I think is under appreciated outside of the quality improvement arena. I’ve written about this in a post about using control charts with hypothesis tests .

September 28, 2018 at 2:30 pm

Valid confidence/prediction intervals are important if the regression model represents a process that is being characterized. When the prediction intervals are wide or too wide, the model’s validity and utility are in question.

September 28, 2018 at 2:49 pm

You’re definitely correct! If the model doesn’t fit the data, your predictions are worthless. One minor caveat that I’d add to your comment.

The prediction intervals can be too wide to be useful yet the model might still be valid. It’s really two separate assessments. Valid model and degree of precision. I write about this in several posts including the following: Understanding Precision in Prediction

September 26, 2018 at 9:13 am

Jim, does centering any independent explanatory variable require centering them all? Center the dependent and explanatory variables? I always make a normal probability plot of the deleted residuals as one test of the prediction capability of the fitted model. It is remarkable how good models give good normal probability plots. I also use the Shapiro-Wilks test to assess the deleted variables for normality. Stan Alekman

September 26, 2018 at 9:46 am

Yes, you should center all of the continuous independent variables if your goal is to reduce multicollinearity and/or to be able to interpret the intercept. I’ve never seen a reason to center the dependent variable.

It’s funny that you mention that about normally distributed residuals! I, too, have been impressed with how frequently that occurs even with fairly simple models. I’ve recently written a post about OLS assumptions and I mention how normal residuals are sort of optional. They only need to be normally distributed if you want to perform hypothesis tests and have valid confidence/prediction intervals. Most analysts want at least the hypothesis tests!

' src=

September 25, 2018 at 2:32 am

Hey Jim,your blogs are really helpful for me to learn data science.Here is my question in my assignment:

You have built a classification model with 90% accuracy but your client is not happy because False Positive rate was very high then what will you do? Can we do something to it by precision or recall??

this is the question..nothing is given in the background

though they should have given!

' src=

September 25, 2018 at 1:20 am

Thank you Jim Really interesting

September 25, 2018 at 1:26 am

Hi Brahim, you’re very welcome! I’m glad it was interesting!

' src=

September 24, 2018 at 10:30 pm

Hey Jim, you are awesome.

September 24, 2018 at 11:04 pm

Aw, MG, thanks so much!! 🙂

' src=

September 24, 2018 at 10:59 am

Thanks for another great article, Jim!.

Q: Could you expand with a specific plot example to explain more clearly, this statement: “We know that for omitted variable bias to exist, an independent variable must correlate with the residuals. Consequently, we can plot the residuals by the variables in our model. If we see a relationship in the plot, rather than random scatter, it both tells us that there is a problem and points us towards the solution. We know which independent variable correlates with the confounding variable.”

Thanks! SFdude

September 24, 2018 at 11:48 am

Hi, thanks!

I’ll try to find a good example plot to include soon. Basically, you’re looking for any non-random pattern. For example, the residuals might tend to either increase or decrease as the value of independent variable increases. That relationship can follow a straight line or display curvature, depending on the nature of relationship.

' src=

September 24, 2018 at 1:37 am

It’s been a long time I heard from you Jim . Missed your stats

September 24, 2018 at 9:53 am

Hi Saketh, thanks, you’re too kind! I try to post here every two weeks at least. Occasionally, weekly!

Comments and Questions Cancel reply

Enago Academy

Demystifying the Role of Confounding Variables in Research

' src=

In the realm of scientific research, the pursuit of knowledge often involves complex investigations, meticulous data collection , and rigorous statistical analysis . Achieving accurate and reliable results is paramount in research. Therefore researchers strive to design experiments and studies that can isolate and scrutinize the specific variables which they aim to investigate. However, some hidden factors can obscure the true relationships between variables and lead to erroneous conclusions. These covert culprits are known as confounding variables, which in their elusive nature, has the potential to skew results and hinder the quest for truth.

Table of Contents

What Are Confounding Variables

Confounding variables, also referred to as confounders or lurking variables, are the variables that affect the cause and outcome of a study. However, they are not the variables of primary interest. They serve as an unmeasured third variable that acts as an extraneous factor. Furthermore, it interferes with the interpretation of the relationship between the independent and dependent variables within a study. Confounding variables in statistics can be categorical, ordinal, or continuous. Some common types of confounding include Selection bias, Information bias, Time-related confounding, Age-related confounding etc.

Additionally, in the world of scientific inquiry, the term “confounding bias”, is used to describe a systematic error or distortion in research findings that occurs when a confounder is not properly accounted for in a study. This can lead to misleading conclusions about the relationship between the independent variable(s) and the dependent variable, potentially introducing bias into the study’s results.

Key Characteristics of Confounding Variables

Key characteristics of confounding variables or confounding factors include:

Characteristics of confounding variables

Confounding factors can distort the relationship between independent and dependent variables in research. Thus, recognizing, controlling, and addressing them is essential to ensure the accuracy and validity of study findings.

Effects of Confounding Variables

Confounding variables play a crucial role in the internal validity of a research. Understanding its effects is necessary for producing credible, applicable, and ethically sound research.

Here are some impacts of confounding variables in research.

1. Lack of Attribution of Cause and Effect

  • Confounding variables can lead researchers to erroneously attribute causation where there is none.
  • This happens when a confounder is mistaken for the independent variable, causing researchers to believe that a relationship exists between variables even when it does not.

2. Overestimate or Underestimate Effects

  • Confounding variables can distort the magnitude and direction of relationships between variables.
  • Additionally, they can either inflate or diminish it, leading to inaccurate assessments of the true impact.
  • Furthermore, they can also hide genuine associations between variables.

3. Distort Results

  • Confounding variables can create false associations between variables.
  • In these cases, the observed relationship is driven by the confounder rather than any meaningful connection between the independent and dependent variables.
  • This distorts the relationship between the variables of interest, leading to incorrect conclusions.

4. Reduce Precision and Reliability

  • Confounding variables can introduce noise and variability in the data.
  • This can make it challenging to detect genuine effects or differences.
  • Furthermore, the results of a study may not generalize well to other populations or contexts as the impact of the confounders might be specific to the study sample or conditions.

5. Introduce Bias

  • If confounding variables are not properly addressed, the conclusions drawn from a study can be biased.
  • These biased conclusions can have real-world implications, especially in fields like medicine, public policy, and social sciences.
  • Studies plagued by confounding variables have reduced internal validity, which can hinder scientific progress and the development of effective interventions

6. Introduce Ethical Implications

  • In certain cases, failing to control for confounding variables can have ethical implications.
  • For instance, if a study erroneously concludes that a particular group is more prone to a disease due to a confounder, it may lead to stigmatization or discrimination.

Researchers must identify these variables and employ rigorous methods to account them for ensuring that their findings accurately reflect the relationships they intend to study.

Why to Account Confounding Variables

Accounting confounding variables is crucial in research as it helps the researchers to obtain more accurate results with a broader application. Furthermore, controlling confounders helps maintain internal validity and establishes causal relationships between variables.

Accounting confounding variables also provide a proper guidance to health interventions or policies and demonstrates scientific rigor and a commitment to producing high-quality, unbiased research . Also, researchers have an ethical responsibility to accurately report and interpret their findings.

Researchers must recognize the potential impact of confounders and take adequate steps to identify and measure them to control for their effects to ensure the integrity and reliability of their research findings.

How to Identify Confounding Variables

Recognizing confounding variables is a crucial step in research. Researchers can employ various strategies to identify potential confounders.

How to Identify Confounding Variables

Strategies to Control Confounding Variables

Controlling confounding variables can help researchers to establish a more robust research and employing appropriate strategies to mitigate them is necessary in establishing reliability and accuracy in research reporting.

1. Randomization

Randomly assigning subjects to experimental and control groups can help distribute confounding variables evenly, reducing their impact.

2. Matching

Matching subjects based on key characteristics can minimize the influence of confounding variables. For example, in a drug trial, matching participants by age, gender, and baseline health status can help control for these factors.

3. Statistical Control

Advanced statistical techniques like multiple regression analysis can help account for the influence of known confounding variables in data analysis.

4. Conduct Sensitivity Analysis

Researchers should test the robustness of their findings by conducting sensitivity analyses, systematically varying assumptions about confounding variables to assess their impact on results.

Although these measures can control confounding variables effectively, addressing them ethically is crucial in maintain the research integrity.

Examples of Confounding Variables

Here are some examples of confounding variables:

1. Smoking and Lung Cancer:

In a study investigating the link between smoking and lung cancer, age can be a confounding variable. Older individuals are more likely to both smoke and develop lung cancer. Therefore, if age is not controlled for in the study, it could falsely suggest a stronger association between smoking and lung cancer than actually exists.

2. Education and Income:

Suppose a study is examining the relationship between education level and income, occupation and the years of experience could be a confounding variable because certain jobs pay more. Without considering occupation and experience, the study might incorrectly reach to a conclusion.

3. Coffee Consumption and Heart Disease:

When studying the relationship between coffee consumption and heart disease, exercise and habits can be a confounding variable. Unhealthy behaviors like smoking, poor diet and lack of physical activity can contribute to heart disease. Failing to control for these factors could erroneously attribute heart disease risk solely to coffee consumption.

Controlling confounding variables through study design or statistical techniques is essential to ensure that research findings accurately represent the relationships being studied.

Statistical Approaches When Reporting And Discussing Confounding Variables

Statistical approaches for reporting and discussing confounding variables are essential to ensure the transparency, rigor, and validity of research findings. Here are some key statistical approaches and strategies to consider when dealing with confounding variables:

1. Descriptive Statistics

  • Begin by providing descriptive statistics for the confounding variables.
  • This includes measures such as mean, median, standard deviation, and frequency distribution.
  • This information helps to understand the characteristics of the confounders in your study.

3. Bivariate Analysis

  • Conduct bivariate analyses to examine the unadjusted relationships between the independent variable(s) and the dependent variable, as well as between the independent variable(s) and the confounding variables.

4. Stratification

  • Stratify your analysis by levels or categories of the confounding variable.
  • This allows you to examine the relationship between the independent variable and the dependent variable within each stratum.
  • It can help identify whether the effect of the independent variable varies across different levels of the confounder.

4. Multivariate Analysis

  • Use multivariate statistical techniques, such as regression analysis , to control for confounding variables.
  • In regression analysis, you can include the confounding variables as covariates in the model.
  • This helps to isolate the effect of the independent variable(s) while holding the confounders constant.

5. Interaction Testing

  • Investigate potential interactions between the independent variable(s) and the confounding variable.
  • Interaction terms in regression models can help determine whether the effect of the independent variable(s) varies based on different levels of the confounder. Interaction tests assess whether the relationship between the independent variable and the dependent variable is modified by the confounder.

6. Model Fit and Goodness of Fit

  • Assess the fit of your statistical model. This includes checking for goodness-of-fit statistics and examining diagnostic plots.
  • A well-fitting model is important for reliable results.

7. Graphical Representation

  • Utilize graphical representations , such as scatter plots, bar charts, or forest plots, to visualize the relationships between variables and the impact of confounding variables on your results.

These statistical approaches help researchers control for confounding variables and provide a comprehensive understanding of the relationships between variables in their studies. Thorough and transparent reporting and discussion of confounding variables in research involve a combination of statistical studies and a strong research design . Reporting these variables ethically is crucial in acknowledging them effectively.

Ethical Considerations While Dealing Confounding Variables

Ethical considerations play a significant role in dealing with confounding variables in research. Addressing confounding variables ethically is essential to ensure that research is conducted with integrity, transparency, and respect for participants and the broader community. Here are some ethical considerations to keep in mind:

1. Disclosure and Transparency

  • Researchers are ethically obliged to disclose potential confounding variables, as well as their plans for addressing them, in research proposals, publications, and presentations.
  • Moreover, transparent reporting allows readers to assess the study’s validity and the potential impact of confounding.

2. Informed Consent

  • When participants are involved in a study, they should be fully informed about the research objectives , procedures, and potential sources of bias, including confounding variables.
  • Informed consent should include explanations of how confounders will be addressed and why it is important.

3. Minimizing Harm

  • Researchers should take steps to minimize any potential harm to participants that may result from addressing confounding variables.
  • This includes ensuring that data collection and analysis procedures do not cause undue distress or discomfort.

4. Fair and Equitable Treatment

  • Researchers must ensure that the methods used to control for confounding variables are fair and equitable.
  • This means that any adjustments or controls should be applied consistently to all participants or groups in a study to avoid bias or discrimination.

5. Respect for Autonomy

  • Ethical research respects the autonomy of participants.
  • This includes allowing participants to withdraw from the study at any time if they feel uncomfortable with the research process or have concerns about how confounding variables are being managed.

6. Consider Community Impact

  • Consider the broader impact of research on the community.
  • Addressing confounding variables can help ensure that research results are accurate and relevant to the community, ultimately contributing to better-informed decisions and policies.

7. Avoiding Misleading Results

  • Ethical research avoids producing results that are misleading due to unaddressed confounding variables.
  • Misleading results can have serious consequences in fields like medicine and public health, where policies and treatments are based on research findings.

8. Ethical Oversight

  • Research involving human participants often requires ethical review and oversight by institutional review boards or ethics committees.
  • Researchers should follow the guidance and recommendations of these oversight bodies when dealing with confounding variables.

9. Continual Evaluation

  • Ethical research involves ongoing evaluation of the impact of confounding variables and the effectiveness of strategies to control them.
  • Additionally, researchers should be prepared to adjust their methods if necessary to maintain ethical standards.

Researchers must uphold these ethical principles to maintain the trust and credibility of their work within the scientific community and society at large.

The quest for knowledge is not solely defined by the variables you aim to study, but also by the diligence with which you address the complexities associated with the confounding variables. This would foster a clearer and more accurate reporting of research which is reliable and sound.

What are your experiences dealing with confounding variables? Share your views and ideas with the community on Enago Academy’s Open Platform and grow your connections with like-minded academics.

Frequently Asked Questions

Confounding bias is a type of bias that occurs when a third variable influences both the independent and dependent variables, leading to erroneous conclusions in research and statistical analysis. It occurs when a third variable (confounding variable) which is not considered in the research design or analysis, is related to both the dependent variable (the outcome of interest) and the independent variable (the factor being studied).

Controlling for confounding variables is a crucial aspect of designing and analyzing research studies. Some methods to control confounding variables are: 1. Randomization 2. Matching 3. Stratification 4. Multivariable Regression Analysis 5. Propensity Score Matching 6. Cohort Studies 7. Restriction 8. Sensitivity Analysis 9. Review Existing Literature 10. Expert Consultation

Some common types of confounding are selection bias, information bias. time-related confounding, age-related confounding, residual confounding, reverse causation etc.

Confounding variables affects the credibility, applicability, and ethical soundness of the study. Their effects include: 1. Lack of Attribution of Cause and Effect 2. Overestimate or Underestimate Effects 3. Distort Results 4. Reduce Precision and Reliability 5. Introduce Bias 6. Introduce Ethical Implications To produce valid research, researchers must identify and rigorously account for confounding variables, ensuring that their findings accurately reflect the relationships they intend to study.

Identifying confounding variables is a critical step in research design and analysis. Here are some strategies and approaches to help identify potential confounding variables: 1. Literature Review 2. Subject Matter Knowledge 3. Theoretical Framework 4. Pilot Testing 5. Consultation 6. Hypothesis Testing 7. Directed Acyclic Graphs (DAGs) 8. Statistical Software 9. Expert Review

Rate this article Cancel Reply

Your email address will not be published.

confounding variables control experiments

Enago Academy's Most Popular Articles

Graphical Abstracts vs. Infographics: Best Practices for Visuals - Enago

  • Promoting Research

Graphical Abstracts Vs. Infographics: Best practices for using visual illustrations for increased research impact

Dr. Sarah Chen stared at her computer screen, her eyes staring at her recently published…

10 Tips to Prevent Research Papers From Being Retracted - Enago

  • Publishing Research

10 Tips to Prevent Research Papers From Being Retracted

Research paper retractions represent a critical event in the scientific community. When a published article…

2024 Scholar Metrics: Unveiling research impact (2019-2023)

  • Industry News

Google Releases 2024 Scholar Metrics, Evaluates Impact of Scholarly Articles

Google has released its 2024 Scholar Metrics, assessing scholarly articles from 2019 to 2023. This…

What is Academic Integrity and How to Uphold it [FREE CHECKLIST]

Ensuring Academic Integrity and Transparency in Academic Research: A comprehensive checklist for researchers

Academic integrity is the foundation upon which the credibility and value of scientific findings are…

7 Step Guide for Optimizing Impactful Research Process

  • Reporting Research

How to Optimize Your Research Process: A step-by-step guide

For researchers across disciplines, the path to uncovering novel findings and insights is often filled…

Choosing the Right Analytical Approach: Thematic analysis vs. content analysis for…

Research Recommendations – Guiding policy-makers for evidence-based decision making

Language as a Bridge, Not a Barrier: ESL researchers’ path to successful…

confounding variables control experiments

Sign-up to read more

Subscribe for free to get unrestricted access to all our resources on research writing and academic publishing including:

  • 2000+ blog articles
  • 50+ Webinars
  • 10+ Expert podcasts
  • 50+ Infographics
  • 10+ Checklists
  • Research Guides

We hate spam too. We promise to protect your privacy and never spam you.

  • AI in Academia
  • Career Corner
  • Diversity and Inclusion
  • Infographics
  • Expert Video Library
  • Other Resources
  • Enago Learn
  • Upcoming & On-Demand Webinars
  • Peer Review Week 2024
  • Open Access Week 2023
  • Conference Videos
  • Enago Report
  • Journal Finder
  • Enago Plagiarism & AI Grammar Check
  • Editing Services
  • Publication Support Services
  • Research Impact
  • Translation Services
  • Publication solutions
  • AI-Based Solutions
  • Thought Leadership
  • Call for Articles
  • Call for Speakers
  • Author Training
  • Edit Profile

I am looking for Editing/ Proofreading services for my manuscript Tentative date of next journal submission:

confounding variables control experiments

In your opinion, what is the most effective way to improve integrity in the peer review process?

  • Bipolar Disorder
  • Therapy Center
  • When To See a Therapist
  • Types of Therapy
  • Best Online Therapy
  • Best Couples Therapy
  • Managing Stress
  • Sleep and Dreaming
  • Understanding Emotions
  • Self-Improvement
  • Healthy Relationships
  • Student Resources
  • Personality Types
  • Sweepstakes
  • Guided Meditations
  • Verywell Mind Insights
  • 2024 Verywell Mind 25
  • Mental Health in the Classroom
  • Editorial Process
  • Meet Our Review Board
  • Crisis Support

Confounding Variables in Psychology Research

Getty Images / Andrew Brookes

  • Real World Examples

Confounding variables are external factors (typically a third variable) in research that can interfere with the relationship between dependent and independent variables .

At a Glance

A confounding variable alters the risk of the condition being studied and confuses the “true” relationship between the variables. The role of confounding variables in research is critical to understanding the causes of all kinds of physical, mental, and behavioral conditions and phenomena.

Real World Examples of Confounding Variables

Typical examples of confounding variables often relate to demographics and social and economic outcomes.

For instance, people who are relatively low in socioeconomic status during childhood tend to do, on average, worse financially than others do when they reach adulthood, explains Glenn Geher , PhD, professor of psychology at State University of New York at New Paltz and author of “Own Your Psychology Major!” While he said we could simply think this because poverty begets poverty, he also says there are other variables that are conflated with poverty.

People with lower economic means tend to have less access to high quality education, which is also related to fiscal success in adulthood, Geher explained. Furthermore, poverty is often associated with limited access to healthcare and, thus, with increased risk of adverse health outcomes. These factors can also play roles in fiscal success in adulthood.

“The bottom line here is that when looking to find factors that predict adult economic success, there are many variables that predict this outcome, and so many of these factors are confounded with one another,” Geher said. 

The Impact of Confounding Variables on Research

Psychology researchers must be diligent in controlling for confounding variables, because if they are not, they may draw inaccurate conclusions.

For example, during a research project, Geher’s team found the number of stitches one received in childhood predicted one’s sexual activity in adulthood.

However, Geher said "to conclude that getting stitches causes promiscuous behavior would be unwarranted and odd. In fact, it is much more likely that childhood health outcomes, such as getting stitches, predicts environmental instability during childhood, which has been found to indirectly bear on adult sexual and relationship outcomes,” said Geher.

In other words, the number of stitches is confounded with environmental instability in childhood. It's not that the number of stitches is directly correlated with sexual activity.

Another example that shows confounding variables is the idea that there is a positive correlation between ice cream sales and homicide rates. However, in fact, both these variables are confounded with time of year, said Geher. “They are both higher in summer when days are longer, days are hotter, and people are more likely to encounter others in social contexts because in the winter when it is cold people are more likely to stay home—so they are less likely to buy ice cream cones and to kill others,” he said. 

Both of these are examples of how it is in the best interest of researchers to ensure that they control for confounding variables to increase the likelihood that their conclusions are truly warranted.

Universal confounding variables across research on a particular topic can also be influential. In an evaluation of confounding variables that assessed the effect of alcohol consumption on the risk of ischemic heart disease, researchers found a large variation in the confounders considered across observational studies.

While 85 of 87 studies that the researchers analyzed made a connection to alcohol and ischemic heart disease, confounding variables that could influence ischemic heart disease included, smoking, age, and BMI, height, and/or weight. This means that these factors could have also affected heart disease, not just alcohol.

While most studies mentioned or alluded to “confounding” in their Abstract or Discussion sections, only one stated that their main findings were likely to be affected by confounding variables. The authors concluded that almost all studies ignored or eventually dismissed confounding variables in their conclusions.

Because study results and interpretations may be affected by the mix of potential confounders included within models, the researchers suggest that “efforts are necessary to standardize approaches for selecting and accounting for confounders in observational studies.”

Techniques to Identify Confounding Variables

The best way to control for confounding variables is to conduct “true experimental research,” which means researchers experimentally manipulate a variable that they think causes a certain outcome. They typically do this by randomly assigning study participants to different levels of the first variable, which is referred to as the “independent variable.”

For example, if researchers want to determine if, separate from other factors, receiving a full high-quality education, including a four-year college degree from a respected school, causes positive fiscal outcomes in adulthood, they would need to find a pool of participants, such as a group of young adults from the same broad socioeconomic group as one another. Once the group is selected, half of them would need to be randomly assigned to receive a free, high-quality education and the other half would need to be randomly assigned to not receive such an education.

“This methodology would allow you to see if there are fiscal outcomes on average for the two groups later in life and, if so, you could reasonably conclude that the cause of the differential fiscal outcomes is found in the educational differences across the two groups,” said Geher. “You can draw this conclusion because you randomly assigned the participants to these different groups—and process that naturally controls for confounding variables.” 

However, with this process, different problems emerge. For instance, it would not be ethical or practical to randomly assign some participants to a “high-quality education” group and others to a “no-education” group.

“[Controlling] confounding variables via experimental manipulation is not always feasible,” Geher said. 

Because of this, there are also statistical ways to try to control for confounding variables, such as “partial correlation,” which looks at a correlation between two variables (e.g., childhood SES and adulthood SES) while factoring out the effects of a potential confounding variable (e.g., educational attainment).

However, statistical control that addresses confounding by measurement can point to confounding through inappropriate control.

“This statistically oriented process is definitely not considered the gold standard compared with true experimental procedures, but often, it is the best you can do given ethical and/or practical constraints,” said Geher.

The Importance of Addressing Confounding Variables in Research

Controlling for confounding variables is critical in research primarily because it allows researchers to make sure that they are drawing valid and accurate conclusions. 

“If you don’t correct for confounding variables, you put yourself at risk for drawing conclusions regarding relationships between variables that are simply wrong (at the worst) or incomplete (at the best),” said Geher.

Controlling for confounding variables includes a basic set of skills when it comes to the social and behavioral sciences, he added. 

The Role of Confounding Variables in Valid Research

Human behavior is highly complex and any single action often has a broad array of variables that underlie it. 

“Understanding the concept of confounding variables, as well as how to control for these variables, makes for better behavioral science with conclusions that are, simply, more valid that research that does not effectively take confounding variables into account,” Geher said.

Wallach JD, Serghiou S, Chu L, et al. Evaluation of confounding in epidemiologic studies assessing alcohol consumption on the risk of ischemic heart disease. BMC Med Res Methodol. 2020;20(1):64. https://doi.org/10.1186/s12874-020-0914-6

Pourhoseingholi MA, Baghestani AR, Vahedi M. How to control confounding effects by statistical analysis. Gastroenterol Hepatol Bed Bench. 2012;5(2):79-83. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4017459/

By Cathy Cassata Cathy Cassata is a freelance writer who specializes in stories around health, mental health, medical news, and inspirational people.

User Preferences

Content preview.

Arcu felis bibendum ut tristique et egestas quis:

  • Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris
  • Duis aute irure dolor in reprehenderit in voluptate
  • Excepteur sint occaecat cupidatat non proident

Keyboard Shortcuts

1.4.1 - confounding variables.

Randomized experiments are typically preferred over observational studies or experimental studies that lack randomization because they allow for more control. A common problem in studies without randomization is that there may be other variables influencing the results. These are known as  confounding variables . A confounding variable is related to both the explanatory variable and the response variable. 

Characteristic that varies between cases and is related to both the explanatory and response variables; also known as a  lurking variable  or a  third variable

Example: Ice Cream & Home Invasions Section  

There is a positive relationship between ice cream sales and home invasions (i.e., as ice cream sales increase throughout the year so do home invasions). It is clear that increases in ice cream sales do not cause home invasions to increase, and home invasions do not cause an increase in ice cream sales. There is a third variable at play here: outdoor temperature. When the weather is warmer both ice cream sales and home invasions increase. In this case, outdoor temperature is a  confounding variable  because it is related to both ice cream sales and home invasions. 

Example: Weight & Preferred Beverage Section  

Research question: Do adults who prefer to drink beer, wine, and water differ in terms of their mean weights?

Data were collected from a sample of World Campus students to address the research question above. The researchers found that adults who preferred beer tended to weigh more than those who preferred wine.

A confounding variable in this study was gender identity. Those who identified as men were more likely to prefer beer and those who identified as women were more likely to prefer wine. In the sample, men weighed more than women on average.

Confounding Variable: Simple Definition and Example

Design of Experiments > Confounding Variable

What is a Confounding Variable?

A confounding variable is an “extra” variable that you didn’t account for. They can ruin an experiment and give you useless results. They can suggest there is correlation when in fact there isn’t. They can even introduce bias . That’s why it’s important to know what one is, and how to avoid getting them into your experiment in the first place.

Confounding variable

In an experiment, the independent variable typically has an effect on your dependent variable . For example, if you are researching whether lack of exercise leads to weight gain, then lack of exercise is your independent variable and weight gain is your dependent variable. Confounding variables are any other variable that also has an effect on your dependent variable. They are like extra independent variables that are having a hidden effect on your dependent variables. Confounding variables can cause two major problems:

  • Increase variance
  • Introduce bias .

Let’s say you test 200 volunteers (100 men and 100 women). You find that lack of exercise leads to weight gain. One problem with your experiment is that is lacks any control variables . For example, the use of placebos , or random assignment to groups. So you really can’t say for sure whether lack of exercise leads to weight gain. One confounding variable is how much people eat. It’s also possible that men eat more than women; this could also make sex a confounding variable. Nothing was mentioned about starting weight, occupation or age either. A poor study design like this could lead to bias. For example, if all of the women in the study were middle-aged, and all of the men were aged 16, age would have a direct effect on weight gain. That makes age a confounding variable.

Confounding Bias

Technically, confounding isn’t a true bias , because bias is usually a result of errors in data collection or measurement. However, one definition of bias is “…the tendency of a statistic to overestimate or underestimate a parameter”, so in this sense, confounding is a type of bias.

Confounding bias is the result of having confounding variables in your model. It has a direction, depending on if it over- or underestimates the effects of your model:

  • Positive confounding is when the observed association is biased away from the null. In other words, it overestimates the effect.
  • Negative confounding is when the observed association is biased toward the null. In other words, it underestimates the effect.

How to Reduce Confounding Variables

Make sure you identify all of the possible confounding variables in your study. Make a list of everything you can think of and one by one, consider whether those listed items might influence the outcome of your study. Usually, someone has done a similar study before you. So check the academic databases for ideas about what to include on your list. Once you have figured out the variables , use one of the following techniques to reduce the effect of those confounding variables:

  • Bias can be eliminated with random samples .
  • Introduce control variables to control for confounding variables. For example, you could control for age by only measuring 30 year olds.
  • Within subjects designs test the same subjects each time. Anything could happen to the test subject in the “between” period so this doesn’t make for perfect immunity from confounding variables.
  • Counterbalancing can be used if you have paired designs. In counterbalancing, half of the group is measured under condition 1 and half is measured under condition 2.

Related Articles:

Age Graded Influences Confounding by Indication History Graded Influences Nonnormative Influences

Kotz, S.; et al., eds. (2006), Encyclopedia of Statistical Sciences , Wiley. Everitt, B. S.; Skrondal, A. (2010), The Cambridge Dictionary of Statistics , Cambridge University Press. Smith, G. Essential Statistics, Regression, and Econometrics 2nd Edition. Academic Press, 2015.

  • Privacy Policy

Research Method

Home » Confounding Variable – Definition, Method and Examples

Confounding Variable – Definition, Method and Examples

Table of Contents

Confounding Variable

Confounding Variable

Definition:

A confounding variable is an extraneous variable that is not the main variable of interest in a study but can affect the outcome of the study. Confounding variables can obscure or distort the true relationship between the independent and dependent variables being studied.

Confounding Variable Control Methods

Methods for controlling confounding variables in research are as follows:

Randomization

Randomization is a powerful method for controlling confounding variables in experimental research. By randomly assigning participants to different groups, researchers can ensure that any extraneous factors that could influence the outcome variable are evenly distributed across the groups.

Matching is a method used in observational studies to control for confounding variables. In this method, researchers match participants on one or more variables that could influence the outcome variable, such as age or gender.

Statistical Analysis

Statistical analysis is used to control for confounding variables in both experimental and observational studies. This can be achieved through the use of regression analysis, which allows researchers to control for the effects of confounding variables on the outcome variable.

Restriction

Restriction involves limiting the range of values for the confounding variable. For example, researchers might only include participants within a certain age range to control for age-related differences.

Stratification

Stratification involves dividing the sample into subgroups based on the confounding variable. Researchers can then compare the outcome variable across the subgroups to determine if the relationship holds for each subgroup.

Design Control

Design control refers to the process of carefully designing the study to minimize the potential for confounding variables. This can involve selecting a representative sample, controlling for extraneous variables, and using appropriate measures to assess the outcome variable.

Confounding Variable Examples

Confounding Variable Examples are as follows:

  • Age : Suppose that a study is investigating the effect of a new teaching method on student performance in a particular subject. If the students’ ages are not controlled for, age could be a confounding variable as older students may perform better due to greater maturity or prior knowledge.
  • Gender : Suppose a study is investigating the effect of a new medication on blood pressure. If the study does not control for gender, gender could be a confounding variable as women generally have lower blood pressure than men.
  • Socioeconomic status : Suppose a study is investigating the relationship between physical activity and health outcomes. If the study does not control for socioeconomic status, it could be a confounding variable as people with higher socioeconomic status may have better access to facilities for exercise and better nutrition.
  • Time of day: Suppose a study is investigating the effect of caffeine on alertness. If the study is conducted at different times of day, time of day could be a confounding variable as individuals may naturally be more alert at certain times of the day.
  • Environmental factors : Suppose a study is investigating the effect of a new air purifier on asthma symptoms. If the study does not control for environmental factors such as pollen or pollution levels, they could be a confounding variable as these factors could affect asthma symptoms independent of the air purifier.
  • Placebo effect : Suppose a study is investigating the effect of a new drug on pain relief. If the study does not control for the placebo effect, it could be a confounding variable as participants may experience a reduction in pain simply due to the belief that they are receiving a treatment.

Applications of Confounding Variable

Here are some applications of confounding variables:

  • Control for Confounding Variables : In experimental research, researchers try to control for confounding variables by holding them constant or statistically adjusting for them in the analysis. This helps to isolate the effects of the independent variable on the dependent variable.
  • Identifying Alternative Explanations : Confounding variables can help researchers identify alternative explanations for their findings. By examining the potential confounding variables, researchers can better understand the factors that may be contributing to the relationship between the independent and dependent variables.
  • Generalizability : Researchers can use confounding variables to improve the generalizability of their findings. By including a diverse range of participants and controlling for potential confounding variables, researchers can better understand how their findings apply to different populations.
  • Real-world Applications : Understanding confounding variables can have real-world applications. For example, in medical research, understanding the potential confounding variables can help clinicians better understand the effectiveness of treatments and improve patient outcomes.
  • Improving Study Design : By considering the potential confounding variables, researchers can improve the design of their studies to reduce the potential for confounding variables to impact their findings.

When to identify Confounding Variable

Identifying confounding variables is an essential step in designing and conducting research. Confounding variables are factors that may impact the relationship between the independent variable and the dependent variable, and they can potentially distort the study’s results. Here are some key points to consider when identifying confounding variables:

  • Before conducting the study: Researchers should identify potential confounding variables before the study begins. This allows them to design the study to control for or adjust for confounding variables to ensure that the results are reliable and valid.
  • During data collection: As researchers collect data, they may identify additional confounding variables that were not anticipated during the study’s design. In such cases, researchers may need to modify the study’s design or analysis to account for the newly identified confounding variables.
  • Statistical analysis: During the analysis, researchers should examine the relationship between the independent and dependent variables while controlling for potential confounding variables. This helps to isolate the effects of the independent variable on the dependent variable.
  • Reporting results: Researchers should report the potential confounding variables that were identified and how they were controlled for or adjusted for in the analysis. This helps other researchers to interpret and replicate the findings accurately.

Purpose of Confounding Variable

The purpose of identifying and controlling for confounding variables in research is to ensure that the relationship between the independent variable and the dependent variable is accurately measured. Confounding variables can introduce bias into a study, making it difficult to determine the true relationship between the variables of interest. By identifying and controlling for confounding variables, researchers can:

  • Improve the validity of the study: Confounding variables can introduce bias into a study, making it difficult to determine whether the results accurately reflect the relationship between the independent and dependent variables. By controlling for confounding variables, researchers can ensure that the results of their study are valid and accurately reflect the relationship between the variables of interest.
  • Improve the reliability of the study: Confounding variables can also affect the reliability of a study by making it more difficult to replicate the results. By controlling for confounding variables, researchers can ensure that their study is reliable and can be replicated by others.
  • Improve the generalizability of the study: Confounding variables can also affect the generalizability of a study by making it difficult to apply the results to other populations. By controlling for confounding variables, researchers can improve the generalizability of their study and increase the likelihood that the results can be applied to other populations.

Characteristics of Confounding Variable

Here are some characteristics of confounding variables:

  • Related to both the independent and dependent variables: Confounding variables are related to both the independent and dependent variables, meaning that they have an impact on both of these variables.
  • Associated with the outcome variable: Confounding variables are associated with the outcome variable or the dependent variable. This means that they can potentially affect the results of the study and make it difficult to determine the true relationship between the independent and dependent variables.
  • Not part of the study’s design: Confounding variables are not part of the study’s design, meaning that they are not intentionally measured or manipulated by the researcher.
  • Can introduce bias: Confounding variables can introduce bias into a study, making it difficult to determine the true effect of the independent variable on the dependent variable.
  • Can be controlled for: While confounding variables cannot be eliminated, they can be controlled for in the study’s design or statistical analysis. This helps to ensure that the true relationship between the independent and dependent variables is accurately measured.
  • Can affect generalizability : Confounding variables can also affect the generalizability of a study, making it difficult to apply the results to other populations or settings.

Advantages of Confounding Variable

Here are some advantages of confounding variables:

  • Improved accuracy of results: By controlling for confounding variables, researchers can improve the accuracy of their results. By isolating the effect of the independent variable on the dependent variable, researchers can determine the true relationship between these variables and avoid any distortions introduced by confounding variables.
  • More reliable results: Controlling for confounding variables can also lead to more reliable results. By minimizing the impact of confounding variables on the study, researchers can increase the likelihood that their findings are accurate and can be replicated by others.
  • Greater generalizability: Controlling for confounding variables can also increase the generalizability of the study. By minimizing the impact of confounding variables, researchers can increase the likelihood that their findings are applicable to other populations or settings.
  • Improved study design: The process of identifying and controlling for confounding variables can also improve the overall study design. By considering potential confounding variables during the study design phase, researchers can develop more robust studies that are better able to isolate the effect of the independent variable on the dependent variable.

Limitations of Confounding Variable

  • Identification : One limitation of confounding variables is that they may be difficult to identify. Confounding variables can come from a variety of sources and may be difficult to measure or control for in a study.
  • Time and resource constraints: Controlling for confounding variables can also be time-consuming and resource-intensive. This can limit the ability of researchers to fully control for all potential confounding variables.
  • Reduced sample size : Controlling for confounding variables may also require a larger sample size, which can be costly and time-consuming.
  • Limitations of statistical methods : While statistical methods can be used to control for confounding variables, there are limitations to these methods. For example, some statistical methods assume that the relationship between the independent and dependent variables is linear, which may not always be the case.
  • Potential for overadjustment : Controlling for too many confounding variables can also lead to overadjustment, where the relationship between the independent and dependent variables is obscured.

Disadvantages of Confounding Variable

Some Disadvantages of Limitations of Confounding Variable are as follows:

  • They can obscure or distort the true relationship between the independent and dependent variables, making it difficult to draw accurate conclusions.
  • They can make it challenging to replicate research findings because the confounding variable may not be accounted for in subsequent studies.
  • They can lead to incorrect conclusions about causality, as the observed relationship between the independent and dependent variables may be due to the confounding variable and not the independent variable.
  • They can reduce the precision of estimates and increase the variability of results.
  • They can lead to false associations or overestimation of the effect size of the independent variable.
  • They can also limit the generalizability of research findings to other populations or settings.

About the author

' src=

Muhammad Hassan

Researcher, Academic Writer, Web developer

You may also like

Quantitative Variable

Quantitative Variable – Definition, Types and...

Ordinal Variable

Ordinal Variable – Definition, Purpose and...

Nominal Variable

Nominal Variable – Definition, Purpose and...

Ratio Variable

Ratio Variable – Definition, Purpose and Examples

Qualitative Variable

Qualitative Variable – Types and Examples

Interval Variable

Interval Variable – Definition, Purpose and...

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, automatically generate references for free.

  • Knowledge Base
  • Methodology
  • Confounding Variables | Definition, Examples & Controls

Confounding Variables | Definition, Examples & Controls

Published on 4 May 2022 by Lauren Thomas . Revised on 12 April 2023.

In research that investigates a potential cause-and-effect relationship, a confounding variable is an unmeasured third variable that influences both the supposed cause and the supposed effect.

It’s important to consider potential confounding variables and account for them in your research design to ensure your results are valid .

Table of contents

What is a confounding variable, why confounding variables matter, how to reduce the impact of confounding variables, frequently asked questions about confounding variables.

Confounding variables (aka confounders or confounding factors) are a type of extraneous variable related to a study’s independent and dependent variables . A variable must meet two conditions to be a confounder:

  • It must be correlated with the independent variable. This may be a causal relationship, but it does not have to be.
  • It must be causally related to the dependent variable.

Example of a confounding variable

Prevent plagiarism, run a free check.

To ensure the internal validity of your research, you must account for confounding variables. If you fail to do so, your results may not reflect the actual relationship between the variables that you are interested in.

For instance, you may find a cause-and-effect relationship that does not actually exist, because the effect you measure is caused by the confounding variable (and not by your independent variable).

Even if you correctly identify a cause-and-effect relationship, confounding variables can result in over- or underestimating the impact of your independent variable on your dependent variable.

There are several methods of accounting for confounding variables. You can use the following methods when studying any type of subjects (humans, animals, plants, chemicals, etc). Each method has its own advantages and disadvantages.

Restriction

In this method, you restrict your treatment group by only including subjects with the same values of potential confounding factors.

Since these values do not differ among the subjects of your study, they cannot correlate with your independent variable and thus cannot confound the cause-and-effect relationship you are studying.

  • Relatively easy to implement
  • Restricts your sample a great deal
  • You might fail to consider other potential confounders

In this method, you select a comparison group that matches with the treatment group. Each member of the comparison group should have a counterpart in the treatment group with the same values of potential confounders, but different independent variable values.

This allows you to eliminate the possibility that differences in confounding variables cause the variation in outcomes between the treatment and comparison group. If you have accounted for any potential confounders, you can thus conclude that the difference in the independent variable must be the cause of the variation in the dependent variable.

  • Allows you to include more subjects than restriction
  • Can prove difficult to implement since you need pairs of subjects that match on every potential confounding variable
  • Other variables that you cannot match on might also be confounding variables

Statistical control

If you have already collected the data, you can include the possible confounders as control variables in your regression models ; in this way, you will control for the impact of the confounding variable.

Any effect that the potential confounding variable has on the dependent variable will show up in the results of the regression and allow you to separate the impact of the independent variable.

  • Easy to implement
  • Can be performed after data collection
  • You can only control for variables that you observe directly, but other confounding variables you have not accounted for might remain

Randomisation

Another way to minimise the impact of confounding variables is to randomise the values of your independent variable. For instance, if some of your participants are assigned to a treatment group while others are in a control group , you can randomly assign participants to each group.

Randomisation ensures that with a sufficiently large sample, all potential confounding variables (even those you cannot directly observe in your study) will have the same average value between different groups. Since these variables do not differ by group assignment, they cannot correlate with your independent variable and thus cannot confound your study.

Since this method allows you to account for all potential confounding variables, which is nearly impossible to do otherwise, it is often considered to be the best way to reduce the impact of confounding variables.

  • Allows you to account for all possible confounding variables, including ones that you may not observe directly
  • Considered the best method for minimising the impact of confounding variables
  • Most difficult to carry out
  • Must be implemented prior to beginning data collection
  • You must ensure that only those in the treatment (and not control) group receive the treatment

A confounding variable , also called a confounder or confounding factor, is a third variable in a study examining a potential cause-and-effect relationship.

A confounding variable is related to both the supposed cause and the supposed effect of the study. It can be difficult to separate the true effect of the independent variable from the effect of the confounding variable.

In your research design , it’s important to identify potential confounding variables and plan how you will reduce their impact.

There are several methods you can use to decrease the impact of confounding variables on your research: restriction, matching, statistical control, and randomisation.

In restriction , you restrict your sample by only including certain subjects that have the same values of potential confounding variables.

In matching , you match each of the subjects in your treatment group with a counterpart in the comparison group. The matched subjects have the same values on any potential confounding variables, and only differ in the independent variable .

In statistical control , you include potential confounders as variables in your regression .

In randomisation , you randomly assign the treatment (or independent variable) in your study to a sufficiently large number of subjects, which allows you to control for all potential confounding variables.

An extraneous variable is any variable that you’re not investigating that can potentially affect the dependent variable of your research study.

A confounding variable is a type of extraneous variable that not only affects the dependent variable, but is also related to the independent variable.

A confounding variable is closely related to both the independent and dependent variables in a study. An independent variable represents the supposed cause , while the dependent variable is the supposed effect . A confounding variable is a third variable that influences both the independent and dependent variables.

Failing to account for confounding variables can cause you to wrongly estimate the relationship between your independent and dependent variables.

To ensure the internal validity of your research, you must consider the impact of confounding variables. If you fail to account for them, you might over- or underestimate the causal relationship between your independent and dependent variables , or even find a causal relationship where none exists.

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the ‘Cite this Scribbr article’ button to automatically add the citation to our free Reference Generator.

Thomas, L. (2023, April 12). Confounding Variables | Definition, Examples & Controls. Scribbr. Retrieved 3 September 2024, from https://www.scribbr.co.uk/research-methods/confounding-variable/

Is this article helpful?

Lauren Thomas

Lauren Thomas

Other students also liked, extraneous variables | examples, types, controls, a quick guide to experimental design | 5 steps & examples.

  • Abnormal Psychology
  • Assessment (IB)
  • Biological Psychology
  • Cognitive Psychology
  • Criminology
  • Developmental Psychology
  • Extended Essay
  • General Interest
  • Health Psychology
  • Human Relationships
  • IB Psychology
  • IB Psychology HL Extensions
  • Internal Assessment (IB)
  • Love and Marriage
  • Post-Traumatic Stress Disorder
  • Prejudice and Discrimination
  • Qualitative Research Methods
  • Research Methodology
  • Revision and Exam Preparation
  • Social and Cultural Psychology
  • Studies and Theories
  • Teaching Ideas

Confounding Variables

Travis Dixon October 24, 2016 Research Methodology

confounding variables control experiments

  • Click to share on Facebook (Opens in new window)
  • Click to share on Twitter (Opens in new window)
  • Click to share on LinkedIn (Opens in new window)
  • Click to share on Pinterest (Opens in new window)
  • Click to email a link to a friend (Opens in new window)

Sometimes factors other than the IV may influence the DV in an experiment. These unwanted influences are called confounding variables . In laboratory experiments, researchers attempt to minimize their influence by carefully designing their experiment so all conditions are exactly the same – the only thing that’s different is the independent variable.

confounding variables control experiments

Our lesson plans and support packs take the stress out of teaching tricky subjects. See more here.

Here are some confounding variables that you need to be looking out for in experiments:

  • Order Effects
  • Participant variability
  • Social desirability effect
  • Hawthorne effect
  • Demand characteristics
  • Evaluation apprehension
  • Lesson Idea: Experimental Designs
  • How to evaluate any study in 3 simple steps

ORDER EFFECTS: In repeated measures experiments one must be careful of order effe cts. Sometimes the order in which a participant does a task may alter the results. For instance, they may get better with practice and this could disrupt the results, or they remember something from the first condition that may alter their results.

Counterbalancing  is one way of controlling for order effects. Counterbalancing is when repeated measures is used but half the group do Condition A then Condition B and the other half do it in the opposite order. Using an independent samples design also controls for order effects.

IB Psych IA Tips: When explaining your Design in the IB Psych IA, try to identify one or more extraneous variables you’re controlling for. 

PARTICIPANT VARIABILITY is the extent to which participants are different and is another potential factor that could influence an experiment’s results. For instance, in a study on the effects of a new training technique on fitness levels the existing fitness of the participants might be quite varied. This can easily be controlled for though either using random allocation or a matched pairs design.

DEMAND CHARACTERISTICS are the cues in a study (characteristics) that may lead the participant to figure out how they’re supposed to act (according to the demands of the researcher/experiment). It leads to participants behaving in a way that they think they’re supposed to, not how they would naturally.

confounding variables control experiments

The placebo effect is a type of participant expectancy effect.

PARTICIPANT EXPECTANCY EFFECT is the name given to the change in behaviour as a result of participants behaving in a way that they think they’re expected to. In other words, demand characteristics in an experiment’s design might lead to participant expectancy effect occurring. These terms are commonly used in correctly (Read more:  Demand characteristics: What are they really?)

SOCIAL DESIRABILITY EFFECT is which is when people change their behaviour because they have a nature desire to be liked by other people. Another factor that influences people’s behaviour is when they don’t act like they normally would simply because they are being watched by someone. This was first recorded in a study on the Hawthorne Electrical Plant in the USA and has become known as the HAWTHORNE EFFECT. In the original Hawthorne Plan research they found the workers were working harder simply because they were being watched. Doesn’t this happen in the classroom? Suddenly when the teacher starts walking around the room checking work you close youtube, put away your phone, tuck away the love-letter, etc etc.

The terms “confounding variable” and “extraneous variable” are used interchangeably. Technically speaking, an extraneous variable is any variable that  could  affect the results, whereas “Confounding occurs when the influence of extraneous variables on the DVs cannot be separated and measured,” (Street et al. 1995)

EVALUATION APPREHENSION  might occur when participants are anxious about being evaluated on a particular task or skill (sometimes called the spotlight effect). This might change their behaviour. Think about your oral assignments in some of your subjects, for instance. If you weren’t being graded you might be OK talking in front of your class but as soon as your teacher gets out their big red pen and beings giving you a grade on your work you’re likely to become nervous and this will affect your performance. People are often nervous about being in an “experiment” because the word might conjure many scary thoughts.

confounding variables control experiments

Psychologists must balance validity, practicality and ethicality when designing experiments.

Some textbooks also mention maturation – when participants get better on the second or third trial simply because they have practiced the skill (like order effects). Information contamination is another term sometimes used. This is when outside information affects the results of the experiment.

You don’t have to know a confounding variable by name to evaluate an experiment. The following “experiment” has the independent variable of chewing gum. However, there are many flaws in this experiment. These flaws raise issues about the experiment’s validity (it’s really a commercial for gum so it’s heavily biased). What confounding variables and/or methodological limitations can you find in this experiment?

Street, D. L. (1995).  Controlling extraneous variables in experimental research: a research note. Accounting Education, 4(2), 169–188.

Travis Dixon

Travis Dixon is an IB Psychology teacher, author, workshop leader, examiner and IA moderator.

  • Foundations
  • Write Paper

Search form

  • Experiments
  • Anthropology
  • Self-Esteem
  • Social Anxiety

confounding variables control experiments

  • Confounding Variable

Confounding Variable / Third Variable

Confounding variables (aka third variables) are variables that the researcher failed to control, or eliminate, damaging the internal validity of an experiment.

This article is a part of the guide:

  • Experimental Research
  • Pretest-Posttest
  • Research Bias
  • Independent Variable
  • Between Subjects

Browse Full Outline

  • 1 Experimental Research
  • 2.1 Independent Variable
  • 2.2 Dependent Variable
  • 2.3 Controlled Variables
  • 2.4 Third Variable
  • 3.1 Control Group
  • 3.2 Research Bias
  • 3.3.1 Placebo Effect
  • 3.3.2 Double Blind Method
  • 4.1 Randomized Controlled Trials
  • 4.2 Pretest-Posttest
  • 4.3 Solomon Four Group
  • 4.4 Between Subjects
  • 4.5 Within Subject
  • 4.6 Repeated Measures
  • 4.7 Counterbalanced Measures
  • 4.8 Matched Subjects

confounding variables control experiments

What are Confounding Variables?

A confounding variable, also known as a third variable or a mediator variable, influences both  the independent variable and dependent variable . Being unaware of or failing to control for confounding variables may cause the researcher to analyze the results incorrectly. The results may show a false correlation between the dependent and independent variables, leading to an incorrect rejection  of the null hypothesis .

confounding variables control experiments

The Problem with Confounding Variables

For example, a research group might design a study to determine if heavy drinkers die at a younger age.

They proceed to design a study, and set about gathering data. Their results, and a battery of statistical tests , indeed show that people who drink excessively are likely to die younger.

Unfortunately, when the researchers gather data from their subjects’ non-drinking peers , they discover that they, too, die earlier than average. Maybe there is another factor, not measured , that influences both drinking and longevity?

The weakness in the experimental design was that they failed to take into account  confounding variables , and did not try to eliminate or control any other factors.

Imagine that in this case, there is in fact no relationship between drinking and longevity. But there may be other variables which bring about both heavy drinking and decreased longevity. If they are unaware of these variables, the researchers may assume that heavy drinking is causing reduced longevity, i.e. they’ll make what’s called a “spurious association.” In reality, decreased longevity may be better explained by a third, confounding variable.

Confounding Variable - Third Variable

For example, it is quite possible that the heaviest drinkers hailed from a different background or social group. This group might be, for unrelated reasons, shorter lived than other groups. Heavy drinkers may be more likely to smoke, or eat junk food, all of which could be factors in reducing longevity. In any case, it is the fact they belong to this group that is responsible for their decreased longevity, and not heavy drinking.

Without controlling for potential confounding variables, the internal validity of the experiment is undermined.

Extraneous Variables

Any variable that researchers are not deliberately studying in an experiment is an extraneous (outside) variable that could threaten the validity of the results. In the example above, these could include age and gender, junk food consumption or marital status.

An extraneous variable becomes a confounding variable when it varies along with the factors you are actually interested in. In other words, it becomes difficult to separate out which effect belongs to which variable, complicating the data.

To return to the example, age might be an extraneous variable. The researchers could control for age by making sure that everyone in the experiment is the same age. If they didn’t, age would become a confounding variable.

Any time there is another variable in an experiment that offers an alternative explanation for the outcome, it has the potential to become a confounding variable. Researchers must therefore control for these as much as possible.

Minimizing the Effects of Confounding Variables

A well-planned experimental design , and constant checks, will filter out the worst confounding variables.

For example, randomizing groups, utilizing strict controls, and sound operationalization practice all contribute to eliminating potential third variables.

After research, when the results are discussed and assessed by a group of peers, this is the area that stimulates the most heated debate. When you read stories of different foods increasing your risk of cancer, or hear claims about the next super-food, assess these findings carefully.

Many media outlets jump on sensational results, but never pay any regard to the possibility of confounding variables.

Mini-quiz: 

Imagine that a research project attempts to study the effect of a popular herbal antidepressant. They sample participants from an online alternative medicine group and ask them to take the remedy for a month. The participants complete a depression inventory before and after the month to measure whether they experience any improvement in their mood. The researchers do indeed find that the participants’ moods are better after a month of treatment.

Can you identify any variables which may have confounded this result? The answer is at the bottom of the page.

Correlation and Causation

The principle is closely related to the problem of correlation and causation .

For example, a scientist performs statistical tests, sees a correlation and incorrectly announces that there is a causal link between two variables.

The problem is that the research has not actually isolated a true cause and effect relationship. It is similar to a researcher who notices that the fewer storks there are in a country, the lower the birth rate is. They would be mistaken to assume that a decrease in storks causes a decrease in birth rate.

Though these factors might show some correlation, it doesn’t mean that one is causing the other. In fact, two variables may move with one another purely by coincidence!

Constant monitoring, before, during and after an experiment , is the only way to ensure that any confounding variables are eliminated.

Statistical tests, whilst excellent for detecting correlations , can be almost too accurate.

Human judgment is always needed to eliminate any underlying problems, ensuring that researchers do not jump to conclusions .

Mini-quiz Answer

The fact that the participants were sampled from a group with an interest in alternative medicine may mean that a third variable, their belief in the effectiveness of the remedy, was responsible. You may have thought of other confoudning variables. For example their mood might have improved for a number of other unrelated reasons, like a change in weather, holidays, or an improvement in personal circumstances.

  • Psychology 101
  • Flags and Countries
  • Capitals and Countries

Martyn Shuttleworth , Lyndsay T Wilson (Aug 16, 2008). Confounding Variable / Third Variable. Retrieved Sep 05, 2024 from Explorable.com: https://explorable.com/confounding-variables

You Are Allowed To Copy The Text

The text in this article is licensed under the Creative Commons-License Attribution 4.0 International (CC BY 4.0) .

This means you're free to copy, share and adapt any parts (or all) of the text in the article, as long as you give appropriate credit and provide a link/reference to this page.

That is it. You don't need our permission to copy the article; just include a link/reference back to this page. You can use it freely (with some kind of link), and we're also okay with people reprinting in publications like books, blogs, newsletters, course-material, papers, wikipedia and presentations (with clear attribution).

Want to stay up to date? Follow us!

Get all these articles in 1 guide.

Want the full version to study at home, take to school or just scribble on?

Whether you are an academic novice, or you simply want to brush up your skills, this book will take your academic writing skills to the next level.

confounding variables control experiments

Download electronic versions: - Epub for mobiles and tablets - For Kindle here - For iBooks here - PDF version here

Save this course for later

Don't have time for it all now? No problem, save it as a course and come back to it later.

Footer bottom

  • Privacy Policy

confounding variables control experiments

  • Subscribe to our RSS Feed
  • Like us on Facebook
  • Follow us on Twitter
  • Skip to main content
  • Skip to primary sidebar
  • Skip to footer
  • QuestionPro

survey software icon

  • Solutions Industries Gaming Automotive Sports and events Education Government Travel & Hospitality Financial Services Healthcare Cannabis Technology Use Case AskWhy Communities Audience Contactless surveys Mobile LivePolls Member Experience GDPR Positive People Science 360 Feedback Surveys
  • Resources Blog eBooks Survey Templates Case Studies Training Help center

confounding variables control experiments

Home Market Research

Experimental vs Observational Studies: Differences & Examples

Experimental vs Observational Studies: Differences & Examples

Understanding the differences between experimental vs observational studies is crucial for interpreting findings and drawing valid conclusions. Both methodologies are used extensively in various fields, including medicine, social sciences, and environmental studies. 

Researchers often use observational and experimental studies to gather comprehensive data and draw robust conclusions about their investigating phenomena. 

This blog post will explore what makes these two types of studies unique, their fundamental differences, and examples to illustrate their applications.

What is an Experimental Study?

An experimental study is a research design in which the investigator actively manipulates one or more variables to observe their effect on another variable. This type of study often takes place in a controlled environment, which allows researchers to establish cause-and-effect relationships.

Key Characteristics of Experimental Studies:

  • Manipulation: Researchers manipulate the independent variable(s).
  • Control: Other variables are kept constant to isolate the effect of the independent variable.
  • Randomization: Subjects are randomly assigned to different groups to minimize bias.
  • Replication: The study can be replicated to verify results.

Types of Experimental Study

  • Laboratory Experiments: Conducted in a controlled environment where variables can be precisely controlled.
  • Field Research : These are conducted in a natural setting but still involve manipulation and control of variables.
  • Clinical Trials: Used in medical research and the healthcare industry to test the efficacy of new treatments or drugs.

Example of an Experimental Study:

Imagine a study to test the effectiveness of a new drug for reducing blood pressure. Researchers would:

  • Randomly assign participants to two groups: receiving the drug and receiving a placebo.
  • Ensure that participants do not know their group (double-blind procedure).
  • Measure blood pressure before and after the intervention.
  • Compare the changes in blood pressure between the two groups to determine the drug’s effectiveness.

What is an Observational Study?

An observational study is a research design in which the investigator observes subjects and measures variables without intervening or manipulating the study environment. This type of study is often used when manipulating impractical or unethical variables.

Key Characteristics of Observational Studies:

  • No Manipulation: Researchers do not manipulate the independent variable.
  • Natural Setting: Observations are made in a natural environment.
  • Causation Limitations: It is difficult to establish cause-and-effect relationships due to the need for more control over variables.
  • Descriptive: Often used to describe characteristics or outcomes.

Types of Observational Studies: 

  • Cohort Studies : Follow a control group of people over time to observe the development of outcomes.
  • Case-Control Studies: Compare individuals with a specific outcome (cases) to those without (controls) to identify factors that might contribute to the outcome.
  • Cross-Sectional Studies : Collect data from a population at a single point to analyze the prevalence of an outcome or characteristic.

Example of an Observational Study:

Consider a study examining the relationship between smoking and lung cancer. Researchers would:

  • Identify a cohort of smokers and non-smokers.
  • Follow both groups over time to record incidences of lung cancer.
  • Analyze the data to observe any differences in cancer rates between smokers and non-smokers.

Difference Between Experimental vs Observational Studies

TopicExperimental StudiesObservational Studies
ManipulationYesNo
ControlHigh control over variablesLittle to no control over variables
RandomizationYes, often, random assignment of subjectsNo random assignment
EnvironmentControlled or laboratory settingsNatural or real-world settings
CausationCan establish causationCan identify correlations, not causation
Ethics and PracticalityMay involve ethical concerns and be impracticalMore ethical and practical in many cases
Cost and TimeOften more expensive and time-consumingGenerally less costly and faster

Choosing Between Experimental and Observational Studies

The researchers relied on statistical analysis to interpret the results of randomized controlled trials, building upon the foundations established by prior research.

Use Experimental Studies When:

  • Causality is Important: If determining a cause-and-effect relationship is crucial, experimental studies are the way to go.
  • Variables Can Be Controlled: When you can manipulate and control the variables in a lab or controlled setting, experimental studies are suitable.
  • Randomization is Possible: When random assignment of subjects is feasible and ethical, experimental designs are appropriate.

Use Observational Studies When:

  • Ethical Concerns Exist: If manipulating variables is unethical, such as exposing individuals to harmful substances, observational studies are necessary.
  • Practical Constraints Apply: When experimental studies are impractical due to cost or logistics, observational studies can be a viable alternative.
  • Natural Settings Are Required: If studying phenomena in their natural environment is essential, observational studies are the right choice.

Strengths and Limitations

Experimental studies.

  • Establish Causality: Experimental studies can establish causal relationships between variables by controlling and using randomization.
  • Control Over Confounding Variables: The controlled environment allows researchers to minimize the influence of external variables that might skew results.
  • Repeatability: Experiments can often be repeated to verify results and ensure consistency.

Limitations:

  • Ethical Concerns: Manipulating variables may be unethical in certain situations, such as exposing individuals to harmful conditions.
  • Artificial Environment: The controlled setting may not reflect real-world conditions, potentially affecting the generalizability of results.
  • Cost and Complexity: Experimental studies can be costly and logistically complex, especially with large sample sizes.

Observational Studies

  • Real-World Insights: Observational studies provide valuable insights into how variables interact in natural settings.
  • Ethical and Practical: These studies avoid ethical concerns associated with manipulation and can be more practical regarding cost and time.
  • Diverse Applications: Observational studies can be used in various fields and situations where experiments are not feasible.
  • Lack of Causality: It’s easier to establish causation with manipulation, and results are limited to identifying correlations.
  • Potential for Confounding: Uncontrolled external variables may influence the results, leading to biased conclusions.
  • Observer Bias: Researchers may unintentionally influence outcomes through their expectations or interpretations of data.

Examples in Various Fields

  • Experimental Study: Clinical trials testing the effectiveness of a new drug against a placebo to determine its impact on patient recovery.
  • Observational Study: Studying the dietary habits of different populations to identify potential links between nutrition and disease prevalence.
  • Experimental Study: Conducting a lab experiment to test the effect of sleep deprivation on cognitive performance by controlling sleep hours and measuring test scores.
  • Observational Study: Observing social interactions in a public setting to explore natural communication patterns without intervention.

Environmental Science

  • Experimental Study: Testing the impact of a specific pollutant on plant growth in a controlled greenhouse setting.
  • Observational Study: Monitoring wildlife populations in a natural habitat to assess the effects of climate change on species distribution.

How QuestionPro Research Can Help in Experimental vs Observational Studies

Choosing between experimental and observational studies is a critical decision that can significantly impact the outcomes and interpretations of a study. QuestionPro Research offers powerful tools and features that can enhance both types of studies, giving researchers the flexibility and capability to gather, analyze, and interpret data effectively.

Enhancing Experimental Studies with QuestionPro

Experimental studies require a high degree of control over variables, randomization, and, often, repeated trials to establish causal relationships. QuestionPro excels in facilitating these requirements through several key features:

  • Survey Design and Distribution: With QuestionPro, researchers can design intricate surveys tailored to their experimental needs. The platform supports random assignment of participants to different groups, ensuring unbiased distribution and enhancing the study’s validity.
  • Data Collection and Management: Real-time data collection and management tools allow researchers to monitor responses as they come in. This is crucial for experimental studies where data collection timing and sequence can impact the results.
  • Advanced Analytics: QuestionPro offers robust analytical tools that can handle complex data sets, enabling researchers to conduct in-depth statistical analyses to determine the effects of the experimental interventions.

Supporting Observational Studies with QuestionPro

Observational studies involve gathering data without manipulating variables, focusing on natural settings and real-world scenarios. QuestionPro’s capabilities are well-suited for these studies as well:

  • Customizable Surveys: Researchers can create detailed surveys to capture a wide range of observational data. QuestionPro’s customizable templates and question types allow for flexibility in capturing nuanced information.
  • Mobile Data Collection: For field research, QuestionPro’s mobile app enables data collection on the go, making it easier to conduct studies in diverse settings without internet connectivity.
  • Longitudinal Data Tracking: Observational studies often require data collection over extended periods. QuestionPro’s platform supports longitudinal studies, allowing researchers to track changes and trends.

Experimental and observational studies are essential tools in the researcher’s toolkit. Each serves a unique purpose and offers distinct advantages and limitations. By understanding their differences, researchers can choose the most appropriate study design for their specific objectives, ensuring their findings are valid and applicable to real-world situations.

Whether establishing causality through experimental studies or exploring correlations with observational research designs, the insights gained from these methodologies continue to shape our understanding of the world around us. 

Whether conducting experimental or observational studies, QuestionPro Research provides a comprehensive suite of tools that enhance research efficiency, accuracy, and depth. By leveraging its advanced features, researchers can ensure that their studies are well-designed, their data is robustly analyzed, and their conclusions are reliable and impactful.

MORE LIKE THIS

Experimental vs Observational Studies: Differences & Examples

Sep 5, 2024

Interactive forms

Interactive Forms: Key Features, Benefits, Uses + Design Tips

Sep 4, 2024

closed-loop management

Closed-Loop Management: The Key to Customer Centricity

Sep 3, 2024

Net Trust Score

Net Trust Score: Tool for Measuring Trust in Organization

Sep 2, 2024

Other categories

  • Academic Research
  • Artificial Intelligence
  • Assessments
  • Brand Awareness
  • Case Studies
  • Communities
  • Consumer Insights
  • Customer effort score
  • Customer Engagement
  • Customer Experience
  • Customer Loyalty
  • Customer Research
  • Customer Satisfaction
  • Employee Benefits
  • Employee Engagement
  • Employee Retention
  • Friday Five
  • General Data Protection Regulation
  • Insights Hub
  • Life@QuestionPro
  • Market Research
  • Mobile diaries
  • Mobile Surveys
  • New Features
  • Online Communities
  • Question Types
  • Questionnaire
  • QuestionPro Products
  • Release Notes
  • Research Tools and Apps
  • Revenue at Risk
  • Survey Templates
  • Training Tips
  • Tuesday CX Thoughts (TCXT)
  • Uncategorized
  • What’s Coming Up
  • Workforce Intelligence

Back Home

  • Science Notes Posts
  • Contact Science Notes
  • Todd Helmenstine Biography
  • Anne Helmenstine Biography
  • Free Printable Periodic Tables (PDF and PNG)
  • Periodic Table Wallpapers
  • Interactive Periodic Table
  • Periodic Table Posters
  • Science Experiments for Kids
  • How to Grow Crystals
  • Chemistry Projects
  • Fire and Flames Projects
  • Holiday Science
  • Chemistry Problems With Answers
  • Physics Problems
  • Unit Conversion Example Problems
  • Chemistry Worksheets
  • Biology Worksheets
  • Periodic Table Worksheets
  • Physical Science Worksheets
  • Science Lab Worksheets
  • My Amazon Books

What Is a Control Variable? Definition and Examples

A control variable is any factor that is controlled or held constant in an experiment.

A control variable is any factor that is controlled or held constant during an experiment . For this reason, it’s also known as a controlled variable or a constant variable. A single experiment may contain many control variables . Unlike the independent and dependent variables , control variables aren’t a part of the experiment, but they are important because they could affect the outcome. Take a look at the difference between a control variable and control group and see examples of control variables.

Importance of Control Variables

Remember, the independent variable is the one you change, the dependent variable is the one you measure in response to this change, and the control variables are any other factors you control or hold constant so that they can’t influence the experiment. Control variables are important because:

  • They make it easier to reproduce the experiment.
  • The increase confidence in the outcome of the experiment.

For example, if you conducted an experiment examining the effect of the color of light on plant growth, but you didn’t control temperature, it might affect the outcome. One light source might be hotter than the other, affecting plant growth. This could lead you to incorrectly accept or reject your hypothesis. As another example, say you did control the temperature. If you did not report this temperature in your “methods” section, another researcher might have trouble reproducing your results. What if you conducted your experiment at 15 °C. Would you expect the same results at 5 °C or 35 5 °C? Sometimes the potential effect of a control variable can lead to a new experiment!

Sometimes you think you have controlled everything except the independent variable, but still get strange results. This could be due to what is called a “ confounding variable .” Examples of confounding variables could be humidity, magnetism, and vibration. Sometimes you can identify a confounding variable and turn it into a control variable. Other times, confounding variables cannot be detected or controlled.

Control Variable vs Control Group

A control group is different from a control variable. You expose a control group to all the same conditions as the experimental group, except you change the independent variable in the experimental group. Both the control group and experimental group should have the same control variables.

Control Variable Examples

Anything you can measure or control that is not the independent variable or dependent variable has potential to be a control variable. Examples of common control variables include:

  • Duration of the experiment
  • Size and composition of containers
  • Temperature
  • Sample volume
  • Experimental technique
  • Chemical purity or manufacturer
  • Species (in biological experiments)

For example, consider an experiment testing whether a certain supplement affects cattle weight gain. The independent variable is the supplement, while the dependent variable is cattle weight. A typical control group would consist of cattle not given the supplement, while the cattle in the experimental group would receive the supplement. Examples of control variables in this experiment could include the age of the cattle, their breed, whether they are male or female, the amount of supplement, the way the supplement is administered, how often the supplement is administered, the type of feed given to the cattle, the temperature, the water supply, the time of year, and the method used to record weight. There may be other control variables, too. Sometimes you can’t actually control a control variable, but conditions should be the same for both the control and experimental groups. For example, if the cattle are free-range, weather might change from day to day, but both groups have the same experience. When you take data, be sure to record control variables along with the independent and dependent variable.

  • Box, George E.P.; Hunter, William G.; Hunter, J. Stuart (1978). Statistics for Experimenters : An Introduction to Design, Data Analysis, and Model Building . New York: Wiley. ISBN 978-0-471-09315-2.
  • Giri, Narayan C.; Das, M. N. (1979). Design and Analysis of Experiments . New York, N.Y: Wiley. ISBN 9780852269145.
  • Stigler, Stephen M. (November 1992). “A Historical View of Statistical Concepts in Psychology and Educational Research”. American Journal of Education . 101 (1): 60–70. doi: 10.1086/444032

Related Posts

  • Open access
  • Published: 03 September 2024

RNAseqCovarImpute: a multiple imputation procedure that outperforms complete case and single imputation differential expression analysis

  • Brennan H. Baker 1 , 2 ,
  • Sheela Sathyanarayana 1 , 2 , 3 , 4 ,
  • Adam A. Szpiro 5   na1 ,
  • James W. MacDonald 1   na1 &
  • Alison G. Paquette 1 , 3 , 6   na1  

Genome Biology volume  25 , Article number:  236 ( 2024 ) Cite this article

Metrics details

Missing covariate data is a common problem that has not been addressed in observational studies of gene expression. Here, we present a multiple imputation method that accommodates high dimensional gene expression data by incorporating principal component analysis of the transcriptome into the multiple imputation prediction models to avoid bias. Simulation studies using three datasets show that this method outperforms complete case and single imputation analyses at uncovering true positive differentially expressed genes, limiting false discovery rates, and minimizing bias. This method is easily implemented via an R Bioconductor package, RNAseqCovarImpute that integrates with the limma-voom pipeline for differential expression analysis.

Missing data is a common problem in observational studies, as modeling techniques such as linear regression cannot be fit to data with missing points. Missing data is frequently handled using complete case (CC) analyses in which any individuals with missing data are dropped from the study. Dropping participants can reduce statistical power and, in some cases, result in biased model estimates. A common technique to address these problems is to replace or “impute” missing data points with substituted values. Typically, for a given covariate, missing data points are imputed using a prediction model including other relevant covariates as independent variables. In single imputation (SI), a missing value is replaced with the most likely value based on the predictive model. Statistical efficiency can be improved by including the outcome in the predictive model in addition to covariates. However, in this setting, SI methods can result in biased coefficients and over-confident standard errors [ 1 ]. Multiple imputation (MI) addresses this problem by generating several predictions, thereby allowing for uncertainty about the imputed data to propagate through the analysis. In a typical MI procedure: (1) m imputed data sets are created, (2) each data set is analyzed separately (e.g., using linear regression), and (3) estimates and standard errors across the m analyses are pooled using Rubin’s rules [ 2 , 3 ].

To date, there has been no concerted effort to determine the most advantageous method for handling missing covariate data in transcriptomic studies. A large proportion of RNA-sequencing studies are conducted in in vitro or in vivo models and do not suffer from missing covariate data. Complete datasets are common in experimental studies with controlled conditions and a limited number of covariates. In an experimental setting, studies may employ two-group analyses with no additional variables or utilize covariates for which collecting data is trivial (e.g., sequencing batch and sex). However, the cost of sequencing has decreased over time [ 4 ], and transcriptomic data are already becoming more common in large human observational studies where missing data is a prevailing concern [ 5 , 6 ]. Therefore, guidelines for handling missing data in this context are critically needed to facilitate the integration of transcriptomic and epidemiologic approaches.

While SI methods must omit the outcome from the imputation predictive model to avoid bias, the opposite is true of MI [ 7 ]. However, including the outcome in the MI predictive model can be problematic in “omics” studies with high dimensional data. Fitting an imputation model where the number of independent variables is far greater than the number of individuals in the study is generally not feasible. For instance, in RNA-sequencing studies with tens of thousands of genes, an equal or greater number of participants may be needed to apply a standard MI procedure.

To ensure that outcome data are included in the predictive model (a requirement of MI to avoid bias [ 7 ]), one solution is to make one set of m imputed datasets per gene, where expression data for a single gene is included in the predictive model. Then, each set of imputed data can be used to estimate differential expression of the gene that was used in that set’s predictive modeling. However, the generation of tens of thousands of sets of imputed data is computationally intensive and may require an unfeasible amount of model checking and diagnostics. In epigenetic studies of DNA methylation at CpG cites, this approach has been modified to be less computationally intensive by using groups of CpG sites together to impute missing data [ 8 , 9 ]. We propose an alternative solution for applying MI to high dimensional gene expression data, which is to utilize principal component analysis (PCA) to reduce the dimensionality of the transcriptome. Then, the top PCs can be included in the MI prediction model when imputing missing covariates, satisfying the requirement that outcome information is included in the MI predictive models.

Here, we developed the first method to our knowledge to make MI compatible with high dimensional transcriptomic data. We created an R package (RNAseqCovarImpute) that is fully compatible with the popular limma-voom [ 10 , 11 , 12 ] differential expression analysis pipeline. We conducted a simulation study to compare the performance of MI as implemented in RNAseqCovarImpute with random forest SI and CC analyses. Finally, we applied RNAseqCovarImpute to two analyses involving (1) the placental transcriptome associated with maternal age, and (2) the blood platelet transcriptome associated with colorectal carcinoma.

Multiple imputation and differential expression analysis in the RNAseqCovarImpute package

The RNAseqCovarImpute package includes two methods accommodating the requirement of MI that the outcome data are included in the MI predictive models. The first method surmounts the problem of high-dimensional outcome data by binning genes into smaller groups to analyze pseudo-independently (MI Gene Bin method, see Additional file 1: Supplemental Methods). Analyzing smaller bins of genes independently lowers the dimensionality of the outcome gene expression data, allowing us to include it in the MI predictive modeling. However, binning genes into smaller groups is computationally inefficient, as it requires that the MI and limma-voom analysis is run many times (typically hundreds).

A second method uses PCA to avoid binning genes while still retaining outcome information in the MI models. The MI PCA method implements covariate MI in gene expression studies by (1) performing PCA on the normalized log-counts per million (logCPM) for all genes using the Bioconductor “PCAtools” package [ 13 ]; (2) creating m imputed datasets where the imputation predictor matrix includes all covariates and the optimum number of PCs to retain; (3) conducting the standard limma-voom differential expression analysis pipeline in R with the “limma::voom” followed by “limma::lmFit” followed by “limma::eBayes” functions [ 10 , 11 , 12 ] on each m imputed dataset; (4) pooling the results with Rubin’s rules to produce combined coefficients, standard errors, and P values; and (5) adjusting P values for multiplicity to account for false discovery rate (FDR) (Fig.  1 ; see “  Methods ” for details). Various methods for determining the number of PCs to retain in the MI prediction model can be utilized. For example, Horn’s parallel analysis, which retains PCs with eigenvalues greater than eigenvalues of random data [ 14 , 15 ], utilizing an 80% explained variation cutoff, or the elbow method.

figure 1

Overview of RNAseqCovarImpute multiple imputation differential expression analysis. A Inputs are covariates, including the predictor of interest and adjustment variables, and RNA-sequencing counts that are filtered to remove low counts and normalized as log-counts per million (logCPM). The logCPM calculation uses the effective library sizes calculated using the weighted trimmed mean of M-values method. B Principal component analysis (PCA) is used to reduce the dimensionality of the count matrix and Horn’s parallel analysis determines the number of PCs (1-h) to retain. Retained PCs (PC1-PCh) are added to the input dataset of covariates (C1-Cn). C Multiple imputation imputes missing covariate data m times (RNA-sequencing data are not imputed). All covariates and all retained PCs are included in the imputation prediction models. D Associations are estimated between the covariates and gene expression, according to the user’s statistical model design of interest, separately within each m imputed dataset using voom followed by lmFit followed by eBayes functions. In this example, the design is a multivariable linear model including all covariates C1-Cn. E Combine across m sets of model results using Rubin’s rules to produce combined log fold changes, standard errors, and P values for each term in the design

Three versions of MI PCA using different criteria to determine the number of retained PCs were compared with the MI Gene Bin approach (“  Methods ”). MI PCA using Horn’s parallel analysis performed better than the MI Gene Bin and other MI PCA methods. All methods had similar true positive rates (TPRs), while MI PCA horn had the lowest false positive rates (FPRs) across most scenarios (Additional file 1: Supplemental Results, Additional file 1: Figs. S2–S4). Among the methods for retaining PCs, results were relatively comparable when missing data were minimal. For instance, for the ECHO-PATHWAYS dataset, Horn’s parallel analysis retained 35 PCs, an 80% variance explained cutoff retained 213 PCs, and the elbow method retained 15 PCs. Despite substantial differences in the number of retained PCs, all methods had similarly high TPRs and good FPR control at approximately 0.05 when there was only 5–15% missing data (Additional file 1: Fig. S2). However, when levels of missing data were higher, FPRs were consistently controlled at 0.05 for Horn’s parallel analysis, but not for the 80% variance explained or elbow approaches (Additional file 1: Fig. S2). Thus, MI PCA using Horn’s parallel analysis was selected as the MI method of choice (hereinafter the “RNAseqCovarImpute” method).

Performance on three real datasets following simulations of missing covariate data

Three large real-world RNA-sequencing datasets encompassing multiple tissue types and a diverse range of covariates were utilized to compare RNAseqCovarImpute, SI, and CC differential expression analysis (“  Methods ”). 

The ECHO prenatal and early childhood pathways to health (ECHO-PATHWAYS) dataset (dbGaP phs003619.v1.p1 and phs003620.v1.p1) includes RNA-sequencing of placentas sampled at delivery from socioeconomically and racially/ethnically diverse participants from two regionally distinct birth cohorts in Washington (Seattle and Yakima) and Tennessee (Memphis), USA [ 5 ]. For the ECHO-PATHWAYS dataset (N = 994), maternal age served as the predictor of interest in differential expression analysis, while covariates included fetal sex, RNA-sequencing batch, maternal tobacco use during pregnancy, maternal alcohol use during pregnancy, and family income. 

The non-small cell lung cancer (NSCLC) dataset (EMBL-EBI: E-GEOD-81089) includes RNA-sequencing of both lung tumor and non-malignant tissues sampled from patients diagnosed with NSCLC being surgically treated from 2006 to 2010 at the Uppsala University Hospital, Sweden [ 16 ]. For the NSCLC dataset (N = 670), the predictor of interest was sex, while covariates included participant age, participant smoking status, and sampling site (tumor versus non-malignant). The non-small cell lung cancer (NSCLC) dataset (EMBL-EBI: E-GEOD-81089) includes RNA-sequencing of both lung tumor and non-malignant tissues sampled from patients diagnosed with NSCLC being surgically treated from 2006 to 2010 at the Uppsala University Hospital, Sweden [ 16 ]. For the NSCLC dataset (N = 670), the predictor of interest was sex, while covariates included participant age, participant smoking status, and sampling site (tumor versus non-malignant).

The Epstein-Barr virus (EBV) dataset (EMBL-EBI: E-MTAB-7805) analyzed primary cultures of human B lymphocytes obtained from adenoid tissue [ 17 ]. For the EBV dataset ( N  = 384), the predictor of interest was time elapsed in culture, while covariates included EBV infection status and individual donor source.

The ECHO-PATHWAYS dataset included 14,026 genes after filtering, of which 2517 were significantly associated with maternal age in the full data model (true positives) while adjusting for covariates. Following the full data model, simulations to induce missingness in the covariate data were performed 10 times per level of missing data and missingness mechanism (“  Methods ”). Patterns of simulated missing data depended on the missingness mechanism. When data were simulated to be missing at random (MAR) or missing not at random (MNAR), the maternal alcohol and family income variables had strong influence over patterns of missing data as intended. For example, in the simulated datasets where 55% of individuals had at least one missing data point, the average rate of individuals with at least one missing data point was 91% among alcohol users but only 50% among those reporting no alcohol use. Family income also impacted missingness: the missing data rate was 37% among those in the bottom quartile of family income, but 75% among those in the top quartile. These patterns of missingness were identical between the MAR and MNAR mechanisms, the only difference being that SI and RNAseqCovarImpute had access to these variables (while imputing data) under MAR, while these variables were masked under MNAR. Thus, under MNAR, unobserved data influenced the patterns of missingness. When data were simulated to be missing completely at random (MCAR), missingness did not depend on alcohol use, family income, or any other covariate. For example, in the 10 simulated datasets where 55% of individuals had at least one missing data point, the rate of individuals with at least one missing data point was 55% among alcohol users, 53% among those reporting no alcohol use, 56% among those in the bottom quartile of family income, and 54% in the top quartile of family income.

In differential expression analysis using the ECHO-PATHWAYS dataset, RNAseqCovarImpute was the best performer, with the highest TPR, lowest FPR, and lowest mean absolute percentage error (MAPE) across most scenarios, especially with increasing levels of missing data (Fig.  2 ). For example, when 55% of participants had at least one missing data point, the TPR ranged from 0.713 to 0.994 for RNAseqCovarImpute, from 0.214 to 0.977 for SI, and from 0.006 to 0.391 for CC (Fig.  2 A). FPR was well-controlled under 0.05 in most scenarios, but more consistently so for RNAseqCovarImpute. For example, the median FPR was always < 0.05 for RNAseqCovarImpute, while there were some cases of high FPRs for the CC method when data were MCAR, and for the SI method when data were MNAR (Fig.  2 B). MAPE was lower for RNAseqCovarImpute in almost every scenario (Fig.  2 C).

figure 2

Performance of missing data methods on ECHO-PATHWAYS dataset. A True positive rate (TPR), B false positive rate (FPR), and C mean absolute percentage error (MAPE) shown for complete case (CC), single imputation (SI), and RNAseqCovarImpute multiple imputation differential expression analyses on ten datasets with simulated missingness per missingness mechanism per level of missingness. Box (median and interquartile range) and whiskers (1.5* interquartile range) shown along with one point per simulation. Dashed line at target FPR of 0.05

The NSCLC dataset included 12,353 genes after filtering, of which 5718 were significantly associated with sex in the full data model (true positives) while adjusting for covariates. After inducing missingness in the covariate data, RNAseqCovarImpute was the best performer with the highest TPR, lowest FPR, and lowest MAPE across most scenarios, especially with increasing levels of missing data (Fig.  3 ). For example, when 85% of participants had at least one missing data point, the TPR ranged from 0.933 to 0.987 for RNAseqCovarImpute, from 0.902 to 0.984 for SI, and from 0.296 to 0.604 for CC (Fig.  3 A). Many scenarios had FPR > 0.05 for CC, while FPR was well-controlled at approximately 0.05 for SI, and consistently below 0.05 for RNAseqCovarImpute (Fig.  3 B). As with FPR, MAPE was lowest for RNAseqCovarImpute, followed by SI and CC, respectively (Fig.  3 C).

figure 3

Performance of missing data methods on NSCLC dataset. A True positive rate (TPR), B false positive rate (FPR), and C mean absolute percentage error (MAPE) shown for complete case (CC), single imputation (SI), and RNAseqCovarImpute multiple imputation differential expression analyses on ten datasets with simulated missingness per missingness mechanism per level of missingness. Box (median and interquartile range) and whiskers (1.5* interquartile range) shown along with one point per simulation. Dashed line at target FPR of 0.05

The EBV dataset included 8677 genes after filtering, of which 7449 were significantly associated with time in the full data model (true positives) while adjusting for covariates. As with the datasets above, after inducing missingness in the covariate data, RNAseqCovarImpute was the best performer with the highest TPR, lowest FPR, and lowest MAPE across most scenarios (Fig.  4 ).

figure 4

Performance of missing data methods on EBV dataset. A True positive rate (TPR), B false positive rate (FPR), and C mean absolute percentage error (MAPE) shown for complete case (CC), single imputation (SI), and RNAseqCovarImpute multiple imputation differential expression analyses on ten datasets with simulated missingness per missingness mechanism per level of missingness. Box (median and interquartile range) and whiskers (1.5* interquartile range) shown along with one point per simulation. Dashed line at target FPR of 0.05

In addition to the real RNA-sequencing datasets above, four sets of synthetic RNA-sequencing data were used to compare performances of RNAseqCovarImpute, SI, and CC differential expression analysis. The NSCLC RNA-sequencing data were modified to add known signal using the seqgendiff package [ 18 ] (“  Methods ”). Compared with fully synthetic count data from theoretical distributions, this method better reflects realistic variability in RNA-sequencing data. Subsets of 25–99% of genes were randomly selected to have their coefficient of association (Log2 fold-changes) with sex set to zero. Distributions of gene expression coefficients associated with sex depended on the desired null gene rates for each synthetic dataset, but followed a similar form compared with the original NSCLC data as intended (Additional file 1: Fig. S5). Coefficients for the remaining genes were drawn randomly from a gamma distribution, and an additional diagnostic confirmed that the coefficients for each gene estimated from the limma-voom pipeline on the synthetic count tables closely matched these pre-defined coefficients input into the seqgendiff package (Additional file 1: Fig. S6). Applied to these synthetic RNA-sequencing datasets, RNAseqCovarImpute had higher TPRs (Additional file 1: Fig. S7) and lower FPRs (Additional file 1: Fig. S8) compared to CC and SI differential expression analysis across most scenarios. Moreover, the advantages of RNAseqCovarImpute were most apparent for synthetic datasets with weaker signals. For example, all methods had FPRs < 0.05 for the synthetic data with the strongest signal (i.e., 75% of genes associated with the predictor of interest and only 25% null genes). However, with 99% null genes and only 1% of genes modified to correlate with the predictor of interest, RNAseqCovarImpute maintained FPRs < 0.05 while the other methods did not (Additional file 1: Fig. S8).

Overall, RNAseqCovarImpute outperformed SI and CC methods in differential expression analysis by achieving higher TPRs, lower FPRs, and lower MAPEs across various real-world and synthetic RNA-sequencing datasets. Its advantages were most notable, especially with respect to controlling FPRs, in scenarios with high levels of missing data or predictors of interest that are only weakly associated with the RNA-sequencing data.

Computational benchmarks

Methods were benchmarked in an analysis of the ECHO-PATHWAYS dataset with 14,026 genes, 994 observations, 4 covariates in the model, and 55% missingness under MCAR on a Windows machine with 3.8 GHz processing speed and 16 GB random-access memory. MI methods were assessed with 10 imputed datasets. Over three iterations per method, memory allocations and median run times were 16.6 GB and 13.52 min for the RNAseqCovarImpute MI Gene Bin method, 4.32 GB and 2.68 min for the RNAseqCovarImpute MI PCA method, 7.58 GB and 17.46 s for SI, and 3.23 GB and 6.71 s for CC. Computation time was further assessed for the RNAseqCovarImpute MI PCA method over several combinations of sample size, number of genes, and number of imputed datasets (Additional file 1: Fig. S9).

Application of RNAseqCovarImpute in analysis of maternal age and placental transcriptome

In a real-world example, RNAseqCovarImpute was applied to the largest placental transcriptomic dataset to-date, which was generated by the ECHO prenatal and early childhood pathways to health (ECHO-PATHWAYS) consortium [ 5 ]. This analysis examined the association of maternal age with the placental transcriptome while adjusting for race, ethnicity, family income, maternal education, tobacco and alcohol use during pregnancy, delivery method, study site, fetal sex, and sequencing batch. The causal relationships among these variables are illustrated in Fig.  5 .

figure 5

Maternal age and placental transcriptome conceptual model. Conceptual model of association between maternal age (predictor) and the placental transcriptome (outcome). Confounders are upstream causes of both the predictor and outcome. Mediators are on the causal pathway between the predictor and outcome. Precision variables could affect the outcome but have no clear casual effect on the predictor

Among 1045 individuals included in this analysis, 6% (61) were missing data for at least one of the 10 covariates, mostly driven by 4% (41) of individuals missing family income data. There were no missing data for delivery method, study site, fetal sex, and sequencing batch, and a minimally adjusted analysis including these variables identified 1071 differentially expressed genes (DEGs) significantly associated with maternal age. Adjusting for all covariates resulted in fewer maternal age DEGs: in the CC, SI, and RNAseqCovarImpute MI analyses, maternal age was associated with 575, 214, and 399 DEGs, respectively (Fig.  6 A). The CC and SI analyses uncovered 91% (362) and 54% (214) of the significant DEGs from the MI method, respectively, while there were 32 DEGs exclusive to MI (Fig.  6 A). Additionally, CC analysis was repeated while omitting the family income covariate, which preserved sample size while allowing possible confounding by this variable. This analysis uncovered 334 DEGs, of which 68% (270) overlapped with the DEGs from the MI method (Additional file 1: Fig. S10). Although there were some differences, genes ranked from lowest to highest P value followed similar orders between the methods (Fig.  6 B). The most substantial differences compared with the P value rank order from the MI analysis were observed in the fully adjusted CC and CC omitting family income analyses (Fig.  6 B). Imputation diagnostics for family income following the RNAseqCovarImpute MI method indicated good convergence and reasonable imputed values (Additional file 1: Fig. S11).

figure 6

Maternal age and the placental transcriptome differential expression analysis. Venn diagram depicts shared and distinct differentially expressed genes for each method ( A ). P value rankings for each method for the top 10 genes with the lowest P values from the multiple imputation analysis ( B ). Volcano plots of maternal age associations with placental gene expression in complete case ( C ), single imputation ( D ), and multiple imputation ( E ) analyses. “Drop Income” indicates complete case analysis excluding the income covariate ( F ). Models include the following covariates: maternal race, ethnicity, education, tobacco and alcohol use during pregnancy, household income adjusted for region and inflation, delivery method, fetal sex, sequencing batch, and study site. Log 2 -adjusted fold-changes (LogFCs) shown for each 1 year increase in maternal age. Horizontal and vertical lines at P  = 0.05 and LogFC ± 0.04, respectively. HGNC gene symbols shown for significant genes with false discovery rate adjusted P value ( P -adj) < 0.05 and LogFC beyond 0.04 cutoff

Many of the top DEGs in the MI analysis, according to their significance and fold-change magnitude (Fig.  6 C–F), play roles in inflammatory processes and the immune response. S100A12 and S100A8 are pro-inflammatory calcium-, zinc-, and copper-binding proteins, CXCL8 (IL-8) and IL1R2 are pro-inflammatory cytokines/cytokine receptors, SAA1 and CASC19 are known to be expressed in response to inflammation, while LILRA5 , a leukocyte receptor gene, may play a role in triggering innate immune responses [ 19 ].

Pathway enrichment of the MI differential expression results revealed 32 Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways that were downregulated in association with older maternal age (Fig.  7 ). Among these downregulated KEGG pathways, 11 belong to the immune system KEGG group and 6 belong to the signal transduction group. Antigen processing and presentation, an immune system KEGG pathway, was the most strongly downregulated pathway according to its enrichment effect size and P value (Fig.  7 ).

figure 7

Maternal age and the placental transcriptome pathway analysis. T -statistics (Log2FCs divided by standard error) from the differential expression analyses of maternal age were input into pathway analysis for Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways (excluding KEGG human disease pathways) using the generally applicable gene set enrichment (GAGE) method. Mean t -statistic of all genes in each KEGG pathway shown with corresponding P value from GAGE (larger points indicate smaller P values)

Application of RNAseqCovarImpute in analysis of colorectal carcinoma and the blood platelet transcriptome

In another real-world example, RNAseqCovarImpute was applied to a dataset of blood platelet RNA-sequencing from 42 individuals with colorectal carcinoma and 59 healthy donors [ 20 ]. This dataset included 3227 genes after filtering. There were 14 individuals with KRAS mutant tumors and 2 individuals with PIK3CA mutant tumors compared with 85 wild-type individuals. No data were missing for cancer status or genotype, while 34% (34) were missing data for sex, and 40% (40) were missing data for age. In the CC, SI, and RNAseqCovarImpute MI analyses, colorectal carcinoma was associated with 2491, 2403, and 2579 DEGs, respectively, while controlling for genotype, sex, and age (Additional file 1: Fig. S12A). The CC and SI analyses uncovered 94% (2422) and 92% (2360) of the significant DEGs from the MI method, respectively, while there were 98 DEGs exclusive to MI (Additional file 1: Fig. S12A). Genes ranked from lowest to highest P value followed similar orders between the methods (Additional file 1: Fig. S12B).

Many of the top DEGs in the MI analysis, according to their significance and fold-change magnitude (Additional file 1: Fig. S12C–E), have been previously shown to play roles in cancer etiology. For example, colorectal carcinoma was associated with the downregulation of GK5. Genes involved in glycerol metabolism, including GK5 , have been previously implicated in the etiology of cancers, including colorectal carcinoma [ 21 , 22 , 23 , 24 ]. Mutations in MDN1 , which was downregulated in association with colorectal cancer here, have been shown to correlate with elevated tumor mutation rates in breast and colorectal cancers [ 25 , 26 ]. Downregulation of TNFRSF1B (a member of the tumor necrosis factor gene family) in association with colorectal cancer here is consistent with prior studies showing lower mRNA expression of TNFRSF1B in lung cancer compared with normal lung tissue [ 27 ].

We have shown that a MI procedure that includes PCA of the gene expression data in the imputation predictor matrix has a performance advantage relative to CC and SI methods in RNA-sequencing studies using the limma-voom pipeline. We found that our newly implemented MI method in RNAseqCovarImpute was the best performer with higher TPRs and lower FPRs and MAPEs across a wide range of missing data scenarios.

Similar methods allowing for the inclusion of high dimensional outcome data in MI models have been developed for epigenome wide association studies (EWAS) [ 8 , 9 ]. In EWAS, there appear to be tradeoffs between CC analysis and MI, with MI identifying more true positives but also more false positives [ 8 ]. In the simulations presented here, however, RNAseqCovarImpute identified more true DEGs with no false positive tradeoff. The application of MI to EWAS studies represents a more challenging problem owing to higher dimensionality of the methylome (850,000 CpG sites with Illumina’s EPIC array [ 28 ]) versus the transcriptome (typically tens of thousands of genes). The application of MI to EWAS studies might benefit from employing a PCA-based approach similar to the one used here. Moreover, future methods development could also tackle the additional challenges in applying MI to alternative epigenomic and transcriptomic methods such as differentially methylated region and pathway enrichment (discussed below) analyses.

To address the sparsity of single-cell RNA-sequencing data, imputation methods to fill in missing or zero RNA-sequencing counts have been extensively developed [ 29 ]. Little attention has been paid, however, to the imputation of missing covariate data in studies where gene expression is the outcome of interest. Methods for the treatment of missing data are well-established in observational epidemiology [ 30 ], with MI increasingly the method-of-choice [ 31 , 32 ]. Yet human observational studies of gene expression have often failed to report on the treatment of missing data, despite its prevalence. When missing data are explicitly addressed in this context, researchers typically utilize CC analyses [ 33 , 34 , 35 ], while SI is a less common alternative [ 6 ]. Despite its advantages, we are unaware of any studies with transcriptomic outcomes that have utilized MI for missing covariate data. The simulations presented here suggest that future observational transcriptomic studies may benefit from employing MI via RNAseqCovarImpute over CC or SI. Moreover, we developed an R package so that users may easily apply the methods presented here to their own data.

We applied RNAseqCovarImpute to a large observational study of maternal age and the placental transcriptome. This analysis assessed the association of maternal age with placental gene expression controlling for confounding variables such as maternal race, ethnicity, and socioeconomic status, and potential mediators such as alcohol and tobacco use during pregnancy. Although there was some overlap, the MI analysis uncovered a different set of differentially expressed genes compared with the CC and SI analyses. CC analysis uncovered a larger number of DEGs compared with MI, possibly indicating that CC may have higher power in some cases. However, the simulations suggest that CC analyses have higher FPRs across many scenarios compared with MI via RNAseqCovarImpute. Individuals with missing data could systematically differ from those with complete data, and dropping these individuals could result in bias. Although higher power of CC remains a possibility, it is more likely that the excess number of DEGs in the CC analysis could be explained by false positives owing to such bias. Ultimately, in a real-world example, the method for dealing with missing data matters, and our simulations suggest that MI should be the preferred approach.

Nevertheless, any of these missing data methods would be a better alternative than omitting covariate control entirely, a common albeit unsatisfactory approach in observational transcriptomic studies. Another reasonable alternative is to perform CC analysis while omitting variables with the most missing data. Compared with fully adjusted CC analysis, CC while omitting the family income variable was more similar to the MI analysis in terms of the total number of DEGs, but displayed more differences in terms of the P value rank order of DEGs. Researchers may also opt to only include covariates with complete data, which preserves sample size and avoids the need for imputation but may introduce bias due to uncontrolled variables. For the ECHO-PATHWAYS dataset, only including covariates with complete data resulted in a minimal model adjusting for delivery method, study site, fetal sex, and sequencing batch. This reduced model identified a much larger 1071 DEGs significantly associated with maternal age compared with 214–575 DEGs in the CC, SI, and RNAseqCovarImpute MI analyses adjusting for all covariates. This larger number of DEGs was likely due to confounding by race, ethnicity, lifestyle, and socioeconomic status variables that were not controlled in this analysis. Failure to control for these variables in analyses of maternal age could lead to erroneous conclusions and even faulty clinical recommendations. For instance, studies have shown that the positive associations of young maternal age with child ADHD in unadjusted analyses are eliminated or even reversed following adjustment for confounding and mediating variables [ 36 , 37 , 38 ]. Thus, younger pregnancies do not confer increased ADHD risk because of the biology of aging per se but owing to other variables that are correlated with maternal age. Younger mothers are more likely to smoke during pregnancy, and prenatal tobacco exposure may impair neurodevelopment. If the link between younger pregnancy and adverse child development is mediated by increased tobacco exposure, then clinical efforts focusing on reducing tobacco exposures during pregnancy would be more effective than recommendations regarding the ideal age for childbearing [ 36 ].

Advanced maternal age is a well-known risk factor for preterm birth [ 39 , 40 ]. These analyses demonstrated associations of advanced maternal age with downregulation of individual genes (i.e., CXCL8 ) and pathways (i.e., 9 immune system and 4 signal transduction pathways) that were also downregulated in association with spontaneous preterm birth in a prior analysis [ 41 ]. Future studies should formally explore these overlapping results as putative mechanistic links between advanced maternal age and preterm birth.

To the best of our knowledge, MI via RNAseqCovarImpute is applicable to any RNA-sequencing study that has missing values in the paired covariate data. In addition to the placental transcriptomics data in ECHO-PATHWAYS, we applied RNAseqCovarImpute to three different datasets. In one example, we analyzed blood platelet RNA-sequencing data from individuals with colorectal carcinoma and healthy controls [ 20 ]. This analysis uncovered several differentially expressed genes with known roles in the etiology and progression of various cancers, including colorectal carcinoma. Future studies may utilize RNAseqCovarImpute to achieve higher power and lower FPRs in differential gene expression analyses of a wide range of factors and across diverse tissue types.

One limitation to RNAseqCovarImpute is that it is dependent on selection of a number of PCs to integrate, and the user will need to define the optimal number of PCs to retain using established methods such as Horn’s parallel analysis or a variance explained cutoff. Our testing suggests that several popular methods perform well when missing data is minimal. At high levels of missing data, Horn’s parallel analysis was the best general method of choice, but performance could vary with different datasets. Another drawback to RNAseqCovarImpute is that its compatibility with pathway and gene set enrichment methods is currently limited, as many of these methods were developed without MI in mind. The RNAseqCovarImpute MI method produces one final list of genes with their associated t -statistics, log fold changes, and P values for differential expression. Thus, the method is compatible with gene set enrichment analyses that utilize gene rankings such as overrepresentation analysis, or gene level statistics such as camera [ 42 ] and GAGE (utilized here in the maternal age analysis) [ 43 ]. However, the final gene list produced by RNAseqCovarImpute is based on the combined analyses of the MI datasets. Although theoretically possible, methods that require as input a gene expression matrix or data at the individual sample level are likely not out-of-box compatible with RNAseqCovarImpute. Future work could moderate such methods to accommodate analysis of multiply imputed RNA-sequencing data. Additionally, the RNAseqCovarImpute package is also not out-of-box compatible with all differential expression analysis methods, as it was designed to utilize the limma-voom pipeline. Another limitation was that, as an MI method, RNAseqCovarImpute required more processing time compared with CC or SI. Finally, data imputation is a rapidly evolving field, and emerging machine learning SI and MI methods [ 44 ] should be tested in the context of RNA-sequencing in future studies.

Conclusions

As the cost of sequencing decreases, studies of the transcriptome may experience a substantial shift from small-scale in vitro and in vivo experimental systems to larger-scale clinical and epidemiologic contexts where missing covariate data is prevalent. MI is a well-established method to handle missing covariate data in epidemiology, but was previously not compatible with transcriptomic outcome data. We developed an R package, RNAseqCovarImpute, to integrate limma-voom RNA-sequencing analysis with MI for missing covariate data, and demonstrated that this method has superior performance compared with SI and CC analyses. Ultimately, RNAseqCovarImpute represents a promising step towards harmonizing transcriptomic and epidemiologic approaches by addressing the critical need to accommodate missing covariate data in RNA-sequencing studies. Future studies may expand upon the MI methods developed here for RNA-sequencing to address problems associated with missing covariate data in other settings, including DNA methylation, proteomics, metabolomics, and other high dimensional data types.

RNAseqCovarImpute multiple imputation principal component analysis (MI PCA) method

A graphical overview of the RNAseqCovarImpute MI PCA method is shown in Fig.  1 , and the corresponding R code are available on GitHub and Bioconductor (see Availability of data and materials).

Filtering genes with low counts

A common filtering cutoff in RNA-sequencing is around 10 counts per gene, and filtering on log-counts per million (logCPM) values rather than raw counts is recommended to avoid giving preference to samples with large library sizes [ 45 , 46 ]. The filtering cutoff on the CPM scale is roughly equal to the count cutoff of 10 divided by the minimum library size in millions. Here, we filter to keep genes with an average CPM across all samples above this cutoff reflecting approximately 10 counts per gene.

Data normalization

Following RNA-sequencing and mapping of reads to each gene or probe, an RNA-sequencing dataset consists of a matrix of counts for genes g  = 1 to the total number of genes G , where counts are recorded for all samples i  = 1 to the total number of samples n . The library size R i , which is the total number of reads for a given sample, is expressed as:

The library size R i is additionally scale-normalized using the weighted trimmed mean of M-values (TMM) method [ 47 ], and logCPMs normalized to library size are computed, offsetting the counts by 0.5 to avoid taking the log of zero and offsetting the library size by 1:

Principal component analysis (PCA)

Fitting an imputation model where the number of independent variables is far greater than the number of individuals in the study is generally not feasible. In RNA-sequencing studies with tens of thousands of genes, we can surmount this problem by reducing the dimensionality of the gene expression data with PCA and including a subset of PCs in the MI prediction models.

For the MI PCA method, we conduct PCA using Bioconductor’s PCAtools R package [ 13 ] on the gene expression data normalized as described above. There is no universally optimal approach to selecting the number of PCs to retain in PCA. Horn’s parallel analysis retains PCs with eigenvalues greater than eigenvalues of random data [ 14 , 15 ] and is regarded as one of the best empirical methods to determine component retention in PCA [ 48 ]. Performance of MI PCA was compared when using Horn’s parallel analysis, an 80% variance explained cutoff, and the elbow method, where all PCs are retained that come before the elbow point in the curve of variance explained by each successive PC. Methods for determining the number of retained PCs were implemented using Bioconductor’s PCAtools R package [ 13 ].

Data imputation

The retained PCs are added to the covariate data and utilized along with all covariates in the MI prediction model when creating m multiply imputed datasets. Data are imputed using the “mice” R package with its default predictive modeling methods, which are predictive mean matching, logistic regression, polytomous regression, and proportional odds modeling for continuous, binary, categorical, and unordered variables, respectively [ 49 ].

  • Differential expression analysis

The limma-voom pipeline is run on each m imputed dataset separately. This procedure fits weighted linear models for each gene that take into account individual-level precision weights based on the mean–variance trend [ 10 ]. A linear model is fit by ordinary least squares separately for each gene. The model includes an intercept \({\beta }_{0}\) , and coefficients \({\beta }_{1}\) – \({\beta }_{n}\) for any number of covariates \({C}_{1}\) – \({C}_{n}\) .

The geometric mean of the library sizes plus one,  \(\widetilde R\)  is computed. The average logCPM for each gene,  \(\overline{\log CPM_g}\) is computed and converted to an average log-count by:

The regressions provide fitted logCPM values, \({\widehat{\mu }}_{gi}\) for each gene ( g ) and each sample ( i ) that are converted to fitted counts by:

A LOWESS curve is fitted to the square root of the residual standard deviations from the regression models as a function of \(\widetilde{r}\) , the average log-counts. Interpolating the curve on the interval of library sizes \(\widetilde{R}\) defines the piecewise linear function lo() for predicting individual observation-level square-root standard deviations. The predicted square-root standard deviation of individual logCPM observations \({logCPM}_{gi}\) is equal to lo( \({\widehat{\lambda }}_{gi}\) ). Voom precision weights are defined as the predicted inverse variances, lo( \({\widehat{\lambda }}_{gi}\) ) −4 .

Voom precision weights and logCPM values are input into the limma linear modeling framework which utilizes an empirical Bayes procedure to squeeze gene-wise variances towards a common value [ 11 , 12 ]. This procedure is run separately on each m set of imputed data to obtain coefficients and standard errors for each gene.

Pooling results

Rubin’s rules [ 2 ] are used to pool coefficients and standard errors, and the Barnard and Rubin adjusted degrees of freedom is calculated [ 50 ] (see [ 3 ] for more details). From the limma-voom pipeline above, the linear regression coefficient ( \(\beta\) ) and the Bayesian moderated standard error ( SE ) for each gene from each m number of models on the m number of imputed datasets is extracted. The Bayesian moderated degrees of freedom ( df ) are averaged across the m models. One gene at a time, results are pooled across the m models as follows.

Coefficients are pooled with the basic formula of taking the mean.

Within imputation variance ( V W ) is the average of the sum of the squared standard errors ( SE s) divided by m .

Between imputation variance ( V B ) reflects extra variance due to missing data and is expected to be large when missing data is high. It is calculated as the sum of the squared differences between the pooled coefficient ( \(\overline{\beta }\) ) and each coefficient ( \({\beta }_{i}\) ) from each imputed dataset divided by m  − 1.

Total variance ( V Total ) is calculated and its square root as the pooled standard error ( SE Pooled ).

The pooled coefficient ( \(\overline{\beta }\) ) divided by the pooled standard error ( SE Pooled ) is defined as the t -statistic ( t ) for significance testing.

The degrees of freedom for significance testing also needs adjustment. First calculate lambda , the proportion of total variance due to missingness.

An older version of the degrees of freedom ( df Old ) proposed in Rubin (1987) is adjusted using the equations from Barnard and Rubin (1999). This MI adjusted degrees of freedom ( df Adjusted ) is the same degrees of freedom used in the “mice” R package.

P values are derived from the t -distribution. In R, 2-sided P values can be calculated using the pt function, which returns the area for the Student’s t -distribution to the left of the t -statistic for a given degrees of freedom. In R:

After this pooling procedure is completed for every gene, P values for the linear model contrast of interest are adjusted for false-discovery-rate control [ 51 ].

Performance on three example datasets

Performance was evaluated in a simulation study using three real RNA-sequencing and covariate datasets and four synthetic sets of RNA-sequencing data (described below). Performance was compared between SI followed by the standard limma-voom differential expression analysis, CC limma-voom differential expression analysis, and the two RNAseqCovarImpute methods, MI Gene Bin (Additional file 1: Supplemental Methods) and MI PCA (described above).

Determining true differentially expressed genes (DEGs)

Differential expression analysis using the limma-voom pipeline was conducted on the entire set of observations with their complete covariate data (hereinafter “full data”). These models estimated the effect of a predictor of interest on gene expression while controlling for several covariates. Genes significantly associated with the predictor of interest at FDR < 0.05 in these full data models were defined as true DEGs.

Simulating missing data under different missingness mechanisms

Missingness was simulated using the ampute function from the “mice” package [ 49 ]. Missingness was simulated to emulate a common situation in scientific research where an investigator has complete data for a predictor of interest, but may have missing data for other important covariates. Therefore, missingness was only induced in adjustment covariates and not the predictor of interest. We explored scenarios with various levels of missing data ranging from 5 to 85% of participants having at least one missing data point, and under three missingness mechanisms: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). We simulated ten datasets for each missingness mechanism at each level of missingness before applying the SI, CC, MI, and MI PCA methods and comparing the results with the full data model.

One or two covariates (described in detail below for each dataset) were defined as MNAR variables: these variables were not included as adjustment covariates in the differential expression analysis, but had influence in determining the missingness in the other covariate data. Under the MAR mechanism, the data that explain the missingness are all available. Thus, for the MAR mechanism, the SI, MI, and MI PCA methods had access to these MNAR variables while imputing missing covariate data. Under the MNAR mechanism, patterns of missingness in the data are related to unobserved or unmeasured factors. Thus, for the MNAR mechanism, the SI, MI, and MI PCA methods did not have access to these MNAR variables while imputing missing covariate data. Under the MCAR mechanism, missingness in the data are completely random and do not depend on values of the covariates.

CC analyses dropped any individual with at least one missing data point, while SI imputed missing data using the missForest package [ 52 ]. The limma-voom pipeline was applied for CC and SI as described for the full data model.

Evaluating results

Our objective was to evaluate the ability of the SI, CC, MI, and MI PCA methods to identify true DEGs from the full data model as significant while limiting false positives. True DEGs from the full data model that were also identified as significant by a given method were defined as true positives. We report the true positive rate (TPR) as the proportion of true DEGs identified as significant for each method out of the total number of true DEGs from the full data model. Genes erroneously identified as significant by a given method that were not true DEGs from the full data analysis were defined as false positives. We report the false positive rate (FPR) as the proportion of false positives out of the total number of significant results for each method. We report the mean absolute percentage error ( MAPE ) across all true DEGs to characterize the ability of each method to reproduce gene expression coefficients from the full data model, where \({\beta }_{truth g}\) is defined as the true coefficient from a DEG in the full data model, and \({\beta }_{g}\) is defined as the coefficient for the same gene estimated using SI, CC, MI, or MI PCA following simulated missingness. MAPE was calculated as:

Three RNA-sequencing datasets with mapped reads were obtained and processed as described above in “  Data normalization ” and “  Filtering genes with low counts .” The first dataset was based on placental RNA-sequencing and covariate data from the ECHO prenatal and early childhood pathways to health (ECHO-PATHWAYS) consortium [ 5 ]. This study harmonized extant data from three pregnancy cohorts from diverse populations across the country. The consortium’s core aim is to explore the impact of chemical exposures and psychosocial stressors experienced by the mother during pregnancy on child development, and to assess potential underlying placental mechanisms. To investigate placental mechanisms, the study generated RNA-sequencing data for the CANDLE and GAPPS pregnancy cohort samples. All participants of the CANDLE and GAPPS studies provided informed consent upon enrollment and research protocols were approved by the Institutional Review Boards (IRBs) at the University of Tennessee Health Science Center (IRB approval: 17–05154-XP) as well as the Seattle Children’s Hospital (IRB approval: STUDY00000608) and the University of Washington (IRB approval: STUDY00000638). The generation of placental RNA-sequencing data for this study is described elsewhere [ 41 ]. Among the enrolled study sample of 1503, transcriptomic data are available for 1083 individuals. We excluded 18 placental abruptions and 20 individuals missing maternal age data, leaving a sample of 1045. We retained only protein-coding genes, processed pseudogenes, and lncRNAs.

Covariates from the ECHO-PATHWAYS dataset included in the simulation study were maternal age (continuous), child sex (male versus female), RNA-sequencing batch, maternal tobacco use during pregnancy (yes versus no), maternal alcohol use during pregnancy (yes versus no), and family income (continuous). The full data model restricted to 994 individuals with complete data for these variables. Mothers self-reported alcohol use, while the positive tobacco exposure group included individuals with maternal urine cotinine above 200 ng/mL [ 53 ], as well as individuals who were below this cutoff but self-reported tobacco use during pregnancy. Maternal age was defined as the predictor of interest, while sex, prenatal tobacco exposure, and RNA-sequencing batch were modeled as covariates. Simulated missing data ranged from 5 to 55% of participants having at least one missing data point. Maternal alcohol use and family income served as MNAR variables. Levels of missingness according to different values of these MNAR variables were summarized to illustrate differences between MAR, MNAR, and MCAR missingness mechanisms.

Two additional datasets were selected based on their large sample sizes, public availability, and ample number of covariates that could be examined in covariate imputation analyses. The non-small cell lung cancer (NSCLC) dataset was downloaded from the European Molecular Biology Laboratory—European Bioinformatics Institute (EMBL-EBI: E-GEOD-81089) and is based on [ 16 ]. For the NSCLC dataset ( N  = 670), the association of sex (male versus female) with the transcriptome was examined, adjusting for participant age (continuous) and participant smoking status (smoker versus ex-smoker versus non-smoker). Sampling site (tumor versus non-malignant) served as an MNAR variable. The Epstein-Barr virus (EBV) dataset was downloaded from EMBL-EBI (E-MTAB-7805) and is based on [ 17 ]. For the EBV dataset ( N  = 384), the association of time (continuous days) with the transcriptome was examined, adjusting for infection status (EBV infected versus not infected). Donor source (categorical, three individuals) served as an MNAR variable. All methods performed better at recovering the full data model results for the EMBL-EBI datasets compared with the ECHO-PATHWAYS dataset, so analyses with these datasets examined 55–85% of participants having at least one missing data point.

Finally, four sets of synthetic RNA-sequencing data were also used to compare performances of RNAseqCovarImpute (MI PCA Horn method), SI, and CC differential expression analysis. The NSCLC RNA-sequencing data were modified to add known signal using the seqgendiff package [ 18 ]. The method relies on binomial thinning of the RNA-sequencing count matrix to closely match user defined coefficients. Rather than generating counts from theoretical distributions, thinning a real set of RNA-sequencing counts can better preserve realistic variability and inter-gene correlations typical of RNA-sequencing data [ 18 ]. Subsets of 25%, 50%, 75%, or 99% of genes were randomly selected to have their coefficient of association (Log2 fold-changes) with sex set to zero. The remaining coefficients were drawn randomly from a gamma distribution generated using rgamma(shape = 1) from the stats package in R.

Application of RNAseqCovarImpute in analysis of maternal age and the placental transcriptome

This analysis examined the association of maternal age with the placental transcriptome while controlling for 10 covariates using the ECHO-PATHWAYS sample described above ( N  = 1045). Covariates included family income adjusted for region and inflation (USD), maternal race (Black vs. other), maternal ethnicity (Hispanic/Latino vs. not Hispanic/Latino), maternal education (< high school vs. high school completion vs. college or technical school vs. graduate/professional degree), study site, maternal alcohol during pregnancy (yes vs. no), maternal tobacco during pregnancy (yes vs. no), delivery method (vaginal vs. C-section), fetal sex (male vs. female), and RNA-sequencing batch. The causal relationships among these variables are illustrated in Fig.  5 . For the maternal race variable, American Indian/Alaska Native, multiple race, and other were collapsed along with White participants to avoid small or zero cell sizes in multivariable models. Only protein-coding genes, processed pseudogenes, and lncRNAs, and genes with average log-CPM > 0 (approximately 10 counts for this dataset) were retained, resulting in a final sample of 14,029 genes. DEGs associated with maternal age while adjusting for all 10 covariates were compared between the CC, SI, and RNAseqCovarImpute MI PCA methods. To retain the entire sample size without covariate imputation, a reduced model was fit by omitting any covariates with missing data. Additionally, an alternative CC analysis was performed while omitting family income, the variable with the most missing data.

T -statistics (Log2FCs divided by standard error) from the differential expression analyses were input into pathway analysis for Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways (excluding KEGG human disease pathways) using the generally applicable gene set enrichment (GAGE) method [ 43 ]. For pathways with GAGE FDR < 0.05, GAGE P values and the mean differential expression t -statistic for all genes in the pathway were plotted.

In another real-world example, RNAseqCovarImpute was applied to a dataset of blood platelet RNA-sequencing from 42 individuals with colorectal carcinoma and 59 healthy donors (EMBL-EBI: E-GEOD-68086) [ 20 ]. This analysis examined the association of colorectal carcinoma versus healthy cancer status with the transcriptome while controlling for genotype ( KRAS vs. PIK3CA vs. wild-type), sex, and age ( N  = 101). DEGs associated with colorectal carcinoma while adjusting for these covariates were compared between the CC, SI, and RNAseqCovarImpute MI PCA methods.

Availability of data and materials

The RNAseqCovarImpute R package is available at the Bioconductor repository under the GNU General Public License v3.0 ( https://doi.org/10.18129/B9.bioc.RNAseqCovarImpute ) [ 54 ]. Source code for simulating missing covariate data and evaluating different methods for handling missing data, including the complete case, single imputation, and RNAseqCovarImpute methods, is available at the Zenodo repository under the GNU General Public License v3.0 ( https://doi.org/10.5281/zenodo.13314514 ) [ 55 ]. The NSCLC data are available at EMBL-EBI accession E-GEOD-81089 [ 16 ]. The EBV data are available at EMBL-EBI accession E-MTAB-7805 [ 17 ]. The colorectal carcinoma data are available at EMBL-EBI accession E-GEOD-68086 [ 20 ]. The ECHO-PATHWAYS RNA-sequencing data are available at dbGaP study accessions phs003619.v1.p1 and phs003620.v1.p1. The ECHO-PATHWAYS covariate data are available upon reasonable request following the data sharing guidelines of the ECHO-PATHWAYS consortium, outlined in LeWinn et al. [ 5 ].

van Buuren S. Flexible Imputation of Missing Data. Second Edition (2nd ed.). Chapman and Hall/CRC; 2018. https://doi.org/10.1201/9780429492259 .

Rubin DB. Multiple Imputation for Nonresponse in Surveys. New York: Wiley; 2004.

Google Scholar  

Heymans M, Eekhout I. Applied missing data analysis with SPSS and (R) Studio. Heymans and Eekhout: Amsterdam, The Netherlands: 20 Available online: https://bookdown/org/mwheymans/bookmi/ . 2019. Accessed 23 May 2020.

Goodwin S, McPherson JD, McCombie WR. Coming of age: ten years of next-generation sequencing technologies. Nat Rev Genet. 2016;17(6):333–51.

Article   PubMed   PubMed Central   CAS   Google Scholar  

LeWinn KZ, Karr CJ, Hazlehurst M, Carroll K, Loftus C, Nguyen R, et al. Cohort profile: the ECHO prenatal and early childhood pathways to health consortium (ECHO-PATHWAYS). BMJ Open. 2022;12(10):e064288.

Article   PubMed   PubMed Central   Google Scholar  

Eaves LA, Bulka CM, Rager JE, Gardner AJ, Galusha AL, Parsons PJ, et al. Metal mixtures modeling identifies birth weight-associated gene networks in the placentas of children born extremely preterm. Chemosphere. 2023;313:137469.

Article   PubMed   CAS   Google Scholar  

Little RJ. Regression with missing X’s: a review. J Am Stat Assoc. 1992;87(420):1227–37.

Mills HL, Heron J, Relton C, Suderman M, Tilling K. Methods for dealing with missing covariate data in epigenome-wide association studies. Am J Epidemiol. 2019;188(11):2021–30.

Wu C, Demerath EW, Pankow JS, Bressler J, Fornage M, Grove ML, et al. Imputation of missing covariate values in epigenome-wide analysis of DNA methylation data. Epigenetics. 2016;11(2):132–9.

Law CW, Chen Y, Shi W, Smyth GK. voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol. 2014;15(2):1–17.

Article   Google Scholar  

Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, et al. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 2015;43(7):e47-e.

Smyth GK. Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Statistical applications in genetics and molecular biology. 2004;3(1).  https://doi.org/10.2202/1544-6115.1027 .

Blighe K, Lun A. PCAtools: PCAtools: Everything Principal Components Analysis. R package version 2.16.0; 2024. https://github.com/kevinblighe/PCAtools .

Buja A, Eyuboglu N. Remarks on parallel analysis. Multivar Behav Res. 1992;27(4):509–40.

Article   CAS   Google Scholar  

Horn JL. A rationale and test for the number of factors in factor analysis. Psychometrika. 1965;30:179–85.

Djureinovic D, Hallström BM, Horie M, Mattsson JSM, La Fleur L, Fagerberg L, et al. Profiling cancer testis antigens in non–small-cell lung cancer. JCI Insight. 2016;1(10).  https://doi.org/10.1172%2Fjci.insight.86837 .

Mrozek-Gorska P, Buschle A, Pich D, Schwarzmayr T, Fechtner R, Scialdone A, et al. Epstein-Barr virus reprograms human B lymphocytes immediately in the prelatent phase of infection. Proc Natl Acad Sci. 2019;116(32):16046–55.

Gerard D. Data-based RNA-seq simulations by binomial thinning. BMC Bioinformatics. 2020;21:1–14.

UniProt: the universal protein knowledgebase in 2023. Nucleic Acids Research. 2023;51(D1):D523–D31.  https://doi.org/10.1093/nar/gkac1052 .

Best MG, Sol N, Kooi I, Tannous J, Westerman BA, Rustenburg F, et al. RNA-Seq of tumor-educated platelets enables blood-based pan-cancer, multiclass, and molecular pathway cancer diagnostics. Cancer Cell. 2015;28(5):666–76.

Liu Y, Zhou F, Yang H, Zhang Z, Zhang J, He K, et al. Porphyromonas gingivalis promotes malignancy and chemo-resistance via GSK3β-mediated mitochondrial oxidative phosphorylation in human esophageal squamous cell carcinoma. Transl Oncol. 2023;32:101656.

Liu X-S, Chen Y-X, Wan H-B, Wang Y-L, Wang Y-Y, Gao Y, et al. TRIP6 a potential diagnostic marker for colorectal cancer with glycolysis and immune infiltration association. Sci Rep. 2024;14(1):4042.

Nishida N, Nagahara M, Sato T, Mimori K, Sudo T, Tanaka F, et al. Microarray analysis of colorectal cancer stromal tissue reveals upregulation of two oncogenic miRNA clusters. Clin Cancer Res. 2012;18(11):3054–70.

Lee CJ, Baek B, Cho SH, Jang TY, Jeon SE, Lee S, et al. Machine learning with in silico analysis markedly improves survival prediction modeling in colon cancer patients. Cancer Med. 2023;12(6):7603–15.

Liu Z, Zhang Y, Dang Q, Wu K, Jiao D, Li Z, et al. Genomic alteration characterization in colorectal cancer identifies a prognostic and metastasis biomarker: FAM83A| IDO1. Front Oncol. 2021;11:632430.

Hao S, Huang M, Xu X, Wang X, Huo L, Wang L, et al. MDN1 mutation is associated with high tumor mutation burden and unfavorable prognosis in breast cancer. Front Genet. 2022;13:857836.

Guo Y, Feng Y, Liu H, Luo S, Clarke JW, Moorman PG, et al. Potentially functional genetic variants in the TNF/TNFR signaling pathway genes predict survival of patients with non-small cell lung cancer in the PLCO cancer screening trial. Mol Carcinog. 2019;58(7):1094–104.

Moran S, Arribas C, Esteller M. Validation of a DNA methylation microarray for 850,000 CpG sites of the human genome enriched in enhancer sequences. Epigenomics. 2016;8(3):389–99.

Hou W, Ji Z, Ji H, Hicks SC. A systematic evaluation of single-cell RNA-sequencing imputation methods. Genome Biol. 2020;21:1–30.

Lee KJ, Tilling KM, Cornish RP, Little RJ, Bell ML, Goetghebeur E, et al. Framework for the treatment and reporting of missing data in observational studies: the treatment and reporting of missing data in observational studies framework. J Clin Epidemiol. 2021;134:79–88.

Mackinnon A. The use and reporting of multiple imputation in medical research–a review. J Intern Med. 2010;268(6):586–93.

Hayati Rezvan P, Lee KJ, Simpson JA. The rise of multiple imputation: a review of the reporting and implementation of the method in medical research. BMC Med Res Methodol. 2015;15:1–14.

Enquobahrie DA, MacDonald J, Hussey M, Bammler TK, Loftus CT, Paquette AG, et al. Prenatal exposure to particulate matter and placental gene expression. Environ Int. 2022;165:107310.

Paquette AG, MacDonald J, Lapehn S, Bammler T, Kruger L, Day DB, et al. A comprehensive assessment of associations between prenatal phthalate exposure and the placental transcriptomic landscape. Environ Health Perspect. 2021;129(9):097003.

Paquette AG, Lapehn S, Freije S, MacDonald J, Bammler T, et al. Placental transcriptomic signatures of prenatal exposure to Hydroxy-Polycyclic aromatic hydrocarbons. Environ Int. 2023;172:107763.

Baker BH, Joo YY, Park J, Cha J, Baccarelli AA, Posner J. Maternal age at birth and child attention-deficit hyperactivity disorder: causal association or familial confounding? J Child Psychol Psychiatry. 2023;64(2):299–310.

Article   PubMed   Google Scholar  

Hvolgaard Mikkelsen S, Olsen J, Bech BH, Obel C. Parental age and attention-deficit/hyperactivity disorder (ADHD). Int J Epidemiol. 2017;46(2):409–20.

PubMed   Google Scholar  

Chang Z, Lichtenstein P, D’Onofrio BM, Almqvist C, Kuja-Halkola R, Sjölander A, et al. Maternal age at childbirth and risk for ADHD in offspring: a population-based cohort study. Int J Epidemiol. 2014;43(6):1815–24.

Waldenström U, Cnattingius S, Vixner L, Norman M. Advanced maternal age increases the risk of very preterm birth, irrespective of parity: a population-based register study. BJOG: An International Journal of Obstetrics and Gynaecology. 2017;124(8):1235–44.

Fuchs F, Monet B, Ducruet T, Chaillet N, Audibert F. Effect of maternal age on the risk of preterm birth: a large cohort study. PLoS One. 2018;13(1):e0191002.

Paquette AG, MacDonald J, Bammler T, Day DB, Loftus CT, Buth E, et al. Placental transcriptomic signatures of spontaneous preterm birth. Am J Obs Gynecol. 2023;228(1):73 e1-. e18.

Wu D, Smyth GK. Camera: a competitive gene set test accounting for inter-gene correlation. Nucleic acids research. 2012;40(17):e133-e.

Luo W, Friedman MS, Shedden K, Hankenson KD, Woolf PJ. GAGE: generally applicable gene set enrichment for pathway analysis. BMC Bioinformatics. 2009;10:1–17.

Luo Y. Evaluating the state of the art in missing data imputation for clinical data. Briefings in Bioinformatics. 2022;23(1):bbab489.

Law CW, Alhamdoosh M, Su S et al. RNA-seq analysis is easy as 1-2-3 with limma, Glimma and edgeR [version 3; peer review: 3 approved]. F1000Res. 2018;5:1408. https://doi.org/10.12688/f1000research.9005.3 .

Chen Y, Lun ATL and Smyth GK. From reads to genes to pathways: differential expression analysis of RNA-Seq experiments using Rsubread and the edgeR quasi-likelihood pipeline [version 2; peer review: 5 approved]. F1000Res. 2016;5:1438. https://doi.org/10.12688/f1000research.8987.2 .

Robinson MD, Oshlack A. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol. 2010;11(3):1–9.

Dinno A. Exploring the sensitivity of Horn’s parallel analysis to the distributional form of random data. Multivar Behav Res. 2009;44(3):362–88.

Van Buuren S, Groothuis-Oudshoorn K. mice: multivariate imputation by chained equations in R. J Stat Softw. 2011;45:1–67.

Barnard J, Rubin DB. Miscellanea. Small-sample degrees of freedom with multiple imputation. Biometrika. 1999;86(4):948–55.

Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J Roy Stat Soc B. 1995;57(1):289–300.

Stekhoven DJ, Bühlmann P. MissForest—non-parametric missing value imputation for mixed-type data. Bioinformatics. 2012;28(1):112–8.

Schick SF, Blount BC, Jacob P 3rd, Saliba NA, Bernert JT, El Hellani A, et al. Biomarkers of exposure to new and emerging tobacco delivery products. Am J Physiol Lung Cell Mol Physiol. 2017;313(3):L425–52.

Baker BH, Sathyanarayana S, Szpiro AA, MacDonald JW, Paquette AG. RNAseqCovarImpute: impute covariate data in RNA sequencing studies. R package version 1.2.0. 2024.  https://doi.org/10.18129/B9.bioc.RNAseqCovarImpute .

Baker BH. 2024. RNAseqCovarImpute source code for NSCLC data analysis. https://doi.org/10.5281/zenodo.13314514 .

Download references

Acknowledgements

The authors would like to thank the study staff, data teams, and co-investigators involved in the CANDLE and GAPPS cohorts as well as the ECHO-PATHWAYS consortium for their invaluable contributions to the ECHO-PATHWAYS dataset. We are also grateful to the study participants who generously volunteered their time for this study. We additionally thank the researchers and participants contributing to the publicly available NSCLC and EBV data utilized here. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. This manuscript has been reviewed by PATHWAYS for scientific content and consistency of data interpretation with previous PATHWAYS publications.

Peer review information

Andrew Cosgrove was the primary editor of this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.

Review history

The review history is available as Additional file 2.

ECHO PATHWAYS is funded by the National Institutes of Health (NIH; 1UG3OD023271-01, 4UH3OD023271-03). The generation of RNA-sequencing data for this study was supported by the University of Washington Interdisciplinary Center for Exposures, Diseases, Genomics, and Environment funded by the National Institute of Environmental Health Sciences (NIEHS; 2P30ES007033). This research was conducted using specimens and data collected from the Conditions Affecting Neurocognitive Development and Learning in Early Childhood (CANDLE) study which was funded by the Urban Child Institute. This research was conducted using specimens and data collected and stored on behalf of the Global Alliance to Prevent Prematurity and Stillbirth (GAPPS) Repository. BB was supported by the UW NIEHS sponsored Biostatistics, Epidemiologic and Bioinformatic Training in Environmental Health (BEBTEH) Training Grant: NIEHS T32ES015459. AP was supported by NICHD K99/R00HD096112 and 1R01ES033785.

Author information

Adam A. Szpiro, James W. MacDonald, and Alison G. Paquette contributed equally to this work.

Authors and Affiliations

Department of Environmental and Occupational Health Sciences, University of Washington, Seattle, WA, USA

Brennan H. Baker, Sheela Sathyanarayana, James W. MacDonald & Alison G. Paquette

Center for Child Health, Behavior, and Development, Seattle Children’s Research Institute, Seattle, WA, USA

Brennan H. Baker & Sheela Sathyanarayana

Department of Pediatrics, University of Washington, Seattle, WA, USA

Sheela Sathyanarayana & Alison G. Paquette

Department of Epidemiology, University of Washington, Seattle, WA, USA

Sheela Sathyanarayana

Department of Biostatistics, University of Washington, Seattle, WA, USA

Adam A. Szpiro

Center for Developmental Biology and Regenerative Medicine, Seattle Children’s Research Institute, Seattle, WA, USA

Alison G. Paquette

You can also search for this author in PubMed   Google Scholar

Contributions

BB: conceptualization, methodology, software, formal analysis, visualization, writing—original draft, writing—review and editing. SS: supervision, funding acquisition, writing—review and editing. AS: supervision, methodology, writing—review and editing. JM: supervision, methodology, software, writing—review and editing. AP: supervision, methodology, visualization, writing—review and editing.

Corresponding author

Correspondence to Brennan H. Baker .

Ethics declarations

Ethics approval and consent to participate.

All participants of the CANDLE and GAPPS studies provided informed consent upon enrollment and research protocols were approved by the IRBs at the University of Tennessee Health Science Center (IRB approval: 17-05154-XP) as well as the Seattle Children’s Hospital (IRB approval: STUDY00000608) and the University of Washington (IRB approval: STUDY00000638). All methods comply with the Helsinki Declaration.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1: supplemental methods, supplemental results, and supplemental figs. s1–s12., additional file 2. the review history., rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Baker, B.H., Sathyanarayana, S., Szpiro, A.A. et al. RNAseqCovarImpute: a multiple imputation procedure that outperforms complete case and single imputation differential expression analysis. Genome Biol 25 , 236 (2024). https://doi.org/10.1186/s13059-024-03376-7

Download citation

Received : 18 May 2023

Accepted : 23 August 2024

Published : 03 September 2024

DOI : https://doi.org/10.1186/s13059-024-03376-7

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • RNA-sequencing
  • Gene expression
  • Multiple imputation
  • Missing data

Genome Biology

ISSN: 1474-760X

confounding variables control experiments

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

  • Knowledge Base

Methodology

  • Control Groups and Treatment Groups | Uses & Examples

Control Groups and Treatment Groups | Uses & Examples

Published on July 3, 2020 by Lauren Thomas . Revised on June 22, 2023.

In a scientific study, a control group is used to establish causality by isolating the effect of an independent variable .

Here, researchers change the independent variable in the treatment group and keep it constant in the control group. Then they compare the results of these groups.

Control groups in research

Using a control group means that any change in the dependent variable can be attributed to the independent variable. This helps avoid extraneous variables or confounding variables from impacting your work, as well as a few types of research bias , like omitted variable bias .

Table of contents

Control groups in experiments, control groups in non-experimental research, importance of control groups, other interesting articles, frequently asked questions about control groups.

Control groups are essential to experimental design . When researchers are interested in the impact of a new treatment, they randomly divide their study participants into at least two groups:

  • The treatment group (also called the experimental group ) receives the treatment whose effect the researcher is interested in.
  • The control group receives either no treatment, a standard treatment whose effect is already known, or a placebo (a fake treatment to control for placebo effect ).

The treatment is any independent variable manipulated by the experimenters, and its exact form depends on the type of research being performed. In a medical trial, it might be a new drug or therapy. In public policy studies, it could be a new social policy that some receive and not others.

In a well-designed experiment, all variables apart from the treatment should be kept constant between the two groups. This means researchers can correctly measure the entire effect of the treatment without interference from confounding variables .

  • You pay the students in the treatment group for achieving high grades.
  • Students in the control group do not receive any money.

Studies can also include more than one treatment or control group. Researchers might want to examine the impact of multiple treatments at once, or compare a new treatment to several alternatives currently available.

  • The treatment group gets the new pill.
  • Control group 1 gets an identical-looking sugar pill (a placebo)
  • Control group 2 gets a pill already approved to treat high blood pressure

Since the only variable that differs between the three groups is the type of pill, any differences in average blood pressure between the three groups can be credited to the type of pill they received.

  • The difference between the treatment group and control group 1 demonstrates the effectiveness of the pill as compared to no treatment.
  • The difference between the treatment group and control group 2 shows whether the new pill improves on treatments already available on the market.

Prevent plagiarism. Run a free check.

Although control groups are more common in experimental research, they can be used in other types of research too. Researchers generally rely on non-experimental control groups in two cases: quasi-experimental or matching design.

Control groups in quasi-experimental design

While true experiments rely on random assignment to the treatment or control groups, quasi-experimental design uses some criterion other than randomization to assign people.

Often, these assignments are not controlled by researchers, but are pre-existing groups that have received different treatments. For example, researchers could study the effects of a new teaching method that was applied in some classes in a school but not others, or study the impact of a new policy that is implemented in one state but not in the neighboring state.

In these cases, the classes that did not use the new teaching method, or the state that did not implement the new policy, is the control group.

Control groups in matching design

In correlational research , matching represents a potential alternate option when you cannot use either true or quasi-experimental designs.

In matching designs, the researcher matches individuals who received the “treatment”, or independent variable under study, to others who did not–the control group.

Each member of the treatment group thus has a counterpart in the control group identical in every way possible outside of the treatment. This ensures that the treatment is the only source of potential differences in outcomes between the two groups.

Control groups help ensure the internal validity of your research. You might see a difference over time in your dependent variable in your treatment group. However, without a control group, it is difficult to know whether the change has arisen from the treatment. It is possible that the change is due to some other variables.

If you use a control group that is identical in every other way to the treatment group, you know that the treatment–the only difference between the two groups–must be what has caused the change.

For example, people often recover from illnesses or injuries over time regardless of whether they’ve received effective treatment or not. Thus, without a control group, it’s difficult to determine whether improvements in medical conditions come from a treatment or just the natural progression of time.

Risks from invalid control groups

If your control group differs from the treatment group in ways that you haven’t accounted for, your results may reflect the interference of confounding variables instead of your independent variable.

Minimizing this risk

A few methods can aid you in minimizing the risk from invalid control groups.

  • Ensure that all potential confounding variables are accounted for , preferably through an experimental design if possible, since it is difficult to control for all the possible confounders outside of an experimental environment.
  • Use double-blinding . This will prevent the members of each group from modifying their behavior based on whether they were placed in the treatment or control group, which could then lead to biased outcomes.
  • Randomly assign your subjects into control and treatment groups. This method will allow you to not only minimize the differences between the two groups on confounding variables that you can directly observe, but also those you cannot.

If you want to know more about statistics , methodology , or research bias , make sure to check out some of our other articles with explanations and examples.

  • Student’s  t -distribution
  • Normal distribution
  • Null and Alternative Hypotheses
  • Chi square tests
  • Confidence interval
  • Quartiles & Quantiles
  • Cluster sampling
  • Stratified sampling
  • Data cleansing
  • Reproducibility vs Replicability
  • Peer review
  • Prospective cohort study

Research bias

  • Implicit bias
  • Cognitive bias
  • Placebo effect
  • Hawthorne effect
  • Hindsight bias
  • Affect heuristic
  • Social desirability bias

Here's why students love Scribbr's proofreading services

Discover proofreading & editing

An experimental group, also known as a treatment group, receives the treatment whose effect researchers wish to study, whereas a control group does not. They should be identical in all other ways.

A true experiment (a.k.a. a controlled experiment) always includes at least one control group that doesn’t receive the experimental treatment.

However, some experiments use a within-subjects design to test treatments without a control group. In these designs, you usually compare one group’s outcomes before and after a treatment (instead of comparing outcomes between different groups).

For strong internal validity , it’s usually best to include a control group if possible. Without a control group, it’s harder to be certain that the outcome was caused by the experimental treatment and not by other variables.

A confounding variable , also called a confounder or confounding factor, is a third variable in a study examining a potential cause-and-effect relationship.

A confounding variable is related to both the supposed cause and the supposed effect of the study. It can be difficult to separate the true effect of the independent variable from the effect of the confounding variable.

In your research design , it’s important to identify potential confounding variables and plan how you will reduce their impact.

There are several methods you can use to decrease the impact of confounding variables on your research: restriction, matching, statistical control and randomization.

In restriction , you restrict your sample by only including certain subjects that have the same values of potential confounding variables.

In matching , you match each of the subjects in your treatment group with a counterpart in the comparison group. The matched subjects have the same values on any potential confounding variables, and only differ in the independent variable .

In statistical control , you include potential confounders as variables in your regression .

In randomization , you randomly assign the treatment (or independent variable) in your study to a sufficiently large number of subjects, which allows you to control for all potential confounding variables.

Experimental design means planning a set of procedures to investigate a relationship between variables . To design a controlled experiment, you need:

  • A testable hypothesis
  • At least one independent variable that can be precisely manipulated
  • At least one dependent variable that can be precisely measured

When designing the experiment, you decide:

  • How you will manipulate the variable(s)
  • How you will control for any potential confounding variables
  • How many subjects or samples will be included in the study
  • How subjects will be assigned to treatment levels

Experimental design is essential to the internal and external validity of your experiment.

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the “Cite this Scribbr article” button to automatically add the citation to our free Citation Generator.

Thomas, L. (2023, June 22). Control Groups and Treatment Groups | Uses & Examples. Scribbr. Retrieved September 3, 2024, from https://www.scribbr.com/methodology/control-group/

Is this article helpful?

Lauren Thomas

Lauren Thomas

Other students also liked, what is a controlled experiment | definitions & examples, random assignment in experiments | introduction & examples, single, double, & triple blind study | definition & examples, get unlimited documents corrected.

✔ Free APA citation check included ✔ Unlimited document corrections ✔ Specialized in correcting academic texts

COMMENTS

  1. Confounding Variables

    Confounding variables (a.k.a. confounders or confounding factors) are a type of extraneous variable that are related to a study's independent and dependent variables. A variable must meet two conditions to be a confounder: It must be correlated with the independent variable. This may be a causal relationship, but it does not have to be.

  2. How to control confounding effects by statistical analysis

    To control for confounding in the analyses, investigators should measure the confounders in the study. Researchers usually do this by collecting data on all known, previously identified confounders. There are mostly two options to dealing with confounders in analysis stage; Stratification and Multivariate methods. 1.

  3. Confounding Variables in Psychology: Definition & Examples

    A confounding variable in psychology is an extraneous factor that interferes with the relationship between an experiment's independent and dependent variables. It's not the variable of interest but can influence the outcome, leading to inaccurate conclusions about the relationship being studied. For instance, if studying the impact of ...

  4. What is a Confounding Variable? (Definition & Example)

    Confounding variable: A variable that is not included in an experiment, yet affects the relationship between the two variables in an experiment. This type of variable can confound the results of an experiment and lead to unreliable findings. For example, suppose a researcher collects data on ice cream sales and shark attacks and finds that the ...

  5. Mastering the Control of Confounding Variables in Psychology Experiments

    Conclusion: Mastering the Control of Confounding Variables for Accurate Results. Mastering the control of confounding variables is paramount to ensuring the accuracy and reliability of research results in psychology experiments. Confounding variables, if left unchecked, can distort the true relationship between the independent and dependent ...

  6. Confounding Variable: Definition & Examples

    In studies examining possible causal links, a confounding variable is an unaccounted factor that impacts both the potential cause and effect and can distort the results. Recognizing and addressing these variables in your experimental design is crucial for producing valid findings. Statisticians also refer to confounding variables that cause ...

  7. Confounding Variables

    For example, in a drug trial, matching participants by age, gender, and baseline health status can help control for these factors. 3. Statistical Control. Advanced statistical techniques like multiple regression analysis can help account for the influence of known confounding variables in data analysis. 4.

  8. Confounding Variables in Psychology Research

    The best way to control for confounding variables is to conduct "true experimental research," which means researchers experimentally manipulate a variable that they think causes a certain outcome. They typically do this by randomly assigning study participants to different levels of the first variable, which is referred to as the ...

  9. 1.4.1

    1.4.1 - Confounding Variables. Randomized experiments are typically preferred over observational studies or experimental studies that lack randomization because they allow for more control. A common problem in studies without randomization is that there may be other variables influencing the results. These are known as confounding variables.

  10. Confounding: What it is and how to deal with it

    Keywords. Confounding, sometimes referred to as confounding bias, is mostly described as a 'mixing' or 'blurring' of effects. 1 It occurs when an investigator tries to determine the effect of an exposure on the occurrence of a disease (or other outcome), but then actually measures the effect of another factor, a confounding variable.

  11. Confounding Variable: Simple Definition and Example

    Introduce control variables to control for confounding variables. For example, you could control for age by only measuring 30 year olds. Within subjects designs test the same subjects each time. Anything could happen to the test subject in the "between" period so this doesn't make for perfect immunity from confounding variables.

  12. Confounding Variable

    Confounding variables can obscure or distort the true relationship between the independent and dependent variables being studied. Confounding Variable Control Methods. Methods for controlling confounding variables in research are as follows: Randomization. Randomization is a powerful method for controlling confounding variables in experimental ...

  13. What Is a Confounding Variable? Definition and Examples

    A confounding variable is a variable that influences both the independent variable and dependent variable and leads to a false correlation between them. A confounding variable is also called a confounder, confounding factor, or lurking variable. Because confounding variables often exist in experiments, correlation does not mean causation.

  14. Control Variables

    A control variable is anything that is held constant or limited in a research study. It's a variable that is not of interest to the study's objectives, but is controlled because it could influence the outcomes. Variables may be controlled directly by holding them constant throughout a study (e.g., by controlling the room temperature in an ...

  15. Confounding Variables

    Confounding variables (aka confounders or confounding factors) are a type of extraneous variable related to a study's independent and dependent variables. A variable must meet two conditions to be a confounder: It must be correlated with the independent variable. This may be a causal relationship, but it does not have to be.

  16. 1.5: Confounding Variables

    A confounding variable is a variable that may affect the dependent variable. This can lead to erroneous conclusions about the relationship between the independent and dependent variables. You deal with confounding variables by controlling them; by matching; by randomizing; or by statistical control. Due to a variety of genetic, developmental ...

  17. Guide to Experimental Design

    Table of contents. Step 1: Define your variables. Step 2: Write your hypothesis. Step 3: Design your experimental treatments. Step 4: Assign your subjects to treatment groups. Step 5: Measure your dependent variable. Other interesting articles. Frequently asked questions about experiments.

  18. Confounding Variables

    Learn how to identify and control for confounding variables in experiments and ensure valid results. Discover common factors that can influence results, such as order effects and participant variability. Improve your experimental design with our lesson plans and support packs. See more here.

  19. Confounding Variable / Third Variable

    A confounding variable, also known as a third variable or a mediator variable, influences both the independent variable and dependent variable. Being unaware of or failing to control for confounding variables may cause the researcher to analyze the results incorrectly. The results may show a false correlation between the dependent and ...

  20. Confounding variables

    A confounding variable is a variable, other than the independent variable that you're interested in, that may affect the dependent variable. This can lead to erroneous conclusions about the relationship between the independent and dependent variables. You deal with confounding variables by controlling them; by matching; by randomizing; or by ...

  21. Experimental vs Observational Studies: Differences & Examples

    Establish Causality: Experimental studies can establish causal relationships between variables by controlling and using randomization. Control Over Confounding Variables: The controlled environment allows researchers to minimize the influence of external variables that might skew results.

  22. What Is a Controlled Experiment?

    Revised on June 22, 2023. In experiments, researchers manipulate independent variables to test their effects on dependent variables. In a controlled experiment, all variables other than the independent variable are controlled or held constant so they don't influence the dependent variable. Controlling variables can involve:

  23. What Is a Control Variable? Definition and Examples

    Examples of confounding variables could be humidity, magnetism, and vibration. Sometimes you can identify a confounding variable and turn it into a control variable. ... Examples of control variables in this experiment could include the age of the cattle, their breed, whether they are male or female, the amount of supplement, the way the ...

  24. RNAseqCovarImpute: a multiple imputation procedure that outperforms

    Failure to control for these variables in analyses of maternal age could lead to erroneous conclusions and even faulty clinical recommendations. For instance, studies have shown that the positive associations of young maternal age with child ADHD in unadjusted analyses are eliminated or even reversed following adjustment for confounding and ...

  25. Control Groups and Treatment Groups

    A true experiment (a.k.a. a controlled experiment) always includes at least one control group that doesn't receive the experimental treatment.. However, some experiments use a within-subjects design to test treatments without a control group. In these designs, you usually compare one group's outcomes before and after a treatment (instead of comparing outcomes between different groups).