Design and Analysis of Experiments and Observational Studies using R : A Volume in the Chapman & Hall/CRC Texts in Statistical Science Series

3 comparing two treatments, 3.1 introduction.

Consider the following scenario. Volunteers for a medical study are randomly assigned to two groups to investigate which group has a higher mortality rate. One group receives the standard treatment for the disease, and the other group receives an experimental treatment. Since people were randomly assigned to the two groups the two groups of patients should be similar except for the treatment they received .

If the group receiving the experimental treatment lives longer on average and the difference in survival is both practically meaningful and statistically significant then because of the randomized design it’s reasonable to infer that the new treatment caused patients to live longer. Randomization is supposed to ensure that the groups will be similar with respect to both measured and unmeasured factors that affect study participants’ mortality.

Consider two treatments labelled A and B. In other words interest lies in a single factor with two levels. Examples of study objectives that lead to comparing two treatments are:

  • Is fertilizer A or B better for growing wheat?
  • Is a new vaccine compared to placebo effective at preventing COVID-19 infections?
  • Will web page design A or B lead to different sales volumes?

These are all examples of comparing two treatments . In experimental design, treatments are different procedures applied to experimental units —the plots, patients, web pages to which we apply treatments .

In the first example, the treatments are two fertilizers and the experimental units might be plots of land. In the second example, the treatments are an active vaccine and placebo vaccine (a sham vaccine) to prevent COVID-19, and the experimental units are volunteers that consented to participate in a vaccine study. In the third example, the treatments are two web page designs and the website visitors are the experimental units.

3.2 Treatment Assignment Mechanism and Propensity Score

In a randomized experiment, the treatment assignment mechanism is developed and controlled by the investigator, and the probability of an assignment of treatments to the units is known before data is collected. Conversely, in a non-randomized experiment, the assignment mechanism and probability of treatment assignments are unknown to the investigator.

Suppose, for example, that an investigator wishes to randomly assign two experimental units, unit 1 and unit 2 , to two treatments (A and B). Table 3.1 shows all possible treatment assignments.

Table 3.1: All Possible Treatment Assignments: Two Units, Two Treatments
Treatment Assignment unit1 unit2
1 A A
2 B A
3 A B
4 B B

3.2.1 Propensity Score

The probability that an experimental unit receives treatment is called the propensity score . In this case, the probability that an experimental unit receives treatment A (or B) is 1/2.

It’s important to note that the probability of a treatment assignment and propensity scores are different probabilities, although in some designs they may be equal.

In general, if there are \(N\) experimental units and two treatments then there are \(2^N\) possible treatment assignments.

3.2.2 Assignment Mechanism

There are four possible treatment assignments when there are two experimental units and two treatments. The probability of a particular treatment assignment is 1/4. This probability is called the assignment mechanism . It is the probability that a particular treatment assignment will occur (see Section 7.2 for further discussion).

3.2.3 Computation Lab: Treatment Assignment Mechanism and Propensity Score

expand.grid() was used to compute Table 3.1 . This function takes the possible treatments for each unit and returns a data frame containing one row for each combination. Each row corresponds to a possible randomization or treatment assignment .

3.3 Completely Randomized Designs

In the case where there are two units and two treatments, it wouldn’t be a very informative experiment if both units received A or both received B. Therefore, it makes sense to rule out this scenario. If we rule out this scenario then we want to assign treatments to units such that one unit receives A and the other receives B. There are two possible treatment assignments: treatment assignments 2 and 3 in Table 3.1 . The probability of a treatment assignment is 1/2, and the probability that an individual patient receives treatment A (or B) is still 1/2.

A completely randomized experiment has the number of units assigned to treatment A, \(N_A\) , fixed in advance so that the number of units assigned to treatment B, \(N_B = N-N_A\) , is also fixed in advance. In such a design, \(N_A\) units are randomly selected, from a population of \(N\) units, to receive treatment A, with the remaining \(N_B\) units assigned to treatment B. Each unit has probability \(N_A/N\) of being assigned to treatment A.

How many ways can \(N_A\) experimental units be selected from \(N\) experimental units such that the order of selection doesn’t matter and replacement is not allowed (i.e., a unit cannot be selected more than once)? This is the same as the distinct number of treatment assignments. There are \(N \choose N_A\) distinct treatment assignments with \(N_A\) units out of \(N\) assigned to treatment A. Therefore, the assignment mechanism or the probability of any particular treatment assignment is \(1/{\binom{N}{N_A}}.\)

Example 3.1 (Comparing Fertilizers) Is fertilizer A better than fertilizer B for growing wheat? It is decided to take one large plot of land and divide it into twelve smaller plots of land, then treat some plots with fertilizer A, and some with fertilizer B. How should we assign fertilizers ( treatments ) to plots of land (Table 3.2 )?

Some of the plots get more sunlight and not all the plots have the exact same soil composition which may affect wheat yield. In other words, the plots are not identical. Nevertheless, we want to make sure that we can identify the treatment effect even though the plots are not identical. Statisticians sometimes state this as being able to identify the treatment effect ( viz. difference between fertilizers) in the presence of other sources of variation ( viz. differences between plots).

Ideally, we would assign fertilizer A to six plots and fertilizer B to six plots. How can this be done so that the only differences between plots is fertilizer type? One way to assign the two fertilizers to the plots is to use six playing cards labelled A (for fertilizer A) and six playing cards labelled B (for fertilizer B), shuffle the cards, and then assign the first card to plot 1, the second card to plot 2, etc.

Table 3.2: Observed Treatment Assignment in Example
Plot 1, B, 11.4 Plot 4, A, 16.5 Plot 7, A, 26.9 Plot 10, B, 28.5
Plot 2, A, 23.7 Plot 5, A, 21.1 Plot 8, B, 26.6 Plot 11, B, 14.2
Plot 3, B, 17.9 Plot 6, A, 19.6 Plot 9, A, 25.3 Plot 12, B, 24.3

3.3.1 Computation Lab: Completely Randomized Experiments

How can R be used to assign treatments to plots in Example 3.1 ? Create cards as a vector of 6 A ’s and 6 B ’s, and use the sample() function to generate a random permutation (i.e., shuffle) of cards .

This can be used to assign B to the first plot, and A to the second plot, etc. The full treatment assignment is shown in Table 3.2 .

3.4 The Randomization Distribution

The treatment assignment in Example 3.1 is the one that the investigator used to collect the data in Table 3.2 . This is one of the \({12 \choose 6}=\) 924 possible ways of allocating 6 A’s and 6 B’s to the 12 plots. The probability of choosing any of these treatment allocations is \(1/{12 \choose 6}=\) 0.001.

Table 3.3: Mean and Standard Deviation of Fertilizer Yield in Example
Treatment Mean yield Standard deviation yield
A 22.18 3.858
B 20.48 6.999

The mean and standard deviation of the outcome variable, yield, under treatment A is \(\bar y_A^{obs}=\) 22.18, \(s_A^{obs}=\) 3.86, and under treatment B is \(\bar y_B^{obs}=\) 20.48, \(s_B^{obs}=\) 7. The observed difference in mean yield is \(\hat \delta^{obs} = \bar y_A^{obs} - \bar y_B^{obs}=\) 1.7 (see Table 3.3 ). The superscript \(obs\) refers to the statistic calculated under the treatment assignment used to collect the data or the observed treatment assignment.

The distributions of a sample can also be described by the empirical cumulative distribution function (ECDF) (see Figure 3.1 ):

\[{\hat F}(y)=\frac{\sum_{i = 1}^{n}I(y_i \le y)}{n},\]

where \(n\) is the number of sample points and \(I(\cdot)\) is the indicator function

\[ I(y_i \le y) = \left\{ \begin{array}{ll} 1 & \mbox{if } y_i \le y \\ 0 & \mbox{if } y_i > y \end{array} \right.\]

Figure 3.1: Distribution of Yield

Table 3.4: Random Shuffle of Treatment Assignment in Example
Plot 1, A, 11.4 Plot 4, B, 16.5 Plot 7, B, 26.9 Plot 10, B, 28.5
Plot 2, A, 23.7 Plot 5, A, 21.1 Plot 8, A, 26.6 Plot 11, A, 14.2
Plot 3, B, 17.9 Plot 6, B, 19.6 Plot 9, B, 25.3 Plot 12, A, 24.3

Is the difference in wheat yield due to the fertilizers or chance?

Assume that there is no difference in the average yield between fertilizer A and fertilizer B.

If there is no difference then the yield would be the same even if a different treatment allocation occurred.

Under this assumption of no difference between the treatments, if one of the other 924 treatment allocations (e.g., A, A, B, B, A, B, B, A, B, B, A, A) was used then the treatments assigned to plots would have been randomly shuffled , but the yield in each plot would be exactly the same as in Table 3.2 . This shuffled treatment allocation is shown in Table 3.4 , and the difference in mean yield for this allocation is \(\delta=\) -2.23 (recall that the observed treatment difference \(\hat \delta^{obs} =\) 1.7).

A probability distribution for \(\delta = \bar y_A - \bar y_B\) , called the randomization distribution , is constructed by calculating \(\delta\) for each possible randomization (i.e., treatment allocation).

Investigators are interested in determining whether fertilizer A produces a higher yield compared to fertilizer B, which corresponds to null and alternative hypotheses

\[\begin{aligned} & H_0: \text {Fertilizers A and B have the same mean wheat yield,} \\ & H_1: \text {Fertilizer B has a greater mean wheat yield than fertilizer A.} \end{aligned}\]

3.4.1 Computation Lab: Randomization Distribution

The data from Example 3.1 is in the fertdat data frame. The code chunk below computes the randomization distribution.

N is the total number of possible treatment assignments or randomizations.

trt_assignments <- combn(1:12,6) generates all combinations of 6 elements taken from 1:12 (i.e., 1 through 12) as a \(6 \times 924\) matrix, where the \(i^{th}\) column trt_assignments[,i] , \(i=1,\ldots,924\) , represents the experimental units assigned to treatment A.

fertdat$fert[trt_assignments[,i]] selects fertdat$fert values indexed by trt_assignments[,i] . These values are assigned to treatment A. fertdat$fert[-trt_assignments[,i]] drops fertdat$fert values indexed by trt_assignments[,i] . These values are assigned to treatment B.

3.5 The Randomization p-value

3.5.1 one-sided p-value.

Let \(T\) be a test statistic such as the difference between treatment means or medians. The p-value of the randomization test \(H_0: T=0\) can be calculated as the probability of obtaining a test statistic as extreme or more extreme than the observed value of the test statistic \(t^{*}\) (i.e., in favour of \(H_1\) ). The p-value is the proportion of randomizations as extreme or more extreme than the observed value of the test statistic \(t^{*}\) .

Definition 3.1 (One-sided Randomization p-value) Let \(T\) be a test statistic and \(t^{*}\) the observed value of \(T\) . The one-sided p-value to test \(H_0:T=0\) is defined as:

\[\begin{aligned} P(T \ge t^{*})&= \sum_{i = 1}^{N \choose N_A} \frac{I(t_i \ge t^{*})}{{N \choose N_A}} \mbox{, if } H_1:T>0; \\ P(T \le t^{*})&=\sum_{i = 1}^{N \choose N_A} \frac{I(t_i \le t^{*})}{{N \choose N_A}} \mbox{, if } H_1:T<0. \end{aligned}\]

A hypothesis test to answer the question posed in Example 3.1 is \(H_0:\delta=0 \text{ v.s. } H_1:\delta>0,\) \(\delta=\bar y_A-\bar y_B.\) The observed value of the test statistic is 1.7.

3.5.2 Two-sided Randomization p-value

If we are using a two-sided alternative, then how do we calculate the randomization p-value? The randomization distribution may not be symmetric, so there is no justification for simply doubling the probability in one tail.

Definition 3.2 (Two-sided Randomization p-value) Let \(T\) be a test statistic and \(t^{*}\) the observed value of \(T\) . The two-sided p-value to test \(H_0:T=0 \mbox{ vs. } H_1:T \ne 0\) is defined as:

\[P(\left|T\right| \ge \left|t^{*}|\right) = \sum_{i = 1}^{N \choose N_A} \frac{I(\left|t_i\right| \ge \left|t^{*}\right|)}{{N \choose N_A}}.\]

The numerator counts the number of randomizations where either \(t_i\) or \(-t_i\) exceed \(|t^{*}|\) .

3.5.3 Computation Lab: Randomization p-value

The randomization distribution was computed in Section 3.4.1 , and stored in delta . We want to compute the proportion of randomizations that exceed obs_diff .

delta >= obs_diff creates a Boolean vector that is TRUE if delta >= obs_diff , and FALSE otherwise, and sum applied to this Boolean vector counts the number of TRUE .

The p-value can be interpreted as the proportion of randomizations that would produce an observed mean difference between A and B of at least 1.7 assuming the null hypothesis is true. In other words, under the assumption that there is no difference between the treatment means, 30.3% of randomizations would produce as extreme or more extreme difference than the observed mean difference of 1.7.

The two-sided p-value to test if there is a difference between fertilizers A and B in Example 3.1 can be computed as

In this case, the randomization distribution is roughly symmetric, so the two-sided p-value is approximately double the one-sided p-value.

The R code to produce Figure 3.2 , without annotations, is shown below. The plot displays the randomization distribution of \(\delta=\bar y_A - \bar y_B\) for Example 3.1 . The left panel shows the distribution using \(1-\hat F_{\delta}\) , and the dotted line indicates how to read the p-value from this graph, and the right panel shows a histogram where the black bars show the values more extreme than the observed value.

Figure 3.2: Randomization Distribution of Difference of Means

3.5.4 Randomization Confidence Intervals

Consider a completely randomized design comparing two groups where the treatment effect is additive. In Example 3.1 , suppose that the yields for fertilizer A were shifted by \(\Delta\) , these shifted responses; \(y_{i_A}-\Delta\) should be similar to \(y_{i_B}\) for \(i=1,\ldots,6,\) and the randomization test on these two sets of responses should not reject \(H_0\) . In other words the difference between the distribution of yield for fertilizers A and B can be removed by subtracting \(\Delta\) from each plot assigned to fertilizer A.

Loosely speaking, a confidence interval, for the mean difference, consists of all the plausible values of the parameter \(\Delta\) . A randomization confidence interval can be constructed by considering all values of \(\Delta_0\) for which the randomization test does not reject \(H_0:\Delta=\Delta_0 \mbox{ vs. } H_a:\Delta \ne\Delta_0\) .

Definition 3.3 (Randomization Confidence Interval) Let \(T_{\Delta}\) be the test statistic calculated using the treatment responses for treatment A shifted by \(\Delta\) , \(t^{*}\) its observed value, and \(p(\Delta)=F_{T_{\Delta}}(t^{*}_{\Delta})=P(T_{\Delta}\leq t^{*}_{\Delta})\) be the observed value of the CDF as a function of \(\Delta\) .

A \(100(1-\alpha)\%\) randomization confidence interval for \(\Delta\) can then be obtained by inverting \(p(\Delta)\) . A two-sided \(100(1-\alpha)\%\) is \((\Delta_L,\Delta_U)\) , where \(\Delta_L=p^{-1}(\alpha/2)=\max_{p(\Delta \leq \alpha/2)} \Delta\) , and \(\Delta_U=p^{-1}(1-\alpha/2)=\min_{p(\Delta \leq 1-\alpha/2)} \Delta\) . 29

3.5.5 Computation Lab: Randomization Confidence Intervals

Computing \(\Delta_L, \Delta_U\) involves recomputing the randomization distribution of \(T_{\Delta}\) for a series of values \(\Delta_1,\ldots,\Delta_k\) . This can be done by trial and error, or by a search method (see for example Paul H Garthwaite 30 ).

In this section, a trial and error method is implemented using a series of R functions.

The function randomization_dist() computes the randomization distribution for the mean difference in a randomized two-sample design.

The function randomization_pctiles() computes \(p(\Delta)\) for a sequence of trial values for \(\Delta\) .

The function randomization_ci() computes the \(\Delta_L,\Delta_U\) as well as the confidence level of the interval.

Example 3.2 (Confidence interval for wheat yield in Example 3.1 ) A 99% randomization confidence interval for the wheat data can be obtained by using randomization_ci() . The data for the two groups are defined by yA and yB , the confidence level is alpha , with M total experimental units and m experimental units in one of the groups. The sequence of values for \(\Delta\) is found by trial and error, but it’s important that the tails of the distribution of \(\Delta\) are computed far enough so that we have values for the upper and lower \(\alpha/2\) percentiles.

A plot of \(p(\Delta)\) is shown in Figure 3.3 . delta is selected so that pdelta is computed in tails of the distribution of \(T_{\Delta}.\)

Figure 3.3: Distribution of \(\Delta\) in Example 3.2

Lptile and Uptile are the lower and upper percentiles of the distribution of \(T_{\Delta}\) used for the confidence interval, conf_level is actual confidence level of the confidence interval, and finally LCI , UCI are the limits of a ( \(1-\) conf_level ) level confidence interval. In this case, \((\Delta_L, \Delta_U)=(-8, 14)\) is a 99.03% confidence interval for the difference between the means of treatments A and B.

3.6 Randomization Distribution of a Test Statistic

Test statistics other than \(T={\bar y}_A-{\bar y}_B\) could be used to measure the effectiveness of fertilizer A in Example 3.1 . Investigators may wish to compare differences between medians, standard deviations, odds ratios, or other test statistics.

3.6.1 Computation Lab: Randomization Distribution of a Test Statistic

The randomization distribution of the difference in group medians can be obtained by modifying the randomization_dist() function (see 3.5.5 ) used to calculate the difference in group means. We can add func as an argument to randomization_dist() and modify the function so that the type of difference can be specified.

The randomization distribution of the difference in medians is

The p-value of the randomization test comparing two medians is

3.7 Computing the Randomization Distribution using Monte Carlo Sampling

Computation of the randomization distribution involves calculating the test statistic for every possible way to split the data into two samples of size \(N_A\) . If \(N = 100\) and \(N_A = 50\) , this would result in \({100 \choose 50}=\) 1.0089^{20} billion differences. These types of calculations are not practical unless the sample size is small.

Instead, we can resort to Monte Carlo sampling from the randomization distribution to estimate the exact p-value.

The data set can be randomly divided into two groups and the test statistic calculated. Several thousand test statistics are usually sufficient to get an accurate estimate of the exact p-value and sampling can be done without replacement.

If \(M\) test statistics, \(t_i\) , \(i = 1,...,M\) are randomly sampled from the permutation distribution, a one-sided Monte Carlo p-value for a test of \(H_0: \mu_T = 0\) versus \(H_1: \mu_T > 0\) is

\[ {\hat p} = \frac {1+\sum_{i = 1}^M I(t_i \ge t^{*})}{M+1}.\]

Including the observed value \(t^{*}\) there are \(M+1\) test statistics.

3.7.1 Computation Lab: Calculating the Randomization Distribution using Monte Carlo Sampling

Example 3.3 (What is the effect of caffeine on reaction time?) There is scientific evidence that caffeine reduces reaction time Tom M McLellan, John A Caldwell, and Harris R Lieberman 31 . A study of the effects of caffeine on reaction time was conducted on a group of 100 high school students. The investigators randomly assigned an equal number of students to two groups: one group (CAFF) consumed a caffeinated beverage prior to taking the test, and the other group (NOCAFF) consumed the same amount of water. The research objective was to study the effect of caffeine on reaction time to test the hypothesis that caffeine would reduce reaction time among high school students. The data from the study is in the data frame rtdat .

The data indicate that the difference in median reaction times between the CAFF and NOCAFF groups is 0.056 seconds. Is the observed difference due to random chance or is there evidence it is due to caffeine? Let’s try to calculate the randomization distribution using randomization_dist() .

Currently, R can only support vectors up to \(2^{52}\) elements, 32 so computing the full randomization distribution becomes much more difficult. In this case, Monte Carlo sampling provides a feasible way to approximate the randomization distribution and p-value.

A p-value equal to 0.004 indicates that the median difference is unusual assuming the null hypothesis is true. Thus, this study provides evidence that caffeine slows down reaction time.

3.8 Properties of the Randomization Test

The p-value of the randomization test must be a multiple of \(1/{\binom{N} {N_A}}\) . If a significance level of \(\alpha=k/{\binom{N} {N_A}}\) , where \(k = 1,...,{N \choose N_A}\) is chosen, then \(P(\text{type I}) = \alpha.\) In other words, the randomization test is an exact test.

If \(\alpha\) is not chosen as a multiple of \(1/{\binom {N}{N_A}}\) , but \(k/{\binom {N}{N_A}}\) is the largest p-value less than \(\alpha\) , then \(P(\text{type I}) = k/{\binom {N}{N_A}}< \alpha\) , and the randomization test is conservative. Either way, the test is guaranteed to control the probability of a type I error under very minimal conditions: randomization of the experimental units to the treatments. 33

3.9 The Two-sample t-test

Consider designing a study where the primary objective is to compare a continuous variable in each group. Let \(Y_{ik}\) be the observed outcome for the \(i^{th}\) experimental unit in the \(kth\) treatment group, for \(i = 1,...,n_k\) and \(k= 1,2\) . The outcomes in the two groups are assumed to be independent and normally distributed with different means but an equal variance \(\sigma^2\) , \(Y_{ik} \sim N(\mu_k,\sigma^2).\)

Let \(\theta=\mu_1-\mu_2\) , be the difference in means between the two treatments. \(H_0:\theta =\theta_0 \mbox{ vs. }H_1:\theta \ne \theta_0\) specify a hypothesis test to evaluate whether the evidence shows that the two treatments are different.

The sample mean for each group is given by \({\bar Y}_k = (1/n_k)\sum_{i = 1}^{n_k} Y_{ik}\) , \(k = 1,2\) , and the pooled sample variance is

\[S^2_p= \frac{(n_1-1)S_1^2 + (n_2-1)S_2^2}{(n_1+n_2-2)},\] where \(S_k^2\) is the sample variance for group \(k=1,2.\)

The two-sample t statistic is given by

\[\begin{equation} T=\frac {{\bar Y}_1 - {\bar Y}_2 - \theta_0}{S_p \sqrt{(1/n_1+1/n_2)}} \tag{3.1}. \end{equation}\]

When \(H_0\) is true, \(T \sim t_{n_1+n_2-2}.\)

For example, the two-sided p-value for testing \(\theta\) is \(P\left(|t_{n_1+n_2-2}|>|T^{obs}|\right)\) , where \(T^{obs}\) is the observed value of (3.1) . The hypothesis testing procedure assesses the strength of evidence contained in the data against the null hypothesis. If the p-value is adequately small, say, less than 0.05 under a two-sided test, we reject the null hypothesis and claim that there is a significant difference between the two treatments; otherwise, there is no significant difference and the study is inconclusive.

In Example, 3.1 \(H_0:\mu_A=\mu_B\) and \(H_1: \mu_A < \mu_B.\) The pooled sample variance and the observed value of the two-sample t-statistic for this example are:

\[S_p^2 = \frac{(n_1-1)S_1^2+(n_2-1)S_2^2}{n_1+n_2-2} = 5.65,\] and \[T^{obs} = \frac {{\bar y}_A - {\bar y}_b}{S_p \sqrt{(1/n_A+1/n_B)}} = \frac {20.22 - 22.45}{5.65 \sqrt{(1/6+1/6)}}=-0.69.\]

The p-value is \(P\left(t_{10} < -0.69\right)=\) 0.3. There is little evidence that fertilizer A produces higher yields than B.

3.9.1 Computation Lab: Two-sample t-test

We can use R to compute the p-value of the two-sample t-test for Example 3.1 . Recall that the data frame fertdat contains the data for this example: fert is the yield and shuffle is the treatment.

The pooled variance \(s_p^2\) and observed value of the two-sample t statistic are:

The observed value of the two-sample t-statistic is -0.6914.

Finally, the p-value for this test can be calculated using the CDF of the \(t_n\) , where \(n = 6 + 6 -2=10.\)

These calculations are also implemented in stats::t.test() .

The assumption of normality can be checked using normal quantile plots, although the t-test is robust against non-normality.

Figure 3.4: Normal Quantile Plot of Fertilizer Yield in Example 3.1

Figure 3.4 indicates that the normality assumption is satisfied, although the sample sizes are fairly small.

Notice that the p-value from the randomization test and the p-value from two-sample t-test are almost identical although the randomization test neither depends on normality nor independence. The randomization test does depend on Fisher’s concept that after randomization, if the null hypothesis is true, the two results obtained from each particular plot will be exchangeable . The randomization test tells you what you could say if exchangeability were true.

3.10 Blocking

Randomizing subjects to two treatments should produce two treatment groups where all the covariates are balanced (i.e., have similar distributions), but it doesn’t guarantee that the treatment groups will be balanced on all covariates. In many applications there may be covariates that are known a priori to have an effect on the outcome, and it’s important that these covariates be measured and balanced, so that the treatment comparison is not affected by the imbalance in these covariates. Suppose an important covariate is income level (low/medium/high). If income level is related to the outcome of interest, then it’s important that the two treatment groups have a balanced number of subjects in each income level, and this shouldn’t be left to chance. To avoid an imbalance between income levels in the two treatment groups, the design can be blocked by income group, by separately randomizing subjects in low, medium, and high income groups.

Example 3.4 Young-Woo Kim et al. 34 conducted a randomized clinical trial to evaluate hemoglobin level (an important component of blood) levels after a surgery to remove a cancer. Patients were randomized to receieve a new treatment or placebo. The study was conducted at seven major institutions in the Republic of Korea. Previous research has shown that the amount of cancer in a person’s body, measured by cancer stage (stage I—less cancer, stages II—IV - more cancer), has an effect on hemoglobin. 450 (225 per group) patients were required to detect a significant difference in the main study outcome at the 5% level (with 90% power - see Chapter— 4 ).

To illustrate the importance of blocking, consider a realistic, although hypothetical, scenario related to Example 3.4 . Suppose that among patients eligible for inclusion in the study, 1/3 have stage I cancer, and 225 (50%) patients are randomized to the treatment and placebo groups. Table 3.5 shows that the distribution of Stage in the placebo group is different than the distribution in the Treatment group. In other words, the distribution of cancer stage in each treatment group is unbalanced. The imbalance in cancer stage might create a bias when comparing the two treatment groups since it’s known a priori that cancer stage has an effect on the main study outcome (hemoglobin level after surgery). An unbiased comparison of the treatment groups would have Stage balanced between the two groups.

Table 3.5: Distribution of Cancer Stage by Treatment Group in Example using Unrestricted Randomization
Stage Placebo Treatment
Stage I 70 80
Stage II-IV 155 145

How can an investigator guarantee that Stage is balanced in the two groups? Separate randomizations by cancer stage, blocking by cancer stage.

3.11 Treatment Assignment in Randomized Block Designs

If Stage was balanced in the two treatment groups in Example 3.4 , then 50% of stage I patients would receive Placebo, and 50% Treatment. If we block or separate the randomizations by Stage, then this will yield treatment groups balanced by stage. There will be \(150 \choose 75\) randomizations for the stage I patients, and \(300 \choose 150\) randomizations for the stage II-IV patients. Table 3.6 shows the results of block randomization.

Table 3.6: Distribution of Cancer Stage by Treatment Group in Example using Restricted Randomization
Stage Placebo Treatment
Stage I 75 75
Stage II-IV 150 150

3.11.1 Computation Lab: Generating a Randomized Block Design

Let’s return to Example 3.4 , and suppose that we are designing a study where 450 subjects will be randomized to two treatments, and 1/3 of the 450 subjects (150) have stage I cancer. cancerdat is a data frame containing a patient id and Stage information.

First, we will create a data frame, cancerdat_stageI of patient id with stage I cancers. Next, randomly select 50% of patient id in this block sample(cancerdat_stageI$id, floor(nrow(cancerdat_stageI)/2)) and assign these patients to Treatment , and the remaining to Placebo using treat = ifelse(id %in% trtids, "Treatment", "Placebo") .

The treatment assignments for the first 4 patients with stage I cancer are shown below.

3.12 Randomized Matched Pairs Design

A randomized matched pairs design arranges experimental units in pairs, and treatment is randomized within a pair. In other words, each experimental unit is a block. In Chapter 5 , we will see how this idea can be extended to compare more than two treatments using randomized block designs.

Example 3.5 (Wear of boys' shoes) Measurements on the amount of wear of the soles of shoes worn by 10 boys were obtained by the following design (this example is based on 3.2 in George EP Box, J Stuart Hunter, and William Gordon Hunter 35 ).

Each boy wore a special pair of shoes with the soles made of two different synthetic materials, A (a standard material) and B (a cheaper material). The left or right sole was randomly assigned to A or B, and the amount of wear after one week was recorded (a smaller value means less wear). During the test some boys scuffed their shoes more than others, but each boy’s shoes were subjected to the same amount of wear.

In this case, each boy is a block, and the two treatments are randomized within a block.

Material was randomized to the left or right shoe by flipping a coin. The observed treatment assignment is one of \(2^{10}=1,024\) equiprobable treatment assignments.

The observed mean difference is 1.7. Figure 3.5 , a connected dot plot of wear for each boy, shows material B had higher wear for most boys.

Figure 3.5: Boy’s Shoe Example

3.13 Randomized Matched Pairs versus Completely Randomized Design

Ultimately, the goal is to compare units that are similar except for the treatment they were assigned. So, if groups of similar units can be created before randomizing, then it’s reasonable to expect that there should be less variability between the treatment groups. Blocking factors are used when the investigator has knowledge of a factor before the study that is measurable and might be strongly associated with the dependent variable.

The most basic blocked design is the randomized pairs design. This design has \(n\) units where two treatments are randomly assigned to each unit which results in a pair of observations \((X_i,Y_i), i=1,\ldots,n\) on each unit. In this case, each unit is a block. Assume that the \(X's\) and \(Y's\) have means \(\mu_X\) and \(\mu_Y\) , and variances \(\sigma^2_X\) and \(\sigma^2_Y\) , and the pairs are independently distributed and \(Cov(X_i,Y_i)=\sigma_{XY}\) . An estimate of \(\mu_X-\mu_Y\) is \(\bar D = \bar X - \bar Y\) . It follows that

\[\begin{align} \begin{split} E\left(\bar D \right) &= \mu_X-\mu_Y \\ Var\left(\bar D \right) &= \frac{1}{n}\left(\sigma^2_X + \sigma^2_Y - 2\rho\sigma_X\sigma_Y \right), \end{split} \tag{3.2} \end{align}\]

where \(\rho\) is the correlation between \(X\) and \(Y\) .

Alternatively, if \(n\) units had been assigned to two independent treatment groups (i.e., \(2n\) units) then \(Var\left(\bar D\right)=(1/n) \left(\sigma^2_X+\sigma^2_Y\right).\) Comparing the variances we see that the variance of \(\bar D\) is smaller in the paired design if the correlation is positive. So, pairing is a more effective experimental design.

3.14 The Randomization Test for a Randomized Paires Design

Table 3.7: Possible Randomizations for Example
Observed L L R R R L R R L L
Possible R R R R L R R R L L
Wear (A) 10.39 8.79 9.64 8.37 9.74 11.1 10.76 9.76 10.99 10.74
Wear (B) 13.22 10.61 12.51 15.31 14.21 11.51 7.54 11.31 7.7 9.84

Table 3.7 shows the observed (randomization) and another possible (randomization) for material A in Example 3.5 . If the other possible randomization was observed, then \(\bar y_A - \bar y_B =\) -1.4.

The differences \(\bar y_A - \bar y_B\) can be analyzed so that we have one response per boy. Under the null hypothesis, the wear of a boy’s left or right shoe is the same regardless of what material he had on his sole, and the material assigned is based on the result of, for example, a sequence of ten tosses of a fair coin (e.g., in R this could be implemented by sample(x = c("L","R"),size = 10,replace = TRUE) ). This means that under the null hypothesis if the Possible Randomization in Table 3.7 was observed, then for the first boy the right side would have been assigned material A and the left side material B, but the amount of wear on the left and right shoes would be the same, so the difference for the first boy would have been 2.8 instead of -2.8 since his wear for materials A and B would have been 13.22 and 10.39 respectively.

The randomization distribution is obtained by calculating 1,024 averages \(\bar y_A-\bar y_B = (\pm\) -2.8 \(\pm\) -1.8 \(\pm \cdots \pm\) 0.9 \()/\) 10, corresponding to each of the \(2^{10}=1,024\) possible treatment assignments.

3.14.1 Computation Lab: Randomization Test for a Paired Design

The data for Example 3.5 is in shoedat_obs data frame.

The code chunk below generates the randomization distribution.

The \(2^{10}\) treatment assignments are computed using expand.grid() on a list of 10 vectors ( c(-1,1) )—each element of the list is the potential sign of the difference for one experimental unit (i.e., boy), and expand.grid() creates a data frame from all combinations of these 10 vectors.

Figure 3.6: Randomization Distribution–Boys’ Shoes

The p-value for testing if B has more wear than A is

\[P(D \le d^{*})= \sum_{i = 1}^{2^{10}} \frac{I(d_i \le d^{*})}{2^{10}},\]

where \(D={\bar y_A}-{\bar y_B}\) , and \(d^{*}\) is the observed mean difference.

The value of \(d^{*}=\) -1.3 is not unusual under the null hypothesis since only 111 (i.e., 10%) differences of the randomization distribution are less than -1.3. Therefore, there is no evidence of a significant increase in the amount of wear with the cheaper material B.

3.15 Paired t-test

If we assume that the differences from Example 3.5 are a random sample from a normal distribution, then \(t=\sqrt{10}{\bar d}/S_{\bar d} \sim t_{10-1},\) where, \(S_{\bar d}\) is the sample standard deviation of the paired differences. The p-value for testing if \({\bar D} < 0\) is \(P(t_{9}< t).\) In other words, this is the same as a one-sample t-test of the differences.

3.15.1 Computation Lab: Paired t-test

In Section 3.14.1 , diff is a vector of the differences for each boy in Example 3.5 . The observed value of the t-statistic for the one-sample test can be computed.

The p-value for testing \(H_0:{\bar D} = 0\) versus \(H_a:{\bar D} < 0\) is

Alternatively, t.test() can be used.

3.16 Exercises

Exercise 3.1 Suppose \(X_1\sim N\left(10, 25\right)\) and \(X_2\sim N\left(5, 4\right)\) in a population. You randomly select 100 samples from the population and assign treatment A to half of the sample and B to the rest. Simulate the sample with treatment assignments and the covariates, \(X_1\) and \(X_2\) . Compare the distributions of \(X_1\) and \(X_2\) in the two treatment groups. Repeat the simulation one hundred times. Do you observe consistent results?

Exercise 3.2 Identify treatments and experimental units in the following scenarios.

City A would like to evaluate whether a new employment training program for the unemployed is more effective compared to the existing program. The City decides to run a pilot program for selected employment program applicants.

Marketing Agency B creates and places targeted advertisements on video-sharing platforms for its clients. The Agency decides to run an experiment to compare the effectiveness of placing advertisements before vs. during vs. after videos.

Paul enjoys baking and finds a new recipe for chocolate chip cookies. Paul decides to test it by bringing cookies baked using his current recipe and the new recipe to his study group. Each member of the group blindly tastes each kind and provides their ratings.

Exercise 3.3 A study has three experimental units and two treatments—A and B. List all possible treatment assignments for the study. How many are there? In general, show that there are \(2^N\) possible treatment assignments for an experiement with \(N\) experimenal units and 2 treatments.

Exercise 3.4 Consider the scenario in Example 3.1 , and suppose that an investigator only has enough fertilizer A to use on four plots. Answer the following questions.

What is the probability that an individual plot receives fertilizer A?

What is the probability of choosing the treatment assignment A, A, A, A, B, B, B, B, B, B, B, B?

Exercise 3.5 Show that the one-sided p-value is \(1-\hat{F}_T\left(t^*\right)\) if \(H_1:T>0\) and \(\hat{F}_T\left(t^*\right)\) if \(H_1:T<0\) , where \(\hat{F}_T\) is the ECDF of the randomization distribution of \(T\) and \(t^*\) is the observed value of \(T\) .

Exercise 3.6 Show that the two-sided p-value is \(1-\hat{F}_T\left(\lvert t^*\rvert\right)+\hat{F}_T\left(-\lvert t^*\rvert\right)\) , where \(\hat{F}_T\) is the ECDF of the randomization distribution of \(T\) and \(t^*\) is the observed value of \(T\) .

Exercise 3.7 The actual confidence level conf_level does not equal the theoretical confidence level 0.01 in Example 3.2 . Explain why.

Exercise 3.8 Consider Example 3.5 . For each of the 10 boys, we randomly assigned the left or right sole to material A and the remaining side to B. Use R’s sample function to simulate a treatment assignment.

Exercise 3.9 Recall that the randomization test for the data in Example 3.5 fails to find evidence of a significant increase in the amount of wear with material B. Does this mean that material B has equivalent wear to material A? Explain.

Exercise 3.10 Consider the study from Example 3.4 . Recall that the clinical trial consists of 450 patients. 150 of the patients have stage I cancer and the rest have stages II-IV cancer. In Computation Lab: Generating a Randomized Block Design , we created a balanced treatment assignment for the stage I cancer patients.

Create a balanced treatment assignment for the stage II-IV cancer patients.

Combine treatment assignments for stage I and stage II-IV. Show that the distribution of stage is balanced in the overall treatment assignment.

Exercise 3.11 Consider a randomized pair design with \(n\) units where two treatments are randomly assigned to each unit, resulting in a pair of observations \(\left(X_i,Y_i\right)\) , for \(i=1,\ldots,n\) on each unit. Assume that \(E[X_i]=\mu_X\) , \(E[Y_i]=\mu_y\) , and \(Var(X_i)=Var(Y_i)=\sigma^2\) for \(i=1,\dots,n\) . Alternatively, we may consider an unpaired design where we assign two independent treatment groups to \(2n\) units.

Show that the ratio of the variances in the paired to the unpaired design is \(1-\rho\) , where \(\rho\) is the correlation between \(X_i\) and \(Y_i\) .

If \(\rho=0.5\) , how many subjects are required in the unpaired design to yield the same precision as the paired design?

Exercise 3.12 Suppose that two drugs A and B are to be tested on 12 subjects’ eyes. The drugs will be randomly assigned to the left eye or right eye based on the flip of a fair coin. If the coin toss is heads then a subject will receive drug A in their right eye. The coin was flipped 12 times and the following sequence of heads and tails was obtained:

\[\begin{array} {c c c c c c c c c c c c} T&T&H&T&H&T&T&T&H&T&T&H \end{array}\]

Create a table that shows how the treatments will be allocated to the 12 subjects’ left and right eyes.

What is the probability of obtaining this treatment allocation?

What type of experimental design has been used to assign treatments to subjects? Explain.

  • The randomization model for comparing two treatments (including a treatment against control), for quantitative responses. Alternative hypotheses: shift, dispersion, omnibus. The ticket model: two fixed numbers per individual. Strong and weak null hypotheses.
  • Permutation test based on the sample sum of the responses of the treatment group. Approximating P -values by simulation; connection to bootstrap tests.
  • The 2-sample t -test in the randomization model. The permutation t -test.
  • Fisher's Exact Test and its normal approximation; the Lady Tasting Tea experiment

References: Lehmann, E.L., 1998. Nonparametrics: Statistical Methods Based on Ranks . Upper Saddle River, N.J.: Prentice Hall; SticiGui Chapter 19 .

Comparing two treatments (e.g., treatment and control) in the randomization model

There are N subjects. The subjects are given; they are not necessarily a sample from some larger population. We assign a simple random sample of size n of the N subjects to treatment, and the remaining m = N − n subjects to control. For each subject, we observe a quantitative response. There is no assumption about the values of that quantitative response; they need not follow any particular distribution. The null hypothesis is that treatment does not matter. Several alternatives are interesting. The most common are the shift alternative, the dispersion alternative, and the omnibus alternative. The shift alternative is that treatment changes the mean response. (There are left-sided, right-sided and two-sided versions of the shift alternative.) The dispersion alternative is that treatment changes the scatter of the responses. The omnibus alternative is that treatment changes the response in some way—any way whatsoever.

Because of the deliberate randomization, whether treatment affects the response within the group of N subjects can be addressed rigorously. Up to sampling error—which can be quantified—differences between the responses of the treatment and control groups must be due to the effect of treatment; the randomization tends to balance other factors that affect the responses, and that otherwise would lead to confounding. However, conclusions about the effect of treatment among the N subjects cannot be extrapolated to any other population, because we do not know where the subjects came from (how they came to be part of the experiment).

We model the experiment as follows: Each of the N subjects is represented by a ticket with two numbers on it, a left and a right number. The left number is the response the subject would have if assigned to the control group; the right number is the response the subject would have if assigned to the treatment group. These numbers are written on the tickets before the experiment starts. Assigning the subject to treatment or control only determines whether we observe the left or the right number for that subject. Let x j be the left number on the j th ticket and let y j be the right number on the j th ticket. The strong null hypothesis is

x j = y j ,    j = 1, 2, … , N .

That is, the strong null hypothesis is that the left and right numbers on each ticket are equal. Subject by subject, treatment makes no difference at all. The weak null hypothesis is that the average of the left numbers equals the average of the right numbers:

( x 1 + x 2 + … + x N )/ N  =  ( y 1 + y 2 + … + y N )/ N .

In the weak null hypothesis, treatment makes no difference on average : treatment might increase the responses for some individuals, provided it decreases the responses of other individuals by a balancing amount.

In this ticket model, if the strong null hypothesis is true, the assignment to treatment or control is an arbitrary random labeling of the subjects. The N responses are the same no matter which subjects are assigned to treatment. Let { Y 1 , … , Y n } denote the responses of the n treated subjects. Under the strong null hypothesis, x j = y j , j = 1, … , N , where x j is the observed response of the j th subject. Now

Y j = x k j ,    j = 1, … , n ,

where { k j : j =1, 2, … , n } is a subset of size n of the integers {1, 2, … , N }. Consider the sum of the treatment responses, the statistic

Y = Y 1 + Y 2 + … + Y n .

Under the strong null hypothesis, the expected value of Y is n times the average of all N responses. (For a review of the probability distribution of the sample sum of random draws from a finite population of numbers, see Chapters 12-15 of SticiGui .) If treatment tends to increase the response, we would expect Y to be larger than that; if treatment tends to decrease the response, we would expect Y to be smaller than that. We can use Y as the test statistic to test the strong null hypothesis. For the alternative hypothesis that treatment increases the response, the test would be of the form

Reject the strong null hypothesis if Y ≥ c ,

where c is chosen so that the test has the desired significance level α, i.e., so that if the strong null hypothesis is true,

P( Y > c ) ≤ α.

This is a permutation test based on Y , the sum of the treatment responses. Note that the null distribution of Y (and thus the critical value c ) depends on the observed responses. Similarly, to test against the alternative that treatment decreases the response, the test would be of the form

Reject the strong null hypothesis if Y ≤ c ,

with c chosen to attain the desired significance level. To test against the alternative that treatment increases or decreases the response, we could use a test of the form

Reject the strong null hypothesis if Y  <  c 1 or Y  >  c 2 .

There is some latitude in choosing c 1 and c 2 ; two standard choices are to pick them symmetrically about the expected value of Y under the null hypothesis, or to pick them so that under the null hypothesis

P( Y  <  c 1 ) is as close as possible to P( Y  >  c 2 ),

subject to the constraint that the significance level is attained. (Because the permutation distribution of Y is discrete, it can happen that no c , or ( c 1 , c 2 ) yields exactly the desired significance level. Then one either can choose the largest attainable significance level that does not exceed the nominal significance level, or one can use a randomized test to attain the desired significance level exactly. We won't worry about this much.)

For example, suppose that N =5, n =2, m =3, and the responses are as follows:

Under the strong null, the probability that Y is 7 or greater is 0.3; this is the P -value for a one-sided permutation test against the alternative that treatment increases the response, based on Y , the sum of the responses of the treatment group.

For larger data sets, working out the null distribution of Y analytically can be difficult. For example, if we have 100 subjects of whom 50 are to be assigned to treatment, there are about 10 29 possible assignments of subjects to treatment or control. Each assignment requires on the order of 50 floating point operations to compute Y . Using a computer that can calculate at the rate 1GFlop/s (10 9 floating point operations per second), it would take about 1.6×10 14 years to compute all the values of Y . The universe is about 1.5×10 10 years old. For N =50, n =25, there are 50 C 25 possible ways to assign the subjects to treatment, which could give as many as 10 14 different values for Y ; it would take a couple of months to compute all the values of Y at that rate. In such situations, calculating the null distribution exactly is impossible, but we still can approximate the null probability distribution of Y with a variety of numerical simulations.

The most straightforward conceptually is to generate simple random samples of size n from the N (pooled control and treatment) responses repeatedly, calculate Y for each, and find the empirical distribution of those "observed" values. This approximates the null distribution of Y for the permutation test. Alternatively, we might sample n of the responses with replacement; this leads to a bootstrap test. When N is large and n is small compared to N , it does not make much difference whether sampling is with or without replacement: the permutation and bootstrap tests should give similar P -values. When N is small and when n is an appreciable fraction of N , the tests differ. (See Romano, J.P., 1989. Bootstrap and randomization tests of some nonparametric hypotheses, Ann. Stat. , 17 , 141–159. More on this later.)

You need Java to see this.

The applet also lets you simulate the null distribution of Y in situations in which working out the null distribution analytically would be impractical. How accurate is the simulation? Ignoring issues with the pseudorandom number generator (not all algorithms for simulating random numbers behave well), we can get a handle on the probable accuracy of the simulation as follows: The standard error of the empirical probability in k independent trials with the same true probability p of success is

( p (1− p )/ k ) 1/2 ≤ 1/(2 k 1/2 ).

For k =10,000 trials, this bound on the standard error is half a percent.

Pseudo-Random Number Generators

Most computers cannot generate truly random numbers, although there is special equipment that can (usually, these rely on a physical source of "noise," such as a resistor or a radiation detector). Most so-called random numbers generated by computers are really "pseudo-random" numbers, sequences generated by a software algorithm from a starting point, called a seed. Pseudo-random numbers behave much like random numbers for many purposes.

The seed of a pseudo-random number generator can be thought of as the state of the algorithm. Each time the algorithm produces a number, the it alters its state—deterministically. If you start a given algorithm from the same seed, you will get the same sequence of pseudo-random numbers. Each pseudo-random number generator has only finitely many states. Eventually—after the period of the generator, the generator gets back to its initial state and the sequence repeats.

Better generators have more states and longer periods, but that comes at a price: speed. There is a tradeoff between the computational efficiency of a pseudo-random number generator and the difficulty of telling that its output is not really random (measured, for example, by the number of bits one must examine). See http://csrc.nist.gov/rng/ for a suite of tests of pseudo-random number generators. Tests can be based on statistics such as the number of zero and one bits in a block or sequence, the number of runs in sequences of differing lengths, the length of the longest run, spectral properties, compressibility (the less random a sequence is, the easier it is to compress), and so on.

No pseudo-random number generator is best for all purposes. You should check which algorithm is used by any software package you rely on for simulations. The Linear Congruential Generator, which used to be quite common, is best avoided. (It tends to have a short period, and the sequences it generates have underlying regularity that can spoil its performance for many purposes.) For statistical simulations, a particularly good pseudo-random number generator is the Mersenne Twister. For cryptography, a higher level of randomness is needed than for most statistical simulations.

Here is a Matlab function to estimate by simulation the P -value for a permutation test based on the sum of the treatment responses. The alternative is one-sided: that treatment increases the response.

function p = simPermuTest(x, y, iter) % function p = simPermuTest(x, y, iter) % % P.B. Stark, www.stat.berkeley.edu/~stark 9/12/05 % simulated P-value for a one-sided permutation test based on the sum of % the treatment responses for the strong null hypothesis that treatment % has no effect whatsoever % % x is the vector of control observations % y is the vector of treatment observations % the lengths of x and y determine the sizes of the control and treatment % groups % iter is the number of replications. % ts = sum(y); % test statistic z = [x y]; % pooled responses dist = zeros(1, iter); % holds results of simulation for i=1:iter % loop over permutations of responses zp = z(randperm(length(z))); % random permutation of responses dist(j) = sum(zp(1:length(y))); % value of the test statistic here end; p = sum(dist >= ts)/iter; % simulated P-value, one-sided test return;

Here is a terse R version of the same algorithm:

simPermuTest = ts)))/iter }

We have been working with the sum Y of the treatment responses. We could just have well worked with the difference between the mean of the treatment responses and the mean of the control responses:

D = ( Y 1 + Y 2 + … + Y n ) / n − ( X 1 + X 2 + … + X m ) / m ,

where the X 's are the control responses. This leads to a permutation test based on the difference in mean responses, instead of a permutation test based on the sum of the treatment responses. The two tests are equivalent—they reject for exactly the same data sets—because there is a monotonic, 1:1 mapping between values of Y and values of D : let x = x 1 + x 2 + … + x N be the total of the responses of all N individuals. Then

D = Y / n − X / m

= Y / n − ( x − Y )/ m

= (1/ n + 1/ m ) Y − x / m .

Thus D can be computed from Y and the total response, and increases monotonically with Y . Specifying which n of the { x j } are treatment responses determines the sum of the treatment responses, and also determines the difference of mean responses. Increasing the sum of the trestment responses increases the mean of the treatment responses, and that increase is at the expense of the mean of the control responses.

Power of the permutation test based on Y

Generally, we have to make pretty strong assumptions to find the power of the permutation test, and it is easier to do in the population model we study in the next chapter. However, consider the power against the shift alternative that treatment increases every subject's response by the same amount d . This alternative is rather far-fetched, but it is strong enough to specify the distribution of the test statistic Y : Under this alternative, the right number on each ticket is the left number on the ticket, plus d . Because we know either the left or right number for each ticket, we actually know both numbers on all the tickets: The right numbers on the tickets for the treatment group were observed; the left numbers on those tickets are the right numbers minus d . The left numbers on the tickets for the control group were observed; the right numbers on those tickets are the left numbers plus d . Under random assignment to control or treatment, we can find the probability distribution of the sum of the right numbers on n of the tickets. The difficulty in translating this sampling distribution into the power is that the critical value of Y for the test depends on the observations: The critical value is the 1− a quantile of the sampling distribution of Y on the assumption that the strong null hypothesis is true—conditional on the observed values—but that assumption leads to a different critical value depending on which subjects are assigned to treatment. The simulation thus needs to include the step of estimating the critical value, for each random assignment of subjects to treatment and control.

Here is a Matlab script to approximate the power of a permutation test based on Y against the shift alternative using nested simulations.

function p = simPermuSumPower(x, y, d, alpha, iter) % function p = simPermuSumPower(x, y, d, alpha, iter) % % P.B. Stark, www.stat.berkeley.edu/~stark 9/11/05 % finds the approximate power of a one-sided level alpha permutation % test based on the sample sum for the alternative hypothesis % that treatment shifts the response by exactly d for all subjects. % % x is the vector of control observations % y is the vector of treatment observations % the lengths of x and y determine the sizes of the control and treatment % groups % alpha is the significance level % iter is the number of replications of each simulation--note: there are % two nested simulations. p = 0; % simulated power z = [x y-d]; % control responses, under shift alternative for i=1:iter % loop over sampling from responses dist = zeros(1,iter); % storage for empirical distribution zp = z(randperm(length(z))); % randomize xp = zp(1:length(x)); % control responses in this iteration yp = zp(length(x)+1:length(x)+length(y)) + d; % treatment responses for j = 1:iter % loop to simulate critical value zp = [xp yp]; % population of responses in this simulation zp = zp(randperm(length(zp))); % randomize dist(j) = sum(zp(1:length(yp))); % value of the test statistic here end; % end loop to simulate critical value if (sum(dist >= sum(yp))/iter

Here is an R version of the same algorithm:

simPermuSumPower = ts) )/iter)

We can simulate the distribution of Y under the shift alternative using the sampling applet as follows: Add d to each control response. Pool those m numbers and the n treatment responses; these are the right numbers on all N = n + m tickets. Put these N numbers into the "box" on the right side of the applet. The distribution of the sample sum of n numbers drawn at random from this population of numbers is the distribution of Y under the shift alternative. Note that this alternative hypothesis—that treatment adds exactly the same amount to the response for each subject—is absurdly restrictive. In the population model, discussed later, we can find the power under more realistic assumptions.

Comparison with the 2-sample t -test

It is common to use the t -test to compare two groups. In the randomization model, the nominal significance level of the t -test derived from Student's t distribution can be off by a lot, especially when the control and treatment groups are small. The null hypothesis for the t -test is that the control and treatment responses are an iid random sample from a normal distribution with unknown mean and variance. In the randomization model, there is no reason to think that the subjects are a random sample from any population, much less one in which responses have a normal distribution. And in the randomization model, responses are not independent, because putting a subject into the treatment group excludes that subject from the control group, and vice versa.

By simulation, we can approximate the significance level of a 2-sample t -test in the randomization model. Here is R code that does it.

simPermuTSig

Of course, we could base a test on the sampling distribution of the 2-sample t -statistic under the randomization model, rather than on Student's t distribution. The resulting test is called a permutation t-test . It differs from permutation tests based on the sample sum or the difference of sample means, but often not by much.

Here is R code to find the simulated P -value of a 1-sided 2-sample permutation t -test.

simPermuTTest =ts)))/iter }

Fisher's Exact Test and its Normal Approximation

A special case of the permutation test based on the sample sum occurs when the only possible responses are 0 and 1. The distribution of the test statistic under the strong null hypothesis is then hypergeometric, which leads to Fisher's Exact test. The following material is adapted from SticiGui .

Suppose we own a start-up company that offers e-tailers a service for targeting their web advertising. Consumers register with our service by filling out a form indicating their likes and dislikes, gender, age, etc . We put "cookies" on their computers to keep track of who they are. When they get to the website of any of our clients, we use their likes and dislikes to select (from a collection of the client's ads) the ad we think they are most likely to respond to. The service is free to consumers; we charge the e-tailers.

We can raise venture capital if we can show that targeting makes e-tailers' advertisements more effective. To measure the effectiveness, we offer our service for free to a large e-tailer. The e-tailer has a collection of web ads that it usually uses in rotation: each time a consumer arrives at the site, the e-tailer's server selects the next ad in the sequence to show to the consumer, then starts over when it runs out of ads.

We install our software on the e-tailer's server, in the following way: each time a consumer arrives at the site, with probability 50% the server shows the consumer the ad our targeting software suggests, and with probability 50% the server shows the consumer the next ad in the rotation, the way the e-tailer used to choose which ad to show. For each consumer, the software records which strategy was used (target or rotation), and whether the consumer buys anything. We call the consumers who were shown the targeted ad the treatment group ; we call the other consumers the control group . If a consumer visits the site more than once during the trial period, we ignore all of that consumer's visits but the first.

Suppose that N consumers visit the site during the trial, that n of them are assigned to the treatment group, that m of them are assigned to the control group, and that N S of the consumers buy something. In essence, we want to know whether there would have been more purchases if everyone had been shown the targeted ad than if everyone had been shown the control ad. Only some of the consumers saw the targeted ad, and only some saw the control ad, so answering this question involves extrapolating from the data to an hypothetical counterfactual situation. Of course, we really want to extrapolate further, to people who have not yet visited the site, to decide whether more product would be sold if those people are shown the targeted ad.

We can think of the experiment in the following way: the i th consumer has a ticket with two numbers on it: the first number ( x i ) is 1 if the consumer would have bought something if shown the control ad, and 0 if not. The second number ( y i ) is 1 if the consumer would have bought something if shown the targeted ad, and 0 if not. There are N tickets in all.

For each consumer i who visits the site, we observe either x i or y i , but not both. The percentage of consumers who would have made purchases if every consumer had been shown the control ads is

p c = ( x 1 + x 2 +  …  + x N ) / N .

Similarly, the percentage of consumers who would have made purchases if every consumer had been shown the targeted ads is

p t = ( y 1 + y 2 +  …  + y N ) / N .

µ =  p t − p c

be the difference between the rate at which consumers would have bought had all of them been shown the targeted ad, and the rate at which consumers would have bought had all of them been in the control group. The null hypothesis, that targeting does not make a difference, is that µ = 0. (The strong null hypothesis is that x i = y i , for i =1, 2, … , N .) The alternative hypothesis, that targeting helps, is that µ > 0. We would like to test the null hypothesis at significance level 5%.

Let m be the number of consumers in the control group, and let n be the number of consumers in the treatment group, so

N  =  m + n .

Let N S be the total number of sales to the treatment and control groups. Let Y be the number of sales to consumers in the treatment group. Under the strong null hypothesis (which implies that µ = 0), for any fixed value of N S , Y has an hypergeometric distribution with parameters N , N S , and n (we consider N to be fixed):

P( Y = y ) = N S C y × N − N S C n − y / N C n .

We cannot calculate the critical value y until we know N , n , and N S . Once we observe them, we can find the smallest value y so that the probability that Y is larger than y if the null hypothesis be true is at most 5%, the significance level we chose for the test. Our rule for testing the null hypothesis then would be to reject the null hypothesis if Y  >  y , and not to reject the null hypothesis otherwise. This is called Fisher's exact test for the equality of two percentages (against the one-sided alternative that treatment increases the response). It is a permutation test, and it is also essentially a (mid-) rank test (discussed in the next chapter), because there are only two possible values for each response.

The Normal Approximation to Fisher's Exact Test

If N is large and n is neither close to zero nor close to N , computing the hypergeometric probabilities will be difficult, but the normal approximation to the hypergeometric distribution should be accurate provided N S is neither too close to zero nor too close to n . To use the normal approximation, we need to convert to standard units, which requires that we know the expected value and standard error of Y . The expected value of Y is

E ( Y ) = n × N S / N ,

and the standard error of Y is

SE( Y ) = f × n ½ × SD,

where f is the finite population correction

f = ( ( N − n )/( N − 1) ) ½ ,

and SD is the standard deviation of a list of N values of which N S equal one and ( N − N S ) equal zero:

SD = ( N S / N × (1 − N S / N ) ) ½ .

In standard units, Y is

Z = ( Y − E ( Y ))/SE( Y )

= ( Y − n × N S / N ) /( f × n ½ × SD).

y = E ( Y ) + 1.645 × SE( Y )

= n × N S / N + 1.645 × f × n ½ × SD

in the original units, so if we reject the null hypothesis when

Z > 1.645

or, equivalently,

Y > n × N S / N + 1.645 × f × n ½ × SD,

we have an (approximate) 5% significance level test of the null hypothesis that ad targeting and ad rotation are equally effective. This is the normal approximation to Fisher's exact test ; Z is called the Z statistic, and the observed value of Z is called the z score.

Fisher's Lady Tasting Tea Experiment

In his 1935 book, The Design of Experiments (London, Oliver and Boyd, 260pp.), Sir R.A. Fisher writes:

A LADY declares that by tasting a cup of tea made with milk she can discriminate whether the milk or the tea infusion was first added to the cup. We will consider the problem of designing an experiment by means of which this assertion can be tested. … Our experiment consists in mixing eight cups of tea, four in one way and four in the other, and presenting them to the subject for judgment in a random order. The subject has been told in advance of what the test will consist, namely that she will be asked to taste eight cups, that these shall be four of each kind, and that they shall be presented to her in a random order, that is in an order not determined arbitrarily by human choice, but by the actual manipulation of the physical apparatus used in games of chance, cards, dice, roulettes, etc., … Her task is to divide the 8 cups into two sets of 4, agreeing, if possible, with the treatments received.

There are 8 C 4  = 70 ways to distribute the four "milk-first" cups among the 8. Under the null hypothesis that the lady cannot taste any difference, her labeling of the 8 cups—4 milk-first and 4 tea-infusion-first—can be thought of as fixed in advance. The probability that her labeling exactly matches the truth is thus 1/70. A test that rejects the null hypothesis only when she matches all 8 cups has significance level 1/70 = 1.4%. If she misses one cup, she must in fact miss at least two, because she will have mislabeled a milk-first as tea-first, and vice versa. The possible numbers of "hits" are 0, 2, 4, 6, and 8. To get 6 hits, she must label as milk-first three of the four true milk-first cups, and must mislabel as milk-first one of the four tea-first cups. The number of arrangements that give exactly 6 hits is thus

4 C 3 × 4 C 1  = 16.

Thus if we reject the null hypothesis when she correctly identifies 6 of the 8 cups or all 8 of the 8 cups, the significance level of the test is (16+1)/70 = 24.3%. Such good performance is pretty likely to occur by chance—about as likely as getting two heads in two tosses of a fair coin—even if the lady and the tea are in different parlors.

There are other experiments we might construct to test this hypothesis. Lindley (1984, A Bayesian Lady Tasting Tea, in Statistics, an Appraisal , H.A. David and H.T. David, eds., Iowa State Univ. Press) lists two, which he attributes to Neyman:

  • Present the lady with n pairs of cups of tea with milk, where one cup in each pair (determined randomly) has milk added first and one has tea added first. Tell the lady that each pair has one of each kind of cup; ask her to identify which member of each pair had the milk added first. Count the number of pairs she categorizes correctly. (This approach is sometimes called two-alternative forced choice in the study of perception in the psychometric literature: each trial has two alternative responses, the subject is forced to pick one.)
  • Present the lady with n cups of tea with milk, each of which has probability 1/2 of having the milk added first and probability 1/2 of having the tea added first. Do not tell the lady how many of each sort of cup there are. Ask her to identify for each cup whether milk or tea was added first. Count the number of cups she categorizes correctly.

These experiments lead to different tests. A permutation test for the first of them is straightforward; to base a permutation test on the second requires conditioning on the number of cups she categorizes each way and on the number of cups that had the milk added first.

Strong and weak null hypotheses

We are in the randomization model: each subject has a ticket with two numbers on it. Those 2×N numbers are the parameters of the model. If the subject is assigned to control, the number on the right is revealed. Assignment to treatment reveals the number on the left. The assignment is by simple random sample.

In the randomization model, the strong null hypothesis is that the two numbers on each ticket are equal: treatment makes no difference whatsoever, subject by subject.

This is not a very realistic hypothesis, but it leads to simple theory. If the strong null is true, when you see one number on each ticket, you know both numbers on each ticket—all 2N parameters in the model—because the two numbers on each ticket are equal. If the strong null is true, once the data have been collected, you know everything there is to know. Since tickets are put into treatment by simple random sampling, that completely specifies the probability distribution of every test statistic.

The strong null implies a variety of weaker null hypotheses—but those hypotheses do not completely specify the probability distribution of every test statistic. For example, if the strong null is true, so is the weaker null that the mean of all N subjects' responses to treatment is equal to the mean of all N subjects' responses to control: the mean of all the left numbers is equal to the mean of all the right numbers. That's a natural "weak null" when the alternative is that treatment tends to increase (or to decrease) the response. That weak null does not uniquely determine the probability distribution of the statistics we have considered, because it does not determine all the parameters in the model: under that weak null, we have N+1 constraints (we observe one number on each ticket, and we know that the two population means are the same)—but we need to have 2N constraints to determine all 2N parameters in the model.

Because the strong null plus the data determine all the parameters in the randomization model, under the strong null, it's easy to calculate the distribution of the test statistic, set critical values and find P-values, etc. If the strong null is true, so is the weak null—but the strong null can be false while the weak null is true. There are parameter values that satisfy the weak null but not the strong null. So, if we test the weak null using a test designed for the strong null, we may well reject too often—the significance level could be larger than we claim. For parameters that satisfy the weak null but not the strong null, the chance that the test statistic exceeds the nominal level-alpha critical value could be much larger than for any parameters that satisfy the strong null.

The "weakening" that is interesting depends partly on the alternative. The idea is that the weak null is implied by the strong null (hence, "weaker") but is false if the alternative is true (hence, would make a reasonable null in contrast to the alternative of interest).

If the alternative is that treatment tends to increase the response, a natural weakening is that the two population means are equal. (If the strong null holds, so does this weakening. If the alternative holds, this weak null is false.) If the alternative is that treatment tends to change the variability, a natural weakening is that the two population variances are equal. (Again, if the strong null is true, so is this weakening; but if the alternative holds, this weakening is false.) If the alternative is that treatment has any effect whatsoever on the distribution of responses, a natural weakening is that the cdfs of the two populations are equal. And so on.

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings
  • My Bibliography
  • Collections
  • Citation manager

Save citation to file

Email citation, add to collections.

  • Create a new collection
  • Add to an existing collection

Add to My Bibliography

Your saved search, create a file for external citation management software, your rss feed.

  • Search in PubMed
  • Search in NLM Catalog
  • Add to Search

A simplified guide to randomized controlled trials

Affiliations.

  • 1 Fetal Medicine Unit, St. Georges University Hospital, London, UK.
  • 2 Division of Neonatology, Department of Pediatrics, Mount Sinai Hospital, Toronto, ON, Canada.
  • 3 Institute of Health Policy, Management and Evaluation, University of Toronto, Toronto, ON, Canada.
  • 4 Department of Clinical Science, Intervention and Technology, Karolinska Institute and Center for Fetal Medicine, Karolinska University Hospital, Stockholm, Sweden.
  • 5 Women's Health and Perinatology Research Group, Department of Clinical Medicine, UiT-The Arctic University of Norway, Tromsø, Norway.
  • 6 Department of Obstetrics and Gynecology, University Hospital of North Norway, Tromsø, Norway.
  • PMID: 29377058
  • DOI: 10.1111/aogs.13309

A randomized controlled trial is a prospective, comparative, quantitative study/experiment performed under controlled conditions with random allocation of interventions to comparison groups. The randomized controlled trial is the most rigorous and robust research method of determining whether a cause-effect relation exists between an intervention and an outcome. High-quality evidence can be generated by performing an randomized controlled trial when evaluating the effectiveness and safety of an intervention. Furthermore, randomized controlled trials yield themselves well to systematic review and meta-analysis providing a solid base for synthesizing evidence generated by such studies. Evidence-based clinical practice improves patient outcomes and safety, and is generally cost-effective. Therefore, randomized controlled trials are becoming increasingly popular in all areas of clinical medicine including perinatology. However, designing and conducting an randomized controlled trial, analyzing data, interpreting findings and disseminating results can be challenging as there are several practicalities to be considered. In this review, we provide simple descriptive guidance on planning, conducting, analyzing and reporting randomized controlled trials.

Keywords: Clinical trial; good clinical practice; random allocation; randomized controlled trial; research methods; study design.

© 2018 Nordic Federation of Societies of Obstetrics and Gynecology.

PubMed Disclaimer

Similar articles

  • Methodology citations and the quality of randomized controlled trials in obstetrics and gynecology. Grimes DA, Schulz KF. Grimes DA, et al. Am J Obstet Gynecol. 1996 Apr;174(4):1312-5. doi: 10.1016/s0002-9378(96)70677-4. Am J Obstet Gynecol. 1996. PMID: 8623862
  • Clinical research in interventional pain management techniques: the clinician's point of view. Van Zundert J. Van Zundert J. Pain Pract. 2007 Sep;7(3):221-9. doi: 10.1111/j.1533-2500.2007.00139.x. Pain Pract. 2007. PMID: 17714100 Review.
  • Assessing the quality of randomization from reports of controlled trials published in obstetrics and gynecology journals. Schulz KF, Chalmers I, Grimes DA, Altman DG. Schulz KF, et al. JAMA. 1994 Jul 13;272(2):125-8. JAMA. 1994. PMID: 8015122
  • Evidence-based medicine, systematic reviews, and guidelines in interventional pain management: Part 2: Randomized controlled trials. Manchikanti L, Hirsch JA, Smith HS. Manchikanti L, et al. Pain Physician. 2008 Nov-Dec;11(6):717-73. Pain Physician. 2008. PMID: 19057624 Review.
  • Ethical pitfalls in neonatal comparative effectiveness trials. Modi N. Modi N. Neonatology. 2014;105(4):350-1. doi: 10.1159/000360650. Epub 2014 May 30. Neonatology. 2014. PMID: 24931328
  • Causal pathways in preeclampsia: a Mendelian randomization study in European populations. Tan Z, Ding M, Shen J, Huang Y, Li J, Sun A, Hong J, Yang Y, He S, Pei C, Luo R. Tan Z, et al. Front Endocrinol (Lausanne). 2024 Sep 2;15:1453277. doi: 10.3389/fendo.2024.1453277. eCollection 2024. Front Endocrinol (Lausanne). 2024. PMID: 39286274 Free PMC article.
  • Efficacy of a Benincasa hispida powdered drink in improving metabolic control in patients with type 2 diabetes: A placebo-controlled study. Zin CAJCM, Ishak WRW, Khan NAK, Mohamed WMIW. Zin CAJCM, et al. Int J Health Sci (Qassim). 2024 Sep-Oct;18(5):16-27. Int J Health Sci (Qassim). 2024. PMID: 39282126 Free PMC article.
  • The genetic link between thyroid dysfunction and alopecia areata: a bidirectional two-sample Mendelian randomization study. Gao L, Li W, Song Q, Gao H, Chen M. Gao L, et al. Front Endocrinol (Lausanne). 2024 Aug 14;15:1440941. doi: 10.3389/fendo.2024.1440941. eCollection 2024. Front Endocrinol (Lausanne). 2024. PMID: 39205687 Free PMC article.
  • The impact of treating parental bipolar disorder and schizophrenia on their children's mental health and wellbeing: an empty systematic review. Can B, Piskun V, Dunn A, Cartwright-Hatton S. Can B, et al. Front Psychiatry. 2024 Aug 13;15:1425519. doi: 10.3389/fpsyt.2024.1425519. eCollection 2024. Front Psychiatry. 2024. PMID: 39193576 Free PMC article.
  • Impact of collegial midwifery assistance during second stage of labour on women's experience: a follow-up from the Swedish Oneplus randomised controlled trial. Häggsgård C, Edqvist M, Teleman P, Tern H, Rubertsson C. Häggsgård C, et al. BMJ Open. 2024 Jul 27;14(7):e077458. doi: 10.1136/bmjopen-2023-077458. BMJ Open. 2024. PMID: 39067883 Free PMC article. Clinical Trial.

Publication types

  • Search in MeSH

Related information

  • Cited in Books

LinkOut - more resources

Full text sources.

  • Ovid Technologies, Inc.

Other Literature Sources

  • scite Smart Citations

full text provider logo

  • Citation Manager

NCBI Literature Resources

MeSH PMC Bookshelf Disclaimer

The PubMed wordmark and PubMed logo are registered trademarks of the U.S. Department of Health and Human Services (HHS). Unauthorized use of these marks is strictly prohibited.

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

  • Knowledge Base

Methodology

  • Control Groups and Treatment Groups | Uses & Examples

Control Groups and Treatment Groups | Uses & Examples

Published on July 3, 2020 by Lauren Thomas . Revised on June 22, 2023.

In a scientific study, a control group is used to establish causality by isolating the effect of an independent variable .

Here, researchers change the independent variable in the treatment group and keep it constant in the control group. Then they compare the results of these groups.

Control groups in research

Using a control group means that any change in the dependent variable can be attributed to the independent variable. This helps avoid extraneous variables or confounding variables from impacting your work, as well as a few types of research bias , like omitted variable bias .

Table of contents

Control groups in experiments, control groups in non-experimental research, importance of control groups, other interesting articles, frequently asked questions about control groups.

Control groups are essential to experimental design . When researchers are interested in the impact of a new treatment, they randomly divide their study participants into at least two groups:

  • The treatment group (also called the experimental group ) receives the treatment whose effect the researcher is interested in.
  • The control group receives either no treatment, a standard treatment whose effect is already known, or a placebo (a fake treatment to control for placebo effect ).

The treatment is any independent variable manipulated by the experimenters, and its exact form depends on the type of research being performed. In a medical trial, it might be a new drug or therapy. In public policy studies, it could be a new social policy that some receive and not others.

In a well-designed experiment, all variables apart from the treatment should be kept constant between the two groups. This means researchers can correctly measure the entire effect of the treatment without interference from confounding variables .

  • You pay the students in the treatment group for achieving high grades.
  • Students in the control group do not receive any money.

Studies can also include more than one treatment or control group. Researchers might want to examine the impact of multiple treatments at once, or compare a new treatment to several alternatives currently available.

  • The treatment group gets the new pill.
  • Control group 1 gets an identical-looking sugar pill (a placebo)
  • Control group 2 gets a pill already approved to treat high blood pressure

Since the only variable that differs between the three groups is the type of pill, any differences in average blood pressure between the three groups can be credited to the type of pill they received.

  • The difference between the treatment group and control group 1 demonstrates the effectiveness of the pill as compared to no treatment.
  • The difference between the treatment group and control group 2 shows whether the new pill improves on treatments already available on the market.

Prevent plagiarism. Run a free check.

Although control groups are more common in experimental research, they can be used in other types of research too. Researchers generally rely on non-experimental control groups in two cases: quasi-experimental or matching design.

Control groups in quasi-experimental design

While true experiments rely on random assignment to the treatment or control groups, quasi-experimental design uses some criterion other than randomization to assign people.

Often, these assignments are not controlled by researchers, but are pre-existing groups that have received different treatments. For example, researchers could study the effects of a new teaching method that was applied in some classes in a school but not others, or study the impact of a new policy that is implemented in one state but not in the neighboring state.

In these cases, the classes that did not use the new teaching method, or the state that did not implement the new policy, is the control group.

Control groups in matching design

In correlational research , matching represents a potential alternate option when you cannot use either true or quasi-experimental designs.

In matching designs, the researcher matches individuals who received the “treatment”, or independent variable under study, to others who did not–the control group.

Each member of the treatment group thus has a counterpart in the control group identical in every way possible outside of the treatment. This ensures that the treatment is the only source of potential differences in outcomes between the two groups.

Control groups help ensure the internal validity of your research. You might see a difference over time in your dependent variable in your treatment group. However, without a control group, it is difficult to know whether the change has arisen from the treatment. It is possible that the change is due to some other variables.

If you use a control group that is identical in every other way to the treatment group, you know that the treatment–the only difference between the two groups–must be what has caused the change.

For example, people often recover from illnesses or injuries over time regardless of whether they’ve received effective treatment or not. Thus, without a control group, it’s difficult to determine whether improvements in medical conditions come from a treatment or just the natural progression of time.

Risks from invalid control groups

If your control group differs from the treatment group in ways that you haven’t accounted for, your results may reflect the interference of confounding variables instead of your independent variable.

Minimizing this risk

A few methods can aid you in minimizing the risk from invalid control groups.

  • Ensure that all potential confounding variables are accounted for , preferably through an experimental design if possible, since it is difficult to control for all the possible confounders outside of an experimental environment.
  • Use double-blinding . This will prevent the members of each group from modifying their behavior based on whether they were placed in the treatment or control group, which could then lead to biased outcomes.
  • Randomly assign your subjects into control and treatment groups. This method will allow you to not only minimize the differences between the two groups on confounding variables that you can directly observe, but also those you cannot.

If you want to know more about statistics , methodology , or research bias , make sure to check out some of our other articles with explanations and examples.

  • Student’s  t -distribution
  • Normal distribution
  • Null and Alternative Hypotheses
  • Chi square tests
  • Confidence interval
  • Quartiles & Quantiles
  • Cluster sampling
  • Stratified sampling
  • Data cleansing
  • Reproducibility vs Replicability
  • Peer review
  • Prospective cohort study

Research bias

  • Implicit bias
  • Cognitive bias
  • Placebo effect
  • Hawthorne effect
  • Hindsight bias
  • Affect heuristic
  • Social desirability bias

An experimental group, also known as a treatment group, receives the treatment whose effect researchers wish to study, whereas a control group does not. They should be identical in all other ways.

A true experiment (a.k.a. a controlled experiment) always includes at least one control group that doesn’t receive the experimental treatment.

However, some experiments use a within-subjects design to test treatments without a control group. In these designs, you usually compare one group’s outcomes before and after a treatment (instead of comparing outcomes between different groups).

For strong internal validity , it’s usually best to include a control group if possible. Without a control group, it’s harder to be certain that the outcome was caused by the experimental treatment and not by other variables.

A confounding variable , also called a confounder or confounding factor, is a third variable in a study examining a potential cause-and-effect relationship.

A confounding variable is related to both the supposed cause and the supposed effect of the study. It can be difficult to separate the true effect of the independent variable from the effect of the confounding variable.

In your research design , it’s important to identify potential confounding variables and plan how you will reduce their impact.

There are several methods you can use to decrease the impact of confounding variables on your research: restriction, matching, statistical control and randomization.

In restriction , you restrict your sample by only including certain subjects that have the same values of potential confounding variables.

In matching , you match each of the subjects in your treatment group with a counterpart in the comparison group. The matched subjects have the same values on any potential confounding variables, and only differ in the independent variable .

In statistical control , you include potential confounders as variables in your regression .

In randomization , you randomly assign the treatment (or independent variable) in your study to a sufficiently large number of subjects, which allows you to control for all potential confounding variables.

Experimental design means planning a set of procedures to investigate a relationship between variables . To design a controlled experiment, you need:

  • A testable hypothesis
  • At least one independent variable that can be precisely manipulated
  • At least one dependent variable that can be precisely measured

When designing the experiment, you decide:

  • How you will manipulate the variable(s)
  • How you will control for any potential confounding variables
  • How many subjects or samples will be included in the study
  • How subjects will be assigned to treatment levels

Experimental design is essential to the internal and external validity of your experiment.

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the “Cite this Scribbr article” button to automatically add the citation to our free Citation Generator.

Thomas, L. (2023, June 22). Control Groups and Treatment Groups | Uses & Examples. Scribbr. Retrieved September 23, 2024, from https://www.scribbr.com/methodology/control-group/

Is this article helpful?

Lauren Thomas

Lauren Thomas

Other students also liked, what is a controlled experiment | definitions & examples, random assignment in experiments | introduction & examples, single, double, & triple blind study | definition & examples, "i thought ai proofreading was useless but..".

I've been using Scribbr for years now and I know it's a service that won't disappoint. It does a good job spotting mistakes”

Experimental Design: Types, Examples & Methods

Saul McLeod, PhD

Editor-in-Chief for Simply Psychology

BSc (Hons) Psychology, MRes, PhD, University of Manchester

Saul McLeod, PhD., is a qualified psychology teacher with over 18 years of experience in further and higher education. He has been published in peer-reviewed journals, including the Journal of Clinical Psychology.

Learn about our Editorial Process

Olivia Guy-Evans, MSc

Associate Editor for Simply Psychology

BSc (Hons) Psychology, MSc Psychology of Education

Olivia Guy-Evans is a writer and associate editor for Simply Psychology. She has previously worked in healthcare and educational sectors.

On This Page:

Experimental design refers to how participants are allocated to different groups in an experiment. Types of design include repeated measures, independent groups, and matched pairs designs.

Probably the most common way to design an experiment in psychology is to divide the participants into two groups, the experimental group and the control group, and then introduce a change to the experimental group, not the control group.

The researcher must decide how he/she will allocate their sample to the different experimental groups.  For example, if there are 10 participants, will all 10 participants participate in both groups (e.g., repeated measures), or will the participants be split in half and take part in only one group each?

Three types of experimental designs are commonly used:

1. Independent Measures

Independent measures design, also known as between-groups , is an experimental design where different participants are used in each condition of the independent variable.  This means that each condition of the experiment includes a different group of participants.

This should be done by random allocation, ensuring that each participant has an equal chance of being assigned to one group.

Independent measures involve using two separate groups of participants, one in each condition. For example:

Independent Measures Design 2

  • Con : More people are needed than with the repeated measures design (i.e., more time-consuming).
  • Pro : Avoids order effects (such as practice or fatigue) as people participate in one condition only.  If a person is involved in several conditions, they may become bored, tired, and fed up by the time they come to the second condition or become wise to the requirements of the experiment!
  • Con : Differences between participants in the groups may affect results, for example, variations in age, gender, or social background.  These differences are known as participant variables (i.e., a type of extraneous variable ).
  • Control : After the participants have been recruited, they should be randomly assigned to their groups. This should ensure the groups are similar, on average (reducing participant variables).

2. Repeated Measures Design

Repeated Measures design is an experimental design where the same participants participate in each independent variable condition.  This means that each experiment condition includes the same group of participants.

Repeated Measures design is also known as within-groups or within-subjects design .

  • Pro : As the same participants are used in each condition, participant variables (i.e., individual differences) are reduced.
  • Con : There may be order effects. Order effects refer to the order of the conditions affecting the participants’ behavior.  Performance in the second condition may be better because the participants know what to do (i.e., practice effect).  Or their performance might be worse in the second condition because they are tired (i.e., fatigue effect). This limitation can be controlled using counterbalancing.
  • Pro : Fewer people are needed as they participate in all conditions (i.e., saves time).
  • Control : To combat order effects, the researcher counter-balances the order of the conditions for the participants.  Alternating the order in which participants perform in different conditions of an experiment.

Counterbalancing

Suppose we used a repeated measures design in which all of the participants first learned words in “loud noise” and then learned them in “no noise.”

We expect the participants to learn better in “no noise” because of order effects, such as practice. However, a researcher can control for order effects using counterbalancing.

The sample would be split into two groups: experimental (A) and control (B).  For example, group 1 does ‘A’ then ‘B,’ and group 2 does ‘B’ then ‘A.’ This is to eliminate order effects.

Although order effects occur for each participant, they balance each other out in the results because they occur equally in both groups.

counter balancing

3. Matched Pairs Design

A matched pairs design is an experimental design where pairs of participants are matched in terms of key variables, such as age or socioeconomic status. One member of each pair is then placed into the experimental group and the other member into the control group .

One member of each matched pair must be randomly assigned to the experimental group and the other to the control group.

matched pairs design

  • Con : If one participant drops out, you lose 2 PPs’ data.
  • Pro : Reduces participant variables because the researcher has tried to pair up the participants so that each condition has people with similar abilities and characteristics.
  • Con : Very time-consuming trying to find closely matched pairs.
  • Pro : It avoids order effects, so counterbalancing is not necessary.
  • Con : Impossible to match people exactly unless they are identical twins!
  • Control : Members of each pair should be randomly assigned to conditions. However, this does not solve all these problems.

Experimental design refers to how participants are allocated to an experiment’s different conditions (or IV levels). There are three types:

1. Independent measures / between-groups : Different participants are used in each condition of the independent variable.

2. Repeated measures /within groups : The same participants take part in each condition of the independent variable.

3. Matched pairs : Each condition uses different participants, but they are matched in terms of important characteristics, e.g., gender, age, intelligence, etc.

Learning Check

Read about each of the experiments below. For each experiment, identify (1) which experimental design was used; and (2) why the researcher might have used that design.

1 . To compare the effectiveness of two different types of therapy for depression, depressed patients were assigned to receive either cognitive therapy or behavior therapy for a 12-week period.

The researchers attempted to ensure that the patients in the two groups had similar severity of depressed symptoms by administering a standardized test of depression to each participant, then pairing them according to the severity of their symptoms.

2 . To assess the difference in reading comprehension between 7 and 9-year-olds, a researcher recruited each group from a local primary school. They were given the same passage of text to read and then asked a series of questions to assess their understanding.

3 . To assess the effectiveness of two different ways of teaching reading, a group of 5-year-olds was recruited from a primary school. Their level of reading ability was assessed, and then they were taught using scheme one for 20 weeks.

At the end of this period, their reading was reassessed, and a reading improvement score was calculated. They were then taught using scheme two for a further 20 weeks, and another reading improvement score for this period was calculated. The reading improvement scores for each child were then compared.

4 . To assess the effect of the organization on recall, a researcher randomly assigned student volunteers to two conditions.

Condition one attempted to recall a list of words that were organized into meaningful categories; condition two attempted to recall the same words, randomly grouped on the page.

Experiment Terminology

Ecological validity.

The degree to which an investigation represents real-life experiences.

Experimenter effects

These are the ways that the experimenter can accidentally influence the participant through their appearance or behavior.

Demand characteristics

The clues in an experiment lead the participants to think they know what the researcher is looking for (e.g., the experimenter’s body language).

Independent variable (IV)

The variable the experimenter manipulates (i.e., changes) is assumed to have a direct effect on the dependent variable.

Dependent variable (DV)

Variable the experimenter measures. This is the outcome (i.e., the result) of a study.

Extraneous variables (EV)

All variables which are not independent variables but could affect the results (DV) of the experiment. Extraneous variables should be controlled where possible.

Confounding variables

Variable(s) that have affected the results (DV), apart from the IV. A confounding variable could be an extraneous variable that has not been controlled.

Random Allocation

Randomly allocating participants to independent variable conditions means that all participants should have an equal chance of taking part in each condition.

The principle of random allocation is to avoid bias in how the experiment is carried out and limit the effects of participant variables.

Order effects

Changes in participants’ performance due to their repeating the same or similar test more than once. Examples of order effects include:

(i) practice effect: an improvement in performance on a task due to repetition, for example, because of familiarity with the task;

(ii) fatigue effect: a decrease in performance of a task due to repetition, for example, because of boredom or tiredness.

Print Friendly, PDF & Email

Popular searches

  • How to Get Participants For Your Study
  • How to Do Segmentation?
  • Conjoint Preference Share Simulator
  • MaxDiff Analysis
  • Likert Scales
  • Reliability & Validity

Request consultation

Do you need support in running a pricing or product study? We can help you with agile consumer research and conjoint analysis.

Looking for an online survey platform?

Conjointly offers a great survey tool with multiple question types, randomisation blocks, and multilingual support. The Basic tier is always free.

Research Methods Knowledge Base

  • Navigating the Knowledge Base
  • Foundations
  • Measurement
  • Internal Validity
  • Introduction to Design
  • Types of Designs
  • Probabilistic Equivalence
  • Random Selection & Assignment
  • Defining Experimental Designs
  • Factorial Designs
  • Randomized Block Designs
  • Covariance Designs
  • Hybrid Experimental Designs
  • Quasi-Experimental Design
  • Pre-Post Design Relationships
  • Designing Designs for Research
  • Quasi-Experimentation Advances
  • Table of Contents

Fully-functional online survey tool with various question types, logic, randomisation, and reporting for unlimited number of surveys.

Completely free for academics and students .

Two-Group Experimental Designs

The simplest of all experimental designs is the two-group posttest-only randomized experiment. In design notation, it has two lines – one for each group – with an R at the beginning of each line to indicate that the groups were randomly assigned. One group gets the treatment or program (the X) and the other group is the comparison group and doesn’t get the program (note that this you could alternatively have the comparison group receive the standard or typical treatment, in which case this study would be a relative comparison).

Notice that a pretest is not required for this design. Usually we include a pretest in order to determine whether groups are comparable prior to the program, but because we are using random assignment we can assume that the two groups are probabilistically equivalent to begin with and the pretest is not required (although you’ll see with covariance designs that a pretest may still be desirable in this context).

In this design, we are most interested in determining whether the two groups are different after the program. Typically we measure the groups on one or more measures (the O s in notation) and we compare them by testing for the differences between the means using a t-test or one way Analysis of Variance (ANOVA) .

The posttest-only randomized experiment is strong against the single-group threats to internal validity because it’s not a single group design! (Tricky, huh?) It’s strong against the all of the multiple-group threats except for selection-mortality. For instance, it’s strong against selection-testing and selection-instrumentation because it doesn’t use repeated measurement. The selection-mortality threat is especially salient if there are differential rates of dropouts in the two groups. This could result if the treatment or program is a noxious or negative one (e.g. a painful medical procedure like chemotherapy) or if the control group condition is painful or intolerable. This design is susceptible to all of the social interaction threats to internal validity. Because the design requires random assignment, in some institutional settings (e.g. schools) it is more likely to utilize persons who would be aware of each other and of the conditions they’ve been assigned to.

Threat typeLevel of threat to Two-Group Experimental Designs
✔ Weak
✔ Weak
✔ Weak
✔ Weak
✔ Weak
✔ Weak
✔ Weak
✔ Weak
✔ Weak
✔ Weak
✔ Weak
❌ Severe
✔ Weak
❌ Severe
❌ Severe
❌ Severe
❌ Severe

The posttest-only randomized experimental design is, despite its simple structure, one of the best research designs for assessing cause-effect relationships. It is easy to execute and, because it uses only a posttest, is relatively inexpensive. But there are many variations on this simple experimental design. You can begin to explore these by looking at how we classify the various experimental designs .

Cookie Consent

Conjointly uses essential cookies to make our site work. We also use additional cookies in order to understand the usage of the site, gather audience analytics, and for remarketing purposes.

For more information on Conjointly's use of cookies, please read our Cookie Policy .

Which one are you?

I am new to conjointly, i am already using conjointly.

  • Skip to secondary menu
  • Skip to main content
  • Skip to primary sidebar

Statistics By Jim

Making statistics intuitive

Randomized Controlled Trial (RCT) Overview

By Jim Frost Leave a Comment

What is a Randomized Controlled Trial (RCT)?

A randomized controlled trial (RCT) is a prospective experimental design that randomly assigns participants to an experimental or control group. RCTs are the gold standard for establishing causal relationships and ruling out confounding variables and selection bias. Researchers must be able to control who receives the treatments and who are the controls to use this design.

Photo of a scientist working on a randomized controlled trial (RCT).

Random assignment is crucial for ruling out other potentially explanatory factors that could have caused those outcome differences. This process in RCTs is so effective that it even works with potential confounders that the researchers don’t know about! Think age, lifestyle, or genetics. Learn more about Random Assignment in Experiments .

Scientists use randomized controlled trials most frequently in fields like medicine, psychology, and social sciences to rigorously test interventions and treatments.

In this post, learn how RCTs work, the various types, and their strengths and weaknesses.

Randomized Controlled Trial Example

Imagine testing a new drug against a placebo using a randomized controlled trial. We take a representative sample of 100 patients. 50 get the drug; 50 get the placebo. Who gets what? It’s random! Perhaps we flip a coin. For more complex designs, we’d probably use computers for random assignment.

After a month, we measure health outcomes. Did the drug help more than the placebo? That’s what we find out!

To read about several examples of top-notch RCTs in more detail, read my following posts:

  • How Effective Are Flu Shots?
  • COVID Vaccination Randomized Controlled Trial

Common Elements for Effective RCT Designs

While randomization springs to mind when discussing RCTs, other equally vital components shape these robust experimental designs. Most well-designed randomized controlled trials contain the following elements.

  • Control Group : Almost every RCT features a control group. This group might receive a placebo, no intervention, or standard care. You can estimate the treatment’s effect size by comparing the outcome in a treatment group to the control group. Learn more about Control Groups in an Experiment  and controlling for the Placebo Effect .
  • Blinding : Blinding hides group assignments from researchers and participants to prevent group assignment knowledge from influencing results. More on this shortly!
  • Pre-defined Inclusion and Exclusion Criteria : These criteria set the boundaries for who can participate based on specifics like age or health conditions.
  • Baseline Assessment : Before diving in, an initial assessment records participants’ starting conditions.
  • Outcome Measures : Clear, pre-defined outcomes, like symptom reduction or survival rates, drive the study’s goals.
  • Controlled, Standardized Environments : Ensuring variables are measured and treatments administered consistently minimizes external factors that could affect results.
  • Monitoring and Data Collection : Regular checks guarantee participant safety and uniform data gathering.
  • Ethical Oversight : Ensures participants’ rights and well-being are prioritized.
  • Informed Consent : Participants must know the drill and agree to participate before joining .
  • Statistical Plan : Detailing how statisticians will analyze the data before the RCT begins helps keep the evaluation objective and prevents p-hacking. Learn more about P-Hacking Best Practices .
  • Protocol Adherence : Consistency is critical. Following the plan ensures reliable results.
  • Analysis and Reporting : Once done, researchers share the results—good, bad, or neutral. Transparency builds trust.

These components ensure randomized controlled trials are both rigorous and ethically sound, leading to trustworthy results.

Common Variations of Randomized Controlled Trial Designs

Randomized controlled trial designs aren’t one-size-fits-all. Depending on the research question and context, researchers can apply various configurations.

Let’s explore the most common RCT designs:

  • Parallel Group : Participants are randomly put into an intervention or control group.
  • Crossover : Participants randomly receive both intervention and control at different times.
  • Factorial : Tests multiple interventions at once. Useful for combination therapies.
  • Cluster : Groups, not individuals, are randomized. For instance, researchers can randomly assign schools or towns to the experimental groups.

If you can’t randomly assign subjects and you want to draw causal conclusions about an intervention, consider using a quasi-experimental design .

Learn more about Experimental Design: Definition and Types .

Blinding in RCTs

Blinding is a standard protection in randomized controlled trials. The term refers to procedures that hide group assignments from those involved. While randomization ensures initial group balance, it doesn’t prevent uneven treatment or assessment as the RCT progresses, which could skew results.

So, what is the best way to sidestep potential biases?

Keep as many people in the dark about group assignments as possible. In a blinded randomized controlled trial, participants, and sometimes researchers, don’t know who gets the intervention.

There are three types of blinding:

  • Single : Participants don’t know if they’re in the intervention or control group.
  • Double : Both participants and researchers are in the dark.
  • Triple : Participants, researchers, and statisticians all don’t know.

It guards against sneaky biases that might creep into our RCT results. Let’s look at a few:

  • Confirmation Bias : Without blinding in a randomized controlled trial, researchers might unconsciously favor results that align with their expectations. For example, they might interpret ambiguous data as positive effects of a new drug if they’re hopeful about its efficacy.
  • Placebo Effect : Participants who know they’re getting the ‘real deal’ might report improved outcomes simply because they believe in the treatment’s power. Conversely, those aware they’re in the control group might not notice genuine improvements.
  • Observer Bias : If a researcher knows which participant is in which group, they might inadvertently influence outcomes. Imagine a physiotherapist unknowingly encouraging a participant more because they know they’re receiving the new treatment.

Blinding helps keep these biases at bay, making our results more reliable. It boosts confidence in a randomized controlled trial. Let’s close by summarizing the benefits and disadvantages of an RCT.

The Benefits of Randomized Controlled Studies

Randomized controlled trials offer a unique blend of strengths:

  • RCTs are best for identifying causal relationships.
  • Random assignment reduces both known and unknown biases.
  • Many RCT designs exist, tailored for different research questions.
  • Well-defined steps and controlled conditions ensure replicability across studies.
  • Internal validity tends to be high in a randomized controlled trial. You can be confident that other variables don’t affect or account for the observed relationship.

Learn more about Correlation vs. Causation: Understanding the Differences .

The Drawbacks of RCTs

While powerful, RCTs also come with limitations:

  • Randomized controlled trials can be expensive in time, money, and resources.
  • Ethical concerns can arise when withholding treatments from a control group.
  • Random assignment might not be possible in some circumstances.
  • External validity can be low in an RCT. Conditions can be so controlled that the results might not always generalize beyond the study.

For a good comparison, learn about the differences and tradeoffs between using Observational Studies and Randomized Experiments .

Learn more about Internal and External Validity in Experiments and see how they’re a tradeoff.

Share this:

randomized two treatment experiment example

Reader Interactions

Comments and questions cancel reply.

Study Design 101: Randomized Controlled Trial

  • Case Report
  • Case Control Study
  • Cohort Study
  • Randomized Controlled Trial
  • Practice Guideline
  • Systematic Review
  • Meta-Analysis
  • Helpful Formulas
  • Finding Specific Study Types

A study design that randomly assigns participants into an experimental group or a control group. As the study is conducted, the only expected difference between the control and experimental groups in a randomized controlled trial (RCT) is the outcome variable being studied.

  • Good randomization will "wash out" any population bias
  • Easier to blind/mask than observational studies
  • Results can be analyzed with well known statistical tools
  • Populations of participating individuals are clearly identified

Disadvantages

  • Expensive in terms of time and money
  • Volunteer biases: the population that participates may not be representative of the whole
  • Loss to follow-up attributed to treatment

Design pitfalls to look out for

An RCT should be a study of one population only.

Was the randomization actually "random", or are there really two populations being studied?

The variables being studied should be the only variables between the experimental group and the control group.

Are there any confounding variables between the groups?

Fictitious Example

To determine how a new type of short wave UVA-blocking sunscreen affects the general health of skin in comparison to a regular long wave UVA-blocking sunscreen, 40 trial participants were randomly separated into equal groups of 20: an experimental group and a control group. All participants' skin health was then initially evaluated. The experimental group wore the short wave UVA-blocking sunscreen daily, and the control group wore the long wave UVA-blocking sunscreen daily.

After one year, the general health of the skin was measured in both groups and statistically analyzed. In the control group, wearing long wave UVA-blocking sunscreen daily led to improvements in general skin health for 60% of the participants. In the experimental group, wearing short wave UVA-blocking sunscreen daily led to improvements in general skin health for 75% of the participants.

Real-life Examples

van Der Horst, N., Smits, D., Petersen, J., Goedhart, E., & Backx, F. (2015). The preventive effect of the nordic hamstring exercise on hamstring injuries in amateur soccer players: a randomized controlled trial. The American Journal of Sports Medicine, 43 (6), 1316-1323. https://doi.org/10.1177/0363546515574057

This article reports on the research investigating whether the Nordic Hamstring Exercise is effective in preventing both the incidence and severity of hamstring injuries in male amateur soccer players. Over the course of a year, there was a statistically significant reduction in the incidence of hamstring injuries in players performing the NHE, but for those injured, there was no difference in severity of injury. There was also a high level of compliance in performing the NHE in that group of players.

Natour, J., Cazotti, L., Ribeiro, L., Baptista, A., & Jones, A. (2015). Pilates improves pain, function and quality of life in patients with chronic low back pain: a randomized controlled trial. Clinical Rehabilitation, 29 (1), 59-68. https://doi.org/10.1177/0269215514538981

This study assessed the effect of adding pilates to a treatment regimen of NSAID use for individuals with chronic low back pain. Individuals who included the pilates method in their therapy took fewer NSAIDs and experienced statistically significant improvements in pain, function, and quality of life.

Related Formulas

  • Relative Risk

Related Terms

Blinding/Masking

When the groups that have been randomly selected from a population do not know whether they are in the control group or the experimental group.

Being able to show that an independent variable directly causes the dependent variable. This is generally very difficult to demonstrate in most study designs.

Confounding Variables

Variables that cause/prevent an outcome from occurring outside of or along with the variable being studied. These variables render it difficult or impossible to distinguish the relationship between the variable and outcome being studied).

Correlation

A relationship between two variables, but not necessarily a causation relationship.

Double Blinding/Masking

When the researchers conducting a blinded study do not know which participants are in the control group of the experimental group.

Null Hypothesis

That the relationship between the independent and dependent variables the researchers believe they will prove through conducting a study does not exist. To "reject the null hypothesis" is to say that there is a relationship between the variables.

Population/Cohort

A group that shares the same characteristics among its members (population).

Population Bias/Volunteer Bias

A sample may be skewed by those who are selected or self-selected into a study. If only certain portions of a population are considered in the selection process, the results of a study may have poor validity.

Randomization

Any of a number of mechanisms used to assign participants into different groups with the expectation that these groups will not differ in any significant way other than treatment and outcome.

Research (alternative) Hypothesis

The relationship between the independent and dependent variables that researchers believe they will prove through conducting a study.

Sensitivity

The relationship between what is considered a symptom of an outcome and the outcome itself; or the percent chance of not getting a false positive (see formulas).

Specificity

The relationship between not having a symptom of an outcome and not having the outcome itself; or the percent chance of not getting a false negative (see formulas).

Type 1 error

Rejecting a null hypothesis when it is in fact true. This is also known as an error of commission.

Type 2 error

The failure to reject a null hypothesis when it is in fact false. This is also known as an error of omission.

Now test yourself!

1. Having a volunteer bias in the population group is a good thing because it means the study participants are eager and make the study even stronger.

a) True b) False

2. Why is randomization important to assignment in an RCT?

a) It enables blinding/masking b) So causation may be extrapolated from results c) It balances out individual characteristics between groups. d) a and c e) b and c

Evidence Pyramid - Navigation

  • Meta- Analysis
  • Case Reports
  • << Previous: Cohort Study
  • Next: Practice Guideline >>

Creative Commons License

  • Last Updated: Sep 25, 2023 10:59 AM
  • URL: https://guides.himmelfarb.gwu.edu/studydesign101

GW logo

  • Himmelfarb Intranet
  • Privacy Notice
  • Terms of Use
  • GW is committed to digital accessibility. If you experience a barrier that affects your ability to access content on this page, let us know via the Accessibility Feedback Form .
  • Himmelfarb Health Sciences Library
  • 2300 Eye St., NW, Washington, DC 20037
  • Phone: (202) 994-2962
  • [email protected]
  • https://himmelfarb.gwu.edu

Help | Advanced Search

Statistics > Methodology

Title: a two-stage inference procedure for sample local average treatment effects in randomized experiments.

Abstract: In a given randomized experiment, individuals are often volunteers and can differ in important ways from a population of interest. It is thus of interest to focus on the sample at hand. This paper focuses on inference about the sample local average treatment effect (LATE) in randomized experiments with non-compliance. We present a two-stage procedure that provides asymptotically correct coverage rate of the sample LATE in randomized experiments. The procedure uses a first-stage test to decide whether the instrument is strong or weak, and uses different confidence sets depending on the first-stage result. Proofs of the procedure is developed for the situation with and without regression adjustment and for two experimental designs (complete randomization and Mahalaonobis distance based rerandomization). Finite sample performance of the methods are studied using extensive Monte Carlo simulations and the methods are applied to data from a voter encouragement experiment.
Subjects: Methodology (stat.ME); Statistics Theory (math.ST)
Cite as: [stat.ME]
  (or [stat.ME] for this version)
  Focus to learn more arXiv-issued DOI via DataCite

Submission history

Access paper:.

  • HTML (experimental)
  • Other Formats

license icon

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

Introduction to Field Experiments and Randomized Controlled Trials

Painting of a girl holding a bottle

Have you ever been curious about the methods researchers employ to determine causal relationships among various factors, ultimately leading to significant breakthroughs and progress in numerous fields? In this article, we offer an overview of field experimentation and its importance in discerning cause and effect relationships. We outline how randomized experiments represent an unbiased method for determining what works. Furthermore, we discuss key aspects of experiments, such as intervention, excludability, and non-interference. To illustrate these concepts, we present a hypothetical example of a randomized controlled trial evaluating the efficacy of an experimental drug called Covi-Mapp.

Why experiments?

Every day, we find ourselves faced with questions of cause and effect. Understanding the driving forces behind outcomes is crucial, ranging from personal decisions like parenting strategies to organizational challenges such as effective advertising. This blog aims to provide a systematic introduction to experimentation, igniting enthusiasm for primary research and highlighting the myriad of experimental applications and opportunities available.

The challenge for those who seek to answer causal questions convincingly is to develop a research methodology that doesn't require identifying or measuring all potential confounders. Since no planned design can eliminate every possible systematic difference between treatment and control groups, random assignment emerges as a powerful tool for minimizing bias. In the contentious world of causal claims, randomized experiments represent an unbiased method for determining what works. Random assignment means participants are assigned to different groups or conditions in a study purely by chance. Basically, each participant has an equal chance to be assigned to a control group or a treatment group. 

Field experiments, or randomized studies conducted in real-world settings, can take many forms. While experiments on college campuses are often considered lab studies, certain experiments on campus – such as those examining club participation – may be regarded as field experiments, depending on the experimental design. Ultimately, whether a study is considered a field experiment hinges on the definition of "the field."

Researchers may employ two main scenarios for randomization. The first involves gathering study participants and randomizing them at the time of the experiment. The second capitalizes on naturally occurring randomizations, such as the Vietnam draft lottery. 

Intervention, Excludability, and Non-Interference

Three essential features of any experiment are intervention, excludability, and non-interference. In a general sense, the intervention refers to the treatment or action being tested in an experiment. The excludability principle is satisfied when the only difference between the experimental and control groups is the presence or absence of the intervention. The non-interference principle holds when the outcome of one participant in the study does not influence the outcomes of other participants. Together, these principles ensure that the experiment is designed to provide unbiased and reliable results, isolating the causal effect of the intervention under study.

Omitted Variables and Non-Compliance

To ensure unbiased results, researchers must randomize as much as possible to minimize omitted variable bias. Omitted variables are factors that influence the outcome but are not measured or are difficult to measure. These unmeasured attributes, sometimes called confounding variables or unobserved heterogeneity, must be accounted for to guarantee accurate findings.

Non-compliance can also complicate experiments. One-sided non-compliance occurs when individuals assigned to a treatment group don't receive the treatment (failure to treat), while two-sided non-compliance occurs when some subjects assigned to the treatment group go untreated or individuals assigned to the control group receive the treatment. Addressing these issues at the design level by implementing a blind or double-blind study can help mitigate potential biases.

Achieving Precision through Covariate Balance

To ensure the control and treatment groups are comparatively similar in all relevant aspects, particularly when the sample size (n) is small, it is essential to achieve covariate balance. Covariance measures the association between two variables, while a covariate is a factor that influences the outcome variable. By balancing covariates, we can more accurately isolate the effects of the treatment, leading to improved precision in our findings.

Fictional Example of Randomized Controlled Trial of Covi-Mapp for COVID-19 Management

Let's explore a fictional example to better understand experiments: a one-week randomized controlled trial of the experimental drug Covi-Mapp for managing Covid. In this case, the control group receives the standard care for Covid patients, while the treatment group receives the standard care plus Covi-Mapp. The outcome of interest is whether patients have cough symptoms on day 7, as subsidizing cough symptoms is an encouraging sign in Covid recovery. We'll measure the presence of cough on day 0 and day 7, as well as temperature on day 0 and day 7. Gender is also tracked. The control represents the standard care for COVID-19 patients, while the treatment includes standard care plus the experimental drug.

In this Covi-Mapp example, the intervention is the Covi-Mapp drug, the excludability principle is satisfied if the only difference in patient care between the groups is the drug administration, and the non-interference principle holds if one patient's outcome doesn't affect another's.

First, let's assume we have a dataset containing the relevant information for each patient, including cough status on day 0 and day 7, temperature on day 0 and day 7, treatment assignment, and gender. We'll read the data and explore the dataset:

library(data.table)

d <- fread("../data/COVID_rct.csv")

names(d)


"temperature_day0"  "cough_day0"        "treat_zmapp"       "temperature_day14" "cough_day14"       "male" 

Simple treatment effect of the experimental drug

Without any covariates, let's first look at the estimated effect of the treatment on the presence of cough on day 7. The estimated proportion of patients with a cough on day 7 for the control group (not receiving the experimental drug) is 0.847458. In other words, about 84.7% of patients in the control group are expected to have a cough on day 7, all else being equal. The estimated effect of the experimental drug on the presence of cough on day 7 is -0.23. This means that, on average, receiving the experimental drug reduces the proportion of patients with a cough on day 7 by 23.8% compared to the control group.

covid_1 <- d[ , lm(cough_day7 ~ treat_drug)]

coeftest(covid_1, vcovHC)


                 Estimate Std. Error t value Pr(>|t|)    

(Intercept)       0.847458   0.047616  17.798  < 2e-16 ***

treat_covid_mapp -0.237702   0.091459  -2.599  0.01079 *  

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

We know that a patient's initial condition would affect the final outcome. If the patient has a cough and a fever on day 0, they might not fare well with the treatment. To better understand the treatment's effect, let's add these covariates:

covid_2 <- d[ , lm(cough_day7 ~ treat_drug +

                   cough_day0 + temperature_day0)]

coeftest(covid_2, vcovHC)


                  Estimate Std. Error t value Pr(>|t|)   

(Intercept)      -19.469655   7.607812 -2.5592 0.012054 * 

treat_covid_mapp  -0.165537   0.081976 -2.0193 0.046242 * 

cough_day0         0.064557   0.178032  0.3626 0.717689   

temperature_day0   0.205548   0.078060  2.6332 0.009859 **

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The output shows the results of a linear regression model, estimating the effect of the experimental drug (treat_covid_mapp) on the presence of cough on day 7, adjusting for cough on day 0 and temperature on day 0. The experimental drug significantly reduces the presence of cough on day 7 by approximately 16.6% compared to the control group (p-value = 0.046242). The presence of cough on day 0 does not significantly predict the presence of cough on day 7 (p-value = 0.717689). A one-unit increase in temperature on day 0 is associated with a 20.6% increase in the presence of cough on day 7, and this effect is statistically significant (p-value = 0.009859).

Should we add day 7 temperature as a covariate? By including it, we might find that the treatment is no longer statistically significant since the temperature on day 7 could be affected by the treatment itself. It is a post-treatment variable, and by including it, the experiment loses value as we used something that was affected by intervention as our covariate.

However, we'd like to investigate if the treatment affects men or women differently. Since we collected gender as part of the study, we could check for Heterogeneous Treatment Effect (HTE) for male vs. female. The experimental drug has a marginally significant effect on the outcome variable for females, reducing it by approximately 23.1% (p-value = 0.05391).

covid_4 <- d[ , lm(cough_day7 ~ treat_drug + treat_drug * male +

                   cough_day0 + temperature_day0)]

coeftest(covid_4, vcovHC)


t test of coefficients:


                  Estimate Std. Error  t value  Pr(>|t|)    

(Intercept)      48.712690  10.194000   4.7786 6.499e-06 ***

treat_zmapp      -0.230866   0.118272  -1.9520   0.05391 .  

male              3.085486   0.121773  25.3379 < 2.2e-16 ***

dehydrated_day0   0.041131   0.194539   0.2114   0.83301    

temperature_day0  0.504797   0.104511   4.8301 5.287e-06 ***

treat_zmapp:male -2.076686   0.198386 -10.4679 < 2.2e-16 ***

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Which group, those coded as male == 0 or male == 1, have better health outcomes (cough) in control? What about in treatment? How does this help to contextualize any heterogeneous treatment effect that might have been estimated?

Stargazer is a popular R package that enables users to create well-formatted tables and reports for statistical analysis results.

covid_males <- d[male == 1, lm(temperature_day14 ~ treat_drug)]

covid_females <- d[male == 0, lm(temperature_day14 ~ treat_drug)]


stargazer(covid_males, covid_females,

          title = "",

          type = 'text',

          dep.var.caption = 'Outcome Variable:',

          dep.var.labels = c('Cough on Day 7'),

          se = list(

            sqrt(diag(vcov(covid_males))),

            sqrt(diag(vcovHC(covid_females))))

          )


===============================================================

                                 Outcome Variable:             

                               Temperature on Day 14           

                              (1)                   (2)        

treat_covid_mapp           -2.591***              -0.323*      

                            (0.220)               (0.174)      

Constant                  101.692***             98.487***     

                            (0.153)               (0.102)      

Observations                  37                    63         

R2                           0.798                 0.057       

Adjusted R2                  0.793                 0.041       

Residual Std. Error     0.669 (df = 35)       0.646 (df = 61)  

F Statistic         138.636*** (df = 1; 35) 3.660* (df = 1; 61)

===============================================================

Note:                               *p<0.1; **p<0.05; ***p<0.01

Looking at this regression report, we see that males in control have a temperature of 102; females in control have a temperature of 98.6 (which is very nearly a normal temperature). So, in control, males are worse off. In treatment, males have a temperature of 102 - 2.59 = 99.41. While this is closer to a normal temperature, this is still elevated. Females in treatment have a temperature of 98.5 - .32 = 98.18, which is slightly lower than a normal temperature, and is better than an elevated temperature. It appears that the treatment is able to have a stronger effect among male participants than females because males are *more sick* at baseline.

In conclusion, experimentation offers a fascinating and valuable avenue for primary research, allowing us to address causal questions and enhance our understanding of the world around us. Covariate control helps to isolate the causal effect of the treatment on the outcome variable, ensuring that the observed effect is not driven by confounding factors. Proper control of covariates enhances the internal validity of the study and ensures that the estimated treatment effect is an accurate representation of the true causal relationship. By exploring and accounting for sub groups in data, researchers can identify whether the treatment has different effects on different groups, such as men and women or younger and older individuals. This information can be critical for making informed policy decisions and developing targeted interventions that maximize the benefits for specific groups. The ongoing investigation of experimental methodologies and their potential applications represents a compelling and significant area of inquiry. 

Gerber, A. S., & Green, D. P. (2012). Field Experiments: Design, Analysis, and Interpretation . W. W. Norton.

“DALL·E 2.” OpenAI , https://openai.com/product/dall-e-2

“Data Science 241. Experiments and Causal Inference.” UC Berkeley School of Information , https://www.ischool.berkeley.edu/courses/datasci/241

Cart

  • SUGGESTED TOPICS
  • The Magazine
  • Newsletters
  • Managing Yourself
  • Managing Teams
  • Work-life Balance
  • The Big Idea
  • Data & Visuals
  • Case Selections
  • HBR Learning
  • Topic Feeds
  • Account Settings
  • Email Preferences

A Refresher on Randomized Controlled Experiments

How to design the right kind of test.

In order to make smart decisions at work, we need data. Where that data comes from and how we analyze it depends on a lot of factors — for example, what we’re trying to do with the results, how accurate we need the findings to be, and how much of a budget we have. There is a spectrum of experiments that managers can do from quick, informal ones, to pilot studies, to field experiments, and to lab research. One of the more structured experiments is the randomized controlled experiment­ .

randomized two treatment experiment example

  • Amy Gallo is a contributing editor at Harvard Business Review, cohost of the Women at Work podcast , and the author of two books: Getting Along: How to Work with Anyone (Even Difficult People) and the HBR Guide to Dealing with Conflict . She writes and speaks about workplace dynamics. Watch her TEDx talk on conflict and follow her on LinkedIn . amyegallo

Partner Center

Randomized Experiment

Statistics Definitions >

What is a Randomized Experiment?

randomized experiment

Randomized experiments are used extensively in a wide variety of agricultural and biological experiments, including human clinical trials. They are also, less commonly, seen in other fields such as economics.

Randomized Experiment Stages

The experiments are usually conducted in two stages [2]:

  • Selection of a small sample of participants from a larger population , using a random sampling technique. This step ensures that the results will have external validity .
  • Random assignment to treatment and control groups. This step ensures that the observed effects have internal validity .

Benefits of Randomized Experiments

Using randomization has several benefits [3]:

  • It prevents selection bias and accidental bias , as well as bias in treatment assignments.
  • Homogeneous, comparable groups are created.
  • Probability methods, including hypothesis tests , can be used to ensure the results didn’t happen by chance.

[1] World Bank. Randomized Experiments. Retrieved December 29, 2021 from: http://web.worldbank.org/archive/website01397/WEB/IMAGES/EXPERI-2.PDF [2] Munck, G. & Verkuilen, J. (2005). Research Designs . In Encyclopedia of Social Measurement, Pages 385-395. [2] Suresh, K. An overview of randomization techniques: An unbiased assessment of outcome in clinical research. J Hum Reprod Sci. 2011 Jan-Apr; 4(1): 8–11.

  • Yale Directories

Institution for Social and Policy Studies

Advancing research • shaping policy • developing leaders, why randomize.

About Randomized Field Experiments Randomized field experiments allow researchers to scientifically measure the impact of an intervention on a particular outcome of interest.

What is a randomized field experiment? In a randomized experiment, a study sample is divided into one group that will receive the intervention being studied (the treatment group) and another group that will not receive the intervention (the control group). For instance, a study sample might consist of all registered voters in a particular city. This sample will then be randomly divided into treatment and control groups. Perhaps 40% of the sample will be on a campaign’s Get-Out-the-Vote (GOTV) mailing list and the other 60% of the sample will not receive the GOTV mailings. The outcome measured –voter turnout– can then be compared in the two groups. The difference in turnout will reflect the effectiveness of the intervention.

What does random assignment mean? The key to randomized experimental research design is in the random assignment of study subjects – for example, individual voters, precincts, media markets or some other group – into treatment or control groups. Randomization has a very specific meaning in this context. It does not refer to haphazard or casual choosing of some and not others. Randomization in this context means that care is taken to ensure that no pattern exists between the assignment of subjects into groups and any characteristics of those subjects. Every subject is as likely as any other to be assigned to the treatment (or control) group. Randomization is generally achieved by employing a computer program containing a random number generator. Randomization procedures differ based upon the research design of the experiment. Individuals or groups may be randomly assigned to treatment or control groups. Some research designs stratify subjects by geographic, demographic or other factors prior to random assignment in order to maximize the statistical power of the estimated effect of the treatment (e.g., GOTV intervention). Information about the randomization procedure is included in each experiment summary on the site.

What are the advantages of randomized experimental designs? Randomized experimental design yields the most accurate analysis of the effect of an intervention (e.g., a voter mobilization phone drive or a visit from a GOTV canvasser, on voter behavior). By randomly assigning subjects to be in the group that receives the treatment or to be in the control group, researchers can measure the effect of the mobilization method regardless of other factors that may make some people or groups more likely to participate in the political process. To provide a simple example, say we are testing the effectiveness of a voter education program on high school seniors. If we allow students from the class to volunteer to participate in the program, and we then compare the volunteers’ voting behavior against those who did not participate, our results will reflect something other than the effects of the voter education intervention. This is because there are, no doubt, qualities about those volunteers that make them different from students who do not volunteer. And, most important for our work, those differences may very well correlate with propensity to vote. Instead of letting students self-select, or even letting teachers select students (as teachers may have biases in who they choose), we could randomly assign all students in a given class to be in either a treatment or control group. This would ensure that those in the treatment and control groups differ solely due to chance. The value of randomization may also be seen in the use of walk lists for door-to-door canvassers. If canvassers choose which houses they will go to and which they will skip, they may choose houses that seem more inviting or they may choose houses that are placed closely together rather than those that are more spread out. These differences could conceivably correlate with voter turnout. Or if house numbers are chosen by selecting those on the first half of a ten page list, they may be clustered in neighborhoods that differ in important ways from neighborhoods in the second half of the list. Random assignment controls for both known and unknown variables that can creep in with other selection processes to confound analyses. Randomized experimental design is a powerful tool for drawing valid inferences about cause and effect. The use of randomized experimental design should allow a degree of certainty that the research findings cited in studies that employ this methodology reflect the effects of the interventions being measured and not some other underlying variable or variables.

Teach yourself statistics

Randomized Block Experiment: Example

This lesson shows how to use analysis of variance to analyze and interpret data from a randomized block experiment. To illustrate the process, we walk step-by-step through a real-world example.

Computations for analysis of variance are usually handled by a software package. For this example, however, we will do the computations "manually", since the gory details have educational value.

Prerequisites: The lesson assumes general familiarity with randomized block designs. If you are unfamiliar with randomized block designs or with terms like blocks , blocking , and blocking variables , review the previous lessons:

  • Randomized Block Designs
  • Randomized Block Experiments: Data Analysis

Problem Statement

As part of a randomized block experiment, a researcher tests the effect of three teaching methods on student performance. The researcher selects subjects randomly from a student population. The researcher assigns subjects to six blocks of three, such that students within the same block have the same (or similar) IQ. Within each block, each student is randomly assigned to a different teaching method.

At the end of the term, the researcher collects one test score (the dependent variable) from each subject, as shown in the table below:

Table 1. Dependent Variable Scores

IQ Teaching Method
A B C
91-95 84 85 85
96-100 86 86 88
101-105 86 87 88
106-110 89 88 89
111-115 88 89 89
116-120 91 90 91

In conducting this experiment, the researcher has two research questions:

  • Does teaching method have a significant effect on student performance (as measured by test score)?
  • How strong is the effect of teaching method on the student performance?

To answer these questions, the researcher uses analysis of variance.

Analytical Logic

To implement analysis of variance with an independent groups, randomized block experiment, a researcher takes the following steps:

  • Specify a mathematical model to describe how main effects and the blocking variable influence the dependent variable.
  • Write statistical hypotheses to be tested by experimental data.
  • Specify a significance level for a hypothesis test.
  • Compute the grand mean and marginal means for the independent variable and for the blocking variable.
  • Compute sums of squares for each effect in the model.
  • Find the degrees of freedom associated with each effect in the model.
  • Based on sums of squares and degrees of freedom, compute mean squares for each effect in the model.
  • Find the expected value of the mean squares for each effect in the model.
  • Compute a test statistic for the independent variable and a test statistic for the blocking variable, based on observed mean squares and their expected values.
  • Find the P value for each test statistic.
  • Accept or reject null hypotheses , based on P value and significance level.
  • Assess the magnitude of effect, based on sums of squares.

Below, we'll explain how to implement each step in the analysis.

Mathematical Model

For every experimental design, there is a mathematical model that accounts for all of the independent and extraneous variables that affect the dependent variable. Here is a mathematical model for an independent groups, randomized block experiment:

X i j = μ + β i + τ j + ε ij

where X i j is the dependent variable score (in this example, the test score) for the subject in block i that receives treatment j , μ is the population mean, β i is the effect of Block i ; τ j is the effect of Treatment j ; and ε ij is the experimental error (i.e., the effect of all other extraneous variables).

For this model, it is assumed that ε ij is normally and independently distributed with a mean of zero and a variance of σ ε 2 . The mean ( μ ) is constant.

Note: Unlike the model for a full factorial experiment, the model for a randomized block experiment does not include an interaction term. That is, the model assumes there is no interaction between block and treatment effects.

Statistical Hypotheses

With a randomized block experiment, it is possible to test both block ( β  i  ) and treatment ( τ  j  ) effects. Here are the null hypotheses (H 0 ) and alternative hypotheses (H 1 ) for each effect.

H 0 : β i = 0 for all i

H 1 : β i ≠ 0 for some i

H 0 : τ j = 0 for all j

H 1 : τ j ≠ 0 for some j

With a randomized block experiment, the main hypothesis test of interest is the test of the treatment effect(s). For instance, in this example the experimenter is primarily interested in the effect of teaching method on student performance (i.e., test score).

Block effects are of less intrinsic interest, because a blocking variable is thought to be a nuisance variable that is only included in the experiment to control for a potential source of undesired variation. In this example, IQ is a potential nuisance variable.

Significance Level

The significance level (also known as alpha or α) is the probability of rejecting the null hypothesis when it is actually true. The significance level for an experiment is specified by the experimenter, before data collection begins. Experimenters often choose significance levels of 0.05 or 0.01. For this experiment, we'll assume that the experimenter chose 0.05 as the significance level.

A significance level of 0.05 means that there is a 5% chance of rejecting the null hypothesis when it is true. A significance level of 0.01 means that there is a 1% chance of rejecting the null hypothesis when it is true. The lower the significance level, the more persuasive the evidence needs to be before an experimenter can reject the null hypothesis.

Mean Scores

Analysis of variance for a randomized block experiment begins by computing a grand mean and marginal means for independent variables and for blocks. Here are computations for the various means, based on dependent variable scores from Table 1:

  • Marginal means for treatment levels. The mean for treatment level j (  X   .  j  ) is computed as follows: X   .  j   = ( 1 / n ) n Σ i=1 ( X  i j  ) X   .  1   = ( 1 / 6 ) 6 Σ i=1 ( X  i 1  ) = 87.33 X   .  2   = ( 1 / 6 ) 6 Σ i=1 ( X  i 2  ) = 87.50 X   .  3   = ( 1 / 6 ) 6 Σ i=1 ( X  i 3  ) = 88.33
  • Marginal means for blocks. The mean for block i (  X  i  .   ) is computed as follows: X  i  .    = ( 1 / k ) k Σ j=1 ( X  i j  ) X  1  .    = ( 1 / 3 ) 3 Σ j=1 ( X  1 j  ) = 84.67 X  2  .    = ( 1 / 3 ) 3 Σ j=1 ( X  2 j  ) = 86.67 X  3  .    = ( 1 / 3 ) 3 Σ j=1 ( X  3 j  ) = 87.00 X  4  .    = ( 1 / 3 ) 3 Σ j=1 ( X  4 j  ) = 88.67 X  5  .    = ( 1 / 3 ) 3 Σ j=1 ( X  5 j  ) = 88.67 X  6  .    = ( 1 / 3 ) 3 Σ j=1 ( X  6 j  ) = 90.67

In the equations above, N is the total sample size; n is the number of blocks, and k is the number of treatment levels.

Sums of Squares

A sum of squares is the sum of squared deviations from a mean score. A randomized block design makes use of four sums of squares:

  • Sum of squares for treatments. The sum of squares for treatments (SSTR) measures variation of the marginal means of treatment levels (  X  j  ) around the grand mean (  X  ). It can be computed from the following formula: SSTR = n k Σ j=1 (  X  j  -  X  ) 2 SSTR = 6 3 Σ j=1 (  X  j  -  X  ) 2 = 3.44
  • Sum of squares for blocks. The sum of squares for blocks (SSB) measures variation of the marginal means of blocks (  X  i  ) around the grand mean (  X  ). It can be computed from the following formula: SSB = k n Σ i=1 (  X  i  -  X  ) 2 SSB = 3 6 Σ i=1 (  X  i  -  X  ) 2 = 64.28
  • Error sum of squares. The error sum of squares (SSE) measures variation of all scores ( X  i j  ) attributable to extraneous variables. It can be computed from the following formula: SSE = n Σ i=1 k Σ j=1 ( X  i j   -  X   i   -  X   j   +  X  ) 2 SSE = 6 Σ i=1 3 Σ j=1 ( X  i j   -  X   i   -  X   j   +  X  ) 2 = 3.89
  • Total sum of squares. The total sum of squares (SST) measures variation of all scores ( X  i j  ) around the grand mean (  X  ). It can be computed from the following formula: SST = n Σ i=1 k Σ j=1 ( X  i j  -  X  ) 2 SST = 6 Σ i=1 3 Σ j=1 ( X  i j  -  X  ) 2 = 71.61

In the formulas above, n is the number of blocks, and k is the number of treatment levels. And the total sum of squares is equal to the sum of the component sums of squares, as shown below:

SST = SSTR + SSB + SSE

SST = 3.44 + 64.28 + 3.89 = 71.61

Degrees of Freedom

The term degrees of freedom (df) refers to the number of independent sample points used to compute a statistic minus the number of parameters estimated from the sample points.

The degrees of freedom used to compute the various sums of squares for an independent groups, randomized block experiment are shown in the table below:

Sum of squares Degrees of freedom
Treatment k - 1 = 2
Block n - 1 = 5
Error ( k - 1 )( n - 1 ) = 10
Total nk - 1 = 17

Notice that there is an additive relationship between the various sums of squares. The degrees of freedom for total sum of squares (df TOT ) is equal to the degrees of freedom for the treatment sum of squares (df TR ) plus the degrees of freedom for the blocks sum of squares (df B ) plus the degrees of freedom for the error sum of squares (df E ). That is,

df TOT = df TR + df B + df E

df TOT = 2 + 5 + 7 = 17

Mean Squares

A mean square is an estimate of population variance. It is computed by dividing a sum of squares (SS) by its corresponding degrees of freedom (df), as shown below:

MS = SS / df

To conduct analysis of variance with a randomized block experiment, we are interested in three mean squares:

MS T = SSTR / df TR

MS T = 3.44 / 2 = 1.72

MS B = SSB / df B

MS B = 64.28 / 5 = 12.86

MS E = SSE / df E

MS E = 3.89 / 10 = 0.39

Expected Value

The expected value of a mean square is the average value of the mean square over a large number of experiments.

Statisticians have derived formulas for the expected value of mean squares, assuming the mathematical model described earlier is correct. Those formulas appear below:

Mean square Expected value
MS σ + nσ
MS σ + kσ
MS σ

In the table above, MS T is the mean square for treatments; MS B is the mean square for blocks; and MS E is the error mean square.

Test Statistics

The main data analysis goal for this experiment is to test the hypotheses that we stated earlier (see Statistical Hypotheses ). That will require the use of test statistics. Let's talk about how to compute test statistics for this study and how to interpret the statistics we compute.

How to Compute Test Statistics

Suppose we want to test the significance of an independent variable or a blocking variable in a randomized block experiment. We can use the mean squares to define a test statistic F for each source of variation, as shown in the table below:

Source Mean square:
Expected value
F ratio
Treatment (T) σ + nσ
Block (B) σ + kσ
Error σ  

Using formulas from the table with data from this randomized block experiment, we can compute an F ratio for treatments ( F T  ) and an F ratio for blocks ( F B  ).

F T = MS T / MS E = 1.72/0.39 = 4.4

F B = MS B / MS E = 12.86/0.39 = 33.0

How to Interpret Test Statistics

Consider the F ratio for the treatment effect in this randomized block experiment. For convenience, we display once again the table that shows expected mean squares and F ratio formulas:

Notice that numerator of the F ratio for the treatment effect should equal the denominator when the variation due to the treatment ( σ 2  T ) is zero (i.e., when the treatment does not affect the dependent variable). And the numerator should be bigger than the denominator when the variation due to the treatment is not zero (i.e., when the treatment does affect the dependent variable).

The F ratio for the blocking variable works the same way. When the blocking variable does not affect the dependent variable, the numerator of the F ratio should equal the denominator. Otherwise, the numerator should be bigger than the denominator.

Each F ratio is a convenient measure that we can use to test the null hypothesis about the effect of a source (the treatment or the blocking variable) on the dependent variable. Here's how to conduct the test:

  • When the F ratio is close to one, the numerator of the F ratio is approximately equal to the denominator. This indicates that the source did not affect the dependent variable, so we cannot reject the null hypothesis.
  • When the F ratio is significantly greater than one, the numerator is bigger than the denominator. This indicates that the source did affect the dependent variable, so we must reject the null hypothesis.

What does it mean for the F ratio to be significantly greater than one? To answer that question, we need to talk about the P-value.

Warning: Recall that this analysis assumes that the interaction between blocking variable and independent variable is zero. If that assumption is incorrect, the F ratio for a fixed-effects variable will be biased. It may indicate that an effect is not significant, when it truly is significant.

In an experiment, a P-value is the probability of obtaining a result more extreme than the observed experimental outcome, assuming the null hypothesis is true.

With analysis of variance for a randomized block experiment, the F ratios are the observed experimental outcomes that we are interested in. So, the P-value would be the probability that an F ratio would be more extreme (i.e., bigger) than the actual F ratio computed from experimental data.

How does an experimenter attach a probability to an observed F ratio? Luckily, the F ratio is a random variable that has an F distribution . The degrees of freedom (v 1 and v 2 ) for the F ratio are the degrees of freedom associated with the mean squares used to compute the F ratio.

For example, consider the F ratio for a treatment effect. That F ratio ( F T  ) is computed from the following formula:

F T = F(v 1 , v 2 ) = MS T / MS E

MS T (the numerator in the formula) has degrees of freedom equal to df TR  ; so for F T  , v 1 is equal to df TR  . Similarly, MS E (the denominator in the formula) has degrees of freedom equal to df E  ; so for F T  , v 2 is equal to df E  . Knowing the F ratio and its degrees of freedom, we can use an F table or Stat Trek's free F distribution calculator to find the probability that an F ratio will be bigger than the actual F ratio observed in the experiment.

To illustrate the process, let's find P-values for the treatment variable and for the blocking variable in this randomized block experiment.

Treatment Variable P-Value

From previous computations, we know the following:

  • The observed value of the F ratio for the treatment variable is 4.4.
  • The degrees of freedom (v 1 ) for the treatment variable mean square (MS T ) is 2.
  • The degrees of freedom (v 2 ) for the error mean square (MS E ) is 10.

Therefore, the P-value we are looking for is the probability that an F with 2 and 10 degrees of freedom is greater than 4.4. We want to know:

P [ F(2, 10) > 4.4 ]

Now, we are ready to use the F Distribution Calculator . We enter the degrees of freedom (v1 = 2) for the treatment mean square, the degrees of freedom (v2 = 10) for the error mean square, and the F value (4.4) into the calculator; and hit the Calculate button.

The calculator reports that the probability that F is greater than 4.4 equals about 0.04. Hence, the correct P-value for the treatment variable is 0.04.

Blocking Variable P-Value

The process to compute the P-value for the blocking variable is exactly the same as the process used for the treatment variable. From previous computations, we know the following:

  • The observed value of the F ratio for the blocking variable is 33.

F B = F(v 1 , v 2 ) = MS B / MS E

  • The degrees of freedom (v 1 ) for the blocking variable mean square (MS B ) is 5.

Therefore, the P-value we are looking for is the probability that an F with 5 and 10 degrees of freedom is greater than 33. We want to know:

P [ F(5, 10) > 33 ]

Now, we are ready to use the F Distribution Calculator . We enter the degrees of freedom (v1 = 5) for the block mean square, the degrees of freedom (v2 = 10) for the error mean square, and the F value (33) into the calculator; and hit the Calculate button.

The calculator reports that the probability that F is greater than 33 is about 0.00001. Hence, the correct P-value is 0.00001.

Interpretation of Results

Having completed the computations for analysis, we are ready to interpret results. We begin by displaying key findings in an ANOVA summary table. Then, we use those findings to (1) test hypotheses and (2) assess the magnitude of effects.

ANOVA Summary Table

It is traditional to summarize ANOVA results in an analysis of variance table. Here, filled with key results, is the analysis of variance table for the randomized block experiment that we have been working on.

Analysis of Variance Table

Source SS df MS F P
Treatment 3.44 2 1.72 4.4 0.04
Block 64.28 5 12.86 33 <0.01
Error 3.89 10 0.39
Total 71.61 17

This ANOVA table provides all the information that we need to (1) test hypotheses and (2) assess the magnitude of treatment effects.

Hypothesis Test

Recall that the experimenter specified a significance level of 0.05 for this study. Once you know the significance level and the P-values, the hypothesis tests are routine. Here's the decision rule for accepting or rejecting a null hypothesis:

  • If the P-value is bigger than the significance level, accept the null hypothesis.
  • If the P-value is equal to or smaller than the significance level, reject the null hypothesis.

A "big" P-value for a source of variation (an independent variable or a blocking variable) indicates that the source did not have a statistically significant effect on the dependent variable. A "small" P-value indicates that the source did have a statistically significant effect on the dependent variable.

The P-value (shown in the last column of the ANOVA table) is the probability that an F statistic would be more extreme (bigger) than the F ratio shown in the table, assuming the null hypothesis is true. When a P-value for an independent variable or a blocking variable is bigger than the significance level, we accept the null hypothesis for the effect; when it is smaller, we reject the null hypothesis.

Based on the P-values in the table above, we can draw the following conclusions:

  • The P-value for treatments (i.e., the independent variable) is 0.04. Since the P-value is smaller than the significance level (0.05), we reject the null hypothesis that the independent variable (training method) has no effect on the dependent variable.
  • The P-value for the blocking variable is less than 0.01. Since this P-value is also smaller than the significance level (0.05), we reject the null hypothesis that the blocking variable (IQ) has no effect on the dependent variable.

In addition, two other points are worthy of note:

  • The fact that the blocking variable (IQ) is statistically significant is good news in a randomized block experiment. It confirms the suspicion that the blocking variable was a nuisance variable that could have obscured effects of the dependent variable. And it justifies the decision to use a randomized block experiment to control nuisance effects of IQ.
  • The independent variable (training method) was also statistically significant with a P-value of 0.04. Had the experimenter used a different design that did not control the nuisance effect of IQ, the experiment might not have produced a significant effect for the independent variable.

Magnitude of Effect

The hypothesis tests tell us whether sources of variation in our experiment had a statistically significant effect on the dependent variable, but the tests do not address the magnitude of the effect. Here are some issues:

  • When the sample size is large, you may find that even small effects (indicated by a small F ratio) are statistically significant.
  • When the sample size is small, you may find that even big effects are not statistically significant.
  • When the blocking variable in a randomized block design is strongly correlated with the dependent variable, you may find that even small treatment effects are statistically significant.

With this in mind, it is customary to supplement analysis of variance with an appropriate measure of effect size. Eta squared (η 2 ) is one such measure. Eta squared is the proportion of variance in the dependent variable that is explained by a source of variation. The eta squared formula for an independent variable or a blocking variable is:

η 2 = SS SOURCE / SST

where SS SOURCE is the sum of squares for a source of variation (i.e., an independent variable or a blocking variable) and SST is the total sum of squares.

Using sum of squares entries from the ANOVA table, we can compute eta squared for the treatment variable ( η 2 T  ) and for the blocking variable ( η 2 B  ).

η 2 T = SSTR / SST = 3.44 / 71.61 = 0.05

η 2 B = SSB / SST = 64.28 / 71.61 = 0.90

The treatment variable (test method) accounted for about 5% of the variance in test performance, and the blocking variable (IQ) accounted for about 90% of the variance in test performance. Based on these findings, an experimenter might conclude:

  • IQ accounted for most of the variance in test performance.
  • Even though the test method effect was statistically significant, test method accounted for only a small proportion of test variation.

Note: Given the very strong nuisance effect of IQ, it is likely that a different experimental design would not have revealed a statistically significant effect for test method.

An Easier Option

In this lesson, we showed all of the hand calculations for analysis of variance with a randomized block experiment. In the real world, researchers seldom conduct analysis of variance by hand. They use statistical software. In the next lesson, we'll demonstrate how to conduct the same analysis of the same problem with Excel. Hopefully, we'll get the same result.

IMAGES

  1. 14 Layout for a two-treatment randomized block confounded factorial

    randomized two treatment experiment example

  2. PPT

    randomized two treatment experiment example

  3. 8 Layout for a two-treatment completely randomized factorial design

    randomized two treatment experiment example

  4. 1 Steps in Randomized Assignment of Two Levels of Treatment

    randomized two treatment experiment example

  5. PPT

    randomized two treatment experiment example

  6. Diagram form Two-group simple randomized experimental design

    randomized two treatment experiment example

VIDEO

  1. Pokémon B/W Soul-Link Randomizer #13

  2. Babysit my cat

  3. ANOVA RBD, Example 158A

  4. CHEM-497 Day 1 Experiment Example

  5. The Scientific Process Explained in 60 Seconds

  6. Pokémon B/W Soul-Link Randomizer #12

COMMENTS

  1. Chapter 3 Comparing Two Treatments

    3.2 Treatment Assignment Mechanism and Propensity Score. In a randomized experiment, the treatment assignment mechanism is developed and controlled by the investigator, and the probability of an assignment of treatments to the units is known before data is collected. Conversely, in a non-randomized experiment, the assignment mechanism and probability of treatment assignments are unknown to the ...

  2. Random Assignment in Experiments

    Random Assignment in Experiments | Introduction & Examples. Published on March 8, 2021 by Pritha Bhandari.Revised on June 22, 2023. In experimental research, random assignment is a way of placing participants from your sample into different treatment groups using randomization. With simple random assignment, every member of the sample has a known or equal chance of being placed in a control ...

  3. Comparing two treatments (e.g., treatment and control) in the

    Permutation test based on the sample sum of the responses of the treatment group. Approximating P-values by simulation; connection to bootstrap tests. The 2-sample t-test in the randomization model. The permutation t-test. Fisher's Exact Test and its normal approximation; the Lady Tasting Tea experiment References: Lehmann, E.L., 1998.

  4. Guide to Experimental Design

    Table of contents. Step 1: Define your variables. Step 2: Write your hypothesis. Step 3: Design your experimental treatments. Step 4: Assign your subjects to treatment groups. Step 5: Measure your dependent variable. Other interesting articles. Frequently asked questions about experiments.

  5. 1.3: Experimental Design

    Randomized Two-Treatment Experiment: In this experiment, there are two treatments, and individuals are randomly placed into the two groups. Either both groups get a treatment, or one group gets a treatment and the other gets either nothing or a placebo. The group getting either no treatment or the placebo is called the control group.

  6. 10.1: Compare two independent sample means

    Consider now a basic experimental design, the randomized control trial, or RCT (Fig. \(\PageIndex{1}\)), introduced in Chapter 2.4. Figure \(\PageIndex{1}\): A two-group Randomized Control Trial. Subjects randomly selected from population of interest, then again — random assignment — once recruited into one of two treatment groups ...

  7. A Two-stage Inference Procedure for Sample Local Average Treatment

    In a given randomized experiment, individuals are often volunteers and can differ in important ways from a population of interest. It is thus of interest to focus on the sample at hand. This paper focuses on inference about the sample local average treatment effect (LATE) in randomized experiments with non-compliance.

  8. A simplified guide to randomized controlled trials

    Abstract. A randomized controlled trial is a prospective, comparative, quantitative study/experiment performed under controlled conditions with random allocation of interventions to comparison groups. The randomized controlled trial is the most rigorous and robust research method of determining whether a cause-effect relation exists between an ...

  9. A Two-stage Inference Procedure for Sample Local Average Treatment

    ment effect (LATE) in randomized experiments with non-compliance. We present a two-stage procedure that provides asymptotically correct coverage rate of the sample LATE in randomized experiments. The procedure uses a first-stage test to decide whether the instrument is strong or weak, and uses different confidence sets depend-

  10. Control Groups and Treatment Groups

    A true experiment (a.k.a. a controlled experiment) always includes at least one control group that doesn't receive the experimental treatment.. However, some experiments use a within-subjects design to test treatments without a control group. In these designs, you usually compare one group's outcomes before and after a treatment (instead of comparing outcomes between different groups).

  11. Experimental Design: Types, Examples & Methods

    Three types of experimental designs are commonly used: 1. Independent Measures. Independent measures design, also known as between-groups, is an experimental design where different participants are used in each condition of the independent variable. This means that each condition of the experiment includes a different group of participants.

  12. Experimental Design

    In this Acme example, the randomized block design is an improvement over the completely randomized design. Both designs use randomization to implicitly guard against confounding. ... It is used when the experiment has only two treatment conditions; and participants can be grouped into pairs, based on one or more blocking variables. Then, within ...

  13. Randomized experiment

    Flowchart of four phases (enrollment, intervention allocation, follow-up, and data analysis) of a parallel randomized trial of two groups, modified from the CONSORT 2010 Statement [1]. In science, randomized experiments are the experiments that allow the greatest reliability and validity of statistical estimates of treatment effects. Randomization-based inference is especially important in ...

  14. PDF Completely Random Design (Crd)

    Treatment 1 Treatment 2 Treatment 3 5 10 9 5 10 9 5 10 9 Yi. 15 30 27 Y.. = 72 Yi. 5 10 9 Y..= 8 • Note in the previous two examples that ∑τi = 0. This is true for all situations. • Given equals the experiment mean). (i.e., the sum of the treatment means divided by the number of treatments : for at least one pair of treatments (i,i')

  15. Two-Group Experimental Designs

    The simplest of all experimental designs is the two-group posttest-only randomized experiment. In design notation, it has two lines - one for each group - with an R at the beginning of each line to indicate that the groups were randomly assigned. One group gets the treatment or program (the X) and the other group is the comparison group and ...

  16. Randomized Controlled Trial (RCT) Overview

    A randomized controlled trial (RCT) is a prospective experimental design that randomly assigns participants to an experimental or control group. RCTs are the gold standard for establishing causal relationships and ruling out confounding variables and selection bias. Researchers must be able to control who receives the treatments and who are the ...

  17. Research Guides: Study Design 101: Randomized Controlled Trial

    Definition. A study design that randomly assigns participants into an experimental group or a control group. As the study is conducted, the only expected difference between the control and experimental groups in a randomized controlled trial (RCT) is the outcome variable being studied.

  18. Heterogeneous effects of Medicaid coverage on cardiovascular risk

    Objectives To investigate whether health insurance generated improvements in cardiovascular risk factors (blood pressure and hemoglobin A1c (HbA1c) levels) for identifiable subpopulations, and using machine learning to identify characteristics of people predicted to benefit highly. Design Secondary analysis of randomized controlled trial. Setting Medicaid insurance coverage in 2008 for adults ...

  19. A Two-stage Inference Procedure for Sample Local Average Treatment

    In a given randomized experiment, individuals are often volunteers and can differ in important ways from a population of interest. It is thus of interest to focus on the sample at hand. This paper focuses on inference about the sample local average treatment effect (LATE) in randomized experiments with non-compliance. We present a two-stage procedure that provides asymptotically correct ...

  20. Experimental Design

    Experimentation An experiment deliberately imposes a treatment on a group of objects or subjects in the interest of observing the response. This differs from an observational study, which involves collecting and analyzing data without changing existing conditions.Because the validity of a experiment is directly affected by its construction and execution, attention to experimental design is ...

  21. Introduction to Field Experiments and Randomized Controlled Trials

    In this article, we offer an overview of field experimentation and its importance in discerning cause and effect relationships. We outline how randomized experiments represent an unbiased method for determining what works. Furthermore, we discuss key aspects of experiments, such as intervention, excludability, and non-interference.

  22. A Refresher on Randomized Controlled Experiments

    A Refresher on Randomized Controlled Experiments. In order to make smart decisions at work, we need data. Where that data comes from and how we analyze it depends on a lot of factors — for ...

  23. Randomized Experiment

    Randomized Experiment Stages. The experiments are usually conducted in two stages [2]: Selection of a small sample of participants from a larger population, using a random sampling technique. This step ensures that the results will have external validity. Random assignment to treatment and control groups. This step ensures that the observed ...

  24. Why randomize?

    In a randomized experiment, a study sample is divided into one group that will receive the intervention being studied (the treatment group) and another group that will not receive the intervention (the control group). For instance, a study sample might consist of all registered voters in a particular city. This sample will then be randomly ...

  25. Effects of Monochromatic Infrared Light on Painful Diabetic

    Purpose: To evaluate the effect of 890 nm Monochromatic Infrared Light (MIR) associated with a physical therapy protocol on pain in individuals with diabetic Distal Symmetric Polyneuropathy.Methods: Randomized, parallel, double-blind controlled trial conducted with individuals randomly allocated into two groups: an experimental group (EG) with the application of 890 nm MIR associated with ...

  26. Randomized Block Experiment: Example

    Statistical Hypotheses. With a randomized block experiment, it is possible to test both block ( β i ) and treatment ( τ j ) effects. Here are the null hypotheses (H 0) and alternative hypotheses (H 1) for each effect. H 0: β i = 0 for all i. H 1: β i ≠ 0 for some i. H 0: τ j = 0 for all j. H 1: τ j ≠ 0 for some j.