Sample Size Calculations (IACUC)
Share this:.
- Click to email this to a friend
- Click to print
Boston University (BU) is committed to observing Federal policies and regulations and the Association for Assessment and Accreditation of Laboratory Animal Care (AAALAC) International standards for the humane care and use of animals. This policy provides guidance on sample size calculations.
Covered Parties
This policy covers all animals on BU premises used for research, teaching, training, breeding, and related activities, hereinafter referred to collectively as “activities”, and is applicable to all persons responsible for conducting activities involving live vertebrate animals at or under the auspices of Boston University.
University Policy
Sample-size calculations.
Estimation of the number of subjects required to answer an experimental question is an important step in planning a study. On one hand, an excessive sample size can result in waste of animal life and other resources, including time and money, because equally valid information could have been gleaned from a smaller number of subjects. However, underestimates of sample size are also wasteful, since an insufficient sample size has a low probability of detecting a statistically significant difference between groups, even if a difference really exists. Consequently, an investigator might wrongly conclude that groups do not differ, when in fact they do.
What Is Involved In Sample Size Calculations
While the need to arrive at appropriate estimates of sample size is clear, many scientists are unfamiliar with the factors that influence determination of sample size and with the techniques for calculating estimated sample size. A quick look at how most textbooks of statistics treat this subject indicates why many investigators regard sample-size calculations with fear and confusion.
While sample-size calculations can become extremely complicated, it is important to emphasize, first, that all of these techniques produce estimates, and, second, that there are just a few major factors influencing these estimates. As a result, it is possible to obtain very reasonable estimates from some relatively simple formulae.
When comparing two groups, the major factors that influence sample size are:
- How large a difference you need to be able to detect.
- How much variability there is in the factor of interest.
- What “p” value you plan to use as a criterion for statistical “significance.”
- How confident you want to be that you will detect a “statistically significant” difference, assuming that a difference does exist.
An Intuitive Look at a Simple Example
Suppose you are studying subjects with renal hypertension, and you want to test the effectiveness of a drug that is said to reduce blood pressure. You plan to compare systolic blood pressure in two groups, one which is treated with a placebo injection, and a second group which is treated with the drug being tested. While you don’t yet know what the blood pressures will be in each of these groups, just suppose that if you were to test a ridiculously large number of subjects (say 100,000) treated with either placebo or drug, their systolic blood pressures would follow two clearly distinct frequency distributions, as shown in Figure 1.
As you would expect, both groups show some variability in blood pressure, and the frequency distribution of observed pressures conforms to a bell shaped curve. As shown here, the two groups overlap, but they are clearly different; systolic pressures in the treated group are an average of 20 mm Hg less than in the untreated controls.
Since there were 100,000 in each group, we can be confident that the groups differ. Now suppose that although we treated 100,000 of each, we only obtained pressure measurements from only three in each group, because the pressure measuring apparatus broke. In other words we have a random sample of N=3 from each group, and their systolic pressures are as follows:
Placebo Group | Treated Group |
160 | 155 |
150 | 140 |
140 | 140 |
Pressures are lower in the treated group, but we cannot be confident that the treatment was successful. There is a distinct possibility that the difference we see is just due to chance, since we took a small random sample. So the question is: how many would we have to measure (sample) in each group to be confident that any observed differences were not simply the result of chance?
How large a sample is needed depends on the four factors listed above. To illustrate this intuitively, suppose that the blood pressures in the treated and untreated subjects were distributed as shown in Figure 2 or in Figure 3.
The size of the sample you need also depends on the “p value” that you use. A “p value” of less than 0.05 is frequently used as the criterion for deciding whether observed differences are likely to be due to chance. If p<0.05, it means that the probability that the difference you observed was due to chance is less than 5%. If want to use a more rigid criterion (say, p<0.01) you will need a larger sample. Finally, the size of the sample you will need also depends on “power,” that is the probability that you will observe a statistically significant difference, assuming that a difference really exists.
To summarize, in order to calculate a sample-size estimate if you need some estimate of how different the groups might be or how large a difference you need to be able to detect, and you also need an estimate of how much variability there will be within groups. In addition, your calculations must also take in account what you want to use as a “p value” and how much “power” you want.
The Information You Need To Do Sample Size Calculations
Since you haven’t actually done the experiment yet, you won’t know how different the groups will be or what the variability (as measured by the standard deviation) will be. But you can usually make reasonable guesses. Perhaps from your experience (or from previously published information) you anticipate that the untreated hypertensive subjects will have a mean systolic blood pressure of about 160 mm Hg with a standard deviation of about +10 mm Hg. You decide that a reduction in systolic blood pressure to a mean of 150 mm Hg would represent a clinically meaningful reduction. Since no one has ever done this experiment before, you don’t know how much variability there will be in response, so you will have to assume that the standard deviation for the test group is at least as large as that in the untreated controls. From these estimates you can calculate an estimate of the sample size you need in each group.
Sample Size Calculations For A Difference In Means
The actual calculations can get a little bit cumbersome, and most people don’t even want to see equations. Consequently, I have put together a spreadsheet ( Lamorte’s Power Calculations ) which does all the calculations automatically. All you have to do is enter the estimated means and standard deviations for each group. In the example shown here, I assumed that my control group (group 1) would have a mean of 160 and a standard deviation of 10. I wanted to know how many subjects I would need in each group to detect a significant difference of 10 mm Hg. So, I plugged in a mean of 150 for group 2 and assumed that the standard deviation for this group would be the same as for group 1.
The format in this spreadsheet makes it easy to play “what if.” If you want to get a feel for how many subjects you might need if the treatment reduces pressures by 20 mm Hg, just change the mean for group 2 to 140, and all the calculations will automatically be redone for you.
Sample Size Calculations For A Difference In Proportions
The bottom part of the same spreadsheet generates sample-size calculations for comparing differences in frequency of an event. Suppose, for example, that a given treatment was successful 50% of the time and you wanted to test a new treatment with the hope that it would be successful 90% of the time. All you have to do is plug these (as fractions) into the spreadsheet, and the estimated sample sizes will be calculated automatically as shown here:
The illustration from the spreadsheet below shows that to have a 90% probability of showing a statistically significant difference (using p< 0.05) in proportions this great, you would need about 22 subjects in each group.
Spreadsheet
The Statistical Explanation Sample Spreadsheet described above can be found here .
Responsible Parties
Principal Investigators are responsible for: preparing and submitting applications; making modifications in applications in order secure IACUC approval; ensuring adherence to approved protocols, and ensuring humane care and use of animals. It is the responsibility of the IACUC to assure that the number of animals to be used in an animal use protocol is appropriate. The Animal Welfare Program and the Institutional Animal Care and Use Committee are responsible for overseeing implementation of and ensuring compliance with this policy.
Effective Date: 03/05/2024 Next Review Date: 03/04/2027
Information For...
An official website of the United States government
The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.
The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.
- Publications
- Account settings
The PMC website is updating on October 15, 2024. Learn More or Try it out now .
- Advanced Search
- Journal List
- Korean J Anesthesiol
- v.74(1); 2021 Feb
General considerations for sample size estimation in animal study
Mun jung ko.
Department of Biostatistics, Dongguk University College of Medicine, Goyang, Korea
Chi-Yeon Lim
The aim of this paper is to introduce basic concepts and methods for calculating sample size in animal studies. At the planning stage of clinical studies, the determination of the sample size is a very important process to show the validity, accuracy, and reliability of the study. However, not all studies require a sample size to be calculated. Before conducting the study, it is essential to determine whether the study objectives suggest a pilot and exploratory study, as well as the purpose of testing the hypothesis of interest. Since most animal experiments are pilot and exploratory studies, it would be more appropriate to review other considerations for conducting an experiment while maintaining scientific and qualitative levels rather than sample size estimation. Sample size is calculated in various situations in animal studies. Therefore, it can be estimated according to the situations and objectives through the methods of precision analysis, power analysis, and so on. In some cases, nonparametric methods can be employed if the assumptions of normality is not met or a small sample is available for the study.
Introduction
It is a crucial process to calculate the sample size before beginning a clinical study is a very important process to demonstrate its validity, accuracy, and reliability. However, not all studies require the calculation of a sample size. It is essential to determine whether or not the study is a pilot and exploratory study along with the purpose of testing the hypothesis of interest. Since most animal experiments are pilot and exploratory studies, it may be more appropriate to consider other things that can be tested while maintaining scientific and qualitative levels than estimating sample size. This is because new hypotheses may be established based on the results of the pilot and exploratory studies. Additionally, even if a hypothesis exists in this study, it may be newly redefined.
In some cases, an animal study for a pre-defined hypothesis testing may not be able to perform the experiment with an estimated sample size. For example, if the subject of the experiment is a monkey, it can be difficult to conduct the research with just one monkey due to reasons such as cost or experimental environment. Liu et al. [ 1 ], stated that it is difficult to obtain a larger number of animals as the use of nonhuman primates is strictly regulated.
The number of subjects that should be studied is critical to show clinically meaningful differences and statistical power. When estimating the sample size, however, a limited budget or research environment may require a trade-off between cost-effectiveness and power [ 2 ].
Ethical issues should be also considered while determining the sample size in animal studies. Russell and Burch [ 3 ] in the Principles of Humane Experimental Technique (1959) proposed that the 3Rs are similar to ethical considerations applied to any animal experiment by researchers and other institutes conducting these studies. The 3R principles are the harmonization of science and ethics in the field of animal experimentation, and includes replacing, refineration, and reduction.
The purpose of this paper is to guide researchers a method for estimating the appropriate sample size in animal studies. Furthermore, this paper helps to understand the calculation of the sample size depending on the stage (pilot, exploratory, and confirmatory study) and the comparison type of the study.
Pilot and exploratory experiments
Pilot studies are performed to check the feasibility of the measurement precision of the variables that are intended to be measured in the main study or pivotal study and to verify the logistic nature of the proposed experiment. The sample size of the pilot study is based on the researcher’s previous experience or guesswork because previous data are not available. Exploratory studies are also conducted to create new hypotheses. In other words, the purpose of these studies are to determine the trend or pattern of responses; therefore, it does not require a significance test. The sample sizes for these studies are sometimes calculated based on previous studies. The data obtained from these studies (standard deviation, the mean difference between the two samples, etc.) is used to calculate the sample size for a pivotal study [ 4 ].
Confirmatory study
A confirmatory study is a controlled study in which the study hypotheses are stated in advance and well-designed. The hypothesis of interest follows directly from the primary objective of the study, is always pre-defined, and is the hypothesis that is subsequently tested after completing the trial [ 5 ].
In these studies, it is very important to estimate with due precision the size of the effects attributable to the treatment of interest and to relate these effects to their clinical significance. In confirmatory study, sample size calculation plays an important role in providing evidence to support the claims. Therefore, estimating a valid sample size for the study is particularly important.
General considerations prior to sample size calculation
Several factors must be considered when calculating the sample size, such as the study’s purpose, study phase, type of comparison, primary variable and its characteristic, clinically meaningful difference, experimental design, statistical test, number of controls, randomization ratio, dropouts, covariates, and so on.
Type of comparison
It is important to clearly state the objective of the intended study because the objective influences the hypothesis of the study. For the study objective, there are four types of comparisons: test for equality, superiority, non-inferiority, and equivalence. The equality test is a two-sided test, while the others are one-sided test (test for equivalence is a two one-sided test). The test for equality is often used to demonstrate the intended objective in pilot and exploratory studies and pre-clinical studies, such as animal studies. In other words, since confirmatory studies are commonly performed after many pilot/exploratory studies, the equality test is often conducted in pilot/exploratory studies, such as animal studies.
To demonstrate the objectives, hypotheses are usually formulated based on the primary study objectives. If it is explained using statistical notations to facilitate understanding, it is expressed as follows ( Table 1 ). H 1 and H 0 are the alternative and null hypotheses, respectively. Let μ t and μ c be the true mean of the test and control group and p t and p c be the true proportion of the test and control group, respectively. Additionally, let δ be the clinically significant difference in the equality test, the non-inferiority margin in the non-inferiority test, the superiority margin in the superiority test, and the equivalence margin in the equivalence test.
Hypotheses according to the Type of Comparison
Type of comparison | Comparing Means | Comparing Proportions | ||
---|---|---|---|---|
Test for Equality | – ≠ | – = | – ≠ | – = |
Test for Superiority | – > (> 0) | – ≤ | – > (> 0) | – ≤ |
Test for Non-Inferiority | – > (< 0) | – ≤ | – > (< 0) | – ≤ |
Test for Equivalence | | – | < | | – | ≥ | | – | < | | – | ≥ |
We assume that the difference ( μ t – μ c ) > 0 is considered an improvement of the test group as compared to the control group. A typical approach to compare the mean or proportion differences in a study with two independent samples (groups) is to test the following hypotheses shown in Table 1 .
Primary variable
The outcomes from animal studies are distinguished from quantitative variables, whose values result from counting or measuring something, and qualitative variables as categorical variables. Sample size calculation is often performed based on statistical inference of the primary variable [ 2 ]. This paper deals with continuous and categorical variables, including dichotomous variables which define one of the outcomes as a “success” and the other a “failure.”
The significance level ( α ) and statistical power (1 – β ) must be considered when calculating a sample size. The significance level is the maximum allowable value of the type I error. The type I error indicates the probability of rejecting the null hypothesis when it is true. Statistical power is the probability of rejecting H 0 when it is false. If the type II error is set to β , then the statistical power is set to 1 – β . The power analysis is a method of sample size calculation that can be used to estimate the sample size required for a study, given the significance level and statistical power.
Table 2 displays the four situations that can be considered for decision-making on unknown facts when testing the hypotheses.
Four Situations in Hypothesis Testing
Actual status (H ) | |||
---|---|---|---|
True | False | ||
Decision-making about H | Don’t reject | Correct (confidence level) | Type II error |
1 – | |||
Reject | Type I error | Correct (power) | |
1 – |
Type I error = Pr (Reject H 0 | H 0 True), Confidence level = Pr (Don’t reject H 0 | H 0 True), Type II error = Pr (Don’t reject H 0 | H 0 False), Power = Pr (Reject H 0 | H 0 False).
Sample size calculation
Calculation of the sample size before beginning the study is desired to test the intended research objective. Too small sample size can cause lower the sensitivity of the experiment to identify significant differences, whereas too large sample size can waste time, cost, and resources or important investigational endpoint [ 6 ]. In the latter case, a trade-off may often occur between the cost-effectiveness and detecting power [ 2 ]. As such, it is difficult to determine the sample size for studies, especially in confirmatory studies.
Several studies have introduced methods to easily calculate the sample size. Arifin and Zahiruddin [ 7 ] introduced a method to calculate the sample size in animal studies, which are pilot and exploratory in nature, through a simple formula using an ANOVA design. The sample size in animal studies can be calculated for various situations. The statistical approaches also vary including precision analysis, power analysis and so on.
Precision analysis
Precision analysis is one of the methods for calculating the sample size. This approach chooses the sample size in such a way that there is a desired precision at a fixed confidence level, that is, a fixed type I error. It is simple and easy to calculate but may have a small probability of detecting a true difference.
The precision of the interval, 100(1 – α )% confidence interval, depends on its width. Because a narrower interval has a more precise interval, this method considers the maximum half-width of the 100(1 – α )% [ 2 ].
When σ 2 is known, the formula of sample size required from a 100(1 – α )% confidence interval for μ can be chosen as
z α 2 is the upper α 2 t h quantile of the standard normal distribution, and E is the maximum error in the estimation of μ .
Power analysis
The power analysis method is usually used to estimate the sample size in a clinical research. It selects the required sample size to achieve the desired power for detecting a scientifically or clinically meaningful difference at a fixed type I error [ 2 ].
The simple illustration in Table 3 has several assumptions: (1) two sample parallel design, (2) σ 2 is the known population variance, (3) the population variances of test and control group are equal to σ 2 , (4) μ t – μ c is the true mean difference between a test group ( μ t ) and a control group ( μ c ), (5) μ t – μ t > 0 is considered an indication of improvement of the test group as compared to the control group, (6) δ is the clinically significant difference in the equality test, the non-inferiority margin in the non-inferiority test, the superiority margin in the superiority test, and the equivalence margin in the equivalence test, (7) k is a constant for the allocation rate, (8) n t is the sample size of the test group, and n c is the sample size of the control group, and (9) z α 2 , z α , z β , and z β 2 are the upper α 2 t h , α th , β th , and β 2 t h quantiles of the standard normal distribution, respectively.
Formulae in Various Types of Comparison for Comparing Two Group Means
Type of comparison | Sample size for control group | Sample size for test group | |||
---|---|---|---|---|---|
Test for Equality | – ≠ | – = | = | ||
Test for Superiority | – μ > (> 0) | – μ ≤ | = | ||
Test for Non-inferiority | – > (< 0) | – ≤ | = | ||
Test for Equivalence | | – | < | | – | ≥ | = |
Other approaches
There are several methods besides precision and power analysis for calculating the sample size, such as probability assessment and reproducibility probability. These concepts are beyond the scope of this paper.
Other formulae of sample size calculation
Sample size for dichotomous data.
Fleiss [ 8 ] provided an equation to compare the proportions in the two groups. Let an outcome be an event of interest, such as the occurrence of a disease or death, and proposed the following hypothesis:
H 0 : p c – p t = 0 versus H 1 : p c – p t ≠ 0
p c = r c N c , p t = r t N t
r c : the number of outcomes in the control group
r t : the number of outcomes in the test group
N c : the total number of animals in the control group
N t : the total number of animals in the test group
The sample size per group(n) needed to achieve power 1 – β can be obtained by following equation:
d = | p c – p t |
C : a constant that depends on the values chosen for α and β , and is for two-sided test
Table 4 can be used to obtain the solution for the above formula and shows sample sizes per arm based on given C values, significance levels, and power, assuming s is 4 and d is 3.
Estimated Sample Size per Arm Using the Formula Suggested by Fleiss
1 – | C | Pc | Pt | d | n | |
---|---|---|---|---|---|---|
0.05 | 0.8 | 7.85 | 0.5 | 0.25 | 0.25 | 65 |
0.05 | 0.9 | 10.51 | 84 | |||
0.01 | 0.8 | 11.68 | 92 | |||
0.01 | 0.9 | 14.88 | 115 |
Sample size for comparing two group means
Snedecor and Cochran [ 9 ] suggested a method for estimating sample size by comparing the mean differences between two group. To show the mean difference between two groups in parallel design, the following hypotheses are considered:
H 0 : μ c – μ t = 0 versus H 1 : μ c – μ t ≠ 0
μ c : population mean of the control group
μ t : population mean of the test group
Then, the sample size needed to achieve a power of 1 – β can be obtained from the following formula:
s : standard deviation
d : the difference to be detected
C : a constant that depends on the values chosen for α and β , and is for a two-sided test
Table 5 can be used to obtain the sample size per arm for the above formula and shows sample sizes per arm based on given C values, significance levels, and power, assuming s is 4 and d is 3.
Estimated Sample Size per Arm Using the Formula Suggested by Snedecor and Cochran
1 – | C | S | d | n | |
---|---|---|---|---|---|
0.05 | 0.8 | 7.85 | 4 | 3 | 29 |
0.05 | 0.9 | 10.51 | 4 | 3 | 39 |
0.01 | 0.8 | 11.68 | 4 | 3 | 43 |
0.01 | 0.9 | 14.88 | 4 | 3 | 54 |
Sample size for paired studies
The equation suggested by Snedecor and Cochran [ 9 ] can be used when comparing values in paired studies. The following hypotheses are considered:
H 0 : μ before – μ after = 0 versus H 1 : μ before – μ after ≠ 0
The sample size needed to achieve power 1 – β can be obtained using the following equation:
Table 6 can be used to obtain the sample size per arm for the above formula and shows sample sizes per arm based on given C values, significance levels, and power, assuming s is 4 and d is 3.
Estimated Sample Size Using the Formula Suggested by Snedecor and Cochran
1 – | C | S | d | n | |
---|---|---|---|---|---|
0.05 | 0.8 | 7.85 | 4 | 3 | 16 |
0.05 | 0.9 | 10.51 | 4 | 3 | 21 |
0.01 | 0.8 | 11.68 | 4 | 3 | 23 |
0.01 | 0.9 | 14.88 | 4 | 3 | 29 |
Nonparametric
In many cases, parametric methods are used to estimate the sample size. However, the estimation of sample size can also be done in a nonparametric way if it is not possible to use large sample sizes, such as animal study. Estimating sample sizes by nonparametric methods is applicable when the sample size is small or when the assumption of normality is not guaranteed. In some animal studies, the assumption of normality may not be fulfilled during the estimation of sample size. Practically, the primary assumptions of the underlying population may not be satisfied. In such cases, nonparametric methods can be considered for testing the differences of location.
Fig. 1 shows the comparison of the statistical power calculated using the parametric and nonparametric methods through 1,000 simulations as increasing the sample size from 1 to 30 by 1. For (a) and (b), paired t-tests and independent two-sample t-tests are applied, respectively, which are parametric methods. For (c) and (d), Wilcoxon’s signed rank test and Wilcoxon’s rank sum test (Mann-Whitney’s U test) are applied, respectively, which are non-parametric methods. Non- parametric methods of (c) and (d) are corresponding to the parametric methods of (a) and (b), respectively. All alternative hypotheses are two-sided tests for equality with a significance level of 0.05, and the power is calculated by PASS 2020 [ 10 ].
Comparison of parametric and non-parametric methods in sample size estimation. (A) parametric method : paired t-test with equal variance, (B) parametric method : Student’s t-test with equal variance, (C) non-parametric method : Wilcoxon’s signed rank test, and (D) non-parametric method : Wilcoxon’s rank sum test; δ 0 (=0) is the null difference, δ 1 (=1) is the actual difference, μ 2 (=0) is the true mean of group 2, σ (=1) is the population standard deviation, N1 (= N2) is the number of items sampled from each population, and N is the total sample size.
Fig. 1 shows the consistency in the statistical power of the parametric and nonparametric methods as the sample size increases. When estimating the sample size using non-parametric methods, there are some practical issues in which the power under the alternative hypothesis has not been fully studied. However, these non-parametric approaches can be helpful in exploratory studies.
Software for calculating sample size
The sample size can be easily calculated using various formulae. However, it can be difficult to calculate directly using formulae and may be calculated using computer algorithms. In some cases, computer simulations can be used to determine an appropriate sample size. Some well-known software that researchers can use for clinical researches are:
(1) Power Analysis and Sample Size (PASS) software, sample size tools for over 965 statistical tests and confidence interval scenarios. (2) nQuery 7.0 Advisor program (Ireland), sample size, and power calculations. (3) G*Power 3 (Faul, Erdfelder, Lang, & Buchner), a flexible statistical power analysis program for the social, behavioral, and biomedical sciences. (4) SAS® version 9.4 (SAS institute Inc., USA) using POWER and GLMPOWER procedures. (5) R version 3.6.5 (R Foundation for Statistical Computing, Austria) using “pwr” package, which is free and open source. (6) Sample Power (SPSS Inc., USA) is a tool for estimating the sample size on the various statistical studies.
More detailed information for comparing software for sample size determination can be found in a paper written by Dattalo [ 11 ].
There are many commercial and free software available on the Internet besides those mentioned above. It is also important for the user to check the accuracy and validity of the sample size whether it is appropriately calculated according to the study objectives as well as whether the algorithm of formula provided is accurate.
When estimating the sample size, there are several assumptions and conditions which are defined before beginning the study. Most animal studies are in the pilot and exploratory phase; therefore, it may be difficult to predefine the sample size for the study. Additionally, the ethical issues in animal studies and the sample size calculation in accordance with the 3R principle should be fully reviewed for any animal study. Nevertheless, at the planning stage, calculation of sample size plays a very important role in clarifying the intended objectives of the study. Sometimes, a problem of trade-off might happen when estimating the sample size. The careful appreciation of experimental design and statistics before data collection is the key of successful experiment when conducting an animal study.
Conflicts of Interest
No potential conflict of interest relevant to this article was reported.
Author Contributions
Mun Jung Ko (Conceptualization; Investigation; Methodology; Resources; Writing – review & editing)
Chi-Yeon Lim (Conceptualization; Data curation; Formal analysis; Investigation; Methodology; Resources; Validation; Visualization; Writing – original draft; Writing – review & editing)
Power and Sample Size Determination
Lisa Sullivan, PhD
Professor of Biosatistics
Boston Univeristy School of Public Health
Introduction
A critically important aspect of any study is determining the appropriate sample size to answer the research question. This module will focus on formulas that can be used to estimate the sample size needed to produce a confidence interval estimate with a specified margin of error (precision) or to ensure that a test of hypothesis has a high probability of detecting a meaningful difference in the parameter.
Studies should be designed to include a sufficient number of participants to adequately address the research question. Studies that have either an inadequate number of participants or an excessively large number of participants are both wasteful in terms of participant and investigator time, resources to conduct the assessments, analytic efforts and so on. These situations can also be viewed as unethical as participants may have been put at risk as part of a study that was unable to answer an important question. Studies that are much larger than they need to be to answer the research questions are also wasteful.
The formulas presented here generate estimates of the necessary sample size(s) required based on statistical criteria. However, in many studies, the sample size is determined by financial or logistical constraints. For example, suppose a study is proposed to evaluate a new screening test for Down Syndrome. Suppose that the screening test is based on analysis of a blood sample taken from women early in pregnancy. In order to evaluate the properties of the screening test (e.g., the sensitivity and specificity), each pregnant woman will be asked to provide a blood sample and in addition to undergo an amniocentesis. The amniocentesis is included as the gold standard and the plan is to compare the results of the screening test to the results of the amniocentesis. Suppose that the collection and processing of the blood sample costs $250 per participant and that the amniocentesis costs $900 per participant. These financial constraints alone might substantially limit the number of women that can be enrolled. Just as it is important to consider both statistical and clinical significance when interpreting results of a statistical analysis, it is also important to weigh both statistical and logistical issues in determining the sample size for a study.
Learning Objectives
After completing this module, the student will be able to:
- Provide examples demonstrating how the margin of error, effect size and variability of the outcome affect sample size computations.
- Compute the sample size required to estimate population parameters with precision.
- Interpret statistical power in tests of hypothesis.
- Compute the sample size required to ensure high power when hypothesis testing.
Issues in Estimating Sample Size for Confidence Intervals Estimates
The module on confidence intervals provided methods for estimating confidence intervals for various parameters (e.g., μ , p, ( μ 1 - μ 2 ), μ d , (p 1 -p 2 )). Confidence intervals for every parameter take the following general form:
Point Estimate + Margin of Error
In the module on confidence intervals we derived the formula for the confidence interval for μ as
In practice we use the sample standard deviation to estimate the population standard deviation. Note that there is an alternative formula for estimating the mean of a continuous outcome in a single population, and it is used when the sample size is small (n<30). It involves a value from the t distribution, as opposed to one from the standard normal distribution, to reflect the desired level of confidence. When performing sample size computations, we use the large sample formula shown here. [Note: The resultant sample size might be small, and in the analysis stage, the appropriate confidence interval formula must be used.]
The point estimate for the population mean is the sample mean and the margin of error is
In planning studies, we want to determine the sample size needed to ensure that the margin of error is sufficiently small to be informative. For example, suppose we want to estimate the mean weight of female college students. We conduct a study and generate a 95% confidence interval as follows 125 + 40 pounds, or 85 to 165 pounds. The margin of error is so wide that the confidence interval is uninformative. To be informative, an investigator might want the margin of error to be no more than 5 or 10 pounds (meaning that the 95% confidence interval would have a width (lower limit to upper limit) of 10 or 20 pounds). In order to determine the sample size needed, the investigator must specify the desired margin of error . It is important to note that this is not a statistical issue, but a clinical or a practical one. For example, suppose we want to estimate the mean birth weight of infants born to mothers who smoke cigarettes during pregnancy. Birth weights in infants clearly have a much more restricted range than weights of female college students. Therefore, we would probably want to generate a confidence interval for the mean birth weight that has a margin of error not exceeding 1 or 2 pounds.
The margin of error in the one sample confidence interval for μ can be written as follows:
Our goal is to determine the sample size, n, that ensures that the margin of error, " E ," does not exceed a specified value. We can take the formula above and, with some algebra, solve for n :
First, multipy both sides of the equation by the square root of n . Then cancel out the square root of n from the numerator and denominator on the right side of the equation (since any number divided by itself is equal to 1). This leaves:
Now divide both sides by "E" and cancel out "E" from the numerator and denominator on the left side. This leaves:
Finally, square both sides of the equation to get:
This formula generates the sample size, n , required to ensure that the margin of error, E , does not exceed a specified value. To solve for n , we must input " Z ," " σ ," and " E ."
- Z is the value from the table of probabilities of the standard normal distribution for the desired confidence level (e.g., Z = 1.96 for 95% confidence)
- E is the margin of error that the investigator specifies as important from a clinical or practical standpoint.
- σ is the standard deviation of the outcome of interest.
Sometimes it is difficult to estimate σ . When we use the sample size formula above (or one of the other formulas that we will present in the sections that follow), we are planning a study to estimate the unknown mean of a particular outcome variable in a population. It is unlikely that we would know the standard deviation of that variable. In sample size computations, investigators often use a value for the standard deviation from a previous study or a study done in a different, but comparable, population. The sample size computation is not an application of statistical inference and therefore it is reasonable to use an appropriate estimate for the standard deviation. The estimate can be derived from a different study that was reported in the literature; some investigators perform a small pilot study to estimate the standard deviation. A pilot study usually involves a small number of participants (e.g., n=10) who are selected by convenience, as opposed to by random sampling. Data from the participants in the pilot study can be used to compute a sample standard deviation, which serves as a good estimate for σ in the sample size formula. Regardless of how the estimate of the variability of the outcome is derived, it should always be conservative (i.e., as large as is reasonable), so that the resultant sample size is not too small.
Sample Size for One Sample, Continuous Outcome
In studies where the plan is to estimate the mean of a continuous outcome variable in a single population, the formula for determining sample size is given below:
where Z is the value from the standard normal distribution reflecting the confidence level that will be used (e.g., Z = 1.96 for 95%), σ is the standard deviation of the outcome variable and E is the desired margin of error. The formula above generates the minimum number of subjects required to ensure that the margin of error in the confidence interval for μ does not exceed E .
An investigator wants to estimate the mean systolic blood pressure in children with congenital heart disease who are between the ages of 3 and 5. How many children should be enrolled in the study? The investigator plans on using a 95% confidence interval (so Z=1.96) and wants a margin of error of 5 units. The standard deviation of systolic blood pressure is unknown, but the investigators conduct a literature search and find that the standard deviation of systolic blood pressures in children with other cardiac defects is between 15 and 20. To estimate the sample size, we consider the larger standard deviation in order to obtain the most conservative (largest) sample size.
In order to ensure that the 95% confidence interval estimate of the mean systolic blood pressure in children between the ages of 3 and 5 with congenital heart disease is within 5 units of the true mean, a sample of size 62 is needed. [ Note : We always round up; the sample size formulas always generate the minimum number of subjects needed to ensure the specified precision.] Had we assumed a standard deviation of 15, the sample size would have been n=35. Because the estimates of the standard deviation were derived from studies of children with other cardiac defects, it would be advisable to use the larger standard deviation and plan for a study with 62 children. Selecting the smaller sample size could potentially produce a confidence interval estimate with a larger margin of error.
An investigator wants to estimate the mean birth weight of infants born full term (approximately 40 weeks gestation) to mothers who are 19 years of age and under. The mean birth weight of infants born full-term to mothers 20 years of age and older is 3,510 grams with a standard deviation of 385 grams. How many women 19 years of age and under must be enrolled in the study to ensure that a 95% confidence interval estimate of the mean birth weight of their infants has a margin of error not exceeding 100 grams? Try to work through the calculation before you look at the answer.
Sample Size for One Sample, Dichotomous Outcome
In studies where the plan is to estimate the proportion of successes in a dichotomous outcome variable (yes/no) in a single population, the formula for determining sample size is:
where Z is the value from the standard normal distribution reflecting the confidence level that will be used (e.g., Z = 1.96 for 95%) and E is the desired margin of error. p is the proportion of successes in the population. Here we are planning a study to generate a 95% confidence interval for the unknown population proportion, p . The equation to determine the sample size for determining p seems to require knowledge of p, but this is obviously this is a circular argument, because if we knew the proportion of successes in the population, then a study would not be necessary! What we really need is an approximate value of p or an anticipated value. The range of p is 0 to 1, and therefore the range of p(1-p) is 0 to 1. The value of p that maximizes p(1-p) is p=0.5. Consequently, if there is no information available to approximate p, then p=0.5 can be used to generate the most conservative, or largest, sample size.
Example 2:
An investigator wants to estimate the proportion of freshmen at his University who currently smoke cigarettes (i.e., the prevalence of smoking). How many freshmen should be involved in the study to ensure that a 95% confidence interval estimate of the proportion of freshmen who smoke is within 5% of the true proportion?
Because we have no information on the proportion of freshmen who smoke, we use 0.5 to estimate the sample size as follows:
In order to ensure that the 95% confidence interval estimate of the proportion of freshmen who smoke is within 5% of the true proportion, a sample of size 385 is needed.
Suppose that a similar study was conducted 2 years ago and found that the prevalence of smoking was 27% among freshmen. If the investigator believes that this is a reasonable estimate of prevalence 2 years later, it can be used to plan the next study. Using this estimate of p, what sample size is needed (assuming that again a 95% confidence interval will be used and we want the same level of precision)?
An investigator wants to estimate the prevalence of breast cancer among women who are between 40 and 45 years of age living in Boston. How many women must be involved in the study to ensure that the estimate is precise? National data suggest that 1 in 235 women are diagnosed with breast cancer by age 40. This translates to a proportion of 0.0043 (0.43%) or a prevalence of 43 per 10,000 women. Suppose the investigator wants the estimate to be within 10 per 10,000 women with 95% confidence. The sample size is computed as follows:
A sample of size n=16,448 will ensure that a 95% confidence interval estimate of the prevalence of breast cancer is within 0.10 (or to within 10 women per 10,000) of its true value. This is a situation where investigators might decide that a sample of this size is not feasible. Suppose that the investigators thought a sample of size 5,000 would be reasonable from a practical point of view. How precisely can we estimate the prevalence with a sample of size n=5,000? Recall that the confidence interval formula to estimate prevalence is:
Assuming that the prevalence of breast cancer in the sample will be close to that based on national data, we would expect the margin of error to be approximately equal to the following:
Thus, with n=5,000 women, a 95% confidence interval would be expected to have a margin of error of 0.0018 (or 18 per 10,000). The investigators must decide if this would be sufficiently precise to answer the research question. Note that the above is based on the assumption that the prevalence of breast cancer in Boston is similar to that reported nationally. This may or may not be a reasonable assumption. In fact, it is the objective of the current study to estimate the prevalence in Boston. The research team, with input from clinical investigators and biostatisticians, must carefully evaluate the implications of selecting a sample of size n = 5,000, n = 16,448 or any size in between.
Sample Sizes for Two Independent Samples, Continuous Outcome
In studies where the plan is to estimate the difference in means between two independent populations, the formula for determining the sample sizes required in each comparison group is given below:
where n i is the sample size required in each group (i=1,2), Z is the value from the standard normal distribution reflecting the confidence level that will be used and E is the desired margin of error. σ again reflects the standard deviation of the outcome variable. Recall from the module on confidence intervals that, when we generated a confidence interval estimate for the difference in means, we used Sp, the pooled estimate of the common standard deviation, as a measure of variability in the outcome (based on pooling the data), where Sp is computed as follows:
If data are available on variability of the outcome in each comparison group, then Sp can be computed and used in the sample size formula. However, it is more often the case that data on the variability of the outcome are available from only one group, often the untreated (e.g., placebo control) or unexposed group. When planning a clinical trial to investigate a new drug or procedure, data are often available from other trials that involved a placebo or an active control group (i.e., a standard medication or treatment given for the condition under study). The standard deviation of the outcome variable measured in patients assigned to the placebo, control or unexposed group can be used to plan a future trial, as illustrated below.
Note that the formula for the sample size generates sample size estimates for samples of equal size. If a study is planned where different numbers of patients will be assigned or different numbers of patients will comprise the comparison groups, then alternative formulas can be used.
An investigator wants to plan a clinical trial to evaluate the efficacy of a new drug designed to increase HDL cholesterol (the "good" cholesterol). The plan is to enroll participants and to randomly assign them to receive either the new drug or a placebo. HDL cholesterol will be measured in each participant after 12 weeks on the assigned treatment. Based on prior experience with similar trials, the investigator expects that 10% of all participants will be lost to follow up or will drop out of the study over 12 weeks. A 95% confidence interval will be estimated to quantify the difference in mean HDL levels between patients taking the new drug as compared to placebo. The investigator would like the margin of error to be no more than 3 units. How many patients should be recruited into the study?
The sample sizes are computed as follows:
A major issue is determining the variability in the outcome of interest (σ), here the standard deviation of HDL cholesterol. To plan this study, we can use data from the Framingham Heart Study. In participants who attended the seventh examination of the Offspring Study and were not on treatment for high cholesterol, the standard deviation of HDL cholesterol is 17.1. We will use this value and the other inputs to compute the sample sizes as follows:
Samples of size n 1 =250 and n 2 =250 will ensure that the 95% confidence interval for the difference in mean HDL levels will have a margin of error of no more than 3 units. Again, these sample sizes refer to the numbers of participants with complete data. The investigators hypothesized a 10% attrition (or drop-out) rate (in both groups). In order to ensure that the total sample size of 500 is available at 12 weeks, the investigator needs to recruit more participants to allow for attrition.
N (number to enroll) * (% retained) = desired sample size
Therefore N (number to enroll) = desired sample size/(% retained)
N = 500/0.90 = 556
If they anticipate a 10% attrition rate, the investigators should enroll 556 participants. This will ensure N=500 with complete data at the end of the trial.
An investigator wants to compare two diet programs in children who are obese. One diet is a low fat diet, and the other is a low carbohydrate diet. The plan is to enroll children and weigh them at the start of the study. Each child will then be randomly assigned to either the low fat or the low carbohydrate diet. Each child will follow the assigned diet for 8 weeks, at which time they will again be weighed. The number of pounds lost will be computed for each child. Based on data reported from diet trials in adults, the investigator expects that 20% of all children will not complete the study. A 95% confidence interval will be estimated to quantify the difference in weight lost between the two diets and the investigator would like the margin of error to be no more than 3 pounds. How many children should be recruited into the study?
Again the issue is determining the variability in the outcome of interest (σ), here the standard deviation in pounds lost over 8 weeks. To plan this study, investigators use data from a published study in adults. Suppose one such study compared the same diets in adults and involved 100 participants in each diet group. The study reported a standard deviation in weight lost over 8 weeks on a low fat diet of 8.4 pounds and a standard deviation in weight lost over 8 weeks on a low carbohydrate diet of 7.7 pounds. These data can be used to estimate the common standard deviation in weight lost as follows:
We now use this value and the other inputs to compute the sample sizes:
Samples of size n 1 =56 and n 2 =56 will ensure that the 95% confidence interval for the difference in weight lost between diets will have a margin of error of no more than 3 pounds. Again, these sample sizes refer to the numbers of children with complete data. The investigators anticipate a 20% attrition rate. In order to ensure that the total sample size of 112 is available at 8 weeks, the investigator needs to recruit more participants to allow for attrition.
N = 112/0.80 = 140
Sample Size for Matched Samples, Continuous Outcome
In studies where the plan is to estimate the mean difference of a continuous outcome based on matched data, the formula for determining sample size is given below:
where Z is the value from the standard normal distribution reflecting the confidence level that will be used (e.g., Z = 1.96 for 95%), E is the desired margin of error, and σ d is the standard deviation of the difference scores. It is extremely important that the standard deviation of the difference scores (e.g., the difference based on measurements over time or the difference between matched pairs) is used here to appropriately estimate the sample size.
Sample Sizes for Two Independent Samples, Dichotomous Outcome
In studies where the plan is to estimate the difference in proportions between two independent populations (i.e., to estimate the risk difference), the formula for determining the sample sizes required in each comparison group is:
where n i is the sample size required in each group (i=1,2), Z is the value from the standard normal distribution reflecting the confidence level that will be used (e.g., Z = 1.96 for 95%), and E is the desired margin of error. p 1 and p 2 are the proportions of successes in each comparison group. Again, here we are planning a study to generate a 95% confidence interval for the difference in unknown proportions, and the formula to estimate the sample sizes needed requires p 1 and p 2 . In order to estimate the sample size, we need approximate values of p 1 and p 2 . The values of p 1 and p 2 that maximize the sample size are p 1 =p 2 =0.5. Thus, if there is no information available to approximate p 1 and p 2 , then 0.5 can be used to generate the most conservative, or largest, sample sizes.
Similar to the situation for two independent samples and a continuous outcome at the top of this page, it may be the case that data are available on the proportion of successes in one group, usually the untreated (e.g., placebo control) or unexposed group. If so, the known proportion can be used for both p 1 and p 2 in the formula shown above. The formula shown above generates sample size estimates for samples of equal size. If a study is planned where different numbers of patients will be assigned or different numbers of patients will comprise the comparison groups, then alternative formulas can be used. Interested readers can see Fleiss for more details. 4
An investigator wants to estimate the impact of smoking during pregnancy on premature delivery. Normal pregnancies last approximately 40 weeks and premature deliveries are those that occur before 37 weeks. The 2005 National Vital Statistics report indicates that approximately 12% of infants are born prematurely in the United States. 5 The investigator plans to collect data through medical record review and to generate a 95% confidence interval for the difference in proportions of infants born prematurely to women who smoked during pregnancy as compared to those who did not. How many women should be enrolled in the study to ensure that the 95% confidence interval for the difference in proportions has a margin of error of no more than 4%?
The sample sizes (i.e., numbers of women who smoked and did not smoke during pregnancy) can be computed using the formula shown above. National data suggest that 12% of infants are born prematurely. We will use that estimate for both groups in the sample size computation.
Samples of size n 1 =508 women who smoked during pregnancy and n 2 =508 women who did not smoke during pregnancy will ensure that the 95% confidence interval for the difference in proportions who deliver prematurely will have a margin of error of no more than 4%.
Is attrition an issue here?
Issues in Estimating Sample Size for Hypothesis Testing
In the module on hypothesis testing for means and proportions, we introduced techniques for means, proportions, differences in means, and differences in proportions. While each test involved details that were specific to the outcome of interest (e.g., continuous or dichotomous) and to the number of comparison groups (one, two, more than two), there were common elements to each test. For example, in each test of hypothesis, there are two errors that can be committed. The first is called a Type I error and refers to the situation where we incorrectly reject H 0 when in fact it is true. In the first step of any test of hypothesis, we select a level of significance, α , and α = P(Type I error) = P(Reject H 0 | H 0 is true). Because we purposely select a small value for α , we control the probability of committing a Type I error. The second type of error is called a Type II error and it is defined as the probability we do not reject H 0 when it is false. The probability of a Type II error is denoted β , and β =P(Type II error) = P(Do not Reject H 0 | H 0 is false). In hypothesis testing, we usually focus on power, which is defined as the probability that we reject H 0 when it is false, i.e., power = 1- β = P(Reject H 0 | H 0 is false). Power is the probability that a test correctly rejects a false null hypothesis. A good test is one with low probability of committing a Type I error (i.e., small α ) and high power (i.e., small β, high power).
Here we present formulas to determine the sample size required to ensure that a test has high power. The sample size computations depend on the level of significance, aα, the desired power of the test (equivalent to 1-β), the variability of the outcome, and the effect size. The effect size is the difference in the parameter of interest that represents a clinically meaningful difference. Similar to the margin of error in confidence interval applications, the effect size is determined based on clinical or practical criteria and not statistical criteria.
The concept of statistical power can be difficult to grasp. Before presenting the formulas to determine the sample sizes required to ensure high power in a test, we will first discuss power from a conceptual point of view.
Suppose we want to test the following hypotheses at aα=0.05: H 0 : μ = 90 versus H 1 : μ ≠ 90. To test the hypotheses, suppose we select a sample of size n=100. For this example, assume that the standard deviation of the outcome is σ=20. We compute the sample mean and then must decide whether the sample mean provides evidence to support the alternative hypothesis or not. This is done by computing a test statistic and comparing the test statistic to an appropriate critical value. If the null hypothesis is true (μ=90), then we are likely to select a sample whose mean is close in value to 90. However, it is also possible to select a sample whose mean is much larger or much smaller than 90. Recall from the Central Limit Theorem (see page 11 in the module on Probability), that for large n (here n=100 is sufficiently large), the distribution of the sample means is approximately normal with a mean of
If the null hypothesis is true, it is possible to observe any sample mean shown in the figure below; all are possible under H 0 : μ = 90.
Rejection Region for Test H 0 : μ = 90 versus H 1 : μ ≠ 90 at α =0.05
The areas in the two tails of the curve represent the probability of a Type I Error, α= 0.05. This concept was discussed in the module on Hypothesis Testing.
Now, suppose that the alternative hypothesis, H 1 , is true (i.e., μ ≠ 90) and that the true mean is actually 94. The figure below shows the distributions of the sample mean under the null and alternative hypotheses.The values of the sample mean are shown along the horizontal axis.
If the true mean is 94, then the alternative hypothesis is true. In our test, we selected α = 0.05 and reject H 0 if the observed sample mean exceeds 93.92 (focusing on the upper tail of the rejection region for now). The critical value (93.92) is indicated by the vertical line. The probability of a Type II error is denoted β, and β = P(Do not Reject H 0 | H 0 is false), i.e., the probability of not rejecting the null hypothesis if the null hypothesis were true. β is shown in the figure above as the area under the rightmost curve (H 1 ) to the left of the vertical line (where we do not reject H 0 ). Power is defined as 1- β = P(Reject H 0 | H 0 is false) and is shown in the figure as the area under the rightmost curve (H 1 ) to the right of the vertical line (where we reject H 0 ).
Note that β and power are related to α, the variability of the outcome and the effect size. From the figure above we can see what happens to β and power if we increase α. Suppose, for example, we increase α to α=0.10.The upper critical value would be 92.56 instead of 93.92. The vertical line would shift to the left, increasing α, decreasing β and increasing power. While a better test is one with higher power, it is not advisable to increase α as a means to increase power. Nonetheless, there is a direct relationship between α and power (as α increases, so does power).
β and power are also related to the variability of the outcome and to the effect size. The effect size is the difference in the parameter of interest (e.g., μ) that represents a clinically meaningful difference. The figure above graphically displays α, β, and power when the difference in the mean under the null as compared to the alternative hypothesis is 4 units (i.e., 90 versus 94). The figure below shows the same components for the situation where the mean under the alternative hypothesis is 98.
Notice that there is much higher power when there is a larger difference between the mean under H 0 as compared to H 1 (i.e., 90 versus 98). A statistical test is much more likely to reject the null hypothesis in favor of the alternative if the true mean is 98 than if the true mean is 94. Notice also in this case that there is little overlap in the distributions under the null and alternative hypotheses. If a sample mean of 97 or higher is observed it is very unlikely that it came from a distribution whose mean is 90. In the previous figure for H 0 : μ = 90 and H 1 : μ = 94, if we observed a sample mean of 93, for example, it would not be as clear as to whether it came from a distribution whose mean is 90 or one whose mean is 94.
Ensuring That a Test Has High Power
In designing studies most people consider power of 80% or 90% (just as we generally use 95% as the confidence level for confidence interval estimates). The inputs for the sample size formulas include the desired power, the level of significance and the effect size. The effect size is selected to represent a clinically meaningful or practically important difference in the parameter of interest, as we will illustrate.
The formulas we present below produce the minimum sample size to ensure that the test of hypothesis will have a specified probability of rejecting the null hypothesis when it is false (i.e., a specified power). In planning studies, investigators again must account for attrition or loss to follow-up. The formulas shown below produce the number of participants needed with complete data, and we will illustrate how attrition is addressed in planning studies.
In studies where the plan is to perform a test of hypothesis comparing the mean of a continuous outcome variable in a single population to a known mean, the hypotheses of interest are:
H 0 : μ = μ 0 and H 1 : μ ≠ μ 0 where μ 0 is the known mean (e.g., a historical control). The formula for determining sample size to ensure that the test has a specified power is given below:
where α is the selected level of significance and Z 1-α /2 is the value from the standard normal distribution holding 1- α/2 below it. For example, if α=0.05, then 1- α/2 = 0.975 and Z=1.960. 1- β is the selected power, and Z 1-β is the value from the standard normal distribution holding 1- β below it. Sample size estimates for hypothesis testing are often based on achieving 80% or 90% power. The Z 1-β values for these popular scenarios are given below:
- For 80% power Z 0.80 = 0.84
- For 90% power Z 0.90 =1.282
ES is the effect size , defined as follows:
where μ 0 is the mean under H 0 , μ 1 is the mean under H 1 and σ is the standard deviation of the outcome of interest. The numerator of the effect size, the absolute value of the difference in means | μ 1 - μ 0 |, represents what is considered a clinically meaningful or practically important difference in means. Similar to the issue we faced when planning studies to estimate confidence intervals, it can sometimes be difficult to estimate the standard deviation. In sample size computations, investigators often use a value for the standard deviation from a previous study or a study performed in a different but comparable population. Regardless of how the estimate of the variability of the outcome is derived, it should always be conservative (i.e., as large as is reasonable), so that the resultant sample size will not be too small.
Example 7:
An investigator hypothesizes that in people free of diabetes, fasting blood glucose, a risk factor for coronary heart disease, is higher in those who drink at least 2 cups of coffee per day. A cross-sectional study is planned to assess the mean fasting blood glucose levels in people who drink at least two cups of coffee per day. The mean fasting blood glucose level in people free of diabetes is reported as 95.0 mg/dL with a standard deviation of 9.8 mg/dL. 7 If the mean blood glucose level in people who drink at least 2 cups of coffee per day is 100 mg/dL, this would be important clinically. How many patients should be enrolled in the study to ensure that the power of the test is 80% to detect this difference? A two sided test will be used with a 5% level of significance.
The effect size is computed as:
The effect size represents the meaningful difference in the population mean - here 95 versus 100, or 0.51 standard deviation units different. We now substitute the effect size and the appropriate Z values for the selected α and power to compute the sample size.
Therefore, a sample of size n=31 will ensure that a two-sided test with α =0.05 has 80% power to detect a 5 mg/dL difference in mean fasting blood glucose levels.
In the planned study, participants will be asked to fast overnight and to provide a blood sample for analysis of glucose levels. Based on prior experience, the investigators hypothesize that 10% of the participants will fail to fast or will refuse to follow the study protocol. Therefore, a total of 35 participants will be enrolled in the study to ensure that 31 are available for analysis (see below).
N (number to enroll) * (% following protocol) = desired sample size
N = 31/0.90 = 35.
Sample Size for One Sample, Dichotomous Outcome
In studies where the plan is to perform a test of hypothesis comparing the proportion of successes in a dichotomous outcome variable in a single population to a known proportion, the hypotheses of interest are:
where p 0 is the known proportion (e.g., a historical control). The formula for determining the sample size to ensure that the test has a specified power is given below:
where α is the selected level of significance and Z 1-α /2 is the value from the standard normal distribution holding 1- α/2 below it. 1- β is the selected power and Z 1-β is the value from the standard normal distribution holding 1- β below it , and ES is the effect size, defined as follows:
where p 0 is the proportion under H 0 and p 1 is the proportion under H 1 . The numerator of the effect size, the absolute value of the difference in proportions |p 1 -p 0 |, again represents what is considered a clinically meaningful or practically important difference in proportions.
Example 8:
A recent report from the Framingham Heart Study indicated that 26% of people free of cardiovascular disease had elevated LDL cholesterol levels, defined as LDL > 159 mg/dL. 9 An investigator hypothesizes that a higher proportion of patients with a history of cardiovascular disease will have elevated LDL cholesterol. How many patients should be studied to ensure that the power of the test is 90% to detect a 5% difference in the proportion with elevated LDL cholesterol? A two sided test will be used with a 5% level of significance.
We first compute the effect size:
We now substitute the effect size and the appropriate Z values for the selected α and power to compute the sample size.
A sample of size n=869 will ensure that a two-sided test with α =0.05 has 90% power to detect a 5% difference in the proportion of patients with a history of cardiovascular disease who have an elevated LDL cholesterol level.
A medical device manufacturer produces implantable stents. During the manufacturing process, approximately 10% of the stents are deemed to be defective. The manufacturer wants to test whether the proportion of defective stents is more than 10%. If the process produces more than 15% defective stents, then corrective action must be taken. Therefore, the manufacturer wants the test to have 90% power to detect a difference in proportions of this magnitude. How many stents must be evaluated? For you computations, use a two-sided test with a 5% level of significance. (Do the computation yourself, before looking at the answer.)
In studies where the plan is to perform a test of hypothesis comparing the means of a continuous outcome variable in two independent populations, the hypotheses of interest are:
where μ 1 and μ 2 are the means in the two comparison populations. The formula for determining the sample sizes to ensure that the test has a specified power is:
where n i is the sample size required in each group (i=1,2), α is the selected level of significance and Z 1-α /2 is the value from the standard normal distribution holding 1- α /2 below it, and 1- β is the selected power and Z 1-β is the value from the standard normal distribution holding 1- β below it. ES is the effect size, defined as:
where | μ 1 - μ 2 | is the absolute value of the difference in means between the two groups expected under the alternative hypothesis, H 1 . σ is the standard deviation of the outcome of interest. Recall from the module on Hypothesis Testing that, when we performed tests of hypothesis comparing the means of two independent groups, we used Sp, the pooled estimate of the common standard deviation, as a measure of variability in the outcome.
Sp is computed as follows:
If data are available on variability of the outcome in each comparison group, then Sp can be computed and used to generate the sample sizes. However, it is more often the case that data on the variability of the outcome are available from only one group, usually the untreated (e.g., placebo control) or unexposed group. When planning a clinical trial to investigate a new drug or procedure, data are often available from other trials that may have involved a placebo or an active control group (i.e., a standard medication or treatment given for the condition under study). The standard deviation of the outcome variable measured in patients assigned to the placebo, control or unexposed group can be used to plan a future trial, as illustrated.
Note also that the formula shown above generates sample size estimates for samples of equal size. If a study is planned where different numbers of patients will be assigned or different numbers of patients will comprise the comparison groups, then alternative formulas can be used (see Howell 3 for more details).
An investigator is planning a clinical trial to evaluate the efficacy of a new drug designed to reduce systolic blood pressure. The plan is to enroll participants and to randomly assign them to receive either the new drug or a placebo. Systolic blood pressures will be measured in each participant after 12 weeks on the assigned treatment. Based on prior experience with similar trials, the investigator expects that 10% of all participants will be lost to follow up or will drop out of the study. If the new drug shows a 5 unit reduction in mean systolic blood pressure, this would represent a clinically meaningful reduction. How many patients should be enrolled in the trial to ensure that the power of the test is 80% to detect this difference? A two sided test will be used with a 5% level of significance.
In order to compute the effect size, an estimate of the variability in systolic blood pressures is needed. Analysis of data from the Framingham Heart Study showed that the standard deviation of systolic blood pressure was 19.0. This value can be used to plan the trial.
The effect size is:
Samples of size n 1 =232 and n 2 = 232 will ensure that the test of hypothesis will have 80% power to detect a 5 unit difference in mean systolic blood pressures in patients receiving the new drug as compared to patients receiving the placebo. However, the investigators hypothesized a 10% attrition rate (in both groups), and to ensure a total sample size of 232 they need to allow for attrition.
N = 232/0.90 = 258.
The investigator must enroll 258 participants to be randomly assigned to receive either the new drug or placebo.
An investigator is planning a study to assess the association between alcohol consumption and grade point average among college seniors. The plan is to categorize students as heavy drinkers or not using 5 or more drinks on a typical drinking day as the criterion for heavy drinking. Mean grade point averages will be compared between students classified as heavy drinkers versus not using a two independent samples test of means. The standard deviation in grade point averages is assumed to be 0.42 and a meaningful difference in grade point averages (relative to drinking status) is 0.25 units. How many college seniors should be enrolled in the study to ensure that the power of the test is 80% to detect a 0.25 unit difference in mean grade point averages? Use a two-sided test with a 5% level of significance.
Answer
In studies where the plan is to perform a test of hypothesis on the mean difference in a continuous outcome variable based on matched data, the hypotheses of interest are:
where μ d is the mean difference in the population. The formula for determining the sample size to ensure that the test has a specified power is given below:
where α is the selected level of significance and Z 1-α/2 is the value from the standard normal distribution holding 1- α/2 below it, 1- β is the selected power and Z 1-β is the value from the standard normal distribution holding 1- β below it and ES is the effect size, defined as follows:
where μ d is the mean difference expected under the alternative hypothesis, H 1 , and σ d is the standard deviation of the difference in the outcome (e.g., the difference based on measurements over time or the difference between matched pairs).
Example 10:
An investigator wants to evaluate the efficacy of an acupuncture treatment for reducing pain in patients with chronic migraine headaches. The plan is to enroll patients who suffer from migraine headaches. Each will be asked to rate the severity of the pain they experience with their next migraine before any treatment is administered. Pain will be recorded on a scale of 1-100 with higher scores indicative of more severe pain. Each patient will then undergo the acupuncture treatment. On their next migraine (post-treatment), each patient will again be asked to rate the severity of the pain. The difference in pain will be computed for each patient. A two sided test of hypothesis will be conducted, at α =0.05, to assess whether there is a statistically significant difference in pain scores before and after treatment. How many patients should be involved in the study to ensure that the test has 80% power to detect a difference of 10 units on the pain scale? Assume that the standard deviation in the difference scores is approximately 20 units.
First compute the effect size:
Then substitute the effect size and the appropriate Z values for the selected α and power to compute the sample size.
A sample of size n=32 patients with migraine will ensure that a two-sided test with α =0.05 has 80% power to detect a mean difference of 10 points in pain before and after treatment, assuming that all 32 patients complete the treatment.
Sample Sizes for Two Independent Samples, Dichotomous Outcomes
In studies where the plan is to perform a test of hypothesis comparing the proportions of successes in two independent populations, the hypotheses of interest are:
H 0 : p 1 = p 2 versus H 1 : p 1 ≠ p 2
where p 1 and p 2 are the proportions in the two comparison populations. The formula for determining the sample sizes to ensure that the test has a specified power is given below:
where n i is the sample size required in each group (i=1,2), α is the selected level of significance and Z 1-α/2 is the value from the standard normal distribution holding 1- α/2 below it, and 1- β is the selected power and Z 1-β is the value from the standard normal distribution holding 1- β below it. ES is the effect size, defined as follows:
where |p 1 - p 2 | is the absolute value of the difference in proportions between the two groups expected under the alternative hypothesis, H 1 , and p is the overall proportion, based on pooling the data from the two comparison groups (p can be computed by taking the mean of the proportions in the two comparison groups, assuming that the groups will be of approximately equal size).
Example 11:
An investigator hypothesizes that there is a higher incidence of flu among students who use their athletic facility regularly than their counterparts who do not. The study will be conducted in the spring. Each student will be asked if they used the athletic facility regularly over the past 6 months and whether or not they had the flu. A test of hypothesis will be conducted to compare the proportion of students who used the athletic facility regularly and got flu with the proportion of students who did not and got flu. During a typical year, approximately 35% of the students experience flu. The investigators feel that a 30% increase in flu among those who used the athletic facility regularly would be clinically meaningful. How many students should be enrolled in the study to ensure that the power of the test is 80% to detect this difference in the proportions? A two sided test will be used with a 5% level of significance.
We first compute the effect size by substituting the proportions of students in each group who are expected to develop flu, p 1 =0.46 (i.e., 0.35*1.30=0.46) and p 2 =0.35 and the overall proportion, p=0.41 (i.e., (0.46+0.35)/2):
We now substitute the effect size and the appropriate Z values for the selected α and power to compute the sample size.
Samples of size n 1 =324 and n 2 =324 will ensure that the test of hypothesis will have 80% power to detect a 30% difference in the proportions of students who develop flu between those who do and do not use the athletic facilities regularly.
Donor Feces? Really? Clostridium difficile (also referred to as "C. difficile" or "C. diff.") is a bacterial species that can be found in the colon of humans, although its numbers are kept in check by other normal flora in the colon. Antibiotic therapy sometimes diminishes the normal flora in the colon to the point that C. difficile flourishes and causes infection with symptoms ranging from diarrhea to life-threatening inflammation of the colon. Illness from C. difficile most commonly affects older adults in hospitals or in long term care facilities and typically occurs after use of antibiotic medications. In recent years, C. difficile infections have become more frequent, more severe and more difficult to treat. Ironically, C. difficile is first treated by discontinuing antibiotics, if they are still being prescribed. If that is unsuccessful, the infection has been treated by switching to another antibiotic. However, treatment with another antibiotic frequently does not cure the C. difficile infection. There have been sporadic reports of successful treatment by infusing feces from healthy donors into the duodenum of patients suffering from C. difficile. (Yuk!) This re-establishes the normal microbiota in the colon, and counteracts the overgrowth of C. diff. The efficacy of this approach was tested in a randomized clinical trial reported in the New England Journal of Medicine (Jan. 2013). The investigators planned to randomly assign patients with recurrent C. difficile infection to either antibiotic therapy or to duodenal infusion of donor feces. In order to estimate the sample size that would be needed, the investigators assumed that the feces infusion would be successful 90% of the time, and antibiotic therapy would be successful in 60% of cases. How many subjects will be needed in each group to ensure that the power of the study is 80% with a level of significance α = 0.05?
Determining the appropriate design of a study is more important than the statistical analysis; a poorly designed study can never be salvaged, whereas a poorly analyzed study can be re-analyzed. A critical component in study design is the determination of the appropriate sample size. The sample size must be large enough to adequately answer the research question, yet not too large so as to involve too many patients when fewer would have sufficed. The determination of the appropriate sample size involves statistical criteria as well as clinical or practical considerations. Sample size determination involves teamwork; biostatisticians must work closely with clinical investigators to determine the sample size that will address the research question of interest with adequate precision or power to produce results that are clinically meaningful.
The following table summarizes the sample size formulas for each scenario described here. The formulas are organized by the proposed analysis, a confidence interval estimate or a test of hypothesis.
Situation |
|
|
---|---|---|
Continuous Outcome, One Sample: CI for μ, H : μ = μ |
|
|
Continuous Outcome, Two Independent Samples: CI for ( μ -μ ), H : μ = μ |
|
|
Continuous Outcome, Two Matched Samples: CI for μ , H : μ = 0 |
|
|
Dichotomous Outcome, One Sample: CI for p , H : p = p |
|
|
Dichotomous Outcome, Two Independent Samples: CI for (p -p ) , H : p = p |
|
|
- Buschman NA, Foster G, Vickers P. Adolescent girls and their babies: achieving optimal birth weight. Gestational weight gain and pregnancy outcome in terms of gestation at delivery and infant birth weight: a comparison between adolescents under 16 and adult women. Child: Care, Health and Development. 2001; 27(2):163-171.
- Feuer EJ, Wun LM. DEVCAN: Probability of Developing or Dying of Cancer. Version 4.0 .Bethesda, MD: National Cancer Institute, 1999.
- Howell DC. Statistical Methods for Psychology. Boston, MA: Duxbury Press, 1982.
- Fleiss JL. Statistical Methods for Rates and Proportions. New York, NY: John Wiley and Sons, Inc.,1981.
- National Center for Health Statistics. Health, United States, 2005 with Chartbook on Trends in the Health of Americans. Hyattsville, MD : US Government Printing Office; 2005.
- Plaskon LA, Penson DF, Vaughan TL, Stanford JL. Cigarette smoking and risk of prostate cancer in middle-aged men. Cancer Epidemiology Biomarkers & Prevention. 2003; 12: 604-609.
- Rutter MK, Meigs JB, Sullivan LM, D'Agostino RB, Wilson PW. C-reactive protein, the metabolic syndrome and prediction of cardiovascular events in the Framingham Offspring Study. Circulation. 2004;110: 380-385.
- Ramachandran V, Sullivan LM, Wilson PW, Sempos CT, Sundstrom J, Kannel WB, Levy D, D'Agostino RB. Relative importance of borderline and elevated levels of coronary heart disease risk factors. Annals of Internal Medicine. 2005; 142: 393-402.
- Wechsler H, Lee JE, Kuo M, Lee H. College Binge Drinking in the 1990s:A Continuing Problem Results of the Harvard School of Public Health 1999 College Health, 2000; 48: 199-210.
Answers to Selected Problems
Answer to birth weight question - page 3.
An investigator wants to estimate the mean birth weight of infants born full term (approximately 40 weeks gestation) to mothers who are 19 years of age and under. The mean birth weight of infants born full-term to mothers 20 years of age and older is 3,510 grams with a standard deviation of 385 grams. How many women 19 years of age and under must be enrolled in the study to ensure that a 95% confidence interval estimate of the mean birth weight of their infants has a margin of error not exceeding 100 grams?
In order to ensure that the 95% confidence interval estimate of the mean birthweight is within 100 grams of the true mean, a sample of size 57 is needed. In planning the study, the investigator must consider the fact that some women may deliver prematurely. If women are enrolled into the study during pregnancy, then more than 57 women will need to be enrolled so that after excluding those who deliver prematurely, 57 with outcome information will be available for analysis. For example, if 5% of the women are expected to delivery prematurely (i.e., 95% will deliver full term), then 60 women must be enrolled to ensure that 57 deliver full term. The number of women that must be enrolled, N, is computed as follows:
N (number to enroll) * (% retained) = desired sample size
N (0.95) = 57
N = 57/0.95 = 60.
Answer Freshmen Smoking - Page 4
In order to ensure that the 95% confidence interval estimate of the proportion of freshmen who smoke is within 5% of the true proportion, a sample of size 303 is needed. Notice that this sample size is substantially smaller than the one estimated above. Having some information on the magnitude of the proportion in the population will always produce a sample size that is less than or equal to the one based on a population proportion of 0.5. However, the estimate must be realistic.
Answer to Medical Device Problem - Page 7
A medical device manufacturer produces implantable stents. During the manufacturing process, approximately 10% of the stents are deemed to be defective. The manufacturer wants to test whether the proportion of defective stents is more than 10%. If the process produces more than 15% defective stents, then corrective action must be taken. Therefore, the manufacturer wants the test to have 90% power to detect a difference in proportions of this magnitude. How many stents must be evaluated? For you computations, use a two-sided test with a 5% level of significance.
Then substitute the effect size and the appropriate z values for the selected alpha and power to comute the sample size.
A sample size of 364 stents will ensure that a two-sided test with α=0.05 has 90% power to detect a 0.05, or 5%, difference in jthe proportion of defective stents produced.
Answer to Alcohol and GPA - Page 8
An investigator is planning a study to assess the association between alcohol consumption and grade point average among college seniors. The plan is to categorize students as heavy drinkers or not using 5 or more drinks on a typical drinking day as the criterion for heavy drinking. Mean grade point averages will be compared between students classified as heavy drinkers versus not using a two independent samples test of means. The standard deviation in grade point averages is assumed to be 0.42 and a meaningful difference in grade point averages (relative to drinking status) is 0.25 units. How many college seniors should be enrolled in the study to ensure that the power of the test is 80% to detect a 0.25 unit difference in mean grade point averages? Use a two-sided test with a 5% level of significance.
First compute the effect size.
Now substitute the effect size and the appropriate z values for alpha and power to compute the sample size.
Sample sizes of n i =44 heavy drinkers and 44 who drink few fewer than five drinks per typical drinking day will ensure that the test of hypothesis has 80% power to detect a 0.25 unit difference in mean grade point averages.
Answer to Donor Feces - Page 8
We first compute the effect size by substituting the proportions of patients expected to be cured with each treatment, p 1 =0.6 and p 2 =0.9, and the overall proportion, p=0.75:
We now substitute the effect size and the appropriate Z values for the selected a and power to compute the sample size.
Samples of size n 1 =33 and n 2 =33 will ensure that the test of hypothesis will have 80% power to detect this difference in the proportions of patients who are cured of C. diff. by feces infusion versus antibiotic therapy.
In fact, the investigators enrolled 38 into each group to allow for attrition. Nevertheless, the study was stopped after an interim analysis. Of 16 patients in the infusion group, 13 (81%) had resolution of C. difficile–associated diarrhea after the first infusion. The 3 remaining patients received a second infusion with feces from a different donor, with resolution in 2 patients. Resolution of C. difficile infection occurred in only 4 of 13 patients (31%) receiving the antibiotic vancomycin.
MINI REVIEW article
How to calculate sample size in animal and human studies.
- 1 Division of Biostatistics and Bioinformatics, Herbert Wertheim School of Public Health and Human Longevity Science, University of California, San Diego, La Jolla, CA, United States
- 2 Department of Pediatrics, University of California, San Diego, La Jolla, CA, United States
- 3 Division of Gastroenterology, Hepatology and Nutrition, Rady Children's Hospital San Diego, San Diego, CA, United States
One of the most important statistical analyses when designing animal and human studies is the calculation of the required sample size. In this review, we define central terms in the context of sample size determination, including mean, standard deviation, statistical hypothesis testing, type I/II error, power, direction of effect, effect size, expected attrition, corrected sample size, and allocation ratio. We also provide practical examples of sample size calculations for animal and human studies based on pilot studies, larger studies similar to the proposed study—or if no previous studies are available—estimated magnitudes of the effect size per Cohen and Sawilowsky.
Introduction
The sample size refers to the number of patients or animals included in a study, and it is one of the first and foremost questions to be answered when designing a human or animal study. It is easy to understand that a sample size smaller than necessary would result in insufficient statistical power to answer the research question and reduce the chance of reaching statistical significance. However, the choice of the sample size also does not necessarily mean the bigger the better. A large sample size will better represent the population and will hence provide more accurate results. However, the increase in accuracy will be small and clinically irrelevant after a certain point and hence not worth the effort and cost. In some studies, an excessively large sample size would expose a more than necessary number of patients/animals to potentially toxic procedures, which would be unethical. Sample size determination depends on the study design and study aim. For most cases, sample size can be determined by hypothesis testing, so that we can reject the null hypothesis with both statistical significance and practical relevance with reasonable statistical power. These procedures must consider the size of type I and type II errors as well as population variance and the effect size of the outcome of interest. There also exist cases, such as opinion surveys, in which sample size calculation usually targets an acceptably small margin of error irrespective of statistical power, type I/II error, and effect size. We focus on the former in this study.
Definitions
In this study, we use x 1 , x 2 , …., x n to denote the n data points for a given variable, and we mostly consider the case of a continuous variable.
Mean and standard deviation (SD)
The mean, or the average of all values of a specific group, x ̄ = ∑ i = 1 n x i / n , is a summary of location. The SD describes the dispersion and variability of the variable s = ∑ i = 1 n ( x i - x ̄ ) 2 / ( n - 1 ) ; specifically, it measures the average deviation of the data points from the mean.
Statistical hypothesis testing
Statistical hypothesis testing is a statistical inference tool that makes use of the data collected to determine whether there is strong evidence to reject a certain hypothesis, which we term the null hypothesis. Generally, the null hypothesis is a statement of no relevant association or effect. With a null hypothesis set up, we also have an alternative hypothesis, which supports the existence of a relevant association or effect. In this review, we focus mainly on the case of comparing the means of two groups. Then, the null hypothesis is that the means of a continuous variable in two groups are the same (μ 1 = μ 2 ). The alternative hypothesis is that there is a non-zero difference between the group means of the continuous variable. Depending on the null hypothesis, a test statistic is calculated and compared to the critical value (at a given significance level, say α= 0.05) under the null hypothesis. The test statistic is a measure of how unlikely we observe the current data given the null hypothesis being true. Usually, a larger test statistic (larger in absolute value than the critical value) means that we are more unlikely to observe the current data. Thus, we tend to accept the alternative hypothesis.
Type I error
In statistical hypothesis testing, a type I error is the probability of rejecting a true null hypothesis, i.e., this is a “false positive” conclusion. This is the significance level (α) we choose to use in statistical hypothesis testing. Common choices of α are 0.05 or 0.01. It is worth noting that a type I error is determined prior to sample size calculation.
Type II error and power
In contrast to the type I error, the type II error, denoted as β, in statistical hypothesis testing refers to the probability of failure to reject a false null hypothesis, i.e., this is a “false negative” conclusion. The power of a statistical test (=1 – Type II error) is the probability to detect a true association, i.e., to reject a false null hypothesis. Common choices of β are either 0.2, 0.1, or 0.05.
Direction of effect
This refers to when to reject the null hypothesis. It is rejected in a two-tailed test if the mean of one group is different (either higher or lower; μ 1 ≠μ 2 ) relative to the mean of another group. In a one-tailed test, the null hypothesis is rejected if the mean of one specific group is higher than that of the other (μ 1 >μ 2 ) but not if it is lower. If we use a one-sided test, the critical value in the hypothesis testing is based on the top α percentile from the distribution of the test statistics; if we use a two-sided test, the critical value is the top α 2 percentile. Practically, a one-sided test requires a smaller required sample size than a two-sided test (see below).
Effect size
The effect size is a value that measures the strength of an association that is being claimed. Thus, the effect size is closely related to the statistical test used. For example, if we hypothesize that there is a group difference between the means of a certain biomarker of the disease group and the healthy group, then Cohen's d is a commonly used effect size defined as the difference between two means divided by the pooled standard deviation for the data, i.e., d = x ¯ d i s e a s e - x ¯ h e a l t h y s , where s is the pooled SD s = ( n d i s e a s e - 1 ) S D d i s e a s e 2 + ( n h e a l t h y - 1 ) S D h e a l t h y 2 n d i s e a s e + n h e a l t h y - 2 or in the case of equal sample size d = x ¯ d i s e a s e - x ¯ h e a l t h y S D d i s e a s e 2 + S D h e a l t h y 2 2 . The most critical feature of effect size is that it is not influenced by the sample size. The effect size can usually be calculated using preliminary data observed in a smaller-scale study or in the literature for similar studies. In practice, if practitioners have experience with the biomarker, then it is helpful to define a clinically relevant effect size based on experience. If there is no historical data or experience with the biomarker at hand, Cohen and Sawilowsky ( 1 , 2 ) laid out a general rule of thumb on the magnitudes of d = 0.01 to 2.0, with small ( d = 0.2), medium ( d = 0.5), large ( d = 0.8), and huge ( d = 2) effect sizes (see Supplementary material 1 ). When we compare the proportions in two groups, which can also be considered as comparing means of binary outcomes in two groups, the effect size and hence sample size can be calculated by similar metrics designed specifically for proportions, such as Cohen's h or Cohen's ω ( 1 ).
If there is another type of association or hypothesis to be used, e.g., for comparing the means of multiple groups, a different type of effect size should be chosen, which we will briefly discuss in a later section.
Relating the statistical testing and sample size calculation
In a simplified setting of n disease = n healthy = n , we could roughly write the required sample size as n ≈ ( Z 1 - α 2 + Z 1 - β d ) 2 * 2 for a two-sample two-sided t -test at the significance level of α with power 1−β, where Z 1 - α 2 and Z 1−β are the ( 1 - α 2 )-th and (1−β)-th percentile of a standard normal distribution (for more detailed calculations, see Supplementary material 2 ). Here, we use approximation, so this formula may slightly underestimate the required sample size. Then, we round up the n to the next smallest integer. Using this simplified formula, we note a few generally true and useful relationships: (1) The required sample size is negatively related to the effect size, i.e., in order to detect a smaller effect size, we need a larger sample size; (2) if we decrease the pre-set tolerated type I (α) and type II error (β), or increase the intended power (1−β), then the required sample size is also larger; (3) in practice, we usually set up α, β, and effect size d and calculate the required sample size n ; however, it is also possible to set up α, β, and the available sample size n , calculate the detectable effect size d , and compare this detectable effect size to the clinically or practically relevant effect size.
Expected attrition and corrected sample size
The calculated required sample size is the minimum number needed to achieve the pre-set parameters. In practice, there is oftentimes dropout throughout the study period. For example, if we expect a 10% dropout or attrition rate, then our final corrected sample size will be the minimum required sample size divided by 0.9 = 90% (=100%−10%).
Allocation ratio
Although random assignment to experimental groups in animals or treatment arms in humans on a 1:1 basis has long been the standard ( 3 ), alternative allocation ratios such as 2:1 or 3:1 might be employed, where two or three individuals receive a drug for each individual enrolled receiving a placebo. This is usually done in humans to improve overall enrollment given patient demand to increase their likelihood to receive a study drug, or these alternative allocation ratios might be employed to learn more about the pharmacokinetics and adverse effects of a drug ( 4 ). However, a 2:1 allocation ratio requires 12% more subjects, and a 3:1 allocation ratio requires 33% more subjects than a 1:1 allocation ratio to detect the same size effect with equivalent power ( 3 ) (also see Supplementary material 2 for justification).
Other types of tests and power calculation
For the discussions above, we mainly focused on comparing the means of the two groups. If we have other scientific questions, e.g., comparing the means of a continuous variable in more than two groups, investigating the association between two continuous variables, and exploring the explained variance in multiple regression, then the corresponding tests we use are the F test for analysis of variance (ANOVA), the Z test for Pearson correlation coefficient, and the F test based on the R 2 of a multiple regression model. The corresponding effect sizes for the F test and Pearson correlation coefficient are Cohen's f 2 and the Pearson correlation coefficient R, respectively ( 5 ). We can develop similar formulas for calculating the required sample size to detect the given effect sizes.
There is a multitude of appropriate programs to calculate sample sizes, including G * Power ( 6 ), R statistical software ( 7 ), Epitools ( 8 ), OpenEpi ( 9 ), and Biomath ( 10 ). A simple and intuitive program is G * Power ( 6 ), which we will use below to illustrate our animal and human examples of sample size calculation. As an alternative, we will provide the R codes ( 7 ) for the same calculations in Supplementary material 3 .
Animal studies
In this section, we will provide practical examples of sample size calculation for animal studies. In order to estimate the sample size for an animal study, one of the more difficult components is to determine the effect size. The effect size depends on the respective outcome the researcher wants to examine. For example, in a mouse model of Western diet-induced liver disease, one of the more important outcomes is the liver triglyceride concentration ( 11 ). If a researcher aims to investigate the effect of a drug, e.g., a bile acid binder, on diet-induced liver disease, he/she can attempt to extrapolate outcomes—and hence the expected effect size—from a study similar to his/her proposed project. The bile acid binder colesevelam decreases the hepatic triglyceride concentration to 143.26 mg/g liver weight (standard deviation [SD] 54.50 mg/g) in mice after Western diet feeding compared with 192.84 mg/g (SD 48.90 mg/g) in the Western diet-fed group not treated with the bile acid sequestrant ( 11 ). The effect size can be calculated with G * Power ( 6 ), other software, or manually ( 1 ): Cohen's d = x ¯ w e s t e r n d i e t - x ¯ w e s t e r n d i e t p l u s c o l e s e v e l a m S D w e s t e r n d i e t 2 + S D w e s t e r n d i e t p l u s c o l e s e v e l a m 2 2 = 192 . 84 - 143 . 26 48 . 9 2 + 54 . 5 2 2 = 0 . 96 ( Figure 1A ). With a two-tailed calculation and an effect size of 0.96, type I error of 0.05, power of 0.8, and an allocation ratio of 1:1, the raw sample size per group of the proposed new bile acid binder experiment is 19, and—with an attrition of 10%—the corrected sample size per group is 22 (19/0.9=21.11) ( Figure 1A ). However, if another outcome is being chosen, such as markers for liver inflammation, e.g., gene expression of tumor necrosis factor (TNF), with 1.65 relative units (SD 0.85) in the colesevelam-treated group vs. the untreated group with 3.37 (SD 1.59), the effect size is much higher at 1.35, resulting in a lower sample size of 10 per group ( Figure 1B ) and a corrected sample size of 12 per group to account for 10% expected attrition (10/0.9 = 11.11). This shows that the calculated sample size depends markedly on the selected outcome variable. Furthermore, decreasing the tolerated type I error (e.g., from 0.05 to 0.01) or increasing the power (e.g., from 0.8 to 0.95) increases the required sample size per group (e.g., from 10 to 15 or from 10 to 16, respectively, Figures 1C, D ).
Figure 1 . Sample size calculations for select animal studies using G*Power. (A) Sample size calculation based on hepatic triglyceride concentration in colesevelam-treated and Western diet-fed mice with type I error of 0.05 and power of 0.8. (B–D) Sample size calculation based on gene expression of inflammatory marker tumor necrosis factor (TNF) in the liver in colesevelam-treated and Western diet-fed mice with (B) type I error of 0.05 and power of 0.8, (C) type I error of 0.01 and power of 0.8, or (D) type I error of 0.05 and power of 0.95.
In addition to extrapolating expected results to similar experimental environments, calculating the sample for a larger experiment based on a pilot experiment with a small sample size can also be done. If a pilot experiment over 9 months showed that a certain drug decreased the tumor growth in five rats (4, 3, 6, 4, and 4 tumors/rat respectively; mean 4.2 tumors/rat, SD 1.10) versus five control rats (6, 5, 4, 7, 5 tumors/rat, respectively; mean 5.4, SD 1.14, p = 0.13 Student's t -test), the effect size is 1.07, and the calculated total sample size for a larger experiment is 15 rats per group using a two-tailed analysis. In this case, a one-tailed analysis could also be used, since the pilot experiment suggests that the drug might be protective against tumor growth and the follow-up experiment would focus rather on whether the drug truly significantly reduces the tumor burden relative to controls and not as well whether controls will have a significantly lower tumor burden than the drug group. The calculated sample size per group would be 12 rats per one-tailed analysis, potentially markedly decreasing the costs for maintenance of the rodents over long experimental periods such as 9 months compared with 15 rats per group per two-tailed analysis (prior to correcting for attrition).
Animal models oftentimes include four groups ( 12 – 14 ), two of which might be on a special diet (or have a specific genotype), and the other two groups are on a control diet (or are wild-type mice, etc.). Furthermore, one group of the special diet groups and one group of the control diet groups might then be treated with a drug, and the other two groups are not. It is of major interest to know if the drug improves a certain disease induced by the special diet compared with the other group fed the special diet but not treated with the drug. However, the question might sometimes be posed by reviewers of submitted manuscripts or grants what the most appropriate sample size of the control animals is, that is, the two groups on the control diet, which are not of primary interest. Many articles commonly use five rodents only or even fewer for those control groups, in particular in those rodent models that cause a stark disease phenotype due to a special diet or genotype or similar conditions ( 13 , 15 – 18 ). A mouse model of high-fat diet-induced obesity might serve as an example, in which mice gained 15.75 g on average over 16 weeks on a high-fat diet (SD 7.63) vs. 2.5 g (SD 2.65) in control mice on a control diet ( 17 ). The estimated uncorrected sample size in the control group would be 3 and 9 in the high-fat diet group using a 3:1 allocation ratio and a two-tailed analysis. A one-tailed analysis would provide an uncorrected sample size of 2 in the control group and 6 in the high-fat diet group using a 3:1 allocation ratio. Clear phenotypes can be achieved with rodent models of diet-induced metabolic diseases including type 2 diabetes ( 19 ) or non-alcoholic steatohepatitis ( 20 ) as well as chemically induced diseases [e.g., dextran sulfate sodium-induced colitis ( 21 )] or (at least partially) genetic diseases [such as Alzheimer's disease ( 18 ) or autism ( 22 )]. In instances like these with established rodent models and distinct phenotypes, untreated control groups of five or fewer rodents are acceptable with or without power calculations as it is rather of primary interest if an intervention (such as a drug) changes the phenotype in the experimental group compared with an experimental group without the drug. On the other hand, it is of utmost importance that a power calculation be carried out for these experimental groups, as described above.
Human studies
The sample size can be calculated for human studies analogous to mouse studies. For example, drug A may decrease the inflammatory marker fecal calprotectin in humans with inflammatory bowel disease by 170 mcg/g (standard deviation 150 mcg/g) vs. 90 mcg/g (SD 100 mcg/g) in the placebo group in a small pilot. The effect size will be 0.63 with 0.05 type I error, 95% power, and 1:1 allocation ratio, and this will require 67 subjects per group (or 75 subjects after accounting for 10% attrition) for the larger randomized controlled trial. However, it can be difficult to calculate the effect size in human studies if no pilot studies have been done. In those cases, one can usually estimate effect sizes as small ( d = 0.2), medium ( d = 0.5), or large ( d = 0.8), as suggested by Cohen and Sawilowsky ( 1 , 2 ). This also highlights that the effect sizes in human studies are usually much smaller, and the sample sizes are usually much larger than in animal studies (the aim for effect sizes in animal studies is generally >1.0, see above).
However, proportions are more commonly used to calculate sample sizes in human studies ( 23 – 25 ). In a human trial of rifaximin in irritable bowel syndrome, the sample size was calculated using the difference between two independent proportions ( 23 ). An improvement was estimated a priori in 55% of the rifaximin group and in 40% of the placebo group, which with 95% power and a significance level of 0.05 would require ~300 subjects per group ( 23 ), or more accurately 286 subjects per group per z test ( Figure 2A ), plus 16 or 32 subjects per group corrected for 5% or 10% attrition (286/0.95 = 301.05 or 286/0.9 = 317.78), respectively. Effect sizes and proposed sample sizes can be arbitrary in human studies ( 26 ). However, as described above, it is recommended to base estimates on smaller pilot studies investigating the same drug or larger randomized controlled trials scrutinizing a similar drug in the same clinical context or the same drug in a slightly different clinical context. For example, a human study examined the effect of dexmedetomidine on acute kidney injury after aortic surgery ( 25 ), basing the estimated 54% incidence of postoperative acute kidney injury on a prior study ( 27 ) and estimating that the dexmedetomidine infusion would decrease the incidence of postoperative acute kidney injury by half to 27% similar to a study on acute kidney injury following valvular heart surgery ( 28 ). These proportions with a statistical power of 80% and type I error of 0.05 provide a sample size of 51 subjects per group ( Figure 2B ) or 54 subjects per group after correcting for 5% attrition ( 25 ).
Figure 2 . Sample size calculations for select human studies using G*Power. (A) Sample size calculations based on expected proportions of response for rifaximin vs. placebo in irritable bowel syndrome with type I error of 0.05 and power of 0.95. (B) Sample size calculations based on the expected incidence of postoperative acute kidney injury with dexmedetomidine infusion vs. placebo with type I error of 0.05 and power of 0.8.
In conclusion, the appropriate calculation of the required sample size is central when designing animal or human studies for a variety of reasons, such as ethical considerations, decreasing costs, time, effort, and the use of other resources.
Author contributions
XZ and PH conceived and designed the study, performed the statistical analysis, and wrote the first draft of the manuscript and edited the manuscript. All authors approved the submitted version.
This study was supported by the National Institutes of Health (NIH) grant K12 HD85036, the University of California San Diego Altman Clinical and Translational Research Institute (ACTRI)/NIH grant KL2TR001444, the Pinnacle Research Award in Liver Diseases Grant #PNC22-159963 from the American Association for the Study of Liver Diseases Foundation, and the Pilot/Feasibility Grant P30 DK120515 from the San Diego Digestive Diseases Research Center (SDDRC) (to PH).
Conflict of interest
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Publisher's note
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.
Supplementary material
The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fmed.2023.1215927/full#supplementary-material
Abbreviations
α, type I error; β, type II error; d , effect size; SD, standard deviation.
1. Cohen J. Statistical power analysis for the behavioral sciences. Academic press. (2013). doi: 10.4324/9780203771587
PubMed Abstract | CrossRef Full Text | Google Scholar
2. Sawilowsky SS. New effect size rules of thumb. J Modern App Stat Methods. (2009) 8:26. doi: 10.22237/jmasm/1257035100
CrossRef Full Text | Google Scholar
3. Hey SP, Kimmelman J. The questionable use of unequal allocation in confirmatory trials. Neurology. (2014) 82:77–9. doi: 10.1212/01.wnl.0000438226.10353.1c
4. Vozdolska R, Sano M, Aisen P, Edland SD. The net effect of alternative allocation ratios on recruitment time and trial cost. Clin Trials. (2009) 6:126–32. doi: 10.1177/1740774509103485
5. Steiger JH. Beyond the F test: Effect size confidence intervals and tests of close fit in the analysis of variance and contrast analysis. Psychol Methods. (2004) 9:164. doi: 10.1037/1082-989X.9.2.164
6. Faul F, Erdfelder E, Buchner A, Lang AG. Statistical power analyses using G * Power 31: tests for correlation and regression analyses. Behav Res Methods. (2009) 41:1149–60. doi: 10.3758/BRM.41.4.1149
7. R Core Team R. R: A language and environment for statistical computing. Vienna, Austria. (2023). Available online at: https://www.R-project.org/ (accessed May 1, 2023).
Google Scholar
8. Ausvet. Epitools. (2023). Available online at: https://epitools.ausvet.com.au/samplesize (accessed May 1, 2023).
9. Dean AG,. OpenEpi: Open Source Epidemiologic Statistics for Public Health, Version 2.3. 1. (2010). Available online at: http://www.openepi.com (accessed May 1, 2023).
10. Center for Biomathematics, Biomath,. (2023). Available online at: http://www.biomath.info/power/index.html (accessed May 1, 2023).
11. Hartmann P, Duan Y, Miyamoto Y, Demir M, Lang S, Hasa E, et al. Colesevelam ameliorates non-alcoholic steatohepatitis and obesity in mice. Hepatol Int. (2022) 16:359–70. doi: 10.1007/s12072-022-10296-w
12. Wang L, Hartmann P, Haimerl M, Bathena SP, Sjöwall C, Almer S, et al. Nod2 deficiency protects mice from cholestatic liver disease by increasing renal excretion of bile acids. J Hepatol. (2014) 60:1259–67. doi: 10.1016/j.jhep.2014.02.012
13. Nishio T, Koyama Y, Liu X, Rosenthal SB, Yamamoto G, Fuji H, et al. Immunotherapy-based targeting of MSLN+ activated portal fibroblasts is a strategy for treatment of cholestatic liver fibrosis. Proc Nat Acad Sci. (2021) 118:e2101270118. doi: 10.1073/pnas.2101270118
14. Zeng S, Hartmann P, Park M, Duan Y, Lang S, Llorente C, et al. Malassezia restricta promotes alcohol-induced liver injury. Hepatol Commun. (2023) 7:2. doi: 10.1097/HC9.0000000000000029
15. Tsuchida T, Lee YA, Fujiwara N, Ybanez M, Allen B, Martins S, et al. A simple diet-and chemical-induced murine NASH model with rapid progression of steatohepatitis, fibrosis and liver cancer. J Hepatol. (2018) 69:385–95. doi: 10.1016/j.jhep.2018.03.011
16. Kang SS, Bloom SM, Norian LA, Geske MJ, Flavell RA, Stappenbeck TS, et al. An antibiotic-responsive mouse model of fulminant ulcerative colitis. PLoS Med. (2008) 5:e41. doi: 10.1371/journal.pmed.0050041
17. Hartmann P, Seebauer CT, Mazagova M, Horvath A, Wang L, Llorente C, et al. Deficiency of intestinal mucin-2 protects mice from diet-induced fatty liver disease and obesity. Am J Physiol Gastroint Liver Physiol. (2016) 310:G310–22. doi: 10.1152/ajpgi.00094.2015
18. Lee HY, Yoon S, Lee JH, Park K, Jung Y, Cho I, et al. Aryloxypropanolamine targets amyloid aggregates and reverses Alzheimer-like phenotypes in Alzheimer mouse models. Alzheimer's Res Therapy. (2022) 14:177. doi: 10.1186/s13195-022-01112-6
19. Morris JL, Bridson TL, Alim MA, Rush CM, Rudd DM, Govan BL, et al. Development of a diet-induced murine model of diabetes featuring cardinal metabolic and pathophysiological abnormalities of type 2 diabetes. Biol Open. (2016) 5:1149–62. doi: 10.1242/bio.016790
20. Demir M, Lang S, Hartmann P, Duan Y, Martin A, Miyamoto Y, et al. The fecal mycobiome in non-alcoholic fatty liver disease. J Hepatol. (2022) 76:788–99. doi: 10.1016/j.jhep.2021.11.029
21. Renes IB, Verburg M, Van Nispen DJ, Büller HA, Dekker J, Einerhand AW. Distinct epithelial responses in experimental colitis: implications for ion uptake and mucosal protection. Am J Physiol Gastroint Liver Physiol. (2002) 283:G169–79. doi: 10.1152/ajpgi.00506.2001
22. Kazdoba TM, Leach PT, Yang M, Silverman JL, Solomon M, Crawley JN. Translational Mouse Models of Autism: Advancing Toward Pharmacological Therapeutics . Cham: Springer International Publishing (2016). doi: 10.1007/7854_2015_5003
23. Pimentel M, Lembo A, Chey WD, Zakko S, Ringel Y, Yu J, et al. Rifaximin therapy for patients with irritable bowel syndrome without constipation. New Eng J Med. (2011) 364:22–32. doi: 10.1056/NEJMoa1004409
24. Makrides M, Gibson RA, McPhee AJ, Yelland L, Quinlivan J, Ryan P. DOMInO investigative team AT. Effect of DHA supplementation during pregnancy on maternal depression and neurodevelopment of young children: a randomized controlled trial. Jama. (2010) 304:1675–83. doi: 10.1001/jama.2010.1507
25. Soh S, Shim JK, Song JW, Bae JC, Kwak YL. Effect of dexmedetomidine on acute kidney injury after aortic surgery: a single-centre, placebo-controlled, randomised controlled trial. Br J Anaesth. (2020) 124:386–94. doi: 10.1016/j.bja.2019.12.036
26. Bacchetti P. Current sample size conventions: flaws, harms, and alternatives. BMC Med. (2010) 8:1–7. doi: 10.1186/1741-7015-8-17
27. Roh GU, Lee JW, Nam SB, Lee J, Choi JR, Shim YH. Incidence and risk factors of acute kidney injury after thoracic aortic surgery for acute dissection. Ann Thorac Surg. (2012) 94:766–71. doi: 10.1016/j.athoracsur.2012.04.057
28. Cho JS, Shim JK, Soh S, Kim MK, Kwak YL. Perioperative dexmedetomidine reduces the incidence and severity of acute kidney injury following valvular heart surgery. Kidney Int. (2016) 89:693–700. doi: 10.1038/ki.2015.306
Keywords: sample size calculation, power, effect size, animal and human study, two sample comparison, type I error, type II error
Citation: Zhang X and Hartmann P (2023) How to calculate sample size in animal and human studies. Front. Med. 10:1215927. doi: 10.3389/fmed.2023.1215927
Received: 02 May 2023; Accepted: 18 July 2023; Published: 17 August 2023.
Reviewed by:
Copyright © 2023 Zhang and Hartmann. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY) . The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
*Correspondence: Phillipp Hartmann, phhartmann@health.ucsd.edu
Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.
IMAGES
VIDEO
COMMENTS
Basically, there are two methods of sample size calculation in animal studies. The most favored and most scientific method is calculation of sample size by power analysis. Every effort should be carried out to calculate sample size by this method.
To assist principal investigators and animal users in designing their animal experiments and determining sample sizes for any given experiment, we encourage you to access the following links: Experimental design hub – National Centre for the 3Rs (NC3Rs) (UK)
Calculate sample size based on minimum effects sizes of scientific importance, with appropriate levels of α and power (consult a statistician, as needed), and faithfully incorporate this sample size into experiments
When comparing two groups, the major factors that influence sample size are: How large a difference you need to be able to detect. How much variability there is in the factor of interest. What “p” value you plan to use as a criterion for statistical “significance.”
Below is an Excel file Power Calculator. Simply fill out the white cells with expected Mean and Standard Deviation (SD) values.
The focus of this article is on the minimum and maximum numbers of animals required in an animal research. Given ethical consideration and budget limitations, a researcher may choose a suitable sample size for his or her planned animal study based on these simple formulas.
method of sample size calculations in animal studies based on review of the literature carried out by us. Basically, there are two methods of sample size calculation in animal studies. The most favored and most scientific method is calculation of sample size by power analysis. [2] Every effort
The power analysis is a method of sample size calculation that can be used to estimate the sample size required for a study, given the significance level and statistical power. Table 2 displays the four situations that can be considered for decision-making on unknown facts when testing the hypotheses.
Compute the sample size required to estimate population parameters with precision. Interpret statistical power in tests of hypothesis. Compute the sample size required to ensure high power when hypothesis testing.
In this review, we define central terms in the context of sample size determination, including mean, standard deviation, statistical hypothesis testing, type I/II error, power, direction of effect, effect size, expected attrition, corrected sample size, and allocation ratio.