Chapter 9: Hypothesis Testing with Two Samples

Introduction to Chapter 9: Hypothesis Testing with Two Samples

This is a photo of a plate with a large pile of eggs in the foreground and six slices of toast in the background. There is a small dish of red jam sitting near the toast on the plate.

Studies often compare two groups. For example, researchers are interested in the effect aspirin has in preventing heart attacks. Over the last few years, newspapers and magazines have reported various aspirin studies involving two groups. Typically, one group is given aspirin and the other group is given a placebo. Then, the heart attack rate is studied over several years.

There are other situations that deal with the comparison of two groups. For example, studies compare various diet and exercise programs. Politicians compare the proportion of individuals from different income brackets who might vote for them. Students are interested in whether SAT or GRE preparatory courses really help raise their scores.

You have learned to conduct hypothesis tests on single means and single proportions. You will expand upon that in this chapter. You will compare two means or two proportions to each other. The general procedure is still the same, just expanded.

To compare two means or two proportions, you work with two groups. The groups are classified either as independent or matched pairs . Independent groups consist of two samples that are independent, that is, sample values selected from one population are not related in any way to sample values selected from the other population. Matched pairs consist of two samples that are dependent. The parameter tested using matched pairs is the population mean. The parameters tested using independent groups are either population means or population proportions.

  • Test of two population means.
  • Test of two population proportions.
  • Test of the two population proportions by testing one population mean of differences.

Chapter 9 Hypothesis testing

The first unit was designed to prepare you for hypothesis testing. In the first chapter we discussed the three major goals of statistics:

  • Describe: connects to unit 1 with descriptive statistics and graphing
  • Decide: connects to unit 1 knowing your data and hypothesis testing
  • Predict: connects to hypothesis testing and unit 3

The remaining chapters will cover many different kinds of hypothesis tests connected to different inferential statistics. Needless to say, hypothesis testing is the central topic of this course. This lesson is important but that does not mean the same thing as difficult. There is a lot of new language we will learn about when conducting a hypothesis test. Some of the components of a hypothesis test are the topics we are already familiar with:

  • Test statistics
  • Probability
  • Distribution of sample means

Hypothesis testing is an inferential procedure that uses data from a sample to draw a general conclusion about a population. It is a formal approach and a statistical method that uses sample data to evaluate hypotheses about a population. When interpreting a research question and statistical results, a natural question arises as to whether the finding could have occurred by chance. Hypothesis testing is a statistical procedure for testing whether chance (random events) is a reasonable explanation of an experimental finding. Once you have mastered the material in this lesson you will be used to solving hypothesis testing problems and the rest of the course will seem much easier. In this chapter, we will introduce the ideas behind the use of statistics to make decisions – in particular, decisions about whether a particular hypothesis is supported by the data.

Logic and Purpose of Hypothesis Testing

The statistician Ronald Fisher explained the concept of hypothesis testing with a story of a lady tasting tea. Fisher was a statistician from London and is noted as the first person to formalize the process of hypothesis testing. His elegantly simple “Lady Tasting Tea” experiment demonstrated the logic of the hypothesis test.

chapter 9 hypothesis testing

Figure 1. A depiction of the lady tasting tea Photo Credit

Fisher would often have afternoon tea during his studies. He usually took tea with a woman who claimed to be a tea expert. In particular, she told Fisher that she could tell which was poured first in the teacup, the milk or the tea, simply by tasting the cup. Fisher, being a scientist, decided to put this rather bizarre claim to the test. The lady accepted his challenge. Fisher brought her 8 cups of tea in succession; 4 cups would be prepared with the milk added first, and 4 with the tea added first. The cups would be presented in a random order unknown to the lady.

The lady would take a sip of each cup as it was presented and report which ingredient she believed was poured first. Using the laws of probability, Fisher determined the chances of her guessing all 8 cups correctly was 1/70, or about 1.4%. In other words, if the lady was indeed guessing there was a 1.4% chance of her getting all 8 cups correct. On the day of the experiment, Fisher had 8 cups prepared just as he had requested. The lady drank each cup and made her decisions for each one.

After the experiment, it was revealed that the lady got all 8 cups correct! Remember, had she been truly guessing, the chance of getting this result was 1.4%. Since this probability was so low , Fisher instead concluded that the lady could indeed differentiate between the milk or the tea being poured first. Fisher’s original hypothesis that she was just guessing was demonstrated to be false and was therefore rejected. The alternative hypothesis, that the lady could truly tell the cups apart, was then accepted as true.

This story demonstrates many components of hypothesis testing in a very simple way. For example, Fisher started with a hypothesis that the lady was guessing. He then determined that if she was indeed guessing, the probability of guessing all 8 right was very small, just 1.4%. Since that probability was so tiny, when she did get all 8 cups right, Fisher determined it was extremely unlikely she was guessing. A more reasonable conclusion was that the lady had the skill to tell the cups apart.

In hypothesis testing, we will always set up a particular hypothesis that we want to demonstrate to be true. We then use probability to determine the likelihood of our hypothesis is correct. If it appears our original hypothesis was wrong, we reject it and accept the alternative hypothesis. The alternative hypothesis is usually the opposite of our original hypothesis. In Fisher’s case, his original hypothesis was that the lady was guessing. His alternative hypothesis was the lady was not guessing.

This result does not prove that he does; it could be he was just lucky and guessed right 13 out of 16 times. But how plausible is the explanation that he was just lucky? To assess its plausibility, we determine the probability that someone who was just guessing would be correct 13/16 times or more. This probability can be computed to be 0.0106. This is a pretty low probability, and therefore someone would have to be very lucky to be correct 13 or more times out of 16 if they were just guessing. A low probability gives us more confidence there is evidence Bond can tell whether the drink was shaken or stirred. There is also still a chance that Mr. Bond was very lucky (more on this later!). The hypothesis that he was guessing is not proven false, but considerable doubt is cast on it. Therefore, there is strong evidence that Mr. Bond can tell whether a drink was shaken or stirred.

You may notice some patterns here:

  • We have 2 hypotheses: the original (researcher prediction) and the alternative
  • We collect data
  • We determine how likley or unlikely the original hypothesis is to occur based on probability.
  • We determine if we have enough evidence to support the original hypothesis and draw conclusions.

Now let’s being in some specific terminology:

Null hypothesis : In general, the null hypothesis, written H 0 (“H-naught”), is the idea that nothing is going on: there is no effect of our treatment, no relation between our variables, and no difference in our sample mean from what we expected about the population mean. The null hypothesis indicates that an apparent effect is due to chance. This is always our baseline starting assumption, and it is what we (typically) seek to reject . For mathematical notation, one uses =).

Alternative hypothesis : If the null hypothesis is rejected, then we will need some other explanation, which we call the alternative hypothesis, H A or H 1 . The alternative hypothesis is simply the reverse of the null hypothesis. Thus, our alternative hypothesis is the mathematical way of stating our research question.  In general, the alternative hypothesis (also called the research hypothesis)is there is an effect of treatment, the relation between variables, or differences in a sample mean compared to a population mean. The alternative hypothesis essentially shows evidence the findings are not due to chance.  It is also called the research hypothesis as this is the most common outcome a researcher is looking for: evidence of change, differences, or relationships. There are three options for setting up the alternative hypothesis, depending on where we expect the difference to lie. The alternative hypothesis always involves some kind of inequality (≠not equal, >, or <).

  • If we expect a specific direction of change/differences/relationships, which we call a directional hypothesis , then our alternative hypothesis takes the form based on the research question itself.  One would expect a decrease in depression from taking an anti-depressant as a specific directional hypothesis.  Or the direction could be larger, where for example, one might expect an increase in exam scores after completing a student success exam preparation module.  The directional hypothesis (2 directions) makes up 2 of the 3 alternative hypothesis options.  The other alternative is to state there are differences/changes, or a relationship but not predict the direction.  We use a non-directional alternative hypothesis  (typically see ≠ for mathematical notation).

Probability value (p-value) : the probability of a certain outcome assuming a certain state of the world. In statistics, it is conventional to refer to possible states of the world as hypotheses since they are hypothesized states of the world. Using this terminology, the probability value is the probability of an outcome given the hypothesis. It is not the probability of the hypothesis given the outcome. It is very important to understand precisely what the probability values mean. In the James Bond example, the computed probability of 0.0106 is the probability he would be correct on 13 or more taste tests (out of 16) if he were just guessing. It is easy to mistake this probability of 0.0106 as the probability he cannot tell the difference. This is not at all what it means. The probability of 0.0106 is the probability of a certain outcome (13 or more out of 16) assuming a certain state of the world (James Bond was only guessing).

A low probability value casts doubt on the null hypothesis. How low must the probability value be in order to conclude that the null hypothesis is false? Although there is clearly no right or wrong answer to this question, it is conventional to conclude the null hypothesis is false if the probability value is less than 0.05 (p < .05). More conservative researchers conclude the null hypothesis is false only if the probability value is less than 0.01 (p<.01). When a researcher concludes that the null hypothesis is false, the researcher is said to have rejected the null hypothesis. The probability value below which the null hypothesis is rejected is called the α level or simply α (“alpha”). It is also called the significance level . If α is not explicitly specified, assume that α = 0.05.

Decision-making is part of the process and we have some language that goes along with that. Importantly, null hypothesis testing operates under the assumption that the null hypothesis is true unless the evidence shows otherwise. We (typically) seek to reject the null hypothesis, giving us evidence to support the alternative hypothesis .  If the probability of the outcome given the hypothesis is sufficiently low, we have evidence that the null hypothesis is false. Note that all probability calculations for all hypothesis tests center on the null hypothesis. In the James Bond example, the null hypothesis is that he cannot tell the difference between shaken and stirred martinis. The probability value is low that one is able to identify 13 of 16 martinis as shaken or stirred (0.0106), thus providing evidence that he can tell the difference. Note that we have not computed the probability that he can tell the difference.

The specific type of hypothesis testing reviewed is specifically known as null hypothesis statistical testing (NHST). We can break the process of null hypothesis testing down into a number of steps a researcher would use.

  • Formulate a hypothesis that embodies our prediction ( before seeing the data )
  • Specify null and alternative hypotheses
  • Collect some data relevant to the hypothesis
  • Compute a test statistic
  • Identify the criteria probability (or compute the probability of the observed value of that statistic) assuming that the null hypothesis is true
  • Drawing conclusions. Assess the “statistical significance” of the result

Steps in hypothesis testing

Step 1: formulate a hypothesis of interest.

The researchers hypothesized that physicians spend less time with obese patients. The researchers hypothesis derived from an identified population. In creating a research hypothesis, we also have to decide whether we want to test a directional or non-directional hypotheses. Researchers typically will select a non-directional hypothesis for a more conservative approach, particularly when the outcome is unknown (more about why this is later).

Step 2: Specify the null and alternative hypotheses

Can you set up the null and alternative hypotheses for the Physician’s Reaction Experiment?

Step 3: Determine the alpha level.

For this course, alpha will be given to you as .05 or .01.  Researchers will decide on alpha and then determine the associated test statistic based from the sample. Researchers in the Physician Reaction study might set the alpha at .05 and identify the test statistics associated with the .05 for the sample size.  Researchers might take extra precautions to be more confident in their findings (more on this later).

Step 4: Collect some data

For this course, the data will be given to you.  Researchers collect the data and then start to summarize it using descriptive statistics. The mean time physicians reported that they would spend with obese patients was 24.7 minutes as compared to a mean of 31.4 minutes for normal-weight patients.

Step 5: Compute a test statistic

We next want to use the data to compute a statistic that will ultimately let us decide whether the null hypothesis is rejected or not. We can think of the test statistic as providing a measure of the size of the effect compared to the variability in the data. In general, this test statistic will have a probability distribution associated with it, because that allows us to determine how likely our observed value of the statistic is under the null hypothesis.

To assess the plausibility of the hypothesis that the difference in mean times is due to chance, we compute the probability of getting a difference as large or larger than the observed difference (31.4 – 24.7 = 6.7 minutes) if the difference were, in fact, due solely to chance.

Step 6: Determine the probability of the observed result under the null hypothesis 

Using methods presented in later chapters, this probability associated with the observed differences between the two groups for the Physician’s Reaction was computed to be 0.0057. Since this is such a low probability, we have confidence that the difference in times is due to the patient’s weight (obese or not) (and is not due to chance). We can then reject the null hypothesis (there are no differences or differences seen are due to chance).

Keep in mind that the null hypothesis is typically the opposite of the researcher’s hypothesis. In the Physicians’ Reactions study, the researchers hypothesized that physicians would expect to spend less time with obese patients. The null hypothesis that the two types of patients are treated identically as part of the researcher’s control of other variables. If the null hypothesis were true, a difference as large or larger than the sample difference of 6.7 minutes would be very unlikely to occur. Therefore, the researchers rejected the null hypothesis of no difference and concluded that in the population, physicians intend to spend less time with obese patients.

This is the step where NHST starts to violate our intuition. Rather than determining the likelihood that the null hypothesis is true given the data, we instead determine the likelihood under the null hypothesis of observing a statistic at least as extreme as one that we have observed — because we started out by assuming that the null hypothesis is true! To do this, we need to know the expected probability distribution for the statistic under the null hypothesis, so that we can ask how likely the result would be under that distribution. This will be determined from a table we use for reference or calculated in a statistical analysis program. Note that when I say “how likely the result would be”, what I really mean is “how likely the observed result or one more extreme would be”. We need to add this caveat as we are trying to determine how weird our result would be if the null hypothesis were true, and any result that is more extreme will be even more weird, so we want to count all of those weirder possibilities when we compute the probability of our result under the null hypothesis.

Let’s review some considerations for Null hypothesis statistical testing (NHST)!

Null hypothesis statistical testing (NHST) is commonly used in many fields. If you pick up almost any scientific or biomedical research publication, you will see NHST being used to test hypotheses, and in their introductory psychology textbook, Gerrig & Zimbardo (2002) referred to NHST as the “backbone of psychological research”. Thus, learning how to use and interpret the results from hypothesis testing is essential to understand the results from many fields of research.

It is also important for you to know, however, that NHST is flawed, and that many statisticians and researchers think that it has been the cause of serious problems in science, which we will discuss in further in this unit. NHST is also widely misunderstood, largely because it violates our intuitions about how statistical hypothesis testing should work. Let’s look at an example to see this.

There is great interest in the use of body-worn cameras by police officers, which are thought to reduce the use of force and improve officer behavior. However, in order to establish this we need experimental evidence, and it has become increasingly common for governments to use randomized controlled trials to test such ideas. A randomized controlled trial of the effectiveness of body-worn cameras was performed by the Washington, DC government and DC Metropolitan Police Department in 2015-2016. Officers were randomly assigned to wear a body-worn camera or not, and their behavior was then tracked over time to determine whether the cameras resulted in less use of force and fewer civilian complaints about officer behavior.

Before we get to the results, let’s ask how you would think the statistical analysis might work. Let’s say we want to specifically test the hypothesis of whether the use of force is decreased by the wearing of cameras. The randomized controlled trial provides us with the data to test the hypothesis – namely, the rates of use of force by officers assigned to either the camera or control groups. The next obvious step is to look at the data and determine whether they provide convincing evidence for or against this hypothesis. That is: What is the likelihood that body-worn cameras reduce the use of force, given the data and everything else we know?

It turns out that this is not how null hypothesis testing works. Instead, we first take our hypothesis of interest (i.e. that body-worn cameras reduce use of force), and flip it on its head, creating a null hypothesis – in this case, the null hypothesis would be that cameras do not reduce use of force. Importantly, we then assume that the null hypothesis is true. We then look at the data, and determine how likely the data would be if the null hypothesis were true. If the data are sufficiently unlikely under the null hypothesis that we can reject the null in favor of the alternative hypothesis which is our hypothesis of interest. If there is not sufficient evidence to reject the null, then we say that we retain (or “fail to reject”) the null, sticking with our initial assumption that the null is true.

Understanding some of the concepts of NHST, particularly the notorious “p-value”, is invariably challenging the first time one encounters them, because they are so counter-intuitive. As we will see later, there are other approaches that provide a much more intuitive way to address hypothesis testing (but have their own complexities).

Step 7: Assess the “statistical significance” of the result. Draw conclusions.

The next step is to determine whether the p-value that results from the previous step is small enough that we are willing to reject the null hypothesis and conclude instead that the alternative is true. In the Physicians Reactions study, the probability value is 0.0057. Therefore, the effect of obesity is statistically significant and the null hypothesis that obesity makes no difference is rejected. It is very important to keep in mind that statistical significance means only that the null hypothesis of exactly no effect is rejected; it does not mean that the effect is important, which is what “significant” usually means. When an effect is significant, you can have confidence the effect is not exactly zero. Finding that an effect is significant does not tell you about how large or important the effect is.

How much evidence do we require and what considerations are needed to better understand the significance of the findings? This is one of the most controversial questions in statistics, in part because it requires a subjective judgment – there is no “correct” answer.

What does a statistically significant result mean?

There is a great deal of confusion about what p-values actually mean (Gigerenzer, 2004). Let’s say that we do an experiment comparing the means between conditions, and we find a difference with a p-value of .01. There are a number of possible interpretations that one might entertain.

Does it mean that the probability of the null hypothesis being true is .01? No. Remember that in null hypothesis testing, the p-value is the probability of the data given the null hypothesis. It does not warrant conclusions about the probability of the null hypothesis given the data.

Does it mean that the probability that you are making the wrong decision is .01? No. Remember as above that p-values are probabilities of data under the null, not probabilities of hypotheses.

Does it mean that if you ran the study again, you would obtain the same result 99% of the time? No. The p-value is a statement about the likelihood of a particular dataset under the null; it does not allow us to make inferences about the likelihood of future events such as replication.

Does it mean that you have found a practially important effect? No. There is an essential distinction between statistical significance and practical significance . As an example, let’s say that we performed a randomized controlled trial to examine the effect of a particular diet on body weight, and we find a statistically significant effect at p<.05. What this doesn’t tell us is how much weight was actually lost, which we refer to as the effect size (to be discussed in more detail). If we think about a study of weight loss, then we probably don’t think that the loss of one ounce (i.e. the weight of a few potato chips) is practically significant. Let’s look at our ability to detect a significant difference of 1 ounce as the sample size increases.

A statistically significant result is not necessarily a strong one. Even a very weak result can be statistically significant if it is based on a large enough sample. This is why it is important to distinguish between the statistical significance of a result and the practical significance of that result. Practical significance refers to the importance or usefulness of the result in some real-world context and is often referred to as the effect size .

Many differences are statistically significant—and may even be interesting for purely scientific reasons—but they are not practically significant. In clinical practice, this same concept is often referred to as “clinical significance.” For example, a study on a new treatment for social phobia might show that it produces a statistically significant positive effect. Yet this effect still might not be strong enough to justify the time, effort, and other costs of putting it into practice—especially if easier and cheaper treatments that work almost as well already exist. Although statistically significant, this result would be said to lack practical or clinical significance.

Be aware that the term effect size can be misleading because it suggests a causal relationship—that the difference between the two means is an “effect” of being in one group or condition as opposed to another. In other words, simply calling the difference an “effect size” does not make the relationship a causal one.

Figure 1 shows how the proportion of significant results increases as the sample size increases, such that with a very large sample size (about 262,000 total subjects), we will find a significant result in more than 90% of studies when there is a 1 ounce difference in weight loss between the diets. While these are statistically significant, most physicians would not consider a weight loss of one ounce to be practically or clinically significant. We will explore this relationship in more detail when we return to the concept of statistical power in Chapter X, but it should already be clear from this example that statistical significance is not necessarily indicative of practical significance.

The proportion of signifcant results for a very small change (1 ounce, which is about .001 standard deviations) as a function of sample size.

Figure 1: The proportion of significant results for a very small change (1 ounce, which is about .001 standard deviations) as a function of sample size.

Challenges with using p-values

Historically, the most common answer to this question has been that we should reject the null hypothesis if the p-value is less than 0.05. This comes from the writings of Ronald Fisher, who has been referred to as “the single most important figure in 20th century statistics” (Efron, 1998 ) :

“If P is between .1 and .9 there is certainly no reason to suspect the hypothesis tested. If it is below .02 it is strongly indicated that the hypothesis fails to account for the whole of the facts. We shall not often be astray if we draw a conventional line at .05 … it is convenient to draw the line at about the level at which we can say: Either there is something in the treatment, or a coincidence has occurred such as does not occur more than once in twenty trials” (Fisher, 1925 )

Fisher never intended p<0.05p < 0.05 to be a fixed rule:

“no scientific worker has a fixed level of significance at which from year to year, and in all circumstances, he rejects hypotheses; he rather gives his mind to each particular case in the light of his evidence and his ideas” (Fisher, 1956 )

Instead, it is likely that p < .05 became a ritual due to the reliance upon tables of p-values that were used before computing made it easy to compute p values for arbitrary values of a statistic. All of the tables had an entry for 0.05, making it easy to determine whether one’s statistic exceeded the value needed to reach that level of significance. Although we use tables in this class, statistical software examines the specific probability value for the calculated statistic.

Assessing Error Rate: Type I and Type II Error

Although there are challenges with p-values for decision making, we will examine a way we can think about hypothesis testing in terms of its error rate.  This was proposed by Jerzy Neyman and Egon Pearson:

“no test based upon a theory of probability can by itself provide any valuable evidence of the truth or falsehood of a hypothesis. But we may look at the purpose of tests from another viewpoint. Without hoping to know whether each separate hypothesis is true or false, we may search for rules to govern our behaviour with regard to them, in following which we insure that, in the long run of experience, we shall not often be wrong” (Neyman & Pearson, 1933 )

That is: We can’t know which specific decisions are right or wrong, but if we follow the rules, we can at least know how often our decisions will be wrong in the long run.

To understand the decision-making framework that Neyman and Pearson developed, we first need to discuss statistical decision-making in terms of the kinds of outcomes that can occur. There are two possible states of reality (H0 is true, or H0 is false), and two possible decisions (reject H0, or retain H0). There are two ways in which we can make a correct decision:

  • We can reject H0 when it is false (in the language of signal detection theory, we call this a hit )
  • We can retain H0 when it is true (somewhat confusingly in this context, this is called a correct rejection )

There are also two kinds of errors we can make:

  • We can reject H0 when it is actually true (we call this a false alarm , or Type I error ), Type I error  means that we have concluded that there is a relationship in the population when in fact there is not. Type I errors occur because even when there is no relationship in the population, sampling error alone will occasionally produce an extreme result.
  • We can retain H0 when it is actually false (we call this a miss , or Type II error ). Type II error  means that we have concluded that there is no relationship in the population when in fact there is.

Summing up, when you perform a hypothesis test, there are four possible outcomes depending on the actual truth (or falseness) of the null hypothesis H0 and the decision to reject or not. The outcomes are summarized in the following table:

True False
Correct Outcome
Correct Outcome

Table 1. The four possible outcomes in hypothesis testing.

  • The decision is not to reject H0 when H0 is true (correct decision).
  • The decision is to reject H0 when H0 is true (incorrect decision known as a Type I error ).
  • The decision is not to reject H0 when, in fact, H0 is false (incorrect decision known as a Type II error ).
  • The decision is to reject H0 when H0 is false ( correct decision ).

Neyman and Pearson coined two terms to describe the probability of these two types of errors in the long run:

  • P(Type I error) = αalpha
  • P(Type II error) = βbeta

That is, if we set αalpha to .05, then in the long run we should make a Type I error 5% of the time. The 𝞪 (alpha) , is associated with the p-value for the level of significance. Again it’s common to set αalpha as .05. In fact, when the null hypothesis is true and α is .05, we will mistakenly reject the null hypothesis 5% of the time. (This is why α is sometimes referred to as the “Type I error rate.”) In principle, it is possible to reduce the chance of a Type I error by setting α to something less than .05. Setting it to .01, for example, would mean that if the null hypothesis is true, then there is only a 1% chance of mistakenly rejecting it. But making it harder to reject true null hypotheses also makes it harder to reject false ones and therefore increases the chance of a Type II error.

In practice, Type II errors occur primarily because the research design lacks adequate statistical power to detect the relationship (e.g., the sample is too small).  Statistical power is the complement of Type II error. We will have more to say about statistical power shortly. The standard value for an acceptable level of β (beta) is .2 – that is, we are willing to accept that 20% of the time we will fail to detect a true effect when it truly exists. It is possible to reduce the chance of a Type II error by setting α to something greater than .05 (e.g., .10). But making it easier to reject false null hypotheses also makes it easier to reject true ones and therefore increases the chance of a Type I error. This provides some insight into why the convention is to set α to .05. There is some agreement among researchers that level of α keeps the rates of both Type I and Type II errors at acceptable levels.

The possibility of committing Type I and Type II errors has several important implications for interpreting the results of our own and others’ research. One is that we should be cautious about interpreting the results of any individual study because there is a chance that it reflects a Type I or Type II error. This is why researchers consider it important to replicate their studies. Each time researchers replicate a study and find a similar result, they rightly become more confident that the result represents a real phenomenon and not just a Type I or Type II error.

Test Statistic Assumptions

Last consideration we will revisit with each test statistic (e.g., t-test, z-test and ANOVA) in the coming chapters.  There are four main assumptions. These assumptions are often taken for granted in using prescribed data for the course.  In the real world, these assumptions would need to be examined, often tested using statistical software.

  • Assumption of random sampling. A sample is random when each person (or animal) point in your population has an equal chance of being included in the sample; therefore selection of any individual happens by chance, rather than by choice. This reduces the chance that differences in materials, characteristics or conditions may bias results. Remember that random samples are more likely to be representative of the population so researchers can be more confident interpreting the results. Note: there is no test that statistical software can perform which assures random sampling has occurred but following good sampling techniques helps to ensure your samples are random.
  • Assumption of Independence. Statistical independence is a critical assumption for many statistical tests including the 2-sample t-test and ANOVA. It is assumed that observations are independent of each other often but often this assumption. Is not met. Independence means the value of one observation does not influence or affect the value of other observations. Independent data items are not connected with one another in any way (unless you account for it in your study). Even the smallest dependence in your data can turn into heavily biased results (which may be undetectable) if you violate this assumption. Note: there is no test statistical software can perform that assures independence of the data because this should be addressed during the research planning phase. Using a non-parametric test is often recommended if a researcher is concerned this assumption has been violated.
  • Assumption of Normality. Normality assumes that the continuous variables (dependent variable) used in the analysis are normally distributed. Normal distributions are symmetric around the center (the mean) and form a bell-shaped distribution. Normality is violated when sample data are skewed. With large enough sample sizes (n > 30) the violation of the normality assumption should not cause major problems (remember the central limit theorem) but there is a feature in most statistical software that can alert researchers to an assumption violation.
  • Assumption of Equal Variance. Variance refers to the spread or of scores from the mean. Many statistical tests assume that although different samples can come from populations with different means, they have the same variance. Equality of variance (i.e., homogeneity of variance) is violated when variances across different groups or samples are significantly different. Note: there is a feature in most statistical software to test for this.

We will use 4 main steps for hypothesis testing:

  • Usually the hypotheses concern population parameters and predict the characteristics that a sample should have
  • Null: Null hypothesis (H0) states that there is no difference, no effect or no change between population means and sample means. There is no difference.
  • Alternative: Alternative hypothesis (H1 or HA) states that there is a difference or a change between the population and sample. It is the opposite of the null hypothesis.
  • Set criteria for a decision. In this step we must determine the boundary of our distribution at which the null hypothesis will be rejected. Researchers usually use either a 5% (.05) cutoff or 1% (.01) critical boundary. Recall from our earlier story about Ronald Fisher that the lower the probability the more confident the was that the Tea Lady was not guessing.  We will apply this to z in the next chapter.
  • Compare sample and population to decide if the hypothesis has support
  • When a researcher uses hypothesis testing, the individual is making a decision about whether the data collected is sufficient to state that the population parameters are significantly different.

Further considerations

The probability value is the probability of a result as extreme or more extreme given that the null hypothesis is true. It is the probability of the data given the null hypothesis. It is not the probability that the null hypothesis is false.

A low probability value indicates that the sample outcome (or one more extreme) would be very unlikely if the null hypothesis were true. We will learn more about assessing effect size later in this unit.

3.  A non-significant outcome means that the data do not conclusively demonstrate that the null hypothesis is false. There is always a chance of error and 4 outcomes associated with hypothesis testing.

chapter 9 hypothesis testing

  • It is important to take into account the assumptions for each test statistic.

Learning objectives

Having read the chapter, you should be able to:

  • Identify the components of a hypothesis test, including the parameter of interest, the null and alternative hypotheses, and the test statistic.
  • State the hypotheses and identify appropriate critical areas depending on how hypotheses are set up.
  • Describe the proper interpretations of a p-value as well as common misinterpretations.
  • Distinguish between the two types of error in hypothesis testing, and the factors that determine them.
  • Describe the main criticisms of null hypothesis statistical testing
  • Identify the purpose of effect size and power.

Exercises – Ch. 9

  • In your own words, explain what the null hypothesis is.
  • What are Type I and Type II Errors?
  • Why do we phrase null and alternative hypotheses with population parameters and not sample means?
  • If our null hypothesis is “H0: μ = 40”, what are the three possible alternative hypotheses?
  • Why do we state our hypotheses and decision criteria before we collect our data?
  • When and why do you calculate an effect size?

Answers to Odd- Numbered Exercises – Ch. 9

1. Your answer should include mention of the baseline assumption of no difference between the sample and the population.

3. Alpha is the significance level. It is the criteria we use when decided to reject or fail to reject the null hypothesis, corresponding to a given proportion of the area under the normal distribution and a probability of finding extreme scores assuming the null hypothesis is true.

5. μ > 40; μ < 40; μ ≠ 40

7. We calculate effect size to determine the strength of the finding.  Effect size should always be calculated when the we have rejected the null hypothesis.  Effect size can be calculated for non-significant findings as a possible indicator of Type II error.

Introduction to Statistics for Psychology Copyright © 2021 by Alisa Beyer is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License , except where otherwise noted.

Chapter 9 Hypothesis Testing


Chapter 9 Hypothesis Testing

Now that we’ve studied confidence intervals in Chapter 8 , let’s study another commonly used method for statistical inference: hypothesis testing. Hypothesis tests allow us to take a sample of data from a population and infer about the plausibility of competing hypotheses. For example, in the upcoming “promotions” activity in Section 9.1 , you’ll study the data collected from a psychology study in the 1970s to investigate whether gender-based discrimination in promotion rates existed in the banking industry at the time of the study.

The good news is we’ve already covered many of the necessary concepts to understand hypothesis testing in Chapters 7 and 8 . We will expand further on these ideas here and also provide a general framework for understanding hypothesis tests. By understanding this general framework, you’ll be able to adapt it to many different scenarios.

The same can be said for confidence intervals. There was one general framework that applies to all confidence intervals and the infer package was designed around this framework. While the specifics may change slightly for different types of confidence intervals, the general framework stays the same.

We believe that this approach is much better for long-term learning than focusing on specific details for specific confidence intervals using theory-based approaches. As you’ll now see, we prefer this general framework for hypothesis tests as well.

If you’d like more practice or you’re curious to see how this framework applies to different scenarios, you can find fully-worked out examples for many common hypothesis tests and their corresponding confidence intervals in Appendix B. We recommend that you carefully review these examples as they also cover how the general frameworks apply to traditional theory-based methods like the \(t\) -test and normal-theory confidence intervals. You’ll see there that these traditional methods are just approximations for the computer-based methods we’ve been focusing on. However, they also require conditions to be met for their results to be valid. Computer-based methods using randomization, simulation, and bootstrapping have much fewer restrictions. Furthermore, they help develop your computational thinking, which is one big reason they are emphasized throughout this book.

Needed packages

Let’s load all the packages needed for this chapter (this assumes you’ve already installed them). Recall from our discussion in Section 4.4 that loading the tidyverse package by running library(tidyverse) loads the following commonly used data science packages all at once:

  • ggplot2 for data visualization
  • dplyr for data wrangling
  • tidyr for converting data to “tidy” format
  • readr for importing spreadsheet data into R
  • As well as the more advanced purrr , tibble , stringr , and forcats packages

If needed, read Section 1.3 for information on how to install and load R packages.

9.1 Promotions activity

Let’s start with an activity studying the effect of gender on promotions at a bank.

9.1.1 Does gender affect promotions at a bank?

Say you are working at a bank in the 1970s and you are submitting your résumé to apply for a promotion. Will your gender affect your chances of getting promoted? To answer this question, we’ll focus on data from a study published in the Journal of Applied Psychology in 1974. This data is also used in the OpenIntro series of statistics textbooks.

To begin the study, 48 bank supervisors were asked to assume the role of a hypothetical director of a bank with multiple branches. Every one of the bank supervisors was given a résumé and asked whether or not the candidate on the résumé was fit to be promoted to a new position in one of their branches.

However, each of these 48 résumés were identical in all respects except one: the name of the applicant at the top of the résumé. Of the supervisors, 24 were randomly given résumés with stereotypically “male” names, while 24 of the supervisors were randomly given résumés with stereotypically “female” names. Since only (binary) gender varied from résumé to résumé, researchers could isolate the effect of this variable in promotion rates.

While many people today (including us, the authors) disagree with such binary views of gender, it is important to remember that this study was conducted at a time where more nuanced views of gender were not as prevalent. Despite this imperfection, we decided to still use this example as we feel it presents ideas still relevant today about how we could study discrimination in the workplace.

The moderndive package contains the data on the 48 applicants in the promotions data frame. Let’s explore this data by looking at six randomly selected rows:

The variable id acts as an identification variable for all 48 rows, the decision variable indicates whether the applicant was selected for promotion or not, while the gender variable indicates the gender of the name used on the résumé. Recall that this data does not pertain to 24 actual men and 24 actual women, but rather 48 identical résumés of which 24 were assigned stereotypically “male” names and 24 were assigned stereotypically “female” names.

Let’s perform an exploratory data analysis of the relationship between the two categorical variables decision and gender . Recall that we saw in Subsection 2.8.3 that one way we can visualize such a relationship is by using a stacked barplot.

Barplot relating gender to promotion decision.

FIGURE 9.1: Barplot relating gender to promotion decision.

Observe in Figure 9.1 that it appears that résumés with female names were much less likely to be accepted for promotion. Let’s quantify these promotion rates by computing the proportion of résumés accepted for promotion for each group using the dplyr package for data wrangling. Note the use of the tally() function here which is a shortcut for summarize(n = n()) to get counts.

So of the 24 résumés with male names, 21 were selected for promotion, for a proportion of 21/24 = 0.875 = 87.5%. On the other hand, of the 24 résumés with female names, 14 were selected for promotion, for a proportion of 14/24 = 0.583 = 58.3%. Comparing these two rates of promotion, it appears that résumés with male names were selected for promotion at a rate 0.875 - 0.583 = 0.292 = 29.2% higher than résumés with female names. This is suggestive of an advantage for résumés with a male name on it.

The question is, however, does this provide conclusive evidence that there is gender discrimination in promotions at banks? Could a difference in promotion rates of 29.2% still occur by chance, even in a hypothetical world where no gender-based discrimination existed? In other words, what is the role of sampling variation in this hypothesized world? To answer this question, we’ll again rely on a computer to run simulations .

9.1.2 Shuffling once

First, try to imagine a hypothetical universe where no gender discrimination in promotions existed. In such a hypothetical universe, the gender of an applicant would have no bearing on their chances of promotion. Bringing things back to our promotions data frame, the gender variable would thus be an irrelevant label. If these gender labels were irrelevant, then we could randomly reassign them by “shuffling” them to no consequence!

To illustrate this idea, let’s narrow our focus to 6 arbitrarily chosen résumés of the 48 in Table 9.1 . The decision column shows that 3 résumés resulted in promotion while 3 didn’t. The gender column shows what the original gender of the résumé name was.

However, in our hypothesized universe of no gender discrimination, gender is irrelevant and thus it is of no consequence to randomly “shuffle” the values of gender . The shuffled_gender column shows one such possible random shuffling. Observe in the fourth column how the number of male and female names remains the same at 3 each, but they are now listed in a different order.

TABLE 9.1: One example of shuffling gender variable
résumé number decision gender shuffled gender
1 not male male
2 not female male
3 not female female
4 promoted male female
5 promoted male female
6 promoted female male

Again, such random shuffling of the gender label only makes sense in our hypothesized universe of no gender discrimination. How could we extend this shuffling of the gender variable to all 48 résumés by hand? One way would be by using standard deck of 52 playing cards, which we display in Figure 9.2 .

Standard deck of 52 playing cards.

FIGURE 9.2: Standard deck of 52 playing cards.

Since half the cards are red (diamonds and hearts) and the other half are black (spades and clubs), by removing two red cards and two black cards, we would end up with 24 red cards and 24 black cards. After shuffling these 48 cards as seen in Figure 9.3 , we can flip the cards over one-by-one, assigning “male” for each red card and “female” for each black card.

Shuffling a deck of cards.

FIGURE 9.3: Shuffling a deck of cards.

We’ve saved one such shuffling in the promotions_shuffled data frame of the moderndive package. If you compare the original promotions and the shuffled promotions_shuffled data frames, you’ll see that while the decision variable is identical, the gender variable has changed.

Let’s repeat the same exploratory data analysis we did for the original promotions data on our promotions_shuffled data frame. Let’s create a barplot visualizing the relationship between decision and the new shuffled gender variable and compare this to the original unshuffled version in Figure 9.4 .

Barplots of relationship of promotion with gender (left) and shuffled gender (right).

FIGURE 9.4: Barplots of relationship of promotion with gender (left) and shuffled gender (right).

It appears the difference in “male names” versus “female names” promotion rates is now different. Compared to the original data in the left barplot, the new “shuffled” data in the right barplot has promotion rates that are much more similar.

Let’s also compute the proportion of résumés accepted for promotion for each group:

So in this hypothetical universe of no discrimination, \(18/24 = 0.75 = 75\%\) of “male” résumés were selected for promotion. On the other hand, \(17/24 = 0.708 = 70.8\%\) of “female” résumés were selected for promotion.

Let’s next compare these two values. It appears that résumés with stereotypically male names were selected for promotion at a rate that was \(0.75 - 0.708 = 0.042 = 4.2\%\) different than résumés with stereotypically female names.

Observe how this difference in rates is not the same as the difference in rates of 0.292 = 29.2% we originally observed. This is once again due to sampling variation . How can we better understand the effect of this sampling variation? By repeating this shuffling several times!

9.1.3 Shuffling 16 times

We recruited 16 groups of our friends to repeat this shuffling exercise. They recorded these values in a shared spreadsheet ; we display a snapshot of the first 10 rows and 5 columns in Figure 9.5 .

Snapshot of shared spreadsheet of shuffling results (m for male, f for female).

FIGURE 9.5: Snapshot of shared spreadsheet of shuffling results (m for male, f for female).

For each of these 16 columns of shuffles , we computed the difference in promotion rates, and in Figure 9.6 we display their distribution in a histogram. We also mark the observed difference in promotion rate that occurred in real life of 0.292 = 29.2% with a dark line.

Distribution of shuffled differences in promotions.

FIGURE 9.6: Distribution of shuffled differences in promotions.

Before we discuss the distribution of the histogram, we emphasize the key thing to remember: this histogram represents differences in promotion rates that one would observe in our hypothesized universe of no gender discrimination.

Observe first that the histogram is roughly centered at 0. Saying that the difference in promotion rates is 0 is equivalent to saying that both genders had the same promotion rate. In other words, the center of these 16 values is consistent with what we would expect in our hypothesized universe of no gender discrimination.

However, while the values are centered at 0, there is variation about 0. This is because even in a hypothesized universe of no gender discrimination, you will still likely observe small differences in promotion rates because of chance sampling variation . Looking at the histogram in Figure 9.6 , such differences could even be as extreme as -0.292 or 0.208.

Turning our attention to what we observed in real life: the difference of 0.292 = 29.2% is marked with a vertical dark line. Ask yourself: in a hypothesized world of no gender discrimination, how likely would it be that we observe this difference? While opinions here may differ, in our opinion not often! Now ask yourself: what do these results say about our hypothesized universe of no gender discrimination?

9.1.4 What did we just do?

What we just demonstrated in this activity is the statistical procedure known as hypothesis testing using a permutation test . The term “permutation” is the mathematical term for “shuffling”: taking a series of values and reordering them randomly, as you did with the playing cards.

In fact, permutations are another form of resampling , like the bootstrap method you performed in Chapter 8 . While the bootstrap method involves resampling with replacement, permutation methods involve resampling without replacement.

Think of our exercise involving the slips of paper representing pennies and the hat in Section 8.1 : after sampling a penny, you put it back in the hat. Now think of our deck of cards. After drawing a card, you laid it out in front of you, recorded the color, and then you did not put it back in the deck.

In our previous example, we tested the validity of the hypothesized universe of no gender discrimination. The evidence contained in our observed sample of 48 résumés was somewhat inconsistent with our hypothesized universe. Thus, we would be inclined to reject this hypothesized universe and declare that the evidence suggests there is gender discrimination.

Recall our case study on whether yawning is contagious from Section 8.6 . The previous example involves inference about an unknown difference of population proportions as well. This time, it will be \(p_{m} - p_{f}\) , where \(p_{m}\) is the population proportion of résumés with male names being recommended for promotion and \(p_{f}\) is the equivalent for résumés with female names. Recall that this is one of the scenarios for inference we’ve seen so far in Table 9.2 .

TABLE 9.2: Scenarios of sampling for inference
Scenario Population parameter Notation Point estimate Symbol(s)
1 Population proportion \(p\) Sample proportion \(\widehat{p}\)
2 Population mean \(\mu\) Sample mean \(\overline{x}\) or \(\widehat{\mu}\)
3 Difference in population proportions \(p_1 - p_2\) Difference in sample proportions \(\widehat{p}_1 - \widehat{p}_2\)

So, based on our sample of \(n_m\) = 24 “male” applicants and \(n_f\) = 24 “female” applicants, the point estimate for \(p_{m} - p_{f}\) is the difference in sample proportions \(\widehat{p}_{m} -\widehat{p}_{f}\) = 0.875 - 0.583 = 0.292 = 29.2%. This difference in favor of “male” résumés of 0.292 is greater than 0, suggesting discrimination in favor of men.

However, the question we asked ourselves was “is this difference meaningfully greater than 0?”. In other words, is that difference indicative of true discrimination, or can we just attribute it to sampling variation ? Hypothesis testing allows us to make such distinctions.

9.2 Understanding hypothesis tests

Much like the terminology, notation, and definitions relating to sampling you saw in Section 7.3 , there are a lot of terminology, notation, and definitions related to hypothesis testing as well. Learning these may seem like a very daunting task at first. However, with practice, practice, and more practice, anyone can master them.

First, a hypothesis is a statement about the value of an unknown population parameter. In our résumé activity, our population parameter of interest is the difference in population proportions \(p_{m} - p_{f}\) . Hypothesis tests can involve any of the population parameters in Table 7.5 of the five inference scenarios we’ll cover in this book and also more advanced types we won’t cover here.

Second, a hypothesis test consists of a test between two competing hypotheses: (1) a null hypothesis \(H_0\) (pronounced “H-naught”) versus (2) an alternative hypothesis \(H_A\) (also denoted \(H_1\) ).

Generally the null hypothesis is a claim that there is “no effect” or “no difference of interest.” In many cases, the null hypothesis represents the status quo or a situation that nothing interesting is happening. Furthermore, generally the alternative hypothesis is the claim the experimenter or researcher wants to establish or find evidence to support. It is viewed as a “challenger” hypothesis to the null hypothesis \(H_0\) . In our résumé activity, an appropriate hypothesis test would be:

\[ \begin{aligned} H_0 &: \text{men and women are promoted at the same rate}\\ \text{vs } H_A &: \text{men are promoted at a higher rate than women} \end{aligned} \]

Note some of the choices we have made. First, we set the null hypothesis \(H_0\) to be that there is no difference in promotion rate and the “challenger” alternative hypothesis \(H_A\) to be that there is a difference. While it would not be wrong in principle to reverse the two, it is a convention in statistical inference that the null hypothesis is set to reflect a “null” situation where “nothing is going on.” As we discussed earlier, in this case, \(H_0\) corresponds to there being no difference in promotion rates. Furthermore, we set \(H_A\) to be that men are promoted at a higher rate, a subjective choice reflecting a prior suspicion we have that this is the case. We call such alternative hypotheses one-sided alternatives . If someone else however does not share such suspicions and only wants to investigate that there is a difference, whether higher or lower, they would set what is known as a two-sided alternative .

We can re-express the formulation of our hypothesis test using the mathematical notation for our population parameter of interest, the difference in population proportions \(p_{m} - p_{f}\) :

\[ \begin{aligned} H_0 &: p_{m} - p_{f} = 0\\ \text{vs } H_A&: p_{m} - p_{f} > 0 \end{aligned} \]

Observe how the alternative hypothesis \(H_A\) is one-sided with \(p_{m} - p_{f} > 0\) . Had we opted for a two-sided alternative, we would have set \(p_{m} - p_{f} \neq 0\) . To keep things simple for now, we’ll stick with the simpler one-sided alternative. We’ll present an example of a two-sided alternative in Section 9.5 .

Third, a test statistic is a point estimate/sample statistic formula used for hypothesis testing. Note that a sample statistic is merely a summary statistic based on a sample of observations. Recall we saw in Section 3.3 that a summary statistic takes in many values and returns only one. Here, the samples would be the \(n_m\) = 24 résumés with male names and the \(n_f\) = 24 résumés with female names. Hence, the point estimate of interest is the difference in sample proportions \(\widehat{p}_{m} - \widehat{p}_{f}\) .

Fourth, the observed test statistic is the value of the test statistic that we observed in real life. In our case, we computed this value using the data saved in the promotions data frame. It was the observed difference of \(\widehat{p}_{m} -\widehat{p}_{f} = 0.875 - 0.583 = 0.292 = 29.2\%\) in favor of résumés with male names.

Fifth, the null distribution is the sampling distribution of the test statistic assuming the null hypothesis \(H_0\) is true . Ooof! That’s a long one! Let’s unpack it slowly. The key to understanding the null distribution is that the null hypothesis \(H_0\) is assumed to be true. We’re not saying that \(H_0\) is true at this point, we’re only assuming it to be true for hypothesis testing purposes. In our case, this corresponds to our hypothesized universe of no gender discrimination in promotion rates. Assuming the null hypothesis \(H_0\) , also stated as “Under \(H_0\) ,” how does the test statistic vary due to sampling variation? In our case, how will the difference in sample proportions \(\widehat{p}_{m} - \widehat{p}_{f}\) vary due to sampling under \(H_0\) ? Recall from Subsection 7.3.2 that distributions displaying how point estimates vary due to sampling variation are called sampling distributions . The only additional thing to keep in mind about null distributions is that they are sampling distributions assuming the null hypothesis \(H_0\) is true .

In our case, we previously visualized a null distribution in Figure 9.6 , which we re-display in Figure 9.7 using our new notation and terminology. It is the distribution of the 16 differences in sample proportions our friends computed assuming a hypothetical universe of no gender discrimination. We also mark the value of the observed test statistic of 0.292 with a vertical line.

Null distribution and observed test statistic.

FIGURE 9.7: Null distribution and observed test statistic.

Sixth, the \(p\) -value is the probability of obtaining a test statistic just as extreme or more extreme than the observed test statistic assuming the null hypothesis \(H_0\) is true . Double ooof! Let’s unpack this slowly as well. You can think of the \(p\) -value as a quantification of “surprise”: assuming \(H_0\) is true, how surprised are we with what we observed? Or in our case, in our hypothesized universe of no gender discrimination, how surprised are we that we observed a difference in promotion rates of 0.292 from our collected samples assuming \(H_0\) is true? Very surprised? Somewhat surprised?

The \(p\) -value quantifies this probability, or in the case of our 16 differences in sample proportions in Figure 9.7 , what proportion had a more “extreme” result? Here, extreme is defined in terms of the alternative hypothesis \(H_A\) that “male” applicants are promoted at a higher rate than “female” applicants. In other words, how often was the discrimination in favor of men even more pronounced than \(0.875 - 0.583 = 0.292 = 29.2\%\) ?

In this case, 0 times out of 16, we obtained a difference in proportion greater than or equal to the observed difference of 0.292 = 29.2%. A very rare (in fact, not occurring) outcome! Given the rarity of such a pronounced difference in promotion rates in our hypothesized universe of no gender discrimination, we’re inclined to reject our hypothesized universe. Instead, we favor the hypothesis stating there is discrimination in favor of the “male” applicants. In other words, we reject \(H_0\) in favor of \(H_A\) .

Seventh and lastly, in many hypothesis testing procedures, it is commonly recommended to set the significance level of the test beforehand. It is denoted by the Greek letter \(\alpha\) (pronounced “alpha”). This value acts as a cutoff on the \(p\) -value, where if the \(p\) -value falls below \(\alpha\) , we would “reject the null hypothesis \(H_0\) .”

Alternatively, if the \(p\) -value does not fall below \(\alpha\) , we would “fail to reject \(H_0\) .” Note the latter statement is not quite the same as saying we “accept \(H_0\) .” This distinction is rather subtle and not immediately obvious. So we’ll revisit it later in Section 9.4 .

While different fields tend to use different values of \(\alpha\) , some commonly used values for \(\alpha\) are 0.1, 0.01, and 0.05; with 0.05 being the choice people often make without putting much thought into it. We’ll talk more about \(\alpha\) significance levels in Section 9.4 , but first let’s fully conduct the hypothesis test corresponding to our promotions activity using the infer package.

9.3 Conducting hypothesis tests

In Section 8.4 , we showed you how to construct confidence intervals. We first illustrated how to do this using dplyr data wrangling verbs and the rep_sample_n() function from Subsection 7.2.3 which we used as a virtual shovel. In particular, we constructed confidence intervals by resampling with replacement by setting the replace = TRUE argument to the rep_sample_n() function.

We then showed you how to perform the same task using the infer package workflow. While both workflows resulted in the same bootstrap distribution from which we can construct confidence intervals, the infer package workflow emphasizes each of the steps in the overall process in Figure 9.8 . It does so using function names that are intuitively named with verbs:

  • specify() the variables of interest in your data frame.
  • generate() replicates of bootstrap resamples with replacement.
  • calculate() the summary statistic of interest.
  • visualize() the resulting bootstrap distribution and confidence interval.

Confidence intervals with the infer package.

FIGURE 9.8: Confidence intervals with the infer package.

In this section, we’ll now show you how to seamlessly modify the previously seen infer code for constructing confidence intervals to conduct hypothesis tests. You’ll notice that the basic outline of the workflow is almost identical, except for an additional hypothesize() step between the specify() and generate() steps, as can be seen in Figure 9.9 .

Hypothesis testing with the infer package.

FIGURE 9.9: Hypothesis testing with the infer package.

Furthermore, we’ll use a pre-specified significance level \(\alpha\) = 0.05 for this hypothesis test. Let’s leave discussion on the choice of this \(\alpha\) value until later on in Section 9.4 .

9.3.1 infer package workflow

1. specify variables.

Recall that we use the specify() verb to specify the response variable and, if needed, any explanatory variables for our study. In this case, since we are interested in any potential effects of gender on promotion decisions, we set decision as the response variable and gender as the explanatory variable. We do so using formula = response ~ explanatory where response is the name of the response variable in the data frame and explanatory is the name of the explanatory variable. So in our case it is decision ~ gender .

Furthermore, since we are interested in the proportion of résumés "promoted" , and not the proportion of résumés not promoted, we set the argument success to "promoted" .

Again, notice how the promotions data itself doesn’t change, but the Response: decision (factor) and Explanatory: gender (factor) meta-data do. This is similar to how the group_by() verb from dplyr doesn’t change the data, but only adds “grouping” meta-data, as we saw in Section 3.4 .

2. hypothesize the null

In order to conduct hypothesis tests using the infer workflow, we need a new step not present for confidence intervals: hypothesize() . Recall from Section 9.2 that our hypothesis test was

\[ \begin{aligned} H_0 &: p_{m} - p_{f} = 0\\ \text{vs. } H_A&: p_{m} - p_{f} > 0 \end{aligned} \]

In other words, the null hypothesis \(H_0\) corresponding to our “hypothesized universe” stated that there was no difference in gender-based discrimination rates. We set this null hypothesis \(H_0\) in our infer workflow using the null argument of the hypothesize() function to either:

  • "point" for hypotheses involving a single sample or
  • "independence" for hypotheses involving two samples.

In our case, since we have two samples (the résumés with “male” and “female” names), we set null = "independence" .

Again, the data has not changed yet. This will occur at the upcoming generate() step; we’re merely setting meta-data for now.

Where do the terms "point" and "independence" come from? These are two technical statistical terms. The term “point” relates from the fact that for a single group of observations, you will test the value of a single point. Going back to the pennies example from Chapter 8 , say we wanted to test if the mean year of all US pennies was equal to 1993 or not. We would be testing the value of a “point” \(\mu\) , the mean year of all US pennies, as follows

\[ \begin{aligned} H_0 &: \mu = 1993\\ \text{vs } H_A&: \mu \neq 1993 \end{aligned} \]

The term “independence” relates to the fact that for two groups of observations, you are testing whether or not the response variable is independent of the explanatory variable that assigns the groups. In our case, we are testing whether the decision response variable is “independent” of the explanatory variable gender that assigns each résumé to either of the two groups.

3. generate replicates

After we hypothesize() the null hypothesis, we generate() replicates of “shuffled” datasets assuming the null hypothesis is true. We do this by repeating the shuffling exercise you performed in Section 9.1 several times. Instead of merely doing it 16 times as our groups of friends did, let’s use the computer to repeat this 1000 times by setting reps = 1000 in the generate() function. However, unlike for confidence intervals where we generated replicates using type = "bootstrap" resampling with replacement, we’ll now perform shuffles/permutations by setting type = "permute" . Recall that shuffles/permutations are a kind of resampling, but unlike the bootstrap method, they involve resampling without replacement.

Observe that the resulting data frame has 48,000 rows. This is because we performed shuffles/permutations for each of the 48 rows 1000 times and \(48,000 = 1000 \cdot 48\) . If you explore the promotions_generate data frame with View() , you’ll notice that the variable replicate indicates which resample each row belongs to. So it has the value 1 48 times, the value 2 48 times, all the way through to the value 1000 48 times.

4. calculate summary statistics

Now that we have generated 1000 replicates of “shuffles” assuming the null hypothesis is true, let’s calculate() the appropriate summary statistic for each of our 1000 shuffles. From Section 9.2 , point estimates related to hypothesis testing have a specific name: test statistics . Since the unknown population parameter of interest is the difference in population proportions \(p_{m} - p_{f}\) , the test statistic here is the difference in sample proportions \(\widehat{p}_{m} - \widehat{p}_{f}\) .

For each of our 1000 shuffles, we can calculate this test statistic by setting stat = "diff in props" . Furthermore, since we are interested in \(\widehat{p}_{m} - \widehat{p}_{f}\) we set order = c("male", "female") . As we stated earlier, the order of the subtraction does not matter, so long as you stay consistent throughout your analysis and tailor your interpretations accordingly.

Let’s save the result in a data frame called null_distribution :

Observe that we have 1000 values of stat , each representing one instance of \(\widehat{p}_{m} - \widehat{p}_{f}\) in a hypothesized world of no gender discrimination. Observe as well that we chose the name of this data frame carefully: null_distribution . Recall once again from Section 9.2 that sampling distributions when the null hypothesis \(H_0\) is assumed to be true have a special name: the null distribution .

What was the observed difference in promotion rates? In other words, what was the observed test statistic \(\widehat{p}_{m} - \widehat{p}_{f}\) ? Recall from Section 9.1 that we computed this observed difference by hand to be 0.875 - 0.583 = 0.292 = 29.2%. We can also compute this value using the previous infer code but with the hypothesize() and generate() steps removed. Let’s save this in obs_diff_prop :

5. visualize the p-value

The final step is to measure how surprised we are by a promotion difference of 29.2% in a hypothesized universe of no gender discrimination. If the observed difference of 0.292 is highly unlikely, then we would be inclined to reject the validity of our hypothesized universe.

We start by visualizing the null distribution of our 1000 values of \(\widehat{p}_{m} - \widehat{p}_{f}\) using visualize() in Figure 9.10 . Recall that these are values of the difference in promotion rates assuming \(H_0\) is true. This corresponds to being in our hypothesized universe of no gender discrimination.

Null distribution.

FIGURE 9.10: Null distribution.

Let’s now add what happened in real life to Figure 9.10 , the observed difference in promotion rates of 0.875 - 0.583 = 0.292 = 29.2%. However, instead of merely adding a vertical line using geom_vline() , let’s use the shade_p_value() function with obs_stat set to the observed test statistic value we saved in obs_diff_prop .

Furthermore, we’ll set the direction = "right" reflecting our alternative hypothesis \(H_A: p_{m} - p_{f} > 0\) . Recall our alternative hypothesis \(H_A\) is that \(p_{m} - p_{f} > 0\) , stating that there is a difference in promotion rates in favor of résumés with male names. “More extreme” here corresponds to differences that are “bigger” or “more positive” or “more to the right.” Hence we set the direction argument of shade_p_value() to be "right" .

On the other hand, had our alternative hypothesis \(H_A\) been the other possible one-sided alternative \(p_{m} - p_{f} < 0\) , suggesting discrimination in favor of résumés with female names, we would’ve set direction = "left" . Had our alternative hypothesis \(H_A\) been two-sided \(p_{m} - p_{f} \neq 0\) , suggesting discrimination in either direction, we would’ve set direction = "both" .

Shaded histogram to show $p$-value.

FIGURE 9.11: Shaded histogram to show \(p\) -value.

In the resulting Figure 9.11 , the solid dark line marks 0.292 = 29.2%. However, what does the shaded-region correspond to? This is the \(p\) -value . Recall the definition of the \(p\) -value from Section 9.2 :

A \(p\) -value is the probability of obtaining a test statistic just as or more extreme than the observed test statistic assuming the null hypothesis \(H_0\) is true .

So judging by the shaded region in Figure 9.11 , it seems we would somewhat rarely observe differences in promotion rates of 0.292 = 29.2% or more in a hypothesized universe of no gender discrimination. In other words, the \(p\) -value is somewhat small. Hence, we would be inclined to reject this hypothesized universe, or using statistical language we would “reject \(H_0\) .”

What fraction of the null distribution is shaded? In other words, what is the exact value of the \(p\) -value? We can compute it using the get_p_value() function with the same arguments as the previous shade_p_value() code:

Keeping the definition of a \(p\) -value in mind, the probability of observing a difference in promotion rates as large as 0.292 = 29.2% due to sampling variation alone in the null distribution is 0.027 = 2.7%. Since this \(p\) -value is smaller than our pre-specified significance level \(\alpha\) = 0.05, we reject the null hypothesis \(H_0: p_{m} - p_{f} = 0\) . In other words, this \(p\) -value is sufficiently small to reject our hypothesized universe of no gender discrimination. We instead have enough evidence to change our mind in favor of gender discrimination being a likely culprit here. Observe that whether we reject the null hypothesis \(H_0\) or not depends in large part on our choice of significance level \(\alpha\) . We’ll discuss this more in Subsection 9.4.3 .

9.3.2 Comparison with confidence intervals

One of the great things about the infer package is that we can jump seamlessly between conducting hypothesis tests and constructing confidence intervals with minimal changes! Recall the code from the previous section that creates the null distribution, which in turn is needed to compute the \(p\) -value:

To create the corresponding bootstrap distribution needed to construct a 95% confidence interval for \(p_{m} - p_{f}\) , we only need to make two changes. First, we remove the hypothesize() step since we are no longer assuming a null hypothesis \(H_0\) is true. We can do this by deleting or commenting out the hypothesize() line of code. Second, we switch the type of resampling in the generate() step to be "bootstrap" instead of "permute" .

Using this bootstrap_distribution , let’s first compute the percentile-based confidence intervals, as we did in Section 8.4 :

Using our shorthand interpretation for 95% confidence intervals from Subsection 8.5.2 , we are 95% “confident” that the true difference in population proportions \(p_{m} - p_{f}\) is between (0.044, 0.539). Let’s visualize bootstrap_distribution and this percentile-based 95% confidence interval for \(p_{m} - p_{f}\) in Figure 9.12 .

Percentile-based 95\% confidence interval.

FIGURE 9.12: Percentile-based 95% confidence interval.

Notice a key value that is not included in the 95% confidence interval for \(p_{m} - p_{f}\) : the value 0. In other words, a difference of 0 is not included in our net, suggesting that \(p_{m}\) and \(p_{f}\) are truly different! Furthermore, observe how the entirety of the 95% confidence interval for \(p_{m} - p_{f}\) lies above 0, suggesting that this difference is in favor of men.

Since the bootstrap distribution appears to be roughly normally shaped, we can also use the standard error method as we did in Section 8.4 . In this case, we must specify the point_estimate argument as the observed difference in promotion rates 0.292 = 29.2% saved in obs_diff_prop . This value acts as the center of the confidence interval.

Let’s visualize bootstrap_distribution again, but now the standard error based 95% confidence interval for \(p_{m} - p_{f}\) in Figure 9.13 . Again, notice how the value 0 is not included in our confidence interval, again suggesting that \(p_{m}\) and \(p_{f}\) are truly different!

Standard error-based 95\% confidence interval.

FIGURE 9.13: Standard error-based 95% confidence interval.

Learning check

(LC9.1) Why does the following code produce an error? In other words, what about the response and predictor variables make this not a possible computation with the infer package?

(LC9.2) Why are we relatively confident that the distributions of the sample proportions will be good approximations of the population distributions of promotion proportions for the two genders?

(LC9.3) Using the definition of p-value , write in words what the \(p\) -value represents for the hypothesis test comparing the promotion rates for males and females.

9.3.3 “There is only one test”

Let’s recap the steps necessary to conduct a hypothesis test using the terminology, notation, and definitions related to sampling you saw in Section 9.2 and the infer workflow from Subsection 9.3.1 :

  • hypothesize() the null hypothesis \(H_0\) . In other words, set a “model for the universe” assuming \(H_0\) is true.
  • generate() shuffles assuming \(H_0\) is true. In other words, simulate data assuming \(H_0\) is true.
  • calculate() the test statistic of interest, both for the observed data and your simulated data.
  • visualize() the resulting null distribution and compute the \(p\) -value by comparing the null distribution to the observed test statistic.

While this is a lot to digest, especially the first time you encounter hypothesis testing, the nice thing is that once you understand this general framework, then you can understand any hypothesis test. In a famous blog post, computer scientist Allen Downey called this the “There is only one test” framework, for which he created the flowchart displayed in Figure 9.14 .

Allen Downey's hypothesis testing framework.

FIGURE 9.14: Allen Downey’s hypothesis testing framework.

Notice its similarity with the “hypothesis testing with infer ” diagram you saw in Figure 9.9 . That’s because the infer package was explicitly designed to match the “There is only one test” framework. So if you can understand the framework, you can easily generalize these ideas for all hypothesis testing scenarios. Whether for population proportions \(p\) , population means \(\mu\) , differences in population proportions \(p_1 - p_2\) , differences in population means \(\mu_1 - \mu_2\) , and as you’ll see in Chapter 10 on inference for regression, population regression slopes \(\beta_1\) as well. In fact, it applies more generally even than just these examples to more complicated hypothesis tests and test statistics as well.

(LC9.4) Describe in a paragraph how we used Allen Downey’s diagram to conclude if a statistical difference existed between the promotion rate of males and females using this study.

9.4 Interpreting hypothesis tests

Interpreting the results of hypothesis tests is one of the more challenging aspects of this method for statistical inference. In this section, we’ll focus on ways to help with deciphering the process and address some common misconceptions.

9.4.1 Two possible outcomes

In Section 9.2 , we mentioned that given a pre-specified significance level \(\alpha\) there are two possible outcomes of a hypothesis test:

  • If the \(p\) -value is less than \(\alpha\) , then we reject the null hypothesis \(H_0\) in favor of \(H_A\) .
  • If the \(p\) -value is greater than or equal to \(\alpha\) , we fail to reject the null hypothesis \(H_0\) .

Unfortunately, the latter result is often misinterpreted as “accepting the null hypothesis \(H_0\) .” While at first glance it may seem that the statements “failing to reject \(H_0\) ” and “accepting \(H_0\) ” are equivalent, there actually is a subtle difference. Saying that we “accept the null hypothesis \(H_0\) ” is equivalent to stating that “we think the null hypothesis \(H_0\) is true.” However, saying that we “fail to reject the null hypothesis \(H_0\) ” is saying something else: “While \(H_0\) might still be false, we don’t have enough evidence to say so.” In other words, there is an absence of enough proof. However, the absence of proof is not proof of absence.

To further shed light on this distinction, let’s use the United States criminal justice system as an analogy. A criminal trial in the United States is a similar situation to hypothesis tests whereby a choice between two contradictory claims must be made about a defendant who is on trial:

  • The defendant is truly either “innocent” or “guilty.”
  • The defendant is presumed “innocent until proven guilty.”
  • The defendant is found guilty only if there is strong evidence that the defendant is guilty. The phrase “beyond a reasonable doubt” is often used as a guideline for determining a cutoff for when enough evidence exists to find the defendant guilty.
  • The defendant is found to be either “not guilty” or “guilty” in the ultimate verdict.

In other words, not guilty verdicts are not suggesting the defendant is innocent , but instead that “while the defendant may still actually be guilty, there wasn’t enough evidence to prove this fact.” Now let’s make the connection with hypothesis tests:

  • Either the null hypothesis \(H_0\) or the alternative hypothesis \(H_A\) is true.
  • Hypothesis tests are conducted assuming the null hypothesis \(H_0\) is true.
  • We reject the null hypothesis \(H_0\) in favor of \(H_A\) only if the evidence found in the sample suggests that \(H_A\) is true. The significance level \(\alpha\) is used as a guideline to set the threshold on just how strong of evidence we require.
  • We ultimately decide to either “fail to reject \(H_0\) ” or “reject \(H_0\) .”

So while gut instinct may suggest “failing to reject \(H_0\) ” and “accepting \(H_0\) ” are equivalent statements, they are not. “Accepting \(H_0\) ” is equivalent to finding a defendant innocent. However, courts do not find defendants “innocent,” but rather they find them “not guilty.” Putting things differently, defense attorneys do not need to prove that their clients are innocent, rather they only need to prove that clients are not “guilty beyond a reasonable doubt”.

So going back to our résumés activity in Section 9.3 , recall that our hypothesis test was \(H_0: p_{m} - p_{f} = 0\) versus \(H_A: p_{m} - p_{f} > 0\) and that we used a pre-specified significance level of \(\alpha\) = 0.05. We found a \(p\) -value of 0.027. Since the \(p\) -value was smaller than \(\alpha\) = 0.05, we rejected \(H_0\) . In other words, we found needed levels of evidence in this particular sample to say that \(H_0\) is false at the \(\alpha\) = 0.05 significance level. We also state this conclusion using non-statistical language: we found enough evidence in this data to suggest that there was gender discrimination at play.

9.4.2 Types of errors

Unfortunately, there is some chance a jury or a judge can make an incorrect decision in a criminal trial by reaching the wrong verdict. For example, finding a truly innocent defendant “guilty”. Or on the other hand, finding a truly guilty defendant “not guilty.” This can often stem from the fact that prosecutors don’t have access to all the relevant evidence, but instead are limited to whatever evidence the police can find.

The same holds for hypothesis tests. We can make incorrect decisions about a population parameter because we only have a sample of data from the population and thus sampling variation can lead us to incorrect conclusions.

There are two possible erroneous conclusions in a criminal trial: either (1) a truly innocent person is found guilty or (2) a truly guilty person is found not guilty. Similarly, there are two possible errors in a hypothesis test: either (1) rejecting \(H_0\) when in fact \(H_0\) is true, called a Type I error or (2) failing to reject \(H_0\) when in fact \(H_0\) is false, called a Type II error . Another term used for “Type I error” is “false positive,” while another term for “Type II error” is “false negative.”

This risk of error is the price researchers pay for basing inference on a sample instead of performing a census on the entire population. But as we’ve seen in our numerous examples and activities so far, censuses are often very expensive and other times impossible, and thus researchers have no choice but to use a sample. Thus in any hypothesis test based on a sample, we have no choice but to tolerate some chance that a Type I error will be made and some chance that a Type II error will occur.

To help understand the concepts of Type I error and Type II errors, we apply these terms to our criminal justice analogy in Figure 9.15 .

Type I and Type II errors in criminal trials.

FIGURE 9.15: Type I and Type II errors in criminal trials.

Thus a Type I error corresponds to incorrectly putting a truly innocent person in jail, whereas a Type II error corresponds to letting a truly guilty person go free. Let’s show the corresponding table in Figure 9.16 for hypothesis tests.

Type I and Type II errors in hypothesis tests.

FIGURE 9.16: Type I and Type II errors in hypothesis tests.

9.4.3 How do we choose alpha?

If we are using a sample to make inferences about a population, we run the risk of making errors. For confidence intervals, a corresponding “error” would be constructing a confidence interval that does not contain the true value of the population parameter. For hypothesis tests, this would be making either a Type I or Type II error. Obviously, we want to minimize the probability of either error; we want a small probability of making an incorrect conclusion:

  • The probability of a Type I Error occurring is denoted by \(\alpha\) . The value of \(\alpha\) is called the significance level of the hypothesis test, which we defined in Section 9.2 .
  • The probability of a Type II Error is denoted by \(\beta\) . The value of \(1-\beta\) is known as the power of the hypothesis test.

In other words, \(\alpha\) corresponds to the probability of incorrectly rejecting \(H_0\) when in fact \(H_0\) is true. On the other hand, \(\beta\) corresponds to the probability of incorrectly failing to reject \(H_0\) when in fact \(H_0\) is false.

Ideally, we want \(\alpha = 0\) and \(\beta = 0\) , meaning that the chance of making either error is 0. However, this can never be the case in any situation where we are sampling for inference. There will always be the possibility of making either error when we use sample data. Furthermore, these two error probabilities are inversely related. As the probability of a Type I error goes down, the probability of a Type II error goes up.

What is typically done in practice is to fix the probability of a Type I error by pre-specifying a significance level \(\alpha\) and then try to minimize \(\beta\) . In other words, we will tolerate a certain fraction of incorrect rejections of the null hypothesis \(H_0\) , and then try to minimize the fraction of incorrect non-rejections of \(H_0\) .

So for example if we used \(\alpha\) = 0.01, we would be using a hypothesis testing procedure that in the long run would incorrectly reject the null hypothesis \(H_0\) one percent of the time. This is analogous to setting the confidence level of a confidence interval.

So what value should you use for \(\alpha\) ? Different fields have different conventions, but some commonly used values include 0.10, 0.05, 0.01, and 0.001. However, it is important to keep in mind that if you use a relatively small value of \(\alpha\) , then all things being equal, \(p\) -values will have a harder time being less than \(\alpha\) . Thus we would reject the null hypothesis less often. In other words, we would reject the null hypothesis \(H_0\) only if we have very strong evidence to do so. This is known as a “conservative” test.

On the other hand, if we used a relatively large value of \(\alpha\) , then all things being equal, \(p\) -values will have an easier time being less than \(\alpha\) . Thus we would reject the null hypothesis more often. In other words, we would reject the null hypothesis \(H_0\) even if we only have mild evidence to do so. This is known as a “liberal” test.

(LC9.5) What is wrong about saying, “The defendant is innocent.” based on the US system of criminal trials?

(LC9.6) What is the purpose of hypothesis testing?

(LC9.7) What are some flaws with hypothesis testing? How could we alleviate them?

(LC9.8) Consider two \(\alpha\) significance levels of 0.1 and 0.01. Of the two, which would lead to a more liberal hypothesis testing procedure? In other words, one that will, all things being equal, lead to more rejections of the null hypothesis \(H_0\) .

9.5 Case study: Are action or romance movies rated higher?

Let’s apply our knowledge of hypothesis testing to answer the question: “Are action or romance movies rated higher on IMDb?”. IMDb is a database on the internet providing information on movie and television show casts, plot summaries, trivia, and ratings. We’ll investigate if, on average, action or romance movies get higher ratings on IMDb.

9.5.1 IMDb ratings data

The movies dataset in the ggplot2movies package contains information on 58,788 movies that have been rated by users of IMDb.com.

We’ll focus on a random sample of 68 movies that are classified as either “action” or “romance” movies but not both. We disregard movies that are classified as both so that we can assign all 68 movies into either category. Furthermore, since the original movies dataset was a little messy, we provide a pre-wrangled version of our data in the movies_sample data frame included in the moderndive package. If you’re curious, you can look at the necessary data wrangling code to do this on GitHub .

The variables include the title and year the movie was filmed. Furthermore, we have a numerical variable rating , which is the IMDb rating out of 10 stars, and a binary categorical variable genre indicating if the movie was an Action or Romance movie. We are interested in whether Action or Romance movies got a higher rating on average.

Let’s perform an exploratory data analysis of this data. Recall from Subsection 2.7.1 that a boxplot is a visualization we can use to show the relationship between a numerical and a categorical variable. Another option you saw in Section 2.6 would be to use a faceted histogram. However, in the interest of brevity, let’s only present the boxplot in Figure 9.17 .

Boxplot of IMDb rating vs. genre.

FIGURE 9.17: Boxplot of IMDb rating vs. genre.

Eyeballing Figure 9.17 , romance movies have a higher median rating. Do we have reason to believe, however, that there is a significant difference between the mean rating for action movies compared to romance movies? It’s hard to say just based on this plot. The boxplot does show that the median sample rating is higher for romance movies.

However, there is a large amount of overlap between the boxes. Recall that the median isn’t necessarily the same as the mean either, depending on whether the distribution is skewed.

Let’s calculate some summary statistics split by the binary categorical variable genre : the number of movies, the mean rating, and the standard deviation split by genre . We’ll do this using dplyr data wrangling verbs. Notice in particular how we count the number of each type of movie using the n() summary function.

Observe that we have 36 movies with an average rating of 6.322 stars and 32 movies with an average rating of 5.275 stars. The difference in these average ratings is thus 6.322 - 5.275 = 1.047. So there appears to be an edge of 1.047 stars in favor of romance movies. The question is, however, are these results indicative of a true difference for all romance and action movies? Or could we attribute this difference to chance sampling variation ?

9.5.2 Sampling scenario

Let’s now revisit this study in terms of terminology and notation related to sampling we studied in Subsection 7.3.1 . The study population is all movies in the IMDb database that are either action or romance (but not both). The sample from this population is the 68 movies included in the movies_sample dataset.

Since this sample was randomly taken from the population movies , it is representative of all romance and action movies on IMDb. Thus, any analysis and results based on movies_sample can generalize to the entire population. What are the relevant population parameter and point estimates ? We introduce the fourth sampling scenario in Table 9.3 .

TABLE 9.3: Scenarios of sampling for inference
Scenario Population parameter Notation Point estimate Symbol(s)
1 Population proportion \(p\) Sample proportion \(\widehat{p}\)
2 Population mean \(\mu\) Sample mean \(\overline{x}\) or \(\widehat{\mu}\)
3 Difference in population proportions \(p_1 - p_2\) Difference in sample proportions \(\widehat{p}_1 - \widehat{p}_2\)
4 Difference in population means \(\mu_1 - \mu_2\) Difference in sample means \(\overline{x}_1 - \overline{x}_2\) or \(\widehat{\mu}_1 - \widehat{\mu}_2\)

So, whereas the sampling bowl exercise in Section 7.1 concerned proportions , the pennies exercise in Section 8.1 concerned means , the case study on whether yawning is contagious in Section 8.6 and the promotions activity in Section 9.1 concerned differences in proportions , we are now concerned with differences in means .

In other words, the population parameter of interest is the difference in population mean ratings \(\mu_a - \mu_r\) , where \(\mu_a\) is the mean rating of all action movies on IMDb and similarly \(\mu_r\) is the mean rating of all romance movies. Additionally the point estimate/sample statistic of interest is the difference in sample means \(\overline{x}_a - \overline{x}_r\) , where \(\overline{x}_a\) is the mean rating of the \(n_a\) = 32 movies in our sample and \(\overline{x}_r\) is the mean rating of the \(n_r\) = 36 in our sample. Based on our earlier exploratory data analysis, our estimate \(\overline{x}_a - \overline{x}_r\) is \(5.275 - 6.322 = -1.047\) .

So there appears to be a slight difference of -1.047 in favor of romance movies. The question is, however, could this difference of -1.047 be merely due to chance and sampling variation? Or are these results indicative of a true difference in mean ratings for all romance and action movies on IMDb? To answer this question, we’ll use hypothesis testing.

9.5.3 Conducting the hypothesis test

We’ll be testing:

\[ \begin{aligned} H_0 &: \mu_a - \mu_r = 0\\ \text{vs } H_A&: \mu_a - \mu_r \neq 0 \end{aligned} \]

In other words, the null hypothesis \(H_0\) suggests that both romance and action movies have the same mean rating. This is the “hypothesized universe” we’ll assume is true. On the other hand, the alternative hypothesis \(H_A\) suggests that there is a difference. Unlike the one-sided alternative we used in the promotions exercise \(H_A: p_m - p_f > 0\) , we are now considering a two-sided alternative of \(H_A: \mu_a - \mu_r \neq 0\) .

Furthermore, we’ll pre-specify a low significance level of \(\alpha\) = 0.001. By setting this value low, all things being equal, there is a lower chance that the \(p\) -value will be less than \(\alpha\) . Thus, there is a lower chance that we’ll reject the null hypothesis \(H_0\) in favor of the alternative hypothesis \(H_A\) . In other words, we’ll reject the hypothesis that there is no difference in mean ratings for all action and romance movies, only if we have quite strong evidence. This is known as a “conservative” hypothesis testing procedure.

Let’s now perform all the steps of the infer workflow. We first specify() the variables of interest in the movies_sample data frame using the formula rating ~ genre . This tells infer that the numerical variable rating is the outcome variable, while the binary variable genre is the explanatory variable. Note that unlike previously when we were interested in proportions, since we are now interested in the mean of a numerical variable, we do not need to set the success argument.

Observe at this point that the data in movies_sample has not changed. The only change so far is the newly defined Response: rating (numeric) and Explanatory: genre (factor) meta-data .

We set the null hypothesis \(H_0: \mu_a - \mu_r = 0\) by using the hypothesize() function. Since we have two samples, action and romance movies, we set null to be "independence" as we described in Section 9.3 .

After we have set the null hypothesis, we generate “shuffled” replicates assuming the null hypothesis is true by repeating the shuffling/permutation exercise you performed in Section 9.1 .

We’ll repeat this resampling without replacement of type = "permute" a total of reps = 1000 times. Feel free to run the code below to check out what the generate() step produces.

Now that we have 1000 replicated “shuffles” assuming the null hypothesis \(H_0\) that both Action and Romance movies on average have the same ratings on IMDb, let’s calculate() the appropriate summary statistic for these 1000 replicated shuffles. From Section 9.2 , summary statistics relating to hypothesis testing have a specific name: test statistics . Since the unknown population parameter of interest is the difference in population means \(\mu_{a} - \mu_{r}\) , the test statistic of interest here is the difference in sample means \(\overline{x}_{a} - \overline{x}_{r}\) .

For each of our 1000 shuffles, we can calculate this test statistic by setting stat = "diff in means" . Furthermore, since we are interested in \(\overline{x}_{a} - \overline{x}_{r}\) , we set order = c("Action", "Romance") . Let’s save the results in a data frame called null_distribution_movies :

Observe that we have 1000 values of stat , each representing one instance of \(\overline{x}_{a} - \overline{x}_{r}\) . The 1000 values form the null distribution , which is the technical term for the sampling distribution of the difference in sample means \(\overline{x}_{a} - \overline{x}_{r}\) assuming \(H_0\) is true. What happened in real life? What was the observed difference in promotion rates? What was the observed test statistic \(\overline{x}_{a} - \overline{x}_{r}\) ? Recall from our earlier data wrangling, this observed difference in means was \(5.275 - 6.322 = -1.047\) . We can also achieve this using the code that constructed the null distribution null_distribution_movies but with the hypothesize() and generate() steps removed. Let’s save this in obs_diff_means :

Lastly, in order to compute the \(p\) -value, we have to assess how “extreme” the observed difference in means of -1.047 is. We do this by comparing -1.047 to our null distribution, which was constructed in a hypothesized universe of no true difference in movie ratings. Let’s visualize both the null distribution and the \(p\) -value in Figure 9.18 . Unlike our example in Subsection 9.3.1 involving promotions, since we have a two-sided \(H_A: \mu_a - \mu_r \neq 0\) , we have to allow for both possibilities for more extreme , so we set direction = "both" .

Null distribution, observed test statistic, and $p$-value.

FIGURE 9.18: Null distribution, observed test statistic, and \(p\) -value.

Let’s go over the elements of this plot. First, the histogram is the null distribution . Second, the solid line is the observed test statistic , or the difference in sample means we observed in real life of \(5.275 - 6.322 = -1.047\) . Third, the two shaded areas of the histogram form the \(p\) -value , or the probability of obtaining a test statistic just as or more extreme than the observed test statistic assuming the null hypothesis \(H_0\) is true .

What proportion of the null distribution is shaded? In other words, what is the numerical value of the \(p\) -value? We use the get_p_value() function to compute this value:

This \(p\) -value of 0.004 is very small. In other words, there is a very small chance that we’d observe a difference of 5.275 - 6.322 = -1.047 in a hypothesized universe where there was truly no difference in ratings.

But this \(p\) -value is larger than our (even smaller) pre-specified \(\alpha\) significance level of 0.001. Thus, we are inclined to fail to reject the null hypothesis \(H_0: \mu_a - \mu_r = 0\) . In non-statistical language, the conclusion is: we do not have the evidence needed in this sample of data to suggest that we should reject the hypothesis that there is no difference in mean IMDb ratings between romance and action movies. We, thus, cannot say that a difference exists in romance and action movie ratings, on average, for all IMDb movies.

(LC9.9) Conduct the same analysis comparing action movies versus romantic movies using the median rating instead of the mean rating. What was different and what was the same?

(LC9.10) What conclusions can you make from viewing the faceted histogram looking at rating versus genre that you couldn’t see when looking at the boxplot?

(LC9.11) Describe in a paragraph how we used Allen Downey’s diagram to conclude if a statistical difference existed between mean movie ratings for action and romance movies.

(LC9.12) Why are we relatively confident that the distributions of the sample ratings will be good approximations of the population distributions of ratings for the two genres?

(LC9.13) Using the definition of \(p\) -value, write in words what the \(p\) -value represents for the hypothesis test comparing the mean rating of romance to action movies.

(LC9.14) What is the value of the \(p\) -value for the hypothesis test comparing the mean rating of romance to action movies?

(LC9.15) Test your data wrangling knowledge and EDA skills:

  • Use dplyr and tidyr to create the necessary data frame focused on only action and romance movies (but not both) from the movies data frame in the ggplot2movies package.
  • Make a boxplot and a faceted histogram of this population data comparing ratings of action and romance movies from IMDb.
  • Discuss how these plots compare to the similar plots produced for the movies_sample data.

9.6 Conclusion

9.6.1 theory-based hypothesis tests.

Much as we did in Subsections 7.6.2 and 8.7.2 when we showed you theory-based methods for compututing standard errors and constructing confidence intervals that involved mathematical formulas, we now present an example of a traditional theory-based method to conduct hypothesis tests. This method relies on probability models, probability distributions, and a few assumptions to construct the null distribution. This is in contrast to the approach we’ve been using throughout this book where we relied on computer simulations to construct the null distribution.

These traditional theory-based methods have been used for decades mostly because researchers didn’t have access to computers that could run thousands of calculations quickly and efficiently. Now that computing power is much cheaper and more accessible, simulation-based methods are much more feasible. However, researchers in many fields continue to use theory-based methods. Hence, we make it a point to include an example here.

As we’ll show in this section, any theory-based method is ultimately an approximation to the simulation-based method. The theory-based method we’ll focus on is known as the two-sample \(t\) -test for testing differences in sample means. However, the test statistic we’ll use won’t be the difference in sample means \(\overline{x}_1 - \overline{x}_2\) , but rather the related two-sample \(t\) -statistic . The data we’ll use will once again be the movies_sample data of action and romance movies from Section 9.5 .

Two-sample t-statistic

A common task in statistics is the process of “standardizing a variable.” By standardizing different variables, we make them more comparable. For example, say you are interested in studying the distribution of temperature recordings from Portland, Oregon, USA and comparing it to that of the temperature recordings in Montreal, Quebec, Canada. Given that US temperatures are generally recorded in degrees Fahrenheit and Canadian temperatures are generally recorded in degrees Celsius, how can we make them comparable? One approach would be to convert degrees Fahrenheit into Celsius, or vice versa. Another approach would be to convert them both to a common “standardized” scale, like Kelvin units of temperature.

One common method for standardizing a variable from probability and statistics theory is to compute the \(z\) -score:

\[z = \frac{x - \mu}{\sigma}\]

where \(x\) represents one value of a variable, \(\mu\) represents the mean of that variable, and \(\sigma\) represents the standard deviation of that variable. You first subtract the mean \(\mu\) from each value of \(x\) and then divide \(x - \mu\) by the standard deviation \(\sigma\) . These operations will have the effect of re-centering your variable around 0 and re-scaling your variable \(x\) so that they have what are known as “standard units.” Thus for every value that your variable can take, it has a corresponding \(z\) -score that gives how many standard units away that value is from the mean \(\mu\) . \(z\) -scores are normally distributed with mean 0 and standard deviation 1. This curve is called a “ \(z\) -distribution” or “standard normal” curve and has the common, bell-shaped pattern from Figure 9.19 discussed in Appendix A.2 .

Standard normal z curve.

FIGURE 9.19: Standard normal z curve.

Bringing these back to the difference of sample mean ratings \(\overline{x}_a - \overline{x}_r\) of action versus romance movies, how would we standardize this variable? By once again subtracting its mean and dividing by its standard deviation. Recall two facts from Subsection 7.3.3 . First, if the sampling was done in a representative fashion, then the sampling distribution of \(\overline{x}_a - \overline{x}_r\) will be centered at the true population parameter \(\mu_a - \mu_r\) . Second, the standard deviation of point estimates like \(\overline{x}_a - \overline{x}_r\) has a special name: the standard error.

Applying these ideas, we present the two-sample \(t\) -statistic :

\[t = \dfrac{ (\bar{x}_a - \bar{x}_r) - (\mu_a - \mu_r)}{ \text{SE}_{\bar{x}_a - \bar{x}_r} } = \dfrac{ (\bar{x}_a - \bar{x}_r) - (\mu_a - \mu_r)}{ \sqrt{\dfrac{{s_a}^2}{n_a} + \dfrac{{s_r}^2}{n_r}} }\]

Oofda! There is a lot to try to unpack here! Let’s go slowly. In the numerator, \(\bar{x}_a-\bar{x}_r\) is the difference in sample means, while \(\mu_a - \mu_r\) is the difference in population means. In the denominator, \(s_a\) and \(s_r\) are the sample standard deviations of the action and romance movies in our sample movies_sample . Lastly, \(n_a\) and \(n_r\) are the sample sizes of the action and romance movies. Putting this together under the square root gives us the standard error \(\text{SE}_{\bar{x}_a - \bar{x}_r}\) .

Observe that the formula for \(\text{SE}_{\bar{x}_a - \bar{x}_r}\) has the sample sizes \(n_a\) and \(n_r\) in them. So as the sample sizes increase, the standard error goes down. We’ve seen this concept numerous times now, in particular (1) in our simulations using the three virtual shovels with \(n\) = 25, 50, and 100 slots in Figure 7.15 , (2) in Subsection 8.5.3 where we studied the effect of using larger sample sizes on the widths of confidence intervals, and (3) in Subsection 7.6.2 where we studied the formula-based approximation to the standard error of the sample proportion \(\widehat{p}\) .

So how can we use the two-sample \(t\) -statistic as a test statistic in our hypothesis test? First, assuming the null hypothesis \(H_0: \mu_a - \mu_r = 0\) is true, the right-hand side of the numerator (to the right of the \(-\) sign), \(\mu_a - \mu_r\) , becomes 0.

Second, similarly to how the Central Limit Theorem from Subsection 7.5 states that sample means follow a normal distribution, it can be mathematically proven that the two-sample \(t\) -statistic follows a \(t\) distribution with degrees of freedom “roughly equal” to \(df = n_a + n_r - 2\) . To better understand this concept of degrees of freedom , we next display three examples of \(t\) -distributions in Figure 9.20 along with the standard normal \(z\) curve.

Examples of t-distributions and the z curve.

FIGURE 9.20: Examples of t-distributions and the z curve.

Begin by looking at the center of the plot at 0 on the horizontal axis. As you move up from the value of 0, follow along with the labels and note that the bottom curve corresponds to 1 degree of freedom, the curve above it is for 3 degrees of freedom, the curve above that is for 10 degrees of freedom, and lastly the dotted curve is the standard normal \(z\) curve.

Observe that all four curves have a bell shape, are centered at 0, and that as the degrees of freedom increase, the \(t\) -distribution more and more resembles the standard normal \(z\) curve. The “degrees of freedom” measures how different the \(t\) distribution will be from a normal distribution. \(t\) -distributions tend to have more values in the tails of their distributions than the standard normal \(z\) curve.

This “roughly equal” statement indicates that the equation \(df = n_a + n_r - 2\) is a “good enough” approximation to the true degrees of freedom. The true formula is a bit more complicated than this simple expression, but we’ve found the formula to be beyond the reach of those new to statistical inference and it does little to build the intuition of the \(t\) -test.

The message to retain, however, is that small sample sizes lead to small degrees of freedom and thus small sample sizes lead to \(t\) -distributions that are different than the \(z\) curve. On the other hand, large sample sizes correspond to large degrees of freedom and thus produce \(t\) distributions that closely align with the standard normal \(z\) -curve.

So, assuming the null hypothesis \(H_0\) is true, our formula for the test statistic simplifies a bit:

\[t = \dfrac{ (\bar{x}_a - \bar{x}_r) - 0}{ \sqrt{\dfrac{{s_a}^2}{n_a} + \dfrac{{s_r}^2}{n_r}} } = \dfrac{ \bar{x}_a - \bar{x}_r}{ \sqrt{\dfrac{{s_a}^2}{n_a} + \dfrac{{s_r}^2}{n_r}} }\]

Let’s compute the values necessary for this two-sample \(t\) -statistic. Recall the summary statistics we computed during our exploratory data analysis in Section 9.5.1 .

Using these values, the observed two-sample \(t\) -test statistic is

\[ \dfrac{ \bar{x}_a - \bar{x}_r}{ \sqrt{\dfrac{{s_a}^2}{n_a} + \dfrac{{s_r}^2}{n_r}} } = \dfrac{5.28 - 6.32}{ \sqrt{\dfrac{{1.36}^2}{32} + \dfrac{{1.61}^2}{36}} } = -2.906 \]

Great! How can we compute the \(p\) -value using this theory-based test statistic? We need to compare it to a null distribution, which we construct next.

Null distribution

Let’s revisit the null distribution for the test statistic \(\bar{x}_a - \bar{x}_r\) we constructed in Section 9.5 . Let’s visualize this in the left-hand plot of Figure 9.21 .

The infer package also includes some built-in theory-based test statistics as well. So instead of calculating the test statistic of interest as the "diff in means" \(\bar{x}_a - \bar{x}_r\) , we can calculate this defined two-sample \(t\) -statistic by setting stat = "t" . Let’s visualize this in the right-hand plot of Figure 9.21 .

Comparing the null distributions of two test statistics.

FIGURE 9.21: Comparing the null distributions of two test statistics.

Observe that while the shape of the null distributions of both the difference in means \(\bar{x}_a - \bar{x}_r\) and the two-sample \(t\) -statistics are similar, the scales on the x-axis are different. The two-sample \(t\) -statistic values are spread out over a larger range.

However, a traditional theory-based \(t\) -test doesn’t look at the simulated histogram in null_distribution_movies_t , but instead it looks at the \(t\) -distribution curve with degrees of freedom equal to roughly 65.85. This calculation is based on the complicated formula referenced previously, which we approximated with \(df = n_a + n_r - 2 = 32 + 36 - 2 = 66\) . Let’s overlay this \(t\) -distribution curve over the top of our simulated two-sample \(t\) -statistics using the method = "both" argument in visualize() .

Null distribution using t-statistic and t-distribution.

FIGURE 9.22: Null distribution using t-statistic and t-distribution.

Observe that the curve does a good job of approximating the histogram here. To calculate the \(p\) -value in this case, we need to figure out how much of the total area under the \(t\) -distribution curve is at or “more extreme” than our observed two-sample \(t\) -statistic. Since \(H_A: \mu_a - \mu_r \neq 0\) is a two-sided alternative, we need to add up the areas in both tails.

We first compute the observed two-sample \(t\) -statistic using infer verbs. This shortcut calculation further assumes that the null hypothesis is true: that the population of action and romance movies have an equal average rating.

We want to find the percentage of values that are at or below obs_two_sample_t \(= -2.906\) or at or above -obs_two_sample_t \(= 2.906\) . We use the shade_p_value() function with the direction argument set to "both" to do this:

Null distribution using t-statistic and t-distribution with $p$-value shaded.

FIGURE 9.23: Null distribution using t-statistic and t-distribution with \(p\) -value shaded.

(We’ll discuss this warning message shortly.) What is the \(p\) -value? We apply get_p_value() to our null distribution saved in null_distribution_movies_t :

We have a very small \(p\) -value, and thus it is very unlikely that these results are due to sampling variation . Thus, we are inclined to reject \(H_0\) .

Let’s come back to that earlier warning message: Check to make sure the conditions have been met for the theoretical method. {infer} currently does not check these for you. To be able to use the \(t\) -test and other such theoretical methods, there are always a few conditions to check. The infer package does not automatically check these conditions, hence the warning message we received. These conditions are necessary so that the underlying mathematical theory holds. In order for the results of our two-sample \(t\) -test to be valid, three conditions must be met:

  • Nearly normal populations or large sample sizes. A general rule of thumb that works in many (but not all) situations is that the sample size \(n\) should be greater than 30.
  • Both samples are selected independently of each other.
  • All observations are independent from each other.

Let’s see if these conditions hold for our movies_sample data:

  • This is met since \(n_a\) = 32 and \(n_r\) = 36 are both larger than 30, satisfying our rule of thumb.
  • This is met since we sampled the action and romance movies at random and in an unbiased fashion from the database of all IMDb movies.
  • Unfortunately, we don’t know how IMDb computes the ratings. For example, if the same person rated multiple movies, then those observations would be related and hence not independent.

Assuming all three conditions are roughly met, we can be reasonably certain that the theory-based \(t\) -test results are valid. If any of the conditions were clearly not met, we couldn’t put as much trust into any conclusions reached. On the other hand, in most scenarios, the only assumption that needs to be met in the simulation-based method is that the sample is selected at random. Thus, in our experience, we prefer simulation-based methods as they have fewer assumptions, are conceptually easier to understand, and since computing power has recently become easily accessible, they can be run quickly. That being said since much of the world’s research still relies on traditional theory-based methods, we also believe it is important to understand them.

You may be wondering why we chose reps = 1000 for these simulation-based methods. We’ve noticed that after around 1000 replicates for the null distribution and the bootstrap distribution for most problems you can start to get a general sense for how the statistic behaves. You can change this value to something like 10,000 though for reps if you would like even finer detail but this will take more time to compute. Feel free to iterate on this as you like to get an even better idea about the shape of the null and bootstrap distributions as you wish.

9.6.2 When inference is not needed

We’ve now walked through several different examples of how to use the infer package to perform statistical inference: constructing confidence intervals and conducting hypothesis tests. For each of these examples, we made it a point to always perform an exploratory data analysis (EDA) first; specifically, by looking at the raw data values, by using data visualization with ggplot2 , and by data wrangling with dplyr beforehand. We highly encourage you to always do the same. As a beginner to statistics, EDA helps you develop intuition as to what statistical methods like confidence intervals and hypothesis tests can tell us. Even as a seasoned practitioner of statistics, EDA helps guide your statistical investigations. In particular, is statistical inference even needed?

Let’s consider an example. Say we’re interested in the following question: Of all flights leaving a New York City airport, are Hawaiian Airlines flights in the air for longer than Alaska Airlines flights? Furthermore, let’s assume that 2013 flights are a representative sample of all such flights. Then we can use the flights data frame in the nycflights13 package we introduced in Section 1.4 to answer our question. Let’s filter this data frame to only include Hawaiian and Alaska Airlines using their carrier codes HA and AS :

There are two possible statistical inference methods we could use to answer such questions. First, we could construct a 95% confidence interval for the difference in population means \(\mu_{HA} - \mu_{AS}\) , where \(\mu_{HA}\) is the mean air time of all Hawaiian Airlines flights and \(\mu_{AS}\) is the mean air time of all Alaska Airlines flights. We could then check if the entirety of the interval is greater than 0, suggesting that \(\mu_{HA} - \mu_{AS} > 0\) , or, in other words suggesting that \(\mu_{HA} > \mu_{AS}\) . Second, we could perform a hypothesis test of the null hypothesis \(H_0: \mu_{HA} - \mu_{AS} = 0\) versus the alternative hypothesis \(H_A: \mu_{HA} - \mu_{AS} > 0\) .

However, let’s first construct an exploratory visualization as we suggested earlier. Since air_time is numerical and carrier is categorical, a boxplot can display the relationship between these two variables, which we display in Figure 9.24 .

Air time for Hawaiian and Alaska Airlines flights departing NYC in 2013.

FIGURE 9.24: Air time for Hawaiian and Alaska Airlines flights departing NYC in 2013.

This is what we like to call “no PhD in Statistics needed” moments. You don’t have to be an expert in statistics to know that Alaska Airlines and Hawaiian Airlines have significantly different air times. The two boxplots don’t even overlap! Constructing a confidence interval or conducting a hypothesis test would frankly not provide much more insight than Figure 9.24 .

Let’s investigate why we observe such a clear cut difference between these two airlines using data wrangling. Let’s first group by the rows of flights_sample not only by carrier but also by destination dest . Subsequently, we’ll compute two summary statistics: the number of observations using n() and the mean airtime:

It turns out that from New York City in 2013, Alaska only flew to SEA (Seattle) from New York City (NYC) while Hawaiian only flew to HNL (Honolulu) from NYC. Given the clear difference in distance from New York City to Seattle versus New York City to Honolulu, it is not surprising that we observe such different ( statistically significantly different , in fact) air times in flights.

This is a clear example of not needing to do anything more than a simple exploratory data analysis using data visualization and descriptive statistics to get an appropriate conclusion. This is why we highly recommend you perform an EDA of any sample data before running statistical inference methods like confidence intervals and hypothesis tests.

9.6.3 Problems with p-values

On top of the many common misunderstandings about hypothesis testing and \(p\) -values we listed in Section 9.4 , another unfortunate consequence of the expanded use of \(p\) -values and hypothesis testing is a phenomenon known as “p-hacking.” p-hacking is the act of “cherry-picking” only results that are “statistically significant” while dismissing those that aren’t, even if at the expense of the scientific ideas. There are lots of articles written recently about misunderstandings and the problems with \(p\) -values. We encourage you to check some of them out:

  • Misunderstandings of \(p\) -values
  • What a nerdy debate about \(p\) -values shows about science - and how to fix it
  • Statisticians issue warning over misuse of \(P\) values
  • You Can’t Trust What You Read About Nutrition
  • A Litany of Problems with p-values

Such issues were getting so problematic that the American Statistical Association (ASA) put out a statement in 2016 titled, “The ASA Statement on Statistical Significance and \(P\) -Values,” with six principles underlying the proper use and interpretation of \(p\) -values. The ASA released this guidance on \(p\) -values to improve the conduct and interpretation of quantitative science and to inform the growing emphasis on reproducibility of science research.

We as authors much prefer the use of confidence intervals for statistical inference, since in our opinion they are much less prone to large misinterpretation. However, many fields still exclusively use \(p\) -values for statistical inference and this is one reason for including them in this text. We encourage you to learn more about “p-hacking” as well and its implication for science.

9.6.4 Additional resources

An R script file of all R code used in this chapter is available here.

If you want more examples of the infer workflow for conducting hypothesis tests, we suggest you check out the infer package homepage, in particular, a series of example analyses available at https://infer.netlify.app/articles/.

9.6.5 What’s to come

We conclude with the infer pipeline for hypothesis testing in Figure 9.25 .

infer package workflow for hypothesis testing.

FIGURE 9.25: infer package workflow for hypothesis testing.

Now that we’ve armed ourselves with an understanding of confidence intervals from Chapter 8 and hypothesis tests from this chapter, we’ll now study inference for regression in the upcoming Chapter 10 .

We’ll revisit the regression models we studied in Chapter 5 on basic regression and Chapter 6 on multiple regression. For example, recall Table 5.2 (shown again here in Table 9.4 ), corresponding to our regression model for an instructor’s teaching score as a function of their “beauty” score.

TABLE 9.4: Linear regression table
term estimate std_error statistic p_value lower_ci upper_ci
intercept 3.880 0.076 50.96 0 3.731 4.030
bty_avg 0.067 0.016 4.09 0 0.035 0.099

We previously saw in Subsection 5.1.2 that the values in the estimate column are the fitted intercept \(b_0\) and fitted slope for “beauty” score \(b_1\) . In Chapter 10 , we’ll unpack the remaining columns: std_error which is the standard error, statistic which is the observed standardized test statistic to compute the p_value , and the 95% confidence intervals as given by lower_ci and upper_ci .

9.1 Null and Alternative Hypotheses

You are testing that the mean speed of your cable internet connection is more than three megabits per second. What is the random variable? Describe it in words.

You are testing that the mean speed of your cable internet connection is more than three megabits per second. State the null and alternative hypotheses.

The American family has an average of two children. What is the random variable? Describe in words.

The mean entry level salary of an employee at a company is $58,000. You believe it is higher for IT professionals in the company. State the null and alternative hypotheses.

A sociologist claims the probability that a person picked at random in Times Square in New York City is visiting the area is 0.83. You want to test to see if the proportion is actually less. What is the random variable? Describe in words.

A sociologist claims the probability that a person picked at random in Times Square in New York City is visiting the area is 0.83. You want to test to see if the claim is correct. State the null and alternative hypotheses.

In a population of fish, approximately 42 percent are female. A test is conducted to see if, in fact, the proportion is less. State the null and alternative hypotheses.

Suppose that a recent article stated that the mean time students spend doing homework each week is 2.5 hours. A study was then done to see if the mean time has increased in the new century. A random sample of 26 students. The mean length of time the students spent on homework was 3 hours with a standard deviation of 1.8 hours. Suppose that it is somehow known that the population standard deviation is 1.5. If you were conducting a hypothesis test to determine if the mean length of homework has increased, what would the null and alternative hypotheses be? The distribution of the population is normal.

  • H 0 : ________
  • H a : ________

A random survey of 75 long-term marathon runners revealed that the mean length of time they've been running is 17.4 years with a standard deviation of 6.3 years. If you were conducting a hypothesis test to determine if the population mean time for these runners could likely be 15 years, what would the null and alternative hypotheses be?

  • H 0 : __________
  • H a : __________

Researchers published an article stating that in any one-year period, approximately 9.5 percent of American adults suffer from a particular type of disease. Suppose that in a survey of 100 people in a certain town, seven of them suffered from this disease. If you were conducting a hypothesis test to determine if the true proportion of people in that town suffering from this disease is lower than the percentage in the general adult American population, what would the null and alternative hypotheses be?

9.2 Outcomes and the Type I and Type II Errors

The mean price of mid-sized cars in a region is $32,000. A test is conducted to see if the claim is true. State the Type I and Type II errors in complete sentences.

A sleeping bag is tested to withstand temperatures of –15 °F. You think the bag cannot stand temperatures that low. State the Type I and Type II errors in complete sentences.

For Exercise 9.12 , what are α and β in words?

In words, describe 1 – β for Exercise 9.12 .

A group of doctors is deciding whether or not to perform an operation. Suppose the null hypothesis, H 0 , is: the surgical procedure will go well. State the Type I and Type II errors in complete sentences.

A group of doctors is deciding whether or not to perform an operation. Suppose the null hypothesis, H 0 , is: the surgical procedure will go well. Which is the error with the greater consequence?

The power of a test is 0.981. What is the probability of a Type II error?

A group of divers is exploring an old sunken ship. Suppose the null hypothesis, H 0 , is the sunken ship does not contain buried treasure. State the Type I and Type II errors in complete sentences.

A microbiologist is testing a water sample for E. coli. Suppose the null hypothesis, H 0 , is the sample does not contain E. coli. The probability that the sample does not contain E. coli, but the microbiologist thinks it does is 0.012. The probability that the sample does contain E. coli, but the microbiologist thinks it does not is 0.002. What is the power of this test?

A microbiologist is testing a water sample for E. coli. Suppose the null hypothesis, H 0 , is the sample contains E-coli. Which is the error with the greater consequence?

9.3 Distribution Needed for Hypothesis Testing

Which two distributions can you use for hypothesis testing for this chapter?

Which distribution do you use when the standard deviation is not known? Assume sample size is large.

Which distribution do you use when the standard deviation is not known and you are testing one population mean? Assume sample size is large.

A population mean is 13. The sample mean is 12.8, and the sample standard deviation is two. The sample size is 20. What distribution should you use to perform a hypothesis test? Assume the underlying population is normal.

A population has a mean of 25 and a standard deviation of five. The sample mean is 24, and the sample size is 108. What distribution should you use to perform a hypothesis test?

It is thought that 42 percent of respondents in a taste test would prefer Brand A . In a particular test of 100 people, 39 percent preferred Brand A . What distribution should you use to perform a hypothesis test?

You are performing a hypothesis test of a single population mean using a Student’s t -distribution. What must you assume about the distribution of the data?

You are performing a hypothesis test of a single population mean using a Student’s t -distribution. The data are not from a simple random sample. Can you accurately perform the hypothesis test?

You are performing a hypothesis test of a single population proportion. What must be true about the quantities of np and nq ?

You are performing a hypothesis test of a single population proportion. You find out that np is less than five. What must you do to be able to perform a valid hypothesis test?

You are performing a hypothesis test of a single population proportion. The data come from which distribution?

9.4 Rare Events, the Sample, and the Decision and Conclusion

When do you reject the null hypothesis?

The probability of winning the grand prize at a particular carnival game is 0.005. Is the outcome of winning very likely or very unlikely?

The probability of winning the grand prize at a particular carnival game is 0.005. Michele wins the grand prize. Is this considered a rare or common event? Why?

It is believed that the mean height of high school students who play basketball on the school team is 73 inches with a standard deviation of 1.8 inches. A random sample of 40 players is chosen. The sample mean was 71 inches, and the sample standard deviation was 1.5 inches. Do the data support the claim that the mean height is less than 73 inches? The p -value is almost zero. State the null and alternative hypotheses and interpret the p -value.

The mean age of graduate students at a university is at most 31 years with a standard deviation of two years. A random sample of 15 graduate students is taken. The sample mean is 32 years and the sample standard deviation is three years. Are the data significant at the 1 percent level? The p -value is 0.0264. State the null and alternative hypotheses and interpret the p -value.

Does the shaded region represent a low or a high p -value compared to a level of significance of 1 percent?

What should you do when α > p -value?

What should you do if α = p -value?

If you do not reject the null hypothesis, then it must be true. Is that statement correct? State why or why not in complete sentences.

Use the following information to answer the next seven exercises: Suppose that a recent article stated that the mean time students spend doing homework each week is 2.5 hours. A study was then done to see if the mean time has increased in the new century. A random sample of 26 students was taken. The mean length of time they did homework each week was three hours with a standard deviation of 1.8 hours. Suppose that it is somehow known that the population standard deviation is 1.5. Conduct a hypothesis test to determine if the mean length of time doing homework each week has increased. Assume the distribution of homework times is approximately normal.

Is this a test of means or proportions?

  • What symbol represents the random variable for this test?
  • In words, define the random variable for this test.

Is σ known and, if so, what is it?

Calculate the following:

  • x ¯ x ¯ _______
  • s x _______

Since both σ and s x s x are given, which should be used? In one to two complete sentences, explain why.

  • State the distribution to use for the hypothesis test.

A random survey of 75 long-term marathon runners revealed that the mean length of time they have been running is 17.4 years with a standard deviation of 6.3 years. Conduct a hypothesis test to determine if the population mean time is likely to be 15 years.

  • Is this a test of one mean or proportion?
  • State the null and alternative hypotheses. H 0 : ____________________ H a : ____________________
  • Is this a right-tailed, left-tailed, or two-tailed test?
  • Is the population standard deviation known and, if so, what is it?
  • x ¯ x ¯ = _____________
  • s = ____________
  • n = ____________
  • Which test should be used?
  • Find the p -value.
  • Reason for the decision:
  • Conclusion (write out in a complete sentence):

9.5 Additional Information and Full Hypothesis Test Examples

Assume H 0 : μ = 9 and H a : μ < 9. Is this a left-tailed, right-tailed, or two-tailed test?

Assume H 0 : μ ≤ 6 and H a : μ > 6. Is this a left-tailed, right-tailed, or two-tailed test?

Assume H 0 : p = 0.25 and H a : p ≠ 0.25. Is this a left-tailed, right-tailed, or two-tailed test?

Draw the general graph of a left-tailed test.

Draw the graph of a two-tailed test.

A bottle of water is labeled as containing 16 fluid ounces of water. You believe it is less than that. What type of test would you use?

Your friend claims that his mean golf score is 63. You want to show that it is higher than that. What type of test would you use?

A bathroom scale claims to be able to identify correctly any weight within a pound. You think that it cannot be that accurate. What type of test would you use?

You flip a coin and record whether it shows heads or tails. You know the probability of getting heads is 50 percent, but you think it is less for this particular coin. What type of test would you use?

If the alternative hypothesis has a not equals ( ≠ ) symbol, you know to use which type of test?

Assume the null hypothesis states that the mean is at least 18. Is this a left-tailed, right-tailed, or two-tailed test?

Assume the null hypothesis states that the mean is at most 12. Is this a left-tailed, right-tailed, or two-tailed test?

Assume the null hypothesis states that the mean is equal to 88. The alternative hypothesis states that the mean is not equal to 88. Is this a left-tailed, right-tailed, or two-tailed test?

Statistical Thinking for the 21st Century

Chapter 9 hypothesis testing.

In the first chapter we discussed the three major goals of statistics:

In this chapter we will introduce the ideas behind the use of statistics to make decisions – in particular, decisions about whether a particular hypothesis is supported by the data.

9.1 Null Hypothesis Statistical Testing (NHST)

The specific type of hypothesis testing that we will discuss is known (for reasons that will become clear) as null hypothesis statistical testing (NHST). If you pick up almost any scientific or biomedical research publication, you will see NHST being used to test hypotheses, and in their introductory psychology textbook, Gerrig & Zimbardo (2002) referred to NHST as the “backbone of psychological research”. Thus, learning how to use and interpret the results from hypothesis testing is essential to understand the results from many fields of research.

It is also important for you to know, however, that NHST is deeply flawed, and that many statisticians and researchers (including myself) think that it has been the cause of serious problems in science, which we will discuss in Chapter 18 . For more than 50 years, there have been calls to abandon NHST in favor of other approaches (like those that we will discuss in the following chapters):

  • “The test of statistical significance in psychological research may be taken as an instance of a kind of essential mindlessness in the conduct of research” (Bakan, 1966)
  • Hypothesis testing is “a wrongheaded view about what constitutes scientific progress” (Luce, 1988)

NHST is also widely misunderstood, largely because it violates our intuitions about how statistical hypothesis testing should work. Let’s look at an example to see this.

9.2 Null hypothesis statistical testing: An example

There is great interest in the use of body-worn cameras by police officers, which are thought to reduce the use of force and improve officer behavior. However, in order to establish this we need experimental evidence, and it has become increasingly common for governments to use randomized controlled trials to test such ideas. A randomized controlled trial of the effectiveness of body-worn cameras was performed by the Washington, DC government and DC Metropolitan Police Department in 2015/2016. Officers were randomly assigned to wear a body-worn camera or not, and their behavior was then tracked over time to determine whether the cameras resulted in less use of force and fewer civilian complaints about officer behavior.

Before we get to the results, let’s ask how you would think the statistical analysis might work. Let’s say we want to specifically test the hypothesis of whether the use of force is decreased by the wearing of cameras. The randomized controlled trial provides us with the data to test the hypothesis – namely, the rates of use of force by officers assigned to either the camera or control groups. The next obvious step is to look at the data and determine whether they provide convincing evidence for or against this hypothesis. That is: What is the likelihood that body-worn cameras reduce the use of force, given the data and everything else we know?

It turns out that this is not how null hypothesis testing works. Instead, we first take our hypothesis of interest (i.e. that body-worn cameras reduce use of force), and flip it on its head, creating a null hypothesis – in this case, the null hypothesis would be that cameras do not reduce use of force. Importantly, we then assume that the null hypothesis is true. We then look at the data, and determine how likely the data would be if the null hypothesis were true. If the the data are sufficiently unlikely under the null hypothesis that we can reject the null in favor of the alternative hypothesis which is our hypothesis of interest. If there is not sufficient evidence to reject the null, then we say that we retain (or “fail to reject”) the null, sticking with our initial assumption that the null is true.

Understanding some of the concepts of NHST, particularly the notorious “p-value”, is invariably challenging the first time one encounters them, because they are so counter-intuitive. As we will see later, there are other approaches that provide a much more intuitive way to address hypothesis testing (but have their own complexities). However, before we get to those, it’s important for you to have a deep understanding of how hypothesis testing works, because it’s clearly not going to go away any time soon.

9.3 The process of null hypothesis testing

We can break the process of null hypothesis testing down into a number of steps:

  • Formulate a hypothesis that embodies our prediction ( before seeing the data )
  • Specify null and alternative hypotheses
  • Collect some data relevant to the hypothesis
  • Fit a model to the data that represents the alternative hypothesis and compute a test statistic
  • Compute the probability of the observed value of that statistic assuming that the null hypothesis is true
  • Assess the “statistical significance” of the result

For a hands-on example, let’s use the NHANES data to ask the following question: Is physical activity related to body mass index? In the NHANES dataset, participants were asked whether they engage regularly in moderate or vigorous-intensity sports, fitness or recreational activities (stored in the variable \(PhysActive\) ). The researchers also measured height and weight and used them to compute the Body Mass Index (BMI):

\[ BMI = \frac{weight(kg)}{height(m)^2} \]

9.3.1 Step 1: Formulate a hypothesis of interest

We hypothesize that BMI is greater for people who do not engage in physical activity, compared to those who do.

9.3.2 Step 2: Specify the null and alternative hypotheses

For step 2, we need to specify our null hypothesis (which we call \(H_0\) ) and our alternative hypothesis (which we call \(H_A\) ). \(H_0\) is the baseline against which we test our hypothesis of interest: that is, what would we expect the data to look like if there was no effect? The null hypothesis always involves some kind of equality (=, \(\le\) , or \(\ge\) ). \(H_A\) describes what we expect if there actually is an effect. The alternative hypothesis always involves some kind of inequality ( \(\ne\) , >, or <). Importantly, null hypothesis testing operates under the assumption that the null hypothesis is true unless the evidence shows otherwise.

We also have to decide whether we want to test a directional or non-directional hypotheses. A non-directional hypothesis simply predicts that there will be a difference, without predicting which direction it will go. For the BMI/activity example, a non-directional null hypothesis would be:

\(H0: BMI_{active} = BMI_{inactive}\)

and the corresponding non-directional alternative hypothesis would be:

\(HA: BMI_{active} \neq BMI_{inactive}\)

A directional hypothesis, on the other hand, predicts which direction the difference would go. For example, we have strong prior knowledge to predict that people who engage in physical activity should weigh less than those who do not, so we would propose the following directional null hypothesis:

\(H0: BMI_{active} \ge BMI_{inactive}\)

and directional alternative:

\(HA: BMI_{active} < BMI_{inactive}\)

As we will see later, testing a non-directional hypothesis is more conservative, so this is generally to be preferred unless there is a strong a priori reason to hypothesize an effect in a particular direction. Hypotheses, including whether they are directional or not, should always be specified prior to looking at the data!

9.3.3 Step 3: Collect some data

In this case, we will sample 250 individuals from the NHANES dataset. Figure 9.1 shows an example of such a sample, with BMI shown separately for active and inactive individuals, and Table 9.1 shows summary statistics for each group.

Table 9.1: Summary of BMI data for active versus inactive individuals
PhysActive N mean sd
No 131 30 9.0
Yes 119 27 5.2

Box plot of BMI data from a sample of adults from the NHANES dataset, split by whether they reported engaging in regular physical activity.

Figure 9.1: Box plot of BMI data from a sample of adults from the NHANES dataset, split by whether they reported engaging in regular physical activity.

9.3.4 Step 4: Fit a model to the data and compute a test statistic

We next want to use the data to compute a statistic that will ultimately let us decide whether the null hypothesis is rejected or not. To do this, the model needs to quantify the amount of evidence in favor of the alternative hypothesis, relative to the variability in the data. Thus we can think of the test statistic as providing a measure of the size of the effect compared to the variability in the data. In general, this test statistic will have a probability distribution associated with it, because that allows us to determine how likely our observed value of the statistic is under the null hypothesis.

For the BMI example, we need a test statistic that allows us to test for a difference between two means, since the hypotheses are stated in terms of mean BMI for each group. One statistic that is often used to compare two means is the t statistic, first developed by the statistician William Sealy Gossett, who worked for the Guiness Brewery in Dublin and wrote under the pen name “Student” - hence, it is often called “Student’s t statistic”. The t statistic is appropriate for comparing the means of two groups when the sample sizes are relatively small and the population standard deviation is unknown. The t statistic for comparison of two independent groups is computed as:

\[ t = \frac{\bar{X_1} - \bar{X_2}}{\sqrt{\frac{S_1^2}{n_1} + \frac{S_2^2}{n_2}}} \]

where \(\bar{X}_1\) and \(\bar{X}_2\) are the means of the two groups, \(S^2_1\) and \(S^2_2\) are the estimated variances of the groups, and \(n_1\) and \(n_2\) are the sizes of the two groups. Because the variance of a difference between two independent variables is the sum of the variances of each individual variable ( \(var(A - B) = var(A) + var(B)\) ), we add the variances for each group divided by their sample sizes in order to compute the standard error of the difference. Thus, one can view the the t statistic as a way of quantifying how large the difference between groups is in relation to the sampling variability of the difference between means.

The t statistic is distributed according to a probability distribution known as a t distribution. The t distribution looks quite similar to a normal distribution, but it differs depending on the number of degrees of freedom. When the degrees of freedom are large (say 1000), then the t distribution looks essentially like the normal distribution, but when they are small then the t distribution has longer tails than the normal (see Figure 9.2 ). In the simplest case, where the groups are the same size and have equal variance, the degrees of freedom for the t test is the number of observations minus 2, since we have computed two means and thus given up two degrees of freedom. In this case it’s pretty clear from the box plot that the inactive group is more variable than then active group, and the numbers in each group differ, so we need to use a slightly more complex formula for the degrees of freedom, which is often referred to as a “Welch t-test”. The formula is:

\[ \mathrm{d.f.} = \frac{\left(\frac{S_1^2}{n_1} + \frac{S_2^2}{n_2}\right)^2}{\frac{\left(S_1^2/n_1\right)^2}{n_1-1} + \frac{\left(S_2^2/n_2\right)^2}{n_2-1}} \] This will be equal to \(n_1 + n_2 - 2\) when the variances and sample sizes are equal, and otherwise will be smaller, in effect imposing a penalty on the test for differences in sample size or variance. For this example, that comes out to 241.12 which is slightly below the value of 248 that one would get by subtracting 2 from the sample size.

Each panel shows the t distribution (in blue dashed line) overlaid on the normal distribution (in solid red line).  The left panel shows a t distribution with 4 degrees of freedom, in which case the distribution is similar but has slightly wider tails.  The right panel shows a t distribution with 1000 degrees of freedom, in which case it is virtually identical to the normal.

Figure 9.2: Each panel shows the t distribution (in blue dashed line) overlaid on the normal distribution (in solid red line). The left panel shows a t distribution with 4 degrees of freedom, in which case the distribution is similar but has slightly wider tails. The right panel shows a t distribution with 1000 degrees of freedom, in which case it is virtually identical to the normal.

9.3.5 Step 5: Determine the probability of the observed result under the null hypothesis

This is the step where NHST starts to violate our intuition. Rather than determining the likelihood that the null hypothesis is true given the data, we instead determine the likelihood under the null hypothesis of observing a statistic at least as extreme as one that we have observed — because we started out by assuming that the null hypothesis is true! To do this, we need to know the expected probability distribution for the statistic under the null hypothesis, so that we can ask how likely the result would be under that distribution. Note that when I say “how likely the result would be”, what I really mean is “how likely the observed result or one more extreme would be”. There are (at least) two reasons that we need to add this caveat. The first is that when we are talking about continuous values, the probability of any particular value is zero (as you might remember if you’ve taken a calculus class). More importantly, we are trying to determine how weird our result would be if the null hypothesis were true, and any result that is more extreme will be even more weird, so we want to count all of those weirder possibilities when we compute the probability of our result under the null hypothesis.

We can obtain this “null distribution” either using a theoretical distribution (like the t distribution), or using randomization. Before we move to our BMI example, let’s start with some simpler examples. P-values: A very simple example

Let’s say that we wish to determine whether a particular coin is biased towards landing heads. To collect data, we flip the coin 100 times, and let’s say we count 70 heads. In this example, \(H_0: P(heads) \le 0.5\) and \(H_A: P(heads) > 0.5\) , and our test statistic is simply the number of heads that we counted. The question that we then want to ask is: How likely is it that we would observe 70 or more heads in 100 coin flips if the true probability of heads is 0.5? We can imagine that this might happen very occasionally just by chance, but doesn’t seem very likely. To quantify this probability, we can use the binomial distribution :

\[ P(X \le k) = \sum_{i=0}^k \binom{N}{k} p^i (1-p)^{(n-i)} \] This equation will tell us the probability of a certain number of heads ( \(k\) ) or fewer, given a particular probability of heads ( \(p\) ) and number of events ( \(N\) ). However, what we really want to know is the probability of a certain number or more, which we can obtain by subtracting from one, based on the rules of probability:

\[ P(X \ge k) = 1 - P(X < k) \]

Distribution of numbers of heads (out of 100 flips) across 100,000 simulated runs with the observed value of 70 flips represented by the vertical line.

Figure 9.3: Distribution of numbers of heads (out of 100 flips) across 100,000 simulated runs with the observed value of 70 flips represented by the vertical line.

Using the binomial distribution, the probability of 69 or fewer heads given P(heads)=0.5 is 0.999961, so the probability of 70 or more heads is simply one minus that value (0.000039). This computation shows us that the likelihood of getting 70 or more heads if the coin is indeed fair is very small.

Now, what if we didn’t have a standard function to tell us the probability of that number of heads? We could instead determine it by simulation – we repeatedly flip a coin 100 times using a true probability of 0.5, and then compute the distribution of the number of heads across those simulation runs. Figure 9.3 shows the result from this simulation. Here we can see that the probability computed via simulation (0.000030) is very close to the theoretical probability (0.000039). Computing p-values using the t distribution

Now let’s compute a p-value for our BMI example using the t distribution. First we compute the t statistic using the values from our sample that we calculated above, where we find that t = 3.86. The question that we then want to ask is: What is the likelihood that we would find a t statistic of this size, if the true difference between groups is zero or less (i.e. the directional null hypothesis)?

We can use the t distribution to determine this probability. Above we noted that the appropriate degrees of freedom (after correcting for differences in variance and sample size) was t = 241.12. We can use a function from our statistical software to determine the probability of finding a value of the t statistic greater than or equal to our observed value. We find that p(t > 3.86, df = 241.12) = 0.000072, which tells us that our observed t statistic value of 3.86 is relatively unlikely if the null hypothesis really is true.

In this case, we used a directional hypothesis, so we only had to look at one end of the null distribution. If we wanted to test a non-directional hypothesis, then we would need to be able to identify how unexpected the size of the effect is, regardless of its direction. In the context of the t-test, this means that we need to know how likely it is that the statistic would be as extreme in either the positive or negative direction. To do this, we multiply the observed t value by -1, since the t distribution is centered around zero, and then add together the two tail probabilities to get a two-tailed p-value: p(t > 3.86 or t< -3.86, df = 241.12) = 0.000145. Here we see that the p value for the two-tailed test is twice as large as that for the one-tailed test, which reflects the fact that an extreme value is less surprising since it could have occurred in either direction.

How do you choose whether to use a one-tailed versus a two-tailed test? The two-tailed test is always going to be more conservative, so it’s always a good bet to use that one, unless you had a very strong prior reason for using a one-tailed test. In that case, you should have written down the hypothesis before you ever looked at the data. In Chapter 18 we will discuss the idea of pre-registration of hypotheses, which formalizes the idea of writing down your hypotheses before you ever see the actual data. You should never make a decision about how to perform a hypothesis test once you have looked at the data, as this can introduce serious bias into the results. Computing p-values using randomization

So far we have seen how we can use the t-distribution to compute the probability of the data under the null hypothesis, but we can also do this using simulation. The basic idea is that we generate simulated data like those that we would expect under the null hypothesis, and then ask how extreme the observed data are in comparison to those simulated data. The key question is: How can we generate data for which the null hypothesis is true? The general answer is that we can randomly rearrange the data in a particular way that makes the data look like they would if the null was really true. This is similar to the idea of bootstrapping, in the sense that it uses our own data to come up with an answer, but it does it in a different way. Randomization: a simple example

Let’s start with a simple example. Let’s say that we want to compare the mean squatting ability of football players with cross-country runners, with \(H_0: \mu_{FB} \le \mu_{XC}\) and \(H_A: \mu_{FB} > \mu_{XC}\) . We measure the maximum squatting ability of 5 football players and 5 cross-country runners (which we will generate randomly, assuming that \(\mu_{FB} = 300\) , \(\mu_{XC} = 140\) , and \(\sigma = 30\) ). The data are shown in Table 9.2 .

Table 9.2: Squatting data for the two groups
group squat shuffledSquat
FB 265 125
FB 310 230
FB 335 125
FB 230 315
FB 315 115
XC 155 335
XC 125 155
XC 125 125
XC 125 265
XC 115 310

Left: Box plots of simulated squatting ability for football players and cross-country runners.Right: Box plots for subjects assigned to each group after scrambling group labels.

Figure 9.4: Left: Box plots of simulated squatting ability for football players and cross-country runners.Right: Box plots for subjects assigned to each group after scrambling group labels.

From the plot on the left side of Figure 9.4 it’s clear that there is a large difference between the two groups. We can do a standard t-test to test our hypothesis; for this example we will use the t.test() command in R, which gives the following result:

If we look at the p-value reported here, we see that the likelihood of such a difference under the null hypothesis is very small, using the t distribution to define the null.

Now let’s see how we could answer the same question using randomization. The basic idea is that if the null hypothesis of no difference between groups is true, then it shouldn’t matter which group one comes from (football players versus cross-country runners) – thus, to create data that are like our actual data but also conform to the null hypothesis, we can randomly reorder the data for the individuals in the dataset, and then recompute the difference between the groups. The results of such a shuffle are shown in the column labeled “shuffleSquat” in Table 9.2 , and the boxplots of the resulting data are in the right panel of Figure 9.4 .

Histogram of t-values for the difference in means between the football and cross-country groups after randomly shuffling group membership.  The vertical line denotes the actual difference observed between the two groups, and the dotted line shows the theoretical t distribution for this analysis.

Figure 9.5: Histogram of t-values for the difference in means between the football and cross-country groups after randomly shuffling group membership. The vertical line denotes the actual difference observed between the two groups, and the dotted line shows the theoretical t distribution for this analysis.

After scrambling the data, we see that the two groups are now much more similar, and in fact the cross-country group now has a slightly higher mean. Now let’s do that 10000 times and store the t statistic for each iteration; if you are doing this on your own computer, it will take a moment to complete. Figure 9.5 shows the histogram of the t values across all of the random shuffles. As expected under the null hypothesis, this distribution is centered at zero (the mean of the distribution is 0.007). From the figure we can also see that the distribution of t values after shuffling roughly follows the theoretical t distribution under the null hypothesis (with mean=0), showing that randomization worked to generate null data. We can compute the p-value from the randomized data by measuring how many of the shuffled values are at least as extreme as the observed value: p(t > 8.01, df = 8) using randomization = 0.00410. This p-value is very similar to the p-value that we obtained using the t distribution, and both are quite extreme, suggesting that the observed data are very unlikely to have arisen if the null hypothesis is true - and in this case we know that it’s not true, because we generated the data. Randomization: BMI/activity example

Now let’s use randomization to compute the p-value for the BMI/activity example. In this case, we will randomly shuffle the PhysActive variable and compute the difference between groups after each shuffle, and then compare our observed t statistic to the distribution of t statistics from the shuffled datasets. Figure 9.6 shows the distribution of t values from the shuffled samples, and we can also compute the probability of finding a value as large or larger than the observed value. The p-value obtained from randomization (0.000000) is very similar to the one obtained using the t distribution (0.000075). The advantage of the randomization test is that it doesn’t require that we assume that the data from each of the groups are normally distributed, though the t-test is generally quite robust to violations of that assumption. In addition, the randomization test can allow us to compute p-values for statistics when we don’t have a theoretical distribution like we do for the t-test.

Histogram of t statistics after shuffling of group labels, with the observed value of the t statistic shown in the vertical line, and values at least as extreme as the observed value shown in lighter gray

Figure 9.6: Histogram of t statistics after shuffling of group labels, with the observed value of the t statistic shown in the vertical line, and values at least as extreme as the observed value shown in lighter gray

We do have to make one main assumption when we use the randomization test, which we refer to as exchangeability . This means that all of the observations are distributed in the same way, such that we can interchange them without changing the overall distribution. The main place where this can break down is when there are related observations in the data; for example, if we had data from individuals in 4 different families, then we couldn’t assume that individuals were exchangeable, because siblings would be closer to each other than they are to individuals from other families. In general, if the data were obtained by random sampling, then the assumption of exchangeability should hold.

9.3.6 Step 6: Assess the “statistical significance” of the result

The next step is to determine whether the p-value that results from the previous step is small enough that we are willing to reject the null hypothesis and conclude instead that the alternative is true. How much evidence do we require? This is one of the most controversial questions in statistics, in part because it requires a subjective judgment – there is no “correct” answer.

Historically, the most common answer to this question has been that we should reject the null hypothesis if the p-value is less than 0.05. This comes from the writings of Ronald Fisher, who has been referred to as “the single most important figure in 20th century statistics” ( Efron 1998 ) :

“If P is between .1 and .9 there is certainly no reason to suspect the hypothesis tested. If it is below .02 it is strongly indicated that the hypothesis fails to account for the whole of the facts. We shall not often be astray if we draw a conventional line at .05 … it is convenient to draw the line at about the level at which we can say: Either there is something in the treatment, or a coincidence has occurred such as does not occur more than once in twenty trials” ( R. A. Fisher 1925 )

However, Fisher never intended \(p < 0.05\) to be a fixed rule:

“no scientific worker has a fixed level of significance at which from year to year, and in all circumstances, he rejects hypotheses; he rather gives his mind to each particular case in the light of his evidence and his ideas” ( Ronald Aylmer Fisher 1956 )

Instead, it is likely that p < .05 became a ritual due to the reliance upon tables of p-values that were used before computing made it easy to compute p values for arbitrary values of a statistic. All of the tables had an entry for 0.05, making it easy to determine whether one’s statistic exceeded the value needed to reach that level of significance.

The choice of statistical thresholds remains deeply controversial, and recently (Benjamin et al., 2018) it has been proposed that the default threshold be changed from .05 to .005, making it substantially more stringent and thus more difficult to reject the null hypothesis. In large part this move is due to growing concerns that the evidence obtained from a significant result at \(p < .05\) is relatively weak; we will return to this in our later discussion of reproducibility in Chapter 18 . Hypothesis testing as decision-making: The Neyman-Pearson approach

Whereas Fisher thought that the p-value could provide evidence regarding a specific hypothesis, the statisticians Jerzy Neyman and Egon Pearson disagreed vehemently. Instead, they proposed that we think of hypothesis testing in terms of its error rate in the long run:

“no test based upon a theory of probability can by itself provide any valuable evidence of the truth or falsehood of a hypothesis. But we may look at the purpose of tests from another viewpoint. Without hoping to know whether each separate hypothesis is true or false, we may search for rules to govern our behaviour with regard to them, in following which we insure that, in the long run of experience, we shall not often be wrong” ( J. Neyman and Pearson 1933 )

That is: We can’t know which specific decisions are right or wrong, but if we follow the rules, we can at least know how often our decisions will be wrong in the long run.

To understand the decision making framework that Neyman and Pearson developed, we first need to discuss statistical decision making in terms of the kinds of outcomes that can occur. There are two possible states of reality ( \(H_0\) is true, or \(H_0\) is false), and two possible decisions (reject \(H_0\) , or retain \(H_0\) ). There are two ways in which we can make a correct decision:

  • We can reject \(H_0\) when it is false (in the language of signal detection theory, we call this a hit )
  • We can retain \(H_0\) when it is true (somewhat confusingly in this context, this is called a correct rejection )

There are also two kinds of errors we can make:

  • We can reject \(H_0\) when it is actually true (we call this a false alarm , or Type I error )
  • We can retain \(H_0\) when it is actually false (we call this a miss , or Type II error )

Neyman and Pearson coined two terms to describe the probability of these two types of errors in the long run:

  • P(Type I error) = \(\alpha\)
  • P(Type II error) = \(\beta\)

That is, if we set \(\alpha\) to .05, then in the long run we should make a Type I error 5% of the time. Whereas it’s common to set \(\alpha\) as .05, the standard value for an acceptable level of \(\beta\) is .2 - that is, we are willing to accept that 20% of the time we will fail to detect a true effect when it truly exists. We will return to this later when we discuss statistical power in Section 10.3 , which is the complement of Type II error.

9.3.7 What does a significant result mean?

There is a great deal of confusion about what p-values actually mean (Gigerenzer, 2004). Let’s say that we do an experiment comparing the means between conditions, and we find a difference with a p-value of .01. There are a number of possible interpretations that one might entertain. Does it mean that the probability of the null hypothesis being true is .01?

No. Remember that in null hypothesis testing, the p-value is the probability of the data given the null hypothesis ( \(P(data|H_0)\) ). It does not warrant conclusions about the probability of the null hypothesis given the data ( \(P(H_0|data)\) ). We will return to this question when we discuss Bayesian inference in a later chapter, as Bayes theorem lets us invert the conditional probability in a way that allows us to determine the probability of the hypothesis given the data. Does it mean that the probability that you are making the wrong decision is .01?

No. This would be \(P(H_0|data)\) , but remember as above that p-values are probabilities of data under \(H_0\) , not probabilities of hypotheses. Does it mean that if you ran the study again, you would obtain the same result 99% of the time?

No. The p-value is a statement about the likelihood of a particular dataset under the null; it does not allow us to make inferences about the likelihood of future events such as replication. Does it mean that you have found a practically important effect?

No. There is an essential distinction between statistical significance and practical significance . As an example, let’s say that we performed a randomized controlled trial to examine the effect of a particular diet on body weight, and we find a statistically significant effect at p<.05. What this doesn’t tell us is how much weight was actually lost, which we refer to as the effect size (to be discussed in more detail in Chapter 10 ). If we think about a study of weight loss, then we probably don’t think that the loss of one ounce (i.e. the weight of a few potato chips) is practically significant. Let’s look at our ability to detect a significant difference of 1 ounce as the sample size increases.

Figure 9.7 shows how the proportion of significant results increases as the sample size increases, such that with a very large sample size (about 262,000 total subjects), we will find a significant result in more than 90% of studies when there is a 1 ounce difference in weight loss between the diets. While these are statistically significant, most physicians would not consider a weight loss of one ounce to be practically or clinically significant. We will explore this relationship in more detail when we return to the concept of statistical power in Section 10.3 , but it should already be clear from this example that statistical significance is not necessarily indicative of practical significance.

The proportion of signifcant results for a very small change (1 ounce, which is about .001 standard deviations) as a function of sample size.

Figure 9.7: The proportion of signifcant results for a very small change (1 ounce, which is about .001 standard deviations) as a function of sample size.

9.4 NHST in a modern context: Multiple testing

So far we have discussed examples where we are interested in testing a single statistical hypothesis, and this is consistent with traditional science which often measured only a few variables at a time. However, in modern science we can often measure millions of variables per individual. For example, in genetic studies that quantify the entire genome, there may be many millions of measures per individual, and in the brain imaging research that my group does, we often collect data from more than 100,000 locations in the brain at once. When standard hypothesis testing is applied in these contexts, bad things can happen unless we take appropriate care.

Let’s look at an example to see how this might work. There is great interest in understanding the genetic factors that can predispose individuals to major mental illnesses such as schizophrenia, because we know that about 80% of the variation between individuals in the presence of schizophrenia is due to genetic differences. The Human Genome Project and the ensuing revolution in genome science has provided tools to examine the many ways in which humans differ from one another in their genomes. One approach that has been used in recent years is known as a genome-wide association study (GWAS), in which the genome of each individual is characterized at one million or more places to determine which letters of the genetic code they have at each location, focusing on locations where humans tend to differ frequently. After these have been determined, the researchers perform a statistical test at each location in the genome to determine whether people diagnosed with schizoprenia are more or less likely to have one specific version of the genetic sequence at that location.

Let’s imagine what would happen if the researchers simply asked whether the test was significant at p<.05 at each location, when in fact there is no true effect at any of the locations. To do this, we generate a large number of simulated t values from a null distribution, and ask how many of them are significant at p<.05. Let’s do this many times, and each time count up how many of the tests come out as significant (see Figure 9.8 ).

Left: A histogram of the number of significant results in each set of one million statistical tests, when there is in fact no true effect. Right: A histogram of the number of significant results across all simulation runs after applying the Bonferroni correction for multiple tests.

Figure 9.8: Left: A histogram of the number of significant results in each set of one million statistical tests, when there is in fact no true effect. Right: A histogram of the number of significant results across all simulation runs after applying the Bonferroni correction for multiple tests.

This shows that about 5% of all of the tests were significant in each run, meaning that if we were to use p < .05 as our threshold for statistical significance, then even if there were no truly significant relationships present, we would still “find” about 500 genes that were seemingly significant in each study (the expected number of significant results is simply \(n * \alpha\) ). That is because while we controlled for the error per test, we didn’t control the error rate across our entire family of tests (known as the familywise error ), which is what we really want to control if we are going to be looking at the results from a large number of tests. Using p<.05, our familywise error rate in the above example is one – that is, we are pretty much guaranteed to make at least one error in any particular study.

A simple way to control for the familywise error is to divide the alpha level by the number of tests; this is known as the Bonferroni correction, named after the Italian statistician Carlo Bonferroni. Using the data from our example above, we see in Figure 9.8 that only about 5 percent of studies show any significant results using the corrected alpha level of 0.000005 instead of the nominal level of .05. We have effectively controlled the familywise error, such that the probability of making any errors in our study is controlled at right around .05.

9.5 Learning objectives

  • Identify the components of a hypothesis test, including the parameter of interest, the null and alternative hypotheses, and the test statistic.
  • Describe the proper interpretations of a p-value as well as common misinterpretations
  • Distinguish between the two types of error in hypothesis testing, and the factors that determine them.
  • Describe how resampling can be used to compute a p-value.
  • Describe the problem of multiple testing, and how it can be addressed
  • Describe the main criticisms of null hypothesis statistical testing

9.6 Suggested readings

  • Mindless Statistics, by Gerd Gigerenzer

Understandable Statistics

Charles henry brase, hypothesis testing - all with video answers.

chapter 9 hypothesis testing

Introduction. to Statistical Tests

Discuss each of the following topics in class or review the topics on your own. Then write a brief but complete essay in which you answer the following questions. (a) What is a null hypothesis $H_{0} ?$ (b) What is an alternate hypothesis $H_{1} ?$ (c) What is a type I error? a type II error? (d) What is the level of significance of a test? What is the probability of a type II error?

Ahmed Genedy

In a statistical test, we have a choice of a left-tailed test, a right-tailed test, or a two-tailed test. Is it the null hypothesis or the alternate hypothesis that determines which type of test is used? Explain your answer.

If we fail to reject (i.e., "accept") the null hypothesis, does this mean that we have proved it to be true beyond all doubt? Explain your answer.

If we reject the null hypothesis, does this mean that we have proved it to be false beyond all doubt? Explain your answer.

The body weight of a healthy 3 -month-old colt should be about $\mu=60 \mathrm{~kg} .$ (Source: The Merck Veterinary Manual, a standard reference manual used in most veterinary colleges.) (a) If you want to set up a statistical test to challenge the claim that $\mu=60 \mathrm{~kg}$, what would you use for the null hypothesis $H_{0}$ ? (b) In Nevada, there are many herds of wild horses. Suppose you want to test the claim that the average weight of a wild Nevada colt $(3$ months old) is less than $60 \mathrm{~kg} .$ What would you use for the alternate hypothesis $H_{1} ?$ (c) Suppose you want to test the claim that the average weight of such a wild colt is greater than $60 \mathrm{~kg}$. What would you use for the alternate hypothesis? (d) Suppose you want to test the claim that the average weight of such a wild colt is different from $60 \mathrm{~kg}$. What would you use for the alternate hypothesis? (e) For each of the tests in parts (b), (c), and (d), would the area corresponding to the $P$ -value be on the left, on the right, or on both sides of the mean? Explain your answer in each case.

How much customers buy is a direct result of how much time they spend in the store. A study of average shopping times in a large national houseware store gave the following information (Source: Why We Buy: The Science of Shopping by P. Underhill): Women with female companion: $8.3 \mathrm{~min}$. Women with male companion: $4.5 \mathrm{~min}$. Suppose you want to set up a statistical test to challenge the claim that a woman with a female friend spends an average of $8.3$ minutes shopping in such a store. (a) What would you use for the null and alternate hypotheses if you believe the average shopping time is less than $8.3$ minutes? Is this a right-tailed, left-tailed, or two-tailed test? (b) What would you use for the null and alternate hypotheses if you believe the average shopping time is different from $8.3$ minutes? Is this a right-tailed, left-tailed, or two-tailed test? Stores that sell mainly to women should figure out a way to engage the interest of men! Perhaps comfortable seats and a big TV with sports programs. Suppose such an entertainment center was installed and you now wish to challenge the claim that a woman with a male friend spends only $4.5$ minutes shopping in a houseware store. (c) What would you use for the null and alternate hypotheses if you believe the average shopping time is more than $4.5$ minutes? Is this a right-tailed, lefttailed, or two-tailed test? (d) What would you use for the null and alternate hypotheses if you believe the average shopping time is different from $4.5$ minutes? Is this a right-tailed, left-tailed, or two-tailed test?

Dominador Tan

Weatherwise magazine is published in association with the American Meteorological Society. Volume 46 , Number 6 has a rating system to classify Nor'easter storms that frequently hit New England states and can cause much damage near the ocean coast. A severe storm has an average peak wave height of $16.4$ feet for waves hitting the shore. Suppose that a Nor'easter is in progress at the severe storm class rating. (a) Let us say that we want to set up a statistical test to see if the wave action (i.e., height) is dying down or getting worse. What would be the null hypothesis regarding average wave height? (b) If you wanted to test the hypothesis that the storm is getting worse, what would you use for the alternate hypothesis? (c) If you wanted to test the hypothesis that the waves are dying down, what would you use for the alternate hypothesis? (d) Suppose you do not know if the storm is getting worse or dying out. You just want to test the hypothesis that the average wave height is different (either higher or lower) from the severe storm class rating. What would you use for the alternate hypothesis? (e) For each of the tests in parts (b), (c), and (d), would the area corresponding to the $P$ -value be on the left, on the right, or on both sides of the mean? Explain your answer in each case.

Consumer Reports stated that the mean time for a Chrysler Concorde to go from 0 to 60 miles per hour was $8.7$ seconds. (a) If you want to set up a statistical test to challenge the claim of $8.7$ seconds, what would you use for the null hypothesis? (b) The town of Leadville, Colorado, has an elevation over 10,000 feet. Suppose you wanted to test the claim that the average time to accelerate from 0 to 60 miles per hour is longer in Leadville (because of less oxygen). What would you use for the alternate hypothesis? (c) Suppose you made an engine modification and you think the average time to accelerate from 0 to 60 miles per hour is reduced. What would you use for the alternate hypothesis? (d) For each of the tests in parts (b) and (c), would the $P$ -value area be on the left, on the right, or on both sides of the mean? Explain your answer in each case.

Please provide the following information. (a) What is the level of significance? State the null and alternate hypotheses. Will you use a left-tailed, right-tailed, or two-tailed test? (b) What sampling distribution will you use? Explain the rationale for your choice of sampling distribution. What is the value of the sample test statistic? (c) Find (or estimate) the $P$ -value. Sketch the sampling distribution and show the area corresponding to the $P$ -value. (d) Based on your answers in parts (a) to (c), will you reject or fail to reject the null hypothesis? Are the data statistically significant at level $\alpha$ ? (e) State your conclusion in the context of the application. Let $x$ be a random variable representing dividend yield of Australian bank stocks. We may assume that $x$ has a normal distribution with $\sigma=2.4 \%$. A random sample of 10 Australian bank stocks gave the following yields. $\begin{array}{llllllllll}5.7 & 4.8 & 6.0 & 4.9 & 4.0 & 3.4 & 6.5 & 7.1 & 5.3 & 6.1\end{array}$ The sample mean is $\bar{x}=5.38 \%$. For the entire Australian stock market, the mean dividend yield is $\mu=4.7 \%$ (Reference: Forbes). Do these data indicate that the dividend yield of all Australian bank stocks is higher than $4.7 \%$ ? Use $\alpha=0.01$.

Please provide the following information. (a) What is the level of significance? State the null and alternate hypotheses. Will you use a left-tailed, right-tailed, or two-tailed test? (b) What sampling distribution will you use? Explain the rationale for your choice of sampling distribution. What is the value of the sample test statistic? (c) Find (or estimate) the $P$ -value. Sketch the sampling distribution and show the area corresponding to the $P$ -value. (d) Based on your answers in parts (a) to (c), will you reject or fail to reject the null hypothesis? Are the data statistically significant at level $\alpha$ ? (e) State your conclusion in the context of the application. Gentle Ben is a Morgan horse at a Colorado dude ranch. Over the past 8 weeks, a veterinarian took the following glucose readings from this horse (in $\mathrm{mg} / 100 \mathrm{ml}$ ). $\begin{array}{llllllll}93 & 88 & 82 & 105 & 99 & 110 & 84 & 89\end{array}$ The sample mean is $\bar{x} \approx 93.8$. Let $x$ be a random variable representing glucose readings taken from Gentle Ben. We may assume that $x$ has a normal distribution, and we know from past experience that $\sigma=12.5 .$ The mean glucose level for horses should be $\mu=85 \mathrm{mg} / 100 \mathrm{ml}$ (Reference: Merck Veterinary Manual). Do these data indicate that Gentle Ben has an overall average glucose level higher than 85? Use $\alpha=0.05$.

Please provide the following information. (a) What is the level of significance? State the null and alternate hypotheses. Will you use a left-tailed, right-tailed, or two-tailed test? (b) What sampling distribution will you use? Explain the rationale for your choice of sampling distribution. What is the value of the sample test statistic? (c) Find (or estimate) the $P$ -value. Sketch the sampling distribution and show the area corresponding to the $P$ -value. (d) Based on your answers in parts (a) to (c), will you reject or fail to reject the null hypothesis? Are the data statistically significant at level $\alpha$ ? (e) State your conclusion in the context of the application. Bill Alther is a zoologist who studies Anna's hummingbird (Calypte anna). (Reference: Hummingbirds, K. Long, W. Alther.) Suppose that in a remote part of the Grand Canyon, a random sample of six of these birds was caught, weighed, and released. The weights (in grams) were $\begin{array}{llllll}3.7 & 2.9 & 3.8 & 4.2 & 4.8 & 3.1\end{array}$ The sample mean is $\bar{x}=3.75$ grams. Let $x$ be a random variable representing weights of Anna's hummingbirds in this part of the Grand Canyon. We assume that $x$ has a normal distribution and $\sigma=0.70$ gram. It is known that for the population of all Anna's hummingbirds, the mean weight is $\mu=4.55$ grams. Do the data indicate that the mean weight of these birds in this part of the Grand Canyon is less than $4.55$ grams? Use $\alpha=0.01$.

Please provide the following information. (a) What is the level of significance? State the null and alternate hypotheses. Will you use a left-tailed, right-tailed, or two-tailed test? (b) What sampling distribution will you use? Explain the rationale for your choice of sampling distribution. What is the value of the sample test statistic? (c) Find (or estimate) the $P$ -value. Sketch the sampling distribution and show the area corresponding to the $P$ -value. (d) Based on your answers in parts (a) to (c), will you reject or fail to reject the null hypothesis? Are the data statistically significant at level $\alpha$ ? (e) State your conclusion in the context of the application. The price to earnings ratio $(\mathrm{P} / \mathrm{E})$ is an important tool in financial work. A random sample of 14 large U.S. banks (J. P. Morgan, Bank of America, and others) gave the following $\mathrm{P} / \mathrm{E}$ ratios (Reference: Forbes). $\begin{array}{lllllll}24 & 16 & 22 & 14 & 12 & 13 & 17 \\ 22 & 15 & 19 & 23 & 13 & 11 & 18\end{array}$ The sample mean is $\bar{x} \approx 17.1$. Generally speaking, a low $\mathrm{P} / \mathrm{E}$ ratio indicates a "value" or bargain stock. A recent copy of The Wall Street Journal indicated that the $\mathrm{P} / \mathrm{E}$ ratio of the entire $\mathrm{S\&P} 500$ stock index is $\mu=19 .$ Let $x$ be a random variable representing the $\mathrm{P} / \mathrm{E}$ ratio of all large U.S. bank stocks. We assume that $x$ has a normal distribution and $\sigma=4.5$. Do these data indicate that the $\mathrm{P} / \mathrm{E}$ ratio of all U.S. bank stocks is less than 19 ? Use $\alpha=0.05$.

Please provide the following information. (a) What is the level of significance? State the null and alternate hypotheses. Will you use a left-tailed, right-tailed, or two-tailed test? (b) What sampling distribution will you use? Explain the rationale for your choice of sampling distribution. What is the value of the sample test statistic? (c) Find (or estimate) the $P$ -value. Sketch the sampling distribution and show the area corresponding to the $P$ -value. (d) Based on your answers in parts (a) to (c), will you reject or fail to reject the null hypothesis? Are the data statistically significant at level $\alpha$ ? (e) State your conclusion in the context of the application. Nationally, about $11 \%$ of the total U.S. wheat crop is destroyed each year by hail (Reference: Agricultural Statistics, U.S. Department of Agriculture). An insurance company is studying wheat hail damage claims in Weld County, Colorado. A random sample of 16 claims in Weld County gave the following data (\% wheat crop lost to hail). $\begin{array}{rrrrrrrr}15 & 8 & 9 & 11 & 12 & 20 & 14 & 11 \\ 7 & 10 & 24 & 20 & 13 & 9 & 12 & 5\end{array}$ The sample mean is $\bar{x}=12.5 \%$. Let $x$ be a random variable that represents the percentage of wheat crop in Weld County lost to hail. Assume that $x$ has a normal distribution and $\sigma=5.0 \%$. Do these data indicate that the percentage of wheat crop lost to hail in Weld County is different (either way) from the national mean of $11 \% ?$ Use $\alpha=0.01$.

Please provide the following information. (a) What is the level of significance? State the null and alternate hypotheses. Will you use a left-tailed, right-tailed, or two-tailed test? (b) What sampling distribution will you use? Explain the rationale for your choice of sampling distribution. What is the value of the sample test statistic? (c) Find (or estimate) the $P$ -value. Sketch the sampling distribution and show the area corresponding to the $P$ -value. (d) Based on your answers in parts (a) to (c), will you reject or fail to reject the null hypothesis? Are the data statistically significant at level $\alpha$ ? (e) State your conclusion in the context of the application. Total blood volume (in ml) per body weight (in $\mathrm{kg}$ ) is important in medical research. For healthy adults, the red blood cell volume mean is about $\mu=28 \mathrm{ml} / \mathrm{kg}$ (Reference: Laboratory and Diagnostic Tests, F. Fischbach). Red blood cell volume that is too low or too high can indicate a medical problem (see reference). Suppose that Roger has had seven blood tests, and the red blood cell volumes were $\begin{array}{lllllll}32 & 25 & 41 & 35 & 30 & 37 & 29\end{array}$ The sample mean is $\bar{x} \approx 32.7 \mathrm{ml} / \mathrm{kg}$. Let $x$ be a random variable that represents Roger's red blood cell volume. Assume that $x$ has a normal distribution and $\sigma=4.75 .$ Do the data indicate that Roger's red blood cell volume is different (either way) from $\mu=28 \mathrm{ml} / \mathrm{kg}$ ? Use a $0.01$ level of significance.

Work in groups on these problems. You should try to answer the questions without referring to your textbook. If you get stuck, try asking another group for help.

Student Learning Outcome

  • The student will evaluate if there is a significant relationship between favorite type of snack and gender.

Collect the Data

NOTE: You may need to combine two food categories so that each cell has an expected value of at least five.

Favorite type of snack
  • Looking at Table , does it appear to you that there is a dependence between gender and favorite type of snack food? Why or why not?

Hypothesis Test

Conduct a hypothesis test to determine if the factors are independent:

  • \(H_{0}\): ________
  • \(H_{a}\): ________
  • What distribution should you use for a hypothesis test?
  • Why did you choose this distribution?
  • Calculate the test statistic.
  • Find the p -value.
  • State your decision.
  • State your conclusion in a complete sentence.

Discussion Questions

  • Is the conclusion of your study the same as or different from your answer to answer to question two under Collect the Data ?
  • Why do you think that occurred?


  1. PDF Chapter 9 Chapter 9: Hypothesis Testing

    Chapter 9 9.5 The t Test The t-Test The t-Test is a test for hypotheses concerning the mean parameter in the normal distribution when the variance is also unknown. The test is based on the t distribution The setup for the next few slides: Let X1;:::;Xn be i.i.d. N( ;˙2) and consider the hypotheses H0: 0 vs. H1: > 0 (1)

  2. PDF Chapter 9 Chapter 9: Hypothesis Testing

    Chapter 9: Hypothesis Testing Sections 9.1 Problems of Testing Hypotheses Skip: 9.2 Testing Simple Hypotheses 9.3 Uniformly Most Powerful Tests Skip: 9.4 Two-Sided Alternatives 9.5 The t Test 9.6 Comparing the Means of Two Normal Distributions 9.7 The F Distributions 9.8 Bayes Test Procedures

  3. Ch. 9 Chapter Review

    To test a null hypothesis, find the p-value for the sample data and graph the results. When deciding whether or not to reject the null the hypothesis, keep these two parameters in mind: α > p-value, reject the null hypothesis. α ≤ p-value, do not reject the null hypothesis. 9.5 Additional Information and Full Hypothesis Test Examples. The ...

  4. 9: Hypothesis Testing

    The structure of the chapter is as follows. Firstly, I'll describe how hypothesis testing works, in a fair amount of detail, using a simple running example to show you how a hypothesis test is "built". ... 9.6: Reporting the Results of a Hypothesis Test; 9.7: Running the Hypothesis Test in Practice; 9.8: Effect Size, Sample Size and Power ...

  5. Introduction to Chapter 9: Hypothesis Testing with Two Samples

    Chapter 9: Hypothesis Testing with Two Samples. Introduction to Chapter 9: Hypothesis Testing with Two Samples Figure 10.1. If you want to test a claim that involves two groups (the types of breakfasts eaten east and west of the Mississippi River) you can use a slightly different technique when conducting a hypothesis test. (credit: Chloe Lim)

  6. 9 Chapter 9 Hypothesis testing

    Chapter 9 Hypothesis testing. The first unit was designed to prepare you for hypothesis testing. In the first chapter we discussed the three major goals of statistics: Describe: connects to unit 1 with descriptive statistics and graphing. Decide: connects to unit 1 knowing your data and hypothesis testing.

  7. 9.6: Chapter 9 Formulas

    Critical value method: reject H 0 when the test statistic is in the critical tail (s). Confidence Interval method, reject H 0 when the hypothesized value (0) found in H 0 is outside the bounds of the confidence interval. The most important step in any method you use is setting up your null and alternative hypotheses. 9.6: Chapter 9 Formulas.

  8. PDF STAT 201 Chapter 9.1-9.2 Hypothesis Testing for Proportion

    A Hypothesis is a proposition assumed as a premise in an argument. It's a statement regarding a characteristic of one or more populations. Hypothesis testing is a procedure based on evidence found in a sample to test hypothesis. The null hypothesis, , is a statement to be tested. The null hypothesis is a statement of no change, no effect or ...

  9. Ch. 9 Introduction

    In this chapter, you will conduct hypothesis tests on single means and single proportions. You will also learn about the errors associated with these tests. Hypothesis testing consists of two contradictory hypotheses or statements, a decision based on the data, and a conclusion. To perform a hypothesis test, a statistician will: Set up two ...

  10. Chapter 9 Hypothesis Testing

    Chapter 9 Hypothesis Testing. Chapter 9. Hypothesis Testing. Now that we've studied confidence intervals in Chapter 8, let's study another commonly used method for statistical inference: hypothesis testing. Hypothesis tests allow us to take a sample of data from a population and infer about the plausibility of competing hypotheses.

  11. PDF Chapter 9 Chapter 9: Hypothesis Testing

    Chapter 9: Hypothesis Testing Sections 9.1 Problems of Testing Hypotheses Skip: 9.2 Testing Simple Hypotheses 9.3 Uniformly Most Powerful Tests Skip: 9.4 Two-Sided Alternatives 9.5 The t Test 9.6 Comparing the Means of Two Normal Distributions 9.7 The F Distributions 9.8 Bayes Test Procedures 9.9 Foundational Issues Hypothesis Testing 10 / 31

  12. PDF Chapter 9 Chapter 9: Hypothesis Testing

    Chapter 9 9.1 Problems of Testing Hypotheses Introduction Statistical Inference: Given a probability model f(xj ) (and possibly a prior p( )) we may be interested in Parameter estimation - Chapters 7 and 8 Making decisions - Hypothesis testing, Chapter 9 E.g. If the disease affects 2% or more of the population, the state

  13. Ch. 9 Practice

    Introduction; 9.1 Null and Alternative Hypotheses; 9.2 Outcomes and the Type I and Type II Errors; 9.3 Distribution Needed for Hypothesis Testing; 9.4 Rare Events, the Sample, and the Decision and Conclusion; 9.5 Additional Information and Full Hypothesis Test Examples; 9.6 Hypothesis Testing of a Single Mean and Single Proportion; Key Terms; Chapter Review; Formula Review

  14. PDF Chapter (9) Fundamentals of Hypothesis Testing: One-Sample Tests

    Hypothesis Testing Steps Objectives In this chapter, you learn: (Slide 9) ⚫ The basic principles of hypothesis testing ⚫ How to use hypothesis testing to test a mean or proportion ⚫ The assumptions of each hypothesis-testing procedure, how to evaluate them, and the consequences if they are seriously violated ⚫ Define Type I and Type II ...

  15. Chapter 9 Hypothesis testing

    Chapter 9. Hypothesis testing. In the first chapter we discussed the three major goals of statistics: Describe. Decide. Predict. In this chapter we will introduce the ideas behind the use of statistics to make decisions - in particular, decisions about whether a particular hypothesis is supported by the data.

  16. PDF Chapter 9: Hypothesis Testing with One Sample

    Step 3 Summarize the data into an appropriate test statistic. Step 4 Find the p-value by comparing the test statistic to the possibilities expected if the null hypothesis were true OR determine the critical value. Step 5 Decide whether the result is statistically significant based on the p-value.

  17. PPT Chapter 9 Hypothesis Testing

    Chapter 9 Hypothesis Testing 9.1 The Language of Hypothesis Testing Example: Illustrating Hypothesis Testing According to the National Center for Chronic Disease Prevention and Health Promotion, 73.8% of females between the ages of 18 and 29 years exercise. Kathleen believes that more women between the ages of 18 and 29 years are now exercising.

  18. STA 270 Chapter 9

    Chapter 9: Hypothesis Tests for One Population Mean. Recall, confidence intervals use sample data to estimate an unknown population parameter. We can also use sample data to determine whether a claim or hypothesis made about a population parameter is plausible. Hypothesis Tests are based on a reductio ad absurdum form of argument.

  19. Notes

    CHAPTER 9 HYPOTHESIS TESTING. 9 Introduction to Hypothesis Testing. Hypothesis testing is the second type of inferences we can make about population parameters. A formulated believe is a hypothesis. Two competing hypotheses on a particular population of interest. Collect evidence that conclude which competing hypothesis is supported.

  20. Chapter 9, Hypothesis Testing Video Solutions ...

    Suppose that in a remote part of the Grand Canyon, a random sample of six of these birds was caught, weighed, and released. The weights (in grams) were. 3.7 2.9 3.8 4.2 4.8 3.1. The sample mean is x = 3.75 grams. Let x be a random variable representing weights of Anna's hummingbirds in this part of the Grand Canyon.

  21. PDF Chapter 9 Chapter 9: Hypothesis Testing

    Chapter 9 9.8 Bayes Test Procedures Bayesian test procedures All inference about a parameter is based on the posterior distribution, including hypothesis testing. Let H0: 2 0 vs. H1: 2 1 Then we can obtain: P( 2 0jx) = probability that H0 is true P( 2 1jx) = probability that H1 is true A straightforward test procedure: Reject H0 if P( 2 0jx) <P ...

  22. PDF Lecture Notes 15 Hypothesis Testing (Chapter 10) 1 Introduction

    Hypothesis Testing (Chapter 10) 1 Introduction Let X 1;:::;X n˘p (x). Suppose we we want to know if = 0 or not, where 0 is a speci c value of . For example, if we are ... ten times, people use hypothesis testing when it would be much more appropriate to use con dence intervals. 1. Notation: Let be the cdf of a standard Normal random variable Z ...

  23. PDF Chapter 9 Chapter 9: Hypothesis Testing

    Chapter 9 9.1 Problems of Testing Hypotheses Notes on hypothesis testing Decisions are expressed in terms of H0 "Do not reject H0" does not mean that we should accept H0 as true. Some use the phrase "There is no evidence that H0 is not true". "critical regions vs. "rejection regions"; " Hypothesis Testing 14 / 14

  24. 9.7.2: Chapter 7 Chi-Square

    This page titled 9.7.2: Chapter 7 Chi-Square - Test of Independence (Worksheet) is shared under a CC BY 4.0 license and was authored, remixed, and/or curated by OpenStax via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request. The student will evaluate if there ...

  25. PDF Chapter 9 Chapter 9: Hypothesis Testing

    Chapter 9: Hypothesis Testing Sections 9.1 Problems of Testing Hypotheses Skip: 9.2 Testing Simple Hypotheses 9.3 Uniformly Most Powerful Tests Skip: 9.4 Two-Sided Alternatives 9.5 The t Test 9.6 Comparing the Means of Two Normal Distributions 9.7 The F Distributions 9.8 Bayes Test Procedures 9.9 Foundational Issues Hypothesis Testing 10 / 31