CHAPTER FIVE SAMPLING DISTRIBUTIONS, STATISTICAL INFERENCE, AND NULL HYPOTHESIS TESTING

Size: px

Start display at page:

Download "CHAPTER FIVE SAMPLING DISTRIBUTIONS, STATISTICAL INFERENCE, AND NULL HYPOTHESIS TESTING"

Camron Perkins
5 years ago
Views:

CHAPTER FIVE SAMPLING DISTRIBUTIONS, STATISTICAL INFERENCE, AND NULL HYPOTHESIS TESTING OBJECTIVES To lay the groundwork for the procedures discussed in this book by examining the general theory of

1 CHAPTER FIVE SAMPLING DISTRIBUTIONS, STATISTICAL INFERENCE, AND NULL HYPOTHESIS TESTING OBJECTIVES To lay the groundwork for the procedures discussed in this book by examining the general theory of data analysis and describing specific concepts as they apply to confidence intervals, effect sizes and hypothesis tests. CONTENTS 5.1 BASIC CONCEPTS BEHIND CONFIDENCE INTERVALS, EFFECT SIZES, AND HYPOTHESIS TESTING 5.2 SAMPLING ERROR 5.3 SIMPLE EXAMPLES INVOLVING AUTHORITARIANISM AND VISUAL CUES TO MEMORY 5.4 SAMPLING DISTRIBUTIONS AND THE STANDARD ERROR 5.5 TEST STATISTICS AND THEIR SAMPLING DISTRIBUTIONS 5.6 MAKING DECISIONS ABOUT THE NULL HYPOTHESIS 5.7 TYPE I AND TYPE II ERRORS 5.8 ONE- AND TWO-TAILED TESTS 5.9 RETAINING OR REJECTING THE NULL HYPOTHESIS 5.1 SUMMARY 1

2 In Chapter 2 we examined a number of different statistics and saw how they might be used to describe a set of data or to represent the frequency of the occurrence of some event. Although the description of the data is important and fundamental to any analysis, it is not sufficient to answer many of the most interesting problems we encounter. In a typical experiment, we might treat one group of people in a special way and wish to see whether their scores differ from the scores of people in general. Or we might offer a treatment to one group but not to a control group and wish to compare the means of the two groups on some variable. Descriptive statistics will not tell us, for example, whether the difference between a sample mean and a hypothetical population mean, or the difference between two obtained sample means, is small enough to be explained by chance alone or whether it could represent a true difference that might be attributable to the effect of our experimental treatment(s). Nor will they tell us if such a difference in meaningful, how much in error we could be in our estimates, and how this study might fit with other studies that have been conducted. The research paper that derives from an experiment must address each of these questions, and to do so requires an understanding of a number of different ways of examining data. 5.1 BASIC CONCEPTS BEHIND CONFIDENCE INTERVALS, EFFECT SIZES, AND HYPOTHESIS TESTING Traditionally, psychology and the behavioral sciences in general have focused on what is generally referred to as hypothesis testing, or, in more current terminology, Null Hypothesis Significance Tests (NHST). Not too long ago it was possible to write a chapter, and even a whole book, focusing almost exclusively on hypothesis testing. (I have done just that, and so have most other authors.) There we would be interested in answering questions such as Are the mean scores for these two groups sufficiently different to lead us to conclude, perhaps erroneously, that different treatments produce different results. (And notice that my question focused almost exclusively on the mean, ignoring other statistics. In some cases we focused on the correlation coefficient, ignoring other interesting possibilities.) Fortunately, we have expanded the kinds of questions we ask and the statistics that they produce. But each of the questions, or ways of evaluating data, depend on the same underlying concepts most importantly, they depend on what we will refer to as sampling distributions, one- and two-tailed tests, Type I and Type II errors, and the logic of hypothesis testing, each of which will be defined as we go along. Even for those who don't particularly approve of hypothesis testing itself, the underlying concepts are critically important. 2

3 Before I launch into a discussion of the whole issue surrounding hypothesis tests, their associated probability values (p), confidence intervals, and effect sizes, I need to lay out the basic concepts that are involved in a discussion of each of those. When I have explained each of the basic concepts, I will come back to the issue of null hypothesis significant testing and explain where it came from, what people have had to say about it, and why it gets people so excited. But I will also point to other statistics that we can use along with hypothesis testing to understand what our data have to say. Those alternative statistics will be elaborated much more in the next chapter. But before going off in that direction, we need a good understanding of the basic material. 5.2 SAMPLING ERROR One of the most basic concepts in statistics is what statisticians call sampling error. It lies at the heart of all statistical procedures. Sampling error refers to the variability of some observation or statistic from one sample to another. In Standard English we usually use the word error to refer to some kind of a mistake. That is not what we mean here. We simply mean random variability. In Chapter 3 we considered the distribution of Total Behavior Problem scores from Achenbach s Youth Self-Report form. Total Behavior Problem scores are nearly normally distributed in the population, with a population mean (μ) of 5 and a population standard deviation (σ) of 1. We know that different children show different levels of problem behaviors and therefore have different scores. We also know that if we took a sample of children, their scores would probably not equal exactly 5. One child might have a score of 49, while a second might have a score of 55. The actual scores would depend on the particular children who happened to be included in the sample. If we then go further and calculate the means of two samples of children, we would also expect those means to differ due to sampling error. (They might also differ because of real differences due to some treatment effect, but that is not what we are talking about here. Here we are just referring to the part that represents random variability.) One mean might be 47.4 and another might be This expected variability from sample to sample is what is meant when we speak of variability due to chance or error variance, or sampling error. The phrase refers to the fact that statistics (in this case, means) obtained from samples naturally vary from one sample to another. We need to understand sampling error if we are to evaluate how different groups respond to some experimental treatment, or how much confidence we can place on a statistic that we just computed. Sampling error is fundamental to the calculation of what we will call p values, 3

4 of confidence intervals, and of effect sizes, all of which will be defined in this chapter. You cannot ignore it. In examining sampling error, and in any statistical procedures which follow, we will be particularly interested in sampling distributions, which refer to the distributions of scores, or more often, statistics like the mean, and their associated sampling error. Such distributions tell us what kind of variability we can expect in sample means, for example, from one experiment to another. In other words, they plot sampling error. If we want to make meaningful estimates of population means, we need to have confidence in the stability in the sample means on which we base those estimates. Suppose that we have a sample mean of 68, and suppose that we can reasonably estimate that if we ran the sample experiment again we would likely have a new sample mean of somewhere between 66 and 7. That looks as if we have a solid basis for concluding that the population mean is probably somewhere in the upper 6s. However if we think that if we reran the experiment the new sample mean would be somewhere between 52 and 84, we would be much more cautious in our estimate of the true population mean. It is just this variability of sample means from one sample to another that we mean by the sampling distribution of a statistic. And although I have used the sample mean(s) as the statistic of interest in this paragraph, I could just as well have spoken about the sampling distribution of a variance, a correlation coefficient, or a test statistic such as t or F. Every statistic has its own sampling distribution. 5.3 SIMPLE EXAMPLES INVOLVING AUTHORITARIANISM AND VISUAL CUES TO MEMORY I want to begin with some examples that illustrate the issues we face. Roets, Au, & Van Hiel (215) examined the relationship between authoritarianism and attitudes toward out-groups. Many studies have found a negative relationship between these variables, with people high in authoritarianism tending to view minorities as bad, immoral, and deviant. (Gee Donald Trump comes to mind!) However, the government of Singapore has had a long history of promoting multiculturalism. They have forced people from different cultures to live in the same neighborhoods, and taken other measures to blend their communities. Roets and Van Hiel wondered if the approach taken by Singapore, which they refer to as the potential institutionalized intergroup ideology, imposed on the people of Singapore would potentially alter this relationship between authoritarianism and out-groups. They examined two quite different groups. For a Belgian 4

5 group of 245 students, the correlation between authoritarianism and a measure of multicultural acceptance was -.28, which was in line with a large body of research. Correlations measure the degree of relationship between two or more variables. For Belgian students, the higher one s authoritarianism score, the more negative one s attitude about minorities. But for a group of 249 students from Singapore, this same relationship was positive, with a correlation of.26. Would we have expected different groups to have such different results if Singapore s approach really has no effect? Is this difference in correlations between two groups with quite different backgrounds toward multiculturalism large enough to indicate a real difference between the two groups and the influence of government policy? What can we say about the potential stability of this difference? Does it represent an important result of Singapore s efforts, or is it minor effect that we can largely shove aside? Those are the important questions to be answered. Also note that the authors worked with college students. Perhaps future work might focus on a different age group. The point is that the study shouldn t end here it is part of a body of research that should be pursued. Two correlation coefficients do not exhaust the area of study. Too often we present a significant result, imply that we have answered the question, and then move on to something else. Now consider a second study. We all know how difficult it sometimes is to remember to do something e.g. call mom and wish her a happy birthday. Rogers & Milkman (216) hypothesized that if you can link a distinctive visual cue to the intention to call, subsequently noticing that cue will facilitate calling. You want to remember to phone your mother when you get home to wish her a happy birthday. First, think about the bottle of milk that you accidently left out on the kitchen counter when you set off to class or work, and associate that with the phone call. Seeing that bottle of sour milk when you return home should remind you to make the call 1. They found that of those who were instructed to form such an association, 29/39 = 74% performed the behavior. Of those who were not instructed to form an association, only 16/38 = 42% performed the behavior. We will want to have some way to decide whether the difference between 74% and 42% can be explained away by normal sampling error, in which case having such cues doesn't seem to help. Alternatively, if the difference is sufficiently large that we can not attribute it solely to sampling 1 I would not suggest that you tell your mother that seeing sour milk prompted your call to her. 5

6 error, then we have evidence that such cues do help and Rogers and Milkman are onto something important. Although the statistical calculations required to answer this question are different from those used to answer the one concerning the correlation between authoritarianism and attitudes toward outgroups, the underlying logic is fundamentally the same. We need to be explicit about what the problem is here. The reason for understanding sampling distributions, and from that to calculating confidence intervals, effect sizes and hypothesis tests is that data are ambiguous. When we collect data on attitude toward minorities, for example, the data will vary from occasion to occasion, depending on who happens to be included in our sample. Similarly for data on memory for important tasks. But how large a difference do we need to lead us to conclude that something meaningful is going on? How do we try to assess the importance of that difference? How sure are we that we have estimated it reliably? Those are the problems we are beginning to explore, and those are the subjects of this chapter and the rest of the book. 5.4 SAMPLING DISTRIBUTIONS AND THE STANDARD ERROR As I have said, the most basic concept underlying all statistical procedures is the sampling distribution of a statistic and its associated sampling error. It is fair to say that if we did not have sampling distributions, we would not have any confidence limits, statistical tests, and other important measures. Sampling distributions tell us what values we might (or might not) expect to obtain for a particular statistic under a set of predefined conditions (e.g., what the differences between our two samples might be expected to be if the true means of the populations from which those samples came are equal.) In addition, the standard deviation of that distribution of differences between sample means (known as the standard error of the distribution) reflects the variability that we would expect to find in the values of that statistic (in this case, differences between means) over repeated trials. Sampling distributions and their standard error provide the opportunity to evaluate the likelihood (given the value of a sample statistic) that such predefined conditions actually exists. Sampling distributions are almost always derived mathematically, but it is easier to understand what they represent if we consider how they could, in theory, be derived empirically with a simple 6

7 sampling experiment. (In several places in this book I will refer to the fact that with the computing power we have available today, we can answer more and more questions by repeated sampling instead of by solving equations. Statistical procedures really do change over time.) We ll begin with the sampling distribution of the mean of a single group. We can then move on to the sampling distribution of the differences between means. The sampling distribution of the mean is the distribution of means of an infinite number of random samples drawn from one population. Suppose we have a population with a known mean and standard deviation. (Here we will suppose that the population mean is 35 and the population standard deviation is 15, though what the values are is not critical to the logic of our argument. In the general case we rarely know the population standard deviation, but for our example suppose that we do.) Further suppose that we draw a very large number (theoretically an infinite number, but I drew 1,) of random samples from this population, each sample consisting of 1 scores. (In this example, for each of the 1, samples the R code shown below drew N = 1 observations from a normally distributed population with a mean of 35 and a standard deviation of 15. (I could have sampled from a population that is not normally distributed, but I wanted to keep this example uncomplicated.) I then repeated that process 9,999 more times and stored away all of those 1, sample means. (You might profitably repeat this procedure using a larger or smaller size of each sample (e.g., 3 or 2), looking to see how the difference influences the resulting sampling distribution.) When I finished drawing the samples, I plotted the distribution of the means. The histogram of this distribution is shown on the left of Figure 5.1, with the Q-Q plot on the right. The code for doing this in R follows. R Code # Sampling distribution shown in in Figure 5.1 nreps <- 1 # Number of replications n <- 1 # Size of individual samples xbar <- numeric(nreps) # Variable to store mean differences par(mfrow = c(2,1)) # Set up the graphics display for (i in 1:nreps) { sample <- rnorm(n = n, mean = 35, sd = 15) xbar[i] <- mean(sample) } # xbar now holds 1, elements Mean <- round(mean(xbar), digits = 2) StDev <- round(sd(xbar), digits = 2) 7

8 cat("the mean of the means is \n", Mean, '\n') cat("the standard deviation of mean is \n",stdev, '\n') hist(xbar, breaks = 5, main = "Distribution of Means", xlab = "Mean") legend(29,5, paste("mean = ", Mean), bty = "n" ) legend(29, 4, paste("st.dev = ",StDev), bty = "n") qqnorm(xbar, main = "Q-Q Plot for Distribution \n of Sample Means", xlab = "Obtained quantiles", ylab = "Expected quantiles") qqline(xbar) Figure 5.1 Distribution of sample means, each based on 1, samples of N = 1 I don't think that there is much doubt that this distribution is normally distributed. The Q-Q plot clearly tells us that it is. The center of this distribution is at 34.99, which is almost exactly the population mean. We can see from the figure on the left that sample means between 32 and 38, for example, are quite likely to occur when we sample from this population. We also can see that it is extremely unlikely that we would draw samples from this population with means of 4 or more. The fact that we know the kinds of values to expect for the difference of means of samples drawn from this one population is going to allow us to turn the question around and ask whether an 8

9 obtained sample mean can be taken as evidence in favor of the hypothesis that we actually are sampling from this population. In addition to authoritarianism and attitudes toward out-groups, and memory for future activities, we will add a third example, which is one to which we can all relate. It involves those annoying people who spend what seems to us an unreasonable amount of time vacating the parking space we are waiting for. Ruback and Juieng (1997) ran a simple study in which they divided drivers into two groups of 1 participants each those who had someone waiting for their space and those who did not. They then recorded the amount of time that it took the driver to leave the parking space. For those drivers who had no one waiting, it took an average of seconds to leave the space. For those who did have someone waiting, it took an average of 39.3 seconds. The average standard deviation of waiting times within these two groups was 14.6 seconds. Notice that a driver took 6.88 seconds (or nearly a full standard deviation) longer to leave a space when someone was waiting for it. (If you think about it, 6.88 seconds is a long time if you are the person doing the waiting.) Here we have a case where they have two means and we want to know about the sampling distribution of the difference between two means. Using a program similar to the one above, I drew 1, samples from two identical populations. The population means were set at 35.6 seconds (the average of the two group means). The standard deviation was set at 14.6 (the common standard deviation of the two groups). Because I was sampling from identical populations, they have the same population mean and standard deviation. The differences between these means are plotted in Figure 5.2. Remember that this is a distribution created by drawing from a case where the hypothesis of equal population means is true both population means are

10 Figure 5.2 Distribution of differences between means. Ruback and Juieng (1997) found a difference of 6.88 seconds in leaving times between the two conditions. It is quite clear from Figure 5.2 that this is very unlikely to have occurred if the true population means were equal. In fact, my sampling study only found 6 cases out of 1, when the mean difference was more extreme than 6.88, for a probability of.6. We will certainly feel justified in concluding that people take longer to leave their space, for whatever reason, when someone is waiting for it. We have just run our first null hypothesis test. You should now have a good understanding of three important concepts. There is sampling error, which is random variability from one sample to another, either in terms of individual observations or in terms of a statistic, such as the mean. There is the sampling distribution, which is just the distribution of, for example, a sample mean, or a sample mean difference, when means are repeatedly drawn from some population. And there is the standard error, which is the standard deviation of the corresponding sampling distribution. Figure 5.2 illustrates a sampling distribution of mean differences, and the variability within that distribution is sampling error. The standard deviation of that distribution is 1.5, which is the standard error of the mean. THE ROLE OF SAMPLING DISTRIBUTIONS AND STANDARD ERRORS 1

11 The reason that we need the concepts of sampling distributions and standard errors is that we use them to calculate measures that will help us better understand our data. The initial impetus came from the idea of testing an hypothesis, which, in the case of the Ruback and Juieng study, posited that both a population of drivers who had someone waiting and a population of drivers who had no one waiting, would have identical means. I will begin with hypothesis testing because it lies at the heart of what we have been doing for many years, but the field and coverage here has moved well beyond that point to include confidence intervals and effect sizes, which will be defined shortly. 5.5 TEST STATISTICS AND THEIR SAMPLING DISTRIBUTIONS Although I have not used the term, what we did in the previous example was to reject the null hypothesis. We said that if the null hypothesis (equal population means) were true, we would almost never find the difference we observed. So we rejected that hypothesis in favor of one that said that the population means were not equal. If we had, instead, found a sample mean difference of.5 seconds in our sample means, we would have not rejected the null hypothesis of equal population means. (Note where.5 would fall in Figure 5.2.) Knowing what the terms rejection and non-rejection mean is all well and good. But how do we get to that point? What do we do with our data to come up with a probability value such as we found in this example? In the not too distant future, we may well do what we did here, which is to draw a huge number of samples from equal populations. But the far more traditional approach is to run a statistical test, compute a test statistic, and evaluate that statistic. We have been discussing the sampling distribution of the mean, but the discussion would have been essentially the same had we dealt instead with the median, the variance, the range, the correlation coefficient (as in our authoritarianism example), proportions (as in our calling mom example), or any other statistic you care to consider. (Technically the shapes of these distributions would be different, but I am deliberately ignoring such issues in this chapter.) The statistics just mentioned usually are referred to as sample statistics because they describe characteristics of samples. There is a whole different class of statistics called test statistics, which are associated with specific statistical procedures and which have their own sampling distributions. Test statistics are statistics such as t, F, and χ 2, which you have probably run across in the past. (If you are not familiar with them, don't worry we will consider them separately in later chapters.) This is not the place to go into a detailed explanation of 11

12 any test statistic, but it is the place to point out that the sampling distributions for test statistics are obtained and used in essentially the same way as the sampling distribution of the mean. As an illustration, consider the sampling distribution of the statistic t, which will be discussed in Chapter 6. For those who are not familiar with the t test, it is sufficient to say that the t test is often used, among other things, to examine whether two samples were drawn from populations with the same means. Let µ 1 and µ 2 represent the means of the populations from which the two samples were drawn. The null hypothesis is the hypothesis that the two population means are equal, in other words, H : µ 1 = µ 2 (or µ 1 µ 2 = ). (This is what we had in the previous example.) If we wished, we could empirically obtain the sampling distribution of t when H is true by drawing an infinite number of pairs of samples, all from two identical populations, calculating t for each pair of samples (by methods to be discussed later), and plotting the resulting values of t. In that case H must be true because we forced it to be true by drawing the samples from identical populations. The resulting distribution is the sampling distribution of t when H is true. If we later had two samples that produced a particular value of t, we would evaluate the null hypothesis by comparing our obtained t to the sampling distribution of t. We would reject the null hypothesis in favor of our research (alternative) hypothesis if our obtained t did not look like the kinds of t values that the sampling distribution told us to expect when the null hypothesis is true. I could rewrite the preceding paragraph, substituting χ 2, or F, or any other test statistic in place of t, with only minor changes dealing with how the statistic is calculated. Thus, you can see that all sampling distributions can be obtained in basically the same way (calculate and plot an infinite number of statistics by sampling from identical populations). At the moment we won't actually draw all of those samples and compute the relevant test statistic, but we could do it that way. 5.6 MAKING DECISIONS ABOUT THE NULL HYPOTHESIS Figure 5.2 included a test of a null hypothesis concerning the time it takes to leave a parking space. You should recall that we first drew pairs of samples from a population with a mean of 35.6 and a standard deviation of Then we calculated the differences between pairs of means in each of 1, replications and plotted those. Then we discovered that under those conditions a difference 12

13 as large as the one that Ruback and Juieng found (6.88) would happen only about 6 times out of 1, trials, for a probability of.6. That is such an unlikely finding that we concluded that our two means did not come from populations with the same mean. That is a nice straightforward example of how we carry out a statistical test. At this point we have to become involved in the decision-making aspects of hypothesis testing. We must decide whether an event with a probability of.6 is sufficiently unlikely to cause us to reject H. Here we traditionally fall back on arbitrary conventions that have been established (perhaps too rigidly) over the years. The rationale, or lack thereof, for these conventions will become clearer as we go along, but for the time being keep in mind that they are merely conventions, and many people object to such conventions 2. One convention calls for rejecting H if the probability under H is less than or equal to.5 (p.5), while another convention one that is more conservative with respect to the probability (p) of rejecting H calls for rejecting H whenever the probability under H is less than or equal to.1. These values of.5 and.1 are often referred to as p values and represent the rejection level, or the significance level, of the test. (When we say that a difference is statistically significant at the.5 level, we mean that a difference that large would occur less than 5% of the time if the null were true.) Whenever the probability obtained under H is less than or equal to our predetermined significance level, we will reject H. Another way of stating this is to say that any outcome whose probability under H is less than or equal to the significance level falls in the rejection region, since such an outcome leads us to reject H. The phrase p value has almost come to be a derogatory term for those who object to null hypothesis testing, but it has played, and continues to play, an important role in statistics. Don't underestimate its importance. For the purpose of setting a standard level of rejection for this book, we will generally use the p >.5 level of statistical significance, keeping in mind that some people would consider this level to be 2 Cortina and Landis (211) point out that such conventions do have an important advantage. They take away my role as the experimenter in deciding whether p =.8 is close enough for rejecting H and substitute a standard (often p <.5) that has been more-or-less set by the research community. It helps keep me honest. 13

14 too lenient. But we will not simply report that such a difference is significant at p <.5 and walk away. There is much more that we need to do, and we need to think carefully about what our ultimate conclusion will be. For our particular example we have obtained a probability of p =.6, and it is clearly are less than.5. We will probably conclude that we have reasonable evidence to decide that the scores for the two conditions were drawn from populations with different means. But then, as researchers in the behavioral sciences, we should look to build on that result in future research. Ruback and Juieng included two additional studies in their paper that helped to confirm the results that I have given, which is important added information about the general conclusions of this paper. In a more casual replication of this study, McKenzie (29) ( reported similar results. The original study is often cited in discussions of territoriality. 5.7 TYPE I AND TYPE II ERRORS At this point you should have a reasonable understanding of what we mean by a null hypothesis and methods we have at our disposal to retain or reject that hypothesis. But there are additional statistical issues that come into play. Whenever we reach a decision with a statistical test, there is always a chance that our decision is the wrong one. While this is true of almost all decisions, statistical or otherwise, the statistician has one point in her favor that other decision makers normally lack. She not only makes a decision by some rational process, but she can also specify the conditional probabilities of a decision s being in error. In everyday life we make decisions with only subjective feelings about what is probably the right choice. The statistician, however, can state quite precisely her estimate of the probability that she would make an erroneously rejection of H if it were true. This ability to specify the probability of erroneously rejecting a true H follows directly from the logic of hypothesis testing. (You will soon see me back off from the statement that she can specify that probability precisely, but under the general interpretation of hypothesis testing, we operate as if that is the case.) Consider the parking lot example again, this time ignoring the difference in means that Ruback and Juieng found. The situation is diagrammed in Figure 5.3, in which the distribution is the distribution of differences in sample means when the null hypothesis is true, and the shaded 14

15 portion represents the upper 5% of the distribution. The actual score that cuts off the highest 5% is called the critical value. Critical values are those values of X (the dependent variable or the test statistic) that describe the boundary or boundaries of the rejection region(s). For this particular example the critical value is Figure 5.3 Upper 5% of differences in means Assume that we have a decision rule that says to reject H whenever an outcome falls in the highest 5% of the distribution. This is the rejection level of the test. We will reject H whenever the difference in means falls in the shaded area; that is, whenever a difference as high as the one we found has a probability of.5 or less of coming from the situation where the population means are equal. (Here I have represented that probability with the Greek letter α (alpha), which is the traditional notation.) Yet by the very nature of our procedure, 5% of the differences in means, when being in the presence of a waiting car has no effect on the time to leave, will themselves fall in the shaded portion. Thus if we actually have a situation where the null hypothesis of no mean difference is true, we stand a 5% chance of an obtained sample mean difference being in the shaded tail of the distribution, causing us erroneously to reject the null hypothesis. This kind of error (rejecting H when in fact it is true) is called a Type I error, and its conditional probability (the probability of rejecting the null hypothesis given that it is true) is α, the size of the rejection region. In the future, whenever we represent a probability by α, we will be referring to the probability of a Type I error erroneously rejecting the null hypothesis. Keep in mind the conditional nature of the probability of a Type I error. This means that you should be sure you understand that when we speak of a Type I error we mean the probability of 15

16 rejecting H given that it is true. We are not saying that we will reject H on 5% of the hypotheses we test. We would hope to run experiments on important and meaningful variables and, therefore, to reject H often. But when we speak of a Type I error, we are speaking only about erroneously rejecting H in those situations in which the null hypothesis happens to be true. You might feel that a 5% chance of making an error is too great a risk to take and suggest that we make our criterion much more stringent, by rejecting, for example, only the lowest 1% of the distribution. This procedure is perfectly legitimate, but realize that the more stringent you make your criterion, the more likely you are to make another kind of error failing to reject H when it is in fact false and H 1 is true. This type of error is called a Type II error, and its probability is symbolized by β (beta). The major difficulty in terms of Type II errors stems from the fact that if H is false, we almost never know what the true distribution (the distribution under H 1 ) would look like for the population from which our data came. In other words, we never know exactly how false the null hypothesis is. We know only the distribution of scores under H. Put in the present context, we know the distribution of differences in means when having someone waiting for a parking space makes no difference in response time, but we don't know what the difference would be if waiting did make a difference. This situation is illustrated in Figure 5.4, in which the distribution labeled H represents the distribution of mean differences when the null hypothesis is true, the distribution labeled H 1 represents our hypothetical distribution of differences when the null hypothesis is false, and the alternative hypothesis ( H 1 ) is true. Remember that the distribution for H1 is only hypothetical. We really do not know the location of that distribution, other than that it is higher (greater differences) than the distribution of H. (I have arbitrarily drawn that distribution so that its mean is 2 units above the mean under H.) 16

17 Figure 5.4 Distribution of mean differences under H and H 1 As I have said, the darkly shaded portion in the top half of Figure 5.4 represents the rejection region. Any observation falling in that area (i.e., to the right of about 3.5) would lead to rejection of the null hypothesis. If the null hypothesis is true, we know that our observation will fall in this area 5% of the time. Thus, we will make a Type I error 5% of the time. The distribution labeled ( H 1 ) represents the expected distribution of sample means if the two population means differ by two seconds. (And remember that this distribution would be displaced 17

18 left or right if I had chosen a different mean difference for population means.) The cross-hatched portion in the bottom half of Figure 5.4 represents the probability (β) of a Type II error. This is the situation in which having someone waiting does make a difference in leaving time, but the mean value is not sufficiently high to cause us to reject H. In the particular situation illustrated in Figure 5.4, where I made up the mean and variance, we can in fact calculate β by using the normal distribution to calculate the probability of obtaining a score greater than 3.5 (the critical value) if μ = 35 and σ = 15 for each condition. The actual calculation is not important for your understanding of β, because this chapter was designed specifically to avoid calculation. I will simply state that this probability (i.e., the area labeled β) is.76. Thus for this example, 76% of the occasions when waiting times (in the population) actually differ by 3.5 seconds (i.e., H1 is actually true), we will make a Type II error by failing to reject H when it is false. From Figure 5.4 you can see that if we were to reduce the level of α (the probability of a Type I error) from.5 to.1 by moving the rejection region to the left, it would reduce the probability of Type I errors but would increase the probability of Type II errors. Setting α at.1 would mean that β =.92. Obviously there is room for debate over what level of significance to use. The decision rests primarily on your opinion concerning the relative importance of Type I and Type II errors for the kind of study you are conducting. If it were important to avoid Type I errors (such as falsely claiming that the average driver is rude), then you would set a stringent (i.e., small) level of α. If, on the other hand, you want to avoid Type II errors (patting everyone on the head for being polite when actually they are not), you might set a fairly high level of α. (Setting α =.2 in this example would reduce β to.46.) Unfortunately, in practice most of us choose an arbitrary level of α, such as.5 or.1, and simply ignore β. In many cases this may be all you can do. (In fact you will probably use the alpha level that your instructor recommends.) In other cases, however, there is much more you can do, as you will see in Chapter 8. I should stress again that Figure 5.4 is purely hypothetical. I was able to draw the figure only because I arbitrarily decided that the population means differed by 2 units, and the standard deviation of each population was 15. The answers would be different if I had chosen to draw it with a mean difference of 2.5 and/or a standard deviation of 1. In most everyday situations we do not know the mean and the variance of that distribution and can make only educated guesses, thus 18

19 providing only crude estimates of β. On occasion, and this is especially true in medical research, we can select a value of μ under H1 that represents the minimum difference we would like to be able to detect, since larger differences will have even smaller βs. In this situation we don't care if a drug, for example, makes a very small difference that is of no practical importance. We want to only look for differences that make a meaningful difference. From this discussion of Type I and Type II errors we can summarize the decision-making process with a simple table. Table 5.1 presents the four possible outcomes of an experiment. The items in this table should be self-explanatory, but the one concept that we have not discussed is power. The power of a test is the probability of rejecting H when it is actually false. Because the probability of failing to reject a false H is β, then power must equal 1 - β. I will discuss power and its calculation in Chapter 8. Table 5.1 Possible outcomes of the decision-making process True State of the World Decision H True H False Reject H Type I error p = α Correct decision p = 1 - β = Power Don't reject H Correct decision p = 1 - α Type II error p = β 5.8 ONE- AND TWO-TAILED TESTS We have one more concept to cover and then we can move on. The preceding discussion brings us to a consideration of one- and two-tailed tests. In our parking lot example we were concerned if people took longer when there was someone waiting, and we decided to reject H only if those drivers took longer. In fact, I chose that approach simply to make the example clearer. However, suppose our drivers were really very thoughtful and left seconds sooner when someone was 19

20 waiting. Although this is an extremely unlikely event to observe if the null hypothesis is true, it would not fall in the rejection region, which consisted solely of long times. As a result we find ourselves in the position of not rejecting H in the face of a piece of data that is very unlikely, but not in the direction expected. The question then arises as to how we can protect ourselves against this type of situation (if protection is thought necessary). One answer is to specify before we run the experiment that we are going to reject a given percentage (say 5%) of the extreme outcomes, both those that are extremely high and those that are extremely low. But if we reject the lowest 5% and the highest 5%, then we would in fact reject H a total of 1% of the time when it is actually true, that is, α =.1. That is not going to work because we are rarely willing to work with α as high as.1 and prefer to see it set no higher than.5. The way to accomplish this is to reject the lowest 2.5% and the highest 2.5%, making a total of 5%. The situation in which we reject H for only the lowest (or only the highest) mean differences is referred to as a one-tailed, or directional, test. We make a prediction of the direction in which the individual will differ from the mean and our rejection region is located in only one tail of the distribution. When we reject extremes in both tails, we have what is called a two-tailed, or nondirectional, test. It is important to keep in mind that while we gain something with a twotailed test (the ability to reject the null hypothesis for extreme scores in either direction), we also lose something. A score that would fall in the 5% rejection region of a one-tailed test may not fall in the rejection region of the corresponding two-tailed test, because now we reject only 2.5% in each tail. In the parking example I chose a one-tailed test because it simplified the example. But that is not a rational way of making such a choice for an actual experiment. In many situations we do not know which tail of the distribution is important (or both are), and we need to guard against extremes in either tail. The situation might arise when we are considering a campaign to persuade children not to start smoking. We might find that the campaign leads to a decrease in the incidence of smoking. Or, we might find that campaigns run by adults to persuade children not to smoke simply make smoking more attractive and exciting, leading to an increase in the number of children smoking. In either case we would want to reject H. 2

21 In general, two-tailed tests are far more common than one-tailed tests for several reasons. First, the investigator may have no idea what the data will look like and therefore has to be prepared for any eventuality. Although this situation is rare, it does occur in some exploratory work. Moreover, a number of people have suggested that when you are trying to replicate an experiment that you or someone else has already run, the original experiment should be evaluated with a two-tailed test, but the replication should be a one-tailed test because you now have a direction in mind. Another common reason for preferring two-tailed tests is that the investigators are reasonably sure the data will come out one way but want to cover themselves in the event that they are wrong. This type of situation arises more often than you might think. (Carefully formed hypotheses have an annoying habit of being phrased in the wrong direction, for reasons that seem so obvious after the event.) The smoking example is a case in point, where there is some evidence that poorly contrived antismoking campaigns actually do more harm than good. A frequent question that arises when the data may come out the other way around is, Why not plan to run a one-tailed test and then, if the data come out the other way, just change the test to a two-tailed test? This kind of approach just won't work. If you start an experiment with the extreme 5% of the left-hand tail as your rejection region and then turn around and reject any outcome that happens to fall in the extreme 2.5% of the right-hand tail, you are working at the 7.5% level. In that situation you will reject 5% of the outcomes in one direction (assuming that the data fall in the desired tail), and you are willing also to reject 2.5% of the outcomes in the other direction (when the data are in the unexpected direction). There is no denying that 5% + 2.5% = 7.5%. To put it another way, would you be willing to flip a coin for an ice cream cone if I have chosen heads but also reserved the right to switch to tails after I see how the coin lands? Or would you think it fair of me to shout, Two out of three! when the coin toss comes up in your favor? (I used to do that all the time when I was a child, and I often got away with it. I guess that my playmates were not statisticians.) You would object to both of these strategies, and you should. By the same logic, the choice between a one-tailed test and a two-tailed one is made before the data are collected. It is also one of the reasons that two-tailed tests are usually chosen. A third reason for two-tailed tests concerns cases where we can't really define a one-tailed test. One example is the case when we have more than two groups. We will consider this situation at length 21

22 when we discuss the analysis of variance. When we have more than two groups a one-tailed test is pretty much undefined, and we will actually have a multi-tailed test. And when we come to the chisquare test in Chapter 7, the way that the test statistic is defined precludes the idea of a one-tailed test unless we engage in additional steps, which I would usually not suggest. Although the preceding discussion argues in favor of two-tailed tests, and although in this book we generally confine ourselves to such procedures, there are no hard-and-fast rules. The final decision depends on what you already know about the relative severity of different kinds of errors. It is important to keep in mind that with respect to a given tail of a distribution, the difference between a one-tailed test and a two-tailed test is that the latter just uses a different cutoff. A two-tailed test at α =.5 is more liberal than a one-tailed test at α =.1. If you have a sound grasp of the logic of testing hypotheses by use of sampling distributions, the remainder of this course will be relatively simple. For any new statistic you encounter, you will need to ask only two basic questions: 1. How and with which assumptions is the statistic calculated? 2. What does the statistic s sampling distribution look like under H? If you know the answers to these two questions, your test is accomplished by calculating the test statistic for the data at hand and comparing the statistic to the sampling distribution. Because the relevant sampling distributions are tabled in the appendices, or are even available on your cell phone, all you really need to know is which test is appropriate for a particular situation and how to calculate its test statistic. (Of course there is far more to statistics than just hypothesis testing, so perhaps I m doing a bit of overselling here. There is a great deal to understanding the field of statistics beyond how to calculate, and evaluate, a specific statistical test. Calculation is the easy part, especially with modern computer software.) 5.9 RETAINING OR REJECTING THE NULL HYPOTHESIS As I have tried to make clear, one of the major goals for behavioral scientists is to evaluate the null hypothesis. No matter your view on using statistical tests to draw conclusions about the variables under study, almost every research paper retains that focus. There is much more to do, but this seems to be the first step. And to understand that, you really need to understand what the fuss is 22

23 about. I raise the issue here because it applies to almost everything that follows in the book. It is not limited to a few statistical procedures. NHST DID THE EXPERIMENT WORK? Ever since the fights in the 192 s and 193's between Sir Ronald Fisher, on the one hand, and Neyman and Pearson, on the other, statistics has been involved in one way or another with a very messy issue called "hypothesis testing." As I said, the more current name for this debate is "null hypothesis significance testing (NHST). What we have today is an amalgam of both sets of ideas, and it is an amalgam that seems to please no one. There have been many papers in the last few years debating the proper way to approach the analysis of data from an experiment, and the debate won't end any time soon although it is encouraging that there really has been progress. Back in 1999 the American Psychological Association formed the Task Force on Statistical Inference to deal with this topic. Some people hoped that the task force would suggest banning all statistical tests. Instead, the task force kept alive hypothesis testing, and did something even more useful. They published a 1 page report (Wilkinson et al, 1999) describing what people should do when examining and reporting data. They offered many very good, and very clear, suggestions on how an author should approach the whole problem of analyzing data and writing up a report. Hypothesis testing played only a small role in that discussion. I was quite surprised when I went back and read it again just how good a report it was. There is far more to conducting and reporting a study than people realize, and statistical testing is only a fairly small part of that. I strongly recommend that you look at that paper. It is available at pdf. But first a little history to put this issues in perspective. Back in the 192's Sir Ronald Fisher looked at the problem of deciding whether experimental results were meaningful by positing the existence of a "null hypothesis." (Fisher did not use that term, but it is consistent with his ideas.) Suppose that you are interested in determining if a new fertilizer produces more wheat than the old one that you have been using for years. (Fisher started his career in agriculture, which is why this example.) You plant your wheat, let it grow, harvest it, and measure the result, as, e.g., bushels per acre. Fisher imagined a null hypothesis that said that the new fertilizer did not differ from the old, and that the mean bushels of wheat that it produced is the same as the mean bushels for the old fertilizer. We can abbreviate this as H ( µ µ ) or ( µ µ ) = = =, where the μ i refer to population means. old new old new 23

Introduction to Statistical Hypothesis Testing Prof. Arun K Tangirala Department of Chemical Engineering Indian Institute of Technology, Madras

Introduction to Statistical Hypothesis Testing Prof. Arun K Tangirala Department of Chemical Engineering Indian Institute of Technology, Madras Lecture 09 Basics of Hypothesis Testing Hello friends, welcome