Project: The Power of a Hypothesis Test

Project: The Power of a Hypothesis Test Let s revisit the basics of hypothesis testing for a bit here, shall we? Any hypothesis test contains two mutually exclusive hypotheses, H 0 and H 1 (AKA, H A ). After conducting our hypothesis test, we will either (formally) reject H 0, or fail to reject H 0. We can see pretty quickly there are four outcomes possible here: H 0 TRUE H 0 FALSE Reject H 0 Bad! (Type I error) Good! Fail to Reject H 0 Good! Bad! (Type 2 Error) Here s another, less formal way of looking at it: The boy who said Wolf! Wolf! Everybody! There s a freaking WOLF here! But there wasn t. Way to go, buddy. (False Positive) The boy who said, Nah there s no wolf here. And there wasn t. So yeah. Back to work. (True Negative) The boy who said Wolf! Wolf! Everybody! There s a freaking WOLF here! And look at that! There is! Thanks, man! (True Positive) The boy who said, Nah there s no wolf here. But, there was. Yikes! (False Negative) Now, those four cells in the contingency table aren t all equally likely; if they were, a hypothesis test would be as precise as flipping a coin! No, our job in good hypothesis testing is to maximize the good boxes, while minimizing the bad (errors). In our course, we have spent a great deal of time discussing both types of error. The probability of a Type I error, also called the significance level of a test is often, by default, set to be 5%. Why? Tradition, mostly. There are many references (dating well back to the beginnings of formal study of inferential statistics) that mention 5% as a good starting point for significance (feel free to Google it up there are many, many references to it). We ve also discussed, in class, that you might not always want to operate at 95% confidence. This project, in part, will explore why. Part 1: The Relationship between Type I and Type 2 error. As you work through this section, go ahead and open the spreadsheet errors that accompanies this project. Familiarize yourself with a few aspects of this sheet (and hypothesis testing in general): The chance of a type 1 ( false positive ) error is referred to, symbolically, as. Its value is indicated by the area that s orange in the sheet. Similarly, the type 2 ( false negative ) rate is called β, and it s the blue area. (continued)

The Critical Value (I ve left off which distribution we re using, as what we re discussing in this assessment is pretty general) is decided upon at the outset of the hypothesis test, and represents the cutoff for beyond a reasonable doubt (that is, any test statistics that lands beyond this value triggers belief in the research hypothesis). 1 1. (2 points) Start by moving the critical value to the right and left (but leaving the sample size and difference in means alone). What do you notice about the areas that represent the errors? Circle the best phrase that completes the following sentence: As the chance of a Type 1 error increases, the chance of a Type 2 error (decreases / stays the same / increases). Cool! If you want to increase your confidence (without changing the parameters of your study) the ONLY way, mathematically, to do this is to accept a higher false negative rate! This actually leaches out into more general applications, all the time for example, if you ve ever had your car recalled by the dealer, but nothing was found to be wrong with it, you got a false positive and most likely, lots of others did, too. That s because a car manufacturer is deathly afraid of a false negative (i.e., leaving unsafe cars on the road), so they accept a higher at the risk of a lower. 2. (2 points) We ve been testing at 95% confidence all term. It s pretty industry standard. However, suppose you re planning a study when you really want to avoid a Type 2 error, but you can live comfortably with a Type 1 error. What should you do at the outset of the study? Lower the confidence / Raise the confidence 3. (2 points) Now, suppose you re planning a study where you need (dearly) to avoid a Type 1 error, but you can live with a Type 2. Now what should you do at the outset of the study? Lower the confidence / Raise the confidence 4. (2 points each) OK now, place your critical value somewhere to the right of zero so that you see visible areas for the Type 1 and Type 2 errors. Leave it there. Now, adjust the difference in means so that the research mean 2 moves away from, and then toward, the null mean. Then, complete each of the following statements by circling the appropriate phrase that completes them: As the difference between the null and research means increases, a) the chance of a Type 1 error increases / stays the same / decreases b) the chance of a Type 2 error increases / stays the same / decreases 1 In the spreadsheet you re looking at, I m running a right tailed test. The same argument, without loss of generality, holds for any direction of testing. 2 Remember you never know, for certain, what this is!

Part 2. Looking Deeper: A Case Study in the Difference of Means 3 Hopefully, the answer to your last question makes sense to you if there s a large difference in means, then you should be able to (correctly) see it more easily than if there s only a small difference. This hypothesized difference (if it exists) is called the effect size. And, as it turns out, sample size is inherently tied to effect size. Let s do some experiments to talk about how. Start by making sure your version of Excel is set up in iterative mode. Here s a video I made to do it in Excel 2010 (it s similar in newer versions): https://www.youtube.com/watch?v=zlxmicraxpo. You only need to watch that short video there are more that follow it, but they re for a different problem. Next, once you re all set up and ready to go, open the sheet 4coin1.xlsx. This is a spreadsheet designed to (Monte Carlo) model a coin flipping: Press the button. Each time you do, the coin flips. The graph you see is the progression of the empirical probability of heads ( empirical because it doesn t assume that the probability of heads is any given number it s going to demonstrate its probability through experimentation). It keeps track of the number of heads that appear, and the number of trials, and then divides to arrive at a probability. That s the location of the penny at the probability of heads 4. 5. (2 points) Go ahead and hold the down for a bit. When the coin is close enough for you to believe that the probability of heads has stabilized (and not deviating far from that percentage, in your opinion), stop, and write down how many trials it took you to believe it. Now, in that last one, you would have failed to reject Ho that is, you were (most likely) expecting 50% heads, and that s about what you got, right? Now, we re going to do three experiments where we DO get a rejection that is, we re going to see a rigged coin. Open the sheet 4coin2.xlsx. I ve set this sheet up to simulate a rigged (that is, NOT 50/50) coin. You re going to repeat the experiment you just did above but this time, write down how many trials it too until you felt that the coin wasn t fair. 3 Means, proportions basically, whatever statistics we re talking about, we re talking about center. 4 A HA! Another nonparametric demonstration! I Monte Carlo.

6. (2 points) Hold the down until you feel you ve seen enough trials to have spotted the unfair coin. How many trials did it take? Note: I m not asking you to hypothesize by how much it s rigged just tell me when you think it s demonstrated it beyond a reasonable doubt. Ready for another rigged coin? Open up 4coin3.xlsx. Yep rigged again. 7. (2 points) Hold the down until you feel you ve seen enough trials to have spotted the unfair coin. How many trials did it take? One last rigged coin, perhaps? Open up 4coin4.xlsx. 8. (2 points) Hold the down until you feel you ve seen enough trials to have spotted the unfair coin. How many trials did it take? Note: please bring these numbers to class next time so we can collate our results! So now you can see why you answered the way you did back at the end of part 1! The greater the deviation from expected, the easier it is to spot! In reality, of course, you won t know if you coin is rigged or not but you will be able to calculate how large a sample you would need to spot a deviation from null, if it indeed exists. The calculations are tedious, but thankfully, there s great software around that ll do them for you! Here s one of my faves, if you ever need one: http://www.gpower.hhu.de/en.html. 5 Part 3. So, what is Power, anyway? We ve spent a lot of time talking about error but what exactly is power? Quite simply, power is defined to be the complement of a Type 2 Error (in symbols, 1 - ). The more powerful a statistical test is, the more we believe that it can catch a difference in means (if such a difference exists). You ll often hear statistical tests critiqued because of their low power what people often mean is that the sample size was too small to allow for a small value of (and, hence, a large 1 - ). I ll let you all get into statistical fistfights with colleagues later over how large is large enough. For now, this document serves as an introduction to Power, measuring type 2 error, and its relationship to type 1 error. Part 4. What does it all mean to you? All of these experiments and demonstrations are great 6, but they don t address the real issue: as researchers, how would you least prefer to be wrong? You ll never know with 100% certainty that s the 5 Es ist Deutsch, so dass Sie wissen, dass es funktioniert. 6 Well, I think so, anyway.

problem. Making this decision, when you re starting out testing a set of hypotheses, helps you set your confidence and/or sample size. In all honesty, in you careers, you ll most likely deal with this rigorously when the time is right. But, I want you thinking about it, critically (not mathematically) right now. Here s who we ll do it: I ll describe a situation that s going to be studied using hypothesis testing methods we ve been learning about in class. Your job is to decide which type of error would be worse, and explain to me why. Let s look at an example: Example: a research company is designing a cholesterol reducing drug. As part of the design process, they re testing for possible side effects. One side effect is loss of appetite while using this drug. They ve created the hypotheses: H 0 : The drug does not cause loss of appetite. H 1 : The drug does cause loss of appetite. We ll never know with 100% certainty which of these will be true, so we aim to minimize the chance of the one we don t want. Here are two ways to answer which error type would be worse? Type 1 Error is worse: A false positive in this case would be saying (erroneously) that, yes, the drug does cause a loss of appetite when, actually, it doesn t. Why would this error type be worse? Well, maybe people would be hesitant to use it if it had this side effect (or, in this case, if they thought it had this side effect). Type 2 Error is worse: A false negative in this case would be saying (erroneously) that the drug does not cause a loss of appetite when, actually, it does. Why would this error type be worse? Suppose people think they re getting a cholesterol lowering drug with the added benefit of not negatively affecting appetite! Great! Except, well, it actually is negatively affecting appetite, and you won t know it until it s too late. So, you see, there isn t one correct answer either type of error could be bad, depending on your context. The important thing here is to get thinking about it. Let s try two of your own! For each of the following, tell me, in your opinion, which type of error would be worse and why. 9. (4 points) An outdoor goods manufacturer is testing its rock climbing harnesses for safety. Of particular concern is the belay loop, an integral part of the harness that (literally) holds the climber s life at one point. From the CE testing specs, I have learned that belay loops need to withstand a 15 kn force for 3 minutes (if it does, it s considered safe ). Therefore, the company tests the following hypotheses: H 0 : The harness will withstand a 15 kn force for 3 minutes (the harness is safe ). H 1 : The harness will not withstand a 15 kn force for 3 minutes (the harness is not safe ). 10. (4 points) A lab is working on a new drug to cut down on illegal blood doping in professional cycling. In particular, it will test for EPO (Erythropoietin) 7 is such a way that the test will flag a rider whose blood has higher-than-permissible levels of EPO. The lab sets the drug test to use the following hypotheses: H 0 : The test comes back negative (that is, it fails to detect EPO). H 1 : The test comes back positive (that is, it detects EPO). 7 You might remember it as one of the substances Lance Armstrong finally admitted to using when we won his Tours de France.