Okay, good afternoon everybody. Hope everyone can hear me. Ronet, can you hear me okay?

Size: px

Start display at page:

Download "Okay, good afternoon everybody. Hope everyone can hear me. Ronet, can you hear me okay?"

Oliver Crawford
5 years ago
Views:

1 Okay, good afternoon everybody. Hope everyone can hear me. Ronet, can you hear me okay? I can. Okay. Great. Can you hear me? Yeah. I can hear you. Wonderful. Well again, good afternoon everyone. My name is Erin Farley, and I am one of JRSS research associates. For those of you who may be less familiar, JRSA stands for the Justice Research and Statistics Association, and we are a national nonprofit organization dedicated to the use of research and analysis to inform criminal and juvenile justice decision making. And we are comprised of a network of researchers and practitioners, which at the core include directors and staff from states fiscal analysis centers. So it is my pleasure today to welcome you all to our Webinar on Simple linear regression, and it will be presented by Dr. Ronet Bachman from the University of Delaware. Ronet is a professor in the Department of Sociology and criminal justice and she's also coauthor of numerous texts including statistical methods for crime and criminal justice and a coeditor of explaining crime and criminology essays and contemporary criminal theory. And her most recent federally funded research was a mixed methods study that investigated the longterm trajectories of offending behavior using official data of a prison cohort released in the early 1990s. So welcome Ronet, and before we go any further, I do want to thank our partners at the Bureau of Justice statistics for helping make this webinar possible. I would also like to cover a few logistical items. We will be recording today's session for future playback. The link to the recording will be posted on JRSA's website and it's usually posted the day following the Webinar. So it should be available tomorrow if not by Monday. Today's Webinar is being audio cast via both the speakers on your computer and teleconference. We recommend listening to the Webinar using your computer speaker, speakers, or headphones. To access the audio conference, select audio from the top menu bar, and then select audio conference. Once the audio conference window Justice Research and Statistics Association Webinar Page 1 of 27

2 appears, you can view the teleconference call in information or join the audio conference via your computer. If you have any questions for the presenter or would like to communicate with JRSA staff, please submit your questions to me, Erin Farley using the chat feature on the right side of your screen. This session is scheduled for one and a half hours. If you have any technical difficulties or get disconnected during the session, you can reconnect to the session using the same link that you used to join. You can also any questions or technological issues to Jasontrask@jTrask@JRSA.org. In the last five minutes of today's Webinar, we will ask you to complete a short survey. The information that you provide will help us plan and improve future webinars and to meet our reporting requirements. And so that is it. And so with that, I will turn it over to Ronet. Welcome. Welcome everybody, I'm Ronet and I'm so thrilled to be here. Let me know, I'm not screaming too loud, am I? Or no one can actually respond, I guess. No. Yeah, you sound great. Okay. I forgive you in advance, I've already taught a two hour statistics class this morning and I'm coming off of a two week horrible cold and cough, so my voice may go here and there. I've got a cup of tea beside me. And let me just say at the forefront that I am so honored to be here with you all today. Before I came back to teaching, I was actually a statistician at the bureau of Justice Statistics, so I work closely with all the state stats at the time and I'm just honored to be here. So thank you very much for having me and I'm thrilled to be able to share this information, is a very exciting information I think, and hopefully you will too on Ordinary least squares regression. Before I start, I'm going to assume that everyone in the audience already understands the basic of hypothesis testing. And so if you have any questions during the talk, just let Erin know and I'll answer them for you. I'm also very used to a teaching style that is give and take without straight lecture. So I might almost habitually stop and wait for somebody responds, so forgive me for that. But when we get into the SPSS component where I'm actually showing you how to get this, I'm going to allow you to do stuff on your own just for like a brief 30 seconds because I believe in doing, you actually cement learning rather than just sitting, listening to somebody blather on. Justice Research and Statistics Association Webinar Page 2 of 27

3 So how I'm setting this up is I have SPSS output within these PowerPoint slides, and then we're going to move to SPSS and I'll show you sort of how I got them. So first we're going to concentrate on interpretation. So ordinary least squares regression is probably the most utilized statistical tool for analyzing relationships there is, not just in criminology and criminal justice, but across all the social sciences in hard sciences. Every time you calculate an insurance rate, for example, this is what they're using, or the probability of something happening, Recidivism or arrest rates, ordinary least squares regression is a very versatile tool. The dependent variable, however, is typically interval ratio, meaning that the numerical values can be added, subtracted, and multiplied and divided. They mean something. The independent variables typically can be about anything, although interval ratio or dichotomies are the most often utilized. And that's what we're going to stick with today. I got to figure out how to use this. It's not going down. Oh, it is slowly, forgive me. Okay. So here and I refer to IVs as X variables, independent variables as X variables and dependent variables as Y variables. I bounced back and forth between saying IV and DV just to get the terminology there. But imagine we have... and I'm going to start when both IVs and DVs or interval ratio, where the numbers actually mean something. And this is just some hypothetical data and I'm going to show you what actually correlation and regression is trying to do. If we have an X variable and Y variable like this, a very important tool that you look at first in examining the linear relationship between two variables is called a scatterplot. And as you can see on the screen here, the scatterplot, the independent variable plots along the horizontal axis and the dependent or Y variable plots along the vertical axis. And so you can see from this hypothetical data that linear relationship here is what we call a positive relationship. And that simply means that both variables are moving in the same direction, as the independent variable increases, so does the dependent variable in this case. In this next set of fictitional data, we have an independent and the dependent variable. And if we plot a scatterplot for these variables, we see a different linear pattern. In this case, on the horizontal axis, as the independent variable increases from low to high, we have a dependent variable that is actually decreasing. This is referred to as a negative relationship, not any connotation with negative simply meaning that the variables are moving in an opposite direction. So as the IV increases, typically on average, the dependent variable will decrease. Justice Research and Statistics Association Webinar Page 3 of 27

4 And if you look at this next pattern of data, you see the X variable changing. And if you look across the values on Y the dependent variable, you notice that they're not changing at all. If we plotted these data, this is the line we would see, no relationship. And this is what linear regression and correlation also do, when there is no relationship, the line that can be drawn through the by various data points is actually flat line, sort of synonymous with flat line in the medical term, no relationship. Now, all of these three typical, actually non typical but perfect relationships that I've shown you obviously don't exist in the real world. You've all worked with data, it's messy, it more looks like a dart board. These are some state level data that I've brought down to illustrate relationships. And the first relationship we're examining here is and it does not include the district of Columbia, so it states without DC, the independent variable along the X axis here is percent of the state's population residing in rural areas, so percent rural interstate. And the dependent variable along the Y vertical axis is motor vehicle theft rates. And so what correlation and regression does specifically is try to quantitatively calculate the best fitting line that would go through that scatter of data and describe it in a linear way, and it will plot the best fitting line to describe that quantitative relationship. And if you had to draw a line through the scatterplot... now this is a time where I'd asked my class what would it be? So I'll answer it myself. It would be negative. The line I can see would be going through here. Can you see my cursor here or not? Probably not. Erin, can you see my cursor? I cannot see. Okay. Okay. So the line would be descending indicating a negative relationship, as percent of a state's population living in rural areas increases, motor vehicle theft rates decrease, and that's what you would expect theoretically. I grew up in a very rural area and there's not a lot of people stealing tractors out there. So let's move on. And I want to sort of summarize what we see in scatter plots and the important things to know. First is the strength of the relationship. If you notice in this, back to the scatterplot, the closer the data points cluster around a line or a linear trend, that indicates the strength of the relationship. The other thing you want to notice of course, is whether or not the relationship is positive or negative, whether there is ascending line or a descending line indicating the direction of the relationship. And the third, and equally important thing you want to examine is whether or not there Justice Research and Statistics Association Webinar Page 4 of 27

5 are any bivariate outliers. And I know you are aware of univariate outliers, but a bivariate outlier is simply that as well only in the bivariate sense, it's some value or data point in your data that does not fall within the rest of the pattern or trend. So for example, here I've included the district of Columbia in my scatterplot with my previous rural and motor vehicle theft rate scatterplot. And you can see that the district of Columbia has a very high motor vehicle theft rate, but it only has zero percent of its population in rural areas. So there's nothing rural about it and it's therefore a bivariate outlier. So these sorts of outliers can really significantly influence your analysis. And I'm going to get back to this in a second. Okay. So the first statistic that we're going to talk about is the standardized Pearson Correlation Coefficient. And this is the most beautiful standardized coefficient that can tell us in a standardized way the co-variation between an x and a y variable. And some of you in the audience may look and see that it looks a lot like the denominator or the numerator for a standard deviation, that is it's looking at the variation between the X variables, and it's also looking at the variation around the Y variable. And it's based on the co-variation and the denominator then standardize it so that you can compare correlations across different models. They mean something regardless of how the original IV or DV were measured. Let me give you a little chart to show you what I mean. It is standardized to go from zero, which indicates no relationship to positive one, which indicates a perfect positive relationship, which that first graph I showed you would indicate, all the way to negative one, which indicates a perfect negative relationship. Now, there are no perfect relationships, at least in our social world of the phenomena we're interested in. So everything follows generally in between. But it's a standard rule that the closer to zero, the correlation coefficient of r is, the weaker the relationship. And so I've glade out these sort of willy-nilly adjectives to describe correlations in between zero and one. And remember the positive and negative signs have nothing to do with indicating strength, they simply indicate the direction. So a correlation of point five would indicate a moderate positive relationship. A correlation of negative point five would indicate a moderate negative relationship. Justice Research and Statistics Association Webinar Page 5 of 27

6 So that's the beauty of the Pearson's correlation coefficient. And this is SPSS correlation matrix output. And I've just simply dumped a couple of the variables including the bivariate relationship I just showed him the scatter gram. And because it's in matrix form, there's a diagonal going through the middle of the box that is indicating a correlation of one. And that simply means that it is correlating, for example, the first box tells you the correlation between murder and murder, which by definition is going to be one because it's a perfect correlation with itself. Because it's a matrix that also means that it's a mirror image, the top of the matrix is a mirror image of the bottom of the matrix. So you only need to look at one port. The second bunch of rows, I guess, in the first column indicate the murder rate and the percent individuals below poverty. This is still the state level data set. And you can see from the correlation Coefficient there, Pearson's correlation, that it's a positive.62 indicating in a subjective sense, a sort of moderate relationship, moderate positive relationship between percent poor and the murder rate in states, and the significance... and I'm assuming that you guys again understand what the significance means. It's basically doing a no hypothesis test. The null hypothesis always states no relationship between whatever variables you're interested in. And in science, of course, we have to test this and either reject it or fail to reject it. And when we reject a null hypothesis, that means we can conclude that in fact these two variables are significantly related and typically the standard critical value or critical alpha or significance is.05 Just as a little hint, when I look at these significance levels, I asked myself, if we're willing to be wrong five percent of the time, and that is correct 95 percent of the time, and I see an Alpha of.003, this is telling you that we're going to be wrong less than one percent of the time. So we can safely reject the null in this case and conclude poverty does in fact increase rates of murder at the state level, just a little. Summary, in case some of you are fuzzy on hypothesis testing. So let's go down to the next role robbery rate and murder rate. Again, positively related. I'm going to go down again two percent rural and the murder rate, just to highlight a negative correlation here. The correlation between percent rural and the murder rate is negative.10. Now, that is a fairly close to zero. So I would interpret that as a very weak relationship. And the significance, in fact, is.65, indicating we'd be wrong 65 percent of the time if we reject it. So we must fail to reject and conclude that Justice Research and Statistics Association Webinar Page 6 of 27

7 percent of a state's population living in rural areas does not in fact affect the murder rate. So that's an overview of the correlation matrix and how to interpret it. And I'm going to move on now to look at a few more scatter plots in and get in these scatter plots in these PowerPoints. I provided the r and the significance of r here as well. So this is what we just saw in the matrix. In fact, it is a positive relationship indicating states with higher rates of poverty also tend to have higher rates of murder. This is percent living in rural areas and not the murder rate, but the robbery rate. And this too is a moderate relationship indicated by r being -.66 and also the significance. So we can conclude that states that have higher populations living in rural areas also tend to have lower rates of robbery, which is what you would expect given theories about social disorganization and the such. So let me move on. Here is the divorce rate. Now the divorce rate is often used as an indicator of social disorganization, which decreases collective efficacy in communities theory say. So you would expect the divorce rate to be related to the burglary rate, but as you see in this scatterplot, the line that you draw is virtually flat because the scatter... if you had to draw a line that goes closest to all these spot, dots, as possible, it would essentially be flat. And that is confirmed by this correlation Coefficient of almost zero, r is.05, that's almost zero. And the significance associated with that is.81, indicating you're going to be wrong almost 82 percent of the time. So this is indicating no relationship. Okay. So I've talked about correlation and the subjective adjective terms to describe correlation being weak, moderate, strong, etc, and whether or not they're significant. There's another more precise way to interpret r, and that's r squared, otherwise known as the very important coefficient of determination. I just call it r squared. Basically it tells us specifically the proportion of the variation in the dependent variable Y that is being explained by the independent variable X. So it gives us a more precise way of interpreting r. And clearly you only have 100 percent of your variation to explain in any given dependent variable. You want to explain as much variation as possible. And let me just say as a side note here, and I'm going to reiterate this when we get to individual level data that aggregate level data like zip code, county, city, state typically there is much less measurement error in aggregate level data so that the correlation and r squared coefficients are Justice Research and Statistics Association Webinar Page 7 of 27

8 much higher than individual level data. When we have individual level data, I've rarely seen a correlation or r squared that explains over 10 percent of the variation. So just an FYI, and I'm going to reiterate that in a minute. But let me go through interpreting these. So we've already seen the poverty rate and murder scatter plot, and we saw that they had a correlation of.62, r squared, which is simply literally the square of that correlation coefficient or multiplying that by itself is.38. When we multiply that by 100, it is interpreted as a percent of variation in Y explained by X. So in this case, we could say that 38 percent of the variation in murder rates in states can be explained by poverty rates. Remember when you were interpreting analysis, you always make reference to your variables and to your units of analysis, which are how your variables are measured, which in this case you're at the state level. So for the rate of robbery, we have a correlation of -.66, r squared tells us.44, which means that 44 percent in this case of the variation in the robbery rates can be explained by knowing a state's percent rural population, which is again, pretty good. And we know that those are both significant. And then we had the divorce rate, which had an r of.05, and this gives us an r squared of.02. And so we can say the divorce rate explains only about two percent of the variation in rates of burglary at the state level. So again, this illustrates very much more precisely in a quantitative way an interpretation of the correlation coefficient instead of saying, weak, moderate, strong. So it's a very handy tool. Now, this correlation and coefficient of determination are wonderful tools to examine strength of relationship, and they allow us to make comparisons and sort of assessments of strength beyond significance, which I'm going to show you in a minute. Now, I'm going to move into the ordinary least squares and ordinary least squares regression line, otherwise known as linear regression. OLS is the formal term. And why is it called ordinary least squares? I'm going to tell you why, because it makes intuitive sense. When most of you learned about descriptive statistics like the mean and learned how to calculate the standard deviation, you learn that this different square, the sum of all of the X values subtracted from their mean would be equal to zero if we sum them because the mean is the perfect balancing score; that's what makes it so beautiful. And that's why it's so versatile as an analysis tool. But to quantify that variability around the mean and any distribution of X values, then you have to square those different scores to get a quantity. Justice Research and Statistics Association Webinar Page 8 of 27

9 And that sum of those different squares, those least squares will be the minimum variants you could possibly describe. And that's how, of course, both the standard deviation, the variance and standard deviation are calculated. Well, this too is what ordinary least squares uses to draw the best fitting line through a linear scatter of data points. This beautiful balancing mechanism that the mean has. So let me take you through a little exercise. Here we have... we're changing now to units of analysis of students in middle school and high school. We have two variables. We have 20 students and we have the independent variable of age. So we have each student's age and then we have the dependent variable, their self reported delinquency score. And if we plot this out, you can see there is a nice linear positive relationship. As age increases for these sample of 20, so do the number of delinquent acts as you would expect. Now, if you imagine if we were to draw the best fitting line through the scatter of data, the best and most efficient way we could do that would be to calculate conditional Y scores, that is conditional rates of self reported delinquency for each age. And that conditional Y score, mean of Y would come closest to all other values of delinquency at any given age. That is exactly what the slope coefficient is doing when we calculate the ordinary least squares regression line. So it's actually creating these conditional values at every X score, so that that line that it fits in that data is the absolute quantitatively best fitting line that could possibly describe that linear relationship. It's really cool. So I plotted these conditional wise and you can see... get out your little rulers and see that they come closest to all the data points possible. And that's my little exercise to demonstrate to you why ordinary least squares is the best way we could ever summarize a linear relationship between two interval ratio level variables. So this is in fact the equation. And that equation on the top is using the symbols for the population parameters, alpha and beta. And on the side I have the equation using the sample simple A and B. But what the regression equation actually is doing is predicting values of Y, the dependent variable to a line, and that line crosses the Y axis at the constant or intercept level. And that's the symbol for alpha and A. And as such, that's just that line, that constant or intercept is just anchoring that line. What we're really interested in is the slope or B or beta up there. The slope tells us exactly how much Y, the dependent variable increases or decreases for every one unit increase in the independent variable. Justice Research and Statistics Association Webinar Page 9 of 27

10 And so you say Ronet, well that's all great, we already know from correlation that there's a positive or negative relationship, what's so great about this? The great thing about ordinary least squares regression line is that allows us to make predictions, specifically with how the variables are measured. So when you're doing a correlation, you also need regression. There are two sides of the same coin, and you need them both when you're describing linear relationships. So on this next slide, we have SPSS output, and this is again repetition, but I'm showing you the different ways you can get these statistics. This is where the independent variable is percent below poverty in a state, and the dependent variable is the murder rate. And again, it's the state level data. The first box, those of you who work in SPSS know that it increasingly gives you about, I don't know, 70, 80 things that are useless. The first box up here, just reiterate what the IVs and DV are. The model summary gives you corelation information just like the correlation matrix does, so you have your r and your r squared and the adjusted r squared, which I'm going to talk about when we talk about multiple regression. For now, I'm just going to concentrate on the r and r squared. But one thing I want to note, and put your listening ears on. In the regression output, r ever has a sign positive or negative. And that will make sense when we get to multiple regression. It doesn't have a sign because in this particular analysis of regression, you can have many independent variables down here besides poverty predicting a particular dependent variable. Some of those variables will be positively related to the DV, some of them may be negative. So this r never has a sign. So simply because it's positive now, don't infer that that is meaningful, it just gives you the value or r. So I'm not going to reiterate this interpretation [inaudible 00:30:41] box here is f test for the significance null hypothesis test of whether the model, the regression model is actually explaining the significant amount of variation in the dependent variable. And when you have a bivariate model that is only one independent variable predicting a dependent variable, it's redundant with this, the coefficient box T test. So I'm not going to spend any time explaining this [inova 00:31:16] box either because it's redundant and I'll talk about that in a second. But for now, let's go to the coefficients box. And the constant is the intercept. SPSS labels, it's intercepts constants, I don't know why, but that's what they labeled them. So I've put arrows there to indicate that. The slope coefficient is the [unstandardized 00:31:41] slope coefficient, is under the same column B and it is And so because there's only one Justice Research and Statistics Association Webinar Page 10 of 27

11 independent variable in this model, we could also look back to the correlation up in the model summary box and know that that too should be positive. So the OLS regression equation that SPSS provides us for that data that we just saw in the scatter plot is the murder rate, Y is equal to plus.53 times X. So you could insert any rate of poverty here for a state and predict the value of murder that you would obtain from this regression equation. I'm going to also interpret the [inaudible 00:32:34] So unlike the correlation coefficient here where you see it's.62, you can say, oh, that's moderate, explains [inaudible 00:32:44] variation. With this slope coefficient, we can say specifically that for every one unit increase in the percent of a state's population living below poverty, murder rates also increase.53 units. So that gives us precision. And I'm going to show you in a minute how to use this for prediction. But let me highlight here. There's also a significance test. You have a T value. T is used to test the null hypothesis that the coefficients here, the slope coefficient is equal to zero, and of course if the slope coefficient is equal to zero, it simply means there's no relationship. So that's the same null hypothesis that we're always using in all of statistical tests. It's the exact same significance that we got with the correlation coefficient. Notice the significance of getting this T is.003. So we can safely conclude at least at the 05 level that there is a significant relationship between poverty and murder rates. And notice that this significance.003 is exactly like the significance of F above in the inova box of.003. So these significance tests for the slope here and for the model summary here in the inova box are identical when there's only one independent variable in the model. So you only really need to look at one. That of course will change when we look at multiple regression. Okay. I'm going to move along. I want to highlight a couple of assumptions and assumptions that we frequently violate, but let me just say that all that regression is very robust, so it can take some violations pretty handily. But the first assumption, of course, is that the data we randomly selected, if they're not randomly selected, we shouldn't be doing probability theory to test hypotheses. We also assume that they're normally distributed, if you had very, very skewed distributions, that Justice Research and Statistics Association Webinar Page 11 of 27

12 might cause problems. It's beyond the scope of this webinar to talk about them here. You're also assuming that the relationship is linear. And that's also beyond the scope to talk about what you do when you have a nonlinear relationship. The other thing I want to highlight is this, a word that you can use in scrabble. It will get you a lot of points. It's called homoscedasticity. And what that means is that you are assuming that the error, which we call residuals, residuals are the distance between your actual data points and the regression line that is drawn. You assume that the residuals or the error component is constant across all values of the independent variable. So what does that mean? If I show you here, these are not raw data, these are actually scatter plots, residuals plotted. And so you have in this first graph, constant residual that is constant error across all the X values. They're sort of consistent. There's not a lot of variation in the residuals. But in the second graph on the right side, you see the much more error and the residual is much greater at higher values of X. So you can do a simple plotting of residuals on your X to see if in fact there is greater variance at different values of X. And if you have that, you probably are going to have a problem with homoscedasticity that you're going to have to deal with, which is again, beyond this webinar. But for now, you can use it in scrabble. Okay. So as I said, prediction. The regression equation I've just written here, this is exactly what we got from that SPSS output. And I remember that the correlation was.62. What we want to ask when you're presenting something to an audience, especially for using prediction tools, you don't want to present a of value of Y predicted in isolation, the question is always compared to what? So what I like to do for audiences to illustrate the effect of an independent variable on a dependent variable is to use a high and a low value within your independent variable to predict values of Y. So what I've done here is predict murder, a murder rate, for a state with a low value of poverty around four percent and for a high value of poverty around 24 percent. Those high and low values of four and 24 may not necessarily exist in your data. The low poverty rate is about that, and the high poverty rate is around 25 percent. And so it doesn't necessarily mean that the predicted value that you obtain is going to be equal to that particular poverty rate. But it's telling you based on that whole regression equation in the scatter of the data, this is the best predictive value of the Justice Research and Statistics Association Webinar Page 12 of 27

13 murder rate you could obtain. And I use in this equation y hat, it's a y with a little hat on it. That's usually what we use to denote that it's a predicted value of not y; not the real value of y. So you can see here for a predicted value of murder for a state with a very low poverty rate, say poverty rate with four percent of its population, that poverty is.29, that's a very low rate of murder;.29, wouldn't that be great? And then we have compared to what? The 24 percent rate of poverty. We plug that into the same regression equation using 24 as our X value instead of four and we get a much higher rate of So that's the beauty of the regression equation. It allows us to make predictions. And you can bet this is what every insurance company is doing to predict your health insurance and your auto insurance, and what financial analysts are doing all over right now to predict what's going to happen to the market. And what people all over, in fact, are doing to predict rates of recidivism and rates of crime and rates of showing up for trial and so on. So let's move on. Okay. I've put in this PowerPoint slide, so you have a mechanism to go back and relearn what I'm going through rather quickly here. I'm highlighting again, here in this slide, that the correlation coefficient in the model summary again never shows a sign. In this case, we have the dependent variable of the robbery rate in states and the independent variable of percent living in rural areas. And I'm going to start with the regression coefficient and the regression line just like the one above for murder. We see here that in this case, robbery, the dependent variable, is equal to the intercept called the constant here of plus times X, which happens to be 80 percent rural. So notice that the coefficient here, the slope is negative. This tells you that this correlation up here in the model summary, because there's only one independent variable, this correlation coefficient should also be negative, which it would be if you had a correlation matrix, if you ran that in SPSS. But again, it's always showing no sign here simply because it can accommodate multiple relationships. So let's interpret this coefficient, this slope coefficient. This tells us that for every one unit increase in percent population living in rural areas in states, there is a corresponding unit decrease in robbery rates. Decrease because it's a negative relationship. So you'll look at the significance of this. Again, the T test is testing this null hypothesis that states no relationship between rural population and Justice Research and Statistics Association Webinar Page 13 of 27

14 rates of robbery, or that the slope in the population is actually equal to zero. And we would be wrong less than one percent of the time if we rejected that. So we can clearly, on safe grounds, conclude that there is a significant relationship. States with higher rates of rural population also tend to have lower rates of robbery. Okay. And now let's go to the divorce rate. And this is what I want to illustrate. Look at this OLS regression equation. You have this time we're predicting burglary rates, burglary rates with the divorce rate in states. And you have Y, the burglary rates are equal to [inaudible 00:42:41] plus So for every one unit increase in the divorce rate, the burglary rate increases units. Now, so you say, Ronet, look at that coefficient it's and whoops... sorry, going the wrong way. And look at that percent rural coefficient back here, it's only two. This highlights what I said a little while ago that you cannot compare the value of a slope coefficient across models or even across the same variables across models because it is measured specifically to determine the change in your particular dependent variable at one unit increase of your independent variable. So this [inaudible 00:43:41] coefficient was much smaller than this, but this is not significant. This is a very, very weak relationship. So what I'm trying to bring home is never compare the size of a slope coefficient. It tells you nothing about its strength or significance; it's measured specifically to how your IV and DV are measured. Okay, so again, this relationship between divorce and burglary rates, we already determined back when we were looking at the correlation is very weak. It explains less than one percent of the variation in burglary rates. And this t value here and its corresponding significance says we're going to be wrong almost 82 percent of the time if we reject the null that says there's no relationship. So we have to fail to reject the null and conclude that divorce rates have no relationship, have no linear effect on rates of burglary in the state. Okay, let's move on. Okay. At the beginning I said that OLS is extremely robust and can handle... even though it was designed to determine linear relationships between two variables, it can handle dichotomies, but the dichotomies have to be coded zero and one for them to be very easily interpreted. So there are many, many times when you have a dichotomous independent variable, there are also many more times when you have an ordinal level of measurement or a variable that has so Justice Research and Statistics Association Webinar Page 14 of 27

15 little variation that it just makes more sense to dichotomize it into zero and one. And I'm going to show you those in a second. For now, a dichotomy has to be coded zero and one, and I'm going to first start with predicting murder using a southern location variable coded zero for states that are in the non-south and one for states that reside in the south according to the US census. I'm not going to show you a scatterplot of this relationship obviously because there's no scatter, there's just a zero and one. But you can interpret correlation similar to any other correlations. This correlation would indicate sort of a weak to moderate.44 relationships. And r squared tells us that when multiplied by 100, 19.3 percent of the variation in murder rates can be explained by a state's southern location. Now, why is this? Well, a lot of reasons. There is a lot of gun ownership, there's the culture of honor. Many reasons why people think the south has higher rates of homicide than other regions. And here we see that it does in fact explain variation in rates of murder. Now let's look at the regression equation. The regression equation is... I've put down here in the slide. And notice that this coefficient is just like any other coefficient, but when you interpret it, we have been interpreting these slope coefficients as if they are interval ratio. That is for every one unit increase in X, what happens to murder? In this case, we know that there is no interval ratio variable, it's simply a variable that's coded for states zero. States that reside in the non-south and one for states that reside in the south. So how you interpret a coefficient like this or a dichotomous, there are a couple of different ways. What I always do is start with the comparison of the zero category. So compared to states in the non-south, the murder rate increases units for states are located in the south. That's exactly how you interpret it. And I can show you. If we use this regression equation for prediction exactly what that means mathematically. So remember that states... and so when you interpret a dichotomous coefficient, you have to do it with how the variables are coded, which is coded zero and which value is coded one. But if I use this zero for states in the non-south and plug in zero in this regression equation, you will see that the zero cancels out that 2.4 and the predicted rate of murder is then just what the constant is, or the [inaudible 00:48:42] one. Now, if I plug in the value of x for states that reside in the south and they are coded one, that murder rate increases Justice Research and Statistics Association Webinar Page 15 of 27

16 exactly by units and the predicted rate of murder now becomes So either way, I hope this illustrates that when you interpret a slope coefficient for a dichotomy, you can't say for every one unit increase in X because X only goes zero to one. So be very careful when you interpret those. And so you can say here... let me go back to this equation. The significance of this is just at the line.052, and my students say, "Well, if that's significant, I say that significant." If it's.056, then you might have... but you can also say at a one tail level, and I'm not going to get into one tail and two tail test that that's significant. So what does that mean? You can conclude that states in the south have higher rates of murder than states in the non-south, and that's all you can say. So let me go onto... I want to bring in some other data, some homicide defendant data. And because I want to try some regression analysis when the units of analysis are individuals. So these data were actually obtained from [inaudible 00:50:20] are a little old, but they still are very useful. It's homicide defend [inaudible 00:50:26] analysis are murder defendants from 75 of the largest standard metropolitan statistical agencies. And the dependent variable that I'm using here is sentence length received for convicted defendants. So all of the individuals in this particular model were all convicted, and the dependent variable is incarceration term and days. The dependent variable is another dichotomy, whether or not the trial was by jury or whether it was a plea. So those of you in the stats know by just your knowledge that obviously police you're going to assume we're going to have lower incarceration terms than those that went to trial and were convicted. So let's go through this regression equation for these individual level, and notice that the r is point the correlation coefficient and the model summaries.273. And that's on the weekend of the r spectrum. r squared indicates that 7.5 percent of the variation in the sentence length received is explained by whether or not the trial was by jury or by plea bargain. Now, you can say, well, Ronet, that's teeny weeny, that's less than 10 percent of the variation explained. And I will say that is pretty good for an individual level unit of analysis. There's so much... many of you who have looked at sentence lengths know that they're extremely valuable, and there's a lot of measurement error. And so this is a fairly good model summary for an individual level data set. Justice Research and Statistics Association Webinar Page 16 of 27

17 So let's move on. And again, We see the coefficient box and I'm putting out the OLS regression equation here, sentence length and days. The dependent variable is equal to about almost plus times X are independent variable, which is whether or not the trial was by jury or a plea. So how to interpret this slope compared to... remember we have to know what zero is coded and what one is coded. In this particular case, plea bargains are coded one and jury trials are coded... I mean, plea bargains are coded zero, forgive me, and jury trials are coded one. So compared to plea bargains, compared to defendants who pled their cases, defendants who were convicted in a jury trial on average had an increase of 9,374 days added to their sentence. Okay, no surprise. Is that significant? Very. The significance is.000. Now, SPSS, I don't know what excel displays, but SPSS only displays three decimal places. So that doesn't mean that you're going to be wrong zero percent of the time or that you're on the other side of the coin going to be right 100 percent of the time, it just means... if I were in SPSS, and I'll show you in a second, and clicked on this, there are about 10 decimal places out there, so there's a one out there somewhere. It just means you're very, very significant going to be wrong less than.1 percent of the time. So that's a very significant relationship, and if I illustrate this using my prediction just like I did for the murder rates. So I plug in zero here when I'm predicting the sentence length for somebody who pled their case and I divide by 365, the number of days in a year, to predict sentence length for a defendant who pled their case and would be about 24 years or 8,947 days. In contrast, I can plug in my X value of one indicating the defendant actually went to a jury trial. And that predicts a sentence length for that particular defendant exactly 9374 days longer or about 50.2 years. So using regression like this to illustrate the effects of a particular independent variable on what you're trying to predict is very, very illuminating. And I think I'm going to do... or I'm actually not. I'm gonna talk a word about outliers here. Remember I said back in the [inaudible 00:55:34] that bivariate scatter plots that scatter plots tell you three things, the direction of the relationship, the strength of the relationship, how close those lines are actually clustered around a linear line, and any idiosyncrasies or outliers that you have. This is real data that I just pulled down, the incarceration of state level data but includes DC now. It's incarceration rate as the independent Justice Research and Statistics Association Webinar Page 17 of 27

18 variable and the murder rate as the dependent variable. And in this first scatterplot, I've included the district of Columbia. And just like it did with percent rural, it's not behaving itself and that's kind of happens with DC frequently. But I want illustrate what happens to the value of the correlation coefficient when we included it in. The correlation coefficient for incarceration and rates of murder is.49 with DC in. And that sort of weak to moderate. But look what happens when we take DC out, the correlation, as you see on the scatterplot on the right, with the correlation noted above, dramatically increases to.68. So when you're analyzing your data, and believe me, you want to make sure that you don't have anything unduly influencing the results, either unduly influencing it so you get the results that you expect or the results that you don't expect. So that's very, very important to sort of highlight on inspecting your scatter plots. And that goes for the univariate level as well. If you have a very, very, very high, for example, in the sentence length data, all of the people who got the death were coded some astronomical sentencing. They can't remember what it was, but all of those people who are sentenced death are like [inaudible 00:57:41] up there. And that's a completely different sentence and it really messed up. So you really have to inspect your data from the very beginning. So I'm going to go now to SPSS. DoeS anybody have any question? I can't, I not seeing any questions. So I'm going to go... oh, wait. Wait, wait. I'm sorry, she needs to speak up closer to her mouth. She is coming through fine on our end. Yeah, you're good. There aren't any additional questions. Yeah. Okay. So how to do this. I need to click start and I need to share my desktop. Yay. Technology is so great when it actually works. I have a handout if you will. I don't know how to give... I will give it to Erin. Okay. And I should have had it prior, I'm sorry, but it's going to take you through what I'm doing here now. It's been a rough week here on this end. So let me just go through here. This is SPSS, and I know as excel does this as well too. I am using SPSS just because it's more user Friendly and you can... it's synonymous with the output that I showed in the Justice Research and Statistics Association Webinar Page 18 of 27

19 PowerPoints. I have two data sets up here. Okay. It's not letting me go in here to this. Maybe it will let me go here. Okay. It is. Okay. So this is what an SPSS window looks like. And I'm first going to take you through a couple of scatter plots to show you how to get them. And scatter plots are wonderful if you have individual level data. Individual level data doesn't always show trends so well, but it's still another important tool to use to see if there are any extreme outliers. So the first data set is the state level data. Instead of the sample of 20 that I used for the PowerPoint illustrations, this is all of the data and it doesn't include DC. So the first relationships I'm going to look at, and when you're in... I guess I can't toggle like that, I have to go through here. When you have SPSS, it spits all of its output into an output window, and so I'm going to keep both the upper window and the data window up here. And it is very user friendly, and there's always a help menu if you get help. In fact, there's not only a help menu, but it gives you sort of interpretation tools. So the first thing I'm going to look at is the relationship in states between mobility and the robbery rate. Now this is another indicator of... I think I mentioned social disorganization earlier. Social organization is this theory that says when communities are disorganized, people moving in and out, divorce rates, etc, that they have... the residents in those communities have less collective efficacy and therefore cannot monitor and socially control their residents. So one indicator of social disorganization that criminologists have frequently used is mobility and the census measures this as asking residents, have you moved in the last five years. And this is percent moved in This data actually 2014, so it's capturing the mover rate or the mobility rate, if you will. So to get a scatterplot, every graphical tool in SPSS is housed under the graph icon here. And you see here you have a whole range of graphing options. I personally hate pie charts and bar charts, but scatter plots and box plots are great things. Histograms, okay too. The scatter, you have several different options. I'm just going to do a simple scatter, you can do matrix. If you did a matrix scatter, it would be exactly like [inaudible 01:01:54] matrix that I described, only instead of showing all the different correlations, it would actually show all the different scatter plots, which they're teeny weeny, so it's hard to examine. So I just stick to a simple scatter. Justice Research and Statistics Association Webinar Page 19 of 27

Using Tableau Software to Make Data Available On-Line December 14, 2017

Using Tableau Software to Make Data Available On-Line December 14, 2017 I hope you all can hear me. My name is Erin Farley and I am one of JRSA's research associates. For those of you who may be less familiar with JRSA it stands for the Justice Research and Statistics Association.