Gas Price regression... This is based on data file GasolineMarket.mpj. Here is a schematic of the data file: Year Expenditure Population GasPrice Income NewCars UsedCars Public Trans Durables Nondurables Services 93 7.4 96 6.668 8883 47.2 26.7 6.8 37.7 29.7 9.4 94 7.8 6239 7.29 868 46. 22.7 8. 36.8 29.7 2. 9 8.6 627 7.2 937 44.8 2. 8. 36. 29. 2.4 96 9.4 6822 7.729 9436 46. 2.7 9.2 36. 29.9 2.9 97.2 7274 8.497 934 48. 23.2 9.9 37.2 3.9 2.8 98.6 744 8.36 9343. 24. 2.9 37.8 3.7 22.6 23 9.3 2973.4 26437 34.7 42.9 29.3 7. 6.3 26. 24 224. 2939 23.9 273 33.9 33.3 29. 4.8 72.2 222.8 This is certainly a time series. We can see very strong patterns in the correlation matrix. This comes out in this form... Correlations: Expenditure, Population, GasPrice, Income, NewCars,... Expenditure Population GasPrice Income Population.9 GasPrice.978.927 Income.96.993.934 NewCars.942.9.936.96 UsedCars.936.946.923.93 PublicTrans.966.96.927.964 Durables.92.94.939.949 Nondurables.979.97.963.978 Services.977.96.939.97 NewCars UsedCars PublicTrans Durables UsedCars.994 PublicTrans.98.982 Durables.993.988.98 Nondurables.989.982.99.977 Services.978.977.998.96 Nondurables Services.994 Cell Contents: Pearson correlation
It looks nicer if we re-organize the layout: Expend- Gas New Used Public Noniture Popn Price Income Cars Cars Trans Durables durables Population.9 GasPrice.978.927 Income.96.993.934 NewCars.942.9.936.96 UsedCars.936.946.923.93.994 PublicTrans.966.96.927.964.98.982 Durables.92.94.939.949.993.988.98 Nondurables.979.97.963.978.989.982.99.977 Services.977.96.939.97.978.977.998.96.994 The problem is that everything is moving forward in time together. So what explains GasPrice? Let s try the run on just NewCars, UsedCars, Population. Regression Analysis: GasPrice versus NewCars, UsedCars, Population The regression equation is GasPrice = - 8.2 +.4 NewCars -.42 UsedCars +.326 Population Predictor Coef SE Coef T P VIF Constant -8.9 2.88-3.84. NewCars.364.36 2.87.6 87.684 UsedCars -.423.24 -.66.4 82.269 Population.3263.23 2.7.9.23 S =.223 R-Sq = 89.7% R-Sq(adj) = 89.% Analysis of Variance Source DF SS MS F P Regression 3 4346 4487 38.9. Error 48 6 4 Total 48467 Source DF Seq SS NewCars 42466 UsedCars 226 Population 768 Unusual Observations Obs NewCars GasPrice Fit SE Fit St Resid 29 94 84.2 9.9 2.76 24.43 2.48R 3 97 79.77 9.3.62 2.64 2.R 32 3 76. 6. 3.89 9.9 2.R 46 4 7.87 92.3 2.49-2.44-2.6R 2 34 23.9 98.36 4.24 2.4 2.7R R denotes an observation with a large standardized residual. Durbin-Watson statistic =.4399 With no critical thought, this looks great! 2
But... here are facts about the residuals. This was obtained through Stat Regression Regression Graphs Four in one. Plots for GasPrice Year 99 Normal Probability Plot Versus Fits Percent 9 2 - -3-3 -2 - - Fitted Value Histogram Versus Order Frequency 2 9 6 3 2 - -2-2 -2 2 2 3 3 Observation Order 4 4 The plot in sequence order is a clear indication that the residuals have some type of time-based dependence. Moreover, the Durbin-Watson statistic is very small. As a side note, we ll record the residuals and then get the autocorrelation function plot. You ll get the residuals through Stat Regression Regression Storage s. You can get the autocorrelation function of the residuals through Stat Time Series Autocorrelation. 3
Here s what that plot looks like: Autocorrelation..8.6.4.2. -.2 -.4 -.6 -.8 -. Autocorrelation Function for RESI (with % significance limits for the autocorrelations) 2 3 4 Lag 6 7 8 9 And here are the autocorrelations: Lag ACF T LBQ.7287.7 27.2 2.3998 2. 3.78 3.2994.39 4.67 4.6749.74 42.2 -.79 -.32 42.49 6 -.8789 -.84 44.6 7 -.79286 -.8 46.6 8 -.96 -.87 49.6 9 -.24262 -.6 2.9 -.26948 -.2 7.49 This is a common situation. We note that the first autocorrelation is large (.7) and statistically significant. The T is an ordinary t statistic. Values bigger than 2 or less than -2 indicate statistical significance. The LBQ refers to the Ljung-Box Q statistic to test the null hypothesis that the autocorrelations for all lags up to lag k are all equal to zero. If you really wish to do that test, you get information from Minitab s Help. So... there is a problem that has to be corrected. 4
Correction Attempt. Use time itself as a predictor. Regression Analysis: GasPrice versus NewCars, UsedCars,... The regression equation is GasPrice = - 26 +.937 NewCars -.38 UsedCars -.7 Population +. Year Predictor Coef SE Coef T P VIF Constant -26 3434 -.63.3 NewCars.9368.3988 2.3.23.7 UsedCars -.383.264 -.44. 87.74 Population -.699.6638 -..97 38.27 Year.2.84.6.47 364.94 S =.28 R-Sq = 89.8% R-Sq(adj) = 88.9% Analysis of Variance Source DF SS MS F P Regression 4 43 87 2.9. Error 47 4967 6 Total 48467 Source DF Seq SS NewCars 42466 UsedCars 226 Population 768 Year 39 Unusual Observations Obs NewCars GasPrice Fit SE Fit St Resid 29 94 84.2 9.86 2.8 24.6 2.44R 32 3 76. 7.67 4.69 8.33 2.R 2 34 23.9 96.89 4.9 27.2 2.99R R denotes an observation with a large standardized residual. Durbin-Watson statistic =.4668 This has failed. The Durbin-Watson statistic is very small. Plots involving the residuals are bad also, but they are not shown here. Correction Attempt 2: Use the differenced data. The dependent variable and all the independent variables should be differenced. In Minitab, use Stat Time Series Differences. This will reduce the sample size by.
The plots look much better. Plots for GasPriceDiff 99 Normal Probability Plot 2 Versus Fits Percent 9-2 - 2 - -2..2 2.4 Fitted Value 3.6 4.8 3 Histogram 2 Versus Order Frequency 2 - -2-2 -2 2 2 3 3 Observation Order 4 4 The Durbin-Watson statistic is.46699, which is at the low end of borderline values. Correction attempt 3: Use the lagged version of the dependent variable. In Minitab, use Stat Time Series Lag. by. Again, this will drop the sample size Regression Analysis: GasPrice versus NewCars, UsedCars,... The regression equation is GasPrice = - 49.3 +.497 NewCars -.43 UsedCars +.27 Population +.89 GasPriceLag cases used, cases contain missing values Predictor Coef SE Coef T P VIF Constant -49.28 4. -3.2. NewCars.4966.2248 2.2.32 92.46 UsedCars -.4297.38-2.79.8 82.2 Population.2673.78 2.6..292 GasPriceLag.88997.964 9.26..66 S = 6.2837 R-Sq = 96.3% R-Sq(adj) = 96.% 6
Analysis of Variance Source DF SS MS F P Regression 4 43 378 32.96. Error 46 728 38 Total 4724 Source DF Seq SS NewCars 422 UsedCars 28 Population 82 GasPriceLag 329 Unusual Observations Obs NewCars GasPrice Fit SE Fit St Resid 28 88 7.9 63.34 2.723 2.64 2.22R 34 6.7 76.838.72-6.663-2.8R 46 4 7.874 86.463.628-4.89-2.47R 48 4. 8.88 2.38 8.92 3.29R R denotes an observation with a large standardized residual. Durbin-Watson statistic =.6793 This is not perfect either, but the Durbin-Watson statistic has crossed, just barely, into the zone at which we can accept ρ =. Here are the relevant plots: Plots for GasPrice 99 Normal Probability Plot 2 Versus Fits Percent 9 - -2-2 -2 2 7 Fitted Value 2 Histogram 2 Versus Order Frequency - -6-8 8 6-2 2 2 3 3 Observation Order 4 4 This is tough to live with, but we could do it. 7
Note that the coefficient on GasPriceLag is.89, close to. The fitted equation was GasPrice = - 49.3 +.497 NewCars -.43 UsedCars +.27 Population +.89 GasPriceLag This can be rearranged as GasPrice -.89 GasPriceLag = - 49.3 +.497 NewCars -.43 UsedCars +.27 Population -.89 GasPriceLag The left side is almost the same as GasPriceDiff. An objection to using the lagged variable on the right side of the equation is that we are mixing up dependent and independent variables. Correction Attempt 4: Use the Cochrane-Orcutt method. There are several variations on this method. The essence of the concept is estimating the * autocorrelation coefficient ρ and then computing Y i = Y ˆ i ρ Yi and doing the same thing for each independent variable. There are several ways to get ˆρ. Start with the initial regression, which is where we were on pages -3. In the printout from the autocorrelation, we found that the first autocorrelation was computed as.7287, and we can call this ˆρ. Some like to use ˆρ = DW ; here 2 this is.4399 =.7732. 2 These are not all that far apart. Let s use ˆρ =.7. In Minitab, we will need to create the lagged variable, and then use Calc Calculator to perform (original).7 (lagged). 8
Here are the results: Regression Analysis: GasPriceAdj versus NewCarAdj, UsedCarAdj, PopnAdj The regression equation is GasPriceAdj = - 66.3 +.482 NewCarAdj -.76 UsedCarAdj +.2 PopnAdj cases used, cases contain missing values Predictor Coef SE Coef T P VIF Constant -66.3 3.6-4.86. NewCarAdj.482.832 2.63..22 UsedCarAdj -.76.483 -.9.24 6.977 PopnAdj.79.23.6. 6.36 S = 6.346 R-Sq = 7.7% R-Sq(adj) = 69.9% Analysis of Variance Source DF SS MS F P Regression 3 489.6 66. 39.78. Error 47 897.9 4.4 Total 677. Source DF Seq SS NewCarAdj 934.6 UsedCarAdj 282.2 PopnAdj 32.8 Unusual Observations Obs NewCarAdj GasPriceAdj Fit SE Fit St Resid 28 46.3 37.42 23.84 2.687 3.7 2.36R 46 34.9 4.69 29.248.3 -.79-2.46R 48 33.2 4.2 29.48.89.797 2.9R 2 33.9.293 3.98 3.3 4.33 2.9RX R denotes an observation with a large standardized residual. X denotes an observation whose X value gives it large leverage. Durbin-Watson statistic =.27866 This has actually made the Durbin-Watson statistic a little worse (lower). 9
Here are the plots: Plots for GasPriceAdj 99 Normal Probability Plot 2 Versus Fits Percent 9 - - 2 2 Fitted Value 3 4 6 Histogram 2 Versus Order Frequency 2 8 4 - -6-8 8 6 2 2 3 3 Observation Order 4 4 This particular data set may have incurable issues. The series are all very smooth for the first (about) twenty years and then become irregular. The statistical word for this problem is non-stationarity. By the way, the most commonly used correction is differencing, the second method illustrated here. It s simple and easy to understand. Using time as a predictor, the first method done here, seems promising, but it rarely works.