Chapter Contents Previous Next
 The REG Procedure

## Polynomial Regression

Consider a response variable Y that can be predicted by a polynomial function of a regressor variable X. You can estimate , the intercept, , the slope due to X, and , the slope due to X2, in
for the observations i = 1,2, ... ,n.

Consider the following example on population growth trends. The population of the United States from 1790 to 1970 is fit to linear and quadratic functions of time. Note that the quadratic term, YearSq, is created in the DATA step; this is done since polynomial effects such as Year*Year cannot be specified in the MODEL statement in PROC REG. The data are as follows:

   data USPopulation;
input Population @@;
retain Year 1780;
Year=Year+10;
YearSq=Year*Year;
Population=Population/1000;
datalines;
3929 5308 7239 9638 12866 17069 23191 31443 39818 50155
62947 75994 91972 105710 122775 131669 151325 179323 203211
;

The following statements begin the analysis. (Influence diagnostics and autocorrelation information for the full model are shown in Figure 55.42 and Figure 55.55.)

   symbol1 c=blue;
proc reg data=USPopulation;
var YearSq;
model Population=Year / r cli clm;
plot r.*p. / cframe=ligr;
run;

The DATA option ensures that the procedure uses the intended data set. Any variable that you might add to the model but that is not included in the first MODEL statement must appear in the VAR statement. In the MODEL statement, three options are specified: R requests a residual analysis to be performed, CLI requests 95% confidence limits for an individual value, and CLM requests these limits for the expected value of the dependent variable. You can request specific % limits with the ALPHA= option in the PROC REG or MODEL statement. A plot of the residuals against the predicted values is requested by the PLOT statement.

The ANOVA table is displayed in Figure 55.4.

 The REG Procedure Model: MODEL1 Dependent Variable: Population

 Analysis of Variance Source DF Sum ofSquares MeanSquare F Value Pr > F Model 1 66336 66336 201.87 <.0001 Error 17 5586.29253 328.60544 Corrected Total 18 71923

 Root MSE 18.1275 R-Square 0.9223 Dependent Mean 69.7675 Adj R-Sq 0.9178 Coeff Var 25.9827

 Parameter Estimates Variable DF ParameterEstimate StandardError t Value Pr > |t| Intercept 1 -1958.36630 142.80455 -13.71 <.0001 Year 1 1.07879 0.07593 14.21 <.0001
Figure 55.4: ANOVA Table and Parameter Estimates

The Model F statistic is significant (F=201.873, p<0.0001), indicating that the model accounts for a significant portion of variation in the data. The R-Square indicates that the model accounts for 92% of the variation in population growth. The fitted equation for this model is

Population = -1958.37 + 1.08 × Year

Figure 55.5 shows the confidence limits for both individual and expected values resulting from the CLM and CLI options.

 The REG Procedure Model: MODEL1 Dependent Variable: Population

 Output Statistics Obs Dep VarPopulation PredictedValue Std ErrorMean Predict 95% CL Mean 95% CL Predict Residual Std ErrorResidual StudentResidual -2-1 0 1 2 Cook'sD 1 3.9290 -27.3240 7.9995 -44.2015 -10.4466 -69.1281 14.4800 31.2530 16.267 1.921 |      |***   | 0.446 2 5.3080 -16.5361 7.3615 -32.0674 -1.0048 -57.8150 24.7428 21.8441 16.565 1.319 |      |**    | 0.172 3 7.2390 -5.7481 6.7486 -19.9864 8.4901 -46.5582 35.0619 12.9871 16.824 0.772 |      |*     | 0.048 4 9.6380 5.0398 6.1684 -7.9744 18.0540 -35.3594 45.4390 4.5982 17.046 0.270 |      |      | 0.005 5 12.8660 15.8277 5.6309 3.9475 27.7080 -24.2206 55.8761 -2.9617 17.231 -0.172 |      |      | 0.002 6 17.0690 26.6157 5.1497 15.7509 37.4805 -13.1432 66.3746 -9.5467 17.381 -0.549 |     *|      | 0.013 7 23.1910 37.4036 4.7417 27.3996 47.4077 -2.1288 76.9360 -14.2126 17.496 -0.812 |     *|      | 0.024 8 31.4430 48.1916 4.4273 38.8508 57.5324 8.8218 87.5614 -16.7486 17.579 -0.953 |     *|      | 0.029 9 39.8180 58.9795 4.2275 50.0603 67.8987 19.7076 98.2514 -19.1615 17.628 -1.087 |    **|      | 0.034 10 50.1550 69.7675 4.1587 60.9933 78.5416 30.5283 109.0067 -19.6125 17.644 -1.112 |    **|      | 0.034 11 62.9470 80.5554 4.2275 71.6362 89.4746 41.2835 119.8273 -17.6084 17.628 -0.999 |     *|      | 0.029 12 75.9940 91.3434 4.4273 82.0026 100.6842 51.9736 130.7131 -15.3494 17.579 -0.873 |     *|      | 0.024 13 91.9720 102.1313 4.7417 92.1272 112.1354 62.5989 141.6637 -10.1593 17.496 -0.581 |     *|      | 0.012 14 105.7100 112.9193 5.1497 102.0544 123.7841 73.1603 152.6782 -7.2093 17.381 -0.415 |      |      | 0.008 15 122.7750 123.7072 5.6309 111.8269 135.5875 83.6589 163.7555 -0.9322 17.231 -0.0541 |      |      | 0.000 16 131.6690 134.4951 6.1684 121.4810 147.5093 94.0959 174.8944 -2.8261 17.046 -0.166 |      |      | 0.002 17 151.3250 145.2831 6.7486 131.0448 159.5214 104.4731 186.0931 6.0419 16.824 0.359 |      |      | 0.010 18 179.3230 156.0710 7.3615 140.5397 171.6024 114.7921 197.3500 23.2520 16.565 1.404 |      |**    | 0.195 19 203.2110 166.8590 7.9995 149.9816 183.7364 125.0550 208.6630 36.3520 16.267 2.235 |      |****  | 0.604

 Sum of Residuals 0 Sum of Squared Residuals 5586.29 Predicted Residual SS (PRESS) 7619.9
Figure 55.5: Confidence Limits

The observed dependent variable is displayed for each observation along with its predicted value from the regression equation and the standard error of the mean predicted value. The 95% CL Mean columns are the confidence limits for the expected value of each observation. The 95% CL Predict columns are the confidence limits for the individual observations.

Figure 55.5 also displays the residual analysis requested by the R option.

The residual, its standard error, and the studentized residuals are displayed for each observation. The studentized residual is the residual divided by its standard error. The magnitude of each studentized residual is shown in a plot. Studentized residuals follow a t distribution and can be used to identify outlying or extreme observations. Asterisks (*) extending beyond the dashed lines indicate that the residual is more than three standard errors from zero. Many observations having absolute studentized residuals greater than 2 may indicate an inadequate model. The wave pattern seen in this plot is also an indication that the model is inadequate; a quadratic term may be needed or autocorrelation may be present in the data. Cook's D is a measure of the change in the predicted values upon deletion of that observation from the data set; hence, it measures the influence of the observation on the estimated regression coefficients. A fairly close agreement between the PRESS statistic (see Table 55.5) and the Sum of Squared Residuals indicates that the MSE is a reasonable measure of the predictive accuracy of the fitted model (Neter, Wasserman, and Kutner, 1990).

A plot of the residuals versus predicted values is shown in Figure 55.6.

Figure 55.7: Plot of Residual vs. Predicted Values

The wave pattern of the studentized residual plot is seen here again. The semi-circle shape indicates an inadequate model; perhaps additional terms (such as the quadratic) are needed, or perhaps the data need to be transformed before analysis. If a model fits well, the plot of residuals against predicted values should exhibit no apparent trends.

Using the interactive feature of PROC REG, the following commands add the variable YearSq to the independent variables and refit the model.

   add YearSq;
print;
plot / cframe=ligr;
run;

The ADD statement requests that YearSq be added to the model, and the PRINT command displays the ANOVA table for the new model. The PLOT statement with no variables recreates the most recent plot requested, in this case a plot of residual versus predicted values.

Figure 55.7 displays the ANOVA table and estimates for the new model.

 The REG Procedure Model: MODEL1.1 Dependent Variable: Population

 Analysis of Variance Source DF Sum ofSquares MeanSquare F Value Pr > F Model 2 71799 35900 4641.72 <.0001 Error 16 123.74557 7.73410 Corrected Total 18 71923

 Root MSE 2.78102 R-Square 0.9983 Dependent Mean 69.7675 Adj R-Sq 0.9981 Coeff Var 3.98613

 Parameter Estimates Variable DF ParameterEstimate StandardError t Value Pr > |t| Intercept 1 20450 843.47533 24.25 <.0001 Year 1 -22.78061 0.89785 -25.37 <.0001 YearSq 1 0.00635 0.00023877 26.58 <.0001
Figure 55.8: ANOVA Table and Parameter Estimates

The overall F statistic is still significant (F=4641.719, p<0.0001). The R-square has increased from 0.9223 to 0.9983, indicating that the model now accounts for 99.8% of the variation in Population. All effects are significant with p<0.0001 for each effect in the model.

The fitted equation is now

Population = 20450 - 22.781 × Year + 0.006 × Yearsq

The confidence limits and residual analysis for the second model are displayed in Figure 55.8.

 The REG Procedure Model: MODEL1.1 Dependent Variable: Population

 Output Statistics Obs Dep VarPopulation PredictedValue Std ErrorMean Predict 95% CL Mean 95% CL Predict Residual Std ErrorResidual StudentResidual -2-1 0 1 2 Cook'sD 1 3.9290 5.0384 1.7289 1.3734 8.7035 -1.9034 11.9803 -1.1094 2.178 -0.509 |     *|      | 0.054 2 5.3080 5.0389 1.3909 2.0904 7.9874 -1.5528 11.6306 0.2691 2.408 0.112 |      |      | 0.001 3 7.2390 6.3085 1.1304 3.9122 8.7047 -0.0554 12.6724 0.9305 2.541 0.366 |      |      | 0.009 4 9.6380 8.8472 0.9571 6.8182 10.8761 2.6123 15.0820 0.7908 2.611 0.303 |      |      | 0.004 5 12.8660 12.6550 0.8721 10.8062 14.5037 6.4764 18.8335 0.2110 2.641 0.0799 |      |      | 0.000 6 17.0690 17.7319 0.8578 15.9133 19.5504 11.5623 23.9015 -0.6629 2.645 -0.251 |      |      | 0.002 7 23.1910 24.0779 0.8835 22.2049 25.9509 17.8920 30.2638 -0.8869 2.637 -0.336 |      |      | 0.004 8 31.4430 31.6931 0.9202 29.7424 33.6437 25.4832 37.9029 -0.2501 2.624 -0.0953 |      |      | 0.000 9 39.8180 40.5773 0.9487 38.5661 42.5885 34.3482 46.8065 -0.7593 2.614 -0.290 |      |      | 0.004 10 50.1550 50.7307 0.9592 48.6972 52.7642 44.4944 56.9671 -0.5757 2.610 -0.221 |      |      | 0.002 11 62.9470 62.1532 0.9487 60.1420 64.1644 55.9241 68.3823 0.7938 2.614 0.304 |      |      | 0.004 12 75.9940 74.8448 0.9202 72.8942 76.7955 68.6350 81.0547 1.1492 2.624 0.438 |      |      | 0.008 13 91.9720 88.8056 0.8835 86.9326 90.6785 82.6197 94.9915 3.1664 2.637 1.201 |      |**    | 0.054 14 105.7100 104.0354 0.8578 102.2169 105.8540 97.8658 110.2051 1.6746 2.645 0.633 |      |*     | 0.014 15 122.7750 120.5344 0.8721 118.6857 122.3831 114.3558 126.7130 2.2406 2.641 0.848 |      |*     | 0.026 16 131.6690 138.3025 0.9571 136.2735 140.3315 132.0676 144.5374 -6.6335 2.611 -2.540 | *****|      | 0.289 17 151.3250 157.3397 1.1304 154.9434 159.7360 150.9758 163.7036 -6.0147 2.541 -2.367 |  ****|      | 0.370 18 179.3230 177.6460 1.3909 174.6975 180.5945 171.0543 184.2377 1.6770 2.408 0.696 |      |*     | 0.054 19 203.2110 199.2215 1.7289 195.5564 202.8865 192.2796 206.1633 3.9895 2.178 1.831 |      |***   | 0.704

 Sum of Residuals -5.8175e-11 Sum of Squared Residuals 123.746 Predicted Residual SS (PRESS) 188.549
Figure 55.9: Confidence Limits and Residual Analysis

The plot of the studentized residuals shows that the wave structure is gone. The PRESS statistic is much closer to the Sum of Squared Residuals now, and both statistics have been dramatically reduced. Most of the Cook's D statistics have also been reduced.

Figure 55.10: Plot of Residual vs. Predicted Values

The plot of residuals versus predicted values seen in Figure 55.9 has improved since a major trend is no longer visible.

To create a plot of the observed values, predicted values, and confidence limits against Year all on the same plot and to exert some control over the look of the resulting plot, you can submit the following statements.

   symbol1 v=dot     c=yellow h=.3;
symbol2 v=square  c=red;
symbol3 f=simplex c=blue  h=2 v='-';
symbol4 f=simplex c=blue  h=2 v='-';
plot (Population predicted. u95. l95.)*Year
/ overlay cframe=ligr;
run;


Figure 55.11: Plot of Population vs Year with Confidence Limits

The SYMBOL statements requests that the actual data be displayed as dots, the predicted values as squares, and the upper and lower 95% confidence limits for an individual value (sometimes called a prediction interval) as dashes. PROC REG provides the short-hand commands CONF and PRED to request confidence and prediction intervals for simple regression models; see the "PLOT Statement" section for details.

To complete an analysis of these data, you may want to examine influence statistics and, since the data are essentially time series data, examine the Durbin-Watson statistic. You might also want to examine other residual plots, such as the residuals vs. regressors.

 Chapter Contents Previous Next Top