The REG Procedure

## Example 55.1: Aerobic Fitness Prediction

Aerobic fitness (measured by the ability to consume oxygen) is fit to some simple exercise tests. The goal is to develop an equation to predict fitness based on the exercise tests rather than on expensive and cumbersome oxygen consumption measurements. Three model-selection methods are used: forward selection, backward selection, and MAXR selection. The following statements produce Output 55.1.1 through Output 55.1.5. (Collinearity diagnostics for the full model are shown in Figure 55.41.)

```   *-------------------Data on Physical Fitness-------------------*
| These measurements were made on men involved in a physical   |
| fitness course at N.C.State Univ. The variables are Age      |
| (years), Weight (kg), Oxygen intake rate (ml per kg body     |
| weight per minute), time to run 1.5 miles (minutes), heart   |
| rate while resting, heart rate while running (same time      |
| Oxygen rate measured), and maximum heart rate recorded while |
| running.                                                     |
| ***Certain values of MaxPulse were changed for this analysis.|
*--------------------------------------------------------------*;
data fitness;
input Age Weight Oxygen RunTime RestPulse RunPulse MaxPulse @@;
datalines;
44 89.47 44.609 11.37 62 178 182   40 75.07 45.313 10.07 62 185 185
44 85.84 54.297  8.65 45 156 168   42 68.15 59.571  8.17 40 166 172
38 89.02 49.874  9.22 55 178 180   47 77.45 44.811 11.63 58 176 176
40 75.98 45.681 11.95 70 176 180   43 81.19 49.091 10.85 64 162 170
44 81.42 39.442 13.08 63 174 176   38 81.87 60.055  8.63 48 170 186
44 73.03 50.541 10.13 45 168 168   45 87.66 37.388 14.03 56 186 192
45 66.45 44.754 11.12 51 176 176   47 79.15 47.273 10.60 47 162 164
54 83.12 51.855 10.33 50 166 170   49 81.42 49.156  8.95 44 180 185
51 69.63 40.836 10.95 57 168 172   51 77.91 46.672 10.00 48 162 168
48 91.63 46.774 10.25 48 162 164   49 73.37 50.388 10.08 67 168 168
57 73.37 39.407 12.63 58 174 176   54 79.38 46.080 11.17 62 156 165
52 76.32 45.441  9.63 48 164 166   50 70.87 54.625  8.92 48 146 155
51 67.25 45.118 11.08 48 172 172   54 91.63 39.203 12.88 44 168 172
51 73.71 45.790 10.47 59 186 188   57 59.08 50.545  9.93 49 148 155
49 76.32 48.673  9.40 56 186 188   48 61.24 47.920 11.50 52 170 176
52 82.78 47.467 10.50 53 170 172
;
proc reg data=fitness;
model Oxygen=Age Weight RunTime RunPulse RestPulse MaxPulse
/ selection=forward;
model Oxygen=Age Weight RunTime RunPulse RestPulse MaxPulse
/ selection=backward;
model Oxygen=Age Weight RunTime RunPulse RestPulse MaxPulse
/ selection=maxr;
run;
```

The FORWARD model-selection method begins with no variables in the model and adds RunTime, then Age,...

Output 55.1.1: Forward Selection Method: PROC REG

 The REG Procedure Model: MODEL1 Dependent Variable: Oxygen Forward Selection: Step 1

 Variable RunTime Entered: R-Square = 0.7434 and C(p) = 13.6988

 Analysis of Variance Source DF Sum ofSquares MeanSquare F Value Pr > F Model 1 632.90010 632.90010 84.01 <.0001 Error 29 218.48144 7.53384 Corrected Total 30 851.38154

 Variable ParameterEstimate StandardError Type II SS F Value Pr > F Intercept 82.42177 3.85530 3443.36654 457.05 <.0001 RunTime -3.31056 0.36119 632.90010 84.01 <.0001

 Bounds on condition number: 1, 1

 Forward Selection: Step 2

 Variable Age Entered: R-Square = 0.7642 and C(p) = 12.3894

 Analysis of Variance Source DF Sum ofSquares MeanSquare F Value Pr > F Model 2 650.66573 325.33287 45.38 <.0001 Error 28 200.71581 7.16842 Corrected Total 30 851.38154

 Variable ParameterEstimate StandardError Type II SS F Value Pr > F Intercept 88.46229 5.37264 1943.41071 271.11 <.0001 Age -0.15037 0.09551 17.76563 2.48 0.1267 RunTime -3.20395 0.35877 571.67751 79.75 <.0001

 Bounds on condition number: 1.0369, 4.1478

...then RunPulse, then MaxPulse,...

 Forward Selection: Step 3

 Variable RunPulse Entered: R-Square = 0.8111 and C(p) = 6.9596

 Analysis of Variance Source DF Sum ofSquares MeanSquare F Value Pr > F Model 3 690.55086 230.18362 38.64 <.0001 Error 27 160.83069 5.95669 Corrected Total 30 851.38154

 Variable ParameterEstimate StandardError Type II SS F Value Pr > F Intercept 111.71806 10.23509 709.69014 119.14 <.0001 Age -0.25640 0.09623 42.28867 7.10 0.0129 RunTime -2.82538 0.35828 370.43529 62.19 <.0001 RunPulse -0.13091 0.05059 39.88512 6.70 0.0154

 Bounds on condition number: 1.3548, 11.597

 Forward Selection: Step 4

 Variable MaxPulse Entered: R-Square = 0.8368 and C(p) = 4.8800

 Analysis of Variance Source DF Sum ofSquares MeanSquare F Value Pr > F Model 4 712.45153 178.11288 33.33 <.0001 Error 26 138.93002 5.34346 Corrected Total 30 851.38154

 Variable ParameterEstimate StandardError Type II SS F Value Pr > F Intercept 98.14789 11.78569 370.57373 69.35 <.0001 Age -0.19773 0.09564 22.84231 4.27 0.0488 RunTime -2.76758 0.34054 352.93570 66.05 <.0001 RunPulse -0.34811 0.11750 46.90089 8.78 0.0064 MaxPulse 0.27051 0.13362 21.90067 4.10 0.0533

 Bounds on condition number: 8.4182, 76.851

...and finally, Weight. The final variable available to add to the model, RestPulse, is not added since it does not meet the 50% (the default value of the SLE option is 0.5 for FORWARD selection) significance-level criterion for entry into the model.

 Forward Selection: Step 5

 Variable Weight Entered: R-Square = 0.8480 and C(p) = 5.1063

 Analysis of Variance Source DF Sum ofSquares MeanSquare F Value Pr > F Model 5 721.97309 144.39462 27.90 <.0001 Error 25 129.40845 5.17634 Corrected Total 30 851.38154

 Variable ParameterEstimate StandardError Type II SS F Value Pr > F Intercept 102.20428 11.97929 376.78935 72.79 <.0001 Age -0.21962 0.09550 27.37429 5.29 0.0301 Weight -0.07230 0.05331 9.52157 1.84 0.1871 RunTime -2.68252 0.34099 320.35968 61.89 <.0001 RunPulse -0.37340 0.11714 52.59624 10.16 0.0038 MaxPulse 0.30491 0.13394 26.82640 5.18 0.0316

 Bounds on condition number: 8.7312, 104.83

 No other variable met the 0.5000 significance level for entry into the model.

 Summary of Forward Selection Step VariableEntered NumberVars In PartialR-Square ModelR-Square C(p) F Value Pr > F 1 RunTime 1 0.7434 0.7434 13.6988 84.01 <.0001 2 Age 2 0.0209 0.7642 12.3894 2.48 0.1267 3 RunPulse 3 0.0468 0.8111 6.9596 6.70 0.0154 4 MaxPulse 4 0.0257 0.8368 4.8800 4.10 0.0533 5 Weight 5 0.0112 0.8480 5.1063 1.84 0.1871

The BACKWARD model-selection method begins with the full model.

Output 55.1.2: Backward Selection Method: PROC REG

 The REG Procedure Model: MODEL2 Dependent Variable: Oxygen Backward Elimination: Step 0

 All Variables Entered: R-Square = 0.8487 and C(p) = 7.0000

 Analysis of Variance Source DF Sum ofSquares MeanSquare F Value Pr > F Model 6 722.54361 120.42393 22.43 <.0001 Error 24 128.83794 5.36825 Corrected Total 30 851.38154

 Variable ParameterEstimate StandardError Type II SS F Value Pr > F Intercept 102.93448 12.40326 369.72831 68.87 <.0001 Age -0.22697 0.09984 27.74577 5.17 0.0322 Weight -0.07418 0.05459 9.91059 1.85 0.1869 RunTime -2.62865 0.38456 250.82210 46.72 <.0001 RunPulse -0.36963 0.11985 51.05806 9.51 0.0051 RestPulse -0.02153 0.06605 0.57051 0.11 0.7473 MaxPulse 0.30322 0.13650 26.49142 4.93 0.0360

 Bounds on condition number: 8.7438, 137.13

RestPulse is the first variable deleted,...

 Backward Elimination: Step 1

 Variable RestPulse Removed: R-Square = 0.8480 and C(p) = 5.1063

 Analysis of Variance Source DF Sum ofSquares MeanSquare F Value Pr > F Model 5 721.97309 144.39462 27.90 <.0001 Error 25 129.40845 5.17634 Corrected Total 30 851.38154

 Variable ParameterEstimate StandardError Type II SS F Value Pr > F Intercept 102.20428 11.97929 376.78935 72.79 <.0001 Age -0.21962 0.09550 27.37429 5.29 0.0301 Weight -0.07230 0.05331 9.52157 1.84 0.1871 RunTime -2.68252 0.34099 320.35968 61.89 <.0001 RunPulse -0.37340 0.11714 52.59624 10.16 0.0038 MaxPulse 0.30491 0.13394 26.82640 5.18 0.0316

 Bounds on condition number: 8.7312, 104.83

...followed by Weight. No other variables are deleted from the model since the variables remaining (Age,RunTime, RunPulse, and MaxPulse) are all significant at the 10% (the default value of the SLS option is 0.1 for the BACKWARD elimination method) significance level.

 Backward Elimination: Step 2

 Variable Weight Removed: R-Square = 0.8368 and C(p) = 4.8800

 Analysis of Variance Source DF Sum ofSquares MeanSquare F Value Pr > F Model 4 712.45153 178.11288 33.33 <.0001 Error 26 138.93002 5.34346 Corrected Total 30 851.38154

 Variable ParameterEstimate StandardError Type II SS F Value Pr > F Intercept 98.14789 11.78569 370.57373 69.35 <.0001 Age -0.19773 0.09564 22.84231 4.27 0.0488 RunTime -2.76758 0.34054 352.93570 66.05 <.0001 RunPulse -0.34811 0.11750 46.90089 8.78 0.0064 MaxPulse 0.27051 0.13362 21.90067 4.10 0.0533

 Bounds on condition number: 8.4182, 76.851

 All variables left in the model are significant at the 0.1000 level.

 Summary of Backward Elimination Step VariableRemoved NumberVars In PartialR-Square ModelR-Square C(p) F Value Pr > F 1 RestPulse 5 0.0007 0.8480 5.1063 0.11 0.7473 2 Weight 4 0.0112 0.8368 4.8800 1.84 0.1871

The MAXR method tries to find the "best" one-variable model, the "best" two-variable model, and so on. For the fitness data, the one-variable model contains RunTime; the two-variable model contains RunTime and Age...

Output 55.1.3: Maximum R-Square Improvement Selection Method: PROC REG

 The REG Procedure Model: MODEL3 Dependent Variable: Oxygen Maximum R-Square Improvement: Step 1

 Variable RunTime Entered: R-Square = 0.7434 and C(p) = 13.6988

 Analysis of Variance Source DF Sum ofSquares MeanSquare F Value Pr > F Model 1 632.90010 632.90010 84.01 <.0001 Error 29 218.48144 7.53384 Corrected Total 30 851.38154

 Variable ParameterEstimate StandardError Type II SS F Value Pr > F Intercept 82.42177 3.85530 3443.36654 457.05 <.0001 RunTime -3.31056 0.36119 632.90010 84.01 <.0001

 Bounds on condition number: 1, 1

 The above model is the best 1-variable model found.

 Maximum R-Square Improvement: Step 2

 Variable Age Entered: R-Square = 0.7642 and C(p) = 12.3894

 Analysis of Variance Source DF Sum ofSquares MeanSquare F Value Pr > F Model 2 650.66573 325.33287 45.38 <.0001 Error 28 200.71581 7.16842 Corrected Total 30 851.38154

 Variable ParameterEstimate StandardError Type II SS F Value Pr > F Intercept 88.46229 5.37264 1943.41071 271.11 <.0001 Age -0.15037 0.09551 17.76563 2.48 0.1267 RunTime -3.20395 0.35877 571.67751 79.75 <.0001

 Bounds on condition number: 1.0369, 4.1478

 The above model is the best 2-variable model found.

...the three-variable model contains RunTime, Age, and RunPulse; the four-variable model contains Age, RunTime, RunPulse, and MaxPulse...

 Maximum R-Square Improvement: Step 3

 Variable RunPulse Entered: R-Square = 0.8111 and C(p) = 6.9596

 Analysis of Variance Source DF Sum ofSquares MeanSquare F Value Pr > F Model 3 690.55086 230.18362 38.64 <.0001 Error 27 160.83069 5.95669 Corrected Total 30 851.38154

 Variable ParameterEstimate StandardError Type II SS F Value Pr > F Intercept 111.71806 10.23509 709.69014 119.14 <.0001 Age -0.25640 0.09623 42.28867 7.10 0.0129 RunTime -2.82538 0.35828 370.43529 62.19 <.0001 RunPulse -0.13091 0.05059 39.88512 6.70 0.0154

 Bounds on condition number: 1.3548, 11.597

 The above model is the best 3-variable model found.

 Maximum R-Square Improvement: Step 4

 Variable MaxPulse Entered: R-Square = 0.8368 and C(p) = 4.8800

 Analysis of Variance Source DF Sum ofSquares MeanSquare F Value Pr > F Model 4 712.45153 178.11288 33.33 <.0001 Error 26 138.93002 5.34346 Corrected Total 30 851.38154

 Variable ParameterEstimate StandardError Type II SS F Value Pr > F Intercept 98.14789 11.78569 370.57373 69.35 <.0001 Age -0.19773 0.09564 22.84231 4.27 0.0488 RunTime -2.76758 0.34054 352.93570 66.05 <.0001 RunPulse -0.34811 0.11750 46.90089 8.78 0.0064 MaxPulse 0.27051 0.13362 21.90067 4.10 0.0533

 Bounds on condition number: 8.4182, 76.851

 The above model is the best 4-variable model found.

...the five-variable model contains Age, Weight, RunTime, RunPulse, and MaxPulse; and finally, the six-variable model contains all the variables in the MODEL statement.

 Maximum R-Square Improvement: Step 5

 Variable Weight Entered: R-Square = 0.8480 and C(p) = 5.1063

 Analysis of Variance Source DF Sum ofSquares MeanSquare F Value Pr > F Model 5 721.97309 144.39462 27.90 <.0001 Error 25 129.40845 5.17634 Corrected Total 30 851.38154

 Variable ParameterEstimate StandardError Type II SS F Value Pr > F Intercept 102.20428 11.97929 376.78935 72.79 <.0001 Age -0.21962 0.09550 27.37429 5.29 0.0301 Weight -0.07230 0.05331 9.52157 1.84 0.1871 RunTime -2.68252 0.34099 320.35968 61.89 <.0001 RunPulse -0.37340 0.11714 52.59624 10.16 0.0038 MaxPulse 0.30491 0.13394 26.82640 5.18 0.0316

 Bounds on condition number: 8.7312, 104.83

 The above model is the best 5-variable model found.

 Maximum R-Square Improvement: Step 6

 Variable RestPulse Entered: R-Square = 0.8487 and C(p) = 7.0000

 Analysis of Variance Source DF Sum ofSquares MeanSquare F Value Pr > F Model 6 722.54361 120.42393 22.43 <.0001 Error 24 128.83794 5.36825 Corrected Total 30 851.38154

 Variable ParameterEstimate StandardError Type II SS F Value Pr > F Intercept 102.93448 12.40326 369.72831 68.87 <.0001 Age -0.22697 0.09984 27.74577 5.17 0.0322 Weight -0.07418 0.05459 9.91059 1.85 0.1869 RunTime -2.62865 0.38456 250.82210 46.72 <.0001 RunPulse -0.36963 0.11985 51.05806 9.51 0.0051 RestPulse -0.02153 0.06605 0.57051 0.11 0.7473 MaxPulse 0.30322 0.13650 26.49142 4.93 0.0360

 Bounds on condition number: 8.7438, 137.13

 The above model is the best 6-variable model found.

 No further improvement in R-Square is possible.

Note that for all three of these methods, RestPulse contributes least to the model. In the case of forward selection, it is not added to the model. In the case of backward selection, it is the first variable to be removed from the model. In the case of MAXR selection, RestPulse is included only for the full model.

For the STEPWISE, BACKWARDS and FORWARD selection methods, you can control the amount of detail displayed by using the DETAILS option. For example, the following statements display only the selection summary table for the FORWARD selection method.

```   proc reg data=fitness;
model Oxygen=Age Weight RunTime RunPulse RestPulse MaxPulse
/ selection=forward details=summary;
run;
```

Output 55.1.4: Forward Selection Summary

 The REG Procedure Model: MODEL1 Dependent Variable: Oxygen

 Summary of Forward Selection Step VariableEntered NumberVars In PartialR-Square ModelR-Square C(p) F Value Pr > F 1 RunTime 1 0.7434 0.7434 13.6988 84.01 <.0001 2 Age 2 0.0209 0.7642 12.3894 2.48 0.1267 3 RunPulse 3 0.0468 0.8111 6.9596 6.70 0.0154 4 MaxPulse 4 0.0257 0.8368 4.8800 4.10 0.0533 5 Weight 5 0.0112 0.8480 5.1063 1.84 0.1871

Next, the RSQUARE model-selection method is used to request R2 and Cp statistics for all possible combinations of the six independent variables. The following statements produce Output 55.1.5

```   model Oxygen=Age Weight RunTime RunPulse RestPulse MaxPulse
/ selection=rsquare cp;
title 'Physical fitness data: all models';
run;
```

Output 55.1.5: All Models by the RSQUARE Method: PROC REG

 Physical fitness data: all models

 The REG Procedure Model: MODEL2 Dependent Variable: Oxygen R-Square Selection Method

 Number inModel R-Square C(p) Variables in Model 1 0.7434 13.6988 RunTime 1 0.1595 106.3021 RestPulse 1 0.1584 106.4769 RunPulse 1 0.0928 116.8818 Age 1 0.0560 122.7072 MaxPulse 1 0.0265 127.3948 Weight 2 0.7642 12.3894 Age RunTime 2 0.7614 12.8372 RunTime RunPulse 2 0.7452 15.4069 RunTime MaxPulse 2 0.7449 15.4523 Weight RunTime 2 0.7435 15.6746 RunTime RestPulse 2 0.3760 73.9645 Age RunPulse 2 0.3003 85.9742 Age RestPulse 2 0.2894 87.6951 RunPulse MaxPulse 2 0.2600 92.3638 Age MaxPulse 2 0.2350 96.3209 RunPulse RestPulse 2 0.1806 104.9523 Weight RestPulse 2 0.1740 105.9939 RestPulse MaxPulse 2 0.1669 107.1332 Weight RunPulse 2 0.1506 109.7057 Age Weight 2 0.0675 122.8881 Weight MaxPulse 3 0.8111 6.9596 Age RunTime RunPulse 3 0.8100 7.1350 RunTime RunPulse MaxPulse 3 0.7817 11.6167 Age RunTime MaxPulse 3 0.7708 13.3453 Age Weight RunTime 3 0.7673 13.8974 Age RunTime RestPulse 3 0.7619 14.7619 RunTime RunPulse RestPulse 3 0.7618 14.7729 Weight RunTime RunPulse 3 0.7462 17.2588 Weight RunTime MaxPulse 3 0.7452 17.4060 RunTime RestPulse MaxPulse 3 0.7451 17.4243 Weight RunTime RestPulse 3 0.4666 61.5873 Age RunPulse RestPulse 3 0.4223 68.6250 Age RunPulse MaxPulse 3 0.4091 70.7102 Age Weight RunPulse 3 0.3900 73.7424 Age RestPulse MaxPulse 3 0.3568 79.0013 Age Weight RestPulse 3 0.3538 79.4891 RunPulse RestPulse MaxPulse 3 0.3208 84.7216 Weight RunPulse MaxPulse 3 0.2902 89.5693 Age Weight MaxPulse 3 0.2447 96.7952 Weight RunPulse RestPulse 3 0.1882 105.7430 Weight RestPulse MaxPulse 4 0.8368 4.8800 Age RunTime RunPulse MaxPulse 4 0.8165 8.1035 Age Weight RunTime RunPulse 4 0.8158 8.2056 Weight RunTime RunPulse MaxPulse 4 0.8117 8.8683 Age RunTime RunPulse RestPulse 4 0.8104 9.0697 RunTime RunPulse RestPulse MaxPulse 4 0.7862 12.9039 Age Weight RunTime MaxPulse 4 0.7834 13.3468 Age RunTime RestPulse MaxPulse 4 0.7750 14.6788 Age Weight RunTime RestPulse 4 0.7623 16.7058 Weight RunTime RunPulse RestPulse 4 0.7462 19.2550 Weight RunTime RestPulse MaxPulse 4 0.5034 57.7590 Age Weight RunPulse RestPulse 4 0.5025 57.9092 Age RunPulse RestPulse MaxPulse 4 0.4717 62.7830 Age Weight RunPulse MaxPulse 4 0.4256 70.0963 Age Weight RestPulse MaxPulse 4 0.3858 76.4100 Weight RunPulse RestPulse MaxPulse 5 0.8480 5.1063 Age Weight RunTime RunPulse MaxPulse 5 0.8370 6.8461 Age RunTime RunPulse RestPulse MaxPulse 5 0.8176 9.9348 Age Weight RunTime RunPulse RestPulse 5 0.8161 10.1685 Weight RunTime RunPulse RestPulse MaxPulse 5 0.7887 14.5111 Age Weight RunTime RestPulse MaxPulse 5 0.5541 51.7233 Age Weight RunPulse RestPulse MaxPulse 6 0.8487 7.0000 Age Weight RunTime RunPulse RestPulse MaxPulse

The models in Output 55.1.5 are arranged first by the number of variables in the model and second by the magnitude of R2 for the model. Before making a final decision about which model to use, you would want to perform collinearity diagnostics. Note that, since many different models have been fit and the choice of a final model is based on R2, the statistics are biased and the p-values for the parameter estimates are not valid.