Chapter Contents Previous Next
 The TRANSREG Procedure

## Example 65.1: Using Splines and Knots

This example illustrates some properties of splines. Splines are curves, which are usually required to be continuous and smooth. Splines are usually defined as piecewise polynomials of degree n with function values and first n-1 derivatives that agree at the points where they join. The abscissa values of the join points are called knots. The term "spline" is also used for polynomials (splines with no knots) and piecewise polynomials with more than one discontinuous derivative. Splines with no knots are generally smoother than splines with knots, which are generally smoother than splines with multiple discontinuous derivatives. Splines with few knots are generally smoother than splines with many knots; however, increasing the number of knots usually increases the fit of the spline function to the data. Knots give the curve freedom to bend to more closely follow the data. Refer to Smith (1979) for an excellent introduction to splines.

In this example, an artificial data set is created with a variable Y that is a discontinuous function of X. See the first plot in Output 65.1.7. Notice that the function has four unconnected parts, each of which is a curve. Notice too that there is an overall quadratic trend, that is, ignoring the shapes of the individual curves, at first the Y values tend to decrease as X increases, then Y values tend to increase.

The first PROC TRANSREG analysis fits a linear regression model. The predicted values of Y given X are output and plotted to form the linear regression line. The R2 for the linear regression is 0.10061, and it can be seen from the second plot in Output 65.1.7 that the linear regression model is not appropriate for these data. The following statements create the data set and perform the first PROC TRANSREG analysis. These statements produce Output 65.1.1.

```   title 'An Illustration of Splines and Knots';

* Create in Y a discontinuous function of X.
*
* Store copies of X in V1-V7 for use in PROC GPLOT.
* These variables are only necessary so that each
* plot can have its own x-axis label while putting
* four plots on a page.;

data A;
array V[7] V1-V7;
X=-0.000001;
do I=0 to 199;
if mod(I,50)=0 then do;
C=((X/2)-5)**2;
if I=150 then C=C+5;
Y=C;
end;
X=X+0.1;
Y=Y-sin(X-C);
do J=1 to 7;
V[J]=X;
end;
output;
end;
run;

* Each of the PROC TRANSREG steps fits a
* different spline model to the data set created
* previously.  The TRANSREG steps build up a data set with
* various regression functions.  All of the functions
* are then plotted with the final PROC GPLOT step.
*
* The OUTPUT statements add new predicted values
* variables to the data set, while the ID statements
* save all of the previously created variables that
* are needed for the plots.;

proc transreg data=A;
model identity(Y) = identity(X);
title2 'A Linear Regression Function';
output out=A pprefix=Linear;
id V1-V7;
run;
```

Output 65.1.1: Fitting a Linear Regression Model with PROC TRANSREG

 An Illustration of Splines and Knots A Linear Regression Function

 The TRANSREG Procedure

 TRANSREG Univariate Algorithm Iteration History forIdentity(Y) IterationNumber AverageChange MaximumChange R-Square CriterionChange Note 1 0.00000 0.00000 0.10061 Converged

 Algorithm converged.

The second PROC TRANSREG analysis finds a degree two spline transformation with no knots, which is a quadratic polynomial. The spline is a weighted sum of a single constant, a single straight line, and a single quadratic curve. The R2 increases from 0.10061, which is the linear fit value from before, to 0.40720. It can be seen from the third plot in Output 65.1.7 that the quadratic regression function does not fit any of the individual curves well, but it does follow the overall trend in the data. Since the overall trend is quadratic, a degree three spline with no knots (not shown) increases R2 by only a small amount. The following statements perform the quadratic analysis and produce Output 65.1.2.

```   proc transreg data=A;
model identity(Y)=spline(X / degree=2);
title2 'A Quadratic Polynomial Regression Function';
id V1-V7 LinearY;
run;
```

Output 65.1.2: Fitting a Quadratic Polynomial

 An Illustration of Splines and Knots A Quadratic Polynomial Regression Function

 The TRANSREG Procedure

 TRANSREG MORALS Algorithm Iteration History for Identity(Y) IterationNumber AverageChange MaximumChange R-Square CriterionChange Note 1 0.82127 2.77121 0.10061 2 0.00000 0.00000 0.40720 0.30659 Converged

 Algorithm converged.

The next step uses the default degree of three, for a piecewise cubic polynomial, and requests knots at the known break points, X=5, 10, and 15. This requests a spline that is continuous, has continuous first and second derivatives, and has a third derivative that is discontinuous at 5, 10, and 15. The spline is a weighted sum of a single constant, a single straight line, a single quadratic curve, a cubic curve for the portion of X less than 5, a different cubic curve for the portion of X between 5 and 10, a different cubic curve for the portion of X between 10 and 15, and another cubic curve for the portion of X greater than 15. The new R2 is 0.61730, and it can be seen from the fourth plot (in Output 65.1.7) that the spline is less smooth than the quadratic polynomial and it follows the data more closely than the quadratic polynomial. The following statements perform this analysis and produce Output 65.1.3:

```   proc transreg data=A;
model identity(Y) = spline(X / knots=5 10 15);
title2 'A Cubic Spline Regression Function';
title3 'The Third Derivative is Discontinuous at X=5, 10, 15';
output out=A pprefix=Cub1;
run;
```

Output 65.1.3: Fitting a Piecewise Cubic Polynomial

 An Illustration of Splines and Knots A Cubic Spline Regression Function The Third Derivative is Discontinuous at X=5, 10, 15

 The TRANSREG Procedure

 TRANSREG MORALS Algorithm Iteration History for Identity(Y) IterationNumber AverageChange MaximumChange R-Square CriterionChange Note 1 0.85367 3.88449 0.10061 2 0.00000 0.00000 0.61730 0.51670 Converged

 Algorithm converged.

The same model could be fit with a DATA step and PROC REG, as follows. (The output from the following code is not displayed.)

```   data B;            /* A is the data set used for transreg */
set a(keep=X Y);
X1=X;                       /* X                       */
X2=X**2;                    /* X squared               */
X3=X**3;                    /* X cubed                 */
X4=(X> 5)*((X-5)**3);       /* change in X**3 after  5 */
X5=(X>10)*((X-10)**3);      /* change in X**3 after 10 */
X6=(X>15)*((X-15)**3);      /* change in X**3 after 15 */
run;

proc reg;
model Y=X1-X6;
run;
```

In the next step each knot is repeated three times, so the first, second, and third derivatives are discontinuous at X=5, 10, and 15, but the spline is required to be continuous at the knots. The spline is a weighted sum of

• a single constant
• a line for the portion of X less than 5
• a quadratic curve for the portion of X less than 5
• a cubic curve for the portion of X less than 5
• a different line for the portion of X between 5 and 10
• a different quadratic curve for the portion of X between 5 and 10
• a different cubic curve for the portion of X between 5 and 10
• a different line for the portion of X between 10 and 15
• a different quadratic curve for the portion of X between 10 and 15
• a different cubic curve for the portion of X between 10 and 15
• another line for the portion of X greater than 15
• another quadratic curve for the portion of X greater than 15
• and another cubic curve for the portion of X greater than 15

The spline is continuous since there is not a separate constant in the formula for the spline for each knot. Now the R2 is 0.95542, and the spline closely follows the data, except at the knots. The following statements perform this analysis and produce Output 65.1.4:

```   proc transreg data=A;
model identity(y) = spline(x / knots=5 5 5 10 10 10 15 15 15);
title3 'First - Third Derivatives Discontinuous at X=5, 10, 15';
output out=A pprefix=Cub3;
run;
```

Output 65.1.4: Piecewise Polynomial with Discontinuous Derivatives

 An Illustration of Splines and Knots A Cubic Spline Regression Function First - Third Derivatives Discontinuous at X=5, 10, 15

 The TRANSREG Procedure

 TRANSREG MORALS Algorithm Iteration History for Identity(Y) IterationNumber AverageChange MaximumChange R-Square CriterionChange Note 1 0.92492 3.50038 0.10061 2 0.00000 0.00000 0.95542 0.85481 Converged

 Algorithm converged.

The same model could be fit with a DATA step and PROC REG, as follows. (The output from the following code is not displayed.)

```   data B;
set a(keep=X Y);
X1=X;                        /* X                       */
X2=X**2;                     /* X squared               */
X3=X**3;                     /* X cubed                 */
X4=(X>5)   * (X- 5);         /* change in X    after  5 */
X5=(X>10)  * (X-10);         /* change in X    after 10 */
X6=(X>15)  * (X-15);         /* change in X    after 15 */
X7=(X>5)   * ((X-5)**2);     /* change in X**2 after  5 */
X8=(X>10)  * ((X-10)**2);    /* change in X**2 after 10 */
X9=(X>15)  * ((X-15)**2);    /* change in X**2 after 15 */
X10=(X>5)  * ((X-5)**3);     /* change in X**3 after  5 */
X11=(X>10) * ((X-10)**3);    /* change in X**3 after 10 */
X12=(X>15) * ((X-15)**3);    /* change in X**3 after 15 */
run;

proc reg;
model Y=X1-X12;
run;
```

When the knots are repeated four times in the next step, the spline function is discontinuous at the knots and follows the data even more closely, with an R2 of 0.99254. In this step, each separate curve is approximated by a cubic polynomial (with no knots within the separate polynomials). The following statements perform this analysis and produce Output 65.1.5:

```   proc transreg data=A;
model identity(Y) = spline(X / knots=5 5 5 5 10 10 10 10 15 15 15 15);
title3 'Discontinuous Function and Derivatives';
output out=A pprefix=Cub4;
id V1-V7 LinearY QuadY Cub1Y Cub3Y;
run;
```

Output 65.1.5: Discontinuous Function and Derivatives

 An Illustration of Splines and Knots A Cubic Spline Regression Function Discontinuous Function and Derivatives

 The TRANSREG Procedure

 TRANSREG MORALS Algorithm Iteration History for Identity(Y) IterationNumber AverageChange MaximumChange R-Square CriterionChange Note 1 0.90271 3.29184 0.10061 2 0.00000 0.00000 0.99254 0.89193 Converged

 Algorithm converged.

To solve this problem with a DATA step and PROC REG, you would need to create all of the variables in the preceding DATA step (the B data set for the piecewise polynomial with discontinuous third derivatives), plus the following three variables:

```   X13=(X >  5);   /* intercept change after  5 */
X14=(X > 10);   /* intercept change after 10 */
X15=(X > 15);   /* intercept change after 15 */
```

The last two steps use the NKNOTS= t-option to specify the number of knots but not their location. NKNOTS=4 places knots at the quintiles while NKNOTS=9 places knots at the deciles. The spline and its first two derivatives are continuous. The R2 values are 0.74450 and 0.95256. Even though the knots are placed in the wrong places, the spline can closely follow the data with NKNOTS=9. The following statements produce Output 65.1.6.

```   proc transreg data=A;
model identity(Y) = spline(X / nknots=4);
title3 'Four Knots';
output out=A pprefix=Cub4k;
id V1-V7 LinearY QuadY Cub1Y Cub3Y Cub4Y;
run;

proc transreg data=A;
model identity(Y) = spline(X / nknots=9);
title3 'Nine Knots';
output out=A pprefix=Cub9k;
id V1-V7 LinearY QuadY Cub1Y Cub3Y Cub4Y Cub4kY;
run;
```

Output 65.1.6: Specifying Number of Knots instead of Knot Location

 An Illustration of Splines and Knots A Cubic Spline Regression Function Four Knots

 The TRANSREG Procedure

 TRANSREG MORALS Algorithm Iteration History for Identity(Y) IterationNumber AverageChange MaximumChange R-Square CriterionChange Note 1 0.90305 4.46027 0.10061 2 0.00000 0.00000 0.74450 0.64389 Converged

 Algorithm converged.

 An Illustration of Splines and Knots A Cubic Spline Regression Function Nine Knots

 The TRANSREG Procedure

 TRANSREG MORALS Algorithm Iteration History for Identity(Y) IterationNumber AverageChange MaximumChange R-Square CriterionChange Note 1 0.94832 3.03488 0.10061 2 0.00000 0.00000 0.95256 0.85196 Converged

 Algorithm converged.

The following statements produce plots that show the data and fit at each step of the analysis. These statements produce Output 65.1.7.

```   goptions goutmode=replace nodisplay;
%let opts = haxis=axis2 vaxis=axis1 frame cframe=ligr;
* Depending on your goptions, these plot options may work better:
* %let opts = haxis=axis2 vaxis=axis1 frame;

proc gplot data=A;
title;
axis1 minor=none label=(angle=90 rotate=0);
axis2 minor=none;
plot Y*X=1              /        &opts name='tregdis1';
plot Y*V1=1 linearY*X=2 /overlay &opts name='tregdis2';
plot Y*V2=1 quadY  *X=2 /overlay &opts name='tregdis3';
plot Y*V3=1 cub1Y  *X=2 /overlay &opts name='tregdis4';
plot Y*V4=1 cub3Y  *X=2 /overlay &opts name='tregdis5';
plot Y*V5=1 cub4Y  *X=2 /overlay &opts name='tregdis6';
plot Y*V6=1 cub4kY *X=2 /overlay &opts name='tregdis7';
plot Y*V7=1 cub9kY *X=2 /overlay &opts name='tregdis8';
symbol1 color=blue   v=star i=none;
symbol2 color=yellow v=dot  i=none;
label V1      = 'Linear Regression'
V3      = '1 Discontinuous Derivative'
V4      = '3 Discontinuous Derivatives'
V5      = 'Discontinuous Function'
V6      = '4 Knots'
V7      = '9 Knots'
Y       = 'Y' LinearY = 'Y' QuadY  = 'Y' Cub1Y  = 'Y'
Cub3Y   = 'Y' Cub4Y   = 'Y' Cub4kY = 'Y' Cub9kY = 'Y';
run; quit;

goptions display;
proc greplay nofs tc=sashelp.templt template=l2r2;
igout gseg;
treplay 1:tregdis1 2:tregdis3 3:tregdis2 4:tregdis4;
treplay 1:tregdis5 2:tregdis7 3:tregdis6 4:tregdis8;
run; quit;
```

Output 65.1.7: Plots Summarizing Analysis for Spline Example

 Chapter Contents Previous Next Top