The PLS Procedure

## Example 51.1: Examining Model Details

The following example, from Umetrics (1995), demonstrates different ways to examine a PLS model. The data come from the field of drug discovery. New drugs are developed from chemicals that are biologically active. Testing a compound for biological activity is an expensive procedure, so it is useful to be able to predict biological activity from cheaper chemical measurements. In fact, computational chemistry makes it possible to calculate certain chemical measurements without even making the compound. These measurements include size, lipophilicity, and polarity at various sites on the molecule. The following statements create a data set named penta, which contains these data.

```   data penta;
input obsnam \$ S1 L1 P1 S2 L2 P2
S3 L3 P3 S4 L4 P4
S5 L5 P5  log_RAI @@;
n = _n_;
datalines;
VESSK    -2.6931 -2.5271 -1.2871  3.0777  0.3891 -0.0701
1.9607 -1.6324  0.5746  1.9607 -1.6324  0.5746
2.8369  1.4092 -3.1398                    0.00
VESAK    -2.6931 -2.5271 -1.2871  3.0777  0.3891 -0.0701
1.9607 -1.6324  0.5746  0.0744 -1.7333  0.0902
2.8369  1.4092 -3.1398                    0.28
VEASK    -2.6931 -2.5271 -1.2871  3.0777  0.3891 -0.0701
0.0744 -1.7333  0.0902  1.9607 -1.6324  0.5746
2.8369  1.4092 -3.1398                    0.20
VEAAK    -2.6931 -2.5271 -1.2871  3.0777  0.3891 -0.0701
0.0744 -1.7333  0.0902  0.0744 -1.7333  0.0902
2.8369  1.4092 -3.1398                    0.51
VKAAK    -2.6931 -2.5271 -1.2871  2.8369  1.4092 -3.1398
0.0744 -1.7333  0.0902  0.0744 -1.7333  0.0902
2.8369  1.4092 -3.1398                    0.11
VEWAK    -2.6931 -2.5271 -1.2871  3.0777  0.3891 -0.0701
-4.7548  3.6521  0.8524  0.0744 -1.7333  0.0902
2.8369  1.4092 -3.1398                    2.73
VEAAP    -2.6931 -2.5271 -1.2871  3.0777  0.3891 -0.0701
0.0744 -1.7333  0.0902  0.0744 -1.7333  0.0902
-1.2201  0.8829  2.2253                    0.18
VEHAK    -2.6931 -2.5271 -1.2871  3.0777  0.3891 -0.0701
2.4064  1.7438  1.1057  0.0744 -1.7333  0.0902
2.8369  1.4092 -3.1398                    1.53
VAAAK    -2.6931 -2.5271 -1.2871  0.0744 -1.7333  0.0902
0.0744 -1.7333  0.0902  0.0744 -1.7333  0.0902
2.8369  1.4092 -3.1398                   -0.10
GEAAK     2.2261 -5.3648  0.3049  3.0777  0.3891 -0.0701
0.0744 -1.7333  0.0902  0.0744 -1.7333  0.0902
2.8369  1.4092 -3.1398                   -0.52
LEAAK    -4.1921 -1.0285 -0.9801  3.0777  0.3891 -0.0701
0.0744 -1.7333  0.0902  0.0744 -1.7333  0.0902
2.8369  1.4092 -3.1398                    0.40
FEAAK    -4.9217  1.2977  0.4473  3.0777  0.3891 -0.0701
0.0744 -1.7333  0.0902  0.0744 -1.7333  0.0902
2.8369  1.4092 -3.1398                    0.30
VEGGK    -2.6931 -2.5271 -1.2871  3.0777  0.3891 -0.0701
2.2261 -5.3648  0.3049  2.2261 -5.3648  0.3049
2.8369  1.4092 -3.1398                   -1.00
VEFAK    -2.6931 -2.5271 -1.2871  3.0777  0.3891 -0.0701
-4.9217  1.2977  0.4473  0.0744 -1.7333  0.0902
2.8369  1.4092 -3.1398                    1.57
VELAK    -2.6931 -2.5271 -1.2871  3.0777  0.3891 -0.0701
-4.1921 -1.0285 -0.9801  0.0744 -1.7333  0.0902
2.8369  1.4092 -3.1398                    0.59
AAAAA     0.0744 -1.7333  0.0902  0.0744 -1.7333  0.0902
0.0744 -1.7333  0.0902  0.0744 -1.7333  0.0902
0.0744 -1.7333  0.0902                   -0.10
AAYAA     0.0744 -1.7333  0.0902  0.0744 -1.7333  0.0902
-1.3944  2.3230  0.0139  0.0744 -1.7333  0.0902
0.0744 -1.7333  0.0902                    0.46
AAWAA     0.0744 -1.7333  0.0902  0.0744 -1.7333  0.0902
-4.7548  3.6521  0.8524  0.0744 -1.7333  0.0902
0.0744 -1.7333  0.0902                    0.75
VAWAA    -2.6931 -2.5271 -1.2871  0.0744 -1.7333  0.0902
-4.7548  3.6521  0.8524  0.0744 -1.7333  0.0902
0.0744 -1.7333  0.0902                    1.43
VAWAK    -2.6931 -2.5271 -1.2871  0.0744 -1.7333  0.0902
-4.7548  3.6521  0.8524  0.0744 -1.7333  0.0902
2.8369  1.4092 -3.1398                    1.45
VKWAA    -2.6931 -2.5271 -1.2871  2.8369  1.4092 -3.1398
-4.7548  3.6521  0.8524  0.0744 -1.7333  0.0902
0.0744 -1.7333  0.0902                    1.71
VWAAK    -2.6931 -2.5271 -1.2871 -4.7548  3.6521  0.8524
0.0744 -1.7333  0.0902  0.0744 -1.7333  0.0902
2.8369  1.4092 -3.1398                    0.04
VAAWK    -2.6931 -2.5271 -1.2871  0.0744 -1.7333  0.0902
0.0744 -1.7333  0.0902 -4.7548  3.6521  0.8524
2.8369  1.4092 -3.1398                    0.23
EKWAP     3.0777  0.3891 -0.0701  2.8369  1.4092 -3.1398
-4.7548  3.6521  0.8524  0.0744 -1.7333  0.0902
-1.2201  0.8829  2.2253                    1.30
VKWAP    -2.6931 -2.5271 -1.2871  2.8369  1.4092 -3.1398
-4.7548  3.6521  0.8524  0.0744 -1.7333  0.0902
-1.2201  0.8829  2.2253                    2.35
RKWAP     2.8827  2.5215 -3.4435  2.8369  1.4092 -3.1398
-4.7548  3.6521  0.8524  0.0744 -1.7333  0.0902
-1.2201  0.8829  2.2253                    1.98
VEWVK    -2.6931 -2.5271 -1.2871  3.0777  0.3891 -0.0701
-4.7548  3.6521  0.8524 -2.6931 -2.5271 -1.2871
2.8369  1.4092 -3.1398                    1.71
PGFSP    -1.2201  0.8829  2.2253  2.2261 -5.3648  0.3049
-4.9217  1.2977  0.4473  1.9607 -1.6324  0.5746
-1.2201  0.8829  2.2253                    0.90
FSPFR    -4.9217  1.2977  0.4473  1.9607 -1.6324  0.5746
-1.2201  0.8829  2.2253 -4.9217  1.2977  0.4473
2.8827  2.5215 -3.4435                    0.64
RYLPT     2.8827  2.5215 -3.4435 -1.3944  2.3230  0.0139
-4.1921 -1.0285 -0.9801 -1.2201  0.8829  2.2253
0.9243 -2.0921 -1.3996                    0.40
GGGGG     2.2261 -5.3648  0.3049  2.2261 -5.3648  0.3049
2.2261 -5.3648  0.3049  2.2261 -5.3648  0.3049
2.2261 -5.3648  0.3049                     .
;
data ptrain; set penta; if (n <= 15);
data ptest ; set penta; if (n >  15);
run;
```

You would like to study the relationship between these measurements and the activity of the compound, represented by the logarithm of the relative Bradykinin activating activity (log_RAI). Notice that these data consist of many predictors relative to the number of observations. Partial least squares is especially appropriate in this situation as a useful tool for finding a few underlying predictive factors that account for most of the variation in the response. Typically, the model is fit for part of the data (the "training" or "work" set), and the quality of the fit is judged by how well it predicts the other part of the data (the "test" or "prediction" set). For this example, the first 15 observations serve as the training set and the rest constitute the test set (refer to Ufkes et al. 1978, 1982).

When you fit a PLS model, you hope to find a few PLS factors that explain most of the variation in both predictors and responses. Factors that explain response variation provide good predictive models for new responses, and factors that explain predictor variation are well represented by the observed values of the predictors. The following statements fit a PLS model with two factors and save predicted values, residuals, and other information for each data point in a data set named outpls.

```   proc pls data=ptrain nfac=2;
model log_RAI = S1-S5 L1-L5 P1-P5;
output out=outpls predicted = yhat1
yresidual = yres1
xresidual = xres1-xres15
xscore    = xscr
yscore    = yscr;
run;
```

The PLS procedure displays a table, shown in Output 51.1.1, showing how much predictor and response variation is explained by each PLS factor.

Output 51.1.1: Amount of Training Set Variation Explained

 The PLS Procedure

 Percent Variation Accounted for by PartialLeast Squares Factors Number ofExtractedFactors Model Effects Dependent Variables Current Total Current Total 1 16.9014 16.9014 89.6399 89.6399 2 12.7721 29.6735 7.8368 97.4767

From Output 51.1.1, note that 97% of the response variation is already explained, but only 29% of the predictor variation is explained.

Partial least squares algorithms choose successive orthogonal factors that maximize the covariance between each X-score and the corresponding Y-score. For a good PLS model, the first few factors show a high correlation between the X- and Y-scores. The correlation usually decreases from one factor to the next. You can plot the X-scores versus the Y-scores for the first PLS factor using the following SAS statements.

```   %let ifac = 1;
data pltanno; set outpls;
length text \$ 2;
retain function 'label' position '5' hsys '3' xsys '2' ysys '2'
color 'blue' style 'swissb';
text=%str(n); x=xscr&ifac; y=yscr&ifac;
axis1 label=(angle=270 rotate=90 "Y score &ifac")
major=(number=5) minor=none;
axis2 label=("X-score &ifac") minor=none;
symbol1 v=none i=none;
proc gplot data=outpls;
plot yscr&ifac*xscr&ifac=1
/ anno=pltanno vaxis=axis1 haxis=axis2 frame cframe=ligr;
run;
```

By changing the macro variable ifac to 2 instead of 1, you can use the same statements to plot the X-scores versus the Y-scores for the second PLS factor. The resulting plots are shown in Output 51.1.2 and Output 51.1.3. The numbers on the plot represent the observation number in the penta data set.

Output 51.1.2: First X- and Y-scores for Penta-Peptide Model 1 Output 51.1.3: Second X- and Y-scores for Penta-Peptide Model 1 For this example, the figures show high correlation between X- and Y-scores for the first factor but somewhat lower correlation for the second factor.

You can also plot the X-scores against each other to look for irregularities in the data. You should look for patterns or clearly grouped observations. If you see a curved pattern, for example, you may want to add a quadratic term. Two or more groupings of observations indicate that it might be better to analyze the groups separately, perhaps by including classification effects in the model. The following SAS statements produce a plot of the first and second X-scores:

```   data pltanno; set outpls;
length text \$ 2;
retain function 'label' position '5' hsys '3' xsys '2' ysys '2'
color 'blue' style 'swissb';
text=%str(n); x=xscr1; y=xscr2;
axis1 label=(angle=270 rotate=90 "X score 2")
major=(number=5) minor=none;
axis2 label=("X-score 1") minor=none;
symbol1 v=none i=none;
proc gplot data=outpls;
plot xscr2*xscr1=1
/ anno=pltanno vaxis=axis1 haxis=axis2 frame cframe=ligr;
run;
```

The plot is shown in Output 51.1.4.

Output 51.1.4: First and Second X-scores for Penta-Peptide Model 1 This plot appears to show most of the observations close together, with a few being more spread out with larger positive X-scores for factor 2. There are no clear grouping patterns, but observation 13 stands out; note that this observation is the most extreme on all three plots so far. This run may be overly influential in the PLS analysis; thus, you should check to make sure it is reliable.

Plots of the weights give the directions toward which each PLS factor projects. They show which predictors are most represented in each factor. Those predictors with small weights in absolute value are less important than those with large weights.

You can use the DETAILS option in the PROC PLS statement to display various model details, including the X-weights. You can then use the ODS statement to send the weights to an output data set, as follows:

```   ods output XWeights=xweights;
proc pls data=ptrain nfac=2 details;
model log_RAI = S1-S5 L1-L5 P1-P5;
run;
```

Once the X-weights are in a data set, you can use the following statements to plot the weights for the first two PLS factors against one another:

```   proc transpose data=xweights(drop=NumberOfFactors InnerRegCoef)
out =xweights;
data xweights; set xweights;
rename col1=w1 col2=w2;
data wt_anno; set xweights;
length text \$ 2;
retain function 'label'
position '5'
hsys     '3'
xsys     '2'
ysys     '2'
color    'blue'
style    'swissb';
text=%str(_name_); x=w1; y=w2;
run;

axis1 label=(angle=270 rotate=90 "X weight 2")
major=(number=5) minor=none;
axis2 label=("X-weight 1") minor=none;
symbol1 v=none i=none;
proc gplot data=xweights;
plot w2*w1=1 / anno=wt_anno vaxis=axis1
haxis=axis2 frame cframe=ligr;
run; quit;
```

The plot of the X-weights is shown in Output 51.1.5.

Output 51.1.5: First and Second X-weights for Penta-Peptide Model 1 The weights plot shows a cluster of X-variables that are weighted at nearly zero for both factors. These variables add little to the model fit, and removing them may improve the model's predictive capability.

To explore further which predictors can be eliminated from the analysis, you can look at the regression coefficients for the standardized data. Predictors with small coefficients (in absolute value) make a small contribution to the response prediction. Another statistic summarizing the contribution a variable makes to the model is the Variable Importance for Projection (VIP) of Wold (1994). Whereas the regression coefficients represent the importance each predictor has in the prediction of just the response, the VIP represents the value of each predictor in fitting the PLS model for both predictors and response. If a predictor has a relatively small coefficient (in absolute value) and a small value of VIP, then it is a prime candidate for deletion. Wold (1994) considers a value less than 0.8 to be "small" for the VIP. The following statements produce coefficients and the VIP:

```   /*
/  Put coefficients, weights, and R**2's into data sets.
/-------------------------------------------------------*/
ods listing close;
ods output PercentVariation  = pctvar
XWeights          = xweights
CenScaleParms     = solution;
proc pls data=ptrain nfac=2 details;
model log_RAI = S1 L1 P1
S2 L2 P2
S3 L3 P3
S4 L4 P4
S5 L5 P5 / solution;
run;
ods listing;

/*
/  Just reformat the coefficients.
/-------------------------------------------------------*/
data solution; set solution;
format log_RAI 8.5;
if (RowName = 'Intercept') then delete;
rename RowName = Predictor log_RAI = B;
run;

/*
/  Transpose weights and R**2's.
/-------------------------------------------------------*/
data xweights; set xweights; _name_='W'||trim(left(_n_));
data pctvar  ; set pctvar  ; _name_='R'||trim(left(_n_));
proc transpose data=xweights(drop=NumberOfFactors InnerRegCoef)
out =xweights;
proc transpose data=pctvar(keep=_name_ CurrentYVariation)
out =pctvar;
run;

/*
/  Sum the squared weights times the normalized R**2's.
/  The VIP is defined as the square root of this
/  weighted average times the number of predictors.
/-------------------------------------------------------*/
```

```   proc sql;
create table vip as
select *
from xweights left join pctvar(drop=_name_) on 1;
data vip; set vip; keep _name_ vip;
array w{2};
array r{2};
VIP = 0;
do i = 1 to 2;
VIP = VIP + r{i}*(w{i}**2)/sum(of r1-r2);
end;
VIP = sqrt(VIP * 15);
data vipbpls; merge solution vip(drop=_name_);
proc print data=vipbpls;
run;
```

The output appears in Output 51.1.6.

Output 51.1.6: Estimated PLS Regression Coefficients and VIP (Model 1)

 Obs Predictor B VIP 1 S1 -0.13831 0.63084 2 L1 0.05720 0.32874 3 P1 -0.19064 0.77412 4 S2 0.12383 0.52061 5 L2 0.05909 0.28007 6 P2 0.09361 0.36941 7 S3 -0.28415 1.62992 8 L3 0.47131 2.51518 9 P3 0.26613 1.16839 10 S4 -0.09145 1.26075 11 L4 0.12265 1.21771 12 P4 -0.04878 0.91090 13 S5 0.03320 0.21989 14 L5 0.03320 0.21989 15 P5 -0.03320 0.21989

For this data set, the variables L1, L2, P2, P4, S5, L5, and P5 have small absolute coefficients and small VIP. Looking back at the weights plot in Output 51.1.5, you can see that these variables tend to be the ones near zero for both PLS factors. You should consider dropping these variables from the model.