Chapter Contents Previous Next
 The DISCRIM Procedure

## Example 25.1: Univariate Density Estimates and Posterior Probabilities

In this example, several discriminant analyses are run with a single quantitative variable, petal width, so that density estimates and posterior probabilities can be plotted easily. The example produces Output 25.1.1 through Output 25.1.5. The GCHART procedure is used to display the sample distribution of petal width in the three species. Note the overlap between species I. versicolor and I. virginica that the bar chart shows. These statements produce Output 25.1.1:

```   proc format;
value specname
1='Setosa    '
2='Versicolor'
3='Virginica ';
run;

data iris;
title 'Discriminant Analysis of Fisher (1936) Iris Data';
input SepalLength SepalWidth PetalLength PetalWidth
Species @@;
format Species specname.;
label SepalLength='Sepal Length in mm.'
SepalWidth ='Sepal Width in mm.'
PetalLength='Petal Length in mm.'
PetalWidth ='Petal Width in mm.';
symbol = put(Species, specname10.);
datalines;
50 33 14 02 1 64 28 56 22 3 65 28 46 15 2 67 31 56 24 3
63 28 51 15 3 46 34 14 03 1 69 31 51 23 3 62 22 45 15 2
59 32 48 18 2 46 36 10 02 1 61 30 46 14 2 60 27 51 16 2
65 30 52 20 3 56 25 39 11 2 65 30 55 18 3 58 27 51 19 3
68 32 59 23 3 51 33 17 05 1 57 28 45 13 2 62 34 54 23 3
77 38 67 22 3 63 33 47 16 2 67 33 57 25 3 76 30 66 21 3
49 25 45 17 3 55 35 13 02 1 67 30 52 23 3 70 32 47 14 2
64 32 45 15 2 61 28 40 13 2 48 31 16 02 1 59 30 51 18 3
55 24 38 11 2 63 25 50 19 3 64 32 53 23 3 52 34 14 02 1
49 36 14 01 1 54 30 45 15 2 79 38 64 20 3 44 32 13 02 1
67 33 57 21 3 50 35 16 06 1 58 26 40 12 2 44 30 13 02 1
77 28 67 20 3 63 27 49 18 3 47 32 16 02 1 55 26 44 12 2
50 23 33 10 2 72 32 60 18 3 48 30 14 03 1 51 38 16 02 1
61 30 49 18 3 48 34 19 02 1 50 30 16 02 1 50 32 12 02 1
61 26 56 14 3 64 28 56 21 3 43 30 11 01 1 58 40 12 02 1
51 38 19 04 1 67 31 44 14 2 62 28 48 18 3 49 30 14 02 1
51 35 14 02 1 56 30 45 15 2 58 27 41 10 2 50 34 16 04 1
46 32 14 02 1 60 29 45 15 2 57 26 35 10 2 57 44 15 04 1
50 36 14 02 1 77 30 61 23 3 63 34 56 24 3 58 27 51 19 3
57 29 42 13 2 72 30 58 16 3 54 34 15 04 1 52 41 15 01 1
71 30 59 21 3 64 31 55 18 3 60 30 48 18 3 63 29 56 18 3
49 24 33 10 2 56 27 42 13 2 57 30 42 12 2 55 42 14 02 1
49 31 15 02 1 77 26 69 23 3 60 22 50 15 3 54 39 17 04 1
66 29 46 13 2 52 27 39 14 2 60 34 45 16 2 50 34 15 02 1
44 29 14 02 1 50 20 35 10 2 55 24 37 10 2 58 27 39 12 2
47 32 13 02 1 46 31 15 02 1 69 32 57 23 3 62 29 43 13 2
74 28 61 19 3 59 30 42 15 2 51 34 15 02 1 50 35 13 03 1
56 28 49 20 3 60 22 40 10 2 73 29 63 18 3 67 25 58 18 3
49 31 15 01 1 67 31 47 15 2 63 23 44 13 2 54 37 15 02 1
56 30 41 13 2 63 25 49 15 2 61 28 47 12 2 64 29 43 13 2
51 25 30 11 2 57 28 41 13 2 65 30 58 22 3 69 31 54 21 3
54 39 13 04 1 51 35 14 03 1 72 36 61 25 3 65 32 51 20 3
61 29 47 14 2 56 29 36 13 2 69 31 49 15 2 64 27 53 19 3
68 30 55 21 3 55 25 40 13 2 48 34 16 02 1 48 30 14 01 1
45 23 13 03 1 57 25 50 20 3 57 38 17 03 1 51 38 15 03 1
55 23 40 13 2 66 30 44 14 2 68 28 48 14 2 54 34 17 02 1
51 37 15 04 1 52 35 15 02 1 58 28 51 24 3 67 30 50 17 2
63 33 60 25 3 53 37 15 02 1
;

pattern1 c=red    /*v=l1   */;
pattern2 c=yellow /*v=empty*/;
pattern3 c=blue   /*v=r1   */;
axis1 label=(angle=90);
axis2 value=(height=.6);
legend1 frame label=none;

proc gchart data=iris;
vbar PetalWidth / subgroup=Species midpoints=0 to 25
raxis=axis1 maxis=axis2 legend=legend1 cframe=ligr;
run;
```

Output 25.1.1: Sample Distribution of Petal Width in Three Species

In order to plot the density estimates and posterior probabilities, a data set called plotdata is created containing equally spaced values from -5 to 30, covering the range of petal width with a little to spare on each end. The plotdata data set is used with the TESTDATA= option in PROC DISCRIM.

```   data plotdata;
do PetalWidth=-5 to 30 by .5;
output;
end;
run;
```

The same plots are produced after each discriminant analysis, so a macro can be used to reduce the amount of typing required. The macro PLOT uses two data sets. The data set plotd, containing density estimates, is created by the TESTOUTD= option in PROC DISCRIM. The data set plotp, containing posterior probabilities, is created by the TESTOUT= option. For each data set, the macro PLOT removes uninteresting values (near zero) and does an overlay plot showing all three species on a single plot. The following statements create the macro PLOT

```   %macro plot;
data plotd;
set plotd;
if setosa<.002 then setosa=.;
if versicolor<.002 then versicolor=.;
if virginica <.002 then virginica=.;
label PetalWidth='Petal Width in mm.';
run;

symbol1 i=join v=none c=red    l=1 /*l=21*/;
symbol2 i=join v=none c=yellow l=1 /*l= 1*/;
symbol3 i=join v=none c=blue   l=1 /*l= 2*/;
legend1 label=none frame;
axis1 label=(angle=90 'Density') order=(0 to .6 by .1);

proc gplot data=plotd;
plot setosa*PetalWidth
versicolor*PetalWidth
virginica*PetalWidth
/ overlay vaxis=axis1 legend=legend1 frame
cframe=ligr;
title3 'Plot of Estimated Densities';
run;

data plotp;
set plotp;
if setosa<.01 then setosa=.;
if versicolor<.01 then versicolor=.;
if virginica<.01 then virginica=.;
label PetalWidth='Petal Width in mm.';
run;

axis1 label=(angle=90 'Posterior Probability')
order=(0 to 1 by .2);

proc gplot data=plotp;
plot setosa*PetalWidth
versicolor*PetalWidth
virginica*PetalWidth
/ overlay vaxis=axis1 legend=legend1 frame
cframe=ligr;
title3 'Plot of Posterior Probabilities';
run;
%mend;
```

The first analysis uses normal-theory methods (METHOD=NORMAL) assuming equal variances (POOL=YES) in the three classes. The NOCLASSIFY option suppresses the resubstitution classification results of the input data set observations. The CROSSLISTERR option lists the observations that are misclassified under cross validation and displays cross validation error-rate estimates. The following statements produce Output 25.1.2:

```   proc discrim data=iris method=normal pool=yes
testdata=plotdata testout=plotp testoutd=plotd
short noclassify crosslisterr;
class Species;
var PetalWidth;
title2 'Using Normal Density Estimates with Equal Variance';
run;
%plot
```

Output 25.1.2: Normal Density Estimates with Equal Variance

 Discriminant Analysis of Fisher (1936) Iris Data Using Normal Density Estimates with Equal Variance

 The DISCRIM Procedure

 Observations 150 DF Total 149 Variables 1 DF Within Classes 147 Classes 3 DF Between Classes 2

 Class Level Information Species VariableName Frequency Weight Proportion PriorProbability Setosa Setosa 50 50.0000 0.333333 0.333333 Versicolor Versicolor 50 50.0000 0.333333 0.333333 Virginica Virginica 50 50.0000 0.333333 0.333333

 Discriminant Analysis of Fisher (1936) Iris Data Using Normal Density Estimates with Equal Variance

 The DISCRIM Procedure Classification Results for Calibration Data: WORK.IRIS Cross-validation Results using Linear Discriminant Function

 Posterior Probability of Membership in Species Obs From Species Classified intoSpecies Setosa Versicolor Virginica 5 Virginica Versicolor * 0.0000 0.9610 0.0390 9 Versicolor Virginica * 0.0000 0.0952 0.9048 57 Virginica Versicolor * 0.0000 0.9940 0.0060 78 Virginica Versicolor * 0.0000 0.8009 0.1991 91 Virginica Versicolor * 0.0000 0.9610 0.0390 148 Versicolor Virginica * 0.0000 0.3828 0.6172

 * Misclassified observation

 Discriminant Analysis of Fisher (1936) Iris Data Using Normal Density Estimates with Equal Variance

 The DISCRIM Procedure Classification Summary for Calibration Data: WORK.IRIS Cross-validation Summary using Linear Discriminant Function

 Number of Observations and Percent Classifiedinto Species From Species Setosa Versicolor Virginica Total Setosa 50 100.00 0 0.00 0 0.00 50 100.00 Versicolor 0 0.00 48 96.00 2 4.00 50 100.00 Virginica 0 0.00 4 8.00 46 92.00 50 100.00 Total 50 33.33 52 34.67 48 32.00 150 100.00 Priors 0.33333 0.33333 0.33333

 Error Count Estimates for Species Setosa Versicolor Virginica Total Rate 0.0000 0.0400 0.0800 0.0400 Priors 0.3333 0.3333 0.3333

 Discriminant Analysis of Fisher (1936) Iris Data Using Normal Density Estimates with Equal Variance

 The DISCRIM Procedure Classification Summary for Test Data: WORK.PLOTDATA Classification Summary using Linear Discriminant Function

 Number of Observations and Percent Classifiedinto Species Setosa Versicolor Virginica Total Total 26 36.62 18 25.35 27 38.03 71 100.00 Priors 0.33333 0.33333 0.33333

The next analysis uses normal-theory methods assuming unequal variances (POOL=NO) in the three classes. The following statements produce Output 25.1.3:

```   proc discrim data=iris method=normal pool=no
testdata=plotdata testout=plotp testoutd=plotd
short noclassify crosslisterr;
class Species;
var PetalWidth;
title2 'Using Normal Density Estimates with Unequal Variance';
run;
%plot
```

Output 25.1.3: Normal Density Estimates with Unequal Variance

 Discriminant Analysis of Fisher (1936) Iris Data Using Normal Density Estimates with Unequal Variance

 The DISCRIM Procedure

 Observations 150 DF Total 149 Variables 1 DF Within Classes 147 Classes 3 DF Between Classes 2

 Class Level Information Species VariableName Frequency Weight Proportion PriorProbability Setosa Setosa 50 50.0000 0.333333 0.333333 Versicolor Versicolor 50 50.0000 0.333333 0.333333 Virginica Virginica 50 50.0000 0.333333 0.333333

 Discriminant Analysis of Fisher (1936) Iris Data Using Normal Density Estimates with Unequal Variance

 The DISCRIM Procedure Classification Results for Calibration Data: WORK.IRIS Cross-validation Results using Quadratic Discriminant Function

 Posterior Probability of Membership in Species Obs From Species Classified intoSpecies Setosa Versicolor Virginica 5 Virginica Versicolor * 0.0000 0.8740 0.1260 9 Versicolor Virginica * 0.0000 0.0686 0.9314 42 Setosa Versicolor * 0.4923 0.5073 0.0004 57 Virginica Versicolor * 0.0000 0.9602 0.0398 78 Virginica Versicolor * 0.0000 0.6558 0.3442 91 Virginica Versicolor * 0.0000 0.8740 0.1260 148 Versicolor Virginica * 0.0000 0.2871 0.7129

 * Misclassified observation

 Discriminant Analysis of Fisher (1936) Iris Data Using Normal Density Estimates with Unequal Variance

 The DISCRIM Procedure Classification Summary for Calibration Data: WORK.IRIS Cross-validation Summary using Quadratic Discriminant Function

 Number of Observations and Percent Classifiedinto Species From Species Setosa Versicolor Virginica Total Setosa 49 98.00 1 2.00 0 0.00 50 100.00 Versicolor 0 0.00 48 96.00 2 4.00 50 100.00 Virginica 0 0.00 4 8.00 46 92.00 50 100.00 Total 49 32.67 53 35.33 48 32.00 150 100.00 Priors 0.33333 0.33333 0.33333

 Error Count Estimates for Species Setosa Versicolor Virginica Total Rate 0.0200 0.0400 0.0800 0.0467 Priors 0.3333 0.3333 0.3333

 Discriminant Analysis of Fisher (1936) Iris Data Using Normal Density Estimates with Unequal Variance

 The DISCRIM Procedure Classification Summary for Test Data: WORK.PLOTDATA Classification Summary using Quadratic Discriminant Function

 Number of Observations and Percent Classifiedinto Species Setosa Versicolor Virginica Total Total 23 32.39 20 28.17 28 39.44 71 100.00 Priors 0.33333 0.33333 0.33333

Two more analyses are run with nonparametric methods (METHOD=NPAR), specifically kernel density estimates with normal kernels (KERNEL=NORMAL). The first of these uses equal bandwidths (smoothing parameters) (POOL=YES) in each class. The use of equal bandwidths does not constrain the density estimates to be of equal variance. The value of the radius parameter that, assuming normality, minimizes an approximate mean integrated square error is 0.48 (see the "Nonparametric Methods" section). Choosing r=0.4 gives a more detailed look at the irregularities in the data. The following statements produce Output 25.1.4:

```   proc discrim data=iris method=npar kernel=normal
r=.4 pool=yes
testdata=plotdata testout=plotp
testoutd=plotd
short noclassify crosslisterr;
class Species;
var PetalWidth;
title2 'Using Kernel Density Estimates with Equal
Bandwidth';
run;
%plot
```

Output 25.1.4: Kernel Density Estimates with Equal Bandwidth

 Discriminant Analysis of Fisher (1936) Iris Data Using Kernel Density Estimates with Equal Bandwidth

 The DISCRIM Procedure

 Observations 150 DF Total 149 Variables 1 DF Within Classes 147 Classes 3 DF Between Classes 2

 Class Level Information Species VariableName Frequency Weight Proportion PriorProbability Setosa Setosa 50 50.0000 0.333333 0.333333 Versicolor Versicolor 50 50.0000 0.333333 0.333333 Virginica Virginica 50 50.0000 0.333333 0.333333

 Discriminant Analysis of Fisher (1936) Iris Data Using Kernel Density Estimates with Equal Bandwidth

 The DISCRIM Procedure Classification Results for Calibration Data: WORK.IRIS Cross-validation Results using Normal Kernel Density

 Posterior Probability of Membership in Species Obs From Species Classified intoSpecies Setosa Versicolor Virginica 5 Virginica Versicolor * 0.0000 0.8827 0.1173 9 Versicolor Virginica * 0.0000 0.0438 0.9562 57 Virginica Versicolor * 0.0000 0.9472 0.0528 78 Virginica Versicolor * 0.0000 0.8061 0.1939 91 Virginica Versicolor * 0.0000 0.8827 0.1173 148 Versicolor Virginica * 0.0000 0.2586 0.7414

 * Misclassified observation

 Discriminant Analysis of Fisher (1936) Iris Data Using Kernel Density Estimates with Equal Bandwidth

 The DISCRIM Procedure Classification Summary for Calibration Data: WORK.IRIS Cross-validation Summary using Normal Kernel Density

 Number of Observations and Percent Classifiedinto Species From Species Setosa Versicolor Virginica Total Setosa 50 100.00 0 0.00 0 0.00 50 100.00 Versicolor 0 0.00 48 96.00 2 4.00 50 100.00 Virginica 0 0.00 4 8.00 46 92.00 50 100.00 Total 50 33.33 52 34.67 48 32.00 150 100.00 Priors 0.33333 0.33333 0.33333

 Error Count Estimates for Species Setosa Versicolor Virginica Total Rate 0.0000 0.0400 0.0800 0.0400 Priors 0.3333 0.3333 0.3333

 Discriminant Analysis of Fisher (1936) Iris Data Using Kernel Density Estimates with Equal Bandwidth

 The DISCRIM Procedure Classification Summary for Test Data: WORK.PLOTDATA Classification Summary using Normal Kernel Density

 Number of Observations and Percent Classifiedinto Species Setosa Versicolor Virginica Total Total 26 36.62 18 25.35 27 38.03 71 100.00 Priors 0.33333 0.33333 0.33333

Another nonparametric analysis is run with unequal bandwidths (POOL=NO). These statements produce Output 25.1.5:

```   proc discrim data=iris method=npar kernel=normal
r=.4 pool=no
testdata=plotdata testout=plotp
testoutd=plotd
short noclassify crosslisterr;
class Species;
var PetalWidth;
title2 'Using Kernel Density Estimates with Unequal
Bandwidth';
run;
%plot
```

Output 25.1.5: Kernel Density Estimates with Unequal Bandwidth

 Discriminant Analysis of Fisher (1936) Iris Data Using Kernel Density Estimates with Unequal Bandwidth

 The DISCRIM Procedure

 Observations 150 DF Total 149 Variables 1 DF Within Classes 147 Classes 3 DF Between Classes 2

 Class Level Information Species VariableName Frequency Weight Proportion PriorProbability Setosa Setosa 50 50.0000 0.333333 0.333333 Versicolor Versicolor 50 50.0000 0.333333 0.333333 Virginica Virginica 50 50.0000 0.333333 0.333333

 Discriminant Analysis of Fisher (1936) Iris Data Using Kernel Density Estimates with Unequal Bandwidth

 The DISCRIM Procedure Classification Results for Calibration Data: WORK.IRIS Cross-validation Results using Normal Kernel Density

 Posterior Probability of Membership in Species Obs From Species Classified intoSpecies Setosa Versicolor Virginica 5 Virginica Versicolor * 0.0000 0.8805 0.1195 9 Versicolor Virginica * 0.0000 0.0466 0.9534 57 Virginica Versicolor * 0.0000 0.9394 0.0606 78 Virginica Versicolor * 0.0000 0.7193 0.2807 91 Virginica Versicolor * 0.0000 0.8805 0.1195 148 Versicolor Virginica * 0.0000 0.2275 0.7725

 * Misclassified observation

 Discriminant Analysis of Fisher (1936) Iris Data Using Kernel Density Estimates with Unequal Bandwidth

 The DISCRIM Procedure Classification Summary for Calibration Data: WORK.IRIS Cross-validation Summary using Normal Kernel Density

 Number of Observations and Percent Classifiedinto Species From Species Setosa Versicolor Virginica Total Setosa 50 100.00 0 0.00 0 0.00 50 100.00 Versicolor 0 0.00 48 96.00 2 4.00 50 100.00 Virginica 0 0.00 4 8.00 46 92.00 50 100.00 Total 50 33.33 52 34.67 48 32.00 150 100.00 Priors 0.33333 0.33333 0.33333

 Error Count Estimates for Species Setosa Versicolor Virginica Total Rate 0.0000 0.0400 0.0800 0.0400 Priors 0.3333 0.3333 0.3333

 Discriminant Analysis of Fisher (1936) Iris Data Using Kernel Density Estimates with Unequal Bandwidth

 The DISCRIM Procedure Classification Summary for Test Data: WORK.PLOTDATA Classification Summary using Normal Kernel Density

 Number of Observations and Percent Classifiedinto Species Setosa Versicolor Virginica Total Total 25 35.21 18 25.35 28 39.44 71 100.00 Priors 0.33333 0.33333 0.33333

 Chapter Contents Previous Next Top