Chapter Contents |
Previous |
Next |

The DISCRIM Procedure |

When the distribution within each group is assumed to be multivariate normal, a parametric method can be used to develop a discriminant function. The discriminant function, also known as a classification criterion, is determined by a measure of generalized squared distance (Rao 1973). The classification criterion can be based on either the individual within-group covariance matrices (yielding a quadratic function) or the pooled covariance matrix (yielding a linear function); it also takes into account the prior probabilities of the groups. The calibration information can be stored in a special SAS data set and applied to other data sets.

When no assumptions can be made about the distribution within each group, or
when the distribution is assumed not to be multivariate normal, nonparametric
methods can be used to estimate the group-specific densities. These methods
include the kernel and *k*-nearest-neighbor methods (Rosenblatt 1956; Parzen
1962). The DISCRIM procedure uses uniform, normal, Epanechnikov, biweight, or
triweight kernels for density estimation.

Either Mahalanobis or Euclidean distance can be used to determine proximity.
Mahalanobis distance can be based on either the full covariance matrix or the
diagonal matrix of variances. With a *k*-nearest-neighbor method, the pooled
covariance matrix is used to calculate the Mahalanobis distances. With a kernel
method, either the individual within-group covariance matrices or the pooled
covariance matrix can be used to calculate the Mahalanobis distances. With the
estimated group-specific densities and their associated prior probabilities, the
posterior probability estimates of group membership for each class can be
evaluated.

Canonical discriminant analysis is a dimension-reduction technique related to principal component analysis and canonical correlation. Given a classification variable and several quantitative variables, PROC DISCRIM derives canonical variables (linear combinations of the quantitative variables) that summarize between-class variation in much the same way that principal components summarize total variation. (See Chapter 21, "The CANDISC Procedure," for more information on canonical discriminant analysis.) A discriminant criterion is always derived in PROC DISCRIM. If you want canonical discriminant analysis without the use of a discriminant criterion, you should use the CANDISC procedure.

The DISCRIM procedure can produce an output data set containing various statistics such as means, standard deviations, and correlations. If a parametric method is used, the discriminant function is also stored in the data set to classify future observations. When canonical discriminant analysis is performed, the output data set includes canonical coefficients that can be rotated by the FACTOR procedure. PROC DISCRIM can also create a second type of output data set containing the classification results for each observation. When canonical discriminant analysis is performed, this output data set also includes canonical variable scores. A third type of output data set containing the group-specific density estimates at each observation can also be produced.

PROC DISCRIM evaluates the performance of a discriminant criterion by estimating error rates (probabilities of misclassification) in the classification of future observations. These error-rate estimates include error-count estimates and posterior probability error-rate estimates. When the input data set is an ordinary SAS data set, the error rate can also be estimated by cross validation.

Do not confuse discriminant analysis with cluster analysis. All varieties of discriminant analysis require prior knowledge of the classes, usually in the form of a sample from each class. In cluster analysis, the data do not include information on class membership; the purpose is to construct a classification.

See Chapter 7, "Introduction to Discriminant Procedures," for a discussion of discriminant analysis and the SAS/STAT procedures available.

Chapter Contents |
Previous |
Next |
Top |

Copyright © 1999 by SAS Institute Inc., Cary, NC, USA. All rights reserved.