Background
The following notation is used to describe the classification methods:
 x
 a pdimensional vector containing the
quantitative variables of an observation
 S_{p}
 the pooled covariance matrix
 t
 a subscript to distinguish the groups
 n_{t}
 the number of training set observations in group t
 m_{t}
 the pdimensional vector containing variable means in group t
 S_{t}
 the covariance matrix within group t
 the determinant of S_{t}
 q_{t}
 the prior probability of membership in group t
 the posterior probability of an observation
x belonging to group t
 f_{t}
 the probability density function for group t
 f_{t}(x)
 the groupspecific density estimate
at x from group t
 f(x)
 , the estimated
unconditional density at x
 e_{t}
 the classification error rate for group t
Bayes' Theorem
Assuming that the prior probabilities of group membership are
known and that the groupspecific densities at x can be
estimated, PROC DISCRIM computes , the probability of
x belonging to group t, by applying Bayes' theorem:
PROC DISCRIM partitions a pdimensional vector space into
regions R_{t}, where the region R_{t} is the subspace
containing all pdimensional vectors y such
that is the largest among all groups.
An observation is classified as coming
from group t if it lies in region R_{t}.
Parametric Methods
Assuming that each group has a multivariate normal distribution,
PROC DISCRIM develops a discriminant function or classification
criterion using a measure of generalized squared distance.
The classification criterion is based on either
the individual withingroup covariance matrices
or the pooled covariance matrix; it also takes
into account the prior probabilities of the classes.
Each observation is placed in the class from which
it has the smallest generalized squared distance.
PROC DISCRIM also computes the posterior probability
of an observation belonging to each class.
The squared Mahalanobis distance from x to group t is

d_{t}^{2}(x) = (x  m_{t})' V_{t}^{1}(x  m_{t})
where V_{t} = S_{t} if the withingroup
covariance matrices are used, or V_{t} = S_{p}
if the pooled covariance matrix is used.
The groupspecific density estimate at
x from group t is then given by
Using Bayes' theorem, the posterior probability
of x belonging to group t is
where the summation is over all groups.
The generalized squared distance from
x to group t is defined as

D_{t}^{2}(x) = d_{t}^{2}(x) + g_{1}(t) + g_{2}(t)
where
and
The posterior probability of x
belonging to group t is then equal to
The discriminant scores are 0.5 D_{u}^{2}(x).
An observation is classified into group u if
setting t=u produces the largest value of
or the smallest value of D_{t}^{2}(x).
If this largest posterior probability is less than the
threshold specified, x is classified into group OTHER.
Nonparametric discriminant methods are based on nonparametric
estimates of groupspecific probability densities.
Either a kernel method or the knearestneighbor method
can be used to generate a nonparametric density estimate
in each group and to produce a classification criterion.
The kernel method uses uniform, normal, Epanechnikov,
biweight, or triweight kernels in the density estimation.
Either Mahalanobis distance or Euclidean
distance can be used to determine proximity.
When the knearestneighbor method is used, the Mahalanobis
distances are based on the pooled covariance matrix.
When a kernel method is used, the Mahalanobis distances
are based on either the individual withingroup
covariance matrices or the pooled covariance matrix.
Either the full covariance matrix or the diagonal matrix of
variances can be used to calculate the Mahalanobis distances.
The squared distance between two observation vectors,
x and y, in group t is given by

d_{t}^{2}(x,y) = (xy)' V_{t}^{1} (xy)
where V_{t} has one of the following forms:
The classification of an observation vector x is based
on the estimated groupspecific densities from the training set.
From these estimated densities, the posterior probabilities
of group membership at x are evaluated.
An observation x is classified into group u if
setting t=u produces the largest value of .If there is a tie for the largest probability or
if this largest probability is less than the threshold
specified, x is classified into group OTHER.
The kernel method uses a fixed radius, r, and a
specified kernel, K_{t}, to estimate the group t
density at each observation vector x.
Let z be a pdimensional vector.
Then the volume of a pdimensional unit sphere
bounded by z'z = 1 is
where represents the gamma function (refer to
SAS Language Reference: Dictionary).
Thus, in group t, the volume of a
pdimensional ellipsoid bounded by
is
The kernel method uses one of the following
densities as the kernel density in group t.
Uniform Kernel
Normal Kernel (with mean zero, variance r^{2} V_{t})

K_{t}(z) = [1/(c_{0}(t))] exp([1/(2r^{2})] z' V_{t}^{1} z )
where .
Epanechnikov Kernel
where
c_{1}(t) = [1/(v_{r}(t))] ( 1 + [p/2] ).
Biweight Kernel
where
c_{2}(t) = (1 + [p/4] ) c_{1}(t).
Triweight Kernel
where
c_{3}(t) = ( 1 + [p/6] ) c_{2}(t).
The group t density at x is estimated by
where the summation is over all observations y
in group t, and K_{t} is the specified kernel function.
The posterior probability of membership
in group t is then given by
where
is the estimated unconditional density.
If f(x) is zero, the observation
x is classified into group OTHER.
The uniformkernel method treats K_{t}(z) as a multivariate
uniform function with density uniformly distributed over
.Let k_{t} be the number of training set observations y
from group t within the closed ellipsoid centered at
x specified by .Then the group t density at x is estimated by

f_{t}(x) = [(k_{t})/(n_{t} v_{r}(t))]
When the identity matrix or the pooled withingroup
covariance matrix is used in calculating the squared distance,
v_{r}(t) is a constant, independent of group membership.
The posterior probability of x
belonging to group t is then given by
If the closed ellipsoid centered at x does not
include any training set observations, f(x)
is zero and x is classified into group OTHER.
When the prior probabilities are equal,
is proportional to k_{t}/n_{t} and x is classified
into the group that has the highest proportion
of observations in the closed ellipsoid.
When the prior probabilities are proportional to
the group sizes, ,
x is classified into the group that has the
largest number of observations in the closed ellipsoid.
The nearestneighbor method fixes the number, k,
of training set points for each observation x.
The method finds the radius r_{k}(x) that is
the distance from x to the kth nearest
training set point in the metric V_{t}^{1}.
Consider a closed ellipsoid centered at x bounded by
; the nearestneighbor
method is equivalent to the uniformkernel method
with a locationdependent radius r_{k}(x).
Note that, with ties, more than k training set points may
be in the ellipsoid.
Using the knearestneighbor rule,
the k_{n} (or more with ties) smallest distances are saved.
Of these k distances, let k_{t} represent the number
of distances that are associated with group t.
Then, as in the uniformkernel method, the
estimated group t density at x is

f_{t}(x) = [(k_{t})/(n_{t} v_{k}(x))]
where v_{k}(x) is the volume of the ellipsoid
bounded by .Since the pooled withingroup covariance matrix
is used to calculate the distances used in the
nearestneighbor method, the volume v_{k}(x)
is a constant independent of group membership.
When k=1 is used in the nearestneighbor rule,
x is classified into the group associated
with the y point that yields the smallest
squared distance d_{t}^{2}(x,y).
Prior probabilities affect nearestneighbor results in the same way
that they affect uniformkernel results.
With a specified squared distance formula (METRIC=,
POOL=), the values of r and k determine the
degree of irregularity in the estimate of the density
function, and they are called smoothing parameters.
Small values of r or k produce jagged density estimates, and
large values of r or k produce smoother density estimates.
Various methods for choosing the smoothing parameters have been
suggested, and there is as yet no simple solution to this problem.
For a fixed kernel shape, one way to choose the smoothing
parameter r is to plot estimated densities with different
values of r and to choose the estimate that is most in
accordance with the prior information about the density.
For many applications, this approach is satisfactory.
Another way of selecting the smoothing parameter r
is to choose a value that optimizes a given criterion.
Different groups may have different sets of optimal values.
Assume that the unknown density has bounded and
continuous second derivatives and that the kernel
is a symmetric probability density function.
One criterion is to minimize an approximate mean integrated
square error of the estimated density (Rosenblatt 1956).
The resulting optimal value of r depends
on the density function and the kernel.
A reasonable choice for the smoothing parameter r is to
optimize the criterion with the assumption that group t
has a normal distribution with covariance matrix V_{t}.
Then, in group t, the resulting
optimal value for r is given by

( [(A(K_{t}))/(n_{t})] )^{[1/(p+4)]}
where the optimal constant A(K_{t}) depends
on the kernel K_{t} (Epanechnikov 1969).
For some useful kernels, the constants A(K_{t}) are given by
These selections of A(K_{t}) are derived under the
assumption that the data in each group are from a multivariate
normal distribution with covariance matrix V_{t}.
However, when the Euclidean distances are used in
calculating the squared distance
(V_{t} = I), the smoothing constant should be
multiplied by s,
where s is an estimate of standard deviations for all variables.
A reasonable choice for s is
where s_{jj} are group t marginal variances.
The DISCRIM procedure uses only a single smoothing parameter for all groups.
However, with the selection of the matrix to be used in the
distance formula (using the METRIC= or POOL= option),
individual groups and variables can have different scalings.
When V_{t}, the matrix used in calculating the squared
distances, is an identity matrix, the kernel estimate on each
data point is scaled equally for all variables in all groups.
When V_{t} is the diagonal matrix of a covariance matrix,
each variable in group t is scaled separately by its variance
in the kernel estimation, where the variance can be the pooled
variance (V_{t} = S_{p}) or an individual withingroup
variance (V_{t} = S_{t}).
When V_{t} is a full covariance matrix, the
variables in group t are scaled simultaneously
by V_{t} in the kernel estimation.
In nearestneighbor methods, the choice of k
is usually relatively uncritical (Hand 1982).
A practical approach is to try several different values
of the smoothing parameters within the context of the
particular application and to choose the one that gives
the best cross validated estimate of the error rate.
Classification ErrorRate Estimates
A classification criterion can be evaluated by its
performance in the classification of future observations.
PROC DISCRIM uses two types of errorrate estimates to
evaluate the derived classification criterion based
on parameters estimated by the training sample:
 errorcount estimates
 posterior probability errorrate estimates.
The errorcount estimate is calculated by applying
the classification criterion derived from the
training sample to a test set and then counting
the number of misclassified observations.
The groupspecific errorcount estimate is the
proportion of misclassified observations in the group.
When the test set is independent of the
training sample, the estimate is unbiased.
However, it can have a large variance,
especially if the test set is small.
When the input data set is an ordinary SAS data set and no
independent test sets are available, the same data set can be
used both to define and to evaluate the classification criterion.
The resulting errorcount estimate has an optimistic
bias and is called an apparent error rate.
To reduce the bias, you can split the data into two
sets, one set for deriving the discriminant function
and the other set for estimating the error rate.
Such a splitsample method has the unfortunate
effect of reducing the effective sample size.
Another way to reduce bias is cross validation (Lachenbruch
and Mickey 1968).
Cross validation treats n1 out of n
training observations as a training set.
It determines the discriminant functions based
on these n1 observations and then applies
them to classify the one observation left out.
This is done for each of the n training observations.
The misclassification rate for each group is the proportion
of sample observations in that group that are misclassified.
This method achieves a nearly unbiased
estimate but with a relatively large variance.
To reduce the variance in an errorcount estimate,
smoothed errorrate estimates are suggested (Glick 1978).
Instead of summing terms that are either zero or one as in the
errorcount estimator, the smoothed estimator uses a continuum
of values between zero and one in the terms that are summed.
The resulting estimator has a smaller
variance than the errorcount estimate.
The posterior probability errorrate estimates provided by
the POSTERR option in the PROC DISCRIM statement (see the
following section,
"Posterior Probability ErrorRate Estimates")
are smoothed errorrate estimates.
The posterior probability estimates for each group
are based on the posterior probabilities of the
observations classified into that same group.
The posterior probability estimates provide good estimates of
the error rate when the posterior probabilities are accurate.
When a parametric classification criterion (linear
or quadratic discriminant function) is derived from
a nonnormal population, the resulting posterior
probability errorrate estimators may not be appropriate.
The overall error rate is estimated through a weighted
average of the individual groupspecific errorrate estimates,
where the prior probabilities are used as the weights.
To reduce both the bias and the variance of the
estimator, Hora and Wilcox (1982) compute the posterior
probability estimates based on cross validation.
The resulting estimates are intended to have both
low variance from using the posterior probability
estimate and low bias from cross validation.
They use Monte Carlo studies on twogroup multivariate
normal distributions to compare the cross validation
posterior probability estimates with three other
estimators: the apparent error rate, cross validation
estimator, and posterior probability estimator.
They conclude that the cross validation posterior probability
estimator has a lower mean squared error in their simulations.
Consider the plot shown in Figure 25.6 with two variables,
X1 and X2, and two classes, A and B. The withinclass
covariance matrix is diagonal, with a positive value for X1 but
zero for X2. Using a MoorePenrose pseudoinverse would
effectively ignore X2 completely in doing the classification,
and the two classes would have a zero generalized distance and could
not be discriminated at all. The quasiinverse used by PROC DISCRIM
replaces the zero variance for X2 by a small positive number to
remove the singularity. This allows X2 to be used in the
discrimination and results correctly in a large generalized distance
between the two classes and a zero error rate. It also allows new
observations, such as the one indicated by N, to be classified in a
reasonable way. PROC CANDISC also uses a quasiinverse when the
totalsample covariance matrix is considered to be singular and
Mahalanobis distances are requested. This problem with singular
withinclass covariance matrices is discussed in Ripley (1996,
p. 38). The use of the quasiinverse is an innovation introduced by
SAS Institute Inc.
Figure 25.6: Plot of Data with Singular WithinClass Covariance Matrix
Let S be a singular covariance matrix. The matrix S can be
either a withingroup covariance matrix, a pooled covariance matrix,
or a totalsample covariance matrix. Let v be the number of
variables in the VAR statement and the nullity n be the number of
variables among them with (partial) R^{2} exceeding 1p. If the
determinant of S (Testing of Homogeneity of Within Covariance
Matrices) or the inverse of S (Squared Distances and
Generalized Squared Distances) is required, a quasideterminant or
quasiinverse is used instead. PROC DISCRIM scales each variable
to unit totalsample variance before calculating this quasiinverse.
The calculation is based on the spectral decomposition , where is a diagonal
matrix of eigenvalues , j = 1, ... , v, where
when i<j, and is a matrix
with the corresponding orthonormal eigenvectors of S as
columns. When the nullity n is less than v, set for j = 1, ... , vn, and
for j = vn+1, ... , v, where
When the nullity n is equal to v, set , for j = 1, ... , v. A quasideterminant is then defined as the product of
, j = 1, ... , v. Similarly, a quasiinverse is
then defined as , where is a diagonal matrix of
values j = 1, ... , v.
Copyright © 1999 by SAS Institute Inc., Cary, NC, USA. All rights reserved.