Chapter Contents
Chapter Contents
The ACECLUS Procedure


It is well known from the literature on nonparametric statistics that variances and, hence, covariances can be computed from pairwise differences instead of deviations from means. (For example, Puri and Sen (1971, pp. 51 -52) show that the variance is a U statistic of degree 2.) Let X = (xij) be the data matrix with n observations (rows) and v variables (columns), and let {\bar{x}_j} be the mean of the jth variable. The sample covariance matrix S = (sjk) is usually defined as

s_{jk} = \frac{1}{n-1} \sum_{i=1}^n (x_{ij} - {\bar{x}}_j)
(x_{ik} - {\bar{x}}_k)
The matrix S can also be computed as
s_{jk} = \frac{1}{n(n-1)} \sum_{i=2}^n \sum_{h=1}^{i-1}
(x_{ij} - x_{hj}) (x_{ik} - x_{hk})
Let W = (wjk) be the pooled within-cluster covariance matrix, q be the number of clusters, nc be the number of observations in the cth cluster, and
d^''_{ic} = \{
1 & & {if observation i\space is in cluster c} \0 & & {otherwise} \.
The matrix W is normally defined as
w_{jk} = \frac{1}{n-q} \sum_{c=1}^q \sum_{i=1}^n d^''_{ic} 
 (x_{ij} - \bar{x}_{cj}) (x_{ik} - \bar{x}_{ck})
where \bar{x}_{cj} is the mean of the jth variable in cluster c. Let
d^'_{ih} = \{
\frac{1}{n_c} & & 
{if observations i\space and h\space are in cluster c} \0 & & {otherwise} \.
The matrix W can also be computed as
w_{jk} = \frac{1}{n-q} \sum_{i=2}^n \sum_{h=1}^{i-1} 
d^'_{ih} (x_{ij} - x_{hj}) (x_{ik} - x_{hk})
If the clusters are not known, d'ih cannot be determined. However, an approximation to W can be obtained by using instead
d^'_{ih} = \{
1 & & {if } \sum_{j=1}^v \sum_{k=1}^v m_{jk} 
(x_{ij} - x_{hj})(x_{ik} - x_{hk}) \leq u^2 \0 & & {otherwise} \.
where u is an appropriately chosen value and M = (mjk) is an appropriate metric. Let A = (ajk) be defined as
a_{jk} = \frac{ \sum_{i=2}^n \sum_{h=1}^{i-1} d_{ih}
(x_{ij} - x_{hj})(x_{ik} - x_{hk}) } 
{ 2 \sum_{i=2}^n \sum_{h=1}^{i-1} d_{ih} }
If all of the following conditions hold, A equals W:

If the clusters are of unequal size, A gives more weight to large clusters than W does, but this discrepancy should be of little importance if the population within-cluster covariance matrices are equal. There may be large differences between A and W if the cutoff u does not discriminate between pairs in the same cluster and pairs in different clusters. Lack of discrimination may occur for one of the following reasons:

In the former case, little can be done to remedy the problem. The remaining question concerns how to choose M and u. Consider M first. The best choice for M is W-1, but W is not known. The solution is to use an iterative algorithm:

  1. Obtain an initial estimate of A, such as the identity or the total-sample covariance matrix. (See the INITIAL= option in the PROC ACECLUS statement for more information.)
  2. Let M equal A-1.
  3. Recompute A using the preceding formula.
  4. Repeat steps 2 and 3 until the estimate stabilizes.

Convergence is assessed by comparing values of A on successive iterations. Let Ai be the value of A on the ith iteration and A0 be the initial estimate of A. Let Z be a user-specified v ×v matrix. (See the METRIC= option in the PROC ACECLUS statement for more information.) The convergence measure is

e_i = \frac{1}v 
\parallel Z^'(A_i - A_{i-1})Z \parallel
where \parallel  ...  \parallel indicates the Euclidean norm, that is, the square root of the sum of the squares of the elements of the matrix. In PROC ACECLUS, Z can be the identity or an inverse factor of S or diag(S). Iteration stops when ei falls below a user-specified value. (See the CONVERGE= option or the MAXITER= option in the PROC ACECLUS statement for more information.)

The remaining question of how to choose u has no simple answer. In practice, you must try several different values. PROC ACECLUS provides four different ways of specifying u:

In most cases, the analysis should begin with the last method using values of p between 0.5 and 0.01 and using the full covariance matrix as the initial estimate of A.

Proportions p are transformed to distances t using the formula

t2 = 2v{ [ F-1v,n-v (p) ][(n-v)/(n-1)] }
where F-1v,n-v is the quantile (inverse cumulative distribution) function of an F random variable with v and n-v degrees of freedom. The squared Mahalanobis distance between a single pair of observations sampled from a multivariate normal distribution is distributed as 2v times an F random variable with v and n-v degrees of freedom. The distances between two pairs of observations are correlated if the pairs have an observation in common. The quantile function is raised to the power given in the preceding formula to compensate approximately for the correlations among distances between pairs of observations that share a member. Monte Carlo studies indicate that the approximation is acceptable if the number of observations exceeds the number of variables by at least 10 percent.

If A becomes singular, step 2 in the iterative algorithm cannot be performed because A cannot be inverted. In this case, let Z be the matrix as defined in discussing the convergence measure, and let {Z^'AZ}={R^' \Lambda R} where R'R = RR' = I and {\Lambda}=(\lambda_{jk}) is diagonal. Let {\Lambda^*}=(\lambda_{jk}^*) be a diagonal matrix where \lambda_{jj}^*=\max(\lambda_{jj}, g  {trace}({\Lambda})),and 0<g<1 is a user-specified singularity criterion (see the SINGULAR= option in the PROC ACECLUS statement for more information). Then M is computed as {ZR^'}({\Lambda^*})^{-1}{RZ^'}.

The ACECLUS procedure differs from the method used by Art, Gnanadesikan, and Kettenring (1982) in several respects.

Analyses of Fisher's (1936) iris data, consisting of measurements of petal and sepal length and width for fifty specimens from each of three iris species, are summarized in Table 16.1. The number of misclassified observations out of 150 is given for four clustering methods:

Each hierarchical analysis is followed by the TREE procedure with NCL=3 to determine cluster assignments at the three-cluster level. Clusters with twenty or fewer observations are discarded by using the DOCK=20 option. The observations in a discarded cluster are considered unclassified.

Each method is applied to

Theoretically, the best results should be obtained by using the canonical variables from PROC CANDISC. PROC ACECLUS yields results comparable to PROC CANDISC for values of the PROPORTION= option ranging from 0.005 to 0.02. At PROPORTION=0.04, average linkage and the centroid method show some deterioration, but k-means and Ward's method continue to produce excellent classifications. At larger values of the PROPORTION= option, all methods perform poorly, although no worse than with four standardized principal components.

Table 16.1: Number of Misclassified and Unclassified Observations Using Fisher's (1936) Iris Data
  Clustering Method
Data k-means Ward's Linkage Centroid
raw data16*16*25+12**14*
standardized data252633+433+4
two standardized    
principal components293130+927+32
four standardized    
principal components392732+745+11
by ACECLUS P=0.323910+97+25 
by ACECLUS P=0.163918+97+197+26
by ACECLUS P=0.081993+135+16
by ACECLUS P=0.04451+193+12
by ACECLUS P=0.024333
by ACECLUS P=0.014434
by ACECLUS P=0.0054444
canonical variables3544+1
* A single number represents misclassified observations with no unclassified observations.
** Where two numbers are separated by a plus sign, the first is the number of misclassified
observations; the second is the number of unclassified observations.

This example demonstrates the following:

Although experience with the Art, Gnanadesikan, and Kettenring and PROC ACECLUS methods is limited, the results so far suggest that these methods help considerably more often than they hinder the subsequent cluster analysis, especially with normal-mixture techniques such as k-means and Ward's minimum variance method.

Chapter Contents
Chapter Contents

Copyright © 1999 by SAS Institute Inc., Cary, NC, USA. All rights reserved.