Chapter Contents Previous Next
 The FASTCLUS Procedure

## Displayed Output

Unless the SHORT or SUMMARY option is specified, PROC FASTCLUS displays

• Initial Seeds, cluster seeds selected after one pass through the data
• Change in Cluster Seeds for each iteration, if you specify MAXITER=n>1

If you specify the LEAST=p option, with (1 < p < 2), and you omit the IRLS option, an additional column is displayed in the Iteration History table. The column contains a character to identify the method used in each iteration. PROC FASTCLUS chooses the most efficient method to cluster the data at each iterative step, given the condition of the data. Thus, the method chosen is data dependent. The possible values are described as follows:

 Value Method N Newton's Method I or L iteratively weighted least squares (IRLS) 1 IRLS step, halved once 2 IRLS step, halved twice 3 IRLS step, halved three times

PROC FASTCLUS displays a Cluster Summary, giving the following for each cluster:

• Cluster number
• Frequency, the number of observations in the cluster
• Weight, the sum of the weights of the observations in the cluster, if you specify the WEIGHT statement
• RMS Std Deviation, the root mean square across variables of the cluster standard deviations, which is equal to the root mean square distance between observations in the cluster
• Maximum Distance from Seed to Observation, the maximum distance from the cluster seed to any observation in the cluster
• Nearest Cluster, the number of the cluster with mean closest to the mean of the current cluster
• Centroid Distance, the distance between the centroids (means) of the current cluster and the nearest other cluster

A table of statistics for each variable is displayed unless you specify the SUMMARY option. The table contains

• Total STD, the total standard deviation
• Within STD, the pooled within-cluster standard deviation
• R-Squared, the R2 for predicting the variable from the cluster
• RSQ/(1 - RSQ), the ratio of between-cluster variance to within-cluster variance (R2/(1 - R2))
• OVER-ALL, all of the previous quantities pooled across variables

PROC FASTCLUS also displays

• Pseudo F Statistic,
[( [(R2)/(c - 1)] )/( [(1 - R2)/(n - c)] )]
where R2 is the observed overall R2, c is the number of clusters, and n is the number of observations. The pseudo F statistic was suggested by Calinski and Harabasz (1974). Refer to Milligan and Cooper (1985) and Cooper and Milligan (1988) regarding the use of the pseudo F statistic in estimating the number of clusters. See Example 23.2 in Chapter 23, "The CLUSTER Procedure," for a comparison of pseudo F statistics.
• Observed Overall R-Squared, if you specify the SUMMARY option
• Approximate Expected Overall R-Squared, the approximate expected value of the overall R2 under the uniform null hypothesis assuming that the variables are uncorrelated. The value is missing if the number of clusters is greater than one-fifth the number of observations.
• Cubic Clustering Criterion, computed under the assumption that the variables are uncorrelated. The value is missing if the number of clusters is greater than one-fifth the number of observations.

If you are interested in the approximate expected R2 or the cubic clustering criterion but your variables are correlated, you should cluster principal component scores from the PRINCOMP procedure. Both of these statistics are described by Sarle (1983). The performance of the cubic clustering criterion in estimating the number of clusters is examined by Milligan and Cooper (1985) and Cooper and Milligan (1988).

• Distances Between Cluster Means, if you specify the DISTANCE option

Unless you specify the SHORT or SUMMARY option, PROC FASTCLUS displays

• Cluster Means for each variable
• Cluster Standard Deviations for each variable

 Chapter Contents Previous Next Top