Chapter Contents |
Previous |
Next |

Introduction to Clustering Procedures |

PROC CLUSTER is easier to use than PROC FASTCLUS because one run produces results from one cluster up to as many as you like. You must run PROC FASTCLUS once for each number of clusters.

The time required by PROC FASTCLUS is roughly proportional to the number of observations, whereas the time required by PROC CLUSTER with most methods varies with the square or cube of the number of observations. Therefore, you can use PROC FASTCLUS with much larger data sets than PROC CLUSTER.

If you want to hierarchically cluster a data set that is too large to use with PROC CLUSTER directly, you can have PROC FASTCLUS produce, for example, 50 clusters, and let PROC CLUSTER analyze these 50 clusters instead of the entire data set. The MEAN= data set produced by PROC FASTCLUS contains two special variables:

- The variable _FREQ_ gives the number of observations in the cluster.
- The variable _RMSSTD_ gives the root-mean-square across variables of the cluster standard deviations.

proc fastclus maxclusters=50 mean=temp; var x y z; run; proc cluster method=ward outtree=tree; var x y z; run;

or Wong's hybrid method (Wong 1982):

proc fastclus maxclusters=50 mean=temp; var x y z; run; proc cluster method=density hybrid outtree=tree; var x y z; run;

More detailed examples are given in Chapter 23, "The CLUSTER Procedure."

Chapter Contents |
Previous |
Next |
Top |

Copyright © 1999 by SAS Institute Inc., Cary, NC, USA. All rights reserved.