Chapter Contents
Chapter Contents
Previous
Previous
Next
Next
The CLUSTER Procedure

Example 23.2: Crude Birth and Death Rates

The following example uses the SAS data set Poverty created in the "Getting Started" section. The data, from Rouncefield (1995), are birth rates, death rates, and infant death rates for 97 countries. Six cluster analyses are performed with eight methods. Scatter plots showing cluster membership at selected levels are produced instead of tree diagrams.

Each cluster analysis is performed by a macro called ANALYZE. The macro takes two arguments. The first, &METHOD, specifies the value of the METHOD= option to be used in the PROC CLUSTER statement. The second, &NCL, must be specified as a list of integers, separated by blanks, indicating the number of clusters desired in each scatter plot. For example, the first invocation of ANALYZE specifies the AVERAGE method and requests plots of 3 and 8 clusters. When two-stage density linkage is used, the K= and R= options are specified as part of the first argument.

The ANALYZE macro first invokes the CLUSTER procedure with METHOD=&METHOD, where &METHOD represents the value of the first argument to ANALYZE. This part of the macro produces the PROC CLUSTER output shown.

The %DO loop processes &NCL, the list of numbers of clusters to plot. The macro variable &K is a counter that indexes the numbers within &NCL. The %SCAN function picks out the &Kth number in &NCL, which is then assigned to the macro variable &N. When &K exceeds the number of numbers in &NCL, %SCAN returns a null string. Thus, the %DO loop executes while &N is not equal to a null string. In the %WHILE condition, a null string is indicated by the absence of any nonblank characters between the comparison operator (NE) and the right parenthesis that terminates the condition.

Within the %DO loop, the TREE procedure creates an output data set containing &N clusters. The GPLOT procedure then produces a scatter plot in which each observation is identified by the number of the cluster to which it belongs. The TITLE2 statement uses double quotes so that &N and &METHOD can be used within the title. At the end of the loop, &K is incremented by 1, and the next number is extracted from &NCL by %SCAN.

For this example, plots are obtained only for average linkage. To generate plots for other methods, follow the example shown in the first macro call. The following statements produce Output 23.2.1 through Output 23.2.7.

   title 'Cluster Analysis of Birth and Death Rates';

   %macro analyze(method,ncl);
   proc cluster data=poverty outtree=tree method=&method p=15 ccc pseudo;
      var birth death;
      title2;
   run;
   %let k=1;
   %let n=%scan(&ncl,&k);
   %do %while(&n NE);
      proc tree data=tree noprint out=out ncl=&n;
         copy birth death;
      run;
      legend1 frame cframe=ligr cborder=black 
              position=center value=(justify=center);
      axis1 label=(angle=90 rotate=0) minor=none;
      axis2 minor=none;
      proc gplot;
         plot death*birth=cluster / 
         frame cframe=ligr legend=legend1 vaxis=axis1 haxis=axis2;
         title2 "Plot of &n Clusters from METHOD=&METHOD";
      run;
      %let k=%eval(&k+1);
      %let n=%scan(&ncl,&k);
   %end;
   %mend;

   %analyze(average,3 8)
   %analyze(complete,3)
   %analyze(single,7 10)
   %analyze(two k=10,3)
   %analyze(two k=18,2)

For average linkage, the CCC has peaks at 3, 8, 10, and 12 clusters, but the 3-cluster peak is lower than the 8-cluster peak. The pseudo F statistic has peaks at 3, 8, and 12 clusters. The pseudo t2 statistic drops sharply at 3 clusters, continues to fall at 4 clusters, and has a particularly low value at 12 clusters. However, there are not enough data to seriously consider as many as 12 clusters. Scatter plots are given for 3 and 8 clusters. The results are shown in Output 23.2.1 through Output 23.2.3. In Output 23.2.3, the eighth cluster consists of the two outlying observations, Mexico and Korea.

Output 23.2.1: Clusters for Birth and Death Rates: METHOD=AVERAGE

Cluster Analysis of Birth and Death Rates

The CLUSTER Procedure
Average Linkage Cluster Analysis

Eigenvalues of the Covariance Matrix
  Eigenvalue Difference Proportion Cumulative
1 189.106588 173.101020 0.9220 0.9220
2 16.005568   0.0780 1.0000

Root-Mean-Square Total-Sample Standard Deviation = 10.127

Root-Mean-Square Distance Between Observations = 20.25399

Cluster History
NCL Clusters Joined FREQ SPRSQ RSQ ERSQ CCC PSF PST2 Norm
RMS
Dist
T
i
e
15 CL27 CL20 18 0.0035 .980 .975 2.61 292 18.6 0.2325  
14 CL23 CL17 28 0.0034 .977 .972 1.97 271 17.7 0.2358  
13 CL18 CL54 8 0.0015 .975 .969 2.35 279 7.1 0.2432  
12 CL21 CL26 8 0.0015 .974 .966 2.85 290 6.1 0.2493  
11 CL19 CL24 12 0.0033 .971 .962 2.78 285 14.8 0.2767  
10 CL22 CL16 12 0.0036 .967 .957 2.84 284 17.4 0.2858  
9 CL15 CL28 22 0.0061 .961 .951 2.45 271 17.5 0.3353  
8 OB23 OB61 2 0.0014 .960 .943 3.59 302 . 0.3703  
7 CL25 CL11 17 0.0098 .950 .933 3.01 284 23.3 0.4033  
6 CL7 CL12 25 0.0122 .938 .920 2.63 273 14.8 0.4132  
5 CL10 CL14 40 0.0303 .907 .902 0.59 225 82.7 0.4584  
4 CL13 CL6 33 0.0244 .883 .875 0.77 234 22.2 0.5194  
3 CL9 CL8 24 0.0182 .865 .827 2.13 300 27.7 0.735  
2 CL5 CL3 64 0.1836 .681 .697 -.55 203 148 0.8402  
1 CL2 CL4 97 0.6810 .000 .000 0.00 . 203 1.3348  

Output 23.2.2: Plot of Three Clusters, METHOD=AVERAGE
clue2a3.gif (4459 bytes)

Output 23.2.3: Plot of Eight Clusters, METHOD=AVERAGE
clue2a8.gif (5108 bytes)

Complete linkage shows CCC peaks at 3, 8 and 12 clusters. The pseudo F statistic peaks at 3 and 12 clusters. The pseudo t2 statistic indicates 3 clusters.

The scatter plot for 3 clusters is shown. The results are shown in Output 23.2.4.

Output 23.2.4: Clusters for Birth and Death Rates: METHOD=COMPLETE

Cluster Analysis of Birth and Death Rates

The CLUSTER Procedure
Complete Linkage Cluster Analysis

Eigenvalues of the Covariance Matrix
  Eigenvalue Difference Proportion Cumulative
1 189.106588 173.101020 0.9220 0.9220
2 16.005568   0.0780 1.0000

Root-Mean-Square Total-Sample Standard Deviation = 10.127

Mean Distance Between Observations = 17.13099

Cluster History
NCL Clusters Joined FREQ SPRSQ RSQ ERSQ CCC PSF PST2 Norm
Max
Dist
T
i
e
15 CL22 CL33 8 0.0015 .983 .975 3.80 329 6.1 0.4092  
14 CL56 CL18 8 0.0014 .981 .972 3.97 331 6.6 0.4255  
13 CL30 CL44 8 0.0019 .979 .969 4.04 330 19.0 0.4332  
12 OB23 OB61 2 0.0014 .978 .966 4.45 340 . 0.4378  
11 CL19 CL24 24 0.0034 .974 .962 4.17 327 24.1 0.4962  
10 CL17 CL28 12 0.0033 .971 .957 4.18 325 14.8 0.5204  
9 CL20 CL13 16 0.0067 .964 .951 3.38 297 25.2 0.5236  
8 CL11 CL21 32 0.0054 .959 .943 3.44 297 19.7 0.6001  
7 CL26 CL15 13 0.0096 .949 .933 2.93 282 28.9 0.7233  
6 CL14 CL10 20 0.0128 .937 .920 2.46 269 27.7 0.8033  
5 CL9 CL16 30 0.0237 .913 .902 1.29 241 47.1 0.8993  
4 CL6 CL7 33 0.0240 .889 .875 1.38 248 21.7 1.2165  
3 CL5 CL12 32 0.0178 .871 .827 2.56 317 13.6 1.2326  
2 CL3 CL8 64 0.1900 .681 .697 -.55 203 167 1.5412  
1 CL2 CL4 97 0.6810 .000 .000 0.00 . 203 2.5233  


clue2b3.gif (4453 bytes)

The CCC and pseudo F statistics are not appropriate for use with single linkage because of the method's tendency to chop off tails of distributions. The pseudo t2 statistic can be used by looking for large values and taking the number of clusters to be one greater than the level at which the large pseudo t2 value is displayed. For these data, there are large values at levels 6 and 9, suggesting 7 or 10 clusters.

The scatter plots for 7 and 10 clusters are shown. The results are shown in Output 23.2.5.

Output 23.2.5: Clusters for Birth and Death Rates: METHOD=SINGLE

Cluster Analysis of Birth and Death Rates

The CLUSTER Procedure
Single Linkage Cluster Analysis

Eigenvalues of the Covariance Matrix
  Eigenvalue Difference Proportion Cumulative
1 189.106588 173.101020 0.9220 0.9220
2 16.005568   0.0780 1.0000

Root-Mean-Square Total-Sample Standard Deviation = 10.127

Mean Distance Between Observations = 17.13099

Cluster History
NCL Clusters Joined FREQ SPRSQ RSQ ERSQ CCC PSF PST2 Norm
Min
Dist
T
i
e
15 CL37 CL19 8 0.0014 .968 .975 -2.3 178 6.6 0.1331  
14 CL20 CL23 15 0.0059 .962 .972 -3.1 162 18.7 0.1412  
13 CL14 CL16 19 0.0054 .957 .969 -3.4 155 8.8 0.1442  
12 CL26 OB58 31 0.0014 .955 .966 -2.7 165 4.0 0.1486  
11 OB86 CL18 4 0.0003 .955 .962 -1.6 183 3.8 0.1495  
10 CL13 CL11 23 0.0088 .946 .957 -2.3 170 11.3 0.1518  
9 CL22 CL17 30 0.0235 .923 .951 -4.7 131 45.7 0.1593 T
8 CL15 CL10 31 0.0210 .902 .943 -5.8 117 21.8 0.1593  
7 CL9 OB75 31 0.0052 .897 .933 -4.7 130 4.0 0.1628  
6 CL7 CL12 62 0.2023 .694 .920 -15 41.3 223 0.1725  
5 CL6 CL8 93 0.6681 .026 .902 -26 0.6 199 0.1756  
4 CL5 OB48 94 0.0056 .021 .875 -24 0.7 0.5 0.1811 T
3 CL4 OB67 95 0.0083 .012 .827 -15 0.6 0.8 0.1811  
2 OB23 OB61 2 0.0014 .011 .697 -13 1.0 . 0.4378  
1 CL3 CL2 97 0.0109 .000 .000 0.00 . 1.0 0.5815  


clue2c7.gif (4894 bytes)

clue2c10.gif (5305 bytes)

For kth-nearest-neighbor density linkage, the number of modes as a function of k is as follows (not all of these analyses are shown):

k   modes
3 13
4 6
5-7 4
8-15 3
16-21 2
22+ 1

Thus, there is strong evidence of 3 modes and an indication of the possibility of 2 modes. Uniform-kernel density linkage gives similar results. For K=10 (10th-nearest-neighbor density linkage), the scatter plot for 3 clusters is shown; and for K=18, the scatter plot for 2 clusters is shown. The results are shown in Output 23.2.6.

Output 23.2.6: Clusters for Birth and Death Rates: METHOD=TWOSTAGE, K=10

Cluster Analysis of Birth and Death Rates

The CLUSTER Procedure
Two-Stage Density Linkage Clustering

Eigenvalues of the Covariance Matrix
  Eigenvalue Difference Proportion Cumulative
1 189.106588 173.101020 0.9220 0.9220
2 16.005568   0.0780 1.0000

K = 10

Root-Mean-Square Total-Sample Standard Deviation = 10.127

Cluster History
NCL   FREQ SPRSQ RSQ ERSQ CCC PSF PST2 Normalized
Fusion Density
Maximum Density
in Each Cluster
T
i
e
Clusters Joined Lesser Greater
15 CL16 OB94 22 0.0015 .921 .975 -11 68.4 1.4 9.2234 6.7927 15.3069  
14 CL19 OB49 28 0.0021 .919 .972 -11 72.4 1.8 8.7369 5.9334 33.4385  
13 CL15 OB52 23 0.0024 .917 .969 -10 76.9 2.3 8.5847 5.9651 15.3069  
12 CL13 OB96 24 0.0018 .915 .966 -9.3 83.0 1.6 7.9252 5.4724 15.3069  
11 CL12 OB93 25 0.0025 .912 .962 -8.5 89.5 2.2 7.8913 5.4401 15.3069  
10 CL11 OB78 26 0.0031 .909 .957 -7.7 96.9 2.5 7.787 5.4082 15.3069  
9 CL10 OB76 27 0.0026 .907 .951 -6.7 107 2.1 7.7133 5.4401 15.3069  
8 CL9 OB77 28 0.0023 .904 .943 -5.5 120 1.7 7.4256 4.9017 15.3069  
7 CL8 OB43 29 0.0022 .902 .933 -4.1 138 1.6 6.927 4.4764 15.3069  
6 CL7 OB87 30 0.0043 .898 .920 -2.7 160 3.1 4.932 2.9977 15.3069  
5 CL6 OB82 31 0.0055 .892 .902 -1.1 191 3.7 3.7331 2.1560 15.3069  
4 CL22 OB61 37 0.0079 .884 .875 0.93 237 10.6 3.1713 1.6308 100.0  
3 CL14 OB23 29 0.0126 .872 .827 2.60 320 10.4 2.0654 1.0744 33.4385  
2 CL4 CL3 66 0.2129 .659 .697 -1.3 183 172 12.409 33.4385 100.0  
1 CL2 CL5 97 0.6588 .000 .000 0.00 . 183 10.071 15.3069 100.0  

3 modal clusters have been formed.


clue2d3.gif (4478 bytes)

Output 23.2.7: Clusters for Birth and Death Rates: METHOD=TWOSTAGE, K=18

Cluster Analysis of Birth and Death Rates

The CLUSTER Procedure
Two-Stage Density Linkage Clustering

Eigenvalues of the Covariance Matrix
  Eigenvalue Difference Proportion Cumulative
1 189.106588 173.101020 0.9220 0.9220
2 16.005568   0.0780 1.0000

K = 18

Root-Mean-Square Total-Sample Standard Deviation = 10.127

Cluster History
NCL   FREQ SPRSQ RSQ ERSQ CCC PSF PST2 Normalized
Fusion Density
Maximum Density
in Each Cluster
T
i
e
Clusters Joined Lesser Greater
15 CL16 OB72 46 0.0107 .799 .975 -21 23.3 3.0 10.118 7.7445 23.4457  
14 CL15 OB94 47 0.0098 .789 .972 -21 23.9 2.7 9.676 7.1257 23.4457  
13 CL14 OB51 48 0.0037 .786 .969 -20 25.6 1.0 9.409 6.8398 23.4457 T
12 CL13 OB96 49 0.0099 .776 .966 -19 26.7 2.6 9.409 6.8398 23.4457  
11 CL12 OB76 50 0.0114 .764 .962 -19 27.9 2.9 8.8136 6.3138 23.4457  
10 CL11 OB77 51 0.0021 .762 .957 -18 31.0 0.5 8.6593 6.0751 23.4457  
9 CL10 OB78 52 0.0103 .752 .951 -17 33.3 2.5 8.6007 6.0976 23.4457  
8 CL9 OB43 53 0.0034 .748 .943 -16 37.8 0.8 8.4964 5.9160 23.4457  
7 CL8 OB93 54 0.0109 .737 .933 -15 42.1 2.6 8.367 5.7913 23.4457  
6 CL7 OB88 55 0.0110 .726 .920 -13 48.3 2.6 7.916 5.3679 23.4457  
5 CL6 OB87 56 0.0120 .714 .902 -12 57.5 2.7 6.6917 4.3415 23.4457  
4 CL20 OB61 39 0.0077 .707 .875 -9.8 74.7 8.3 6.2578 3.2882 100.0  
3 CL5 OB82 57 0.0138 .693 .827 -5.0 106 3.0 5.3605 3.2834 23.4457  
2 CL3 OB23 58 0.0117 .681 .697 -.54 203 2.5 3.2687 1.7568 23.4457  
1 CL2 CL4 97 0.6812 .000 .000 0.00 . 203 13.764 23.4457 100.0  

2 modal clusters have been formed.


clue2e2.gif (4305 bytes)

In summary, most of the clustering methods indicate 3 or 8 clusters. Most methods agree at the 3-cluster level, but at the other levels, there is considerable disagreement about the composition of the clusters. The presence of numerous ties also complicates the analysis; see Example 23.4.

Chapter Contents
Chapter Contents
Previous
Previous
Next
Next
Top
Top

Copyright © 1999 by SAS Institute Inc., Cary, NC, USA. All rights reserved.