Chapter Contents Previous Next
 The CLUSTER Procedure

## Example 23.6: Size, Shape, and Correlation

The following example shows the analysis of a data set in which size information is detrimental to the classification. Imagine that an archaeologist of the future is excavating a 20th century grocery store. The archaeologist has discovered a large number of boxes of various sizes, shapes, and colors and wants to do a preliminary classification based on simple external measurements: height, width, depth, weight, and the predominant color of the box. It is known that a given product may have been sold in packages of different size, so the archaeologist wants to remove the effect of size from the classification. It is not known whether color is relevant to the use of the products, so the analysis should be done both with and without color information.

Unknown to the archaeologist, the boxes actually fall into six general categories according to the use of the product: breakfast cereals, crackers, laundry detergents, Little Debbie snacks, tea, and toothpaste. These categories are shown in the analysis so that you can evaluate the effectiveness of the classification.

Since there is no reason for the archaeologist to assume that the true categories have equal sample sizes or variances, the centroid method is used to avoid undue bias. Each analysis is done with Euclidean distances after suitable transformations of the data. Color is coded as five dummy variables with values of 0 or 1. The DATA step is as follows:

```   options ls=120;
title 'Cluster Analysis of Grocery Boxes';
data grocery2;
length name \$35   /* name of product */
class \$16  /* category of product */
unit \$1    /* unit of measurement for weights:
g=gram
o=ounce
l=lb
all weights are converted to grams */
color \$8   /* predominant color of box */
height 8   /* height of box in cm. */
width 8    /* width of box in cm. */
depth 8    /* depth of box (front to back) in cm. */
weight 8   /* weight of box in grams */
c_white c_yellow c_red c_green c_blue 4;
/* dummy variables */
retain class;
drop unit;

/*--- read name with possible embedded blanks ---*/
input name & @;

/*--- if name starts with "---",              ---*/
/*--- it's really a category value            ---*/
if substr(name,1,3) = '---' then do;
class = substr(name,4,index(substr(name,4),'-')-1);
delete;
return;
end;

/*--- read the rest of the variables ---*/
input height width depth weight unit color;

/*--- convert weights to grams ---*/
select (unit);
when ('l') weight = weight * 454;
when ('o') weight = weight * 28.3;
when ('g') ;
otherwise put 'Invalid unit ' unit;
end;

/*--- use 0/1 coding for dummy variables for colors ---*/
c_white  = (color = 'w');
c_yellow = (color = 'y');
c_red    = (color = 'r');
c_green  = (color = 'g');
c_blue   = (color = 'b');

datalines;

---Breakfast cereals---

Cheerios                            32.5 22.4  8.4  567 g y
Cheerios                            30.3 20.4  7.2  425 g y
Cheerios                            27.5 19    6.2  283 g y
Cheerios                            24.1 17.2  5.3  198 g y
Special K                           30.1 20.5  8.5   18 o w
Special K                           29.6 19.2  6.7   12 o w
Special K                           23.4 16.6  5.7    7 o w
Corn Flakes                         33.7 25.4  8     24 o w
Corn Flakes                         30.2 20.6  8.4   18 o w
Corn Flakes                         30   19.1  6.6   12 o w
Grape Nuts                          21.7 16.3  4.9  680 g w
Shredded Wheat                      19.7 19.9  7.5  283 g y
Shredded Wheat, Spoon Size          26.6 19.6  5.6  510 g r
All-Bran                            21.1 14.3  5.2 13.8 o y
Froot Loops                         30.2 20.8  8.5 19.7 o r
Froot Loops                         25   17.7  6.4   11 o r

---Crackers---

Wheatsworth                         11.1 25.2  5.5  326 g w
Ritz                                23.1 16    5.3  340 g r
Ritz                                23.1 20.7  5.2  454 g r
Premium Saltines                    11   25   10.7  454 g w
Waverly Wafers                      14.4 22.5  6.2  454 g g

---Detergent---

Arm & Hammer Detergent              38.8 30   16.9   25 l y
Arm & Hammer Detergent              39.5 25.8 11   14.2 l y
Arm & Hammer Detergent              33.7 22.8  7      7 l y
Arm & Hammer Detergent              27.8 19.4  6.3    4 l y
Tide                                39.4 24.8 11.3  9.2 l r
Tide                                32.5 23.2  7.3  4.5 l r
Tide                                26.5 19.9  6.3   42 o r
Tide                                19.3 14.6  4.7   17 o r

---Little Debbie---

Figaroos                            13.5 18.6  3.7   12 o y
Swiss Cake Rolls                    10.1 21.8  5.8   13 o w
Fudge Brownies                      11   30.8  2.5   12 o w
Marshmallow Supremes                 9.4 32    7     10 o w
Apple Delights                      11.2 30.1  4.9   15 o w
Snack Cakes                         13.4 32    3.4   13 o b
Nutty Bar                           13.2 18.5  4.2   12 o y
Lemon Stix                          13.2 18.5  4.2    9 o w
Fudge Rounds                         8.1 28.3  5.4  9.5 o w

---Tea---

Celestial Saesonings Mint Magic      7.8 13.8  6.3   49 g b
Celestial Saesonings Cranberry Cove  7.8 13.8  6.3   46 g r
Celestial Saesonings Sleepy Time     7.8 13.8  6.3   37 g g
Celestial Saesonings Lemon Zinger    7.8 13.8  6.3   56 g y
Bigelow Lemon Lift                   7.7 13.4  6.9   40 g y
Bigelow Plantation Mint              7.7 13.4  6.9   35 g g
Bigelow Earl Grey                    7.7 13.4  6.9   35 g b
Luzianne                             8.9 22.8  6.4    6 o r
Luzianne                            18.4 20.2  6.9    8 o r
Luzianne Decaffeinated               8.9 22.8  6.4 5.25 o g
Lipton Tea Bags                     17.1 20    6.7    8 o r
Lipton Tea Bags                     11.5 14.4  6.6 3.75 o r
Lipton Tea Bags                      6.7 10    5.7 1.25 o r
Lipton Family Size Tea Bags         13.7 24    9     12 o r
Lipton Family Size Tea Bags          8.7 20.8  8.2    6 o r
Lipton Family Size Tea Bags          8.9 11.1  8.2    3 o r
Lipton Loose Tea                    12.7 10.9  5.4    8 o r

---Paste, Tooth---

Colgate                              4.4 22    3.5    7 o r
Colgate                              3.6 15.6  3.3    3 o r
Colgate                              4.2 18.3  3.5    5 o r
Crest                                4.3 21.7  3.7  6.4 o w
Crest                                4.3 17.4  3.6  4.6 o w
Crest                                3.5 15.2  3.2  2.7 o w
Crest                                3.0 10.9  2.8  .85 o w
Arm & Hammer                         4.4 17    3.7    5 o w
;

data grocery;
length name \$16;
set grocery2;
```

The FORMAT procedure is used to define to formats to make the output easier to read. The STARS. format is used for graphical crosstabulations in the TABULATE procedure. The \$COLOR format displays the names of the colors instead of just the first letter.

```       /*------ formats and macros for displaying ------*/
/*------ cluster results                   ------*/
proc format; value stars
0='               '
1='              #'
2='             ##'
3='            ###'
4='           ####'
5='          #####'
6='         ######'
7='        #######'
8='       ########'
9='      #########'
10='     ##########'
11='    ###########'
12='   ############'
13='  #############'
14=' ##############'
15-high='>##############';
run;

proc format; value \$color
'w'='White'
'y'='Yellow'
'r'='Red'
'g'='Green'
'b'='Blue';
run;
```

Since a full display of the results of each cluster analysis would be very long, a macro is used with five macro variables to select parts of the output. The macro variables are set to select only the PROC CLUSTER output and the crosstabulation of clusters and true categories for the first two analyses. The example could be run with different settings of the macro variables to show the full output or other selected parts.

```   %let cluster=1;   /* 1=show CLUSTER output, 0=don't */
%let tree=0;      /* 1=print TREE diagram, 0=don't */
%let list=0;      /* 1=list clusters, 0=don't */
%let crosstab=1;  /* 1=crosstabulate clusters and classes,
0=don't                              */
%let crosscol=0;  /* 1=crosstabulate clusters and colors,
0=don't                              */

/*--- define macro with options for TREE ---*/
%macro treeopt;
%if &tree %then h page=1;
%else noprint;
%mend;

/*--- define macro with options for CLUSTER ---*/
%macro clusopt;
%if &cluster %then pseudo ccc p=20;
%else noprint;
%mend;

/*------ macro for showing cluster results ------*/
%macro show(n); /* n=number of clusters
to show results for */

proc tree data=tree %treeopt n=&n out=out;
id name;
copy class height width depth weight color;
run;

%if &list %then %do;
proc sort;
by cluster;
run;

proc print;
var class name height width depth weight color;
by cluster clusname;
run;
%end;

%if &crosstab %then %do;
proc tabulate noseps /* formchar='           ' */;
class class cluster;
table cluster, class*n='
'*f=stars./rts=10 misstext=' ';
run;
%end;

%if &crosscol %then %do;
proc tabulate noseps /* formchar='           ' */;
class color cluster;
table cluster, color*n='
'*f=stars./rts=10 misstext=' ';
format color \$color.;
run;
%end;
%mend;
```

The first analysis uses the variables height, width, depth, and weight in standardized form to show the effect of including size information. The CCC, pseudo F, and pseudo t2 statistics indicate 10 clusters. Most of the clusters do not correspond closely to the true categories, and four of the clusters have only one or two observations.

```   /**********************************************************/
/*                                                        */
/*       Analysis 1: standardized box measurements        */
/*                                                        */
/**********************************************************/
title2 'Analysis 1: Standardized data';
proc cluster data=grocery m=cen std %clusopt outtree=tree;
var height width depth weight;
id name;
copy class color;
run;

%show(10);
```

Output 23.6.1: Analysis of Standardized Data

 Cluster Analysis of Grocery Boxes Analysis 1: Standardized data

 The CLUSTER Procedure Centroid Hierarchical Cluster Analysis

 Eigenvalues of the Correlation Matrix Eigenvalue Difference Proportion Cumulative 1 2.44512438 1.64456210 0.6113 0.6113 2 0.80056228 0.33149770 0.2001 0.8114 3 0.46906458 0.18381582 0.1173 0.9287 4 0.28524876 0.0713 1.0000

 The data have been standardized to mean 0 and variance 1

 Root-Mean-Square Total-Sample Standard Deviation = 1

 Root-Mean-Square Distance Between Observations = 2.828427

 Cluster Analysis of Grocery Boxes Analysis 1: Standardized data

 The CLUSTER Procedure Centroid Hierarchical Cluster Analysis

 The data have been standardized to mean 0 and variance 1

 Root-Mean-Square Total-Sample Standard Deviation = 1

 Root-Mean-Square Distance Between Observations = 2.828427

 Cluster History NCL Clusters Joined FREQ SPRSQ RSQ ERSQ CCC PSF PST2 NormCentDist Tie 20 CL22 Lipton Family Si 11 0.0028 .974 . . 85.4 4.5 0.3073 19 CL36 Corn Flakes 5 0.0026 .972 . . 83.7 15.3 0.3146 18 CL24 CL41 12 0.0080 .964 . . 70.2 10.0 0.3316 17 CL18 CL30 18 0.0144 .949 . . 53.8 12.7 0.3343 16 Marshmallow Supr CL29 3 0.0024 .947 . . 55.8 4.7 0.3363 15 CL50 CL33 7 0.0055 .941 . . 55.0 24.4 0.346 14 CL46 CL15 10 0.0069 .934 . . 53.7 8.1 0.3192 13 CL27 Lipton Family Si 6 0.0035 .931 . . 56.1 6.3 0.362 12 CL31 CL16 5 0.0075 .923 .861 8.03 55.8 6.6 0.4416 11 CL19 CL23 7 0.0102 .913 .848 7.59 54.6 12.7 0.4713 10 Arm & Hammer Det Tide 2 0.0037 .909 .835 8.36 59.1 . 0.4781 9 CL11 CL17 25 0.0393 .870 .819 4.72 45.2 19.3 0.4918 8 CL13 CL14 16 0.0329 .837 .801 2.95 40.4 23.7 0.5215 7 CL8 CL20 27 0.0629 .774 .779 -.31 32.0 25.9 0.5467 6 CL7 Crest 28 0.0112 .763 .752 0.61 36.7 2.4 0.6003 5 CL9 CL6 53 0.1879 .575 .718 -5.9 19.6 43.4 0.6641 4 CL5 CL21 55 0.0345 .541 .672 -5.2 23.2 4.5 0.745 3 CL4 CL12 60 0.1137 .427 .602 -5.3 22.4 14.5 0.8769 2 CL3 CL10 62 0.1511 .276 .471 -4.3 23.2 15.8 1.5559 1 CL2 Arm & Hammer Det 63 0.2759 .000 .000 0.00 . 23.2 2.948

 class Breakfast cereal Crackers Detergent Little Debbie Paste, Tooth Tea CLUSTER ########### 1 2 ## # ### 3 ##### ## 4 ### ####### 5 ########### ## ### ## 6 ##### 7 # # 8 ## 9 # 10 #

The second analysis uses logarithms of height, width, depth, and the cube root of weight; the cube root is used for consistency with the linear measures. The rows are then centered to remove size information. Finally, the columns are standardized to have a standard deviation of 1. There is no compelling a priori reason to standardize the columns, but if they are not standardized, height dominates the analysis because of its large variance. The STANDARD procedure is used instead of the STD option in PROC CLUSTER so that a subsequent analysis can separately standardize the dummy variables for color.

```   /**********************************************************/
/*                                                        */
/*    Analysis 2: standardized row-centered logarithms    */
/*                                                        */
/**********************************************************/

title2 'Row-centered logarithms';
data shape;
set grocery;
array x height width depth weight;
array l l_height l_width l_depth l_weight;
/* logarithms */
weight=weight**(1/3);  /* take cube root to conform with
the other linear measurements */
do over l;             /* take logarithms */
l=log(x);
end;
mean=mean( of l(*));   /* find row mean of logarithms */
do over l;
l=l-mean;           /* center row */
end;
run;

title2 'Analysis 2: Standardized row-centered logarithms';
proc standard data=shape out=shapstan m=0 s=1;
var l_height l_width l_depth l_weight;
run;

proc cluster data=shapstan m=cen %clusopt outtree=tree;
var l_height l_width l_depth l_weight;
id name;
copy class height width depth weight color;
run;

%show(8);
```

The results of the second analysis are shown for eight clusters. Clusters 1 through 4 correspond fairly well to tea, toothpaste, breakfast cereals, and detergents. Crackers and Little Debbie products are scattered among several clusters.

Output 23.6.2: Analysis of Standardized Row-Centered Logarithms

 Cluster Analysis of Grocery Boxes Analysis 2: Standardized row-centered logarithms

 The CLUSTER Procedure Centroid Hierarchical Cluster Analysis

 Eigenvalues of the Covariance Matrix Eigenvalue Difference Proportion Cumulative 1 1.94931049 0.34845395 0.4873 0.4873 2 1.60085654 1.15102358 0.4002 0.8875 3 0.44983296 0.44983296 0.1125 1.0000 4 0.00000000 0.0000 1.0000

 Root-Mean-Square Total-Sample Standard Deviation = 1

 Root-Mean-Square Distance Between Observations = 2.828427

 Cluster History NCL Clusters Joined FREQ SPRSQ RSQ ERSQ CCC PSF PST2 NormCentDist Tie 20 CL29 All-Bran 4 0.0017 .977 . . 94.7 2.9 0.2658 19 CL26 CL27 8 0.0045 .972 . . 85.4 8.4 0.3047 18 Fudge Rounds Crest 2 0.0016 .971 . . 87.2 . 0.3193 17 Fudge Brownies Snack Cakes 2 0.0018 .969 . . 89.1 . 0.3331 16 Arm & Hammer Det Lipton Loose Tea 2 0.0019 .967 . . 91.3 . 0.3434 15 CL23 CL18 5 0.0050 .962 . . 86.5 4.8 0.3587 14 CL37 CL21 5 0.0051 .957 . . 83.5 10.4 0.3613 13 CL30 CL24 9 0.0068 .950 . . 79.2 12.9 0.3682 12 CL32 CL20 16 0.0142 .936 .892 5.75 67.6 29.3 0.3826 11 CL22 Apple Delights 4 0.0037 .932 .881 6.31 71.4 3.2 0.3901 10 CL11 CL31 7 0.0090 .923 .869 6.17 70.8 6.3 0.4032 9 CL33 CL13 11 0.0092 .914 .853 6.25 71.7 7.6 0.4181 8 CL19 CL16 10 0.0131 .901 .835 6.12 71.4 10.9 0.503 7 CL14 CL9 16 0.0297 .871 .813 4.63 63.1 15.6 0.5173 6 CL10 CL15 12 0.0329 .838 .785 3.69 59.1 13.6 0.5916 5 CL6 CL28 19 0.0557 .783 .748 2.01 52.2 15.8 0.6252 4 CL12 CL8 26 0.0885 .694 .697 -.16 44.6 48.8 0.6679 3 CL5 CL17 21 0.0459 .648 .617 1.21 55.3 7.4 0.8863 2 CL4 CL7 42 0.2841 .364 .384 -.56 34.9 60.3 0.9429 1 CL2 CL3 63 0.3640 .000 .000 0.00 . 34.9 0.8978

 class Breakfast cereal Crackers Detergent Little Debbie Paste, Tooth Tea CLUSTER # ########## 1 2 ####### 3 ############## ## 4 # ######## # 5 ## # ## 6 # #### 7 ## ##### 8 ##

The third analysis is similar to the second analysis except that the rows are standardized rather than just centered. There is a clear indication of seven clusters from the CCC, pseudo F, and pseudo t2 statistics. The clusters are listed as well as crosstabulated with the true categories and colors.

```   /**********************************************************/
/*                                                        */
/*  Analysis 3: standardized row-standardized logarithms  */
/*                                                        */
/**********************************************************/

%let list=1;
%let crosscol=1;

title2 'Row-standardized logarithms';
data std;
set grocery;
array x height width depth weight;
array l l_height l_width l_depth l_weight;
/* logarithms */
weight=weight**(1/3); /* take cube root to conform with
the other linear measurements */
do over l;
l=log(x);          /* take logarithms */
end;
mean=mean( of l(*));  /* find row mean of logarithms */
std=std( of l(*));    /* find row standard deviation */
do over l;
l=(l-mean)/std;    /* standardize row */
end;
run;

title2 'Analysis 3: Standardized row-standardized logarithms';
proc standard data=std out=stdstan m=0 s=1;
var l_height l_width l_depth l_weight;
run;

proc cluster data=stdstan m=cen %clusopt outtree=tree;
var l_height l_width l_depth l_weight;
id name;
copy class height width depth weight color;
run;

%show(7);
```

The output from the third analysis shows that cluster 1 contains 9 of the 17 teas. Cluster 2 contains all of the detergents plus Grape Nuts, a very heavy cereal. Cluster 3 includes all of the toothpastes and one Little Debbie product that is of very similar shape, although roughly twice as large. Cluster 4 has most of the cereals, Ritz crackers (which come in a box very similar to most of the cereal boxes), and Lipton Loose Tea (all the other teas in the sample come in tea bags). Clusters 5 and 6 each contain several Luzianne and Lipton teas and one or two miscellaneous items. Cluster 7 includes most of the Little Debbie products and two types of crackers. Thus, the crackers are not identified and the teas are broken up into three clusters, but the other categories correspond to single clusters. This analysis classifies toothpaste and Little Debbie products slightly better than the second analysis,

Output 23.6.3: Analysis of Standardized Row-Standardized Logarithms

 Cluster Analysis of Grocery Boxes Analysis 3: Standardized row-standardized logarithms

 The CLUSTER Procedure Centroid Hierarchical Cluster Analysis

 Eigenvalues of the Covariance Matrix Eigenvalue Difference Proportion Cumulative 1 2.42684848 0.94583675 0.6067 0.6067 2 1.48101173 1.38887193 0.3703 0.9770 3 0.09213980 0.09213980 0.0230 1.0000 4 -.00000000 -0.0000 1.0000

 Root-Mean-Square Total-Sample Standard Deviation = 1

 Root-Mean-Square Distance Between Observations = 2.828427

 Cluster History NCL Clusters Joined FREQ SPRSQ RSQ ERSQ CCC PSF PST2 NormCentDist Tie 20 CL35 CL33 8 0.0024 .990 . . 229 32.0 0.1923 19 CL22 Ritz 5 0.0010 .989 . . 224 2.9 0.2014 18 CL44 CL27 6 0.0018 .987 . . 206 20.5 0.2073 17 CL18 CL26 9 0.0025 .985 . . 187 6.4 0.1956 16 Fudge Rounds Crest 2 0.0009 .984 . . 192 . 0.24 15 CL24 CL23 5 0.0029 .981 . . 177 7.8 0.2753 14 CL25 Waverly Wafers 4 0.0021 .979 . . 175 7.7 0.2917 13 CL30 CL19 17 0.0101 .969 . . 130 41.0 0.2974 12 CL16 CL31 9 0.0049 .964 .932 5.49 124 20.5 0.3121 11 CL21 Lipton Family Si 4 0.0029 .961 .924 5.81 129 8.2 0.3445 10 CL41 CL11 6 0.0045 .957 .915 5.94 130 5.0 0.323 9 CL29 Lipton Tea Bags 4 0.0031 .953 .904 6.52 138 20.3 0.3603 8 CL14 CL15 9 0.0101 .943 .890 6.08 131 10.7 0.3761 7 CL20 Lipton Family Si 9 0.0047 .939 .872 6.89 143 11.7 0.4063 6 CL13 CL9 21 0.0272 .911 .848 5.23 117 30.0 0.5101 5 CL6 CL17 30 0.0746 .837 .814 1.30 74.3 42.2 0.606 4 CL10 CL7 15 0.0440 .793 .764 1.40 75.3 36.4 0.6152 3 CL8 CL12 18 0.0642 .729 .681 2.02 80.6 44.0 0.6648 2 CL3 CL4 33 0.2580 .471 .470 0.01 54.2 54.4 0.9887 1 CL5 CL2 63 0.4707 .000 .000 0.00 . 54.2 0.9636

 CLUSTER=1 CLUSNAME=CL7

 Obs class name height width depth weight color 1 Tea Bigelow Plantati 7.7 13.4 6.9 3.27107 g 2 Tea Bigelow Earl Gre 7.7 13.4 6.9 3.27107 b 3 Tea Celestial Saeson 7.8 13.8 6.3 3.65931 b 4 Tea Celestial Saeson 7.8 13.8 6.3 3.58305 r 5 Tea Bigelow Lemon Li 7.7 13.4 6.9 3.41995 y 6 Tea Celestial Saeson 7.8 13.8 6.3 3.82586 y 7 Tea Celestial Saeson 7.8 13.8 6.3 3.33222 g 8 Tea Lipton Tea Bags 6.7 10.0 5.7 3.28271 r 9 Tea Lipton Family Si 8.9 11.1 8.2 4.39510 r

 CLUSTER=2 CLUSNAME=CL17

 Obs class name height width depth weight color 10 Detergent Tide 26.5 19.9 6.3 10.5928 r 11 Detergent Tide 19.3 14.6 4.7 7.8357 r 12 Detergent Tide 32.5 23.2 7.3 12.6889 r 13 Breakfast cereal Grape Nuts 21.7 16.3 4.9 8.7937 w 14 Detergent Arm & Hammer Det 33.7 22.8 7.0 14.7023 y 15 Detergent Arm & Hammer Det 27.8 19.4 6.3 12.2003 y 16 Detergent Arm & Hammer Det 38.8 30.0 16.9 22.4732 y 17 Detergent Tide 39.4 24.8 11.3 16.1045 r 18 Detergent Arm & Hammer Det 39.5 25.8 11.0 18.6115 y

 CLUSTER=3 CLUSNAME=CL12

 Obs class name height width depth weight color 19 Paste, Tooth Colgate 3.6 15.6 3.3 4.39510 r 20 Paste, Tooth Crest 3.5 15.2 3.2 4.24343 w 21 Paste, Tooth Crest 4.3 17.4 3.6 5.06813 w 22 Paste, Tooth Arm & Hammer 4.4 17.0 3.7 5.21097 w 23 Paste, Tooth Colgate 4.2 18.3 3.5 5.21097 r 24 Paste, Tooth Crest 4.3 21.7 3.7 5.65790 w 25 Paste, Tooth Colgate 4.4 22.0 3.5 5.82946 r 26 Little Debbie Fudge Rounds 8.1 28.3 5.4 6.45411 w 27 Paste, Tooth Crest 3.0 10.9 2.8 2.88670 w

 CLUSTER=4 CLUSNAME=CL13

 Obs class name height width depth weight color 28 Breakfast cereal Cheerios 27.5 19.0 6.2 6.56541 y 29 Breakfast cereal Froot Loops 25.0 17.7 6.4 6.77735 r 30 Breakfast cereal Special K 30.1 20.5 8.5 7.98644 w 31 Breakfast cereal Corn Flakes 30.2 20.6 8.4 7.98644 w 32 Breakfast cereal Special K 29.6 19.2 6.7 6.97679 w 33 Breakfast cereal Corn Flakes 30.0 19.1 6.6 6.97679 w 34 Breakfast cereal Froot Loops 30.2 20.8 8.5 8.23034 r 35 Breakfast cereal Cheerios 30.3 20.4 7.2 7.51847 y 36 Breakfast cereal Cheerios 24.1 17.2 5.3 5.82848 y 37 Breakfast cereal Corn Flakes 33.7 25.4 8.0 8.79021 w 38 Breakfast cereal Special K 23.4 16.6 5.7 5.82946 w 39 Breakfast cereal Cheerios 32.5 22.4 8.4 8.27677 y 40 Breakfast cereal Shredded Wheat, 26.6 19.6 5.6 7.98957 r 41 Crackers Ritz 23.1 16.0 5.3 6.97953 r 42 Breakfast cereal All-Bran 21.1 14.3 5.2 7.30951 y 43 Tea Lipton Loose Tea 12.7 10.9 5.4 6.09479 r 44 Crackers Ritz 23.1 20.7 5.2 7.68573 r

 CLUSTER=5 CLUSNAME=CL10

 Obs class name height width depth weight color 45 Tea Luzianne 8.9 22.8 6.4 5.53748 r 46 Tea Luzianne Decaffe 8.9 22.8 6.4 5.29641 g 47 Crackers Premium Saltines 11.0 25.0 10.7 7.68573 w 48 Tea Lipton Family Si 8.7 20.8 8.2 5.53748 r 49 Little Debbie Marshmallow Supr 9.4 32.0 7.0 6.56541 w 50 Tea Lipton Family Si 13.7 24.0 9.0 6.97679 r

 CLUSTER=6 CLUSNAME=CL9

 Obs class name height width depth weight color 51 Tea Luzianne 18.4 20.2 6.9 6.09479 r 52 Tea Lipton Tea Bags 17.1 20.0 6.7 6.09479 r 53 Breakfast cereal Shredded Wheat 19.7 19.9 7.5 6.56541 y 54 Tea Lipton Tea Bags 11.5 14.4 6.6 4.73448 r

 CLUSTER=7 CLUSNAME=CL8

 Obs class name height width depth weight color 55 Crackers Wheatsworth 11.1 25.2 5.5 6.88239 w 56 Little Debbie Swiss Cake Rolls 10.1 21.8 5.8 7.16545 w 57 Little Debbie Figaroos 13.5 18.6 3.7 6.97679 y 58 Little Debbie Nutty Bar 13.2 18.5 4.2 6.97679 y 59 Little Debbie Apple Delights 11.2 30.1 4.9 7.51552 w 60 Little Debbie Lemon Stix 13.2 18.5 4.2 6.33884 w 61 Little Debbie Fudge Brownies 11.0 30.8 2.5 6.97679 w 62 Little Debbie Snack Cakes 13.4 32.0 3.4 7.16545 b 63 Crackers Waverly Wafers 14.4 22.5 6.2 7.68573 g

 CLUSTER=4 CLUSNAME=CL13

 Obs class name height width depth weight color 28 Breakfast cereal Cheerios 27.5 19.0 6.2 6.56541 y 29 Breakfast cereal Froot Loops 25.0 17.7 6.4 6.77735 r 30 Breakfast cereal Special K 30.1 20.5 8.5 7.98644 w 31 Breakfast cereal Corn Flakes 30.2 20.6 8.4 7.98644 w 32 Breakfast cereal Special K 29.6 19.2 6.7 6.97679 w 33 Breakfast cereal Corn Flakes 30.0 19.1 6.6 6.97679 w 34 Breakfast cereal Froot Loops 30.2 20.8 8.5 8.23034 r 35 Breakfast cereal Cheerios 30.3 20.4 7.2 7.51847 y 36 Breakfast cereal Cheerios 24.1 17.2 5.3 5.82848 y 37 Breakfast cereal Corn Flakes 33.7 25.4 8.0 8.79021 w 38 Breakfast cereal Special K 23.4 16.6 5.7 5.82946 w 39 Breakfast cereal Cheerios 32.5 22.4 8.4 8.27677 y 40 Breakfast cereal Shredded Wheat, 26.6 19.6 5.6 7.98957 r 41 Crackers Ritz 23.1 16.0 5.3 6.97953 r 42 Breakfast cereal All-Bran 21.1 14.3 5.2 7.30951 y 43 Tea Lipton Loose Tea 12.7 10.9 5.4 6.09479 r 44 Crackers Ritz 23.1 20.7 5.2 7.68573 r

 CLUSTER=5 CLUSNAME=CL10

 Obs class name height width depth weight color 45 Tea Luzianne 8.9 22.8 6.4 5.53748 r 46 Tea Luzianne Decaffe 8.9 22.8 6.4 5.29641 g 47 Crackers Premium Saltines 11.0 25.0 10.7 7.68573 w 48 Tea Lipton Family Si 8.7 20.8 8.2 5.53748 r 49 Little Debbie Marshmallow Supr 9.4 32.0 7.0 6.56541 w 50 Tea Lipton Family Si 13.7 24.0 9.0 6.97679 r

 CLUSTER=6 CLUSNAME=CL9

 Obs class name height width depth weight color 51 Tea Luzianne 18.4 20.2 6.9 6.09479 r 52 Tea Lipton Tea Bags 17.1 20.0 6.7 6.09479 r 53 Breakfast cereal Shredded Wheat 19.7 19.9 7.5 6.56541 y 54 Tea Lipton Tea Bags 11.5 14.4 6.6 4.73448 r

 CLUSTER=7 CLUSNAME=CL8

 Obs class name height width depth weight color 55 Crackers Wheatsworth 11.1 25.2 5.5 6.88239 w 56 Little Debbie Swiss Cake Rolls 10.1 21.8 5.8 7.16545 w 57 Little Debbie Figaroos 13.5 18.6 3.7 6.97679 y 58 Little Debbie Nutty Bar 13.2 18.5 4.2 6.97679 y 59 Little Debbie Apple Delights 11.2 30.1 4.9 7.51552 w 60 Little Debbie Lemon Stix 13.2 18.5 4.2 6.33884 w 61 Little Debbie Fudge Brownies 11.0 30.8 2.5 6.97679 w 62 Little Debbie Snack Cakes 13.4 32.0 3.4 7.16545 b 63 Crackers Waverly Wafers 14.4 22.5 6.2 7.68573 g

 class Breakfast cereal Crackers Detergent Little Debbie Paste, Tooth Tea CLUSTER ######### 1 2 # ######## 3 # ######## 4 ############## ## # 5 # # #### 6 # ### 7 ## #######

 color Blue Green Red White Yellow CLUSTER ## ## ### ## 1 2 #### # #### 3 ### ###### 4 ###### ###### ##### 5 # ### ## 6 ### # 7 # # ##### ##

The last several analyses include color. Obviously, the dummy variables must not be included in calculations to standardize the rows. If the five dummy variables are simply standardized to variance 1.0 and included with the other variables, color dominates the analysis. The dummy variables should be scaled to a smaller variance, which must be determined by trial and error. Four analyses are done using PROC STANDARD to scale the dummy variables to a standard deviation of 0.2, 0.3, 0.4, or 0.8. The cluster listings are suppressed.

Since dummy variables drastically violate the normality assumption on which the CCC depends, the CCC tends to indicate an excessively large number of clusters.

```   /************************************************************/
/*                                                          */
/* Analyses 4-7: standardized row-standardized logs & color */
/*                                                          */
/************************************************************/
%let list=0;
%let crosscol=1;

title2
'Analysis 4: Standardized row-standardized
logarithms and color (s=.2)';
proc standard data=stdstan out=stdstan m=0 s=.2;
var c_:;
run;

proc cluster data=stdstan m=cen %clusopt outtree=tree;
var l_height l_width l_depth l_weight c_:;
id name;
copy class height width depth weight color;
run;

%show(7);

title2
'Analysis 5: Standardized row-standardized
logarithms and color (s=.3)';
proc standard data=stdstan out=stdstan m=0 s=.3;
var c_:;
run;

proc cluster data=stdstan m=cen %clusopt outtree=tree;
var l_height l_width l_depth l_weight c_:;
id name;
copy class height width depth weight color;
run;

%show(6);

title2
'Analysis 6: Standardized row-standardized
logarithms and color (s=.4)';
proc standard data=stdstan out=stdstan m=0 s=.4;
var c_:;
run;

proc cluster data=stdstan m=cen %clusopt outtree=tree;
var l_height l_width l_depth l_weight c_:;
id name;
copy class height width depth weight color;
run;

%show(3);

title2
'Analysis 7: Standardized row-standardized
logarithms and color (s=.8)';
proc standard data=stdstan out=stdstan m=0 s=.8;
var c_:;
run;

proc cluster data=stdstan m=cen %clusopt outtree=tree;
var l_height l_width l_depth l_weight c_:;
id name;
copy class height width depth weight color;
run;

%show(10);
```

Using PROC STANDARD on the dummy variables with S=0.2 causes four of the Little Debbie products to join the toothpastes. Using S=0.3 causes one of the tea clusters to merge with the breakfast cereals while three cereals defect to the detergents. Using S=0.4 produces three clusters consisting of (1) cereals and detergents, (2) Little Debbie products and toothpaste, and (3) teas, with crackers divided among all three clusters and a few other misclassifications. With S=0.8, ten clusters are indicated, each entirely monochrome. So, S=0.2 or S=0.3 degrades the classification, S=0.4 yields a good but perhaps excessively coarse classification, and higher values of the S= option produce clusters that are determined mainly by color.

Output 23.6.4: Analysis of Standardized Row-Standardized Logarithms and Color

 Cluster Analysis of Grocery Boxes Analysis 4: Standardized row-standardized logarithms and color (s=.2)

 The CLUSTER Procedure Centroid Hierarchical Cluster Analysis

 Eigenvalues of the Covariance Matrix Eigenvalue Difference Proportion Cumulative 1 2.43584975 0.94791932 0.5800 0.5800 2 1.48793042 1.39363531 0.3543 0.9342 3 0.09429511 0.03686218 0.0225 0.9567 4 0.05743293 0.01036136 0.0137 0.9704 5 0.04707157 0.00489503 0.0112 0.9816 6 0.04217654 0.00693298 0.0100 0.9916 7 0.03524355 0.03524355 0.0084 1.0000 8 -.00000000 0.00000000 -0.0000 1.0000 9 -.00000000 -0.0000 1.0000

 Root-Mean-Square Total-Sample Standard Deviation = 0.68313

 Root-Mean-Square Distance Between Observations = 2.898275

 Cluster History NCL Clusters Joined FREQ SPRSQ RSQ ERSQ CCC PSF PST2 NormCentDist Tie 20 CL46 Lemon Stix 3 0.0016 .968 . . 67.5 11.9 0.2706 19 Luzianne Lipton Family Si 2 0.0014 .966 . . 69.7 . 0.2995 18 CL25 CL37 6 0.0041 .962 . . 67.1 5.0 0.3081 17 CL33 CL35 16 0.0099 .952 . . 57.2 16.7 0.3196 16 CL19 Luzianne Decaffe 3 0.0024 .950 . . 59.2 1.7 0.3357 15 CL30 CL16 5 0.0042 .946 . . 59.5 2.7 0.3299 14 CL27 CL18 8 0.0057 .940 . . 58.9 4.2 0.3429 13 CL20 Fudge Brownies 4 0.0031 .937 . . 61.7 3.6 0.3564 12 CL24 Lipton Tea Bags 4 0.0031 .934 .905 3.23 65.2 4.7 0.359 11 CL39 CL28 6 0.0068 .927 .896 3.17 65.9 12.1 0.3743 10 CL13 Snack Cakes 5 0.0036 .923 .886 3.62 70.8 2.3 0.3755 9 CL11 CL32 13 0.0176 .906 .874 2.70 64.8 16.0 0.4107 8 CL14 Lipton Family Si 9 0.0052 .900 .859 3.29 71.0 2.6 0.4265 7 Waverly Wafers CL10 6 0.0052 .895 .841 4.09 79.8 2.4 0.4378 6 CL17 CL12 20 0.0248 .870 .817 3.52 76.6 19.7 0.4898 5 CL15 CL8 14 0.0326 .838 .783 3.08 75.0 14.0 0.5607 4 CL6 CL21 30 0.0743 .764 .734 1.35 63.5 35.6 0.5877 3 CL9 CL7 19 0.0579 .706 .653 2.17 72.0 22.8 0.6611 2 CL4 CL3 49 0.3632 .343 .450 -2.6 31.8 73.0 0.9838 1 CL2 CL5 63 0.3426 .000 .000 0.00 . 31.8 0.9876

 class Breakfast cereal Crackers Detergent Little Debbie Paste, Tooth Tea CLUSTER ## ######## 1 2 # #### ######## 3 ############# ## # 4 # ### 5 # ##### 6 ######### 7 # ####

 color Blue Green Red White Yellow CLUSTER #### # ##### 1 2 ### ########## 3 ###### ###### #### 4 ### # 5 # # ## ## 6 ## ## ### ## 7 # ### #

 Cluster Analysis of Grocery Boxes Analysis 5: Standardized row-standardized logarithms and color (s=.3)

 The CLUSTER Procedure Centroid Hierarchical Cluster Analysis

 Eigenvalues of the Covariance Matrix Eigenvalue Difference Proportion Cumulative 1 2.44752302 0.95026671 0.5500 0.5500 2 1.49725632 1.36701945 0.3365 0.8865 3 0.13023687 0.02135049 0.0293 0.9157 4 0.10888637 0.00867367 0.0245 0.9402 5 0.10021271 0.00628821 0.0225 0.9627 6 0.09392449 0.02196469 0.0211 0.9838 7 0.07195981 0.07195981 0.0162 1.0000 8 0.00000000 0.00000000 0.0000 1.0000 9 -.00000000 -0.0000 1.0000

 Root-Mean-Square Total-Sample Standard Deviation = 0.703167

 Root-Mean-Square Distance Between Observations = 2.983287

 Cluster History NCL Clusters Joined FREQ SPRSQ RSQ ERSQ CCC PSF PST2 NormCentDist Tie 20 CL24 CL28 4 0.0038 .953 . . 45.7 2.7 0.3448 19 Grape Nuts CL23 6 0.0033 .950 . . 46.0 3.5 0.3477 18 CL46 Lemon Stix 3 0.0027 .947 . . 47.1 21.9 0.3558 17 CL21 Lipton Tea Bags 4 0.0031 .944 . . 48.2 2.5 0.3577 16 CL39 CL33 6 0.0064 .937 . . 46.9 12.1 0.3637 15 CL19 CL29 14 0.0152 .922 . . 40.6 12.4 0.3707 14 CL18 Fudge Brownies 4 0.0035 .919 . . 42.5 2.5 0.3813 13 CL16 CL25 13 0.0175 .901 . . 38.0 13.7 0.4103 12 CL22 Lipton Family Si 5 0.0049 .896 .875 1.76 40.0 3.2 0.4353 11 CL12 CL37 7 0.0089 .887 .865 1.71 40.9 4.6 0.4397 10 CL20 Luzianne Decaffe 5 0.0056 .882 .854 2.02 43.9 2.5 0.4669 9 CL26 CL17 16 0.0222 .859 .841 1.20 41.3 16.6 0.479 8 CL32 CL11 9 0.0125 .847 .826 1.31 43.5 4.5 0.4988 7 CL14 Snack Cakes 5 0.0070 .840 .806 1.95 49.0 3.3 0.519 6 Waverly Wafers CL7 6 0.0077 .832 .782 2.79 56.6 2.3 0.5366 5 CL9 CL15 30 0.0716 .761 .749 0.54 46.1 28.3 0.5452 4 CL10 CL8 14 0.0318 .729 .700 1.21 52.9 8.6 0.5542 3 CL5 CL6 36 0.0685 .660 .622 1.50 58.3 14.2 0.6516 2 CL13 CL4 27 0.2008 .460 .427 0.90 51.9 46.6 0.9611 1 CL3 CL2 63 0.4595 .000 .000 0.00 . 51.9 0.9609

 class Breakfast cereal Crackers Detergent Little Debbie Paste, Tooth Tea CLUSTER ### ## ######## # 1 2 # #### ######## 3 ############# ### 4 # ##### 5 ######### 6 # ####

 color Blue Green Red White Yellow CLUSTER ######## # ##### 1 2 ### ########## 3 ##### ###### ##### 4 # # ## ## 5 ## ## ### ## 6 # ### #

 Cluster Analysis of Grocery Boxes Analysis 6: Standardized row-standardized logarithms and color (s=.4)

 The CLUSTER Procedure Centroid Hierarchical Cluster Analysis

 Eigenvalues of the Covariance Matrix Eigenvalue Difference Proportion Cumulative 1 2.46469435 0.95296119 0.5135 0.5135 2 1.51173316 1.28149311 0.3149 0.8284 3 0.23024005 0.04306536 0.0480 0.8764 4 0.18717469 0.01766446 0.0390 0.9154 5 0.16951023 0.01827481 0.0353 0.9507 6 0.15123542 0.06582379 0.0315 0.9822 7 0.08541162 0.08541162 0.0178 1.0000 8 -.00000000 0.00000000 -0.0000 1.0000 9 -.00000000 -0.0000 1.0000

 Root-Mean-Square Total-Sample Standard Deviation = 0.730297

 Root-Mean-Square Distance Between Observations = 3.098387

 Cluster History NCL Clusters Joined FREQ SPRSQ RSQ ERSQ CCC PSF PST2 NormCentDist Tie 20 CL29 CL44 10 0.0074 .955 . . 47.7 8.2 0.3789 19 CL38 Lipton Family Si 3 0.0031 .952 . . 48.1 9.3 0.3792 18 CL25 CL41 11 0.0155 .936 . . 38.8 36.7 0.4192 17 CL23 CL43 10 0.0120 .924 . . 35.0 11.6 0.4208 16 Grape Nuts CL26 6 0.0050 .919 . . 35.6 5.8 0.4321 15 CL19 CL31 5 0.0074 .912 . . 35.4 5.3 0.4362 14 Premium Saltines CL27 4 0.0046 .907 . . 36.8 2.9 0.4374 13 CL18 CL20 21 0.0352 .872 . . 28.4 19.7 0.4562 12 CL13 CL16 27 0.0372 .835 .839 -.37 23.4 12.0 0.4968 11 CL21 CL17 15 0.0289 .806 .828 -1.5 21.6 13.6 0.5183 10 CL14 CL15 9 0.0200 .786 .815 -1.8 21.6 7.2 0.5281 9 Waverly Wafers Luzianne Decaffe 2 0.0047 .781 .801 -1.2 24.1 . 0.5425 8 CL10 CL24 12 0.0243 .757 .785 -1.3 24.5 5.8 0.5783 7 CL12 CL46 29 0.0224 .735 .765 -1.3 25.8 5.3 0.6105 6 CL8 CL37 14 0.0220 .712 .740 -1.1 28.3 4.0 0.6313 5 CL6 CL32 16 0.0251 .687 .707 -.78 31.9 3.9 0.6664 4 CL11 CL9 17 0.0287 .659 .660 -.04 38.0 7.0 0.7098 3 CL4 Snack Cakes 18 0.0180 .641 .584 2.21 53.5 3.2 0.7678 2 CL3 CL5 34 0.2175 .423 .400 0.67 44.8 31.4 0.8923 1 CL7 CL2 63 0.4232 .000 .000 0.00 . 44.8 0.9156

 class Breakfast cereal Crackers Detergent Little Debbie Paste, Tooth Tea CLUSTER >############## ## ######## ## # 1 2 ## ####### ######## # 3 # >##############

 color Blue Green Red White Yellow CLUSTER ########## ####### ############ 1 2 # ## ### ############ 3 ## ## ######### # ##

 Cluster Analysis of Grocery Boxes Analysis 7: Standardized row-standardized logarithms and color (s=.8)

 The CLUSTER Procedure Centroid Hierarchical Cluster Analysis

 Eigenvalues of the Covariance Matrix Eigenvalue Difference Proportion Cumulative 1 2.61400794 0.93268930 0.3631 0.3631 2 1.68131864 0.77645948 0.2335 0.5966 3 0.90485916 0.22547234 0.1257 0.7222 4 0.67938683 0.00292216 0.0944 0.8166 5 0.67646466 0.12119211 0.0940 0.9106 6 0.55527255 0.46658428 0.0771 0.9877 7 0.08868827 0.08868827 0.0123 1.0000 8 -.00000000 0.00000000 -0.0000 1.0000 9 -.00000000 -0.0000 1.0000

 Root-Mean-Square Total-Sample Standard Deviation = 0.894427

 Root-Mean-Square Distance Between Observations = 3.794733

 Cluster History NCL Clusters Joined FREQ SPRSQ RSQ ERSQ CCC PSF PST2 NormCentDist Tie 20 CL29 CL44 10 0.0049 .970 . . 72.7 8.2 0.3094 19 CL38 Lipton Family Si 3 0.0021 .968 . . 73.3 9.3 0.3096 18 CL21 CL23 12 0.0153 .952 . . 53.0 15.0 0.4029 17 Waverly Wafers Luzianne Decaffe 2 0.0032 .949 . . 53.8 . 0.443 16 CL27 CL24 6 0.0095 .940 . . 48.9 10.4 0.444 15 CL19 CL16 9 0.0136 .926 . . 43.0 6.1 0.4587 14 CL41 Grape Nuts 7 0.0058 .920 . . 43.6 51.2 0.4591 13 CL26 CL46 7 0.0105 .910 . . 42.1 22.0 0.4769 12 CL25 CL13 12 0.0205 .889 .743 16.5 37.3 13.8 0.467 11 CL18 Premium Saltines 13 0.0093 .880 .726 16.7 38.2 4.0 0.5586 10 CL17 CL37 4 0.0134 .867 .706 16.5 38.3 7.9 0.6454 9 CL14 CL20 17 0.0567 .810 .684 11.0 28.8 52.6 0.6534 8 CL12 CL9 29 0.0828 .727 .659 5.03 20.9 20.7 0.604 7 CL11 CL43 16 0.0359 .691 .631 4.25 20.9 14.4 0.6758 6 CL15 CL31 11 0.0263 .665 .598 4.24 22.6 8.0 0.7065 5 CL7 CL6 27 0.1430 .522 .557 -1.7 15.8 28.2 0.8247 4 CL8 CL5 56 0.2692 .253 .507 -9.1 6.6 31.5 0.7726 3 Snack Cakes CL32 3 0.0216 .231 .435 -6.6 9.0 46.0 1.0027 2 CL4 CL10 60 0.1228 .108 .289 -5.6 7.4 9.5 1.0096 1 CL2 CL3 63 0.1083 .000 .000 0.00 . 7.4 1.0839

 class Breakfast cereal Crackers Detergent Little Debbie Paste, Tooth Tea CLUSTER ### ## #### # 1 2 ## ###### ##### 3 ####### 4 ###### #### ## 5 ### 6 ######### 7 # ### 8 ## 9 ## 10 #

 color Blue Green Red White Yellow CLUSTER ########## 1 2 ############# 3 ####### 4 ############ 5 ### 6 ######### 7 #### 8 ## 9 ## 10 #

 Chapter Contents Previous Next Top