Chapter Contents |
Previous |
Next |

The ACECLUS Procedure |

**PROC ACECLUS**PROPORTION=*p*| THRESHOLD=*t*< options >**;**

Task |
Options |
Description |

Specify clustering options | ||

METHOD= | specify the clustering method | |

MPAIRS= | specify number of pairs for estimating within-cluster covariance (when you specify the option METHOD=COUNT) | |

PROPORTION= | specify proportion of pairs for estimating within-cluster covariance | |

THRESHOLD= | specify the threshold for including pairs in the estimation of the within-cluster covariance | |

Specify input and output data sets | ||

DATA= | specify input data set name | |

OUT= | specify output data set name | |

OUTSTAT= | specify output data set name containing various statistics | |

Specify iteration options | ||

ABSOLUTE | use absolute instead of relative threshold | |

CONVERGE= | specify convergence criterion | |

INITIAL= | specify initial estimate of within-cluster covariance matrix | |

MAXITER= | specify maximum number of iterations | |

METRIC= | specify metric in which computations are performed | |

SINGULAR= | specify singularity criterion | |

Specify canonical analysis options | ||

N= | specify number of canonical variables | |

PREFIX= | specify prefix for naming canonical variables | |

Control displayed output | ||

NOPRINT | suppress the display of the output | |

PP | produce PP-plot of distances between pairs from last iteration | |

produce QQ-plot of power transformation of distances between pairs from last iteration | ||

SHORT | omit all output except for iteration history and eigenvalue table |

**ABSOLUTE**-
causes the THRESHOLD= value or the threshold computed from the
PROPORTION= option to be treated absolutely rather than relative
to the root mean square distance between observations.
Use the ABSOLUTE option only when you are confident that the initial
estimate of the within-cluster covariance matrix is close
to the final estimate, such as when the INITIAL= option
specifies a data set created by a previous execution of
PROC ACECLUS using the OUTSTAT= option.
**CONVERGE=***c*-
specifies the convergence criterion.
By default, CONVERGE= 0.001.
Iteration stops when the convergence measure falls below the
value specified by the CONVERGE= option or when the iteration
limit as specified by the MAXITER= option is exceeded,
whichever happens first.
**DATA=***SAS-data-set*-
specifies the SAS data set to be analyzed.
By default, PROC ACECLUS uses the most recently created SAS data set.
**INITIAL=***name*-
specifies the matrix for the initial estimate
of the within-cluster covariance matrix.
Valid values for
*name*are as follows:- DIAGONAL | D
- uses the diagonal matrix of sample variances as the
initial estimate of the within-cluster covariance matrix.
- FULL | F
- uses the total-sample covariance matrix as the initial
estimate of the within-cluster covariance matrix.
- IDENTITY | I
- uses the identity matrix as the initial estimate
of the within-cluster covariance matrix.
- INPUT=
*SAS-data-set* - specifies a SAS data set from which to obtain the initial
estimate of the within-cluster covariance matrix.
The data set can be TYPE=CORR, COV, UCORR, UCOV, SSCP,
or ACE, or it can be an ordinary SAS data set.
(See Appendix 1, "Special SAS Data Sets," for
descriptions of CORR, COV, UCORR, UCOV, and SSCP data sets.
See the section "Output Data Sets" for a
description of ACE data sets.)

If you do not specify the INITIAL= option, the default is the matrix specified by the METRIC= option. If neither the INITIAL= nor the METRIC= option is specified, INITIAL=FULL is used if there are enough observations to obtain a nonsingular total-sample covariance matrix; otherwise, INITIAL=DIAGONAL is used.

**MAXITER=***n*-
specifies the maximum number of iterations.
By default, MAXITER=10.
**METHOD= COUNT | C****METHOD= THRESHOLD | T**-
specifies the clustering method.
The METHOD=THRESHOLD option requests a method (also the default)
that uses all pairs closer than a given cutoff value
to form the estimate at each iteration.
The METHOD=COUNT option requests a method that uses a number of pairs,
*m*, with the smallest distances to form the estimate at each iteration. **METRIC=***name*-
specifies the metric in which the computations are performed,
implies the default value for the INITIAL= option,
and specifies the matrix
**Z**used in the formula for the convergence measure*e*_{i}and for checking singularity of the**A**matrix. Valid values for*name*are as follows:- DIAGONAL | D
- uses the diagonal matrix of sample variances diag(
**S**) and sets**Z**= diag(**S**)^{-(1/2)}, where the superscript -(1/2) indicates an inverse factor. - FULL | F
- uses the total-sample covariance matrix
**S**and sets**Z**=**S**^{-(1/2)}. - IDENTITY | I
- uses the identity matrix
**I**and sets**Z**=**I**.

The option METRIC= is rather technical. It affects the computations in a variety of ways, but for well-conditioned data the effects are subtle. For most data sets, the METRIC= option is not needed. **MPAIRS=***m*-
specifies the number of pairs to be included in the
estimation of the within-cluster covariance matrix
when METHOD=COUNT is requested.
The values of
*m*must be greater than 0 but less than or equal to (*totfq*×(*totfq*-1))/2, where*totfq*is the sum of nonmissing frequencies specified in the FREQ statement. If there is no FREQ statement,*totfq*equals the number of total nonmissing observations. **N=***n*-
specifies the number of canonical variables to be computed.
The default is the number of variables analyzed.
N=0 suppresses the canonical analysis.
**NOPRINT**- suppresses the display of all output. Note that this option temporarily disables the Output Delivery System (ODS). For more information, see Chapter 15, "Using the Output Delivery System."
**OUT=***SAS-data-set*-
creates an output SAS data set that contains all the original
data as well as the canonical variables having an
estimated within-cluster covariance matrix equal to the identity
matrix.
If you want to create a permanent SAS data set,
you must specify a two-level name.
See Chapter 16, "SAS Data Files" in
*SAS Language Reference: Concepts*for information on permanent SAS data sets. **OUTSTAT=***SAS-data-set*-
specifies a TYPE=ACE output SAS data set that contains means,
standard deviations, number of observations, covariances,
estimated within-cluster covariances, eigenvalues, and
canonical coefficients.
If you want to create a permanent SAS data set,
you must specify a two-level name.
See Chapter 16, "SAS Data Files" in
*SAS Language Reference: Concepts*for information on permanent SAS data sets. **PROPORTION=***p***PERCENT=***p***P=***p*-
specifies the percentage of pairs to be included in the
estimation of the within-cluster covariance matrix.
The value of
*p*must be greater than 0. If*p*is greater than or equal to 1, it is interpreted as a percentage and divided by 100; PROPORTION=0.02 and PROPORTION=2 are equivalent. When you specify METHOD=THRESHOLD, a threshold value is computed from the PROPORTION= option under the assumption that the observations are sampled from a multivariate normal distribution.

When you specify METHOD=COUNT, the number of pairs,*m*, is computed from PROPORTION=*p*as-
*m*= floor ( [*p*/2] × totfq ×( totfq-1) )

*totfq*is the number of total non-missing observations. -
**PP**-
produces a PP probability plot of distances between pairs of
observations computed in the last iteration.
**PREFIX=***name*-
specifies a prefix for naming the canonical variables.
By default the names are CAN1, CAN2, ... , CAN
*n*. If you specify PREFIX=ABC, the variables are named ABC1, ABC2, ABC3, and so on. The number of characters in the prefix plus the number of digits required to designate the variables should not exceed the name length defined by the VALIDVARNAME= system option. For more information on the VALIDVARNAME= system option, refer to*SAS Language Reference: Dictionary*. **QQ**-
produces a QQ probability plot of a power transformation
of the distances between pairs of observations computed
in the last iteration.
**Caution:**The QQ plot may require an enormous amount of computer time. **SHORT**-
omits all items from the standard output except for the
iteration history and the eigenvalue table.
**SINGULAR=***g***SING=***g*-
specifies a singularity criterion 0<
*g*<1 for the total-sample covariance matrix**S**and the approximate within-cluster covariance estimate**A**. The default is SINGULAR=1E-4. **THRESHOLD=***t***T=***t*-
specifies the threshold for including pairs of observations in the
estimation of the within-cluster covariance matrix. A pair of
observations is included if the Euclidean distance between them is
less than or equal to
*t*times the root mean square distance computed over all pairs of observations.

Chapter Contents |
Previous |
Next |
Top |

Copyright © 1999 by SAS Institute Inc., Cary, NC, USA. All rights reserved.