Chapter Contents |
Previous |
Next |
The CORRESP Procedure |
Let N be the contingency table formed from those observations and variables that are not supplementary and from those observations that have no missing values and have a positive weight. This table is an (n_{r} ×n_{c}) rank q matrix of nonnegative numbers with nonzero row and column sums. If Z_{a} is the binary coding for variable A, and Z_{b} is the binary coding for variable B, then N = Z_{a}'Z_{b} is a contingency table. Similarly, if Z_{b,c} contains the binary coding for both variables B and C, then N = Z_{a}'Z_{b,c} can also be input to a correspondence analysis. With the BINARY option, N = Z, and the analysis is based on a binary table. In multiple correspondence analysis, the analysis is based on a Burt table, Z'Z.
Let 1 be a vector of 1s of the appropriate order, let I be an identity matrix, and let diag(·) be a matrix-valued function that creates a diagonal matrix from a vector. Let
The scalar f is the sum of all elements in N. The matrix P is a matrix of relative frequencies. The vector r contains row marginal proportions or row "masses." The vector c contains column marginal proportions or column masses. The matrices D_{r} and D_{c} are diagonal matrices of marginals.
The rows of R contain the "row profiles." The elements of each row of R sum to one. Each (i,j) element of R contains the observed probability of being in column j given membership in row i. Similarly, the columns of C contain the column profiles. The coordinates in correspondence analysis are based on the generalized singular value decomposition of P,
In multiple correspondence analysis,
The matrix A, which is the rectangular matrix of left generalized singular vectors, has n_{r} rows and q columns; the matrix D_{u}, which is a diagonal matrix of singular values, has q rows and columns; and the matrix B, which is the rectangular matrix of right generalized singular vectors, has n_{c} rows and q columns. The columns of A and B define the principal axes of the column and row point clouds, respectively.
The generalized singular value decomposition of P - rc', discarding the last singular value (which is zero) and the last left and right singular vectors, is exactly the same as a generalized singular value decomposition of P, discarding the first singular value (which is one), the first left singular vector, r, and the first right singular vector, c. The first (trivial) column of A and B and the first singular value in D_{u} are discarded before any results are displayed. You can obtain the generalized singular value decomposition of P - rc' from the ordinary singular value decomposition of D_{r}^{-1/2} (P - rc') D_{c}^{-1/2}.
Hence, A = D_{r}^{1/2} U and B = D_{c}^{1/2} V.
The default row coordinates are D_{r}^{-1}AD_{u}, and the default column coordinates are D_{c}^{-1}BD_{u}. Typically the first two columns of D_{r}^{-1}AD_{u} and D_{c}^{-1}BD_{u} are plotted to display graphically associations between the row and column categories. The plot consists of two overlaid plots, one for rows and one for columns. The row points are row profiles, rescaled so that distances between profiles can be displayed as ordinary Euclidean distances, then orthogonally rotated to a principal axes orientation. The column points are column profiles, rescaled so that distances between profiles can be displayed as ordinary Euclidean distances, then orthogonally rotated to a principal axes orientation. Distances between row points and other row points have meaning. Distances between column points and other column points have meaning. However, distances between column points and row points are not interpretable.
ROW= | Matrix Formula | |
A | A | |
AD | AD_{u} | |
DA | D_{r}^{-1}A | |
DAD | D_{r}^{-1}AD_{u} | |
DAD1/2 | D_{r}^{-1}A D_{u}^{1/2} | |
DAID1/2 | D_{r}^{-1}A(I + D_{u})^{1/2} | |
COLUMN= | Matrix Formula | |
B | B | |
BD | BD_{u} | |
DB | D_{c}^{-1}B | |
DBD | D_{c}^{-1}BD_{u} | |
DBD1/2 | D_{c}^{-1}B D_{u}^{1/2} | |
DBID1/2 | D_{c}^{-1}B(I + D_{u})^{1/2} |
When PROFILE=ROW (ROW=DAD and COLUMN=DB), the row coordinates D_{r}^{-1}AD_{u} and column coordinates D_{c}^{-1}B provide a correspondence analysis based on the row profile matrix. The row profile (conditional probability) matrix is defined as R = D_{r}^{-1}P = D_{r}^{-1}AD_{u}B'. The elements of each row of R sum to one. Each (i,j) element of R contains the observed probability of being in column j given membership in row i. The "principal" row coordinates D_{r}^{-1}AD_{u} and "standard" column coordinates D_{c}^{-1}B provide a decomposition of D_{r}^{-1}AD_{u}B'D_{c}^{-1} = D_{r}^{-1}PD_{c}^{-1} = RD_{c}^{-1}. Since D_{r}^{-1}AD_{u} = RD_{c}^{-1}B, the row coordinates are weighted centroids of the column coordinates. Each column point, with coordinates scaled to standard coordinates, defines a vertex in (n_{c}-1)-dimensional space. All of the principal row coordinates are located in the space defined by the standard column coordinates. Distances among row points have meaning, but distances among column points and distances between row and column points are not interpretable.
The option PROFILE=COLUMN can be described as applying the PROFILE=ROW formulas to the transpose of the contingency table. When PROFILE=COLUMN (ROW=DA and COLUMN=DBD), the principal column coordinates D_{c}^{-1}BD_{u} are weighted centroids of the standard row coordinates D_{r}^{-1}A. Each row point, with coordinates scaled to standard coordinates, defines a vertex in (n_{r}-1)-dimensional space. All of the principal column coordinates are located in the space defined by the standard row coordinates. Distances among column points have meaning, but distances among row points and distances between row and column points are not interpretable.
The usual sets of coordinates are given by the default PROFILE=BOTH (ROW=DAD and COLUMN=DBD). All of the summary statistics, such as the squared cosines and contributions to inertia, apply to these two sets of points. One advantage to using these coordinates is that both sets (D_{r}^{-1}AD_{u} and D_{c}^{-1}BD_{u}) are postmultiplied by the diagonal matrix D_{u}, which has diagonal values that are all less than or equal to one. When D_{u} is a part of the definition of only one set of coordinates, that set forms a tight cluster near the centroid whereas the other set of points is more widely dispersed. Including D_{u} in both sets makes a better graphical display. However, care must be taken in interpreting such a plot. No correct interpretation of distances between row points and column points can be made.
Another property of this choice of coordinates concerns the geometry of distances between points within each set. The default row coordinates can be decomposed into D_{r}^{-1}AD_{u} = D_{r}^{-1}AD_{u}B'D_{c}^{-1}B = (D_{r}^{-1}P)(D_{c}^{-1/2})(D_{c}^{-1/2}B). The row coordinates are row profiles (D_{r}^{-1}P), rescaled by D_{c}^{-1/2} (rescaled so that distances between profiles are transformed from a chi-square metric to a Euclidean metric), then orthogonally rotated (with D_{c}^{-1/2}B) to a principal axes orientation. Similarly, the column coordinates are column profiles rescaled to a Euclidean metric and orthogonally rotated to a principal axes orientation.
The rationale for computing distances between row profiles using the non-Euclidean chi-square metric is as follows. Each row of the contingency table can be viewed as a realization of a multinomial distribution conditional on its row marginal frequency. The null hypothesis of row and column independence is equivalent to the hypothesis of homogeneity of the row profiles. A significant chi-square statistic is geometrically interpreted as a significant deviation of the row profiles from their centroid, c'. The chi-square metric is the Mahalanobis metric between row profiles based on their estimated covariance matrix under the homogeneity assumption (Greenacre and Hastie 1987). A parallel argument can be made for the column profiles.
When ROW=DAD1/2 and COLUMN=DBD1/2 (Gifi 1990; van der Heijden and de Leeuw 1985), the row coordinates D_{r}^{-1}AD_{u}^{1/2} and column coordinates D_{c}^{-1}BD_{u}^{1/2} are a decomposition of D_{r}^{-1}PD_{c}^{-1}.
In all of the preceding pairs, distances between row and column points are not meaningful. This prompted Carroll, Green, and Schaffer (1986) to propose that row coordinates D_{r}^{-1}A(I+D_{u})^{1/2} and column coordinates D_{c}^{-1}B(I+D_{u})^{1/2} be used. These coordinates are (except for a constant scaling) the coordinates from a multiple correspondence analysis of a Burt table created from two categorical variables. This standardization is available with ROW=DAID1/2 and COLUMN=DBID1/2. However, this approach has been criticized on both theoretical and empirical grounds by Greenacre (1989). The Carroll, Green, and Schaffer standardization relies on the assumption that the chi-square metric is an appropriate metric for measuring the distance between the columns of a bivariate indicator matrix. See the section "Types of Tables Used as Input" for a description of indicator matrices. Greenacre (1989) showed that this assumption cannot be justified.
A TABLES statement with a single variable list creates a Burt table. Thus, you can always specify the MCA option with this type of input. If you use the MCA option when reading an existing table with a VAR statement, you must ensure that the table is a Burt table.
If you perform MCA on a table that is not a Burt table, the results of the analysis are invalid. If the table is not symmetric, or if the sums of all elements in each diagonal partition are not equal, PROC CORRESP displays an error message and quits.
A subset of the columns of a Burt table is not necessarily a Burt table, so in MCA it is not appropriate to designate arbitrary columns as supplementary. You can, however, designate all columns from one or more categorical variables as supplementary.
The results of a multiple correspondence analysis of a Burt table Z'Z are the same as the column results from a simple correspondence analysis of the binary (or fuzzy) matrix Z. Multiple correspondence analysis is not a simple correspondence analysis of the Burt table. It is not appropriate to perform a simple correspondence analysis of a Burt table. The MCA option is based on P = BD_{u}^{2}B', whereas a simple correspondence analysis of the Burt table would be based on P = BD_{u}B'.
Since the rows and columns of the Burt table are the same, no row information is displayed or written to the output data sets. The resulting inertias and the default (COLUMN=DBD) column coordinates are the appropriate inertias and coordinates for an MCA. The supplementary column coordinates, cosines, and quality of representation formulas for MCA differ from the simple correspondence analysis formulas because the design matrix column profiles and left singular vectors are not available.
The following statements create a Burt table and perform a multiple correspondence analysis:
proc corresp data=Neighbor observed short mca; tables Hair Height Sex Age; run;
Both the rows and the columns have the same nine categories (Blond, Brown, White, Short, Tall, Female, Male, Old, and Young).
for u_{k} > [1/m]
The Benzcri adjustment is available with the BENZECRI option.
Greenacre (1994, p. 156) argues that the Benzcri adjustment overestimates the quality of fit. Greenacre proposes instead the following inertia adjustment:
for
The Greenacre adjustment is available with the GREENACRE option.
Ordinary unadjusted inertias are printed by default with MCA when neither the BENZECRI nor the GREENACRE option is specified. However, the unadjusted inertias are not printed by default when either the BENZECRI or the GREENACRE option is specified. To display both adjusted and unadjusted inertias, specify the UNADJUSTED option in addition to the relevant adjusted inertia option (BENZECRI, GREENACRE, or both).
ROW= | Matrix Formula |
A | [1/f]S_{o}D_{c}^{-1}BD_{u}^{-1} |
AD | [1/f]S_{o}D_{c}^{-1}B |
DA | R_{s}D_{c}^{-1}BD_{u}^{-1} |
DAD | R_{s}D_{c}^{-1}B |
DAD1/2 | R_{s}D_{c}^{-1}BD_{u}^{-1/2} |
DAID1/2 | R_{s}D_{c}^{-1}BD_{u}^{-1}(I + D_{u})^{1/2} |
COLUMN= | Matrix Formula |
B | [1/f]S_{v}D_{r}^{-1}AD_{u}^{-1} |
BD | [1/f]S_{v}D_{r}^{-1}A |
DB | C_{s}D_{r}^{-1}AD_{u}^{-1} |
DBD | C_{s}D_{r}^{-1}A |
DBD1/2 | C_{s}D_{r}^{-1}AD_{u}^{-1/2} |
DBID1/2 | C_{s}D_{r}^{-1}AD_{u}^{-1}(I + D_{u})^{1/2} |
MCA COLUMN= | Matrix Formula |
B | not allowed |
BD | not allowed |
DB | C_{s}D_{r}^{-1}BD_{u}^{-2} |
DBD | C_{s}D_{r}^{-1}BD_{u}^{-1} |
DBD1/2 | C_{s}D_{r}^{-1}BD_{u}^{-3/2} |
DBID1/2 | C_{s}D_{r}^{-1}BD_{u}^{-2}(I + D_{u})^{1/2} |
These statistics pertain to the default PROFILE=BOTH coordinates, no matter what values you specify for the ROW=, COLUMN=, or PROFILE= option. Let sq(·) be a matrix-valued function denoting elementwise squaring of the argument matrix. Let t be the total inertia (the sum of the elements in D_{u}^{2}).
In MCA, let D_{s} be the Burt table partition containing the intersection of the supplementary columns and the supplementary rows. The matrix D_{s} is a diagonal matrix of marginal frequencies of the supplemental columns of the binary matrix Z. Let p be the number of rows in this design matrix.
Statistic | Matrix Formula | |
Row partial contributions | D_{r}^{-1}sq(A) | |
to inertia | ||
Column partial contributions | D_{c}^{-1}sq(B) | |
to inertia | ||
Row squared cosines | diag(sq(AD_{u}) 1)^{-1}sq(AD_{u}) | |
Column squared cosines | diag(sq(BD_{u}) 1)^{-1}sq(BD_{u}) | |
Row mass | r | |
Column mass | c | |
Row inertia | [1/t]D_{r}^{-1} sq(AD_{u})1 | |
Column inertia | [1/t]D_{c}^{-1} sq(BD_{u})1 | |
Supplementary row | diag(sq(R_{s}-1c') D_{c}^{-1}1)^{-1}sq(R_{s}D_{c}^{-1}B) | |
squared cosines | ||
Supplementary column | diag(sq(C_{s}-1r') D_{r}^{-1}1)^{-1}sq(C_{s}D_{r}^{-1}A) | |
squared cosines | ||
MCA supplementary column | D_{s}(pI-D_{s})^{-1} sq(C_{s}D_{r}^{-1}BD_{u}^{-1}) | |
squared cosines |
The quality of representation in the DIMENS=n dimensional display of any point is the sum of its squared cosines over only the n dimensions. Inertia and mass are not defined for supplementary points.
A table that summarizes the partial contributions to inertia table is also computed. The points that best explain the inertia of each dimension and the dimension to which each point contributes the most inertia are indicated. The output data set variable names for this table are Best1 -Bestn (where DIMENS=n) and Best. The Best column contains the dimension number of the largest partial contribution to inertia for each point (the index of the maximum value in each row of D_{r}^{-1}sq(A) or D_{c}^{-1}sq(B)).
For each row, the Best1 -Bestn columns contain either the corresponding value of Best if the point is one of the biggest contributors to the dimension's inertia or 0 if it is not. Specifically, Best1 contains the value of Best for the point with the largest contribution to dimension one's inertia. A cumulative proportion sum is initialized to this point's partial contribution to the inertia of dimension one. If this sum is less than the value for the MININERTIA= option, then Best1 contains the value of Best for the point with the second largest contribution to dimension one's inertia. Otherwise, this point's Best1 is 0. This point's partial contribution to inertia is added to the sum. This process continues for the point with the third largest partial contribution, and so on, until adding a point's contribution to the sum increases the sum beyond the value of the MININERTIA= option. This same algorithm is then used for Best2, and so on.
For example, the following table contains contributions to inertia and the corresponding Best variables. The contribution to inertia variables are proportions that sum to 1 within each column. The first point makes its greatest contribution to the inertia of dimension two, so Best for point one is set to 2 and Best1 -Best3 for point one must all be 0 or 2. The second point also makes its greatest contribution to the inertia of dimension two, so Best for point two is set to 2 and Best1 -Best3 for point two must all be 0 or 2, and so on.
Assume MININERTIA=0.8, the default. In dimension one, the largest contribution is 0.41302 for the fourth point, so Best1 is set to 1, the value of Best for the fourth point. Because this value is less than 0.8, the second largest value (0.36456 for point five) is found and its Best1 is set to its Best's value of 1. Because 0.41302 + 0.36456 = 0.77758 is less than 0.8, the third point (0.0882 at point eight) is found and Best1 is set to 3 since the contribution to dimension 3 for that point is greater than the contribution to dimension 1. This increases the sum of the partial contributions to greater than 0.8, so the remaining Best1 values are all 0.
Contr1 | Contr2 | Contr3 | Best1 | Best2 | Best3 | Best |
0.01593 | 0.32178 | 0.07565 | 0 | 2 | 2 | 2 |
0.03014 | 0.24826 | 0.07715 | 0 | 2 | 2 | 2 |
0.00592 | 0.02892 | 0.02698 | 0 | 0 | 0 | 2 |
0.41302 | 0.05191 | 0.05773 | 1 | 0 | 0 | 1 |
0.36456 | 0.00344 | 0.15565 | 1 | 0 | 1 | 1 |
0.03902 | 0.30966 | 0.11717 | 0 | 2 | 2 | 2 |
0.00019 | 0.01840 | 0.00734 | 0 | 0 | 0 | 2 |
0.08820 | 0.00527 | 0.16555 | 3 | 0 | 3 | 3 |
0.01447 | 0.00024 | 0.03851 | 0 | 0 | 0 | 3 |
0.02855 | 0.01213 | 0.27827 | 0 | 0 | 3 | 3 |
Chapter Contents |
Previous |
Next |
Top |
Copyright © 1999 by SAS Institute Inc., Cary, NC, USA. All rights reserved.