The CORR Procedure

Statistical Computations

PROC CORR computes several parametric and nonparametric correlation statistics as measures of association. The formulas for computing these measures and the associated probabilities follow.

The Pearson product-moment correlation is a parametric measure of association for two continuous random variables. The formula for the true Pearson product-moment correlation, denoted , is

The sample correlation, such as a Pearson product-moment correlation or weighted product-moment correlation, estimates the true correlation. The formula for the Pearson product-moment correlation is

where is the sample mean of and is the sample mean of .

The formula for a weighted Pearson product-moment correlation is

where

Note that is the weighted mean of , is the weighted mean of , and is the weight.

When one variable is dichotomous (0,1) and the other variable is continuous, a Pearson correlation is equivalent to a point biserial correlation. When both variables are dichotomous, a Pearson correlation coefficient is equivalent to the phi coefficient.

Spearman rank-order correlation is a nonparametric measure of association based on the rank of the data values. The formula is

where is the rank of the value, is the rank of the value, is the mean of the values, and is the mean of the values.

PROC CORR computes the Spearman's correlation by ranking the data and using the ranks in the Pearson product-moment correlation formula. In case of ties, the averaged ranks are used.

Kendall's tau-b is a nonparametric measure of association based on the number of concordances and discordances in paired observations. Concordance occurs when paired observations vary together, and discordance occurs when paired observations vary differently. The formula for Kendall's tau-b is

where

and where is the number of tied values in the group of tied values, is the number of tied values in the group of tied values, is the number of observations, and sgn(z) is defined as

PROC CORR computes Kendall's correlation by ranking the data and using a method similar to Knight (1966). The data are double sorted by ranking observations according to values of the first variable and reranking the observations according to values of the second variable. PROC CORR computes Kendall's tau-b from the number of interchanges of the first variable and corrects for tied pairs (pairs of observations with equal values of X or equal values of Y).

Hoeffding's measure of dependence, D, is a nonparametric measure of association that detects more general departures from independence. The statistic approximates a weighted sum over observations of chi-square statistics for two-by-two classification tables (Hoeffding 1948). Each set of values are cut points for the classification. The formula for Hoeffding's D is

where

is the rank of , is the rank of , and (also called the bivariate rank) is 1 plus the number of points with both and values less than the point. A point that is tied on only the value or value contributes 1/2 to if the other value is less than the corresponding value for the point. A point that is tied on both and contributes 1/4 to .

PROC CORR obtains the values by first ranking the data. The data are then double sorted by ranking observations according to values of the first variable and reranking the observations according to values of the second variable. Hoeffding's D statistic is computed using the number of interchanges of the first variable.

When no ties occur among data set observations, the D statistic values are between -0.5 and 1, with 1 indicating complete dependence. However, when ties occur, the D statistic may result in a smaller value. That is, for a pair of variables with identical values, the Hoeffding's D statistic may be less than 1. With a large number of ties in a small data set, the D statistic may be less than -0.5 . For more information on Hoeffding's D, see Hollander and Wolfe (1973, p. 228).

A partial correlation measures the strength of a relationship between two variables, while controlling the effect of one or more additional variables. The Pearson partial correlation for a pair of variables may be defined as the correlation of errors after regression on the controlling variables. Let be the set of variables to correlate. Also let and be sets of regression parameters and be the set of controlling variables, where , is the slope, and . Suppose

is a regression model for given . The population Pearson partial correlation between the and the variables of given is defined as the correlation between errors and .

If the exact values of and are unknown, you can use a sample Pearson partial correlation to estimate the population Pearson partial correlation. For a given sample of observations, you estimate the sets of unknown parameters and using the least-squares estimators and . Then the fitted least-squares regression model is

The partial corrected sums of squares and crossproducts (CSSCP) of given are the corrected sums of squares and crossproducts of the residuals . Using these partial corrected sums of squares and crossproducts, you can calculate the partial variances, partial covariances, and partial correlations.

PROC CORR derives the partial corrected sums of squares and crossproducts matrix by applying the Cholesky decomposition algorithm to the CSSCP matrix. For Pearson partial correlations, let be the partitioned CSSCP matrix between two sets of variables, and :

PROC CORR calculates , the partial CSSCP matrix of after controlling for , by applying the Cholesky decomposition algorithm sequentially on the rows associated with , the variables being partialled out.

After applying the Cholesky decomposition algorithm to each row associated with variables , PROC CORR checks all higher numbered diagonal elements associated with for singularity. After the Cholesky decomposition, a variable is considered singular if the value of the corresponding diagonal element is less than times the original unpartialled corrected sum of squares of that variable. You can specify the singularity criterion using the SINGULAR= option. For Pearson partial correlations, a controlling variable is considered singular if the for predicting this variable from the variables that are already partialled out exceeds . When this happens, PROC CORR excludes the variable from the analysis. Similarly, a variable is considered singular if the for predicting this variable from the controlling variables exceeds . When this happens, its associated diagonal element and all higher numbered elements in this row or column are set to zero.

After the Cholesky decomposition algorithm is performed on all rows associated with , the resulting matrix has the form

where is an upper triangular matrix with

If is positive definite, then the partial CSSCP matrix is identical to the matrix derived from the formula

The partial variance-covariance matrix is calculated with the variance divisor (VARDEF= option). PROC CORR can then use the standard Pearson correlation formula on the partial variance-covariance matrix to calculate the Pearson partial correlation matrix. Another way to calculate Pearson partial correlation is by applying the Cholesky decomposition algorithm directly to the correlation matrix and by using the correlation formula on the resulting matrix.

To derive the corresponding Spearman partial rank-order correlations and Kendall partial tau-b correlations, PROC CORR applies the Cholesky decomposition algorithm to the Spearman rank-order correlation matrix and Kendall tau-b correlation matrix and uses the correlation formula. The singularity criterion for nonparametric partial correlations is identical to Pearson partial correlation except that PROC CORR uses a matrix of nonparametric correlations and sets a singular variable's associated correlations to missing. The partial tau-b correlations range from -1 to 1. However, the sampling distribution of this partial tau-b is unknown; therefore, the probability values are not available.

When a correlation matrix (Pearson, Spearman, or Kendall tau-b correlation matrix) is positive definite, the resulting partial correlation between variables and after adjusting for a single variable is identical to that obtained from the first-order partial correlation formula

where , , and are the appropriate correlations.

The formula for higher-order partial correlations is a straightforward extension of the above first-order formula. For example, when the correlation matrix is positive definite, the partial correlation between and controlling for both and is identical to the second-order partial correlation formula

where , , and are first-order partial correlations among variables , , and given .

Analyzing latent constructs such as job satisfaction, motor ability, sensory recognition, or customer satisfaction requires instruments to accurately measure the constructs. Interrelated items may be summed to obtain an overall score for each participant. Cronbach's coefficient alpha estimates the reliability of this type of scale by determining the internal consistency of the test or the average correlation of items within the test (Cronbach 1951).

When a value is recorded, the observed value contains some degree of measurement error. Two sets of measurements on the same variable for the same individual may not have identical values. However, repeated measurements for a series of individuals will show some consistency. Reliability measures internal consistency from one set of measurements to another. The observed value Y is divided into two components, a true value T and a measurement error E. The measurement error is assumed to be independent of the true value, that is,

The reliability coefficient of a measurement test is defined as the squared correlation between the observed value Y and the true value T, that is,

which is the proportion of the observed variance due to true differences among individuals in the sample. If Y is the sum of several observed variables measuring the same feature, you can estimate var(T). Cronbach's coefficient alpha, based on a lower bound for var(T), is an estimate of the reliability coefficient.

Suppose variables are used with for , where is the observed value, is the true value, and is the measurement error. The measurement errors ( ) are independent of the true values ( ) and are also independent of each other. Let be the total observed score and be the total true score. Because

a lower bound for is given by

With for , a lower bound for the reliability coefficient is then given by the Cronbach's coefficient alpha:

If the variances of the items vary widely, you can standardize the items to a standard deviation of 1 before computing the coefficient alpha. If the variables are dichotomous (0,1), the coefficient alpha is equivalent to the Kuder-Richardson 20 (KR-20) reliability measure.

When the correlation between each pair of variables is 1, the coefficient alpha has a maximum value of 1. With negative correlations between some variables, the coefficient alpha can have a value less than zero. The larger the overall alpha coefficient, the more likely that items contribute to a reliable scale. Nunnally (1978) suggests .70 as an acceptable reliability coefficient; smaller reliability coefficients are seen as inadequate. However, this varies by discipline.

To determine how each item reflects the reliability of the scale, you calculate a coefficient alpha after deleting each variable independently from the scale. The Cronbach's coefficient alpha from all variables except the variable is given by

If the reliability coefficient increases after deleting an item from the scale, you can assume that the item is not correlated highly with other items in the scale. Conversely, if the reliability coefficient decreases you can assume that the item is highly correlated with other items in the scale. See SAS Communications, 4th Quarter 1994, for more information on how to interpret Cronbach's coefficient alpha.

Listwise deletion of observations with missing values is necessary to correctly calculate Cronbach's coefficient alpha. PROC CORR does not automatically use listwise deletion when you specify ALPHA. Therefore, use the NOMISS option if the data set contains missing values. Otherwise, PROC FREQ prints a warning message in the SAS log indicating the need to use NOMISS with ALPHA.

Probability values for the Pearson and Spearman correlations are computed by treating

as coming from a t distribution with degrees of freedom, where is the appropriate correlation.

Probability values for the Pearson and Spearman partial correlations are computed by treating

as coming from a t distribution with degrees of freedom, where is the appropriate partial correlation and is the number of variables being partialled out.

Probability values for Kendall correlations are computed by treating

as coming from a normal distribution when

and where are the values of the first variable, are the values of the second variable, and the function sgn(z) is defined as

The formula for the variance of , var( ), is computed as

where

The sums are over tied groups of values where is the number of tied values and is the number of tied values (Noether 1967). The sampling distribution of Kendall's partial tau-b is unknown; therefore, the probability values are not available.

The probability values for Hoeffding's D statistic are computed using the asymptotic distribution computed by Blum, Kiefer, and Rosenblatt (1961). The formula is

which comes from the asymptotic distribution. When the sample size is less than 10, see the tables for the distribution of D in Hollander and Wolfe (1973).