Chapter Contents |
Previous |
Next |

SAS Procedures Guide |

Populations and Parameters |

A population of values can be described
in terms of its **cumulative distribution function**, which gives
the proportion of the population less than or equal to each possible value.
A discrete population can also be described by a **probability function**, which gives the proportion of the population equal to each possible
value. A continuous population can often be described by a **density
function**, which is the derivative of the cumulative distribution function.
A density function can be approximated by a histogram that gives the proportion
of the population lying within each of a series of intervals of values. A
probability density function is like a histogram with an infinite number of
infinitely small intervals.

In technical literature, when
the term **distribution** is used without qualification, it generally
refers to the cumulative distribution function. In informal writing, **distribution** sometimes means the density function instead. Often the
word **distribution** is used simply to refer to an abstract population
of values rather than some concrete population. Thus, the statistical literature
refers to many types of abstract distributions, such as normal distributions,
exponential distributions, Cauchy distributions, and so on. When a phrase
such as **normal distribution** is used, it frequently does not matter
whether the cumulative distribution function or the density function is intended.

It may be
expedient to describe a population in terms of a few measures that summarize
interesting features of the distribution. One such measure, computed from
the population values, is called a **parameter**. Many different
parameters can be defined to measure different aspects of a distribution.

The most commonly used parameter
is the (arithmetic) **mean**. If the population contains a finite
number of values, the population mean is computed as the sum of all the values
in the population divided by the number of elements in the population. For
an infinite population, the concept of the mean is similar but requires more
complicated mathematics.

E(**x**) denotes the mean of a population of values symbolized by **x**, such as
height, where E stands for **expected value**.
You can also consider expected values of derived functions of the original
values. For example, if **x** represents height, then
is the expected value of height squared, that is, the mean
value of the population obtained by squaring every value in the population
of heights.

Samples and Statistics |

Samples
can be selected in a variety of ways. Most SAS procedures assume that the
data constitute a **simple random sample**, which means that the
sample was selected in such a way that all possible samples were equally likely
to be selected.

Statistics
from a sample can be used to make inferences, or reasonable guesses, about
the parameters of a population. For example, if you take a random sample
of 30 students from the high school, the mean height for those 30 students
is a reasonable guess, or **estimate**, of the mean height of all
the students in the high school. Other statistics, such as the standard error,
can provide information about how good an estimate is likely to be.

For any population parameter, several statistics can estimate it. Often, however, there is one particular statistic that is customarily used to estimate a given parameter. For example, the sample mean is the usual estimator of the population mean. In the case of the mean, the formulas for the parameter and the statistic are the same. In other cases, the formula for a parameter may be different from that of the most commonly used estimator. The most commonly used estimator is not necessarily the best estimator in all applications.

Measures of Location |

Percentiles |

The upper quartile of a distribution is the value below which 75 percent of the measurements fall (the 75th percentile). Twenty-five percent of the measurements fall below the lower quartile value.In the following example, SAS artificially generates the data with a pseudorandom number function. The UNIVARIATE procedure computes a variety of quantiles and measures of location, and outputs the values to a SAS data set. A DATA step then uses the SYMPUT routine to assign the values of the statistics to macro variables. The macro %FORMGEN uses these macro variables to produce value labels for the FORMAT procedure. PROC CHART uses the resulting format to display the values of the statistics on a histogram.

options nodate pageno=1 linesize=64 pagesize=52; title 'Example of Quantiles and Measures of Location'; data random; drop n; do n=1 to 1000; X=floor(exp(rannor(314159)*.8+1.8)); output; end; run; proc univariate data=random nextrobs=0; var x; output out=location mean=Mean mode=Mode median=Median q1=Q1 q3=Q3 p5=P5 p10=P10 p90=P90 p95=P95 max=Max; run;

proc print data=location noobs; run;

data _null_; set location; call symput('MEAN',round(mean,1)); call symput('MODE',mode); call symput('MEDIAN',round(median,1)); call symput('Q1',round(q1,1)); call symput('Q3',round(q3,1)); call symput('P5',round(p5,1)); call symput('P10',round(p10,1)); call symput('P90',round(p90,1)); call symput('P95',round(p95,1)); call symput('MAX',min(50,max)); run; %macro formgen; %do i=1 %to &max; %let value=&i; %if &i=&p5 %then %let value=&value P5; %if &i=&p10 %then %let value=&value P10; %if &i=&q1 %then %let value=&value Q1; %if &i=&mode %then %let value=&value Mode; %if &i=&median %then %let value=&value Median; %if &i=&mean %then %let value=&value Mean; %if &i=&q3 %then %let value=&value Q3; %if &i=&p90 %then %let value=&value P90; %if &i=&p95 %then %let value=&value P95; %if &i=&max %then %let value=>=&value; &i="&value" %end; %mend; proc format print; value stat %formgen; run; options pagesize=42 linesize=64; proc chart data=random; vbar x / midpoints=1 to &max by 1; format x stat.; footnote 'P5 = 5TH PERCENTILE'; footnote2 'P10 = 10TH PERCENTILE'; footnote3 'P90 = 90TH PERCENTILE'; footnote4 'P95 = 95TH PERCENTILE'; footnote5 'Q1 = 1ST QUARTILE '; footnote6 'Q3 = 3RD QUARTILE '; run;

Measures of Variability |

The
sample variance is denoted by
. The difference between a value and the mean is called
a **deviation from the mean**. Thus, the variance approximates the
mean of the squared deviations.

When all the values lie close to the mean, the variance is small but never less than zero. When values are more scattered, the variance is larger. If all sample values are multiplied by a constant, the sample variance is multiplied by the square of the constant.

Sometimes values other than
are used in the denominator. The VARDEF= option controls
what divisor the procedure uses.

Measures of Shape |

Population skewness is defined as

Because the deviations are cubed rather than squared, the signs of the
deviations are maintained. Cubing the deviations also emphasizes the effects
of large deviations. The formula includes a divisor of
to remove the effect of scale, so multiplying all values
by a constant does not change the skewness. Skewness can thus be interpreted
as a tendency for one tail of the population to be heavier than the other.
Skewness can be positive or negative and is unbounded.

**Note:** Some statisticians omit the subtraction of
3.

Because the deviations are raised to the fourth power, positive and negative deviations make the same contribution, while large deviations are strongly emphasized. Because of the divisor , multiplying each value by a constant has no effect on kurtosis.

Population kurtosis must lie between and , inclusive. If represents population skewness and represents population kurtosis, then

Statistical
literature sometimes reports that kurtosis measures the **peakedness**
of a density. However, heavy tails have much more influence on kurtosis than
does the shape of the distribution near the mean (Kaplansky 1945; Ali 1974;
Johnson, et al. 1980).

Sample skewness and kurtosis are rather unreliable estimators of the corresponding parameters in small samples. They are better estimators when your sample is very large. However, large values of skewness or kurtosis may merit attention even in small samples because such values indicate that statistical methods that are based on normality assumptions may be inappropriate.

The Normal Distribution |

Many statistical methods are designed under the assumption that the population being sampled is normally distributed. Nevertheless, most real-life populations do not have normal distributions. Before using any statistical method based on normality assumptions, you should consult the statistical literature to find out how sensitive the method is to nonnormality and, if necessary, check your sample for evidence of nonnormality.

In the following example, SAS generates a sample from
a normal distribution with a mean of 50 and a standard deviation of 10. The
UNIVARIATE procedure performs tests for location and normality. Because the
data are from a normal distribution, all **p**-values from the tests
for normality are greater than 0.15. The CHART procedure displays a histogram
of the observations. The shape of the histogram is a belllike, normal density.

options nodate pageno=1 linesize=64 pagesize=52; title '10000 Obs Sample from a Normal Distribution'; title2 'with Mean=50 and Standard Deviation=10'; data normaldat; drop n; do n=1 to 10000; X=10*rannor(53124)+50; output; end; run; proc univariate data=normaldat nextrobs=0 normal mu0=50 loccount; var x; run;

proc format; picture msd 20='20 3*Std' (noedit) 30='30 2*Std' (noedit) 40='40 1*Std' (noedit) 50='50 Mean ' (noedit) 60='60 1*Std' (noedit) 70='70 2*Std' (noedit) 80='80 3*Std' (noedit) other=' '; run; options linesize=64 pagesize=42; proc chart; vbar x / midpoints=20 to 80 by 2; format x msd.; run;

Sampling Distribution of the Mean |

It
can be proven mathematically that if the original population has mean
and standard deviation , then the sampling distribution of the mean
also has mean , but its standard deviation is
. The standard deviation of the sampling distribution of
the mean is called the **standard error of the mean**. The standard
error of the mean provides an indication of the accuracy of a sample mean
as an estimator of the population mean.

If the original population has a normal distribution, then the sampling distribution of the mean is also normal. If the original distribution is not normal but does not have excessively long tails, then the sampling distribution of the mean can be approximated by a normal distribution for large sample sizes.

The following example consists of three separate programs that show how the sampling distribution of the mean can be approximated by a normal distribution as the sample size increases. The first DATA step uses the RANEXP function to create a sample of 1000 observations from an exponential distribution.The theoretical population mean is 1.00, while the sample mean is 1.01, to two decimal places. The population standard deviation is 1.00; the sample standard deviation is 1.04.

This is an example of a nonnormal distribution. The population skewness is 2.00, which is close to the sample skewness of 1.97. The population kurtosis is 6.00, but the sample kurtosis is only 4.80.

options nodate pageno=1 linesize=64 pagesize=42; title '1000 Observation Sample'; title2 'from an Exponential Distribution'; data expodat; drop n; do n=1 to 1000; X=ranexp(18746363); output; end; run; proc format; value axisfmt .05='0.05' .55='0.55' 1.05='1.05' 1.55='1.55' 2.05='2.05' 2.55='2.55' 3.05='3.05' 3.55='3.55' 4.05='4.05' 4.55='4.55' 5.05='5.05' 5.55='5.55' other=' '; run; proc chart data=expodat ; vbar x / axis=300 midpoints=0.05 to 5.55 by .1; format x axisfmt.; run;

options pagesize=64; proc univariate data=expodat noextrobs=0 normal mu0=1; var x; run;The next DATA step generates 1000 different samples from the same exponential distribution. Each sample contains ten observations. The MEANS procedure computes the mean of each sample. In the data set that is created by PROC MEANS, each observation represents the mean of a sample of ten observations from an exponential distribution. Thus, the data set is a sample from the sampling distribution of the mean for an exponential population.

PROC UNIVARIATE displays statistics for this sample of means. Notice that the mean of the sample of means is .99, almost the same as the mean of the original population. Theoretically, the standard deviation of the sampling distribution is , whereas the standard deviation of this sample from the sampling distribution is .30. The skewness (.55) and kurtosis (-.006) are closer to zero in the sample from the sampling distribution than in the original sample from the exponential distribution. This is so because the sampling distribution is closer to a normal distribution than is the original exponential distribution. The CHART procedure displays a histogram of the 1000-sample means. The shape of the histogram is much closer to a belllike, normal density, but it is still distinctly lopsided.

options nodate pageno=1 linesize=64 pagesize=48; title '1000 Sample Means with 10 Obs per Sample'; title2 'Drawn from an Exponential Distribution'; data samp10; drop n; do Sample=1 to 1000; do n=1 to 10; X=ranexp(433879); output; end; end; proc means data=samp10 noprint; output out=mean10 mean=Mean; var x; by sample; run;

proc format; value axisfmt .05='0.05' .55='0.55' 1.05='1.05' 1.55='1.55' 2.05='2.05' other=' '; run; proc chart data=mean10; vbar mean/axis=300 midpoints=0.05 to 2.05 by .1; format mean axisfmt.; run;

options pagesize=64; proc univariate data=mean10 noextrobs=0 normal mu0=1; var mean; run;

In the following DATA step, the size of each sample from the exponential distribution is increased to 50. The standard deviation of the sampling distribution is smaller than in the previous example because the size of each sample is larger. Also, the sampling distribution is even closer to a normal distribution, as can be seen from the histogram and the skewness.

options nodate pageno=1 linesize=64 pagesize=48; title '1000 Sample Means with 50 Obs per Sample'; title2 'Drawn from an Exponential Distribution'; data samp50; drop n; do sample=1 to 1000; do n=1 to 50; X=ranexp(72437213); output; end; end; proc means data=samp50 noprint; output out=mean50 mean=Mean; var x; by sample; run;

proc format; value axisfmt .05='0.05' .55='0.55' 1.05='1.05' 1.55='1.55' 2.05='2.05' 2.55='2.55' other=' '; run; proc chart data=mean50; vbar mean / axis=300 midpoints=0.05 to 2.55 by .1; format mean axisfmt.; run;

options pagesize=64; proc univariate data=mean50 nextrobs=0 normal mu0=1; var mean; run;

Testing Hypotheses |

Consider the universe of students in a college. Let the variable X be the number of pounds by which a student's weight deviates from the ideal weight for a person of the same sex, height, and build. You want to find out whether the population of students is, on the average, underweight or overweight. To this end, you have taken a random sample of X values from nine students, with results as given in the following DATA step:

title 'Deviations from Normal Weight'; data x; input X @@; datalines; -7 -2 1 3 6 10 15 21 30 ;

You can define
several hypotheses of interest. One hypothesis is that, on the average, the
students are of exactly ideal weight. If represents the population mean
of the X values, you can write this hypothesis, called the **null**
hypothesis, as
. The other two hypotheses, called **alternative**
hypotheses, are that the students are underweight on the average,
, and that the students are overweight on the average,
.

The null hypothesis is so called because in many situations it corresponds to the assumption of "no effect" or "no difference." However, this interpretation is not appropriate for all testing problems. The null hypothesis is like a straw man that can be toppled by statistical evidence. You decide between the alternative hypotheses according to which way the straw man falls.

A naive way to approach this problem would be to look at the sample mean and decide among the three hypotheses according to the following rule:

- If
, decide on
.
- If
, decide on
.
- If
, decide on
.

The trouble with this approach is that there may be a high
probability
of making an incorrect decision. If H_{0} is true, you
are nearly certain to make a wrong decision because the chances of
being exactly zero are almost nil. If is slightly
less than zero, so that H_{1} is true, there may be nearly
a 50 percent chance that
will be greater than zero in repeated sampling, so the
chances of incorrectly choosing H_{2} would also be nearly
50 percent. Thus, you have a high probability of making an error if
is near zero. In such cases, there is not enough evidence
to make a confident decision, so the best response may be to reserve judgment
until you can obtain more evidence.

The question is, how far from zero must be for you to be able to make a confident decision? The answer can be obtained by considering the sampling distribution of . If X has a roughly normal distribution, then has an approximately normal sampling distribution. The mean of the sampling distribution of is . Assume temporarily that , the standard deviation of X, is known to be 12. Then the standard error of for samples of nine observations is .

You know that about 95 percent of the values from a normal distribution are within two standard deviations of the mean, so about 95 percent of the possible samples of nine X values have a sample mean between and , or between -8 and 8. Consider the chances of making an error with the following decision rule:

- If
, decide on
.
- If
, reserve judgment.
- If
, decide on
.

If H_{0} is true, then in about 95 percent of
the
possible samples
will be between the **critical values**
and 8, so you will reserve judgment. In these cases the
statistical evidence is not strong enough to fell the straw man. In the other
5 percent of the samples you will make an error; in 2.5 percent of the samples
you will incorrectly choose H_{1}, and in 2.5 percent
you will incorrectly choose H_{2}.

The price you pay for controlling the chances of making an error is
the necessity of reserving judgment when there is not sufficient statistical
evidence to reject the null hypothesis.

The decision rule
is a **two-tailed test** because the alternative hypotheses allow
for population means either smaller or larger than the value specified in
the null hypothesis. If you were interested only in the possibility of the
students being overweight on the average, you could use a **one-tailed
test**:

- If
, reserve judgment.
- If
, decide on
.

For this one-tailed test, the type I error rate is 2.5 percent, half that of the two-tailed test.

The
probability of rejecting the null hypothesis if it is false is called the **power** of the statistical test and is typically denoted as
.
is called the **Type II error rate**, which is
the probability of not rejecting a false null hypothesis. The power depends
on the true value of the parameter. In the example, assume the population
mean is 4. The power for detecting H_{2} is the probability
of getting a sample mean greater than 8. The critical value 8 is one standard
error higher than the population mean 4. The chance of getting a value at
least one standard deviation greater than the mean from a normal distribution
is about 16 percent, so the power for detecting the alternative hypothesis
H_{2} is about 16 percent. If the population mean were
8, the power for H_{2} would be 50 percent, whereas a
population mean of 12 would yield a power of about 84 percent.

The smaller the type I error rate is, the less the chance of making
an incorrect decision, but the higher the chance of having to reserve judgment.
In choosing a type I error rate, you should consider the resulting power for
various alternatives of interest.

This **t** statistic is the difference between the sample
mean and the hypothesized mean
divided by the estimated standard error of the mean.

If the null hypothesis is true and the population is normally distributed,
then the **t** statistic has what is called a **Student's t distribution** with
degrees of freedom. This distribution looks very similar
to a normal distribution, but the tails of the Student's **t** distribution
are heavier. As the sample size gets larger, the sample standard deviation
becomes a better estimator of the population standard deviation, and the **t** distribution gets closer to a normal distribution.

You can base a decision rule on the
**t** statistic:

- If
, decide on
.
- If
, reserve judgment.
- If
, decide on
.

The value 2.3 was obtained from a table of Student's **t**
distribution to give a type I error rate of 5 percent for 8 (that is,
) degrees of freedom. Most common statistics texts contain
a table of Student's **t** distribution. If you do not have a statistics
text handy, you can use the DATA step and the TINV function to print any values
from the **t** distribution.

By default, PROC UNIVARIATE computes a **t** statistic for
the null hypothesis that
, along with related statistics. Use the MU0= option in
the PROC statement to specify another value for the null hypothesis.

This example uses the data on deviations from normal weight, which consist
of nine observations. First, PROC MEANS computes the **t** statistic
for the null hypothesis that
. Then, the TINV function in a DATA step computes the value
of Student's **t** distribution for a two-tailed test at the 5 percent
level of significance and 8 degrees of freedom.

data devnorm; title 'Deviations from Normal Weight'; input X @@; datalines; -7 -2 1 3 6 10 15 21 30 ; proc means data=devnorm maxdec=3 n mean std stderr t probt; run; title 'Student''s t Critical Value'; data _null_; file print; t=tinv(.975,8); put t 5.3; run;In the current example, the value of the

**Pr > |t|**

in
the PROC MEANS output, is .0606, so the null hypothesis could be rejected
at the 10 percent significance level but not at the 5 percent level.
A **p**-value is a measure of the strength of the evidence
against the null hypothesis. The smaller the **p**-value, the stronger
the evidence for rejecting the null hypothesis.

Chapter Contents |
Previous |
Next |
Top of Page |

Copyright 1999 by SAS Institute Inc., Cary, NC, USA. All rights reserved.