Chapter Contents 
Previous 
Next 
The MODEL Procedure 
As with any nonlinear estimation routine, there is no guarantee that the estimation will be successful for a given model and data. If the equations are linear with respect to the parameters, the parameter estimates always converge in one iteration. The methods that iterate the S matrix must iterate further for the S matrix to converge. Nonlinear models may not necessarily converge.
Convergence can be expected only with fully identified parameters, adequate data, and starting values sufficiently close to solution estimates.
Convergence and the rate of convergence may depend primarily on the choice of starting values for the estimates. This does not mean that a great deal of effort should be invested in choosing starting values. First, try the default values. If the estimation fails with these starting values, examine the model and data and rerun the estimation using reasonable starting values. It is usually not necessary that the starting values be very good, just that they not be very bad; choose values that seem plausible for the model and data.
In this equation, Y is linearly related to a power transformation of X. The unknown parameters are a, b, and c. is an unobserved random error. Some simulated data was generated using the following SAS statements. In this simulation, a=10, b=2, and the use of the SQRT function corresponds to c=.5.
data test; do i = 1 to 20; x = 5 * ranuni(1234); y = 10 + 2 * sqrt(x) + .5 * rannor(2345); output; end; run;
The following statements specify the model and give descriptive labels to the model parameters. Then the FIT statement attempts to estimate a, b, and c using the default starting value .0001.
proc model data=test; y = a + b * x ** c; label a = "Intercept" b = "Coefficient of Transformed X" c = "Power Transformation Parameter"; fit y; run;
PROC MODEL prints model summary and estimation problem summary reports and then prints the output shown in Figure 14.17.
By using the default starting values, PROC MODEL was unable to take even the first step in iterating to the solution. The change in the parameters that the GaussNewton method computes is very extreme and makes the objective values worse instead of better. Even when this step is shortened by a factor of a million, the objective function is still worse, and PROC MODEL is unable to estimate the model parameters.
The problem is caused by the starting value of C. Using the default starting value C=.0001, the first iteration attempts to compute better values of A and B by what is, in effect, a linear regression of Y on the 10,000th root of X, which is almost the same as the constant 1. Thus the matrix that is inverted to compute the changes is nearly singular and affects the accuracy of the computed parameter changes.
This is also illustrated by the next part of the output, which displays collinearity diagnostics for the crossproducts matrix of the partial derivatives with respect to the parameters, shown in Figure 14.18.

This output shows that the matrix is singular and that the partials of A, B, and C with respect to the residual are collinear at the point ( 0.0001, 0.0001, 0.0001 ) in the parameter space. See the section "Linear Dependencies" for a full explanation of the collinearity diagnostics.
The MODEL procedure next prints the note shown in Figure 14.19, which suggests that you try different starting values.

PROC MODEL then produces the usual printout of results for the nonconverged parameter values. The estimation summary is shown in Figure 14.20. The heading includes the reminder "(Not Converged)."

The nonconverged estimation results are shown in Figure 14.21.

Note that the R^{2} statistic is negative. An R^{2} < 0 results when the residual mean square error for the model is larger than the variance of the dependent variable. Negative R^{2} statistics may be produced when either the parameter estimates fail to converge correctly, as in this case, or when the correctly estimated model fits the data very poorly.
Starting values are specified with the START= option of the FIT statement or on a PARMS statement. For example, the following statements estimate the model parameters using the starting values A=.0001, B=.0001, and C=5.
proc model data=test; y = a + b * x ** c; label a = "Intercept" b = "Coefficient of Transformed X" c = "Power Transformation Parameter"; fit y start=(c=5); run;
Using these starting values, the estimates converge in 16 iterations. The results are shown in Figure 14.22. Note that since the START= option explicitly declares parameters, the parameter C is placed first in the table.

For example, the following statements set C to 1 and compute starting values for A and B by estimating these parameters conditional on the fixed value of C. With C=1 this is equivalent to computing A and B by linear regression on X. A PARMS statement is used to declare the parameters in alphabetical order. The ITPRINT option is used to print the parameter values at each iteration.
proc model data=test; parms a b c; y = a + b * x ** c; label a = "Intercept" b = "Coefficient of Transformed X" c = "Power Transformation Parameter"; fit y start=(c=1) / startiter itprint; run;
With better starting values, the estimates converge in only 5 iterations. Counting the 2 iterations required to compute the starting values for A and B, this is 5 fewer than the 12 iterations required without the STARTITER option. The iteration history listing is shown in Figure 14.23.

The results produced in this case are almost the same as the results shown in Figure 14.22, except that the PARMS statement causes the Parameter Estimates table to be ordered A, B, C instead of C, A, B. They are not exactly the same because the different starting values caused the iterations to converge at a slightly different place. This effect is controlled by changing the convergence criterion with the CONVERGE= option.
By default, the STARTITER option performs one iteration to find starting values for the parameters not given values. In this case the model is linear in A and B, so only one iteration is needed. If A or B were nonlinear, you could specify more than one "starting values" iteration by specifying a number for the STARTITER= option.
For example, the following statements try 5 different starting values for C: 10, 5, 2.5, 2.5, 5. For each value of C, values for A and B are estimated. The combination of A, B, and C values producing the smallest residual mean square is then used to start the iterative process.
proc model data=test; parms a b c; y = a + b * x ** c; label a = "Intercept" b = "Coefficient of Transformed X" c = "Power Transformation Parameter"; fit y start=(c=10 5 2.5 2.5 5) / startiter itprint; run;
The iteration history listing is shown in Figure 14.24. Using the best starting values found by the grid search, the OLS estimation only requires 2 iterations. However, since the grid search required 10 iterations, the total iterations in this case is 12.

Because no initial values for A or B were provided in the PARAMETERS statement or were read in with a PARMSDATA= or ESTDATA= option, A and B were given the default value of 0.0001 for the first iteration. At the second grid point, C=5, the values of A and B obtained from the previous iterations are used for the initial iteration. If initial values are provided for parameters, the parameters start at those initial values at each grid point.
where t is time in years. The model is estimated using decennial census data of the U.S. population in millions. If this simple but highly nonlinear model is estimated using the default starting values, the estimation fails to converge.
To find reasonable starting values, first consider the meaning of a and c. Taking the limit as time increases, a is the limiting or maximum possible population. So, as a starting value for a, several times the most recent population known can be used, for example, one billion (1000 million).
Dividing the time derivative by the function to find the growth rate and taking the limit as t moves into the past, you can determine that c is the initial growth rate. You can examine the data and compute an estimate of the growth rate for the first few decades, or you can pick a number that sounds like a plausible population growth rate figure, such as 2%.
To find a starting value for b, let t equal the base year used, 1790, which causes c to drop out of the formula for that year, and then solve for the value of b that is consistent with the known population in 1790 and with the starting value of a. This yields b = ln(a/3.91) or about 5.5, where a is 1000 and 3.9 is roughly the population for 1790 given in the data. The estimates converge using these starting values.
Failure of the algorithm to improve the objective value can be caused by a CONVERGE= value that is too small. Look at the convergence measures reported at the point of failure. If the estimates appear to be approximately converged, you can accept the NOT CONVERGED results reported, or you can try rerunning the FIT task with a larger CONVERGE= value.
If the procedure fails to converge because it is unable to find a change vector that improves the objective value, check your model and data to ensure that all parameters are identified and data values are reasonably scaled. Then, rerun the model with different starting values. Also, consider using the Marquardt method if GaussNewton fails; the GaussNewton method can get into trouble if the Jacobian matrix is nearly singular or illconditioned. Keep in mind that a nonlinear model may be wellidentified and wellconditioned for parameter values close to the solution values but unidentified or numerically illconditioned for other parameter values. The choice of starting values can make a big difference.
When the estimates fail to converge, collinearity diagnostics for the Jacobian crossproducts matrix are printed if there are 20 or fewer parameters estimated. See "Linear Dependencies" later in this section for an explanation of these diagnostics.
There are many nonlinear functions for which the objective function is quite flat in a large region around the minimum point so that many quite different parameter vectors may satisfy a weak convergence criterion. By using different starting values, different convergence criteria, or different minimization methods, you can produce very different estimates for such models.
You can guard against this by running the estimation with different starting values and different convergence criteria and checking that the estimates produced are essentially the same. If they are not, use a smaller CONVERGE= value.
If the model equations or their derivatives contain discontinuities, the estimation will usually succeed, provided that the final parameter estimates lie in a continuous interval and that the iterations do not produce parameter values at points of discontinuity or parameter values that try to cross asymptotes.
One common case of discontinuities causing estimation failure is that of an asymptotic discontinuity between the final estimates and the initial values. For example, consider the following model, which is basically linear but is written with one parameter in reciprocal form:
y = a + b * x1 + x2 / c;
By placing the parameter C in the denominator, a singularity is introduced into the parameter space at C=0. This is not necessarily a problem, but if the correct estimate of C is negative while the starting value is positive (or vice versa), the asymptotic discontinuity at 0 will lie between the estimate and the starting value. This means that the iterations have to pass through the singularity to get to the correct estimates. The situation is shown in Figure 14.25.
Because of the incorrect sign of the starting value, the C estimate goes off towards positive infinity in a vain effort to get past the asymptote and onto the correct arm of the hyperbola. As the computer is required to work with ever closer approximations to infinity, the numerical calculations break down and an "objective function was not improved" convergence failure message is printed. At this point, the iterations terminate with an extremely large positive value for C. When the sign of the starting value for C is changed, the estimates converge quickly to the correct values.
For each parameter, the proportion of the variance of the estimate accounted for by each principal component is printed. The principal components are constructed from the eigenvalues and eigenvectors of the correlation matrix (scaled covariance matrix). When collinearity exists, a principal component is associated with proportion of the variance of more than one parameter. The numbers reported are proportions so they will remain between 0 and 1. If two or more parameters have large proportion values associated with the same principle component, then two problems can occur: the computation of the parameter estimates are slow or nonconvergent; and the parameter estimates have inflated variances (Belsley 1980, p. 105117).
For example, the following cubic model is fit to a quadratic data set:
proc model data=test3; exogenous x1 ; parms b1 a1 c1 ; y1 = a1 * x1 + b1 * x1 * x1 + c1 * x1 * x1 *x1; fit y1/ collin ; run;The collinearity diagnostics are shown in Figure 14.26.

Notice that the proportions associated with the smallest eigenvalue are almost 1. For this model, removing any of the parameters will decrease the variances of the remaining parameters.
In many models the collinearity might not be clear cut. Collinearity is not necessarily something you remove. A model may need to be reformulated to remove the redundant parameterization or the limitations on the estimatability of the model can be accepted.
Collinearity diagnostics are also useful when an estimation does not converge. The diagnostics provide insight into the numerical problems and can suggest which parameters need better starting values. These diagnostics are based on the approach of Belsley, Kuh, and Welsch (1980).
Chapter Contents 
Previous 
Next 
Top 
Copyright © 1999 by SAS Institute Inc., Cary, NC, USA. All rights reserved.