BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
Tom25
Fluorite | Level 6

I used proc iml to generate data for both x from N(0,1) and Y from logistic regression model for 1000 times and sample size =25 for each replication, where beta_0=1 and beta_1=3.

I used the generated data to calculate the MLE for unknown coefficients of logit regression that includes only one predictor (x) by minimizing the negative log likelihood function. I used initial values {0 1} for beta_0 and beta_1 optimization procedure. The following is the code:

proc iml;
/*Module for calculating the log-likelihood function for logistic regression and returns negative log likelihood*/
start SUMLOGLHF(beta) global(n,x,y);
z = x*t(beta);
z = max(-700, min(700, z));/*Prevents exp() overflow/underflow*/
prob=exp(z)/(1+exp(z));
llikelihood=sum(y#log(prob+1e-12) + (1-y)#log(1-prob+1e-12));/*Added 1e-12 to avoid log (0)*/
return (-llikelihood);/*Return negative log likelihood function*/
finish;

* Module to calculate gradient;
start driv(beta) global(n,x,y);
z = x*t(beta);
z = max(-700, min(700, z));/*Prevents exp() overflow/underflow*/
prob=exp(z)/(1+exp(z));
g=t(-t(x)*(y-prob));/* g is 1 by 2 vector*/
return(g);
finish;
/*Module to calculate Hessian Matrix*/ 
start HessLogLik(beta) global(n,x,y);
z = x*t(beta);
z = max(-700, min(700, z));/*Prevents exp() overflow/underflow*/
prob=exp(z)/(1+exp(z));
weight = prob#(1-prob);/*Var(Y)*/
/*Hessian*/
h = (t(x)*(diag(weight))*x); /* Simplified Hessian calculation */
h = h + I(2)*1e-4; /*Prevents the inversion of a singular matrix when the Hessian is not positive definite*/
return(h);
finish;

replicate=1000;/*1000 replications*/
n=25;/*Sample size for each replicate is 25*/
Est=j(replicate,2,0);/*Matrix that includes the estimates of beta_0 and beta_1*/

colMean=j(2,1,0);/*Vector that includes the mean values of the estimates of beta_0 and beta_1*/
colStanddev=j(2,1,0);/*Vector that includes the standard deviation values of the estimates of beta_0 and beta_1*/

/*Do loop for 1000 times*/
do k=1 to replicate;
x1=j(1,n);
x1t=t(x1);
call randseed(3205+k);/*Change seed each iteration by adding k to avoid getting the same result 1000 times*/
Call randgen(x1t, "NORMAL", 0, 1); /*Generates N(0,1)*/
x2=j(n,1,1);
x=x2||x1t;/*x is the design matrix*/ 
y=j(n,1);/*y is the vector that includes the generated y data*/
TruePar={1 3};/*Row vector that includes the true values for beta_0 and beta_1*/

/*Calculate ptrue which is the cdf of logit and use it to generate y data */
ptrue=exp(x*t(TruePar))/(1+exp(x*t(TruePar)));/* ptrue is based on Logistic (Sigmoid) distribution */
Call randgen(y,"Bernoulli",ptrue);/*Generates Bernoulli distribution for y with probability of success =ptrue*/
sumy=sum(y);

/*Initial_values for beta_0 and beta_1 is represented as a 1x2 row vector for initial values*/
initial_param={0 1};
opt={0};/*To get the estimates that minimize the negative log likelihood function introduced in the module*/ 

/* Call Optimizer */
call nlpnra(rc, res, "SUMLOGLHF", initial_param,opt, , , , ,"driv","HessLogLik");/*Use Newton Raphson's algorithm and the nlpnra routine that minimizes negative log likelihood function*/
Est[k,]=res;/*Matrix that includes 1000 estimates of beta_0_hat and beta_1_ hat*/
end;
print Est;

colMean=t(mean(Est));
print colMean;

colStanddev=t(std(Est));
print colStanddev;
quit;

 

I got a mean of beta_0_hat and beta_1_hat that are not close to the true values of beta_0 and beta_1, which are 1 and 3 respectively.

Tom25_0-1773288986677.png

 

I got the following errors and notes in the log window:

ERROR: NEWRAP Optimization cannot be completed.

ERROR: NEWRAP needs more than 200 iterations or 500 function calls.

NOTE: FCONV convergence criterion satisfied.

NOTE: At least one element of the (projected) gradient is greater than 1e-3.

 

How can I fix the code to get rid of the errors and get the MLEs of beta_0 and beta_1 in a correct way?

 

Thank you!

1 ACCEPTED SOLUTION

Accepted Solutions
Rick_SAS
SAS Super FREQ

First, I made a mistake. The statement that subsets the good indices should be 
Est = Est[good_idx, ];  /* Note the comma! */

To answer your questions:
> 1. the mean of estimates of betas are far away from the true values. Is it acceptable?

I don't know what "acceptable" means. What is your goal? Your estimates will be far away from the true mean regardless of which optimization method you use.

If your goal is to show that the regression estimates for this model are highly variable and should not be trusted for samples of size n=25, then, yes, I think this simulation demonstrates that fact very well. If you plot your parameter estimates, there are many extreme values, which will, of course, strongly influence the Monte Carlo mean and standard deviation. A more robust Monte Carlo estimate would be the median of the columns. You should always graph the parameter estimates to visualize their variability.

colMedian=t(median(Est));
print colMedian;

title "Graph of parameter estimates from NLPNRA";
call scatter(Est[,1], Est[,2]);

The graph might cause you to ask whether the computations in the simulation are wrong. You can answer that question by performing the same computation for larger sample sizes. Many of your "problems" with nonconvergence and highly variable estimates will go away if you increase the sample size, to, say, n=100.

 

2. Yes, the Monte Carlo estimate is based on the 997 estimates that converged

3. Yes, you can increase the value of the 'replicate' variable to obtain additional Monte Carlo samples. But the math doesn't care if you use 997 or 1000. The math says that the mean of the estimates (however many you have) is the Monte Carlo estimate.

 

In summary, your "problems" are caused because the sample size in the simulation is so small. You can "fix" your problems by using a larger sample size.

View solution in original post

7 REPLIES 7
sbxkoenk
SAS Super FREQ

There's a board dedicated to SAS/IML.
I will move this topic thread to that board.

Koen

 

** DONE **

Rick_SAS
SAS Super FREQ

The statement 

z = max(-700, min(700, z));/*Prevents exp() overflow/underflow*/

is wrong.

The MIN and the MAX function return a scalar value. So instead of z being a vector, it is a scalar.

The correct way to clip a vector of values into an interval is shown in the article
"Clip values into an interval"

The article contains this little IML helper function:

/* use the elementwise minimum (><) and elementwise maximum (<>) operators to 
   truncate elements of a vector into an interval. See
https://blogs.sas.com/content/iml/2026/02/04/clip-values.html
*/
start Clip(x, ab); a = ab[1]; b = ab[2]; return( a <> (x >< b) ); finish;

Use that function instead of your nested MAX-MIN statement. For example:

start SUMLOGLHF(beta) global(n,x,y);
z = x*t(beta);
z = Clip(z, {-700 700}); /*Prevents exp() overflow/underflow*/
...

There might be other issues, but fix this issue first.

 

Tom25
Fluorite | Level 6

Thank you for your response.

I did what you recommended, and the following is the code:

proc iml;

/* Define Clip module */

start Clip(x, ab);

   a = ab[1]; b = ab[2];

   return( a <> (x >< b) );

finish;



/*Module for calculating the log-likelihood function for logistic regression and returns negative log likelihood*/

start SUMLOGLHF(beta) global(n,x,y);

z = x*t(beta);

z = Clip(z, {-700 700}); /*Prevents exp() overflow/underflow*/

prob=exp(z)/(1+exp(z));

llikelihood=sum(y#log(prob+1e-12) + (1-y)#log(1-prob+1e-12));/*Added 1e-12 to avoid log (0)*/

return (-llikelihood);/*Return negative log likelihood function*/

finish;



* Module to calculate gradient;

start driv(beta) global(n,x,y);

z = x*t(beta);

z = Clip(z, {-700 700}); /*Prevents exp() overflow/underflow*/

prob=exp(z)/(1+exp(z));

g=t(-t(x)*(y-prob));/* g is 1 by 2 vector*/

return(g);

finish;



/*Module to calculate Hessian Matrix*/

start HessLogLik(beta) global(n,x,y);

z = x*t(beta);

z = Clip(z, {-700 700}); /*Prevents exp() overflow/underflow*/

prob=exp(z)/(1+exp(z));

weight = prob#(1-prob);/*Var(Y)*/

/*Hessian matrix (second derivative matrix)*/

h = (t(x)*(diag(weight))*x); /* Simplified Hessian calculation */

h = h + I(2)*1e-4; /*Prevents the inversion of a singular matrix when the Hessian is not positive definite*/

return(h);

finish;



replicate=1000;/*1000 replications*/

n=25;/*Sample size for each replicate is 25*/

Est=j(replicate,2,0);/*Matrix that includes the estimates of beta_0 and beta_1*/



colMean=j(2,1,0);/*Vector that includes the mean values of the estimates of beta_0 and beta_1*/

colStanddev=j(2,1,0);/*Vector that includes the standard deviation values of the estimates of beta_0 and beta_1*/



/*Do loop for 1000 times*/

do k=1 to replicate;

x1=j(1,n);

x1t=t(x1);

call randseed(3205+k);/*Change seed each iteration by adding k to avoid getting the same result 1000 times*/

Call randgen(x1t, "NORMAL", 0, 1); /*Generates N(0,1)*/

x2=j(n,1,1);

x=x2||x1t;/*x is the design matrix*/

y=j(n,1);/*y is the vector that includes the generated y data*/

TruePar={1 3};/*Row vector that includes the true values for beta_0 and beta_1*/



/*Calculate ptrue which is the cdf of logit and use it to generate y data */

ptrue=exp(x*t(TruePar))/(1+exp(x*t(TruePar)));/* ptrue is based on Logistic distribution */

Call randgen(y,"Bernoulli",ptrue);/*Generates Bernoulli distribution for y with probability of success =ptrue*/



/*Initial_values for beta_0 and beta_1 is represented as a 1x2 row vector for initial values*/

initial_param={0 1};



/* Set max iterations to 500 and max function calls to 1000*/

opt = j(1, 10, .);

opt[1] = 0; /* Minimize */

opt[2] = 1; /* Change to 1 or 3 to print messages to log if needed */

opt[5] = 500; /* Increased Max Iterations */

opt[6] = 1000; /* Increased Max Function Calls */



/* Call Optimizer */

call nlpnra(rc, res, "SUMLOGLHF", initial_param,opt, , , , ,"driv","HessLogLik");/*Use Newton Raphson's algorithm and the nlpnra routine that minimizes negative log likelihood function*/

Est[k,]=res;/*Matrix that includes 1000 estimates of beta_0_hat and beta_1_ hat*/

end;



colMean=t(mean(Est));

print colMean;



colStanddev=t(std(Est));

print colStanddev;

quit;

 

I got the following error 3 times in the log window when n=25:

ERROR: NEWRAP Optimization cannot be completed.

ERROR: NEWRAP needs more than 200 iterations or 500 function calls.

 

I defined opt as follows to get rid of this error:

opt = j(1, 10, .);

opt[1] = 0; /* Minimize */

opt[2] = 1; /* Change to 1 or 3 to print messages to log if needed */

opt[5] = 500; /* Increased Max Iterations */

opt[6] = 1000; /* Increased Max Function Calls */

However, I am still getting the same error. How can I fix this?

Rick_SAS
SAS Super FREQ

> I got the following error 3 times in the log window when n=25:

ERROR: NEWRAP Optimization cannot be completed.

ERROR: NEWRAP needs more than 200 iterations or 500 function calls.

How can I fix this?

 

This error message appears for 3 out of 1000 simulation trials. Errors like this appear in simulation studies when a random sample contains extreme values or is otherwise not representative. For those samples, the MLE step does not converge. This does not indicate a problem in your code, but rather a hazard when you are simulating small data sets (in your example, n=25). For details and a discussion, see Monitor convergence during simulation studies in SAS - The DO Loop

 

There are two ways to address your concern:

  1. Unless you are specifically interested in Newton-Raphson optimization, you can switch optimization schemes. For example, the quasi-Newton optimization in NLPQN might be faster and more robust. Since quasi-Newton methods do not require Hessian computations, you can replace your call to NLPNRA with 
    call nlpqn(rc, res, "SUMLOGLHF", initial_param,opt) grd="driv"; /*Use quasi-Newton algorithm that minimizes negative log likelihood function */
    This is what I would do.

  2. If, for some reason, Newton iteration is important, you can keep track of the return code (rc) and remove the results from samples for which the MLE does not converge. This is the suggestion in the blog post, as applied to your IML program. This requires allocating a vector for the return codes, then using the LOC function to keep only the results from MLE that converged:

rc_status = j(replicate,1);
do k=1 to replicate;
...
   call nlpnra(rc, res, "SUMLOGLHF", initial_param, opt) grd="driv" hes="HessLogLik";
   Est[k,]=res;
   rc_status[k] = rc;
end;

/* find trials that converged */
good_idx = loc(rc_status > 0);
Est = Est[good_idx, ];
nSim = nrow(Est);
colMean=t(mean(Est));
colStanddev=t(std(Est));
print nSim colMean colStanddev;
Tom25
Fluorite | Level 6

Thank you for response.

I have the following three questions.

  1. Based on the first way you suggested, the 1000 simulation trials converge. However, the mean of estimates of betas are far away from the true values. Is it acceptable?
  2. Based on the second way you suggested, are the means and standard deviations of the estimates of the unknown regression coefficients calculated based on 970 simulation trials?
  3. Based on the second way you suggested, is it possible to generate another three samples instead of those samples where the MLE step does not converge such as 1000 simulation trials are used for the calculations presented in the code? If that’s possible, how can I adjust the codes?
Rick_SAS
SAS Super FREQ

First, I made a mistake. The statement that subsets the good indices should be 
Est = Est[good_idx, ];  /* Note the comma! */

To answer your questions:
> 1. the mean of estimates of betas are far away from the true values. Is it acceptable?

I don't know what "acceptable" means. What is your goal? Your estimates will be far away from the true mean regardless of which optimization method you use.

If your goal is to show that the regression estimates for this model are highly variable and should not be trusted for samples of size n=25, then, yes, I think this simulation demonstrates that fact very well. If you plot your parameter estimates, there are many extreme values, which will, of course, strongly influence the Monte Carlo mean and standard deviation. A more robust Monte Carlo estimate would be the median of the columns. You should always graph the parameter estimates to visualize their variability.

colMedian=t(median(Est));
print colMedian;

title "Graph of parameter estimates from NLPNRA";
call scatter(Est[,1], Est[,2]);

The graph might cause you to ask whether the computations in the simulation are wrong. You can answer that question by performing the same computation for larger sample sizes. Many of your "problems" with nonconvergence and highly variable estimates will go away if you increase the sample size, to, say, n=100.

 

2. Yes, the Monte Carlo estimate is based on the 997 estimates that converged

3. Yes, you can increase the value of the 'replicate' variable to obtain additional Monte Carlo samples. But the math doesn't care if you use 997 or 1000. The math says that the mean of the estimates (however many you have) is the Monte Carlo estimate.

 

In summary, your "problems" are caused because the sample size in the simulation is so small. You can "fix" your problems by using a larger sample size.

Tom25
Fluorite | Level 6

Thank you, Rick!