Solved: Re: How do I use modules within Do loop groups

timtimalo · Posted 04-15-2016 05:59 AM

Dear SAS Community,

I generated 5000 data sets from Weibull (a,b) to calculate MLEs for a and b using do loop. Inside this do loop I generated another 5000 data sets from Weibull (1,1) to calculate MLEs for Weibull (1,1). Then I divided the observed MLE of Weibull (a,b) by MLE for Weibull (1,1) and keep the results of division in a matrix. I ranked the items of the matrix from smallest to largest to pick only two elements from these ranked elements.

I got warnings and errors that resulted in stopping the run of the program. I checked the program several times but could not find out the mistake.

I attached the program and log files. Your help is very appreciated in this issue.

Best Regards’

Tom Timtimalo

Epidemiologist, UAE

Rick_SAS · Posted 05-06-2016 04:27 PM

I don't get that many errors, but as I said in my previous message, you can try to fiddle with the options vector for the NLP functions and see if that improves the situation.

I think you should mark this thread as "solved." If you have other questions later, open a new thread.

View solution in original post

IanWakeling · Posted 04-15-2016 07:31 AM

I have not tried running the code, but the error message relates to the fact that you are attempting to declare functions multiple times, by putting them within a loop. So try moving all of the code between "start f_weib2" and "finish g_weib2" inclusive to the top of the program.

timtimalo · Posted 04-16-2016 08:58 AM

I need to generate Weibull distribution with specific parameters (a, b) such that a=4 and b= 24.953 for 5000 times. For each group of data I would like to apply the modules to construct log-likelihood function and calculate MLEs using it and then keep the estimate of the second parameter "b" in a matrix c.

In SAS help, proc iml is recommended to define the matrix that contains data and then modules are applied for calculating MLEs of the constructed loglikelihood function of the given data. When I applied this protocol inside a do loop for 5000 times to get 5000 MLEs, I got the following warnings and notes:

WARNING: Starting a module while inside a DO group

WARNING: Finishing a module while inside a DO group.

NOTE: Module F_WEIB2 defined.

WARNING: Starting a module while inside a DO group.

WARNING: Finishing a module while inside a DO group.

NOTE: Module G_WEIB2 defined.

Based on your recommendation, I moved all codes between "start f_weib2" and "finish g_weib2" inclusive to the top of the program in a simpler program that contains only one do loop. I got errors and the program stop running.

How to overcome this problem to get MLEs for each generated group of data for 5000 times. Attached is the program and its log.

Rick_SAS · Posted 04-16-2016 11:10 AM

On line 11 you start a DO loop (do t = 1 to 10), but you never close that DO loop.

Do yourself a favor and call each function once before you try to optimize it. In other words, define the two modules and then test them out to make sure they work. After you have both functions working, you can try to use them in an optimization.

timtimalo · Posted 04-16-2016 03:03 PM

On line 50 do loop is closed since I moved all codes between "start f_weib2" and "finish g_weib2" inclusive to the top of the program based on your advice.

Based on your recommendation, I generated one group of data from Weibull (a,b) and defined the two modules. The program was run without warnings or errors in the log and I could get MLEs for this group of data.

When I generated data for 10 times and calculated MLEs each time inside do loop, I got MLEs for the 10 groups of generated data but the following warnings appeared in the log:

WARNING: Starting a module while inside a DO group

WARNING: Finishing a module while inside a DO group.

NOTE: Module F_WEIB2 defined.

WARNING: Starting a module while inside a DO group.

WARNING: Finishing a module while inside a DO group.

NOTE: Module G_WEIB2 defined.

When I moved all codes between "start f_weib2" and "finish g_weib2" inclusive to the top of the program, I got errors and the program stopped running.

Please, I want to know where I should put the do loop statement (do t = 1 to 10) and its end statement in the attached program so that I could get 10 MLEs from generated data without warnings or errors in the log.

Rick_SAS · Posted 04-15-2016 08:54 AM

Like Ian, I did not run your code. Another problem you are having is that you are trying to use the same module to define two different functions. One function uses x1 as a global variable, the other uses x2 as a global variable.

If the functions are different, then use two different names. To me, it looks like the functions are the same. Therefore I suggest you change the signature to

start f_weib2(x) global(g_x, thet); /* g_x is global variable, which can be x1 or x2 */

In the outer loop, you would assign

g_x = x1;

prior to calling the function. In the inner loop you would assign g_x = x2 prior to calling the function.

There are many other problems with your program, but Ian's tip and my suggestion will get you past the WARNINGs.

The next error you are likely to encounter is that the p variable is not defined inside the functions. Presumably you want to assign p = nrow(g_x);

Rick_SAS · Posted 04-18-2016 09:41 PM

Perhaps you are confusing DEFINING a module with calling it. You define a module once. You cancall it many times.

Probably the easiest thing for you to do is to define both your modules at the top of the program, then call them as needed.

proc iml;
start f_weib2(x) global(g_x,thet); /* x[1]=sigma and x[2]=c */ 
...
finish;

start g_weib2(x) global(x,thet); /* x[1]=sigma and x[2]=c */ 
...
finish;

/* rest of program. Before you call the function, be sure to define g_x and thet */
g_x = ...;
thet = ...;
x = ...;
f = f_weib2(x);
g = g_weib2(x);

You should get that much of the program working before you attempt to optimize anything or run a loop with 5000 iterations.

timtimalo · Posted 04-21-2016 05:43 PM

Thank you for your recommendation that solve the problems of using the modules. Howevere when I calculated MLES in case of a random sample of size 15 observations using do loop for 5000 times inside another do loop repeated for 5000 times consumes more than 5 hours.the following notes repeatedly appear several times in the log screen:

NOTE: GCONV convergence criterion satisfied.

NOTE: ABSGCONV convergence criterion satisfied

The following message appears on the screen during the run

Window is full and must be cleared. Select

F to file, P to print, S to save

To clear window without saving

Is it possible to overcome the problem of these notes such that they do not appear during the run process?

Calculating MLES in case of a random sample of size 15 observations using do loop for 5000 times inside another do loop repeated for 5000 times consumes more than 5 hours. Thus, the situation will be worse if a bigger sample size is used or ranked set sampling.

Is there a way to minimize time consumption for optimization procedure to calculate MLEs for Weibull in these situations?

Best Regards,

Tom

Rick_SAS · Posted 04-21-2016 05:59 PM

Glad it's working. Now you can to vectorize your objective function, which is being called millions of time. That should drop the time to seconds or minutes. For a vectorized version of your functions, see the SAS/IML documentation. For a discussion of vectorization tips, see How to vectorize.r.

I assume the "window full" message is because you are printing many many thousands of items? Supress printing, which will also make the program run faster. If the log is filling up, use

OPTIONS NONOTES;

to suppress the convergence notes.

While debugging, you might want to use 5 or 50 iterations instead of 5000.

timtimalo · Posted 04-22-2016 04:31 PM

Thanks for your quick response.

timtimalo · Posted 04-27-2016 04:58 PM

Hello Dr. Rick,

I used do loop for

I simulated data sets of Weibull (a, b) each includes 15 observations for 5000 times.
MLEs for each sample would be calculated using nonlinear optimization Newton Raphson subroutine (call nlpnra (rc,xres,"f_weib2",x,optn,con); which should be resulted in 5000 MLEs groups. Each MLEs group includes two estimates for a and b.

The problem is time consumption with nested do loop.

You recommend to use vectorization instead of do loop. I tried a simple program that includes only one do loop in different ways to change do loop to matrix work in different ways such that:

I simulated data sets of Weibull (a, b) each includes 15 observations for 5000 times by constructing a matrix (5000,15) such that each row represents a sample of 15 observations of Weibull (a,b).
I used a module to calculate a matrix of which likelihood functions “f” that would be used to get a vector of MLEs for 5000 samples “xopt”.

The program did not run and I got errors.

The question is how I can fix this program to get MLEs for 100 samples using vectorization instead of do loop.

The program is

proc iml;

*options nonotes;

a=12.1543;

b=4;

start f_weib2(x) global(g_x,thet); /* x[1]=b and x[2]=a and g_x is global variable, which can be x1 or x2*/

f = p*log(x[2]) - p*x[2]*log(x[1]) + (x[2]-1)*sum1 - sum2;

return(f);

finish;

count=0;

N = 15; /* number of obs in each sample */

NumSamples = 5000; /* number of samples */

call randseed(12345);/* set seed for random number stream */

x1 = j(NumSamples,N ); /* each row is sample of size N */

call randgen(x1, 'weibull', a, b); /* fill it*/

g_x = x1;

*print g_x;

n = 2; thet = 0.;

sum1=j(5000,1);

sum2=j(5000,1);

p = ncol(g_x);

temp = g_x - thet;

logtemp=log(temp);

x = j(1,n,.5); /* initial guess for solution (a, b) */

tempdev=(temp / x[1])##x[2];

sum1=logtemp[,+];/*a vector of sum over columns of the matrix logtemp to be used in LHF "f"*/

sum2=tempdev[,+];/*a vector of sum over columns of the matrix tempdev to be used in LHF "f"*/

*The following statements set up the initial conditions, options, and constraints, and they call the optimization

routine;

optn = {1 2}; /* 1: find max of a function and 2: print information*/

con = { 1.e-6 1.e-6 ,

. . }; /*lower bounds of parms >0 and upper bounds of parms < infinity (.)*/

call nlpnra(rc,xres,"f_weib2",x,optn,con);/*nonlinear optimization Newton Raphson subroutine*/

/*--- Save result in xopt, fopt ---*/

xopt = xres`; fopt = f_weib2(xopt);

quit;

run;

The log of the program is

NOTE: AUTOEXEC processing completed.

1 proc iml;

NOTE: IML Ready

2 *options nonotes;

3 a=12.1543;

4 b=4;

5 start f_weib2(x) global(g_x,thet);

5 ! /* x[1]=b and x[2]=a and g_x is global variable, which can

5 ! be x1 or x2*/

6 f = p*log(x[2]) - p*x[2]*log(x[1]) + (x[2]-1)*sum1 - sum2;

7 return(f);

8 finish;

NOTE: Module F_WEIB2 defined.

9 count=0;

10 N = 15;

10 ! /* number of obs in each sample */

11 NumSamples = 5000;

11 ! /* number of samples */

12 call randseed(12345);

12 ! /* set seed for random number stream */

13 x1 = j(NumSamples,N );

13 ! /* each row is sample of size N */

14 call randgen(x1, 'weibull', a, b);

14 ! /* fill it*/

15 g_x = x1;

16 *print g_x;

17 n = 2;

17 ! thet = 0.;

18 sum1=j(5000,1);

19 sum2=j(5000,1);

20 p = ncol(g_x);

21 temp = g_x - thet;

22 logtemp=log(temp);

23 x = j(1,n,.5);

23 ! /* initial guess for solution (a, b) */

24 tempdev=(temp / x[1])##x[2];

25 sum1=logtemp[,+];

25 ! /*a vector of sum over columns of the matrix logtemp to be used in LHF "f"*/

26 sum2=tempdev[,+];

26 ! /*a vector of sum over columns of the matrix tempdev to be used in LHF "f"*/

27 *The following statements set up the initial conditions, options, and constraints, and they

27 ! call the optimization

28 routine;

29 optn = {1 2};

29 ! /* 1: find max of a function and 2: print information*/

30 con = { 1.e-6 1.e-6 ,

31 . . };

31 ! /*lower bounds of parms >0 and upper bounds of parms < infinity

31 ! (.)*/

32 call nlpnra(rc,xres,"f_weib2",x,optn,con);

ERROR: (execution) Matrix has not been set to a value.

operation : * at line 6 column 6

operands : P, _TEM1002

P 0 row 0 col (type ?, size 0)

_TEM1002 1 row 1 col (numeric)

-0.693147

statement : ASSIGN at line 6 column 1

traceback : module F_WEIB2 at line 6 column 1

ERROR: (execution) Matrix has not been set to a value.

operation : NLPNRA at line 32 column 1

operands : *LIT1025, X, OPTN, CON

*LIT1025 1 row 1 col (character, size 7)

f_weib2

X 1 row 2 cols (numeric)

0.5 0.5

OPTN 1 row 2 cols (numeric)

1 2

CON 2 rows 2 cols (numeric)

1E-6 1E-6

. .

statement : CALL at line 32 column 1

32 ! /*nonlinear optimization Newton Raphson

32 ! subroutine*/

33 /*--- Save result in xopt, fopt ---*/

34 xopt = xres`;

ERROR: (execution) Matrix has not been set to a value.

operation : ` at line 34 column 12

operands : XRES

XRES 0 row 0 col (type ?, size 0)

statement : ASSIGN at line 34 column 1

34 ! fopt = f_weib2(xopt);

ERROR: (execution) Matrix has not been set to a value.

operation : F_WEIB2 at line 34 column 29

operands : XOPT

XOPT 0 row 0 col (type ?, size 0)

statement : ASSIGN at line 34 column 15

35 quit;

NOTE: Exiting IML.

NOTE: 5 workspace compresses.

NOTE: The SAS System stopped processing this step because of errors.

NOTE: PROCEDURE IML used (Total process time):

real time 0.10 seconds

cpu time 0.10 seconds

36 run;

Rick_SAS · Posted 04-27-2016 08:17 PM

As I explained before, you should write and debug the objective functions before you start to optimize.

The error messages tell you what is going on:

ERROR: (execution) Matrix has not been set to a value.

operation : * at line 6 column 6

operands : P, _TEM1002

P 0 row 0 col (type ?, size 0)

The error says that P has not been set to a value. It is a 0x0 matrix. It also tells you that the error in during the multiplication operation (*) on line 6, which is statement in the module

f = p*log(x[2]) - p*x[2]*log(x[1]) + (x[2]-1)*sum1 - sum2;

Notice that P is not being passed into the module, nor is it defined in the module. Thus the error.

timtimalo · Posted 04-28-2016 09:24 AM

Thank you for your continuous help and sorry for bothering you.

I simulated 5000 data sets from Weibull (1,1) dist. inside 5000 simulated data sets from Weibull (12.1543 , 4) using do loops, which consumed huge time. You recommended using vectorization instead of do loops to save time. During the previous week, I tried to do that with the help of documents you suggested to read for that matter and other materials but I could not solve the problem.

I sent you what I got and you suggested defining P in the module and to write and debug the objective functions before starting optimization.

I defined p in the module and wrote the program to debug the objective functions before starting optimization. The program runs and there were no warnings or errors in the log of the program. I expected to find a vector of 5000 values of likelihood functions for the 5000 generated data sets from Weibull (12.1543 , 4) in the output of the program. However, the objective functions expressed as “LOGLIK” are all equal to 1. I tried to discover what is going on but I could not catch the problem. I appreciate your help to find the problem so I can run the program to have 5000 likelihood function values for the 5000 simulated data sets. The program and the log are attached.

If we have only one data set, we can optimize likelihood function to get MLEs as follows:

n=2;

p = j(1,n,.5); /* initial guess for solution */

opt = {1 4}; /* 1: find max of a function and 0: print no information*/

con = { 1.e-6 1.e-6 ,

. . }; /*lower bounds of parms >0 and upper bounds of parms < infinity (.)*/

call nlpnra(rc, result, "LogLik", p, opt, con);

/*--- Save result in xopt, fopt ---*/

xopt = result`; fopt = LogLik(xopt);

How we can change this code for optimization to work for the vector of objective functions “LogLik” to calculate MLEs for the 5000 simulated data sets? i.e. what will be the form of p, opt, and con for 5000 cases?

Rick_SAS · Posted 04-28-2016 08:58 PM

From what you have said, you want to generate 5000 random samples, and find 5000 ML estimates, thus obtaining a Monte Carlo distribution that approximate the sampling distribution for the two parameters.

The first step is to make sure that the objective function is working correctly. The objective function must return a scalar value. I see that you have tried to vectorize the objective function so that it handles all 5000 samples at once. That's a nice idea, but that won't work because the NLP routines require a SCALAR objective function. Thus you should generate ONE sample from Weib(12,4), find the MLEs for that sample, and then repeat 4999 more times. The loop will look something like this:

Est = j(NumSamples,2);  /* allocate space to store parameter estimates */
x1 = j(1, N);           /* each row is sample of size N */
options NONOTES;
do i = 1 to NumSamples;
   call randgen(x1, 'weibull', 12.1543, 4); /* i_th sample */
   g_x = x1;
   call nlpnra(rc, result, "LogLik", p, opt, con); /* i_th estiamtes */
   Est[i,] = result;    /* save i_th estimates */
end;
options NOTES;
call scatter(Est[,], Est[,2]);  /* if desired, plot estimates */

Maybe this will point you in the right direction. I don't understand what you are attempting when you say "simulate 5000 Weibull (1,1) dist inside 5000 Weibull (12.1543 , 4)." I don't see that anywhere in your program.

One clarification of terminology: You are generating 5000 random SAMPLES. There are no "data sets" in this problem.

There are many examples of this sort in my book Simulating Data with SAS. I suggest Chapter 4 "Simulating Data to Estimate Sampling Distributions" might be a useful chapter to review.

timtimalo · Posted 04-30-2016 06:23 PM

Hello Dr. Rick,

In an attempt to minimize the time of the run, I tried to get a vector of objective functions to calculate vector of MLEs. The vectorization did not work and when I consulted you for your advice, you told me “the objective function must return a scalar value. I see that you have tried to vectorize the objective function so that it handles all 5000 samples at once. That's a nice idea, but that won't work because the NLP routines require a SCALAR objective function.” I am stacked in this problem and I need to start over explaining my idea.

The idea of the attached program is:

(1) Simulate 5000 data from Weibull (a,b) and calculate their MLEs and the estimate of “a” is called wd1.

(2) For each generated data in step (1), I generated Weibull distribution (1,1) and calculate MLEs and the estimate of the second parameter is called wd2.

(3) Then, I calculated:

(a) ratio=wd1/wd2 using the value “wd1” calculated from step (1) and 5000 values calculated to “wd2” from step (2), then I used “ratio” to calculate:

(b) gam1=1+(1/ratio);

(c) gam2=1+(2/ratio);

(d) cove=(sqrt(gamma(gam2)-(gamma(gam1))##2))/gamma(gam1).

The values of cove are saved in a matrix “r” and the inner do loop is closed.

(4) Thus, “r” is a vector contains 5000 values. These values are ranked and I picked two values number 125 called “up” and number 4875 called “up”.

(5) The interval width is calculated and called “wid=up – low” for 5000 times (outer loop). The values of “wid” are saved in a matrix “s” Also, it is checked if calculated toi falls in the interval “wid” and if it is yes count will be increased by one. The outer loop is closed.

(6) calculate percent=count/5000 and mean interval widths is “meanwid”.

The problem is time consumption with nested do loop. The attached program consumed seven and half hours without finishing the run correctly and it gave the following Error in the log:

611 call nlpnra(rc, result, "f_weib2", x, optn, con);

611! /* i_th estiamtes */

ERROR: (execution) Unable to allocate sufficient memory. At least 262128 more bytes required.

operation : NLPNRA at line 611 column 1

operands : *LIT1054, X, OPTN, CON

*LIT1054 1 row 1 col (character, size 7)

f_weib2

X 1 row 2 cols (numeric)

0.5 0.5

OPTN 1 row 2 cols (numeric)

1 0

CON 2 rows 2 cols (numeric)

1E-6 1E-6

. .

statement : CALL at line 611 column 1

My problems are:

(1) How I can modify the attached program to save time such that it takes minutes instead of hours?

(2) I used “call nlpnra” and it was unable to allocate sufficient memory, how I can modify the attached program to allocate sufficient memory for calculations so that the program can complete the run to give the right answer? Should I change to using “call nlptr”?

Registration is open