Solved: Re: Converting a simulation from SAS data step to SAS/IML to make it f...

SimRock · Posted 05-31-2018 12:22 AM

Hi,

I desperately need some help to convert my SAS code (currently in Data step format) into a SAS IML. My simulations are exceedingly slow when I increase the numbers greatly and SAS experts such as Rick Wicklin have suggesting vectorizing and using SAS IML to make the code faster and more efficient. The problem is that I am not completely sure how to go about it. If anyone can help me write a SAS IML procedure to make the code super fast and efficient (especially when I increase the complexity i.e. many variables ~ 100, increased sample size ~1000,000, increased number of replication ~1000). For example I use 100,000 for my sample size, 10 timepoints and 100 replications and many variables, my simulations take about 5 hours to run.

Basically my simulation consists of a nested (hierarchical) type of simulation with individuals nested within neighborhoods and each individuals has different measurement of a particular variable (here age) at different time points. The simulation is then repeated a number of time. After which I reshape the data and compute the statistics. Any help would be greatly appreciated. Here's a sample code. Any help to make this code more efficient and faster would be greatly appreciated especially in SAS IML.

%let reps=10;
%let numNeigh=20;
%let sampsize=30;
%let start=0;
%let endpoint=3;

/*Storing parameters into a dataset*/
Data param;
	MeanAge0=32;
	MeanAge1=40;
	MeanAge2=58;
	MeanAge3=70;
run;

Data simwide;
	set param;
	call streaminit(053018);
		do replicate= 1 to &reps; /*Number of replications*/
			do Envid = 1 to &numNeigh; /*Number of neighborhoods*/

			NeighborhoodScore 	= 	rand('normal', 2, 1);
			array arAGE	{&start:&endpoint} AGE&start-AGE&endpoint;

				do id = 1 to &sampsize; /*size of each sample*/
					do t=&start to &endpoint; /*time points*/
					if t=&start then
						do;
						arAGE{t} =	rand('normal',MeanAge0,2);
						end;
						else if t=1 then
						    	do;
						arAGE{t} =	rand('normal',MeanAge1,3);
							end;
						else if t=2 then
						    	do;
						arAGE{t} =	rand('normal',MeanAge2,4);
							end;
						else if t=&endpoint then
						    	do;
						arAGE{t} =	rand('normal',MeanAge3,2);
							end;
					   end;
					output;
					end;
				end;
			end;
	run;

	/*Transposing the data in a long format*/

Data simlong;
retain replicate EnviD id t AGE;
	set simwide;
		array arAGE	{&start:&endpoint} AGE&start-AGE&endpoint;
	do t=&start to &endpoint;
				AGE =	arAGE{t};
	output;
	end;
		Drop 	AGE&start-AGE&endpoint;
run;

Data simlong2;
	set simlong;
		newID= id + (envid-1)*&sampsize; 
		/*to get a correct idenfification number for the entire population*/
run;

proc sort data=simlong2; by replicate envid id t; run;

/*Computing the statistics*/
/*ideally I would like to automate the process for many variables like age*/

proc sql;
	create table sum1 as
	select replicate, t, 
	mean(age), mean(age) as age
	from simlong2
	group by replicate, t
	order by t;
quit;

proc univariate data=sum1 noprint; 
by t;
	var age; 
	output out=sum2
	pctlpts=2.5 50 97.5 
	pctlpre	=var_p
	mean	=var_mean
	std		=var_SD;
run;

Data sum2;
retain varname;
	set sum2;
	varname="age";
run;
proc print data=sum2;
title1 "Distribution of simulated variable by timepoints";
run;

Rick_SAS · Posted 05-31-2018 08:38 AM

I don't recall suggesting that you should use "to make the code faster and more efficient." In general, there is no reason to think that an IML simulation will run faster than a well-written DATA step simulation. Since you know how to write the code in the DATA step, you might try improving the efficiency there.

Here are some suggestions:

1. Time how long each component of the simulation takes: Simulation, transpose, sort, SQL, UNIVARIATE, etc. That will tell you where the time is being spent.

2. If possible, simulate the data in such a way as to eliminate the need for the transpose and sort. THIS IS PROBABLY THE BIGGEST TIME SAVER.

3. I am not an SQL expert, but I would see if PROC MEANS can generate the same statistics. Do you really need the ORDER BY t clause?

4. Putting an IF-THEN/ELSE statement inside the innermost loop is not efficient. Use array references instead to eliminate the IF logic. For example, the following example eliminates the IF-THEN logic and simplifies the program:

Data Param;
array arMEAN   {&start:&endpoint} (32, 40, 58, 70);
array arSTD    {&start:&endpoint} ( 2,  3,  4,  2);
run;

data simwide;
set Param;
array arMEAN   {&start:&endpoint};
array arSTD    {&start:&endpoint};
array arAGE    {&start:&endpoint};
call streaminit(1);

do id = 1 to &sampsize; /*size of each sample*/
   do t=&start to &endpoint; /*time points*/
      arAGE{t} =  rand('normal',arMEAN[t], arSTD[t]);
   end;
   output;
end;
run;

View solution in original post

Rick_SAS · Posted 05-31-2018 08:38 AM

I don't recall suggesting that you should use "to make the code faster and more efficient." In general, there is no reason to think that an IML simulation will run faster than a well-written DATA step simulation. Since you know how to write the code in the DATA step, you might try improving the efficiency there.

Here are some suggestions:

1. Time how long each component of the simulation takes: Simulation, transpose, sort, SQL, UNIVARIATE, etc. That will tell you where the time is being spent.

2. If possible, simulate the data in such a way as to eliminate the need for the transpose and sort. THIS IS PROBABLY THE BIGGEST TIME SAVER.

3. I am not an SQL expert, but I would see if PROC MEANS can generate the same statistics. Do you really need the ORDER BY t clause?

4. Putting an IF-THEN/ELSE statement inside the innermost loop is not efficient. Use array references instead to eliminate the IF logic. For example, the following example eliminates the IF-THEN logic and simplifies the program:

Data Param;
array arMEAN   {&start:&endpoint} (32, 40, 58, 70);
array arSTD    {&start:&endpoint} ( 2,  3,  4,  2);
run;

data simwide;
set Param;
array arMEAN   {&start:&endpoint};
array arSTD    {&start:&endpoint};
array arAGE    {&start:&endpoint};
call streaminit(1);

do id = 1 to &sampsize; /*size of each sample*/
   do t=&start to &endpoint; /*time points*/
      arAGE{t} =  rand('normal',arMEAN[t], arSTD[t]);
   end;
   output;
end;
run;

SimRock · Posted 05-31-2018 07:14 PM

Thank you so much Rick_SAS

This is really helpful

So now I think I just need help eliminating the IF and Then statements. My code is a bit more complicated and I am wondering if you could help me eliminate them in the sample code below. I was not able to find the best wat to do it. especially because I am trying to simulate MI (myocardial infarction) among those who do not have it in the previous step. Thank you

Proc fcmp outlib=work.funcs.link; /*to create an Expit function*/
	function expit(LinearCombination);
		n = (1/(1 + exp(-(LinearCombination))));
	return (n);
	endsub;
quit;
/*End of function generation*/

options cmplib=work.funcs ;

Data param;
array arMEANAGE	{&start:&endpoint}(32, 40, 58);
array arMEANBP	{&start:&endpoint}(132, ., .);
array arMEANMI	{&start:&endpoint}(0, ., .);

array arInterBP	{&start:&endpoint}(., 80, 90);
array arInterMI	{&start:&endpoint}(., -2.75, -2.09);

array arBeta_BP_AGE {&start:&endpoint}(., 0.001, 0.002);
array arBeta_MI_AGE	{&start:&endpoint}(., 0.002, 0.001);
array arBeta_MI_BP	{&start:&endpoint}(., 0.003, 0.009);
run;



Data simwide;
	set param;
	call streaminit(053018);
		do replicate= 1 to &reps; /*Number of replications*/
			do Envid = 1 to &numNeigh; /*Number of neighborhoods*/

					NeighborhoodScore 	= 	rand('normal', 2, 1);

					array arAGE		{&start:&endpoint};
					array arBP		{&start:&endpoint};
					array arMI		{&start:&endpoint};

					array arMEANAGE	{&start:&endpoint};
					array arMEANBP	{&start:&endpoint};
					array arMEANMI	{&start:&endpoint};

					array arInterBP	{&start:&endpoint};
					array arInterMI	{&start:&endpoint};

					array arBeta_BP_AGE {&start:&endpoint};
					array arBeta_MI_AGE	{&start:&endpoint};
					array arBeta_MI_BP	{&start:&endpoint};


				do id = 1 to &sampsize; /*size of each sample*/
					do t=&start to &endpoint; /*time points*/
						if t=&start then
						    do;
								arAGE{t} =	rand('normal',arMEANAGE{t}, 1);
								arBP{t} =	rand('normal',arMEANBP{t}, 1);
								arMI{t} =	arMEANMI{t};
	
							end;
						else if t=1 then
						    do;
								arAGE{t} =	rand('normal',arMEANAGE{t-1} + 1, 2);
								arBP{t} =	rand('normal',arInterBP{t} + arBeta_BP_AGE{t}*arAGE{t-1}, 1);

								if (arMI{t-1}=0) then
									do;
										arMI{t} =	rand('bernoulli',expit(arInterMI{t} + arBeta_MI_AGE{t}*arAGE{t-1} + arBeta_MI_BP{t}*arBP{t-1}));
								 	end;
								else
									do;
										arMI{t}=.;
									end;
							end;
	
						else if t=&endpoint then
						    do;
								arAGE{t} =	rand('normal',arMEANAGE{t-1} + 1, 3);
								arBP{t} =	rand('normal',arInterBP{t} + arBeta_BP_AGE{t}*arAGE{t-1}, 1);

								if (arMI{t-1}=0) then
									do;
										arMI{t} =	rand('bernoulli',expit (arInterMI{t} + arBeta_MI_AGE{t}*arAGE{t-1} + arBeta_MI_BP{t}*arBP{t-1}));
								 	end;
								else
									do;
										arMI{t}=.;
									end;
							end;
					   end;
					output;
					end;
				end;
			end;
	run;

Rick_SAS · Posted 06-01-2018 08:47 AM

This looks good. If you want different behaviors for each time point, you might not be able to avoid the IF-THEN.

A few comments:

1. You don't need to define your own EXPIT function. use the built-in LOGISTIC function.

2. Are you sure you want arMI to contain missing values? It looks like a response variable, which would be 0/1. If you use 0/1, then you might be able to get rid of the IF-THEN/ELSE statement for arMI.

3. Your code will be clearer if you use a separate variable for the linear predictors in your model. For example

mu = arInterBP{t} + arBeta_BP_AGE{t}*arAGE{t-1};

...

eta = arInterMI{t} + arBeta_MI_AGE{t}*arAGE{t-1} + arBeta_MI_BP{t}*arBP{t-1});

4. Coding Tip: In many fonts and program editors, the curly braces look very similar to round parenteses: f{i} vs f(i). Although there is nothing wrong with using curly braces for indexing into an array, I prefer using square brackets, which show up better because they look different from the parentheses: f[i] vs f(i). For example, I would write:

eta = arInterMI[t] + arBeta_MI_AGE[t]*arAGE[t-1] + arBeta_MI_BP[t]*arBP[t-1];

arMI[t] = rand('bernoulli',logistic (eta));

5. For clarity, you can move the ARRAY statements out of the loops and after the SET PARAM statement.

SimRock · Posted 06-01-2018 04:19 PM

Thank you so much, Rick. This is very helpful. I have tried to find out where SAS is spending the most time and I figured out it was a SAS code that I am including to convert body mass index to z-score and percentiles. See https://www.cdc.gov/nccdphp/dnpao/growthcharts/resources/sas.htm and https://www.cdc.gov/nccdphp/dnpao/growthcharts/resources/cdc-source-code.sas for more details (attached CDC codes). In this code, they use a couple of PROC SORT which I believe increases the time. Do you have some tips and strategies (e.g. use memory or clusters or convert CDC SAS codes etc...) you can suggest to circumvent the time hurdle while accomplishing what I want (i.e. convert body mass index to z-scores and percentiles using the CDC provided SAS codes)

Thank you for your help

By the way, does SAS have a log odds function such that logit(p) = p/(1-p) similar to the logistic function? Just wondering

I am currently using the following

Proc fcmp outlib=ref.funcs.link; /*logit function*/
	function logit(Pr);
		n = log(Pr/(1-Pr));
	return(n);
	endsub;
quit;

Rick_SAS · Posted 06-01-2018 04:48 PM

I don't know anything about CDC codes, so I'd rather not comment on that.

SAS does not have a built-in DATA step function for the logit.

SimRock · Posted 06-01-2018 04:49 PM

Thank you very much

Converting a simulation from SAS data step to SAS/IML to make it faster and more efficient

Re: Converting a simulation from SAS data step to SAS/IML to make it faster and more efficient

Re: Converting a simulation from SAS data step to SAS/IML to make it faster and more efficient

Re: Converting a simulation from SAS data step to SAS/IML to make it faster and more efficient

Re: Converting a simulation from SAS data step to SAS/IML to make it faster and more efficient

Re: Converting a simulation from SAS data step to SAS/IML to make it faster and more efficient

Re: Converting a simulation from SAS data step to SAS/IML to make it faster and more efficient

Re: Converting a simulation from SAS data step to SAS/IML to make it faster and more efficient

Ready to join fellow brilliant minds for the SAS Hackathon?