BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
SimRock
Obsidian | Level 7

Hi,

I desperately need some help to convert my SAS code (currently in Data step format) into a SAS IML. My simulations are exceedingly slow when I increase the numbers greatly and SAS experts such as Rick Wicklin have suggesting vectorizing and using SAS IML to make the code faster and more efficient. The problem is that I am not completely sure how to go about it. If anyone can help me write a SAS IML procedure to make the code super fast and efficient (especially when I increase the complexity i.e. many variables ~ 100, increased sample size ~1000,000, increased number of replication ~1000). For example I use 100,000 for my sample size, 10 timepoints and 100 replications and many variables, my simulations take about 5 hours to run.

 

Basically my simulation consists of a nested (hierarchical) type of simulation with individuals nested within neighborhoods and each individuals has different measurement of a particular variable (here age) at different time points. The simulation is then repeated a number of time. After which I reshape the data and compute the statistics. Any help would be greatly appreciated. Here's a sample code. Any help to make this code more efficient and faster would be greatly appreciated especially in SAS IML.

%let reps=10;
%let numNeigh=20;
%let sampsize=30;
%let start=0;
%let endpoint=3;

/*Storing parameters into a dataset*/
Data param;
	MeanAge0=32;
	MeanAge1=40;
	MeanAge2=58;
	MeanAge3=70;
run;

Data simwide;
	set param;
	call streaminit(053018);
		do replicate= 1 to &reps; /*Number of replications*/
			do Envid = 1 to &numNeigh; /*Number of neighborhoods*/

			NeighborhoodScore 	= 	rand('normal', 2, 1);
			array arAGE	{&start:&endpoint} AGE&start-AGE&endpoint;

				do id = 1 to &sampsize; /*size of each sample*/
					do t=&start to &endpoint; /*time points*/
					if t=&start then
						do;
						arAGE{t} =	rand('normal',MeanAge0,2);
						end;
						else if t=1 then
						    	do;
						arAGE{t} =	rand('normal',MeanAge1,3);
							end;
						else if t=2 then
						    	do;
						arAGE{t} =	rand('normal',MeanAge2,4);
							end;
						else if t=&endpoint then
						    	do;
						arAGE{t} =	rand('normal',MeanAge3,2);
							end;
					   end;
					output;
					end;
				end;
			end;
	run;

	/*Transposing the data in a long format*/

Data simlong;
retain replicate EnviD id t AGE;
	set simwide;
		array arAGE	{&start:&endpoint} AGE&start-AGE&endpoint;
	do t=&start to &endpoint;
				AGE =	arAGE{t};
	output;
	end;
		Drop 	AGE&start-AGE&endpoint;
run;

Data simlong2;
	set simlong;
		newID= id + (envid-1)*&sampsize; 
		/*to get a correct idenfification number for the entire population*/
run;

proc sort data=simlong2; by replicate envid id t; run;

/*Computing the statistics*/
/*ideally I would like to automate the process for many variables like age*/

proc sql;
	create table sum1 as
	select replicate, t, 
	mean(age), mean(age) as age
	from simlong2
	group by replicate, t
	order by t;
quit;

proc univariate data=sum1 noprint; 
by t;
	var age; 
	output out=sum2
	pctlpts=2.5 50 97.5 
	pctlpre	=var_p
	mean	=var_mean
	std		=var_SD;
run;

Data sum2;
retain varname;
	set sum2;
	varname="age";
run;
proc print data=sum2;
title1 "Distribution of simulated variable by timepoints";
run;
 

 

1 ACCEPTED SOLUTION

Accepted Solutions
Rick_SAS
SAS Super FREQ

I don't recall suggesting that you should use "to make the code faster and more efficient."  In general, there is no reason to think that an IML simulation will run faster than a well-written DATA step simulation. Since you know how to write the code in the DATA step, you might try improving the efficiency there.

 

Here are some suggestions:

1. Time how long each component of the simulation takes: Simulation, transpose, sort, SQL, UNIVARIATE, etc. That will tell you where the time is being spent.

2. If possible, simulate the data in such a way as to eliminate the need for the transpose and sort. THIS IS PROBABLY THE BIGGEST TIME SAVER.

3. I am not an SQL expert, but I would see if PROC MEANS can generate the same statistics. Do you really need the ORDER BY t clause?

4. Putting an IF-THEN/ELSE statement inside the innermost loop is not efficient. Use array references instead to eliminate the IF logic. For example, the following example eliminates the IF-THEN logic and simplifies the program:

Data Param;
array arMEAN   {&start:&endpoint} (32, 40, 58, 70);
array arSTD    {&start:&endpoint} ( 2,  3,  4,  2);
run;

data simwide;
set Param;
array arMEAN   {&start:&endpoint};
array arSTD    {&start:&endpoint};
array arAGE    {&start:&endpoint};
call streaminit(1);

do id = 1 to &sampsize; /*size of each sample*/
   do t=&start to &endpoint; /*time points*/
      arAGE{t} =  rand('normal',arMEAN[t], arSTD[t]);
   end;
   output;
end;
run;

 

View solution in original post

6 REPLIES 6
Rick_SAS
SAS Super FREQ

I don't recall suggesting that you should use "to make the code faster and more efficient."  In general, there is no reason to think that an IML simulation will run faster than a well-written DATA step simulation. Since you know how to write the code in the DATA step, you might try improving the efficiency there.

 

Here are some suggestions:

1. Time how long each component of the simulation takes: Simulation, transpose, sort, SQL, UNIVARIATE, etc. That will tell you where the time is being spent.

2. If possible, simulate the data in such a way as to eliminate the need for the transpose and sort. THIS IS PROBABLY THE BIGGEST TIME SAVER.

3. I am not an SQL expert, but I would see if PROC MEANS can generate the same statistics. Do you really need the ORDER BY t clause?

4. Putting an IF-THEN/ELSE statement inside the innermost loop is not efficient. Use array references instead to eliminate the IF logic. For example, the following example eliminates the IF-THEN logic and simplifies the program:

Data Param;
array arMEAN   {&start:&endpoint} (32, 40, 58, 70);
array arSTD    {&start:&endpoint} ( 2,  3,  4,  2);
run;

data simwide;
set Param;
array arMEAN   {&start:&endpoint};
array arSTD    {&start:&endpoint};
array arAGE    {&start:&endpoint};
call streaminit(1);

do id = 1 to &sampsize; /*size of each sample*/
   do t=&start to &endpoint; /*time points*/
      arAGE{t} =  rand('normal',arMEAN[t], arSTD[t]);
   end;
   output;
end;
run;

 

SimRock
Obsidian | Level 7

Thank you so much Rick_SAS

This is really helpful

So now I think I just need help eliminating the IF and Then statements. My code is a bit more complicated and I am wondering if you could help me eliminate them in the sample code below. I was not able to find the best wat to do it. especially because I am trying to simulate MI (myocardial infarction) among those who do not have it in the previous step. Thank you

 

Proc fcmp outlib=work.funcs.link; /*to create an Expit function*/
	function expit(LinearCombination);
		n = (1/(1 + exp(-(LinearCombination))));
	return (n);
	endsub;
quit;
/*End of function generation*/

options cmplib=work.funcs ;

Data param;
array arMEANAGE	{&start:&endpoint}(32, 40, 58);
array arMEANBP	{&start:&endpoint}(132, ., .);
array arMEANMI	{&start:&endpoint}(0, ., .);

array arInterBP	{&start:&endpoint}(., 80, 90);
array arInterMI	{&start:&endpoint}(., -2.75, -2.09);

array arBeta_BP_AGE {&start:&endpoint}(., 0.001, 0.002);
array arBeta_MI_AGE	{&start:&endpoint}(., 0.002, 0.001);
array arBeta_MI_BP	{&start:&endpoint}(., 0.003, 0.009);
run;



Data simwide;
	set param;
	call streaminit(053018);
		do replicate= 1 to &reps; /*Number of replications*/
			do Envid = 1 to &numNeigh; /*Number of neighborhoods*/

					NeighborhoodScore 	= 	rand('normal', 2, 1);

					array arAGE		{&start:&endpoint};
					array arBP		{&start:&endpoint};
					array arMI		{&start:&endpoint};

					array arMEANAGE	{&start:&endpoint};
					array arMEANBP	{&start:&endpoint};
					array arMEANMI	{&start:&endpoint};

					array arInterBP	{&start:&endpoint};
					array arInterMI	{&start:&endpoint};

					array arBeta_BP_AGE {&start:&endpoint};
					array arBeta_MI_AGE	{&start:&endpoint};
					array arBeta_MI_BP	{&start:&endpoint};


				do id = 1 to &sampsize; /*size of each sample*/
					do t=&start to &endpoint; /*time points*/
						if t=&start then
						    do;
								arAGE{t} =	rand('normal',arMEANAGE{t}, 1);
								arBP{t} =	rand('normal',arMEANBP{t}, 1);
								arMI{t} =	arMEANMI{t};
	
							end;
						else if t=1 then
						    do;
								arAGE{t} =	rand('normal',arMEANAGE{t-1} + 1, 2);
								arBP{t} =	rand('normal',arInterBP{t} + arBeta_BP_AGE{t}*arAGE{t-1}, 1);

								if (arMI{t-1}=0) then
									do;
										arMI{t} =	rand('bernoulli',expit(arInterMI{t} + arBeta_MI_AGE{t}*arAGE{t-1} + arBeta_MI_BP{t}*arBP{t-1}));
								 	end;
								else
									do;
										arMI{t}=.;
									end;
							end;
	
						else if t=&endpoint then
						    do;
								arAGE{t} =	rand('normal',arMEANAGE{t-1} + 1, 3);
								arBP{t} =	rand('normal',arInterBP{t} + arBeta_BP_AGE{t}*arAGE{t-1}, 1);

								if (arMI{t-1}=0) then
									do;
										arMI{t} =	rand('bernoulli',expit (arInterMI{t} + arBeta_MI_AGE{t}*arAGE{t-1} + arBeta_MI_BP{t}*arBP{t-1}));
								 	end;
								else
									do;
										arMI{t}=.;
									end;
							end;
					   end;
					output;
					end;
				end;
			end;
	run;
Rick_SAS
SAS Super FREQ

This looks good. If you want different behaviors for each time point, you might not be able to avoid the IF-THEN. 

A few comments:

1. You don't need to define your own EXPIT function. use the built-in LOGISTIC function.

2. Are you sure you want arMI to contain missing values? It looks like a response variable, which would be 0/1.  If you use 0/1, then you might be able to get rid of the IF-THEN/ELSE statement for arMI.

3. Your code will be clearer if you use a separate variable for the linear predictors in your model. For example

mu = arInterBP{t} + arBeta_BP_AGE{t}*arAGE{t-1};

...

eta = arInterMI{t} + arBeta_MI_AGE{t}*arAGE{t-1} + arBeta_MI_BP{t}*arBP{t-1});

4. Coding Tip: In many fonts and program editors, the curly braces look very similar to round parenteses: f{i} vs f(i).  Although there is nothing wrong with using curly braces for indexing into an array, I prefer using square brackets, which show up better because they look different from the parentheses: f[i] vs f(i). For example, I would write:

eta = arInterMI[t] + arBeta_MI_AGE[t]*arAGE[t-1] + arBeta_MI_BP[t]*arBP[t-1];

arMI[t] =   rand('bernoulli',logistic (eta));

 

5. For clarity, you can move the ARRAY statements out of the loops and after the SET PARAM statement.

SimRock
Obsidian | Level 7

Thank you so much, Rick. This is very helpful. I have tried to find out where SAS is spending the most time and I figured out it was a SAS code that I am including  to convert body mass index to z-score and percentiles. See https://www.cdc.gov/nccdphp/dnpao/growthcharts/resources/sas.htm and https://www.cdc.gov/nccdphp/dnpao/growthcharts/resources/cdc-source-code.sas for more details (attached CDC codes). In this code, they use a couple of PROC SORT which I believe increases the time. Do you have some tips and strategies (e.g. use memory or clusters or convert CDC SAS codes etc...) you can suggest to circumvent the time hurdle while accomplishing what I want (i.e. convert body mass index to z-scores and percentiles using the CDC provided SAS codes)

Thank you for your help

 

By the way, does SAS have a log odds function such that logit(p) = p/(1-p) similar to the logistic function? Just wondering

I am currently using the following

Proc fcmp outlib=ref.funcs.link; /*logit function*/
	function logit(Pr);
		n = log(Pr/(1-Pr));
	return(n);
	endsub;
quit;
Rick_SAS
SAS Super FREQ

I don't know anything about CDC codes, so I'd rather not comment on that.

 

SAS does not have a built-in DATA step function for the logit.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

Multiple Linear Regression in SAS

Learn how to run multiple linear regression models with and without interactions, presented by SAS user Alex Chaplin.

Find more tutorials on the SAS Users YouTube channel.

From The DO Loop
Want more? Visit our blog for more articles like these.
Discussion stats
  • 6 replies
  • 1236 views
  • 1 like
  • 2 in conversation