Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Options

- RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

🔒 This topic is **solved** and **locked**.
Need further help from the community? Please
sign in and ask a **new** question.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Posted 05-31-2018 12:22 AM
(1762 views)

Hi,

I desperately need some help to convert my SAS code (currently in Data step format) into a SAS IML. My simulations are exceedingly slow when I increase the numbers greatly and SAS experts such as Rick Wicklin have suggesting vectorizing and using SAS IML to make the code faster and more efficient. The problem is that I am not completely sure how to go about it. If anyone can help me write a SAS IML procedure to make the code super fast and efficient (especially when I increase the complexity i.e. many variables ~ 100, increased sample size ~1000,000, increased number of replication ~1000). For example I use 100,000 for my sample size, 10 timepoints and 100 replications and many variables, my simulations take about 5 hours to run.

Basically my simulation consists of a nested (hierarchical) type of simulation with individuals nested within neighborhoods and each individuals has different measurement of a particular variable (here age) at different time points. The simulation is then repeated a number of time. After which I reshape the data and compute the statistics. Any help would be greatly appreciated. Here's a sample code. Any help to make this code more efficient and faster would be greatly appreciated especially in SAS IML.

```
%let reps=10;
%let numNeigh=20;
%let sampsize=30;
%let start=0;
%let endpoint=3;
/*Storing parameters into a dataset*/
Data param;
MeanAge0=32;
MeanAge1=40;
MeanAge2=58;
MeanAge3=70;
run;
Data simwide;
set param;
call streaminit(053018);
do replicate= 1 to &reps; /*Number of replications*/
do Envid = 1 to &numNeigh; /*Number of neighborhoods*/
NeighborhoodScore = rand('normal', 2, 1);
array arAGE {&start:&endpoint} AGE&start-AGE&endpoint;
do id = 1 to &sampsize; /*size of each sample*/
do t=&start to &endpoint; /*time points*/
if t=&start then
do;
arAGE{t} = rand('normal',MeanAge0,2);
end;
else if t=1 then
do;
arAGE{t} = rand('normal',MeanAge1,3);
end;
else if t=2 then
do;
arAGE{t} = rand('normal',MeanAge2,4);
end;
else if t=&endpoint then
do;
arAGE{t} = rand('normal',MeanAge3,2);
end;
end;
output;
end;
end;
end;
run;
/*Transposing the data in a long format*/
Data simlong;
retain replicate EnviD id t AGE;
set simwide;
array arAGE {&start:&endpoint} AGE&start-AGE&endpoint;
do t=&start to &endpoint;
AGE = arAGE{t};
output;
end;
Drop AGE&start-AGE&endpoint;
run;
Data simlong2;
set simlong;
newID= id + (envid-1)*&sampsize;
/*to get a correct idenfification number for the entire population*/
run;
proc sort data=simlong2; by replicate envid id t; run;
/*Computing the statistics*/
/*ideally I would like to automate the process for many variables like age*/
proc sql;
create table sum1 as
select replicate, t,
mean(age), mean(age) as age
from simlong2
group by replicate, t
order by t;
quit;
proc univariate data=sum1 noprint;
by t;
var age;
output out=sum2
pctlpts=2.5 50 97.5
pctlpre =var_p
mean =var_mean
std =var_SD;
run;
Data sum2;
retain varname;
set sum2;
varname="age";
run;
proc print data=sum2;
title1 "Distribution of simulated variable by timepoints";
run;
```

1 ACCEPTED SOLUTION

Accepted Solutions

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

I don't recall suggesting that you should use "to make the code faster and more efficient." In general, **t****here is no reason to think that an IML simulation will run faster than a well-written DATA step simulation**. Since you know how to write the code in the DATA step, you might try improving the efficiency there.

Here are some suggestions:

1. Time how long each component of the simulation takes: Simulation, transpose, sort, SQL, UNIVARIATE, etc. That will tell you where the time is being spent.

2. If possible, simulate the data in such a way as to eliminate the need for the transpose and sort. THIS IS PROBABLY THE BIGGEST TIME SAVER.

3. I am not an SQL expert, but I would see if PROC MEANS can generate the same statistics. Do you really need the ORDER BY t clause?

4. Putting an IF-THEN/ELSE statement inside the innermost loop is not efficient. Use array references instead to eliminate the IF logic. For example, the following example eliminates the IF-THEN logic and simplifies the program:

```
Data Param;
array arMEAN {&start:&endpoint} (32, 40, 58, 70);
array arSTD {&start:&endpoint} ( 2, 3, 4, 2);
run;
data simwide;
set Param;
array arMEAN {&start:&endpoint};
array arSTD {&start:&endpoint};
array arAGE {&start:&endpoint};
call streaminit(1);
do id = 1 to &sampsize; /*size of each sample*/
do t=&start to &endpoint; /*time points*/
arAGE{t} = rand('normal',arMEAN[t], arSTD[t]);
end;
output;
end;
run;
```

6 REPLIES 6

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

I don't recall suggesting that you should use "to make the code faster and more efficient." In general, **t****here is no reason to think that an IML simulation will run faster than a well-written DATA step simulation**. Since you know how to write the code in the DATA step, you might try improving the efficiency there.

Here are some suggestions:

1. Time how long each component of the simulation takes: Simulation, transpose, sort, SQL, UNIVARIATE, etc. That will tell you where the time is being spent.

2. If possible, simulate the data in such a way as to eliminate the need for the transpose and sort. THIS IS PROBABLY THE BIGGEST TIME SAVER.

3. I am not an SQL expert, but I would see if PROC MEANS can generate the same statistics. Do you really need the ORDER BY t clause?

4. Putting an IF-THEN/ELSE statement inside the innermost loop is not efficient. Use array references instead to eliminate the IF logic. For example, the following example eliminates the IF-THEN logic and simplifies the program:

```
Data Param;
array arMEAN {&start:&endpoint} (32, 40, 58, 70);
array arSTD {&start:&endpoint} ( 2, 3, 4, 2);
run;
data simwide;
set Param;
array arMEAN {&start:&endpoint};
array arSTD {&start:&endpoint};
array arAGE {&start:&endpoint};
call streaminit(1);
do id = 1 to &sampsize; /*size of each sample*/
do t=&start to &endpoint; /*time points*/
arAGE{t} = rand('normal',arMEAN[t], arSTD[t]);
end;
output;
end;
run;
```

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Thank you so much Rick_SAS

This is really helpful

So now I think I just need help eliminating the IF and Then statements. My code is a bit more complicated and I am wondering if you could help me eliminate them in the sample code below. I was not able to find the best wat to do it. especially because I am trying to simulate MI (myocardial infarction) among those who do not have it in the previous step. Thank you

```
Proc fcmp outlib=work.funcs.link; /*to create an Expit function*/
function expit(LinearCombination);
n = (1/(1 + exp(-(LinearCombination))));
return (n);
endsub;
quit;
/*End of function generation*/
options cmplib=work.funcs ;
Data param;
array arMEANAGE {&start:&endpoint}(32, 40, 58);
array arMEANBP {&start:&endpoint}(132, ., .);
array arMEANMI {&start:&endpoint}(0, ., .);
array arInterBP {&start:&endpoint}(., 80, 90);
array arInterMI {&start:&endpoint}(., -2.75, -2.09);
array arBeta_BP_AGE {&start:&endpoint}(., 0.001, 0.002);
array arBeta_MI_AGE {&start:&endpoint}(., 0.002, 0.001);
array arBeta_MI_BP {&start:&endpoint}(., 0.003, 0.009);
run;
Data simwide;
set param;
call streaminit(053018);
do replicate= 1 to &reps; /*Number of replications*/
do Envid = 1 to &numNeigh; /*Number of neighborhoods*/
NeighborhoodScore = rand('normal', 2, 1);
array arAGE {&start:&endpoint};
array arBP {&start:&endpoint};
array arMI {&start:&endpoint};
array arMEANAGE {&start:&endpoint};
array arMEANBP {&start:&endpoint};
array arMEANMI {&start:&endpoint};
array arInterBP {&start:&endpoint};
array arInterMI {&start:&endpoint};
array arBeta_BP_AGE {&start:&endpoint};
array arBeta_MI_AGE {&start:&endpoint};
array arBeta_MI_BP {&start:&endpoint};
do id = 1 to &sampsize; /*size of each sample*/
do t=&start to &endpoint; /*time points*/
if t=&start then
do;
arAGE{t} = rand('normal',arMEANAGE{t}, 1);
arBP{t} = rand('normal',arMEANBP{t}, 1);
arMI{t} = arMEANMI{t};
end;
else if t=1 then
do;
arAGE{t} = rand('normal',arMEANAGE{t-1} + 1, 2);
arBP{t} = rand('normal',arInterBP{t} + arBeta_BP_AGE{t}*arAGE{t-1}, 1);
if (arMI{t-1}=0) then
do;
arMI{t} = rand('bernoulli',expit(arInterMI{t} + arBeta_MI_AGE{t}*arAGE{t-1} + arBeta_MI_BP{t}*arBP{t-1}));
end;
else
do;
arMI{t}=.;
end;
end;
else if t=&endpoint then
do;
arAGE{t} = rand('normal',arMEANAGE{t-1} + 1, 3);
arBP{t} = rand('normal',arInterBP{t} + arBeta_BP_AGE{t}*arAGE{t-1}, 1);
if (arMI{t-1}=0) then
do;
arMI{t} = rand('bernoulli',expit (arInterMI{t} + arBeta_MI_AGE{t}*arAGE{t-1} + arBeta_MI_BP{t}*arBP{t-1}));
end;
else
do;
arMI{t}=.;
end;
end;
end;
output;
end;
end;
end;
run;
```

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

This looks good. If you want different behaviors for each time point, you might not be able to avoid the IF-THEN.

A few comments:

1. You don't need to define your own EXPIT function. use the built-in LOGISTIC function.

2. Are you sure you want arMI to contain missing values? It looks like a response variable, which would be 0/1. If you use 0/1, then you might be able to get rid of the IF-THEN/ELSE statement for arMI.

3. Your code will be clearer if you use a separate variable for the linear predictors in your model. For example

mu = arInterBP{t} + arBeta_BP_AGE{t}*arAGE{t-1};

...

eta = arInterMI{t} + arBeta_MI_AGE{t}*arAGE{t-1} + arBeta_MI_BP{t}*arBP{t-1});

4. Coding Tip: In many fonts and program editors, the curly braces look very similar to round parenteses: f{i} vs f(i). Although there is nothing wrong with using curly braces for indexing into an array, I prefer using square brackets, which show up better because they look different from the parentheses: f[i] vs f(i). For example, I would write:

eta = arInterMI[t] + arBeta_MI_AGE[t]*arAGE[t-1] + arBeta_MI_BP[t]*arBP[t-1];

arMI[t] = rand('bernoulli',logistic (eta));

5. For clarity, you can move the ARRAY statements out of the loops and after the SET PARAM statement.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Thank you so much, Rick. This is very helpful. I have tried to find out where SAS is spending the most time and I figured out it was a SAS code that I am including to convert body mass index to z-score and percentiles. See https://www.cdc.gov/nccdphp/dnpao/growthcharts/resources/sas.htm and https://www.cdc.gov/nccdphp/dnpao/growthcharts/resources/cdc-source-code.sas for more details (attached CDC codes). In this code, they use a couple of PROC SORT which I believe increases the time. Do you have some tips and strategies (e.g. use memory or clusters or convert CDC SAS codes etc...) you can suggest to circumvent the time hurdle while accomplishing what I want (i.e. convert body mass index to z-scores and percentiles using the CDC provided SAS codes)

Thank you for your help

By the way, does SAS have a log odds function such that logit(p) = p/(1-p) similar to the logistic function? Just wondering

I am currently using the following

```
Proc fcmp outlib=ref.funcs.link; /*logit function*/
function logit(Pr);
n = log(Pr/(1-Pr));
return(n);
endsub;
quit;
```

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

I don't know anything about CDC codes, so I'd rather not comment on that.

SAS does not have a built-in DATA step function for the logit.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Thank you very much

**SAS Innovate 2025** is scheduled for May 6-9 in Orlando, FL. Sign up to be **first to learn** about the agenda and registration!

Multiple Linear Regression in SAS

Learn how to run multiple linear regression models with and without interactions, presented by SAS user Alex Chaplin.

Find more tutorials on the SAS Users YouTube channel.