Hello all,
I am trying to estimate parameters through a custom-written likelihood function in PROC IML. The input data to the function ("use" in the code below) is akin to a panel with multiple observations for each unit (vector 'u' in the code below stores all the unique units). I compute the likelihood of observing the data for each unit (I call it LL in the code below). Then, I compute the log(LL) and sum log(LL) over all units. The summation is my objective function.
The code works fine for small datasets. For 1000 rows and 42 unique units in the dataset, it takes about a minute to execute. However, I have about 2 MN rows and 150K unique units in the data and each time I try executing, SAS gets hung. I even tried letting it run for 36 hours with no output. My guess is that the DO loop over all unique units is causing the time. My questions are as follows -
1. Is my guess correct?
2. If yes, do you have some advice on doing these computations without a DO Loop?
3. If no, do you have some advice on suggestions to increase the speed?
As always, very grateful for all your support.
Thank you,
Rakesh
%let extended_pos = 5;
proc iml;
/************************/
/*Read relevant variables from the data into a matrix X*/
varNames = {"seq" "position" "pred_usage" "residual_usage" "credits_r" "p_hat" "c_seller_fe" "c_pos_coef" "sl_fixed_effect" "pos_fixed_effect"};
use use;
read all var varNames into x;
close use ;
/************************/
start LogLik(param) global(x);
u = unique(x[,1]); /*Extract unique session Ids in vector u*/
s = j(1, ncol(u),.); /*Create an empty matrix to store probability of each session*/
do i = 1 to ncol(u);
idx = loc(x[,1]=u[i]); /*Get all the variables in each session*/
max_pos = max(x[,2][idx]`);
sess_pred_usage = x[,3][idx]`;
sess_res_usage = x[,4][idx]`;
sess_credits = max(x[,5][idx]`);
sess_phat = max(x[,6][idx]`);
sl_fe_choice = max(x[,7][idx]`);
sl_pos_choice = max(x[,8][idx]`);
sl_fe_scroll = max(x[,9][idx]`);
sl_pos_scroll = max(x[,10][idx]`);
max_pred_usage = sess_pred_usage[1,max_pos];
max_res_usage = sess_res_usage[1,max_pos];
pos_cut = max_pos + &extended_pos.;
pos = do(1, pos_cut, 1);
/************************/
/*Create the predicted and residual upto vectors for all positions upto the cut off*/
sess_pred_usage = sess_pred_usage||(max_pred_usage*j(1, &extended_pos.,1));
sess_res_usage = sess_res_usage||(max_res_usage*j(1, &extended_pos.,1));
/************************/
/************************/
/*Create a vector for probability of scrolling further at all positions
The seller always scrolls further from the first two positions*/
sc = logistic(sl_fe_scroll + sl_pos_scroll*pos + param[1] + param[2]*pos + param[3]*sess_pred_usage + param[4]*sess_res_usage + param[5]*sess_credits);/*param[2]*predicted_usage_prop + param[3]*residual_usage_prop +*/
sc[1,1] = 1; sc[1,2] = 1;
/************************/
/************************/
/*Create a vector for probability of ending at all positions greater then the maximum position
Up to the maximum position, the probability of ending is 0*/
ed = 1 - sc; ed[1, 1:max_pos-1] = 0;
/************************/
/************************/
/*Create a vector for computing the not choosing probality at each position.
Upto max_pos default this value to 1 as it does not affect the LL function. */
m1 = 1 - logistic(sl_fe_choice + sess_phat + sl_pos_choice*(max_pos + do(1, &extended_pos.,1)));
not_choosing_prob = j(1, max_pos,1)||m1;
/************************/
/************************/
/*Compute the probability of a sample path for all possible end points in a session. Store each probability in a vector called LL*/
LL = j(1, pos_cut,.);
LL[1,1] = 0; LL[1,2] = 0;
do m = 3 to pos_cut;
LL[1,m] = ed[1,m]*prod(sc[1,1:m-1])*prod(not_choosing_prob[1,1:m]);
end;
/************************/
s[i] = log(100*sum(LL)); /*Probability of aN OBSERVED session is the sum of all sample paths */
end;
return sum(s);
finish;
/************************/
/*Maximize sum(s)*/
param = {0 -0.0025 0.001 0.001 0};
optn = {1, /* 3. find max of function, and */ 4}; /* print moderate amount of output */
con = {. . . . .,
. . . . .};
call nlpnra(rc, xres, "LogLik", param, optn, con);
/************************/
quit;
It seems likely that it is inefficient to use the UNIQUE-LOC trick to loop over and process 150K unique groups from 2M records. Currently, EVERY call to the objective function must use the LOC function 150K times to extract the subgroups. This requires going through the 2M records 150K times for each call!
A more efficient method is to sort the data once OUTSIDE of the objective function and directly index and process each group. I suggest you use PROC SORT to sort the data by the SEQ variable. You can then use the UNIQUEBY function in SAS IML to obtain the beginning/ending index for each BY group. Make that vector available to the objective function by using the GLOBAL statement. Then you will be able to directly iterate over the subgroups without any calls to the LOC function. For a discussion, example, and sample code, see https://blogs.sas.com/content/iml/2011/11/07/an-efficient-alternative-to-the-unique-loc-technique.ht...
It seems likely that it is inefficient to use the UNIQUE-LOC trick to loop over and process 150K unique groups from 2M records. Currently, EVERY call to the objective function must use the LOC function 150K times to extract the subgroups. This requires going through the 2M records 150K times for each call!
A more efficient method is to sort the data once OUTSIDE of the objective function and directly index and process each group. I suggest you use PROC SORT to sort the data by the SEQ variable. You can then use the UNIQUEBY function in SAS IML to obtain the beginning/ending index for each BY group. Make that vector available to the objective function by using the GLOBAL statement. Then you will be able to directly iterate over the subgroups without any calls to the LOC function. For a discussion, example, and sample code, see https://blogs.sas.com/content/iml/2011/11/07/an-efficient-alternative-to-the-unique-loc-technique.ht...
Thank you, Rick.
This helped me reduce the run time for 10 unique groups from 92 seconds to 69 seconds, a huge improvement!
However, the run-time per session is still unfortunately high to be scaled up to the entire data. Do you see any more opportunities to improve efficiency?
The performance when you have many unique groups should be even better.
You could vectorize the inner loop, but I am not sure how much impact it will have. The basic idea is to use the CUPROD function to compute the cumulative products, rather than iterate over the partial products. I don't have your data, so I can't test it out, but the try replacing the statements
LL = j(1, pos_cut,.);
LL[1] = 0; LL[2] = 0;
do m = 3 to pos_cut;
LL[m] = ed[m]*prod(sc[1,1:m-1])*prod(not_choosing_prob[1,1:m]);
end;
with the new vectorized statements
lagsc = rowvec(lag(sc));
cuProdLagsc = cuprod(lagsc);
cuProdncp = cuprod(not_choosing_prob);
LL = ed # cuProdLagsc # cuProdncp;
You will, of course, want to test this code on a small example to ensure that my code computes the same quantities that yours does.
BTW, a more efficient version of
sess_pred_usage = x[,3][idx]`;
sess_res_usage = x[,4][idx]`;
etc
is
sess_pred_usage = x[idx,3]`;
sess_res_usage = x[idx,4]`;
etc
although it is mostly easier to read. I don't expect any noticeable performance improvement from this change.
Thank you very much, Rick. I am now able to execute the code for all the groups in about 45 minutes.
Wow! That's great! So, I think the performance of this problem improved as follows:
1. The original program required more than 36 hours.
2. After rewriting the out loop to use sort and UNIQUEBY (instead of performing a LOC lookup at each iteration) the program took .... HOW LONG? A few hours? One hour?
3. After rewriting both the outer and inner loop, the code took 45 minutes.
I did not do step 2 for the entire data. Below are the numbers for a sample of 10 unique groups -
1. Original problem took ~90 seconds
2. Using uniqueby reduces the time to ~65 seconds (I am sorting outside proc iml)
3. Vectorizing the inner loop reduces time to ~0.12 seconds
That is very COOL! Thank you, Rick.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how to run multiple linear regression models with and without interactions, presented by SAS user Alex Chaplin.
Find more tutorials on the SAS Users YouTube channel.