BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
rakeshallu
Obsidian | Level 7

Hello all, 

 

I am trying to estimate parameters through a custom-written likelihood function in PROC IML. The input data to the function ("use" in the code below) is akin to a panel with multiple observations for each unit (vector 'u' in the code below stores all the unique units). I compute the likelihood of observing the data for each unit (I call it LL in the code below). Then, I compute the log(LL) and sum log(LL) over all units. The summation is my objective function.

 

The code works fine for small datasets. For 1000 rows and 42 unique units in the dataset, it takes about a minute to execute.  However, I have about 2 MN rows and 150K unique units in the data and each time I try executing, SAS gets hung. I even tried letting it run for 36 hours with no output. My guess is that the DO loop over all unique units is causing the time. My questions are as follows - 

 

1. Is my guess correct? 

2. If yes, do you have some advice on doing these computations without a DO Loop?

3. If no, do you have some advice on suggestions to increase the speed? 

 

As always, very grateful for all your support. 

 

Thank you, 

Rakesh 

 

%let extended_pos = 5; 
proc iml;
/************************/
/*Read relevant variables from the data into a matrix X*/	
varNames = {"seq" "position" "pred_usage" "residual_usage" "credits_r" "p_hat" "c_seller_fe" "c_pos_coef" "sl_fixed_effect" "pos_fixed_effect"};
use use; 
	read all var varNames into x; 
close use ; 
/************************/

start LogLik(param) global(x);
u = unique(x[,1]); 																/*Extract unique session Ids in vector u*/ 
s = j(1, ncol(u),.);															/*Create an empty matrix to store probability of each session*/
do i = 1 to ncol(u);						   									
   	idx = 							loc(x[,1]=u[i]);  							/*Get all the variables in each session*/
	max_pos = 						max(x[,2][idx]`);
	sess_pred_usage	=	 			x[,3][idx]`;
	sess_res_usage = 				x[,4][idx]`;
	sess_credits = 					max(x[,5][idx]`);
	sess_phat = 					max(x[,6][idx]`);
	sl_fe_choice  = 				max(x[,7][idx]`);
	sl_pos_choice = 				max(x[,8][idx]`);
	sl_fe_scroll =  				max(x[,9][idx]`);
	sl_pos_scroll =  				max(x[,10][idx]`);
	max_pred_usage = 				sess_pred_usage[1,max_pos]; 
	max_res_usage = 				sess_res_usage[1,max_pos];

	pos_cut = max_pos + &extended_pos.;
	pos = do(1, pos_cut, 1);

	/************************/
	/*Create the predicted and residual upto vectors for all positions upto the cut off*/	
		sess_pred_usage = sess_pred_usage||(max_pred_usage*j(1, &extended_pos.,1)); 
		sess_res_usage = sess_res_usage||(max_res_usage*j(1, &extended_pos.,1)); 
	/************************/

	/************************/
	/*Create a vector for probability of scrolling further at all positions 
	The seller always scrolls further from the first two positions*/
		sc = logistic(sl_fe_scroll + sl_pos_scroll*pos + param[1] + param[2]*pos + param[3]*sess_pred_usage + param[4]*sess_res_usage + param[5]*sess_credits);/*param[2]*predicted_usage_prop + param[3]*residual_usage_prop +*/ 
		sc[1,1] = 1; sc[1,2] = 1;
	/************************/ 

	/************************/
	/*Create a vector for probability of ending at all positions greater then the maximum position 
	Up to the maximum position, the probability of ending is 0*/
		ed = 1 - sc; ed[1, 1:max_pos-1] = 0;   
	/************************/ 

	/************************/
	/*Create a vector for computing the not choosing probality at each position. 
	 Upto max_pos default this value to 1 as it does not affect the LL function. */	
		m1 = 1 - logistic(sl_fe_choice + sess_phat + sl_pos_choice*(max_pos + do(1, &extended_pos.,1)));
		not_choosing_prob = j(1, max_pos,1)||m1;
	/************************/

	/************************/
	/*Compute the probability of a sample path for all possible end points in a session. Store each probability in a vector called LL*/	
		LL = j(1,  pos_cut,.);
		LL[1,1] = 0; LL[1,2] = 0;
		do m = 3 to  pos_cut;												
				LL[1,m] = ed[1,m]*prod(sc[1,1:m-1])*prod(not_choosing_prob[1,1:m]);
		end;
	/************************/
	s[i] = log(100*sum(LL));													/*Probability of aN OBSERVED session is the sum of all sample paths */
	end; 
return sum(s);
finish;

/************************/
/*Maximize sum(s)*/	
param = {0 -0.0025 0.001 0.001  0};
optn = {1, /* 3. find max of function, and */ 4}; /* print moderate amount of output */
con = {. . . . ., 
	    . . . . .};
call nlpnra(rc, xres, "LogLik", param, optn, con);
/************************/
quit; 

 

 

 

  

1 ACCEPTED SOLUTION

Accepted Solutions
Rick_SAS
SAS Super FREQ

It seems likely that it is inefficient to use the UNIQUE-LOC trick to loop over and process 150K unique groups from 2M records. Currently, EVERY call to the objective function must use the LOC function 150K times to extract the subgroups. This requires going through the 2M records 150K times for each call!

 

A more efficient method is to sort the data once OUTSIDE of the objective function and directly index and process each group. I suggest you use PROC SORT to sort the data by the SEQ variable. You can then use the UNIQUEBY function in SAS IML to obtain the beginning/ending index for each BY group. Make that vector available to the objective function by using the GLOBAL statement. Then you will be able to directly iterate over the subgroups without any calls to the LOC function. For a discussion, example, and sample code, see https://blogs.sas.com/content/iml/2011/11/07/an-efficient-alternative-to-the-unique-loc-technique.ht...

 

View solution in original post

7 REPLIES 7
Rick_SAS
SAS Super FREQ

It seems likely that it is inefficient to use the UNIQUE-LOC trick to loop over and process 150K unique groups from 2M records. Currently, EVERY call to the objective function must use the LOC function 150K times to extract the subgroups. This requires going through the 2M records 150K times for each call!

 

A more efficient method is to sort the data once OUTSIDE of the objective function and directly index and process each group. I suggest you use PROC SORT to sort the data by the SEQ variable. You can then use the UNIQUEBY function in SAS IML to obtain the beginning/ending index for each BY group. Make that vector available to the objective function by using the GLOBAL statement. Then you will be able to directly iterate over the subgroups without any calls to the LOC function. For a discussion, example, and sample code, see https://blogs.sas.com/content/iml/2011/11/07/an-efficient-alternative-to-the-unique-loc-technique.ht...

 

rakeshallu
Obsidian | Level 7

Thank you, Rick. 

This helped me reduce the run time for 10 unique groups from 92 seconds to 69 seconds, a huge improvement!

However, the run-time per session is still unfortunately high to be scaled up to the entire data. Do you see any more opportunities to improve efficiency?   

 

Rick_SAS
SAS Super FREQ

The performance when you have many unique groups should be even better.

 

You could vectorize the inner loop, but I am not sure how much impact it will have. The basic idea is to use the CUPROD function to compute the cumulative products, rather than iterate over the partial products. I don't have your data, so I can't test it out, but the try replacing the statements

 

LL = j(1,  pos_cut,.);
LL[1] = 0; LL[2] = 0;
do m = 3 to  pos_cut;
   LL[m] = ed[m]*prod(sc[1,1:m-1])*prod(not_choosing_prob[1,1:m]);
end;

 

 

with the new vectorized statements

lagsc = rowvec(lag(sc));
cuProdLagsc = cuprod(lagsc);
cuProdncp = cuprod(not_choosing_prob);
LL = ed # cuProdLagsc # cuProdncp;

You will, of course, want to test this code on a small example to ensure that my code computes the same quantities that yours does.

Rick_SAS
SAS Super FREQ

BTW, a more efficient version of

	sess_pred_usage	=	 			x[,3][idx]`;
	sess_res_usage = 				x[,4][idx]`;
etc

is

	sess_pred_usage	=	 			x[idx,3]`;
	sess_res_usage = 				x[idx,4]`;
etc

although it is mostly easier to read. I don't expect any noticeable performance improvement from this change.

rakeshallu
Obsidian | Level 7

Thank you very much, Rick. I am now able to execute the code for all the groups in about 45 minutes.   

 

Rick_SAS
SAS Super FREQ

Wow! That's great! So, I think the performance of this problem improved as follows:

1. The original program required more than 36 hours.

2. After rewriting the out loop to use sort and UNIQUEBY (instead of performing a LOC lookup at each iteration) the program took .... HOW LONG? A few hours? One hour?

3. After rewriting both the outer and inner loop, the code took 45 minutes.

rakeshallu
Obsidian | Level 7

I did not do step 2 for the entire data. Below are the numbers for a sample of 10 unique groups - 

1. Original problem took ~90 seconds

2. Using uniqueby reduces the time to ~65 seconds (I am sorting outside proc iml)

3. Vectorizing the inner loop reduces time to ~0.12 seconds

 

That is very COOL! Thank you, Rick.

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
Multiple Linear Regression in SAS

Learn how to run multiple linear regression models with and without interactions, presented by SAS user Alex Chaplin.

Find more tutorials on the SAS Users YouTube channel.

From The DO Loop
Want more? Visit our blog for more articles like these.
Discussion stats
  • 7 replies
  • 1105 views
  • 12 likes
  • 2 in conversation