Solved: Re: MLE with many dummy variables in PROC IML

rakeshallu

Dear All,

I am writing a custom objective function in the PROC IML environment where I want to estimate 56 parameters. Of these, I have 51 are week fixed effects and the rest 5 are parameters of interest. I want to conduct the estimation on a dataset with 4,037,641 rows. The code below works but takes very long; no output in over 16 hours. However, for a sub-sample of 23,809 rows, the code executes in 4:45.73 real-time and 1:38.68 CPU time. I would like your suggestions on ways to optimize this code.

Thank you in advance.

Best,

Rakesh

proc iml;
/************************/
/*Read relevant variables from the data into a matrix X*/	
varNames = {"seq" "position" "usage_upto" "lag_mean_usage_after_p" "credits_left" "sl_fixed_effect" "pos_fixed_effect" "max_pos" "wk" "choice_p" "chosen" "week_fe_scroll"};
use scroll; 
	read all var varNames into x; 
close scroll; 
/************************/
b = uniqueby(x[,1],1)`; 														/*starting points in each session*/
u = unique(x[,1]);
b = b ||(nrow(x[,1])+1);     													/* trick: append (n+1) to end of b for taking care of the last session*/

param = {3.354 -0.097 -2.538 0.146 0.002 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0}; 											/*set initial parameter values for models with CHOSEN VARIABLE*/

start LogLik(param) global(x,b);
s = j(1, ncol(b)-1,.);															/*Create an empty matrix to store probability of each session*/
do i = 1 to ncol(b)-1;
	idx = b[1,i]:(b[1,i+1]-1);
																				/*Extract unique session Ids in vector u*/ 
	pos = 							x[idx,2]`;
	sess_usage_upto	=				x[idx,3]`;
	lag_use =						x[idx,4]`;
	sess_credits = 					max(x[idx,5]`);
	sl_fe_scroll =  				max(x[idx,6]`);
	sl_pos_scroll =  				max(x[idx,7]`);
	max_pos = 						max(max(x[idx,8]`),3);
	wk = 							max(x[idx,9]`);
	choice_p = 						x[idx,10]`;
	chosen = 						x[idx,11]`;
	wk_fe_scroll = 					x[idx,12]`;

	/************************/
	/*Create a vector for transaction probability at all positions. Create the probability of observing the choice at each position (choice_mult)*/
	choice_mult = (choice_p##chosen) # ((1-choice_p)##(1-chosen)); 

	/************************/
	/*Create a vector for probability of scrolling further at all positions 
	The seller always scrolls further from the first two positions*/
	if wk = 52 then k = 0; else do; z = wk + 5; k = param[z]; end;
	sc = logistic(param[1] + param[2]*pos + param[3]*(sess_usage_upto/pos) + param[4]*lag_use + param[5]*sess_credits + k);
	sc[1,1] = 1; sc[1,2] = 1;
	choice_mult_sc = choice_mult # sc; /*multiply choice prob and scrolling prob at each position*/
	/************************/ 

	/************************/
	/*Create a vector for probability of ending at all positions greater then the maximum position 
	Up to the maximum position, the probability of ending is 0*/
		ed = 1 - sc; 
		ed[1, 1:max_pos-1] = 0; 
		choice_mult_ed = choice_mult # ed; /*multiply choice prob and scrolling prob at each position*/ 
	/************************/ 

	/************************/
	/*Compute the probability of a sample path for all possible end points in a session. Store each probability in a vector called LL*/	
		lag_prod_sc = lag(cuprod(choice_mult_sc))`; 
		lag_prod_sc[1,1] = 1;
		LL = lag_prod_sc # choice_mult_ed;
		if LL < 1/(2**500) then LL = 1/(2**500);

	/************************/
	s[i] = log(10*sum(LL));													/*Probability of aN OBSERVED session is the sum of all sample paths */
	end; 
return sum(s);
finish;

/************************/
/*Maximize sum(s)*/	
optn = {1, /* 3. find max of function, and */ 2}; /* print moderate amount of output */
con = {. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . , 
	   . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .};
call nlpnra(rc, xres, "LogLik", param, optn, con);
/************************/
quit;

Rick_SAS

I assume X is sorted by the first column?

1. Unless you have some very extreme outliers, the MLE estimates for 4 million observations are going to be very similar to estimates for a smaller subset of the data. If possible, use a smaller subset (like 8K-10K) to obtain preliminary estimates, then use those estimates as an initial guess for the MLE applied to the whole data set.

2. I think your logic is wrong in the line

if LL < 1/(2**500) then LL = 1/(2**500);

LL is a vector, which means that the statement is interpreted as if ALL(LL) < 2##(-500) then...
See IF-THEN logic with matrix expressions - The DO Loop (sas.com)

3. The following parameters only need to be computed once prior to calling the optimizer. They do not need to be computed during each call to the log-likelihood (LL) function.

	sess_credits = 					max(x[idx,5]`);
	sl_fe_scroll =  				max(x[idx,6]`);
	sl_pos_scroll =  				max(x[idx,7]`);
	max_pos = 						max(max(x[idx,8]`),3);
	wk = 							max(x[idx,9]`);

4. I don't think you need to subset the data every time you call the LL. For example, if you are analyzing 3 groups, you could subset X into x1, x2, x3 before calling the optimizer. You would use x1, x2, and x3 on the GLOBAL clause instead of X. If you are analyzing many groups (or you don't know how many groups you will be analyzing), you can build a list of subsets L such that L$i is the i_th subset.

I suspect that precomputing the subsets of x will improve the performance.

View solution in original post

Rick_SAS

I assume X is sorted by the first column?

1. Unless you have some very extreme outliers, the MLE estimates for 4 million observations are going to be very similar to estimates for a smaller subset of the data. If possible, use a smaller subset (like 8K-10K) to obtain preliminary estimates, then use those estimates as an initial guess for the MLE applied to the whole data set.

2. I think your logic is wrong in the line

if LL < 1/(2**500) then LL = 1/(2**500);

LL is a vector, which means that the statement is interpreted as if ALL(LL) < 2##(-500) then...
See IF-THEN logic with matrix expressions - The DO Loop (sas.com)

3. The following parameters only need to be computed once prior to calling the optimizer. They do not need to be computed during each call to the log-likelihood (LL) function.

	sess_credits = 					max(x[idx,5]`);
	sl_fe_scroll =  				max(x[idx,6]`);
	sl_pos_scroll =  				max(x[idx,7]`);
	max_pos = 						max(max(x[idx,8]`),3);
	wk = 							max(x[idx,9]`);

4. I don't think you need to subset the data every time you call the LL. For example, if you are analyzing 3 groups, you could subset X into x1, x2, x3 before calling the optimizer. You would use x1, x2, and x3 on the GLOBAL clause instead of X. If you are analyzing many groups (or you don't know how many groups you will be analyzing), you can build a list of subsets L such that L$i is the i_th subset.

I suspect that precomputing the subsets of x will improve the performance.

rakeshallu

Thank you for the response, Rick.
Apologies for the delay in responding. At first, I wasn't sure of your reccomendation. Having spent some time with it, I see what you are saying.

Will get back to you shortly with how these reccommendations improve performance. Thank you!

rakeshallu

Thank you again for the response, Rick.

I implemented point 1, corrected the LL in point 2, subsetted X before the LogLik function, removed the CON matrix. The performance does improve and it is promising. Below are some summaries -

Number of rows	Number of Parameters	Old Code Real Time	New Code Real Time
30,000	5	4.02	3.32
30,000	55	5:17.51	4:14.30

That said, the number of parameters seems to be a bigger bottleneck which I do not know how to naviagte. Please let me know if you have any more suggestions. Relatedly, I am wondering if this following step an be done better:

if wk = 52 then k = 0; else do; z = wk + 5; k = param[z]; end;
sc = logistic(param[1] + param[2]*pos + param[3]*(sess_usage_upto/pos) + param[4]*lag_use + param[5]*sess_credits + k);

As always, thank you.

Best,

Rakesh

Rick_SAS

1. You don't need to transpose data when you are taking the MAX. So all statements that look like max(vector`) can be rewritten as max(vector)

2. In fact, I don't know why you are transposing any of the vectors. You can use column vectors instead of row vectors and save 6*ncol(b) vector transposes per iteration.

3. It looks like sess_usage_upto is only used as the ratio sess_usage_upto/pos. Therefore, you can compute the ratio one time by modifying the 3rd column:

x[,3] = x[,3] / x[,2];

Then there is not need for the ratio inside the LOGISTIC function argumet.

4. Regarding the LOGISTIC call, if you stop transposing and use column vectors, you can use matrix multiplication inside the LOGISTIC call. The linear combination of the vectors is equal to x[idx,2:4)*param[2:4] or x[idx,2:4)*param[2:4]` (not sure which)

These tips might give you another 5-10% improvement, but the size of the data set is what is hurting the performance.

rakeshallu

Thank you, Rick. I do see a performance improvement of ~15% in real-time.

The dummy variables I was estimating were fixed effects at a week level (51 coefficients), now I've chosen to include quarter fixed effects instead of week to get quick convergence. The improved code works super fine.

Thank you for all your inputs.

Rick_SAS

Also, if you are performing unconstrained optimization, you do not need to define the CON matrix. You can omit that argument instead of defining it as a matrix of missing values.

Ready to join fellow brilliant minds for the SAS Hackathon?