Dear All,
I am writing a custom objective function in the PROC IML environment where I want to estimate 56 parameters. Of these, I have 51 are week fixed effects and the rest 5 are parameters of interest. I want to conduct the estimation on a dataset with 4,037,641 rows. The code below works but takes very long; no output in over 16 hours. However, for a sub-sample of 23,809 rows, the code executes in 4:45.73 real-time and 1:38.68 CPU time. I would like your suggestions on ways to optimize this code.
Thank you in advance.
Best,
Rakesh
proc iml;
/************************/
/*Read relevant variables from the data into a matrix X*/
varNames = {"seq" "position" "usage_upto" "lag_mean_usage_after_p" "credits_left" "sl_fixed_effect" "pos_fixed_effect" "max_pos" "wk" "choice_p" "chosen" "week_fe_scroll"};
use scroll;
read all var varNames into x;
close scroll;
/************************/
b = uniqueby(x[,1],1)`; /*starting points in each session*/
u = unique(x[,1]);
b = b ||(nrow(x[,1])+1); /* trick: append (n+1) to end of b for taking care of the last session*/
param = {3.354 -0.097 -2.538 0.146 0.002 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0}; /*set initial parameter values for models with CHOSEN VARIABLE*/
start LogLik(param) global(x,b);
s = j(1, ncol(b)-1,.); /*Create an empty matrix to store probability of each session*/
do i = 1 to ncol(b)-1;
idx = b[1,i]:(b[1,i+1]-1);
/*Extract unique session Ids in vector u*/
pos = x[idx,2]`;
sess_usage_upto = x[idx,3]`;
lag_use = x[idx,4]`;
sess_credits = max(x[idx,5]`);
sl_fe_scroll = max(x[idx,6]`);
sl_pos_scroll = max(x[idx,7]`);
max_pos = max(max(x[idx,8]`),3);
wk = max(x[idx,9]`);
choice_p = x[idx,10]`;
chosen = x[idx,11]`;
wk_fe_scroll = x[idx,12]`;
/************************/
/*Create a vector for transaction probability at all positions. Create the probability of observing the choice at each position (choice_mult)*/
choice_mult = (choice_p##chosen) # ((1-choice_p)##(1-chosen));
/************************/
/*Create a vector for probability of scrolling further at all positions
The seller always scrolls further from the first two positions*/
if wk = 52 then k = 0; else do; z = wk + 5; k = param[z]; end;
sc = logistic(param[1] + param[2]*pos + param[3]*(sess_usage_upto/pos) + param[4]*lag_use + param[5]*sess_credits + k);
sc[1,1] = 1; sc[1,2] = 1;
choice_mult_sc = choice_mult # sc; /*multiply choice prob and scrolling prob at each position*/
/************************/
/************************/
/*Create a vector for probability of ending at all positions greater then the maximum position
Up to the maximum position, the probability of ending is 0*/
ed = 1 - sc;
ed[1, 1:max_pos-1] = 0;
choice_mult_ed = choice_mult # ed; /*multiply choice prob and scrolling prob at each position*/
/************************/
/************************/
/*Compute the probability of a sample path for all possible end points in a session. Store each probability in a vector called LL*/
lag_prod_sc = lag(cuprod(choice_mult_sc))`;
lag_prod_sc[1,1] = 1;
LL = lag_prod_sc # choice_mult_ed;
if LL < 1/(2**500) then LL = 1/(2**500);
/************************/
s[i] = log(10*sum(LL)); /*Probability of aN OBSERVED session is the sum of all sample paths */
end;
return sum(s);
finish;
/************************/
/*Maximize sum(s)*/
optn = {1, /* 3. find max of function, and */ 2}; /* print moderate amount of output */
con = {. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ,
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .};
call nlpnra(rc, xres, "LogLik", param, optn, con);
/************************/
quit;
I assume X is sorted by the first column?
1. Unless you have some very extreme outliers, the MLE estimates for 4 million observations are going to be very similar to estimates for a smaller subset of the data. If possible, use a smaller subset (like 8K-10K) to obtain preliminary estimates, then use those estimates as an initial guess for the MLE applied to the whole data set.
2. I think your logic is wrong in the line
if LL < 1/(2**500) then LL = 1/(2**500);
LL is a vector, which means that the statement is interpreted as if ALL(LL) < 2##(-500) then...
See IF-THEN logic with matrix expressions - The DO Loop (sas.com)
3. The following parameters only need to be computed once prior to calling the optimizer. They do not need to be computed during each call to the log-likelihood (LL) function.
sess_credits = max(x[idx,5]`);
sl_fe_scroll = max(x[idx,6]`);
sl_pos_scroll = max(x[idx,7]`);
max_pos = max(max(x[idx,8]`),3);
wk = max(x[idx,9]`);
4. I don't think you need to subset the data every time you call the LL. For example, if you are analyzing 3 groups, you could subset X into x1, x2, x3 before calling the optimizer. You would use x1, x2, and x3 on the GLOBAL clause instead of X. If you are analyzing many groups (or you don't know how many groups you will be analyzing), you can build a list of subsets L such that L$i is the i_th subset.
I suspect that precomputing the subsets of x will improve the performance.
I assume X is sorted by the first column?
1. Unless you have some very extreme outliers, the MLE estimates for 4 million observations are going to be very similar to estimates for a smaller subset of the data. If possible, use a smaller subset (like 8K-10K) to obtain preliminary estimates, then use those estimates as an initial guess for the MLE applied to the whole data set.
2. I think your logic is wrong in the line
if LL < 1/(2**500) then LL = 1/(2**500);
LL is a vector, which means that the statement is interpreted as if ALL(LL) < 2##(-500) then...
See IF-THEN logic with matrix expressions - The DO Loop (sas.com)
3. The following parameters only need to be computed once prior to calling the optimizer. They do not need to be computed during each call to the log-likelihood (LL) function.
sess_credits = max(x[idx,5]`);
sl_fe_scroll = max(x[idx,6]`);
sl_pos_scroll = max(x[idx,7]`);
max_pos = max(max(x[idx,8]`),3);
wk = max(x[idx,9]`);
4. I don't think you need to subset the data every time you call the LL. For example, if you are analyzing 3 groups, you could subset X into x1, x2, x3 before calling the optimizer. You would use x1, x2, and x3 on the GLOBAL clause instead of X. If you are analyzing many groups (or you don't know how many groups you will be analyzing), you can build a list of subsets L such that L$i is the i_th subset.
I suspect that precomputing the subsets of x will improve the performance.
Thank you again for the response, Rick.
I implemented point 1, corrected the LL in point 2, subsetted X before the LogLik function, removed the CON matrix. The performance does improve and it is promising. Below are some summaries -
Number of rows | Number of Parameters | Old Code Real Time | New Code Real Time |
30,000 | 5 | 4.02 | 3.32 |
30,000 | 55 | 5:17.51 | 4:14.30 |
That said, the number of parameters seems to be a bigger bottleneck which I do not know how to naviagte. Please let me know if you have any more suggestions. Relatedly, I am wondering if this following step an be done better:
if wk = 52 then k = 0; else do; z = wk + 5; k = param[z]; end;
sc = logistic(param[1] + param[2]*pos + param[3]*(sess_usage_upto/pos) + param[4]*lag_use + param[5]*sess_credits + k);
As always, thank you.
Best,
Rakesh
1. You don't need to transpose data when you are taking the MAX. So all statements that look like max(vector`) can be rewritten as max(vector)
2. In fact, I don't know why you are transposing any of the vectors. You can use column vectors instead of row vectors and save 6*ncol(b) vector transposes per iteration.
3. It looks like sess_usage_upto is only used as the ratio sess_usage_upto/pos. Therefore, you can compute the ratio one time by modifying the 3rd column:
x[,3] = x[,3] / x[,2];
Then there is not need for the ratio inside the LOGISTIC function argumet.
4. Regarding the LOGISTIC call, if you stop transposing and use column vectors, you can use matrix multiplication inside the LOGISTIC call. The linear combination of the vectors is equal to x[idx,2:4)*param[2:4] or x[idx,2:4)*param[2:4]` (not sure which)
These tips might give you another 5-10% improvement, but the size of the data set is what is hurting the performance.
Also, if you are performing unconstrained optimization, you do not need to define the CON matrix. You can omit that argument instead of defining it as a matrix of missing values.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how to run multiple linear regression models with and without interactions, presented by SAS user Alex Chaplin.
Find more tutorials on the SAS Users YouTube channel.