Re: How to save running time for a large sample regression

lichee · Posted 01-19-2024 10:16 AM

Hi all,

I wonder if there is way to save running time for a large sample regression. There are nearly five million observations, and the event rate is about 6%. Currently running logit regression with the full sample takes more than 20 hours. Is there a way to make running faster?

Thank you for your wisdom!

L.

PaigeMiller · Posted 01-19-2024 10:25 AM

How many X variables, plus interaction terms and power terms, are in the MODEL statement? Are some of the X variables in the CLASS statement? If so, how many variables and how many levels each?

Are you using the SELECTION= option, if so what method did you select?

Are you using a BY statement?

--
Paige Miller

lichee · Posted 01-19-2024 11:52 AM

All of the X variables are categorical in the class statement. There are interaction terms of 9 age bands and female gender indicator and 15 other binary indicators. I will also need to include 49 state indicators if I can cut the running time. Thank you very much!

PaigeMiller · Posted 01-19-2024 12:10 PM

Having lots of class variables with lots of levels could be one reason why it takes 20 hours. Can you combine some levels (such that maybe you have only 5 age bands instead of 9)?

Please answer my other questions.

Also, are you using the EXACT statement in your PROC LOGISTIC?

--
Paige Miller

lichee · Posted 01-19-2024 12:17 PM

I can combine age bands. I did not use Exact statement or Selection=Option.

PaigeMiller · Posted 01-19-2024 12:33 PM

Please share your PROC LOGISTIC code so we don't have to guess what options are being used.

--
Paige Miller

lichee · Posted 01-19-2024 12:43 PM

Here is the code:
proc logistic data=&datin.;
class &XVAR_ref./PARAM=REF;
model Dependent(ref='0')=&XVAR./firth parmlabel CLODDS=PL EXPB rsquare;
run;

PaigeMiller · Posted 01-19-2024 12:54 PM

So 17 class variables and you also want to add in state (with 49 levels). I think this is largely the problem and you should experiment with fewer variables and fewer levels of each variable.

Also, you will have to struggle with possible multicollinearity between the X variables, which I'm sure will be a problem if you want to interpret the results.

--
Paige Miller

lichee · Posted 01-19-2024 01:07 PM

I did check multicollinearity for the initial list of 25 class variables and now ended with the 17 class variables without collinearity issue. However, the state indicators were not checked for collinearity as I wanted to include them to control for state variation.

PaigeMiller · Posted 01-19-2024 02:04 PM

You might also want to try changing the algorithm convergence criteria in the MODEL statement

--
Paige Miller

sbxkoenk · Posted 01-21-2024 08:33 AM

Large, complex models are much more likely to suffer from separation problems because the data becomes more sparse as the model becomes more complex.

So, I understand why you add the FIRTH option.
But in my experience FIRTH option will also inflate execution time.

(The FIRTH method uses an iterative maximum likelihood estimation algorithm to maximize a penalized likelihood function. The time needed depends on the number of parameters that must be estimated in each iteration and the number of iterations needed to achieve convergence. The amount of time needed will increase with each of these and cannot be known in advance. Note that both of these can be data dependent such that the same code applied to even slightly different data could result in very different time use.)

This usage note discusses the separation issue:
Usage Note 22599: Understanding and correcting complete or quasi-complete separation problems
https://support.sas.com/kb/22/599.html

BR, Koen

Ksharp · Posted 01-21-2024 04:26 AM

Try PROC HPLOGISTIC ,
Any PROC name start with HP is for big data. such as PROC HPGENSELECT, PROC HPMIXED ..........

Rick_SAS · Posted 01-21-2024 06:18 PM

1. Are you running on the data set that contains data for all 49 states?

2. Do you want to run each state's data independently of the others by using a BY statement?

If so, you might experiment with how long some of the smaller states take. For example, try something equivalent to this:

proc logistic data=&datin.;
where State in ('DE');
class &XVAR_ref./PARAM=REF;
model Dependent(ref='0')=&XVAR./firth parmlabel CLODDS=PL EXPB rsquare;
run;

It might be that the small states complete quickly. Even the larger states (TX, CA, FL) might only take a 20 minutes or less. (Try it out!) After you get the preliminary timings, you might decide that you can run a BY group analysis
BY state;

in a fraction of the time that it takes to run the full regression for all states combined.

SAS Training: Just a Click Away