Hi all,
I wonder if there is way to save running time for a large sample regression. There are nearly five million observations, and the event rate is about 6%. Currently running logit regression with the full sample takes more than 20 hours. Is there a way to make running faster?
Thank you for your wisdom!
L.
How many X variables, plus interaction terms and power terms, are in the MODEL statement? Are some of the X variables in the CLASS statement? If so, how many variables and how many levels each?
Are you using the SELECTION= option, if so what method did you select?
Are you using a BY statement?
Having lots of class variables with lots of levels could be one reason why it takes 20 hours. Can you combine some levels (such that maybe you have only 5 age bands instead of 9)?
Please answer my other questions.
Also, are you using the EXACT statement in your PROC LOGISTIC?
Please share your PROC LOGISTIC code so we don't have to guess what options are being used.
Here is the code:
proc logistic data=&datin.;
class &XVAR_ref./PARAM=REF;
model Dependent(ref='0')=&XVAR./firth parmlabel CLODDS=PL EXPB rsquare;
run;
So 17 class variables and you also want to add in state (with 49 levels). I think this is largely the problem and you should experiment with fewer variables and fewer levels of each variable.
Also, you will have to struggle with possible multicollinearity between the X variables, which I'm sure will be a problem if you want to interpret the results.
You might also want to try changing the algorithm convergence criteria in the MODEL statement
Large, complex models are much more likely to suffer from separation problems because the data becomes more sparse as the model becomes more complex.
So, I understand why you add the FIRTH option.
But in my experience FIRTH option will also inflate execution time.
(The FIRTH method uses an iterative maximum likelihood estimation algorithm to maximize a penalized likelihood function. The time needed depends on the number of parameters that must be estimated in each iteration and the number of iterations needed to achieve convergence. The amount of time needed will increase with each of these and cannot be known in advance. Note that both of these can be data dependent such that the same code applied to even slightly different data could result in very different time use.)
This usage note discusses the separation issue:
Usage Note 22599: Understanding and correcting complete or quasi-complete separation problems
https://support.sas.com/kb/22/599.html
BR, Koen
1. Are you running on the data set that contains data for all 49 states?
2. Do you want to run each state's data independently of the others by using a BY statement?
If so, you might experiment with how long some of the smaller states take. For example, try something equivalent to this:
proc logistic data=&datin.;
where State in ('DE');
class &XVAR_ref./PARAM=REF;
model Dependent(ref='0')=&XVAR./firth parmlabel CLODDS=PL EXPB rsquare;
run;
It might be that the small states complete quickly. Even the larger states (TX, CA, FL) might only take a 20 minutes or less. (Try it out!) After you get the preliminary timings, you might decide that you can run a BY group analysis
BY state;
in a fraction of the time that it takes to run the full regression for all states combined.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.