BookmarkSubscribeRSS Feed
lichee
Quartz | Level 8

Hi all,

I wonder if there is way to save running time for a large sample regression. There are nearly five million observations, and the event rate is about 6%. Currently running logit regression with the full sample takes more than 20 hours. Is there a way to make running faster?

Thank you for your wisdom!

L.

12 REPLIES 12
PaigeMiller
Diamond | Level 26

How many X variables, plus interaction terms and power terms, are in the MODEL statement? Are some of the X variables in the CLASS statement? If so, how many variables and how many levels each?

 

Are you using the SELECTION= option, if so what method did you select?

 

Are you using a BY statement?

--
Paige Miller
lichee
Quartz | Level 8
All of the X variables are categorical in the class statement. There are interaction terms of 9 age bands and female gender indicator and 15 other binary indicators. I will also need to include 49 state indicators if I can cut the running time. Thank you very much!
PaigeMiller
Diamond | Level 26

Having lots of class variables with lots of levels could be one reason why it takes 20 hours. Can you combine some levels (such that maybe you have only 5 age bands instead of 9)?

 

Please answer my other questions.

 

Also, are you using the EXACT statement in your PROC LOGISTIC?

--
Paige Miller
lichee
Quartz | Level 8
I can combine age bands. I did not use Exact statement or Selection=Option.
PaigeMiller
Diamond | Level 26

Please share your PROC LOGISTIC code so we don't have to guess what options are being used.

--
Paige Miller
lichee
Quartz | Level 8

Here is the code:
proc logistic data=&datin.;
class &XVAR_ref./PARAM=REF;
model Dependent(ref='0')=&XVAR./firth parmlabel CLODDS=PL EXPB rsquare;
run;

PaigeMiller
Diamond | Level 26

So 17 class variables and you also want to add in state (with 49 levels). I think this is largely the problem and you should experiment with fewer variables and fewer levels of each variable.

 

Also, you will have to struggle with possible multicollinearity between the X variables, which I'm sure will be a problem if you want to interpret the results.

--
Paige Miller
lichee
Quartz | Level 8
I did check multicollinearity for the initial list of 25 class variables and now ended with the 17 class variables without collinearity issue. However, the state indicators were not checked for collinearity as I wanted to include them to control for state variation.
PaigeMiller
Diamond | Level 26

You might also want to try changing the algorithm convergence criteria in the MODEL statement 

 

PaigeMiller_0-1705691047076.png

 

--
Paige Miller
sbxkoenk
SAS Super FREQ

Large, complex models are much more likely to suffer from separation problems because the data becomes more sparse as the model becomes more complex. 

 

So, I understand why you add the FIRTH option.
But in my experience FIRTH option will also inflate execution time.

(The FIRTH method uses an iterative maximum likelihood estimation algorithm to maximize a penalized likelihood function. The time needed depends on the number of parameters that must be estimated in each iteration and the number of iterations needed to achieve convergence. The amount of time needed will increase with each of these and cannot be known in advance. Note that both of these can be data dependent such that the same code applied to even slightly different data could result in very different time use.)

 

This usage note discusses the separation issue:
Usage Note 22599: Understanding and correcting complete or quasi-complete separation problems
https://support.sas.com/kb/22/599.html

 

BR, Koen

Ksharp
Super User
Try PROC HPLOGISTIC ,
Any PROC name start with HP is for big data. such as PROC HPGENSELECT, PROC HPMIXED ..........
Rick_SAS
SAS Super FREQ

1. Are you running on the data set that contains data for all 49 states?

2. Do you want to run each state's data independently of the others by using a BY statement?

 

If so, you might experiment with how long some of the smaller states take. For example, try something equivalent to this:

proc logistic data=&datin.;
where State in ('DE');
class &XVAR_ref./PARAM=REF;
model Dependent(ref='0')=&XVAR./firth parmlabel CLODDS=PL EXPB rsquare;
run;

It might be that the small states complete quickly. Even the larger states (TX, CA, FL) might only take a 20 minutes or less. (Try it out!)  After you get the preliminary timings, you might decide that you can run a BY group analysis
BY state;

in a fraction of the time that it takes to run the full regression for all states combined.

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 12 replies
  • 1896 views
  • 4 likes
  • 5 in conversation