BookmarkSubscribeRSS Feed
ciro
Quartz | Level 8

hi,

I have to estimate regression models on large datasets (15-20 millions obs) with a very large number of fixed effect (1-2 millions).

it is a two level data with units of 2 level nested in units of first level.

the regression model is of the type:

y(ij)=d(j)a+x(ij)b+u(ij)

where d is the first level indicator and x the matrix of variables for second level units.

I am interested in estimating var(da), var(xb), var(u) and the covariance between the first two terms. 

I have searched in the forum and internet without success.

I have tried many procedures, included hpreg and hpmixed, but ended up with "too large  numbers of fixed effect" error or memory shortage issue.

I was able to estimate the model only with proc glm with absorb statement, but in this case the procedure does not produce predicted values or residuals.

Is there any other possibility? any workaround?

thank you very much in advance

 

 

7 REPLIES 7
PaigeMiller
Diamond | Level 26

It's not clear to me from the explanation and formula where the 1-2 million fixed effects come in. Are you really trying to fit a model where you 1-2 million x variables? Or are you fitting 1-2 million different models?

--
Paige Miller
ciro
Quartz | Level 8

Sorry, maybe I was not clear.

I have to fit just one model on, say, 15 millions observations (units of second level) and, depending on the specifications, about 100 variables.

one of these variables is an indicator (d in the formula i used) that say to which unit of first level each observation belongs.

Hope this is clearer.

PaigeMiller
Diamond | Level 26

@ciro wrote:

Sorry, maybe I was not clear.

I have to fit just one model on, say, 15 millions observations (units of second level) and, depending on the specifications, about 100 variables.

one of these variables is an indicator (d in the formula i used) that say to which unit of first level each observation belongs.

Hope this is clearer.


This is quite different than fitting a model with 1-2 million variables, which you said originally.

 

All of these x variables are categorical? And just to be 100% clear, these 100 x variables, are they really 100 subjects with nesting, or are there really 100 columns for each subject?

 

Earlier in my career I tried doing things like this (fitting a model with 100 categorical variables) but they essentially became models that could not be understood or interpreted, and were probably over-fitted as well. And of course, 100 variables are likely to be correlated with one another. So ...

 

My solution now is to adopt a Partial Least Squares Regression which accounts for the correlations between the x variables, it is likely to be more interpretable, and fits better, and also doesn't require as much memory as models which require a matrix to be inverted. But it has the drawback that the algorithm may not converge. Still I would give it a try.

--
Paige Miller
ballardw
Super User

You might show at least the GLM code so we have a chance of seeing how many variables are actually involved.

Rick_SAS
SAS Super FREQ

As to the out-of-memory condition, the important quantity for memory is the number of columns in the design matrix. Each continuous variable contributes 1 column.  A categorical variable that has K levels contributes K (or K-1) columns.   

 

If the total number of columns is p, the linear regression must form a (p x p) crossproduct matrix. When p is very large, the crossproduct matrix can become huge. For some examples, see this article about the memory required to store a matrix.  The article mentions that if p=40,000 then the crossproduct matrix consumes 12GB. If p=100,000, the crossproduct matrix consumes 75GB.  In your original post you suggested you wanted to use 1 million columns. Such an analysis would require a crossproduct matrix that consumes 7450GB. Even if you could construct such a matrix and solve the resulting system, the resulting model would be impractical to use.

 

ciro
Quartz | Level 8

Hi Rick, 

I see the point. I have tried with a 10% sample and without X variables (only the first level unit fixed effect -variable d, about 110000). In this case, after an increase in memsize, hpmixed was able to produce the estimates (not hpreg). It run( in less than 10 seconds.

When I add the X variables (about 150 variables after "dummyzation") it took more than 20 hours. Any hint?

Moreover, is it possible that there are some other algorithm able to estimate such large number of  fixed effects models? Stata with the areg command takes very little time to estimate the full model (variable d+ X).

 

In any case thanks for the help to all the forum.

Rick_SAS
SAS Super FREQ

I honestly have no idea what you are trying to do. You have not supplied data nor code. You talk about fixed effects but you claim you are using PROC HPMIXED. 

 

I am not familiar with STATA, but a quick internet search suggests that the 'areg command' that you mention might be similar to the ABSORB statement in PROC GLM, which can reduce memory and computing time for linear models when a classification variable has a large number of discrete levels. I suggest you read the documentation for the ABSORB statement and decide whether it applies to your analysis.  Good luck.

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 7 replies
  • 2936 views
  • 2 likes
  • 4 in conversation