Re: very large number of fixed effects

ciro · Posted 09-27-2017 04:14 PM

hi,

I have to estimate regression models on large datasets (15-20 millions obs) with a very large number of fixed effect (1-2 millions).

it is a two level data with units of 2 level nested in units of first level.

the regression model is of the type:

y(ij)=d(j)a+x(ij)b+u(ij)

where d is the first level indicator and x the matrix of variables for second level units.

I am interested in estimating var(da), var(xb), var(u) and the covariance between the first two terms.

I have searched in the forum and internet without success.

I have tried many procedures, included hpreg and hpmixed, but ended up with "too large numbers of fixed effect" error or memory shortage issue.

I was able to estimate the model only with proc glm with absorb statement, but in this case the procedure does not produce predicted values or residuals.

Is there any other possibility? any workaround?

thank you very much in advance

PaigeMiller · Posted 09-27-2017 04:35 PM

It's not clear to me from the explanation and formula where the 1-2 million fixed effects come in. Are you really trying to fit a model where you 1-2 million x variables? Or are you fitting 1-2 million different models?

--
Paige Miller

ciro · Posted 09-28-2017 03:45 AM

Sorry, maybe I was not clear.

I have to fit just one model on, say, 15 millions observations (units of second level) and, depending on the specifications, about 100 variables.

one of these variables is an indicator (d in the formula i used) that say to which unit of first level each observation belongs.

Hope this is clearer.

PaigeMiller · Posted 09-28-2017 08:00 AM

@ciro wrote:

Sorry, maybe I was not clear.

I have to fit just one model on, say, 15 millions observations (units of second level) and, depending on the specifications, about 100 variables.

one of these variables is an indicator (d in the formula i used) that say to which unit of first level each observation belongs.

Hope this is clearer.

This is quite different than fitting a model with 1-2 million variables, which you said originally.

All of these x variables are categorical? And just to be 100% clear, these 100 x variables, are they really 100 subjects with nesting, or are there really 100 columns for each subject?

Earlier in my career I tried doing things like this (fitting a model with 100 categorical variables) but they essentially became models that could not be understood or interpreted, and were probably over-fitted as well. And of course, 100 variables are likely to be correlated with one another. So ...

My solution now is to adopt a Partial Least Squares Regression which accounts for the correlations between the x variables, it is likely to be more interpretable, and fits better, and also doesn't require as much memory as models which require a matrix to be inverted. But it has the drawback that the algorithm may not converge. Still I would give it a try.

--
Paige Miller

ballardw · Posted 09-27-2017 06:15 PM

You might show at least the GLM code so we have a chance of seeing how many variables are actually involved.

Rick_SAS · Posted 09-28-2017 11:14 AM

As to the out-of-memory condition, the important quantity for memory is the number of columns in the design matrix. Each continuous variable contributes 1 column. A categorical variable that has K levels contributes K (or K-1) columns.

If the total number of columns is p, the linear regression must form a (p x p) crossproduct matrix. When p is very large, the crossproduct matrix can become huge. For some examples, see this article about the memory required to store a matrix. The article mentions that if p=40,000 then the crossproduct matrix consumes 12GB. If p=100,000, the crossproduct matrix consumes 75GB. In your original post you suggested you wanted to use 1 million columns. Such an analysis would require a crossproduct matrix that consumes 7450GB. Even if you could construct such a matrix and solve the resulting system, the resulting model would be impractical to use.

ciro · Posted 10-04-2017 03:31 AM

Hi Rick,

I see the point. I have tried with a 10% sample and without X variables (only the first level unit fixed effect -variable d, about 110000). In this case, after an increase in memsize, hpmixed was able to produce the estimates (not hpreg). It run( in less than 10 seconds.

When I add the X variables (about 150 variables after "dummyzation") it took more than 20 hours. Any hint?

Moreover, is it possible that there are some other algorithm able to estimate such large number of fixed effects models? Stata with the areg command takes very little time to estimate the full model (variable d+ X).

In any case thanks for the help to all the forum.

Rick_SAS · Posted 10-04-2017 09:01 AM

I honestly have no idea what you are trying to do. You have not supplied data nor code. You talk about fixed effects but you claim you are using PROC HPMIXED.

I am not familiar with STATA, but a quick internet search suggests that the 'areg command' that you mention might be similar to the ABSORB statement in PROC GLM, which can reduce memory and computing time for linear models when a classification variable has a large number of discrete levels. I suggest you read the documentation for the ABSORB statement and decide whether it applies to your analysis. Good luck.