topic Re: very large number of fixed effects in Statistical Procedures

very large number of fixed effects

ciro — Wed, 27 Sep 2017 20:14:46 GMT

hi,

I have to estimate regression models on large datasets (15-20 millions obs) with a very large number of fixed effect (1-2 millions).

it is a two level data with units of 2 level nested in units of first level.

the regression model is of the type:

y(ij)=d(j)a+x(ij)b+u(ij)

where d is the first level indicator and x the matrix of variables for second level units.

I am interested in estimating var(da), var(xb), var(u) and the covariance between the first two terms.

I have searched in the forum and internet without success.

I have tried many procedures, included hpreg and hpmixed, but ended up with "too large numbers of fixed effect" error or memory shortage issue.

I was able to estimate the model only with proc glm with absorb statement, but in this case the procedure does not produce predicted values or residuals.

Is there any other possibility? any workaround?

thank you very much in advance

Re: very large number of fixed effects

PaigeMiller — Wed, 27 Sep 2017 20:37:43 GMT

It's not clear to me from the explanation and formula where the 1-2 million fixed effects come in. Are you really trying to fit a model where you 1-2 million x variables? Or are you fitting 1-2 million different models?

Re: very large number of fixed effects

ballardw — Wed, 27 Sep 2017 22:15:51 GMT

You might show at least the GLM code so we have a chance of seeing how many variables are actually involved.

Re: very large number of fixed effects

ciro — Thu, 28 Sep 2017 07:45:02 GMT

Sorry, maybe I was not clear.

I have to fit just one model on, say, 15 millions observations (units of second level) and, depending on the specifications, about 100 variables.

one of these variables is an indicator (d in the formula i used) that say to which unit of first level each observation belongs.

Hope this is clearer.

Re: very large number of fixed effects

PaigeMiller — Thu, 28 Sep 2017 13:20:17 GMT

@ciro wrote:

Sorry, maybe I was not clear.

I have to fit just one model on, say, 15 millions observations (units of second level) and, depending on the specifications, about 100 variables.

one of these variables is an indicator (d in the formula i used) that say to which unit of first level each observation belongs.

Hope this is clearer.

This is quite different than fitting a model with 1-2 million variables, which you said originally.

All of these x variables are categorical? And just to be 100% clear, these 100 x variables, are they really 100 subjects with nesting, or are there really 100 columns for each subject?

Earlier in my career I tried doing things like this (fitting a model with 100 categorical variables) but they essentially became models that could not be understood or interpreted, and were probably over-fitted as well. And of course, 100 variables are likely to be correlated with one another. So ...

My solution now is to adopt a Partial Least Squares Regression which accounts for the correlations between the x variables, it is likely to be more interpretable, and fits better, and also doesn't require as much memory as models which require a matrix to be inverted. But it has the drawback that the algorithm may not converge. Still I would give it a try.

Re: very large number of fixed effects

Rick_SAS — Thu, 28 Sep 2017 15:14:47 GMT

As to the out-of-memory condition, the important quantity for memory is the number of columns in the design matrix. Each continuous variable contributes 1 column. A categorical variable that has K levels contributes K (or K-1) columns.

If the total number of columns is p, the linear regression must form a (p x p) crossproduct matrix. When p is very large, the crossproduct matrix can become huge. For some examples, see this article about the memory required to store a matrix. The article mentions that if p=40,000 then the crossproduct matrix consumes 12GB. If p=100,000, the crossproduct matrix consumes 75GB. In your original post you suggested you wanted to use 1 million columns. Such an analysis would require a crossproduct matrix that consumes 7450GB. Even if you could construct such a matrix and solve the resulting system, the resulting model would be impractical to use.

Re: very large number of fixed effects

ciro — Wed, 04 Oct 2017 07:31:54 GMT

Hi Rick,

I see the point. I have tried with a 10% sample and without X variables (only the first level unit fixed effect -variable d, about 110000). In this case, after an increase in memsize, hpmixed was able to produce the estimates (not hpreg). It run( in less than 10 seconds.

When I add the X variables (about 150 variables after "dummyzation") it took more than 20 hours. Any hint?

Moreover, is it possible that there are some other algorithm able to estimate such large number of fixed effects models? Stata with the areg command takes very little time to estimate the full model (variable d+ X).

In any case thanks for the help to all the forum.

Re: very large number of fixed effects

Rick_SAS — Wed, 04 Oct 2017 13:01:24 GMT

I honestly have no idea what you are trying to do. You have not supplied data nor code. You talk about fixed effects but you claim you are using PROC HPMIXED.

I am not familiar with STATA, but a quick internet search suggests that the 'areg command' that you mention might be similar to the ABSORB statement in PROC GLM, which can reduce memory and computing time for linear models when a classification variable has a large number of discrete levels. I suggest you read the documentation for the ABSORB statement and decide whether it applies to your analysis. Good luck.