<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: very large number of fixed effects in Statistical Procedures</title>
    <link>https://communities.sas.com/t5/Statistical-Procedures/very-large-number-of-fixed-effects/m-p/399523#M20816</link>
    <description>&lt;P&gt;As to the out-of-memory condition, the important quantity for memory is the number of columns in the design matrix. Each continuous variable contributes 1 column.&amp;nbsp; A&amp;nbsp;categorical variable that has K levels contributes K (or K-1) columns.&amp;nbsp; &amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;If the total number of columns is p, the linear regression must form a (p x p) crossproduct matrix. When p is very large, the crossproduct matrix can become huge. For some examples, see this article about &lt;A href="https://blogs.sas.com/content/iml/2014/04/28/how-much-ram-do-i-need-to-store-that-matrix.html" target="_self"&gt;the&amp;nbsp;memory required to store&amp;nbsp;a matrix&lt;/A&gt;.&amp;nbsp; The article mentions that if p=40,000 then the crossproduct matrix consumes 12GB.&amp;nbsp;If p=100,000, the crossproduct&amp;nbsp;matrix consumes 75GB.&amp;nbsp; In your original post you suggested you wanted to use 1 million columns. Such an analysis would require a crossproduct matrix that consumes 7450GB. Even if you could construct such a matrix and solve the resulting system, the resulting model would be impractical to use.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Thu, 28 Sep 2017 15:14:47 GMT</pubDate>
    <dc:creator>Rick_SAS</dc:creator>
    <dc:date>2017-09-28T15:14:47Z</dc:date>
    <item>
      <title>very large number of fixed effects</title>
      <link>https://communities.sas.com/t5/Statistical-Procedures/very-large-number-of-fixed-effects/m-p/399329#M20803</link>
      <description>&lt;P&gt;hi,&lt;/P&gt;
&lt;P&gt;I have to estimate regression models on large datasets (15-20 millions obs) with a very large number of fixed effect (1-2 millions).&lt;/P&gt;
&lt;P&gt;it is a two level data with units of 2 level nested in units of first level.&lt;/P&gt;
&lt;P&gt;the regression model is of the type:&lt;/P&gt;
&lt;P&gt;y(ij)=d(j)a+x(ij)b+u(ij)&lt;/P&gt;
&lt;P&gt;where d is the first level indicator and x the matrix of variables for second level units.&lt;/P&gt;
&lt;P&gt;I am interested in estimating var(&lt;SPAN&gt;da), var(xb), var(u) and the covariance between the first two terms.&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;I have searched in the forum and internet without success.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;I have tried many procedures, included hpreg and hpmixed, but ended up with "too large &amp;nbsp;numbers of fixed effect" error or memory shortage issue.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;I was able to estimate the model only with proc glm with absorb statement, but in this case the procedure does not produce predicted values or residuals.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;Is there any other possibility? any workaround?&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;thank you very much in advance&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 27 Sep 2017 20:14:46 GMT</pubDate>
      <guid>https://communities.sas.com/t5/Statistical-Procedures/very-large-number-of-fixed-effects/m-p/399329#M20803</guid>
      <dc:creator>ciro</dc:creator>
      <dc:date>2017-09-27T20:14:46Z</dc:date>
    </item>
    <item>
      <title>Re: very large number of fixed effects</title>
      <link>https://communities.sas.com/t5/Statistical-Procedures/very-large-number-of-fixed-effects/m-p/399334#M20804</link>
      <description>&lt;P&gt;It's not clear to me from the explanation and formula where the 1-2 million fixed effects come in. Are you really trying to fit a model where you 1-2 million x variables? Or are you fitting 1-2 million different models?&lt;/P&gt;</description>
      <pubDate>Wed, 27 Sep 2017 20:37:43 GMT</pubDate>
      <guid>https://communities.sas.com/t5/Statistical-Procedures/very-large-number-of-fixed-effects/m-p/399334#M20804</guid>
      <dc:creator>PaigeMiller</dc:creator>
      <dc:date>2017-09-27T20:37:43Z</dc:date>
    </item>
    <item>
      <title>Re: very large number of fixed effects</title>
      <link>https://communities.sas.com/t5/Statistical-Procedures/very-large-number-of-fixed-effects/m-p/399349#M20805</link>
      <description>&lt;P&gt;You might show at least the GLM code so we have a chance of seeing how many variables are actually involved.&lt;/P&gt;</description>
      <pubDate>Wed, 27 Sep 2017 22:15:51 GMT</pubDate>
      <guid>https://communities.sas.com/t5/Statistical-Procedures/very-large-number-of-fixed-effects/m-p/399349#M20805</guid>
      <dc:creator>ballardw</dc:creator>
      <dc:date>2017-09-27T22:15:51Z</dc:date>
    </item>
    <item>
      <title>Re: very large number of fixed effects</title>
      <link>https://communities.sas.com/t5/Statistical-Procedures/very-large-number-of-fixed-effects/m-p/399424#M20807</link>
      <description>&lt;P&gt;Sorry, maybe I was not clear.&lt;/P&gt;
&lt;P&gt;I&amp;nbsp;have to&amp;nbsp;fit just one model on, say, 15 millions observations (units of second level) and, depending on the specifications, about 100 variables.&lt;/P&gt;
&lt;P&gt;one of these variables is an indicator (d in the formula i used) that say to which unit of first level each observation belongs.&lt;/P&gt;
&lt;P&gt;Hope this is clearer.&lt;/P&gt;</description>
      <pubDate>Thu, 28 Sep 2017 07:45:02 GMT</pubDate>
      <guid>https://communities.sas.com/t5/Statistical-Procedures/very-large-number-of-fixed-effects/m-p/399424#M20807</guid>
      <dc:creator>ciro</dc:creator>
      <dc:date>2017-09-28T07:45:02Z</dc:date>
    </item>
    <item>
      <title>Re: very large number of fixed effects</title>
      <link>https://communities.sas.com/t5/Statistical-Procedures/very-large-number-of-fixed-effects/m-p/399463#M20808</link>
      <description>&lt;BLOCKQUOTE&gt;&lt;HR /&gt;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/114"&gt;@ciro&lt;/a&gt; wrote:&lt;BR /&gt;
&lt;P&gt;Sorry, maybe I was not clear.&lt;/P&gt;
&lt;P&gt;I&amp;nbsp;have to&amp;nbsp;fit just one model on, say, 15 millions observations (units of second level) and, depending on the specifications, about 100 variables.&lt;/P&gt;
&lt;P&gt;one of these variables is an indicator (d in the formula i used) that say to which unit of first level each observation belongs.&lt;/P&gt;
&lt;P&gt;Hope this is clearer.&lt;/P&gt;
&lt;HR /&gt;&lt;/BLOCKQUOTE&gt;
&lt;P&gt;This is quite different than fitting a model with 1-2 million variables, which you said originally.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;All of these x variables are categorical? And just to be 100% clear, these 100 x variables, are they really 100 subjects with nesting, or are there really 100 columns for each subject?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Earlier in my career I tried doing things like this (fitting a model with 100 categorical variables) but they essentially became models that could not be understood or interpreted, and were probably over-fitted as well. And of course, 100 variables are likely to be correlated with one another. So ...&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;My solution now is to adopt a Partial Least Squares Regression which accounts for the correlations between the x variables, it is likely to be more interpretable, and fits better, and also doesn't require as much memory as models which require a matrix to be inverted. But it has the drawback that the algorithm may not converge. Still I would give it a try.&lt;/P&gt;</description>
      <pubDate>Thu, 28 Sep 2017 13:20:17 GMT</pubDate>
      <guid>https://communities.sas.com/t5/Statistical-Procedures/very-large-number-of-fixed-effects/m-p/399463#M20808</guid>
      <dc:creator>PaigeMiller</dc:creator>
      <dc:date>2017-09-28T13:20:17Z</dc:date>
    </item>
    <item>
      <title>Re: very large number of fixed effects</title>
      <link>https://communities.sas.com/t5/Statistical-Procedures/very-large-number-of-fixed-effects/m-p/399523#M20816</link>
      <description>&lt;P&gt;As to the out-of-memory condition, the important quantity for memory is the number of columns in the design matrix. Each continuous variable contributes 1 column.&amp;nbsp; A&amp;nbsp;categorical variable that has K levels contributes K (or K-1) columns.&amp;nbsp; &amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;If the total number of columns is p, the linear regression must form a (p x p) crossproduct matrix. When p is very large, the crossproduct matrix can become huge. For some examples, see this article about &lt;A href="https://blogs.sas.com/content/iml/2014/04/28/how-much-ram-do-i-need-to-store-that-matrix.html" target="_self"&gt;the&amp;nbsp;memory required to store&amp;nbsp;a matrix&lt;/A&gt;.&amp;nbsp; The article mentions that if p=40,000 then the crossproduct matrix consumes 12GB.&amp;nbsp;If p=100,000, the crossproduct&amp;nbsp;matrix consumes 75GB.&amp;nbsp; In your original post you suggested you wanted to use 1 million columns. Such an analysis would require a crossproduct matrix that consumes 7450GB. Even if you could construct such a matrix and solve the resulting system, the resulting model would be impractical to use.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 28 Sep 2017 15:14:47 GMT</pubDate>
      <guid>https://communities.sas.com/t5/Statistical-Procedures/very-large-number-of-fixed-effects/m-p/399523#M20816</guid>
      <dc:creator>Rick_SAS</dc:creator>
      <dc:date>2017-09-28T15:14:47Z</dc:date>
    </item>
    <item>
      <title>Re: very large number of fixed effects</title>
      <link>https://communities.sas.com/t5/Statistical-Procedures/very-large-number-of-fixed-effects/m-p/400868#M20903</link>
      <description>&lt;P&gt;Hi Rick,&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I see the point. I have tried with a 10% sample and without X variables (only the first level unit fixed effect -variable d, about 110000). In this case, after an increase in memsize, hpmixed&amp;nbsp;was able to produce the estimates (not hpreg). It run( in less than&amp;nbsp;10&amp;nbsp;seconds.&lt;/P&gt;
&lt;P&gt;When&amp;nbsp;I add the X variables (about 150 variables after "dummyzation") it took more than&amp;nbsp;20 hours. Any hint?&lt;/P&gt;
&lt;P&gt;Moreover, is it possible&amp;nbsp;that there are some other algorithm able to estimate such large number of&amp;nbsp; fixed effects models? Stata with the areg command&amp;nbsp;takes very little time to estimate the full model (variable d+ X).&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;In any case thanks for the help to all the forum.&lt;/P&gt;</description>
      <pubDate>Wed, 04 Oct 2017 07:31:54 GMT</pubDate>
      <guid>https://communities.sas.com/t5/Statistical-Procedures/very-large-number-of-fixed-effects/m-p/400868#M20903</guid>
      <dc:creator>ciro</dc:creator>
      <dc:date>2017-10-04T07:31:54Z</dc:date>
    </item>
    <item>
      <title>Re: very large number of fixed effects</title>
      <link>https://communities.sas.com/t5/Statistical-Procedures/very-large-number-of-fixed-effects/m-p/400940#M20907</link>
      <description>&lt;P&gt;I honestly have no idea what you are trying to do. You have not supplied data nor code. You talk about fixed effects but you claim you are using PROC HPMIXED.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I am not familiar with STATA, but a quick internet search suggests that the 'areg command' that you mention&amp;nbsp;might be similar to &lt;A href="http://support.sas.com/documentation/cdl/en/statug/68162/HTML/default/viewer.htm#statug_glm_syntax02.htm" target="_self"&gt;the ABSORB statement in PROC GLM&lt;/A&gt;, which can reduce memory and computing time for linear models when a classification&amp;nbsp;variable has a large number of discrete levels. I suggest you read the documentation for the ABSORB statement and decide whether it applies to your analysis.&amp;nbsp; Good luck.&lt;/P&gt;</description>
      <pubDate>Wed, 04 Oct 2017 13:01:24 GMT</pubDate>
      <guid>https://communities.sas.com/t5/Statistical-Procedures/very-large-number-of-fixed-effects/m-p/400940#M20907</guid>
      <dc:creator>Rick_SAS</dc:creator>
      <dc:date>2017-10-04T13:01:24Z</dc:date>
    </item>
  </channel>
</rss>

