04-20-2014 09:23 PM
I would like to run a regression that includes about 2500 dummy variables (or fixed effects). The data set includes about 450,000 observations, and it is very sparse: most observations only have one or two effects "turned on" -- in other words, only about 0.05% of the design matrix are ones.
(Interestingly, when I created this matrix in SAS 9.4 on a Windows machine it created a file that was about 4.5GB. When I transferred it to Unix it turned into a 30MB file. I was surprised that whatever magic sauce SAS is using to store the sparse matrix on Unix it isn't using on Windows.)
I'm wondering what the best way to estimate a model like this. Here are some possibilities that I'm aware of, and I'm looking for guidance on what is likely to be the most efficient approach:
1) Use proc hpmixed. Given the sparse nature of the data this seemed like a good way to go. But I've been running this model for 11 hours and it hasn't finished yet. I'm wondering if perhaps I've implemented it wrong. My code is:
model r = size fid dummy1-dummy2500;
2) Use IML. I thought perhaps I could read in the sparse matrix to IML and use solvelin to estimate the coefficients.
Is one of these likely to be the best approach? Are there other procedures that would work well?
04-21-2014 12:07 AM
04-21-2014 09:45 AM
Thanks, but it seems that this is only about sparse matrices with text mining or Enterprise Miner tool. I haven't seen anything about use in a standard regression.
04-22-2014 09:44 AM
You should try HPREG procedure. This is designed specifically for high dimensional fixed-effects modeling. It is only found in the newer releases of sas.
04-23-2014 03:48 PM
I would go with the first inclination towards HPMIXED, which employs sparse matrix algorithms. I have not tried HPREG, but the documentation for yet another high performance proc (HPLMIXED) indicates that HPMIXED "is particularly suited for problems in which the [XZ]'[XZ] crossproducts matrix is sparse." And that sounds exactly like what is going on here. And while HPREG offers a lot of capability, it looks like it depends more on multithreading/parallel processing than on sparse matrix techniques.
My question is--dummy1 to dummy2500 seems difficult. Are these dummies the result of more easily defined class variables, such that you can use the class statement to "auto-populate" the levels? If not, and the data set is already prepped, I would go with your first inclination.