BookmarkSubscribeRSS Feed
stoffprof
Calcite | Level 5

I would like to run a regression that includes about 2500 dummy variables (or fixed effects). The data set includes about 450,000 observations, and it is very sparse: most observations only have one or two effects "turned on" -- in other words, only about 0.05% of the design matrix are ones.

(Interestingly, when I created this matrix in SAS 9.4 on a Windows machine it created a file that was about 4.5GB. When I transferred it to Unix it turned into a 30MB file. I was surprised that whatever magic sauce SAS is using to store the sparse matrix on Unix it isn't using on Windows.)

I'm wondering what the best way to estimate a model like this. Here are some possibilities that I'm aware of, and I'm looking for guidance on what is likely to be the most efficient approach:

1) Use proc hpmixed. Given the sparse nature of the data this seemed like a good way to go. But I've been running this model for 11 hours and it hasn't finished yet. I'm wondering if perhaps I've implemented it wrong. My code is:

proc hpmixed;

  class fid;

  model r = size fid dummy1-dummy2500;

run;

2) Use IML. I thought perhaps I could read in the sparse matrix to IML and use solvelin to estimate the coefficients.

Is one of these likely to be the best approach? Are there other procedures that would work well?

Thanks!

4 REPLIES 4
Reeza
Super User

There's an article on here in the past week about using sparse matrices and how to use them.

Unfortunately I can't find the link but if you look through the past two weeks I'm sure you'll find it.

EDIT: Found the link:

stoffprof
Calcite | Level 5

Thanks, but it seems that this is only about sparse matrices with text mining or Enterprise Miner tool. I haven't seen anything about use in a standard regression.

lvm
Rhodochrosite | Level 12 lvm
Rhodochrosite | Level 12

You should try HPREG procedure. This is designed specifically for high dimensional fixed-effects modeling. It is only found in the newer releases of sas.

SteveDenham
Jade | Level 19

I would go with the first inclination towards HPMIXED, which employs sparse matrix algorithms.  I have not tried HPREG, but the documentation for yet another high performance proc (HPLMIXED) indicates that HPMIXED "is particularly suited for problems in which the [XZ]'[XZ] crossproducts matrix is sparse."  And that sounds exactly like what is going on here.  And while HPREG offers a lot of capability, it looks like it depends more on multithreading/parallel processing than on sparse matrix techniques.

My question is--dummy1 to dummy2500 seems difficult.  Are these dummies the result of more easily defined class variables, such that you can use the class statement to "auto-populate" the levels?  If not, and the data set is already prepped, I would go with your first inclination.

Steve Denham

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 4 replies
  • 2192 views
  • 0 likes
  • 4 in conversation