About stoffprof

stoffprof · ‎04-21-2014

This question is related to a question I posted in the SAS/STAT forum (). I have a large and sparse design matrix of ones and zeros (about 450k rows, 2500 columns). I'm trying to use this to run a regression, but SAS/STAT procedures are taking too long. (I ran proc hpmixed and it had not produced any results after 24 hours.) If I read the data into Matlab as a sparse matrix, I can run a QR decomposition and invert the resulting matrix in less than 10 seconds. Is it possible to take this approach in IML? I feel like I must be missing something because it doesn't seem like this is a hard problem to solve, so I thought SAS would handle everything on the fly, but it seems that I'm going wrong somewhere. Thanks for any advice.

stoffprof · ‎04-21-2014

Thanks, but it seems that this is only about sparse matrices with text mining or Enterprise Miner tool. I haven't seen anything about use in a standard regression.

stoffprof · ‎04-20-2014

I would like to run a regression that includes about 2500 dummy variables (or fixed effects). The data set includes about 450,000 observations, and it is very sparse: most observations only have one or two effects "turned on" -- in other words, only about 0.05% of the design matrix are ones. (Interestingly, when I created this matrix in SAS 9.4 on a Windows machine it created a file that was about 4.5GB. When I transferred it to Unix it turned into a 30MB file. I was surprised that whatever magic sauce SAS is using to store the sparse matrix on Unix it isn't using on Windows.) I'm wondering what the best way to estimate a model like this. Here are some possibilities that I'm aware of, and I'm looking for guidance on what is likely to be the most efficient approach: 1) Use proc hpmixed. Given the sparse nature of the data this seemed like a good way to go. But I've been running this model for 11 hours and it hasn't finished yet. I'm wondering if perhaps I've implemented it wrong. My code is: proc hpmixed; class fid; model r = size fid dummy1-dummy2500; run; 2) Use IML. I thought perhaps I could read in the sparse matrix to IML and use solvelin to estimate the coefficients. Is one of these likely to be the best approach? Are there other procedures that would work well? Thanks!

stoffprof · ‎04-20-2014

Why can't you use proc model? For example, see here.

stoffprof · ‎04-16-2014

I am having trouble figuring out how to estimate a particular type of fixed effects regression. My data looks something like this: Company Date Output Manager 1 Manager 2 Manager 3 1 12/31/2002 500 1055 2291 . 1 3/31/2002 520 1055 2291 . 2 12/31/2002 180 2291 5538 7721 2 3/31/2002 178 2291 5538 7721 2 6/30/2002 188 5538 7721 . 3 9/30/2002 759 7721 . . There are a few thousand companies, each with about 100 observations on different dates. Each company has up to 3 managers, which are identified by a numeric code. I would like to estimate fixed effects for each manager in a regression of output on a number of controls (not shown). I'm having trouble thinking about how to do this given the structure of the data. It's important to notice that the particular column where a manager ID is listed has no intrinsic meaning -- it's just a list of which managers are at the company, and there can be up to 3 at a time. (They're sorted, so if a lower-ID manager leaves, all the IDs shift to the left, as happens for the third observation of company 2 above.) There are about 2,000 different manager IDs, so there are many fixed effects. (That is, the numbers 1055, 2291, 5538, and 7721 are just four of the 2,000 possible values.) I've been trying to think of how to estimate such a model. One approach would be to manually create a dummy variable matrix in SAS/IML and run the regression, but I'm running into memory problems. To be clear, for the data above, the dummy variables would look like this: mgr1055 mgr3765 mgr5538 mgr7721 1 1 0 0 1 1 0 0 1 0 1 1 1 0 1 1 0 0 1 1 0 0 0 1 The other way I thought might work would be to reshape the data so that it's "long", so that, for example, the first observation would generate two observations: Company Date Output Manager 1 12/31/2002 500 1055 1 12/31/2002 500 2291 This would easily generate the fixed effects, but the standard errors would be wrong because the regression doesn't "know" that this is just one observation. Perhaps clustering could fix this, but it's not clear to me if the regression would be correctly specified. So... is there a way to do this with PROC GLM or some other procedure? Any guidance on this would be much appreciated!

stoffprof · ‎08-14-2013

Perfect, thanks Rick. I made sure to update to 9.4 (12.3) after your recent blog post.

stoffprof · ‎08-14-2013

I need to randomly choose k integers between 1 and T without replacement. Here's what I've come up with: T=60; k=15; /* random vector for choosing integers */ r = j(T,1,0); call randgen(r,'uniform'); /* find indices from sorting r */ call sortndx( integers, r, {1}); /* take first k integers */ draw = integers[1:k]; I need to do it many times, and I'd rather avoid looping through this, so I'm wondering if anyone has a better solution.

stoffprof · ‎08-02-2013

Isn't that what we'd expect from the floor function? Anything less than 2 (even if just epsilon) is 1?

stoffprof · ‎08-02-2013

I've noticed that using a non-integer index on a vector returns a value. For example, if I set x = 1:5; y = x[{1.2 5.7}]; then y returns a 2-vector {1, 5}. It seems that the indexes are having the floor function applied before being used to find elements of x. Is this correct?

stoffprof · ‎07-24-2013

Thanks Rob. The only constraints are that the variables are integers bounded by 0 and some integer usually close to 60. (There are about 50-100 such variables.) I don't think there's a straightforward way to share my objective function completely; it's a complicated function of many variables in a dataset. In short, the function is f(x) = sum over all t of |a(t) - b(t)| where a(t) is observable and b(t) = g(x; many parameters), and g() is, as I said, pretty complicated. I've been minimizing this with a genetic algorithm in IML but am exploring other options with more flexibility. I recently heard about the GA procedure in SAS/OR, and wanted to try that as well. Are you aware of any sample code that shows how to use FCMP with PROC GA? In particular, I'm not clear on whether the functions have to be defined in exactly the same way as they are within GA (the first variable is an array). For example, suppose you wanted to write the function sumsq from this page within FCMP. How would that work, since I'm assuming you can't call ReadMember from FCMP?

stoffprof · ‎07-21-2013

Thanks Rob, especially for the super-fast answer (on a Sunday!). I'm trying to minimize a function I created in FCMP that takes only integer values. As far as I can tell, the problem does not fit into a MILP or LP setup, and NLP doesn't seem to allow integers. Is there a way to solve an arbitrary function that takes integers? I thought about modifying my program so it converts real numbers to integers, but I thought that might pose problems for a solver that looks at derivatives because they'd all be quite flat.

stoffprof · ‎07-21-2013

Thanks. Can you give some guidance on the syntax? My attempt isn't working (optmodel can't find the function): proc fcmp outlib=sasuser.funcs.tester; function rosenbrock(x[2]); f = 2 * (x[2] - x[1]**2)**2 + (1-x[1])**2; return(f); endsub; run; options cmplib=sasuser.funcs; proc optmodel; var x {1..2}; min rosenbrock; solve; print x; quit;

stoffprof · ‎07-19-2013

Can OPTMODEL be used to minimize a function created with PROC FCMP? The documentation for FCMP says functions can be called by PROC NLP and a number of other procedures, but doesn't mention OPTMODEL. Is this correct? If NLP is the legacy procedure I assumed FCMP would be available to the newer procedure.

stoffprof · ‎06-27-2013

Another example: what does the random seed in GASETUP do if a user-written initialization module is specified? Does it do anything?

stoffprof · ‎06-27-2013

I've found the genetic algorithm documentation to be somewhat vague with regard to the details of its implementation. For example, in describing the GASETSEL call, nothing is said about how members are chosen from the population to participate in the tournament. Is it a roulette wheel implementation?

Online Status	Offline
Date Last Visited	‎09-01-2015 07:12 AM

Solving least squares problem with sparse matrix

Re: Regression with large number of fixed effects in a sparse matrix

Regression with large number of fixed effects in a sparse matrix

Re: GARCH(1,1) and IML

Regression with multiple fixed effects per observation

Re: Random draws without replacement

Random draws without replacement

Re: Non-integer index of a vector

Non-integer index of a vector

Re: FCMP and OPTMODEL

Regression with multiple fixed effects per observation

Solving least squares problem with sparse matrix

Re: Regression with large number of fixed effects in a sparse matrix

Regression with large number of fixed effects in a sparse matrix

Re: GARCH(1,1) and IML

Regression with multiple fixed effects per observation

Re: Random draws without replacement

Random draws without replacement

Re: Non-integer index of a vector

Non-integer index of a vector

Re: FCMP and OPTMODEL

Re: FCMP and OPTMODEL

Re: FCMP and OPTMODEL

FCMP and OPTMODEL

Re: Tournament selection in genetic algorithm

Tournament selection in genetic algorithm