12-17-2014 05:55 PM
I have a .txt dataset which has 11,653 variables and 179 observations (see attachment). The variable "name" is just the id, I don't need it. The variable "score" is the responsable variable and the remaining 11,651 variables are the explanatory variables. I know those 11,651 variables are not all important, so I need to select the most significant ones to build and fit a model. After I import the .txt file into SAS, I tried stepwise regression in PROC HPREG, but SAS reported insufficient memory. I cannot change the MEMSIZE option (now is 2G) casue I am running the code on the server in my univerisity.
Here is my code:
proc hpreg data = P2222;
model score = P12050301--P60598281; *P12050301 is the first explanatory variable, P60598281 is the last explanatory variable;
selection method = stepwise(select=sl sle=0.25 sls=0.25 maxeffects=170);
Do I have other methods to select the best model? I am totally a beginner. Thank you for your help!
12-17-2014 06:18 PM
You need to reduce your variables before you do a regression.
Ideally you'd know something about all your variables and then you could apply business knowledge as well as statistical techniques.
One way is to use forward selection (generally not recommended) which is to regress each variable against the dependent and only use those that are significant. You also need to check for correlation between your independent variables.
You'll also need to consider the variable types, i.e. categorical, numerical, ordinal and treat them appropriately in your variable selection method.
Because you have 179 variables you'll want AT LEAST LESS THAT that in your regression, otherwise you have a dimensionality problem and your matrices won't invert.
12-18-2014 02:46 PM
Hi, thank you for your reply. I understand what you mean here. All my variables are numerical. I don't know something about those variables so I cannot select them using prior knowledge. That's why I use stepwise regression and let SAS select the important ones. But right now because of the memory probelm, I cannot get the result.
12-17-2014 06:29 PM
I don't know the solution to the performance issue here, but...
If your explanatory variables (P12050301--P60598281) were independent random numbers, unrelated to score, you would certainly find a model that fits your data perfectly. Fitting that model is almost like trying to solve N linear equations for N variables.
Try setting apart a small set of, say, 20 observations. Build your model based on the rest of the data and then test the resulting model on the small set.
12-18-2014 02:58 PM
Hi, thank you for your reply. I may not quite understand you. I tried the stepwise regression but SAS reported insufficient memory. This means the explanatoy variables are too many (11,651), so SAS don't have enough memory for this proc.
Did you mean I separate the dataset into several small ones and build a model on each of the small dataset? Actually I tried it. I sepatated the whole dataset into 2 small ones, the first one has the first 5,826 explanatory variables and the second one has the remaining 5,825 variables. Then I run the stepwise regression on both of them and built 2 models. The first model selected 10 variables and R square=0.24, the second model selected 155 variables and R square=1.0. But if I separte the whole dataset into different way, say the first 5,000 variables and then 6,651 variables, those model selected different variables and R square is different.
Now I am really confusing about it. How can I separate the dataset and get the "best" result?
12-18-2014 04:16 PM
Divide your dataset the other way by creating two classes of observations (not variables) with a weight variable:
if rand("UNIFORM") < (21/179) then weight = 0;
else weight = 1;
proc reg data=splitData;
model score = ... ;
output out=outData p=predScore;
proc corr data=outData(where=(weight=0)) Pearson;
var score; with predScore;
proc sgplot data=outData(where=(weight=0));
scatter x=score y=predScore;
"the second model selected 155 variables and R square=1.0". This is what I warned you about. Contrary to intuition, you have very little chance of creating a meaningful model from that many variables via a variable selection method. The biggest challenge you face is your lack of knowledge about these variables. They can probably be grouped into closely related subsets that could each be represented by one best variable or a single Principal Component.
12-18-2014 04:34 PM
This is an interesting way. Just to learn more about your recommended method, could you please provide more info on this, like how are getting 155 variables?
12-19-2014 05:09 PM
Thanks again for your reply. Yes you are right. There are so many variables and I don't have enough knowledge about them. So simply using regression didn't give me good results.
I tried decision tree and PCA today. Decison tree selected 15 variables out of 11,651 and MSE is 0.139. For PCA method, there are 160 components in the model and MSE is 0.0106. However, I didn't get the coefficient of these components, so I don't know how to explain it.
You talked about dividing the original dataset into 2 small ones by separating the observations. One is used for building a model, the other is used for testing. Why can't just use all the observations to build the model and then test it? Will that make the result more accurate and reasonable? And what is weight meaning and what its purpose?
12-19-2014 11:22 PM
Evaluating the performance of a model with the same data that served to build it is well known to provide an optimistic, sometimes very optimistic, evaluation. The trick is to sacrifice a part of your data (10% say) for evaluation. A good idea is to repeat this process many times with different sets of observations sacrificed randomly. This way you can also see if the chosen models are stable.
12-20-2014 12:14 AM
Observations with weight=0 are excluded from the regression (and variable selection) process but the REG procedure will provide predicted values for them in the out= dataset. It is a simple trick to split your dataset into two subsets, one for model building (weight=1) and the other for model evaluation (weight=0).
12-19-2014 01:53 PM
I like thinking about this one in the following way: The OP measured over 11K variables about 180 times, so we have nearly 60 times as many variables as cases. You can RANDOMLY select any 179 of the variables and get a PERFECT fit to the response variable. In fact there are (thank you Wolfram Alpha) :
05286963384521740723958639266557563907788742389725738156000 possible perfect fits (that's 1.78E400), and stepwise regression will pick one of those. Given that the total number of protons and electrons in the universe is on the order of 300 orders of magnitude less than this, you may consider stepwise methods as an exercise in futility. Finding the "right" one is impossible in P Log P time.
So the best I could think of is to include the response variable into the mix and look at principal components. Find those that have large loadings on the dependent variable, and explain most of the variability.
Actually, the best idea is to find a subject matter expert and whittle the 11K plus variables down to, say, 8 or 10, which is about how many your 179 cases can accurately estimate.
12-19-2014 02:09 PM
Partial Least Squares Regression was designed for this case. It is often used when the number of variables far exceeds the number of data points, for example, in spectroscopy, where you might have measured the intensity at 10,000 wavelengths on 179 samples. You can find lots of examples in the literature of PLS models for spectroscopy that were similar to your case with many times more X variables than observations.
With PLS, you get data reduction in the sense that it will find linear combinations of your X variables that are predictive of Y. Better than PCA or factor analysis, where you get lienar combinations of your X variables that might just be non-predictive since the Y values are not used in determining the PCA/factor analysis dimensions.
But PLS won't give you individual X variables that are in the model, it's not designed to, and as other have pointed out, there is no logical way to pick individual Xs from your 11,651 Xs that are in the model. So give up this idea of using stepwise regression.