Hello,
The question is simple but it has been driving me crazy. I have a large dataset (~150GB) with billions of observations. I would like to perform a linear regression on a list of both continuous and categorical variables (which contains thousands of categories), and keep the observations that make X'X matrix non-singular (invertible).
In Stata it is simple because Stata automatically drops observations so that X'X is always invertible. All I need to do is to use e(sample) to get the observations that are used in the regression, like the following:
reg y x1 x2 i.x3 keep e(sample)
However, in SAS it seems like most procedures keep all observations and, when X'X appears to be singular, perform a generalized inverse. That is not what I want, but I cannot find any options in proc reg or proc glm to suppress such behavior. I cannot do it in Stata as well because the dataset is too large to fit in the memory.
Does anyone have suggestions on how to do this?
Thank you in advance.
If the matrix is non-singular because of the categorical variables ... which must mean you created the dummy variables for the categorical variables, and the SAS PROC you are using did not create the dummy variables for the categorical variables ... then all you have to do is eliminate one level of the categorical variable and then the matrix becomes singular. (SAS, when creating dummy variables for categorical variables, does not use all levels of the categorical variables, it uses the number of levels minus one)
Otherwise, I stand by my comments that removing rows will not remove the fact that one (or more) column is a linear combination of other columns, and so you can't make the matrix singular by removing rows. If X3=X1+X2 in your data set (or some other similar linear combination holds), then removing rows won't help. Nor will eliminating rows using Cook's distance help. And I am skeptical that Ridge Regression will help, as it is used in the case where there are some high correlations between variables, but I do not think it can help if you have perfect linear combination between x-variables.
I never heard of such a thing.
Typically, a matrix is not invertible because columns (not observations) are linearly dependent. In that case, dropping observations will not change the fact that columns are linearly dependent. So, it doesn't sound to me like such a thing like dropping observations to make the matrix invertible is even mathematically possible.
Thank you for your reply Paige.
As far as I know (correct me if I am wrong), but since the row rank of a matrix always equals to the column rank, I guess collinear observations and collinear columns are actually one thing.
In this case, thinking in terms of columns, I am sure that collinearity comes from categorical variables. However, given that the original dataset is already large enough, I cannot manually generate all categorical variables and then check their collinearity.
In addition, I have tried to run the Stata code in the original post in a small sample of ~8.0M observations. There are no missing values, and also no duplicate observations. When I run the regression there are only ~7.8M obs left. So I am pretty sure that it is possible.
If the matrix is non-singular because of the categorical variables ... which must mean you created the dummy variables for the categorical variables, and the SAS PROC you are using did not create the dummy variables for the categorical variables ... then all you have to do is eliminate one level of the categorical variable and then the matrix becomes singular. (SAS, when creating dummy variables for categorical variables, does not use all levels of the categorical variables, it uses the number of levels minus one)
Otherwise, I stand by my comments that removing rows will not remove the fact that one (or more) column is a linear combination of other columns, and so you can't make the matrix singular by removing rows. If X3=X1+X2 in your data set (or some other similar linear combination holds), then removing rows won't help. Nor will eliminating rows using Cook's distance help. And I am skeptical that Ridge Regression will help, as it is used in the case where there are some high correlations between variables, but I do not think it can help if you have perfect linear combination between x-variables.
Adding
Removing rows might help if you were inverting X * X', but regression inverts X' * X
Thank you very much for clarifying this.
It turns out that I totally misunderstood my problem. It is not that X'X is singular, it is just missing values (that I believed not exist because when I passed my subsample into PROC PHREG no observations were dropped).
Basically what I am trying to do is to make this dataset suitable for another package in R (h2o.coxph) because PROC PHREG is unbearable slow for me. That package simply told me that something is singular (in hindsight simply because it does not drop missing values as well), which somehow made me believe that my problem is a singular X'X matrix.
Still, thanks for everyone's helpful answers.
Sorry for the mistake: it looks like the collinearity does not come from categorical variables, so maybe I can find a way to filter out collinear observations without generating category dummies.
Still, I am looking for a way to do so in SAS.
Would you like to use Ridge Regression by PROC REG ?
I would consider drop variables not obs if xx` was singular .
If you want drop obs ,consider COOK Distance in PROC REG .
And @Rick_SAS would have good idea .
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.