BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
zkp1997
Calcite | Level 5

Hello,

 

The question is simple but it has been driving me crazy. I have a large dataset (~150GB) with billions of observations. I would like to perform a linear regression on a list of both continuous and categorical variables (which contains thousands of categories), and keep the observations that make X'X matrix non-singular (invertible).

 

In Stata it is simple because Stata automatically drops observations so that X'X is always invertible. All I need to do is to use e(sample) to get the observations that are used in the regression, like the following:

 

 

reg y x1 x2 i.x3
keep e(sample)

 

 

However, in SAS it seems like most procedures keep all observations and, when X'X appears to be singular, perform a generalized inverse. That is not what I want, but I cannot find any options in proc reg or proc glm to suppress such behavior. I cannot do it in Stata as well because the dataset is too large to fit in the memory. 

 

Does anyone have suggestions on how to do this?

 

Thank you in advance.

1 ACCEPTED SOLUTION

Accepted Solutions
PaigeMiller
Diamond | Level 26

If the matrix is non-singular because of the categorical variables ... which must mean you created the dummy variables for the categorical variables, and the SAS PROC you are using did not create the dummy variables for the categorical variables ... then all you have to do is eliminate one level of the categorical variable and then the matrix becomes singular. (SAS, when creating dummy variables for categorical variables, does not use all levels of the categorical variables, it uses the number of levels minus one)

 

Otherwise, I stand by my comments that removing rows will not remove the fact that one (or more) column is a linear combination of other columns, and so you can't make the matrix singular by removing rows. If X3=X1+X2 in your data set (or some other similar linear combination holds), then removing rows won't help. Nor will eliminating rows using Cook's distance help. And I am skeptical that Ridge Regression will help, as it is used in the case where there are some high correlations between variables, but I do not think it can help if you have perfect linear combination between x-variables.

--
Paige Miller

View solution in original post

7 REPLIES 7
PaigeMiller
Diamond | Level 26

I never heard of such a thing.

 

Typically, a matrix is not invertible because columns (not observations) are linearly dependent. In that case, dropping observations will not change the fact that columns are linearly dependent. So, it doesn't sound to me like such a thing like dropping observations to make the matrix invertible is even mathematically possible.

--
Paige Miller
zkp1997
Calcite | Level 5

Thank you for your reply Paige.

 

As far as I know (correct me if I am wrong), but since the row rank of a matrix always equals to the column rank, I guess collinear observations and collinear columns are actually one thing.

 

In this case, thinking in terms of columns, I am sure that collinearity comes from categorical variables. However, given that the original dataset is already large enough, I cannot manually generate all categorical variables and then check their collinearity.

 

In addition, I have tried to run the Stata code in the original post in a small sample of ~8.0M observations. There are no missing values, and also no duplicate observations. When I run the regression there are only ~7.8M obs left. So I am pretty sure that it is possible.

 

PaigeMiller
Diamond | Level 26

If the matrix is non-singular because of the categorical variables ... which must mean you created the dummy variables for the categorical variables, and the SAS PROC you are using did not create the dummy variables for the categorical variables ... then all you have to do is eliminate one level of the categorical variable and then the matrix becomes singular. (SAS, when creating dummy variables for categorical variables, does not use all levels of the categorical variables, it uses the number of levels minus one)

 

Otherwise, I stand by my comments that removing rows will not remove the fact that one (or more) column is a linear combination of other columns, and so you can't make the matrix singular by removing rows. If X3=X1+X2 in your data set (or some other similar linear combination holds), then removing rows won't help. Nor will eliminating rows using Cook's distance help. And I am skeptical that Ridge Regression will help, as it is used in the case where there are some high correlations between variables, but I do not think it can help if you have perfect linear combination between x-variables.

--
Paige Miller
PaigeMiller
Diamond | Level 26

Adding

 

Removing rows might help if you were inverting X * X', but regression inverts X' * X

--
Paige Miller
zkp1997
Calcite | Level 5

Thank you very much for clarifying this.

 

It turns out that I totally misunderstood my problem. It is not that X'X is singular, it is just missing values (that I believed not exist because when I passed my subsample into PROC PHREG no observations were dropped). 

 

Basically what I am trying to do is to make this dataset suitable for another package in R (h2o.coxph) because PROC PHREG is unbearable slow for me. That package simply told me that something is singular (in hindsight simply because it does not drop missing values as well), which somehow made me believe that my problem is a singular X'X matrix.

 

Still, thanks for everyone's helpful answers.

zkp1997
Calcite | Level 5

Sorry for the mistake: it looks like the collinearity does not come from categorical variables, so maybe I can find a way to filter out collinear observations without generating category dummies. 

Still, I am looking for a way to do so in SAS. 

Ksharp
Super User

Would you like to use Ridge Regression by PROC REG ?

I would consider drop variables not obs if xx` was singular .

If you want drop obs ,consider COOK Distance in PROC REG .

And @Rick_SAS  would have good idea .

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 7 replies
  • 1889 views
  • 1 like
  • 3 in conversation