BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
mintbit
Obsidian | Level 7

Hello. I am learning to do a logistic regression and have a question. I am following this guide to check for collinearity: https://support.sas.com/resources/papers/proceedings17/1404-2017.pdf

  • I am a bit confused of what he is doing in the data step at top of page 3. Why is the where statement there, is he sorting out some of the information?  
  • Also, why is he putting _cat behind the variables to check for association? In the sas course they only enter the variables as they are, and categorises with class statement first in the logistic procedure. 
1 ACCEPTED SOLUTION

Accepted Solutions
Reeza
Super User

@mintbit wrote:

 

  • I am a bit confused of what he is doing in the data step at top of page 3. Why is the where statement there, is he sorting out some of the information?  
  • Also, why is he putting _cat behind the variables to check for association? In the sas course they only enter the variables as they are, and categorises with class statement first in the logistic procedure. 

She's removing records that don't meet the criteria for the analysis, so yes she's filtering out data. Since it's a survey there are likely missing values usually coded as 99 or missing that need to be excluded from the analysis.

 

_CAT is a naming convention - it helps to remember which variables are categorical and which ones are not. You can call a variable Age or Age_CAT but if they have the same values the name doesn't matter much. However if you have both, and age is continuous and age_cat is categorical having a naming convention makes it easier to isolate the variables needed and keep your work clean It appears as if the authors checks both the categorical and continuous variables for association. It isn't a bad idea so you can see if the effects are linear and constant. 

View solution in original post

4 REPLIES 4
PaigeMiller
Diamond | Level 26

The WHERE statement eliminates some data from the analysis, according to the conditions stated in the WHERE statement.

 

I assume that variable names end with _CAT to indicate these are categorical variables.

 

By the way, in addition to Ridge Regression and Principal Components Regression, SAS has other methods to combat multicollinearity which include PROC PLS and PROC GLMSELECT. In fact, I wouldn't even bother with Principal Components Regression as I don't really find it useful from a logical point of view. And none of these are available in SAS for the Logistic case, as far as I know.

--
Paige Miller
StatDave
SAS Super FREQ

Collinearity in generalized linear models (GLMs) like logistic models can cause the information matrix to become ill-conditioned and affect the standard errors of the parameters. However, the information matrix in GLMs is not the same as for a simple regression model on a normally distributed response. The information matrix in GLMs is a weighted matrix, so the concern of collinearity is not among the predictors, but rather among the weighted predictors. The weighted predictors should be used when assessing collinearity with the features in PROC REG. This is all further discussed and illustrated in the collinearity section of this note

ballardw
Super User

@mintbit wrote:

Hello. I am learning to do a logistic regression and have a question. I am following this guide to check for collinearity: https://support.sas.com/resources/papers/proceedings17/1404-2017.pdf

  • I am a bit confused of what he is doing in the data step at top of page 3. Why is the where statement there, is he sorting out some of the information?  


Yes he is filtering out records. My supposition based on working with YRBS data is that he is removing the records with code values that indicate a "don't know", "refused" or similar question response. Therefore the model will only have "complete" records with more meaningful, in terms of using the model, answers to the questions of interest.

 

Also, why is he putting _cat behind the variables to check for association? In the sas course they only enter the variables as they are, and categorises with class statement first in the logistic procedure.

I do not see any code where he is "putting _cat behind the variables". The names of the variables in the data set already end in _cat. You would have to be more familiar with the data as to why the _cat variables were created.

 

 

 

 

 

 

 

 

 

 

 

 

Reeza
Super User

@mintbit wrote:

 

  • I am a bit confused of what he is doing in the data step at top of page 3. Why is the where statement there, is he sorting out some of the information?  
  • Also, why is he putting _cat behind the variables to check for association? In the sas course they only enter the variables as they are, and categorises with class statement first in the logistic procedure. 

She's removing records that don't meet the criteria for the analysis, so yes she's filtering out data. Since it's a survey there are likely missing values usually coded as 99 or missing that need to be excluded from the analysis.

 

_CAT is a naming convention - it helps to remember which variables are categorical and which ones are not. You can call a variable Age or Age_CAT but if they have the same values the name doesn't matter much. However if you have both, and age is continuous and age_cat is categorical having a naming convention makes it easier to isolate the variables needed and keep your work clean It appears as if the authors checks both the categorical and continuous variables for association. It isn't a bad idea so you can see if the effects are linear and constant. 

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 4 replies
  • 776 views
  • 9 likes
  • 5 in conversation