BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
David_M
Obsidian | Level 7

Hi ... I have the following regression model and also have many categorical and continuous predictors:

let predictor_list = X1;

%let predictor_list = X1 X2;

%let predictor_list = X1 X3;

...

%let predictor_list = X1 X70;

PROC Logistic
     CLASS  X2 X3 X5 X10 .... X20 /param=ref ref=first;
     model Y = &predictor_list / link=logit orpvalue clparm=PL clodds=PL;
run;

For now, I manually cycle through the predictor_list statements one by one and note the change-in-estimate (CIE) between the raw model (1st predictor) and the adjusted model (1st and 2nd predictors). If the CIE is 10% or greater, then I would keep that second variable for later modeling; else, it would be eliminated from further analysis.

 

A macro will be generated later to automate this entire procedure.

Also, I leave the big CLASS statement in the code, hoping SAS will ignore it for non-categorical predictors in my model statement.

 

The problem is:

 

1. The parameter estimates are different if the CLASS statement only has the X predictor (a categorical predictor) vs if the CLASS statement has all 20 categorical variables. Why?

 

2. The parameter estimates are different if predictor X is NOT any of the class variables (i.e., a continuous predictor) than if the CLASS statement is commented out. Why?

 

I need to know why the CLASS statement distorts output results if it contains variables not in the model statement, and if there's a solution to this problem.

 

Best,

David

1 ACCEPTED SOLUTION

Accepted Solutions
jiltao
SAS Super FREQ

Hi,

As explained in the documentation at the link below --

SAS Help Center: Levelization of Classification Variables

 

"When the MISSING option is not specified, or for procedures whose CLASS statement does not support this option, it is important to understand the implications of missing values for your statistical analysis. When a SAS/STAT procedure levelizes the CLASS variables, an observation for which a CLASS variable has a missing value is excluded from the analysis. This is true regardless of whether the variable is used to form the statistical model. Consider, for example, the case where some observations contain missing values for variable A but the records for these observations are otherwise complete with respect to all other variables in the statistical models. The analysis results from the following statements do not include any observations for which variable A contains missing values, even though A is not specified in the MODEL statement:

class A B;
model y = B x B*x;

Many statistical procedures print a "Number of Observations" table that shows the number of observations read from the data set and the number of observations used in the analysis. Pay careful attention to this table—especially when your data set contains missing values—to ensure that no observations are unintentionally excluded from the analysis."

 

I suspect this can explain the difference you are observing.

Thanks,

Jill

View solution in original post

8 REPLIES 8
Quentin
Super User

If you have missing values for some of the CLASS variables, by default they will cause the record to be excluded from the analysis, even if that CLASS variable is not used in your model.  When that happens, you should see a note in your output:

NOTE: 1 observation was deleted due to missing values for the response or explanatory variables.

You could add the /missing option to your CLASS statement, which would include explanatory variables with missing values in your model, giving them their own category.  Or your macro could generate the appropriate list of CLASS variables.

The Boston Area SAS Users Group is hosting free webinars!

Register now at https://www.basug.org/events.
jiltao
SAS Super FREQ

Hi,

As explained in the documentation at the link below --

SAS Help Center: Levelization of Classification Variables

 

"When the MISSING option is not specified, or for procedures whose CLASS statement does not support this option, it is important to understand the implications of missing values for your statistical analysis. When a SAS/STAT procedure levelizes the CLASS variables, an observation for which a CLASS variable has a missing value is excluded from the analysis. This is true regardless of whether the variable is used to form the statistical model. Consider, for example, the case where some observations contain missing values for variable A but the records for these observations are otherwise complete with respect to all other variables in the statistical models. The analysis results from the following statements do not include any observations for which variable A contains missing values, even though A is not specified in the MODEL statement:

class A B;
model y = B x B*x;

Many statistical procedures print a "Number of Observations" table that shows the number of observations read from the data set and the number of observations used in the analysis. Pay careful attention to this table—especially when your data set contains missing values—to ensure that no observations are unintentionally excluded from the analysis."

 

I suspect this can explain the difference you are observing.

Thanks,

Jill

David_M
Obsidian | Level 7

One more thing ... the SAS documentation says when the missing option is applied, "missing values are treated as valid values". What is unclear is how these values are assigned and generated. Are they imputed from neighboring datapoints? Are they randomly generated?

 

Thx!

Tom
Super User Tom
Super User

They are just another level.

 

So if you had a GENDER variable with M and F as the expected values and there were some observations that just had a space instead (SAS treat a character variable that only has spaces in it as "missing") then you will have three classes instead of just two.

David_M
Obsidian | Level 7

In my case, all variables are numbered, including categoricals. So if the GENDER variable is coded as Male = 0 and Female  = 1, what is the code for the empty or missing observation, if the missing option is used? 

Tom
Super User Tom
Super User

SAS has 28 distinct missing values that can be stored in a numeric variable.  The normal missing value, that is represented in code by a single period. But also special missing values that are represented in code by a period followed by a single letter or underscore.

 

So depending on how many of these distinct missing values appear in the dataset for that class variable there is a possibility of 28 extra levels in addition to the 0 and 1 that was intended.

 

The normal missing value is usually printed as a single period and the special missing values as the corresponding letter (or underscore).

 

If you want them to be displayed in a more human friendly way you could add decodes for them to a user defined format. So perhaps something like this:

proc format;
  value gender 0='Male' 1='Female'
   .='Missing'
   .a = 'Refused'
   .b = 'Other'
 ;
run;
David_M
Obsidian | Level 7

Thank you...so they are not coded as numbers, and yet these 'new codes' become part of the model? I have verified that I get different parameter estimates in my regression model if I use the /missing option. I have missing values, coded as '.' for all my categorical and continuous variables. Which parameter estimates are true and should be believed? Those generated with or without the missing option?

 

Thanks!

Catch up on SAS Innovate 2026

Dive into keynotes, announcements and breakthroughs on demand.

Explore Now →
What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 8 replies
  • 835 views
  • 9 likes
  • 4 in conversation