BookmarkSubscribeRSS Feed
acordes
Rhodochrosite | Level 12

My company's SAS Viya server is down at this moment, therefore I cannot provide neither data, nor code or screenshots but only lay out my thoughts on the problem. 

 

So I have an interval target variable regressed on interval input variables and one categorical variable with cardinality 10. 

(I simplify my real world problem for the sake of this example.) 

 

When I request the parameter estimates I might see that 3 of the 9 or 10 levels (depending on the ecoding used) have "similar" estimates and let's assume that they are all significant at Pr >= |t|. 

So I could rerrange the categorical variable by grouping these 3 levels. I suppose that good grouping increases the significance and generalizability of the model. 

 

Following on that reasoning my approach would be to one-hot encode the class variable and then apply variable clustering to these dummy variables. Does this make sense and will bring me closer to my goal?

Or apply WOE encoding plus some binning of the output? But this comes at the cost of less interpretability and I don't know if it only works for binary target variables. 

 

In my specific case at hand, the informed grouping is done by the business expert like saying "Q3 and Q5 behave equally in terms of value retention" and we would group them via data step before running the regression model. 

But I would like to let the data decide and receive a grouping proposal from the proc, etc. 

 

Do we have a dedicated proc / option / cas action / model studio function to do this?

 

 

3 REPLIES 3
PaigeMiller
Diamond | Level 26

When I request the parameter estimates I might see that 3 of the 9 or 10 levels (depending on the ecoding used) have "similar" estimates and let's assume that they are all significant at Pr >= |t|. So I could rerrange the categorical variable by grouping these 3 levels.

 

Pr>=|t| tests if the estimates are significantly different than zero. This is not a criterion I would use to group. I would test to see if the estimates of these 3 levels are not significantly different than each other, which is a different test. If they are not significantly different, then perhaps grouping is the next step, and it makes sense to group them from subject matter point of view, then you can group them. (Example, perhaps ridiculous, of when you would find subject matter does not support grouping: if lions and slugs were found to have similar estimates, I would probably still not group them based on subject matter point of view).

 

I suppose that good grouping increases the significance and generalizability of the model.

 

Maybe, maybe not.

 

 

Following on that reasoning my approach would be to one-hot encode the class variable and then apply variable clustering to these dummy variables.

 

Cluster dummy variables? I don't see how that adds anything to the grouping of the class variables. Empirically, you can't cluster the levels of class variables; the same applies to dummy variables created from those class variables.

--
Paige Miller
acordes
Rhodochrosite | Level 12
I've found the lines statement within the lsmeans context of i.e. proc glm.
This seems to help me in this regard.
http://support.sas.com/kb/63/810.html
ballardw
Super User

The easiest way to accomplish such grouping for "what if" is to use one or more custom formats for the values and apply the format to the variable in the code. The groups created by formats are honored by the analysis and reporting procedures and generally for the graphing procedures (some issues with custom date/time/datetime formats).

 

An example using a data set you should have to test code with that creates two formats and then uses them with the same variable in different calls to proc freq.

proc format;
value agegroup
low-12='Pre-teen'
13 -18='Teen'
;
value secondgroup
low - 12='Pre-teen'
13 - 15 ='13 to 15'
16 - high='16+'
;
run;

proc freq data=sashelp.class;
   tables age;
   format age agegroup.;
run;

proc freq data=sashelp.class;
   tables age;
   format age secondgroup.;
run;

If you are using a CLASS statement in Proc Logistic (or most procs that support the ref option) and are specifying a Ref= value you need to use the Formatted value. So would likely have to change the CLASS statement to match the Format statement values of the variable.

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 3 replies
  • 1188 views
  • 2 likes
  • 3 in conversation