How to cluster levels of a categorical variable in a linear regression...

acordes · Posted 08-24-2022 06:34 PM

My company's SAS Viya server is down at this moment, therefore I cannot provide neither data, nor code or screenshots but only lay out my thoughts on the problem.

So I have an interval target variable regressed on interval input variables and one categorical variable with cardinality 10.

(I simplify my real world problem for the sake of this example.)

When I request the parameter estimates I might see that 3 of the 9 or 10 levels (depending on the ecoding used) have "similar" estimates and let's assume that they are all significant at Pr >= |t|.

So I could rerrange the categorical variable by grouping these 3 levels. I suppose that good grouping increases the significance and generalizability of the model.

Following on that reasoning my approach would be to one-hot encode the class variable and then apply variable clustering to these dummy variables. Does this make sense and will bring me closer to my goal?

Or apply WOE encoding plus some binning of the output? But this comes at the cost of less interpretability and I don't know if it only works for binary target variables.

In my specific case at hand, the informed grouping is done by the business expert like saying "Q3 and Q5 behave equally in terms of value retention" and we would group them via data step before running the regression model.

But I would like to let the data decide and receive a grouping proposal from the proc, etc.

Do we have a dedicated proc / option / cas action / model studio function to do this?

PaigeMiller · Posted 08-24-2022 06:53 PM

When I request the parameter estimates I might see that 3 of the 9 or 10 levels (depending on the ecoding used) have "similar" estimates and let's assume that they are all significant at Pr >= |t|. So I could rerrange the categorical variable by grouping these 3 levels.

Pr>=|t| tests if the estimates are significantly different than zero. This is not a criterion I would use to group. I would test to see if the estimates of these 3 levels are not significantly different than each other, which is a different test. If they are not significantly different, then perhaps grouping is the next step, and it makes sense to group them from subject matter point of view, then you can group them. (Example, perhaps ridiculous, of when you would find subject matter does not support grouping: if lions and slugs were found to have similar estimates, I would probably still not group them based on subject matter point of view).

I suppose that good grouping increases the significance and generalizability of the model.

Maybe, maybe not.

Following on that reasoning my approach would be to one-hot encode the class variable and then apply variable clustering to these dummy variables.

Cluster dummy variables? I don't see how that adds anything to the grouping of the class variables. Empirically, you can't cluster the levels of class variables; the same applies to dummy variables created from those class variables.

--
Paige Miller

acordes · Posted 08-25-2022 01:58 AM

I've found the lines statement within the lsmeans context of i.e. proc glm.
This seems to help me in this regard.
http://support.sas.com/kb/63/810.html

ballardw · Posted 08-24-2022 07:44 PM

The easiest way to accomplish such grouping for "what if" is to use one or more custom formats for the values and apply the format to the variable in the code. The groups created by formats are honored by the analysis and reporting procedures and generally for the graphing procedures (some issues with custom date/time/datetime formats).

An example using a data set you should have to test code with that creates two formats and then uses them with the same variable in different calls to proc freq.

proc format;
value agegroup
low-12='Pre-teen'
13 -18='Teen'
;
value secondgroup
low - 12='Pre-teen'
13 - 15 ='13 to 15'
16 - high='16+'
;
run;

proc freq data=sashelp.class;
   tables age;
   format age agegroup.;
run;

proc freq data=sashelp.class;
   tables age;
   format age secondgroup.;
run;

If you are using a CLASS statement in Proc Logistic (or most procs that support the ref option) and are specifying a Ref= value you need to use the Formatted value. So would likely have to change the CLASS statement to match the Format statement values of the variable.

How to cluster levels of a categorical variable in a linear regression setting

Re: How to cluster levels of a categorical variable in a linear regression setting

Re: How to cluster levels of a categorical variable in a linear regression setting

Re: How to cluster levels of a categorical variable in a linear regression setting

How to cluster levels of a categorical variable in a linear regression setting

Re: How to cluster levels of a categorical variable in a linear regression setting

Re: How to cluster levels of a categorical variable in a linear regression setting

Re: How to cluster levels of a categorical variable in a linear regression setting

Registration is open