DATA Step, Macro, Functions and more

Collapse levels of a categorical variable

Reply
Regular Contributor
Posts: 183

Collapse levels of a categorical variable

Hi Team,

Is there a way we can collapse the levels of categorical variables in an automated manner? One way i know is to calculate the % of events falling in each category and run cluster analysis on it. Can this be automated for multiple variables considering only event rate, not cluster analysis?

For example, there is a variable called "Char A". I have calculated Event Rate for this variable, i.e. percentage of 1s appearing in dependent variable (let's say VarY). It is simply the mean of Y.

Char AEvent Rate
A49%
B67%
C2%
D87%
E4%
F3%

Next step is to combine categories of C,E and F as they have almost similar event rate (let's say, variation within 5%).

Method 2 : If the above methodology is complicated to automate, can we make it simple taking only the number of cases falling in Char A (not considering dependent variable). For example, if a categorical level contains atmost  5% observations, combine it with others which all have percentage less than 5%,

Char AColPct
C2%
F3%
E4%
A49%
B67%
D87%

Combine categories of C,E and F as they have column percentage less than 5%.


After using either of the 2 methodologies, we need to replace the levels with the combined category in a raw data file. The raw data file looks like below -


Char AY
C1
C1
F1
E0
E0
A1
B1
D1
E0
A0
B1
D1
C1



Any help would be highly appreciated! Thanks in anticipation!

Super User
Posts: 11,343

Re: Collapse levels of a categorical variable

The easiest way is usually a custom format as you don't have to change data and can specify any one the formats at run time.

with your example:

proc format;

value $MyABC

"C","E","F" = "CEF"; /* or any string you want. */

run;

This format only changes the listed values, others would appear as, from your example, A B D

Multiple groups are possible within one format. There are numeric range instructions but results are often odd when attempting to use those with character variables.

Then run your proc with the line below added:

format CharA $MyABC. ;

to the Freq or other analysis proc.

Regular Contributor
Posts: 183

Re: Collapse levels of a categorical variable

Thanks for your response. First I need to check the categories that need to be combined in an automated manner. In the example, i know which categories need to be collapsed. I want this to be implemented via code. In a raw data file, first i need to calculate column percentage of a categorical (character) variable and then decide the categories to be collapsed and then replace them with the combined category.

Super User
Posts: 19,770

Re: Collapse levels of a categorical variable

Sure, write a macro to take the number of parameters you have, and the number you want and evaluate the rules Smiley Happy

It could also be considered a decision tree analysis if you have SAS OR, and there may be macros for base SAS - try lexjansen.com

Ask a Question
Discussion stats
  • 3 replies
  • 1140 views
  • 0 likes
  • 3 in conversation