Hi Team,
Is there a way we can collapse the levels of categorical variables in an automated manner? One way i know is to calculate the % of events falling in each category and run cluster analysis on it. Can this be automated for multiple variables considering only event rate, not cluster analysis?
For example, there is a variable called "Char A". I have calculated Event Rate for this variable, i.e. percentage of 1s appearing in dependent variable (let's say VarY). It is simply the mean of Y.
Char A | Event Rate |
A | 49% |
B | 67% |
C | 2% |
D | 87% |
E | 4% |
F | 3% |
Next step is to combine categories of C,E and F as they have almost similar event rate (let's say, variation within 5%).
Method 2 : If the above methodology is complicated to automate, can we make it simple taking only the number of cases falling in Char A (not considering dependent variable). For example, if a categorical level contains atmost 5% observations, combine it with others which all have percentage less than 5%,
Char A | ColPct |
C | 2% |
F | 3% |
E | 4% |
A | 49% |
B | 67% |
D | 87% |
Combine categories of C,E and F as they have column percentage less than 5%.
After using either of the 2 methodologies, we need to replace the levels with the combined category in a raw data file. The raw data file looks like below -
Char A | Y |
C | 1 |
C | 1 |
F | 1 |
E | 0 |
E | 0 |
A | 1 |
B | 1 |
D | 1 |
E | 0 |
A | 0 |
B | 1 |
D | 1 |
C | 1 |
Any help would be highly appreciated! Thanks in anticipation!
The easiest way is usually a custom format as you don't have to change data and can specify any one the formats at run time.
with your example:
proc format;
value $MyABC
"C","E","F" = "CEF"; /* or any string you want. */
run;
This format only changes the listed values, others would appear as, from your example, A B D
Multiple groups are possible within one format. There are numeric range instructions but results are often odd when attempting to use those with character variables.
Then run your proc with the line below added:
format CharA $MyABC. ;
to the Freq or other analysis proc.
Thanks for your response. First I need to check the categories that need to be combined in an automated manner. In the example, i know which categories need to be collapsed. I want this to be implemented via code. In a raw data file, first i need to calculate column percentage of a categorical (character) variable and then decide the categories to be collapsed and then replace them with the combined category.
Sure, write a macro to take the number of parameters you have, and the number you want and evaluate the rules
It could also be considered a decision tree analysis if you have SAS OR, and there may be macros for base SAS - try lexjansen.com
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.