BookmarkSubscribeRSS Feed
Ujjawal
Quartz | Level 8

Hi Team,

Is there a way we can collapse the levels of categorical variables in an automated manner? One way i know is to calculate the % of events falling in each category and run cluster analysis on it. Can this be automated for multiple variables considering only event rate, not cluster analysis?

For example, there is a variable called "Char A". I have calculated Event Rate for this variable, i.e. percentage of 1s appearing in dependent variable (let's say VarY). It is simply the mean of Y.

Char AEvent Rate
A49%
B67%
C2%
D87%
E4%
F3%

Next step is to combine categories of C,E and F as they have almost similar event rate (let's say, variation within 5%).

Method 2 : If the above methodology is complicated to automate, can we make it simple taking only the number of cases falling in Char A (not considering dependent variable). For example, if a categorical level contains atmost  5% observations, combine it with others which all have percentage less than 5%,

Char AColPct
C2%
F3%
E4%
A49%
B67%
D87%

Combine categories of C,E and F as they have column percentage less than 5%.


After using either of the 2 methodologies, we need to replace the levels with the combined category in a raw data file. The raw data file looks like below -


Char AY
C1
C1
F1
E0
E0
A1
B1
D1
E0
A0
B1
D1
C1



Any help would be highly appreciated! Thanks in anticipation!

3 REPLIES 3
ballardw
Super User

The easiest way is usually a custom format as you don't have to change data and can specify any one the formats at run time.

with your example:

proc format;

value $MyABC

"C","E","F" = "CEF"; /* or any string you want. */

run;

This format only changes the listed values, others would appear as, from your example, A B D

Multiple groups are possible within one format. There are numeric range instructions but results are often odd when attempting to use those with character variables.

Then run your proc with the line below added:

format CharA $MyABC. ;

to the Freq or other analysis proc.

Ujjawal
Quartz | Level 8

Thanks for your response. First I need to check the categories that need to be combined in an automated manner. In the example, i know which categories need to be collapsed. I want this to be implemented via code. In a raw data file, first i need to calculate column percentage of a categorical (character) variable and then decide the categories to be collapsed and then replace them with the combined category.

Reeza
Super User

Sure, write a macro to take the number of parameters you have, and the number you want and evaluate the rules Smiley Happy

It could also be considered a decision tree analysis if you have SAS OR, and there may be macros for base SAS - try lexjansen.com

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 3 replies
  • 4307 views
  • 0 likes
  • 3 in conversation