Good morning,
I am trying to group "noisy" insignificant observations into one single variable so that I can focus on the observations that really matter. I have been looking to see if I can use conditionals with proc freq. The logic I am trying to implement is the following:
"if a host appears less than five times group into 'other'"
Basically regroup all hosts that appear less than five times into a single variable named "other"
I get the table listed below, but I would like to group yahoo and aol into "other"
data host_list;
input host $;
datalines;
amazon
amazon
amazon
amazon
amazon
aol
yahoo
yahoo
run;
proc freq data=host_list order=freq;
run;
amazon | 5 |
| 5 |
|
5 |
| 10 |
| |
yahoo | 2 |
| 12 |
|
aol | 1 |
| 13 |
|
Can anybody help?
Thank you
Something data driven wouldn't you say.
Message was edited by: data _null_ driveN
I don't think you could do it in proc freq, you could add
if host in ('yahoo','aol') then host='other' in your data step.
You could create a custom format that just assigns the unwanted to that category.
Proc format;
value $oth (upcase)
'YAHOO','AOL'='other';
run;
/* the UPCASE is in case your actual data has different capitalization of the words, the option compares the all caps version to assign the 'other' text */
proc freq data=host_list order=freq;
format host $oth.;
run;
Automating the solution is the hard part. Here's an approach ... I'm hoping someone else can come up with the code because I don't think I'll have the time to do it.
1. Run your PROC FREQ, but don't print the results. Instead, send the results to an output data set.
2. Subset the output data set, taking all those observations having COUNT < 5.
3. Prepare that subset to become a format: add FMTNAME and LABEL.
4. Create the format using the step 3 results as a CNTLIN= data set.
5. Re-run PROC FREQ, applying the format.
There will be smaller issues ... what will happen if all the original counts are 5 or more? And the order of the rows in the table might change. (It's even possible that the Other category will have the highest count and will print first.)
But at least there's an approach to work with.
Good luck.
Something data driven wouldn't you say.
Message was edited by: data _null_ driveN
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.