Hi Everyone,
Please can someone let me know the ways in which a continuous variable can be converted into bins (buckets)?
Suggest alternative coding ways either using proc sql, data step or sas macro.
Thank you!
Big advantages of using the format approach for this:
1) you don't have to add any variables, so no additional passes through the entire data
2) you can assign the format as needed for specific analysis
3) one format can be assigned to multiple variables that use the same rule
4) multiple formats to regroup data as needed, just change the format for the proc at run time.
5) as long as values of only a single variable are considered it can be much easier to construct then if/then/else logic and checks to ensure that one start value is treated the same all the time (exception: multilabel but that's a bit more advanced and only useable for some procs)
6) used with special missing assignments .A - .Z you can get descriptions for what that missing means such as not asked, original value out of bounds (and differentiate between high and low), instrument failure, skip pattern and a plethora of conditions. And the missing values are still excluded from calculations such as means.
And your example data in the form of a datastep, plus required output is? At a guess:
data have;
do i=1 to 1000;
output;
end;
run;
data want;
set have;
bucket=floor(_n_ / 50)+1;
run;
My preferred way is format.
proc format;
value age low-17 = '< 18 yrs'
18-64 = '18-65 yrs'
65-high = '> 65 yrs';
run;
data want;
set have;
length agegrp $ 15;
agegrp = put(age, age.);
run;
Hth,
Eric
I like Proc Format for continuous numeric data.
You can use the LT symbol to get up to a boundary
Proc format ;
value agegrp
low - <18 = '0-18'
18-<25 = '18-25'
25-<35 = '25-35'
...
85-high = '85 and over' ;
run ;
Ha! Eric beat me to it!
Big advantages of using the format approach for this:
1) you don't have to add any variables, so no additional passes through the entire data
2) you can assign the format as needed for specific analysis
3) one format can be assigned to multiple variables that use the same rule
4) multiple formats to regroup data as needed, just change the format for the proc at run time.
5) as long as values of only a single variable are considered it can be much easier to construct then if/then/else logic and checks to ensure that one start value is treated the same all the time (exception: multilabel but that's a bit more advanced and only useable for some procs)
6) used with special missing assignments .A - .Z you can get descriptions for what that missing means such as not asked, original value out of bounds (and differentiate between high and low), instrument failure, skip pattern and a plethora of conditions. And the missing values are still excluded from calculations such as means.
Don't miss out on SAS Innovate - Register now for the FREE Livestream!
Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.