Dear all,
I have a huge dataset containing firms and how much their number of employees grow, the definition is so that the growth is always between -200 and 200 ( employees(t+1)-employees(t))/ average ).
What I wanna do is group the employee growths in, say, 50 bins of equal length and indicate that with a variable with the value of the center of the interval, in a smart way.
I would also like to know a nice way to do even if the bins doesn't have the same width so like they were: [-200,-190] , [-190,-180] ... [20,15],[15,10] etc.
An example to clarify, in the case of [-200,200] and 40 bins: All the firms with an employee growth of between say 10 and 20 I would like to give the value 15 in a new variable, all the ones between 20 and 30, 25 etc.
Hope my idea is clear and someone is willing to help (I made a really ugly macro which works, but now I have to change the bins and it's going to be hell) plus I wanna become better at SAS
best regards,
Henrik
Proc Rank might be of help for creating a specified number of groups.
Proc rank data=<your data set name> out=<output data set> groups=50;/*or any number of groups wanted */
var growth; /* or whatever your variable is*/
ranks growthgroup; /* what ever variable you would like to hold the group numbers*/
run;
The output set will have the group variable with values of 0 to n-1 where n is the requested number of groups.
You can then send that output through PROC MEANS or SUMMARY to get the min, max or any statistic wanted by group.
Proc means data=<above output set>;
class growthgroup;
var growth;
output out=<another output set > (request statistics and name them here).
This output set if using max and min could be used to create an format using MIN and MAX, you could even assign the Mean and or median as the label for the group.
It is best to provide an example of input of output in an easily readable format to help people understand your problem alongside your verbal explanation.
Thank you for the tip! I can output some of the data, is there a smart way to insert it into my question?
You can edit the question or just add a new reply. You could used the advanced editor to manually enter of table of table or just copy and paste the information from somewhere. It should represent your real data as close as possible but definitely does not have to be your real data.
Are you looking for the statistic estimator MEAN or MEDIAN ?
Ksharp
Here's one approach to binning. Unfortunately, it requires processing the "huge" data set a few times. This process assumes you already have added GROWTH to every observation, and just assigns BIN. It is up to you to decide what additional information you want to assign to each BIN.
proc sort data=have;
by growth;
run;
data want;
set have nobs=_nobs_;
bin = ceil(50 * _n_ / _nobs_);
run;
I like to add a further wrinkle. Just in case multiple records have the same value for GROWTH, I like to group them all into the same BIN, even if it means the BIN sizes will be uneven. Here is that variation;
proc sort data=have;
by growth;
run;
data want;
set have nobs=_nobs_;
by growth;
if first.growth then bin = ceil(50 * _n_ / _nobs_);
retain bin;
run;
Good luck.
** If you want to control your bin ranges, a proc format works great and is simple and efficient;
** In the ranges, you can use the "less than" symbol either before or after the dash to control the endpoints when you know you have continuous numbers;
proc format ;
value bins
-200 -< -150 = '01'
-150 -< -125 = '02'
-125 -< 0 = '03'
0 -< 25 = '04'
25 -< 50 = '05'
50 -< 75 = '06'
75 -< 100 = '07'
100 -< 125 = '08'
125 -< 150 = '09'
150 -< 175 = '10'
175 - 200 = '11'
other = '99'
;
run ;
** the output of the format is character by definition ;
** but you could use either an implicit or explicit conversion to a numeric;
** if that is what you ultimately want ;
data new ;
set old ;
length bin_char $2 bin_num 8 ;
bin_char = put(growth, bins.) ;
** implicit conversion, as long as you define the variable with a length statement first ;
bin_num = put(growth, bins.) ;
run ;
I want to do exactly that, but for 240 groups
0- 0.5
0.5-1.0
1.0-1.5
and so on
119.5-120.0
Do you have any solution for that (without writing 240 lines)?
** by the way, this method is especially useful when you purposefully do NOT want equal-sized bins;
proc format ;
value bins
low -<0 = '-100'
0 = '0'
0 <-< 25 = '12.5'
25 -< 50 = '37.5'
50 -< 100 = '75'
100 -< 200 = '150'
200 - high = '1000'
;
run ;
Oh, I just realized that you want to assign the midpoint of the range - you could do that too. See above.
Although, if you want 40 bins, this process might be tedious, and a mathematical one might make more sense.
(I usually want a small number of bins if I do this.)
Message was edited by: Carla Wilson
Proc Rank might be of help for creating a specified number of groups.
Proc rank data=<your data set name> out=<output data set> groups=50;/*or any number of groups wanted */
var growth; /* or whatever your variable is*/
ranks growthgroup; /* what ever variable you would like to hold the group numbers*/
run;
The output set will have the group variable with values of 0 to n-1 where n is the requested number of groups.
You can then send that output through PROC MEANS or SUMMARY to get the min, max or any statistic wanted by group.
Proc means data=<above output set>;
class growthgroup;
var growth;
output out=<another output set > (request statistics and name them here).
This output set if using max and min could be used to create an format using MIN and MAX, you could even assign the Mean and or median as the label for the group.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.