BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
zuki
Calcite | Level 5

Dear all,

I have a huge dataset containing firms and how much their number of employees grow, the definition is so that the growth is always between -200 and 200 ( employees(t+1)-employees(t))/ average ).

What I wanna do is group the employee growths in, say, 50 bins of equal length and indicate that with a variable with the value of the center of the interval, in a smart way.

I would also like to know a nice way to do even if the bins doesn't have the same width so like they were: [-200,-190] , [-190,-180] ... [20,15],[15,10] etc.

An example to clarify, in the case of [-200,200] and 40 bins: All the firms with an employee growth of between say 10  and 20  I would like to give the value 15 in a new variable, all the ones between 20 and 30, 25 etc.

Hope my idea is clear and someone is willing to help (I made a really ugly macro which works, but now I have to change the bins and it's going to be hell) plus I wanna become better at SAS Smiley Happy

best regards,

Henrik

1 ACCEPTED SOLUTION

Accepted Solutions
ballardw
Super User

Proc Rank might be of help for creating a specified number of groups.

Proc rank data=<your data set name> out=<output data set> groups=50;/*or any number of groups wanted */

var growth; /* or whatever your variable is*/

ranks growthgroup; /* what ever variable you would like to hold the group numbers*/

run;

The output set will have the group variable with values of 0 to n-1 where n is the requested number of groups.

You can then send that output through PROC MEANS or SUMMARY to get the min, max or any statistic wanted by group.

Proc means data=<above output set>;

class growthgroup;

var growth;

output out=<another output set > (request statistics and name them here).

This output set if using max and min could be used to create an format using MIN and MAX, you could even assign the Mean and or median as the label for the group.


View solution in original post

9 REPLIES 9
FriedEgg
SAS Employee

It is best to provide an example of input of output in an easily readable format to help people understand your problem alongside your verbal explanation.

zuki
Calcite | Level 5

Thank you for the tip! I can output some of the data, is there a smart way to insert it into my question?

FriedEgg
SAS Employee

You can edit the question or just add a new reply.  You could used the advanced editor to manually enter of table of table or just copy and paste the information from somewhere.  It should represent your real data as close as possible but definitely does not have to be your real data.

Ksharp
Super User

Are you looking for the statistic estimator MEAN or MEDIAN ?

Ksharp

Astounding
PROC Star

Here's one approach to binning.  Unfortunately, it requires processing the "huge" data set a few times.  This process assumes you already have added GROWTH to every observation, and just assigns BIN.  It is up to you to decide what additional information you want to assign to each BIN.

proc sort data=have;

   by growth;

run;

data want;

   set have nobs=_nobs_;

   bin = ceil(50 * _n_ / _nobs_);

run;

I like to add a further wrinkle.  Just in case multiple records have the same value for GROWTH, I like to group them all into the same BIN, even if it means the BIN sizes will be uneven.  Here is that variation;

proc sort data=have;

   by growth;

run;

data want;

   set have nobs=_nobs_;

   by growth;

   if first.growth then bin = ceil(50 * _n_ / _nobs_);

   retain bin;

run;

Good luck.

cwilson
Calcite | Level 5

** If you want to control your bin ranges, a proc format works great and is simple and efficient;
** In the ranges, you can use the "less than" symbol either before or after the dash to control the endpoints when you know you have continuous numbers;

proc format ;
value bins
-200 -< -150 = '01'
-150 -< -125 = '02'
-125 -<    0 = '03'
   0 -<   25 = '04'
  25 -<   50 = '05'
  50 -<   75 = '06'
  75 -<  100 = '07'
100 -<  125 = '08'
125 -<  150 = '09'
150 -<  175 = '10'
175 -   200 = '11'
other       = '99'
;
run ;

** the output of the format is character by definition ;
** but you could use either an implicit or explicit conversion to a numeric;
** if that is what you ultimately want ;
data new ;
  set old ;
  length bin_char $2 bin_num 8 ;
  bin_char = put(growth, bins.) ;
  ** implicit conversion, as long as you define the variable with a length statement first ;
  bin_num  = put(growth, bins.) ;
run ;

TobbeNord
Obsidian | Level 7

I want to do exactly that, but for 240 groups

0- 0.5 

0.5-1.0

1.0-1.5 

and so on

119.5-120.0

 

Do you have any solution for that (without writing 240 lines)?

 

 

cwilson
Calcite | Level 5

** by the way, this method is especially useful when you purposefully do NOT want equal-sized bins;

proc format ;

value bins

low -<0 = '-100'

0 = '0'

0 <-< 25 = '12.5'

25 -< 50 = '37.5'

50 -< 100 = '75'

100 -< 200 = '150'

200 - high = '1000'

;

run ;

Oh, I just realized that you want to assign the midpoint of the range - you could do that too.  See above.

Although, if you want 40 bins, this process might be tedious, and a mathematical one might make more sense.

(I usually want a small number of bins if I do this.)

Message was edited by: Carla Wilson

ballardw
Super User

Proc Rank might be of help for creating a specified number of groups.

Proc rank data=<your data set name> out=<output data set> groups=50;/*or any number of groups wanted */

var growth; /* or whatever your variable is*/

ranks growthgroup; /* what ever variable you would like to hold the group numbers*/

run;

The output set will have the group variable with values of 0 to n-1 where n is the requested number of groups.

You can then send that output through PROC MEANS or SUMMARY to get the min, max or any statistic wanted by group.

Proc means data=<above output set>;

class growthgroup;

var growth;

output out=<another output set > (request statistics and name them here).

This output set if using max and min could be used to create an format using MIN and MAX, you could even assign the Mean and or median as the label for the group.


sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 9 replies
  • 29843 views
  • 1 like
  • 7 in conversation