turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- SAS Programming
- /
- SAS Procedures
- /
- Creating bins in a dataset

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

04-18-2012 05:32 PM

Dear all,

I have a huge dataset containing firms and how much their number of employees grow, the definition is so that the growth is always between -200 and 200 ( employees(t+1)-employees(t))/ average ).

What I wanna do is group the employee growths in, say, 50 bins of equal length and indicate that with a variable with the value of the center of the interval, in a smart way.

I would also like to know a nice way to do even if the bins doesn't have the same width so like they were: [-200,-190] , [-190,-180] ... [20,15],[15,10] etc.

An example to clarify, in the case of [-200,200] and 40 bins: All the firms with an employee growth of between say 10 and 20 I would like to give the value 15 in a new variable, all the ones between 20 and 30, 25 etc.

Hope my idea is clear and someone is willing to help (I made a really ugly macro which works, but now I have to change the bins and it's going to be hell) plus I wanna become better at SAS

best regards,

Henrik

Accepted Solutions

Solution

3 weeks ago

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

04-19-2012 12:59 PM

Proc Rank might be of help for creating a specified number of groups.

Proc rank data=<your data set name> out=<output data set> groups=50;/*or any number of groups wanted */

var growth; /* or whatever your variable is*/

ranks growthgroup; /* what ever variable you would like to hold the group numbers*/

run;

The output set will have the group variable with values of 0 to n-1 where n is the requested number of groups.

You can then send that output through PROC MEANS or SUMMARY to get the min, max or any statistic wanted by group.

Proc means data=<above output set>;

class growthgroup;

var growth;

output out=<another output set > (request statistics and name them here).

This output set if using max and min could be used to create an format using MIN and MAX, you could even assign the Mean and or median as the label for the group.

All Replies

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

04-18-2012 05:40 PM

It is best to provide an example of input of output in an easily readable format to help people understand your problem alongside your verbal explanation.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

04-18-2012 05:55 PM

Thank you for the tip! I can output some of the data, is there a smart way to insert it into my question?

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

04-18-2012 06:01 PM

You can edit the question or just add a new reply. You could used the advanced editor to manually enter of table of table or just copy and paste the information from somewhere. It should represent your real data as close as possible but definitely does not have to be your real data.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

04-19-2012 12:39 AM

Are you looking for the statistic estimator MEAN or MEDIAN ?

Ksharp

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

04-19-2012 09:29 AM

Here's one approach to binning. Unfortunately, it requires processing the "huge" data set a few times. This process assumes you already have added GROWTH to every observation, and just assigns BIN. It is up to you to decide what additional information you want to assign to each BIN.

proc sort data=have;

by growth;

run;

data want;

set have nobs=_nobs_;

bin = ceil(50 * _n_ / _nobs_);

run;

I like to add a further wrinkle. Just in case multiple records have the same value for GROWTH, I like to group them all into the same BIN, even if it means the BIN sizes will be uneven. Here is that variation;

proc sort data=have;

by growth;

run;

data want;

set have nobs=_nobs_;

by growth;

if first.growth then bin = ceil(50 * _n_ / _nobs_);

retain bin;

run;

Good luck.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

04-19-2012 09:56 AM

** If you want to control your bin ranges, a proc format works great and is simple and efficient;

** In the ranges, you can use the "less than" symbol either before or after the dash to control the endpoints when you know you have continuous numbers;

proc format ;

value bins

-200 -< -150 = '01'

-150 -< -125 = '02'

-125 -< 0 = '03'

0 -< 25 = '04'

25 -< 50 = '05'

50 -< 75 = '06'

75 -< 100 = '07'

100 -< 125 = '08'

125 -< 150 = '09'

150 -< 175 = '10'

175 - 200 = '11'

other = '99'

;

run ;

** the output of the format is character by definition ;

** but you could use either an implicit or explicit conversion to a numeric;

** if that is what you ultimately want ;

data new ;

set old ;

length bin_char $2 bin_num 8 ;

bin_char = put(growth, bins.) ;

** implicit conversion, as long as you define the variable with a length statement first ;

bin_num = put(growth, bins.) ;

run ;

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Tuesday

I want to do exactly that, but for 240 groups

0- 0.5

0.5-1.0

1.0-1.5

and so on

119.5-120.0

Do you have any solution for that (without writing 240 lines)?

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

04-19-2012 10:13 AM

** by the way, this method is especially useful when you purposefully do NOT want equal-sized bins;

proc format ;

value bins

low -<0 = '-100'

0 = '0'

0 <-< 25 = '12.5'

25 -< 50 = '37.5'

50 -< 100 = '75'

100 -< 200 = '150'

200 - high = '1000'

;

run ;

Oh, I just realized that you want to assign the midpoint of the range - you could do that too. See above.

Although, if you want 40 bins, this process might be tedious, and a mathematical one might make more sense.

(I usually want a small number of bins if I do this.)

Message was edited by: Carla Wilson

Solution

3 weeks ago

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

04-19-2012 12:59 PM

Proc Rank might be of help for creating a specified number of groups.

Proc rank data=<your data set name> out=<output data set> groups=50;/*or any number of groups wanted */

var growth; /* or whatever your variable is*/

ranks growthgroup; /* what ever variable you would like to hold the group numbers*/

run;

The output set will have the group variable with values of 0 to n-1 where n is the requested number of groups.

You can then send that output through PROC MEANS or SUMMARY to get the min, max or any statistic wanted by group.

Proc means data=<above output set>;

class growthgroup;

var growth;

output out=<another output set > (request statistics and name them here).

This output set if using max and min could be used to create an format using MIN and MAX, you could even assign the Mean and or median as the label for the group.