Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Home
- /
- Programming
- /
- SAS Procedures
- /
- Creating bins in a dataset

Options

- RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

🔒 This topic is **solved** and **locked**.
Need further help from the community? Please
sign in and ask a **new** question.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Posted 04-18-2012 05:32 PM
(29411 views)

Dear all,

I have a huge dataset containing firms and how much their number of employees grow, the definition is so that the growth is always between -200 and 200 ( employees(t+1)-employees(t))/ average ).

What I wanna do is group the employee growths in, say, 50 bins of equal length and indicate that with a variable with the value of the center of the interval, in a smart way.

I would also like to know a nice way to do even if the bins doesn't have the same width so like they were: [-200,-190] , [-190,-180] ... [20,15],[15,10] etc.

An example to clarify, in the case of [-200,200] and 40 bins: All the firms with an employee growth of between say 10 and 20 I would like to give the value 15 in a new variable, all the ones between 20 and 30, 25 etc.

Hope my idea is clear and someone is willing to help (I made a really ugly macro which works, but now I have to change the bins and it's going to be hell) plus I wanna become better at SAS

best regards,

Henrik

1 ACCEPTED SOLUTION

Accepted Solutions

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Proc Rank might be of help for creating a specified number of groups.

Proc rank data=<your data set name> out=<output data set> groups=50;/*or any number of groups wanted */

var growth; /* or whatever your variable is*/

ranks growthgroup; /* what ever variable you would like to hold the group numbers*/

run;

The output set will have the group variable with values of 0 to n-1 where n is the requested number of groups.

You can then send that output through PROC MEANS or SUMMARY to get the min, max or any statistic wanted by group.

Proc means data=<above output set>;

class growthgroup;

var growth;

output out=<another output set > (request statistics and name them here).

This output set if using max and min could be used to create an format using MIN and MAX, you could even assign the Mean and or median as the label for the group.

9 REPLIES 9

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Are you looking for the statistic estimator MEAN or MEDIAN ?

Ksharp

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Here's one approach to binning. Unfortunately, it requires processing the "huge" data set a few times. This process assumes you already have added GROWTH to every observation, and just assigns BIN. It is up to you to decide what additional information you want to assign to each BIN.

proc sort data=have;

by growth;

run;

data want;

set have nobs=_nobs_;

bin = ceil(50 * _n_ / _nobs_);

run;

I like to add a further wrinkle. Just in case multiple records have the same value for GROWTH, I like to group them all into the same BIN, even if it means the BIN sizes will be uneven. Here is that variation;

proc sort data=have;

by growth;

run;

data want;

set have nobs=_nobs_;

by growth;

if first.growth then bin = ceil(50 * _n_ / _nobs_);

retain bin;

run;

Good luck.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

** If you want to control your bin ranges, a proc format works great and is simple and efficient;

** In the ranges, you can use the "less than" symbol either before or after the dash to control the endpoints when you know you have continuous numbers;

proc format ;

value bins

-200 -< -150 = '01'

-150 -< -125 = '02'

-125 -< 0 = '03'

0 -< 25 = '04'

25 -< 50 = '05'

50 -< 75 = '06'

75 -< 100 = '07'

100 -< 125 = '08'

125 -< 150 = '09'

150 -< 175 = '10'

175 - 200 = '11'

other = '99'

;

run ;

** the output of the format is character by definition ;

** but you could use either an implicit or explicit conversion to a numeric;

** if that is what you ultimately want ;

data new ;

set old ;

length bin_char $2 bin_num 8 ;

bin_char = put(growth, bins.) ;

** implicit conversion, as long as you define the variable with a length statement first ;

bin_num = put(growth, bins.) ;

run ;

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

I want to do exactly that, but for 240 groups

0- 0.5

0.5-1.0

1.0-1.5

and so on

119.5-120.0

Do you have any solution for that (without writing 240 lines)?

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

** by the way, this method is especially useful when you purposefully do NOT want equal-sized bins;

proc format ;

value bins

low -<0 = '-100'

0 = '0'

0 <-< 25 = '12.5'

25 -< 50 = '37.5'

50 -< 100 = '75'

100 -< 200 = '150'

200 - high = '1000'

;

run ;

Oh, I just realized that you want to assign the midpoint of the range - you could do that too. See above.

Although, if you want 40 bins, this process might be tedious, and a mathematical one might make more sense.

(I usually want a small number of bins if I do this.)

Message was edited by: Carla Wilson

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Proc Rank might be of help for creating a specified number of groups.

Proc rank data=<your data set name> out=<output data set> groups=50;/*or any number of groups wanted */

var growth; /* or whatever your variable is*/

ranks growthgroup; /* what ever variable you would like to hold the group numbers*/

run;

The output set will have the group variable with values of 0 to n-1 where n is the requested number of groups.

You can then send that output through PROC MEANS or SUMMARY to get the min, max or any statistic wanted by group.

Proc means data=<above output set>;

class growthgroup;

var growth;

output out=<another output set > (request statistics and name them here).

This output set if using max and min could be used to create an format using MIN and MAX, you could even assign the Mean and or median as the label for the group.

Registration is open! SAS is returning to Vegas for an AI and analytics experience like no other! Whether you're an executive, manager, end user or SAS partner, SAS Innovate is designed for everyone on your team. Register for just $495 by 12/31/2023.

**If you are interested in speaking, there is still time to submit a session idea. More details are posted on the website. **

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.