turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- Analytics
- /
- Stat Procs
- /
- Can we use proc univariate for getting a clue on g...

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

07-11-2012 02:52 PM

Hi Collegues,

I have the attahced data set with a single variable.

Problem:

I need to categorize these obs into some meaningful income groups not driven by any business logic but based on the distribution of data. I do not have any clue what cut off points that I should impose for getting the groups. I was just given a huge data set by the employer.

So, I have run the below code to get a sense how to decide the "income group boundaries"

**proc** **univariate** data = have plots;

var income;

**run**;

Then I got the following "Quantiles".

100% Max | 393 |

99% | 393 |

95% | 392 |

90% | 360 |

75% Q3 | 138 |

50% Median | 41 |

25% Q1 | 0 |

10% | 0 |

5% | -9 |

1% | -10 |

0% Min | -10 |

Based on above qunatiles, I decided the following boundaries.

**data** want;

length Income_Range $**20**;

set have;

if Income = **.** then Income_Range='Missing';

else if Income <=-**9** then Income_Range = '<=-9';

else if Income <=-**8** then Income_Range = '-8 to -9';

else if Income <=-**1** then Income_Range = '-1 to -8';

else if Income <=**0** then Income_Range ='-7 to 0';

else if Income <=**41** then Income_Range ='1-41';

else if Income <=**392** then Income_Range ='42-392';

else Income_Range = '>392';

**run**;

Question:

1). Can we use Univariate approach like this to decide "boundaries" for income range if we do not have any clue what the boundary cut offs are?

2). My SAS code looks fine for this small data set, but they sometimes dangerously omit some data when apply to large data set. Could any expert make sure this code is error free

Thank you for the help

Mirisage

Accepted Solutions

Solution

07-11-2012
11:22 PM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

07-11-2012 11:22 PM

You can use proc univariate to create 20 groups/10 equal groups.

One way is to use proc rank with groups=10 option

SAS Code:

proc rank data = have groups = 10 out = out1;

var income;

ranks predgroup;

run;

Here out1 will have rank variable taking values 0 to 9. 10 equal groups.

Another way is to use proc univariate with output statement;

SAS Code:

proc univariate data = have;

var income;

output out=out1 pctlpts = 10 to 100 by 10 pctlpre = inc;

run;

This gives you 10%,20%...90% & 100% percentile points.

You can create proc format to create formats which can be easily used over the course of diff programs.

There are some advantages in using proc univariate:

1. Weight option is available in proc univariate(not available in proc rank)

2. You dont have to create the whole dataset again. It is simple

Thanks,

zilok

All Replies

Solution

07-11-2012
11:22 PM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

07-11-2012 11:22 PM

You can use proc univariate to create 20 groups/10 equal groups.

One way is to use proc rank with groups=10 option

SAS Code:

proc rank data = have groups = 10 out = out1;

var income;

ranks predgroup;

run;

Here out1 will have rank variable taking values 0 to 9. 10 equal groups.

Another way is to use proc univariate with output statement;

SAS Code:

proc univariate data = have;

var income;

output out=out1 pctlpts = 10 to 100 by 10 pctlpre = inc;

run;

This gives you 10%,20%...90% & 100% percentile points.

You can create proc format to create formats which can be easily used over the course of diff programs.

There are some advantages in using proc univariate:

1. Weight option is available in proc univariate(not available in proc rank)

2. You dont have to create the whole dataset again. It is simple

Thanks,

zilok

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

07-12-2012 09:05 AM

Question:

1). Can we use Univariate approach like this to decide "boundaries" for income range if we do not have any clue what the boundary cut offs are?

Yes, of course, you can take SAS (or any other software) and chop up a continuous variable into "groups". But just because you can doesn't mean you should. The information contained in a continuous variable like income is destroyed by chopping it up into intervals. Depending on what you plan to do with this data, your best approach may be to leave the data as continuous, not groups. And even so, if you MUST have intervals, the proper intervals depend on what you plan to do with this data, and you haven't told us. You asked for "meaningful income groups", but your result is empirical groups, not meaningful groups.

I think you are going down a dangerous path. You are trying to impose a statistical solution to a problem that logically does not have a statistical solution (at least, not until we know more about what you plan to do, and then groups might not be the best answer).

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

07-13-2012 01:24 PM

You could also use the HISTOGRAM option after var income / histogram; to see what SAS comes up with.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

07-13-2012 09:32 PM

Hi zilok, PaigeMiller and **darrylovia,**

Thanks very much for all of you. I learned a lot from your statistical and SAS expertise.

Warm regards

Mirisage