BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Mirisage
Obsidian | Level 7

Hi Collegues,

I have the attahced data set with a single variable.

Problem:

I need to categorize these obs into some meaningful income groups not driven by any business logic but based on the distribution of data. I do not have any clue what cut off points that I should impose for getting the groups. I was just given a huge data set by the employer.

So, I have run the below code to get a sense how to decide the "income group boundaries"

proc univariate data = have plots;

var income;

run;

Then I got the following "Quantiles".

100% Max393
99%393
95%392
90%360
75% Q3138
50% Median41
25% Q10
10%0
5%-9
1%-10
0% Min-10

Based on above qunatiles, I decided the following boundaries.

data want;

length Income_Range $20;

set have;

if Income = . then Income_Range='Missing';

else if Income   <=-9   then Income_Range = '<=-9';

else if Income   <=-8   then Income_Range = '-8 to -9';

else if Income   <=-1   then Income_Range = '-1 to -8';

else if Income   <=0  then Income_Range ='-7 to 0';

else if Income   <=41 then  Income_Range ='1-41';

else if Income   <=392 then  Income_Range ='42-392';

else Income_Range = '>392';

run;

Question:

1). Can we use Univariate approach like this to decide "boundaries" for income range if we do not have any clue what the boundary cut offs are?

2). My SAS code looks fine for this small data set, but they sometimes dangerously omit some data when apply to large data set. Could any expert make sure this code is error free

Thank you for the help

Mirisage


1 ACCEPTED SOLUTION

Accepted Solutions
zilok
Calcite | Level 5

You can use proc univariate to create 20 groups/10 equal groups.

One way is to use proc rank with groups=10 option

SAS Code:

proc rank data = have groups = 10 out = out1;

var income;

ranks predgroup;

run;

Here out1 will have rank variable taking values 0 to 9. 10 equal groups.

Another way is to use proc univariate with output statement;

SAS Code:

proc univariate data = have;

var income;

output out=out1 pctlpts = 10 to 100 by 10 pctlpre = inc;

run;

This gives you 10%,20%...90% & 100% percentile points.

You can create proc format to create formats which can be easily used over the course of diff programs.

There are some advantages in using proc univariate:

1. Weight option is available in proc univariate(not available in proc rank)

2. You dont have to create the whole dataset again. It is simple

Thanks,

zilok

View solution in original post

4 REPLIES 4
zilok
Calcite | Level 5

You can use proc univariate to create 20 groups/10 equal groups.

One way is to use proc rank with groups=10 option

SAS Code:

proc rank data = have groups = 10 out = out1;

var income;

ranks predgroup;

run;

Here out1 will have rank variable taking values 0 to 9. 10 equal groups.

Another way is to use proc univariate with output statement;

SAS Code:

proc univariate data = have;

var income;

output out=out1 pctlpts = 10 to 100 by 10 pctlpre = inc;

run;

This gives you 10%,20%...90% & 100% percentile points.

You can create proc format to create formats which can be easily used over the course of diff programs.

There are some advantages in using proc univariate:

1. Weight option is available in proc univariate(not available in proc rank)

2. You dont have to create the whole dataset again. It is simple

Thanks,

zilok

PaigeMiller
Diamond | Level 26

Question:

1). Can we use Univariate approach like this to decide "boundaries" for income range if we do not have any clue what the boundary cut offs are?

Yes, of course, you can take SAS (or any other software) and chop up a continuous variable into "groups". But just because you can doesn't mean you should. The information contained in a continuous variable like income is destroyed by chopping it up into intervals. Depending on what you plan to do with this data, your best approach may be to leave the data as continuous, not groups. And even so, if you MUST have intervals, the proper intervals depend on what you plan to do with this data, and you haven't told us. You asked for "meaningful income groups", but your result is empirical groups, not meaningful groups.

I think you are going down a dangerous path. You are trying to impose a statistical solution to a problem that logically does not have a statistical solution (at least, not until we know more about what you plan to do, and then groups might not be the best answer).

--
Paige Miller
darrylovia
Quartz | Level 8

You could also use the HISTOGRAM option after var income / histogram; to see what SAS comes up with.

Mirisage
Obsidian | Level 7

Hi zilok, PaigeMiller and darrylovia,

Thanks very much for all of you. I learned a lot from your statistical and SAS expertise.

Warm regards

Mirisage

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 4 replies
  • 3155 views
  • 6 likes
  • 4 in conversation