Selecting data below 20th percentile

sap98fcs · Posted 06-11-2017 10:01 PM

Dear all,

I am handling the expenditure statistics of a city. I have about 1.5K sample, each of which is a household with different household sizes and is grossed up (i.e representing different no. of housholds). They look like these:

Serial household size Houshold expense Grossing up factor

0001 3                 8064.63    220
0002 3               16830.25    260
0003 6               3755.18   810
0004 3               16028.58    400
0005     3               22617.33    250
0006     2                   24844.37   320
0007    2                   32191.85   270
0008     2               22815.47   310
0009     3               16298.62   220
0010     2                 16850.27   320

For analysis, I have to select the 20% households with lowest expense of different household size.

Of course, I have to split the dataset with different household size and sort them according to their expense

e.g. for household with household size of 3

Serial household size Houshold expense Grossing up factor

0001 3 8064.63 220

0004 3 16028.58 400

0009 3 16298.62 220

0002 3 16830.25 260

0005 3 22617.33 250

Yet, how can I retain the samples of the lowest 20%, e.g only 0001 and 0004 as (250+260+220+400+220)/5=270 > 220 (0001) and <620 (sum of 0001 and 0004)? So that the dataset will have only the two sample left:

Serial household size Houshold expense Grossing up factor

0001 3 8064.63 220

0004 3 16028.58 400

I have think of using the sum of the grossing up factor and the accumulated sum of grossing up factor to do this

Serial household size Houshold expense Grossing up factor accumulated sum of grossing up factor

0001 3 8064.63 220 220

0004 3 16028.58 400 620

0009 3 16298.62 220 840

0002 3 16830.25 260 1100

0005 3 22617.33 250 1350

If acccumulated sum > 1350/5 and accumulated sum - grossing up factor >1350/5 then delete.

I will get the same result as above.

But the point is, how can I extract that 1350 directly from the dataset using SAS command, so that I not need read the result first and input the figure "1350" myself back into the command?

remarks: 620 is much large than one fifth of the total no. of households they represent (sum of 250+260+220+400+320) as shown above, but it will neglectable when running the full dataset.

Reeza · Posted 06-11-2017 10:07 PM

How does the 'grossing up factor' play into the calculation here?

If it's the lowest 20%, if you have 5 numbers only the first would be included, why is the second included in your example?

I'm not following your logic. Dividing by 5 implies some sort of average, but you say percent.

This is confusing.

Here's how you would keep the bottom 20% based on a single variable and grouped by sex.

proc sort data=sashelp.class out=class;
    by sex;

proc rank data=class out=classRanked groups=5;
    var household_expense;
    rank rank_Expense;
run;

data lowest_20;
  set classRanked;

  if rank_Expense=0;
run;

sap98fcs · Posted 06-11-2017 10:19 PM

Sorry for making it confusing

grossing up factor means the no. of household that sample represent

0001 3 8064.63 220

0004 3 16028.58 400

0009 3 16298.62 220

0002 3 16830.25 260

0005 3 22617.33 250

for example, 0001 represent 220 households in the city,

0004 represent 400 households in the city

0009 represent 220 households in the city

So, the above dataset represent a total of 1350 households in the city. In fact, there are more than thousands of samples for each household size group in the full dataset. I just select a few for easier illustration.

Say, if there are a total 1000 sample with a household size of 3, the lowest 20% households may not be exactly the first 200 obs after sort proc, as the total no. of household representing may be more or less than one-fifth of the total no. of household they represent.

That's why I have calculate the sum of the grossing up factor and divide by 5 to get the no. of households in a 20% group.

Reeza · Posted 06-12-2017 01:19 AM

Well, PROC RANK doesn't have a WEIGHT or FREQ statement, so you'll need to use PROC MEANS or UNIVARIATE to calculate the statistic (P20) and then merge it back in to summarize.

Weighted percentile lead to this post FYI

http://blogs.sas.com/content/iml/2016/08/29/weighted-percentiles.html

You could modify the solutuon here to account for BY groups.

https://gist.github.com/statgeek/31316a678433a1db8136

Or this one:

https://github.com/statgeek/SAS-Tutorials/blob/master/add_average_value_to_dataset

I dont seem to have a good reference for merging BY statistics back in ....