New Contributor
Posts: 2

# Selecting data below 20th percentile

Dear all,

I am handling the expenditure statistics of a city. I have about 1.5K sample, each of which is a household with different household sizes and is grossed up (i.e representing different no. of housholds). They look like these:

Serial household size    Houshold expense    Grossing up factor

0001       3                     8064.63                      220
0002       3                     16830.25                    260
0003       6                     3755.18                      810
0004       3                     16028.58                    400
0005       3                     22617.33                    250
0006       2                     24844.37                    320
0007       2                     32191.85                    270
0008       2                     22815.47                    310
0009       3                     16298.62                    220
0010       2                     16850.27                    320

For analysis, I have to select the 20% households with lowest expense of different household size.

Of course, I have to split the dataset with different household size and sort them according to their expense

e.g. for household with household size of 3

Serial household size    Houshold expense    Grossing up factor

0001       3                     8064.63                      220

0004       3                     16028.58                    400

0009       3                     16298.62                    220

0002       3                     16830.25                    260

0005       3                     22617.33                    250

Yet, how can I retain the samples of the lowest 20%, e.g only 0001 and 0004 as (250+260+220+400+220)/5=270 > 220 (0001) and <620 (sum of 0001 and 0004)? So that the dataset will have only the two sample left:

Serial household size    Houshold expense    Grossing up factor

0001       3                     8064.63                      220

0004       3                     16028.58                    400

I have think of using the sum of the grossing up factor and the accumulated sum of grossing up factor to do this

Serial household size    Houshold expense    Grossing up factor   accumulated sum of grossing up factor

0001       3                     8064.63                      220                               220

0004       3                     16028.58                    400                               620

0009       3                     16298.62                    220                               840

0002       3                     16830.25                    260                               1100

0005       3                     22617.33                    250                               1350

If acccumulated sum > 1350/5 and accumulated sum - grossing up factor >1350/5 then delete.

I will get the same result as above.

But the point is, how can I extract that 1350 directly from the dataset using SAS command, so that I not need read the result first and input the figure "1350" myself back into the command?

remarks: 620 is much large than one fifth of the total no. of households they represent (sum of 250+260+220+400+320) as shown above, but it will neglectable when running the full dataset.

Super User
Posts: 24,010

## Re: Selecting data below 20th percentile

How does the 'grossing up factor' play into the calculation here?

If it's the lowest 20%, if you have 5 numbers only the first would be included, why is the second included in your example?

I'm not following your logic. Dividing by 5 implies some sort of average, but you say percent.

This is confusing.

Here's how you would keep the bottom 20% based on a single variable and grouped by sex.

``````proc sort data=sashelp.class out=class;
by sex;

proc rank data=class out=classRanked groups=5;
var household_expense;
rank rank_Expense;
run;

data lowest_20;
set classRanked;

if rank_Expense=0;
run;``````
New Contributor
Posts: 2

## Re: Selecting data below 20th percentile

Sorry for making it confusing

grossing up factor means the no. of household that sample represent

0001       3                     8064.63                      220

0004       3                     16028.58                    400

0009       3                     16298.62                    220

0002       3                     16830.25                    260

0005       3                     22617.33                    250

for example, 0001 represent 220 households in the city,

0004 represent 400 households in the city

0009 represent 220 households in the city

So, the above dataset represent a total of 1350 households in the city. In fact, there are more than thousands of samples for each household size group in the full dataset. I just select a few for easier illustration.

Say, if there are a total 1000 sample with a household size of 3, the lowest 20% households may not be exactly the first 200 obs after sort proc, as the total no. of household representing may be more or less than one-fifth of the total no. of household they represent.

That's why I have calculate the sum of the grossing up factor and divide by 5 to get the no. of households in a 20% group.

Super User
Posts: 24,010

## Re: Selecting data below 20th percentile

Well, PROC RANK doesn't have a WEIGHT or FREQ statement, so you'll need to use PROC MEANS or UNIVARIATE to calculate the statistic (P20) and then merge it back in to summarize.

Weighted percentile lead to this post FYI

http://blogs.sas.com/content/iml/2016/08/29/weighted-percentiles.html

You could modify the solutuon here to account for BY groups.

https://gist.github.com/statgeek/31316a678433a1db8136

Or this one: