DATA Step, Macro, Functions and more

Selecting data below 20th percentile

Reply
New Contributor
Posts: 2

Selecting data below 20th percentile

Dear all,

 

      I am handling the expenditure statistics of a city. I have about 1.5K sample, each of which is a household with different household sizes and is grossed up (i.e representing different no. of housholds). They look like these:

 

Serial household size    Houshold expense    Grossing up factor 

0001       3                     8064.63                      220
0002       3                     16830.25                    260 
0003       6                     3755.18                      810
0004       3                     16028.58                    400  
0005       3                     22617.33                    250  
0006       2                     24844.37                    320
0007       2                     32191.85                    270 
0008       2                     22815.47                    310
0009       3                     16298.62                    220
0010       2                     16850.27                    320

 

For analysis, I have to select the 20% households with lowest expense of different household size.

 

Of course, I have to split the dataset with different household size and sort them according to their expense

 

e.g. for household with household size of 3

Serial household size    Houshold expense    Grossing up factor 

0001       3                     8064.63                      220

0004       3                     16028.58                    400  

0009       3                     16298.62                    220

0002       3                     16830.25                    260 

0005       3                     22617.33                    250  


Yet, how can I retain the samples of the lowest 20%, e.g only 0001 and 0004 as (250+260+220+400+220)/5=270 > 220 (0001) and <620 (sum of 0001 and 0004)? So that the dataset will have only the two sample left:

 

Serial household size    Houshold expense    Grossing up factor 

0001       3                     8064.63                      220

0004       3                     16028.58                    400  

 

I have think of using the sum of the grossing up factor and the accumulated sum of grossing up factor to do this 

Serial household size    Houshold expense    Grossing up factor   accumulated sum of grossing up factor

0001       3                     8064.63                      220                               220

0004       3                     16028.58                    400                               620

0009       3                     16298.62                    220                               840

0002       3                     16830.25                    260                               1100

0005       3                     22617.33                    250                               1350  

 

If acccumulated sum > 1350/5 and accumulated sum - grossing up factor >1350/5 then delete.

I will get the same result as above. 

 

But the point is, how can I extract that 1350 directly from the dataset using SAS command, so that I not need read the result first and input the figure "1350" myself back into the command? 

 

remarks: 620 is much large than one fifth of the total no. of households they represent (sum of 250+260+220+400+320) as shown above, but it will neglectable when running the full dataset. 

 

 

Super User
Posts: 17,864

Re: Selecting data below 20th percentile

How does the 'grossing up factor' play into the calculation here?

 

If it's the lowest 20%, if you have 5 numbers only the first would be included, why is the second included in your example?

I'm not following your logic. Dividing by 5 implies some sort of average, but you say percent. 

This is confusing. 

 

Here's how you would keep the bottom 20% based on a single variable and grouped by sex. 

 

proc sort data=sashelp.class out=class;
    by sex;

proc rank data=class out=classRanked groups=5;
    var household_expense;
    rank rank_Expense;
run;

data lowest_20;
  set classRanked;

  if rank_Expense=0;
run;
New Contributor
Posts: 2

Re: Selecting data below 20th percentile

Sorry for making it confusing

 

grossing up factor means the no. of household that sample represent 

 

0001       3                     8064.63                      220

0004       3                     16028.58                    400  

0009       3                     16298.62                    220

0002       3                     16830.25                    260 

0005       3                     22617.33                    250  

 

for example, 0001 represent 220 households in the city,

0004 represent 400 households in the city

0009 represent 220 households in the city

 

So, the above dataset represent a total of 1350 households in the city. In fact, there are more than thousands of samples for each household size group in the full dataset. I just select a few for easier illustration. 

 

Say, if there are a total 1000 sample with a household size of 3, the lowest 20% households may not be exactly the first 200 obs after sort proc, as the total no. of household representing may be more or less than one-fifth of the total no. of household they represent. 

That's why I have calculate the sum of the grossing up factor and divide by 5 to get the no. of households in a 20% group.

 

Super User
Posts: 17,864

Re: Selecting data below 20th percentile

Well, PROC RANK doesn't have a WEIGHT or FREQ statement, so you'll need to use PROC MEANS or UNIVARIATE to calculate the statistic (P20) and then merge it back in to summarize. 

 

Weighted percentile lead to this post FYI 

http://blogs.sas.com/content/iml/2016/08/29/weighted-percentiles.html

 

You could modify the solutuon here to account for BY groups. 

https://gist.github.com/statgeek/31316a678433a1db8136

 

Or this one:

https://github.com/statgeek/SAS-Tutorials/blob/master/add_average_value_to_dataset

 

I dont seem to have a good reference for merging BY statistics back in ....

Ask a Question
Discussion stats
  • 3 replies
  • 112 views
  • 0 likes
  • 2 in conversation