BookmarkSubscribeRSS Feed
sap98fcs
Calcite | Level 5

Dear all,

 

      I am handling the expenditure statistics of a city. I have about 1.5K sample, each of which is a household with different household sizes and is grossed up (i.e representing different no. of housholds). They look like these:

 

Serial household size    Houshold expense    Grossing up factor 

0001       3                     8064.63                      220
0002       3                     16830.25                    260 
0003       6                     3755.18                      810
0004       3                     16028.58                    400  
0005       3                     22617.33                    250  
0006       2                     24844.37                    320
0007       2                     32191.85                    270 
0008       2                     22815.47                    310
0009       3                     16298.62                    220
0010       2                     16850.27                    320

 

For analysis, I have to select the 20% households with lowest expense of different household size.

 

Of course, I have to split the dataset with different household size and sort them according to their expense

 

e.g. for household with household size of 3

Serial household size    Houshold expense    Grossing up factor 

0001       3                     8064.63                      220

0004       3                     16028.58                    400  

0009       3                     16298.62                    220

0002       3                     16830.25                    260 

0005       3                     22617.33                    250  


Yet, how can I retain the samples of the lowest 20%, e.g only 0001 and 0004 as (250+260+220+400+220)/5=270 > 220 (0001) and <620 (sum of 0001 and 0004)? So that the dataset will have only the two sample left:

 

Serial household size    Houshold expense    Grossing up factor 

0001       3                     8064.63                      220

0004       3                     16028.58                    400  

 

I have think of using the sum of the grossing up factor and the accumulated sum of grossing up factor to do this 

Serial household size    Houshold expense    Grossing up factor   accumulated sum of grossing up factor

0001       3                     8064.63                      220                               220

0004       3                     16028.58                    400                               620

0009       3                     16298.62                    220                               840

0002       3                     16830.25                    260                               1100

0005       3                     22617.33                    250                               1350  

 

If acccumulated sum > 1350/5 and accumulated sum - grossing up factor >1350/5 then delete.

I will get the same result as above. 

 

But the point is, how can I extract that 1350 directly from the dataset using SAS command, so that I not need read the result first and input the figure "1350" myself back into the command? 

 

remarks: 620 is much large than one fifth of the total no. of households they represent (sum of 250+260+220+400+320) as shown above, but it will neglectable when running the full dataset. 

 

 

3 REPLIES 3
Reeza
Super User

How does the 'grossing up factor' play into the calculation here?

 

If it's the lowest 20%, if you have 5 numbers only the first would be included, why is the second included in your example?

I'm not following your logic. Dividing by 5 implies some sort of average, but you say percent. 

This is confusing. 

 

Here's how you would keep the bottom 20% based on a single variable and grouped by sex. 

 

proc sort data=sashelp.class out=class;
    by sex;

proc rank data=class out=classRanked groups=5;
    var household_expense;
    rank rank_Expense;
run;

data lowest_20;
  set classRanked;

  if rank_Expense=0;
run;
sap98fcs
Calcite | Level 5

Sorry for making it confusing

 

grossing up factor means the no. of household that sample represent 

 

0001       3                     8064.63                      220

0004       3                     16028.58                    400  

0009       3                     16298.62                    220

0002       3                     16830.25                    260 

0005       3                     22617.33                    250  

 

for example, 0001 represent 220 households in the city,

0004 represent 400 households in the city

0009 represent 220 households in the city

 

So, the above dataset represent a total of 1350 households in the city. In fact, there are more than thousands of samples for each household size group in the full dataset. I just select a few for easier illustration. 

 

Say, if there are a total 1000 sample with a household size of 3, the lowest 20% households may not be exactly the first 200 obs after sort proc, as the total no. of household representing may be more or less than one-fifth of the total no. of household they represent. 

That's why I have calculate the sum of the grossing up factor and divide by 5 to get the no. of households in a 20% group.

 

Reeza
Super User

Well, PROC RANK doesn't have a WEIGHT or FREQ statement, so you'll need to use PROC MEANS or UNIVARIATE to calculate the statistic (P20) and then merge it back in to summarize. 

 

Weighted percentile lead to this post FYI 

http://blogs.sas.com/content/iml/2016/08/29/weighted-percentiles.html

 

You could modify the solutuon here to account for BY groups. 

https://gist.github.com/statgeek/31316a678433a1db8136

 

Or this one:

https://github.com/statgeek/SAS-Tutorials/blob/master/add_average_value_to_dataset

 

I dont seem to have a good reference for merging BY statistics back in ....

SAS Innovate 2025: Register Now

Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 3 replies
  • 1268 views
  • 0 likes
  • 2 in conversation