DATA Step, Macro, Functions and more

re: Data Selection by groups

Reply
Regular Contributor
Posts: 222

re: Data Selection by groups

Hi....I have a dataset that I am  trying to select a certain percentile of records based on their Fees for each YearMonth and within each Level. For example, from the table below, for the YearMonth = 201501 and Level=A, I would like to select and keep 93% of the records with the lowest fees and exclude the remaining 7% of the records with the highest fees. That would mean if there were 1,000 records having YearMonth = 201501 and Level =A, then I am expecting to have 930 records with YearMonth = 201501 and Level =A while the remaining 70 are excluded from the new dataset. I have sorted the dataset by YearMonth, Level and Fees. I was wondering what be the best and most efficient approach to achiev this. Thanks.

 

YearMonth Level Percentile
201501 A 0.93
201501 B 0.95
201502 A 0.92
201503 A 0.92
201503 B 0.91
201503 C 0.94
Super User
Posts: 17,792

Re: re: Data Selection by groups

Do you have the percentiles already calculated?

If not, use PROC RANK with groups=100 and from the output select the desired records. I think it would be rank<92 because ranks go from 0 to 99. 

 

 

Regular Contributor
Posts: 222

Re: re: Data Selection by groups

Hi Reeza,

 

Yes the percentiles are already calculated and are specific for each unique YearMonth and Level.

Super User
Posts: 17,792

Re: re: Data Selection by groups

If the percentiles are already calculated isn't that just a WHERE or IF statement?

 

If not for some reason, detail your issue.

Super User
Posts: 10,490

Re: re: Data Selection by groups

Another approach would be to use Proc Surveyselect. You can use a data set with the percentiles as a control data set but the Percentile value must be in a variable named _RATE_.

The procedure call would look something like this:

proc survey select data=have out=want 
    samprate=Percentileset;
    by Yearmonth level;
run;

where Percentileset would be the name of the data set with YearMonth, Level and _rate.

 

Both your full dataset (fees?) and Percentileset would need to be sorted by Yearmonth and level.

This will give you a random selection of records.

Regular Contributor
Posts: 222

Re: re: Data Selection by groups

Hi Ballardw.....unfortunately I need to select the lowest X percentile of the records for each YearMonth and Level and not a random sample....unless I am missing something from your approach.

Ask a Question
Discussion stats
  • 5 replies
  • 98 views
  • 0 likes
  • 3 in conversation