## re: Data Selection by groups

Regular Contributor
Posts: 248

# re: Data Selection by groups

Hi....I have a dataset that I am  trying to select a certain percentile of records based on their Fees for each YearMonth and within each Level. For example, from the table below, for the YearMonth = 201501 and Level=A, I would like to select and keep 93% of the records with the lowest fees and exclude the remaining 7% of the records with the highest fees. That would mean if there were 1,000 records having YearMonth = 201501 and Level =A, then I am expecting to have 930 records with YearMonth = 201501 and Level =A while the remaining 70 are excluded from the new dataset. I have sorted the dataset by YearMonth, Level and Fees. I was wondering what be the best and most efficient approach to achiev this. Thanks.

 YearMonth Level Percentile 201501 A 0.93 201501 B 0.95 201502 A 0.92 201503 A 0.92 201503 B 0.91 201503 C 0.94
Super User
Posts: 23,958

## Re: re: Data Selection by groups

Do you have the percentiles already calculated?

If not, use PROC RANK with groups=100 and from the output select the desired records. I think it would be rank<92 because ranks go from 0 to 99.

Regular Contributor
Posts: 248

## Re: re: Data Selection by groups

Hi Reeza,

Yes the percentiles are already calculated and are specific for each unique YearMonth and Level.

Super User
Posts: 23,958

## Re: re: Data Selection by groups

If the percentiles are already calculated isn't that just a WHERE or IF statement?

If not for some reason, detail your issue.

Super User
Posts: 13,889

## Re: re: Data Selection by groups

Another approach would be to use Proc Surveyselect. You can use a data set with the percentiles as a control data set but the Percentile value must be in a variable named _RATE_.

The procedure call would look something like this:

```proc survey select data=have out=want
samprate=Percentileset;
by Yearmonth level;
run;```

where Percentileset would be the name of the data set with YearMonth, Level and _rate.

Both your full dataset (fees?) and Percentileset would need to be sorted by Yearmonth and level.

This will give you a random selection of records.

Regular Contributor
Posts: 248

## Re: re: Data Selection by groups

Hi Ballardw.....unfortunately I need to select the lowest X percentile of the records for each YearMonth and Level and not a random sample....unless I am missing something from your approach.

Discussion stats
• 5 replies
• 138 views
• 0 likes
• 3 in conversation