03-09-2017 04:16 PM
Hi....I have a dataset that I am trying to select a certain percentile of records based on their Fees for each YearMonth and within each Level. For example, from the table below, for the YearMonth = 201501 and Level=A, I would like to select and keep 93% of the records with the lowest fees and exclude the remaining 7% of the records with the highest fees. That would mean if there were 1,000 records having YearMonth = 201501 and Level =A, then I am expecting to have 930 records with YearMonth = 201501 and Level =A while the remaining 70 are excluded from the new dataset. I have sorted the dataset by YearMonth, Level and Fees. I was wondering what be the best and most efficient approach to achiev this. Thanks.
03-09-2017 04:23 PM
Do you have the percentiles already calculated?
If not, use PROC RANK with groups=100 and from the output select the desired records. I think it would be rank<92 because ranks go from 0 to 99.
03-09-2017 06:10 PM
Another approach would be to use Proc Surveyselect. You can use a data set with the percentiles as a control data set but the Percentile value must be in a variable named _RATE_.
The procedure call would look something like this:
proc survey select data=have out=want samprate=Percentileset; by Yearmonth level; run;
where Percentileset would be the name of the data set with YearMonth, Level and _rate.
Both your full dataset (fees?) and Percentileset would need to be sorted by Yearmonth and level.
This will give you a random selection of records.
03-09-2017 08:26 PM
Hi Ballardw.....unfortunately I need to select the lowest X percentile of the records for each YearMonth and Level and not a random sample....unless I am missing something from your approach.