11-28-2017 06:06 PM - edited 11-28-2017 06:08 PM
I have one column in a dataset with 5M Rows of data. I'm trying to understand the range of values and figure out the general distribution. How would I select a percentage of the highest amounts and a percentage of the lowest amounts? Thanks!
11-28-2017 06:09 PM
I would recommend a histogram first - using PROC UNIVARIATE.
It also displays the highest.
Then I would also recommend PROC RANK.
Rank the variable of interest using groups of 100, then you can find all less than X% by choosing all less than the Xth rank. Note how it handles tied values though - and that's one reason I prefer this methodology. It can account for ties where some of the manual methodologies will not, by default, so you need extra coding.