I have a dataset with 16 million records and 64 variables, 2 of which I am looking to use to subset the data. Of the two, call them x and y, x has duplicates and I am looking to choose the one record of each unique x based on which one has the highest value of y. I know I can do this with either proc sql, using a "group by" approach or to sort first and then use a set statement with first.y etc approach. My concern here is which approach is generally considered more efficient ? I used to run away from all sorts until I realized that sometimes, a Proc SQL approach could be equally time-consuming. Any insights would be greatly appreciated.
since you have only 16M rows and 2 columns to de-duplicate, I could recommend a hash able of keys with the row-numbers of the preferred row(for re-loading the data), but you might get the logical equivalent with the TAGSORT option of proc sort - when sort work areas should be needed only for keys (and that tag).