BookmarkSubscribeRSS Feed
0 Likes

Many times we've had to sort a table by some variables, then use another proc sort to eliminate duplicate observations based on a subset of the previous variables of the sorting. Eg:

proc sort;

  by var1 var2 var3;

run;

proc sort nodupkey;

  by var1 var2;

run;

 

It would be really useful if we could merge those two processes in one, to reduce the reading/writing load. Especially when dealing with larger tables, every rewrite in disk takes a LOT of time.

The example above could be reduced to:

proc sort nodupkey=var2;

  by var1 var2 var3;

run;

 

Regards,

C.

5 Comments
Quentin
Super User

I find your proposed syntax confusing. 

 

Why are your sorting twice?  Is it because you want to keep the record that has the minimum value of var3 (or maybe maximum, I can never remember which record nodupkey keeps). 

cgmt
Fluorite | Level 6

Hello Quentin, thanks for commenting. Since English isn't my first language, I was afraid my post would be a bit confussing.

 

Let me put an example, to see if that makes it more clear:

We have a table with information of every item purchased in our store, and we want to get the first one for every customer. We would have to sort the table by customer, purchase and item identifier, and then sort again, adding the nodupkey option:

proc sort;
  by customer_id purchase_id item_id;
run;

proc sort nodupkey;
  by customer_id;
run;

As far as I know, currently there is no way to do that in a single step. Please enlighten me if I'm mistaken!

 

Following the example, we would end up with:

 

proc sort nodupkey=customer_id;
by customer_id purchase_id item_id;
run;

Which would sort by all three variables, and then outputting only the first item purchased for every customer.

 

Regards,

C.

Quentin
Super User

Thanks.  That is clear. 

 

 

With a DATA step, I would probably only sort once, then use a DATA step to collapse, e.g.:

 

proc sort;
  by customer_id purchase_id item_id;
run;

data want;
  set have;
  by customer_id;
  if first.customer_id;
run;

 

 

Not sure that will be faster then the second sort, but I think it's likely.  I like the DATA step approach more than SORT NODUPKEY because it's explicit that you want to keep the first or last record from each group.  If my data is not too big, I will often make the BY statement BY CUSTOMER_ID PURCHASE_ID ITEM_ID, just to make clear that the data have been sorted and the intent is to select the first record with the minimum PURCHASE_ID per customer, and minimum ITEM_ID within that.

 

If your values of purchase_id were distinct, you could do it in SQL with something like:

 

proc sql;
  create table want as
  select *
    from have
    group by customer_id
    having purchase_id=min(purchase_id)
;

I would think that could be extended to support multiple by variables.

 

 

I don't think I really like the idea of extending nodupkey this way, but if it were extended, maybe an a slightly alternative syntax, something like:

proc sort data=have out=want nodupkey;
  by customer_id purchase_id item_id ;  *sort by these 3 vars;
  key customer_id ; *make this a unique key in output dataset;
run ;
cgmt
Fluorite | Level 6

Thanks for the insights!

 

About switching the second proc sort with a data step, the time spent is about the same, since the proc sort doesn't really sort again (it looks for the sortedby property, and since it's already sorted, it just does the same as the data step you posted).

 

Also, about the SQL: as you mentioned, it would only work as intended if we had no duplicated values by (customer_id, purchase_id, item_id), which isn't always guaranteed.

 

At last, the syntax you proposed for the extension of the nodupkey seems great. Thank you! 🙂

 

Regards,

C.

paulkaefer
Lapis Lazuli | Level 10

I'm just chiming in to say that @Quentin's solution with the data step to "collapse" the dataset is also easier to read and understand. So better for the future, whether that be you (@cgmt) coming back, not remembering exactly what you intended, or the next code maintainer trying to understand your work.