An Idea Exchange for SAS software and services

Comments
by PROC Star
‎05-01-2017 02:23 PM - edited ‎05-01-2017 02:23 PM

I find your proposed syntax confusing. 

 

Why are your sorting twice?  Is it because you want to keep the record that has the minimum value of var3 (or maybe maximum, I can never remember which record nodupkey keeps). 

by New Contributor cgmt
‎05-01-2017 03:03 PM - edited ‎05-01-2017 03:03 PM

Hello Quentin, thanks for commenting. Since English isn't my first language, I was afraid my post would be a bit confussing.

 

Let me put an example, to see if that makes it more clear:

We have a table with information of every item purchased in our store, and we want to get the first one for every customer. We would have to sort the table by customer, purchase and item identifier, and then sort again, adding the nodupkey option:

proc sort;
  by customer_id purchase_id item_id;
run;

proc sort nodupkey;
  by customer_id;
run;

As far as I know, currently there is no way to do that in a single step. Please enlighten me if I'm mistaken!

 

Following the example, we would end up with:

 

proc sort nodupkey=customer_id;
by customer_id purchase_id item_id;
run;

Which would sort by all three variables, and then outputting only the first item purchased for every customer.

 

Regards,

C.

by PROC Star
on ‎05-02-2017 09:12 AM

Thanks.  That is clear. 

 

 

With a DATA step, I would probably only sort once, then use a DATA step to collapse, e.g.:

 

proc sort;
  by customer_id purchase_id item_id;
run;

data want;
  set have;
  by customer_id;
  if first.customer_id;
run;

 

 

Not sure that will be faster then the second sort, but I think it's likely.  I like the DATA step approach more than SORT NODUPKEY because it's explicit that you want to keep the first or last record from each group.  If my data is not too big, I will often make the BY statement BY CUSTOMER_ID PURCHASE_ID ITEM_ID, just to make clear that the data have been sorted and the intent is to select the first record with the minimum PURCHASE_ID per customer, and minimum ITEM_ID within that.

 

If your values of purchase_id were distinct, you could do it in SQL with something like:

 

proc sql;
  create table want as
  select *
    from have
    group by customer_id
    having purchase_id=min(purchase_id)
;

I would think that could be extended to support multiple by variables.

 

 

I don't think I really like the idea of extending nodupkey this way, but if it were extended, maybe an a slightly alternative syntax, something like:

proc sort data=have out=want nodupkey;
  by customer_id purchase_id item_id ;  *sort by these 3 vars;
  key customer_id ; *make this a unique key in output dataset;
run ;
by New Contributor cgmt
on ‎05-02-2017 09:35 AM

Thanks for the insights!

 

About switching the second proc sort with a data step, the time spent is about the same, since the proc sort doesn't really sort again (it looks for the sortedby property, and since it's already sorted, it just does the same as the data step you posted).

 

Also, about the SQL: as you mentioned, it would only work as intended if we had no duplicated values by (customer_id, purchase_id, item_id), which isn't always guaranteed.

 

At last, the syntax you proposed for the extension of the nodupkey seems great. Thank you! Smiley Happy

 

Regards,

C.

by Regular Contributor
on ‎05-02-2017 10:00 AM

I'm just chiming in to say that @Quentin's solution with the data step to "collapse" the dataset is also easier to read and understand. So better for the future, whether that be you (@cgmt) coming back, not remembering exactly what you intended, or the next code maintainer trying to understand your work.

Idea Statuses
Top Liked Authors