topic Re: Efficient way to identify unique records in SAS Programming

Efficient way to identify unique records

thesasuser — Sun, 20 May 2018 19:05:17 GMT

The unique records can be identified with the any of the following

Approach 1

if first.var and last.var

Approach 2

if first.var=1 and last.var=1

Is there any difference between them in terms of efficiency and time?

Re: Efficient way to identify unique records

novinosrin — Sun, 20 May 2018 19:52:01 GMT

in my opinion both does the same evaluation so no.

other than that the only thing i can think of is during tokenisation sas processes 2 more tokens i.e =1 before it sends to the compiler for execution, which is very insignificant

Re: Efficient way to identify unique records

Reeza — Sun, 20 May 2018 21:18:21 GMT

Try PROC SORT with the NODUPKEY option as well, since you often have to sort before using FIRST/LAST you can sometimes avoid extra steps which is more efficient.

@thesasuser wrote:

The unique records can be identified with the any of the following

Approach 1

if first.var and last.var

Approach 2

if first.var=1 and last.var=1

Is there any difference between them in terms of efficiency and time?

Re: Efficient way to identify unique records

art297 — Sun, 20 May 2018 21:24:58 GMT

@novinosrin already answered your question, but there are still other ways of expressing the same thing. Other than using sort, they all run as efficiently as the others.

Like @Reeza mentioned, proc sort can also be used but, for what you want, in a slightly different manner than @Reeza mentioned:

data have;
  set sashelp.class;
  do i=1 to 100000;
    if i eq 1 then id=_n_*100000;
    else id=_n_*100000+1;
    output;
  end;
run;
data want;
  set have;
  by id;
  if first.id and last.id;
run;
data want;
  set have;
  by id;
  if first.id eq 1 and last.id eq 1;
run;

data want;
  set have;
  by id;
  if first.id +last.id eq 2;
run;

data want;
  set have;
  by id;
  if first.id *last.id;
run;

data want;
  set have;
  by id;
  if first.id *last.id eq 1;
run;

proc sort data=have out=dontwant nouniquekey noequals UNIQUEOUT=want;
  by id;
run;

And, as @Reeza mentioned, if the file has to be sorted first, it may actually end up being the fastest method.

Art, CEO, AnalystFinder.com

Re: Efficient way to identify unique records

thesasuser — Sun, 20 May 2018 21:23:22 GMT

Thanks.

This was an interview question. The interviewer informed that one of them is more efficient.

To me both looked identically efficient.
Wanted to know form the community if I am correct

Re: Efficient way to identify unique records

novinosrin — Sun, 20 May 2018 21:25:21 GMT

Well well well, does the interviewer think processing 2 more tokens makes it less efficient? Jeez!!!

Re: Efficient way to identify unique records

Reeza — Sun, 20 May 2018 22:19:00 GMT

I’m not fond of this, but sometimes people will ask obviously wrong questions to see how a candidate deals with this. People who get super defensive about a mistake or flustered are what they’re looking for in this case.

Re: Efficient way to identify unique records

art297 — Sun, 20 May 2018 22:19:35 GMT

Interesting footnote to this question. Did you know that you can specify _null_ as an output file to proc sort? Before today, I didn't!

However, in trying to test the idea that the nouniquekey and uniqueout options might be faster than doing a sort and then using first. and last. in a datastep, I was bothered that I (thought I) had to create a dummy output file. Not so!

proc sort data=have out=_null_ nouniquekey noequals UNIQUEOUT=want;
  by id;
run;

On a 1.9 million record file, with 19 unique ids, the above only took 0.49 seconds.

The sort alone of the initial file (needed, of course in order to use first. and last. processing, took 0.68 seconds.

Not that I've ever needed a file of records that don't have duplicates, but nice to know which would be the most efficient process.

Art, CEO, AnalystFinder.com