DATA Step, Macro, Functions and more

Efficient way to identify unique records

Accepted Solution Solved
Reply
Contributor
Posts: 43
Accepted Solution

Efficient way to identify unique records

The unique records can be identified with the any of the following

Approach 1

if first.var and last.var

Approach 2

if first.var=1 and last.var=1

Is there any difference between them in terms of efficiency and time?


Accepted Solutions
Solution
‎05-20-2018 05:26 PM
PROC Star
Posts: 1,831

Re: Efficient way to identify unique records

Posted in reply to thesasuser

in my opinion both does the same evaluation so no.

 

other than that the only thing i can think of is during tokenisation sas processes 2 more tokens i.e =1 before it sends to the compiler for execution, which is very insignificant

View solution in original post


All Replies
Solution
‎05-20-2018 05:26 PM
PROC Star
Posts: 1,831

Re: Efficient way to identify unique records

Posted in reply to thesasuser

in my opinion both does the same evaluation so no.

 

other than that the only thing i can think of is during tokenisation sas processes 2 more tokens i.e =1 before it sends to the compiler for execution, which is very insignificant

Super User
Posts: 23,765

Re: Efficient way to identify unique records

Posted in reply to thesasuser

Try PROC SORT with the NODUPKEY option as well, since you often have to sort before using FIRST/LAST you can sometimes avoid extra steps which is more efficient. 

 


@thesasuser wrote:

The unique records can be identified with the any of the following

Approach 1

if first.var and last.var

Approach 2

if first.var=1 and last.var=1

Is there any difference between them in terms of efficiency and time?


 

Contributor
Posts: 43

Re: Efficient way to identify unique records

Thanks.

This was an interview question. The interviewer informed that one of them is more efficient.

To me both looked identically efficient. 
Wanted to know form the community if I am correct

PROC Star
Posts: 1,831

Re: Efficient way to identify unique records

Posted in reply to thesasuser

Well well well, does the interviewer think processing 2 more tokens makes it less efficient? Jeez!!!

PROC Star
Posts: 8,164

Re: Efficient way to identify unique records

Posted in reply to novinosrin

Interesting footnote to this question. Did you know that you can specify _null_ as an output file to proc sort? Before today, I didn't!

 

However, in trying to test the idea that the nouniquekey and uniqueout options might be faster than doing a sort and then using first. and last. in a datastep, I was bothered that I (thought I) had to create a dummy output file. Not so!

 

proc sort data=have out=_null_ nouniquekey noequals UNIQUEOUT=want;
  by id;
run;

On a 1.9 million record file, with 19 unique ids, the above only took 0.49 seconds.

 

The sort alone of the initial file (needed, of course in order to use first. and last. processing, took 0.68 seconds.

 

Not that I've ever needed a file of records that don't have duplicates, but nice to know which would be the most efficient process.

 

Art, CEO, AnalystFinder.com

 

 

Super User
Posts: 23,765

Re: Efficient way to identify unique records

Posted in reply to thesasuser

I’m not fond of this, but sometimes people will ask obviously wrong questions to see how a candidate deals with this. People who get super defensive about a mistake or flustered are what they’re looking for in this case. 

 

PROC Star
Posts: 8,164

Re: Efficient way to identify unique records

[ Edited ]
Posted in reply to thesasuser

@novinosrin already answered your question, but there are still other ways of expressing the same thing. Other than using sort, they all run as efficiently as the others.

 

Like @Reeza mentioned, proc sort can also be used but, for what you want, in a slightly different manner than @Reeza mentioned:

data have;
  set sashelp.class;
  do i=1 to 100000;
    if i eq 1 then id=_n_*100000;
    else id=_n_*100000+1;
    output;
  end;
run;
data want;
  set have;
  by id;
  if first.id and last.id;
run;
data want;
  set have;
  by id;
  if first.id eq 1 and last.id eq 1;
run;

data want;
  set have;
  by id;
  if first.id +last.id eq 2;
run;

data want;
  set have;
  by id;
  if first.id *last.id;
run;

data want;
  set have;
  by id;
  if first.id *last.id eq 1;
run;

proc sort data=have out=dontwant nouniquekey noequals UNIQUEOUT=want;
  by id;
run;

And, as @Reeza mentioned, if the file has to be sorted first, it may actually end up being the fastest method.

 

Art, CEO, AnalystFinder.com

 

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 7 replies
  • 184 views
  • 2 likes
  • 4 in conversation