topic Re: How to find duplicate based on entire record using Data step in SAS Programming

How to find duplicate based on entire record using Data step

meenakshim — Thu, 23 May 2019 16:44:35 GMT

Hi,

I wanted to find duplicates based on the entire record of 55 columns. I have done it using SAS procedures like sort with NODUP/ NON UNIQUE KEY option. But i am wondering , can we able to accomplish this using the data step BY group processing.

can we able to use _ALL_ , _NUMERIC_ etc in BY processing. I have used this in the BY statement in procedures. Instead of typing in all 55 columns.

Or is there is any other way in data step processing that we can get the duplicate record(the entire record ) without any specific key field.

Thanks.

Re: How to find duplicate based on entire record using Data step

Tom — Thu, 23 May 2019 17:08:28 GMT

You need to sort. You might be able to avoid sorting if the data was small and could all be put into a hash.

So let's assume you already did this step to sort by all of the variables so that dups are next to each other.

proc sort data=have;
  by _all_;
run;

Now you want to read the data and detect duplicate rows. To detect duplicates you just need to test the FIRST. and LAST. variables for the last BY variable. The trick is to add back one of the variables to the end.

Try this example using SASHELP.CLASS. I stuck AGE in before _ALL_ so that the second data step will show that duplicates can be found by use the FIRST. and LAST. variables.

proc sort data=sashelp.class out=class;
  by age _all_;
run;

data test;
  set class;
  by age _all_ age;
  if not (first.age and last.age);
run;

data test2;
  set class;
  by age ;
  if not (first.age and last.age);
run;

Re: How to find duplicate based on entire record using Data step

noling — Thu, 23 May 2019 17:13:37 GMT

It's definitely possible to do this in a datastep, but it likely wouldn't be as efficient as just using proc sort (which was built exactly for this type of job). You'd probably want to sort the dataset first by ALL the fields, then use a first-DOT-<Last field in BY statement> (or last-DOT of the same field). I recommend not using a datastep for this, but please let me know what you find out!

I'd be interested in:

1. Processing time using proc sort with a NODUP/ other option

2. Processing time using a datastep

3. Whatever the generalized approach for all fields is with the datastep. EDIT: Thanks @Tom!! - neat trick adding another field to the end of _all_!

Re: How to find duplicate based on entire record using Data step

PeterClemmensen — Thu, 23 May 2019 17:45:26 GMT

Here is what a hash object approach may look like

I create 20 mio rows of data with 10 variables for demonstration purposes. Then I create a data set of duplicates with PROC SORT Dupout method and Data Step Hash Method respectively..

data testdata1(drop=i j);
   call streaminit(123);
   array vars var1-var10;
   do i=1 to 20e6;
      do j=1 to dim(vars);
         vars[j]=rand('integer', 1, 10);
      end;
      output;
   end;
run;

/* Run Time: 1 min 17 sec  */
proc sort data=testdata1 dupout=test1 nodupkey;
   by _ALL_;
run;

data testdata2(drop=i j);
   call streaminit(123);
   array vars var1-var10;
   do i=1 to 20e6;
      do j=1 to dim(vars);
         vars[j]=rand('integer', 1, 10);
      end;
      output;
   end;
run;

/* Run Time: 28 sec */
data test2;
   if _N_ = 1 then do;
      declare hash h(hashexp:20);
      h.defineKey('var1', 'var2', 'var3', 'var4', 'var5',
                  'var6', 'var7', 'var8', 'var9', 'var10');
      h.defineDone();
   end;
 
   set testdata2;

   if h.check() ne 0 then h.add();
   else output;
run;

Re: How to find duplicate based on entire record using Data step

FreelanceReinh — Thu, 23 May 2019 19:20:57 GMT

@PeterClemmensen: Thanks for sharing the run times. So, the DATA step using a hash object was significantly faster than PROC SORT -- on your computer. This is interesting because it was vice versa on mine (24 vs. 16 s), thus illustrating that general recommendations regarding performance are problematic even if the data are identical.

Re: How to find duplicate based on entire record using Data step

PeterClemmensen — Thu, 23 May 2019 20:18:22 GMT

@FreelanceReinh, thank you. And I agree completely 🙂

I do not recommend one over the other, only illustrating how it can be done with a hash object instead of PROC SORT.

Re: How to find duplicate based on entire record using Data step

FreelanceReinh — Thu, 23 May 2019 21:51:39 GMT

@PeterClemmensen wrote:

(...)

I do not recommend one over the other, only illustrating how it can be done with a hash object instead of PROC SORT.

Sure. Sorry for the ambiguity. My remark about recommendations was not related to your run time comparison.

Re: How to find duplicate based on entire record using Data step

Ksharp — Fri, 24 May 2019 12:45:18 GMT

Just one PROC SORT is suffice ;

proc sort data=have out=want nouniquekey ;
  by _all_;
run;

Re: How to find duplicate based on entire record using Data step

Patrick — Fri, 24 May 2019 14:14:03 GMT

@meenakshim

From a coding perspective using Proc Sort ... nodupkey; by _all_; run is by far the simplest approach. If you consult the docu for Proc Sort you'll also find that Proc Sort provides options to collect duplicates in a separate data set.

Re: How to find duplicate based on entire record using Data step

meenakshim — Fri, 24 May 2019 15:20:43 GMT

This method is new to me and i need to explore more. But it's interesting.

I could see only the duplicate records are outputting like in PROC SORT.

Re: How to find duplicate based on entire record using Data step

Tom — Fri, 24 May 2019 15:44:43 GMT

Yes. In groups with more than one record (ie duplicates in this case) FIRST.lastbyvar will be true only on the first record and LAST.lastbyvar will be true only on the last record. A unique record will be both the first and the last of its group.
The reason to use the data step method instead of just letting PROC SORT do the job is when you want something different. You can write your own logic to control exactly what happens.

Re: How to find duplicate based on entire record using Data step

meenakshim — Fri, 24 May 2019 15:48:57 GMT

I tried this before. But i couldn't figure it out what need to give in for FIRST and LAST BY variables to represent whole record. As you said ,by adding another variable to the end, it worked as expected. Thank u for the input.

My class data set is a copy of SASHELP.CLASS with 4 duplicates added.

proc sort data=class;
by _all_;
run;

data uniqueclass notuniqueclass;
set class;
by _all_ age;
if (first.age and last.age) then output uniqueclass;
if not (first.age and last.age) then output notuniqueclass;

run;

Thanks

meenakshi