Solved: How to find duplicate based on entire record using Data step

meenakshim · Posted 05-23-2019 12:44 PM

Hi,

I wanted to find duplicates based on the entire record of 55 columns. I have done it using SAS procedures like sort with NODUP/ NON UNIQUE KEY option. But i am wondering , can we able to accomplish this using the data step BY group processing.

can we able to use _ALL_ , _NUMERIC_ etc in BY processing. I have used this in the BY statement in procedures. Instead of typing in all 55 columns.

Or is there is any other way in data step processing that we can get the duplicate record(the entire record ) without any specific key field.

Thanks.

Tom · Posted 05-23-2019 01:08 PM

You need to sort. You might be able to avoid sorting if the data was small and could all be put into a hash.

So let's assume you already did this step to sort by all of the variables so that dups are next to each other.

proc sort data=have;
  by _all_;
run;

Now you want to read the data and detect duplicate rows. To detect duplicates you just need to test the FIRST. and LAST. variables for the last BY variable. The trick is to add back one of the variables to the end.

Try this example using SASHELP.CLASS. I stuck AGE in before _ALL_ so that the second data step will show that duplicates can be found by use the FIRST. and LAST. variables.

proc sort data=sashelp.class out=class;
  by age _all_;
run;

data test;
  set class;
  by age _all_ age;
  if not (first.age and last.age);
run;

data test2;
  set class;
  by age ;
  if not (first.age and last.age);
run;

View solution in original post

Tom · Posted 05-23-2019 01:08 PM

You need to sort. You might be able to avoid sorting if the data was small and could all be put into a hash.

So let's assume you already did this step to sort by all of the variables so that dups are next to each other.

proc sort data=have;
  by _all_;
run;

Now you want to read the data and detect duplicate rows. To detect duplicates you just need to test the FIRST. and LAST. variables for the last BY variable. The trick is to add back one of the variables to the end.

Try this example using SASHELP.CLASS. I stuck AGE in before _ALL_ so that the second data step will show that duplicates can be found by use the FIRST. and LAST. variables.

proc sort data=sashelp.class out=class;
  by age _all_;
run;

data test;
  set class;
  by age _all_ age;
  if not (first.age and last.age);
run;

data test2;
  set class;
  by age ;
  if not (first.age and last.age);
run;

meenakshim · Posted 05-24-2019 11:48 AM

I tried this before. But i couldn't figure it out what need to give in for FIRST and LAST BY variables to represent whole record. As you said ,by adding another variable to the end, it worked as expected. Thank u for the input.

My class data set is a copy of SASHELP.CLASS with 4 duplicates added.

proc sort data=class;
by _all_;
run;

data uniqueclass notuniqueclass;
set class;
by _all_ age;
if (first.age and last.age) then output uniqueclass;
if not (first.age and last.age) then output notuniqueclass;

run;

Thanks

meenakshi

noling · Posted 05-23-2019 01:09 PM

It's definitely possible to do this in a datastep, but it likely wouldn't be as efficient as just using proc sort (which was built exactly for this type of job). You'd probably want to sort the dataset first by ALL the fields, then use a first-DOT-<Last field in BY statement> (or last-DOT of the same field). I recommend not using a datastep for this, but please let me know what you find out!

I'd be interested in:

1. Processing time using proc sort with a NODUP/ other option

2. Processing time using a datastep

3. Whatever the generalized approach for all fields is with the datastep. EDIT: Thanks @Tom!! - neat trick adding another field to the end of _all_!

Register today and join us virtually on June 16!
sasglobalforum.com | #SASGF

View now: on-demand content for SAS users

PeterClemmensen · Posted 05-23-2019 01:45 PM

Here is what a hash object approach may look like

I create 20 mio rows of data with 10 variables for demonstration purposes. Then I create a data set of duplicates with PROC SORT Dupout method and Data Step Hash Method respectively..

data testdata1(drop=i j);
   call streaminit(123);
   array vars var1-var10;
   do i=1 to 20e6;
      do j=1 to dim(vars);
         vars[j]=rand('integer', 1, 10);
      end;
      output;
   end;
run;

/* Run Time: 1 min 17 sec  */
proc sort data=testdata1 dupout=test1 nodupkey;
   by _ALL_;
run;

data testdata2(drop=i j);
   call streaminit(123);
   array vars var1-var10;
   do i=1 to 20e6;
      do j=1 to dim(vars);
         vars[j]=rand('integer', 1, 10);
      end;
      output;
   end;
run;

/* Run Time: 28 sec */
data test2;
   if _N_ = 1 then do;
      declare hash h(hashexp:20);
      h.defineKey('var1', 'var2', 'var3', 'var4', 'var5',
                  'var6', 'var7', 'var8', 'var9', 'var10');
      h.defineDone();
   end;
 
   set testdata2;

   if h.check() ne 0 then h.add();
   else output;
run;

The DATA to DATA Step Macro
Blog: SASnrd

FreelanceReinh · Posted 05-23-2019 03:20 PM

@PeterClemmensen: Thanks for sharing the run times. So, the DATA step using a hash object was significantly faster than PROC SORT -- on your computer. This is interesting because it was vice versa on mine (24 vs. 16 s), thus illustrating that general recommendations regarding performance are problematic even if the data are identical.

PeterClemmensen · Posted 05-23-2019 04:18 PM

@FreelanceReinh, thank you. And I agree completely 🙂

I do not recommend one over the other, only illustrating how it can be done with a hash object instead of PROC SORT.

The DATA to DATA Step Macro
Blog: SASnrd

FreelanceReinh · Posted 05-23-2019 05:51 PM

@PeterClemmensen wrote:

(...)

I do not recommend one over the other, only illustrating how it can be done with a hash object instead of PROC SORT.

Sure. Sorry for the ambiguity. My remark about recommendations was not related to your run time comparison.

meenakshim · Posted 05-24-2019 11:20 AM

This method is new to me and i need to explore more. But it's interesting.

I could see only the duplicate records are outputting like in PROC SORT.

Tom · Posted 05-24-2019 11:44 AM

Yes. In groups with more than one record (ie duplicates in this case) FIRST.lastbyvar will be true only on the first record and LAST.lastbyvar will be true only on the last record. A unique record will be both the first and the last of its group.
The reason to use the data step method instead of just letting PROC SORT do the job is when you want something different. You can write your own logic to control exactly what happens.

Ksharp · Posted 05-24-2019 08:45 AM

Just one PROC SORT is suffice ;

proc sort data=have out=want nouniquekey ;
  by _all_;
run;

Patrick · Posted 05-24-2019 10:13 AM

@meenakshim

From a coding perspective using Proc Sort ... nodupkey; by _all_; run is by far the simplest approach. If you consult the docu for Proc Sort you'll also find that Proc Sort provides options to collect duplicates in a separate data set.

How to find duplicate based on entire record using Data step

Re: How to find duplicate based on entire record using Data step

Re: How to find duplicate based on entire record using Data step

Re: How to find duplicate based on entire record using Data step

Re: How to find duplicate based on entire record using Data step

Re: How to find duplicate based on entire record using Data step

Re: How to find duplicate based on entire record using Data step

Re: How to find duplicate based on entire record using Data step

Re: How to find duplicate based on entire record using Data step

Re: How to find duplicate based on entire record using Data step

Re: How to find duplicate based on entire record using Data step

Re: How to find duplicate based on entire record using Data step

Re: How to find duplicate based on entire record using Data step

Classroom Training Available!