BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
meenakshim
Fluorite | Level 6

Hi,

 

  I wanted to find duplicates based on the entire record of 55 columns. I have done it using SAS procedures like sort with NODUP/ NON UNIQUE KEY option. But i am wondering , can we able to accomplish this using the data step BY group processing. 

 

can we able to use _ALL_ , _NUMERIC_ etc in BY processing. I have used this in the BY statement in procedures. Instead of typing in all 55 columns. 

 

Or is there is any other way in data step processing that we can get the duplicate record(the entire record )  without any specific key field. 

 

Thanks.

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

You need to sort.  You might be able to avoid sorting if the data was small and could all be put into a hash.

 

So let's assume you already did this step to sort by all of  the variables so that dups are next to each other.

proc sort data=have;
  by _all_;
run;

Now you want to read the data and detect duplicate rows.  To detect duplicates you just need to test the FIRST. and LAST. variables for the last BY variable. The trick is to add back one of the variables to the end.

 

Try this example using SASHELP.CLASS.   I stuck AGE in before _ALL_ so that the second data step will show that duplicates can be found by use the FIRST. and LAST. variables.

proc sort data=sashelp.class out=class;
  by age _all_;
run;

data test;
  set class;
  by age _all_ age;
  if not (first.age and last.age);
run;

data test2;
  set class;
  by age ;
  if not (first.age and last.age);
run;

 

 

 

View solution in original post

11 REPLIES 11
Tom
Super User Tom
Super User

You need to sort.  You might be able to avoid sorting if the data was small and could all be put into a hash.

 

So let's assume you already did this step to sort by all of  the variables so that dups are next to each other.

proc sort data=have;
  by _all_;
run;

Now you want to read the data and detect duplicate rows.  To detect duplicates you just need to test the FIRST. and LAST. variables for the last BY variable. The trick is to add back one of the variables to the end.

 

Try this example using SASHELP.CLASS.   I stuck AGE in before _ALL_ so that the second data step will show that duplicates can be found by use the FIRST. and LAST. variables.

proc sort data=sashelp.class out=class;
  by age _all_;
run;

data test;
  set class;
  by age _all_ age;
  if not (first.age and last.age);
run;

data test2;
  set class;
  by age ;
  if not (first.age and last.age);
run;

 

 

 

meenakshim
Fluorite | Level 6

I tried this before. But i couldn't figure it out what need to give in for FIRST and LAST BY variables to represent whole record. As you said ,by adding another variable to the end, it worked as expected. Thank u for the input. 

 

My class data set is a copy of SASHELP.CLASS with 4 duplicates added.

 

proc sort data=class;
by _all_;
run;

data uniqueclass notuniqueclass;
set class;
by _all_ age;
if (first.age and last.age) then output uniqueclass;
if not (first.age and last.age) then output notuniqueclass;

run;

 

Thanks

meenakshi

noling
SAS Employee

It's definitely possible to do this in a datastep, but it likely wouldn't be as efficient as just using proc sort (which was built exactly for this type of job). You'd probably want to sort the dataset first by ALL the fields, then use a first-DOT-<Last field in BY statement> (or last-DOT of the same field). I recommend not using a datastep for this, but please let me know what you find out!

 

I'd be interested in:

1. Processing time using proc sort with a NODUP/ other option

2. Processing time using a datastep

3. Whatever the generalized approach for all fields is with the datastep. EDIT: Thanks @Tom!! - neat trick adding another field to the end of _all_!

 


Register today and join us virtually on June 16!
sasglobalforum.com | #SASGF

View now: on-demand content for SAS users

PeterClemmensen
Tourmaline | Level 20

Here is what a hash object approach may look like

 

I create 20 mio rows of data with 10 variables for demonstration purposes. Then I create a data set of duplicates with PROC SORT Dupout method and Data Step Hash Method respectively..

 

data testdata1(drop=i j);
   call streaminit(123);
   array vars var1-var10;
   do i=1 to 20e6;
      do j=1 to dim(vars);
         vars[j]=rand('integer', 1, 10);
      end;
      output;
   end;
run;

/* Run Time: 1 min 17 sec  */
proc sort data=testdata1 dupout=test1 nodupkey;
   by _ALL_;
run;

data testdata2(drop=i j);
   call streaminit(123);
   array vars var1-var10;
   do i=1 to 20e6;
      do j=1 to dim(vars);
         vars[j]=rand('integer', 1, 10);
      end;
      output;
   end;
run;

/* Run Time: 28 sec */
data test2;
   if _N_ = 1 then do;
      declare hash h(hashexp:20);
      h.defineKey('var1', 'var2', 'var3', 'var4', 'var5',
                  'var6', 'var7', 'var8', 'var9', 'var10');
      h.defineDone();
   end;
 
   set testdata2;

   if h.check() ne 0 then h.add();
   else output;
run;

 

FreelanceReinh
Jade | Level 19

@PeterClemmensen: Thanks for sharing the run times. So, the DATA step using a hash object was significantly faster than PROC SORT -- on your computer. This is interesting because it was vice versa on mine (24 vs. 16 s), thus illustrating that general recommendations regarding performance are problematic even if the data are identical.

PeterClemmensen
Tourmaline | Level 20

@FreelanceReinh, thank you. And I agree completely 🙂

 

I do not recommend one over the other, only illustrating how it can be done with a hash object instead of PROC SORT.

FreelanceReinh
Jade | Level 19

@PeterClemmensen wrote:

(...)

I do not recommend one over the other, only illustrating how it can be done with a hash object instead of PROC SORT.


Sure. Sorry for the ambiguity. My remark about recommendations was not related to your run time comparison.

meenakshim
Fluorite | Level 6

This method is new to me and i need to explore more. But it's interesting.

I could see only the duplicate records are outputting like in PROC SORT.

Tom
Super User Tom
Super User
Yes. In groups with more than one record (ie duplicates in this case) FIRST.lastbyvar will be true only on the first record and LAST.lastbyvar will be true only on the last record. A unique record will be both the first and the last of its group.
The reason to use the data step method instead of just letting PROC SORT do the job is when you want something different. You can write your own logic to control exactly what happens.
Ksharp
Super User

Just one PROC SORT is suffice ;

 

proc sort data=have out=want nouniquekey ;
  by _all_;
run;
Patrick
Opal | Level 21

@meenakshim 

From a coding perspective using Proc Sort ... nodupkey; by _all_; run is by far the simplest approach. If you consult the docu for Proc Sort you'll also find that Proc Sort provides options to collect duplicates in a separate data set.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 11 replies
  • 1251 views
  • 6 likes
  • 7 in conversation