Hi,
I wanted to find duplicates based on the entire record of 55 columns. I have done it using SAS procedures like sort with NODUP/ NON UNIQUE KEY option. But i am wondering , can we able to accomplish this using the data step BY group processing.
can we able to use _ALL_ , _NUMERIC_ etc in BY processing. I have used this in the BY statement in procedures. Instead of typing in all 55 columns.
Or is there is any other way in data step processing that we can get the duplicate record(the entire record ) without any specific key field.
Thanks.
You need to sort. You might be able to avoid sorting if the data was small and could all be put into a hash.
So let's assume you already did this step to sort by all of the variables so that dups are next to each other.
proc sort data=have;
by _all_;
run;
Now you want to read the data and detect duplicate rows. To detect duplicates you just need to test the FIRST. and LAST. variables for the last BY variable. The trick is to add back one of the variables to the end.
Try this example using SASHELP.CLASS. I stuck AGE in before _ALL_ so that the second data step will show that duplicates can be found by use the FIRST. and LAST. variables.
proc sort data=sashelp.class out=class;
by age _all_;
run;
data test;
set class;
by age _all_ age;
if not (first.age and last.age);
run;
data test2;
set class;
by age ;
if not (first.age and last.age);
run;
You need to sort. You might be able to avoid sorting if the data was small and could all be put into a hash.
So let's assume you already did this step to sort by all of the variables so that dups are next to each other.
proc sort data=have;
by _all_;
run;
Now you want to read the data and detect duplicate rows. To detect duplicates you just need to test the FIRST. and LAST. variables for the last BY variable. The trick is to add back one of the variables to the end.
Try this example using SASHELP.CLASS. I stuck AGE in before _ALL_ so that the second data step will show that duplicates can be found by use the FIRST. and LAST. variables.
proc sort data=sashelp.class out=class;
by age _all_;
run;
data test;
set class;
by age _all_ age;
if not (first.age and last.age);
run;
data test2;
set class;
by age ;
if not (first.age and last.age);
run;
I tried this before. But i couldn't figure it out what need to give in for FIRST and LAST BY variables to represent whole record. As you said ,by adding another variable to the end, it worked as expected. Thank u for the input.
My class data set is a copy of SASHELP.CLASS with 4 duplicates added.
proc sort data=class;
by _all_;
run;
data uniqueclass notuniqueclass;
set class;
by _all_ age;
if (first.age and last.age) then output uniqueclass;
if not (first.age and last.age) then output notuniqueclass;
run;
Thanks
meenakshi
It's definitely possible to do this in a datastep, but it likely wouldn't be as efficient as just using proc sort (which was built exactly for this type of job). You'd probably want to sort the dataset first by ALL the fields, then use a first-DOT-<Last field in BY statement> (or last-DOT of the same field). I recommend not using a datastep for this, but please let me know what you find out!
I'd be interested in:
1. Processing time using proc sort with a NODUP/ other option
2. Processing time using a datastep
3. Whatever the generalized approach for all fields is with the datastep. EDIT: Thanks @Tom!! - neat trick adding another field to the end of _all_!
Register today and join us virtually on June 16!
sasglobalforum.com | #SASGF
View now: on-demand content for SAS users
Here is what a hash object approach may look like
I create 20 mio rows of data with 10 variables for demonstration purposes. Then I create a data set of duplicates with PROC SORT Dupout method and Data Step Hash Method respectively..
data testdata1(drop=i j);
call streaminit(123);
array vars var1-var10;
do i=1 to 20e6;
do j=1 to dim(vars);
vars[j]=rand('integer', 1, 10);
end;
output;
end;
run;
/* Run Time: 1 min 17 sec */
proc sort data=testdata1 dupout=test1 nodupkey;
by _ALL_;
run;
data testdata2(drop=i j);
call streaminit(123);
array vars var1-var10;
do i=1 to 20e6;
do j=1 to dim(vars);
vars[j]=rand('integer', 1, 10);
end;
output;
end;
run;
/* Run Time: 28 sec */
data test2;
if _N_ = 1 then do;
declare hash h(hashexp:20);
h.defineKey('var1', 'var2', 'var3', 'var4', 'var5',
'var6', 'var7', 'var8', 'var9', 'var10');
h.defineDone();
end;
set testdata2;
if h.check() ne 0 then h.add();
else output;
run;
@PeterClemmensen: Thanks for sharing the run times. So, the DATA step using a hash object was significantly faster than PROC SORT -- on your computer. This is interesting because it was vice versa on mine (24 vs. 16 s), thus illustrating that general recommendations regarding performance are problematic even if the data are identical.
@FreelanceReinh, thank you. And I agree completely 🙂
I do not recommend one over the other, only illustrating how it can be done with a hash object instead of PROC SORT.
@PeterClemmensen wrote:
(...)
I do not recommend one over the other, only illustrating how it can be done with a hash object instead of PROC SORT.
Sure. Sorry for the ambiguity. My remark about recommendations was not related to your run time comparison.
This method is new to me and i need to explore more. But it's interesting.
I could see only the duplicate records are outputting like in PROC SORT.
Just one PROC SORT is suffice ;
proc sort data=have out=want nouniquekey ;
by _all_;
run;
From a coding perspective using Proc Sort ... nodupkey; by _all_; run is by far the simplest approach. If you consult the docu for Proc Sort you'll also find that Proc Sort provides options to collect duplicates in a separate data set.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.