Hi,
I try to generate a code that looks for duplicates in a dataset based by costumer and dates, here is a random test-data I created:
data testdata;
do year = 2014 to 2023;
date = input(cats(put(year, 4.), '1231'), yymmdd8.);
do i = 1 to 25;
customer = cats('Customer', (year - 2014) * 25 + i);
output;
if i <= 10 then do;
customer = cats('Customer', (year - 2014) * 25 + i);
output;
end;
end;
end;
format date yymmddn8.;
drop i year;
run;
But my orginal data have 300,000 obs,
and I want to look for duplicates between year=20201231,20211231,20221231 and 20231231, only. I tried this one:.. But it takes ages since it loop around for every costumer. I know fo sure that it is easy to just use nodupkey (can't remember) but I want it to be a macro-loop.
macro check_duplicates(date_list, dataset); %do i = 1 %to %sysfunc(countw(&date_list)); %let date = %scan(&date_list, &i); %if %sysfunc(inputn(&date, yymmdd8.)) ne . %then %do; data _duplicates_; set &dataset; where date_id = &date; by date_id; retain count; if first.date_id then count = 0; count + 1; if count > 1 then do; put coustumer; put date_id; output; /* Output observation if duplicate */ end; if last then put "No dup found for &date."; run; proc print data=_duplicates_ noobs; title "Dup found for date &date"; run; %end; %else %put No date found for &date.; %end; %mend; %check_duplicates(20201231 20221231 202331, lib.dataset);
Here is a suggestion, but not sure what the point is of the dataset _duplicates_ in your example, since it gets overridden for each date in the list.
You could try this code, which works from the testdata dataset you create. If you want the customers who are duplicates, you can find them in the _AllDups_ table:
%macro check_duplicates(date_list, dataset); proc sort data=&dataset.; by date customer; run; data _AllDups_; set &dataset.; by date customer; if first.customer then c=0; c+1; if last.customer and c>1 then output; run; %do i = 1 %to %sysfunc(countw(&date_list)); %let date = %scan(&date_list, &i); %if %sysfunc(inputn(&date, yymmdd8.)) ne . %then %do; data _null_; dt=input("&date.", yymmdd8.); put dt= date9.; if _N_ = 1 then do; length date 8; declare hash stats(dataset:"AllDups"); stats.defineKey('date'); stats.defineDone(); end; rc=stats.Check(key:dt); if rc=0 then put "Dup found for &date."; else put "No dup found for &date."; run; %end; %end; %mend; %check_duplicates(20201231 20221231 20231231, testdata);
I know fo sure that it is easy to just use nodupkey (can't remember) but I want it to be a macro-loop.
Why?
I find it almost impossible to imagine that a macro could work faster than the NODUPKEY option in PROC SORT.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.