BookmarkSubscribeRSS Feed
melhaf
Fluorite | Level 6

Hi,

I try to generate a code that looks for duplicates in a dataset based by costumer and dates, here is a random test-data I created:

 
data testdata;
    do year = 2014 to 2023;

        date = input(cats(put(year, 4.), '1231'), yymmdd8.);

        do i = 1 to 25;
            customer = cats('Customer', (year - 2014) * 25 + i);
            output;
            if i <= 10 then do;
                customer = cats('Customer', (year - 2014) * 25 + i);
                output;
            end;
        end;
    end;
    format date yymmddn8.;
    drop i year;
run;

But my orginal data have 300,000 obs, 

and I want to look for duplicates between year=20201231,20211231,20221231 and 20231231, only. I tried this one:.. But it takes ages since it loop around for every costumer. I know fo sure that it is easy to just use nodupkey (can't remember) but I want it to be a macro-loop.

 

macro check_duplicates(date_list, dataset);

    %do i = 1 %to %sysfunc(countw(&date_list));
        %let date = %scan(&date_list, &i);
 
        %if %sysfunc(inputn(&date, yymmdd8.)) ne . %then %do;
 
            data _duplicates_;
                set &dataset;
                where date_id = &date;
                by date_id;
                retain count;
                if first.date_id then count = 0;
                count + 1;
                if count > 1 then do;
                    put coustumer;
                    put date_id;
                    output; /* Output observation if duplicate */
                end;
                if last then put "No dup found for &date.";
            run;
 
            proc print data=_duplicates_ noobs;
                title "Dup found for date &date";
            run;
 
        %end;
        %else %put No date found for &date.;
 
    %end;
%mend;
 
%check_duplicates(20201231 20221231 202331, lib.dataset);

 

2 REPLIES 2
JB1_DK
Fluorite | Level 6

Here is a suggestion, but not sure what the point is of the dataset _duplicates_ in your example, since it gets overridden for each date in the list.

 

You could try this code, which works from the testdata dataset you create. If you want the customers who are duplicates, you can find them in the _AllDups_ table:

 

%macro check_duplicates(date_list, dataset);

   proc sort data=&dataset.; by date customer; run;

   data _AllDups_;
      set &dataset.; 
      by date customer; 
      if first.customer then c=0;
      c+1;
      if last.customer and c>1 then output;
   run;

    %do i = 1 %to %sysfunc(countw(&date_list));
        %let date = %scan(&date_list, &i);
 
        %if %sysfunc(inputn(&date, yymmdd8.)) ne . %then %do;

        data _null_;
           dt=input("&date.", yymmdd8.);
           put dt= date9.;
           if _N_ = 1 then do;
               length date 8;
               declare hash stats(dataset:"AllDups");
               stats.defineKey('date');
               stats.defineDone();
            end;
            rc=stats.Check(key:dt);
            if rc=0 then put "Dup found for &date.";
            else put "No dup found for &date.";
         run;

         %end;
   %end;

%mend;

%check_duplicates(20201231 20221231 20231231, testdata);
PaigeMiller
Diamond | Level 26

I know fo sure that it is easy to just use nodupkey (can't remember) but I want it to be a macro-loop.

 

Why?

 

I find it almost impossible to imagine that a macro could work faster than the NODUPKEY option in PROC SORT.

--
Paige Miller

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 2 replies
  • 423 views
  • 3 likes
  • 3 in conversation