BookmarkSubscribeRSS Feed
melhaf
Fluorite | Level 6

Hi,

I try to generate a code that looks for duplicates in a dataset based by costumer and dates, here is a random test-data I created:

 
data testdata;
    do year = 2014 to 2023;

        date = input(cats(put(year, 4.), '1231'), yymmdd8.);

        do i = 1 to 25;
            customer = cats('Customer', (year - 2014) * 25 + i);
            output;
            if i <= 10 then do;
                customer = cats('Customer', (year - 2014) * 25 + i);
                output;
            end;
        end;
    end;
    format date yymmddn8.;
    drop i year;
run;

But my orginal data have 300,000 obs, 

and I want to look for duplicates between year=20201231,20211231,20221231 and 20231231, only. I tried this one:.. But it takes ages since it loop around for every costumer. I know fo sure that it is easy to just use nodupkey (can't remember) but I want it to be a macro-loop.

 

macro check_duplicates(date_list, dataset);

    %do i = 1 %to %sysfunc(countw(&date_list));
        %let date = %scan(&date_list, &i);
 
        %if %sysfunc(inputn(&date, yymmdd8.)) ne . %then %do;
 
            data _duplicates_;
                set &dataset;
                where date_id = &date;
                by date_id;
                retain count;
                if first.date_id then count = 0;
                count + 1;
                if count > 1 then do;
                    put coustumer;
                    put date_id;
                    output; /* Output observation if duplicate */
                end;
                if last then put "No dup found for &date.";
            run;
 
            proc print data=_duplicates_ noobs;
                title "Dup found for date &date";
            run;
 
        %end;
        %else %put No date found for &date.;
 
    %end;
%mend;
 
%check_duplicates(20201231 20221231 202331, lib.dataset);

 

2 REPLIES 2
JB1_DK
Fluorite | Level 6

Here is a suggestion, but not sure what the point is of the dataset _duplicates_ in your example, since it gets overridden for each date in the list.

 

You could try this code, which works from the testdata dataset you create. If you want the customers who are duplicates, you can find them in the _AllDups_ table:

 

%macro check_duplicates(date_list, dataset);

   proc sort data=&dataset.; by date customer; run;

   data _AllDups_;
      set &dataset.; 
      by date customer; 
      if first.customer then c=0;
      c+1;
      if last.customer and c>1 then output;
   run;

    %do i = 1 %to %sysfunc(countw(&date_list));
        %let date = %scan(&date_list, &i);
 
        %if %sysfunc(inputn(&date, yymmdd8.)) ne . %then %do;

        data _null_;
           dt=input("&date.", yymmdd8.);
           put dt= date9.;
           if _N_ = 1 then do;
               length date 8;
               declare hash stats(dataset:"AllDups");
               stats.defineKey('date');
               stats.defineDone();
            end;
            rc=stats.Check(key:dt);
            if rc=0 then put "Dup found for &date.";
            else put "No dup found for &date.";
         run;

         %end;
   %end;

%mend;

%check_duplicates(20201231 20221231 20231231, testdata);
PaigeMiller
Diamond | Level 26

I know fo sure that it is easy to just use nodupkey (can't remember) but I want it to be a macro-loop.

 

Why?

 

I find it almost impossible to imagine that a macro could work faster than the NODUPKEY option in PROC SORT.

--
Paige Miller

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 2 replies
  • 238 views
  • 3 likes
  • 3 in conversation