Solved: What's the best way to see if a dataset has any duplicates?

cosmid · Posted 01-22-2024 09:41 PM

Hi,

Is PROC SORT nodupkey or nodup the best way to check for duplicates or there a better way to quickly check to see if a variable has duplicated values in a dataset?

Thanks

yabwon · Posted 01-23-2024 02:46 AM

you can use in memory hash table to read data and print "error" on the first duplicate:

data have;
input x $1. @@;
if x ne " ";
cards;
qwertyuiopasdfghjklzxcvbnm1234567890q
;
run;
proc print;
run;


/* test for dups */
data _null_;
  declare hash H();
  H.defineKey("x");
  H.defineDone();

  do until(eof);
    set HAVE end=eof curobs=curobs;
    rc=H.add();
    if rc then 
      do;
        put "ERROR: Duplicate value: " x "detected in observation " curobs;
        stop;
      end;
  end;
stop;
run;

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

View solution in original post

yabwon · Posted 01-23-2024 02:46 AM

you can use in memory hash table to read data and print "error" on the first duplicate:

data have;
input x $1. @@;
if x ne " ";
cards;
qwertyuiopasdfghjklzxcvbnm1234567890q
;
run;
proc print;
run;


/* test for dups */
data _null_;
  declare hash H();
  H.defineKey("x");
  H.defineDone();

  do until(eof);
    set HAVE end=eof curobs=curobs;
    rc=H.add();
    if rc then 
      do;
        put "ERROR: Duplicate value: " x "detected in observation " curobs;
        stop;
      end;
  end;
stop;
run;

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

cosmid · Posted 01-23-2024 08:58 AM

I was really hoping for a much shorter solution, lol...but this will definitely help me learn how to use HAS, thanks!

andreas_lds · Posted 01-23-2024 02:58 AM

Another hash-solution (using the data provided by @yabwon 😞

data _null_;
   if 0 then set have;
   declare hash h(dataset: 'have', duplicate: 'e');
   h.defineKey('x');
   h.defineDone();
   stop;
run;

yabwon · Posted 01-23-2024 03:10 AM

And cool thing is that it can be easily extended from only single variable check to row duplicates;

data have;
input x $1. @@;
if x ne " ";
y=rank(x);
z=y*10;
cards;
qwertyuiopasdfghjklzxcvbnm1234567890q
;
run;
proc print;
run;

data _null_;
   if 0 then set have;
   declare hash h(dataset: 'have', duplicate: 'e');
   h.defineKey(all:'yes'); /* duplicated rows */
   h.defineDone();
   stop;
run;

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

cosmid · Posted 01-23-2024 08:57 AM

Hi andreas!

I didn't know there's a DUPLICATE that can be used with HASH.

I have seen a lot programs with the IF statement:
if 0 then set data_set_name;

I always wondered how that statement can execute because I thought the default numeric value for FALSE is also 0? So the 0 here must mean something else?

cosmid · Posted 01-23-2024 06:01 PM

I understand the IF 0 now. It's used to set the PDV and skip reading in the observations.
Sorry, I wanted to follow up because I asked about it in an earlier reply and I don't know how to delete that reply. So in case I'll waste more of your time to answer I'll just explain here.
Thanks again for the help!

LinusH · Posted 01-23-2024 03:11 AM

Define "best".
Apart from the suggested hash techniques, you could also use PROC SQL with HAVING and COUNT.
Or apply a unique index and see the operation succeeeds.

Data never sleeps

cosmid · Posted 01-23-2024 09:02 AM

Hi LinusH,

So, I was hoping for a built-in SAS function that I didn't know of or something like a one line of code. The other solution I found beside PROC SORT was using the FIRST and LAST and compare them. The PROC SORT creates another dataset and the FIRST and LAST involves more coding so I was hoping for a shorter version of some sort. I thought there might be one that exist since checking for duplicates is such a common task.

SASKiwi · Posted 01-23-2024 07:39 PM

Personally I find it useful to create macros for common tasks like this. It means you can get your answer with just one statement. It also means the underlying method isn't so important.

%macro Find_Dups ( dataset = 
                  ,byvar   =
                  ,dupvar  = 
                 );

%if &dupvar = %then %let dupvar = &byvar; 

proc sort data = &dataset 
          out = sorted
           ;
  by &byvar;
run;

data dups;
  set sorted;
   by &byvar;
  if not (first.&dupvar and last.&dupvar);
run;

%mend Find_Dups;

Although you will notice that I prefer to create a table with the duplicate rows.

cosmid · Posted 02-25-2024 11:18 AM

Thanks for the code! Is there a way for SAS to take parameters at the command line? I'm referring to Linux environment. For example, if I wanted to check if dataset sample.sas7bdat has any duplicate, I could just run the program with command like SAS PROG.SAS sample var
And the program will take the first parameter as the dataset and the 2nd parameter as the BY variable

Tom · Posted 02-25-2024 12:07 PM

This thread seems to be devolving into a general discussion.

Much better to post new questions on new threads. You can always include a link to some older topic.

You can use the old -sysparm option.

https://documentation.sas.com/doc/en/mcrolref/3.2/p0ajr6rtdhuhzbn199hhpkak2v8p.htm

Or you can take advantage of the new -set option.

https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/hostunx/n106qouqj0hfk5n1wgqpw8iovxy2.htm

SASKiwi · Posted 02-25-2024 05:06 PM

@cosmid - I suggest you follow @Tom 's advice regarding the SET option which creates environment variables you can read using %SYSGET or SYSGET in your SAS program.

What's the best way to see if a dataset has any duplicates?

Re: What's the best way to see if a dataset has any duplicates?

Re: What's the best way to see if a dataset has any duplicates?

Re: What's the best way to see if a dataset has any duplicates?

Re: What's the best way to see if a dataset has any duplicates?

Re: What's the best way to see if a dataset has any duplicates?

Re: What's the best way to see if a dataset has any duplicates?

Re: What's the best way to see if a dataset has any duplicates?

Re: What's the best way to see if a dataset has any duplicates?

Re: What's the best way to see if a dataset has any duplicates?

Re: What's the best way to see if a dataset has any duplicates?

Re: What's the best way to see if a dataset has any duplicates?

Re: What's the best way to see if a dataset has any duplicates?

Re: What's the best way to see if a dataset has any duplicates?

Registration is open