BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
cosmid
Lapis Lazuli | Level 10

Hi,

 

Is PROC SORT nodupkey or nodup the best way to check for duplicates or there a better way to quickly check to see if a variable has duplicated values in a dataset?

 

Thanks

 

1 ACCEPTED SOLUTION

Accepted Solutions
yabwon
Onyx | Level 15

you can use in memory hash table to read data and print "error" on the first duplicate:

data have;
input x $1. @@;
if x ne " ";
cards;
qwertyuiopasdfghjklzxcvbnm1234567890q
;
run;
proc print;
run;


/* test for dups */
data _null_;
  declare hash H();
  H.defineKey("x");
  H.defineDone();

  do until(eof);
    set HAVE end=eof curobs=curobs;
    rc=H.add();
    if rc then 
      do;
        put "ERROR: Duplicate value: " x "detected in observation " curobs;
        stop;
      end;
  end;
stop;
run;

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation



View solution in original post

12 REPLIES 12
yabwon
Onyx | Level 15

you can use in memory hash table to read data and print "error" on the first duplicate:

data have;
input x $1. @@;
if x ne " ";
cards;
qwertyuiopasdfghjklzxcvbnm1234567890q
;
run;
proc print;
run;


/* test for dups */
data _null_;
  declare hash H();
  H.defineKey("x");
  H.defineDone();

  do until(eof);
    set HAVE end=eof curobs=curobs;
    rc=H.add();
    if rc then 
      do;
        put "ERROR: Duplicate value: " x "detected in observation " curobs;
        stop;
      end;
  end;
stop;
run;

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation



cosmid
Lapis Lazuli | Level 10
I was really hoping for a much shorter solution, lol...but this will definitely help me learn how to use HAS, thanks!
andreas_lds
Jade | Level 19

Another hash-solution (using the data provided by @yabwon 😞

 

data _null_;
   if 0 then set have;
   declare hash h(dataset: 'have', duplicate: 'e');
   h.defineKey('x');
   h.defineDone();
   stop;
run;
yabwon
Onyx | Level 15

And cool thing is that it can be easily extended from only single variable check to row duplicates;

data have;
input x $1. @@;
if x ne " ";
y=rank(x);
z=y*10;
cards;
qwertyuiopasdfghjklzxcvbnm1234567890q
;
run;
proc print;
run;

data _null_;
   if 0 then set have;
   declare hash h(dataset: 'have', duplicate: 'e');
   h.defineKey(all:'yes'); /* duplicated rows */
   h.defineDone();
   stop;
run;

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation



cosmid
Lapis Lazuli | Level 10
Hi andreas!

I didn't know there's a DUPLICATE that can be used with HASH.

I have seen a lot programs with the IF statement:
if 0 then set data_set_name;

I always wondered how that statement can execute because I thought the default numeric value for FALSE is also 0? So the 0 here must mean something else?
cosmid
Lapis Lazuli | Level 10
I understand the IF 0 now. It's used to set the PDV and skip reading in the observations.
Sorry, I wanted to follow up because I asked about it in an earlier reply and I don't know how to delete that reply. So in case I'll waste more of your time to answer I'll just explain here.
Thanks again for the help!
LinusH
Tourmaline | Level 20
Define "best".
Apart from the suggested hash techniques, you could also use PROC SQL with HAVING and COUNT.
Or apply a unique index and see the operation succeeeds.
Data never sleeps
cosmid
Lapis Lazuli | Level 10
Hi LinusH,

So, I was hoping for a built-in SAS function that I didn't know of or something like a one line of code. The other solution I found beside PROC SORT was using the FIRST and LAST and compare them. The PROC SORT creates another dataset and the FIRST and LAST involves more coding so I was hoping for a shorter version of some sort. I thought there might be one that exist since checking for duplicates is such a common task.
SASKiwi
PROC Star

Personally I find it useful to create macros for common tasks like this. It means you can get your answer with just one statement. It also means the underlying method isn't so important.

SASKiwi_0-1706056615771.png

 

%macro Find_Dups ( dataset = 
                  ,byvar   =
                  ,dupvar  = 
                 );

%if &dupvar = %then %let dupvar = &byvar; 

proc sort data = &dataset 
          out = sorted
           ;
  by &byvar;
run;

data dups;
  set sorted;
   by &byvar;
  if not (first.&dupvar and last.&dupvar);
run;

%mend Find_Dups;

Although you will notice that I prefer to create a table with the duplicate rows.

 

 

 

cosmid
Lapis Lazuli | Level 10
Thanks for the code! Is there a way for SAS to take parameters at the command line? I'm referring to Linux environment. For example, if I wanted to check if dataset sample.sas7bdat has any duplicate, I could just run the program with command like SAS PROG.SAS sample var
And the program will take the first parameter as the dataset and the 2nd parameter as the BY variable
Tom
Super User Tom
Super User

This thread seems to be devolving into a general discussion.

Much better to post new questions on new threads.  You can always include a link to some older topic.

 

You can use the old -sysparm option.

https://documentation.sas.com/doc/en/mcrolref/3.2/p0ajr6rtdhuhzbn199hhpkak2v8p.htm

 

Or you can take advantage of the new -set option.

https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/hostunx/n106qouqj0hfk5n1wgqpw8iovxy2.htm

 

SASKiwi
PROC Star

@cosmid - I suggest you follow @Tom 's advice regarding the SET option which creates environment variables you can read using %SYSGET or SYSGET in your SAS program.

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

5 Steps to Your First Analytics Project Using SAS

For SAS newbies, this video is a great way to get started. James Harroun walks through the process using SAS Studio for SAS OnDemand for Academics, but the same steps apply to any analytics project.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 12 replies
  • 972 views
  • 14 likes
  • 6 in conversation