We’re smarter together. Learn from this collection of community knowledge and add your expertise.

Getting All Duplicates within a SAS Data Set

by Contributor EricCai on ‎01-29-2015 12:52 PM - edited on ‎10-13-2016 07:59 AM by Community Manager (653 Views)

The following blog post was originally published on my blog, The Chemical Statistician.

 

Introduction

A common task in data manipulation is to obtain all observations that appear multiple times in a data set - in other words, to obtain the duplicates.  It turns out that there is no procedure or function that will directly provide the duplicates of a data set in SAS*.

*Update: As Fareeza Khurshed kindly commented, the NOUNIQUEKEY option in PROC SORT is available in SAS 9.3+ to directly obtain duplicates.

 

The Wrong Way to Obtain Duplicates in SAS

You may think that PROC SORT can accomplish this task with the nodupkey and the dupout options.  However, the output data set from such a procedure does not have the first of each set of duplicates.  Here is an example.

data test;     

     input var1 $;     

     cards;     

          111     

          222     

          333     

          444     

          444     

          444     

          555     

          555     

          666     

          666     

          666     

          777     

          888     

          888     

          999     

          999     

          999;

run;

 

The data set test has 5 sets of duplicates.  A simple PROC FREQ shows the frequency counts of each possible value of var1.

proc freq     

     data = test;     

     tables var1 / nocum nocol norow nopercent;

run;

 

111 1
222 1
333 1
444 3
555 2
666 3
777 1
888 2
999 3

 

Here is the WRONG way to get the duplicates from test.

proc sort     

     data = test          

     nodupkey               

     dupout = dups;     

     by var1;

run;

proc print     

     data = dups;

run;

 

1 444
2 444
3 555
4 666
5 666
6 888
7 999
8 999

The PROC FREQ of the data set dups will show a different frequency count than that of test as shown previously.

444 2
555 1
666 2
888 1
999 2

When you use nodupkey and dupout in PROC SORT, you will remove the first of each set of duplicates in the output data set.  That's why each duplicated value has a frequency in dups that is one less than its frequency in test.

 

 

The Right Way to Obtain Duplicates in SAS

 

To obtain ALL duplicates of a data set, you can take advantage of first.variable and last.variableHere is the code to do it with the above example data set of test; you will get both the single observations and the duplicate observations.

 

(Update: As Robert Matthews commented, you can alter this code to just obtain the duplicates - this would be more efficient if you have many single observations and few duplicate observations, and you don't care about the singles.)

 

data single dup;     

     set test;     

     by var1;     

     if first.var1 and last.var1          

          then output single;     

     else output dup;

run;

 

 

Here is the logic behind the code.  We are producing 2 data sets: single for the observations that appear only once, and dup for the duplicate observations.

 

If first.var1 and last.var1 both equal to "1" for the same observation, then there is no duplicate, and this observation belongs to single.

Otherwise, it's a duplicate, and this observation belongs to dup.

 

Here is what you get as the output data set.

1 444
2 444
3 444
4 555
5 555
6 666
7 666
8 666
9 888
10 888
11 999
12 999
13 999

 

Notice that it has the correct number of duplicates for each value.

Comments
by Valued Guide
on ‎10-13-2016 02:01 AM

Here's an alternative if the original data is not sorted by the variable whose duplicates are wanted, but big enough to make using proc sort expensive (in terms of computer resources).  This single DATA step outputs all duplicates without pre-sorting the data set, but produces the duplicates in sorted order.

 

The fundamental logic is to (1) record-by-record, build a hash table V tracking all values of VAR as they are encountered, noting the record number of the first encounter. (2) compare each incoming value of VAR record to the hash table.   If not found add it to V.  But if it is found then add the current record to hash object DUPLICATES, and (if neccessary) use SET ... POINT= to go back and get the first duplicate and put it also in DUPLICATES.

 

data _null_;
  set have end=eod;
  if _n_=1 then do;
    declare hash v ();
      v.definekey('var');
      v.definedata('var','_point');
      v.definedone();
    declare hash duplicates (dataset:'have (obs=0)',ordered:'a',multidata:'y');
      duplicates.definekey('var');
      duplicates.definedata(all:'y');
      duplicates.definedone();
  end;

  if v.find()^=0 then v.add(key:var,data:var,data:_n_);  /*If new VAR add it to hash*/
  else do;                 /* Not new var? Then we are processing a duplicate       */
    duplicates.add();      /* First add incoming duplicate                          */
    if _point^=. then do;  /* Then go back to add the first dupe if not already done*/
      set have point=_point;
      duplicates.add();
      v.replace(key:var,data:var,data:.);
    end;
  end;
  if eod then rc=duplicates.output(dataset:'duplicates2');
run;

 

Note by declaring hash object DUPLICATES as "multidata:'Y'", it can hold multiple items ("rows") per key value.  Making it "ordered:'A'" means that it is stored in sorted order.

 

Regards,
Mark

Your turn
Sign In!

Want to write an article? Sign in with your profile.