Remove Duplicates Based on a Pair of Columns

Reply
Occasional Learner
Posts: 1

Remove Duplicates Based on a Pair of Columns

Hello,

 

I have run into a task in SAS that, though appearing easily accomplished, has actually proved to be quite tricky.

 

I have a dataset that looks like this:

 

ID     A     B     C

1      a      e     #

2      c      f      #

3      e      a     #

4      f       c     #  ,

 

where A and B are identifiers and C is the common field between the two. In this particular instance, rows 1 and 3 are duplicates while rows 2 and 4 are duplicates.  I need to find a way to remove the instances of duplication.  There may be an easy answer to the problem, but I may have been looking at it too long to notice it Smiley Very Happy.

 

Thanks in advance,

 

MS

 

Regular Contributor
Posts: 161

Re: Remove Duplicates Based on a Pair of Columns

As with many problems, there are many ways to solve this, using a hash table, proc sql, or a data step.   Here is a way using simple data steps...

data in;
input ID VAR1:$1. VAR2:$1. VAR3:$1.;
DATALINES;

1      a      e     #
2      c      f      #
3      e      a     #
4      f       c     # 
;
RUN;
PROC SORT DATA=IN;
  BY VAR1 VAR2;
RUN;
PROC SORT DATA=IN OUT =REV;
  BY VAR2 VAR1;
RUN;

DATA DUPES;
     MERGE IN (IN=A RENAME=(VAR1=V1 VAR2=V2 ID=ID1 VAR3=V3))
           REV(IN=B RENAME=(VAR1=V2 VAR2=V1 ID=ID2 VAR3=V3));
        BY V1 V2;
     IF A & B;
     IF ID1 NE . AND ID2 NE . 
     THEN KEEP_REC=MIN(ID1,ID2);
RUN;
PROC SORT DATA=DUPES OUT=FINAL(KEEP=ID V1 V2 V3) NODUPKEYS;
  BY KEEP_REC;
RUN;
PROC PRINT;

Here is the output it generates...

Obs	V1	V2	V3
1	a	e	#
2	c	f	#

Hope this helps.... Good Luck...!!!

Kannan Deivasigamani
Super User
Posts: 5,372

Re: Remove Duplicates Based on a Pair of Columns

One way would be to create a pair of new variables:

 

data want;

set have;

if a < b then do;

   minval=a;

   maxval=b;

end;

else do;

   minval=b;

   maxval=a;

end;

run;

 

proc sort data=want nodupkey;

   by minval maxval;

run;

 

You could always drop minval and maxval later.

 

Good luck.

Super User
Posts: 19,192

Re: Remove Duplicates Based on a Pair of Columns

Simplifying @Astounding solutions, 

 

data want;
set have;

array temp(2) minval maxval;

new_a=a; new_b=b;
call sortn(temp(*));

/*if character variables, use below instead*/
/*array temp(2) $ minval maxval;*/

/*new_a=a; new_b=b;*/
/*call sortc(temp(*));*/

run;
 
proc sort data=want nodupkey;
   by minval maxval;
run;
Respected Advisor
Posts: 4,138

Re: Remove Duplicates Based on a Pair of Columns

Just another coding option which will work as long as your result dataset fits into memory.

data in;
  input ID VAR1:$1. VAR2:$1. VAR3:$1.;
  DATALINES;
1 a e #
2 c f #
3 e a #
4 f c # 
;
RUN;

data _null_;
  set in end=last;
  if _n_=1 then
    do;
      dcl hash h();
      _rc=h.defineKey('var1','var2');
      _rc=h.defineData('id','var1','var2','var3');
      _rc=h.defineDone();
    end;
  if h.check(key:var1,key:var2) and h.check(key:var2,key:var1) then h.add();
  if last then h.output(dataset:'want');
run;
Ask a Question
Discussion stats
  • 4 replies
  • 350 views
  • 2 likes
  • 5 in conversation