To select duplicates

Reply
Contributor
Posts: 62

To select duplicates

I have database which contains more than 1 million observation. I would like to select all observations that have same two variables (for example ID and date)

How could I do that?

Thanks

Super User
Super User
Posts: 7,392

Re: To select dublicates

There are several methods.  One option was discussed in a thread earlier today: https://communities.sas.com/thread/57918

Contributor
Posts: 62

Re: To select dublicates

I do not want to delete duplicates, i want all observations that has one or more duplicate to extract and to check if they are totally identical in all variables

Super User
Super User
Posts: 7,392

Re: To select dublicates

Hi, in which case just change the logic to:

proc sql;

  create table inter as

  select  col1,

          col2,

          tmp

  from    (select *,count(*) as tmp from have group by col1,col2)

  where   tmp > 1;

quit;

And this will give all records which appear more than once.

Contributor
Posts: 62

Re: To select dublicates

Thanks!

how could i do in this case:

i select duplicates using two variables (col1 and col2), then i want to see if the rest of variables are identical as well (for example if all diagnosis is identical in selected duplicates). How could i do it?

Super User
Super User
Posts: 7,392

Re: To select dublicates

Your not really talking about duplicates then so much, what you want is to compare data.  For that use the proc compare function, you can specify in there a by line if you want to match based on variables rather than row number etc.  This compares all the daa and shows you differences.  If you only want uniques, then drop duplicate data before running your proc compare. 

Contributor
Posts: 62

Re: To select dublicates

Thanks, but it looks that this procedure do not do what i really want to.

I will give the example. Lets say i have data:

Col1 Col2 Diag1 Diag2 Diag3 Diag4

1       2      A1      A2     A3   A4

1       2      A1      A2     A3   A4

2       3      A1      A2     A3   A4

2       3      A1      A2     A3   A4

2       3      A0      A2     A3   A4

I want to separate this table in such way: first - to take all observations by col1 and col2 that has in all line diag identical (in this case would be this

1       2      A1      A2     A3   A4

1       2      A1      A2     A3   A4)

and second, i want to extract the one that has slight difference, in my case would be this:

2       3      A1      A2     A3   A4

2       3      A1      A2     A3   A4

2       3      A0      A2     A3   A4

Super User
Super User
Posts: 7,392

Re: To select dublicates

Well, that's really pushing the boat out.  Get a count of your COLs, and your DIAGs by COLs.  Then add these to the data.  You can then see what repeats and what doesn't.

Try:

data have;
  attrib Col1 Col2 Diag1 Diag2 Diag3 Diag4 format=$20.;
  infile cards;
  input col1 $ col2 $ diag1 $ diag2 $ diag3 $ diag4 $;
cards;
1       2      A1      A2     A3   A4
1       2      A1      A2     A3   A4
2       3      A1      A2     A3   A4
2       3      A1      A2     A3   A4
2       3      A0      A2     A3   A4
;
run;


proc sql;
  create table INTER as
  select  A.*,
          B.DIAG_COUNT
  from    (select  *,
                    count(*) as ID_COUNT
            from    HAVE
            group by COL1,COL2) A
  left join (
              select *,
                     count(*) as DIAG_COUNT
              from   HAVE
              group by COL1,COL2,DIAG1,DIAG2,DIAG3,DIAG4) B
  on      A.COL1=B.COL1
  and     A.COL2=B.COL2
  and     A.DIAG1=B.DIAG1
  and     A.DIAG2=B.DIAG2
  and     A.DIAG3=B.DIAG3
  and     A.DIAG4=B.DIAG4;

  create table TABLE1 as
  select  *
  from    INTER A
  where   not exists (select distinct THIS.COL1,THIS.COL2 from INTER THIS where THIS.COL1=A.COL1 and THIS.COL2=A.COL2 and THIS.DIAG_COUNT=1);

  create table TABLE2 as
  select  *
  from    INTER A
  where   exists (select distinct THIS.COL1,THIS.COL2 from INTER THIS where THIS.COL1=A.COL1 and THIS.COL2=A.COL2 and THIS.DIAG_COUNT=1);

quit;


Respected Advisor
Posts: 3,777

Re: To select dublicates

Find the groups(col1 col2) with unique records those will have only ONE record or "slight differences".  The proc sort option nouniquekey finds the groups of interest.

data have;
   input Col1 Col2 (Diag1-Diag4)(:$2.);
   cards;
1       2      A1      A2     A3   A4
1       2      A1      A2     A3   A4
2       3      A1      A2     A3   A4
2       3      A1      A2     A3   A4
2       3      A0      A2     A3   A4
;;;;
   run;
proc sort data=have out=_null_ nouniquekey uniqueout=unique;
   by col1--diag4;
   run;
proc sort data=unique(keep=colSmiley Happy out=who nodupkey;
  
by col:;
   run;
data mixed allequal;
   merge sorted who(in=in2);
   by col:;
   if in2 then output mixed;
   else output allequal;
   run;
proc print data=mixed;
   run;
proc print data=allequal;
   run;
Super Contributor
Posts: 275

Re: To select dublicates

data have;
   input Col1 Col2 (Diag1-Diag4)(:$2.);
   cards;
1       2      A1      A2     A3   A4
1       2      A1      A2     A3   A4
2       3      A1      A2     A3   A4
2       3      A1      A2     A3   A4
2       3      A0      A2     A3   A4
;
run;

proc sql;
  create table one as
  select * from have where col1 not in (select col1 from (select * from have except select *from have group by 1,2,3,4,5,6 having count(*)>1));
  create table two as
  select * from have except all select * from one;
quit;

Super User
Posts: 9,671

Re: To select dublicates

Are you talking about splitting a dataset ?


data have;
input Col1 Col2 (Diag1 Diag2 Diag3 Diag4 ) ($);
cards;
1       2      A1      A2     A3   A4
1       2      A1      A2     A3   A4
2       3      A1      A2     A3   A4
2       3      A1      A2     A3   A4
2       3      A0      A2     A3   A4
;
run;
proc sort data=have;by col1 col2;run;
data _null_;
 if _n_ eq 1 then do;
  if 0 then set have;
  declare hash h(multidata:'y');
   h.definekey('col1','col2');
   h.definedata('col1','col2','Diag1','Diag2','Diag3','Diag4' );
   h.definedone();
 end;
 set have;
 by col1 col2;
 h.add();
 if last.col2 then do;h.output(dataset:cats('_',col1,col2));h.clear();end;
 run;



Xia Keshan

Trusted Advisor
Posts: 1,204

Re: To select dublicates

Try this.

data have;
input id date date9.;
format date date9.;
datalines;
1 12-Aug-97
1 12-Aug-97
1 13-Aug-97
1 14-Aug-97
1 14-Aug-97
1 15-Aug-97
1 16-Aug-97
1 16-Aug-97
2 14-Aug-97
2 14-Aug-97
2 15-Aug-97
2 15-Aug-97
2 15-Aug-97
2 15-Aug-97
2 15-Aug-97
2 15-Aug-97
2 16-Aug-97
2 17-Aug-97
2 19-Aug-97
;
run;

proc sort data=have;
by id date;
run;

data want;
set have;
by id date;
if not first.date and last.date;
run;

Frequent Contributor
Posts: 114

Re: To select duplicates

Hi,

Greetings for the day !

you can also refer to this as well.  :

Regards

Uma Shanker Saini

SAS Certification prep and basic understanding

Super User
Posts: 9,671

Re: To select duplicates

You'd better post some sample data and the output you like.

The code below is not tested.

proc sql;

  create table inter as

  select  *

   from  have

    group by id,date

     having count(*) gt 1 ;

quit;

Xia Keshan

Valued Guide
Posts: 3,208

Re: To select duplicates

Ask a Question
Discussion stats
  • 14 replies
  • 529 views
  • 0 likes
  • 8 in conversation