To select duplicates

viollete · Posted 05-16-2014 09:01 AM

I have database which contains more than 1 million observation. I would like to select all observations that have same two variables (for example ID and date)

How could I do that?

Thanks

RW9 · Posted 05-16-2014 09:03 AM

There are several methods. One option was discussed in a thread earlier today: https://communities.sas.com/thread/57918

viollete · Posted 05-16-2014 10:26 AM

I do not want to delete duplicates, i want all observations that has one or more duplicate to extract and to check if they are totally identical in all variables

RW9 · Posted 05-16-2014 10:32 AM

Hi, in which case just change the logic to:

proc sql;

create table inter as

select col1,

col2,

tmp

from (select *,count(*) as tmp from have group by col1,col2)

where tmp > 1;

quit;

And this will give all records which appear more than once.

viollete · Posted 05-29-2014 04:33 AM

Thanks!

how could i do in this case:

i select duplicates using two variables (col1 and col2), then i want to see if the rest of variables are identical as well (for example if all diagnosis is identical in selected duplicates). How could i do it?

RW9 · Posted 05-29-2014 04:40 AM

Your not really talking about duplicates then so much, what you want is to compare data. For that use the proc compare function, you can specify in there a by line if you want to match based on variables rather than row number etc. This compares all the daa and shows you differences. If you only want uniques, then drop duplicate data before running your proc compare.

viollete · Posted 05-29-2014 05:28 AM

Thanks, but it looks that this procedure do not do what i really want to.

I will give the example. Lets say i have data:

Col1 Col2 Diag1 Diag2 Diag3 Diag4

1 2 A1 A2 A3 A4

2 3 A1 A2 A3 A4

2 3 A0 A2 A3 A4

I want to separate this table in such way: first - to take all observations by col1 and col2 that has in all line diag identical (in this case would be this

1 2 A1 A2 A3 A4

1 2 A1 A2 A3 A4)

and second, i want to extract the one that has slight difference, in my case would be this:

2 3 A1 A2 A3 A4

2 3 A0 A2 A3 A4

RW9 · Posted 05-29-2014 06:02 AM

Well, that's really pushing the boat out. Get a count of your COLs, and your DIAGs by COLs. Then add these to the data. You can then see what repeats and what doesn't.

Try:

data have;
attrib Col1 Col2 Diag1 Diag2 Diag3 Diag4 format=$20.;
infile cards;
input col1 $ col2 $ diag1 $ diag2 $ diag3 $ diag4 $;
cards;
1       2      A1      A2     A3   A4
1       2      A1      A2     A3   A4
2       3      A1      A2     A3   A4
2       3      A1      A2     A3   A4
2       3      A0      A2     A3   A4
;
run;

proc sql;
create table INTER as
select A.*,
          B.DIAG_COUNT
from    (select *,
                    count(*) as ID_COUNT
            from    HAVE
            group by COL1,COL2) A
left join (
              select *,
                     count(*) as DIAG_COUNT
              from   HAVE
              group by COL1,COL2,DIAG1,DIAG2,DIAG3,DIAG4) B
on      A.COL1=B.COL1
and     A.COL2=B.COL2
and     A.DIAG1=B.DIAG1
and     A.DIAG2=B.DIAG2
and     A.DIAG3=B.DIAG3
and     A.DIAG4=B.DIAG4;

create table TABLE1 as
select *
from    INTER A
where   not exists (select distinct THIS.COL1,THIS.COL2 from INTER THIS where THIS.COL1=A.COL1 and THIS.COL2=A.COL2 and THIS.DIAG_COUNT=1);

create table TABLE2 as
select *
from INTER A
where exists (select distinct THIS.COL1,THIS.COL2 from INTER THIS where THIS.COL1=A.COL1 and THIS.COL2=A.COL2 and THIS.DIAG_COUNT=1);

quit;

data_null__ · Posted 05-29-2014 08:33 AM

Find the groups(col1 col2) with unique records those will have only ONE record or "slight differences". The proc sort option nouniquekey finds the groups of interest.

data have;
   input Col1 Col2 (Diag1-Diag4)(:$2.);
   cards; 
1       2      A1      A2     A3   A4
1       2      A1      A2     A3   A4
2       3      A1      A2     A3   A4
2       3      A1      A2     A3   A4
2       3      A0      A2     A3   A4
;;;;
   run; 
proc sort data=have out=_null_ nouniquekey uniqueout=unique;
   by col1--diag4;
   run; 
proc sort data=unique(keep=col:) out=who nodupkey; 
   by col:;
   run; 
data mixed allequal;
   merge sorted who(in=in2);
   by col:;
   if in2 then output mixed;
   else output allequal;
   run; 
proc print data=mixed;
   run; 
proc print data=allequal;
   run; 

slchen · Posted 05-29-2014 10:25 AM

data have;
   input Col1 Col2 (Diag1-Diag4)(:$2.);
   cards;
1       2      A1      A2     A3   A4
1       2      A1      A2     A3   A4
2       3      A1      A2     A3   A4
2       3      A1      A2     A3   A4
2       3      A0      A2     A3   A4
;
run;

proc sql;
create table one as
select * from have where col1 not in (select col1 from (select * from have except select *from have group by 1,2,3,4,5,6 having count(*)>1));
create table two as
select * from have except all select * from one;
quit;

Ksharp · Posted 05-29-2014 11:29 AM

Are you talking about splitting a dataset ?


data have;
input Col1 Col2 (Diag1 Diag2 Diag3 Diag4 ) ($);
cards;
1       2      A1      A2     A3   A4
1       2      A1      A2     A3   A4
2       3      A1      A2     A3   A4
2       3      A1      A2     A3   A4
2       3      A0      A2     A3   A4
;
run;
proc sort data=have;by col1 col2;run;
data _null_;
 if _n_ eq 1 then do;
  if 0 then set have;
  declare hash h(multidata:'y');
   h.definekey('col1','col2');
   h.definedata('col1','col2','Diag1','Diag2','Diag3','Diag4' );
   h.definedone();
 end;
 set have;
 by col1 col2;
 h.add();
 if last.col2 then do;h.output(dataset:cats('_',col1,col2));h.clear();end;
 run;

Xia Keshan

stat_sas · Posted 05-16-2014 09:47 AM

Try this.

data have;
input id date date9.;
format date date9.;
datalines;
1 12-Aug-97
1 12-Aug-97
1 13-Aug-97
1 14-Aug-97
1 14-Aug-97
1 15-Aug-97
1 16-Aug-97
1 16-Aug-97
2 14-Aug-97
2 14-Aug-97
2 15-Aug-97
2 15-Aug-97
2 15-Aug-97
2 15-Aug-97
2 15-Aug-97
2 15-Aug-97
2 16-Aug-97
2 17-Aug-97
2 19-Aug-97
;
run;

proc sort data=have;
by id date;
run;

data want;
set have;
by id date;
if not first.date and last.date;
run;

umashankersaini · Posted 05-17-2014 03:19 PM

Hi,

Greetings for the day !

you can also refer to this as well. :

Regards

Uma Shanker Saini

SAS Certification prep and basic understanding

Ksharp · Posted 05-18-2014 03:41 AM

You'd better post some sample data and the output you like.

The code below is not tested.

proc sql;

create table inter as

select *

from have

group by id,date

having count(*) gt 1 ;

quit;

Xia Keshan

jakarman · Posted 05-29-2014 06:01 AM

http://support.sas.com/resources/papers/proceedings13/324-2013.pdf

---->-- ja karman --<-----

To select duplicates

Re: To select dublicates

Re: To select dublicates

Re: To select dublicates

Re: To select dublicates

Re: To select dublicates

Re: To select dublicates

Re: To select dublicates

Re: To select dublicates

Re: To select dublicates

Re: To select dublicates

Re: To select dublicates

Re: To select duplicates

Re: To select duplicates

Re: To select duplicates

Registration is open

SAS Training: Just a Click Away