BookmarkSubscribeRSS Feed
viollete
Calcite | Level 5

I have database which contains more than 1 million observation. I would like to select all observations that have same two variables (for example ID and date)

How could I do that?

Thanks

14 REPLIES 14
RW9
Diamond | Level 26 RW9
Diamond | Level 26

There are several methods.  One option was discussed in a thread earlier today: https://communities.sas.com/thread/57918

viollete
Calcite | Level 5

I do not want to delete duplicates, i want all observations that has one or more duplicate to extract and to check if they are totally identical in all variables

RW9
Diamond | Level 26 RW9
Diamond | Level 26

Hi, in which case just change the logic to:

proc sql;

  create table inter as

  select  col1,

          col2,

          tmp

  from    (select *,count(*) as tmp from have group by col1,col2)

  where   tmp > 1;

quit;

And this will give all records which appear more than once.

viollete
Calcite | Level 5

Thanks!

how could i do in this case:

i select duplicates using two variables (col1 and col2), then i want to see if the rest of variables are identical as well (for example if all diagnosis is identical in selected duplicates). How could i do it?

RW9
Diamond | Level 26 RW9
Diamond | Level 26

Your not really talking about duplicates then so much, what you want is to compare data.  For that use the proc compare function, you can specify in there a by line if you want to match based on variables rather than row number etc.  This compares all the daa and shows you differences.  If you only want uniques, then drop duplicate data before running your proc compare. 

viollete
Calcite | Level 5

Thanks, but it looks that this procedure do not do what i really want to.

I will give the example. Lets say i have data:

Col1 Col2 Diag1 Diag2 Diag3 Diag4

1       2      A1      A2     A3   A4

1       2      A1      A2     A3   A4

2       3      A1      A2     A3   A4

2       3      A1      A2     A3   A4

2       3      A0      A2     A3   A4

I want to separate this table in such way: first - to take all observations by col1 and col2 that has in all line diag identical (in this case would be this

1       2      A1      A2     A3   A4

1       2      A1      A2     A3   A4)

and second, i want to extract the one that has slight difference, in my case would be this:

2       3      A1      A2     A3   A4

2       3      A1      A2     A3   A4

2       3      A0      A2     A3   A4

RW9
Diamond | Level 26 RW9
Diamond | Level 26

Well, that's really pushing the boat out.  Get a count of your COLs, and your DIAGs by COLs.  Then add these to the data.  You can then see what repeats and what doesn't.

Try:

data have;
  attrib Col1 Col2 Diag1 Diag2 Diag3 Diag4 format=$20.;
  infile cards;
  input col1 $ col2 $ diag1 $ diag2 $ diag3 $ diag4 $;
cards;
1       2      A1      A2     A3   A4
1       2      A1      A2     A3   A4
2       3      A1      A2     A3   A4
2       3      A1      A2     A3   A4
2       3      A0      A2     A3   A4
;
run;


proc sql;
  create table INTER as
  select  A.*,
          B.DIAG_COUNT
  from    (select  *,
                    count(*) as ID_COUNT
            from    HAVE
            group by COL1,COL2) A
  left join (
              select *,
                     count(*) as DIAG_COUNT
              from   HAVE
              group by COL1,COL2,DIAG1,DIAG2,DIAG3,DIAG4) B
  on      A.COL1=B.COL1
  and     A.COL2=B.COL2
  and     A.DIAG1=B.DIAG1
  and     A.DIAG2=B.DIAG2
  and     A.DIAG3=B.DIAG3
  and     A.DIAG4=B.DIAG4;

  create table TABLE1 as
  select  *
  from    INTER A
  where   not exists (select distinct THIS.COL1,THIS.COL2 from INTER THIS where THIS.COL1=A.COL1 and THIS.COL2=A.COL2 and THIS.DIAG_COUNT=1);

  create table TABLE2 as
  select  *
  from    INTER A
  where   exists (select distinct THIS.COL1,THIS.COL2 from INTER THIS where THIS.COL1=A.COL1 and THIS.COL2=A.COL2 and THIS.DIAG_COUNT=1);

quit;


data_null__
Jade | Level 19

Find the groups(col1 col2) with unique records those will have only ONE record or "slight differences".  The proc sort option nouniquekey finds the groups of interest.

data have;
   input Col1 Col2 (Diag1-Diag4)(:$2.);
   cards;
1       2      A1      A2     A3   A4
1       2      A1      A2     A3   A4
2       3      A1      A2     A3   A4
2       3      A1      A2     A3   A4
2       3      A0      A2     A3   A4
;;;;
   run;
proc sort data=have out=_null_ nouniquekey uniqueout=unique;
   by col1--diag4;
   run;
proc sort data=unique(keep=col:) out=who nodupkey;
  
by col:;
   run;
data mixed allequal;
   merge sorted who(in=in2);
   by col:;
   if in2 then output mixed;
   else output allequal;
   run;
proc print data=mixed;
   run;
proc print data=allequal;
   run;
slchen
Lapis Lazuli | Level 10

data have;
   input Col1 Col2 (Diag1-Diag4)(:$2.);
   cards;
1       2      A1      A2     A3   A4
1       2      A1      A2     A3   A4
2       3      A1      A2     A3   A4
2       3      A1      A2     A3   A4
2       3      A0      A2     A3   A4
;
run;

proc sql;
  create table one as
  select * from have where col1 not in (select col1 from (select * from have except select *from have group by 1,2,3,4,5,6 having count(*)>1));
  create table two as
  select * from have except all select * from one;
quit;

Ksharp
Super User

Are you talking about splitting a dataset ?


data have;
input Col1 Col2 (Diag1 Diag2 Diag3 Diag4 ) ($);
cards;
1       2      A1      A2     A3   A4
1       2      A1      A2     A3   A4
2       3      A1      A2     A3   A4
2       3      A1      A2     A3   A4
2       3      A0      A2     A3   A4
;
run;
proc sort data=have;by col1 col2;run;
data _null_;
 if _n_ eq 1 then do;
  if 0 then set have;
  declare hash h(multidata:'y');
   h.definekey('col1','col2');
   h.definedata('col1','col2','Diag1','Diag2','Diag3','Diag4' );
   h.definedone();
 end;
 set have;
 by col1 col2;
 h.add();
 if last.col2 then do;h.output(dataset:cats('_',col1,col2));h.clear();end;
 run;



Xia Keshan

stat_sas
Ammonite | Level 13

Try this.

data have;
input id date date9.;
format date date9.;
datalines;
1 12-Aug-97
1 12-Aug-97
1 13-Aug-97
1 14-Aug-97
1 14-Aug-97
1 15-Aug-97
1 16-Aug-97
1 16-Aug-97
2 14-Aug-97
2 14-Aug-97
2 15-Aug-97
2 15-Aug-97
2 15-Aug-97
2 15-Aug-97
2 15-Aug-97
2 15-Aug-97
2 16-Aug-97
2 17-Aug-97
2 19-Aug-97
;
run;

proc sort data=have;
by id date;
run;

data want;
set have;
by id date;
if not first.date and last.date;
run;

umashankersaini
Quartz | Level 8

Hi,

Greetings for the day !

you can also refer to this as well.  :

Regards

Uma Shanker Saini

SAS Certification prep and basic understanding

Ksharp
Super User

You'd better post some sample data and the output you like.

The code below is not tested.

proc sql;

  create table inter as

  select  *

   from  have

    group by id,date

     having count(*) gt 1 ;

quit;

Xia Keshan

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 14 replies
  • 1168 views
  • 0 likes
  • 8 in conversation