I have database which contains more than 1 million observation. I would like to select all observations that have same two variables (for example ID and date)
How could I do that?
Thanks
There are several methods. One option was discussed in a thread earlier today: https://communities.sas.com/thread/57918
I do not want to delete duplicates, i want all observations that has one or more duplicate to extract and to check if they are totally identical in all variables
Hi, in which case just change the logic to:
proc sql;
create table inter as
select col1,
col2,
tmp
from (select *,count(*) as tmp from have group by col1,col2)
where tmp > 1;
quit;
And this will give all records which appear more than once.
Thanks!
how could i do in this case:
i select duplicates using two variables (col1 and col2), then i want to see if the rest of variables are identical as well (for example if all diagnosis is identical in selected duplicates). How could i do it?
Your not really talking about duplicates then so much, what you want is to compare data. For that use the proc compare function, you can specify in there a by line if you want to match based on variables rather than row number etc. This compares all the daa and shows you differences. If you only want uniques, then drop duplicate data before running your proc compare.
Thanks, but it looks that this procedure do not do what i really want to.
I will give the example. Lets say i have data:
Col1 Col2 Diag1 Diag2 Diag3 Diag4
1 2 A1 A2 A3 A4
1 2 A1 A2 A3 A4
2 3 A1 A2 A3 A4
2 3 A1 A2 A3 A4
2 3 A0 A2 A3 A4
I want to separate this table in such way: first - to take all observations by col1 and col2 that has in all line diag identical (in this case would be this
1 2 A1 A2 A3 A4
1 2 A1 A2 A3 A4)
and second, i want to extract the one that has slight difference, in my case would be this:
2 3 A1 A2 A3 A4
2 3 A1 A2 A3 A4
2 3 A0 A2 A3 A4
Well, that's really pushing the boat out. Get a count of your COLs, and your DIAGs by COLs. Then add these to the data. You can then see what repeats and what doesn't.
Try:
data have;
attrib Col1 Col2 Diag1 Diag2 Diag3 Diag4 format=$20.;
infile cards;
input col1 $ col2 $ diag1 $ diag2 $ diag3 $ diag4 $;
cards;
1 2 A1 A2 A3 A4
1 2 A1 A2 A3 A4
2 3 A1 A2 A3 A4
2 3 A1 A2 A3 A4
2 3 A0 A2 A3 A4
;
run;
proc sql;
create table INTER as
select A.*,
B.DIAG_COUNT
from (select *,
count(*) as ID_COUNT
from HAVE
group by COL1,COL2) A
left join (
select *,
count(*) as DIAG_COUNT
from HAVE
group by COL1,COL2,DIAG1,DIAG2,DIAG3,DIAG4) B
on A.COL1=B.COL1
and A.COL2=B.COL2
and A.DIAG1=B.DIAG1
and A.DIAG2=B.DIAG2
and A.DIAG3=B.DIAG3
and A.DIAG4=B.DIAG4;
create table TABLE1 as
select *
from INTER A
where not exists (select distinct THIS.COL1,THIS.COL2 from INTER THIS where THIS.COL1=A.COL1 and THIS.COL2=A.COL2 and THIS.DIAG_COUNT=1);
create table TABLE2 as
select *
from INTER A
where exists (select distinct THIS.COL1,THIS.COL2 from INTER THIS where THIS.COL1=A.COL1 and THIS.COL2=A.COL2 and THIS.DIAG_COUNT=1);
quit;
Find the groups(col1 col2) with unique records those will have only ONE record or "slight differences". The proc sort option nouniquekey finds the groups of interest.
data have;
input Col1 Col2 (Diag1-Diag4)(:$2.);
cards;
1 2 A1 A2 A3 A4
1 2 A1 A2 A3 A4
2 3 A1 A2 A3 A4
2 3 A1 A2 A3 A4
2 3 A0 A2 A3 A4
;
run;
proc sql;
create table one as
select * from have where col1 not in (select col1 from (select * from have except select *from have group by 1,2,3,4,5,6 having count(*)>1));
create table two as
select * from have except all select * from one;
quit;
Are you talking about splitting a dataset ?
data have; input Col1 Col2 (Diag1 Diag2 Diag3 Diag4 ) ($); cards; 1 2 A1 A2 A3 A4 1 2 A1 A2 A3 A4 2 3 A1 A2 A3 A4 2 3 A1 A2 A3 A4 2 3 A0 A2 A3 A4 ; run; proc sort data=have;by col1 col2;run; data _null_; if _n_ eq 1 then do; if 0 then set have; declare hash h(multidata:'y'); h.definekey('col1','col2'); h.definedata('col1','col2','Diag1','Diag2','Diag3','Diag4' ); h.definedone(); end; set have; by col1 col2; h.add(); if last.col2 then do;h.output(dataset:cats('_',col1,col2));h.clear();end; run;
Xia Keshan
Try this.
data have;
input id date date9.;
format date date9.;
datalines;
1 12-Aug-97
1 12-Aug-97
1 13-Aug-97
1 14-Aug-97
1 14-Aug-97
1 15-Aug-97
1 16-Aug-97
1 16-Aug-97
2 14-Aug-97
2 14-Aug-97
2 15-Aug-97
2 15-Aug-97
2 15-Aug-97
2 15-Aug-97
2 15-Aug-97
2 15-Aug-97
2 16-Aug-97
2 17-Aug-97
2 19-Aug-97
;
run;
proc sort data=have;
by id date;
run;
data want;
set have;
by id date;
if not first.date and last.date;
run;
You'd better post some sample data and the output you like.
The code below is not tested.
proc sql;
create table inter as
select *
from have
group by id,date
having count(*) gt 1 ;
quit;
Xia Keshan
http://support.sas.com/resources/papers/proceedings13/324-2013.pdf
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.