Hi all ,
I am trying to flag the duplicate records over the group. same id fall under different groups. i need to flg only those records who fall under different groups
have:
data have;
input (id grp pam) (: $8.) seq val ord 8.;
cards;
100 xyz pop 1 10 1.1
100 xyz pop 2 11 1.2
100 xyz pop 3 12 1.3
100 xyz pop 4 15 1.4
100 xyz pop 5 11 1.5
100 xyz pop 6 13 1.6
100 xyz pop 7 30 1.7
100 xyz pop 8 40 1.8
100 xyz pop 9 10 1.9
100 xyz pop 10 30 1.10
100 abc pop 9 10 1.9
100 abc pop 10 30 1.10
100 abc pop 9 10 1.11
100 abc pop 9 10 1.11
100 abc pop 12 30 1.12
100 abc pop 13 40 1.13
100 abc pop 14 20 1.14
100 abc pop 15 10 1.15
100 abc pop 16 40 1.16
100 abc pop 17 10 1.17
;
run;
want:
id grp pam seq val ord flg
100 xyz pop 1 10 1.10
100 xyz pop 2 11 1.20
100 xyz pop 3 12 1.30
100 xyz pop 4 15 1.40
100 xyz pop 5 11 1.50
100 xyz pop 6 13 1.60
100 xyz pop 7 30 1.70
100 xyz pop 8 40 1.80
100 xyz pop 9 10 1.90 1
100 xyz pop 10 30 1.10 1
100 abc pop 9 10 1.90 1
100 abc pop 10 30 1.10 1
100 abc pop 9 10 1.11
100 abc pop 9 10 1.11
100 abc pop 12 30 1.12
100 abc pop 13 40 1.13
100 abc pop 14 20 1.14
100 abc pop 15 10 1.15
100 abc pop 16 40 1.16
100 abc pop 17 10 1.17
my code:
proc sql;
create table want as
select a.*,b.flg
from have as a left join(select id,pam,seq,val,ord,1 as flg from have group by id,pam,seq,val,ord having count(*)>1) as b
on (a.id=b.id & a.pam=b.pam & a.seq=b.seq & a.val=b.val & a.ord=b.ord);
quit;
my code is flagging the duplicate record, if id fall in single group.
Thanks
It becomes more challenging if you allow that there might be multiple records with exactly the same values that need to be flagged (because they match another record on all but the group). Here's one approach:
proc sort data=have;
by id pam val seq ord grp;
run;
data want;
do until (last.grp);
set have;
by id pam val seq ord grp;
if first.grp=1 and first.ord=0 then flag=1;
if last.grp=1 and last.ord=0 then flag=1;
end;
do until (last.grp);
set have;
by id pam val seq ord grp;
output;
end;
run;
it's untested code, but it looks like it should do the job.
It becomes more challenging if you allow that there might be multiple records with exactly the same values that need to be flagged (because they match another record on all but the group). Here's one approach:
proc sort data=have;
by id pam val seq ord grp;
run;
data want;
do until (last.grp);
set have;
by id pam val seq ord grp;
if first.grp=1 and first.ord=0 then flag=1;
if last.grp=1 and last.ord=0 then flag=1;
end;
do until (last.grp);
set have;
by id pam val seq ord grp;
output;
end;
run;
it's untested code, but it looks like it should do the job.
Thank you Astounding!!! Worked perfect!!!.
Thanks to hentipdp,PGStat!!! your apporach also works !!!!
Thanks FreelanceReinha
Thanks!!!!
Hi,
You can perhaps give the following a go.
I split the code to make reading and debugging easier. i think the same could have been accomplished with the datastep.
proc sql;
/*Get the duplicate counts*/
create table t1 as
select id,pam,seq,val,ord,1 as flg
from have
group by id,pam,seq,val,ord
having count(*)>1 ;
/*From with in the duplicates get the mmaximum value, excluding the minimum values
this should allow for other values to also be excluded*/
create table t2 as
select a.id,a.pam,a.seq,a.val, max(a.ord) as maxord
from t1 a left outer join (select id,pam,seq,val, min(ord) as minord from t1) b on
(a.id=b.id & a.pam=b.pam & a.seq=b.seq & a.val=b.val & a.ord ne b.minord)
group by a.id,a.pam,a.seq,a.val
;
/*Using the new data flag the OBS that are duplicates*/
create table want2 as
select a.*,b.flg
from have as a left join(select id,pam,seq,val,maxord, 1 as flg
from t2) as b
on (a.id=b.id & a.pam=b.pam & a.seq=b.seq & a.val=b.val & a.ord=b.maxord );
quit;
Hope this can at least provide some guidance.
KR
H
You only need to count the number of distinct values of grp within each instance of the other variables:
proc sql;
create table want as
select *, count(distinct grp) > 1 as flg
from have
group by id, pam, seq, val, ord;
select * from want;
quit;
@sam369: Just a side note: Numeric variable ORD is a bit odd. Please note that 1.1=1.10. Depending on how these values were generated, there's even a risk of numeric representation issues, i.e. numbers which look as if they were equal (and should be equal), but in fact have small differences so that GROUP BY would assign them to different groups. Maybe a character variable would be more appropriate, if it isn't redundant anyway.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.