DATA Step, Macro, Functions and more

flag the dup records

Accepted Solution Solved
Reply
Regular Contributor
Posts: 168
Accepted Solution

flag the dup records

Hi all ,

 

I am trying to flag the duplicate records over the group. same id fall under different groups. i need to flg only those records who fall under different groups

 

have:

data have;

  input (id grp pam) (: $8.) seq val ord 8.;

  cards;

 100  xyz pop  1 10 1.1

 100  xyz pop  2 11 1.2

 100  xyz pop  3 12 1.3

 100  xyz pop  4 15 1.4

 100  xyz pop  5 11 1.5

 100  xyz pop  6 13 1.6

 100  xyz pop  7 30 1.7

 100  xyz pop  8 40 1.8

 100  xyz pop  9 10 1.9

 100  xyz pop  10 30 1.10

 100  abc pop  9 10 1.9

 100  abc pop  10 30 1.10

 100  abc pop  9 10 1.11

 100  abc pop  9 10 1.11

 100  abc pop  12 30 1.12

 100  abc pop  13 40 1.13

 100  abc pop  14 20 1.14

 100  abc pop  15 10 1.15

 100  abc pop  16 40 1.16

 100  abc pop  17 10 1.17

 ;

 run;

 

want:

 

id     grp    pam    seq    val     ord     flg

 

100    xyz    pop      1     10    1.10

100    xyz    pop      2     11    1.20

100    xyz    pop      3     12    1.30

100    xyz    pop      4     15    1.40

100    xyz    pop      5     11    1.50

100    xyz    pop      6     13    1.60

100    xyz    pop      7     30    1.70

100    xyz    pop      8     40    1.80

100    xyz    pop      9     10    1.90       1

100    xyz    pop     10     30    1.10       1

100    abc    pop      9     10    1.90       1

100    abc    pop     10     30    1.10       1

100    abc    pop      9     10    1.11

100    abc    pop      9     10    1.11

100    abc    pop     12     30    1.12

100    abc    pop     13     40    1.13

100    abc    pop     14     20    1.14

100    abc    pop     15     10    1.15

100    abc    pop     16     40    1.16

100    abc    pop     17     10    1.17

 

my code:

 

proc sql;
  create table want as
  select a.*,b.flg
  from have as a left join(select id,pam,seq,val,ord,1 as flg from have group by id,pam,seq,val,ord having count(*)>1) as b
  on (a.id=b.id & a.pam=b.pam & a.seq=b.seq & a.val=b.val & a.ord=b.ord);
quit;

 

my code is flagging the duplicate record, if id fall in single group.

 

Thanks


Accepted Solutions
Solution
‎06-02-2016 08:37 AM
Super User
Posts: 5,079

Re: flag the dup records

It becomes more challenging if you allow that there might be multiple records with exactly the same values that need to be flagged (because they match another record on all but the group).  Here's one approach:

 

proc sort data=have;

by id pam val seq ord grp;

run;

 

data want;

do until (last.grp);

   set have;

   by id pam val seq ord grp;

   if first.grp=1 and first.ord=0 then flag=1;

   if last.grp=1 and last.ord=0 then flag=1;

end;

do until (last.grp);

   set have;

   by id pam val seq ord grp;

   output;

end;

run; 

 

it's untested code, but it looks like it should do the job.

View solution in original post


All Replies
Solution
‎06-02-2016 08:37 AM
Super User
Posts: 5,079

Re: flag the dup records

It becomes more challenging if you allow that there might be multiple records with exactly the same values that need to be flagged (because they match another record on all but the group).  Here's one approach:

 

proc sort data=have;

by id pam val seq ord grp;

run;

 

data want;

do until (last.grp);

   set have;

   by id pam val seq ord grp;

   if first.grp=1 and first.ord=0 then flag=1;

   if last.grp=1 and last.ord=0 then flag=1;

end;

do until (last.grp);

   set have;

   by id pam val seq ord grp;

   output;

end;

run; 

 

it's untested code, but it looks like it should do the job.

Regular Contributor
Posts: 168

Re: flag the dup records

Thank you Astounding!!! Worked perfect!!!.

 

Thanks to hentipdp,PGStat!!! your apporach also works !!!!

 

Thanks , yes my sample data for ord varaible is liite bit odd.. my real data is different !!! i just made some sample. from next time i will keep your points in my mind 

 

 

Thanks!!!!

Occasional Contributor
Posts: 9

Re: flag the dup records

Hi, 

 

You can perhaps give the following a go.

 

I split the code to make reading and debugging easier. i think the same could have been accomplished with the datastep.


proc sql;
/*Get the duplicate counts*/
create table t1 as
select id,pam,seq,val,ord,1 as flg
from have
group by id,pam,seq,val,ord
having count(*)>1 ;

/*From with in the duplicates get the mmaximum value, excluding the minimum values
this should allow for other values to also be excluded*/
create table t2 as
select a.id,a.pam,a.seq,a.val, max(a.ord) as maxord
from t1 a left outer join (select id,pam,seq,val, min(ord) as minord from t1) b on
(a.id=b.id & a.pam=b.pam & a.seq=b.seq & a.val=b.val & a.ord ne b.minord)
group by a.id,a.pam,a.seq,a.val
;

/*Using the new data flag the OBS that are duplicates*/
create table want2 as
select a.*,b.flg
from have as a left join(select id,pam,seq,val,maxord, 1 as flg
from t2) as b
on (a.id=b.id & a.pam=b.pam & a.seq=b.seq & a.val=b.val & a.ord=b.maxord );
quit;

 

Hope this can at least provide some guidance.

 

KR

H

Respected Advisor
Posts: 4,641

Re: flag the dup records

You only need to count the number of distinct values of grp within each instance of the other variables:

 


proc sql;
create table want as
select *, count(distinct grp) > 1 as flg
from have
group by id, pam, seq, val, ord;
select * from want;
quit;
PG
Trusted Advisor
Posts: 1,115

Re: flag the dup records

@sam369: Just a side note: Numeric variable ORD is a bit odd. Please note that 1.1=1.10. Depending on how these values were generated, there's even a risk of numeric representation issues, i.e. numbers which look as if they were equal (and should be equal), but in fact have small differences so that GROUP BY would assign them to different groups. Maybe a character variable would be more appropriate, if it isn't redundant anyway.

☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 5 replies
  • 304 views
  • 1 like
  • 5 in conversation