Solved: Re: flag the dup records

sam369 · Posted 06-01-2016 02:09 PM

Hi all ,

I am trying to flag the duplicate records over the group. same id fall under different groups. i need to flg only those records who fall under different groups

have:

data have;

input (id grp pam) (: $8.) seq val ord 8.;

cards;

100 xyz pop 1 10 1.1

100 xyz pop 2 11 1.2

100 xyz pop 3 12 1.3

100 xyz pop 4 15 1.4

100 xyz pop 5 11 1.5

100 xyz pop 6 13 1.6

100 xyz pop 7 30 1.7

100 xyz pop 8 40 1.8

100 xyz pop 9 10 1.9

100 xyz pop 10 30 1.10

100 abc pop 9 10 1.9

100 abc pop 10 30 1.10

100 abc pop 9 10 1.11

100 abc pop 12 30 1.12

100 abc pop 13 40 1.13

100 abc pop 14 20 1.14

100 abc pop 15 10 1.15

100 abc pop 16 40 1.16

100 abc pop 17 10 1.17

;

run;

want:

id grp pam seq val ord flg

100 xyz pop 1 10 1.10

100 xyz pop 2 11 1.20

100 xyz pop 3 12 1.30

100 xyz pop 4 15 1.40

100 xyz pop 5 11 1.50

100 xyz pop 6 13 1.60

100 xyz pop 7 30 1.70

100 xyz pop 8 40 1.80

100 xyz pop 9 10 1.90 1

100 xyz pop 10 30 1.10 1

100 abc pop 9 10 1.90 1

100 abc pop 10 30 1.10 1

100 abc pop 9 10 1.11

100 abc pop 12 30 1.12

100 abc pop 13 40 1.13

100 abc pop 14 20 1.14

100 abc pop 15 10 1.15

100 abc pop 16 40 1.16

100 abc pop 17 10 1.17

my code:

proc sql;
create table want as
select a.*,b.flg
from have as a left join(select id,pam,seq,val,ord,1 as flg from have group by id,pam,seq,val,ord having count(*)>1) as b
on (a.id=b.id & a.pam=b.pam & a.seq=b.seq & a.val=b.val & a.ord=b.ord);
quit;

my code is flagging the duplicate record, if id fall in single group.

Thanks

Astounding · Posted 06-01-2016 03:09 PM

It becomes more challenging if you allow that there might be multiple records with exactly the same values that need to be flagged (because they match another record on all but the group). Here's one approach:

proc sort data=have;

by id pam val seq ord grp;

run;

data want;

do until (last.grp);

set have;

by id pam val seq ord grp;

if first.grp=1 and first.ord=0 then flag=1;

if last.grp=1 and last.ord=0 then flag=1;

end;

do until (last.grp);

set have;

by id pam val seq ord grp;

output;

end;

run;

it's untested code, but it looks like it should do the job.

View solution in original post

Astounding · Posted 06-01-2016 03:09 PM

It becomes more challenging if you allow that there might be multiple records with exactly the same values that need to be flagged (because they match another record on all but the group). Here's one approach:

proc sort data=have;

by id pam val seq ord grp;

run;

data want;

do until (last.grp);

set have;

by id pam val seq ord grp;

if first.grp=1 and first.ord=0 then flag=1;

if last.grp=1 and last.ord=0 then flag=1;

end;

do until (last.grp);

set have;

by id pam val seq ord grp;

output;

end;

run;

it's untested code, but it looks like it should do the job.

FreelanceReinhard · Posted 06-02-2016 08:43 AM

Thank you Astounding!!! Worked perfect!!!.

Thanks to hentipdp,PGStat!!! your apporach also works !!!!

Thanks FreelanceReinhard, yes my sample data for ord varaible is liite bit odd.. my real data is different !!! i just made some sample. from next time i will keep your points in my mind

Thanks!!!!

Hentiedp · Posted 06-01-2016 03:42 PM

Hi,

You can perhaps give the following a go.

I split the code to make reading and debugging easier. i think the same could have been accomplished with the datastep.

proc sql;
/*Get the duplicate counts*/
create table t1 as
select id,pam,seq,val,ord,1 as flg
from have
group by id,pam,seq,val,ord
having count(*)>1 ;

/*From with in the duplicates get the mmaximum value, excluding the minimum values
this should allow for other values to also be excluded*/
create table t2 as
select a.id,a.pam,a.seq,a.val, max(a.ord) as maxord
from t1 a left outer join (select id,pam,seq,val, min(ord) as minord from t1) b on
(a.id=b.id & a.pam=b.pam & a.seq=b.seq & a.val=b.val & a.ord ne b.minord)
group by a.id,a.pam,a.seq,a.val
;

/*Using the new data flag the OBS that are duplicates*/
create table want2 as
select a.*,b.flg
from have as a left join(select id,pam,seq,val,maxord, 1 as flg
from t2) as b
on (a.id=b.id & a.pam=b.pam & a.seq=b.seq & a.val=b.val & a.ord=b.maxord );
quit;

Hope this can at least provide some guidance.

KR

H

PGStats · Posted 06-01-2016 05:54 PM

You only need to count the number of distinct values of grp within each instance of the other variables:


proc sql;
create table want as
select *, count(distinct grp) > 1 as flg
from have
group by id, pam, seq, val, ord;
select * from want;
quit;

PG

FreelanceReinh · Posted 06-02-2016 04:33 AM

@sam369: Just a side note: Numeric variable ORD is a bit odd. Please note that 1.1=1.10. Depending on how these values were generated, there's even a risk of numeric representation issues, i.e. numbers which look as if they were equal (and should be equal), but in fact have small differences so that GROUP BY would assign them to different groups. Maybe a character variable would be more appropriate, if it isn't redundant anyway.

flag the dup records

Re: flag the dup records

Re: flag the dup records

Re: flag the dup records

Re: flag the dup records

Re: flag the dup records

Re: flag the dup records

Registration is open

SAS Training: Just a Click Away