Identifying duplicates across observations

Accepted Solution Solved
Reply
Occasional Contributor
Posts: 10
Accepted Solution

Identifying duplicates across observations


Hello,

I have a dataset along these lines:

Obs     UID     HomePhone     CellPhone     OtherPhone

1          JN32                              000-0000                             

2          A132     555-5413                             

3          CJ3N     000-0000          555-1234

4          BK78                                                  000-0000

5          AZ25     000-0000          555-5413

I have a canned report from a database telling me I have potential duplicate records based on phone number. For example, the system might tell me Obs 1 and 3 are duplicate records due to the 000-0000 entry. I do not want to remove any observations based on a 000-0000 phone number, or a missing phone number, as this would eliminate Obs 5, which may truly be a duplicate of Obs 2.

My question then is this:

Is there a way to identify, not delete, the true duplicates across multiple variables?

Thanks!


Accepted Solutions
Solution
‎08-14-2014 05:03 PM
Respected Advisor
Posts: 3,124

Re: Identifying duplicates across observations

data have;

input Obs (UID HomePhone CellPhone OtherPhone) (:$15.);

cards;

1 JN32 . 000-0000 . 

2 A132 555-5413 555-5413 . 

3 CJ3N 000-0000 555-1234 .

4 BK78 . . 000-0000

5 AZ25 000-0000 555-5413 .

;

data want;

if _n_=1 then

do;

declare hash h();

h.definekey('phone');

h.definedone();

length phone $ 15;

end;

set have;

array t(3) $ 15 _temporary_;

array p homephone--otherphone;

do i=1 to 3;

if p(i) ne '000-0000' and not missing (p(i)) and p(i) not in t then

t(i)=p(i);

end;

do i=1 to 3;

if not missing(t(i)) then

do;

rc=h.check(key:t(i));

if rc =0 then

flag='DUP';

else

do;

phone=t(i);

rc=h.replace();

end;

end;

end;

call missing (of t(*));

drop i rc phone;

run;

View solution in original post


All Replies
Solution
‎08-14-2014 05:03 PM
Respected Advisor
Posts: 3,124

Re: Identifying duplicates across observations

data have;

input Obs (UID HomePhone CellPhone OtherPhone) (:$15.);

cards;

1 JN32 . 000-0000 . 

2 A132 555-5413 555-5413 . 

3 CJ3N 000-0000 555-1234 .

4 BK78 . . 000-0000

5 AZ25 000-0000 555-5413 .

;

data want;

if _n_=1 then

do;

declare hash h();

h.definekey('phone');

h.definedone();

length phone $ 15;

end;

set have;

array t(3) $ 15 _temporary_;

array p homephone--otherphone;

do i=1 to 3;

if p(i) ne '000-0000' and not missing (p(i)) and p(i) not in t then

t(i)=p(i);

end;

do i=1 to 3;

if not missing(t(i)) then

do;

rc=h.check(key:t(i));

if rc =0 then

flag='DUP';

else

do;

phone=t(i);

rc=h.replace();

end;

end;

end;

call missing (of t(*));

drop i rc phone;

run;

Occasional Contributor
Posts: 10

Re: Identifying duplicates across observations

Perfect! Thank you!

☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 2 replies
  • 251 views
  • 0 likes
  • 2 in conversation