## Identifying duplicates across observations

Solved
Occasional Contributor
Posts: 10

# Identifying duplicates across observations

Hello,

I have a dataset along these lines:

Obs     UID     HomePhone     CellPhone     OtherPhone

1          JN32                              000-0000

2          A132     555-5413

3          CJ3N     000-0000          555-1234

4          BK78                                                  000-0000

5          AZ25     000-0000          555-5413

I have a canned report from a database telling me I have potential duplicate records based on phone number. For example, the system might tell me Obs 1 and 3 are duplicate records due to the 000-0000 entry. I do not want to remove any observations based on a 000-0000 phone number, or a missing phone number, as this would eliminate Obs 5, which may truly be a duplicate of Obs 2.

My question then is this:

Is there a way to identify, not delete, the true duplicates across multiple variables?

Thanks!

Accepted Solutions
Solution
‎08-14-2014 05:03 PM
Posts: 3,167

## Re: Identifying duplicates across observations

data have;

input Obs (UID HomePhone CellPhone OtherPhone) (:\$15.);

cards;

1 JN32 . 000-0000 .

2 A132 555-5413 555-5413 .

3 CJ3N 000-0000 555-1234 .

4 BK78 . . 000-0000

5 AZ25 000-0000 555-5413 .

;

data want;

if _n_=1 then

do;

declare hash h();

h.definekey('phone');

h.definedone();

length phone \$ 15;

end;

set have;

array t(3) \$ 15 _temporary_;

array p homephone--otherphone;

do i=1 to 3;

if p(i) ne '000-0000' and not missing (p(i)) and p(i) not in t then

t(i)=p(i);

end;

do i=1 to 3;

if not missing(t(i)) then

do;

rc=h.check(key:t(i));

if rc =0 then

flag='DUP';

else

do;

phone=t(i);

rc=h.replace();

end;

end;

end;

call missing (of t(*));

drop i rc phone;

run;

All Replies
Solution
‎08-14-2014 05:03 PM
Posts: 3,167

## Re: Identifying duplicates across observations

data have;

input Obs (UID HomePhone CellPhone OtherPhone) (:\$15.);

cards;

1 JN32 . 000-0000 .

2 A132 555-5413 555-5413 .

3 CJ3N 000-0000 555-1234 .

4 BK78 . . 000-0000

5 AZ25 000-0000 555-5413 .

;

data want;

if _n_=1 then

do;

declare hash h();

h.definekey('phone');

h.definedone();

length phone \$ 15;

end;

set have;

array t(3) \$ 15 _temporary_;

array p homephone--otherphone;

do i=1 to 3;

if p(i) ne '000-0000' and not missing (p(i)) and p(i) not in t then

t(i)=p(i);

end;

do i=1 to 3;

if not missing(t(i)) then

do;

rc=h.check(key:t(i));

if rc =0 then

flag='DUP';

else

do;

phone=t(i);

rc=h.replace();

end;

end;

end;

call missing (of t(*));

drop i rc phone;

run;

Occasional Contributor
Posts: 10

## Re: Identifying duplicates across observations

Perfect! Thank you!

🔒 This topic is solved and locked.