DATA Step, Macro, Functions and more

Proc SQL - Duplicate Findings (SOUNDEX)

Reply
Contributor
Posts: 46

Proc SQL - Duplicate Findings (SOUNDEX)

Hello,

Need to find out the Duplicate Telephone Numbers by National_ID and NAME,

A) Tele_dup

1.BY NATIONAL_ID

National_ID  Customer_Name  Telephone
1DEVID+19 99999999    Out put should like this
2MARK+19 99999999National_ID  Telephone
3JIM+19 88888881+19 99999999
1DEVID+19 77777772+19 99999999
2MARK HENRY+19 999999997+19 99999999
4THOMAS+19 66666664+19 6666666
5BALE+19 55555556+19 6666666
6PITT+19 6666666
7WOOD+19 99999999

B) Name_Dups

2. By Customer_Name.

National_ID  Customer_Name  Telephone
1MASS+19 99999999
1MOSS+19 99999999     Out put should like this
2DEVID+19 8888888National_ID  Telephone
3DEVID THOMAS+19 77777771+19 99999999
4JOHN+19 999999994+19 99999999
5HENRY+19 66666668+19 99999999
6WOOD+19 5555555
7MARK HENRY+19 6666666
8WOOD+19 99999999

A)

proc sql ;

create table Dups_BY_ID as

select distinct a.National_ID,a.Telephone

from Tele_dup as a, Tele_dup as b

where

a.National_ID ne b.National_ID and a.Telephone eq b.Telephone

order by 2;

quit;

B)

proc sql ;

create table Dups_BY_Name as

select distinct a.National_ID,a.Telephone

from Name_Dups as a, Name_Dups as b

where

soundex(a.Customer_Name) ne soundex(b.Customer_Name) and a.Telephone eq b.Telephone

order by 2;

quit;

These two are giving the proper out put for small data, but same code run on huge data it is taking long time because it working as cartesian join,

Plz suggest any simple code for getting duplicats on huse data from single table.

Thanks in Advance....!

Super User
Posts: 17,750

Re: Proc SQL - Duplicate Findings (SOUNDEX)

Have you looked into the prod sort options? Or a hash table sort?

Contributor
Posts: 46

Re: Proc SQL - Duplicate Findings (SOUNDEX)

Not yet, but I want to get "same Phone number repeated for different persons(or differnt IDs)" from single table.

Could you please share the code for same.

Super Contributor
Posts: 578

Re: Proc SQL - Duplicate Findings (SOUNDEX)

Even though proc sql doesn't require input tables to be sorted....it does help performance.  You could also create a covering index.

SAS Employee
Posts: 85

Re: Proc SQL - Duplicate Findings (SOUNDEX)

This is an another option which will eliminate the Cartesian product and potentially help performance. Here you are joining all of the data with a subset of data that has duplicate telephone numbers. I just took your sample data and it worked great. For the full table, I would put an index on the telephone column since you are joining on it to improve performance.

proc sql ;

create table Dups_BY_ID as

SELECT

  a.national_id,

  a.telephone

FROM

  tele_dup a

where

  a.telephone in (

  select

  telephone

  from

  tele_dup

  group by

  telephone

  having

  count(*) > 1

)

  order by 2 desc;

quit;

Contributor
Posts: 46

Re: Proc SQL - Duplicate Findings (SOUNDEX)

Thank You,

Compare to my code what am using earlier it was more efficient

But above code is giving  "SAME ID having SAME Phone Numbers";

Ex:-

National_ID   Telephone

100                +19 99999999

100                +19 99999999                      here Same Phone Repeated for Same IDs

200                +19 77777777

200                +19 77777777

Iam Looking for "Same Phone Number is repeated for Differnt IDs"

Ex:-

National_ID   Telephone

100                +19 99999999

200                +19 99999999                   "Same Phone repeated for Different IDs

XXX               +19 77777777

YYY               +19 77777777

Regular Contributor
Posts: 195

Re: Proc SQL - Duplicate Findings (SOUNDEX)

Try this one

proc sql;

  create table want(drop = tot) as

  select national_id,telephone,

         count(telephone) as tot

  from have

  group by 1,2

  having tot > 1;

quit;

proc sql;

  create table want1(drop = tot) as

  select customer_name,telephone,national_id,

         count(telephone) as tot

  from have

  group by 1,2

  having tot > 1;

quit;

proc append base = want data = want1 force;

run;

Contributor
Posts: 46

Re: Proc SQL - Duplicate Findings (SOUNDEX)

Hi Everyone,

National_ID  TelePhone

11111 9833293682

11111 9833293682

22222 9833293682

33333 9833293682

44444 9902247880

55555 9902247880

66666 9999999999

77777 8888888888

77777 8888888888

Here we have Ids and Telephone_Num, National_ID is the Unique and submited their Phone number.

I am looking for different National_ID submitted same Phone numbers list.

Output Should be:-

National_ID  TelePhone

11111   9833293682

22222   9833293682

33333   9833293682

44444   9902247880

55555   9902247880

here -  11111,22222 and 33333 having same(9833293682) TelePhone and

           44444,55555 having same( 9902247880) TelePhone.


Thanks in Advance....!

Super User
Posts: 9,662

Re: Proc SQL - Duplicate Findings (SOUNDEX)

data have;
input National_ID  TelePhone : $40.;
cards;
11111 9833293682
11111 9833293682
22222 9833293682
33333 9833293682
44444 9902247880
55555 9902247880
66666 9999999999
77777 8888888888
77777 8888888888
;
run;
proc sql;
 select distinct * from have
  group by      TelePhone
   having count(distinct  National_ID) gt 1;
quit;
 


Xia Keshan

Contributor
Posts: 46

Re: Proc SQL - Duplicate Findings (SOUNDEX)

Thanks Xia Keshan,

Run the above quire on 30 millions data, it is taken hardly not more than 10 min and I checked output randomly it was looking good.

On the urgent basis I can apply this Smiley Happy

Super User
Posts: 9,662

Re: Proc SQL - Duplicate Findings (SOUNDEX)

I can't believe you can get it done within 10 minutes for 30 millions, that is incredible . I was thinking about data step for that big data .

Contributor
Posts: 46

Re: Proc SQL - Duplicate Findings (SOUNDEX)

For testing purpose run the quire on sample dataset. Observation count is 5316167 and am preparing the dataset for 30 millions.

That's why I wondered  - IS IT OK......!

here I am sharing the log.

NOTE: There were 3433276 observations read from the data set WORK.TELEPHONE_1.

NOTE: There were 1284869 observations read from the data set WORK.TELEPHONE_2.

NOTE: There were 562389 observations read from the data set WORK.TELEPHONE_3.

NOTE: There were 35406 observations read from the data set WORK.TELEPHONE_4.

NOTE: There were 227 observations read from the data set WORK.TELEPHONE_5.

NOTE: There were 0 observations read from the data set WORK.TELEPHONE_6.

NOTE: There were 0 observations read from the data set WORK.TELEPHONE_7.

NOTE: The data set WORK.PHONE has 5316167 observations and 3 variables.

NOTE: DATA statement used (Total process time):

      real time           1.37 seconds

      cpu time            1.34 seconds

Total Records -  WORK.PHONE has   5316167   observations and 3 variables

proc sql;

90   create table dup as select distinct Telphoneno,fid from Phone

91   group by Telphoneno

92   having count(distinct  fid) gt 1

93   order by Telphoneno;

NOTE: The query requires remerging summary statistics back with the original data.

NOTE: Table WORK.DUP created, with 1334130 rows and 2 columns.

94   quit;

NOTE: PROCEDURE SQL used (Total process time):

      real time           35.28 seconds

      cpu time            38.46 seconds

suggest me any other alternatives to get it.

Respected Advisor
Posts: 3,887

Re: Proc SQL - Duplicate Findings (SOUNDEX)

You seem to run this on reasonable hardware and "cpu time" being higher than "real time" also shows that the process runs multi-threaded. If 10 minutes is good enough for you (and for your environment) then may be it's better to have simple code instead of tweaking the code for performance but then having more complex code to maintain.

Frequent Contributor
Posts: 115

Re: Proc SQL - Duplicate Findings (SOUNDEX)

If datastep:

data have;
input National_ID $  TelePhone;
datalines;
11111 9833293682
11111 9833293682
22222 9833293682
33333 9833293682
44444 9902247880
55555 9902247880
66666 9999999999
77777 8888888888
77777 8888888888
;

data want;
set have;
by national_id telephone notsorted;
retain TelePhone2;
if first.national_id then telephone2=telephone;
if not first.national_id and telephone=telephone2 then delete;
drop telephone2;
run;

Ask a Question
Discussion stats
  • 13 replies
  • 771 views
  • 6 likes
  • 8 in conversation