Re: Proc SQL - Duplicate Findings (SOUNDEX)

sas_lak · Posted 12-23-2014 12:17 AM

Hello,

Need to find out the Duplicate Telephone Numbers by National_ID and NAME,

A) Tele_dup

1.BY NATIONAL_ID

National_ID	Customer_Name	Telephone
1	DEVID	+19 99999999	Out put should like this
2	MARK	+19 99999999	National_ID	Telephone
3	JIM	+19 8888888	1	+19 99999999
1	DEVID	+19 7777777	2	+19 99999999
2	MARK HENRY	+19 99999999	7	+19 99999999
4	THOMAS	+19 6666666	4	+19 6666666
5	BALE	+19 5555555	6	+19 6666666
6	PITT	+19 6666666
7	WOOD	+19 99999999

B) Name_Dups

2. By Customer_Name.

National_ID	Customer_Name	Telephone
1	MASS	+19 99999999
1	MOSS	+19 99999999	Out put should like this
2	DEVID	+19 8888888	National_ID	Telephone
3	DEVID THOMAS	+19 7777777	1	+19 99999999
4	JOHN	+19 99999999	4	+19 99999999
5	HENRY	+19 6666666	8	+19 99999999
6	WOOD	+19 5555555
7	MARK HENRY	+19 6666666
8	WOOD	+19 99999999

A)

proc sql ;

create table Dups_BY_ID as

select distinct a.National_ID,a.Telephone

from Tele_dup as a, Tele_dup as b

where

a.National_ID ne b.National_ID and a.Telephone eq b.Telephone

order by 2;

quit;

B)

proc sql ;

create table Dups_BY_Name as

select distinct a.National_ID,a.Telephone

from Name_Dups as a, Name_Dups as b

where

soundex(a.Customer_Name) ne soundex(b.Customer_Name) and a.Telephone eq b.Telephone

order by 2;

quit;

These two are giving the proper out put for small data, but same code run on huge data it is taking long time because it working as cartesian join,

Plz suggest any simple code for getting duplicats on huse data from single table.

Thanks in Advance....!

Reeza · Posted 12-23-2014 12:47 AM

Have you looked into the prod sort options? Or a hash table sort?

sas_lak · Posted 12-23-2014 01:07 AM

Not yet, but I want to get "same Phone number repeated for different persons(or differnt IDs)" from single table.

Could you please share the code for same.

DBailey · Posted 12-23-2014 08:38 AM

Even though proc sql doesn't require input tables to be sorted....it does help performance. You could also create a covering index.

skillman · Posted 12-23-2014 09:25 AM

This is an another option which will eliminate the Cartesian product and potentially help performance. Here you are joining all of the data with a subset of data that has duplicate telephone numbers. I just took your sample data and it worked great. For the full table, I would put an index on the telephone column since you are joining on it to improve performance.

proc sql ;

create table Dups_BY_ID as

SELECT

a.national_id,

a.telephone

FROM

tele_dup a

where

a.telephone in (

select

telephone

from

tele_dup

group by

telephone

having

count(*) > 1

)

order by 2 desc;

quit;

sas_lak · Posted 12-23-2014 11:30 PM

Thank You,

Compare to my code what am using earlier it was more efficient

But above code is giving "SAME ID having SAME Phone Numbers";

Ex:-

National_ID Telephone

100 +19 99999999

100 +19 99999999 here Same Phone Repeated for Same IDs

200 +19 77777777

Iam Looking for "Same Phone Number is repeated for Differnt IDs"

Ex:-

National_ID Telephone

100 +19 99999999

200 +19 99999999 "Same Phone repeated for Different IDs

XXX +19 77777777

YYY +19 77777777

UrvishShah · Posted 12-31-2014 12:06 PM

Try this one

proc sql;

create table want(drop = tot) as

select national_id,telephone,

count(telephone) as tot

from have

group by 1,2

having tot > 1;

quit;

proc sql;

create table want1(drop = tot) as

select customer_name,telephone,national_id,

count(telephone) as tot

from have

group by 1,2

having tot > 1;

quit;

proc append base = want data = want1 force;

run;

sas_lak · Posted 01-02-2015 06:15 AM

Hi Everyone,

National_ID TelePhone

11111 9833293682

22222 9833293682

33333 9833293682

44444 9902247880

55555 9902247880

66666 9999999999

77777 8888888888

Here we have Ids and Telephone_Num, National_ID is the Unique and submited their Phone number.

I am looking for different National_ID submitted same Phone numbers list.

Output Should be:-

National_ID TelePhone

11111 9833293682

22222 9833293682

33333 9833293682

44444 9902247880

55555 9902247880

here - 11111,22222 and 33333 having same(9833293682) TelePhone and

44444,55555 having same( 9902247880) TelePhone.

Thanks in Advance....!

Ksharp · Posted 01-02-2015 06:45 AM

data have;
input National_ID  TelePhone : $40.;
cards;
11111 9833293682
11111 9833293682
22222 9833293682
33333 9833293682
44444 9902247880
55555 9902247880
66666 9999999999
77777 8888888888
77777 8888888888
;
run;
proc sql;
 select distinct * from have
  group by      TelePhone
   having count(distinct  National_ID) gt 1;
quit;

Xia Keshan

sas_lak · Posted 01-02-2015 07:08 AM

Thanks Xia Keshan,

Run the above quire on 30 millions data, it is taken hardly not more than 10 min and I checked output randomly it was looking good.

On the urgent basis I can apply this

Ksharp · Posted 01-02-2015 07:18 AM

I can't believe you can get it done within 10 minutes for 30 millions, that is incredible . I was thinking about data step for that big data .

sas_lak · Posted 01-02-2015 07:47 AM

For testing purpose run the quire on sample dataset. Observation count is 5316167 and am preparing the dataset for 30 millions.

That's why I wondered - IS IT OK......!

here I am sharing the log.

NOTE: There were 3433276 observations read from the data set WORK.TELEPHONE_1.

NOTE: There were 1284869 observations read from the data set WORK.TELEPHONE_2.

NOTE: There were 562389 observations read from the data set WORK.TELEPHONE_3.

NOTE: There were 35406 observations read from the data set WORK.TELEPHONE_4.

NOTE: There were 227 observations read from the data set WORK.TELEPHONE_5.

NOTE: There were 0 observations read from the data set WORK.TELEPHONE_6.

NOTE: There were 0 observations read from the data set WORK.TELEPHONE_7.

NOTE: The data set WORK.PHONE has 5316167 observations and 3 variables.

NOTE: DATA statement used (Total process time):

real time 1.37 seconds

cpu time 1.34 seconds

Total Records - WORK.PHONE has 5316167 observations and 3 variables

proc sql;

90 create table dup as select distinct Telphoneno,fid from Phone

91 group by Telphoneno

92 having count(distinct fid) gt 1

93 order by Telphoneno;

NOTE: The query requires remerging summary statistics back with the original data.

NOTE: Table WORK.DUP created, with 1334130 rows and 2 columns.

94 quit;

NOTE: PROCEDURE SQL used (Total process time):

real time 35.28 seconds

cpu time 38.46 seconds

suggest me any other alternatives to get it.

Patrick · Posted 01-02-2015 10:04 PM

You seem to run this on reasonable hardware and "cpu time" being higher than "real time" also shows that the process runs multi-threaded. If 10 minutes is good enough for you (and for your environment) then may be it's better to have simple code instead of tweaking the code for performance but then having more complex code to maintain.

naveen_srini · Posted 01-02-2015 06:55 AM

If datastep:

data have;
input National_ID $ TelePhone;
datalines;
11111 9833293682
11111 9833293682
22222 9833293682
33333 9833293682
44444 9902247880
55555 9902247880
66666 9999999999
77777 8888888888
77777 8888888888
;

data want;
set have;
by national_id telephone notsorted;
retain TelePhone2;
if first.national_id then telephone2=telephone;
if not first.national_id and telephone=telephone2 then delete;
drop telephone2;
run;

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away