BookmarkSubscribeRSS Feed
sas_lak
Quartz | Level 8

Hello,

Need to find out the Duplicate Telephone Numbers by National_ID and NAME,

A) Tele_dup

1.BY NATIONAL_ID

National_ID  Customer_Name  Telephone
1DEVID+19 99999999    Out put should like this
2MARK+19 99999999National_ID  Telephone
3JIM+19 88888881+19 99999999
1DEVID+19 77777772+19 99999999
2MARK HENRY+19 999999997+19 99999999
4THOMAS+19 66666664+19 6666666
5BALE+19 55555556+19 6666666
6PITT+19 6666666
7WOOD+19 99999999

B) Name_Dups

2. By Customer_Name.

National_ID  Customer_Name  Telephone
1MASS+19 99999999
1MOSS+19 99999999     Out put should like this
2DEVID+19 8888888National_ID  Telephone
3DEVID THOMAS+19 77777771+19 99999999
4JOHN+19 999999994+19 99999999
5HENRY+19 66666668+19 99999999
6WOOD+19 5555555
7MARK HENRY+19 6666666
8WOOD+19 99999999

A)

proc sql ;

create table Dups_BY_ID as

select distinct a.National_ID,a.Telephone

from Tele_dup as a, Tele_dup as b

where

a.National_ID ne b.National_ID and a.Telephone eq b.Telephone

order by 2;

quit;

B)

proc sql ;

create table Dups_BY_Name as

select distinct a.National_ID,a.Telephone

from Name_Dups as a, Name_Dups as b

where

soundex(a.Customer_Name) ne soundex(b.Customer_Name) and a.Telephone eq b.Telephone

order by 2;

quit;

These two are giving the proper out put for small data, but same code run on huge data it is taking long time because it working as cartesian join,

Plz suggest any simple code for getting duplicats on huse data from single table.

Thanks in Advance....!

13 REPLIES 13
Reeza
Super User

Have you looked into the prod sort options? Or a hash table sort?

sas_lak
Quartz | Level 8

Not yet, but I want to get "same Phone number repeated for different persons(or differnt IDs)" from single table.

Could you please share the code for same.

DBailey
Lapis Lazuli | Level 10

Even though proc sql doesn't require input tables to be sorted....it does help performance.  You could also create a covering index.

skillman
SAS Employee

This is an another option which will eliminate the Cartesian product and potentially help performance. Here you are joining all of the data with a subset of data that has duplicate telephone numbers. I just took your sample data and it worked great. For the full table, I would put an index on the telephone column since you are joining on it to improve performance.

proc sql ;

create table Dups_BY_ID as

SELECT

  a.national_id,

  a.telephone

FROM

  tele_dup a

where

  a.telephone in (

  select

  telephone

  from

  tele_dup

  group by

  telephone

  having

  count(*) > 1

)

  order by 2 desc;

quit;

sas_lak
Quartz | Level 8

Thank You,

Compare to my code what am using earlier it was more efficient

But above code is giving  "SAME ID having SAME Phone Numbers";

Ex:-

National_ID   Telephone

100                +19 99999999

100                +19 99999999                      here Same Phone Repeated for Same IDs

200                +19 77777777

200                +19 77777777

Iam Looking for "Same Phone Number is repeated for Differnt IDs"

Ex:-

National_ID   Telephone

100                +19 99999999

200                +19 99999999                   "Same Phone repeated for Different IDs

XXX               +19 77777777

YYY               +19 77777777

UrvishShah
Fluorite | Level 6

Try this one

proc sql;

  create table want(drop = tot) as

  select national_id,telephone,

         count(telephone) as tot

  from have

  group by 1,2

  having tot > 1;

quit;

proc sql;

  create table want1(drop = tot) as

  select customer_name,telephone,national_id,

         count(telephone) as tot

  from have

  group by 1,2

  having tot > 1;

quit;

proc append base = want data = want1 force;

run;

sas_lak
Quartz | Level 8

Hi Everyone,

National_ID  TelePhone

11111 9833293682

11111 9833293682

22222 9833293682

33333 9833293682

44444 9902247880

55555 9902247880

66666 9999999999

77777 8888888888

77777 8888888888

Here we have Ids and Telephone_Num, National_ID is the Unique and submited their Phone number.

I am looking for different National_ID submitted same Phone numbers list.

Output Should be:-

National_ID  TelePhone

11111   9833293682

22222   9833293682

33333   9833293682

44444   9902247880

55555   9902247880

here -  11111,22222 and 33333 having same(9833293682) TelePhone and

           44444,55555 having same( 9902247880) TelePhone.


Thanks in Advance....!

Ksharp
Super User
data have;
input National_ID  TelePhone : $40.;
cards;
11111 9833293682
11111 9833293682
22222 9833293682
33333 9833293682
44444 9902247880
55555 9902247880
66666 9999999999
77777 8888888888
77777 8888888888
;
run;
proc sql;
 select distinct * from have
  group by      TelePhone
   having count(distinct  National_ID) gt 1;
quit;
 


Xia Keshan

sas_lak
Quartz | Level 8

Thanks Xia Keshan,

Run the above quire on 30 millions data, it is taken hardly not more than 10 min and I checked output randomly it was looking good.

On the urgent basis I can apply this Smiley Happy

Ksharp
Super User

I can't believe you can get it done within 10 minutes for 30 millions, that is incredible . I was thinking about data step for that big data .

sas_lak
Quartz | Level 8

For testing purpose run the quire on sample dataset. Observation count is 5316167 and am preparing the dataset for 30 millions.

That's why I wondered  - IS IT OK......!

here I am sharing the log.

NOTE: There were 3433276 observations read from the data set WORK.TELEPHONE_1.

NOTE: There were 1284869 observations read from the data set WORK.TELEPHONE_2.

NOTE: There were 562389 observations read from the data set WORK.TELEPHONE_3.

NOTE: There were 35406 observations read from the data set WORK.TELEPHONE_4.

NOTE: There were 227 observations read from the data set WORK.TELEPHONE_5.

NOTE: There were 0 observations read from the data set WORK.TELEPHONE_6.

NOTE: There were 0 observations read from the data set WORK.TELEPHONE_7.

NOTE: The data set WORK.PHONE has 5316167 observations and 3 variables.

NOTE: DATA statement used (Total process time):

      real time           1.37 seconds

      cpu time            1.34 seconds

Total Records -  WORK.PHONE has   5316167   observations and 3 variables

proc sql;

90   create table dup as select distinct Telphoneno,fid from Phone

91   group by Telphoneno

92   having count(distinct  fid) gt 1

93   order by Telphoneno;

NOTE: The query requires remerging summary statistics back with the original data.

NOTE: Table WORK.DUP created, with 1334130 rows and 2 columns.

94   quit;

NOTE: PROCEDURE SQL used (Total process time):

      real time           35.28 seconds

      cpu time            38.46 seconds

suggest me any other alternatives to get it.

Patrick
Opal | Level 21

You seem to run this on reasonable hardware and "cpu time" being higher than "real time" also shows that the process runs multi-threaded. If 10 minutes is good enough for you (and for your environment) then may be it's better to have simple code instead of tweaking the code for performance but then having more complex code to maintain.

naveen_srini
Quartz | Level 8

If datastep:

data have;
input National_ID $  TelePhone;
datalines;
11111 9833293682
11111 9833293682
22222 9833293682
33333 9833293682
44444 9902247880
55555 9902247880
66666 9999999999
77777 8888888888
77777 8888888888
;

data want;
set have;
by national_id telephone notsorted;
retain TelePhone2;
if first.national_id then telephone2=telephone;
if not first.national_id and telephone=telephone2 then delete;
drop telephone2;
run;

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 13 replies
  • 3390 views
  • 6 likes
  • 8 in conversation