Hello,
Need to find out the Duplicate Telephone Numbers by National_ID and NAME,
A) Tele_dup
1.BY NATIONAL_ID
National_ID | Customer_Name | Telephone | ||
1 | DEVID | +19 99999999 | Out put should like this | |
2 | MARK | +19 99999999 | National_ID | Telephone |
3 | JIM | +19 8888888 | 1 | +19 99999999 |
1 | DEVID | +19 7777777 | 2 | +19 99999999 |
2 | MARK HENRY | +19 99999999 | 7 | +19 99999999 |
4 | THOMAS | +19 6666666 | 4 | +19 6666666 |
5 | BALE | +19 5555555 | 6 | +19 6666666 |
6 | PITT | +19 6666666 | ||
7 | WOOD | +19 99999999 |
B) Name_Dups
2. By Customer_Name.
National_ID | Customer_Name | Telephone | ||
1 | MASS | +19 99999999 | ||
1 | MOSS | +19 99999999 | Out put should like this | |
2 | DEVID | +19 8888888 | National_ID | Telephone |
3 | DEVID THOMAS | +19 7777777 | 1 | +19 99999999 |
4 | JOHN | +19 99999999 | 4 | +19 99999999 |
5 | HENRY | +19 6666666 | 8 | +19 99999999 |
6 | WOOD | +19 5555555 | ||
7 | MARK HENRY | +19 6666666 | ||
8 | WOOD | +19 99999999 |
A)
proc sql ;
create table Dups_BY_ID as
select distinct a.National_ID,a.Telephone
from Tele_dup as a, Tele_dup as b
where
a.National_ID ne b.National_ID and a.Telephone eq b.Telephone
order by 2;
quit;
B)
proc sql ;
create table Dups_BY_Name as
select distinct a.National_ID,a.Telephone
from Name_Dups as a, Name_Dups as b
where
soundex(a.Customer_Name) ne soundex(b.Customer_Name) and a.Telephone eq b.Telephone
order by 2;
quit;
These two are giving the proper out put for small data, but same code run on huge data it is taking long time because it working as cartesian join,
Plz suggest any simple code for getting duplicats on huse data from single table.
Thanks in Advance....!
Have you looked into the prod sort options? Or a hash table sort?
Not yet, but I want to get "same Phone number repeated for different persons(or differnt IDs)" from single table.
Could you please share the code for same.
Even though proc sql doesn't require input tables to be sorted....it does help performance. You could also create a covering index.
This is an another option which will eliminate the Cartesian product and potentially help performance. Here you are joining all of the data with a subset of data that has duplicate telephone numbers. I just took your sample data and it worked great. For the full table, I would put an index on the telephone column since you are joining on it to improve performance.
proc sql ;
create table Dups_BY_ID as
SELECT
a.national_id,
a.telephone
FROM
tele_dup a
where
a.telephone in (
select
telephone
from
tele_dup
group by
telephone
having
count(*) > 1
)
order by 2 desc;
quit;
Thank You,
Compare to my code what am using earlier it was more efficient
But above code is giving "SAME ID having SAME Phone Numbers";
Ex:-
National_ID Telephone
100 +19 99999999
100 +19 99999999 here Same Phone Repeated for Same IDs
200 +19 77777777
200 +19 77777777
Iam Looking for "Same Phone Number is repeated for Differnt IDs"
Ex:-
National_ID Telephone
100 +19 99999999
200 +19 99999999 "Same Phone repeated for Different IDs
XXX +19 77777777
YYY +19 77777777
Try this one
proc sql;
create table want(drop = tot) as
select national_id,telephone,
count(telephone) as tot
from have
group by 1,2
having tot > 1;
quit;
proc sql;
create table want1(drop = tot) as
select customer_name,telephone,national_id,
count(telephone) as tot
from have
group by 1,2
having tot > 1;
quit;
proc append base = want data = want1 force;
run;
Hi Everyone,
National_ID TelePhone
11111 9833293682
11111 9833293682
22222 9833293682
33333 9833293682
44444 9902247880
55555 9902247880
66666 9999999999
77777 8888888888
77777 8888888888
Here we have Ids and Telephone_Num, National_ID is the Unique and submited their Phone number.
I am looking for different National_ID submitted same Phone numbers list.
Output Should be:-
National_ID TelePhone
11111 9833293682
22222 9833293682
33333 9833293682
44444 9902247880
55555 9902247880
here - 11111,22222 and 33333 having same(9833293682) TelePhone and
44444,55555 having same( 9902247880) TelePhone.
Thanks in Advance....!
data have; input National_ID TelePhone : $40.; cards; 11111 9833293682 11111 9833293682 22222 9833293682 33333 9833293682 44444 9902247880 55555 9902247880 66666 9999999999 77777 8888888888 77777 8888888888 ; run; proc sql; select distinct * from have group by TelePhone having count(distinct National_ID) gt 1; quit;
Xia Keshan
Thanks Xia Keshan,
Run the above quire on 30 millions data, it is taken hardly not more than 10 min and I checked output randomly it was looking good.
On the urgent basis I can apply this
I can't believe you can get it done within 10 minutes for 30 millions, that is incredible . I was thinking about data step for that big data .
For testing purpose run the quire on sample dataset. Observation count is 5316167 and am preparing the dataset for 30 millions.
That's why I wondered - IS IT OK......!
here I am sharing the log.
NOTE: There were 3433276 observations read from the data set WORK.TELEPHONE_1.
NOTE: There were 1284869 observations read from the data set WORK.TELEPHONE_2.
NOTE: There were 562389 observations read from the data set WORK.TELEPHONE_3.
NOTE: There were 35406 observations read from the data set WORK.TELEPHONE_4.
NOTE: There were 227 observations read from the data set WORK.TELEPHONE_5.
NOTE: There were 0 observations read from the data set WORK.TELEPHONE_6.
NOTE: There were 0 observations read from the data set WORK.TELEPHONE_7.
NOTE: The data set WORK.PHONE has 5316167 observations and 3 variables.
NOTE: DATA statement used (Total process time):
real time 1.37 seconds
cpu time 1.34 seconds
Total Records - WORK.PHONE has 5316167 observations and 3 variables
proc sql;
90 create table dup as select distinct Telphoneno,fid from Phone
91 group by Telphoneno
92 having count(distinct fid) gt 1
93 order by Telphoneno;
NOTE: The query requires remerging summary statistics back with the original data.
NOTE: Table WORK.DUP created, with 1334130 rows and 2 columns.
94 quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 35.28 seconds
cpu time 38.46 seconds
suggest me any other alternatives to get it.
You seem to run this on reasonable hardware and "cpu time" being higher than "real time" also shows that the process runs multi-threaded. If 10 minutes is good enough for you (and for your environment) then may be it's better to have simple code instead of tweaking the code for performance but then having more complex code to maintain.
If datastep:
data have;
input National_ID $ TelePhone;
datalines;
11111 9833293682
11111 9833293682
22222 9833293682
33333 9833293682
44444 9902247880
55555 9902247880
66666 9999999999
77777 8888888888
77777 8888888888
;
data want;
set have;
by national_id telephone notsorted;
retain TelePhone2;
if first.national_id then telephone2=telephone;
if not first.national_id and telephone=telephone2 then delete;
drop telephone2;
run;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.