BookmarkSubscribeRSS Feed
buddha_d
Pyrite | Level 9

Fellow SAS users,

                   I have some messy data and I cleaned it up little bit. I have account dataset which has account number (acct_nbr) and account holder name (acct_name). Acct_Name column has messy customer name information (Since multiple words of first, last and middle names are mixed up in random order). I have another dataset which has customer name information. Customer_name has only customer name (first, middle and last names). I want to join both datasets by matching name and if a letter or word extra from acct_name shouldn't be matched. 

proc sql;
create table WORK.have1( bufsize=65536 )
  (
   acct_nbr char(150) format=$150. informat=$150. label='acct_number',
   Acct_NAME char(150)

  );

insert into WORK.have1
values('54321', 'M A F VENKATA SHIVA SAI ARUN KUMARI KANDARPA B S RA' )
values('54321', 'VENKATA SHIVA SAI M A F  ARUN KUMARI KANDARPA B S RA' )
values('54321', 'VENKATA SHIVA SAI M A F  ARUN B S RA KUMARI KANDARPA' )
values('54321', 'VENKATA SHIVA SAI M A F  ARUN B S RA KUMARI KANDARPA INCORPORATED' )
values('54321', 'VENKATA SHIVA SAI M A F  ARUN KUMARI KANDARPA B S RA LLC' )   
values('1234AUC5','V V S MUDIMI HARINATH PEMBARTHY BABU VADA VEMULA' )
values('1234AUC5','V V S VEMULA MUDIMI PEMBARTHY VADA BABU HARINATH' )
values('1234AUC5','VEMULA MUDIMI PEMBARTHY V V S  VADA BABU HARINATH' )
values('1234AUC5','VEMULA MUDIMI AND PEMBARTHY V V S  VADA BABU HARINATH' )
values('1234AUC5','A B V V S MUDIMI HARINATH PEMBARTHY BABU VADA VEMULA' )
values('12534AUC5','GARCIA DELYNN CHRISTINE DE DE LEON LA VEGA FOX TENA PONCE ' )
values('12534AUC5','GARCIA DE LA VEGA DELYNN CHRISTINE FOX TENA PONCE DE LEON' )
values('12534AUC5','GARCIA DE LA VEGA TENA PONCE DE LEON DELYNN CHRISTINE FOX ' )
values('A1234AUC5','ELAINE ALLAN CHRISTOPHER ELLIOTT LAA BELLE' )
values('A1234AUC5','CHRISTOPHER ELLIOTT LAA BELLE ELAINE ALLAN' )
values('1234AUC5','ALLAN CHRISTOPHER ELLIOTT ABIGAIL NICOLE K CATHARINE DA RE' )
values('1234AUC5','DA RE NICOLE K CATHARINE ELLIOTT ABIGAIL ALLAN CHRISTOPHER' )
values('1234AUC5',' RE DA  CATHARINE  K  NICOLE ABIGAIL ELLIOTT CHRISTOPHER ALLAN' )
values('54689AGC','WONG   TONY')
values('54689AGC','WONG K  TONY')
values('54689AGC','TONY WONG JAY ')
;
quit;

proc sql;
create table WORK.have2( bufsize=65536 )
  (
   first_name char(150),
   middle_name char(150),
   last_name char(150)	
  );
insert into WORK.have2
values('VENKATA SHIVA SAI ARUN KUMARI', 'M A F', 'KANDARPA B S RA' )
values('V V S HARINATH BABU', 'PEMBARTHY MUDIMI', 'VEMULA VADA' )
values('GARCIA DE LA VEGA', 'DELYNN CHRISTINE FOX', 'TENA PONCE DE LEON' )
values('ELAINE', 'ALLAN CHRISTOPHER ELLIOTT', 'LAA BELLE' )
values('ALLAN CHRISTOPHER ELLIOTT', 'ABIGAIL NICOLE K', 'CATHARINE DA RE' )
values('TONY','','WONG')
;
quit;

The output should have 

           Please check the attachment for the result output. 

 

I tried different ways to join, but unsuccessful. Any help would be appreciated. The issue is the two joining string columns are not arranged in a proper way to join. The most important thing to remember is to match acct_name with first_name, middle_name and last_name fields (in any order). If account name (for example WONG K TONY) is different from (TONY WONG) as there is an extra K in the string. 

Please help me here or can give suggestions too. 

Thanks in Advance. 

 

 

8 REPLIES 8
Reeza
Super User

This is known as a fuzzy join and is quite computationally intensive, as in if you have 1000 rows and 1000 rows in each tables you're looking at 1000*1000 = 1,000,000 calculations for a single variable and your second data set has three variables. Do you have any other way of narrowing down the comparison first? Dates of any kind?

 

 

If you have SAS Data Quality Studio it's worth checking out the features there. 

 

Otherwise, you can use something like the following to get you started but it will take a while. You may run out of memory.

 


proc sql;
create table want as
select a.*, b.*
from have1 as a
full join have2 as b
on find(a.acct_name, b.first_name, 'it') & find(a.acct_name, b.middle_name, 'it') & find(a.acct_name, b.last_name, 'it');
quit;

 


@buddha_d wrote:

Fellow SAS users,

                   I have some messy data and I cleaned it up little bit. I have account dataset which has account number (acct_nbr) and account holder name (acct_name). Acct_Name column has messy customer name information (Since multiple words of first, last and middle names are mixed up in random order). I have another dataset which has customer name information. Customer_name has only customer name (first, middle and last names). I want to join both datasets by matching name and if a letter or word extra from acct_name shouldn't be matched. 

proc sql;
create table WORK.have1( bufsize=65536 )
  (
   acct_nbr char(150) format=$150. informat=$150. label='acct_number',
   Acct_NAME char(150)

  );

insert into WORK.have1
values('54321', 'M A F VENKATA SHIVA SAI ARUN KUMARI KANDARPA B S RA' )
values('54321', 'VENKATA SHIVA SAI M A F  ARUN KUMARI KANDARPA B S RA' )
values('54321', 'VENKATA SHIVA SAI M A F  ARUN B S RA KUMARI KANDARPA' )
values('54321', 'VENKATA SHIVA SAI M A F  ARUN B S RA KUMARI KANDARPA INCORPORATED' )
values('54321', 'VENKATA SHIVA SAI M A F  ARUN KUMARI KANDARPA B S RA LLC' )   
values('1234AUC5','V V S MUDIMI HARINATH PEMBARTHY BABU VADA VEMULA' )
values('1234AUC5','V V S VEMULA MUDIMI PEMBARTHY VADA BABU HARINATH' )
values('1234AUC5','VEMULA MUDIMI PEMBARTHY V V S  VADA BABU HARINATH' )
values('1234AUC5','VEMULA MUDIMI AND PEMBARTHY V V S  VADA BABU HARINATH' )
values('1234AUC5','A B V V S MUDIMI HARINATH PEMBARTHY BABU VADA VEMULA' )
values('12534AUC5','GARCIA DELYNN CHRISTINE DE DE LEON LA VEGA FOX TENA PONCE ' )
values('12534AUC5','GARCIA DE LA VEGA DELYNN CHRISTINE FOX TENA PONCE DE LEON' )
values('12534AUC5','GARCIA DE LA VEGA TENA PONCE DE LEON DELYNN CHRISTINE FOX ' )
values('A1234AUC5','ELAINE ALLAN CHRISTOPHER ELLIOTT LAA BELLE' )
values('A1234AUC5','CHRISTOPHER ELLIOTT LAA BELLE ELAINE ALLAN' )
values('1234AUC5','ALLAN CHRISTOPHER ELLIOTT ABIGAIL NICOLE K CATHARINE DA RE' )
values('1234AUC5','DA RE NICOLE K CATHARINE ELLIOTT ABIGAIL ALLAN CHRISTOPHER' )
values('1234AUC5',' RE DA  CATHARINE  K  NICOLE ABIGAIL ELLIOTT CHRISTOPHER ALLAN' )
values('54689AGC','WONG   TONY')
values('54689AGC','WONG K  TONY')
values('54689AGC','TONY WONG JAY ')
;
quit;

proc sql;
create table WORK.have2( bufsize=65536 )
  (
   first_name char(150),
   middle_name char(150),
   last_name char(150)	
  );
insert into WORK.have2
values('VENKATA SHIVA SAI ARUN KUMARI', 'M A F', 'KANDARPA B S RA' )
values('V V S HARINATH BABU', 'PEMBARTHY MUDIMI', 'VEMULA VADA' )
values('GARCIA DE LA VEGA', 'DELYNN CHRISTINE FOX', 'TENA PONCE DE LEON' )
values('ELAINE', 'ALLAN CHRISTOPHER ELLIOTT', 'LAA BELLE' )
values('ALLAN CHRISTOPHER ELLIOTT', 'ABIGAIL NICOLE K', 'CATHARINE DA RE' )
values('TONY','','WONG')
;
quit;

The output should have 

           Please check the attachment for the result output. 

 

I tried different ways to join, but unsuccessful. Any help would be appreciated. The issue is the two joining string columns are not arranged in a proper way to join. The most important thing to remember is to match acct_name with first_name, middle_name and last_name fields (in any order). If account name (for example WONG K TONY) is different from (TONY WONG) as there is an extra K in the string. 

Please help me here or can give suggestions too. 

Thanks in Advance. 

 

 




 

 

buddha_d
Pyrite | Level 9

Hi Reeza, 

           Thanks for the quick response. I have 3 million records of account dataset and over 100,000 customer name dataset. So, full join with conditions might be so huge for me to run. It might even crash my SAS. I use SAS EG and don't have Data quality studio. 

Thanks, 

SASKiwi
PROC Star

With 3 million accounts you are likely to have a lot of accounts owned by customers who have the same name but are not the same people. You really need other attributes to improve accuracy. Is name the only common attribute?

buddha_d
Pyrite | Level 9

Yes SASKiwi. I know it is kind of difficult and I tried to do Cartesian join with conditions, with some functions. It was running for a while and I killed that query.  

SASKiwi
PROC Star

Do you at least have a customer ID so you can check for duplicate names in your customer table? I'm not talking about processing difficulties here, just simply that with duplicate customer names you will end up with a lot of different people with the same name being joined to the same account. 

buddha_d
Pyrite | Level 9
I made a sample data. The logic is there is CC account data and some contracting employees names are showing up as account holder name. So trying to develop the logic to match the names. There is no common field for both datasets.
Thanks for asking.
SASKiwi
PROC Star

If you are dealing with employee names then you should also have employee numbers. I agree with @Reeza that relying on mis-formed account and customer names only for joining is unlikely to produce any useful results.

 

I'm surprised by how messy your account and customers names are. I've worked at many banks and never encountered such poorly structured data. Is the sample data you've provided just the worst examples and most of it is in better shape or is it all very similar?     

Reeza
Super User

@buddha_d wrote:

Hi Reeza, 

           Thanks for the quick response. I have 3 million records of account dataset and over 100,000 customer name dataset. So, full join with conditions might be so huge for me to run. It might even crash my SAS. I use SAS EG and don't have Data quality studio. 

Thanks, 


As others have said, you need something else to be able to narrow this down otherwise you literally do not have the resources to do this calculation in a reasonable amount of time. 

Do you have gender, dates, birthdates?

 

For example for a similar comparison I had birthdates so I first linked on birthdates and then tried the fuzzy matching which reduced my run time from hours to under 10 minutes. 

 

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 8 replies
  • 525 views
  • 7 likes
  • 3 in conversation