Data cleansing - company names

zishaq · Posted 11-21-2019 08:13 AM

Hello,

I have been tasked with cleaning a single column in SAS. The column in question is the names of businesses (BIZ_NAME) where we can have the same name business appearing more than once. I essentially want to calculate the number of times a business appears in this column (could easily use Pivot tables in Excel or the GROUP BY and SUM function in SAS).

However, the BIZ_NAME column needs cleansing. So for example, we can use the UPCASE, STRIP, COMPBL, COMPRESS functions to remove specific characters and to increase the accuracy of fuzzy matching. Many of the business also have LIMITED, LTD, CORPORATION etc. as a suffix so I was thinking of using TRANWRD to replace these with missing spaces.

Are there any other methods (or functions) that may be of use for this what appears to be 'simple' task? In the past I have used COMPGED when comparing two variables but in this case I am using only the one variable

Thanks!

SuryaKiran · Posted 11-21-2019 11:39 AM

Hello,

First count the records as they exists without any data cleansing, this will give all distinct names, from this you might need to find the fuzzy matches. There are several techniques to find out the fuzzy matches, it all depends on how your data look. I would recommend scoring with SPEDIS & COMPGED (great functions for fuzzy patterns). Why not both for better results. Check this blog post for more ways.

Here is one of my example for fuzzy pattern matching:

data have;
input biz_name $1-25;
datalines;
ABC Limited
ABC Limited
ABC Limited
ABC Ltd
ABC Ltd
Test Holding LLC
Test Holding LLC.
XYZ Ltd
;
run;

/* This will count the number of records per company (company names might be duplicated due to fuzzy) & assign a row number for each distinct company name */
proc sql;
create table test as 
select monotonic() as row_id,biz_name,rec_cnt
from (
select biz_name,count(*) as rec_cnt
from have 
group by 1) ;
quit;

/* Cartesian product, to cross compare each names */ 
proc sql;
create table test2 as 
select a.row_id, 
	   b.row_id as row_id_c, 
	   a.biz_name, 
	   b.biz_name as biz_name_c,
	   a.rec_cnt,
	   b.rec_cnt as rec_cnt_c
from test as a,test as b
where a.row_id<>b.row_id  /* excluding join by itself */
	and spedis(a.biz_name,b.biz_name)<50 and compged(a.biz_name,b.biz_name)<500 /* Fuzzy match comparsion */
order by 1,2
;
quit;

/* must exist in pairs, this will make sure the scores are acceptable both ways */
/* i.e. ABC Limited compared with ABC Ltd & ABC Ltd compared with ABC Limited (scores differ) */
proc sql;
create table test3 as 
select *
from test2
where row_id in (select row_id_c from test2)
	and row_id_c in (select row_id from test2);
quit;

/* records that have pair of sets */ 
data test4(drop=lag_row_id_c);
set test3;
lag_row_id_c=lag(row_id_c);
if row_id=lag_row_id_c;
run;

/* Join back the matched names to count the final values */
proc sql;
create table final as 
select 	a.row_id,
		a.biz_name,
		a.rec_cnt,
		b.row_id as merge_row_id,
		b.biz_name as merge_biz_name, 
		b.rec_cnt as merge_rec_cnt,
		case when a.rec_cnt=. then 0 else a.rec_cnt end + case when b.rec_cnt=. then 0 else b.rec_cnt end as total_cnt
from test as a
left join test4 as b on a.row_id=b.row_id_c
where a.row_id not in (select row_id from test4);
quit;

Thanks,
Suryakiran

Reeza · Posted 11-21-2019 12:42 PM

How many names do you have?

If it's smaller, you may want to take some time to identify which is the clean name, or if you have a database that has the clean names and then you can go from there to figure out how to assign the incorrect names to the correct names.

ie is Coca Cola LTD or Coca-cola ltd or Coca Cola Limited the correct one? What would happen if they have the same frequency.

One good tool to do this manual cleaning, if you decide to not use SAS is OpenRefine. It's great at suggesting the appropriate name and I *think* it generates some python code behind the scenes.

@zishaq wrote:

Hello,

I have been tasked with cleaning a single column in SAS. The column in question is the names of businesses (BIZ_NAME) where we can have the same name business appearing more than once. I essentially want to calculate the number of times a business appears in this column (could easily use Pivot tables in Excel or the GROUP BY and SUM function in SAS).

However, the BIZ_NAME column needs cleansing. So for example, we can use the UPCASE, STRIP, COMPBL, COMPRESS functions to remove specific characters and to increase the accuracy of fuzzy matching. Many of the business also have LIMITED, LTD, CORPORATION etc. as a suffix so I was thinking of using TRANWRD to replace these with missing spaces.

Are there any other methods (or functions) that may be of use for this what appears to be 'simple' task? In the past I have used COMPGED when comparing two variables but in this case I am using only the one variable

Thanks!

SASKiwi · Posted 11-21-2019 02:19 PM

SAS Data Quality has built-in methods for doing this without any coding, just in case you have this product available to you.

VincentRejany · Posted 11-21-2019 02:33 PM

You can avoid all this code by using the SAS Data Quality functions and a SAS Quality Knowledge Base, as it has been specifically designed for this particular use case

For standardizing Organizations:

%DQLOAD(DQLOCALE=(ENUSA), DQSETUPLOC='/opt/sas/spre/home/share/refdata/qkb/ci/31');

data _null_;
 length org stdOrg $ 50;
 input org $char50.;
 stdOrg=dqStandardize(org, 'Name');
 put 'Name:' @10 org /
 'StdName:' @10 stdOrg /;
datalines;
coca cola limited
coca cola ltd
;
run;

Name:    coca cola limited
StdName: Coca Cola, Limited
Name:    coca cola ltd
StdName: Coca Cola, Ltd

and parse

data _null_;
parsedValue=dqParse('Coca Cola, Ltd', 'ORGANIZATION', 'ENUSA');
name=dqParseTokenGet(parsedValue, 'Name', 'ORGANIZATION', 'ENUSA');
legal=dqParseTokenGet(parsedValue, 'Legal Form', 'ORGANIZATION', 'ENUSA');
 put 'Name:' @10 name /
 'Legal:' @10 legal /;
run;

1    %studio_hide_wrapper;
82   %DQLOAD(DQLOCALE=(ENUSA), DQSETUPLOC='/opt/sas/spre/home/share/refdata/qkb/ci/31');
NOTE: DQLOAD macro beginning.
NOTE: DQLOAD macro ending.
83   
84   data _null_;
85   parsedValue=dqParse('Coca Cola, Ltd', 'ORGANIZATION', 'ENUSA');
86   name=dqParseTokenGet(parsedValue, 'Name', 'ORGANIZATION', 'ENUSA');
87   legal=dqParseTokenGet(parsedValue, 'Legal Form', 'ORGANIZATION', 'ENUSA');
88    put 'Name:' @10 name /
89    'Legal:' @10 legal /;
90   run;
Name:    Coca Cola
Legal:   Ltd

Hope that help

Data cleansing - company names

Re: Data cleansing - company names

Re: Data cleansing - company names

Re: Data cleansing - company names

Re: Data cleansing - company names

Data cleansing - company names

Re: Data cleansing - company names

Re: Data cleansing - company names

Re: Data cleansing - company names

Re: Data cleansing - company names

SAS Innovate 2025: Call for Content