BookmarkSubscribeRSS Feed
zishaq
Obsidian | Level 7

Hello,

 

I have been tasked with cleaning a single column in SAS. The column in question is the names of businesses (BIZ_NAME) where we can have the same name business appearing more than once. I essentially want to calculate the number of times a business appears in this column (could easily use Pivot tables in Excel or the GROUP BY and SUM function in SAS). 

 

However, the BIZ_NAME column needs cleansing. So for example, we can use the UPCASE, STRIP, COMPBL, COMPRESS functions to remove specific characters and to increase the accuracy of fuzzy matching. Many of the business also have LIMITED, LTD, CORPORATION etc. as a suffix so I was thinking of using TRANWRD to replace these with missing spaces.

 

Are there any other methods (or functions) that may be of use for this what appears to be 'simple' task? In the past I have used COMPGED when comparing two variables but in this case I am using only the one variable 

 

Thanks!

4 REPLIES 4
SuryaKiran
Meteorite | Level 14

Hello,

 

First count the records as they exists without any data cleansing, this will give all distinct names, from this you might need to find the fuzzy matches. There are several techniques to find out the fuzzy matches, it all depends on how your data look. I would recommend scoring with SPEDIS & COMPGED (great functions for fuzzy patterns). Why not both for better results. Check this blog post for more ways. 

 

Here is one of my example for fuzzy pattern matching:

 

data have;
input biz_name $1-25;
datalines;
ABC Limited
ABC Limited
ABC Limited
ABC Ltd
ABC Ltd
Test Holding LLC
Test Holding LLC.
XYZ Ltd
;
run;

/* This will count the number of records per company (company names might be duplicated due to fuzzy) & assign a row number for each distinct company name */
proc sql;
create table test as 
select monotonic() as row_id,biz_name,rec_cnt
from (
select biz_name,count(*) as rec_cnt
from have 
group by 1) ;
quit;

/* Cartesian product, to cross compare each names */ 
proc sql;
create table test2 as 
select a.row_id, 
	   b.row_id as row_id_c, 
	   a.biz_name, 
	   b.biz_name as biz_name_c,
	   a.rec_cnt,
	   b.rec_cnt as rec_cnt_c
from test as a,test as b
where a.row_id<>b.row_id  /* excluding join by itself */
	and spedis(a.biz_name,b.biz_name)<50 and compged(a.biz_name,b.biz_name)<500 /* Fuzzy match comparsion */
order by 1,2
;
quit;

/* must exist in pairs, this will make sure the scores are acceptable both ways */
/* i.e. ABC Limited compared with ABC Ltd & ABC Ltd compared with ABC Limited (scores differ) */
proc sql;
create table test3 as 
select *
from test2
where row_id in (select row_id_c from test2)
	and row_id_c in (select row_id from test2);
quit;

/* records that have pair of sets */ 
data test4(drop=lag_row_id_c);
set test3;
lag_row_id_c=lag(row_id_c);
if row_id=lag_row_id_c;
run;

/* Join back the matched names to count the final values */
proc sql;
create table final as 
select 	a.row_id,
		a.biz_name,
		a.rec_cnt,
		b.row_id as merge_row_id,
		b.biz_name as merge_biz_name, 
		b.rec_cnt as merge_rec_cnt,
		case when a.rec_cnt=. then 0 else a.rec_cnt end + case when b.rec_cnt=. then 0 else b.rec_cnt end as total_cnt
from test as a
left join test4 as b on a.row_id=b.row_id_c
where a.row_id not in (select row_id from test4);
quit;

 

 

Thanks,
Suryakiran
Reeza
Super User

How many names do you have?

 

If it's smaller, you may want to take some time to identify which is the clean name, or if you have a database that has the clean names and then you can go from there to figure out how to assign the incorrect names to the correct names. 

 

ie is Coca Cola LTD or Coca-cola ltd or Coca Cola Limited the correct one? What would happen if they have the same frequency. 

 

One good tool to do this manual cleaning, if you decide to not use SAS is OpenRefine. It's great at suggesting the appropriate name and I *think* it generates some python code behind the scenes.

 


@zishaq wrote:

Hello,

 

I have been tasked with cleaning a single column in SAS. The column in question is the names of businesses (BIZ_NAME) where we can have the same name business appearing more than once. I essentially want to calculate the number of times a business appears in this column (could easily use Pivot tables in Excel or the GROUP BY and SUM function in SAS). 

 

However, the BIZ_NAME column needs cleansing. So for example, we can use the UPCASE, STRIP, COMPBL, COMPRESS functions to remove specific characters and to increase the accuracy of fuzzy matching. Many of the business also have LIMITED, LTD, CORPORATION etc. as a suffix so I was thinking of using TRANWRD to replace these with missing spaces.

 

Are there any other methods (or functions) that may be of use for this what appears to be 'simple' task? In the past I have used COMPGED when comparing two variables but in this case I am using only the one variable 

 

Thanks!


 

SASKiwi
PROC Star

SAS Data Quality has built-in methods for doing this without any coding, just in case you have this product available to you.

VincentRejany
SAS Employee

You can avoid all this code by using the SAS Data Quality functions and a SAS Quality Knowledge Base, as it has been specifically designed for this particular use case

For standardizing Organizations:

%DQLOAD(DQLOCALE=(ENUSA), DQSETUPLOC='/opt/sas/spre/home/share/refdata/qkb/ci/31');

data _null_;
 length org stdOrg $ 50;
 input org $char50.;
 stdOrg=dqStandardize(org, 'Name');
 put 'Name:' @10 org /
 'StdName:' @10 stdOrg /;
datalines;
coca cola limited
coca cola ltd
;
run;
Name:    coca cola limited
StdName: Coca Cola, Limited
Name:    coca cola ltd
StdName: Coca Cola, Ltd

and parse

 

data _null_;
parsedValue=dqParse('Coca Cola, Ltd', 'ORGANIZATION', 'ENUSA');
name=dqParseTokenGet(parsedValue, 'Name', 'ORGANIZATION', 'ENUSA');
legal=dqParseTokenGet(parsedValue, 'Legal Form', 'ORGANIZATION', 'ENUSA');
 put 'Name:' @10 name /
 'Legal:' @10 legal /;
run;
1    %studio_hide_wrapper;
82   %DQLOAD(DQLOCALE=(ENUSA), DQSETUPLOC='/opt/sas/spre/home/share/refdata/qkb/ci/31');
NOTE: DQLOAD macro beginning.
NOTE: DQLOAD macro ending.
83   
84   data _null_;
85   parsedValue=dqParse('Coca Cola, Ltd', 'ORGANIZATION', 'ENUSA');
86   name=dqParseTokenGet(parsedValue, 'Name', 'ORGANIZATION', 'ENUSA');
87   legal=dqParseTokenGet(parsedValue, 'Legal Form', 'ORGANIZATION', 'ENUSA');
88    put 'Name:' @10 name /
89    'Legal:' @10 legal /;
90   run;
Name:    Coca Cola
Legal:   Ltd

 

Hope that help

 

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to connect to databases in SAS Viya

Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 4 replies
  • 1988 views
  • 4 likes
  • 5 in conversation