<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Combine Datasets using Inexact Character Variables in SAS in SAS Procedures</title>
    <link>https://communities.sas.com/t5/SAS-Procedures/Combine-Datasets-using-Inexact-Character-Variables-in-SAS/m-p/95413#M27007</link>
    <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;Hi Patrick and Ksharp,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thank you for letting me know about this special features of DataFlux. I am not sure whether the University has subscribed for the SAS Data Quality Server. I will check with the local SAS representative on this matter. Perhaps, I should submit a proposal for the subscription of DataFlux if it is not accessible by staff members here.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Regards,&lt;/P&gt;&lt;P&gt;mspak&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
    <pubDate>Fri, 27 Apr 2012 13:00:31 GMT</pubDate>
    <dc:creator>mspak</dc:creator>
    <dc:date>2012-04-27T13:00:31Z</dc:date>
    <item>
      <title>Combine Datasets using Inexact Character Variables in SAS</title>
      <link>https://communities.sas.com/t5/SAS-Procedures/Combine-Datasets-using-Inexact-Character-Variables-in-SAS/m-p/95402#M26996</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;&lt;SPAN style="font-size: 10pt; font-family: arial, helvetica, sans-serif;"&gt;Dear all,&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-size: 10pt; font-family: arial, helvetica, sans-serif;"&gt;I downloaded data from 2 different databases with different their identification codes. As such, it is impossible to match cases by an ID. Therefore, the only way is to match the cases by using their company names which is an inexact character variable. &lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-size: 10pt; font-family: arial, helvetica, sans-serif;"&gt;I have 2 different datasets:&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-size: 10pt; font-family: arial, helvetica, sans-serif;"&gt;A) uw_match, with the following variables:&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-size: 10pt; font-family: arial, helvetica, sans-serif;"&gt;- underwriters_names;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-size: 10pt; font-family: arial, helvetica, sans-serif;"&gt;- holding_company;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-size: 10pt; font-family: arial, helvetica, sans-serif;"&gt;- ipo_date;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-size: 10pt; font-family: arial, helvetica, sans-serif;"&gt;- others.....&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-size: 10pt; font-family: arial, helvetica, sans-serif;"&gt;B) maluw, with the following variables;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-size: 10pt; font-family: arial, helvetica, sans-serif;"&gt;- name (label as bank name)&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-size: 10pt; font-family: arial, helvetica, sans-serif;"&gt;- bs_id_number (bankscope identification code)&lt;/SPAN&gt;&lt;/P&gt;&lt;ADDRESS&gt;&lt;SPAN style="font-size: 10pt; font-family: arial, helvetica, sans-serif;"&gt;- closdate (Company Fiscal Year End)&lt;/SPAN&gt;&lt;/ADDRESS&gt;&lt;ADDRESS&gt;&lt;SPAN style="font-size: 10pt; font-family: arial, helvetica, sans-serif;"&gt;- others....&lt;/SPAN&gt;&lt;/ADDRESS&gt;&lt;ADDRESS&gt;&lt;/ADDRESS&gt;&lt;ADDRESS&gt;I wish to match-merge these two dataset by the following creteria:&lt;/ADDRESS&gt;&lt;ADDRESS&gt;STEP 1: the underwriters (either underwriters_names or their holding_company) in dataset A compared to bank name in dataset B;&lt;/ADDRESS&gt;&lt;ADDRESS&gt;note: I can either match-merge the underwriters_names (in A) with the bank name (in B) or holding_company (in A) with bank name (in B), the pairs with the higher matching accuracy level will be output/used; AND &lt;/ADDRESS&gt;&lt;ADDRESS&gt;STEP 2: the closest&amp;nbsp; &lt;SPAN style="font-style: italic;"&gt;closdate&lt;/SPAN&gt; (in B) with ipo_date (in A). &lt;/ADDRESS&gt;&lt;ADDRESS&gt;&lt;/ADDRESS&gt;&lt;ADDRESS&gt;In short, I wish to match all the variables in A and B, by&lt;/ADDRESS&gt;&lt;ADDRESS&gt;1. the bank names (in B) = IPO underwriters (in A; either&lt;SPAN style="font-style: italic;"&gt; underwriters_names or their holding_company;which can provide the highest precision level) &lt;/SPAN&gt;&lt;/ADDRESS&gt;&lt;ADDRESS&gt;2. in the similar period (ie. the closest fiscal year end of the banks with the IPO date).&lt;/ADDRESS&gt;&lt;ADDRESS&gt;&lt;/ADDRESS&gt;&lt;ADDRESS&gt;I read an article (see the pdf attached), and I understand that it is possible with SAS. But I feel little knowledge on how to apply the examples into my context here. &lt;/ADDRESS&gt;&lt;ADDRESS&gt;&lt;/ADDRESS&gt;&lt;ADDRESS&gt;Any comment and advise is much appreciated.&lt;/ADDRESS&gt;&lt;ADDRESS&gt;&lt;/ADDRESS&gt;&lt;ADDRESS&gt;Thank you.&lt;/ADDRESS&gt;&lt;ADDRESS&gt;&lt;/ADDRESS&gt;&lt;ADDRESS&gt;Regards,&lt;/ADDRESS&gt;&lt;ADDRESS&gt;mspak &lt;/ADDRESS&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Thu, 26 Apr 2012 12:37:49 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Procedures/Combine-Datasets-using-Inexact-Character-Variables-in-SAS/m-p/95402#M26996</guid>
      <dc:creator>mspak</dc:creator>
      <dc:date>2012-04-26T12:37:49Z</dc:date>
    </item>
    <item>
      <title>Re: Combine Datasets using Inexact Character Variables in SAS</title>
      <link>https://communities.sas.com/t5/SAS-Procedures/Combine-Datasets-using-Inexact-Character-Variables-in-SAS/m-p/95403#M26997</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;hi ... before you start any of the above, since step #1 relies on matching by literals, have you looked at the names in both files and determined if there are thigs you should do before you even start ... for example ...&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;#1 in holding_company, I see ...&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&lt;STRONG&gt;O.S.K. HOLDINGS BERHAD&lt;/STRONG&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&lt;STRONG&gt;OSK HOLDINGS BERHAD&lt;/STRONG&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;are they the same company and should you get rid of those periods&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;#2&amp;nbsp; there's a mix of lower and upper case letters ... should you convert to all uppercase&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;#3&amp;nbsp; most (90+%) of all the name variables you cite have "BHD" or "BERHAD" as part of the name ... if you are going to look for similarity in names you don't want the fact that the "BHD' or "BERHAD" part of the match contributing anything to a score given to a name comparison&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;#4&amp;nbsp; sometimes a location is in parentheses (MALAYSIA) and sometimes it's not MALAYSIA&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;just using PROC FREQ on the various name variables would give you some idea as to how to fix up the names before you even try to match names&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;for example, clean up the names and make some new variables to hold those names ...&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&lt;STRONG&gt;data new_maluw;&lt;/STRONG&gt;&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&lt;STRONG&gt;set z.maluw;&lt;/STRONG&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&lt;STRONG&gt;* add a record number for later use;&lt;/STRONG&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&lt;STRONG&gt;mnrec+1;&lt;/STRONG&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&lt;STRONG&gt;* convert to uppercase, only keep numbers/letters/spaces, convert multiple spaces to one space;&lt;/STRONG&gt;&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&lt;STRONG&gt;nm = compbl(compress(upcase(name),,'kdas'));&lt;/STRONG&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&lt;STRONG&gt;* get rid of BHD and BERHAD;&lt;/STRONG&gt;&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&lt;STRONG&gt;nm = tranwrd(nm,' BHD','');&lt;/STRONG&gt;&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&lt;STRONG&gt;nm = tranwrd(nm,' BERHAD','');&lt;/STRONG&gt;&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&lt;STRONG&gt;run;&lt;/STRONG&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&lt;STRONG&gt;data new_uw_match;&lt;/STRONG&gt;&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&lt;STRONG&gt;set z.uw_match;&lt;/STRONG&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&lt;STRONG&gt;unrec+1;&lt;/STRONG&gt;&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&lt;STRONG&gt;nmh = compbl(compress(upcase(holding_company),,'kdas'));&lt;/STRONG&gt;&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&lt;STRONG&gt;nmh = tranwrd(nmh,' BHD','');&lt;/STRONG&gt;&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&lt;STRONG&gt;nmh = tranwrd(nmh,' BERHAD','');&lt;/STRONG&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&lt;STRONG&gt;nmu = compbl(compress(upcase(underwriters_names),,'kdas'));&lt;/STRONG&gt;&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&lt;STRONG&gt;nmu = tranwrd(nmu,' BHD','');&lt;/STRONG&gt;&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&lt;STRONG&gt;nmu = tranwrd(nmu,' BERHAD','');&lt;/STRONG&gt;&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&lt;STRONG&gt;run;&lt;/STRONG&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-family: arial,helvetica,sans-serif;"&gt;then run PROC FREQ again on the new variables (nm, nmh, and nmu) and see if there are any other things you should do before you start to match the nm in one file to nmh and nmu in another&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-family: arial,helvetica,sans-serif;"&gt;once you have done the above, here's a suggestion for a start ... haven't used COMPGED much (maybe other folk know about a "good score" level) &lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-family: arial,helvetica,sans-serif;"&gt;I usually do this stuf in stages, evaluating the success of each step (e.g. the name match) before I move onto the next ...&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&lt;STRONG&gt;* use SQL to match the files by a comparison of names, use the COMPGED function to compare names;&lt;/STRONG&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&lt;STRONG&gt;* you don't have to use all the data since you have pointers (mnrec and unrec);&lt;/STRONG&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&lt;STRONG&gt;* nm_nmh and nm_nmu are matching scores;&lt;/STRONG&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&lt;STRONG&gt;proc sql;&lt;/STRONG&gt;&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&lt;STRONG&gt;create table both as&lt;/STRONG&gt;&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&lt;STRONG&gt;select mnrec, unrec, compged(nm, nmh) as nm_nmh, compged(nm, nmu) as nm_nmu&lt;/STRONG&gt;&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&lt;STRONG&gt;from new_maluw, new_uw_match&lt;/STRONG&gt;&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&lt;STRONG&gt;having nm_nmh lt 50 or nm_nmu lt 50;&lt;/STRONG&gt;&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&lt;STRONG&gt;quit;&lt;/STRONG&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&lt;STRONG&gt;* reconstruct the data using the pointers;&lt;/STRONG&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&lt;STRONG&gt;* maybe you only add the dates and other vars you need for more work at this point;&lt;/STRONG&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&lt;STRONG&gt;data both;&lt;/STRONG&gt;&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&lt;STRONG&gt;set both;&lt;/STRONG&gt;&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&lt;STRONG&gt;p1=mnrec;&lt;/STRONG&gt;&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&lt;STRONG&gt;p2=unrec;&lt;/STRONG&gt;&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&lt;STRONG&gt;set new_maluw (keep=nm closdate) point=p1;&lt;/STRONG&gt;&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&lt;STRONG&gt;set new_uw_match (keep=nmh nmu ipo_date) point=p2;&lt;/STRONG&gt;&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&lt;STRONG&gt;run;&amp;nbsp;&amp;nbsp; &lt;/STRONG&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&lt;STRONG&gt;etc ...&amp;nbsp; &lt;/STRONG&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Thu, 26 Apr 2012 14:59:53 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Procedures/Combine-Datasets-using-Inexact-Character-Variables-in-SAS/m-p/95403#M26997</guid>
      <dc:creator>MikeZdeb</dc:creator>
      <dc:date>2012-04-26T14:59:53Z</dc:date>
    </item>
    <item>
      <title>Re: Combine Datasets using Inexact Character Variables in SAS</title>
      <link>https://communities.sas.com/t5/SAS-Procedures/Combine-Datasets-using-Inexact-Character-Variables-in-SAS/m-p/95404#M26998</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;Hello MSPak, take a look at SAS string distance functions : SPEDIS, COMPGED and COMPLEV. They seem far more sophisticated than the function described in the joined article.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;PG&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Thu, 26 Apr 2012 15:09:33 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Procedures/Combine-Datasets-using-Inexact-Character-Variables-in-SAS/m-p/95404#M26998</guid>
      <dc:creator>PGStats</dc:creator>
      <dc:date>2012-04-26T15:09:33Z</dc:date>
    </item>
    <item>
      <title>Re: Combine Datasets using Inexact Character Variables in SAS</title>
      <link>https://communities.sas.com/t5/SAS-Procedures/Combine-Datasets-using-Inexact-Character-Variables-in-SAS/m-p/95405#M26999</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;hi .. I had the same reaction and wondered how there could be an SGF 2012 paper on inexact matching that did not even reference the functions you mentioned (so someone like MSPak could read a new paper and not even be made aware of the functions)&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Thu, 26 Apr 2012 15:25:40 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Procedures/Combine-Datasets-using-Inexact-Character-Variables-in-SAS/m-p/95405#M26999</guid>
      <dc:creator>MikeZdeb</dc:creator>
      <dc:date>2012-04-26T15:25:40Z</dc:date>
    </item>
    <item>
      <title>Re: Combine Datasets using Inexact Character Variables in SAS</title>
      <link>https://communities.sas.com/t5/SAS-Procedures/Combine-Datasets-using-Inexact-Character-Variables-in-SAS/m-p/95406#M27000</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;For your case (with 1,560*940=1,466,400 comparisons) the match functions would be fine, although it's still a good idea to translate common abbreviations (CO, CORP, LTD, INC, DIST, DIV, states such as BHD) and eliminate punctuation before doing an outer join. When I match company names from much larger databases, an outer join is not practical. In that case, you need to transform both sides and join on the transformed fields. I have a rather old program that does a modified soundex transformation to each name from both datasets before matching. Not particularly sophisticated but, having developed the transformations over a period of time, it performs well.&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Thu, 26 Apr 2012 16:04:46 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Procedures/Combine-Datasets-using-Inexact-Character-Variables-in-SAS/m-p/95406#M27000</guid>
      <dc:creator>GTickner</dc:creator>
      <dc:date>2012-04-26T16:04:46Z</dc:date>
    </item>
    <item>
      <title>Re: Combine Datasets using Inexact Character Variables in SAS</title>
      <link>https://communities.sas.com/t5/SAS-Procedures/Combine-Datasets-using-Inexact-Character-Variables-in-SAS/m-p/95407#M27001</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;For your situation, it is very very hard .&lt;/P&gt;&lt;P&gt;SAS has a product named DATA FLUX can solve this problem.&lt;/P&gt;&lt;P&gt;As my opinion, I will maintain a table which contain all of these similar words to identify them .&lt;/P&gt;&lt;P&gt;And of course these spell distance function mentioned by PGStat .&lt;/P&gt;&lt;P&gt;The following code just for fun.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;PRE&gt;libname x v9 'D:\Software';
data uw_match;
 set x.uw_match;
 _name=holding_company;output;
 _name=underwriters_names;output;
 keep _name;
run;
proc sql;
create table x as
 select name,_name
&amp;nbsp; from x.maluw,uw_match
&amp;nbsp;&amp;nbsp; where name =* _name ;quit;

&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Ksharp&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Fri, 27 Apr 2012 03:23:56 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Procedures/Combine-Datasets-using-Inexact-Character-Variables-in-SAS/m-p/95407#M27001</guid>
      <dc:creator>Ksharp</dc:creator>
      <dc:date>2012-04-27T03:23:56Z</dc:date>
    </item>
    <item>
      <title>Re: Combine Datasets using Inexact Character Variables in SAS</title>
      <link>https://communities.sas.com/t5/SAS-Procedures/Combine-Datasets-using-Inexact-Character-Variables-in-SAS/m-p/95408#M27002</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;Hi Ksharp,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;What is the meaning of this code "where name =* _name" ? &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thanks.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Regards,&lt;/P&gt;&lt;P&gt;mspak&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Fri, 27 Apr 2012 07:01:15 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Procedures/Combine-Datasets-using-Inexact-Character-Variables-in-SAS/m-p/95408#M27002</guid>
      <dc:creator>mspak</dc:creator>
      <dc:date>2012-04-27T07:01:15Z</dc:date>
    </item>
    <item>
      <title>Re: Combine Datasets using Inexact Character Variables in SAS</title>
      <link>https://communities.sas.com/t5/SAS-Procedures/Combine-Datasets-using-Inexact-Character-Variables-in-SAS/m-p/95409#M27003</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;Hi MikeZdeb,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Your codes are extremely useful for me. I applied the codes suggested, I can say the result is excellent.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;In BTW, I found a good article on&amp;nbsp; COMPGED Function (see the Pdf). &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thank you very much for your helps.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Regards,&lt;/P&gt;&lt;P&gt;mspak&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Fri, 27 Apr 2012 07:07:32 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Procedures/Combine-Datasets-using-Inexact-Character-Variables-in-SAS/m-p/95409#M27003</guid>
      <dc:creator>mspak</dc:creator>
      <dc:date>2012-04-27T07:07:32Z</dc:date>
    </item>
    <item>
      <title>Re: Combine Datasets using Inexact Character Variables in SAS</title>
      <link>https://communities.sas.com/t5/SAS-Procedures/Combine-Datasets-using-Inexact-Character-Variables-in-SAS/m-p/95410#M27004</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;It is SOUND LIKE .&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Fri, 27 Apr 2012 07:45:29 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Procedures/Combine-Datasets-using-Inexact-Character-Variables-in-SAS/m-p/95410#M27004</guid>
      <dc:creator>Ksharp</dc:creator>
      <dc:date>2012-04-27T07:45:29Z</dc:date>
    </item>
    <item>
      <title>Re: Combine Datasets using Inexact Character Variables in SAS</title>
      <link>https://communities.sas.com/t5/SAS-Procedures/Combine-Datasets-using-Inexact-Character-Variables-in-SAS/m-p/95411#M27005</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;Thank you very much. Today, I learned a lot from all of you. &lt;/P&gt;&lt;P&gt;&lt;SPAN style="color: #000000; font-family: Arial; font-size: 16px; font-style: normal; text-align: left; text-indent: 0px; background-color: #ffffff;"&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="color: #000000; font-family: Arial; font-size: 16px; font-style: normal; text-align: left; text-indent: 0px; background-color: #ffffff;"&gt;"T&lt;/SPAN&gt;&lt;SPAN style="color: #000000; font-family: Arial; font-size: 16px; font-style: normal; text-align: left; text-indent: 0px; background-color: #ffffff;"&gt;here is no royal road to learning, learning SAS with this discussion forum is extremely useful and fun".&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="color: #000000; font-family: Arial; font-size: 16px; font-style: normal; text-align: left; text-indent: 0px; background-color: #ffffff;"&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="color: #000000; font-family: Arial; font-size: 16px; font-style: normal; text-align: left; text-indent: 0px; background-color: #ffffff;"&gt;Hope everyone enjoy your weekend :smileylaugh:&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="color: #000000; font-family: Arial; font-size: 16px; font-style: normal; text-align: left; text-indent: 0px; background-color: #ffffff;"&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="color: #000000; font-family: Arial; font-size: 16px; font-style: normal; text-align: left; text-indent: 0px; background-color: #ffffff;"&gt;Regards,&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="color: #000000; font-family: Arial; font-size: 16px; font-style: normal; text-align: left; text-indent: 0px; background-color: #ffffff;"&gt;mspak&lt;/SPAN&gt;&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Fri, 27 Apr 2012 08:02:45 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Procedures/Combine-Datasets-using-Inexact-Character-Variables-in-SAS/m-p/95411#M27005</guid>
      <dc:creator>mspak</dc:creator>
      <dc:date>2012-04-27T08:02:45Z</dc:date>
    </item>
    <item>
      <title>Re: Combine Datasets using Inexact Character Variables in SAS</title>
      <link>https://communities.sas.com/t5/SAS-Procedures/Combine-Datasets-using-Inexact-Character-Variables-in-SAS/m-p/95412#M27006</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;Just to add to Ksharp's mentioning of DataFlux:&lt;/P&gt;&lt;P&gt;If you have the SAS Data Quality Server licensed at your site then you have have DataFlux components available which you can use within a data step.&lt;/P&gt;&lt;P&gt;&lt;A href="http://support.sas.com/documentation/cdl/en/lefunctionsref/63354/HTML/default/viewer.htm#p1dczp3khf9susn1m0b7badlypav.htm" title="http://support.sas.com/documentation/cdl/en/lefunctionsref/63354/HTML/default/viewer.htm#p1dczp3khf9susn1m0b7badlypav.htm"&gt;http://support.sas.com/documentation/cdl/en/lefunctionsref/63354/HTML/default/viewer.htm#p1dczp3khf9susn1m0b7badlypav.htm&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Interesting for you would be to create machcodes. Similar strings will collaps into a single matchcode which you then can use to join your tables.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;The way DataFlux creates these matchcodes is far superior over functions like soundex() as it doesn't only use a rule-based approach but also lists (eg. I would assume all major companies together with a vast list of variants inclusive typos are in such a list - so "someone" does clusteranalysis here and there are regular updates to the "database").&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Fri, 27 Apr 2012 09:22:33 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Procedures/Combine-Datasets-using-Inexact-Character-Variables-in-SAS/m-p/95412#M27006</guid>
      <dc:creator>Patrick</dc:creator>
      <dc:date>2012-04-27T09:22:33Z</dc:date>
    </item>
    <item>
      <title>Re: Combine Datasets using Inexact Character Variables in SAS</title>
      <link>https://communities.sas.com/t5/SAS-Procedures/Combine-Datasets-using-Inexact-Character-Variables-in-SAS/m-p/95413#M27007</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;Hi Patrick and Ksharp,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thank you for letting me know about this special features of DataFlux. I am not sure whether the University has subscribed for the SAS Data Quality Server. I will check with the local SAS representative on this matter. Perhaps, I should submit a proposal for the subscription of DataFlux if it is not accessible by staff members here.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Regards,&lt;/P&gt;&lt;P&gt;mspak&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Fri, 27 Apr 2012 13:00:31 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Procedures/Combine-Datasets-using-Inexact-Character-Variables-in-SAS/m-p/95413#M27007</guid>
      <dc:creator>mspak</dc:creator>
      <dc:date>2012-04-27T13:00:31Z</dc:date>
    </item>
    <item>
      <title>Re: Combine Datasets using Inexact Character Variables in SAS</title>
      <link>https://communities.sas.com/t5/SAS-Procedures/Combine-Datasets-using-Inexact-Character-Variables-in-SAS/m-p/95414#M27008</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;Hi Patrick,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I am using SAS Dataflux.&lt;/P&gt;&lt;P&gt;I have to agree that Dataflux is indeed a great software.&lt;/P&gt;&lt;P&gt;However, as you mentioned, Dataflux collapses similar strings into a single matchcode which I then can use it to join my table.&lt;/P&gt;&lt;P&gt;Not sure you are referring to &lt;STRONG&gt;Match Codes&lt;/STRONG&gt; under the &lt;STRONG&gt;Entity Resolution&lt;/STRONG&gt; node.&lt;/P&gt;&lt;P&gt;If yes, then I have a question.&lt;/P&gt;&lt;P&gt;As you will find in my attachment, the final output (&lt;STRONG&gt;&lt;EM&gt;All Matches and All_Non_Match&lt;/EM&gt;&lt;/STRONG&gt;) files have lesser total number of firms than my original input files (&lt;STRONG&gt;&lt;EM&gt;Text file input 1 and Text file input 2&lt;/EM&gt;&lt;/STRONG&gt;).&lt;/P&gt;&lt;P&gt;In other words,&lt;SPAN style="font-size: small; line-height: 1.5em;"&gt; if the total number of firms in original input files &lt;STRONG&gt;&lt;EM&gt;(Text file input 1 and Text file input 2&lt;/EM&gt;&lt;/STRONG&gt;) consist of 1000 firms, then logically, the final output (&lt;STRONG&gt;&lt;EM&gt;All Matches and All_Non_Match&lt;/EM&gt;&lt;/STRONG&gt;) files should also consist of 1000&lt;/SPAN&gt;&lt;SPAN style="line-height: 1.5em; font-size: 10pt;"&gt; right?&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="line-height: 1.5em; font-size: 10pt;"&gt;In my case, I lost about 148 firms and upon checking manually, I realize that companies such ABC 1995, ABC 1996, ABC 1997 (which may refer to the similar or not similar companies) have been merged into one, which is ABC.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;Can I customize the &lt;STRONG&gt;Match Codes&lt;/STRONG&gt; function so that &lt;SPAN style="font-size: small;"&gt;ABC 1995, ABC 1996, ABC 1997 &lt;/SPAN&gt;are considered as different companies?&lt;/P&gt;&lt;P&gt;My sincere advance thank you.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Warm regards,&lt;/P&gt;&lt;P&gt;Steven&lt;/P&gt;&lt;BR /&gt;&lt;IMG src="https://communities.sas.com/t5/image/serverpage/image-id/11519iB1E5E457ED4E7866/image-size/large?v=1.0&amp;amp;px=600" border="0" alt="Capture.PNG" title="Capture.PNG" /&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Wed, 20 Nov 2013 09:12:58 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Procedures/Combine-Datasets-using-Inexact-Character-Variables-in-SAS/m-p/95414#M27008</guid>
      <dc:creator>stevennair</dc:creator>
      <dc:date>2013-11-20T09:12:58Z</dc:date>
    </item>
    <item>
      <title>Re: Combine Datasets using Inexact Character Variables in SAS</title>
      <link>https://communities.sas.com/t5/SAS-Procedures/Combine-Datasets-using-Inexact-Character-Variables-in-SAS/m-p/95415#M27009</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;Hi Steven&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;You should normally start a new post in a situation like here and then eventually reference the old post which is related to your question.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;To answer what you asked for:&lt;/P&gt;&lt;P&gt;It's quite a while that I had my DataFlux training and I didn't have a lot of opportunity to actually use it :-). I was using match codes in the past quite a bit though (using dq... SAS functions).&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I had a look at your flow and can't give you the full answer but only a few thoughts:&lt;/P&gt;&lt;P&gt;- The Data Joining node follows the same logic than a SQL join so in order to no loose data you would need a OUTER JOIN&lt;/P&gt;&lt;P&gt;- If you're already standardizing your company names then you don't have to create match codes and use the match codes for joining. Use the standardized company names instead.&lt;/P&gt;&lt;P&gt;- " ABC 1995, ABC 1996, ABC 1997": That such variations of the same names get clustered and resolve to the same standardized name is exactly what you want. You can play with the sensitivity which will affect what will get clustered together (the same is also true for match codes).&lt;/P&gt;&lt;P&gt;- Yes, you can modify the clustering. That's what the role "Data Stewart" is supposed to do. Data Quality is an ongoing process. In the heart of things are the QKB's (Quality Knowledge Base). There is - I believe - a quarterly update and the QKB's are also local/language specific.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;HTH&lt;/P&gt;&lt;P&gt;Patrick&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Wed, 20 Nov 2013 09:47:17 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Procedures/Combine-Datasets-using-Inexact-Character-Variables-in-SAS/m-p/95415#M27009</guid>
      <dc:creator>Patrick</dc:creator>
      <dc:date>2013-11-20T09:47:17Z</dc:date>
    </item>
    <item>
      <title>Re: Combine Datasets using Inexact Character Variables in SAS</title>
      <link>https://communities.sas.com/t5/SAS-Procedures/Combine-Datasets-using-Inexact-Character-Variables-in-SAS/m-p/95416#M27010</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;Hi Patrick,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I am sorry for that since I am a newbie here. learnt something now :smileyblush:&lt;/P&gt;&lt;P&gt;Thank you for the reply, I really appreciate it and find it VERY useful.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Just to double check with you:&lt;/P&gt;&lt;P&gt;-When you mentioned that "&lt;SPAN style="background-color: #ffffff; font-family: 'Helvetica Neue', Helvetica, Arial, 'Lucida Grande', sans-serif;"&gt;&lt;EM&gt;The Data Joining node follows the same logic than a SQL join so in order to no loose data you would need a OUTER JOIN&lt;/EM&gt;&lt;/SPAN&gt;", you are just referring that i need the data joining node to join data from 2 different files right?&lt;/P&gt;&lt;P&gt;-If I have already standardized the company names, then I could join them even without match code OR I can join the company names by only using match code without standardization?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;By the way, in your opinion, do you think using SAS function (dq...) is better than Dataflux?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Once again, my sincere thank you.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Warm regards,&lt;/P&gt;&lt;P&gt;Steven&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Wed, 20 Nov 2013 10:40:24 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Procedures/Combine-Datasets-using-Inexact-Character-Variables-in-SAS/m-p/95416#M27010</guid>
      <dc:creator>stevennair</dc:creator>
      <dc:date>2013-11-20T10:40:24Z</dc:date>
    </item>
    <item>
      <title>Re: Combine Datasets using Inexact Character Variables in SAS</title>
      <link>https://communities.sas.com/t5/SAS-Procedures/Combine-Datasets-using-Inexact-Character-Variables-in-SAS/m-p/95417#M27011</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;As said I'm not very experienced with DataFlux.&lt;/P&gt;&lt;P&gt;- "Data Joining node": I've just looked it up &lt;A href="http://support.sas.com/documentation/onlinedoc/dfdmstudio/2.5/dmpdmsug/Content/dfDMStd_Task_RDM_DomainItems.html" title="http://support.sas.com/documentation/onlinedoc/dfdmstudio/2.5/dmpdmsug/Content/dfDMStd_Task_RDM_DomainItems.html"&gt;Using Reference Data Manager Domain Items in a Job&lt;/A&gt; . You must chose "full" as join type - and the variables you're selecting to be used for the join must represent a key.&lt;/P&gt;&lt;P&gt;- Match codes allow you to create a key for a set of variables (eg. first-, middle-, last name). Standardisation allows you to cleanse your data, eg. variations of the same company name converted to the same name. You can use either for matching depending on what suits your case better.&lt;/P&gt;&lt;P&gt;- The SAS "DQ...." functions allow you to use DataFlux functionality out of SAS in a data step. So the "DQ..." functions use DataFlux as well.&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Wed, 20 Nov 2013 20:47:22 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Procedures/Combine-Datasets-using-Inexact-Character-Variables-in-SAS/m-p/95417#M27011</guid>
      <dc:creator>Patrick</dc:creator>
      <dc:date>2013-11-20T20:47:22Z</dc:date>
    </item>
  </channel>
</rss>

