<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Fuzzy match using a string variable between two large datasets in SAS Procedures</title>
    <link>https://communities.sas.com/t5/SAS-Procedures/Fuzzy-match-using-a-string-variable-between-two-large-datasets/m-p/208286#M51639</link>
    <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;Lan,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Better than sample data, if you click on the link in the paper's first paragraph (problem statement), you can download full copies of all of the datasets that were used.&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
    <pubDate>Tue, 24 Mar 2015 03:50:43 GMT</pubDate>
    <dc:creator>art297</dc:creator>
    <dc:date>2015-03-24T03:50:43Z</dc:date>
    <item>
      <title>Fuzzy match using a string variable between two large datasets</title>
      <link>https://communities.sas.com/t5/SAS-Procedures/Fuzzy-match-using-a-string-variable-between-two-large-datasets/m-p/208283#M51636</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;Hello, everyone&lt;/P&gt;&lt;P&gt;I have two datasets to merge using a common string variable –customer name.&lt;/P&gt;&lt;P&gt;The main data (customer data) contains firmid year and its corresponding sales to each of its customers and customer id, name. each firm could have multiple customers in each year. This is a panel dataset with 455,000 records. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;TABLE&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD&gt;supplierid&lt;/TD&gt;&lt;TD&gt;supplier&lt;/TD&gt;&lt;TD&gt;customername&lt;/TD&gt;&lt;TD&gt;salestocustomer&lt;/TD&gt;&lt;TD&gt;fyear&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;1797&lt;/TD&gt;&lt;TD&gt;MM corp&lt;/TD&gt;&lt;TD&gt;Enserco Energy Inc&lt;/TD&gt;&lt;TD&gt;20&lt;/TD&gt;&lt;TD&gt;2006&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;1797&lt;/TD&gt;&lt;TD&gt;MM corp&lt;/TD&gt;&lt;TD&gt;Calpine Corp&lt;/TD&gt;&lt;TD&gt;30&lt;/TD&gt;&lt;TD&gt;2006&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;1891&lt;/TD&gt;&lt;TD&gt;Hilton Inc&lt;/TD&gt;&lt;TD&gt;International Business Machi&lt;/TD&gt;&lt;TD&gt;40&lt;/TD&gt;&lt;TD&gt;2006&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;1891&lt;/TD&gt;&lt;TD&gt;Hilton Inc&lt;/TD&gt;&lt;TD&gt;Xilinx Inc&lt;/TD&gt;&lt;TD&gt;50&lt;/TD&gt;&lt;TD&gt;2006&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;1001&lt;/TD&gt;&lt;TD&gt;Alcoa&lt;/TD&gt;&lt;TD&gt;X Incorporated&lt;/TD&gt;&lt;TD&gt;20&lt;/TD&gt;&lt;TD&gt;1990&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;1001&lt;/TD&gt;&lt;TD&gt;Alcoa&lt;/TD&gt;&lt;TD&gt;Sumblet corp&lt;/TD&gt;&lt;TD&gt;30&lt;/TD&gt;&lt;TD&gt;1990&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;&lt;P&gt;The firm data : this dataset contains all U.S. publically traded firms between 1972-2012, their names, year, and accounting data. If the firmname in this dataset is considered close enough to customername in maindata set, I want to join these two datasets together.This one has 256,000 observations, among which 24,000 unique firmnames (note: each firmname could appear in multiple years).&lt;/P&gt;&lt;TABLE&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD&gt;firmid&lt;/TD&gt;&lt;TD&gt;firmname&lt;/TD&gt;&lt;TD&gt;xvar&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;1291&lt;/TD&gt;&lt;TD&gt;Enserco Energy Incorporated&lt;/TD&gt;&lt;TD&gt;0.1&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;1081&lt;/TD&gt;&lt;TD&gt;Calpine&lt;/TD&gt;&lt;TD&gt;1&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;1123&lt;/TD&gt;&lt;TD&gt;Xilin corp&lt;/TD&gt;&lt;TD&gt;110&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;My plan is :&lt;/P&gt;&lt;OL style="list-style-type: decimal;"&gt;&lt;LI&gt;Using proc sql to join: for each record in customer data, it will look up all records in firm data, and using functions such as compged, or spedis , to keep the acceptable match. I need this match to bring in the accounting data for all the customers in the main data. &lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;My concern is : given the number of records of the datasets, proc sql does a full Cartesian join, that will be over several GB, it froze my computer each time. &lt;/P&gt;&lt;P&gt;Can someone comment on :&lt;/P&gt;&lt;OL style="list-style-type: decimal;"&gt;&lt;LI&gt;If I shall use PROC SQL a full Cartesian join&lt;/LI&gt;&lt;LI&gt;Should I include a where statement (e.g. set an acceptable matching score based on compged,or spedis)&lt;/LI&gt;&lt;LI&gt;Is compged better than spedis in my case&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;If you could show a sample code, that will be great !&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thank you !&lt;/P&gt;&lt;P&gt;Lan&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Mon, 23 Mar 2015 21:57:53 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Procedures/Fuzzy-match-using-a-string-variable-between-two-large-datasets/m-p/208283#M51636</guid>
      <dc:creator>LanMin</dc:creator>
      <dc:date>2015-03-23T21:57:53Z</dc:date>
    </item>
    <item>
      <title>Re: Fuzzy match using a string variable between two large datasets</title>
      <link>https://communities.sas.com/t5/SAS-Procedures/Fuzzy-match-using-a-string-variable-between-two-large-datasets/m-p/208284#M51637</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;Take a look at the approach I used for the following presentation: &lt;A href="http://www.sascommunity.org/wiki/Expert_Panel_Solution_MWSUG_2013-Tabachneck" title="http://www.sascommunity.org/wiki/Expert_Panel_Solution_MWSUG_2013-Tabachneck"&gt;Expert Panel Solution MWSUG 2013-Tabachneck - sasCommunity&amp;nbsp; &lt;/A&gt;&lt;/P&gt;&lt;P&gt;particularly the part about adjusting company names.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;The approach requires creating a file of unique company names, and then using compged (as I recall) to first clean up the company names before doing any joins.&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Mon, 23 Mar 2015 22:26:24 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Procedures/Fuzzy-match-using-a-string-variable-between-two-large-datasets/m-p/208284#M51637</guid>
      <dc:creator>art297</dc:creator>
      <dc:date>2015-03-23T22:26:24Z</dc:date>
    </item>
    <item>
      <title>Re: Fuzzy match using a string variable between two large datasets</title>
      <link>https://communities.sas.com/t5/SAS-Procedures/Fuzzy-match-using-a-string-variable-between-two-large-datasets/m-p/208285#M51638</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;Thank you Arthur ! &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;yes you used compged , &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Do you have a sample data for which you presented the materials? I found your code starting on page 24, but I do not completely follow what you did.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Best,&lt;/P&gt;&lt;P&gt;Lan&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Tue, 24 Mar 2015 01:35:26 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Procedures/Fuzzy-match-using-a-string-variable-between-two-large-datasets/m-p/208285#M51638</guid>
      <dc:creator>LanMin</dc:creator>
      <dc:date>2015-03-24T01:35:26Z</dc:date>
    </item>
    <item>
      <title>Re: Fuzzy match using a string variable between two large datasets</title>
      <link>https://communities.sas.com/t5/SAS-Procedures/Fuzzy-match-using-a-string-variable-between-two-large-datasets/m-p/208286#M51639</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;Lan,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Better than sample data, if you click on the link in the paper's first paragraph (problem statement), you can download full copies of all of the datasets that were used.&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Tue, 24 Mar 2015 03:50:43 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Procedures/Fuzzy-match-using-a-string-variable-between-two-large-datasets/m-p/208286#M51639</guid>
      <dc:creator>art297</dc:creator>
      <dc:date>2015-03-24T03:50:43Z</dc:date>
    </item>
    <item>
      <title>Re: Fuzzy match using a string variable between two large datasets</title>
      <link>https://communities.sas.com/t5/SAS-Procedures/Fuzzy-match-using-a-string-variable-between-two-large-datasets/m-p/208287#M51640</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;Thank you Art!&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I downloaded the paper and the zipped data.&amp;nbsp; I did the first two steps following your paper: 1. create a unique set of banknames using customer data , 2. get number of records in bank info dataset&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;By page 6 of the paper, I could not quite follow&amp;nbsp; the block of code&amp;nbsp; due to my insufficient knowledge of sas. I pasted them here:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;let me call the sequence of code below (block 1), so I can refer back to it later.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;data fmtDataset (keep=fmtname start label type);&lt;/P&gt;&lt;P&gt;retain fmtname '$banks' type 'C';&lt;/P&gt;&lt;P&gt;array bank(&amp;amp;numrec) $57;&lt;/P&gt;&lt;P&gt;do i=1 to &amp;amp;numrec;&lt;/P&gt;&lt;P&gt;set bankinfo;&lt;/P&gt;&lt;P&gt;bank(i)=BankName;&lt;/P&gt;&lt;P&gt;end;&lt;/P&gt;&lt;P&gt;do until (eof);&lt;/P&gt;&lt;P&gt;set banks (rename=(BankName=start)) end=eof;&lt;/P&gt;&lt;P&gt;if length(start) le 4 then label=start;&lt;/P&gt;&lt;P&gt;else do; lowscore=5000;&lt;/P&gt;&lt;P&gt;do i=1 to &amp;amp;numrec;&lt;/P&gt;&lt;P&gt;score= compged(start,bank(i));&lt;/P&gt;&lt;P&gt;if score le lowscore then do;&lt;/P&gt;&lt;P&gt;lowscore=score; closest=i;&lt;/P&gt;&lt;P&gt;end;&lt;/P&gt;&lt;P&gt;end;&lt;/P&gt;&lt;P&gt;label=bank(closest);&lt;/P&gt;&lt;P&gt;end;&lt;/P&gt;&lt;P&gt;output;&lt;/P&gt;&lt;P&gt;end;&lt;/P&gt;&lt;P&gt;run;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;1. I used compged in the past, but the score I set is generally low to ensure the high matching, your code has a line:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;lowscore=5000;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I do not know if it means you allow very distant match, i.e., two names are not close match.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;2. my data set has company names in two databases, some are easier to match&amp;nbsp; &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;e.g. AB INC.&amp;nbsp;&amp;nbsp; vs. &lt;SPAN style="font-size: 13.3333330154419px;"&gt;AB&amp;nbsp; &lt;SPAN style="font-size: 13.3333330154419px;"&gt;INCORPORATED, i could use &lt;/SPAN&gt;SAS Code: &amp;amp;name = tranwrd(&amp;amp;name, "INCORPORATED","INC"); to account for those, &lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;but I notice you use &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;/*Create the necessary format*/&lt;/P&gt;&lt;P&gt;proc format cntlin=fmtDataset;&lt;/P&gt;&lt;P&gt;run;&lt;/P&gt;&lt;P&gt;/*recode bank names*/&lt;/P&gt;&lt;P&gt;data dcandh;&lt;/P&gt;&lt;P&gt;set dcandh (rename=(BankName=_BankName));&lt;/P&gt;&lt;P&gt;BankName=put(_BankName,$banks.);&lt;/P&gt;&lt;P&gt;run;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;my question is: should I use &lt;SPAN style="font-size: 13.3333330154419px;"&gt;block 1 code for my case, my two data sets are : data one has a group of firms that are customers of other firms, &lt;SPAN style="font-size: 13.3333330154419px;"&gt;data two has all publicly traded firms in the U.S. market ; the two data source could be using different abbreviations such as Inc. Corp, spelled out or not, ; lower case , upper case, for name spelling, accounting for these, I can make two datasets both lower cases, and spell out some abbreviations I can think of. Besides these, should i use your block 1 to build the format? &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Sincerely,&lt;/P&gt;&lt;P&gt;Lan&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Wed, 25 Mar 2015 01:36:50 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Procedures/Fuzzy-match-using-a-string-variable-between-two-large-datasets/m-p/208287#M51640</guid>
      <dc:creator>LanMin</dc:creator>
      <dc:date>2015-03-25T01:36:50Z</dc:date>
    </item>
    <item>
      <title>Re: Fuzzy match using a string variable between two large datasets</title>
      <link>https://communities.sas.com/t5/SAS-Procedures/Fuzzy-match-using-a-string-variable-between-two-large-datasets/m-p/208288#M51641</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;Lan,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;The 5000 was just an arbitrary starting point. The code matches all of the records in one file, with all of the records in the other file, and selects the closest match.&lt;/P&gt;&lt;P&gt;I created a format because one of the files had a million records in it, but only about 10000 unique bank names. Thus, rather than match all million records, I first eliminated all duplicates, ran the code against the file of unique bank names, created a format, and applied the format to the full non-duplicated file.&lt;/P&gt;&lt;P&gt;You likely won't need to create a format, but could just modify block 1 so that does the entire job.&lt;/P&gt;&lt;P&gt;As for the score, you could always check the scores at the end of the process to see if any large scores resulted. Obviously, if you do end up with any large scores, adequate matches weren't found for those records.&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Wed, 25 Mar 2015 04:06:21 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Procedures/Fuzzy-match-using-a-string-variable-between-two-large-datasets/m-p/208288#M51641</guid>
      <dc:creator>art297</dc:creator>
      <dc:date>2015-03-25T04:06:21Z</dc:date>
    </item>
    <item>
      <title>Re: Fuzzy match using a string variable between two large datasets</title>
      <link>https://communities.sas.com/t5/SAS-Procedures/Fuzzy-match-using-a-string-variable-between-two-large-datasets/m-p/208289#M51642</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;&lt;SPAN style="font-family: 'Helvetica Neue', Helvetica, Arial, 'Lucida Grande', sans-serif; font-size: 13px; background-color: #ffffff;"&gt;Thank you Art!&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-family: 'Helvetica Neue', Helvetica, Arial, 'Lucida Grande', sans-serif; font-size: 13px; background-color: #ffffff;"&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-family: 'Helvetica Neue', Helvetica, Arial, 'Lucida Grande', sans-serif; font-size: 13px; background-color: #ffffff;"&gt;1. I do not know how to modify block 1 code (not using format).&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-family: 'Helvetica Neue', Helvetica, Arial, 'Lucida Grande', sans-serif; font-size: 13px; background-color: #ffffff;"&gt;2. you wrote "&lt;SPAN style="font-family: 'Helvetica Neue', Helvetica, Arial, 'Lucida Grande', sans-serif; font-size: 13px; background-color: #ffffff;"&gt;you could always check the scores at the end of the process to see if any large scores resulted. Obviously, if you do end up with any large scores, adequate matches weren't found for those records.", how do I do that in your block 1 ?&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-family: 'Helvetica Neue', Helvetica, Arial, 'Lucida Grande', sans-serif; font-size: 13px; background-color: #ffffff;"&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-family: 'Helvetica Neue', Helvetica, Arial, 'Lucida Grande', sans-serif; font-size: 13px; background-color: #ffffff;"&gt;Best,&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-family: 'Helvetica Neue', Helvetica, Arial, 'Lucida Grande', sans-serif; font-size: 13px; background-color: #ffffff;"&gt;Lan&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-family: 'Helvetica Neue', Helvetica, Arial, 'Lucida Grande', sans-serif; font-size: 13px; background-color: #ffffff;"&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-family: 'Helvetica Neue', Helvetica, Arial, 'Lucida Grande', sans-serif; font-size: 13px; background-color: #ffffff;"&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Wed, 25 Mar 2015 16:13:27 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Procedures/Fuzzy-match-using-a-string-variable-between-two-large-datasets/m-p/208289#M51642</guid>
      <dc:creator>LanMin</dc:creator>
      <dc:date>2015-03-25T16:13:27Z</dc:date>
    </item>
    <item>
      <title>Re: Fuzzy match using a string variable between two large datasets</title>
      <link>https://communities.sas.com/t5/SAS-Procedures/Fuzzy-match-using-a-string-variable-between-two-large-datasets/m-p/208290#M51643</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;It's been a while since I've looked at that code, thus I'm a bit rusty.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I'm re-running it now and, after looking at it, I've changed my mind: building and using the format would probably be the easiest approach.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;To answer your question, in block one just change the line:&lt;/P&gt;&lt;P style="font-size: 13px; font-family: 'Helvetica Neue', Helvetica, Arial, 'Lucida Grande', sans-serif; background-color: #ffffff;"&gt;data fmtDataset (keep=fmtname start label type);&lt;/P&gt;&lt;P style="font-size: 13px; font-family: 'Helvetica Neue', Helvetica, Arial, 'Lucida Grande', sans-serif; background-color: #ffffff;"&gt;&lt;/P&gt;&lt;P style="font-size: 13px; font-family: 'Helvetica Neue', Helvetica, Arial, 'Lucida Grande', sans-serif; background-color: #ffffff;"&gt;to&lt;/P&gt;&lt;P style="font-size: 13px; font-family: 'Helvetica Neue', Helvetica, Arial, 'Lucida Grande', sans-serif; background-color: #ffffff;"&gt;&lt;/P&gt;&lt;P style="font-size: 13px; font-family: 'Helvetica Neue', Helvetica, Arial, 'Lucida Grande', sans-serif; background-color: #ffffff;"&gt;data fmtDataset (keep=fmtname start label type lowscore);&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Wed, 25 Mar 2015 16:47:37 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Procedures/Fuzzy-match-using-a-string-variable-between-two-large-datasets/m-p/208290#M51643</guid>
      <dc:creator>art297</dc:creator>
      <dc:date>2015-03-25T16:47:37Z</dc:date>
    </item>
    <item>
      <title>Re: Fuzzy match using a string variable between two large datasets</title>
      <link>https://communities.sas.com/t5/SAS-Procedures/Fuzzy-match-using-a-string-variable-between-two-large-datasets/m-p/208291#M51644</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;I just re-ran the code. It turns out that it was matching 33,514 records with 10,000 bank names. Depending upon your system's processing speed that could take between 30 minutes and almost 3 hours.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;The highest compged score it produced was 860 which, in this case, was quite satisfactory.&amp;nbsp; The result of lowscores 860 included the following matches:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;GIANR FARM&amp;nbsp; OFFICE OF CLE=GIANT FARM SAVINGS OFFICE OF CLEVELAND&lt;/P&gt;&lt;P&gt;GIANT FARM&amp;nbsp; OFFICE OF CLW=GIANT FARM SAVINGS OFFICE OF CLEVELAND&lt;/P&gt;&lt;P&gt;GIANY FARM&amp;nbsp; OFFICE OF CLE=GIANT FARM SAVINGS OFFICE OF CLEVELAND&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;The next highest score was 840 and it was assigned to:&lt;/P&gt;&lt;P&gt;SOUTHWEST MILITSRY&amp;nbsp; OF WAS=SOUTHWEST MILITARY SAVINGS OF WASHINGTON&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Wed, 25 Mar 2015 19:58:26 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Procedures/Fuzzy-match-using-a-string-variable-between-two-large-datasets/m-p/208291#M51644</guid>
      <dc:creator>art297</dc:creator>
      <dc:date>2015-03-25T19:58:26Z</dc:date>
    </item>
    <item>
      <title>Re: Fuzzy match using a string variable between two large datasets</title>
      <link>https://communities.sas.com/t5/SAS-Procedures/Fuzzy-match-using-a-string-variable-between-two-large-datasets/m-p/208292#M51645</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;Thank you Art!&lt;/P&gt;&lt;P&gt;I plan to run your code as soon as I finish another program. In the mean time, I looked at the bankname in bankinfo and DCANDH data, the spelling are not subject to issues/inconsistencies such as lowercase, uppercase unmatch, or Limited vs. Ltd, or Corp vs. Co. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;my case is slightly different, my datatwo is a larger set of firm names, it could overlap with dataone in firm names, however, the spelling could be inconsistent, (such as such as lowercase, uppercase unmatch, or Limited vs. Ltd, or Corp vs. Co. , but not limited to these). &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I plan to apply your block 1 to my data - if&amp;nbsp; I could make it work . &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;If you think of other things I should clean up data before the proc sql match, please let me know. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Best,&lt;/P&gt;&lt;P&gt;Lan&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Wed, 25 Mar 2015 20:51:10 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Procedures/Fuzzy-match-using-a-string-variable-between-two-large-datasets/m-p/208292#M51645</guid>
      <dc:creator>LanMin</dc:creator>
      <dc:date>2015-03-25T20:51:10Z</dc:date>
    </item>
    <item>
      <title>Re: Fuzzy match using a string variable between two large datasets</title>
      <link>https://communities.sas.com/t5/SAS-Procedures/Fuzzy-match-using-a-string-variable-between-two-large-datasets/m-p/208293#M51646</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;Apply the Upcase function to the name variables in each file before attempting to run block 1.&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Wed, 25 Mar 2015 21:38:22 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Procedures/Fuzzy-match-using-a-string-variable-between-two-large-datasets/m-p/208293#M51646</guid>
      <dc:creator>art297</dc:creator>
      <dc:date>2015-03-25T21:38:22Z</dc:date>
    </item>
    <item>
      <title>Re: Fuzzy match using a string variable between two large datasets</title>
      <link>https://communities.sas.com/t5/SAS-Procedures/Fuzzy-match-using-a-string-variable-between-two-large-datasets/m-p/208294#M51647</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;Hi Art,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Your block 1 code &lt;STRONG&gt;works very well&lt;/STRONG&gt; for my small sample testing. I am moving on to large sample and will keep you posted on my progress.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thanks a bunch !&lt;/P&gt;&lt;P&gt;Lan&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Thu, 26 Mar 2015 17:38:42 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Procedures/Fuzzy-match-using-a-string-variable-between-two-large-datasets/m-p/208294#M51647</guid>
      <dc:creator>LanMin</dc:creator>
      <dc:date>2015-03-26T17:38:42Z</dc:date>
    </item>
    <item>
      <title>Re: Fuzzy match using a string variable between two large datasets</title>
      <link>https://communities.sas.com/t5/SAS-Procedures/Fuzzy-match-using-a-string-variable-between-two-large-datasets/m-p/208295#M51648</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;One thing to keep in mind. Prior to running the block 1 code, use a proc sort, by comany_name (or whatever the field is called) using the nodupkey option, and outputting a new file (e.g., companies) .. to use in the block 1 code.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;You only need to create the format for each variant of a company's name and the reason for creating the format is to then apply it to the full file.&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Thu, 26 Mar 2015 17:55:34 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Procedures/Fuzzy-match-using-a-string-variable-between-two-large-datasets/m-p/208295#M51648</guid>
      <dc:creator>art297</dc:creator>
      <dc:date>2015-03-26T17:55:34Z</dc:date>
    </item>
    <item>
      <title>Re: Fuzzy match using a string variable between two large datasets</title>
      <link>https://communities.sas.com/t5/SAS-Procedures/Fuzzy-match-using-a-string-variable-between-two-large-datasets/m-p/208296#M51649</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P style="font-size: 13.3333330154419px;"&gt;Yes, Art. I did proc sort nodupkey for both of my dataset,&amp;nbsp; I think your bankinfo data has&amp;nbsp; only 1 name per bank , so you did not have to run &lt;SPAN style="font-size: 13.3333330154419px;"&gt;nodupkey for that one, you did &lt;SPAN style="font-size: 13.3333330154419px;"&gt;proc sort &lt;/SPAN&gt;&lt;SPAN style="font-size: 13.3333330154419px;"&gt;nodupkey &lt;/SPAN&gt; for the customer theft data . Both of my datasets are panel, so my customer names could appear multiple times. Therefore I run &lt;SPAN style="font-size: 13.3333330154419px;"&gt;proc sort nodupkey , then use the output datasets to run your block 1 code. &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P style="font-size: 13.3333330154419px;"&gt;&lt;SPAN style="font-size: 13.3333330154419px;"&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P style="font-size: 13.3333330154419px;"&gt;&lt;SPAN style="font-size: 13.3333330154419px;"&gt;Lan&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Thu, 26 Mar 2015 20:33:22 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Procedures/Fuzzy-match-using-a-string-variable-between-two-large-datasets/m-p/208296#M51649</guid>
      <dc:creator>LanMin</dc:creator>
      <dc:date>2015-03-26T20:33:22Z</dc:date>
    </item>
    <item>
      <title>Re: Fuzzy match using a string variable between two large datasets</title>
      <link>https://communities.sas.com/t5/SAS-Procedures/Fuzzy-match-using-a-string-variable-between-two-large-datasets/m-p/208297#M51650</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;Hi Art,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;As my earlier post indicated, I was able to run your code (with minor edits) successfully.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;My matched sample has over 35000 records with the highest score upto 1070. About 2000 records have zero score, i.e., perfect match for names across two data sets.&amp;nbsp; However, at very low scores, I still have names that are not good matches with visual inspection. e.g.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I have one additional question regarding the formatting. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;CIMM INC and ICM INC is a pair, with lowscore 40&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;another example&lt;/P&gt;&lt;P&gt;ESD CO&amp;nbsp; and EDS CORP &lt;SPAN style="font-size: 13.3333330154419px;"&gt; is a pair, with lowscore 40&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;my question is should change formatting or other aspects of your posted code (&lt;A class="active_link" href="http://www.sascommunity.org/wiki/Expert_Panel_Solution_MWSUG_2013-Tabachneck" title="http://www.sascommunity.org/wiki/Expert_Panel_Solution_MWSUG_2013-Tabachneck"&gt;Expert Panel Solution MWSUG 2013-Tabachneck - sasCommunity)&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;to enhance the match results. As it stands, I have 33,000 records that have non-zero score, and it is time consuming to do visual inspection for each pair.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;For those that are interested, code starts on the bottom of page 5. &lt;/P&gt;&lt;P&gt;&lt;A href="http://www.sascommunity.org/mwiki/images/f/f8/Expert_Panel_Tabachneck_MWSUG_2013.pdf" title="http://www.sascommunity.org/mwiki/images/f/f8/Expert_Panel_Tabachneck_MWSUG_2013.pdf"&gt;http://www.sascommunity.org/mwiki/images/f/f8/Expert_Panel_Tabachneck_MWSUG_2013.pdf&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thanks !&lt;/P&gt;&lt;P&gt;Lan&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Wed, 13 May 2015 02:46:26 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Procedures/Fuzzy-match-using-a-string-variable-between-two-large-datasets/m-p/208297#M51650</guid>
      <dc:creator>LanMin</dc:creator>
      <dc:date>2015-05-13T02:46:26Z</dc:date>
    </item>
    <item>
      <title>Re: Fuzzy match using a string variable between two large datasets</title>
      <link>https://communities.sas.com/t5/SAS-Procedures/Fuzzy-match-using-a-string-variable-between-two-large-datasets/m-p/208298#M51651</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;Lan,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;For three character names, I have found the compged function to be inefficient. However, SAS provides a number of alternatives (e.g., soundex).&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;However, there are additional steps you can take to limit probable noise before looking for matches (see, e.g.: &lt;A href="http://ftp.sas.com/techsup/download/observations/obswww15/obswww15.pdf" title="http://ftp.sas.com/techsup/download/observations/obswww15/obswww15.pdf"&gt;http://ftp.sas.com/techsup/download/observations/obswww15/obswww15.pdf&lt;/A&gt; ).&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Finally, it would definitely help if you were matching with a set of valid company names. It sounds like you are trying to clean up both data sets simultaneously which, methinks, would only add more confusion.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Art&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Wed, 13 May 2015 13:04:42 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Procedures/Fuzzy-match-using-a-string-variable-between-two-large-datasets/m-p/208298#M51651</guid>
      <dc:creator>art297</dc:creator>
      <dc:date>2015-05-13T13:04:42Z</dc:date>
    </item>
    <item>
      <title>Re: Fuzzy match using a string variable between two large datasets</title>
      <link>https://communities.sas.com/t5/SAS-Procedures/Fuzzy-match-using-a-string-variable-between-two-large-datasets/m-p/208299#M51652</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;Thank you Art for sharing the resources and your advice ! I will read them carefully.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;A little more details on my data:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;main data: contains the universe of publicly traded U.S. firm names (I kept one unique record per firm, similar to your code).&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;second data: contains publicly traded company's customer name, these customers are themselves companies,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;My main data has accounting data (e.g. firm cash holdings, debt, assets etc) of each firm, my goal is to get such info for my customer data, hence, I must match these two data sets.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;In response to your latest comment, as far as I know both data contain &lt;STRONG&gt;valid&lt;/STRONG&gt; company names, however, the database may not be build to users' satisfaction, the name abbreviation etc may not be used consistently across these two data sets.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thanks again !&lt;/P&gt;&lt;P&gt;Lan&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Wed, 13 May 2015 14:53:43 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Procedures/Fuzzy-match-using-a-string-variable-between-two-large-datasets/m-p/208299#M51652</guid>
      <dc:creator>LanMin</dc:creator>
      <dc:date>2015-05-13T14:53:43Z</dc:date>
    </item>
  </channel>
</rss>

