<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: How to drop words with no real meaning in a string? in SAS Programming</title>
    <link>https://communities.sas.com/t5/SAS-Programming/How-to-drop-words-with-no-real-meaning-in-a-string/m-p/896341#M354166</link>
    <description>&lt;P&gt;Presumably you are trying to match Compustat company data with data from another vendor of company-related data.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Before you work directly on fuzzy matching of names, do you have other information that might help, like phone numbers, zip codes, addresses, tickers?&amp;nbsp; If so, you might be able to reduce the population of unmatched data for submission to the fuzzy matching process.&amp;nbsp; Those other variables might also be helpful in supporting or invalidating fuzzy matches.&lt;/P&gt;</description>
    <pubDate>Thu, 28 Sep 2023 20:51:04 GMT</pubDate>
    <dc:creator>mkeintz</dc:creator>
    <dc:date>2023-09-28T20:51:04Z</dc:date>
    <item>
      <title>How to drop words with no real meaning in a string?</title>
      <link>https://communities.sas.com/t5/SAS-Programming/How-to-drop-words-with-no-real-meaning-in-a-string/m-p/896256#M354132</link>
      <description>&lt;P&gt;Hi all,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I'm doing firm name fuzzy matching. I have a table called &lt;STRONG&gt;compustat&lt;/STRONG&gt;, there is a column called &lt;STRONG&gt;conm&lt;/STRONG&gt; which contains company names. I also have a table called &lt;STRONG&gt;firm_ref&lt;/STRONG&gt;, which contains a column called &lt;STRONG&gt;company&lt;/STRONG&gt; which contains company names. After turning them into case sensitive, drop common firm suffix (e.g. co, corp, etc), dropping special letters ($&amp;amp;,/+-, etc), I found the same firm could still have different names in the two column due to the existence of words without real meaning.&amp;nbsp;&lt;/P&gt;&lt;P&gt;For instance, we could have a firm name = "TIFFANY LUXURY RETAIL" in &lt;STRONG&gt;compustat&lt;/STRONG&gt;, and a firm name = "TIFFANY LUXURY AND RETAIL" in &lt;STRONG&gt;firm_ref&lt;/STRONG&gt;, I want to get rid of words such as "and" that does not related to the identity of the firm, how should I do it?&lt;/P&gt;&lt;P&gt;I know how to remove a word from a string if this word belongs to a list (e.g. a list of such meaningless words), but my problem is I do not have such a list. Is there any package like in python that provides a list of such words?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thank you!&lt;/P&gt;</description>
      <pubDate>Thu, 28 Sep 2023 14:17:01 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/How-to-drop-words-with-no-real-meaning-in-a-string/m-p/896256#M354132</guid>
      <dc:creator>Eileen1496</dc:creator>
      <dc:date>2023-09-28T14:17:01Z</dc:date>
    </item>
    <item>
      <title>Re: How to drop words with no real meaning in a string?</title>
      <link>https://communities.sas.com/t5/SAS-Programming/How-to-drop-words-with-no-real-meaning-in-a-string/m-p/896341#M354166</link>
      <description>&lt;P&gt;Presumably you are trying to match Compustat company data with data from another vendor of company-related data.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Before you work directly on fuzzy matching of names, do you have other information that might help, like phone numbers, zip codes, addresses, tickers?&amp;nbsp; If so, you might be able to reduce the population of unmatched data for submission to the fuzzy matching process.&amp;nbsp; Those other variables might also be helpful in supporting or invalidating fuzzy matches.&lt;/P&gt;</description>
      <pubDate>Thu, 28 Sep 2023 20:51:04 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/How-to-drop-words-with-no-real-meaning-in-a-string/m-p/896341#M354166</guid>
      <dc:creator>mkeintz</dc:creator>
      <dc:date>2023-09-28T20:51:04Z</dc:date>
    </item>
    <item>
      <title>Re: How to drop words with no real meaning in a string?</title>
      <link>https://communities.sas.com/t5/SAS-Programming/How-to-drop-words-with-no-real-meaning-in-a-string/m-p/896344#M354167</link>
      <description>Yes I do have some firm identifiers, but they miss a lot in the external&lt;BR /&gt;dataset. I already conduct some rounds of matches using identifiers like&lt;BR /&gt;GVKEY to reduce the dataset for fuzzy name matching. I also have NAICS&lt;BR /&gt;code, but I think they are complement criteria to verify the matching&lt;BR /&gt;accuracy after name matching.&lt;BR /&gt;&lt;BR /&gt;And for name matching, I get rid of common firm suffixes (e.g. corp, ltd,&lt;BR /&gt;etc) , special characters, parenthesis (like TIFFANY(UK) becomes TIFFANY),&lt;BR /&gt;so I can conduct some exact name matching. Then for the rest, I could&lt;BR /&gt;further match if I can get rid of words like "and", "the", etc.&lt;BR /&gt;</description>
      <pubDate>Thu, 28 Sep 2023 21:35:54 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/How-to-drop-words-with-no-real-meaning-in-a-string/m-p/896344#M354167</guid>
      <dc:creator>Eileen1496</dc:creator>
      <dc:date>2023-09-28T21:35:54Z</dc:date>
    </item>
    <item>
      <title>Re: How to drop words with no real meaning in a string?</title>
      <link>https://communities.sas.com/t5/SAS-Programming/How-to-drop-words-with-no-real-meaning-in-a-string/m-p/896360#M354169</link>
      <description>&lt;P&gt;I am unaware of a ready-made list of words that you might use for removal for fuzzy matching of company names.&amp;nbsp; Clearly your examples "and" and "the" should be included.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;You might be able to leverage your data to build your own list.&amp;nbsp; You could do a frequency table of all words found in CONM, by word length, and a similar table for COMPANY.&amp;nbsp; That might suggest some words that wouldn't otherwise seem likely.&amp;nbsp; I imagine most words on the removal list would be short.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;As a general rule, I presume that false positives (i.e. erroneous fuzzy matches) would be a far more significant problem than false negatives.&lt;/P&gt;</description>
      <pubDate>Fri, 29 Sep 2023 02:52:23 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/How-to-drop-words-with-no-real-meaning-in-a-string/m-p/896360#M354169</guid>
      <dc:creator>mkeintz</dc:creator>
      <dc:date>2023-09-29T02:52:23Z</dc:date>
    </item>
    <item>
      <title>Re: How to drop words with no real meaning in a string?</title>
      <link>https://communities.sas.com/t5/SAS-Programming/How-to-drop-words-with-no-real-meaning-in-a-string/m-p/896361#M354170</link>
      <description>Getting a frequency table and then decide my own list is a good advice! Indeed when I try to match using tfidf before in Python, even two firms with different names, as long as they have one words in common they have very high score. I need to figure out this later.&lt;BR /&gt;</description>
      <pubDate>Fri, 29 Sep 2023 02:49:54 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/How-to-drop-words-with-no-real-meaning-in-a-string/m-p/896361#M354170</guid>
      <dc:creator>Eileen1496</dc:creator>
      <dc:date>2023-09-29T02:49:54Z</dc:date>
    </item>
  </channel>
</rss>

