<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: help needs on removing duplicate observation fuzzly in SAS Programming</title>
    <link>https://communities.sas.com/t5/SAS-Programming/help-needs-on-removing-duplicate-observation-fuzzly/m-p/115533#M23798</link>
    <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;Hi PG thanks. I learned some basics about graph theory and sort of understand your macro. My only concern is than this macro treat every pair equally and I am thinking about the possibility of using edit distance as a weight between two nodes. I guess this will give me more precise subgraph.&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
    <pubDate>Sat, 08 Sep 2012 02:12:06 GMT</pubDate>
    <dc:creator>tediest</dc:creator>
    <dc:date>2012-09-08T02:12:06Z</dc:date>
    <item>
      <title>help needs on removing duplicate observation fuzzly</title>
      <link>https://communities.sas.com/t5/SAS-Programming/help-needs-on-removing-duplicate-observation-fuzzly/m-p/115529#M23794</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;Hi guys. I have a variable recording the titles of publications and I need to remove duplicate observations from this variable and create a list of unique titles. Due to misspelling and improper citation, many duplicate titles are not exactly the same. How to delete duplicates in this case?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I figured out a solution but it was soon proven to be problematic. This is what I did. First I created a Cartesian products of this variable. Next, I calculate edit distance between any pair and then determine which pair should be considered as the same. Finally, I only keep the matched pairs and this table tells me which observations are the same. However, there is a serious problem in this procedure which is illustrated by a simple example below. Suppose I have 4 observations in my data set. Going through the matching steps described above, I get the table below:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;obs1 obs2&amp;nbsp; match_result&lt;/P&gt;&lt;P&gt;1&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 2&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; exact match&lt;/P&gt;&lt;P&gt;1&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 3&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; fuzzy match&lt;/P&gt;&lt;P&gt;1&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 5&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; exact match&lt;/P&gt;&lt;P&gt;2&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 5&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; exact match&lt;/P&gt;&lt;P&gt;3&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 4&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; fuzzy match&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;In this example, the comparison between 2 and 5 are unnecessary, since 1=2 and 2=5 are already established. Worse more, 3 is recognized as a fuzzy duplicate of 1, but later 4 become a fuzzy duplicate of 3 too. However, the edit distance between 1 and 4 are quite far and they are not matched by any means. In this example, I do not know how to categorize duplicate observation. Sure enough, 1,2 and 5 are duplicates. But how about 3 and 4? Therefore, I set another restriction.on this procedure, that once an observation is matched no matter it is fuzzy or exact, this observation should be not be used to match other observations. In this example above, after the first loop, 2, 3 and 5 should be excluded from further matching. I feel a series macro variables would be needed for this procedure and I do not know how to implement it. Also, do you think the last restriction I set is reasonable? Thanks&amp;nbsp; a lot!&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Fri, 07 Sep 2012 21:18:49 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/help-needs-on-removing-duplicate-observation-fuzzly/m-p/115529#M23794</guid>
      <dc:creator>tediest</dc:creator>
      <dc:date>2012-09-07T21:18:49Z</dc:date>
    </item>
    <item>
      <title>Re: help needs on removing duplicate observation fuzzly</title>
      <link>https://communities.sas.com/t5/SAS-Programming/help-needs-on-removing-duplicate-observation-fuzzly/m-p/115530#M23795</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;Can you post a small sample of your data and the code you used?&amp;nbsp; That would make it a lot easier to evaluate.&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Fri, 07 Sep 2012 21:42:13 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/help-needs-on-removing-duplicate-observation-fuzzly/m-p/115530#M23795</guid>
      <dc:creator>art297</dc:creator>
      <dc:date>2012-09-07T21:42:13Z</dc:date>
    </item>
    <item>
      <title>Re: help needs on removing duplicate observation fuzzly</title>
      <link>https://communities.sas.com/t5/SAS-Programming/help-needs-on-removing-duplicate-observation-fuzzly/m-p/115531#M23796</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;You could try the following procedure :&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;1) create a dataset PAIRS of all pairs (obs1, obs2) where obs1 &amp;lt; obs2 and the edit distance is less than a given value (an exact or fuzzy match)&lt;/P&gt;&lt;P&gt;2) use the macro %SubGraphs(PAIRS,from=obs1,to=obs2,out=CLUSTERS); the macro is given here : &lt;A __default_attr="1268" __jive_macro_name="document" class="jive_macro jive_macro_document" href="https://communities.sas.com/"&gt;&lt;/A&gt;&lt;/P&gt;&lt;P&gt;3) look at the CLUSTERS dataset, the variable clust identifies each group of similar titles.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Good luck.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;PG&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Sat, 08 Sep 2012 00:43:26 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/help-needs-on-removing-duplicate-observation-fuzzly/m-p/115531#M23796</guid>
      <dc:creator>PGStats</dc:creator>
      <dc:date>2012-09-08T00:43:26Z</dc:date>
    </item>
    <item>
      <title>Re: help needs on removing duplicate observation fuzzly</title>
      <link>https://communities.sas.com/t5/SAS-Programming/help-needs-on-removing-duplicate-observation-fuzzly/m-p/115532#M23797</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;here is the code I wrote for this problem and the sample data is attached.&amp;nbsp; Thanks!&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;%let maxscore=1000;&lt;/P&gt;&lt;P&gt;data title_deldup;&lt;/P&gt;&lt;P&gt;set title(rename=(title=title1 order=order1)) nobs=nobs1;&lt;/P&gt;&lt;P&gt;if _n_ = 1 then do;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; call compcost( 'fdelete=',200, 'finsert=',200, 'freplace=',100,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; 'delete=',100, 'insert=',100, 'replace=',100,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; 'append=',200, 'truncate=',200,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; 'double=',20, 'single=',20, 'swap=',20,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; 'blank=',10, 'punctuation=',10,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; 'match=',0 );&lt;/P&gt;&lt;P&gt;end;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;do i = 1 to nobs1;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; if _n_ &amp;lt; i then do;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; set title(rename=(title=title2 order=order2)) point=i;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; gedscore=compged(title1,title2,'iL' );&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; if gedscore&amp;lt;&amp;amp;maxscore then output;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; end;&lt;/P&gt;&lt;P&gt;end;&lt;/P&gt;&lt;P&gt;run;&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Sat, 08 Sep 2012 01:52:38 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/help-needs-on-removing-duplicate-observation-fuzzly/m-p/115532#M23797</guid>
      <dc:creator>tediest</dc:creator>
      <dc:date>2012-09-08T01:52:38Z</dc:date>
    </item>
    <item>
      <title>Re: help needs on removing duplicate observation fuzzly</title>
      <link>https://communities.sas.com/t5/SAS-Programming/help-needs-on-removing-duplicate-observation-fuzzly/m-p/115533#M23798</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;Hi PG thanks. I learned some basics about graph theory and sort of understand your macro. My only concern is than this macro treat every pair equally and I am thinking about the possibility of using edit distance as a weight between two nodes. I guess this will give me more precise subgraph.&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Sat, 08 Sep 2012 02:12:06 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/help-needs-on-removing-duplicate-observation-fuzzly/m-p/115533#M23798</guid>
      <dc:creator>tediest</dc:creator>
      <dc:date>2012-09-08T02:12:06Z</dc:date>
    </item>
    <item>
      <title>Re: help needs on removing duplicate observation fuzzly</title>
      <link>https://communities.sas.com/t5/SAS-Programming/help-needs-on-removing-duplicate-observation-fuzzly/m-p/115534#M23799</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;If your PAIRS dataset contains variables obs1, obs2, and dist, the edit distance, you could merge the CLUSTERS dataset with PAIRS to get back the title pairs grouped by cluster as follows :&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG style="font-size: 12pt; font-family: calibri, verdana, arial, sans-serif;"&gt;proc sql;&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG style="font-size: 12pt; font-family: calibri, verdana, arial, sans-serif;"&gt;create table DISTANCES as&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG style="font-size: 12pt; font-family: calibri, verdana, arial, sans-serif;"&gt;select C1.clust, P.obs1, P.obs2, P.dist &lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG style="font-size: 12pt; font-family: calibri, verdana, arial, sans-serif;"&gt;from PAIRS as P inner join &lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG style="font-size: 12pt; font-family: calibri, verdana, arial, sans-serif;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; CLUSTERS as C1 on C1.node=P.obs1 inner join&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG style="font-size: 12pt; font-family: calibri, verdana, arial, sans-serif;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; CLUSTERS as C2 on C2.node=P.obs2 and C1.clust=C2.clust&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG style="font-size: 12pt; font-family: calibri, verdana, arial, sans-serif;"&gt;order by C1.clust;&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG style="font-size: 12pt; font-family: calibri, verdana, arial, sans-serif;"&gt;quit;&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;PG&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Sat, 08 Sep 2012 02:40:11 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/help-needs-on-removing-duplicate-observation-fuzzly/m-p/115534#M23799</guid>
      <dc:creator>PGStats</dc:creator>
      <dc:date>2012-09-08T02:40:11Z</dc:date>
    </item>
    <item>
      <title>Re: help needs on removing duplicate observation fuzzly</title>
      <link>https://communities.sas.com/t5/SAS-Programming/help-needs-on-removing-duplicate-observation-fuzzly/m-p/115535#M23800</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;Hi PG, thanks!&lt;/P&gt;&lt;P&gt;Your subgraph macro did its job very well on my data. The only trick is that I need to set a smaller threshold for fuzzy match. When I did so, your macro will give a pretty clear-cut division. Thanks!&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Ted&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Mon, 10 Sep 2012 01:27:53 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/help-needs-on-removing-duplicate-observation-fuzzly/m-p/115535#M23800</guid>
      <dc:creator>tediest</dc:creator>
      <dc:date>2012-09-10T01:27:53Z</dc:date>
    </item>
  </channel>
</rss>

