<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Fuzzy grouping in SAS Enterprise Guide</title>
    <link>https://communities.sas.com/t5/SAS-Enterprise-Guide/Fuzzy-grouping/m-p/20268#M3387</link>
    <description>hello,&lt;BR /&gt;
&lt;BR /&gt;
Can SAS performing fuzzy grouping?&lt;BR /&gt;
i.e. I would like to find, within a data (say 8000 record and 200 columns), whether there are some pairs of data that are likely to be duplicates / similiars.&lt;BR /&gt;
&lt;BR /&gt;
e.g.&lt;BR /&gt;
id	f1	f2	f3	f4	f5&lt;BR /&gt;
1	1	2	3	4	5&lt;BR /&gt;
2	2	3	4	5	6&lt;BR /&gt;
3	3	4	5	6	7&lt;BR /&gt;
4	3	4	7	8	9&lt;BR /&gt;
5	1	3	3	3	3&lt;BR /&gt;
6	1	2	3	4	5&lt;BR /&gt;
aim: (a) find out the pairs of data which is exactly the same&lt;BR /&gt;
       (b) find out the pairs of data which is different in 3 columns or less&lt;BR /&gt;
Result &lt;BR /&gt;
(a)  pair (1 - 6 ) with fields F1 to F5&lt;BR /&gt;
(b)  pair (3 - 4 ) with fields F3, F4, F5&lt;BR /&gt;
      pair (1 - 5 ) with fields F2, F4, F5&lt;BR /&gt;
      ......&lt;BR /&gt;
&lt;BR /&gt;
&lt;BR /&gt;
&lt;BR /&gt;
Thanks for your help</description>
    <pubDate>Thu, 21 Oct 2010 04:02:14 GMT</pubDate>
    <dc:creator>achan</dc:creator>
    <dc:date>2010-10-21T04:02:14Z</dc:date>
    <item>
      <title>Fuzzy grouping</title>
      <link>https://communities.sas.com/t5/SAS-Enterprise-Guide/Fuzzy-grouping/m-p/20268#M3387</link>
      <description>hello,&lt;BR /&gt;
&lt;BR /&gt;
Can SAS performing fuzzy grouping?&lt;BR /&gt;
i.e. I would like to find, within a data (say 8000 record and 200 columns), whether there are some pairs of data that are likely to be duplicates / similiars.&lt;BR /&gt;
&lt;BR /&gt;
e.g.&lt;BR /&gt;
id	f1	f2	f3	f4	f5&lt;BR /&gt;
1	1	2	3	4	5&lt;BR /&gt;
2	2	3	4	5	6&lt;BR /&gt;
3	3	4	5	6	7&lt;BR /&gt;
4	3	4	7	8	9&lt;BR /&gt;
5	1	3	3	3	3&lt;BR /&gt;
6	1	2	3	4	5&lt;BR /&gt;
aim: (a) find out the pairs of data which is exactly the same&lt;BR /&gt;
       (b) find out the pairs of data which is different in 3 columns or less&lt;BR /&gt;
Result &lt;BR /&gt;
(a)  pair (1 - 6 ) with fields F1 to F5&lt;BR /&gt;
(b)  pair (3 - 4 ) with fields F3, F4, F5&lt;BR /&gt;
      pair (1 - 5 ) with fields F2, F4, F5&lt;BR /&gt;
      ......&lt;BR /&gt;
&lt;BR /&gt;
&lt;BR /&gt;
&lt;BR /&gt;
Thanks for your help</description>
      <pubDate>Thu, 21 Oct 2010 04:02:14 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Enterprise-Guide/Fuzzy-grouping/m-p/20268#M3387</guid>
      <dc:creator>achan</dc:creator>
      <dc:date>2010-10-21T04:02:14Z</dc:date>
    </item>
    <item>
      <title>Re: Fuzzy grouping</title>
      <link>https://communities.sas.com/t5/SAS-Enterprise-Guide/Fuzzy-grouping/m-p/20269#M3388</link>
      <description>What you are talking about is, broadly, under the ETL (Extract, Transform, and Load) family of activities.  DI may have some tools for that.  You can accomplish your first goal in base SAS using SORT NODUPLICATES, or a SORT and DATA step if you want separate output files.&lt;BR /&gt;
&lt;BR /&gt;
The second is tougher.  I see lots of data sets with 3 variables alike but different subjects.  A statistical approach might be to compute the Mahalanobis distances between pairs and look at ones the distance is below some threshold; it is a computer intensive approach.  See&lt;BR /&gt;
&lt;A href="http://support.sas.com/kb/30/662.html" target="_blank"&gt;http://support.sas.com/kb/30/662.html&lt;/A&gt;&lt;BR /&gt;
&lt;BR /&gt;
Doc Muhlbaier&lt;BR /&gt;
Duke</description>
      <pubDate>Thu, 21 Oct 2010 15:27:38 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Enterprise-Guide/Fuzzy-grouping/m-p/20269#M3388</guid>
      <dc:creator>Doc_Duke</dc:creator>
      <dc:date>2010-10-21T15:27:38Z</dc:date>
    </item>
    <item>
      <title>Re: Fuzzy grouping</title>
      <link>https://communities.sas.com/t5/SAS-Enterprise-Guide/Fuzzy-grouping/m-p/20270#M3389</link>
      <description>I think this is doable in a DATA step by reading the input data twice using two SET statements, the second in a loop using the POINT = option like so:&lt;BR /&gt;
&lt;BR /&gt;
data example;&lt;BR /&gt;
  set inputdata;&lt;BR /&gt;
  do record = 1 to obsnum;&lt;BR /&gt;
    set  inputdata point = record nobs = obsnum;&lt;BR /&gt;
    ......checking statements......&lt;BR /&gt;
  end;&lt;BR /&gt;
run;&lt;BR /&gt;
&lt;BR /&gt;
By using this strategy you can compare every row with every other row in your data and test for exact or partial matches.&lt;BR /&gt;
&lt;BR /&gt;
The trick for the checking logic would be to create two arrays so you can easily compare all 200 columns in a loop:&lt;BR /&gt;
&lt;BR /&gt;
array vars (*) F1 - F200;    * From first SET statement;&lt;BR /&gt;
array vars2 (*) G1 - G200; * From second SET statement; &lt;BR /&gt;
do i = 1 to dim(vars);&lt;BR /&gt;
  if vars (i) = vars2(i) then exact_match_count + 1;&lt;BR /&gt;
end;&lt;BR /&gt;
&lt;BR /&gt;
At the end of this logic you will know how many exact matches compared to total values for the pair of rows you are checking so you can write out a row to the result dataset the IDs for  the row-pairs you are processing and the matching stats.

Message was edited by: SASKiwi</description>
      <pubDate>Thu, 21 Oct 2010 23:22:55 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Enterprise-Guide/Fuzzy-grouping/m-p/20270#M3389</guid>
      <dc:creator>SASKiwi</dc:creator>
      <dc:date>2010-10-21T23:22:55Z</dc:date>
    </item>
    <item>
      <title>Re: Fuzzy grouping</title>
      <link>https://communities.sas.com/t5/SAS-Enterprise-Guide/Fuzzy-grouping/m-p/20271#M3390</link>
      <description>Thanks for your advice.&lt;BR /&gt;
&lt;BR /&gt;
i have tried on the program, like:&lt;BR /&gt;
&lt;BR /&gt;
data example;&lt;BR /&gt;
set inputdata;&lt;BR /&gt;
array vars (*) F1 - F200; * From first SET statement;&lt;BR /&gt;
do record = 1 to obsnum;&lt;BR /&gt;
set inputdata point = record nobs = obsnum;&lt;BR /&gt;
array vars2 (*) G1 - G200; * From second SET statement; &lt;BR /&gt;
do i = 1 to dim(vars);&lt;BR /&gt;
if vars (i) = vars2(i) then exact_match_count + 1;&lt;BR /&gt;
end;&lt;BR /&gt;
end;&lt;BR /&gt;
run;&lt;BR /&gt;
&lt;BR /&gt;
but i found that there is a problem in the statement "if vars (i) = vars2(i) then exact_match_count + 1;", seems that the array vars(i) is not work.&lt;BR /&gt;
&lt;BR /&gt;
Pls advice, Thanks</description>
      <pubDate>Wed, 27 Oct 2010 09:17:05 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Enterprise-Guide/Fuzzy-grouping/m-p/20271#M3390</guid>
      <dc:creator>achan</dc:creator>
      <dc:date>2010-10-27T09:17:05Z</dc:date>
    </item>
    <item>
      <title>Re: Fuzzy grouping</title>
      <link>https://communities.sas.com/t5/SAS-Enterprise-Guide/Fuzzy-grouping/m-p/20272#M3391</link>
      <description>Sorry, I missed out the renaming of the variables on the second SET statement:&lt;BR /&gt;
&lt;BR /&gt;
set inputdata (rename = (F1 - F200 = G1 - G200)) point = record nobs = obsnum;&lt;BR /&gt;
&lt;BR /&gt;
This is needed so you can compare values between the different rows without overwriting and to populate the VARS2 array correctly so the exact matches can be counted.</description>
      <pubDate>Fri, 29 Oct 2010 00:21:44 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Enterprise-Guide/Fuzzy-grouping/m-p/20272#M3391</guid>
      <dc:creator>SASKiwi</dc:creator>
      <dc:date>2010-10-29T00:21:44Z</dc:date>
    </item>
  </channel>
</rss>

