<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Looking for records which are &amp;quot;close&amp;quot; but won't be detected with NODUPKEY or NODUPRECS in SAS Programming</title>
    <link>https://communities.sas.com/t5/SAS-Programming/Looking-for-records-which-are-quot-close-quot-but-won-t-be/m-p/980729#M378981</link>
    <description>&lt;P&gt;It sounds like you are seeking a method of "fuzzy matching" -- identifying values that are close to the same, maybe intended to&amp;nbsp;&lt;STRONG&gt;be&lt;/STRONG&gt; the same, but due to the quality of the data, they are different.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Here's a &lt;A href="https://www.lexjansen.com/pharmasug/2020/AP/PharmaSUG-2020-AP-073.pdf" target="_self"&gt;thorough paper by Stephen Sloan that describes an approach&lt;/A&gt;&amp;nbsp; and includes code that you can use as a process. For a more narrow tactic using "spelling distance", &lt;A href="https://blogs.sas.com/content/sgf/2021/09/21/fuzzy-matching/" target="_self"&gt;see this article by Ron Cody&lt;/A&gt;.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;A href="https://communities.sas.com/t5/SAS-Communities-Library/Data-Verification-Made-Easy-with-Loqate-in-SAS-Studio/ta-p/974065" target="_self"&gt;SAS offers data quality tools&lt;/A&gt; to help with this problem for certain types of data, such as names and addresses. These apply a standardization approach that would recognize values like "Bob Smith" and "Robert Smith" as being probably the same entity. Here's &lt;A href="https://support.sas.com/resources/papers/proceedings18/2487-2018.pdf" target="_self"&gt;another paper specifically about addresses&lt;/A&gt; and some base SAS approaches.&lt;/P&gt;</description>
    <pubDate>Fri, 12 Dec 2025 20:29:34 GMT</pubDate>
    <dc:creator>ChrisHemedinger</dc:creator>
    <dc:date>2025-12-12T20:29:34Z</dc:date>
    <item>
      <title>Looking for records which are "close" but won't be detected with NODUPKEY or NODUPRECS</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Looking-for-records-which-are-quot-close-quot-but-won-t-be/m-p/980724#M378979</link>
      <description>&lt;P&gt;How can I generate a numeric score of similarity between records in a dataset that would help me find records that might be duplicates but have typos from one entry to the next. I also would not know in which fields this would occur. The dataset would contain both numeric and character variables.&lt;/P&gt;</description>
      <pubDate>Fri, 12 Dec 2025 19:46:42 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Looking-for-records-which-are-quot-close-quot-but-won-t-be/m-p/980724#M378979</guid>
      <dc:creator>_Hopper</dc:creator>
      <dc:date>2025-12-12T19:46:42Z</dc:date>
    </item>
    <item>
      <title>Re: Looking for records which are "close" but won't be detected with NODUPKEY or NODUPRECS</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Looking-for-records-which-are-quot-close-quot-but-won-t-be/m-p/980729#M378981</link>
      <description>&lt;P&gt;It sounds like you are seeking a method of "fuzzy matching" -- identifying values that are close to the same, maybe intended to&amp;nbsp;&lt;STRONG&gt;be&lt;/STRONG&gt; the same, but due to the quality of the data, they are different.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Here's a &lt;A href="https://www.lexjansen.com/pharmasug/2020/AP/PharmaSUG-2020-AP-073.pdf" target="_self"&gt;thorough paper by Stephen Sloan that describes an approach&lt;/A&gt;&amp;nbsp; and includes code that you can use as a process. For a more narrow tactic using "spelling distance", &lt;A href="https://blogs.sas.com/content/sgf/2021/09/21/fuzzy-matching/" target="_self"&gt;see this article by Ron Cody&lt;/A&gt;.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;A href="https://communities.sas.com/t5/SAS-Communities-Library/Data-Verification-Made-Easy-with-Loqate-in-SAS-Studio/ta-p/974065" target="_self"&gt;SAS offers data quality tools&lt;/A&gt; to help with this problem for certain types of data, such as names and addresses. These apply a standardization approach that would recognize values like "Bob Smith" and "Robert Smith" as being probably the same entity. Here's &lt;A href="https://support.sas.com/resources/papers/proceedings18/2487-2018.pdf" target="_self"&gt;another paper specifically about addresses&lt;/A&gt; and some base SAS approaches.&lt;/P&gt;</description>
      <pubDate>Fri, 12 Dec 2025 20:29:34 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Looking-for-records-which-are-quot-close-quot-but-won-t-be/m-p/980729#M378981</guid>
      <dc:creator>ChrisHemedinger</dc:creator>
      <dc:date>2025-12-12T20:29:34Z</dc:date>
    </item>
    <item>
      <title>Re: Looking for records which are "close" but won't be detected with NODUPKEY or NODUPRECS</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Looking-for-records-which-are-quot-close-quot-but-won-t-be/m-p/980736#M378984</link>
      <description>&lt;P&gt;The first thing is combining all the variables into one variables .&lt;/P&gt;
&lt;P&gt;a.k.a make a column for a row.&lt;/P&gt;
&lt;P&gt;Here could give you a start.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;data have;
 set sashelp.heart(obs=100);
 length row $ 200;
 obs+1;
 row=catx('|',of _all_);
 keep obs row;
 run;






 
proc fcmp outlib=work.math.func;
function fuzz_match(a $,b $)  ;
length new_a new_b str_short str_long  $ 200;  /*only support the max length of string is 200*/
new_a=kcompress(a,'|','kad');
new_b=kcompress(b,'|','kad');
if klength(a)&amp;lt;klength(b) then do;
 str_short=new_a;str_long=new_b;
end;
else do;
  str_short=new_b;str_long=new_a;
end;

array count{10} _temporary_  (0 0 0  0  0  0  0   0    0    0);  /*The count of one string ,two string , three string....*/
array weight{10} _temporary_ (2 4 8 16 32 64 128 256 512 1024);  /*The weight of one string ,two string , three string....*/
max_str_len=min(10,klength(a),klength(b));  /*Search the max length of str is 10*/
do len=1 to max_str_len;      /*Search  string which's length is 1, 2, 3.......*/
 n=0;
 do start=1 to klength(str_short)+1-len;  /*Search sub-string in  long string*/
  if kfind(str_long,strip(ksubstr(str_short,start,len))) then n+1;
 end;
 count{len}=n; 
end;


sum=0;w=0;mean=0;
do k=1 to max_str_len;
  if count{k} ne 0 then do;sum+count{k}*weight{k}; w+weight{k};end;
end;
/*Calculated weight mean
if w=0 then mean=0;
 else mean=sum/w;
*/
if a=b then sum=constant('exactint');
return (sum); /*return weight sum or weight mean(mean)*/
endsub;
run;



options cmplib=work.math;
proc sql;
create table want as
select a.obs,a.row as row1,b.obs as obs2,b.row as row2,fuzz_match(a.row,b.row) as weight
 from  have as a,have as b
  where a.obs ne b.obs
  order by a.obs,weight desc;
quit;
&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sat, 13 Dec 2025 09:23:03 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Looking-for-records-which-are-quot-close-quot-but-won-t-be/m-p/980736#M378984</guid>
      <dc:creator>Ksharp</dc:creator>
      <dc:date>2025-12-13T09:23:03Z</dc:date>
    </item>
    <item>
      <title>Re: Looking for records which are "close" but won't be detected with NODUPKEY or NODUPRECS</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Looking-for-records-which-are-quot-close-quot-but-won-t-be/m-p/980796#M378993</link>
      <description>&lt;P&gt;Thanks this helps. For longer strings the array size must increase correct? Also the string length can and is quite long, what other adjustments are required to account for that?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Lastly, is a lower score a better match or a higher one?&lt;/P&gt;</description>
      <pubDate>Mon, 15 Dec 2025 14:56:42 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Looking-for-records-which-are-quot-close-quot-but-won-t-be/m-p/980796#M378993</guid>
      <dc:creator>_Hopper</dc:creator>
      <dc:date>2025-12-15T14:56:42Z</dc:date>
    </item>
    <item>
      <title>Re: Looking for records which are "close" but won't be detected with NODUPKEY or NODUPRECS</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Looking-for-records-which-are-quot-close-quot-but-won-t-be/m-p/980815#M378995</link>
      <description>&lt;P&gt;If you have SAS Data Quality, &lt;A href="https://go.documentation.sas.com/doc/en/pgmsascdc/v_069/dqclref/n1qx8v350g506kn1j1xlnm0v303h.htm" target="_self"&gt;PROC DQMATCH&lt;/A&gt; will do this for you. You can specify which variable(s) and match definition to use, and &lt;A href="https://go.documentation.sas.com/doc/en/pgmsascdc/v_069/dqclref/p1ido24wrdf77fn16chwolsqaojx.htm#n0gq0lusd4szk7n1486tglw4l982" target="_self"&gt;optionally apply multiple conditions&lt;/A&gt;. This generates a match code for a condition. If you do not apply a condition statement then it is a single composite condition that generates one match code. If the match codes between rows are identical, there is a degree of confidence that those entities are the same.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Here is an example using both names and addresses together.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;TABLE border="1" width="100.79973651453255%"&gt;
&lt;TBODY&gt;
&lt;TR style="height: 14.4pt;"&gt;
&lt;TD width="49.71297359357061%" style="width: 48pt;"&gt;&lt;STRONG&gt;Name&lt;/STRONG&gt;&lt;/TD&gt;
&lt;TD width="50.172215843857636%" style="width: 48pt;"&gt;&lt;STRONG&gt;Address&lt;/STRONG&gt;&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR style="height: 14.4pt;"&gt;
&lt;TD width="49.71297359357061%"&gt;Bill Bond&lt;/TD&gt;
&lt;TD width="50.172215843857636%"&gt;
&lt;DIV&gt;
&lt;DIV&gt;&lt;SPAN&gt;202 E Cedar St Cary NC 27511-3440&lt;/SPAN&gt;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR style="height: 14.4pt;"&gt;
&lt;TD width="49.71297359357061%"&gt;William L. Bond&lt;/TD&gt;
&lt;TD width="50.172215843857636%"&gt;
&lt;DIV&gt;
&lt;DIV&gt;&lt;SPAN&gt;202 East Cedar Street, Cary, North Carolina, 27511 &lt;/SPAN&gt;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR style="height: 14.4pt;"&gt;
&lt;TD width="49.71297359357061%"&gt;Prais Hilton&lt;/TD&gt;
&lt;TD width="50.172215843857636%"&gt;100 SAS Campus Dr, Cary, NC 27513&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR style="height: 14.4pt;"&gt;
&lt;TD width="49.71297359357061%"&gt;Paris Hilton&lt;/TD&gt;
&lt;TD width="50.172215843857636%"&gt;100 SAS campus Drive Cary NC, 27513&lt;/TD&gt;
&lt;/TR&gt;
&lt;/TBODY&gt;
&lt;/TABLE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;In our data, we know that rows 1 and 2 are the same person, and 3 and 4 are the same person. Row 3 has a mispelled name, while Row 4 has the correct name. The addresses also differ in how they've been input, and one includes a ZIP+4. We need to programmatically determine who is who.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;We can use the&amp;nbsp;&lt;A href="https://go.documentation.sas.com/doc/en/qkb/latest/p0ko3df76as2pln19k2x6gwcnjhz.htm#p0bx9tzu9r43tan1n51ciic6rzmt" target="_self"&gt;Name (with Suggestions)&lt;/A&gt; match definition to find similar names with mispellings, and the &lt;A href="https://go.documentation.sas.com/doc/en/qkb/latest/p0ko3df76as2pln19k2x6gwcnjhz.htm#n076vgbrzebbrtn13drcluqb2e9w" target="_self"&gt;Address&lt;/A&gt; match definition for the addresses. If the combination of those two generate the same match code, we will assign them the same group.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;You may need to adjust the sensitivity (between 50-100) to get the desired results. Lower sensitivity increases fuzziness, while higher sensitivity decreases fuzziness. You can mix sensitivity on a per-variable basis.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;data customers;
    length Name Address $50.;
    infile datalines dlm='|';
    input Name$ Address$;
    datalines;
Bill Bond|202 E Cedar St Cary NC 27511-3440
William L. Bond|202 East Cedar Street, Cary, North Carolina, 27511 
Prais Hilton|100 SAS Campus Dr, Cary, NC 27513
Paris Hilton|100 SAS campus Drive Cary NC, 27513
;
run;

/* Load the USA English locale */
%dqload(dqlocale=(enusa));

proc dqmatch 
    data = customers 
    out  = customers2 
    matchcode = match_cd   /* Output variable name of match code */
    cluster   = group_id;  /* Output variable name of a cluster of same match codes */

    criteria var=Name    matchdef='Name (with Suggestions)' sensitivity=50; /* Name must match */
    criteria var=Address matchdef='Address' sensitivity=60;                 /* and Address must match */
run;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;GROUP_ID	Name	          Address	                                        MATCH_CD
1	        Bill Bond	  202 E Cedar St Cary NC 27511-3440	                M&amp;amp;P$$$$$$$$$$$$$$$$$$!$$$$$$$$$$$$$$H00$$$$$$$$$$$4&amp;amp;8&amp;amp;Y$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
1	        William L. Bond	  202 East Cedar Street, Cary, North Carolina, 27511    M&amp;amp;P$$$$$$$$$$$$$$$$$$!$$$$$$$$$$$$$$H00$$$$$$$$$$$4&amp;amp;8&amp;amp;Y$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
2	        Prais Hilton	  100 SAS Campus Dr, Cary, NC 27513	                2&amp;amp;W$$$$$$$$$$$$$$$$$$!$$$$$$$$$$$$$$I00$$$$$$$$$$$4&amp;amp;43&amp;amp;$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
2	        Paris Hilton	  100 SAS campus Drive Cary NC, 27513	                2&amp;amp;W$$$$$$$$$$$$$$$$$$!$$$$$$$$$$$$$$I00$$$$$$$$$$$4&amp;amp;43&amp;amp;$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
&lt;/PRE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 16 Dec 2025 01:55:29 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Looking-for-records-which-are-quot-close-quot-but-won-t-be/m-p/980815#M378995</guid>
      <dc:creator>Stu_SAS</dc:creator>
      <dc:date>2025-12-16T01:55:29Z</dc:date>
    </item>
    <item>
      <title>Re: Looking for records which are "close" but won't be detected with NODUPKEY or NODUPRECS</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Looking-for-records-which-are-quot-close-quot-but-won-t-be/m-p/980818#M378996</link>
      <description>"For longer strings the array size must increase correct? "&lt;BR /&gt;Nope. array is used for checking sub-string ,not whole string.&lt;BR /&gt;Here I check sub-string which's length is 1,2,3,4,...10.&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;"Also the string length can and is quite long, what other adjustments are required to account for that?"&lt;BR /&gt;You need to change the following line:&lt;BR /&gt;length new_a new_b str_short str_long  $ 200; &lt;BR /&gt;As I commented that max length of string is 200. if you have string longer, change 200 into other number.&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;"Lastly, is a lower score a better match or a higher one?"&lt;BR /&gt;A higher score match better.&lt;BR /&gt;I already sort it by WEIGHT decending. The first obs in each group (a.obs) is the best match , you can just keep it and drop others .</description>
      <pubDate>Tue, 16 Dec 2025 07:06:55 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Looking-for-records-which-are-quot-close-quot-but-won-t-be/m-p/980818#M378996</guid>
      <dc:creator>Ksharp</dc:creator>
      <dc:date>2025-12-16T07:06:55Z</dc:date>
    </item>
  </channel>
</rss>

