<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Matching by scores in SAS Procedures</title>
    <link>https://communities.sas.com/t5/SAS-Procedures/Matching-by-scores/m-p/218312#M53693</link>
    <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;Hi billfish, This code is not providing 1-to-many matches, but working perfectly for 1:1 matches.&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
    <pubDate>Fri, 10 Apr 2015 14:40:49 GMT</pubDate>
    <dc:creator>pmesh</dc:creator>
    <dc:date>2015-04-10T14:40:49Z</dc:date>
    <item>
      <title>Matching by scores</title>
      <link>https://communities.sas.com/t5/SAS-Procedures/Matching-by-scores/m-p/218305#M53686</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;Hello,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I have two datasets - data1 and data2. Data1 has 10000 observations and data2 has 50000. Both of them consist of two variables - ID and score. Both of the datasets have unique IDs, no duplicates.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I want to match IDs from data1 with IDs from data2, in 1:2 ratio, and such that the difference between the two scores is less than 0.01.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Could you please suggest an efficient approach. Merging or proc sql is unfortunately not helping much.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thanks!&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Thu, 09 Apr 2015 18:11:23 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Procedures/Matching-by-scores/m-p/218305#M53686</guid>
      <dc:creator>pmesh</dc:creator>
      <dc:date>2015-04-09T18:11:23Z</dc:date>
    </item>
    <item>
      <title>Re: Matching by scores</title>
      <link>https://communities.sas.com/t5/SAS-Procedures/Matching-by-scores/m-p/218306#M53687</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;If you have multiple identical matches what do you do? What if you don't have enough to match?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Can you post some sample data and what you want the output to be, i.e. a list of closest between 0.01 or only top 2?&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Thu, 09 Apr 2015 18:52:12 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Procedures/Matching-by-scores/m-p/218306#M53687</guid>
      <dc:creator>Reeza</dc:creator>
      <dc:date>2015-04-09T18:52:12Z</dc:date>
    </item>
    <item>
      <title>Re: Matching by scores</title>
      <link>https://communities.sas.com/t5/SAS-Procedures/Matching-by-scores/m-p/218307#M53688</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;If an ID from data1 matches with multiple IDs from data2, then the two pairs with lowest difference (absolute value close to zero) should get selected and the matched IDs from data2 should not be available for further matching. Overall the program should maximize the number of matches.&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Thu, 09 Apr 2015 19:05:32 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Procedures/Matching-by-scores/m-p/218307#M53688</guid>
      <dc:creator>pmesh</dc:creator>
      <dc:date>2015-04-09T19:05:32Z</dc:date>
    </item>
    <item>
      <title>Re: Matching by scores</title>
      <link>https://communities.sas.com/t5/SAS-Procedures/Matching-by-scores/m-p/218308#M53689</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;1. approach&lt;/P&gt;&lt;P&gt;load data2 into a hash object (&lt;EM&gt;hash2&lt;/EM&gt;)&lt;/P&gt;&lt;P&gt;use a set statement to read data1&lt;/P&gt;&lt;P&gt;at each observation iterate through &lt;EM&gt;hash2&lt;/EM&gt; (using a hash iterator object) and determine the 2 closest scores that are within 0.01&lt;/P&gt;&lt;P&gt;output those matches&lt;/P&gt;&lt;P&gt;remove those matches from hash2&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;2. approach (more clever)&lt;/P&gt;&lt;P&gt;Instead of iterating over ~50000 items at every observation from data1 (which would be ~50000x10000 hash object method calls) use a &lt;STRONG&gt;sorted hash&lt;/STRONG&gt; object and "jump directly" to the position&lt;EM&gt; score-0.01&lt;/EM&gt; and iterate only until &lt;STRONG&gt;score+0.01&lt;/STRONG&gt;.&lt;/P&gt;&lt;P&gt;How to "jump directly" in a hash object? Look at the "bisection trick" of @FriedEgg at this thread: &lt;A _jive_internal="true" href="https://communities.sas.com/ideas/1613" title="https://communities.sas.com/ideas/1613"&gt;https://communities.sas.com/ideas/1613&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;3. approach (additional improvements)&lt;/P&gt;&lt;P&gt;You can sort both datasets. You can read them in parallel in a data step, so you don't need to store the whole data2 in the &lt;EM&gt;hash2&lt;/EM&gt; hash object. Only the [score-0.01, score+0.01] portion should be stored. This means, you should "read ahead" data2, and you can remove items from the hash object when their score value is less then the "current data1 score-0.01".&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Note, that none of the above methods "maximize the number of matches"! They are simple greedy algorithms.&lt;/P&gt;&lt;P&gt;To maximize matches I would create an optimization model with OPTMODEL. Do you have SAS/OR?&lt;/P&gt;&lt;P&gt;On the other hand, if you just need maximal matches but refrain from choosing the "2 closest scores", then this kind of greedy approach works perfectly. &lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Thu, 09 Apr 2015 21:27:33 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Procedures/Matching-by-scores/m-p/218308#M53689</guid>
      <dc:creator>gergely_batho</dc:creator>
      <dc:date>2015-04-09T21:27:33Z</dc:date>
    </item>
    <item>
      <title>Re: Matching by scores</title>
      <link>https://communities.sas.com/t5/SAS-Procedures/Matching-by-scores/m-p/218309#M53690</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;A proposed solution.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;Some of the hints are already given in this forum.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;I am looking to match pairs with the smallest score differences. &lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;One can at some later point remove the pairs with a score difference greater than the minimum wanted.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;Same sample datasets.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;/*********************/&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;/***** DataSet 1 *****/&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;/*********************/&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;data t_a(keep=id1 zScore);&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&amp;nbsp; do id = 1 to 10000;&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; id1 = compress('A'||put(id, z5.));&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; zScore=&amp;nbsp; ceil(100000*ranuni(3));&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; output;&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&amp;nbsp; end;&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;run;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;/*********************/&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;/***** DataSet 2 *****/&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;/*********************/&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;data t_b(keep=id2 zScore);&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&amp;nbsp; do id = 1 to 50000;&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; id2 = compress('B'||put(id, z5.));&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; zScore=&amp;nbsp; ceil(200000*ranuni(3));&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; output;&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&amp;nbsp; end;&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;run;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;/**************************************/&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;/** One can do a full Cartesian Join **/&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;/**&amp;nbsp;&amp;nbsp;&amp;nbsp; or one can adjust the range&amp;nbsp;&amp;nbsp; **/&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;/**************************************/&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;PROC SQL;&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&amp;nbsp;&amp;nbsp; create table t_c as&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&amp;nbsp;&amp;nbsp; select a.id1, b.id2, a.zScore as zS1, b.zScore as zS2,&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; (a.zScore-b.zScore) as difx, &lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; abs(a.zScore-b.zScore) as difa&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&amp;nbsp;&amp;nbsp; from&amp;nbsp;&amp;nbsp; t_a a, t_b b&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&amp;nbsp;&amp;nbsp; where (-100 &amp;lt;= (a.zScore-b.zScore) &amp;lt;=100);&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&amp;nbsp;&amp;nbsp; /*********************************/&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&amp;nbsp;&amp;nbsp; /** check if all id1 are in t_c **/&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&amp;nbsp;&amp;nbsp; /*********************************/&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&amp;nbsp;&amp;nbsp; select count(distinct id1) as c_id1 from t_c;&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;QUIT;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;proc sort data=t_c; by difa difx; run;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;/*****************************/&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;/**** the 1-to-1 matching ****/&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;/*****************************/&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;data t_want(keep=id1 id2 difa difx zS1 zS2);&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&amp;nbsp;&amp;nbsp; if _N_=1 then do;&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; declare hash ha(multidata:'N');&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ha.definekey('mbr1');&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ha.definedata('mbr1');&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ha.definedone();&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; declare hash hb(multidata:'N');&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; hb.definekey('mbr2');&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; hb.definedata('mbr2');&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; hb.definedone();&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&amp;nbsp;&amp;nbsp; end;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&amp;nbsp;&amp;nbsp; do until (aDone);&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; set t_c;&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; mbr1=id1;&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; mbr2=id2;&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; k1=ha.find();&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; k2=hb.find();&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; if (k1 and k2) then do;&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ha.add();&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; hb.add();&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; output;&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; end;&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&amp;nbsp;&amp;nbsp; end;&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;run;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;/***************************/&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;/**** some idiot checks ****/&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;/***************************/&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;proc sql;&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&amp;nbsp;&amp;nbsp; select count(distinct id1) as c_id1, count(distinct id2) as c_id2&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&amp;nbsp;&amp;nbsp; from t_want;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&amp;nbsp;&amp;nbsp; select difa,difx, sum(1) as c_pairs&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&amp;nbsp;&amp;nbsp; from t_want&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;&amp;nbsp;&amp;nbsp; group by difa, difx;&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN style="font-family: courier new,courier;"&gt;quit;&lt;/SPAN&gt;&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Thu, 09 Apr 2015 23:05:23 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Procedures/Matching-by-scores/m-p/218309#M53690</guid>
      <dc:creator>billfish</dc:creator>
      <dc:date>2015-04-09T23:05:23Z</dc:date>
    </item>
    <item>
      <title>Re: Matching by scores</title>
      <link>https://communities.sas.com/t5/SAS-Procedures/Matching-by-scores/m-p/218310#M53691</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;The following program just does a greedy maximal match. It does not select the &lt;EM&gt;closest&lt;/EM&gt; scores, only 2 scores that are within 0.01&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;data data1;&lt;/P&gt;&lt;P&gt;&amp;nbsp; do id1=1 to 10000;&lt;/P&gt;&lt;P&gt;&amp;nbsp; score1=100*ranuni(1237);&lt;/P&gt;&lt;P&gt;&amp;nbsp; output;&lt;/P&gt;&lt;P&gt;&amp;nbsp; end;&lt;/P&gt;&lt;P&gt;run;&lt;/P&gt;&lt;P&gt;data data2;&lt;/P&gt;&lt;P&gt;&amp;nbsp; do id2=1 to 50000;&lt;/P&gt;&lt;P&gt;&amp;nbsp; score2=100*ranuni(12345);&lt;/P&gt;&lt;P&gt;&amp;nbsp; output;&lt;/P&gt;&lt;P&gt;&amp;nbsp; end;&lt;/P&gt;&lt;P&gt;run;&lt;/P&gt;&lt;P&gt;proc sort data=data1;by score1;run;&lt;/P&gt;&lt;P&gt;proc sort data=data2;by score2;run;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;data matches_only;&lt;/P&gt;&lt;P&gt;&amp;nbsp; numMach=0;&lt;/P&gt;&lt;P&gt;&amp;nbsp; do until(end1);&lt;/P&gt;&lt;P&gt;&amp;nbsp; set data1 end=end1;&lt;/P&gt;&lt;P&gt;&amp;nbsp; if(score2=.) then do;&lt;/P&gt;&lt;P&gt;&amp;nbsp; leave;&lt;/P&gt;&lt;P&gt;&amp;nbsp; end;&lt;/P&gt;&lt;P&gt;&amp;nbsp; if(score2~=. and score2-0.01&amp;lt;=score1 and score1&amp;lt;=score2+0.01) then do;&lt;/P&gt;&lt;P&gt;&amp;nbsp; output;&lt;/P&gt;&lt;P&gt;&amp;nbsp; numMach+1;&lt;/P&gt;&lt;P&gt;&amp;nbsp; leave;&lt;/P&gt;&lt;P&gt;&amp;nbsp; end;&lt;/P&gt;&lt;P&gt;&amp;nbsp; if(score2~=. and score2+0.01&amp;lt;score1) then do;&lt;/P&gt;&lt;P&gt;&amp;nbsp; score2=.;&lt;/P&gt;&lt;P&gt;&amp;nbsp; leave;&lt;/P&gt;&lt;P&gt;&amp;nbsp; end;&lt;/P&gt;&lt;P&gt;&amp;nbsp; end;/*until*/&lt;/P&gt;&lt;P&gt;&amp;nbsp; do until(end2);&lt;/P&gt;&lt;P&gt;&amp;nbsp; set data2 end=end2;&lt;/P&gt;&lt;P&gt;&amp;nbsp; if(score1=.) then do;&lt;/P&gt;&lt;P&gt;&amp;nbsp; leave;&lt;/P&gt;&lt;P&gt;&amp;nbsp; end;&lt;/P&gt;&lt;P&gt;&amp;nbsp; if(score1~=. and score1-0.01&amp;lt;=score2 and score2&amp;lt;=score1+0.01)then do;&lt;/P&gt;&lt;P&gt;&amp;nbsp; output;&lt;/P&gt;&lt;P&gt;&amp;nbsp; numMach+1;&lt;/P&gt;&lt;P&gt;&amp;nbsp; end;&lt;/P&gt;&lt;P&gt;&amp;nbsp; if(score1~=. and score1+0.01&amp;lt;score2)then do;&lt;/P&gt;&lt;P&gt;&amp;nbsp; score1=.;&lt;/P&gt;&lt;P&gt;&amp;nbsp; leave;&lt;/P&gt;&lt;P&gt;&amp;nbsp; end;&lt;/P&gt;&lt;P&gt;&amp;nbsp; if numMach&amp;gt;=2 then do;&lt;/P&gt;&lt;P&gt;&amp;nbsp; score1=.;&lt;/P&gt;&lt;P&gt;&amp;nbsp; score2=.;&lt;/P&gt;&lt;P&gt;&amp;nbsp; leave;&lt;/P&gt;&lt;P&gt;&amp;nbsp; end;&lt;/P&gt;&lt;P&gt;&amp;nbsp; end;/*until*/&lt;/P&gt;&lt;P&gt;run;&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Thu, 09 Apr 2015 23:44:14 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Procedures/Matching-by-scores/m-p/218310#M53691</guid>
      <dc:creator>gergely_batho</dc:creator>
      <dc:date>2015-04-09T23:44:14Z</dc:date>
    </item>
    <item>
      <title>Re: Matching by scores</title>
      <link>https://communities.sas.com/t5/SAS-Procedures/Matching-by-scores/m-p/218311#M53692</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;Gergely, Thank you very much!! This worked perfectly.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Only a couple of requests:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;- If you could please also demonstrate the code for approach 1 that you mentioned above&lt;SPAN style="font-size: 13.3333330154419px;"&gt; (using hash objects), I would be extremely thankful. It would allow me to compare it with the current approach and would help me understand it better.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;- Secondly, IDs with less than the required matches (in our example, IDs with less than 2 matches) are also kept in the dataset. I did the extra step to remove those, but is it something that can be handled in the original code itself?&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Fri, 10 Apr 2015 14:39:13 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Procedures/Matching-by-scores/m-p/218311#M53692</guid>
      <dc:creator>pmesh</dc:creator>
      <dc:date>2015-04-10T14:39:13Z</dc:date>
    </item>
    <item>
      <title>Re: Matching by scores</title>
      <link>https://communities.sas.com/t5/SAS-Procedures/Matching-by-scores/m-p/218312#M53693</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;Hi billfish, This code is not providing 1-to-many matches, but working perfectly for 1:1 matches.&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Fri, 10 Apr 2015 14:40:49 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Procedures/Matching-by-scores/m-p/218312#M53693</guid>
      <dc:creator>pmesh</dc:creator>
      <dc:date>2015-04-10T14:40:49Z</dc:date>
    </item>
    <item>
      <title>Re: Matching by scores</title>
      <link>https://communities.sas.com/t5/SAS-Procedures/Matching-by-scores/m-p/218313#M53694</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;Do you consider using proc format + cntlin&amp;nbsp; ?&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Fri, 10 Apr 2015 14:56:56 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Procedures/Matching-by-scores/m-p/218313#M53694</guid>
      <dc:creator>Ksharp</dc:creator>
      <dc:date>2015-04-10T14:56:56Z</dc:date>
    </item>
    <item>
      <title>Re: Matching by scores</title>
      <link>https://communities.sas.com/t5/SAS-Procedures/Matching-by-scores/m-p/218314#M53695</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;Why is it important to handle removal of "less then two matches" in the same data step?&lt;/P&gt;&lt;P&gt;I'd like to understand how much this problem is performance(memory, CPU) focused.&lt;/P&gt;&lt;P&gt;The problem here is, that in the current code, when I discover a match, I immediately output it. If later it turns out, it is the only match, I cannot revoke it.&lt;/P&gt;&lt;P&gt;So instead of outputing, you should rather store it in memory (in a hash object, or in an array, or if it is really just 1 observation: in a temporary variable).&lt;/P&gt;&lt;P&gt;Then you should output it when it turns out, there are really enough matches. (Otherwise you just clear the hash, empty the array or variable.)&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Unfortunately approach 1 (and also 2 and 3) is a bit more coding work.&lt;/P&gt;&lt;P&gt;But I rather adapted &lt;A __default_attr="858574" __jive_macro_name="user" class="jive_macro jive_macro_user" data-objecttype="3" href="https://communities.sas.com/"&gt;&lt;/A&gt; 's idea and code. It is much more easy to adapt to your needs. For example one additional line to exclude ids with less then 2 matches.&lt;/P&gt;&lt;P&gt;Also I think this program uses a better heuristic (if you want to minimize the sum of distances): it is "more greedy" then my program, because it starts with the smallest overall distance.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;PROC SQL;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp; create table t_c as&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp; select a.id1, b.id2, a.score1, b.score2,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; (a.score1-b.score2) as difx, &lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; abs(a.score1-b.score2) as difa&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp; from&amp;nbsp;&amp;nbsp; data1 a, data2 b&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp; where (-0.01 &amp;lt;= (a.score1-b.score2) &amp;lt;=0.01)&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp; group by id1&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp; having count(*)&amp;gt;=2 /*exclude ids with less then 2 matches*/&lt;/P&gt;&lt;P&gt;&amp;nbsp; order by difa;&lt;/P&gt;&lt;P&gt;QUIT;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;data t_want(keep=id1 id2 difa difx score1 score2);&lt;/P&gt;&lt;P&gt;&amp;nbsp; retain one 1;&lt;/P&gt;&lt;P&gt;&amp;nbsp; length numCon1 numCon2 8;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp; if _N_=1 then do;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; declare hash h1(multidata:'N', suminc:'one');/*This will use a counter for each ID: counting how many times it was used*/&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; h1.definekey('id1');&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; h1.definedone();&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; declare hash h2(multidata:'N', suminc:'one');&lt;SPAN style="font-size: 13.3333330154419px;"&gt;/*This will use a counter for each ID: counting how many times it was used*/&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; h2.definekey('id2');&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; h2.definedone();&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp; end;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp; do until(0);&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; set t_c;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; h1.ref();&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp; h1.sum(sum:numCon1);&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; h2.ref();&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp; h2.sum(sum:numCon2);&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; if (numCon1&amp;lt;=2 and numCon2&amp;lt;=1) then do;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; output;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; end;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp; end;&lt;/P&gt;&lt;P&gt;run;&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Sat, 11 Apr 2015 23:31:16 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Procedures/Matching-by-scores/m-p/218314#M53695</guid>
      <dc:creator>gergely_batho</dc:creator>
      <dc:date>2015-04-11T23:31:16Z</dc:date>
    </item>
  </channel>
</rss>

