Hi, Why is it important to handle removal of "less then two matches" in the same data step? I'd like to understand how much this problem is performance(memory, CPU) focused. The problem here is, that in the current code, when I discover a match, I immediately output it. If later it turns out, it is the only match, I cannot revoke it. So instead of outputing, you should rather store it in memory (in a hash object, or in an array, or if it is really just 1 observation: in a temporary variable). Then you should output it when it turns out, there are really enough matches. (Otherwise you just clear the hash, empty the array or variable.) Unfortunately approach 1 (and also 2 and 3) is a bit more coding work. But I rather adapted 's idea and code. It is much more easy to adapt to your needs. For example one additional line to exclude ids with less then 2 matches. Also I think this program uses a better heuristic (if you want to minimize the sum of distances): it is "more greedy" then my program, because it starts with the smallest overall distance. PROC SQL; create table t_c as select a.id1, b.id2, a.score1, b.score2, (a.score1-b.score2) as difx, abs(a.score1-b.score2) as difa from data1 a, data2 b where (-0.01 <= (a.score1-b.score2) <=0.01) group by id1 having count(*)>=2 /*exclude ids with less then 2 matches*/ order by difa; QUIT; data t_want(keep=id1 id2 difa difx score1 score2); retain one 1; length numCon1 numCon2 8; if _N_=1 then do; declare hash h1(multidata:'N', suminc:'one');/*This will use a counter for each ID: counting how many times it was used*/ h1.definekey('id1'); h1.definedone(); declare hash h2(multidata:'N', suminc:'one');/*This will use a counter for each ID: counting how many times it was used*/ h2.definekey('id2'); h2.definedone(); end; do until(0); set t_c; h1.ref(); h1.sum(sum:numCon1); h2.ref(); h2.sum(sum:numCon2); if (numCon1<=2 and numCon2<=1) then do; output; end; end; run;
... View more