topic Re: Matching by scores in SAS Procedures

Matching by scores

pmesh — Thu, 09 Apr 2015 18:11:23 GMT

Hello,

I have two datasets - data1 and data2. Data1 has 10000 observations and data2 has 50000. Both of them consist of two variables - ID and score. Both of the datasets have unique IDs, no duplicates.

I want to match IDs from data1 with IDs from data2, in 1:2 ratio, and such that the difference between the two scores is less than 0.01.

Could you please suggest an efficient approach. Merging or proc sql is unfortunately not helping much.

Thanks!

Re: Matching by scores

Reeza — Thu, 09 Apr 2015 18:52:12 GMT

If you have multiple identical matches what do you do? What if you don't have enough to match?

Can you post some sample data and what you want the output to be, i.e. a list of closest between 0.01 or only top 2?

Re: Matching by scores

pmesh — Thu, 09 Apr 2015 19:05:32 GMT

If an ID from data1 matches with multiple IDs from data2, then the two pairs with lowest difference (absolute value close to zero) should get selected and the matched IDs from data2 should not be available for further matching. Overall the program should maximize the number of matches.

Re: Matching by scores

gergely_batho — Thu, 09 Apr 2015 21:27:33 GMT

1. approach

load data2 into a hash object (hash2)

use a set statement to read data1

at each observation iterate through hash2 (using a hash iterator object) and determine the 2 closest scores that are within 0.01

output those matches

remove those matches from hash2

2. approach (more clever)

Instead of iterating over ~50000 items at every observation from data1 (which would be ~50000x10000 hash object method calls) use a sorted hash object and "jump directly" to the position score-0.01 and iterate only until score+0.01.

How to "jump directly" in a hash object? Look at the "bisection trick" of @FriedEgg at this thread: https://communities.sas.com/ideas/1613

3. approach (additional improvements)

You can sort both datasets. You can read them in parallel in a data step, so you don't need to store the whole data2 in the hash2 hash object. Only the [score-0.01, score+0.01] portion should be stored. This means, you should "read ahead" data2, and you can remove items from the hash object when their score value is less then the "current data1 score-0.01".

Note, that none of the above methods "maximize the number of matches"! They are simple greedy algorithms.

To maximize matches I would create an optimization model with OPTMODEL. Do you have SAS/OR?

On the other hand, if you just need maximal matches but refrain from choosing the "2 closest scores", then this kind of greedy approach works perfectly.

Re: Matching by scores

billfish — Thu, 09 Apr 2015 23:05:23 GMT

A proposed solution.

Some of the hints are already given in this forum.

I am looking to match pairs with the smallest score differences.

One can at some later point remove the pairs with a score difference greater than the minimum wanted.

Same sample datasets.

/*********************/
/***** DataSet 1 *****/
/*********************/
data t_a(keep=id1 zScore);
do id = 1 to 10000;
     id1 = compress('A'||put(id, z5.));
     zScore= ceil(100000*ranuni(3));
     output;
end;
run;

/*********************/
/***** DataSet 2 *****/
/*********************/
data t_b(keep=id2 zScore);
do id = 1 to 50000;
     id2 = compress('B'||put(id, z5.));
     zScore= ceil(200000*ranuni(3));
     output;
end;
run;

/**************************************/
/** One can do a full Cartesian Join **/
/**    or one can adjust the range   **/
/**************************************/
PROC SQL;
   create table t_c as
   select a.id1, b.id2, a.zScore as zS1, b.zScore as zS2,
          (a.zScore-b.zScore) as difx,
           abs(a.zScore-b.zScore) as difa
   from   t_a a, t_b b
   where (-100 <= (a.zScore-b.zScore) <=100);
   /*********************************/
   /** check if all id1 are in t_c **/
   /*********************************/
   select count(distinct id1) as c_id1 from t_c;
QUIT;

proc sort data=t_c; by difa difx; run;

/*****************************/
/**** the 1-to-1 matching ****/
/*****************************/
data t_want(keep=id1 id2 difa difx zS1 zS2);
   if _N_=1 then do;
      declare hash ha(multidata:'N');
         ha.definekey('mbr1');
         ha.definedata('mbr1');
         ha.definedone();
      declare hash hb(multidata:'N');
         hb.definekey('mbr2');
         hb.definedata('mbr2');
         hb.definedone();
   end;

   do until (aDone);
      set t_c;
      mbr1=id1;
      mbr2=id2;
      k1=ha.find();
      k2=hb.find();
      if (k1 and k2) then do;
         ha.add();
         hb.add();
         output;
      end;
   end;
run;

/***************************/
/**** some idiot checks ****/
/***************************/
proc sql;
select count(distinct id1) as c_id1, count(distinct id2) as c_id2
from t_want;

   select difa,difx, sum(1) as c_pairs
   from t_want
   group by difa, difx;
quit;

Re: Matching by scores

gergely_batho — Thu, 09 Apr 2015 23:44:14 GMT

The following program just does a greedy maximal match. It does not select the closest scores, only 2 scores that are within 0.01

data data1;

do id1=1 to 10000;

score1=100*ranuni(1237);

output;

end;

run;

data data2;

do id2=1 to 50000;

score2=100*ranuni(12345);

output;

end;

run;

proc sort data=data1;by score1;run;

proc sort data=data2;by score2;run;

data matches_only;

numMach=0;

do until(end1);

set data1 end=end1;

if(score2=.) then do;

leave;

end;

if(score2~=. and score2-0.01<=score1 and score1<=score2+0.01) then do;

output;

numMach+1;

leave;

end;

if(score2~=. and score2+0.01<score1) then do;

score2=.;

leave;

end;

end;/*until*/

do until(end2);

set data2 end=end2;

if(score1=.) then do;

leave;

end;

if(score1~=. and score1-0.01<=score2 and score2<=score1+0.01)then do;

output;

numMach+1;

end;

if(score1~=. and score1+0.01<score2)then do;

score1=.;

leave;

end;

if numMach>=2 then do;

score1=.;

score2=.;

leave;

end;

end;/*until*/

run;

Re: Matching by scores

pmesh — Fri, 10 Apr 2015 14:39:13 GMT

Gergely, Thank you very much!! This worked perfectly.

Only a couple of requests:

- If you could please also demonstrate the code for approach 1 that you mentioned above (using hash objects), I would be extremely thankful. It would allow me to compare it with the current approach and would help me understand it better.

- Secondly, IDs with less than the required matches (in our example, IDs with less than 2 matches) are also kept in the dataset. I did the extra step to remove those, but is it something that can be handled in the original code itself?

Re: Matching by scores

pmesh — Fri, 10 Apr 2015 14:40:49 GMT

Hi billfish, This code is not providing 1-to-many matches, but working perfectly for 1:1 matches.

Re: Matching by scores

Ksharp — Fri, 10 Apr 2015 14:56:56 GMT

Do you consider using proc format + cntlin ?

Re: Matching by scores

gergely_batho — Sat, 11 Apr 2015 23:31:16 GMT

Hi,

Why is it important to handle removal of "less then two matches" in the same data step?

I'd like to understand how much this problem is performance(memory, CPU) focused.

The problem here is, that in the current code, when I discover a match, I immediately output it. If later it turns out, it is the only match, I cannot revoke it.

So instead of outputing, you should rather store it in memory (in a hash object, or in an array, or if it is really just 1 observation: in a temporary variable).

Then you should output it when it turns out, there are really enough matches. (Otherwise you just clear the hash, empty the array or variable.)

Unfortunately approach 1 (and also 2 and 3) is a bit more coding work.

But I rather adapted 's idea and code. It is much more easy to adapt to your needs. For example one additional line to exclude ids with less then 2 matches.

Also I think this program uses a better heuristic (if you want to minimize the sum of distances): it is "more greedy" then my program, because it starts with the smallest overall distance.

PROC SQL;

create table t_c as

select a.id1, b.id2, a.score1, b.score2,

(a.score1-b.score2) as difx,

abs(a.score1-b.score2) as difa

from data1 a, data2 b

where (-0.01 <= (a.score1-b.score2) <=0.01)

group by id1

having count(*)>=2 /*exclude ids with less then 2 matches*/

order by difa;

QUIT;

data t_want(keep=id1 id2 difa difx score1 score2);

retain one 1;

length numCon1 numCon2 8;

if _N_=1 then do;

declare hash h1(multidata:'N', suminc:'one');/*This will use a counter for each ID: counting how many times it was used*/

h1.definekey('id1');

h1.definedone();

declare hash h2(multidata:'N', suminc:'one');/*This will use a counter for each ID: counting how many times it was used*/

h2.definekey('id2');

h2.definedone();

end;

do until(0);

set t_c;

h1.ref();

h1.sum(sum:numCon1);

h2.ref();

h2.sum(sum:numCon2);

if (numCon1<=2 and numCon2<=1) then do;

output;

end;

run;