data one;
input ID_A ID_B Match_score;
cards;
123 777 28.1
124 778 15.6
125 787 19.7
125 799 18.9
126 762 36.1
127 762 55.1
127 777 28.7
128 999 19.5
129 781 18.2
;
I just performed probabilistic linkage on two datasets. The output dataset called "one", contains the identification number from both original datasets, ID_A, the other ID_B, with a linkage score "match_score". There are numerous combinations of ID_A and ID_B. I want to select only the top linkage to pair then remove them from the selection process for further linkages. An ideal output would be...
ID_A ID_B Match_score
127 762 55.1
123 777 28.1
125 787 19.7
128 999 19.5
129 781 18.2
124 778 15.6
ID_A: 126 wouldn't match because of the ID_B (762), match_score is higher for another ID_A (127).
ID_B: 799 wouldn't match because ID_A(125) had a larger match_score with (787)
Any help would be greatly appreciated!
Edited 28.7 to 28.1...thanks for pointing out
Hello @Schilker and welcome to the SAS Support Communities!
Isn't this just a matter of two lookup tables to check if either component of a matched pair has occurred before (i.e. with a higher match score), after sorting the pairs by descending match score?
proc sort data=one;
by descending match_score ID_A ID_B;
run;
data want;
if _n_=1 then do;
dcl hash ha();
ha.definekey('ID_A');
ha.definedone();
dcl hash hb();
hb.definekey('ID_B');
hb.definedone();
end;
set one;
if ha.check()*hb.check() then do;
output;
ha.add();
hb.add();
end;
run;
proc print data=want noobs;
run;
Result:
Match_ ID_A ID_B score 127 762 55.1 123 777 28.1 125 787 19.7 128 999 19.5 129 781 18.2 124 778 15.6
However, if your sample output is correct, then mine is wrong (see highlighted decimal above). But why should the pair (123, 777) "inherit" match score 28.7, which was computed for the pair (127, 777)?
I vaguely remember the TOP notch people in the forum asking if you have SAS OR that has the procs for networking problems that solves at ease. I have seen PG, Ksharp , Robpratt answering such questions frequently.
I had fun attempting in this thread a while ago-
Dec,17,2018 attempt
So I tried copy pasting and plugging your dataset with a tweak
data have;
input (Cust dev) (:$8.) Match_score;
cards;
123 777 28.1
124 778 15.6
125 787 19.7
125 799 18.9
126 762 36.1
127 762 55.1
127 777 28.7
128 999 19.5
129 781 18.2
;
proc sort data=have;
by cust dev;
run;
dm log 'clear';
data _null_;
if _n_=1 then do;
if 0 then set have have(rename=(dev=_dev cust=_cust));
length __dev UID __cust $5;
/*H initial full load*/
dcl hash H (dataset:'have(rename=(dev=_dev cust=_cust))',multidata:'y') ;
h.definekey ("_dev") ;
h.definedata ("_cust", "_dev","match_score") ;
h.definedone () ;
dcl hiter hh('h');
/*H1 Look up and load part by part load proceeding to our need*/
dcl hash H1 (multidata:'y',ordered:'y') ;
h1.definekey ("__cust") ;
h1.definedata ("__cust", "__dev","match_score","UID") ;
h1.definedone () ;
dcl hiter hh1('h1');
call missing(__dev,__cust);
end;
array t(999) $5;
array j(999) $5;
do until(last.cust);
set have end=l;
by cust;
if first.cust then
do;
if h1.find(key:cust)= 0 then f=1;
if not f then do; c+1; uid= cats( 'UID', put(c,8. -l));end;
end;
if not f then
do;
/*Look from dev to cust and collect cust residuals in array*/
do rc1=h.find(key:dev) by 0 while(rc1=0);
rc=h1.add(key:_cust,data:_cust,data:_dev,data:match_score,data:uid);
/*collect cust residuals in array that's not part of cust but part of device*/
if _cust ne cust then do;n+1;t(n)=_cust;end;
rc1=h.find_next();
end;
end;
end;
n=0;
if not f then
do;
/*Residual look up-start with cust and iterate with dev*cust recursively until all check complete*/
do until(sum(cmiss(of t(*)),cmiss(of j(*)))=dim(t)*2);
do i=1 to dim(t);
if not missing(t(i)) then
/*iterate the full load h using hh*/
do while(hh.next()=0);
if t(i)=_cust then
do;
/*Check if residual cust's* dev is not in part load hash,if not then fetch*/
k=.;
do while(hh1.next()=0);
if __dev = _dev then do; k=1;leave;end;
end;
if not k then do;n1+1;j(n1)=_dev;end;
rc=h1.add(key:_cust,data:t(i),data:_dev,data:match_score,data:uid);
end;
end;
end;
call missing(of t(*));n1=0;
do i=1 to dim(j);
if not missing(j(i)) then
do rc1=h.find(key:j(i)) by 0 while(rc1=0);
/*Check if residual dev's *cust is not in part load hash,if not then fetch*/
if h1.check(key:_cust) ne 0 then do;n+1;t(n)=_cust;end;
rc=h1.add(key:_cust,data:_cust,data:_dev,data:match_score,data:uid);
rc1=h.find_next();
end;
end;
call missing(of j(*));n=0;
end;
end;
if l then h1.output(dataset:'want');
run;
proc sql;
create table final_want as
select __cust as ID_A,__dev as ID_B, Match_score
from want
group by uid
having match_score=max(match_score);
quit;
FINAL_WANT
ID_A | ID_B | Match_score |
127 | 762 | 55.1 |
124 | 778 | 15.6 |
125 | 787 | 19.7 |
128 | 999 | 19.5 |
129 | 781 | 18.2 |
This is great except it excludes the match ID_A (123) and ID_B(777). Because ID_B (777) wasn't the top match for ID_A (127), it should be used to match with the next best possible ID_A, which would be ID_A (123).
Hi @Schilker I think my understanding is perhaps wrong and consequently your req may not be easily plugged to the code
So from this
ID_A | ID_B | Match_score | Group |
123 | 777 | 28.1 | UID1 |
124 | 778 | 15.6 | UID2 |
125 | 787 | 19.7 | UID3 |
125 | 799 | 18.9 | UID3 |
126 | 762 | 36.1 | UID1 |
126 | 762 | 36.1 | UID1 |
127 | 777 | 28.7 | UID1 |
127 | 762 | 28.7 | UID1 |
127 | 777 | 28.1 | UID1 |
127 | 762 | 55.1 | UID1 |
128 | 999 | 19.5 | UID4 |
129 | 781 | 18.2 | UID5 |
You can see , i took the max of the match_score for each By group. Therefore your "Because ID_B (777) wasn't the top match for ID_A (127), it should be used to match with the next best possible ID_A, which would be ID_A (123)." is not clear to me or the grouping logic is not supposed to the be approach to follow
Hello @Schilker and welcome to the SAS Support Communities!
Isn't this just a matter of two lookup tables to check if either component of a matched pair has occurred before (i.e. with a higher match score), after sorting the pairs by descending match score?
proc sort data=one;
by descending match_score ID_A ID_B;
run;
data want;
if _n_=1 then do;
dcl hash ha();
ha.definekey('ID_A');
ha.definedone();
dcl hash hb();
hb.definekey('ID_B');
hb.definedone();
end;
set one;
if ha.check()*hb.check() then do;
output;
ha.add();
hb.add();
end;
run;
proc print data=want noobs;
run;
Result:
Match_ ID_A ID_B score 127 762 55.1 123 777 28.1 125 787 19.7 128 999 19.5 129 781 18.2 124 778 15.6
However, if your sample output is correct, then mine is wrong (see highlighted decimal above). But why should the pair (123, 777) "inherit" match score 28.7, which was computed for the pair (127, 777)?
Thank you,
Very correct in my typological error when posting. ID_A 123 would not inherit the match score from ID_A 127 and ID_B 777.
But your solution is what I am looking for!
You saved my A$$.
Great, but one remaining issue might be tied match scores. In this case the sort order would not be uniquely determined by the scores. (As you see, I used ID_A and ID_B, in ascending order, as potential tie breakers.) But the order does have an impact on which of the subsequent pairs are discarded.
Example:
data one; input ID_A ID_B Match_score; cards; 101 201 100 101 202 100 102 201 90 103 202 80 103 203 75 ;
With the current algorithm only the two observations highlighted in green would be selected. However, the only reason why the second obs. (101 202 100) is discarded is its higher ID_B, which is rather arbitrary, isn't it? But if the second obs. had been selected (and the first discarded instead), the third and fifth obs. would have been selected as well rather than the fourth. So, the result would have been different both in terms of match scores (80 vs. 90 and 75) and number of selected observations! Would this be a problem?
Why you pick 123 not 127 ? since 28.7 > 28.1
123 777 28.1
...........
127 777 28.7
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.