Solved: Help deduplicate after probabilistic linkage

Schilker · Posted 07-13-2019 03:31 PM

data one;
  input ID_A ID_B Match_score;
cards;
123 777 28.1
124 778 15.6 
125 787 19.7
125 799 18.9
126 762 36.1
127 762 55.1
127 777 28.7
128 999 19.5
129 781 18.2
;

I just performed probabilistic linkage on two datasets. The output dataset called "one", contains the identification number from both original datasets, ID_A, the other ID_B, with a linkage score "match_score". There are numerous combinations of ID_A and ID_B. I want to select only the top linkage to pair then remove them from the selection process for further linkages. An ideal output would be...

ID_A ID_B Match_score

127 762 55.1

123 777 28.1

125 787 19.7

128 999 19.5

129 781 18.2

124 778 15.6

ID_A: 126 wouldn't match because of the ID_B (762), match_score is higher for another ID_A (127).

ID_B: 799 wouldn't match because ID_A(125) had a larger match_score with (787)

Any help would be greatly appreciated!

Edited 28.7 to 28.1...thanks for pointing out

FreelanceReinh · Posted 07-13-2019 05:47 PM

Hello @Schilker and welcome to the SAS Support Communities!

Isn't this just a matter of two lookup tables to check if either component of a matched pair has occurred before (i.e. with a higher match score), after sorting the pairs by descending match score?

proc sort data=one;
by descending match_score ID_A ID_B;
run;

data want;
if _n_=1 then do;
  dcl hash ha();
  ha.definekey('ID_A');
  ha.definedone();
  dcl hash hb();
  hb.definekey('ID_B');
  hb.definedone();
end;
set one;
if ha.check()*hb.check() then do;
  output;
  ha.add();
  hb.add();
end;
run;

proc print data=want noobs;
run;

Result:

                Match_
ID_A    ID_B     score

 127     762     55.1
 123     777     28.1
 125     787     19.7
 128     999     19.5
 129     781     18.2
 124     778     15.6

However, if your sample output is correct, then mine is wrong (see highlighted decimal above). But why should the pair (123, 777) "inherit" match score 28.7, which was computed for the pair (127, 777)?

View solution in original post

novinosrin · Posted 07-13-2019 04:45 PM

I vaguely remember the TOP notch people in the forum asking if you have SAS OR that has the procs for networking problems that solves at ease. I have seen PG, Ksharp , Robpratt answering such questions frequently.

I had fun attempting in this thread a while ago-

Dec,17,2018 attempt

https://communities.sas.com/t5/SAS-Programming/SAS-unique-identifier-for-multiple-columns-with-same-...

So I tried copy pasting and plugging your dataset with a tweak



data have;
  input (Cust dev) (:$8.) Match_score;
cards;
123 777 28.1
124 778 15.6 
125 787 19.7
125 799 18.9
126 762 36.1
127 762 55.1
127 777 28.7
128 999 19.5
129 781 18.2
;
proc sort data=have;
by cust dev;
run;
dm log 'clear';

data _null_;
if _n_=1 then do;
  if 0 then set  have have(rename=(dev=_dev cust=_cust));
  length __dev UID __cust $5;
	/*H initial full load*/
  dcl hash H (dataset:'have(rename=(dev=_dev cust=_cust))',multidata:'y') ;
   h.definekey  ("_dev") ;
   h.definedata ("_cust", "_dev","match_score") ;
   h.definedone () ;
   dcl hiter hh('h');
   	/*H1 Look up and load part by part load proceeding to our need*/
   dcl hash H1 (multidata:'y',ordered:'y') ;
   h1.definekey  ("__cust") ;
   h1.definedata ("__cust", "__dev","match_score","UID") ;
   h1.definedone () ;
   dcl hiter hh1('h1');
   call missing(__dev,__cust);
end;
array t(999) $5;
array j(999) $5;
do until(last.cust);
	set have end=l;
	by cust;
	if first.cust then 
	do;
		if  h1.find(key:cust)= 0 then f=1;
		if not f then do; c+1; uid= cats( 'UID', put(c,8. -l));end;
	end;
	if not f then 
	do;
		/*Look from dev to cust and collect cust residuals in array*/
			do rc1=h.find(key:dev) by 0 while(rc1=0);
				rc=h1.add(key:_cust,data:_cust,data:_dev,data:match_score,data:uid);
		/*collect cust residuals in array that's not part of cust but part of device*/
				if _cust ne cust then do;n+1;t(n)=_cust;end;
				rc1=h.find_next();
			end;
	end;
end;
n=0;
if not f then 
do;
	/*Residual look up-start with cust and iterate with dev*cust recursively until all check complete*/
	do until(sum(cmiss(of t(*)),cmiss(of j(*)))=dim(t)*2);
		do i=1 to dim(t);
			if not missing(t(i)) then 
	/*iterate the full load h using hh*/
				do while(hh.next()=0);
					if t(i)=_cust then
					do;
		/*Check if residual cust's* dev is not in part load hash,if not then fetch*/
						k=.;
						do while(hh1.next()=0);
						if __dev = _dev then do; k=1;leave;end;
						end;
						if not  k then do;n1+1;j(n1)=_dev;end;
						rc=h1.add(key:_cust,data:t(i),data:_dev,data:match_score,data:uid);
					end;	
				end;
		end;
		call missing(of t(*));n1=0;
		do i=1 to dim(j);
			if not missing(j(i)) then 
				do rc1=h.find(key:j(i)) by 0 while(rc1=0);
			/*Check if residual dev's *cust is not in part load hash,if not then fetch*/
					if h1.check(key:_cust) ne 0 then do;n+1;t(n)=_cust;end;
					rc=h1.add(key:_cust,data:_cust,data:_dev,data:match_score,data:uid);
					rc1=h.find_next();
				end;
		end;
		call missing(of j(*));n=0;
	end;
end;
if l then h1.output(dataset:'want');
run;

proc sql;
create table final_want as
select __cust as ID_A,__dev as ID_B, Match_score
from want 
group by uid
having match_score=max(match_score);
quit;

FINAL_WANT

ID_A	ID_B	Match_score
127	762	55.1
124	778	15.6
125	787	19.7
128	999	19.5
129	781	18.2

Schilker · Posted 07-13-2019 05:29 PM

This is great except it excludes the match ID_A (123) and ID_B(777). Because ID_B (777) wasn't the top match for ID_A (127), it should be used to match with the next best possible ID_A, which would be ID_A (123).

novinosrin · Posted 07-13-2019 05:39 PM

Hi @Schilker I think my understanding is perhaps wrong and consequently your req may not be easily plugged to the code

So from this

ID_A	ID_B	Match_score	Group
123	777	28.1	UID1
124	778	15.6	UID2
125	787	19.7	UID3
125	799	18.9	UID3
126	762	36.1	UID1
126	762	36.1	UID1
127	777	28.7	UID1
127	762	28.7	UID1
127	777	28.1	UID1
127	762	55.1	UID1
128	999	19.5	UID4
129	781	18.2	UID5

You can see , i took the max of the match_score for each By group. Therefore your "Because ID_B (777) wasn't the top match for ID_A (127), it should be used to match with the next best possible ID_A, which would be ID_A (123)." is not clear to me or the grouping logic is not supposed to the be approach to follow

FreelanceReinh · Posted 07-13-2019 05:47 PM

Hello @Schilker and welcome to the SAS Support Communities!

Isn't this just a matter of two lookup tables to check if either component of a matched pair has occurred before (i.e. with a higher match score), after sorting the pairs by descending match score?

proc sort data=one;
by descending match_score ID_A ID_B;
run;

data want;
if _n_=1 then do;
  dcl hash ha();
  ha.definekey('ID_A');
  ha.definedone();
  dcl hash hb();
  hb.definekey('ID_B');
  hb.definedone();
end;
set one;
if ha.check()*hb.check() then do;
  output;
  ha.add();
  hb.add();
end;
run;

proc print data=want noobs;
run;

Result:

                Match_
ID_A    ID_B     score

 127     762     55.1
 123     777     28.1
 125     787     19.7
 128     999     19.5
 129     781     18.2
 124     778     15.6

However, if your sample output is correct, then mine is wrong (see highlighted decimal above). But why should the pair (123, 777) "inherit" match score 28.7, which was computed for the pair (127, 777)?

Schilker · Posted 07-13-2019 06:07 PM

Thank you,

Very correct in my typological error when posting. ID_A 123 would not inherit the match score from ID_A 127 and ID_B 777.

But your solution is what I am looking for!

You saved my A$$.

FreelanceReinh · Posted 07-13-2019 06:58 PM

Great, but one remaining issue might be tied match scores. In this case the sort order would not be uniquely determined by the scores. (As you see, I used ID_A and ID_B, in ascending order, as potential tie breakers.) But the order does have an impact on which of the subsequent pairs are discarded.

Example:

data one;
input ID_A ID_B Match_score;
cards;
101 201 100
101 202 100
102 201  90
103 202  80
103 203  75
;

With the current algorithm only the two observations highlighted in green would be selected. However, the only reason why the second obs. (101 202 100) is discarded is its higher ID_B, which is rather arbitrary, isn't it? But if the second obs. had been selected (and the first discarded instead), the third and fifth obs. would have been selected as well rather than the fourth. So, the result would have been different both in terms of match scores (80 vs. 90 and 75) and number of selected observations! Would this be a problem?

Ksharp · Posted 07-14-2019 07:45 AM

Why you pick 123 not 127 ? since 28.7 > 28.1

123 777 28.1
...........
127 777 28.7

Schilker · Posted 07-14-2019 08:29 AM

Because 127 had a better match

Help deduplicate after probabilistic linkage

Re: Help deduplicate after probabilistic linkage

Re: Help deduplicate after probabilistic linkage

Re: Help deduplicate after probabilistic linkage

Re: Help deduplicate after probabilistic linkage

Re: Help deduplicate after probabilistic linkage

Re: Help deduplicate after probabilistic linkage

Re: Help deduplicate after probabilistic linkage

Re: Help deduplicate after probabilistic linkage

Re: Help deduplicate after probabilistic linkage

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!