BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Schilker
Fluorite | Level 6
data one;
  input ID_A ID_B Match_score;
cards;
123 777 28.1
124 778 15.6 
125 787 19.7
125 799 18.9
126 762 36.1
127 762 55.1
127 777 28.7
128 999 19.5
129 781 18.2
;

I just performed probabilistic linkage on two datasets. The output dataset called "one", contains the identification number from both original datasets,  ID_A, the other ID_B, with a linkage score "match_score". There are numerous combinations of ID_A and ID_B. I want to select only the top linkage to pair then remove them from the selection process for further linkages. An ideal output would be...

 

ID_A     ID_B     Match_score 

127       762       55.1

123       777       28.1

125       787       19.7

128       999       19.5

129       781       18.2

124       778       15.6

 

ID_A: 126 wouldn't match because of the ID_B (762),  match_score is higher for another ID_A (127).

ID_B: 799 wouldn't match because ID_A(125) had a larger match_score with (787)

 

Any help would be greatly appreciated!

 

 

 

Edited 28.7 to 28.1...thanks for pointing out

1 ACCEPTED SOLUTION

Accepted Solutions
FreelanceReinh
Jade | Level 19

Hello @Schilker and welcome to the SAS Support Communities!

 

Isn't this just a matter of two lookup tables to check if either component of a matched pair has occurred before (i.e. with a higher match score), after sorting the pairs by descending match score?

proc sort data=one;
by descending match_score ID_A ID_B;
run;

data want;
if _n_=1 then do;
  dcl hash ha();
  ha.definekey('ID_A');
  ha.definedone();
  dcl hash hb();
  hb.definekey('ID_B');
  hb.definedone();
end;
set one;
if ha.check()*hb.check() then do;
  output;
  ha.add();
  hb.add();
end;
run;

proc print data=want noobs;
run;

Result:

                Match_
ID_A    ID_B     score

 127     762     55.1
 123     777     28.1
 125     787     19.7
 128     999     19.5
 129     781     18.2
 124     778     15.6

However, if your sample output is correct, then mine is wrong (see highlighted decimal above). But why should the pair (123, 777) "inherit" match score 28.7, which was computed for the pair (127, 777)?

View solution in original post

8 REPLIES 8
novinosrin
Tourmaline | Level 20

I vaguely remember the TOP notch people in the forum asking if you have SAS OR that has the procs for networking problems that solves at ease. I have seen PG, Ksharp , Robpratt answering such questions frequently.

 

I had fun attempting in this thread a while ago-

Dec,17,2018 attempt

https://communities.sas.com/t5/SAS-Programming/SAS-unique-identifier-for-multiple-columns-with-same-...

 

So I tried copy pasting and plugging  your dataset with a tweak

 



data have;
  input (Cust dev) (:$8.) Match_score;
cards;
123 777 28.1
124 778 15.6 
125 787 19.7
125 799 18.9
126 762 36.1
127 762 55.1
127 777 28.7
128 999 19.5
129 781 18.2
;
proc sort data=have;
by cust dev;
run;
dm log 'clear';

data _null_;
if _n_=1 then do;
  if 0 then set  have have(rename=(dev=_dev cust=_cust));
  length __dev UID __cust $5;
	/*H initial full load*/
  dcl hash H (dataset:'have(rename=(dev=_dev cust=_cust))',multidata:'y') ;
   h.definekey  ("_dev") ;
   h.definedata ("_cust", "_dev","match_score") ;
   h.definedone () ;
   dcl hiter hh('h');
   	/*H1 Look up and load part by part load proceeding to our need*/
   dcl hash H1 (multidata:'y',ordered:'y') ;
   h1.definekey  ("__cust") ;
   h1.definedata ("__cust", "__dev","match_score","UID") ;
   h1.definedone () ;
   dcl hiter hh1('h1');
   call missing(__dev,__cust);
end;
array t(999) $5;
array j(999) $5;
do until(last.cust);
	set have end=l;
	by cust;
	if first.cust then 
	do;
		if  h1.find(key:cust)= 0 then f=1;
		if not f then do; c+1; uid= cats( 'UID', put(c,8. -l));end;
	end;
	if not f then 
	do;
		/*Look from dev to cust and collect cust residuals in array*/
			do rc1=h.find(key:dev) by 0 while(rc1=0);
				rc=h1.add(key:_cust,data:_cust,data:_dev,data:match_score,data:uid);
		/*collect cust residuals in array that's not part of cust but part of device*/
				if _cust ne cust then do;n+1;t(n)=_cust;end;
				rc1=h.find_next();
			end;
	end;
end;
n=0;
if not f then 
do;
	/*Residual look up-start with cust and iterate with dev*cust recursively until all check complete*/
	do until(sum(cmiss(of t(*)),cmiss(of j(*)))=dim(t)*2);
		do i=1 to dim(t);
			if not missing(t(i)) then 
	/*iterate the full load h using hh*/
				do while(hh.next()=0);
					if t(i)=_cust then
					do;
		/*Check if residual cust's* dev is not in part load hash,if not then fetch*/
						k=.;
						do while(hh1.next()=0);
						if __dev = _dev then do; k=1;leave;end;
						end;
						if not  k then do;n1+1;j(n1)=_dev;end;
						rc=h1.add(key:_cust,data:t(i),data:_dev,data:match_score,data:uid);
					end;	
				end;
		end;
		call missing(of t(*));n1=0;
		do i=1 to dim(j);
			if not missing(j(i)) then 
				do rc1=h.find(key:j(i)) by 0 while(rc1=0);
			/*Check if residual dev's *cust is not in part load hash,if not then fetch*/
					if h1.check(key:_cust) ne 0 then do;n+1;t(n)=_cust;end;
					rc=h1.add(key:_cust,data:_cust,data:_dev,data:match_score,data:uid);
					rc1=h.find_next();
				end;
		end;
		call missing(of j(*));n=0;
	end;
end;
if l then h1.output(dataset:'want');
run;

proc sql;
create table final_want as
select __cust as ID_A,__dev as ID_B, Match_score
from want 
group by uid
having match_score=max(match_score);
quit;

FINAL_WANT

 

ID_A ID_B Match_score
127 762 55.1
124 778 15.6
125 787 19.7
128 999 19.5
129 781 18.2

 

 

Schilker
Fluorite | Level 6

This is great except it excludes the match ID_A (123) and ID_B(777). Because ID_B (777) wasn't the top match for ID_A (127), it should be used to match with the next best possible ID_A, which would be ID_A (123).

 

 

novinosrin
Tourmaline | Level 20

Hi @Schilker   I think my understanding is perhaps wrong and consequently your req may not be easily plugged to the code

 

So from this

 

ID_A ID_B Match_score Group
123 777 28.1 UID1
124 778 15.6 UID2
125 787 19.7 UID3
125 799 18.9 UID3
126 762 36.1 UID1
126 762 36.1 UID1
127 777 28.7 UID1
127 762 28.7 UID1
127 777 28.1 UID1
127 762 55.1 UID1
128 999 19.5 UID4
129 781 18.2 UID5

 

You can see , i took the max of the match_score for each By group. Therefore your "Because ID_B (777) wasn't the top match for ID_A (127), it should be used to match with the next best possible ID_A, which would be ID_A (123)." is not clear to me or the grouping logic is not supposed to the be approach to follow

FreelanceReinh
Jade | Level 19

Hello @Schilker and welcome to the SAS Support Communities!

 

Isn't this just a matter of two lookup tables to check if either component of a matched pair has occurred before (i.e. with a higher match score), after sorting the pairs by descending match score?

proc sort data=one;
by descending match_score ID_A ID_B;
run;

data want;
if _n_=1 then do;
  dcl hash ha();
  ha.definekey('ID_A');
  ha.definedone();
  dcl hash hb();
  hb.definekey('ID_B');
  hb.definedone();
end;
set one;
if ha.check()*hb.check() then do;
  output;
  ha.add();
  hb.add();
end;
run;

proc print data=want noobs;
run;

Result:

                Match_
ID_A    ID_B     score

 127     762     55.1
 123     777     28.1
 125     787     19.7
 128     999     19.5
 129     781     18.2
 124     778     15.6

However, if your sample output is correct, then mine is wrong (see highlighted decimal above). But why should the pair (123, 777) "inherit" match score 28.7, which was computed for the pair (127, 777)?

Schilker
Fluorite | Level 6

Thank you,

 

Very correct in my typological error when posting. ID_A 123 would not inherit the match score from ID_A 127 and ID_B 777.

 

But your solution is what I am looking for!

 

You saved my A$$.

FreelanceReinh
Jade | Level 19

Great, but one remaining issue might be tied match scores. In this case the sort order would not be uniquely determined by the scores. (As you see, I used ID_A and ID_B, in ascending order, as potential tie breakers.) But the order does have an impact on which of the subsequent pairs are discarded.

 

Example:

data one;
input ID_A ID_B Match_score;
cards;
101 201 100
101 202 100
102 201  90
103 202  80
103 203  75
;

With the current algorithm only the two observations highlighted in green would be selected. However, the only reason why the second obs. (101 202 100) is discarded is its higher ID_B, which is rather arbitrary, isn't it? But if the second obs. had been selected (and the first discarded instead), the third and fifth obs. would have been selected as well rather than the fourth. So, the result would have been different both in terms of match scores (80 vs. 90 and 75) and number of selected observations! Would this be a problem?

Ksharp
Super User

Why you pick 123 not 127 ? since 28.7 > 28.1

 

123 777 28.1
...........
127 777 28.7

 

Schilker
Fluorite | Level 6
Because 127 had a better match

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 8 replies
  • 1397 views
  • 5 likes
  • 4 in conversation