Solved: Fuzzy merge of diagnoses with multiple possibilities

DocMartin · Posted 01-24-2022 11:36 AM

I have two data sets. One is the main data set "MAIN" which has the variables ptid (patient id), hosp (name of hospital) and dx (principal diagnosis). The other data set "DIAGNOSES" has just one variable DX, which is a list of common diagnoses. I want to see if the diagnosis in MAIN is already in the data set DIAGNOSES.

data main; input @1 ptid $4 @6 hosp $6 @10 dx $65;
cards;
abcd JHop acute myocardial infarction w/ comorbidity
swer UMD  acute myocardial infarction with comorbidity
pppp StJo diabetes
......
;
run;

data diagnoses; input @1 dx $65;
cards;
acute myocardial infarction w/ comorbidity
diabetes 
gastric obstruction
hematoma
......
;

Normally getting the two rable to match merge would be simple. The problem is that the variable dx in MAIN can have multiple ways of being captured. For example, these should all be the same dx:

acute myocardial infarction w/ comorbidity

acute myocardial infarction with comorbidity

acute myocardial infarction with / comorbidity

A similar but different diagnosis is:

acute myocardial infarction w/ major comorbidity

acute myocardial infarction, major comorbidity

acute myocardial infarction & major comorbidity

These small spelling quirks make direct matching impossible (there are close to 1000 values for dx in DIAGNOSES).

Is there any way to match on similar but not exactly the same text string?

Thanks!

Andrew

DocMartin · Posted 02-04-2022 11:59 AM

O.K., I think I've found a way to do this. There are a few functions in SAS that compare two strings for the number of characters in one but not the other. So I can use two do lookups. One that looks up all the entries in the DIAGNOSES table into an array, and another that matches each entry in the array to diagnoses in the MAIN table. The key function is COMPLEV.

data match (keep = dx ptid hosp) 
     close (keep = dx ptid distance possible_dx);
array dxs[808] $70 _temporary_;
do until (done);
	set diagnoses end=done;
	count+1;
	dxs[count] = dx;
end;

do until (checkdone);
	set main end=checkdone;
	do i = 1 to dim(dxs);
		distance = complev(dx, dxs[i],'iln');
		if distance=0 then do;
			output match;
			leave;
		end;
		else if distance <= 10 then do;
			possible_dx = dxs[i];
			output close;
		end;
	end;
end;
stop;
run;

I'll need to combine MATCH and CLOSE, sort by dx and descending distance, and take the "first.dx".

A good article on the complev function can be found at: https://support.sas.com/kb/48/582.html

Andrew

View solution in original post

japelin · Posted 01-24-2022 12:45 PM

I think it would be difficult to implement this in code unless the criteria for determining what is considered the same and what is considered a different diagnosis is clear.
Therefore, I think it is necessary to take a step such as storing the replacement of w/ with/ in a temporary variable for matching.

In my experience, I think it is best to repeat the following steps and gradually build up the conversion pattern for the data that does not match.

For example, first exclude the matches without conversion.
Next, eliminate the matches by converting "w/" to "with/" and "with /" to "with/", and memorize this conversion pattern.
Next, eliminate matches by converting "&" to "and", and memorize this conversion pattern.
Next...
And so on.

Then, when you have some replacement patterns, you may want to create a list of the ones that match the first X characters, while removing spaces from the remaining ones using compress or something.

ballardw · Posted 01-24-2022 05:04 PM

I might suggest starting by creating a second variable, so you maintain the origin value for verification by people at different points, and "clean" that second variable such as removing "&" or "w/" or similar, make sure the number of spaces between words is consistent, case, expand or consistently replace words with abbreviations and such.

Then start attempting to match.

Or if the data has any ICD diagnosis codes start there. Of course if that is the goal that wouldn't be available would it.

DocMartin · Posted 02-04-2022 11:59 AM

O.K., I think I've found a way to do this. There are a few functions in SAS that compare two strings for the number of characters in one but not the other. So I can use two do lookups. One that looks up all the entries in the DIAGNOSES table into an array, and another that matches each entry in the array to diagnoses in the MAIN table. The key function is COMPLEV.

data match (keep = dx ptid hosp) 
     close (keep = dx ptid distance possible_dx);
array dxs[808] $70 _temporary_;
do until (done);
	set diagnoses end=done;
	count+1;
	dxs[count] = dx;
end;

do until (checkdone);
	set main end=checkdone;
	do i = 1 to dim(dxs);
		distance = complev(dx, dxs[i],'iln');
		if distance=0 then do;
			output match;
			leave;
		end;
		else if distance <= 10 then do;
			possible_dx = dxs[i];
			output close;
		end;
	end;
end;
stop;
run;

I'll need to combine MATCH and CLOSE, sort by dx and descending distance, and take the "first.dx".

A good article on the complev function can be found at: https://support.sas.com/kb/48/582.html

Andrew

Fuzzy merge of diagnoses with multiple possibilities

Re: Fuzzy merge of diagnoses with multiple possibilities

Re: Fuzzy merge of diagnoses with multiple possibilities

Re: Fuzzy merge of diagnoses with multiple possibilities

Re: Fuzzy merge of diagnoses with multiple possibilities

Fuzzy merge of diagnoses with multiple possibilities

Re: Fuzzy merge of diagnoses with multiple possibilities

Re: Fuzzy merge of diagnoses with multiple possibilities

Re: Fuzzy merge of diagnoses with multiple possibilities

Re: Fuzzy merge of diagnoses with multiple possibilities

Registration is open

SAS Training: Just a Click Away