BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
DocMartin
Quartz | Level 8

I have two data sets. One is the main data set "MAIN"  which has the variables ptid (patient id), hosp (name of hospital) and dx (principal diagnosis). The other data set "DIAGNOSES" has just one variable DX, which is a list of common diagnoses. I want to see if the diagnosis in MAIN is already in the data set DIAGNOSES.

data main; input @1 ptid $4 @6 hosp $6 @10 dx $65;
cards;
abcd JHop acute myocardial infarction w/ comorbidity
swer UMD  acute myocardial infarction with comorbidity
pppp StJo diabetes
......
;
run;

data diagnoses; input @1 dx $65;
cards;
acute myocardial infarction w/ comorbidity
diabetes 
gastric obstruction
hematoma
......
;

Normally getting the two rable to match merge would be simple. The problem is that the variable dx in MAIN can have multiple ways of being captured. For example, these should all be the same dx:

 

acute myocardial infarction w/ comorbidity

acute myocardial infarction with comorbidity

acute myocardial infarction with / comorbidity

 

A similar but different diagnosis is:

 

acute myocardial infarction w/ major comorbidity

acute myocardial infarction, major comorbidity

acute myocardial infarction & major comorbidity

 

These small spelling quirks make direct matching impossible (there are close to 1000 values for dx in DIAGNOSES).

Is there any way to match on similar but not exactly the same text string?

 

Thanks!

 

Andrew

 

1 ACCEPTED SOLUTION

Accepted Solutions
DocMartin
Quartz | Level 8

O.K., I think I've found a way to do this. There are a few functions in SAS that compare two strings for the number of characters in one but not the other. So I can use two do lookups. One that looks up all the entries in the DIAGNOSES table into an array, and another that matches each entry in the array to diagnoses in the MAIN table. The key function is COMPLEV.

 

data match (keep = dx ptid hosp) 
     close (keep = dx ptid distance possible_dx);
array dxs[808] $70 _temporary_;
do until (done);
	set diagnoses end=done;
	count+1;
	dxs[count] = dx;
end;

do until (checkdone);
	set main end=checkdone;
	do i = 1 to dim(dxs);
		distance = complev(dx, dxs[i],'iln');
		if distance=0 then do;
			output match;
			leave;
		end;
		else if distance <= 10 then do;
			possible_dx = dxs[i];
			output close;
		end;
	end;
end;
stop;
run;

I'll need to combine MATCH and CLOSE, sort by dx and descending distance, and take the "first.dx".

A good article on the complev function can be found at: https://support.sas.com/kb/48/582.html

 

Andrew

View solution in original post

3 REPLIES 3
japelin
Rhodochrosite | Level 12
I think it would be difficult to implement this in code unless the criteria for determining what is considered the same and what is considered a different diagnosis is clear.
Therefore, I think it is necessary to take a step such as storing the replacement of w/ with/ in a temporary variable for matching.

In my experience, I think it is best to repeat the following steps and gradually build up the conversion pattern for the data that does not match.

For example, first exclude the matches without conversion.
Next, eliminate the matches by converting "w/" to "with/" and "with /" to "with/", and memorize this conversion pattern.
Next, eliminate matches by converting "&" to "and", and memorize this conversion pattern.
Next...
And so on.

Then, when you have some replacement patterns, you may want to create a list of the ones that match the first X characters, while removing spaces from the remaining ones using compress or something.
ballardw
Super User

I might suggest starting by creating a second variable, so you maintain the origin value for verification by people at different points, and "clean" that second variable such as removing "&" or "w/" or similar, make sure the number of spaces between words is consistent, case, expand or consistently replace words with abbreviations and such.

 

Then start attempting to match.

 

Or if the data has any ICD diagnosis codes start there. Of course if that is the goal that wouldn't be available would it.

DocMartin
Quartz | Level 8

O.K., I think I've found a way to do this. There are a few functions in SAS that compare two strings for the number of characters in one but not the other. So I can use two do lookups. One that looks up all the entries in the DIAGNOSES table into an array, and another that matches each entry in the array to diagnoses in the MAIN table. The key function is COMPLEV.

 

data match (keep = dx ptid hosp) 
     close (keep = dx ptid distance possible_dx);
array dxs[808] $70 _temporary_;
do until (done);
	set diagnoses end=done;
	count+1;
	dxs[count] = dx;
end;

do until (checkdone);
	set main end=checkdone;
	do i = 1 to dim(dxs);
		distance = complev(dx, dxs[i],'iln');
		if distance=0 then do;
			output match;
			leave;
		end;
		else if distance <= 10 then do;
			possible_dx = dxs[i];
			output close;
		end;
	end;
end;
stop;
run;

I'll need to combine MATCH and CLOSE, sort by dx and descending distance, and take the "first.dx".

A good article on the complev function can be found at: https://support.sas.com/kb/48/582.html

 

Andrew

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 3 replies
  • 513 views
  • 1 like
  • 3 in conversation