BookmarkSubscribeRSS Feed
dera
Obsidian | Level 7

Hi,

 

I have a data set that contains two variables : the subject ID and the conclusion of a medical exam.

 

IDCONCLUSION
1Signs of arrhytmia.
2Hypoplastic syndrome of left heart
3Normal exam.
4No sign of ventricular septal defect. Arrythmia.

 

I need to assign a code (from another data set) according to what is written in the conclusion variable.

The data set containing the codes looks like this :

CODEDESCRIPTION
0.0Normal exam
1.1Ventricular septal defect
2.2Arrhytmia

3.3

...

Hypoplastic left heart syndrome

...

 

I am looking for a way to match those two data sets and a function that would assign a code even if the expression is not the exact same. In addition to that, if it is written "no sign of", is there a way to not assign the code? Plus, if the word order is not the exact same, is it possible to assign the code anyway (see the hypoplastic left heart syndrome in my exemple).

 

I specify that it has to be as automatic as possible because my data set has about 20,000 observations.

 

Here is an exemple of what I am looking for :

 

IDCONCLUSIONCODE
1Signs of arrhytmia2.2
2Hypoplastic syndrome of left heart3.3
3Normal exam.0.0
4No sign of ventricular septal defect. Arrythmia.2.2

 

If it is not clear enough, please tell me. I will try to make it more clear.

 

 

Thank you in advance.

2 REPLIES 2
TomKari
Onyx | Level 15

Your question is very clear, but sadly this is a VERY challenging type of requirement to implement.

 

SAS has some very sophisticated text analytics tools, which cost a very sophisticated price. On the off chance that you have access to one of them, use it!

 

Otherwise, I think you're going to need to do some pre-processing of your "Conclusion" field, to try to get it into a reasonable form for matching (for example, the "no sign of..." type construct is always going to be a huge problem). SAS has two text matching functions, COMPGED and COMPLEV. I'm attaching an example of how it could be used below; the only problem is that the results aren't correct.

 

Hopefully this will move you another step closer.

 

Tom

 

data Exams;
	length Conclusion $30;
	input ID Conclusion &;
	cards;
1 Signs of arrhytmia.
2 Hypoplastic syndrome of left heart
3 Normal exam.
4 No sign of ventricular septal defect. Arrythmia.
run;

data Descriptions;
	length Description $30;
	input Code Description &;
	cards;
0.0 Normal exam
1.1 Ventricular septal defect
2.2 Arrhytmia
3.3 Hypoplastic left heart syndrome
run;

proc sql noprint;
	create table Combined as
		select e.ID, e.Conclusion, d.Code, d.Description, complev(e.Conclusion, d.Description, 1000, "il") as MatchScore
			from Exams e cross join Descriptions d
				order by MatchScore;
quit;
Reeza
Super User
20000 is actually not too bad. Although fully automating is likely impossible there are some methods to speed this up. First clean the data up as much as possible, capitalize or lower cas everything, remove punctuation, extraneous symbols.
Second, run a proc freq on the data and check your groupings. If you see a lot of things that can be grouped together you can do those first. Then assign them the correct code. You can separate the words for each observation against the codes and merge on those, scoring more words that’s match higher and if you find a NOT or NOT FOUND or INCONCLUSIVE you set it as no. For these, verify the matches using ocular test. It’s not perfect and with 20k observations you may get just as good results as if you did it manually. Building a process means you can reuse some parts. OpenRefine is a great tool for this if you want to give that a shot as well. It’s a tool designed for data cleaning via Google and they open sourced it.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to connect to databases in SAS Viya

Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 2 replies
  • 828 views
  • 4 likes
  • 3 in conversation