topic Re: How to deal with two datasets in SAS Data Management

How to deal with two datasets

dera — Wed, 03 Oct 2018 18:03:24 GMT

Hi,

I have a data set that contains two variables : the subject ID and the conclusion of a medical exam.

ID	CONCLUSION
1	Signs of arrhytmia.
2	Hypoplastic syndrome of left heart
3	Normal exam.
4	No sign of ventricular septal defect. Arrythmia.

I need to assign a code (from another data set) according to what is written in the conclusion variable.

The data set containing the codes looks like this :

CODE	DESCRIPTION
0.0	Normal exam
1.1	Ventricular septal defect
2.2	Arrhytmia
3.3 ...	Hypoplastic left heart syndrome ...

I am looking for a way to match those two data sets and a function that would assign a code even if the expression is not the exact same. In addition to that, if it is written "no sign of", is there a way to not assign the code? Plus, if the word order is not the exact same, is it possible to assign the code anyway (see the hypoplastic left heart syndrome in my exemple).

I specify that it has to be as automatic as possible because my data set has about 20,000 observations.

Here is an exemple of what I am looking for :

ID	CONCLUSION	CODE
1	Signs of arrhytmia	2.2
2	Hypoplastic syndrome of left heart	3.3
3	Normal exam.	0.0
4	No sign of ventricular septal defect. Arrythmia.	2.2

If it is not clear enough, please tell me. I will try to make it more clear.

Thank you in advance.

Re: How to deal with two datasets

TomKari — Wed, 03 Oct 2018 18:36:42 GMT

Your question is very clear, but sadly this is a VERY challenging type of requirement to implement.

SAS has some very sophisticated text analytics tools, which cost a very sophisticated price. On the off chance that you have access to one of them, use it!

Otherwise, I think you're going to need to do some pre-processing of your "Conclusion" field, to try to get it into a reasonable form for matching (for example, the "no sign of..." type construct is always going to be a huge problem). SAS has two text matching functions, COMPGED and COMPLEV. I'm attaching an example of how it could be used below; the only problem is that the results aren't correct.

Hopefully this will move you another step closer.

Tom

data Exams;
	length Conclusion $30;
	input ID Conclusion &;
	cards;
1 Signs of arrhytmia.
2 Hypoplastic syndrome of left heart
3 Normal exam.
4 No sign of ventricular septal defect. Arrythmia.
run;

data Descriptions;
	length Description $30;
	input Code Description &;
	cards;
0.0 Normal exam
1.1 Ventricular septal defect
2.2 Arrhytmia
3.3 Hypoplastic left heart syndrome
run;

proc sql noprint;
	create table Combined as
		select e.ID, e.Conclusion, d.Code, d.Description, complev(e.Conclusion, d.Description, 1000, "il") as MatchScore
			from Exams e cross join Descriptions d
				order by MatchScore;
quit;

Re: How to deal with two datasets

Reeza — Wed, 03 Oct 2018 23:12:41 GMT

20000 is actually not too bad. Although fully automating is likely impossible there are some methods to speed this up. First clean the data up as much as possible, capitalize or lower cas everything, remove punctuation, extraneous symbols.
Second, run a proc freq on the data and check your groupings. If you see a lot of things that can be grouped together you can do those first. Then assign them the correct code. You can separate the words for each observation against the codes and merge on those, scoring more words that’s match higher and if you find a NOT or NOT FOUND or INCONCLUSIVE you set it as no. For these, verify the matches using ocular test. It’s not perfect and with 20k observations you may get just as good results as if you did it manually. Building a process means you can reuse some parts. OpenRefine is a great tool for this if you want to give that a shot as well. It’s a tool designed for data cleaning via Google and they open sourced it.