Processing unstructured data into structured data for dummies

Reply
Regular Contributor
Posts: 182

Processing unstructured data into structured data for dummies

[ Edited ]

Hi everyone,

I am not sure this question fits here, if not then admin please feel free to delete.

I would appreciate everyone's input on this question.

As you know, more than 80% of data out there are unstructured data. In order to make sense of these data, you need to convert it to structured data for analysis.

I guess this is relevant to many people out there, let me give you my own version of the problem.

I am a physician, and a researcher. Working in a busy hospital, we store gigabytes of data everyday in medical notes, images, etc.

Medical notes are stored in text format (basically text in SQL sever). 

There are different types of notes, and you expect certain type of information stored int he text depending on the type of the note.

Let us imaging a document describing a simple endoscopic procedure. You would expect the following information to be scattered in the text of the document:

The name of the surgeon

The name of the patient

The age of the patient

The indication of the procedure

The date, time and duration of the procedure.

Findings in the oesophagus, stomach, duodenum, colon

Therapy done during the procedure

Complications

Follow up

 

 

This information is entered as free text, natural human language.

There are tens of thousands of these documents, transforming them into a structured analysable data is a huge (but a very tempting challenge)

I tried doing this using different approaches, analysing a small  sample(≈500 reports), and the best results I managed to obtain were through using regular expressions. 

Even through the results were impressive (as one of my colleagues who went through some of the cases said: I didn't know a computer could be this good), but they are far, far from good enough, and if the writer deviates excessively from the pattern i program into the regular expression, the code fails spectacularly.

I realise I have two issues here:

1. Large data where computing power is needed, but this is not the main question here

2. Processing unstructured data into structured data which is my main focus here.

I have been looking into text mining, but not sure that can do the job.  This is more of natural human language analysis.

I looked outside SAS: R seems to have some (limited?) packages to deal with this kind of issue:

https://cran.r-project.org/web/views/NaturalLanguageProcessing.html

Others seems to suggest Morphline, or Hadoop...etc.

So my question is:

Has anyone done this through SAS?

Is SAS at all an appropriate tool to do this?

If yes, then how?

If not, then could you please share your approach of dealing with this kind of problem?

As we store more and more data, and as the volume of stored data increased exponentially, this is going to be a more and more important problem to deal with. And if SAS or whoever comes with a good solution to this, it definitely is going to be a very sought after solution ... maybe the Holy Grail of data management in the future. 

 

 

Kind regards

AM

 

Super User
Super User
Posts: 9,437

Re: Processing unstructured data into structured data for dummies

The question is one faced by everyone, to differing degrees.  The pharma industry has made several large moves towards data standardisation - for instance CDISC data modelling.  But each problem generally needs it own inputs even with standardisation.

 

The key problem your facing here is the lack of standardisation at capture.  Now from a UI deisng point of view having one box where all of:

The name of the surgeon

The name of the patient

The age of the patient

The indication of the procedure

The date, time and duration of the procedure.

Findings in the oesophagus, stomach, duodenum, colon

Therapy done during the procedure

Complications

Follow up

Can be typed (written) in in any order or way is easily the simplest, and the easiest for the user.  However this provides the least usable data for the end product.  Fixed forms with each part in separate boxes with validation tends to be the other end of the spectrum, harder for the user, but providing a cleaner dataset for further processes.  Note this mimics the problem in computing in general where users want to use Excel as it is open and flexible, and programmers want databases with nice fixed fields.  

At the end of the day you have to arrive at a decision where the effort is going to be put, either at the user, or the output end and resource/bill accordingly.  For a clinical trial for instance, there are requirements which have to be met, so the strict end of the scale is necessary.  In a hospital it may be that the doctors time is more important than the data.  So with that you can then plan the next step, for instance should all the data be sent to a specialist coding/medical group to be converted?  You could for instance as you are doing, extract certain strings from text, but do you have the medical know how to ascertain relationships between them, specialist terms/drug classifications etc.

 

My personal opinion - and I am a programmer so heavily biased - is that woolly data capture should be a thing of the past.  With advances in tablets/phones, embedded medical devices, and other such things, data should be being captured using tools which create usable data as much as possible.  For instance lets take the fields you provide:

The name of the surgeon - this can be derived by the login to the tablet

The name of the patient - a patient id can be entered, all other demography data is linked via the patient id 

The age of the patient - as above

 

The indication of the procedure - probably a free text field with autoencoding attached

The date, time and duration of the procedure. - entry fields for date/time fixed format, duration calculated

Findings in the oesophagus, stomach, duodenum, colon  - separate boxes for each, free text, can then be searched.

Therapy done during the procedure - separate box, possibly some sort of find box for lists of therapies

Complications - free text?

Follow up - free text plus date selector

 

So I broke the two blocks up here as the first, general demographic information on user and subject is not really something which needs to be entered at all. 

The second block is where its at, now I would say, some changes to collection, to have fixed boxes for know data items would halve your data troubles.  Maybe some sort of notation, i.e. delimit text may help for the free text.

 

So after that rambling, there are tools out there, and SAS provides some itself:

https://www.sas.com/en_us/software/text-miner.html

 

But I still feel personally that more emphasis should be put on the capture tools to get it right, or closer to the end product at entry, rather than post processing somewhere down the line.

Regular Contributor
Posts: 182

Re: Processing unstructured data into structured data for dummies

Thank you @RW9 for taking the time to answer this question.

I fully agree with you on every single point you mentioned and especially regarding the importance capturing data in a structured form to begin with.

However, and as you know too well, the world is not that structured nor that easy to deal with. Being a physician, a researcher and a dedicated coder/programmer, I have been dealing with the full spectrum of the problem. I have worked with pharmaceutical sponsored studies, the data capture forms are tedious and no one outside these well sponsored studies will ever use these detailed forms on a daily basis. I tried designing webforms, for use on desktops and smart phones. I tried splitting data collection so that the patients will fill their part and the physicians will do theirs. I even used digitally readable form (where the responder would check a textbox and the form would then be converted to structured data after scanning the answer sheet and putting it through an OCR software. The problem: missing data (huge problem) as people ignore fileds or dont have time to fill in the forms. Forcing them to complete a filed is not the answer either. The more complex and structured form you create to collect more data, the less answers you get. Combining different modalities makes some aspect of the problem easier but can quickly build up to a very complex system relative to the task needed to be accomplished. 

Employing clerks/data analysts to go through captured data is very slow, very expensive and prone t errors. 

The other problem is, as you know, that there are numerous programs and solutions used out there for capturing information. Each department and each agency has its own systems. IT units in different companies adapt different solutions and systems. However, at the end of the day, all the data is stored in a database, be it Oracle, or SQL or something else, and it is there that work should be done.

You would imagine that, living in a time when Googles AI beats the world nr 1 at "Go" that we would have the tools to do more complex tasks than just working with structured data (just a metaphore, I understand the difference in the problem description here)

What I am doing now is a combination of bits and pieces to solve the problem. Matching the text against a table with the name of the surgeons would tells me who did the procedure, patient name is extracted by getting the string between (Mr/Mrs/Miss/Master) and a date regular expression (representing date of birth) and so forth

Then splitting the text into sentences.

Then applying  keywords (like "Stomach)  to extract the sentence containing information about the stomach.

In that sentence, I would look for expected results (like Ulcer). An example of the problem at this level is the following sentences return a positive result for ulcer: (there was an ulcer) ( there was no ulcer) (no identifiable ulcer was seen) (examination reveled healing ulcer) (After through inspection, an ulcer couldnt be seen) etc will all be picked by reg expression as containing the key-term "ulcer" and the code is not that smart to understand negation, and it is virtually impossible to include every possible way of negating the presence of an ulcer in a simple code format. Maybe text miner could pick this up but note sure if it could be coded and embedded at part of a SAS code. 

This is just one example of a problem that you might run into with reg expression (and the list could get very long)

Yet, you would think (maybe with machine learning?) that a set of general rules could be programmed into the code.

Like:

IF sentence X contains the term "ulcerAND there is no negation THEN Ulcer_variable=1 ELSE Ulcer_Variable=0

 

Maybe wishful thinking, but as data volume, computing power and the sophistication of coding (and human laziness) increase, asking humans to input data using structured forms is not the way forward. We need tools to understand and analyse natural human language

 

 

As I said, maybe wishful thinking...  at least today

 

Thanks 

AM

SAS Super FREQ
Posts: 9,325

Re: Processing unstructured data into structured data for dummies

Hi:
A lot of what you describe sounds like the kind of data mining that people perform using SAS Text Miner tools (used with SAS Enterprise Miner). For an example of Data Mining with Text Miner, there is an Ask the Expert webinar that you can watch here: https://communities.sas.com/t5/Ask-the-Expert/SAS-Text-Miner-Getting-Started/ta-p/350616

cynthia
Regular Contributor
Posts: 150

Re: Processing unstructured data into structured data for dummies

The problem is well known. I won't go into more than that but the issues is that SAS is not the language to handle it. SAS, under the covers, has the capability of doing a lot of the heavy lifting but syntactically, it is not the appropriate choice.

 

Look into languages that are structured differently (object-oriented). I use C# but easily found a library:

 

https://www.nrecosite.com/nlp_ner_net.aspx

 

You should be able to find similar stuff in java or python. Python is probably the slowest so keep that in mind.

Super User
Posts: 23,323

Re: Processing unstructured data into structured data for dummies

I've done this with Python before, but the reports usually do have some structure. 

 

The first step was bringing all the text into an application in a semi-structured form. Then we decided on the values we needed for a specific analysis and used some text mining algorithms to set the values.  Those are then used for the analysis.

 

It doesn't seem to matter what language you use, but the process is generally the same at the moment. 

Text mining is a more generic topic, but in this case it's text mining on specific data points which differ...and combining the unstructured physician notes with the relevant billing and coding data to build analytical models. 

 

There are custom applications and groups that work on decoding/analyzing EMR data. Search google "EMR analytics".

 

Are you using EPIC or a different EMR?

 

This is not a trivial exercise by any means, we have an entire research group solely dedicated to that. Though socialized medicine means all patient records can be centralized within the government, both scary and powerful. 

 

http://alberta.cpcssn.ca/what-we-do/

Regular Contributor
Posts: 182

Re: Processing unstructured data into structured data for dummies

Thank you Reeza,

Just a quick question, I presume you used regular expressions with Python, which can also be used within SAS.

Is that the main approach you use? ie, what is the advantage of going with Python vs. SAS with regular expressions?

I did come across natural language processing toolkit in Python, not sure how good that is...

I am not that familiar with Python, but familiar with other programming languages, and R. If Python provides a strong tool to analyse natural language that is not provided through other programming environment then it is worthwhile knowing that and learning the language.

Keen to hear your feedback.

 

 

Super User
Posts: 23,323

Re: Processing unstructured data into structured data for dummies

We used python because the data came in Word and/or PDF docs and Python had better packages to read the data. And it was better at parsing the text. 

 

Yes, most of it was regular expression matching looking for key words, but I didn't code that part (I'm allergic to regular expressions Smiley Wink). 

 

 

 

 

Ask a Question
Discussion stats
  • 7 replies
  • 618 views
  • 6 likes
  • 5 in conversation