Hi everyone,
I am not sure this question fits here, if not then admin please feel free to delete.
I would appreciate everyone's input on this question.
As you know, more than 80% of data out there are unstructured data. In order to make sense of these data, you need to convert it to structured data for analysis.
I guess this is relevant to many people out there, let me give you my own version of the problem.
I am a physician, and a researcher. Working in a busy hospital, we store gigabytes of data everyday in medical notes, images, etc.
Medical notes are stored in text format (basically text in SQL sever).
There are different types of notes, and you expect certain type of information stored int he text depending on the type of the note.
Let us imaging a document describing a simple endoscopic procedure. You would expect the following information to be scattered in the text of the document:
The name of the surgeon
The name of the patient
The age of the patient
The indication of the procedure
The date, time and duration of the procedure.
Findings in the oesophagus, stomach, duodenum, colon
Therapy done during the procedure
Complications
Follow up
This information is entered as free text, natural human language.
There are tens of thousands of these documents, transforming them into a structured analysable data is a huge (but a very tempting challenge)
I tried doing this using different approaches, analysing a small sample(≈500 reports), and the best results I managed to obtain were through using regular expressions.
Even through the results were impressive (as one of my colleagues who went through some of the cases said: I didn't know a computer could be this good), but they are far, far from good enough, and if the writer deviates excessively from the pattern i program into the regular expression, the code fails spectacularly.
I realise I have two issues here:
1. Large data where computing power is needed, but this is not the main question here
2. Processing unstructured data into structured data which is my main focus here.
I have been looking into text mining, but not sure that can do the job. This is more of natural human language analysis.
I looked outside SAS: R seems to have some (limited?) packages to deal with this kind of issue:
https://cran.r-project.org/web/views/NaturalLanguageProcessing.html
Others seems to suggest Morphline, or Hadoop...etc.
So my question is:
Has anyone done this through SAS?
Is SAS at all an appropriate tool to do this?
If yes, then how?
If not, then could you please share your approach of dealing with this kind of problem?
As we store more and more data, and as the volume of stored data increased exponentially, this is going to be a more and more important problem to deal with. And if SAS or whoever comes with a good solution to this, it definitely is going to be a very sought after solution ... maybe the Holy Grail of data management in the future.
Kind regards
AM
... View more