04-19-2017 01:46 PM
I have about 13,000 letters in MS word that I need to look for if they had pancreatic cancer. So there are some buzz words (pancreatic cancer, pancreatic adenocarcinoma, pancreatic tumor) in each of these 13,000 letters that I need to search for and identify them. It will be a tedious task for me to search and open each and every letter and then look for Pancreatic cancer.
Each of these letters are saved with patient’s name and date when they are seen.e.g doe, jane Jan 2016
I need help with the following:
1. Identify letters with pancreatic cancer and create a separate list (preferablly retain MS word format). These letters have name and date of birth, is there a way to select name and date of birth along with the pancreatic cancer?
2. Identify patients with multiple letters from the pool of 13,000 letters and save them in a separate folder. I need to isolate patients who had multiple visits, that is patients with more than one letter in the pool of 13,000 letters. If a patient had multiple visits (multiple letters), the name will be same but date of visit after name will be different reflecting their date of visit.
e.g doe, jane Jan 2016
doe, jane Mar 2017
All the letters are stored in one folder.
I would really appreciate any help with this.
04-19-2017 02:11 PM
Can you share an example of one of these letters with any personally identifiable information such as Name and date of birth masked to look like "First Name Last Name" and 01/01/1900 or similar?
Do you have access to SAS Text and Content Analytics?
04-19-2017 02:44 PM
Here is an example of a letter for my Question # 2. I do not have the access to letters for my question #1 right now, but I can share it with you in the evening, if that is ok with you. The letters for Question #1 are much simpler that letters for Question # 2.
Although each letter is saved by patient name I am concerned that they are in different ways, e.g
lastname, firstname hh-2015
lastname,firstname hh 2014
lastname, firstname december 2010, etc.
Please see the attached for a sample letter for Question #2.
Not sure if I have access to SAS text and content analytics, how to check for that?
04-19-2017 03:25 PM
Are they doc files or docx?
When I had to create a database from a set of word docs my process was:
1. Convert to docx using a VBS - will automatically do all in the folder.
2. Use apython scripts to parse out the texts from the word docs into a database for all information on the form. There are python libraries to parse information from word docs.
3. Filter the form information afterwards searching fields from the file.
AFAIK SAS doesn't have an easy way of interacting with Word Docs that make it the appropriate choice for this project. It can probably be accomplished but there are easier ways.
DOCX files are zipped XML files so you may be able to access the XML file after the conversion and then use that instead as the basis for your search.
04-19-2017 04:54 PM
Goodish news, badish news and possibly indeterminate.
First goodish news. You actually have a Form, not a "letter". A letter is pretty free form and would likely be a nightmare. So if the Forms are all similar to the one shown and are converted to TXT files the header information, the part through the HISTORY could be read pretty easily. From the example that looks like you would have two fields to search for your key words: Reason and History.
The reason the data coming from a form is important is that when exported it will have a pretty regular layout.
(for those joining the conversation the example heading looks like this)
Patient History Procedure Date: Time: In Pt Out Pt MOBAPT Patient Name: FirstName LastName dob: 01-01-00 Phone:XXX-XXX-XXXX Alt: XXX-XXX-XXXX— daughter Procedure Scheduled: EGD/EUS/FNA Reason: Pancreatic Neoplasm on CT History: EUS performed 5-15-8 revealed food in stomach. To repeat after two days of clear liquids with erythromycin 250mg one bid x 2 days. Chronic back pain with SOB. CT of chest 4-22-8 revealed mildly prominent right paratracheal lymph node. Remaining lymph nodes in the hilar & mediastinal areas are densely calcified. A 9mm hyperenhancing lesion in the body of the pancreas anteriorly raises possibility for endocrine neoplasm of pancreas.
Some of the key pieces are that SAS can read multiple lines into a single record and if you have key text it can look for the key text to start reading a field by using: input @"Patient Name:" LASTNAME FIRSTNAME
SAS can find where the text appears and start reading there and the first text would be treated as the LastName value and the following text as FirstName. The actual more nasty bit to deal with are names entered like Garcia Rodriguez, Fred as the space imbedded in the last name field has some difficulty. If you have data entry with the names reversed then that is a "welcome to the real world" type of problem. Incomplete dates are nastier and if know they are there then read them as text and then parse. If the dates are not too bad then SAS has a special informat ANYDTDTE that may help but it cannot deal with a date of only a 2 or 4 digit year. You would have to decide what to do with those.
The badish news: If you need the rest of the data from the rest of the form which is in a Word table followed by some more relatively structured text fields, then Base SAS is not going to be the tool you want to use.
The intermediate news is that the data step code to read the txt file format is going to tak a bit of work involving much more activity then a few clicks to generate.
So, do you think you can get someone to write a script to turn those thousands of forms into individual TXT documents?
One might also ask at this point if the data mentioned has already been entered into a database somewhere. it may be much easier to deal with there or at least to export and read into SAS than dealing with 13,000 individual documents.