06-24-2017 08:40 AM - edited 06-24-2017 08:42 AM
I have a folder that contain mutiple text documents, placed in subfolders.
Is there a way of importing the contents of these files as a string into a dataset?
I have found some examples round the internet and in SAS community but they dont seem to be working.
THe documents are all .doc
07-06-2017 04:15 AM
Actually, it did work, and the imported text was hidden between all the gibrish lines. I now have a working solution.
Attached is the final code for anyones reference:
***************** * Get file list * *****************; libname output "C:\Users\Home\Desktop\SAS"; %let wordPath=C:\Users\Home\Desktop\Word\; * &wordPath contains all Word forms *; %let rtfPath=C:\Users\Home\Desktop\rtf\; %let fileType=.doc; * Or .docx *; filename files pipe "dir ""&wordPath*&fileType"""; data fileList; infile files lrecl=300 truncover; input line $200.; retain fileID; if not index(line,"&fileType") then delete; else fileID+1; fileName = strip(substr(line,39,199)); wordPathname="&wordPath"||left(fileName); rtfPathname="&rtfPath"||left(tranwrd(compress(fileName,,p), "&fileType", ".txt")); keep fileID fileName wordPathname rtfPathname; run; * Convert Word to RTF files * *****************************; options noxwait noxsync; x call "C:\Program Files (x86)\Microsoft Office\root\Office16\winword.exe"; data _null_; wait=sleep(3); run; fileName wordLink dde 'WinWord|System'; %macro word2rtf(inPathname,outPathname); data _null_; file wordLink; put '[FileOpen.Name = "' &inPathname '"]'; put '[FileSaveAs "' &outPathname '",6]'; put '[FileClose]'; run; %mend word2rtf; data _null_; set fileList; call execute(cats('%nrstr(%word2rtf)(' ,quote(trim(wordPathname)) ,',' ,quote(trim(rtfPathname)) ,')' )); run; data _null_; file wordLink; put '[FileExit]'; run; fileName wordLink clear; *************** * Read in data from RTF files * *******************************; %macro rtf2sas(rtfFile,fileName,surveyNum); fileName inRTF "&rtfFile"; data questions; infile inRTF lrecl=5000 truncover; input rawTxt $ 1-5000; run; data q_and_a; set questions; run; data survey_&surveyNum; length Survey $ 50; set questions; Survey="&fileName"; run; %mend rtf2sas; data _null_; set fileList; call execute('%rtf2sas('||rtfPathname||','||fileName||','|| fileID||')'); run; **************************** * Set all surveys together * ****************************; data output.Survey_All; set survey_:; run;
Having said that, I know understand why it would be better to use a VBA code first to import the word files to csv and then import csv to SAS and process the data, it is much slower when using SAS
06-24-2017 08:51 AM
What is the question exactly?
Do you want to get the list of files in directory tree? Or do you also want then then use that list to convert those files into SAS datasets?
If you have an example that can find the files with names that end with .doc then you should be able to modify the program to find other files.
06-24-2017 08:54 AM
It is not only the file names that I need.
What I need is the text in the file documents themselves.
So the final SAS table I am trying to create would contain 2 columns:
the first is the file name
The second is a string that contains the imported text from the word document
I hope that explained it further
06-24-2017 11:11 AM - edited 06-24-2017 11:13 AM
Writing up some code which automates accessing files from different folders is very doable and there are already quite a few variations of how to achieve this here in the forums.
A Word document is NOT a simple plain text file so the real question for us to understand is: How does this Word document actually look like and what do you want to read into this 2nd variable.
Even though some members of the forums don't want to download Excel or Word attachments for security reasons, I'd still suggest that you attach a sample Word document and then explain/show us how the expected result should look like (best provided as a SAS datastep creating the desired result set as a SAS table).
You are aware that a SAS variable can only hold 32KB of characters but that a Word document could be much bigger and also contain embedded documents, pictures and many more things which can't get read into SAS variables?
What you're asking for will require a very specific set of Word documents.
So: Let's first get something working for one of your Word docs and only then look into generalizing the task for multiple Word docs in multiple sub-folders.
06-24-2017 10:02 PM - edited 06-24-2017 10:04 PM
Hi Patrick and Tom
Thank you for the excellent reply, and the remarks included in your post.
I fully agree with everything you said and as you can see in the attached word document, there are multiple word elements included, such as a table, free text and a text box.
I am aware of the limitation in SAS variable size, the documents contain only text and should not exceed the 32KB in size, so that is gnot going to be a problem
I really appreciate your time and help with this.
06-24-2017 10:54 PM
SAS doesn't haven an easy way to accomplish this.
I personally prefer python for this, since you can convert the DOC to DOCX and then there are many packages that will help you parse out the content as desired.
You don't actually have a limit on the text so you can't guarantee that the fields will smaller than the 32K character limit. You'll need multiple reads to get around that, but it's not the end of the world.
06-24-2017 11:54 PM
06-24-2017 11:57 PM
Do you have SAS e-Miner and/or text miner? That will actually process these files for you...
Otherwise, I'm in the camp of using the right tool for the right job. I'd probably write a script to process the word doc and then call the script from SAS. I understand doing it all in one tool though.
06-25-2017 12:03 AM
06-25-2017 01:59 AM
Save the file as text and try that.
SAS could write a VBS file to do the conversion and then process the text file. If you have specific file structure and tags you can use perl expressions to parse it. If there are definitive structures it should be straight forward. But test it first, it should be something that's quick to check.
06-25-2017 03:02 AM
Just to be clear about requirements:
These are doc files.
They are in multiple subfolders.
You cannot uses X commands. This means that you cannot batch convert the files to txt, rtf or docx formats? Docs may be parsed as it's in an XML format. You cannot use DDE because you can't use X commands.
So you're asking how to use SAS to import a two decade old proprietary unstructured format? Without using the SAS tool specifically designed to handle this type of data?
06-25-2017 05:39 AM - edited 06-25-2017 07:21 AM
I would be using Tika for this task. Tika can extract text from many different file formats. You then can pipe the result directly into SAS.
Below code that works in my environment:
filename extract pipe 'java -jar "c:\Apache\Lucene\tika17\tika-app-1.7.jar" -t C:\temp\sample.doc'; data want; length txt $32767; retain txt; infile extract end=last; input; txt=catx(' ',txt,_infile_); if last then output; run; filename extract clear;
Tika is a great open source tool for text extraction. You can download it from here: https://tika.apache.org/download.html
SAS is using Tika for some of its product stack and depending on what SAS software/solutions you have licensed, you might be lucky and have Tika already installed (i.e. for SAS Content Categorization).
06-25-2017 06:42 AM
Thank you very much for this approach, I havent actually heard of it before, but reading through the webpage, i see it has immense potentials.
I will definatly look into this and try to use it
Need further help from the community? Please ask a new question.