BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
ammarhm
Lapis Lazuli | Level 10

Hi everyone.

I have a folder that contain mutiple text documents, placed in subfolders.

Is there a way of importing the contents of these files as a string into a dataset?

I have found some examples round the internet and in SAS community but they dont seem to be working.

THe documents are all .doc

 

Kind regards

1 ACCEPTED SOLUTION

Accepted Solutions
ammarhm
Lapis Lazuli | Level 10

Actually, it did work, and the imported text was hidden between all the gibrish lines. I now have a working solution. 

Thank you @Tom @Reeza @Ksharp @Patrickfor all the help

Attached is the final code for anyones reference:

 

*****************
 * Get file list *
 *****************;
 libname output "C:\Users\Home\Desktop\SAS";
 %let wordPath=C:\Users\Home\Desktop\Word\;
 * &wordPath contains all Word forms *;
 %let rtfPath=C:\Users\Home\Desktop\rtf\;
 %let fileType=.doc;   * Or .docx *;
 
filename files pipe "dir ""&wordPath*&fileType"""; 
data fileList;
infile files lrecl=300 truncover;
input line $200.;
retain fileID;   
if not index(line,"&fileType") then delete;
else fileID+1;
fileName = strip(substr(line,39,199));
wordPathname="&wordPath"||left(fileName);
rtfPathname="&rtfPath"||left(tranwrd(compress(fileName,,p), "&fileType", ".txt"));
keep fileID fileName wordPathname rtfPathname;
run;


 * Convert Word to RTF files *
 *****************************;
 options noxwait noxsync;
 x call "C:\Program Files (x86)\Microsoft Office\root\Office16\winword.exe";

data _null_;   
 wait=sleep(3);
 run;
 fileName wordLink dde 'WinWord|System';

%macro word2rtf(inPathname,outPathname);
data _null_;
  file wordLink;
  put '[FileOpen.Name = "' &inPathname '"]';
  put '[FileSaveAs "' &outPathname '",6]'; 
  put '[FileClose]';
run;
%mend word2rtf;

data _null_;
  set fileList;
  call execute(cats('%nrstr(%word2rtf)('
                   ,quote(trim(wordPathname))
                   ,','
                   ,quote(trim(rtfPathname))
                   ,')'
  )); 
run;

data _null_;
file wordLink;
 put '[FileExit]';
 run;
fileName wordLink clear;

***************
 * Read in data from RTF files *
 *******************************;
  %macro rtf2sas(rtfFile,fileName,surveyNum);
fileName inRTF "&rtfFile";
 data questions;
  infile inRTF lrecl=5000 truncover;
 input rawTxt $ 1-5000;

run;

data q_and_a; 
set questions; 

run;

data survey_&surveyNum; 
length Survey $ 50; 
set questions; 
Survey="&fileName"; 
run; 

%mend rtf2sas;
data _null_; 
set fileList; 


call execute('%rtf2sas('||rtfPathname||','||fileName||','||
 fileID||')'); 

run;

**************************** * Set all surveys together * ****************************; 
data output.Survey_All;
set survey_:;
run;

Having said that, I know understand why it would be better to use a VBA code first to import the word files to csv and then import csv to SAS and process the data, it is much slower when using SAS

View solution in original post

26 REPLIES 26
Tom
Super User Tom
Super User

What is the question exactly?

Do you want to get the list of files in directory tree?  Or do you also want then then use that list to convert those files into SAS datasets?

If you have an example that can find the files with names that end with .doc then you should be able to modify the program to find other files.

ammarhm
Lapis Lazuli | Level 10

It is not only the file names that I need.

What I need is the text in the file documents themselves. 

So the final SAS table I am trying to create would contain 2 columns:

the first is the file name

The second is a string that contains the imported text from the word document

I hope that explained it further

Patrick
Opal | Level 21

@ammarhm

Writing up some code which automates accessing files from different folders is very doable and there are already quite a few variations of how to achieve this here in the forums.

 

A Word document is NOT a simple plain text file so the real question for us to understand is: How does this Word document actually look like and what do you want to read into this 2nd variable.

 

Even though some members of the forums don't want to download Excel or Word attachments for security reasons, I'd still suggest that you attach a sample Word document and then explain/show us how the expected result should look like (best provided as a SAS datastep creating the desired result set as a SAS table).

 

You are aware that a SAS variable can only hold 32KB of characters but that a Word document could be much bigger and also contain embedded documents, pictures and many more things which can't get read into SAS variables?

What you're asking for will require a very specific set of Word documents.

 

So: Let's first get something working for one of your Word docs and only then look into generalizing the task for multiple Word docs in multiple sub-folders.

ammarhm
Lapis Lazuli | Level 10

@Patrick @Tom @Reeza

Hi Patrick and Tom

Thank you for the excellent reply, and the remarks included in your post.

I fully agree with everything you said and as you can see in the attached word document, there are multiple word elements included, such as a table, free text and a text box.

I am aware of the limitation in SAS variable size, the documents contain only text and should not exceed the 32KB in size, so that is gnot going to be a problem

I really appreciate your time and help with this.

Kind regards

 

Reeza
Super User

SAS doesn't haven an easy way to accomplish this. 

 

I personally prefer python for this, since you can convert the DOC to DOCX and then there are many packages that will help you parse out the content as desired.  

 

You don't actually have a limit on the text so you can't guarantee that the fields will smaller than the 32K character limit. You'll need multiple reads to get around that, but it's not the end of the world. 

ammarhm
Lapis Lazuli | Level 10
Thank you Reeza
I still believe it is better to use on system rather than two different to achieve the task. I can import the word document using VBA otherwise but i prefer doing the whole thing through SAS though it might not be the optimal approach
Reeza
Super User

Do you have SAS e-Miner and/or text miner? That will actually process these files for you...

 

Otherwise, I'm in the camp of using the right tool for the right job. I'd probably write a script to process the word doc and then call the script from SAS.  I understand doing it all in one tool though. 

ammarhm
Lapis Lazuli | Level 10
Unfortunately I dont have these packages
I was hoping to do it all through SAS and I have seen some people who managed to do it, but I am not able to implement a solution to this specific case
Reeza
Super User

Save the file as text and try that. 

SAS could write a VBS file to do the conversion and then process the text file. If you have specific file structure and tags you can use perl expressions to parse it. If there are definitive structures it should be straight forward. But test it first, it should be something that's quick to check. 

ammarhm
Lapis Lazuli | Level 10
Thanks Reeza
I still hope someone out there will have a SAS 9nly approach. Implementing command X in SAS is not always the preferred option.thanks again
Reeza
Super User

Just to be clear about requirements:

These are doc files.

They are in multiple subfolders. 

You cannot uses X commands. This means that you cannot batch convert the files to txt, rtf or docx formats? Docs may be parsed as it's in an XML format. You cannot use DDE because you can't use X commands. 

 

So you're asking how to use SAS to import a two decade old proprietary unstructured format? Without using the SAS tool specifically designed to handle this type of data? 

ammarhm
Lapis Lazuli | Level 10

Thank you Reeza

i understand, I guess I will have to do it with X commands and DDE

But I hope I wont need to use VBA, at least

Kind regards

 

Patrick
Opal | Level 21

@ammarhm

I would be using Tika for this task. Tika can extract text from many different file formats. You then can pipe the result directly into SAS.

 

Below code that works in my environment:

filename extract pipe 'java -jar "c:\Apache\Lucene\tika17\tika-app-1.7.jar" -t C:\temp\sample.doc';
data want;
  length txt $32767;
  retain txt;

  infile extract end=last;
  input;

  txt=catx(' ',txt,_infile_);
  if last then output;
run;
filename extract clear;

Capture.JPG

 

Tika is a great open source tool for text extraction. You can download it from here: https://tika.apache.org/download.html 

 

SAS is using Tika for some of its product stack and depending on what SAS software/solutions you have licensed, you might be lucky and have Tika already installed (i.e. for SAS Content Categorization).

ammarhm
Lapis Lazuli | Level 10

@Patrick

Thank you very much for this approach, I havent actually heard of it before, but reading through the webpage, i see it has immense potentials.

I will definatly look into this and try to use it 

Kind regards

 

 

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 26 replies
  • 2795 views
  • 5 likes
  • 5 in conversation