BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
GKalema
Fluorite | Level 6

Dear All,

 

I have a word document (see example attachment) that I would like to get into SAS. I would like to extract only the TOC and the table title (this is supposed to be the same but sometimes, titles are edited manually introducing differences/mistakes). The following code imports Example2.docx successfully (following Jay Zhou's paper titled "Importing Data from Microsoft Word into SAS");

OPTIONS NOXWAIT NOXSYNC;
%let rc=%sysfunc(SYSTEM(START WINWORD));

DATA _NULL_;
  X=SLEEP(2); * PAUSE SAS PROCESSING FOR 2 SECONDS WHILE WORD OPENS;
RUN;

FILENAME word DDE "WINWORD|C:\Temp\Example2.docx" NOTAB;
FILENAME word DDE 'WINWORD|SYSTEM';

DATA _NULL_;
  FILE word;
  PUT '[FileOpen.Name = "' "C:\Temp\Example2.docx" '"]';
  PUT "[EditSelectAll]";
  PUT "[EditCopy]";
  PUT '[FileClose]';
  PUT '[AppMinimize]';
RUN;

%let rc=%sysfunc(SYSTEM(START EXCEL));

DATA _NULL_;
  X=SLEEP(2); * PAUSE SAS PROCESSING FOR 2 SECONDS WHILE EXCEL OPENS AND LOADS THE SHEET;
RUN;

FILENAME EXCEL DDE 'EXCEL|SYSTEM';

DATA _NULL_;
FILE excel;
  PUT '[error(false)]';
  PUT '[paste]';
  PUT '[SAVE.AS("C:\temp\temp.xlsx")]';
  PUT '[Select.Last.Cell()]';
  PUT '[Copy]';
  PUT '[quit]';
RUN;

PROC IMPORT OUT = temp
  DATAFILE = "C:\temp\temp.xlsx"
  DBMS = EXCELCS REPLACE;
RUN;

The problem is that the number of tables and the length/size of each table is sometimes very limiting (hundreds of tables, hundreds of pages) in terms of computational time when pasting the copied text from word to Excel.

 

A work around could be to go directly from Word to SAS without intermittently invoking Excel. Something like the following;

OPTIONS NOXWAIT NOXSYNC;
%let rc=%sysfunc(SYSTEM(START WINWORD));

DATA _NULL_;
  X=SLEEP(2); 
RUN;

FILENAME word DDE "WINWORD|C:\Temp\Example2.docx" NOTAB;
FILENAME word DDE 'WINWORD|SYSTEM';

DATA worddata;
  INFILE word;
  INPUT;
RUN;

I have tried a number of things with the last data step but I don't seem to get this to work. How do I efficiently go from Word to SAS avoiding Excel as a bridge? Note that the number of tables will vary between files.

 

1 ACCEPTED SOLUTION

Accepted Solutions
Reeza
Super User

Essentially once you have the word file open, execute a SAVE AS to convert it to a text file. There are probably VBS scripts that could convert the files to text files in batch. 

 

If the files are DOCX you 'could' try to parse the XML if you're feeling adventurous. It would mean changing the extension to .zip, unzipping the file and then finding the components you need. Once you have that set up it should be relatively simple to automate it for all your files.

 

Here's a program I used to convert the DOC files to DOCX. I think you just need to find the 'number' for a text file and change the 12 to that number. It will automatically do all files in the folder. 

https://gist.github.com/statgeek/c7cab12c48133abefc2b4c84f264b290

View solution in original post

10 REPLIES 10
art297
Opal | Level 21

If you have IML, which by definition can run r, there's an r package for that. see: https://www.r-bloggers.com/using-r-to-get-data-out-of-word-docs/ and https://www.google.ca/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=0ahUKEwjN1oa8-P7UA...

 

Art, CEO, AnalystFinder.com

 

Reeza
Super User

Try using DDE to save the word document as a text file instead of going to Excel and then parse the text files instead. Someone else did a similar project recently so if you search and ask him, maybe he can help out with the code he ended up using.

RW9
Diamond | Level 26 RW9
Diamond | Level 26

As another alternative, could you not run a VBA script from within Word to export the information you need as a text file for futher processing - doesn't need to be in the same file, have an empty word do, which loads your word doc, processess it and dumps the required information out to a text file.  This would be far simpler than trying to read directly into SAS.  Note that your current code uses DDE which is very old and probably wont work on half the things you would want to.

An example:

https://stackoverflow.com/questions/3628222/extract-headings-and-pagenumber-of-table-of-contents-of-...

GKalema
Fluorite | Level 6

Many thanks for the quick suggestions.

 

For 1, I will try calling R in IML, seems interesting to me (have interacted SAS and R in the not-too-distant past to use the copula package from R). However, I would prefer to have the code fully SAS based because one always needs to configure R to interact with SAS which may be uncomfortable for other users of the macro I am writing.

 

For 2, replacing the excel with a text file via DDE should be a good solution, if I can get it implemented. I have been searching for a while on the topic but am yet to get this.

 

For 3, I have so many of these word files (my solution will be applied to probably hundreds of them). I am not familiar with VBA and have no idea how I would write that script. 

Reeza
Super User

Essentially once you have the word file open, execute a SAVE AS to convert it to a text file. There are probably VBS scripts that could convert the files to text files in batch. 

 

If the files are DOCX you 'could' try to parse the XML if you're feeling adventurous. It would mean changing the extension to .zip, unzipping the file and then finding the components you need. Once you have that set up it should be relatively simple to automate it for all your files.

 

Here's a program I used to convert the DOC files to DOCX. I think you just need to find the 'number' for a text file and change the 12 to that number. It will automatically do all files in the folder. 

https://gist.github.com/statgeek/c7cab12c48133abefc2b4c84f264b290

GKalema
Fluorite | Level 6

Hi Reeza,

 

Indeed your script converts all .doc files in a directoy to .docx files. In order to convert to .txt files, not only do I need to change the number '12' in the script   

oWord.ActiveDocument.SaveAs oFile.path & "X", 12

to '2' but I also need to change the naming of the file because the above line only adds an "x" to the extension, so it becomes .docx.

 

How do I specify a complete new extension '.txt'?

 

Reeza
Super User

@GKalema wrote:

Hi Reeza,

 

Indeed your script converts all .doc files in a directoy to .docx files. In order to convert to .txt files, not only do I need to change the number '12' in the script   

oWord.ActiveDocument.SaveAs oFile.path & "X", 12

to '2' but I also need to change the naming of the file because the above line only adds an "x" to the extension, so it becomes .docx.

 

How do I specify a complete new extension '.txt'?

 


I don't know...I would recommend checking StackOverflow or a VBS site instead of SAS at that point, since this wouldn't be considered a SAS question.

GKalema
Fluorite | Level 6

Hi Reeza,

 

Actually the following line of code saves the word file as a txt file

oWord.ActiveDocument.SaveAs oFile.path & ".txt", 2

so that if file was e.g., c:\temp\test.doc, it becomes c:\temp\test.doc.txt

 

Thanks so much for you help.

George

Reeza
Super User

I'm glad you figured it out 🙂

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

Discussion stats
  • 10 replies
  • 9168 views
  • 1 like
  • 4 in conversation