BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
ammarhm
Lapis Lazuli | Level 10

so it seems using X command or Tika are the two available options.

Any other sggestions please from other users with experience i similar cases

@Ksharp any alternative approaches please?

Thanks

Reeza
Super User

If you have SAS IML you could see if there's an R package to read a doc file 

https://communities.sas.com/t5/General-SAS-Programming/Run-R-code-inside-SAS-easily/td-p/210116

Ksharp
Super User
Yes. I think SAS can do that. 1)use " dir c:\temp\*.txt /s /b " command to get all the path of file. 2)After that use FILENAME + filevar= to get what you want . @Tom is specaility for that .
Reeza
Super User

@Ksharp wrote:
Yes. I think SAS can do that. 1)use " dir c:\temp\*.txt /s /b " command to get all the path of file. 2)After that use FILENAME + filevar= to get what you want . @Tom is specaility for that .

 

@Ksharp DOC files return gibberish if you connect to them via a filename the last time I tried. I'd be happy to be proven wrong though. 

 

 

Ksharp
Super User

Opps. I thought OP want import TEXT files .

ammarhm
Lapis Lazuli | Level 10

Thank you @Ksharp @Reeza @Patrick @Tom

 

Trying to keep the thread alive, and after all the wise recommendations I decided to try and do it using x command and the fact that you cant import directly from a DOC file and based on pieces of code i found online for similar problems.

Here is the code I have been playing with:

 

 

*****************
 * Get file list *
 *****************;
 libname output "C:\Users\home\Desktop\IU\SAS";
 %let wordPath=C:\Users\home\Desktop\IU\Word\;
 * &wordPath contains all Word forms *;
 %let rtfPath=C:\Users\home\Desktop\IU\rtf\;
 %let fileType=.doc;   * Or .docx *;
 
filename files pipe "dir ""&wordPath*&fileType"""; 
data fileList;
infile files lrecl=300 truncover;
input line $200.;
retain fileID;   
if not index(line,"&fileType") then delete;
else fileID+1;
fileName = strip(substr(line,39,199));
wordPathname="&wordPath"||left(fileName);
rtfPathname="&rtfPath"||left(tranwrd(fileName, "&fileType", ".rtf"));
keep fileID fileName wordPathname rtfPathname;
run;


 * Convert Word to RTF files *
 *****************************;
 options noxwait noxsync;
 x call "C:\Program Files (x86)\Microsoft Office\root\Office16\winword.exe";

data _null_;   
 wait=sleep(3);
 run;
 fileName wordLink dde 'WinWord|System';

%macro word2rtf(inPathname,outPathname);
data _null_;
file wordLink;
put '[FileOpen.Name = "'"&inPathname"'"]';
put '[FileSaveAs "'"&outPathname"'",6]'; 
put '[FileClose]';
run;
%mend word2rtf;

 data _null_;

set fileList;
  call execute('%word2rtf('||wordPathname||', '||rtfPathname||')'); 
 run;

data _null_;
file wordLink;
 put '[FileExit]';
 run;
fileName wordLink clear;

The problem is that the following step is causing an error: 

 

data _null_;

set fileList;
  call execute('%word2rtf('||wordPathname||', '||rtfPathname||')'); 
 run;

ERROR: More positional parameters found than defined.

ERROR: More positional parameters found than defined.

ERROR: More positional parameters found than defined.

ERROR: More positional parameters found than defined.

ERROR: More positional parameters found than defined.

ERROR: More positional parameters found than defined.

 

Could anyone help with this please

kind regards

 

 

Reeza
Super User

Wouldn't text make more sense than RTF?

Tom
Super User Tom
Super User

Not sure why you are getting that message. Are you sure that you haven't gotten mismatched qutoes?

Or perhaps the filenames themselves contain commas? If they do then to fix that you might want to adjust your macro to expect the quoted filename instead of adding the quotes.

 

Also it is good practice to use %NRSTR() around the macro name you are stacking up with CALL EXECUTE to prevent the macro from running at the wrong time.  Your particular macro will not have a problem but any macro that modifies macro variables via code (call symputx() or sql select into) will have issues.

%macro word2rtf(inPathname,outPathname);
data _null_;
  file wordLink;
  put '[FileOpen.Name = "' &inPathname '"]';
  put '[FileSaveAs "' &outPathname '",6]'; 
  put '[FileClose]';
run;
%mend word2rtf;

data _null_;
  set fileList;
  call execute(cats('%nrstr(%word2rtf)('
                   ,quote(trim(wordPathname))
                   ,','
                   ,quote(trim(rtfPathname))
                   ,')'
  )); 
run;

 

 

ammarhm
Lapis Lazuli | Level 10

Thank you @Tom for the suggestion, that sovled this provlem.

@Reeza I did try text but the result was only a jargon file that I couldnt read, but happy to work around that if you think it is better. I only changed the extension: 

data fileList;
infile files lrecl=300 truncover;
input line $200.;
retain fileID;   
if not index(line,"&fileType") then delete;
else fileID+1;
fileName = strip(substr(line,39,199));
wordPathname="&wordPath"||left(fileName);
rtfPathname="&rtfPath"||left(tranwrd(fileName, "&fileType", ".txt"));
keep fileID fileName wordPathname rtfPathname;
run;

But as I mentioned, when I opened the generated files, there were unreadable.

Is there a difference between the two options when it comes to extensions? I have about 3000 documents to go throgh so it would be helpful to get the conversion working a bit fastrere. Also, it opens the files one by one, is there an option to make this happenending in the background (like open files in hidden mode) so that the process is faster?

 

I will be working with text extraction from the generated files and will get back the form when I have done that

Thanks everyone

ammarhm
Lapis Lazuli | Level 10

Thank you @Reeza @Tom @Ksharp @Patrick again for your kind help.

I am now working on text extraction from the rtf files

 

 

***************
 * Read in data from RTF files *
 *******************************;
  %macro rtf2sas(rtfFile,fileName,surveyNum);
fileName inRTF "&rtfFile";
 data questions;
  infile inRTF lrecl=5000 truncover;
 input rawTxt $ 1-5000;

run;

data q_and_a; 
set questions; 

run;

data survey_&surveyNum; 
length Survey $ 50; 
set question; 
Survey="&fileName"; 
run; 

%mend rtf2sas;
data _null_; 
set fileList; 
call execute(cats('%nrstr(%rtf2sas)('
                   ,quote(trim(rtfPathname))
                   ,quote(trim(fileName))
                   ,quote(trim(fileID))
                   ,')'
  )); 
run;

Yes the code is returning asn error:

 

 

1   + %rtf2sas("C:\Users\Home\Desktop\IU\rtf\Uploaded_0944114_20160505_DAS, ADD.txt""Uploaded_0944114_20160505_DAS, ADD.doc""           1")

ERROR: Error in the FILENAME statement.

 

ERROR: No logical assign for filename INRTF.

 

I cant solve this problem, could anyone please help me out here?

Could this be related to the fact that I have commas in the file name? if so what would the easiest fix be?

 

Kind regards

 

ammarhm
Lapis Lazuli | Level 10

OK I think I am moving forward on this now, i can import the rtf files as I want but the imported text looks gebrish and unreadable, please see attached screenshot

Any idea on how to process this into readable text?

 


Screen Shot 2017-07-06 at 9.59.28 am.png
ammarhm
Lapis Lazuli | Level 10

Actually, it did work, and the imported text was hidden between all the gibrish lines. I now have a working solution. 

Thank you @Tom @Reeza @Ksharp @Patrickfor all the help

Attached is the final code for anyones reference:

 

*****************
 * Get file list *
 *****************;
 libname output "C:\Users\Home\Desktop\SAS";
 %let wordPath=C:\Users\Home\Desktop\Word\;
 * &wordPath contains all Word forms *;
 %let rtfPath=C:\Users\Home\Desktop\rtf\;
 %let fileType=.doc;   * Or .docx *;
 
filename files pipe "dir ""&wordPath*&fileType"""; 
data fileList;
infile files lrecl=300 truncover;
input line $200.;
retain fileID;   
if not index(line,"&fileType") then delete;
else fileID+1;
fileName = strip(substr(line,39,199));
wordPathname="&wordPath"||left(fileName);
rtfPathname="&rtfPath"||left(tranwrd(compress(fileName,,p), "&fileType", ".txt"));
keep fileID fileName wordPathname rtfPathname;
run;


 * Convert Word to RTF files *
 *****************************;
 options noxwait noxsync;
 x call "C:\Program Files (x86)\Microsoft Office\root\Office16\winword.exe";

data _null_;   
 wait=sleep(3);
 run;
 fileName wordLink dde 'WinWord|System';

%macro word2rtf(inPathname,outPathname);
data _null_;
  file wordLink;
  put '[FileOpen.Name = "' &inPathname '"]';
  put '[FileSaveAs "' &outPathname '",6]'; 
  put '[FileClose]';
run;
%mend word2rtf;

data _null_;
  set fileList;
  call execute(cats('%nrstr(%word2rtf)('
                   ,quote(trim(wordPathname))
                   ,','
                   ,quote(trim(rtfPathname))
                   ,')'
  )); 
run;

data _null_;
file wordLink;
 put '[FileExit]';
 run;
fileName wordLink clear;

***************
 * Read in data from RTF files *
 *******************************;
  %macro rtf2sas(rtfFile,fileName,surveyNum);
fileName inRTF "&rtfFile";
 data questions;
  infile inRTF lrecl=5000 truncover;
 input rawTxt $ 1-5000;

run;

data q_and_a; 
set questions; 

run;

data survey_&surveyNum; 
length Survey $ 50; 
set questions; 
Survey="&fileName"; 
run; 

%mend rtf2sas;
data _null_; 
set fileList; 


call execute('%rtf2sas('||rtfPathname||','||fileName||','||
 fileID||')'); 

run;

**************************** * Set all surveys together * ****************************; 
data output.Survey_All;
set survey_:;
run;

Having said that, I know understand why it would be better to use a VBA code first to import the word files to csv and then import csv to SAS and process the data, it is much slower when using SAS

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 26 replies
  • 2847 views
  • 5 likes
  • 5 in conversation