BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
janpeter
Fluorite | Level 6

Hi,

im trying to read a large non delimited text file into a dataset. Im using PC SAS 9.4.

 

The file is the document.xml part from a docx file. It does not seem to be possible to read it into one variable and one row in a dataset since it is too large (>300kb). I was thinking that one way of doing this is to pre process the file by reading it character by character and add a  CR (carriage return) every time i see a '>'. Then output it and re-read it by using proc import with CR ('0D0A'x) as a delimiter. 

Can this be done. If yes then how?

 

BR

Jan

p.s. note that reading the file with a XML libname is not useful here.

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

Not sure if it will help but here is how to do what you asked.

filename in 'document.xml';
filename out 'document_fixed.xml';

data _null_;
  infile in recfm=n;
  file out recfm=n;
  input char $char1. ;
  put char $char1. ;
  if char='>' then put '0D'x;
run;

I don't think that UTF-8 (or other multibyte character sets) would make any difference.

View solution in original post

7 REPLIES 7
Reeza
Super User
What happens if you do try to read it in as one string?

data test;
infile 'path to xml' lrecl=32000;
input;
length x $32000.;
x=_infile_;
run;
Tom
Super User Tom
Super User

Not sure if it will help but here is how to do what you asked.

filename in 'document.xml';
filename out 'document_fixed.xml';

data _null_;
  infile in recfm=n;
  file out recfm=n;
  input char $char1. ;
  put char $char1. ;
  if char='>' then put '0D'x;
run;

I don't think that UTF-8 (or other multibyte character sets) would make any difference.

janpeter
Fluorite | Level 6

Sorry i was too fast. I of course meant to write Tom

Thanks Tom!!!

😉

BR

J

data_null__
Jade | Level 19

@Tom wrote:

Not sure if it will help but here is how to do what you asked.

filename in 'document.xml';
filename out 'document_fixed.xml';

data _null_;
  infile in recfm=n;
  file out recfm=n;
  input char $char1. ;
  put char $char1. ;
  if char='>' then put '0D'x;
run;

I don't think that UTF-8 (or other multibyte character sets) would make any difference.


I wonder if it would be faster to read the file as if were fixed length and apply TRANSLATE function to _INFILE_.  

Tom
Super User Tom
Super User
Perhaps with TRANWRD() function since this is inserting an extra character.
Reeza
Super User
If you remind me later, I did this parsing out a Word doc before. If that's what you're trying to do by the way, I ended up using a python library instead (called from SAS) that worked very well to parse the contents from word documents.But I can dig out the code I had been working with at least.
janpeter
Fluorite | Level 6

Hi Reeza,

that certainly did the trick. Thanks a lot!

Yes. I was considering Python as an option but prefer to keep it all in SAS.

 

BR

Jan

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 7 replies
  • 2827 views
  • 3 likes
  • 4 in conversation