BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
janpeter
Fluorite | Level 6

Hi,

im trying to read a large non delimited text file into a dataset. Im using PC SAS 9.4.

 

The file is the document.xml part from a docx file. It does not seem to be possible to read it into one variable and one row in a dataset since it is too large (>300kb). I was thinking that one way of doing this is to pre process the file by reading it character by character and add a  CR (carriage return) every time i see a '>'. Then output it and re-read it by using proc import with CR ('0D0A'x) as a delimiter. 

Can this be done. If yes then how?

 

BR

Jan

p.s. note that reading the file with a XML libname is not useful here.

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

Not sure if it will help but here is how to do what you asked.

filename in 'document.xml';
filename out 'document_fixed.xml';

data _null_;
  infile in recfm=n;
  file out recfm=n;
  input char $char1. ;
  put char $char1. ;
  if char='>' then put '0D'x;
run;

I don't think that UTF-8 (or other multibyte character sets) would make any difference.

View solution in original post

7 REPLIES 7
Reeza
Super User
What happens if you do try to read it in as one string?

data test;
infile 'path to xml' lrecl=32000;
input;
length x $32000.;
x=_infile_;
run;
Tom
Super User Tom
Super User

Not sure if it will help but here is how to do what you asked.

filename in 'document.xml';
filename out 'document_fixed.xml';

data _null_;
  infile in recfm=n;
  file out recfm=n;
  input char $char1. ;
  put char $char1. ;
  if char='>' then put '0D'x;
run;

I don't think that UTF-8 (or other multibyte character sets) would make any difference.

janpeter
Fluorite | Level 6

Sorry i was too fast. I of course meant to write Tom

Thanks Tom!!!

😉

BR

J

data_null__
Jade | Level 19

@Tom wrote:

Not sure if it will help but here is how to do what you asked.

filename in 'document.xml';
filename out 'document_fixed.xml';

data _null_;
  infile in recfm=n;
  file out recfm=n;
  input char $char1. ;
  put char $char1. ;
  if char='>' then put '0D'x;
run;

I don't think that UTF-8 (or other multibyte character sets) would make any difference.


I wonder if it would be faster to read the file as if were fixed length and apply TRANSLATE function to _INFILE_.  

Tom
Super User Tom
Super User
Perhaps with TRANWRD() function since this is inserting an extra character.
Reeza
Super User
If you remind me later, I did this parsing out a Word doc before. If that's what you're trying to do by the way, I ended up using a python library instead (called from SAS) that worked very well to parse the contents from word documents.But I can dig out the code I had been working with at least.
janpeter
Fluorite | Level 6

Hi Reeza,

that certainly did the trick. Thanks a lot!

Yes. I was considering Python as an option but prefer to keep it all in SAS.

 

BR

Jan

hackathon24-white-horiz.png

2025 SAS Hackathon: There is still time!

Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!

Register Now

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 7 replies
  • 4651 views
  • 3 likes
  • 4 in conversation