BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
janpeter
Fluorite | Level 6

Hi,

im trying to read a large non delimited text file into a dataset. Im using PC SAS 9.4.

 

The file is the document.xml part from a docx file. It does not seem to be possible to read it into one variable and one row in a dataset since it is too large (>300kb). I was thinking that one way of doing this is to pre process the file by reading it character by character and add a  CR (carriage return) every time i see a '>'. Then output it and re-read it by using proc import with CR ('0D0A'x) as a delimiter. 

Can this be done. If yes then how?

 

BR

Jan

p.s. note that reading the file with a XML libname is not useful here.

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

Not sure if it will help but here is how to do what you asked.

filename in 'document.xml';
filename out 'document_fixed.xml';

data _null_;
  infile in recfm=n;
  file out recfm=n;
  input char $char1. ;
  put char $char1. ;
  if char='>' then put '0D'x;
run;

I don't think that UTF-8 (or other multibyte character sets) would make any difference.

View solution in original post

7 REPLIES 7
Reeza
Super User
What happens if you do try to read it in as one string?

data test;
infile 'path to xml' lrecl=32000;
input;
length x $32000.;
x=_infile_;
run;
Tom
Super User Tom
Super User

Not sure if it will help but here is how to do what you asked.

filename in 'document.xml';
filename out 'document_fixed.xml';

data _null_;
  infile in recfm=n;
  file out recfm=n;
  input char $char1. ;
  put char $char1. ;
  if char='>' then put '0D'x;
run;

I don't think that UTF-8 (or other multibyte character sets) would make any difference.

janpeter
Fluorite | Level 6

Sorry i was too fast. I of course meant to write Tom

Thanks Tom!!!

😉

BR

J

data_null__
Jade | Level 19

@Tom wrote:

Not sure if it will help but here is how to do what you asked.

filename in 'document.xml';
filename out 'document_fixed.xml';

data _null_;
  infile in recfm=n;
  file out recfm=n;
  input char $char1. ;
  put char $char1. ;
  if char='>' then put '0D'x;
run;

I don't think that UTF-8 (or other multibyte character sets) would make any difference.


I wonder if it would be faster to read the file as if were fixed length and apply TRANSLATE function to _INFILE_.  

Tom
Super User Tom
Super User
Perhaps with TRANWRD() function since this is inserting an extra character.
Reeza
Super User
If you remind me later, I did this parsing out a Word doc before. If that's what you're trying to do by the way, I ended up using a python library instead (called from SAS) that worked very well to parse the contents from word documents.But I can dig out the code I had been working with at least.
janpeter
Fluorite | Level 6

Hi Reeza,

that certainly did the trick. Thanks a lot!

Yes. I was considering Python as an option but prefer to keep it all in SAS.

 

BR

Jan

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 7 replies
  • 4850 views
  • 3 likes
  • 4 in conversation