SAS Programming

DATA Step, Macro, Functions and more
BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
janpeter
Fluorite | Level 6

Hi,

im trying to read a large non delimited text file into a dataset. Im using PC SAS 9.4.

 

The file is the document.xml part from a docx file. It does not seem to be possible to read it into one variable and one row in a dataset since it is too large (>300kb). I was thinking that one way of doing this is to pre process the file by reading it character by character and add a  CR (carriage return) every time i see a '>'. Then output it and re-read it by using proc import with CR ('0D0A'x) as a delimiter. 

Can this be done. If yes then how?

 

BR

Jan

p.s. note that reading the file with a XML libname is not useful here.

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

Not sure if it will help but here is how to do what you asked.

filename in 'document.xml';
filename out 'document_fixed.xml';

data _null_;
  infile in recfm=n;
  file out recfm=n;
  input char $char1. ;
  put char $char1. ;
  if char='>' then put '0D'x;
run;

I don't think that UTF-8 (or other multibyte character sets) would make any difference.

View solution in original post

7 REPLIES 7
Reeza
Super User
What happens if you do try to read it in as one string?

data test;
infile 'path to xml' lrecl=32000;
input;
length x $32000.;
x=_infile_;
run;
Tom
Super User Tom
Super User

Not sure if it will help but here is how to do what you asked.

filename in 'document.xml';
filename out 'document_fixed.xml';

data _null_;
  infile in recfm=n;
  file out recfm=n;
  input char $char1. ;
  put char $char1. ;
  if char='>' then put '0D'x;
run;

I don't think that UTF-8 (or other multibyte character sets) would make any difference.

janpeter
Fluorite | Level 6

Sorry i was too fast. I of course meant to write Tom

Thanks Tom!!!

😉

BR

J

data_null__
Jade | Level 19

@Tom wrote:

Not sure if it will help but here is how to do what you asked.

filename in 'document.xml';
filename out 'document_fixed.xml';

data _null_;
  infile in recfm=n;
  file out recfm=n;
  input char $char1. ;
  put char $char1. ;
  if char='>' then put '0D'x;
run;

I don't think that UTF-8 (or other multibyte character sets) would make any difference.


I wonder if it would be faster to read the file as if were fixed length and apply TRANSLATE function to _INFILE_.  

Tom
Super User Tom
Super User
Perhaps with TRANWRD() function since this is inserting an extra character.
Reeza
Super User
If you remind me later, I did this parsing out a Word doc before. If that's what you're trying to do by the way, I ended up using a python library instead (called from SAS) that worked very well to parse the contents from word documents.But I can dig out the code I had been working with at least.
janpeter
Fluorite | Level 6

Hi Reeza,

that certainly did the trick. Thanks a lot!

Yes. I was considering Python as an option but prefer to keep it all in SAS.

 

BR

Jan

sas-innovate-white.png

Our biggest data and AI event of the year.

Don’t miss the livestream kicking off May 7. It’s free. It’s easy. And it’s the best seat in the house.

Join us virtually with our complimentary SAS Innovate Digital Pass. Watch live or on-demand in multiple languages, with translations available to help you get the most out of every session.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 7 replies
  • 4074 views
  • 3 likes
  • 4 in conversation