BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Anotherdream
Quartz | Level 8

Hello everyone.  I have been given files which are in the xml format, and are quite large.  A very important note about these files is that they are ONE giant string, that is much longer than 32,000 characters in length.


Because an Xml doesn't need line feeds / carriage returns to make sense, the maker of the files didn't put any in the file.  However they did put some information within the files that I need to remove before the files are processed into my Production Database (not allowed legally to include them).

How would I open a file in SAS, and remove all occurance of a string and replace it with a blank if the string itself is too long to be made into a variable?

I tried the following method below; however the output file actually has nothing produced. (It makes a blank output file).  I do note from the log that the input variable has a length of 79,293.  I note this is much larger than the $32,000 newvar variable so I'm guessing something is going wrong that is causing the error based upon the string being too long....

The log has no other errors noted;

Please note that I put the general "Borrowername:" as the string I want to replace, but this can be replaced with anything. It's actually not important to the question.

      data _null_;

      infile "c:\xmlfiles\inputxml.xml" end=eof lrecl=1000000;

      file "c:\xmlfiles\temp.xml" lrecl=1000000 RECFM=N;

      informat newvar $32000.;

      input;

      newvar=(_infile_,' BorrowerName:',trimn(''));

      put newvar;

      run;

I also tried the same thing but without making a new variable, and just over-writing the _infile_ variable itself, like below.  When this happens the new file has data, however it is a 100% match to the original file. The transtrn doesn't actually do anything as the output file still has the "BorrowerName:" string within the file. It looks like that step was basically skipped.

  data _null_

      infile "c:\xmlfiles\inputxml.xml" end=eof lrecl=1000000;

        file "c:\xmlfiles\temp.xml" lrecl=1000000 RECFM=N;

      input;

      _infile_=transtrn(_infile_,' BorrowerName:',trimn(''));

      put _infile_;

      run;

Please note that BOTH methodology show a 100% clean log, and I am not sure what to do next.

Thanks all and let me know if I can add any info to help!

1 ACCEPTED SOLUTION

Accepted Solutions
Reeza
Super User

Can you run x commands? Passing this to powershell or Unix to accomplish would be quicker.

Can you read in/parse entire XML file then generate a new XML with the structure you need?

View solution in original post

6 REPLIES 6
ballardw
Super User

I don't know if this will solve anything but the output buffer is accessed using the _FILE_ variable and

_infile_=transtrn(_infile_,' BorrowerName:',trimn(''));

      put _infile_;

would probably be better as :

_file_=transtrn(_infile_,' BorrowerName:',trimn(''));

      put ; /* puts the content of the buffer so explicit variable not needed though some interesting things can be done*/

put _infile_; explicitly in the documentation says that it writes the last input data record read

This specific task may be more appropriate for a text editor macro or possibly a utility such as UNIX SED or GREP

Anotherdream
Quartz | Level 8

Would it be possible to do something liek the following?

Since borrower name is a string of length 13, could I loop over a file moving one "space (or byte, i'm not sure what to call it, but basically the same as using lrecl=1 and recfm=N)"   at a time, but using a "moving window" of 13 length?

Aka if the file looked like below....

Example File:

Hey how are you doing?  I'm good. Did you happen to get that BorrowerName:  I did it's Shannon FakeGirl.

Could I create a loop that would look at the first 13 character first. Aka

'Hey how are y'

and then compare it to "BorrowerName:".  If the 13 length string isn't equal to BorrowerName: then keep the first record within it. Else replace it with nothing.


Aka the code would next move onto

"ey how are yo".

Since this is not equal to BorrowerName: it would keep the "e" that it started with...

However if the string ever was equal to "BorrowerName:" then replace that string with empty space and jump the pointer forward 13 'bytes' and continue thte loop?

Does this even sound possible in SAS? Unfortunately my skills relative to this portion of sas are pretty newbie, and i'm not sure how to even begin.

I know another route could be too break an _infile_ statement into multiple variables dynamically, making each one equal to length $32767.  until you reach the size of the file and then stop....  however I have no idea how to control for the string occuring in the break point between variables. Meaning the first variable ending at $32767 might cut the required string in half, and if that occurs I can't replace the string..

Thanks again, any other ideas are greatly appreciated.

Reeza
Super User

Can you run x commands? Passing this to powershell or Unix to accomplish would be quicker.

Can you read in/parse entire XML file then generate a new XML with the structure you need?

gergely_batho
SAS Employee

%let searchStr=BorrowerName;

%let buffLen=%sysfunc(length(&searchStr.));

data _null_;

  length buff $ &buffLen.;

  retain buff '';

  retain inBuff 0;

  infile "c:\temp\file.txt" RECFM=N eof=eof;

  file "c:\temp\outfile.txt" RECFM=N;

  input char $char1. @@;

  buff=substr(buff,2,&buffLen.-1)||char;

  inBuff=min(inBuff+1,&buffLen.);

  if buff="&searchStr." then do;

  /*do not put*/

  buff='';

  inBuff=0;

  end;

  if inBuff=>&buffLen. then do;

  outChar=char(buff ,1);

  put outChar $1. @@;

  end;

  return;

eof:

  if inBuff>0 then do;

  buff=substr(buff,&buffLen.-inBuff+1,inBuff );

  put buff $varying&buffLen..inBuff @@;

  end;

  stop;

run;

This code is inefficient, but if your xml files are small, it works. Reading bigger blocks and doing search on those (but solving the issue on the block boundaries) would be better.

Some comments:

1. end= option is not working in this setting on the infile statement, so I had to use the eof= option.

2. I had to use the $char1. informat, because $1. informat can skip blanks.

3. I needed a variable: inBuff, which stores, how many characters are actually in buff. I couldn't solve this without that variable, again because of the blanks.

4. $varying. format is needed at the end, if you don't want additional blans in the output file (compared to the input file).

Thank you sharing this problem, I wonder if anyone could suggest impovements (make it more effective or simplify it) to the above code.

Anotherdream
Quartz | Level 8

Novice

Master

Ksharp
Super User

You can make it vertically , one row with one column and with  value which's length is $1 .

data _null_;

      infile "c:\xmlfiles\inputxml.xml" RECFM=N;

      input x $1. @@ ;

      run;

After that , do it whatever you want.

Xia Keshan

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 6 replies
  • 3426 views
  • 5 likes
  • 5 in conversation