Carriage Returns - and lines longer than 32,000 bytes

Accepted Solution Solved
Reply
Super Contributor
Posts: 418
Accepted Solution

Carriage Returns - and lines longer than 32,000 bytes

Hello all.  A while back I posted a question on how to remove CR / LF from csv files programatically, without the ability to use the options TERMSTR (as my file has both CR and LF and CR+LF all within given comments, so this option would still break the file at incorrect places).

The user Tom was very helpful and provided the code given below, which is extremely helpful for ALMOST all the files I need to process. The issue is that I have files in which the total records across one line are vastly longer than 32,000 bytes (dealing with comment data where one box can (and often is) over 10-20 thousand bytes.

Beause of this, this process will not work on these files. Does anyone know if there is a way to alter this code, or another code that will work on these large text comment files?

Tom, if you are reading if you know a way that would be very helpful, and again thanks for this code below!


data _null_;

infile "YOUR FILE" lrecl=32000 end=eof;

file "NEW FILE" lrecl=32000;

nq=0;

do until (mod(nq,2)=0 or eof);

     input

     _infile_ = TRANSLATE(_infile_,' ','0D0A'x);

     newn+1;

     nq=nq+countc(_infile_,' " ');

     put _infile_ @;

     if mod(nq,2) then do;

     missq+1;

     put ' ' @;

     end;

end;

put;

run;

Thanks

Brandon


Accepted Solutions
Solution
‎02-12-2013 10:28 PM
Super User
Super User
Posts: 6,500

Re: Carriage Returns - and lines longer than 32,000 bytes

Do not change the value of CHAR to blank or else it will break the test for end of line that is in the condition of the DO UNTIL statement.

The extra PUT statement that you added inside the loop for when it sees a CR or LF while the quotes are balanced does not have a trailing @.  This is most likely what is added the blank line.

If you want to replace the CR and LF characters with spaces then you can eliminate a lot of the complexity and just do this after the number of quotes is incremented.

    if char in ('0D'x,'0A'x) then put ' ' @;

     else put char $char1. @;


Or if you just want them to disappear then you can be even simpler :


    if char not in ('0D'x,'0A'x) then put char $char1. @ ;


If your file is well formed such that only the LF at the end of the line is outside of the double quotes (and hence the only place where the number of quotes will be even) this should work.

It will not matter if the files is on UNIX where the end of line is just the LF or on WINDOWS where the end of line is CR and LF.  If you prevent copying the CR and LF from the line to the output inside of the loop (either by writing a space or not writing anything) then the bare PUT statement at the end of the loop will write the end of line using the appropriate characters for your operating system.

If your individual lines (in the fixed or new file) are longer than 1,000,000 characters (and if they are how the heck are you ever going to use them) then you might need to switch the settings on your output file to RECFM=N.  Then the PUT statement at the end should explicitly write the LF or CR and LF characters to end the line. (PUT '0D0A'X ; )

View solution in original post


All Replies
Super User
Super User
Posts: 6,500

Re: Carriage Returns - and lines longer than 32,000 bytes

You might need to resort to processing the file byte by byte.

* UNTESTED CODE ;

data _null_;

infile "YOUR FILE" lrecl=1 recfm=n end=eof;

file "NEW FILE" lrecl=1000000;

nq=0;

do until (eof or (char='0A'X and mod(nq,2)=0) );

     input char $char1.;

     nq=nq+(char='"') ;

     if char in ('0D'x,'0A'x) and mod(nq)=1 then char=' ';

     put char $char1. @;

end;

put;

run;

Super Contributor
Posts: 418

Re: Carriage Returns - and lines longer than 32,000 bytes

Hey Tom! Thanks again for helping me through this problem!

Okay so in your code, it works Great, with the exception that between every single line it puts an additional blank line.  Do you know why this portion is happening? (I changed the if char in ('OD'x,'OA'x) and mod(nq,2)=1 portion and it ran exactly the way I was hoping, with this one minor detail.

So the data looks like below

Colm1          Colm2          Colm3

Thanks          So                  MUCH

TOM               This               IS

VERY            HELPFUL!    


Super User
Super User
Posts: 6,500

Re: Carriage Returns - and lines longer than 32,000 bytes

Try removing the PUT statement after the END of the loop.  Most likely it is already putting the end of line and so the extra PUT is adding another end of line.

Super Contributor
Posts: 418

Re: Carriage Returns - and lines longer than 32,000 bytes

Hello again Tom. I actually did try that, and it worked up until the file got to 1,000,000 bytes, and then the carriage returns stopped being processed. However with that put statement every line is 100% correct (there is just an extra line after the end of it). I believe that last put statement is resetting the lrecl somehow (or something along those lines). Does that make sense for the put statement to reset it because the loop reset and the lrecl reset to 1,000,000?

I did have one "observation" and I was thinking maybe this is why it is not working. Please correct me if I am wrong!

Anyways, at the end of each line in my original file is a CRLF (the combination of the two). In the code above, since it is a do until, when it encounters the first CR ('OD'x) it is not triggered by the do Until,  thus it is written as normal (because its mod(q,2)=0). Then it gets to the LF, and it is written as normal (because the do until will still process through it). Therefore we are keeping our CR+LF at the end of every line, and then we are moving forward and putting.

This results in a LF and then a CR+LF from the put statement. Would there be a way to test for subsetting values of CR+LF when the mod(nq,2)=0, and then completely remove that combination, thus when the put statement runs it fixes all of the problems?

Or perhaps any other nifty ideas?

Again thanks for your help, I am learning a lot about the workings of SAS!

Super Contributor
Posts: 418

Re: Carriage Returns - and lines longer than 32,000 bytes

I think I may have gotten it! Tom I took your code and added some modifications to it. If you have time, please see below and determine if there is anything incorrect in my logic. I will test it on more files to make sure the process works the way I intend it too (I feel the logic is solid, but I'm new at this!).

Basically I added a step that tests for CR and LF when they are outside of double quotes, and it removes them replacing them with null strings.

Therefore when it gets to the final LF at each line (in the CR+LF statement), it will replace it, however the put statement after the end clause will inherently put a new line, thus "replacing" the CR+LF at the end line with just the one LF that sas produces.

data _null_;

infile "YOUR FILE" lrecl=1 recfm=n end=eof;

file "NEW FILE" lrecl=1000000;

nq=0;

do until (eof or (char='0A'X and mod(nq,2)=0) );

     input char $char1.;

     nq=nq+(char='"') ;

     if char in ('0D'x,'0A'x) and mod(nq)=1 then char=' ';

     if char in ('0D'x,'0A'x) and mod(nq)=0 then do;

     char=' ';

     put char $char1. ;

     end;

     else do;

     put char $char1. @;

     end;

end;

put;

run;

Solution
‎02-12-2013 10:28 PM
Super User
Super User
Posts: 6,500

Re: Carriage Returns - and lines longer than 32,000 bytes

Do not change the value of CHAR to blank or else it will break the test for end of line that is in the condition of the DO UNTIL statement.

The extra PUT statement that you added inside the loop for when it sees a CR or LF while the quotes are balanced does not have a trailing @.  This is most likely what is added the blank line.

If you want to replace the CR and LF characters with spaces then you can eliminate a lot of the complexity and just do this after the number of quotes is incremented.

    if char in ('0D'x,'0A'x) then put ' ' @;

     else put char $char1. @;


Or if you just want them to disappear then you can be even simpler :


    if char not in ('0D'x,'0A'x) then put char $char1. @ ;


If your file is well formed such that only the LF at the end of the line is outside of the double quotes (and hence the only place where the number of quotes will be even) this should work.

It will not matter if the files is on UNIX where the end of line is just the LF or on WINDOWS where the end of line is CR and LF.  If you prevent copying the CR and LF from the line to the output inside of the loop (either by writing a space or not writing anything) then the bare PUT statement at the end of the loop will write the end of line using the appropriate characters for your operating system.

If your individual lines (in the fixed or new file) are longer than 1,000,000 characters (and if they are how the heck are you ever going to use them) then you might need to switch the settings on your output file to RECFM=N.  Then the PUT statement at the end should explicitly write the LF or CR and LF characters to end the line. (PUT '0D0A'X ; )

Super Contributor
Posts: 418

Re: Carriage Returns - and lines longer than 32,000 bytes

Hello Tom. Just a few notes for clarification if you don't mind. The code that I provide above actually works and gives the exact output that I want (without the extra line). The original code posted is what gave the blank lines.

Also, the code that I gave above,  (called code block 1 below), appears to do the exact same thing as the code that you noted with the "

    if char not in ('0D'x,'0A'x) then put char $char1. @ ;" (named code block 2).

However I see your point that when I change char to blank, the do until loop should no longer work (because char no longer equals '0A'x). It seems strange to me that the code I provided worked based upon this. Do you have any insights into why CODE BLOCK 1 works (and it does indeed work I tested it on 15 different files).

Thanks so much for your help, I am marking you as correct answer once again!

CODE BLOCK 1

data _null_;

infile "YOUR FILE" lrecl=1 recfm=n end=eof;

file "NEW FILE" lrecl=1000000;

nq=0;

do until (eof or (char='0A'X and mod(nq,2)=0) );

     input char $char1.;

     nq=nq+(char='"') ;

     if char in ('0D'x,'0A'x) and mod(nq)=1 then char=' ';

     if char in ('0D'x,'0A'x) and mod(nq)=0 then do;

     char=' ';

     put char $char1. ;

     end;

     else do;

     put char $char1. @;

     end;

end;

put;

run;

CODE BLOCK 2

data _null_;

infile "YOUR FILE" lrecl=1 recfm=n end=eof;

file "NEW FILE" lrecl=1000000;

nq=0;

do until (eof or (char='0A'X and mod(nq,2)=0) );

     input char $char1.;

     nq=nq+(char='"') ;

     if char not in ('0D'x,'0A'x) then put char $char1. @;

end;

put;

run;

Super User
Super User
Posts: 6,500

Re: Carriage Returns - and lines longer than 32,000 bytes

CODE BLOCK 1 is replacing all of the CR and LF with blanks.  BUT when it finds either when the quotes are balanced it spits out an end of line because of the PUT statement without a trailing @ . This would also explain your extra blank lines since if the lines do end with CR+LF (which is normal for Windows text files) then both of those characters will cause the put statement to write an end of line. If you look carefully the extra blank line should have a single space character from where the LF was converted to blank and the put.  Also the main DO loop that was designed to stop when it got to the end of line instead runs through the whole file.  I expected it to reach the end of the data step and execute the PUT statement so that there would be one extra blank line.  But instead it stops when the INPUT statement tries to read past the end of the file.  Perhaps it is because of using RECFM=F LRECL=1 on the input statement that the EOF variable is not working normally?  You should see this NOTE in the log:

NOTE: Unexpected end of file for binary input.

CODE BLOCK 2 is removing all of the CR and LF characters, without replacing them with anything, but it is breaking out of the loop when it sees a LF that is not enclosed in quotes. Then the PUT statement at the end of the data step will write the end of line.

Try adding this line just before the RUN statement in both programs to see a note in the log every time the DO loop ends.

putlog 'Interation number ' _n_ ;

Here is a simple two line TEST input file with an embedded CR in line one (column 17) and below are the resulting output files from the two versions of the code.

RULE:     ----+----1----+----2----+----3----+----4----+----5----+----6----+

1   CHAR  1,"This line is .split",3..2,"This line is not split",4.. 57

    ZONE  322566726666267207766722300322566726666267266727766722300

    NUMR  1C248930C9E50930D30C942C3DA2C248930C9E50930EF4030C942C4DA

Code Block 1 output

RULE:     ----+----1----+----2----+----3----+----4----+----5----+----6----+

1   CHAR  1,"This line is  split",3 .. ..2,"This line is not split",4 .. .. 65

    ZONE  32256672666626722776672232002003225667266662672667277667223200200

    NUMR  1C248930C9E50930030C942C30DA0DA2C248930C9E50930EF4030C942C40DA0DA

Code Block 2 output

RULE:     ----+----1----+----2----+----3----+----4----+----5----+----6----+

1   CHAR  1,"This line is split",3..2,"This line is not split",4.. 56

    ZONE  32256672666626727766722300322566726666267266727766722300

    NUMR  1C248930C9E5093030C942C3DA2C248930C9E50930EF4030C942C4DA

☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 8 replies
  • 930 views
  • 0 likes
  • 2 in conversation