Solved: Removing embedded carriage returns

Wmcfarland · Posted 07-24-2017 11:55 AM

I have a data set that has quite a few carriage returns. I need a way to write this into sas and not change my initial data. The initial data step that I had was:

data companyinfo;

infile 'sample.csv' lrecl=2500 firstobs=2 dsd truncover;

input VAR1:$3. VAR2:$11. VAR3:$50. VAR4 VAR5:$2. VAR6:$2. VAR7:$550. VAR8 VAR9:$1.

VAR10:$3.;

run;

Where VAR7 is the column that contains CR.

I tried termstr=crlf and var7=tranwrd(var7,'0D'x,'') amd have not been able to find a correct result.

Tom · Posted 07-24-2017 02:20 PM

Are you sure that the embedded line breaks are using the exact same character strings as the normal line breaks? Have you read the data using an editor that shows the underlying data? or with a SAS data step using fixed mode input so that you can see the actual characters?

For example you could use the code to look at the first 1,000 characters in your file. If that includes a line with extra line break embedded in VAR7 then you can tell if it is using the same hexcodes as at the end of record. (You could play with OBS= and FIRSTOBS= options to read other parts of the file of it is large.)

data _null_:
  infile 'sample.csv' lrecl=100 recfm=f obs=10;
  input;
  list;
run;

Let's assume that they are both using the same codes and so the INFILE statement will have no way to tell the false line breaks form the real ones so that you need to do something to make it possible to read this data.

If the values really never have the delimters in the data then you could just count the number of fields and eliminate line breaks when there are not enough fields. But using comma as the delimiter and having many long character variables it is very likely this approach will not work since it is probably likely that some of those variables will contain commas. But you could try it.

filename fixedcsv temp;
data _null_;
  infile 'sample.csv' lrecl=2500;
  file fixedcsv lrecl=5000 ;
  input ;
  put _infile_ @;
  if countw(_infile_,',','qm') < 10 then do;
    input ;
    put _infile_ @;
  end;
  put;
run;

Now if that doesn't make a file you can read then if the values that have the embedded line breaks are quoted then you can at least fix the value by counting the number of quote characters you see and keeping track if you have seen an even or an odd number. Then replace any linebreaks you see when there are an odd number of quotes with something else. So you could replace any quoted CR or LF with a vertical bar for example.

filename fixedcsv temp;
data _null_;
  infile 'sample.csv' recfm=n ;
  file fixedcsv recfm=n;
  retain quoted 0 quote '"';
  input ch $char1. ;
  if ch=quote then quoted = mod(quoted+(ch=quote),2);
  if ch in ('0D'x,'0A'x) and quoted then put '|' ;
  else put ch $char1.;
run;

Once you have a fixed version of the data in the temporary file then read that into a dataset using data step like you posted.

data companyinfo;
  infile fixedcsv lrecl=5000 firstobs=2 dsd truncover;
  length var1 $3 var2 $11 var3 $50 var4 8 var5 $2 
         var6 $2 var7 $550 var8 8 var9 $1 var10 $3
  ;
  input var1-var10;
run;

Moderator note: see also the code/macro provided by @AllanBowe to solve this as a general case.

View solution in original post

Shmuel · Posted 07-24-2017 12:16 PM

You were very close to the solution.

Change line:

var7=tranwrd(var7,'0D'x,'');

into:

var7 = translate(var7,' ','0D'x);

Wmcfarland · Posted 07-24-2017 01:09 PM

That appears to give me the right number of observations but the information after the CR appears to be deleted and it is giving me errors for the fields that follow VAR7.

Shmuel · Posted 07-24-2017 01:17 PM

Please post:

(1) the log of your run

(2) your output and your expected output for a specific problamatic row

Without this info I cannot guess what is the problem.

Reeza · Posted 07-24-2017 01:17 PM

I think the general approach has been to read in the full line using _infile_ and either write it out without the carriage return using Compress and then import that file, or to process line afterwards manually using SCAN.

sbxkoenk · Posted 07-24-2017 02:49 PM

Hello,

I agree with Reeza.

You can use the $VARYINGw. informat when reading the full line into a variable (VarFullLine).

data one;
   infile file-specification length=reclen;
   input @;
   fwidth=reclen-9;
   input id_name $ 1-9
         @10 VarFullLine $varying750. fwidth;
run;

Then use the TRANSTRN function instead of the TRANWRD function.

The TRANWRD function differs from the TRANSTRN function because TRANSTRN allows the replacement string to have a length of zero. TRANWRD uses a single blank instead when the replacement string has a length of zero.

Or use the compress function and indicate the list of characters you want to keep.

example:

data _NULL_;
 source='aZer§#t_y';
 y=compress(source, '+-', 'kin');
 /*   k for keep (instead of remove), i for case insensitive, l for lowercase, u for uppercase 
    , n for digits (and underscore character and English letters!), p for punctuation marks 
    , w for "write-able"  */
 put y=;
run;

Or simply compress the control characters

compress(source, , "c"); /* remove control characters from the result */

Other functions that you might find interesting with regard to control characters (carriage return, line feed, form feed, page break ...)

( carriage returns (‘0D’x ), line feeds (‘0A’x) -> hexadecimal constant notation 😞

ANYALPHA, NOTALPHA

, ANYDIGIT, NOTDIGIT

, ANYPRINT, NOTPRINT

, TRANSLATE

Terminate by using the scan function to make your variables!

See also this interesting usage note:

Usage Note 14178: The INFILE statement TERMSTR= option facilitates reading files between UNIX and Windows

http://support.sas.com/kb/14/178.html

Kind regards,

Koen

ChrisNZ · Posted 07-24-2017 08:10 PM

You have another problem that you are not describing it seems.

The solution given works fine.

data _null_;
  file 'test.csv' lrecl=2500  dlm=',';
  VAR1='abc';VAR2='def';VAR3=' ';VAR4=VAR5=VAR8;VAR6=' ';VAR7='ghi'||'0D'x||'jkl';VAR9='m';VAR10='n';
  put / VAR1-VAR10 ;
run;
 
data TEST;
  infile 'test.csv' lrecl=2500 firstobs=2 dsd truncover termstr=crlf dlm=',' ;
  input VAR1:$3. VAR2:$11. VAR3:$50. VAR4 VAR5:$2. VAR6:$2. VAR7:$550. VAR8 VAR9:$1. VAR10:$3.;
  VAR7 = translate(VAR7,'_','0D'x);
run;

VAR1	VAR2	VAR3	VAR4	VAR5	VAR6	VAR7	VAR8	VAR9	VAR10
abc	def		1			ghi_jkl	.	m	n

Please supply a simple sample of data like the one above that replicates the issue.

High-Performance SAS Coding - Third Edition

Tom · Posted 07-24-2017 10:53 PM

@ChrisNZ the original post said that using TERMSTR=CRLF was already tried.

ChrisNZ · Posted 07-25-2017 12:10 AM

@Tom

Exactly. Just showing that we can't replicate the OP's problem, so something else is at play.

High-Performance SAS Coding - Third Edition

Tom · Posted 07-24-2017 02:20 PM

Are you sure that the embedded line breaks are using the exact same character strings as the normal line breaks? Have you read the data using an editor that shows the underlying data? or with a SAS data step using fixed mode input so that you can see the actual characters?

For example you could use the code to look at the first 1,000 characters in your file. If that includes a line with extra line break embedded in VAR7 then you can tell if it is using the same hexcodes as at the end of record. (You could play with OBS= and FIRSTOBS= options to read other parts of the file of it is large.)

data _null_:
  infile 'sample.csv' lrecl=100 recfm=f obs=10;
  input;
  list;
run;

Let's assume that they are both using the same codes and so the INFILE statement will have no way to tell the false line breaks form the real ones so that you need to do something to make it possible to read this data.

If the values really never have the delimters in the data then you could just count the number of fields and eliminate line breaks when there are not enough fields. But using comma as the delimiter and having many long character variables it is very likely this approach will not work since it is probably likely that some of those variables will contain commas. But you could try it.

filename fixedcsv temp;
data _null_;
  infile 'sample.csv' lrecl=2500;
  file fixedcsv lrecl=5000 ;
  input ;
  put _infile_ @;
  if countw(_infile_,',','qm') < 10 then do;
    input ;
    put _infile_ @;
  end;
  put;
run;

Now if that doesn't make a file you can read then if the values that have the embedded line breaks are quoted then you can at least fix the value by counting the number of quote characters you see and keeping track if you have seen an even or an odd number. Then replace any linebreaks you see when there are an odd number of quotes with something else. So you could replace any quoted CR or LF with a vertical bar for example.

filename fixedcsv temp;
data _null_;
  infile 'sample.csv' recfm=n ;
  file fixedcsv recfm=n;
  retain quoted 0 quote '"';
  input ch $char1. ;
  if ch=quote then quoted = mod(quoted+(ch=quote),2);
  if ch in ('0D'x,'0A'x) and quoted then put '|' ;
  else put ch $char1.;
run;

Once you have a fixed version of the data in the temporary file then read that into a dataset using data step like you posted.

data companyinfo;
  infile fixedcsv lrecl=5000 firstobs=2 dsd truncover;
  length var1 $3 var2 $11 var3 $50 var4 8 var5 $2 
         var6 $2 var7 $550 var8 8 var9 $1 var10 $3
  ;
  input var1-var10;
run;

Moderator note: see also the code/macro provided by @AllanBowe to solve this as a general case.

Shmuel · Posted 07-24-2017 02:39 PM

I would like to extend @Tom's post. Assuming the problem is only in var7 -

please run next code and post the log (results):

data _NULL_;
  set have;
        length check $500;
        check = compress(var7,' ','ADFI');
        if length(check) > 1 then do;
            m+1;
            if m < 20 then put _N_= ' *' check=  '*' /
                check= $hex100. ;
        end;
run;

the log will enable check non visible characters by displaying its hex value.

rdum96 · Posted 01-09-2024 03:33 PM

Hey Tom! Thanks for posting this solution. The second code you provided worked for the line breaks I was facing in my data. To better understand what's going on, could you explain why we add the second input statement and also the put statement at the end (in red)?

filename fixedcsv temp;
data _null_;
  infile 'sample.csv' lrecl=2500;
  file fixedcsv lrecl=5000 ;
  input ;
  put _infile_ @;
  if countw(_infile_,',','qm') < 10 then do;
    input ;
    put _infile_ @;
  end;
  put;
run;

Also I tried replace any single quotes in the data to avoid any complication with "|", however I'm not sure how to change it back to a single quote in the end!

filename fixedcsv temp;
data _null_;
  infile 'sample.csv' lrecl=2500;
  file fixedcsv lrecl=5000 ;
  input ;
 _infile_=tranwrd(_infile_,"'","|");
  put _infile_ @;
  if countw(_infile_,',','qm') < 10 then do;
    input ;
    put _infile_ @;
  end;
  put;
run;

Tom · Posted 01-09-2024 05:24 PM

The second INPUT is to read the next line and write it onto the end of the current line being written. You should probably test whether or not it works when there are multiple extra end-of-lines embedded in the same record.

The last PUT is the end the line that is being written since the previous PUT statements had trailing @ which does not write the end-of-line characters.

Probably the simplest way to convert the |'s back into ' is to do it in the variables you end up with in your SAS dataset after you have successfully read in the fixed file.

Here is a data step that will replace all | with ' in all character variables in a SAS dataset.

data apostrophe;
  set bar ;
  array _char_ _character_;
  do over _char_;
    _char_ = translate(_char_,"'",'|');
  end;
run;

Patrick · Posted 07-24-2017 05:57 PM

@Wmcfarland

The following not tested code (you haven't provided sample data) should replace all whitespace characters in your source with a blank before reading the data into SAS variables.

data companyinfo;
  infile 'sample.csv' lrecl=2500 firstobs=2 dsd truncover;
  /* read source record into input buffer; @: holds line, next input will not read a new line */
  input @;
  /* replace all whitespace characters in input buffer with a blank */
  _infile_=prxchange('s/\s/ /oi',-1,trim(_infile_));
  /* map variables agains input buffer; due to @ before this input statement will not read a new record */
  input VAR1:$3. VAR2:$11. VAR3:$50. VAR4 VAR5:$2. VAR6:$2. VAR7:$550. VAR8 VAR9:$1.
    VAR10:$3.;
run;

Tom · Posted 07-24-2017 05:59 PM

@Patrick

How is that code supposed to change end of line characters that are in appropriately placed in the middle of a record?

That could would just read such records are two different records.

Removing embedded carriage returns

Re: Removing embedded carriage returns

Re: Removing embedded carriage returns

Re: Removing embedded carriage returns

Re: Removing embedded carriage returns

Re: Removing embedded carriage returns

Re: Removing embedded carriage returns

Re: Removing embedded carriage returns

Re: Removing embedded carriage returns

Re: Removing embedded carriage returns

Re: Removing embedded carriage returns

Re: Removing embedded carriage returns

Re: Removing embedded carriage returns

Re: Removing embedded carriage returns

Re: Removing embedded carriage returns

Re: Removing embedded carriage returns

Classroom Training Available!