Removing embedded carriage returns

Reply
New Contributor
Posts: 4

Removing embedded carriage returns

I have a data set that has quite a few carriage returns. I need a way to write this into sas and not change my initial data. The initial data step that I had was:

 

data companyinfo;

infile 'sample.csv' lrecl=2500 firstobs=2 dsd truncover;

input VAR1:$3. VAR2:$11. VAR3:$50. VAR4 VAR5:$2. VAR6:$2. VAR7:$550. VAR8 VAR9:$1.

 VAR10:$3.;

run;

 

Where VAR7 is the column that contains CR.

 

I tried termstr=crlf and var7=tranwrd(var7,'0D'x,'') amd have not been able to find a correct result.

Trusted Advisor
Posts: 1,554

Re: Removing embedded carriage returns

[ Edited ]
Posted in reply to Wmcfarland

You were very close to the solution.

Change line: 

var7=tranwrd(var7,'0D'x,'');

into:

  

var7 = translate(var7,' ','0D'x);

 

New Contributor
Posts: 4

Re: Removing embedded carriage returns

That appears to give me the right number of observations but the information after the CR appears to be deleted and it is giving me errors for the fields that follow VAR7.

Trusted Advisor
Posts: 1,554

Re: Removing embedded carriage returns

Posted in reply to Wmcfarland

Please post:

(1) the log of your run

(2) your output and your expected output for a specific problamatic row

 

Without this info I cannot guess what is the problem.

Super User
Posts: 19,772

Re: Removing embedded carriage returns

Posted in reply to Wmcfarland

I think the general approach has been to read in the full line using _infile_ and either write it out without the carriage return using Compress and then import that file, or to process line afterwards manually using SCAN. 

 

 

SAS Employee
Posts: 51

Re: Removing embedded carriage returns

[ Edited ]

Hello, 

  

I agree with Reeza.

You can use the $VARYINGw. informat when reading the full line into a variable (VarFullLine). 

 

data one;
   infile file-specification length=reclen;
   input @;
   fwidth=reclen-9;
   input id_name $ 1-9
         @10 VarFullLine $varying750. fwidth;
run;

 

Then use the TRANSTRN function instead of the TRANWRD function.

The TRANWRD function differs from the TRANSTRN function because TRANSTRN allows the replacement string to have a length of zero. TRANWRD uses a single blank instead when the replacement string has a length of zero.

 

Or use the compress function and indicate the list of characters you want to keep.

example: 

 

data _NULL_;
 source='aZer§#t_y';
 y=compress(source, '+-', 'kin');
 /*   k for keep (instead of remove), i for case insensitive, l for lowercase, u for uppercase 
    , n for digits (and underscore character and English letters!), p for punctuation marks 
    , w for "write-able"  */
 put y=;
run;

 

Or simply compress the control characters

compress(source, , "c"); /* remove control characters from the result */

 

Other functions that you might find interesting with regard to control characters (carriage return, line feed, form feed, page break ...) 

( carriage returns (‘0D’x ), line feeds (‘0A’x) -> hexadecimal constant notation ):

 ANYALPHA, NOTALPHA

, ANYDIGIT, NOTDIGIT

, ANYPRINT, NOTPRINT

, TRANSLATE

 

Terminate by using the scan function to make your variables!

 

See also this interesting usage note:

Usage Note 14178: The INFILE statement TERMSTR= option facilitates reading files between UNIX and Windows

http://support.sas.com/kb/14/178.html

 

Kind regards,

Koen

PROC Star
Posts: 1,759

Re: Removing embedded carriage returns

Posted in reply to Wmcfarland

You have another problem that you are not describing it seems.

The solution given works fine.

 

data _null_;
  file 'test.csv' lrecl=2500  dlm=',';
  VAR1='abc';VAR2='def';VAR3=' ';VAR4=VAR5=VAR8;VAR6=' ';VAR7='ghi'||'0D'x||'jkl';VAR9='m';VAR10='n';
  put / VAR1-VAR10 ;
run;
 
data TEST;
  infile 'test.csv' lrecl=2500 firstobs=2 dsd truncover termstr=crlf dlm=',' ;
  input VAR1:$3. VAR2:$11. VAR3:$50. VAR4 VAR5:$2. VAR6:$2. VAR7:$550. VAR8 VAR9:$1. VAR10:$3.;
  VAR7 = translate(VAR7,'_','0D'x);
run;
 

 

VAR1 VAR2 VAR3 VAR4 VAR5 VAR6 VAR7 VAR8 VAR9 VAR10
abc def   1     ghi_jkl . m n
 
Please supply a simple sample of data like the one above that replicates the issue.

 

 

Super User
Super User
Posts: 7,039

Re: Removing embedded carriage returns

@ChrisNZ the original post said that using TERMSTR=CRLF was already tried.

PROC Star
Posts: 1,759

Re: Removing embedded carriage returns

@Tom

Exactly. Just showing that we can't replicate the OP's problem, so something else is at play.

Super User
Super User
Posts: 7,039

Re: Removing embedded carriage returns

[ Edited ]
Posted in reply to Wmcfarland

 Are you sure that the embedded line breaks are using the exact same character strings as the normal line breaks?  Have you read the data using an editor that shows the underlying data? or with a SAS data step using fixed mode input so that you can see the actual characters?

 

For example you could use the code to look at the first 1,000 characters in your file.  If that includes a line with extra line break embedded in VAR7 then you can tell if it is using the same hexcodes as at the end of record.  (You could play with OBS= and FIRSTOBS= options to read other parts of the file of it is large.)

data _null_:
  infile 'sample.csv' lrecl=100 recfm=f obs=10;
  input;
  list;
run;

Let's assume that they are both using the same codes and so the INFILE statement will have no way to tell the false line breaks form the real ones so that you need to do something to make it possible to read this data.

 

If the values really never have the delimters in the data then you could just count the number of fields and eliminate line breaks when there are not enough fields. But using comma as the delimiter and having many long character variables it is very likely this approach will not work since it is probably likely that some of those variables will contain commas.  But you could try it.

 

filename fixedcsv temp;
data _null_;
  infile 'sample.csv' lrecl=2500;
  file fixedcsv lrecl=5000 ;
  input ;
  put _infile_ @;
  if countw(_infile_,',','qm') < 10 then do;
    input ;
    put _infile_ @;
  end;
  put;
run;

Now if that doesn't make a file you can read then if the values that have the embedded line breaks are quoted then you can at least fix the value by counting the number of quote characters you see and keeping track if you have seen an even or an odd number. Then replace any linebreaks you see when there are an odd number of quotes with something else.  So you could replace any quoted CR or LF with a vertical bar for example.

 

filename fixedcsv temp;
data _null_;
  infile 'sample.csv' recfm=n ;
  file fixedcsv recfm=n;
  retain quoted 0 quote '"';
  input ch $char1. ;
  if ch=quote then quoted = mod(quoted+(ch=quote),2);
  if ch in ('0D'x,'0A'x) and quoted then put '|' ;
  else put ch $char1.;
run;

Once you have a fixed version of the data in the temporary file then read that into a dataset using data step like you posted.

data companyinfo;
  infile fixedcsv lrecl=5000 firstobs=2 dsd truncover;
  length var1 $3 var2 $11 var3 $50 var4 8 var5 $2 
         var6 $2 var7 $550 var8 8 var9 $1 var10 $3
  ;
  input var1-var10;
run;

 

Trusted Advisor
Posts: 1,554

Re: Removing embedded carriage returns

I would like to extend @Tom's post. Assuming the problem is only in var7 - 

please run next code and post the log (results):

data _NULL_;
  set have;
        length check $500;
        check = compress(var7,' ','ADFI');
        if length(check) > 1 then do;
            m+1;
            if m < 20 then put _N_= ' *' check=  '*' /
check= $hex100. ; end; run;

the log will enable check non visible characters by displaying its hex value.

 

Respected Advisor
Posts: 4,173

Re: Removing embedded carriage returns

[ Edited ]
Posted in reply to Wmcfarland

@Wmcfarland

The following not tested code (you haven't provided sample data) should replace all whitespace characters in your source with a blank before reading the data into SAS variables.

data companyinfo;
  infile 'sample.csv' lrecl=2500 firstobs=2 dsd truncover;
  /* read source record into input buffer; @: holds line, next input will not read a new line */
  input @;
  /* replace all whitespace characters in input buffer with a blank */
  _infile_=prxchange('s/\s/ /oi',-1,trim(_infile_));
  /* map variables agains input buffer; due to @ before this input statement will not read a new record */
  input VAR1:$3. VAR2:$11. VAR3:$50. VAR4 VAR5:$2. VAR6:$2. VAR7:$550. VAR8 VAR9:$1.
    VAR10:$3.;
run;

 

Super User
Super User
Posts: 7,039

Re: Removing embedded carriage returns

@Patrick

How is that code supposed to change end of line characters that are in appropriately placed in the middle of a record?

That could would just read such records are two different records.

 

Respected Advisor
Posts: 4,173

Re: Removing embedded carriage returns

@Tom

The OP uses a .csv and I've made the assumption that we're simply dealing the normal embedded LF problem but that a records ends with CRLF.

 

@Wmcfarland

Please post your sample.csv as attachment so we can deal with the real thing and stop with all the guesswork.

 

Ask a Question
Discussion stats
  • 13 replies
  • 507 views
  • 0 likes
  • 7 in conversation