BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
Wmcfarland
Calcite | Level 5

I have a data set that has quite a few carriage returns. I need a way to write this into sas and not change my initial data. The initial data step that I had was:

 

data companyinfo;

infile 'sample.csv' lrecl=2500 firstobs=2 dsd truncover;

input VAR1:$3. VAR2:$11. VAR3:$50. VAR4 VAR5:$2. VAR6:$2. VAR7:$550. VAR8 VAR9:$1.

 VAR10:$3.;

run;

 

Where VAR7 is the column that contains CR.

 

I tried termstr=crlf and var7=tranwrd(var7,'0D'x,'') amd have not been able to find a correct result.

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

 Are you sure that the embedded line breaks are using the exact same character strings as the normal line breaks?  Have you read the data using an editor that shows the underlying data? or with a SAS data step using fixed mode input so that you can see the actual characters?

 

For example you could use the code to look at the first 1,000 characters in your file.  If that includes a line with extra line break embedded in VAR7 then you can tell if it is using the same hexcodes as at the end of record.  (You could play with OBS= and FIRSTOBS= options to read other parts of the file of it is large.)

data _null_:
  infile 'sample.csv' lrecl=100 recfm=f obs=10;
  input;
  list;
run;

Let's assume that they are both using the same codes and so the INFILE statement will have no way to tell the false line breaks form the real ones so that you need to do something to make it possible to read this data.

 

If the values really never have the delimters in the data then you could just count the number of fields and eliminate line breaks when there are not enough fields. But using comma as the delimiter and having many long character variables it is very likely this approach will not work since it is probably likely that some of those variables will contain commas.  But you could try it.

 

filename fixedcsv temp;
data _null_;
  infile 'sample.csv' lrecl=2500;
  file fixedcsv lrecl=5000 ;
  input ;
  put _infile_ @;
  if countw(_infile_,',','qm') < 10 then do;
    input ;
    put _infile_ @;
  end;
  put;
run;

Now if that doesn't make a file you can read then if the values that have the embedded line breaks are quoted then you can at least fix the value by counting the number of quote characters you see and keeping track if you have seen an even or an odd number. Then replace any linebreaks you see when there are an odd number of quotes with something else.  So you could replace any quoted CR or LF with a vertical bar for example.

 

filename fixedcsv temp;
data _null_;
  infile 'sample.csv' recfm=n ;
  file fixedcsv recfm=n;
  retain quoted 0 quote '"';
  input ch $char1. ;
  if ch=quote then quoted = mod(quoted+(ch=quote),2);
  if ch in ('0D'x,'0A'x) and quoted then put '|' ;
  else put ch $char1.;
run;

Once you have a fixed version of the data in the temporary file then read that into a dataset using data step like you posted.

data companyinfo;
  infile fixedcsv lrecl=5000 firstobs=2 dsd truncover;
  length var1 $3 var2 $11 var3 $50 var4 8 var5 $2 
         var6 $2 var7 $550 var8 8 var9 $1 var10 $3
  ;
  input var1-var10;
run;

Moderator note: see also the code/macro provided by @AllanBowe to solve this as a general case.

View solution in original post

19 REPLIES 19
Shmuel
Garnet | Level 18

You were very close to the solution.

Change line: 

var7=tranwrd(var7,'0D'x,'');

into:

  

var7 = translate(var7,' ','0D'x);

 

Wmcfarland
Calcite | Level 5

That appears to give me the right number of observations but the information after the CR appears to be deleted and it is giving me errors for the fields that follow VAR7.

Shmuel
Garnet | Level 18

Please post:

(1) the log of your run

(2) your output and your expected output for a specific problamatic row

 

Without this info I cannot guess what is the problem.

Reeza
Super User

I think the general approach has been to read in the full line using _infile_ and either write it out without the carriage return using Compress and then import that file, or to process line afterwards manually using SCAN. 

 

 

sbxkoenk
SAS Super FREQ

Hello, 

  

I agree with Reeza.

You can use the $VARYINGw. informat when reading the full line into a variable (VarFullLine). 

 

data one;
   infile file-specification length=reclen;
   input @;
   fwidth=reclen-9;
   input id_name $ 1-9
         @10 VarFullLine $varying750. fwidth;
run;

 

Then use the TRANSTRN function instead of the TRANWRD function.

The TRANWRD function differs from the TRANSTRN function because TRANSTRN allows the replacement string to have a length of zero. TRANWRD uses a single blank instead when the replacement string has a length of zero.

 

Or use the compress function and indicate the list of characters you want to keep.

example: 

 

data _NULL_;
 source='aZer§#t_y';
 y=compress(source, '+-', 'kin');
 /*   k for keep (instead of remove), i for case insensitive, l for lowercase, u for uppercase 
    , n for digits (and underscore character and English letters!), p for punctuation marks 
    , w for "write-able"  */
 put y=;
run;

 

Or simply compress the control characters

compress(source, , "c"); /* remove control characters from the result */

 

Other functions that you might find interesting with regard to control characters (carriage return, line feed, form feed, page break ...) 

( carriage returns (‘0D’x ), line feeds (‘0A’x) -> hexadecimal constant notation 😞

 ANYALPHA, NOTALPHA

, ANYDIGIT, NOTDIGIT

, ANYPRINT, NOTPRINT

, TRANSLATE

 

Terminate by using the scan function to make your variables!

 

See also this interesting usage note:

Usage Note 14178: The INFILE statement TERMSTR= option facilitates reading files between UNIX and Windows

http://support.sas.com/kb/14/178.html

 

Kind regards,

Koen

ChrisNZ
Tourmaline | Level 20

You have another problem that you are not describing it seems.

The solution given works fine.

 

data _null_;
  file 'test.csv' lrecl=2500  dlm=',';
  VAR1='abc';VAR2='def';VAR3=' ';VAR4=VAR5=VAR8;VAR6=' ';VAR7='ghi'||'0D'x||'jkl';VAR9='m';VAR10='n';
  put / VAR1-VAR10 ;
run;
 
data TEST;
  infile 'test.csv' lrecl=2500 firstobs=2 dsd truncover termstr=crlf dlm=',' ;
  input VAR1:$3. VAR2:$11. VAR3:$50. VAR4 VAR5:$2. VAR6:$2. VAR7:$550. VAR8 VAR9:$1. VAR10:$3.;
  VAR7 = translate(VAR7,'_','0D'x);
run;
 

 

VAR1 VAR2 VAR3 VAR4 VAR5 VAR6 VAR7 VAR8 VAR9 VAR10
abc def   1     ghi_jkl . m n
 
Please supply a simple sample of data like the one above that replicates the issue.

 

 

Tom
Super User Tom
Super User

@ChrisNZ the original post said that using TERMSTR=CRLF was already tried.

ChrisNZ
Tourmaline | Level 20

@Tom

Exactly. Just showing that we can't replicate the OP's problem, so something else is at play.

Tom
Super User Tom
Super User

 Are you sure that the embedded line breaks are using the exact same character strings as the normal line breaks?  Have you read the data using an editor that shows the underlying data? or with a SAS data step using fixed mode input so that you can see the actual characters?

 

For example you could use the code to look at the first 1,000 characters in your file.  If that includes a line with extra line break embedded in VAR7 then you can tell if it is using the same hexcodes as at the end of record.  (You could play with OBS= and FIRSTOBS= options to read other parts of the file of it is large.)

data _null_:
  infile 'sample.csv' lrecl=100 recfm=f obs=10;
  input;
  list;
run;

Let's assume that they are both using the same codes and so the INFILE statement will have no way to tell the false line breaks form the real ones so that you need to do something to make it possible to read this data.

 

If the values really never have the delimters in the data then you could just count the number of fields and eliminate line breaks when there are not enough fields. But using comma as the delimiter and having many long character variables it is very likely this approach will not work since it is probably likely that some of those variables will contain commas.  But you could try it.

 

filename fixedcsv temp;
data _null_;
  infile 'sample.csv' lrecl=2500;
  file fixedcsv lrecl=5000 ;
  input ;
  put _infile_ @;
  if countw(_infile_,',','qm') < 10 then do;
    input ;
    put _infile_ @;
  end;
  put;
run;

Now if that doesn't make a file you can read then if the values that have the embedded line breaks are quoted then you can at least fix the value by counting the number of quote characters you see and keeping track if you have seen an even or an odd number. Then replace any linebreaks you see when there are an odd number of quotes with something else.  So you could replace any quoted CR or LF with a vertical bar for example.

 

filename fixedcsv temp;
data _null_;
  infile 'sample.csv' recfm=n ;
  file fixedcsv recfm=n;
  retain quoted 0 quote '"';
  input ch $char1. ;
  if ch=quote then quoted = mod(quoted+(ch=quote),2);
  if ch in ('0D'x,'0A'x) and quoted then put '|' ;
  else put ch $char1.;
run;

Once you have a fixed version of the data in the temporary file then read that into a dataset using data step like you posted.

data companyinfo;
  infile fixedcsv lrecl=5000 firstobs=2 dsd truncover;
  length var1 $3 var2 $11 var3 $50 var4 8 var5 $2 
         var6 $2 var7 $550 var8 8 var9 $1 var10 $3
  ;
  input var1-var10;
run;

Moderator note: see also the code/macro provided by @AllanBowe to solve this as a general case.

Shmuel
Garnet | Level 18

I would like to extend @Tom's post. Assuming the problem is only in var7 - 

please run next code and post the log (results):

data _NULL_;
  set have;
        length check $500;
        check = compress(var7,' ','ADFI');
        if length(check) > 1 then do;
            m+1;
            if m < 20 then put _N_= ' *' check=  '*' /
check= $hex100. ; end; run;

the log will enable check non visible characters by displaying its hex value.

 

rdum96
Calcite | Level 5

Hey Tom! Thanks for posting this solution. The second code you provided worked for the line breaks I was facing in my data. To better understand what's going on, could you explain why we add the second input statement and also the put statement at the end (in red)? 

filename fixedcsv temp;
data _null_;
  infile 'sample.csv' lrecl=2500;
  file fixedcsv lrecl=5000 ;
  input ;
  put _infile_ @;
  if countw(_infile_,',','qm') < 10 then do;
    input ;
    put _infile_ @;
  end;
  put;
run;

Also I tried replace any single quotes in the data to avoid any complication with "|", however I'm not sure how to change it back to a single quote in the end!

filename fixedcsv temp;
data _null_;
  infile 'sample.csv' lrecl=2500;
  file fixedcsv lrecl=5000 ;
  input ;
 _infile_=tranwrd(_infile_,"'","|");
  put _infile_ @;
  if countw(_infile_,',','qm') < 10 then do;
    input ;
    put _infile_ @;
  end;
  put;
run;

 

Tom
Super User Tom
Super User

The second INPUT is to read the next line and write it onto the end of the current line being written.  You should probably test whether or not it works when there are multiple extra end-of-lines embedded in the same record.

 

The last PUT is the end the line that is being written since the previous PUT statements had trailing @ which does not write the end-of-line characters.

 

Probably the simplest way to convert the |'s back into ' is to do it in the variables you end up with in your SAS dataset after you have successfully read in the fixed file. 

 

Here is a data step that will replace all | with ' in all character variables in a SAS dataset.

data apostrophe;
  set bar ;
  array _char_ _character_;
  do over _char_;
    _char_ = translate(_char_,"'",'|');
  end;
run;
  
Patrick
Opal | Level 21

@Wmcfarland

The following not tested code (you haven't provided sample data) should replace all whitespace characters in your source with a blank before reading the data into SAS variables.

data companyinfo;
  infile 'sample.csv' lrecl=2500 firstobs=2 dsd truncover;
  /* read source record into input buffer; @: holds line, next input will not read a new line */
  input @;
  /* replace all whitespace characters in input buffer with a blank */
  _infile_=prxchange('s/\s/ /oi',-1,trim(_infile_));
  /* map variables agains input buffer; due to @ before this input statement will not read a new record */
  input VAR1:$3. VAR2:$11. VAR3:$50. VAR4 VAR5:$2. VAR6:$2. VAR7:$550. VAR8 VAR9:$1.
    VAR10:$3.;
run;

 

Tom
Super User Tom
Super User

@Patrick

How is that code supposed to change end of line characters that are in appropriately placed in the middle of a record?

That could would just read such records are two different records.

 

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 19 replies
  • 42990 views
  • 15 likes
  • 9 in conversation