Carriage Return / Multiple Line Break Issue Importing TXT File

Accepted Solution Solved
Reply
New Contributor
Posts: 4
Accepted Solution

Carriage Return / Multiple Line Break Issue Importing TXT File

[ Edited ]

Hi all,

 

I am trying to import quite a large pipe-delimited txt file (approx. 18 million observations/ 17 million KB) into sas (version 9.4), but there are too many line breaks in my data to use a regular infile statement. I tried using the TERMSTR=CRLF statement initially, and this did help delete the completely empty rows, however it did not fix the rows that are misaligned. To fully understand my issue, see attached for a simple example I created of my raw data.

 

Regarding the sample I have attached, lines 2-5 are all the same row of data, but they are unfortunately split up like this. Does anybody have any tips for how to resolve this issue so I can potentially import a cleaner file without having to piece things back together manually?

 

Thanks in advance!

 

 

 


Accepted Solutions
Solution
‎12-14-2017 01:37 PM
Super User
Super User
Posts: 7,399

Re: Carriage Return / Multiple Line Break Issue Importing CSV File

Posted in reply to Lauren_dal

I would suggest counting the number of pipes and removing the line breaks when not at the end of the line.  But there are a few situations that could make that hard.

 

Does your actual data file ever have truncated records?  That is records that only contain some of the fields and instead of adding a number of pipe characters to the end to indicate the last few a missing the line just stops?  That will probably make just counting pipes impossible. Does the last field on the line every have embedded line breaks?  That would also make it impossible to just count lines.

 

If you can't count pipes then you will need to see if you can detect the beginning of a new line. So if the first one or two fields always follow a specific pattern then you could test those for an indication that you have hit the beginning of a new record.

 

Does your actual data every contain any pipe characters as values?  That is a value like 'a|b'. If so it will make the process harder, but it should be possible to still parse.  Normally in a delimited file you would quote the value that contains the delimiter.  

1|"a|b"|3

But some systems might instead try to "escape" the delimiter.

1|a\|b|3

Here is a simple method to copy the file and remove embedded line breaks by just counting the number of pipes. I have it put a space before the continuation lines, but you could put some other character there.  I have it count the number of pipes in the first line and use that to drive the process of checking if the current record has enough pipes or not.

filename fixed temp;
data _null_;
  retain maxp 0 p 0;
  infile sample ;
  file fixed ;
  input ;
  if p then put ' ' @;
  put _infile_ @;
  p+countc(_infile_,'|');
  if _n_=1 then maxp=p;
  if p>= maxp then do ;
    put ;
    p=0;
  end;
run;

Since the file is large you might see if you can get a solution that does not involve making a whole new copy of the file.  So instead you might try building the full record into a string variable, then assigning the string variable back to the _INFILE_ variable and then reading the values with an INPUT statement. 

 

 

View solution in original post


All Replies
Trusted Advisor
Posts: 1,149

Re: Carriage Return / Multiple Line Break Issue Importing CSV File

Posted in reply to Lauren_dal

I have downloaded your attachment, which looks like this:

 

Var1|Var2|Var3|Var4|Var5|
1000|PEONY|DIRECT SUNLIGHT|YELLOW|80.0|
1600|ROSE|NEEDS TO BE WATERED
EVERY 5 TO 7 DAYS

AND PLACED IN DIRECT SUNLIGHT|RED|125.25
1750|LILY|NOT DIRECT SUNLIGHT|WHITE|76.05

Lines 1 names 5 variables, suggesting each record will have 4 pipes.

 

Note that record 1 (line 1) and record 2 (line 2) each have a trailing pipe, yielding 5 pipes, and implying 6 pipe-delimited fields. But record 3 (lines 3 through 6) and record 4 (line 7)  do not have trailing pipes, so they have 4 pipes, implying 5 pipe-delimited fields.

 

Is this a transcription error, or  is it actually in the raw data you  are using?  If it's the latter get the person who made the file redo it consistently.

 

 

Contributor RM6
Contributor
Posts: 24

Re: Carriage Return / Multiple Line Break Issue Importing CSV File

try using proc import, something like this


proc import datafile ="C:\data.txt" out= xyz
dbms="dlm" replace;
delimiter='|';
getnames=yes;
run;
Trusted Advisor
Posts: 1,683

Re: Carriage Return / Multiple Line Break Issue Importing CSV File

It was realy a chalenge. Next code works fine with the sample.

I hope you'll be able to addapt it to your input:

data want;
  format v1-v5 ; 
  retain i j v1-v5;
  length a_line  word w_next $100 
         v1 $5 v2-v4 $100  v5 $10;
  infile datalines truncover;
  link read_line;
  array vx $ v1-v5;
  i=1; 
  do until (i=6);
     word= scan(a_line,i,'|'); 
     vx(i) = word;
     if word = '' then do;
        do until (a_line ne ' ');
           link read_line;
        end;
	    if i>1 then i=i-1;
        do until (w_next=' ');
	       word = scan(a_line,j,'|'); 
	       w_next = scan(a_line,j+1,'|');
	       vx(i) = catx(' ',vx(i),word);
	       if w_next ne ' ' then do; i+1; j+1; end;
	    end;
     end; 
     i+1;
  end;
  output; 
  call missing(v1, v2, v3, v4, v5);
  keep v1-v5;
return;
read_line:
  N+1; j=1;
  input a_line $100.;
return;
datalines;
Var1|Var2|Var3|Var4|Var5|
1000|PEONY|DIRECT SUNLIGHT|YELLOW|80.0|
1600|ROSE|NEEDS TO BE WATERED
EVERY 5 TO 7 DAYS

AND PLACED IN DIRECT SUNLIGHT|RED|125.25
1750|LILY|NOT DIRECT SUNLIGHT|WHITE|76.05
; run;

You may want to define some variables as numeric instead of character.

You may need addapt maximum length  of character variables. and

to maximum number of variables per line.

 

wish you success.

New Contributor
Posts: 4

Re: Carriage Return / Multiple Line Break Issue Importing CSV File

Shmuel - your code seems like it will work for this data. I just tried modifying it a bit using my actual data (which the updated sample data it modeled off of) but now I am getting the following error:

 

ERROR: Array subscript out of range at line 661 column 29.

 

The line and column # change every time. 

 

Here is my adaptation of the code below:

 

Do you see anything amiss that I am not seeing?

 

data encounters_test;
format v1-v126 ;
retain i j v1-v126;
length a_line word w_next $500
v1-v126 $500;
infile source truncover;
link read_line;
array vx $ v1-v126;
i=1;
do until (i=127);
word= scan(a_line,i,'|');
vx(i) = word;
if word = '' then do;
do until (a_line ne ' ');
link read_line;
end;
if i>1 then i=i-1;
do until (w_next=' ');
word = scan(a_line,j,'|');
w_next = scan(a_line,j+1,'|');
vx(i) = catx(' ',vx(i),word);
if w_next ne ' ' then do; i+1; j+1; end;
end;
end;
i+1;
end;
output;
call missing(v1, v2, v3, v4, v5, v6, v7, v8, v9, v10, v11, v12, v13, v14, v15, v16, v17, v18, v19, v20, v21, v22, v23, v24, v25, v26, v27, v28, v29, v30,
v31, v32, v33, v34, v35, v36, v37, v38, v39, v40, v41, v42, v43, v44, v45, v46, v47, v48, v49, v50, v51, v52, v53, v54, v55, v56, v57, v58, v59, v60,
v61, v62, v63, v64, v65, v66, v67, v68, v69, v70, v71, v72, v73, v74, v75, v76, v77, v78, v79, v80, v81, v82, v83, v84, v85, v86, v87, v88, v89, v90,
v91, v92, v93, v94, v95, v96, v97, v98, v99, v100, v101, v102, v103, v104, v105, v106, v107, v108, v109, v110, v111, v112, v113, v114, v115, v116, v117, v118, v119, v120, v121, v122, v123, v124, v125, v126);
keep v1-v126;
return;
read_line:
N+1; j=1;
input a_line $400.;
return;
run;

 

 

Thanks again for your helpful response!

Trusted Advisor
Posts: 1,683

Re: Carriage Return / Multiple Line Break Issue Importing CSV File

Posted in reply to Lauren_dal

@Tom's solution seems very neat, but testing it I have the feeling it misses the last input record.

 

I have not found any failure in your addapted code, but make some light changes

and added some debuging tools:

- defined MAX_V macro variable set to 126 - number of variables.

- added put to log statement = N= input record number every 20th record.

- replaced the call missing statement with a do loop.

 

Please run the next code and, in case of "ERROR: Array subscript out of rang"

post the full log and the input records that cause the ERROR (from record N= to N+20)

or you may change the debuging statement to:

     

/*debug*/ if N ge <last value in log> then put a_line=;

and rerun.

 

Having the log and the input records that caused the error - then I'll be able to fix the code.

Super User
Super User
Posts: 7,399

Re: Carriage Return / Multiple Line Break Issue Importing CSV File


Shmuel wrote:

@Tom's solution seems very neat, but testing it I have the feeling it misses the last input record.

 

...


Since it wrties each line as it reads it there is no way it misses any lines.  It might not write an end of line for the last record, if it did not have enough pipe characters.  

New Contributor
Posts: 4

Re: Carriage Return / Multiple Line Break Issue Importing CSV File

Thanks so much for the reply. I realize I actually made a mistake in my data sample, it should look like this (there are no pipes at the end of the lines, just CRLF characters): 

 

Var1|Var2|Var3|Var4|Var5
1000|PEONY|DIRECT SUNLIGHT|YELLOW|80.0
1600|ROSE|NEEDS TO BE WATERED
EVERY 5 TO 7 DAYS

AND PLACED IN DIRECT SUNLIGHT|RED|125.25
1750|LILY|NOT DIRECT SUNLIGHT|WHITE|76.05

 

I have updated my sample data file as well.

Solution
‎12-14-2017 01:37 PM
Super User
Super User
Posts: 7,399

Re: Carriage Return / Multiple Line Break Issue Importing CSV File

Posted in reply to Lauren_dal

I would suggest counting the number of pipes and removing the line breaks when not at the end of the line.  But there are a few situations that could make that hard.

 

Does your actual data file ever have truncated records?  That is records that only contain some of the fields and instead of adding a number of pipe characters to the end to indicate the last few a missing the line just stops?  That will probably make just counting pipes impossible. Does the last field on the line every have embedded line breaks?  That would also make it impossible to just count lines.

 

If you can't count pipes then you will need to see if you can detect the beginning of a new line. So if the first one or two fields always follow a specific pattern then you could test those for an indication that you have hit the beginning of a new record.

 

Does your actual data every contain any pipe characters as values?  That is a value like 'a|b'. If so it will make the process harder, but it should be possible to still parse.  Normally in a delimited file you would quote the value that contains the delimiter.  

1|"a|b"|3

But some systems might instead try to "escape" the delimiter.

1|a\|b|3

Here is a simple method to copy the file and remove embedded line breaks by just counting the number of pipes. I have it put a space before the continuation lines, but you could put some other character there.  I have it count the number of pipes in the first line and use that to drive the process of checking if the current record has enough pipes or not.

filename fixed temp;
data _null_;
  retain maxp 0 p 0;
  infile sample ;
  file fixed ;
  input ;
  if p then put ' ' @;
  put _infile_ @;
  p+countc(_infile_,'|');
  if _n_=1 then maxp=p;
  if p>= maxp then do ;
    put ;
    p=0;
  end;
run;

Since the file is large you might see if you can get a solution that does not involve making a whole new copy of the file.  So instead you might try building the full record into a string variable, then assigning the string variable back to the _INFILE_ variable and then reading the values with an INPUT statement. 

 

 

New Contributor
Posts: 4

Re: Carriage Return / Multiple Line Break Issue Importing CSV File

Tom - Thank you so much for sending this code! I think it completely resolved the issue with the data! 

 

My understanding of the code is that for the lines that have less than the max # of pipes, it is replacing the extra CRLF characters at the end those lines with a space so that the data can then import properly. 

 

Thanks so much!

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 9 replies
  • 320 views
  • 0 likes
  • 5 in conversation