BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Lauren_dal
Calcite | Level 5

Hi all,

 

I am trying to import quite a large pipe-delimited txt file (approx. 18 million observations/ 17 million KB) into sas (version 9.4), but there are too many line breaks in my data to use a regular infile statement. I tried using the TERMSTR=CRLF statement initially, and this did help delete the completely empty rows, however it did not fix the rows that are misaligned. To fully understand my issue, see attached for a simple example I created of my raw data.

 

Regarding the sample I have attached, lines 2-5 are all the same row of data, but they are unfortunately split up like this. Does anybody have any tips for how to resolve this issue so I can potentially import a cleaner file without having to piece things back together manually?

 

Thanks in advance!

 

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

I would suggest counting the number of pipes and removing the line breaks when not at the end of the line.  But there are a few situations that could make that hard.

 

Does your actual data file ever have truncated records?  That is records that only contain some of the fields and instead of adding a number of pipe characters to the end to indicate the last few a missing the line just stops?  That will probably make just counting pipes impossible. Does the last field on the line every have embedded line breaks?  That would also make it impossible to just count lines.

 

If you can't count pipes then you will need to see if you can detect the beginning of a new line. So if the first one or two fields always follow a specific pattern then you could test those for an indication that you have hit the beginning of a new record.

 

Does your actual data every contain any pipe characters as values?  That is a value like 'a|b'. If so it will make the process harder, but it should be possible to still parse.  Normally in a delimited file you would quote the value that contains the delimiter.  

1|"a|b"|3

But some systems might instead try to "escape" the delimiter.

1|a\|b|3

Here is a simple method to copy the file and remove embedded line breaks by just counting the number of pipes. I have it put a space before the continuation lines, but you could put some other character there.  I have it count the number of pipes in the first line and use that to drive the process of checking if the current record has enough pipes or not.

filename fixed temp;
data _null_;
  retain maxp 0 p 0;
  infile sample ;
  file fixed ;
  input ;
  if p then put ' ' @;
  put _infile_ @;
  p+countc(_infile_,'|');
  if _n_=1 then maxp=p;
  if p>= maxp then do ;
    put ;
    p=0;
  end;
run;

Since the file is large you might see if you can get a solution that does not involve making a whole new copy of the file.  So instead you might try building the full record into a string variable, then assigning the string variable back to the _INFILE_ variable and then reading the values with an INPUT statement. 

 

 

View solution in original post

9 REPLIES 9
mkeintz
PROC Star

I have downloaded your attachment, which looks like this:

 

Var1|Var2|Var3|Var4|Var5|
1000|PEONY|DIRECT SUNLIGHT|YELLOW|80.0|
1600|ROSE|NEEDS TO BE WATERED
EVERY 5 TO 7 DAYS

AND PLACED IN DIRECT SUNLIGHT|RED|125.25
1750|LILY|NOT DIRECT SUNLIGHT|WHITE|76.05

Lines 1 names 5 variables, suggesting each record will have 4 pipes.

 

Note that record 1 (line 1) and record 2 (line 2) each have a trailing pipe, yielding 5 pipes, and implying 6 pipe-delimited fields. But record 3 (lines 3 through 6) and record 4 (line 7)  do not have trailing pipes, so they have 4 pipes, implying 5 pipe-delimited fields.

 

Is this a transcription error, or  is it actually in the raw data you  are using?  If it's the latter get the person who made the file redo it consistently.

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
RM6
Obsidian | Level 7 RM6
Obsidian | Level 7

try using proc import, something like this


proc import datafile ="C:\data.txt" out= xyz
dbms="dlm" replace;
delimiter='|';
getnames=yes;
run;
Shmuel
Garnet | Level 18

It was realy a chalenge. Next code works fine with the sample.

I hope you'll be able to addapt it to your input:

data want;
  format v1-v5 ; 
  retain i j v1-v5;
  length a_line  word w_next $100 
         v1 $5 v2-v4 $100  v5 $10;
  infile datalines truncover;
  link read_line;
  array vx $ v1-v5;
  i=1; 
  do until (i=6);
     word= scan(a_line,i,'|'); 
     vx(i) = word;
     if word = '' then do;
        do until (a_line ne ' ');
           link read_line;
        end;
	    if i>1 then i=i-1;
        do until (w_next=' ');
	       word = scan(a_line,j,'|'); 
	       w_next = scan(a_line,j+1,'|');
	       vx(i) = catx(' ',vx(i),word);
	       if w_next ne ' ' then do; i+1; j+1; end;
	    end;
     end; 
     i+1;
  end;
  output; 
  call missing(v1, v2, v3, v4, v5);
  keep v1-v5;
return;
read_line:
  N+1; j=1;
  input a_line $100.;
return;
datalines;
Var1|Var2|Var3|Var4|Var5|
1000|PEONY|DIRECT SUNLIGHT|YELLOW|80.0|
1600|ROSE|NEEDS TO BE WATERED
EVERY 5 TO 7 DAYS

AND PLACED IN DIRECT SUNLIGHT|RED|125.25
1750|LILY|NOT DIRECT SUNLIGHT|WHITE|76.05
; run;

You may want to define some variables as numeric instead of character.

You may need addapt maximum length  of character variables. and

to maximum number of variables per line.

 

wish you success.

Lauren_dal
Calcite | Level 5

Shmuel - your code seems like it will work for this data. I just tried modifying it a bit using my actual data (which the updated sample data it modeled off of) but now I am getting the following error:

 

ERROR: Array subscript out of range at line 661 column 29.

 

The line and column # change every time. 

 

Here is my adaptation of the code below:

 

Do you see anything amiss that I am not seeing?

 

data encounters_test;
format v1-v126 ;
retain i j v1-v126;
length a_line word w_next $500
v1-v126 $500;
infile source truncover;
link read_line;
array vx $ v1-v126;
i=1;
do until (i=127);
word= scan(a_line,i,'|');
vx(i) = word;
if word = '' then do;
do until (a_line ne ' ');
link read_line;
end;
if i>1 then i=i-1;
do until (w_next=' ');
word = scan(a_line,j,'|');
w_next = scan(a_line,j+1,'|');
vx(i) = catx(' ',vx(i),word);
if w_next ne ' ' then do; i+1; j+1; end;
end;
end;
i+1;
end;
output;
call missing(v1, v2, v3, v4, v5, v6, v7, v8, v9, v10, v11, v12, v13, v14, v15, v16, v17, v18, v19, v20, v21, v22, v23, v24, v25, v26, v27, v28, v29, v30,
v31, v32, v33, v34, v35, v36, v37, v38, v39, v40, v41, v42, v43, v44, v45, v46, v47, v48, v49, v50, v51, v52, v53, v54, v55, v56, v57, v58, v59, v60,
v61, v62, v63, v64, v65, v66, v67, v68, v69, v70, v71, v72, v73, v74, v75, v76, v77, v78, v79, v80, v81, v82, v83, v84, v85, v86, v87, v88, v89, v90,
v91, v92, v93, v94, v95, v96, v97, v98, v99, v100, v101, v102, v103, v104, v105, v106, v107, v108, v109, v110, v111, v112, v113, v114, v115, v116, v117, v118, v119, v120, v121, v122, v123, v124, v125, v126);
keep v1-v126;
return;
read_line:
N+1; j=1;
input a_line $400.;
return;
run;

 

 

Thanks again for your helpful response!

Shmuel
Garnet | Level 18

@Tom's solution seems very neat, but testing it I have the feeling it misses the last input record.

 

I have not found any failure in your addapted code, but make some light changes

and added some debuging tools:

- defined MAX_V macro variable set to 126 - number of variables.

- added put to log statement = N= input record number every 20th record.

- replaced the call missing statement with a do loop.

 

Please run the next code and, in case of "ERROR: Array subscript out of rang"

post the full log and the input records that cause the ERROR (from record N= to N+20)

or you may change the debuging statement to:

     

/*debug*/ if N ge <last value in log> then put a_line=;

and rerun.

 

Having the log and the input records that caused the error - then I'll be able to fix the code.

Tom
Super User Tom
Super User

@Shmuel wrote:

@Tom's solution seems very neat, but testing it I have the feeling it misses the last input record.

 

...


Since it wrties each line as it reads it there is no way it misses any lines.  It might not write an end of line for the last record, if it did not have enough pipe characters.  

Lauren_dal
Calcite | Level 5

Thanks so much for the reply. I realize I actually made a mistake in my data sample, it should look like this (there are no pipes at the end of the lines, just CRLF characters): 

 

Var1|Var2|Var3|Var4|Var5
1000|PEONY|DIRECT SUNLIGHT|YELLOW|80.0
1600|ROSE|NEEDS TO BE WATERED
EVERY 5 TO 7 DAYS

AND PLACED IN DIRECT SUNLIGHT|RED|125.25
1750|LILY|NOT DIRECT SUNLIGHT|WHITE|76.05

 

I have updated my sample data file as well.

Tom
Super User Tom
Super User

I would suggest counting the number of pipes and removing the line breaks when not at the end of the line.  But there are a few situations that could make that hard.

 

Does your actual data file ever have truncated records?  That is records that only contain some of the fields and instead of adding a number of pipe characters to the end to indicate the last few a missing the line just stops?  That will probably make just counting pipes impossible. Does the last field on the line every have embedded line breaks?  That would also make it impossible to just count lines.

 

If you can't count pipes then you will need to see if you can detect the beginning of a new line. So if the first one or two fields always follow a specific pattern then you could test those for an indication that you have hit the beginning of a new record.

 

Does your actual data every contain any pipe characters as values?  That is a value like 'a|b'. If so it will make the process harder, but it should be possible to still parse.  Normally in a delimited file you would quote the value that contains the delimiter.  

1|"a|b"|3

But some systems might instead try to "escape" the delimiter.

1|a\|b|3

Here is a simple method to copy the file and remove embedded line breaks by just counting the number of pipes. I have it put a space before the continuation lines, but you could put some other character there.  I have it count the number of pipes in the first line and use that to drive the process of checking if the current record has enough pipes or not.

filename fixed temp;
data _null_;
  retain maxp 0 p 0;
  infile sample ;
  file fixed ;
  input ;
  if p then put ' ' @;
  put _infile_ @;
  p+countc(_infile_,'|');
  if _n_=1 then maxp=p;
  if p>= maxp then do ;
    put ;
    p=0;
  end;
run;

Since the file is large you might see if you can get a solution that does not involve making a whole new copy of the file.  So instead you might try building the full record into a string variable, then assigning the string variable back to the _INFILE_ variable and then reading the values with an INPUT statement. 

 

 

Lauren_dal
Calcite | Level 5

Tom - Thank you so much for sending this code! I think it completely resolved the issue with the data! 

 

My understanding of the code is that for the lines that have less than the max # of pipes, it is replacing the extra CRLF characters at the end those lines with a space so that the data can then import properly. 

 

Thanks so much!

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 9 replies
  • 12492 views
  • 1 like
  • 5 in conversation