Hi all,
I am trying to import quite a large pipe-delimited txt file (approx. 18 million observations/ 17 million KB) into sas (version 9.4), but there are too many line breaks in my data to use a regular infile statement. I tried using the TERMSTR=CRLF statement initially, and this did help delete the completely empty rows, however it did not fix the rows that are misaligned. To fully understand my issue, see attached for a simple example I created of my raw data.
Regarding the sample I have attached, lines 2-5 are all the same row of data, but they are unfortunately split up like this. Does anybody have any tips for how to resolve this issue so I can potentially import a cleaner file without having to piece things back together manually?
Thanks in advance!
I would suggest counting the number of pipes and removing the line breaks when not at the end of the line. But there are a few situations that could make that hard.
Does your actual data file ever have truncated records? That is records that only contain some of the fields and instead of adding a number of pipe characters to the end to indicate the last few a missing the line just stops? That will probably make just counting pipes impossible. Does the last field on the line every have embedded line breaks? That would also make it impossible to just count lines.
If you can't count pipes then you will need to see if you can detect the beginning of a new line. So if the first one or two fields always follow a specific pattern then you could test those for an indication that you have hit the beginning of a new record.
Does your actual data every contain any pipe characters as values? That is a value like 'a|b'. If so it will make the process harder, but it should be possible to still parse. Normally in a delimited file you would quote the value that contains the delimiter.
1|"a|b"|3But some systems might instead try to "escape" the delimiter.
1|a\|b|3Here is a simple method to copy the file and remove embedded line breaks by just counting the number of pipes. I have it put a space before the continuation lines, but you could put some other character there. I have it count the number of pipes in the first line and use that to drive the process of checking if the current record has enough pipes or not.
filename fixed temp;
data _null_;
  retain maxp 0 p 0;
  infile sample ;
  file fixed ;
  input ;
  if p then put ' ' @;
  put _infile_ @;
  p+countc(_infile_,'|');
  if _n_=1 then maxp=p;
  if p>= maxp then do ;
    put ;
    p=0;
  end;
run;Since the file is large you might see if you can get a solution that does not involve making a whole new copy of the file. So instead you might try building the full record into a string variable, then assigning the string variable back to the _INFILE_ variable and then reading the values with an INPUT statement.
I have downloaded your attachment, which looks like this:
Var1|Var2|Var3|Var4|Var5| 1000|PEONY|DIRECT SUNLIGHT|YELLOW|80.0| 1600|ROSE|NEEDS TO BE WATERED EVERY 5 TO 7 DAYS AND PLACED IN DIRECT SUNLIGHT|RED|125.25 1750|LILY|NOT DIRECT SUNLIGHT|WHITE|76.05
Lines 1 names 5 variables, suggesting each record will have 4 pipes.
Note that record 1 (line 1) and record 2 (line 2) each have a trailing pipe, yielding 5 pipes, and implying 6 pipe-delimited fields. But record 3 (lines 3 through 6) and record 4 (line 7) do not have trailing pipes, so they have 4 pipes, implying 5 pipe-delimited fields.
Is this a transcription error, or is it actually in the raw data you are using? If it's the latter get the person who made the file redo it consistently.
try using proc import, something like this
proc import datafile ="C:\data.txt" out= xyz
dbms="dlm" replace;
delimiter='|';
getnames=yes;
run;It was realy a chalenge. Next code works fine with the sample.
I hope you'll be able to addapt it to your input:
data want;
  format v1-v5 ; 
  retain i j v1-v5;
  length a_line  word w_next $100 
         v1 $5 v2-v4 $100  v5 $10;
  infile datalines truncover;
  link read_line;
  array vx $ v1-v5;
  i=1; 
  do until (i=6);
     word= scan(a_line,i,'|'); 
     vx(i) = word;
     if word = '' then do;
        do until (a_line ne ' ');
           link read_line;
        end;
	    if i>1 then i=i-1;
        do until (w_next=' ');
	       word = scan(a_line,j,'|'); 
	       w_next = scan(a_line,j+1,'|');
	       vx(i) = catx(' ',vx(i),word);
	       if w_next ne ' ' then do; i+1; j+1; end;
	    end;
     end; 
     i+1;
  end;
  output; 
  call missing(v1, v2, v3, v4, v5);
  keep v1-v5;
return;
read_line:
  N+1; j=1;
  input a_line $100.;
return;
datalines;
Var1|Var2|Var3|Var4|Var5|
1000|PEONY|DIRECT SUNLIGHT|YELLOW|80.0|
1600|ROSE|NEEDS TO BE WATERED
EVERY 5 TO 7 DAYS
AND PLACED IN DIRECT SUNLIGHT|RED|125.25
1750|LILY|NOT DIRECT SUNLIGHT|WHITE|76.05
; run;You may want to define some variables as numeric instead of character.
You may need addapt maximum length of character variables. and
to maximum number of variables per line.
wish you success.
Shmuel - your code seems like it will work for this data. I just tried modifying it a bit using my actual data (which the updated sample data it modeled off of) but now I am getting the following error:
ERROR: Array subscript out of range at line 661 column 29.
The line and column # change every time.
Here is my adaptation of the code below:
Do you see anything amiss that I am not seeing?
data encounters_test;
format v1-v126 ;
retain i j v1-v126;
length a_line word w_next $500
v1-v126 $500;
infile source truncover;
link read_line;
array vx $ v1-v126;
i=1;
do until (i=127);
word= scan(a_line,i,'|');
vx(i) = word;
if word = '' then do;
do until (a_line ne ' ');
link read_line;
end;
if i>1 then i=i-1;
do until (w_next=' ');
word = scan(a_line,j,'|');
w_next = scan(a_line,j+1,'|');
vx(i) = catx(' ',vx(i),word);
if w_next ne ' ' then do; i+1; j+1; end;
end;
end;
i+1;
end;
output;
call missing(v1, v2, v3, v4, v5, v6, v7, v8, v9, v10, v11, v12, v13, v14, v15, v16, v17, v18, v19, v20, v21, v22, v23, v24, v25, v26, v27, v28, v29, v30,
v31, v32, v33, v34, v35, v36, v37, v38, v39, v40, v41, v42, v43, v44, v45, v46, v47, v48, v49, v50, v51, v52, v53, v54, v55, v56, v57, v58, v59, v60,
v61, v62, v63, v64, v65, v66, v67, v68, v69, v70, v71, v72, v73, v74, v75, v76, v77, v78, v79, v80, v81, v82, v83, v84, v85, v86, v87, v88, v89, v90,
v91, v92, v93, v94, v95, v96, v97, v98, v99, v100, v101, v102, v103, v104, v105, v106, v107, v108, v109, v110, v111, v112, v113, v114, v115, v116, v117, v118, v119, v120, v121, v122, v123, v124, v125, v126);
keep v1-v126;
return;
read_line:
N+1; j=1;
input a_line $400.;
return;
run;
Thanks again for your helpful response!
@Tom's solution seems very neat, but testing it I have the feeling it misses the last input record.
I have not found any failure in your addapted code, but make some light changes
and added some debuging tools:
- defined MAX_V macro variable set to 126 - number of variables.
- added put to log statement = N= input record number every 20th record.
- replaced the call missing statement with a do loop.
Please run the next code and, in case of "ERROR: Array subscript out of rang"
post the full log and the input records that cause the ERROR (from record N= to N+20)
or you may change the debuging statement to:
/*debug*/ if N ge <last value in log> then put a_line=;
and rerun.
Having the log and the input records that caused the error - then I'll be able to fix the code.
Thanks so much for the reply. I realize I actually made a mistake in my data sample, it should look like this (there are no pipes at the end of the lines, just CRLF characters):
Var1|Var2|Var3|Var4|Var5
1000|PEONY|DIRECT SUNLIGHT|YELLOW|80.0
1600|ROSE|NEEDS TO BE WATERED
EVERY 5 TO 7 DAYS
AND PLACED IN DIRECT SUNLIGHT|RED|125.25
1750|LILY|NOT DIRECT SUNLIGHT|WHITE|76.05
I have updated my sample data file as well.
I would suggest counting the number of pipes and removing the line breaks when not at the end of the line. But there are a few situations that could make that hard.
Does your actual data file ever have truncated records? That is records that only contain some of the fields and instead of adding a number of pipe characters to the end to indicate the last few a missing the line just stops? That will probably make just counting pipes impossible. Does the last field on the line every have embedded line breaks? That would also make it impossible to just count lines.
If you can't count pipes then you will need to see if you can detect the beginning of a new line. So if the first one or two fields always follow a specific pattern then you could test those for an indication that you have hit the beginning of a new record.
Does your actual data every contain any pipe characters as values? That is a value like 'a|b'. If so it will make the process harder, but it should be possible to still parse. Normally in a delimited file you would quote the value that contains the delimiter.
1|"a|b"|3But some systems might instead try to "escape" the delimiter.
1|a\|b|3Here is a simple method to copy the file and remove embedded line breaks by just counting the number of pipes. I have it put a space before the continuation lines, but you could put some other character there. I have it count the number of pipes in the first line and use that to drive the process of checking if the current record has enough pipes or not.
filename fixed temp;
data _null_;
  retain maxp 0 p 0;
  infile sample ;
  file fixed ;
  input ;
  if p then put ' ' @;
  put _infile_ @;
  p+countc(_infile_,'|');
  if _n_=1 then maxp=p;
  if p>= maxp then do ;
    put ;
    p=0;
  end;
run;Since the file is large you might see if you can get a solution that does not involve making a whole new copy of the file. So instead you might try building the full record into a string variable, then assigning the string variable back to the _INFILE_ variable and then reading the values with an INPUT statement.
Tom - Thank you so much for sending this code! I think it completely resolved the issue with the data!
My understanding of the code is that for the lines that have less than the max # of pipes, it is replacing the extra CRLF characters at the end those lines with a space so that the data can then import properly.
Thanks so much!
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.
