After spending quite some time, so far I have been unable to rectify this seemingly simple issue.
I am importing txt files into SAS with an infile data step. There are two types of txt files: one type where each line ends with a carriage return (CR or \r) and another type where it ends with a line feed (LF or \n). I want a single program to handle either type of delimiter encountered (it's a macro) and my goal is to use a multi-delimeter specification dlm = " '0D'x, '0A'x " which covers CR and LF, respectively. Here is the issue: using dlm = '0A'x is a perfect substitute for termstr=LF, but when I try to substitute termstr=CR with dlm = '0D'x, this does not work and I have no idea why. What is the equivalent way to substitute termstr=CR with a specific delimiter? I have already checked my data in Notepad++ and confirmed that one text file has line endings with CR and another with LF.
If you just want to read the whole line into a single character variable then DLM will work if used with RECFM=N.
data read;
if eof then call symputx('nobs',_n_-1);
infile dat dlm='0D0A'x recfm=n end=eof;
input line :$100.;
run;
This will work with lines terminated by CR, LF or CRLF.
But this will not work if any of the lines have embedded CR or LF characters.
DLM = delimiter = char that separates variables
termstr = separator between obs
Without having termstr in the infile-statement, sas uses the default depending on the os sas runs on.
It seems as if dlm = " '0D'x, '0A'x " is processed correctly, because in the background termstr=LF is still active.
I think the only way to solve the problem is by adding a parameter to the macro that is used in termstr later on.
One could, of course, try to read the file with one termstr-option active and check the number of lines read. Then change the option and re-read the file again.
I understand that dlm is a column delimiter, but when the goal is to read an entire row of data into a single column only, I thought that might not matter. Maybe a sample code will help. The goal is to read any text file (whether observations end with CR or LF), into one column with each line as a row in SAS. I'd like to achieve that in a single data step. Below is a conditional infile statement, but I don't think SAS supports it or it's just incorrect. I think another option is to change any LFs (\n) to CRs (\r) in data and then use termstr=CR. I tried that with Perl (_infile_= prxchange('s/\n/\r/', -1, _infile_)), but instead of adding an actual carriage return, PRXCHANGE adds "\r" as a text: "sentence one as an example\rsentence_& "two" /. as ..." The issue here is how to convert LF into CR line-by-line during the datastep.
data test;
infile datalines truncover;
input x $100.;
datalines;
sentence one as an example
sentence_& "two" /. as an example
23 sentence three "'"' example
;
/* Create text file with CR endings */
%let txt = C:\test.txt;
filename out "&txt" termstr=cr;
data _null_;
set test;
file out;
put x;
run;
data read;
infile "&txt" truncover termstr=cr end=eof;
if eof then call symputx('nobs',_n_);
input @;
put "Lines read = &nobs" ;
if &nobs > 1 then do;
input y $100.;
end;
/* Code below does not work */
else do;
infile "&txt" termstr=lf;
input y $100.;
end;
run;
proc print data=read;
run;
If you just want to read the whole line into a single character variable then DLM will work if used with RECFM=N.
data read;
if eof then call symputx('nobs',_n_-1);
infile dat dlm='0D0A'x recfm=n end=eof;
input line :$100.;
run;
This will work with lines terminated by CR, LF or CRLF.
But this will not work if any of the lines have embedded CR or LF characters.
I think this is the solution. Need to run some more tests to confirm.
Tom, I actually did try what you suggested, except that I kept using dlm=" '0D'x '0A'x " instead of '0D0A'x and that was bringing in too many lines. However, I don't understand why '0D0A'x is working. I think that's equivalent to "CRLF," but the document contains either CR or LF and so technically this shouldn't work. Whereas I thought " '0D'x '0A'x " means "CR or LF." Is my syntax off perhaps? Some insight would be great.
Finally, since I've gone down this rabbit hole too far at this point, I would still like to understand:
@Excelsius wrote:
I think this is the solution. Need to run some more tests to confirm.
Tom, I actually did try what you suggested, except that I kept using dlm=" '0D'x '0A'x " instead of '0D0A'x and that was bringing in too many lines. However, I don't understand why '0D0A'x is working. I think that's equivalent to "CRLF," but the document contains either CR or LF and so technically this shouldn't work. Whereas I thought " '0D'x '0A'x " means "CR or LF." Is my syntax off perhaps? Some insight would be great.
Finally, since I've gone down this rabbit hole too far at this point, I would still like to understand:
- Is it possible to have a conditional INFILE statement in a single datastep?
- Is it possible to replace CR with LF (or vice versa) when reading data into SAS?
Using dlm=" '0D'x '0A'x " means that you want to treat ANY of the six characters "0","D","'","x"," ", and "A" as a delimiter. Repeating the same character (like zero or single quote) multiple times does have any effect. So you essentially used:
dlm=" '0DxA"
Using '0D0A'x means you want the two byte string that is a CR and a LF to be the set of characters that are considered delimiters. So it does not matter if the lines are terminated with CR or LF or CRLF each of those three string will be considered a delimiter for the INPUT statement.
To make it dynamic use the name of a variable in the DLM= option.
For example you could make a character variable named DLM and use DLM=DLM. Then when the INPUT statement runs it will use the current value of DLM as the set of delimiter characters.
length dlm $1 ;
infile csv dlm=dlm ..... ;
input x @;
if x=1 then dlm='0D'x;
else dlm='0A'x;
input .... ;
After some more investigation, the specific issue is that if a given line only has LF or CR with no other characters (or spaces), it is not counted (n+1) with the recfm=n dlm='0A0D'x method:
Here, line 3 is counted because it has 2 spaces, but lines 4 and 5 are not and these errors add up as lines proceed. This is not an issue when using termstr=LF rather than using a delimiter.
That is normal behavior, multiple adjacent delimiters form one "word" boundary. It is why this simple data step works.
data have ;
input x y;
cards;
1 23
102 133
;
You could try using the DSD option. That will treat adjacent delimiters as indicating an empty "word". That is why you can read CSV files using code like:
data have;
infile cards dsd truncover;
input x y z;
cards;
1,2,3
4,,6
;
But that will have some other effects.
Lines with CR and LF as the end of line characters will be treated as TWO lines.
Lines with quotes around them will have the quotes removed. You can prevent this by using the ~ modifier on the INPUT statement.
Lines that have CR or LF inside of quotes will NOT be split into two lines. That might be a positive.
filename LF temp termstr=lf;
options parmcards=LF;
parmcards;
line one
"line two"
line four
line seven
;
filename CR temp termstr=CR;
options parmcards=CR;
parmcards;
line one
"line two"
line four
line seven
;
data want1;
infile LF dlm='0D0A'x dsd recfm=n;
lineno+1;
input line ~:$100.;
run;
data want2;
infile CR dlm='0D0A'x dsd recfm=n;
lineno+1;
input line ~:$100.;
run;
@Excelsius wrote:
After some more investigation, the specific issue is that if a given line only has LF or CR with no other characters (or spaces), it is not counted (n+1) with the recfm=n dlm='0A0D'x method:
Here, line 3 is counted because it has 2 spaces, but lines 4 and 5 are not and these errors add up as lines proceed. This is not an issue when using termstr=LF rather than using a delimiter.
I was able to reproduce your problem. Then I added the INFILE option DSD (delimiter sensitive data) and the null lines were counted.
Your problem does not need dynamic changes to the delimiter. That is for when you have lines of text where the delimiter changes for part of the line or part of the file. You want to treat the whole file using the same delimiter once you figure out what it is.
You can just read the beginning of the file and look for the first CR or LF character.
%let termstr=;
data _null_;
infile cr recfm=f lrecl=32767 obs=1;
input;
lf = indexc(_infile_,'0A'x);
cr = indexc(_infile_,'0D'x);
if lf and not cr then call symputx('termstr','termstr=LF');
else if cr and not lf then call symputx('termstr','termstr=CR');
else if cr+1=lf then call symputx('termstr','termstr=CRLF');
run;
%put &=termstr;
data want;
infile cr &termstr truncover;
lineno+1;
input line $char100.;
run;
There are no macros in the program I posted. It does use a macro variable to facilitate the generation of the INFILE statement.
DLM is for between VALUES.
TERMSTR is for between LINES.
If you want to treat LINES as VALUES then use RECFM=N.
Example:
filename dat temp;
data _null_;
file dat recfm=n ;
put 'line one' '0D'x 'line two' '0D'x;
run;
data test;
infile dat recfm=n dlm='0D'x ;
input line :$80. ;
run;
proc print;
run;
Available on demand!
Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.