Solved: Re: Importing data with CR and LF endings with custom delimeters

Excelsius · Posted 05-10-2022 08:56 PM

After spending quite some time, so far I have been unable to rectify this seemingly simple issue.

I am importing txt files into SAS with an infile data step. There are two types of txt files: one type where each line ends with a carriage return (CR or \r) and another type where it ends with a line feed (LF or \n). I want a single program to handle either type of delimiter encountered (it's a macro) and my goal is to use a multi-delimeter specification dlm = " '0D'x, '0A'x " which covers CR and LF, respectively. Here is the issue: using dlm = '0A'x is a perfect substitute for termstr=LF, but when I try to substitute termstr=CR with dlm = '0D'x, this does not work and I have no idea why. What is the equivalent way to substitute termstr=CR with a specific delimiter? I have already checked my data in Notepad++ and confirmed that one text file has line endings with CR and another with LF.

Tom · Posted 05-11-2022 01:59 PM

If you just want to read the whole line into a single character variable then DLM will work if used with RECFM=N.

data read; 
  if eof then call symputx('nobs',_n_-1);
  infile dat dlm='0D0A'x recfm=n end=eof;
  input line :$100.;
run;

This will work with lines terminated by CR, LF or CRLF.

But this will not work if any of the lines have embedded CR or LF characters.

View solution in original post

andreas_lds · Posted 05-11-2022 02:57 AM

DLM = delimiter = char that separates variables

termstr = separator between obs

Without having termstr in the infile-statement, sas uses the default depending on the os sas runs on.

It seems as if dlm = " '0D'x, '0A'x " is processed correctly, because in the background termstr=LF is still active.

I think the only way to solve the problem is by adding a parameter to the macro that is used in termstr later on.

One could, of course, try to read the file with one termstr-option active and check the number of lines read. Then change the option and re-read the file again.

Excelsius · Posted 05-11-2022 01:46 PM

I understand that dlm is a column delimiter, but when the goal is to read an entire row of data into a single column only, I thought that might not matter. Maybe a sample code will help. The goal is to read any text file (whether observations end with CR or LF), into one column with each line as a row in SAS. I'd like to achieve that in a single data step. Below is a conditional infile statement, but I don't think SAS supports it or it's just incorrect. I think another option is to change any LFs (\n) to CRs (\r) in data and then use termstr=CR. I tried that with Perl (_infile_= prxchange('s/\n/\r/', -1, _infile_)), but instead of adding an actual carriage return, PRXCHANGE adds "\r" as a text: "sentence one as an example\rsentence_& "two" /. as ..." The issue here is how to convert LF into CR line-by-line during the datastep.

data test;
	infile datalines truncover;
	input x $100.;
	datalines;
sentence one	as an example
sentence_& "two" /. as an example
23 sentence three "'"' example
;

/* Create text file with CR endings */
%let txt = C:\test.txt;
filename out "&txt" termstr=cr;

data _null_;
	set test;
	file out;
	put x;
run;

data read; 
	infile "&txt" truncover termstr=cr end=eof;
	if eof then call symputx('nobs',_n_);
	input @;
	put "Lines read = &nobs" ;
	if &nobs > 1 then do;
		input y $100.;
	end;

    /* Code below does not work */
	else do;
		infile "&txt" termstr=lf;
		input y $100.;
	end;
run;

proc print data=read;
run;

Tom · Posted 05-11-2022 01:59 PM

If you just want to read the whole line into a single character variable then DLM will work if used with RECFM=N.

data read; 
  if eof then call symputx('nobs',_n_-1);
  infile dat dlm='0D0A'x recfm=n end=eof;
  input line :$100.;
run;

This will work with lines terminated by CR, LF or CRLF.

But this will not work if any of the lines have embedded CR or LF characters.

Excelsius · Posted 05-11-2022 02:47 PM

I think this is the solution. Need to run some more tests to confirm.

Tom, I actually did try what you suggested, except that I kept using dlm=" '0D'x '0A'x " instead of '0D0A'x and that was bringing in too many lines. However, I don't understand why '0D0A'x is working. I think that's equivalent to "CRLF," but the document contains either CR or LF and so technically this shouldn't work. Whereas I thought " '0D'x '0A'x " means "CR or LF." Is my syntax off perhaps? Some insight would be great.

Finally, since I've gone down this rabbit hole too far at this point, I would still like to understand:

Is it possible to have a conditional INFILE statement in a single datastep?
Is it possible to replace CR with LF (or vice versa) when reading data into SAS?

Tom · Posted 05-11-2022 03:23 PM

@Excelsius wrote:

I think this is the solution. Need to run some more tests to confirm.

Tom, I actually did try what you suggested, except that I kept using dlm=" '0D'x '0A'x " instead of '0D0A'x and that was bringing in too many lines. However, I don't understand why '0D0A'x is working. I think that's equivalent to "CRLF," but the document contains either CR or LF and so technically this shouldn't work. Whereas I thought " '0D'x '0A'x " means "CR or LF." Is my syntax off perhaps? Some insight would be great.

Finally, since I've gone down this rabbit hole too far at this point, I would still like to understand:

Is it possible to have a conditional INFILE statement in a single datastep?

Is it possible to replace CR with LF (or vice versa) when reading data into SAS?

Using dlm=" '0D'x '0A'x " means that you want to treat ANY of the six characters "0","D","'","x"," ", and "A" as a delimiter. Repeating the same character (like zero or single quote) multiple times does have any effect. So you essentially used:

dlm=" '0DxA"

Using '0D0A'x means you want the two byte string that is a CR and a LF to be the set of characters that are considered delimiters. So it does not matter if the lines are terminated with CR or LF or CRLF each of those three string will be considered a delimiter for the INPUT statement.

To make it dynamic use the name of a variable in the DLM= option.

For example you could make a character variable named DLM and use DLM=DLM. Then when the INPUT statement runs it will use the current value of DLM as the set of delimiter characters.

length dlm $1 ;
infile csv dlm=dlm ..... ;

input x @;
if x=1 then dlm='0D'x;
else dlm='0A'x;

input .... ;

Excelsius · Posted 05-11-2022 07:14 PM

Thanks Tom, seems like I had the incorrect syntax for the dlm initially. After running some tests, the '0D0A'x seems to be working fine. However, this syntax has created an issue where the line numbers (line_cnt+1) are incorrect, which I'm assuming is related to the recfm=n option because without it, the counts were correct for each line. If I don't figure out how to count lines correctly, I will have to look for a different solution. Do you know if there is a simple way to add a variable that will be the running count of the rows (1,2,3, etc)?

The dlm=dlm example you posted is interesting. The usage in this case would be termstr=termstr, but x would have to be some function that would detect whether the input uses CR or LF: IF (input is LF) then termstr=LF. Not sure if SAS has a way of detecting line endings. Another test would be to detect if infile ends up being just a single line (meaning that termstr did not work), but I don't think this can be implemented inside a single data step.

Excelsius · Posted 05-11-2022 09:22 PM

After some more investigation, the specific issue is that if a given line only has LF or CR with no other characters (or spaces), it is not counted (n+1) with the recfm=n dlm='0A0D'x method:

Here, line 3 is counted because it has 2 spaces, but lines 4 and 5 are not and these errors add up as lines proceed. This is not an issue when using termstr=LF rather than using a delimiter.

Tom · Posted 05-11-2022 10:47 PM

That is normal behavior, multiple adjacent delimiters form one "word" boundary. It is why this simple data step works.

data have ;
   input x y;
cards;
  1  23
102 133
;

You could try using the DSD option. That will treat adjacent delimiters as indicating an empty "word". That is why you can read CSV files using code like:

data have;
   infile cards dsd truncover;
   input x y z;
cards;
1,2,3
4,,6
;

But that will have some other effects.

Lines with CR and LF as the end of line characters will be treated as TWO lines.

Lines with quotes around them will have the quotes removed. You can prevent this by using the ~ modifier on the INPUT statement.

Lines that have CR or LF inside of quotes will NOT be split into two lines. That might be a positive.

filename LF temp termstr=lf;
options parmcards=LF;
parmcards;
line one
"line two"

line four


line seven
;

filename CR temp termstr=CR;
options parmcards=CR;
parmcards;
line one
"line two"

line four


line seven
;

data want1;
  infile LF dlm='0D0A'x dsd recfm=n;
  lineno+1;
  input line ~:$100.;
run;

data want2;
  infile CR dlm='0D0A'x dsd recfm=n;
  lineno+1;
  input line ~:$100.;
run;

mkeintz · Posted 05-11-2022 11:05 PM

@Excelsius wrote:

After some more investigation, the specific issue is that if a given line only has LF or CR with no other characters (or spaces), it is not counted (n+1) with the recfm=n dlm='0A0D'x method:

Here, line 3 is counted because it has 2 spaces, but lines 4 and 5 are not and these errors add up as lines proceed. This is not an issue when using termstr=LF rather than using a delimiter.

I was able to reproduce your problem. Then I added the INFILE option DSD (delimiter sensitive data) and the null lines were counted.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Tom · Posted 05-11-2022 10:58 PM

Your problem does not need dynamic changes to the delimiter. That is for when you have lines of text where the delimiter changes for part of the line or part of the file. You want to treat the whole file using the same delimiter once you figure out what it is.

You can just read the beginning of the file and look for the first CR or LF character.

%let termstr=;
data _null_;
   infile cr recfm=f lrecl=32767 obs=1;
   input;
   lf = indexc(_infile_,'0A'x);
   cr = indexc(_infile_,'0D'x);
   if lf and not cr then call symputx('termstr','termstr=LF');
   else if cr and not lf then call symputx('termstr','termstr=CR');
   else if cr+1=lf then call symputx('termstr','termstr=CRLF');
run;
%put &=termstr;
data want;
   infile cr &termstr truncover;
   lineno+1;
   input line $char100.;
run;

Excelsius · Posted 05-13-2022 11:52 AM

This is a great implementation of macros, which is what I was trying to figure out. It's an interesting way to call macros under _null_ and then use them in a data step. Technically it's still not a single data step, but an interesting implementation nevertheless. I can think of other areas I could apply a similar technique. Thanks for your insights.

And yes, you were both correct that the issue had to do with dsd--that resolve the count issues.

Tom · Posted 05-13-2022 11:53 AM

There are no macros in the program I posted. It does use a macro variable to facilitate the generation of the INFILE statement.

Excelsius · Posted 05-13-2022 11:05 PM

Obviously, was referring to macro variable usage under data _null_. That's why that code works and it can be expanded to other uses.

Tom · Posted 05-11-2022 08:14 AM

DLM is for between VALUES.

TERMSTR is for between LINES.

If you want to treat LINES as VALUES then use RECFM=N.

Example:

filename dat temp;
data _null_;
  file dat recfm=n ;
  put 'line one' '0D'x 'line two' '0D'x;
run;

data test;
  infile dat recfm=n dlm='0D'x ;
  input line :$80. ;
run;

proc print;
run;

Classroom Training Available!