Solved: INFILE statment to read lines with sometimes a Carriage Return in the ...

CHUNantes · Posted 05-03-2021 02:48 AM

Hello.

We try to load a file into a SAS table so we use INFILE filename DLM='\' DSD

FILENAME ficin "path_to_file.txt";

DATA work.users;
   INFILE ficin DLM='\' DSD;
   INFORMAT Name $20. City $20. Birthdate DDMMYY10.;
   FORMAT Name $20. City $20. Birthdate DDMMYY10.;
   INPUT Name $ City $ Birthdate;
RUN;

But sometimes we can receive a file with a Carriage Return in the middle of the line :

Mikael\NANTES\01/01/2000
Guillaume\
NANTES\01/02/2000
Anne\\01/03/2000

And of course, we have to manage missing values so we need to keep the DSD option.

How can we load this file ?

We have try to remove the CR character using sed on our Linux server before reading the file without succes.

Regards

Ksharp · Posted 05-03-2021 07:17 AM

data _null_;
 infile "c:\temp\donnees.txt" dsd termstr=crlf;
 file   "c:\temp\want.txt";
 input ;
 length temp $ 2000;
 retain temp;
 temp=cats(temp,_infile_); 
 if countw(temp,'\','mq')=7 then do;put temp;call missing(temp);end;
run;

data want;
 infile   "c:\temp\want.txt" DLM='\' dsd termstr=crlf;
 length  N_Code_NoIPP 8
           C_Code_NoIPPExterne $12.
            C_Nom $50.
            C_Prenom $50.
            C_Telephone $20.
            C_Var1 $1.
            C_Var2 $1.
            ;
    input   N_Code_NoIPP
            C_Code_NoIPPExterne $
            C_Nom $
            C_Prenom $
            C_Telephone $
            C_Var1 $
            C_Var2 $
            ;
RUN;

View solution in original post

Patrick · Posted 05-03-2021 04:08 AM

What is the end of line indicator for your data? LF , CRLF ?

In Unix the default SAS uses is LF so a CR in the data shouldn't mess-up mapping of data into SAS variables. The CR would get stored in a variable though which is not something you normally want.

1. Figure out the EoL (I'm normally using Notepad++ for that just opening the file while showing all characters).

2. If EoL anything other than LF then use infile option TERMSTR to define it - i.e. INFILE.... TERMSTR=CRLF;

3. To remove CR and other control characters use translate() or compress() or whatever is more suitable.

data want;
infile ..... termstr=CRLF truncover dsd dlm='\';
input @;
_infile_=compress(_infile_,,'c');
input <your vars>;
run;

CHUNantes · Posted 05-03-2021 04:43 AM

Thanks for your replie.

The file use CRLF as end of line

If I use Notepad++ with Hexa editor, here is a valid line ending :

And here is a line with the CRLF in the middle of the line :

As the special character is a0, I have try to replace a00d0a with space or another character using sed command but I didn't succeed.

With TERMSTR=CRLF, only the first line is read.

As all lines are ending with \+CRLF (so 5c 0d 0a), other occurence of 0d 0a should be removed.

Ksharp · Posted 05-03-2021 06:42 AM

It is complicated question. Better post your real data to demonstrate your Q.

data _null_;
 infile "c:\temp\path_to_file.txt" dsd termstr=crlf;
 file   "c:\temp\want.txt";
 input ;
 length temp $ 2000;
 retain temp;
 temp=cats(temp,_infile_); 
 if countw(temp,'\','mq')=3 then do;put temp;call missing(temp);end;
run;

data want;
 infile   "c:\temp\want.txt" DLM='\' dsd termstr=crlf;
 INFORMAT Name $20. City $20. Birthdate DDMMYY10.;
   FORMAT Name $20. City $20. Birthdate DDMMYY10.;
   INPUT Name $ City $ Birthdate;
RUN;

Ksharp · Posted 05-03-2021 06:42 AM

It is complicated question. Better post your real data to demonstrate your Q.

data _null_;
 infile "c:\temp\path_to_file.txt" dsd termstr=crlf;
 file   "c:\temp\want.txt";
 input ;
 length temp $ 2000;
 retain temp;
 temp=cats(temp,_infile_); 
 if countw(temp,'\','mq')=3 then do;put temp;call missing(temp);end;
run;

data want;
 infile   "c:\temp\want.txt" DLM='\' dsd termstr=crlf;
 INFORMAT Name $20. City $20. Birthdate DDMMYY10.;
   FORMAT Name $20. City $20. Birthdate DDMMYY10.;
   INPUT Name $ City $ Birthdate;
RUN;

CHUNantes · Posted 05-03-2021 06:45 AM

I will post a small part of the file but I have to anonymise it before 😉

CHUNantes · Posted 05-03-2021 06:57 AM

Here is a sample of the file and the SAS programm.

filename ficin "C:\temp\donnees.txt" LRECL=1024;

data donnees;
    infile ficin missover dsd dlm='\';
    length  C_Code_NoIPPExterne $12.
            C_Nom $50.
            C_Prenom $50.
            C_Telephone $20.
            C_Var1 $1.
            C_Var2 $1.
            ;
    input   N_Code_NoIPP
            C_Code_NoIPPExterne $
            C_Nom $
            C_Prenom $
            C_Telephone $
            C_Var1 $
            C_Var2 $
            ;
run;

Ksharp · Posted 05-03-2021 07:17 AM

data _null_;
 infile "c:\temp\donnees.txt" dsd termstr=crlf;
 file   "c:\temp\want.txt";
 input ;
 length temp $ 2000;
 retain temp;
 temp=cats(temp,_infile_); 
 if countw(temp,'\','mq')=7 then do;put temp;call missing(temp);end;
run;

data want;
 infile   "c:\temp\want.txt" DLM='\' dsd termstr=crlf;
 length  N_Code_NoIPP 8
           C_Code_NoIPPExterne $12.
            C_Nom $50.
            C_Prenom $50.
            C_Telephone $20.
            C_Var1 $1.
            C_Var2 $1.
            ;
    input   N_Code_NoIPP
            C_Code_NoIPPExterne $
            C_Nom $
            C_Prenom $
            C_Telephone $
            C_Var1 $
            C_Var2 $
            ;
RUN;

CHUNantes · Posted 05-03-2021 08:17 AM

Thank you for your code.

My original file have more variables by lines and I have to deal with ' used by users in some fields but now I can load the file 😉

Here is the programm I use :

data temp;
 infile "/ECHANGE/PRQ/in/archive/CLI_20210501/202105010051DFD_MVTS_PATIENTS.NDW" dsd missover;
 file "/ECHANGE/PRQ/in/archive/CLI_20210501/FD_MVTS_PATIENTS.NDW";
 input ;
 length temp $ 2000;
 retain temp;
 temp = cats(temp,_infile_);
 temp = TRANWRD(temp,"'","¤");
 nb_var = countw(temp,'\','mq');
 if nb_var=42 then do;
  temp = TRANWRD(temp,"¤","'");
  put temp;
  call missing(temp);
 end;
run;

Ksharp · Posted 05-03-2021 07:26 AM

But I prefer to Perl Regular Expression.

data temp;
 infile "c:\temp\donnees.txt" dsd termstr=crlf length=len;
 input x $varying400. len;
 if prxmatch('/^\d{7}\\/',_infile_) then group+1;
run;

data temp1;
 length temp $ 2000;
do until(last.group);
 set temp;
 by group;
  temp=cats(temp,x); 
end;
drop x;
run;


data want;
 set temp1;
N_Code_NoIPP=scan(temp,1,'\','mq');
            C_Code_NoIPPExterne =scan(temp,2,'\','mq');
            C_Nom =scan(temp,3,'\','mq');
            C_Prenom =scan(temp,4,'\','mq');
            C_Telephone =scan(temp,5,'\','mq');
            C_Var1 =scan(temp,6,'\','mq');
            C_Var2 =scan(temp,7,'\','mq');
            ;
RUN;

Tom · Posted 05-03-2021 08:56 AM

So looking at that example file we have some answers.

End of line is CR+LF.

Embedded line breaks are also CR+LF.

No values are quoted.

There are no embedded delimiters.

The number of delimiters per (intended) record is fixed at 6. In addition it looks like there is an extra delimiter after the last value (which actually helps).

So you should be able to just COUNT the number of delimiters to determine whether you have need to remove a line break.

In this case the expected number of \ characters per line is 6.

So here is a program to convert the existing text file to a new file where the extra line breaks are removed. And then read the new file to confirm it is working.

filename csv "c:\downloads\donnees.txt";
filename fixed temp;

data _null_;
  infile csv end=eof;
  file fixed ;
  do until(nslash>=6 or eof) ;
    input ;
    nslash=sum(nslash,countc(_infile_,'\'));
    put _infile_ @;
  end;
  put;
run;

data want;
  infile fixed dlm='\' truncover ;
  input (var1-var5) (:$30.);
run;

proc print;
run;

Results:

Obs     var1        var2       var3       var4           var5

 1     1234567    123456789    Name1    Surname1     01 02 03 04 05
 2     1234567    123456789    Name2    Surname2    01.02.03.04.05.

Tom · Posted 05-03-2021 08:33 AM

To determine how hard it will be to fix the file so that SAS can parse it you need to answer some questions.

1) Are the fields that contain the end-of-line characters enclosed in quotes?

2) Are there a constant number of fields per line?

3) Does the delimiter character ever appear in the value of a field?

To look at the file you can use SAS data step and the LIST statement.

For example to look at the first 10 lines

data _null_;
  infile 'myfile' obs=10;
  input;
  list;
run;

To really see the end of line characters use RECFM=F on the infile statement.

So to look at the first 1,000 bytes of the file you could use:

data _null_;
  infile 'myfile' obs=10 recfm=f lrecl=100;
  input;
  list;
run;

INFILE statment to read lines with sometimes a Carriage Return in the middle of the line

Re: INFILE statment to read lines with sometimes a Carriage Return in the middle of the line

Re: INFILE statment to read lines with sometimes a Carriage Return in the middle of the line

Re: INFILE statment to read lines with sometimes a Carriage Return in the middle of the line

Re: INFILE statment to read lines with sometimes a Carriage Return in the middle of the line

Re: INFILE statment to read lines with sometimes a Carriage Return in the middle of the line

Re: INFILE statment to read lines with sometimes a Carriage Return in the middle of the line

Re: INFILE statment to read lines with sometimes a Carriage Return in the middle of the line

Re: INFILE statment to read lines with sometimes a Carriage Return in the middle of the line

Re: INFILE statment to read lines with sometimes a Carriage Return in the middle of the line

Re: INFILE statment to read lines with sometimes a Carriage Return in the middle of the line

Re: INFILE statment to read lines with sometimes a Carriage Return in the middle of the line

Re: INFILE statment to read lines with sometimes a Carriage Return in the middle of the line

Classroom Training Available!