Solved: Re: Import file with data infile and ignore CR in middle of lines

ajulio4 · Posted 11-24-2022 05:34 AM

Hello,

I have problem with my infile data step to import a csv file. In my file i have sommetimes for lines, the CR caracter. consequence, i have my table output with anomalies. Anytime SAS see CR caracter it consider like a new line.

Can i have an option to tell SAS not to interpret the CR caracter when loading the file?

ErikLund_Jensen · Posted 11-24-2022 10:50 AM

Hi @ajulio4

There is no option for that, because SAS doesn't see the the CR character. It is not included in the record read by SAS, but works as a delimiter between records, so SAS reads the file line by line as you see it in Notepad (or any other editor).

The only way is to preprocess the file and write a new file with the broken fields collapsed. A simple solution is to count the number of deliniters, as shown below. But this simple technique works only if the file is well formed, so the same number of delimiters is found in every "complete" record, and there are no embedded delimiters within quoted fields. If this is not the case, the code needs some refinement before it works.

* Create test file;
* Not part of the solution, just to simulate your input with broken lines;
filename in 'c:\temp\csvtest.csv';
data _null_;
  file in;
  put '22,"Richard Nixon","this a text field",something,more';
  put '23,"Jimmy Carter","this a long';
  put ' broken text field with';
  put ' two new lines",something,more';
  put '24,"Gerald Ford","a new text field",something,more';
run;

* Preprocess the file;
filename temp 'c:\temp\csvtest_collapsed.csv';
%let expected_delimiters = 4;
data _null_;
  infile in;
  file temp;
  retain csum;
  input;
  c = count(_infile_,','); 
  if c = &expected_delimiters then put _infile_;
  else do;
    put _infile_ @;
    csum = sum(csum,c);
    if csum = &expected_delimiters then do;
      put;
      csum = 0;
    end;
  end;
run;

* Import the preprocessed file;
proc import datafile=temp out=want dbms=csv replace;
  getnames=no;
run;

View solution in original post

fja · Posted 11-24-2022 06:10 AM

How would you like to separate lines? By length? Or how does a CR not indicating a line break differ from those doing so?
If you find the input data to be messed up my suggestion would be to sanitize the data source ... if possible.
--FJa

ErikLund_Jensen · Posted 11-24-2022 10:50 AM

Hi @ajulio4

There is no option for that, because SAS doesn't see the the CR character. It is not included in the record read by SAS, but works as a delimiter between records, so SAS reads the file line by line as you see it in Notepad (or any other editor).

The only way is to preprocess the file and write a new file with the broken fields collapsed. A simple solution is to count the number of deliniters, as shown below. But this simple technique works only if the file is well formed, so the same number of delimiters is found in every "complete" record, and there are no embedded delimiters within quoted fields. If this is not the case, the code needs some refinement before it works.

* Create test file;
* Not part of the solution, just to simulate your input with broken lines;
filename in 'c:\temp\csvtest.csv';
data _null_;
  file in;
  put '22,"Richard Nixon","this a text field",something,more';
  put '23,"Jimmy Carter","this a long';
  put ' broken text field with';
  put ' two new lines",something,more';
  put '24,"Gerald Ford","a new text field",something,more';
run;

* Preprocess the file;
filename temp 'c:\temp\csvtest_collapsed.csv';
%let expected_delimiters = 4;
data _null_;
  infile in;
  file temp;
  retain csum;
  input;
  c = count(_infile_,','); 
  if c = &expected_delimiters then put _infile_;
  else do;
    put _infile_ @;
    csum = sum(csum,c);
    if csum = &expected_delimiters then do;
      put;
      csum = 0;
    end;
  end;
run;

* Import the preprocessed file;
proc import datafile=temp out=want dbms=csv replace;
  getnames=no;
run;

ajulio4 · Posted 11-24-2022 11:24 AM

Thks very much..

Tom · Posted 11-24-2022 12:00 PM

Please upvote this improvement suggestion.

https://communities.sas.com/t5/SASware-Ballot-Ideas/Enhancements-to-INFILE-FILE-to-handle-delimited-...

Note that the programmers at Excel figured out how to read CSV files with quoted embedded end of line characters. Even the idiots at snowflake.com that think text files have "record delimiters" instead of lines could figure it out. Time for SAS to add an option to support these types of files directly.

ajulio4 · Posted 11-25-2022 05:46 AM

Hello ErikLund_Jensen

thks you for your propostion yesterday. it work like i want. But can you explain me this part of the code :

else do;
put _infile_ @;
csum = sum(csum,c);
if csum = &expected_delimiters then do;
put;
csum = 0;
end;

If first c ne &expected_delimiters , what is done?

thks

ErikLund_Jensen · Posted 11-25-2022 07:45 AM

Hi @ajulio4

I will try to explain it with comments in the code. Write again if this is unclear or gives rise to further questions.

The idea is to write each input line to output. If the input line is a complete record, it is written to output, and the output line is released, so next put will write a new output line.

But if the input line is not a complete record, is is written to output, but the output line is held, so next put will continue writing the same line. The csum variable keeps track of the total number of delimiters in the partial records written, so this his is repeated for further input lines until a whole record is written to output, then the line is released.

* Preprocess the file;
filename temp 'c:\temp\csvtest_collapsed.csv';
%let expected_delimiters = 4;
data _null_;
  infile in;
  file temp;

  * Direct SAS to hold the value of the summation variable csum instead of clearing for each input observation;
  retain csum;

  * Read a new line from file;
  input;

  * Count the number of delimiters in the newly read line;
  c = count(_infile_,','); 

  * If the number is as expected then the input line is a complete record, so 
  *    write it to to the output file and release it (ADD CR character);
  if c = &expected_delimiters then put _infile_;

  else do; 
    * The line is not a complete record, so it is written with the trailing @ directive, meaning:
    *    write to output file, but but hold the current output line for further writing ; 
    put _infile_ @;

    * Add the value of c (= number of delimiters in current input) to the summation variable csum;
    *   csum is now holding the sum of delimiters in the partial records wtitten so far;
    csum = sum(csum,c);

    * If the number if delimiters written so far is equal to the expected number, then 
    *   the output record is finished and can be released (write nothing and add CR Character);
    if csum = &expected_delimiters then do;
      put;

      * - and reset the counter to 0, so it is ready for next time a partial record is read from input;
      csum = 0;
    end;

    * Implicit return to next input;
  end;
run;

ajulio4 · Posted 11-30-2022 11:29 AM

I have a problem with my code here is an example of my file :

1 PPP;00001;Aa001zz;123;A71 tetstst;a4jgtt;gtrall;grrrr

2 PTP;0002;App01;412; Azer

( test agerebdb);;;

3 PTP;erfff;Zerrrr;847;derrr;55zer;dfererereer;erffff

4 PTP;erfff;Zerrrr;847;derrr;55zer;dfererereer;erffff

With the code, if execute, i lost the line 3 but i read the line 4 in my table in output

ErikLund_Jensen · Posted 11-30-2022 04:51 PM

Hi @ajulio4

I am sorry to her that, and I can't explain what went wrong, because I wasn't able to reproduce the error.

When I change the number of delimiters from 4 to 7 and make appropriate changes to handle semicolon as delimiter instead of comma as in your first example, it works for me.

* Create test file;
* Not part of the solution, just to simulate your input with broken lines;
filename in 'c:\temp\csvtest.csv';
data _null_;
  file in;
  put 'PPP;00001;Aa001zz;123;A71 tetstst;a4jgtt;gtrall;grrrr';
  put 'PTP;0002;App01;412; Azer';
  put '( test agerebdb);;;';
  put 'PTP;erfff;Zerrrr;847;derrr;55zer;dfererereer;erffff';
  put 'PTP;erfff;Zerrrr;847;derrr;55zer;dfererereer;erffff';
run;

* Preprocess the file;
%let expected_delimiters = 7; /*** number of delimiters changed from 4 to 7 ***/
data _null_;
  infile in;
  file temp;
  retain csum;
  input;
  c = count(_infile_,';'); /*** delimiter changed from , to ; ***/
  if c = &expected_delimiters then put _infile_;
  else do;
    put _infile_ @;
    csum = sum(csum,c);
    if csum = &expected_delimiters then do;
      put;
      csum = 0;
    end;
  end;
run;

* Import the preprocessed file;
proc import datafile=temp out=want dbms=csv replace;
  delimiter = ';'; /*** new statement needed, because comma is the default delimiter ***/
  getnames=no;
run;

result:

PPP;00001;Aa001zz;123;A71 tetstst;a4jgtt;gtrall;grrrr
PTP;0002;App01;412; Azer( test agerebdb);;;
PTP;erfff;Zerrrr;847;derrr;55zer;dfererereer;erffff
PTP;erfff;Zerrrr;847;derrr;55zer;dfererereer;erffff

ajulio4 · Posted 12-01-2022 10:50 AM

Hello ErikLund_Jensen,

i think i was not enough clear in my replu when am trying to produice an example of my file. there is a part of my file, there are 10 lignes, but am expecting 9 observations. but if process the file with the code, i have 8 lines so 8 observations.

i've attach a part of my file

Thks

ajulio4 · Posted 12-01-2022 10:52 AM

i want 252 colums in my table outpu so i expect "251 ; " i for my record

Tom · Posted 12-01-2022 11:16 AM

But that file does only have 8 records.

The second record is split over three lines.

RULE:     ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0
1         XXX;366XXXX;H;20210112;17:32:22;17004;360Travaux;10604;360Travaux-Renovation-technique;;;;;;;;;;;;;;
     101  ;;;;;;40 000 â‚¬ - 50 000 â‚¬;17000 LAROUSSE;50 Ã  100 mÂ²;RÃ©novation de piÃ¨ce;Electrique et domot
     201  ique;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
     301  ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;Appartement
     401  ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; 434
2         XXX;366XXXX;H;20210121;12:08:46;17004;360Travaux;10950;360Travaux-Renovation-technique;;;;;;;;;;;;;;
     101  ;;;;;;Budget inconnu;17000 LAROUSSE;50 Ã  100 mÂ²;Travaux globaux 165
3         (Axxxxxxxxxxxxx, xxxxxxxxxvation, toit, fenÃªtres, xxxxtion de piÃ¨cesâ€¦);RÃ©novation globale 94
4         (rafraichissement, rÃ©agencement des espacesâ€¦);;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
     101  ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
     201  ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;Appartement;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; 278
5         XXX;366XXXX;H;20210118;10:40:27;;XXX;XXX_CT_H000366XXXX;XXX-CalculatriceXATDGE00E;;;;;;;;;;;;;;;;;;;
     101  ;;;;;;;;;;;;;;;;BATIMENT 25 ehehehdhd de jrrrr ;00000;PAYSS;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
     201  ;;;;;;9.0;1417.0;N;N;;N;N;0.0;0.0;0.0;0.0;0.0;0.0;0.0;0.0;0.0;0.0;0.0;0.0;N;N;N;N;0.0;0.0;N;N;N;0.0;
     301  0.0;N;N;0.0;N;0.0;0.0;N;N;0.0;1417.0;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
     401  ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; 456

ErikLund_Jensen · Posted 12-01-2022 02:02 PM

Hi @ajulio4

I am sorry, but it is working perfectly well for me as shown below.

The file you sent me (newfile.txt) has 10 lines, first 40 bytes shown here:

dlm_count=251 line_abbrev=XXX;366XXXX;H;20210112;17:32:22;17004;36
dlm_count=31  line_abbrev=XXX;366XXXX;H;20210121;12:08:46;17004;36
dlm_count=1   line_abbrev=(Axxxxxxxxxxxxx, xxxxxxxxxvation, toit,
dlm_count=219 line_abbrev=(rafraichissement, réagencement des espa
dlm_count=251 line_abbrev=XXX;366XXXX;H;20210118;10:40:27;;XXX;XXX
dlm_count=251 line_abbrev=XXX;366XXXX;H;20210118;10:40:28;;XXX;XXX
dlm_count=251 line_abbrev=XXX;366XXXX;H;20210118;10:40:47;;XXX;XXX
dlm_count=251 line_abbrev=XXX;366XXXX;H;20210118;10:40:48;;XXX;XXX
dlm_count=251 line_abbrev=XXX;366XXXX;H;20210118;10:40:51;;XXX;XXX
dlm_count=251 line_abbrev=XXX;366XXXX;H;20210118;10:40:55;;XXX;XXX

The first record is complete with 251 delimiters.

The second record is split over 3 lines with 31+1+219 delimiters, totalling 251.

The remaining 6 records are all complete with 251 delimiters.

So the input file contains 8 complete records.

960  * Preprocess the file;
961  filename in 'c:\temp\newfile.txt';
962  filename temp 'c:\temp\csvtest_collapsed.csv';
963  %let expected_delimiters = 251;
964  data _null_;
965    infile in;
966    file temp;
967    retain csum;
968    input;
969    c = count(_infile_,';');
970    if c = &expected_delimiters then put _infile_;
971    else do;
972      put _infile_ @;
973      csum = sum(csum,c);
974      if csum = &expected_delimiters then do;
975        put;
976        csum = 0;
977      end;
978    end;
979  run;

NOTE: The infile IN is:
      Filename=c:\temp\newfile.txt,
      RECFM=V,LRECL=32767,File Size (bytes)=3763,
      Last Modified=01. december 2022 19:05:57,
      Create Time=01. december 2022 19:05:56

NOTE: The file TEMP is:
      Filename=c:\temp\csvtest_collapsed.csv,
      RECFM=V,LRECL=32767,File Size (bytes)=0,
      Last Modified=01. december 2022 19:44:36,
      Create Time=30. november 2022 22:25:34

NOTE: 10 records were read from the infile IN.
      The minimum record length was 94.
      The maximum record length was 466.
NOTE: 8 records were written to the file TEMP.
      The minimum record length was 434.
      The maximum record length was 537.
NOTE: DATA statement used (Total process time):
      real time           0.02 seconds
      cpu time            0.01 seconds

The compressed file has - as expected - 8 records, because lines 2-4 are compressed into one line.

I cannot figure out why you would expect 9 records ?

Then proceed with import:

980  * Import the preprocessed file;
981  proc import datafile=temp out=want dbms=csv replace;
982    delimiter=';';
983    getnames=no;
984  run;
...
NOTE: The infile TEMP is:
      Filename=c:\temp\csvtest_collapsed.csv,
      RECFM=V,LRECL=32767,File Size (bytes)=3759,
      Last Modified=01. december 2022 19:44:36,
      Create Time=30. november 2022 22:25:34

NOTE: 8 records were read from the infile TEMP.
      The minimum record length was 434.
      The maximum record length was 537.
NOTE: The data set WORK.WANT has 8 observations and 252 variables.
NOTE: DATA statement used (Total process time):
      real time           0.44 seconds
      cpu time            0.39 seconds

8 rows created in WORK.WANT from TEMP.

NOTE: WORK.WANT data set was successfully created.
NOTE: The data set WORK.WANT has 8 observations and 252 variables.
NOTE: PROCEDURE IMPORT used (Total process time):
      real time           0.58 seconds
      cpu time            0.46 seconds

This is also as expected - 8 observations and 252 variables.

I cannot se any errors in this. If you run the same code on the same file and get a different result, please post the complete log from the steps.

ajulio4 · Posted 12-01-2022 11:06 AM

i want 252 colums in my table outpu so i expect "251 ; " i for my record

Tom · Posted 11-30-2022 08:06 PM

What are the actual end of line characters in that file?

Read the file as fixed length records (so the end of line character are treated as just ordinary characters) and look at it.

The LIST statement in a SAS data step can help with this.

So this data step will display the first 500 bytes of the file. If any of the characters are CR ('0D'x) or LF ('0A'x) then the LIST statement will show the the hex codes for ALL of the characters in the 100 byte fixed length record.

data _null_;
  infile 'myfile.txt' lrecl=100 recfm=f obs=5 ;
  input;
  list;
run;

The first question you need to answer is. What characters appear the end of the real records. If it the unix standard LF character? The Windows/DOS standard CR and LF pair? Is the deprecated standard used by the original Macintosh of just a CR?

The second quest is do the extra line breaks in the middle of a record use the exact same characters?

If you are lucky the real lines end with CRLF and the extra breaks you are seeing are caused by just CR or just LF. IF that is true then SAS will read the file properly if you tell it to use CRLF as the end of line characters.

data want;
  infile 'myfile.txt' dsd dlm=';' truncover termstr=crlf;
  input (var1-var8) (:$30.);
run;

Otherwise you are in trouble because the values that have the embedded end of line characters are not quoted. The standard definition of delimited files says that embedded delimiters or end of line characters need to have the values quoted. SAS does not directly support files with embedded end of line characters, but when the value is quoted it is simple to just count the quotes to know whether or not the end of line is occurring inside a quoted string. Such as with this macro %replace_crlf()

So if your file falls into neither of those patterns then you can come close to fixing the problem by counting delimiters, but that will not work right when the embedded end of line characters appear in the last field on the line. So you might need to do some review or manual touch up to get the file converted into something that can be read.

So let's try it on your example.

First let's convert the text you posted back into a file using PARMCARDS.

parmcards4;
PPP;00001;Aa001zz;123;A71 tetstst;a4jgtt;gtrall;grrrr
PTP;0002;App01;412; Azer
 ( test agerebdb);;;
PTP;erfff;Zerrrr;847;derrr;55zer;dfererereer;erffff
PTP;erfff;Zerrrr;847;derrr;55zer;dfererereer;erffff
;;;;

Now let's look at the file:

124  options generic;
125  data _null_;
126    infile example termstr=crlf;
127    input;
128    list;
129  run;

NOTE: The infile EXAMPLE is:
      (system-specific pathname),
      (system-specific file attributes)

RULE:     ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0
1         PPP;00001;Aa001zz;123;A71 tetstst;a4jgtt;gtrall;grrrr 53
2         PTP;0002;App01;412; Azer 24
3          ( test agerebdb);;; 20
4         PTP;erfff;Zerrrr;847;derrr;55zer;dfererereer;erffff 51
5         PTP;erfff;Zerrrr;847;derrr;55zer;dfererereer;erffff 51
NOTE: 5 records were read from the infile (system-specific pathname).
      The minimum record length was 20.
      The maximum record length was 53.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds

Let's read it in an keep track of the number of semicolons. When we hit 7 (or more!) then let's write a CRLF pair. Otherwise let's just write a bare LF. I have used $200 character variable. So the maximum line it can handle is 200 bytes. But a data step can handle character strings of up to 32,767 bytes. If the lines are longer than that you will need to use a different method.

filename fixed temp;

data _null_;
  infile example truncover length=ll termstr=crlf end=eof;
  file fixed recfm=n;
  input line $char200. ;
  length=ll;
  count+countc(line,';');
  if _n_>1 then do;
    if count>7 and _n_>1 then do;
      put '0D0A'x ;
      count=countc(line,';');
    end;
    else put '0A'x ;
  end;
  put line $varying200. length;
  if eof then put '0D0A'x;
run;

Now let's read in the resulting file the same way to look at it.

166  data _null_;
167    infile fixed termstr=crlf;
168    input;
169    list;
170  run;

NOTE: The infile FIXED is:
      (system-specific pathname),
      (system-specific file attributes)

RULE:     ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0
1         PPP;00001;Aa001zz;123;A71 tetstst;a4jgtt;gtrall;grrrr 53

2   CHAR  PTP;0002;App01;412; Azer. ( test agerebdb);;; 45
    ZONE  555333333477333333324767022276772666766662333
    NUMR  040B0002B10001B412B01A52A08045340175252429BBB
3         PTP;erfff;Zerrrr;847;derrr;55zer;dfererereer;erffff 51
4         PTP;erfff;Zerrrr;847;derrr;55zer;dfererereer;erffff 51
NOTE: 4 records were read from the infile (system-specific pathname).
      The minimum record length was 45.
      The maximum record length was 53.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds

Now if you don't want that linefeed in the middle of the fifth field on the second line you could change the ELSE statement that is writing it to write something else, or nothing at all.

SAS Innovate 2025: Call for Content

Classroom Training Available!