Hello,
I have problem with my infile data step to import a csv file. In my file i have sommetimes for lines, the CR caracter. consequence, i have my table output with anomalies. Anytime SAS see CR caracter it consider like a new line.
Can i have an option to tell SAS not to interpret the CR caracter when loading the file?
Hi @ajulio4
There is no option for that, because SAS doesn't see the the CR character. It is not included in the record read by SAS, but works as a delimiter between records, so SAS reads the file line by line as you see it in Notepad (or any other editor).
The only way is to preprocess the file and write a new file with the broken fields collapsed. A simple solution is to count the number of deliniters, as shown below. But this simple technique works only if the file is well formed, so the same number of delimiters is found in every "complete" record, and there are no embedded delimiters within quoted fields. If this is not the case, the code needs some refinement before it works.
* Create test file;
* Not part of the solution, just to simulate your input with broken lines;
filename in 'c:\temp\csvtest.csv';
data _null_;
file in;
put '22,"Richard Nixon","this a text field",something,more';
put '23,"Jimmy Carter","this a long';
put ' broken text field with';
put ' two new lines",something,more';
put '24,"Gerald Ford","a new text field",something,more';
run;
* Preprocess the file;
filename temp 'c:\temp\csvtest_collapsed.csv';
%let expected_delimiters = 4;
data _null_;
infile in;
file temp;
retain csum;
input;
c = count(_infile_,',');
if c = &expected_delimiters then put _infile_;
else do;
put _infile_ @;
csum = sum(csum,c);
if csum = &expected_delimiters then do;
put;
csum = 0;
end;
end;
run;
* Import the preprocessed file;
proc import datafile=temp out=want dbms=csv replace;
getnames=no;
run;
Hi @ajulio4
There is no option for that, because SAS doesn't see the the CR character. It is not included in the record read by SAS, but works as a delimiter between records, so SAS reads the file line by line as you see it in Notepad (or any other editor).
The only way is to preprocess the file and write a new file with the broken fields collapsed. A simple solution is to count the number of deliniters, as shown below. But this simple technique works only if the file is well formed, so the same number of delimiters is found in every "complete" record, and there are no embedded delimiters within quoted fields. If this is not the case, the code needs some refinement before it works.
* Create test file;
* Not part of the solution, just to simulate your input with broken lines;
filename in 'c:\temp\csvtest.csv';
data _null_;
file in;
put '22,"Richard Nixon","this a text field",something,more';
put '23,"Jimmy Carter","this a long';
put ' broken text field with';
put ' two new lines",something,more';
put '24,"Gerald Ford","a new text field",something,more';
run;
* Preprocess the file;
filename temp 'c:\temp\csvtest_collapsed.csv';
%let expected_delimiters = 4;
data _null_;
infile in;
file temp;
retain csum;
input;
c = count(_infile_,',');
if c = &expected_delimiters then put _infile_;
else do;
put _infile_ @;
csum = sum(csum,c);
if csum = &expected_delimiters then do;
put;
csum = 0;
end;
end;
run;
* Import the preprocessed file;
proc import datafile=temp out=want dbms=csv replace;
getnames=no;
run;
Please upvote this improvement suggestion.
Note that the programmers at Excel figured out how to read CSV files with quoted embedded end of line characters. Even the idiots at snowflake.com that think text files have "record delimiters" instead of lines could figure it out. Time for SAS to add an option to support these types of files directly.
Hello ErikLund_Jensen
thks you for your propostion yesterday. it work like i want. But can you explain me this part of the code :
else do;
put _infile_ @;
csum = sum(csum,c);
if csum = &expected_delimiters then do;
put;
csum = 0;
end;
If first c ne &expected_delimiters , what is done?
thks
Hi @ajulio4
I will try to explain it with comments in the code. Write again if this is unclear or gives rise to further questions.
The idea is to write each input line to output. If the input line is a complete record, it is written to output, and the output line is released, so next put will write a new output line.
But if the input line is not a complete record, is is written to output, but the output line is held, so next put will continue writing the same line. The csum variable keeps track of the total number of delimiters in the partial records written, so this his is repeated for further input lines until a whole record is written to output, then the line is released.
* Preprocess the file;
filename temp 'c:\temp\csvtest_collapsed.csv';
%let expected_delimiters = 4;
data _null_;
infile in;
file temp;
* Direct SAS to hold the value of the summation variable csum instead of clearing for each input observation;
retain csum;
* Read a new line from file;
input;
* Count the number of delimiters in the newly read line;
c = count(_infile_,',');
* If the number is as expected then the input line is a complete record, so
* write it to to the output file and release it (ADD CR character);
if c = &expected_delimiters then put _infile_;
else do;
* The line is not a complete record, so it is written with the trailing @ directive, meaning:
* write to output file, but but hold the current output line for further writing ;
put _infile_ @;
* Add the value of c (= number of delimiters in current input) to the summation variable csum;
* csum is now holding the sum of delimiters in the partial records wtitten so far;
csum = sum(csum,c);
* If the number if delimiters written so far is equal to the expected number, then
* the output record is finished and can be released (write nothing and add CR Character);
if csum = &expected_delimiters then do;
put;
* - and reset the counter to 0, so it is ready for next time a partial record is read from input;
csum = 0;
end;
* Implicit return to next input;
end;
run;
I have a problem with my code here is an example of my file :
1 PPP;00001;Aa001zz;123;A71 tetstst;a4jgtt;gtrall;grrrr
2 PTP;0002;App01;412; Azer
( test agerebdb);;;
3 PTP;erfff;Zerrrr;847;derrr;55zer;dfererereer;erffff
4 PTP;erfff;Zerrrr;847;derrr;55zer;dfererereer;erffff
With the code, if execute, i lost the line 3 but i read the line 4 in my table in output
Hi @ajulio4
I am sorry to her that, and I can't explain what went wrong, because I wasn't able to reproduce the error.
When I change the number of delimiters from 4 to 7 and make appropriate changes to handle semicolon as delimiter instead of comma as in your first example, it works for me.
* Create test file;
* Not part of the solution, just to simulate your input with broken lines;
filename in 'c:\temp\csvtest.csv';
data _null_;
file in;
put 'PPP;00001;Aa001zz;123;A71 tetstst;a4jgtt;gtrall;grrrr';
put 'PTP;0002;App01;412; Azer';
put '( test agerebdb);;;';
put 'PTP;erfff;Zerrrr;847;derrr;55zer;dfererereer;erffff';
put 'PTP;erfff;Zerrrr;847;derrr;55zer;dfererereer;erffff';
run;
* Preprocess the file;
%let expected_delimiters = 7; /*** number of delimiters changed from 4 to 7 ***/
data _null_;
infile in;
file temp;
retain csum;
input;
c = count(_infile_,';'); /*** delimiter changed from , to ; ***/
if c = &expected_delimiters then put _infile_;
else do;
put _infile_ @;
csum = sum(csum,c);
if csum = &expected_delimiters then do;
put;
csum = 0;
end;
end;
run;
* Import the preprocessed file;
proc import datafile=temp out=want dbms=csv replace;
delimiter = ';'; /*** new statement needed, because comma is the default delimiter ***/
getnames=no;
run;
result:
PPP;00001;Aa001zz;123;A71 tetstst;a4jgtt;gtrall;grrrr
PTP;0002;App01;412; Azer( test agerebdb);;;
PTP;erfff;Zerrrr;847;derrr;55zer;dfererereer;erffff
PTP;erfff;Zerrrr;847;derrr;55zer;dfererereer;erffff
Hello ErikLund_Jensen,
i think i was not enough clear in my replu when am trying to produice an example of my file. there is a part of my file, there are 10 lignes, but am expecting 9 observations. but if process the file with the code, i have 8 lines so 8 observations.
i've attach a part of my file
Thks
But that file does only have 8 records.
The second record is split over three lines.
RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0 1 XXX;366XXXX;H;20210112;17:32:22;17004;360Travaux;10604;360Travaux-Renovation-technique;;;;;;;;;;;;;; 101 ;;;;;;40 000 € - 50 000 €;17000 LAROUSSE;50 à 100 m²;Rénovation de pièce;Electrique et domot 201 ique;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; 301 ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;Appartement 401 ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; 434 2 XXX;366XXXX;H;20210121;12:08:46;17004;360Travaux;10950;360Travaux-Renovation-technique;;;;;;;;;;;;;; 101 ;;;;;;Budget inconnu;17000 LAROUSSE;50 à 100 m²;Travaux globaux 165 3 (Axxxxxxxxxxxxx, xxxxxxxxxvation, toit, fenêtres, xxxxtion de pièces…);Rénovation globale 94 4 (rafraichissement, réagencement des espaces…);;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; 101 ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; 201 ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;Appartement;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; 278 5 XXX;366XXXX;H;20210118;10:40:27;;XXX;XXX_CT_H000366XXXX;XXX-CalculatriceXATDGE00E;;;;;;;;;;;;;;;;;;; 101 ;;;;;;;;;;;;;;;;BATIMENT 25 ehehehdhd de jrrrr ;00000;PAYSS;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; 201 ;;;;;;9.0;1417.0;N;N;;N;N;0.0;0.0;0.0;0.0;0.0;0.0;0.0;0.0;0.0;0.0;0.0;0.0;N;N;N;N;0.0;0.0;N;N;N;0.0; 301 0.0;N;N;0.0;N;0.0;0.0;N;N;0.0;1417.0;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; 401 ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; 456
Hi @ajulio4
I am sorry, but it is working perfectly well for me as shown below.
The file you sent me (newfile.txt) has 10 lines, first 40 bytes shown here:
dlm_count=251 line_abbrev=XXX;366XXXX;H;20210112;17:32:22;17004;36 dlm_count=31 line_abbrev=XXX;366XXXX;H;20210121;12:08:46;17004;36 dlm_count=1 line_abbrev=(Axxxxxxxxxxxxx, xxxxxxxxxvation, toit, dlm_count=219 line_abbrev=(rafraichissement, réagencement des espa dlm_count=251 line_abbrev=XXX;366XXXX;H;20210118;10:40:27;;XXX;XXX dlm_count=251 line_abbrev=XXX;366XXXX;H;20210118;10:40:28;;XXX;XXX dlm_count=251 line_abbrev=XXX;366XXXX;H;20210118;10:40:47;;XXX;XXX dlm_count=251 line_abbrev=XXX;366XXXX;H;20210118;10:40:48;;XXX;XXX dlm_count=251 line_abbrev=XXX;366XXXX;H;20210118;10:40:51;;XXX;XXX dlm_count=251 line_abbrev=XXX;366XXXX;H;20210118;10:40:55;;XXX;XXX
The first record is complete with 251 delimiters.
The second record is split over 3 lines with 31+1+219 delimiters, totalling 251.
The remaining 6 records are all complete with 251 delimiters.
So the input file contains 8 complete records.
960 * Preprocess the file; 961 filename in 'c:\temp\newfile.txt'; 962 filename temp 'c:\temp\csvtest_collapsed.csv'; 963 %let expected_delimiters = 251; 964 data _null_; 965 infile in; 966 file temp; 967 retain csum; 968 input; 969 c = count(_infile_,';'); 970 if c = &expected_delimiters then put _infile_; 971 else do; 972 put _infile_ @; 973 csum = sum(csum,c); 974 if csum = &expected_delimiters then do; 975 put; 976 csum = 0; 977 end; 978 end; 979 run; NOTE: The infile IN is: Filename=c:\temp\newfile.txt, RECFM=V,LRECL=32767,File Size (bytes)=3763, Last Modified=01. december 2022 19:05:57, Create Time=01. december 2022 19:05:56 NOTE: The file TEMP is: Filename=c:\temp\csvtest_collapsed.csv, RECFM=V,LRECL=32767,File Size (bytes)=0, Last Modified=01. december 2022 19:44:36, Create Time=30. november 2022 22:25:34 NOTE: 10 records were read from the infile IN. The minimum record length was 94. The maximum record length was 466. NOTE: 8 records were written to the file TEMP. The minimum record length was 434. The maximum record length was 537. NOTE: DATA statement used (Total process time): real time 0.02 seconds cpu time 0.01 seconds
The compressed file has - as expected - 8 records, because lines 2-4 are compressed into one line.
I cannot figure out why you would expect 9 records ?
Then proceed with import:
980 * Import the preprocessed file; 981 proc import datafile=temp out=want dbms=csv replace; 982 delimiter=';'; 983 getnames=no; 984 run; ... NOTE: The infile TEMP is: Filename=c:\temp\csvtest_collapsed.csv, RECFM=V,LRECL=32767,File Size (bytes)=3759, Last Modified=01. december 2022 19:44:36, Create Time=30. november 2022 22:25:34 NOTE: 8 records were read from the infile TEMP. The minimum record length was 434. The maximum record length was 537. NOTE: The data set WORK.WANT has 8 observations and 252 variables. NOTE: DATA statement used (Total process time): real time 0.44 seconds cpu time 0.39 seconds 8 rows created in WORK.WANT from TEMP. NOTE: WORK.WANT data set was successfully created. NOTE: The data set WORK.WANT has 8 observations and 252 variables. NOTE: PROCEDURE IMPORT used (Total process time): real time 0.58 seconds cpu time 0.46 seconds
This is also as expected - 8 observations and 252 variables.
I cannot se any errors in this. If you run the same code on the same file and get a different result, please post the complete log from the steps.
i want 252 colums in my table outpu so i expect "251 ; " i for my record
What are the actual end of line characters in that file?
Read the file as fixed length records (so the end of line character are treated as just ordinary characters) and look at it.
The LIST statement in a SAS data step can help with this.
So this data step will display the first 500 bytes of the file. If any of the characters are CR ('0D'x) or LF ('0A'x) then the LIST statement will show the the hex codes for ALL of the characters in the 100 byte fixed length record.
data _null_;
infile 'myfile.txt' lrecl=100 recfm=f obs=5 ;
input;
list;
run;
The first question you need to answer is. What characters appear the end of the real records. If it the unix standard LF character? The Windows/DOS standard CR and LF pair? Is the deprecated standard used by the original Macintosh of just a CR?
The second quest is do the extra line breaks in the middle of a record use the exact same characters?
If you are lucky the real lines end with CRLF and the extra breaks you are seeing are caused by just CR or just LF. IF that is true then SAS will read the file properly if you tell it to use CRLF as the end of line characters.
data want;
infile 'myfile.txt' dsd dlm=';' truncover termstr=crlf;
input (var1-var8) (:$30.);
run;
Otherwise you are in trouble because the values that have the embedded end of line characters are not quoted. The standard definition of delimited files says that embedded delimiters or end of line characters need to have the values quoted. SAS does not directly support files with embedded end of line characters, but when the value is quoted it is simple to just count the quotes to know whether or not the end of line is occurring inside a quoted string. Such as with this macro %replace_crlf()
So if your file falls into neither of those patterns then you can come close to fixing the problem by counting delimiters, but that will not work right when the embedded end of line characters appear in the last field on the line. So you might need to do some review or manual touch up to get the file converted into something that can be read.
So let's try it on your example.
First let's convert the text you posted back into a file using PARMCARDS.
parmcards4;
PPP;00001;Aa001zz;123;A71 tetstst;a4jgtt;gtrall;grrrr
PTP;0002;App01;412; Azer
( test agerebdb);;;
PTP;erfff;Zerrrr;847;derrr;55zer;dfererereer;erffff
PTP;erfff;Zerrrr;847;derrr;55zer;dfererereer;erffff
;;;;
Now let's look at the file:
124 options generic; 125 data _null_; 126 infile example termstr=crlf; 127 input; 128 list; 129 run; NOTE: The infile EXAMPLE is: (system-specific pathname), (system-specific file attributes) RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0 1 PPP;00001;Aa001zz;123;A71 tetstst;a4jgtt;gtrall;grrrr 53 2 PTP;0002;App01;412; Azer 24 3 ( test agerebdb);;; 20 4 PTP;erfff;Zerrrr;847;derrr;55zer;dfererereer;erffff 51 5 PTP;erfff;Zerrrr;847;derrr;55zer;dfererereer;erffff 51 NOTE: 5 records were read from the infile (system-specific pathname). The minimum record length was 20. The maximum record length was 53. NOTE: DATA statement used (Total process time): real time 0.00 seconds cpu time 0.00 seconds
Let's read it in an keep track of the number of semicolons. When we hit 7 (or more!) then let's write a CRLF pair. Otherwise let's just write a bare LF. I have used $200 character variable. So the maximum line it can handle is 200 bytes. But a data step can handle character strings of up to 32,767 bytes. If the lines are longer than that you will need to use a different method.
filename fixed temp;
data _null_;
infile example truncover length=ll termstr=crlf end=eof;
file fixed recfm=n;
input line $char200. ;
length=ll;
count+countc(line,';');
if _n_>1 then do;
if count>7 and _n_>1 then do;
put '0D0A'x ;
count=countc(line,';');
end;
else put '0A'x ;
end;
put line $varying200. length;
if eof then put '0D0A'x;
run;
Now let's read in the resulting file the same way to look at it.
166 data _null_; 167 infile fixed termstr=crlf; 168 input; 169 list; 170 run; NOTE: The infile FIXED is: (system-specific pathname), (system-specific file attributes) RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0 1 PPP;00001;Aa001zz;123;A71 tetstst;a4jgtt;gtrall;grrrr 53 2 CHAR PTP;0002;App01;412; Azer. ( test agerebdb);;; 45 ZONE 555333333477333333324767022276772666766662333 NUMR 040B0002B10001B412B01A52A08045340175252429BBB 3 PTP;erfff;Zerrrr;847;derrr;55zer;dfererereer;erffff 51 4 PTP;erfff;Zerrrr;847;derrr;55zer;dfererereer;erffff 51 NOTE: 4 records were read from the infile (system-specific pathname). The minimum record length was 45. The maximum record length was 53. NOTE: DATA statement used (Total process time): real time 0.00 seconds cpu time 0.00 seconds
Now if you don't want that linefeed in the middle of the fifth field on the second line you could change the ELSE statement that is writing it to write something else, or nothing at all.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.