The CSV sample dataset we are using is attached and contains special carriage return characters (0A and 0D that we cannot make SAS read them as such) which in turn cause open quotation marks. We are trying to either count the number of double-quotes or the number of commas (delimiters) to determine if an observation completes before making a new observation. When an observation does not complete, I would like to hold that observation within the buffer until the next observation can be added to the end.
data _test_ ;
infile "C:\Desktop\sample_issues.CSV"
dlm=','
dsd
missover
lrecl=30000
firstobs = 2
sharebuffers
;
do until (comma > 5); /* 5 is an arbitrary number used with this sample file */
/*infile to buffer */
input
;
put _infile_ @;
/*Count the number of commas in each observation within the buffer */
comma = countw(_infile_,',');
/* If comma variable is less than 165, then current observation needs to be held in buffer
until next obs is added to end */
if comma < 5 then
put @ ;
else put _infile_;
/* put _infile_@;*/
end;
drop comma;
run;
I believe that you can only treat _INFILE_ like a variable if the LRECL of the records is below a certain value like 32K.
You could convert to program to read the lines character by character instead using RECFM=N.
filename copy temp;
data _null_;
infile 'c:\downloads\sample_issues.csv' recfm=n ;
file copy recfm=n ;
input x $char1. ;
quotes+(x='"');
if mod(quotes,2) and (x='0D'x or x='0A'x) then put '|';
else put x $char1.;
run;
data _null_;
infile copy dsd dlm=',' truncover termstr=CRLF;
length x1-x6 $200 ;
input x1-x6;
put (x1-x6) (=/);
run;
The file you posted appears to only have embedded linefeeds.
So you can read it directly by just using the TERMSTR=CRLF option on the INFILE statement.
Tom,
I agree the file contains embedded linefeeds that are corrected by TERMSTR, but the second observation "PROOF OF RETURN RECEIVED REFI" contains a linefeed not picked up by the TERMSTR=CRLF option. I have included the code and output I receive below.
data _test_ ;
infile "C:\Desktop\sample_issues.CSV"
dlm=','
dsd
termstr=crlf
missover
lrecl=30000
firstobs = 2
sharebuffers
;
length
Name $10
ID 8
Comment $30
Type $30
Class 3
Message $20
;
input
Name $
ID
Comment $
Type $
Class $
Message $
;
informat
Name $10.
ID 8.
Comment $30.
Type $30.
Class 3.
Message $20.
;
run;
Name ID Comment Type Class Message
adfjlkj 515511 aldfjd here 100 55 aldfjd here
lmnop" 5465 "PROOF OF RETURN RECEIVED .
REFI" 52 7 hhhttt REFI .
abc 244187 jfa;kj PLEASE NOTE (35 NOT "53")" 12 wjlkj
kjklwj 266121 jkjoej 250 75 jfk;ajs
So you do need to pre-process the file to get rid of the offending CR+LF combinations.
The basic method is to remove or replace the ones that appear inside of quotes. Here is one method. I am replacing them with '|', but you could replace with '0A'x if you want.
filename copy temp;
data _null_;
infile 'c:\downloads\sample_issues.csv' ;
file copy ;
input ;
put _infile_ @;
q + lengthn(compress(_infile_,'"','k')) ;
if 0=mod(q,2) then put;
else put '|' @;
run;
data _null_;
infile copy dsd dlm=',' truncover termstr=CRLF;
length x1-x6 $200 ;
input x1-x6;
put (x1-x6) (=/);
run;
Tom,
Your above code solved our issue with the sample file; however, when I apply the code to our full file there appears to be additional returns. The master file contains 262 fields, 250,000+ rows, and is well under the maximum string length. I applied your code to a smaller file with only 16 fields and 6 records(changing the first infile statement and length, input, and put statements in the second data step to 16). The code corrected our problem, but when applied to all 262 fields for 6 records and changing the length to $20000 I get linefeeds where previously there weren't any.
I believe that you can only treat _INFILE_ like a variable if the LRECL of the records is below a certain value like 32K.
You could convert to program to read the lines character by character instead using RECFM=N.
filename copy temp;
data _null_;
infile 'c:\downloads\sample_issues.csv' recfm=n ;
file copy recfm=n ;
input x $char1. ;
quotes+(x='"');
if mod(quotes,2) and (x='0D'x or x='0A'x) then put '|';
else put x $char1.;
run;
data _null_;
infile copy dsd dlm=',' truncover termstr=CRLF;
length x1-x6 $200 ;
input x1-x6;
put (x1-x6) (=/);
run;
The code above appears to have corrected our carriage return issue; however, upon reading in the CSV file i noticed any numeric field that previously contained decimals are now read in with a space replacing the decimal. Is this due to reading in as a byte?
The use of $CHAR format should have prevented the periods from being converted to spaces.
data _null_;
input @1 period $char1. @1 space $1. ;
put (period space) (=) '_infile_=' _infile_ ;
cards;
.
1
A
;;;;
I think if you just add lrecl=32767 to the two "infile" stmts in Tom's first code sample it will work.
Is this something you want in the end to implement using SAS DI Studio?
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.
Find more tutorials on the SAS Users YouTube channel.