Hi Community,
I am trying to prepare a file with a speciel format
it is a csv format but with a pund sign(£) as delimiter and a lot if linefeeds/carriage returns in some of the text columns.
a line in my file with look like this
col1£col2£col3£col4£col5$
myname£myaftername£myaddress£this is a text column
wtih crlf all over
the place £col5_account
now i am trying to remove those crtl by count the oocourence if £ and remove them until the end of line, but a am still a newbie in SAS programming so i hoped for some kind advice in this forum.
My approach until now has been this peace of code, it is based on an example from SAS removing crtl inside quoted text columns
but i cant get my count to work properly and uit just writes \n all over right now.
%let repA='@'; /* replacement character LF */
%let repD='$'; /* replacement character CR */
%let dsnnme="/folders/myfolders/testdata.txt";
/* use full path of CSV file */
data test;
infile &dsnnme recfm=n sharebuffers;
file &dsnnme recfm=n;
retain open 0;
length cnt 8.;
retain cnt 0;
input a $char1.;
/* If the character is a £ i increment the cnt variable by 1 */
if a = 'A3'x then
cnt = cnt + 1;
/* if cnt is not end of line remove any crtl */
*/
if cnt < 5 then do;
if a = '0D'x then put &repD;
else if a = '0A'x then put &repA;
end;
if cnt = 5 then do;
/* this should capture the end of line */
cnt = 0;
end;
run;
First thing is to not overwrite your original file. Make a new file with the fix. That way you can test until you get it right without having to keep re-creating the input.
If you just want to count the number of delimiters then you should be able to do it by just reading and writing lines. To remove the embedded line breaks just don't write end of line until you see the next record starting.
First let's convert your example into an actual file.
filename source temp;
data _null_;
file source ;
put
'col1£col2£col3£col4£col5'
/'myname£myaftername£myaddress£this is a text column'
/'wtih crlf all over'
/'the place £col5_account'
;
run;
So here is what it looks like:
164 data _null_; 165 infile source ; 166 input ; 167 list; 168 run; NOTE: The infile SOURCE is: Filename=.../#LN00024, Owner Name=XXX,Group Name=YYY, Access Permission=-rw-rw-r--, Last Modified=14Dec2019:13:05:57, File Size (bytes)=119 RULE: ----+----1----+----2----+----3----+----4----+----5----+----6 1 col1£col2£col3£col4£col5 24 2 myname£myaftername£myaddress£this is a text column 50 3 wtih crlf all over 18 4 the place £col5_account 23 NOTE: 4 records were read from the infile SOURCE. The minimum record length was 18. The maximum record length was 50.
Now let's run a data step to remove the breaks. This one will add a space where the line break was.
filename target temp;
data _null_;
infile source end=eof;
file target ;
input ;
count=countc(_infile_,'£');
if total+count > 4 then do ;
put ;
total = 0;
end;
if total then put ' ' @;
total+count;
put _infile_ @;
if eof then put;
run;
And here is the result:
185 data _null_; 186 infile target ; 187 input ; 188 list; 189 run; NOTE: The infile TARGET is: Filename=.../#LN00025, Owner Name=XXX,Group Name=YYY, Access Permission=-rw-rw-r--, Last Modified=14Dec2019:13:05:57, File Size (bytes)=119 RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0 1 col1£col2£col3£col4£col5 24 2 myname£myaftername£myaddress£this is a text column wtih crlf all over the place £col5_account 93 NOTE: 2 records were read from the infile TARGET. The minimum record length was 24. The maximum record length was 93.
The right solution depends on how the file is formatted.
Are the values with embedded CR and/or LF characters enclosed in quotes? If they are then the solution posted by SAS that counts quotes will work. http://support.sas.com/kb/26/065.html
Do the values ever contain actual British pound symbols? If not then just counting the number of pound symbols might work. But it will get tricky if the embedded line breaks appear in the last value on the line.
If yes then how have the protected those symbols so that you can tell which are delimiters and which are data? There are two normally used styles. One is to enclose the value in quotes (that is what SAS does and the DSD infile option supports). The other is the "escape" the character by preceding it with a special character, normally the backslash, \ , character.
First thing is to not overwrite your original file. Make a new file with the fix. That way you can test until you get it right without having to keep re-creating the input.
If you just want to count the number of delimiters then you should be able to do it by just reading and writing lines. To remove the embedded line breaks just don't write end of line until you see the next record starting.
First let's convert your example into an actual file.
filename source temp;
data _null_;
file source ;
put
'col1£col2£col3£col4£col5'
/'myname£myaftername£myaddress£this is a text column'
/'wtih crlf all over'
/'the place £col5_account'
;
run;
So here is what it looks like:
164 data _null_; 165 infile source ; 166 input ; 167 list; 168 run; NOTE: The infile SOURCE is: Filename=.../#LN00024, Owner Name=XXX,Group Name=YYY, Access Permission=-rw-rw-r--, Last Modified=14Dec2019:13:05:57, File Size (bytes)=119 RULE: ----+----1----+----2----+----3----+----4----+----5----+----6 1 col1£col2£col3£col4£col5 24 2 myname£myaftername£myaddress£this is a text column 50 3 wtih crlf all over 18 4 the place £col5_account 23 NOTE: 4 records were read from the infile SOURCE. The minimum record length was 18. The maximum record length was 50.
Now let's run a data step to remove the breaks. This one will add a space where the line break was.
filename target temp;
data _null_;
infile source end=eof;
file target ;
input ;
count=countc(_infile_,'£');
if total+count > 4 then do ;
put ;
total = 0;
end;
if total then put ' ' @;
total+count;
put _infile_ @;
if eof then put;
run;
And here is the result:
185 data _null_; 186 infile target ; 187 input ; 188 list; 189 run; NOTE: The infile TARGET is: Filename=.../#LN00025, Owner Name=XXX,Group Name=YYY, Access Permission=-rw-rw-r--, Last Modified=14Dec2019:13:05:57, File Size (bytes)=119 RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0 1 col1£col2£col3£col4£col5 24 2 myname£myaftername£myaddress£this is a text column wtih crlf all over the place £col5_account 93 NOTE: 2 records were read from the infile TARGET. The minimum record length was 24. The maximum record length was 93.
Thanks,
Actually the result im trying to chive is the last three line of this test file attached, i know that there will be no crtl on the last column it is a number
I am extracting data from a system where some text is typed in by human and therefore can contain linefeeds and carriage returns.
Your example does not seem consistent with 5 fields per record.
Do you have any control over how the values are dumped from the source? If you can get the source to generate the file in a systematic way that it is possible to interpret it will make it much easier.
The first line is the header row. And it looks like the the last 3 lines are normal records.
But the stuff in the middle looks confused, they appear to be grouped in three lines per record.
So lines 2 and 3 look like a record, but then line 4 looks like ii has just the last two fields, or two extra fields.
But the other groups seem ok as they do not have that extra delimiter like line 3 does.
75 options ls=132; 76 data _null_; 77 infile source encoding='utf8' ; 78 input; 79 list ; 80 run; NOTE: The infile SOURCE is: Filename=c:\downloads\testdata.txt, RECFM=V,LRECL=131068,File Size (bytes)=702, Last Modified=14Dec2019:13:50:36, Create Time=14Dec2019:13:50:35 RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0 1 col1£col2£col3£col4£col5 24 2 simon£jespersen£højbovej£Dette er er fri tekst 46 3 med en masse£23 16 4 linieskift i£21 15 5 simon£jespersen£højbovej£Dette er er fri tekst 46 6 med en masse 13 7 linieskift i£20 15 8 simon£jespersen£højbovej£Dette er er fri tekst 46 9 med en masse 13 10 linieskift i£10 15 11 simon£jespersen£højbovej£Dette er er fri tekst 46 12 med en masse 13 13 linieskift i£07 15 14 simon£jespersen£højbovej£Dette er er fri tekst 46 15 med en masse 13 16 linieskift i£01 15 17 simon£jespersen£højbovej£Dette er er fri tekstmed en masse linieskift i£25 74 18 simon£jespersen£højbovej£Dette er er fri tekstmed en masse linieskift i£26 74 19 simon£jespersen£højbovej£Dette er er fri tekstmed en masse linieskift i£27 74 NOTE: 19 records were read from the infile SOURCE. The minimum record length was 13. The maximum record length was 74.
Thank you very much,
its funny on my computer it did not work by doing count=countc(_infile_, '£');
i needed to test on hex value count=countc(_infile_,'A3'x );
But your solution seems to solve my problem, thank you very much.
Hi @havmaage
Instead of writing an output record in pieces, I find it easier to concatenate input lines to a SAS variable, until the expected number of delimiters is found, and then output and start building a new line. The following simple code will work with any number of line breaks if the input follow these rules:
filename source 'c:\temp\source.csv';
data _null_;
file source ;
put 'col1£col2£col3£col4£col5$';
put 'myname£myaftername£myaddress£this is a text column';
put 'wtih crlf all over';
put 'the place £col5_account';
put 'col1£col2£col3£col4£col5$';
put 'myname£myaftername£myaddress£this is a text column';
put 'with a single crlf £col5_account';
put 'col1£col2£col3£col4£col5$';
put 'myname£myaftername£myaddress£this is a text column without crlf in the free text column £col5_account';
put 'col1£col2£col3£col4£col5$myname£myaftername£myaddress£this is a text column without crlf in the free text column £col5_account';
;
run;
filename target 'c:\temp\target.csv';
data _null_;
infile source;
file target lrecl = 32000;
retain outline;
length outline $32000;
input;
outline = catx(' ',outline,_infile_);
if countc(outline,'£') >= 8 then do;
put outline;
outline = '';
end;
run;
v.h.
Erik
The issue, as I see it, is that a £ terminates the first 4 variables, while a CRLF terminates the last variable, correct?
If so, then if you assume
then this works:
filename oneline temp;
data _null_;
infile 'c:\temp\t.txt' ;
length single_line $500;
do until (n_dlim=4);
input;
n_dlim=sum(n_dlim,countc(_infile_,'£'));
if char(single_line,length(single_line))^='£' then single_line=catx(' ',single_line,_infile_); else
single_line=cats(single_line,_infile_);
end;
file oneline;
put single_line ;
run;
data want;
infile oneline dlm='£' truncover ;
informat col1-col5 $char100. ;
input col1 col2 col3 col4 col5 ;
run;
Like your program it converts multiple lines into a single line with all 5 variables - i.e. it effectively removes the excess CRLF's. It write the single line to a temporary raw file, which is read normally using the DLM='£'
delimiter. A single line is considered complete when it has accommodated 4 £'
signs. This is why the program relies on the assumption that the fifth function has no leading or internal CRLF's (a trailing CRLF is of course, indicator of end-of-record).
One note: If a '£'
is immediately followed by a CRLF, I insert a blank in the concatenation of the accumulating single line and the record fragment represented by _INFILE_. Otherwise just concatenate normally.
You can even do this in one step, by parsing the complete SINGLE_LINE text with a set of SCAN functions, as in:
data want2 (keep=col:);
infile 'c:\temp\t.txt' ;
length single_line $500;
do until (n_dlim=4);
input;
n_dlim=sum(n_dlim,countc(_infile_,'£'));
if char(single_line,length(single_line))^='£' then single_line=catx(' ',single_line,_infile_); else
single_line=cats(single_line,_infile_);
end;
col1=scan(single_line,1,'£');
col2=scan(single_line,2,'£');
col3=scan(single_line,3,'£');
col4=scan(single_line,4,'£');
col5=scan(single_line,5,'£');
run;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.