Running SAS 9.4M4 on Win 7.
I'm reading a UTF-8 encoded text file. My SAS session has WLATIN1 encoding. The file has an occasional mu symbol ("cebc"x) in it.
I've been reading the file fine with below hack, where I guess it was reading with WLATIN1 encoding (?) and I just manually converted the mu the letter u, to make it easier to deal with.
data _null_ ;
infile "C:\Junk\noBOM.txt" ;
input x : $2.;
put x= ;
if x="cebc"x then do ;
put "found a mu and converted to u" ;
x="u" ;
put x= ;
end ;
run ;
All was fine in the world until, apparently, somebody start occasionally opening some of the text files and saving them, I assume in some microsoft text editor, which added the BOM (https://en.wikipedia.org/wiki/Byte_order_mark) at the front. So now, I guess, SAS recognizes that it's UTF-8, and tries to transcode it, but then throws a warning when it can't transcode the mu and everything breaks.
When there is a BOM, I get a log like:
733 data _null_ ; 734 infile "C:\Junk\BOM.txt"; 735 input x : $2.; 736 put x= ; 737 if x="cebc"x then do ; 738 put "found a mu and converted to u" ; 739 x="u" ; 740 put x= ; 741 end ; 742 run ; NOTE: A byte-order mark in the file "C:\Junk\BOM.txt" (for fileref "#LN00234") indicates that the data is encoded in "utf-8". This encoding will be used to process the file. NOTE: The infile "C:\Junk\BOM.txt" is: Filename=C:\Junk\BOM.txt, RECFM=V,LRECL=131068,File Size (bytes)=23, Last Modified=25Mar2019:16:04:13, Create Time=25Mar2019:16:04:13 x=A x=B x=C WARNING: A character that could not be transcoded has been replaced in record 4. x= x=D x=E x=F NOTE: 7 records were read from the infile "C:\Junk\BOM.txt".
So my question:
Is there a way I can tell SAS to just ignore the BOM?
I don't want to rewrite the file without the BOM. I had hoped I could specify encoding=WLATIN1 on the infile statement but that just throws an error because SAS still sees the BOM and knows it's UTF8 so says it's a conflict. It looks like the NOBOMFILE option prevents SAS from writing a BOM when it writes a file, but doesn't tell SAS to ignore the BOM when reading a file if it's there.
Attached are two sample text files, noBom.txt and Bom.txt.
Thanks,
-Q.
You can read the file with encoding='any' (which basically turns off transcoding):
data test;
infile "C:\Junk\BOM.txt" encoding='any';
input a $;
run;
You'll still get the binary data which you can clean up as needed.
I don't know how to force sas to ignore the bom. This is a clumsy workaround:
You are using a local sas installation, right? You should be able to start the session using unicode as encoding, avoiding the warnings about chars that can't be transcoded. After removing the unicode-chars, saving the datasets with wlatin1-encoding should be possible.
I don't know how you can ignore the file's BOM when reading it "normally".
What you can do however is read it in binary format.
You can then either discard the first 3 bytes, or copy the whole file minus the BOM and then read it per normal.
Something like:
filename UTF "\\NZ8037SPSAS2003\temp\utf.txt" encoding=utf8;
data _null_;
file UTF;
put '1é';
run;
filename UTF "\\NZ8037SPSAS2003\temp\utf.txt" recfm=f;
data _null_;
length FID 8 ;
FID = fopen('UTF','i',1,'b');
REC = '20'x;
do while(fread(FID)=0);
RC = fget(FID,REC,1);
put REC=;
end;
RC = fclose(FID);
run;
See:
data _null_ ;
infile "C:\Junk\noBOM.txt" encoding='utf-8' ignoredoseof;
Thanks @Ksharp , but my goal is the opposite.
By explicitly adding the UTF-8 encoding, you told SAS to read noBom.txt with UTF-8 encoding even though there was no BOM.
What I want (I think) is in a WLATIN1 SAS session, when I read a UTF-8 file with a BOM, to be able to force it to read as WLATIN1 encoding, just the same as if there was no BOM.
When there is no BOM, I can do:
1 data _null_ ; 2 infile "C:\Junk\noBOM.txt"; 3 input ; 4 run ; NOTE: The infile "C:\Junk\noBOM.txt" is: Filename=C:\Junk\noBOM.txt, RECFM=V,LRECL=32767,File Size (bytes)=20, Last Modified=25Mar2019:16:02:37, Create Time=25Mar2019:16:01:33
Note that above, there is no transcoding of the mu symbol done. I assume this means the file was read as WLATIN1 encoding, not UTF-8. I like that.
When there is a BOM, it transcodes from UTF-8 to WLATIN1, and throws a warning when it encounters the mu symbol and can't transcode it:
7 data _null_ ; 8 infile "C:\Junk\BOM.txt"; 9 input ; 10 run ; NOTE: A byte-order mark in the file "C:\Junk\BOM.txt" (for fileref "#LN00092") indicates that the data is encoded in "utf-8". This encoding will be used to process the file. NOTE: The infile "C:\Junk\BOM.txt" is: Filename=C:\Junk\BOM.txt, RECFM=V,LRECL=131068,File Size (bytes)=23, Last Modified=25Mar2019:16:04:13, Create Time=25Mar2019:16:04:13 WARNING: A character that could not be transcoded has been replaced in record 4.
I had hoped I could force the file to be read with WLATIN1 encoding, but this fails:
12 data _null_ ; 13 infile "C:\Junk\BOM.txt" encoding='WLATIN1' ; 14 input ; 15 run ; ERROR: The file "C:\Junk\BOM.txt" could not be opened. A byte-order mark indicates that the data is encoded in "utf-8". This conflicts with the "wlatin1" encoding that was specified for the fileref "#LN00093". NOTE: The SAS System stopped processing this step because of errors. NOTE: DATA statement used (Total process time): real time 0.00 seconds user cpu time 0.01 seconds system cpu time 0.00 seconds memory 294.59k OS Memory 16120.00k Timestamp 03/26/2019 08:49:07 AM
Can you skip this line ?
data _null_ ; infile "C:\Junk\BOM.txt" firstobs=2;
/* OR input x $hex2.; if x='xxxx' then delete; */ input ; run ;
Sadly, no. Even without reading anything from the file, SAS sees the BOM.
1 data _null_ ; 2 infile "C:\Junk\BOM.txt" firstobs=2; 3 *input ; 4 run ; NOTE: A byte-order mark in the file "C:\Junk\BOM.txt" (for fileref "#LN00101") indicates that the data is encoded in "utf-8". This encoding will be used to process the file.
And after that, if I read the mu value, it transcodes the mu and throws the warning.
You can read the file with encoding='any' (which basically turns off transcoding):
data test;
infile "C:\Junk\BOM.txt" encoding='any';
input a $;
run;
You'll still get the binary data which you can clean up as needed.
Thanks much @DaveHorne , looks like that is doing what I want.
1 data _null_ ; 2 infile "C:\Junk\BOM.txt" encoding=any; 3 input @ ; 4 _infile_ = transtrn(_infile_,'EFBBBF'x,trimn('')) ; *Remove the BOM ; 5 input x : $2.; 6 put x= ; 7 if x="cebc"x then do ; 8 put "found a mu and converted to u" ; 9 x="u" ; 10 put x= ; 11 end ; 12 run ; NOTE: The infile "C:\Junk\BOM.txt" is: Filename=C:\Junk\BOM.txt, RECFM=V,LRECL=32767,File Size (bytes)=23, Last Modified=25Mar2019:16:04:13, Create Time=25Mar2019:16:04:13 x=A x=B x=C x=μ found a mu and converted to u x=u x=D x=E x=F NOTE: 7 records were read from the infile "C:\Junk\BOM.txt". The minimum record length was 1. The maximum record length was 4.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.