Running SAS 9.4M4 on Win 7.
I'm reading a UTF-8 encoded text file. My SAS session has WLATIN1 encoding. The file has an occasional mu symbol ("cebc"x) in it.
I've been reading the file fine with below hack, where I guess it was reading with WLATIN1 encoding (?) and I just manually converted the mu the letter u, to make it easier to deal with.
data _null_ ;
infile "C:\Junk\noBOM.txt" ;
input x : $2.;
put x= ;
if x="cebc"x then do ;
put "found a mu and converted to u" ;
x="u" ;
put x= ;
end ;
run ;
All was fine in the world until, apparently, somebody start occasionally opening some of the text files and saving them, I assume in some microsoft text editor, which added the BOM (https://en.wikipedia.org/wiki/Byte_order_mark) at the front. So now, I guess, SAS recognizes that it's UTF-8, and tries to transcode it, but then throws a warning when it can't transcode the mu and everything breaks.
When there is a BOM, I get a log like:
733 data _null_ ; 734 infile "C:\Junk\BOM.txt"; 735 input x : $2.; 736 put x= ; 737 if x="cebc"x then do ; 738 put "found a mu and converted to u" ; 739 x="u" ; 740 put x= ; 741 end ; 742 run ; NOTE: A byte-order mark in the file "C:\Junk\BOM.txt" (for fileref "#LN00234") indicates that the data is encoded in "utf-8". This encoding will be used to process the file. NOTE: The infile "C:\Junk\BOM.txt" is: Filename=C:\Junk\BOM.txt, RECFM=V,LRECL=131068,File Size (bytes)=23, Last Modified=25Mar2019:16:04:13, Create Time=25Mar2019:16:04:13 x=A x=B x=C WARNING: A character that could not be transcoded has been replaced in record 4. x= x=D x=E x=F NOTE: 7 records were read from the infile "C:\Junk\BOM.txt".
So my question:
Is there a way I can tell SAS to just ignore the BOM?
I don't want to rewrite the file without the BOM. I had hoped I could specify encoding=WLATIN1 on the infile statement but that just throws an error because SAS still sees the BOM and knows it's UTF8 so says it's a conflict. It looks like the NOBOMFILE option prevents SAS from writing a BOM when it writes a file, but doesn't tell SAS to ignore the BOM when reading a file if it's there.
Attached are two sample text files, noBom.txt and Bom.txt.
Thanks,
-Q.
You can read the file with encoding='any' (which basically turns off transcoding):
data test;
infile "C:\Junk\BOM.txt" encoding='any';
input a $;
run;
You'll still get the binary data which you can clean up as needed.
I don't know how to force sas to ignore the bom. This is a clumsy workaround:
You are using a local sas installation, right? You should be able to start the session using unicode as encoding, avoiding the warnings about chars that can't be transcoded. After removing the unicode-chars, saving the datasets with wlatin1-encoding should be possible.
I don't know how you can ignore the file's BOM when reading it "normally".
What you can do however is read it in binary format.
You can then either discard the first 3 bytes, or copy the whole file minus the BOM and then read it per normal.
Something like:
filename UTF "\\NZ8037SPSAS2003\temp\utf.txt" encoding=utf8;
data _null_;
file UTF;
put '1é';
run;
filename UTF "\\NZ8037SPSAS2003\temp\utf.txt" recfm=f;
data _null_;
length FID 8 ;
FID = fopen('UTF','i',1,'b');
REC = '20'x;
do while(fread(FID)=0);
RC = fget(FID,REC,1);
put REC=;
end;
RC = fclose(FID);
run;
See:
data _null_ ;
infile "C:\Junk\noBOM.txt" encoding='utf-8' ignoredoseof;
Thanks @Ksharp , but my goal is the opposite.
By explicitly adding the UTF-8 encoding, you told SAS to read noBom.txt with UTF-8 encoding even though there was no BOM.
What I want (I think) is in a WLATIN1 SAS session, when I read a UTF-8 file with a BOM, to be able to force it to read as WLATIN1 encoding, just the same as if there was no BOM.
When there is no BOM, I can do:
1 data _null_ ; 2 infile "C:\Junk\noBOM.txt"; 3 input ; 4 run ; NOTE: The infile "C:\Junk\noBOM.txt" is: Filename=C:\Junk\noBOM.txt, RECFM=V,LRECL=32767,File Size (bytes)=20, Last Modified=25Mar2019:16:02:37, Create Time=25Mar2019:16:01:33
Note that above, there is no transcoding of the mu symbol done. I assume this means the file was read as WLATIN1 encoding, not UTF-8. I like that.
When there is a BOM, it transcodes from UTF-8 to WLATIN1, and throws a warning when it encounters the mu symbol and can't transcode it:
7 data _null_ ; 8 infile "C:\Junk\BOM.txt"; 9 input ; 10 run ; NOTE: A byte-order mark in the file "C:\Junk\BOM.txt" (for fileref "#LN00092") indicates that the data is encoded in "utf-8". This encoding will be used to process the file. NOTE: The infile "C:\Junk\BOM.txt" is: Filename=C:\Junk\BOM.txt, RECFM=V,LRECL=131068,File Size (bytes)=23, Last Modified=25Mar2019:16:04:13, Create Time=25Mar2019:16:04:13 WARNING: A character that could not be transcoded has been replaced in record 4.
I had hoped I could force the file to be read with WLATIN1 encoding, but this fails:
12 data _null_ ; 13 infile "C:\Junk\BOM.txt" encoding='WLATIN1' ; 14 input ; 15 run ; ERROR: The file "C:\Junk\BOM.txt" could not be opened. A byte-order mark indicates that the data is encoded in "utf-8". This conflicts with the "wlatin1" encoding that was specified for the fileref "#LN00093". NOTE: The SAS System stopped processing this step because of errors. NOTE: DATA statement used (Total process time): real time 0.00 seconds user cpu time 0.01 seconds system cpu time 0.00 seconds memory 294.59k OS Memory 16120.00k Timestamp 03/26/2019 08:49:07 AM
Can you skip this line ?
data _null_ ; infile "C:\Junk\BOM.txt" firstobs=2;
/* OR input x $hex2.; if x='xxxx' then delete; */ input ; run ;
Sadly, no. Even without reading anything from the file, SAS sees the BOM.
1 data _null_ ; 2 infile "C:\Junk\BOM.txt" firstobs=2; 3 *input ; 4 run ; NOTE: A byte-order mark in the file "C:\Junk\BOM.txt" (for fileref "#LN00101") indicates that the data is encoded in "utf-8". This encoding will be used to process the file.
And after that, if I read the mu value, it transcodes the mu and throws the warning.
You can read the file with encoding='any' (which basically turns off transcoding):
data test;
infile "C:\Junk\BOM.txt" encoding='any';
input a $;
run;
You'll still get the binary data which you can clean up as needed.
Thanks much @DaveHorne , looks like that is doing what I want.
1 data _null_ ; 2 infile "C:\Junk\BOM.txt" encoding=any; 3 input @ ; 4 _infile_ = transtrn(_infile_,'EFBBBF'x,trimn('')) ; *Remove the BOM ; 5 input x : $2.; 6 put x= ; 7 if x="cebc"x then do ; 8 put "found a mu and converted to u" ; 9 x="u" ; 10 put x= ; 11 end ; 12 run ; NOTE: The infile "C:\Junk\BOM.txt" is: Filename=C:\Junk\BOM.txt, RECFM=V,LRECL=32767,File Size (bytes)=23, Last Modified=25Mar2019:16:04:13, Create Time=25Mar2019:16:04:13 x=A x=B x=C x=μ found a mu and converted to u x=u x=D x=E x=F NOTE: 7 records were read from the infile "C:\Junk\BOM.txt". The minimum record length was 1. The maximum record length was 4.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.