Running SAS 9.4M4 on Win 7.
I'm reading a UTF-8 encoded text file. My SAS session has WLATIN1 encoding. The file has an occasional mu symbol ("cebc"x) in it.
I've been reading the file fine with below hack, where I guess it was reading with WLATIN1 encoding (?) and I just manually converted the mu the letter u, to make it easier to deal with.
data _null_ ;
infile "C:\Junk\noBOM.txt" ;
input x : $2.;
put x= ;
if x="cebc"x then do ;
put "found a mu and converted to u" ;
x="u" ;
put x= ;
end ;
run ;
All was fine in the world until, apparently, somebody start occasionally opening some of the text files and saving them, I assume in some microsoft text editor, which added the BOM (https://en.wikipedia.org/wiki/Byte_order_mark) at the front. So now, I guess, SAS recognizes that it's UTF-8, and tries to transcode it, but then throws a warning when it can't transcode the mu and everything breaks.
When there is a BOM, I get a log like:
733 data _null_ ;
734 infile "C:\Junk\BOM.txt";
735 input x : $2.;
736 put x= ;
737 if x="cebc"x then do ;
738 put "found a mu and converted to u" ;
739 x="u" ;
740 put x= ;
741 end ;
742 run ;
NOTE: A byte-order mark in the file "C:\Junk\BOM.txt" (for fileref "#LN00234") indicates
that the data is encoded in "utf-8". This encoding will be used to process the
file.
NOTE: The infile "C:\Junk\BOM.txt" is:
Filename=C:\Junk\BOM.txt,
RECFM=V,LRECL=131068,File Size (bytes)=23,
Last Modified=25Mar2019:16:04:13,
Create Time=25Mar2019:16:04:13
x=A
x=B
x=C
WARNING: A character that could not be transcoded has been replaced in record 4.
x=
x=D
x=E
x=F
NOTE: 7 records were read from the infile "C:\Junk\BOM.txt".
So my question:
Is there a way I can tell SAS to just ignore the BOM?
I don't want to rewrite the file without the BOM. I had hoped I could specify encoding=WLATIN1 on the infile statement but that just throws an error because SAS still sees the BOM and knows it's UTF8 so says it's a conflict. It looks like the NOBOMFILE option prevents SAS from writing a BOM when it writes a file, but doesn't tell SAS to ignore the BOM when reading a file if it's there.
Attached are two sample text files, noBom.txt and Bom.txt.
Thanks,
-Q.
You can read the file with encoding='any' (which basically turns off transcoding):
data test;
infile "C:\Junk\BOM.txt" encoding='any';
input a $;
run;
You'll still get the binary data which you can clean up as needed.
I don't know how to force sas to ignore the bom. This is a clumsy workaround:
You are using a local sas installation, right? You should be able to start the session using unicode as encoding, avoiding the warnings about chars that can't be transcoded. After removing the unicode-chars, saving the datasets with wlatin1-encoding should be possible.
I don't know how you can ignore the file's BOM when reading it "normally".
What you can do however is read it in binary format.
You can then either discard the first 3 bytes, or copy the whole file minus the BOM and then read it per normal.
Something like:
filename UTF "\\NZ8037SPSAS2003\temp\utf.txt" encoding=utf8;
data _null_;
file UTF;
put '1é';
run;
filename UTF "\\NZ8037SPSAS2003\temp\utf.txt" recfm=f;
data _null_;
length FID 8 ;
FID = fopen('UTF','i',1,'b');
REC = '20'x;
do while(fread(FID)=0);
RC = fget(FID,REC,1);
put REC=;
end;
RC = fclose(FID);
run;
See:
data _null_ ;
infile "C:\Junk\noBOM.txt" encoding='utf-8' ignoredoseof;
Thanks @Ksharp , but my goal is the opposite.
By explicitly adding the UTF-8 encoding, you told SAS to read noBom.txt with UTF-8 encoding even though there was no BOM.
What I want (I think) is in a WLATIN1 SAS session, when I read a UTF-8 file with a BOM, to be able to force it to read as WLATIN1 encoding, just the same as if there was no BOM.
When there is no BOM, I can do:
1 data _null_ ;
2 infile "C:\Junk\noBOM.txt";
3 input ;
4 run ;
NOTE: The infile "C:\Junk\noBOM.txt" is:
Filename=C:\Junk\noBOM.txt,
RECFM=V,LRECL=32767,File Size (bytes)=20,
Last Modified=25Mar2019:16:02:37,
Create Time=25Mar2019:16:01:33
Note that above, there is no transcoding of the mu symbol done. I assume this means the file was read as WLATIN1 encoding, not UTF-8. I like that.
When there is a BOM, it transcodes from UTF-8 to WLATIN1, and throws a warning when it encounters the mu symbol and can't transcode it:
7 data _null_ ;
8 infile "C:\Junk\BOM.txt";
9 input ;
10 run ;
NOTE: A byte-order mark in the file "C:\Junk\BOM.txt" (for fileref "#LN00092") indicates
that the data is encoded in "utf-8". This encoding will be used to process the
file.
NOTE: The infile "C:\Junk\BOM.txt" is:
Filename=C:\Junk\BOM.txt,
RECFM=V,LRECL=131068,File Size (bytes)=23,
Last Modified=25Mar2019:16:04:13,
Create Time=25Mar2019:16:04:13
WARNING: A character that could not be transcoded has been replaced in record 4.
I had hoped I could force the file to be read with WLATIN1 encoding, but this fails:
12 data _null_ ;
13 infile "C:\Junk\BOM.txt" encoding='WLATIN1' ;
14 input ;
15 run ;
ERROR: The file "C:\Junk\BOM.txt" could not be opened. A byte-order mark indicates that
the data is encoded in "utf-8". This conflicts with the "wlatin1" encoding that
was specified for the fileref "#LN00093".
NOTE: The SAS System stopped processing this step because of errors.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
user cpu time 0.01 seconds
system cpu time 0.00 seconds
memory 294.59k
OS Memory 16120.00k
Timestamp 03/26/2019 08:49:07 AM
Can you skip this line ?
data _null_ ;
infile "C:\Junk\BOM.txt" firstobs=2;
/* OR input x $hex2.; if x='xxxx' then delete; */
input ;
run ;
Sadly, no. Even without reading anything from the file, SAS sees the BOM.
1 data _null_ ;
2 infile "C:\Junk\BOM.txt" firstobs=2;
3 *input ;
4 run ;
NOTE: A byte-order mark in the file "C:\Junk\BOM.txt" (for fileref "#LN00101") indicates
that the data is encoded in "utf-8". This encoding will be used to process the
file.
And after that, if I read the mu value, it transcodes the mu and throws the warning.
You can read the file with encoding='any' (which basically turns off transcoding):
data test;
infile "C:\Junk\BOM.txt" encoding='any';
input a $;
run;
You'll still get the binary data which you can clean up as needed.
Thanks much @DaveHorne , looks like that is doing what I want.
1 data _null_ ;
2 infile "C:\Junk\BOM.txt" encoding=any;
3 input @ ;
4 _infile_ = transtrn(_infile_,'EFBBBF'x,trimn('')) ; *Remove the BOM ;
5 input x : $2.;
6 put x= ;
7 if x="cebc"x then do ;
8 put "found a mu and converted to u" ;
9 x="u" ;
10 put x= ;
11 end ;
12 run ;
NOTE: The infile "C:\Junk\BOM.txt" is:
Filename=C:\Junk\BOM.txt,
RECFM=V,LRECL=32767,File Size (bytes)=23,
Last Modified=25Mar2019:16:04:13,
Create Time=25Mar2019:16:04:13
x=A
x=B
x=C
x=μ
found a mu and converted to u
x=u
x=D
x=E
x=F
NOTE: 7 records were read from the infile "C:\Junk\BOM.txt".
The minimum record length was 1.
The maximum record length was 4.
April 27 – 30 | Gaylord Texan | Grapevine, Texas
Walk in ready to learn. Walk out ready to deliver. This is the data and AI conference you can't afford to miss.
Register now and lock in 2025 pricing—just $495!
Still thinking about your presentation idea? The submission deadline has been extended to Friday, Nov. 14, at 11:59 p.m. ET.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.