<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Read UTF-8 file and ignore BOM in SAS Programming</title>
    <link>https://communities.sas.com/t5/SAS-Programming/Read-UTF-8-file-and-ignore-BOM/m-p/545957#M151116</link>
    <description>&lt;P&gt;Running SAS 9.4M4 on Win 7.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I'm reading a UTF-8 encoded text file.&amp;nbsp; My SAS session has WLATIN1 encoding.&amp;nbsp; The file has an occasional mu symbol ("cebc"x) in it.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I've been reading the file fine with below hack, where I guess it was reading with WLATIN1 encoding (?) and I just manually converted the mu the letter u, to make it easier to deal with.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;data _null_ ;
  infile "C:\Junk\noBOM.txt" ;
  input x : $2.;
  put x=  ;
  if x="cebc"x then do ;
    put "found a mu and converted to u" ;
    x="u" ;
    put x=  ;
  end ;
run ;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;All was fine in the world until, apparently, somebody start occasionally opening some of the text files and saving them, I assume in some microsoft text editor, which added the BOM (&lt;A href="https://en.wikipedia.org/wiki/Byte_order_mark" target="_blank"&gt;https://en.wikipedia.org/wiki/Byte_order_mark&lt;/A&gt;)&amp;nbsp; at the front.&amp;nbsp; So now, I guess, SAS recognizes that it's UTF-8, and tries to transcode it, but then throws a warning when it can't transcode the mu and everything breaks.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;When there is a BOM, I get a log like:&lt;/P&gt;
&lt;PRE&gt;733  data _null_ ;
734    infile "C:\Junk\BOM.txt";
735    input x : $2.;
736    put x= ;
737    if x="cebc"x then do ;
738      put "found a mu and converted to u" ;
739      x="u" ;
740      put x=  ;
741    end ;
742  run ;

NOTE: A byte-order mark in the file "C:\Junk\BOM.txt" (for fileref "#LN00234") indicates
      that the data is encoded in "utf-8".  This encoding will be used to process the
      file.
NOTE: The infile "C:\Junk\BOM.txt" is:
      Filename=C:\Junk\BOM.txt,
      RECFM=V,LRECL=131068,File Size (bytes)=23,
      Last Modified=25Mar2019:16:04:13,
      Create Time=25Mar2019:16:04:13

x=A
x=B
x=C
WARNING: A character that could not be transcoded has been replaced in record 4.
x=&amp;#26;
x=D
x=E
x=F
NOTE: 7 records were read from the infile "C:\Junk\BOM.txt".
&lt;/PRE&gt;
&lt;P&gt;So my question:&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Is there a way I can tell SAS to just ignore the BOM?&lt;/STRONG&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I don't want to rewrite the file without the BOM.&amp;nbsp; I had hoped I could specify encoding=WLATIN1 on the infile statement but that just throws an error because SAS still sees the BOM and knows it's UTF8 so says it's a conflict.&amp;nbsp; It looks like the NOBOMFILE option prevents SAS from writing a BOM when it writes a file, but doesn't tell SAS to ignore the BOM when reading a file if it's there.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Attached are two sample text files, noBom.txt and Bom.txt.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Thanks,&lt;/P&gt;
&lt;P&gt;-Q.&lt;/P&gt;</description>
    <pubDate>Mon, 25 Mar 2019 20:54:56 GMT</pubDate>
    <dc:creator>Quentin</dc:creator>
    <dc:date>2019-03-25T20:54:56Z</dc:date>
    <item>
      <title>Read UTF-8 file and ignore BOM</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Read-UTF-8-file-and-ignore-BOM/m-p/545957#M151116</link>
      <description>&lt;P&gt;Running SAS 9.4M4 on Win 7.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I'm reading a UTF-8 encoded text file.&amp;nbsp; My SAS session has WLATIN1 encoding.&amp;nbsp; The file has an occasional mu symbol ("cebc"x) in it.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I've been reading the file fine with below hack, where I guess it was reading with WLATIN1 encoding (?) and I just manually converted the mu the letter u, to make it easier to deal with.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;data _null_ ;
  infile "C:\Junk\noBOM.txt" ;
  input x : $2.;
  put x=  ;
  if x="cebc"x then do ;
    put "found a mu and converted to u" ;
    x="u" ;
    put x=  ;
  end ;
run ;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;All was fine in the world until, apparently, somebody start occasionally opening some of the text files and saving them, I assume in some microsoft text editor, which added the BOM (&lt;A href="https://en.wikipedia.org/wiki/Byte_order_mark" target="_blank"&gt;https://en.wikipedia.org/wiki/Byte_order_mark&lt;/A&gt;)&amp;nbsp; at the front.&amp;nbsp; So now, I guess, SAS recognizes that it's UTF-8, and tries to transcode it, but then throws a warning when it can't transcode the mu and everything breaks.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;When there is a BOM, I get a log like:&lt;/P&gt;
&lt;PRE&gt;733  data _null_ ;
734    infile "C:\Junk\BOM.txt";
735    input x : $2.;
736    put x= ;
737    if x="cebc"x then do ;
738      put "found a mu and converted to u" ;
739      x="u" ;
740      put x=  ;
741    end ;
742  run ;

NOTE: A byte-order mark in the file "C:\Junk\BOM.txt" (for fileref "#LN00234") indicates
      that the data is encoded in "utf-8".  This encoding will be used to process the
      file.
NOTE: The infile "C:\Junk\BOM.txt" is:
      Filename=C:\Junk\BOM.txt,
      RECFM=V,LRECL=131068,File Size (bytes)=23,
      Last Modified=25Mar2019:16:04:13,
      Create Time=25Mar2019:16:04:13

x=A
x=B
x=C
WARNING: A character that could not be transcoded has been replaced in record 4.
x=&amp;#26;
x=D
x=E
x=F
NOTE: 7 records were read from the infile "C:\Junk\BOM.txt".
&lt;/PRE&gt;
&lt;P&gt;So my question:&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Is there a way I can tell SAS to just ignore the BOM?&lt;/STRONG&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I don't want to rewrite the file without the BOM.&amp;nbsp; I had hoped I could specify encoding=WLATIN1 on the infile statement but that just throws an error because SAS still sees the BOM and knows it's UTF8 so says it's a conflict.&amp;nbsp; It looks like the NOBOMFILE option prevents SAS from writing a BOM when it writes a file, but doesn't tell SAS to ignore the BOM when reading a file if it's there.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Attached are two sample text files, noBom.txt and Bom.txt.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Thanks,&lt;/P&gt;
&lt;P&gt;-Q.&lt;/P&gt;</description>
      <pubDate>Mon, 25 Mar 2019 20:54:56 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Read-UTF-8-file-and-ignore-BOM/m-p/545957#M151116</guid>
      <dc:creator>Quentin</dc:creator>
      <dc:date>2019-03-25T20:54:56Z</dc:date>
    </item>
    <item>
      <title>Re: Read UTF-8 file and ignore BOM</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Read-UTF-8-file-and-ignore-BOM/m-p/545958#M151117</link>
      <description>&lt;P&gt;I don't know how to force sas to ignore the bom. This is a clumsy workaround:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;You are using a local sas installation, right? You should be able to start the session using unicode as encoding, avoiding the warnings about chars that can't be transcoded. After removing the unicode-chars, saving the datasets with wlatin1-encoding should be possible.&lt;/P&gt;</description>
      <pubDate>Mon, 25 Mar 2019 21:05:21 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Read-UTF-8-file-and-ignore-BOM/m-p/545958#M151117</guid>
      <dc:creator>andreas_lds</dc:creator>
      <dc:date>2019-03-25T21:05:21Z</dc:date>
    </item>
    <item>
      <title>Re: Read UTF-8 file and ignore BOM</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Read-UTF-8-file-and-ignore-BOM/m-p/546006#M151134</link>
      <description>&lt;P&gt;I don't know how you can ignore the file's BOM when reading it "normally".&lt;/P&gt;
&lt;P&gt;What you can do however is read it in binary format.&lt;/P&gt;
&lt;P&gt;You can then either discard the first 3 bytes, or copy the whole file minus the BOM and then read it per normal.&lt;/P&gt;
&lt;P&gt;Something like:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;
filename UTF "\\NZ8037SPSAS2003\temp\utf.txt" encoding=utf8;
data _null_;  
  file UTF;
  put '1é';
run;

filename UTF "\\NZ8037SPSAS2003\temp\utf.txt" recfm=f;
data _null_;
  length FID 8 ;
  FID = fopen('UTF','i',1,'b');
  REC = '20'x;
  do while(fread(FID)=0);
    RC = fget(FID,REC,1);
    put REC=;
  end;
  RC = fclose(FID);
run;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;See:&lt;/P&gt;
&lt;P&gt;&lt;A href="https://blogs.sas.com/content/sasdummy/2011/06/17/how-to-use-sas-data-step-to-copy-a-file-from-anywhere/" target="_blank"&gt;https://blogs.sas.com/content/sasdummy/2011/06/17/how-to-use-sas-data-step-to-copy-a-file-from-anywhere/&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 26 Mar 2019 01:53:03 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Read-UTF-8-file-and-ignore-BOM/m-p/546006#M151134</guid>
      <dc:creator>ChrisNZ</dc:creator>
      <dc:date>2019-03-26T01:53:03Z</dc:date>
    </item>
    <item>
      <title>Re: Read UTF-8 file and ignore BOM</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Read-UTF-8-file-and-ignore-BOM/m-p/546106#M151163</link>
      <description>&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;data _null_ ;
  infile "C:\Junk\noBOM.txt" encoding='utf-8' ignoredoseof;&lt;/CODE&gt;&lt;/PRE&gt;</description>
      <pubDate>Tue, 26 Mar 2019 12:36:06 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Read-UTF-8-file-and-ignore-BOM/m-p/546106#M151163</guid>
      <dc:creator>Ksharp</dc:creator>
      <dc:date>2019-03-26T12:36:06Z</dc:date>
    </item>
    <item>
      <title>Re: Read UTF-8 file and ignore BOM</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Read-UTF-8-file-and-ignore-BOM/m-p/546115#M151165</link>
      <description>&lt;P&gt;Thanks&amp;nbsp;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/18408"&gt;@Ksharp&lt;/a&gt;&amp;nbsp;, but my goal is the opposite.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;By explicitly adding the UTF-8 encoding, you told SAS to read noBom.txt with UTF-8 encoding even though there was no BOM.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;What I want (I think) is in a WLATIN1 SAS session, when I read a UTF-8 file with a BOM, to be able to force it to read as WLATIN1 encoding, just the same as if there was no BOM.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;When there is no BOM, I can do:&lt;/P&gt;
&lt;PRE&gt;1    data _null_ ;
2      infile "C:\Junk\noBOM.txt";
3      input ;
4    run ;

NOTE: The infile "C:\Junk\noBOM.txt" is:
      Filename=C:\Junk\noBOM.txt,
      RECFM=V,LRECL=32767,File Size (bytes)=20,
      Last Modified=25Mar2019:16:02:37,
      Create Time=25Mar2019:16:01:33
&lt;/PRE&gt;
&lt;P&gt;Note that above, there is no transcoding of the mu symbol done.&amp;nbsp; I assume this means the file was read as WLATIN1 encoding, not UTF-8.&amp;nbsp; I like that.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;When there is a BOM, it transcodes from UTF-8 to WLATIN1, and throws a warning when it encounters the mu symbol and can't transcode it:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;7    data _null_ ;
8      infile "C:\Junk\BOM.txt";
9      input ;
10   run ;

NOTE: A byte-order mark in the file "C:\Junk\BOM.txt" (for fileref "#LN00092") indicates
      that the data is encoded in "utf-8".  This encoding will be used to process the
      file.
NOTE: The infile "C:\Junk\BOM.txt" is:
      Filename=C:\Junk\BOM.txt,
      RECFM=V,LRECL=131068,File Size (bytes)=23,
      Last Modified=25Mar2019:16:04:13,
      Create Time=25Mar2019:16:04:13

WARNING: A character that could not be transcoded has been replaced in record 4.
&lt;/PRE&gt;
&lt;P&gt;I had hoped I could force the file to be read with WLATIN1 encoding, but this fails:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;12   data _null_ ;
13     infile "C:\Junk\BOM.txt" encoding='WLATIN1' ;
14     input ;
15   run ;

ERROR: The file "C:\Junk\BOM.txt" could not be opened.  A byte-order mark indicates that
       the data is encoded in "utf-8".  This conflicts with the "wlatin1" encoding that
       was specified for the fileref "#LN00093".
NOTE: The SAS System stopped processing this step because of errors.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      user cpu time       0.01 seconds
      system cpu time     0.00 seconds
      memory              294.59k
      OS Memory           16120.00k
      Timestamp           03/26/2019 08:49:07 AM
&lt;/PRE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 26 Mar 2019 13:01:01 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Read-UTF-8-file-and-ignore-BOM/m-p/546115#M151165</guid>
      <dc:creator>Quentin</dc:creator>
      <dc:date>2019-03-26T13:01:01Z</dc:date>
    </item>
    <item>
      <title>Re: Read UTF-8 file and ignore BOM</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Read-UTF-8-file-and-ignore-BOM/m-p/546129#M151169</link>
      <description>&lt;P&gt;Can you skip this line ?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;data _null_ ;
      infile "C:\Junk\BOM.txt" firstobs=2;&lt;BR /&gt;   /* OR  input x $hex2.; if x='xxxx' then delete; */
     input ;
   run ;&lt;/PRE&gt;</description>
      <pubDate>Tue, 26 Mar 2019 13:17:21 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Read-UTF-8-file-and-ignore-BOM/m-p/546129#M151169</guid>
      <dc:creator>Ksharp</dc:creator>
      <dc:date>2019-03-26T13:17:21Z</dc:date>
    </item>
    <item>
      <title>Re: Read UTF-8 file and ignore BOM</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Read-UTF-8-file-and-ignore-BOM/m-p/546131#M151170</link>
      <description>&lt;P&gt;Sadly, no.&amp;nbsp; Even without reading anything from the file, SAS sees the BOM.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;1    data _null_ ;
2      infile "C:\Junk\BOM.txt" firstobs=2;
3      *input ;
4    run ;

NOTE: A byte-order mark in the file "C:\Junk\BOM.txt" (for fileref "#LN00101") indicates
      that the data is encoded in "utf-8".  This encoding will be used to process the
      file.
&lt;/PRE&gt;
&lt;P&gt;And after that, if I read the mu value, it transcodes the mu and throws the warning.&lt;/P&gt;</description>
      <pubDate>Tue, 26 Mar 2019 13:22:18 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Read-UTF-8-file-and-ignore-BOM/m-p/546131#M151170</guid>
      <dc:creator>Quentin</dc:creator>
      <dc:date>2019-03-26T13:22:18Z</dc:date>
    </item>
    <item>
      <title>Re: Read UTF-8 file and ignore BOM</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Read-UTF-8-file-and-ignore-BOM/m-p/546142#M151172</link>
      <description>&lt;P&gt;You can read the file with encoding='any' (which basically turns off transcoding):&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;data test;                                                                                                                              
  infile "C:\Junk\BOM.txt" encoding='any';                                                                                              
  input a $;                                                                                                                            
run; 
&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;You'll still get the binary data which you can clean up as needed.&lt;/P&gt;</description>
      <pubDate>Tue, 26 Mar 2019 13:30:29 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Read-UTF-8-file-and-ignore-BOM/m-p/546142#M151172</guid>
      <dc:creator>DaveHorne</dc:creator>
      <dc:date>2019-03-26T13:30:29Z</dc:date>
    </item>
    <item>
      <title>Re: Read UTF-8 file and ignore BOM</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Read-UTF-8-file-and-ignore-BOM/m-p/546156#M151179</link>
      <description>&lt;P&gt;Thanks much &lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/100"&gt;@DaveHorne&lt;/a&gt;&amp;nbsp;, looks like that is doing what I want.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;1    data _null_ ;
2      infile "C:\Junk\BOM.txt" encoding=any;
3      input @ ;
4      _infile_ = transtrn(_infile_,'EFBBBF'x,trimn('')) ; *Remove the BOM ;
5      input x : $2.;
6      put x= ;
7      if x="cebc"x then do ;
8        put "found a mu and converted to u" ;
9        x="u" ;
10       put x=  ;
11     end ;
12   run ;

NOTE: The infile "C:\Junk\BOM.txt" is:
      Filename=C:\Junk\BOM.txt,
      RECFM=V,LRECL=32767,File Size (bytes)=23,
      Last Modified=25Mar2019:16:04:13,
      Create Time=25Mar2019:16:04:13

x=A
x=B
x=C
x=Î¼
found a mu and converted to u
x=u
x=D
x=E
x=F
NOTE: 7 records were read from the infile "C:\Junk\BOM.txt".
      The minimum record length was 1.
      The maximum record length was 4.
&lt;/PRE&gt;</description>
      <pubDate>Tue, 26 Mar 2019 14:29:56 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Read-UTF-8-file-and-ignore-BOM/m-p/546156#M151179</guid>
      <dc:creator>Quentin</dc:creator>
      <dc:date>2019-03-26T14:29:56Z</dc:date>
    </item>
  </channel>
</rss>

