BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.

Running SAS 9.4M4 on Win 7.

 

I'm reading a UTF-8 encoded text file.  My SAS session has WLATIN1 encoding.  The file has an occasional mu symbol ("cebc"x) in it.

 

I've been reading the file fine with below hack, where I guess it was reading with WLATIN1 encoding (?) and I just manually converted the mu the letter u, to make it easier to deal with.

 

data _null_ ;
  infile "C:\Junk\noBOM.txt" ;
  input x : $2.;
  put x=  ;
  if x="cebc"x then do ;
    put "found a mu and converted to u" ;
    x="u" ;
    put x=  ;
  end ;
run ;

All was fine in the world until, apparently, somebody start occasionally opening some of the text files and saving them, I assume in some microsoft text editor, which added the BOM (https://en.wikipedia.org/wiki/Byte_order_mark)  at the front.  So now, I guess, SAS recognizes that it's UTF-8, and tries to transcode it, but then throws a warning when it can't transcode the mu and everything breaks.

 

When there is a BOM, I get a log like:

733  data _null_ ;
734    infile "C:\Junk\BOM.txt";
735    input x : $2.;
736    put x= ;
737    if x="cebc"x then do ;
738      put "found a mu and converted to u" ;
739      x="u" ;
740      put x=  ;
741    end ;
742  run ;

NOTE: A byte-order mark in the file "C:\Junk\BOM.txt" (for fileref "#LN00234") indicates
      that the data is encoded in "utf-8".  This encoding will be used to process the
      file.
NOTE: The infile "C:\Junk\BOM.txt" is:
      Filename=C:\Junk\BOM.txt,
      RECFM=V,LRECL=131068,File Size (bytes)=23,
      Last Modified=25Mar2019:16:04:13,
      Create Time=25Mar2019:16:04:13

x=A
x=B
x=C
WARNING: A character that could not be transcoded has been replaced in record 4.
x=
x=D
x=E
x=F
NOTE: 7 records were read from the infile "C:\Junk\BOM.txt".

So my question: 

Is there a way I can tell SAS to just ignore the BOM? 

 

I don't want to rewrite the file without the BOM.  I had hoped I could specify encoding=WLATIN1 on the infile statement but that just throws an error because SAS still sees the BOM and knows it's UTF8 so says it's a conflict.  It looks like the NOBOMFILE option prevents SAS from writing a BOM when it writes a file, but doesn't tell SAS to ignore the BOM when reading a file if it's there.

 

Attached are two sample text files, noBom.txt and Bom.txt.

 

Thanks,

-Q.

1 ACCEPTED SOLUTION

Accepted Solutions
DaveHorne
SAS Employee

You can read the file with encoding='any' (which basically turns off transcoding):

 

data test;                                                                                                                              
  infile "C:\Junk\BOM.txt" encoding='any';                                                                                              
  input a $;                                                                                                                            
run; 

You'll still get the binary data which you can clean up as needed.

View solution in original post

8 REPLIES 8
andreas_lds
Jade | Level 19

I don't know how to force sas to ignore the bom. This is a clumsy workaround:

 

You are using a local sas installation, right? You should be able to start the session using unicode as encoding, avoiding the warnings about chars that can't be transcoded. After removing the unicode-chars, saving the datasets with wlatin1-encoding should be possible.

ChrisNZ
Tourmaline | Level 20

I don't know how you can ignore the file's BOM when reading it "normally".

What you can do however is read it in binary format.

You can then either discard the first 3 bytes, or copy the whole file minus the BOM and then read it per normal.

Something like:

 


filename UTF "\\NZ8037SPSAS2003\temp\utf.txt" encoding=utf8;
data _null_;  
  file UTF;
  put '1é';
run;

filename UTF "\\NZ8037SPSAS2003\temp\utf.txt" recfm=f;
data _null_;
  length FID 8 ;
  FID = fopen('UTF','i',1,'b');
  REC = '20'x;
  do while(fread(FID)=0);
    RC = fget(FID,REC,1);
    put REC=;
  end;
  RC = fclose(FID);
run;

 

See:

https://blogs.sas.com/content/sasdummy/2011/06/17/how-to-use-sas-data-step-to-copy-a-file-from-anywh...

 

 

Ksharp
Super User
data _null_ ;
  infile "C:\Junk\noBOM.txt" encoding='utf-8' ignoredoseof;
Quentin
Super User

Thanks @Ksharp , but my goal is the opposite. 

 

By explicitly adding the UTF-8 encoding, you told SAS to read noBom.txt with UTF-8 encoding even though there was no BOM.

 

What I want (I think) is in a WLATIN1 SAS session, when I read a UTF-8 file with a BOM, to be able to force it to read as WLATIN1 encoding, just the same as if there was no BOM.

 

When there is no BOM, I can do:

1    data _null_ ;
2      infile "C:\Junk\noBOM.txt";
3      input ;
4    run ;

NOTE: The infile "C:\Junk\noBOM.txt" is:
      Filename=C:\Junk\noBOM.txt,
      RECFM=V,LRECL=32767,File Size (bytes)=20,
      Last Modified=25Mar2019:16:02:37,
      Create Time=25Mar2019:16:01:33

Note that above, there is no transcoding of the mu symbol done.  I assume this means the file was read as WLATIN1 encoding, not UTF-8.  I like that.

 

When there is a BOM, it transcodes from UTF-8 to WLATIN1, and throws a warning when it encounters the mu symbol and can't transcode it:

 

7    data _null_ ;
8      infile "C:\Junk\BOM.txt";
9      input ;
10   run ;

NOTE: A byte-order mark in the file "C:\Junk\BOM.txt" (for fileref "#LN00092") indicates
      that the data is encoded in "utf-8".  This encoding will be used to process the
      file.
NOTE: The infile "C:\Junk\BOM.txt" is:
      Filename=C:\Junk\BOM.txt,
      RECFM=V,LRECL=131068,File Size (bytes)=23,
      Last Modified=25Mar2019:16:04:13,
      Create Time=25Mar2019:16:04:13

WARNING: A character that could not be transcoded has been replaced in record 4.

I had hoped I could force the file to be read with WLATIN1 encoding, but this fails:

 

12   data _null_ ;
13     infile "C:\Junk\BOM.txt" encoding='WLATIN1' ;
14     input ;
15   run ;

ERROR: The file "C:\Junk\BOM.txt" could not be opened.  A byte-order mark indicates that
       the data is encoded in "utf-8".  This conflicts with the "wlatin1" encoding that
       was specified for the fileref "#LN00093".
NOTE: The SAS System stopped processing this step because of errors.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      user cpu time       0.01 seconds
      system cpu time     0.00 seconds
      memory              294.59k
      OS Memory           16120.00k
      Timestamp           03/26/2019 08:49:07 AM

 

Ksharp
Super User

Can you skip this line ?

 

data _null_ ;
      infile "C:\Junk\BOM.txt" firstobs=2;
/* OR input x $hex2.; if x='xxxx' then delete; */ input ; run ;
Quentin
Super User

Sadly, no.  Even without reading anything from the file, SAS sees the BOM. 

 

1    data _null_ ;
2      infile "C:\Junk\BOM.txt" firstobs=2;
3      *input ;
4    run ;

NOTE: A byte-order mark in the file "C:\Junk\BOM.txt" (for fileref "#LN00101") indicates
      that the data is encoded in "utf-8".  This encoding will be used to process the
      file.

And after that, if I read the mu value, it transcodes the mu and throws the warning.

DaveHorne
SAS Employee

You can read the file with encoding='any' (which basically turns off transcoding):

 

data test;                                                                                                                              
  infile "C:\Junk\BOM.txt" encoding='any';                                                                                              
  input a $;                                                                                                                            
run; 

You'll still get the binary data which you can clean up as needed.

Quentin
Super User

Thanks much @DaveHorne , looks like that is doing what I want. 

 

1    data _null_ ;
2      infile "C:\Junk\BOM.txt" encoding=any;
3      input @ ;
4      _infile_ = transtrn(_infile_,'EFBBBF'x,trimn('')) ; *Remove the BOM ;
5      input x : $2.;
6      put x= ;
7      if x="cebc"x then do ;
8        put "found a mu and converted to u" ;
9        x="u" ;
10       put x=  ;
11     end ;
12   run ;

NOTE: The infile "C:\Junk\BOM.txt" is:
      Filename=C:\Junk\BOM.txt,
      RECFM=V,LRECL=32767,File Size (bytes)=23,
      Last Modified=25Mar2019:16:04:13,
      Create Time=25Mar2019:16:04:13

x=A
x=B
x=C
x=μ
found a mu and converted to u
x=u
x=D
x=E
x=F
NOTE: 7 records were read from the infile "C:\Junk\BOM.txt".
      The minimum record length was 1.
      The maximum record length was 4.

hackathon24-white-horiz.png

2025 SAS Hackathon: There is still time!

Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!

Register Now

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 8 replies
  • 10713 views
  • 3 likes
  • 5 in conversation