BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.

Running SAS 9.4M4 on Win 7.

 

I'm reading a UTF-8 encoded text file.  My SAS session has WLATIN1 encoding.  The file has an occasional mu symbol ("cebc"x) in it.

 

I've been reading the file fine with below hack, where I guess it was reading with WLATIN1 encoding (?) and I just manually converted the mu the letter u, to make it easier to deal with.

 

data _null_ ;
  infile "C:\Junk\noBOM.txt" ;
  input x : $2.;
  put x=  ;
  if x="cebc"x then do ;
    put "found a mu and converted to u" ;
    x="u" ;
    put x=  ;
  end ;
run ;

All was fine in the world until, apparently, somebody start occasionally opening some of the text files and saving them, I assume in some microsoft text editor, which added the BOM (https://en.wikipedia.org/wiki/Byte_order_mark)  at the front.  So now, I guess, SAS recognizes that it's UTF-8, and tries to transcode it, but then throws a warning when it can't transcode the mu and everything breaks.

 

When there is a BOM, I get a log like:

733  data _null_ ;
734    infile "C:\Junk\BOM.txt";
735    input x : $2.;
736    put x= ;
737    if x="cebc"x then do ;
738      put "found a mu and converted to u" ;
739      x="u" ;
740      put x=  ;
741    end ;
742  run ;

NOTE: A byte-order mark in the file "C:\Junk\BOM.txt" (for fileref "#LN00234") indicates
      that the data is encoded in "utf-8".  This encoding will be used to process the
      file.
NOTE: The infile "C:\Junk\BOM.txt" is:
      Filename=C:\Junk\BOM.txt,
      RECFM=V,LRECL=131068,File Size (bytes)=23,
      Last Modified=25Mar2019:16:04:13,
      Create Time=25Mar2019:16:04:13

x=A
x=B
x=C
WARNING: A character that could not be transcoded has been replaced in record 4.
x=
x=D
x=E
x=F
NOTE: 7 records were read from the infile "C:\Junk\BOM.txt".

So my question: 

Is there a way I can tell SAS to just ignore the BOM? 

 

I don't want to rewrite the file without the BOM.  I had hoped I could specify encoding=WLATIN1 on the infile statement but that just throws an error because SAS still sees the BOM and knows it's UTF8 so says it's a conflict.  It looks like the NOBOMFILE option prevents SAS from writing a BOM when it writes a file, but doesn't tell SAS to ignore the BOM when reading a file if it's there.

 

Attached are two sample text files, noBom.txt and Bom.txt.

 

Thanks,

-Q.

The Boston Area SAS Users Group is hosting free webinars!
Next webinar will be in January 2025. Until then, check out our archives: https://www.basug.org/videos. And be sure to subscribe to our our email list.
1 ACCEPTED SOLUTION

Accepted Solutions
DaveHorne
SAS Employee

You can read the file with encoding='any' (which basically turns off transcoding):

 

data test;                                                                                                                              
  infile "C:\Junk\BOM.txt" encoding='any';                                                                                              
  input a $;                                                                                                                            
run; 

You'll still get the binary data which you can clean up as needed.

View solution in original post

8 REPLIES 8
andreas_lds
Jade | Level 19

I don't know how to force sas to ignore the bom. This is a clumsy workaround:

 

You are using a local sas installation, right? You should be able to start the session using unicode as encoding, avoiding the warnings about chars that can't be transcoded. After removing the unicode-chars, saving the datasets with wlatin1-encoding should be possible.

ChrisNZ
Tourmaline | Level 20

I don't know how you can ignore the file's BOM when reading it "normally".

What you can do however is read it in binary format.

You can then either discard the first 3 bytes, or copy the whole file minus the BOM and then read it per normal.

Something like:

 


filename UTF "\\NZ8037SPSAS2003\temp\utf.txt" encoding=utf8;
data _null_;  
  file UTF;
  put '1é';
run;

filename UTF "\\NZ8037SPSAS2003\temp\utf.txt" recfm=f;
data _null_;
  length FID 8 ;
  FID = fopen('UTF','i',1,'b');
  REC = '20'x;
  do while(fread(FID)=0);
    RC = fget(FID,REC,1);
    put REC=;
  end;
  RC = fclose(FID);
run;

 

See:

https://blogs.sas.com/content/sasdummy/2011/06/17/how-to-use-sas-data-step-to-copy-a-file-from-anywh...

 

 

Ksharp
Super User
data _null_ ;
  infile "C:\Junk\noBOM.txt" encoding='utf-8' ignoredoseof;
Quentin
Super User

Thanks @Ksharp , but my goal is the opposite. 

 

By explicitly adding the UTF-8 encoding, you told SAS to read noBom.txt with UTF-8 encoding even though there was no BOM.

 

What I want (I think) is in a WLATIN1 SAS session, when I read a UTF-8 file with a BOM, to be able to force it to read as WLATIN1 encoding, just the same as if there was no BOM.

 

When there is no BOM, I can do:

1    data _null_ ;
2      infile "C:\Junk\noBOM.txt";
3      input ;
4    run ;

NOTE: The infile "C:\Junk\noBOM.txt" is:
      Filename=C:\Junk\noBOM.txt,
      RECFM=V,LRECL=32767,File Size (bytes)=20,
      Last Modified=25Mar2019:16:02:37,
      Create Time=25Mar2019:16:01:33

Note that above, there is no transcoding of the mu symbol done.  I assume this means the file was read as WLATIN1 encoding, not UTF-8.  I like that.

 

When there is a BOM, it transcodes from UTF-8 to WLATIN1, and throws a warning when it encounters the mu symbol and can't transcode it:

 

7    data _null_ ;
8      infile "C:\Junk\BOM.txt";
9      input ;
10   run ;

NOTE: A byte-order mark in the file "C:\Junk\BOM.txt" (for fileref "#LN00092") indicates
      that the data is encoded in "utf-8".  This encoding will be used to process the
      file.
NOTE: The infile "C:\Junk\BOM.txt" is:
      Filename=C:\Junk\BOM.txt,
      RECFM=V,LRECL=131068,File Size (bytes)=23,
      Last Modified=25Mar2019:16:04:13,
      Create Time=25Mar2019:16:04:13

WARNING: A character that could not be transcoded has been replaced in record 4.

I had hoped I could force the file to be read with WLATIN1 encoding, but this fails:

 

12   data _null_ ;
13     infile "C:\Junk\BOM.txt" encoding='WLATIN1' ;
14     input ;
15   run ;

ERROR: The file "C:\Junk\BOM.txt" could not be opened.  A byte-order mark indicates that
       the data is encoded in "utf-8".  This conflicts with the "wlatin1" encoding that
       was specified for the fileref "#LN00093".
NOTE: The SAS System stopped processing this step because of errors.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      user cpu time       0.01 seconds
      system cpu time     0.00 seconds
      memory              294.59k
      OS Memory           16120.00k
      Timestamp           03/26/2019 08:49:07 AM

 

The Boston Area SAS Users Group is hosting free webinars!
Next webinar will be in January 2025. Until then, check out our archives: https://www.basug.org/videos. And be sure to subscribe to our our email list.
Ksharp
Super User

Can you skip this line ?

 

data _null_ ;
      infile "C:\Junk\BOM.txt" firstobs=2;
/* OR input x $hex2.; if x='xxxx' then delete; */ input ; run ;
Quentin
Super User

Sadly, no.  Even without reading anything from the file, SAS sees the BOM. 

 

1    data _null_ ;
2      infile "C:\Junk\BOM.txt" firstobs=2;
3      *input ;
4    run ;

NOTE: A byte-order mark in the file "C:\Junk\BOM.txt" (for fileref "#LN00101") indicates
      that the data is encoded in "utf-8".  This encoding will be used to process the
      file.

And after that, if I read the mu value, it transcodes the mu and throws the warning.

The Boston Area SAS Users Group is hosting free webinars!
Next webinar will be in January 2025. Until then, check out our archives: https://www.basug.org/videos. And be sure to subscribe to our our email list.
DaveHorne
SAS Employee

You can read the file with encoding='any' (which basically turns off transcoding):

 

data test;                                                                                                                              
  infile "C:\Junk\BOM.txt" encoding='any';                                                                                              
  input a $;                                                                                                                            
run; 

You'll still get the binary data which you can clean up as needed.

Quentin
Super User

Thanks much @DaveHorne , looks like that is doing what I want. 

 

1    data _null_ ;
2      infile "C:\Junk\BOM.txt" encoding=any;
3      input @ ;
4      _infile_ = transtrn(_infile_,'EFBBBF'x,trimn('')) ; *Remove the BOM ;
5      input x : $2.;
6      put x= ;
7      if x="cebc"x then do ;
8        put "found a mu and converted to u" ;
9        x="u" ;
10       put x=  ;
11     end ;
12   run ;

NOTE: The infile "C:\Junk\BOM.txt" is:
      Filename=C:\Junk\BOM.txt,
      RECFM=V,LRECL=32767,File Size (bytes)=23,
      Last Modified=25Mar2019:16:04:13,
      Create Time=25Mar2019:16:04:13

x=A
x=B
x=C
x=μ
found a mu and converted to u
x=u
x=D
x=E
x=F
NOTE: 7 records were read from the infile "C:\Junk\BOM.txt".
      The minimum record length was 1.
      The maximum record length was 4.
The Boston Area SAS Users Group is hosting free webinars!
Next webinar will be in January 2025. Until then, check out our archives: https://www.basug.org/videos. And be sure to subscribe to our our email list.

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 8 replies
  • 8600 views
  • 3 likes
  • 5 in conversation