BookmarkSubscribeRSS Feed
stf5018
SAS Employee

Problem: SAS code can't seem to handle encodings when reading/writing a file to/from Azure Data Lake.

 

Situation: I have a machine that is writing data to an Azure Data Lake Storage container. Unfortunately it writes the files with a BOM (Byte Order Mark). This is completely unnecessary with these UTF-8 files, but that's what I'm stuck with. I can't change my machine and I would like to read the data directly from ADLS to SAS without intermediate steps or non-SAS code..

 

Approach: The basic approach I've found is to use the FILENAME statement to read/write files from/to ADLS. However, SAS breaks on reading because of the BOM in the file. Even odder, SAS actually puts a BOM in the file when it writes to ADLS. I've read you can get around this with the "encoding" option in the FILENAME statement. However, that option is not supported with the Azure access method. Is there a way to get around the BOM? Or is this a bug in SAS that needs to be addressed?

 

Code:

*attempt to read file with BOM;
filename myjson adls "process_w_bom.json"
   applicationId=&appId 
   accountname=&acctName 
   filesystem=&filesys;

libname temp json fileref=myjson;

*attempt to use encoding option;
filename myjson2 adls "process_w_bom.json"
   applicationId=&appId 
   accountname=&acctName 
   filesystem=&filesys
encoding='utf-8';

libname temp json fileref=myjson2;

 

Log Output:

1 %studio_hide_wrapper;
77
78 filename myjson adls "process_w_bom.json"
79 applicationId=&appId
80 accountname=&acctName
81 filesystem=&filesys;
82
83 libname temp json fileref=myjson;
NOTE: JSON data is only read once. To read the JSON again, reassign the JSON LIBNAME.
ERROR: Invalid JSON in input near line 1 column 1: Encountered an illegal character.
ERROR: Error in the LIBNAME statement.
84
85 *attempt to use encoding option;
86 filename myjson2 adls "process_w_bom.json"
87 applicationId=&appId
88 accountname=&acctName
89 filesystem=&filesys
90 encoding='utf-8';
--------
23
ERROR: Error in the FILENAME statement.
ERROR 23-2: Invalid option name encoding.
91
92 libname temp json fileref=myjson2;
ERROR: Libref TEMP is not assigned.
ERROR: Error in the LIBNAME statement.
93
94 %studio_hide_wrapper;

6 REPLIES 6
ballardw
Super User

Please post LOG text in a code box. SAS frequently supplies characters to indicate where errors or problems occur but the main message windows reformat text making those characters appear in a different location.

 

What is the SAS session encoding set to?

If not sure you should be able to test with this and check result in the log

proc options option=encoding;
run;

If it does not show a UTF then perhaps change your SAS session encoding, which will require either a change of the Encoding= system option to the config file or the start up command for SAS and restart the session

stf5018
SAS Employee

Here's the results of running proc options. It shows UTF-8. However, I was under the impression that the session encoding options would not have an effect on what the FILENAME statement does. And if it does, then I wouldn't want to have to manually change the encoding every time I run my code. And I wouldn't want to change a default system setting as that would mess up any other stuff I'm doing. The end goal is to have code that can automatically read data from ADLS as it gets put there by my machine.

 

That being said, the session encoding options don't seem to make any difference on the FILENAME statement. It sees that BOM in the beginning of the file when parsing and breaks. I need to either 1) use the information in BOM to parse the file, or 2) completely ignore it. 

 

 

proc options option=encoding;
run;

1    %studio_hide_wrapper;
77   
78   proc options option=encoding;
79   run;
    SAS (r) Proprietary Software Release V.04.00  TS1M0
 ENCODING=UTF8     Specifies the default character-set encoding for the SAS session.
NOTE: PROCEDURE OPTIONS used (Total process time):
      real time           0.00 seconds
      cpu time            0.01 seconds
      
80   
81   %studio_hide_wrapper;
91   
92   
Tom
Super User Tom
Super User

If you cannot get SAS to ignore it then copy the file first and remove it yourself.

Perhaps something like this?

filename myjson adls "process_w_bom.json"
   applicationId=&appId 
   accountname=&acctName 
   filesystem=&filesys
;

filename myjsonX temp;
data _null_;
  infile myjson recfm=n ;
  file myjsonx recfm=n;
  if _n_=1 do; input x $char3. ; if x ne 'EFBBBF'x then put x $char3.; end;
  input x $char512. ;
  put x $char512.;
run;

libname temp json fileref=myjsonX;

 

stf5018
SAS Employee

I attempted the code you provided to remove the characters, but I get a new error. 

 

 

filename myjsonX temp;
data _null_;
  infile myjson recfm=n ;
  file myjsonx recfm=n;

  if _n_=1 then do; 
		input x $char3. ; 
		if x ne 'EFBBBF'x then do;
			put x $char3.; 
		end;
	end;

  input x $char512. ;
  put x $char512.;
run;

libname t2 json fileref=myjsonX;

1    %studio_hide_wrapper;
77   
78   filename myjsonX temp;
79   data _null_;
80     infile myjson recfm=n ;
81     file myjsonx recfm=n;
82   
83     if _n_=1 then do;
84   input x $char3. ;
85   if x ne 'EFBBBF'x then do;
86   put x $char3.;
87   end;
88   end;
89   
90     input x $char512. ;
91     put x $char512.;
92   run;
NOTE: The infile MYJSON is:
      Filename=process_correct_w_bom.json,
      Account Name=sefordtest,File system=test
NOTE: UNBUFFERED is the default with RECFM=N.
NOTE: The file MYJSONX is:
      
      Filename=/opt/sas/viya/...../#LN00469,
      Owner Name=UNKNOWN,Group Name=UNKNOWN,
      Access Permission=-rw-r--r--,
      Last Modified=02Aug2022:11:30:32
NOTE: LOST CARD.
x= _ERROR_=1 _N_=1
NOTE: 1 record was read from the infile MYJSON.
NOTE: DATA statement used (Total process time):
      real time           0.26 seconds
      cpu time            0.04 seconds
      
93   
94   libname t2 json fileref=myjsonX;
NOTE: JSON data is only read once.  To read the JSON again, reassign the JSON LIBNAME.
ERROR: Invalid JSON in input near line 1 column 1: No JSON text value found.
ERROR: Error in the LIBNAME statement.
95   
96   %studio_hide_wrapper;

 

 

Tom
Super User Tom
Super User

Perhaps RECFM=N is not the right way to process the file?

Make sure you can read the file and see the BOM.  Here is example to read the first "line" and dump it to the SAS log.

data _null_;
  infile myjson  obs=1;
  input ;
  list;
run;

If you are sure the "lines" in the JSON file will always be shorter then the 32,767 byte limit for a character variable then you could try something like:

data _null_;
  infile myjson  ;
  file myjsonx ;
  input ;
  if _n_=1 and 'EFBBBF'x =: _infile_ then _infile_=substr(_infile_,4);
  put _infile_;
run;

Or perhaps trying reading it byte by byte and just skip the first 3 bytes?

data _null_;
  infile myjson  recfm=f lrecl=1 firtsobs=4;
  file myjsonx recfm=f lrecl1;
  input x $char1.;
  put x $char1.;
run;

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 6 replies
  • 1400 views
  • 3 likes
  • 3 in conversation