SAS Programming

DATA Step, Macro, Functions and more
BookmarkSubscribeRSS Feed
ojaro
SAS Employee

Hi experts,

 

I am have an UTF-8 character encoding issue somewhere within my ETL pipeline storing data from REST API to CSV file.


My ETL steps are as follows:

1) Use proc fcmp to download the data from the REST API and save the data in the filesystem as a JSON file. - Reading the JSON file with Python works, so the characters must be ok when returned from the REST API.

2) Use the following code to read the JSON file to SAS:

filename jsondata &mc_temp_file. encoding="utf-8";
libname jsondata JSON fileref=jsondata;

The resulting dataset encoding "Default" returned by proc contents data=jsondata.docs;run; .

3) Transform data and load it to work.t_doc_attrs.
The resulting dataset encoding "UTF-8" returned by proc contents data=work.t_doc_attrs;run; .

4) Use the following code to write the dataset as CSV:

proc transpose data=work.t_doc_attrs(obs=0) out=_names;
	  var _all_;
	run;
	data _null_;
	  file "&doc_cvs_path" dsd delimiter='|' lrecl=1000000 encoding="utf-8";
	  set _names end=eof;
	  put _name_ @ ;
	  if eof then put;
	run;
	data _null_;
	  file "&doc_cvs_path" dsd delimiter='|' lrecl=1000000 MOD encoding="utf-8";
	  set work.t_doc_attrs;
	  put (_all_) (+0);
	run;

If I now try to read the resulting CSV file with Python, I ge the following UTF-8 encoding error:

UnicodeDecodeError 'utf-8' codec can't decode byte 0xc3 in position 7524: invalid continuation byte

Any ideas regardng what might cause the characters to get corrupted? My session encoding (returned by proc options option=encoding; run;) is UTF-8.

Many thanks in advance,

 

Olli

2 REPLIES 2
Tom
Super User Tom
Super User

It kind of looks like you have truncation going on.  The reason 'C3'x is not a valid UTF-8 character is because it is the first character of a sequence.  So if there is no next byte to say which character this multi-byte item is then you get an error.

 

Perhaps JSON libname engine is not making the variable long enough?  Perhaps you can modify the MAP file to make them longer?

Perhaps the varaibles are long enough but somehow to short a format has been attached and that is truncating the values.

 

Or perhaps you Python code it trying to translate the UTF-8 characters twice?

https://stackoverflow.com/questions/24475393/unicodedecodeerror-ascii-codec-cant-decode-byte-0xc3-in...

ojaro
SAS Employee
Hi Tom,

Thanks for these suggestions. Good to know regarding the truncation.

I did manage to trace the issue all the way back to the source in the end, so I have gone back to the API owners highligting the issue. It must be that the Python CSV reader is just extra sensitive, since all earlier steps (including reading JSON in Python as a text file) worked without any errors.

Cheers,

Olli

sas-innovate-wordmark-2025-midnight.png

Register Today!

Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.


Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 2 replies
  • 2275 views
  • 0 likes
  • 2 in conversation