BookmarkSubscribeRSS Feed
ojaro
SAS Employee

Hi experts,

 

I am have an UTF-8 character encoding issue somewhere within my ETL pipeline storing data from REST API to CSV file.


My ETL steps are as follows:

1) Use proc fcmp to download the data from the REST API and save the data in the filesystem as a JSON file. - Reading the JSON file with Python works, so the characters must be ok when returned from the REST API.

2) Use the following code to read the JSON file to SAS:

filename jsondata &mc_temp_file. encoding="utf-8";
libname jsondata JSON fileref=jsondata;

The resulting dataset encoding "Default" returned by proc contents data=jsondata.docs;run; .

3) Transform data and load it to work.t_doc_attrs.
The resulting dataset encoding "UTF-8" returned by proc contents data=work.t_doc_attrs;run; .

4) Use the following code to write the dataset as CSV:

proc transpose data=work.t_doc_attrs(obs=0) out=_names;
	  var _all_;
	run;
	data _null_;
	  file "&doc_cvs_path" dsd delimiter='|' lrecl=1000000 encoding="utf-8";
	  set _names end=eof;
	  put _name_ @ ;
	  if eof then put;
	run;
	data _null_;
	  file "&doc_cvs_path" dsd delimiter='|' lrecl=1000000 MOD encoding="utf-8";
	  set work.t_doc_attrs;
	  put (_all_) (+0);
	run;

If I now try to read the resulting CSV file with Python, I ge the following UTF-8 encoding error:

UnicodeDecodeError 'utf-8' codec can't decode byte 0xc3 in position 7524: invalid continuation byte

Any ideas regardng what might cause the characters to get corrupted? My session encoding (returned by proc options option=encoding; run;) is UTF-8.

Many thanks in advance,

 

Olli

2 REPLIES 2
Tom
Super User Tom
Super User

It kind of looks like you have truncation going on.  The reason 'C3'x is not a valid UTF-8 character is because it is the first character of a sequence.  So if there is no next byte to say which character this multi-byte item is then you get an error.

 

Perhaps JSON libname engine is not making the variable long enough?  Perhaps you can modify the MAP file to make them longer?

Perhaps the varaibles are long enough but somehow to short a format has been attached and that is truncating the values.

 

Or perhaps you Python code it trying to translate the UTF-8 characters twice?

https://stackoverflow.com/questions/24475393/unicodedecodeerror-ascii-codec-cant-decode-byte-0xc3-in...

ojaro
SAS Employee
Hi Tom,

Thanks for these suggestions. Good to know regarding the truncation.

I did manage to trace the issue all the way back to the source in the end, so I have gone back to the API owners highligting the issue. It must be that the Python CSV reader is just extra sensitive, since all earlier steps (including reading JSON in Python as a text file) worked without any errors.

Cheers,

Olli

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 2 replies
  • 1879 views
  • 0 likes
  • 2 in conversation