- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi experts,
I am have an UTF-8 character encoding issue somewhere within my ETL pipeline storing data from REST API to CSV file.
My ETL steps are as follows:
1) Use proc fcmp to download the data from the REST API and save the data in the filesystem as a JSON file. - Reading the JSON file with Python works, so the characters must be ok when returned from the REST API.
2) Use the following code to read the JSON file to SAS:
filename jsondata &mc_temp_file. encoding="utf-8";
libname jsondata JSON fileref=jsondata;
The resulting dataset encoding "Default" returned by proc contents data=jsondata.docs;run; .
3) Transform data and load it to work.t_doc_attrs.
The resulting dataset encoding "UTF-8" returned by proc contents data=work.t_doc_attrs;run; .
4) Use the following code to write the dataset as CSV:
proc transpose data=work.t_doc_attrs(obs=0) out=_names;
var _all_;
run;
data _null_;
file "&doc_cvs_path" dsd delimiter='|' lrecl=1000000 encoding="utf-8";
set _names end=eof;
put _name_ @ ;
if eof then put;
run;
data _null_;
file "&doc_cvs_path" dsd delimiter='|' lrecl=1000000 MOD encoding="utf-8";
set work.t_doc_attrs;
put (_all_) (+0);
run;
If I now try to read the resulting CSV file with Python, I ge the following UTF-8 encoding error:
UnicodeDecodeError 'utf-8' codec can't decode byte 0xc3 in position 7524: invalid continuation byte
Any ideas regardng what might cause the characters to get corrupted? My session encoding (returned by proc options option=encoding; run;) is UTF-8.
Many thanks in advance,
Olli
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
It kind of looks like you have truncation going on. The reason 'C3'x is not a valid UTF-8 character is because it is the first character of a sequence. So if there is no next byte to say which character this multi-byte item is then you get an error.
Perhaps JSON libname engine is not making the variable long enough? Perhaps you can modify the MAP file to make them longer?
Perhaps the varaibles are long enough but somehow to short a format has been attached and that is truncating the values.
Or perhaps you Python code it trying to translate the UTF-8 characters twice?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for these suggestions. Good to know regarding the truncation.
I did manage to trace the issue all the way back to the source in the end, so I have gone back to the API owners highligting the issue. It must be that the Python CSV reader is just extra sensitive, since all earlier steps (including reading JSON in Python as a text file) worked without any errors.
Cheers,
Olli