BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
Multipla99
Quartz | Level 8

I'm running the following code in UTF8

 

filename MYJSON "C:\Temp\JSON TEST\JSON\Emoji.json" encoding=utf8;
libname MYJSON json;   

 

and get the following error messages:

 

 

NOTE: JSON data is only read once.  To read the JSON again, reassign the JSON LIBNAME.
ERROR: Invalid JSON in input near line 1 column 1049: Some code points did not transcode.

 

Is the emoji, \ud83d\ude80, the culprit? If yes, how can I get around it?

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
Multipla99
Quartz | Level 8

SAS Global Technical Support has explained the reason for not being able to read the file as follows. 

 

"Basically SAS reads the file into buffers before parsing the data. On Windows it can happen that a multi-byte (ie emoji) sequence is broken up, leading to a partial character. This provokes the error you see. You can remove a work before the emoji and it will read in just fine, although the behavior is unpredictable. It is not a problem on the Linux platform, so if you have access to SAS on there, it can be used as a workaround."

 

They have also filed a request for SAS developers to consider fixing this in SAS 9.4 M7.

View solution in original post

11 REPLIES 11
Tom
Super User Tom
Super User

Why do you think that those are emoji?  What emoji could they be?

D83D is not a valid UTF-8 sequence.

https://charbase.com/d83d-unicode-invalid-character

DE80 is not a valid UTF-8 sequence.

https://charbase.com/de80-unicode-invalid-character

 

Try fixing the file by replacing those with something that does not look like an attempt to define a unicode character.

11   %put %sysfunc(getoption(encoding,keyword));
ENCODING=UTF-8
12   filename original 'c:\downloads\Emoji.json' ;
13   filename myjson temp;
14   data _null_;
15     infile original;
16     file myjson;
17     input;
18     _infile_=tranwrd(_infile_,'\ude80','*de80*');
19     _infile_=tranwrd(_infile_,'\ud83d','*d83d*');
20     put _infile_;
21   run;

NOTE: The infile ORIGINAL is:
      Filename=c:\downloads\Emoji.json,
      RECFM=V,LRECL=32767,File Size (bytes)=2970,
      Last Modified=04Aug2022:10:18:06,
      Create Time=04Aug2022:10:18:06

NOTE: The file MYJSON is:
      Filename=C:\Users\...\#LN00053,
      RECFM=V,LRECL=32767,File Size (bytes)=0,
      Last Modified=04Aug2022:10:48:35,
      Create Time=04Aug2022:10:48:35

NOTE: 1 record was read from the infile ORIGINAL.
      The minimum record length was 2970.
      The maximum record length was 2970.
NOTE: 1 record was written to the file MYJSON.
      The minimum record length was 2970.
      The maximum record length was 2970.
NOTE: DATA statement used (Total process time):
      real time           0.01 seconds
      cpu time            0.01 seconds


22
23   libname myjson json ;
NOTE: JSON data is only read once.  To read the JSON again, reassign the JSON LIBNAME.
NOTE: Libref MYJSON was successfully assigned as follows:
      Engine:        JSON
      Physical Name: C:\Users\...\#LN00053
24   proc copy inlib=myjson outlib=work;
25   run;

NOTE: Copying MYJSON.ALLDATA to WORK.ALLDATA (memtype=DATA).
NOTE: BUFSIZE is not cloned when copying across different engines. System Option for BUFSIZE was used.
NOTE: There were 1 observations read from the data set MYJSON.ALLDATA.
NOTE: The data set WORK.ALLDATA has 1 observations and 4 variables.
NOTE: Copying MYJSON.ROOT to WORK.ROOT (memtype=DATA).
NOTE: BUFSIZE is not cloned when copying across different engines. System Option for BUFSIZE was used.
NOTE: There were 1 observations read from the data set MYJSON.ROOT.
NOTE: The data set WORK.ROOT has 1 observations and 2 variables.
NOTE: PROCEDURE COPY used (Total process time):
      real time           0.02 seconds
      cpu time            0.01 seconds

 

 

Multipla99
Quartz | Level 8
If you put the sequence, "\ud83d\ude80", in Unicode escape you will get back a 🚀. Isn't this an emoji? 
Tom
Super User Tom
Super User

If you want to represent that character https://charbase.com/1f680-unicode-rocket try:

  _infile_=tranwrd(_infile_,'\ud83d\ude80','🚀');

Example:

filename original '~/test/Emoji.json' ;
filename myjson temp;
data _null_;
  infile original;
  file myjson;
  input;
  _infile_=tranwrd(_infile_,'\ude80\ud83d','🚀');
  put _infile_;
run;

libname myjson json ;

data test;
 set alldata;
 do offset=0 to length(value) by 80 ;
   value2=substrn(value,offset,80);
   output;
 end;
 keep offset value2;
run;

Tom_0-1659630896434.png

 

 

Multipla99
Quartz | Level 8

Thank you, Tom, for your propsosed solution!

 

Unfortunately, a more general solution is needed.

 

The content of the presented JSON string is only a part of a much, much larger file: "https://data.arbetsformedlingen.se/annonser/historiska/2020.zip". This is the file I need to read and I do not know how many points that will not transcode. Thus it doesn't seem to be a good way of working to change parts of the string for something else every time a Javascript Escape is not recognized.

 

Strangely enough, SAS seems to have the capability to recognize this Javascript Escape if you read it in through a put statement.

data _null_;
file input;
put '{"num_records": 1, "data": [[1, "\ud83d\ude80"]], "data_type": "json"}';
run;
libname input json fileref=input;
proc print data = posts.data;
run;

With the following result.

 

Obs ordinal_root ordinal_data element1 element2
1 1 1 1 🚀

 

Do you think one could use this to build a more general solution for reading JSON files containing Java Script Escapes?

Ksharp
Super User
Can you remove these "\uXXXX" string ? or Just read json by data step. I think @Tom has many experience about it .
Tom
Super User Tom
Super User

The issue is that the JSON engine is actually trying the transcode the unicode character strings and it apparently it does understand this strange variation that I had never heard of before but apparently that JAVA has decided to use.

 

Open a ticket with SAS support to see if they can enhance the JSON engine.

 

That data step is NOT trying to interpret the Java syntax.  It just stored the actual \u characters and the hex digits as normal text and whatever viewer you are using to look at the output is what is deciding to show it as an image instead of text.  I guess you are calling this an "Emoji" in the title of your question because the strange Java syntax works like text emojis where you type a colon and right parenthesis next to each other and it looks like a smiley face?  The problem is that the individual unicode code points that syntax means in the way SAS is interpreting it are not valid unicode characters.  But there is a valid unicode character for that rocket ship image.  So it would be best if the system that is generating these JSON files stopped using the JAVA syntax.

 

Tom
Super User Tom
Super User

Cannot check that ZIP file as the host seems to be invalid.

ERROR: Windows SSL error -2146893016 (0x80090328) occurred at line 2694, the error message is "The received certificate has expired. "

 

You could just try to generate your own list of strings that need to be recoded by using the example data _null_ step I showed.  First use it to fix the rocketships and then try to read the resulting file.  Once you see another error then look at the offending character and figure out how to fix it and update the data step to replace that string also.

 

The issue you might have using my data step is that it treating the file a lines of text.  So any line that is longer than 32767 bytes will get truncated.  So if you did want to do that it might be better to use different method to replace the strings.  Either an outside program, like unix command sed, or a more complex SAS data step.  If you want to go the SAS data step route I had had success in the past doing something like that by reading the file as fixed length records.  The trick is two use a small enough record length that you can concatenate two records into a single character variable.  That way you can find and replace strings that happen to fall over the record boundary.

Multipla99
Quartz | Level 8
Thank you, Tom!

I'm sorry about the ZIP-file. It was possible to download it some week ago.

I also plan to contact SAS Technical support an suggest that they enhance the JSON engine.

In the meantime I might go for your suggested solution, replacing offending characters as they appear. Of course I will watch out for the 32767 bytes limit.

Thank you for your warning about it!
Tom
Super User Tom
Super User

Here is a way that should work.  Just replace all of the \ with \\.  Then the JSON engine will not try to convert them. Instead you will get the actual \n and \uXXXX strings into the character variables.

 

filename java 'c:\downloads\Emoji.json';
filename myjson temp;

data _null_;
  infile java recfm=f lrecl=512 ;
  file myjson recfm=n ;
  input ;
  length string $1024;
  nslash = countc(_infile_,'\');
  string = tranwrd(_infile_,'\','\\');
  len = 512+nslash;
  put string $varying1024. len @;
run;

libname myjson json;

Text in modified JSON.

Tom_1-1659707960095.png

 

Resulting Text in SAS variable.

Tom_0-1659707909221.png

 

Multipla99
Quartz | Level 8

SAS Global Technical Support has explained the reason for not being able to read the file as follows. 

 

"Basically SAS reads the file into buffers before parsing the data. On Windows it can happen that a multi-byte (ie emoji) sequence is broken up, leading to a partial character. This provokes the error you see. You can remove a work before the emoji and it will read in just fine, although the behavior is unpredictable. It is not a problem on the Linux platform, so if you have access to SAS on there, it can be used as a workaround."

 

They have also filed a request for SAS developers to consider fixing this in SAS 9.4 M7.

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 11 replies
  • 2202 views
  • 5 likes
  • 3 in conversation