I'm running the following code in UTF8
filename MYJSON "C:\Temp\JSON TEST\JSON\Emoji.json" encoding=utf8;
libname MYJSON json;   
and get the following error messages:
NOTE: JSON data is only read once.  To read the JSON again, reassign the JSON LIBNAME.
ERROR: Invalid JSON in input near line 1 column 1049: Some code points did not transcode.
Is the emoji, \ud83d\ude80, the culprit? If yes, how can I get around it?
SAS Global Technical Support has explained the reason for not being able to read the file as follows.
"Basically SAS reads the file into buffers before parsing the data. On Windows it can happen that a multi-byte (ie emoji) sequence is broken up, leading to a partial character. This provokes the error you see. You can remove a work before the emoji and it will read in just fine, although the behavior is unpredictable. It is not a problem on the Linux platform, so if you have access to SAS on there, it can be used as a workaround."
They have also filed a request for SAS developers to consider fixing this in SAS 9.4 M7.
Why do you think that those are emoji? What emoji could they be?
D83D is not a valid UTF-8 sequence.
https://charbase.com/d83d-unicode-invalid-character
DE80 is not a valid UTF-8 sequence.
https://charbase.com/de80-unicode-invalid-character
Try fixing the file by replacing those with something that does not look like an attempt to define a unicode character.
11   %put %sysfunc(getoption(encoding,keyword));
ENCODING=UTF-8
12   filename original 'c:\downloads\Emoji.json' ;
13   filename myjson temp;
14   data _null_;
15     infile original;
16     file myjson;
17     input;
18     _infile_=tranwrd(_infile_,'\ude80','*de80*');
19     _infile_=tranwrd(_infile_,'\ud83d','*d83d*');
20     put _infile_;
21   run;
NOTE: The infile ORIGINAL is:
      Filename=c:\downloads\Emoji.json,
      RECFM=V,LRECL=32767,File Size (bytes)=2970,
      Last Modified=04Aug2022:10:18:06,
      Create Time=04Aug2022:10:18:06
NOTE: The file MYJSON is:
      Filename=C:\Users\...\#LN00053,
      RECFM=V,LRECL=32767,File Size (bytes)=0,
      Last Modified=04Aug2022:10:48:35,
      Create Time=04Aug2022:10:48:35
NOTE: 1 record was read from the infile ORIGINAL.
      The minimum record length was 2970.
      The maximum record length was 2970.
NOTE: 1 record was written to the file MYJSON.
      The minimum record length was 2970.
      The maximum record length was 2970.
NOTE: DATA statement used (Total process time):
      real time           0.01 seconds
      cpu time            0.01 seconds
22
23   libname myjson json ;
NOTE: JSON data is only read once.  To read the JSON again, reassign the JSON LIBNAME.
NOTE: Libref MYJSON was successfully assigned as follows:
      Engine:        JSON
      Physical Name: C:\Users\...\#LN00053
24   proc copy inlib=myjson outlib=work;
25   run;
NOTE: Copying MYJSON.ALLDATA to WORK.ALLDATA (memtype=DATA).
NOTE: BUFSIZE is not cloned when copying across different engines. System Option for BUFSIZE was used.
NOTE: There were 1 observations read from the data set MYJSON.ALLDATA.
NOTE: The data set WORK.ALLDATA has 1 observations and 4 variables.
NOTE: Copying MYJSON.ROOT to WORK.ROOT (memtype=DATA).
NOTE: BUFSIZE is not cloned when copying across different engines. System Option for BUFSIZE was used.
NOTE: There were 1 observations read from the data set MYJSON.ROOT.
NOTE: The data set WORK.ROOT has 1 observations and 2 variables.
NOTE: PROCEDURE COPY used (Total process time):
      real time           0.02 seconds
      cpu time            0.01 seconds
If you want to represent that character https://charbase.com/1f680-unicode-rocket try:
  _infile_=tranwrd(_infile_,'\ud83d\ude80','🚀');
Example:
filename original '~/test/Emoji.json' ;
filename myjson temp;
data _null_;
  infile original;
  file myjson;
  input;
  _infile_=tranwrd(_infile_,'\ude80\ud83d','🚀');
  put _infile_;
run;
libname myjson json ;
data test;
 set alldata;
 do offset=0 to length(value) by 80 ;
   value2=substrn(value,offset,80);
   output;
 end;
 keep offset value2;
run;
Thank you, Tom, for your propsosed solution!
Unfortunately, a more general solution is needed.
The content of the presented JSON string is only a part of a much, much larger file: "https://data.arbetsformedlingen.se/annonser/historiska/2020.zip". This is the file I need to read and I do not know how many points that will not transcode. Thus it doesn't seem to be a good way of working to change parts of the string for something else every time a Javascript Escape is not recognized.
Strangely enough, SAS seems to have the capability to recognize this Javascript Escape if you read it in through a put statement.
data _null_;
file input;
put '{"num_records": 1, "data": [[1, "\ud83d\ude80"]], "data_type": "json"}';
run;
libname input json fileref=input;
proc print data = posts.data;
run;
With the following result.
| Obs | ordinal_root | ordinal_data | element1 | element2 | 
|---|---|---|---|---|
| 1 | 1 | 1 | 1 | 🚀 | 
Do you think one could use this to build a more general solution for reading JSON files containing Java Script Escapes?
The issue is that the JSON engine is actually trying the transcode the unicode character strings and it apparently it does understand this strange variation that I had never heard of before but apparently that JAVA has decided to use.
Open a ticket with SAS support to see if they can enhance the JSON engine.
That data step is NOT trying to interpret the Java syntax. It just stored the actual \u characters and the hex digits as normal text and whatever viewer you are using to look at the output is what is deciding to show it as an image instead of text. I guess you are calling this an "Emoji" in the title of your question because the strange Java syntax works like text emojis where you type a colon and right parenthesis next to each other and it looks like a smiley face? The problem is that the individual unicode code points that syntax means in the way SAS is interpreting it are not valid unicode characters. But there is a valid unicode character for that rocket ship image. So it would be best if the system that is generating these JSON files stopped using the JAVA syntax.
Cannot check that ZIP file as the host seems to be invalid.
ERROR: Windows SSL error -2146893016 (0x80090328) occurred at line 2694, the error message is "The received certificate has expired. "
You could just try to generate your own list of strings that need to be recoded by using the example data _null_ step I showed. First use it to fix the rocketships and then try to read the resulting file. Once you see another error then look at the offending character and figure out how to fix it and update the data step to replace that string also.
The issue you might have using my data step is that it treating the file a lines of text. So any line that is longer than 32767 bytes will get truncated. So if you did want to do that it might be better to use different method to replace the strings. Either an outside program, like unix command sed, or a more complex SAS data step. If you want to go the SAS data step route I had had success in the past doing something like that by reading the file as fixed length records. The trick is two use a small enough record length that you can concatenate two records into a single character variable. That way you can find and replace strings that happen to fall over the record boundary.
Here is a way that should work. Just replace all of the \ with \\. Then the JSON engine will not try to convert them. Instead you will get the actual \n and \uXXXX strings into the character variables.
filename java 'c:\downloads\Emoji.json';
filename myjson temp;
data _null_;
  infile java recfm=f lrecl=512 ;
  file myjson recfm=n ;
  input ;
  length string $1024;
  nslash = countc(_infile_,'\');
  string = tranwrd(_infile_,'\','\\');
  len = 512+nslash;
  put string $varying1024. len @;
run;
libname myjson json;Text in modified JSON.
Resulting Text in SAS variable.
SAS Global Technical Support has explained the reason for not being able to read the file as follows.
"Basically SAS reads the file into buffers before parsing the data. On Windows it can happen that a multi-byte (ie emoji) sequence is broken up, leading to a partial character. This provokes the error you see. You can remove a work before the emoji and it will read in just fine, although the behavior is unpredictable. It is not a problem on the Linux platform, so if you have access to SAS on there, it can be used as a workaround."
They have also filed a request for SAS developers to consider fixing this in SAS 9.4 M7.
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.
