Solved: Re: Parsing Json Array stored in SAS Column [SAS DI]

vfarmak · Posted 11-23-2020 09:50 AM

Greetings everyone,

I am facing the following issue: I have loaded a dataset in SASDI that some columns contain json arrays.

The problem with json arrays is that do not have always the same length. At first I used the technique provided by the below thread

https://communities.sas.com/t5/SAS-Programming/SAS-DI-4-9-JSON-Column-Parse-Data-Step/m-p/684905#M20...

by ChrisZ and it works fine with one size arrays.

When it comes to arrays that have size greater than one, the code reads only the first line.

The expected outcome is to generate in the below example lines that are associated with the number of the array elements.

id	content
000001	[{"item":"value1"}, {"item":"value2"},{"item":"value3"}]

to

id	value
000001	value1
000001	value2
000001	value3

I searched for the following:

SCANOVER function: this comes in handy but I think that it can be used only in the infile statement
I tried the PROC DS2 with no luck: I was able to execute the code but when the parser is initialized; it does not see any of the tokens provided in the json column
JSON lib does not supported by the version of our SAS system

Below you will find the attempt with the PROC DS2 & the sas version

Contents of HAVE Dataset

_id

contactProgress

5fa1b999c2dc2170299dc0b4

[{"date": { "$date": "2020-11-04T15:20:00.000Z" }, "details": {"code": "040", "description": "Follow up"}, "feedback": {"code": "020", "description": "No answer"}, "notes": "this is the content of the text provided, "type": "INPROGRESS"}, {"date": { "$date": "2020-11-04T18:00:00.000Z" }, "details": {"code": "040", "description": "Follow up"}, "feedback": {"code": "010", "description": "Communication"}, "notes": "This is the new note test", "type": "INPROGRESS"}, {"date": { "$date": "2020-11-05T16:30:00.000Z" }, "details": {"code": "040", "description": "Follow up"}, "feedback": {"code": "010", "description": "Communication"}, "notes": "The test quotation has been sent to the client", "type": "INPROGRESS"}]

DATA HAVE;
SET WORK.QUERY_FOR_WRK_LEADS_KPIS;
RUN;

PROC DS2;
DATA WANT (OVERWRITE=YES);
DCL PACKAGE JSON J();
DCL INTEGER TOKENTYPE PARSEFLAGS;
DCL NVARCHAR(128) TOKEN;
DCL INTEGER RC;
DCL CHAR(2000) CHARACTER SET UTF8 CONTACT_PROGRESS;

METHOD INIT();
RC = J.CREATEPARSER( CONTACT_PROGRESS, 32767 );

DO WHILE( RC = 0 );
J.GETNEXTTOKEN( RC, TOKEN, TOKENTYPE, PARSEFLAGS );
IF (TOKEN = '') THEN do; 
%put 'Empty Token Found';

end;
END;
END;

METHOD RUN();
SET WORK.HAVE;
END;


method term();
rc = j.destroyParser();
end;

ENDDATA;
RUN;

Thank you in advance for your time.

Tom · Posted 11-24-2020 01:53 PM

Let's generalize a little more. First let's take the two example strings posted in the original question and make a dataset out of them. (Note the second string was not valid JSON as one of the quoted strings was missing the close quote so I fixed that). So we have two variables ID and a JSON string.

data have;
  infile cards truncover ;
  input id :$10. json $1000. ;
cards4;
001 [{"item":"value1"}, {"item":"value2"},{"item":"value3"}]
002 [{"date": { "$date": "2020-11-04T15:20:00.000Z" }, "details": {"code": "040", "description": "Follow up"}, "feedback": {"code": "020", "description": "No answer"}, "notes": "this is the content of the text provided", "type": "INPROGRESS"}, {"date": { "$date": "2020-11-04T18:00:00.000Z" }, "details": {"code": "040", "description": "Follow up"}, "feedback": {"code": "010", "description": "Communication"}, "notes": "This is the new note test", "type": "INPROGRESS"}, {"date": { "$date": "2020-11-05T16:30:00.000Z" }, "details": {"code": "040", "description": "Follow up"}, "feedback": {"code": "010", "description": "Communication"}, "notes": "The test quotation has been sent to the client", "type": "INPROGRESS"}]
;;;;

Now let's use code like Chris' to generate a JSON file. Note the use of the TEMP engine to make get a temporary file. Also I much prefer to place the continuation characters (commas in JSON syntax) at the START of the line instead of the END of the line so that humans can see them easier.

filename json temp;
data _null_;
  file json;
  set have end=eof;
  if _n_=1 then put '{' @;
  else put ',' @;
  put id :$quote. ':' json ;
  if eof then put '}';
run;

Now we can make a libref using the JSON engine (it will default to look for a FILEREF with the same name as the LIBREF).

libname json json ;

Here is what the ALLDATA structure looks like for this JSON data.

     Obs    P    P1     P2          P3             V    Value

       1    1    001                               0
       2    2    001    item                       1    value1
       3    1    001                               0
       4    2    001    item                       1    value2
       5    1    001                               0
       6    2    001    item                       1    value3
       7    1    002                               0
       8    2    002    date                       0
       9    3    002    date        $date          1    2020-11-04T15:20:00.000Z
      10    2    002    details                    0
      11    3    002    details     code           1    040
      12    3    002    details     description    1    Follow up
      13    2    002    feedback                   0
      14    3    002    feedback    code           1    020
      15    3    002    feedback    description    1    No answer
      16    2    002    notes                      1    this is the content of the text provided
      17    2    002    type                       1    INPROGRESS
      18    1    002                               0
      19    2    002    date                       0
      20    3    002    date        $date          1    2020-11-04T18:00:00.000Z
...

P is what level in the JSON hierarchy this value lives at (1,2,3,...) and P1,P2,... are the names of the levels. Note that in this case P1 is the value we wrote from ID and P2 onward of the names in the JSON strings. V tells if VALUE is populated.

So we can use this to make a structure that PROC TRANSPOSE can use.

First let's rename P and P1 so that we can then use P: wildcard to find P2,P3,P4,..... without knowing how deep the hierarchy is for this JSON text. As long as those names are short enough we can use them to make variable names. If they are too long we might need to add some more logic to generate unique names.

To handle the repeating names we can add a sequence number. We can increment the counter whenever P=1. We can use changes in P1 to signal restarting the counter.

data tall ;
  set json.alldata(rename=(p=first_id p1=id)) ;
  by id notsorted;
  length _name_ $32 _label_ $200 ;
  _label_=catx(' ',of p:);
  _name_ =catx('_',of p:);
  if first.id then seq=0;
  if first_id=1 then seq+1;
  if v=1;
run;

Now we can use PROC TRANSPOSE.

proc transpose data= tall out=want(drop=_name_);
  by id seq notsorted;
  var value;
run;

And get data like this:

                                                           details_     details_      feedback_
Obs    id     seq     item            date__date             code      description      code

 1     001     1     value1
 2     001     2     value2
 3     001     3     value3
 4     002     1               2020-11-04T15:20:00.000Z      040        Follow up        020
 5     002     2               2020-11-04T18:00:00.000Z      040        Follow up        010
 6     002     3               2020-11-05T16:30:00.000Z      040        Follow up        010

         feedback_
Obs     description     notes                                                type

 1
 2
 3
 4     No answer        this is the content of the text provided          INPROGRESS
 5     Communication    This is the new note test                         INPROGRESS
 6     Communication    The test quotation has been sent to the client    INPROGRESS

[See also the code in this reply that shows how to use DATA step and the JSON libname engine to read a series of JSON data entries.]

View solution in original post

Tom · Posted 11-23-2020 10:07 AM

Why do you have that macro code in the middle of your PROC DS2 step? The %PUT is always going to print its message while SAS is scanning the code and preparing the PROC step to execute. The %PUT will not do anything while the step is running. Did you mean to use a normal PUT statement there?

vfarmak · Posted 11-23-2020 10:25 AM

I used it for debugging purposes.

I changed it as you suggested, but nothing is printed.

Ksharp · Posted 11-24-2020 08:31 AM

@ChrisHemedinger wrote a blog about it reading JSONL before.

Or could write some raw sas data step code to parse it.

vfarmak · Posted 11-24-2020 10:10 AM

Yes indeed I am aware of it.

However it focuses on Json data retrieved from a web service, rather than processing json arrays stored in a sas column.

I used an excerpt from the PROC DS2 that @ChrisHemedinger with no luck.

The problem resides in how to target the column's value. He uses RestAPI to capture the data and not a sas dataset with json values stored.

ChrisHemedinger · Posted 11-24-2020 12:46 PM

I think your best option is to use the JSON libname engine. However, this operates only on text files, so you first need to extract the JSON data from the data set and create a text file for each record, or one large JSON text file with a wrapper object. This assumes the data field you're reading is a valid complete JSON string.

Let's illustrate using an example. I don't have your data, so I'll simulate by creating a similar data set from SASHELP.CLASS.

%let datasource = sashelp.class;

/* Create a new subfolder in WORK to hold */
/* temp JSON files, avoiding conflicts    */
options dlcreatedir;
%let workpath = %sysfunc(getoption(WORK))/json;
libname json "&workpath.";
libname json clear;

/* Will create a run a separate PROC JSON step */
/* for each record.  This might take a while   */
/* for very large data.                        */
/* Each iteration will create a new JSON file  */
data _null_;
 set &datasource.;
 call execute(catt('filename out "',"&workpath./out",_n_,'.json";'));
 call execute('proc json out=out nosastags ;');
 call execute("export &datasource.(obs="||_n_||" firstobs="||_n_||");");
 call execute('run;');
run;

data records;
 length ID $ 10 Details $ 32767;
 infile "&workpath./out*.json";
 input;
 ID = cats('ID',put(_n_,z3.));
 details = _infile_;
run;

This creates a data set that looks similar to yours:

Now, starting with this, let's create a JSON file that has these records formatted as name/value pairs:

/* Now create one large JSON file that wraps these name/value pairs */
filename stage TEMP;
data _null_;
 length v $ 15;
 file stage;
 set records end=done;
 if _n_ = 1 then do;
  put '{';
 end;
 v = cats('"',ID,'":');
 put v;
 put Details;
 if done then do;
  put '}';
 end;
 else put ',';
run;

The resulting JSON looks like this:

{
  "ID001": [
    {
      "Name": "Alfred",
      "Sex": "M",
      "Age": 14,
      "Height": 69,
      "Weight": 112.5
    }
  ],
  "ID002": [
    {
      "Name": "John",
      "Sex": "M",
      "Age": 12,
      "Height": 59,
      "Weight": 99.5
    }
  ],
  "ID003": [
    {
      "Name": "Joyce",
      "Sex": "F",
      "Age": 11,
      "Height": 51.3,
      "Weight": 50.5
    }
  ],

Now we can use the JSON libname engine to read this like a set of data sets that we can assemble:

libname final json fileref=stage;
data final;
 length ID $ 15;
 set final.ID: indsname=src;
 drop ordinal:;
 ID=scan(src,2,'.');
run;

Result:

Check out SAS Innovate on-demand content! Watch the main stage sessions, keynotes, and over 20 technical breakout sessions!

Tom · Posted 11-24-2020 01:53 PM

Let's generalize a little more. First let's take the two example strings posted in the original question and make a dataset out of them. (Note the second string was not valid JSON as one of the quoted strings was missing the close quote so I fixed that). So we have two variables ID and a JSON string.

data have;
  infile cards truncover ;
  input id :$10. json $1000. ;
cards4;
001 [{"item":"value1"}, {"item":"value2"},{"item":"value3"}]
002 [{"date": { "$date": "2020-11-04T15:20:00.000Z" }, "details": {"code": "040", "description": "Follow up"}, "feedback": {"code": "020", "description": "No answer"}, "notes": "this is the content of the text provided", "type": "INPROGRESS"}, {"date": { "$date": "2020-11-04T18:00:00.000Z" }, "details": {"code": "040", "description": "Follow up"}, "feedback": {"code": "010", "description": "Communication"}, "notes": "This is the new note test", "type": "INPROGRESS"}, {"date": { "$date": "2020-11-05T16:30:00.000Z" }, "details": {"code": "040", "description": "Follow up"}, "feedback": {"code": "010", "description": "Communication"}, "notes": "The test quotation has been sent to the client", "type": "INPROGRESS"}]
;;;;

Now let's use code like Chris' to generate a JSON file. Note the use of the TEMP engine to make get a temporary file. Also I much prefer to place the continuation characters (commas in JSON syntax) at the START of the line instead of the END of the line so that humans can see them easier.

filename json temp;
data _null_;
  file json;
  set have end=eof;
  if _n_=1 then put '{' @;
  else put ',' @;
  put id :$quote. ':' json ;
  if eof then put '}';
run;

Now we can make a libref using the JSON engine (it will default to look for a FILEREF with the same name as the LIBREF).

libname json json ;

Here is what the ALLDATA structure looks like for this JSON data.

     Obs    P    P1     P2          P3             V    Value

       1    1    001                               0
       2    2    001    item                       1    value1
       3    1    001                               0
       4    2    001    item                       1    value2
       5    1    001                               0
       6    2    001    item                       1    value3
       7    1    002                               0
       8    2    002    date                       0
       9    3    002    date        $date          1    2020-11-04T15:20:00.000Z
      10    2    002    details                    0
      11    3    002    details     code           1    040
      12    3    002    details     description    1    Follow up
      13    2    002    feedback                   0
      14    3    002    feedback    code           1    020
      15    3    002    feedback    description    1    No answer
      16    2    002    notes                      1    this is the content of the text provided
      17    2    002    type                       1    INPROGRESS
      18    1    002                               0
      19    2    002    date                       0
      20    3    002    date        $date          1    2020-11-04T18:00:00.000Z
...

P is what level in the JSON hierarchy this value lives at (1,2,3,...) and P1,P2,... are the names of the levels. Note that in this case P1 is the value we wrote from ID and P2 onward of the names in the JSON strings. V tells if VALUE is populated.

So we can use this to make a structure that PROC TRANSPOSE can use.

First let's rename P and P1 so that we can then use P: wildcard to find P2,P3,P4,..... without knowing how deep the hierarchy is for this JSON text. As long as those names are short enough we can use them to make variable names. If they are too long we might need to add some more logic to generate unique names.

To handle the repeating names we can add a sequence number. We can increment the counter whenever P=1. We can use changes in P1 to signal restarting the counter.

data tall ;
  set json.alldata(rename=(p=first_id p1=id)) ;
  by id notsorted;
  length _name_ $32 _label_ $200 ;
  _label_=catx(' ',of p:);
  _name_ =catx('_',of p:);
  if first.id then seq=0;
  if first_id=1 then seq+1;
  if v=1;
run;

Now we can use PROC TRANSPOSE.

proc transpose data= tall out=want(drop=_name_);
  by id seq notsorted;
  var value;
run;

And get data like this:

                                                           details_     details_      feedback_
Obs    id     seq     item            date__date             code      description      code

 1     001     1     value1
 2     001     2     value2
 3     001     3     value3
 4     002     1               2020-11-04T15:20:00.000Z      040        Follow up        020
 5     002     2               2020-11-04T18:00:00.000Z      040        Follow up        010
 6     002     3               2020-11-05T16:30:00.000Z      040        Follow up        010

         feedback_
Obs     description     notes                                                type

 1
 2
 3
 4     No answer        this is the content of the text provided          INPROGRESS
 5     Communication    This is the new note test                         INPROGRESS
 6     Communication    The test quotation has been sent to the client    INPROGRESS

[See also the code in this reply that shows how to use DATA step and the JSON libname engine to read a series of JSON data entries.]

vfarmak · Posted 11-25-2020 05:50 AM

Thank you @ChrisHemedinger & @Tom

I tried to run the code but the following error appears:

43         libname json json ;
NOTE: JSON data is only read once.  To read the JSON again, reassign the JSON LIBNAME.
ERROR: Invalid JSON in input near line 1 column 5: Encountered an illegal character.
ERROR: Error in the LIBNAME statement.

Does this have to do with the version of Enterprise Guide that I am running? Does it support the libname json?

I have read @ChrisHemedinger blog regarding the json library, I tried to use it with no luck.

This is the main reason that I tried to used the PROC DS2 method.

Tom · Posted 11-25-2020 07:52 AM

So that error is saying the JSON is not well formed. The error is early in the first line, so just look at the file. You can run this data step to dump the first line into the SAS log.

data _null_;
  infile json obs=1;
  input;
  list;
run;

If the JSON is not well formed then reading it with any JSON parser is not going to work.

vfarmak · Posted 11-25-2020 08:22 AM

Thank you @Tom for you input.

I did used the debug and I have observed the following:

Is it necessary to add the brackets { .. [] ... } in the json file
the json variable that is assigned in the filename statement is empty (I suspect that this is the json file written to the temp folder which SAS Enterprise Guide uses to read and write data. Is it true? If so, why I cannot see the value in the debugger?

vfarmak · Posted 11-25-2020 08:26 AM

Also @Tom to give you a better perspective of the debugger I have attached the following screeshot:

Tom · Posted 11-25-2020 09:06 AM

Never tried to use the data step debugger.

But the statement you have split over lines 10 and 11 needs to be AFTER the statement you have on line 12.

If the right side of that photograph is supposed to show your data then it looks like the variable with the JSON text is named CONTACTPROGRESS and not JSON.

vfarmak · Posted 11-25-2020 10:07 AM

Hello @ChrisHemedinger ,

First of all thank you for your time on this.

I have reviewed thoroughly your code and I understood exactly what your were trying to achieve.

I did the same this for my part, I located the file in the disk that your code creates.

In the final datastep, I can see that you are using only the ID column and not all the others (in your example sex, weight etc.).

Is this something you did on purpose and the code automatically read the columns or this is something that i should do explicitly in order to view the rest of the data?

ChrisHemedinger · Posted 11-25-2020 10:15 AM

@vfarmak In that final step where I have a length statement only for ID, the remaining columns are being read from the JSON library via the set statement. This code does not presume knowledge of the column names/attributes that come from that JSON -- but as you can see in the final result, the data is there.

Nested JSON structures might require more code and join/merge operations for related tables that the JSON engine will create based on structure. The code from @Tom will read all of the values regardless of nesting structure, and is a good approach when you have no idea of the data structure/contents. But my approach does take advantage of the JSON engine ability to automatically assign the proper variable types and lengths based on the data values.

Check out SAS Innovate on-demand content! Watch the main stage sessions, keynotes, and over 20 technical breakout sessions!

vfarmak · Posted 12-03-2020 10:06 AM

Greetings @ChrisHemedinger and @Tom ,

I researched your answers in detail and I flattened the objects not only in EG but also in DI.

Thank you very much for your effort in this.

One last question: Can I find material related to the libraries and the operations that they offer? For example in the Json Library I found the property all_data which stores all the data in a sas dataset.

Classroom Training Available!