Re: Automating reading JSON files into SAS

brucegilsen · Posted 12-23-2020 12:59 PM

Hi, I'm working on a conference paper on copying data between SAS and JSON. For reading JSON files into SAS, I started with the examples in the LIBNAME Statement: JSON Engine documentation and I'm using SAS 9.4TS1M4.

What I've seen about reading JSON files into SAS generally involves looking at the data and then writing code specific to the data layout. I wonder if there's a way to automate the process.

For example, the section on Merging Data Sets shows how to read JSON code into multiple SAS data sets and then use ordered values (variables beginning with ORDINAL_) to merge the data sets into a single data set. Using that JSON file to save space here, could I automate that? Maybe after the PROC DATASETS in the documentation, I could do something like the following:

1. Read the dictionary.columns table for the library containing the component data sets and subset to obtain all variables that start with ORDINAL_.

proc sql noprint;
  create table ordinals as
  select memname, name
    from dictionary.columns
    where libname="IN"
    and upcase(name) like 'ORDINAL_%';   
quit;

Resulting data set:

Obs    memname          name
 1     INFO         ordinal_root
 2     INFO         ordinal_info
 3     INFO_ADD     ordinal_info
 4     INFO_ADD     ordinal_add 
 5     INFO_NAME    ordinal_info
 6     INFO_NAME    ordinal_name
 7     ROOT         ordinal_root

2. Remove observations for ORDINAL_ values that only occur once since those are at the bottom of the hierarchy.

proc sort data=ordinals nouniquekey;
  by name;
run;

Resulting data set:

Resulting data set:
Obs    memname          name
 1     INFO         ordinal_info
 2     INFO_ADD     ordinal_info
 3     INFO_NAME    ordinal_info
 4     INFO         ordinal_root
 5     ROOT         ordinal_root

3. Construct the merge by working down the levels from ORDINAL_ROOT.
a. Merge 1 is for data sets with ORDINAL_ROOT (INFO and ROOT).
b. Since INFO from Merge 1 also has ORDINAL_INFO, Merge 2 is the data set created in Merge 1 with INFO_ADD and INFO_NAME (they both have ORDINAL_INFO).

c. Additional levels could work the same way.

That worked with a few JSON files but then I got to a few fairly simple cases that didn't. For example, consider the following JSON file (simplified to remove additional blocks of rows like these).

[
  {
    "country": "usa",
    "city": "chicago",
    "income": {
              "salary": {
                         "salary_h1": 35,
                         "salary_h2": 45
                        },
              "stocks": {
                         "stocks_h1": 8,
                         "stocks_h2": 12
                        }
     },
    "tax": {   
              "incometax": {
                             "incometax_estimated": 3,
                             "incometax_withheld": 7
                           },
              "salestax": 5,
              "cartax": {
                             "cartax_h1": 2,
                             "cartax_h2": 2
                           }
     }
  }
]

Using PROC SQL and PROC SORT as above, this generated the following result, where there's no connection between ORDINAL_INCOME and the rest of the tree.

Obs    memname               name
 1     INCOME_SALARY    ordinal_income
 2     INCOME_STOCKS    ordinal_income
 3     ROOT             ordinal_root  
 4     TAX              ordinal_root  
 5     TAX              ordinal_tax   
 6     TAX_CARTAX       ordinal_tax   
 7     TAX_INCOMETAX    ordinal_tax

So I hit a dead end with this effort, unless there's a way to identify some subset of JSON files that'll work this way.

Has anyone done in any research in this area and come up with anything interesting?

Thanks,
Bruce

Tom · Posted 12-23-2020 01:36 PM

INCOME has no data, so no dataset is made. TAX has this data

"salestax": 5

You might want to just use the table ALLDATA.

brucegilsen · Posted 12-23-2020 02:08 PM

Tom:

Thanks. Revealing my ignorance about the concepts of JSON, I'm confused. If I do the 2 merges as described above, I get this.

                                     incometax_   incometax_   cartax_   cartax_
Obs   country    city     salestax    estimated    withheld       h1        h2
 1      usa     chicago       5           3            7          2         2

I see in the JSON file that INCOME has no data directly underneath it, as you noted. But SALARY and STOCKS, which do have values, are nested below INCOME. Conceptually, the idea that INCOMETAX and CARTAX seem analogous to SALARY and STOCKS, yet I can access the values nested below TAX because there's a key/value pair "salestax": 5 directly below TAX but no data directly below INCOME is something I'm having trouble wrapping my head around.

Thanks,
Bruce

Tom · Posted 12-23-2020 03:44 PM

It looks to me like INCOME is not generated but if it were it would have just the variables ORDINAL_ROOT and ORDINAL_INDEX. For your simple case it looks like you could generate it from the ROOT table by just setting ORDINAL_INDEX to match ORDINAL_ROOT. But that might not work with more complex data.

Tom · Posted 12-23-2020 04:03 PM

Add ORDINAL_COUNT=ALL to the LIBNAME statement.

ORDINALCOUNT= ALL | NONE | n

specifies the maximum number of ordinal variables to generate for each data set. The maximum number of possible ordinal variables for a data set depends on the position of the data in the JSON, and can be less than the number that you specify. In that case, only the maximum number of possible ordinal variables is generated. ORDINALCOUNT can have one of these values:

ALL

creates all possible ordinal variables for the data set.

NONE

suppresses creation of ordinal variables for the data set.

n

is an integer that is greater than or equal to 0 (zero).

Default 2

Example results :

            ordinal_    ordinal_    ordinal_    salary_    salary_
     Obs      root       income      salary        h1         h2

       1        1           1           1          35         45
       2        3           2           2          35         45

brucegilsen · Posted 12-23-2020 06:03 PM

Tom:

Hi, I added ORDINAL_COUNT=ALL to the LIBNAME statement as suggested. Thanks!!!

Now, initial ORDINAL_ variables are as follows. *** = not present without ORDINALCOUNT= ALL.

memname               name
INCOME_SALARY    ordinal_root    ***  
INCOME_SALARY    ordinal_income   
INCOME_SALARY    ordinal_salary   
INCOME_STOCKS    ordinal_root    *** 
INCOME_STOCKS    ordinal_income   
INCOME_STOCKS    ordinal_stocks   
ROOT             ordinal_root     
TAX              ordinal_root     
TAX              ordinal_tax      
TAX_CARTAX       ordinal_root    ***
TAX_CARTAX       ordinal_tax      
TAX_CARTAX       ordinal_cartax   
TAX_INCOMETAX    ordinal_root    *** 
TAX_INCOMETAX    ordinal_tax      
TAX_INCOMETAX    ordinal_incometax

And after getting rid of unique ordinals. *** = not present without ORDINALCOUNT= ALL.

memname               name
INCOME_SALARY    ordinal_income
INCOME_STOCKS    ordinal_income
INCOME_SALARY    ordinal_root    ***
INCOME_STOCKS    ordinal_root    ***
ROOT             ordinal_root  
TAX              ordinal_root  
TAX_CARTAX       ordinal_root    ***
TAX_INCOMETAX    ordinal_root    ***
TAX              ordinal_tax   
TAX_CARTAX       ordinal_tax   
TAX_INCOMETAX    ordinal_tax

Now, the merge can work from this info. I haven't coded it, but I think for the first merge, I want to merge by ORDINAL_ROOT for all data sets where MEMNAME either has no underscore or the value of MEMNAME up to the first or only underscore doesn't equal an entire MEMNAME. That's probably clearest by example. The first merge is of ROOT, INCOME_SALARY, INCOME_STOCKS, and TAX, but not TAX_CARTAX or TAX_INCOMETAX as follows.

memname       name          Include in first merge?
INCOME_SALARY ordinal_root  Yes because there's no MEMNAME of INCOME
INCOME_STOCKS ordinal_root  Yes because there's no MEMNAME of INCOME
ROOT          ordinal_root  Yes because it's root
TAX           ordinal_root  Yes because there's no underscore
TAX_CARTAX    ordinal_root  No because there's a MEMNAME of TAX
TAX_INCOMETAX ordinal_root  No because there's a MEMNAME of TAX

So the merges are as follows.

data finaldata6;
  merge xxx.root xxx.tax xxx.income_salary xxx.income_stocks;
  by ordinal_root;
run;
data finaldata6;
merge finaldata6 xxx.tax_incometax xxx.tax_cartax;
  by ordinal_tax;
  drop ordinal_:;
run;

And now I get all the values in the final data set! Apologies for the layout.

                                                         i
                                                         n    i
                                                         c    n
                                                         o    c
                                                         m    o
                                                         e    m
                                                         t    e
                                                         a    t
                                                         x    a
                                                         _    x
                                   s     s    s     s    e    _    c    c
                             s     a     a    t     t    s    w    a    a
            c                a     l     l    o     o    t    i    r    r
            o                l     a     a    c     c    i    t    t    t
            u                e     r     r    k     k    m    h    a    a
            n        c       s     y     y    s     s    a    h    x    x
      O     t        i       t     _     _    _     _    t    e    _    _
      b     r        t       a     h     h    h     h    e    l    h    h
      s     y        y       x     1     2    1     2    d    d    1    2

      1    usa    chicago    5    35    45    8    12    3    7    2    2

I have no idea how much this can be generalized, but will think about this next week. Any thoughts you have are welcome. Thanks so much for moving me this far forward today!

Bruce

brucegilsen · Posted 01-08-2021 05:27 PM

Tom:

Hi, I'm back at this after for the holidays and some other work. With a fresh deep dive in, I realized something significant.

Returning to the JSON file I included in the initial post 2 weeks ago. I ran the following code (with ORDINACOUNT=ALL as you suggested):

libname in5 json '/my/home/m1xxx00/json/fivefive.json'
  map='user5.map' automap=create ordinalcount=all;
libname xxx '/my/home/m1xxx00/json/example6';
proc copy in=in5 out=xxx;
run;

The library referenced by XXX had the following data sets (ignoring ALLDATA):
income_salary, income_stocks, root, tax_cartax, tax_incometax, tax

All 6 data sets had the variable ORDINAL_ROOT and now a simple merge of all these data sets generated the correct data. None of the steps I described previously were needed.

data simpledata6;
  merge xxx.root xxx.income_salary xxx.income_stocks xxx.tax xxx.tax_incometax xxx.tax_cartax;
  by ordinal_root;
  drop ordinal_:;
run;

Data set SIMPLEDATA6 had all the values we wanted:

                                                         i
                                                         n    i
                                                         c    n
                                                         o    c
                                                         m    o
                                                         e    m
                                                         t    e
                                                         a    t
                                                         x    a
                                                         _    x
                              s     s    s     s         e    _    c    c
                              a     a    t     t    s    s    w    a    a
            c                 l     l    o     o    a    t    i    r    r
            o                 a     a    c     c    l    i    t    t    t
            u                 r     r    k     k    e    m    h    a    a
            n        c        y     y    s     s    s    a    h    x    x
      O     t        i        _     _    _     _    t    t    e    _    _
      b     r        t        h     h    h     h    a    e    l    h    h
      s     y        y        1     2    1     2    x    d    d    1    2

      1    usa    chicago    35    45    8    12    5    3    7    2    2

I then tried a slightly different data set with blocks of records like the following and the same thing was true - they all had ORDINAL_ROOT, with appropriate values for each block, and the simple merge worked.

[
  {
    "country": "usa",
    "city": "chicago",
    "income": {
              "salary": {
                         "salary_h1": 35,
                         "salary_h2": 45
                        },
              "bonus": 10,
              "stocks": {
                         "stocks_h1": 8,
                         "stocks_h2": 12
                        }
     },
    "tax": {   
              "incometax": {
                             "incometax_estimated": 3,
                             "incometax_withheld": 7
                           },
              "salestax": 5,
              "cartax": {
                             "cartax_h1": 2,
                             "cartax_h2": 2
                           }
     },

 ...... (additional blocks like these).......
]

So, I'm wondering about a few things.
1. Why did the documentation use the indirect method that it did?
2. How generalized is this solution?
I'm not sure how much time you have for this but it seems to me that a simplified way to read JSON files would be useful to many people.

Thanks,
Bruce

Tom · Posted 01-08-2021 07:01 PM

If just merging all of the tables at once on a single KEY variable worked for your data then you are very lucky.

Someone else just posted today a question on JSON where that would not work.

Each JSON file is different.

That is the beauty and frustration of the format.

brucegilsen · Posted 01-13-2021 04:39 PM

Hi Tom, I checked out that other thread, at https://communities.sas.com/t5/SAS-Programming/SAS-EG-how-to-read-in-JSON-file/m-p/710217, and after using your code (with a minor change to the IF EOF code) to convert his JSONL code to JSON, I was able to use the same code as above to do a simple merge of all 3 data sets and get the desired result.

I know you are correct that each file is different, but I aspire to understand enough about the nuances of JSON files to have a feel for what types of files will resolve directly.

filename jsonl "/my/home/m1xxx00/json/example10/ten.jsonl";
filename json "/my/home//m1xxx00/json/example10/ten.json";
data _null_;
  infile jsonl end=eof;
  file json;
  if _n_=1 then put '[' @ ;
  else put ',' @ ;
  input;
  put _infile_;
  if eof then do;
     put ']';
     stop; /* otherwise I got an extra , at the end */
  end;
run;
         
libname in10 json '/my/home/m1xxx00/json/example10/ten.json'
  map='user10.map' automap=create ordinalcount=all;
libname xxx '/my/home/m1xxx00/json/example10';
proc copy in=in10 out=xxx;
run;

  /* Data sets ALLDATA, QUANTITY, ROOT, and VALUE in the library. */
proc datasets lib=xxx;
run;quit;  

  /* Simple merge of all data sets */
data simpledata10;
  merge xxx.root xxx.quantity xxx.value;
  by ordinal_root;
  /*   drop ordinal_:; */  /* Comment out to test */
run;
proc print data=simpledata10;
run;

ORDINALCOUNT= ALL | NONE | n

ALL

NONE

n

Classroom Training Available!