BookmarkSubscribeRSS Feed
brucegilsen
Obsidian | Level 7

Hi, I'm working on a conference paper on copying data between SAS and JSON. For reading JSON files into SAS, I started with the examples in the LIBNAME Statement: JSON Engine documentation and I'm using SAS 9.4TS1M4.

 

What I've seen about reading JSON files into SAS generally involves looking at the data and then writing code specific to the data layout. I wonder if there's a way to automate the process.

 

For example, the section on Merging Data Sets shows how to read JSON code into multiple SAS data sets and then use ordered values (variables beginning with ORDINAL_) to merge the data sets into a single data set. Using that JSON file to save space here, could I automate that? Maybe after the PROC DATASETS in the documentation, I could do something like the following:

1. Read the dictionary.columns table for the library containing the component data sets and subset to obtain all variables that start with ORDINAL_.

proc sql noprint;
  create table ordinals as
  select memname, name
    from dictionary.columns
    where libname="IN"
    and upcase(name) like 'ORDINAL_%';   
quit;

Resulting data set:

Obs    memname          name
 1     INFO         ordinal_root
 2     INFO         ordinal_info
 3     INFO_ADD     ordinal_info
 4     INFO_ADD     ordinal_add 
 5     INFO_NAME    ordinal_info
 6     INFO_NAME    ordinal_name
 7     ROOT         ordinal_root

 

2. Remove observations for ORDINAL_ values that only occur once since those are at the bottom of the hierarchy.

proc sort data=ordinals nouniquekey;
  by name;
run;

Resulting data set:

Resulting data set:
Obs    memname          name
 1     INFO         ordinal_info
 2     INFO_ADD     ordinal_info
 3     INFO_NAME    ordinal_info
 4     INFO         ordinal_root
 5     ROOT         ordinal_root

 

3. Construct the merge by working down the levels from ORDINAL_ROOT.
a. Merge 1 is for data sets with ORDINAL_ROOT (INFO and ROOT).
b. Since INFO from Merge 1 also has ORDINAL_INFO, Merge 2 is the data set created in Merge 1 with INFO_ADD and INFO_NAME (they both have ORDINAL_INFO).

c. Additional levels could work the same way.

 

That worked with a few JSON files but then I got to a few fairly simple cases that didn't. For example, consider the following JSON file (simplified to remove additional blocks of rows like these).

[
  {
    "country": "usa",
    "city": "chicago",
    "income": {
              "salary": {
                         "salary_h1": 35,
                         "salary_h2": 45
                        },
              "stocks": {
                         "stocks_h1": 8,
                         "stocks_h2": 12
                        }
     },
    "tax": {   
              "incometax": {
                             "incometax_estimated": 3,
                             "incometax_withheld": 7
                           },
              "salestax": 5,
              "cartax": {
                             "cartax_h1": 2,
                             "cartax_h2": 2
                           }
     }
  }
]

 

Using PROC SQL and PROC SORT as above, this generated the following result, where there's no connection between ORDINAL_INCOME and the rest of the tree.

Obs    memname               name
 1     INCOME_SALARY    ordinal_income
 2     INCOME_STOCKS    ordinal_income
 3     ROOT             ordinal_root  
 4     TAX              ordinal_root  
 5     TAX              ordinal_tax   
 6     TAX_CARTAX       ordinal_tax   
 7     TAX_INCOMETAX    ordinal_tax  

 

So I hit a dead end with this effort, unless there's a way to identify some subset of JSON files that'll work this way.

Has anyone done in any research in this area and come up with anything interesting?

 

Thanks,
Bruce

8 REPLIES 8
Tom
Super User Tom
Super User

INCOME has no data, so no dataset is made.  TAX has this data

"salestax": 5

You might want to just use the table ALLDATA.

brucegilsen
Obsidian | Level 7

Tom:

 

Thanks.  Revealing my ignorance about the concepts of JSON, I'm confused. If I do the 2 merges as described above, I get this.

                                     incometax_   incometax_   cartax_   cartax_
Obs   country    city     salestax    estimated    withheld       h1        h2
 1      usa     chicago       5           3            7          2         2   

I see in the JSON file that INCOME has no data directly underneath it, as you noted.  But SALARY and STOCKS, which do have values, are nested below INCOME.  Conceptually, the idea that INCOMETAX and CARTAX seem analogous to SALARY and STOCKS, yet I can access the values nested below TAX because there's a key/value pair "salestax": 5 directly below TAX but no data directly below INCOME is something I'm having trouble wrapping my head around.

 

Thanks,
Bruce

Tom
Super User Tom
Super User

It looks to me like INCOME is not generated but if it were it would have just the variables ORDINAL_ROOT and ORDINAL_INDEX.  For your simple case it looks like you could generate it from the ROOT table by just setting ORDINAL_INDEX to match ORDINAL_ROOT.  But that might not work with more complex data.

Tom
Super User Tom
Super User

Add ORDINAL_COUNT=ALL to the LIBNAME statement.

ORDINALCOUNT= ALL | NONE | n

specifies the maximum number of ordinal variables to generate for each data set. The maximum number of possible ordinal variables for a data set depends on the position of the data in the JSON, and can be less than the number that you specify. In that case, only the maximum number of possible ordinal variables is generated. ORDINALCOUNT can have one of these values:

ALL

creates all possible ordinal variables for the data set.

NONE

suppresses creation of ordinal variables for the data set.

n

is an integer that is greater than or equal to 0 (zero).

Default 2

 

Example results :

            ordinal_    ordinal_    ordinal_    salary_    salary_
     Obs      root       income      salary        h1         h2

       1        1           1           1          35         45
       2        3           2           2          35         45

 

brucegilsen
Obsidian | Level 7

Tom:

 

Hi, I added ORDINAL_COUNT=ALL to the LIBNAME statement as suggested.  Thanks!!! 

 

Now, initial ORDINAL_ variables are as follows.  *** = not present without ORDINALCOUNT= ALL.

memname               name
INCOME_SALARY    ordinal_root    ***  
INCOME_SALARY    ordinal_income   
INCOME_SALARY    ordinal_salary   
INCOME_STOCKS    ordinal_root    *** 
INCOME_STOCKS    ordinal_income   
INCOME_STOCKS    ordinal_stocks   
ROOT             ordinal_root     
TAX              ordinal_root     
TAX              ordinal_tax      
TAX_CARTAX       ordinal_root    ***
TAX_CARTAX       ordinal_tax      
TAX_CARTAX       ordinal_cartax   
TAX_INCOMETAX    ordinal_root    *** 
TAX_INCOMETAX    ordinal_tax      
TAX_INCOMETAX    ordinal_incometax

And after getting rid of unique ordinals.    *** = not present without ORDINALCOUNT= ALL.

memname               name
INCOME_SALARY    ordinal_income
INCOME_STOCKS    ordinal_income
INCOME_SALARY    ordinal_root    ***
INCOME_STOCKS    ordinal_root    ***
ROOT             ordinal_root  
TAX              ordinal_root  
TAX_CARTAX       ordinal_root    ***
TAX_INCOMETAX    ordinal_root    ***
TAX              ordinal_tax   
TAX_CARTAX       ordinal_tax   
TAX_INCOMETAX    ordinal_tax   

 

Now, the merge can work from this info.  I haven't coded it, but I think for the first merge, I want to merge by ORDINAL_ROOT for all data sets where MEMNAME either has no underscore or the value of MEMNAME up to the first or only underscore doesn't equal an entire MEMNAME.  That's probably clearest by example.  The first merge is of ROOT, INCOME_SALARY, INCOME_STOCKS, and TAX, but not TAX_CARTAX or TAX_INCOMETAX as follows.

memname       name          Include in first merge?
INCOME_SALARY ordinal_root  Yes because there's no MEMNAME of INCOME
INCOME_STOCKS ordinal_root  Yes because there's no MEMNAME of INCOME
ROOT          ordinal_root  Yes because it's root
TAX           ordinal_root  Yes because there's no underscore
TAX_CARTAX    ordinal_root  No because there's a MEMNAME of TAX
TAX_INCOMETAX ordinal_root  No because there's a MEMNAME of TAX

 

So the merges are as follows.

data finaldata6;
  merge xxx.root xxx.tax xxx.income_salary xxx.income_stocks;
  by ordinal_root;
run;
data finaldata6;
merge finaldata6 xxx.tax_incometax xxx.tax_cartax;
  by ordinal_tax;
  drop ordinal_:;
run;

And now I get all the values in the final data set!  Apologies for the layout.

                                                         i
                                                         n    i
                                                         c    n
                                                         o    c
                                                         m    o
                                                         e    m
                                                         t    e
                                                         a    t
                                                         x    a
                                                         _    x
                                   s     s    s     s    e    _    c    c
                             s     a     a    t     t    s    w    a    a
            c                a     l     l    o     o    t    i    r    r
            o                l     a     a    c     c    i    t    t    t
            u                e     r     r    k     k    m    h    a    a
            n        c       s     y     y    s     s    a    h    x    x
      O     t        i       t     _     _    _     _    t    e    _    _
      b     r        t       a     h     h    h     h    e    l    h    h
      s     y        y       x     1     2    1     2    d    d    1    2

      1    usa    chicago    5    35    45    8    12    3    7    2    2

I have no idea how much this can be generalized, but will think about this next week.  Any thoughts you have are welcome.  Thanks so much for moving me this far forward today!

 

Bruce

 

 

 

brucegilsen
Obsidian | Level 7

Tom:

Hi, I'm back at this after for the holidays and some other work. With a fresh deep dive in, I realized something significant.

Returning to the JSON file I included in the initial post 2 weeks ago. I ran the following code (with ORDINACOUNT=ALL as you suggested):

libname in5 json '/my/home/m1xxx00/json/fivefive.json'
  map='user5.map' automap=create ordinalcount=all;
libname xxx '/my/home/m1xxx00/json/example6';
proc copy in=in5 out=xxx;
run;

The library referenced by XXX had the following data sets (ignoring ALLDATA):
income_salary, income_stocks, root, tax_cartax, tax_incometax, tax

All 6 data sets had the variable ORDINAL_ROOT and now a simple merge of all these data sets generated the correct data. None of the steps I described previously were needed.

data simpledata6;
  merge xxx.root xxx.income_salary xxx.income_stocks xxx.tax xxx.tax_incometax xxx.tax_cartax;
  by ordinal_root;
  drop ordinal_:;
run;

Data set SIMPLEDATA6 had all the values we wanted:

                                                         i
                                                         n    i
                                                         c    n
                                                         o    c
                                                         m    o
                                                         e    m
                                                         t    e
                                                         a    t
                                                         x    a
                                                         _    x
                              s     s    s     s         e    _    c    c
                              a     a    t     t    s    s    w    a    a
            c                 l     l    o     o    a    t    i    r    r
            o                 a     a    c     c    l    i    t    t    t
            u                 r     r    k     k    e    m    h    a    a
            n        c        y     y    s     s    s    a    h    x    x
      O     t        i        _     _    _     _    t    t    e    _    _
      b     r        t        h     h    h     h    a    e    l    h    h
      s     y        y        1     2    1     2    x    d    d    1    2

      1    usa    chicago    35    45    8    12    5    3    7    2    2

I then tried a slightly different data set with blocks of records like the following and the same thing was true - they all had ORDINAL_ROOT, with appropriate values for each block, and the simple merge worked.

[
  {
    "country": "usa",
    "city": "chicago",
    "income": {
              "salary": {
                         "salary_h1": 35,
                         "salary_h2": 45
                        },
              "bonus": 10,
              "stocks": {
                         "stocks_h1": 8,
                         "stocks_h2": 12
                        }
     },
    "tax": {   
              "incometax": {
                             "incometax_estimated": 3,
                             "incometax_withheld": 7
                           },
              "salestax": 5,
              "cartax": {
                             "cartax_h1": 2,
                             "cartax_h2": 2
                           }
     },

 ...... (additional blocks like these).......
] 

So, I'm wondering about a few things.
1. Why did the documentation use the indirect method that it did?
2. How generalized is this solution?
I'm not sure how much time you have for this but it seems to me that a simplified way to read JSON files would be useful to many people.

Thanks,
Bruce

Tom
Super User Tom
Super User

If just merging all of the tables at once on a single KEY variable worked for your data then you are very lucky.

Someone else just posted today a question on JSON where that would not work.

Each JSON file is different.

That is the beauty and frustration of the format.

brucegilsen
Obsidian | Level 7

Hi Tom, I checked out that other thread, at https://communities.sas.com/t5/SAS-Programming/SAS-EG-how-to-read-in-JSON-file/m-p/710217, and after using your code (with a minor change to the IF EOF code) to convert his JSONL code to JSON, I was able to use the same code as above to do a simple merge of all 3 data sets and get the desired result.

 

I know you are correct that each file is different, but I aspire to understand enough about the nuances of JSON files to have a feel for what types of files will resolve directly.

 

filename jsonl "/my/home/m1xxx00/json/example10/ten.jsonl";
filename json "/my/home//m1xxx00/json/example10/ten.json";
data _null_;
  infile jsonl end=eof;
  file json;
  if _n_=1 then put '[' @ ;
  else put ',' @ ;
  input;
  put _infile_;
  if eof then do;
     put ']';
     stop; /* otherwise I got an extra , at the end */
  end;
run;
         
libname in10 json '/my/home/m1xxx00/json/example10/ten.json'
  map='user10.map' automap=create ordinalcount=all;
libname xxx '/my/home/m1xxx00/json/example10';
proc copy in=in10 out=xxx;
run;

  /* Data sets ALLDATA, QUANTITY, ROOT, and VALUE in the library. */
proc datasets lib=xxx;
run;quit;  

  /* Simple merge of all data sets */
data simpledata10;
  merge xxx.root xxx.quantity xxx.value;
  by ordinal_root;
  /*   drop ordinal_:; */  /* Comment out to test */
run;
proc print data=simpledata10;
run;

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 8 replies
  • 1911 views
  • 0 likes
  • 2 in conversation