BookmarkSubscribeRSS Feed
Assaf_Attas
Obsidian | Level 7

Hey group members,

Recently I came across a new challenge.

I needed to analyze data which is stored as a JSON in a single DB column.

What the heck?!

How can one parse the required data from the huge JSON?!

In my research for a solution to this challenge I found PROC JSON which was quite cute for handling a single JSON (like in HTTP responses)  but in my case I need to extract hundreds of fields from a table that contains millions of JSON’s.

 

Further in my research, I’ve found that almost any language supplies functions to parse JSON data.

To my great astonishment and disappointment SAS does not!

Is this real? SAS? The language that has a function for everything?

 

Well.. that seems to be the case.

so, since I couldn’t wait for the guys in north Carolina I’ve decided to write one down myself.

 

This is my first draft

I’ll need to add error-handling, optimization and all but its working.

Please feel free to improve & share

 

Assaf attas.  

 

%let j_json='{
   "quiz":{
      "sport":{
         "q1":{
            "question":"Which one is correct team name in NBA?",
            "options":[
               "New York Bulls",
               "Los Angeles Kings",
               "Golden State Warriros",
               "Huston Rocket"
            ],
            "answer":"Huston Rocket"
         }
      },
      "maths":{
         "q1":{
            "question":"5 + 7 = ?",
            "options":[
               "10",
               "11",
               "12",
               "13"
            ],
            "answer":"12"
         },
         "q2":{
            "question":"12 - 8 = ?",
            "options":[
               "1",
               "2",
               "3",
               "4"
            ],
            "answer":"4"
         }
      }
   }
}
'
;

options cmplib=work.funcs;
proc fcmp outlib=work.funcs.json;

	function json_value(jf_jstr $,jf_value $) $ 256;
		
		/* Function Input */
		/*-------------------------------------------------------*/
		length 	j_jstr  $ 32767 j_value j_return $ 256;
		j_jstr 		=compbl(jf_jstr);
		j_value 	=compress(jf_value);
		
			
		/* Declarations & Calc */
		/*-------------------------------------------------------*/
		length 	thisObj $ 32 j_object  $ 32767;
		
		levels 		=countw(j_value,'.');
		j_object 	=j_jstr;
		
		/* LOOP Object levels */
		/*-------------------------------------------------------*/
		do _j=2 to levels;
	
			thisObj 	=cats('"',scan(j_value,_j,'.'),'"');
			
			/* For Objects */
			/*-------------------------------------------------------*/
			if _j < levels then do;
			
				/* Start parser */
				/*---------------------------------------*/
				spos=index(j_object,trim(thisObj));
				spos=find(j_object,'{',spos+1 );
				epos=spos;
				x=1;
	
				/* LOOP parser */
				/*---------------------------------------*/
				do until (x=0 or max_iter=100);
					epos=findc(j_object, '{}',epos+1);
					v=substr(j_object,epos,1);
					if v='{' then x=x+1;
						else if v='}' then x=x-1;
						
					max_iter=sum(y,1); /* Just in case... :) */
				end;
				
				/* RETURN parser */
				/*---------------------------------------*/
				j_object=substr(j_object,spos,(epos-spos+1));
				
				/* CLEAR parser */
				/*---------------------------------------*/
				spos=0;
				epos=0;
				x=0;
			
			end;
			
			/* For Key:Value Pair */
			/*-------------------------------------------------------*/
			else do;
				pattern_rgx='/' || '(' || trim(thisObj) || ')(\s*:\s*)((\[([^\]]+)\])|(\"([^\"]+)\")|(\w*))' || '/i';
				pattern_ID =prxparse(pattern_rgx);
				call prxsubstr(pattern_ID, j_object, j_pos, j_len);
				j_return=scan(substr(j_object, j_pos, j_len), 2, ':');
			end;
	
		end;
	
	return(j_return);
	endfunc;

run;

data sample;

	length 	j_jstr  $ 32767;
	j_jstr 		=compbl(&j_json);
	
	q1_question= json_value(j_jstr,'$.quiz.maths.q1.question');
	
run;

 

12 REPLIES 12
Tom
Super User Tom
Super User

To handle embedded quotes shouldn't this:

	thisObj 	=cats('"',scan(j_value,_j,'.'),'"');

Be

	thisObj 	=quote(scan(j_value,_j,'.'));

Or if you want the "escape" embedded quotes instead. 

thisObj 	=cats('"',tranwrd(scan(j_value,_j,'.'),'"','\"'),'"');
Assaf_Attas
Obsidian | Level 7
yes, both works but yours is prettier 🙂
Tom
Super User Tom
Super User

QUOTE() function handles embedded quotes.  At least it handles them the way SAS would. I am not sure what JSON files do with embedded quotes.

Assaf_Attas
Obsidian | Level 7

if a quote() is to be used then it needs to be with a trim:

thisObj 	=quote(trim(scan(j_value,_j,'.')));
Tom
Super User Tom
Super User

@Assaf_Attas wrote:

if a quote() is to be used then it needs to be with a trim:

thisObj 	=quote(trim(scan(j_value,_j,'.')));

There is usually no need to TRIM() the results of SCAN() function.  If you had stored the string into a variable first then the TRIM() would be needed because the fixed length nature of character variables will add spaces to pad the string to the full length of the variable.

Assaf_Attas
Obsidian | Level 7

bottom line.. try for yourself

/* Works */
thisObj 	=cats('"',scan(j_value,_j,'.'),'"');

/* Works */
thisObj 	=quote(trim(scan(j_value,_j,'.')));

/* Doesnt Work */
thisObj 	=quote(scan(j_value,_j,'.'));
Tom
Super User Tom
Super User

Sorry I was looking at the string literal in the function call. That does not include any trailing spaces to worry about.

json_value(j_jstr,'$.quiz.maths.q1.question')

But inside the function the value supplied is being assign to character variable, so trailing spaces will be added.

Tom
Super User Tom
Super User

This link has details of what is a valid KEY in JSON text.

https://stackoverflow.com/questions/8676011/which-characters-are-valid-invalid-in-a-json-key-name

 

So you only need to escape embedded double quotes or backslash.  But it might be easiest to leave the just the CATS() function call and assume the caller has properly added the backslash escapes where they are needed.

 

But what about allowing quotes in the key list?

Do these two value for the second argument have the same meaning?

$.id.name
$."id"."name"

If so you might want to use the Q modifier on the SCAN() function call.  This will also let you handle keys with embedded periods.  Note however that embedded quotes that have "escaped" with a backslash will cause trouble for the SCAN() function.  So if you want to support embedded quotes or periods in key names then you might need to either build your own parser to replace SCAN() function or have the input key list use SAS syntax and add your own logic to both remover outer quotes and insert required escape characters.

Assaf_Attas
Obsidian | Level 7

hi tom,

 

this is relevant only for the function call and not to the json itself 

$.id.name

object in JSON are always quoted.

the key (referenced by the last item in call function) is searched using a perl regex so the code is capable to handle numerous situations.

 

you can consider this as a simplified version of T-SQL Jason_Value function:

https://docs.microsoft.com/en-us/sql/t-sql/functions/json-value-transact-sql?view=sql-server-ver15

 

my function does not yet support json arrays

and i havent yet tested it for performance. 

Tom
Super User Tom
Super User

Can you explain how this works?  What is the syntax for the function call?  It looks like the first argument is just the JSON text itself.  And the second argument is what exactly?  If this is following a pattern used in other functions from other languages can you provide a link to that?

 

I assume you are trying to provide a hierarchy of keys in the JSON text and want to return the first value for that key.  What happens if they keys repeat?  Do you just want the first value found? All of them? Do the other languages provide syntax for handling that?

 

For example could you spell out what $.quiz.maths.q1.question means and how exactly does your implementation locate that key? 

 

Does this implementation require a certain pattern to how the JSON text is constructed to work?

EyalGonen
Lapis Lazuli | Level 10

שלום @Assaf_Attas 

 

אני מניח שאני כותב את הברור מאליו אבל זה שיש לך מאות שדות או קבצי JSON לא אמור להפריע בשימוש ב PROC JSON שכן אפשר להפעיל אותו בלולאה על כל הקבצים/שדות דרך שפת Macro של SAS.

בכל מקרה, רעיון נוסף שרציתי להציע לך הוא שימוש בשפת Groovy בתוך SAS על מנת לנתח את קובץ ה JSON ראה לדוגמה https://communities.sas.com/t5/SAS-Procedures/Parse-json-file-with-Proc-Groovy/td-p/187484 

אפשרויות נוספות כוללות שליחת SQL בדחיפה ל DBMS ושימוש בפונקציות המובנות שלו לניתוח ה JSON ומשיכת התוצאות בחזרה לאחר הניתוח ל SAS. הכוונה כאן היא לשימוש ב SQL Pass Through.

 

בהצלחה,

אייל

Assaf_Attas
Obsidian | Level 7

הי אייל, תודה ומה שלומך.

שים לב שלא דיברתי על מאות שדות.. עם זה אין בעיה.

הבעיה היא מה עושים כשהמדובר במליוני רשומות?

 

נראה לי פחות פרקטי לקחת כל רשומה.. לכתוב אותה לקובץ JSON חיצוני.. ואז לקרוא אותו בחזרה באמצעות json libname

SQL Pass Through זה אכן פתרון.. כמו גם לבנות VIEW בבסיס הנתונים המקורי.. כאן יש רק בעיית בירוקרטיה מול ה- DBA 🙂

 

עם groovy עדיין לא יצא לי לעבוד.

ראיתי שגם ב- DS2 יש אוסף נאה של פונקציות JSON אבל גם שם אני פחות שוחה 🙂

 

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

Discussion stats
  • 12 replies
  • 1844 views
  • 1 like
  • 3 in conversation