Re: Preventing truncation of dataset when merging PROC JSON output(s)

bdighera · Posted 01-01-2023 06:16 PM

Hello,

I am currently using a PROC HTTP request to GET a JSON payload from a server. The response is:

1. Too long to be pulled in one request and must be pulled and concatenated page by page.

2. Contains two data sets (data, data_recorddata) that must be merged together on their corresponding observation.

3. Dynamic in the respect that I cannot anticipate what variables are coming across the API to have a JSON map set up.

My issue is that at some point when I merge the data sets or concatenate the pages together I end up truncating some of the data contained in the observations. How do I prevent this? Can I set some parameter so that SAS does not look at length or automatically sets the longest possible length? Unfortunately I cannot anticipate exactly what will come through the API and need to be flexible but also cannot truncate. I have provided my current code below which will pull from the API and perform merging and concatenating of the JSON payload. Any help is greatly appreciated!

/*Extract the total number of pages from JSON response and assessbles url for each page */
		data urls;
		set api.pagination;
		call symputx("nobs", sum(totalPages, -1));
		do I = 0 to totalPages;
			path = catt(&base_url., &study_name., '/', &endpoint., '?size=500&pages=',I, '&filter=formKey==', &crf.);
			output;
		end;
		run;
		
		/*Create macro variables for url paths*/
		data _NULL_;
		set urls nobs=obs;
		call symput("iter", url);
		run;
		
		
		/*Loop through urls based on number of urls and collect API data for each page*/
		%do I = 0 %to &nobs.;
			%let ping = &url.%str(&)page=&I.;
			%put &ping;
			proc http
		   		url= "&ping."
		   		method="GET"
		   		out=respAll
		   		proxyhost="http://webproxy-azuse.vsp.sas.com"
		   		proxyport=3128
		   		headerout=hdrs;
			 	headers
			   		"x-imn-security-key"= &security_key.
			   		"x-api-key"= &api_key.;
			
			
			/*Assign http response to JSON library*/
			/*Makes dataset in work library for parsed data set*/
			data data_page&I.;
				libname page&I. JSON fileref=respAll;
				set page&I..data;
			run;
				
			proc datasets lib=page&I.; quit;
			
			/*Makes dataset in work library for parsed column recorddata set*/
			data recorddata_page&I.;
				set page&I..data_recorddata;
			run;
			
			/*Merge the two datasets so patient info and CRF data are on same page. Merge by index by page*/
			data mergeData&I.;
				merge work.data_page&I. work.recorddata_page&I.;
				drop ordinal_recordData;
				by ordinal_data;
			run;
			
			proc transpose data=mergeData&I. out=middle&I.;
				var _all_;
			run;
			
			proc transpose data=middle&I. out=mergeDataFin&I. (drop=_name_ _label_);
			var _all_;
			run;
			
			data mergeDataFin&I.;
				set mergeDataFin&I.;
				if _N_ = 1 then delete;
			run;
	
			libname page&I. clear;
	
		%end;
		run;
	
		data out.&output.;
			set mergeDataFin: truncover;

			drop ordinal_root ordinal_data;

Seen below, there is truncation at the formId, recordId, and deleted columns when looking at the finalized dataset.

Tom · Posted 01-02-2023 12:17 AM

If you want the DATASETS created from the JSON text to be consistently defined then create your own MAP and pass it to the JSON engine using the MAP= option.

Patrick · Posted 01-02-2023 12:45 AM

If you can't know the possible universe of variables then how can you write downstream code that consumes this data and does something useful with it?

What you could do instead of a map:

1. Read the individual JSON's into individual SAS tables with a common root name

2. Query dictionary.columns to determine all the column names with their max. lengths from all the tables and create a mapping table with it (or alternatively an attrib statement for all the columns).

3. Concatenate the tables via a data step set statement with the mapping table first in the list - or alternatively the generated attrib statement before the set statement.

bdighera · Posted 01-02-2023 12:22 PM

Hi Patrick,

Thank you for your response. I have build a preliminary mapping table using dictionary.columns and a proc sql statement, but now I am unsure how to modify/format my current preliminary mapping table to concatenate with the other tables.

My variables are in the name column. Should I transpose this? How would I assign the max_length and type? Thanks again for your help.

So far, I build a general map using:

proc sql ;
		create table length_report as
		select upcase(name) as NAME
		       , count(*) as n_datasets
		       , max(length) as max_length
		       , min(length) as min_length
		       , count(distinct length) as n_lengths
		       , min(type) as min_type
		       , max(type) as max_type
		       , count(distinct type) as n_types
		from dictionary.columns
		where upcase(LIBNAME) = "FIN"
		group by 1
		order by 1;
		quit;

Tom · Posted 01-02-2023 02:57 PM

It seems very strange to me that all of those variables would have a maximum length of 75.

But it does make writing the code easier. Just add the LENGTH statement before the SET statement in the data step that combines the individual datasets. Note you probably also need to remove any formats that might be attached (I suspect that the JSON engine is like the ORACLE and XLSX engine and is stupdily attaches $xx formats to character variables that do not need them.).

data want;
  length SUBJECTID ......  $75 ;
  set fin: ;
  format _character_ ;
run;

You could type out all of the names or since you have them in a dataset pull the list from there into a macro variable and use the macro variable.

The advantage of typing them (or copy and pasting them from a printout) is that you can put them into some logical order (such as having the SUBJECTID first) so that when you browse the dataset it is easier to review.

Patrick · Posted 01-02-2023 09:11 PM

You could do something along the line of below sample code.

proc sql;
  create table work.attribs as
  select 
    upcase(name) as _gname,
    max(varnum) as varnum,
    max(name) as name,
    max(length) as len,
    max(case when type='char' then '$' else ' ' end) as type,
    max(format) as format,
    max(informat) as informat,
    max(case when missing(label) then ' ' else cats("'",label,"'") end) as label
  from dictionary.columns
  where 
    libname='SASHELP'
    and memname like 'CLASS%'
  group by _gname
  order by varnum, _gname
  ;
quit;

filename codegen temp;
data _null_;
  file codegen;
  set work.attribs end=_last;
  if _n_=1 then put 'attrib';
  put name 'length=' type len @;
  if not missing(format)    then put 'format=' format @;
  if not missing(informat)  then put 'informat=' informat @;
  if not missing(label)     then put 'label=' label @;
  put;
  if _last then put ';';
run;

data want;
  %include codegen /source2;
  set sashelp.class:;
run;
filename codegen clear;

Ordering by varnum to keep the variable order in the pdv closer to the source data.

Ordering by varnum and name in case varnum is missing like for source data stored in a data base.

Preventing truncation of dataset when merging PROC JSON output(s)