Re: How to best alter an HDFS file for the JSON reader

miller3b · Posted 04-11-2017 02:31 PM

Hi,

I am trying to fill in the blanks on a use case and have some constraints that I'm facing.

> can't introduce any UDF not natively in hortonworks offering of hadoop

> cannot touch permanent data persistence solution within hadoop

We are currently storing some JSON objects in HBase with mapping through HIVE in order to enable the SAS connector to work directly with the data (reliant on hiveserver2). No problems with data access itself. The SAS limitiations on hive tables and proc sql make it that any return has to be below 32XXX characters which is not acceptable for this solution. The JSON reader for SAS can handle but will need to consume from a file. The SAS software site outside of the cluster and not directly on any nodes.

The code I have been using is pasted below. The code executes fine and creates a comma separated file to work with as far as the query goes. The reader engine I am using needs the entire file to conform to JSON standards which would mean adding an opening/closing [ ] to make it appear as a array type (SAS JSON Reader 9.4.4).

Is there a way to specify adding these characters in the initial write? If writing the file to HDFS how would I go about appending (or rather copying merging) in the process through the SAS Hadoop connector?

If not is there a good approach to maybe bringing the file local and adding the characters as needed for the reader?

Current:

{JSON Data},

{JSON Data}

Desired (not sensitive to line feeds from reader perspective):

[{JSON Data},

{JSON Data}]

proc sql;

connect to hadoop (XXXXXX - connection string information);

execute (INSERT OVERWRITE DIRECTORY '/home/bmiller'

ROW FORMAT DELIMITED FIELDS TERMINATED BY ','

SELECT JSON from results r

inner join

hbaseresults hb on r.confirmationnumber = hb.key

where clientcode = &clientcode and

artifacttype = 'CrfResult') by hadoop;

disconnect from hadoop;

quit;

Any help or suggestions would be greatly appreciated.

Thanks,

Ben

JBailey · Posted 04-25-2017 11:03 AM

Hi @miller3b

I have no experience dealing with Hive going to HBase. The 32k issue is a known problem with Hive. This GitHub may help you.

https://github.com/Jeff-Bailey/SGF2016_SAS3880_Insiders_Guide_Hadoop_HOW

Take a look at the chapter on 32k String Thing. It's Exercise 5.

Best wishes,

Jeff

miller3b · Posted 04-25-2017 11:16 AM

Excellent. Thank you Jeff. I'm sure there are some other tricks in here that I need as well.

JBailey · Posted 04-25-2017 11:18 AM

Hi @miller3b

You are quite welcome!

Best wishes,

Jeff

miller3b · Posted 04-25-2017 02:50 PM

Hi miller3b,

I'm glad you found some useful info! If one of the replies was the exact solution to your problem, can you "Accept it as a solution"? Or if one was particularly helpful, feel free to "Like" it. This will help other community members who may run into the same issue know what worked.

Thanks!
Anna

Access SAS Innovate on-demand content now!

How to best alter an HDFS file for the JSON reader