Solved: Is it possible to register SPDE data stored on HDFS in metadata?

moorsd · Posted 03-03-2016 06:26 AM

Is it possible to register SPDE tables stored on HDFS in metadata, using SAS Management Console or code? I can't seem to locate any documentation that suggests this can be done. If it can be done, could someone advise how this can be achieved. Thanks David

SteveSober · Posted 03-04-2016 06:21 PM

Hi moorsd,

I’m glad this was helpful. You are correct you must be at SAS 9.4 M3 (maintenance 3) to leverage the Hive Serde for SPD Engine data.

No I’m not doing a workshop but my colleague Jeff Bailey is, and I see you’ve already chimed in with some feedback on what you’d like to see in the workshop. I am presenting a paper and it would be great if it works for you to attend it or to stop by to say hello:

Session ID: SAS2180
Session Title: How to Leverage the Hadoop Distributed File System as a Storage Format for the SAS® Scalable Performance Data Server and SAS® Scalable Performance Data Engine
Session Type: Breakout
Day Scheduled: Tuesday, 4/19/16
Start Time/End Time: 15:00/ 15:20
Location: Bellini 2005

View solution in original post

JuanS_OCS · Posted 03-03-2016 06:46 AM

I know it is possible to register the tables once registered/loaded into LASR, but this is not what you are asking for.

Good question, I don't know and I feel interested on how to work with this option too.

LinusH · Posted 03-03-2016 02:12 PM

I've checked in my environment, and there are no signs of any hdfs adoption in the new library wizard for SPDE. And the same goes for SPD Server. Reservation: this is in 9.4 M2. But SMC usually doesn't get updated much between major versions.

So my guess is that you need to go the not so preferred way of pre-asigning the library (by external configuration).

Data never sleeps

SteveSober · Posted 03-04-2016 02:23 PM

Timely question David. I am working on my next article which will be posted to the Data Management community. In this article I which will provided details on how to register SPD Engine data to the Hive metastore. In the meantime here is some high level information on this topic.

Accessing SPD Engine Data Using Hive; SAS provides a custom Hive SerDe for SPD Engine data that is stored in HDFS. The SerDe makes the data available for applications outside of SAS to query using HiveQL.

Requirements:

You must deploy SAS Foundation using the SAS Deployment Wizard. Select SAS Hive SerDe for SPDE Data.
You must be running a supported Hadoop distribution that includes Hive 0.13 (for example, Cloudera CDH 5.2, Hortonworks HDP 2.1 or later, or MapR 4.0.2 or later).
The data stored in HDFS must have been created using the SPD Engine.
The SerDe is delivered as two JAR files that must be deployed to all nodes in the Hadoop cluster.

The data set must be registered in the Hive metastore. SAS supplies a registration utility to the data set. You cannot use any other method to register data sets.

For more information about the Hive SerDe for the SPD Engine, see SAS 9.4 SPD Engine: Storing Data in the Hadoop Distributed File System at http://support.sas.com/documentation/cdl/en/engspdehdfsug/67948/HTML/default/viewer.htm#n02qaxsa4wu1...

moorsd · Posted 03-04-2016 03:58 PM

Hi Steve,

Many thanks for the information. It was very informative. But from what I've read around this topic you need to be running on SAS 9.4 M3 to have the access the SAS Hive SerDe for SPDE.

The client I'm working at has SAS 9.4 M2 and Hive SerDe is not available for that release. So I guess we're stuck until they upgrade to SAS 9.4 M3 or some other version of SAS that may be released.

Are you running a SAS & Hadoop workshop at this years Global Forum. If so, I'll pop along and say hello.

thanks again,

David

SteveSober · Posted 03-04-2016 06:21 PM

Hi moorsd,

I’m glad this was helpful. You are correct you must be at SAS 9.4 M3 (maintenance 3) to leverage the Hive Serde for SPD Engine data.

No I’m not doing a workshop but my colleague Jeff Bailey is, and I see you’ve already chimed in with some feedback on what you’d like to see in the workshop. I am presenting a paper and it would be great if it works for you to attend it or to stop by to say hello:

Session ID: SAS2180
Session Title: How to Leverage the Hadoop Distributed File System as a Storage Format for the SAS® Scalable Performance Data Server and SAS® Scalable Performance Data Engine
Session Type: Breakout
Day Scheduled: Tuesday, 4/19/16
Start Time/End Time: 15:00/ 15:20
Location: Bellini 2005

LinusH · Posted 03-05-2016 03:14 AM

Exiting times, it sure is.

@SteveSober: are you saying that as of today (M3) you can't register a "plain" SPDE hdfs library in SMC?

The work-around would be to configure it as a Hive data store, and the register that in SMC as a Hive library?

Perhaps you will bring this up in your presentation...

Perhaps a bit off-topic (but not much ): as an architect it's quite difficult to compare different scenarios with new engine possibilities. Especially since it seems be not so many real world implementations learn from. Where can I find information that compare SPDE/SPDS hdfs, SPDE/SPDS Hive and pure Hive implementations? What are the pros and cons? Lessons learned etc.

br

Data never sleeps

SteveSober · Posted 03-08-2016 01:11 PM

Thanks for your comments and questions LinusH. To summarize what I am saying:

To utilize the Hive Serde for SPD Engine Data you must be at SAS 9.4 M3.
You cannot register the SPD Engine data to the Hive metastore using the SAS Management Console. This is done using a command that is issued on the Hadoop head node; more details to follow in my next article.
You can register an SPD Engine LIBREF to hdfs using SAS Management Console. Use this libref when you create, update, append or modify the SPD Engine data. Use this libref when you use PROC DS2 via SAS In-Database Code Accelerator for Hadoop and PROC TRANSPOSE. Note: SQL processing of SPD Engine data store on HDFS is done on the SAS Server.
You can register a Hive LIBREF using SAS Management Console. Use this libref whenever you want to process SQL against the SPD Engine data registered to the Hive metastore. I also suggest using this libref when you use PROC FREQ, MEANS, REPORT, and SUMMARY as well as when using DS2 MATCH MERGE and or FedSQL on the SET statement of your DS2 thread that runs code via MapReduce. The reason for this is MATH MERGE and FedSQL push SQL to run via MapReduce.

In the ideal world the information you requested (a bit off-topic) would be readily available but the reality is these kinds of implementation comparisons are proprietary among our customer base. In addition it is very hard to give general guidelines because every environment is unique. One site may have 100 cores in their Hadoop cluster while others sites have over 1,000 cores. One site might utilize TEZ or SPARK as the method HiveQL is processed. Another site may not use TEZ or SPARK. As you can see the results of any benchmark would vary tremendously depending on the core count of your Hadoop cluster and the methodology used to processing HiveQL.

moorsd · Posted 03-16-2016 10:39 AM

Hi Steve,

As the SPDE SerDe is just two Jar files, can these be deployed manually to all the nodes in the Hadoop cluster, so this works with SAS 9.4 M2 Then in a Hive query just add something like:

exec (ADD JAR /home/hadoop/SerDes/name_of_SAS_Serde.jar).

Is it that you just need the SAS 9.4 M3 depot to get the SerDe from SAS, as we already have a SAS 9.4 M3 depot. Or is there something in the SAS 9.4 M3 SAS/ACCESS to Hadoop engine that is doing something different under the hood? Could you tell me more about the registration utility that SAS supplies.

The ability to see SPDE files in Hive is something that would be ideal for my UK client, however it's taken nearly 12 months to migrate 10 SAS environments to SAS 9.4 M2 so the appetite to upgrade to SAS 9.4 M3 isn't there at the minute..

Thanks

David

LinusH · Posted 03-16-2016 04:06 PM

Going to M3 is not a migration, it's an update. So if all the Hadoop stuff is important for your customer upgrading is recommended. Tweaking the installation with not support component versions isn't really a best practice.

Data never sleeps

SteveSober · Posted 03-17-2016 12:24 PM

Thanks @moorsd and @LinusH for your questions and comments. If you were to tweak the SAS 9.4 M2 environment to try and leverage the HiveSerDe for SPD Engine Data SAS Technical Support would not be able to help you with any issues. It is best to wait until you are at SAS 9.4 M3 before trying to leverage the SerDe.

moorsd · Posted 03-18-2016 08:13 AM

@LinusH

Thanks for the information.

Sadly as the customer has just spent a lot of time and money migrating from 9.2 to 9.4 M2 (which was the current distribution when the migration project started). The appitite to upgrade to M3 isn't there at the moment. I was just wondering we we could get the SPDE Hove SerDe working with 9.4 M2 as a stop gap. Yes, it would be a hack but Hive can use other SerDe's so why not the SAS 9.4 M3 one?

A SAS upgrade is being considered by the client for 2017. However with a new version of SAS due soon (50 years of SAS, so time for a big announcement?) it might not be worth the bother going to M3, and see what a new version of SAS may bring to the table, Hadoop wise. Spark integration (i.e. SAS/ACCESS to Spark) would be nice. DataLoader 2.4 has started the ball rolling with Spark so maybe something like that may happen?

Is it possible to register SPDE data stored on HDFS in metadata?

Re: Is it possible to register SPDE data stored on HDFS in metadata?

Re: Is it possible to register SPDE data stored on HDFS in metadata?

Re: Is it possible to register SPDE data stored on HDFS in metadata?

Re: Is it possible to register SPDE data stored on HDFS in metadata?

Re: Is it possible to register SPDE data stored on HDFS in metadata?

Re: Is it possible to register SPDE data stored on HDFS in metadata?

Re: Is it possible to register SPDE data stored on HDFS in metadata?

Re: Is it possible to register SPDE data stored on HDFS in metadata?

Re: Is it possible to register SPDE data stored on HDFS in metadata?

Re: Is it possible to register SPDE data stored on HDFS in metadata?

Re: Is it possible to register SPDE data stored on HDFS in metadata?

Re: Is it possible to register SPDE data stored on HDFS in metadata?

The 2025 SAS Hackathon has begun!