Solved: Re: No sasdata written to Hadoop, only metadata

platonsan · Posted 01-31-2014 08:53 AM

Hi All

I am testing SAS/ACCESS for Hadoop (9.4 TS1M1) against the Cloudera Quickstart VM Image (CDH 4.4.0.1)

When i create a table, everything looks fine at first; the table is created in Hadoop and the SAS-log claims that observations is written. But afterwards there is no data in it, only the metadata. (SAS-log below)

In the 9.4 documentaion "LIBNAME Statement Specifics for Hadoop" there is a note stating "If HDFS /tmp has enabled the sticky bit, the LOAD command can fail. To resolve this, see the SAS system requirements and configuration guide documents, which are available at http://support.sas.com.".

It is not unlikely that the Sticky Bit is on, as Cloudera recomends it; but i cant find anything on the matter at support.sas.com.

Does anybody know the solution to this problem ?

regards
Steen

8    libname cdh hadoop server="192.168.198.131" subprotocol=hive2 user=cloudera password=XXXXXXXX;
NOTE: Libref CDH was successfully assigned as follows:
      Engine:        HADOOP
      Physical Name: jdbc:hive2://192.168.198.131:10000/default
9    data cdh.class;
10   set sashelp.class;
11   run;

NOTE: There were 19 observations read from the data set SASHELP.CLASS.
NOTE: The data set CDH.CLASS has 19 observations and 5 variables.

12 data test; set cdh.class; run;
NOTE: There were 0 observations read from the data set CDH.CLASS.
NOTE: The data set WORK.TEST has 0 observations and 5 variables.

JBailey · Posted 03-13-2014 10:00 AM

Hi,

Just so you know, this problem became an obsession for me (I literally spent days working on it). I know more about CDH because of it. Plus, it was a lot of fun. Thank you so much for posting your question here.

The good news: it is now working.

The bad news: fixing it is really complicated. High-five Woody from Cloudera - he actually fixed it.

Here is the problem...

SAS connects just fine using a LIBNAME statement. The Hive server contacts the NameNode (NN). The NN creates the directory entry and puts a filename in metadata. This is why you can see the table name and other metadata. The NN then sends the DataNode (DN) hostname, and port number (50010), to the SAS machine. The DN sends the name “localhost” back to the SAS machine. This causes SAS to try to connect to a DN on the same machine it is running on. Hilarity ensues. Since there is no DataNode the data does not make it to the VM. Hence the empty Hive table.

The fix involved giving the machine a real hostname - we used clouderavm - and then changing all occurrences of "localhost.localdomain" to new hostname. This required a lot of work. Woody says it is unreasonable to expect to be able to do this on your own. That makes me want to try it;) Watching him do this was eye-opening. He was using unpublished tricks and was flying around the machine like a mad-man. It was humbling to observe someone who truly knows what he is doing. I feel fortunate for having seen him work.

Woody is planning on redoing the VM at some point. In the mean time, if you still want to use the VM with SAS, feel free to contact me and we can discuss how to proceed.

View solution in original post

JBailey · Posted 02-05-2014 09:23 AM

This problem is typically caused by a configuration issue. Specifically, HDFS. Verify that HDFS is configured properly. It could be the hdfs port number, lack of write permissions, or the /tmp sticky bit issue.

This is covered starting on page 555 in the SAS/ACCESS doc.
http://support.sas.com/documentation/cdl/en/acreldb/66787/PDF/default/acreldb.pdf

Just curious, how did you configure Hive2 in your virtual machine?

platonsan · Posted 02-10-2014 04:16 AM

I have tried it; but unfortunately It didn’t work.

As I see it, the SAS-data is not loaded to Hadoop at all.

If I manually (using Impala) create a table and insert a value to it; SAS can read it:

15 data x; set cdh.sastest; run;

NOTE: There were 1 observations read from the data set CDH.SASTEST.

NOTE: The data set WORK.X has 1 observations and 1 variables.

What I have done on the Hadoop side, is to download the preconfigured Virtual Machine (VMware) from Cloudera (Cloudera Quickstart) which is running CDH 4.4. The only thing I have configured is to start the HiveServer2 service.

Inspired by the list posted in other threads I have copied the following files from the server:

guava-11.0.2.jar

hadoop-auth-2.0.0-cdh4.4.0.jar

hadoop-common-2.0.0-cdh4.4.0.jar

hadoop-core-2.0.0-mr1-cdh4.4.0.jar

hadoop-hdfs-2.0.0-cdh4.4.0.jar

hive-exec-0.10.0-cdh4.4.0.jar

hive-jdbc-0.10.0-cdh4.4.0.jar

hive-metastore-0.10.0-cdh4.4.0.jar

hive-service-0.10.0-cdh4.4.0.jar

ibfb303-0.9.0.jar

pig-0.11.0-cdh4.4.0.jar

protobuf-java-2.4.0a.jar

JBailey · Posted 03-06-2014 09:53 AM

Hi platonsan,

Were you able to get this to work? I have reproduced it and going to work on it. If you have it fixed, I would love to hear what you did to fix it.

platonsan · Posted 03-07-2014 12:44 AM

Hi Jeff

No, i didnt get it to work.

I havnt looked at it for a few weeeks, it would be great if you could get it to work.

JBailey · Posted 03-13-2014 10:00 AM

Hi,

Just so you know, this problem became an obsession for me (I literally spent days working on it). I know more about CDH because of it. Plus, it was a lot of fun. Thank you so much for posting your question here.

The good news: it is now working.

The bad news: fixing it is really complicated. High-five Woody from Cloudera - he actually fixed it.

Here is the problem...

SAS connects just fine using a LIBNAME statement. The Hive server contacts the NameNode (NN). The NN creates the directory entry and puts a filename in metadata. This is why you can see the table name and other metadata. The NN then sends the DataNode (DN) hostname, and port number (50010), to the SAS machine. The DN sends the name “localhost” back to the SAS machine. This causes SAS to try to connect to a DN on the same machine it is running on. Hilarity ensues. Since there is no DataNode the data does not make it to the VM. Hence the empty Hive table.

The fix involved giving the machine a real hostname - we used clouderavm - and then changing all occurrences of "localhost.localdomain" to new hostname. This required a lot of work. Woody says it is unreasonable to expect to be able to do this on your own. That makes me want to try it;) Watching him do this was eye-opening. He was using unpublished tricks and was flying around the machine like a mad-man. It was humbling to observe someone who truly knows what he is doing. I feel fortunate for having seen him work.

Woody is planning on redoing the VM at some point. In the mean time, if you still want to use the VM with SAS, feel free to contact me and we can discuss how to proceed.

platonsan · Posted 03-14-2014 04:57 AM

Well, I am glad I can help 🙂 (May I recommend an error-message in ACCESS ?)

So our problem is that we decided to test the setup with a VM before we threw ourselves into building of a full-blown Hadoop-cluster.

Well, we will have to rethink this.

BTW, what is the status of SAS/ACCESS for Cloudera (Impala)

Thanks for the effort.

regards Steen

FriedEgg · Posted 03-14-2014 02:06 PM

The Cloudera Manager tool can make this post-configuration more simplistic. The issue as described by JBailey make sense.

Cloudera Manager

JBailey · Posted 03-14-2014 04:01 PM

Testing SAS functionality using the Hadoop vendor VMs is hit-or-miss. Sometimes they work well. Other times they don't.

The problem with the issuing an error message is that SAS doesn't know it is an error. It appears to work. The only indication is an empty table, we could check for that, but it would slow down processing. So we don't do that. In addition, we don't know what the cause of the error is. For example, I had this same-exact behavior happen today. Looked like it worked; table was empty. This was on a real cluster, so it wasn't caused by the same underlying problem. It was a JAR file issue (this is the cause of the vast majority of the problems I have experienced).

SAS/ACCESS Interface to Impala is currently a Limited Availability (LA) release. It is not feature complete yet. It will be Generally Available (GA) sometime in July. There is an approval process for the LA release. If you are interested in trying it out, contact your SAS account executive. Their is an approval process, but I am pretty sure that you will get in.

FriedEgg · Posted 03-09-2014 12:08 AM

The documentation the note is referring to is available of page 42 of the following:

http://support.sas.com/documentation/installcenter/en/ikfdtnunxcg/66380/PDF/default/config.pdf

The documentation provides reference for checking the sticky bit on the HDFS /tmp directory as well as a sas option to use if this folder's setting cannot be altered.

HDFS_TEMPDIR=

http://support.sas.com/documentation/cdl/en/acreldb/66787/HTML/default/viewer.htm#n1h398otek0j00n1it...

Of course, you should also verify the write-ability of an alternate directory as well.

daniele_laguardia · Posted 03-25-2015 10:59 AM

Hi all,

I have the same problem on both virtual machines and on a physical cluster .When i try to load some data in hadoop Cloudera, using a simple datastep, no data is written (Table with zero rows) but the metadata structure of the table is correctly stored (the SAS log is without errors).

Do you have suggestions to solve the problem on the hadoop cluster ?

Thanks.

D.

JBailey · Posted 03-25-2015 11:08 AM

Hi Daniele,

Your problem is most likely related to the file permissions on the Hadoop /tmp directory. If SAS can't write there metadata will be created but no data loaded into the table.

SAS bypasses JDBC and streams data directly into HDFS (/tmp directory). Once the data is in /tmp it is loaded into the final Hive table. This is done for performance reasons; it is very fast.

Best wishes,

Jeff

FriedEgg · Posted 03-25-2015 11:10 AM

Have you reviewed the things already discussed here:

1. Sticky bit on /tmp

2. The IP/Host of the data nodes are reachable/resolvable from the SAS server.

daniele_laguardia · Posted 03-26-2015 06:51 AM

When SAS sas code is running I found this message several times in the log file of Cloudera Cluster ( non VM 😞

Retry#0: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby

2015-03-26 11:40:57,582 WARN org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:hdfs (auth:SIMPLE) cause:org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby

JBailey · Posted 03-26-2015 11:03 AM

Is looks like the cluster is in standby mode. Go into Cloudera manager (under Hosts) and check-out the status. You may need to take the namenode out of standby.

Have you been able to get this to work on the VM? On the VM you will need to put user=cloudera on your LIBNAME statement. That is, unless you have added a user.

Can you try setting the following environment variable on the SAS machine (you will need to restart SAS)?

SAS_HADOOP_RESTFUL=1

Ready to join fellow brilliant minds for the SAS Hackathon?