In a nutshell, we discovered an issue when using Cloudera Manager with the HDFS datanode address that SAS Data Loader received from the Cloudera cluster and found an HDFS configuration option we could activate to remedy the problem. A more detailed explanation appears below. After the initial issue with the oozie service not responding (which went away after reinstallation of the VMs), the primary issue we struggled with was Copy To Hadoop was failing during its execution for both SAS datasets and SQL Server tables as sources. When copying the SAS dataset, the operation would create the Hive target table with the correct metadata but no data (i.e. empty table). We confirmed that the HDFS file that backs the Hive table was created, but it had empty contents. No errors were present in any of the logs we looked at. When copying from a SQL Server table to Hadoop, we would encounter an error very early in the execution stating “There are 0 datanode(s) running…”. No Hive table or HDFS file was created. We discovered that the issue only occurs if Cloudera Manager is enabled. In the default Quickstart VM configuration, Cloudera Manager is disabled. The difference in behavior is caused by the fact that the Quickstart VM has two different sets of Hadoop site XML configuration files that it uses depending upon whether we are using the default configuration or the Cloudera Manager configuration. In the site XML files for the default configuration, the hdfs-site.xml file contains explicit entries with the datanode IP address. The DataLoader client uses these entries (after the DataLoader VM picks them up during its configuration phase) to locate and communicate with the HDFS datanode. In the site XML files for the Cloudera Manager configuration, the hdfs-site.xml lacks explicit entries for the datanode. The DataLoader client queries the Quickstart VM namenode to request the address for the datanode. Unfortunately, the namenode reports back an address of 127.0.0.1 for the datanode. The DataLoader client (VM) attempts to connect to HDFS at address 127.0.0.1 and cannot (because 127.0.0.1 is localhost). Unfortunately, the logs do not readily indicate this connection failure. The fix is to go into Cloudera Manager and configure the HDFS configuration option to “Use Datanode Hostname” for external clients. This change causes the namenode to report back an address of “cloudera.quickstart” rather than 127.0.0.1 for the datanode. We have verified on the customer’s system and our own that this resolves the issue. We were able to copy both SAS datasets and SQL Server tables into the Quickstart VM. We will work on updating the SAS Communities documents that we have for the configuration changes necessary for the Copy to Hadoop operation in Data Loader. Thank you, Ben Ryan
... View more