BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
TBarker
Quartz | Level 8

Hello,

 

We have discovered that when returning a large amount of data from our Hadoop cluster (on LINUX) to SAS (9.4M5 on AIX), the query runs very slowly if it first passes through the Hadoop load balancer rather than being directed to a specific node in the connection URI. Currently, we have only two nodes in our cluster. Node 01 returns the data in approximately 13 minutes. Node 02 returns the same data in approximately 18 minutes. The LB returns the same data in approximately 2 *hours*. Number of records is 19,932,993, with 40 columns. It's a basic SELECT statement. Same exact query each time, just changing the server in the URI to point directly to a node or to the LB. Results are similar regardless of date/time run.

 

Our network administrator doesn't see anything unusual when he examines the traffic while the query runs directly to either of the nodes or through the LB and he says this LB is a pretty basic setup. We also tested running the query through the LB with him alternately removing one of the nodes from the LB pool, which we hoped would identify whether the LB was having an issue with one of the nodes, but for both runs in that manner (effectively LB-->01, then LB-->02) we received the same 2-hour slow run time.

 

His current thought/question:  "The only thing I can think of now is when running through the LB the source IP is the LB backend IP and not the initiating server.  Is there something within the application that may be looking at the source or the host name used to connect to Hive?  On Windows servers some applications require SPNs when using alias cnames or other DNS names.  Not sure if that is applicable here."

 

Both the Hadoop cluster and our SAS integration with it are new to my company as of earlier this year, so we're only just discovering this issue. As far as we know, no one accessing the cluster via a non-Hadoop application (mainly Tableau and SAP Analysis for Office) is experiencing this problem - but perhaps they are and they just don't know it yet due to the still-limited use of the nascent Hadoop environment.

 

Any suggestions as to what the problem might be and/or how to resolve it so we can utilize the load balancer?

~Tamara
1 ACCEPTED SOLUTION

Accepted Solutions
Kalind_Patel
Lapis Lazuli | Level 10

Have you checked the configuration of LB's Configuration properties like ?

-Ddfs.balance.bandwidthPerSec

 

When you're running a query for large table in Hadoop from SAS, SAS will automatically use READ_METHOD=HDFS which means SAS is directly connecting to the HDFS Load Balancers or HDFS Data Nodes,

SAS retrieves these values from SAS_HADOOP_CONFIG_PATH Environment Variable For more information refer this link: https://documentation.sas.com/?docsetId=hadoopbacg&docsetTarget=p15adw95cc9397n1drnqztusgdwp.htm&doc... 

 

You can check core-site.xml, hdfs-site.xml files to get an idea that which LB, Data Nodes and SPNs SAS is using to connect to the  Hadoop;

 

About SAS connection to the Hadoop, it uses Hadoop Jar Files and Conf files retrieved from Hadoop, you can refer this link for more info:https://documentation.sas.com/?docsetId=hadoopov&docsetTarget=n1gtt90tf28i1an1flr3c6a8yr3t.htm&docse...

 

View solution in original post

4 REPLIES 4
Kalind_Patel
Lapis Lazuli | Level 10

Have you checked the configuration of LB's Configuration properties like ?

-Ddfs.balance.bandwidthPerSec

 

When you're running a query for large table in Hadoop from SAS, SAS will automatically use READ_METHOD=HDFS which means SAS is directly connecting to the HDFS Load Balancers or HDFS Data Nodes,

SAS retrieves these values from SAS_HADOOP_CONFIG_PATH Environment Variable For more information refer this link: https://documentation.sas.com/?docsetId=hadoopbacg&docsetTarget=p15adw95cc9397n1drnqztusgdwp.htm&doc... 

 

You can check core-site.xml, hdfs-site.xml files to get an idea that which LB, Data Nodes and SPNs SAS is using to connect to the  Hadoop;

 

About SAS connection to the Hadoop, it uses Hadoop Jar Files and Conf files retrieved from Hadoop, you can refer this link for more info:https://documentation.sas.com/?docsetId=hadoopov&docsetTarget=n1gtt90tf28i1an1flr3c6a8yr3t.htm&docse...

 

TBarker
Quartz | Level 8

Thank you @Kalind_Patel for your response. I've passed your message on to my IT team to look into. I did look in our Hadoop XML files on the SAS server and I see only the data nodes and not the load balancer, so if the load balancer is supposed to be in there as well, then that could indeed be the problem and will definitely need to be tested. I'll update this thread once we've resolved the problem.

~Tamara
Kalind_Patel
Lapis Lazuli | Level 10
Glad to hear,
And, As an interim solution you can point out to the specific datanodes in the libname's URI syntax.
TBarker
Quartz | Level 8

This is exactly what we're having to do, as it turns out that we're currently using an external load balancer, not Cloudera's LB, as the Cloudera LB currently doesn't play well with our SAP/HANA environment. And the external LB is what our SAS environment is struggling with. Our I.T. team hopes to resolve that with the next upgrade of something... (HANA? Hadoop? Both? I don't recall what I was told about that.) Meanwhile, we've been granted permission to direct SAS/Hadoop queries directly to a specific data node and the external LB will take care of queries from Hadoop and other integrated tools that seem to work okay with it. Thanks again for your help!

~Tamara

suga badge.PNGThe SAS Users Group for Administrators (SUGA) is open to all SAS administrators and architects who install, update, manage or maintain a SAS deployment. 

Join SUGA 

Get Started with SAS Information Catalog in SAS Viya

SAS technical trainer Erin Winters shows you how to explore assets, create new data discovery agents, schedule data discovery agents, and much more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 4 replies
  • 2094 views
  • 3 likes
  • 2 in conversation