topic Writing data to Hadoop by using SAS libname(Performance issue) in SAS Programming

Writing data to Hadoop by using SAS libname(Performance issue)

GunnerEP — Fri, 13 Oct 2017 10:56:50 GMT

Hello All,

I am connecting to hadoop & writing a SAS dataset to hadoop using a libname statement.

libname hdptgt hadoop server=&server port=10000 schema=sample config="&hadoop_config_file"; /*parameters passed from unix*/

/** sas code **/

data hdptgt.main_table;

merge main_table sub_table;

by rec_id;

run;

Log resolution:-

NOTE: There were 290000000 observations read from the data set WORK.MAIN_TABLE.
NOTE: There were 10000000 observations read from the data set WORK.SUB_TABLE.
NOTE: The data set HDP.MAIN_TABLE has 290000000 observations and 50 variables.

real time 8:30:04.19

cpu time 34:31.04

This takes around 8 hrs 30 mins. Is there anything i could do to run this fast ? any help would be appreciated.

Re: Writing data to Hadoop by using SAS libname(Performance issue)

LinusH — Fri, 13 Oct 2017 11:18:32 GMT

Typically, on would use BULKLOAD to speed up RDBMS write operations.

Unfortunately, for Hive this is just a syntax support, there is the same underlying process that is used.

I would start with adding

options msglevel=i sastrace=',,,d' sastraceloc=saslog nostsuffix;

to your program to better analyze what's going on on the Hive side.

Other than that, I think this is a matter of hdfs/Hive optimization issue (given that you can rule out network bottlenecks, or local SAS session ones during read/merge operation).

Re: Writing data to Hadoop by using SAS libname(Performance issue)

GunnerEP — Tue, 17 Oct 2017 16:35:08 GMT

Unfortunately that doesn't make much of a difference, found this as well.

http://support.sas.com/documentation/cdl/en/acreldb/65247/HTML/default/viewer.htm#n0mnrn0q9n41atn194mujpi4zel9.htm