10-13-2017 06:56 AM
I am connecting to hadoop & writing a SAS dataset to hadoop using a libname statement.
libname hdptgt hadoop server=&server port=10000 schema=sample config="&hadoop_config_file"; /*parameters passed from unix*/
/** sas code **/
merge main_table sub_table;
NOTE: There were 290000000 observations read from the data set WORK.MAIN_TABLE.
NOTE: There were 10000000 observations read from the data set WORK.SUB_TABLE.
NOTE: The data set HDP.MAIN_TABLE has 290000000 observations and 50 variables.
real time 8:30:04.19
cpu time 34:31.04
This takes around 8 hrs 30 mins. Is there anything i could do to run this fast ? any help would be appreciated.
10-13-2017 07:18 AM
Typically, on would use BULKLOAD to speed up RDBMS write operations.
Unfortunately, for Hive this is just a syntax support, there is the same underlying process that is used.
I would start with adding
options msglevel=i sastrace=',,,d' sastraceloc=saslog nostsuffix;
to your program to better analyze what's going on on the Hive side.
Other than that, I think this is a matter of hdfs/Hive optimization issue (given that you can rule out network bottlenecks, or local SAS session ones during read/merge operation).
10-17-2017 12:35 PM
Unfortunately that doesn't make much of a difference, found this as well.