Hello All,
I am connecting to hadoop & writing a SAS dataset to hadoop using a libname statement.
libname hdptgt hadoop server=&server port=10000 schema=sample config="&hadoop_config_file"; /*parameters passed from unix*/
/** sas code **/
data hdptgt.main_table;
merge main_table sub_table;
by rec_id;
run;
Log resolution:-
NOTE: There were 290000000 observations read from the data set WORK.MAIN_TABLE.
NOTE: There were 10000000 observations read from the data set WORK.SUB_TABLE.
NOTE: The data set HDP.MAIN_TABLE has 290000000 observations and 50 variables.
real time 8:30:04.19
cpu time 34:31.04
This takes around 8 hrs 30 mins. Is there anything i could do to run this fast ? any help would be appreciated.
Typically, on would use BULKLOAD to speed up RDBMS write operations.
Unfortunately, for Hive this is just a syntax support, there is the same underlying process that is used.
I would start with adding
options msglevel=i sastrace=',,,d' sastraceloc=saslog nostsuffix;
to your program to better analyze what's going on on the Hive side.
Other than that, I think this is a matter of hdfs/Hive optimization issue (given that you can rule out network bottlenecks, or local SAS session ones during read/merge operation).
Unfortunately that doesn't make much of a difference, found this as well.
Don't miss out on SAS Innovate - Register now for the FREE Livestream!
Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.