Re: Writing data to Hadoop by using SAS libname(Performance issue)

GunnerEP · Posted 10-13-2017 06:56 AM

Hello All,

I am connecting to hadoop & writing a SAS dataset to hadoop using a libname statement.

libname hdptgt hadoop server=&server port=10000 schema=sample config="&hadoop_config_file"; /*parameters passed from unix*/

/** sas code **/

data hdptgt.main_table;

merge main_table sub_table;

by rec_id;

run;

Log resolution:-

NOTE: There were 290000000 observations read from the data set WORK.MAIN_TABLE.
NOTE: There were 10000000 observations read from the data set WORK.SUB_TABLE.
NOTE: The data set HDP.MAIN_TABLE has 290000000 observations and 50 variables.

real time 8:30:04.19

cpu time 34:31.04

This takes around 8 hrs 30 mins. Is there anything i could do to run this fast ? any help would be appreciated.

LinusH · Posted 10-13-2017 07:18 AM

Typically, on would use BULKLOAD to speed up RDBMS write operations.

Unfortunately, for Hive this is just a syntax support, there is the same underlying process that is used.

I would start with adding

options msglevel=i sastrace=',,,d' sastraceloc=saslog nostsuffix;

to your program to better analyze what's going on on the Hive side.

Other than that, I think this is a matter of hdfs/Hive optimization issue (given that you can rule out network bottlenecks, or local SAS session ones during read/merge operation).

Data never sleeps

GunnerEP · Posted 10-17-2017 12:35 PM

Unfortunately that doesn't make much of a difference, found this as well.

http://support.sas.com/documentation/cdl/en/acreldb/65247/HTML/default/viewer.htm#n0mnrn0q9n41atn194...

Writing data to Hadoop by using SAS libname(Performance issue)

Re: Writing data to Hadoop by using SAS libname(Performance issue)

Re: Writing data to Hadoop by using SAS libname(Performance issue)

Click image to register for webinar

Classroom Training Available!