topic Re: need to improve performance of query which use connection to hive in SAS Programming

need to improve performance of query which use connection to hive

hema_bhure — Mon, 03 Oct 2022 16:22:25 GMT

Hi All,

I am using SAS EG 7.1 and writing some query such as parsethrough which use using to connect to the hive and accessing data from Hadoop but its taking to much time to execute simple queries and i need to improve the performance of query , so could you please help me to improve performance .

%let selstr =
int(test_acc/100) as pac,
test_acc-100 * int(test_acc/100) as test_sub_acc,
date-&dt_1. as comp_dt,
lstcapdte-&dt_2. as lstcap_dt,
test_int_rate/100 as int_rate,
Test_cur_bal as Bal,
edh_ingest_ts;

proc sql;
connect to hadoop (
server="XXXXXXXXXXXX"
uri="jdbc:hive2://XXXXXX,XXXX,XXXX/;
serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=XXXserver2" HIVE_KERBEROS_PRINCIPAL="hive/XXXXXX"
port=xxxxx
schema=xxxxx
read_method=JDBC
properties="mapred.job.queue.name=XXXX_default;hive.fetch.task.conversion=minimal;hive.cbo.enable=true;
hive.compute.query.using.stats=true;hive.stats.fetch.column.stats=true;hive.stats.fetch.partition.stats=true;
hive.prewarm.enabled=true;"
dbmax_text=255
);
create table work.test as
select
*
from connection to hadoop (
select
&selstr.
from &table.
where to_date(process_date_time) >=&test_date_1. and to_date(process_date_time) <=&test_date_2.
);
disconnect from hadoop;

quit;

/*processing time as below*/

NOTE: PROCEDURE SQL used (Total process time):
real time 35:37.20
2 The SAS System 09:28 Thursday, September 29, 2022

user cpu time 1:19.61
system cpu time 26.64 seconds
memory 5633.62k
OS Memory 43176.00k
Timestamp 29/09/2022 03:46:38 PM
Step Count 8 Switch Count 3
Page Faults 68
Page Reclaims 440
Page Swaps 0
Voluntary Context Switches 92058
Involuntary Context Switches 26290
Block Input Operations 0
Block Output Operations 0

Re: need to improve performance of query which use connection to hive

SASKiwi — Mon, 03 Oct 2022 19:25:15 GMT

How many rows are extracted into your SAS table? If it is a lot rerun your query changing &selstr for count(*). This will measure the performance of the query itself only and not the time to transfer the data across your network back to SAS. Please post the time for this query.

Re: need to improve performance of query which use connection to hive

hema_bhure — Tue, 04 Oct 2022 08:10:11 GMT

Hi,

no there is not changing the variable into the &selstr. The selstr is limited for this query only.

I am just copying the the processing time and number of observation below.

please let me know how i can optimised it, i tried libname approach also but the still query takes lot of time to extract data.

----------------------------------------------------------------------------------------------------------------------------

NOTE: Compressing data set WORK.test decreased size by 0.71 percent.
Compressed is 58707 pages; un-compressed would require 59127 pages.
NOTE: Table WORK.test created, with 43930975 rows and 13 columns.

45 ;
46 disconnect from hadoop;
47 quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 35:37.20
2 The SAS System 09:28 Thursday, September 29, 2022

Re: need to improve performance of query which use connection to hive

Patrick — Tue, 04 Oct 2022 09:41:43 GMT

It's likely that the data transfer is what consumes the time.

As @SASKiwi suggested run your query with a select statement that only returns a single row of data as this will tell us if it's your where clause or the data transfer that consumes the time.

Can you please re-run the code with the following change

%let selstr = count(*) as n_rows;

Re: need to improve performance of query which use connection to hive

hema_bhure — Tue, 04 Oct 2022 10:44:52 GMT

Hi,

following is n_row count and log

n_rows
40073670

------------------------------------------------------------------------------------------------------------------------------------

-------------------------log-----------------------------

24
25 GOPTIONS ACCESSIBLE;
26 %let selstr = count(*) as n_rows;
27 proc sql;
connect to hadoop (
server="XXXXXXXXXXXX"
uri="jdbc:hive2://XXXXXX,XXXX,XXXX/;
serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=XXXserver2" HIVE_KERBEROS_PRINCIPAL="hive/XXXXXX"
port=xxxxx
schema=xxxxx
read_method=JDBC
properties="mapred.job.queue.name=XXXX_default;hive.fetch.task.conversion=minimal;hive.cbo.enable=true;
hive.compute.query.using.stats=true;hive.stats.fetch.column.stats=true;hive.stats.fetch.partition.stats=true;
hive.prewarm.enabled=true;"
dbmax_text=255
);
create table work.test as
select
*
from connection to hadoop (
select
&selstr.
from &table.
where to_date(process_date_time) >=&test_date_1. and to_date(process_date_time) <=&test_date_2.
);
disconnect from hadoop;

quit;
NOTE: Compression was disabled for data set WORK.TEST because compression overhead would increase the size of the data set.
NOTE: Table WORK.TEST created, with 1 rows and 1 columns.

51 ;
2 The SAS System 09:56 Tuesday, October 4, 2022

52 disconnect from hadoop;
53 quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 17:21.22
user cpu time 0.12 seconds
system cpu time 0.08 seconds
memory 5598.90k
OS Memory 43688.00k
Timestamp 04/10/2022 11:18:42 AM
Step Count 9 Switch Count 3
Page Faults 0
Page Reclaims 312
Page Swaps 0
Voluntary Context Switches 5065
Involuntary Context Switches 50
Block Input Operations 0
Block Output Operations 0

54
55 GOPTIONS NOACCESSIBLE;
56 %LET _CLIENTTASKLABEL=;
57 %LET _CLIENTPROCESSFLOWNAME=;
58 %LET _CLIENTPROJECTPATH=;
59 %LET _CLIENTPROJECTPATHHOST=;
60 %LET _CLIENTPROJECTNAME=;
61 %LET _SASPROGRAMFILE=;
62 %LET _SASPROGRAMFILEHOST=;
63
64 ;*';*";*/;quit;run;
65 ODS _ALL_ CLOSE;
66
67
68 QUIT; RUN;
69

Re: need to improve performance of query which use connection to hive

LinusH — Tue, 04 Oct 2022 14:32:29 GMT

I think this might be a Hive optimization problem.

You should execute the query directly into Hive.

Hive is not a very fast engine, especially for small-medium sized data (whatever that is).

I have experienced that Impala is usually more performant, but that may require additional SAS/ACCESS licensing.

Re: need to improve performance of query which use connection to hive

hema_bhure — Tue, 04 Oct 2022 15:27:04 GMT

Thank you @LinusH , i can execute query in Hive but this source for us and we are doing etL into SAS. so i need to connection to Hadoop to access the data. thank you once again.

Re: need to improve performance of query which use connection to hive

LinusH — Tue, 04 Oct 2022 15:37:58 GMT

I didn't mean to do it as permanent solution, just as a way to see if the excessive execution time is in the Hive query or in the data transport to SAS.

So execute it, measure the time and compare with your SAS SQL pass through measurements.

Re: need to improve performance of query which use connection to hive

SASKiwi — Tue, 04 Oct 2022 19:17:45 GMT

The row count test takes less than half the time of the original SAS table query. That suggests the network speed between Hive and SAS is slowing performance a lot. There is not a lot you can do about that, except splitting your query and running it in parallel over different date periods. You might get better performance outside of normal business hours as well so you could schedule it to run overnight when run time isn't so important.

Re: need to improve performance of query which use connection to hive

hema_bhure — Wed, 05 Oct 2022 15:48:05 GMT

thank you @SASKiwi, I will try this.