BookmarkSubscribeRSS Feed
hema_bhure
Calcite | Level 5

Hi All,

 

I am using SAS EG 7.1 and writing some query such as parsethrough which use using to connect to the hive and accessing data from Hadoop but its taking to much time to execute simple queries and i need to improve the performance of query , so could you please help me to improve performance .

 

%let selstr =
int(test_acc/100) as pac,
test_acc-100 * int(test_acc/100) as test_sub_acc,
date-&dt_1. as comp_dt,
lstcapdte-&dt_2. as lstcap_dt,
test_int_rate/100 as int_rate,
Test_cur_bal as Bal,
edh_ingest_ts;


proc sql;
connect to hadoop (
 server="XXXXXXXXXXXX"
 uri="jdbc:hive2://XXXXXX,XXXX,XXXX/;
 serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=XXXserver2" HIVE_KERBEROS_PRINCIPAL="hive/XXXXXX"
 port=xxxxx
 schema=xxxxx
 read_method=JDBC
 properties="mapred.job.queue.name=XXXX_default;hive.fetch.task.conversion=minimal;hive.cbo.enable=true;
 hive.compute.query.using.stats=true;hive.stats.fetch.column.stats=true;hive.stats.fetch.partition.stats=true;
 hive.prewarm.enabled=true;"
 dbmax_text=255
 );
 create table work.test as
 select
 *
 from connection to hadoop (
 select
 &selstr.
 from &table.
 where to_date(process_date_time) >=&test_date_1. and to_date(process_date_time) <=&test_date_2.
 );
disconnect from hadoop;

quit; 

 

 

/*processing time as below*/

 

NOTE: PROCEDURE SQL used (Total process time):
real time 35:37.20
2 The SAS System 09:28 Thursday, September 29, 2022

user cpu time 1:19.61
system cpu time 26.64 seconds
memory 5633.62k
OS Memory 43176.00k
Timestamp 29/09/2022 03:46:38 PM
Step Count 8 Switch Count 3
Page Faults 68
Page Reclaims 440
Page Swaps 0
Voluntary Context Switches 92058
Involuntary Context Switches 26290
Block Input Operations 0
Block Output Operations 0

 

 

9 REPLIES 9
SASKiwi
PROC Star

How many rows are extracted into your SAS table? If it is a lot rerun your query changing &selstr for count(*). This will measure the performance of the query itself only and not the time to transfer the data across your network back to SAS. Please post the time for this query. 

hema_bhure
Calcite | Level 5

Hi,

no there is not changing the variable into the &selstr. The selstr is limited for this query only.

I am just copying the the processing time and number of observation below.

please let me know how i can optimised it, i tried libname approach also but the still query  takes lot of time to extract data.

----------------------------------------------------------------------------------------------------------------------------

NOTE: Compressing data set WORK.test decreased size by 0.71 percent.
Compressed is 58707 pages; un-compressed would require 59127 pages.
NOTE: Table WORK.test created, with 43930975 rows and 13 columns.

45 ;
46 disconnect from hadoop;
47 quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 35:37.20
2 The SAS System 09:28 Thursday, September 29, 2022

user cpu time 1:19.61
system cpu time 26.64 seconds
memory 5633.62k
OS Memory 43176.00k
Timestamp 29/09/2022 03:46:38 PM
Step Count 8 Switch Count 3
Page Faults 68
Page Reclaims 440
Page Swaps 0
Voluntary Context Switches 92058
Involuntary Context Switches 26290
Block Input Operations 0
Block Output Operations 0

 

 

Patrick
Opal | Level 21

It's likely that the data transfer is what consumes the time. 

As @SASKiwi suggested run your query with a select statement that only returns a single row of data as this will tell us if it's your where clause or the data transfer that consumes the time.

Can you please re-run the code with the following change

%let selstr = count(*) as n_rows;
hema_bhure
Calcite | Level 5

Hi,

 

following is n_row count and log 

n_rows
40073670

------------------------------------------------------------------------------------------------------------------------------------

-------------------------log-----------------------------

24
25 GOPTIONS ACCESSIBLE;
26 %let selstr = count(*) as n_rows;
27 proc sql;
connect to hadoop (
server="XXXXXXXXXXXX"
uri="jdbc:hive2://XXXXXX,XXXX,XXXX/;
serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=XXXserver2" HIVE_KERBEROS_PRINCIPAL="hive/XXXXXX"
port=xxxxx
schema=xxxxx
read_method=JDBC
properties="mapred.job.queue.name=XXXX_default;hive.fetch.task.conversion=minimal;hive.cbo.enable=true;
hive.compute.query.using.stats=true;hive.stats.fetch.column.stats=true;hive.stats.fetch.partition.stats=true;
hive.prewarm.enabled=true;"
dbmax_text=255
);
create table work.test as
select
*
from connection to hadoop (
select
&selstr.
from &table.
where to_date(process_date_time) >=&test_date_1. and to_date(process_date_time) <=&test_date_2.
);
disconnect from hadoop;

quit;
NOTE: Compression was disabled for data set WORK.TEST because compression overhead would increase the size of the data set.
NOTE: Table WORK.TEST created, with 1 rows and 1 columns.

51 ;
2 The SAS System 09:56 Tuesday, October 4, 2022

52 disconnect from hadoop;
53 quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 17:21.22
user cpu time 0.12 seconds
system cpu time 0.08 seconds
memory 5598.90k
OS Memory 43688.00k
Timestamp 04/10/2022 11:18:42 AM
Step Count 9 Switch Count 3
Page Faults 0
Page Reclaims 312
Page Swaps 0
Voluntary Context Switches 5065
Involuntary Context Switches 50
Block Input Operations 0
Block Output Operations 0

54
55 GOPTIONS NOACCESSIBLE;
56 %LET _CLIENTTASKLABEL=;
57 %LET _CLIENTPROCESSFLOWNAME=;
58 %LET _CLIENTPROJECTPATH=;
59 %LET _CLIENTPROJECTPATHHOST=;
60 %LET _CLIENTPROJECTNAME=;
61 %LET _SASPROGRAMFILE=;
62 %LET _SASPROGRAMFILEHOST=;
63
64 ;*';*";*/;quit;run;
65 ODS _ALL_ CLOSE;
66
67
68 QUIT; RUN;
69

LinusH
Tourmaline | Level 20

I think this might be a Hive optimization problem.

You should execute the query directly into Hive.

Hive is not a very fast engine, especially for small-medium sized data (whatever that is).

I have experienced that Impala is usually more performant, but that may require additional SAS/ACCESS licensing.

Data never sleeps
hema_bhure
Calcite | Level 5
Thank you @LinusH , i can execute query in Hive but this source for us and we are doing etL into SAS. so i need to connection to Hadoop to access the data. thank you once again.
LinusH
Tourmaline | Level 20

I didn't mean to do it as permanent solution, just as a way to see if the excessive execution time is in the Hive query or in the data transport to SAS.

So execute it, measure the time and compare with your SAS SQL pass through measurements.

Data never sleeps
SASKiwi
PROC Star

The row count test takes less than half the time of the original SAS table query. That suggests the network speed between Hive and SAS is slowing performance a lot. There is not a lot you can do about that, except splitting your query and running it in parallel over different date periods. You might get better performance outside of normal business hours as well so you could schedule it to run overnight when run time isn't so important.  

hema_bhure
Calcite | Level 5

thank you @SASKiwi, I will try this.

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 9 replies
  • 1589 views
  • 1 like
  • 4 in conversation