Hello,
I am trying to pull two fields from a partitioned HIVE ORC table. It is partitioned on month_end_date. Each month of data is about 5 million records. SAS batch is SAS (r) Proprietary Software 9.4 (TS1M3) . Apache Hive (version 1.2.1000.2.5.3.0-37). Each field has the format double.
options compress=yes macrogen symbolgen mlogic sastrace=',,,ds' sastraceloc=saslog nostsuffix sqlgeneration=dbms dbidirectexec sql_ip_trace=(note, source) msglevel=i source2 source2 mprint MCOMPILENOTE=all ;
%let URI="jdbc:=...";
libname lib "xxx";
proc sql;
connect to hadoop(server="lnbradpp06" schema=xxxxxxx_xxxx uri=&uri.);
CREATE TABLE lib.enrollment as select * from connection to hadoop
(
SELECT distinct id , count
FROM xxxxxx_xxxx.history
WHERE MONTH_END_DATE = "2018-06-30"
);
DISCONNECT FROM HADOOP;
quit;
I don't know why this would take so long (half an hour), or if my options (accumulated from various SAS/Hadoop examples) is a problem. Has anyone had this issue with SAS at their company? The solution I am getting is to switch to R.
So, our solution turned out to be a non-SAS one. Before doing the select distinct query, we have to execute a set hive statement since it is not currently the default on our system:
proc sql;
connect to hadoop(server="xxxx" schema=xxxxxxx_xxxx uri=xxxx);
execute(set hive.vectorized.execution.enabled=true) by hadoop;
create table lib.part as select * from connection to hadoop
(SELECT distinct part_id, part_date
from xxxxxxx_xxxx.history
);
DISCONNECT FROM HADOOP;
quit;
This made a query that took over 35 minutes take 10 seconds.
Try just a simple count(*) - that will remove the possibility that the network between the database and SAS is the cause of the slowness and let us know how fast that runs:
proc sql;
connect to hadoop(server="lnbradpp06" schema=xxxxxxx_xxxx uri=&uri.);
select * from connection to hadoop
(
SELECT count(*) as row_count
FROM xxxxxx_xxxx.history
WHERE MONTH_END_DATE = "2018-06-30"
);
DISCONNECT FROM HADOOP;
quit;
Less than 1 minute!
So, I ran three queries. The first (count(*) and second (select...) came back immediately. The "select distinct " one is the problem, it will likely to half an hour as usual. I am not even sure I need select distinct, this table should be distinct. It is a just-in-case. But why would it be so much longer?
proc sql;
connect to hadoop(server="xx" schema=xxxxxxx_xxxx uri=&uri.);
CREATE TABLE lib.test1 as select * from connection to hadoop
(
SELECT count(*) as N
FROM xxxxxx_xxxx.history
WHERE MONTH_END_DATE = "2018-05-31"
);
DISCONNECT FROM HADOOP;
quit;
proc sql;
connect to hadoop(server="xx" schema=xxxxxxx_xxxx uri=&uri.);
CREATE TABLE lib.elig_mos_test2 as select * from connection to hadoop
(
SELECT id , month_count
FROM xxxxxxx_xxxx.history
WHERE MONTH_END_DATE = "2018-05-31"
);
DISCONNECT FROM HADOOP;
quit;
proc sql;
connect to hadoop(server="xx" schema=xxxxxxx_xxxx uri=&uri.);
CREATE TABLE lib.test3 as select * from connection to hadoop
(
SELECT distinct id, month_count
FROM xxxxxxx_xxxx.history
WHERE MONTH_END_DATE = "2018-05-31"
);
DISCONNECT FROM HADOOP;
quit;
I am looking up implicit passthrough now.
I never use DISTINCT on a query if the table does not require it. Adding it "just in case" will add overhead to your processing that isn't necessary. Also I would argue it is probably better to let it happen so you know about it rather than having DISTINCT hiding the problem. You can add safeguards later in your program if required, for example using DATA step BY processing on ID with FIRST. and LAST. logic.
Well, I found out the select distinct query is slow in Beeline too. So, I guess it is a non-SAS problem. Still I need to find an explanation.
So, our solution turned out to be a non-SAS one. Before doing the select distinct query, we have to execute a set hive statement since it is not currently the default on our system:
proc sql;
connect to hadoop(server="xxxx" schema=xxxxxxx_xxxx uri=xxxx);
execute(set hive.vectorized.execution.enabled=true) by hadoop;
create table lib.part as select * from connection to hadoop
(SELECT distinct part_id, part_date
from xxxxxxx_xxxx.history
);
DISCONNECT FROM HADOOP;
quit;
This made a query that took over 35 minutes take 10 seconds.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.
Find more tutorials on the SAS Users YouTube channel.