Proc SQL RANK() over (partition by in Hadoop

dkassis — Wed, 17 Jul 2019 17:02:03 GMT

Hi,

I am working with SAS/ACCESS to Hadoop and trying to optimize my query. I am pulling data from a table that has duplicates as electric read data (in 15 minute intervals) is sent over everyday for the preceding months. To make sure I don't duplicate my measures I need to take a max of a date in a subquery then join it back to the original query (shared below).

I have read that in Hive it is more efficient to use the RANK() over (partition by so a join isn't done and the query will perform better. I tried using the below in SAS and I get an error.

1) is the Proc SQL RANK() over (partition by supported by SAS?

2) Any tips to optimize this query to work in Hive?

Thanks,

Dan

Query using RANK() over (partition by:

PROC SQL;
CREATE TABLE xxxxx.electric_interval_test_part
AS
SELECT DISTINCT
st.svc_agrmt_id
,ST.svc_agrmt_intv_data_yr_mth_nbr
,st.SVC_PT_ID
, st.svc_agrmt_int_pf_rel_typ_cd
, st.svc_agrmt_int_pf_rel_typ_desc
, st.svc_agrmt_intv_pf_id
, st.svc_agrmt_intv_ds_dttm
, st.svc_agrmt_intv_ds_status_cd
, st.svc_agrmt_intv_ds_status_desc
, st.svc_agrmt_intv_data_dttm
, st.svc_agrmt_intv_data_nbr
FROM ( SELECT svc_agrmt_id
, svc_agrmt_intv_data_yr_mth_nbr
, SVC_PT_ID
, svc_agrmt_int_pf_rel_typ_cd
, svc_agrmt_int_pf_rel_typ_desc
, svc_agrmt_intv_pf_id
, svc_agrmt_intv_ds_dttm
, svc_agrmt_intv_ds_status_cd
, svc_agrmt_intv_ds_status_desc
, svc_agrmt_intv_data_dttm
, svc_agrmt_intv_data_nbr
, RANK() over (partition by svc_agrmt_intv_pf_id, svc_agrmt_intv_data_dttm order by svc_agrmt_intv_ds_dttm) as mx_set_dttm
FROM CUST.cust_svc_agrmt_intv_pf_stg
WHERE svc_agrmt_intv_data_yr_mth_nbr in (201801,201802,201803,201804,201805,201806,201901,201902,201903,201904,201905,201906)
AND svc_agrmt_int_pf_rel_typ_cd = 'KWH15'
AND svc_agrmt_intv_ds_status_cd = 'C'
AND svc_agrmt_id ='xxxxxxxxxxxx') st
WHERE st.mx_set_dttm = 1
order by ST.svc_agrmt_intv_data_yr_mth_nbr;

Original query not using the RANK() over (partition by:

options sastrace=',,,d' sastraceloc=saslog nostsuffix ls=64;
PROC SQL;
CREATE TABLE xxxxxx.electric_interval_test
AS
SELECT DISTINCT
st.svc_agrmt_id
,ST.svc_agrmt_intv_data_yr_mth_nbr
,st.SVC_PT_ID
, st.svc_agrmt_int_pf_rel_typ_cd
, st.svc_agrmt_int_pf_rel_typ_desc
, st.svc_agrmt_intv_pf_id
, st.svc_agrmt_intv_ds_dttm
, st.svc_agrmt_intv_ds_status_cd
, st.svc_agrmt_intv_ds_status_desc
, st.svc_agrmt_intv_data_dttm
, st.svc_agrmt_intv_data_nbr
FROM
(SELECT svc_agrmt_intv_pf_id
, svc_agrmt_intv_data_dttm
, max(svc_agrmt_intv_ds_dttm) as mx_set_dttm
FROM CUST.cust_svc_agrmt_intv_pf_stg
WHERE svc_agrmt_intv_data_yr_mth_nbr in (201801,201802,201803,201804,201805,201806,201901,201902,201903,201904,201905,201906) and
svc_agrmt_id in ('xxxxxxxxxxx') and
st.svc_agrmt_int_pf_rel_typ_cd = 'KWH15' and
st.svc_agrmt_intv_ds_status_cd = 'C'
GROUP BY svc_agrmt_intv_pf_id
, svc_agrmt_intv_data_dttm) mx

JOIN CUST.cust_svc_agrmt_intv_pf_stg st ON mx.svc_agrmt_intv_pf_id = st.svc_agrmt_intv_pf_id
AND mx.mx_set_dttm = st.svc_agrmt_intv_ds_dttm
AND mx.svc_agrmt_intv_data_dttm = st.svc_agrmt_intv_data_dttm
quit;

Re: Proc SQL RANK() over (partition by in Hadoop

Reeza — Wed, 17 Jul 2019 17:06:26 GMT

I would have assumed the SAS processes are faster.

RANK/PARTITION are not supported in PROC SQL. However, getting the lastest observation or removing duplicates is a pretty trivial task in Base SAS. Have you tried a PROC SORT or PROC SUMMARY to get what you need?

Re: Proc SQL RANK() over (partition by in Hadoop

dkassis — Wed, 17 Jul 2019 18:04:48 GMT

I have, the problem is the table I am going against is quite large (billions or rows) and I also want to join other tables to this data to get demographic information. I want to leverage the processing in Hadoop by pushing everything there otherwise I will max out the SAS server.

Re: Proc SQL RANK() over (partition by in Hadoop

LinusH — Wed, 17 Jul 2019 18:34:40 GMT

If you think that rank() partition by is the best solution use HiveQL. You can still call it from SAS using explicit SQL pass through:
select * from connection to Hadoop (...);
or
execute (...) by Hadoop;

topic Proc SQL RANK() over (partition by in Hadoop in SAS Programming

Proc SQL RANK() over (partition by in Hadoop

Re: Proc SQL RANK() over (partition by in Hadoop

Re: Proc SQL RANK() over (partition by in Hadoop

Re: Proc SQL RANK() over (partition by in Hadoop