SAS Programming

dkassis · Posted 07-17-2019 01:02 PM

Hi,

I am working with SAS/ACCESS to Hadoop and trying to optimize my query. I am pulling data from a table that has duplicates as electric read data (in 15 minute intervals) is sent over everyday for the preceding months. To make sure I don't duplicate my measures I need to take a max of a date in a subquery then join it back to the original query (shared below).

I have read that in Hive it is more efficient to use the RANK() over (partition by so a join isn't done and the query will perform better. I tried using the below in SAS and I get an error.

1) is the Proc SQL RANK() over (partition by supported by SAS?

2) Any tips to optimize this query to work in Hive?

Thanks,

Dan

Query using RANK() over (partition by:

PROC SQL;
CREATE TABLE xxxxx.electric_interval_test_part
AS
SELECT DISTINCT
st.svc_agrmt_id
,ST.svc_agrmt_intv_data_yr_mth_nbr
,st.SVC_PT_ID
, st.svc_agrmt_int_pf_rel_typ_cd
, st.svc_agrmt_int_pf_rel_typ_desc
, st.svc_agrmt_intv_pf_id
, st.svc_agrmt_intv_ds_dttm
, st.svc_agrmt_intv_ds_status_cd
, st.svc_agrmt_intv_ds_status_desc
, st.svc_agrmt_intv_data_dttm
, st.svc_agrmt_intv_data_nbr
FROM ( SELECT svc_agrmt_id
, svc_agrmt_intv_data_yr_mth_nbr
, SVC_PT_ID
, svc_agrmt_int_pf_rel_typ_cd
, svc_agrmt_int_pf_rel_typ_desc
, svc_agrmt_intv_pf_id
, svc_agrmt_intv_ds_dttm
, svc_agrmt_intv_ds_status_cd
, svc_agrmt_intv_ds_status_desc
, svc_agrmt_intv_data_dttm
, svc_agrmt_intv_data_nbr
, RANK() over (partition by svc_agrmt_intv_pf_id, svc_agrmt_intv_data_dttm order by svc_agrmt_intv_ds_dttm) as mx_set_dttm
FROM CUST.cust_svc_agrmt_intv_pf_stg
WHERE svc_agrmt_intv_data_yr_mth_nbr in (201801,201802,201803,201804,201805,201806,201901,201902,201903,201904,201905,201906)
AND svc_agrmt_int_pf_rel_typ_cd = 'KWH15'
AND svc_agrmt_intv_ds_status_cd = 'C'
AND svc_agrmt_id ='xxxxxxxxxxxx') st
WHERE st.mx_set_dttm = 1
order by ST.svc_agrmt_intv_data_yr_mth_nbr;

Original query not using the RANK() over (partition by:

options sastrace=',,,d' sastraceloc=saslog nostsuffix ls=64;
PROC SQL;
CREATE TABLE xxxxxx.electric_interval_test
AS
SELECT DISTINCT
st.svc_agrmt_id
,ST.svc_agrmt_intv_data_yr_mth_nbr
,st.SVC_PT_ID
, st.svc_agrmt_int_pf_rel_typ_cd
, st.svc_agrmt_int_pf_rel_typ_desc
, st.svc_agrmt_intv_pf_id
, st.svc_agrmt_intv_ds_dttm
, st.svc_agrmt_intv_ds_status_cd
, st.svc_agrmt_intv_ds_status_desc
, st.svc_agrmt_intv_data_dttm
, st.svc_agrmt_intv_data_nbr
FROM
(SELECT svc_agrmt_intv_pf_id
, svc_agrmt_intv_data_dttm
, max(svc_agrmt_intv_ds_dttm) as mx_set_dttm
FROM CUST.cust_svc_agrmt_intv_pf_stg
WHERE svc_agrmt_intv_data_yr_mth_nbr in (201801,201802,201803,201804,201805,201806,201901,201902,201903,201904,201905,201906) and
svc_agrmt_id in ('xxxxxxxxxxx') and
st.svc_agrmt_int_pf_rel_typ_cd = 'KWH15' and
st.svc_agrmt_intv_ds_status_cd = 'C'
GROUP BY svc_agrmt_intv_pf_id
, svc_agrmt_intv_data_dttm) mx

JOIN CUST.cust_svc_agrmt_intv_pf_stg st ON mx.svc_agrmt_intv_pf_id = st.svc_agrmt_intv_pf_id
AND mx.mx_set_dttm = st.svc_agrmt_intv_ds_dttm
AND mx.svc_agrmt_intv_data_dttm = st.svc_agrmt_intv_data_dttm
quit;

Reeza · Posted 07-17-2019 01:06 PM

I would have assumed the SAS processes are faster.

RANK/PARTITION are not supported in PROC SQL. However, getting the lastest observation or removing duplicates is a pretty trivial task in Base SAS. Have you tried a PROC SORT or PROC SUMMARY to get what you need?

dkassis · Posted 07-17-2019 02:04 PM

I have, the problem is the table I am going against is quite large (billions or rows) and I also want to join other tables to this data to get demographic information. I want to leverage the processing in Hadoop by pushing everything there otherwise I will max out the SAS server.

LinusH · Posted 07-17-2019 02:34 PM

If you think that rank() partition by is the best solution use HiveQL. You can still call it from SAS using explicit SQL pass through:
select * from connection to Hadoop (...);
or
execute (...) by Hadoop;

Data never sleeps

SAS Programming

Proc SQL RANK() over (partition by in Hadoop

Re: Proc SQL RANK() over (partition by in Hadoop

Re: Proc SQL RANK() over (partition by in Hadoop

Re: Proc SQL RANK() over (partition by in Hadoop

PROC SQL - Insert

Creating sequence variable with partitioning

PROC RANK Alternative

PROC SQL- Order by

Data append to a partitioned CAS table

Follow Us

What is...

SAS Programming

Special offer for SAS Communities members

SAS Training: Just a Click Away

Follow Us

What is...