- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I am working with SAS/ACCESS to Hadoop and trying to optimize my query. I am pulling data from a table that has duplicates as electric read data (in 15 minute intervals) is sent over everyday for the preceding months. To make sure I don't duplicate my measures I need to take a max of a date in a subquery then join it back to the original query (shared below).
I have read that in Hive it is more efficient to use the RANK() over (partition by so a join isn't done and the query will perform better. I tried using the below in SAS and I get an error.
1) is the Proc SQL RANK() over (partition by supported by SAS?
2) Any tips to optimize this query to work in Hive?
Thanks,
Dan
Query using RANK() over (partition by:
PROC SQL;
CREATE TABLE xxxxx.electric_interval_test_part
AS
SELECT DISTINCT
st.svc_agrmt_id
,ST.svc_agrmt_intv_data_yr_mth_nbr
,st.SVC_PT_ID
, st.svc_agrmt_int_pf_rel_typ_cd
, st.svc_agrmt_int_pf_rel_typ_desc
, st.svc_agrmt_intv_pf_id
, st.svc_agrmt_intv_ds_dttm
, st.svc_agrmt_intv_ds_status_cd
, st.svc_agrmt_intv_ds_status_desc
, st.svc_agrmt_intv_data_dttm
, st.svc_agrmt_intv_data_nbr
FROM ( SELECT svc_agrmt_id
, svc_agrmt_intv_data_yr_mth_nbr
, SVC_PT_ID
, svc_agrmt_int_pf_rel_typ_cd
, svc_agrmt_int_pf_rel_typ_desc
, svc_agrmt_intv_pf_id
, svc_agrmt_intv_ds_dttm
, svc_agrmt_intv_ds_status_cd
, svc_agrmt_intv_ds_status_desc
, svc_agrmt_intv_data_dttm
, svc_agrmt_intv_data_nbr
, RANK() over (partition by svc_agrmt_intv_pf_id, svc_agrmt_intv_data_dttm order by svc_agrmt_intv_ds_dttm) as mx_set_dttm
FROM CUST.cust_svc_agrmt_intv_pf_stg
WHERE svc_agrmt_intv_data_yr_mth_nbr in (201801,201802,201803,201804,201805,201806,201901,201902,201903,201904,201905,201906)
AND svc_agrmt_int_pf_rel_typ_cd = 'KWH15'
AND svc_agrmt_intv_ds_status_cd = 'C'
AND svc_agrmt_id ='xxxxxxxxxxxx') st
WHERE st.mx_set_dttm = 1
order by ST.svc_agrmt_intv_data_yr_mth_nbr;
Original query not using the RANK() over (partition by:
options sastrace=',,,d' sastraceloc=saslog nostsuffix ls=64;
PROC SQL;
CREATE TABLE xxxxxx.electric_interval_test
AS
SELECT DISTINCT
st.svc_agrmt_id
,ST.svc_agrmt_intv_data_yr_mth_nbr
,st.SVC_PT_ID
, st.svc_agrmt_int_pf_rel_typ_cd
, st.svc_agrmt_int_pf_rel_typ_desc
, st.svc_agrmt_intv_pf_id
, st.svc_agrmt_intv_ds_dttm
, st.svc_agrmt_intv_ds_status_cd
, st.svc_agrmt_intv_ds_status_desc
, st.svc_agrmt_intv_data_dttm
, st.svc_agrmt_intv_data_nbr
FROM
(SELECT svc_agrmt_intv_pf_id
, svc_agrmt_intv_data_dttm
, max(svc_agrmt_intv_ds_dttm) as mx_set_dttm
FROM CUST.cust_svc_agrmt_intv_pf_stg
WHERE svc_agrmt_intv_data_yr_mth_nbr in (201801,201802,201803,201804,201805,201806,201901,201902,201903,201904,201905,201906) and
svc_agrmt_id in ('xxxxxxxxxxx') and
st.svc_agrmt_int_pf_rel_typ_cd = 'KWH15' and
st.svc_agrmt_intv_ds_status_cd = 'C'
GROUP BY svc_agrmt_intv_pf_id
, svc_agrmt_intv_data_dttm) mx
JOIN CUST.cust_svc_agrmt_intv_pf_stg st ON mx.svc_agrmt_intv_pf_id = st.svc_agrmt_intv_pf_id
AND mx.mx_set_dttm = st.svc_agrmt_intv_ds_dttm
AND mx.svc_agrmt_intv_data_dttm = st.svc_agrmt_intv_data_dttm
quit;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
RANK/PARTITION are not supported in PROC SQL. However, getting the lastest observation or removing duplicates is a pretty trivial task in Base SAS. Have you tried a PROC SORT or PROC SUMMARY to get what you need?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I have, the problem is the table I am going against is quite large (billions or rows) and I also want to join other tables to this data to get demographic information. I want to leverage the processing in Hadoop by pushing everything there otherwise I will max out the SAS server.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
select * from connection to Hadoop (...);
or
execute (...) by Hadoop;