SAS Programming

DATA Step, Macro, Functions and more
BookmarkSubscribeRSS Feed
dkassis
Calcite | Level 5

Hi, 

 

I am working with SAS/ACCESS to Hadoop and trying to optimize my query.  I am pulling data from a table that has duplicates as electric read data (in 15 minute intervals) is sent over everyday for the preceding months.  To make sure I don't duplicate my measures I need to take a max of a date in a subquery then join it back to the original query (shared below).  

 

I have read that in Hive it is more efficient to use the RANK() over (partition by so a join isn't done and the query will perform better.  I tried using the below in SAS and I get an error.

 

1) is the Proc SQL RANK() over (partition by supported by SAS?

2) Any tips to optimize this query to work in Hive?

 

 

Thanks,

Dan

 

Query using  RANK() over (partition by:

 

PROC SQL;
CREATE TABLE xxxxx.electric_interval_test_part
AS
SELECT DISTINCT
st.svc_agrmt_id
,ST.svc_agrmt_intv_data_yr_mth_nbr
,st.SVC_PT_ID
, st.svc_agrmt_int_pf_rel_typ_cd
, st.svc_agrmt_int_pf_rel_typ_desc
, st.svc_agrmt_intv_pf_id
, st.svc_agrmt_intv_ds_dttm
, st.svc_agrmt_intv_ds_status_cd
, st.svc_agrmt_intv_ds_status_desc
, st.svc_agrmt_intv_data_dttm
, st.svc_agrmt_intv_data_nbr
FROM ( SELECT svc_agrmt_id
, svc_agrmt_intv_data_yr_mth_nbr
, SVC_PT_ID
, svc_agrmt_int_pf_rel_typ_cd
, svc_agrmt_int_pf_rel_typ_desc
, svc_agrmt_intv_pf_id
, svc_agrmt_intv_ds_dttm
, svc_agrmt_intv_ds_status_cd
, svc_agrmt_intv_ds_status_desc
, svc_agrmt_intv_data_dttm
, svc_agrmt_intv_data_nbr
, RANK() over (partition by svc_agrmt_intv_pf_id, svc_agrmt_intv_data_dttm order by svc_agrmt_intv_ds_dttm) as mx_set_dttm
FROM CUST.cust_svc_agrmt_intv_pf_stg
WHERE svc_agrmt_intv_data_yr_mth_nbr in (201801,201802,201803,201804,201805,201806,201901,201902,201903,201904,201905,201906)
AND svc_agrmt_int_pf_rel_typ_cd = 'KWH15'
AND svc_agrmt_intv_ds_status_cd = 'C'
AND svc_agrmt_id ='xxxxxxxxxxxx') st
WHERE st.mx_set_dttm = 1
order by ST.svc_agrmt_intv_data_yr_mth_nbr;

 

 

Original query not using the RANK() over (partition by:

 

options sastrace=',,,d' sastraceloc=saslog nostsuffix ls=64;
PROC SQL;
CREATE TABLE xxxxxx.electric_interval_test
AS
SELECT DISTINCT
st.svc_agrmt_id
,ST.svc_agrmt_intv_data_yr_mth_nbr
,st.SVC_PT_ID
, st.svc_agrmt_int_pf_rel_typ_cd
, st.svc_agrmt_int_pf_rel_typ_desc
, st.svc_agrmt_intv_pf_id
, st.svc_agrmt_intv_ds_dttm
, st.svc_agrmt_intv_ds_status_cd
, st.svc_agrmt_intv_ds_status_desc
, st.svc_agrmt_intv_data_dttm
, st.svc_agrmt_intv_data_nbr
FROM
(SELECT svc_agrmt_intv_pf_id
, svc_agrmt_intv_data_dttm
, max(svc_agrmt_intv_ds_dttm) as mx_set_dttm
FROM CUST.cust_svc_agrmt_intv_pf_stg
WHERE svc_agrmt_intv_data_yr_mth_nbr in (201801,201802,201803,201804,201805,201806,201901,201902,201903,201904,201905,201906) and
svc_agrmt_id in ('xxxxxxxxxxx') and
st.svc_agrmt_int_pf_rel_typ_cd = 'KWH15' and
st.svc_agrmt_intv_ds_status_cd = 'C'
GROUP BY svc_agrmt_intv_pf_id
, svc_agrmt_intv_data_dttm) mx

 

JOIN CUST.cust_svc_agrmt_intv_pf_stg st ON mx.svc_agrmt_intv_pf_id = st.svc_agrmt_intv_pf_id
AND mx.mx_set_dttm = st.svc_agrmt_intv_ds_dttm
AND mx.svc_agrmt_intv_data_dttm = st.svc_agrmt_intv_data_dttm
quit;

 

3 REPLIES 3
Reeza
Super User
I would have assumed the SAS processes are faster.

RANK/PARTITION are not supported in PROC SQL. However, getting the lastest observation or removing duplicates is a pretty trivial task in Base SAS. Have you tried a PROC SORT or PROC SUMMARY to get what you need?
dkassis
Calcite | Level 5

I have, the problem is the table I am going against is quite large (billions or rows) and I also want to join other tables to this data to get demographic information.  I want to leverage the processing in Hadoop by pushing everything there otherwise I will max out the SAS server. 

LinusH
Tourmaline | Level 20
If you think that rank() partition by is the best solution use HiveQL. You can still call it from SAS using explicit SQL pass through:
select * from connection to Hadoop (...);
or
execute (...) by Hadoop;
Data never sleeps

sas-innovate-white.png

Special offer for SAS Communities members

Save $250 on SAS Innovate and get a free advance copy of the new SAS For Dummies book! Use the code "SASforDummies" to register. Don't miss out, May 6-9, in Orlando, Florida.

 

View the full agenda.

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 3 replies
  • 10952 views
  • 2 likes
  • 3 in conversation