<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Proc SQL RANK() over (partition by in Hadoop in SAS Programming</title>
    <link>https://communities.sas.com/t5/SAS-Programming/Proc-SQL-RANK-over-partition-by-in-Hadoop/m-p/574288#M162246</link>
    <description>&lt;P&gt;Hi,&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I am working with SAS/ACCESS to Hadoop and trying to optimize my query.&amp;nbsp; I am pulling data from a table that has duplicates as electric read data (in 15 minute intervals) is sent over everyday for the preceding months.&amp;nbsp; To make sure I don't duplicate my measures I need to take a max of a date in a subquery then join it back to the original query (shared below).&amp;nbsp;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I have read that in Hive it is more efficient to use the RANK() over (partition by so a join isn't done and the query will perform better.&amp;nbsp; I tried using the below in SAS and I get an error.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;1) is the&amp;nbsp;Proc SQL RANK() over (partition by supported by SAS?&lt;/P&gt;&lt;P&gt;2) Any tips to optimize this query to work in Hive?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thanks,&lt;/P&gt;&lt;P&gt;Dan&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Query using&amp;nbsp; RANK() over (partition by:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;PROC SQL;&lt;BR /&gt;CREATE TABLE xxxxx.electric_interval_test_part&lt;BR /&gt;AS&lt;BR /&gt;SELECT DISTINCT&lt;BR /&gt;st.svc_agrmt_id&lt;BR /&gt;,ST.svc_agrmt_intv_data_yr_mth_nbr&lt;BR /&gt;,st.SVC_PT_ID&lt;BR /&gt;, st.svc_agrmt_int_pf_rel_typ_cd&lt;BR /&gt;, st.svc_agrmt_int_pf_rel_typ_desc&lt;BR /&gt;, st.svc_agrmt_intv_pf_id&lt;BR /&gt;, st.svc_agrmt_intv_ds_dttm&lt;BR /&gt;, st.svc_agrmt_intv_ds_status_cd&lt;BR /&gt;, st.svc_agrmt_intv_ds_status_desc&lt;BR /&gt;, st.svc_agrmt_intv_data_dttm&lt;BR /&gt;, st.svc_agrmt_intv_data_nbr&lt;BR /&gt;FROM ( SELECT svc_agrmt_id&lt;BR /&gt;, svc_agrmt_intv_data_yr_mth_nbr&lt;BR /&gt;, SVC_PT_ID&lt;BR /&gt;, svc_agrmt_int_pf_rel_typ_cd&lt;BR /&gt;, svc_agrmt_int_pf_rel_typ_desc&lt;BR /&gt;, svc_agrmt_intv_pf_id&lt;BR /&gt;, svc_agrmt_intv_ds_dttm&lt;BR /&gt;, svc_agrmt_intv_ds_status_cd&lt;BR /&gt;, svc_agrmt_intv_ds_status_desc&lt;BR /&gt;, svc_agrmt_intv_data_dttm&lt;BR /&gt;, svc_agrmt_intv_data_nbr&lt;BR /&gt;, RANK() over (partition by svc_agrmt_intv_pf_id, svc_agrmt_intv_data_dttm order by svc_agrmt_intv_ds_dttm) as mx_set_dttm&lt;BR /&gt;FROM CUST.cust_svc_agrmt_intv_pf_stg&lt;BR /&gt;WHERE svc_agrmt_intv_data_yr_mth_nbr in (201801,201802,201803,201804,201805,201806,201901,201902,201903,201904,201905,201906)&lt;BR /&gt;AND svc_agrmt_int_pf_rel_typ_cd = 'KWH15'&lt;BR /&gt;AND svc_agrmt_intv_ds_status_cd = 'C'&lt;BR /&gt;AND svc_agrmt_id ='xxxxxxxxxxxx') st&lt;BR /&gt;WHERE st.mx_set_dttm = 1&lt;BR /&gt;order by ST.svc_agrmt_intv_data_yr_mth_nbr;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Original query not using the&amp;nbsp;RANK() over (partition by:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;options sastrace=',,,d' sastraceloc=saslog nostsuffix ls=64;&lt;BR /&gt;PROC SQL;&lt;BR /&gt;CREATE TABLE xxxxxx.electric_interval_test&lt;BR /&gt;AS&lt;BR /&gt;SELECT DISTINCT&lt;BR /&gt;st.svc_agrmt_id&lt;BR /&gt;,ST.svc_agrmt_intv_data_yr_mth_nbr&lt;BR /&gt;,st.SVC_PT_ID&lt;BR /&gt;, st.svc_agrmt_int_pf_rel_typ_cd&lt;BR /&gt;, st.svc_agrmt_int_pf_rel_typ_desc&lt;BR /&gt;, st.svc_agrmt_intv_pf_id&lt;BR /&gt;, st.svc_agrmt_intv_ds_dttm&lt;BR /&gt;, st.svc_agrmt_intv_ds_status_cd&lt;BR /&gt;, st.svc_agrmt_intv_ds_status_desc&lt;BR /&gt;, st.svc_agrmt_intv_data_dttm&lt;BR /&gt;, st.svc_agrmt_intv_data_nbr&lt;BR /&gt;FROM&lt;BR /&gt;(SELECT svc_agrmt_intv_pf_id&lt;BR /&gt;, svc_agrmt_intv_data_dttm&lt;BR /&gt;, max(svc_agrmt_intv_ds_dttm) as mx_set_dttm&lt;BR /&gt;FROM CUST.cust_svc_agrmt_intv_pf_stg&lt;BR /&gt;WHERE svc_agrmt_intv_data_yr_mth_nbr in (201801,201802,201803,201804,201805,201806,201901,201902,201903,201904,201905,201906) and&lt;BR /&gt;svc_agrmt_id in ('xxxxxxxxxxx') and&lt;BR /&gt;st.svc_agrmt_int_pf_rel_typ_cd = 'KWH15' and&lt;BR /&gt;st.svc_agrmt_intv_ds_status_cd = 'C'&lt;BR /&gt;GROUP BY svc_agrmt_intv_pf_id&lt;BR /&gt;, svc_agrmt_intv_data_dttm) mx&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;JOIN CUST.cust_svc_agrmt_intv_pf_stg st ON mx.svc_agrmt_intv_pf_id = st.svc_agrmt_intv_pf_id&lt;BR /&gt;AND mx.mx_set_dttm = st.svc_agrmt_intv_ds_dttm&lt;BR /&gt;AND mx.svc_agrmt_intv_data_dttm = st.svc_agrmt_intv_data_dttm&lt;BR /&gt;quit;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Wed, 17 Jul 2019 17:02:03 GMT</pubDate>
    <dc:creator>dkassis</dc:creator>
    <dc:date>2019-07-17T17:02:03Z</dc:date>
    <item>
      <title>Proc SQL RANK() over (partition by in Hadoop</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Proc-SQL-RANK-over-partition-by-in-Hadoop/m-p/574288#M162246</link>
      <description>&lt;P&gt;Hi,&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I am working with SAS/ACCESS to Hadoop and trying to optimize my query.&amp;nbsp; I am pulling data from a table that has duplicates as electric read data (in 15 minute intervals) is sent over everyday for the preceding months.&amp;nbsp; To make sure I don't duplicate my measures I need to take a max of a date in a subquery then join it back to the original query (shared below).&amp;nbsp;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I have read that in Hive it is more efficient to use the RANK() over (partition by so a join isn't done and the query will perform better.&amp;nbsp; I tried using the below in SAS and I get an error.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;1) is the&amp;nbsp;Proc SQL RANK() over (partition by supported by SAS?&lt;/P&gt;&lt;P&gt;2) Any tips to optimize this query to work in Hive?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thanks,&lt;/P&gt;&lt;P&gt;Dan&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Query using&amp;nbsp; RANK() over (partition by:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;PROC SQL;&lt;BR /&gt;CREATE TABLE xxxxx.electric_interval_test_part&lt;BR /&gt;AS&lt;BR /&gt;SELECT DISTINCT&lt;BR /&gt;st.svc_agrmt_id&lt;BR /&gt;,ST.svc_agrmt_intv_data_yr_mth_nbr&lt;BR /&gt;,st.SVC_PT_ID&lt;BR /&gt;, st.svc_agrmt_int_pf_rel_typ_cd&lt;BR /&gt;, st.svc_agrmt_int_pf_rel_typ_desc&lt;BR /&gt;, st.svc_agrmt_intv_pf_id&lt;BR /&gt;, st.svc_agrmt_intv_ds_dttm&lt;BR /&gt;, st.svc_agrmt_intv_ds_status_cd&lt;BR /&gt;, st.svc_agrmt_intv_ds_status_desc&lt;BR /&gt;, st.svc_agrmt_intv_data_dttm&lt;BR /&gt;, st.svc_agrmt_intv_data_nbr&lt;BR /&gt;FROM ( SELECT svc_agrmt_id&lt;BR /&gt;, svc_agrmt_intv_data_yr_mth_nbr&lt;BR /&gt;, SVC_PT_ID&lt;BR /&gt;, svc_agrmt_int_pf_rel_typ_cd&lt;BR /&gt;, svc_agrmt_int_pf_rel_typ_desc&lt;BR /&gt;, svc_agrmt_intv_pf_id&lt;BR /&gt;, svc_agrmt_intv_ds_dttm&lt;BR /&gt;, svc_agrmt_intv_ds_status_cd&lt;BR /&gt;, svc_agrmt_intv_ds_status_desc&lt;BR /&gt;, svc_agrmt_intv_data_dttm&lt;BR /&gt;, svc_agrmt_intv_data_nbr&lt;BR /&gt;, RANK() over (partition by svc_agrmt_intv_pf_id, svc_agrmt_intv_data_dttm order by svc_agrmt_intv_ds_dttm) as mx_set_dttm&lt;BR /&gt;FROM CUST.cust_svc_agrmt_intv_pf_stg&lt;BR /&gt;WHERE svc_agrmt_intv_data_yr_mth_nbr in (201801,201802,201803,201804,201805,201806,201901,201902,201903,201904,201905,201906)&lt;BR /&gt;AND svc_agrmt_int_pf_rel_typ_cd = 'KWH15'&lt;BR /&gt;AND svc_agrmt_intv_ds_status_cd = 'C'&lt;BR /&gt;AND svc_agrmt_id ='xxxxxxxxxxxx') st&lt;BR /&gt;WHERE st.mx_set_dttm = 1&lt;BR /&gt;order by ST.svc_agrmt_intv_data_yr_mth_nbr;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Original query not using the&amp;nbsp;RANK() over (partition by:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;options sastrace=',,,d' sastraceloc=saslog nostsuffix ls=64;&lt;BR /&gt;PROC SQL;&lt;BR /&gt;CREATE TABLE xxxxxx.electric_interval_test&lt;BR /&gt;AS&lt;BR /&gt;SELECT DISTINCT&lt;BR /&gt;st.svc_agrmt_id&lt;BR /&gt;,ST.svc_agrmt_intv_data_yr_mth_nbr&lt;BR /&gt;,st.SVC_PT_ID&lt;BR /&gt;, st.svc_agrmt_int_pf_rel_typ_cd&lt;BR /&gt;, st.svc_agrmt_int_pf_rel_typ_desc&lt;BR /&gt;, st.svc_agrmt_intv_pf_id&lt;BR /&gt;, st.svc_agrmt_intv_ds_dttm&lt;BR /&gt;, st.svc_agrmt_intv_ds_status_cd&lt;BR /&gt;, st.svc_agrmt_intv_ds_status_desc&lt;BR /&gt;, st.svc_agrmt_intv_data_dttm&lt;BR /&gt;, st.svc_agrmt_intv_data_nbr&lt;BR /&gt;FROM&lt;BR /&gt;(SELECT svc_agrmt_intv_pf_id&lt;BR /&gt;, svc_agrmt_intv_data_dttm&lt;BR /&gt;, max(svc_agrmt_intv_ds_dttm) as mx_set_dttm&lt;BR /&gt;FROM CUST.cust_svc_agrmt_intv_pf_stg&lt;BR /&gt;WHERE svc_agrmt_intv_data_yr_mth_nbr in (201801,201802,201803,201804,201805,201806,201901,201902,201903,201904,201905,201906) and&lt;BR /&gt;svc_agrmt_id in ('xxxxxxxxxxx') and&lt;BR /&gt;st.svc_agrmt_int_pf_rel_typ_cd = 'KWH15' and&lt;BR /&gt;st.svc_agrmt_intv_ds_status_cd = 'C'&lt;BR /&gt;GROUP BY svc_agrmt_intv_pf_id&lt;BR /&gt;, svc_agrmt_intv_data_dttm) mx&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;JOIN CUST.cust_svc_agrmt_intv_pf_stg st ON mx.svc_agrmt_intv_pf_id = st.svc_agrmt_intv_pf_id&lt;BR /&gt;AND mx.mx_set_dttm = st.svc_agrmt_intv_ds_dttm&lt;BR /&gt;AND mx.svc_agrmt_intv_data_dttm = st.svc_agrmt_intv_data_dttm&lt;BR /&gt;quit;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 17 Jul 2019 17:02:03 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Proc-SQL-RANK-over-partition-by-in-Hadoop/m-p/574288#M162246</guid>
      <dc:creator>dkassis</dc:creator>
      <dc:date>2019-07-17T17:02:03Z</dc:date>
    </item>
    <item>
      <title>Re: Proc SQL RANK() over (partition by in Hadoop</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Proc-SQL-RANK-over-partition-by-in-Hadoop/m-p/574290#M162247</link>
      <description>I would have assumed the SAS processes are faster. &lt;BR /&gt;&lt;BR /&gt;RANK/PARTITION are not supported in PROC SQL. However, getting the lastest observation or removing duplicates is a pretty trivial task in Base SAS. Have you tried a PROC SORT or PROC SUMMARY to get what you need?</description>
      <pubDate>Wed, 17 Jul 2019 17:06:26 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Proc-SQL-RANK-over-partition-by-in-Hadoop/m-p/574290#M162247</guid>
      <dc:creator>Reeza</dc:creator>
      <dc:date>2019-07-17T17:06:26Z</dc:date>
    </item>
    <item>
      <title>Re: Proc SQL RANK() over (partition by in Hadoop</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Proc-SQL-RANK-over-partition-by-in-Hadoop/m-p/574312#M162256</link>
      <description>&lt;P&gt;I have, the problem is the table I am going against is quite large (billions or rows) and I also want to join other tables to this data to get demographic information.&amp;nbsp; I want to leverage the processing in Hadoop by pushing everything there otherwise I will max out the SAS server.&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 17 Jul 2019 18:04:48 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Proc-SQL-RANK-over-partition-by-in-Hadoop/m-p/574312#M162256</guid>
      <dc:creator>dkassis</dc:creator>
      <dc:date>2019-07-17T18:04:48Z</dc:date>
    </item>
    <item>
      <title>Re: Proc SQL RANK() over (partition by in Hadoop</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Proc-SQL-RANK-over-partition-by-in-Hadoop/m-p/574320#M162261</link>
      <description>If you think that rank() partition by is the best solution use HiveQL. You can still call it from SAS using explicit SQL pass through:&lt;BR /&gt;select * from connection to Hadoop (...);&lt;BR /&gt;or &lt;BR /&gt;execute (...) by Hadoop;</description>
      <pubDate>Wed, 17 Jul 2019 18:34:40 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Proc-SQL-RANK-over-partition-by-in-Hadoop/m-p/574320#M162261</guid>
      <dc:creator>LinusH</dc:creator>
      <dc:date>2019-07-17T18:34:40Z</dc:date>
    </item>
  </channel>
</rss>

