Solved: Re: Joining multiple table with the help of hash object

Bharat_P · Posted 01-02-2020 01:29 AM

proc sql;
create table SUMM7_DIGITAL_CUST as
select M.CNCPT_NM
, M.TRRTRY_SHRT_NM
, (C.APPDWLD_CUST/M.CUST_CY) as APPDWLD_CUST_PER_CY format=percent8.2
, (L.APPDWLD_CUST/M.CUST_LY) as APPDWLD_CUST_PER_LY format=percent8.2
, (calculated APPDWLD_CUST_PER_CY - calculated APPDWLD_CUST_PER_LY) as APPDWLD_CUST_PP_CHNG
, (C.APPACTV_CUST/M.CUST_CY) as APPACTV_CUST_PER_CY format=percent8.2
, (L.APPACTV_CUST/M.CUST_LY) as APPACTV_CUST_PER_LY format=percent8.2
, (calculated APPACTV_CUST_PER_CY - calculated APPACTV_CUST_PER_LY) as APPACTV_CUST_PP_CHNG
from SUMM1_REV_TREE M
left join APP_DWLD_ACTV_CUST_CY C on M.CNCPT_NM=C.CNCPT_NM and M.TRRTRY_SHRT_NM=C.TRRTRY_SHRT_NM
left join APP_DWLD_ACTV_CUST_LY L on M.CNCPT_NM=L.CNCPT_NM and M.TRRTRY_SHRT_NM=L.TRRTRY_SHRT_NM
order by 1,2
;
quit;

this table needs to be converted to hash table to cut the runtime. Can someone please help me with the hash table version of the codes mentioned above

mkeintz · Posted 01-03-2020 03:55 PM

As @Tom pointed out your response to my question is contradictory. Here is a solution that should word if there are no duplicate key-pairs in table C or table L. It doesn't matter if there are duplicates in table M:

data  need (keep=CNCPT_NM TRRTRY_SHRT_NM APPDWLD_CUST_PER_CY APPACTV_CUST_PER_CY
        APPDWLD_CUST_PER_LY APPACTV_CUST_PER_LY APPDWLD_CUST_PP_CHNG APPACTV_CUST_PP_CHNG) /view=need;

  merge
      SUMM1_REV_TREE        (      keep=CNCPT_NM TRRTRY_SHRT_NM CUST_CY CUST_LY)            /* M */ 
      APP_DWLD_ACTV_CUST_CY (obs=0 keep=CNCPT_NM TRRTRY_SHRT_NM APPDWLD_CUST APPACTV_CUST)  /* C */
      APP_DWLD_ACTV_CUST_LY (obs=0 keep=CNCPT_NM TRRTRY_SHRT_NM APPDWLD_CUST APPACTV_CUST)  /* L */
      ;

  if _n_=1 then do;
    declare hash C (dataset:'APP_DWLD_ACTV_CUST_CY (keep=CNCPT_NM TRRTRY_SHRT_NM APPDWLD_CUST APPACTV_CUST)');
      c.definekey('CNCPT_NM','TRRTRY_SHRT_NM');
      c.definedata('APPDWLD_CUST','APPACTV_CUST');
      c.definedone();
    declare hash L (dataset:'APP_DWLD_ACTV_CUST_LY (keep=CNCPT_NM TRRTRY_SHRT_NM APPDWLD_CUST APPACTV_CUST)');
      l.definekey('CNCPT_NM','TRRTRY_SHRT_NM');
      l.definedata('APPDWLD_CUST','APPACTV_CUST');
      l.definedone();
  end;

  if C.find() = 0 then do;   /* Retrieve C.APPDWLD_CUST and APPACTV_CUST */
    APPDWLD_CUST_PER_CY = APPDWLD_CUST/CUST_CY ;
    APPACTV_CUST_PER_CY = APPACTV_CUST/CUST_CY;
  end;

  if L.find() = 0 then do;   /* Retrieve L.APPDWLD_CUST and L.APPACTV_CUST*/
    APPDWLD_CUST_PER_LY= APPDWLD_CUST/CUST_LY ;
    APPACTV_CUST_PER_LY= APPACTV_CUST/CUST_LY ;
  end;

  format APPDWLD_CUST_PER_CY APPACTV_CUST_PER_CY 
         APPDWLD_CUST_PER_LY APPACTV_CUST_PER_LY percent8.2;

  APPDWLD_CUST_PP_CHNG=APPDWLD_CUST_PER_CY - APPDWLD_CUST_PER_LY;
  APPACTV_CUST_PP_CHNG=APPACTV_CUST_PER_CY - APPACTV_CUST_PER_LY;

run;

proc sort data=need out=SUMM7_DIGITAL_CUST0 ;
  by CNCPT_NM TRRTRY_SHRT_NM;
run;

The MERGE statement applies OBS=0 to tables C and L. That is done because merge will reset all the exclusively C-table and L-table variables to missing at the start of each iteration of the DATA step. You need that to protect against situations in which the current iteration finds no matching table C entry, which would leave C-table values retrieved from the most recent successful c.find() retrieval (same for L of course).

Of course, if every entry in table M has a match in tables L and C, then the defense against such retained values is not needed.
I don't guarantee this code, in the absence of usable sample data.
This code would need modification if C or L have duplicate key pairs.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

mkeintz · Posted 01-02-2020 08:52 AM

You are doing left joins.

Does that mean you are trying to accommodate actual one-to-many matches? Or are they all one-to-one and you merely want to restrict the output table to joins that have a match in the left-hand table (SUMM1_REV_TREE)?

The reason I ask is that you would need to set up your C and L tables (APP_DWLD_ACTV_CUST_CY and APP_DWLD_ACTV_CUST_LY, respectively) as hash objects, keyed on the matching variables (CNCPT_NM and TRRTRY_SHRT_NM).

The program will differ based on whether those key-pairs are unique within each of those two tables. However, it wouldn't matter if there are duplicate CNCPT_NM/TRRTRY_SHRT_NM pairs in the M table (SUMM1_REV_TREE).

So what do you have?

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Bharat_P · Posted 01-03-2020 01:03 AM

they are all one to one and I want to restrict the output table to join that have a match. And there are duplicate values in the key (CNCPT_NM/TRRTRY_SHRT_NM)

Tom · Posted 01-03-2020 12:44 PM

@Bharat_P wrote:

they are all one to one and I want to restrict the output table to join that have a match. And there are duplicate values in the key (CNCPT_NM/TRRTRY_SHRT_NM)

That answer is confusing. How can the join be one to one if there are duplicate observations in one or more of the input datasets for the same combinations of the key variables?

Patrick · Posted 01-03-2020 02:08 PM

@Bharat_P

If you're after actual SAS code then please provide representative sample data in the form of tested SAS data steps creating such data. Make sure that your sample data also includes repeated key variables if there are such cases in your actual data.

Then show us the expected result based on the sample data and required join logic.

mkeintz · Posted 01-03-2020 03:55 PM

As @Tom pointed out your response to my question is contradictory. Here is a solution that should word if there are no duplicate key-pairs in table C or table L. It doesn't matter if there are duplicates in table M:

data  need (keep=CNCPT_NM TRRTRY_SHRT_NM APPDWLD_CUST_PER_CY APPACTV_CUST_PER_CY
        APPDWLD_CUST_PER_LY APPACTV_CUST_PER_LY APPDWLD_CUST_PP_CHNG APPACTV_CUST_PP_CHNG) /view=need;

  merge
      SUMM1_REV_TREE        (      keep=CNCPT_NM TRRTRY_SHRT_NM CUST_CY CUST_LY)            /* M */ 
      APP_DWLD_ACTV_CUST_CY (obs=0 keep=CNCPT_NM TRRTRY_SHRT_NM APPDWLD_CUST APPACTV_CUST)  /* C */
      APP_DWLD_ACTV_CUST_LY (obs=0 keep=CNCPT_NM TRRTRY_SHRT_NM APPDWLD_CUST APPACTV_CUST)  /* L */
      ;

  if _n_=1 then do;
    declare hash C (dataset:'APP_DWLD_ACTV_CUST_CY (keep=CNCPT_NM TRRTRY_SHRT_NM APPDWLD_CUST APPACTV_CUST)');
      c.definekey('CNCPT_NM','TRRTRY_SHRT_NM');
      c.definedata('APPDWLD_CUST','APPACTV_CUST');
      c.definedone();
    declare hash L (dataset:'APP_DWLD_ACTV_CUST_LY (keep=CNCPT_NM TRRTRY_SHRT_NM APPDWLD_CUST APPACTV_CUST)');
      l.definekey('CNCPT_NM','TRRTRY_SHRT_NM');
      l.definedata('APPDWLD_CUST','APPACTV_CUST');
      l.definedone();
  end;

  if C.find() = 0 then do;   /* Retrieve C.APPDWLD_CUST and APPACTV_CUST */
    APPDWLD_CUST_PER_CY = APPDWLD_CUST/CUST_CY ;
    APPACTV_CUST_PER_CY = APPACTV_CUST/CUST_CY;
  end;

  if L.find() = 0 then do;   /* Retrieve L.APPDWLD_CUST and L.APPACTV_CUST*/
    APPDWLD_CUST_PER_LY= APPDWLD_CUST/CUST_LY ;
    APPACTV_CUST_PER_LY= APPACTV_CUST/CUST_LY ;
  end;

  format APPDWLD_CUST_PER_CY APPACTV_CUST_PER_CY 
         APPDWLD_CUST_PER_LY APPACTV_CUST_PER_LY percent8.2;

  APPDWLD_CUST_PP_CHNG=APPDWLD_CUST_PER_CY - APPDWLD_CUST_PER_LY;
  APPACTV_CUST_PP_CHNG=APPACTV_CUST_PER_CY - APPACTV_CUST_PER_LY;

run;

proc sort data=need out=SUMM7_DIGITAL_CUST0 ;
  by CNCPT_NM TRRTRY_SHRT_NM;
run;

The MERGE statement applies OBS=0 to tables C and L. That is done because merge will reset all the exclusively C-table and L-table variables to missing at the start of each iteration of the DATA step. You need that to protect against situations in which the current iteration finds no matching table C entry, which would leave C-table values retrieved from the most recent successful c.find() retrieval (same for L of course).

Of course, if every entry in table M has a match in tables L and C, then the defense against such retained values is not needed.
I don't guarantee this code, in the absence of usable sample data.
This code would need modification if C or L have duplicate key pairs.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Tom · Posted 01-03-2020 09:41 AM

Why do you think the hash tables will be faster than PROC SQL? How do you know that PROC SQL is not already doing a hash? There are options on the PROC SQL statement you can add so that it reports on how it has decided to implement your query.

Are the two tables small enough that you can load them into memory that a hash table would require?

Are your datasets already sorted? Why not just do a simple data step merge?

hashman · Posted 01-03-2020 12:33 PM

@Tom:

> Why do you think the hash tables will be faster than PROC SQL? <

A good question if the OP's idea is to make it run faster rather than some other motive.

> How do you know that PROC SQL is not already doing a hash? <

SAS SQL never uses the SQXJHSH access method for any join but an inner join and cannot be forced to even via MAGIC=103 option. I don't know why since algorithmically it's not a problem (e.g., it's easy to do a left/right join using the hash object).

> Are the two tables small enough that you can load them into memory that a hash table would require? <

It very rarely a problem, and when it is due to the frontal attack approach (i.e. load all the keys and all the data), in the case of a join it can always be circumvented by storing only the keys (or their combined MD5 signature if the composite key is really long) in the key portion and the record number RID in the data portion. Then the rest of the data can be extracted via POINT=RID if there's a key match. There're other ways of dealing with memory limitations as well. @DonH and I have a whole chapter on the topic in our "hash book":

https://www.sas.com/store/books/categories/examples/data-management-solutions-using-sas-hash-table-o...

In really extreme cases (which arise primarily when attacking massive data aggregation), this always works:

https://www.sas.com/content/dam/SAS/support/en/sas-global-forum-proceedings/2018/1755-2018.pdf

Kind regards

Paul D.

Classroom Training Available!