Solved: Re: using hash objects to join tables instead of proc sql statements

eroolpal · Posted 05-25-2021 06:25 PM

Hi, I have the following proc sql statement where I will be joining two tables. I would like to see if hash objects would speed up the process. Is there a way to do the following proc sql using hashing. I am having trouble with the "case when" statements when merging the tables.

proc sql;

drop table new;
create table new as
select
a.case1,
a.case2,
a.case3,
case when a.flag ne "X" then a.code else b.code end as code,
case when a.flag ne "X" then a.df else b.df end as df,
case when (a.flag = "X" and b.case1 ne "")
then "O" else a.flag end as flag
from new as a
left join old as b
on a.case1 = b.case1
and a.case2 = b.case2
;

quit;

DavePrinsloo · Posted 05-26-2021 08:39 AM

The data step hash function is very efficient and I have had very good improvements in performance in comparison to other join mechanisms.

The important thing is that the right table (lookup table) must be relatively small, i.e. the relevant columns in the lookup table (i.e. the "right" table in your join) must be able to comfortably fit into the amount of available memory when running the data step (actually you need roughly 1.3 times the size of the uncompressed columns to be available).

In your case the lookup table has the coumns code, df as data columns and keys case1 and case2. Assume these 4 columns have a combined length of 100 bytes.

If the lookup table has 1 million rows, you need available memory of 100 million * 1.3 factor mentioned above. Thats roughly 1.3 GB. Remember about 500 MB could be consumed by the environment, etc., so if your session only has 2GB memory, you cannot use a larger lookup table. But if you have 4GB available, you could probably load more than 3 million rows. That is not really that small:)

On the other hand, there are no real limits on the size of the main input table.

Once the has table is loaded, the execution speed of the entire join increases linearly with the size of the main input table, irrespective of whether it is sorted or not. In theory, if the main table is sorted, you may get a small performance improvement - but the time required for a sort would always be more than the performance increase. i.e. Hash tables in the data step have an advantage in that your large table does not need to be sorted at all.

The following is an untested quick stab at reproducing your logic:

data new2;
set new (keep= /*list only columns you need from the left table*/ 
                 case1 case2 case3 flag df);
/* define b_case1 to inherit the length of case1 */
if 0 then b_case1 = case1;

 _memory_before = input(getoption('xmrlmem'),20.);

/* data step statements to define and load the hash table from the relevant columns */
/* if interested I have a macro for this, code available on request...*/

%ut_hash_define(lookuptable = old
   , hashobjektname = hashname 
   , keys = case1 case2 
   /* need to rename columns in old when loading into then 
      hash table because columns "code" and "df" are in  both tables */
   , datacols_orig = code     df
   , datacols      = b_code b_df )


Rc = hashname.find();

/* if no match found, then set the looked-up colums to missing 
This is important otherwise the last match values are preserved */
If rc ne 0 then call missing(b_code, b_df, b_case1);
Else b_case1 = case1;

/* now apply any additional logic */
/* don’t need to set values from the left table (new), 
* as the values are already correct from the set statement above*/ 
If flag = "X" then do;
    code_then a.code = b.code;  
    df = b_df;
    if b_case1 ne "" then flag = "O";
end;
run;

View solution in original post

jimbarbour · Posted 05-25-2021 07:28 PM

Hash tables are generally only used in DATA steps. I'm not familiar with the use of hash tables in Proc SQL although maybe you could do it if you used Proc FCMP to create a compiled function.

Hash tables do work, and sometimes they speed things up but not always. If the tables are both large and both have about the same number of rows, I don't think hash tables are going to speed things up. You may also run out of memory.

What problems are you having with your SQL? Can you describe what is missing or incorrect in your output? Can you post some sample data? Are there any error messages?

Jim

eroolpal · Posted 05-25-2021 09:13 PM

Thank You for the response. I should have clarified this a bit more. The PROC SQL code works fine. However I am dealing with millions of rows of data and the proc sql takes more than 10 minutes to process. I was trying out the hash tables in DATA steps (to see if I can speed up the process) , but I am not familiar with it. I can do the merge in hash but how can I deal with proc sql statements such as "case when" when merging tables using the hash function? Here is a sample of the tables I am using

NEW table

case1 code df case2 case3 flag
1 5 N 2-Jan-10 4-Feb-10 X
1 5 N 1-Jan-10 7-Mar-10 X
2 5 Y 1-Feb-10 6-Apr-10 MAR
3 6 Y 1-Apr-10 7-May-10 X
3 6 N 2-Apr-10 5-Jun-10 X
4 6 N 1-Jul-10 8-Jul-10 FEB
5 7 Y 1-Aug-10 7-Sep-10 X
6 7 N 1-Sep-10 6-Oct-10 X
7 7 N 1-Oct-10 4-Nov-10 X
8 7 N 1-Nov-10 8-Dec-10 X

OLD table

case1 code df case2 case3 flag
1 6 N 2-Jan-10 4-Feb-10 O
1 6 Y 1-Jan-10 7-Mar-10 O
2 6 N 1-Feb-10 6-Apr-10 JAN
3 6 Y 1-Apr-10 7-May-10 O
3 5 N 2-Apr-10 5-Jun-10 O
4 6 N 1-Jul-10 8-Jul-10 AUG
5 7 N 1-Aug-10 7-Sep-10 O
6 7 Y 1-Sep-10 6-Oct-10 O
7 9 N 1-Oct-10 4-Nov-10 O
8 9 Y 1-Nov-10 8-Dec-10 O

ChrisNZ · Posted 05-25-2021 09:05 PM

1. proc sql uses hash tables when it sees fit. Use the _method option to see what's happening behind the scene.

2. You can use the ifc or ifn functions if you want something closer to a case clause syntax, though if you can master the hash table syntax, you can certainly master the if statement.

3. For a fast merge, keep your source tables sorted.

High-Performance SAS Coding - Third Edition

DavePrinsloo · Posted 05-26-2021 08:39 AM

The data step hash function is very efficient and I have had very good improvements in performance in comparison to other join mechanisms.

The important thing is that the right table (lookup table) must be relatively small, i.e. the relevant columns in the lookup table (i.e. the "right" table in your join) must be able to comfortably fit into the amount of available memory when running the data step (actually you need roughly 1.3 times the size of the uncompressed columns to be available).

In your case the lookup table has the coumns code, df as data columns and keys case1 and case2. Assume these 4 columns have a combined length of 100 bytes.

If the lookup table has 1 million rows, you need available memory of 100 million * 1.3 factor mentioned above. Thats roughly 1.3 GB. Remember about 500 MB could be consumed by the environment, etc., so if your session only has 2GB memory, you cannot use a larger lookup table. But if you have 4GB available, you could probably load more than 3 million rows. That is not really that small:)

On the other hand, there are no real limits on the size of the main input table.

Once the has table is loaded, the execution speed of the entire join increases linearly with the size of the main input table, irrespective of whether it is sorted or not. In theory, if the main table is sorted, you may get a small performance improvement - but the time required for a sort would always be more than the performance increase. i.e. Hash tables in the data step have an advantage in that your large table does not need to be sorted at all.

The following is an untested quick stab at reproducing your logic:

data new2;
set new (keep= /*list only columns you need from the left table*/ 
                 case1 case2 case3 flag df);
/* define b_case1 to inherit the length of case1 */
if 0 then b_case1 = case1;

 _memory_before = input(getoption('xmrlmem'),20.);

/* data step statements to define and load the hash table from the relevant columns */
/* if interested I have a macro for this, code available on request...*/

%ut_hash_define(lookuptable = old
   , hashobjektname = hashname 
   , keys = case1 case2 
   /* need to rename columns in old when loading into then 
      hash table because columns "code" and "df" are in  both tables */
   , datacols_orig = code     df
   , datacols      = b_code b_df )


Rc = hashname.find();

/* if no match found, then set the looked-up colums to missing 
This is important otherwise the last match values are preserved */
If rc ne 0 then call missing(b_code, b_df, b_case1);
Else b_case1 = case1;

/* now apply any additional logic */
/* don’t need to set values from the left table (new), 
* as the values are already correct from the set statement above*/ 
If flag = "X" then do;
    code_then a.code = b.code;  
    df = b_df;
    if b_case1 ne "" then flag = "O";
end;
run;

ChrisNZ · Posted 05-26-2021 09:46 PM

The 1.3 ratio is probably optimistic. Hash tables are rather wasteful of RAM.

The worst case scenario is with a key and data of $1, still using 48 bytes per row on a 64-bit system.

High-Performance SAS Coding - Third Edition

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!