12-15-2016 07:37 AM - edited 12-15-2016 07:46 AM
During the Webcast 14. December I used an example on joining data using HASH object instead of MERGE statement in the DATA STEP. Link to the webcast here (the demo of SAS 9.4 M4 starts 40 minutes into the presentation)
If you want to learn this the right way I suggest you join the SAS Programming 3 course. There you will learn this technique and other smart way to reduce responce times on your SAS programs.
To prove that HASH object is farster I created this program. Let me explain it to you.
First we create four tables. In the ORDERS table we only 640 rows. Once you are ready to test this for real then change the DO LOOP and create 64 million rows. Then the job will take some time to execute, and the HASH object can be used.
data product; input product_id product $ @@; datalines; 1001 Table 1002 Door 1003 Chair 1004 Sofa run; data country; input country_id country $ @@; datalines; 11 Denmark 12 Norway 13 Sweden 14 Germany run; data customer; input customer_id customer $ @@; datalines; 101 Hans 102 Pia 103 Ole 104 Lis run; data orders; do customer_id = 101 to 104; do country_id = 11 to 14; do product_id = 1001 to 1004; * Create 1.000.000 rows when doing the test; do id = 1 to 10; sale = Round(ranuni(0)*1000); output; end; end; end; end; drop id; run;
Now we need to join the four tables. Because we have three different join keys this needs to happen in many steps. Between each step we need to sort data, and that is taking long time.
proc sort data=orders; by customer_id; proc sort data=customer; by customer_id; data result; merge customer orders; by customer_id; run; proc sort data=result; by country_id; proc sort data=country; by country_id; data result; merge country result; by country_id; run; proc sort data=result; by product_id; proc sort data=product; by product_id; data result; merge product result; by product_id; run;
When executed on my PC the above code took 562 seconds with 64 million rows in the ORDERS table. When using the HASH object in the DATA STEP it only took 101 seconds - the code below.
data hash_result; if _n_=1 then do; length product customer country $ 8; declare hash A(dataset:"country"); A.definekey("country_id"); A.definedata("country"); A.definedone(); declare hash B(dataset:"product"); B.definekey("product_id"); B.definedata("product"); B.definedone(); declare hash C(dataset:"customer"); C.definekey("customer_id"); C.definedata("customer"); C.definedone(); end; set orders; rc1=A.find(); rc2=B.find(); rc3=C.find(); if rc1=0 and rc2=0 and rc3=0 then output; drop rc:; run;
The code above creates 3 HASH tables in memory, then it reads the large 64 million row table and perform 3 lookups into the 3 HASH objects. No sorting is performed and that is why it goes much faster.
Consider HASH objects next time you have a program where you are doing sorting prior to a join and especially if you have to sort many times.