Hi Experts,
I have two tables, Table DCS (2 billion rows) and table Wiki (1 million rows). I would like to translate the following SQL code into SAS Hash.
Proc SQL;
create table want as
Select
a.*
from DCS as a
inner join Wiki as b
on a.sourcekey = b.S_Key and a.sourcesystemid = b.s_ID;
quit;
Sample data:
DCS table
Snapshot_date | Sourcekey | Sourcesystemid | Amount | Riskbucket |
01/01/2000 | 958-985 | 5 | 5698 | 9 |
02/01/2000 | 958-985 | 5 | 3256 | 0 |
03/01/2000 | 958-985 | 5 | 3254 | 5 |
04/01/2000 | 958-985 | 5 | 8523 | 6 |
01/01/2000 | 958-986 | 7 | 5698 | 9 |
02/01/2000 | 958-986 | 7 | 3256 | 0 |
03/01/2000 | 958-986 | 7 | 3254 | 5 |
04/01/2000 | 958-986 | 7 | 8523 | 6 |
01/01/2000 | 958-989 | 2 | 5698 | 9 |
02/01/2000 | 958-989 | 2 | 3256 | 0 |
03/01/2000 | 958-989 | 2 | 3254 | 5 |
04/01/2000 | 958-989 | 2 | 8523 | 6 |
01/01/2000 | 958-989 | 6 | 5698 | 9 |
02/01/2000 | 958-989 | 6 | 3256 | 0 |
03/01/2000 | 958-989 | 6 | 3254 | 5 |
04/01/2000 | 958-989 | 2 | 8523 | 6 |
Wiki table
Sourcekey | Sourcesystemid |
958-985 | 5 |
958-986 | 7 |
958-999 | 5 |
958-989 | 2 |
Results table
Snapshot_date | Sourcekey | Sourcesystemid | Amount | Riskbucket |
01/01/2000 | 958-985 | 5 | 5698 | 9 |
02/01/2000 | 958-985 | 5 | 3256 | 0 |
03/01/2000 | 958-985 | 5 | 3254 | 5 |
04/01/2000 | 958-985 | 5 | 8523 | 6 |
01/01/2000 | 958-986 | 7 | 5698 | 9 |
02/01/2000 | 958-986 | 7 | 3256 | 0 |
03/01/2000 | 958-986 | 7 | 3254 | 5 |
04/01/2000 | 958-986 | 7 | 8523 | 6 |
01/01/2000 | 958-989 | 2 | 5698 | 9 |
02/01/2000 | 958-989 | 2 | 3256 | 0 |
03/01/2000 | 958-989 | 2 | 3254 | 5 |
04/01/2000 | 958-989 | 2 | 8523 | 6 |
Since these tables are large, I think HASH technique will help me to get the results quickly.
Thanks in advance.
Regards,
Myu
Example:
Data:
data DCS; input Snapshot_date mmddyy10. Sourcekey $ Sourcesystemid Amount Riskbucket; format Snapshot_date mmddyy10.; datalines; 01/01/2000 958-985 5 5698 9 02/01/2000 958-985 5 3256 0 03/01/2000 958-985 5 3254 5 04/01/2000 958-985 5 8523 6 01/01/2000 958-986 7 5698 9 02/01/2000 958-986 7 3256 0 03/01/2000 958-986 7 3254 5 04/01/2000 958-986 7 8523 6 01/01/2000 958-989 2 5698 9 02/01/2000 958-989 2 3256 0 03/01/2000 958-989 2 3254 5 04/01/2000 958-989 2 8523 6 01/01/2000 958-989 6 5698 9 02/01/2000 958-989 6 3256 0 03/01/2000 958-989 6 3254 5 04/01/2000 958-989 2 8523 6 ; data Wiki; input Sourcekey $ Sourcesystemid; datalines; 958-985 5 958-986 7 958-999 5 958-989 2 ;
Code:
data want; if _n_ = 1 then do; declare hash wiki_lookup(dataset:'wiki'); wiki_lookup.defineKey('Sourcekey', 'Sourcesystemid'); wiki_lookup.defineDone(); end; set DCS; if wiki_lookup.check() = 0; run;
Note, your sample data has a last row
04/01/2000 958-989 2 8523 6
that is a duplicate of an earlier row
Example:
Data:
data DCS; input Snapshot_date mmddyy10. Sourcekey $ Sourcesystemid Amount Riskbucket; format Snapshot_date mmddyy10.; datalines; 01/01/2000 958-985 5 5698 9 02/01/2000 958-985 5 3256 0 03/01/2000 958-985 5 3254 5 04/01/2000 958-985 5 8523 6 01/01/2000 958-986 7 5698 9 02/01/2000 958-986 7 3256 0 03/01/2000 958-986 7 3254 5 04/01/2000 958-986 7 8523 6 01/01/2000 958-989 2 5698 9 02/01/2000 958-989 2 3256 0 03/01/2000 958-989 2 3254 5 04/01/2000 958-989 2 8523 6 01/01/2000 958-989 6 5698 9 02/01/2000 958-989 6 3256 0 03/01/2000 958-989 6 3254 5 04/01/2000 958-989 2 8523 6 ; data Wiki; input Sourcekey $ Sourcesystemid; datalines; 958-985 5 958-986 7 958-999 5 958-989 2 ;
Code:
data want; if _n_ = 1 then do; declare hash wiki_lookup(dataset:'wiki'); wiki_lookup.defineKey('Sourcekey', 'Sourcesystemid'); wiki_lookup.defineDone(); end; set DCS; if wiki_lookup.check() = 0; run;
Note, your sample data has a last row
04/01/2000 958-989 2 8523 6
that is a duplicate of an earlier row
@RichardDeVen proposed a very good solution. You may also want to add the argument HASHEXP:9 or HASHEXP:10 to the DECLARE statement to increase the hash table size and make it more efficient when dealing with a million entries.
The hash method check returns 0 when a hash key corresponding to the current PDV variables exists.
An edit of the source code line from
if wiki_lookup.check() = 0;
to
wiki_flag = (wiki_lookup.check() = 0);
will change the program flow from a subsetting-if to a flag variable assignment
1. How can I achieve this using HASH?
Change
if wiki_lookup.check() = 0;
to
WIKI_FLAG = ( wiki_lookup.check() = 0 );
2. Scanning every row of a large table takes time. If the large table is indexed, retrieving 1/2000th of the table using indexes might be faster.
Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.
Register today!Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.