Hi Experts,
I have two tables, Table DCS (2 billion rows) and table Wiki (1 million rows). I would like to translate the following SQL code into SAS Hash.
Proc SQL;
create table want as
Select
a.*
from DCS as a
inner join Wiki as b
on a.sourcekey = b.S_Key and a.sourcesystemid = b.s_ID;
quit;
Sample data:
DCS table
Snapshot_date | Sourcekey | Sourcesystemid | Amount | Riskbucket |
01/01/2000 | 958-985 | 5 | 5698 | 9 |
02/01/2000 | 958-985 | 5 | 3256 | 0 |
03/01/2000 | 958-985 | 5 | 3254 | 5 |
04/01/2000 | 958-985 | 5 | 8523 | 6 |
01/01/2000 | 958-986 | 7 | 5698 | 9 |
02/01/2000 | 958-986 | 7 | 3256 | 0 |
03/01/2000 | 958-986 | 7 | 3254 | 5 |
04/01/2000 | 958-986 | 7 | 8523 | 6 |
01/01/2000 | 958-989 | 2 | 5698 | 9 |
02/01/2000 | 958-989 | 2 | 3256 | 0 |
03/01/2000 | 958-989 | 2 | 3254 | 5 |
04/01/2000 | 958-989 | 2 | 8523 | 6 |
01/01/2000 | 958-989 | 6 | 5698 | 9 |
02/01/2000 | 958-989 | 6 | 3256 | 0 |
03/01/2000 | 958-989 | 6 | 3254 | 5 |
04/01/2000 | 958-989 | 2 | 8523 | 6 |
Wiki table
Sourcekey | Sourcesystemid |
958-985 | 5 |
958-986 | 7 |
958-999 | 5 |
958-989 | 2 |
Results table
Snapshot_date | Sourcekey | Sourcesystemid | Amount | Riskbucket |
01/01/2000 | 958-985 | 5 | 5698 | 9 |
02/01/2000 | 958-985 | 5 | 3256 | 0 |
03/01/2000 | 958-985 | 5 | 3254 | 5 |
04/01/2000 | 958-985 | 5 | 8523 | 6 |
01/01/2000 | 958-986 | 7 | 5698 | 9 |
02/01/2000 | 958-986 | 7 | 3256 | 0 |
03/01/2000 | 958-986 | 7 | 3254 | 5 |
04/01/2000 | 958-986 | 7 | 8523 | 6 |
01/01/2000 | 958-989 | 2 | 5698 | 9 |
02/01/2000 | 958-989 | 2 | 3256 | 0 |
03/01/2000 | 958-989 | 2 | 3254 | 5 |
04/01/2000 | 958-989 | 2 | 8523 | 6 |
Since these tables are large, I think HASH technique will help me to get the results quickly.
Thanks in advance.
Regards,
Myu
Example:
Data:
data DCS; input Snapshot_date mmddyy10. Sourcekey $ Sourcesystemid Amount Riskbucket; format Snapshot_date mmddyy10.; datalines; 01/01/2000 958-985 5 5698 9 02/01/2000 958-985 5 3256 0 03/01/2000 958-985 5 3254 5 04/01/2000 958-985 5 8523 6 01/01/2000 958-986 7 5698 9 02/01/2000 958-986 7 3256 0 03/01/2000 958-986 7 3254 5 04/01/2000 958-986 7 8523 6 01/01/2000 958-989 2 5698 9 02/01/2000 958-989 2 3256 0 03/01/2000 958-989 2 3254 5 04/01/2000 958-989 2 8523 6 01/01/2000 958-989 6 5698 9 02/01/2000 958-989 6 3256 0 03/01/2000 958-989 6 3254 5 04/01/2000 958-989 2 8523 6 ; data Wiki; input Sourcekey $ Sourcesystemid; datalines; 958-985 5 958-986 7 958-999 5 958-989 2 ;
Code:
data want; if _n_ = 1 then do; declare hash wiki_lookup(dataset:'wiki'); wiki_lookup.defineKey('Sourcekey', 'Sourcesystemid'); wiki_lookup.defineDone(); end; set DCS; if wiki_lookup.check() = 0; run;
Note, your sample data has a last row
04/01/2000 958-989 2 8523 6
that is a duplicate of an earlier row
Example:
Data:
data DCS; input Snapshot_date mmddyy10. Sourcekey $ Sourcesystemid Amount Riskbucket; format Snapshot_date mmddyy10.; datalines; 01/01/2000 958-985 5 5698 9 02/01/2000 958-985 5 3256 0 03/01/2000 958-985 5 3254 5 04/01/2000 958-985 5 8523 6 01/01/2000 958-986 7 5698 9 02/01/2000 958-986 7 3256 0 03/01/2000 958-986 7 3254 5 04/01/2000 958-986 7 8523 6 01/01/2000 958-989 2 5698 9 02/01/2000 958-989 2 3256 0 03/01/2000 958-989 2 3254 5 04/01/2000 958-989 2 8523 6 01/01/2000 958-989 6 5698 9 02/01/2000 958-989 6 3256 0 03/01/2000 958-989 6 3254 5 04/01/2000 958-989 2 8523 6 ; data Wiki; input Sourcekey $ Sourcesystemid; datalines; 958-985 5 958-986 7 958-999 5 958-989 2 ;
Code:
data want; if _n_ = 1 then do; declare hash wiki_lookup(dataset:'wiki'); wiki_lookup.defineKey('Sourcekey', 'Sourcesystemid'); wiki_lookup.defineDone(); end; set DCS; if wiki_lookup.check() = 0; run;
Note, your sample data has a last row
04/01/2000 958-989 2 8523 6
that is a duplicate of an earlier row
@RichardDeVen proposed a very good solution. You may also want to add the argument HASHEXP:9 or HASHEXP:10 to the DECLARE statement to increase the hash table size and make it more efficient when dealing with a million entries.
The hash method check returns 0 when a hash key corresponding to the current PDV variables exists.
An edit of the source code line from
if wiki_lookup.check() = 0;
to
wiki_flag = (wiki_lookup.check() = 0);
will change the program flow from a subsetting-if to a flag variable assignment
1. How can I achieve this using HASH?
Change
if wiki_lookup.check() = 0;
to
WIKI_FLAG = ( wiki_lookup.check() = 0 );
2. Scanning every row of a large table takes time. If the large table is indexed, retrieving 1/2000th of the table using indexes might be faster.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.