Hi Experts,
I have two tables, Table DCS (2 billion rows) and table Wiki (1 million rows). I would like to translate the following SQL code into SAS Hash.
Proc SQL;
create table want as
Select
a.*
from DCS as a
inner join Wiki as b
on a.sourcekey = b.S_Key and a.sourcesystemid = b.s_ID;
quit;Sample data:
DCS table
| Snapshot_date | Sourcekey | Sourcesystemid | Amount | Riskbucket |
| 01/01/2000 | 958-985 | 5 | 5698 | 9 |
| 02/01/2000 | 958-985 | 5 | 3256 | 0 |
| 03/01/2000 | 958-985 | 5 | 3254 | 5 |
| 04/01/2000 | 958-985 | 5 | 8523 | 6 |
| 01/01/2000 | 958-986 | 7 | 5698 | 9 |
| 02/01/2000 | 958-986 | 7 | 3256 | 0 |
| 03/01/2000 | 958-986 | 7 | 3254 | 5 |
| 04/01/2000 | 958-986 | 7 | 8523 | 6 |
| 01/01/2000 | 958-989 | 2 | 5698 | 9 |
| 02/01/2000 | 958-989 | 2 | 3256 | 0 |
| 03/01/2000 | 958-989 | 2 | 3254 | 5 |
| 04/01/2000 | 958-989 | 2 | 8523 | 6 |
| 01/01/2000 | 958-989 | 6 | 5698 | 9 |
| 02/01/2000 | 958-989 | 6 | 3256 | 0 |
| 03/01/2000 | 958-989 | 6 | 3254 | 5 |
| 04/01/2000 | 958-989 | 2 | 8523 | 6 |
Wiki table
| Sourcekey | Sourcesystemid |
| 958-985 | 5 |
| 958-986 | 7 |
| 958-999 | 5 |
| 958-989 | 2 |
Results table
| Snapshot_date | Sourcekey | Sourcesystemid | Amount | Riskbucket |
| 01/01/2000 | 958-985 | 5 | 5698 | 9 |
| 02/01/2000 | 958-985 | 5 | 3256 | 0 |
| 03/01/2000 | 958-985 | 5 | 3254 | 5 |
| 04/01/2000 | 958-985 | 5 | 8523 | 6 |
| 01/01/2000 | 958-986 | 7 | 5698 | 9 |
| 02/01/2000 | 958-986 | 7 | 3256 | 0 |
| 03/01/2000 | 958-986 | 7 | 3254 | 5 |
| 04/01/2000 | 958-986 | 7 | 8523 | 6 |
| 01/01/2000 | 958-989 | 2 | 5698 | 9 |
| 02/01/2000 | 958-989 | 2 | 3256 | 0 |
| 03/01/2000 | 958-989 | 2 | 3254 | 5 |
| 04/01/2000 | 958-989 | 2 | 8523 | 6 |
Since these tables are large, I think HASH technique will help me to get the results quickly.
Thanks in advance.
Regards,
Myu
Example:
Data:
data DCS; input Snapshot_date mmddyy10. Sourcekey $ Sourcesystemid Amount Riskbucket; format Snapshot_date mmddyy10.; datalines; 01/01/2000 958-985 5 5698 9 02/01/2000 958-985 5 3256 0 03/01/2000 958-985 5 3254 5 04/01/2000 958-985 5 8523 6 01/01/2000 958-986 7 5698 9 02/01/2000 958-986 7 3256 0 03/01/2000 958-986 7 3254 5 04/01/2000 958-986 7 8523 6 01/01/2000 958-989 2 5698 9 02/01/2000 958-989 2 3256 0 03/01/2000 958-989 2 3254 5 04/01/2000 958-989 2 8523 6 01/01/2000 958-989 6 5698 9 02/01/2000 958-989 6 3256 0 03/01/2000 958-989 6 3254 5 04/01/2000 958-989 2 8523 6 ; data Wiki; input Sourcekey $ Sourcesystemid; datalines; 958-985 5 958-986 7 958-999 5 958-989 2 ;
Code:
data want;
if _n_ = 1 then do;
declare hash wiki_lookup(dataset:'wiki');
wiki_lookup.defineKey('Sourcekey', 'Sourcesystemid');
wiki_lookup.defineDone();
end;
set DCS;
if wiki_lookup.check() = 0;
run;
Note, your sample data has a last row
04/01/2000 958-989 2 8523 6
that is a duplicate of an earlier row
Example:
Data:
data DCS; input Snapshot_date mmddyy10. Sourcekey $ Sourcesystemid Amount Riskbucket; format Snapshot_date mmddyy10.; datalines; 01/01/2000 958-985 5 5698 9 02/01/2000 958-985 5 3256 0 03/01/2000 958-985 5 3254 5 04/01/2000 958-985 5 8523 6 01/01/2000 958-986 7 5698 9 02/01/2000 958-986 7 3256 0 03/01/2000 958-986 7 3254 5 04/01/2000 958-986 7 8523 6 01/01/2000 958-989 2 5698 9 02/01/2000 958-989 2 3256 0 03/01/2000 958-989 2 3254 5 04/01/2000 958-989 2 8523 6 01/01/2000 958-989 6 5698 9 02/01/2000 958-989 6 3256 0 03/01/2000 958-989 6 3254 5 04/01/2000 958-989 2 8523 6 ; data Wiki; input Sourcekey $ Sourcesystemid; datalines; 958-985 5 958-986 7 958-999 5 958-989 2 ;
Code:
data want;
if _n_ = 1 then do;
declare hash wiki_lookup(dataset:'wiki');
wiki_lookup.defineKey('Sourcekey', 'Sourcesystemid');
wiki_lookup.defineDone();
end;
set DCS;
if wiki_lookup.check() = 0;
run;
Note, your sample data has a last row
04/01/2000 958-989 2 8523 6
that is a duplicate of an earlier row
@RichardDeVen proposed a very good solution. You may also want to add the argument HASHEXP:9 or HASHEXP:10 to the DECLARE statement to increase the hash table size and make it more efficient when dealing with a million entries.
The hash method check returns 0 when a hash key corresponding to the current PDV variables exists.
An edit of the source code line from
if wiki_lookup.check() = 0;
to
wiki_flag = (wiki_lookup.check() = 0);
will change the program flow from a subsetting-if to a flag variable assignment
1. How can I achieve this using HASH?
Change
if wiki_lookup.check() = 0;
to
WIKI_FLAG = ( wiki_lookup.check() = 0 );
2. Scanning every row of a large table takes time. If the large table is indexed, retrieving 1/2000th of the table using indexes might be faster.
April 27 – 30 | Gaylord Texan | Grapevine, Texas
Walk in ready to learn. Walk out ready to deliver. This is the data and AI conference you can't afford to miss.
Register now and lock in 2025 pricing—just $495!
Still thinking about your presentation idea? The submission deadline has been extended to Friday, Nov. 14, at 11:59 p.m. ET.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.