BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Myurathan
Quartz | Level 8

Hi Experts, 

 

I have two tables, Table DCS (2 billion rows) and table Wiki (1 million rows). I would like to translate the following SQL code into SAS Hash.

 

Proc SQL;

create table want as
Select
 a.*
from DCS as a
inner join Wiki as b
on a.sourcekey = b.S_Key and a.sourcesystemid = b.s_ID;
quit;

Sample data:

DCS table

Snapshot_dateSourcekeySourcesystemidAmountRiskbucket
01/01/2000958-985556989
02/01/2000958-985532560
03/01/2000958-985532545
04/01/2000958-985585236
01/01/2000958-986756989
02/01/2000958-986732560
03/01/2000958-986732545
04/01/2000958-986785236
01/01/2000958-989256989
02/01/2000958-989232560
03/01/2000958-989232545
04/01/2000958-989285236
01/01/2000958-989656989
02/01/2000958-989632560
03/01/2000958-989632545
04/01/2000958-989285236

 

 

Wiki table

SourcekeySourcesystemid
958-9855
958-9867
958-9995
958-9892

 

Results table

Snapshot_dateSourcekeySourcesystemidAmountRiskbucket
01/01/2000958-985556989
02/01/2000958-985532560
03/01/2000958-985532545
04/01/2000958-985585236
01/01/2000958-986756989
02/01/2000958-986732560
03/01/2000958-986732545
04/01/2000958-986785236
01/01/2000958-989256989
02/01/2000958-989232560
03/01/2000958-989232545
04/01/2000958-989285236

 

Since these tables are large, I think HASH technique will help me to get the results quickly. 

 

Thanks in advance.

 

Regards,

Myu

1 ACCEPTED SOLUTION

Accepted Solutions
RichardDeVen
Barite | Level 11
  • Hash the 1M keys from Wiki
  • Implicitly loop DCS with SET
  • Output only on key found (.check() method)

Example:

 

Data:

Spoiler
data DCS;
input Snapshot_date mmddyy10.	Sourcekey	$ Sourcesystemid	Amount	Riskbucket;
format Snapshot_date mmddyy10.;
datalines;
01/01/2000	958-985	5	5698	9
02/01/2000	958-985	5	3256	0
03/01/2000	958-985	5	3254	5
04/01/2000	958-985	5	8523	6
01/01/2000	958-986	7	5698	9
02/01/2000	958-986	7	3256	0
03/01/2000	958-986	7	3254	5
04/01/2000	958-986	7	8523	6
01/01/2000	958-989	2	5698	9
02/01/2000	958-989	2	3256	0
03/01/2000	958-989	2	3254	5
04/01/2000	958-989	2	8523	6
01/01/2000	958-989	6	5698	9
02/01/2000	958-989	6	3256	0
03/01/2000	958-989	6	3254	5
04/01/2000	958-989	2	8523	6
;

data Wiki;
input Sourcekey $	Sourcesystemid;
datalines;
958-985	5
958-986	7
958-999	5
958-989	2
;

Code:

data want;
  if _n_ = 1 then do;
    declare hash wiki_lookup(dataset:'wiki');
    wiki_lookup.defineKey('Sourcekey', 'Sourcesystemid');
    wiki_lookup.defineDone();
  end;

  set DCS;

  if wiki_lookup.check() = 0;
run;

Note, your sample data has a last row

04/01/2000	958-989	2	8523	6

that is a duplicate of an earlier row

 

View solution in original post

5 REPLIES 5
RichardDeVen
Barite | Level 11
  • Hash the 1M keys from Wiki
  • Implicitly loop DCS with SET
  • Output only on key found (.check() method)

Example:

 

Data:

Spoiler
data DCS;
input Snapshot_date mmddyy10.	Sourcekey	$ Sourcesystemid	Amount	Riskbucket;
format Snapshot_date mmddyy10.;
datalines;
01/01/2000	958-985	5	5698	9
02/01/2000	958-985	5	3256	0
03/01/2000	958-985	5	3254	5
04/01/2000	958-985	5	8523	6
01/01/2000	958-986	7	5698	9
02/01/2000	958-986	7	3256	0
03/01/2000	958-986	7	3254	5
04/01/2000	958-986	7	8523	6
01/01/2000	958-989	2	5698	9
02/01/2000	958-989	2	3256	0
03/01/2000	958-989	2	3254	5
04/01/2000	958-989	2	8523	6
01/01/2000	958-989	6	5698	9
02/01/2000	958-989	6	3256	0
03/01/2000	958-989	6	3254	5
04/01/2000	958-989	2	8523	6
;

data Wiki;
input Sourcekey $	Sourcesystemid;
datalines;
958-985	5
958-986	7
958-999	5
958-989	2
;

Code:

data want;
  if _n_ = 1 then do;
    declare hash wiki_lookup(dataset:'wiki');
    wiki_lookup.defineKey('Sourcekey', 'Sourcesystemid');
    wiki_lookup.defineDone();
  end;

  set DCS;

  if wiki_lookup.check() = 0;
run;

Note, your sample data has a last row

04/01/2000	958-989	2	8523	6

that is a duplicate of an earlier row

 

PGStats
Opal | Level 21

@RichardDeVen proposed a very good solution. You may also want to add the argument HASHEXP:9 or HASHEXP:10 to the DECLARE statement to increase the hash table size and make it more efficient when dealing with a million entries.

PG
Myurathan
Quartz | Level 8
@RichardADeVenezia Thank you so much for your help.
followup question: if I want to do left join and wanna create a variable (wiki_Flag) that indicate that the contract is available in wiki table. How can I achieve this using HASH?
Thanks in advance.
RichardDeVen
Barite | Level 11

The hash method check returns 0 when a hash key corresponding to the current PDV variables exists.

 

An edit of the source code line from 

  if wiki_lookup.check() = 0;

to

  wiki_flag = (wiki_lookup.check() = 0);

will change the program flow from a subsetting-if to a flag variable assignment

ChrisNZ
Tourmaline | Level 20

1. How can I achieve this using HASH?

Change

  if wiki_lookup.check() = 0;

to

  WIKI_FLAG = ( wiki_lookup.check() = 0 );

 

2. Scanning every row of a large table takes time. If the large table is indexed, retrieving 1/2000th of the table using indexes might be faster.

 

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 5 replies
  • 1852 views
  • 5 likes
  • 4 in conversation