BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Myurathan
Quartz | Level 8

Hi Experts, 

 

I have two tables, Table DCS (2 billion rows) and table Wiki (1 million rows). I would like to translate the following SQL code into SAS Hash.

 

Proc SQL;

create table want as
Select
 a.*
from DCS as a
inner join Wiki as b
on a.sourcekey = b.S_Key and a.sourcesystemid = b.s_ID;
quit;

Sample data:

DCS table

Snapshot_dateSourcekeySourcesystemidAmountRiskbucket
01/01/2000958-985556989
02/01/2000958-985532560
03/01/2000958-985532545
04/01/2000958-985585236
01/01/2000958-986756989
02/01/2000958-986732560
03/01/2000958-986732545
04/01/2000958-986785236
01/01/2000958-989256989
02/01/2000958-989232560
03/01/2000958-989232545
04/01/2000958-989285236
01/01/2000958-989656989
02/01/2000958-989632560
03/01/2000958-989632545
04/01/2000958-989285236

 

 

Wiki table

SourcekeySourcesystemid
958-9855
958-9867
958-9995
958-9892

 

Results table

Snapshot_dateSourcekeySourcesystemidAmountRiskbucket
01/01/2000958-985556989
02/01/2000958-985532560
03/01/2000958-985532545
04/01/2000958-985585236
01/01/2000958-986756989
02/01/2000958-986732560
03/01/2000958-986732545
04/01/2000958-986785236
01/01/2000958-989256989
02/01/2000958-989232560
03/01/2000958-989232545
04/01/2000958-989285236

 

Since these tables are large, I think HASH technique will help me to get the results quickly. 

 

Thanks in advance.

 

Regards,

Myu

1 ACCEPTED SOLUTION

Accepted Solutions
RichardDeVen
Barite | Level 11
  • Hash the 1M keys from Wiki
  • Implicitly loop DCS with SET
  • Output only on key found (.check() method)

Example:

 

Data:

Spoiler
data DCS;
input Snapshot_date mmddyy10.	Sourcekey	$ Sourcesystemid	Amount	Riskbucket;
format Snapshot_date mmddyy10.;
datalines;
01/01/2000	958-985	5	5698	9
02/01/2000	958-985	5	3256	0
03/01/2000	958-985	5	3254	5
04/01/2000	958-985	5	8523	6
01/01/2000	958-986	7	5698	9
02/01/2000	958-986	7	3256	0
03/01/2000	958-986	7	3254	5
04/01/2000	958-986	7	8523	6
01/01/2000	958-989	2	5698	9
02/01/2000	958-989	2	3256	0
03/01/2000	958-989	2	3254	5
04/01/2000	958-989	2	8523	6
01/01/2000	958-989	6	5698	9
02/01/2000	958-989	6	3256	0
03/01/2000	958-989	6	3254	5
04/01/2000	958-989	2	8523	6
;

data Wiki;
input Sourcekey $	Sourcesystemid;
datalines;
958-985	5
958-986	7
958-999	5
958-989	2
;

Code:

data want;
  if _n_ = 1 then do;
    declare hash wiki_lookup(dataset:'wiki');
    wiki_lookup.defineKey('Sourcekey', 'Sourcesystemid');
    wiki_lookup.defineDone();
  end;

  set DCS;

  if wiki_lookup.check() = 0;
run;

Note, your sample data has a last row

04/01/2000	958-989	2	8523	6

that is a duplicate of an earlier row

 

View solution in original post

5 REPLIES 5
RichardDeVen
Barite | Level 11
  • Hash the 1M keys from Wiki
  • Implicitly loop DCS with SET
  • Output only on key found (.check() method)

Example:

 

Data:

Spoiler
data DCS;
input Snapshot_date mmddyy10.	Sourcekey	$ Sourcesystemid	Amount	Riskbucket;
format Snapshot_date mmddyy10.;
datalines;
01/01/2000	958-985	5	5698	9
02/01/2000	958-985	5	3256	0
03/01/2000	958-985	5	3254	5
04/01/2000	958-985	5	8523	6
01/01/2000	958-986	7	5698	9
02/01/2000	958-986	7	3256	0
03/01/2000	958-986	7	3254	5
04/01/2000	958-986	7	8523	6
01/01/2000	958-989	2	5698	9
02/01/2000	958-989	2	3256	0
03/01/2000	958-989	2	3254	5
04/01/2000	958-989	2	8523	6
01/01/2000	958-989	6	5698	9
02/01/2000	958-989	6	3256	0
03/01/2000	958-989	6	3254	5
04/01/2000	958-989	2	8523	6
;

data Wiki;
input Sourcekey $	Sourcesystemid;
datalines;
958-985	5
958-986	7
958-999	5
958-989	2
;

Code:

data want;
  if _n_ = 1 then do;
    declare hash wiki_lookup(dataset:'wiki');
    wiki_lookup.defineKey('Sourcekey', 'Sourcesystemid');
    wiki_lookup.defineDone();
  end;

  set DCS;

  if wiki_lookup.check() = 0;
run;

Note, your sample data has a last row

04/01/2000	958-989	2	8523	6

that is a duplicate of an earlier row

 

PGStats
Opal | Level 21

@RichardDeVen proposed a very good solution. You may also want to add the argument HASHEXP:9 or HASHEXP:10 to the DECLARE statement to increase the hash table size and make it more efficient when dealing with a million entries.

PG
Myurathan
Quartz | Level 8
@RichardADeVenezia Thank you so much for your help.
followup question: if I want to do left join and wanna create a variable (wiki_Flag) that indicate that the contract is available in wiki table. How can I achieve this using HASH?
Thanks in advance.
RichardDeVen
Barite | Level 11

The hash method check returns 0 when a hash key corresponding to the current PDV variables exists.

 

An edit of the source code line from 

  if wiki_lookup.check() = 0;

to

  wiki_flag = (wiki_lookup.check() = 0);

will change the program flow from a subsetting-if to a flag variable assignment

ChrisNZ
Tourmaline | Level 20

1. How can I achieve this using HASH?

Change

  if wiki_lookup.check() = 0;

to

  WIKI_FLAG = ( wiki_lookup.check() = 0 );

 

2. Scanning every row of a large table takes time. If the large table is indexed, retrieving 1/2000th of the table using indexes might be faster.

 

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 5 replies
  • 948 views
  • 5 likes
  • 4 in conversation