Solved: Re: Inner join with keep variables from only one table in HASH

Myurathan · Posted 07-24-2020 11:26 AM

Hi Experts,

I have two tables, Table DCS (2 billion rows) and table Wiki (1 million rows). I would like to translate the following SQL code into SAS Hash.

Proc SQL;

create table want as
Select
 a.*
from DCS as a
inner join Wiki as b
on a.sourcekey = b.S_Key and a.sourcesystemid = b.s_ID;
quit;

Sample data:

DCS table

Snapshot_date	Sourcekey	Sourcesystemid	Amount	Riskbucket
01/01/2000	958-985	5	5698	9
02/01/2000	958-985	5	3256	0
03/01/2000	958-985	5	3254	5
04/01/2000	958-985	5	8523	6
01/01/2000	958-986	7	5698	9
02/01/2000	958-986	7	3256	0
03/01/2000	958-986	7	3254	5
04/01/2000	958-986	7	8523	6
01/01/2000	958-989	2	5698	9
02/01/2000	958-989	2	3256	0
03/01/2000	958-989	2	3254	5
04/01/2000	958-989	2	8523	6
01/01/2000	958-989	6	5698	9
02/01/2000	958-989	6	3256	0
03/01/2000	958-989	6	3254	5
04/01/2000	958-989	2	8523	6

Wiki table

Sourcekey	Sourcesystemid
958-985	5
958-986	7
958-999	5
958-989	2

Results table

Snapshot_date	Sourcekey	Sourcesystemid	Amount	Riskbucket
01/01/2000	958-985	5	5698	9
02/01/2000	958-985	5	3256	0
03/01/2000	958-985	5	3254	5
04/01/2000	958-985	5	8523	6
01/01/2000	958-986	7	5698	9
02/01/2000	958-986	7	3256	0
03/01/2000	958-986	7	3254	5
04/01/2000	958-986	7	8523	6
01/01/2000	958-989	2	5698	9
02/01/2000	958-989	2	3256	0
03/01/2000	958-989	2	3254	5
04/01/2000	958-989	2	8523	6

Since these tables are large, I think HASH technique will help me to get the results quickly.

Thanks in advance.

Regards,

Myu

RichardDeVen · Posted 07-24-2020 12:16 PM

Hash the 1M keys from Wiki
Implicitly loop DCS with SET
Output only on key found (.check() method)

Example:

Data:

Spoiler

data DCS;
input Snapshot_date mmddyy10.	Sourcekey	$ Sourcesystemid	Amount	Riskbucket;
format Snapshot_date mmddyy10.;
datalines;
01/01/2000	958-985	5	5698	9
02/01/2000	958-985	5	3256	0
03/01/2000	958-985	5	3254	5
04/01/2000	958-985	5	8523	6
01/01/2000	958-986	7	5698	9
02/01/2000	958-986	7	3256	0
03/01/2000	958-986	7	3254	5
04/01/2000	958-986	7	8523	6
01/01/2000	958-989	2	5698	9
02/01/2000	958-989	2	3256	0
03/01/2000	958-989	2	3254	5
04/01/2000	958-989	2	8523	6
01/01/2000	958-989	6	5698	9
02/01/2000	958-989	6	3256	0
03/01/2000	958-989	6	3254	5
04/01/2000	958-989	2	8523	6
;

data Wiki;
input Sourcekey $	Sourcesystemid;
datalines;
958-985	5
958-986	7
958-999	5
958-989	2
;

data DCS; input Snapshot_date mmddyy10. Sourcekey $ Sourcesystemid Amount Riskbucket; format Snapshot_date mmddyy10.; datalines; 01/01/2000 958-985 5 5698 9 02/01/2000 958-985 5 3256 0 03/01/2000 958-985 5 3254 5 04/01/2000 958-985 5 8523 6 01/01/2000 958-986 7 5698 9 02/01/2000 958-986 7 3256 0 03/01/2000 958-986 7 3254 5 04/01/2000 958-986 7 8523 6 01/01/2000 958-989 2 5698 9 02/01/2000 958-989 2 3256 0 03/01/2000 958-989 2 3254 5 04/01/2000 958-989 2 8523 6 01/01/2000 958-989 6 5698 9 02/01/2000 958-989 6 3256 0 03/01/2000 958-989 6 3254 5 04/01/2000 958-989 2 8523 6 ; data Wiki; input Sourcekey $ Sourcesystemid; datalines; 958-985 5 958-986 7 958-999 5 958-989 2 ;

Code:

data want;
  if _n_ = 1 then do;
    declare hash wiki_lookup(dataset:'wiki');
    wiki_lookup.defineKey('Sourcekey', 'Sourcesystemid');
    wiki_lookup.defineDone();
  end;

  set DCS;

  if wiki_lookup.check() = 0;
run;

Note, your sample data has a last row

04/01/2000	958-989	2	8523	6

that is a duplicate of an earlier row

View solution in original post

RichardDeVen · Posted 07-24-2020 12:16 PM

Hash the 1M keys from Wiki
Implicitly loop DCS with SET
Output only on key found (.check() method)

Example:

Data:

Spoiler

data DCS;
input Snapshot_date mmddyy10.	Sourcekey	$ Sourcesystemid	Amount	Riskbucket;
format Snapshot_date mmddyy10.;
datalines;
01/01/2000	958-985	5	5698	9
02/01/2000	958-985	5	3256	0
03/01/2000	958-985	5	3254	5
04/01/2000	958-985	5	8523	6
01/01/2000	958-986	7	5698	9
02/01/2000	958-986	7	3256	0
03/01/2000	958-986	7	3254	5
04/01/2000	958-986	7	8523	6
01/01/2000	958-989	2	5698	9
02/01/2000	958-989	2	3256	0
03/01/2000	958-989	2	3254	5
04/01/2000	958-989	2	8523	6
01/01/2000	958-989	6	5698	9
02/01/2000	958-989	6	3256	0
03/01/2000	958-989	6	3254	5
04/01/2000	958-989	2	8523	6
;

data Wiki;
input Sourcekey $	Sourcesystemid;
datalines;
958-985	5
958-986	7
958-999	5
958-989	2
;

data DCS; input Snapshot_date mmddyy10. Sourcekey $ Sourcesystemid Amount Riskbucket; format Snapshot_date mmddyy10.; datalines; 01/01/2000 958-985 5 5698 9 02/01/2000 958-985 5 3256 0 03/01/2000 958-985 5 3254 5 04/01/2000 958-985 5 8523 6 01/01/2000 958-986 7 5698 9 02/01/2000 958-986 7 3256 0 03/01/2000 958-986 7 3254 5 04/01/2000 958-986 7 8523 6 01/01/2000 958-989 2 5698 9 02/01/2000 958-989 2 3256 0 03/01/2000 958-989 2 3254 5 04/01/2000 958-989 2 8523 6 01/01/2000 958-989 6 5698 9 02/01/2000 958-989 6 3256 0 03/01/2000 958-989 6 3254 5 04/01/2000 958-989 2 8523 6 ; data Wiki; input Sourcekey $ Sourcesystemid; datalines; 958-985 5 958-986 7 958-999 5 958-989 2 ;

Code:

data want;
  if _n_ = 1 then do;
    declare hash wiki_lookup(dataset:'wiki');
    wiki_lookup.defineKey('Sourcekey', 'Sourcesystemid');
    wiki_lookup.defineDone();
  end;

  set DCS;

  if wiki_lookup.check() = 0;
run;

Note, your sample data has a last row

04/01/2000	958-989	2	8523	6

that is a duplicate of an earlier row

PGStats · Posted 07-24-2020 12:32 PM

@RichardDeVen proposed a very good solution. You may also want to add the argument HASHEXP:9 or HASHEXP:10 to the DECLARE statement to increase the hash table size and make it more efficient when dealing with a million entries.

PG

Myurathan · Posted 07-25-2020 05:30 AM

@RichardADeVenezia Thank you so much for your help.
followup question: if I want to do left join and wanna create a variable (wiki_Flag) that indicate that the contract is available in wiki table. How can I achieve this using HASH?
Thanks in advance.

RichardDeVen · Posted 07-25-2020 07:08 AM

The hash method check returns 0 when a hash key corresponding to the current PDV variables exists.

An edit of the source code line from

  if wiki_lookup.check() = 0;

to

  wiki_flag = (wiki_lookup.check() = 0);

will change the program flow from a subsetting-if to a flag variable assignment

ChrisNZ · Posted 07-25-2020 07:14 AM

1. How can I achieve this using HASH?

Change

  if wiki_lookup.check() = 0;

to

  WIKI_FLAG = ( wiki_lookup.check() = 0 );

2. Scanning every row of a large table takes time. If the large table is indexed, retrieving 1/2000th of the table using indexes might be faster.

High-Performance SAS Coding - Third Edition

Inner join with keep variables from only one table in HASH

Re: Inner join with keep variables from only one table in HASH

Re: Inner join with keep variables from only one table in HASH

Re: Inner join with keep variables from only one table in HASH

Re: Inner join with keep variables from only one table in HASH

Re: Inner join with keep variables from only one table in HASH

Re: Inner join with keep variables from only one table in HASH

Registration is open

Call for Content EXTENDED

Inner join with keep variables from only one table in HASH

Re: Inner join with keep variables from only one table in HASH

Re: Inner join with keep variables from only one table in HASH

Re: Inner join with keep variables from only one table in HASH

Re: Inner join with keep variables from only one table in HASH

Re: Inner join with keep variables from only one table in HASH

Re: Inner join with keep variables from only one table in HASH

Registration is open

Call for Content EXTENDED

SAS Training: Just a Click Away