DATA Step, Macro, Functions and more

Hash join more than two tables

Reply
Super Contributor
Posts: 271

Hash join more than two tables

I want merge three tables. In order to avoid sort procedure,I am inclined to use hash method. I have sucessfully used it for joining two tables. I wonder if hash table can be used to join more than two tables.

Super User
Posts: 5,430

Re: Hash join more than two tables

Posted in reply to SAS_inquisitive
Yes, I'm almost 100% sure.
In DI Studio the Lookup transformation uses hash tables, and are able to do lookup on several tables simultaneously.
Data never sleeps
Super Contributor
Posts: 271

Re: Hash join more than two tables

@LinusH I don't  have access to DI studio.  Is it doable in Base SAS?

Respected Advisor
Posts: 3,156

Re: Hash join more than two tables

Posted in reply to SAS_inquisitive

Yes.

Super User
Posts: 5,430

Re: Hash join more than two tables

Posted in reply to SAS_inquisitive
Absolutely. I was just bringing up the code generation from DI Studio as an example. Almost all code generated is in Base SAS code.
Data never sleeps
Super Contributor
Posts: 271

Re: Hash join more than two tables

@Haikuo and @LinusH  Can you give an example (or DI generated code) so that I can apply in my cases?  Thanks !

Respected Advisor
Posts: 4,173

Re: Hash join more than two tables

Posted in reply to SAS_inquisitive

@SAS_inquisitive

If you know how to define and use one hash table then what holds you back to simply define and use a second hash table as well?

You can load as much data into as many hash tables as you wish. The only restriction is memory available.

Super Contributor
Posts: 271

Re: Hash join more than two tables

@Patrick  I want to make sure I am doing correctly. I looked some papers (Paul Dorfman's), most of them have only two tables.

Respected Advisor
Posts: 3,156

Re: Hash join more than two tables

Posted in reply to SAS_inquisitive

@SAS_inquisitive: Have you tried yet? If you have and it did not work the way you want, show us what you have tried (code) and what didn't work (log). If you haven't tried, now it is the time for you to do so.

Super Contributor
Posts: 271

Re: Hash join more than two tables

[ Edited ]

@Haikuo I have not tried on more than two tables yet.  I will try to ceate some mock data sets (rather than real data sets) and will post the code. Here is one mock example.

This does not give desired result.

data a;
	input id x y;
	cards;
1 3 4
2 5 6
3 5 6
4 7 9
5 3 2
;

data b;
	input id x y;
	cards;
1 7 8
2 3 4
3 1 3
7 8 4
8 1 6
;

data c;
	input id x y;
	cards;
1 7 8
2 3 4
3 1 3
9 8 4
7 1 6
;

data test;
	if 0 then
		set a;
	if 0 then 
	    set b;

	if _n_=1 then
		do;
			dcl hash h1(dataset:'a');
			h1.definekey('id');
			h1.definedata('x','y');
				h1.definedone();

			dcl hash h2(dataset:'b');
			h2.definekey('id');
			h2.definedata('x','y');
				h2.definedone();
		end;

	set c;
	if h1.find() and h2.find() then output;
run;

 

Respected Advisor
Posts: 3,156

Re: Hash join more than two tables

Posted in reply to SAS_inquisitive

Ok, that was nice try Smiley Happy. Couple of comments before SAS code:

It is hard for me to guess the purpose of your code without seeing your expected outcome. So here is my try:

1. From your code, you seem trying to do a SQL equivalent of 'inner join' , however, your code is doing it exclusively instead. the commented part of the code will do a inclusive inner join.

2. Since 3 of your incoming tables share exact variable names, so what you wound up doing is pdv is first overidden by h1 then by h2. What you want is probably to rename some of the variables to see some obvious outcome.

 

data a;
	input id x_A y_A;
	cards;
1 3 4
2 5 6
3 5 6
4 7 9
5 3 2
;

data b;
	input id x_B y_B;
	cards;
1 7 8
2 3 4
3 1 3
7 8 4
8 1 6
;

data c;
	input id x_C y_C;
	cards;
1 7 8
2 3 4
3 1 3
9 8 4
7 1 6
;

data test;
	if 0 then
		set a B;

	/*	if 0 then */
	/*	    set b;*/
	if _n_=1 then
		do;
			dcl hash h1(dataset:'a');
			h1.definekey('id');
	h1.definedata(ALL:
			'Y');
			h1.definedone();
			dcl hash h2(dataset:'b');
			h2.definekey('id');
	h2.definedata(ALL:
			'y');
			h2.definedone();
		end;

	set c;

	if h1.find() and h2.find() then
		output;

	/*		 if h1.find()=0 and h2.find()=0 then output;*/
run;

Going bed right now, if you have further questions, I may find some time tomorrow.

Super Contributor
Posts: 271

Re: Hash join more than two tables

@Haikuo Thank you.  Have a good night.

Respected Advisor
Posts: 4,173

Re: Hash join more than two tables

Posted in reply to SAS_inquisitive

@Haikuo states that the expected result should be the equivalent to a SQL inner join but then changes the selection logic so that it's no more an inner join. Must have been late... 

 

I would code as below.

data test;
  set c;
  if _n_=1 then
    do;
      if 0 then set a;
      dcl hash h1(dataset:'a');
      h1.definekey('id');
      h1.definedata(ALL:'Y');
      h1.definedone();

      if 0 then set b;
      dcl hash h2(dataset:'b');
      h2.definekey('id');
      h2.definedata(ALL:'y');
      h2.definedone();
    end;
  if h1.find()=0 and h2.find()=0 then output;
run;

The reason for having "set c;" first is that in case there are same named variables in the source datasets the definition from base table "c" will get used for the output dataset.

 

In case the relationship between the tables can be 0:N and not only 0:1 then you would also have to use multidata:'y' for the hash definition and code for some looping over the hash as well in case there is a match where there are multiple entries for the same key in the hash.

 

Ask a Question
Discussion stats
  • 12 replies
  • 380 views
  • 0 likes
  • 4 in conversation