DATA Step, Macro, Functions and more

Multiple SQL Joins using Hash?

Reply
New Contributor
Posts: 3

Multiple SQL Joins using Hash?

Proc sql;

create table JOIN as
select 
	a.key1, a.x1, a.x2, a.x3, 
	b.y1, b.y2, b.y3, b.y4,
	c.z1, c.z2, c.z3
	d.p1, d.p2
	e.t1, e.t2, e.t3, e.t4, 
	f.n1, f.n2, 
	g.w1, g.w2, g.w3

from DATASET1 a

	join DATASET2 b
	on b.y1  between '01aug2017'd and '31aug2017'd
	and a.key1 = b.key1
	and b.y3 = 'D'
	and b.y4 = 'P'

	join DATASET3 c
	on b.y3 = c.z1

	join DATASET4 d
	on c.key1 = d.key1
	and d.p2 = '31jul2017'd

	join DATASET5 e
	on d.key2 = e.key1

	join DATASET6 g
	on g.w1 in ('123')
	and e.t2 between '01aug2017'd and '31aug2017'd

	join DATASET7 f
	on c.key2 = f.key1

where a.x1 between '01aug2017'd and '31aug2017'd
and a.x2 = 'C'
and a.x3 in ('123');

Quit;

Hey everyone, I'm fairly new to SAS. I'm trying to join multiple datasets together on several different keys, and meeting certain criteria. I've posted my code and it does exactly what I want it to do... in about 80 seconds. Some of the tables (namely table e) are HUGE. Here's my issue: I need to cut down on the process time, because I need to loop this process several thousand times. I have the macro to do it, but it would take too long. I've read online that hash tables might be the solution but I can't wrap my head around how to code it. Any help would be appreciated! Thanks. (Using SAS EG 7.1)

 

Super User
Posts: 9,878

Re: Multiple SQL Joins using Hash?

[ Edited ]
Posted in reply to Jrunner1569

Your problem is not the time of the join, but that you loop it 1000 times. There's a fair chance that what you do can be solved by by-group processing.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
How to post code
Esteemed Advisor
Posts: 5,477

Re: Multiple SQL Joins using Hash?

Posted in reply to KurtBremser

Your query can be rewritten as:

 

Proc sql;

create table JOIN as
select 
	a.key1, a.x1, a.x2, a.x3, 
	b.y1, b.y2, b.y3, b.y4,
	c.z1, c.z2, c.z3
	d.p1, d.p2
	e.t1, e.t2, e.t3, e.t4, 
	f.n1, f.n2, 
	g.w1, g.w2, g.w3

from 
    DATASET1 a                                join
    DATASET2 b on a.key1 = b.key1 join
    DATASET3 c                                join
    DATASET4 d on c.key1 = d.key1 join
    DATASET5 e on d.key2 = e.key1 join
    DATASET7 f on c.key2 = f.key1   join
    DATASET6 g
    
where 
    a.x2 = 'C'                                 and
    b.y3 = 'D'                                 and
    b.y4 = 'P'                                 and
    c.z1 = 'D'                                 and
    a.x3 in ('123')                            and
    g.w1 in ('123')                            and
    d.p2 = '31jul2017'd                        and
    a.x1 between '01aug2017'd and '31aug2017'd and
    b.y1 between '01aug2017'd and '31aug2017'd and  
    e.t2 between '01aug2017'd and '31aug2017'd
    ;    
Quit;

 which means that you are actually asking for

 

 

(a*b) x (c*d*e*f) x g

 

where * is a join and x is a cartesian product. Are you certain that this is what you want?

PG
New Contributor
Posts: 3

Re: Multiple SQL Joins using Hash?

So I took your code suggestion (added 'DATASET6 g on a.x3 = g.w1') and it runs about 30 seconds faster (great!) and produces the same result after doing a quick proc-compare. Are there further enhancements now that you know what I'm after?
PROC Star
Posts: 2,311

Re: Multiple SQL Joins using Hash?

[ Edited ]
Posted in reply to Jrunner1569

I agree with @KurtBremser.

* What varies between the 1000 runs? *

If you read the same tables several times, sort them (and/or index them, depending on what is needed) so they are ready to use by the merge process.

New Contributor
Posts: 3

Re: Multiple SQL Joins using Hash?

So essentially these tables are several billions of observations. I start with data from table A, but need the data from table E to be joined to it. E's data can only be gotten from a key field in table D. D's key field can only be gotten by matching on a key field in C....so on until I reach table A.
Esteemed Advisor
Posts: 5,477

Re: Multiple SQL Joins using Hash?

Posted in reply to Jrunner1569

Without knowing the details of your data structures, at this point, your best bet would b to include indexes on join keys and on where condition variables, as @ChrisNZ suggested.

 

More importantly, you should reconsider your need to execute this query thousands of times. There is most likely a more efficient way.

PG
Ask a Question
Discussion stats
  • 6 replies
  • 128 views
  • 0 likes
  • 4 in conversation