topic Multiple SQL Joins using Hash? in SAS Programming

Multiple SQL Joins using Hash?

Jrunner1569 — Mon, 21 May 2018 19:54:58 GMT

Proc sql;

create table JOIN as
select 
	a.key1, a.x1, a.x2, a.x3, 
	b.y1, b.y2, b.y3, b.y4,
	c.z1, c.z2, c.z3
	d.p1, d.p2
	e.t1, e.t2, e.t3, e.t4, 
	f.n1, f.n2, 
	g.w1, g.w2, g.w3

from DATASET1 a

	join DATASET2 b
	on b.y1  between '01aug2017'd and '31aug2017'd
	and a.key1 = b.key1
	and b.y3 = 'D'
	and b.y4 = 'P'

	join DATASET3 c
	on b.y3 = c.z1

	join DATASET4 d
	on c.key1 = d.key1
	and d.p2 = '31jul2017'd

	join DATASET5 e
	on d.key2 = e.key1

	join DATASET6 g
	on g.w1 in ('123')
	and e.t2 between '01aug2017'd and '31aug2017'd

	join DATASET7 f
	on c.key2 = f.key1

where a.x1 between '01aug2017'd and '31aug2017'd
and a.x2 = 'C'
and a.x3 in ('123');

Quit;

Hey everyone, I'm fairly new to SAS. I'm trying to join multiple datasets together on several different keys, and meeting certain criteria. I've posted my code and it does exactly what I want it to do... in about 80 seconds. Some of the tables (namely table e) are HUGE. Here's my issue: I need to cut down on the process time, because I need to loop this process several thousand times. I have the macro to do it, but it would take too long. I've read online that hash tables might be the solution but I can't wrap my head around how to code it. Any help would be appreciated! Thanks. (Using SAS EG 7.1)

Re: Multiple SQL Joins using Hash?

Kurt_Bremser — Mon, 21 May 2018 22:42:12 GMT

Your problem is not the time of the join, but that you loop it 1000 times. There's a fair chance that what you do can be solved by by-group processing.

Re: Multiple SQL Joins using Hash?

ChrisNZ — Tue, 22 May 2018 21:28:46 GMT

I agree with @Kurt_Bremser.

* What varies between the 1000 runs? *

If you read the same tables several times, sort them (and/or index them, depending on what is needed) so they are ready to use by the merge process.

Re: Multiple SQL Joins using Hash?

PGStats — Tue, 22 May 2018 02:54:11 GMT

Your query can be rewritten as:

Proc sql;

create table JOIN as
select 
	a.key1, a.x1, a.x2, a.x3, 
	b.y1, b.y2, b.y3, b.y4,
	c.z1, c.z2, c.z3
	d.p1, d.p2
	e.t1, e.t2, e.t3, e.t4, 
	f.n1, f.n2, 
	g.w1, g.w2, g.w3

from 
    DATASET1 a                                join
    DATASET2 b on a.key1 = b.key1 join
    DATASET3 c                                join
    DATASET4 d on c.key1 = d.key1 join
    DATASET5 e on d.key2 = e.key1 join
    DATASET7 f on c.key2 = f.key1   join
    DATASET6 g
    
where 
    a.x2 = 'C'                                 and
    b.y3 = 'D'                                 and
    b.y4 = 'P'                                 and
    c.z1 = 'D'                                 and
    a.x3 in ('123')                            and
    g.w1 in ('123')                            and
    d.p2 = '31jul2017'd                        and
    a.x1 between '01aug2017'd and '31aug2017'd and
    b.y1 between '01aug2017'd and '31aug2017'd and  
    e.t2 between '01aug2017'd and '31aug2017'd
    ;    
Quit;

which means that you are actually asking for

(a*b) x (c*d*e*f) x g

where * is a join and x is a cartesian product. Are you certain that this is what you want?

Re: Multiple SQL Joins using Hash?

Jrunner1569 — Tue, 22 May 2018 15:37:10 GMT

So I took your code suggestion (added 'DATASET6 g on a.x3 = g.w1') and it runs about 30 seconds faster (great!) and produces the same result after doing a quick proc-compare. Are there further enhancements now that you know what I'm after?

Re: Multiple SQL Joins using Hash?

Jrunner1569 — Tue, 22 May 2018 15:40:52 GMT

So essentially these tables are several billions of observations. I start with data from table A, but need the data from table E to be joined to it. E's data can only be gotten from a key field in table D. D's key field can only be gotten by matching on a key field in C....so on until I reach table A.

Re: Multiple SQL Joins using Hash?

PGStats — Tue, 22 May 2018 18:08:44 GMT

Without knowing the details of your data structures, at this point, your best bet would b to include indexes on join keys and on where condition variables, as @ChrisNZ suggested.

More importantly, you should reconsider your need to execute this query thousands of times. There is most likely a more efficient way.