topic Re: Hash join to replace SQL Left Join in SAS Programming

Hash join to replace SQL Left Join

tommy81 — Wed, 12 Jun 2013 05:34:47 GMT

Hi,

I`m looking for an equivalent of the below code in Data step HASH join.

proc sql;

create table rllp2 as

select distinct a.*, b.enam as name,b.position,b.orid,b.ccid_1 as ccid,b.region1 as region

max(a.BEGD,b.BEGD) as BEGD,min(a.ENDD,b.ENDD) as ENDD

from rllp3 as a

left join

rllp1 as b

on a.position_id_k = b.position_id_1

and ((a.BEGD BETWEEN b.BEGD and b.ENDD)

or (a.ENDD BETWEEN b.BEGD and b.ENDD) or (a.BEGD >= b.BEGD
and a.ENDD <= b.ENDD) or (a.BEGD <= b.BEGD and a.ENDD >= b.ENDD))

order by position,BEGD,ENDD;

Not sure how to add the distinct ,extra join conditions and order by in a single datas tep.

DATA MI ;

IF 0 THEN SET rllp1;

DECLARE hash VS(hashexp:10, dataset:"rllp1");

VS.definekey('position_id_k,'position_id_1');

VS.definedata(all:'Y');

VS.definedone();

DO UNTIL(eof);

SET rllp3end=eof;

IF VS.find()=0 THEN OUTPUT;

ELSE DO;

CALL missing( a1---b1);

OUTPUT;

END;

STOP ;

RUN;

Kindly suggest some methods

Re: Hash join to replace SQL Left Join

LinusH — Wed, 12 Jun 2013 11:50:35 GMT

i don't think I can help about hashing, but maybe you could tell us about why? Is it a performance issue regarding the SQL?

It's quite hard for SAS SQL to optimize OR conditions. Storing in SPDE might improve performance (or SPD Server if available), and indexing on at least position_id_1 and position_id_k.

Then I'm not sure how distinct works with max and min functions. I am more comfortable using Group By, but distinct may work fine for you...

Re: Hash join to replace SQL Left Join

DBailey — Wed, 12 Jun 2013 12:09:27 GMT

Can you rewrite your clause to say

left join

rllp1 as b

on a.position_id_k = b.position_id_1

and a.BEGD <= b.ENDD

and a.ENDD >= b.BEGD

I didn't work through all of the permutations but that is what I see normally when trying to make sure you count everything that may have happened during a date span.

Re: Hash join to replace SQL Left Join

tommy81 — Wed, 12 Jun 2013 12:38:37 GMT

Hi LinusH,

Its about performance and time. The datasets are 100Gb in size.That why i though its better to deploy data step hash.

Re: Hash join to replace SQL Left Join

AhmedAl_Attar — Wed, 12 Jun 2013 13:12:26 GMT

tommy81,

When your data set gets into 30GB+ , that's when you move it into SPDE/SPDS!!

After all, that's why SPDE was developed in the first place. it was the "free" solution to handle large data sets using your existing BASE SAS. I know BASE SAS is not Free, but you don't have to purchase and configure SPDS to handle such volumes.

I would strongly recommend investigating SPDE as storage engine for your data set. It will payoff down the line.

Ahmed

Re: Hash join to replace SQL Left Join

tommy81 — Thu, 13 Jun 2013 05:51:58 GMT

I agree to the SPDS point, however I need to know how much of a difference it could make if I could use a data step hash to the above step . I`m certain data vloume cannot be controlled much, but at least run times can be improved.

Re: Hash join to replace SQL Left Join

LinusH — Thu, 13 Jun 2013 09:02:33 GMT

I think it will very complicated to do both the join logic and distinct/group by functionality within one single data step.

So my suggestion is to first try to optimize your SQL query first to see what gains is possible, t sum up:

simplify the SQL suggested by DBailey
Index on the join column(s)
move the data to SPDE

Since SQL operations such as summarizing and sorting is done in paralel, investigate if you can exploit the hardware (or have any other HW available). This includes setting up MEMSIZE/SORTSIZE options.