Hash join to replace SQL Left Join

Reply
Frequent Contributor
Posts: 83

Hash join to replace SQL Left Join

Hi,

I`m looking for an equivalent of the below code in Data step HASH join.


proc sql;

create table rllp2 as

select distinct a.*, b.enam as name,b.position,b.orid,b.ccid_1 as ccid,b.region1 as region

max(a.BEGD,b.BEGD) as BEGD,min(a.ENDD,b.ENDD) as ENDD

from rllp3 as a

left join

rllp1 as b

on a.position_id_k = b.position_id_1

and ((a.BEGD BETWEEN b.BEGD and  b.ENDD)

or (a.ENDD BETWEEN b.BEGD and  b.ENDD) or (a.BEGD >= b.BEGD
and a.ENDD <= b.ENDD) or (a.BEGD <= b.BEGD and a.ENDD >= b.ENDD))

order by position,BEGD,ENDD;

Not sure how to add the distinct ,extra join conditions and order by in a single datas tep.

DATA MI ;

IF 0 THEN SET rllp1;

      DECLARE hash VS(hashexp:10, dataset:"rllp1");

            VS.definekey('position_id_k,'position_id_1');

            VS.definedata(all:'Y');

            VS.definedone();

                  DO UNTIL(eof);

                        SET rllp3end=eof;

                  IF VS.find()=0 THEN OUTPUT;

                        ELSE DO;

                  CALL missing( a1---b1);

                  OUTPUT;

            END;

      END;

STOP ;

RUN;

Kindly suggest some methods

Super User
Posts: 5,426

Re: Hash join to replace SQL Left Join

i don't think I can help about hashing, but maybe you could tell us about why? Is it a performance issue regarding the SQL?

It's quite hard for SAS SQL to optimize OR conditions. Storing in SPDE might improve performance (or SPD Server if available), and indexing on at least position_id_1 and  position_id_k.

Then I'm not sure how distinct works with max and min functions. I am more comfortable using Group By, but distinct may work fine for you...

Data never sleeps
Frequent Contributor
Posts: 83

Re: Hash join to replace SQL Left Join


Hi LinusH,

Its about performance and time. The datasets are 100Gb in size.That why i though its better to deploy data step hash.

Regular Contributor
Posts: 216

Re: Hash join to replace SQL Left Join

tommy81,

When your data set gets into 30GB+ , that's when you move it into SPDE/SPDS!!

After all, that's why SPDE was developed in the first place. it was the "free" solution to handle large data sets using your existing BASE SAS. I know BASE SAS is not Free, but you don't have to purchase and configure SPDS to handle such volumes.

I would strongly recommend investigating SPDE as storage engine for your data set. It will payoff down the line.

Ahmed

Frequent Contributor
Posts: 83

Re: Hash join to replace SQL Left Join

Posted in reply to AhmedAl_Attar

I agree to the SPDS point, however I need to know how much of a difference it could make if I could use  a data step hash to the above step . I`m certain data vloume cannot be controlled much, but at least run times can be improved.

Super User
Posts: 5,426

Re: Hash join to replace SQL Left Join

I think it will very complicated to do both the join logic and distinct/group by functionality within one single data step.

So my suggestion is to first try to optimize your SQL query first to see what gains is possible, t sum up:

  • simplify the SQL suggested by DBailey
  • Index on the join column(s)
  • move the data to SPDE

Since SQL operations such as summarizing and sorting is done in paralel, investigate if you can exploit the hardware (or have any other HW available). This includes setting up MEMSIZE/SORTSIZE options.

Data never sleeps
Super Contributor
Posts: 578

Re: Hash join to replace SQL Left Join

Can you rewrite your clause to say

left join

rllp1 as b

on a.position_id_k = b.position_id_1

and a.BEGD <= b.ENDD

and a.ENDD >= b.BEGD

I didn't work through all of the permutations but that is what I see normally when trying to make sure you count everything that may have happened during a date span.

Ask a Question
Discussion stats
  • 6 replies
  • 383 views
  • 3 likes
  • 4 in conversation