Hi,
I`m looking for an equivalent of the below code in Data step HASH join.
proc sql;
create table rllp2 as
select distinct a.*, b.enam as name,b.position,b.orid,b.ccid_1 as ccid,b.region1 as region
max(a.BEGD,b.BEGD) as BEGD,min(a.ENDD,b.ENDD) as ENDD
from rllp3 as a
left join
rllp1 as b
on a.position_id_k = b.position_id_1
and ((a.BEGD BETWEEN b.BEGD and b.ENDD)
or (a.ENDD BETWEEN b.BEGD and b.ENDD) or (a.BEGD >= b.BEGD
and a.ENDD <= b.ENDD) or (a.BEGD <= b.BEGD and a.ENDD >= b.ENDD))
order by position,BEGD,ENDD;
Not sure how to add the distinct ,extra join conditions and order by in a single datas tep.
DATA MI ;
IF 0 THEN SET rllp1;
DECLARE hash VS(hashexp:10, dataset:"rllp1");
VS.definekey('position_id_k,'position_id_1');
VS.definedata(all:'Y');
VS.definedone();
DO UNTIL(eof);
SET rllp3end=eof;
IF VS.find()=0 THEN OUTPUT;
ELSE DO;
CALL missing( a1---b1);
OUTPUT;
END;
END;
STOP ;
RUN;
Kindly suggest some methods
i don't think I can help about hashing, but maybe you could tell us about why? Is it a performance issue regarding the SQL?
It's quite hard for SAS SQL to optimize OR conditions. Storing in SPDE might improve performance (or SPD Server if available), and indexing on at least position_id_1 and position_id_k.
Then I'm not sure how distinct works with max and min functions. I am more comfortable using Group By, but distinct may work fine for you...
Hi LinusH,
Its about performance and time. The datasets are 100Gb in size.That why i though its better to deploy data step hash.
tommy81,
When your data set gets into 30GB+ , that's when you move it into SPDE/SPDS!!
After all, that's why SPDE was developed in the first place. it was the "free" solution to handle large data sets using your existing BASE SAS. I know BASE SAS is not Free, but you don't have to purchase and configure SPDS to handle such volumes.
I would strongly recommend investigating SPDE as storage engine for your data set. It will payoff down the line.
Ahmed
I agree to the SPDS point, however I need to know how much of a difference it could make if I could use a data step hash to the above step . I`m certain data vloume cannot be controlled much, but at least run times can be improved.
I think it will very complicated to do both the join logic and distinct/group by functionality within one single data step.
So my suggestion is to first try to optimize your SQL query first to see what gains is possible, t sum up:
Since SQL operations such as summarizing and sorting is done in paralel, investigate if you can exploit the hardware (or have any other HW available). This includes setting up MEMSIZE/SORTSIZE options.
Can you rewrite your clause to say
left join
rllp1 as b
on a.position_id_k = b.position_id_1
and a.BEGD <= b.ENDD
and a.ENDD >= b.BEGD
I didn't work through all of the permutations but that is what I see normally when trying to make sure you count everything that may have happened during a date span.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.