I do not understand why join done with proc sql does not play the same way hash. What did I do wrong?
data tab1;
input id1 id2 ts;
datalines;
1 1 3
1 2 5
1 3 6
1 2 20
1 2 22
run;
data tab2;
input id id1 id2 time_from time_to descr $;
datalines;
1 1 1 4 6 xxx
2 1 2 4 6 yyy
3 1 2 21 30 kkk
run;
/*sql version*/
proc sql;
create table ris_sql as select
a.*,
b.descr
from tab1 as a
left join tab2 as b
on a.id1=b.id1 and
a.id2=b.id2 and
(a.ts between b.time_from and b.time_to);
quit;
/*hash version*/
data ris_hash;
attrib descr format=$10.;
if _n_=1 then do;
declare hash s(dataset: 'tab2');
s.definekey('id1','id2');
s.definedata('time_from', 'time_to','descr');
s.definedone();
end;
set tab1;
rc=s.find();
if not(time_from <= ts <= time_to) then call missing(descr);
drop rc time_from time_to;
run;
Hi.
That's because you data set has duplicate key entries. Only the first entry is loaded.
You should use the MULTIDATA option and cycle find until there are no more matches.
Check this very good paper by Paul M. Dorfman: http://support.sas.com/resources/papers/proceedings16/10200-2016.pdf
Hope it helps.
Daniel Santos @ www.cgd.pt
For small datasets I don't think you can easily force SAS to use a HASH.
/* T007370 SQL HASH MAGIC=103 faster that SQL Sort merge */
Hi Team
I wanted to select about 1,000,000 records from a 5,000,000
table and add 100 variables at the same time.
The finder file was 800mb.
I noticed that the default SQL method(sort) took almost twice as
long as the SQL(hash) method. Note that I have plenty of ram(128gb).
The code below will not run in EG, memsize is set at 0.5gb.
* sql sort=103 seconds;
* sql hash 60 seconds;
* there is an excellent paper in the most recent pharmasug
on the sql magic option;
* 800mb finder file with additional variables c1-c100;
options compress=no;
data find_benes(compress=no drop=i j);
array chr[100] $8 c1-c100;
do i=100000 to 10000000 by 10;
pat_id=encrypt(i);
do j=1 to 100;
chr[j]=put(1e8*uniform(-1),z8.);
end;
output;
end;
run;quit;
/*
800mb table
Up to 40 obs from find_benes total obs=990001
pat_id C1 C100
1202046 75950697 28236507
1202056 22914180 68703933
1202066 61116305 19959418
1202076 58626531 82712843
1202086 10725652 84593862
*/
* 4 gb;
options compress=no;
data all_benes(compress=no drop=i j);
array chr[100] $8 d1-d100;
do i=100000 to 10000000 by 2;
pat_id=encrypt(i);
do j=1 to 100;
chr[j]=put(1e8*uniform(-1),z8.);
end;
output;
end;
run;quit;
/*
4gb table with variables d1-d100
Up to 40 obs from all_benes total obs=4,950,001
pat_id D1 D100
1201766 90585571 28236507
1201768 55665932 68703933
1201770 29801506 19959418
1201772 67134560 82712843
1201774 74029386 84593862
*/
* sql sort=103 seconds;
* sql hash 60 seconds;
proc sql _method magic=103;
create
table gotem as
select
l.*
,r.*
from
find_benes as l, all_benes as r
where
l.pat_id = r.pat_id
;quit;
/* magic=101 = cartesian
magic=102 = sort merge
magic=103 = hash
*/
Note if you have good size ram 60-128gb you might want to consider using sasfile
options load, open and close. for intermediate work datasets, but don't forget
to free memory often. Unfortunately SAS does not have a means of using memory
for the utility files and indexes are not supported. Perhaps a new kind of index
is needed, in memory index.
Sorting a 'big' dataset.
I suspect using a hash with 7 systasks( split the big dataset into 7(firstobs-obs)
pieces that sum to less then total dataset size might be much faster than the SAS
sort on the big dataset, even though the sort is mutithreaded.
Of course you need a lot of ram.
Hi.
That's because you data set has duplicate key entries. Only the first entry is loaded.
You should use the MULTIDATA option and cycle find until there are no more matches.
Check this very good paper by Paul M. Dorfman: http://support.sas.com/resources/papers/proceedings16/10200-2016.pdf
Hope it helps.
Daniel Santos @ www.cgd.pt
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.