I am trying to join two large datasets on a long string variable and it is extremely inefficient. It runs forever and never finishes.
I was hoping you could give me some suggestions to make this query more efficient. Thank you.
select *
from xxxxx.xxxxx
inner join xxxx.xxxx b
on a.name=b.name
where a.name ne ' ' and dt>'201712';
quit;
Why not just merge them instead?
data want;
merge a(in=in1) b(in=in2);
by name;
if in1 and in2 and name ne ' ' and dt>'201712';
run;
It is a large data set, sorting it will also take a long time, but I'll give it a try.
Hello,
You can put the filters as a data step option (to avoid these observations from being processed).
In fact you move from an output filter to an input filter that way. And an input filter is faster obviously.
select *
from
lib1.abc(where=(name is not missing and dt>'201712')) as a
inner join
lib2.xyz(where=(name is not missing and dt>'201712')) as b
on a.name=b.name
quit;
If name is the only variable with the same name in a and b, you can also do a natural (inner) join:
from lib1.abc natural join lib2.xyz
Koen
I'm always a bit uneasy to combine an explicit join with a where clause and then trust the SAS SQL optimizer that it doesn't first execute the join and only then applies the where condition on the intermediary result set.
In your case I'd go for an implicit join with the "date" filter first.
data have_1;
do row_id=1 to 100000;
name=put(row_id,z10.);
if mod(row_id,500)=0 then dt='201812';
else dt=' ';
output;
end;
stop;
run;
data have_2;
set have_1(keep=name);
run;
proc sql _method feedback;
create table want_1 as
select a.*
from
have_1 a
,have_2 b
where
a.dt>'201712'
and a.name ne ' '
and a.name=b.name
;
quit;
Please note that if you've got the same name multiple times in both tables (=a many to many relationship) then you could end-up with a lot of rows which could be the reason for you long running SQL.
As for not using a data step merge because of the pre-sorting required: Your SQL will also have to sort the tables.
I've added in the code above option _method to the SQL. This shows you in the log what your SQL is actually doing. With my sample data it's a hash join - but this will be different with your big tables and it's highly likely that you'll see SQXSORT.
And just as a thought: If you just want to select all rows in the one table that have a matching name in another table then creating a distinct list of names first would be the way to go (avoiding a many to many situation). ...and if this distinct list of names is not too big then you could eventually even use a SAS data step hash lookup for this which would avoid any need for sorting the data.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.