Re: joining strings in proc sql why so inefficient?

telligent · Posted 01-28-2022 10:05 PM

I am trying to join two large datasets on a long string variable and it is extremely inefficient. It runs forever and never finishes.

I was hoping you could give me some suggestions to make this query more efficient. Thank you.

select *

from xxxxx.xxxxx

inner join xxxx.xxxx b

on a.name=b.name

where a.name ne ' ' and dt>'201712';

quit;

Tom · Posted 01-28-2022 10:12 PM

Why not just merge them instead?

data want;
  merge a(in=in1) b(in=in2);
  by name;
  if in1 and in2 and name ne ' ' and dt>'201712';
run;

telligent · Posted 01-29-2022 03:13 AM

It is a large data set, sorting it will also take a long time, but I'll give it a try.

sbxkoenk · Posted 01-29-2022 06:05 AM

Hello,

You can put the filters as a data step option (to avoid these observations from being processed).

In fact you move from an output filter to an input filter that way. And an input filter is faster obviously.

select *
from 
     lib1.abc(where=(name is not missing and dt>'201712')) as a
 inner join 
     lib2.xyz(where=(name is not missing and dt>'201712')) as b
on a.name=b.name     

quit;

If name is the only variable with the same name in a and b, you can also do a natural (inner) join:

from lib1.abc natural join lib2.xyz

Koen

Patrick · Posted 01-29-2022 10:47 PM

I'm always a bit uneasy to combine an explicit join with a where clause and then trust the SAS SQL optimizer that it doesn't first execute the join and only then applies the where condition on the intermediary result set.

In your case I'd go for an implicit join with the "date" filter first.

data have_1;
  do row_id=1 to 100000;
    name=put(row_id,z10.);
    if mod(row_id,500)=0 then dt='201812';
    else dt=' ';
    output;
  end;
  stop;
run;

data have_2;
  set have_1(keep=name);
run;

proc sql _method feedback;
  create table want_1 as
  select a.*
  from 
    have_1 a
   ,have_2 b
  where 
    a.dt>'201712'
    and a.name ne ' ' 
    and a.name=b.name
  ;
quit;

Please note that if you've got the same name multiple times in both tables (=a many to many relationship) then you could end-up with a lot of rows which could be the reason for you long running SQL.

As for not using a data step merge because of the pre-sorting required: Your SQL will also have to sort the tables.

I've added in the code above option _method to the SQL. This shows you in the log what your SQL is actually doing. With my sample data it's a hash join - but this will be different with your big tables and it's highly likely that you'll see SQXSORT.

And just as a thought: If you just want to select all rows in the one table that have a matching name in another table then creating a distinct list of names first would be the way to go (avoiding a many to many situation). ...and if this distinct list of names is not too big then you could eventually even use a SAS data step hash lookup for this which would avoid any need for sorting the data.

joining strings in proc sql why so inefficient?