BookmarkSubscribeRSS Feed
telligent
Quartz | Level 8

I am trying to join two large datasets on a long string variable and it is extremely inefficient.  It runs forever and never finishes. 

I was hoping you could give me some suggestions to make this query more efficient.  Thank you.

 

select *

from xxxxx.xxxxx

inner join xxxx.xxxx b

on a.name=b.name

where a.name ne ' ' and dt>'201712';

quit;

4 REPLIES 4
Tom
Super User Tom
Super User

Why not just merge them instead?

data want;
  merge a(in=in1) b(in=in2);
  by name;
  if in1 and in2 and name ne ' ' and dt>'201712';
run;
  
telligent
Quartz | Level 8

It is a large data set, sorting it will also take a long time, but I'll give it a try. 

sbxkoenk
SAS Super FREQ

Hello,

 

You can put the filters as a data step option (to avoid these observations from being processed).

In fact you move from an output filter to an input filter that way. And an input filter is faster obviously.

select *
from 
     lib1.abc(where=(name is not missing and dt>'201712')) as a
 inner join 
     lib2.xyz(where=(name is not missing and dt>'201712')) as b
on a.name=b.name     

quit;

 

If name is the only variable with the same name in a and b, you can also do a natural (inner) join:

from lib1.abc natural join lib2.xyz


Koen

Patrick
Opal | Level 21

I'm always a bit uneasy to combine an explicit join with a where clause and then trust the SAS SQL optimizer that it doesn't first execute the join and only then applies the where condition on the intermediary result set.

 

In your case I'd go for an implicit join with the "date" filter first.

data have_1;
  do row_id=1 to 100000;
    name=put(row_id,z10.);
    if mod(row_id,500)=0 then dt='201812';
    else dt=' ';
    output;
  end;
  stop;
run;

data have_2;
  set have_1(keep=name);
run;

proc sql _method feedback;
  create table want_1 as
  select a.*
  from 
    have_1 a
   ,have_2 b
  where 
    a.dt>'201712'
    and a.name ne ' ' 
    and a.name=b.name
  ;
quit;

Please note that if you've got the same name multiple times in both tables (=a many to many relationship) then you could end-up with a lot of rows which could be the reason for you long running SQL.

 

As for not using a data step merge because of the pre-sorting required: Your SQL will also have to sort the tables.

I've added in the code above option _method to the SQL. This shows you in the log what your SQL is actually doing. With my sample data it's a hash join - but this will be different with your big tables and it's highly likely that you'll see SQXSORT.

 

And just as a thought: If you just want to select all rows in the one table that have a matching name in another table then creating a distinct list of names first would be the way to go (avoiding a many to many situation). ...and if this distinct list of names is not too big then you could eventually even use a SAS data step hash lookup for this which would avoid any need for sorting the data.

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 4 replies
  • 967 views
  • 0 likes
  • 4 in conversation