Hello, I wondered if people had any experience with where they place a condition when doing a left join performance/time wise. I have the code:
proc sql ;
create table step2 as
select distinct a.*,
b.income,
...
from step1 as a
left join (select year, income, ..., from income_data) as b
on a.id=b.id and a.year=b.year and b.id^='' ;
quit ;
I could have used a where expression instead, are they equivalent, or is there some performance differences?:
proc sql ;
create table step2 as
select distinct a.*,
b.income,
...
from step1 as a
left join (select year, income, ..., from income_data where id^='') as b
on a.id=b.id and a.year=b.year ;
quit ;
Make up some fake data and run your codes against it.
I did this:
data step1;
input id $ year;
datalines;
A 2018
A 2019
B 2018
B 2019
C 2018
C 2019
;
data income_data;
input id $ year income;
datalines;
A 2018 1000
A 2018 1500
B 2019 3000
C 2018 2000
;
run;
proc sql ;
create table step2_a as
select distinct
a.*,
b.income
from step1 as a
left join (
select
id,
year,
income
from income_data
) as b
on a.id = b.id and a.year = b.year and b.id ^= ''
;
quit;
proc sql ;
create table step2_b as
select distinct
a.*,
b.income
from step1 as a
left join (
select
id,
year,
income
from income_data where id ^= ''
) as b
on a.id = b.id and a.year = b.year
;
quit;
proc compare data=step2_a compare=step2_b;
run;
and found no difference.
So play around with the faked source data to include edge cases (do you have missing id and/or year values in one of the datasets, or do you have multiple instances for any id/year pair in one or both datasets) that might turn up in your source data.
In particular I noticed your use of distinct. Some people use this out of habit and do not realize that it forces SQL to sort the result table by all columns to determine possible duplicates. This can be very time-consuming, so you should check your data (see Maxim 3) if it is necessary at all.
From this, I created a simplified version of your code; first, I removed the check for id ^= '':
proc sql;
create table step2_c as
select
a.*,
b.income
from step1 as a
left join (
select
id,
year,
income
from income_data
) as b
on a.id = b.id and a.year = b.year
;
quit;
proc compare data=step2_a compare=step2_c;
run;
and found no difference to the previous results.
Next, I simplified this even further:
proc sql;
create table step2_d as
select
a.*,
b.income
from step1 as a
left join income_data as b
on a.id = b.id and a.year = b.year
;
quit;
proc compare data=step2_a compare=step2_d;
run;
and got the same results once again.
If you do not have a many-to-many relationship with regards to id and year, you can use a data step:
proc sort data=step1;
by id year;
run;
proc sort data=income_data;
by id year;
run;
data step2_e;
merge
step1 (in=_in_step1)
income_data
;
by id year;
if _in_step1;
run;
proc compare data=step2_a compare=step2_e;
run;
which once again yields identical results with my fake data. Out of experience, I dare to say that this approach will beat all SQL methods performance-wise. Especially if you can manage to have the sorts done further up when the data sets are created in the first place.
First, compare the results of both queries to see if they are logically equivalent.
@Gexern wrote:
Hello, I wondered if people had any experience with where they place a condition when doing a left join performance/time wise. I have the code:
proc sql ; create table step2 as select distinct a.*, b.income, ... from step1 as a left join (select year, income, ..., from income_data) as b on a.id=b.id and a.year=b.year and b.id^='' ; quit ;
I could have used a where expression instead, are they equivalent, or is there some performance differences?:
proc sql ; create table step2 as select distinct a.*, b.income, ... from step1 as a left join (select year, income, ..., from income_data where id^='') as b on a.id=b.id and a.year=b.year ; quit ;
Welcome to the SAS Community 🙂
What result did you get yourself? Did you get the same result from the two queries?
I haven't, no. The code runs for 4 hours so its not that easy to compare. I can batch two programs that run at the same time and compare. I would expect to get the same result from both queries though.
@Gexern wrote:
I haven't, no. The code runs for 4 hours so its not that easy to compare. I can batch two programs that run at the same time and compare. I would expect to get the same result from both queries though.
How many records are involved?
I would create subsets of the data of only a few thousand records for each and test the code for performance and equivalence.
Do you have missing values of a.id? I don't see much sense for the b.id^=' ' in this if you don't.
on a.id=b.id and a.year=b.year and b.id^=''
And if you do, I strongly wonder why an ID type variable would be missing. Since I tend to think of ID as used to identify things then missing values are extremely problematic. If A.ID does not have any missing values then the b.id^=' ' is not needed as join on will only get matches.
Make up some fake data and run your codes against it.
I did this:
data step1;
input id $ year;
datalines;
A 2018
A 2019
B 2018
B 2019
C 2018
C 2019
;
data income_data;
input id $ year income;
datalines;
A 2018 1000
A 2018 1500
B 2019 3000
C 2018 2000
;
run;
proc sql ;
create table step2_a as
select distinct
a.*,
b.income
from step1 as a
left join (
select
id,
year,
income
from income_data
) as b
on a.id = b.id and a.year = b.year and b.id ^= ''
;
quit;
proc sql ;
create table step2_b as
select distinct
a.*,
b.income
from step1 as a
left join (
select
id,
year,
income
from income_data where id ^= ''
) as b
on a.id = b.id and a.year = b.year
;
quit;
proc compare data=step2_a compare=step2_b;
run;
and found no difference.
So play around with the faked source data to include edge cases (do you have missing id and/or year values in one of the datasets, or do you have multiple instances for any id/year pair in one or both datasets) that might turn up in your source data.
In particular I noticed your use of distinct. Some people use this out of habit and do not realize that it forces SQL to sort the result table by all columns to determine possible duplicates. This can be very time-consuming, so you should check your data (see Maxim 3) if it is necessary at all.
From this, I created a simplified version of your code; first, I removed the check for id ^= '':
proc sql;
create table step2_c as
select
a.*,
b.income
from step1 as a
left join (
select
id,
year,
income
from income_data
) as b
on a.id = b.id and a.year = b.year
;
quit;
proc compare data=step2_a compare=step2_c;
run;
and found no difference to the previous results.
Next, I simplified this even further:
proc sql;
create table step2_d as
select
a.*,
b.income
from step1 as a
left join income_data as b
on a.id = b.id and a.year = b.year
;
quit;
proc compare data=step2_a compare=step2_d;
run;
and got the same results once again.
If you do not have a many-to-many relationship with regards to id and year, you can use a data step:
proc sort data=step1;
by id year;
run;
proc sort data=income_data;
by id year;
run;
data step2_e;
merge
step1 (in=_in_step1)
income_data
;
by id year;
if _in_step1;
run;
proc compare data=step2_a compare=step2_e;
run;
which once again yields identical results with my fake data. Out of experience, I dare to say that this approach will beat all SQL methods performance-wise. Especially if you can manage to have the sorts done further up when the data sets are created in the first place.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.