Solved: Re: proc sql left join condition using where or on

Gexern · Posted 08-16-2019 07:09 AM

Hello, I wondered if people had any experience with where they place a condition when doing a left join performance/time wise. I have the code:

proc sql ;
    create table step2 as
    select distinct a.*,
        b.income,
        ...
    from step1 as a

    left join (select year, income, ..., from income_data) as b
    on a.id=b.id and a.year=b.year and b.id^='' ;
quit ;

I could have used a where expression instead, are they equivalent, or is there some performance differences?:

proc sql ;
    create table step2 as
    select distinct a.*,
        b.income,
        ...
    from step1 as a

    left join (select year, income, ..., from income_data where id^='') as b
    on a.id=b.id and a.year=b.year ;
quit ;

Kurt_Bremser · Posted 08-20-2019 04:23 AM

Make up some fake data and run your codes against it.

I did this:

data step1;
input id $ year;
datalines;
A 2018
A 2019
B 2018
B 2019
C 2018
C 2019
;

data income_data;
input id $ year income;
datalines;
A 2018 1000
A 2018 1500
B 2019 3000
C 2018 2000
;
run;

proc sql ;
create table step2_a as
select distinct
  a.*,
  b.income
from step1 as a
left join (
  select
    id,
    year,
    income
  from income_data
) as b
on a.id = b.id and a.year = b.year and b.id ^= ''
;
quit;

proc sql ;
create table step2_b as
select distinct
  a.*,
  b.income
from step1 as a
left join (
  select
    id,
    year,
    income
  from income_data where id ^= ''
) as b
on a.id = b.id and a.year = b.year
;
quit;

proc compare data=step2_a compare=step2_b;
run;

and found no difference.

So play around with the faked source data to include edge cases (do you have missing id and/or year values in one of the datasets, or do you have multiple instances for any id/year pair in one or both datasets) that might turn up in your source data.

In particular I noticed your use of distinct. Some people use this out of habit and do not realize that it forces SQL to sort the result table by all columns to determine possible duplicates. This can be very time-consuming, so you should check your data (see Maxim 3) if it is necessary at all.

From this, I created a simplified version of your code; first, I removed the check for id ^= '':

proc sql;
create table step2_c as
select
  a.*,
  b.income
from step1 as a
left join (
  select
    id,
    year,
    income
  from income_data
) as b
on a.id = b.id and a.year = b.year
;
quit;

proc compare data=step2_a compare=step2_c;
run;

and found no difference to the previous results.

Next, I simplified this even further:

proc sql;
create table step2_d as
select
  a.*,
  b.income
from step1 as a
left join income_data as b
on a.id = b.id and a.year = b.year
;
quit;

proc compare data=step2_a compare=step2_d;
run;

and got the same results once again.

If you do not have a many-to-many relationship with regards to id and year, you can use a data step:

proc sort data=step1;
by id year;
run;

proc sort data=income_data;
by id year;
run;

data step2_e;
merge
  step1 (in=_in_step1)
  income_data
;
by id year;
if _in_step1;
run;

proc compare data=step2_a compare=step2_e;
run;

which once again yields identical results with my fake data. Out of experience, I dare to say that this approach will beat all SQL methods performance-wise. Especially if you can manage to have the sorts done further up when the data sets are created in the first place.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

View solution in original post

Kurt_Bremser · Posted 08-16-2019 07:42 AM

First, compare the results of both queries to see if they are logically equivalent.

@Gexern wrote:

Hello, I wondered if people had any experience with where they place a condition when doing a left join performance/time wise. I have the code:
proc sql ;
    create table step2 as
    select distinct a.*,
        b.income,
        ...
    from step1 as a

    left join (select year, income, ..., from income_data) as b
    on a.id=b.id and a.year=b.year and b.id^='' ;
quit ;
I could have used a where expression instead, are they equivalent, or is there some performance differences?:
proc sql ;
    create table step2 as
    select distinct a.*,
        b.income,
        ...
    from step1 as a

    left join (select year, income, ..., from income_data where id^='') as b
    on a.id=b.id and a.year=b.year ;
quit ;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

PeterClemmensen · Posted 08-16-2019 07:43 AM

Welcome to the SAS Community 🙂

What result did you get yourself? Did you get the same result from the two queries?

The DATA to DATA Step Macro
Blog: SASnrd

Gexern · Posted 08-16-2019 07:49 AM

I haven't, no. The code runs for 4 hours so its not that easy to compare. I can batch two programs that run at the same time and compare. I would expect to get the same result from both queries though.

ballardw · Posted 08-16-2019 10:25 AM

@Gexern wrote:

I haven't, no. The code runs for 4 hours so its not that easy to compare. I can batch two programs that run at the same time and compare. I would expect to get the same result from both queries though.

How many records are involved?

I would create subsets of the data of only a few thousand records for each and test the code for performance and equivalence.

Do you have missing values of a.id? I don't see much sense for the b.id^=' ' in this if you don't.

on a.id=b.id and a.year=b.year and b.id^=''

And if you do, I strongly wonder why an ID type variable would be missing. Since I tend to think of ID as used to identify things then missing values are extremely problematic. If A.ID does not have any missing values then the b.id^=' ' is not needed as join on will only get matches.

Kurt_Bremser · Posted 08-20-2019 04:23 AM

Make up some fake data and run your codes against it.

I did this:

data step1;
input id $ year;
datalines;
A 2018
A 2019
B 2018
B 2019
C 2018
C 2019
;

data income_data;
input id $ year income;
datalines;
A 2018 1000
A 2018 1500
B 2019 3000
C 2018 2000
;
run;

proc sql ;
create table step2_a as
select distinct
  a.*,
  b.income
from step1 as a
left join (
  select
    id,
    year,
    income
  from income_data
) as b
on a.id = b.id and a.year = b.year and b.id ^= ''
;
quit;

proc sql ;
create table step2_b as
select distinct
  a.*,
  b.income
from step1 as a
left join (
  select
    id,
    year,
    income
  from income_data where id ^= ''
) as b
on a.id = b.id and a.year = b.year
;
quit;

proc compare data=step2_a compare=step2_b;
run;

and found no difference.

So play around with the faked source data to include edge cases (do you have missing id and/or year values in one of the datasets, or do you have multiple instances for any id/year pair in one or both datasets) that might turn up in your source data.

In particular I noticed your use of distinct. Some people use this out of habit and do not realize that it forces SQL to sort the result table by all columns to determine possible duplicates. This can be very time-consuming, so you should check your data (see Maxim 3) if it is necessary at all.

From this, I created a simplified version of your code; first, I removed the check for id ^= '':

proc sql;
create table step2_c as
select
  a.*,
  b.income
from step1 as a
left join (
  select
    id,
    year,
    income
  from income_data
) as b
on a.id = b.id and a.year = b.year
;
quit;

proc compare data=step2_a compare=step2_c;
run;

and found no difference to the previous results.

Next, I simplified this even further:

proc sql;
create table step2_d as
select
  a.*,
  b.income
from step1 as a
left join income_data as b
on a.id = b.id and a.year = b.year
;
quit;

proc compare data=step2_a compare=step2_d;
run;

and got the same results once again.

If you do not have a many-to-many relationship with regards to id and year, you can use a data step:

proc sort data=step1;
by id year;
run;

proc sort data=income_data;
by id year;
run;

data step2_e;
merge
  step1 (in=_in_step1)
  income_data
;
by id year;
if _in_step1;
run;

proc compare data=step2_a compare=step2_e;
run;

which once again yields identical results with my fake data. Out of experience, I dare to say that this approach will beat all SQL methods performance-wise. Especially if you can manage to have the sorts done further up when the data sets are created in the first place.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Classroom Training Available!