BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Gexern
Fluorite | Level 6

Hello, I wondered if people had any experience with where they place a condition when doing a left join performance/time wise. I have the code:

 

proc sql ;
    create table step2 as
    select distinct a.*,
        b.income,
        ...
    from step1 as a

    left join (select year, income, ..., from income_data) as b
    on a.id=b.id and a.year=b.year and b.id^='' ;
quit ;

I could have used a where expression instead, are they equivalent, or is there some performance differences?:

 

proc sql ;
    create table step2 as
    select distinct a.*,
        b.income,
        ...
    from step1 as a

    left join (select year, income, ..., from income_data where id^='') as b
    on a.id=b.id and a.year=b.year ;
quit ;
1 ACCEPTED SOLUTION

Accepted Solutions
Kurt_Bremser
Super User

Make up some fake data and run your codes against it.

I did this:

data step1;
input id $ year;
datalines;
A 2018
A 2019
B 2018
B 2019
C 2018
C 2019
;

data income_data;
input id $ year income;
datalines;
A 2018 1000
A 2018 1500
B 2019 3000
C 2018 2000
;
run;

proc sql ;
create table step2_a as
select distinct
  a.*,
  b.income
from step1 as a
left join (
  select
    id,
    year,
    income
  from income_data
) as b
on a.id = b.id and a.year = b.year and b.id ^= ''
;
quit;

proc sql ;
create table step2_b as
select distinct
  a.*,
  b.income
from step1 as a
left join (
  select
    id,
    year,
    income
  from income_data where id ^= ''
) as b
on a.id = b.id and a.year = b.year
;
quit;

proc compare data=step2_a compare=step2_b;
run;

and found no difference.

So play around with the faked source data to include edge cases (do you have missing id and/or year values in one of the datasets, or do you have multiple instances for any id/year pair in one or both datasets) that might turn up in your source data.

In particular I noticed your use of distinct. Some people use this out of habit and do not realize that it forces SQL to sort the result table by all columns to determine possible duplicates. This can be very time-consuming, so you should check your data (see Maxim 3) if it is necessary at all.

From this, I created a simplified version of your code; first, I removed the check for id ^= '':

proc sql;
create table step2_c as
select
  a.*,
  b.income
from step1 as a
left join (
  select
    id,
    year,
    income
  from income_data
) as b
on a.id = b.id and a.year = b.year
;
quit;

proc compare data=step2_a compare=step2_c;
run;

and found no difference to the previous results.

Next, I simplified this even further:

proc sql;
create table step2_d as
select
  a.*,
  b.income
from step1 as a
left join income_data as b
on a.id = b.id and a.year = b.year
;
quit;

proc compare data=step2_a compare=step2_d;
run;

and got the same results once again.

If you do not have a many-to-many relationship with regards to id and year, you can use a data step:

proc sort data=step1;
by id year;
run;

proc sort data=income_data;
by id year;
run;

data step2_e;
merge
  step1 (in=_in_step1)
  income_data
;
by id year;
if _in_step1;
run;

proc compare data=step2_a compare=step2_e;
run;

which once again yields identical results with my fake data. Out of experience, I dare to say that this approach will beat all SQL methods performance-wise. Especially if you can manage to have the sorts done further up when the data sets are created in the first place.

 

 

 

View solution in original post

5 REPLIES 5
Kurt_Bremser
Super User

First, compare the results of both queries to see if they are logically equivalent.

 


@Gexern wrote:

Hello, I wondered if people had any experience with where they place a condition when doing a left join performance/time wise. I have the code:

 

proc sql ;
    create table step2 as
    select distinct a.*,
        b.income,
        ...
    from step1 as a

    left join (select year, income, ..., from income_data) as b
    on a.id=b.id and a.year=b.year and b.id^='' ;
quit ;

I could have used a where expression instead, are they equivalent, or is there some performance differences?:

 

proc sql ;
    create table step2 as
    select distinct a.*,
        b.income,
        ...
    from step1 as a

    left join (select year, income, ..., from income_data where id^='') as b
    on a.id=b.id and a.year=b.year ;
quit ;

 

PeterClemmensen
Tourmaline | Level 20

Welcome to the SAS Community 🙂

 

What result did you get yourself? Did you get the same result from the two queries?

Gexern
Fluorite | Level 6

I haven't, no. The code runs for 4 hours so its not that easy to compare. I can batch two programs that run at the same time and compare. I would expect to get the same result from both queries though.

ballardw
Super User

@Gexern wrote:

I haven't, no. The code runs for 4 hours so its not that easy to compare. I can batch two programs that run at the same time and compare. I would expect to get the same result from both queries though.


How many records are involved?

 

I would create subsets of the data of only a few thousand records for each and test the code for performance and equivalence.

 

Do you have missing values of a.id? I don't see much sense for the b.id^=' ' in this if you don't.

on a.id=b.id and a.year=b.year and b.id^='' 

And if you do, I strongly wonder why an ID type variable would be missing. Since I tend to think of ID as used to identify things then missing values are extremely problematic. If A.ID does not have any missing values then the b.id^=' ' is not needed as join on will only get matches.

Kurt_Bremser
Super User

Make up some fake data and run your codes against it.

I did this:

data step1;
input id $ year;
datalines;
A 2018
A 2019
B 2018
B 2019
C 2018
C 2019
;

data income_data;
input id $ year income;
datalines;
A 2018 1000
A 2018 1500
B 2019 3000
C 2018 2000
;
run;

proc sql ;
create table step2_a as
select distinct
  a.*,
  b.income
from step1 as a
left join (
  select
    id,
    year,
    income
  from income_data
) as b
on a.id = b.id and a.year = b.year and b.id ^= ''
;
quit;

proc sql ;
create table step2_b as
select distinct
  a.*,
  b.income
from step1 as a
left join (
  select
    id,
    year,
    income
  from income_data where id ^= ''
) as b
on a.id = b.id and a.year = b.year
;
quit;

proc compare data=step2_a compare=step2_b;
run;

and found no difference.

So play around with the faked source data to include edge cases (do you have missing id and/or year values in one of the datasets, or do you have multiple instances for any id/year pair in one or both datasets) that might turn up in your source data.

In particular I noticed your use of distinct. Some people use this out of habit and do not realize that it forces SQL to sort the result table by all columns to determine possible duplicates. This can be very time-consuming, so you should check your data (see Maxim 3) if it is necessary at all.

From this, I created a simplified version of your code; first, I removed the check for id ^= '':

proc sql;
create table step2_c as
select
  a.*,
  b.income
from step1 as a
left join (
  select
    id,
    year,
    income
  from income_data
) as b
on a.id = b.id and a.year = b.year
;
quit;

proc compare data=step2_a compare=step2_c;
run;

and found no difference to the previous results.

Next, I simplified this even further:

proc sql;
create table step2_d as
select
  a.*,
  b.income
from step1 as a
left join income_data as b
on a.id = b.id and a.year = b.year
;
quit;

proc compare data=step2_a compare=step2_d;
run;

and got the same results once again.

If you do not have a many-to-many relationship with regards to id and year, you can use a data step:

proc sort data=step1;
by id year;
run;

proc sort data=income_data;
by id year;
run;

data step2_e;
merge
  step1 (in=_in_step1)
  income_data
;
by id year;
if _in_step1;
run;

proc compare data=step2_a compare=step2_e;
run;

which once again yields identical results with my fake data. Out of experience, I dare to say that this approach will beat all SQL methods performance-wise. Especially if you can manage to have the sorts done further up when the data sets are created in the first place.

 

 

 

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 5 replies
  • 19092 views
  • 2 likes
  • 4 in conversation