BookmarkSubscribeRSS Feed
Aexor
Lapis Lazuli | Level 10

Hi All,

 

I need to know is there any way I can optimize this join. I am getting notes :

NOTE: The execution of this query involves performing one or more Cartesian product joins that can not be optimized. mywork.dates

 

proc sql noprint;
   create table mywork.dates as
    select
        distinct
        a.prod_sk,
        a.geo_sk,
        b.date_sk,
        b.fiscal_start_dt,
        b.fiscal_end_dt
    from
        mywork.tmp_plan_actuals_pgr a,
        mylib.date_nm b
            where b.fiscal_start_dt >= &min_start_dt.
               and b.fiscal_end_dt <= &max_end_dt.
            and date_lvl_no = &time_low_lvl.;
    quit;
 
&min_start_dt. , &max_end_dt. ,  &time_low_lvl these have some values . these are getting created in previous steps and totally independent from the tables present in join.
 
Please help.
       
11 REPLIES 11
PaigeMiller
Diamond | Level 26

A Cartesian join takes all records of data set 1 and merges them with all records in data set 2. If data set 1 has n1 records and data set 2 has n2 records, then the Cartesian join will have n1*n2 records (which may be reduced by a WHERE statement, but I think first all records are created and then those that don't meet the where condition are deleted). If n1 and n2 are large, this will take a long time.

 

The only way to "optimize" is if you can replace the Cartesian join with some other type of join. Maybe that's possible, maybe that's not possible, you tell us.

--
Paige Miller
ballardw
Super User

Which data set has the variable date_lvl_no?

 

Without example data and expected results we can't come up with much optimization and when you use just a comma between two sets that is the request for a Cartesian join. SAS provides a note about such in case that wasn't your intent.

from set1 , set2

You might be able to improve performance by sub-setting data earlier such as:

proc sql noprint;
   create table mywork.dates as
    select
        distinct
        a.prod_sk,
        a.geo_sk,
        b.date_sk,
        b.fiscal_start_dt,
        b.fiscal_end_dt
    from
        mywork.tmp_plan_actuals_pgr a,
        ( select * from mylib.date_nm 
            where fiscal_start_dt >= &min_start_dt.
               and fiscal_end_dt <= &max_end_dt.) as b
        where date_lvl_no = &time_low_lvl.;
    quit;

Do the same about only selecting the records where date_lvl_no is desired from which set it comes from, which is why I asked.

This reduces the number of joins between the two sets which MAY significantly improve run time.

 

Caution: Is B.fiscal_end_dt ever missing? If it is missing then it is less than any value for &max_end_dt. Is that your intend or did you only want the fiscal_end_dt when not missing?

 

Aexor
Lapis Lazuli | Level 10
Thank i tried as you suggested . Please refer below comparison .
without applying the sub setting method
NOTE: The execution of this query involves performing one or more Cartesian product joins that can not be optimized.
NOTE: Table mywork.dates created, with 10790 rows and 5 columns.
NOTE: PROCEDURE SQL used (Total process time):
real time 0.35 seconds
cpu time 0.51 seconds

with apply sub setting

NOTE: The execution of this query involves performing one or more Cartesian product joins that can not be optimized.
NOTE: Table mywork.dates created, with 10790 rows and 5 columns.
NOTE: PROCEDURE SQL used (Total process time):
real time 0.39 seconds
cpu time 0.53 seconds

I am still getting notes and the its taking more time in sub setting method. . Can you please suggest if I am missing anything.
ballardw
Super User

The only way to remove that note is a setting that will suppress all notes. So is not recommended. It is not an error. It is a reminder that such as join is in your code. Such Cartesian joins, by definition, join every record in one set with every record in the other set. That step cannot be optimized. Ever. Period. 

 

If you are going to use Cartesian joins it is a good idea to keep a good grasp on your data set sizes and exactly how every element of the data is actually used to get reasonable performance out of everything else.

 

 

SASKiwi
PROC Star

Are there no key columns that are in both input tables? Your query is missing a join between at least one column in each input table, so that's why you are getting the Cartesian product note.

Aexor
Lapis Lazuli | Level 10
There are no key column. I just want to optimize that step to improve performance
SASKiwi
PROC Star

Why do you need to improve performance when the query real time is less than 1 second? Are you getting the right result set?

Aexor
Lapis Lazuli | Level 10
the result log is from sample data set, which is very less compare to actual ones
Aexor
Lapis Lazuli | Level 10
Sorry , I don't get this. What exactly I have to check in the blog ?
Tom
Super User Tom
Super User

Show examples of the data (does not need to be REAL data, just something that shows the right relationships).

 

If there really is no variables in common between the two tables then you cannot do anything other than a Cartesian product.

 

But you can fool PROC SQL into not writing the message by just adding a variable to each input and joining on it.

 

Here is trivial example you can try.

proc sql nowarn;
create table test as
select *
from (select distinct sex,1 as x from sashelp.class) a
   , (select distinct age,1 as x from sashelp.class) b
;
create table test2(drop=x) as
select *
from (select distinct sex,1 as x from sashelp.class) a
full join (select distinct age,1 as x from sashelp.class) b
on a.x=b.x
;
quit;

NOTE that the variable date_lvl_no mentioned in your WHERE clause is not output and you do not say where it came from.  You could make the problem easier by just excluding those observations from even being considered.

 

So if DATE_LVL_NO is coming from mylib.date_nm then your query looks like:

create table mywork.dates as
select distinct
     a.prod_sk
   , a.geo_sk
   , b.date_sk
   , b.fiscal_start_dt
   , b.fiscal_end_dt
from (select prod_sk,geo_sk,1 as x from mywork.tmp_plan_actuals_pgr) a
full join 
    (select fiscal_start_dt,fiscal_end_dt,date_sk,1 as x
      from mylib.date_nm
      where fiscal_start_dt >= &min_start_dt.
        and fiscal_end_dt <= &max_end_dt.
        and date_lvl_no = &time_low_lvl.
    ) b
on a.x=b.x
;

It might or might not run any faster.  But at least you won't get the NOTE in the LOG.

 

sas-innovate-2024.png

📢

ANNOUNCEMENT

The early bird rate has been extended! Register by March 18 for just $695 - $100 off the standard rate.

 

Check out the agenda and get ready for a jam-packed event featuring workshops, super demos, breakout sessions, roundtables, inspiring keynotes and incredible networking events. 

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 11 replies
  • 1236 views
  • 1 like
  • 6 in conversation