BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Junyong
Pyrite | Level 9

I have the following panel data of 500 individuals for 60 months.

data have;
do i=1 to 500;
do j="1jan2016"d to "31dec2020"d;
x=rannor(1);
if j=intnx("month",j,0,"end") then output;
end;
end;
run;

And suppose I want to add lag(x) with SQL as follows.

proc sql;
create table want as
select a.*,b.x as x1
from have a join have b on a.i=b.i and intnx("month",a.j,0)=intnx("month",b.j,1);
quit;

And I found that altering intnx("month",a.j,0)=intnx("month",b.j,1) by intck("month",a.j,b.j)=-1 makes the proc sql slower—why is this the case?

1 ACCEPTED SOLUTION

Accepted Solutions
FreelanceReinh
Jade | Level 19

Hi @Junyong,

 

You can see the difference that @Kurt_Bremser has pointed out by modifying the queries as shown in the log below:

128  data have2;
129  do j="1jan2016"d to "31dec2020"d;
130    x=rannor(1);
131    output;
132  end;
133  run;

NOTE: The data set WORK.HAVE2 has 1827 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           0.07 seconds
      cpu time            0.07 seconds


134
135  proc sql;
136  create table want2 as
137  select a.*, b.x as x1
138  from have2 a join have2 b on intnx("month",a.j,0)=intnx("month",b.j,1);
NOTE: Table WORK.WANT2 created, with 54653 rows and 3 columns.

139  quit;
NOTE: PROCEDURE SQL used (Total process time):
      real time           0.06 seconds
      cpu time            0.06 seconds


140
141  proc sql;
142  create table want2_ as
143  select a.*, b.x as x1
144  from have2 a join have2 b on intck("month",a.j,b.j)=-1;
NOTE: The execution of this query involves performing one or more Cartesian product joins that can not be optimized.
NOTE: Table WORK.WANT2_ created, with 54653 rows and 3 columns.

145  quit;
NOTE: PROCEDURE SQL used (Total process time):
      real time           0.59 seconds
      cpu time            0.59 seconds

View solution in original post

9 REPLIES 9
Kurt_Bremser
Super User

I guess it is because the INTNX functions are applied once for every observation in the source table, but for the INTCK function, SQL needs to do a cartesian join and then apply the function to every result of that cartesian join.

FreelanceReinh
Jade | Level 19

Hi @Junyong,

 

You can see the difference that @Kurt_Bremser has pointed out by modifying the queries as shown in the log below:

128  data have2;
129  do j="1jan2016"d to "31dec2020"d;
130    x=rannor(1);
131    output;
132  end;
133  run;

NOTE: The data set WORK.HAVE2 has 1827 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           0.07 seconds
      cpu time            0.07 seconds


134
135  proc sql;
136  create table want2 as
137  select a.*, b.x as x1
138  from have2 a join have2 b on intnx("month",a.j,0)=intnx("month",b.j,1);
NOTE: Table WORK.WANT2 created, with 54653 rows and 3 columns.

139  quit;
NOTE: PROCEDURE SQL used (Total process time):
      real time           0.06 seconds
      cpu time            0.06 seconds


140
141  proc sql;
142  create table want2_ as
143  select a.*, b.x as x1
144  from have2 a join have2 b on intck("month",a.j,b.j)=-1;
NOTE: The execution of this query involves performing one or more Cartesian product joins that can not be optimized.
NOTE: Table WORK.WANT2_ created, with 54653 rows and 3 columns.

145  quit;
NOTE: PROCEDURE SQL used (Total process time):
      real time           0.59 seconds
      cpu time            0.59 seconds
Kurt_Bremser
Super User

PS the fastest method to get your result is, of course, a DATA step:

proc sql stimer;

create table want1 as
select a.*,b.x as x1
from have a , have b where a.i=b.i and intnx("month",a.j,0)=intnx("month",b.j,1);

create table want2 as
select a.*,b.x as x1
from have a, have b where a.i=b.i and intck("month",a.j,b.j)=-1;

quit;

data want3;
set have;
by i;
x1 = lag(x);
if not first.i;
run;

Log:

 84         proc sql stimer;
 NOTE:  Verwendet wurde: SQL Statement - (Gesamtverarbeitungszeit):
       real time           0.00 seconds
       cpu time            0.00 seconds
       
 85         
 86         create table want1 as
 87         select a.*,b.x as x1
 88         from have a , have b where a.i=b.i and intnx("month",a.j,0)=intnx("month",b.j,1);
 NOTE: Table WORK.WANT1 created, with 29500 rows and 4 columns.
 
 NOTE:  Verwendet wurde: SQL Statement - (Gesamtverarbeitungszeit):
       real time           0.04 seconds
       cpu time            0.04 seconds
       
 89         
 90         create table want2 as
 91         select a.*,b.x as x1
 92         from have a, have b where a.i=b.i and intck("month",a.j,b.j)=-1;
 NOTE: Table WORK.WANT2 created, with 29500 rows and 4 columns.
 
 NOTE:  Verwendet wurde: SQL Statement - (Gesamtverarbeitungszeit):
       real time           0.46 seconds
       cpu time            0.45 seconds
       
 93         
 94         quit;
 NOTE:  Verwendet wurde: PROZEDUR SQL - (Gesamtverarbeitungszeit):
       real time           0.00 seconds
       cpu time            0.01 seconds
       
 
 95         
 96         data want3;
 97         set have;
 98         by i;
 99         x1 = lag(x);
 100        if not first.i;
 101        run;
 
 NOTE: There were 30000 observations read from the data set WORK.HAVE.
 NOTE: The data set WORK.WANT3 has 29500 observations and 4 variables.
 NOTE:  Verwendet wurde: DATA statement - (Gesamtverarbeitungszeit):
       real time           0.01 seconds
       cpu time            0.01 seconds
Junyong
Pyrite | Level 9

My real data rather than this example are not monthly continuous with some skipped months, which is why I am avoiding data, but thanks a lot for your clarification—I think I need to be careful in using intck next time.

Kurt_Bremser
Super User

Even with skipped months, a data step is the right tool for lagged values. To see if a lagged value can be used, you just need to look at the lagged date to see if it is the previous month or older.

Junyong
Pyrite | Level 9

I think I need to clarify in detail. In my case,

data have;
input firm date yymmdd8. assets;
cards;
1 20200131 110
1 20200331 120
1 20200430 130
2 20200131 210
2 20200229 220
2 20200331 230
;

If I define assets1 as the assets a month behind, then the observation for firm 1 on March 31 would be incorrect (110 is the value from January) with data as follows.

data no;
set have;
assets1=lag(assets);
run;

sql will have no such issue.

proc sql;
create table yes as
select a.*,b.assets as assets1
from have a join have b on a.firm=b.firm and a.date=intnx("month",b.date,1,"end")
order by firm,date;
quit;

Though this does not produce any value for firm 1 on February 29, I needed this outcome—which is why I am using sql over data currently.

+ I think the following is possible with data instead.

data yes;
set have;
assets1=lag(assets);
if firm>lag(firm) or date>intnx("month",lag(date),1,"end") then assets1=.;
run;

Thanks.

Kurt_Bremser
Super User

PPS you can create your have dataset much faster by avoiding unnecessary looping:

data have;
do i = 1 to 500;
  j = "31jan2016"d;
  do while (j le "31dec2020"d);
    output;
    j = intnx("month",j,1,"e");
  end;
end;
run;

Look at this log:

 73         data have1;
 74         do i = 1 to 500;
 75           do j = "1jan2016"d to "31dec2020"d;
 76             x = rannor(1);
 77             if j = intnx("month",j,0,"end") then output;
 78           end;
 79         end;
 80         run;
 
 NOTE: The data set WORK.HAVE1 has 30000 observations and 3 variables.
 NOTE:  Verwendet wurde: DATA statement - (Gesamtverarbeitungszeit):
       real time           0.43 seconds
       cpu time            0.43 seconds
       
 
 81         
 82         data have2;
 83         do i = 1 to 500;
 84           j = "31jan2016"d;
 85           do while (j le "31dec2020"d);
 86             output;
 87             j = intnx("month",j,1,"e");
 88           end;
 89         end;
 90         run;
 
 NOTE: The data set WORK.HAVE2 has 30000 observations and 2 variables.
 NOTE:  Verwendet wurde: DATA statement - (Gesamtverarbeitungszeit):
       real time           0.02 seconds
       cpu time            0.01 seconds

It creates the same amount of observations 20 times as fast.

RichardDeVen
Barite | Level 11

Use the undocumented _TREE option and you will see @Kurt_Bremser supposition is correct.

 

For 

intnx("month",a.j,0)=intnx("month",b.j,1)

INTNX is called for rows selected and results are cross joined.

 

For

intck("month",a.j,b.j)=-1

Rows are cross joined and INTCK is called for row in cross join and result applied to evaluation criteria

 

The trees:

 

203  proc sql _tree;
204  create table want as select a.*,b.x as x1
205  from have a join have b on a.i = b.i and
206       intnx("month",a.j,0)=intnx("month",b.j,1) ;

Tree as planned.
                               /-SYM-V-(a.i:1 flag=00000001)
                     /-OBJ----|
                    |         |--SYM-V-(a.j:2 flag=00000001)
                    |         |--SYM-V-(a.x:3 flag=00000001)
                    |          \-SYM-M-(x1:1 flag=00000005)
           /-JOIN---|
          |         |                              /-SYM-V-(a.i:1 flag=00000001)
          |         |                    /-OBJ----|
          |         |                   |         |--SYM-V-(a.j:2 flag=00000001)
          |         |                   |         |--SYM-V-(a.x:3 flag=00000001)
          |         |                   |          \-SYM-A-(#TEMJ001:2 flag=00000004)
          |         |          /-FIL----|
          |         |         |         |                    /-SYM-V-(a.i:1 flag=00000001)
          |         |         |         |          /-OBJ----|
          |         |         |         |         |         |--SYM-V-(a.j:2 flag=00000001)
          |         |         |         |         |          \-SYM-V-(a.x:3 flag=00000001)
          |         |         |         |--SRC----|
          |         |         |         |          \-TABL[WORK].have opt=''
          |         |         |         |--empty-
          |         |         |         |--empty-
          |         |         |         |--empty-
          |         |         |         |--empty-
          |         |         |         |                    /-SYM-A-(#TEMJ001:2 flag=00000004)
          |         |         |         |          /-ASGN---|
          |         |         |         |         |         |          /-SYM-F-(INTNX:1)
          |         |         |         |         |          \-FLST---|
          |         |         |         |         |                   |--LITC('month')
          |         |         |         |         |                   |--SYM-V-(a.j:2)
          |         |         |         |         |                    \-LITN(0)
          |         |         |          \-OBJE---|
          |         |--FROM---|
          |         |         |                    /-SYM-M-(x1:1 flag=00000001)
          |         |         |          /-OBJ----|
          |         |         |         |         |--SYM-V-(b.i:1 flag=00000001)
          |         |         |         |         |--SYM-V-(b.j:2 flag=00000001)
          |         |         |         |          \-SYM-A-(#TEMJ002:3 flag=00000004)
          |         |          \-FIL----|
          |         |                   |                    /-SYM-M-(x1:1 flag=00000001)
          |         |                   |          /-OBJ----|
          |         |                   |         |         |--SYM-V-(b.i:1 flag=00000001)
          |         |                   |         |          \-SYM-V-(b.j:2 flag=00000001)
          |         |                   |--SRC----|
          |         |                   |          \-TABL[WORK].have opt=''
          |         |                   |--empty-
          |         |                   |--empty-
          |         |                   |--empty-
          |         |                   |--empty-
          |         |                   |                    /-SYM-A-(#TEMJ002:3 flag=00000004)
          |         |                   |          /-ASGN---|
          |         |                   |         |         |          /-SYM-F-(INTNX:2)
          |         |                   |         |          \-FLST---|
          |         |                   |         |                   |--LITC('month')
          |         |                   |         |                   |--SYM-V-(b.j:2)
          |         |                   |         |                    \-LITN(1)
          |         |                    \-OBJE---|
          |         |--empty-
          |         |                    /-SYM-V-(a.i:1)
          |         |          /-CEQ----|
          |         |         |          \-SYM-V-(b.i:1)
          |          \-LAND---|
          |                   |          /-SYM-A-(#TEMJ001:2)
          |                    \-CEQ----|
          |                              \-SYM-A-(#TEMJ002:3)
 --SSEL---|


NOTE: Table WORK.WANT created, with 29500 rows and 4 columns.

207
208  create table want as select a.*,b.x as x1
209  from have a join have b on a.i = b.i and
210       intck("month",a.j,b.j)=-1 ;

Tree as planned.
                               /-SYM-V-(a.i:1 flag=00000001)
                     /-OBJ----|
                    |         |--SYM-V-(a.j:2 flag=00000001)
                    |         |--SYM-V-(a.x:3 flag=00000001)
                    |          \-SYM-M-(x1:1 flag=00000005)
           /-JOIN---|
          |         |                              /-SYM-V-(a.i:1 flag=00000001)
          |         |                    /-OBJ----|
          |         |                   |         |--SYM-V-(a.j:2 flag=00000001)
          |         |                   |          \-SYM-V-(a.x:3 flag=00000001)
          |         |          /-SRC----|
          |         |         |          \-TABL[WORK].have opt=''
          |         |--FROM---|
          |         |         |                    /-SYM-M-(x1:1 flag=00000001)
          |         |         |          /-OBJ----|
          |         |         |         |         |--SYM-V-(b.i:1 flag=00000001)
          |         |         |         |          \-SYM-V-(b.j:2 flag=00000001)
          |         |          \-SRC----|
          |         |                    \-TABL[WORK].have opt=''
          |         |                    /-SYM-F-(INTCK:1)
          |         |          /-FLST---|
          |         |         |         |--LITC('month')
          |         |         |         |--SYM-V-(a.j:2)
          |         |         |          \-SYM-V-(b.j:2)
          |         |--CEQ----|
          |         |          \-LITN(-1)
          |         |          /-SYM-V-(a.i:1)
          |          \-CEQ----|
          |                    \-SYM-V-(b.i:1)
 --SSEL---|


NOTE: Table WORK.WANT created, with 29500 rows and 4 columns.

211  quit;
NOTE: PROCEDURE SQL used (Total process time):
      real time           0.47 seconds
      cpu time            0.46 seconds

 

 

 

Junyong
Pyrite | Level 9

intnx subsets and then joins, while intck joins and then subsets, which is why intnx was faster than intck—thanks for this clarification. Probably functions requiring multiple variables from different data sets cause bottlenecks.

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 9 replies
  • 2618 views
  • 8 likes
  • 4 in conversation