DATA Step, Macro, Functions and more

PROC SQL remerging

Accepted Solution Solved
Reply
PROC Star
Posts: 1,325
Accepted Solution

PROC SQL remerging

[ Edited ]

Hi All,

 

Below is a PROC SQL query that is surprising me.  It has a many-to-one join of two subqueries a and b, and then uses a group by clause to aggregate the result.

 

data have ;
  input id1 id2 x y ;
  cards ;
1 1 10 100
1 2 20 200
2 1 30 300
2 2 40 400
 ;
quit ;

proc sql ;
  select
    a.id1
   ,sum(max_x) as blah
   ,max_y  /*this causes remerging in 9.4 but not 9.3! */  /*if change to max_y as max_y in 9.4, it won't remerge! */
  from
    (select id1,max(x) as max_x
     from have
     group by id1,id2
     ) as a
  ,(select id1,max(y) as max_y
     from have
     group by id1
     ) as b
  where a.id1=b.id1
  group by a.id1
   ;
quit ;

 

 

Observations:

  1. When I run the code in 9.3, it does not trigger remerging, and the output is two rows.
  2. When I run the code in 9.4M3, it does trigger remerging, and the output is four rows.
  3. When I run the code in 9.4M3, if I change the first select statement from ",max_y" to ",max_y as max_y" it does NOT trigger remerging (and output is two rows, like 9.3).

I think I can convince myself that returning four rows is correct (since there is no excplicit aggregate function, even though the data were already aggregated on the second subquery).  Would that suggest that this was a bug in 9.3 not to remerge? 

 

I can't imagine why adding "as max_y" would cause 9.4M3 to decide NOT to remerge.  But it makes me wonder if in SAS SQL not remerging is actually the "correct" behavior.  i.e., is the SAS SQL compiler supposed to be smart enough to see that the max_y has already been aggregated to the appropriate level?

 

Thanks,

--Q.


Accepted Solutions
Solution
‎02-20-2017 01:40 PM
PROC Star
Posts: 1,325

Re: PROC SQL remerging

Hi All,

 

I heard back from tech support.  My expectation/hope was incorrect; they agreed with the expectations of @PGStats.

 

Per tech support, all six of the scenarios I posted should remerge.  So regardless of whether using subqueries / views / interim tables, and regardless of whether or not you use an alias, it should remerge. 

 

The remerging rule is straight-forward: "This note is generated when a column, listed on the select clause, is not being used in a summary function and the column is not listed on the GROUP BY clause."  http://support.sas.com/techsup/notes/v8/4/308.html

 

So there is a bug  in 9.3 that the subquery and view approach do not remerge.  This bug was fixed in 9.4, so that they do remerge.

 

There is a different bug in 9.4 such that if you use an alias ("as") on the SELECT clause, or a format, it will not remerge.  They will add a TS note about this bug.

 

Thanks all for the help in understanding what is happening.

 

--Q.

View solution in original post


All Replies
Respected Advisor
Posts: 4,935

Re: PROC SQL remerging

Interesting. I think it should always remerge. Anyhow, these two operations should give identical results:

 

data have ;
  input id1 id2 x y ;
  cards ;
1 1 10 100
1 2 20 200
2 1 30 300
2 2 40 400
 ;

proc sql;

title "Views";
create view a as
select id1, max(x) as max_x
        from have
        group by id1, id2;
create view b as
select id1, max(y) as max_y
        from have
        group by id1;
select
    a.id1
   ,sum(max_x) as blah
   ,max_y as max_y
from a inner join b on a.id1 = b.id1
group by a.id1;

title "Tables";
create table c as
select id1, max(x) as max_x
        from have
        group by id1, id2;
create table d as
select id1, max(y) as max_y
        from have
        group by id1;
select
    c.id1
   ,sum(max_x) as blah
   ,max_y as max_y
from c inner join d on c.id1 = d.id1
group by c.id1;

quit;

they don't (in SAS/SQL version 9.4)

PG
Super User
Posts: 10,046

Re: PROC SQL remerging

Did you try :

,sum(a.max_x) as blah
   , b.max_y 

Better feedback it to sas support.


Valued Guide
Posts: 2,177

Re: PROC SQL remerging

I think the _TREE procedure option might reveal the different query structures.
PROC Star
Posts: 1,325

Re: PROC SQL remerging

Thanks all.

 

Yes @Peter_C _TREE showed differences (in particular the value of FLAG was different for max_y) but unfortunately I'm not adept at interpretting tree. In any case, it was helpful evidence that the SQL optimizer is making different decisisosn.

 

@Ksharp , changing to b.max_y did not change results.  I will definitely send it in to tech support.

 

@PGStats thanks for adding your very helpful examples of using views vs tables.

 

Below I have 6 queries.  Three approaches (subqueries, views, tables ) each with "select Max_Y" and "select Max_Y as Max_Y".

 

In 9.3 the results are consistent (though not what @PGStats expects).  With the subquery and view approach, there is no remerging.  Witht the tables approach there is remerging.  This is plausible to me, if the SQL optimizer is "smart" enough to look into the subqueries and views and see that Max_Y has already been aggregated to the appropriate level, so does not need to be remerged.  In 9.3, this is true regardless of using "select Max_Y or "select Max_Y as Max_Y".

 

In 9.4M3, if you use  "select Max_Y as Max_Y" the results match 9.3 (i.e. subquery approach and view approach do NOT remerge).  But in 9.4M3, if you use "select Max_Y" the subquery approach and view approach do remerge.  I can't see an argument for why using "as" should change the results of the query, so I feel like this is a bug in 9.4.

 

My questions for tech support will be

 

1.  "Which of these queries *should* remerge?"  Here I expect they will say use of subqueries and views should not remerge, but tables should remerge.  Even though this may be undesired from perspective of subqueries/views vs tables behaving differently, I think maybe it's fair because the whole remerging concept is non-standard SQL anyway, so SAS gets to decide the rules.

 

2.  "In 9.4M3, why would use of 'as' change the behavior?" Here I hope their answer will be that this is a bug in 9.4  That is, I hope they wil say that 9.4 behavior should be same as 9.3.

 

Test code:

 

proc sql  ; *_tree;

title1 "1. Subqueries with max_y as max_y";
title2 "9.4M3 No Remerging" ;
select
  a.id1
 ,sum(max_x) as blah
 ,b.max_y as max_y
from
  (select id1,max(x) as max_x
   from have
   group by id1,id2
   ) as a
inner join (select id1,max(y) as max_y
   from have
   group by id1
   ) as b
on a.id1=b.id1
group by a.id1
 ;

title1 "2. Views with max_y as max_y";
title2 "9.4M3 No Remerging" ;
create view a as
select id1, max(x) as max_x
        from have
        group by id1, id2;
create view b as
select id1, max(y) as max_y
        from have
        group by id1;
select
    a.id1
   ,sum(max_x) as blah
   ,b.max_y as max_y
from a inner join b on a.id1 = b.id1
group by a.id1;

title1 "3. Tables with max_y as max_y";
title2 "9.4M3 Remerging" ;
create table c as
select id1, max(x) as max_x
        from have
        group by id1, id2;
create table d as
select id1, max(y) as max_y
        from have
        group by id1;
select
    c.id1
   ,sum(max_x) as blah
   ,max_y as max_y
from c inner join d on c.id1 = d.id1
group by c.id1;


title1 "4. Subqueries with max_y (no as)";
title2 "9.4M3 Remerging" ;
select
  a.id1
 ,sum(max_x) as blah
 ,b.max_y
from
  (select id1,max(x) as max_x
   from have
   group by id1,id2
   ) as a
inner join (select id1,max(y) as max_y
   from have
   group by id1
   ) as b
on a.id1=b.id1
group by a.id1
 ;

title1 "5. Views with max_y (no as)";
title2 "9.4M3 Remerging" ;
create view a as
select id1, max(x) as max_x
        from have
        group by id1, id2;
create view b as
select id1, max(y) as max_y
        from have
        group by id1;
select
    a.id1
   ,sum(max_x) as blah
   ,b.max_y 
from a inner join b on a.id1 = b.id1
group by a.id1;

title1 "6. Tables with max_y (no as)";
title2 "9.4M3 Remerging" ;
create table c as
select id1, max(x) as max_x
        from have
        group by id1, id2;
create table d as
select id1, max(y) as max_y
        from have
        group by id1;
select
    c.id1
   ,sum(max_x) as blah
   ,max_y as max_y
from c inner join d on c.id1 = d.id1
group by c.id1;


quit;
Trusted Advisor
Posts: 1,022

Re: PROC SQL remerging

@Quentin

 

Please keep us up to date on SAS's response to your questions.

Solution
‎02-20-2017 01:40 PM
PROC Star
Posts: 1,325

Re: PROC SQL remerging

Hi All,

 

I heard back from tech support.  My expectation/hope was incorrect; they agreed with the expectations of @PGStats.

 

Per tech support, all six of the scenarios I posted should remerge.  So regardless of whether using subqueries / views / interim tables, and regardless of whether or not you use an alias, it should remerge. 

 

The remerging rule is straight-forward: "This note is generated when a column, listed on the select clause, is not being used in a summary function and the column is not listed on the GROUP BY clause."  http://support.sas.com/techsup/notes/v8/4/308.html

 

So there is a bug  in 9.3 that the subquery and view approach do not remerge.  This bug was fixed in 9.4, so that they do remerge.

 

There is a different bug in 9.4 such that if you use an alias ("as") on the SELECT clause, or a format, it will not remerge.  They will add a TS note about this bug.

 

Thanks all for the help in understanding what is happening.

 

--Q.

Respected Advisor
Posts: 4,935

Re: PROC SQL remerging

Thank you @Quentin for carrying this through. Remerging is a very useful feature of SAS/SQL, it should be done flawlessly. 

PG
Trusted Advisor
Posts: 1,022

Re: PROC SQL remerging

@Quentin

 

I too appreciate you pasing along SAS's  answer to this question.

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 8 replies
  • 375 views
  • 9 likes
  • 5 in conversation