BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.

Hi All,

 

Below is a PROC SQL query that is surprising me.  It has a many-to-one join of two subqueries a and b, and then uses a group by clause to aggregate the result.

 

data have ;
  input id1 id2 x y ;
  cards ;
1 1 10 100
1 2 20 200
2 1 30 300
2 2 40 400
 ;
quit ;

proc sql ;
  select
    a.id1
   ,sum(max_x) as blah
   ,max_y  /*this causes remerging in 9.4 but not 9.3! */  /*if change to max_y as max_y in 9.4, it won't remerge! */
  from
    (select id1,max(x) as max_x
     from have
     group by id1,id2
     ) as a
  ,(select id1,max(y) as max_y
     from have
     group by id1
     ) as b
  where a.id1=b.id1
  group by a.id1
   ;
quit ;

 

 

Observations:

  1. When I run the code in 9.3, it does not trigger remerging, and the output is two rows.
  2. When I run the code in 9.4M3, it does trigger remerging, and the output is four rows.
  3. When I run the code in 9.4M3, if I change the first select statement from ",max_y" to ",max_y as max_y" it does NOT trigger remerging (and output is two rows, like 9.3).

I think I can convince myself that returning four rows is correct (since there is no excplicit aggregate function, even though the data were already aggregated on the second subquery).  Would that suggest that this was a bug in 9.3 not to remerge? 

 

I can't imagine why adding "as max_y" would cause 9.4M3 to decide NOT to remerge.  But it makes me wonder if in SAS SQL not remerging is actually the "correct" behavior.  i.e., is the SAS SQL compiler supposed to be smart enough to see that the max_y has already been aggregated to the appropriate level?

 

Thanks,

--Q.

BASUG is hosting free webinars Next up: Mike Sale presenting Data Warehousing with SAS April 10 at noon ET. Register now at the Boston Area SAS Users Group event page: https://www.basug.org/events.
1 ACCEPTED SOLUTION

Accepted Solutions
Quentin
Super User

Hi All,

 

I heard back from tech support.  My expectation/hope was incorrect; they agreed with the expectations of @PGStats.

 

Per tech support, all six of the scenarios I posted should remerge.  So regardless of whether using subqueries / views / interim tables, and regardless of whether or not you use an alias, it should remerge. 

 

The remerging rule is straight-forward: "This note is generated when a column, listed on the select clause, is not being used in a summary function and the column is not listed on the GROUP BY clause."  http://support.sas.com/techsup/notes/v8/4/308.html

 

So there is a bug  in 9.3 that the subquery and view approach do not remerge.  This bug was fixed in 9.4, so that they do remerge.

 

There is a different bug in 9.4 such that if you use an alias ("as") on the SELECT clause, or a format, it will not remerge.  They will add a TS note about this bug.

 

Thanks all for the help in understanding what is happening.

 

--Q.

BASUG is hosting free webinars Next up: Mike Sale presenting Data Warehousing with SAS April 10 at noon ET. Register now at the Boston Area SAS Users Group event page: https://www.basug.org/events.

View solution in original post

8 REPLIES 8
PGStats
Opal | Level 21

Interesting. I think it should always remerge. Anyhow, these two operations should give identical results:

 

data have ;
  input id1 id2 x y ;
  cards ;
1 1 10 100
1 2 20 200
2 1 30 300
2 2 40 400
 ;

proc sql;

title "Views";
create view a as
select id1, max(x) as max_x
        from have
        group by id1, id2;
create view b as
select id1, max(y) as max_y
        from have
        group by id1;
select
    a.id1
   ,sum(max_x) as blah
   ,max_y as max_y
from a inner join b on a.id1 = b.id1
group by a.id1;

title "Tables";
create table c as
select id1, max(x) as max_x
        from have
        group by id1, id2;
create table d as
select id1, max(y) as max_y
        from have
        group by id1;
select
    c.id1
   ,sum(max_x) as blah
   ,max_y as max_y
from c inner join d on c.id1 = d.id1
group by c.id1;

quit;

they don't (in SAS/SQL version 9.4)

PG
Ksharp
Super User
Did you try :

,sum(a.max_x) as blah
   , b.max_y 

Better feedback it to sas support.


Peter_C
Rhodochrosite | Level 12
I think the _TREE procedure option might reveal the different query structures.
Quentin
Super User

Thanks all.

 

Yes @Peter_C _TREE showed differences (in particular the value of FLAG was different for max_y) but unfortunately I'm not adept at interpretting tree. In any case, it was helpful evidence that the SQL optimizer is making different decisisosn.

 

@Ksharp , changing to b.max_y did not change results.  I will definitely send it in to tech support.

 

@PGStats thanks for adding your very helpful examples of using views vs tables.

 

Below I have 6 queries.  Three approaches (subqueries, views, tables ) each with "select Max_Y" and "select Max_Y as Max_Y".

 

In 9.3 the results are consistent (though not what @PGStats expects).  With the subquery and view approach, there is no remerging.  Witht the tables approach there is remerging.  This is plausible to me, if the SQL optimizer is "smart" enough to look into the subqueries and views and see that Max_Y has already been aggregated to the appropriate level, so does not need to be remerged.  In 9.3, this is true regardless of using "select Max_Y or "select Max_Y as Max_Y".

 

In 9.4M3, if you use  "select Max_Y as Max_Y" the results match 9.3 (i.e. subquery approach and view approach do NOT remerge).  But in 9.4M3, if you use "select Max_Y" the subquery approach and view approach do remerge.  I can't see an argument for why using "as" should change the results of the query, so I feel like this is a bug in 9.4.

 

My questions for tech support will be

 

1.  "Which of these queries *should* remerge?"  Here I expect they will say use of subqueries and views should not remerge, but tables should remerge.  Even though this may be undesired from perspective of subqueries/views vs tables behaving differently, I think maybe it's fair because the whole remerging concept is non-standard SQL anyway, so SAS gets to decide the rules.

 

2.  "In 9.4M3, why would use of 'as' change the behavior?" Here I hope their answer will be that this is a bug in 9.4  That is, I hope they wil say that 9.4 behavior should be same as 9.3.

 

Test code:

 

proc sql  ; *_tree;

title1 "1. Subqueries with max_y as max_y";
title2 "9.4M3 No Remerging" ;
select
  a.id1
 ,sum(max_x) as blah
 ,b.max_y as max_y
from
  (select id1,max(x) as max_x
   from have
   group by id1,id2
   ) as a
inner join (select id1,max(y) as max_y
   from have
   group by id1
   ) as b
on a.id1=b.id1
group by a.id1
 ;

title1 "2. Views with max_y as max_y";
title2 "9.4M3 No Remerging" ;
create view a as
select id1, max(x) as max_x
        from have
        group by id1, id2;
create view b as
select id1, max(y) as max_y
        from have
        group by id1;
select
    a.id1
   ,sum(max_x) as blah
   ,b.max_y as max_y
from a inner join b on a.id1 = b.id1
group by a.id1;

title1 "3. Tables with max_y as max_y";
title2 "9.4M3 Remerging" ;
create table c as
select id1, max(x) as max_x
        from have
        group by id1, id2;
create table d as
select id1, max(y) as max_y
        from have
        group by id1;
select
    c.id1
   ,sum(max_x) as blah
   ,max_y as max_y
from c inner join d on c.id1 = d.id1
group by c.id1;


title1 "4. Subqueries with max_y (no as)";
title2 "9.4M3 Remerging" ;
select
  a.id1
 ,sum(max_x) as blah
 ,b.max_y
from
  (select id1,max(x) as max_x
   from have
   group by id1,id2
   ) as a
inner join (select id1,max(y) as max_y
   from have
   group by id1
   ) as b
on a.id1=b.id1
group by a.id1
 ;

title1 "5. Views with max_y (no as)";
title2 "9.4M3 Remerging" ;
create view a as
select id1, max(x) as max_x
        from have
        group by id1, id2;
create view b as
select id1, max(y) as max_y
        from have
        group by id1;
select
    a.id1
   ,sum(max_x) as blah
   ,b.max_y 
from a inner join b on a.id1 = b.id1
group by a.id1;

title1 "6. Tables with max_y (no as)";
title2 "9.4M3 Remerging" ;
create table c as
select id1, max(x) as max_x
        from have
        group by id1, id2;
create table d as
select id1, max(y) as max_y
        from have
        group by id1;
select
    c.id1
   ,sum(max_x) as blah
   ,max_y as max_y
from c inner join d on c.id1 = d.id1
group by c.id1;


quit;
BASUG is hosting free webinars Next up: Mike Sale presenting Data Warehousing with SAS April 10 at noon ET. Register now at the Boston Area SAS Users Group event page: https://www.basug.org/events.
mkeintz
PROC Star

@Quentin

 

Please keep us up to date on SAS's response to your questions.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Quentin
Super User

Hi All,

 

I heard back from tech support.  My expectation/hope was incorrect; they agreed with the expectations of @PGStats.

 

Per tech support, all six of the scenarios I posted should remerge.  So regardless of whether using subqueries / views / interim tables, and regardless of whether or not you use an alias, it should remerge. 

 

The remerging rule is straight-forward: "This note is generated when a column, listed on the select clause, is not being used in a summary function and the column is not listed on the GROUP BY clause."  http://support.sas.com/techsup/notes/v8/4/308.html

 

So there is a bug  in 9.3 that the subquery and view approach do not remerge.  This bug was fixed in 9.4, so that they do remerge.

 

There is a different bug in 9.4 such that if you use an alias ("as") on the SELECT clause, or a format, it will not remerge.  They will add a TS note about this bug.

 

Thanks all for the help in understanding what is happening.

 

--Q.

BASUG is hosting free webinars Next up: Mike Sale presenting Data Warehousing with SAS April 10 at noon ET. Register now at the Boston Area SAS Users Group event page: https://www.basug.org/events.
PGStats
Opal | Level 21

Thank you @Quentin for carrying this through. Remerging is a very useful feature of SAS/SQL, it should be done flawlessly. 

PG
mkeintz
PROC Star

@Quentin

 

I too appreciate you pasing along SAS's  answer to this question.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 8 replies
  • 2556 views
  • 9 likes
  • 5 in conversation