Hi All,
Below is a PROC SQL query that is surprising me. It has a many-to-one join of two subqueries a and b, and then uses a group by clause to aggregate the result.
data have ;
input id1 id2 x y ;
cards ;
1 1 10 100
1 2 20 200
2 1 30 300
2 2 40 400
;
quit ;
proc sql ;
select
a.id1
,sum(max_x) as blah
,max_y /*this causes remerging in 9.4 but not 9.3! */ /*if change to max_y as max_y in 9.4, it won't remerge! */
from
(select id1,max(x) as max_x
from have
group by id1,id2
) as a
,(select id1,max(y) as max_y
from have
group by id1
) as b
where a.id1=b.id1
group by a.id1
;
quit ;
Observations:
I think I can convince myself that returning four rows is correct (since there is no excplicit aggregate function, even though the data were already aggregated on the second subquery). Would that suggest that this was a bug in 9.3 not to remerge?
I can't imagine why adding "as max_y" would cause 9.4M3 to decide NOT to remerge. But it makes me wonder if in SAS SQL not remerging is actually the "correct" behavior. i.e., is the SAS SQL compiler supposed to be smart enough to see that the max_y has already been aggregated to the appropriate level?
Thanks,
--Q.
Hi All,
I heard back from tech support. My expectation/hope was incorrect; they agreed with the expectations of @PGStats.
Per tech support, all six of the scenarios I posted should remerge. So regardless of whether using subqueries / views / interim tables, and regardless of whether or not you use an alias, it should remerge.
The remerging rule is straight-forward: "This note is generated when a column, listed on the select clause, is not being used in a summary function and the column is not listed on the GROUP BY clause." http://support.sas.com/techsup/notes/v8/4/308.html
So there is a bug in 9.3 that the subquery and view approach do not remerge. This bug was fixed in 9.4, so that they do remerge.
There is a different bug in 9.4 such that if you use an alias ("as") on the SELECT clause, or a format, it will not remerge. They will add a TS note about this bug.
Thanks all for the help in understanding what is happening.
--Q.
Interesting. I think it should always remerge. Anyhow, these two operations should give identical results:
data have ;
input id1 id2 x y ;
cards ;
1 1 10 100
1 2 20 200
2 1 30 300
2 2 40 400
;
proc sql;
title "Views";
create view a as
select id1, max(x) as max_x
from have
group by id1, id2;
create view b as
select id1, max(y) as max_y
from have
group by id1;
select
a.id1
,sum(max_x) as blah
,max_y as max_y
from a inner join b on a.id1 = b.id1
group by a.id1;
title "Tables";
create table c as
select id1, max(x) as max_x
from have
group by id1, id2;
create table d as
select id1, max(y) as max_y
from have
group by id1;
select
c.id1
,sum(max_x) as blah
,max_y as max_y
from c inner join d on c.id1 = d.id1
group by c.id1;
quit;
they don't (in SAS/SQL version 9.4)
Did you try : ,sum(a.max_x) as blah , b.max_y Better feedback it to sas support.
Thanks all.
Yes @Peter_C _TREE showed differences (in particular the value of FLAG was different for max_y) but unfortunately I'm not adept at interpretting tree. In any case, it was helpful evidence that the SQL optimizer is making different decisisosn.
@Ksharp , changing to b.max_y did not change results. I will definitely send it in to tech support.
@PGStats thanks for adding your very helpful examples of using views vs tables.
Below I have 6 queries. Three approaches (subqueries, views, tables ) each with "select Max_Y" and "select Max_Y as Max_Y".
In 9.3 the results are consistent (though not what @PGStats expects). With the subquery and view approach, there is no remerging. Witht the tables approach there is remerging. This is plausible to me, if the SQL optimizer is "smart" enough to look into the subqueries and views and see that Max_Y has already been aggregated to the appropriate level, so does not need to be remerged. In 9.3, this is true regardless of using "select Max_Y or "select Max_Y as Max_Y".
In 9.4M3, if you use "select Max_Y as Max_Y" the results match 9.3 (i.e. subquery approach and view approach do NOT remerge). But in 9.4M3, if you use "select Max_Y" the subquery approach and view approach do remerge. I can't see an argument for why using "as" should change the results of the query, so I feel like this is a bug in 9.4.
My questions for tech support will be
1. "Which of these queries *should* remerge?" Here I expect they will say use of subqueries and views should not remerge, but tables should remerge. Even though this may be undesired from perspective of subqueries/views vs tables behaving differently, I think maybe it's fair because the whole remerging concept is non-standard SQL anyway, so SAS gets to decide the rules.
2. "In 9.4M3, why would use of 'as' change the behavior?" Here I hope their answer will be that this is a bug in 9.4 That is, I hope they wil say that 9.4 behavior should be same as 9.3.
Test code:
proc sql ; *_tree;
title1 "1. Subqueries with max_y as max_y";
title2 "9.4M3 No Remerging" ;
select
a.id1
,sum(max_x) as blah
,b.max_y as max_y
from
(select id1,max(x) as max_x
from have
group by id1,id2
) as a
inner join (select id1,max(y) as max_y
from have
group by id1
) as b
on a.id1=b.id1
group by a.id1
;
title1 "2. Views with max_y as max_y";
title2 "9.4M3 No Remerging" ;
create view a as
select id1, max(x) as max_x
from have
group by id1, id2;
create view b as
select id1, max(y) as max_y
from have
group by id1;
select
a.id1
,sum(max_x) as blah
,b.max_y as max_y
from a inner join b on a.id1 = b.id1
group by a.id1;
title1 "3. Tables with max_y as max_y";
title2 "9.4M3 Remerging" ;
create table c as
select id1, max(x) as max_x
from have
group by id1, id2;
create table d as
select id1, max(y) as max_y
from have
group by id1;
select
c.id1
,sum(max_x) as blah
,max_y as max_y
from c inner join d on c.id1 = d.id1
group by c.id1;
title1 "4. Subqueries with max_y (no as)";
title2 "9.4M3 Remerging" ;
select
a.id1
,sum(max_x) as blah
,b.max_y
from
(select id1,max(x) as max_x
from have
group by id1,id2
) as a
inner join (select id1,max(y) as max_y
from have
group by id1
) as b
on a.id1=b.id1
group by a.id1
;
title1 "5. Views with max_y (no as)";
title2 "9.4M3 Remerging" ;
create view a as
select id1, max(x) as max_x
from have
group by id1, id2;
create view b as
select id1, max(y) as max_y
from have
group by id1;
select
a.id1
,sum(max_x) as blah
,b.max_y
from a inner join b on a.id1 = b.id1
group by a.id1;
title1 "6. Tables with max_y (no as)";
title2 "9.4M3 Remerging" ;
create table c as
select id1, max(x) as max_x
from have
group by id1, id2;
create table d as
select id1, max(y) as max_y
from have
group by id1;
select
c.id1
,sum(max_x) as blah
,max_y as max_y
from c inner join d on c.id1 = d.id1
group by c.id1;
quit;
Please keep us up to date on SAS's response to your questions.
Hi All,
I heard back from tech support. My expectation/hope was incorrect; they agreed with the expectations of @PGStats.
Per tech support, all six of the scenarios I posted should remerge. So regardless of whether using subqueries / views / interim tables, and regardless of whether or not you use an alias, it should remerge.
The remerging rule is straight-forward: "This note is generated when a column, listed on the select clause, is not being used in a summary function and the column is not listed on the GROUP BY clause." http://support.sas.com/techsup/notes/v8/4/308.html
So there is a bug in 9.3 that the subquery and view approach do not remerge. This bug was fixed in 9.4, so that they do remerge.
There is a different bug in 9.4 such that if you use an alias ("as") on the SELECT clause, or a format, it will not remerge. They will add a TS note about this bug.
Thanks all for the help in understanding what is happening.
--Q.
Thank you @Quentin for carrying this through. Remerging is a very useful feature of SAS/SQL, it should be done flawlessly.
I too appreciate you pasing along SAS's answer to this question.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.