I'm trying to sum totals across datasets using proc sql. Is there away to avoid the third select statement (i.e. incorporating it in to the first two?)
data a;
input x 2.;
datalines;
10
20
;
data b;
input x 2.;
datalines;
25 ;
proc sql;
create table c as select sum(x) as tot
from a
union
select sum(x) as tot
from b;
select sum(tot)
from c;
quit;
Hi @Batman
You can do this
Proc sql;
Create table c as
Select sum(x) as tot
From (select x from a union all select x from b);
Quit;
it is important to specify « all » to avoid to remove duplicate records from input tables
Hi,
how about subquery?
data a;
input x 2.;
datalines;
10
20
;
run;
data b;
input x 2.;
datalines;
25
;
run;
proc sql;
select sum(x) as tot
from
(
select x
from a
union all
select x
from b
)
;
quit;
Hi @Batman
You can do this
Proc sql;
Create table c as
Select sum(x) as tot
From (select x from a union all select x from b);
Quit;
it is important to specify « all » to avoid to remove duplicate records from input tables
Hi @Batman I am afraid there is some glitch in either
1. my comprehension of the requirement
2. Or the fix
3. or the requirement itself
My understanding from what you wrote is
1. You have 2 queries that involves a SET operator.
2. It appears the independent queries compute an independent sum of the the values of x. This would perfectly yield 1 record for each select clause i.e one for a A and two for B with two independent grand_totals
3. At this point, you merely want to compute the grand_total of the two grand_totals, or in other words sum of the 2 sums.
So the real and meaningful fix doesn't warrant an UNION ALL rather your existing query is indeed not wrong. The fix is actually to get your two SELECT clauses that are combined using a SET operator in a subquery as mentioned by genius @yabwon for the reason the SQL processor needs a copy aka akin to FROM table/view to process the sum=>sum (of the sums).
Of course doing one combined sum from an appended copy usng UNION ALL is essentially the right way albeit the slight tweak that is required in your existing query to make it work after all is
select sum(x) as tot
from a
union
select sum(x) as tot
from b
/*to have the above as SUBQUERY below*/
proc sql;
create table c as
select sum(tot) as g_tot
from
(select sum(x) as tot
from a
union
select sum(x) as tot
from b);
quit;
I hope this clears the crack for others who perhaps weren't clear initially like me
Hi @novinosrin ,
Your post got me thinking about performance (thanks 1e6 for the inspiration!)
So, what I did:
data a b c d;
do _N_ = 1 to 1e8;
x = 1;
output;
end;
run;
options fullstimer msglevel=i;
resetline;
options ps=max ls=max;
proc sql feedback _method _tree;
select sum(tot) as tot
from
(
select sum(x) as tot
from a
union all
select sum(x) as tot
from b
)
;
quit;
proc sql feedback _method _tree;
select sum(x) as tot
from
(
select x
from c
union all
select x
from d
)
;
quit;
Your idea of doing sum() in subquery ran half the time mine idea did:
71 data a b c d; 72 do _N_ = 1 to 1e8; 73 x = 1; 74 output; 75 end; 76 run; NOTE: The data set WORK.A has 100000000 observations and 1 variables. NOTE: The data set WORK.B has 100000000 observations and 1 variables. NOTE: The data set WORK.C has 100000000 observations and 1 variables. NOTE: The data set WORK.D has 100000000 observations and 1 variables. NOTE: DATA statement used (Total process time): real time 19.85 seconds cpu time 19.78 seconds NOTE: Statement transforms to: select SUM(tot) as tot from ( select SUM(A.x) as tot from WORK.A ) union all ( select SUM(B.x) as tot from WORK.B )); NOTE: SQL execution methods chosen are: sqxslct sqxsumn sqxuall sqxsumn sqxsrc( WORK.A ) sqxsumn sqxsrc( WORK.B ) NOTE: PROCEDURE SQL used (Total process time): real time 15.98 seconds user cpu time 14.34 seconds system cpu time 1.60 seconds memory 5649.09k OS Memory 27384.00k NOTE: Statement transforms to: select SUM(x) as tot from ( select C.x from WORK.C ) union all ( select D.x from WORK.D )); NOTE: SQL execution methods chosen are: sqxslct sqxsumn sqxuall sqxsrc( WORK.C ) sqxsrc( WORK.D ) NOTE: PROCEDURE SQL used (Total process time): real time 31.27 seconds user cpu time 29.82 seconds system cpu time 1.54 seconds memory 5622.53k OS Memory 27384.00k
But I think we need `union all` in any case. There is one data setup which cause it:
data a;
input x 2.;
datalines;
10
20
;
run;
data b;
input x 2.;
datalines;
25
5
;
run;
proc sql;
select sum(x) as tot
from a
union /* ALL */
select sum(x) as tot
from b
;
run;
Without `all` we will get 30, but we would like to have 60.
All the best
Bart
Good morning Cop @yabwon Great catch. I didn't think of identical sum that caused the elimination of dup. Yes, UNION ALL will be required in cases where have sum(of identical sums). Indeed union all is essentially the right way of doing it. I agree.
I can't thank you enough for your time and details. Much appreciate it bro!
Okay, that being said one funny workaround(turn on your sense of humor) with a dsn label making it unique
data a;
input x 2.;
datalines;
10
20
;
run;
data b;
input x 2.;
datalines;
25
5
;
run;
proc sql;
select sum(tot) as grand_tot
from
(select 'a' as dsn,sum(x) as tot
from a
union
select 'b' as dsn, sum(x) as tot
from b)
;
quit;
grand_tot |
---|
60 |
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.