Solved: Re: Sql union

Batman · Posted 02-04-2020 01:54 PM

I'm trying to sum totals across datasets using proc sql. Is there away to avoid the third select statement (i.e. incorporating it in to the first two?)

data a;

input x 2.;

datalines;

10

20

;

data b;

input x 2.;

datalines;

25 ;

proc sql;

create table c as select sum(x) as tot

from a

union

select sum(x) as tot

from b;

select sum(tot)

from c;

quit;

ed_sas_member · Posted 02-04-2020 02:13 PM

Hi @Batman

You can do this

Proc sql;
Create table c as 
Select sum(x) as tot
From (select x from a union all select x from b);
Quit;

it is important to specify « all » to avoid to remove duplicate records from input tables

View solution in original post

yabwon · Posted 02-04-2020 02:09 PM

Hi,

how about subquery?

data a;
input x 2.;
datalines;
10
20
;
run;

data b;
input x 2.;
datalines;
25 
;
run;

 

proc sql;
select sum(x) as tot
from
(
  select x
  from a

  union all

  select x
  from b
)
;
quit;

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

ed_sas_member · Posted 02-04-2020 02:13 PM

Hi @Batman

You can do this

Proc sql;
Create table c as 
Select sum(x) as tot
From (select x from a union all select x from b);
Quit;

it is important to specify « all » to avoid to remove duplicate records from input tables

novinosrin · Posted 02-04-2020 03:42 PM

Hi @Batman I am afraid there is some glitch in either

1. my comprehension of the requirement

2. Or the fix

3. or the requirement itself

My understanding from what you wrote is

1. You have 2 queries that involves a SET operator.

2. It appears the independent queries compute an independent sum of the the values of x. This would perfectly yield 1 record for each select clause i.e one for a A and two for B with two independent grand_totals

3. At this point, you merely want to compute the grand_total of the two grand_totals, or in other words sum of the 2 sums.

So the real and meaningful fix doesn't warrant an UNION ~~ALL~~ rather your existing query is indeed not wrong. The fix is actually to get your two SELECT clauses that are combined using a SET operator in a subquery as mentioned by genius @yabwon for the reason the SQL processor needs a copy aka akin to FROM table/view to process the sum=>sum (of the sums).

Of course doing one combined sum from an appended copy usng UNION ALL is essentially the right way albeit the slight tweak that is required in your existing query to make it work after all is

select sum(x) as tot

from a

union

select sum(x) as tot

from b

/*to have the above as SUBQUERY below*/

proc sql;

create table c as 
select sum(tot) as g_tot
from
(select sum(x) as tot

from a

union

select sum(x) as tot

from b);
quit;

I hope this clears the crack for others who perhaps weren't clear initially like me

yabwon · Posted 02-05-2020 03:36 AM

Hi @novinosrin ,

Your post got me thinking about performance (thanks 1e6 for the inspiration!)

So, what I did:

data a b c d;
  do _N_ = 1 to 1e8;
    x = 1;
    output;
  end; 
run;

options fullstimer msglevel=i;
resetline;
options ps=max ls=max;
proc sql feedback _method _tree;
select sum(tot) as tot
from
(
  select sum(x) as tot
  from a

  union all 

  select sum(x) as tot
  from b
)
;
quit;

proc sql feedback _method _tree;
select sum(x) as tot
from
(
  select x
  from c

  union all

  select x
  from d
)
;
quit;

Your idea of doing sum() in subquery ran half the time mine idea did:

71   data a b c d;
72     do _N_ = 1 to 1e8;
73       x = 1;
74       output;
75     end;
76   run;

NOTE: The data set WORK.A has 100000000 observations and 1 variables.
NOTE: The data set WORK.B has 100000000 observations and 1 variables.
NOTE: The data set WORK.C has 100000000 observations and 1 variables.
NOTE: The data set WORK.D has 100000000 observations and 1 variables.
NOTE: DATA statement used (Total process time):
      real time           19.85 seconds
      cpu time            19.78 seconds

NOTE: Statement transforms to:

        select SUM(tot) as tot
          from ( select SUM(A.x) as tot
                   from WORK.A
               )
                 union all
               ( select SUM(B.x) as tot
                   from WORK.B
               ));


NOTE: SQL execution methods chosen are:

      sqxslct
          sqxsumn
              sqxuall
                  sqxsumn
                      sqxsrc( WORK.A )
                  sqxsumn
                      sqxsrc( WORK.B )

NOTE: PROCEDURE SQL used (Total process time):
      real time           15.98 seconds
      user cpu time       14.34 seconds
      system cpu time     1.60 seconds
      memory              5649.09k
      OS Memory           27384.00k


NOTE: Statement transforms to:

        select SUM(x) as tot
          from ( select C.x
                   from WORK.C
               )
                 union all
               ( select D.x
                   from WORK.D
               ));


NOTE: SQL execution methods chosen are:

      sqxslct
          sqxsumn
              sqxuall
                  sqxsrc( WORK.C )
                  sqxsrc( WORK.D )

NOTE: PROCEDURE SQL used (Total process time):
      real time           31.27 seconds
      user cpu time       29.82 seconds
      system cpu time     1.54 seconds
      memory              5622.53k
      OS Memory           27384.00k

But I think we need `union all` in any case. There is one data setup which cause it:

data a;
input x 2.;
datalines;
10
20
;
run;

data b;
input x 2.;
datalines;
25
5 
;
run;

proc sql;
select sum(x) as tot
from a

union /* ALL */

select sum(x) as tot
from b
;
run;

Without `all` we will get 30, but we would like to have 60.

All the best

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

novinosrin · Posted 02-05-2020 07:38 AM

Good morning Cop @yabwon Great catch. I didn't think of identical sum that caused the elimination of dup. Yes, UNION ALL will be required in cases where have sum(of identical sums). Indeed union all is essentially the right way of doing it. I agree.

I can't thank you enough for your time and details. Much appreciate it bro!

Okay, that being said one funny workaround(turn on your sense of humor) with a dsn label making it unique

data a;
input x 2.;
datalines;
10
20
;
run;

data b;
input x 2.;
datalines;
25
5 
;
run;

proc sql;
select sum(tot) as grand_tot
from
(select 'a' as dsn,sum(x) as tot
from a

union 

select 'b' as dsn, sum(x) as tot
from b)
;
quit;

grand_tot
60

yabwon · Posted 02-05-2020 08:00 AM

lovely 😄 😄

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

Registration is open

SAS Training: Just a Click Away