Re: Using SQL aggregate functions with a LEFT JOIN

mconover · Posted 01-28-2016 03:38 AM

I've got two datasets. The first dataset (DATA_WITH_MANY_ROWS) is a dataset with many rows of data and includes two numeric variables numvar1 and numvar2. Using only one PROC SQL step, I would like to 1) aggregate the data across this entire table, then 2) aggregate the data within a subset of the table, then 3) join with an existing dataset that has only one observation (DATA_WITH_ONE_ROW). You'll also notice I'm including adding a text label to the data. Essentially I am trying to export a final dataset which contains only one observation and side by side includes the above described measures.

I have included code below showing my current approach. However, the output table (OUTPUT_TABLE) ends up having as many rows as the dataset DATA_WITH_MANY_ROWS. I can't seem to figure out where my code is going wrong since I'm fairly certain each item in the left-join should only have one observation and should join smoothly using the ON 1=1 condition. Since I am summarizing in every case down to a single row of data, I assume I do not need to use group by functions.

Am I misunderstanding something about how the order that SQL is processing my code? Thanks to the community for any help/insight.

PROC SQL;

CREATE TABLE work.OUTPUT_TABLE AS

SELECT

"texthere" AS label_everyone,

SUM(a.numvar1) AS TOTAL1 SUM(a.numvar2) AS TOTAL2,

b.*, c.*

FROM work.DATA_WITH_MANY_ROWS AS a

LEFT JOIN

(SELECT SUM(numvar1) AS TOTAL1_SUBSET, SUM(numvar2) AS TOTAL2_SUBSET FROM DATA_WITH_MANY_ROWS WHERE classvar=1) AS b ON 1=1

LEFT JOIN DATA_WITH_ONE_ROW AS c ON 1=1;

QUIT;

Kurt_Bremser · Posted 01-28-2016 04:02 AM

"texthere" AS label_everyone,

is not an aggregate. This is what most probably causes every record to be in the output.

As an aside, I would clearly avoid putting it all in one complex step. Build it step by step, it is easier to understand and easier to debug.

"Write every program as if the next one to maintain it is a violent psychopath who knows where you live"

Here's my solution with a data step that will make only one pass through the large dataset:

data sums;
label_everyone = 'texthere';
merge
  data_with_many_rows
  data_with_one_row
  end=done
;
retain
  total1 0
  total2 0
  total1_subset 0
  total2_subset 0
;
total1 + numvar1;
total2 + numvar2;
if classvar = 1 then do;
  total1_subset + numvar1;
  total2_subset + numvar2;
end;
if done then output;
keep label_everyone total1 total2 total1_subset total2_subset vars_from one_row;
run;

(not tested because of missing test data)

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

RW9 · Posted 01-28-2016 04:39 AM

Sorry, its not clear what you are trying to do, there are several errors in the code. Also, code formatting makes reading a lot easier:

proc sql;
  create table WORK.OUTPUT_TABLE as
  select  "texthere" as LABEL_EVERYONE,
          sum(A.NUMVAR1) as TOTAL1    /* <- missing semicolon */
          sum(A.NUMVAR2) as TOTAL2,
          B.*,    /* This is meaningless? */
          C.*     /* This is meaningless? */
  from    WORK.DATA_WITH_MANY_ROWS as A
  left join (select sum(NUMVAR1) as TOTAL1_SUBSET, 
                    sum(NUMVAR2) as TOTAL2_SUBSET 
             from   DATA_WITH_MANY_ROWS where CLASSVAR=1) as B 
  on      1=1  /* This is equicalent to merge every row in A with every row in B, not what you intend */
  left join DATA_WITH_ONE_ROW as C 
  on 1=1;  /* as above, merge eveery row of first result with every row of this one */
quit;

Post some sample data, in the form of a datastep, and what you want the output to look like.

FreelanceReinh · Posted 01-28-2016 07:17 AM

Let n denote the number of rows in DATA_WITH_MANY_ROWS (n>1). The left join of DATA_WITH_MANY_ROWS with the table with alias b (which has only one row) on 1=1 is a table with all rows from DATA_WITH_MANY_ROWS, enriched with the two variables from "b" (i.e. identical values of TOTAL1_SUBSET and TOTAL2_SUBSET, respectively, on each row). Hence the note in the log: "NOTE: The execution of this query involves performing one or more Cartesian product joins that can not be optimized." The resulting table has n rows.

Same situation with the second left join: You simply get additional columns (one per variable in DATA_WITH_ONE_ROW) having constant values in all rows.

It's this temporary table with n rows on which the first SELECT statement operates. It selects summary statistics (TOTAL1, TOTAL2), but at the same time "original data" contributed by tables "b" and "c". The important point is that "original data", i.e. b.* and c.*, does not refer to the single observations of tables "b" and "c", but to the columns contributed by "b" and "c" in the temporary table described above! Hence the note in the log: "NOTE: The query requires remerging summary statistics back with the original data." That is, the values of TOTAL1 and TOTAL2 are copied n times to appear in all rows together with the values b.* and c.* (regardless of the fact that these values happen to be likewise only copies of a single value per variable) and with the constant label_everyone.

So, a quick fix of your PROC SQL step would be to

insert the keyword "DISTINCT" after the first "SELECT": SELECT DISTINCT "texthere" ...
add the missing comma after "TOTAL1" (as @RW9 pointed at).

That said, I agree with @Kurt_Bremser that the task could be accomplished more efficiently, be it with an improved PROC SQL step or with a data step. (I think, "merge" should read "set" in Kurt's solution and the RETAIN statement is redundant due to the "SUM" statements.)

Kurt_Bremser · Posted 01-28-2016 07:33 AM

Both "set" and "merge" work because of the way the vars are retained and only one output happens at the very end. But to me the "merge" better shows the side-by-side nature of the operation.

And I used the retain statement because it allows to set the initial value.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

FreelanceReinh · Posted 01-28-2016 08:43 AM

@Kurt_Bremser: Are you sure? Maybe it depends on the input datasets, but with those I had made up for testing (see below) it seemed that the variable(s) from DATA_WITH_ONE_ROW would have missing values with the 1-to-1 merge. With the SET statement, however, they would have their single value, because DATA_WITH_ONE_ROW is the second dataset in the SET statement, so that the value is present in the last observation selected with "if done ...".

data data_with_many_rows;
do numvar1=1 to 10;
  classvar=mod(numvar1,3);
  numvar2=int(11*ranuni(314159));
  output;
end;
run;

data data_with_one_row;
a=314;
run;

If I'm not mistaken, the initial value 0 (used in the RETAIN statement) is implied by the use of the Sum statement (as is the "retaining" itself). So, for example, without the RETAIN statement the subset totals would still have value 0 (and not missing) if the condition "classvar=1" was never met and hence the Sum statements were never executed (but compiled!).

Kurt_Bremser · Posted 01-28-2016 09:11 AM

You are right. I'm so used to do MERGE always with a BY that I presumed that the value(s) of data_with_one_row would automatically be kept for the whole operation. Which they obviously aren't, so set is the way to go.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

BMiller · Posted 01-28-2016 10:05 AM

Excuse the poor layout etc, just edited your code quickly.

BM

PROC SQL;

CREATE TABLE work.OUTPUT_TABLE AS

SELECT

"texthere" AS label_everyone,

A.total1,

A.total2,

b.*,

c.*

FROM (

select SUM(a.numvar1) AS TOTAL1,

SUM(a.numvar2) AS TOTAL2

from work.DATA_WITH_MANY_ROWS

) AS a

, (SELECT SUM(numvar1) AS TOTAL1_SUBSET,

SUM(numvar2) AS TOTAL2_SUBSET

FROM DATA_WITH_MANY_ROWS

WHERE classvar=1

) AS b

, DATA_WITH_ONE_ROW AS c ;

QUIT;

SAS Innovate 2025: Register Now

SAS Training: Just a Click Away