Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Home
- /
- Programming
- /
- SAS Procedures
- /
- Using SQL aggregate functions with a LEFT JOIN

Options

- RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Posted 01-28-2016 03:38 AM
(5200 views)

I've got two datasets. The first dataset (DATA_WITH_MANY_ROWS) is a dataset with many rows of data and includes two numeric variables numvar1 and numvar2. Using only one PROC SQL step, I would like to 1) aggregate the data across this entire table, then 2) aggregate the data within a subset of the table, then 3) join with an existing dataset that has only one observation (DATA_WITH_ONE_ROW). You'll also notice I'm including adding a text label to the data. Essentially I am trying to export a final dataset which contains only one observation and side by side includes the above described measures.

I have included code below showing my current approach. However, the output table (OUTPUT_TABLE) ends up having as many rows as the dataset DATA_WITH_MANY_ROWS. I can't seem to figure out where my code is going wrong since I'm fairly certain each item in the left-join should only have one observation and should join smoothly using the ON 1=1 condition. Since I am summarizing in every case down to a single row of data, I assume I do not need to use group by functions.

Am I misunderstanding something about how the order that SQL is processing my code? Thanks to the community for any help/insight.

**PROC** **SQL**;

CREATE TABLE work.OUTPUT_TABLE AS

SELECT

"texthere" AS label_everyone,

SUM(a.numvar1) AS TOTAL1 SUM(a.numvar2) AS TOTAL2,

b.*, c.*

FROM work.DATA_WITH_MANY_ROWS AS a

LEFT JOIN

(SELECT SUM(numvar1) AS TOTAL1_SUBSET, SUM(numvar2) AS TOTAL2_SUBSET FROM DATA_WITH_MANY_ROWS WHERE classvar=**1**) AS b ON **1**=**1**

LEFT JOIN DATA_WITH_ONE_ROW AS c ON **1**=**1**;

**QUIT**;

7 REPLIES 7

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

`"texthere" AS label_everyone,`

is not an aggregate. This is what most probably causes every record to be in the output.

As an aside, I would clearly avoid putting it all in one complex step. Build it step by step, it is easier to understand and easier to debug.

"Write every program as if the next one to maintain it is a violent psychopath who knows where you live"

Here's my solution with a data step that will make only one pass through the large dataset:

```
data sums;
label_everyone = 'texthere';
merge
data_with_many_rows
data_with_one_row
end=done
;
retain
total1 0
total2 0
total1_subset 0
total2_subset 0
;
total1 + numvar1;
total2 + numvar2;
if classvar = 1 then do;
total1_subset + numvar1;
total2_subset + numvar2;
end;
if done then output;
keep label_everyone total1 total2 total1_subset total2_subset vars_from one_row;
run;
```

(not tested because of missing test data)

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Sorry, its not clear what you are trying to do, there are several errors in the code. Also, code formatting makes reading a lot easier:

proc sql; create table WORK.OUTPUT_TABLE as select "texthere" as LABEL_EVERYONE, sum(A.NUMVAR1) as TOTAL1 /* <- missing semicolon */ sum(A.NUMVAR2) as TOTAL2, B.*, /* This is meaningless? */ C.* /* This is meaningless? */ from WORK.DATA_WITH_MANY_ROWS as A left join (select sum(NUMVAR1) as TOTAL1_SUBSET, sum(NUMVAR2) as TOTAL2_SUBSET from DATA_WITH_MANY_ROWS where CLASSVAR=1) as B on 1=1 /* This is equicalent to merge every row in A with every row in B, not what you intend */ left join DATA_WITH_ONE_ROW as C on 1=1; /* as above, merge eveery row of first result with every row of this one */ quit;

Post some sample data, in the form of a datastep, and what you want the output to look like.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Let n denote the number of rows in DATA_WITH_MANY_ROWS (n>1). The left join of DATA_WITH_MANY_ROWS with the table with alias b (which has only one row) on 1=1 is a table with all rows from DATA_WITH_MANY_ROWS, enriched with the two variables from "b" (i.e. identical values of TOTAL1_SUBSET and TOTAL2_SUBSET, respectively, on each row). Hence the note in the log: "NOTE: The execution of this query involves performing one or more Cartesian product joins that can not be optimized." The resulting table has n rows.

Same situation with the second left join: You simply get additional columns (one per variable in DATA_WITH_ONE_ROW) having constant values in all rows.

It's *this* temporary table with n rows on which the first SELECT statement operates. It selects summary statistics (TOTAL1, TOTAL2), but at the same time "original data" contributed by tables "b" and "c". The important point is that "original data", i.e. b.* and c.*, does *not* refer to the single observations of tables "b" and "c", but to the *columns* *contributed* *by* "b" and "c" *in the temporary table* described above! Hence the note in the log: "NOTE: The query requires remerging summary statistics back with the original data." That is, the values of TOTAL1 and TOTAL2 are copied n times to appear in all rows together with the values b.* and c.* (regardless of the fact that these values happen to be likewise only copies of a single value per variable) and with the constant label_everyone.

So, a **quick fix** of your PROC SQL step would be to

**insert the keyword "DISTINCT" after the first "SELECT": SELECT DISTINCT "texthere" ...****add the missing comma after "TOTAL1" (as @RW9 pointed at).**

That said, I agree with @Kurt_Bremser that the task could be accomplished more efficiently, be it with an improved PROC SQL step or with a data step. (I think, "merge" should read "**set**" in Kurt's solution and the RETAIN statement is redundant due to the "SUM" statements.)

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Both "set" and "merge" work because of the way the vars are retained and only one output happens at the very end. But to me the "merge" better shows the side-by-side nature of the operation.

And I used the retain statement because it allows to set the initial value.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

@Kurt_Bremser: Are you sure? Maybe it depends on the input datasets, but with those I had made up for testing (see below) it seemed that the variable(s) from DATA_WITH_ONE_ROW would have missing values with the 1-to-1 merge. With the SET statement, however, they would have their single value, because DATA_WITH_ONE_ROW is the second dataset in the SET statement, so that the value is present in the last observation selected with "if done ...".

```
data data_with_many_rows;
do numvar1=1 to 10;
classvar=mod(numvar1,3);
numvar2=int(11*ranuni(314159));
output;
end;
run;
data data_with_one_row;
a=314;
run;
```

If I'm not mistaken, the initial value 0 (used in the RETAIN statement) is implied by the use of the Sum statement (as is the "retaining" itself). So, for example, without the RETAIN statement the subset totals would still have value 0 (and not missing) if the condition "classvar=1" was never met and hence the Sum statements were never executed (but compiled!).

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

**Excuse the poor layout etc, just edited your code quickly.**

**BM**

**PROC** **SQL**;

CREATE TABLE work.OUTPUT_TABLE AS

SELECT

"texthere" AS label_everyone,

A.total1,

A.total2,

b.*,

c.*

FROM (

select SUM(a.numvar1) AS TOTAL1,

SUM(a.numvar2) AS TOTAL2

from work.DATA_WITH_MANY_ROWS

) AS a

, (SELECT SUM(numvar1) AS TOTAL1_SUBSET,

SUM(numvar2) AS TOTAL2_SUBSET

FROM DATA_WITH_MANY_ROWS

WHERE classvar=**1**

) AS b

, DATA_WITH_ONE_ROW AS c ;

**QUIT**;

Registration is open! SAS is returning to Vegas for an AI and analytics experience like no other! Whether you're an executive, manager, end user or SAS partner, SAS Innovate is designed for everyone on your team. Register for just $495 by 12/31/2023.

**If you are interested in speaking, there is still time to submit a session idea. More details are posted on the website. **

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.