I've got two datasets. The first dataset (DATA_WITH_MANY_ROWS) is a dataset with many rows of data and includes two numeric variables numvar1 and numvar2. Using only one PROC SQL step, I would like to 1) aggregate the data across this entire table, then 2) aggregate the data within a subset of the table, then 3) join with an existing dataset that has only one observation (DATA_WITH_ONE_ROW). You'll also notice I'm including adding a text label to the data. Essentially I am trying to export a final dataset which contains only one observation and side by side includes the above described measures.
I have included code below showing my current approach. However, the output table (OUTPUT_TABLE) ends up having as many rows as the dataset DATA_WITH_MANY_ROWS. I can't seem to figure out where my code is going wrong since I'm fairly certain each item in the left-join should only have one observation and should join smoothly using the ON 1=1 condition. Since I am summarizing in every case down to a single row of data, I assume I do not need to use group by functions.
Am I misunderstanding something about how the order that SQL is processing my code? Thanks to the community for any help/insight.
PROC SQL;
CREATE TABLE work.OUTPUT_TABLE AS
SELECT
"texthere" AS label_everyone,
SUM(a.numvar1) AS TOTAL1 SUM(a.numvar2) AS TOTAL2,
b.*, c.*
FROM work.DATA_WITH_MANY_ROWS AS a
LEFT JOIN
(SELECT SUM(numvar1) AS TOTAL1_SUBSET, SUM(numvar2) AS TOTAL2_SUBSET FROM DATA_WITH_MANY_ROWS WHERE classvar=1) AS b ON 1=1
LEFT JOIN DATA_WITH_ONE_ROW AS c ON 1=1;
QUIT;
"texthere" AS label_everyone,
is not an aggregate. This is what most probably causes every record to be in the output.
As an aside, I would clearly avoid putting it all in one complex step. Build it step by step, it is easier to understand and easier to debug.
"Write every program as if the next one to maintain it is a violent psychopath who knows where you live"
Here's my solution with a data step that will make only one pass through the large dataset:
data sums;
label_everyone = 'texthere';
merge
data_with_many_rows
data_with_one_row
end=done
;
retain
total1 0
total2 0
total1_subset 0
total2_subset 0
;
total1 + numvar1;
total2 + numvar2;
if classvar = 1 then do;
total1_subset + numvar1;
total2_subset + numvar2;
end;
if done then output;
keep label_everyone total1 total2 total1_subset total2_subset vars_from one_row;
run;
(not tested because of missing test data)
Sorry, its not clear what you are trying to do, there are several errors in the code. Also, code formatting makes reading a lot easier:
proc sql; create table WORK.OUTPUT_TABLE as select "texthere" as LABEL_EVERYONE, sum(A.NUMVAR1) as TOTAL1 /* <- missing semicolon */ sum(A.NUMVAR2) as TOTAL2, B.*, /* This is meaningless? */ C.* /* This is meaningless? */ from WORK.DATA_WITH_MANY_ROWS as A left join (select sum(NUMVAR1) as TOTAL1_SUBSET, sum(NUMVAR2) as TOTAL2_SUBSET from DATA_WITH_MANY_ROWS where CLASSVAR=1) as B on 1=1 /* This is equicalent to merge every row in A with every row in B, not what you intend */ left join DATA_WITH_ONE_ROW as C on 1=1; /* as above, merge eveery row of first result with every row of this one */ quit;
Post some sample data, in the form of a datastep, and what you want the output to look like.
Let n denote the number of rows in DATA_WITH_MANY_ROWS (n>1). The left join of DATA_WITH_MANY_ROWS with the table with alias b (which has only one row) on 1=1 is a table with all rows from DATA_WITH_MANY_ROWS, enriched with the two variables from "b" (i.e. identical values of TOTAL1_SUBSET and TOTAL2_SUBSET, respectively, on each row). Hence the note in the log: "NOTE: The execution of this query involves performing one or more Cartesian product joins that can not be optimized." The resulting table has n rows.
Same situation with the second left join: You simply get additional columns (one per variable in DATA_WITH_ONE_ROW) having constant values in all rows.
It's this temporary table with n rows on which the first SELECT statement operates. It selects summary statistics (TOTAL1, TOTAL2), but at the same time "original data" contributed by tables "b" and "c". The important point is that "original data", i.e. b.* and c.*, does not refer to the single observations of tables "b" and "c", but to the columns contributed by "b" and "c" in the temporary table described above! Hence the note in the log: "NOTE: The query requires remerging summary statistics back with the original data." That is, the values of TOTAL1 and TOTAL2 are copied n times to appear in all rows together with the values b.* and c.* (regardless of the fact that these values happen to be likewise only copies of a single value per variable) and with the constant label_everyone.
So, a quick fix of your PROC SQL step would be to
That said, I agree with @Kurt_Bremser that the task could be accomplished more efficiently, be it with an improved PROC SQL step or with a data step. (I think, "merge" should read "set" in Kurt's solution and the RETAIN statement is redundant due to the "SUM" statements.)
Both "set" and "merge" work because of the way the vars are retained and only one output happens at the very end. But to me the "merge" better shows the side-by-side nature of the operation.
And I used the retain statement because it allows to set the initial value.
@Kurt_Bremser: Are you sure? Maybe it depends on the input datasets, but with those I had made up for testing (see below) it seemed that the variable(s) from DATA_WITH_ONE_ROW would have missing values with the 1-to-1 merge. With the SET statement, however, they would have their single value, because DATA_WITH_ONE_ROW is the second dataset in the SET statement, so that the value is present in the last observation selected with "if done ...".
data data_with_many_rows;
do numvar1=1 to 10;
classvar=mod(numvar1,3);
numvar2=int(11*ranuni(314159));
output;
end;
run;
data data_with_one_row;
a=314;
run;
If I'm not mistaken, the initial value 0 (used in the RETAIN statement) is implied by the use of the Sum statement (as is the "retaining" itself). So, for example, without the RETAIN statement the subset totals would still have value 0 (and not missing) if the condition "classvar=1" was never met and hence the Sum statements were never executed (but compiled!).
You are right. I'm so used to do MERGE always with a BY that I presumed that the value(s) of data_with_one_row would automatically be kept for the whole operation. Which they obviously aren't, so set is the way to go.
Excuse the poor layout etc, just edited your code quickly.
BM
PROC SQL;
CREATE TABLE work.OUTPUT_TABLE AS
SELECT
"texthere" AS label_everyone,
A.total1,
A.total2,
b.*,
c.*
FROM (
select SUM(a.numvar1) AS TOTAL1,
SUM(a.numvar2) AS TOTAL2
from work.DATA_WITH_MANY_ROWS
) AS a
, (SELECT SUM(numvar1) AS TOTAL1_SUBSET,
SUM(numvar2) AS TOTAL2_SUBSET
FROM DATA_WITH_MANY_ROWS
WHERE classvar=1
) AS b
, DATA_WITH_ONE_ROW AS c ;
QUIT;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.