- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hello and thank you for helping. I am trying to figure out a way to use the AVG aggregate function in PROC SQL to average across a subset of the data, without specifying this subset using WHERE or HAVING statements. This is because I would like to calculate multiple different averages across different subsets within the data, all within the same select statement, creating a different variable name for each different average. I have included an illustrative example of what I am attempting to do below.
Consider the following data set:
DATA data;
INPUT id c b;
DATALINES;
1 1 1
17 0 1
33 0 0
49 0 0
65 1 0
81 1 1
2 1 0
18 0 1
34 0 0
50 1 0
20 1 1
39 0 0
;
PROC PRINT; RUN;
Using only one PROC SQL step (i.e. a single select statement), I would like to create a data set which contains only one observation and two variables:
- AVG_B_C0 = the average value of B when c = 0
- AVG_B_C1 = the average value of B when c = 1
So my output table should look like this:
Obs | Mean_B_C0 | Mean_B_C1 |
1 | .33333 | .5 |
I was trying to use this code but it doesn't work. The reason I was trying to use this code is that it works; if I use the SUM aggregate function rather than the AVG aggregate function.
PROC SQL;
CREATE TABLE summary AS
SELECT
AVG((c=0)*B) AS Mean_B_C0,
AVG((c=1)*B) AS Mean_B_C1
FROM data;
QUIT;
Can anyone help me? This code is part of a program that runs iteratively and I have to calculate values in multiple different subsets. While I know how to do this using multiple PROC SQL steps, I would prefer not to since it slows down the speed of my program. Just to note, I also tried using CASE WHEN statements, but it ends up returning two rows in my output table, with missing values for when the case statement was not satisfied.
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
The average function won't work properly because the denominator won't be calculated properly, it will be the total number of observations.
Proc means would work nicely - or you can use sum as you've indicated.
PROC SQL;
CREATE TABLE summary AS
SELECT
sum((c=0)*B)/sum(c=0) AS Mean_B_C0,
sum((c=1)*B)/sum(c=1) AS Mean_B_C1
FROM data;
QUIT;
proc print data=summary;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
The average function won't work properly because the denominator won't be calculated properly, it will be the total number of observations.
Proc means would work nicely - or you can use sum as you've indicated.
PROC SQL;
CREATE TABLE summary AS
SELECT
sum((c=0)*B)/sum(c=0) AS Mean_B_C0,
sum((c=1)*B)/sum(c=1) AS Mean_B_C1
FROM data;
QUIT;
proc print data=summary;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Reeza,
Thank you. I suppose I should have thought of just using the sum function. This helps though because I think it clarifies what's actually happening with the PROC SQL AVG function.
Thank you again,
Mitch
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
CASE is meant for this.
PROC SQL;
CREATE TABLE summary AS
SELECT
AVG( case when (c=0) then b else (.) end ) AS Mean_B_C0,
AVG( case when (c=1) then b else (.) end ) AS Mean_B_C1
FROM data;
QUIT;