Dear all,
Thanks in advance for any and all help here.
An example data set and desired output is given below.
I want to compute new variables from the "Have" data set as follows:
RE: the average of "R" values within a given "Cat" variable value excluding the specific observation
IE the average of "I" responses within a given "Cat" variable value excluding the specific observation
Have:
ID | CAT | R | I | … (Additional variables I'd like to be retained) |
1 | 1 | 1 | 3 | … |
2 | 1 | 2 | 2 | … |
3 | 1 | 1 | 1 | … |
4 | 2 | 3 | 3 | … |
5 | 2 | 4 | 5 | … |
6 | 2 | 4 | 4 | … |
7 | 3 | 6 | 7 | … |
8 | 3 | 4 | 4 | … |
9 | 3 | 5 | 5 | … |
The desired data set should thus be:
Want:
ID | CAT | R | I | RE | IE | … (Additional variables retained) |
1 | 1 | 1 | 3 | 1.5 | 1.5 | … |
2 | 1 | 2 | 2 | 1 | 2 | … |
3 | 1 | 1 | 1 | 1.5 | 2.5 | … |
4 | 2 | 3 | 3 | etc. | etc. | … |
5 | 2 | 4 | 5 | … | ||
6 | 2 | 4 | 4 | … | ||
7 | 3 | 6 | 7 | … | ||
8 | 3 | 4 | 4 | … | ||
9 | 3 | 5 | 5 | … |
Thanks!
@Reeza's query can be simplified to :
proc sql;
create table want as
select *,
(sum(R)-coalesce(R, 0))/(count(R)-1+missing(R)) as RE,
(sum(I)-coalesce(I, 0))/(count(I)-1+missing(I)) as IE
from have
group by CAT
order by ID, CAT;
quit;
The easiest way would be to bring in the total of the underlying variables for each ID. While a tricky DATA step could do it, I'll stick to basic techniques here.
proc summary data=have;
by ID;
var R I;
output out=totals (keep=total_R total_I n_R n_I) sum=total_R total_I n=n_R n_I;
run;
data want;
merge have totals;
by ID;
if (n_R > 1) then RE = (total_R - R) / (n_R - 1);
if (n_I > 1) then IE = (total_I - I) / (n_I - 1);
run;
There can be complications (which I didn't address here) if it might be possible that R or I could take on a missing value on some observations.
Good luck.
@AstoundingThanks for this, I should have specified that I do have missing values for R, I, and Cat. I think that will cause a problem here. I'll look into this and Reeza's response to see if missing data is properly accounted for and post back.
@Astounding Indeed, missing data is problematic here, but *appears* to be dealt with in the Proc SQL follow-ups. Hopefully we can get a follow-up on the handling of missing data yet with the two Proc SQL examples provided. Thanks for your help!
SQL is probably the easiest way.
proc sql;
create table want as
select *, mean(r) as avg_R, count(R) as n_R, mean(I) as avg_I, count(I) as n_I,
(calculated avg_R* calculated n_R - r)/(calculated n_r-1) as RE,
(calculated avg_I* calculated n_I - I)/(calculated n_I-1) as IE
from have
group by CAT
order by ID, CAT;
quit;
@Reeza Thanks for this,
As above, I should have specified that I do have missing values for R, I, and Cat. Will this be accounted for via the "count" argument?
Your code produced means within groups, but it created the same mean value for RE and IE for everyone in a given "Cat" group. Eveyone should have a slightly different average because the average should be calculated leaving out their unique observation (as per the "want" example). What can be done to ensure the RE and IE are calculated leaving out each individual observation within the given "Cat" group?
@Reeza My fault, this works correctly! Can you confirm how this deals with missing data?
Thanks!
@Reeza's query can be simplified to :
proc sql;
create table want as
select *,
(sum(R)-coalesce(R, 0))/(count(R)-1+missing(R)) as RE,
(sum(I)-coalesce(I, 0))/(count(I)-1+missing(I)) as IE
from have
group by CAT
order by ID, CAT;
quit;
@PGStatsThanks for the further parsimony, can you confirm how this deals with missing data?
My review shows that
@Reeza method deals with missing data values by creating a missing data point for RE and IE when R or I is missing and that
@PGStats method deals with missing data values by inserting the group mean for RE and IE when R or I is missing.
If you could confirm my review for future users, that would be excellent. Both methods have merit depending on how the researcher plans to handle missing data.
My query deals with missing values in the following ways:
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.