Solved: Re: Compute mean by grouping variable excluding observation

jpwiega · Posted 10-30-2015 03:47 PM

Dear all,

Thanks in advance for any and all help here.

An example data set and desired output is given below.

I want to compute new variables from the "Have" data set as follows:

RE: the average of "R" values within a given "Cat" variable value excluding the specific observation

IE the average of "I" responses within a given "Cat" variable value excluding the specific observation

Have:

ID	CAT	R	I	… (Additional variables I'd like to be retained)
1	1	1	3	…
2	1	2	2	…
3	1	1	1	…
4	2	3	3	…
5	2	4	5	…
6	2	4	4	…
7	3	6	7	…
8	3	4	4	…
9	3	5	5	…

The desired data set should thus be:

Want:

ID	CAT	R	I	RE	IE	… (Additional variables retained)
1	1	1	3	1.5	1.5	…
2	1	2	2	1	2	…
3	1	1	1	1.5	2.5	…
4	2	3	3	etc.	etc.	…
5	2	4	5			…
6	2	4	4			…
7	3	6	7			…
8	3	4	4			…
9	3	5	5			…

Thanks!

PGStats · Posted 10-30-2015 06:16 PM

@Reeza's query can be simplified to :

proc sql;
create table want as
select *, 
	(sum(R)-coalesce(R, 0))/(count(R)-1+missing(R)) as RE,
	(sum(I)-coalesce(I, 0))/(count(I)-1+missing(I)) as IE
from have
group by CAT
order by ID, CAT;
quit;

PG

View solution in original post

Astounding · Posted 10-30-2015 04:19 PM

The easiest way would be to bring in the total of the underlying variables for each ID. While a tricky DATA step could do it, I'll stick to basic techniques here.

proc summary data=have;

by ID;

var R I;

output out=totals (keep=total_R total_I n_R n_I) sum=total_R total_I n=n_R n_I;

run;

data want;

merge have totals;

by ID;

if (n_R > 1) then RE = (total_R - R) / (n_R - 1);

if (n_I > 1) then IE = (total_I - I) / (n_I - 1);

run;

There can be complications (which I didn't address here) if it might be possible that R or I could take on a missing value on some observations.

Good luck.

jpwiega · Posted 10-30-2015 06:03 PM

@AstoundingThanks for this, I should have specified that I do have missing values for R, I, and Cat. I think that will cause a problem here. I'll look into this and Reeza's response to see if missing data is properly accounted for and post back.

jpwiega · Posted 10-30-2015 06:58 PM

@Astounding Indeed, missing data is problematic here, but *appears* to be dealt with in the Proc SQL follow-ups. Hopefully we can get a follow-up on the handling of missing data yet with the two Proc SQL examples provided. Thanks for your help!

Reeza · Posted 10-30-2015 04:57 PM

SQL is probably the easiest way.

proc sql;

create table want as
select *, mean(r) as avg_R, count(R) as n_R, mean(I) as avg_I, count(I) as n_I,
(calculated avg_R* calculated n_R - r)/(calculated n_r-1) as RE,
(calculated avg_I* calculated n_I - I)/(calculated n_I-1) as IE
from have
group by CAT
order by ID, CAT;
quit;

jpwiega · Posted 10-30-2015 06:15 PM

@Reeza Thanks for this,

As above, I should have specified that I do have missing values for R, I, and Cat. Will this be accounted for via the "count" argument?

Your code produced means within groups, but it created the same mean value for RE and IE for everyone in a given "Cat" group. Eveyone should have a slightly different average because the average should be calculated leaving out their unique observation (as per the "want" example). What can be done to ensure the RE and IE are calculated leaving out each individual observation within the given "Cat" group?

jpwiega · Posted 10-30-2015 06:40 PM

@Reeza My fault, this works correctly! Can you confirm how this deals with missing data?

Thanks!

PGStats · Posted 10-30-2015 06:16 PM

@Reeza's query can be simplified to :

proc sql;
create table want as
select *, 
	(sum(R)-coalesce(R, 0))/(count(R)-1+missing(R)) as RE,
	(sum(I)-coalesce(I, 0))/(count(I)-1+missing(I)) as IE
from have
group by CAT
order by ID, CAT;
quit;

PG

jpwiega · Posted 10-30-2015 06:42 PM

@PGStatsThanks for the further parsimony, can you confirm how this deals with missing data?

jpwiega · Posted 10-30-2015 07:05 PM

My review shows that

@Reeza method deals with missing data values by creating a missing data point for RE and IE when R or I is missing and that

@PGStats method deals with missing data values by inserting the group mean for RE and IE when R or I is missing.

If you could confirm my review for future users, that would be excellent. Both methods have merit depending on how the researcher plans to handle missing data.

PGStats · Posted 10-30-2015 09:17 PM

My query deals with missing values in the following ways:

The SUM and COUNT aggregate functions operate on non-missing values only.
When a R or I value is missing, the average excluding that specific observation is computed anyway from the other values in the same CAT group.
When a R or I value is the only non-missing in its CAT group, the average excluding that specific observation is set to missing.
When all R or I values are missing in a CAT group, the average is set to missing.

PG

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away