DATA Step, Macro, Functions and more

Compute mean by grouping variable excluding observation

Accepted Solution Solved
Reply
Occasional Contributor
Posts: 11
Accepted Solution

Compute mean by grouping variable excluding observation

Dear all,

 

Thanks in advance for any and all help here. 

 

An example data set and desired output is given below.

 

I want to compute new variables from the "Have" data set as follows:

RE:  the average of "R" values within a given "Cat" variable value excluding the specific observation

IE  the average of "I" responses within a given "Cat" variable value excluding the specific observation

 

Have: 

IDCATRI… (Additional variables I'd like to be retained)
1113
2122
3111
4233
5245
6244
7367
8344
9355

 

The desired data set should thus be:

Want:

IDCATRIREIE… (Additional variables retained)
11131.51.5
212212
31111.52.5
4233etc.etc.
5245  
6244  
7367  
8344  
9355  


Thanks!


Accepted Solutions
Solution
‎10-31-2015 05:45 PM
Respected Advisor
Posts: 4,930

Re: Compute mean by grouping variable excluding observation

@Reeza's query can be simplified to :

 

proc sql;
create table want as
select *, 
	(sum(R)-coalesce(R, 0))/(count(R)-1+missing(R)) as RE,
	(sum(I)-coalesce(I, 0))/(count(I)-1+missing(I)) as IE
from have
group by CAT
order by ID, CAT;
quit;
PG

View solution in original post


All Replies
Super User
Posts: 5,516

Re: Compute mean by grouping variable excluding observation

The easiest way would be to bring in the total of the underlying variables for each ID.  While a tricky DATA step could do it, I'll stick to basic techniques here.

 

proc summary data=have;

   by ID;

   var R I;

   output out=totals (keep=total_R total_I n_R n_I) sum=total_R total_I n=n_R n_I;

run;

 

data want;

   merge have totals;

   by ID;

   if (n_R > 1) then RE = (total_R - R) / (n_R - 1);

   if (n_I > 1)  then IE = (total_I - I) / (n_I - 1);

run;

 

 

There can be complications (which I didn't address here) if it might be possible that R or I could take on a missing value on some observations.

 

Good luck.

Occasional Contributor
Posts: 11

Re: Compute mean by grouping variable excluding observation

[ Edited ]
Posted in reply to Astounding

@AstoundingThanks for this, I should have specified that I do have missing values for R, I, and Cat.  I think that will cause a problem here.  I'll look into this and Reeza's response to see if missing data is properly accounted for and post back.

Occasional Contributor
Posts: 11

Re: Compute mean by grouping variable excluding observation

@Astounding Indeed, missing data is problematic here, but *appears* to be dealt with in the Proc SQL follow-ups.  Hopefully we can get a follow-up on the handling of missing data yet with the two Proc SQL examples provided.  Thanks for your help!

Super User
Posts: 19,855

Re: Compute mean by grouping variable excluding observation

SQL is probably the easiest way.

 

proc sql;

create table want as
select *, mean(r) as avg_R, count(R) as n_R, mean(I) as avg_I, count(I) as n_I,
(calculated avg_R* calculated n_R - r)/(calculated n_r-1) as RE,
(calculated avg_I* calculated n_I - I)/(calculated n_I-1) as IE
from have
group by CAT
order by ID, CAT;
quit;
            
Occasional Contributor
Posts: 11

Re: Compute mean by grouping variable excluding observation

[ Edited ]

@Reeza Thanks for this, 

 

As above, I should have specified that I do have missing values for R, I, and Cat.  Will this be accounted for via the "count" argument?

 

Your code produced means within groups, but it created the same mean value for RE and IE for everyone in a given "Cat" group.  Eveyone should have a slightly different average because the average should be calculated leaving out their unique observation (as per the "want" example).  What can be done to ensure the RE and IE are calculated leaving out each individual observation within the given "Cat" group?

Occasional Contributor
Posts: 11

Re: Compute mean by grouping variable excluding observation

[ Edited ]

@Reeza My fault, this works correctly!  Can you confirm how this deals with missing data?

Thanks!

Solution
‎10-31-2015 05:45 PM
Respected Advisor
Posts: 4,930

Re: Compute mean by grouping variable excluding observation

@Reeza's query can be simplified to :

 

proc sql;
create table want as
select *, 
	(sum(R)-coalesce(R, 0))/(count(R)-1+missing(R)) as RE,
	(sum(I)-coalesce(I, 0))/(count(I)-1+missing(I)) as IE
from have
group by CAT
order by ID, CAT;
quit;
PG
Occasional Contributor
Posts: 11

Re: Compute mean by grouping variable excluding observation

[ Edited ]

@PGStatsThanks for the further parsimony, can you confirm how this deals with missing data?

Occasional Contributor
Posts: 11

Re: Compute mean by grouping variable excluding observation

My review shows that 

 

@Reeza method deals with missing data values by creating a missing data point for RE and IE when R or I is missing and that 

 

@PGStats method deals with missing data values by inserting the group mean for RE and IE when R or I is missing.

 

If you could confirm my review for future users, that would be excellent.  Both methods have merit depending on how the researcher plans to handle missing data.

Respected Advisor
Posts: 4,930

Re: Compute mean by grouping variable excluding observation

My query deals with missing values in the following ways:

 

  • The SUM and COUNT aggregate functions operate on non-missing values only.
  • When a R or I value is missing, the average excluding that specific observation is computed anyway from the other values in the same CAT group.
  • When a R or I value is the only non-missing in its CAT group, the average excluding that specific observation is set to missing.
  • When all R or I values are missing in a CAT group, the average is set to missing.
PG
🔒 This topic is solved and locked.

Need further help from the community? Please ask a new question.

Discussion stats
  • 10 replies
  • 454 views
  • 0 likes
  • 4 in conversation