BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
jpwiega
Fluorite | Level 6

Dear all,

 

Thanks in advance for any and all help here. 

 

An example data set and desired output is given below.

 

I want to compute new variables from the "Have" data set as follows:

RE:  the average of "R" values within a given "Cat" variable value excluding the specific observation

IE  the average of "I" responses within a given "Cat" variable value excluding the specific observation

 

Have: 

IDCATRI… (Additional variables I'd like to be retained)
1113
2122
3111
4233
5245
6244
7367
8344
9355

 

The desired data set should thus be:

Want:

IDCATRIREIE… (Additional variables retained)
11131.51.5
212212
31111.52.5
4233etc.etc.
5245  
6244  
7367  
8344  
9355  


Thanks!

1 ACCEPTED SOLUTION

Accepted Solutions
PGStats
Opal | Level 21

@Reeza's query can be simplified to :

 

proc sql;
create table want as
select *, 
	(sum(R)-coalesce(R, 0))/(count(R)-1+missing(R)) as RE,
	(sum(I)-coalesce(I, 0))/(count(I)-1+missing(I)) as IE
from have
group by CAT
order by ID, CAT;
quit;
PG

View solution in original post

10 REPLIES 10
Astounding
PROC Star

The easiest way would be to bring in the total of the underlying variables for each ID.  While a tricky DATA step could do it, I'll stick to basic techniques here.

 

proc summary data=have;

   by ID;

   var R I;

   output out=totals (keep=total_R total_I n_R n_I) sum=total_R total_I n=n_R n_I;

run;

 

data want;

   merge have totals;

   by ID;

   if (n_R > 1) then RE = (total_R - R) / (n_R - 1);

   if (n_I > 1)  then IE = (total_I - I) / (n_I - 1);

run;

 

 

There can be complications (which I didn't address here) if it might be possible that R or I could take on a missing value on some observations.

 

Good luck.

jpwiega
Fluorite | Level 6

@AstoundingThanks for this, I should have specified that I do have missing values for R, I, and Cat.  I think that will cause a problem here.  I'll look into this and Reeza's response to see if missing data is properly accounted for and post back.

jpwiega
Fluorite | Level 6

@Astounding Indeed, missing data is problematic here, but *appears* to be dealt with in the Proc SQL follow-ups.  Hopefully we can get a follow-up on the handling of missing data yet with the two Proc SQL examples provided.  Thanks for your help!

Reeza
Super User

SQL is probably the easiest way.

 

proc sql;

create table want as
select *, mean(r) as avg_R, count(R) as n_R, mean(I) as avg_I, count(I) as n_I,
(calculated avg_R* calculated n_R - r)/(calculated n_r-1) as RE,
(calculated avg_I* calculated n_I - I)/(calculated n_I-1) as IE
from have
group by CAT
order by ID, CAT;
quit;
            
jpwiega
Fluorite | Level 6

@Reeza Thanks for this, 

 

As above, I should have specified that I do have missing values for R, I, and Cat.  Will this be accounted for via the "count" argument?

 

Your code produced means within groups, but it created the same mean value for RE and IE for everyone in a given "Cat" group.  Eveyone should have a slightly different average because the average should be calculated leaving out their unique observation (as per the "want" example).  What can be done to ensure the RE and IE are calculated leaving out each individual observation within the given "Cat" group?

jpwiega
Fluorite | Level 6

@Reeza My fault, this works correctly!  Can you confirm how this deals with missing data?

Thanks!

PGStats
Opal | Level 21

@Reeza's query can be simplified to :

 

proc sql;
create table want as
select *, 
	(sum(R)-coalesce(R, 0))/(count(R)-1+missing(R)) as RE,
	(sum(I)-coalesce(I, 0))/(count(I)-1+missing(I)) as IE
from have
group by CAT
order by ID, CAT;
quit;
PG
jpwiega
Fluorite | Level 6

@PGStatsThanks for the further parsimony, can you confirm how this deals with missing data?

jpwiega
Fluorite | Level 6

My review shows that 

 

@Reeza method deals with missing data values by creating a missing data point for RE and IE when R or I is missing and that 

 

@PGStats method deals with missing data values by inserting the group mean for RE and IE when R or I is missing.

 

If you could confirm my review for future users, that would be excellent.  Both methods have merit depending on how the researcher plans to handle missing data.

PGStats
Opal | Level 21

My query deals with missing values in the following ways:

 

  • The SUM and COUNT aggregate functions operate on non-missing values only.
  • When a R or I value is missing, the average excluding that specific observation is computed anyway from the other values in the same CAT group.
  • When a R or I value is the only non-missing in its CAT group, the average excluding that specific observation is set to missing.
  • When all R or I values are missing in a CAT group, the average is set to missing.
PG

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 10 replies
  • 5875 views
  • 0 likes
  • 4 in conversation