Solved
Contributor
Posts: 71

How to calculate mean within each group without one observation / including additional number

Hi,

I have a question about calculating mean of score within each school with additional condition. The variable for mean of score within each school [mean_score] has been created. I need create two more variables now and am not sure how to program. Below is a fake small dataset for program develop. Real data contains about 100,000 observations (students) belonging to about 100 schools.

First, I need create a variable for mean of score within each school but without the score of the same row. For example, for student_ID=123, it should be (95+89+95)/(4-1), which is without 98 (his own’s score) in numerator and minus 1 (observation) in denominator. For student_ID=124, it should be (98+89+95)/3, and so on.

Second, I need create a variable for mean of score within each school including additional number, which is 25th percentile of score for ALL student (in the entire dataset, not within each school). For example, for student_ID=123, it should be (98+95+89+95+25th percentile of score in the entire dataset)/(4+1). So this variable will be the same value for students within each school.

student_ID         school_ID            score     mean_score

123                         1                              98           94.25

124                         1                              95           94.25

125                         1                              89           94.25

126                         1                              95           94.25

127                         2                              96           80.2

128                         2                              78           80.2

129                         2                              85           80.2

130                         2                              74           80.2

131                         2                              68           80.2

132                         3                              98           87

133                         3                              84           87

134                         3                              85           87

135                         3                              75           87

136                         3                              99           87

137                         3                              81           87

Accepted Solutions
Solution
‎11-17-2015 04:53 PM
Posts: 1,163

Re: How to calculate mean within each group without one observation / including additional number

[ Edited ]

How do you like this?

``````/* Compute overall 25th percentile of score */

proc summary data=have;
var score;
output out=perc25 q1=q;
quit;

/* Write it into macro variable Q */

proc sql noprint;
select q
into :q
from perc25;
quit;

%put Overall 25th percentile of score: &q;

/* Create dataset with number of students per school */

data numstud;
do nst=1 by 1 until(last.school_ID);
set have(keep=school_ID);
by school_ID;
end;
run;

/* Add the desired new variables */

proc sql;
create table want(drop=nst) as
select *, (mean_score*nst-score)/(nst-1) as mrest, (mean_score*nst+&q)/(nst+1) as mwithq
from have natural join numstud
order by school_ID, student_ID;
quit;
``````

Edit: Added optional ORDER BY clause.

All Replies
Super User
Posts: 6,638

Re: How to calculate mean within each group without one observation / including additional number

In retrospect, your first question would have been easy to answer if you had prepared the data slightly differently.  Instead of adding MEAN_SCORE to each observation, it likely would have been possible to add SUM_SCORE and N_SCORE.  That would have made the calculations easy.  If that is still a possibility, I would suggest it.  If not, we can always use the existing data to generate those numbers.  The program wouldn't be simple, but it would be relatively short.

Regarding your second question, it looks like two preparatory steps are needed.  First, calculate the 25th percentile and store it in a SAS data set.  Then append it to each observation.  Are you comfortable with the first part of that, or do you need help with both parts?

Contributor
Posts: 71

Re: How to calculate mean within each group without one observation / including additional number

Thanks Astounding so much for your answer! I think I can get the first part done now. I need help for the second part. Thanks!

Super User
Posts: 6,638

Re: How to calculate mean within each group without one observation / including additional number

OK, assuming you can create a SAS data set (I'll call it CUTOFF) with a single variable representing the 25th percentile (I'll call it percentile_25), here's how to add one observation to every observation in your existing data set:

data want;

if _n_=1 then set cutoff;

set have;

*** add calculations here as needed;

run;

If you actually have SUM_SCORE and N_SCORE available, the calculations should be relatively easy.  For some of the calculations, you will be dividing by (N_SCORE - 1).  In that case, be sure to check:

if N_SCORE > 1 then ...

Good luck.

Contributor
Posts: 71

Re: How to calculate mean within each group without one observation / including additional number

Thanks so much, Astounding! So what you did is append the dataset cutoff to each first observation of each school? I have one more question. I need create 3 more variables, similar to the second variable [mwithq], using 1st percentile, 75th percentile, 99th percentile of score, instead of 25th percentile of score. How can I create these variable at the same time?

Thanks much for the check reminder. It is very useful.

Solution
‎11-17-2015 04:53 PM
Posts: 1,163

Re: How to calculate mean within each group without one observation / including additional number

[ Edited ]

How do you like this?

``````/* Compute overall 25th percentile of score */

proc summary data=have;
var score;
output out=perc25 q1=q;
quit;

/* Write it into macro variable Q */

proc sql noprint;
select q
into :q
from perc25;
quit;

%put Overall 25th percentile of score: &q;

/* Create dataset with number of students per school */

data numstud;
do nst=1 by 1 until(last.school_ID);
set have(keep=school_ID);
by school_ID;
end;
run;

/* Add the desired new variables */

proc sql;
create table want(drop=nst) as
select *, (mean_score*nst-score)/(nst-1) as mrest, (mean_score*nst+&q)/(nst+1) as mwithq
from have natural join numstud
order by school_ID, student_ID;
quit;
``````

Edit: Added optional ORDER BY clause.

Contributor
Posts: 71

Re: How to calculate mean within each group without one observation / including additional number

Thanks so much, FreelanceReinhard! Your code works! I have one more question. I need create 3 more variables, similar to the second variable [mwithq], using 1st percentile, 75th percentile, 99th percentile of score, instead of 25th percentile of score. How can I create these variable at the same time? In addition, I do not know how to refer to 1% percentile like the q1 for 25th percentile.

Posts: 1,163

Re: How to calculate mean within each group without one observation / including additional number

``````/* Compute overall 1st, 25th, 75th and 99th percentile of score */

proc summary data=have;
var score;
output out=perc p1=p1 p25=p25 p75=p75 p99=p99;
quit;

/* Write them into macro variables */

proc sql noprint;
select p1, p25, p75, p99
into :p1, :p25, :p75, :p99
from perc;
quit;

%put Overall percentiles of score:;
%put P1=&p1;
%put P25=&p25;
%put P75=&p75;
%put P99=&p99;

/* Create dataset with number of students per school */

data numstud;
do nst=1 by 1 until(last.school_ID);
set have(keep=school_ID);
by school_ID;
end;
run;

/* Add the desired new variables */

proc sql;
create table want(drop=nst) as
select *, (mean_score*nst-score)/(nst-1) as mrest,
(mean_score*nst+&p1) /(nst+1) as mwithp1,
(mean_score*nst+&p25)/(nst+1) as mwithp25,
(mean_score*nst+&p75)/(nst+1) as mwithp75,
(mean_score*nst+&p99)/(nst+1) as mwithp99
from have natural join numstud
order by school_ID, student_ID;
quit;
``````
Posts: 1,163

Re: How to calculate mean within each group without one observation / including additional number

Btw, PROC SQL is quite tolerant with regard to divisions by zero. Of course, you should be aware of the (remote?) possibility that there is a school with only one student, for which then MREST is literally not defined. But unlike the DATA step, PROC SQL does not complain when it divides by NST-1=0 in this situation. It simply assigns a missing value to MREST, which I think is fair.
Contributor
Posts: 71