Solved: How to calculate mean within each group without one observation / inc...

michellel · Posted 11-17-2015 02:41 PM

Hi,

I have a question about calculating mean of score within each school with additional condition. The variable for mean of score within each school [mean_score] has been created. I need create two more variables now and am not sure how to program. Below is a fake small dataset for program develop. Real data contains about 100,000 observations (students) belonging to about 100 schools.

First, I need create a variable for mean of score within each school but without the score of the same row. For example, for student_ID=123, it should be (95+89+95)/(4-1), which is without 98 (his own’s score) in numerator and minus 1 (observation) in denominator. For student_ID=124, it should be (98+89+95)/3, and so on.

Second, I need create a variable for mean of score within each school including additional number, which is 25^th percentile of score for ALL student (in the entire dataset, not within each school). For example, for student_ID=123, it should be (98+95+89+95+25^th percentile of score in the entire dataset)/(4+1). So this variable will be the same value for students within each school.

Many thanks in advance for your answer!!!

student_ID school_ID score mean_score

123 1 98 94.25

124 1 95 94.25

125 1 89 94.25

126 1 95 94.25

127 2 96 80.2

128 2 78 80.2

129 2 85 80.2

130 2 74 80.2

131 2 68 80.2

132 3 98 87

133 3 84 87

134 3 85 87

135 3 75 87

136 3 99 87

137 3 81 87

FreelanceReinh · Posted 11-17-2015 03:52 PM

How do you like this?

/* Compute overall 25th percentile of score */

proc summary data=have;
var score;
output out=perc25 q1=q;
quit;


/* Write it into macro variable Q */

proc sql noprint;
select q
into :q
from perc25;
quit;

%put Overall 25th percentile of score: &q;

/* Create dataset with number of students per school */

data numstud;
do nst=1 by 1 until(last.school_ID);
  set have(keep=school_ID);
  by school_ID;
end;
run;


/* Add the desired new variables */

proc sql;
create table want(drop=nst) as
select *, (mean_score*nst-score)/(nst-1) as mrest, (mean_score*nst+&q)/(nst+1) as mwithq
from have natural join numstud
order by school_ID, student_ID;
quit;

Edit: Added optional ORDER BY clause.

View solution in original post

Astounding · Posted 11-17-2015 03:46 PM

In retrospect, your first question would have been easy to answer if you had prepared the data slightly differently. Instead of adding MEAN_SCORE to each observation, it likely would have been possible to add SUM_SCORE and N_SCORE. That would have made the calculations easy. If that is still a possibility, I would suggest it. If not, we can always use the existing data to generate those numbers. The program wouldn't be simple, but it would be relatively short.

Regarding your second question, it looks like two preparatory steps are needed. First, calculate the 25th percentile and store it in a SAS data set. Then append it to each observation. Are you comfortable with the first part of that, or do you need help with both parts?

michellel · Posted 11-17-2015 03:51 PM

Thanks Astounding so much for your answer! I think I can get the first part done now. I need help for the second part. Thanks!

Astounding · Posted 11-17-2015 04:19 PM

OK, assuming you can create a SAS data set (I'll call it CUTOFF) with a single variable representing the 25th percentile (I'll call it percentile_25), here's how to add one observation to every observation in your existing data set:

data want;

if _n_=1 then set cutoff;

set have;

*** add calculations here as needed;

run;

If you actually have SUM_SCORE and N_SCORE available, the calculations should be relatively easy. For some of the calculations, you will be dividing by (N_SCORE - 1). In that case, be sure to check:

if N_SCORE > 1 then ...

Good luck.

michellel · Posted 11-17-2015 04:50 PM

Thanks so much, Astounding! So what you did is append the dataset cutoff to each first observation of each school? I have one more question. I need create 3 more variables, similar to the second variable [mwithq], using 1st percentile, 75th percentile, 99th percentile of score, instead of 25th percentile of score. How can I create these variable at the same time?

Thanks much for the check reminder. It is very useful.

Thanks again for your answer!

FreelanceReinh · Posted 11-17-2015 03:52 PM

How do you like this?

/* Compute overall 25th percentile of score */

proc summary data=have;
var score;
output out=perc25 q1=q;
quit;


/* Write it into macro variable Q */

proc sql noprint;
select q
into :q
from perc25;
quit;

%put Overall 25th percentile of score: &q;

/* Create dataset with number of students per school */

data numstud;
do nst=1 by 1 until(last.school_ID);
  set have(keep=school_ID);
  by school_ID;
end;
run;


/* Add the desired new variables */

proc sql;
create table want(drop=nst) as
select *, (mean_score*nst-score)/(nst-1) as mrest, (mean_score*nst+&q)/(nst+1) as mwithq
from have natural join numstud
order by school_ID, student_ID;
quit;

Edit: Added optional ORDER BY clause.

michellel · Posted 11-17-2015 04:46 PM

Thanks so much, FreelanceReinhard! Your code works! I have one more question. I need create 3 more variables, similar to the second variable [mwithq], using 1st percentile, 75th percentile, 99th percentile of score, instead of 25th percentile of score. How can I create these variable at the same time? In addition, I do not know how to refer to 1% percentile like the q1 for 25th percentile.

Thanks so much for your answer!

FreelanceReinh · Posted 11-17-2015 05:07 PM

/* Compute overall 1st, 25th, 75th and 99th percentile of score */

proc summary data=have;
var score;
output out=perc p1=p1 p25=p25 p75=p75 p99=p99;
quit;


/* Write them into macro variables */

proc sql noprint;
select p1, p25, p75, p99
into :p1, :p25, :p75, :p99
from perc;
quit;

%put Overall percentiles of score:;
%put P1=&p1;
%put P25=&p25;
%put P75=&p75;
%put P99=&p99;


/* Create dataset with number of students per school */

data numstud;
do nst=1 by 1 until(last.school_ID);
  set have(keep=school_ID);
  by school_ID;
end;
run;


/* Add the desired new variables */

proc sql;
create table want(drop=nst) as
select *, (mean_score*nst-score)/(nst-1) as mrest,
          (mean_score*nst+&p1) /(nst+1) as mwithp1,
          (mean_score*nst+&p25)/(nst+1) as mwithp25,
          (mean_score*nst+&p75)/(nst+1) as mwithp75,
          (mean_score*nst+&p99)/(nst+1) as mwithp99
from have natural join numstud
order by school_ID, student_ID;
quit;

FreelanceReinh · Posted 11-17-2015 05:29 PM

Btw, PROC SQL is quite tolerant with regard to divisions by zero. Of course, you should be aware of the (remote?) possibility that there is a school with only one student, for which then MREST is literally not defined. But unlike the DATA step, PROC SQL does not complain when it divides by NST-1=0 in this situation. It simply assigns a missing value to MREST, which I think is fair.

michellel · Posted 11-18-2015 10:41 AM

Hi FreelanceReinhard, Thanks so much for your updated programming and explanation! It is really helpful!

How to calculate mean within each group without one observation / including additional number

Re: How to calculate mean within each group without one observation / including additional number

Re: How to calculate mean within each group without one observation / including additional number

Re: How to calculate mean within each group without one observation / including additional number

Re: How to calculate mean within each group without one observation / including additional number

Re: How to calculate mean within each group without one observation / including additional number

Re: How to calculate mean within each group without one observation / including additional number

Re: How to calculate mean within each group without one observation / including additional number

Re: How to calculate mean within each group without one observation / including additional number

Re: How to calculate mean within each group without one observation / including additional number

Re: How to calculate mean within each group without one observation / including additional number

Registration is open