Re: Calculating a tricky average using by group processing

lnicholl9 · Posted 10-25-2017 10:42 AM

Hi,

I have some data in which students are tracked between schools and years--observations with the same student ID and same school ID are assumed to be the same person (i.e. person A in school 65 in year 2007 = person A in school 65 in year 2008). But person A in school 65 is not the same as person A in school 64, regardless of year. I need to calculate the average score over years. Some students appear in multiple years, and some only appear in one (in which case their average score would just be their score from last year). Additionally, students can appear

Here is what my data currently looks like:

data test;

infile datalines;

input studentID $ schoolID year score;

datalines;

a 123 2000 78

a 123 2001 80

b 123 2002 95

b 456 2003 91

c 456 2003 90

c 456 2004 98

c 456 2006 92

;

run;

Basically i want another column with average score, that should be 79, 79, 79, 79, 79, 95, 91, 93.3, 93.3, 93.3

I feel like I might need a combination of sql and by group processing but I'm getting stuck on calculating the average score vertically.

Thanks!

Rick_SAS · Posted 10-25-2017 10:52 AM

Sort by studentID and schoolID, then use BY-group processing to compute the means for each student-school pair. Lastly, merge the results:


proc sort data=test;
by studentID schoolID;;
run;

proc means data=test noprint;
by studentID schoolID;
var score;
output out=means(drop=_type_ _freq_) mean=mean;
run;

data All;
merge test means;
by studentID schoolID;
run;

Reeza · Posted 10-25-2017 10:54 AM

It should be 79.2?


proc sql;
create table want as
select *, mean(score) as avg_score
from test
group by schoolid, studentid;
quit;

lnicholl9 · Posted 10-25-2017 11:00 AM

Thanks for the quick responses! I should have been more clear about how volume is calculated

Take this example:

studentID schoolID year score

a 123 2000 78

a 123 2001 80

The average score should be 79.0, not 79.3. This is because the second two observations are essentially duplicates. There are other covariates which are different but these 4 columns are the same. So the mean should not be calculated like these 3 are independent. It should be calculated as sum(of unique scores)/number of unique years.

Thanks!

Reeza · Posted 10-25-2017 11:00 AM

And you want to keep the duplicates in your output for some reason?

Reeza · Posted 10-25-2017 11:05 AM

Here's one way - first identify the duplicates and then average. If all you're doing is an average, it may be worth just adding that to the data step as well.

data test1;
set test;
by studentID schoolID year;
if first.year then duplicate=0;
else duplicate=1;
run;

proc sql;
create table want as
select *, mean(case when duplicate=0 then score else . end) as avg_score
from test1
group by schoolid, studentid;
quit;

lnicholl9 · Posted 10-25-2017 02:14 PM

This worked perfectly, thanks!!

PGStats · Posted 10-25-2017 02:03 PM

Try this:

proc sql;
create table test_avg as
select 
    studentID, 
    schoolID, 
    year,
    sum(score/n)/sum(1/n) as avg_score
from (
    select 
        *, 
        count(score) as n
    from test
    group by studentID, schoolID, year )
group by studentID, schoolID;
select * from test_avg;
quit;

PG

Ksharp · Posted 10-26-2017 09:29 AM


data test;
   infile datalines;
   input studentID $ schoolID year score;
datalines;
a 123 2000 78
a 123 2000 78
a 123 2001 80
a 123 2001 80
a 123 2001 80
b 123 2002 95
b 456 2003 91
c 456 2003 90
c 456 2004 98
c 456 2006 92
;
run;

proc sql;
select *,(select mean(distinct score) from test 
 where studentID=a.studentID and schoolID=a.schoolID) as mean
 from test as a;
quit;

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away