DATA Step, Macro, Functions and more

Calculating a tricky average using by group processing

Reply
Occasional Contributor
Posts: 9

Calculating a tricky average using by group processing

Hi,

 

I have some data in which students are tracked between schools and years--observations with the same student ID and same school ID are assumed to be the same person (i.e. person A in school 65 in year 2007 = person A in school 65 in year 2008). But person A in school 65 is not the same as person A in school 64, regardless of year. I need to calculate the average score over years. Some students appear in multiple years, and some only appear in one (in which case their average score would just be their score from last year). Additionally, students can appear 

 

Here is what my data currently looks like:

data test;

   infile datalines;

   input studentID $ schoolID year score;

datalines;

a 123 2000 78

a 123 2000 78

a 123 2001 80

a 123 2001 80

a 123 2001 80

b 123 2002 95

b 456 2003 91

c 456 2003 90

c 456 2004 98

c 456 2006 92

;

run;

 

Basically i want another column with average score, that should be 79, 79, 79, 79, 79, 95, 91, 93.3, 93.3, 93.3

 

I feel like I might need a combination of sql and by group processing but I'm getting stuck on calculating the average score vertically.

 

Thanks!

SAS Super FREQ
Posts: 4,168

Re: Calculating a tricky average using by group processing

Posted in reply to lnicholl9

Sort by studentID and schoolID, then use BY-group processing to compute the means for each student-school pair. Lastly, merge the results:

 


proc sort data=test;
by studentID schoolID;;
run;

proc means data=test noprint;
by studentID schoolID;
var score;
output out=means(drop=_type_ _freq_) mean=mean;
run;

data All;
merge test means;
by studentID schoolID;
run;
Super User
Posts: 23,224

Re: Calculating a tricky average using by group processing

Posted in reply to lnicholl9

It should be 79.2?

 


proc sql;
create table want as
select *, mean(score) as avg_score
from test
group by schoolid, studentid;
quit;
Occasional Contributor
Posts: 9

Re: Calculating a tricky average using by group processing

Thanks for the quick responses! I should have been more clear about how volume is calculated

Take this example:

studentID schoolID year score

a                123        2000   78

a                 123       2001   80

a                 123       2001    80

 

The average score should be 79.0, not 79.3. This is because the second two observations are essentially duplicates. There are other covariates which are different but these 4 columns are the same. So the mean should not be calculated like these 3 are independent. It should be calculated as sum(of unique scores)/number of unique years.

 

Thanks!

Super User
Posts: 23,224

Re: Calculating a tricky average using by group processing

Posted in reply to lnicholl9

And you want to keep the duplicates in your output for some reason?

Super User
Posts: 23,224

Re: Calculating a tricky average using by group processing

Here's one way - first identify the duplicates and then average. If all you're doing is an average, it may be worth just adding that to the data step as well. 

 

data test1;
set test;
by studentID schoolID year;
if first.year then duplicate=0;
else duplicate=1;
run;

proc sql;
create table want as
select *, mean(case when duplicate=0 then score else . end) as avg_score
from test1
group by schoolid, studentid;
quit;
Occasional Contributor
Posts: 9

Re: Calculating a tricky average using by group processing

This worked perfectly, thanks!!

Esteemed Advisor
Posts: 5,474

Re: Calculating a tricky average using by group processing

Posted in reply to lnicholl9

Try this:

 

proc sql;
create table test_avg as
select 
    studentID, 
    schoolID, 
    year,
    sum(score/n)/sum(1/n) as avg_score
from (
    select 
        *, 
        count(score) as n
    from test
    group by studentID, schoolID, year )
group by studentID, schoolID;
select * from test_avg;
quit;
PG
Super User
Posts: 10,681

Re: Calculating a tricky average using by group processing

Posted in reply to lnicholl9

data test;
   infile datalines;
   input studentID $ schoolID year score;
datalines;
a 123 2000 78
a 123 2000 78
a 123 2001 80
a 123 2001 80
a 123 2001 80
b 123 2002 95
b 456 2003 91
c 456 2003 90
c 456 2004 98
c 456 2006 92
;
run;

proc sql;
select *,(select mean(distinct score) from test 
 where studentID=a.studentID and schoolID=a.schoolID) as mean
 from test as a;
quit;
Ask a Question
Discussion stats
  • 8 replies
  • 106 views
  • 5 likes
  • 5 in conversation