Hi,
I have some data in which students are tracked between schools and years--observations with the same student ID and same school ID are assumed to be the same person (i.e. person A in school 65 in year 2007 = person A in school 65 in year 2008). But person A in school 65 is not the same as person A in school 64, regardless of year. I need to calculate the average score over years. Some students appear in multiple years, and some only appear in one (in which case their average score would just be their score from last year). Additionally, students can appear
Here is what my data currently looks like:
data test;
infile datalines;
input studentID $ schoolID year score;
datalines;
a 123 2000 78
a 123 2000 78
a 123 2001 80
a 123 2001 80
a 123 2001 80
b 123 2002 95
b 456 2003 91
c 456 2003 90
c 456 2004 98
c 456 2006 92
;
run;
Basically i want another column with average score, that should be 79, 79, 79, 79, 79, 95, 91, 93.3, 93.3, 93.3
I feel like I might need a combination of sql and by group processing but I'm getting stuck on calculating the average score vertically.
Thanks!
Sort by studentID and schoolID, then use BY-group processing to compute the means for each student-school pair. Lastly, merge the results:
proc sort data=test;
by studentID schoolID;;
run;
proc means data=test noprint;
by studentID schoolID;
var score;
output out=means(drop=_type_ _freq_) mean=mean;
run;
data All;
merge test means;
by studentID schoolID;
run;
It should be 79.2?
proc sql;
create table want as
select *, mean(score) as avg_score
from test
group by schoolid, studentid;
quit;
Thanks for the quick responses! I should have been more clear about how volume is calculated
Take this example:
studentID schoolID year score
a 123 2000 78
a 123 2001 80
a 123 2001 80
The average score should be 79.0, not 79.3. This is because the second two observations are essentially duplicates. There are other covariates which are different but these 4 columns are the same. So the mean should not be calculated like these 3 are independent. It should be calculated as sum(of unique scores)/number of unique years.
Thanks!
And you want to keep the duplicates in your output for some reason?
Here's one way - first identify the duplicates and then average. If all you're doing is an average, it may be worth just adding that to the data step as well.
data test1;
set test;
by studentID schoolID year;
if first.year then duplicate=0;
else duplicate=1;
run;
proc sql;
create table want as
select *, mean(case when duplicate=0 then score else . end) as avg_score
from test1
group by schoolid, studentid;
quit;
This worked perfectly, thanks!!
Try this:
proc sql;
create table test_avg as
select
studentID,
schoolID,
year,
sum(score/n)/sum(1/n) as avg_score
from (
select
*,
count(score) as n
from test
group by studentID, schoolID, year )
group by studentID, schoolID;
select * from test_avg;
quit;
data test; infile datalines; input studentID $ schoolID year score; datalines; a 123 2000 78 a 123 2000 78 a 123 2001 80 a 123 2001 80 a 123 2001 80 b 123 2002 95 b 456 2003 91 c 456 2003 90 c 456 2004 98 c 456 2006 92 ; run; proc sql; select *,(select mean(distinct score) from test where studentID=a.studentID and schoolID=a.schoolID) as mean from test as a; quit;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.