10-12-2015 07:37 AM
Hi all,
I have a general question for help on efficiency of task I am doing.
I have a dataset which has a set of scores for each student.
STUDENTID SCORE
1 5
1 4
1 4
1 3
1 3
1 2
1 5
2 1
2 1
2 1
2 2
2 3
2 1
3 4
3 4
What I want to do is create a new dataset which will contain the STUDYID, HIGHEST SCORE, AVERAGE SCORE. this is fine if using all data, however my requirements are to only use, the last 3 valid observations (i.e. not null) per student.
If less than 3 datapoints exists, then HIGHEST and AVERAGE are considered missing.
i.e.
STUDYID HIGHEST AVERAGE
1 5 3.33
2 3 2
3 . .
I am really just trying to find a way to select thos last 3 values per student to use.
Kind regards,
10-12-2015 09:18 AM
You will need to process the data twice. One time counts the number of records, and the other calculates statistics. The easiest way might be to use a single DATA step:
data want;
n_records = 0;
do until (last.ID);
set have;
by ID;
where score > .;
n_records + 1;
end;
n_records2 = 0;
useful_records = min(3, n_records);
do until (last.ID);
set have;
by ID;
where score > .;
n_records2 + 1;
if (useful_records >= 3) and (n_records2 >= n_records-2) then do;
highest = max(highest, score);
average = sum(average, score);
end;
end;
if average > . then average = average/3;
drop n_records n_records2;
run;
The first DO loop counts, and the second calculates.
10-12-2015 10:55 AM
Hi, this works with your data ...
data x;
input studentid score @@;
datalines;
1 5 1 4 1 4 1 3 1 3 1 2 1 5 2 1
2 1 2 1 2 2 2 3 2 1 3 4 3 4
;
data y (keep=studentid average highest);
array z(100);
do j=1 by 1 until(last.studentid);
set x;
by studentid;
z(j) = score;
end;
if j ge 3 then do;
highest = max(of z(j) z(j-1) z(j-2));
average = round(mean(of z(j) z(j-1) z(j-2)),.01);;
end;
run;
data set Y ...
Obs studentid highest average
1 1 5 3.33
2 2 3 2.00
3 3 . .
10-12-2015 11:28 AM
Much simpler. Here's a wrinkle that removes any worries about the number of incoming scores:
array z {0:2} z0-z2;
Then:
z{mod(j,3)} = score;
Then compute max and average based on z{*}.
10-12-2015 01:35 PM - edited 10-12-2015 01:37 PM
Hi, OK but no need to bother with the z(0:2) part ...
data y (keep=studentid average highest);
array z(3);
do j=1 by 1 until(last.studentid);
set x;
by studentid;
z(mod(j,3)+1) = score;
end;
if j ge 3 then do;
highest = max(of z(*));
average = round(mean(of z(*)),.01);;
end;
run;
10-12-2015 01:42 PM - edited 10-12-2015 01:45 PM
The simpler the better
data have;
input StudentId score @@;
datalines;
1 5 1 4 1 4 1 3 1 3 1 2 1 5 2 1
2 1 2 1 2 2 2 3 2 1 3 4 3 4
;
data want;
do until(last.studentId);
set have; by studentId;
where score is not missing;
v2 = v1;
v1 = v0;
v0 = score;
end;
if not missing(v2) then do;
Highest = max(of v:);
Average = mean(of v:);
end;
keep studentId average highest;
format average 4.1;
run;
proc print data=want noobs; run;
thanks @MikeZdeb