Hi all,
I have a general question for help on efficiency of task I am doing.
I have a dataset which has a set of scores for each student.
STUDENTID SCORE
1 5
1 4
1 4
1 3
1 3
1 2
1 5
2 1
2 1
2 1
2 2
2 3
2 1
3 4
3 4
What I want to do is create a new dataset which will contain the STUDYID, HIGHEST SCORE, AVERAGE SCORE. this is fine if using all data, however my requirements are to only use, the last 3 valid observations (i.e. not null) per student.
If less than 3 datapoints exists, then HIGHEST and AVERAGE are considered missing.
i.e.
STUDYID HIGHEST AVERAGE
1 5 3.33
2 3 2
3 . .
I am really just trying to find a way to select thos last 3 values per student to use.
Kind regards,
You will need to process the data twice. One time counts the number of records, and the other calculates statistics. The easiest way might be to use a single DATA step:
data want;
n_records = 0;
do until (last.ID);
set have;
by ID;
where score > .;
n_records + 1;
end;
n_records2 = 0;
useful_records = min(3, n_records);
do until (last.ID);
set have;
by ID;
where score > .;
n_records2 + 1;
if (useful_records >= 3) and (n_records2 >= n_records-2) then do;
highest = max(highest, score);
average = sum(average, score);
end;
end;
if average > . then average = average/3;
drop n_records n_records2;
run;
The first DO loop counts, and the second calculates.
Hi, this works with your data ...
data x;
input studentid score @@;
datalines;
1 5 1 4 1 4 1 3 1 3 1 2 1 5 2 1
2 1 2 1 2 2 2 3 2 1 3 4 3 4
;
data y (keep=studentid average highest);
array z(100);
do j=1 by 1 until(last.studentid);
set x;
by studentid;
z(j) = score;
end;
if j ge 3 then do;
highest = max(of z(j) z(j-1) z(j-2));
average = round(mean(of z(j) z(j-1) z(j-2)),.01);;
end;
run;
data set Y ...
Obs studentid highest average
1 1 5 3.33
2 2 3 2.00
3 3 . .
Much simpler. Here's a wrinkle that removes any worries about the number of incoming scores:
array z {0:2} z0-z2;
Then:
z{mod(j,3)} = score;
Then compute max and average based on z{*}.
Hi, OK but no need to bother with the z(0:2) part ...
data y (keep=studentid average highest);
array z(3);
do j=1 by 1 until(last.studentid);
set x;
by studentid;
z(mod(j,3)+1) = score;
end;
if j ge 3 then do;
highest = max(of z(*));
average = round(mean(of z(*)),.01);;
end;
run;
The simpler the better
data have;
input StudentId score @@;
datalines;
1 5 1 4 1 4 1 3 1 3 1 2 1 5 2 1
2 1 2 1 2 2 2 3 2 1 3 4 3 4
;
data want;
do until(last.studentId);
set have; by studentId;
where score is not missing;
v2 = v1;
v1 = v0;
v0 = score;
end;
if not missing(v2) then do;
Highest = max(of v:);
Average = mean(of v:);
end;
keep studentId average highest;
format average 4.1;
run;
proc print data=want noobs; run;
thanks @MikeZdeb
Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.
Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.