I have a longitudinal dataset containing various grades for students. I want to select the very last grade per student, create a new variable and assign that grade to that variable. Is there an easy way to accomplish this. I tried to write a last statement but did not produce the results I was anticipating.
Sample data below with new column far right:
Ross Smith 8
Jayden Woo 9
Ross Smith 7
Jayden Woo 10
Ross Smith 10
Want:
Ross Smith 8 10
Jayden Woo 9 10
Ross Smith 7 10
Jayden Woo 10 10
Ross Smith 10 10
You'll need two passes through the data and you can use last.student and then merge it back in with your original data by student name/id.
Like Fareeze said, using old school data step, such as 2X DOW, will involve 2X I/O passes; and if using Proc SQL technique, God knows how many passes involved. However, if trying to use Hash(), it may only involve one pass (I/O): (Credit goes to recent posts from Joe Hinson)
data have;
input (fn ln) (:$20.) grade;
cards;
Ross Smith 8
Jayden Woo 9
Ross Smith 7
Jayden Woo 10
Ross Smith 10
;
/*the least verbose approach: Proc SQL*/
proc sql;
select *, max(grade) as l_grade from have group by fn, ln ;quit;
/*theorically one-pass(I/O) approach*/
data want;
declare hash full(ordered:'y');
full.definekey('_n_');
full.definedata('fn','ln','grade');
full.definedone();
declare hiter hif('full');
declare hash hoh();
hoh.definekey('fn','ln');
hoh.definedata('fn','ln','lst','hil');
hoh.definedone();
declare hash lst;
declare hiter hil;
do _n_=1 by 1 until (done);
set have end=done;
l_grade=grade;
if hoh.find() ne 0 then do;
lst=_new_ hash (ordered:'d');
lst.definekey('fn','ln','l_grade');
lst.definedata('fn','ln','l_grade');
lst.definedone();
hil=_new_ hiter ('lst');
hoh.replace();
end;
lst.replace();
full.replace();
end;
do _rc=hif.next() by 0 while (_rc=0);
hoh.find();
hil.first();
output;
_rc=hif.next();
end;
stop;
drop _rc;
run;
Haikuo
The last grade isn't the same as the maximum grade, though it is in the sample data provided.
There isn't an order specified, which would actually be needed based on the question from the OP.
Hai.kuo (and others?),
I'm just learning this hashing stuff, but shouldn't there be an easier hash approach? Instead of creating an iterative hash object, isn't there a way to create a one-key, one-data-element table where the data element gets replaced if the key is repeated? After that, it should be a simple matter to run through the data and look up the final data element.
Astounding: There are simpler Hash() approaches, however, I can only do it with two passes theoretically :
data have;
input (fn ln) (:$20.) grade;
cards;
Ross Smith 11
Jayden Woo 10
Ross Smith 7
Jayden Woo 8
Ross Smith 6
;
/*theorically two-pass(I/O) approach: last record, not largest*/
data want;
if _n_=1 then do;
declare hash full(dataset:'have (rename=grade=l_grade)', multidata:'y');
full.definekey('fn','ln');
full.definedata('l_grade');
full.definedone();
call missing (l_grade);
end;
set have ;
_rc=full.find();
do _rc=0 by 0 while (_rc=0);
_rc=full.find_next();
end;
output;
drop _rc;
run;
/*theorically two-pass(I/O) approach: largest record*/
data want;
if _n_=1 then do;
declare hash full(dataset:'have (rename=grade=l_grade)', multidata:'y');
full.definekey('fn','ln');
full.definedata('l_grade');
full.definedone();
call missing (l_grade);
end;
set have;
_rc=full.find();
do _rc=0 by 0 while (_rc=0);
_grade=max(_grade,l_grade);
_rc=full.find_next();
end;
l_grade=_grade;
output;
drop _:;
run;
Haikuo
Hai.kuo,
I know it's for my own good, but now I have to look this up. Couldn't this work just as well? Is it not simpler and returns the right answer? Is it faster/slower?
data want;
if _n_=1 then do;
declare hash h(dataset:'have (rename=(grade=last_grade))', duplicate:'replace');
h.definekey('fn', 'ln');
h.definedata('last_grade');
h.definedone();
call missing(last_grade);
end;
set have;
rc = h.find();
drop rc;
run;
The idea is to build the hash table with one record per key (last name / first name). Replace duplicates, so the last data point is the only one kept in the hash table. What am I missing?
Astounding, your Hash() code is quite nice. But there are several things you may want to dig deeper:
1. What is your purpose, the last record or the largest one within the name? Yours is doing the last record, what if for the largest one? Can you construct Hash() code with similar simplicity?
2. Yours is still involving two passes: one to load into Hash(), another of course the 'set' statement. Theoretically mine is just as efficient as yours, cause it takes no time to do the direct access to RAM, but we all know it has to be some sort of overhead, and Art already proved that HOH is quite slow even it only has one pass.
Just my 2 cents and you are always my hero,
Haikuo
Art,
If the data are pre-sorted, the order still makes a difference for appending the highest grade. The ideal order would be by fn ln DESCENDING grade. Then you could simply code:
data want;
set have;
by fn ln;
if first.ln then highest_grade=grade;
retain highest_grade;
run;
In the longer run, I imagine that SAS will develop more hash methods to apply to handling duplicate key values. I used REPLACE, but I could easily conceive of LARGEST or SMALLEST being a possible value in the future. While that may seem trivial, there are questions to consider about what would be the right way to go about it. What if there are multiple data values stored in the hash table. Would LARGEST apply to each data element independently of the others, or does it make sense to take the largest combination from a single observation (similar to the way an ID statement works with PROC SUMMARY). There may be a right answer to that, but it takes time to think through the possibilities.
I imagine you could program your own version of LARGEST if you had to. I'll take liberties with the syntax, but here's the idea.
if _n_=1 then do;
declare hash h();
h.definekey('ln', 'fn');
h.definedata('largest_grade');
h.definedone();
do until (done);
set have end=done;
if h.check() then do;
largest_grade = grade;
h.add();
end;
else do;
rc = h.find();
if grade > largest_grade then do;
largest_grade = grade;
h.replace();
end;
end;
end;
Just looking at the hoops you have to jump through convinces me that SAS will make this easy one day.
If you want to select the highest grade then I would use Haikuo's suggested SQL code, but with a create statement added. i.e.,:
proc sql;
create table want as
select *, max(grade) as highest_grade
from have
group by fn, ln
;
quit;
That will run almost twice as fast as Haikuo's suggest hash approach. However, does highest grade always equate to last grade?
Mr. Tabachneck- The highest grade is always equal to or greater than the last grade. I am not familiar with hash and interested in learning once I become more comfortable with proc sql. Will your code do the work it needs to?
proc sql;
create table want as
select *, max(grade) as highest_grade
from have
group by fn, ln
;
quit;
: Just so we don't get confused with semantics, what you just said would suggested that you couldn't use the proc sql approach. It would only have been valid if you had said: "the last grade will always be the highest grade."
To meet the condition as you stated it, you would either need a hash approach or one where at least three passes of the data would be necessary. e.g. given a dataset like the following:
data have (drop=i);
input (fn ln) (:$20.) grade;
cards;
Ross Smith 8
Jayden Woo 9
Ross Smith 7
Jayden Woo 5
Ross Smith 2
;
I would now presume that you would want a file that looked like:
ln fn grade last_grade
Ross Smith 8 2
Ross Smith 7 2
Ross Smith 2 2
Jayden Woo 9 5
Jayden Woo 5 5
If that is indeed what you want, then one way of doing it would be:
proc sort data=have;
by ln fn;
run;
data want;
do until (last.fn);
set have end=_last1;
by ln fn;
if last.fn then last_grade=grade;
end;
do until (last.fn);
set have end=_last2;
by ln fn;
output;
end;
run;
However, that will definitely not run as quickly as a hash approach.
Art said: " what you just said would suggested that you couldn't use the proc sql approach.", What Art didn't say is: "UNLESS you use some undocumented features"
data have;
input (fn ln) (:$20.) grade;
cards;
Ross Smith 11
Jayden Woo 10
Ross Smith 7
Jayden Woo 8
Ross Smith 6
;
proc sql;
create table want as
select a.*, b.grade as l_grade from have a,
( select *,monotonic() as seq from have
group by fn,ln
having seq=max(seq)) b
where a.fn=b.fn and a.ln=b.ln;
quit;
Haikuo
Thank you Hai- I re-tried the last.student function and matched it back to the original dataset to create the new data variable. I will definitely try the proc sql to test it out. Many thanks everyone!
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.