Longitudinal data

SannaSanna · Posted 12-28-2012 01:33 PM

I have a longitudinal dataset containing various grades for students. I want to select the very last grade per student, create a new variable and assign that grade to that variable. Is there an easy way to accomplish this. I tried to write a last statement but did not produce the results I was anticipating.

Sample data below with new column far right:

Ross Smith 8
Jayden Woo 9
Ross Smith 7
Jayden Woo 10
Ross Smith 10

Want:
Ross Smith 8 10
Jayden Woo 9 10
Ross Smith 7 10
Jayden Woo 10 10
Ross Smith 10 10

Reeza · Posted 12-28-2012 01:36 PM

You'll need two passes through the data and you can use last.student and then merge it back in with your original data by student name/id.

Haikuo · Posted 12-28-2012 03:05 PM

Like Fareeze said, using old school data step, such as 2X DOW, will involve 2X I/O passes; and if using Proc SQL technique, God knows how many passes involved. However, if trying to use Hash(), it may only involve one pass (I/O): (Credit goes to recent posts from Joe Hinson)

data have;

input (fn ln) (:$20.) grade;

cards;

Ross Smith 8

Jayden Woo 9

Ross Smith 7

Jayden Woo 10

Ross Smith 10

;

/*the least verbose approach: Proc SQL*/

proc sql;

select *, max(grade) as l_grade from have group by fn, ln ;quit;

/*theorically one-pass(I/O) approach*/

data want;

declare hash full(ordered:'y');

full.definekey('_n_');

full.definedata('fn','ln','grade');

full.definedone();

declare hiter hif('full');

declare hash hoh();

hoh.definekey('fn','ln');

hoh.definedata('fn','ln','lst','hil');

hoh.definedone();

declare hash lst;

declare hiter hil;

do _n_=1 by 1 until (done);

set have end=done;

l_grade=grade;

if hoh.find() ne 0 then do;

lst=_new_ hash (ordered:'d');

lst.definekey('fn','ln','l_grade');

lst.definedata('fn','ln','l_grade');

lst.definedone();

hil=_new_ hiter ('lst');

hoh.replace();

end;

lst.replace();

full.replace();

end;

do _rc=hif.next() by 0 while (_rc=0);

hoh.find();

hil.first();

output;

_rc=hif.next();

end;

stop;

drop _rc;

run;

Haikuo

Reeza · Posted 12-28-2012 03:29 PM

The last grade isn't the same as the maximum grade, though it is in the sample data provided.

There isn't an order specified, which would actually be needed based on the question from the OP.

Astounding · Posted 12-28-2012 03:42 PM

Hai.kuo (and others?),

I'm just learning this hashing stuff, but shouldn't there be an easier hash approach? Instead of creating an iterative hash object, isn't there a way to create a one-key, one-data-element table where the data element gets replaced if the key is repeated? After that, it should be a simple matter to run through the data and look up the final data element.

Haikuo · Posted 12-28-2012 04:52 PM

Astounding: There are simpler Hash() approaches, however, I can only do it with two passes theoretically :

data have;

input (fn ln) (:$20.) grade;

cards;

Ross Smith 11

Jayden Woo 10

Ross Smith 7

Jayden Woo 8

Ross Smith 6

;

/*theorically two-pass(I/O) approach: last record, not largest*/

data want;

if _n_=1 then do;

declare hash full(dataset:'have (rename=grade=l_grade)', multidata:'y');

full.definekey('fn','ln');

full.definedata('l_grade');

full.definedone();

call missing (l_grade);

end;

set have ;

_rc=full.find();

do _rc=0 by 0 while (_rc=0);

_rc=full.find_next();

end;

output;

drop _rc;

run;

/*theorically two-pass(I/O) approach: largest record*/

data want;

if _n_=1 then do;

declare hash full(dataset:'have (rename=grade=l_grade)', multidata:'y');

full.definekey('fn','ln');

full.definedata('l_grade');

full.definedone();

call missing (l_grade);

end;

set have;

_rc=full.find();

do _rc=0 by 0 while (_rc=0);

_grade=max(_grade,l_grade);

_rc=full.find_next();

end;

l_grade=_grade;

output;

drop _:;

run;

Haikuo

Astounding · Posted 12-30-2012 03:59 PM

Hai.kuo,

I know it's for my own good, but now I have to look this up. Couldn't this work just as well? Is it not simpler and returns the right answer? Is it faster/slower?

data want;

if _n_=1 then do;

declare hash h(dataset:'have (rename=(grade=last_grade))', duplicate:'replace');

h.definekey('fn', 'ln');

h.definedata('last_grade');

h.definedone();

call missing(last_grade);

end;

set have;

rc = h.find();

drop rc;

run;

The idea is to build the hash table with one record per key (last name / first name). Replace duplicates, so the last data point is the only one kept in the hash table. What am I missing?

Haikuo · Posted 12-30-2012 06:03 PM

Astounding, your Hash() code is quite nice. But there are several things you may want to dig deeper:

1. What is your purpose, the last record or the largest one within the name? Yours is doing the last record, what if for the largest one? Can you construct Hash() code with similar simplicity?

2. Yours is still involving two passes: one to load into Hash(), another of course the 'set' statement. Theoretically mine is just as efficient as yours, cause it takes no time to do the direct access to RAM, but we all know it has to be some sort of overhead, and Art already proved that HOH is quite slow even it only has one pass.

Just my 2 cents and you are always my hero,

Haikuo

art297 · Posted 12-30-2012 06:20 PM

and : Actually, if the data aren't sorted, Astounding's approach works the fastest for the current problem and DOES do what the OP (I think) requested.

The non-hash datastep approach only works fastest if the data are either already sorted or indexed.

Astounding · Posted 12-30-2012 11:03 PM

Art,

If the data are pre-sorted, the order still makes a difference for appending the highest grade. The ideal order would be by fn ln DESCENDING grade. Then you could simply code:

data want;

set have;

by fn ln;

if first.ln then highest_grade=grade;

retain highest_grade;

run;

In the longer run, I imagine that SAS will develop more hash methods to apply to handling duplicate key values. I used REPLACE, but I could easily conceive of LARGEST or SMALLEST being a possible value in the future. While that may seem trivial, there are questions to consider about what would be the right way to go about it. What if there are multiple data values stored in the hash table. Would LARGEST apply to each data element independently of the others, or does it make sense to take the largest combination from a single observation (similar to the way an ID statement works with PROC SUMMARY). There may be a right answer to that, but it takes time to think through the possibilities.

I imagine you could program your own version of LARGEST if you had to. I'll take liberties with the syntax, but here's the idea.

if _n_=1 then do;

declare hash h();

h.definekey('ln', 'fn');

h.definedata('largest_grade');

h.definedone();

do until (done);

set have end=done;

if h.check() then do;

largest_grade = grade;

h.add();

end;

else do;

rc = h.find();

if grade > largest_grade then do;

largest_grade = grade;

h.replace();

end;

Just looking at the hoops you have to jump through convinces me that SAS will make this easy one day.

art297 · Posted 12-28-2012 03:27 PM

If you want to select the highest grade then I would use Haikuo's suggested SQL code, but with a create statement added. i.e.,:

proc sql;

create table want as

select *, max(grade) as highest_grade

from have

group by fn, ln

;

quit;

That will run almost twice as fast as Haikuo's suggest hash approach. However, does highest grade always equate to last grade?

SannaSanna · Posted 12-28-2012 04:06 PM

Mr. Tabachneck- The highest grade is always equal to or greater than the last grade. I am not familiar with hash and interested in learning once I become more comfortable with proc sql. Will your code do the work it needs to?

proc sql;

create table want as

select *, max(grade) as highest_grade

from have

group by fn, ln

;

quit;

art297 · Posted 12-28-2012 05:19 PM

: Just so we don't get confused with semantics, what you just said would suggested that you couldn't use the proc sql approach. It would only have been valid if you had said: "the last grade will always be the highest grade."

To meet the condition as you stated it, you would either need a hash approach or one where at least three passes of the data would be necessary. e.g. given a dataset like the following:

data have (drop=i);

input (fn ln) (:$20.) grade;

cards;

Ross Smith 8

Jayden Woo 9

Ross Smith 7

Jayden Woo 5

Ross Smith 2

;

I would now presume that you would want a file that looked like:

ln fn grade last_grade

Ross Smith 8 2

Ross Smith 7 2

Ross Smith 2 2

Jayden Woo 9 5

Jayden Woo 5 5

If that is indeed what you want, then one way of doing it would be:

proc sort data=have;

by ln fn;

run;

data want;

do until (last.fn);

set have end=_last1;

by ln fn;

if last.fn then last_grade=grade;

end;

do until (last.fn);

set have end=_last2;

by ln fn;

output;

end;

run;

However, that will definitely not run as quickly as a hash approach.

Haikuo · Posted 12-28-2012 05:50 PM

Art said: " what you just said would suggested that you couldn't use the proc sql approach.", What Art didn't say is: "UNLESS you use some undocumented features"

data have;

input (fn ln) (:$20.) grade;

cards;

Ross Smith 11

Jayden Woo 10

Ross Smith 7

Jayden Woo 8

Ross Smith 6

;

proc sql;

create table want as

select a.*, b.grade as l_grade from have a,

( select *,monotonic() as seq from have

group by fn,ln

having seq=max(seq)) b

where a.fn=b.fn and a.ln=b.ln;

quit;

Haikuo

SannaSanna · Posted 12-28-2012 05:57 PM

Thank you Hai- I re-tried the last.student function and matched it back to the original dataset to create the new data variable. I will definitely try the proc sql to test it out. Many thanks everyone!

Longitudinal data

Re: Longitudinal data

Re: Longitudinal data

Re: Longitudinal data

Re: Longitudinal data

Re: Longitudinal data

Re: Longitudinal data

Re: Longitudinal data

Re: Longitudinal data

Re: Longitudinal data

Re: Longitudinal data

Re: Longitudinal data

Re: Longitudinal data

Re: Longitudinal data

Re: Longitudinal data

SAS Innovate 2025: Call for Content

Classroom Training Available!