longitudinal dataset - a count across variables

Solved
Occasional Contributor
Posts: 7

longitudinal dataset - a count across variables

Hello,

I have a longitudinal dataset (for years 2002 through 2010) that includes data for approximately 700,000 students in one school district. Each student has a unique identifier.

Over the years, there is a lot of student mobility from school to school. For example, a student might attend a given school for kindergarten, a second school for first grade and second grade, and a third school for third grade through eighth grade, for a total of 3 schools. Another student might attend one school for kindergarten through eighth grade, for a total of 1 school. I am trying to figure out how many schools each student attended. Then I can determine how many students attended just one school over the years, how many students attended 2 different schools, how many students attended 3 different schools, how many students attended 4 different schools, etc.  Each school has a unique identifier.

(Missing values exist when a given student is not enrolled within this school district.)

What I have:

 StudentID SchCode2002 SchCode2003 SchCode2004 SchCode2005 SchCode2006 SchCode2007 SchCode2008 SchCode2009 SchCode2010 100145 8543 8543 8543 8543 8543 8543 2455 2455 2455 100362 8543 8543 9652 3402 3402 3402 4656 4656 . 100595 . . 3402 3402 3402 3402 6548 6548 . 100788 2104 8453 8453 8453 2455 . . . . 100900 . . . . . . . 7502 7502 102044 3402 3402 3402 3402 3402 3402 3402 3402 3402 105305 . 4645 4645 4645 4645 4645 4645 4645 6548 108643 . . . 8543 8543 8543 8543 8543 8543 109224 3402 3402 2455 2455 2455 . . . . etc.

What I want (last column -- a count, for each student, of the number of schools attended):

 StudentID SchCode2002 SchCode2003 SchCode2004 SchCode2005 SchCode2006 SchCode2007 SchCode2008 SchCode2009 SchCode2010 SchCount 100145 8543 8543 8543 8543 8543 8543 2455 2455 2455 2 100362 8543 8543 9652 3402 3402 3402 4656 4656 . 4 100595 . . 3402 3402 3402 3402 6548 6548 . 2 100788 2104 8543 8543 8543 2455 . . . . 3 100900 . . . . . . . 7502 7502 1 102044 3402 3402 3402 3402 3402 3402 3402 3402 3402 1 105305 . 4645 4645 4645 4645 4645 4645 4645 6548 2 108643 . . . 8543 8543 8543 8543 8543 8543 1 109224 3402 3402 2455 2455 2455 . . . . 2 etc.

Accepted Solutions
Solution
‎12-28-2012 05:56 PM
PROC Star
Posts: 8,169

Re: longitudinal dataset - a count across variables

Here is one way to do it:

data want (drop=schools;

set have;

array schcode SchCode2002-SchCode2010;

array schools(9999);

do over schcode;

if not missing(schcode) then schools(schcode)=1;

end;

SchCount=9999-cmiss(of schools(*));

run;

All Replies
Solution
‎12-28-2012 05:56 PM
PROC Star
Posts: 8,169

Re: longitudinal dataset - a count across variables

Here is one way to do it:

data want (drop=schools;

set have;

array schcode SchCode2002-SchCode2010;

array schools(9999);

do over schcode;

if not missing(schcode) then schools(schcode)=1;

end;

SchCount=9999-cmiss(of schools(*));

run;

Occasional Contributor
Posts: 7

Re: longitudinal dataset - a count across variables

Thanks very much!  This worked quite well.

And thanks to others for your alternative approaches. Happy new year.

Super User
Posts: 23,778

Re: longitudinal dataset - a count across variables

A brute force approach always works, though I'm sure this doesn't handle the missing values appropriately.

But hopefully it gets you started.

data want;

set have;

array sch(9) schcode2002-schcode2010;

schcount=0;

do i=2 to 9;

if sch(i) ne . and sch(i) ne sch(i-1) then schcount+1;

end;

run;

PROC Star
Posts: 8,169

Re: longitudinal dataset - a count across variables

: I would think that your approach wouldn't work if a student goes back to the same school after a couple of years.  I think my extreme brute force approach will cover any deviations, including where a student moves out of and then back into the district, but of course I am only thinking of the logic and couldn't test all possibilities.

p.s.  Have I wished you and everyone the best for the coming new year?  If not, please consider those wishes conveyed!

Posts: 3,167

Re: longitudinal dataset - a count across variables

I think we can really use some convenience of operator "IN":

data want;

set have;

array sch schcode:;

array ds (9999);

do i=1 to dim(sch);

if sch(i) not in ds then ds(i)=sch(i);

end;

schcount=n(of ds(*));

drop ds: i;

run;

And Happy New to Everyone online or offline!

Haikuo

PROC Star
Posts: 8,169

Re: longitudinal dataset - a count across variables

L Lesson #576,432

Try the following and post your results:

data have (drop=i);

input (StudentID SchCode2002-SchCode2010) (\$);

do i=1 to 10000;

output;

end;

cards;

100145 8543 8543 8543 8543 8543 8543 2455 2455 2455

100362 8543 8543 9652 3402 3402 3402 4656 4656 .

100595 . . 3402 3402 3402 3402 6548 6548 .

100788 2104 8453 8453 8453 2455 . . . .

;

/*art*/

data want1 (drop=schools;

set have;

array schcode SchCode2002-SchCode2010;

array schools(9999);

do over schcode;

if not missing(schcode) then schools(schcode)=1;

end;

SchCount=9999-cmiss(of schools(*));

run;

/*haikuo*/

data want2;

set have;

array sch schcode:;

array ds (9999);

do i=1 to dim(sch);

if sch(i) not in ds then ds(i)=sch(i);

end;

schcount=n(of ds(*));

drop ds: i;

run;

Posts: 3,167

Re: longitudinal dataset - a count across variables

LOL, Art.  Happy New Year to you, too.

I admit yours is 10 times faster than mine. However, because of my laziness, I sort of took "array ds (9999)" from your code, which I don't really need that much. Now try this:

/*haikuo*/

data want2;

set have;

array sch schcode:;

array ds (9);

do i=1 to dim(sch);

if sch(i) not in ds then ds(i)=sch(i);

end;

schcount=n(of ds(*));

drop ds: i;

run;

Haikuo

PROC Star
Posts: 8,169

Re: longitudinal dataset - a count across variables

: Happy New Year to you too!  Much better!

PROC Star
Posts: 1,317

Re: longitudinal dataset - a count across variables

Oh, you data step coders. I like PROCs!

proc sort data=have;
by studentid;
run;

proc transpose data=have out=interim1 prefix=column name=source;
by studentid;
var schcode2002 schcode2003 schcode2004 schcode2005 schcode2006 schcode2007 schcode2008 schcode2009 schcode2010;
run;

proc sql;
create table interim2 as
select studentid, count(column1) as SchCount from interim1
group by studentid;
quit;

proc sql;
create table want as
select h.*, i.SchCount from have h left join interim2 i on(h.studentid = i.studentid);
quit;

Tom

PROC Star
Posts: 8,169

Re: longitudinal dataset - a count across variables

: a belated Christmas present.  Since your code ran faster than mine, but I was awarded the correct answer, to justify my points here is an approach that runs faster (on average) than yours:

data want;

set have;

array sch \$ schcode:;

call sortc(of sch(*));

counter=1;

do _n_=8 to 1 by -1 while (not missing(sch(_n_)));

if sch(_n_+1) ne sch(_n_) then counter+1;

end;

run;

: I have absolutely nothing against procs, but your approach runs slower and would have to be modified as it is only counting number of schools, not the number of distinct schools as required by the OP.

PROC Star
Posts: 1,317

Re: longitudinal dataset - a count across variables

Oops! Missed the requirement for distinct schools. Amend

proc sql;

create table interim2 as

select studentid, count(column1) as SchCount from interim1

group by studentid;

quit;

to

proc sql;

create table interim2 as

select studentid, count(column1) as SchCount from

(select distinct studentid, column1 from interim1)

group by studentid;

quit;

However, that being said, in this case I agree that the data step solution is the better one. PROC APPEND is one that I haven't used much, and I'm trying to get better with it, so I jump in when the opportunity presents. I thought this would make a nice contrast to the other solutions, but if I was the client I'd prefer them!

Tom

(P.S. As long as the powers-that-be don't start deducting points, I'm happy!)

Posts: 3,167

Re: longitudinal dataset - a count across variables

: Very impressive. I made a small change to your code to take care of "all missing" scenario, and it still runs faster than mine. You very much deserve "correct answer", Sir.

data want;

set have;

array sch \$ schcode:;

call sortc(of sch(*));

if n(of sch(*))=0 then counter=0;

else counter=1;

do _n_=8 to 1 by -1 while (not missing(sch(_n_)));

if sch(_n_+1) ne sch(_n_) then counter+1;

end;

run;

Haikuo

PROC Star
Posts: 8,169

Re: longitudinal dataset - a count across variables

: Yes, good catch!  However, the performance could still be improved.  How about:

data want;

set have;

array sch \$ schcode:;

call sortc(of sch(*));

counter=not missing(sch(9));

do _n_=8 to 1 by -1 while (not missing(sch(_n_)));

counter+sch(_n_+1) ne sch(_n_);

end;

run;

Super User
Posts: 10,788

Re: longitudinal dataset - a count across variables

or using Hash Table.

```data have (drop=i);
input (StudentID SchCode2002-SchCode2010) (\$);
do i=1 to 10000;
output;
end;
cards;
100145 8543 8543 8543 8543 8543 8543 2455 2455 2455
100362 8543 8543 9652 3402 3402 3402 4656 4656 .
100595 . . 3402 3402 3402 3402 6548 6548 .
100788 2104 8453 8453 8453 2455 . . . .
;
run;

data want(drop=i rc k );
if _n_ eq 1 then do;
length k \$ 8;
declare hash ha();
ha.definekey('k');
ha.definedone();
end;
set have;
array s{9} \$ SchCode2002-SchCode2010;
do i=1 to 9 ;
if not missing(s{i}) then do;k=s{i};rc=ha.add(); end;
end;
n=ha.num_items;
ha.clear();
run;

```

Ksharp

Message was edited by: xia keshan

🔒 This topic is solved and locked.