Hello,
I have a longitudinal dataset (for years 2002 through 2010) that includes data for approximately 700,000 students in one school district. Each student has a unique identifier.
Over the years, there is a lot of student mobility from school to school. For example, a student might attend a given school for kindergarten, a second school for first grade and second grade, and a third school for third grade through eighth grade, for a total of 3 schools. Another student might attend one school for kindergarten through eighth grade, for a total of 1 school. I am trying to figure out how many schools each student attended. Then I can determine how many students attended just one school over the years, how many students attended 2 different schools, how many students attended 3 different schools, how many students attended 4 different schools, etc. Each school has a unique identifier.
(Missing values exist when a given student is not enrolled within this school district.)
What I have:
StudentID | SchCode2002 | SchCode2003 | SchCode2004 | SchCode2005 | SchCode2006 | SchCode2007 | SchCode2008 | SchCode2009 | SchCode2010 |
100145 | 8543 | 8543 | 8543 | 8543 | 8543 | 8543 | 2455 | 2455 | 2455 |
100362 | 8543 | 8543 | 9652 | 3402 | 3402 | 3402 | 4656 | 4656 | . |
100595 | . | . | 3402 | 3402 | 3402 | 3402 | 6548 | 6548 | . |
100788 | 2104 | 8453 | 8453 | 8453 | 2455 | . | . | . | . |
100900 | . | . | . | . | . | . | . | 7502 | 7502 |
102044 | 3402 | 3402 | 3402 | 3402 | 3402 | 3402 | 3402 | 3402 | 3402 |
105305 | . | 4645 | 4645 | 4645 | 4645 | 4645 | 4645 | 4645 | 6548 |
108643 | . | . | . | 8543 | 8543 | 8543 | 8543 | 8543 | 8543 |
109224 | 3402 | 3402 | 2455 | 2455 | 2455 | . | . | . | . |
etc. |
What I want (last column -- a count, for each student, of the number of schools attended):
StudentID | SchCode2002 | SchCode2003 | SchCode2004 | SchCode2005 | SchCode2006 | SchCode2007 | SchCode2008 | SchCode2009 | SchCode2010 | SchCount |
100145 | 8543 | 8543 | 8543 | 8543 | 8543 | 8543 | 2455 | 2455 | 2455 | 2 |
100362 | 8543 | 8543 | 9652 | 3402 | 3402 | 3402 | 4656 | 4656 | . | 4 |
100595 | . | . | 3402 | 3402 | 3402 | 3402 | 6548 | 6548 | . | 2 |
100788 | 2104 | 8543 | 8543 | 8543 | 2455 | . | . | . | . | 3 |
100900 | . | . | . | . | . | . | . | 7502 | 7502 | 1 |
102044 | 3402 | 3402 | 3402 | 3402 | 3402 | 3402 | 3402 | 3402 | 3402 | 1 |
105305 | . | 4645 | 4645 | 4645 | 4645 | 4645 | 4645 | 4645 | 6548 | 2 |
108643 | . | . | . | 8543 | 8543 | 8543 | 8543 | 8543 | 8543 | 1 |
109224 | 3402 | 3402 | 2455 | 2455 | 2455 | . | . | . | . | 2 |
etc. |
Thanks in advance for your help.
Here is one way to do it:
data want (drop=schools:);
set have;
array schcode SchCode2002-SchCode2010;
array schools(9999);
do over schcode;
if not missing(schcode) then schools(schcode)=1;
end;
SchCount=9999-cmiss(of schools(*));
run;
Here is one way to do it:
data want (drop=schools:);
set have;
array schcode SchCode2002-SchCode2010;
array schools(9999);
do over schcode;
if not missing(schcode) then schools(schcode)=1;
end;
SchCount=9999-cmiss(of schools(*));
run;
Thanks very much! This worked quite well.
And thanks to others for your alternative approaches. Happy new year.
A brute force approach always works, though I'm sure this doesn't handle the missing values appropriately.
But hopefully it gets you started.
data want;
set have;
array sch(9) schcode2002-schcode2010;
schcount=0;
do i=2 to 9;
if sch(i) ne . and sch(i) ne sch(i-1) then schcount+1;
end;
run;
: I would think that your approach wouldn't work if a student goes back to the same school after a couple of years. I think my extreme brute force approach will cover any deviations, including where a student moves out of and then back into the district, but of course I am only thinking of the logic and couldn't test all possibilities.
p.s. Have I wished you and everyone the best for the coming new year? If not, please consider those wishes conveyed!
I think we can really use some convenience of operator "IN":
data want;
set have;
array sch schcode:;
array ds (9999);
do i=1 to dim(sch);
if sch(i) not in ds then ds(i)=sch(i);
end;
schcount=n(of ds(*));
drop ds: i;
run;
And Happy New to Everyone online or offline!
Haikuo
Try the following and post your results:
data have (drop=i);
input (StudentID SchCode2002-SchCode2010) ($);
do i=1 to 10000;
output;
end;
cards;
100145 8543 8543 8543 8543 8543 8543 2455 2455 2455
100362 8543 8543 9652 3402 3402 3402 4656 4656 .
100595 . . 3402 3402 3402 3402 6548 6548 .
100788 2104 8453 8453 8453 2455 . . . .
;
/*art*/
data want1 (drop=schools:);
set have;
array schcode SchCode2002-SchCode2010;
array schools(9999);
do over schcode;
if not missing(schcode) then schools(schcode)=1;
end;
SchCount=9999-cmiss(of schools(*));
run;
/*haikuo*/
data want2;
set have;
array sch schcode:;
array ds (9999);
do i=1 to dim(sch);
if sch(i) not in ds then ds(i)=sch(i);
end;
schcount=n(of ds(*));
drop ds: i;
run;
LOL, Art. Happy New Year to you, too.
I admit yours is 10 times faster than mine. However, because of my laziness, I sort of took "array ds (9999)" from your code, which I don't really need that much. Now try this:
/*haikuo*/
data want2;
set have;
array sch schcode:;
array ds (9);
do i=1 to dim(sch);
if sch(i) not in ds then ds(i)=sch(i);
end;
schcount=n(of ds(*));
drop ds: i;
run;
Haikuo
Oh, you data step coders. I like PROCs!
proc sort data=have;
by studentid;
run;
proc transpose data=have out=interim1 prefix=column name=source;
by studentid;
var schcode2002 schcode2003 schcode2004 schcode2005 schcode2006 schcode2007 schcode2008 schcode2009 schcode2010;
run;
proc sql;
create table interim2 as
select studentid, count(column1) as SchCount from interim1
group by studentid;
quit;
proc sql;
create table want as
select h.*, i.SchCount from have h left join interim2 i on(h.studentid = i.studentid);
quit;
Tom
: a belated Christmas present. Since your code ran faster than mine, but I was awarded the correct answer, to justify my points here is an approach that runs faster (on average) than yours:
data want;
set have;
array sch $ schcode:;
call sortc(of sch(*));
counter=1;
do _n_=8 to 1 by -1 while (not missing(sch(_n_)));
if sch(_n_+1) ne sch(_n_) then counter+1;
end;
run;
: I have absolutely nothing against procs, but your approach runs slower and would have to be modified as it is only counting number of schools, not the number of distinct schools as required by the OP.
Oops! Missed the requirement for distinct schools. Amend
proc sql;
create table interim2 as
select studentid, count(column1) as SchCount from interim1
group by studentid;
quit;
to
proc sql;
create table interim2 as
select studentid, count(column1) as SchCount from
(select distinct studentid, column1 from interim1)
group by studentid;
quit;
However, that being said, in this case I agree that the data step solution is the better one. PROC APPEND is one that I haven't used much, and I'm trying to get better with it, so I jump in when the opportunity presents. I thought this would make a nice contrast to the other solutions, but if I was the client I'd prefer them!
Tom
(P.S. As long as the powers-that-be don't start deducting points, I'm happy!)
: Very impressive. I made a small change to your code to take care of "all missing" scenario, and it still runs faster than mine. You very much deserve "correct answer", Sir.
data want;
set have;
array sch $ schcode:;
call sortc(of sch(*));
if n(of sch(*))=0 then counter=0;
else counter=1;
do _n_=8 to 1 by -1 while (not missing(sch(_n_)));
if sch(_n_+1) ne sch(_n_) then counter+1;
end;
run;
Haikuo
or using Hash Table.
data have (drop=i); input (StudentID SchCode2002-SchCode2010) ($); do i=1 to 10000; output; end; cards; 100145 8543 8543 8543 8543 8543 8543 2455 2455 2455 100362 8543 8543 9652 3402 3402 3402 4656 4656 . 100595 . . 3402 3402 3402 3402 6548 6548 . 100788 2104 8453 8453 8453 2455 . . . . ; run; data want(drop=i rc k ); if _n_ eq 1 then do; length k $ 8; declare hash ha(); ha.definekey('k'); ha.definedone(); end; set have; array s{9} $ SchCode2002-SchCode2010; do i=1 to 9 ; if not missing(s{i}) then do;k=s{i};rc=ha.add(); end; end; n=ha.num_items; ha.clear(); run;
Ksharp
Message was edited by: xia keshan
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.