Solved: Re: longitudinal dataset - a count across variables

Gbird · Posted 12-28-2012 05:29 PM

Hello,

I have a longitudinal dataset (for years 2002 through 2010) that includes data for approximately 700,000 students in one school district. Each student has a unique identifier.

Over the years, there is a lot of student mobility from school to school. For example, a student might attend a given school for kindergarten, a second school for first grade and second grade, and a third school for third grade through eighth grade, for a total of 3 schools. Another student might attend one school for kindergarten through eighth grade, for a total of 1 school. I am trying to figure out how many schools each student attended. Then I can determine how many students attended just one school over the years, how many students attended 2 different schools, how many students attended 3 different schools, how many students attended 4 different schools, etc. Each school has a unique identifier.

(Missing values exist when a given student is not enrolled within this school district.)

What I have:

StudentID	SchCode2002	SchCode2003	SchCode2004	SchCode2005	SchCode2006	SchCode2007	SchCode2008	SchCode2009	SchCode2010
100145	8543	8543	8543	8543	8543	8543	2455	2455	2455
100362	8543	8543	9652	3402	3402	3402	4656	4656	.
100595	.	.	3402	3402	3402	3402	6548	6548	.
100788	2104	8453	8453	8453	2455	.	.	.	.
100900	.	.	.	.	.	.	.	7502	7502
102044	3402	3402	3402	3402	3402	3402	3402	3402	3402
105305	.	4645	4645	4645	4645	4645	4645	4645	6548
108643	.	.	.	8543	8543	8543	8543	8543	8543
109224	3402	3402	2455	2455	2455	.	.	.	.
etc.

What I want (last column -- a count, for each student, of the number of schools attended):

StudentID	SchCode2002	SchCode2003	SchCode2004	SchCode2005	SchCode2006	SchCode2007	SchCode2008	SchCode2009	SchCode2010	SchCount
100145	8543	8543	8543	8543	8543	8543	2455	2455	2455	2
100362	8543	8543	9652	3402	3402	3402	4656	4656	.	4
100595	.	.	3402	3402	3402	3402	6548	6548	.	2
100788	2104	8543	8543	8543	2455	.	.	.	.	3
100900	.	.	.	.	.	.	.	7502	7502	1
102044	3402	3402	3402	3402	3402	3402	3402	3402	3402	1
105305	.	4645	4645	4645	4645	4645	4645	4645	6548	2
108643	.	.	.	8543	8543	8543	8543	8543	8543	1
109224	3402	3402	2455	2455	2455	.	.	.	.	2
etc.

Thanks in advance for your help.

art297 · Posted 12-28-2012 05:56 PM

Here is one way to do it:

data want (drop=schools:);

set have;

array schcode SchCode2002-SchCode2010;

array schools(9999);

do over schcode;

if not missing(schcode) then schools(schcode)=1;

end;

SchCount=9999-cmiss(of schools(*));

run;

View solution in original post

art297 · Posted 12-28-2012 05:56 PM

Here is one way to do it:

data want (drop=schools:);

set have;

array schcode SchCode2002-SchCode2010;

array schools(9999);

do over schcode;

if not missing(schcode) then schools(schcode)=1;

end;

SchCount=9999-cmiss(of schools(*));

run;

Gbird · Posted 12-29-2012 11:32 AM

Thanks very much! This worked quite well.

And thanks to others for your alternative approaches. Happy new year.

Reeza · Posted 12-28-2012 05:59 PM

A brute force approach always works, though I'm sure this doesn't handle the missing values appropriately.

But hopefully it gets you started.

data want;

set have;

array sch(9) schcode2002-schcode2010;

schcount=0;

do i=2 to 9;

if sch(i) ne . and sch(i) ne sch(i-1) then schcount+1;

end;

run;

art297 · Posted 12-28-2012 06:17 PM

: I would think that your approach wouldn't work if a student goes back to the same school after a couple of years. I think my extreme brute force approach will cover any deviations, including where a student moves out of and then back into the district, but of course I am only thinking of the logic and couldn't test all possibilities.

p.s. Have I wished you and everyone the best for the coming new year? If not, please consider those wishes conveyed!

Haikuo · Posted 12-28-2012 06:58 PM

I think we can really use some convenience of operator "IN":

data want;

set have;

array sch schcode:;

array ds (9999);

do i=1 to dim(sch);

if sch(i) not in ds then ds(i)=sch(i);

end;

schcount=n(of ds(*));

drop ds: i;

run;

And Happy New to Everyone online or offline!

Haikuo

art297 · Posted 12-28-2012 08:00 PM

L Lesson #576,432

Try the following and post your results:

data have (drop=i);

input (StudentID SchCode2002-SchCode2010) ($);

do i=1 to 10000;

output;

end;

cards;

100145 8543 8543 8543 8543 8543 8543 2455 2455 2455

100362 8543 8543 9652 3402 3402 3402 4656 4656 .

100595 . . 3402 3402 3402 3402 6548 6548 .

100788 2104 8453 8453 8453 2455 . . . .

;

/*art*/

data want1 (drop=schools:);

set have;

array schcode SchCode2002-SchCode2010;

array schools(9999);

do over schcode;

if not missing(schcode) then schools(schcode)=1;

end;

SchCount=9999-cmiss(of schools(*));

run;

/*haikuo*/

data want2;

set have;

array sch schcode:;

array ds (9999);

do i=1 to dim(sch);

if sch(i) not in ds then ds(i)=sch(i);

end;

schcount=n(of ds(*));

drop ds: i;

run;

Haikuo · Posted 12-28-2012 08:24 PM

LOL, Art. Happy New Year to you, too.

I admit yours is 10 times faster than mine. However, because of my laziness, I sort of took "array ds (9999)" from your code, which I don't really need that much. Now try this:

/*haikuo*/

data want2;

set have;

array sch schcode:;

array ds (9);

do i=1 to dim(sch);

if sch(i) not in ds then ds(i)=sch(i);

end;

schcount=n(of ds(*));

drop ds: i;

run;

Haikuo

art297 · Posted 12-28-2012 08:36 PM

: Happy New Year to you too! Much better!

TomKari · Posted 12-29-2012 11:37 AM

Oh, you data step coders. I like PROCs!

proc sort data=have;
by studentid;
run;

proc transpose data=have out=interim1 prefix=column name=source;
by studentid;
var schcode2002 schcode2003 schcode2004 schcode2005 schcode2006 schcode2007 schcode2008 schcode2009 schcode2010;
run;

proc sql;
create table interim2 as
select studentid, count(column1) as SchCount from interim1
group by studentid;
quit;

proc sql;
create table want as
select h.*, i.SchCount from have h left join interim2 i on(h.studentid = i.studentid);
quit;

Tom

art297 · Posted 12-29-2012 05:46 PM

: a belated Christmas present. Since your code ran faster than mine, but I was awarded the correct answer, to justify my points here is an approach that runs faster (on average) than yours:

data want;

set have;

array sch $ schcode:;

call sortc(of sch(*));

counter=1;

do _n_=8 to 1 by -1 while (not missing(sch(_n_)));

if sch(_n_+1) ne sch(_n_) then counter+1;

end;

run;

: I have absolutely nothing against procs, but your approach runs slower and would have to be modified as it is only counting number of schools, not the number of distinct schools as required by the OP.

TomKari · Posted 12-29-2012 07:52 PM

Oops! Missed the requirement for distinct schools. Amend

proc sql;

create table interim2 as

select studentid, count(column1) as SchCount from interim1

group by studentid;

quit;

to

proc sql;

create table interim2 as

select studentid, count(column1) as SchCount from

(select distinct studentid, column1 from interim1)

group by studentid;

quit;

However, that being said, in this case I agree that the data step solution is the better one. PROC APPEND is one that I haven't used much, and I'm trying to get better with it, so I jump in when the opportunity presents. I thought this would make a nice contrast to the other solutions, but if I was the client I'd prefer them!

Tom

(P.S. As long as the powers-that-be don't start deducting points, I'm happy!)

Haikuo · Posted 12-29-2012 11:06 PM

: Very impressive. I made a small change to your code to take care of "all missing" scenario, and it still runs faster than mine. You very much deserve "correct answer", Sir.

data want;

set have;

array sch $ schcode:;

call sortc(of sch(*));

if n(of sch(*))=0 then counter=0;

else counter=1;

do _n_=8 to 1 by -1 while (not missing(sch(_n_)));

if sch(_n_+1) ne sch(_n_) then counter+1;

end;

run;

Haikuo

art297 · Posted 12-30-2012 12:27 PM

: Yes, good catch! However, the performance could still be improved. How about:

data want;

set have;

array sch $ schcode:;

call sortc(of sch(*));

counter=not missing(sch(9));

do _n_=8 to 1 by -1 while (not missing(sch(_n_)));

counter+sch(_n_+1) ne sch(_n_);

end;

run;

Ksharp · Posted 12-30-2012 09:33 PM

or using Hash Table.

data have (drop=i);
  input (StudentID SchCode2002-SchCode2010) ($);
  do i=1 to 10000;
    output;
  end;
  cards;
100145 8543 8543 8543 8543 8543 8543 2455 2455 2455
100362 8543 8543 9652 3402 3402 3402 4656 4656 .
100595 . . 3402 3402 3402 3402 6548 6548 .
100788 2104 8453 8453 8453 2455 . . . .
;
run;

data want(drop=i rc k );
 if _n_ eq 1 then do;
  length k $ 8;
  declare hash ha();
   ha.definekey('k');
   ha.definedone();
 end;
set have;
  array s{9} $ SchCode2002-SchCode2010;
  do i=1 to 9 ;
      if not missing(s{i}) then do;k=s{i};rc=ha.add(); end;
  end;
  n=ha.num_items;
  ha.clear();
run;

Ksharp

Message was edited by: xia keshan

SAS Innovate 2025: Call for Content

Classroom Training Available!