BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Gbird
Calcite | Level 5

Hello,


I have a longitudinal dataset (for years 2002 through 2010) that includes data for approximately 700,000 students in one school district. Each student has a unique identifier.

Over the years, there is a lot of student mobility from school to school. For example, a student might attend a given school for kindergarten, a second school for first grade and second grade, and a third school for third grade through eighth grade, for a total of 3 schools. Another student might attend one school for kindergarten through eighth grade, for a total of 1 school. I am trying to figure out how many schools each student attended. Then I can determine how many students attended just one school over the years, how many students attended 2 different schools, how many students attended 3 different schools, how many students attended 4 different schools, etc.  Each school has a unique identifier.

(Missing values exist when a given student is not enrolled within this school district.)

What I have:

StudentID

SchCode2002

SchCode2003

SchCode2004

SchCode2005

SchCode2006

SchCode2007

SchCode2008

SchCode2009

SchCode2010

100145

8543

8543

8543

8543

8543

8543

2455

2455

2455

100362

8543

8543

9652

3402

3402

3402

4656

4656

.

100595

.

.

3402

3402

3402

3402

6548

6548

.

100788

2104

8453

8453

8453

2455

.

.

.

.

100900

.

.

.

.

.

.

.

7502

7502

102044

3402

3402

3402

3402

3402

3402

3402

3402

3402

105305

.

4645

4645

4645

4645

4645

4645

4645

6548

108643

.

.

.

8543

8543

8543

8543

8543

8543

109224

3402

3402

2455

2455

2455

.

.

.

.

etc.

What I want (last column -- a count, for each student, of the number of schools attended):

StudentID

SchCode2002

SchCode2003

SchCode2004

SchCode2005

SchCode2006

SchCode2007

SchCode2008

SchCode2009

SchCode2010

SchCount

100145

8543

8543

8543

8543

8543

8543

2455

2455

2455

2

100362

8543

8543

9652

3402

3402

3402

4656

4656

.

4

100595

.

.

3402

3402

3402

3402

6548

6548

.

2

100788

2104

8543

8543

8543

2455

.

.

.

.

3

100900

.

.

.

.

.

.

.

7502

7502

1

102044

3402

3402

3402

3402

3402

3402

3402

3402

3402

1

105305

.

4645

4645

4645

4645

4645

4645

4645

6548

2

108643

.

.

.

8543

8543

8543

8543

8543

8543

1

109224

3402

3402

2455

2455

2455

.

.

.

.

2

etc.

 

Thanks in advance for your help.

1 ACCEPTED SOLUTION

Accepted Solutions
art297
Opal | Level 21

Here is one way to do it:

data want (drop=schools:);

  set have;

  array schcode SchCode2002-SchCode2010;

  array schools(9999);

  do over schcode;

    if not missing(schcode) then schools(schcode)=1;

  end;

  SchCount=9999-cmiss(of schools(*));

run;

View solution in original post

14 REPLIES 14
art297
Opal | Level 21

Here is one way to do it:

data want (drop=schools:);

  set have;

  array schcode SchCode2002-SchCode2010;

  array schools(9999);

  do over schcode;

    if not missing(schcode) then schools(schcode)=1;

  end;

  SchCount=9999-cmiss(of schools(*));

run;

Gbird
Calcite | Level 5

Thanks very much!  This worked quite well. 

And thanks to others for your alternative approaches. Happy new year.

Reeza
Super User

A brute force approach always works, though I'm sure this doesn't handle the missing values appropriately.

But hopefully it gets you started.

data want;

set have;

array sch(9) schcode2002-schcode2010;

schcount=0;

do i=2 to 9;

if sch(i) ne . and sch(i) ne sch(i-1) then schcount+1;

end;

run;

art297
Opal | Level 21

: I would think that your approach wouldn't work if a student goes back to the same school after a couple of years.  I think my extreme brute force approach will cover any deviations, including where a student moves out of and then back into the district, but of course I am only thinking of the logic and couldn't test all possibilities.

p.s.  Have I wished you and everyone the best for the coming new year?  If not, please consider those wishes conveyed!

Haikuo
Onyx | Level 15

I think we can really use some convenience of operator "IN":

data want;

   set have;

     array sch schcode:;

     array ds (9999);

     do i=1 to dim(sch);

       if sch(i) not in ds then ds(i)=sch(i);

      end;

      schcount=n(of ds(*));

      drop ds: i;

run;

And Happy New to Everyone online or offline!

Haikuo

art297
Opal | Level 21

L Lesson #576,432

Try the following and post your results:

data have (drop=i);

  input (StudentID SchCode2002-SchCode2010) ($);

  do i=1 to 10000;

    output;

  end;

  cards;

100145 8543 8543 8543 8543 8543 8543 2455 2455 2455

100362 8543 8543 9652 3402 3402 3402 4656 4656 .

100595 . . 3402 3402 3402 3402 6548 6548 .

100788 2104 8453 8453 8453 2455 . . . .

;

/*art*/

data want1 (drop=schools:);

  set have;

  array schcode SchCode2002-SchCode2010;

  array schools(9999);

  do over schcode;

    if not missing(schcode) then schools(schcode)=1;

  end;

  SchCount=9999-cmiss(of schools(*));

run;

/*haikuo*/

data want2;

   set have;

     array sch schcode:;

     array ds (9999);

     do i=1 to dim(sch);

       if sch(i) not in ds then ds(i)=sch(i);

      end;

      schcount=n(of ds(*));

      drop ds: i;

run;

Haikuo
Onyx | Level 15

LOL, Art.  Happy New Year to you, too.

I admit yours is 10 times faster than mine. However, because of my laziness, I sort of took "array ds (9999)" from your code, which I don't really need that much. Now try this:

/*haikuo*/

data want2;

   set have;

     array sch schcode:;

     array ds (9);

     do i=1 to dim(sch);

       if sch(i) not in ds then ds(i)=sch(i);

      end;

      schcount=n(of ds(*));

      drop ds: i;

run;

Haikuo

art297
Opal | Level 21

: Happy New Year to you too!  Much better!

TomKari
Onyx | Level 15

Oh, you data step coders. I like PROCs!

proc sort data=have;
by studentid;
run;

proc transpose data=have out=interim1 prefix=column name=source;
by studentid;
var schcode2002 schcode2003 schcode2004 schcode2005 schcode2006 schcode2007 schcode2008 schcode2009 schcode2010;
run;

proc sql;
create table interim2 as
  select studentid, count(column1) as SchCount from interim1
   group by studentid;
quit;

proc sql;
create table want as
  select h.*, i.SchCount from have h left join interim2 i on(h.studentid = i.studentid);
quit;

Tom

art297
Opal | Level 21

: a belated Christmas present.  Since your code ran faster than mine, but I was awarded the correct answer, to justify my points here is an approach that runs faster (on average) than yours:

data want;

  set have;

  array sch $ schcode:;

  call sortc(of sch(*));

  counter=1;

  do _n_=8 to 1 by -1 while (not missing(sch(_n_)));

    if sch(_n_+1) ne sch(_n_) then counter+1;

  end;

run;

: I have absolutely nothing against procs, but your approach runs slower and would have to be modified as it is only counting number of schools, not the number of distinct schools as required by the OP.

TomKari
Onyx | Level 15

Oops! Missed the requirement for distinct schools. Amend

proc sql;

create table interim2 as

  select studentid, count(column1) as SchCount from interim1

   group by studentid;

quit;

to

proc sql;

create table interim2 as

select studentid, count(column1) as SchCount from

(select distinct studentid, column1 from interim1)

group by studentid;

quit;

However, that being said, in this case I agree that the data step solution is the better one. PROC APPEND is one that I haven't used much, and I'm trying to get better with it, so I jump in when the opportunity presents. I thought this would make a nice contrast to the other solutions, but if I was the client I'd prefer them!

Tom

(P.S. As long as the powers-that-be don't start deducting points, I'm happy!)

Haikuo
Onyx | Level 15

: Very impressive. I made a small change to your code to take care of "all missing" scenario, and it still runs faster than mine. You very much deserve "correct answer", Sir.

data want;

  set have;

  array sch $ schcode:;

  call sortc(of sch(*));

  if n(of sch(*))=0 then counter=0;

  else counter=1;

  do _n_=8 to 1 by -1 while (not missing(sch(_n_)));

    if sch(_n_+1) ne sch(_n_) then counter+1;

  end;

run;

Haikuo

art297
Opal | Level 21

: Yes, good catch!  However, the performance could still be improved.  How about:

data want;

  set have;

  array sch $ schcode:;

  call sortc(of sch(*));

  counter=not missing(sch(9));

  do _n_=8 to 1 by -1 while (not missing(sch(_n_)));

    counter+sch(_n_+1) ne sch(_n_);

  end;

run;

Ksharp
Super User

or using Hash Table.

data have (drop=i);
  input (StudentID SchCode2002-SchCode2010) ($);
  do i=1 to 10000;
    output;
  end;
  cards;
100145 8543 8543 8543 8543 8543 8543 2455 2455 2455
100362 8543 8543 9652 3402 3402 3402 4656 4656 .
100595 . . 3402 3402 3402 3402 6548 6548 .
100788 2104 8453 8453 8453 2455 . . . .
;
run;

data want(drop=i rc k );
 if _n_ eq 1 then do;
  length k $ 8;
  declare hash ha();
   ha.definekey('k');
   ha.definedone();
 end;
set have;
  array s{9} $ SchCode2002-SchCode2010;
  do i=1 to 9 ;
      if not missing(s{i}) then do;k=s{i};rc=ha.add(); end;
  end;
  n=ha.num_items;
  ha.clear();
run;
   


Ksharp

Message was edited by: xia keshan

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 14 replies
  • 4447 views
  • 0 likes
  • 6 in conversation