Help using Base SAS procedures

longitudinal dataset - a count across variables

Accepted Solution Solved
Reply
Occasional Contributor
Posts: 7
Accepted Solution

longitudinal dataset - a count across variables

Hello,


I have a longitudinal dataset (for years 2002 through 2010) that includes data for approximately 700,000 students in one school district. Each student has a unique identifier.

Over the years, there is a lot of student mobility from school to school. For example, a student might attend a given school for kindergarten, a second school for first grade and second grade, and a third school for third grade through eighth grade, for a total of 3 schools. Another student might attend one school for kindergarten through eighth grade, for a total of 1 school. I am trying to figure out how many schools each student attended. Then I can determine how many students attended just one school over the years, how many students attended 2 different schools, how many students attended 3 different schools, how many students attended 4 different schools, etc.  Each school has a unique identifier.

(Missing values exist when a given student is not enrolled within this school district.)

What I have:

StudentID

SchCode2002

SchCode2003

SchCode2004

SchCode2005

SchCode2006

SchCode2007

SchCode2008

SchCode2009

SchCode2010

100145

8543

8543

8543

8543

8543

8543

2455

2455

2455

100362

8543

8543

9652

3402

3402

3402

4656

4656

.

100595

.

.

3402

3402

3402

3402

6548

6548

.

100788

2104

8453

8453

8453

2455

.

.

.

.

100900

.

.

.

.

.

.

.

7502

7502

102044

3402

3402

3402

3402

3402

3402

3402

3402

3402

105305

.

4645

4645

4645

4645

4645

4645

4645

6548

108643

.

.

.

8543

8543

8543

8543

8543

8543

109224

3402

3402

2455

2455

2455

.

.

.

.

etc.

What I want (last column -- a count, for each student, of the number of schools attended):

StudentID

SchCode2002

SchCode2003

SchCode2004

SchCode2005

SchCode2006

SchCode2007

SchCode2008

SchCode2009

SchCode2010

SchCount

100145

8543

8543

8543

8543

8543

8543

2455

2455

2455

2

100362

8543

8543

9652

3402

3402

3402

4656

4656

.

4

100595

.

.

3402

3402

3402

3402

6548

6548

.

2

100788

2104

8543

8543

8543

2455

.

.

.

.

3

100900

.

.

.

.

.

.

.

7502

7502

1

102044

3402

3402

3402

3402

3402

3402

3402

3402

3402

1

105305

.

4645

4645

4645

4645

4645

4645

4645

6548

2

108643

.

.

.

8543

8543

8543

8543

8543

8543

1

109224

3402

3402

2455

2455

2455

.

.

.

.

2

etc.

 

Thanks in advance for your help.


Accepted Solutions
Solution
‎12-28-2012 05:56 PM
PROC Star
Posts: 7,363

Re: longitudinal dataset - a count across variables

Here is one way to do it:

data want (drop=schoolsSmiley Happy;

  set have;

  array schcode SchCode2002-SchCode2010;

  array schools(9999);

  do over schcode;

    if not missing(schcode) then schools(schcode)=1;

  end;

  SchCount=9999-cmiss(of schools(*));

run;

View solution in original post


All Replies
Solution
‎12-28-2012 05:56 PM
PROC Star
Posts: 7,363

Re: longitudinal dataset - a count across variables

Here is one way to do it:

data want (drop=schoolsSmiley Happy;

  set have;

  array schcode SchCode2002-SchCode2010;

  array schools(9999);

  do over schcode;

    if not missing(schcode) then schools(schcode)=1;

  end;

  SchCount=9999-cmiss(of schools(*));

run;

Occasional Contributor
Posts: 7

Re: longitudinal dataset - a count across variables

Thanks very much!  This worked quite well. 

And thanks to others for your alternative approaches. Happy new year.

Super User
Posts: 17,829

Re: longitudinal dataset - a count across variables

A brute force approach always works, though I'm sure this doesn't handle the missing values appropriately.

But hopefully it gets you started.

data want;

set have;

array sch(9) schcode2002-schcode2010;

schcount=0;

do i=2 to 9;

if sch(i) ne . and sch(i) ne sch(i-1) then schcount+1;

end;

run;

PROC Star
Posts: 7,363

Re: longitudinal dataset - a count across variables

: I would think that your approach wouldn't work if a student goes back to the same school after a couple of years.  I think my extreme brute force approach will cover any deviations, including where a student moves out of and then back into the district, but of course I am only thinking of the logic and couldn't test all possibilities.

p.s.  Have I wished you and everyone the best for the coming new year?  If not, please consider those wishes conveyed!

Respected Advisor
Posts: 3,124

Re: longitudinal dataset - a count across variables

I think we can really use some convenience of operator "IN":

data want;

   set have;

     array sch schcode:;

     array ds (9999);

     do i=1 to dim(sch);

       if sch(i) not in ds then ds(i)=sch(i);

      end;

      schcount=n(of ds(*));

      drop ds: i;

run;

And Happy New to Everyone online or offline!

Haikuo

PROC Star
Posts: 7,363

Re: longitudinal dataset - a count across variables

L Lesson #576,432

Try the following and post your results:

data have (drop=i);

  input (StudentID SchCode2002-SchCode2010) ($);

  do i=1 to 10000;

    output;

  end;

  cards;

100145 8543 8543 8543 8543 8543 8543 2455 2455 2455

100362 8543 8543 9652 3402 3402 3402 4656 4656 .

100595 . . 3402 3402 3402 3402 6548 6548 .

100788 2104 8453 8453 8453 2455 . . . .

;

/*art*/

data want1 (drop=schoolsSmiley Happy;

  set have;

  array schcode SchCode2002-SchCode2010;

  array schools(9999);

  do over schcode;

    if not missing(schcode) then schools(schcode)=1;

  end;

  SchCount=9999-cmiss(of schools(*));

run;

/*haikuo*/

data want2;

   set have;

     array sch schcode:;

     array ds (9999);

     do i=1 to dim(sch);

       if sch(i) not in ds then ds(i)=sch(i);

      end;

      schcount=n(of ds(*));

      drop ds: i;

run;

Respected Advisor
Posts: 3,124

Re: longitudinal dataset - a count across variables

LOL, Art.  Happy New Year to you, too.

I admit yours is 10 times faster than mine. However, because of my laziness, I sort of took "array ds (9999)" from your code, which I don't really need that much. Now try this:

/*haikuo*/

data want2;

   set have;

     array sch schcode:;

     array ds (9);

     do i=1 to dim(sch);

       if sch(i) not in ds then ds(i)=sch(i);

      end;

      schcount=n(of ds(*));

      drop ds: i;

run;

Haikuo

PROC Star
Posts: 7,363

Re: longitudinal dataset - a count across variables

: Happy New Year to you too!  Much better!

PROC Star
Posts: 1,093

Re: longitudinal dataset - a count across variables

Oh, you data step coders. I like PROCs!

proc sort data=have;
by studentid;
run;

proc transpose data=have out=interim1 prefix=column name=source;
by studentid;
var schcode2002 schcode2003 schcode2004 schcode2005 schcode2006 schcode2007 schcode2008 schcode2009 schcode2010;
run;

proc sql;
create table interim2 as
  select studentid, count(column1) as SchCount from interim1
   group by studentid;
quit;

proc sql;
create table want as
  select h.*, i.SchCount from have h left join interim2 i on(h.studentid = i.studentid);
quit;

Tom

PROC Star
Posts: 7,363

Re: longitudinal dataset - a count across variables

: a belated Christmas present.  Since your code ran faster than mine, but I was awarded the correct answer, to justify my points here is an approach that runs faster (on average) than yours:

data want;

  set have;

  array sch $ schcode:;

  call sortc(of sch(*));

  counter=1;

  do _n_=8 to 1 by -1 while (not missing(sch(_n_)));

    if sch(_n_+1) ne sch(_n_) then counter+1;

  end;

run;

: I have absolutely nothing against procs, but your approach runs slower and would have to be modified as it is only counting number of schools, not the number of distinct schools as required by the OP.

PROC Star
Posts: 1,093

Re: longitudinal dataset - a count across variables

Oops! Missed the requirement for distinct schools. Amend

proc sql;

create table interim2 as

  select studentid, count(column1) as SchCount from interim1

   group by studentid;

quit;

to

proc sql;

create table interim2 as

select studentid, count(column1) as SchCount from

(select distinct studentid, column1 from interim1)

group by studentid;

quit;

However, that being said, in this case I agree that the data step solution is the better one. PROC APPEND is one that I haven't used much, and I'm trying to get better with it, so I jump in when the opportunity presents. I thought this would make a nice contrast to the other solutions, but if I was the client I'd prefer them!

Tom

(P.S. As long as the powers-that-be don't start deducting points, I'm happy!)

Respected Advisor
Posts: 3,124

Re: longitudinal dataset - a count across variables

: Very impressive. I made a small change to your code to take care of "all missing" scenario, and it still runs faster than mine. You very much deserve "correct answer", Sir.

data want;

  set have;

  array sch $ schcode:;

  call sortc(of sch(*));

  if n(of sch(*))=0 then counter=0;

  else counter=1;

  do _n_=8 to 1 by -1 while (not missing(sch(_n_)));

    if sch(_n_+1) ne sch(_n_) then counter+1;

  end;

run;

Haikuo

PROC Star
Posts: 7,363

Re: longitudinal dataset - a count across variables

: Yes, good catch!  However, the performance could still be improved.  How about:

data want;

  set have;

  array sch $ schcode:;

  call sortc(of sch(*));

  counter=not missing(sch(9));

  do _n_=8 to 1 by -1 while (not missing(sch(_n_)));

    counter+sch(_n_+1) ne sch(_n_);

  end;

run;

Super User
Posts: 9,681

Re: longitudinal dataset - a count across variables

or using Hash Table.

data have (drop=i);
  input (StudentID SchCode2002-SchCode2010) ($);
  do i=1 to 10000;
    output;
  end;
  cards;
100145 8543 8543 8543 8543 8543 8543 2455 2455 2455
100362 8543 8543 9652 3402 3402 3402 4656 4656 .
100595 . . 3402 3402 3402 3402 6548 6548 .
100788 2104 8453 8453 8453 2455 . . . .
;
run;

data want(drop=i rc k );
 if _n_ eq 1 then do;
  length k $ 8;
  declare hash ha();
   ha.definekey('k');
   ha.definedone();
 end;
set have;
  array s{9} $ SchCode2002-SchCode2010;
  do i=1 to 9 ;
      if not missing(s{i}) then do;k=s{i};rc=ha.add(); end;
  end;
  n=ha.num_items;
  ha.clear();
run;
   


Ksharp

Message was edited by: xia keshan

☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 14 replies
  • 1741 views
  • 0 likes
  • 6 in conversation