Turnover stats - Unbalanced data

Solved
Occasional Contributor
Posts: 7

Turnover stats - Unbalanced data

Hi All,

I'm infinitely grateful to folks who responded to an earlier query regarding creation of turnover variables (entries, exits, and total turnover [entries + exits]).  A terrific solution was provided for a balanced dataset, or at least one in which the number of years was known and could be specified.

Alas, a second dataset contains unbalanced data. Start and end years differ; also, duration varies across teams.  So, for example, four years of data might be available for one team; only three years for another.  What's more, teams exhibit different start and end years.  A sample of the data are provided below.  To the right of the three variable columns (Team, Year, Member) are some notes reflecting the events that need capturing (again, exits, entries, and their total).

Sample data:

 Team Year Member A 2000 Charlene A 2000 David A 2000 Jane A 2000 Paul A 2001 Charlene A 2001 David A 2001 Jane A 2001 Paul A 2002 Ann Charlene exited; Ann entered A 2002 David A 2002 Jane A 2002 Paul A 2003 Ann A 2003 David A 2003 Jane A 2003 Jim Jim entered A 2003 Paul B 2000 Betty B 2000 Joe B 2000 Suresh B 2000 Tom B 2001 Brandy Betty exited; Brandy entered B 2001 Joe B 2001 Suresh Tom exited C 2002 Gwen C 2002 Jamal C 2002 Peter C 2002 Susan C 2003 Edward Edward entered C 2003 Gwen C 2003 Jamal C 2003 Peter C 2003 Susan C 2004 Edward C 2004 Gwen C 2004 Jamal C 2004 Peter Susan exited

Whereas the input data are observations at the individual level of analysis, the output is aggregated at the team level and should have a Team x Year structure including three variables: Entry (# additions to the team in the most recent time period), Exit (# exits from the team in the most recent timer period), and Turnover (= Entry + Exit in the most recent time period).  In other words ...

 Team Year Entry Exit Turnover A 2000 . . . A 2001 0 0 0 A 2002 1 1 2 A 2003 1 0 1 B 2000 . . . B 2001 1 2 3 C 2002 . . . C 2003 1 0 1 C 2004 0 1 1

The first year (time period) for each team scores missing values since it cannot have incurred turnover.

Please know any thoughts or suggestions are immensely appreciated!  Thanks!

Accepted Solutions
Solution
‎02-21-2017 07:13 PM
Posts: 1,309

Re: Turnover stats - Unbalanced data

[ Edited ]

Just as proc summary can provide the minyear and maxyear for each team*member,  it can simultaneously provide minyear and maxyear for each team.  This means you can compare each member's time span to their team timespan, instead of a global pre-known time range.

``````data have;
input Team \$1. Year Member :\$8.;
datalines;
A 2000 Charlene
A 2000 David
A 2000 Jane
A 2000 Paul
A 2001 Charlene
A 2001 David
A 2001 Jane
A 2001 Paul
A 2002 Ann   Charlene exited/ Ann entered
A 2002 David
A 2002 Jane
A 2002 Paul
A 2003 Ann
A 2003 David
A 2003 Jane
A 2003 Jim   Jim entered
A 2003 Paul
B 2000 Betty
B 2000 Joe
B 2000 Suresh
B 2000 Tom
B 2001 Brandy   Betty exited, Brandy entered
B 2001 Joe
B 2001 Suresh   Tom exited
C 2002 Gwen
C 2002 Jamal
C 2002 Peter
C 2002 Susan
C 2003 Edward   Edward entered
C 2003 Gwen
C 2003 Jamal
C 2003 Peter
C 2003 Susan
C 2004 Edward
C 2004 Gwen
C 2004 Jamal
C 2004 Peter   Susan exited
run;

proc summary data=have min max ;
class team member;
var year;
output out=range min=minyear max=maxyear;
run;

data want (keep=team year entries exits turnover);
array ent {1991:2010} _temporary_ (20*0);
array ext {1991:2010} _temporary_ (20*0);
merge range (where=(_type_=2) rename=(minyear=team_minyear maxyear=team_maxyear))
range (where=(_type_=3));
by team ;

if minyear>team_minyear then ent{minyear}+1;   /*exclude baseline year from entry counts*/
if maxyear<team_maxyear then ext{maxyear+1}+1; /*exclude last data year from exit counts*/

if last.team then do year=team_minyear to team_maxyear;
entries=ent{year};  ent{year}=0;
exits=ext{year};    ext{year}=0;
turnover=entries+exits;
if year=team_minyear then call missing(entries,exits,turnover);
output;
end;
run;``````

Notes:

1. By taking away the NWAY option of proc summary as used in your other solution (Creating turnover stats - entries/exits to groups over ), you not only get stats for all the team*member combo's, but also the same stats for team, for member (in case a member name appears in multiple teams), and globally.   Proc summary distinguishes these dimension combinations with the automatic variable _TYPE_.  If you look at data set RANGE, you'll see _TYPE_ =0(global),  1 (member), 2 (team) and 3 (member*team).
2. So the only problem is how to merge the _TYPE_ 2 and 3 records, which you see in the data step.
3. And then instead of looping from a global minyear to maxyear, the loop for output is from TEAM_MINYEAR to TEAM_MAXYEAR.

The nice thing is that the core of the program is changed very little.

All Replies
Solution
‎02-21-2017 07:13 PM
Posts: 1,309

Re: Turnover stats - Unbalanced data

[ Edited ]

Just as proc summary can provide the minyear and maxyear for each team*member,  it can simultaneously provide minyear and maxyear for each team.  This means you can compare each member's time span to their team timespan, instead of a global pre-known time range.

``````data have;
input Team \$1. Year Member :\$8.;
datalines;
A 2000 Charlene
A 2000 David
A 2000 Jane
A 2000 Paul
A 2001 Charlene
A 2001 David
A 2001 Jane
A 2001 Paul
A 2002 Ann   Charlene exited/ Ann entered
A 2002 David
A 2002 Jane
A 2002 Paul
A 2003 Ann
A 2003 David
A 2003 Jane
A 2003 Jim   Jim entered
A 2003 Paul
B 2000 Betty
B 2000 Joe
B 2000 Suresh
B 2000 Tom
B 2001 Brandy   Betty exited, Brandy entered
B 2001 Joe
B 2001 Suresh   Tom exited
C 2002 Gwen
C 2002 Jamal
C 2002 Peter
C 2002 Susan
C 2003 Edward   Edward entered
C 2003 Gwen
C 2003 Jamal
C 2003 Peter
C 2003 Susan
C 2004 Edward
C 2004 Gwen
C 2004 Jamal
C 2004 Peter   Susan exited
run;

proc summary data=have min max ;
class team member;
var year;
output out=range min=minyear max=maxyear;
run;

data want (keep=team year entries exits turnover);
array ent {1991:2010} _temporary_ (20*0);
array ext {1991:2010} _temporary_ (20*0);
merge range (where=(_type_=2) rename=(minyear=team_minyear maxyear=team_maxyear))
range (where=(_type_=3));
by team ;

if minyear>team_minyear then ent{minyear}+1;   /*exclude baseline year from entry counts*/
if maxyear<team_maxyear then ext{maxyear+1}+1; /*exclude last data year from exit counts*/

if last.team then do year=team_minyear to team_maxyear;
entries=ent{year};  ent{year}=0;
exits=ext{year};    ext{year}=0;
turnover=entries+exits;
if year=team_minyear then call missing(entries,exits,turnover);
output;
end;
run;``````

Notes:

1. By taking away the NWAY option of proc summary as used in your other solution (Creating turnover stats - entries/exits to groups over ), you not only get stats for all the team*member combo's, but also the same stats for team, for member (in case a member name appears in multiple teams), and globally.   Proc summary distinguishes these dimension combinations with the automatic variable _TYPE_.  If you look at data set RANGE, you'll see _TYPE_ =0(global),  1 (member), 2 (team) and 3 (member*team).
2. So the only problem is how to merge the _TYPE_ 2 and 3 records, which you see in the data step.
3. And then instead of looping from a global minyear to maxyear, the loop for output is from TEAM_MINYEAR to TEAM_MAXYEAR.

The nice thing is that the core of the program is changed very little.

Occasional Contributor
Posts: 7

Re: Turnover stats - Unbalanced data

mkeintz,

You are brilliant!  I again extend thanks ... not only for the solution, but also the insights.  As someone who is self-taught in SAS, and very much an amateur, I've learned a great deal from your (and others') posts.  With deep appreciation, cheers,

MGray000

☑ This topic is solved.