DATA Step, Macro, Functions and more

Turnover stats - Unbalanced data

Accepted Solution Solved
Reply
Occasional Contributor
Posts: 7
Accepted Solution

Turnover stats - Unbalanced data

Hi All,

 

I'm infinitely grateful to folks who responded to an earlier query regarding creation of turnover variables (entries, exits, and total turnover [entries + exits]).  A terrific solution was provided for a balanced dataset, or at least one in which the number of years was known and could be specified.

 

Alas, a second dataset contains unbalanced data. Start and end years differ; also, duration varies across teams.  So, for example, four years of data might be available for one team; only three years for another.  What's more, teams exhibit different start and end years.  A sample of the data are provided below.  To the right of the three variable columns (Team, Year, Member) are some notes reflecting the events that need capturing (again, exits, entries, and their total).

 

Sample data:

 

TeamYearMember  
     
A2000Charlene  
A2000David  
A2000Jane  
A2000Paul  
A2001Charlene  
A2001David  
A2001Jane  
A2001Paul  
A2002Ann Charlene exited; Ann entered
A2002David  
A2002Jane  
A2002Paul  
A2003Ann  
A2003David  
A2003Jane  
A2003Jim Jim entered
A2003Paul  
B2000Betty  
B2000Joe  
B2000Suresh  
B2000Tom  
B2001Brandy Betty exited; Brandy entered
B2001Joe  
B2001Suresh Tom exited
C2002Gwen  
C2002Jamal  
C2002Peter  
C2002Susan  
C2003Edward Edward entered
C2003Gwen  
C2003Jamal  
C2003Peter  
C2003Susan  
C2004Edward  
C2004Gwen  
C2004Jamal  
C2004Peter Susan exited

 

Whereas the input data are observations at the individual level of analysis, the output is aggregated at the team level and should have a Team x Year structure including three variables: Entry (# additions to the team in the most recent time period), Exit (# exits from the team in the most recent timer period), and Turnover (= Entry + Exit in the most recent time period).  In other words ...

 

TeamYearEntryExitTurnover
     
A2000...
A2001000
A2002112
A2003101
B2000...
B2001123
C2002...
C2003101
C2004011

 

The first year (time period) for each team scores missing values since it cannot have incurred turnover.  

 

Please know any thoughts or suggestions are immensely appreciated!  Thanks!


Accepted Solutions
Solution
‎02-21-2017 07:13 PM
Trusted Advisor
Posts: 1,019

Re: Turnover stats - Unbalanced data

[ Edited ]

Just as proc summary can provide the minyear and maxyear for each team*member,  it can simultaneously provide minyear and maxyear for each team.  This means you can compare each member's time span to their team timespan, instead of a global pre-known time range.

 

data have;
  input Team $1. Year Member :$8.;
datalines;          
A 2000 Charlene     
A 2000 David     
A 2000 Jane     
A 2000 Paul     
A 2001 Charlene     
A 2001 David     
A 2001 Jane     
A 2001 Paul     
A 2002 Ann   Charlene exited/ Ann entered 
A 2002 David     
A 2002 Jane     
A 2002 Paul     
A 2003 Ann     
A 2003 David     
A 2003 Jane     
A 2003 Jim   Jim entered 
A 2003 Paul     
B 2000 Betty     
B 2000 Joe     
B 2000 Suresh     
B 2000 Tom     
B 2001 Brandy   Betty exited, Brandy entered 
B 2001 Joe     
B 2001 Suresh   Tom exited 
C 2002 Gwen     
C 2002 Jamal     
C 2002 Peter     
C 2002 Susan     
C 2003 Edward   Edward entered 
C 2003 Gwen     
C 2003 Jamal     
C 2003 Peter     
C 2003 Susan     
C 2004 Edward     
C 2004 Gwen     
C 2004 Jamal     
C 2004 Peter   Susan exited 
run;
 
proc summary data=have min max ;
  class team member;
  var year;
  output out=range min=minyear max=maxyear;
run;

data want (keep=team year entries exits turnover);
  array ent {1991:2010} _temporary_ (20*0);
  array ext {1991:2010} _temporary_ (20*0);
  merge range (where=(_type_=2) rename=(minyear=team_minyear maxyear=team_maxyear))
        range (where=(_type_=3));
  by team ;

  if minyear>team_minyear then ent{minyear}+1;   /*exclude baseline year from entry counts*/
  if maxyear<team_maxyear then ext{maxyear+1}+1; /*exclude last data year from exit counts*/
  
  if last.team then do year=team_minyear to team_maxyear;
    entries=ent{year};  ent{year}=0;
    exits=ext{year};    ext{year}=0;
    turnover=entries+exits;
	if year=team_minyear then call missing(entries,exits,turnover);
    output;
  end;
run;

Notes:

 

  1. By taking away the NWAY option of proc summary as used in your other solution (Creating turnover stats - entries/exits to groups over ), you not only get stats for all the team*member combo's, but also the same stats for team, for member (in case a member name appears in multiple teams), and globally.   Proc summary distinguishes these dimension combinations with the automatic variable _TYPE_.  If you look at data set RANGE, you'll see _TYPE_ =0(global),  1 (member), 2 (team) and 3 (member*team).
  2. So the only problem is how to merge the _TYPE_ 2 and 3 records, which you see in the data step.
  3. And then instead of looping from a global minyear to maxyear, the loop for output is from TEAM_MINYEAR to TEAM_MAXYEAR.

 

The nice thing is that the core of the program is changed very little.

View solution in original post


All Replies
Solution
‎02-21-2017 07:13 PM
Trusted Advisor
Posts: 1,019

Re: Turnover stats - Unbalanced data

[ Edited ]

Just as proc summary can provide the minyear and maxyear for each team*member,  it can simultaneously provide minyear and maxyear for each team.  This means you can compare each member's time span to their team timespan, instead of a global pre-known time range.

 

data have;
  input Team $1. Year Member :$8.;
datalines;          
A 2000 Charlene     
A 2000 David     
A 2000 Jane     
A 2000 Paul     
A 2001 Charlene     
A 2001 David     
A 2001 Jane     
A 2001 Paul     
A 2002 Ann   Charlene exited/ Ann entered 
A 2002 David     
A 2002 Jane     
A 2002 Paul     
A 2003 Ann     
A 2003 David     
A 2003 Jane     
A 2003 Jim   Jim entered 
A 2003 Paul     
B 2000 Betty     
B 2000 Joe     
B 2000 Suresh     
B 2000 Tom     
B 2001 Brandy   Betty exited, Brandy entered 
B 2001 Joe     
B 2001 Suresh   Tom exited 
C 2002 Gwen     
C 2002 Jamal     
C 2002 Peter     
C 2002 Susan     
C 2003 Edward   Edward entered 
C 2003 Gwen     
C 2003 Jamal     
C 2003 Peter     
C 2003 Susan     
C 2004 Edward     
C 2004 Gwen     
C 2004 Jamal     
C 2004 Peter   Susan exited 
run;
 
proc summary data=have min max ;
  class team member;
  var year;
  output out=range min=minyear max=maxyear;
run;

data want (keep=team year entries exits turnover);
  array ent {1991:2010} _temporary_ (20*0);
  array ext {1991:2010} _temporary_ (20*0);
  merge range (where=(_type_=2) rename=(minyear=team_minyear maxyear=team_maxyear))
        range (where=(_type_=3));
  by team ;

  if minyear>team_minyear then ent{minyear}+1;   /*exclude baseline year from entry counts*/
  if maxyear<team_maxyear then ext{maxyear+1}+1; /*exclude last data year from exit counts*/
  
  if last.team then do year=team_minyear to team_maxyear;
    entries=ent{year};  ent{year}=0;
    exits=ext{year};    ext{year}=0;
    turnover=entries+exits;
	if year=team_minyear then call missing(entries,exits,turnover);
    output;
  end;
run;

Notes:

 

  1. By taking away the NWAY option of proc summary as used in your other solution (Creating turnover stats - entries/exits to groups over ), you not only get stats for all the team*member combo's, but also the same stats for team, for member (in case a member name appears in multiple teams), and globally.   Proc summary distinguishes these dimension combinations with the automatic variable _TYPE_.  If you look at data set RANGE, you'll see _TYPE_ =0(global),  1 (member), 2 (team) and 3 (member*team).
  2. So the only problem is how to merge the _TYPE_ 2 and 3 records, which you see in the data step.
  3. And then instead of looping from a global minyear to maxyear, the loop for output is from TEAM_MINYEAR to TEAM_MAXYEAR.

 

The nice thing is that the core of the program is changed very little.

Occasional Contributor
Posts: 7

Re: Turnover stats - Unbalanced data

mkeintz,

 

You are brilliant!  I again extend thanks ... not only for the solution, but also the insights.  As someone who is self-taught in SAS, and very much an amateur, I've learned a great deal from your (and others') posts.  With deep appreciation, cheers,

 

MGray000

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 2 replies
  • 114 views
  • 0 likes
  • 2 in conversation