Solved: Re: Turnover stats - Unbalanced data

MGray000 · Posted 02-21-2017 02:52 PM

Hi All,

I'm infinitely grateful to folks who responded to an earlier query regarding creation of turnover variables (entries, exits, and total turnover [entries + exits]). A terrific solution was provided for a balanced dataset, or at least one in which the number of years was known and could be specified.

Alas, a second dataset contains unbalanced data. Start and end years differ; also, duration varies across teams. So, for example, four years of data might be available for one team; only three years for another. What's more, teams exhibit different start and end years. A sample of the data are provided below. To the right of the three variable columns (Team, Year, Member) are some notes reflecting the events that need capturing (again, exits, entries, and their total).

Sample data:

Team	Year	Member

A	2000	Charlene
A	2000	David
A	2000	Jane
A	2000	Paul
A	2001	Charlene
A	2001	David
A	2001	Jane
A	2001	Paul
A	2002	Ann	Charlene exited; Ann entered
A	2002	David
A	2002	Jane
A	2002	Paul
A	2003	Ann
A	2003	David
A	2003	Jane
A	2003	Jim	Jim entered
A	2003	Paul
B	2000	Betty
B	2000	Joe
B	2000	Suresh
B	2000	Tom
B	2001	Brandy	Betty exited; Brandy entered
B	2001	Joe
B	2001	Suresh	Tom exited
C	2002	Gwen
C	2002	Jamal
C	2002	Peter
C	2002	Susan
C	2003	Edward	Edward entered
C	2003	Gwen
C	2003	Jamal
C	2003	Peter
C	2003	Susan
C	2004	Edward
C	2004	Gwen
C	2004	Jamal
C	2004	Peter	Susan exited

Whereas the input data are observations at the individual level of analysis, the output is aggregated at the team level and should have a Team x Year structure including three variables: Entry (# additions to the team in the most recent time period), Exit (# exits from the team in the most recent timer period), and Turnover (= Entry + Exit in the most recent time period). In other words ...

Team	Year	Entry	Exit	Turnover

A	2000	.	.	.
A	2001	0	0	0
A	2002	1	1	2
A	2003	1	0	1
B	2000	.	.	.
B	2001	1	2	3
C	2002	.	.	.
C	2003	1	0	1
C	2004	0	1	1

The first year (time period) for each team scores missing values since it cannot have incurred turnover.

Please know any thoughts or suggestions are immensely appreciated! Thanks!

mkeintz · Posted 02-21-2017 06:10 PM

Just as proc summary can provide the minyear and maxyear for each team*member, it can simultaneously provide minyear and maxyear for each team. This means you can compare each member's time span to their team timespan, instead of a global pre-known time range.

data have;
  input Team $1. Year Member :$8.;
datalines;          
A 2000 Charlene     
A 2000 David     
A 2000 Jane     
A 2000 Paul     
A 2001 Charlene     
A 2001 David     
A 2001 Jane     
A 2001 Paul     
A 2002 Ann   Charlene exited/ Ann entered 
A 2002 David     
A 2002 Jane     
A 2002 Paul     
A 2003 Ann     
A 2003 David     
A 2003 Jane     
A 2003 Jim   Jim entered 
A 2003 Paul     
B 2000 Betty     
B 2000 Joe     
B 2000 Suresh     
B 2000 Tom     
B 2001 Brandy   Betty exited, Brandy entered 
B 2001 Joe     
B 2001 Suresh   Tom exited 
C 2002 Gwen     
C 2002 Jamal     
C 2002 Peter     
C 2002 Susan     
C 2003 Edward   Edward entered 
C 2003 Gwen     
C 2003 Jamal     
C 2003 Peter     
C 2003 Susan     
C 2004 Edward     
C 2004 Gwen     
C 2004 Jamal     
C 2004 Peter   Susan exited 
run;
 
proc summary data=have min max ;
  class team member;
  var year;
  output out=range min=minyear max=maxyear;
run;

data want (keep=team year entries exits turnover);
  array ent {1991:2010} _temporary_ (20*0);
  array ext {1991:2010} _temporary_ (20*0);
  merge range (where=(_type_=2) rename=(minyear=team_minyear maxyear=team_maxyear))
        range (where=(_type_=3));
  by team ;

  if minyear>team_minyear then ent{minyear}+1;   /*exclude baseline year from entry counts*/
  if maxyear<team_maxyear then ext{maxyear+1}+1; /*exclude last data year from exit counts*/
  
  if last.team then do year=team_minyear to team_maxyear;
    entries=ent{year};  ent{year}=0;
    exits=ext{year};    ext{year}=0;
    turnover=entries+exits;
	if year=team_minyear then call missing(entries,exits,turnover);
    output;
  end;
run;

Notes:

By taking away the NWAY option of proc summary as used in your other solution (Creating turnover stats - entries/exits to groups over ), you not only get stats for all the team*member combo's, but also the same stats for team, for member (in case a member name appears in multiple teams), and globally. Proc summary distinguishes these dimension combinations with the automatic variable _TYPE_. If you look at data set RANGE, you'll see _TYPE_ =0(global), 1 (member), 2 (team) and 3 (member*team).
So the only problem is how to merge the _TYPE_ 2 and 3 records, which you see in the data step.
And then instead of looping from a global minyear to maxyear, the loop for output is from TEAM_MINYEAR to TEAM_MAXYEAR.

The nice thing is that the core of the program is changed very little.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

mkeintz · Posted 02-21-2017 06:10 PM

Just as proc summary can provide the minyear and maxyear for each team*member, it can simultaneously provide minyear and maxyear for each team. This means you can compare each member's time span to their team timespan, instead of a global pre-known time range.

data have;
  input Team $1. Year Member :$8.;
datalines;          
A 2000 Charlene     
A 2000 David     
A 2000 Jane     
A 2000 Paul     
A 2001 Charlene     
A 2001 David     
A 2001 Jane     
A 2001 Paul     
A 2002 Ann   Charlene exited/ Ann entered 
A 2002 David     
A 2002 Jane     
A 2002 Paul     
A 2003 Ann     
A 2003 David     
A 2003 Jane     
A 2003 Jim   Jim entered 
A 2003 Paul     
B 2000 Betty     
B 2000 Joe     
B 2000 Suresh     
B 2000 Tom     
B 2001 Brandy   Betty exited, Brandy entered 
B 2001 Joe     
B 2001 Suresh   Tom exited 
C 2002 Gwen     
C 2002 Jamal     
C 2002 Peter     
C 2002 Susan     
C 2003 Edward   Edward entered 
C 2003 Gwen     
C 2003 Jamal     
C 2003 Peter     
C 2003 Susan     
C 2004 Edward     
C 2004 Gwen     
C 2004 Jamal     
C 2004 Peter   Susan exited 
run;
 
proc summary data=have min max ;
  class team member;
  var year;
  output out=range min=minyear max=maxyear;
run;

data want (keep=team year entries exits turnover);
  array ent {1991:2010} _temporary_ (20*0);
  array ext {1991:2010} _temporary_ (20*0);
  merge range (where=(_type_=2) rename=(minyear=team_minyear maxyear=team_maxyear))
        range (where=(_type_=3));
  by team ;

  if minyear>team_minyear then ent{minyear}+1;   /*exclude baseline year from entry counts*/
  if maxyear<team_maxyear then ext{maxyear+1}+1; /*exclude last data year from exit counts*/
  
  if last.team then do year=team_minyear to team_maxyear;
    entries=ent{year};  ent{year}=0;
    exits=ext{year};    ext{year}=0;
    turnover=entries+exits;
	if year=team_minyear then call missing(entries,exits,turnover);
    output;
  end;
run;

Notes:

By taking away the NWAY option of proc summary as used in your other solution (Creating turnover stats - entries/exits to groups over ), you not only get stats for all the team*member combo's, but also the same stats for team, for member (in case a member name appears in multiple teams), and globally. Proc summary distinguishes these dimension combinations with the automatic variable _TYPE_. If you look at data set RANGE, you'll see _TYPE_ =0(global), 1 (member), 2 (team) and 3 (member*team).
So the only problem is how to merge the _TYPE_ 2 and 3 records, which you see in the data step.
And then instead of looping from a global minyear to maxyear, the loop for output is from TEAM_MINYEAR to TEAM_MAXYEAR.

The nice thing is that the core of the program is changed very little.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

MGray000 · Posted 02-21-2017 07:16 PM

mkeintz,

You are brilliant! I again extend thanks ... not only for the solution, but also the insights. As someone who is self-taught in SAS, and very much an amateur, I've learned a great deal from your (and others') posts. With deep appreciation, cheers,

MGray000

Turnover stats - Unbalanced data

Re: Turnover stats - Unbalanced data

Re: Turnover stats - Unbalanced data

Re: Turnover stats - Unbalanced data

Click image to register for webinar

Classroom Training Available!