Edited: Changed the required output data for id 2, which had the wrong start and end levels.
Hi all,
My company has a system of levels for probationary recruits that everyone has to pass through by completing certain tasks. They are usually given a period of 4 months to go through each level. If somebody has some relevant experience, they can start at a higher level. They are also allowed to change up the sequence in which they go through the levels, meaning they can are allowed to do a higher level task and then come back and finish the rest. If the person moves to another department, they repeat this.
The data looks as follows:
id | Month/Year | Start Month/Year | End Month/Year | Dept. | Level |
1 | Jan-17 | Jan-17 | Sep-17 | Acct. | 1 |
1 | May-17 | Jan-17 | Sep-17 | Acct. | 2 |
1 | Sep-17 | Jan-17 | Sep-17 | Acct. | 3 |
1 | Jan-18 | Jan-17 | Sep-17 | Acct. | 4 |
2 | Feb-19 | Feb-19 | Oct-19 | Finance | 2 |
2 | Jun-19 | Feb-19 | Oct-19 | Finance | 3 |
2 | Oct-19 | Feb-19 | Oct-19 | Finance | 4 |
3 | Sep-18 | Sep-18 | Sep-19 | HR | 1 |
3 | Jan-19 | Sep-18 | Sep-19 | HR | 2 |
3 | May-19 | Sep-18 | Sep-19 | HR | 4 |
3 | Sep-19 | Sep-18 | Sep-19 | HR | 3 |
1 | Apr-20 | Apr-20 | Mar-21 | SCM | 1 |
1 | Aug-20 | Apr-20 | Mar-21 | SCM | 2 |
1 | Nov-20 | Apr-20 | Mar-21 | SCM | 3 |
1 | Mar-21 | Apr-20 | Mar-21 | SCM | 4 |
I need to add two columns that give me the start and end levels each person took by dept. So, what I need is as below:
id | Month/Year | Start Month/Year | End Month/Year | Dept. | Level | Start level | End level |
1 | Jan-17 | Jan-17 | Sep-17 | Acct. | 1 | 1 | 4 |
1 | May-17 | Jan-17 | Sep-17 | Acct. | 2 | 1 | 4 |
1 | Sep-17 | Jan-17 | Sep-17 | Acct. | 3 | 1 | 4 |
1 | Jan-18 | Jan-17 | Sep-17 | Acct. | 4 | 1 | 4 |
2 | Feb-19 | Feb-19 | Oct-19 | Finance | 2 | 2 | 4 |
2 | Jun-19 | Feb-19 | Oct-19 | Finance | 3 | 2 | 4 |
2 | Oct-19 | Feb-19 | Oct-19 | Finance | 4 | 2 | 4 |
3 | Sep-18 | Sep-18 | Sep-19 | HR | 1 | 1 | 3 |
3 | Jan-19 | Sep-18 | Sep-19 | HR | 2 | 1 | 3 |
3 | May-19 | Sep-18 | Sep-19 | HR | 4 | 1 | 3 |
3 | Sep-19 | Sep-18 | Sep-19 | HR | 3 | 1 | 3 |
1 | Apr-20 | Apr-20 | Mar-21 | SCM | 1 | 1 | 2 |
1 | Aug-20 | Apr-20 | Mar-21 | SCM | 4 | 1 | 2 |
1 | Nov-20 | Apr-20 | Mar-21 | SCM | 3 | 1 | 2 |
1 | Mar-21 | Apr-20 | Mar-21 | SCM | 2 | 1 | 2 |
I would appreciate it if someone could please help me with this. Thank you in advance for all the help.
Do something like this
data have;
input id $ MonthYear $ StartMonthYear $ EndMonthYear $ Dept $ Level;
datalines;
1 Jan-17 Jan-17 Sep-17 Acct. 1
1 May-17 Jan-17 Sep-17 Acct. 2
1 Sep-17 Jan-17 Sep-17 Acct. 3
1 Jan-18 Jan-17 Sep-17 Acct. 4
2 Feb-19 Feb-19 Oct-19 Finance 2
2 Jun-19 Feb-19 Oct-19 Finance 3
2 Oct-19 Feb-19 Oct-19 Finance 4
3 Sep-18 Sep-18 Sep-19 HR 1
3 Jan-19 Sep-18 Sep-19 HR 2
3 May-19 Sep-18 Sep-19 HR 4
3 Sep-19 Sep-18 Sep-19 HR 3
1 Apr-20 Apr-20 Mar-21 SCM 1
1 Aug-20 Apr-20 Mar-21 SCM 2
1 Nov-20 Apr-20 Mar-21 SCM 3
1 Mar-21 Apr-20 Mar-21 SCM 4
;
data want;
do _N_ = 1 by 1 until (last.id);
set have;
by id notsorted;
if _N_ = 1 then start = Level;
end;
end = Level;
do _N_ = 1 to _N_;
set have;
output;
end;
run;
ID 2 has a sequence of 2, 3, 4. So why does it have start level 3 and end level 2?
Assuming that is an error, then here's a technique to pass through each id twice. The first pass determines the start and end levels. the second pass re-reads the id and outputs each obs with the recently determined start and end levels:
data want;
set have (in=firstpass) have (in=secondpass);
by id;
retain start_level end_level;
if first.id then start_level=level;
if in=firstpass then end_level=level;
if secondpass;
run;
The by statement create the dummy variables first.id and last.id, providing a way to know when an ID is beginning or ending. So when FIRST.ID is true, then you have start_level.
Since the HAVE dataset is specified twice in the SET statement, the presence of the BY ID statement causes the data to be interleaved - i.e. each id is completely read twice before progressing to the next id.
The retain tells SAS not to reset the variables to missing.
And the "if in=firstpass" keeps update the end_level as long as your are on the first pass.
Finally, there is a subsetting if: "if secondpass;"
The firstpass dummy is tested
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.