Hello,
I am working with a dataset that tracks entry and exit dates per individual, with each individual having a unique identifying number. In most instances, the individual only has a single record, however several individuals have multiple entry and exit dates. I am trying to calculate the number of days each individual is in the program by adding the difference between each release and entry date. The data looks like this:
casenum | entry_date | release_date | counter |
11111 | 1/2/2019 | 1/3/2019 | 1 |
11111 | 1/17/2019 | 1/21/2019 | 2 |
11112 | 1/1/2019 | 1/6/2019 | 1 |
11113 | 1/2/2019 | 1/5/2019 | 1 |
11114 | 1/1/2019 | 1/3/2019 | 1 |
11114 | 1/9/2019 | 1/15/2019 | 2 |
11114 | 1/17/2019 | 1/18/2019 | 3 |
11115 | 1/3/2019 | 1/5/2019 | 1 |
11116 | 1/2/2019 | 1/17/2019 | 1 |
11117 | 1/2/2019 | 1/6/2019 | 1 |
11117 | 1/18/2019 | 1/11/2019 | 2 |
11117 | 1/15/2019 | 1/27/2019 | 3 |
11117 | 2/5/2019 | 2/14/2019 | 4 |
11118 | 1/2/2019 | 2/1/2019 | 1 |
11119 | 1/1/2019 | 1/29/2019 | 1 |
11120 | 1/1/2019 | 1/5/2019 | 1 |
11120 | 1/1/2019 | 1/17/2019 | 2 |
11120 | 1/1/2019 | 1/22/2019 | 3 |
I was given the variables casenum, entry_date, and release_date. The counter variable was something I created. As I mentioned earlier, I need to calculate the number of days between each entry and release date per person. However, the snag I have run into is that I need to exclude cases like 11120, where a single entry date has multiple release dates; these need to be left as missing when I calculate the time duration. I would prefer to write the program such that the program drops ALL observations of the case number 11120 based on the existence of duplicate entry dates, rather than by using the precise case number (I am trying to limit my hard coding to avoid having to update it every time we receive new data). Is there an efficient way to go about 1) dropping the observations that need to be dropped and 2) adding the total time for the rest? Any advice is appreciated!