I have a large dataset with multiple rows per ID. I am trying to create a dataset with only one row per id (no duplicates), but on specific conditions. Data looks like this:
ID | Month | Case | Control |
1 | 10 | 0 | 1 |
2 | 8 | 0 | 1 |
2 | 9 | 1 | 0 |
3 | 3 | 1 | 0 |
4 | 7 | 0 | 1 |
4 | 11 | 1 | 0 |
5 | 3 | 0 | 1 |
5 | 6 | 0 | 1 |
5 | 8 | 0 | 1 |
The conditions for retaining the observation in the dataset are:
1) Select all cases (case=1). If the ID also has observations with control=1, remove those observations
- ID 2 month 9 would be retained
- ID 4 month 11 would be retained and ID 4 month 7 would be removed
2) For IDs with multiple control observations, select the observation with the first/lowest month.
- ID 5 month 3 would be retained
I know I can achieve criteria #2 with something like:
Data want; Set have;
By ID Month;
If first.ID; run;
But how can I achieve #1?