- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I have a large dataset with multiple rows per ID. I am trying to create a dataset with only one row per id (no duplicates), but on specific conditions. Data looks like this:
ID | Month | Case | Control |
1 | 10 | 0 | 1 |
2 | 8 | 0 | 1 |
2 | 9 | 1 | 0 |
3 | 3 | 1 | 0 |
4 | 7 | 0 | 1 |
4 | 11 | 1 | 0 |
5 | 3 | 0 | 1 |
5 | 6 | 0 | 1 |
5 | 8 | 0 | 1 |
The conditions for retaining the observation in the dataset are:
1) Select all cases (case=1). If the ID also has observations with control=1, remove those observations
- ID 2 month 9 would be retained
- ID 4 month 11 would be retained and ID 4 month 7 would be removed
2) For IDs with multiple control observations, select the observation with the first/lowest month.
- ID 5 month 3 would be retained
I know I can achieve criteria #2 with something like:
Data want; Set have;
By ID Month;
If first.ID; run;
But how can I achieve #1?
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Here's another, simpler possibility. It's possible this is all you really want:
data want;
set have (where=(case=1))
have (where=(case=0));
by id;
if first.id;
run;
It keeps first case record (deleting all else). If there are no case records, it keeps the first control record.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Just clarifying ....
Please confirm that ID 2 Month 8 should be removed.
Please confirm that there is no YEAR in the data. Just looking at MONTH is sufficient to determine which observation is earlier.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Correct, ID 2 month 8 should be removed. There is no year in the data, month is sufficient. Thank you!
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
It appears you have a sorted order to your data: by ID month. Let's pull the observations in a slightly different order:
data want;
set have (where=(case=1) in=have_case)
have (where=(case=0) in=have_control);
by id;
retain any_case 'Y';
*condition 1;
if first.id then do;
if have_case then any_case='Y';
else any_case='N';
end;
if have_control and any_case='Y' then delete;
*condition 2;
if have_control and not first.id then delete;
run;
The code is untested, since you have the data to work with. But it looks right (even if it's mildly cumbersome).
If there are multiple case records for an ID, this selects them all. I'm not sure if that is your intention or not. I'm interpreting the title vs. the description of the conditions as saying opposite things about that question.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Here's another, simpler possibility. It's possible this is all you really want:
data want;
set have (where=(case=1))
have (where=(case=0));
by id;
if first.id;
run;
It keeps first case record (deleting all else). If there are no case records, it keeps the first control record.