I have a large dataset with multiple rows per ID. I am trying to create a dataset with only one row per id (no duplicates), but on specific conditions. Data looks like this:
ID | Month | Case | Control |
1 | 10 | 0 | 1 |
2 | 8 | 0 | 1 |
2 | 9 | 1 | 0 |
3 | 3 | 1 | 0 |
4 | 7 | 0 | 1 |
4 | 11 | 1 | 0 |
5 | 3 | 0 | 1 |
5 | 6 | 0 | 1 |
5 | 8 | 0 | 1 |
The conditions for retaining the observation in the dataset are:
1) Select all cases (case=1). If the ID also has observations with control=1, remove those observations
2) For IDs with multiple control observations, select the observation with the first/lowest month.
I know I can achieve criteria #2 with something like:
Data want; Set have;
By ID Month;
If first.ID; run;
But how can I achieve #1?
Here's another, simpler possibility. It's possible this is all you really want:
data want;
set have (where=(case=1))
have (where=(case=0));
by id;
if first.id;
run;
It keeps first case record (deleting all else). If there are no case records, it keeps the first control record.
Just clarifying ....
Please confirm that ID 2 Month 8 should be removed.
Please confirm that there is no YEAR in the data. Just looking at MONTH is sufficient to determine which observation is earlier.
Correct, ID 2 month 8 should be removed. There is no year in the data, month is sufficient. Thank you!
It appears you have a sorted order to your data: by ID month. Let's pull the observations in a slightly different order:
data want;
set have (where=(case=1) in=have_case)
have (where=(case=0) in=have_control);
by id;
retain any_case 'Y';
*condition 1;
if first.id then do;
if have_case then any_case='Y';
else any_case='N';
end;
if have_control and any_case='Y' then delete;
*condition 2;
if have_control and not first.id then delete;
run;
The code is untested, since you have the data to work with. But it looks right (even if it's mildly cumbersome).
If there are multiple case records for an ID, this selects them all. I'm not sure if that is your intention or not. I'm interpreting the title vs. the description of the conditions as saying opposite things about that question.
Here's another, simpler possibility. It's possible this is all you really want:
data want;
set have (where=(case=1))
have (where=(case=0));
by id;
if first.id;
run;
It keeps first case record (deleting all else). If there are no case records, it keeps the first control record.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.