Solved: Re: Select one observation per ID from multiple rows per ID based on c...

m5 · Posted 08-02-2024 09:33 PM

I have a large dataset with multiple rows per ID. I am trying to create a dataset with only one row per id (no duplicates), but on specific conditions. Data looks like this:

ID	Month	Case	Control
1	10	0	1
2	8	0	1
2	9	1	0
3	3	1	0
4	7	0	1
4	11	1	0
5	3	0	1
5	6	0	1
5	8	0	1

The conditions for retaining the observation in the dataset are:

1) Select all cases (case=1). If the ID also has observations with control=1, remove those observations

ID 2 month 9 would be retained
ID 4 month 11 would be retained and ID 4 month 7 would be removed

2) For IDs with multiple control observations, select the observation with the first/lowest month.

ID 5 month 3 would be retained

I know I can achieve criteria #2 with something like:

Data want; Set have;

By ID Month;

If first.ID; run;

But how can I achieve #1?

Astounding · Posted 08-03-2024 03:54 PM

Here's another, simpler possibility. It's possible this is all you really want:

data want;
set have (where=(case=1))
have (where=(case=0));
by id;
if first.id;
run;

It keeps first case record (deleting all else). If there are no case records, it keeps the first control record.

View solution in original post

Astounding · Posted 08-02-2024 10:16 PM

Just clarifying ....

Please confirm that ID 2 Month 8 should be removed.

Please confirm that there is no YEAR in the data. Just looking at MONTH is sufficient to determine which observation is earlier.

m5 · Posted 08-02-2024 10:28 PM

Correct, ID 2 month 8 should be removed. There is no year in the data, month is sufficient. Thank you!

Astounding · Posted 08-02-2024 11:45 PM

It appears you have a sorted order to your data: by ID month. Let's pull the observations in a slightly different order:

data want;
set have (where=(case=1) in=have_case)
have (where=(case=0) in=have_control);
by id;
retain any_case 'Y';
*condition 1;
if first.id then do;
   if have_case then any_case='Y';
   else any_case='N';
end;
if have_control and any_case='Y' then delete;
*condition 2;
if have_control and not first.id then delete; 
run;

The code is untested, since you have the data to work with. But it looks right (even if it's mildly cumbersome).

If there are multiple case records for an ID, this selects them all. I'm not sure if that is your intention or not. I'm interpreting the title vs. the description of the conditions as saying opposite things about that question.

Astounding · Posted 08-03-2024 03:54 PM

Here's another, simpler possibility. It's possible this is all you really want:

data want;
set have (where=(case=1))
have (where=(case=0));
by id;
if first.id;
run;

It keeps first case record (deleting all else). If there are no case records, it keeps the first control record.

Select one observation per ID from multiple rows per ID based on criteria.

Re: Select one observation per ID from multiple rows per ID based on criteria.

Re: Select one observation per ID from multiple rows per ID based on criteria.

Re: Select one observation per ID from multiple rows per ID based on criteria.

Re: Select one observation per ID from multiple rows per ID based on criteria.

Re: Select one observation per ID from multiple rows per ID based on criteria.

Select one observation per ID from multiple rows per ID based on criteria.

Re: Select one observation per ID from multiple rows per ID based on criteria.

Re: Select one observation per ID from multiple rows per ID based on criteria.

Re: Select one observation per ID from multiple rows per ID based on criteria.

Re: Select one observation per ID from multiple rows per ID based on criteria.

Re: Select one observation per ID from multiple rows per ID based on criteria.

Registration is open