New SAS User

m5 · Posted 08-02-2024 09:33 PM

I have a large dataset with multiple rows per ID. I am trying to create a dataset with only one row per id (no duplicates), but on specific conditions. Data looks like this:

ID	Month	Case	Control
1	10	0	1
2	8	0	1
2	9	1	0
3	3	1	0
4	7	0	1
4	11	1	0
5	3	0	1
5	6	0	1
5	8	0	1

The conditions for retaining the observation in the dataset are:

1) Select all cases (case=1). If the ID also has observations with control=1, remove those observations

ID 2 month 9 would be retained
ID 4 month 11 would be retained and ID 4 month 7 would be removed

2) For IDs with multiple control observations, select the observation with the first/lowest month.

ID 5 month 3 would be retained

I know I can achieve criteria #2 with something like:

Data want; Set have;

By ID Month;

If first.ID; run;

But how can I achieve #1?

Astounding · Posted 08-03-2024 03:54 PM

Here's another, simpler possibility. It's possible this is all you really want:

data want;
set have (where=(case=1))
have (where=(case=0));
by id;
if first.id;
run;

It keeps first case record (deleting all else). If there are no case records, it keeps the first control record.

View solution in original post

Astounding · Posted 08-02-2024 10:16 PM

Just clarifying ....

Please confirm that ID 2 Month 8 should be removed.

Please confirm that there is no YEAR in the data. Just looking at MONTH is sufficient to determine which observation is earlier.

m5 · Posted 08-02-2024 10:28 PM

Correct, ID 2 month 8 should be removed. There is no year in the data, month is sufficient. Thank you!

Astounding · Posted 08-02-2024 11:45 PM

It appears you have a sorted order to your data: by ID month. Let's pull the observations in a slightly different order:

data want;
set have (where=(case=1) in=have_case)
have (where=(case=0) in=have_control);
by id;
retain any_case 'Y';
*condition 1;
if first.id then do;
   if have_case then any_case='Y';
   else any_case='N';
end;
if have_control and any_case='Y' then delete;
*condition 2;
if have_control and not first.id then delete; 
run;

The code is untested, since you have the data to work with. But it looks right (even if it's mildly cumbersome).

If there are multiple case records for an ID, this selects them all. I'm not sure if that is your intention or not. I'm interpreting the title vs. the description of the conditions as saying opposite things about that question.

Astounding · Posted 08-03-2024 03:54 PM

Here's another, simpler possibility. It's possible this is all you really want:

data want;
set have (where=(case=1))
have (where=(case=0));
by id;
if first.id;
run;

It keeps first case record (deleting all else). If there are no case records, it keeps the first control record.

New SAS User

Select one observation per ID from multiple rows per ID based on criteria.

Re: Select one observation per ID from multiple rows per ID based on criteria.

Re: Select one observation per ID from multiple rows per ID based on criteria.

Re: Select one observation per ID from multiple rows per ID based on criteria.

Re: Select one observation per ID from multiple rows per ID based on criteria.

Re: Select one observation per ID from multiple rows per ID based on criteria.

Subsetting by selecting rows of multiple observation data based on two...

Keep rows of observations based on complex criteria

Numbering rows based on defined criteria

Remove duplicate observations based on multiple criteria

Flagging Records Based on Date Criteria in SAS

Follow Us

What is...

New SAS User

Join us for our biggest event of the year!

Follow Us

What is...