BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
m5
Calcite | Level 5 m5
Calcite | Level 5

I have a large dataset with multiple rows per ID. I am trying to create a dataset with only one row per id (no duplicates), but on specific conditions. Data looks like this: 

IDMonthCaseControl
11001
2801
2910
3310
4701
41110
5301
5601
5801

The conditions for retaining the observation in the dataset are:

1) Select all cases (case=1). If the ID also has observations with control=1, remove those observations

  1. ID 2 month 9 would be retained
  2. ID 4 month 11 would be retained and ID 4 month 7 would be removed

2) For IDs with multiple control observations, select the observation with the first/lowest month.

  1. ID 5 month 3 would be retained

I know I can achieve criteria #2 with something like:

Data want; Set have;

By ID Month;

If first.ID; run;

But how can I achieve #1?

1 ACCEPTED SOLUTION

Accepted Solutions
Astounding
PROC Star

Here's another, simpler possibility.  It's possible this is all you really want:

data want;
set have (where=(case=1))
have (where=(case=0));
by id;
if first.id;
run;

It keeps first case record (deleting all else).  If there are no case records, it keeps the first control record.

View solution in original post

4 REPLIES 4
Astounding
PROC Star

Just clarifying ....

Please confirm that ID 2 Month 8 should be removed.

Please confirm that there is no YEAR in the data.  Just looking at MONTH is sufficient to determine which observation is earlier.

m5
Calcite | Level 5 m5
Calcite | Level 5

Correct, ID 2 month 8 should be removed. There is no year in the data, month is sufficient. Thank you!

Astounding
PROC Star

It appears you have a sorted order to your data:  by ID month.  Let's pull the observations in a slightly different order:

data want;
set have (where=(case=1) in=have_case)
have (where=(case=0) in=have_control);
by id;
retain any_case 'Y';
*condition 1;
if first.id then do;
   if have_case then any_case='Y';
   else any_case='N';
end;
if have_control and any_case='Y' then delete;
*condition 2;
if have_control and not first.id then delete; 
run;

The code is untested, since you have the data to work with.  But it looks right (even if it's mildly cumbersome).

If there are multiple case records for an ID, this selects them all.  I'm not sure if that is your intention or not.  I'm interpreting the title vs. the description of the conditions as saying opposite things about that question.

Astounding
PROC Star

Here's another, simpler possibility.  It's possible this is all you really want:

data want;
set have (where=(case=1))
have (where=(case=0));
by id;
if first.id;
run;

It keeps first case record (deleting all else).  If there are no case records, it keeps the first control record.

SAS Innovate 2025: Register Now

Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!

Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 4 replies
  • 1105 views
  • 0 likes
  • 2 in conversation