BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
RoddyJ
Obsidian | Level 7

Hi all,

 

I have a dataset of the form:

 

ID    var1    var2    var3

1       1         3          11

1       2         4          12

1       2         5        NUL

2      1          5          13

2      2          5         NUL

3      1          5         NUL

3      1          4         NUL

 

I'd like to remove duplicate observations of ID based on multiple criteria.

For example: var3 should be non-null, var2 should be 5 and var1 should be 2. But if those criteria can't be met I want to keep the observation that best fits those criteria. e.g in the table given, I'd want to keep the 2nd observation with ID = 1 because var1 = 2 and var is non-null. I'd want to keep the first observation of ID=2 because var2 = 5 and var3 is non-null and I'd want to keep the first observation of ID=3 because var2 = 5.

 

Is this possible to solve with a SAS program, or is the original dataset too messy?

 

Thanks in advance for your help!

1 ACCEPTED SOLUTION

Accepted Solutions
FreelanceReinh
Jade | Level 19

Hi @RoddyJ,

 

You can sort the input dataset using an ORDER BY clause with PROC SQL (into a view) and then use BY-group processing in a DATA step to select one observation per ID. The ORDER BY clause supports a wide range of sorting rules. Here's an example implementing a hierarchical rule as suggested by your description:

proc format;
invalue nulmiss
'NUL' = .;
run;

data have;
input ID var1 var2 var3 :nulmiss.;
cards;
1 1 3 11
1 2 4 12
1 2 5 NUL
2 1 5 13
2 2 5 NUL
3 1 5 NUL
3 1 4 NUL
;

proc sql;
create view _tmp as
select * from have
order by id, not missing(var3), var2=5, var1=2;
quit;

data want;
set _tmp;
by id;
if last.id;
run;

(I assumed that the "NUL" values are missing values in your real data.)

 

Similarly, you could compute a numeric score and sort by this score -- all within the ORDER BY clause.

 

Edit:

Explanation of the "hierarchical rule": The first criterion, "var3 not missing," has top priority (within the ID BY group), i.e., it will be met for the selected observation whenever possible -- even if there was another observation satisfying both of the other two criteria, but not the first. The next criterion, "var2=5," has second priority, i.e., it will be regarded only if the first criterion alone does not determine a unique observation to be selected from a particular ID BY group. Of course, in this situation a record satisfying it would be preferred in the selection. If there are still two or more "selection candidates" in a BY group with the same status regarding the first two criteria, the third criterion, "var1=2," will be used in the decision. It's possible (but doesn't occur in your sample data) that these three criteria do not uniquely determine an observation in a BY group. The suggested code would then select an arbitrary observation from those satisfying the three criteria equally well. (Edit 3: Needless to say, the last three items in the ORDER BY clause are Boolean expressions. Their sort order is always 0, 1 [that is: FALSE, TRUE], which in conjunction with the subsetting "if last.id;" in the DATA step implements the selection rule as described.)

 

Depending on the meaning of the data it may be appropriate to include, for example, the absolute differences abs(var2-5) and abs(var1-2) (or squared differences, ...) in the sorting rule so that, say, var2=5.1 would be regarded as a better match than var2=9.0, everything else being the same. You would need to decide about the priorities, e.g., if var1=2 (exact match in the third-priority criterion) should trump a close, but not exact match in var2 (5.03, say). The implementation in PROC SQL would be fairly easy.

 

Edit 2: Another way to understand the hierarchical rule is to consider the "ranking" of the eight possible cases that can be distinguished by the three binary criteria. Each observation of a BY group falls into one of eight categories and an observation with the "highest" available priority (1=top, ..., 8=lowest priority) is selected.

priority var1 var2 var3
    1      2    5  non-missing
    2     ≠2    5  non-missing
    3      2   ≠5  non-missing
    4     ≠2   ≠5  non-missing
    5      2    5  missing
    6     ≠2    5  missing
    7      2   ≠5  missing
    8     ≠2   ≠5  missing

 

View solution in original post

2 REPLIES 2
FreelanceReinh
Jade | Level 19

Hi @RoddyJ,

 

You can sort the input dataset using an ORDER BY clause with PROC SQL (into a view) and then use BY-group processing in a DATA step to select one observation per ID. The ORDER BY clause supports a wide range of sorting rules. Here's an example implementing a hierarchical rule as suggested by your description:

proc format;
invalue nulmiss
'NUL' = .;
run;

data have;
input ID var1 var2 var3 :nulmiss.;
cards;
1 1 3 11
1 2 4 12
1 2 5 NUL
2 1 5 13
2 2 5 NUL
3 1 5 NUL
3 1 4 NUL
;

proc sql;
create view _tmp as
select * from have
order by id, not missing(var3), var2=5, var1=2;
quit;

data want;
set _tmp;
by id;
if last.id;
run;

(I assumed that the "NUL" values are missing values in your real data.)

 

Similarly, you could compute a numeric score and sort by this score -- all within the ORDER BY clause.

 

Edit:

Explanation of the "hierarchical rule": The first criterion, "var3 not missing," has top priority (within the ID BY group), i.e., it will be met for the selected observation whenever possible -- even if there was another observation satisfying both of the other two criteria, but not the first. The next criterion, "var2=5," has second priority, i.e., it will be regarded only if the first criterion alone does not determine a unique observation to be selected from a particular ID BY group. Of course, in this situation a record satisfying it would be preferred in the selection. If there are still two or more "selection candidates" in a BY group with the same status regarding the first two criteria, the third criterion, "var1=2," will be used in the decision. It's possible (but doesn't occur in your sample data) that these three criteria do not uniquely determine an observation in a BY group. The suggested code would then select an arbitrary observation from those satisfying the three criteria equally well. (Edit 3: Needless to say, the last three items in the ORDER BY clause are Boolean expressions. Their sort order is always 0, 1 [that is: FALSE, TRUE], which in conjunction with the subsetting "if last.id;" in the DATA step implements the selection rule as described.)

 

Depending on the meaning of the data it may be appropriate to include, for example, the absolute differences abs(var2-5) and abs(var1-2) (or squared differences, ...) in the sorting rule so that, say, var2=5.1 would be regarded as a better match than var2=9.0, everything else being the same. You would need to decide about the priorities, e.g., if var1=2 (exact match in the third-priority criterion) should trump a close, but not exact match in var2 (5.03, say). The implementation in PROC SQL would be fairly easy.

 

Edit 2: Another way to understand the hierarchical rule is to consider the "ranking" of the eight possible cases that can be distinguished by the three binary criteria. Each observation of a BY group falls into one of eight categories and an observation with the "highest" available priority (1=top, ..., 8=lowest priority) is selected.

priority var1 var2 var3
    1      2    5  non-missing
    2     ≠2    5  non-missing
    3      2   ≠5  non-missing
    4     ≠2   ≠5  non-missing
    5      2    5  missing
    6     ≠2    5  missing
    7      2   ≠5  missing
    8     ≠2   ≠5  missing

 

RoddyJ
Obsidian | Level 7
Thanks for this reply, really well explained!

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 2 replies
  • 879 views
  • 1 like
  • 2 in conversation