SAS Programming

RoddyJ · Posted 04-06-2020 11:16 AM

Hi all,

I have a dataset of the form:

ID var1 var2 var3

1 1 3 11

1 2 4 12

1 2 5 NUL

2 1 5 13

2 2 5 NUL

3 1 5 NUL

3 1 4 NUL

I'd like to remove duplicate observations of ID based on multiple criteria.

For example: var3 should be non-null, var2 should be 5 and var1 should be 2. But if those criteria can't be met I want to keep the observation that best fits those criteria. e.g in the table given, I'd want to keep the 2nd observation with ID = 1 because var1 = 2 and var is non-null. I'd want to keep the first observation of ID=2 because var2 = 5 and var3 is non-null and I'd want to keep the first observation of ID=3 because var2 = 5.

Is this possible to solve with a SAS program, or is the original dataset too messy?

Thanks in advance for your help!

FreelanceReinh · Posted 04-06-2020 11:42 AM

Hi @RoddyJ,

You can sort the input dataset using an ORDER BY clause with PROC SQL (into a view) and then use BY-group processing in a DATA step to select one observation per ID. The ORDER BY clause supports a wide range of sorting rules. Here's an example implementing a hierarchical rule as suggested by your description:

proc format;
invalue nulmiss
'NUL' = .;
run;

data have;
input ID var1 var2 var3 :nulmiss.;
cards;
1 1 3 11
1 2 4 12
1 2 5 NUL
2 1 5 13
2 2 5 NUL
3 1 5 NUL
3 1 4 NUL
;

proc sql;
create view _tmp as
select * from have
order by id, not missing(var3), var2=5, var1=2;
quit;

data want;
set _tmp;
by id;
if last.id;
run;

(I assumed that the "NUL" values are missing values in your real data.)

Similarly, you could compute a numeric score and sort by this score -- all within the ORDER BY clause.

Edit:

Explanation of the "hierarchical rule": The first criterion, "var3 not missing," has top priority (within the ID BY group), i.e., it will be met for the selected observation whenever possible -- even if there was another observation satisfying both of the other two criteria, but not the first. The next criterion, "var2=5," has second priority, i.e., it will be regarded only if the first criterion alone does not determine a unique observation to be selected from a particular ID BY group. Of course, in this situation a record satisfying it would be preferred in the selection. If there are still two or more "selection candidates" in a BY group with the same status regarding the first two criteria, the third criterion, "var1=2," will be used in the decision. It's possible (but doesn't occur in your sample data) that these three criteria do not uniquely determine an observation in a BY group. The suggested code would then select an arbitrary observation from those satisfying the three criteria equally well. (Edit 3: Needless to say, the last three items in the ORDER BY clause are Boolean expressions. Their sort order is always 0, 1 [that is: FALSE, TRUE], which in conjunction with the subsetting "if last.id;" in the DATA step implements the selection rule as described.)

Depending on the meaning of the data it may be appropriate to include, for example, the absolute differences abs(var2-5) and abs(var1-2) (or squared differences, ...) in the sorting rule so that, say, var2=5.1 would be regarded as a better match than var2=9.0, everything else being the same. You would need to decide about the priorities, e.g., if var1=2 (exact match in the third-priority criterion) should trump a close, but not exact match in var2 (5.03, say). The implementation in PROC SQL would be fairly easy.

Edit 2: Another way to understand the hierarchical rule is to consider the "ranking" of the eight possible cases that can be distinguished by the three binary criteria. Each observation of a BY group falls into one of eight categories and an observation with the "highest" available priority (1=top, ..., 8=lowest priority) is selected.

priority var1 var2 var3
    1      2    5  non-missing
    2     ≠2    5  non-missing
    3      2   ≠5  non-missing
    4     ≠2   ≠5  non-missing
    5      2    5  missing
    6     ≠2    5  missing
    7      2   ≠5  missing
    8     ≠2   ≠5  missing

View solution in original post

FreelanceReinh · Posted 04-06-2020 11:42 AM

Hi @RoddyJ,

You can sort the input dataset using an ORDER BY clause with PROC SQL (into a view) and then use BY-group processing in a DATA step to select one observation per ID. The ORDER BY clause supports a wide range of sorting rules. Here's an example implementing a hierarchical rule as suggested by your description:

proc format;
invalue nulmiss
'NUL' = .;
run;

data have;
input ID var1 var2 var3 :nulmiss.;
cards;
1 1 3 11
1 2 4 12
1 2 5 NUL
2 1 5 13
2 2 5 NUL
3 1 5 NUL
3 1 4 NUL
;

proc sql;
create view _tmp as
select * from have
order by id, not missing(var3), var2=5, var1=2;
quit;

data want;
set _tmp;
by id;
if last.id;
run;

(I assumed that the "NUL" values are missing values in your real data.)

Similarly, you could compute a numeric score and sort by this score -- all within the ORDER BY clause.

Edit:

Explanation of the "hierarchical rule": The first criterion, "var3 not missing," has top priority (within the ID BY group), i.e., it will be met for the selected observation whenever possible -- even if there was another observation satisfying both of the other two criteria, but not the first. The next criterion, "var2=5," has second priority, i.e., it will be regarded only if the first criterion alone does not determine a unique observation to be selected from a particular ID BY group. Of course, in this situation a record satisfying it would be preferred in the selection. If there are still two or more "selection candidates" in a BY group with the same status regarding the first two criteria, the third criterion, "var1=2," will be used in the decision. It's possible (but doesn't occur in your sample data) that these three criteria do not uniquely determine an observation in a BY group. The suggested code would then select an arbitrary observation from those satisfying the three criteria equally well. (Edit 3: Needless to say, the last three items in the ORDER BY clause are Boolean expressions. Their sort order is always 0, 1 [that is: FALSE, TRUE], which in conjunction with the subsetting "if last.id;" in the DATA step implements the selection rule as described.)

Depending on the meaning of the data it may be appropriate to include, for example, the absolute differences abs(var2-5) and abs(var1-2) (or squared differences, ...) in the sorting rule so that, say, var2=5.1 would be regarded as a better match than var2=9.0, everything else being the same. You would need to decide about the priorities, e.g., if var1=2 (exact match in the third-priority criterion) should trump a close, but not exact match in var2 (5.03, say). The implementation in PROC SQL would be fairly easy.

Edit 2: Another way to understand the hierarchical rule is to consider the "ranking" of the eight possible cases that can be distinguished by the three binary criteria. Each observation of a BY group falls into one of eight categories and an observation with the "highest" available priority (1=top, ..., 8=lowest priority) is selected.

priority var1 var2 var3
    1      2    5  non-missing
    2     ≠2    5  non-missing
    3      2   ≠5  non-missing
    4     ≠2   ≠5  non-missing
    5      2    5  missing
    6     ≠2    5  missing
    7      2   ≠5  missing
    8     ≠2   ≠5  missing

RoddyJ · Posted 04-06-2020 02:38 PM

Thanks for this reply, really well explained!

SAS Programming

Remove duplicate observations based on multiple criteria

Re: Remove duplicate observations based on multiple criteria

Re: Remove duplicate observations based on multiple criteria

Re: Remove duplicate observations based on multiple criteria

Follow Us

What is...

SAS Programming

Remove duplicate observations based on multiple criteria

Re: Remove duplicate observations based on multiple criteria

Re: Remove duplicate observations based on multiple criteria

Re: Remove duplicate observations based on multiple criteria

Special offer for SAS Communities members

SAS Training: Just a Click Away

Follow Us

What is...