Hi all,
I have a dataset of the form:
ID var1 var2 var3
1 1 3 11
1 2 4 12
1 2 5 NUL
2 1 5 13
2 2 5 NUL
3 1 5 NUL
3 1 4 NUL
I'd like to remove duplicate observations of ID based on multiple criteria.
For example: var3 should be non-null, var2 should be 5 and var1 should be 2. But if those criteria can't be met I want to keep the observation that best fits those criteria. e.g in the table given, I'd want to keep the 2nd observation with ID = 1 because var1 = 2 and var is non-null. I'd want to keep the first observation of ID=2 because var2 = 5 and var3 is non-null and I'd want to keep the first observation of ID=3 because var2 = 5.
Is this possible to solve with a SAS program, or is the original dataset too messy?
Thanks in advance for your help!
Hi @RoddyJ,
You can sort the input dataset using an ORDER BY clause with PROC SQL (into a view) and then use BY-group processing in a DATA step to select one observation per ID. The ORDER BY clause supports a wide range of sorting rules. Here's an example implementing a hierarchical rule as suggested by your description:
proc format;
invalue nulmiss
'NUL' = .;
run;
data have;
input ID var1 var2 var3 :nulmiss.;
cards;
1 1 3 11
1 2 4 12
1 2 5 NUL
2 1 5 13
2 2 5 NUL
3 1 5 NUL
3 1 4 NUL
;
proc sql;
create view _tmp as
select * from have
order by id, not missing(var3), var2=5, var1=2;
quit;
data want;
set _tmp;
by id;
if last.id;
run;
(I assumed that the "NUL" values are missing values in your real data.)
Similarly, you could compute a numeric score and sort by this score -- all within the ORDER BY clause.
Edit:
Explanation of the "hierarchical rule": The first criterion, "var3 not missing," has top priority (within the ID BY group), i.e., it will be met for the selected observation whenever possible -- even if there was another observation satisfying both of the other two criteria, but not the first. The next criterion, "var2=5," has second priority, i.e., it will be regarded only if the first criterion alone does not determine a unique observation to be selected from a particular ID BY group. Of course, in this situation a record satisfying it would be preferred in the selection. If there are still two or more "selection candidates" in a BY group with the same status regarding the first two criteria, the third criterion, "var1=2," will be used in the decision. It's possible (but doesn't occur in your sample data) that these three criteria do not uniquely determine an observation in a BY group. The suggested code would then select an arbitrary observation from those satisfying the three criteria equally well. (Edit 3: Needless to say, the last three items in the ORDER BY clause are Boolean expressions. Their sort order is always 0, 1 [that is: FALSE, TRUE], which in conjunction with the subsetting "if last.id;" in the DATA step implements the selection rule as described.)
Depending on the meaning of the data it may be appropriate to include, for example, the absolute differences abs(var2-5) and abs(var1-2) (or squared differences, ...) in the sorting rule so that, say, var2=5.1 would be regarded as a better match than var2=9.0, everything else being the same. You would need to decide about the priorities, e.g., if var1=2 (exact match in the third-priority criterion) should trump a close, but not exact match in var2 (5.03, say). The implementation in PROC SQL would be fairly easy.
Edit 2: Another way to understand the hierarchical rule is to consider the "ranking" of the eight possible cases that can be distinguished by the three binary criteria. Each observation of a BY group falls into one of eight categories and an observation with the "highest" available priority (1=top, ..., 8=lowest priority) is selected.
priority var1 var2 var3 1 2 5 non-missing 2 ≠2 5 non-missing 3 2 ≠5 non-missing 4 ≠2 ≠5 non-missing 5 2 5 missing 6 ≠2 5 missing 7 2 ≠5 missing 8 ≠2 ≠5 missing
Hi @RoddyJ,
You can sort the input dataset using an ORDER BY clause with PROC SQL (into a view) and then use BY-group processing in a DATA step to select one observation per ID. The ORDER BY clause supports a wide range of sorting rules. Here's an example implementing a hierarchical rule as suggested by your description:
proc format;
invalue nulmiss
'NUL' = .;
run;
data have;
input ID var1 var2 var3 :nulmiss.;
cards;
1 1 3 11
1 2 4 12
1 2 5 NUL
2 1 5 13
2 2 5 NUL
3 1 5 NUL
3 1 4 NUL
;
proc sql;
create view _tmp as
select * from have
order by id, not missing(var3), var2=5, var1=2;
quit;
data want;
set _tmp;
by id;
if last.id;
run;
(I assumed that the "NUL" values are missing values in your real data.)
Similarly, you could compute a numeric score and sort by this score -- all within the ORDER BY clause.
Edit:
Explanation of the "hierarchical rule": The first criterion, "var3 not missing," has top priority (within the ID BY group), i.e., it will be met for the selected observation whenever possible -- even if there was another observation satisfying both of the other two criteria, but not the first. The next criterion, "var2=5," has second priority, i.e., it will be regarded only if the first criterion alone does not determine a unique observation to be selected from a particular ID BY group. Of course, in this situation a record satisfying it would be preferred in the selection. If there are still two or more "selection candidates" in a BY group with the same status regarding the first two criteria, the third criterion, "var1=2," will be used in the decision. It's possible (but doesn't occur in your sample data) that these three criteria do not uniquely determine an observation in a BY group. The suggested code would then select an arbitrary observation from those satisfying the three criteria equally well. (Edit 3: Needless to say, the last three items in the ORDER BY clause are Boolean expressions. Their sort order is always 0, 1 [that is: FALSE, TRUE], which in conjunction with the subsetting "if last.id;" in the DATA step implements the selection rule as described.)
Depending on the meaning of the data it may be appropriate to include, for example, the absolute differences abs(var2-5) and abs(var1-2) (or squared differences, ...) in the sorting rule so that, say, var2=5.1 would be regarded as a better match than var2=9.0, everything else being the same. You would need to decide about the priorities, e.g., if var1=2 (exact match in the third-priority criterion) should trump a close, but not exact match in var2 (5.03, say). The implementation in PROC SQL would be fairly easy.
Edit 2: Another way to understand the hierarchical rule is to consider the "ranking" of the eight possible cases that can be distinguished by the three binary criteria. Each observation of a BY group falls into one of eight categories and an observation with the "highest" available priority (1=top, ..., 8=lowest priority) is selected.
priority var1 var2 var3 1 2 5 non-missing 2 ≠2 5 non-missing 3 2 ≠5 non-missing 4 ≠2 ≠5 non-missing 5 2 5 missing 6 ≠2 5 missing 7 2 ≠5 missing 8 ≠2 ≠5 missing
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.