SAS Programming

DATA Step, Macro, Functions and more
BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
RoddyJ
Obsidian | Level 7

Hi all,

 

I have a dataset of the form:

 

ID    var1    var2    var3

1       1         3          11

1       2         4          12

1       2         5        NUL

2      1          5          13

2      2          5         NUL

3      1          5         NUL

3      1          4         NUL

 

I'd like to remove duplicate observations of ID based on multiple criteria.

For example: var3 should be non-null, var2 should be 5 and var1 should be 2. But if those criteria can't be met I want to keep the observation that best fits those criteria. e.g in the table given, I'd want to keep the 2nd observation with ID = 1 because var1 = 2 and var is non-null. I'd want to keep the first observation of ID=2 because var2 = 5 and var3 is non-null and I'd want to keep the first observation of ID=3 because var2 = 5.

 

Is this possible to solve with a SAS program, or is the original dataset too messy?

 

Thanks in advance for your help!

1 ACCEPTED SOLUTION

Accepted Solutions
FreelanceReinh
Jade | Level 19

Hi @RoddyJ,

 

You can sort the input dataset using an ORDER BY clause with PROC SQL (into a view) and then use BY-group processing in a DATA step to select one observation per ID. The ORDER BY clause supports a wide range of sorting rules. Here's an example implementing a hierarchical rule as suggested by your description:

proc format;
invalue nulmiss
'NUL' = .;
run;

data have;
input ID var1 var2 var3 :nulmiss.;
cards;
1 1 3 11
1 2 4 12
1 2 5 NUL
2 1 5 13
2 2 5 NUL
3 1 5 NUL
3 1 4 NUL
;

proc sql;
create view _tmp as
select * from have
order by id, not missing(var3), var2=5, var1=2;
quit;

data want;
set _tmp;
by id;
if last.id;
run;

(I assumed that the "NUL" values are missing values in your real data.)

 

Similarly, you could compute a numeric score and sort by this score -- all within the ORDER BY clause.

 

Edit:

Explanation of the "hierarchical rule": The first criterion, "var3 not missing," has top priority (within the ID BY group), i.e., it will be met for the selected observation whenever possible -- even if there was another observation satisfying both of the other two criteria, but not the first. The next criterion, "var2=5," has second priority, i.e., it will be regarded only if the first criterion alone does not determine a unique observation to be selected from a particular ID BY group. Of course, in this situation a record satisfying it would be preferred in the selection. If there are still two or more "selection candidates" in a BY group with the same status regarding the first two criteria, the third criterion, "var1=2," will be used in the decision. It's possible (but doesn't occur in your sample data) that these three criteria do not uniquely determine an observation in a BY group. The suggested code would then select an arbitrary observation from those satisfying the three criteria equally well. (Edit 3: Needless to say, the last three items in the ORDER BY clause are Boolean expressions. Their sort order is always 0, 1 [that is: FALSE, TRUE], which in conjunction with the subsetting "if last.id;" in the DATA step implements the selection rule as described.)

 

Depending on the meaning of the data it may be appropriate to include, for example, the absolute differences abs(var2-5) and abs(var1-2) (or squared differences, ...) in the sorting rule so that, say, var2=5.1 would be regarded as a better match than var2=9.0, everything else being the same. You would need to decide about the priorities, e.g., if var1=2 (exact match in the third-priority criterion) should trump a close, but not exact match in var2 (5.03, say). The implementation in PROC SQL would be fairly easy.

 

Edit 2: Another way to understand the hierarchical rule is to consider the "ranking" of the eight possible cases that can be distinguished by the three binary criteria. Each observation of a BY group falls into one of eight categories and an observation with the "highest" available priority (1=top, ..., 8=lowest priority) is selected.

priority var1 var2 var3
    1      2    5  non-missing
    2     ≠2    5  non-missing
    3      2   ≠5  non-missing
    4     ≠2   ≠5  non-missing
    5      2    5  missing
    6     ≠2    5  missing
    7      2   ≠5  missing
    8     ≠2   ≠5  missing

 

View solution in original post

2 REPLIES 2
FreelanceReinh
Jade | Level 19

Hi @RoddyJ,

 

You can sort the input dataset using an ORDER BY clause with PROC SQL (into a view) and then use BY-group processing in a DATA step to select one observation per ID. The ORDER BY clause supports a wide range of sorting rules. Here's an example implementing a hierarchical rule as suggested by your description:

proc format;
invalue nulmiss
'NUL' = .;
run;

data have;
input ID var1 var2 var3 :nulmiss.;
cards;
1 1 3 11
1 2 4 12
1 2 5 NUL
2 1 5 13
2 2 5 NUL
3 1 5 NUL
3 1 4 NUL
;

proc sql;
create view _tmp as
select * from have
order by id, not missing(var3), var2=5, var1=2;
quit;

data want;
set _tmp;
by id;
if last.id;
run;

(I assumed that the "NUL" values are missing values in your real data.)

 

Similarly, you could compute a numeric score and sort by this score -- all within the ORDER BY clause.

 

Edit:

Explanation of the "hierarchical rule": The first criterion, "var3 not missing," has top priority (within the ID BY group), i.e., it will be met for the selected observation whenever possible -- even if there was another observation satisfying both of the other two criteria, but not the first. The next criterion, "var2=5," has second priority, i.e., it will be regarded only if the first criterion alone does not determine a unique observation to be selected from a particular ID BY group. Of course, in this situation a record satisfying it would be preferred in the selection. If there are still two or more "selection candidates" in a BY group with the same status regarding the first two criteria, the third criterion, "var1=2," will be used in the decision. It's possible (but doesn't occur in your sample data) that these three criteria do not uniquely determine an observation in a BY group. The suggested code would then select an arbitrary observation from those satisfying the three criteria equally well. (Edit 3: Needless to say, the last three items in the ORDER BY clause are Boolean expressions. Their sort order is always 0, 1 [that is: FALSE, TRUE], which in conjunction with the subsetting "if last.id;" in the DATA step implements the selection rule as described.)

 

Depending on the meaning of the data it may be appropriate to include, for example, the absolute differences abs(var2-5) and abs(var1-2) (or squared differences, ...) in the sorting rule so that, say, var2=5.1 would be regarded as a better match than var2=9.0, everything else being the same. You would need to decide about the priorities, e.g., if var1=2 (exact match in the third-priority criterion) should trump a close, but not exact match in var2 (5.03, say). The implementation in PROC SQL would be fairly easy.

 

Edit 2: Another way to understand the hierarchical rule is to consider the "ranking" of the eight possible cases that can be distinguished by the three binary criteria. Each observation of a BY group falls into one of eight categories and an observation with the "highest" available priority (1=top, ..., 8=lowest priority) is selected.

priority var1 var2 var3
    1      2    5  non-missing
    2     ≠2    5  non-missing
    3      2   ≠5  non-missing
    4     ≠2   ≠5  non-missing
    5      2    5  missing
    6     ≠2    5  missing
    7      2   ≠5  missing
    8     ≠2   ≠5  missing

 

RoddyJ
Obsidian | Level 7
Thanks for this reply, really well explained!

sas-innovate-white.png

Special offer for SAS Communities members

Save $250 on SAS Innovate and get a free advance copy of the new SAS For Dummies book! Use the code "SASforDummies" to register. Don't miss out, May 6-9, in Orlando, Florida.

 

View the full agenda.

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 2 replies
  • 1099 views
  • 2 likes
  • 2 in conversation