I have one large dataset where I need to filter out observations that do not abide by the 4 rules below:
1- Keep observations where event1 is linked to event2 and have the same ID_1 and same ID_2.
2- Keep observations where different event1 is linked to the same event2 and has the same ID_1.
3-Keep observations where event1 is linked to event2 one time and both ID_1 and ID_2 is missing.
4-Keep observations where same event1 matches two different event2 and has same ID_2.
Below is a small example of dataset with 1 example for each rule:
event1 | event2 | ID_1 | ID_2 | |
RSS1 | OWT1 | 423 | 5 | |
RSS2 | OWT1 | 423 | 5 | |
RSS3 | OWT1 | 423 | 5 | |
RSS1 | OWT2 | 423 | 5 | |
RSS2 | OWT2 | 423 | 5 | |
RSS3 | OWT2 | 423 | 5 | |
RSS10 | OWT2 | . | 5 | do not keep/delete |
RSS4 | OWT3 | 12 | . | |
RSS7 | OWT3 | 12 | . | |
RSS11 | OWT3 | . | . | do not keep/delete |
RSS13 | OWT12 | . | . | |
RSS9 | OWT5 | . | 14 | |
RSS9 | OWT8 | . | 14 | |
RSS17 | OWT8 | . | 14 | do not keep/delete |
Below is the final desired dataset. I would like to create a New_ID showing distinct links based off the rules.
event1 | event2 | ID_1 | ID_2 | New_ID |
RSS1 | OWT1 | 423 | 5 | 1 |
RSS2 | OWT1 | 423 | 5 | 1 |
RSS3 | OWT1 | 423 | 5 | 1 |
RSS1 | OWT2 | 423 | 5 | 1 |
RSS2 | OWT2 | 423 | 5 | 1 |
RSS3 | OWT2 | 423 | 5 | 1 |
RSS4 | OWT3 | 12 | . | 2 |
RSS7 | OWT3 | 12 | . | 2 |
RSS13 | OWT12 | . | . | 3 |
RSS9 | OWT5 | . | 14 | 4 |
RSS9 | OWT8 | . | 14 | 4 |
How can I go about programming this with loops and sql while maintaining all the rules in one dataset?
Rule 1 - Keep observations where event1 is linked to event2 and have the same ID_1 and same ID_2.
event1 | event2 | ID_1 | ID_2 | |
RSS1 | OWT1 | 423 | 5 | |
RSS2 | OWT1 | 423 | 5 | |
RSS3 | OWT1 | 423 | 5 | |
RSS1 | OWT2 | 423 | 5 | |
RSS2 | OWT2 | 423 | 5 | |
RSS3 | OWT2 | 423 | 5 | |
RSS10 | OWT2 | . | 5 | do not keep/delete |
Rule 2 - Keep observations where different event1 is linked to the same event2 and has the same ID_1.
event1 | event2 | ID_1 | ID_2 | |
RSS4 | OWT3 | 12 | . | |
RSS7 | OWT3 | 12 | . | |
RSS11 | OWT3 | . | . | do not keep/delete |
Rule 3 -Keep observations where event1 is linked to event2 one time and both ID_1 and ID_2 is missing.
event1 | event2 | ID_1 | ID_2 |
RSS13 | OWT12 | . | . |
Rule 4 - Keep observations where same event1 matches two different event2 and has same ID_2.
event1 | event2 | ID_1 | ID_2 | |
RSS9 | OWT5 | . | 14 | |
RSS9 | OWT8 | . | 14 | |
RSS17 | OWT8 | . | 14 | do not keep/delete |
Sorry, but I have hard time to understand the rules given under each bullet, what does "linked" and "same" means in this data context?
Best would be to make examples under each bullet (1-4).
Using these two examples to explain "same" and "linked".
Each row is a "link". event1 is linked to event2.
Same means the value in event1/event2/ID_1/ID_2 occurs more than once.
Rule 3 -Keep observations where event1 is linked to event2 one time and both ID_1 and ID_2 is missing.
RSS13 is linked to OWT12. RSS13 only occurs one time in the combined dataset above.
event1 | event2 | ID_1 | ID_2 |
RSS13 | OWT12 | . | . |
Rule 4 - Keep observations where same event1 matches two different event2 and has same ID_2.
RSS9 is linked to both OWT5 and OWT8 and has ID_2 = 14 for both links.
RSS17 is also linked to OWT8 and has ID_2 = 14 for this one link.
"Same event1" means RSS9 appears twice in event1 and is linked to two different event2 (OWT5 and OWT8) whereas RSS17 only appears one time.
These three links all have the same ID_2 (14) but do not keep RSS17 - OWT8 link because it doesn't the follow the rule where there has to be the same event matched to 2 different event2. RSS17 is only matched to a single event2.
event1 | event2 | ID_1 | ID_2 | |
RSS9 | OWT5 | . | 14 | |
RSS9 | OWT8 | . | 14 | |
RSS17 | OWT8 | . | 14 | do not keep/delete |
Hope this is more clear.
How about providing your sample data in the form of a working data step? You will be more likely to get actual tested suggestions.
Help us help you.
data have;
infile cards expandtabs;
input (event1 event2 ID_1 ID_2) ($);
cards;
RSS1 OWT1 423 5
RSS2 OWT1 423 5
RSS3 OWT1 423 5
RSS1 OWT2 423 5
RSS2 OWT2 423 5
RSS3 OWT2 423 5
RSS10 OWT2 . 5
RSS4 OWT3 12 .
RSS7 OWT3 12 .
RSS11 OWT3 . .
RSS13 OWT12 . .
RSS9 OWT5 . 14
RSS9 OWT8 . 14
RSS17 OWT8 . 14
;
Inner join doesn't work to retain the first 6 observations. How do I get rid of RSS10 linked with OWT2 ?
proc sql;
create table rule1 as
select a.*
from have a inner join have b
on a.ID_1=b.ID_1 and a.ID_2=b.ID_2
where a.event1 NE b.event2;
run;
Desired output:
Rule 1 - Keep observations where event1 is linked to event2 and have the same ID_1 and same ID_2.
event1 | event2 | ID_1 | ID_2 | |
RSS1 | OWT1 | 423 | 5 | |
RSS2 | OWT1 | 423 | 5 | |
RSS3 | OWT1 | 423 | 5 | |
RSS1 | OWT2 | 423 | 5 | |
RSS2 | OWT2 | 423 | 5 | |
RSS3 | OWT2 | 423 | 5 | |
RSS10 | OWT2 | . | 5 | do not keep/delete |
Analyze the code and results from execution with the code below:
data have;
infile cards expandtabs;
input (event1 event2 ID_1 ID_2) ($);
cards;
RSS1 OWT1 423 5
RSS2 OWT1 423 5
RSS3 OWT1 423 5
RSS1 OWT2 423 5
RSS2 OWT2 423 5
RSS3 OWT2 423 5
RSS10 OWT2 . 5
RSS4 OWT3 12 .
RSS7 OWT3 12 .
RSS11 OWT3 . .
RSS13 OWT12 . .
RSS9 OWT5 . 14
RSS9 OWT8 . 14
RSS17 OWT8 . 14
;
data rule1assist;
set have;
rule1assist = catx('-', trim(ID_1), trim(ID_2));
run;
proc sort data=rule1assist;
by rule1assist;
run;
proc sort data=rule1assist out=rule1tmp(keep=rule1assist);
by rule1assist;
run;
data rule1tmp;
set rule1tmp;
if first.rule1assist eq 1 AND last.rule1assist eq 1 then delete;
by rule1assist;
run;
proc sort data=rule1tmp nodup;
by rule1assist;
run;
proc sql;
create table rule1 as
select event1, event2, ID_1, ID_2, a.rule1assist
from rule1assist a, rule1tmp b
where a.rule1assist eq b.rule1assist
;
quit;
data rule2assist;
set rule1;
rule2assist = catx('-', trim(event2), trim(ID_1));
run;
proc sort data=rule2assist;
by rule2assist;
run;
proc sort data=rule2assist out=rule2tmp(keep=rule2assist);
by rule2assist;
run;
data rule2tmp;
set rule2tmp;
if first.rule2assist eq 1 AND last.rule2assist eq 1 then delete;
by rule2assist;
run;
proc sort data=rule2tmp nodup;
by rule2assist;
run;
proc sql;
create table rule2 as
select event1, event2, ID_1, ID_2,
a.rule1assist, a.rule2assist
from rule2assist a, rule2tmp b
where a.rule2assist eq b.rule2assist
;
quit;
This seems to implement the rule 1 correctly but not rule 2. Because after applying the rule 2 deletions your rule 3 cannot be applied anymore. So I suspect I am not applying the correct programming logic for rule 2. The code for rule 1 could work for you though.
Thank you for your help. I've adapted your code to generate 4 different sets of code (1 for each rule) and then used a final data step to combine the data.
I am struggling with Rule 3. I would like to keep any observations that have distinct event1 and event2 and has missing ID_1 and ID_2. My code keeps 2 rows of data and I only want the row with RSS13 OWT12. How can I modify the code for rule 3 for desired output? See code below:
data have;
infile cards expandtabs;
input (event1 event2 ID_1 ID_2) ($);
cards;
RSS1 OWT1 423 5
RSS2 OWT1 423 5
RSS3 OWT1 423 5
RSS1 OWT2 423 5
RSS2 OWT2 423 5
RSS3 OWT2 423 5
RSS10 OWT2 . 5
RSS4 OWT3 12 .
RSS7 OWT3 12 .
RSS11 OWT3 . .
RSS13 OWT12 . .
RSS9 OWT5 . 14
RSS9 OWT8 . 14
RSS17 OWT8 . 14
;
/* Code for Keeping observations that meet Rule 1*/
data rule1assist;
set have;
rule1assist = catx('-', trim(ID_1), trim(ID_2));
run;
proc sort data=rule1assist;
by rule1assist;
run;
proc sort data=rule1assist out=rule1tmp(keep=rule1assist);
by rule1assist;
run;
data rule1tmp;
set rule1tmp;
if first.rule1assist eq 1 AND last.rule1assist eq 1 then delete;
by rule1assist;
run;
proc sort data=rule1tmp nodup;
by rule1assist;
run;
proc sql;
create table rule1 as
select event1, event2, ID_1, ID_2, a.rule1assist
from rule1assist a, rule1tmp b
where a.rule1assist eq b.rule1assist and ID_1 ne " " and ID_2 ne " "
;
quit;
/*Code for Keeping observations that meet Rule 2*/
data rule2assist;
set have;
rule2assist = catx('-', trim(event2), trim(ID_1));
run;
proc sort data=rule2assist;
by rule2assist;
run;
proc sort data=rule2assist out=rule2tmp(keep=rule2assist);
by rule2assist;
run;
data rule2tmp;
set rule2tmp;
if first.rule2assist eq 1 AND last.rule2assist eq 1 then delete;
by rule2assist;
run;
proc sort data=rule2tmp nodup;
by rule2assist;
run;
proc sql;
create table rule2 as
select event1, event2, ID_1, ID_2, a.rule2assist
from rule2assist a, rule2tmp b
where a.rule2assist eq b.rule2assist and ID_2 eq " "
;
quit;
/*Code for Keeping observations that meet Rule 3...this doesn't work*/
data rule3assist;
set have;
rule3assist = catx('-', trim(event1), trim(event2),trim(ID_1), trim(ID_2));
run;
proc sort data=rule3assist;
by rule3assist;
run;
proc sort data=rule3assist out=rule3tmp(keep=rule3assist);
by rule3assist;
run;
data rule3tmp;
set rule3tmp;
if first.rule3assist eq 1 AND last.rule3assist eq 1 ;
by rule3assist;
run;
proc sort data=rule3tmp nodup;
by rule3assist;
run;
proc sql;
create table rule3 as
select distinct *
from rule3assist a, rule3tmp b
where a.rule3assist eq b.rule3assist and ID_2 eq " " and ID_1 eq " "
;
quit;
/*Code for Keeping observations that meet Rule 4*/
data rule4assist;
set have;
rule4assist = catx('-', trim(event1), trim(ID_2));
run;
proc sort data=rule4assist;
by rule4assist;
run;
proc sort data=rule4assist out=rule4tmp(keep=rule4assist);
by rule4assist;
run;
data rule4tmp;
set rule4tmp;
if first.rule4assist eq 1 AND last.rule4assist eq 1 then delete;
by rule4assist;
run;
proc sort data=rule4tmp nodup;
by rule4assist;
run;
proc sql;
create table rule4 as
select event1, event2, ID_1, ID_2, a.rule4assist
from rule4assist a, rule4tmp b
where a.rule4assist eq b.rule4assist and ID_1 eq " "
;
quit;
data want;
set rule1 rule2 rule3 rule4;
run;
/*Code for Keeping observations that meet Rule 2*/
data rule2assist;
set have;
rule2assist = catx('-', trim(event2), trim(ID_1));
run;
proc sort data=rule2assist;
by rule2assist;
run;
proc sort data=rule2assist out=rule2tmp(keep=rule2assist);
by rule2assist;
run;
data rule2tmp;
set rule2tmp;
if first.rule2assist eq 1 AND last.rule2assist eq 1 then delete;
by rule2assist;
run;
proc sort data=rule2tmp nodup;
by rule2assist;
run;
proc sql;
create table rule2 as
select event1, event2, ID_1, ID_2, a.rule2assist
from rule2assist a, rule2tmp b
where a.rule2assist eq b.rule2assist and ID_2 eq " "
;
quit;
/*Code for Keeping observations that meet ...this doesn't work*/
data rule3assist;
set have;
rule3assist = catx('-', trim(event1), trim(event2),trim(ID_1), trim(ID_2));
run;
proc sort data=rule3assist;
by rule3assist;
run;
proc sort data=rule3assist out=rule3tmp(keep=rule3assist);
by rule3assist;
run;
data rule3tmp;
set rule3tmp;
if first.rule3assist eq 1 AND last.rule3assist eq 1 ;
by rule3assist;
run;
proc sort data=rule3tmp nodup;
by rule3assist;
run;
proc sql;
create table rule3 as
select distinct*
from rule3assist a, rule3tmp b
group by a.event2
where a.rule3assist eq b.rule3assist and ID_2 eq " " and ID_1 eq " "
;
quit;
/*Code for Keeping observations that meet Rule 4*/
data rule4assist;
set have;
rule4assist = catx('-', trim(event1), trim(ID_2));
run;
proc sort data=rule4assist;
by rule4assist;
run;
proc sort data=rule4assist out=rule4tmp(keep=rule4assist);
by rule4assist;
run;
data rule4tmp;
set rule4tmp;
if first.rule4assist eq 1 AND last.rule4assist eq 1 then delete;
by rule4assist;
run;
proc sort data=rule4tmp nodup;
by rule4assist;
run;
proc sql;
create table rule4 as
select event1, event2, ID_1, ID_2, a.rule4assist
from rule4assist a, rule4tmp b
where a.rule4assist eq b.rule4assist and ID_1 eq " "
;
quit;
data want;
set rule1 rule2 rule3 rule4
run;
Save $250 on SAS Innovate and get a free advance copy of the new SAS For Dummies book! Use the code "SASforDummies" to register. Don't miss out, May 6-9, in Orlando, Florida.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.