Data Cleaning - remove events based off linkage rules between 2 Events...

r-lee-484 · Posted 08-14-2023 01:07 PM

I have one large dataset where I need to filter out observations that do not abide by the 4 rules below:

1- Keep observations where event1 is linked to event2 and have the same ID_1 and same ID_2.

2- Keep observations where different event1 is linked to the same event2 and has the same ID_1.

3-Keep observations where event1 is linked to event2 one time and both ID_1 and ID_2 is missing.

4-Keep observations where same event1 matches two different event2 and has same ID_2.

Below is a small example of dataset with 1 example for each rule:

event1	event2	ID_1	ID_2
RSS1	OWT1	423	5
RSS2	OWT1	423	5
RSS3	OWT1	423	5
RSS1	OWT2	423	5
RSS2	OWT2	423	5
RSS3	OWT2	423	5
RSS10	OWT2	.	5	do not keep/delete
RSS4	OWT3	12	.
RSS7	OWT3	12	.
RSS11	OWT3	.	.	do not keep/delete
RSS13	OWT12	.	.
RSS9	OWT5	.	14
RSS9	OWT8	.	14
RSS17	OWT8	.	14	do not keep/delete

Below is the final desired dataset. I would like to create a New_ID showing distinct links based off the rules.

event1	event2	ID_1	ID_2	New_ID
RSS1	OWT1	423	5	1
RSS2	OWT1	423	5	1
RSS3	OWT1	423	5	1
RSS1	OWT2	423	5	1
RSS2	OWT2	423	5	1
RSS3	OWT2	423	5	1
RSS4	OWT3	12	.	2
RSS7	OWT3	12	.	2
RSS13	OWT12	.	.	3
RSS9	OWT5	.	14	4
RSS9	OWT8	.	14	4

How can I go about programming this with loops and sql while maintaining all the rules in one dataset?

JosvanderVelden · Posted 08-15-2023 06:25 AM

Can you please specify for each observation the rule or rules that is/are involved in keeping the observation?

r-lee-484 · Posted 08-15-2023 09:07 AM

Rule 1 - Keep observations where event1 is linked to event2 and have the same ID_1 and same ID_2.

event1	event2	ID_1	ID_2
RSS1	OWT1	423	5
RSS2	OWT1	423	5
RSS3	OWT1	423	5
RSS1	OWT2	423	5
RSS2	OWT2	423	5
RSS3	OWT2	423	5
RSS10	OWT2	.	5	do not keep/delete

Rule 2 - Keep observations where different event1 is linked to the same event2 and has the same ID_1.

event1	event2	ID_1	ID_2
RSS4	OWT3	12	.
RSS7	OWT3	12	.
RSS11	OWT3	.	.	do not keep/delete

Rule 3 -Keep observations where event1 is linked to event2 one time and both ID_1 and ID_2 is missing.

event1	event2	ID_1	ID_2
RSS13	OWT12	.	.

Rule 4 - Keep observations where same event1 matches two different event2 and has same ID_2.

event1	event2	ID_1	ID_2
RSS9	OWT5	.	14
RSS9	OWT8	.	14
RSS17	OWT8	.	14	do not keep/delete

LinusH · Posted 08-15-2023 08:39 AM

Sorry, but I have hard time to understand the rules given under each bullet, what does "linked" and "same" means in this data context?

Best would be to make examples under each bullet (1-4).

Data never sleeps

r-lee-484 · Posted 08-15-2023 09:24 AM

Using these two examples to explain "same" and "linked".

Each row is a "link". event1 is linked to event2.

Same means the value in event1/event2/ID_1/ID_2 occurs more than once.

Rule 3 -Keep observations where event1 is linked to event2 one time and both ID_1 and ID_2 is missing.

RSS13 is linked to OWT12. RSS13 only occurs one time in the combined dataset above.

event1	event2	ID_1	ID_2
RSS13	OWT12	.	.

Rule 4 - Keep observations where same event1 matches two different event2 and has same ID_2.

RSS9 is linked to both OWT5 and OWT8 and has ID_2 = 14 for both links.

RSS17 is also linked to OWT8 and has ID_2 = 14 for this one link.

"Same event1" means RSS9 appears twice in event1 and is linked to two different event2 (OWT5 and OWT8) whereas RSS17 only appears one time.

These three links all have the same ID_2 (14) but do not keep RSS17 - OWT8 link because it doesn't the follow the rule where there has to be the same event matched to 2 different event2. RSS17 is only matched to a single event2.

event1	event2	ID_1	ID_2
RSS9	OWT5	.	14
RSS9	OWT8	.	14
RSS17	OWT8	.	14	do not keep/delete

Hope this is more clear.

mkeintz · Posted 08-15-2023 09:46 AM

How about providing your sample data in the form of a working data step? You will be more likely to get actual tested suggestions.

Help us help you.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

r-lee-484 · Posted 08-15-2023 02:24 PM

data have;
infile cards expandtabs;
input (event1 event2 ID_1 ID_2) ($);
cards;
RSS1 OWT1 423 5
RSS2 OWT1 423 5
RSS3 OWT1 423 5
RSS1 OWT2 423 5
RSS2 OWT2 423 5
RSS3 OWT2 423 5
RSS10 OWT2 . 5
RSS4 OWT3 12 .
RSS7 OWT3 12 .
RSS11 OWT3 . .
RSS13 OWT12 . .
RSS9 OWT5 . 14
RSS9 OWT8 . 14
RSS17 OWT8 . 14
;

Inner join doesn't work to retain the first 6 observations. How do I get rid of RSS10 linked with OWT2 ?

proc sql;
create table rule1 as
select a.*
from have a inner join have b
on a.ID_1=b.ID_1 and a.ID_2=b.ID_2
where a.event1 NE b.event2;
run;

Desired output:

Rule 1 - Keep observations where event1 is linked to event2 and have the same ID_1 and same ID_2.

event1	event2	ID_1	ID_2
RSS1	OWT1	423	5
RSS2	OWT1	423	5
RSS3	OWT1	423	5
RSS1	OWT2	423	5
RSS2	OWT2	423	5
RSS3	OWT2	423	5
RSS10	OWT2	.	5	do not keep/delete

JosvanderVelden · Posted 08-16-2023 05:47 AM

Analyze the code and results from execution with the code below:

data have;
   infile cards expandtabs;
   input (event1 event2 ID_1 ID_2) ($);
cards;
RSS1 OWT1 423 5
RSS2 OWT1 423 5
RSS3 OWT1 423 5
RSS1 OWT2 423 5
RSS2 OWT2 423 5
RSS3 OWT2 423 5
RSS10 OWT2 . 5
RSS4 OWT3 12 .
RSS7 OWT3 12 .
RSS11 OWT3 . .
RSS13 OWT12 . .
RSS9 OWT5 . 14
RSS9 OWT8 . 14
RSS17 OWT8 . 14
;

data rule1assist;
   set have;
   rule1assist = catx('-', trim(ID_1), trim(ID_2));
run;

proc sort data=rule1assist;
   by rule1assist;
run;

proc sort data=rule1assist out=rule1tmp(keep=rule1assist);
   by rule1assist;
run;

data rule1tmp;
   set rule1tmp;
   if first.rule1assist eq 1 AND last.rule1assist eq 1 then delete;
   by rule1assist;
run;

proc sort data=rule1tmp nodup;
   by rule1assist;
run;

proc sql;
   create table rule1 as
   select event1, event2, ID_1, ID_2, a.rule1assist
   from rule1assist a, rule1tmp b
   where a.rule1assist eq b.rule1assist
   ;
quit;

data rule2assist;
   set rule1;
   rule2assist = catx('-', trim(event2), trim(ID_1));
run;

proc sort data=rule2assist;
   by rule2assist;
run;

proc sort data=rule2assist out=rule2tmp(keep=rule2assist);
   by rule2assist;
run;

data rule2tmp;
   set rule2tmp;
   if first.rule2assist eq 1 AND last.rule2assist eq 1 then delete;
   by rule2assist;
run;

proc sort data=rule2tmp nodup;
   by rule2assist;
run;

proc sql;
   create table rule2 as
   select event1, event2, ID_1, ID_2, 
          a.rule1assist, a.rule2assist
   from rule2assist a, rule2tmp b
   where a.rule2assist eq b.rule2assist
   ;
quit;

This seems to implement the rule 1 correctly but not rule 2. Because after applying the rule 2 deletions your rule 3 cannot be applied anymore. So I suspect I am not applying the correct programming logic for rule 2. The code for rule 1 could work for you though.

r-lee-484 · Posted 08-17-2023 10:51 AM

Thank you for your help. I've adapted your code to generate 4 different sets of code (1 for each rule) and then used a final data step to combine the data.

I am struggling with Rule 3. I would like to keep any observations that have distinct event1 and event2 and has missing ID_1 and ID_2. My code keeps 2 rows of data and I only want the row with RSS13 OWT12. How can I modify the code for rule 3 for desired output? See code below:

data have;
infile cards expandtabs;
input (event1 event2 ID_1 ID_2) ($);
cards;
RSS1 OWT1 423 5
RSS2 OWT1 423 5
RSS3 OWT1 423 5
RSS1 OWT2 423 5
RSS2 OWT2 423 5
RSS3 OWT2 423 5
RSS10 OWT2 . 5
RSS4 OWT3 12 .
RSS7 OWT3 12 .
RSS11 OWT3 . .
RSS13 OWT12 . .
RSS9 OWT5 . 14
RSS9 OWT8 . 14
RSS17 OWT8 . 14
;

/* Code for Keeping observations that meet Rule 1*/
data rule1assist;
set have;
rule1assist = catx('-', trim(ID_1), trim(ID_2));
run;

proc sort data=rule1assist;
by rule1assist;
run;

proc sort data=rule1assist out=rule1tmp(keep=rule1assist);
by rule1assist;
run;

data rule1tmp;
set rule1tmp;
if first.rule1assist eq 1 AND last.rule1assist eq 1 then delete;
by rule1assist;
run;

proc sort data=rule1tmp nodup;
by rule1assist;
run;

proc sql;
create table rule1 as
select event1, event2, ID_1, ID_2, a.rule1assist
from rule1assist a, rule1tmp b
where a.rule1assist eq b.rule1assist and ID_1 ne " " and ID_2 ne " "
;
quit;

/*Code for Keeping observations that meet Rule 2*/
data rule2assist;
set have;
rule2assist = catx('-', trim(event2), trim(ID_1));
run;

proc sort data=rule2assist;
by rule2assist;
run;

proc sort data=rule2assist out=rule2tmp(keep=rule2assist);
by rule2assist;
run;

data rule2tmp;
set rule2tmp;
if first.rule2assist eq 1 AND last.rule2assist eq 1 then delete;
by rule2assist;
run;

proc sort data=rule2tmp nodup;
by rule2assist;
run;

proc sql;
create table rule2 as
select event1, event2, ID_1, ID_2, a.rule2assist
from rule2assist a, rule2tmp b
where a.rule2assist eq b.rule2assist and ID_2 eq " "
;
quit;

/*Code for Keeping observations that meet Rule 3...this doesn't work*/

data rule3assist;
set have;
rule3assist = catx('-', trim(event1), trim(event2),trim(ID_1), trim(ID_2));
run;

proc sort data=rule3assist;
by rule3assist;
run;

proc sort data=rule3assist out=rule3tmp(keep=rule3assist);
by rule3assist;
run;

data rule3tmp;
set rule3tmp;
if first.rule3assist eq 1 AND last.rule3assist eq 1 ;
by rule3assist;
run;

proc sort data=rule3tmp nodup;
by rule3assist;
run;

proc sql;
create table rule3 as
select distinct *
from rule3assist a, rule3tmp b
where a.rule3assist eq b.rule3assist and ID_2 eq " " and ID_1 eq " "
;
quit;

/*Code for Keeping observations that meet Rule 4*/

data rule4assist;
set have;
rule4assist = catx('-', trim(event1), trim(ID_2));
run;

proc sort data=rule4assist;
by rule4assist;
run;

proc sort data=rule4assist out=rule4tmp(keep=rule4assist);
by rule4assist;
run;

data rule4tmp;
set rule4tmp;
if first.rule4assist eq 1 AND last.rule4assist eq 1 then delete;
by rule4assist;
run;

proc sort data=rule4tmp nodup;
by rule4assist;
run;

proc sql;
create table rule4 as
select event1, event2, ID_1, ID_2, a.rule4assist
from rule4assist a, rule4tmp b
where a.rule4assist eq b.rule4assist and ID_1 eq " "
;
quit;

data want;
set rule1 rule2 rule3 rule4;
run;

/*Code for Keeping observations that meet Rule 2*/
data rule2assist;
set have;
rule2assist = catx('-', trim(event2), trim(ID_1));
run;

proc sort data=rule2assist;
by rule2assist;
run;

proc sort data=rule2assist out=rule2tmp(keep=rule2assist);
by rule2assist;
run;

data rule2tmp;
set rule2tmp;
if first.rule2assist eq 1 AND last.rule2assist eq 1 then delete;
by rule2assist;
run;

proc sort data=rule2tmp nodup;
by rule2assist;
run;

proc sql;
create table rule2 as
select event1, event2, ID_1, ID_2, a.rule2assist
from rule2assist a, rule2tmp b
where a.rule2assist eq b.rule2assist and ID_2 eq " "
;
quit;

/*Code for Keeping observations that meet ...this doesn't work*/

data rule3assist;
set have;
rule3assist = catx('-', trim(event1), trim(event2),trim(ID_1), trim(ID_2));
run;

proc sort data=rule3assist;
by rule3assist;
run;

proc sort data=rule3assist out=rule3tmp(keep=rule3assist);
by rule3assist;
run;

data rule3tmp;
set rule3tmp;
if first.rule3assist eq 1 AND last.rule3assist eq 1 ;
by rule3assist;
run;

proc sort data=rule3tmp nodup;
by rule3assist;
run;

proc sql;
create table rule3 as
select distinct*
from rule3assist a, rule3tmp b
group by a.event2
where a.rule3assist eq b.rule3assist and ID_2 eq " " and ID_1 eq " "
;
quit;

/*Code for Keeping observations that meet Rule 4*/

data rule4assist;
set have;
rule4assist = catx('-', trim(event1), trim(ID_2));
run;

proc sort data=rule4assist;
by rule4assist;
run;

proc sort data=rule4assist out=rule4tmp(keep=rule4assist);
by rule4assist;
run;

data rule4tmp;
set rule4tmp;
if first.rule4assist eq 1 AND last.rule4assist eq 1 then delete;
by rule4assist;
run;

proc sort data=rule4tmp nodup;
by rule4assist;
run;

proc sql;
create table rule4 as
select event1, event2, ID_1, ID_2, a.rule4assist
from rule4assist a, rule4tmp b
where a.rule4assist eq b.rule4assist and ID_1 eq " "
;
quit;

data want;
set rule1 rule2 rule3 rule4
run;

Data Cleaning - remove events based off linkage rules between 2 Events and 2 IDs

Re: Data Cleaning - remove events based off linkage rules between 2 Events and 2 IDs

Re: Data Cleaning - remove events based off linkage rules between 2 Events and 2 IDs

Re: Data Cleaning - remove events based off linkage rules between 2 Events and 2 IDs

Re: Data Cleaning - remove events based off linkage rules between 2 Events and 2 IDs

Re: Data Cleaning - remove events based off linkage rules between 2 Events and 2 IDs

Re: Data Cleaning - remove events based off linkage rules between 2 Events and 2 IDs

Re: Data Cleaning - remove events based off linkage rules between 2 Events and 2 IDs

Re: Data Cleaning - remove events based off linkage rules between 2 Events and 2 IDs

Registration is open

SAS Training: Just a Click Away