Solved: Removing duplicate matched events

r-lee-484 · Posted 07-26-2023 10:34 AM

Hi, I have this dataset where these are hospitalized events matched to each patient. Each event is given a unique number. For example, event1 is matched to event2.

event1	event2	Subject_ID
1095A	2097B	74
1095A	3003C	74
1095A	4006D	74
2097B	3003C	12
2097B	4006D	12
3003C	4006D	116
9382J	5813K	32
8762L	3250M	66
4382E	6782F	45
4382E	5489L	45
6782F	5489L	378

I am trying to create a NEW_ID and remove subject_IDs 12, 116 since they are duplicate matches to subject_ID74. I also need to remove subject_ID 378 since it's a duplicate match to subject_ID 45. I ultimately want to retain matches from subject_ID 74, 32, 66, 45 as they are unique matches and assign a new ID for each subject.

Desired output:

event1	event2	NEW_ID
1095A	2097B	1
1095A	3003C	1
1095A	4006D	1
9382J	5813K	2
8762L	3250M	3
4382E	6782F	4
4382E	5489L	4

Is it possible to use PROC SQL and what kind of rules or code should I write to do this? I have a dataset where there are thousands of matched events so this is a simplified version of what needs to be accomplished.

Thanks for your help!

Ksharp · Posted 07-27-2023 07:57 AM

/*Assuming I understood what you mean.*/
data have;
infile cards expandtabs;
input (event1 event2 Subject_ID) ($);
cards;
1095A 2097B 74
1095A 3003C 74
1095A 4006D 74
2097B 3003C 12
2097B 4006D 12
3003C 4006D 116
9382J 5813K 32
8762L 3250M 66
4382E 6782F 45
4382E 5489L 45
6782F 5489L 378
;

data temp;
 set have;
 key=event1;output;
 key=event2;output;
 keep key Subject_ID;
run;
proc sort data=temp out=key nodupkey;
by Subject_ID key;
run;
proc sort data=temp(keep=Subject_ID) out=Subject_ID nodupkey;
by Subject_ID;
run;



%macro check(key=);
%local n_key n ;
proc sql noprint;
select max(n_key) into :n_key separated by ' ' from 
(
select Subject_ID,count(distinct key) as n_key
 from key 
  where key in (select key from key where Subject_ID="&key.") and Subject_ID ne "&key."
   group by Subject_ID
);
select count(distinct key) into :n separated by ' ' 
 from key 
  where Subject_ID="&key.";
quit;
%put &=key. &=n_key. &=n. ;
%if &n_key.=&n. %then %do;
  data exclude_key;
   length key $ 100;
   key="&key.";
  run;
  proc append base=want data=exclude_key;run;
%end;
%mend;


option nodsnferr nosource;
proc delete data=want;run;
data _null_;
 set Subject_ID;
 call execute(cats('%nrstr(%check(key=',Subject_ID,'))')); 
run;
proc sql;
create table final_want as
select * from have where Subject_ID not in (select key from want);
quit;

View solution in original post

ballardw · Posted 07-26-2023 10:37 AM

You have to define "duplicate" in words of two syllables or less for those of us that cannot see how 378 is a "duplicate" of 45.

4382E	6782F	45
4382E	5489L	45
6782F	5489L	378

r-lee-484 · Posted 07-26-2023 10:46 AM

Looking at these events:
4382E 6782F 45
4382E 5489L 45
6782F 5489L 378

If I change these to dates, subject_ID 45 was hospitalized on 10/25/2000, 11/4/2006, and 6/23/2002. (a total of 3 times)

However, the events 11/4/2006 and 6/23/2002 were matched again to each other and assigned a different subject_ID. This is essentially the same as subject_ID 45 but is missing the match with date 10/25/2000. I don't want subject_ID 378's match

event1 event2 subject_ID
10/25/2000 11/4/2006 45
10/25/2000 6/23/2002 45
11/4/2006 6/23/2002 378

ballardw · Posted 07-26-2023 11:56 AM

Still having a hard time seeing how a value for Event1 is a match for Event2 for a different subject. They aren't "duplicates". Related perhaps but not duplicates.

But then I haven't understood anything in any of your posts because obviously there is a lot going on that relies only on your understanding of the process.

HB · Posted 07-27-2023 12:26 AM

I'm with @ballardw in that I don't understand your rules either.

But your desired output is a simple list of DISTINCT event2's. Pull the event1 that goes with the first occurrence of event2 (databases don't have order) and assign a new id at each change in event1.

Post a working data step and somebody might try some code. Help them to help you.

Ksharp · Posted 07-27-2023 07:57 AM

/*Assuming I understood what you mean.*/
data have;
infile cards expandtabs;
input (event1 event2 Subject_ID) ($);
cards;
1095A 2097B 74
1095A 3003C 74
1095A 4006D 74
2097B 3003C 12
2097B 4006D 12
3003C 4006D 116
9382J 5813K 32
8762L 3250M 66
4382E 6782F 45
4382E 5489L 45
6782F 5489L 378
;

data temp;
 set have;
 key=event1;output;
 key=event2;output;
 keep key Subject_ID;
run;
proc sort data=temp out=key nodupkey;
by Subject_ID key;
run;
proc sort data=temp(keep=Subject_ID) out=Subject_ID nodupkey;
by Subject_ID;
run;



%macro check(key=);
%local n_key n ;
proc sql noprint;
select max(n_key) into :n_key separated by ' ' from 
(
select Subject_ID,count(distinct key) as n_key
 from key 
  where key in (select key from key where Subject_ID="&key.") and Subject_ID ne "&key."
   group by Subject_ID
);
select count(distinct key) into :n separated by ' ' 
 from key 
  where Subject_ID="&key.";
quit;
%put &=key. &=n_key. &=n. ;
%if &n_key.=&n. %then %do;
  data exclude_key;
   length key $ 100;
   key="&key.";
  run;
  proc append base=want data=exclude_key;run;
%end;
%mend;


option nodsnferr nosource;
proc delete data=want;run;
data _null_;
 set Subject_ID;
 call execute(cats('%nrstr(%check(key=',Subject_ID,'))')); 
run;
proc sql;
create table final_want as
select * from have where Subject_ID not in (select key from want);
quit;

r-lee-484 · Posted 08-01-2023 11:32 AM

This is exactly the solution I am looking for. Thank you!

I wanted distinct combinations and not all possible matches per subject_id.

Removing duplicate matched events

Re: Removing duplicate matched events

Re: Removing duplicate matched events

Re: Removing duplicate matched events

Re: Removing duplicate matched events

Re: Removing duplicate matched events

Re: Removing duplicate matched events

Re: Removing duplicate matched events

SAS Innovate 2025: Call for Content

Classroom Training Available!