- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi, I have this dataset where these are hospitalized events matched to each patient. Each event is given a unique number. For example, event1 is matched to event2.
event1 | event2 | Subject_ID |
1095A | 2097B | 74 |
1095A | 3003C | 74 |
1095A | 4006D | 74 |
2097B | 3003C | 12 |
2097B | 4006D | 12 |
3003C | 4006D | 116 |
9382J | 5813K | 32 |
8762L | 3250M | 66 |
4382E | 6782F | 45 |
4382E | 5489L | 45 |
6782F | 5489L | 378 |
I am trying to create a NEW_ID and remove subject_IDs 12, 116 since they are duplicate matches to subject_ID74. I also need to remove subject_ID 378 since it's a duplicate match to subject_ID 45. I ultimately want to retain matches from subject_ID 74, 32, 66, 45 as they are unique matches and assign a new ID for each subject.
Desired output:
event1 | event2 | NEW_ID |
1095A | 2097B | 1 |
1095A | 3003C | 1 |
1095A | 4006D | 1 |
9382J | 5813K | 2 |
8762L | 3250M | 3 |
4382E | 6782F | 4 |
4382E | 5489L | 4 |
Is it possible to use PROC SQL and what kind of rules or code should I write to do this? I have a dataset where there are thousands of matched events so this is a simplified version of what needs to be accomplished.
Thanks for your help!
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
/*Assuming I understood what you mean.*/
data have;
infile cards expandtabs;
input (event1 event2 Subject_ID) ($);
cards;
1095A 2097B 74
1095A 3003C 74
1095A 4006D 74
2097B 3003C 12
2097B 4006D 12
3003C 4006D 116
9382J 5813K 32
8762L 3250M 66
4382E 6782F 45
4382E 5489L 45
6782F 5489L 378
;
data temp;
set have;
key=event1;output;
key=event2;output;
keep key Subject_ID;
run;
proc sort data=temp out=key nodupkey;
by Subject_ID key;
run;
proc sort data=temp(keep=Subject_ID) out=Subject_ID nodupkey;
by Subject_ID;
run;
%macro check(key=);
%local n_key n ;
proc sql noprint;
select max(n_key) into :n_key separated by ' ' from
(
select Subject_ID,count(distinct key) as n_key
from key
where key in (select key from key where Subject_ID="&key.") and Subject_ID ne "&key."
group by Subject_ID
);
select count(distinct key) into :n separated by ' '
from key
where Subject_ID="&key.";
quit;
%put &=key. &=n_key. &=n. ;
%if &n_key.=&n. %then %do;
data exclude_key;
length key $ 100;
key="&key.";
run;
proc append base=want data=exclude_key;run;
%end;
%mend;
option nodsnferr nosource;
proc delete data=want;run;
data _null_;
set Subject_ID;
call execute(cats('%nrstr(%check(key=',Subject_ID,'))'));
run;
proc sql;
create table final_want as
select * from have where Subject_ID not in (select key from want);
quit;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
You have to define "duplicate" in words of two syllables or less for those of us that cannot see how 378 is a "duplicate" of 45.
4382E | 6782F | 45 |
4382E | 5489L | 45 |
6782F | 5489L | 378 |
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
4382E 6782F 45
4382E 5489L 45
6782F 5489L 378
If I change these to dates, subject_ID 45 was hospitalized on 10/25/2000, 11/4/2006, and 6/23/2002. (a total of 3 times)
However, the events 11/4/2006 and 6/23/2002 were matched again to each other and assigned a different subject_ID. This is essentially the same as subject_ID 45 but is missing the match with date 10/25/2000. I don't want subject_ID 378's match
event1 event2 subject_ID
10/25/2000 11/4/2006 45
10/25/2000 6/23/2002 45
11/4/2006 6/23/2002 378
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Still having a hard time seeing how a value for Event1 is a match for Event2 for a different subject. They aren't "duplicates". Related perhaps but not duplicates.
But then I haven't understood anything in any of your posts because obviously there is a lot going on that relies only on your understanding of the process.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I'm with @ballardw in that I don't understand your rules either.
But your desired output is a simple list of DISTINCT event2's. Pull the event1 that goes with the first occurrence of event2 (databases don't have order) and assign a new id at each change in event1.
Post a working data step and somebody might try some code. Help them to help you.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
/*Assuming I understood what you mean.*/
data have;
infile cards expandtabs;
input (event1 event2 Subject_ID) ($);
cards;
1095A 2097B 74
1095A 3003C 74
1095A 4006D 74
2097B 3003C 12
2097B 4006D 12
3003C 4006D 116
9382J 5813K 32
8762L 3250M 66
4382E 6782F 45
4382E 5489L 45
6782F 5489L 378
;
data temp;
set have;
key=event1;output;
key=event2;output;
keep key Subject_ID;
run;
proc sort data=temp out=key nodupkey;
by Subject_ID key;
run;
proc sort data=temp(keep=Subject_ID) out=Subject_ID nodupkey;
by Subject_ID;
run;
%macro check(key=);
%local n_key n ;
proc sql noprint;
select max(n_key) into :n_key separated by ' ' from
(
select Subject_ID,count(distinct key) as n_key
from key
where key in (select key from key where Subject_ID="&key.") and Subject_ID ne "&key."
group by Subject_ID
);
select count(distinct key) into :n separated by ' '
from key
where Subject_ID="&key.";
quit;
%put &=key. &=n_key. &=n. ;
%if &n_key.=&n. %then %do;
data exclude_key;
length key $ 100;
key="&key.";
run;
proc append base=want data=exclude_key;run;
%end;
%mend;
option nodsnferr nosource;
proc delete data=want;run;
data _null_;
set Subject_ID;
call execute(cats('%nrstr(%check(key=',Subject_ID,'))'));
run;
proc sql;
create table final_want as
select * from have where Subject_ID not in (select key from want);
quit;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I wanted distinct combinations and not all possible matches per subject_id.