Hi, I have this dataset where these are hospitalized events matched to each patient. Each event is given a unique number. For example, event1 is matched to event2.
event1 | event2 | Subject_ID |
1095A | 2097B | 74 |
1095A | 3003C | 74 |
1095A | 4006D | 74 |
2097B | 3003C | 12 |
2097B | 4006D | 12 |
3003C | 4006D | 116 |
9382J | 5813K | 32 |
8762L | 3250M | 66 |
4382E | 6782F | 45 |
4382E | 5489L | 45 |
6782F | 5489L | 378 |
I am trying to create a NEW_ID and remove subject_IDs 12, 116 since they are duplicate matches to subject_ID74. I also need to remove subject_ID 378 since it's a duplicate match to subject_ID 45. I ultimately want to retain matches from subject_ID 74, 32, 66, 45 as they are unique matches and assign a new ID for each subject.
Desired output:
event1 | event2 | NEW_ID |
1095A | 2097B | 1 |
1095A | 3003C | 1 |
1095A | 4006D | 1 |
9382J | 5813K | 2 |
8762L | 3250M | 3 |
4382E | 6782F | 4 |
4382E | 5489L | 4 |
Is it possible to use PROC SQL and what kind of rules or code should I write to do this? I have a dataset where there are thousands of matched events so this is a simplified version of what needs to be accomplished.
Thanks for your help!
/*Assuming I understood what you mean.*/
data have;
infile cards expandtabs;
input (event1 event2 Subject_ID) ($);
cards;
1095A 2097B 74
1095A 3003C 74
1095A 4006D 74
2097B 3003C 12
2097B 4006D 12
3003C 4006D 116
9382J 5813K 32
8762L 3250M 66
4382E 6782F 45
4382E 5489L 45
6782F 5489L 378
;
data temp;
set have;
key=event1;output;
key=event2;output;
keep key Subject_ID;
run;
proc sort data=temp out=key nodupkey;
by Subject_ID key;
run;
proc sort data=temp(keep=Subject_ID) out=Subject_ID nodupkey;
by Subject_ID;
run;
%macro check(key=);
%local n_key n ;
proc sql noprint;
select max(n_key) into :n_key separated by ' ' from
(
select Subject_ID,count(distinct key) as n_key
from key
where key in (select key from key where Subject_ID="&key.") and Subject_ID ne "&key."
group by Subject_ID
);
select count(distinct key) into :n separated by ' '
from key
where Subject_ID="&key.";
quit;
%put &=key. &=n_key. &=n. ;
%if &n_key.=&n. %then %do;
data exclude_key;
length key $ 100;
key="&key.";
run;
proc append base=want data=exclude_key;run;
%end;
%mend;
option nodsnferr nosource;
proc delete data=want;run;
data _null_;
set Subject_ID;
call execute(cats('%nrstr(%check(key=',Subject_ID,'))'));
run;
proc sql;
create table final_want as
select * from have where Subject_ID not in (select key from want);
quit;
You have to define "duplicate" in words of two syllables or less for those of us that cannot see how 378 is a "duplicate" of 45.
4382E | 6782F | 45 |
4382E | 5489L | 45 |
6782F | 5489L | 378 |
Still having a hard time seeing how a value for Event1 is a match for Event2 for a different subject. They aren't "duplicates". Related perhaps but not duplicates.
But then I haven't understood anything in any of your posts because obviously there is a lot going on that relies only on your understanding of the process.
I'm with @ballardw in that I don't understand your rules either.
But your desired output is a simple list of DISTINCT event2's. Pull the event1 that goes with the first occurrence of event2 (databases don't have order) and assign a new id at each change in event1.
Post a working data step and somebody might try some code. Help them to help you.
/*Assuming I understood what you mean.*/
data have;
infile cards expandtabs;
input (event1 event2 Subject_ID) ($);
cards;
1095A 2097B 74
1095A 3003C 74
1095A 4006D 74
2097B 3003C 12
2097B 4006D 12
3003C 4006D 116
9382J 5813K 32
8762L 3250M 66
4382E 6782F 45
4382E 5489L 45
6782F 5489L 378
;
data temp;
set have;
key=event1;output;
key=event2;output;
keep key Subject_ID;
run;
proc sort data=temp out=key nodupkey;
by Subject_ID key;
run;
proc sort data=temp(keep=Subject_ID) out=Subject_ID nodupkey;
by Subject_ID;
run;
%macro check(key=);
%local n_key n ;
proc sql noprint;
select max(n_key) into :n_key separated by ' ' from
(
select Subject_ID,count(distinct key) as n_key
from key
where key in (select key from key where Subject_ID="&key.") and Subject_ID ne "&key."
group by Subject_ID
);
select count(distinct key) into :n separated by ' '
from key
where Subject_ID="&key.";
quit;
%put &=key. &=n_key. &=n. ;
%if &n_key.=&n. %then %do;
data exclude_key;
length key $ 100;
key="&key.";
run;
proc append base=want data=exclude_key;run;
%end;
%mend;
option nodsnferr nosource;
proc delete data=want;run;
data _null_;
set Subject_ID;
call execute(cats('%nrstr(%check(key=',Subject_ID,'))'));
run;
proc sql;
create table final_want as
select * from have where Subject_ID not in (select key from want);
quit;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 16. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.