Hello!
I have a long dataset from the "Early Prediction of Sepsis from Clinical Data: the PhysioNet/Computing in Cardiology Challenge 2019" This has over 1.5 million rows of hourly data, with over 40,000 unique Patient_IDs. There are many variable (such as Hour, HR, Resp, O2Sat, various lab values , etc.) by one outcome variable (SepsisLabel. 0 for no sepsis, 1 for sepsis).
I'm trying to keep the hourly rows of datasets from where SepsisLabel changes from 0 to 1, while excluding those who are only SepsisLabel=0 and excluding those who are only SepsisLabel=1. I've included a truncated example of one unique Patient_ID below with only Hour HR Temp Resp SepsisLabel and Patient_ID
Hour HR Temp Resp SepsisLabel Patient_ID
0 . . . 0 3205
1 76 . 20 0 3205
2 78 . 20 0 3205
3 81 . 16 0 3205
4 79 37.89 12 0 3205
5 81 . 15.25 0 3205
6 78 . 12 0 3205
7 75 . 13.5 1 3205
8 76 37.5 18.5 1 3205
9 84 . 14.5 1 3205
10 82 . 33 1 3205
11 95 . 28 1 3205
12 99 37.67 26 1 3205
13 96 . 21 1 3205
14 92 . 22 1 3205
15 85 . 26 1 3205
16 92 37.33 22 1 3205
Please help me with this!
Your sample data shows no instances of reversals (a single id having a change in both directions), then you are simply looking for patient_id's that start at sepsislabel zero and end at sepsislabel one. If so, then:
data have;
input Hour HR Temp Resp SepsisLabel Patient_ID;
datalines;
0 . . . 0 3205
1 76 . 20 0 3205
2 78 . 20 1 3205
3 81 . 16 1 3205
0 . . . 0 3206
1 76 . 20 0 3206
2 78 . 20 0 3206
3 81 . 16 0 3206
0 . . . 1 3207
1 76 . 20 1 3207
2 78 . 20 1 3207
3 81 . 16 1 3207
0 . . . 1 3208
1 76 . 20 1 3208
2 78 . 20 0 3208
3 81 . 16 0 3208
;
data want (drop=_:);
set have (in=firstpass)
have (in=secondpass);
by patient_id;
retain _start_at_zero _end_at_one ;
if first.patient_id=1 then _start_at_zero=(sepsislabel=0);
if firstpass then _end_at_one=(sepsislabel=1);
if secondpass and _start_at_zero and _end_at_one;
run;
But if you also want instances that change from 1 to 0:
data want (drop=_:);
set have (in=firstpass)
have (in=secondpass);
by patient_id;
if first.patient_id then call missing(_n0,_n1);
_n0+(firstpass=1 and sepsislabel=0);
_n1+(firstpass=1 and sepsislabel=1);
if secondpass and _n0 and _n1;
run;
First, let's make some data robust enough for testing.
data have;
input Hour HR Temp Resp SepsisLabel Patient_ID;
datalines;
0 . . . 0 3205
1 76 . 20 0 3205
2 78 . 20 1 3205
3 81 . 16 1 3205
0 . . . 0 3206
1 76 . 20 0 3206
2 78 . 20 0 3206
3 81 . 16 0 3206
0 . . . 1 3207
1 76 . 20 1 3207
2 78 . 20 1 3207
3 81 . 16 1 3207
0 . . . 1 3208
1 76 . 20 1 3208
2 78 . 20 0 3208
3 81 . 16 0 3208
;
In this dataset, Patient_ID 3205 and 3208 had a change in SepsisLabel from 0 to 1 or from 1 to 0. Patient_ID 3206 and 3207 had no change in SepsisLabel. As I understand the problem, you want to keep all rows for Patient_ID 3205 and 3208, and exclude all rows for Patient_ID 3206 and 3207. If so, here's one approach:
proc sql;
select *
from have as h
inner join
(select Patient_ID
from(select distinct Patient_id, SepsisLabel
from have)
group by Patient_ID
having count(*) >1) as w
on h.Patient_ID=w.Patient_ID
;
quit;
The SQL subquery identifies those Patient_ID values having more than one value for SepsisLabel. Those Patient_ID values are used to select the desired rows from the original table.
Your sample data shows no instances of reversals (a single id having a change in both directions), then you are simply looking for patient_id's that start at sepsislabel zero and end at sepsislabel one. If so, then:
data have;
input Hour HR Temp Resp SepsisLabel Patient_ID;
datalines;
0 . . . 0 3205
1 76 . 20 0 3205
2 78 . 20 1 3205
3 81 . 16 1 3205
0 . . . 0 3206
1 76 . 20 0 3206
2 78 . 20 0 3206
3 81 . 16 0 3206
0 . . . 1 3207
1 76 . 20 1 3207
2 78 . 20 1 3207
3 81 . 16 1 3207
0 . . . 1 3208
1 76 . 20 1 3208
2 78 . 20 0 3208
3 81 . 16 0 3208
;
data want (drop=_:);
set have (in=firstpass)
have (in=secondpass);
by patient_id;
retain _start_at_zero _end_at_one ;
if first.patient_id=1 then _start_at_zero=(sepsislabel=0);
if firstpass then _end_at_one=(sepsislabel=1);
if secondpass and _start_at_zero and _end_at_one;
run;
But if you also want instances that change from 1 to 0:
data want (drop=_:);
set have (in=firstpass)
have (in=secondpass);
by patient_id;
if first.patient_id then call missing(_n0,_n1);
_n0+(firstpass=1 and sepsislabel=0);
_n1+(firstpass=1 and sepsislabel=1);
if secondpass and _n0 and _n1;
run;
This worked! Took some tinkering (I had to sort my data by Patient_ID and hour first), but it matches the original dataset, but keeping only those who change from 0 to 1. There aren't any instances on it going from 1 to 0, but thank you for giving that! The one SASJedi gave was a little memory intensive and I'm not sure why (I'm new to SAS).
If "There aren't any instances on it going from 1 to 0". That would be easy.
proc sql;
create table want as
select * from have
group by Patient_ID
having count(distinct SepsisLabel)=2
order by Patient_ID,Hour;
quit;
While SQL is sometimes simpler to write, the SQL processor does a lot of "behind the scenes" work (sorting, etc.) on your behalf. An individual SQL step will often be more resource-intensive than a PROC SORT / DATA step combo.
After looking at my data some more, turns out I might need to include individuals who are only sepsislabel=0 for their whole time in the hospital, but also include those who are first sepsislabel=0 and then transition to sepsislabel=1. How would you do that? (I.e. Include patient_ID=3206 (the only sepsislabel=0) and 3205 (who goes from sepsislabel=0 to 1) in my dataset). This will exclude people who are only sepsislabel=1.
Trying to to a cox model and need a good comparison group for those who do not develop sepsis. If you have any more advice, please let me know!
If the above are both true, then you simply want all ID's that start with zero, That's a simple task.
If the data are sorted by PATIENT_ID/Hour, then:
data want (drop=_:);
retain _keep ' ';
set have;
by patient_id ;
if first.patient_id=1 then _keep=ifc(sepsislabel=0,'Y','N');
if _keep='Y';
run;
If the data within patient_id are sorted by HR, and the patient_id's are grouped, but not sorted, then change the BY statement to:
by patient_id notsorted;
Now if the data are sorted by HR, you don't have to re-sort by patient_id/hr. And it doesn't matter what the order is within each HR value. So, if sorting by patient_id/hr is expensive, then you can use a hash object:
data want (drop=_:);
set have;
by patient_id;
length _keep $1;
if _n_=1 then do;
declare hash h();
h.definekey('patient_id');
h.definedata('_keep');
h.definedone();
end;
if h.find()^=0 then do; /*If this is first instance for this ID ...*/
if sepsislabel=0 then _keep='Y'; /*Check the initial sepsislabel value */
else _keep='N';
h.add(); /*Put it in the hash object */
end;
if _keep='Y';
run;
These are untested.
Save $250 on SAS Innovate and get a free advance copy of the new SAS For Dummies book! Use the code "SASforDummies" to register. Don't miss out, May 6-9, in Orlando, Florida.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.