Solved: Keep rows of data when one var changes from 0 to 1, but excluding when...

pdick2 · Posted 04-14-2024 02:37 PM

Hello!

I have a long dataset from the "Early Prediction of Sepsis from Clinical Data: the PhysioNet/Computing in Cardiology Challenge 2019" This has over 1.5 million rows of hourly data, with over 40,000 unique Patient_IDs. There are many variable (such as Hour, HR, Resp, O2Sat, various lab values , etc.) by one outcome variable (SepsisLabel. 0 for no sepsis, 1 for sepsis).

I'm trying to keep the hourly rows of datasets from where SepsisLabel changes from 0 to 1, while excluding those who are only SepsisLabel=0 and excluding those who are only SepsisLabel=1. I've included a truncated example of one unique Patient_ID below with only Hour HR Temp Resp SepsisLabel and Patient_ID

Hour HR Temp Resp SepsisLabel Patient_ID
0 . . . 0 3205
1 76 . 20 0 3205
2 78 . 20 0 3205
3 81 . 16 0 3205
4 79 37.89 12 0 3205
5 81 . 15.25 0 3205
6 78 . 12 0 3205
7 75 . 13.5 1 3205
8 76 37.5 18.5 1 3205
9 84 . 14.5 1 3205
10 82 . 33 1 3205
11 95 . 28 1 3205
12 99 37.67 26 1 3205
13 96 . 21 1 3205
14 92 . 22 1 3205
15 85 . 26 1 3205
16 92 37.33 22 1 3205

Please help me with this!

mkeintz · Posted 04-15-2024 10:07 AM

Your sample data shows no instances of reversals (a single id having a change in both directions), then you are simply looking for patient_id's that start at sepsislabel zero and end at sepsislabel one. If so, then:

data have;
input Hour HR Temp Resp SepsisLabel Patient_ID;
datalines;
0 . . . 0 3205
1 76 . 20 0 3205
2 78 . 20 1 3205
3 81 . 16 1 3205
0 . . . 0 3206
1 76 . 20 0 3206
2 78 . 20 0 3206
3 81 . 16 0 3206
0 . . . 1 3207
1 76 . 20 1 3207
2 78 . 20 1 3207
3 81 . 16 1 3207
0 . . . 1 3208
1 76 . 20 1 3208
2 78 . 20 0 3208
3 81 . 16 0 3208
;
data want (drop=_:);
  set have (in=firstpass)
      have (in=secondpass);
  by patient_id;
  retain _start_at_zero _end_at_one ;
  if first.patient_id=1 then _start_at_zero=(sepsislabel=0);
  if firstpass then _end_at_one=(sepsislabel=1);
  if secondpass  and _start_at_zero and _end_at_one;
run;

But if you also want instances that change from 1 to 0:

data want (drop=_:);
  set have (in=firstpass)
      have (in=secondpass);
  by patient_id;
  if first.patient_id then call missing(_n0,_n1);
  _n0+(firstpass=1 and sepsislabel=0);
  _n1+(firstpass=1 and sepsislabel=1);
  if secondpass  and _n0 and _n1;
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

SASJedi · Posted 04-15-2024 08:35 AM

First, let's make some data robust enough for testing.

data have;
input Hour HR Temp Resp SepsisLabel Patient_ID;
datalines;
0 . . . 0 3205
1 76 . 20 0 3205
2 78 . 20 1 3205
3 81 . 16 1 3205
0 . . . 0 3206
1 76 . 20 0 3206
2 78 . 20 0 3206
3 81 . 16 0 3206
0 . . . 1 3207
1 76 . 20 1 3207
2 78 . 20 1 3207
3 81 . 16 1 3207
0 . . . 1 3208
1 76 . 20 1 3208
2 78 . 20 0 3208
3 81 . 16 0 3208
;

In this dataset, Patient_ID 3205 and 3208 had a change in SepsisLabel from 0 to 1 or from 1 to 0. Patient_ID 3206 and 3207 had no change in SepsisLabel. As I understand the problem, you want to keep all rows for Patient_ID 3205 and 3208, and exclude all rows for Patient_ID 3206 and 3207. If so, here's one approach:

proc sql;
select * 
	from have as h
	inner join
	(select Patient_ID
		from(select distinct Patient_id, SepsisLabel
				from have)
		group by Patient_ID
		having count(*) >1) as w
	on h.Patient_ID=w.Patient_ID
;
quit;

The SQL subquery identifies those Patient_ID values having more than one value for SepsisLabel. Those Patient_ID values are used to select the desired rows from the original table.

Check out my Jedi SAS Tricks for SAS Users

mkeintz · Posted 04-15-2024 10:07 AM

Your sample data shows no instances of reversals (a single id having a change in both directions), then you are simply looking for patient_id's that start at sepsislabel zero and end at sepsislabel one. If so, then:

data have;
input Hour HR Temp Resp SepsisLabel Patient_ID;
datalines;
0 . . . 0 3205
1 76 . 20 0 3205
2 78 . 20 1 3205
3 81 . 16 1 3205
0 . . . 0 3206
1 76 . 20 0 3206
2 78 . 20 0 3206
3 81 . 16 0 3206
0 . . . 1 3207
1 76 . 20 1 3207
2 78 . 20 1 3207
3 81 . 16 1 3207
0 . . . 1 3208
1 76 . 20 1 3208
2 78 . 20 0 3208
3 81 . 16 0 3208
;
data want (drop=_:);
  set have (in=firstpass)
      have (in=secondpass);
  by patient_id;
  retain _start_at_zero _end_at_one ;
  if first.patient_id=1 then _start_at_zero=(sepsislabel=0);
  if firstpass then _end_at_one=(sepsislabel=1);
  if secondpass  and _start_at_zero and _end_at_one;
run;

But if you also want instances that change from 1 to 0:

data want (drop=_:);
  set have (in=firstpass)
      have (in=secondpass);
  by patient_id;
  if first.patient_id then call missing(_n0,_n1);
  _n0+(firstpass=1 and sepsislabel=0);
  _n1+(firstpass=1 and sepsislabel=1);
  if secondpass  and _n0 and _n1;
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

pdick2 · Posted 04-15-2024 08:47 PM

This worked! Took some tinkering (I had to sort my data by Patient_ID and hour first), but it matches the original dataset, but keeping only those who change from 0 to 1. There aren't any instances on it going from 1 to 0, but thank you for giving that! The one SASJedi gave was a little memory intensive and I'm not sure why (I'm new to SAS).

Ksharp · Posted 04-15-2024 10:32 PM

If "There aren't any instances on it going from 1 to 0". That would be easy.

proc sql;
create table want as
select * from have
 group by Patient_ID
  having count(distinct SepsisLabel)=2
   order by Patient_ID,Hour;
quit;

SASJedi · Posted 04-16-2024 07:38 AM

While SQL is sometimes simpler to write, the SQL processor does a lot of "behind the scenes" work (sorting, etc.) on your behalf. An individual SQL step will often be more resource-intensive than a PROC SORT / DATA step combo.

Check out my Jedi SAS Tricks for SAS Users

pdick2 · Posted 04-19-2024 10:22 AM

After looking at my data some more, turns out I might need to include individuals who are only sepsislabel=0 for their whole time in the hospital, but also include those who are first sepsislabel=0 and then transition to sepsislabel=1. How would you do that? (I.e. Include patient_ID=3206 (the only sepsislabel=0) and 3205 (who goes from sepsislabel=0 to 1) in my dataset). This will exclude people who are only sepsislabel=1.

Trying to to a cox model and need a good comparison group for those who do not develop sepsis. If you have any more advice, please let me know!

mkeintz · Posted 04-19-2024 03:04 PM

@pdick2

You want all cases for ID's that have sepsislabel always at zero, or that change from zero to one.
You earlier said your data has no instance of going from one to zero.

If the above are both true, then you simply want all ID's that start with zero, That's a simple task.

If the data are sorted by PATIENT_ID/Hour, then:

data want (drop=_:);
  retain _keep ' ';
  set have;
  by patient_id ;
  if first.patient_id=1 then _keep=ifc(sepsislabel=0,'Y','N');
  if _keep='Y';
run;

If the data within patient_id are sorted by HR, and the patient_id's are grouped, but not sorted, then change the BY statement to:

  by patient_id notsorted;

Now if the data are sorted by HR, you don't have to re-sort by patient_id/hr. And it doesn't matter what the order is within each HR value. So, if sorting by patient_id/hr is expensive, then you can use a hash object:

data want (drop=_:);
  set have;
  by patient_id;

  length _keep $1;
  if _n_=1 then do;
    declare hash h();
      h.definekey('patient_id');
      h.definedata('_keep');
      h.definedone();
  end;
  if h.find()^=0 then do;             /*If this is first instance for this ID ...*/
    if sepsislabel=0 then _keep='Y';  /*Check the initial sepsislabel value      */
    else _keep='N';
    h.add();                          /*Put it in the hash object */
  end;
  if _keep='Y';
run;

These are untested.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Keep rows of data when one var changes from 0 to 1, but excluding when that var is only 1 or 0

Re: Keep rows of data when one var changes from 0 to 1, but excluding when that var is only 1 or 0

Re: Keep rows of data when one var changes from 0 to 1, but excluding when that var is only 1 or 0

Re: Keep rows of data when one var changes from 0 to 1, but excluding when that var is only 1 or 0

Re: Keep rows of data when one var changes from 0 to 1, but excluding when that var is only 1 or 0

Re: Keep rows of data when one var changes from 0 to 1, but excluding when that var is only 1 or 0

Re: Keep rows of data when one var changes from 0 to 1, but excluding when that var is only 1 or 0

Re: Keep rows of data when one var changes from 0 to 1, but excluding when that var is only 1 or 0

Re: Keep rows of data when one var changes from 0 to 1, but excluding when that var is only 1 or 0

Registration is open

SAS Training: Just a Click Away