BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
pdick2
Fluorite | Level 6

Hello again!

I have a long dataset from the "Early Prediction of Sepsis from Clinical Data: the PhysioNet/Computing in Cardiology Challenge 2019" This has over 1.5 million rows of hourly data, with over 40,000 unique Patient_IDs. There are many variable (such as Hour, HR, Resp, O2Sat, various lab values , etc.) by one outcome variable (SepsisLabel. 0 for no sepsis, 1 for sepsis).

 

I now have the dataset where SepsisLabel changes from 0 to 1, while excluding those who are only SepsisLabel=0 and excluding those who are only SepsisLabel=1.

 

Now I want to do some other things to the data. I want to take the time difference between when sepsislabel changes from 0 to 1 before onset. Here is an example dataset. I've been using ChatGPT to help, but it doesn't seem to understand what I want. I kinda got the code it gave me to work with the time difference AFTER sepsislabel changes from 0 to 1, but not before. Before it gives me missing values. Code below and an example dataset for Patient_ID=34.

 

data have;
input Hour HR Temp SepsisLabel Patient_ID onset_time TimeDifference;
datalines;
0 88 36.11 0 34 . .
1 88 36.17 0 34 . .
2 88 . 0 34 . .
3 83.5 . 1 34 3 0
4 80 . 1 34 3 1
5 88 36.5 1 34 3 2
6 91 . 1 34 3 3
7 88 . 1 34 3 4
8 80 . 1 34 3 5
9 80 . 1 34 3 6
10 80 . 1 34 3 7
11 82 . 1 34 3 8
12 77 . 1 34 3 9
;

 

data biosp.sepsis_0_to_1_time_diff;
set biosp.sepsis_0_to_1;
by Patient_ID;
retain onset_time;
if first.Patient_ID then onset_time =.; *Initialize the onset time for each unique patient_ID;
if sepsislabel=1 and onset_time=. then onset_time=Hour; *Records when sepsislabel changes to 1;
if not missing(onset_time) then TimeDifference = Hour - onset_time; /* Calculate time difference */
else TimeDifference=.;
run;

 

Please help me with this.

Something else I want to do is get the mean, median, mode, q1, q3, min, and max for certain variables (such as HR, Resp, Temp., etc.) at certain time intervals before sepsislabel changes from 0 to 1. I was going to look at t=-4 hours, t=-6 hours, and t=-12 hours. This will single out this long data into 1 row per 1 patient_ID (instead of 11 rows, or 100 rows in some instances for one patient), but the rows will have the mean, median, etc. for those variables of interest (HR, Temp., etc.). This will create many datasheets (1 datasheet for mean values, 1 for median, 1 for q1, etc.), but this can be more usable in a logistic analysis in my opinion. 

 

Can anyone help me with this please? Thank you!!!

1 ACCEPTED SOLUTION

Accepted Solutions
PaigeMiller
Diamond | Level 26

Yes thank you, this is helpful.

 

data transition;
    set have;
    by patient_id;
    prev_sepsis=lag(sepsisLabel);
    if not first.patient_id and sepsisLabel=1 and prev_sepsis<1 then output;
run;
data want;
    merge have transition(rename=(hour=transition_hour) keep=hour patient_id);
    by patient_id;
    time_difference=hour-transition_hour;
run;

 

What I would like to do is, for example, take the mean of a column variable (lets use HR) for one that unique Patient_ID between the hours of -4 to 0. This will allow me to get the mean value of HR 4 hours before sepsislabel changes from 0 to 1. I would like to look at other variables besides HR, and other time intervals as well (such as -6, -12, etc. before the sepsislabel changes from 0 to 1).

Sounds like a job for PROC SUMMARY.

 

--
Paige Miller

View solution in original post

4 REPLIES 4
PaigeMiller
Diamond | Level 26

Is this what you want? You haven't given us an example of what you do want, you told us you got what you don't want. (In the future, please be clear about what you do want).

 

I have removed onset_time and timedifference from HAVE.

 

data have;
input Hour HR Temp SepsisLabel Patient_ID;
datalines;
0 88 36.11 0 34 . .
1 88 36.17 0 34 . .
2 88 . 0 34 . .
3 83.5 . 1 34 3 0
4 80 . 1 34 3 1
5 88 36.5 1 34 3 2
6 91 . 1 34 3 3
7 88 . 1 34 3 4
8 80 . 1 34 3 5
9 80 . 1 34 3 6
10 80 . 1 34 3 7
11 82 . 1 34 3 8
12 77 . 1 34 3 9
;

data sepsis_0_to_1_time_diff;
set have;
by Patient_ID;
retain onset_time;
lag_sepsis=lag(sepsislabel);
if first.patient_id then onset_time=hour;
if sepsislabel=1 and lag_sepsis<1 then TimeDifference = Hour - onset_time; /* Calculate time difference */
else TimeDifference=.;
run;

 

 

I don't understand the rest of your question, but this part bothers me: "This will create many datasheets (1 datasheet for mean values, 1 for median, 1 for q1, etc.)", why wouldn't you just put all the results (mean, median, etc.) into one data set? Separate data sets means more complicated programming.

 

I am not surprised ChatGPT fails here. It relies on finding answers to similar problems, and this is relatively unique problem.

--
Paige Miller
pdick2
Fluorite | Level 6

I'll try to be more clear. Here's what I do want or would like to see in a datalines format:

data have;
input Hour HR Temp SepsisLabel Patient_ID TimeDifference;
datalines;
0 88 36.11 0 34 -3
1 88 36.17 0 34 -2 
2 88 . 0 34 -1
3 83.5 . 1 34 0
4 80 . 1 34 1
5 88 36.5 1 34 2
6 91 . 1 34 3
7 88 . 1 34 4
8 80 . 1 34 5
9 80 . 1 34 6
10 80 . 1 34 7
11 82 . 1 34 8
12 77 . 1 34 9
;

And I didn't think about putting them on one data set. I see what you mean, it'll be easier to use one "calculated values" data set (and using the vars I want) instead of 6 data sets. Do you recommend just making one new data set with all of the calculated values or have them on the original "have" data set? This isn't the source data set, but just another copy of it (so I don't overwrite the original data).

 

What I would like to do is, for example, take the mean of a column variable (lets use HR) for one that unique Patient_ID between the hours of -4 to 0. This will allow me to get the mean value of HR 4 hours before sepsislabel changes from 0 to 1. I would like to look at other variables besides HR, and other time intervals as well (such as -6, -12, etc. before the sepsislabel changes from 0 to 1).

 

The code you gave didn't work for me. Maybe the specifics above will help. Thank you so much for helping me! First time SAS user here. Also, I can send the data set where it only has data when sepsislabel changed from 0 to 1, if you want to experiment.

 

Here is a screenshot of what I got from the code you gave:

pdick2_0-1713287466840.png

 

PaigeMiller
Diamond | Level 26

Yes thank you, this is helpful.

 

data transition;
    set have;
    by patient_id;
    prev_sepsis=lag(sepsisLabel);
    if not first.patient_id and sepsisLabel=1 and prev_sepsis<1 then output;
run;
data want;
    merge have transition(rename=(hour=transition_hour) keep=hour patient_id);
    by patient_id;
    time_difference=hour-transition_hour;
run;

 

What I would like to do is, for example, take the mean of a column variable (lets use HR) for one that unique Patient_ID between the hours of -4 to 0. This will allow me to get the mean value of HR 4 hours before sepsislabel changes from 0 to 1. I would like to look at other variables besides HR, and other time intervals as well (such as -6, -12, etc. before the sepsislabel changes from 0 to 1).

Sounds like a job for PROC SUMMARY.

 

--
Paige Miller
pdick2
Fluorite | Level 6

This worked thank you so much!

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 4 replies
  • 254 views
  • 0 likes
  • 2 in conversation