Solved: Data cleaning for Cox regression analysis

wj2 · Posted 04-02-2021 09:18 AM

Hello Community,

I am trying to clean a dataset for a Cox regression analysis that I would greatly appreciate some help with. Basically, I am interested in running a Cox regression analysis to examine factors associated with the discontinuation of a certain medication prior to 6 months (180 days). I have provided an example below of what my dataset looks like: variable “days” represents the cumulative number of days from the index prescription date to the most recent prescription date, while variable “days_int” represents the number of days since the previous prescription.

A patient would be censored if the number of cumulative days since the first prescription date is less than 180 days (“Days”<180) or if a patient had 90 days or more in between prescriptions (“Days_int” ≥ 90) prior to 180 days since their first prescription.

I have also provided below an example of my desired output dataset. Please note that if a patient is censored because of the ≥90 days between prescriptions rule, the value for variable “Days” in the output dataset should be the “days” value in the record prior to the censoring event (see Patient_ID= 2 and Patient_ID=3). The value for “Dose” should also follow this rule for this case.

If a patient has received at least 180 days of the prescription (“Days” ≥180) with less than 90 days in between prescriptions during the first 180 days since the first prescription, the value for Days in the output dataset should be ‘180’ (i.e., to indicate that they received at least 180 days of the prescription). The value for “Dose” in this case should be the value in the last record prior to days ≥ 180. Please let me know if I can provide any further clarification. Any help with how to program this would be greatly appreciated!

Have:

Patient_ID	Dose	Med_start_date	Days	Days_int
1	2	01JAN2015	0	.
1	4	16JAN2015	15	15
2	2	01JAN2010	0	.
2	4	11APR2010	100	100
3	8	01JAN2012	0	.
3	16	31JAN2012	30	30
3	24	10MAY2012	130	100
4	8	01MAR2017	0	.
4	8	30APR2017	60	60
4	8	29JUN2017	120	60
4	16	28AUG2017	180	60
4	16	27OCT2017	240	60
5	8	01FEB2018	0	.
5	8	02APR2018	60	60
5	16	01JUN2018	120	60
5	16	21JUL2018	170	50
5	32	20AUG2018	200	30
5	32	18NOV2018	300	90
6	1	01JAN2011	0	.
6	2	20FEB2011	50	50
6	2	21APR2011	110	60
6	4	20JUN2011	170	60
6	6	18SEP2011	260	90
7	8	01JAN2019	0	.
7	16	28OCT2019	300	300

Want:

Patient_ID	Dose	Days	Censor
1	4	15	1
2	2	0	1
3	16	30	1
4	8	180	0
5	16	180	0
6	4	170	1
7	8	0	1

mkeintz · Posted 04-02-2021 09:04 PM

You have two types of censoring:

1a: You run out of observations before days reaches 180, while days_int never reaches 90. In this case you write out the last observation as is, with CENSOR=1.

1b: You reach days_int>=90 before (or simulataneous with) days reaching 180. In this case, you write out the preceding days and dose, again with censor=1

CENSOR=0 when you reach 180 days without already reaching days_int>=90. In this case write out prior dose but set days=180.

data have;
  input Patient_ID	Dose	Med_start_date :date9.	Days	Days_int;
  format med_start_date date9.;
datalines;
1	2	01JAN2015	0	. 
1	4	16JAN2015	15	15
2	2	01JAN2010	0	.
2	4	11APR2010	100	100
3	8	01JAN2012	0	.
3	16	31JAN2012	30	30
3	24	10MAY2012	130	100
4	8	01MAR2017	0	.
4	8	30APR2017	60	60
4	8	29JUN2017	120	60
4	16	28AUG2017	180	60
4	16	27OCT2017	240	60
5	8	01FEB2018	0	.
5	8	02APR2018	60	60
5	16	01JUN2018	120	60
5	16	21JUL2018	170	50
5	32	20AUG2018	200	30
5	32	18NOV2018	300	90
6	1	01JAN2011	0	.
6	2	20FEB2011	50	50
6	2	21APR2011	110	60
6	4	20JUN2011	170	60
6	6	18SEP2011	260	90
7	8	01JAN2019	0	.
7	16	28OCT2019	300	300
run;

data want (keep=patient_id censor days dose);
  do until (last.patient_id);
    set have;
    by patient_id;
    _lag_dose=lag(dose);
    _lag_days=lag(days);
    if censor^=. then continue; /*Already output, skip subsequent iterations*/
    if days_int>=90 then do;
      censor=1;
      days=_lag_days;
      dose=_lag_dose;
    end;
    else if days>=180 then do;
      censor=0;
      days=180;
      dose=_lag_dose;
    end;
    else if last.patient_id then censor=1;
    if censor^=. then output;
  end;
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

mkeintz · Posted 04-02-2021 09:04 PM

You have two types of censoring:

1a: You run out of observations before days reaches 180, while days_int never reaches 90. In this case you write out the last observation as is, with CENSOR=1.

1b: You reach days_int>=90 before (or simulataneous with) days reaching 180. In this case, you write out the preceding days and dose, again with censor=1

CENSOR=0 when you reach 180 days without already reaching days_int>=90. In this case write out prior dose but set days=180.

data have;
  input Patient_ID	Dose	Med_start_date :date9.	Days	Days_int;
  format med_start_date date9.;
datalines;
1	2	01JAN2015	0	. 
1	4	16JAN2015	15	15
2	2	01JAN2010	0	.
2	4	11APR2010	100	100
3	8	01JAN2012	0	.
3	16	31JAN2012	30	30
3	24	10MAY2012	130	100
4	8	01MAR2017	0	.
4	8	30APR2017	60	60
4	8	29JUN2017	120	60
4	16	28AUG2017	180	60
4	16	27OCT2017	240	60
5	8	01FEB2018	0	.
5	8	02APR2018	60	60
5	16	01JUN2018	120	60
5	16	21JUL2018	170	50
5	32	20AUG2018	200	30
5	32	18NOV2018	300	90
6	1	01JAN2011	0	.
6	2	20FEB2011	50	50
6	2	21APR2011	110	60
6	4	20JUN2011	170	60
6	6	18SEP2011	260	90
7	8	01JAN2019	0	.
7	16	28OCT2019	300	300
run;

data want (keep=patient_id censor days dose);
  do until (last.patient_id);
    set have;
    by patient_id;
    _lag_dose=lag(dose);
    _lag_days=lag(days);
    if censor^=. then continue; /*Already output, skip subsequent iterations*/
    if days_int>=90 then do;
      censor=1;
      days=_lag_days;
      dose=_lag_dose;
    end;
    else if days>=180 then do;
      censor=0;
      days=180;
      dose=_lag_dose;
    end;
    else if last.patient_id then censor=1;
    if censor^=. then output;
  end;
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

wj2 · Posted 04-06-2021 08:55 AM

Thank you very much @mkeintz!! This is perfect!

wj2 · Posted 04-08-2021 12:00 PM

@mkeintzanother issue I'm running into for this analysis of my actual dataset is that sometimes patients will have missing data for the preceding 'Dose' value. Therefore, I would like to implement an imputation procedure where in this case, the first dose that isn't missing prior to censoring is written out instead. I have provided a couple of examples below for the example dataset used previously. Would be able to suggest how to modify the code you previously provided to do this? Thank you very much!

Want:

Patient_ID	Dose	Med_start_date	Days	Days_int
1	2	01JAN2015	0	.
1	4	16JAN2015	15	15
2	2	01JAN2010	0	.
2	4	11APR2010	100	100
3	8	01JAN2012	0	.
3	16	31JAN2012	30	30
3	24	10MAY2012	130	100
4	8	01MAR2017	0	.
4	8	30APR2017	60	60
4	8	29JUN2017	120	60
4	16	28AUG2017	180	60
4	16	27OCT2017	240	60
5	8	01FEB2018	0	.
5	8	02APR2018	60	60
5	16	01JUN2018	120	60
5	16	21JUL2018	170	50
5	32	20AUG2018	200	30
5	32	18NOV2018	300	90
6	1	01JAN2011	0	.
6	2	20FEB2011	50	50
6	2	21APR2011	110	60
6	4	20JUN2011	170	60
6	6	18SEP2011	260	90
7	8	01JAN2019	0	.
7	16	28OCT2019	300	300
8	8	01JAN2017	0	.
8	.	20FEB2017	50	50
9	8	01FEB2018	0	.
9	16	02APR2018	60	60
9	.	01JUN2018	120	60
9	.	21JUL2018	170	50
9	32	20AUG2018	200	30
9	32	18NOV2018	300	90

Want:

Patient_ID	Dose	Days	Censor
1	4	15	1
2	2	0	1
3	16	30	1
4	8	180	0
5	16	180	0
6	4	170	1
7	8	0	1
8	8	50	1
9	16	180	0

mkeintz · Posted 04-08-2021 03:12 PM

So you want the most recent non-missing value of dose. Then instead of

    _lag_dose=lag(dose);

you can use

    _lag_dose=ifn(lag(dose)^=.,lag(dose),_lag_dose);
    if first.patient_id=1 then _lag_dose=.;

Think of the IFN function as analogous to the excel if function. It tests the first argument, and returns the second arg if the first is true. Otherwise it returns the third arg.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

wj2 · Posted 04-12-2021 02:51 PM

@mkeintz Thank you very much. It looks like this works perfectly except for the case of patient_id = 8 in my example dataset below. For cases like this, the _lag_dose value is not getting inserted into the 'dose' variable; instead, dose =. where it should be equal to 8 in this case (please see desired output below). Can you please let me know how the code can be modified to account for cases like this as well? I have provided my code below as well. Thanks again for all your help.

data have;
  input Patient_ID	Dose	Med_start_date :date9.	Days	Days_int;
  format med_start_date date9.;
datalines;
1	2	01JAN2015	0	. 
1	4	16JAN2015	15	15
2	2	01JAN2010	0	.
2	4	11APR2010	100	100
3	8	01JAN2012	0	.
3	16	31JAN2012	30	30
3	24	10MAY2012	130	100
4	8	01MAR2017	0	.
4	8	30APR2017	60	60
4	8	29JUN2017	120	60
4	16	28AUG2017	180	60
4	16	27OCT2017	240	60
5	8	01FEB2018	0	.
5	8	02APR2018	60	60
5	16	01JUN2018	120	60
5	16	21JUL2018	170	50
5	32	20AUG2018	200	30
5	32	18NOV2018	300	90
6	1	01JAN2011	0	.
6	2	20FEB2011	50	50
6	2	21APR2011	110	60
6	4	20JUN2011	170	60
6	6	18SEP2011	260	90
7	8	01JAN2019	0	.
7	16	28OCT2019	300	300
8   8   01JAN2017   0   .
8   .   20FEB2017   50  50
9   8   01FEB2018   0   .
9   16  02APR2018   60  60
9   .   01JUN2018   120 60
9   .   21JUL2018   170 50
9   32  20AUG2018   200 30
9   32  18NOV2018   300 90
run;

data want (keep=patient_id censor days dose);
  do until (last.patient_id);
    set have;
    by patient_id;
     _lag_dose=ifn(lag(dose)^=.,lag(dose),_lag_dose);
    if first.patient_id=1 then _lag_dose=.;
    _lag_days=lag(days);
    if censor^=. then continue; 
    if days_int>=90 then do;
      censor=1;
      days=_lag_days;
      dose=_lag_dose;
    end;
    else if days>=180 then do;
      censor=0;
      days=180;
      dose=_lag_dose;
    end;
    else if last.patient_id then censor=1;
    if censor^=. then output;
  end;
run;

Want:

Patient_ID	Dose	Days	Censor
1	4	15	1
2	2	0	1
3	16	30	1
4	8	180	0
5	16	180	0
6	4	170	1
7	8	0	1
8	8	50	1
9	16	180	0

Data cleaning for Cox regression analysis

Re: Data cleaning for Cox regression analysis

Re: Data cleaning for Cox regression analysis

Re: Data cleaning for Cox regression analysis

Re: Data cleaning for Cox regression analysis

Re: Data cleaning for Cox regression analysis

Re: Data cleaning for Cox regression analysis

Classroom Training Available!