BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
wj2
Quartz | Level 8 wj2
Quartz | Level 8

Hello Community,

 

I am trying to clean a dataset for a Cox regression analysis that I would greatly appreciate some help with. Basically, I am interested in running a Cox regression analysis to examine factors associated with the discontinuation of a certain medication prior to 6 months (180 days). I have provided an example below of what my dataset looks like: variable “days” represents the cumulative number of days from the index prescription date to the most recent prescription date, while variable “days_int” represents the number of days since the previous prescription.

 

A patient would be censored if the number of cumulative days since the first prescription date is less than 180 days  (“Days”<180) or if a patient had 90 days or more in between prescriptions (“Days_int” ≥ 90) prior to 180 days since their first prescription.

 

I have also provided below an example of my desired output dataset. Please note that if a patient is censored because of the ≥90 days between prescriptions rule, the value for variable “Days” in the output dataset should be the “days” value in the record prior to the censoring event (see Patient_ID= 2 and Patient_ID=3). The value for “Dose” should also follow this rule for this case.

 

If a patient has received at least 180 days of the prescription (“Days” ≥180) with less than 90 days in between prescriptions during the first 180 days since the first prescription, the value for Days in the output dataset should be ‘180’ (i.e., to indicate that they received at least 180 days of the prescription). The value for “Dose” in this case should be the value in the last record prior to days ≥ 180. Please let me know if I can provide any further clarification. Any help with how to program this would be greatly appreciated!

 

Have:

Patient_ID

Dose

Med_start_date

Days

Days_int

1

2

01JAN2015

0

.

1

4

16JAN2015

15

15

2

2

01JAN2010

0

.

2

4

11APR2010

100

100

3

8

01JAN2012

0

.

3

16

31JAN2012

30

30

3

24

10MAY2012

130

100

4

8

01MAR2017

0

.

4

8

30APR2017

60

60

4

8

29JUN2017

120

60

4

16

28AUG2017

180

60

4

16

27OCT2017

240

60

5

8

01FEB2018

0

.

5

8

02APR2018

60

60

5

16

01JUN2018

120

60

5

16

21JUL2018

170

50

5

32

20AUG2018

200

30

5

32

18NOV2018

300

90

6

1

01JAN2011

0

.

6

2

20FEB2011

50

50

6

2

21APR2011

110

60

6

4

20JUN2011

170

60

6

6

18SEP2011

260

90

7

8

01JAN2019

0

.

7

16

28OCT2019

300

300

 

Want:

Patient_ID

Dose

Days

Censor

1

4

15

1

2

2

0

1

3

16

30

1

4

8

180

0

5

16

180

0

6

4

170

1

7

8

0

1

1 ACCEPTED SOLUTION

Accepted Solutions
mkeintz
PROC Star

You have two types of censoring:

 

    1a:  You run out of observations before days reaches 180, while days_int never reaches 90.  In this case you write out the last observation as is, with CENSOR=1.

    1b:  You reach days_int>=90 before (or simulataneous with) days reaching 180.  In this case, you write out the preceding days and dose, again with censor=1

    CENSOR=0 when you reach 180 days without already reaching days_int>=90.  In this case write out prior dose but set days=180.

 

data have;
  input Patient_ID	Dose	Med_start_date :date9.	Days	Days_int;
  format med_start_date date9.;
datalines;
1	2	01JAN2015	0	. 
1	4	16JAN2015	15	15
2	2	01JAN2010	0	.
2	4	11APR2010	100	100
3	8	01JAN2012	0	.
3	16	31JAN2012	30	30
3	24	10MAY2012	130	100
4	8	01MAR2017	0	.
4	8	30APR2017	60	60
4	8	29JUN2017	120	60
4	16	28AUG2017	180	60
4	16	27OCT2017	240	60
5	8	01FEB2018	0	.
5	8	02APR2018	60	60
5	16	01JUN2018	120	60
5	16	21JUL2018	170	50
5	32	20AUG2018	200	30
5	32	18NOV2018	300	90
6	1	01JAN2011	0	.
6	2	20FEB2011	50	50
6	2	21APR2011	110	60
6	4	20JUN2011	170	60
6	6	18SEP2011	260	90
7	8	01JAN2019	0	.
7	16	28OCT2019	300	300
run;

data want (keep=patient_id censor days dose);
  do until (last.patient_id);
    set have;
    by patient_id;
    _lag_dose=lag(dose);
    _lag_days=lag(days);
    if censor^=. then continue; /*Already output, skip subsequent iterations*/
    if days_int>=90 then do;
      censor=1;
      days=_lag_days;
      dose=_lag_dose;
    end;
    else if days>=180 then do;
      censor=0;
      days=180;
      dose=_lag_dose;
    end;
    else if last.patient_id then censor=1;
    if censor^=. then output;
  end;
run;

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

5 REPLIES 5
mkeintz
PROC Star

You have two types of censoring:

 

    1a:  You run out of observations before days reaches 180, while days_int never reaches 90.  In this case you write out the last observation as is, with CENSOR=1.

    1b:  You reach days_int>=90 before (or simulataneous with) days reaching 180.  In this case, you write out the preceding days and dose, again with censor=1

    CENSOR=0 when you reach 180 days without already reaching days_int>=90.  In this case write out prior dose but set days=180.

 

data have;
  input Patient_ID	Dose	Med_start_date :date9.	Days	Days_int;
  format med_start_date date9.;
datalines;
1	2	01JAN2015	0	. 
1	4	16JAN2015	15	15
2	2	01JAN2010	0	.
2	4	11APR2010	100	100
3	8	01JAN2012	0	.
3	16	31JAN2012	30	30
3	24	10MAY2012	130	100
4	8	01MAR2017	0	.
4	8	30APR2017	60	60
4	8	29JUN2017	120	60
4	16	28AUG2017	180	60
4	16	27OCT2017	240	60
5	8	01FEB2018	0	.
5	8	02APR2018	60	60
5	16	01JUN2018	120	60
5	16	21JUL2018	170	50
5	32	20AUG2018	200	30
5	32	18NOV2018	300	90
6	1	01JAN2011	0	.
6	2	20FEB2011	50	50
6	2	21APR2011	110	60
6	4	20JUN2011	170	60
6	6	18SEP2011	260	90
7	8	01JAN2019	0	.
7	16	28OCT2019	300	300
run;

data want (keep=patient_id censor days dose);
  do until (last.patient_id);
    set have;
    by patient_id;
    _lag_dose=lag(dose);
    _lag_days=lag(days);
    if censor^=. then continue; /*Already output, skip subsequent iterations*/
    if days_int>=90 then do;
      censor=1;
      days=_lag_days;
      dose=_lag_dose;
    end;
    else if days>=180 then do;
      censor=0;
      days=180;
      dose=_lag_dose;
    end;
    else if last.patient_id then censor=1;
    if censor^=. then output;
  end;
run;

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
wj2
Quartz | Level 8 wj2
Quartz | Level 8

Thank you very much @mkeintz!! This is perfect!

wj2
Quartz | Level 8 wj2
Quartz | Level 8

@mkeintzanother issue I'm running into for this analysis of my actual dataset is that sometimes patients will have missing data for the preceding 'Dose' value. Therefore, I would like to implement an imputation procedure where in this case, the first dose that isn't missing prior to censoring is written out instead. I have provided a couple of examples below for the example dataset used previously. Would be able to suggest how to modify the code you previously provided to do this? Thank you very much!

 

Want:

Patient_ID

Dose

Med_start_date

Days

Days_int

1

2

01JAN2015

0

.

1

4

16JAN2015

15

15

2

2

01JAN2010

0

.

2

4

11APR2010

100

100

3

8

01JAN2012

0

.

3

16

31JAN2012

30

30

3

24

10MAY2012

130

100

4

8

01MAR2017

0

.

4

8

30APR2017

60

60

4

8

29JUN2017

120

60

4

16

28AUG2017

180

60

4

16

27OCT2017

240

60

5

8

01FEB2018

0

.

5

8

02APR2018

60

60

5

16

01JUN2018

120

60

5

16

21JUL2018

170

50

5

32

20AUG2018

200

30

5

32

18NOV2018

300

90

6

1

01JAN2011

0

.

6

2

20FEB2011

50

50

6

2

21APR2011

110

60

6

4

20JUN2011

170

60

6

6

18SEP2011

260

90

7

8

01JAN2019

0

.

7

16

28OCT2019

300

300

8

8

01JAN2017

0

.

8

.

20FEB2017

50

50

9

8

01FEB2018

0

.

9

16

02APR2018

60

60

9

.

01JUN2018

120

60

9

.

21JUL2018

170

50

9

32

20AUG2018

200

30

9

32

18NOV2018

300

90

 

Want:

Patient_ID

Dose

Days

Censor

1

4

15

1

2

2

0

1

3

16

30

1

4

8

180

0

5

16

180

0

6

4

170

1

7

8

0

1

8

8

50

1

9

16

180

0

mkeintz
PROC Star

So you want the most recent non-missing value of dose.  Then instead of 

 

    _lag_dose=lag(dose);

you can use

    _lag_dose=ifn(lag(dose)^=.,lag(dose),_lag_dose);
    if first.patient_id=1 then _lag_dose=.;

Think of the IFN function as analogous to the excel if function.  It tests the first argument, and returns the second arg if the first is true.  Otherwise it returns the third arg.

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
wj2
Quartz | Level 8 wj2
Quartz | Level 8

@mkeintz  Thank you very much. It looks like this works perfectly except for the case of patient_id = 8 in my example dataset below. For cases like this, the _lag_dose value is not getting inserted into the 'dose' variable; instead, dose =. where it should be equal to 8 in this case (please see desired output below). Can you please let me know how the code can be modified to account for cases like this as well? I have provided my code below as well. Thanks again for all  your help. 

data have;
  input Patient_ID	Dose	Med_start_date :date9.	Days	Days_int;
  format med_start_date date9.;
datalines;
1	2	01JAN2015	0	. 
1	4	16JAN2015	15	15
2	2	01JAN2010	0	.
2	4	11APR2010	100	100
3	8	01JAN2012	0	.
3	16	31JAN2012	30	30
3	24	10MAY2012	130	100
4	8	01MAR2017	0	.
4	8	30APR2017	60	60
4	8	29JUN2017	120	60
4	16	28AUG2017	180	60
4	16	27OCT2017	240	60
5	8	01FEB2018	0	.
5	8	02APR2018	60	60
5	16	01JUN2018	120	60
5	16	21JUL2018	170	50
5	32	20AUG2018	200	30
5	32	18NOV2018	300	90
6	1	01JAN2011	0	.
6	2	20FEB2011	50	50
6	2	21APR2011	110	60
6	4	20JUN2011	170	60
6	6	18SEP2011	260	90
7	8	01JAN2019	0	.
7	16	28OCT2019	300	300
8   8   01JAN2017   0   .
8   .   20FEB2017   50  50
9   8   01FEB2018   0   .
9   16  02APR2018   60  60
9   .   01JUN2018   120 60
9   .   21JUL2018   170 50
9   32  20AUG2018   200 30
9   32  18NOV2018   300 90
run;

data want (keep=patient_id censor days dose);
  do until (last.patient_id);
    set have;
    by patient_id;
     _lag_dose=ifn(lag(dose)^=.,lag(dose),_lag_dose);
    if first.patient_id=1 then _lag_dose=.;
    _lag_days=lag(days);
    if censor^=. then continue; 
    if days_int>=90 then do;
      censor=1;
      days=_lag_days;
      dose=_lag_dose;
    end;
    else if days>=180 then do;
      censor=0;
      days=180;
      dose=_lag_dose;
    end;
    else if last.patient_id then censor=1;
    if censor^=. then output;
  end;
run;

Want:

Patient_ID

Dose

Days

Censor

1

4

15

1

2

2

0

1

3

16

30

1

4

8

180

0

5

16

180

0

6

4

170

1

7

8

0

1

8

8

50

1

9

16

180

0

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 5 replies
  • 1037 views
  • 2 likes
  • 2 in conversation