Hi all. Sorry about the length of the post but I wanted to be comprehensive in terms of context.
I have data of exits from an employment program over the course of a year. In survival analysis terms, the 'failure' variable is leaving the program (which is a positive thing as it means the individual has found employment). There are two groups I wish to compare in terms of 'failures'. Some people are in Group A, which is a 'standard' program over the course of a year. Some people are in Group B, which is the 'standard' program for the first six months, and then at the six month mark Group B is subject to additional participation requirements.
It is been found before that when people are subject to additional participation requirements (Group B), they leave the program more quickly around the time the additional requirements are imposed, compared to people without the additional requirements (Group A). Individuals know when the additional requirements are coming up, so they might leave shortly before the additional requirements are imposed or shortly after.
Therefore, when comparing the groups, I would expect similar numbers of exits for the first 5 months or so, but then I would expect increased exits for a few weeks for Group B. In other words, I expect an interaction of time (measured in fortnights) and Group and that the hazard ratio will not be constant over time.
The two groups are also not random but quasi-experimental, so there are additional covariates (such as participant gender, age, local unemployment rate, etc) that are included. I'm not directly interested in these variables; I just want to hold them constant between the groups.
The particular problems I'm having are the following:
This is the SAS code I am using for the phreg (I have dropped some of the covariates for clarity of reading, and I've perturbed the data from any output). The data are in long format, one row per fortnight per participant until they leave the program if they leave within a year, right-censored at a year (26 fortnights) if they do not leave the program, I have included indicative output.
ods graphics on;
proc phreg data=exits plots(overlay=stratum)=(survival);
class GROUP GENDER REMOTENESS fortnight /param=ref ref=first order=internal;
model (start,stop)*exit(0) = GROUP AGE GENDER REMOTENESS LAST_SCORE fortnight * GROUP / ties=efron alpha=0.01 rl;
baseline out=exit_out survival=_all_ /diradj group=GROUP;
hazardratio GROUP / at (fortnight=ALL) alpha=0.01 ;
run;
ods graphics off;
Hazard Ratios for GROUP | |
Description | Point Estimate |
GROUP GROUP_B vs GROUP_A At fortnight=1 | 1.090 |
GROUP GROUP_B vs GROUP_A At fortnight=2 | 1.258 |
GROUP GROUP_B vs GROUP_A At fortnight=3 | 1.240 |
GROUP GROUP_B vs GROUP_A At fortnight=4 | 1.155 |
GROUP GROUP_B vs GROUP_A At fortnight=5 | 1.075 |
GROUP GROUP_B vs GROUP_A At fortnight=6 | 1.117 |
GROUP GROUP_B vs GROUP_A At fortnight=7 | 0.942 |
GROUP GROUP_B vs GROUP_A At fortnight=8 | 0.941 |
GROUP GROUP_B vs GROUP_A At fortnight=9 | 1.006 |
GROUP GROUP_B vs GROUP_A At fortnight=10 | 0.962 |
GROUP GROUP_B vs GROUP_A At fortnight=11 | 0.977 |
GROUP GROUP_B vs GROUP_A At fortnight=12 | 1.173 |
GROUP GROUP_B vs GROUP_A At fortnight=13 | 1.310 |
GROUP GROUP_B vs GROUP_A At fortnight=14 | 1.347 |
GROUP GROUP_B vs GROUP_A At fortnight=15 | 1.322 |
GROUP GROUP_B vs GROUP_A At fortnight=16 | 1.381 |
GROUP GROUP_B vs GROUP_A At fortnight=17 | 1.389 |
GROUP GROUP_B vs GROUP_A At fortnight=18 | 1.144 |
GROUP GROUP_B vs GROUP_A At fortnight=19 | 1.206 |
GROUP GROUP_B vs GROUP_A At fortnight=20 | 1.072 |
GROUP GROUP_B vs GROUP_A At fortnight=21 | 1.196 |
GROUP GROUP_B vs GROUP_A At fortnight=22 | 1.117 |
GROUP GROUP_B vs GROUP_A At fortnight=23 | 0.906 |
GROUP GROUP_B vs GROUP_A At fortnight=24 | 1.083 |
GROUP GROUP_B vs GROUP_A At fortnight=25 | 0.806 |
GROUP GROUP_B vs GROUP_A At fortnight=26 | 0.761 |
GROUP GROUP_B vs GROUP_A At fortnight=27 | 0.729 |
The graphs below are not the original data but they illustrate the kind of thing happening to the original.
Below on the right is a graph of 'raw' exits showing the percentage leaving each fortnight, with the denominator being the number of people left at the beginning of the fortnight. The two lines are Group A and B.
Below on the left is a graph, plotted by taking the fortnightly 'survival' rate from phreg output, turning it into a failure rate (1-survival), and then calculating a 'hazard' rate each fortnight for each group. You can see this hazard ratio between the groups is essentially constant. Is my interaction variable wrongly specified? I want the hazard ratio between the groups to be free to vary at each fortnight.
How did you create this "fortnight" variable? It will not be mathematical correct to use start and stop to create the independent variables. That is because you then introduce some future dependency on specifying the rates. Instead, if you want interaction with time you should create the time variable inside proc phreg - something like this:
proc phreg data=simulation;
period1_a=(t<=5)*a;
period2_a=(t>5)*a;
model t=period1_a period2_a/rl;
run;
There is also an other way to make interaction with time in phreg. Then you need first to aggregate your data on the risksets. Doing this it becomes "legal" to use the time variable to make the independent variables. I have made a macro for the aggregation step (coxaggregate). Maybe you will find it useful. Here is a simple example of how it can be used for making a interaction with time. (time is here a effect modifier on the covariate "a", such the true effect of a is 1.5 before time=5 and 2 later on.) Notice that the two phregs give same result, but only by aggregating on riskset you can use the hazardratio statement.
data simulation;
do i=1 to 1000;
a=mod(i,2);
rate1=0.1*exp(log(1.5)*a);
rate2=0.1*exp(log(2)*a);
t=rand('exponential',1/rate1);
if t>5 then t=5+rand('exponential',1/rate2);
event=1;
output;
end;
keep t a event;
run;
quit;
proc phreg data=simulation;
period1_a=(t<=5)*a;
period2_a=(t>5)*a;
model t=period1_a period2_a/rl;
run;
%coxaggregate(data=simulation,output=coxout,entry=0,exit=t,event=event,covariate=a)
data coxout;
set coxout;
timegroup=(time<=5);
run;
proc phreg data=coxout nosummary;
class a(ref="0") timegroup/param=glm ;
model dummytime*dummytime(2)= timegroup*a;
hazardratio a/at(timegroup=all);
strata time ;
freq weight;
run;
Hi JacobSimonsen , thank you for response. I have some clarifications and questions below.
How did you create this "fortnight" variable? It will not be mathematical correct to use start and stop to create the independent variables. That is because you then introduce some future dependency on specifying the rates. Instead, if you want interaction with time you should create the time variable inside proc phreg - something like this:
procphregdata=simulation;
period1_a=(t<=5)*a;
period2_a=(t>5)*a;
model t=period1_a period2_a/rl;
run;
I did use 'stop' to create the fortnight variable, but I don't know what you mean by 'introduce some future dependency'. The particular fortnight that is being considered does not depend on some event in the future, but by how many fortnights have already passed since beginning observation (??) I am using the counting process syntax because I have other time-dependent variables (but I don't expect these to interact with time - only the group variable - which does not change - interacts with time.)
In the above code, 't' seems to be the survival variable (in time units), and 'a' is the grouping variable that interacts with time. But I don't want to break up time into two parts, but into 26 parts.
Here is a simple example of how it can be used for making a interaction with time. (time is here a effect modifier on the covariate "a", such the true effect of a is 1.5 before time=5 and 2 later on.) Notice that the two phregs give same result, but only by aggregating on riskset you can use the hazardratio statement.
This implies I need to estimate what I believe the hazard ratio to be during each time period?
It is not exactly clear to me what "start" and "stop" is. If you have divided your survival time into subintervals which "start" and "stop" is the endpoints for the subintervals, then I think you did it right (and you can forget my comment).
But if "stop" is the survival time and you used that for create "fortnight", then it does create predictors is introducing dependency on future events: In the cox regression you go through the timeline, and at each timepoint you can legally construct the rate by using only what happens in the past. Using the survival time for creating the rate before the event time will then be to conditioning on the future.
Alternatively, the statement that creates the rate can be specified inside proc phreg. The difference is then that it is not directly the survival time from each observation that is used, but instead a running time value (depending on where on the timeline the rates is to be calculated) that is used.
It is not exactly clear to me what "start" and "stop" is. If you have divided your survival time into subintervals which "start" and "stop" is the endpoints for the subintervals, then I think you did it right (and you can forget my comment).
But if "stop" is the survival time and you used that for create "fortnight", then it does create predictors is introducing dependency on future events: In the cox regression you go through the timeline, and at each timepoint you can legally construct the rate by using only what happens in the past. Using the survival time for creating the rate before the event time will then be to conditioning on the future.
Thanks again for the response JacobSimonsen
"Start" and "stop" mark the end of subintervals of time, conditioned only on the date (every two weeks, a new interval is created on a new row). However, intervals are created until either
So if somebody left in three fortnights, they would have three rows, but if somebody never left, they'd have 26 rows. If somebody left in the third fortnight, their survival time would be equal to the 'stop' variable on that row only. The previous rows would show that the survival event did not occur in that time period.
Don't miss out on SAS Innovate - Register now for the FREE Livestream!
Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.