02-01-2015 10:26 PM
In one of my projects, I will not be getting patient level data for survival analysis (the usual time to an event, treatment and baseline covariates) but summarized data. I have to suggest a proper way to summarize patient level data for survival analysis. I am trying to figure the best possible summary data that will be amenable to modeling.
One type of summary data I have in mind is attached It is made-up data. The actual data will have more time points and covariates. All covariates, including age, are going to be categorical.
This data set has time points which you can take to be 6-month intervals (1 corresponds to 0-6 months, 2 corresponds to 6-12 months and 3 corresponds to > 12 months), number of events during each interval, number of patients at risk at the beginning of each interval and two categorical variables. One of the categorical variables has three levels and the other (let's say treatment), two levels. Clearly, number of patients at risk at the beginning of each interval (except the fist) excludes patients who had an event or were censored in the previous interval. This table has one record for each level of each categorical variable and time period.
I have two questions. Is this the best way to organize the data? If so, what model would be suitable. The log-linear model is not because it does take into account the conditional nature of the data. That is, number at risk at the beginning of each interval clearly is determined by events and censoring in the previous interval. I would like to know what you would suggest.
02-04-2015 12:57 PM
I split the program into two parts - one to create the summary table and the other to run the Cox model. They both work on the test examples provided by Jacob Simonsen. Although the number of records in the summary table may be larger than the individual patient data table,I think that this is the only summary table that would work . Thanks for pointing it out.
02-04-2015 04:55 PM
It is correct that the summary table can be larger than the original dataset. Using the original dataset will cause PHREG to create a temporay dataset that can be very large. By using the aggregated data you avoid the large temporary data, and thereby save calculation time. Of course, this matters only if your dataset is large.