I will suggest that there should be implemented an option in phreg for which algorithm to use when the Cox-partial likelihood function is maximized.
The calculation time of phreg seem to be somethin like O(number_of_events * population_size), -at least when left-truncation is present. The reason for this is that there is at many risk set as there are events, and for each riskset all indiduals should be veriedfied whether they belong to the riskset or not.
However, there is a faster algorithm: It can be used that the scorefunction has the form of a integral (called the scoreprocess).
Y=at-risk-indicator, N(s) is the counting processes, X is the covariates and beta is the parameter.
The idea is to keep in memory some of the numbers in the integrand (the numerator, the denominator and the sum of X's). When there is a new jump of N(s) these numbers should be updated and then the integrand can be recalculated. In this way we avoid to go through the whole cohort for each riskset, but so to say just make small corrections from one riskset to the new riskset. With this idea I created a macro which can calculate the estimates in a calculation time which is linear to the number of individuals. This can produce similar results to phreg, but (especially for large cohorts) it goes much faster (thousand times faster for VERY large cohorts).
I have implemented an easy-to use macro, attached here. Also, I attached the matrix-algebra packeage which is a requirement, and a testprogram which shows how the macro should be used. It is quite easy, I think. If anyone has suggestions for improvements I will be happy to know. Although I could not get the macro to run as fast as STATA's stcox, it runs much faster than phreg. The difference between these two implementation ofcourse depends on the nature of the data.
It would though be much better to have the faster algoritm in phreg than in the macro, since phreg has a much better implementation of construction the designmatrix (the class statement). In the attached macro, the user has to be sure that no overparametrization occur, which is no problem in phreg.
A comparison between the "risk-set"- algorithm and the "dynamic" algorithm is shown in the figure. I here simulated datasets with different number of observations and analysed with phreg, and the attached cox-macro. This is a fair comparison (same data, same machine). It shows how the phreg as a quadratic calculation-time, while it can be done linearly.
It is my experience that phreg only slow when there are either time-dependent covariates or left-truncation. It may therefore be that there are different algorithms implemented in phreg.
With the the above mentioned algorithm left-truncation is no problem, and time-dependent variables can be handles as long as covariates only depend on the underlying time.
I am aware that there are some features which is not possible with the fast algorithm which is possible with the algorithm implemented today. Therefore, what I hope is that there could be implemented an option in phreg which indicates what algoritm to use.
Has anyone done similar thoughts about efficient algorithm for cox-regression?
ps. I have changed a bit: A collegue of me pointed out that it is imprecise to say that integrand should be kept in memory. Also some of the numbers in the integrand should be kept in the memory. Sorry.