I would like to run a cox model on bootstrapped data, 1000 replications. Currently, it takes 1 hour to run the model on the original data. Ultimately I have to repeat this for several models.
The original data (dat) has 2 million observations, only includes variables used in the model, and no missing . I have access to 2 computers.
Any suggestions to make this more efficient?
Here is the code I am working with:
proc surveyselect data=dat out=dat_boot
seed=3446
method=urs
samprate=1
outhits rep=1000;
run;
proc phreg data=dat_boot outest=output covout noprint;
by replicate;
freq numberhits;
class zip a b c;
model time*y(0)=x e x*e a b c;
random zip;
run;
Thanks!
Which estimates are you wanting to bootstrap? Are you trying to get CIs that are not provided? With that many observations, I would think the normal approximation (by using the CL option) should be sufficient for the parameters.
One option in traditional SAS is to use those two computers in parallel. On one, submit the PROC with
WHERE replicate <= 500;
and the other with
WHERE replicate > 500;
There is a phreg.cox action in SAS Viya, if your company uses Viya. I think it supports the groupby= parameter for BY-group processing across multiple threads.
The ESTIMATE statement can provide estimates and CIs for linear combinations of the effect parameters.
To answer your question: if you run multiple copies of SAS on the same PC, you are probably going to compete with yourself for resources. So use multiple computers if you pursue the bootstrap idea.
I wonder whether a Bayesian analysis (using the BAYES statement in PROC PHREG) will give you the distribution of the estimates that you need. Anyway, I am not an expert on survival analysis, so I will let others offer their opinions. Good luck.
@pamplemousse822 wrote:
Does opening multiple SAS sessions on the same computer work? Or would that results in the same run time.
Hi @pamplemousse822,
Look at the processor load (e.g., in Windows task manager) while one SAS session is running your PROC PHREG step. If CPU usage is well below 100%, chances are that you can run two (or more) sessions in parallel without doubling (multiplying) run time. I remember a SAS program (not PROC PHREG, though) running at about 12-13% CPU usage. It was using essentially one of the eight available threads of my workstation's quad-core processor. With six SAS sessions in parallel (working on disjoint subsets of the data) CPU usage went up to about 75% (six threads) and thus I got my results almost six times faster.
Is your data clustered by Zips? If so, you should Google "bootstrapping clustered data", you'll find information about the pitfalls of ignoring the original sample structure when resampling.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.