I have a dataset where an individual undergoes IVF treatment for infertility. They can become pregnant or not on the first round of treatment. About 50% do not become pregnant and go onto additional rounds of treatment, up to 6 rounds maximum.
I want to model the probability that an individual becomes pregnant on the 'ith' round of treatment, taking into account that they didn't become pregnant on the previous round. Once univariate model is explored, I want to adjust for baseline covariates.
I do not want to run separate models for each cycle separately
I have seen that someone has modelled this using a Discrete time logistic regression model but I am not sure how to do this in SAS?
You might be talking about the discrete survival time or interval-censored model that is illustrated in an example in the PROC LOGISTIC documentation and which can also be fit in PROC ICPHREG. But it sounds like you just have repeated measures data which could be modeled by either a subject-specific random effects model using PROC GLIMMIX or with a population-averaged Generalized Estimating Equations (GEE) model in PROC GEE (or GENMOD). The GEE example in the Getting Started section of the GENMOD documentation seems similar to your situation with your question being similar to getting estimates of the wheeze probability at each age. Note that you can include any covariates in the model as needed. You just need to add an LSMEANS statement with the ILINK option to estimate the wheeze probabilities. Include the CL option to get confidence limits. An example follows. Note that the newer GEE procedure is the recommended procedure for fitting GEE models.
proc gee data=six;
class case city age;
model wheeze = city age / dist=bin;
repeated subject=case / type=exch;
lsmeans age / ilink cl;
run;
Thank you for your response.
Is that method appropriate if the data is such that the goal is the Cumulative probability of pregnancy? Where if they were pregnant on the first cycle, the dataset would then have a record for Cycle 2, which indicated that they got pregnant in cycle 1 but if you want to be able to say how many women got pregnant by cycle 2 they would be included as a being pregnant in cycle.
If you want to assume that the probability of pregnancy strictly increases (is cumulative) over the stages, then maybe a cumulative logit model is what you want. In that case, the input data would have only one record per subject and would include a variable giving the stage at which pregnancy occurred. This variable is then an ordinal multinomial categorical response variable that can be modeled with a cumulative logistic model in PROC LOGISTIC. For example, if there are different treatments and if age (and possibly others) is a covariate, then the following code fits the model. The OUTPUT statement provides the predicted probabilities both at the individual stage as well as the cumulative up to each stage.
proc logistic;
class treatment;
model pregstage(descending)=treatment age;
output out=preds predprob=(i c);
run;
Yes, their goal is to model the cumulative probability of getting pregnant. So that if a patient comes in they can tell them the likelihood of pregnancy after they would complete each cycle/procedure as procedures very costly.
However I don't think this option solves the problem as the outcome should be pregnancy and the stage or procedure number a covariate?
So that if at cycle 1 - )(50/100) 50% are pregnant, then at cycle end of cycle 2, 20 more women become pregnant after this cycle, now 70% are pregnant, etc. So this is the outcome they wish to model
The cumulative logit model in PROC LOGISTIC that I showed will work fine for that purpose. For each observation in the output data set, you will have the probabilities of pregnancy by each possible stage. So, for a given subject, you will have the probability that she will be pregnant at stage 1, by stage 2, by stage 3, and so on (provided by PREDPROBS=c). However, the GEE model I initially showed is another possible approach and is set up more like you suggested. In that case, your STAGE variable would appear where the AGE variable is in the example code I showed. Note that the data structure for the GEE model is different and requires a set of observations for each subject (one at each stage) with a binary response variable.
However, note that the GEE model is a population-averaged model, most appropriate for making general predictions for the sampled population such as estimating the effect of predictors in the model on the population. If your goal is to predict probabilities for individual subjects, then the cumulative logit model is probably more appropriate.
THank you for all your advice!
There are however several woman that never got pregnant, so in the cumulative logit model how would they be incoporated? As using cycle=0 to indicate not pregnant doesn't seem appropriate
Okay, so you have right censored observations. Then we are back to the first part of my first comment which is to use PROC ICPHREG to fit the interval censored model to deal both with your discrete stages as well as the censoring. See the example in the Getting Started section of the ICPHREG documentation. Your data will have multiple observations for the multiple stages for each subject. The left and right time variables in the MODEL statement will define the stages using intervals of (0, 1), (1, 2), (2, 3), ... , (6, .) where the last interval is for those subjects who don't get pregnant. In the MODEL statement, specify the BASE=DISCRETE option. You can add a BASELINE statement with the SURVIVAL= option (and LOWER= and UPPER= options if desired) to obtain estimates of the survival curve(s) over time.
I found this in a paper in a similar area of research but I am unclear exactly how they did this
The cumulative probability of achieving a live birth after x number of cycles was calculated using Kaplan-Meier product-limit estimate [1 − (1 − p1)(1 − p2)…(1 − p x)]∗100%, where p of x is the probability of achieving live birth in cycle x. Three different assumptions were made for estimation of live birth rate.
Hello,
I think the answer by @StatDave is the way to go
, but here is some more info on survival data mining (discrete-time logistic hazard regression).
PD Allison is (or was?) a successful author in SAS Press ("books by users" program from SAS).
All this can be done with PROC LOGISTIC (a.o.) :
SAS/STAT® 15.2 User's Guide
The LOGISTIC Procedure
Example 78.15 Complementary Log-Log Model for Interval-Censored Survival Times
https://go.documentation.sas.com/doc/en/statug/15.2/statug_logistic_examples19.htm
You can also take a look at this book :
Survival Data Mining: Modeling Customer Event Histories
Will Potts, SAS Institute
John Wiley & Sons Australia, Limited, 01 Apr 2006 - Business & Economics - 224 pages
BR,
Koen
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.