BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
Laastat
Calcite | Level 5

I'm working on an actuarial project to estimate monthly probabilities that someone becomes disabled.

A portfolio of persons (each having a different 'PolicyNr') is observed during 12 months, and the time until disability is registered by the variable 'TimetoDisability'. When no disability occured during the 12 months, the variable 'RightCensored' has the value 1. We further have the variables 'Gender', 'AgeatDisability' (which equals the age at the disability, or the age after 12 months for the right censored observations), and the variable 'OccupationClass'.    

We have data available in a wide format. For example, the following lines are part of the data:

PolicyNrTimetoDisabilityRightCensoredGenderAgeatDisabilityOccupationClass
0012 months0Male40 year1
0023 months0Male30 year2
00312 months1Female42 year1

 

Intuitively, I would model this using a Cox proportional hazard model, with the variable 'TimetoDisabilty' as the time until the occurrence of the disability, and 'Gender', 'AgeatDisability', and 'OccupationClass' as covariates. Monthly probabilities are derived from the survival function.

 

Now assume -because of practical/technical reasons- it is only possible to perform a GLM Binomial regression. I read that performing a GLM Binomial regression on data with pseudo observations is analogue to a Cox Discrete Time Survival model. To prepare the analysis, I transform the wide dataset to a long dataset (with pseudo observations, see, e.g., https://grodri.github.io/glms/notes/c7s6), in which each line is duplicated according the variable TimetoDisability.  For example, the first line from the table above is transformed to 2 lines as is took 2 months to become disabled. The last line has a value 1 for the variable 'Disability', as the disability occured in month 2. The variable 'AgeatDisability' is transformed into the variable 'Age', now representing the age during that month. The right censored observation is transformed into 12 lines, all having the value zero for the variable 'Disability', as the disability is not observed. This becomes:

PolicyNrDurationDisabilityGenderAgeOccupationClass
00110Male39 year 11 months1
00121Male40 year1
00210Male29 year 10 months2
00220Male29 year 11 months2
00231Male30 year2
00310Female41 year 1 months1
00320Female41 year 2 months1
00330Female41 year 3 months1
00340Female41 year 4 months1
00350Female41 year 5 months1
00360Female41 year 6 months1
00370Female41 year 7 months1
00380Female41 year 8 months1
00390Female41 year 9 months1
003100Female41 year 10 months1
003110Female41 year 11 months1
003120Female42 year1

 

Question:

In this long data format, the multiple rows (pseudo observations) for each person are not independent. We have repeated measures for each person.

However, I read in Therneau and Grambsch: (quote)

"One concern that often arises is that observations [on the same individual] are "correlated," and would thus not be handled by standard methods. This is not actually an issue. The internal computations for a Cox model have a term for each unique death or event time..."

So for a Cox Discrete Time Survival model, the dependency is not an issue.

However, I don't see how the dependency in the data is not an issue for a GLM Binomial regression?     

Is it -given the dependency in the data- appropriate to perform a GLM to get trustworthy estimates of monthly probabilities? Or should I go for a mixed effect model?

 

Thank you.

1 ACCEPTED SOLUTION

Accepted Solutions
sbxkoenk
SAS Super FREQ

Due to lack of time, I have just done a quick diagonal read of your post.

 

See here. Might be useful.

It's about discrete-time logistic hazard regression (aka survival data mining) -- to be done with PROC LOGISTIC :

 

Good luck with your modelling efforts,

Koen

View solution in original post

1 REPLY 1
sbxkoenk
SAS Super FREQ

Due to lack of time, I have just done a quick diagonal read of your post.

 

See here. Might be useful.

It's about discrete-time logistic hazard regression (aka survival data mining) -- to be done with PROC LOGISTIC :

 

Good luck with your modelling efforts,

Koen

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 1 reply
  • 667 views
  • 0 likes
  • 2 in conversation