BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
nathanb_1993
Calcite | Level 5

Hi,

 

I was hoping someone may be able to help me with some confusion around the survival node in Enterprise Miner, paticularly with the way that my data is structured.

 

Some quick background, I'm attempting to model the time to defaulting on a mortgage, a simple 1 or 0 binary flag. For every observation I have the _T_ variable which is my time since observation and starts from 0 and goes up to all the observations for following months, with an end date recorded if the event mentioned above happens.

 

Within my data _t_ = 0 is the initial obs month and no accounts here have hit the event described above. As a result of this, I've noticed that there are a couple of issues with the way that cubic splines are being calculated, mainly as the hazard function has a sharp spike at _t_ = 1, as this is actually where my highest event rate occurs.

 

Upon noticing that the sharp spike is being used as part of the spline fitting, I realise this is most likely not what I want to be happening. I am correct to include _t_ = 0 with no events in my dataset when using the surivival node?

 

 

Please let me know if some more information is needed.

 

Thanks!

Nathan

1 ACCEPTED SOLUTION

Accepted Solutions
WendyCzika
SAS Employee

If you have time-varying covariates, then yes you need to use an expanded form of the data, but if not, you can have just 1 obs. per ID.

 

There are a couple of videos available that can help you with formatting your data, this one for the standard format without time-varying covariates:

http://www.sas.com/apps/webnet/video-sharing.html?player=brightcove&width=640&height=360&autoStart=t...

 

And this one for the expanded format: 

http://www.sas.com/apps/webnet/video-sharing.html?player=brightcove&width=640&height=360&autoStart=t...

View solution in original post

7 REPLIES 7
Reeza
Super User

@nathanb_1993 wrote:

Hi,

 

I was hoping someone may be able to help me with some confusion around the survival node in Enterprise Miner, paticularly with the way that my data is structured.

 

Some quick background, I'm attempting to model the time to defaulting on a mortgage, a simple 1 or 0 binary flag. For every observation I have the _T_ variable which is my time since observation and starts from 0 and goes up to all the observations for following months, with an end date recorded if the event mentioned above happens.

 

 

Without fully seeing what you're doing, my initial concern is your data structure is not correct for survival modeling. Review the code generate, if it's a PROC PHREG then your data structure will likely need to be modifed. You can review the examples in PROC PHREG to see how to set up your data. 

 

nathanb_1993
Calcite | Level 5

Hi Reeza,

 

Below is an imagine which shows 3 (simple) examples of observations in my dataset. Hopefully this will help in understanding my data setup, obviously my actual dataset is considerably larger than this! But the image shows the two outcomes in my datset, either 'EVENT' = 1 if the event eventually happened in which case the END_DATE is populated, or event = 0 at the end of the ID, so END_DATE is not populated. Each ID is independent.

Data Set Up.PNG

Don't worry about the extra Variables, just in there as an example.

 

As mentioned previously, there are no instances where we have start_date = end_date since _t_ = 0 has no events as this is an initial observation month.


Data Set Up.PNG
Reeza
Super User

Yeah, pretty sure that's not the set up required for survival analysis, at least not in Base SAS. You need a single record per ID from what I understand.  This assumes it's use PROC PHREG behind the scenes. If it's not, then you may be ok, but I strongly suspect this is not the correct data structure. 

nathanb_1993
Calcite | Level 5

So for Enterprise Miner we believe it's required we set the data up in the this structure, learnt from SAS documentation and guidance available, we need multiple records per ID for the change of the _t_ variable which is time since observation.

 

No worries though, I'll keep this in mind and see if anything is happening behind the scenes. Thanks for the input.

WendyCzika
SAS Employee

If you have time-varying covariates, then yes you need to use an expanded form of the data, but if not, you can have just 1 obs. per ID.

 

There are a couple of videos available that can help you with formatting your data, this one for the standard format without time-varying covariates:

http://www.sas.com/apps/webnet/video-sharing.html?player=brightcove&width=640&height=360&autoStart=t...

 

And this one for the expanded format: 

http://www.sas.com/apps/webnet/video-sharing.html?player=brightcove&width=640&height=360&autoStart=t...

nathanb_1993
Calcite | Level 5

Hi Wendy,

 

Thanks for this, I'll have a watch of the videos when I'm back in the office tomrorow morning.

 

As a little extra context (and you may be able to highlight what I've done incorrect with _t_), here is my current hazard function which I have output, my problem is with that initial spike at 1. When the node does the regression, it's treating 0 as a very favourable month (since no events occur here as it's my observation month)

 

Hazard.PNG

 

Thank you

nathanb_1993
Calcite | Level 5
Hi,

Very helpful video! I actually think we used it to expand our data which I've had a look over and it all seems to be OK.

Just still having the issue of the _t_ = 0 month which is being included in the regression as a favourable month as there are no events, the post prior to this may help a little in explaining my confusion.

Is there such a way that this can be avoided, i.e. avoid my hazard starting at 0?

I realise this is probably quite confusing, but hopefully you can see where I'm coming from?

Thanks

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 7 replies
  • 1636 views
  • 2 likes
  • 3 in conversation