Survival Analysis (also known as Kaplan-Meier curve or Time-to-event analysis) is one of my favourite forms of analysis; this type of analysis can be used for most data that has a time-based component. When used in context of patients at a hospital, this analysis is called Survival Analysis; in manufacturing, utilities, and anywhere else there is a start / end time, it is known as Time to Event analysis.
Get the Data
I have wanted to do an article on Survival Analysis for a while, but I was unable to find a dataset that was ideal for what I wanted to cover. Most datasets that are in healthcare are aggregated to protect the patients’ privacy, which makes it difficult to do this analysis.
I did end up finding a perfect dataset – patient-level, but de-identified. I must admit the dataset was far too large for me to process in SAS University Edition (it’s over 2 million rows), so I had to truncate it down to a more manageable size. I kept only the first site in the list, Albany Medical Center Hospital, which still amounted to 33,000 patients, giving us more than enough data to play with. The 2012 data can be downloaded from here (you can also get the 2011 data).
How to go about getting SAS University Edition
If you don’t already have University Edition, get it here and follow the instructions from the pdf carefully. If you need help with almost any aspect of using University Edition, check out these video tutorials. Additional resources are available in this article.
Getting the data ready
After I’ve imported the data, I see that the Length of Stay column is a text column rather than numeric; it turns out that the data has 0-119 days and then “120 +” – so I have to remove the plus sign in the raw data and then reimport it to SAS, and everything is fine. I then do some preliminary exploration (I absolutely love playing with this type of dataset) and decide on a couple of variables I want to highlight here.
So let’s get to it – what you’ll notice is that this first example is very simple – 3 lines of code, nothing difficult at all!
Here’s the graph that is outputted (Note: there’s a couple of tables that are also included but I’ve excluded those for another post0.
This graph starts at 1.0 or 100% of the patient population at Day 0, and everytime a patient is discharged (their length of stay ends) there is a step in the graph as the remaining number of patients gets smaller. This continues until 120 days, which is the maximum number of days in the dataset.
This first graph is good but it’s not really informative; we need to split our data into groups. In PROC LIFETEST, the Group option will create one graph per group; I prefer having all the groups on one image to make comparisons easier. The first group I look at is the most logical, Gender. The initial step is to sort the data; the grouping won’t work if SAS doesn’t know precisely where the different levels start / end.
Here’s the code:
You’ll note that I’ve put a NOTABLE in the PROC LIFETEST statement – this is to suppress those tables I mentioned earlier. The next key point is that I’ve added the strata (on line 8) which will be our groups.
SAS automatically assigns the colour and the strata are sorted alphabetically; the Males and Females, at least in this truncated dataset, have no clear difference.
The next strata I wanted to use is the type_of_admission variable; I’ve updated the code accordingly.
The output is a little more complex as we now have 4 levels:
Digging a little more into the data, there’s a variable that indicates the severity of the illness – when I update my code and plot the graph, I get a very interesting output:
It’s very clear that the “Extreme” cases have significantly longer stays in hospital than the other three groups, which makes sense. The one aspect to the graph that I feel has been missing are the numbers of patients in each groups, and I’ve updated the code from above to do this:
The plots=(survival(atrisk)) statement specifies the survival plot (which is the default, and the same we’ve seen above). The (atrisk) option allows the table to be added, shown here:
You can note a couple of aspects right away – first, all patients with a Minor incident are discharged somewhere between 25 and 50 days. Second, that the Moderate group is the largest group of patients, with Extreme being the smallest. The groups are alphabetised rather than categorical (Extreme, Major, Moderate, Minor would make more sense); this can be done and I’ll show you how in a later post. The other aspect to these graphs that I’ve not mentioned is the title – “Product-limit survival estimates” is not really understandable, and this can be changed using the TITLE option that will be shown next week.
Now it’s your turn!
Did you find something else interesting in this data? Share in the comments. I’m glad to answer any questions.
Need data for learning?
The SAS Communities Library has a growing supply of free data sources that you can use in your training to become a data scientist. The easiest way to find articles about data sources is to type "Data for learning" in the communities site search field like so:
We publish all articles about free data sources under the Analytics U label in the SAS Communities Library. Want email notifications when we add new content? Subscribe to the Analytics U label by clicking "Find A Community" in the right nav and selecting SAS Communities Library at the bottom of the list. In the Labels box in the right nav, click Analytics U:
Click Analytics U, then select "Subscribe" from the Options menu.