I intended today's post to be a continuation of my last week’s article on Survival Analysis, but ran into problems getting my computer to cooperate with the code. Let's put that post on hold for now and come back to it (hopefully next week.)
So, switching gears. This post focuses on three tasks that I’ve started really exploring and have found to be very useful -- Data Exploration, Summary Statistics and Characterize Data. I’ll use hospital data as that has a good combination of Categorical and Continuous variables.
Get the Data
Using the same dataset from last week's post, How to survive Survival Analysis in SAS University Edition, I again had to truncate it down to a more manageable size. I kept only the first site in the list, Albany Medical Center Hospital, which still amounted to 33,000 patients, giving us more than enough data to play with. The 2012 data can be downloaded from here (you can also get the 2011 data.)
How to go about getting SAS University Edition
If you don’t already have University Edition, get it here and follow the instructions from the pdf carefully. If you need help with almost any aspect of using University Edition, check out these video tutorials. Additional resources are available in this article.
Getting the data ready
The only preparation I’ve done is sorted the data by the Length of Stay column, in preparation for our analyses, and remember that I changed the values “120 +” to “120” to keep it all numeric.
The Data Exploration task allows quick and easy scatter plots to be generated in a matrix layout, allowing for easy visual exploration of the data.
You can select up to 6 Continuous and 2 Categorical Variables. I selected ones that may actually make some sense based on the data, so let’s take a look at the output:
There are clearly some trends with the data. SAS also provides enhanced analytical options for most tasks, and this one is no exception.
I won’t go through all of them, but when I reduce my variables to LENGTH_OF_STAY and TOTAL_CHARGES, with GENDER as my categorical variable, I can then select Comparative Box Plot and know that it will run fairly quickly.
Not really anything of interest – but boring results are results still the same, and this eliminates a path of inquiry we may have taken.
Though sounding pretty straight forward, this task is appealing because it allows me to quickly see what the data looks like.
So right away we see the 50 to 69 year old age group is the largest, the 70+ group has the largest mean, and the 0 to 17 year olds have the largest standard deviation. Right away I have a sense of my data, and all it took was four clicks!
But wait, as fantastic as that is, there’s even more power and flexibility in the Options tab – check out all the ways you can view and analyse your data!
The last task I wanted to touch on allows you once again to explore your data, but with a slightly different output; combined with the other two tasks, you will have a fairly complete picture of your data.
This provides two output graphics (as I’ve only selected the one variable) and provides one of my favourite statistics, the Cumulative Percentage. I find this to be very useful when looking at groups, and leads to developing Pareto Charts and other Key Performance Indicators.
Although similar to the previous task, having the frequency graph and the cumulative percentage shows us that 52% of our population is under the age of 49 years old, and that only 11% of the patients were 18-29 years of age. I’d be interested in seeing a demographics table for the same area, to see if this is in line with the actual population.
Not only does this task allow you a quick and dirty view, you can customise it using the Options tab, and one of the best is the Date Variables option to see if there are seasonal or other types of trends.
Now it’s your turn!
Did you find something else interesting in this data? Share in the comments. I’m glad to answer any questions.
Need data for learning?
The SAS Communities Library has a growing supply of free data sources that you can use in your training to become a data scientist. The easiest way to find articles about data sources is to type "Data for learning" in the communities site search field like so:
We publish all articles about free data sources under the Analytics U label in the SAS Communities Library. Want email notifications when we add new content? Subscribe to the Analytics U label by clicking "Find A Community" in the right nav and selecting SAS Communities Library at the bottom of the list. In the Labels box in the right nav, click Analytics U:
Click Analytics U, then select "Subscribe" from the Options menu.