I was inspired by Robert Allison’s recent blog on comparing animals’ lifespans, Which Lives Longer, a Honey Bee or Black Ant, and wanted to dig further into the dataset to see if I could find anything interesting.
The dataset has a lot of really fascinating information (Metabolic Rates, Adult / Birth weights, litter sizes etc) so you can discover a lot of fascinating relationships. I wanted to focus on the ages at which different animals become mature (not in the cleaning-up-after-themselves sense but in the baby-making sense) and found some rather interesting comparisons.
How to go about getting SAS University Edition
If you don’t already have University Edition, get it here and follow the instructions from the pdf carefully. If you need help with almost any aspect of using University Edition, check out these video tutorials. Additional resources are available in this article.
Getting the Data Ready
I used the IMPORT Task to bring my data into my WORK Library, and kept the default name “IMPORT”. The data is in a text file, which is different from a CSV or Excel dataset, so pay attention when importing the data as I’ve seen different computers handle TXT files differently. You should end up with just over 4,000 rows of data.
My first step was to do a scatter plot on the two variables of interest. I have imported the file names without converting them to the “SAS defaults” (with underscores instead of spaces) because I wanted to show a new feature that I just learnt about this week. Most analysis software does not like spaces or special characters in the variable names; SAS however seems to be unique in that it does allow them, and gives you an easy way to handle them.
If I had used the SAS default, the fields would have been called Female_maturity__days_ and Male_maturity__days_ (note the double _, one for the space and one for the opening bracket). You would then be able to do x=Female_maturity__days_ and you’d be fine; for the purposes of this article however I wanted to keep the spaces and brackets, and use the ‘Female maturity (days)’n convention as shown below. I’ve done this because it makes the labels on the graphs and tables easier to read and you don’t have to worry about changing them later.
You’ll notice that I’ve used max=10000 in the xaxis and yaxis statements; this is to keep the upper boundary of the graph to a pre-defined limit, rather than using one that SAS felt appropriate. It also ensures that the x- and y-axes will be on the same scale, allowing for easier visual comparison.
Here’s the scatter plot:
You can very quickly see there’s a pretty straight line running diagonal through the graph. This means that for the majority of animals, the male and female maturation age is very close if not identical. You will also note however there are some rather bizarre outliers – one animal has a female maturation age of almost 10000 days (27 years) versus about 3000 for the male (8 years). So how can we pull out the animals with the biggest gaps between maturation ages?
For those of you that have been following all along know that I’m a big fan of PROC SQL, and this is another example where it makes life a lot easier.
The first step will be to create a data set that includes only those animals that have both a male and female maturation age, and calculate their difference.
On line 10, I have asked SAS to create a table based on my query, which starts on line 11 with my selecting the Common name and then calculating the absolute difference in ages (there may be cases where the males mature slower than the females, so I set everything to a positive difference). Because I want my final output to have the actual ages, I also include the two columns in the dataset. Lines 14 and 15 limit the data to only the rows that have both (because these two columns are numeric, missing data is indicated by a period).
Now that we’ve done that, the next step is to pull out the top 10 observations. But I want to take it a step further – 10000 days doesn’t mean much to most people, so I want to calculate the Years. Here’s the code to do that:
Outobs=10 is going to limit the output to the top ten results based on the criteria we specify. I then select the Common name, and take the number of days for each and divide by 365.25 (the .25 takes into account leap years, otherwise your age will be off). I also want something a little cleaner than “20.45456”, so I specify the format as 5.2 (5 digits to the left of the decimal, and 2 after).
Here’s the table that SAS returns; talk about a difference in ages! For the Lake Sturgeon, males are sexually mature by 8 years of age, but females are not until they’re 26. On the flip side, the False Killer Whale females are not mature until they’re almost 10, whereas males are not until they’re over 18.
I must admit this started me thinking about the impact of environmental changes on these animals – if we are affecting the young of these 10 animals now, that means the effects won’t be felt for some time and so issues we’re not even aware of may become significant problems for our children and grandchildren.
Now it’s your turn!
Did you find something else interesting in this data? Share in the comments. I’m glad to answer any questions.
Need data for learning?
The SAS Communities Library has a growing supply of free data sources that you can use in your training to become a data scientist. The easiest way to find articles about data sources is to type "Data for learning" in the communities site search field like so:
We publish all articles about free data sources under the Analytics U label in the SAS Communities Library. Want email notifications when we add new content? Subscribe to the Analytics U label by clicking "Find A Community" in the right nav and selecting SAS Communities Library at the bottom of the list. In the Labels box in the right nav, click Analytics U: