For this week's installment of Free Data Friday, I wanted to continue on the topic of messy data, this time around the idea of “filling in the blanks.” On Wednesday, as luck would have it, I had an analysis come up at work where I did actually need to expand out the data set so we could get a more accurate picture of trends.
My challenge was that it wasn’t just straight Year | Count data, but rather there were multiple categories to take into account. I tried PROC EXPAND, but couldn’t seem to get it to work. I reached out on social media and SAS Technical Support immediately got back to me with a link by @Ksharp that I was able to modify and use.
I wanted to take his DATA step method and use it here on an amusing data set I found.
Get the Data
I found this data set by searching “funny open data” and was intrigued by what turned up. The data comes from the City of Bristol in the UK and is an annual count of shopping carts found in the area’s rivers. I would love to see the business case put forward to track this! You can get the data here.
How to go about getting SAS University Edition
If you don’t already have University Edition, get it here and follow the instructions from the pdf carefully. If you need help with almost any aspect of using University Edition, check out these video tutorials. Additional resources are available in this article.
Getting the data ready
The data was already in CSV format, but I did modify the data set slightly:
1) I added an ID column grouped by the River.
2) I renamed the variable Number of Trolleys to Number_of_Trolleys.
Both of these could be done through SAS, but I did it in the base CSV as part of my pre-import data review.
As usual, I run my pre-analysis checks and I notice that there are a couple of years missing, in particular 2006. (What, did they not get funding that year to track this?) Because I’d eventually want to do time series analysis and possibly forecasting, I would need to ensure the missing years are available.
Because the data is broken down by River, PROC EXPAND doesn’t (seem) to work, unless I split my data set into smaller ones by location. As I mentioned, @Ksharp answered a similar question so let’s take a look at the DATA step I used:
data import2; merge import import(firstobs=2 keep=id year river Number_of_Trolleys rename=(year=_y id=_id)); output; if id=_id then do; do i=year+1 to _y-1; year=i; Number_of_Trolleys =.;output; end; end; drop i _:; run;
Now, I’m no Data step expert, but this appears to be a rather simple one. We’re simply taking the data set, merging it to itself, and adding a new Year column with the Number_of_Trolleys variable equal to missing. From what I gather, I can provide my own calculation for a imputed value here, if I so choose.
Here’s my output before I run the Data step:
And here it is with the missing year added in:
Now that I have a more complete data set, I can easily move on to deeper analyses.
Just out of curiosity, let’s see if one river has more trolleys show up than the other:
For whatever reason, the River Frome (particularly in 2008) clearly had the highest volume. Not knowing anything about the area, I support possible reasons include more grocery stores, a more dense population, or (based on the area I grew up in) a really good hill that kids use the carts to ride in. (Disclaimer: Do not attempt, it’s extremely dangerous!)
Now it’s your turn!
Did you find something else interesting in this data? Share in the comments. I’m glad to answer any questions.
Need data for learning?
The SAS Communities Library has a growing supply of free data sources that you can use in your training to become a data scientist. The easiest way to find articles about data sources is to type "Data for learning" in the communities site search field like so:
We publish all articles about free data sources under the Analytics U label in the SAS Communities Library. Want email notifications when we add new content? Subscribe to the Analytics U label by clicking "Find A Community" in the right nav and selecting SAS Communities Library at the bottom of the list. In the Labels box in the right nav, click Analytics U:
Click Analytics U, then select "Subscribe" from the Options menu.