How to Assess and Finesse Messy Data -- Step 3: Filling in the Blanks

Editor's note: SAS programming concepts in this and other Free Data Friday articles remain useful, but SAS OnDemand for Academics has replaced SAS University Edition as a free e-learning option. Hit the orange button below to start your journey with SAS OnDemand for Academics:

Access Now

For this week's installment of Free Data Friday, I wanted to continue on the topic of messy data, this time around the idea of “filling in the blanks.” On Wednesday, as luck would have it, I had an analysis come up at work where I did actually need to expand out University Edition and the missing shopping cart data.jpg the data set so we could get a more accurate picture of trends.

My challenge was that it wasn’t just straight Year | Count data, but rather there were multiple categories to take into account. I tried PROC EXPAND, but couldn’t seem to get it to work. I reached out on social media and SAS Technical Support immediately got back to me with a link by @Ksharp that I was able to modify and use.

I wanted to take his DATA step method and use it here on an amusing data set I found.

Get the data

I found this data set by searching “funny open data” and was intrigued by what turned up. The data comes from the City of Bristol in the UK and is an annual count of shopping carts found in the area’s rivers. I would love to see the business case put forward to track this! You can get the data here.

Get started with SAS OnDemand for Academics

In this 9-minute tutorial, SAS instructor @DomWeatherspoon shows you how to get your data into SAS OnDemand for Academics and other key steps:

Get Started

Get the data ready

The data was already in CSV format, but I did modify the data set slightly:

1) I added an ID column grouped by the River.

2) I renamed the variable Number of Trolleys to Number_of_Trolleys.

Both of these could be done through SAS, but I did it in the base CSV as part of my pre-import data review.

The results

As usual, I run my pre-analysis checks and I notice that there are a couple of years missing, in particular 2006. (What, did they not get funding that year to track this?) Because I’d eventually want to do time series analysis and possibly forecasting, I would need to ensure the missing years are available.

Because the data is broken down by River, PROC EXPAND doesn’t (seem) to work, unless I split my data set into smaller ones by location. As I mentioned, @Ksharp answered a similar question so let’s take a look at the DATA step I used:

data import2;
merge import import(firstobs=2 keep=id year river Number_of_Trolleys rename=(year=_y id=_id));
output;
 if id=_id then do;
  do i=year+1 to _y-1;
   year=i; Number_of_Trolleys =.;output;
  end;
end;
drop i _:;
run;

Now, I’m no Data step expert, but this appears to be a rather simple one. We’re simply taking the data set, merging it to itself, and adding a new Year column with the Number_of_Trolleys variable equal to missing. From what I gather, I can provide my own calculation for a imputed value here, if I so choose.

Here’s my output before I run the Data step:

And here it is with the missing year added in:

Now that I have a more complete data set, I can easily move on to deeper analyses.

Just out of curiosity, let’s see if one river has more trolleys show up than the other:

For whatever reason, the River Frome (particularly in 2008) clearly had the highest volume. Not knowing anything about the area, I support possible reasons include more grocery stores, a more dense population, or (based on the area I grew up in) a really good hill that kids use the carts to ride in. (Disclaimer: Do not attempt, it’s extremely dangerous!)