BookmarkSubscribeRSS Feed

How to Assess and Finesse Messy Data -- Step 3: Filling in the Blanks

Started ‎02-17-2017 by
Modified ‎08-04-2021 by
Views 1,489

Editor's note: SAS programming concepts in this and other Free Data Friday articles remain useful, but SAS OnDemand for Academics has replaced SAS University Edition as a free e-learning option. Hit the orange button below to start your journey with SAS OnDemand for Academics:

 

Access Now

 

For this week's installment of Free Data Friday, I wanted to continue on the topic of messy data, this time around the idea of “filling in the blanks.” On Wednesday, as luck would have it, I had an analysis come up at work where I did actually need to expand out University Edition and the missing shopping cart data.jpgthe data set so we could get a more accurate picture of trends.

 

My challenge was that it wasn’t just straight Year | Count data, but rather there were multiple categories to take into account.  I tried PROC EXPAND, but couldn’t seem to get it to work.  I reached out on social media and SAS Technical Support immediately got back to me with a link by @Ksharp that I was able to modify and use.

 

I wanted to take his DATA step method and use it here on an amusing data set I found.

 

Get the data

I found this data set by searching “funny open data” and was intrigued by what turned up. The data comes from the City of Bristol in the UK and is an annual count of shopping carts found in the area’s rivers. I would love to see the business case put forward to track this!  You can get the data here

 

Get started with SAS OnDemand for Academics

 
In this 9-minute tutorial, SAS instructor @DomWeatherspoon shows you how to get your data into SAS OnDemand for Academics and other key steps:
 

Get Started

 

Get the data ready

The data was already in CSV format, but I did modify the data set slightly:

1) I added an ID column grouped by the River.

2) I renamed the variable Number of Trolleys to Number_of_Trolleys. 

 

Both of these could be done through SAS, but I did it in the base CSV as part of my pre-import data review.

 

The results

As usual, I run my pre-analysis checks and I notice that there are a couple of years missing, in particular 2006. (What, did they not FreeDataFriday_graphic.jpgget funding that year to track this?)  Because I’d eventually want to do time series analysis and possibly forecasting, I would need to ensure the missing years are available. 

 

Because the data is broken down by River, PROC EXPAND doesn’t (seem) to work, unless I split my data set into smaller ones by location.  As I mentioned, @Ksharp answered a similar question so let’s take a look at the DATA step I used:

  

 

data import2;
merge import import(firstobs=2 keep=id year river Number_of_Trolleys rename=(year=_y id=_id));
output;
 if id=_id then do;
  do i=year+1 to _y-1;
   year=i; Number_of_Trolleys =.;output;
  end;
end;
drop i _:;
run;

 

 

Now, I’m no Data step expert, but this appears to be a rather simple one. We’re simply taking the data set, merging it to itself, and adding a new Year column with the Number_of_Trolleys variable equal to missing.  From what I gather, I can provide my own calculation for a imputed value here, if I so choose. 

 

Here’s my output before I run the Data step:

 

 11.png

And here it is with the missing year added in:

 

 12.png

 

Now that I have a more complete data set, I can easily move on to deeper analyses. 

 

Just out of curiosity, let’s see if one river has more trolleys show up than the other:

 

13.png

 

For whatever reason, the River Frome (particularly in 2008) clearly had the highest volume.  Not knowing anything about the area, I support possible reasons include more grocery stores, a more dense population, or (based on the area I grew up in) a really good hill that kids use the carts to ride in. (Disclaimer: Do not attempt, it’s extremely dangerous!)

 

 

Version history
Last update:
‎08-04-2021 10:17 AM
Updated by:

SAS Innovate 2025: Register Now

Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!

Free course: Data Literacy Essentials

Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning  and boost your career prospects.

Get Started

Article Tags