The Internet has made sharing data as easy as sharing a link to a video of a keyboard-playing cat. Data portals like the well-known http://www.data.gov/ in the US and http://open-data.europa.eu/en/data/ in the EU provide many different kinds of data for research purposes. But almost every website that discusses data provides some url, api, or other means to download all or part of the data being discussed.
Downloading the data is easy. Making sense and finding meaning in the data you downloaded is often not. Without being careful, you might conclude that Google’s stock price moves in the exact opposite direction as the rest of the stock market … or that there are 22.5 billion people on Earth in 2010… or worse, performing an analysis that is just as wrong, but less obviously mistaken.
Each data set has its own challenges. The data set you thought included one observation per row may also include summary rows with totals, and there is no easy way to separate one kind of record from another. In a different data set, missing values in a column may mean that no data were collected, but should that be treated as missing or 0? Even when the data sets are provided in SAS format, using the data as you found it may, in fact, result in statistical analyses that seem very scientific, but are, in fact, just wrong.
In this series, I will use SAS University Edition to demonstrate techniques you can use to access open data. I will also describe some of the detective work you might need to do to be certain that you understand the data you are using, including identifying whether data is missing and how to treat missing values, generating descriptive statistics, running correlations, and graphing relationships. The series will look at a very wide range of data from everything from earthquakes to football to campus crime and the stock market.
I am also interested in hearing from you. If you have had similar experiences that you would like to share, I would be delighted to post your code and observations. Start by commenting below!
Need data for learning?
The SAS Communities Library has a growing supply of free data sources that you can use in your training to become a data scientist. The easiest way to find articles about data sources is to type "Data for learning" in the communities site search field like so:
We publish all articles about free data sources under the Analytics U label in the SAS Communities Library. Want email notifications when we add new content? Subscribe to the Analytics U label by clicking "Find A Community" in the right nav and selecting SAS Communities Library at the bottom of the list. In the Labels box in the right nav, click Analytics U:
Click Analytics U, then select "Subscribe" from the Options menu.