Because so many in academia need data for school, I keep an eye out for open sources. Below are sources (alphabetized) that my predecessor Jennifer Scott pulled together last year and I add to as much as possible.
Data already cleaned and labeled
There's a lot of interest in curated data sets that are already cleaned, labeled and ready to be mined. For that reason, I draw your attention to an incredible stash of 30 data sets posted on StartUp Grind's Medium blog by Luke de Oliveira, a tech entrepreneur and visiting scientist at Berkeley Labs.
The post, Fueling the Gold Rush: The Greatest Public Datasets for AI, includes links to the data with a legend and other context to help you quickly decide whether to download and analyze.
"Most people in AI forget that the hardest part of building a new AI solution or product is not the AI or algorithms — it’s the data collection and labeling," Oliveira writes on StartUp Grind.
"Standard datasets can be used as validation or a good starting point for building a more tailored solution. This week, a few machine learning experts and I were talking about all this. To make your life easier, we’ve collected an (opinionated) list of some open datasets that you can’t afford not to know about in the AI world."
In the comments, add links to any data sets you've found.
This is a list of lists of datasets. There's not much organization here, but there really are a LOT of datasets. Dive in and have fun.
Project Data Sphere, founded by the CEO Roundtable on Cancer, brings together leading pharmaceutical companies to share data and an analytics platform to investigate those data. The platform now provides access to data from more than 40 studies and 25,000 patients from clinical trials of prostate cancer, breast cancer, and melanoma treatments. The list of available studies is growing rapidly.
Obtain data and statistics in any form from raw data to publication specific to health care – biostatistics, nursing, epidemiology, etc.
Obtain public use files to download for analysis, as well as, access pre-digested health statistics
Where do Citi Bikers ride? When do they ride? How far do they go? Which stations are most popular? What days of the week are most rides taken on? We’ve heard all of these questions and more from you and now we are happy to provide the datasets to help you discover the answers to these questions and more. Developers, engineers, statisticians, artists, academics and other members of the interested public are encouraged to use the data we provide for analysis, development, visualization and whatever else moves you.
The Duke Clinical Research Institute (DCRI) and SAS provide researchers worldwide with data management and analytics tools to explore 45 years of cardiovascular patient data collected by the Duke University Health System. The DCRI and SAS share the goal of greater transparency and openness in research to improve patient care to find new ways to treat heart disease, the leading cause of death in the United States. SAS' collaboration with DCRI represents a significant milestone for its broader data access initiative, Supporting Open Access for Researchers (SOAR).
Access a cardiovascular data set from the Duke Databank for Cardiovascular Disease (DDCD) at http://soar.dcri.org/data. The databank includes de-identified records for patients treated at Duke between 1969 and 2013, and data from more than 100,000 procedures on more than 50,000 unique patients. The data includes patient demographics, cardiac medical history, other conditions occurring simultaneously (comorbidity), final impressions and subsequent treatments.
This site offers a number of datasets on energy production, consumption, sources, etc.
Hundreds of datasets on world health, economics, population, etc. All of it is viewable online within Google Docs, and downloadable as spreadsheets.
This is a collection of geospatial datasets offered by Arizona State University’s Center for Geospatial Analysis & Computation
An international consortium of more than 700 academic institutions and research organizations, ICPSR provides leadership and training in data access, curation, and methods of analysis for the social science research community. ICPSR maintains a data archive of more than 500,000 files of research inthe social sciences.
Kaggle is a site that hosts data mining competitions. Each competition provides a data set that's free for download.
The Koblenz Network Collection. Several datasets related to social networking & Wikipedia.
A freely-available collection of audio features and metadata for a million contemporary popular music tracks.
This site includes quick access to many of NCDC's climate and weather datasets, products, and various web pages and resources.
Over 1100 Datasets are currently available on the NYC open data portal, more than any other U.S. City
This beta site helps users get access to research data that is already online or request that data be made accessible. When a user wants access to the data behind a paper, they can make a data request via the app. The app will then see if the data is already available online, and if not, it will contact the author and invite them to make their research data openly available through the Center for Open Science’s Open Science Framework. Once a dataset is delivered, the author will be rewarded with an Open Data Badge to recognise their efforts.
Great resources for data sets to start learning about Open Health Care Data
The Opportunity Project (U.S. Government)
Through The Opportunity Project, the Administration is releasing a unique package of federal and local datasets in an easy-to-use format and accelerating a new way for the federal government to collaborate with local leaders, technologists, and community members to use data and technology to tackle inequities and strengthen their communities. Read the announcement for more information.
This is a web-based front end to a number of public data sets. What's nice about this website is that it allows for the combination of data from a number of sources, and can export the data in a number of formats.
This last one isn't a dataset itself, but rather a social news site devoted to datasets. It's updated regularly with news about newly available datasets.
The SAS office in the UK has a repository of open-source data worth checking out.
SAS Curriculum Pathways is provided at no cost to students and educators in traditional, virtual and home schools, as well as other teaching and learning environments. Its Data Depot houses data sources and focused lessons designed to help you become more data literate.
SAS provides more than 200 data sets in the Sashelp library. 17 of these data sets are used in SAS/STAT documentation and can be used with SAS University Edition. These data sets are available for you to use for examples and for testing code.
Stanford's Large Network Dataset Collection. This list has several datasets related to social networking. Lots of fun in here!
This collection of public data sets, organized topically, covers government, health, space, sports, weather, and much more. There's free data in each set, but there may be a charge for some.
Hosted by CKAN. Most of these datasets come from the government.
Mostly large datasets. The site is losing momentum, but the data available here is still gold.
There are 284 data sets maintained as a service to the machine learning community
SAS and the United Nations Statistics Division made it possible for anyone to analyze the world's largest bank of international trade data. In the UN Comtrade database, you'll find details on more than 200 countries' import/export habits since 1988 and much more.
The US Census data resources section offers everything from data visualization tools to large data files. Whether you are teaching statistics, research methods, economics or political science, you can find resources here.
Here you will find data, tools, and resources to conduct research, develop web and mobile applications, design data visualizations, and more.
The U.S. Energy Information Administration (EIA) collects, analyzes, and disseminates independent and impartial energy information to promote sound policymaking, efficient markets, and public understanding of energy and its interaction with the economy and the environment.
Literally hundreds of datasets spanning many decades, sortable by topic or country. Data is downloadable in Excel or XML formats, or you can make API calls. This is an outstanding resource.
Yahoo has various types of data available to share. They are categorized into Ratings, Language, Graph, Advertising and Market Data, Computing Systems and an appendix of other relevant data and resources available via the Yahoo! Developer Network.
The Yahoo News Feed dataset is a collection based on a sample of anonymized user interactions on the news feeds of several Yahoo properties, including the Yahoo homepage, Yahoo News, Yahoo Sports, Yahoo Finance, Yahoo Movies, and Yahoo Real Estate. The dataset stands at a massive ~110B lines (1.5TB bzipped) of user-news item interaction data, collected by recording the user- news item interaction of about 20M users from February 2015 to May 2015.
Pigeon races, chopstick effectiveness, crop yields in England from 1211 to 1492 are are just some of the quirky data sets you can get from data science vendor Yhat. Seriously, have a look. You'll have fun with these.
Need data for learning?
The SAS Communities Library has a growing supply of free data sources that you can use in your training to become a data scientist. The easiest way to find articles about data sources is to type "Data for learning" in the communities site search field like so:
We publish all articles about free data sources under the Analytics U label in the SAS Communities Library. Want email notifications when we add new content? Subscribe to the Analytics U label by clicking "Find A Community" in the right nav and selecting SAS Communities Library at the bottom of the list. In the Labels box in the right nav, click Analytics U:
Click Analytics U, then select "Subscribe" from the Options menu.