(Editor's note: Data sources were last confirmed and updated by @BeverlyBrown on 5/22/2024; * denotes most recently added data set.)
Because so many in academia need data for school, I keep an eye out for open sources. Below are sources (alphabetized) that my predecessor Jennifer Scott pulled together and I add to occasionally. Recently, SAS and the National Institutes of Health equipped public health researchers with SAS Analytics Pro on SAS ® Viya ® to the All of Us Researcher Workbench, the cloud-based platform that allows registered researchers to study data contributed by the All of Us participant cohort. For decades, researchers have relied on SAS to analyze biomedical and clinical data as part of their innovation toolset. Now, researchers will be able to wield their SAS expertise to explore data contributed by more than 400,000 All of Us participants who reflect the rich diversity of the United States, including communities that have been historically underrepresented in biomedical research.
Learn More
Read on for see the data sets. In the comments, add links to any data sets you've found interesting and useful.
*All of Us Research Hub, National Institutes of Health
the NIH's All of Us Research Program is building one of the largest biomedical data resources of its kind. Its research hub stores health data from a diverse group of participants from across the U.S. Register for the Researcher Workbench to access data and tools to conduct health research and improve understanding of health and disease.
Cancer research data through Project Data Sphere
Project Data Sphere, founded by the CEO Roundtable on Cancer, brings together leading pharmaceutical companies to share data and an analytics platform to investigate those data. The platform now provides access to data from more than 40 studies and 25,000 patients from clinical trials of prostate cancer, breast cancer, and melanoma treatments. The list of available studies is growing rapidly.
Census Bureau (U.S. government)
Visit this site to tap into a key national resource, serving as a fuel for entrepreneurship and innovation, scientific discovery, and commercial activity.
CHIS - California Health Interview Survey
Obtain public use files to download for analysis, as well as, access pre-digested health statistics.
Citi Bike - NYC Bike Share Data
Where do Citi Bikers ride? When do they ride? How far do they go? Which stations are most popular? What days of the week are most rides taken on? We’ve heard all of these questions and more from you and now we are happy to provide the datasets to help you discover the answers to these questions and more. Developers, engineers, statisticians, artists, academics and other members of the interested public are encouraged to use the data we provide for analysis, development, visualization and whatever else moves you.
Data USA
Compiled by Deloitte, this is a massive collection of US stats and data sources that, according to the site's "About us," "tells millions of stories about America. Through advanced data analytics and visualization, it tells stories about: places in America—towns, cities and states; occupations, from teachers to welders to web developers; industries--where they are thriving, where they are declining and their interconnectedness to each other; and education and skills, from where is the best place to live if you’re a computer science major to the key skills needed to be an accountant."
Don't get lost in the cool stats and forget to peruse the site's many data sets.
Energy Information Administration
This site offers a number of data sets on energy production, consumption, sources, etc.
Forecasting: data sets for practice
Google https://datasetsearch.research.google.com/
Make your own data sets with:
SAS/OR https://blogs.sas.com/content/operations/2017/05/17/creating-synthetic-data-sasor
SAS/IML, for example, using ARMASIM Function and VARMASIM Call.
Monash Time Series Forecasting Repository (Monash University, Australia)
Hundreds of data sets on world health, economics, population, etc. All of it is viewable online within Google Docs, and downloadable as spreadsheets.
ICPSR- Inter-university Consortium for Political and Social Research
An international consortium of more than 700 academic institutions and research organizations, ICPSR provides leadership and training in data access, curation, and methods of analysis for the social science research community. ICPSR maintains a data archive of more than 500,000 files of research inthe social sciences.
Kaggle is a site that hosts data mining competitions. Each competition provides a data set that's free for download.
Million-Song Data Set
The Million Song Dataset is a freely-available collection of audio features and metadata for a million contemporary popular music tracks.
The Intelligent Systems Division's Prognostics Center of Excellence at NASA hosts a rich data sets you can use to test your problem-solving acumen for a long list of challenges, including managing uncertainty.
Note from @sbxkoenk, who recommended this data source: The data didn't originate with NASA itself. The Prognostics Data Repository is a collection of data sets that have been donated by universities, agencies, or companies. The data repository focuses exclusively on prognostic data sets, i.e., data sets that can be used for the development of prognostic algorithms. Most of these are time-series data from a prior nominal state to a failed state. The collection of data in this repository is an ongoing process.
National Institutes of Health, US National Library of Medicine: What We Learned from Big Data for Autophagy Research
Scroll down to the "Autophagy networks and databases" heading where you'll find a list of open databases.
NYC Open Data
Over 1100 Datasets are currently available on the NYC open data portal, more than any other U.S. City
The Open Data Button
This site helps users get access to research data that is already online or request that data be made accessible. When a user wants access to the data behind a paper, they can make a data request via the app. The app will then see if the data is already available online, and if not, it will contact the author and invite them to make their research data openly available through the Center for Open Science’s Open Science Framework. Once a dataset is delivered, the author will be rewarded with an Open Data Badge to recognise their efforts.
Reddit Datasets
This last one isn't a dataset itself, but rather a social news site devoted to datasets. It's updated regularly with news about newly available datasets.
SAS Help Data Sets
SAS provides more than 200 data sets in the Sashelp library. 17 of these data sets are used in SAS/STAT documentation. These data sets are available for you to use for examples and for testing code.
Stanford's Large Network Dataset Collection. This list has several datasets related to social networking. Lots of fun in here!
This collection of public data sets, organized topically, covers government, health, space, sports, weather, and much more. There's free data in each set, but there may be a charge for some.
United Nations Comtrade Data
In the UN Comtrade database, you'll find details on countries' import/export habits and much more.
US Government Open Data
Here you will find data, tools, and resources to conduct research, develop web and mobile applications, design data visualizations, and more.
USGS (U.S. Geological Survey)
Dig into real-time data about droughts, earthquakes, floods, landslides, volcanoes, and wildfires.
Weekly US Retail Gasoline and Diesel Prices
The U.S. Energy Information Administration (EIA) collects, analyzes, and disseminates independent and impartial energy information to promote sound policymaking, efficient markets, and public understanding of energy and its interaction with the economy and the environment.
World Bank Data
Literally hundreds of datasets spanning many decades, sortable by topic or country. Data is downloadable in Excel or XML formats, or you can make API calls. This is an outstanding resource.
Yahoo Data Sets
Yahoo has various types of data available to share. They are categorized into Ratings, Language, Graph, Advertising and Market Data, Computing Systems and an appendix of other relevant data and resources available via the Yahoo! Developer Network.
Yahoo News Feed Dataset
The Yahoo News Feed dataset is a collection based on a sample of anonymized user interactions on the news feeds of several Yahoo properties, including the Yahoo homepage, Yahoo News, Yahoo Sports, Yahoo Finance, Yahoo Movies, and Yahoo Real Estate. The dataset stands at a massive ~110B lines (1.5TB bzipped) of user-news item interaction data, collected by recording the user- news item interaction of about 20M users from February 2015 to May 2015.
... View more