We’re smarter together. Learn from this collection of community knowledge and add your expertise.

Need data for teaching or learning? Get it here!

by SAS Employee jennifers_sas on ‎02-26-2014 10:07 PM - edited on ‎02-13-2017 10:53 AM by Community Manager (16,326 Views)

 

Because so many in academia need data for school, I keep an eye out for open sources. Below are sources (alphabetized) that my predecessor Jennifer Scott pulled together last year and I add to as much as possible.

 

Data for learning.jpgData already cleaned and labeled

There's a lot of interest in curated data sets that are already cleaned, labeled and ready to be mined. For that reason, I draw your attention to an incredible stash of 30 data sets posted on StartUp Grind's Medium blog by Luke de Oliveira, a tech entrepreneur and visiting scientist at Berkeley Labs.

 

The post, Fueling the Gold Rush: The Greatest Public Datasets for AI, includes links to the data with a legend and other context to help you quickly decide whether to download and analyze.

 

"Most people in AI forget that the hardest part of building a new AI solution or product is not the AI or algorithms — it’s the data collection and labeling," Oliveira writes on StartUp Grind.

 

"Standard datasets can be used as validation or a good starting point for building a more tailored solution. This week, a few machine learning experts and I were talking about all this. To make your life easier, we’ve collected an (opinionated) list of some open datasets that you can’t afford not to know about in the AI world."

 

In the comments, add links to any data sets you've found.

 

1,001 Datasets

This is a list of lists of datasets. There's not much organization here, but there really are a LOT of datasets. Dive in and have fun.

 

Cancer research data through Project Data Sphere

Project Data Sphere, founded by the CEO Roundtable on Cancer, brings together leading pharmaceutical companies to share data and an analytics platform to investigate those data. The platform now provides access to data from more than 40 studies and 25,000 patients from clinical trials of prostate cancer, breast cancer, and melanoma treatments. The list of available studies is growing rapidly.

 

CDC - Center for Disease Control

Obtain data and statistics in any form from raw data to publication specific to health care  biostatistics, nursing, epidemiology, etc.


CHIS - California Health Interview Survey

Obtain public use files to download for analysis, as well as, access pre-digested health statistics

 

Citi Bike - NYC Bike Share Data

Where do Citi Bikers ride? When do they ride? How far do they go? Which stations are most popular? What days of the week are most rides taken on? We’ve heard all of these questions and more from you and now we are happy to provide the datasets to help you discover the answers to these questions and more.  Developers, engineers, statisticians, artists, academics and other members of the interested public are encouraged to use the data we provide for analysis, development, visualization and whatever else moves you.

 

Clinical Trials Data from Supporting Open Access for Researchers

The Duke Clinical Research Institute (DCRI) and SAS provide researchers worldwide with data management and analytics tools to explore 45 years of cardiovascular patient data collected by the Duke University Health System. The DCRI and SAS share the goal of greater transparency and openness in research to improve patient care to find new ways to treat heart disease, the leading cause of death in the United States. SAS' collaboration with DCRI represents a significant milestone for its broader data access initiative, Supporting Open Access for Researchers (SOAR).

 

Access a cardiovascular data set from the Duke Databank for Cardiovascular Disease (DDCD) at http://soar.dcri.org/data. The databank includes de-identified records for patients treated at Duke between 1969 and 2013, and data from more than 100,000 procedures on more than 50,000 unique patients. The data includes patient demographics, cardiac medical history, other conditions occurring simultaneously (comorbidity), final impressions and subsequent treatments.

 

Energy Information Administration

This site offers a number of datasets on energy production, consumption, sources, etc.

 

Gapminder

Hundreds of datasets on world health, economics, population, etc. All of it is viewable online within Google Docs, and downloadable as spreadsheets.

 

GeoDa Center

This is a collection of geospatial datasets offered by Arizona State University’s Center for Geospatial Analysis & Computation

 

ICPSR- Inter-university Consortium for Political and Social Research

An international consortium of more than 700 academic institutions and research organizations, ICPSR provides leadership and training in data access, curation, and methods of analysis for the social  science research community.  ICPSR maintains a data archive of more than 500,000 files of research inthe social sciences.

 

Kaggle

Kaggle is a site that hosts data mining competitions. Each competition provides a data set that's free for download.

 

KONECT

The Koblenz Network Collection. Several datasets related to social networking & Wikipedia.

 

Million Song Database

A freely-available collection of audio features and metadata for a million contemporary popular music tracks.

 

National Climatic Data Center (NCDC)

This site includes quick access to many of NCDC's climate and weather datasets, products, and various web pages and resources.

 

NC School Report Cards

In this WRAL article, 10 Things to Know about NC School Report Cards, you'll find context about an open source of public schools data as well as a link to data sets.

 

NYC Open Data

Over 1100 Datasets are currently available on the NYC open data portal, more than any other U.S. City

 

The Open Data Button

This beta site helps users get access to research data that is already online or request that data be made accessible. When a user wants access to the data behind a paper, they can make a data request via the app. The app will then see if the data is already available online, and if not, it will contact the author and invite them to make their research data openly available through the Center for Open Science’s Open Science Framework. Once a dataset is delivered, the author will be rewarded with an Open Data Badge to recognise their efforts.

 

Open Health Care Data Sets

Great resources for data sets to start learning about Open Health Care Data

 

The Opportunity Project (U.S. Government)

Through The Opportunity Project, the Administration is releasing a unique package of federal and local datasets in an easy-to-use format and accelerating a new way for the federal government to collaborate with local leaders, technologists, and community members to use data and technology to tackle inequities and strengthen their communities.  Read the announcement for more information.

 

Quandl

This is a web-based front end to a number of public data sets. What's nice about this website is that it allows for the combination of data from a number of sources, and can export the data in a number of formats.

 

Reddit Datasets

This last one isn't a dataset itself, but rather a social news site devoted to datasets. It's updated regularly with news about newly available datasets.

 

SAS Academic Program (UK)

The SAS office in the UK has a repository of open-source data worth checking out.

 

SAS Curriculum Pathways Data Depot

SAS Curriculum Pathways is provided at no cost to students and educators in traditional, virtual and home schools, as well as other teaching and learning environments. Its Data Depot houses data sources and focused lessons designed to help you become more data literate.

 

SAS Help Data Sets

SAS provides more than 200 data sets in the Sashelp library. 17 of these data sets are used in SAS/STAT documentation and can be used with SAS University Edition. These data sets are available for you to use for examples and for testing code.

 

SNAP

Stanford's Large Network Dataset Collection. This list has several datasets related to social networking. Lots of fun in here!

 

SQLServerCentral.com

This collection of public data sets, organized topically, covers government, health, space, sports, weather, and much more. There's free data in each set, but there may be a charge for some.

 

The Data Hub

Hosted by CKAN. Most of these datasets come from the government.

 

The Info

Mostly large datasets. The site is losing momentum, but the data available here is still gold.

 

UCI Machine Learning Datasets

There are 284 data sets maintained as a service to the machine learning community

 

United Nations Comtrade Data

SAS and the United Nations Statistics Division made it possible for anyone to analyze the world's largest bank of international trade data. In the UN Comtrade database, you'll find details on more than 200 countries' import/export habits since 1988 and much more.

 

US Census Site

The US Census data resources section offers everything from data visualization tools to large data files. Whether you are teaching statistics, research methods, economics or political science, you can find resources here.

 

US Government Open Data

Here you will find data, tools, and resources to conduct research, develop web and mobile applications, design data visualizations, and more.

 

Weekly US Retail Gasoline and Diesel Prices

The U.S. Energy Information Administration (EIA) collects, analyzes, and disseminates independent and impartial energy information to promote sound policymaking, efficient markets, and public understanding of energy and its interaction with the economy and the environment.

 

World Bank Data

Literally hundreds of datasets spanning many decades, sortable by topic or country. Data is downloadable in Excel or XML formats, or you can make API calls. This is an outstanding resource.

 

Yahoo Data Sets

Yahoo has various types of data available to share. They are categorized into Ratings, Language, Graph, Advertising and Market Data, Computing Systems and an appendix of other relevant data and resources available via the Yahoo! Developer Network.

 

Yahoo News Feed Dataset

The Yahoo News Feed dataset is a collection based on a sample of anonymized user interactions on the news feeds of several Yahoo properties, including the Yahoo homepage, Yahoo News, Yahoo Sports, Yahoo Finance, Yahoo Movies, and Yahoo Real Estate. The dataset stands at a massive ~110B lines (1.5TB bzipped) of user-news item interaction data, collected by recording the user- news item interaction of about 20M users from February 2015 to May 2015.

 

The Yhat Blog 

Pigeon races, chopstick effectiveness, crop yields in England from 1211 to 1492 are are just some of the quirky data sets you can get from data science vendor Yhat. Seriously, have a look. You'll have fun with these.

 

 

Need data for learning?

 

The SAS Communities Library has a growing supply of free data sources that you can use in your training to become a data scientist. The easiest way to find articles about data sources is to type "Data for learning" in the communities site search field like so:

 Search.PNG

We publish all articles about free data sources under the Analytics U label in the SAS Communities Library. Want email notifications when we add new content? Subscribe to the Analytics U label by clicking "Find A Community" in the right nav and selecting SAS Communities Library at the bottom of the list. In the Labels box in the right nav, click Analytics U:

 

AnalyticsU.PNG

 

Click Analytics U, then select "Subscribe" from the Options menu.

 

Happy Learning!

Comments
by Grand Advisor
on ‎04-07-2014 06:41 PM

Some of the CDC data projects, BRFSS at least, can have data in either SAS data set or text files with associated SAS programs to read the data, and associated files with SAS informats and formats. So make sure you get all of the files for the data you are interested in.

Also look for any data dictionaries as there can often be raw data and recoded values and you'll want to know how the recoes were done.

Do not forget data quality checking, there may be oddities in the data especially from surveys.

by Occasional Contributor shwilliams4
on ‎06-20-2014 01:28 PM

Thank you SAS for providing these links. 10 years ago people would look for data to work with to learn. This is a starting point for people to get some data. Pistol-stem data set will no longer be the only data set people use.

by SAS Employee jennifers_sas
on ‎06-20-2014 01:49 PM

Thanks shwilliams4  - please let me know if you run across any interesting data sets that we can add to this list.

Jennifer

by Respected Advisor
on ‎07-15-2014 10:26 PM

See also the archived case studies from the Statistical Society of Canada for many interesting (and challenging) datasets  from diverse fields at: http://ssc.ca/en/meetings/archived-case-studies

PG

by SAS Super FREQ
on ‎05-29-2015 12:02 PM

Wow, what a treasure-trove!

by New Contributor Zulfiqar_Ali
on ‎06-18-2015 01:03 AM

Hi Jennifer, I am just a student of SAS learning by my own, by reading Mr. Cody's book named "Learning SAS by examples". book says that data sets and programs can be downloaded from http://support.sas.com/cody. But I could not find it, Can you please help me to look for it. Thanks, Ali

by Community Manager
on ‎06-18-2015 09:55 AM

Hi Zuliqar, the link to the data and programs is a bit difficult to find. Here it is: http://support.sas.com/publishing/bbu/zip/62857.zip   Comment back here if you're looking for something different!

by Occasional Contributor Hernan
on ‎06-18-2015 02:13 PM

Outstanding!

by Regular Contributor
on ‎09-15-2015 09:46 PM

Hee hee hee - lots and lots of data for me to play with!!  Thank you so much for posting this!!

by Super User
on ‎12-08-2015 07:43 PM

I came across a blog post containing a list of 20 Big Data Repositories to check out and thought it would be good to add the link here.

 

http://www.datasciencecentral.com/profiles/blogs/20-big-data-repositories-you-should-check-out-1

 

Cheers,

Michelle

by Community Manager
on ‎12-09-2015 11:06 AM

Wow, @MichelleH, this is awesome! Thank you for sharing.

by Super User
on ‎01-21-2016 07:47 AM

Another resource that may be of interest are the Kaggle Datasets.

by Community Manager
on ‎01-21-2016 10:03 AM

This is great, @MichelleHomes! I'll add this to the article above and mention it on the AU board. Got a lead yesterday on a huge stash of Yahoo datasets. We're awash in open data!

by Regular Contributor
on ‎01-21-2016 08:44 PM

So.

Much.

Data.

Mind.

Melting.

 

:-)

 

 

by Community Manager
on ‎01-21-2016 08:55 PM
Time to crank up University Edition, @DarthPathos!
by Regular Contributor
on ‎01-21-2016 10:35 PM

@BeverlyBrown is that a challenge? LOL

by Frequent Contributor
on ‎03-04-2016 08:17 AM

Wow, I've often wasted my time looking in vain for suitable datasets to use for motivating marketing analysis case studies in my university teaching, but prospecting through this collection has really got me some great hits in a fairly short time. I'm really impressed and grateful to SAS and the community for curating this. Well Done (SAS) U!

by Community Manager
‎03-04-2016 01:00 PM - edited ‎03-04-2016 01:05 PM

I appreciate that affirmation, @Damien_Mather! Credit my predecessor, @jennifers_sas, who started this stash. Here's a collection I just ran across, with data sets organized topically...government, sports, health, weather, space, etc. Keep using it, we'll keep adding to it!

by Regular Contributor
‎04-05-2016 09:24 PM - edited ‎04-05-2016 09:24 PM

 Hi Free Data Fans!  

 

I realised I had a great repository of links bookmarked (many of them ones people have posted here) and I wanted to add to the collection.  I've tried to make sure I haven't duplicated any of the ones already mentioned.  Enjoy!

Chris

 

http://open.canada.ca/en - Canadian government’s open data sets; I’ve been using these for a while and always impressed with the huge variety. 

 

Open Data Toronto – Born and raised in the city of Toronto, Ontario, this is definitely a treasure trove of great stuff.  Some aggregated / summarized, some raw, I’ve been using many of these in the Free Data Friday series and in numerous presentations I’ve given.

 

http://datausa.io - More focused on the visualization of data, but there are some pretty interesting datasets you can get down to.  Does take some time to navigate though, and not all have data that can be downloaded.

 

http://www.uscampgrounds.info/takeit.html - If you want to play with Geographic data, this is a great site; they’ve done a lot of work compiling Campgrounds in North America, and if you are an avid camper, you can easily import the data into your GPS.

 

https://data.nasa.gov - Another site I used data from for a Free Data Friday (on meteorites!) – a lot of the datasets mean nothing to me, but every so often I find a new one that’s really cool.  I’ve also found some pretty cool images, which make great wallpaper!

 

https://data.gov.uk/data/search - Jumping outside of North America, the UK has done a phenomenal job at compiling a wide range of data sources, in a multitude of formats. 

 

http://opendataforafrica.org - African datasets, including Port statistics, poverty, oil production, and a huge assortment of other topics.  The main page has links to different countries that also offer open data, so if you’re looking for a specific region chances are good there will be something useful.

 

http://www.opensourcesports.com - Hockey, baseball, football, soccer - even cycling, rugby, and swimming!  This website has a ton of data on a ton of sports.

 

http://www.seanlahman.com/baseball-archive/statistics/ - This database is *huge* and allows for exploration into topics like joining multiple tables, not-so-big big data, and time series analysis.  I absolutely love this dataset, and use it as often as I can.

by Super User
on ‎07-07-2016 07:36 PM

Thought the community may like to know about MIT's new visualization tool - a goldmine for data nerds! Robot Happy

by Regular Contributor
on ‎07-07-2016 07:41 PM

@MichelleHomes well there goes my weekend - thanks :-)

by Super User
on ‎07-07-2016 07:47 PM

I knew you'd appreciate it @DarthPathos! Smiley Wink Have a fun analytical datavizzy weekend...

by Regular Contributor
on ‎07-07-2016 08:03 PM

@MichelleHomes between you and @BeverlyBrown you two are keeping me very happy!! 8-)

by Community Manager
on ‎07-08-2016 06:59 AM

"Analytical datavizzy weekend." Epic! @MichelleHomes and @DarthPathos, your comments made me notice the other data sources @DarthPathos posted in this thread in April. Wow...thanks! @Damien_Mather: Did you see that? 

by Super User
on ‎11-11-2016 08:22 AM

Another resourceful list of data for folks - http://flowingdata.com/2016/11/10/find-the-data-you-need-2016-edition/

 

Yes, thinking of you @DarthPathos and your weekend again Smiley Wink

by Regular Contributor
on ‎11-11-2016 08:35 AM

@MichelleHomes LOL I already saw your tweet and already sent it to myself Robot wink  thanks for sharing and hope you have a great weekend! 

Chris

by Community Manager
on ‎11-11-2016 08:38 AM

@MichelleHomes I appreciate your latest contribution, which benefits anyone looking for practice data and guaranteed to distract @DarthPathos for hours! Woman Wink

by Regular Contributor
on ‎11-11-2016 08:45 AM

@BeverlyBrown @MichelleHomes Hahah - distracting me is definitely not a pro...SQUIRREL!!!!!  Oh chocolate!!  Oh hey something's shiny over there :-)

by Super User
on ‎11-11-2016 08:46 AM

The joys of sharing with the awesome side effect of keeping @DarthPathos occupied! LOL. Have a wonderful weekend guys...

by Frequent Contributor
on ‎11-24-2016 06:45 PM

More great teaching data links, thanks team!

 

I'm loath to post anything negative about free data, especially when we get it for free, but I'd like to reflect some comments from my advanced business analytics class just finished this semester back to the community:

 

Much of the kaggle data seemed so heavily anonymized to them so as to be unusable for many of their learning and research opportunities.

 

Maybe this is due in part to the marketing (and therefore customer) orientation of my course, but it seems to me that some datasets are 'overanonymised' if that is indeed a word.

 

I am keenly aware that useability, as a function of anonymity, for public data is an 'ideal point' (inverted parabolic or 'U' ) relationship, and that without anonymity we would have much less data available, and different domains generally have different views on what constitutes sufficient anonymity, and different protocols to ensure adequate privacy.

 

Does anyone have a feel for characterizing some of the other teaching data sources on an anonymity scale?

by Regular Contributor
on ‎11-24-2016 09:21 PM

Hey Damien!

 

Data privacy is a topic that fascinates me (I work in a hospital, so it's critical for what i do).  I would love to discuss this and provide my thoughts but I'm exhausted and need to get some work done.  Please post back and we'll keep chatting :-)

 

Have a great evening

Chris

by Frequent Contributor
on ‎11-24-2016 10:23 PM

OK, you asked for it..

 

Anonymity in New Zealand and Australia for...

 

Primary research data:

 

As a university researchers I always promise to store securely, limit access to a small group of named reseachers, analyse and summarise so that individual participants cannot be identified, then eventually destroy after a fixed number of years, all the primary survey and qualitative data I collect. As far as I know, this is mandatory if I want to get ethical approval for my research from my university, and ethical approval for primary research data gathering is in turn mandatory. So nobody outside the prior-nominated researcher group gets to see the data, anonymised or otherwise. Sometimes for removal of potential researcher bias the data is also anonymised for some researchers.  

 

Secondary internal data:

 

There are commercial customer privacy laws that prevents firms from collecting and retaining customer data unless they can demonstrate that is part of their legitimate business activity, whilst legitimate business activity is understood to include any analysis that is designed to generate improvements in customer value and experience. Generally improvements in customer experience and value can in turn be tied to improvements in business efficiency and profitability. Analysing internal secondary data for anti-competitive and market-dominance-maintaining reasons is explicitly prohibited on penalty of severe fines, as is lax data firewall and handling security. For this type of data to be made publicly available for general learning, an Australasian firm would have to assume responsibility for the effectiveness of any anonymising process, to ensure its customers cannot ever be identified.

 

So I can sort of understand why corporate legal councel insist on data anonymising protocols that err on the conservative side, i.e. go over the top, before any is released into the wild.

 

Secondary external data:

 

This is where it gets a little tricky for me.

 

Data and metadata that is largely or primarily machine-generated, and does not explicitly identify individual users, like google page stats for selected government public websites, should be OK to analyse and reproduce unless explicitly protected by an agreed protocol, not that any come to mind apart from Creative Commons variations. However I'm never entirely sure that someone smarter than me might be able to identify users somehow.

 

 

Any data in the public domain generated by someone typing something, like I am doing right now, is generally protected by author copyright, which means, at a minimum, any explicit reproduction should acknowledge authorship during the period of copyright, and, as in the case of Creative Commons, other protocols should be adhered to. Reproduction and explicit CC or other protocols aside, the data and metadata should be fairly available for analysis, which generally summarises the heck out of what is generally big corpora and structured data and metadata.  But when we type, do we ever stop and think how someone might be openly and honestly attempting to analyse what we write, either specifically identifying us, or in bulk with others' creative output? As an academic, I do**, but how widespread is that consciousness? If writers are generally not aware, or limit their awareness to specific domains, how do we, as analysts, ensure their moral rights are fairly protected?  

 

Who analyses the analysts?

 

If you were tired before reading this, you'll be asleep by now. 

 

Nitey-nite.

** I watch my own bibliographic metadata like a hawk to see who is citing me and to what extent!

by Regular Contributor
on ‎11-25-2016 08:25 PM

Wow, @Damien_Mather that was impressive - and a great way for me to spend my Friday night.  I'm sitting here watching Supernatural, eating chocolate ice cream, reading SAS stuff and waiting for my wife to get home ;-).  

 

For me, I have worked for Canada's major telecomm company, 2 hospitals (1 currently) and the Ministry of Health in Ontario.  In every job, I have had full access to significantly sensitive material - from customer address, phone bills, and so on to full medical histories (including social work and psychiatric notes).  I have taken great pride in my job, and always approach my position that if my data, or that of my parents, were in the database, how would i want it handled?  I am highly involved in research at the hospital, and have implemented policies and audits to ensure our data is as protected as possible.  

 

Deidentified data is absolutely critical - but so is data with full PHI, when it's handled appropriately.  One of my favourite types of analyses is geospatial cluster analysis - I always manage to find some truly interesting findings when I do this, and the only way I can do it is with postal codes.  Bringing age, income etc. into the equation obviously increases the level of identifiable information but could potentially increase the value of the findings exponentially.  When i send out reports, I use screen shots to ensure no identifiable information is sent.  When I worked at the first hospital, we had a breach because someone sent out a number of pivot tables in Excel without realising that the PHI was still accessible (Excel has hidden sheets with the data, which enables the pivot tables to be manipulated).  

 

Ok - enough rambling from me.  All this to say that I have felt for a long time that anyone working with data should be held to a standard (such as the American Statistical Association) where ethics and integrity etc. are clearly outlined, and methods are detailed for handling situations that arise.  

 

Thanks for the food for thought - and here I was thinking I was going to sleep tonight ;-)

Chris

Your turn
Sign In!

Want to write an article? Sign in with your profile.