22
15 PLACES TO FIND FREE DATASETS FOR YOUR DATA SCIENCE PROJECTS www.newsdata.io

15 Places to Find Free Datasets for your Data Science Projects

Embed Size (px)

DESCRIPTION

If you’ve ever worked on a personal data science project, you’ve probably spent a lot of time scouring the internet for interesting datasets to analyze. It can be fun to sift through dozens of datasets to find the best fit, but it can also be frustrating to download and import multiple CSV files, only to find that the data is just missing, not so interesting. Fortunately, there are online repositories that keep sets of data and (mostly) remove uninteresting ones. In this article, we’ll look at different types of data science projects, including data visualization projects, data cleansing projects, and machine learning projects, and identify the right places to find sets of data. data for each.

Citation preview

Page 1: 15 Places to Find Free Datasets for your Data Science Projects

15 PLACES TO FIND FREEDATASETS FOR YOUR DATASCIENCE PROJECTS

www.newsdata.io

Page 2: 15 Places to Find Free Datasets for your Data Science Projects

Overview1. Data sets for your Data Visualization Projects2. Data sets for your Data Processing Projects3. Data sets for your Machine Learning Projects4. Data sets for Data Cleaning Projects

Table ofContents

Points for discussion

Newsdata.io API

Page 3: 15 Places to Find Free Datasets for your Data Science Projects

If you’ve ever worked on a personal data science project, you’ve probably spent a lotof time scouring the internet for interesting datasets to analyze.

It can be fun to sift through dozens of datasets to find the best fit, but it can also befrustrating to download and import multiple CSV files, only to find that the data is justmissing, not so interesting. Fortunately, there are online repositories that keep sets ofdata and (mostly) remove uninteresting ones.

In this article, we’ll look at different types of data science projects, including datavisualization projects, data cleansing projects, and machine learning projects, andidentify the right places to find sets of data, data for each.

Overview

Newsdata.io API

Page 4: 15 Places to Find Free Datasets for your Data Science Projects

Whether you want to strengthenyour data science portfolio byshowing that you can visualize datawell, or if you have a few hours tospare and want to practice yourmachine learning skills, we’ve gotyou covered.

Newsdata.io API

Page 5: 15 Places to Find Free Datasets for your Data Science Projects

This shouldn’t be complicated because you don’t want to spend a lot of timecleaning up your data.It must be sufficiently nuanced and interesting to make graphics of it.

A typical data visualization project might be something like “I want to create aninfographic on how income varies in different states in the United States.”

There are a few considerations to keep in mind when looking for a good dataset for adata visualization project:

Data sets for your Data VisualizationProjects

Newsdata.io API

Page 6: 15 Places to Find Free Datasets for your Data Science Projects

Ideally, each column should be well explainedfor the display to be accurate.The dataset should not have too many rows orcolumns, so it is easy to use.A good place to find good datasets for datavisualization projects is news sites that publishtheir own data.

They usually clean the data for you and alsoalready have some charts they created that youcan reproduce or improve.

Newsdata.io API

Page 7: 15 Places to Find Free Datasets for your Data Science Projects

Newsdata.io is a great platform if you are interested in historical news datasets, asthey also provide news API for breaking news and historical news. Therefore, theycollect news data every single day, daily. They also provide free data samples beforeyou request your actual historical news dataset.

1. Newsdata.io (for news datasets)

FiveThirtyEight is an incredibly popular interactive news and sports site launched byNate Silver. They write interesting data-driven articles, such as “Don’t Blame Lack ofSkills For Lack of Production Hires” and “The 2016 NFL Predictions. ”FiveThirtyEightmakes the datasets used in their articles available online on Github. Displays theFiveThirtyEight dataset

2. FiveThirtyEight

Page 8: 15 Places to Find Free Datasets for your Data Science Projects

BuzzFeed started out as a provider of low-quality articles, but has since evolved andnow writes investigative articles, such as “The Court That Rulers the World” and “TheShort Life of Deonte Hoard”.BuzzFeed makes the datasets used in its articles available on Github.

3. BuzzFeed

Socrata OpenData is a portal that contains several own datasets which can beviewed in the browser or downloaded for viewing. A significant portion of the datacomes from US government sources and many of them is out of date.You can browse and download data from OpenData without registering. You canalso use view and navigation tools to explore the data in the browser.

4. Socrata OpenData

Page 9: 15 Places to Find Free Datasets for your Data Science Projects

The cleaner the data, the better — cleaning a large dataset can take a long time.The dataset should be interesting.There should be an interesting question the data can answer.

Sometimes you just want to work with a large set of data. The end result is not asimportant as the process of reading and analyzing the data.

You can use tools like Spark or Hadoop to distribute processing across multiplenodes. keep in mind when looking for a good dataset for data processing:

Data sets for your Data ProcessingProjects

Newsdata.io API

Page 10: 15 Places to Find Free Datasets for your Data Science Projects

Cloud hosting providers like Amazon and Google are good places to find large publicdatasets. They are incentivized to host datasets because they have them analyzedusing their infrastructure (and they pay for it).

Newsdata.io API

5. AWS Public Data setsAmazon makes large datasets available on its Amazon Web Services platform. Youcan download the data and use it on your computer, or analyze the data in the cloudusing EC2 and Hadoop via EMR. You can read more about how the program workshere.

Amazon has a page that lists all the datasets to browse. You will need an AWSaccount, although Amazon does provide you with a free level of access for newaccounts that will allow you to explore data at no cost.

Page 11: 15 Places to Find Free Datasets for your Data Science Projects

Just like Amazon, Google also offers a cloudhosting service, called the Google CloudPlatform. With GCP, you can use a tool calledBigQuery to explore large sets of data.Google lists all datasets on a page. You’ll needto create a GCP account, but the first 1TBrequest you make is free.

6. Google Public Datasets

Newsdata.io API

Page 12: 15 Places to Find Free Datasets for your Data Science Projects

Wikipedia is a free, online, community-edited encyclopedia. Wikipedia contains anastonishing expanse of knowledge, with pages on everything from the Ottoman Warsof the Habsburgs to Leonard Nimoy.

As part of Wikipedia’s commitment to the advancement of knowledge, they offer allof their content free of charge and regularly generate dumps of all articles on the site.In addition, Wikipedia offers a history of changes and activities, so you can track theprogress of a page on a topic over time and know who is contributing to it.

You can find different ways to download the data on the Wikipedia site. You will alsofind scripts to reformat the data in various ways.

7. Wikipedia

Newsdata.io API

Page 13: 15 Places to Find Free Datasets for your Data Science Projects

The dataset is not too complicated — if it is, we’ll be spending all of our timecleaning up the data.There is an interesting target column for making predictions.The other variables have some explanatory power for the target column.

When working on a machine learning project, you want to be able to predict acolumn from the other columns in a dataset.

To do this, we need to make sure that:

There are online repositories of specific datasets for machine learning. Thesedatasets are usually cleaned up early and allow algorithms to be tested very quickly.

Data sets for your Machine Learning Projects

Newsdata.io API

Page 14: 15 Places to Find Free Datasets for your Data Science Projects

Kaggle is a data science community that hosts machine learning contests. There are avariety of interesting datasets on the site provided externally. Kaggle offers live andhistorical contests.

You can download data for both, but you must register with Kaggle and agree to theterms of use of the contest.

You can download Kaggle data by entering a contest. Each competition has its ownassociated dataset. There are also user-supplied datasets in the new Kaggle datasetoffering.

8. Kaggle

Newsdata.io API

Page 15: 15 Places to Find Free Datasets for your Data Science Projects

The UCI Machine Learning Repository is one of the oldest sources of datasets onthe web. While the datasets are user-supplied and therefore have varying levelsof documentation and cleanup, the vast majority are clean and ready to apply.

UCI is a great first stop when looking for interesting datasets.You can download the data directly from the UCI Machine Learning repository,without registration. These datasets tend to be quite small and don’t have a lotof nuances, but they are useful for machine learning.

9. UCI Machine Learning Repository

Newsdata.io API

Page 16: 15 Places to Find Free Datasets for your Data Science Projects

Quandl is a repository of economic and financial data. Some of this information is free,but there are many datasets that need to be purchased. Quandl is useful for creatingmodels to predict economic indicators or stock prices. Due to a large number ofdatasets available, it is possible to build a complex model that uses many datasets topredict values in another.

10. Quandl

Newsdata.io API

Sometimes it can be very satisfying to take a dataset that is spread across multiplefiles, clean it up, condense it into one, and then perform an analysis. In data cleansingprojects, it sometimes takes hours of research to figure out what each column containsthe dataset means.

Data sets for Data Cleaning Projects

Page 17: 15 Places to Find Free Datasets for your Data Science Projects

Newsdata.io API

Spread across multiple files.They have many nuances and many possible angles to take.Requires a fair amount of research to understand.Be as “real” as possible.

Sometimes it may turn out that the dataset you are analyzing is not suitable for whatyou are trying to do and you will have to start over.

When looking for a good dataset for a data cleansing project, you want:

These types of datasets are typically found on dataset aggregators. Theseaggregators tend to have datasets from multiple sources, without much care. Toomuch care gives us overly precise datasets that are difficult to thoroughly cleanse.

Page 18: 15 Places to Find Free Datasets for your Data Science Projects

data.world describes itself as “the social network for data people”, but it could bemore correctly described as “GitHub for data”. It is a place where you can search,copy, analyze, and download datasets.

Additionally, you can upload your data to data.world and use it to collaborate withothers. In a relatively short time, it has become one of the benchmarks for dataacquisition, with many datasets provided by users and fantastic datasets thanks todata.world’s partnerships with various organizations that include a large amount ofUS federal government data.

A key differentiator of data.world are the tools they created to make working withdata easier: you can write SQL queries in their interface to explore data and mergemultiple datasets. They also have SDKs for R and python to make it easier to captureand work with data in your favorite tool.

11. data.world

Page 19: 15 Places to Find Free Datasets for your Data Science Projects

Data.gov is a relatively new site that is part of a US effort for open government.Data.gov allows you to download data from several US government agencies.

Data can range from government budgets to school performance scores. Most of thedata require further research and it can sometimes be difficult to understand whichdataset is the “correct” version.

Anyone can download the data, although some data sets require additional steps,such as accepting license agreements.

You can browse the datasets on Data.gov directly, without registering. You canbrowse by domain or search for a specific data set.

12. Data.gov

Page 20: 15 Places to Find Free Datasets for your Data Science Projects

The World Bank is a global development organization that provides loans and adviceto developing countries. The World Bank regularly funds programs in developingcountries and then collects data to track the success of those programs.You can browse the World Bank datasets directly without registering. Data sets havemany missing values and sometimes require multiple clicks to actually access thedata.

13. The World Bank

Reddit, a popular community chat site, has a section dedicated to sharing interestingdatasets. This is called the subreddit or / r / dataset. The scope of these datasets variesa lot, as they are all user-submitted, but they tend to be very interesting and nuanced.

14. /r/datasets

Page 21: 15 Places to Find Free Datasets for your Data Science Projects

Academic Torrents is a new site focused on sharing datasets from scientific papers.It’s a newer site, so it’s hard to say what the more common types of datasets will looklike. For now, it has tons of interesting datasets that lack context.

You can browse the datasets directly on the site. Since this is a torrent site, alldatasets can be downloaded immediately, but you will need a Bittorrent client.Deluge is a good free option.

15. Academic Torrents

Page 22: 15 Places to Find Free Datasets for your Data Science Projects

Newsdata_io

Newsdata.io

Newsdata_io

Newsdata.io