Computer vision
65+ Best Free Datasets for Machine Learning
17 min read
—
Jun 1, 2021
Save time searching for quality training data for your machine learning projects, and explore our collection of the best free datasets.
Alberto Rizzoli
Co-founder & CEO
Have you ever spent hours searching for a suitable dataset for your data science project?
It can get pretty daunting, right?
Well, not anymore ;-)
Whether you are a student or a professional looking for high-quality datasets for machine learning or data analysis projects—we’ve got you covered!
In today’s article, we will share with you a comprehensive list of 65+ open machine learning datasets that you can access for free.
Check out 13 Best Image Annotation Tools if you are looking for the data annotation platform for your project.
Use the links below to find the datasets you are looking for in seconds.
Here’s what we’ll cover:
Open dataset aggregators
Public governments datasets
Finance and economics datasets
Computer vision datasets
Natural language processing datasets
Audio speech and music datasets
Data visualization datasets
P.S. We will regularly update this list, so feel free to suggest datasets you are using and we will make sure to add them. You can also head over to our Open Datasets repository to browse or download some of the coolest datasets out there.
P.P.S. And if you are ready to start annotating your data, go ahead and check out:
Open Dataset Aggregators
“Where can I get free datasets for machine learning?” you might ask yourself.
Look no further.
Here’s the list of the best open dataset finders that you can use to browse through a wide variety of niche-specific datasets for your data science projects.
Let’s jump right into it.
Kaggle
A data science community with tools and resources which include externally contributed machine learning datasets of all kinds. From health, through sports, food, travel, education, and more, Kaggle is one of the best places to look for quality training data.
Google Dataset Search
A search engine from Google that helps researchers locate freely available online data. It works similarly to Google Scholar, and it contains over 25 million datasets. You can find here economic and financial data, as well as datasets uploaded by organizations like WHO, Statista, or Harvard.
UCI Machine Learning Repository
One of the oldest dataset aggregators on the web. All datasets are user-contributed, and you can download them from the UCI Machine Learning Repository website without registration. They are categorized by task, attribute, data type, and area of expertise.
OpenML
OpenML is an online machine learning platform for sharing and organizing data with more than 21.000 datasets. It’s regularly updated and it automatically versions and analyses each dataset and annotates it with rich meta-data to streamline analysis.
DataHub
A collection of thousands of machine learning datasets from financial market data, macroeconomic data, and population growth to cryptocurrency prices. You can access DataHub without any registration.
Papers with Code
A community project with free and open resources, currently including 3937 datasets for data science and machine learning, including natural language processing tasks. You can easily filter them by modality, task, or language.
VisualData
VisualData is search engine for computer vision datasets. You can easily filter them by category, date, popularity or use a search box to find a theme-specific dataset. A great source of datasets for image classification, image processing, and image segmentation projects.
You can start annotating your image and video data with V7 for free.
Public Government Datasets for Machine Learning
Leveraging demographic data can help governments to improve the well-being of citizens and the economy at scale. Using public government data to train machine learning models can help discover patterns, identify trends, and detect anomalies.
Those predictive models can, in turn, help prevent some of the social and cultural issues like population decline or migration.
Here’s the list of chosen public datasets that you can use for your machine learning projects.
Data.gov
The US government’s open data site. You can filter it by various industries like healthcare, climate, education, etc. Be aware that much of the open-source data on data.gov might require additional research.
Data.europa.eu
The point of access to public data published by the EU institutions, agencies, and other entities. Data.europa.eu contains data related to economics, agriculture, education, employment, climate, finance, science, etc.
World Bank
The open data from the World Bank that you can access without registration. It contains data concerning population demographics, macroeconomic data, and key indicators for development. A great source of data to perform data analysis at a large scale.
US Healthcare Data
Statistics and datasets for health care and public health. In US Healthcare Data you can find data about population health, diseases, drugs, and health plans collected from the FDA and USDA Food composition databases.
The US National Center for Education Statistics
Nces.ed.gov is a website with data on educational institutions and education demographics in the U.S. and internationally.
The UK Data Service
www.ukdataservice.ac.uk is a platform that provides access to over 7,000 digital data collections for research and teaching purposes. You can find here economic and social data from the Economic and Social Data Service (ESDS), Census Programme, and others, including some international data sets.
Data USA
A free platform with the most comprehensive visualization of U.S. public data.
Machine Learning Datasets for Finance and Economics
Open financial and economic datasets are a great source of information for your machine learning projects related to the financial sector.
Thanks to the vast quantities of financial records collected over decades, you can train your models using rich public datasets that are easily accessible. It's not a secret that machine learning has been widely used for algorithmic trading, stock market predictions, portfolio management, and fraud detection.
Furthermore, the developments in deep learning over the years made it possible to test economic models, collect new sources of data more easily, and predict citizen behavior to help inform policymaking.
Here is the list of reliable sources of various datasets you can use for your machine learning projects.
Global Financial Development (GFD)
GFD is an extensive dataset of financial system characteristics for 214 economies around the world. It contains annual data which has been collected since 1960.
Financial Times Markets Data
FT's up-to-date source of data on financial markets from around the world. The dataset contains information about share and stock prices, equities, currencies, bonds, and commodities performance.
Quandl
A platform with rich datasets of financial, economic, and alternative data. Quandl's data comes in two formats: Time-series (data taken over a period of time) and Tables (numerical and unsorted data types such as strings, etc.) You can download it either as a JSON or CSV file.
IMF Data
International Monetary Fund publishes data related to the IMF lending, exchange rates, and other economic and financial indicators.
American Economic Association (AEA)
AEA's website with links to some of the most useful and popular economic data sources. It includes data on U.S macroeconomic as well as individual-level global data on income, employment, and health.
Image Datasets for Computer Vision
Now, let’s have a look at some of the best open datasets for computer vision projects.
Some of the most popular machine learning project ideas and lab research projects are based on training visual data. Computer vision finds application in fields like medical imaging, self-driving cars, or facial recognition.
You can use an image or video datasets for a range of computer vision tasks, including image acquisition, image classification, semantic segmentation, and image analysis.
However—
To build a robust deep learning model for computer vision, you need a sizeable amount of high-quality training data.
Here's the list of open-source websites where you can access it for free.
Labelme
Labelme is an extensive dataset created by the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL). It contains 187,240 images, 62,197 annotated images, and 658,992 labeled objects.
ImageNet
ImageNet is one of the most popular and the largest image datasets for computer vision. It is organized according to the WordNet hierarchy. It currently holds 1,281,167 images for training and 50,000 images for validation within 1,000 categories.
Kinetics-700
Kinetics-700 is a large, high-quality video dataset of URL links to approximately 650000 Youtube video clips that cover 700 human action classes. The videos include human-object interactions, as well as human-human interactions. Kinetics dataset is great for training human action recognition model.
LSUN
LSUN is a dataset containing around one million labeled images for each of 10 scene categories (e.g., church, dining room, etc.) and 20 object categories (e.g., bird, airplane, etc.). It aims to provide a different benchmark for large-scale scene classification and understanding.
MS COCO
MS COCO is a large-scale object detection, segmentation, key-point detection, and captioning open-source dataset. It contains over 200,000 labeled images.
COIL100
COIL100 is a dataset containing 7200 color images of 100 objects (72 images per object) imaged at every angle in a 360 rotation. It was collected by the Center for Research on Intelligent Systems at Columbia University.
Visual Genome
Visual Genome is a large and detailed dataset and knowledge base with captioning of over 100.000 images.
Google’s Open Images
Google’s Open Images is a collection of over 9 million varied images with rich annotations. It contains image-level label annotations, object bounding boxes, object segmentation, and visual relationships across 6000 categories. This large image database is a great source of data for any data science project.
Youtube-8M
Youtube-8M is a vast dataset of millions of YouTube video IDs with high-quality machine-generated annotations of more than 3,800 visual entities. This dataset comes with pre-computed audio-visual features from billions of frames and audio segments.
Labeled Faces in the Wild
Labeled Faces in the Wild is a high-quality database of 13.000 face photographs designed for developing facial recognition projects. Each face has been labeled with the name of the person pictured.
Indoor Scene Recognition
Indoor Scene Recognition is a database containing 5620 images across 7 Indoor categories. There are at least 100 images per category in jpg format.
xView
xView is a vast public dataset of overhead imagery. It contains more than 1 million object images with 60 classes from complex scenes around the world annotated using bounding boxes.
CelebFaces
CelebFaces is a large-scale dataset of more than 200K celebrity images. Each image contains 40 attribute annotations. The images cover a range of pose variations and background clutter.
Stanford Dogs Dataset
Stanford Dogs is a dataset with images of 120 breeds of dogs from around the world. It contains 20,580 images across 120 categories annotated using class labels and bounding boxes.
Places
Places is a dataset provided by MIT Computer Science and Artificial Intelligence Laboratory. There are more than 2.5 million images across 205 scene categories. Each image comes with a category label. You can use it to train deep neural networks to understand various scenes.
VisualQA
VisualQA is a new dataset containing open-ended questions about images. It includes 265,016 images (COCO and abstract scenes), at least three questions per image, and ten answers per question.
CIFAR-10
CIFAR-10 is a vast dataset containing 60000 32x32 color images in 10 classes, with 6000 images per class. It includes 50000 training images and 10000 test images.
Cityscapes Dataset
Cityscapes Dataset is a large-scale dataset that contains a diverse set of stereo video sequences recorded in street scenes from 50 different cities. It comes with pixel-level annotations of 5 000 frames and a set of 20 000 weakly annotated frames. T
his dataset is useful in semantic segmentation and training deep neural networks to understand the urban scene.
You can check out our free dataset with 6000+ annotated X-ray lung images here.
Natural Language Processing Datasets
Where can I find databases for natural language processing tasks?
Good question.
Although NLP makes up for a significant part of the machine learning use cases, including voice and speech recognition, and language translation, it requires a large amount of data and hours of training.
There are also several categories of datasets you can use depending on the natural language processing concepts you plan to explore.
Have a look!
General NLP Datasets
Let's kick off with a few popular datasets forgeneral natural language processing purposes.
The Big Bad NLP Database
The Big Bad NLP Database is a well-organized collection of 841 datasets for NLP-related tasks, including document classification, automated image captioning, dialog, clustering, intent classification, language modeling, or machine translation.
Enron Email Dataset
Enron Email is dataset collected and prepared by the CALO Project (A Cognitive Assistant that Learns and Organizes). It includes over 600,000 emails generated by 158 employees of the Enron Corporation.
Google Books Ngrams
Google Books Ngrams is a vast collection of words extracted from the Google Books corpus. The “n” specifies the number of elements in the tuple, meaning that a 4-gram contains four words or characters.
Wikipedia Links Data
Wikipedia Links Data is a dataset with 1.9 billion words from more than 4 million articles. You can search by word, phrase, part of speech, synonyms, comparisons of terms, etc. Plus, you can create and use theme-specific virtual corpora from any of the 4,400,000 articles in the corpus.
SMS Spam Collection in English
SMS Spam Collection in English is a small dataset containing 5,574SMS-labeled messages (in English) collected for the mobile phone spam research. They are tagged either as legitimate or spam.
Yelp Reviews
Yelp Reviews is an open dataset with over 8.6 million reviews and 200.000 pictures published by Yelp. It also contains over 1.2 million business attributes like hours, parking, availability, and ambiance.
Blog Authorship Corpus
Blog Authorship Corpus is a dataset containing over 681,000 posts written by 19,320 different bloggers. In total, there are over 140 million words within the corpus. Each blog is presented as a separate file and it features blogger ID number, gender, age, industry, and astrological sign.
Sentiment Analysis Datasets for Machine Learning
To train a reliable sentiment analysis model, you need a large volume of specialized datasets.
Finding relevant datasets can be challenging as they need to cover a wide range of sentiment analysis applications and use cases.
Luckily, we've put together a list of the best sentiment analysis datasets available for free.
Multidomain Sentiment Analysis Dataset
Multidomain Sentiment Analysis is a relatively old dataset with positive and negative product reviews from Amazon. The reviews contain ratings from 1 to 5 stars (and they can be converted to binary if needed).
Stanford Sentiment Treebank
A large movie review dataset with sentiment annotations based on Rotten Tomatoes reviews. Stanford Sentiment Treebank contains 10,000+ pieces of data. This standard sentiment dataset had its original code written in Matlab, but is no rewritten in Java.
Sentiment140
Sentiment140 is a dataset containing 1.6 million tweets extracted using Twitter API (originally it wasn’t open-source, but is now available for free on Kaggle). The tweets have been annotated (0 = negative, 2 = neutral, 4 = positive) and they can be used to detect sentiment. This Twitter data is available in a CSV format with emoticons removed.
IMDB Movie Reviews Dataset
A vast collection of 50,000 movie reviews from IMDB. It contains 25,000 highly polarized movie reviews for training and 25,000 for testing. The negative reviews have a score of below 4 out of 10 and the positive reviews have a score of more than 7 out of 10.
Twitter US Airline Sentiment
Twitter US Airline Sentiment is a dataset containing tweets since February 2015 about each of the major US airlines. Tweets are classified as positive, negative, or neutral. It includes features like Twitter ID, sentiment confidence score, negative reasons, airline name, retweet count, etc.
OpinRank Review Dataset
OpinRank Review Dataset is a large collection of reviews on cars and hotels collected from Tripadvisor and Edmunds. It has nearly 260.000 hotel reviews and 42.230 car reviews.
Amazon Review Data (2018)
An updated version of an Amazon review dataset from 2014. It contains 233.1 million reviews collected between May 1996 and October 2018. Other features include product metadata (descriptions, category information, price, brand, and image features), and links (also viewed/also bought graphs).
Sentiment Lexicons for 81 Languages
Sentiment Lexicons for 81 Languages is a dataset published on Kaggle. It contains both positive and negative sentiment lexicons for 81 languages. The sentiments were built based on English sentiment lexicons.
Text Datasets for Natural Language Processing
Lastly, here's a handful of text-based datasets to check out.
Jeopardy Dataset
A collection of 216,930 Jeopardy questions (quiz show), answers, and other data available for download in JSON format.
20 Newsgroups
A collection of 20,000 documents from over 20 different newsgroups. The content covers a variety of topics with some closely related for reference. There are three versions available: original, sorted by dates, and with removed duplicates.
This dataset is commonly used for experiments in text applications of machine learning techniques, such as text classification and text clustering.
Legal Case Reports Dataset
Legal Case Reports is a small dataset with text summaries of 4000 legal cases that you can download from UCI Machine Learning Repository. A superb source of data for training automatic text summarization.
The WikiQA Corpus
The WikiQA Corpus is a rich dataset containing question and sentence pairs collected and annotated for research on open-domain question answering. It comes with over 3000 questions and over 29,000 answer sentences with just under 1500 labeled as answer sentences.
Audio Speech and Music Datasets for Machine Learning Projects
Now, let’s have a look at some of the best audio speech and music datasets.
Common Voice
Common Voice is a high-quality open source and multi-language dataset of voices for training speech-enabled technologies. The project is led by volunteers who record sample sentences with a microphone and review recordings of other users.
AudioSet
A rich dataset with manually annotated audio events. AudioSet contains 632 audio event classes and a collection of 2,084,320 human-labeled 10-second sound clips drawn from YouTube videos.
LibriSpeech
LibriSpeech is a quality dataset of approximately 1000 hours of read English speech, derived from audiobooks. All the audio data has been carefully segmented and aligned.
Spoken Wikipedia Corpora
Spoken Wikipedia Corpora is a volunteer-driven corpus of aligned Spoken Wikipedia including hundreds of articles from the English, German, and Dutch Wikipedia. The advantages of this data source come down to a diverse set of readers and topics. All annotations can be mapped back to the original html.
VoxForge
VoxForge is an open speech dataset that was set up to collect transcribed speech in languages like English, German, Italian, Portuguese or Spanish.
Free Music Archive (FMA)
FMA dataset for music analysis. It contains full-length and HQ audio, pre-computed features, and track and user-level metadata. The audio data comes from 106,574 tracks from 16,341 artists and 14,854 albums, arranged in a hierarchical taxonomy of 161 genres.
Ballroom
Ballroom is a music dataset with information on ballroom dancing (online lessons, etc.). Some characteristic excerpts of many dance styles are provided in real audio format. The total number of instances is 698 with a duration of around 30 seconds.
Data Visualization Datasets
To successfully complete your data visualization projects, you need clean and well-organized data that could be logically presented on a graph or a chart.
Here are a few websites where you can find suitable datasets for this endeavor.
FiveThirtyEight
A platform that focuses on opinion poll analysis, politics, economics, and sports blogging. FiveThirtyEight hosts interactive articles backed by curated datasets. They publish their datasets via their Github repository.
BuzzFeed
Popular news website that evolved from low-quality clickbait writing to research-driven and high-quality data journalism. BuzzFeed makes their datasets publicly available on Github.
ProPublica
ProPublica is an independent, non-profit newsroom focused on issues of public interest in the U.S. It offers both free and paid datasets which are well-maintained and regularly updated.
Conclusion
There you have it—a comprehensive list of 65+ free datasets for machine learning, computer vision, data analysis, data mining, and data visualization projects.
We hope you've found the dataset you were looking for.
And if not—let us know!
We'd be happy to update the article with your dataset suggestions.
Read next:
21+ Best Healthcare Datasets for Computer Vision
What is Data Labeling and How to Do It Efficiently [Tutorial]
13 Best Image Annotation Tools
The Complete Guide to CVAT—Pros & Cons
YOLO: Real-Time Object Detection Explained
Labeling with LabelMe: Step-by-step Guide [Alternatives + Datasets]