Have you ever spent hours searching for a suitable dataset for your data science project?
It can get pretty daunting, right?
Well, not anymore ;-)
Whether you are a student or a professional looking for high-quality datasets for machine learning or data analysis projects—we’ve got you covered!
In today’s article, we will share with you a comprehensive list of 65+ open machine learning datasets that you can access for free.
Use the links below to find the datasets you are looking for in seconds.
Here’s what we’ll cover:
P.S. We will regularly update this list, so feel free to suggest datasets you are using and we will make sure to add them. You can also head over to our Open Datasets repository to browse or download some of the coolest datasets out there.
P.P.S. And if you are ready to start annotating your data, go ahead and check out:
Manage your datasets, annotate data, and train models 10x faster.
Don't start empty-handed. Explore our repository of 500+ open datasets and test-drive V7's tools.
“Where can I get free datasets for machine learning?” you might ask yourself.
Look no further.
Here’s the list of the best open dataset finders that you can use to browse through a wide variety of niche-specific datasets for your data science projects.
Let’s jump right into it.
A data science community with tools and resources which include externally contributed machine learning datasets of all kinds. From health, through sports, food, travel, education, and more, Kaggle is one of the best places to look for quality training data.
A search engine from Google that helps researchers locate freely available online data. It works similarly to Google Scholar, and it contains over 25 million datasets. You can find here economic and financial data, as well as datasets uploaded by organizations like WHO, Statista, or Harvard.
One of the oldest dataset aggregators on the web. All datasets are user-contributed, and you can download them from the UCI Machine Learning Repository website without registration. They are categorized by task, attribute, data type, and area of expertise.
An online machine learning platform for sharing and organizing data with more than 21.000 datasets. It’s regularly updated and it automatically versions and analyses each dataset and annotates it with rich meta-data to streamline analysis.
A collection of thousands of machine learning datasets from financial market data, macroeconomic data, and population growth to cryptocurrency prices. You can access it without any registration.
A community project with free and open resources, currently including 3937 datasets for data science and machine learning, including natural language processing tasks. You can easily filter them by modality, task, or language.
A search engine for computer vision datasets. You can easily filter them by category, date, popularity or use a search box to find a theme-specific dataset. A great source of datasets for image classification, image processing, and image segmentation projects.
Leveraging demographic data can help governments to improve the well-being of citizens and the economy at scale. Using public government data to train machine learning models can help discover patterns, identify trends, and detect anomalies.
Those predictive models can, in turn, help prevent some of the social and cultural issues like population decline or migration.
Here’s the list of chosen public datasets that you can use for your machine learning projects.
The US government’s open data site. You can filter it by various industries like healthcare, climate, education, etc. Be aware that much of this open-source data might require additional research.
The point of access to public data published by the EU institutions, agencies, and other entities. It contains data related to economics, agriculture, education, employment, climate, finance, science, etc.
The open data from the World Bank that you can access without registration. It contains data concerning population demographics, macroeconomic data, and key indicators for development. A great source of data to perform data analysis at a large scale.
Statistics and datasets for health care and public health. You can find data about population health, diseases, drugs, and health plans collected from the FDA and USDA Food composition databases.
It’s a website with data on educational institutions and education demographics in the U.S. and internationally.
It’s a platform that provides access to over 7,000 digital data collections for research and teaching purposes. You can find here economic and social data from the Economic and Social Data Service (ESDS), Census Programme, and others, including some international data sets.
A free platform with the most comprehensive visualization of U.S. public data.
Open financial and economic datasets are a great source of information for your machine learning projects related to the financial sector.
Thanks to the vast quantities of financial records collected over decades, you can train your models using rich public datasets that are easily accessible. It's not a secret that machine learning has been widely used for algorithmic trading, stock market predictions, portfolio management, and fraud detection.
Furthermore, the developments in deep learning over the years made it possible to test economic models, collect new sources of data more easily, and predict citizen behavior to help inform policymaking.
Here is the list of reliable sources of various datasets you can use for your machine learning projects.
An extensive dataset of financial system characteristics for 214 economies around the world. It contains annual data which has been collected since 1960.
Up-to-date source of data on financial markets from around the world. The dataset contains information about share and stock prices, equities, currencies, bonds, and commodities performance.
A platform with rich datasets of financial, economic, and alternative data. Quandl’s data comes in two formats: Time-series (data taken over a period of time) and Tables (numerical and unsorted data types such as strings, etc.) You can download it either as a JSON or CSV file.
International Monetary Fund publishes data related to the IMF lending, exchange rates, and other economic and financial indicators.
A website with links to some of the most useful and popular economic data sources. It includes data on U.S macroeconomic as well as individual-level global data on income, employment, and health.
Now, let’s have a look at some of the best open datasets for computer vision projects.
Some of the most popular machine learning project ideas and lab research projects are based on training visual data. Computer vision finds application in fields like medical imaging, self-driving cars, or facial recognition.
You can use an image or video datasets for a range of computer vision tasks, including image acquisition, image classification, semantic segmentation, and image analysis.
To build a robust deep learning model for computer vision, you need a sizeable amount of high-quality training data.
Here's the list of open-source websites where you can access it for free.
An extensive dataset created by the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL). It contains 187,240 images, 62,197 annotated images, and 658,992 labeled objects.
One of the most popular and the largest image datasets for computer vision. It is organized according to the WordNet hierarchy. It currently holds 1,281,167 images for training and 50,000 images for validation within 1,000 categories.
A large, high-quality video dataset of URL links to approximately 650000 Youtube video clips that cover 700 human action classes. The videos include human-object interactions, as well as human-human interactions. Kinetics dataset is great for training human action recognition model.
A dataset containing around one million labeled images for each of 10 scene categories (e.g., church, dining room, etc.) and 20 object categories (e.g., bird, airplane, etc.). It aims to provide a different benchmark for large-scale scene classification and understanding.
MS COCO is a large-scale object detection, segmentation, key-point detection, and captioning open-source dataset. It contains over 200,000 labeled images.
A dataset containing 7200 color images of 100 objects (72 images per object) imaged at every angle in a 360 rotation. It was collected by the Center for Research on Intelligent Systems at Columbia University.
A large and detailed dataset and knowledge base with captioning of over 100.000 images.
A collection of over 9 million varied images with rich annotations. It contains image-level label annotations, object bounding boxes, object segmentation, and visual relationships across 6000 categories. This large image database is a great source of data for any data science project.
A vast dataset of millions of YouTube video IDs with high-quality machine-generated annotations of more than 3,800 visual entities. This dataset comes with pre-computed audio-visual features from billions of frames and audio segments.
A high-quality database of 13.000 face photographs designed for developing facial recognition projects. Each face has been labeled with the name of the person pictured.
A database containing 5620 images across 7 Indoor categories. There are at least 100 images per category in jpg format.
A vast public dataset of overhead imagery. It contains more than 1 million object images with 60 classes from complex scenes around the world annotated using bounding boxes.
A large-scale dataset of more than 200K celebrity images. Each image contains 40 attribute annotations. The images cover a range of pose variations and background clutter.
A dataset with images of 120 breeds of dogs from around the world. It contains 20,580 images across 120 categories annotated using class labels and bounding boxes.
A dataset provided by MIT Computer Science and Artificial Intelligence Laboratory. There are more than 2.5 million images across 205 scene categories. Each image comes with a category label. You can use it to train deep neural networks to understand various scenes.
A new dataset containing open-ended questions about images. It includes 265,016 images (COCO and abstract scenes), at least three questions per image, and ten answers per question.
A vast dataset containing 60000 32x32 color images in 10 classes, with 6000 images per class. It includes 50000 training images and 10000 test images.
A large-scale dataset that contains a diverse set of stereo video sequences recorded in street scenes from 50 different cities. It comes with pixel-level annotations of 5 000 frames and a set of 20 000 weakly annotated frames. T
his dataset is useful in semantic segmentation and training deep neural networks to understand the urban scene.
Where can I find databases for natural language processing tasks?
Although NLP makes up for a significant part of the machine learning use cases, including voice and speech recognition, and language translation, it requires a large amount of data and hours of training.
There are also several categories of datasets you can use depending on the natural language processing concepts you plan to explore.
Have a look!
Let's kick off with a few popular datasets forgeneral natural language processing purposes.
A well-organized collection of 841 datasets for NLP-related tasks, including document classification, automated image captioning, dialog, clustering, intent classification, language modeling, or machine translation.
A dataset collected and prepared by the CALO Project (A Cognitive Assistant that Learns and Organizes). It includes over 600,000 emails generated by 158 employees of the Enron Corporation.
A vast collection of words extracted from the Google Books corpus. The “n” specifies the number of elements in the tuple, meaning that a 4-gram contains four words or characters.
A dataset with 1.9 billion words from more than 4 million articles. You can search by word, phrase, part of speech, synonyms, comparisons of terms, etc. Plus, you can create and use theme-specific virtual corpora from any of the 4,400,000 articles in the corpus.
A small dataset containing 5,574SMS-labeled messages (in English) collected for the mobile phone spam research. They are tagged either as legitimate or spam.
An open dataset with over 8.6 million reviews and 200.000 pictures published by Yelp. It also contains over 1.2 million business attributes like hours, parking, availability, and ambiance.
A dataset containing over 681,000 posts written by 19,320 different bloggers. In total, there are over 140 million words within the corpus. Each blog is presented as a separate file and it features blogger ID number, gender, age, industry, and astrological sign.
To train a reliable sentiment analysis model, you need a large volume of specialized datasets.
Finding relevant datasets can be challenging as they need to cover a wide range of sentiment analysis applications and use cases.
Luckily, we've put together a list of the best sentiment analysis datasets available for free.
A relatively old dataset with positive and negative product reviews from Amazon. The reviews contain ratings from 1 to 5 stars (and they can be converted to binary if needed).
A large movie review dataset with sentiment annotations based on Rotten Tomatoes reviews. It contains 10,000+ pieces of data. This standard sentiment dataset had its original code written in Matlab, but is no rewritten in Java.
A dataset containing 1.6 million tweets extracted using Twitter API (originally it wasn’t open-source, but is now available for free on Kaggle). The tweets have been annotated (0 = negative, 2 = neutral, 4 = positive) and they can be used to detect sentiment. This Twitter data is available in a CSV format with emoticons removed.
A vast collection of 50,000 movie reviews from IMDB. It contains 25,000 highly polarized movie reviews for training and 25,000 for testing. The negative reviews have a score of below 4 out of 10 and the positive reviews have a score of more than 7 out of 10.
A dataset containing tweets since February 2015 about each of the major US airlines. Tweets are classified as positive, negative, or neutral. It includes features like Twitter ID, sentiment confidence score, negative reasons, airline name, retweet count, etc.
A large collection of reviews on cars and hotels collected from Tripadvisor and Edmunds. It has nearly 260.000 hotel reviews and 42.230 car reviews.
An updated version of an Amazon review dataset from 2014. It contains 233.1 million reviews collected between May 1996 and October 2018. Other features include product metadata (descriptions, category information, price, brand, and image features), and links (also viewed/also bought graphs).
A dataset published on Kaggle. It contains both positive and negative sentiment lexicons for 81 languages. The sentiments were built based on English sentiment lexicons.
Lastly, here's a handful of text-based datasets to check out.
A collection of 216,930 Jeopardy questions (quiz show), answers, and other data available for download in JSON format.
A collection of 20,000 documents from over 20 different newsgroups. The content covers a variety of topics with some closely related for reference. There are three versions available: original, sorted by dates, and with removed duplicates.
This dataset is commonly used for experiments in text applications of machine learning techniques, such as text classification and text clustering.
A small dataset with text summaries of 4000 legal cases that you can download from UCI Machine Learning Repository. A superb source of data for training automatic text summarization.
A rich dataset containing question and sentence pairs collected and annotated for research on open-domain question answering. It comes with over 3000 questions and over 29,000 answer sentences with just under 1500 labeled as answer sentences.
Now, let’s have a look at some of the best audio speech and music datasets.
A high-quality open source and multi-language dataset of voices for training speech-enabled technologies. The project is led by volunteers who record sample sentences with a microphone and review recordings of other users.
A rich dataset with manually annotated audio events. It contains 632 audio event classes and a collection of 2,084,320 human-labeled 10-second sound clips drawn from YouTube videos.
A quality dataset of approximately 1000 hours of read English speech, derived from audiobooks. All the audio data has been carefully segmented and aligned.
A volunteer-driven corpus of aligned Spoken Wikipedia including hundreds of articles from the English, German, and Dutch Wikipedia. The advantages of this data source come down to a diverse set of readers and topics. All annotations can be mapped back to the original html.
An open speech dataset that was set up to collect transcribed speech in languages like English, German, Italian, Portuguese or Spanish.
A dataset for music analysis. It contains full-length and HQ audio, pre-computed features, and track and user-level metadata. The audio data comes from 106,574 tracks from 16,341 artists and 14,854 albums, arranged in a hierarchical taxonomy of 161 genres.
A music dataset with information on ballroom dancing (online lessons, etc.). Some characteristic excerpts of many dance styles are provided in real audio format. The total number of instances is 698 with a duration of around 30 seconds.
To successfully complete your data visualization projects, you need clean and well-organized data that could be logically presented on a graph or a chart.
Here are a few websites where you can find suitable datasets for this endeavor.
A platform that focuses on opinion poll analysis, politics, economics, and sports blogging. It hosts interactive articles backed by curated datasets. They publish their datasets via their Github repository.
Popular news website that evolved from low-quality clickbait writing to research-driven and high-quality data journalism. Buzzfeed makes their datasets publicly available on Github.
An independent, non-profit newsroom focused on issues of public interest in the U.S. It offers both free and paid datasets which are well-maintained and regularly updated.
There you have it—a comprehensive list of 65+ free datasets for machine learning, computer vision, data analysis, data mining, and data visualization projects.
We hope you've found the dataset you were looking for.
And if not—let us know!
We'd be happy to update the article with your dataset suggestions.
💡 Read next:
“Collecting user feedback and using human-in-the-loop methods for quality control are crucial for improving Al models over time and ensuring their reliability and safety. Capturing data on the inputs, outputs, user actions, and corrections can help filter and refine the dataset for fine-tuning and developing secure ML solutions.”