An Introductory Guide to Quality Training Data for Machine Learning

Data is the soul of every machine learning model. Here's why the world’s greatest machine learning teams spend more than a whopping 80% of their time improving training data.
Read time
12
min read  ·  
July 11, 2022
Training dataset

The accuracy of your AI model is directly proportional to the quality of your training data.

Today’s deep neural networks perform extraordinarily well at representing billions of parameters.

But—

If your data is poorly labeled, it will be billions of mistaken features and hours of wasted time.

We don't want that to happen to you :)

Here’s what we’ll cover:

  1. What is AI training data?
  2. Types of training data
  3. Training, Validation, and Test Sets
  4. How much training data is enough?
  5. How to improve the quality of AI training data
  6. 4 ways to find quality training data

Manage your datasets, annotate data, and train models 10x faster.

Don't start empty-handed. Explore our repository of 500+ open datasets and test-drive V7's tools.

💡 Pro tip: Would you like to learn more about machine learning first? Check out this guide—What is Machine Learning? The Ultimate Beginner's Guide.

And in case you are ready to start annotating, check out:

  1. V7 Model Training
  2. V7 Workflows
  3. V7 Auto Annotation
  4. V7 Dataset Management

Now, let’s get started!

What is AI training data?

Training-validation-testing data refers to the initial set of data fed to any machine learning model from which the model is created.

Just like we humans learn better from examples, machines also need a set of data to learn patterns from it.

💡 Training data is the data we use to train a machine learning algorithm.

In most cases, the training data contains a pair of input data and annotations gathered from various resources and organized to train the model to perform a specific task at a high level of accuracy.

It may be composed of raw data, such as images, text, or sound, containing annotations, such as bounding boxes, tags, or connections.

Machine learning models learn the annotations on training data, so that they may apply them to new, unlabeled examples.

For instance, here's how you can auto-annotate your training data with V7.

Traning data auto-annotation using V7 Darwin

Training Data in Supervised vs. Unsupervised learning

What is the difference in training data using supervised vs. unsupervised learning?

In supervised learning, humans will label data telling the model exactly what it needs to find.

For example, in spam detection, the input is any text while the label would suggest if the message is spam or not.

Supervised learning is more restrictive, as we aren’t allowing the model to derive its own conclusions from the data outside of the limits annotated by our labels.

In unsupervised learning, humans will present to the model raw data containing no labels, and models find patterns within the data. For example—recognizing how similar or different are two data points based on the common features extracted.

This helps the model derive inferences and reach conclusions, for instance—segregating similar images or into clusters.

Data in supervised vs. Unsupervised learning

Semi-supervised learning is a combination of the two learning types mentioned above, where data is partly labeled by humans with some of the predictions left to the model's judgment.

Semi-supervised learning is often used when humans can direct the model towards the area of focus but where actual predictions become hard to annotate because they are too small or nuanced.

In reality, there is no such thing as fully supervised or unsupervised learning— there exist only various degrees of supervision.

Supervised learning: training data process

All learning methods start with the collection of raw data from different sources.

Raw data can be of any form like text, images, audio, video etc. However, to tell the model what needs to be identified in this data, you must add annotations.

These annotations allow you to supervise the learning, ensuring that the model focuses on the features you point out, rather than extrapolating conclusions from other correlated (but not causal) elements in your data.

Supervised learning training data

Each input data should have a corresponding label that guides the machine towards what the prediction should look like. This processed dataset is obtained with the help of humans, and sometimes other ML models accurate enough to reliably apply labels.

Once a labeled dataset is ready to be fed to the AI, the training phase starts.

Here, the model tries to derive important features that are common across all the areas where you applied your labels. For example, if you segmented out a few cars in your images, it will learn that wheels, rear-view mirrors, and door handles are all features that correlate with “car”.

Models test themselves continuously against a validation set defined prior to training time.

Once complete, they will make a final check against testing data (a set never seen before by the model) which will give an idea of the model’s performance on relevant new examples.

Your training, validation, and test sets are all part of your training data. The more training data you have, the higher is your model's accuracy.

Now, let's define some of the popular terms you might encounter when dealing with machine learning training data.

💡 Pro tip: Dive deeper and check out Supervised vs. Unsupervised Learning: What’s the Difference?

What is labeled data?

Labeled data is data that comes with a tag/class that provides meaningful information.

Here are a few examples of labeled data: images with the corresponding tag of cat/dog, marking emails/messages as spam, forecasting stock prices (the future state is your label), identifying nodules to be cancerous or not with a polygon or audio files giving information of what words were spoken.

💡 Pro tip: If you are looking for a free labeling annotation tool, check out The Complete Guide to CVAT—Pros & Cons.

Accurately labeled data makes it easy for the machine to recognize patterns according to the task to predict the target and hence it is widely used in solving complex tasks.

What is human in the loop?

Human in the loop (HITL) process is when a machine learning model is only partially able to solve a problem, and part of the task is offloaded to a human agent.

Model-assisted data labeling is an example of human in the loop, where an ML model will apply initial predictions, and a human complements them with additional tags, corrections, or other types of annotations unsupported by the model.

Humans provide continuous feedback improving the performance of the model.

To begin with, humans use annotation tools to label the raw data to help the machines learn and make predictions accordingly. They validate the output of the model and check the predictions when the machine is not sure of its output to ensure that the learning of the model progresses in the right direction.

Sometimes though, humans stay forever in the loop to add more tags to data that we can’t fully rely on models for.

For example, many automated medical diagnosis systems, or identify verification systems, rely on humans in the loop to avoid leaving the final decision of an important evaluation to the machine learning algorithms.

💡 Pro tip: Check out The Ultimate Guide to Medical Image Annotation.

In this loop, machines and humans go hand in hand!

Human in the loop (HITL)

Training, Validation, and Test Sets

No AI model cannot be trained and tested on the same training data.

Why?

It's simple—

The model's evaluation would be biased as the model is being tested on what it has already learned. It would be like giving the same exact questions in an exam that were already answered in a class. We would not know if the student memorized the answers or actually understood the concepts.

The same rules apply to the machine learning models.

Here’s an overview of the splits.

Training data—At least 60% of your data should be used for training.

Validation data—A sample (10-20%) of the total dataset will be used for validation and checked on periodically by the model during training. This validation set should look like a representative sample of the training set.

Test data—This set of data is used to test the model after it has been completely trained. This is separate from both the training set and validation set. After the model is trained and validated, then it is tested on the testing set. The data in the test set should be unlabeled, exactly how real data would look if the model is deployed.

💡 Pro tip: Read How to Split Your Machine Learning Data: Train, Validation, Test Set Split to learn more.

You may have more than one test set in a dataset.

Each test set can be used to check whether a model generalizes to a specific scenario. For example—

An autonomous vehicle model made to detect pedestrians may be trained on videos from all over the United States.

Image annotation of pedestrians for autonomous vehicles
💡 Pro tip: Check out the list of 65+ datasets for machine learning.

Its main test set might be a mix of all the state’s locations, however, you might want to create dedicated test sets for specific scenarios. These can include:

  • A test set for sunset driving
  • A test set for a snowy environment
  • A test set for driving in heavy storms
  • A test set for when the camera has a dirty lens or has been scratched.

These test sets are normally stored in a dataset management solution and are manually hand-picked by data scientists. As such, it’s paramount that you fully understand what your data looks like and appropriately tag outlier scenarios so that you may create test sets out of them.

Test sets are not used exclusively to assess AI model performances. Sometimes they are used to test our human annotator performances too.

This is known as a Gold Set.

Gold Sets—Your ideal ground truth

A selection of well-labeled images that accurately represent what perfect ground truth looks like is called a gold set.

These image sets are used as mini testing sets for human annotators, either as part of an initial tutorial, or to be scattered across labeling tasks to make sure that an annotator’s performance is not deteriorating either due to poor performance on their part, or changing instructions.

Gold sets usually check for a series of things:

  • Time to complete a task.
  • Accuracy of each annotation (by recall or IoU)
  • Performance increases with experience
  • Performance deterioration with new instruction changes

Testing continuously against gold sets is paramount to good training data. The best labeling teams in the market maintain rigorous automated tests and make use of a platform that allows them to be intelligently placed and measurable.

Blind Stages—Multiple passes by multiple annotators

Blind stages are annotation tasks where multiple humans (or models) place a label independently of one another, and the stage passes only if they all agree on the same outcome.

Blind stages are used to create ultra-accurate training data and automating quality assurance checks. It’s very common for an annotator to miss an object, but it’s far less common for two of them to do so.

Blind stages are labeled in parallel and each participant cannot see the progress of the others.

When all annotators have completed their version of the task, it goes through a consensus check that validates that the annotations agree. If they don’t, or don’t overlap enough with one another spatially, the task is sent to a human reviewer to apply corrections, and the annotator who made an error is notified so they may improve their work.

Blind stages in the image annotation process

How much training data do you need?

The simple answer: Enough to represent each plausible case in your scenario with at least 1,000 data samples.

Why 1000?

If you use 10% of that as a test set, you can tell the accuracy of a class with at least 1% of an error rate.

To put things into perspective:

1,000 examples per class is a decent dataset.

10,000 is a great dataset.

100k-1 million is an excellent dataset.

More than 1 million labeled examples of something puts you on the leader board among AI teams.

Some companies are now training models on billions of images, video, and audio samples. These datasets have multiple test sets and are labeled and re-labeled multiple times to increase their scope.

Yes, theoretically you can train a model using 100 examples of something. For example, V7 allows you to do train a model with as few as 100 instances, however, these will perform rather poorly on new examples.

Training a model with V7 on 100 instances

Great models are trained on large volumes of training data items for a good reason—modern neural network architectures work brilliantly because they can store many weights (parameters) efficiently. However, if you don’t have a lot of training data, you are only using a fraction of your model’s potential.

Dataset size will also depend on the domain of your task and the variance of each class.

If you plan to identify every Mars chocolate bar in the world, you’ll probably run out of variance after 10,000 examples. The model will have seen every possible angle, lighting condition, and crumpled appearance of the candy bar.

However, if you want to make a generalized person detector, 10,000 samples are only a glimpse of the variety of sizes, appearances, poses, and clothing that humans may have. As such—a class with high variance such as “person” requires a lot more training data.

💡 Pro tip: Check out 15+ Top Computer Vision Project Ideas for Beginners to start building your own computer vision models in less than an hour!
Train ML models 10x faster

Turn labeled data into models. Develop production-ready AI in hours with just a few clicks.

Learn more
Book a Demo of V7 Darwin now->
Mockup of model training in V7

Five challenges in estimating training data amounts

Here are some factors that have a high degree of influence on the size of your dataset:

Size of the existing raw data corpus

How much data is the system capturing today?

If there’s no raw data available, make sure you have the ability to collect it at the level of scale necessary for your use case. For example, if you are working in a logistics company processing 10,000 invoices a day, that is a good estimate of what your test set should look like, meaning your training set should be of at least 100,000 files to accurately compare it with human performances

Variance of classes

How sparsely represented are the tags you want to identify?

If your goal is to identify a very uniform set of objects, you can get away with a few thousand examples. If you want to identify something varied, such as car license plates across all countries, lighting, and weather conditions, your dataset should hold a few hundred examples of each plausible scenario to achieve reliable results. The number of classes proportionally increases this dataset size requirement.

Type of classification

Are you performing object detection or semantic segmentation?

Each type of image classification task requires a different amount of training data. In object detection challenges you will want to calculate the number of object instances to expect. If you only expect 1 object per file, you will learn slightly more slowly than if there are 4-5.

💡 Pro tip: Check out A Gentle Introduction to Image Segmentation for Machine Learning and AI.

Ongoing changes

Will your distribution-the content of what your dataset is supposed to represent-change over time?

For example, if you’re making a mobile phone detector, you’ll have to periodically add more training data to account for new models. If you make a face detector, you’ll have to account for face masks.

Instruments also change over time—modern phone cameras capture higher resolution HDR images and models trained on this instrument’s data will perform better.

The complexity of your model

The greater the number of parameters/attributes the model can learn from, the more the training data is required.

In other words as the complexity of the model increases so does the dataset size. Modern deep neural network architectures can store millions or even billions of parameters. As such, model complexity is hardly ever a bottleneck—always aim for more training data to improve its performance.

If you are using classical machine learning instead, or are running on resource-constrained devices, you may see diminishing returns on multi-million file datasets.

How to calculate your training data needs?

An estimate of how much data is needed is more than enough to give a kick start. A couple of methods that help us with this are:

Rule of 10

It is the most common of all estimation techniques.

It simply states that a model requires ten times more data than it has degrees of freedom. By degree of freedom, we mean any parameter that affects the model or any attribute on which the output of the model depends on.

This is to handle the variableness that any parameter or attribute can bring.

Learning curves

It is a more logical approach driven by evidence.

A learning curve shows the relationship between the performance of a machine learning model and the size of the dataset. We plot the results of the model performance with the increase in the size of the dataset in each iteration.

There comes a threshold dataset size after which the performance becomes stagnant or diminishes.

Calculating training data needs


How to improve the quality of AI training data?

Training data quality is imperative to the machine learning model’s performance.

By the term quality data, we mean data that is cleaned and contains all attributes on which the model learning depends.

We can measure quality by both consistency and accuracy of labeled data. While consistency prevents randomness, accuracy brings correctness to the model.

Let's dive deeper to understand how we can ensure that our data sets consist of quality data.

4 characteristics of quality training data

  1. Relevancy - Dataset should contain only attributes that provide meaningful information to our model. Identification of important attributes is a complicated task and requires domain knowledge to have a clear understanding of which features to consider and which to remove.
  2. Consistency - Similar attribute values should correspond to similar labels ensuring uniformity in the dataset.
  3. Uniformity -  Values for all attributes should be comparable for all the data points. Irregularities or the presence of outliers in the dataset gives an adverse effect on the quality of training data.
  4. Comprehensiveness—Dataset should have enough parameters or features so that no edge cases are left. The dataset should have enough samples of these edge cases to help the model learn them as well.

What affects training data quality?

There are three main factors that directly affect the quality of training data.

People, Process, and Tool's (P-P-T) are the three components vital in any business process.

Let's have a look at each one of them.

People

Quality starts with the human resources that are assigned this task. Worker selection and training significantly impact the efficiency of work and final results. Task-dependent training is the key to better quality data.

Process

These are the set of actions that the workers perform for data collection, labelling following quality control workflow.

Tools

Technology, like V7, provides tools that help people to implement the process. It also automates some parts of the process to maximize the quality of the data.

What affects training data quality

How to prepare training data: Best Practices

Now let's have a look at a few best practices for preparing and preprocessing your training data.

Data cleaning

Raw data can be very dirty and corrupted in many different ways. If it is not cleaned properly it might skew our results and end up making our AI model make wrong results.

Data cleaning is the process of fixing or removing incorrect, corrupted, duplicate data within a dataset with its modified version. The steps in the data cleaning process vary from dataset to dataset.

Best practices:

  1. Check for duplicates - There can be the same set of data points present in the dataset more than once. It can be due to gathering data from different sources ending up with some similar data points. It should be removed as it may lead the model to overlearn some patterns and end up making false predictions.
  2. Remove outliers—Some parts of data behave very differently from the rest of the data. An example can be a SessionID coming over and over again in weblog data. It can be due to some malicious activity that we do not want to feed to our model. So watching for outliers is one of striping out data we don't want for our machine.
  3. Fix structural errors—In some cases, there might be mislabeling within the dataset. For example ‘Cat’ and ‘cat’ are thought to be different classes, ‘caat’ and ‘cat’ are considered different due to spelling errors leading to erroneous class distribution.
  4. Check for missing values—There might be some instances in a dataset where some attributes/features are missing for some data points. The action can be simply not including these instances in the training dataset or filling up missing values by some operations.

Data labeling

Labeling data is about detecting and tagging data samples. It is the process by which we attach a meaning to the data in the form of a class or label.

Data labeling can be done by a collaborator by human-in-loop or by any automated machine to speed up the process of labeling.

Best practices:

  1. Create a gold standard—In the Data Labeling domain, data scientists or experts are considered to be gold standards who label the raw data with the highest sensitivity and accuracy. Their labels are considered to be a reference point for our team annotations and can be used as answers to screen annotation candidates.
  2. Don’t use too many labels—Dividing the dataset into a large number of classes can make it more confusing to annotate the dataset. Also, more features will be required to be analyzed to distinguish among more labels. For example, it becomes a matter of debate for the annotators to label the data points into classes like “Very Expensive”, “Expensive”, “Less Expensive”.
  3. Use multipass—This involves labeling the data points by a number of annotators. This is done in order to make the decision of labeling consistent and to improve the overall quality of the data. Though it is time-consuming and increases resource use, it is used to establish consensus within the team.
  4. Create a review system—Data Labeling done should be reviewed to reduce any chances of error by some other person or by implementing self-improvement checks. The main takeaway for any annotator is to come to know its areas of improvement, level of accuracy, and what kind of training is required to improve the work.
💡 Pro tip: Check out Annotating With Bounding Boxes Guide for more tips on data labeling.

Now, let's explore where we can find relevant data for our data science and deep learning projects.

4 ways to find high-quality training datasets

Whether you are looking for quality data for your business endeavors or you want to build your first computer vision model, having access to quality datasets is crucial.

Here are a few methods you can use.

Open datasets and search engines

The first technique is to explore options such as open datasets, online machine learning forums, and dataset search engines which are free and relatively easy. There are a number of websites that provide free and diversified datasets like Google Dataset Search, Kaggle, Reddit, UCI repository. We only need to preprocess the data to make it suitable for our use case.

💡 Pro tip: Check out 21+ Best Healthcare Datasets for Computer Vision if you are looking for medical data.

Scrape web data

This technique is mostly used in cases when we want data from multiple sources for diversified inputs. Data collection is done by the extraction of data from various public online resources, such as government websites or certain social media platforms.

Own data

Sometimes the above options do not work well for training data collection.

In this case, we have to check what in-house options are available. For example, if we are working on a chatbot that aims to respond to students' problems, instead of using natural language processing datasets, we can try to extract data from supervisors and students' conversations if the logs and messages are preserved.

Use data augmentation

There might be some scenarios for which we are unable to gather data that meets our needs.

Rather what we can do is to repurpose the data to broaden the dataset. Data augmentation means applying different transformations on the original data to generate new data that suits our case. For image data, training data size can be increased by simple operations like rotation, changes in color, brightness etc.

Quality training data: Key takeaways

Finally, let's recap everything you've learned in our essential guide to quality training data:

  • Training data refers to the data we use to train a machine learning algorithm.
  • Your model's accuracy depends on the data you use— the majority of the time of any data engineer is used for preparing quality training data.
  • Supervised learning uses labeled data while unsupervised learning uses raw, unlabeled data.
  • You need high-quality datasets for training and validation, and a separate, original dataset for testing.
  • A gold set is a selection of accurately annotated images that represent what perfect ground truth looks like.
  • You need a significant amount of training data to represent each plausible case in your scenario with at least 1,000 data samples to achieve quality results.
  • 4 characteristics of quality training data come down to relevant content, consistency, uniformity, and comprehensiveness.
  • Data cleansing, data labeling, and the annotation tools you are using play a key part in ensuring that your final model can be reliably applied in real-world conditions.

💡 Read next:

Optical Character Recognition: What is It and How Does it Work [Guide]

7 Life-Saving AI Use Cases in Healthcare

The Complete Guide to CVAT—Pros & Cons

5 Alternatives to Scale AI

YOLO: Real-Time Object Detection Explained

The Ultimate Guide to Semi-Supervised Learning

The Beginner’s Guide to Contrastive Learning

9 Reinforcement Learning Real-Life Applications

Mean Average Precision (mAP) Explained: Everything You Need to Know

A Step-by-Step Guide to Text Annotation [+Free OCR Tool]

The Essential Guide to Data Augmentation in Deep Learning

Previously CEO at Aipoly - First smartphone engine for convolutional neural networks. Management & Stats grad at Cass Business School and Singularity University. Never had a real job.

“Collecting user feedback and using human-in-the-loop methods for quality control are crucial for improving Al models over time and ensuring their reliability and safety. Capturing data on the inputs, outputs, user actions, and corrections can help filter and refine the dataset for fine-tuning and developing secure ML solutions.”
Name
Company
GUIDE
Building AI-Powered Products: The Enterprise Guide

Building AI products? This guide breaks down the A to Z of delivering an AI success story.

🎉 Thanks for downloading our guide - your access link was just emailed to you!
Oops! Something went wrong while submitting the form.
By submitting you are agreeing to V7's privacy policy and to receive other content from V7.
Ready to get started?
Try our trial or talk to one of our experts.