The accuracy of your AI model is directly proportional to the quality of your training data.
Today’s deep neural networks perform extraordinarily well at representing billions of parameters.
If your data is poorly labeled, it will be billions of mistaken features and hours of wasted time.
We don't want that to happen to you :)
This article will show you the best tips and tricks for improving your training data's quality and help you understand the following:
And in case you are ready to start annotating, check out:
Now, let’s get started!
Training data refers to the initial set of data fed to any machine learning model from which the model is created.
Just like we humans learn better from examples, machines also need a set of data to learn patterns from it.
In most cases, the training data contains a pair of input data and annotations gathered from various resources and organized to train the model to perform a specific task at a high level of accuracy.
It may be composed of raw data, such as images, text, or sound, containing annotations, such as bounding boxes, tags, or connections.
Machine learning models learn the annotations on training data, so that they may apply them to new, unlabeled examples.
For instance, here's how you can auto-annotate your training data with V7.
What is the difference in training data using supervised vs. unsupervised learning?
In supervised learning, humans will label data telling the model exactly what it needs to find.
For example, in spam detection, the input is any text while the label would suggest if the message is spam or not.
Supervised learning is more restrictive, as we aren’t allowing the model to derive its own conclusions from the data outside of the limits annotated by our labels.
In unsupervised learning, humans will present to the model raw data containing no labels, and models find patterns within the data. For example—recognizing how similar or different are two data points based on the common features extracted.
This helps the model derive inferences and reach conclusions, for instance—segregating similar images or into clusters.
Semi-supervised learning is a combination of the two learning types mentioned above, where data is partly labeled by humans with some of the predictions left to the model's judgment.
Semi-supervised learning is often used when humans can direct the model towards the area of focus but where actual predictions become hard to annotate because they are too small or nuanced.
In reality, there is no such thing as fully supervised or unsupervised learning— there exist only various degrees of supervision.
All learning methods start with the collection of raw data from different sources.
Raw data can be of any form like text, images, audio, video etc. However, to tell the model what needs to be identified in this data, you must add annotations.
These annotations allow you to supervise the learning, ensuring that the model focuses on the features you point out, rather than extrapolating conclusions from other correlated (but not causal) elements in your data.
Each input data should have a corresponding label that guides the machine towards what the prediction should look like. This processed dataset is obtained with the help of humans, and sometimes other ML models accurate enough to reliably apply labels.
Once a labeled dataset is ready to be fed to the AI, the training phase starts.
Here, the model tries to derive important features that are common across all the areas where you applied your labels. For example, if you segmented out a few cars in your images, it will learn that wheels, rear-view mirrors, and door handles are all features that correlate with “car”.
Models test themselves continuously against a validation set defined prior to training time.
Once complete, they will make a final check against testing data (a set never seen before by the model) which will give an idea of the model’s performance on relevant new examples.
Your training, validation, and test sets are all part of your training data. The more training data you have, the higher is your model's accuracy.
Now, let's define some of the popular terms you might encounter when dealing with machine learning training data.
Labeled data is data that comes with a tag/class that provides meaningful information.
Here are a few examples of labeled data: images with the corresponding tag of cat/dog, marking emails/messages as spam, forecasting stock prices (the future state is your label), identifying nodules to be cancerous or not with a polygon or audio files giving information of what words were spoken.
Accurately labeled data makes it easy for the machine to recognize patterns according to the task to predict the target and hence it is widely used in solving complex tasks.
Human in the loop (HITL) process is when a machine learning model is only partially able to solve a problem, and part of the task is offloaded to a human agent.
Model-assisted data labeling is an example of human in the loop, where an ML model will apply initial predictions, and a human complements them with additional tags, corrections, or other types of annotations unsupported by the model.
Humans provide continuous feedback improving the performance of the model.
To begin with, humans use annotation tools to label the raw data to help the machines learn and make predictions accordingly. They validate the output of the model and check the predictions when the machine is not sure of its output to ensure that the learning of the model progresses in the right direction.
Sometimes though, humans stay forever in the loop to add more tags to data that we can’t fully rely on models for.
For example, many automated medical diagnosis systems, or identify verification systems, rely on humans in the loop to avoid leaving the final decision of an important evaluation to the machine learning algorithms.
In this loop, machines and humans go hand in hand!
No AI model cannot be trained and tested on the same training data.
The model's evaluation would be biased as the model is being tested on what it has already learned. It would be like giving the same exact questions in an exam that were already answered in a class. We would not know if the student memorized the answers or actually understood the concepts.
The same rules apply to the machine learning models.
Here’s an overview of the splits.
Training data - At least 60% of your data should be used for training.
Validation data - A sample (10-20%) of the total dataset will be used for validation and checked on periodically by the model during training. This validation set should look like a representative sample of the training set.
Test data - This set of data is used to test the model after it has been completely trained. This is separate from both the training set and validation set. After the model is trained and validated, then it is tested on the testing set. The data in the test set should be unlabeled, exactly how real data would look if the model is deployed.
You may have more than one test set in a dataset.
Each test set can be used to check whether a model generalizes to a specific scenario. For example—
An autonomous vehicle model made to detect pedestrians may be trained on videos from all over the United States.
Its main test set might be a mix of all the state’s locations, however, you might want to create dedicated test sets for specific scenarios. These can include:
These test sets are normally stored in a dataset management solution and are manually hand-picked by data scientists. As such, it’s paramount that you fully understand what your data looks like and appropriately tag outlier scenarios so that you may create test sets out of them.
Test sets are not used exclusively to assess AI model performances. Sometimes they are used to test our human annotator performances too.
This is known as a Gold Set.
A selection of well-labeled images that accurately represent what perfect ground truth looks like is called a gold set.
These image sets are used as mini testing sets for human annotators, either as part of an initial tutorial, or to be scattered across labeling tasks to make sure that an annotator’s performance is not deteriorating either due to poor performance on their part, or changing instructions.
Gold sets usually check for a series of things:
Testing continuously against gold sets is paramount to good training data. The best labeling teams in the market maintain rigorous automated tests and make use of a platform that allows them to be intelligently placed and measurable.
Blind stages are annotation tasks where multiple humans (or models) place a label independently of one another, and the stage passes only if they all agree on the same outcome.
Blind stages are used to create ultra-accurate training data and automating quality assurance checks. It’s very common for an annotator to miss an object, but it’s far less common for two of them to do so.
Blind stages are labeled in parallel and each participant cannot see the progress of the others.
When all annotators have completed their version of the task, it goes through a consensus check that validates that the annotations agree. If they don’t, or don’t overlap enough with one another spatially, the task is sent to a human reviewer to apply corrections, and the annotator who made an error is notified so they may improve their work.
The simple answer: Enough to represent each plausible case in your scenario with at least 1,000 data samples.
If you use 10% of that as a test set, you can tell the accuracy of a class with at least 1% of an error rate.
To put things into perspective:
1,000 examples per class is a decent dataset.
10,000 is a great dataset.
100k-1 million is an excellent dataset.
More than 1 million labeled examples of something puts you on the leader board among AI teams.
Some companies are now training models on billions of images, video, and audio samples. These datasets have multiple test sets and are labeled and re-labeled multiple times to increase their scope.
Yes, theoretically you can train a model using 100 examples of something. For example, V7 allows you to do train a model with as few as 100 instances, however, these will perform rather poorly on new examples.
Great models are trained on large volumes of training data items for a good reason - modern neural network architectures work brilliantly because they can store many weights (parameters) efficiently. However, if you don’t have a lot of training data, you are only using a fraction of your model’s potential.
Dataset size will also depend on the domain of your task and the variance of each class.
If you plan to identify every Mars chocolate bar in the world, you’ll probably run out of variance after 10,000 examples. The model will have seen every possible angle, lighting condition, and crumpled appearance of the candy bar.
However, if you want to make a generalized person detector, 10,000 samples are only a glimpse of the variety of sizes, appearances, poses, and clothing that humans may have. As such—a class with high variance such as “person” requires a lot more training data.
Here are some factors that have a high degree of influence on the size of your dataset:
How much data is the system capturing today?
If there’s no raw data available, make sure you have the ability to collect it at the level of scale necessary for your use case. For example, if you are working in a logistics company processing 10,000 invoices a day, that is a good estimate of what your test set should look like, meaning your training set should be of at least 100,000 files to accurately compare it with human performances
How sparsely represented are the tags you want to identify?
If your goal is to identify a very uniform set of objects, you can get away with a few thousand examples. If you want to identify something varied, such as car license plates across all countries, lighting, and weather conditions, your dataset should hold a few hundred examples of each plausible scenario to achieve reliable results. The number of classes proportionally increases this dataset size requirement.
Each type of classification task requires a different amount of training data. In object detection challenges you will want to calculate the number of object instances to expect. If you only expect 1 object per file, you will learn slightly more slowly than if there are 4-5.
Will your distribution-the content of what your dataset is supposed to represent-change over time?
For example, if you’re making a mobile phone detector, you’ll have to periodically add more training data to account for new models. If you make a face detector, you’ll have to account for face masks.
Instruments also change over time - modern phone cameras capture higher resolution HDR images and models trained on this instrument’s data will perform better.
The greater the number of parameters/attributes the model can learn from, the more the training data is required.
In other words as the complexity of the model increases so does the dataset size. Modern deep neural network architectures can store millions or even billions of parameters. As such, model complexity is hardly ever a bottleneck - always aim for more training data to improve its performance.
If you are using classical machine learning instead, or are running on resource-constrained devices, you may see diminishing returns on multi-million file datasets.
An estimate of how much data is needed is more than enough to give a kick start. A couple of methods that help us with this are:
It is the most common of all estimation techniques.
It simply states that a model requires ten times more data than it has degrees of freedom. By degree of freedom, we mean any parameter that affects the model or any attribute on which the output of the model depends on.
This is to handle the variableness that any parameter or attribute can bring.
It is a more logical approach driven by evidence.
A learning curve shows the relationship between the performance of a machine learning model and the size of the dataset. We plot the results of the model performance with the increase in the size of the dataset in each iteration.
There comes a threshold dataset size after which the performance becomes stagnant or diminishes.
Training data quality is imperative to the machine learning model’s performance.
By the term quality data, we mean data that is cleaned and contains all attributes on which the model learning depends.
We can measure quality by both consistency and accuracy of labeled data. While consistency prevents randomness, accuracy brings correctness to the model.
Let's dive deeper to understand how we can ensure that our data sets consist of quality data.
There are three main factors that directly affect the quality of training data.
People, Process, and Tool's (P-P-T) are the three components vital in any business process.
Let's have a look at each one of them.
Quality starts with the human resources that are assigned this task. Worker selection and training significantly impact the efficiency of work and final results. Task-dependent training is the key to better quality data.
These are the set of actions that the workers perform for data collection, labelling following quality control workflow.
Technology, like V7, provides tools that help people to implement the process. It also automates some parts of the process to maximize the quality of the data.
Now let's have a look at a few best practices for preparing your training data.
Raw data can be very dirty and corrupted in many different ways. If it is not cleaned properly it might skew our results and end up making our AI model make wrong results.
Data cleaning is the process of fixing or removing incorrect, corrupted, duplicate data within a dataset with its modified version. The steps in the data cleaning process vary from dataset to dataset.
Labeling data is about detecting and tagging data samples. It is the process by which we attach a meaning to the data in the form of a class or label.
Data labeling can be done by a collaborator by human-in-loop or by any automated machine to speed up the process of labeling.
Whether you are looking for quality data for your business endeavors or you want to build your first computer vision model, having access to quality datasets is crucial.
Here are a few methods you can use.
The first technique is to explore options such as open datasets, online machine learning forums, and dataset search engines which are free and relatively easy. There are a number of websites that provide free and diversified datasets like Google Dataset Search, Kaggle, Reddit, UCI repository. We only need to preprocess the data to make it suitable for our use case.
This technique is mostly used in cases when we want data from multiple sources for diversified inputs. Data collection is done by the extraction of data from various public online resources, such as government websites or certain social media platforms.
Sometimes the above options do not work well for training data collection.
In this case, we have to check what in-house options are available. For example, if we are working on a chatbot that aims to respond to students' problems, instead of using natural language processing datasets, we can try to extract data from supervisors and students' conversations if the logs and messages are preserved.
There might be some scenarios for which we are unable to gather data that meets our needs.
Rather what we can do is to repurpose the data to broaden the dataset. Data augmentation means applying different transformations on the original data to generate new data that suits our case. For image data, training data size can be increased by simple operations like rotation, changes in color, brightness etc.
Finally, let's recap everything you've learned in our essential guide to quality training data:
💡 Read next: