Humans evolve by learning from their past mistakes.
Similarly, deep learning training uses a feedback mechanism called loss functions to evaluate mistakes and improve learning trajectories.
In this article, we will go in-depth about the loss functions and their implementation in the PyTorch framework.
Here’s what we’ll cover:
Ready to streamline AI product deployment right away? Check out:
Before diving into the Pytorch specifics, let’s quickly recap the basics of loss functions and their characteristics.
Loss functions measure how close a predicted value is to the actual value. When our model makes predictions that are very close to the actual values on our training and testing dataset, it means we have a pretty robust model.
Loss functions guide the model training process towards correct predictions. The loss function is a mathematical function or expression used to measure a dataset's performance on a model.
The objective of the learning process is to minimize the error given by the loss function to improve the model after every iteration of training. Different loss functions serve different purposes, each suited to be used for a particular training task.
Different loss functions suit different problems, each carefully crafted by researchers to ensure stable gradient flow during training.
Here’s what you need to do before getting hands-on experience with PyTorch.
First, you must set up PyTorch to test and run your code.
We can do this using these amazing tools:
Google Colab is helpful if you prefer to run your PyTorch code in your web browser. It comes with preinstalled all major frameworks out of the box that you can use for running Pytorch loss functions.
Another option is to install the PyTorch framework on a local machine using an anaconda package installer.
With Anaconda, it's easy to get and manage Python, Jupyter Notebook, and other commonly used scientific computing and data science packages, like PyTorch.
Let me walk you through the installation steps:-
Pytorch has two fundamental libraries, torch, and torch nn, that encompass the starter functions required to construct your loss functions like creating a tensor.
torch library provides excellent flexibility and support for tensor operations on the GPU. It has a wide range of functionalities to train different neural network models.
The torch nn module provides building blocks like data loaders, train, loss functions, and more essential to training a model.
You can read more about the torch.nn here.
Once you have PyTorch up and running, here’s how you can add loss functions in PyTorch.
Torch NN module in pytorch has predefined and ready-to-use loss functions out of the box that you can use to train your neural network.
Let’s do a simple code walk-through that will guide you on how to add a loss function in PyTorch using a torch.nn library.
First import the libraries from the PyTorch library
import torch
Remember, you can also write a custom loss function definition based on your application and use it instead of using an inbuilt PyTorch loss function.
While writing the forward pass for training the neural network, you can use the above loss function to pass your inputs and outputs to get a loss scalar value that is further used for backward propagation.
Here’s a link to the PyTorch documentation that lists all the predefined loss functions that you can use directly in your neural network training code.
I would recommend you refer to the source code for these loss functions to get a better understanding of how loss functions are defined internally in PyTorch.
To learn more about the standard coding practices around writing a neural network training pipeline code, I would strongly recommend referring to Pytorch Image Models' open-source framework to train computer vision models.
There are three types of loss functions in PyTorch:
Regression loss functions deal with continuous values, which can take any value between two limits., such as when predicting the GDP per capita of a country given its rate of population growth, urbanization, historical GDP trends, etc.
Classification loss functions deal with discrete values, like the task of classifying an object with a confidence value. For instance, image classification into two labels: cat and dog.
Ranking loss functions predict the relative distances between values. An example would be face verification, where we want to know which face images belong to a particular face. We can do so by ranking faces that do not belong to the original face-holder via their degree of relative approximation to the target face scan.
Now, let’s discuss each PyTorch’s function in more detail
The L1 loss function computes the mean absolute error between each value in the predicted and target tensor. It computes the sum of all the values returned from each absolute difference computation and takes the average of this sum value to obtain the Mean Absolute Error (MAE).
When to use?
Syntax
PyTorch Code Implementation
The smooth L1 loss function combines the benefits of MSE loss and MAE loss through a heuristic value beta. It uses a squared term if the absolute error falls below one and an absolute term otherwise. It is less sensitive to outliers than the mean square error loss and, in some cases, prevents exploding gradients.
In mean square error loss, we square the difference, resulting in a number much larger than the original number. These high values result in exploding gradients. It is avoided here for numbers greater than 1; the numbers are not squared.
When to use?
Syntax
PyTorch Code Implementation
The Mean Square Error shares similarities with the Mean Absolute Error. It computes the square difference between values in the prediction tensor and that of the target tensor.
By doing so, relatively significant differences are penalized more, while relatively minor differences are penalized less. MSE is considered less robust at handling outliers and noise than MAE.
When to use it?
Syntax
PyTorch Code Implementation
Cross-entropy as a loss function is used to learn the probability distribution of the data. While other loss functions like squared loss penalize wrong predictions, cross-entropy gives a more significant penalty when incorrect predictions are predicted with high confidence.
What differentiates it from negative log-likelihood loss is that cross-entropy also penalizes wrong but confident predictions and correct but less confident predictions. In contrast, negative log loss does not penalize according to the confidence of predictions.
When to use?
Syntax
PyTorch Code Implementation
It maximizes the overall probability of the data. It penalizes the model when it predicts the correct class with smaller probabilities and incentivizes when the prediction is made with a higher probability. The logarithm does the penalizing part here.
The smaller the probabilities, the higher its logarithm will be. The negative sign is used here because the probabilities lie in the range [0, 1], and the logarithms of values in this range are negative. So it makes the loss value to be positive.
When to use?
Syntax
PyTorch Code Implementation
Binary Cross-Entropy loss is a particular class of Cross-Entropy losses used for the unique problem of classifying data points into only two classes. Labels for this type of problem are usually binary, and our goal is to push the model to predict a number close to zero for a zero label and a number close to one for one label.
Usually, when using BCE loss for binary classification problems, the neural network's output is a Sigmoid layer to ensure that the output is either a value close to zero or a value close to one.
When to use?
Syntax
PyTorch Code Implementation
It adds a Sigmoid layer and the BCELoss in one single class. It provides numerical stability for log-sum-exp. It is more numerically stable than a plain Sigmoid followed by a BCELoss.
Syntax
PyTorch Code Implementation
Hinge Embedding Loss measures the loss given an input target tensor x and labels tensor y containing values (1 or -1). It is used for measuring whether two inputs are similar or dissimilar.
When to use?
Syntax
PyTorch Code Implementation
Margin Ranking loss belongs to the ranking losses whose main objective, unlike other loss functions, is to measure the relative distance between a set of inputs in a dataset.
The margin Ranking loss function takes two inputs and a label containing only 1 or -1.
If the label is 1, then it is assumed that the first input should have a higher ranking than the second input, and if the label is -1, it is assumed that the second input should have a higher ranking than the first input.
When to use?
Syntax
PyTorch Code Implementation
The Triplet loss function is widely used to evaluate the inputs' similarity. It uses a triplet pair generated from training data. The triplet pair comprises three sample points (anchor, positive and negative). The anchor and positive belongs to the same class but different data point, whereas the negative examples belong to a different class.
The objective is to learn to minimize the distance between the anchor and positive data points and maximize the distance between the anchor and negative data points with a margin. You can think of it as a learning separation boundary(margin) between each class cluster.
When to use?
Syntax
PyTorch Code Implementation
The criterion measures similarity by computing the cosine distance between the two data points in space. The cosine distance correlates to the angle between the two points, which means that the smaller the angle, the closer the inputs and the more similar they are.
When to use?
Syntax
PyTorch Code Implementation
KL divergence measures how two probability distributions are different from each other.
Given two distributions, P and Q, Kullback Leibler Divergence (KLD) loss measures how much information is lost when P (assumed to be the true distribution) is replaced with Q.
By measuring how much information is lost when we use Q to approximate P, we can obtain the similarity between P and Q and drive our algorithm to produce a distribution very close to the true distribution, P.
The information loss when Q is used to approximate P is not the same as when P is used to approximate Q; thus, KL Divergence is not symmetric.
When to use?
Syntax
PyTorch Code Implementation
Writing a custom PyTorch loss function is simple.
The standard way of doing it is to write a Class definition per loss function. The class will have mainly two methods.
init
forward
Here’s a link to the Kaggle discussion thread that is a great resource to learn more about the right way of defining loss functions.
Once you have done the hard work of writing your loss function and training code for the neural network, monitoring the loss values is essential.
Let’s see how to do it!
Let’s take the example of the FashionMNIST classification task. The objective of the task is to classify ten clothes classes.
A neural network training pipeline consists mainly of the following low-level components:-
Please refer to the google collab link to train and test the Fashion MNIST dataset on your own.
Let’s go through what losses we need to monitor while training and how we can visualize them to track the training progress.
Training a model requires the calculation of 2 types of losses:
I am using the code snippet from the google collab link.
When you plot the train_losses and valid_losses values on a graph, you should expect an exponential decrease over the number of epochs.
The training loss should be ideally lower than the valid loss indicating that the model has learned well on training data and generalized over the unseen data. So, when the model is used in production, it does not fail on real-world constraints.