In this article, we will go in-depth about the loss functions and their implementation in the PyTorch framework.

10

min read ·

July 13, 2022

Humans evolve by learning from their past mistakes.

Similarly, deep learning training uses a feedback mechanism called loss functions to evaluate mistakes and improve learning trajectories.

In this article, we will go in-depth about the loss functions and their implementation in the PyTorch framework.

Here’s what we’ll cover:

Ready to streamline AI product deployment right away? Check out:

Before diving into the Pytorch specifics, let’s quickly recap the basics of loss functions and their characteristics.

Loss functions measure how close a **predicted value** is to the **actual value**. When our model makes predictions that are very close to the actual values on our training and testing dataset, it means we have a pretty robust model.

Loss functions guide the model training process towards correct predictions. The loss function is a mathematical function or expression used to measure a dataset's performance on a model.

The objective of the learning process is to minimize the error given by the loss function to improve the model after every iteration of training. Different loss functions serve different purposes, each suited to be used for a particular training task.

Different loss functions suit different problems, each carefully crafted by researchers to ensure stable gradient flow during training.

Here’s what you need to do before getting hands-on experience with PyTorch.

First, you must set up PyTorch to test and run your code.

We can do this using these amazing tools:

- Google Colab
- Anaconda

**Google Colab** is helpful if you prefer to run your PyTorch code in your web browser. It comes with preinstalled all major frameworks out of the box that you can use for running Pytorch loss functions.

Another option is to install the PyTorch framework on a local machine using an anaconda package installer.

With **Anaconda**, it's easy to get and manage Python, Jupyter Notebook, and other commonly used scientific computing and data science packages, like PyTorch.

Let me walk you through the installation steps:-

**Download and install Anaconda**(choose the latest Python version).- Go to
**PyTorch's site**and find the get started section*locally.* - Specify the appropriate configuration options for your particular environment.
- Run the presented command in the terminal to install PyTorch.

Pytorch has two fundamental libraries, **torch**, and **torch nn**, that encompass the starter functions required to construct your loss functions like creating a tensor.

**torch** library provides excellent flexibility and support for tensor operations on the GPU. It has a wide range of functionalities to train different neural network models.

**The torch nn** module provides building blocks like data loaders, train, loss functions, and more essential to training a model.

You can read more about the torch.nn here.

Once you have PyTorch up and running, here’s how you can add loss functions in PyTorch.

Torch NN module in pytorch has predefined and ready-to-use loss functions out of the box that you can use to train your neural network.

Let’s do a simple code walk-through that will guide you on how to add a loss function in PyTorch using a torch.nn library.

First import the libraries from the PyTorch library

import torch

- import torch.nn
- Now, it's time to define a loss function variable. Here, we will use cross-entropy loss, for example, but you can use any loss function from the library.

loss_fn = nn.CrossEntropyLoss()

Remember, you can also write a custom loss function definition based on your application and use it instead of using an inbuilt PyTorch loss function.

While writing the forward pass for training the neural network, you can use the above loss function to pass your inputs and outputs to get a loss scalar value that is further used for backward propagation.

Here’s a link to the PyTorch documentation that lists all the predefined loss functions that you can use directly in your neural network training code.

I would recommend you refer to the source code for these loss functions to get a better understanding of how loss functions are defined internally in PyTorch.

To learn more about the standard coding practices around writing a neural network training pipeline code, I would strongly recommend referring to Pytorch Image Models' open-source framework to train computer vision models.

There are **three** types of loss functions in PyTorch:

**Regression loss functions** deal with continuous values, which can take any value between two limits., such as when predicting the GDP per capita of a country given its rate of population growth, urbanization, historical GDP trends, etc.

**Classification loss functions** deal with discrete values, like the task of classifying an object with a confidence value. For instance, image classification into two labels: cat and dog.

**Ranking loss functions** predict the relative distances between values. An example would be face verification, where we want to know which face images belong to a particular face. We can do so by ranking faces that do not belong to the original face-holder via their degree of relative approximation to the target face scan.

Now, let’s discuss each PyTorch’s function in more detail

- L1 loss function (Mean Absolute Error Loss)
- Mean Squared Error Loss
- Negative Log-Likelihood Loss
- Cross-Entropy Loss
- Binary Cross Entropy Loss
- Hinge Embedding Loss
- Margin Ranking Loss
- Triplet Margin Loss
- Kullback-Leibler divergence

The L1 loss function computes the mean absolute error between each value in the predicted and target tensor. It computes the sum of all the values returned from each absolute difference computation and takes the average of this sum value to obtain the Mean Absolute Error (*MAE*).

**When to use?**

- Regression
- Specifically in those cases where target variables contain outliers.
- It is robust for handling noise.

**Syntax **

**PyTorch Code Implementation**

The smooth L1 loss function combines the benefits of MSE loss and MAE loss through a heuristic value beta. It uses a squared term if the absolute error falls below one and an absolute term otherwise. It is less sensitive to outliers than the mean square error loss and, in some cases, prevents exploding gradients.

In mean square error loss, we square the difference, resulting in a number much larger than the original number. These high values result in exploding gradients. It is avoided here for numbers greater than 1; the numbers are not squared.

**When to use?**

- Regression.
- When the features have large values.
- Well suited for most problems.

**Syntax **

**PyTorch Code Implementation**

The Mean Square Error shares similarities with the Mean Absolute Error. It computes the ** square difference** between values in the prediction tensor and that of the target tensor.

By doing so, relatively significant differences are penalized more, while relatively minor differences are penalized less. MSE is considered less robust at handling outliers and noise than MAE.

**When to use it?**

- Regression problems.
- The numerical value features are not large.
- The problem is not very high-dimensional.

**Syntax**

**PyTorch Code Implementation**

Cross-entropy as a loss function is used to learn the probability distribution of the data. While other loss functions like squared loss penalize wrong predictions, cross-entropy gives a more significant penalty when incorrect predictions are predicted with high confidence.

What differentiates it from negative log-likelihood loss is that cross-entropy also penalizes wrong but confident predictions and correct but less confident predictions. In contrast, negative log loss does not penalize according to the confidence of predictions.

**When to use?**

- Classification tasks
- Making a confident model i.e. model will not only predict accurately, but it will also do so with higher probability.
- For higher precision/recall values.

**Syntax**

**PyTorch Code Implementation**

It maximizes the overall probability of the data. It penalizes the model when it predicts the correct class with smaller probabilities and incentivizes when the prediction is made with a higher probability. The logarithm does the penalizing part here.

The smaller the probabilities, the higher its logarithm will be. The negative sign is used here because the probabilities lie in the range [0, 1], and the logarithms of values in this range are negative. So it makes the loss value to be positive.

**When to use?**

- Classification.
- Smaller quicker training.
- Simple tasks.

**Syntax**

**PyTorch Code Implementation**

Binary Cross-Entropy loss is a particular class of Cross-Entropy losses used for the unique problem of classifying data points into only two classes. Labels for this type of problem are usually binary, and our goal is to push the model to predict a number close to zero for a zero label and a number close to one for one label.

Usually, when using BCE loss for binary classification problems, the neural network's output is a Sigmoid layer to ensure that the output is either a value close to zero or a value close to one.

**When to use?**

- Binary Classification tasks

**Syntax**

**PyTorch Code Implementation**

It adds a Sigmoid layer and the BCELoss in one single class. It provides numerical stability for log-sum-exp. It is more numerically stable than a plain Sigmoid followed by a BCELoss.

**Syntax**

**PyTorch Code Implementation**

Hinge Embedding Loss measures the loss given an input target tensor x and labels tensor y containing values (1 or -1). It is used for measuring whether two inputs are similar or dissimilar.

**When to use?**

*Learning*nonlinear embeddings- Semi-supervised learning
- Where similarity or dissimilar of two inputs is to be measured.

**Syntax**

**PyTorch Code Implementation**

Margin Ranking loss belongs to the ranking losses whose main objective, unlike other loss functions, is to measure the relative distance between a set of inputs in a dataset.

The margin Ranking loss function takes two inputs and a label containing only 1 or -1.

If the label is 1, then it is assumed that the first input should have a higher ranking than the second input, and if the label is -1, it is assumed that the second input should have a higher ranking than the first input.

**When to use?**

- GANs.
- Ranking tasks.

**Syntax**

**PyTorch Code Implementation**

The Triplet loss function is widely used to evaluate the inputs' similarity. It uses a triplet pair generated from training data. The triplet pair comprises three sample points (** anchor**,

The objective is to learn to minimize the distance between the anchor and positive data points and maximize the distance between the anchor and negative data points with a margin. You can think of it as a learning separation boundary(margin) between each class cluster.

**When to use?**

- In ranking tasks like face matching, search retrieval, etc.
- Embedding learning

**Syntax**

**PyTorch Code Implementation**

The criterion measures similarity by computing the cosine distance between the two data points in space. The cosine distance correlates to the angle between the two points, which means that the smaller the angle, the closer the inputs and the more similar they are.

**When to use?**

- Learning nonlinear embeddings
- Semi-supervised learning
- Where similarity or dissimilar of two inputs is to be measured.

**Syntax**

**PyTorch Code Implementation**

KL divergence measures how two probability distributions are different from each other.

Given two distributions, P and Q, Kullback Leibler Divergence (KLD) loss measures how much information is lost when P (assumed to be the true distribution) is replaced with Q.

By measuring how much information is lost when we use Q to approximate P, we can obtain the similarity between P and Q and drive our algorithm to produce a distribution very close to the true distribution, P.

The information loss when Q is used to approximate P is not the same as when P is used to approximate Q; thus, KL Divergence is not symmetric.

**When to use?**

- Classification
- The same can be achieved with cross-entropy with lesser computation, so avoid it.

**Syntax**

**PyTorch Code Implementation**

Writing a custom PyTorch loss function is simple.

The standard way of doing it is to write a Class definition per loss function. The class will have mainly two methods.

**init**

- The
**init**method defines the input member variables required for the loss function. - In EmbeddingLoss fn, the
**margin**variable defines the separation between positive and negative samples in high dimensional space.

**forward**

- The forward function is defined to run the loss function over the input and output and return a scalar value to be used by the backward propagation layer.
- In EmbeddingLoss fn, the first distance is calculated between two samples using the sum of error formula, and then we use margin to have a minimum separation.

Here’s a link to the Kaggle discussion thread that is a great resource to learn more about the right way of defining loss functions.

Once you have done the hard work of writing your loss function and training code for the neural network, monitoring the loss values is essential.

Let’s see how to do it!

Let’s take the example of the FashionMNIST classification task. The objective of the task is to classify ten clothes classes.

A neural network training pipeline consists mainly of the following low-level components:-

**Dataset**Class to load, filter and visualize data.**Network Architecture**Class to define the network.**Training**function to train the model for every epoch using loss values.**Loss validation and visualization**function to monitor the loss values.**Inference**function to test the trained model.

Please refer to the google collab link to train and test the Fashion MNIST dataset on your own.

Let’s go through what losses we need to monitor while training and how we can visualize them to track the training progress.

Training a model requires the calculation of 2 types of losses:

**train_loss**- The training loss value indicates how much the model has learned from the training data. Ideally, the training loss should decrease with every iteration.**validation_loss -**The validation loss value indicates the model’s performance on data it hasn’t seen before. It shows whether the model isor*overfitting*the training data.*underfitting*

I am using the code snippet from the google collab link.

When you plot the **train_losses** and **valid_losses** values on a graph, you should expect an exponential decrease over the number of epochs.

The training loss should be ideally lower than the valid loss indicating that the model has learned well on training data and generalized over the unseen data. So, when the model is used in production, it does not fail on real-world constraints.

- PyTorch has predefined loss functions that you can use to train almost any neural network architecture.
- The loss function guides the model training to convergence.
- Choosing the correct loss function is crucial to the model performance.
- Loss values should be monitored visually to track the model learning progress.

- https://neptune.ai/blog/pytorch-loss-functions/
- https://blog.paperspace.com/pytorch-loss-functions/amp/
- https://analyticsindiamag.com/all-pytorch-loss-function/
- https://medium.com/udacity-pytorch-challengers/a-brief-overview-of-loss-functions-in-pytorch-c0ddb78068f7
- https://towardsdatascience.com/understanding-pytorch-loss-functions-the-maths-and-algorithms-part-1-6e439b27117e
- https://machinelearningknowledge.ai/ultimate-guide-to-pytorch-loss-functions/