Logistic regression is a popular classification algorithm, and the foundation for many advanced machine learning algorithms, including neural networks and support vector machines. It’s widely adapted in healthcare, marketing, finance, and more.
In logistic regression, the dependent variable is binary, and the independent variables can be continuous, discrete, or categorical. The algorithm aims to find the relationship between the input variables and the probability of the dependent variable being in one of the two categories.
In this article, we’ll give you a comprehensive overview of logistic regression, dive into the mathematical principles behind the algorithm, and provide practical examples of implementing it in PyTorch.
Here’s what we’ll cover:
Speed up labeling data 10x. Use V7 to develop AI faster.
Don't start empty-handed. Explore our repository of 500+ open datasets and test-drive V7's tools.
Ready to streamline AI product deployment right away? Check out:
Suppose you are an HR professional attempting to determine the status of an interview. Based on this status, you can check how it was conducted and decide whether to transfer the chosen applicants to the next round of interviews. The candidates must be categorized into "good" or "bad."
Logistic regression is a classification model that uses several independent parameters to predict a binary-dependent outcome. It is a highly effective technique for identifying the relationship between data or cues or a particular occurrence.
Using a set of input variables, logistic regression aims to model the likelihood of a specific outcome. The output variable in logistic regression is binary—it may only assume one of two potential values (e.g., 0 for the event not to occur or 1 for the event to happen).
To convert the outcome into categorical value, we use the sigmoid function. The sigmoid function, which generates an S-shaped curve and delivers a probabilistic value ranging from 0 to 1, is used in machine learning to convert predictions to probabilities, as shown below. Although logistic regression is a linear technique, it alters the projections. The result is that, unlike linear regression, we can no longer comprehend the forecasts as a linear combination of the inputs.
The model can be written as follows:
This can also be rewritten as (note that f(x) and P(x) are being used interchangeably:
Several explanatory factors frequently influence the quantity of the dependent variable's value. Logistic regression formulas assume the various independent variables are linearly related to the model. The output data variable can be computed by altering the sigmoid function as above and below:
The value of X Is calculated using the logistic regression equation:
If x goes until infinity, predicted y becomes 1, and if it goes into negative infinity, y becomes 0. This is how the dependent variable's value is estimated through logistic regression.
Like linear regression, logit regression represents data with an equation showcasing a success ratio over failure. Log odds can also be calculated using the logit model.
Your odds are p/(1 - p) in terms of probability, and your odds equal log (p/(1 - p)). The logistic function can be represented as log odds, as seen below:
The ratio of the likelihood of success to the odds of failure is known as the odds. As a result, logistic regression converts a linear combination of inputs to log(odds), with an output of 1.
Although logistic regression is a linear technique, the logistic function alters the predictions, transforming them into a straight line using the odds. A threshold might be defined to determine the class.
Let’s go back to the interview status example. A sigmoid curve will be drawn based on the threshold values. These threshold values should be used when converting a probability value into a binary category. If the value is higher than the threshold, it's considered category 1; else it's considered category 2.
For instance, if p(x)= 0.98, which translates to a 98% chance for the event to occur, then it falls closer to category 1, and if p(x)= 0.28, it falls under category 2, which means there’s 28% chance for the event to occur. Machine learning thresholds must be adjusted because they depend on the situation, even if the default threshold is 0.5.
For simplicity, let’s have two classes, “good” or “bad,” with 1 = good and 0 = bad. The threshold value is assumed to be 0.5.
Assumptions are necessary for modeling a problem statement. They can be considered prerequisites for determining if you can appropriately make inferences from the analysis findings.
When picking a model, checking assumptions is crucial. Since every model has unique requirements, if the assumptions aren’t correct, the model may represent the data poorly and produce false predictions
Here’s the list of assumptions for logistic regression.
Many performance criteria can be used to assess the logistic regression model, including:
Two standard statistical techniques used in various forms of data analysis are logistic regression and linear regression. Both approaches are applied to simulate the link between a dependent variable and one or more independent variables. However, logistic and linear regression differ fundamentally; each method is appropriate for specific issues.
Let’s go through three types of logistic regression: binary, multinomial, and ordinal.
The most widely used type of logistic regression is binary logistic regression, applied when the dependent variable is binary or dichotomous. Using the values of one or more independent variables, binary logistic regression attempts to estimate the likelihood that the dependent variable will take on a specific value (such as 0 or 1) in the future. This particular form of logistic regression is helpful for forecasting outcomes, like whether a customer will purchase a product or not or whether a patient will benefit from a specific treatment.
While the binary regression model adjusts the result to the nearest values, the logistic function generates a range of values between 0 and 1. The logistic function typically provides a binary result by rounding values below 0.5 to 0 and values over 0.5 to 1.
Multinomial logistic regression is applied when the dependent variable includes more than two categories but no ordered subcategories. In other words, the categories have no inherent ordering; they are all mutually exclusive.
For example, a multinomial logistic regression model can be utilized to forecast the kind of fruit a consumer will purchase based on their demographic information. Apples, bananas, and oranges are conceivably produced. Another application of multinomial logistic regression models is image classification challenges, in which it is desired to categorize an image into one of the multiple groups.
Multinomial logistic regression works by mapping outcome values to different values between 0 and 1. Since the logistic function can return a range of continuous data, like 0.1, 0.11, 0.12, and so on, multinomial regression groups the output to the closest possible values.
Ordinal logistic regression is applied when the dependent variable includes more than two categories, and there is a natural ordering between the categories. For instance, research might be done to gauge a disease's severity from the patient's symptoms—with a range of potential outcomes, from minor to severe.
Ordinal logistic regression aims to simulate the relationship between the independent variables and the dependent variable's ordered categories. The change in the log chances of going from one category to the next higher category is represented by the coefficients in an ordinal logistic regression model.
Logistic regression is a flexible and effective method that may simulate the association between a binary or categorical dependent variable and one or more independent variables. It is possible to employ the numerous logistic regression models covered above to address various issues in various industries, such as marketing, healthcare, and picture categorization. Each of these models has its distinct advantages and applications.
Here’s the training code for logistic regression in PyTorch.
This code is an implementation of logistic regression on the MNIST dataset using PyTorch. Here's a code breakdown:
Let's go through a few of the most popular applications of logistic regression across various industries.
An optical character recognition (OCR) method, often called text recognition, may turn handwritten or printed characters into text that computers can understand. The output of optical character recognition is categorical, making it a classification challenge in machine learning (i.e., it belongs to a finite set of values). So, it warrants the use of logistic regression.
With logistic regression, we can train a binary classifier that can discriminate between distinct characteristics. The dependent variable in this instance is binary, denoting the presence or absence of a character. The features retrieved from the input image are the independent variables.
The initial step in OCR is to take the input image's features and extract them. Features used to characterize an image include lines, curves, and edges. Following their extraction, parts can be fed into a machine-learning model such as logistic regression. The weights of the independent factors that predict the likelihood of the observed data are estimated by logistic regression. These coefficients are applied to new image forecasts.
Fraud detection is about detecting scams and preventing fraudsters from acquiring money or property by deception. Fraud poses a significant financial liability that must be discovered and mitigated immediately.
Many organizations struggle with fraud detection, particularly finance, insurance, and e-commerce. Using machine learning methods such as logistic regression is one strategy for fraud detection. It can be used to train a binary classifier that can discriminate between fraudulent and legitimate transactions.
The dependent variable, in this instance, is binary and represents if the payment is fraudulent or not. The elements that characterize a transaction, including its value, place, time, and user details, are its independent variables. The effectiveness of fraud detection can be increased by combining logistic regression with other machine learning methods like anomaly detection and decision trees.
Disease spread prediction can be approached as a binary classification problem, where the target variable is whether an individual will contract the disease.
Data including the number of affected people, the population's age and health, the environment, and the accessibility of medical resources, can affect how quickly diseases spread. The link between these variables and the risk of disease transmission can be modeled using logistic regression.
A dataset of historical disease spread data can be used to predict the spread of illnesses using logistic regression. The dataset should contain details regarding the number of affected people, the time frame, and the place. To increase the accuracy of disease spread prediction, we can combine logistic regression with other machine learning methods, such as time series analysis and clustering. The temporal patterns of disease propagation can be modeled using time series analysis. Clustering can be used to pinpoint the areas and populations that are most impacted.
Logistic regression is widely used for mortality prediction to calculate the likelihood of a person dying with a specific illness.
Data attributes like demographics, health information, and clinical information like age, gender, conditions, and clinical indicators, including hypertension, heart rate, and lab test results, can be used to train the logistic regression model. This can assist medical professionals in making wise choices regarding patient care and enhance patient outcomes.
Churn prediction identifies customers likely to stop using a product or service. Logistic regression is a commonly used technique to model churn prediction.
In logistic regression, the dependent variable is a binary variable that indicates whether a customer will churn. The independent variables are the customer's demographic information, usage patterns, and other factors that may influence their decision to leave.
The logistic regression model estimates each customer's churn probability, based on the independent variables. The model then predicts a customer's churn based on a threshold probability.
The logistic regression model can identify customers at high risk of churning, enabling businesses to take proactive measures to retain them. This can include targeted marketing campaigns, personalized offers, and customer support interventions.
Logistic regression is a popular machine learning algorithm used for binary classification problems, where the goal is to predict the probability of an event occurring. Here are some key takeaways:
“Collecting user feedback and using human-in-the-loop methods for quality control are crucial for improving Al models over time and ensuring their reliability and safety. Capturing data on the inputs, outputs, user actions, and corrections can help filter and refine the dataset for fine-tuning and developing secure ML solutions.”