Warning! Your browser does not support this Website: Try Google-Chrome or Firefox!

Logistic Regression

by Elvira Siegel
(Published: Sat Oct 05, 2019)

Logit Regression

Logit regression is another shortened name derived from logistic unit. Logistic regression is a popular statistical model that generates probabilities for binary classification tasks. It produces discrete values and its span lies in the range of [0.0 ; 1.0]. A discrete value is some outcome (dependent variable) that has only a limited number of possible values.

Precisely speaking, a classification task is a task of predicting a discrete class label.

Logistic regression is sometimes called (as already mentioned) Logit, Maximum Entropy Classifier or Log Odds and it is used when the outcome (response variable) is categorical in its nature, e.g.: yes/no, true/false, 1/0, red/green/blue.

The name "Logistic Regression" comes from a similar technique used in Linear Regression. Logistic regression is sometimes called Logit because of the Logit function that is used in its method of classification. Thus the name "Logistic" was taken from this Logit function.

The Logistic function and the Logit :

The logit function is used to predict the occurrence probability of some binary event. A logit function is just the inverse of the logistic function. We can apply the natural logit function to convert the odds logarithm into a probability. So keep the Logistic function separate from the Logit.

the logit :

f(x) = log (x / (1 - x))

the plot of the logit function:


The Logistic function (also "the standard logistic sigmoid function") has an explicit S-shape and is presented in the following picture:


The Logistic function is sometimes called the logistic sigmoid function. Logistic sigmoid is one of the activation functions. Logistic sigmoid takes any number and outputs a probability for it between 0 and 1. Furthermore, the logistic function could be referred to as the expit function.

If you are looking for an easy implementation of the logistic function and the logit function, you can use the scipy library in python:

from scipy.special import expit, logit

Back to Logistic Regression

Logistic regression belongs to the class of discriminative models. A discriminative model represents a decision boundary between the classes and is a model of conditional probability:


Decision Boundary: Linear


Decision Boundary: Non-Linear


There are also some other models which belong to the discriminative class. Some of them are: K-Nearest Neighbors (KNN), Maximum Entropy, Support Vector Machines (SVM) and Neural Networks.

Logistic regression is a special case of linear regression. In contrast to logistic regression, a linear regression outcome is continuous in its nature, e.g.: height, weights, hours, price on the stock market, etc. and not discrete as in Logistic Regression

Logistic Regression & Linear Regression. Differences

Logistic Regression Equation looks like:

Y = ex + e-x

Linear Regression Equation looks like:

Y = mX + C

The Ordinary Logistic regression needs the dependent variable to be of two or more particular categories. Binary or not ordinary logistic regression has dependent variable with only two categories.

Linear regression needs the dependent variable to be continuous that means no categories or groups are allowed (Note: a dependent variable is a variable that is being measured in an experiment. The dependent variable changes as a result to changes in the independent variable, e.g the person's height at different ages).

Logistic regression is based on Maximum Likelihood Estimation which means that we choose coefficients in a way that it maximizes the probability of Y given X (Y|X). This is also called likelihood.

Linear regression, on the other hand, is based on Least Square Estimation. The concept of LSE is that we choose coefficients in a way that it minimizes the Sum of the Squared Distances of each observed outcome. (Note: Sum of the Squared Distances means that we sum up all of the squared distances from the boundary (separating line in a Cartesian vector space) to each individual point. We normally would want a line with the largest sum of squared distances because that means that the line separated the data point the best)

Fitting the line to the data: Linear Regression vs. Logistic Regression


Maximum Likelihood Estimation

Logistic Regression makes use of Maximum Likelihood. Maximum Likelihood Estimation ( MLE ) is a widely used statistical method. It estimates the parameters of some probability distribution: MLE finds the values of the model's parameters which make the known likelihood distribution a maximum. Or in other words, MLE maximizes the likelihood (a likelihood function), in a way that the observed data samples are most expected to happen under the presumed statistical model.

L (w*, b*) = maxw,b L(w, b)


We noticed before that MLE is a parameter estimation function. That means, MLE finds the values for the parameters. What are parameters then? Previously, we mentioned the linear regression equation: y = mX + c . As an example, the variable X might stand for expenses or investments in business and y could represent the generated income. Then m and c are the two model's parameters, that MLE seeks to determine.

So, parameters are important for the model sketching.

Another short algorithmic example of MLE :

  1. as an example, you pick some weight scaled probability of an obese person
  2. use that observation to compute the likelihood of observing a non-obese person with the same weight
  3. take the likelihood of observing this person
  4. do that for all people in the data set
  5. multiply all these likelihoods together. This is the likelihood of the data with the logistic regression line
  6. shift the line and compute a new likelihood of the data
  7. keep on shifting the line until you can select the curve with the maximum likelihood

To recap, we said that Maximum Likelihood tries to maximize the likelihood through parameter estimation. When the parameters fit good, then the data, that we want to have in the end, will be outputted. That is why, it is a very popular technique for parameter estimation. MLE will literally give you the parameters which suit your model the best.

Another crucial part to understand Maximum Likelihood Estimation is that we have to have a good idea of differentiation from calculus. It is a mathematical method which helps us to find maxima and minima of a function. To find the MLE values for some parameters we can apply the following algorithm:

  1. determine the derivative of the function
  2. set the derivative of the function to zero
  3. rearrange the equation in a way that the parameter of interest is the subject of the equation

To learn more about derivatives and partial derivatives, read Partial Derivatives and The Jacobian Matrix .

Summing up

Logistic Regression (like linear regression) is able to work not only with continuous data (e.g. age, weight) but also with discrete data (e.g. blood type).

Logistic Regression describes the relationship between a categorical dependent variable and one or more independent variables: logistic regression calculates probabilities with the help of a logistic function, which is often called logistic sigmoid function.

Further recommended readings:

Classification with Naive Bayes

Introduction to statistics Part I

Partial Derivatives and The Jacobian Matrix


It's AI Against Corona

2019-nCoV There has been a lot of talking about the new corona virus going around the world. Let's clear up some things about it first and then we will see how data science and ai can help us fight 2019-nCoV. ...

Activation Functions

What are activation functions in Neural Networks? First of all let's clear some terminology you need in order to understand the concept of an activation function. ...


or backward propagation of errorsis another supervised learning optimization algorithm. The main task of the backpropagation algorithm is to find optimal weights in a by implementing optimization technique. ...


The Convolutional Neural Network (CNN) architecture is widely used in the field of computer vision. Because we have a massive amount of data in image files, the usage of traditional neural networks wouldn't give much efficiency as the computational time would expl...

Early Stopping

In this article we will introduce you to the concept of Early Stopping and its implementation including code samples. ...


Generative Adversarial Networks (GANs) are a type of unsupervised neural networks. The network exists since 2014 and was developed by and colleges. ...

Gradient Descent

Hiking Down a Mountain Gradient Descent is a popular optimization technique in machine learning. It is aimed to find the minimum value of a function. ...

Introduction to Statistics

Part III In this third and last part of the series "Introduction to Statistics" we will cover questions as what is probability and what are its types, as well as the three probability axioms on top of which the entire probability theory is constructed. ...

Introduction to Statistics

Part I In the following three parts we will cover basic terminology as well as the core concepts from statistics. In this Part I you are going to learn about measures of central tendency (mean, median and mode). In the Part II you will read about measures of variabili...

Introduction to Statistics

Part II In this part we will continue our talk about descriptive statistics and the measures of variability such as range, standard deviation and variance as well as different types of distributions. Feel free to read the Part I of these series to deepen your knowle...

Loss Functions

When training a neural network, we try to optimize the algorithm, so it gives the best possible output. This optimization needs a loss function to compute the error/loss of the model. In this article we will gain a general picture of Squared Error, Mean Sq...

The Magic Behind Tensorflow

Getting started In this article we will delve into the magic behind one of the most popular Deep Learning frameworks - Tensorflow. We will look at the crucial terminology and some core computation principles we need to grasp the real power of Tensorflow. ...

Classification with Naive Bayes

The Bayes' Theorem describes the probability of some event, based on some conditions that might be related to that event. ...

Neural Networks

Neural Networks - Introduction In Neural Networks (NNs) we try to create a program which is able to learn from experience with respect to some task. This program should cons...


Principal component analysis or PCA is a technique for taking out relevant data points (variables also called components or sometimes features) from a larger data set. From this high dimensional data set, PCA tries extracting low dimensional data points. The idea...

Introduction to reinforcement learning

Part IV: Policy Gradient In the previous articles from this series on Reinforcement Learning (RL) we discussed Model-Based and Model-Free RL. In model-free RL we talked about Value Function Approximation (VFA). In this Part we are going to learn about Policy Based R...

Introduction to Reinforcement Learning

Part I : Model-Based Reinforcement Learning Welcome to the series "Introduction to Reinforcement Learning" which will give you a broad understanding about basic (and not only :) ) techniques in the field of Reinforcement Learning. The article series assumes you have s...

Introduction to Reinforcement Learning

Part II : Model-Free Reinforcement Learning In this Part II we're going to deal with Model-Free approaches in Reinforcement Learning (RL). See what model-free prediction and control mean and get to know some useful algorithms like Monte Carlo (MC) and Temporal Differ...

Recurrent Neural Networks

RNNs A Recurrent Neural Network (RNN) is a type of neural network where an output from the previous step is given as an input to the current step. RNNs are designed to take an input series with no size limits. RNNs remember the past states and are influenced by them...


Support Vector Machines If you happened to have a classification, a regression or an outlier detection task, you might want to consider using Support Vector Machines (SVMs), a supervised learning model, that builds a line (hyperplane) to separate data into groups....

Singular Value Decomposition

Matrix factorization: Singular Value Decomposition Matrix decomposition is another name for matrix factorization. This method is a nice representation for applied linear algebra in machine learning and similar algorithms. ...

Partial Derivatives and the Jacobian Matrix

A Jacobian Matrix is a special kind of matrix that consists of first order partial derivatives for some vector function. The form of the Jacobian matrix can vary. That means, the number of rows and columns can be equal or not, denoting that in one case it is a squa...

Introduction to Reinforcement Learning

Part III: Value Function Approximation In the previous Part I and Part II of this series we described model-based and model-free reinforcement learning as well as some well known algorithms. In this Part III we are going to talk about Value Function Approximation: w...

Weight Initialization

How does Weight Initialization work? As a general rule, weights and biases are normally initialized with some random numbers. Weights and biases are extremely important model's parameters and play a pivot role in every neural network training. Therefore, one should ...

Word Embeddings

Part 1: Introduction to Word2Vec Word embedding is a popular vocabulary representation model. Such model is able to capture contexts and semantics of a word in a document. So what is it exactly? ...

Word Embeddings

Part 2: Word2Vec (Skip Gram)In the second part of Word Embeddings we will talk about what are the downsides of the Word2Vec model (Skip Gram...


T-Distributed Stochastic Neighbor Embedding If you do data analysis, machine learning or some other data driven research you will prob...
Copyright © 2024 by Richard Siegel at siegel.work Donate Contact & Privacy Policy