Warning! Your browser does not support this Website: Try Google-Chrome or Firefox!


by Elvira Siegel
(Published: Sun Oct 13, 2019)

Support Vector Machines

If you happened to have a classification, a regression or an outlier detection task, you might want to consider using Support Vector Machines (SVMs), a supervised learning model, that builds a line (hyperplane) to separate data into groups. SVM can perform not only linear classification but also effectively separate data in a non-linear, multi-dimensional space using the so called kernel trick which transfers data points into a high dimensional space. SVM is no new special fancy technique for machine learning, it exists since 1963.

Despite the fact that SVM is also applied for regression tasks, it is still more often used for classification.

SVM Introduction: problem stated, problem solved

Let's imagine the situation where we have some medical data samples. With these samples we want to predict whether a patient is ill or not. We plot the data and try to find some value which serves as a threshold. If we go above the threshold, then the prediction is the patient is ill and if under the threshold, then the prediction is the patient is not ill. How about this one:


All right, seems correctly classified at the first glance. Now a new value from another patient comes in and we also want to classify it:


We don't really know now whether the patient is ill or not, the new data point is classified as not ill, but it is definitely closer to the ill class.

What we can do is following: we only focus on data values which lies on the side edges. For later, we call those "edge"-data values - support vectors. When we have our support vectors, we draw a line right in the middle from both support vectors (data points) - this is our new, better threshold. Now if the new data point comes in, it will be correctly classified.


More concrete about SVM

From the visual example above we can conclude that with SVM our goal is to find a line (or a hyperplane) in some n-dimensional space, where n is the number of dimensions. For a 2-dimensional space we use a line as a separator and in higher dimensional spaces we use a hyperplane. In this article we will mostly use the line as a term for separator, so we use n=2 dimensions. Furthermore, those dimensions (n) are just the number of features (input values).


The line, we search for, has to divide the data points in two groups the best way possible. This "best way" can be found by maximizing the margin , so the line distinguishes the data point most clearly. Let's see why.

Here we have some data points which are placed in two groups:


Which line does separate those data groups the best? There may be an infinite number of lines ...



Well, as we saw in the introduction, the ideal line is the line which lies right in the middle of the biggest separating margin. So, we want to maximize the margin and draw a separating line in the middle. Margin is basically what builds the decision boundary, it is the distance from a data point which is placed closest to the line, till the beginning of that line. A margin can be seen as the maximum width that separates the support vectors of two classes:

When we maximize the margin, we provide reinforcement to the SVM model in a way that it classifies other data points more confidently. This reinforcement part is also one of the features of the supervised learning.



Support vectors are the data points which "support" the margin. Only they are important for finding the decision boundary. If we remove those points, we will change the position of the decision boundary. Support vectors lie on the beginning of the margin boundary and are used to maximize it. Data points that lie on the one or the other side of the decision boundary can be assigned to one or the other class.

The SVM algorithm looks only at those support vectors to compute the margin. From there it is able to assign the data from one side of the decision boundary to one class and the data from the other side of the decision boundary to the other class.

Why maximizing the margin?

Can't we just take some line that separates the data in two parts? Maybe somewhere in the middle? Well ... Nope, if we do so, we lose let's call it the "confidence"-measure in out model. Each distance from a data point to some separating line corresponds to the "confidence" of the prediction for that point's group. By maximizing the margin (the decision boundary), we maximize the level of confidence for our prediction for a data point. In other words, data points that lie near the separating line stay for uncertain classification decisions. That means, there is nearly 50 % chance this point can belong to the "other side" - to the other class. By maximizing the margin we make sure, that the closest point to the separating line which lies on a margin boundary, is still relatively far away from the line itself, thus increasing the probability that this point was classified correctly:

Defining the hyperplane

We can determine the separating line - a hyperplane, by using the so called weight vector (w⃗) and the intercept (b) which is sometimes referred to as the bias. The bias term represents the rate of how good a machine learning algorithm can show true correlations between the data points. The w⃗ vector is perpendicular to the hyperplane and logically the hyperplane is perpendicular to the weight vector as well.


A perpendicular line crosses the other line at right angles 90 °

In order to select the right line (hyperplane), we need the intercept b. The intercept which is often presented as a constant is some point at which the weight vector crosses the line (hyperplane).


Let's take a set of training data points S = { (x⃗ i , y i) } where x⃗ is a vector representing each data point at the index i and y is a corresponding correct label to the data point x⃗ i . In SVM the two data groups we want to classify the data points into, are given as 1 and -1 . So the 1 stays for one class and -1 for the other class.



  1. w⃗ is a perpendicular weight vector
  2. x⃗ i is a vector sample from the training set
  3. b is the intercept (sometimes called "the bias term")
  4. >= +1 and <= -1 are the threshold values, which define at which time one data point should be classified in the respective group

We can also present the formula above in a more concise way :


To conclude, the equation: ( w⃗ * x⃗ i ) + b is the predictor for one sample xi

SVM belongs to the class of discriminative models. Discriminative means: the algorithm models a decision boundary between classes.

In the python library sklearn we can find a ready implementation of SVM:

from sklearn.svm import SVC 

The Kernel Trick

The kernel is a group of mathematical functions. It transforms data into a different form.

The main idea of the kernel function is to bring data into higher dimensions. Why do we need it? Imagine that we have a data set which is not linearly separable, so we cannot draw a line between two classes.


With the kernel trick we can separate data that is not linearly separable by bringing it into higher dimensions.



There exist different kernel functions: linear/nonlinear kernel, polynomial kernel, sigmoid kernel, exponential kernel ...

Linear Kernel:

The linear kernel predicts the classification for a new data point in 2-dimensional space. It does so by calculating the dot product of the input and every support vector:

f(input) = Σ (weighti * dot(input, support_veci)) + bias(0)

Polynomial Kernel:

f(input, support_veci) = 1 + Σ(input * support_veci)n

where n is the number of dimensions.

In contrast to the linear kernels, polynomial kernels are used to figure out the separating hyperplane in higher dimensions.

Again, using the sklearn library makes our life easier:

from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

classifier = SVC(kernel='linear')
y_pred = classifier.predict(x_test)


Support Vector Machines is a powerful algorithm for classification tasks. It is widely used in image classification, text categorization, hand-written character recognition, as well as in face recognition systems.

Further recommended readings:

Neural Networks

Logistic Regression

Classification with Naive Bayes


It's AI Against Corona

2019-nCoV There has been a lot of talking about the new corona virus going around the world. Let's clear up some things about it first and then we will see how data science and ai can help us fight 2019-nCoV. ...

Activation Functions

What are activation functions in Neural Networks? First of all let's clear some terminology you need in order to understand the concept of an activation function. ...


or backward propagation of errorsis another supervised learning optimization algorithm. The main task of the backpropagation algorithm is to find optimal weights in a by implementing optimization technique. ...


The Convolutional Neural Network (CNN) architecture is widely used in the field of computer vision. Because we have a massive amount of data in image files, the usage of traditional neural networks wouldn't give much efficiency as the computational time would expl...

Early Stopping

In this article we will introduce you to the concept of Early Stopping and its implementation including code samples. ...


Generative Adversarial Networks (GANs) are a type of unsupervised neural networks. The network exists since 2014 and was developed by and colleges. ...

Gradient Descent

Hiking Down a Mountain Gradient Descent is a popular optimization technique in machine learning. It is aimed to find the minimum value of a function. ...

Introduction to Statistics

Part III In this third and last part of the series "Introduction to Statistics" we will cover questions as what is probability and what are its types, as well as the three probability axioms on top of which the entire probability theory is constructed. ...

Introduction to Statistics

Part I In the following three parts we will cover basic terminology as well as the core concepts from statistics. In this Part I you are going to learn about measures of central tendency (mean, median and mode). In the Part II you will read about measures of variabili...

Introduction to Statistics

Part II In this part we will continue our talk about descriptive statistics and the measures of variability such as range, standard deviation and variance as well as different types of distributions. Feel free to read the Part I of these series to deepen your knowle...

Logistic Regression

Logit Regression Logit regression is another shortened name derived from logistic unit. Logistic regression is a popular statistical model that generates probabilities for binary classification tasks. It produces discrete values and its span lies in the range of [...

Loss Functions

When training a neural network, we try to optimize the algorithm, so it gives the best possible output. This optimization needs a loss function to compute the error/loss of the model. In this article we will gain a general picture of Squared Error, Mean Sq...

The Magic Behind Tensorflow

Getting started In this article we will delve into the magic behind one of the most popular Deep Learning frameworks - Tensorflow. We will look at the crucial terminology and some core computation principles we need to grasp the real power of Tensorflow. ...

Classification with Naive Bayes

The Bayes' Theorem describes the probability of some event, based on some conditions that might be related to that event. ...

Neural Networks

Neural Networks - Introduction In Neural Networks (NNs) we try to create a program which is able to learn from experience with respect to some task. This program should cons...


Principal component analysis or PCA is a technique for taking out relevant data points (variables also called components or sometimes features) from a larger data set. From this high dimensional data set, PCA tries extracting low dimensional data points. The idea...

Introduction to reinforcement learning

Part IV: Policy Gradient In the previous articles from this series on Reinforcement Learning (RL) we discussed Model-Based and Model-Free RL. In model-free RL we talked about Value Function Approximation (VFA). In this Part we are going to learn about Policy Based R...

Introduction to Reinforcement Learning

Part I : Model-Based Reinforcement Learning Welcome to the series "Introduction to Reinforcement Learning" which will give you a broad understanding about basic (and not only :) ) techniques in the field of Reinforcement Learning. The article series assumes you have s...

Introduction to Reinforcement Learning

Part II : Model-Free Reinforcement Learning In this Part II we're going to deal with Model-Free approaches in Reinforcement Learning (RL). See what model-free prediction and control mean and get to know some useful algorithms like Monte Carlo (MC) and Temporal Differ...

Recurrent Neural Networks

RNNs A Recurrent Neural Network (RNN) is a type of neural network where an output from the previous step is given as an input to the current step. RNNs are designed to take an input series with no size limits. RNNs remember the past states and are influenced by them...

Singular Value Decomposition

Matrix factorization: Singular Value Decomposition Matrix decomposition is another name for matrix factorization. This method is a nice representation for applied linear algebra in machine learning and similar algorithms. ...

Partial Derivatives and the Jacobian Matrix

A Jacobian Matrix is a special kind of matrix that consists of first order partial derivatives for some vector function. The form of the Jacobian matrix can vary. That means, the number of rows and columns can be equal or not, denoting that in one case it is a squa...

Introduction to Reinforcement Learning

Part III: Value Function Approximation In the previous Part I and Part II of this series we described model-based and model-free reinforcement learning as well as some well known algorithms. In this Part III we are going to talk about Value Function Approximation: w...

Weight Initialization

How does Weight Initialization work? As a general rule, weights and biases are normally initialized with some random numbers. Weights and biases are extremely important model's parameters and play a pivot role in every neural network training. Therefore, one should ...

Word Embeddings

Part 1: Introduction to Word2Vec Word embedding is a popular vocabulary representation model. Such model is able to capture contexts and semantics of a word in a document. So what is it exactly? ...

Word Embeddings

Part 2: Word2Vec (Skip Gram)In the second part of Word Embeddings we will talk about what are the downsides of the Word2Vec model (Skip Gram...


T-Distributed Stochastic Neighbor Embedding If you do data analysis, machine learning or some other data driven research you will prob...
Copyright © 2024 by Richard Siegel at siegel.work Donate Contact & Privacy Policy