Warning! Your browser does not support this Website: Try Google-Chrome or Firefox!

Activation Functions

by Elvira Siegel
(Published: Sun Oct 06, 2019)

What are activation functions in Neural Networks?

First of all let's clear some terminology you need in order to understand the concept of an activation function.

What are Neural Networks?

Shortly speaking, neural networks are algorithmic procedures, the structure of which relatively reminds our human brain. Those algorithms are architectured in a way, so that they can remember and even recognize certain patterns. These patterns, the neural nets recognize, are only numbers. Those numbers are enclosed in vectors. No matter what kind of data we have (images, text, sound, etc...), it must be converted into a numerical vector representation.

We use neural networks to cluster and classify our real-world data. We can think of them as a clustering or classification algorithm to manage and analyze information flows.

Neural Networks consist of layers. Layers consist of "neurons".

What is a neuron in Neural Networks?

An artificial neuron computes the weighted sum of its input, then it adds a bias:

N = Σ(input * weight) + bias


After the computation, the neuron decides whether to "fire" or not, or to activate itself or not. The value of a neuron may be any number from -infinity to +infinity. The neuron doesn't know the necessary boundary for its values - it doesn't really know when to "fire" and when not.


How the neuron decides?

We add an activation function for this decision. We now can check the N (neuron's) value with the activation function and see, whether the neuron should be "activated" or not.


Use a linear function?

What if we use a normal linear function as an activation function?

For example, we would use a linear function like that: L = yx ? This is a straight line activation function. In this case the output of such linear function is some numerical range. That means, it is not a binary activation. If you know the Gradient Descent optimization algorithm, you can also notice that the derivative of this linear function is constant.

But the bigger problem is: imagine some connected layers. Each layer has its own activation function, in our case we chose a linear function as an activation function. That activated output, based on a linear function, from one layer goes straight as input into another layer. The weighted sum is calculated and the output would be again based on another linear function. All in all, that indicates, no matter how many layers we add, the neural network will still behave like a single layer network because the sum of all linear layers gives another linear function. So it stays linear no matter what.


How do we bring in non-linearity?

We need non-linearity to solve more complex problems, in cases where the relationship between variables is not static or directly proportional to the input, but instead is dynamic and differ from time to time. In sales for example, per unit cost may decrease instead of staying constant as the output increases.


We can use non-linear activation functions:


Sigmoid Function
SIGMOID

the range of the non-linear function sigmoid is [0;1] , that means we can easily calculate probability values and, unlike with the linear function, we now have a boundary.

The sigmoid function is frequently used for binary classification.

If you noticed, the function contains the constant e (more on e here) and e is the easiest function to get the derivative from.

The intercept of the function lies at 0.5, which is also beneficial for calculation probabilities when we have 50/50 chance of something or when we have sigmoid(0).

And finally, the sigmoid function is non-linear (non-linear curve), so it solves the problem mentioned previously: each combination of linear function is again a linear function.

Implementation of the Sigmoid Function
import numpy as np
import matplotlib.pyplot as plt


def sigmoid(num):
    return 1 / (1 + np.exp(-num))

num_seq = np.linspace(-5, 5, 100) #'np.linspace' creates evenly spaced numbers in sequences.
plt.plot(num_seq, sigmoid(num_seq),'r') # the 'r'-parameter for 'red'.

plt.xlabel('x')
plt.ylabel('y')

plt.title('sigmoid function')
plt.grid()

plt.show()  
sigmoid
What are the drawbacks of the sigmoid function?

If you look at the picture of the sigmoid curve above, you'll notice that there are regions on the graph where the function doesn't change much. That means that the gradient at that region is going to be very small. The problem of "vanishing gradient" rises: when gradients become very small, the weight update stays the same or changes extremely slowly. When the update of weights stops, the learning stops.


TanH Function (Hyperbolic Tangent)

TanH looks similar to sigmoid. In fact, it is the scaled version of the sigmoid function. The range of the tanH function is [-1;1].


TANH

In comparison to sigmoid, tanH has stronger gradients (derivatives are steeper): if the data is centered around 0, the derivatives are higher. Dependent on the gradient strength you need, you choose between the both. But similar to sigmoid, tanH is susceptible to the same vanishing gradient problem.

Implementation of the TanH Function
import numpy as np
import matplotlib.pyplot as plt

def tanH(num):
    return (2 / (1 + np.exp(-2*num))) - 1

num_seq = np.linspace(-5, 5, 100)   
plt.plot(num_seq, tanH(num_seq),'g') 

plt.xlabel('x')
plt.ylabel('y')

plt.title('tanH function')
plt.grid()

plt.show()
tanH
Rectified Linear Unit (ReLu)

RELU

ReLu function gives the input as the output if the input is positive and 0 if the input is negative.

It is widely used in hidden layers.

If you look at the graph above, you may think that this function is linear. But it is not! In fact, all combinations of ReLu are non-linear.

ReLu is computationally light, because it outputs max(input, 0), so it doesn't have massive calculations and easy to implement mathematically.

If we have a neural network, where for 60% of the network ReLu outputs 0 (zero for negative values). That means less neurons fire (get activated), so we have a sparse activation, which means the network gets lighter.

Implementation of the ReLu Function
import numpy as np
import matplotlib.pyplot as plt

def relu(num):
    return np.maximum(num, 0)

num_seq = np.linspace(-5, 5, 100)   
plt.plot(num_seq, relu(num_seq),'b') 

plt.xlabel('x')
plt.ylabel('y')

plt.title('relu function')
plt.grid()

plt.show()
relu
What are potential problems with ReLu?

The range of ReLu is [0, inf), which means that the function can explode the activation because there is no boundary.

If you look at the ReLu graph above, you will notice that horizontal line (for negative x or x=0) where the ReLu output is 0, which indicates that the gradient goes towards 0. That denotes, the gradient in this span becomes 0 and the weights won't be tweaked during descent. If the weights update stops (or becomes significantly slow), the learning stops. If the gradient is 0, that means, nothing changes. The neurons which are in such state stop reacting to changes in the error. This is what we know as Dying ReLu Problem. This problem causes a part of the neurons "die" as they won't respond. This makes neural network passive and even static.

There is an approach to the dying ReLu problem, the so called "Leaky ReLu". This Leaky ReLu allows a some "leakage" for an input variable less than 0. In other words, we take the horizontal line and convert it into a non-horizontal element by inclining it. The key idea is: allow gradients equal 0.

LEAKY_RELU

Softmax

Softmax is often used in multiclass classification tasks. The output of the softmax is big if the input (also called logit or score) is big. If the input is small, the output will also be small.


SOFTMAX

Softmax is widely used in the output layer. It is suitable for classification or linear regression type of tasks.

Softmax takes in a vector with logits (scores) and outputs a vector with probabilities which sum up to 1. The output represents a probability distribution (or categorical distribution) of possible outcomes. The most important task for softmax is: it turns numbers into probabilities which is great for probability analysis.

We called the input values for softmax logits several times. Logits are numerical values which are the output of the last hidden layer. So logits are scores from the last layer before activation (e.g. softmax) comes in.

If you look at the formula, you'll see that it sums over all samples in input. That means: a considerable drawback of softmax is that it takes linear time to accomplish a task O(n).

import numpy as np
import matplotlib.pyplot as plt

def softmax(num):
    return np.exp(num) / np.sum(np.exp(num), axis=0)
    

    # Axes are needed for multi-dimensional arrays. 
    # A 2-dimensional array, for example, has two axes (x,y): the first one (y) is vertical and goes downwards across the rows (axis 0). 
    # The second one is horizontal and goes across columns (axis 1).

num_seq = np.linspace(-5, 5, 100)
plt.plot(num_seq, softmax(num_seq),'m') 

plt.xlabel('x')
plt.ylabel('y')

plt.title('softmax function')
plt.grid()

plt.show()
softmax
Which activation function should I chose?

Can I just always use ReLu or Sigmoid or tanH? Well, it depends on the task you have. Generally speaking, you should choose the kind of an activation function which approximates the function faster and leads to faster training. You can even customize your own activation function and use it! As a rule of thumb: for the start take ReLu and test it. As a rule, ReLu works fine as a general approximator.


Further recommended readings:

Activation Functions Implementation on Github

Check out the article on Backpropagation ...

... or on Recurrent Neural Networks (RNNs) as well

😄

siegel.work

It's AI Against Corona

2019-nCoV There has been a lot of talking about the new corona virus going around the world. Let's clear up some things about it first and then we will see how data science and ai can help us fight 2019-nCoV. ...

Backpropagation

or backward propagation of errorsis another supervised learning optimization algorithm. The main task of the backpropagation algorithm is to find optimal weights in a by implementing optimization technique. ...

CNNs

The Convolutional Neural Network (CNN) architecture is widely used in the field of computer vision. Because we have a massive amount of data in image files, the usage of traditional neural networks wouldn't give much efficiency as the computational time would expl...

Early Stopping

In this article we will introduce you to the concept of Early Stopping and its implementation including code samples. ...

GAN

Generative Adversarial Networks (GANs) are a type of unsupervised neural networks. The network exists since 2014 and was developed by and colleges. ...

Gradient Descent

Hiking Down a Mountain Gradient Descent is a popular optimization technique in machine learning. It is aimed to find the minimum value of a function. ...

Introduction to Statistics

Part III In this third and last part of the series "Introduction to Statistics" we will cover questions as what is probability and what are its types, as well as the three probability axioms on top of which the entire probability theory is constructed. ...

Introduction to Statistics

Part I In the following three parts we will cover basic terminology as well as the core concepts from statistics. In this Part I you are going to learn about measures of central tendency (mean, median and mode). In the Part II you will read about measures of variabili...

Introduction to Statistics

Part II In this part we will continue our talk about descriptive statistics and the measures of variability such as range, standard deviation and variance as well as different types of distributions. Feel free to read the Part I of these series to deepen your knowle...

Logistic Regression

Logit Regression Logit regression is another shortened name derived from logistic unit. Logistic regression is a popular statistical model that generates probabilities for binary classification tasks. It produces discrete values and its span lies in the range of [...

Loss Functions

When training a neural network, we try to optimize the algorithm, so it gives the best possible output. This optimization needs a loss function to compute the error/loss of the model. In this article we will gain a general picture of Squared Error, Mean Sq...

The Magic Behind Tensorflow

Getting started In this article we will delve into the magic behind one of the most popular Deep Learning frameworks - Tensorflow. We will look at the crucial terminology and some core computation principles we need to grasp the real power of Tensorflow. ...

Classification with Naive Bayes

The Bayes' Theorem describes the probability of some event, based on some conditions that might be related to that event. ...

Neural Networks

Neural Networks - Introduction In Neural Networks (NNs) we try to create a program which is able to learn from experience with respect to some task. This program should cons...

PCA

Principal component analysis or PCA is a technique for taking out relevant data points (variables also called components or sometimes features) from a larger data set. From this high dimensional data set, PCA tries extracting low dimensional data points. The idea...

Introduction to reinforcement learning

Part IV: Policy Gradient In the previous articles from this series on Reinforcement Learning (RL) we discussed Model-Based and Model-Free RL. In model-free RL we talked about Value Function Approximation (VFA). In this Part we are going to learn about Policy Based R...

Introduction to Reinforcement Learning

Part I : Model-Based Reinforcement Learning Welcome to the series "Introduction to Reinforcement Learning" which will give you a broad understanding about basic (and not only :) ) techniques in the field of Reinforcement Learning. The article series assumes you have s...

Introduction to Reinforcement Learning

Part II : Model-Free Reinforcement Learning In this Part II we're going to deal with Model-Free approaches in Reinforcement Learning (RL). See what model-free prediction and control mean and get to know some useful algorithms like Monte Carlo (MC) and Temporal Differ...

Recurrent Neural Networks

RNNs A Recurrent Neural Network (RNN) is a type of neural network where an output from the previous step is given as an input to the current step. RNNs are designed to take an input series with no size limits. RNNs remember the past states and are influenced by them...

SVM

Support Vector Machines If you happened to have a classification, a regression or an outlier detection task, you might want to consider using Support Vector Machines (SVMs), a supervised learning model, that builds a line (hyperplane) to separate data into groups....

Singular Value Decomposition

Matrix factorization: Singular Value Decomposition Matrix decomposition is another name for matrix factorization. This method is a nice representation for applied linear algebra in machine learning and similar algorithms. ...

Partial Derivatives and the Jacobian Matrix

A Jacobian Matrix is a special kind of matrix that consists of first order partial derivatives for some vector function. The form of the Jacobian matrix can vary. That means, the number of rows and columns can be equal or not, denoting that in one case it is a squa...

Introduction to Reinforcement Learning

Part III: Value Function Approximation In the previous Part I and Part II of this series we described model-based and model-free reinforcement learning as well as some well known algorithms. In this Part III we are going to talk about Value Function Approximation: w...

Weight Initialization

How does Weight Initialization work? As a general rule, weights and biases are normally initialized with some random numbers. Weights and biases are extremely important model's parameters and play a pivot role in every neural network training. Therefore, one should ...

Word Embeddings

Part 1: Introduction to Word2Vec Word embedding is a popular vocabulary representation model. Such model is able to capture contexts and semantics of a word in a document. So what is it exactly? ...

Word Embeddings

Part 2: Word2Vec (Skip Gram)In the second part of Word Embeddings we will talk about what are the downsides of the Word2Vec model (Skip Gram...

t-SNE

T-Distributed Stochastic Neighbor Embedding If you do data analysis, machine learning or some other data driven research you will prob...
Copyright © 2024 by Richard Siegel at siegel.work Donate Contact & Privacy Policy