Warning! Your browser does not support this Website: Try Google-Chrome or Firefox!

Weight Initialization

by Elvira Siegel
(Published: Wed Oct 09, 2019)

How does Weight Initialization work?

As a general rule, weights and biases are normally initialized with some random numbers. Weights and biases are extremely important model's parameters and play a pivot role in every neural network training. Therefore, one should have a firm understanding of how they work and why one needs them. In this article we will see how we can initialize and optimize weights to achieve better performance in a neural network.

Weights

Precisely speaking a weight is a coefficient which is used in combination with some feature in a machine learning model. The main objective of every machine learning model is to find out the weight for each feature which drives the model's performance to its best. In case that a weight is equal to zero, the feature, this weight is used with, does not play an important role in contributing to the model's performance. We can see weights as a tweaking mechanism with the help of which we can adjust and fine-tune our machine learning model's performance: the output we want for the model to have.

Weights are often represented as edges when graphed in a neural network:


weights_graphed_edges

An individual weight denotes the input importance, so if some particular weight from some particular input is frequently used, the network's performance will be influenced by this weight or we can say this input. In other words, by doing so we "give weight" to the input.


Bias

A bias measures the range how far away the average calculated value is from the ground truth. If, for example, a linear regression model has a high bias value, it means that the model is not able to capture the actual relationship between data points.


large_bias

The linear regression model (the red straight line) cannot possibly relate to the curve, therefore this model will have a large bias.

On the other hand, if we take some other model and suit it perfectly ...


low_bias

The green line, which seems to fit perfectly, will have a very low bias. So, a low bias is good, right?

Unfortunately, both of those images demonstrate rather poor model performance. The first one demonstrates the case of underfitting and the secon one of overfitting (Feel free to read more about this topics here: Neural Networks Introduction).

The big problem with the second approach is: the green line might seem to fit perfectly well. In fact it may fit so well, that the distances between the line and the data points are actually zero. But it doesn't have to be ideal or even good for our model! We previously named one crucial word: overfitting. That means, the second model marked with the green line is not able to generalize well on unseen data. An overfitted model performs well only on already seen data, aka only on training data. With test data, such model will have a very high variability or high variance. That means, for new datasets, the results produced by this model (the green line) will vary a lot from each other, hence creating too much inconsistency.

If we again compare the two models (the green line and the red straight line), the red one is actually better in comparison to the green one. Because the red straight line has much lower variance in its results, so it's going to be way more consistent with different data sets as the green line.

A bias in a neural network can be represented as a vector. In this vector there are individual bias values (we can see them as "additional weights"). An individual bias value is used to delay the application of an activation function in a layer. Biases are normally initialized as all zeros but again there is no universal rule. Read more at CS231n Convolutional Neural Networks for Visual Recognition under "Initializing the biases".

Bias is one of many parameters in machine learning modeling. The main task of the bias is tweaking the output value. Consequently, bias is a constant value. With it we can fit the best line for the given data in almost any machine learning model.


neuron_activation = σ( Σ (weights * inputs) + bias )


where σ is an activation function.


Summarizing briefly, the perfect model would have low variance and low bias, so it is consistent in its predictions and it represents the data relationship correct, with minimum or no error.



When we initialize weights for a neural network with some random numbers, we could experience the vanishing (or also exploding) gradient problem. As an outcome, our neural network will need much more time for convergence.


What are some good weight initialization methods?

What if we just initialize all weights as zeros? It might seem easy and intuitively correct ... But in reality it is rather suboptimal. Let's see why.

Consider the formula for 3 neurons:

Note: the superscript (i) means the layer number. The subscript j means the neuron number in a layer. The small sigma is the sigmoid activation function. More on this function here.

neuron = σ 1 (1)(weight 1 (1) * input + bias 1 (1))

neuron = σ 2 (1)(weight 2 (1) * input + bias 2 (1))

neuron = σ 3 (1)(weight 3 (1) * input + bias 3 (1))


This equation is used to compute the node's (also called neuron) activation (whether the neuron "fires" or not).

Now, if all the weights are equal 0, it denotes that the neural network's computations will in the end all evaluate to 0. That indicates that in the first layer all neurons (the subscript number) will acquire the same post activation value. Even if we utilize an activation function (in the example sigmoid: logistic function), the 0-weights will eliminate the effect of it and any non-linear activation function applied.

If we initialize all the weight as 0, then all neurons compute the same output. If the neurons' output is overall the same, the gradient's value will be also the same. That means with the gradient's value being the same for all neurons, these neurons will receive the same parameter updates during backpropagation.


Let's try small weights?

The same: not going to work well, because with small weights, the variance of the input value will gradually get smaller and smaller until it went through all the layers and finally vanished. The small valued input will continue to shrink till the value is so small it can't do any good.

A big problem which comes with very small weights is: we can't really bring in non-linearity which is vital for more complex machine learning tasks.

Consider the sigmoid activation function as an example:


sigmoid_pic_small_weights

Now look precisely at the region near zero. We notice that the sigmoid function gets roughly linear as we near the zero value on the graph. Let's call it "the non-linear region of the sigmoid function". We want to achieve a non-linear relationship, instead we get exactly the opposite of it: the linear one. In this situation we don't gain any benefits from using a multiple-layered neural network and it is basically impossible to bring in any non-linearity.

What if we take one and the same value for all the weights initially?

This is also a bad idea because this method's problem corresponds with the initializing all weights as 0, namely: the initial weights starting with the same value (e.g. all zeros), will all receive the same gradient update. Afterwards, these weights will hold the same value even after being updated through backpropagation. 😕


Large Weights Initialization

What if we initialize our weights with large values? Do we create any potential obstacles by doing so? Well, yes we do. Larger weights in a neural network become much more sensitive to small noises in the data. Noise in data means additional useless information which doesn't contribute to the learning and contributes to bad results. Therefore, we can also conclude that a network with large weights has probably learned nothing but some noise in the training data.

If we return to the sigmoid activation function we use in the last layer, and analyze the problem of big weights on it, we can see: the input values for each layer will grow at a great rate and in the end, we get absolutely huge weights, so huge they will be of no use for us. And why can't we use the network with such big weights? Again, the sigmoid activation function: look at the graph of the sigmoid function once more. The function tends to flatten if it reaches too big values. So the gradients will reach some number near the zero value, if we deal with a flattened activation function.


sigmoid_pic_large_weights

As a general rule, we can remember that large weights normally signify overfitting and our "second" general rule could be: don't go into extremes: too large weights are bad as well as too small.


Weight Regularization

to put it concisely, weight regularization techniques, used in a neural network training, penalize the loss function to motivate the network to get smaller weights.

One of proven solutions could be: updating and optimizing the loss function during network's training and at the same time by doing so, considering the weights' size. This process is called penalty. In case the weights get larger, the network is penalized more. The result of it is: larger loss and in the end larger updates. Such large updates of the network help to reduce overfitting.

The major problem with this kind of regularization which penalizes larger weights is that a network trained with this procedure might still let very large weights through. We will talk more about regularization techniques later.



Neural networks often suffer from vanishing or exploding gradients. The major obstacle to prevent this lies in the unbalanced variance of the values in the outputs of each layer. In every layer the outputs will consequently gain lower values as we go deeper in the neural network towards the last output layer. Therefore, there are more advanced weight initialization techniques we can apply.


Xavier Initialization

We said before that most neural networks are initially created to be optimized further. We optimize them by tweaking parameter values. We proceed with the parameter update iteratively using backpropagation. Weights and biases are the parameters we want to optimize in order to obtain the best possible result for our machine learning model.

If we set the weights randomly without any system, we might get problems with vanishing/exploding gradients. To deal with this kind of problem, Xavier Initialization is there to help us.

Because we actually don't know much about our data, we can't know for sure where we start with the weight initialization. Knowledge in statistics might help, namely the Gaussian Distribution. We can try and initialize our network's weights from a Gaussian distribution. The Gaussian distribution will have a finite variance as well as the mean which is zero. We care about the variance a lot in the context of weight initialization because if the variance will not stay the same, we have no ruling mechanism which prevents our weights from exploding or vanishing.

To sum up, in the Xavier initialization we have the following algorithm: we take a normal distribution with the mean 0 and a standard deviation. From this distribution the parameters are randomly initialized.

In other words, with this method we want that the variances of inputs and outputs in each layer is the same; or otherwise stated, we seek to equal the output variance of one layer with the input variance. In this way, the produced weight result is balanced.


Formula_X_Init

In the formula above: in is the number of input neurons; out is the number of output neurons. By applying this formula, we prevent our network from too small or too large weights as well as from the weights which equal 0.

The Xavier initialization is sometimes referred to as Glorot normal this method was first suggested by Xavier Glorot and Yoshua Bengio. You can use the Keras implementation (initializers) which looks like this:


keras.initializers.glorot_uniform(seed=None) 


He Initialization :

In general, the He initialization tries to accomplish the same as the Xavier initialization. But there is still a difference: if we use ReLu as an activation function for the output layer, we might consider the He weight initialization to be appropriate. We know if we use ReLu, we have to deal with some non-linear feature it can bring in: if we look at the ReLu activation function, we can see that at x = 0, the function is non differentiable.


ReLu

It was proven mathematically that for the ReLU activation function, the most optimal weight initialization technique would be initializing the weights according with this formula:

v2 = 2 / N


Which is called the He initialization. For the complete proof and more mathematical grounding, consult the original paper "On weight initialization in deep neural networks" .

In short, the He Initialization is frequently used with the ReLu activation function. We can implement this weight initialization in python with the help of numpy:


import numpy as np

weight_init = np.random.randn(n) * np.sqrt(2.0/n)

The "numpy.random.randn()" build-in function which returns an array with n-elements:

weight_init = np.random.randn(5) * np.sqrt(2.0/5)
array([-0.22166148, -0.26587495, -0.72696066, 0.04386937, -0.55365807])

Both, Xavier and He initialization methods try to lower large weight values, whereas the small weights will be increased. Basically, both initializations will output values around one.


Conclusion

Initializing right weights and biases is vital for an optimal performance of the model. If the weights are initially small or even zero, the model is stuck at the 0-point in the activation function (sigmoid) and non-linearity is hence not possible. Furthermore, the gradient of the function becomes extremely small which is called vanishing gradient problem.

If the weights initialized with large values, the result of the value for a neuron np.dot(weights,input) + bias becomes very high and if additionally the sigmoid activation function is used then the maximum of sigmoid will be at 1 where the slope of the gradient of the function changes very slowly which means that the learning and the convergence take much more time.

The Xavier initialization technique is in its core the same as the He initialization technique. He initialization is applied with the tanh activation function and Xavier initialization is often applied to overcome non-linearity issues with the ReLu activation function .


Further recommended readings:

Neural Networks

Gradient Descent

Xavier Glorot and Yoshua Bengio the original paper: Understanding the difficulty of training deep feedforward neural networks

siegel.work

It's AI Against Corona

2019-nCoV There has been a lot of talking about the new corona virus going around the world. Let's clear up some things about it first and then we will see how data science and ai can help us fight 2019-nCoV. ...

Activation Functions

What are activation functions in Neural Networks? First of all let's clear some terminology you need in order to understand the concept of an activation function. ...

Backpropagation

or backward propagation of errorsis another supervised learning optimization algorithm. The main task of the backpropagation algorithm is to find optimal weights in a by implementing optimization technique. ...

CNNs

The Convolutional Neural Network (CNN) architecture is widely used in the field of computer vision. Because we have a massive amount of data in image files, the usage of traditional neural networks wouldn't give much efficiency as the computational time would expl...

Early Stopping

In this article we will introduce you to the concept of Early Stopping and its implementation including code samples. ...

GAN

Generative Adversarial Networks (GANs) are a type of unsupervised neural networks. The network exists since 2014 and was developed by and colleges. ...

Gradient Descent

Hiking Down a Mountain Gradient Descent is a popular optimization technique in machine learning. It is aimed to find the minimum value of a function. ...

Introduction to Statistics

Part III In this third and last part of the series "Introduction to Statistics" we will cover questions as what is probability and what are its types, as well as the three probability axioms on top of which the entire probability theory is constructed. ...

Introduction to Statistics

Part I In the following three parts we will cover basic terminology as well as the core concepts from statistics. In this Part I you are going to learn about measures of central tendency (mean, median and mode). In the Part II you will read about measures of variabili...

Introduction to Statistics

Part II In this part we will continue our talk about descriptive statistics and the measures of variability such as range, standard deviation and variance as well as different types of distributions. Feel free to read the Part I of these series to deepen your knowle...

Logistic Regression

Logit Regression Logit regression is another shortened name derived from logistic unit. Logistic regression is a popular statistical model that generates probabilities for binary classification tasks. It produces discrete values and its span lies in the range of [...

Loss Functions

When training a neural network, we try to optimize the algorithm, so it gives the best possible output. This optimization needs a loss function to compute the error/loss of the model. In this article we will gain a general picture of Squared Error, Mean Sq...

The Magic Behind Tensorflow

Getting started In this article we will delve into the magic behind one of the most popular Deep Learning frameworks - Tensorflow. We will look at the crucial terminology and some core computation principles we need to grasp the real power of Tensorflow. ...

Classification with Naive Bayes

The Bayes' Theorem describes the probability of some event, based on some conditions that might be related to that event. ...

Neural Networks

Neural Networks - Introduction In Neural Networks (NNs) we try to create a program which is able to learn from experience with respect to some task. This program should cons...

PCA

Principal component analysis or PCA is a technique for taking out relevant data points (variables also called components or sometimes features) from a larger data set. From this high dimensional data set, PCA tries extracting low dimensional data points. The idea...

Introduction to reinforcement learning

Part IV: Policy Gradient In the previous articles from this series on Reinforcement Learning (RL) we discussed Model-Based and Model-Free RL. In model-free RL we talked about Value Function Approximation (VFA). In this Part we are going to learn about Policy Based R...

Introduction to Reinforcement Learning

Part I : Model-Based Reinforcement Learning Welcome to the series "Introduction to Reinforcement Learning" which will give you a broad understanding about basic (and not only :) ) techniques in the field of Reinforcement Learning. The article series assumes you have s...

Introduction to Reinforcement Learning

Part II : Model-Free Reinforcement Learning In this Part II we're going to deal with Model-Free approaches in Reinforcement Learning (RL). See what model-free prediction and control mean and get to know some useful algorithms like Monte Carlo (MC) and Temporal Differ...

Recurrent Neural Networks

RNNs A Recurrent Neural Network (RNN) is a type of neural network where an output from the previous step is given as an input to the current step. RNNs are designed to take an input series with no size limits. RNNs remember the past states and are influenced by them...

SVM

Support Vector Machines If you happened to have a classification, a regression or an outlier detection task, you might want to consider using Support Vector Machines (SVMs), a supervised learning model, that builds a line (hyperplane) to separate data into groups....

Singular Value Decomposition

Matrix factorization: Singular Value Decomposition Matrix decomposition is another name for matrix factorization. This method is a nice representation for applied linear algebra in machine learning and similar algorithms. ...

Partial Derivatives and the Jacobian Matrix

A Jacobian Matrix is a special kind of matrix that consists of first order partial derivatives for some vector function. The form of the Jacobian matrix can vary. That means, the number of rows and columns can be equal or not, denoting that in one case it is a squa...

Introduction to Reinforcement Learning

Part III: Value Function Approximation In the previous Part I and Part II of this series we described model-based and model-free reinforcement learning as well as some well known algorithms. In this Part III we are going to talk about Value Function Approximation: w...

Word Embeddings

Part 1: Introduction to Word2Vec Word embedding is a popular vocabulary representation model. Such model is able to capture contexts and semantics of a word in a document. So what is it exactly? ...

Word Embeddings

Part 2: Word2Vec (Skip Gram)In the second part of Word Embeddings we will talk about what are the downsides of the Word2Vec model (Skip Gram...

t-SNE

T-Distributed Stochastic Neighbor Embedding If you do data analysis, machine learning or some other data driven research you will prob...
Copyright © 2024 by Richard Siegel at siegel.work Donate Contact & Privacy Policy