Weight Initialization

by Elvira Siegel
(Published: Wed Oct 09, 2019)

How does Weight Initialization work?

As a general rule, weights and biases are normally initialized with some random numbers. Weights and biases are extremely important model's parameters and play a pivot role in every neural network training. Therefore, one should have a firm understanding of how they work and why one needs them. In this article we will see how we can initialize and optimize weights to achieve better performance in a neural network.

Weights

Precisely speaking a weight is a coefficient which is used in combination with some feature in a machine learning model. The main objective of every machine learning model is to find out the weight for each feature which drives the model's performance to its best. In case that a weight is equal to zero, the feature, this weight is used with, does not play an important role in contributing to the model's performance. We can see weights as a tweaking mechanism with the help of which we can adjust and fine-tune our machine learning model's performance: the output we want for the model to have.

Weights are often represented as edges when graphed in a neural network:

An individual weight denotes the input importance, so if some particular weight from some particular input is frequently used, the network's performance will be influenced by this weight or we can say this input. In other words, by doing so we "give weight" to the input.

Bias

A bias measures the range how far away the average calculated value is from the ground truth. If, for example, a linear regression model has a high bias value, it means that the model is not able to capture the actual relationship between data points.

The linear regression model (the red straight line) cannot possibly relate to the curve, therefore this model will have a large bias.

On the other hand, if we take some other model and suit it perfectly ...

The green line, which seems to fit perfectly, will have a very low bias. So, a low bias is good, right?

Unfortunately, both of those images demonstrate rather poor model performance. The first one demonstrates the case of underfitting and the secon one of overfitting (Feel free to read more about this topics here: Neural Networks Introduction).

The big problem with the second approach is: the green line might seem to fit perfectly well. In fact it may fit so well, that the distances between the line and the data points are actually zero. But it doesn't have to be ideal or even good for our model! We previously named one crucial word: overfitting. That means, the second model marked with the green line is not able to generalize well on unseen data. An overfitted model performs well only on already seen data, aka only on training data. With test data, such model will have a very high variability or high variance. That means, for new datasets, the results produced by this model (the green line) will vary a lot from each other, hence creating too much inconsistency.

If we again compare the two models (the green line and the red straight line), the red one is actually better in comparison to the green one. Because the red straight line has much lower variance in its results, so it's going to be way more consistent with different data sets as the green line.

A bias in a neural network can be represented as a vector. In this vector there are individual bias values (we can see them as "additional weights"). An individual bias value is used to delay the application of an activation function in a layer. Biases are normally initialized as all zeros but again there is no universal rule. Read more at CS231n Convolutional Neural Networks for Visual Recognition under "Initializing the biases".

Bias is one of many parameters in machine learning modeling. The main task of the bias is tweaking the output value. Consequently, bias is a constant value. With it we can fit the best line for the given data in almost any machine learning model.

neuron_activation = σ( Σ (weights * inputs) + bias )

where σ is an activation function.

Summarizing briefly, the perfect model would have low variance and low bias, so it is consistent in its predictions and it represents the data relationship correct, with minimum or no error.

When we initialize weights for a neural network with some random numbers, we could experience the vanishing (or also exploding) gradient problem. As an outcome, our neural network will need much more time for convergence.

What are some good weight initialization methods?

What if we just initialize all weights as zeros? It might seem easy and intuitively correct ... But in reality it is rather suboptimal. Let's see why.

Consider the formula for 3 neurons:

Note: the superscript (i) means the layer number. The subscript j means the neuron number in a layer. The small sigma is the sigmoid activation function. More on this function here.

neuron = σ₁⁽¹⁾(weight₁⁽¹⁾ * input + bias₁⁽¹⁾)

neuron = σ₂⁽¹⁾(weight₂⁽¹⁾ * input + bias₂⁽¹⁾)

neuron = σ₃⁽¹⁾(weight₃⁽¹⁾ * input + bias₃⁽¹⁾)

This equation is used to compute the node's (also called neuron) activation (whether the neuron "fires" or not).

Now, if all the weights are equal 0, it denotes that the neural network's computations will in the end all evaluate to 0. That indicates that in the first layer all neurons (the subscript number) will acquire the same post activation value. Even if we utilize an activation function (in the example sigmoid: logistic function), the 0-weights will eliminate the effect of it and any non-linear activation function applied.

If we initialize all the weight as 0, then all neurons compute the same output. If the neurons' output is overall the same, the gradient's value will be also the same. That means with the gradient's value being the same for all neurons, these neurons will receive the same parameter updates during backpropagation.

Let's try small weights?

The same: not going to work well, because with small weights, the variance of the input value will gradually get smaller and smaller until it went through all the layers and finally vanished. The small valued input will continue to shrink till the value is so small it can't do any good.

A big problem which comes with very small weights is: we can't really bring in non-linearity which is vital for more complex machine learning tasks.

Consider the sigmoid activation function as an example:

Now look precisely at the region near zero. We notice that the sigmoid function gets roughly linear as we near the zero value on the graph. Let's call it "the non-linear region of the sigmoid function". We want to achieve a non-linear relationship, instead we get exactly the opposite of it: the linear one. In this situation we don't gain any benefits from using a multiple-layered neural network and it is basically impossible to bring in any non-linearity.

What if we take one and the same value for all the weights initially?

This is also a bad idea because this method's problem corresponds with the initializing all weights as 0, namely: the initial weights starting with the same value (e.g. all zeros), will all receive the same gradient update. Afterwards, these weights will hold the same value even after being updated through backpropagation. 😕

Large Weights Initialization

What if we initialize our weights with large values? Do we create any potential obstacles by doing so? Well, yes we do. Larger weights in a neural network become much more sensitive to small noises in the data. Noise in data means additional useless information which doesn't contribute to the learning and contributes to bad results. Therefore, we can also conclude that a network with large weights has probably learned nothing but some noise in the training data.

If we return to the sigmoid activation function we use in the last layer, and analyze the problem of big weights on it, we can see: the input values for each layer will grow at a great rate and in the end, we get absolutely huge weights, so huge they will be of no use for us. And why can't we use the network with such big weights? Again, the sigmoid activation function: look at the graph of the sigmoid function once more. The function tends to flatten if it reaches too big values. So the gradients will reach some number near the zero value, if we deal with a flattened activation function.

As a general rule, we can remember that large weights normally signify overfitting and our "second" general rule could be: don't go into extremes: too large weights are bad as well as too small.

Weight Regularization

to put it concisely, weight regularization techniques, used in a neural network training, penalize the loss function to motivate the network to get smaller weights.

One of proven solutions could be: updating and optimizing the loss function during network's training and at the same time by doing so, considering the weights' size. This process is called penalty. In case the weights get larger, the network is penalized more. The result of it is: larger loss and in the end larger updates. Such large updates of the network help to reduce overfitting.

The major problem with this kind of regularization which penalizes larger weights is that a network trained with this procedure might still let very large weights through. We will talk more about regularization techniques later.

Neural networks often suffer from vanishing or exploding gradients. The major obstacle to prevent this lies in the unbalanced variance of the values in the outputs of each layer. In every layer the outputs will consequently gain lower values as we go deeper in the neural network towards the last output layer. Therefore, there are more advanced weight initialization techniques we can apply.

Xavier Initialization

We said before that most neural networks are initially created to be optimized further. We optimize them by tweaking parameter values. We proceed with the parameter update iteratively using backpropagation. Weights and biases are the parameters we want to optimize in order to obtain the best possible result for our machine learning model.

If we set the weights randomly without any system, we might get problems with vanishing/exploding gradients. To deal with this kind of problem, Xavier Initialization is there to help us.

Because we actually don't know much about our data, we can't know for sure where we start with the weight initialization. Knowledge in statistics might help, namely the Gaussian Distribution. We can try and initialize our network's weights from a Gaussian distribution. The Gaussian distribution will have a finite variance as well as the mean which is zero. We care about the variance a lot in the context of weight initialization because if the variance will not stay the same, we have no ruling mechanism which prevents our weights from exploding or vanishing.

To sum up, in the Xavier initialization we have the following algorithm: we take a normal distribution with the mean 0 and a standard deviation. From this distribution the parameters are randomly initialized.

In other words, with this method we want that the variances of inputs and outputs in each layer is the same; or otherwise stated, we seek to equal the output variance of one layer with the input variance. In this way, the produced weight result is balanced.

In the formula above: in is the number of input neurons; out is the number of output neurons. By applying this formula, we prevent our network from too small or too large weights as well as from the weights which equal 0.

The Xavier initialization is sometimes referred to as Glorot normal this method was first suggested by Xavier Glorot and Yoshua Bengio. You can use the Keras implementation (initializers) which looks like this:

keras.initializers.glorot_uniform(seed=None)

He Initialization :

In general, the He initialization tries to accomplish the same as the Xavier initialization. But there is still a difference: if we use ReLu as an activation function for the output layer, we might consider the He weight initialization to be appropriate. We know if we use ReLu, we have to deal with some non-linear feature it can bring in: if we look at the ReLu activation function, we can see that at x = 0, the function is non differentiable.

It was proven mathematically that for the ReLU activation function, the most optimal weight initialization technique would be initializing the weights according with this formula:

v² = 2 / N

Which is called the He initialization. For the complete proof and more mathematical grounding, consult the original paper "On weight initialization in deep neural networks" .

In short, the He Initialization is frequently used with the ReLu activation function. We can implement this weight initialization in python with the help of numpy:

import numpy as np

weight_init = np.random.randn(n) * np.sqrt(2.0/n)

The "numpy.random.randn()" build-in function which returns an array with n-elements:

weight_init = np.random.randn(5) * np.sqrt(2.0/5)

array([-0.22166148, -0.26587495, -0.72696066, 0.04386937, -0.55365807])

Both, Xavier and He initialization methods try to lower large weight values, whereas the small weights will be increased. Basically, both initializations will output values around one.

Conclusion

Initializing right weights and biases is vital for an optimal performance of the model. If the weights are initially small or even zero, the model is stuck at the 0-point in the activation function (sigmoid) and non-linearity is hence not possible. Furthermore, the gradient of the function becomes extremely small which is called vanishing gradient problem.

If the weights initialized with large values, the result of the value for a neuron np.dot(weights,input) + bias becomes very high and if additionally the sigmoid activation function is used then the maximum of sigmoid will be at 1 where the slope of the gradient of the function changes very slowly which means that the learning and the convergence take much more time.

The Xavier initialization technique is in its core the same as the He initialization technique. He initialization is applied with the tanh activation function and Xavier initialization is often applied to overcome non-linearity issues with the ReLu activation function .