The Magic Behind Tensorflow
"With great power comes great responsibility", Spider-Man
Tensor and Flow?
First of all, let's clear out what a tensor is. In short, a tensor is a multi-dimensional matrix which makes computations much faster as we are not calculating values one by one but we compute all the values at once. By doing so in Tensorflow, we gain a significant computational performance boost.
In Tensorflow a scalar, a vector and a matrix are all implemented as tensors. Each of them has a specific rank: zero rank means a scalar value, rank 1 means a vector and rank 3 means a matrix. But those are just vectors or matrices represented as tensors. What does a "real tensor" look like? Well, look at the pictures below.
one representation of a tensor
another representation of a tensor
A tensor is the main data structure in Tensorflow
A 3 dimensional tensor also has a rank which is rank 3. We can go on and on by saying rank n tensor is a n-dimensional tensor which means we can potentially have a huge amount of dimensions in our tensor.
In Tensorflow each numerical object is represented in form of a tensor, be it a n-dimensional vector or matrix. To put it simply, in Tensorflow everything is a tensor.
By using tensors we achieve a huge computational efficiency. If we're computing with normal, non-tensor matrices, we receive the result only for once entry at a time. That would mean if we train a neural network with millions of entries, it will take a looooong wile until they are all computed ...
The key idea with tensors is: we are able to perform simultaneous computations on all the entries we have in our matrices. In Tensorflow, computations are presented as nodes. More on this later in the article.
And what about "flow"?
The idea with the flow is that the created tensors are "flowing" through the model's computational graph, consisting of nodes (which are tensors) and connections (weights). Read more on weights in neural networks here and here .
Computational graphs: an Overview
Computational graphs are also sometimes called "dataflow graphs".
Node
A computational graph consists of nodes and edges. Each node represents some mathematical operation. The result of the operation is stored in the node in form of some variable and is passed to the next node as a ready-to-be-used calculation. All in all, a node is a way to present some function we need for our further computation.
Such architecture allows us to reuse our computations at a later point. In context of Tensorflow we should mention the lazy execution. Despite that this execution type is called "lazy", it gives us a massive advantage in terms of the computational speed. If we make use of the lazy execution, Tensorflow doesn't compute values until we tell it to perform these computations. Tensorflow processes its input values if we request it to do so, otherwise Tensorflow only stores generated information for later use. Such structuring empowers us to parallelize or fuse computations in a more easy way.
In lazy execution the idea of a session is crucial to understand. We need to create a session in Tensorflow lazy execution in order to activate our computational graph and compute the values from it. A session is a way to allocate necessary memory for storing variables' values. You can compare a session with some executable file that may be run on your computer. Without a session there is no execution possible.
Another important concept of lazy execution is a placeholder. A placeholder is some variable which will receive its value at a later point in time. With placeholders in lazy execution we generate operations for the computational graph without any data in the first place. Through these placeholders we are able to "feed" our data into the computational graph later on. Metaphorically, we can imagine placeholders as parts of an empty shell, we will fill later with some values.
Consult the following code, which represents placeholders and sessions in action:
import tensorflow as tf
p_1 = tf.placeholder(tf.float32, None, name="placeholder_1") # here we store our value later
func = p_1**2 # our operation
with tf.Session() as sess:
res = sess.run(func, feed_dict={p_1: [2,3,5]}) # 'feed_dict' is an input for the 'func', define the placeholder p_1
print(res)
Note: there also exists the "eager execution" which computes operations on the fly. This type of execution does not set up a graph or creates any sessions. Eager execution utilizes the concept of imperative programming. We will cover eager execution later on.
So, each graph consists of nodes and edges. Each edge represents a weight and each node represents a tensor. Let's start with a simple example, for the function g = ((a * b) + d) / f we can construct the following computational graph:
computational graph for the function g
This is an introductory example of a computational graph. If we feed it into the Tensorflow framework, we don't get any satisfiable result as we perform only the forward pass. To unwrap the real potential of our neural network we will use the power of differentiation. By that I mean the Chain Rule and by that I mean Backpropagation.
If you need a refresher on the Chain Rule and Backpropagation, consult this article Backpropagation. It will provide you with necessary background knowledge.
A small recap : we use backpropagation to go backwards in the neural network where we start from the beginning in order to update the weight values. If we do not perform weight update, the neural network won't learn anything, which means, it becomes useless. The chain rule is basically the means by which the backpropagation is implemented. The chain rule is in its core a multiple application of differentiation rules on composite functions. Composite functions are functions with another functions inside.
Now we compute partial derivatives of the computational graph from above starting from the end point as we're applying backpropagation:
computational graph for partial derivatives of the function g
For a quick recap: gradients ( ∇ ) are vectors of partial derivatives. Partial derivatives ( ∂ ) are derivatives taken with respect to one certain variable while other variables are held constant.
To review what derivatives and Co. are, consult this article here .
As you might notice, we start from the last computation and go backwards in the graph, computing partial derivatives of each node and performing backpropagation. We need partial derivatives to determine the rate of change in our function. For example, if we have something like: ∂g / ∂f we read it as the partial derivative of g with respect to f and we want to find the change in g if we slightly change f .
With partial differentiation we find the change of a function with respect to (w.r.t.) some variable, for example, the change of the function g w.r.t. the variable a according to the chain rule would look like this:
Same for the b variable:
In Tensorflow once we've done with defining the computational graph, we compile the whole model to define the loss function, the optimizer and the metric we want to be used in our model. We can see compile as a method to create connections between the nodes in a graph.
We compile our model before training to set parameters such as an optimizer, a loss function and metrics. Look at an example of training and evaluating a model here.
There is a pre-implemented compile method in Tensorflow we can use.
compile(
optimizer='rmsprop',
loss=None,
metrics=None,
loss_weights=None,
sample_weight_mode=None,
weighted_metrics=None,
target_tensors=None,
distribute=None,
**kwargs
)
Why do we use compile if we can program in python? Python is a non-compiled language, right? Yes, that's right. The thing is the Tensorflow idea is that we can use some programming language (often it's python) to compute our model. Tensorflow itself is not intrinsically bound to python. In fact the backend of Tensorflow is written in C++ and CUDA.
We don't really express the python code in the end. Instead the code is being translated into a computational graph and computed further, so we gain the maximum optimization.
Computational Graph: More Details
Let's analyze computational graphs in detail. What does a computational algorithm look like with a simple neural network?
Starting with an example: imagine we have an input vector x with n values in it, a weight vector w with n values in it and a bias b, which is a scalar.
We have a function of the form: σ((x n * wTn) + b) where the small Greek sigma is the sigmoid function, which serves as our activation function. This whole equation is called prediction and we denote it with Å· . We also need a loss function for our neural network in order to be able to analyze the error of the outputted prediction. Let's say that we have a regression task, so we can take the Mean Square Error (MSE) function which has the following form: MSE = (ΣNi=1 (yi - Å·i)2) / N where N is the total number of samples.
Now we want to produce a computational graph for the following equation:
As we already know, Tensorflow needs to perform partial differentiation on the graph starting from the very end, so the graph will look like this:
Tensorflow will do the differentiation part for you automatically. So you don't have to differentiate anything on the paper which becomes less and less feasible for bigger and more complex neural networks :( .
The automatic differentiation in Tensorflow does exactly what we just saw ourselves:
- create a computational graph
- calculate derivatives from the computational graph
- to calculate derivatives use the chain rule
- take a node and its corresponding gradient operation
- calculate the derivative of the input w.r.t the output
- backpropagate by calculating the gradients w.r.t. each network's parameter
to use the chain rule and to backpropagate:
A short summary for computational graphs: one of the main concepts behind computational graphs lies in code portability. We can actually export any created graph from Tensorflow and use it on any other architecture.
GradientTape
Tensorflow GradientTape is an automatic differentiation API. The "Tape" part denotes that each operation will be "written down on a tape", aka stored. Then we can compute the gradients from the stored computation. The stored computation consists of operations and again previous gradients. But aren't the gradients stored automatically all the time? Well, read further ...
GradientTape is very useful when we utilize eager execution. With eager execution Tensorflow won't build a computational graph for us, neither create a session as it is usual in the case of lazy execution. If Tensorflow provides no graph, it also won't store gradients for us explicitly. But it's still necessary to record the gradients to be able to apply backpropagation later. For this purpose we use GradientTape as all operations will be recorded.
So now, even with the eager execution, we can calculate the gradients of the loss function w.r.t the parameters.
Let's summarize some key differences between Lazy and Eager executions:
Lazy:
- constructs a graph
- create a session to activate the graph: Session.run()
- utilizes placeholders
- gradients are saved in a graph internally
- a session is saved as a whole
Eager:
- doesn't construct a graph
- use tf.enable_eager_execution() to enable the eager mode (if you don't use the 2.0 version)
- has no placeholders, instead we pass data in form of arguments directly into a function
- gradients are saved with the help of GradientTape
- the eager saver navigates to variable values and loads them through Checkpoint
import tensorflow as tf
tf.enable_eager_execution() # activate eager execution
c = tf.constant((2.0)) # a Tensorflow constant with value 2.0
with tf.GradientTape() as tape:
tape.watch(c) # 'watch' for tape to record c
f = 4*c
df_dc = tape.gradient(f,c)
print(df_dc)
the output is:
tf.Tensor(4.0, shape=(), dtype=float32)
Again we have the function: f = 4*c so consequently we get ∂ f / ∂ c
Going through the different Tensorflow versions and their configurations for different purposes may be a pain. Just keep in mind that for Tensorflow 2.0 the eager execution is the default mode.
Conclusion
In this article we've learned about two types of execution in Tensorflow: lazy and eages ones. We've analyzed the workflow of computational graphs and what is behind it, namely partial differentiation. With computational graphs Tensorflow doesn't have to compute everything from the code but instead it can create a graph and represent our code in it. We have to run a session in order to compute values from the graph. This type of procedure is called lazy execution paradigm. In the Tensorflow 2.0 version, the approach is to get rid of sessions and to execute everything in eager per default.
I hope, you could unravel some magic behind one of the most frequently used Machine Learning frameworks at the present time.
For further research, consult the official Tensorflow documentation.