Warning! Your browser does not support this Website: Try Google-Chrome or Firefox!

The Magic Behind Tensorflow

by Elvira Siegel
(Published: Sun Nov 17, 2019)

Getting started

In this article we will delve into the magic behind one of the most popular Deep Learning frameworks - Tensorflow. We will look at the crucial terminology and some core computation principles we need to grasp the real power of Tensorflow.


"With great power comes great responsibility", Spider-Man


Tensor and Flow?

First of all, let's clear out what a tensor is. In short, a tensor is a multi-dimensional matrix which makes computations much faster as we are not calculating values one by one but we compute all the values at once. By doing so in Tensorflow, we gain a significant computational performance boost.


scalar
vector


matrix

In Tensorflow a scalar, a vector and a matrix are all implemented as tensors. Each of them has a specific rank: zero rank means a scalar value, rank 1 means a vector and rank 3 means a matrix. But those are just vectors or matrices represented as tensors. What does a "real tensor" look like? Well, look at the pictures below.



one representation of a tensor

tensor1

another representation of a tensor

tensor2

A tensor is the main data structure in Tensorflow



A 3 dimensional tensor also has a rank which is rank 3. We can go on and on by saying rank n tensor is a n-dimensional tensor which means we can potentially have a huge amount of dimensions in our tensor.

In Tensorflow each numerical object is represented in form of a tensor, be it a n-dimensional vector or matrix. To put it simply, in Tensorflow everything is a tensor.

By using tensors we achieve a huge computational efficiency. If we're computing with normal, non-tensor matrices, we receive the result only for once entry at a time. That would mean if we train a neural network with millions of entries, it will take a looooong wile until they are all computed ...


a-bit-too-much


The key idea with tensors is: we are able to perform simultaneous computations on all the entries we have in our matrices. In Tensorflow, computations are presented as nodes. More on this later in the article.


And what about "flow"?

The idea with the flow is that the created tensors are "flowing" through the model's computational graph, consisting of nodes (which are tensors) and connections (weights). Read more on weights in neural networks here and here .


Computational graphs: an Overview

Computational graphs are also sometimes called "dataflow graphs".


Node

A computational graph consists of nodes and edges. Each node represents some mathematical operation. The result of the operation is stored in the node in form of some variable and is passed to the next node as a ready-to-be-used calculation. All in all, a node is a way to present some function we need for our further computation.

Such architecture allows us to reuse our computations at a later point. In context of Tensorflow we should mention the lazy execution. Despite that this execution type is called "lazy", it gives us a massive advantage in terms of the computational speed. If we make use of the lazy execution, Tensorflow doesn't compute values until we tell it to perform these computations. Tensorflow processes its input values if we request it to do so, otherwise Tensorflow only stores generated information for later use. Such structuring empowers us to parallelize or fuse computations in a more easy way.


In lazy execution the idea of a session is crucial to understand. We need to create a session in Tensorflow lazy execution in order to activate our computational graph and compute the values from it. A session is a way to allocate necessary memory for storing variables' values. You can compare a session with some executable file that may be run on your computer. Without a session there is no execution possible.

Another important concept of lazy execution is a placeholder. A placeholder is some variable which will receive its value at a later point in time. With placeholders in lazy execution we generate operations for the computational graph without any data in the first place. Through these placeholders we are able to "feed" our data into the computational graph later on. Metaphorically, we can imagine placeholders as parts of an empty shell, we will fill later with some values.


Consult the following code, which represents placeholders and sessions in action:

import tensorflow as tf

p_1 = tf.placeholder(tf.float32, None, name="placeholder_1") # here we store our value later

func = p_1**2 # our operation 

with tf.Session() as sess:
    res = sess.run(func, feed_dict={p_1: [2,3,5]}) # 'feed_dict' is an input for the 'func', define the placeholder p_1
    print(res) 

Note: there also exists the "eager execution" which computes operations on the fly. This type of execution does not set up a graph or creates any sessions. Eager execution utilizes the concept of imperative programming. We will cover eager execution later on.


So, each graph consists of nodes and edges. Each edge represents a weight and each node represents a tensor. Let's start with a simple example, for the function g = ((a * b) + d) / f we can construct the following computational graph:


example_comp_graph_1

computational graph for the function g


example_comp_graph_1_formula

This is an introductory example of a computational graph. If we feed it into the Tensorflow framework, we don't get any satisfiable result as we perform only the forward pass. To unwrap the real potential of our neural network we will use the power of differentiation. By that I mean the Chain Rule and by that I mean Backpropagation.

If you need a refresher on the Chain Rule and Backpropagation, consult this article Backpropagation. It will provide you with necessary background knowledge.

A small recap : we use backpropagation to go backwards in the neural network where we start from the beginning in order to update the weight values. If we do not perform weight update, the neural network won't learn anything, which means, it becomes useless. The chain rule is basically the means by which the backpropagation is implemented. The chain rule is in its core a multiple application of differentiation rules on composite functions. Composite functions are functions with another functions inside.


composit_func

Now we compute partial derivatives of the computational graph from above starting from the end point as we're applying backpropagation:


comptgrap_for_partial_deriv_1

computational graph for partial derivatives of the function g


For a quick recap: gradients ( ∇ ) are vectors of partial derivatives. Partial derivatives ( ∂ ) are derivatives taken with respect to one certain variable while other variables are held constant.

To review what derivatives and Co. are, consult this article here .

As you might notice, we start from the last computation and go backwards in the graph, computing partial derivatives of each node and performing backpropagation. We need partial derivatives to determine the rate of change in our function. For example, if we have something like: ∂g / ∂f we read it as the partial derivative of g with respect to f and we want to find the change in g if we slightly change f .

With partial differentiation we find the change of a function with respect to (w.r.t.) some variable, for example, the change of the function g w.r.t. the variable a according to the chain rule would look like this:


wrt_a

Same for the b variable:



wrt_b

In Tensorflow once we've done with defining the computational graph, we compile the whole model to define the loss function, the optimizer and the metric we want to be used in our model. We can see compile as a method to create connections between the nodes in a graph.

We compile our model before training to set parameters such as an optimizer, a loss function and metrics. Look at an example of training and evaluating a model here.

There is a pre-implemented compile method in Tensorflow we can use.


compile(
    optimizer='rmsprop',
    loss=None,
    metrics=None,
    loss_weights=None,
    sample_weight_mode=None,
    weighted_metrics=None,
    target_tensors=None,
    distribute=None,
    **kwargs
) 

Why do we use compile if we can program in python? Python is a non-compiled language, right? Yes, that's right. The thing is the Tensorflow idea is that we can use some programming language (often it's python) to compute our model. Tensorflow itself is not intrinsically bound to python. In fact the backend of Tensorflow is written in C++ and CUDA.

We don't really express the python code in the end. Instead the code is being translated into a computational graph and computed further, so we gain the maximum optimization.


Computational Graph: More Details

Let's analyze computational graphs in detail. What does a computational algorithm look like with a simple neural network?

Starting with an example: imagine we have an input vector x with n values in it, a weight vector w with n values in it and a bias b, which is a scalar.


input_x
input_w


input_b

We have a function of the form: σ((x n * wTn) + b) where the small Greek sigma is the sigmoid function, which serves as our activation function. This whole equation is called prediction and we denote it with Å· . We also need a loss function for our neural network in order to be able to analyze the error of the outputted prediction. Let's say that we have a regression task, so we can take the Mean Square Error (MSE) function which has the following form: MSE = (ΣNi=1 (yi - Å·i)2) / N where N is the total number of samples.

Now we want to produce a computational graph for the following equation:


MSE

example_comp_graph_2-2

As we already know, Tensorflow needs to perform partial differentiation on the graph starting from the very end, so the graph will look like this:


comp_graph_for_partial_derivs_2

Tensorflow will do the differentiation part for you automatically. So you don't have to differentiate anything on the paper which becomes less and less feasible for bigger and more complex neural networks :( .

The automatic differentiation in Tensorflow does exactly what we just saw ourselves:

  1. create a computational graph
  2. calculate derivatives from the computational graph
  3. to calculate derivatives use the chain rule

    to use the chain rule and to backpropagate:

  1. take a node and its corresponding gradient operation
  2. calculate the derivative of the input w.r.t the output
  3. backpropagate by calculating the gradients w.r.t. each network's parameter

A short summary for computational graphs: one of the main concepts behind computational graphs lies in code portability. We can actually export any created graph from Tensorflow and use it on any other architecture.


GradientTape

Tensorflow GradientTape is an automatic differentiation API. The "Tape" part denotes that each operation will be "written down on a tape", aka stored. Then we can compute the gradients from the stored computation. The stored computation consists of operations and again previous gradients. But aren't the gradients stored automatically all the time? Well, read further ...

GradientTape is very useful when we utilize eager execution. With eager execution Tensorflow won't build a computational graph for us, neither create a session as it is usual in the case of lazy execution. If Tensorflow provides no graph, it also won't store gradients for us explicitly. But it's still necessary to record the gradients to be able to apply backpropagation later. For this purpose we use GradientTape as all operations will be recorded.

So now, even with the eager execution, we can calculate the gradients of the loss function w.r.t the parameters.

Let's summarize some key differences between Lazy and Eager executions:



Lazy:

  1. constructs a graph
  2. create a session to activate the graph: Session.run()
  3. utilizes placeholders
  4. gradients are saved in a graph internally
  5. a session is saved as a whole

Eager:

  1. doesn't construct a graph
  2. use tf.enable_eager_execution() to enable the eager mode (if you don't use the 2.0 version)
  3. has no placeholders, instead we pass data in form of arguments directly into a function
  4. gradients are saved with the help of GradientTape
  5. the eager saver navigates to variable values and loads them through Checkpoint

import tensorflow as tf


tf.enable_eager_execution()     # activate eager execution

c = tf.constant((2.0))    # a Tensorflow constant with value 2.0

with tf.GradientTape() as tape:
    tape.watch(c)   # 'watch' for tape to record c
    f = 4*c
    
df_dc = tape.gradient(f,c)
print(df_dc) 

the output is:

tf.Tensor(4.0, shape=(), dtype=float32)

Again we have the function: f = 4*c so consequently we get ∂ f / ∂ c

Going through the different Tensorflow versions and their configurations for different purposes may be a pain. Just keep in mind that for Tensorflow 2.0 the eager execution is the default mode.


Conclusion

In this article we've learned about two types of execution in Tensorflow: lazy and eages ones. We've analyzed the workflow of computational graphs and what is behind it, namely partial differentiation. With computational graphs Tensorflow doesn't have to compute everything from the code but instead it can create a graph and represent our code in it. We have to run a session in order to compute values from the graph. This type of procedure is called lazy execution paradigm. In the Tensorflow 2.0 version, the approach is to get rid of sessions and to execute everything in eager per default.

I hope, you could unravel some magic behind one of the most frequently used Machine Learning frameworks at the present time.

For further research, consult the official Tensorflow documentation.


Further recommended readings:

Neural Networks Introduction

Backpropagation

Partial Derivatives and The Jacobian Matrix

siegel.work

It's AI Against Corona

2019-nCoV There has been a lot of talking about the new corona virus going around the world. Let's clear up some things about it first and then we will see how data science and ai can help us fight 2019-nCoV. ...

Activation Functions

What are activation functions in Neural Networks? First of all let's clear some terminology you need in order to understand the concept of an activation function. ...

Backpropagation

or backward propagation of errorsis another supervised learning optimization algorithm. The main task of the backpropagation algorithm is to find optimal weights in a by implementing optimization technique. ...

CNNs

The Convolutional Neural Network (CNN) architecture is widely used in the field of computer vision. Because we have a massive amount of data in image files, the usage of traditional neural networks wouldn't give much efficiency as the computational time would expl...

Early Stopping

In this article we will introduce you to the concept of Early Stopping and its implementation including code samples. ...

GAN

Generative Adversarial Networks (GANs) are a type of unsupervised neural networks. The network exists since 2014 and was developed by and colleges. ...

Gradient Descent

Hiking Down a Mountain Gradient Descent is a popular optimization technique in machine learning. It is aimed to find the minimum value of a function. ...

Introduction to Statistics

Part III In this third and last part of the series "Introduction to Statistics" we will cover questions as what is probability and what are its types, as well as the three probability axioms on top of which the entire probability theory is constructed. ...

Introduction to Statistics

Part I In the following three parts we will cover basic terminology as well as the core concepts from statistics. In this Part I you are going to learn about measures of central tendency (mean, median and mode). In the Part II you will read about measures of variabili...

Introduction to Statistics

Part II In this part we will continue our talk about descriptive statistics and the measures of variability such as range, standard deviation and variance as well as different types of distributions. Feel free to read the Part I of these series to deepen your knowle...

Logistic Regression

Logit Regression Logit regression is another shortened name derived from logistic unit. Logistic regression is a popular statistical model that generates probabilities for binary classification tasks. It produces discrete values and its span lies in the range of [...

Loss Functions

When training a neural network, we try to optimize the algorithm, so it gives the best possible output. This optimization needs a loss function to compute the error/loss of the model. In this article we will gain a general picture of Squared Error, Mean Sq...

Classification with Naive Bayes

The Bayes' Theorem describes the probability of some event, based on some conditions that might be related to that event. ...

Neural Networks

Neural Networks - Introduction In Neural Networks (NNs) we try to create a program which is able to learn from experience with respect to some task. This program should cons...

PCA

Principal component analysis or PCA is a technique for taking out relevant data points (variables also called components or sometimes features) from a larger data set. From this high dimensional data set, PCA tries extracting low dimensional data points. The idea...

Introduction to reinforcement learning

Part IV: Policy Gradient In the previous articles from this series on Reinforcement Learning (RL) we discussed Model-Based and Model-Free RL. In model-free RL we talked about Value Function Approximation (VFA). In this Part we are going to learn about Policy Based R...

Introduction to Reinforcement Learning

Part I : Model-Based Reinforcement Learning Welcome to the series "Introduction to Reinforcement Learning" which will give you a broad understanding about basic (and not only :) ) techniques in the field of Reinforcement Learning. The article series assumes you have s...

Introduction to Reinforcement Learning

Part II : Model-Free Reinforcement Learning In this Part II we're going to deal with Model-Free approaches in Reinforcement Learning (RL). See what model-free prediction and control mean and get to know some useful algorithms like Monte Carlo (MC) and Temporal Differ...

Recurrent Neural Networks

RNNs A Recurrent Neural Network (RNN) is a type of neural network where an output from the previous step is given as an input to the current step. RNNs are designed to take an input series with no size limits. RNNs remember the past states and are influenced by them...

SVM

Support Vector Machines If you happened to have a classification, a regression or an outlier detection task, you might want to consider using Support Vector Machines (SVMs), a supervised learning model, that builds a line (hyperplane) to separate data into groups....

Singular Value Decomposition

Matrix factorization: Singular Value Decomposition Matrix decomposition is another name for matrix factorization. This method is a nice representation for applied linear algebra in machine learning and similar algorithms. ...

Partial Derivatives and the Jacobian Matrix

A Jacobian Matrix is a special kind of matrix that consists of first order partial derivatives for some vector function. The form of the Jacobian matrix can vary. That means, the number of rows and columns can be equal or not, denoting that in one case it is a squa...

Introduction to Reinforcement Learning

Part III: Value Function Approximation In the previous Part I and Part II of this series we described model-based and model-free reinforcement learning as well as some well known algorithms. In this Part III we are going to talk about Value Function Approximation: w...

Weight Initialization

How does Weight Initialization work? As a general rule, weights and biases are normally initialized with some random numbers. Weights and biases are extremely important model's parameters and play a pivot role in every neural network training. Therefore, one should ...

Word Embeddings

Part 1: Introduction to Word2Vec Word embedding is a popular vocabulary representation model. Such model is able to capture contexts and semantics of a word in a document. So what is it exactly? ...

Word Embeddings

Part 2: Word2Vec (Skip Gram)In the second part of Word Embeddings we will talk about what are the downsides of the Word2Vec model (Skip Gram...

t-SNE

T-Distributed Stochastic Neighbor Embedding If you do data analysis, machine learning or some other data driven research you will prob...
Copyright © 2024 by Richard Siegel at siegel.work Donate Contact & Privacy Policy