In my last post on neural network [HERE], I talked on how one can think of neural network as universal approximators. In this post I am trying to help understand a toy neural network implementation. In particular one can have a clearer and intuitive understanding of what a forward_pass
is and what back_propagation
means.
Most of the material presented here is a summarization of my learning from
- CS231n Stanford University Course. Convolutional Networks for Visual Recognition by Prof. Fei Fei Li
- Deep Learning Book. An MIT Press Book by Ian Goodfellow, Yoshua Bengio and Aaron Courville. Online version available for free.
Summary
- Neural network is just a rather complicated (not so much) non-linear function
- Forward pass is a way to evaluate this function using a computation-graph representation.
- Back propagation is another name for chain rule for this computation-graph. Used to compute gradients.
- Gradient descent used to minimize the loss function with respect to tunable parameters.
- Code available on GitHub : https://github.com/mpkuse/learning_xor
- Update : Neural Network from scratch for a 3-category classification of spiral 2d dataset. Code on GitHub : https://github.com/mpkuse/nn_classify
Gradient Descent
Before we start understanding neural networks, let us understand what is Gradient Descent
. There are ton of material on internet to help you with it ([Youtube]). I am just going to summarize it here. Gradient descent is a numerical technique for minimizing a function with scalar output. Concretely if . Gradient descent is an iterative technique which can find you an input argument which result in minimum value of the function
. The technique is very simple, one starts with an initial guess and in each iteration subtracts its scaled gradient vector from the initial guess. Iterate until convergence or for fixed number of iterations.
As an example let us try and minimize .
Following code is an naive implementation of gradient descent for the above function.
%initial guess b = 1.3; w1 = 1; w2 = 2; step = 0.6; % Iterations for itr=1:200 f = 0.5*( b^2 + (1-w2-b)^2 + (1-w1-b)^2 + (-w1-w2-b)^2 ); df_db = b - (1-w2-b) - (1-w1-b) - (-w1-w2-b); df_dw1 = -(1-w1-b) - (-w1-w2-b); df_dw2 = -(1-w2-b) - (-w1-w2-b); b = b - step*df_db; w1 = w1 - step*df_dw1; w2 = w2 - step*df_dw2; display( sprintf( '%d : %f (%f)', itr, cost, norm([df_db, df_dw1, df_dw2]) ) ); end
Forward Pass
Quite often neural networks are compared with neurons in a brain. I will skip this analogy. Working of our brains is far more complex which we do not clearly understand. Following links might help to explain why neural networks are far far away from actual brain neurons [link] and [link].
From my perspective neural network is just a rather complicated non-linear function with tunable/learnable parameters
. For example, a 3-layer neural net with a particular architecture setting evaluates the function
.
are parameter matrices which has to be tuned/learned.
function [u1, u2, H1, u3, s, L ] = forward_pass( X, W1, b1, W2, b2, y ) u1 = X * W1; u2 = u1 + b1; H1 = max(0, u2 ); %Element-wise max, (Also refered as ReLU). u3 = H1 * W2; s = u3 + b2; L = (y - s)^2; % Squared-Loss function end
When people refer to Forward pass, they are referring to evaluation of the above function, given a particular instance of the parameter matrices, input data and desired output. The output of the forward pass being a loss associated with these parameter matrix with respect to a loss-function
.
When we say we are training a neural network, we are merely doing a gradient descent to find the best parameters which fit the training data
(training data consists of input and desired output for this input). Usually we quantify the fit with a loss function like squared loss, softmax, etc.
For more workable details on construction of these neural network functions refer to the lecture notes of Stanford’s CS231n course on “Setting Up the Architecture”.
Back-Propagation
Back propagation is a technique of evaluation of derivatives (gradients) with respect to each of the tunable/learnable parameters. It is actually a fancy name for the chain-rule in calculus.
One usually thinks of the neural network function as a computation-graph. Each of the variables are nodes in a directed-graph and edges denote a function (eg matrix-multiply, +, ReLU viz max(0,z) etc. Note that a node can represent a matrix, or a vector or a scalar. Often one can find in the literature saying that the nodes represent a tensor. One would draw a computation-graph for the 2-layer neural network as below. The graph exactly computes the forward_pass
function (see the matlab code above).
Now we compute derivatives (gradients) of each of the simple functions, ie. the functions denoted on the edge of the above computation graph. We are only interested in gradients of the loss function with respect to W1, b1, W2, and b2, ie . To obtain that one basically chains (matrix multiply) the gradients along a path from the loss until the variable.
To get a little more details on the back-propagation method of computing function gradients, I highly recommend Stanford’s CS231n lecture on “Back-propagation Intuition“. There are actually very simple rules for evaluation of derivatives for operations like matrix-multiply, ReLU, + as noted in the Stanford’s CS 321n lectures. After having read that, I would recommend Section 6.5 (chp 6) of the Deep Learning Book.
One can accomplish the evaluation of gradients of the squared loss function for our 2-layer network with the following code. Note that this function needs the intermediate values computed by the forward_pass function.
function [ dL_db2, dL_dW2, dL_db1, dL_dW1] = backward_pass( X, W1, b1, W2, b2, y, u1, u2, H1, u3, s, L ) dL_dy = 2*(y-s); %scalar dL_du4 = -2*(y-s); %scalar du4_db2 = 1.0; %scalar du4_du3 = 1.0; %scalar du3_dW2 = H1; %1x2 du3_dH1 = W2'; %1x2 % if u2_i > 0 ==> 1. if u2_i < 0 ==> 0 dH1_du2 = diag(max( 0, u2 ) > 0); %2x2 du2_db1 = eye(2,2); %2x2 du2_du1 = eye(2,2); %2x2 % final chaining dL_db2 = dL_du4 * du4_db2; dL_dW2 = dL_du4 * du4_du3 * du3_dW2; dL_db1 = dL_du4 * du4_du3 * du3_dH1 * dH1_du2 * du2_db1; dL_dW1 = X' * dL_du4 * du4_du3 * du3_dH1 * dH1_du2 * du2_du1; end
Putting it all Together (Learning XOR-gate)
Now that we understand that, a forward_pass
evaluates the function value (and loss) given an instance of the input data and tunable parameters. Also back_propagation
evaluates derivatives at this input with respect to tunable parameters. We start with an initial guess of the tunable parameters and improve this guess using the gradient-descent.
For the toy example, let us try to learn the XOR Gate. This neural network will have 2-inputs and 1-output. In particular we want to train the parameters of the neural network function so that an input of [0,0] results in 0 ; [0,1] results in 1; [1,0] results in 1; [1,1] results in 0. Note that this data is not linearly separable (you can verify it by plotting).
The total loss for all the 4 data points will be : . Thus, the gradient of the total loss with respect to all tunable parameters :
. In other words, we need to add up the gradients of each data points resulting from back_propagation.
You can get the entire code (matlab) from my github repository.
https://github.com/mpkuse/learning_xor
This concludes the discussion on a toy neural network. Of Course real neural network implementations are way more complicated. For example, as data gets larger gradient descent is not feasible so the way around it SGD (Stochastic Gradient Descent). For the sake of simplicity we have missed a lot of details. For an excellent overview of the descent techniques in use in machine learning community viz, SGD, ADAM, RMSProP, ADAGrad etc see here [link].
Progressively I am planning to introduce myself to those complexities and blog about it. Next time I might do a 3-category classifying neural network on a larger data sets (~300 pts).
Summary
- Neural network is just a rather complicated (not so much) non-linear function
- Forward pass is a way to evaluate this function using a computation-graph representation.
- Back propagation is another name for chain rule for this computation-graph. Used to compute gradients.
- Gradient descent used to minimize the loss function with respect to tunable parameters.
- Code available on GitHub : https://github.com/mpkuse/learning_xor