As jean-yves rightly pointed out in his writeup about neural networks, the actual theory behind how to train the networks was a difficult task. The problem is that once we know the amount of error in the network's output, how do we divide that error amongst all the nodes? Every node is responsible for a bit of that error, but it isn't clear exactly how much.* As it turns out, the answer is to be found through the magic of a little bit of calculus and a little bit of cleverness.

The essential framework of the algorithm in broad strokes is:

- Choose random weights for the network
- Feed in an example and obtain a result
- Calculate the error for each node (starting from the last node and propagating the error backwards)
- Update the weights
- Lather, rinse, repeat with other examples until the network converges on the target output

First let's look at the process of calculating each node's error (Step 3). I am assuming you already have constructed a neural network and have chosen your training data. Run it with your first example and obtain the result.

**Calculating Error**

For the purposes of this writeup,

- Δ
_{i}denotes the value of a node's error, where i is the particular node - g(z) denotes the activation function of the network
- g'(z) is the derivative of the activation function
- a
_{i}is the activation of node i, in other words, the value of its output when it fires - in
_{i}is the sum of the weighted input coming into node i. Specifically, in_{i}= Σ_{j}a_{j}w_{ji}where w_{ji}is the weight from node j to i

**Δ _{i} = g'(in_{i})(T-a_{i})**

Basically, the error for the output node is the difference of its output a_{i} from the target output T, multiplied by the derivative of the activation function. We know what the target output is because the examples used to train network are labelled -- this allows us to determine how far off it is from the correct answer. The derivative is used here in order to solve the problem mentioned earlier of assigning the correct amount of error to this particular node. As it turns out, the derivative is an accurate way to determine how much of a role each node plays in the network.

Now that we know the error for the final node in the network, we can now propagate that error backwards to determine all of the other nodes' error. For reference, these nodes are also called the Hidden Units. The formula for these nodes is:

**Δ _{i} = g'(in_{i})Σ_{k}w_{ik}Δ_{k}**

*DON'T PANIC!*It's not as scary as it sounds! Let's look at that scary sigma part first. K ranges over the nodes connected to the output of node i, so these are all of the nodes that come after our node. w

_{ik}signifies the weight from node i to node k, and Δ

_{k}we already know -- it's the amount of error for the nodes that are one layer ahead of node i. Remember, we know this because we are working backwards through the network. So basically we take the weight from node i to k and multiply it by k's delta. Then we do the same with each node connected out from i, and sum the total. Now we calculate g' for node i, based on all of the weighted inputs from the nodes that feed into i. Voila! Now you can repeat this step with all of the nodes one layer in front of i and so on until you have gone through the whole network.

**Updating Weights**

Now that we have calculated the amount of error (Δ) for each node, we can update the weights between the nodes (Step 4). This is our ultimate goal because it is the weights which determine how the network performs. The formula for this is now trivial:

**w _{ji} = w_{ji} + ηa_{j}Δ_{i}**

_{i}(calculated earlier) and η (constant).

**Summary**

So again, the basic steps of error propagation are to determine the error for the output node using the first formula, then work backwards through the network, caculating each node's error using formula 2. Finally we update the weights based on the error, using the third formula.

This is technically known as one epoch of the training process. Once this has occurred we simply supply another example to the network, calulate the error, and update the weights and repeat until the network has converged on the target output, meaning that it responds correctly to given input.

*A poor example of this might be the citizens of a country. If the country is experiencing a recession, every single person is somehow a cause of the downturn, simply by being a part (however small) of the economy. Certain individuals may play larger roles (a bank going bankrupt) than others, but everyone is ultimately implicated. In the same way, each node contributes to the overall success of the network.