## Tuesday, June 7, 2011

### Back Propagation

In a recent post on neural networks, using R I described neural networks and presented the following visualization from R:
I have also described a multilayer perceptron as a weighted average or ensemble of logits. But how are the weights in each hidden layer logistic activation function (or any activation function for other network architectures) estimated? How are the weights in the combination functions estimated? Neural networks can be estimated using back propagation, described in Hastie as 'a generic approach to minimizing R(θ) (the cost function) by gradient descent.'

Given a neural network with inputs X with hidden layers comprised of hidden units Z used to predict some outcome Y , schematically (simplifying the notation in Hastie)

X -> Z -> Y

Z = σ( α0α'1 x)
T =  β0 + βZ
f(X) = g(T)   [1]

where σ = the activation function

Given weights {α00 ,  β0 , β} find the values that minimize the specified error function:

R(θ) =∑∑ ( y-f(x)2 )      [2] (note a number of possible error functions may be used)

βr+1 – βr - γ ∂R/ ∂β    [3]

αr+1 – αr - γ ∂R/ ∂α   [4]

Errors can be re-specified as:

∂R/ ∂β = δZ   [5]
∂R/ ∂α =Sx    [6]

Backpropogation equations:

s = σ'( αTx )βδ   [7]

Algorithm:

Forward Pass: use initial or current weights (guesses) and calculate f(X)

Backward Pass: calculate errors δ using [5] and then 'back propagate' via back propagation equations to obtain s. Both sets of errors (δ) and (s) are used to update weight estimates via the update equations [3]& [4].

In Predictive modeling with SAS Enterprise Miner by Sarma,  the following basic description of back propagation is given:

Specify an error function E.

1) 1st iteration- set initial weights, use to evaluate E
2) 2nd iteration- weights are changed by a small amount such that the error is redced
-repeat until convergence

As Sarma explains, with each iteration a number of weights are produced, so if it takes 100 iterations to converge, 100 possible models are specified, giving 100 sets of weights. Using validation data, the best iteration can be chosen calculating E via the validation data.