When I first sat through this lecture I wondered if it would really be useful. It turns out that understanding gradient descent is helpful to understanding backpropogation which is used to train neural networks.
Based on the lecture notes, gradient descent can be described as follows:
Suppose we want to predict y with a function h(x) = Θ0+ Θ1 x1 + x2Θ2 + etc = ΘTx or βX
given a specified cost function: J(Θ) = (1/2) ∑ (h(x)-y)2or e'e
we choose Θ to minimize J(Θ) using a search algorithm that repeatedly changes Θ to make J(Θ) smaller and smaller until it converges to a value of Θ that minimizes J(Θ).
Θ : Θ(i) - α ∂ J(Θ)/∂Θ(i) or β : βi - α ∂e'e/∂β 'update or guessing function' for some guess 'i'
Solving for the partial derivative or gradient term gives:
∂ J(Θ)/∂Θ(i) = (h(Θ)-y)x or e'x
and the update function becomes:
Θ: Θ(i) +α(y-h(x))x or β : βi - αe'x
the magnitude of each update for each iteration is a function of the error term and the learning rate 'α '.
Alternatively, gradient descent can be represented as follows:
Given a function F() and guess Xo and the update function
Xn+1 = Xn - α∇F(Xn)
we get a series of updates such that F(Xo) > F(X1) > F(X2) >F(X3...
with convergence at the minimum value of F().
we choose Θ to minimize J(Θ) using a search algorithm that repeatedly changes Θ to make J(Θ) smaller and smaller until it converges to a value of Θ that minimizes J(Θ).
Θ : Θ(i) - α ∂ J(Θ)/∂Θ(i) or β : βi - α ∂e'e/∂β 'update or guessing function' for some guess 'i'
Solving for the partial derivative or gradient term gives:
∂ J(Θ)/∂Θ(i) = (h(Θ)-y)x or e'x
and the update function becomes:
Θ: Θ(i) +α(y-h(x))x or β : βi - αe'x
the magnitude of each update for each iteration is a function of the error term and the learning rate 'α '.
Alternatively, gradient descent can be represented as follows:
Given a function F() and guess Xo and the update function
Xn+1 = Xn - α∇F(Xn)
we get a series of updates such that F(Xo) > F(X1) > F(X2) >F(X3...
with convergence at the minimum value of F().
No comments:
Post a Comment