From Applied Logistic Regression, 2nd Edition. Hosmer and Lemeshaw.
This is why you can exponentiate the co-efficients from logistic regression to derive the odds ratio- note we start out with calculating the odds given X =1 and divide that by the odds given X = 0 for exp( Bo + B1X ). See logistic regression and the calculation and interpretation of odds ratios and analysis of the logistic function for more details.
An attempt to make sense of econometrics, biostatistics, machine learning, experimental design, bioinformatics, ....
Thursday, June 30, 2011
Tuesday, June 7, 2011
Back Propagation
In a recent post on neural networks, using R I described neural networks and presented the following visualization from R:
I have also described a multilayer perceptron as a weighted average or ensemble of logits. But how are the weights in each hidden layer logistic activation function (or any activation function for other network architectures) estimated? How are the weights in the combination functions estimated? Neural networks can be estimated using back propagation, described in Hastie as 'a generic approach to minimizing R(θ) (the cost function) by gradient descent.'
Given a neural network with inputs X with hidden layers comprised of hidden units Z used to predict some target T, we can represent a neural network schematically (simplifying the notation in Hastie by omitting key subscripts and summations)
X -> Z -> T
Z = σ( α0 + αTx)
T = β0 + βZ
f(X) = g(T) [1]
where σ = the activation function
Given weights {α0,α0 , β0 , β} find the values that minimize the specified error function:
R(θ) =∑∑ ( y-f(x)2 ) [2] (note a number of possible error functions may be used)
Algorithm:
Given a neural network with inputs X with hidden layers comprised of hidden units Z used to predict some target T, we can represent a neural network schematically (simplifying the notation in Hastie by omitting key subscripts and summations)
X -> Z -> T
Z = σ( α0 + αTx)
T = β0 + βZ
f(X) = g(T) [1]
where σ = the activation function
Given weights {α0,α0 , β0 , β} find the values that minimize the specified error function:
R(θ) =∑∑ ( y-f(x)2 ) [2] (note a number of possible error functions may be used)
Backpropogation equations:
s = σ'( αTx )βδ [3]
Gradient Descent Update:
Errors can be re-specified as:
∂R/ ∂β = δZ [4]
∂R/ ∂α = sx [5]
Gradient Descent Update:
βr+1 = βr - γ ∂R/ ∂β [6]
αr+1 = αr - γ ∂R/ ∂α [7]
Algorithm:
Forward Pass: use initial or current weights (guesses) and calculate f(X), and errors δ from the output layer [2]
Backward Pass: 'back propagate' via back propagation equation [3] to obtain s. Both sets of errors (δ) and (s) are used to derive the derivative terms in [4] and [5] which are then used in the gradient descent update weight estimates via equations [6]& [7].
In Predictive modeling with SAS Enterprise Miner by Sarma, the following basic description of back propagation is given:
Specify an error function E.
1) 1st iteration- set initial weights, use to evaluate E
2) 2nd iteration- weights are changed by a small amount such that the error is redced
-repeat until convergence
As Sarma explains, with each iteration a number of weights are produced, so if it takes 100 iterations to converge, 100 possible models are specified, giving 100 sets of weights. Using validation data, the best iteration can be chosen calculating E via the validation data.
Gradient Descent
The following lecture from Dr. Ng's course in machine learning from Stanford covers gradient descent.
When I first sat through this lecture I wondered if it would really be useful. It turns out that understanding gradient descent is helpful to understanding backpropogation which is used to train neural networks.
Based on the lecture notes, gradient descent can be described as follows:
Suppose we want to predict y with a function h(x) = Θ0+ Θ1 x1 + x2Θ2 + etc = ΘTx or βX
When I first sat through this lecture I wondered if it would really be useful. It turns out that understanding gradient descent is helpful to understanding backpropogation which is used to train neural networks.
Based on the lecture notes, gradient descent can be described as follows:
Suppose we want to predict y with a function h(x) = Θ0+ Θ1 x1 + x2Θ2 + etc = ΘTx or βX
given a specified cost function: J(Θ) = (1/2) ∑ (h(x)-y)2or e'e
we choose Θ to minimize J(Θ) using a search algorithm that repeatedly changes Θ to make J(Θ) smaller and smaller until it converges to a value of Θ that minimizes J(Θ).
Θ : Θ(i) - α ∂ J(Θ)/∂Θ(i) or β : βi - α ∂e'e/∂β 'update or guessing function' for some guess 'i'
Solving for the partial derivative or gradient term gives:
∂ J(Θ)/∂Θ(i) = (h(Θ)-y)x or e'x
and the update function becomes:
Θ: Θ(i) +α(y-h(x))x or β : βi - αe'x
the magnitude of each update for each iteration is a function of the error term and the learning rate 'α '.
Alternatively, gradient descent can be represented as follows:
Given a function F() and guess Xo and the update function
Xn+1 = Xn - α∇F(Xn)
we get a series of updates such that F(Xo) > F(X1) > F(X2) >F(X3...
with convergence at the minimum value of F().
we choose Θ to minimize J(Θ) using a search algorithm that repeatedly changes Θ to make J(Θ) smaller and smaller until it converges to a value of Θ that minimizes J(Θ).
Θ : Θ(i) - α ∂ J(Θ)/∂Θ(i) or β : βi - α ∂e'e/∂β 'update or guessing function' for some guess 'i'
Solving for the partial derivative or gradient term gives:
∂ J(Θ)/∂Θ(i) = (h(Θ)-y)x or e'x
and the update function becomes:
Θ: Θ(i) +α(y-h(x))x or β : βi - αe'x
the magnitude of each update for each iteration is a function of the error term and the learning rate 'α '.
Alternatively, gradient descent can be represented as follows:
Given a function F() and guess Xo and the update function
Xn+1 = Xn - α∇F(Xn)
we get a series of updates such that F(Xo) > F(X1) > F(X2) >F(X3...
with convergence at the minimum value of F().
Sunday, June 5, 2011
Instrumental Variables and Selection Bias
In a previous post I noted Hal Varian and Andrew Gelman's discussion on instrumental variables, and the following specification for program or treatment T and instrument Z :
"You have to assume that the only way that z affects Y is through the treatment, T. So the IV model is
T = az + e 1
y = bT + d
It follows that
E(y|z) = b E(T|z) + E(d|z)
In a recent paper 'Using Instrumental Variables to Account for Selection Effects in Research on First Year Programs' Pike, Hansen and Lin expand on the details following the work of Angrist and Pischke. They describe selection bias for participation in first year programs at 4 year universities in the context of omitted variable bias.
Yi = α + βjXij + pDi + η
where Xij may or may not be related to Di, which is program participation.
p = the unbiased effect of program participation
and η = γSi + v
Given that γSi is related to program participation, it causes the effect of program participation to be overstated.
Pike, Hansen and Lin propose capturing the impact of selection bias using instrumental variables (Z) to ultimately measure the impact of program participation, p.
p = cov(Y, Z)/ cov(D, Z) = Π11 / Π21
Where impact of program participation is derived from the ratio of two regressions, Y on Z and D on Z.
Y = α1 + βX + Π11Z + e1
D = α2 + βX + Π21Z + e2
As they explain in their paper, the ratio for p as it is defined above is useful in thinking about the consequences of the two major assumptions of IV analysis.
1) Z should be strongly correlated with D. If the correlation is week, then the denominator will be small, and p will be overstated.
2) Z should be unrelated to Y and 'e'. If the correlation is strong, then the numerator will be large, and p will overstate program effects.
In the paper, they correct for the impact of selection bias using two instruments (participation in a summer bridge program and having decided a major prior to enrollment). In a normal regression, they find that even after correcting for various other controls, there is a positive significant relationship between first year programs and student success (measured by GPA). However, by including the instruments in the regression (correcting for selection bias) this relationship goes away.
Instrumental variable techniques add a valuable tool that all policy analysts and researchers should have in their quantitative tool box. As stated in the paper:
"If, as the results of this study suggest, traditional evaluation methods can overstate (either positively or negatively) the magnitude of program effects in the face of self selection, then evaluation research may be providing decision makers with inaccurate information. In addition to providing an incomplete accounting for external audiences, inaccurate information about program effectiveness can lead to the misallocation of scarce institutional resources."
References:
Angrist and Pischke, Mostly Harmless Econometrics, 2009
Using Instrumental Variables to Account for Selection Effects in Research on First-Year Programs
Gary R. Pike, Michele J. Hansen and Ching-Hui Lin
Research in Higher Education
Volume 52, Number 2, 194-214, DOI: 10.1007/s11162-010-9188-x
"You have to assume that the only way that z affects Y is through the treatment, T. So the IV model is
T = az + e 1
y = bT + d
It follows that
E(y|z) = b E(T|z) + E(d|z)
Now if we
1) assume E(d|z) = 0
2) verify that E(T|z) != 0
1) assume E(d|z) = 0
2) verify that E(T|z) != 0
we can solve for b by division
i.e. b = E(y|z) / E(T|z)
In a recent paper 'Using Instrumental Variables to Account for Selection Effects in Research on First Year Programs' Pike, Hansen and Lin expand on the details following the work of Angrist and Pischke. They describe selection bias for participation in first year programs at 4 year universities in the context of omitted variable bias.
Yi = α + βjXij + pDi + η
where Xij may or may not be related to Di, which is program participation.
p = the unbiased effect of program participation
and η = γSi + v
Given that γSi is related to program participation, it causes the effect of program participation to be overstated.
Pike, Hansen and Lin propose capturing the impact of selection bias using instrumental variables (Z) to ultimately measure the impact of program participation, p.
p = cov(Y, Z)/ cov(D, Z) = Π11 / Π21
Where impact of program participation is derived from the ratio of two regressions, Y on Z and D on Z.
Y = α1 + βX + Π11Z + e1
D = α2 + βX + Π21Z + e2
As they explain in their paper, the ratio for p as it is defined above is useful in thinking about the consequences of the two major assumptions of IV analysis.
1) Z should be strongly correlated with D. If the correlation is week, then the denominator will be small, and p will be overstated.
2) Z should be unrelated to Y and 'e'. If the correlation is strong, then the numerator will be large, and p will overstate program effects.
In the paper, they correct for the impact of selection bias using two instruments (participation in a summer bridge program and having decided a major prior to enrollment). In a normal regression, they find that even after correcting for various other controls, there is a positive significant relationship between first year programs and student success (measured by GPA). However, by including the instruments in the regression (correcting for selection bias) this relationship goes away.
Instrumental variable techniques add a valuable tool that all policy analysts and researchers should have in their quantitative tool box. As stated in the paper:
"If, as the results of this study suggest, traditional evaluation methods can overstate (either positively or negatively) the magnitude of program effects in the face of self selection, then evaluation research may be providing decision makers with inaccurate information. In addition to providing an incomplete accounting for external audiences, inaccurate information about program effectiveness can lead to the misallocation of scarce institutional resources."
References:
Angrist and Pischke, Mostly Harmless Econometrics, 2009
Using Instrumental Variables to Account for Selection Effects in Research on First-Year Programs
Gary R. Pike, Michele J. Hansen and Ching-Hui Lin
Research in Higher Education
Volume 52, Number 2, 194-214, DOI: 10.1007/s11162-010-9188-x