Monday, October 31, 2011

Deriving Heteroskedasticity Consistent Standard Errors

Heteroskedasticity leads to inaccurate standard errors in linear regression, but can be corrected using robust / heteroskedasticity consistent corrected (HCC) standard errors.

Recall, the matrix form of the regression equation:

y = XB + ϵ

with the estimator b = (X’X)-1X’y substituting for y we get
b =  (X’X)-1X’ (XB + ϵ)
=  (X’X)-1 [X’XB +X’ ϵ]
=  (X’X)-1 X’XB +(X’X)-1X’ ϵ
= B +(X’X)-1X’ ϵ

VAR(b) = E[(b-B)(b-B)]

From the above we know: b = B +(X’X)-1X’ ϵ
(b-B) = B +(X’X)-1X’ ϵ - B
= (X’X)-1X’ ϵ

Therefore VAR(b)=  E[(b-B)(b-B)] = E[((X’X)-1X’ ϵ)( (X’X)-1X’ ϵ)]

=E[(X’X)-1X’ ϵ ϵ’X(X’X)-1]
=(X’X)-1X’E[ϵ ϵ’]X(X’X)-1

Let  E[ϵ ϵ’] = Φ  then we have VAR(b) = (X’X)-1X’ Φ X(X’X)-1   ‘general form’

If Φ = σ2I then VAR(b) = (X’X)-1X’ σ2I X(X’X)-1   = σ2(X’X)-1  ‘case of homoskedasticity’

For heteroskedasticity corrected standard errors VAR(b) =

N/(N-k) (X’X)-1X’ Φ̂  X(X’X)-1   where Φ̂ = diag(ei2)


References:
Greene, Econometric Analysis. 1990
Using Heteroskedasticity Consistent Standard Errors in the Linear Regression Model. J.Scott Long and Laurie H. Ervin. The American Statistician Vol. 54 No 3 (Aug 2000) pp. 217-224

Chi-Square Test of Independence (Association)

I watched this a while back, not a bad step by step overview of chi-square tests. Great for thinking about decision tree logic.


Saturday, September 17, 2011

Bayesian Models with Censored Data: A comparison of OLS, tobit and bayesian models

The following R code models a censored dependent variable (in this case academic aptitude) using a traditional least squares, tobit, and Bayesian approaches.  As depicted below, the OLS estimates (blue) for censored data are inconsistent and will not approach the true population parameters (green).

A tobit model will provide better results. A tobit model is estimated via maximum likelihood.

For observations beyond ‘k’, a cumulative density is specified [F(y)] , and the likelihood is a mixture of products of density [f(y)] and cumulative density functions [F(y)].

Roughly: L =   [f(y)]d-1 [F(y)]d where d = 1 if censored





Bayesian methods for censored dependent variables are also available, and may provide better estimates than traditional tobit specifications in the face of heteroskedasticity, (which in addition to leading to inaccurate estimates of standard errors in OLS, also leads to inaccurate coefficient estimates in the tobit model). 

Below is the output of a regression modeling academic aptitude (using data from UCLA statistical computing examples- see references in the R-code documentation that follows) as a function of reading and math scores, as well as program participation:

Call:
lm(formula = mydata$apt ~ mydata$read + mydata$math + as.factor(mydata$prog))

Residuals:
     Min       1Q   Median       3Q      Max
-161.463  -42.474   -0.707   43.180  181.554

Coefficients:
                                 Estimate Std. Error t value Pr(>|t|)   
(Intercept)                       242.735     30.140   8.054 7.80e-14 ***
mydata$read                         2.553      0.583   4.379 1.95e-05 ***
mydata$math                         5.383      0.659   8.169 3.84e-14 ***
as.factor(mydata$prog)general     -13.741     11.744  -1.170 0.243423   
as.factor(mydata$prog)vocational  -48.835     12.982  -3.762 0.000223 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 62.38 on 195 degrees of freedom
Multiple R-squared: 0.6127,    Adjusted R-squared: 0.6048
F-statistic: 77.13 on 4 and 195 DF,  p-value: < 2.2e-16

Below is output from a tobit model specified in R:

Call:
vglm(formula = mydata$apt ~ mydata$read + mydata$math + as.factor(mydata$prog),
    family = tobit(Upper = 800))

Pearson Residuals:
            Min       1Q    Median      3Q    Max
mu      -3.2606 -0.69522  0.049445 0.82743 2.7935
log(sd) -1.1309 -0.61020 -0.310926 0.21836 4.8277

Coefficients:
                                    Value Std. Error t value
(Intercept):1                    209.5488  32.642056  6.4196
(Intercept):2                      4.1848   0.049756 84.1071
mydata$read                        2.6980   0.619577  4.3546
mydata$math                        5.9148   0.706472  8.3723
as.factor(mydata$prog)general    -12.7145  12.409533 -1.0246
as.factor(mydata$prog)vocational -46.1431  13.707785 -3.3662

Number of linear predictors:  2

Names of linear predictors: mu, log(sd)

Dispersion Parameter for tobit family:   1

Log-likelihood: -872.8971 on 394 degrees of freedom

Number of Iterations: 8

Notice the coefficients from the tobit model are larger than those from OLS, indicating the downward bias of the coefficients resulting from OLS regression on a censored dependent variable. Below is the output from a bayesian model, based on the specifications outlined for the MCMCtobit function in the MCMCpack documentation. These results are very similar to those obtained by the previous tobit model output.

Iterations = 1001:31000
Thinning interval = 1
Number of chains = 1
Sample size per chain = 30000

1. Empirical mean and standard deviation for each variable,
   plus standard error of the mean:

                                     Mean       SD Naive SE Time-series SE
(Intercept)                       208.635  33.3365 0.192468       0.213581
mydata$read                         2.702   0.6304 0.003640       0.003506
mydata$math                         5.930   0.7242 0.004181       0.004368
as.factor(mydata$prog)general     -12.733  12.6756 0.073183       0.082116
as.factor(mydata$prog)vocational  -46.145  13.9556 0.080573       0.085942
sigma2                           4486.277 488.5388 2.820580       2.962118

2. Quantiles for each variable:

                                     2.5%      25%      50%      75%    97.5%
(Intercept)                       142.260  186.381  208.659  231.153  273.779
mydata$read                         1.482    2.273    2.701    3.120    3.949
mydata$math                         4.522    5.438    5.924    6.417    7.357
as.factor(mydata$prog)general     -37.558  -21.261  -12.733   -4.175   12.300
as.factor(mydata$prog)vocational  -73.634  -55.441  -46.044  -36.719  -18.890
sigma2                           3635.560 4140.883 4450.761 4795.448 5536.943

R-Code: (use scroll bar at bottom, or click code and scroll with arrow keys)

#   ------------------------------------------------------------------

#  | PROGRAM NAME: ex_bayesian_tobit

#  | DATE: 9/17/11

#  | CREATED BY:  Matt Bogard

#  | PROJECT FILE: www.econometricsense.blogspot.com              

#  |----------------------------------------------------------------

#  | PURPOSE: comparison of models for censored dependent variables              

#  | 1 - least squares

#  | 2 - tobit model

#  | 3 - bayesian model

#  |------------------------------------------------------------------

#  | REFERENCES: 

#  | UCLA Statistical Computing: http://www.ats.ucla.edu/stat/R/dae/tobit.htm

#  | R Package 'MCMCpack' documentation : # http://mcmcpack.wustl.edu/documentation.html

#  | 

#  | Literature:

#  |   Andrew D. Martin, Kevin M. Quinn, and Jong Hee Park. 2011. “MCMCpack: Markov Chain Monte Carlo in R.”, 

#  |   Journal of Statistical Software. 42(9): 1-21. http://www.jstatsoft. org/v42/i09/.

#  |   

#  |   Daniel Pemstein, Kevin M. Quinn, and Andrew D. Martin. 2007. Scythe Statistical Library 1.0.

#  |   http://  scythe.wustl.edu.

#  |

#  |   Martyn Plummer, Nicky Best, Kate Cowles, and Karen Vines. 2002. Output Analysis and Diagnos- tics for 

#  |   MCMC(CODA). http://www-fis.iarc.fr/coda/.

#  |

#  |  Siddhartha Chib. 1992. “Bayes inference in the Tobit censored regression model." Journal of Econometrics. #  |  51:79-99.

#  |

#  |

#  |

#   ------------------------------------------------------------------

 

# example tobit model

 

# get data

 

mydata <- read.csv(url("http://www.ats.ucla.edu/stat/r/dae/tobit.csv"))

 

#explore dataset

names(mydata) # list var names

dim(mydata) # data dimensions

hist(mydata$apt)  # histogram of dependent variable for academic aptitude

                          #  indcates right or upper bound censoring at 'y' = 800

 

# run model using standard ols regression

 

ols <- lm(mydata$apt~mydata$read + mydata$math + as.factor(mydata$prog))

 

summary(ols)

 

#  tobit model 

 

library(VGAM) # load package

 

tobit <- vglm(mydata$apt ~ mydata$read + mydata$math + as.factor(mydata$prog), tobit(Upper=800))

 

summary(tobit)

 

# note the coefficients for the tobit model are larger, indicating the downward bias

# of the OLS estimates

 

# bayesian model

 

library(MCMCpack) # load package

 

bayes.tobit <- MCMCtobit(mydata$apt ~ mydata$read + mydata$math + as.factor(mydata$prog), above = 800, mcmc = 30000, verbose = 1000)

 

summary(bayes.tobit)

 

plot(bayes.tobit)

 

# the empirical (posterior mean) looks very similar to the tobit estimates. 
Created by Pretty R at inside-R.org

Elements of Bayesian Econometrics


 posterior = (likelihood x prior) / integrated likelihood

The combination of a prior distribution and a likelihood function is utilized to produce a posterior distribution.  Incorporating information from both the prior distribution and the likelihood function leads to a reduction in variance and an improved estimator.

As n→ ∞ the likelihood centers over the true β. As a result, with more data the role of the likelihood function becomes more important in deriving the posterior distribution.

P(θ|y) = (P(y|θ)*P(θ) )/(P(y) = ( P(y|θ)*P(θ) ) / ∫  (P(y|θ)*P(θ) )/dθ

P(θ|y) = posterior distribution

P(y|θ) = likelihood function L(θ)

P(θ) = prior distribution

P(y) = ∫  (P(y|θ)*P(θ) )/dθ = the integrated likelihood, a normalizing or proportionality factor

P(θ|y) α P(y|θ)*P(θ) -->  the posterior distribution is a weighted average of the prior distribution and the likelihood function

Deriving the posterior distribution may be an algebraic exercise, but summarization is more difficult, requiring numerical methods. We may often be interested in deriving the posterior expected value of θ.

E(θ|y) = ∫  θ f(θ|y) dθ

This may be approximated numerically using a sequence of random draws   Θ(1) , Θ(1) , …Θ(G) 

E(θ|y) = 1/G ∑ Θ(g) 

This can actually be implemented via Markov Chain Monte Carlo methods. For the sequence of draws Θ(1) , Θ(1) , …Θ(G) , Θ(g+1)  depends solely on the previous draw Θ(g)  forming a markov chain which converges to the target density. 

LOSS FUNCTIONS

Define l(Θ(hat) ,Θ) as the loss from the decision to choose Θ(hat) to estimate Θ

Ex:  squared error loss l(Θ(hat) ,Θ) = (Θ(hat) - Θ)^2

choose Θ(hat) to minimize E[l(Θ(hat) ,Θ)] =  ∫  l(Θ(hat) ,Θ) p(θ|y) dθ


REGRESSION ANALOGY

Recall, that if we specify the following likelihood:


ΒOLS = BMLE = (X’X)-1X’y

If we specify a prior distribution for β as P(β) α  exp[(β - m) V-1 (β - m)/2]

i.e β ~N(m,V)

 and

σ2 ~ I gamma(v/2,ς/2) 


Given the specified priors and likelihood, we can express the posterior distribution as:

p( β , σ2 |y)  α  p(y|β , σ2 ) p(β , σ2 )

Bayesian models can be implemented using the package MCMCpack in R.

An example given by Martin (one of the developers of the package) involves modeling murder as a function of unemployment. 

Under the linear regression model we get the following results:


Call:
lm(formula = murder ~ unemp, data = murder)

Residuals:
    Min      1Q  Median      3Q     Max
-5.5386 -2.6677 -0.6324  2.3935  8.9443

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)  -0.5509     2.4810  -0.222  0.82521  
unemp         1.4204     0.4535   3.132  0.00296 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.599 on 48 degrees of freedom
Multiple R-squared: 0.1697,    Adjusted R-squared: 0.1524
F-statistic: 9.809 on 1 and 48 DF,  p-value: 0.002957

Using bayesian methods, with standard priors we get very similar results:


Iterations = 1001:11000
Thinning interval = 1
Number of chains = 1
Sample size per chain = 10000

1. Empirical mean and standard deviation for each variable,
   plus standard error of the mean:

               Mean     SD Naive SE Time-series SE
(Intercept) -0.5278 2.5374 0.025374       0.023700
unemp        1.4158 0.4648 0.004648       0.004426
sigma2      13.5547 2.9186 0.029186       0.035616

2. Quantiles for each variable:

               2.5%    25%    50%    75%  97.5%
(Intercept) -5.5103 -2.189 -0.493  1.147  4.402
unemp        0.4957  1.112  1.412  1.719  2.329
sigma2       9.0343 11.496 13.151 15.199 20.355

 Bayesian methods with informative priors (as specified in the earlier discussion above)


Iterations = 1001:11000
Thinning interval = 1
Number of chains = 1
Sample size per chain = 10000

1. Empirical mean and standard deviation for each variable,
   plus standard error of the mean:

              Mean     SD Naive SE Time-series SE
(Intercept) -0.317 0.9147 0.009147       0.008721
unemp        1.392 0.1885 0.001885       0.001908
sigma2      13.306 2.8145 0.028145       0.033750

2. Quantiles for each variable:

              2.5%    25%     50%     75%  97.5%
(Intercept) -2.123 -0.925 -0.3036  0.2991  1.485
unemp        1.026  1.266  1.3907  1.5147  1.769
sigma2       8.916 11.296 12.9415 14.9085 19.802


References:

Andrew D. Martin. "Bayesian Analysis." Prepared for The Oxford Handbook of Political Methodology. Click here for the chapter.

Andrew D. Martin. "Bayesian Inference and Computation in Political Science." Slides from a talk given to the Department of Politics, Nuffield College, Oxford University, March 9, 2004. Click here for the slides, and here for the example R code.

An Introduction to Modern Bayesian Econometrics. Tony Lancaster. 2002. http://www.brown.edu/Courses/EC0264/book.pdf
 
  R-Code :



#   ------------------------------------------------------------------
#  | PROGRAM NAME: EX_BAYESIAN_ECONOMETRCS
#  | DATE: 9-15-11
#  | CREATED BY: MATT BOGARD 
#  | PROJECT FILE: WWW.ECONOMETRICSENSE.BLOGSPOT.COM             
#  |----------------------------------------------------------------
#  | ADAPTED FROM: Andrew D. Martin. "Bayesian Inference and Computation in Political Science." Slides from a talk given to the Department of Politics, Nuffield College, Oxford University, March 9, #      |   2004.   SLIDES:http://adm.wustl.edu/media/talks/bayesslides.pdf  R-CODE :  http://adm.wustl.edu/media/talks/examples.zip           
#  |
#  |
#  | 
#  |------------------------------------------------------------------
 
 
setwd('/Users/wkuuser/Desktop/Briefcase/R Data Sets') 
 
library(MCMCpack)
murder <- read.table("murder.txt", header=TRUE)
names(murder)
dim(murder)
 
# estimation using OLS
lm(murder ~ unemp, data=murder)
summary(lm(murder ~ unemp, data=murder))
 
# posterior with standard priors
post1 <- MCMCregress(murder ~ unemp, data=murder)
print(summary(post1))
 
# posterior with informative priors
m <- matrix(c(0,3),2,1)
V <- matrix(c(1,0,0,1),2,2)
post2 <- MCMCregress(murder ~ unemp, b0=m, B0=V, data=murder)
print(summary(post2))
Created by Pretty R at inside-R.org

Wednesday, August 31, 2011

Linear Regression and Analysis of Variance with a Binary Dependent Variable

(see also my posts related to Logistic Regression

If for instance Y is dichotomous or binary, Y = { 1 if ‘yes’  0 if ‘no’}, would  you consider it valid to do an analysis of variance or fit a linear regression model?

We might not think so based on traditional assumptions, because besides assuming that Y is continuous….

1)      ANOVA / linear regression both work under the assumption of a uniform (homoskedastic) error term ‘e’

2)      For a dichotomous y, the expected value E(y|X)  =  ‘probability’ and may be continuous, but the  error terms follow a binomial distribution with mean ‘p’ and a variance that is a function of the mean, which is inherently heteroskedastic , var(error) ~ n*p*(1-p)   

3)      Therefore the assumption of uniform variance is violated, so the F test and other tests based on standard errors based on this assumption are questionable

Some people will argue that violations of this assumption may only matter by degree, and under certain conditions ANOVA and linear regression using least squares is OK with a binary dependent variable (Lunney 1971, D'Agostino 1971,Astin & Dey 1993, Angrist & Pischke 2008).

I recall from mathematical statistics and econometrics, the lectures (and test questions) related to properties of estimators including efficiency, consistency, unbiasedness etc. We also spoke some about robustness, but in the theoretical work, its not so easy to 'prove' robustness as it is to show that a certain estimator is unbiased or consistent. In this way, I think a lot of times, robustness to assumptions isn't given a lot of credence by students or practitioners. ( However, Angrist and Pischke in their book 'Mostly Harmless Economics' spend a lot of time discussing ideas related to the robustness of the least squares estimator).  Robustness to assumptions related to the distribution of the error terms (under OLS / ANOVA are discussed in the following:

LITERATURE RELATED TO REGRESSION AND ANOVA  WITH A DICHOTOMOUS DEPENDENT VARIABLE

ANALYSIS OF VARIANCE CONTEXT--------------------------------------------

A Second Look at Analysis of Variance on Dichotomous Data
Author(s): Ralph B. D'Agostino
Source: Journal of Educational Measurement, Vol. 8, No. 4 (Winter, 1971), pp. 327-333

“probably a safe rule of thumb for deciding when the I x J ANOVA techniques may be used on dichotomous data with equal sample sizes in each cell is; the sample proportions for the cells should lie between .25 and .75 and there should be at least 20 degrees offreedom for error. This rule combines Lunney's results along with standard rules (Snedecor & Cochran, 1967, p. 494). The reason for the .25 and .75 lies in the fact that for this range there is little change between the within cell variances, p(l - p), and so there is a sufficient homogeneity of variances. Given that this rule is satisfied a standard procedure for analysis is the ANOVA on the original data. In this range (.25 to .75) it is doubtful if any alternative valid procedure, such as an analysis of transformed data, would lead to different conclusions”

Using Analysis of Variance with a Dichotomous Dependent Variable: An Empirical Study
Author(s): Gerald H. Lunney
Source: Journal of Educational Measurement, Vol. 7, No. 4 (Winter, 1970), pp. 263-269

“The findings show the analysis of variance to be an appropriate statistical technique for analyzing dichotomous data in fixed effects models where cell frequencies are equal under the following conditions: (a) the proportion of responses in the smaller response category is equal to or greater than .2 and there are at least 20 degrees of freedom for error, or (b) the proportion of responses in the smaller response category is less than .2 and there are at least 40 degrees of freedom for error.”

REGRESSION CONTEXT---------------------------------------------------

Dey ,Eric L. and Alexander W. Astin. Statistical Alternatives For Studying College Student Retention: A Comparative Analysis of Logit, Probit, and Linear Regression. Research in Higher Education, Vol. 34, No. 5. 1993. link http://www.jstor.org/stable/40196112  

"These results indicate that despite the theoretical advantages offered by logistic regression and probit analysis, there is little practical difference between either of these two techniques and more traditional linear regression. While this may not always be the case, these and other analyses show that for variables that are moderately distributed (say, within a .75/. 25 split; for example, see Cleary and Angel, 1984) there is little practical difference in obtained results upon which to make a decision about one technique or another, especially in large samples."

Angrist, Joshua D. & Jörn-Steffen Pischke. Mostly Harmless Econometrics: An Empiricist's Companion. Princeton University Press. NJ. 2008.

"While a nonlinear model may fit the CEF (population conditional expectation function) for LDVs (limited dependent variables) more closely than a linear model, when it comes to marginal effects, this probably matters little"

BOTH CONTEXTS---------------------------------------------

The Analysis of Relationships Involving Dichotomous Dependent Variables
Author(s): Paul D. Cleary and Ronald AngelSource: Journal of Health and Social Behavior, Vol. 25, No. 3 (Sep., 1984), pp. 334-348

"If the researcher wishes to estimate the probability of an outcome as a linear function and, A. If the sample size is moderately large and the dependent variable is not too skewed (.25 < p < .75), then OLS regression or ordinary ANOVA is adequate."

The Asymptotic Bias of OLS With Censored Dependent Variables


An interesting result from Greene that I was not previously aware of:

We find that for the case of normally distributed regressors, the slopes in the linear regression of y on x, which are consistently estimated by ordinary least squares, are proportional to the corresponding slopes of the underlying regression relating y*, and x. In this setting, the bias can be "corrected" simply by dividing each OLS slope by the proportion of nonlimit observations in the sample. The other structural parameters can be consistently estimated in similar fashion. - Greene On the Asymptotic Bias of the Ordinary Least Squares Estimator of the Tobit Model Econometrica, Vol. 49, No. 2 (Mar., 1981), pp. 505-513Published


Wednesday, August 17, 2011

Basic Concepts: Defining The Mean and Variance

"E[X] is the center of gravity (or centroid) of the unit mass that is determined by the density function of X. So the mean of X is a measure
of where the values of the random variable X are" centered…. The variance of X will be a measure of the spread or dispersion of the density of X
" ---Mood, Alexander McFar1ane, INTRODUCTION TO THE THEORY OF STATISTICS  1963.

“The variance of a distribution provides a measure of the spread or dispersion of the distribution about its mean. A small value of the variance indicates that the probability distribution is tightly concentrated around the mean (μ ); and a large value of the variance typically indicates that the probability distribution has a wide spread about the mean (μ )….the variance of X is the second central moment of X” – DeGroot & Schervish. Probability and Statistics 2002.