## Wednesday, August 31, 2011

### Linear Regression and Analysis of Variance with a Binary Dependent Variable

If for instance Y is dichotomous or binary, Y = { 1 if ‘yes’  0 if ‘no’}, would  you consider it valid to do an analysis of variance or fit a linear regression model?

We might not think so based on traditional assumptions, because besides assuming that Y is continuous….

1)      ANOVA / linear regression both work under the assumption of a uniform (homoskedastic) error term ‘e’

2)      For a dichotomous y, the expected value E(y|X)  =  ‘probability’ and may be continuous, but the  error terms follow a binomial distribution with mean ‘p’ and a variance that is a function of the mean, which is inherently heteroskedastic , var(error) ~ n*p*(1-p)

3)      Therefore the assumption of uniform variance is violated, so the F test and other tests based on standard errors based on this assumption are questionable

Some people will argue that violations of this assumption may only matter by degree, and under certain conditions ANOVA and linear regression using least squares is OK with a binary dependent variable (Lunney 1971, D'Agostino 1971,Astin & Dey 1993, Angrist & Pischke 2008).

I recall from mathematical statistics and econometrics, the lectures (and test questions) related to properties of estimators including efficiency, consistency, unbiasedness etc. We also spoke some about robustness, but in the theoretical work, its not so easy to 'prove' robustness as it is to show that a certain estimator is unbiased or consistent. In this way, I think a lot of times, robustness to assumptions isn't given a lot of credence by students or practitioners. ( However, Angrist and Pischke in their book 'Mostly Harmless Economics' spend a lot of time discussing ideas related to the robustness of the least squares estimator).  Robustness to assumptions related to the distribution of the error terms (under OLS / ANOVA are discussed in the following:

LITERATURE RELATED TO REGRESSION AND ANOVA  WITH A DICHOTOMOUS DEPENDENT VARIABLE

ANALYSIS OF VARIANCE CONTEXT--------------------------------------------

A Second Look at Analysis of Variance on Dichotomous Data
Author(s): Ralph B. D'Agostino
Source: Journal of Educational Measurement, Vol. 8, No. 4 (Winter, 1971), pp. 327-333

“probably a safe rule of thumb for deciding when the I x J ANOVA techniques may be used on dichotomous data with equal sample sizes in each cell is; the sample proportions for the cells should lie between .25 and .75 and there should be at least 20 degrees offreedom for error. This rule combines Lunney's results along with standard rules (Snedecor & Cochran, 1967, p. 494). The reason for the .25 and .75 lies in the fact that for this range there is little change between the within cell variances, p(l - p), and so there is a sufficient homogeneity of variances. Given that this rule is satisfied a standard procedure for analysis is the ANOVA on the original data. In this range (.25 to .75) it is doubtful if any alternative valid procedure, such as an analysis of transformed data, would lead to different conclusions”

Using Analysis of Variance with a Dichotomous Dependent Variable: An Empirical Study
Author(s): Gerald H. Lunney
Source: Journal of Educational Measurement, Vol. 7, No. 4 (Winter, 1970), pp. 263-269

“The findings show the analysis of variance to be an appropriate statistical technique for analyzing dichotomous data in fixed effects models where cell frequencies are equal under the following conditions: (a) the proportion of responses in the smaller response category is equal to or greater than .2 and there are at least 20 degrees of freedom for error, or (b) the proportion of responses in the smaller response category is less than .2 and there are at least 40 degrees of freedom for error.”

REGRESSION CONTEXT---------------------------------------------------

Dey ,Eric L. and Alexander W. Astin. Statistical Alternatives For Studying College Student Retention: A Comparative Analysis of Logit, Probit, and Linear Regression. Research in Higher Education, Vol. 34, No. 5. 1993. link http://www.jstor.org/stable/40196112

"These results indicate that despite the theoretical advantages offered by logistic regression and probit analysis, there is little practical difference between either of these two techniques and more traditional linear regression. While this may not always be the case, these and other analyses show that for variables that are moderately distributed (say, within a .75/. 25 split; for example, see Cleary and Angel, 1984) there is little practical difference in obtained results upon which to make a decision about one technique or another, especially in large samples."

Angrist, Joshua D. & Jörn-Steffen Pischke. Mostly Harmless Econometrics: An Empiricist's Companion. Princeton University Press. NJ. 2008.

"While a nonlinear model may fit the CEF (population conditional expectation function) for LDVs (limited dependent variables) more closely than a linear model, when it comes to marginal effects, this probably matters little"

BOTH CONTEXTS---------------------------------------------

The Analysis of Relationships Involving Dichotomous Dependent Variables
Author(s): Paul D. Cleary and Ronald AngelSource: Journal of Health and Social Behavior, Vol. 25, No. 3 (Sep., 1984), pp. 334-348

"If the researcher wishes to estimate the probability of an outcome as a linear function and, A. If the sample size is moderately large and the dependent variable is not too skewed (.25 < p < .75), then OLS regression or ordinary ANOVA is adequate."