Wednesday, August 31, 2011

Linear Regression and Analysis of Variance with a Binary Dependent Variable

(see also my posts related to Logistic Regression

If for instance Y is dichotomous or binary, Y = { 1 if ‘yes’  0 if ‘no’}, would  you consider it valid to do an analysis of variance or fit a linear regression model?

We might not think so based on traditional assumptions, because besides assuming that Y is continuous….

1)      ANOVA / linear regression both work under the assumption of a uniform (homoskedastic) error term ‘e’

2)      For a dichotomous y, the expected value E(y|X)  =  ‘probability’ and may be continuous, but the  error terms follow a binomial distribution with mean ‘p’ and a variance that is a function of the mean, which is inherently heteroskedastic , var(error) ~ n*p*(1-p)   

3)      Therefore the assumption of uniform variance is violated, so the F test and other tests based on standard errors based on this assumption are questionable

Some people will argue that violations of this assumption may only matter by degree, and under certain conditions ANOVA and linear regression using least squares is OK with a binary dependent variable (Lunney 1971, D'Agostino 1971,Astin & Dey 1993, Angrist & Pischke 2008).

I recall from mathematical statistics and econometrics, the lectures (and test questions) related to properties of estimators including efficiency, consistency, unbiasedness etc. We also spoke some about robustness, but in the theoretical work, its not so easy to 'prove' robustness as it is to show that a certain estimator is unbiased or consistent. In this way, I think a lot of times, robustness to assumptions isn't given a lot of credence by students or practitioners. ( However, Angrist and Pischke in their book 'Mostly Harmless Economics' spend a lot of time discussing ideas related to the robustness of the least squares estimator).  Robustness to assumptions related to the distribution of the error terms (under OLS / ANOVA are discussed in the following:

LITERATURE RELATED TO REGRESSION AND ANOVA  WITH A DICHOTOMOUS DEPENDENT VARIABLE

ANALYSIS OF VARIANCE CONTEXT--------------------------------------------

A Second Look at Analysis of Variance on Dichotomous Data
Author(s): Ralph B. D'Agostino
Source: Journal of Educational Measurement, Vol. 8, No. 4 (Winter, 1971), pp. 327-333

“probably a safe rule of thumb for deciding when the I x J ANOVA techniques may be used on dichotomous data with equal sample sizes in each cell is; the sample proportions for the cells should lie between .25 and .75 and there should be at least 20 degrees offreedom for error. This rule combines Lunney's results along with standard rules (Snedecor & Cochran, 1967, p. 494). The reason for the .25 and .75 lies in the fact that for this range there is little change between the within cell variances, p(l - p), and so there is a sufficient homogeneity of variances. Given that this rule is satisfied a standard procedure for analysis is the ANOVA on the original data. In this range (.25 to .75) it is doubtful if any alternative valid procedure, such as an analysis of transformed data, would lead to different conclusions”

Using Analysis of Variance with a Dichotomous Dependent Variable: An Empirical Study
Author(s): Gerald H. Lunney
Source: Journal of Educational Measurement, Vol. 7, No. 4 (Winter, 1970), pp. 263-269

“The findings show the analysis of variance to be an appropriate statistical technique for analyzing dichotomous data in fixed effects models where cell frequencies are equal under the following conditions: (a) the proportion of responses in the smaller response category is equal to or greater than .2 and there are at least 20 degrees of freedom for error, or (b) the proportion of responses in the smaller response category is less than .2 and there are at least 40 degrees of freedom for error.”

REGRESSION CONTEXT---------------------------------------------------

Dey ,Eric L. and Alexander W. Astin. Statistical Alternatives For Studying College Student Retention: A Comparative Analysis of Logit, Probit, and Linear Regression. Research in Higher Education, Vol. 34, No. 5. 1993. link http://www.jstor.org/stable/40196112  

"These results indicate that despite the theoretical advantages offered by logistic regression and probit analysis, there is little practical difference between either of these two techniques and more traditional linear regression. While this may not always be the case, these and other analyses show that for variables that are moderately distributed (say, within a .75/. 25 split; for example, see Cleary and Angel, 1984) there is little practical difference in obtained results upon which to make a decision about one technique or another, especially in large samples."

Angrist, Joshua D. & Jörn-Steffen Pischke. Mostly Harmless Econometrics: An Empiricist's Companion. Princeton University Press. NJ. 2008.

"While a nonlinear model may fit the CEF (population conditional expectation function) for LDVs (limited dependent variables) more closely than a linear model, when it comes to marginal effects, this probably matters little"

BOTH CONTEXTS---------------------------------------------

The Analysis of Relationships Involving Dichotomous Dependent Variables
Author(s): Paul D. Cleary and Ronald AngelSource: Journal of Health and Social Behavior, Vol. 25, No. 3 (Sep., 1984), pp. 334-348

"If the researcher wishes to estimate the probability of an outcome as a linear function and, A. If the sample size is moderately large and the dependent variable is not too skewed (.25 < p < .75), then OLS regression or ordinary ANOVA is adequate."

The Asymptotic Bias of OLS With Censored Dependent Variables


An interesting result from Greene that I was not previously aware of:

We find that for the case of normally distributed regressors, the slopes in the linear regression of y on x, which are consistently estimated by ordinary least squares, are proportional to the corresponding slopes of the underlying regression relating y*, and x. In this setting, the bias can be "corrected" simply by dividing each OLS slope by the proportion of nonlimit observations in the sample. The other structural parameters can be consistently estimated in similar fashion. - Greene On the Asymptotic Bias of the Ordinary Least Squares Estimator of the Tobit Model Econometrica, Vol. 49, No. 2 (Mar., 1981), pp. 505-513Published


Wednesday, August 17, 2011

Basic Concepts: Defining The Mean and Variance

"E[X] is the center of gravity (or centroid) of the unit mass that is determined by the density function of X. So the mean of X is a measure
of where the values of the random variable X are" centered…. The variance of X will be a measure of the spread or dispersion of the density of X
" ---Mood, Alexander McFar1ane, INTRODUCTION TO THE THEORY OF STATISTICS  1963.

“The variance of a distribution provides a measure of the spread or dispersion of the distribution about its mean. A small value of the variance indicates that the probability distribution is tightly concentrated around the mean (μ ); and a large value of the variance typically indicates that the probability distribution has a wide spread about the mean (μ )….the variance of X is the second central moment of X” – DeGroot & Schervish. Probability and Statistics 2002.



Saturday, August 13, 2011

QTL Analysis in R

See also: Part 1: QTL Analysis and Quantitative Genetics  Part 2: Statistical Methods for QTL Analysis

The 'qtl' package in R allows you to implement QTL analysis using the methods I've previously discussed. The code below is adapted from Broman's documentation "A Brief Tour of R/qtl." ( http://www.rqtl.org/tutorials/rqtltour.pdf ) My code (which follows) is an even briefer tour, relating specifically to the general topics in QTL analysis that I have previously discussed in Part 1 and 2 of this 3 part series of posts. The data set is a simulated backcross data set. The 'summary' function provides some basic descriptive data, such as the number of individuals in the crosses, the number of mapped chromosomes, the number of phenotypes, and number of mapped markers.

Ex:
   Backcross

    No. individuals:    400

    No. phenotypes:     4
    Percent phenotyped: 100 100 100 100

    No. chromosomes:    19
        Autosomes:      1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

    Total markers:      91
    No. markers:        7 5 3 4 10 5 4 4 6 4 4 5 11 3 3 4 4 2 3
    Percent genotyped:  72.7
    Genotypes (%):      AA:50.1  AB:49.9

The plot function (see output below) provides 1) a genetic map with markers 2) phenotype distributions 3) missing data visualizations:


Recall, from  the end of my discussion in Part 1 and my discussion in Part 2 I stated:

For maximum likelihood estimation, the following probability density of yi can be stated as:
f(yi) = Pr(y|Zi=Hk)  ‘ the probability of phenotype ‘y’ given genotype H
= (1/ √ 2π σ)  exp[ 1/2 σ 2 (yi-Xiβ +Hkγ )2
The hypothesis H0:  γ =0 can be tested using the likelihood ratio test:
λ = -2(L0-L1)
where L0 represents the likelihood under a restricted model. This is equivalent to the general notion presented in Broman (1997) and my previous post:

LOD = Likelihood (effect occurs by QTL linkage) / Likelihood(effect occurs by chance)
The calc.genoprob, scanone, and sim.geno functions allow you to make these calculations with options for expectation maximization, Haley-Knott regression, or multiple imputation.  'summary' and 'max' functions allow you to identify the locations on the chromosomes where the LOD score is maximized and infer the potential location of a QTL.  The plot function allows you to plot this data for specified chromosomes.




R Code: (note this code scrolls from left to right- click on the code and use the arrow keys or use the scroll bar at the bottom of the post )
#   ------------------------------------------------------------------
#  |PROGRAM NAME: EX_QTL_R
#  |DATE: 8-13-11
#  |CREATED BY: MATT BOGARD 
#  |PROJECT FILE:              
#  |----------------------------------------------------------------
#  | PURPOSE: DEMONSTRATE STATISTICAL METHODS FOR QTL ANALYSIS IN R        
#  | 
#  |------------------------------------------------------------------
#  | REFERENCE:  R code for "A brief tour of R/qtl"
#  |  Karl W Broman, kbroman.wisc.edu http://www.rqtl.org
#  |-----------------------------------------------------------------
 
 
library(qtl) # call package qtl
 
ls() 
 
############################################################
#  exploring backcross data
############################################################
 
 
data(fake.bc) # load simulated backcross data (default data in qtl package)
ls()
 
summary(fake.bc) # summary of info in fake.bc which is and object of class 'cross'
 
# you can also get this information with specific function calls:
nind(fake.bc)   # number of individuals
nphe(fake.bc)   # number of phenotypes
nchr(fake.bc)   # number of chromosomes
totmar(fake.bc) # number of total  markers
nmar(fake.bc)   # list markers? 
 
############################################################
#   plotting maps, genotypes, phenotypes, and markers
###########################################################
 
 
plot(fake.bc) # gives plots data
 
# you can also call for the plots individually:
plot.missing(fake.bc) # just plot missing genotypes
plot.map(fake.bc)     # just plot the genetic map 
plot.pheno(fake.bc, pheno.col=1) # just plot phenotype 1 'phe 1' 
plot.map(fake.bc, chr=c(1, 4, 6, 7, 15), show.marker.names=TRUE) # specific chromosomes and marker names
plot.missing(fake.bc, reorder=TRUE) # order NA genotypes based on value of phenotype
 
fake.bc <- drop.nullmarkers(fake.bc)  # drop obs with missing genotypes
totmar(fake.bc) # total # markers left
 
 
##################################################################
#  specifying and estimating the likelihood used for QTL mapping
#################################################################
 
# From Broman:
# The function calc.genoprob calculates QTL genotype probabilities, conditional
# on the available marker data. These are needed for most of the QTL mapping 
# functions. The argument step indicates the step size (in cM) at which the
# probabilities are calculated, and determines the step size at which later 
# LOD scores are calculated.
 
 
fake.bc <- calc.genoprob(fake.bc, step=1, error.prob=0.01)
 
# function scanone performs single-QTL genome scan with a normal model. 
#  methods include: maximum likelihood via the EM algorithm 
#  and Haley-Knott regression 
 
out.em <- scanone(fake.bc)
out.hk <- scanone(fake.bc, method="hk")
 
# multiple imputation method using sim.geno utilizing the joint 
# genotype distribution, given the observed marker data.
 
fake.bc <- sim.geno(fake.bc, step=2, n.draws=16, error.prob=0.01)
out.imp <- scanone(fake.bc, method="imp")
 
# get the maximum LOD score on each chromosome 
# can also specify a threshold for LOD
 
summary(out.em) 
summary(out.em, threshold=3)
summary(out.hk, threshold=3)
summary(out.imp, threshold=3)
 
# function max.scanone returns just the highest peak from output of scanone.
max(out.em) # based on expectation maximization
max(out.hk) # based on Haley-Knott regression
max(out.imp) # based on multiple imputation
 
##################################################################
#  plot LOD scores by chromosome for QTL mapping
#################################################################
 
 
plot(out.em, chr=c(2,5)) # just based on em method
plot(out.em, out.hk, out.imp, chr=c(2,5)) # all methods
plot(out.em, chr=c(2)) # zoom in on specified chromosome
Created by Pretty R at inside-R.org

R Program Documentation Template

#   ------------------------------------------------------------------
#  |PROGRAM NAME:
#  |DATE: 
#  |CREATED BY: MATT BOGARD 
#  |PROJECT FILE:              
#  |----------------------------------------------------------------
#  | PURPOSE:         
#  | 
#  |------------------------------------------------------------------
#  |DATA USED:  		            
#  |------------------------------------------------------------------
#  |CONTENTS:               
#  |
#  |			
#  |		
#  |		     
#  |------------------------------------------------------------------
#  |COMMENTS:               
#  |
#  |	 	
#  |-----------------------------------------------------------------
#  |UPDATES:               
#  |
#  |
#   ------------------------------------------------------------------
 
 
 
# set working directory
 
setwd('/Users/wkuuser/Desktop/R Data Sets') # mac 
 
setwd("P:\\R  Code References\\R Training") # windows

Created by Pretty R at inside-R.org