Wednesday, August 31, 2011

Linear Regression and Analysis of Variance with a Binary Dependent Variable

(see also my posts related to Logistic Regression

If for instance Y is dichotomous or binary, Y = { 1 if ‘yes’  0 if ‘no’}, would  you consider it valid to do an analysis of variance or fit a linear regression model?

We might not think so based on traditional assumptions, because besides assuming that Y is continuous….

1)      ANOVA / linear regression both work under the assumption of a uniform (homoskedastic) error term ‘e’

2)      For a dichotomous y, the expected value E(y|X)  =  ‘probability’ and may be continuous, but the  error terms follow a binomial distribution with mean ‘p’ and a variance that is a function of the mean, which is inherently heteroskedastic , var(error) ~ n*p*(1-p)   

3)      Therefore the assumption of uniform variance is violated, so the F test and other tests based on standard errors based on this assumption are questionable

Some people will argue that violations of this assumption may only matter by degree, and under certain conditions ANOVA and linear regression using least squares is OK with a binary dependent variable (Lunney 1971, D'Agostino 1971,Astin & Dey 1993, Angrist & Pischke 2008).

I recall from mathematical statistics and econometrics, the lectures (and test questions) related to properties of estimators including efficiency, consistency, unbiasedness etc. We also spoke some about robustness, but in the theoretical work, its not so easy to 'prove' robustness as it is to show that a certain estimator is unbiased or consistent. In this way, I think a lot of times, robustness to assumptions isn't given a lot of credence by students or practitioners. ( However, Angrist and Pischke in their book 'Mostly Harmless Economics' spend a lot of time discussing ideas related to the robustness of the least squares estimator).  Robustness to assumptions related to the distribution of the error terms (under OLS / ANOVA are discussed in the following:

LITERATURE RELATED TO REGRESSION AND ANOVA  WITH A DICHOTOMOUS DEPENDENT VARIABLE

ANALYSIS OF VARIANCE CONTEXT--------------------------------------------

A Second Look at Analysis of Variance on Dichotomous Data
Author(s): Ralph B. D'Agostino
Source: Journal of Educational Measurement, Vol. 8, No. 4 (Winter, 1971), pp. 327-333

“probably a safe rule of thumb for deciding when the I x J ANOVA techniques may be used on dichotomous data with equal sample sizes in each cell is; the sample proportions for the cells should lie between .25 and .75 and there should be at least 20 degrees offreedom for error. This rule combines Lunney's results along with standard rules (Snedecor & Cochran, 1967, p. 494). The reason for the .25 and .75 lies in the fact that for this range there is little change between the within cell variances, p(l - p), and so there is a sufficient homogeneity of variances. Given that this rule is satisfied a standard procedure for analysis is the ANOVA on the original data. In this range (.25 to .75) it is doubtful if any alternative valid procedure, such as an analysis of transformed data, would lead to different conclusions”

Using Analysis of Variance with a Dichotomous Dependent Variable: An Empirical Study
Author(s): Gerald H. Lunney
Source: Journal of Educational Measurement, Vol. 7, No. 4 (Winter, 1970), pp. 263-269

“The findings show the analysis of variance to be an appropriate statistical technique for analyzing dichotomous data in fixed effects models where cell frequencies are equal under the following conditions: (a) the proportion of responses in the smaller response category is equal to or greater than .2 and there are at least 20 degrees of freedom for error, or (b) the proportion of responses in the smaller response category is less than .2 and there are at least 40 degrees of freedom for error.”

REGRESSION CONTEXT---------------------------------------------------

Dey ,Eric L. and Alexander W. Astin. Statistical Alternatives For Studying College Student Retention: A Comparative Analysis of Logit, Probit, and Linear Regression. Research in Higher Education, Vol. 34, No. 5. 1993. link http://www.jstor.org/stable/40196112  

"These results indicate that despite the theoretical advantages offered by logistic regression and probit analysis, there is little practical difference between either of these two techniques and more traditional linear regression. While this may not always be the case, these and other analyses show that for variables that are moderately distributed (say, within a .75/. 25 split; for example, see Cleary and Angel, 1984) there is little practical difference in obtained results upon which to make a decision about one technique or another, especially in large samples."

Angrist, Joshua D. & Jörn-Steffen Pischke. Mostly Harmless Econometrics: An Empiricist's Companion. Princeton University Press. NJ. 2008.

"While a nonlinear model may fit the CEF (population conditional expectation function) for LDVs (limited dependent variables) more closely than a linear model, when it comes to marginal effects, this probably matters little"

BOTH CONTEXTS---------------------------------------------

The Analysis of Relationships Involving Dichotomous Dependent Variables
Author(s): Paul D. Cleary and Ronald AngelSource: Journal of Health and Social Behavior, Vol. 25, No. 3 (Sep., 1984), pp. 334-348

"If the researcher wishes to estimate the probability of an outcome as a linear function and, A. If the sample size is moderately large and the dependent variable is not too skewed (.25 < p < .75), then OLS regression or ordinary ANOVA is adequate."

The Asymptotic Bias of OLS With Censored Dependent Variables


An interesting result from Greene that I was not previously aware of:

We find that for the case of normally distributed regressors, the slopes in the linear regression of y on x, which are consistently estimated by ordinary least squares, are proportional to the corresponding slopes of the underlying regression relating y*, and x. In this setting, the bias can be "corrected" simply by dividing each OLS slope by the proportion of nonlimit observations in the sample. The other structural parameters can be consistently estimated in similar fashion. - Greene On the Asymptotic Bias of the Ordinary Least Squares Estimator of the Tobit Model Econometrica, Vol. 49, No. 2 (Mar., 1981), pp. 505-513Published


Wednesday, August 17, 2011

Basic Concepts: Defining The Mean and Variance

"E[X] is the center of gravity (or centroid) of the unit mass that is determined by the density function of X. So the mean of X is a measure
of where the values of the random variable X are" centered…. The variance of X will be a measure of the spread or dispersion of the density of X
" ---Mood, Alexander McFar1ane, INTRODUCTION TO THE THEORY OF STATISTICS  1963.

“The variance of a distribution provides a measure of the spread or dispersion of the distribution about its mean. A small value of the variance indicates that the probability distribution is tightly concentrated around the mean (μ ); and a large value of the variance typically indicates that the probability distribution has a wide spread about the mean (μ )….the variance of X is the second central moment of X” – DeGroot & Schervish. Probability and Statistics 2002.



Saturday, August 13, 2011

QTL Analysis in R

See also: Part 1: QTL Analysis and Quantitative Genetics  Part 2: Statistical Methods for QTL Analysis

The 'qtl' package in R allows you to implement QTL analysis using the methods I've previously discussed. The code below is adapted from Broman's documentation "A Brief Tour of R/qtl." ( http://www.rqtl.org/tutorials/rqtltour.pdf ) My code (which follows) is an even briefer tour, relating specifically to the general topics in QTL analysis that I have previously discussed in Part 1 and 2 of this 3 part series of posts. The data set is a simulated backcross data set. The 'summary' function provides some basic descriptive data, such as the number of individuals in the crosses, the number of mapped chromosomes, the number of phenotypes, and number of mapped markers.

Ex:
   Backcross

    No. individuals:    400

    No. phenotypes:     4
    Percent phenotyped: 100 100 100 100

    No. chromosomes:    19
        Autosomes:      1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

    Total markers:      91
    No. markers:        7 5 3 4 10 5 4 4 6 4 4 5 11 3 3 4 4 2 3
    Percent genotyped:  72.7
    Genotypes (%):      AA:50.1  AB:49.9

The plot function (see output below) provides 1) a genetic map with markers 2) phenotype distributions 3) missing data visualizations:


Recall, from  the end of my discussion in Part 1 and my discussion in Part 2 I stated:

For maximum likelihood estimation, the following probability density of yi can be stated as:
f(yi) = Pr(y|Zi=Hk)  ‘ the probability of phenotype ‘y’ given genotype H
= (1/ √ 2π σ)  exp[ 1/2 σ 2 (yi-Xiβ +Hkγ )2
The hypothesis H0:  γ =0 can be tested using the likelihood ratio test:
λ = -2(L0-L1)
where L0 represents the likelihood under a restricted model. This is equivalent to the general notion presented in Broman (1997) and my previous post:

LOD = Likelihood (effect occurs by QTL linkage) / Likelihood(effect occurs by chance)
The calc.genoprob, scanone, and sim.geno functions allow you to make these calculations with options for expectation maximization, Haley-Knott regression, or multiple imputation.  'summary' and 'max' functions allow you to identify the locations on the chromosomes where the LOD score is maximized and infer the potential location of a QTL.  The plot function allows you to plot this data for specified chromosomes.




R Code: (note this code scrolls from left to right- click on the code and use the arrow keys or use the scroll bar at the bottom of the post )
#   ------------------------------------------------------------------
#  |PROGRAM NAME: EX_QTL_R
#  |DATE: 8-13-11
#  |CREATED BY: MATT BOGARD 
#  |PROJECT FILE:              
#  |----------------------------------------------------------------
#  | PURPOSE: DEMONSTRATE STATISTICAL METHODS FOR QTL ANALYSIS IN R        
#  | 
#  |------------------------------------------------------------------
#  | REFERENCE:  R code for "A brief tour of R/qtl"
#  |  Karl W Broman, kbroman.wisc.edu http://www.rqtl.org
#  |-----------------------------------------------------------------
 
 
library(qtl) # call package qtl
 
ls() 
 
############################################################
#  exploring backcross data
############################################################
 
 
data(fake.bc) # load simulated backcross data (default data in qtl package)
ls()
 
summary(fake.bc) # summary of info in fake.bc which is and object of class 'cross'
 
# you can also get this information with specific function calls:
nind(fake.bc)   # number of individuals
nphe(fake.bc)   # number of phenotypes
nchr(fake.bc)   # number of chromosomes
totmar(fake.bc) # number of total  markers
nmar(fake.bc)   # list markers? 
 
############################################################
#   plotting maps, genotypes, phenotypes, and markers
###########################################################
 
 
plot(fake.bc) # gives plots data
 
# you can also call for the plots individually:
plot.missing(fake.bc) # just plot missing genotypes
plot.map(fake.bc)     # just plot the genetic map 
plot.pheno(fake.bc, pheno.col=1) # just plot phenotype 1 'phe 1' 
plot.map(fake.bc, chr=c(1, 4, 6, 7, 15), show.marker.names=TRUE) # specific chromosomes and marker names
plot.missing(fake.bc, reorder=TRUE) # order NA genotypes based on value of phenotype
 
fake.bc <- drop.nullmarkers(fake.bc)  # drop obs with missing genotypes
totmar(fake.bc) # total # markers left
 
 
##################################################################
#  specifying and estimating the likelihood used for QTL mapping
#################################################################
 
# From Broman:
# The function calc.genoprob calculates QTL genotype probabilities, conditional
# on the available marker data. These are needed for most of the QTL mapping 
# functions. The argument step indicates the step size (in cM) at which the
# probabilities are calculated, and determines the step size at which later 
# LOD scores are calculated.
 
 
fake.bc <- calc.genoprob(fake.bc, step=1, error.prob=0.01)
 
# function scanone performs single-QTL genome scan with a normal model. 
#  methods include: maximum likelihood via the EM algorithm 
#  and Haley-Knott regression 
 
out.em <- scanone(fake.bc)
out.hk <- scanone(fake.bc, method="hk")
 
# multiple imputation method using sim.geno utilizing the joint 
# genotype distribution, given the observed marker data.
 
fake.bc <- sim.geno(fake.bc, step=2, n.draws=16, error.prob=0.01)
out.imp <- scanone(fake.bc, method="imp")
 
# get the maximum LOD score on each chromosome 
# can also specify a threshold for LOD
 
summary(out.em) 
summary(out.em, threshold=3)
summary(out.hk, threshold=3)
summary(out.imp, threshold=3)
 
# function max.scanone returns just the highest peak from output of scanone.
max(out.em) # based on expectation maximization
max(out.hk) # based on Haley-Knott regression
max(out.imp) # based on multiple imputation
 
##################################################################
#  plot LOD scores by chromosome for QTL mapping
#################################################################
 
 
plot(out.em, chr=c(2,5)) # just based on em method
plot(out.em, out.hk, out.imp, chr=c(2,5)) # all methods
plot(out.em, chr=c(2)) # zoom in on specified chromosome
Created by Pretty R at inside-R.org

R Program Documentation Template

#   ------------------------------------------------------------------
#  |PROGRAM NAME:
#  |DATE: 
#  |CREATED BY: MATT BOGARD 
#  |PROJECT FILE:              
#  |----------------------------------------------------------------
#  | PURPOSE:         
#  | 
#  |------------------------------------------------------------------
#  |DATA USED:  		            
#  |------------------------------------------------------------------
#  |CONTENTS:               
#  |
#  |			
#  |		
#  |		     
#  |------------------------------------------------------------------
#  |COMMENTS:               
#  |
#  |	 	
#  |-----------------------------------------------------------------
#  |UPDATES:               
#  |
#  |
#   ------------------------------------------------------------------
 
 
 
# set working directory
 
setwd('/Users/wkuuser/Desktop/R Data Sets') # mac 
 
setwd("P:\\R  Code References\\R Training") # windows

Created by Pretty R at inside-R.org

Wednesday, July 27, 2011

Statistical Methods for QTL Analysis

See also QTL Analysis and Quantitative Genetics and  QTL Analysis in R.

In a previous post a gave a general overview of QTL mapping and analysis, and gave the motivation for the use of maximum likelihood to identify the approximate location of a QTL based on RFLP (or marker) variations. Below are more details related to the statistical methods that can be used in this process.


Analysis of Variance and Marker Regression

At each marker (RFLP) loci, compare backcross phenotype distributions for groups that differ according to their marker genotypes. As depicted in Broman (2001)

 
For markers a and c, we see that phenotype distributions differ for genotypes aa and aa’ but not for cc and cc’. Again this indicates that a QTL may be linked to the RFLP genotypes at locus “a” but not "c". 

The differences between genotypes and the associated phenotype distributions for two marker genotypes can be assessed using the t-statistic. For >2 genotypes, analysis of variance may be used. The ANOVA approach allows a flexible experimental design, allowing for the incorporation of covariates, treatment, and environmental effects. 

The model for marker regression, following the notation in Hu and Zu (2009) can be specified as follows:

yi= Xiβ + Ziγ +ϵi

such that  yi  is the phenotype of the ith individual, β is the vector of control effects, Xi is design vector,γ is a vector for QTL effects, and Z is a genotype indicator vector.

Z = H1 for A1A1 , H2 for A1A2 , H3 for A2A2  or more generally 

 
Tests on hypotheses related to QTL effects take the form H0: γ = 0 . Hence we test the null hypothesis of no QTL associated with genotype Z. 

Maximum Likelihood

For maximum likelihood estimation, the following probability density of yi can be stated as:
f(yi) = Pr(y|Zi=Hk)  ‘ the probability of phenotype ‘y’ given genotype H
= (1/ √ 2π σ)  exp[ 1/2 σ 2 (yi-Xiβ +Hkγ )2
The log likelihood function can then be specified as L(θ) =Σ  ln(f(yi))
The hypothesis H0:  γ =0 can be tested using the likelihood ratio test:
λ = -2(L0-L1)
where L0 represents the likelihood under a restricted model. This is equivalent to the general notion presented in Broman (1997) and my previous post:

Likelihood (effect occurs by QTL linkage) / Likelihood(effect occurs by chance)

References:

 Jones, N., H. Ougham, and H. Thomas. Markers and mapping:We are all geneticists now. New Phytol. 137:165–177.1997.

Broman KW. Lab Anim (NY). Review of statistical methods for QTL mapping in experimental crosses.
2001 Jul-Aug;30(7):44-52.

Zhiqiu Hu and Shizhong Xu (2009). PROC QTL - A SAS Procedure for Mapping Quantitative Trait Loci. International Journal of Plant Genomics 2009: 3 doi:10.1155/2009/141234.

QTL Analysis and Quantitative Genetics

See also Statistical Methods for QTL Analysis and QTL Analysis in R

Restriction Fragment Length Polymorphisms (RFLPs) and Restriction Enzymes

Restriction enzymes target specific DNA base sequences producing staggered cuts and variable length DNA fragments-‘RFLPs’.


RFLPs establish fixed landmarks in the genome. Subjecting DNA to restriction enzymes and gel-electrophoresis leads to gel patterns which can be described in terms of genotpype or allelic differences between RFLPs  at a given locus, which follow rules of Mendelian inheritance.  (recall a 'locus' is the location of a DNA sequence on a chromosome. A variant of a DNA sequence at a given locus is an 'allele.'

 
Backcrossing, Segregation, and Recombination
P1 AABB x P2 aabb --> F1 AaBb

F1 x P2 -->              AAab
                                Abab
                                aBab
                                abab     

With unlinked genes and independent segregation, recombinant genotypes Abab and aBab will result 50% of the time. With linked genes, crossing over will result in recombinants < 50% of the time.

 
The probability of crossing over is proportional to the distance between loci, therefore, crossover rates can be used to create genetic maps.
Loci                        % Crossover or Recombinants
a,b                          7
b,c                          19

 
Mapping Population
Use P1,P2,F1 to create a population with segregating QTL and RFLP profiles and phenotypic variation. RFLPs that are tightly linked to QTLs will segregate together, can serve as markers for that trait.

Central Dogma of QTL Marker Analysis

DNA: Δ mapped locus position --> Δ RFLP allele -->Δ Phenotype  -->likelihood that RFLP is linked to QTL 

Where a QTL is referred to as a bundle of genes associated with some quantitative trait of interest. As we move across a segment of DNA, we note changes in RFLP genotype alleles, and associated changes in the phenotype of interest. This data can then be assessed to determine the likelihood that a RFLP is linked to a QTL.

For example, suppose at locus "a" there are two observed allelic differences in the RFLP profile, denoted aa and aa’. If we notice distinct differences in the phenotypic distribution for plants with alleles aa vs. aa’ it may be inferred that the RFLP at locus “a” is linked to the QTL of interest. If at another location, we observe two allelic differences in the RFLP profile cc and cc’, and we find little difference in the associated phenotypes we may conclude that the RFLP at locus “c” is not linked to the QTL of interest. Intermediate differences between phenotypes would imply that the RFLP at that locus is closely linked to the QTL of interest. The following schematic from Jones(1997) et al illustrates these differences:

 
The map position of the QTL can be determined by the method of maximum likelihood from the distribution of likelihood values determined at each locus above. At each locus we compute the following:

Likelihood(effect occurs QTL linkage) / Likelihood(effect occurs by chance)

Depicted as  L1/L0 below:

 
The key to using markers to map QTL is to associate RFLP patterns with QTLs. Observing changes in RFLP profiles among plant DNA from a segregating population at various positions in the genome, and associating them with changes in phenotype allows us to find a statistical relationship between RFLPs and QTLs. Based on the observations above, the evidence strongly supports that a QTL may be found near the RFLP locus “a”.

Reference: Jones, N., H. Ougham, and H. Thomas. 1997. Markers and mapping:We are all geneticists now. New Phytol. 137:165–177