Wednesday, July 27, 2011

Statistical Methods for QTL Analysis

See also QTL Analysis and Quantitative Genetics and  QTL Analysis in R.

In a previous post a gave a general overview of QTL mapping and analysis, and gave the motivation for the use of maximum likelihood to identify the approximate location of a QTL based on RFLP (or marker) variations. Below are more details related to the statistical methods that can be used in this process.

Analysis of Variance and Marker Regression

At each marker (RFLP) loci, compare backcross phenotype distributions for groups that differ according to their marker genotypes. As depicted in Broman (2001)

For markers a and c, we see that phenotype distributions differ for genotypes aa and aa’ but not for cc and cc’. Again this indicates that a QTL may be linked to the RFLP genotypes at locus “a” but not "c". 

The differences between genotypes and the associated phenotype distributions for two marker genotypes can be assessed using the t-statistic. For >2 genotypes, analysis of variance may be used. The ANOVA approach allows a flexible experimental design, allowing for the incorporation of covariates, treatment, and environmental effects. 

The model for marker regression, following the notation in Hu and Zu (2009) can be specified as follows:

yi= Xiβ + Ziγ +ϵi

such that  yi  is the phenotype of the ith individual, β is the vector of control effects, Xi is design vector,γ is a vector for QTL effects, and Z is a genotype indicator vector.

Z = H1 for A1A1 , H2 for A1A2 , H3 for A2A2  or more generally 

Tests on hypotheses related to QTL effects take the form H0: γ = 0 . Hence we test the null hypothesis of no QTL associated with genotype Z. 

Maximum Likelihood

For maximum likelihood estimation, the following probability density of yi can be stated as:
f(yi) = Pr(y|Zi=Hk)  ‘ the probability of phenotype ‘y’ given genotype H
= (1/ √ 2π σ)  exp[ 1/2 σ 2 (yi-Xiβ +Hkγ )2
The log likelihood function can then be specified as L(θ) =Σ  ln(f(yi))
The hypothesis H0:  γ =0 can be tested using the likelihood ratio test:
λ = -2(L0-L1)
where L0 represents the likelihood under a restricted model. This is equivalent to the general notion presented in Broman (1997) and my previous post:

Likelihood (effect occurs by QTL linkage) / Likelihood(effect occurs by chance)


 Jones, N., H. Ougham, and H. Thomas. Markers and mapping:We are all geneticists now. New Phytol. 137:165–177.1997.

Broman KW. Lab Anim (NY). Review of statistical methods for QTL mapping in experimental crosses.
2001 Jul-Aug;30(7):44-52.

Zhiqiu Hu and Shizhong Xu (2009). PROC QTL - A SAS Procedure for Mapping Quantitative Trait Loci. International Journal of Plant Genomics 2009: 3 doi:10.1155/2009/141234.

QTL Analysis and Quantitative Genetics

See also Statistical Methods for QTL Analysis and QTL Analysis in R

Restriction Fragment Length Polymorphisms (RFLPs) and Restriction Enzymes

Restriction enzymes target specific DNA base sequences producing staggered cuts and variable length DNA fragments-‘RFLPs’.

RFLPs establish fixed landmarks in the genome. Subjecting DNA to restriction enzymes and gel-electrophoresis leads to gel patterns which can be described in terms of genotpype or allelic differences between RFLPs  at a given locus, which follow rules of Mendelian inheritance.  (recall a 'locus' is the location of a DNA sequence on a chromosome. A variant of a DNA sequence at a given locus is an 'allele.'

Backcrossing, Segregation, and Recombination
P1 AABB x P2 aabb --> F1 AaBb

F1 x P2 -->              AAab

With unlinked genes and independent segregation, recombinant genotypes Abab and aBab will result 50% of the time. With linked genes, crossing over will result in recombinants < 50% of the time.

The probability of crossing over is proportional to the distance between loci, therefore, crossover rates can be used to create genetic maps.
Loci                        % Crossover or Recombinants
a,b                          7
b,c                          19

Mapping Population
Use P1,P2,F1 to create a population with segregating QTL and RFLP profiles and phenotypic variation. RFLPs that are tightly linked to QTLs will segregate together, can serve as markers for that trait.

Central Dogma of QTL Marker Analysis

DNA: Δ mapped locus position --> Δ RFLP allele -->Δ Phenotype  -->likelihood that RFLP is linked to QTL 

Where a QTL is referred to as a bundle of genes associated with some quantitative trait of interest. As we move across a segment of DNA, we note changes in RFLP genotype alleles, and associated changes in the phenotype of interest. This data can then be assessed to determine the likelihood that a RFLP is linked to a QTL.

For example, suppose at locus "a" there are two observed allelic differences in the RFLP profile, denoted aa and aa’. If we notice distinct differences in the phenotypic distribution for plants with alleles aa vs. aa’ it may be inferred that the RFLP at locus “a” is linked to the QTL of interest. If at another location, we observe two allelic differences in the RFLP profile cc and cc’, and we find little difference in the associated phenotypes we may conclude that the RFLP at locus “c” is not linked to the QTL of interest. Intermediate differences between phenotypes would imply that the RFLP at that locus is closely linked to the QTL of interest. The following schematic from Jones(1997) et al illustrates these differences:

The map position of the QTL can be determined by the method of maximum likelihood from the distribution of likelihood values determined at each locus above. At each locus we compute the following:

Likelihood(effect occurs QTL linkage) / Likelihood(effect occurs by chance)

Depicted as  L1/L0 below:

The key to using markers to map QTL is to associate RFLP patterns with QTLs. Observing changes in RFLP profiles among plant DNA from a segregating population at various positions in the genome, and associating them with changes in phenotype allows us to find a statistical relationship between RFLPs and QTLs. Based on the observations above, the evidence strongly supports that a QTL may be found near the RFLP locus “a”.

Reference: Jones, N., H. Ougham, and H. Thomas. 1997. Markers and mapping:We are all geneticists now. New Phytol. 137:165–177

Newton's Method

Newton-Raphson Method

(see also Maximum Likelihood Visualization via SAS and R)

"Newton's method, also called the Newton-Raphson method, is a root-finding algorithm that uses the first few terms of the Taylor series of a function f(x) in the vicinity of a suspected root."­ -

Given we want to find some root value x* that optimizes the function f(x) such that f(x*)=0, we start with an initial guess x­­n and expand around that point with a taylor series:

f(xn + Δx ) = f(xn) + f’(xn)Δx + …=0 where Δx represents the difference between the actual solution x* and the guess xn.

Retaining only the first order terms we can solve for Δx :

Δx = -[f(xn)/f’(xn)]

If xn is an initial guess, the next guess can be obtained by:

xn+1 = xn - [f(xn)/f’(xn)]
Note as we get closer to the true root value x*, f(xn) gets smaller and Δx gets smaller, such that the value xn+1 from the next iteration changes little from the previous. This is referred to as convergence to the root value x*, which optimizes f(x).

Numerical Estimation of Maximum Likelihood Estimators

Recall, when we undertake MLE we typically maximize the log of the likelihood function as follows:

Max Log(L(β)) or LL ‘log likelihood’ or solve:

∂Log(L(β))/∂β = 0    =  'score matrix'   = u( β) = 0

Generally, these equations aren’t solved directly, but solutions ( βhat 's) are derived from an iterative procedure like the Newton-Raphson algorithm.

Following the procedure outlined above, let

βt = initial guess or parameter value at iteration t

βt+1 = βt –[ ∂ u(βt) / ∂ βt]-1 [u(βt)]  

(note the product in the 2nd term is the same as –[f(xn)/f’(xn)]

given  [∂ u(β0) / ∂β] = hessian matrix ‘H’ and u( β) = ∂Log(L(β))/∂β = score matrix ‘S’

the update function can be rewritten as:

βt+1 = βt –H -1t) S (βt)

Step 1: Guess initial value βt 
Step 2: Evaluate update function to obtain βt+1
Step 3: Repeat until convergence (differences between estimates approach zero)

Fisher Scoring

Similar to Newton's method above, but replace  H-1 with its expected value E(H-1 ) =  I-1 and use

βhat=  β0 - I-10) *u(β0) in the update, repeating until convergence.

Sunday, July 10, 2011

More Discussion on Matching Estimators

At first, I thought my interpretation of matching was a little oversimplified. In fact, Andrew Gelman stated the following concern about this (in the STATA Journal), particularly the interpretation from Angrist and Pischke:

"A casual reader of the book might be left with the unfortunate impression that matching is a competitor to regression rather than a tool for making regression more effective."

I'm a big fan of Dr. Gelman's blog 'Statistical Modeling, Causal Inference and Social Science'  and link to it from this site. I've actually referenced discussions he has had in the comments section with Hal Varian related to Instrumental Variables. For further clarification I contacted Angrist and Pischke via thier blog and  this is what they stated:

Question: You state:

“Our view is that regression can be motivated as a particular sort of
weighted matching estimator, and therefore the differences between
regression and matching estimates are unlikely to be of major
empirical importance” (Chapter 3 p. 70)

I take this to mean that in a ‘mostly harmless way’ regular OLS
regression is in fact a method of matching, or is a matching
estimator.  Is that an appropriate interpretation?  In ‘The Stata
Journal and his blog, Andrew Gelman takes issue with my understanding,
he states:

“A casual reader of the book might be left with the unfortunate
impression that matching is a competitor to regression rather than a
tool for making regression more effective.”

Any guidance?

-Matt Bogard, Western Kentucky University


Well Matt, Andrew Gelman’s intentions are undoubtedly good but I’m afraid he risks doing some harm here.  Suppose you’re interested in the effects of treatment, D, and you have a discrete control variable, X, for a selection-on-observables story.  Regress on D an a full set of dummies (i.e., saturated) model for X.  The resulting estimate of  the effect of D is equal to matching on X, and weighting across cells by the variance of X, as explained in Chapter 3.  While you might not always want to saturate, any other regression model for X gives the best linear approx to this version subject to whatever parameterization you’re using.
This means that i can’t imagine a situation where matching makes sense but regression does not (though some my say that I’m known for my lack of imagination when it comes to econometric methods)

I also inquired with Dr. Gelman, who to my surprise also followed up on his blog:

I don't know what Angrist and Pischke actually do in their applied analysis. I'm sorry to report that many users of matching do seem to think of it as a pure substitute for regression: once they decide to use matching, they try to do it perfectly and they often don't realize they can use regression on the matched data to do even better. In my book with Jennifer, we try to clarify that the primary role of matching is to correct for lack of complete overlap between control and treatment groups.
But I think in their comment you quoted above, Angrist and Pischke are just giving a conceptual perspective rather than detailed methodological advice. They're saying that regression, like matching, is a way of comparing-like-with-like in estimating a comparison. This point seems commonplace from a statistical standpoint but may be news to some economists who might think that regression relies on the linear model being true.

Gary King and I discuss this general idea in our 1990 paper on estimating incumbency advantage. Basically, a regression model works if either of two assumptions is satisfied: if the linear model is true, or if the two groups are balanced so that you're getting an average treatment effect. More recently this idea (of their being two bases for an inference) has been given the name "double robustness"; in any case, it's a fundamental aspect of regression modeling, and I think that, by equating regression with matching, Angrist and Pischke are just trying to emphasize that these are just tow different ways of ensuring balance in a comparison.

In many examples, neither regression nor matching works perfectly, which is why it can be better to do both (as Don Rubin discussed in his Ph.D. thesis in 1970 and subsequently in some published articles with his advisor, William Cochran).

Gelman, Andrew. The Stata Journal (2009) 9, Number 2, pp. 315–320
A statistician’s perspective on “Mostly Harmless Econometrics: An Empiricist’s Companion”, by Joshua D. Angrist and J¨orn-Steffen Pischke.
Angrist and Pischke, Mostly Harmless Econometrics, 2009 

Matching Estimators

In a previous post, I discussed the use of instrumental variables to address selection bias. Another method of correcting for selection bias would involve the use of propensity score matching. Heuristically it would involve first estimating (using probit or logit models) the propensity that an individual would self select, and weighting or matching those subjects with similar predicted probabilities, or ‘propensities’ and carrying out the analysis on the weighted or adjusted data.   A good example using SAS can be found here.  Why would we want to derive matching estimators and how would we do it? In general, we employ matching when we want to compare like cases. We may employ matching when we want to estimate the average effect of a treatment or intervention comparing participants to non-participants in a program. Comparisons are made between participants and nonparticipants based on the similarity of their observed characteristics.  There are two major assumptions related to matching schemes:
1)      We have sufficient observable data (X) such that outcomes or treatment effects (Y) are independent of program participation (D). This is also referred to as the conditional independence assumption.
The idea is that conditional on observed characteristics (X), selection bias disappears.  We want to make conditional on X comparisons of Y that mimic as much as possible the experimental benchmark of random assignment.  Let Y be some measure of success or the average treatment effect, D be some program, and X be a matrix of control variables. It follows from the CIA that
E[Y|X,D=1] – E[Y|X,D=0] = E[Y(1) –Y(0)|X]
Where Y(1) = value of Y when D=1, Y(0) = value of Y when D = 0.
2)      0 < p(D=1|X) < 1
Also central to the concept of matching is the idea of ‘common support.’ Support is essentially the overlap between values of X for the comparison groups (defined by D=1 or 0).
How can we match observations? Several methods are discussed in the literature, including nearest neighbor, caliper and radius, stratification, and kernel matching.  In practice matching has some complications.  By definition, matching only utilizes observations in the region of common support where we are able to obtain matched observations.  That is, unmatched observations are excluded from the analysis.  Therefore we can’t estimate treatment effects (Y) outside the region of common support. As a result, estimated treatment effects may vary substantially based on the matching method employed. The matching process itself also adds variation that must be accounted for in the estimation of standard errors in excess of normal sampling variation.  Some approaches to this involve bootstrapping (Lechner,2002). However, Imbens(2004) finds lack of empirical support for the necessity of bootstrapping.
Why not just run a regression?
The discussion of conditional independence above and making conditional on X comparisons of Y may sound a lot like regression in terms of estimating the conditional expectation of Y.
E(Y|X)  = B0 + B1X +B2D +e
i.e. estimating the beta coefficient on D holding constant all other factors in the model. In fact, regression and matching have a very strong connection.  In Equivalence between Fixed Effects and Matching Estimators for Causal Inference, Kosuke Imaiy  and Song Kimz (as a prelude to their discussion of fixed effects models and matching in panel studies) state:

It is well known that the least squares estimate of β is algebraically
equivalent to a matching estimator for the average treatment effect

They illustrate the algebraic equivalence between regression and matching:
Morgan and Harding (2006) cite the abundance of literature supporting the connection between regression and matching estimators  including Hahn 1998; Heckman, Ichimura, and Todd 1998; Hirano et al. 2003; Imbens 2004; Angrist and Krueger 1999.  Particularly they point out the findings of  Heckman and Vytlacil,  2004, stating  ‘all average causal effect estimators can be interpreted as weighted averages of marginal treatment effects whether generated by matching, regression, or local instrumental variable estimators.’

In their particular work, Morgan and Harding  find:
“For comparison, we then provide analogous regression estimates in the second panel of the table. In some cases, these estimates outperform some of the matching estimates.”
 This is in line with more recent work by Angrist and Pischke (2009) as presented in their book Mostly Harmless Econometrics:
"Our view is that regression can be motivated as a particular sort of weighted matching estimator, and therefore the differences between regression and matching estimates are unlikely to be of major empirical importance…The reason the regression and matching estimates are similar is that regression, too, can be seen as a sort of matching estimator: the regression estimand differs from the matching estimands only in the weights used to combine the covariate-specific effects into a single average effect.  In particular, while matching uses the distribution of covariates among the treated to weight covariate-specific estimates into an estimate of the effect of treatment on the treated, regression produces a variance-weighted average of these effects." (Chapter 3, p. 70,73-74)

Angrist and Pischke seem to imply that regression can be a 'mostly harmless' substitute or competitor to matching.   

Additional Notes
If we stratify our data, matching places more emphasis on cells most likely to be treated. Regression places more emphasis on cells with similar numbers of treated and untreated cases. Matching tries to address heterogeneity (the differences in the groups we are comparing). It can do this only to the extent that these differences are captured in observable characteristics (X). In this context, it offers no improvement over regression, which is also ‘matches’ observations on X.

Gelman, Andrew. The Stata Journal (2009) 9, Number 2, pp. 315–320
A statistician’s perspective on “Mostly Harmless Econometrics: An Empiricist’s Companion”, by Joshua D. Angrist and J¨orn-Steffen Pischke.
Angrist and Pischke, Mostly Harmless Econometrics, 2009 
Hahn, Jinyong. 1998. ‘‘On the Role of the Propensity Score in Efficient Semiparametric Estimation of Average Treatment Effects.’’ Econometrica 66:315-31.
Heckman, James J., Hidehiko Ichimura, and Petra Todd. 1997. ‘‘Matching as an Econometric Evaluation Estimator: Evidence From Evaluating a Job Training Programme.’’ Review of Economic Studies 64:605-54.
Hirano, Keisuke, Guido W. Imbens, and Geert Ridder. 2003. ‘‘Efficient Estimation of Average Treatment Effects Using the Estimated Propensity Score.’’ Econometrica 71:1161-89.
Imbens, Guido W. 2000. ‘‘The Role of the Propensity Score in Estimating Dose-Response Functions.’’ Biometrika 87:706-10.
Imbens, Guido W.  2004. ‘‘Nonparametric Estimation of Average Treatment Effects Under Exogeneity: A Review.’’ Review of Economics and Statistics 86:4-29.
Angrist, Joshua D. and Alan B. Krueger. 1999. ‘‘Empirical Strategies in Labor Economics.’’ Pp. 1277-1366 in Handbook of Labor Economics, vol. 3, edited by O. C. Ashenfelter and D. Card. Amsterdam: Elsevier.
Paper 366-2008 SAS Global Forum. The Use of Propensity Scores and Instrumental Variable Methods to Adjust For Treatment Selection Bias R. Scott Leslie, MedImpact Healthcare Systems, Inc., San Diego, CA Hassan Ghomrawi, MedImpact Healthcare Systems, Inc., San Diego, CA
Lechner M. 2002. Some practical issues in the evaluation of heterogeneous  labour  market programmes by matching methods. Journal of the Royal Statistical Society. Series A, 165, pp. 59-82.
Equivalence between Fixed Effects and Matching Estimators for Causal Inference Kosuke Imaiy In Song Kimz
Sociological Methods & Research. Volume 35 Number 1.  August 2006. Matching Estimators of Causal Effects Prospects and Pitfalls in Theory and Practice.  Stephen L. Morgan Cornell and David J. Harding