Saturday, January 14, 2017

Identification Through Copulas

Recently I attended a talk (see Zimmer and Trivedi below) where a paper referenced work by Han and Vytlacil that used copulas to estimate probit models with dummy endogenous regressors. The seminar offered an extension to other types of models. However, here I wanted to summarize the approach more generally. You can find the referenced working paper below for more details, which I am told is forthcoming in the Journal of Econometrics.

Copula functions can be used to simulate a dependence structure independently from the marginal distributions.

Based on Sklar's theorem the multivariate distribution F can be represented by copula C as follows:

F(x1…xp) = C{ F1(x1),…, Fp(xp); θ}

The parameter θ represents the dependence between the two distributions F1 and F2. No let's set up the framework for what we are trying to model.
Suppose we want to predict some outcome Y. Let

Y = f(x,D)

where x is a vector of controls and D is a treatment indicator. We are interested in estimating the coefficient on D as our measure of the treatment effect. However, suppose that there is selection bias, such that those that choose to engage in the program indicated by D are more likely to have higher levels of Y regardless of treatment. (for the following for more on selection bias and unobserved heterogeneity and endogeneity).

We can model selection as follows:

D = g(x,z)

where x is a vector of controls and z is an instrument, correlated with the probability of D, but uncorrelated with selection. We can jointly model the outcome and selection functions using copulas where:

P(Y, D|x,z) = C{ F(.), G(.); θ} 

As it turns out, the term θ captures the dependence between outcome and selection allowing for unbiased estimation of treatment effects associated with D. Han and Vytlacil extend the results to cases without instruments.


Han, S. and E. Vytlacil (2015). Identification in a generalization of bivariate probit models with dummy endogenous regressors.Working paper, University of Texas at Austin.

A Note on IdentiÖcation of Discrete Bivariate Copulas. Pravin K. Trivedi and David M. Zimmer August 5, 2016

Tuesday, January 10, 2017

Mediators, Moderators, and Mechanisms

Recently Marc Bellemare shared a post highlighting an article in American Political Science ReviewExplaining Causal Findings Without Bias: Detecting and Assessing Direct Effects.  He does an awesome job giving an overview of the article. If you read his post, you will see that the paper emphasizes causal mechanisms and introduces this through controlled direct effects:

 "their method not only tells you whether M  is a mechanism through which D causes y, it can also tell you whether there is any significant amount of statistical variation left in the causal relationship flowing from D through y after M is accounted for"

Previously, I have been working on a post related to mediators and moderators, and his post motivated me to wrap it up today.

In the article Mediators and Mechanisms of Change in Psychotherapy Research, Kazdin provides some clarity about the differences and relationships between mediators, moderators, and mechanisms:

Mediator: an intervening variable that may account (statistically) for the relationship between the
independent and dependent variable. Something that mediates change may not necessarily explain the processes of how change came about. Also, the mediator could be a proxy for one or more other variables or be a general construct that is not necessarily intended to explain the mechanisms of change. A mediator may be a guide that points to possible mechanisms but is not necessarily a mechanism.

Mechanism: the basis for the effect, i.e., the processes or events that are responsible for the change; the reasons why change occurred or how change came about.

Moderator: a characteristic that influences the direction or magnitude of the relationship between and independent and dependent variable. If the relationship between variable x and y varies is different for males and females, sex is a moderator of the relation. Moderators are related to mediators and mechanisms because they suggest that different processes might be involved (e.g., for males or females).


Mediators and Mechanisms of Change in Psychotherapy Research
Alan E. Kazdin Annu Rev Clin Psychol. 2007;3:1-27

Mediators and Moderators

Moderators: With moderation, a third variable impacts or interacts with the relationship between two other variables. We would say the relationship between two variables is ‘moderated.’ This can be thought of as an interaction in a standard regression:

Y = b0 + b1*X1 + b2*X2 + b3*X1*X2

b3 =moderating effect i.e. the relationship between Y and X1 changes with levels of X2
b1 = impact of X1 on Y when X2 = 0.

So in the context of the relationship between Y and X1, X2 is a moderator

Mediators: With mediation, a third variable invervenes in the relationship between two other variables. For example, in the diagram below, suppose we are interested in the relationship between x and y. This relationship may be ‘mediated’ by a third variable m.

Consider a model where Y = grade in course (our outcome of interest), k = IQ, m =  study skills. We might hypothesize that study skills ‘mediate’ the effect of IQ on course grade. A perfectly brilliant person might do OK on an exam through educated guesses, but we all might know of cases where brilliant students have done quite poor due to lax study skills. So while there may be a direct effect of IQ on grades, IQ -> grades or x -> y there is an indirect effect as well, IQ->Study Skills -> grade or x -> m -> y.

This implies that mediation can take a number of forms and can be formally tested. In the case of full mediation, the relationship between x and y becomes insignificant after a mediator ‘m’ is included in the model, or our estimate of c (modeling the path or direct effect between x and y) isn’t significantly different from 0. Partial mediation would occur if the relationship between x and y or  c is reduced (but remains significant) after m is entered into the model).  In this case we could say that x has both direct effects on y (through the path c) as well as an indirect effect (through the mediator m or paths a and b).

These relationships can be formally tested as laid out in Hair et al:

1) Test for significant correlations between x,y or estimate c; x,m or estimate a; m,y or estimate b
2) If c is significant after m is included, and the magnitude of c does not change then m is not a mediator.
3) If the magnitude of c is reduced after including m, and c remains significant, then m is a moderator. This is a case of partial mediation.
4) If including m in the model reduces the magnitude of c such that it is not significantly different from 0, then m is a mediator and this is considered a case of full mediation.

Reference: Multivariate Data Analysis. 6th Edition. Harris, Black, Babin, Anderson and Tatham. Pearson-Prentice Hall. 2006.

Thursday, November 17, 2016

Copula Based Regression

Recently I ran across an article in the Casualty Actuarial Society's publication Variance that discussed copula based regression. 

From the abstract: 

In this paper, we present copula regression as an alternative to OLS and GLM. The major advantage of a copula regression is that there are no restrictions on the probability distributions that can be used. In this paper, we will present the formulas and algorithms necessary for conducting a copula regression analysis using the normal copula. However, the ideas presented here can be used with any copula function that can incorporate multiple variables with varying degrees of association.

In the paper they outline a 3 step process for accomplishing this:

1) Assume a model for the joint distribution of all the variables (response and covariates)

2) Estimate the parameters of the model (the parameters for the selected marginal distributions and the parameters of the copula)
3) Compute the predicted values of Y given a set of covariates by using the conditional mean of Y given the covariates.

I'd also like to point out this interesting virtual course on copula regression from Dr. Edward Frees at the University of Wisconsin:  
I have not had a chance to view these materials in detail, but absolutely think it could be valuable to anyone wanting to learn more about these methods.

Additional Reading:

Modeling Dependence with Copulas and R
Copula Based Agricultural Risk Models
Intro to Copulas using SAS 
Copulas, R, and the Financial Crisis

Parsa, Rahul A, and Stuart A. Klugman, "Copula Regression," Variance 5:1, 2011, pp. 45-54.

Sunday, October 30, 2016

Modeling Dependence with Copulas and Quantmod in R

A while back I produced a few posts related to copulas. In A Basic Intro to Copulas I played around with some examples using SAS. This gave me some idea about how you could fit copulas based on parameters estimated from real data. And then I did some exploratory work using the copula package in R. In Copulas Functions, R and the Financial Crisis I mentioned some properties of copulas and how they were used in some of the risk modeling related to mortgage backed securities that went bad during the financial crisis. I used R to make some copula based plots....but it just did not click with me how I could use the copula functions in R to do anything concrete.

However, recently I ran across a post Modelling Dependence with Copulas in R, in which the author walks through a number of motivating examples giving some really powerful intuition about how copulas work, as well as a concrete application using stock data.

In this post (or really just the R code below) I modified the code somewhat to pull the data from yahoo finance using the quantmod package in R. Then I just pulled out the key segments of code that estimated the copulas, and concretely used parameters derived from the actual stock data to simulate returns for two stocks, Yahoo and Cree, Inc. Except I used Monsanto and ADM.

One thing discussed in the post that I found very interesting was related to the VineCopula application. This was used to help decide which copula specification might make the most sense for a given data structure.

"In the first example above I chose a normal copula model without much thinking, however, when applying these models to real data one should carefully think at what could suit the data better. For instance, many copulas are better suited for modelling asymetrical correlation other emphasize tail correlation a lot, and so on. My guess for stock returns is that a t-copula should be fine, however a guess is certainly not enough. Fortunately, the VineCopula package offers a great function which tells us what copula we should use. Essentially the VineCopula library allows us to perform copula selection using BIC and AIC through the function BiCopSelect()"

Below is the last plot from my code, of simulated returns vs actual daily returns for Monsanto and ADM for the last 3 years. I'll have to echo the above, I didn't use the VineCopula package to select a copula, but my intent here was just to look at some example code and a practical application of copulas using R with real data.

I won't share the previous plots etc. but refer you to the original post where they walk through the code almost line by line and demonstrate really well how this works. My code is below, but they have their entire program for their orignal post at github.

R Code: