Friday, August 15, 2014

Implications of Maximum Likelihood Methods for Missing Data in Predictive Modeling Applications

I believe there are some key things to consider when we deal with situations where we have missing data when we are 1) estimating a model to be used for prediction (training) and 2) using the model to predict new cases (scoring) which also may have missing values for key predictor variables.  For purposes of this discussion, I'm thinking specifically about situations where one intends to use ML or FIML to estimate the parameters that define or train your model.  You must consider how to handle missing values in both the model training and scoring exercises. Also, there may be distinctions to consider in a purely predictive/machine learning application vs. causal inference.

If your goal is simply to estimate paramter values to make causal inferences i.e. evaluate treatment effects, then most likely you will only be concerned with imputation during the training or estimation stage. Again, in this post I am concerned with predictive modeling applications vs. causal inference . I will start with a short discussion of  maximum likelihood estimation. 
Standard Maximum Likelihood: 

Maximize L = Π f(y,x1,…xk;β)  

With standard ML, the likelihood function is optimized providing the values for β which define our regression model. (like Y = β0 + β1 x1  + … βk xk  + e)

 As is the case in many modeling scenarios, with standard MLE, only complete cases are used to estimated the model. That is, for each 'row' or individual case, all values of 'x' and 'y' must be defined. If a single explanatory variable 'x' or the dependent variable 'y' have a missing value, then that individual/case/row is excluded from the data. This is referred to as listwise deletion. In many scenarios, this can be undesirable because for one thing, you are reducing the amount of information used to estimate you model. Paul Allison has a very informative discussion of this in a recent post at Statistical Horizons.

Full Information Maximum Likelihood (FIML): 

Maximize L = Π f(y,x1,…xk;β)  Π f(y,x3,…xk;β)  

Full information maximum likelihood is an estimation strategy that allows for us to get parameter estimates even in the presence of missing data.  The overall likelihood is the product of the likelihoods specified for all observations. If there are m observations with no missing values but n observations missing x1 and x2 we account for that by specifying the overall likelihood function as the product of two terms i.e. likelihood function is specified as a product of likelihoods for both complete and incomplete cases.  In the example above the second term in the product depicts a case where for individual ‘i’ there are missing values for the first 3 variables. The first term represents the likelihood for all other complete cases. The overall likelihood is then optimized providing the values for β which define our regression model.

Both ML and FIML are methods for estimating parameters; they are not imputation procedures per-say.  As Karen Grace Martin (Analysis Factor) aptly puts it  “This method does not impute any data, but rather uses each case's available data to compute maximum likelihood estimates.”

Predictive Modeling Applications

So if we have missing data, we could use FIML to obtain parameter estimates for a model, but what if we actually want to predict outcomes ‘y’ for some new data set (i.e. we want to 'score' a new data set using the model we just estimated). By assumption, if we are trying to predict ‘y’ we don’t have values for y in our data set.  We will attempt to take the model or parameter estimates we got from FIML and predict Y based on the estimated values of  our β’s and observed x’s. But what if in the new data we have missing x’s? Can’t we just use FIML to get our model and predictions? No. First we have already derived our model via FIML using our original or training data. Again, FIML is a model or parameter estimation procedure. To apply FIML in our new data set would imply 2 things:

1)   We want to estimate a new model on new data
2)  We have observed values for what we are trying to predict ‘y’ which by assumption we don’t have that in a prediction or scoring scenario! 

So there is no way to properly specify the likelihood to even implement FIML to estimate a new model in a new data set!
But, we don't want to estimate a new model in the first place. If we want to make new predictions based on our original model developed using FIML, we have to utilize some type of actual imputation procedure to derive values for missing x’s in the new 'scoring' data set.

SAS Global Forum Paper 312-2012
Handling Missing Data by Maximum Likelihood
Paul D. Allison, Statistical Horizons, Haverford, PA, USA

Two Recommended Solutions for Missing Data: Multiple Imputation and Maximum Likelihood. Karen Gace-Martin. The Analysis Factor: Accessed 8/14/14

Listwise Deletion: It's Not Evil. Paul Allison, Statistical Horizons. June 13,2014.