Sunday, January 29, 2012

Decision Tree Basics in SAS and R

Assume we were going to use a decision tree to predict ‘green’ vs. ‘’red” cases (see below- note this plot of the data was actually created in R).

 
We want to use 2 variables say X1 and X2 to make a prediction of ‘green’ or ‘red’.  In its simplest form, the decision tree algorithm searches through the values of X1 and X2 and finds the values that do the ‘best’ job of ‘splitting’ the cases. For each possible value of X, the algorithm performs a chi-square test. The values that create the best ‘split’ (in SAS Enterprise Miner this is based on a metric called ‘worth’ which is a function of the p-value  associated with the chi-square test  [worth = -log(p-value)]) are chosen. (See below)
 
After all of the ‘best’ splits are determined, these values then become rules for distinguishing between cases (in this case ‘green’ vs. ‘red’) What you end up with in the end is a set of ‘rectangles’ defined by a set of ‘rules’ that can be visualized by a ‘tree’ diagram (see visualization from R below). 


 

 (see visualization from SAS Enterpise Miner Below)




But, in SAS Enterprise Miner, the algorithm then goes on to look at new data (validation data) and assesses how well the splitting rules do in terms of predicting or classifying new cases.  If it finds that a ‘smaller’ tree with fewer  variables or fewer splits or rules or ‘branches’ does a better job, it prunes or trims the tree and removes them from the analysis.  In the end, you get a final set of rules that can then be applied algorithmically to predict new cases based on observed values of X1 and  X2 (or whatever variables are used by the tree).

In SAS Enterprise Miner, with each split created by the decision tree, a metric for importance based on the Gini impurity index (which is a measure of variability or impurity for categorical data) is calculated. This measures how well the tree distinguishes between cases (again ‘green’ vs. ‘red’ or ‘retained’ vs. ‘non-retained’ in our model) and ultimately how well the model explains whatever it is we are trying to predict.  Overall, if splits based on values of X2 ‘reduce impurity’ more so than splits based on values of  X1 , then  X2 would be considered the ‘best’ or ‘most’ predictive variable. 

The rpart algorithm used in R may differ in some of the specific details, but as discussed in The Elements of Statistical Learning (Hastie, Tibshirani and Friedman (2008)) all decision trees pretty much work the same, fitting the model by recursively partitioning the feature space into rectangular subsets.

 R code for the Decision Tree and Visualizations Above: 

# *------------------------------------------------------------------
# | PROGRAM NAME: R_tree_basic 
# | DATE:4/26/11    
# | CREATED BY: Matt Bogard 
# | PROJECT FILE:P:\R  Code References\Data Mining_R              
# *----------------------------------------------------------------
# | PURPOSE: demo of basic decision tree mechanics               
# |
# *------------------------------------------------------------------
 
rm(list=ls()) # get rid of any existing data 
ls() # view open data sets
 
setwd('/Users/wkuuser/Desktop/R Data Sets') # mac 
setwd("P:\\R  Code References\\R Data") # windows
 
library(rpart) # install rpart decision tree library
 
# *------------------------------------------------------------------
# | get data            
# *-----------------------------------------------------------------
 
dat1 <-  read.csv("basicTree.csv", na.strings=c(".", "NA", "", "?"), encoding="UTF-8")
plot( dat1$x2, dat1$x1, col = dat1$class) # plot data space
 
# fit decision tree
 
(r <- rpart(class ~ x1 + x2, data = dat1)) 
 
plot(r)
text(r)
 
library(rattle) # data mining package
drawTreeNodes(r) # for more detailed tree plot supported by rattle
Created by Pretty R at inside-R.org

Saturday, January 21, 2012

Predictive Analytics In Higher Ed

See also: Using SAS® Enterprise BI and SAS® Enterprise Miner™ to


A colleague shared with me the following article The Predictive Analytics Reporting Framework Moves Forward . This is a great example of the 'multicultural' approach to data analysis that I had discussed in 'Culture War: Classical Statistics vs. Machine Learning'

From the article:

Wagner: "For some in education what we’re doing might seem a bit heretical at first--we’ve all been warned in research methods classes that data snooping is bad! But the newer technologies and the sophisticated analyses that those technologies have enabled have helped us to move away from looking askance at pattern recognition. That may take a while in some research circles, but in decision-making circles it’s clear that pattern recognition techniques are making a real difference in terms of the ways numerous enterprises in many industries are approaching their work"

The fact that they are willing to step out of what Leo Brieman described as the 'statistical straight jacket' of traditional research methods and embrace 'heretical' pattern recognition and data mining algorithms is impressive. The paragraph excerpted above and the distinction between 'research circles' and 'decision making circles' indicates to me they clearly get what he was talking about over a decade ago in his famous article 'Two Statistical Cultures' ( http://bit.ly/bqtZ6I ).

"There are two cultures in the use of statistical modeling to reach conclusions from data. One assumes that the data are generated by a given stochastic data model. The other uses algorithmic models and treats the data mechanism as unknown. The statistical community has been committed to the almost exclusive use of data models. (i.e. 'research circles') This commitment has led to irrelevant theory, questionable conclusions, and has kept statisticians from working on a large range of interesting current problems.Algorithmic modeling,both in theory and practice,has developed rapidly in fields outside statistics."(like 'decision making circles').

As far as 'newer' technologies and the 'sophisticated analysis' that they can employ, I can only see opportunities for programs like SAS Enterprise Miner, JMP, and R.

Reference:
Leo Breiman. Statistical Modeling: The Two Cultures.  Statist. Sci. Volume 16, Issue 3 (2001), 199-231.

Time Series Intervention Analysis with R and SAS

In a previous post, I worked through the theory behind intervention analysis. In his time series course, University of Georgia political science professor Jamie Monogan demonstrates how to implement intervention analysis in R.  The following example is from this course. It investigates the impact of the terrorist attacks of 911 on president Bush's approval ratings. An excerpt from the data set follows:


   year month  approve  t s11
1  2001     1 45.41947  1   0
2  2001     2 55.83721  2   0
3  2001     3 55.91828  3   0
4  2001     4 58.12725  4   0
5  2001     5 55.79231  5   0
.
.
.
etc


Note in the plot of the series above, observation 10 represents October 2001, post 9/11/2011. It appears that there is a drastic shift in the series, that slowly decays and eventually returns to previous levels. This may indicate an intervention model with a pulse function.  Following the Box-Jenkins methodology, an ARIMA(1,0,0) model with the intervention can be specified in R as follows:

model <- arimax(bush$approve, order=c(1,0,0), xtransf=bush$s11, transfer=list(c(1,0)))

where:
bush$approve = Yt , Bush's approval ratings
order =c(1,0,0) indicates and AR(1) process
xtransf=bush#s11 is the intervention series, flagging Pt = 1 to indicate September 11, 2001
transfer = list(c(1,0))  indicates the functional form of the transfer function. The first parameter indicates the denominator, while the second indicates a numerator factor. As specified in the code above the transfer function is specified as follows:

[w0/ 1- δB ]

Based on SAS documentation (http://support.sas.com/documentation/cdl/en/etsug/60372/HTML/default/viewer.htm#etsug_arima_sect014.htm )  this could also be specified in SAS via PROC ARIMA as follows:

   proc arima data=bush;
      identify var=approve crosscorr=s11;
      estimate p =1  input=( / (1) s11 );
   run;

Using R, the estimated model coefficients are:
Coefficients:
         ar1  intercept  T1-AR1   T1-MA0
      0.8198    58.2875  0.8915  23.6921
s.e.  0.1184     4.6496  0.0166   3.9415

Based on the output above, w0 = 23.6921 and δ = .8915. The plot below shows the fitted values which look very similar the to graphs for this specification shown in the previous post.


The code below is was used to conduct this analysis, and is excerpted from Jamie Monogan's original R program.
#   ------------------------------------------------------------------
#  | PROGRAM NAME: R_ARIMAX
#  | DATE:1/21/12
#  | CREATED BY: Matt Bogard 
#  | PROJECT FILE: /Users/wkuuser/Desktop/Briefcase/R Programs              
#  |----------------------------------------------------------------
#  | PURPOSE: illustration of ARIMA model with transfer function/intervention series               
#  |
#  |
#  | 
#  |------------------------------------------------------------------
#  | COMMENTS: R code and data originally sourced from : http://monogan.myweb.uga.edu/teaching/ts/index.html
#  |------------------------------------------------------------------
 
#libraries
rm(list=ls())
library(foreign)
library(TSA)
 
setwd('/Users/wkuuser/Desktop/briefcase/R Data Sets') # mac 
 
#load data & view series
 
bush <- read.dta("BUSHJOB.DTA")
names(bush)
print(bush)
 
plot(y=bush$approve, x=bush$t, type='l')
 
#identify arima process
acf(bush$approve)
pacf(bush$approve)
 
#estimate arima model
mod.1 <- arima(bush$approve, order=c(0,1,0))
mod.1
 
#diagnose arima model
acf(mod.1$residuals)
pacf(mod.1$residuals)
Box.test(mod.1$residuals)
 
#Looks like I(1). Note: There is a theoretical case against public opinion data being integrated.
 
#estimate intervention analysis for september 11 (remember to start with a pulse
 
mod.2 <- arimax(bush$approve, order=c(0,1,0), xtransf=bush$s11, transfer=list(c(1,0))); mod.2
 
summary(mod.2)
 
#Notes: the second parameter directs the numerator. If 0 only, then concurrent effect only. The first parameter affects the denominator. c(0,0) replicates the effect of just doing "xreg"
#Our parameter estimates look good, no need to drop delta or switch to a step function.
 
#Graph the intervention model
y.diff <- diff(bush$approve)
t.diff <- bush$t[-1]
y.pred <- 24.3741*bush$s11 + 24.3741*(.9639^(bush$t-9))*as.numeric(bush$t>9)
y.pred <- y.pred[-1]
plot(y=y.diff, x=t.diff, type='l')
lines(y=y.pred, x=t.diff, lty=2)
 
#suppose an AR(1) process
mod.2b <- arimax(bush$approve, order=c(1,0,0), xtransf=bush$s11, transfer=list(c(1,0))); mod.2b
y.pred <- 58.2875 + 23.6921*bush$s11 + 23.2921*(.8915^(bush$t-9))*as.numeric(bush$t>9)
plot(y=bush$approve, x=bush$t, type='l')
lines(y=y.pred, x=bush$t, lty=2)
Created by Pretty R at inside-R.org

Intervention Analysis

See also: Time Series Intervention Analysis with R and SAS

In previous posts I have discussed the basics of time series analysis methods, provided an example of an applied ARIMA model (using fertilizer application data), and discussed how vector auto regressions can be used to accommodate a multivariate analysis of time series. What if you want to measure the impact of some shock or event (such as a change in government policy, the impact of a natural disaster, a negative news report etc. ) on a the behavior of a series of data?  For instance, lets assume beef consumption was holding steady or even trending downward. Then at time ‘t’ someone of national prominence or importance makes false claims about beef safety. We notice a downward trend in beef consumption follows.  How much of this decrease can be attributed to the headline making false claims, and how much is just the continued momentum of other factors.  Or, assume that corn prices have been trending upward for some time and at some time ‘t’, the government increases renewable fuels mandates. It follows that we notice an increase in the trend of corn prices. How much of this can be attributed to the public policy vs. the momentum of other evolving factors? 

Time series intervention analysis (ARIMA modeling with intervention) provides the framework for answering these types of questions.  As explained in the introductory business forecasting text written by Newbold and Bos:

“ When a significant intervention has interrupted the stable behavior of a time series of interest, it is important to attempt to explicitly model its impact. Box and Tao(1975) introduced a procedure , known as intervention analysis, for this purpose. Essentially these authors proposed the use of the transfer function-noise class of models…but with Xt a dummy variable series defined to take the value zero up to the point in time that the intervention occurs, and the value one thereafter.”

As described by Nelson (2000) intervention analysis is

“an extension of the ARIMA methods introduced by Box and Tao (1975) and Jenkins(1976) to measure the effect of a policy change or event on the outcome of a variable...In summary, intervention models generalize the univariate Box-Jenkins methodology by allowing the time path of the dependent variable to be influenced by the time path of the intervention variables”

So it seems a time series intervention analysis can be a valuable tool for assessing the impact of some policy change or other intervention on the trend of some variable we are interested in. This however first requires an understanding of the transfer function-noise class of models described by Newbold and Bos.

Transfer Functions

ARIMA models are simply models of a single dependent variable  (Y) as a function of past values of Y as well as past values of the error terms (e).

For example, ARIMA(2,1,1) :    Y*t = b0 + b1Y*t-1  + b2Y*t-2  + b3 et-1  + et  represents an ARIMA model with 2 autoregressive terms, 1 first difference, and 1 moving average term.  But what if we want to include the impact of some other variable on Y, such as X.  A basic regression model Y = B0 + B1X provides a static model for this relationship, but what if we want a model that captures this relationship between both the time path of X and Y.  One way to look at the impact of the time path of X on Y would be to model:

Yt = V0 + V1Xt + V2Xt-1 + V3Xt-3 +…..et
 
This can more compactly, flexibly, and efficiently be represented and estimate as

Yt = α +w(B)/ δ(B) *Xt + e

where:

w(B) = wo + w1B +w2B +…wrBr
δ(B) = 1- δ1B -… -δs Bs
B = the backshift operator
Xt  = the independent explanatory variable of interest
e=noise/error term

As pointed out by Newbold and Bos, this model often can fit many data series quite well with only a few polynomial terms, such as the very popular formulation below:

Yt = α +[wo/ 1- δ1B ]*Xt + e
Which is equivalent to:

Yt = α + woXt + wo δ1Xt-1 + woδ12 Xt-2 + woδ13 Xt-3 +…+ e

This model implies that a 1 unit increase in X during time t increases Y by wo units, increases Y by wo δ1 units in the next time period  and by woδ12 units the following period with the impact of the change in X phasing out over time.  The polynomial expression [wo/ 1- δ1B ]*Xt represents what’s referred to as a transfer function.

What about the time path of Y itself? We previously modeled this via an ARIMA process. Letting e be represented by the ARIMA process ( aka noise process Nt) as follows:

[θ(B) θ(Bs) / Φ(B) Φ(Bs) ΔdΔs  ] *   a  where  

B = backshift operator such that BY = Yt-1
θ(B) = MA(q)
θ(Bs) = seasonal MA(Q)
Φ(B) = AR(p)
Φ(Bs) = seasonal AR(P)
Δd=  differencing
Δs = seasonal differencing
at = white noise term

which is just another way of representing the ARIMA(p,d,q)(P,D,Q) process where

P = # of seasonally differenced autoregressive terms
D = # of seasonal differences
Q = # of seasonally differenced moving average terms

The combined transfer function-noise model can be represented more compactly as:

Yt = C + w(B)/ δ(B) *Xt + Nt

Or even more compactly:

Yt = C + v(B)*Xt + Nt


This brings us back to the description of intervention analysis previously described by Bos and Newbold:


Essentially these authors proposed the use of the transfer function-noise class of models…but with Xt a dummy variable series defined to take the value zero up to the point in time that the intervention occurs, and the value one thereafter.”

That is, we use the noise-transfer model above, with the independent variable Xt being a dummy or indicator variable flagging the intervention at time ‘T’.  If we let Xt  be the intervention variable It, it can take two forms, depending on they impact of the intervention on the series of interest (Y):

Let It be a pulse function St such that:

Pt =  0 if  t != T
           1 if t = T
Pulse functions are typically employed if the effects are  expected to be only temporary, and decay over time.

Let It be a step function St such that:

St = 0 if t < T
       1 if t >= T

Step functions are typically employed if the effects of the intervention are expected to remain permanently after time T.

So a time series intervention model can be compactly expressed in two ways, as a pulse function:

Yt = C + v(B)*Pt + Nt

Or as a step function

Yt = C + v(B)*St + Nt

 Min(2008) provides some interesting graphs illustrating these various specifications.

Pulse: [wB/ 1- δB ]Pt

 
The above specification illustrates an intervention with an abrupt temporary effect ‘w’ that gradually decay or ‘wears off’ at rate δ with a return back to original or pre-intervention levels.

 Step: [wB/ 1- δB ]St

The above specification illustrates an intervention with a gradual permanent effect with an initial impact or level change of ‘w’ and rate of adjustment equal to δ and long run response level w/1- δB.  To better illustrate the role of δ, note that if δ = 0 then we would have [wB ]St. There would be no adjustment process as illustrated below.
 

There are many other possible specifications depending on the behavior of the series we are interested in (Yt) and the impact of the intervention on the time path of Yt.  As noted in Wei(1990) you can even specify a combination of step and pulse functions as follows:

[woB/ 1- δB ]Pt + w1St

It also follows that we could have both an input series Xt  (as discussed previously in the context of transfer functions) and an intervention It specified as follows:

Yt = C + v(B)*It + w1BXt + Nt

Where It is either a step or pulse function.


References:

 Consumer Bankruptcies and the Bankruptcy Reform Act: A Time-Series Intervention Analysis, 1960–1997. Jon P. Nelson. Journal of Financial Services Research
Volume 17, Number 2,  Aug 2000. 181-200, DOI: 10.1023/A:1008166614928

Box, G. E. P. and G. C. Tiao. 1975. “Intervention Analysis with
Applications to Economic and Environmental Problems.” Journal of the American Statistical Association. 70:70-79

Introductory Business Forecasting. Newbold and Bos. 1990.

Jennifer C.H. Min, (2008) "Forecasting Japanese tourism demand in Taiwan using an intervention analysis", International Journal of Culture, Tourism and Hospitality Research, Vol. 2 Iss: 3, pp.197 – 216

Time Series Analysis: Univariate and Multivariate Methods. William W.S.Wei. 1990.

Friday, December 2, 2011

Culture War (original post)


 'Statistical Modeling: The Two Cultures' by L. Breiman (Statistical Science
2001, Vol. 16, No. 3, 199–231) is an interesting paper that is a must read for anyone traditionally trained in statistics, but new to the concept of machine learning. It gives  perspective and context to anyone that may attempt to learn to use data mining software such as  SAS Enterprise Miner or who may take a course in machine learning  (like Dr. Ng's (Stanford) youtube lectures in machine learning .) The algorithmic machine learning paradigm is in great contrast to the traditional probabilistic approaches of 'data modeling' in which I had been groomed both as an undergraduate and in graduate school.

From the article, two cultures are defined:

"There are two cultures in the use of statistical modeling to reach conclusions from data.

Classical Statistics/Stochastc Data Modeling Paradigm:

" assumes that the data are generated by a given stochastic data model. "

Algorithmic or Machine Learning Paradigm:

"uses algorithmic models and treats the data mechanism as unknown."

In a lecture  for Eco 5385
 Data Mining Techniques for Economists
, Professor  Tom Fomby
 of Southern Methodist University distinguishes machine learning from classical statistical techniques:

Classical Statistics: Focus is on hypothesis testing of causes and effects and interpretability of models.  Model Choice is based on parameter significance and In-sample Goodness-of-fit.

Machine Learning:  Focus is on Predictive Accuracy even in the face of lack of interpretability of models.  Model Choice is based on Cross Validation of Predictive Accuracy using Partitioned Data Sets.

For some, this distinction may be made more transparent by comparing the methods used under each approach. Professor Fomby does a great job making these distinctions:

Methods Classical Statistics:  Regression, Logit/Probit, Duration Models, Principle Components, Discriminant Analysis, Bayes Rules

Artificial Intelligence/Machine Learning/Data Mining: Classification and Regression Trees, Neural Nets, K-Nearest Neighbors, Association Rules, Cluster Analysis
 
From the standpoint of econometrics, the data modeling culture is described very well in this post by Tim Harford:

"academic econometrics is rarely used for forecasting. Instead, econometricians set themselves the task of figuring out past relationships. Have charter schools improved educational standards? Did abortion liberalisation reduce crime? What has been the impact of immigration on wages?"

This is certainly consistent with the comparisons presented in the Statistical Science article. Note however, that the methodologies referenced in the article (like logistic regression)  that are utilized under the data modeling or classical statistics paradigm are a means to fill what Brieman refers to as a black box. Under this paradigm analysts are attempting to characterize an outcome by estimating parameters and making inferences about them based on some assumed data generating process. It is not to say that these methods are never used under the machine learning paradigm, but how they are used. The article provides a very balanced 'ping-pong' discussion citing various experts from both cultures, including some who seem to promote both including the authors of The Elements of Statistical Learning: Data Mining, Inference, and Prediction.

In my first econometrics course, the textbook cautioned against 'data mining,' described as using techniques such as stepwise regression. It insisted on letting theory drive model development, rating the model on total variance explained, and the significance of individual coefficients. This advice was certainly influenced by the 'data modeling' culture. The text was published in the same year as the Breiman article. ( I understand this caution has been moderated in contemporary editions).
Of course, as the article mentions, if what you are interested in is theory and the role of particular variables in underlying processes, then traditional inference seems to be the appropriate direction to take. (Breiman of course still takes issue, arguing that we can't trust the significance of an estimated co-efficient if the model overall is a poor predictor).

"Higher predictive accuracy is associated with more reliable information about the underlying data mechanism. Weak predictive accuracy can lead to questionable conclusions."

"Algorithmic models can give better predictive accuracy than data models,and provide better information about the underlying mechanism.
"

"The goal is not interpretability, but accurate information."

When algorithmic models are more appropriate (especially when the goal is prediction) a stoachastic model designed to make inferences about specific model co-efficients may provide "the right answer to the wrong question" as Emanuel Parzen puts it in his comments on Breiman.

While in graduate school, the stochastic culture was dominant. However, there was some inclusiveness, take for example Kennedy's A Guide to Econometrics . Kennedy discusses the expectation maximization algorithm (also a lecture in Dr. Ng's course) and also provides a great introduction to neural networks which greatly influenced my understanding of such allegedly 'uninterpretable' algorithmic techniques.


Keeping an Open Mind: Multiculturalism in Data Science

As  Breiman states:

"Approaching problems by looking for a data model imposes an apriori straight jacket that restricts the ability of statisticians to deal with a wide range of statistical problems."

A multicultural approach to analysis (stochastic or algorithmic) seems to be the take away message of the Breiman article and discussions that follow. In fact, this may be the best approach in the universe of all things empirical. This is certainly true in the new field of data science, and is clearly depicted in Drew Conway's  data science Venn diagram depicted below. 





As Parzen states "I believe statistics has many cultures." He points out that many practitioners are well aware of the divides that exist between Bayesians and frequentists, algorithmic approaches aside. Even if we restrict our tool box to stochastic methods, we can often find our hands tied if we are not open minded. We can still sacrifice results over technique. And there are plenty of divisive debates.  For example, in discussing the state of modern econometrics Leamer faults econometric results that emphasize statistical significance over practicality or importance in an interesting podcast from Econ Talk here.

Brieman touches on  his idea of importance as well:

"My definition of importance is based on prediction. A variable might be considered important if deleting it seriously affects prediction accuracy"

Note, no mention of statistical significance (a child of the stochastic data modeling culture) by  Breiman

This is echoed in a paper from the Joint Statistical Meetings-Section on Statistical Education (2009) 'The Cult of Statistical Significance' by Stephen T. Ziliak and Deirdre N. McCloskey who have authored a book by the same title.  They note:

"Statistical significance is, we argue, a diversion from the proper objects of scientific study…Significant does not mean important and insignificant does not mean unimportant"

I even find a hint of this in Greene, a well known econometrics textbook author:

 "It remains an interesting question for research whether fitting y well or obtaining good parameter estimates is a preferable estimation criterion. Evidently, they need not be the same thing."
  p. 686 Greene,  Econometric Analysis 5th ed

It seems that cultural divides are hard to avoid. Take another example from the article below:

Statistical alternatives for studying college student retention: A comparative analysis of logit, probit, and linear regression. Eric L. Dey and Alexander W. Astin Research in Higher Education
Volume 34, Number 5, 569-581 (1993)

 The above article concluded:

"Results indicate that despite the theoretical advantages offered by logistic regression and probit analysis, there is little practical difference between either of these two techniques and more traditional linear regression. "

It never fails to go to a paper presentation and see someone getting uptight over linear probability models.  In fact, there is some interesting literature on the robustness and pragmatism of linear probability models. 

An emphasis on robustness  or even parsimony seems to be an overarching theme in the fairly new book "Mostly Harmless Econometrics" by Joshua Angrist and Jörn-Steffen Pischke. Angrist and Pishke write about  the robustness of the LPM in their book (but also see econometrician Dave Giles  post with some challenging objections to the LPM, a followup by Angrist and Pischke, and more discussion from economist Marc Bellemare).

Their web site for the book seems to caution practitioners about being married to fancier techniques:

fancier techniques are typically unnecessary and even dangerous.


I have become more and more open minded as my experience working under both paradigms has increased. Software packages like SAS Enterprise Miner certainly accommodate open minded curiosity. Packages like (or even SAS IML) let the very curious get their hands even dirtier. Both have been very influential to me. When it comes to estimating the marginal effects of a treatment for a binary outcome, I usually have no issue with using an LPM over logistic regression (but I would have taken issue with it not not long ago). But when it comes to prediction I certainly won't shy away from using logistic regression or for that matter a neural net, decision tree, or a 'multicultural' ensemble of all of the above.

Sunday, November 27, 2011

Regression via Gradient Descent in R

In a previous post I derived the least squares estimators using basic calculus, algebra, and arithmetic, and also showed how the same results can be achieved using the canned functions in SAS and R or via the matrix programming capabilities offered by those languages. I've also introduced the concept of gradient descent here and here. 

Given recent course work in the online machine learning course via Stanford, in this post, I tie all of the concepts together. The R -code that follows borrows heavily from the Statalgo blog, which has done a tremendous job breaking down the concepts from the course and applying them in R.  This is highly redundant to that post, but not quite as sophisticated (i.e. my code isn't nearly as efficient), but it allows me to intuitively extend the previous work I've done.

If you are not going to use a canned routine from R or SAS to estimate a regression, then why not at least apply the matrix inversions and solve for the beta's as I did before?  One reason may be that with a large number of features (independent variables) the matrix inversion can be slow, and a computational algorithm like gradient descent may be more efficient.

From my earliest gradient descent post, I provided the intuition and notation necessary for performing gradient descent to estimate OLS coefficients. Following the notation from the machine learning course mentioned above, I review the notation  again below.

In machine learning, the regression function y = b0 + b1x is referred to as a 'hypothesis' function:





The idea, just like in OLS is to minimize the sum of squared errors, represented as a 'cost function' in the machine learning context:

This is minimized by solving for the values of theta that set the derivative to zero. In some cases this can be done analytically with calculus and a little algebra, but this can also be done (especially when complex functions are involved) via gradient descent. Recall from before, the basic gradient descent algorithm involves a learning rate 'alpha' and an update function that utilizes the 1st derivitive or gradient f'(.). The basic algorithm is as follows:

repeat until convergence {
  x := x - α∇F(x) 
}

(see here for a basic demo using R code)

For the case of OLS, and the cost function J(theta) depicted above, the gradient descent algorithm can be implemented in pseudo-code as follows:










It turns out that:


and the gradient descent algorithm becomes:











The following R -Code implements this algorithm and achieves the same results we got from the matrix programming or the canned R or SAS routines before. The concept of 'feature scaling' is introduced, which is a method of standardizing the independent variables or features. This improves the convergence speed of the algorithm. Additional code intuitively investigates the concept of convergence. Ideally, the algorithm should stop once changes in the cost function become small as we update the values in theta.

R-Code:

#  ----------------------------------------------------------------------------------
# |PROGRAM NAME: gradient_descent_OLS_R
# |DATE: 11/27/11
# |CREATED BY: MATT BOGARD 
# |PROJECT FILE:              
# |----------------------------------------------------------------------------------
# | PURPOSE: illustration of gradient descent algorithm applied to OLS
# | REFERENCE: adapted from : http://www.cs.colostate.edu/~anderson/cs545/Lectures/week6day2/week6day2.pdf                
# |  and http://www.statalgo.com/2011/10/17/stanford-ml-1-2-gradient-descent/
#  ---------------------------------------------------------------------------------
 
# get data 
rm(list = ls(all = TRUE)) # make sure previous work is clear
ls()
x0 <- c(1,1,1,1,1) # column of 1's
x1 <- c(1,2,3,4,5) # original x-values
 
# create the x- matrix of explanatory variables
 
x <- as.matrix(cbind(x0,x1))
 
# create the y-matrix of dependent variables
 
y <- as.matrix(c(3,7,5,11,14))
m <- nrow(y)
 
# implement feature scaling
x.scaled <- x
x.scaled[,2] <- (x[,2] - mean(x[,2]))/sd(x[,2])
 
# analytical results with matrix algebra
 solve(t(x)%*%x)%*%t(x)%*%y # w/o feature scaling
 solve(t(x.scaled)%*%x.scaled)%*%t(x.scaled)%*%y # w/ feature scaling
 
# results using canned lm function match results above
summary(lm(y ~ x[, 2])) # w/o feature scaling
summary(lm(y ~ x.scaled[, 2])) # w/feature scaling
 
# define the gradient function dJ/dtheata: 1/m * (h(x)-y))*x where h(x) = x*theta
# in matrix form this is as follows:
grad <- function(x, y, theta) {
  gradient <- (1/m)* (t(x) %*% ((x %*% t(theta)) - y))
  return(t(gradient))
}
 
# define gradient descent update algorithm
grad.descent <- function(x, maxit){
    theta <- matrix(c(0, 0), nrow=1) # Initialize the parameters
 
    alpha = .05 # set learning rate
    for (i in 1:maxit) {
      theta <- theta - alpha  * grad(x, y, theta)   
    }
 return(theta)
}
 
 
# results without feature scaling
print(grad.descent(x,1000))
 
# results with feature scaling
print(grad.descent(x.scaled,1000))
 
# ----------------------------------------------------------------------- 
# cost and convergence intuition
# -----------------------------------------------------------------------
 
# typically we would iterate the algorithm above until the 
# change in the cost function (as a result of the updated b0 and b1 values)
# was extremely small value 'c'. C would be referred to as the set 'convergence'
# criteria. If C is not met after a given # of iterations, you can increase the
# iterations or change the learning rate 'alpha' to speed up convergence
 
# get results from gradient descent
beta <- grad.descent(x,1000)
 
# define the 'hypothesis function'
h <- function(xi,b0,b1) {
 b0 + b1 * xi 
}
 
# define the cost function   
cost <- t(mat.or.vec(1,m))
  for(i in 1:m) {
    cost[i,1] <-  (1 /(2*m)) * (h(x[i,2],beta[1,1],beta[1,2])- y[i,])^2 
  }
 
totalCost <- colSums(cost)
print(totalCost)
 
# save this as Cost1000
cost1000 <- totalCost
 
# change iterations to 1001 and compute cost1001
beta <- (grad.descent(x,1001))
cost <- t(mat.or.vec(1,m))
  for(i in 1:m) {
    cost[i,1] <-  (1 /(2*m)) * (h(x[i,2],beta[1,1],beta[1,2])- y[i,])^2 
  }
cost1001 <- colSums(cost)
 
# does this difference meet your convergence criteria? 
print(cost1000 - cost1001)
Created by Pretty R at inside-R.org