Sunday, January 29, 2012

Decision Tree Basics in SAS and R

Assume we were going to use a decision tree to predict ‘green’ vs. ‘’red” cases (see below- note this plot of the data was actually created in R).

 
We want to use 2 variables say X1 and X2 to make a prediction of ‘green’ or ‘red’.  In its simplest form, the decision tree algorithm searches through the values of X1 and X2 and finds the values that do the ‘best’ job of ‘splitting’ the cases. For each possible value of X, the algorithm performs a chi-square test. The values that create the best ‘split’ (in SAS Enterprise Miner this is based on a metric called ‘worth’ which is a function of the p-value  associated with the chi-square test  [worth = -log(p-value)]) are chosen. (See below)
 
After all of the ‘best’ splits are determined, these values then become rules for distinguishing between cases (in this case ‘green’ vs. ‘red’) What you end up with in the end is a set of ‘rectangles’ defined by a set of ‘rules’ that can be visualized by a ‘tree’ diagram (see visualization from R below). 


 

 (see visualization from SAS Enterpise Miner Below)




But, in SAS Enterprise Miner, the algorithm then goes on to look at new data (validation data) and assesses how well the splitting rules do in terms of predicting or classifying new cases.  If it finds that a ‘smaller’ tree with fewer  variables or fewer splits or rules or ‘branches’ does a better job, it prunes or trims the tree and removes them from the analysis.  In the end, you get a final set of rules that can then be applied algorithmically to predict new cases based on observed values of X1 and  X2 (or whatever variables are used by the tree).

In SAS Enterprise Miner, with each split created by the decision tree, a metric for importance based on the Gini impurity index (which is a measure of variability or impurity for categorical data) is calculated. This measures how well the tree distinguishes between cases (again ‘green’ vs. ‘red’ or ‘retained’ vs. ‘non-retained’ in our model) and ultimately how well the model explains whatever it is we are trying to predict.  Overall, if splits based on values of X2 ‘reduce impurity’ more so than splits based on values of  X1 , then  X2 would be considered the ‘best’ or ‘most’ predictive variable. 

The rpart algorithm used in R may differ in some of the specific details, but as discussed in The Elements of Statistical Learning (Hastie, Tibshirani and Friedman (2008)) all decision trees pretty much work the same, fitting the model by recursively partitioning the feature space into rectangular subsets.

 R code for the Decision Tree and Visualizations Above: 

# *------------------------------------------------------------------
# | PROGRAM NAME: R_tree_basic 
# | DATE:4/26/11    
# | CREATED BY: Matt Bogard 
# | PROJECT FILE:P:\R  Code References\Data Mining_R              
# *----------------------------------------------------------------
# | PURPOSE: demo of basic decision tree mechanics               
# |
# *------------------------------------------------------------------
 
rm(list=ls()) # get rid of any existing data 
ls() # view open data sets
 
setwd('/Users/wkuuser/Desktop/R Data Sets') # mac 
setwd("P:\\R  Code References\\R Data") # windows
 
library(rpart) # install rpart decision tree library
 
# *------------------------------------------------------------------
# | get data            
# *-----------------------------------------------------------------
 
dat1 <-  read.csv("basicTree.csv", na.strings=c(".", "NA", "", "?"), encoding="UTF-8")
plot( dat1$x2, dat1$x1, col = dat1$class) # plot data space
 
# fit decision tree
 
(r <- rpart(class ~ x1 + x2, data = dat1)) 
 
plot(r)
text(r)
 
library(rattle) # data mining package
drawTreeNodes(r) # for more detailed tree plot supported by rattle
Created by Pretty R at inside-R.org

Saturday, January 21, 2012

Predictive Analytics In Higher Ed

See also: Using SAS® Enterprise BI and SAS® Enterprise Miner™ to


A colleague shared with me the following article The Predictive Analytics Reporting Framework Moves Forward . This is a great example of the 'multicultural' approach to data analysis that I had discussed in 'Culture War: Classical Statistics vs. Machine Learning'

From the article:

Wagner: "For some in education what we’re doing might seem a bit heretical at first--we’ve all been warned in research methods classes that data snooping is bad! But the newer technologies and the sophisticated analyses that those technologies have enabled have helped us to move away from looking askance at pattern recognition. That may take a while in some research circles, but in decision-making circles it’s clear that pattern recognition techniques are making a real difference in terms of the ways numerous enterprises in many industries are approaching their work"

The fact that they are willing to step out of what Leo Brieman described as the 'statistical straight jacket' of traditional research methods and embrace 'heretical' pattern recognition and data mining algorithms is impressive. The paragraph excerpted above and the distinction between 'research circles' and 'decision making circles' indicates to me they clearly get what he was talking about over a decade ago in his famous article 'Two Statistical Cultures' ( http://bit.ly/bqtZ6I ).

"There are two cultures in the use of statistical modeling to reach conclusions from data. One assumes that the data are generated by a given stochastic data model. The other uses algorithmic models and treats the data mechanism as unknown. The statistical community has been committed to the almost exclusive use of data models. (i.e. 'research circles') This commitment has led to irrelevant theory, questionable conclusions, and has kept statisticians from working on a large range of interesting current problems.Algorithmic modeling,both in theory and practice,has developed rapidly in fields outside statistics."(like 'decision making circles').

As far as 'newer' technologies and the 'sophisticated analysis' that they can employ, I can only see opportunities for programs like SAS Enterprise Miner, JMP, and R.

Reference:
Leo Breiman. Statistical Modeling: The Two Cultures.  Statist. Sci. Volume 16, Issue 3 (2001), 199-231.

Time Series Intervention Analysis with R and SAS

In a previous post, I worked through the theory behind intervention analysis. In his time series course, University of Georgia political science professor Jamie Monogan demonstrates how to implement intervention analysis in R.  The following example is from this course. It investigates the impact of the terrorist attacks of 911 on president Bush's approval ratings. An excerpt from the data set follows:


   year month  approve  t s11
1  2001     1 45.41947  1   0
2  2001     2 55.83721  2   0
3  2001     3 55.91828  3   0
4  2001     4 58.12725  4   0
5  2001     5 55.79231  5   0
.
.
.
etc


Note in the plot of the series above, observation 10 represents October 2001, post 9/11/2011. It appears that there is a drastic shift in the series, that slowly decays and eventually returns to previous levels. This may indicate an intervention model with a pulse function.  Following the Box-Jenkins methodology, an ARIMA(1,0,0) model with the intervention can be specified in R as follows:

model <- arimax(bush$approve, order=c(1,0,0), xtransf=bush$s11, transfer=list(c(1,0)))

where:
bush$approve = Yt , Bush's approval ratings
order =c(1,0,0) indicates and AR(1) process
xtransf=bush#s11 is the intervention series, flagging Pt = 1 to indicate September 11, 2001
transfer = list(c(1,0))  indicates the functional form of the transfer function. The first parameter indicates the denominator, while the second indicates a numerator factor. As specified in the code above the transfer function is specified as follows:

[w0/ 1- δB ]

Based on SAS documentation (http://support.sas.com/documentation/cdl/en/etsug/60372/HTML/default/viewer.htm#etsug_arima_sect014.htm )  this could also be specified in SAS via PROC ARIMA as follows:

   proc arima data=bush;
      identify var=approve crosscorr=s11;
      estimate p =1  input=( / (1) s11 );
   run;

Using R, the estimated model coefficients are:
Coefficients:
         ar1  intercept  T1-AR1   T1-MA0
      0.8198    58.2875  0.8915  23.6921
s.e.  0.1184     4.6496  0.0166   3.9415

Based on the output above, w0 = 23.6921 and δ = .8915. The plot below shows the fitted values which look very similar the to graphs for this specification shown in the previous post.


The code below is was used to conduct this analysis, and is excerpted from Jamie Monogan's original R program.
#   ------------------------------------------------------------------
#  | PROGRAM NAME: R_ARIMAX
#  | DATE:1/21/12
#  | CREATED BY: Matt Bogard 
#  | PROJECT FILE: /Users/wkuuser/Desktop/Briefcase/R Programs              
#  |----------------------------------------------------------------
#  | PURPOSE: illustration of ARIMA model with transfer function/intervention series               
#  |
#  |
#  | 
#  |------------------------------------------------------------------
#  | COMMENTS: R code and data originally sourced from : http://monogan.myweb.uga.edu/teaching/ts/index.html
#  |------------------------------------------------------------------
 
#libraries
rm(list=ls())
library(foreign)
library(TSA)
 
setwd('/Users/wkuuser/Desktop/briefcase/R Data Sets') # mac 
 
#load data & view series
 
bush <- read.dta("BUSHJOB.DTA")
names(bush)
print(bush)
 
plot(y=bush$approve, x=bush$t, type='l')
 
#identify arima process
acf(bush$approve)
pacf(bush$approve)
 
#estimate arima model
mod.1 <- arima(bush$approve, order=c(0,1,0))
mod.1
 
#diagnose arima model
acf(mod.1$residuals)
pacf(mod.1$residuals)
Box.test(mod.1$residuals)
 
#Looks like I(1). Note: There is a theoretical case against public opinion data being integrated.
 
#estimate intervention analysis for september 11 (remember to start with a pulse
 
mod.2 <- arimax(bush$approve, order=c(0,1,0), xtransf=bush$s11, transfer=list(c(1,0))); mod.2
 
summary(mod.2)
 
#Notes: the second parameter directs the numerator. If 0 only, then concurrent effect only. The first parameter affects the denominator. c(0,0) replicates the effect of just doing "xreg"
#Our parameter estimates look good, no need to drop delta or switch to a step function.
 
#Graph the intervention model
y.diff <- diff(bush$approve)
t.diff <- bush$t[-1]
y.pred <- 24.3741*bush$s11 + 24.3741*(.9639^(bush$t-9))*as.numeric(bush$t>9)
y.pred <- y.pred[-1]
plot(y=y.diff, x=t.diff, type='l')
lines(y=y.pred, x=t.diff, lty=2)
 
#suppose an AR(1) process
mod.2b <- arimax(bush$approve, order=c(1,0,0), xtransf=bush$s11, transfer=list(c(1,0))); mod.2b
y.pred <- 58.2875 + 23.6921*bush$s11 + 23.2921*(.8915^(bush$t-9))*as.numeric(bush$t>9)
plot(y=bush$approve, x=bush$t, type='l')
lines(y=y.pred, x=bush$t, lty=2)
Created by Pretty R at inside-R.org

Intervention Analysis

See also: Time Series Intervention Analysis with R and SAS

In previous posts I have discussed the basics of time series analysis methods, provided an example of an applied ARIMA model (using fertilizer application data), and discussed how vector auto regressions can be used to accommodate a multivariate analysis of time series. What if you want to measure the impact of some shock or event (such as a change in government policy, the impact of a natural disaster, a negative news report etc. ) on a the behavior of a series of data?  For instance, lets assume beef consumption was holding steady or even trending downward. Then at time ‘t’ someone of national prominence or importance makes false claims about beef safety. We notice a downward trend in beef consumption follows.  How much of this decrease can be attributed to the headline making false claims, and how much is just the continued momentum of other factors.  Or, assume that corn prices have been trending upward for some time and at some time ‘t’, the government increases renewable fuels mandates. It follows that we notice an increase in the trend of corn prices. How much of this can be attributed to the public policy vs. the momentum of other evolving factors? 

Time series intervention analysis (ARIMA modeling with intervention) provides the framework for answering these types of questions.  As explained in the introductory business forecasting text written by Newbold and Bos:

“ When a significant intervention has interrupted the stable behavior of a time series of interest, it is important to attempt to explicitly model its impact. Box and Tao(1975) introduced a procedure , known as intervention analysis, for this purpose. Essentially these authors proposed the use of the transfer function-noise class of models…but with Xt a dummy variable series defined to take the value zero up to the point in time that the intervention occurs, and the value one thereafter.”

As described by Nelson (2000) intervention analysis is

“an extension of the ARIMA methods introduced by Box and Tao (1975) and Jenkins(1976) to measure the effect of a policy change or event on the outcome of a variable...In summary, intervention models generalize the univariate Box-Jenkins methodology by allowing the time path of the dependent variable to be influenced by the time path of the intervention variables”

So it seems a time series intervention analysis can be a valuable tool for assessing the impact of some policy change or other intervention on the trend of some variable we are interested in. This however first requires an understanding of the transfer function-noise class of models described by Newbold and Bos.

Transfer Functions

ARIMA models are simply models of a single dependent variable  (Y) as a function of past values of Y as well as past values of the error terms (e).

For example, ARIMA(2,1,1) :    Y*t = b0 + b1Y*t-1  + b2Y*t-2  + b3 et-1  + et  represents an ARIMA model with 2 autoregressive terms, 1 first difference, and 1 moving average term.  But what if we want to include the impact of some other variable on Y, such as X.  A basic regression model Y = B0 + B1X provides a static model for this relationship, but what if we want a model that captures this relationship between both the time path of X and Y.  One way to look at the impact of the time path of X on Y would be to model:

Yt = V0 + V1Xt + V2Xt-1 + V3Xt-3 +…..et
 
This can more compactly, flexibly, and efficiently be represented and estimate as

Yt = α +w(B)/ δ(B) *Xt + e

where:

w(B) = wo + w1B +w2B +…wrBr
δ(B) = 1- δ1B -… -δs Bs
B = the backshift operator
Xt  = the independent explanatory variable of interest
e=noise/error term

As pointed out by Newbold and Bos, this model often can fit many data series quite well with only a few polynomial terms, such as the very popular formulation below:

Yt = α +[wo/ 1- δ1B ]*Xt + e
Which is equivalent to:

Yt = α + woXt + wo δ1Xt-1 + woδ12 Xt-2 + woδ13 Xt-3 +…+ e

This model implies that a 1 unit increase in X during time t increases Y by wo units, increases Y by wo δ1 units in the next time period  and by woδ12 units the following period with the impact of the change in X phasing out over time.  The polynomial expression [wo/ 1- δ1B ]*Xt represents what’s referred to as a transfer function.

What about the time path of Y itself? We previously modeled this via an ARIMA process. Letting e be represented by the ARIMA process ( aka noise process Nt) as follows:

[θ(B) θ(Bs) / Φ(B) Φ(Bs) ΔdΔs  ] *   a  where  

B = backshift operator such that BY = Yt-1
θ(B) = MA(q)
θ(Bs) = seasonal MA(Q)
Φ(B) = AR(p)
Φ(Bs) = seasonal AR(P)
Δd=  differencing
Δs = seasonal differencing
at = white noise term

which is just another way of representing the ARIMA(p,d,q)(P,D,Q) process where

P = # of seasonally differenced autoregressive terms
D = # of seasonal differences
Q = # of seasonally differenced moving average terms

The combined transfer function-noise model can be represented more compactly as:

Yt = C + w(B)/ δ(B) *Xt + Nt

Or even more compactly:

Yt = C + v(B)*Xt + Nt


This brings us back to the description of intervention analysis previously described by Bos and Newbold:


Essentially these authors proposed the use of the transfer function-noise class of models…but with Xt a dummy variable series defined to take the value zero up to the point in time that the intervention occurs, and the value one thereafter.”

That is, we use the noise-transfer model above, with the independent variable Xt being a dummy or indicator variable flagging the intervention at time ‘T’.  If we let Xt  be the intervention variable It, it can take two forms, depending on they impact of the intervention on the series of interest (Y):

Let It be a pulse function St such that:

Pt =  0 if  t != T
           1 if t = T
Pulse functions are typically employed if the effects are  expected to be only temporary, and decay over time.

Let It be a step function St such that:

St = 0 if t < T
       1 if t >= T

Step functions are typically employed if the effects of the intervention are expected to remain permanently after time T.

So a time series intervention model can be compactly expressed in two ways, as a pulse function:

Yt = C + v(B)*Pt + Nt

Or as a step function

Yt = C + v(B)*St + Nt

 Min(2008) provides some interesting graphs illustrating these various specifications.

Pulse: [wB/ 1- δB ]Pt

 
The above specification illustrates an intervention with an abrupt temporary effect ‘w’ that gradually decay or ‘wears off’ at rate δ with a return back to original or pre-intervention levels.

 Step: [wB/ 1- δB ]St

The above specification illustrates an intervention with a gradual permanent effect with an initial impact or level change of ‘w’ and rate of adjustment equal to δ and long run response level w/1- δB.  To better illustrate the role of δ, note that if δ = 0 then we would have [wB ]St. There would be no adjustment process as illustrated below.
 

There are many other possible specifications depending on the behavior of the series we are interested in (Yt) and the impact of the intervention on the time path of Yt.  As noted in Wei(1990) you can even specify a combination of step and pulse functions as follows:

[woB/ 1- δB ]Pt + w1St

It also follows that we could have both an input series Xt  (as discussed previously in the context of transfer functions) and an intervention It specified as follows:

Yt = C + v(B)*It + w1BXt + Nt

Where It is either a step or pulse function.


References:

 Consumer Bankruptcies and the Bankruptcy Reform Act: A Time-Series Intervention Analysis, 1960–1997. Jon P. Nelson. Journal of Financial Services Research
Volume 17, Number 2,  Aug 2000. 181-200, DOI: 10.1023/A:1008166614928

Box, G. E. P. and G. C. Tiao. 1975. “Intervention Analysis with
Applications to Economic and Environmental Problems.” Journal of the American Statistical Association. 70:70-79

Introductory Business Forecasting. Newbold and Bos. 1990.

Jennifer C.H. Min, (2008) "Forecasting Japanese tourism demand in Taiwan using an intervention analysis", International Journal of Culture, Tourism and Hospitality Research, Vol. 2 Iss: 3, pp.197 – 216

Time Series Analysis: Univariate and Multivariate Methods. William W.S.Wei. 1990.