Wednesday, March 16, 2011

Predictive Modeling and Custom Reporting with R

Previously I made a post that looked at different customer/patron/donor segments and how they differed over time in terms of predicted risks, which were based on a predictive model that I created. Below I will give brief introductory example of one such model implemented in R. The aim of the project is to predict admissions status, and create a report (that could be implemented in an enterprise wide system) that ranks individual probability of admissions with a simple 'red'= low probability of admissions, 'yellow' = marginal probability of admissions, 'green' = high probability of admissions. Note this example isn't the most practical, but more practical results could easily be obtained for any predictive outcome, customer purchase decisions, retention, success, etc.

The data in this example consists of graduate student application data provided from UCLA, with variables rank - indicates the rank of the school the applicant applied to, GRE - the GRE score of the student, GPA- the undergraduate GPA of the student. Additionally I added an unique ID (row) for each student applicant, used to help create the report. The variable 'admit' is the binary variable (0,1) that we are trying to predict.

One thing I do differently than the example given by UCLA, for demo purposes, I divide the data into training data to build the model, and validation data, or data that we use to score, the students we are trying to predict. (in practice, validation data is used to calibrate and evaluate models prior to deployment, and a final 'score' data set is used for predicting new people)

The model used in this example is a logistic regression model. After running the model,and getting the odds ratios, we get the following interesting result:

GPA : 2.41991974

This implies that for every 1 unit increase in GPA, the odds of being admitted increase by a factor of about 2.4 (for more on interpreting logistic regression co-efficients and odds ratios see my post here.) Odds ratios can be useful for measuring the impact of various variables (which could indicate customer segments, interventions, marketing campaigns etc.) and how they relate to probabilities of any outcome of interest.

By using the R function 'predict', new data can be read in and predictions can be made using the developed model. This will give a probability of admission for each student applicant, which can then be used to create an easily interpreted actionable report that can be used directly, refined in another program like excel, delivered via web, or incorporated into an enterprise wide reporting system.  The data can be merged by ID with other data sources, and various customized reports can be created utilizing the analytics provided by the model.

Example:


The R code used for this demonstration is below:

# *------------------------------------------------------------------
# | PROGRAM NAME: ex_logit_analytics_R
# | DATE: 3/14/11
# | CREATED BY: Matt Bogard  
# | PROJECT FILE:Desktop/R Programs            
# *----------------------------------------------------------------
# | PURPOSE: example of predictive model and reporting               
# |
# *------------------------------------------------------------------
# | COMMENTS:               
# |
# |  1: Reference: R Data Analysis Logistic Regression 
# |     http://www.ats.ucla.edu/stat/r/dae/logit.htm
# |  2: 
# |  3: 
# |*------------------------------------------------------------------
# | DATA USED: data downloaded from reference above             
# |
# |
# |*------------------------------------------------------------------
# | CONTENTS:               
# |
# |  PART 1: data partition 
# |  PART 2: build model
# |  PART 3: predictions/scoring
# |     PART 4: traffic lighting report
# *-----------------------------------------------------------------
# | UPDATES:               
# |
# |
# *------------------------------------------------------------------
 
# get data 
 
apps <- read.csv(url("http://www.ats.ucla.edu/stat/r/dae/binary.csv")) # read data 
 
 
names(apps) # list variables in this data set
dim(apps) # number of observations
print(apps) # view
 
# *------------------------------------------------------------------
# |                
# |    data partition
# |  
# |  
# *-----------------------------------------------------------------
 
 
# store total number of observations in your data
N <- 400 
print(N)
 
# Number of training observations
Ntrain <- N * 0.5
print(Ntrain)
 
# add an explicit row number variable for tracking
 
id <- seq(1,400)
 
apps2 <- cbind(apps,id)
 
# Randomly arrange the data and divide it into a training
# and test set.
 
dat <- apps2[sample(1:N),]
train <- dat[1:Ntrain,]
validate <- dat[(Ntrain+1):N,]
 
dim(dat)
dim(train)
dim(validate)
 
# sort and look at data sets to see that they are different
 
sort_train <- train[order(train$id),]
print(sort_train)
 
sort_val <- validate[order(validate$id),]
print(sort_val)
 
# *------------------------------------------------------------------
# |                
# |    build model
# |  
# |  
# *-----------------------------------------------------------------
 
# logit model 
 
admit_model<- glm(train$admit~train$gre+train$gpa+as.factor(train$rank), family=binomial(link="logit"), na.action=na.pass)
 
# model results
 
summary(admit_model)
 
# odds ratios
 
exp(admit_model$coefficients)
 
# *------------------------------------------------------------------
# |                
# |   predictions/scoring data
# |  
# |  
# *-----------------------------------------------------------------
 
train$score <-predict(admit_model,type="response") # add predictons to training data 
 
sort_train_score <- train[order(train$id),] # sort by observation
print(sort_train_score) # view
 
validate$score <-predict(admit_model,newdata=validate,type="response") # add predictions to validation data 
sort_val_score <- validate[order(validate$id),] # sort by observation
print(sort_val_score) # view
 
 
# *------------------------------------------------------------------
# |                
# |    create a 'traffic light report' based on predicted probabilities
# |  
# |  
# *----------------------------------------------------------------- 
 
summary(validate$score) # look at probability ranges
 
 
green <- validate[validate$score >=.6,] # subset most likley to be admitted group
dim(green)
green$colorcode <- "green" # add color code variable for this group
 
yellow <- validate[(validate$score < .6 & validate$score >.5),] # subset intermediate group
dim(yellow)
yellow$colorcode <-"yellow"  # add color code
 
red <- validate[validate$score <=.5,]  # subset least likely to be admitted group
dim(red)
red$colorcode <- "red" # add color code
 
# create distribution list/report
 
applicants_by_risk<- rbind(red,yellow, green)
dim(applicants_by_risk)
report<-applicants_by_risk[order(applicants_by_risk$id),] # sort by applicant id
print(report[c("id","colorcode", "score")]) # basic unformatted action report can be saved as a data set, and exported for other reports and formatting
Created by Pretty R at inside-R.org

No comments:

Post a Comment