The data in this example consists of graduate student application data provided from UCLA, with variables rank - indicates the rank of the school the applicant applied to, GRE - the GRE score of the student, GPA- the undergraduate GPA of the student. Additionally I added an unique ID (row) for each student applicant, used to help create the report. The variable 'admit' is the binary variable (0,1) that we are trying to predict.
One thing I do differently than the example given by UCLA, for demo purposes, I divide the data into training data to build the model, and validation data, or data that we use to score, the students we are trying to predict. (in practice, validation data is used to calibrate and evaluate models prior to deployment, and a final 'score' data set is used for predicting new people)
The model used in this example is a logistic regression model. After running the model,and getting the odds ratios, we get the following interesting result:
GPA : 2.41991974
This implies that for every 1 unit increase in GPA, the odds of being admitted increase by a factor of about 2.4 (for more on interpreting logistic regression co-efficients and odds ratios see my post here.) Odds ratios can be useful for measuring the impact of various variables (which could indicate customer segments, interventions, marketing campaigns etc.) and how they relate to probabilities of any outcome of interest.
By using the R function 'predict', new data can be read in and predictions can be made using the developed model. This will give a probability of admission for each student applicant, which can then be used to create an easily interpreted actionable report that can be used directly, refined in another program like excel, delivered via web, or incorporated into an enterprise wide reporting system. The data can be merged by ID with other data sources, and various customized reports can be created utilizing the analytics provided by the model.
Example:
The R code used for this demonstration is below:
# *------------------------------------------------------------------ # | PROGRAM NAME: ex_logit_analytics_R # | DATE: 3/14/11 # | CREATED BY: Matt Bogard # | PROJECT FILE:Desktop/R Programs # *---------------------------------------------------------------- # | PURPOSE: example of predictive model and reporting # | # *------------------------------------------------------------------ # | COMMENTS: # | # | 1: Reference: R Data Analysis Logistic Regression # | http://www.ats.ucla.edu/stat/r/dae/logit.htm # | 2: # | 3: # |*------------------------------------------------------------------ # | DATA USED: data downloaded from reference above # | # | # |*------------------------------------------------------------------ # | CONTENTS: # | # | PART 1: data partition # | PART 2: build model # | PART 3: predictions/scoring # | PART 4: traffic lighting report # *----------------------------------------------------------------- # | UPDATES: # | # | # *------------------------------------------------------------------ # get data apps <- read.csv(url("http://www.ats.ucla.edu/stat/r/dae/binary.csv")) # read data names(apps) # list variables in this data set dim(apps) # number of observations print(apps) # view # *------------------------------------------------------------------ # | # | data partition # | # | # *----------------------------------------------------------------- # store total number of observations in your data N <- 400 print(N) # Number of training observations Ntrain <- N * 0.5 print(Ntrain) # add an explicit row number variable for tracking id <- seq(1,400) apps2 <- cbind(apps,id) # Randomly arrange the data and divide it into a training # and test set. dat <- apps2[sample(1:N),] train <- dat[1:Ntrain,] validate <- dat[(Ntrain+1):N,] dim(dat) dim(train) dim(validate) # sort and look at data sets to see that they are different sort_train <- train[order(train$id),] print(sort_train) sort_val <- validate[order(validate$id),] print(sort_val) # *------------------------------------------------------------------ # | # | build model # | # | # *----------------------------------------------------------------- # logit model admit_model<- glm(train$admit~train$gre+train$gpa+as.factor(train$rank), family=binomial(link="logit"), na.action=na.pass) # model results summary(admit_model) # odds ratios exp(admit_model$coefficients) # *------------------------------------------------------------------ # | # | predictions/scoring data # | # | # *----------------------------------------------------------------- train$score <-predict(admit_model,type="response") # add predictons to training data sort_train_score <- train[order(train$id),] # sort by observation print(sort_train_score) # view validate$score <-predict(admit_model,newdata=validate,type="response") # add predictions to validation data sort_val_score <- validate[order(validate$id),] # sort by observation print(sort_val_score) # view # *------------------------------------------------------------------ # | # | create a 'traffic light report' based on predicted probabilities # | # | # *----------------------------------------------------------------- summary(validate$score) # look at probability ranges green <- validate[validate$score >=.6,] # subset most likley to be admitted group dim(green) green$colorcode <- "green" # add color code variable for this group yellow <- validate[(validate$score < .6 & validate$score >.5),] # subset intermediate group dim(yellow) yellow$colorcode <-"yellow" # add color code red <- validate[validate$score <=.5,] # subset least likely to be admitted group dim(red) red$colorcode <- "red" # add color code # create distribution list/report applicants_by_risk<- rbind(red,yellow, green) dim(applicants_by_risk) report<-applicants_by_risk[order(applicants_by_risk$id),] # sort by applicant id print(report[c("id","colorcode", "score")]) # basic unformatted action report can be saved as a data set, and exported for other reports and formatting
No comments:
Post a Comment