Tuesday, March 15, 2011

Applied Anaytics with R and Venn Diagrams


For a particular client I developed a predictive model that scored a set of patrons or donors at different points of time, providing the predicted probability that they would stop making contributions. At each point in time, they were more and more experienced with the service and more data about the patron was collected. As a result the model’s predictive accuracy improved with time. The client wanted to know, looking at the same cohort of customers over time, how often were the same customers predicted to stop making donations. In other words, at t=1, when the model is weakest, how many customers predicted to stop contributions were also on the ‘list’ at say t=3 when the model is much more accurate?

To do this I used the 'limma' package from the 'bioconductor' R mirror. (see reference below and R code that follows)

Before attempting to construct the Venn diagram, I had to take the scored donor data set and subset it based on all those patrons ever indicated to be 'high risk.' Then I created a data set with one row per patron and a binary indicator tracking their movement from 'novice' to 'intermediate' to 'experienced.' (t=1,2,3 respectively)



The format for the data set is similar to the layout below:

ID NOVICE INTERMEDIATE EXPERIENCED
1 1 1 1
2 1 1 0
3 1 0 0
.  .  .  .
etc.

The resulting Venn Diagram is below:



Note, there were 50 patrons in this example data set, and only 12 of those were predicted to be 'high risk' every time as they moved across each experience category or time period. 15 were high risk 'novice' patrons but never became part of the 'intermediate' or 'experienced' segments.

References:

How can I generate a Venn diagram in R?
http://www.ats.ucla.edu/stat/r/faq/venn.htm

R code:

# *------------------------------------------------------------------
# | PROGRAM NAME: R_Venn
# | DATE: 3/15/11
# | CREATED BY: MATT BOGARD  
# | PROJECT FILE: stats blog        
# *----------------------------------------------------------------
# | PURPOSE: CREATE VENN DIAGRAMS FOR MEMBERSHIP IN MULTIPLE GROUPS              
# |
# *------------------------------------------------------------------
# | COMMENTS:               
# |
# |  1: REFERENCES: How can I generate a Venn diagram in R? 
# |     http://www.ats.ucla.edu/stat/r/faq/venn.htm
# | 
# |  2: 
# |  3: 
# |*------------------------------------------------------------------
# | DATA USED: data scored by predictive model  
# |
# |*------------------------------------------------------------------
# | CONTENTS:               
# |
# |  PART 1: Run UCLA example code for practice 
# |  PART 2: My data
# |  PART 3: 
# *-----------------------------------------------------------------
# | UPDATES:               
# |
# |
# *------------------------------------------------------------------
 
 
 
 
 rm(list=ls()) # get rid of any existing data 
 ls() # view open data sets
 
 
 
# for 1st time use- get source code for bioconductor limma library
 
 
source("http://www.bioconductor.org/biocLite.R")
 
 
biocLite("limma")
 
ls() # see what data is there
 
library(limma) # load package
 
# *------------------------------------------------------------------
# | Part 1: Run UCLA example code            
# *-----------------------------------------------------------------
 
 
# read data
 
hsb2<-read.table("http://www.ats.ucla.edu/stat/R/notes/hsb2.csv", sep=',', header=T)
 
fix(hsb2) # view data set 
 
# create column vectors to represent the data sets
 
hw<-(hsb2$write>=60)
hm<-(hsb2$math >=60)
hr<-(hsb2$read >=60)
c3<-cbind(hw, hm, hr)
 
# create the matrix that will be used to plot the venn diagram
a <- vennCounts(c3)
a
 
vennDiagram(a) # plot venn diagram
 
# *------------------------------------------------------------------
# | Part 2: My Data           
# *-----------------------------------------------------------------
 
 
setwd('/Users/wkuuser/Desktop/R Data Sets') # set working directory
 
 
list<- read.csv("CUSTOMER_LOYALTY.csv", na.strings=c(".", "NA", "", "?"), encoding="UTF-8") # read data
 
fix(list) # view data set
 
names(list) # get variable names (for cutting and pasting below)
 
# look at summary statistics for each data group
 
library(Hmisc) # for describe function
 
novice <-list[(list$NOVICE==1),] #subset novice segment
describe(novice) # n =37
 
intermediate <- list[(list$INTERMEDIATE==1),] # subset intermediate segment
describe(intermediate) # n=25
 
experienced <- list[(list$EXPERIENCED==1),] # subset experienced segment
describe(experienced)  # n= 25
 
# format data for use in venn diagram function below
 
l <- list[c("NOVICE","INTERMEDIATE","EXPERIENCED"  )] # keep only indicator variables
 
l3 <- as.matrix(l) # convert to a matrix # save as matrix
 
a <- vennCounts(l3) # create counts for venn digram
a
 
# plot venn digram 
vennDiagram(a, include = "both", names = c("Novice (n =37)", "Intermediate (n=25)", "Experienced (n=25)"), cex = 1, counts.col = "blue")
title("Donors Likely to Stop Contributions by Experience")
Created by Pretty R at inside-R.org

No comments:

Post a Comment