## Tuesday, March 15, 2011

### Applied Anaytics with R and Venn Diagrams

For a particular client I developed a predictive model that scored a set of patrons or donors at different points of time, providing the predicted probability that they would stop making contributions. At each point in time, they were more and more experienced with the service and more data about the patron was collected. As a result the model’s predictive accuracy improved with time. The client wanted to know, looking at the same cohort of customers over time, how often were the same customers predicted to stop making donations. In other words, at t=1, when the model is weakest, how many customers predicted to stop contributions were also on the ‘list’ at say t=3 when the model is much more accurate?

To do this I used the 'limma' package from the 'bioconductor' R mirror. (see reference below and R code that follows)

Before attempting to construct the Venn diagram, I had to take the scored donor data set and subset it based on all those patrons ever indicated to be 'high risk.' Then I created a data set with one row per patron and a binary indicator tracking their movement from 'novice' to 'intermediate' to 'experienced.' (t=1,2,3 respectively)

The format for the data set is similar to the layout below:

ID NOVICE INTERMEDIATE EXPERIENCED
1 1 1 1
2 1 1 0
3 1 0 0
.  .  .  .
etc.

The resulting Venn Diagram is below:

Note, there were 50 patrons in this example data set, and only 12 of those were predicted to be 'high risk' every time as they moved across each experience category or time period. 15 were high risk 'novice' patrons but never became part of the 'intermediate' or 'experienced' segments.

References:

How can I generate a Venn diagram in R?
http://www.ats.ucla.edu/stat/r/faq/venn.htm

R code:

```# *------------------------------------------------------------------
# | PROGRAM NAME: R_Venn
# | DATE: 3/15/11
# | CREATED BY: MATT BOGARD
# | PROJECT FILE: stats blog
# *----------------------------------------------------------------
# | PURPOSE: CREATE VENN DIAGRAMS FOR MEMBERSHIP IN MULTIPLE GROUPS
# |
# *------------------------------------------------------------------
# |
# |  1: REFERENCES: How can I generate a Venn diagram in R?
# |     http://www.ats.ucla.edu/stat/r/faq/venn.htm
# |
# |  2:
# |  3:
# |*------------------------------------------------------------------
# | DATA USED: data scored by predictive model
# |
# |*------------------------------------------------------------------
# | CONTENTS:
# |
# |  PART 1: Run UCLA example code for practice
# |  PART 2: My data
# |  PART 3:
# *-----------------------------------------------------------------
# |
# |
# *------------------------------------------------------------------

rm(list=ls()) # get rid of any existing data
ls() # view open data sets

# for 1st time use- get source code for bioconductor limma library

source("http://www.bioconductor.org/biocLite.R")

biocLite("limma")

ls() # see what data is there

library(limma) # load package

# *------------------------------------------------------------------
# | Part 1: Run UCLA example code
# *-----------------------------------------------------------------

fix(hsb2) # view data set

# create column vectors to represent the data sets

hw<-(hsb2\$write>=60)
hm<-(hsb2\$math >=60)
c3<-cbind(hw, hm, hr)

# create the matrix that will be used to plot the venn diagram
a <- vennCounts(c3)
a

vennDiagram(a) # plot venn diagram

# *------------------------------------------------------------------
# | Part 2: My Data
# *-----------------------------------------------------------------

setwd('/Users/wkuuser/Desktop/R Data Sets') # set working directory

list<- read.csv("CUSTOMER_LOYALTY.csv", na.strings=c(".", "NA", "", "?"), encoding="UTF-8") # read data

fix(list) # view data set

names(list) # get variable names (for cutting and pasting below)

# look at summary statistics for each data group

library(Hmisc) # for describe function

novice <-list[(list\$NOVICE==1),] #subset novice segment
describe(novice) # n =37

intermediate <- list[(list\$INTERMEDIATE==1),] # subset intermediate segment
describe(intermediate) # n=25

experienced <- list[(list\$EXPERIENCED==1),] # subset experienced segment
describe(experienced)  # n= 25

# format data for use in venn diagram function below

l <- list[c("NOVICE","INTERMEDIATE","EXPERIENCED"  )] # keep only indicator variables

l3 <- as.matrix(l) # convert to a matrix # save as matrix

a <- vennCounts(l3) # create counts for venn digram
a

# plot venn digram
vennDiagram(a, include = "both", names = c("Novice (n =37)", "Intermediate (n=25)", "Experienced (n=25)"), cex = 1, counts.col = "blue")
title("Donors Likely to Stop Contributions by Experience")```
Created by Pretty R at inside-R.org