## Friday, September 17, 2010

### Data Mining/Machine Learning Overview

For example code in R related to some of these topics see 'Data Mining in A Nutshell' -link

BAGGING: Acronym for ‘bootstrap aggregating.’ A technique that relies on sampling with replacement. By taking a number N of bootstrap samples, N models are fit, one from each sample. For regression, the models are combined or averaged. For classification 'votes' or outcomes from each model are tallied with the majority determining a given class.

BOOSTING:
Developing an 'ensemble' or additive model composed of simple rules or weak learners. These rules are combined in such a way that each ensemble member's performance is improved or 'boosted.'

Separates classes utilizing supporting hyperplanes and a separating hyperplane that maximizes the margin between classes.
A nonlinear model of complex relationships composed of multiple 'hidden' layers (similar to composite functions)

Y = f(g(h(x)) or
x -> hidden layers ->Y

'Classification and Decision Trees'(CART) Divides the training set into rectangles (partitions) based on simple rules and a measure of 'impurity.' (recursive partitioning) Rules and partitions can be visualized as a 'tree.' These rules can then be used to classify new data sets.

RANDOM FORESTS:
An ensemble of decision trees.

CHI-SQUARED AUTOMATIC INTERACTION DETECTION:
characterized by the use of a chi-square test to stop decision tree splits. (CHAID) Requires more computational power than CART.
A generative learning algorithm based on an application of Bayes Theorem. Referred to as 'naive' because of strong independence assumptions:

p(yi|x) = p(x|yi)p(yi) / p(x)    'bayes rule'
p(x|y) = p(x1,x2|y) = p(x1|y) p(x2|y) = ∏ p(xk|yi)   'conditional independence assumption'
p(yi|x) = ∏ p(xk|yi) p(yi) / p(x)    'naive bayes'

SURVIVAL ANALYSIS: Models the failure time or occurrence of an event. Define a 'survival' function that relates 't' (failure time variable) to Y. Using the survival function, a likelihood function can be specified.

DESCRIMINANT ANALYSIS: Weights are estimated that form a linear function maximizing between group vs. within group variance. Can be visualized as a line that separates groups.

CLUSTER ANALYSIS:
Utilizes a measure of distance to define homogeneous clusters or classes of data

K-MEANS CLUSTERING: Defines a centroid or average of all points in a cluster of data. Observations are assigned to the cluster that has the 'nearest' or closest mean value. The space defined by cluster centroids is equivalent to the subspace defined by principal components of the data set.

K-NEAREST NEIGHBOR:
an observation takes the same classification as the most common class of its k-nearest neighbors.

K-FOLD CROSS VALIDATION: Models are constructed by training on all but 1 of 'K' subsets of data. This is repeated K-1 times.

MARKET BASKET ANALYSIS-APRIORI ALGORITHM: Derives rules based on the probability of items occurring together. A conditional probability is used as a measure of 'confidence.' Example: The probability that A and C occur together ( A --> C) can be defined as P(A∩B)/P(C) = P(CA)

LIFT: the increase in the probability of an event given the occurrence of another event. Lift = support/(support(A) + support(C))

PARTIAL LEAST SQUARES: For Y = X, derive latent vectors (principle components) of X and use them to explain the covariance of X and Y.

PATH ANALYSIS: An application of regression analysis to define connections or paths between variables. Several simultaneous regressions may be specified and can be visualized by a path diagram.

STRUCTURAL EQUATION MODELING:
Path analysis utilizing latent variables. Combines regression analysis and factor analysis in a simultaneous equation model.

PRINCIPLE COMPONENTS ANALYSIS: Y = Ax The components of A are eigenvectors of an observed covariance matrix S or correlation matrix R. They define principal axis that maximize the variance of a data set. The principal components Y = (y.....y) define new uncorrelated variables that provide an alternative 'reduced' representation of the original data.

FACTOR ANALYSIS: Utilizes principal components analysis to partition total variance into common an unique variance based on underlying factors and assumptions of a statistical model.

X = L F + U

X = observations
F = theoretical factor, defined by the X's that load most heavily on that factor
U = unique variance

GENETIC ALGORITHMS: Analogizes concepts from genetics to model data in an iterative multistep (multigenerational) process.

Each ‘generation’ is a completion of 5 steps, iterating for 3-5 generations generates estimators.

1) Chromosomes = S initial random samples
2) Fitness = predictive accuracy
3) Replicate chromosomes weighting by fitness
4) Crossover = randomly select 2 chromosomes to form 2 new ‘recombinants’
5) Mutations = initiate random changes in randomly selected chromosomes