## Thursday, March 3, 2011

### Data Mining and Input Transformations

In applied econometrics, there are often lots of reasons why we may or may not want to transform data. In the case of  time series data, we will take first differences to deal with non-stationarity.  To deal with serial correlation, we may resort to generalized least squares and adjust the data based on an estimate of the correlation of the error terms. In spatially weighted regression, the inputs are adjusted or weighted using a spatial weights matrix. In many of these cases, adjustments are made based on an attempt to explicitly model a theorized mathematical relationship between inputs and outputs.

In the SAS training 'Applied Analytics with SAS Enterprise Miner it is recommended to transform any of the training examples that may have skewed distributions, without regard to any explicit mathematical relationship between the explanatory and dependent variables.  The recommendation based on an attempt to eliminate leverage or skewness that may impact model performance. Recall, in regression the the goal is to use sample data to estimate the population conditional expectation function.  Since what we are estimating then is a conditional mean, it is subject to bias from outliers or the same sort of leverage that biases any mean from any skewed distribution.  (see also my post on data imputation and M-estimators).  A transformation that creates a more symmetric or less skewed distribution could lead to less biased predictions.

As explained in a great post at the stats blog for Children's Mercy Hospitals and Clinics (which may seem like an odd place for a stats blog, but a great resource from which I have often found great references):

"If you want to use a log transformation, you compute the logarithm of each data value and then analyze the resulting data. You may wish to transform the results back to the original scale of measurement.The logarithm function tends to squeeze together the larger values in your data set and stretches out the smaller values. This squeezing and stretching can correct one or more of the following problems with your data : 1. Skewed data  2. Outliers 3. Unequal variation"

Let's simulate a skewed data distribution and see how this works using R.

The following R code will simulate skewed data, graph it,  and compute some basic diagnostics, as well as provide regression results for variable x2 and y:

```# simulate skewed data

x2 <- c(10,11,13,24,30,35,33,34,35,36,37,32,37,40,42,60,71,100,120, breaks =20)
y <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20)

hist(x2)        # plot histogram
summary(x2)     # summary stats
plot(x2,y)      # plot x2 and y
reg2 <- lm(y~x2)  # regress y on x2
abline(reg2)     # plot regression line over data
summary(reg2)    # regression output```
Created by Pretty R at inside-R.org

The  summary statistics give the following:

Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
10.0    28.5    35.0    41.0    40.5   120.0

It is easy to see that the mean is biased towards the tail of the distribution, and as expected is greater than the mean.

Regressing y on x2 gives an adjusted r-sqaare value of .4653.

Lets look at a log transformation of x2.

```logx2 <- log(x2)    # log transform
hist(logx2)       # plot histogram of logx2
summary(logx2)      # produce summary stats
plot(logx2,y)       # scatterplot

reg3 <- lm(y~logx2) # regress y on x2
abline(reg3)        # plot regresson line over data
summary(reg3)       # regression results```
Created by Pretty R at inside-R.org

Taking a log transform of x2 gives a much more symmetrical distribution:

From the summary statistics, we see that the mean and median are almost identical, another indication that the transformation improved the symmetry of the distribution:

Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
2.303   3.345   3.555   3.521   3.701   4.787

From the regression output, we see that we are now able to explain more of the total variation in y after the transformation (squeezing the data together reduced the leverage exerted by the extreme observations in the sample and enabled a better fit- which would allow better predictions of y)

This is all without making any assumptions about the explicit mathematical relationship between x and y. In reality, could there be such an explicit relationship? Possibly. If theory leads you down that path then you should transform x regardless, not just to smooth out the distribution, but to incorporate the theory into your model.  Often times in data mining we are not sure of the relationships between the training examples and the dependent variable. But, if our goal is making good predictions or fitting the data well, we know that certain transformations can improve our results. As Greene says :

"It remains an interesting question for research whether fitting y well or obtaining good parameter estimates is a preferable estimation criterion. Evidently, they need not be the same thing." (Econometric Analysis, 5th Edition)

Greene is referring to issues related to the pseudo r-square and maximum likelihood estimators, but the concept applies, and yet again goes back to the cultural conflicts between machine learning and traditional data modeling. Recall Breimer's quote from my post 'Culture Wars'

"Approaching problems by looking for a data model imposes an apriori straight jacket that restricts the ability of statisticians to deal with a wide range of statistical problems."

As I was cultured in the traditional data modeling paradigm, I've been more likely to think in terms of this straight jacket approach- critically thinking through the data model, the functional form, theoretical support for the varaibles to include and their functional relationship with y, etc. If your goal is to get coefficient estimates and make specific inferences about the relationships between specific variables in the model, then perhaps that is the approach to take. However, with hard deadlines, and large data sets, this becomes very time consuming and difficult. If your goal is actually prediction, as Breiman notes, you could easily paint yourself into a very restricted corner, and as Greene states, these need not be the same thing.

Ultimately it comes down to what your goals are and your client's preferences.