## Saturday, March 5, 2011

### Logit Models: R-square and the Percentage of Correct Predictions

In my post on logistic regression and maximum likelihood estimation, using measures of deviance (derived from the log-likelihood), I presented a formulation for a pseudo-R-square. The pseudo-R-square is OK, some people like it, but people often get wrapped up in the least squares framework, and start to talk and act like they are demonstrating explained sums of squared errors. This is not what is being measured. With maximum likelihood, we are not minimizing squared errors, but maximizing a specified likelihood function.   While  it is fashioned in the sense of a traditional R-square, with the pseudo R-square there are no sums of squares to compute 1-SSE/SST. The pseudo R-square simply attempts to capture realtive changes in the log likelihood given the model parameters. The various transformations (Cox and Snell, Nagelkerke) attempt to make it a little more useful.

Below are some remarks from Greene and Kennedy regarding the use of the pseuso R-square and assessing model performance in the framework of maximum likelihood and logistic regression.

‘But the maximum likelihood estimator, on which all the fit measures are
based, is not chosen so as to maximize a fitting criterion based on prediction of y as it is in the classical regression (which maximizes R2). It is chosen to maximize the joint density of the observed dependent variables. It remains an interesting question for research whether fitting y well or obtaining good parameter estimates is a preferable estimation criterion. Evidently, they need not be the same thing.’

p. 686 Greene,  Econometric Analysis 5th ed

There is no universally accepted goodness of fit measure (pseudo-Rsquare for probit, logit, or count data models). It is tempting to use the percentage of correct predictions as a measure of goodness of fit. This temptation should be resisted: a naïve predictor, for example  that every y=1, could do well on this criterion. A better measure along these lines is the sum of the fraction of zeros correctly predicted plus the fraction of ones correctly predicted, a number which should exceed unity if the prediction method is of value. ( p. 267 Kennedy, A Guide to Econometrics 5th ed)

Kennedy makes a good point. Often times you will see results indicating that a logit model performs well because the percentage of correct predictions  is very high, maybe 90% or higher. But what if your population is dominated by cases belonging to class Y=1, and your model does a great job predicting those classes?  If all you are interested in involves predicting cases where Y=1, that may be an OK model. However, as Kennedy points out, if in your true population 9 out of every 10 observations belong to class Y=1, a naive predictor that classifies 10/10 cases as belonging to class 1 will also have a 90% percentage of correct predictions.  If you are interested in those cases belonging to class Y =0, the naive model will be wrong 100% of the time, and your actual model will be wrong more often than not.

More on R-square and Goodness of Fit for Discrete Choice Models

Thomas, Dawes, and Reznik. Using Predictive Modeling to Target Student Recruitment: Theory and Practice.  AIR Professional File 78 Winter 2001

There is no universally accepted goodness of fit measure (pseudo-Rsquare for probit, logit, or count data models). It is tempting to use the percentage of correct predictions as a measure of goodness of fit. This temptation should be resisted: a naïve predictor, for example  that every y=1, could do well on this criterion. A better measure along these lines is the sum of the fraction of zeros correctly predicted plus the fraction of ones correctly predicted, a number which should exceed unity if the prediction method is of value. ( p. 267 Kennedy, A Guide to Econometrics 5th ed)

Applied multiple regression/correlation analysis for the behavioral sciences 3rd Edition.
Jacob Cohen, Patricia Cohen - L. Erlbaum Associates (2003)

“Again, we caution that all these indicators are not goodness of fit indices in the sense of “proportion of variance accounted for”, in contrast to R2 in OLS regression. This seems puzzling, perhaps, but the explanation is straight forward. Reflect for a moment on the OLS regression model, which assumes homoscedasticity-the same error variance for every value of the criterion. Given homoscedasticity, we are able to think of the total proportion of variance that is error variance in a universal sense, across the full range of Y. In contrast, in logistic regression, we have inherent heteroscedasticity, with a different error variance for each different value of the predicted score 'p'. "
Applied Logistic Regression. Second Edition. David W. Hosmer & Stanley Lemeshaw.

“The most common assumption is that 'e' follows a normal distribution with mean zero and some variance that is constant across all levels of the independent variable. It follows that the conditional distribution of the outcome variable given x will be normal with mean E(Y|x), and a variance that is constant. This is not the case with a dichotomous outcome variable…thus 'e'  has a distribution with mean zero and variance equal to π(x)[1-π(x)]. That is, the conditional distribution of the outcome variable follows a binomial distribution with probability given by the conditional mean, π(x).”