Saturday, January 29, 2011

Which variables are the most important?

Often, after conducting an analysis, I've been asked which variables are the most important? While a simple and practical question, the ways of answering it are not always simple or practical. Professor David Firth of Oxford assembled a very useful literature review addressing this topic which can be found here.

In a critique of The Bell Curve  Goldberger and Manski offer the following:

Artheur S. Goldberger and Charles F. Manski  in Journal of Economic Literature Vol. XXXIII (June 1995), pp. 762-776

"We must still confront HM's interpretation of the relative slopes of the two curves in Figure 1 as measuring the "relative importance" of cognitive ability and SES in determining poverty status.
Standardization in the manner of HM-essentially using "beta weights"-has been a common practice in sociology, psychology, and education. Yet it is very rarely encountered in economics. Most econometrics text- books do not even mention the practice. The reason is that standardization accomplishes nothing except to give quantities in noncom- parable units the superficial appearance of being in comparable units. This accomplishment is worse than useless-it yields misleading inferences.

"We find no substantively meaningful way to interpret the empirical analysis in Part II of The Bell Curve as showing that IQ is "more important" than SES as a determinant of social behaviors....The Coleman Report sought to measure the "strength" of the relationship between various school factors and pupil achievement through the percent of variance explained by each factor, an approach similar to that of HM. Cain and Watts write (p. 231): "this measure of strength is totally inappropriate for the purpose of informing policy choice, and cannot provide relevant information for the policy maker."

Again, this goes back to Leamer's discussion of the current state of econometrics on EconTalk, as well as Brieman's comments of variable importance in Statistical Modeling: The Two Cultures

"My definition of importance is based on prediction. A variable might be considered important if deleting it seriously affects prediction accuracy...Importance does not yet have a satisfactory theoretical definition...It depends on the dependencies between the output variable and the input variables, and on the dependencies between the input variables. The problem begs for research."

In A Guide to Econometrics Kennedy presents multiple regression in the context of the Ballentine Venn diagram and explains: (click to enlarge)
“.. the blue-plus-red-plus-green area represents total variation in Y explained by X and Z together (X1 & X2 in my adapted diagram above)….the red area is discarded only for the purpose of estimating the coefficients, not for predicting Y; once the coefficients are estimated, all variation in X and Z is used to predict Y…Thus, the R2 resulting from multiple regression is given by the ratio of the blue-plus-red-plus-green area to the entire Y circle. Notice there is no way of allocating portions of total R2 to X and Z because the red area variation is explained by both, in a way that cannot be disentangled. Only if X and Z are orthogonal, and the red area disappears, can total R2 be allocated unequivocally to X and Z separately.”

So on a theoretical basis, total variance explained is out. Probably better answers revolve around the notion that all significant variables are important. However,  Ziliak and  McCloskey argue against this approach:

"Significant does not mean important and insignificant does not mean unimportant"

The thing to bear in mind, in absence of theoretical justification, there are questions that need answering. Perhaps a number of practical approaches should be embraced, including some of those criticized above.

(for an interesting discussion of this topic on google groups see here)

No comments:

Post a Comment