Saturday, January 19, 2013

That modeling feeling « Statistical Modeling, Causal Inference, and Social Science

This is exactly what it's like: 

"And the challenge is to get from point A to point B. So, you throw model after model at the problem, method after method, alternating between quick-and-dirty methods that get me nowhere, and elaborate models that give uninterpretable, nonsensical results. Until finally you get close. Actually, what happens is that you suddenly solve the problem! Unexpectedly, you're done! And boy is the result exciting. And you do some checking, fit to a different dataset maybe, or make some graphs showing raw data and model estimates together, or look carefully at some of the numbers, and you realize you have a problem. And you stare at your code for a long long time and finally bite the bullet, suck it up and do some active debugging, fake-data simulation, and all the rest. You code your quick graphs as diagnostic plots and build them into your procedure. And you go back and do some more modeling, and you get closer, and you never quite return to the triumphant feeling you had earlier—because you know that, at some point, the revolution will come again and with new data or new insights you'll have to start over on this problem..."

"But, not so deep inside you, that not-so-still and not-so-small voice reminds you of the compromises you've made, the data you've ignored, the things you just don't know if you believe. You want to do more, but that will require more computing, more modeling, more theory. Yes, more theory...."

And the post continues with more 'what its like' ...

Friday, January 18, 2013

Paul Allison On Multicollinearity

I happened to stumble upon this great article by Paul Allison who I had the priviledge of seeing at last year's SAS Global Forum. Actually, everything I've read by him from logistic regression to survival analysis and data imputation is very thorough and cogent. Here are a few of his points that hit home with me: 

When is it OK to ignore multicollinearity? Well when....

# 1. The variables with high VIFs are control variables, and the variables of interest do not have high VIFs.
"Here's an example from some of my own work: the sample consists of U.S. colleges, the dependent variable is graduation rate, and the variable of interest is an indicator (dummy) for public vs. private. Two control variables are average SAT scores and average ACT scores for entering freshmen. These two variables have a correlation above .9, which corresponds to VIFs of at least 5.26 for each of them. But the VIF for the public/private indicator is only 1.04. So there's no problem to be concerned about, and no need to delete one or the other of the two controls."
# 3. The variables with high VIFs are indicator (dummy) variables that represent a categorical variable with three or more categories.

Thursday, January 10, 2013

You say Stata I say SAS: Software Signaling and Social Identity Theory

In my previous post I quoted the following: 

"When you don't have to code your own estimators, you probably won't understand what you're doing. I'm not saying that you definitely won't, but push-button analyses make it easy to compute numbers that you are not equipped to interpret." 

Then I said: 

I Agree. Statistics is a language best communicated and understood via code vs. a point and click GUI.

But I revised the post and restated:

I agree that statistics is a language best communicated and understood via code vs. a point and click GUI. 

I've been thinking a lot about this, and considering some insight from Rick Wicklin in the comments to that post. What I should actually say is that  *personally* I don't feel like I understand an estimator as well until I've actually coded it (as an algorithm not just submitting a command to a software package) or at least made some attempt to implement it in some simplified way, or I get some idea of how it could be coded if my coding skills were up to the challenge. So, by that measure I understand some estimators better than others and trying to better understand estimators in this way is really the purpose of this blog. 

But I've thought a little more about the first quote. *IF* I understand correctly how SAS/R/STATA/SPSS etc. works, regardless if you are pointing and clicking via a GUI interface or submitting canned routines at the command line, *BOTH* are wrappers for the heavy lifting and actual statistical programming abstracted behind the scenes by developers. Both command line and GUI environments  make it pretty easy to get results you may or may not correctly interpret, or simply to apply the wrong test in the wrong situation.
But, you can also really get into trouble actually coding your estimator (maybe using SAS IML or R or Octave). If you don't know what you are doing and you click through a regression in SPSS or submit a PROC in SAS, at least the estimates will be correct. If you make a syntax error the code likely just won't run and the log may even help you out.  The application of the correct statistical test or interpretation is up to you. But you can make a mistake coding your own estimator and not even realize it. Efficiency, repeatability, resource requirements etc. all factor in too.

As far as 'signaling' goes, now that I've thought more about it, a screen full of code certainly may give the appearance of a more sophisticated analysis and may even give one a false sense of confidence in the results i.e. the signal for sophistication or quality may be mixed at best. 

In a recent blog post, Andrew Gelman states (in the context of this same software signal discussion on his blog): 
  To me, a statistics package is not just its code, it’s also its community, it’s what people do with it.

I can relate to the community aspect.  On campus, within academia we have our different communities of users (Stata, SPSS, Excel, SAS,  Mathematica ,and some R) and I think Rick's comments on the previous post are informative about academic and business communities.  I think lots of times, social identity theory begins to play out;  we get really comfortable within our community and  tend to pigeonhole and short change other tools, and even worse project those perceptions onto the users of those tools.  I've been guilty of this and Rick helped me realize it.

Note: There can definitely be benefits to coding your own estimators. If you are a coder  and use SAS, I highly recommend  Rick Wicklin's The Do Loop where he has produced a number of excellent posts explaining the nuts and bolts of coding your own estimators and gets behind the scenes of a vast array of concepts (like the power method for computing only the largest eigenvalue and this tip on NOT using macros to code a simulation.  See also 12 Tips for SAS Statistical Programmers from 2012.

Tuesday, January 8, 2013

Decomposition: The Statistics Software Signal


Decomposition: The Statistics Software Signal

"When you don't have to code your own estimators, you probably won't understand what you're doing. I'm not saying that you definitely won't, but push-button analyses make it easy to compute numbers that you are not equipped to interpret."

I agree that statistics is a language best communicated and understood via code vs. a point and click GUI.

However, particularly interesting is his view of how the use of a given software package may relate to the quality of research:

"SPSS: You love using your mouse and discovering options using menus. You are nervous about writing code and probably manage your data in Microsoft Excel." (see the linked article for similar remarks)

 To be fair, STATA, SPSS, SAS and R have coding environments, and as a user of both SAS and R products I don't see why using PROC REG in SAS is any less sophisticated than the 'lm' function in R. Nor do I see any difference in coding an estimator or algorithm in R vs. SAS IML.

In fact, there has been a long running discussion for over a year now on SAS vs. R on LinkedIn and in my opinion it all it has established is that R certainly provides a powerful software solution for many researchers and businesses. 

It would be interesting to quantify and test Taylor's theory.

UPDATE: see You say Stata I Say SAS: software signaling and social identity theory.

Data science = failure of imagination

I think I like this distinction between Bayesian and Frequentist statistics: 

"we are nearly always ultimately curious about the Bayesian probability of the hypothesis (i.e. "how probable it is that things work a certain way, given what we see") rather then in the frequentist pobability of the data (i.e. "how likely it is that we would see this if we repeated the experiment again and again and again")."

But I think the rest of the article gives a mischaracterization of  data science, take for instance the following paragraph:
"But most importantly, data-driven science is less intellectually demanding then hypothesis-driven science. Data mining is sweet, anyone can do it. Plotting multivariate data, maps, "relationships" and colorful visualizations is hip and catchy, everybody can understand it. By contrary, thinking about theory can be pain and it requires a rare commodity: imagination."

Actually, in my opinion it takes way more imagination to develop an effective data visualization than develop an estimator by hand and prove its unbiased or consistent. I'd much rather do the latter because I frankly don't have the imagination to do the best job with the former. 

But, data science is much more than visualization. As far as not being intellectually demanding, trying to understand the back proposition algorithm used by neural networks not to mention actually coding your own algorithm isn't child's play.

As far as results, ultimately it is about getting the right tool for the right job. There are plenty of cases, in bioinformatics and genomics for example where the algorithmic approach is more useful than say ANOVA. As Leo Brieman said: 

"Approaching problems by looking for a data model imposes an apriori straight jacket that restricts the ability of statisticians to deal with a wide range of statistical problems." 

Culture Shock? see "Culture War: Classical Statistics vs. Machine Learning" here

Saturday, January 5, 2013

Interpreting Regression Coefficients

Nice discussion on regression here:(one of my favorite blogs)

I particularly like Gelman's comment:

"It's all about comparisons, nothing about how a variable "responds to change." Why? Because, in its most basic form, regression tells you nothing at all about change. It's a structured way of computing average comparisons in data."

This 'computing average comparisons of data' interpretation is why regression works as sort of a matching estimator as Angrist and Pischke  argue, and we've all including Dr. Gelman discussed before here: 

"Think of the world of difference between using a regression model for prediction and using one for estimating a parameter with a causal interpretation, for example, the effect of class size on school children's test scores. With prediction, we don't need our relationship to be causal, but we do need to be concerned with the relation between our training and our test set. If we have reason to think that our future test set may differ from our past training set in unknown ways, nothing, including cross-validation, will save us. When estimating the causal parameter, we do need to ask whether the children were randomly assigned to classes of different sizes, and if not, we need to find a way to deal with possible selection bias. If we have not measured suitable covariates on our children, we may not be able to adjust for any bias."

If we are talking about the following specification: 

E[Y­­­i|ci=1] - E[Y­­­i|ci=0] =E[Y1i-Y0i|ci=1]  +{ E[Y0i|ci=1] - E[Y0i|ci=0]}

Observed effect = treatment effect on the treated + {selection bias}

 I think that framework is the most useful for characterizing and understanding selection bias. I could be missing something but I don't see how the block quote from Terry above is really inconsistent with the potential outcome framework of causal inference, unless maybe you completely refuse to think of regression as a matching estimator. I think he does a good job pointing out what most people don't see as different applications of regression. As Dr. Gelman says, inference may be a special case of prediction, but when I here this distinction I can't help but think of this comment from Greene: 

 "It remains an interesting question for research whether fitting well or obtaining good parameter estimates is a preferable estimation criterion. Evidently, they need not be the same thing."
  p. 686 Greene,  Econometric Analysis 5th ed

Thursday, January 3, 2013

Propensity Score Matching in Higher Ed Research

See also: Causal Inference in A Nutshell, and this example using Instrumental Variables to evaluate First Year Programs, as well as my previous discussion of matching estimators here and here.

And this really good presentation: Why Propensity Score Matching should beused to Assess Programmatic Effects by Forrest Lane at the University of North Texas.

Below are some great references for both higher education research as well as good examples of applied quasi-experimental methods, particularly propensity score matching: 

Estimating the influence of financial aid on student retention: A discrete-choice propensity score-matching model
Education Working Paper Archive
January 17, 2008
Serge Herzog, Ph.D.
Director, Institutional Analysis
Consultant, CRDA StatLab
University of Nevada, Reno

Estimates the effect of financial aid on freshmen retention using propensity score-matching.Found that higher income students accrue a retention benefit from financial aid while retention of low-income freshmen is more likely due to academic

Assessing the Effectiveness of a College Freshman
Seminar Using Propensity Score Adjustments

M. H. Clark  and Nicole L. Cundiff
Res High Educ (2011) 52:616–639

Without accounting for selection bias, those who took the course had similar retention rates
and lower GPAs than those who did not take the course. After matching on propensity
scores, the negative effects of the program on GPA were nullified and those in the program
were more likely to enroll for a second year.

An, B. P. (In press). The influence of dual enrollment on academic performance and college readiness: Differences by socioeconomic status. Research in Higher Education. 2012.

Employed a propensity score matching model to assess the impact of dual enrollment
on academic performance and college readiness. Found that dual enrollment continues to
influence positively academic performance and college readiness.

A Mandate for Causal Inference in Higher Ed

 See also:  Using SAS® Enterprise BI and SAS® Enterprise Miner™ to
Reduce Student Attrition & Propensity Score Matching in Higher Education Research

From an older article in the Chronicle of Higher Education:

 "The new policy will permit -- but not require-- all of the department's units to give preference to grant applicants who promise to use randomized controlled trials or similar quasi experimental

Complaints From Researchers:

"They argued that randomized trials were expensive and difficult to conduct on a meaningful scale... posed ethical problems"

Well that sounds exactly like a scenario for quasi-experimental methods and causal inference!

New Federal Policy Favors Randomized Trials in
Education Research
March 11, 2005