Tuesday, January 8, 2013

Data science = failure of imagination

I think I like this distinction between Bayesian and Frequentist statistics: 

"we are nearly always ultimately curious about the Bayesian probability of the hypothesis (i.e. "how probable it is that things work a certain way, given what we see") rather then in the frequentist pobability of the data (i.e. "how likely it is that we would see this if we repeated the experiment again and again and again")."

But I think the rest of the article gives a mischaracterization of  data science, take for instance the following paragraph:
"But most importantly, data-driven science is less intellectually demanding then hypothesis-driven science. Data mining is sweet, anyone can do it. Plotting multivariate data, maps, "relationships" and colorful visualizations is hip and catchy, everybody can understand it. By contrary, thinking about theory can be pain and it requires a rare commodity: imagination."

Actually, in my opinion it takes way more imagination to develop an effective data visualization than develop an estimator by hand and prove its unbiased or consistent. I'd much rather do the latter because I frankly don't have the imagination to do the best job with the former. 

But, data science is much more than visualization. As far as not being intellectually demanding, trying to understand the back proposition algorithm used by neural networks not to mention actually coding your own algorithm isn't child's play.

As far as results, ultimately it is about getting the right tool for the right job. There are plenty of cases, in bioinformatics and genomics for example where the algorithmic approach is more useful than say ANOVA. As Leo Brieman said: 

"Approaching problems by looking for a data model imposes an apriori straight jacket that restricts the ability of statisticians to deal with a wide range of statistical problems." 

Culture Shock? see "Culture War: Classical Statistics vs. Machine Learning" herehttp://econometricsense.blogspot.com/2011/01/classical-statistics-vs-machine.html


  1. Sure, understanding the mechanics and theory behind data mining / machine learning algorithms is difficult. No one should say differently. The question you should be asking is whether your average "data scientist" actually knows or even cares about the theory.

    My, albeit limited, experience with data scientists has led me to believe that the majority blindly apply whatever algorithm is in vogue... (There are a lot of statistical quacks out there!) Many of them don't even think twice about what the data generating process might be... Of course, that's sometimes beneficial. This is esp. true in a commercial environment, where the main concern is prediction / classification rather than understanding of the DGP.

    Contrast that to the case of econometric research / analysis, where your main concern is usually about understanding the dynamics...

    Different tools for different tasks.