Friday, January 30, 2015

Data Science is 10% Inspiration and 90% Perspiration

“Success is 10 percent inspiration and 90 percent perspiration.” Thomas Alva Edison

Last fall there was a really good article in the New York Times:

For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights

"Yet far too much handcrafted work — what data scientists call “data wrangling,” “data munging” and “data janitor work” — is still required. Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets."

“Data wrangling is a huge — and surprisingly so — part of the job,” said Monica Rogati, vice president for data science at Jawbone, whose sensor-filled wristband and software track activity, sleep and food consumption, and suggest dietary and health tips based on the numbers. “It’s something that is not appreciated by data civilians. At times, it feels like everything we do.”

“It’s an absolute myth that you can send an algorithm over raw data and have insights pop up,” said Jeffrey Heer, a professor of computer science at the University of Washington and a co-founder of Trifacta, a start-up based in San Francisco.

This has always been true, even before 'Big Data' was a big deal. It was one of the first rude awakenings I had as a researcher straight out of graduate school (one of the other things was related to the large gap between econometric theory and applied econometrics). Thank goodness, I was lucky enough to work in a shop where this was much appreciated and I developed the necessary SAS and SQL (and later R) skills to deal with these issues. They just don't teach this stuff in school (I'm sure they might some places).

The article mentions some efforts to develop software to make these tasks simpler. I think there is a very fine line between the value gained from doing this grunt work vs. the savings of time an energy that we could yield if we flatten the cost curve when it comes to data prep. As the article says:

"Data scientists emphasize that there will always be some hands-on work in data preparation, and there should be. Data science, they say, is a step-by-step process of experimentation."

“You prepared your data for a certain purpose, but then you learn something new, and the purpose changes,” said Cathy O’Neil, a data scientist at the Columbia University Graduate School of Journalism, and co-author, with Rachel Schutt, of “Doing Data Science” (O’Reilly Media, 2013)."

In God We Trust, All Others Show Me Your Code

