Previously I wrote a post titled: What do you really need to know to be a data scientist. Data science lovers and haters. In this post I made the general argument that this is a broad space and there is a lot of contention about the level of technical skill and tools that one must master to consider themselves a 'real' data scientist vs. getting labeled a 'fake' data scientist or 'poser' or whatever. But, to me its all about leveraging data to solve problems and most of that work is about cleaning and prepping data. It's process. In an older KDNuggets article, economist/data scientist Scott Nicholson makes a similar point:
GP: What advice you have for aspiring data scientists?
SN: Focus less on algorithms and fancy technology & more on identifying questions, and extracting/cleaning/verifying data. People often ask me how to get started, and I usually recommend that they start with a question and follow through with the end-to-end process before they think about implementing state-of-the-art technology or algorithms. Grab some data, clean it, visualize it, and run a regression or some k-means before you do anything else. That basic set of skills surprisingly is something that a lot of people are just not good at but it is crucial.
GP: Your opinion on the hype around Big Data - how much is real?
SN: Overhyped. Big data is more of a sudden realization of all of the things that we can do with the data than it is about the data themselves. Of course also it is true that there is just more data accessible for analysis and that then starts a powerful and virtuous spiral. For most companies more data is a curse as they can barely figure out what to do with what they had in 2005.
So getting your foot in the door in a data science field doesn't mean mastering Hive or Hadoop apparently. And, this does not sound like PhD level rocket science at this point either. Karolis Urbonas, Head of Business Intelligence at Amazon has recently written a couple of similarly themed pieces also at KDNuggets:
How to think like a data scientist to become one
"I still think there’s too much chaos around the craft and much less clarity, especially for people thinking of switching careers. Don’t get me wrong – there are a lot of very complex branches of data science – like AI, robotics, computer vision, voice recognition etc. – which require very deep technical and mathematical expertise, and potentially a PhD… or two. But if you are interested in getting into a data science role that was called a business / data analyst just a few years ago – here are the four rules that have helped me get into and are still helping me survive in the data science."
He emphasizes basic data analysis, statistics, and coding to get started. The emphasis again is not on specific tools, degrees etc. but more on the process and ability to use data to solve problems. Note in the comments there is some push back on the level of expertise required, but Karolis actually addressed that when he mentioned very narrow and specific roles in AI, robotics, etc. Here he's giving advice for getting started in the broad diversity of roles in data science outside these narrow tracks. The issue is some people in data science want to narrow the scope to the exclusion of much of the work done by business analysts, researchers, engineers and consultants creating much of the value in this space (again see my previous post).
What makes a great data scientist?
"A data scientist is an umbrella term that describes people whose main responsibility is leveraging data to help other people (or machines) making more informed decisions….Over the years that I have worked with data and analytics I have found that this has almost nothing to do with technical skills. Yes, you read it right. Technical knowledge is a must-have if you want to get hired but that’s just the basic absolutely minimal requirement. The features that make one a great data scientist are mostly non-technical."
1. Great data scientist is obsessed with solving problems, not new tools.
"This one is so fundamental, it is hard to believe it’s so simple. Every occupation has this curse – people tend to focus on tools, processes or – more generally – emphasize the form over the content. A very good example is the on-going discussion whether R or Python is better for data science and which one will win the beauty contest. Or another one – frequentist vs. Bayesian statistics and why one will become obsolete. Or my favorite – SQL is dead, all data will be stored on NoSQL databases."