Previously I wrote a post titled: What do you really need to know to be a data scientist. Data science lovers and haters. In this post I made the general argument that this is a broad space and there is a lot of contention about the level of technical skill and tools that one must master to consider themselves a 'real' data scientist vs. getting labeled a 'fake' data scientist or 'poser' or whatever. But, to me its all about leveraging data to solve problems and most of that work is about cleaning and prepping data. It's process. In an older KDNuggets article, economist/data scientist Scott Nicholson makes a similar point:
GP: What advice you have for aspiring data scientists?
SN: Focus less on algorithms and fancy technology & more on identifying questions, and extracting/cleaning/verifying data. People often ask me how to get started, and I usually recommend that they start with a question and follow through with the end-to-end process before they think about implementing state-of-the-art technology or algorithms. Grab some data, clean it, visualize it, and run a regression or some k-means before you do anything else. That basic set of skills surprisingly is something that a lot of people are just not good at but it is crucial.
GP: Your opinion on the hype around Big Data - how much is real?
SN: Overhyped. Big data is more of a sudden realization of all of the things that we can do with the data than it is about the data themselves. Of course also it is true that there is just more data accessible for analysis and that then starts a powerful and virtuous spiral. For most companies more data is a curse as they can barely figure out what to do with what they had in 2005.
So getting your foot in the door in a data science field doesn't mean mastering Hive or Hadoop apparently. And, this does not sound like PhD level rocket science at this point either. Karolis Urbonas, Head of Business Intelligence at Amazon has recently written a couple of similarly themed pieces also at KDNuggets:
How to think like a data scientist to become one
"I still think there’s too much chaos around the craft and much less clarity, especially for people thinking of switching careers. Don’t get me wrong – there are a lot of very complex branches of data science – like AI, robotics, computer vision, voice recognition etc. – which require very deep technical and mathematical expertise, and potentially a PhD… or two. But if you are interested in getting into a data science role that was called a business / data analyst just a few years ago – here are the four rules that have helped me get into and are still helping me survive in the data science."
He emphasizes basic data analysis, statistics, and coding to get started. The emphasis again is not on specific tools, degrees etc. but more on the process and ability to use data to solve problems. Note in the comments there is some push back on the level of expertise required, but Karolis actually addressed that when he mentioned very narrow and specific roles in AI, robotics, etc. Here he's giving advice for getting started in the broad diversity of roles in data science outside these narrow tracks. The issue is some people in data science want to narrow the scope to the exclusion of much of the work done by business analysts, researchers, engineers and consultants creating much of the value in this space (again see my previous post).
What makes a great data scientist?
"A data scientist is an umbrella term that describes people whose main responsibility is leveraging data to help other people (or machines) making more informed decisions….Over the years that I have worked with data and analytics I have found that this has almost nothing to do with technical skills. Yes, you read it right. Technical knowledge is a must-have if you want to get hired but that’s just the basic absolutely minimal requirement. The features that make one a great data scientist are mostly non-technical."
1. Great data scientist is obsessed with solving problems, not new tools.
"This one is so fundamental, it is hard to believe it’s so simple. Every occupation has this curse – people tend to focus on tools, processes or – more generally – emphasize the form over the content. A very good example is the on-going discussion whether R or Python is better for data science and which one will win the beauty contest. Or another one – frequentist vs. Bayesian statistics and why one will become obsolete. Or my favorite – SQL is dead, all data will be stored on NoSQL databases."
An attempt to make sense of econometrics, biostatistics, machine learning, experimental design, bioinformatics, ....
Monday, April 10, 2017
Saturday, April 8, 2017
What do you really need to know to be a data scientist? Data Science Lovers and Haters
Previously I discussed the Super Data Science podcast and credit modeling in terms of the modeling strategy and models used. The discussion also covered data science in general, and one part of the conversation I thought was well worth discussing in more detail. It really gets to the question of what's it take to be a data scientist. There is a ton of energy spent on this in places like LinkedIn and other forums. I think the answer comes in two forms. From the 'lovers' of data science its all about what kind of advice can I give people to help and encourage them to create value in this space. To the 'haters' its more like now that I have established myself in this space what kind of criterion should we have to keep people out and prevent them from creating value. But before we get to that, here is some great dialogue from Kirill discussing a trap that data scientists or aspiring data scientists fall into:
Kirill: "I think there’s a level of acumen that people should have, especially going into data science role. And then if you’re a manager you might take a step back from that. You might not need that much detail…If you’re doing the algorithms, that acumen might be enough. You don’t need to know the nitty-gritty mathematical academic formulas to everything about support vector machines or Kernels and stuff like that to apply it properly and get results. On the other hand, if you find that you do need that stuff you can go and spend some additional time learning. A lot of people fall into the trap. They try to learn everything in a lot of depth, whereas I think the space of data science is so broad you can’t just learn everything in huge depths. It’s better to learn everything to an acceptable level of acumen and then deepen your knowledge in the spaces that you need."
Greg: "if you don’t want to get into that detail, I totally get it. You can be totally fine without it. I have never once in my career had somebody ask me what are the formulas behind the algorithm….there’s a lot of jobs out there for people that don’t know them."
I admit I used to fall into this trap. In fact this blog is a direct result. Early in my career I had the mindset if you can't prove it you can't use it. I really didn't feel confident about an algorithm or method until I understood it 'on paper' and could at least code my own version in SAS IML or R. A number of posts here were based on this work and mindset. Then, a very well known and accomplished developer/computational scientist that frequently helped me gave the good advice that with this mindset you might never get any work done. Or only a fraction of work.
Given the amount of discussion you might see on LinkedIn or the so called data science community about real or fake data scientists (lots of haters out there) in the Talk Python to Me podcast author Joel Grus (of Data Science from Scratch) provides what I think is the most honest discussion of what data science is and what data scientists do:
"there are just as many jobs called data science as there are data scientists"
That is kind of paraphrasing and kind of out of context and yes very general. Almost defining a word using the word in the definition. But it is very very TRUE. That is because the field is largely undefined. To attempt to define it is futile and I think would be the antithesis of data science itself. I will warn though that there are plenty of data science haters out there that would quibble with what Greg and Joel have said above.
These are people that want to impose something more strict. Some minimum threshold. Common threads indicate some fear of a poser or fake data scientist fooling some company into hiring them or incompetently pointing and clicking their way through an analysis without knowing what is going on and calling themselves a data scientist. While I understand that concern, its one extreme. It can easily morph into a straw man argument for a more political agenda at the other extreme. That might lead to a listing of minimal requirements to be a real data scientist, some laundry list of requirements (think big data technologies, degrees and the like). Economists know all about this and we see it in the form of licensing and rent seeking in a number of professions and industries. Broadly speaking its a waste of resources. Absolutely in this broad space economists would also recognize merit in signaling through certification, certain degree programs or course work, or other methods of credentialization. But there is a big difference between competitive signaling and non-competitive rent seeking behaviors.
In its inception, data science was all about disruption. As described in Johns Hopkins applied economics program description:
“Economic analysis is no longer relegated to academicians and a small number of PhD-trained specialists. Instead, economics has become an increasingly ubiquitous as well as rapidly changing line of inquiry that requires people who are skilled in analyzing and interpreting economic data, and then using it to effect decisions ………Advances in computing and the greater availability of timely data through theInternet have created an arena which demands skilled statistical analysis, guided by economic reasoning and modeling.”
This parallels data science. Suddenly you no longer need a PhD in statistics or a software engineering background or an academics' level of acumen to create value added analysis. (although those are all excellent backgrounds for doing some advanced work in data science no doubt). Its that basic combination of subject matter expertise, some knowledge of statistics and machine learning, and ability to write code or use software to solve problems. That's it. Its disruptive and the haters hate it. They simultaneously embrace the disruption and want to reign it in and fence out the competition. I hate it for the haters but you don't need to be able to code your own estimators or train a neural net from scratch to use it. And there is probably as much or more value creating professional space out there for someone that can clean a data set and provide a set of cross tabs as there is for the know how to set up a Hadoop cluster.
Below are a couple of really great KDNuggets articles in this regard written by Karolis Urbonas, Head of Business Intelligence at Amazon:
How to think like a data scientist to become one
What makes a great data scientist?
Kirill: "I think there’s a level of acumen that people should have, especially going into data science role. And then if you’re a manager you might take a step back from that. You might not need that much detail…If you’re doing the algorithms, that acumen might be enough. You don’t need to know the nitty-gritty mathematical academic formulas to everything about support vector machines or Kernels and stuff like that to apply it properly and get results. On the other hand, if you find that you do need that stuff you can go and spend some additional time learning. A lot of people fall into the trap. They try to learn everything in a lot of depth, whereas I think the space of data science is so broad you can’t just learn everything in huge depths. It’s better to learn everything to an acceptable level of acumen and then deepen your knowledge in the spaces that you need."
Greg: "if you don’t want to get into that detail, I totally get it. You can be totally fine without it. I have never once in my career had somebody ask me what are the formulas behind the algorithm….there’s a lot of jobs out there for people that don’t know them."
I admit I used to fall into this trap. In fact this blog is a direct result. Early in my career I had the mindset if you can't prove it you can't use it. I really didn't feel confident about an algorithm or method until I understood it 'on paper' and could at least code my own version in SAS IML or R. A number of posts here were based on this work and mindset. Then, a very well known and accomplished developer/computational scientist that frequently helped me gave the good advice that with this mindset you might never get any work done. Or only a fraction of work.
Given the amount of discussion you might see on LinkedIn or the so called data science community about real or fake data scientists (lots of haters out there) in the Talk Python to Me podcast author Joel Grus (of Data Science from Scratch) provides what I think is the most honest discussion of what data science is and what data scientists do:
"there are just as many jobs called data science as there are data scientists"
That is kind of paraphrasing and kind of out of context and yes very general. Almost defining a word using the word in the definition. But it is very very TRUE. That is because the field is largely undefined. To attempt to define it is futile and I think would be the antithesis of data science itself. I will warn though that there are plenty of data science haters out there that would quibble with what Greg and Joel have said above.
These are people that want to impose something more strict. Some minimum threshold. Common threads indicate some fear of a poser or fake data scientist fooling some company into hiring them or incompetently pointing and clicking their way through an analysis without knowing what is going on and calling themselves a data scientist. While I understand that concern, its one extreme. It can easily morph into a straw man argument for a more political agenda at the other extreme. That might lead to a listing of minimal requirements to be a real data scientist, some laundry list of requirements (think big data technologies, degrees and the like). Economists know all about this and we see it in the form of licensing and rent seeking in a number of professions and industries. Broadly speaking its a waste of resources. Absolutely in this broad space economists would also recognize merit in signaling through certification, certain degree programs or course work, or other methods of credentialization. But there is a big difference between competitive signaling and non-competitive rent seeking behaviors.
In its inception, data science was all about disruption. As described in Johns Hopkins applied economics program description:
“Economic analysis is no longer relegated to academicians and a small number of PhD-trained specialists. Instead, economics has become an increasingly ubiquitous as well as rapidly changing line of inquiry that requires people who are skilled in analyzing and interpreting economic data, and then using it to effect decisions ………Advances in computing and the greater availability of timely data through theInternet have created an arena which demands skilled statistical analysis, guided by economic reasoning and modeling.”
This parallels data science. Suddenly you no longer need a PhD in statistics or a software engineering background or an academics' level of acumen to create value added analysis. (although those are all excellent backgrounds for doing some advanced work in data science no doubt). Its that basic combination of subject matter expertise, some knowledge of statistics and machine learning, and ability to write code or use software to solve problems. That's it. Its disruptive and the haters hate it. They simultaneously embrace the disruption and want to reign it in and fence out the competition. I hate it for the haters but you don't need to be able to code your own estimators or train a neural net from scratch to use it. And there is probably as much or more value creating professional space out there for someone that can clean a data set and provide a set of cross tabs as there is for the know how to set up a Hadoop cluster.
Below are a couple of really great KDNuggets articles in this regard written by Karolis Urbonas, Head of Business Intelligence at Amazon:
How to think like a data scientist to become one
What makes a great data scientist?
Super Data Science Podcast Credit Scoring Models
I recently discovered the Super Data Science podcast hosted by Kirill Eremenko. What I like about this podcast series is that it is applied data science. You can talk all day about theory, theorems, proofs, and mathematical details and assumptions. Even if you could master every technical detail underlying 'data science' you have only scratched the surface. What distinguishes data science from the academic discipline of statistics, computer science, or machine learning is application to solve a problem for business or society. Its not theory for theory's sake. There are huge gaps between theory and application that can easily stump a team of PhD's or experienced practitioners (see also applied econometrics). Podcasts like this can help bridge the gap.
Episode 014 featured Greg Poppe who is Sr Vice President for risk management at an auto lending firm. They discussed how data science is leveraged in loan approvals and rate setting among other things.
The general modeling approach that Greg discussed is very similar to work that I have done before in student risk modeling in higher education (see here and here).
"So think of it like -- you know, I would have a hard time telling you with any high degree of certainty, “This loan will pay. This loan will pay. But this loan won’t.” However, if you give me a portfolio of a hundred loans, I should be able to say “15 aren’t going to pay. I don’t know which 15, but 15 won’t.” And then if you give me another portfolio that’s say riskier, I should be able to measure that risk and say “This is a riskier pool. 25 aren’t going to pay. And again, I don’t know which 25, but I’m estimating 25.” And that’s how we measure our accuracy. So it’s not so much on a loan-by-loan basis. It’s “If we just select a random sample, how many did not pay, and what was our expectation of that?” And if they’re very close, we consider our models to be accurate."
A toy example in R that seems very similar can be found here (Predictive Modeling and Custom Reporting in R).
So at a basic level they are just using predictive models to get a score and using cutoffs to determine different pools of risk and making approvals, declines, and setting interest rates based on this. He doesn't discuss the specifics of the model testing, but to me the key here sounds a lot like calibration (see Is the ROC curve a good metric for model calibration?). In terms of the types of models they use of this it gets very interesting. As Kirill says, the whole podcast is worth listening to for this very point. For their credit scoring models they use regression, even though they could get improved performance from other algorithms like decision trees or ensembles. Why?
"so primarily in the credit decisioning models, we use regression models. And the reason why—well, there’s quite a few. One is it’s very computationally easy. It’s easy to explain, it’s easy for people to understand but it’s also not a black box in the sense that a lot of models can be, and what we need to do is we need to provide a continuity to a dealership because they can adjust the parameters of the application and that will adjust the risk accordingly…..If we were to go with a CART model or any other decision tree model, if the first break point or the first cut point in that model is down payment and they go from one side to the other, it can throw it down a completely separate set of decision logic and they can get very strange approvals. From a data science perspective and from an analytics perspective, that may be more accurate but it’s not sellable, it’s not marketable to the dealership."
Yes huge gap just filled and well worth repeating. Its interesting, in a different scenario you could go the other way around. For instance, in my work in higher education student risk modeling we went with decision trees instead of regression but based on a similar line of reasoning. Our end users however were not going to be tweaking parameters but getting sign off and buy in required that they understand more about what the model was doing. The explicit nature of the splits and decision logic of the trees was easier to explain and understand for untrained statisticians than was regression models or neural networks.
If you have been a practitioner for a while you might think of course every data scientist knows there is a tradeoff between accuracy, complexity, and functional practicality. I agree but it still can't be emphasized enough. And more time should be spent on applied examples like this vs the waste we see in social media discussion who is or isn't a fake data scientist. The real data scientists are too busy working in the gaps between theory and practice to care. To be continued....
Episode 014 featured Greg Poppe who is Sr Vice President for risk management at an auto lending firm. They discussed how data science is leveraged in loan approvals and rate setting among other things.
The general modeling approach that Greg discussed is very similar to work that I have done before in student risk modeling in higher education (see here and here).
"So think of it like -- you know, I would have a hard time telling you with any high degree of certainty, “This loan will pay. This loan will pay. But this loan won’t.” However, if you give me a portfolio of a hundred loans, I should be able to say “15 aren’t going to pay. I don’t know which 15, but 15 won’t.” And then if you give me another portfolio that’s say riskier, I should be able to measure that risk and say “This is a riskier pool. 25 aren’t going to pay. And again, I don’t know which 25, but I’m estimating 25.” And that’s how we measure our accuracy. So it’s not so much on a loan-by-loan basis. It’s “If we just select a random sample, how many did not pay, and what was our expectation of that?” And if they’re very close, we consider our models to be accurate."
A toy example in R that seems very similar can be found here (Predictive Modeling and Custom Reporting in R).
So at a basic level they are just using predictive models to get a score and using cutoffs to determine different pools of risk and making approvals, declines, and setting interest rates based on this. He doesn't discuss the specifics of the model testing, but to me the key here sounds a lot like calibration (see Is the ROC curve a good metric for model calibration?). In terms of the types of models they use of this it gets very interesting. As Kirill says, the whole podcast is worth listening to for this very point. For their credit scoring models they use regression, even though they could get improved performance from other algorithms like decision trees or ensembles. Why?
"so primarily in the credit decisioning models, we use regression models. And the reason why—well, there’s quite a few. One is it’s very computationally easy. It’s easy to explain, it’s easy for people to understand but it’s also not a black box in the sense that a lot of models can be, and what we need to do is we need to provide a continuity to a dealership because they can adjust the parameters of the application and that will adjust the risk accordingly…..If we were to go with a CART model or any other decision tree model, if the first break point or the first cut point in that model is down payment and they go from one side to the other, it can throw it down a completely separate set of decision logic and they can get very strange approvals. From a data science perspective and from an analytics perspective, that may be more accurate but it’s not sellable, it’s not marketable to the dealership."
Yes huge gap just filled and well worth repeating. Its interesting, in a different scenario you could go the other way around. For instance, in my work in higher education student risk modeling we went with decision trees instead of regression but based on a similar line of reasoning. Our end users however were not going to be tweaking parameters but getting sign off and buy in required that they understand more about what the model was doing. The explicit nature of the splits and decision logic of the trees was easier to explain and understand for untrained statisticians than was regression models or neural networks.
If you have been a practitioner for a while you might think of course every data scientist knows there is a tradeoff between accuracy, complexity, and functional practicality. I agree but it still can't be emphasized enough. And more time should be spent on applied examples like this vs the waste we see in social media discussion who is or isn't a fake data scientist. The real data scientists are too busy working in the gaps between theory and practice to care. To be continued....
Friday, April 7, 2017
Andrew Gelman on EconTalk
Recently Andrew Gelman was on EconTalk with Russ Roberts. A couple of the most interesting topics covered included discussion of his garden of forking paths as well as what Gelman covered in a fairly recent blog post- The “What does not kill my statistical significance makes it stronger” fallacy.
Here is an excerpt from the transcript:
Russ Roberts: But you have a small sample...--that's very noisy, usually. Very imprecise. And you still found statistical significance. That means, 'Wow, if you'd had a big sample you'd have found even a more reliable effect.'
Andrew Gelman: Um, yes. You're using what we call the 'That which does not kill my statistical significance makes it stronger' fallacy. We can talk about that, too.
From Andrew's Blog Post:
"The idea is that statistical significance is taken as an even stronger signal when it was obtained from a noisy study.
This idea, while attractive, is wrong. Eric Loken and I call it the “What does not kill my statistical significance makes it stronger” fallacy.
"What went wrong? Why it is a fallacy? In short, conditional on statistical significance at some specified level, the noisier the estimate, the higher the Type M and Type S errors. Type M (magnitude) error says that a statistically significant estimate will overestimate the magnitude of the underlying effect, and Type S error says that a statistically significant estimate can have a high probability of getting the sign wrong.
We demonstrated this with an extreme case a couple years ago in a post entitled, “This is what “power = .06” looks like. Get used to it.” We were talking about a really noisy study where, if a statistically significant difference is found, it is guaranteed to be at least 9 times higher than any true effect, with a 24% chance of getting the sign backward."
Noted in both the podcast and the blog post by Gelman, this is not a well known fallacy and as they point out very well known researchers appear to be found committing it one time or another in their writing or dialogue.
See also: Econometrics, Multiple Testing, and Researcher Degrees of Freedom
Here is an excerpt from the transcript:
Russ Roberts: But you have a small sample...--that's very noisy, usually. Very imprecise. And you still found statistical significance. That means, 'Wow, if you'd had a big sample you'd have found even a more reliable effect.'
Andrew Gelman: Um, yes. You're using what we call the 'That which does not kill my statistical significance makes it stronger' fallacy. We can talk about that, too.
From Andrew's Blog Post:
"The idea is that statistical significance is taken as an even stronger signal when it was obtained from a noisy study.
This idea, while attractive, is wrong. Eric Loken and I call it the “What does not kill my statistical significance makes it stronger” fallacy.
"What went wrong? Why it is a fallacy? In short, conditional on statistical significance at some specified level, the noisier the estimate, the higher the Type M and Type S errors. Type M (magnitude) error says that a statistically significant estimate will overestimate the magnitude of the underlying effect, and Type S error says that a statistically significant estimate can have a high probability of getting the sign wrong.
We demonstrated this with an extreme case a couple years ago in a post entitled, “This is what “power = .06” looks like. Get used to it.” We were talking about a really noisy study where, if a statistically significant difference is found, it is guaranteed to be at least 9 times higher than any true effect, with a 24% chance of getting the sign backward."
Noted in both the podcast and the blog post by Gelman, this is not a well known fallacy and as they point out very well known researchers appear to be found committing it one time or another in their writing or dialogue.
See also: Econometrics, Multiple Testing, and Researcher Degrees of Freedom