Friday, June 19, 2015

Got Data? Probably not like your econometrics textbook!

Recently there has been a lot of discussion of the Angrist and Pischke piece entitled "Why Econometrics Teaching Needs an Overhaul." (read more...) and I have discussed before the large gap between theoretical and applied econometrics.

But here I plan to discuss another potential gap in teaching and application and this is a topic that often is not introduced at any point in a traditional undergraduate or graduate economics curriculum, and that is hacking skills. This becomes extremely important for economists that someday find themselves doing applied work in a corporate environment, or working in the area of data science. Drew Conway points out the there are three spheres of data science including hacking skills, math and statistics knowledge, and subject matter expertise. For many economists, the hacking sphere might be the weakest (read also Big Data Requires a New Kind of Expert: The Econinformatrician) while their quantitative training otherwise makes them ripe to become very good data scientists.

Drew Conway's Data Science Venn Diagram

In a recent whitepaper, I discuss this issue:

Students of econometrics might often spend their days learning proofs and theorems, and if they are lucky they will get their hands on some data and access to software to actually practice some applied work rather it be for a class project or part of a thesis or dissertation. I have written before about the large gap between theoretical and applied econometrics, but there is another gap to speak of, and it has nothing to do with theoretical properties of estimators or interpreting output from STATA, SAS or R. This has to do with raw coding, hacking, and data manipulation skills; the ability to tease out relevant observations and measures from both large structured transactional databases or unstructured log files or web data like tweet-streams. This gap becomes more of an issue as econometricians move from more academic environments to corporate environments and especially so for those economists that begin to take on roles as data scientists. In these environments, not only is it true that problems don’t fit the standard textbook solutions (see article ‘Applied Econometrics’), but the data doesn't look much like the simple data sets often used in textbooks either.  One cannot always expect their IT people to be able to just dump them a flat file with all the variables and formats that will work for your research project. In fact, the absolute best you might hope for in many environments is a SQL or Oracle data base with hundreds or thousands of tables and the tiny bits of information you need spread across a number of them. How do you bring all of this information together to do an analysis? This can be complicated, but for the uninitiated I will present some ‘toy’ examples to give a feel for executing basic database queries to bring together different pieces of information housed in separate tables in order to produce a ‘toy’ analytics ready data set.

I am certain that many schools actually do teach some of the basics related to joining and cleaning data sets, and if they don't then others might figure this out on the job or through one research project or another. I am not certain that this gap needs to be filled  necessarily as part of any econometrics course. However, it is something students need to be aware of and offering some sort of workshop, lab or formal course (maybe as part of a more comprehensive data science curriculum like this) would be very beneficial.

Read the whole paper here:

Matt Bogard. 2015. "Joining Tables with SQL: The most important econometrics lesson you may ever learn" The SelectedWorks of Matt Bogard
Available at: http://works.bepress.com/matt_bogard/29  

See also: Is Machine Learning Trending with Economists?

No comments:

Post a Comment