Friday, June 19, 2015

Got Data? Probably not like your econometrics textbook!

Recently there has been a lot of discussion of the Angrist and Pischke piece entitled "Why Econometrics Teaching Needs an Overhaul." (read more...) and I have discussed before the large gap between theoretical and applied econometrics.

But here I plan to discuss another potential gap in teaching and application and this is a topic that often is not introduced at any point in a traditional undergraduate or graduate economics curriculum, and that is hacking skills. This becomes extremely important for economists that someday find themselves doing applied work in a corporate environment, or working in the area of data science. Drew Conway points out the there are three spheres of data science including hacking skills, math and statistics knowledge, and subject matter expertise. For many economists, the hacking sphere might be the weakest (read also Big Data Requires a New Kind of Expert: The Econinformatrician) while their quantitative training otherwise makes them ripe to become very good data scientists.

Drew Conway's Data Science Venn Diagram

In a recent whitepaper, I discuss this issue:

Students of econometrics might often spend their days learning proofs and theorems, and if they are lucky they will get their hands on some data and access to software to actually practice some applied work rather it be for a class project or part of a thesis or dissertation. I have written before about the large gap between theoretical and applied econometrics, but there is another gap to speak of, and it has nothing to do with theoretical properties of estimators or interpreting output from STATA, SAS or R. This has to do with raw coding, hacking, and data manipulation skills; the ability to tease out relevant observations and measures from both large structured transactional databases or unstructured log files or web data like tweet-streams. This gap becomes more of an issue as econometricians move from more academic environments to corporate environments and especially so for those economists that begin to take on roles as data scientists. In these environments, not only is it true that problems don’t fit the standard textbook solutions (see article ‘Applied Econometrics’), but the data doesn't look much like the simple data sets often used in textbooks either.  One cannot always expect their IT people to be able to just dump them a flat file with all the variables and formats that will work for your research project. In fact, the absolute best you might hope for in many environments is a SQL or Oracle data base with hundreds or thousands of tables and the tiny bits of information you need spread across a number of them. How do you bring all of this information together to do an analysis? This can be complicated, but for the uninitiated I will present some ‘toy’ examples to give a feel for executing basic database queries to bring together different pieces of information housed in separate tables in order to produce a ‘toy’ analytics ready data set.

I am certain that many schools actually do teach some of the basics related to joining and cleaning data sets, and if they don't then others might figure this out on the job or through one research project or another. I am not certain that this gap needs to be filled  necessarily as part of any econometrics course. However, it is something students need to be aware of and offering some sort of workshop, lab or formal course (maybe as part of a more comprehensive data science curriculum like this) would be very beneficial.

Read the whole paper here:

Matt Bogard. 2015. "Joining Tables with SQL: The most important econometrics lesson you may ever learn" The SelectedWorks of Matt Bogard
Available at:  

See also: Is Machine Learning Trending with Economists?

Wednesday, June 17, 2015

Farmlink and the Rise of Data Science in Agriculture

At a recent Global Ag Investing Conference Dave Gebhardt (Chief Strategy Officer for FarmLink ) spoke about the rise of data science in agriculture. You can read the story and find a link to the podcast here:

In the podcast he discusses the way data science is revolutionizing agriculture, and how we are at a "tipping point where advances in science, IT, technology, and computing power have put a whole new level of opportunities before us."

This sounds a lot like what I have previously discussed in relation to big data and the internet of things: 

Watch more about how FarmLink is leveraging IoT, big data, and advanced analytics:


 Big Ag Meets Big Data (Part 1 & Part 2)

Saturday, June 13, 2015

SAS vs R? The right answer to the wrong question?

For a long time I tracked a discussion on LinkedIn that consisted of various opinions about using SAS vs R. Some people can take this very personal.  Recently there was an interesting post at the DataCamp blog addressing this topic. They also provided an interesting infographic making some comparisons between SAS and R as well as SPSS.  Other popular debates also include python in the mix. (By the way, it is possible to integrate all three on the SAS platform and you can also run R via the open source integration node in SAS Enterprise Miner 13.1).

Aside: For older versions of SAS EM-can you drop in a code node and call R via PROC IML?

Anyway, getting back to the article, I tend to agree with this one point:

"While these debates are a good thing for the community and the programming language as a whole, they unfortunately also have a negative effect on those individuals that are just in the beginning of their data analytics career. Biased opinions on all sides of the table make it difficult for new data analysts to see the forest for the trees when choosing a statistical programming language."

While I agree with this notion, I want to reflect for a minute on the concept of a programming language. If you think of SAS as just a programming language, then perhaps these kinds of comparisons and discussions make sense, but for a data scientist, I think one's view of analtyics should transcend just a language. When we think of an overall analytical solution there is a lot to consider, from how the data is generated, how it is captured and warehoused, how it is extracted and cleaned and accessed by whatever programming tool(s), how it is visualized and analyzed, and ultimately, how do we operationalize the solution so that it can be consumed by business users.

So to me the relevant question is not, which programming language is preferred by data scientists, or which program is better for implementing specific machine learning algorithms; but perhaps what is the best analytical solutions platform for solving the problems at hand? 

Friday, June 12, 2015

Linear Literalism & Fundamentalist Econometrics

Your tweet stream may have included the recently trending article by Angrist and Pischke entitled: "Why econometrics teaching needs an overhaul". Here is an excerpt:

“In addition to its more up-to-date contents, our book renews the econometrics canon by abandoning the childish literalism of the legacy approach to econometric instruction. In this spirit, we eschew the notion that regression is tied to a literal linear model. Regression describes differences in averages whether or not these averages fit a linear equation. This is a universal property – one that is reliably true – and we don’t intimidate readers with descriptions of the punishments to be meted out for the failure of classical assumptions. Our regression discussion begins by challenging readers to ask themselves, first, what the target causal effect is, and, second, by asking, ‘what is the regression you want’? In other words, what would you like to hold fixed when trying to regress-out an average causal effect?”

They are referencing their text Mastering Metrics, which I highly recommend.  In their other text, Mostly Harmles Econometrics, they also state:

"In fact, the validity of linear regression as an empirical tool does not turn on linearity either...The statement that regression approximates the CEF lines up with our view of empirical work as an effort to describe the essential features of statistical relationships, without necessarily trying to pin them down exactly."  - Mostly Harmless Econometrics, p. 26 & 29

On their MHE blog a reader asks about their pedagogy which focuses on 'best linear projection' vs the traditional BLUE criteria to which they respond:

"our undergrad econometrics training (like most people’s) focused on the sampling distribution of OLS. Hence you were tortured with the Gauss-Markov Thm, which says that OLS is a Best Linear Unbiased Estimator (BLUE). MHE and MM are largely unconcerned with such things. Rather, we try to give our students a clear understanding of what regression means. To that end, we introduce regression as the best linear approximation to whatever conditional expectation fn. (CEF) motivates your empirical work – this is the BLP property you mention, which is a regression feature unrelated to samples. (MM also emphasises our interpretation of regression as a form of “automated matching”)." read more...

The notion of using regression as a means of making like comparisons has also been echoed by Andrew Gelman:

"It's all about comparisons, nothing about how a variable "responds to change." Why? Because, in its most basic form, regression tells you nothing at all about change. It's a structured way of computing average comparisons in data."

Linear literalism or fundamentalist undergraduate econometrics (being tortured with BLUE as A&P might put it) can have long term consequences for students. I think this has caused harms that I encounter from time to time even among more seasoned practitioners and even graduate degree holders. This isn't too different from what Leo Brieman described as a 'statistical straight jacket' that can arbitrarily limit fruitful empirical work. Overly clinical concerns with linearity, heteroskedasticity, and multicollinearity might crowd out more important concerns around causality and prediction.


We could simplify this as a notion of non-constant variance. As Angrist and Pischke note:

"Our view of regression as an approximation to the CEF makes heteroskedasticity seem natural. If the CEF is nonlinear....the residuals will be larger, on average, at values of X where the fit is an empirical matter, heteroskedasticity may matter little" -Ch 3, p.46-47 MHE

 Of course the concern is correct standard errors and valid inference, which can be addressed via heteroskedasticity corrected standard errors. But I am afraid some students, after taking a traditional econometrics course, may terminate all thought processes after a cookbook test hints of its existence.


When covariates are highly correlated, it may be difficult to parse out the independent information about each variable and  lead to inflated standard errors. Again this is a phenomena related to inference, not prediction. Even in professional and academic settings when I have presented or attended other presentations related to forecasting or predictive analtyics you will get the occasional criticism or self aggrandizing questions about multicollinearity being a concern.

"Multicollinearity has a very different impact if your goal is prediction from when your goal is estimation. When predicting, multicollinearity is not really a problem provided the values of your predictors lie within the hyper-​​region of the predictors used when estimating the model."-  Statist. Sci.  Volume 25, Number 3 (2010), 289-310.

Undue criticisms and literalism related to multicollinearity often results from a failure to recognize the differences between goals related to explaining vs. predicting.

Paul Allison offers some additional advice on when not to worry about multicolinearity. I have highlighted a couple points of interest here.

This kind of linear fundamentalist paradigm can lead students and practitioners to adopt more complicated methods than necessary or abandon promising empirical work altogether, become overly critical and dismissive of other important work done by others, or completely miss more important questions related to selection bias and identification and unobserved heterogeneity and endogeneity.

Some of this also is a the result of the huge gap between theoretical and applied econometrics.

See also:
Marc Bellemare discusses a similar vein of literalism that is averse to linear probability models here.

Linear Probability Models

Regression as an empirical tool
Quasi-Experimental Design Roundup

Friday, May 8, 2015

Mendelian Instruments (Applied Econometrics meets Bioinformatics)

Recently I defended the use of quasi-experimental methods in wellness studies, and a while back I sort of speculated that genomic data might be useful in a quasi-experimental setting-but wasn’t sure how: 

If causality is the goal, then merge 'big data' from the gym app with biometrics and the SNP profiles and employ some quasi-expermental methodology to investigate causality.”

Then this morning at marginal revolution I ran across a link to a blog post that mentioned exploiting mendelian variation as instruments for a particular study related to alcohol consumption.

This piece gives a nice intro I think:

Stat Med. 2008 Apr 15;27(8):1133-63. Mendelian randomization: using genes as instruments for making causal inferences in epidemiology.

Lawlor DA1, Harbord RM, Sterne JA, Timpson N, Davey Smith G


“Observational epidemiological studies suffer from many potential biases, from confounding and from reverse causation, and this limits their ability to robustly identify causal associations. Several high-profile situations exist in which randomized controlled trials of precisely the same intervention that has been examined in observational studies have produced markedly different findings. In other observational sciences, the use of instrumental variable (IV) approaches has been one approach to strengthening causal inferences in non-experimental situations. The use of germline genetic variants that proxy for environmentally modifiable exposures as instruments for these exposures is one form of IV analysis that can be implemented within observational epidemiological studies. The method has been referred to as 'Mendelian randomization', and can be considered as analogous to randomized controlled trials. This paper outlines Mendelian randomization, draws parallels with IV methods, provides examples of implementation of the approach and discusses limitations of the approach and some methods for dealing with these.”

Tuesday, April 28, 2015

Healthcare Analytics at SAS Global Forum 2015

I was not able to attend this year's SAS Global Forum, but have had a chance to browse the numerous session papers as well as enjoy some live content. The conference page has a searchable catalog and for each paper you will find links to similar sessions in the sidebar. There were around 5000 attendees at this year's conference and over 600 papers. For a 2 1/2 day conference that's more than 200 papers to cover per day. Below is a selection of some papers related to healthcare analytics. If we widen the search to include papers related to other fields, but applications in healthcare analtyics the selection would probably double. I'm sure I missed something, and would be glad to know if you had a favorite paper or presentation you'd like to share in the comments.

1329 - Causal Analytics: Testing, Targeting, and Tweaking to Improve Outcomes This session is an introduction to predictive analytics and causal analytics in the context of improving outcomes. The session covers the following topics: 1) Basic... View More 20 minutes Breakout Jason Pieratt

1340 - Using SAS® Macros to Flag Claims Based on Medical Codes Many epidemiological studies use medical claims to identify and describe a population. But finding out who was diagnosed, and who received treatment, isn't always simple.... View More 50 minutes Breakout Andy Karnopp

2382 - Reducing the Bias: Practical Application of Propensity Score Matching in Health-Care Program Evaluation To stay competitive in the marketplace, health-care programs must be capable of reporting the true savings to clients. This is a tall order, because most health-care programs... View More 20 minutes Breakout Amber Schmitz

2920 - Text Mining Kaiser Permanente Member Complaints with SAS® Enterprise Miner™ This presentation details the steps involved in using SAS® Enterprise Miner™ to text mine a sample of member complaints. Specifically, it describes how the Text Parsing, Text... View More 30 minutes E-Poster Amanda Pasch

3214 - How is Your Health? Using SAS® Macros, ODS Graphics, and GIS Mapping to Monitor Neighborhood and Small-Area Health Outcomes With the constant need to inform researchers about neighborhood health data, the Santa Clara County Health Department created socio-demographic and health profiles for 109... View More 20 minutes Breakout Roshni Shah

3254 - Predicting Readmission of Diabetic Patients Using the High-Performance Support Vector Machine Algorithm of SAS® Enterprise Miner™ 13.1 Diabetes is a chronic condition affecting people of all ages and is prevalent in around 25.8 million people in the U.S. The objective of this research is to predict the... View More 20 minutes Breakout Hephzibah Munnangi

3281 - Using SAS® to Create Episodes-of-Hospitalization for Health Services Research An essential part of health services research is describing the use and sequencing of a variety of health services. One of the most frequently examined health services is... View More 20 minutes Breakout MeriƧ Osman

3282 - A Case Study: Improve Classification of Rare Events with SAS® Enterprise Miner™ Imbalanced data are frequently seen in fraud detection, direct marketing, disease prediction, and many other areas. Rare events are sometimes of primary interest. Classifying... View More 20 minutes Breakout Ruizhe Wang

3411 - Identifying Factors Associated with High-Cost Patients Research has shown that the top five percent of patients can account for nearly fifty percent of the total healthcare expenditure in the United States. Using SAS® Enterprise... View More 30 minutes E-Poster Jialuo Cheng

3488 - Text Analytics on Electronic Medical Record Data This session describes our journey from data acquisition to text analytics on clinical, textual data. 50 minutes Breakout Mark Pitts

3560 - A SAS Macro to Calculate the PDC Adjustment of Inpatient Stays The Centers for Medicare & Medicaid Services (CMS) uses the Proportion of Days Covered (PDC) to measure medication adherence. There is also some PDC-related research based on... View More 20 minutes Breakout anping chang

3600 - When Two Are Better Than One: Fitting Two-Part Models Using SAS In many situations, an outcome of interest has a large number of zero outcomes and a group of nonzero outcomes that are discrete or highly skewed.  For example, in modeling... View More 20 minutes Breakout Laura Kapitula

3740 - Risk-Adjusting Provider Performance Utilization Metrics Pay-for-performance programs are putting increasing pressure on providers to better manage patient utilization through care coordination, with the philosophy that good... View More 50 minutes Breakout Tracy Lewis

3741 - The Spatio-Temporal Impact of Urgent Care Centers on Physician and ER Use The unsustainable trend in healthcare costs has led to efforts to shift some healthcare services to less expensive sites of care. In North Carolina, the expansion of urgent... View More 50 minutes Breakout Laurel Trantham

3760 - Methodological and Statistical Issues in Provider Performance Assessment With the move to value-based benefit and reimbursement models, it is essential toquantify the relative cost, quality, and outcome of a service. Accuratelymeasuring the cost... View More 50 minutes Breakout Daryl Wansink

SAS1855 - Using the PHREG Procedure to Analyze Competing-Risks Data Competing risks arise in studies in which individuals are subject to a number of potential failure events and the occurrence of one event might impede the occurrence of other... View More 20 minutes Breakout Ying So

SAS1900 - Establishing a Health Analytics Framework Medicaid programs are the second largest line item in each state’s budget. In 2012, they contributed $421.2 billion, or 15 percent of total national healthcare expenditures.... View More 20 minutes Breakout Krisa Tailor

SAS1951 - Using SAS® Text Analytics to Examine Labor and Delivery Sentiments on the Internet In today’s society, where seemingly unlimited information is just a mouse click away, many turn to social media, forums, and medical websites to research and understand how mothers feel about the birthing process. Mining the data in these resources helps provide an understanding of what mothers value and how they feel. This paper shows the use of SAS® Text Analytics to gather, explore, and analyze reports from mothers to determine their sentiment about labor and delivery topics. Results of this analysis could aid in the design and development of a labor and delivery survey and be used to understand what characteristics of the birthing process yield the highest levels of importance. These resources can then be used by labor and delivery professionals to engage with mothers regarding their labor and delivery preferences. View Less 20 minutes Breakout Michael Wallis

Saturday, March 28, 2015

Using the R MatchIt package for propensity score analysis

Descriptive analysis between treatment and control groups can reveal interesting patterns or relationships, but we cannot always take descriptive statistics at face value. Regression and matching methods allow us to make controlled comparisons to reduce selection bias in observational studies.
For a couple good references that I am basically tracking in this post see here and here. These are links to the pages of the package authors and a nice paper (A Step by Step Guide to Propensity Score Matching in R) from higher education evaluation research respectively.

In both Mostly Harmless Econometrics and Mastering Metrics Angrist and Pischke discuss the similarities between matching and regression. From MM:

"Specifically, regression estimates are weighted averages of multiple matched comparisons"

In this post I borrow from some of the previous references, and try to follow closely the dialogue in chapter 3 of MHE. So, conveniently the R matchit propensity score matching package comes with a subset of the Lalonde data set referenced in MHE. Based on descriptives, it looks like this data matches columns (1) and (4) in table 3.3.2. The Lalonde data set basically consists of a treatment variable indicator, an outcome re78 or real earnings in 1978 as well as other data that can be used for controls. (see previous links above for more details). If we use regression to look at basic uncontrolled raw differences between treatment and control groups, it appears that the treatment (a job training program) produces negative results (on the order of $635):

R code:

            Estimate Std. Error t value            Pr(>|t|)   
(Intercept)     6984        361   19.36 <0.0000000000000002 ***
treat           -635  <---      657   -0.97                0.33 

Once we implement matching in R, the output provides comparisons between the balance in covariates for the treatment and control groups before and after matching. Matching is based on propensity scores estimated with logistic regression. (see previous post on propensity score analysis for further details). The output below indicates that the propensity score matching creates balance among covariates/controls as if we were explicitly trying to match on the controls themselves.

R Code:
m.out1 <- matchit(treat ~ age + educ + black + hispan + nodegree + married + re74 + re75, data = lalonde, method = "nearest", distance = "logit")

Summary of balance for all data:
         Means Treated Means Control SD Control  Mean Diff   eQQ Med  eQQ Mean   eQQ Max
distance        0.5774        0.1822     0.2295     0.3952    0.5176    0.3955    0.5966
age            25.8162       28.0303    10.7867    -2.2141    1.0000    3.2649   10.0000
educ           10.3459       10.2354     2.8552     0.1105    1.0000    0.7027    4.0000
black           0.8432        0.2028     0.4026     0.6404    1.0000    0.6432    1.0000
hispan          0.0595        0.1422     0.3497    -0.0827    0.0000    0.0811    1.0000
nodegree        0.7081        0.5967     0.4911     0.1114    0.0000    0.1135    1.0000
married         0.1892        0.5128     0.5004    -0.3236    0.0000    0.3243    1.0000
re74         2095.5737     5619.2365  6788.7508 -3523.6628 2425.5720 3620.9240 9216.5000
re75         1532.0553     2466.4844  3291.9962  -934.4291  981.0968 1060.6582 6795.0100

Summary of balance for matched data:
         Means Treated Means Control SD Control Mean Diff  eQQ Med eQQ Mean    eQQ Max
distance        0.5774        0.3629     0.2533    0.2145   0.1646   0.2146     0.4492
age            25.8162       25.3027    10.5864    0.5135   3.0000   3.3892     9.0000
educ           10.3459       10.6054     2.6582   -0.2595   0.0000   0.4541     3.0000
black           0.8432        0.4703     0.5005    0.3730   0.0000   0.3730     1.0000
hispan          0.0595        0.2162     0.4128   -0.1568   0.0000   0.1568     1.0000
nodegree        0.7081        0.6378     0.4819    0.0703   0.0000   0.0703     1.0000
married         0.1892        0.2108     0.4090   -0.0216   0.0000   0.0216     1.0000
re74         2095.5737     2342.1076  4238.9757 -246.5339 131.2709 545.1182 13121.7500
re75         1532.0553     1614.7451  2632.3533  -82.6898 152.1774 349.5371 11365.7100

Estimation of treatment effects can be obtained via paired or matched comparisons (Lanehart et al, 2012; Austin, 2010-see previous posts here and here)


t = 1.2043, df = 184, p-value = 0.23
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -579.6904 2396.0948
sample estimates:
mean of the differences
               908.2022 <---

This indicates an estimated treatment effect of about $900.00, which is quite a reversal from the raw uncontrolled/unmatched comparisons. In Mostly Harmless Econometrics, as part of the dialogue relating regression to matching, Angrist and Pischke present results in table 3.3.3 for regressions utilizing data that has been 'screened' by eliminating observations where ps > .90 or < .10. Similar results were obtained in R below:

summary(lm(re78~treat + age + educ + black + hispan + nodegree + married + re74 + re75, data = m.data3))

              Estimate Std. Error t value Pr(>|t|) 
(Intercept)    23.6095  3920.6599    0.01    0.995 
treat        1229.8619 <---  849.1439    1.45    0.148 
age            -3.4488    45.4815   -0.08    0.940 
educ          506.9958   240.1989    2.11    0.036 *
black       -1030.4883  1255.1766   -0.82    0.412 
hispan        926.5288  1498.4450    0.62    0.537

Again we get a reversal of the values from the raw comparisons, and a much larger estimated treatment effect than the non-paramterically matched comparisons above.

Note: Slightly different results were obtained from A&P, partly because I am not sure the exact specification of their ps model, which may have impacted the screening and ultimately differences in the data used.  Full R code is provided below with the ps model specification. Also, I have replicated similar results in SAS using the Mayo Clinic %gmatch macro as well as using approaches outlined by Lanehart et al (2012).  These results may be shared in a later post or white paper. 

See also: A Toy Instrumental Variable Application

R Program:

# *------------------------------------------------------------------
# | PROGRAM NAME: ex ps match mostly harmless R
# | DATE: 3/24/15 
# | PROJECT FILE: Tools and References/Rcode             
# *----------------------------------------------------------------
# | PURPOSE: Use R matchit and glm to mimic the conversation in 
# | chapter 3 of Mostly Harmless Econometrics              
# | NOTE: because of random sorting within matching application different results may be
# | obtained with each iteration of matchit in R
# *------------------------------------------------------------------
rm(list=ls()) # get rid of any existing data 
ls() # view open data sets
library(MatchIt) # load matching package
options("scipen" =100, "digits" = 4) # override R's tendency to use scientific notation
# raw differences in means for treatments and controls using regression
# nearest neighbor matching via MatchIt
# estimate propensity scores and create matched data set using 'matchit' and lalonde data
m.out1 <- matchit(treat ~ age + educ + black + hispan + nodegree + married + re74 + re75, data = lalonde, method = "nearest", distance = "logit")
summary(m.out1) # check balance
m.data1 <-,distance ="pscore") # create ps matched data set from previous output
hist(m.data1$pscore) # distribution of propenisty scores
# perform paired t-tests on matched data
# lets look at regression on the ps restricted data on the entire sample per MHE chapter 3
m.data2 <- lalonde # copy lalonde for ps estimation
# generate propensity scores for all of the data in Lalonde 
ps.model <- glm(treat ~ age + educ + black + hispan + nodegree + married + re74 +  
re75, data = m.data2, family=binomial(link="logit"),
# add pscores to study data
m.data2$pscore <- predict(ps.model, newdata = m.data2, type = "response")
hist(m.data2$pscore) # distribution of ps
# restrict data to ps range .10 <= ps <= .90
m.data3 <- m.data2[m.data2$pscore >= .10 & m.data2$pscore <=.90,]
# regression with controls on propensity score screened data set
summary(lm(re78~treat + age + educ + black + hispan + nodegree + married + re74 + re75, data = m.data3))
# unrestricted regression with controls
summary(lm(re78~treat + age + educ + black + hispan + nodegree + married + re74 + re75, data = lalonde))
Created by Pretty R at