Friday, January 30, 2015

Considerations in Propensity Score Matching

A while back I stumbled across a paper by Liu & Lynch: Do Agricultural Land Preservation Programs Reduce Farmland Loss?


They use a really long panel and propensity score matching and highlight some important considerations in propensity score matching applications:

1) They used unrestricted matching (which basically ignored the time component, allowed an individual to actually be matched to themselves if propensity scores across time periods matched up) and restricted matching which required matches to be made only between treatments and controls within a given time period/census period.

2) They provide a  interesting discussion of the variance/bias tradeoff associated with bandwidth and kernel selection:

 “Bandwidth and kernel type selection is an important issue in choosing matching method. Generally speaking, a large bandwidth leads to a larger bias but smaller variance of the estimated average treatment effect of the PDR programs; a small bandwidth leads to a smaller bias but a larger variance. The differences among kernel types are embedded in the weights they assign to non-PDR county observations whose estimated propensity score are farther away from that of their matched PDR county observations.”

3) They discussed the  use of a leave one out cross-validation mechanism  to choose the ‘best matching’ method (combination of matching method i.e. nearest neighbor, kernel, local linear & combination with 5 possible kernel types and 6 bandwidths) optimized based on MSE criteria.  They site a references for this:

Racine, J. S. and Q. Li. 2004. “Nonparametric estimation of regression functions with both categorical and continuous data.” Journal of Econometrics 119 (1): 99-130.

Black, D., and J. Smith. 2004. “How Robust is the Evidence on the Effects of College Quality? Evidence from Matching.” Journal of Econometrics, 121(1-2): 99-124.

4) They state: “Matching with replacement performs as well or better than matching without replacement”– based on:

Dehejia, R., and S. Wahba. 2002. “Propensity score matching methods for non-experimental causal studies.” The Review of Economics and Statistics 84: 151-161.

Rosenbaum, P. 2002. Observational Studies (2nd edition). New York: Springer Verlag.

5) They also write that  the “selection of matching methods depends on the distribution of the estimated propensity score”- i.e. 

Kernel Matching works well with asymmetric distributions, excludes bad matches

The Local Linear Estimator may be more efficient than standard kernel matching when there is a large concentration of observations with propensity scores near 1 or 0:

Also from: McMillen, D. P., and J.F. McDonald. 2002. “Land values in a newly zoned city.” Review of Economics and Statistics 84(1): 62–72.

They also discuss the fact that nearest neighbor matching is more biased if propensity score  distributions are not very compatible.

6) Balancing Tests – A lot of practitioners implement more subjective evaluations of balance based on data visualization, but in this paper formal tests are discussed.

“After matching, we check again whether the two matched groups are the same on their observed characteristics. If unbalanced, the estimated ATT may not be solely the impact of PDR programs. Instead, it may be a combination of the impacts of PDR programs and the unbalanced variables. We rely on two of the balancing tests that exist in the empirical literature: the standardized difference test and a regression-based test. The first method is a t-test for the equality of the means for each covariate in the matched PDR and non-PDR counties. The regression test estimates coefficients for each covariate on polynomials of the estimated propensity scores…. and the interaction of these polynomials with the treatment binary variable, ….If the estimated coefficients on the interacted terms are jointly equal to zero according to an F-test, the balancing condition is satisfied.”

7) They also offer some really nice explanations of how kernel matching works:

“Kernel matching and local linear matching techniques match each PDR county with all non-PDR counties whose estimated propensity scores fall within a specified bandwidth (Heckman, Ichimura and Todd, 1997). The bandwidth is centered on the estimated propensity score for the PDR county. The matched non-PDR counties are weighted according to the density function of the kernel types.11 The closer a non-PDR county’s estimated propensity score is to the matched PDR county’s propensity score, the more similar the non-PDR county is to the matched PDR county and therefore it is assigned a larger weight calculated from a kernel functions defined in each method. More non-PDR counties are utilized under the kernel and local linear matching as compared to nearest neighbor matching.”

Data Science is 10% Inspiration and 90% Perspiration

“Success is 10 percent inspiration and 90 percent perspiration.” Thomas Alva Edison

Last fall there was a really good article in the New York Times:

For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights

"Yet far too much handcrafted work — what data scientists call “data wrangling,” “data munging” and “data janitor work” — is still required. Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets."

“Data wrangling is a huge — and surprisingly so — part of the job,” said Monica Rogati, vice president for data science at Jawbone, whose sensor-filled wristband and software track activity, sleep and food consumption, and suggest dietary and health tips based on the numbers. “It’s something that is not appreciated by data civilians. At times, it feels like everything we do.”

“It’s an absolute myth that you can send an algorithm over raw data and have insights pop up,” said Jeffrey Heer, a professor of computer science at the University of Washington and a co-founder of Trifacta, a start-up based in San Francisco.

This has always been true, even before 'Big Data' was a big deal. It was one of the first rude awakenings I had as a researcher straight out of graduate school (one of the other things was related to the large gap between econometric theory and applied econometrics). Thank goodness, I was lucky enough to work in a shop where this was much appreciated and I developed the necessary SAS and SQL (and later R) skills to deal with these issues. They just don't teach this stuff in school (I'm sure they might some places).

The article mentions some efforts to develop software to make these tasks simpler. I think there is a very fine line between the value gained from doing this grunt work vs. the savings of time an energy that we could yield if we flatten the cost curve when it comes to data prep. As the article says:

"Data scientists emphasize that there will always be some hands-on work in data preparation, and there should be. Data science, they say, is a step-by-step process of experimentation."

“You prepared your data for a certain purpose, but then you learn something new, and the purpose changes,” said Cathy O’Neil, a data scientist at the Columbia University Graduate School of Journalism, and co-author, with Rachel Schutt, of “Doing Data Science” (O’Reilly Media, 2013)."

See also:

In God We Trust, All Others Show Me Your Code

In God We Trust, All Others Show Me Your Code

There recently was a really interesting article at the Political Methodologist titled:

A Decade of Replications: Lessons from the Quarterly Journal of Political Science

They have high standards related to research documentation:

"Since its inception in 2005, the Quarterly Journal of Political Science (QJPS) has sought to encourage this type of transparency by requiring all submissions to be accompanied by a replication package, consisting of data and code for generating paper results. These packages are then made available with the paper on the QJPS website. In addition, all replication packages are subject to internal review by the QJPS prior to publication. This internal review includes ensuring the code executes smoothly, results from the paper can be easily located, and results generated by the replication package match those in the paper."

"Although the QJPS does not necessarily require the submitted code to access the data if the data are publicly available (e.g., data from the National Election Studies, or some other data repository), it does require that the dataset containing all of the original variables used in the analysis be included in the replication package. For the sake of transparency, the variables should be in their original, untransformed and unrecoded form, with code included that performs the transformations and recodings in the reported analyses. This allows replicators to assess the impact of transformations and recodings on the results."

From an efficiency standpoint, I don't know if this standard should be applied universally or not. We wouldn't want to bottleneck the body of peer reviewed literature contributing to society's pool of knowledge, but at the same time, some sort of filtration system might keep the murkiness out of the water so we can see more clearly the 'real' effects of policies and treatments.

I certainly know from personal (professional and academic) experience collaborating with others a lot of time and resources have been lost trying to reinvent the wheel, reconstruct the creation of some data set etc. because of lack of documentation around how data was pulled or cleaned. Better documentation and code sharing always seems better. Maybe everyone needs a Github account.

Saturday, January 24, 2015

The Internet of Things, Big Data, and John Deere

What is the internet of things? It's essentially the proliferation of smart products with connectivity and data sharing capabilities that are changing the way we interact with them and the rest of the world. We've all experienced IOT via smartphones, but the next generation of smart products will include our cars, homes, and appliances. A recent Harvard a Business Review article discusses what this means for the future of industry and the economy: 

How Smart, Connected products are Transforming the Competition by Porter and Heppelmann (the same porter of Porter's 5 competitive forces) : 

"IT is becoming an integral part of the product itself. Embedded sensors, processors, software, and connectivity in products (in effect, computers are being put inside products), coupled with a product cloud in which product data is stored and analyzed and some applications are run, are driving dramatic improvements in product functionality and performance....As products continue to communicate and collaborate in networks, which are expanding both in number and diversity, many companies will have to reexamine their core mission and value proposition"

In the article, they view how IOT is reshaping the competitive framework through the lense of Porter's five competitive forces. 

HBR continually uses John Deere as a case study example of a firm that is leading the industry leveraging big data and analytics as a successful model for an IOT strategy, particularly the integration of IOT applications through connected tractors and implements. So, the competitive space expands from a singular focus on a line of equipment, to optimization of performance and interoperability within a connected system of systems: 

"The function of one product is optimized with other related products...The manufacturer can now offer a package of connected equipment and related services that optimize overall results. Thus in the farm example, the industry expands from tractor manufacturing to farm equipment optimization."

A 'system of systems' approach means not only smart connected tractors and implements, but layers of connectivity and data related to weather, crop prices, and agronomics. That changes value proposition not just for the farm implement business, but also the seed, input, sales, and crop consulting as well. When you think about the IOT, suddenly Monsanto's purchase of Climate Corporation make sense. Suddenly the competitive landscape has changed in agriculture. Both Monsanto and John Deere are offering advanced analytic and agronomics consulting services based on the big data generated from the IOT. John Deere has found value creation in these expanded economies of scope given the date generated from their equipment. Seed companies see added value in optimizing and developing customized genetics as farmers begin to farm (collectin data) literally inch by inch (vs field by field). Both new alliances and rivalrys are being drawn between what was once rather distinct lines of business. 

And, as the HBR article notes, there is a lot of opportunity in the new era of Big Data and the IOT.  The convergence of biotechnology, genomics, and big data has major implications for economic development as well as environmental sustainability. 

Additional Reading:

This article does a great job breaking down insight from the HBR paper: How John Deere is Using APIs to Grow the World's Food Supply  
From the Motley Fool: The Internet of Things is Changing How Your Food is Grown

Thursday, January 22, 2015

Fat Tails, Kurtosis, and Risk

When I think of 'fat tails' or 'heavy tails' I typically think of situations that can be described by probability distributions with heavy mass in the tails. This might imply that the tails are 'fat' or 'thicker' than other situations with less mass in the tails. (for instance, a normal distribution might be said to have thin tails while a distribution with more mass in the tails than the normal might be considered a 'fat tailed' distribution.)

Basically when I think of tails I think of 'extreme events', some occurrence with an excessive departure from what's expected on average. So, if there is more mass in a the tail of a probability distribution (it is a fat or thick tailed distribution) that implies that extreme events will occur with greater probability.  So in application, if I am trying to model or assess the probability of some extreme event (like a huge loss on an investment) then I better get the distribution correct in terms of tail thickness.

See here for a very  nice explanation of fat tailed events from the Models and Agents blog:

Also here is a great podcast from the CME group related to tail hedging strategies: (May 29 2014)

And here's a twist, what if I am trying to model the occurrence of multiple events simultaneously? (a huge loss in commodities and equities simultaneously, or losses on real estate in Colorado and Florida simultaneously). I would want to model a multivariate process that captures the correlation or dependence between multiple extreme events (in other words between the tails of their distributions). Copulas offer an approach to modeling tail dependence, and again, getting the distributions correct matters.

When I think of how do we assess or measure tail thickness in a data set, I think of kurtosis. Rick Wicklin recently had a nice piece discussing the interpretation of kurtosis and relating kurtosis to tail thickness.

"A data distribution with negative kurtosis is often broader, flatter, and has thinner tails than the normal distribution."

" A data distribution with positive kurtosis is often narrower at its peak and has fatter tails than the normal distribution."

The connection between kurtosis can be tricky and kurtosis  cannot be interpreted this way universally in all situations. Rick gives some good examples if you want to read more.  But this definition of kurtosis from his article seems keep us honest:

"kurtosis can be defined as "the location- and scale-free movement of probability mass from the shoulders of a distribution into its center and tails. " with the caveat - "the peaks and tails of a distribution contribute to the value of the kurtosis, but so do other features."

But Rick also had an earlier post on fat and long tailed distributions where he puts all of this into perspective in terms of the connection to modeling extreme events as well as a more rigorous discussion and definition of tails and what 'fat' or 'heavy' tailed means:

"Probability distribution functions that decay faster than an exponential are called thin-tailed distributions. The canonical example of a thin-tailed distribution is the normal distribution, whose PDF decreases like exp(-x2/2) for large values of |x|. A thin-tailed distribution does not have much mass in the tail, so it serves as a model for situations in which extreme events are unlikely to occur.

Probability distribution functions that decay slower than an exponential are called heavy-tailed distributions. The canonical example of a heavy-tailed distribution is the t distribution. The tails of many heavy-tailed distributions follow a power law (like |x|–α) for large values of |x|. A heavy-tailed distribution has substantial mass in the tail, so it serves as a model for situations in which extreme events occur somewhat frequently."

Also, somewhat related, Nassim Taleb, in his paper on the precautionary principle and GMOs (genetically modified organisms) discusses such concepts as ruin, harm,  fat tails and fragility, tail sensitivity to uncertaintly etc.  He uses very rigorous definitions of these terms and determines that there are certain things like GMOs that would require a non-naive application of the precautionary principle while other things like nuclear energy would not. (also tune into his discussion of this with Russ Roberts on Econtalk- more discussion on this actual application at my applied economics blog Economic Sense).

Wednesday, January 14, 2015

A Credibility Revolution in Wellness Program Analysis?

Last november there was a post on the Health Affairs blog related to the evaluation of wellness programs. Here are some tidbits:

"This blog post will consider the results of two compelling study designs — population-based wellness-sensitive medical event analysis, and randomized controlled trials (RCTs). Then it will look at the popular, although weaker, participant vs. non-participant study design."

"More often than not wellness studies simply compare participants to “matched” non-participants or compare a subset of participants (typically high-risk individuals) to themselves over time."

“Looking at how participants improve versus non-participants…ignores self-selection bias. Self-improvers are likely to be drawn to self-improvement programs, and self-improvers are more likely to improve.” Further, passive non-participants can be tracked all the way through the study since they cannot “drop out” from not participating, but dropouts from the participant group—whose results would presumably be unfavorable—are not counted and are considered lost to follow-up. So the study design is undermined by two major limitations, both of which would tend to overstate savings."

Does Wellness need a credibility revolution?

These criticisms are certainly valid, however, my thoughts are that panel methodsdifference-in-difference and propensity score matching fit firmly in the Rubin Causal Model or potential outcomes framework for addressing issues related to selection bias. And what about examples of more robust quasi experimental approaches (like instrumental variables)?   These are all methods that are meant to deal specifically with the drawbacks of self comparisons and the issues mentioned above by the authors, and are at the heart of techniques related to the credibility revolution in econometrics.

A RCT is by far the most reliable way to identify treatment effects, but I know when it comes to applied work, RCT just isn't happening for a lot of obvious reasons. As Marc Bellemare might say, let the credibility revolution flow through you!

***this post was revised July 22, 2016, originally titled "Are Quasi-Experimental Designs Off the Table in Wellness Program Analysis"

Tuesday, January 13, 2015

Overconfident Confidence Intervals

In an interesting post, "Playing Dumb on Statistical Significance" there is a discussion relating to Naomi Oreskes January 4 NYT piece Playing Dumb on Climate Change. The dialogue centers around her reference to confidence intervals and possibly an overbearing burden of proof that researchers apply related to statistical significance.

From the article:

"Although the confidence interval is related to the pre-specified Type I error rate, alpha, and so a conventional alpha of 5% does lead to a coefficient of confidence of 95%, Oreskes has misstated the confidence interval to be a burden of proof consisting of a 95% posterior probability. The “relationship” is either true or not; the p-value or confidence interval provides a probability for the sample statistic, or one more extreme, on the assumption that the null hypothesis is correct. The 95% probability of confidence intervals derives from the long-term frequency that 95% of all confidence intervals, based upon samples of the same size, will contain the true parameter of interest."

As mentioned in the piece, Oreskes writing does have a Bayesian ring to it and this whole story and critique makes me think of Kennedy's chapter on "The Bayesian Approach" in his book "A Guide to Econometrics".  I believe that people often interpret frequentist based confidence intervals  from a bayesian perspective. If I understand any of this at all, and I admit my knowledge of Bayesian econometrics is limited, then I think I have been a guilty offender at times as well.  In the chapter it is even stated that Bayesians tout their methods because Bayesian thinking is actually how people really think and that is why they so often misinterpret frequentist confidence intervals.

In Bayesian analysis, a posterior probability distribution is produced (and a posterior probability interval) that 'chops off' 2.5% from each tail leaving an area or probability of 95%. From a Bayesian perspective, it is correct for the researcher to claim or believe that there is a 95% probability that the true value of the parameter they are estimating will fall within the interval. This is how many people interpret confidence intervals, which are quite different from Bayesian posterior probability intervals. An illustration is given from Kennedy:

"How do you think about an unknown parameter? When you are told that the interval between 2.6 and 2.7 is a 95% confidence interval, how do you think about this? Do you think, I am willing to bet $95 to your $5 that the true value of 'beta' lies in this interval [note this sounds a lot like Oreskes as if you read the article above]? Or do you think, if I were to estimate this interval over and over again using data with different error terms, then 95% of the time this interval will cover the true value of 'beta'"?

"Are you a Bayesian or a frequentist?"