Saturday, March 11, 2017

Count Models with Offsets

See also: Count Models with Offsets: Practical Applications using R

Principles of Count Model Regression

Often times we want to model the impact of some intervention or differences between groups in relation to an outcome that is a count. Examples of counts may be related to medical care utilization such as office visits or days in the hospital or total hospital admissions.

"In all cases the data are concentrated on a few small discrete values, say 0, 1 and 2; skewed to the left; and intrinsically heteroskedastic with variance increasing with the mean. These features motivate the application of special methods and models for count regression." (Cameron and Trivedi, 1999).

Poisson regression is "the starting point for count data analysis, though it is often inadequate" (Cameron and Trivedi,1999). The "main focus is the effect of covariates on the frequency of an event, measured by non-negative integer values or counts"(Cameron and Trivedi,1996).

From Cameron and Trivedi (1999) "The Poisson regression model is derived from the Poisson distribution by parameterizing the relation between the mean parameter μ and covariates (regressors) x. The standard assumption is to use the exponential mean parameterization"

E(Y|x) = μ = exp(xβ)where xβ= β0 +βx

Equivalently:

Log(μ) = xβ

Where β is the change in the log of average counts given a change in X. Alternatively we can also say that  E(Y|x) changes by a factor of exp(β) . Often exp(β) is interpreted as an ‘incident rate ratio’ although in many cases ‘rate’ and ‘mean’ are used interchangeably (Williams, 2016). 

Are we modeling mean or average counts or rates or does this matter?

We might often think of our data as counts, but count models assume that counts are always observed within time or space. This gives rise to an implicit rate interpretation of counts. For example, the probability mass function for a Poisson process can be specified as:

P(Y|μ) = exp(-μ)* μ-y / y!

The Poisson process described above gives us the probability of ‘y’ events occurring during a given interval. The parameter μ is the expected count or average number of occurrences in the specified or implied interval. Often but not always this interval is measured in units of time (i.e. ER visits per year or total leaks in 1 mile of pipeline).  So even if we think of our outcome as counts we are actually modeling a rate per some implicit interval of observation whether we think of that explicitly or not. This is what gives us the incident rate ratio (IRR) interpretation of exponentiated coefficients. If we think of rates as:

Rate = count/t

Where t = time or interval of observation or exposure If all participants are assumed to have a common t this is essentially like dividing by 1 and our rate is for all practical purposes often interpreted as a count.

Equivalently, as noted in STAT504 (UPenn) when we model rates, mean count is proportional to t and the interpretation of parameter estimates, α and β will stay the same as for the model of counts; you just need to multiply the expected counts by t.

If t =1 or everyone is observed for the same period of time, then we are back to just thinking about counts with an implied rate.
As noted in Cameron and Trivedi (1996) count models and duration (survival) models can be viewed as duals. If observed units have different levels of exposure or duration  or intervals of observation this is accounted for in count models through inclusion of an offset. Inclusion of an offset creates a model that explicitly considers interval ‘tx’ where tx represents exposure time for individuals with covariate value x:

Log(μ/tx) = xβ  here we are explicitly specifying a rate based on time ‘tx

Re-arranging terms we get:

Log(μ) – Log(tx) = xβ 
Log(μ) = xβ + Log(tx)
The term Log(tx) is referred to as an ‘offset.’

When should we include an offset?

Karen Grace Martin at the Analysis Factor gives a great explanation of modeling offsets or exposure in count models. Here is an excerpt:

"What this means theoretically is that by defining an offset variable, you are only adjusting for the amount of opportunity an event has.....A patient in for 20 days is twice as likely to have an incident as a patient in for 10 days....There is an assumption that the likelihood of events is not changing over time."

In another post Karen states:
"It is often necessary to include an exposure or offset parameter in the model to account for the amount of risk each individual had to the event."

So if there are differences in exposure or observation times for different observations relevant to the outcome of interest then it makes sense to account for this by including offsets as specified above.  By explicitly specifying tx we can account for differences in exposure time or observation periods unique to each observation. Often the relevant interval of exposure may be something other than time.  Karen gives one example where it might not make sense to include an offset or account for time such as the number of words a toddler can say. Another example might be the number of correct words spelled in a spelling bee. In fact in this case time may be endogenous. More correct words spelled by a participant imply a longer interval of observation, duration, or ‘exposure’.  We would not make our decision about who is a better speller on the basis of time or a rate such as total correct words per minute. As all count models implicitly model rates, the implicit and most relevant interval here would be the contest itself. In practical terms this simply reverts to being a comparison of raw counts. 

Summary: Counts always occur within some interval of time or space and therefore can always have an implicit ‘rate’ interpretation. If counts are observed across different intervals in time or space for different observations then differences in outcomes should be modeled through the specification of an offset. Whether to include an offset really depends on answering the questions:  (1) What is the relevant interval in time or space upon which our counts are based? (2) Is this interval different across our observations of counts?

References:
Essentials of Count Data Regression. A. Colin Cameron and Pravin K. Trivedi. (1999)

Count Data Models for Financial Data. A. Colin Cameron and Pravin K. Trivedi. (1996)

Models for Count Outcomes. Richard Williams, University of Notre Dame, http://www3.nd.edu/~rwilliam/ . Last revised February 16, 2016

Econometric Analysis of Count Data. By Rainer Winkelmann. 2nd Edition.

Notes: This ignores any discussion related to overdispersion or inflated zeros which relate to other possible model specifications including negative binomial or zero-inflated poisson (ZIP) or zero-inflated negative binomial (ZINB) models.

Monday, March 6, 2017

Interpreting Confidence Intervals

From: Handbook of Biological Statistics http://www.biostathandbook.com/confidence.html
"There is a myth that when two means have confidence intervals that overlap, the means are not significantly different (at the P<0.05 level)… (Schenker and Gentleman 2001, Payton et al. 2003); it is easy for two sets of numbers to have overlapping confidence intervals, yet still be significantly different by a two-sample t–test; conversely… Don't try compare two means by visually comparing their confidence intervals, just use the correct statistical test."
A really cogent note related to this from the Cornell Statistical Consulting unit can be found here.
"Generally, when comparing two parameter estimates, it is always true that if the confidence intervals do not overlap, then the statistics will be statistically significantly different. However, the converse is not true. That is, it is erroneous to determine the statistical significance of the difference between two statistics based on overlapping confidence intervals."

More details using basic math here.

A 2005 article in Psychological Methods indicates a large number of researchers don't interpret them correctly. In an interesting American Psychologist article (2005) researchers determined that under a number of broadly applicable conditions 95% confidence intervals can overlap as much as 25% for groups that are actually significantly different at the 5% level and that with zero overlap statistical significance is actually at the 1% level (p~.01).


Dave Giles also brings up an Insect Science paper by Payton et al in discussion related to using CI to determine statistical significance that relates to this:
http://davegiles.blogspot.com/2017/01/hypothesis-testing-using-non.html?_sm_au_=iVVFv75rHPD1TSHs
"Here's a well-known result that bears on this use of the confidence intervals. Recall that we're effectively testing H0: μ1 = μ2, against HA: μ1 ≠ μ2. If we construct the two 95% confidence intervals, and they fail to overlap, then this does not imply rejection of H0 at the 5% significance level. In fact the correct significance is roughly one-tenth of that.  Yes, 0.5%!
If you want to learn why, there are plenty of references to help you. For instance, check out McGill et al. (1978), Andrews et al. (1980), Schenker and Gentleman (2001), Masson and Loftus (2003), and Payton et al. (2003) - to name a few. The last of these papers also demonstrates that a rough rule-of-thumb would be to use 84% confidence intervals if you want to achieve an effective 5% significance level when you "try" to test H0 by looking at the overlap/non-overlap of the intervals."
Actually according to the last paper mentioned (Payton,2003) the 84% CI is adjusted depending on the ratio of standard errors from the two populations you are comparing:
But all of the work above is predicated on a comparison of two populations. Considerations of multiple comparisons complicate things further. (see Rick Wicklin's post on doing this in SAS). Perhaps if a visual presentation is what we want we plot the CIs (as much as we may not like dynamite plots) but denote which groups are significantly different based on the properly specified tests (per the note from the handbook above).  Something like below:
http://freakonomics.com/2008/07/30/how-big-is-your-halo-a-guest-post/

References:
Belia, S, Fidler, F, Williams, J, Cumming, G (2005). Researchers misunderstand confidence intervals and standard error bars Psychological Methods, 10 (4), 389-396
Am Psychol. 2005 Feb-Mar;60(2):170-80. Inference by eye: confidence intervals and how to read pictures of data. Cumming G(1), Finch S
Payton, M. E., M. H. Greenstone, and N. Schenker, 2003. Overlapping confidence intervals or standard error intervals: What do they mean in terms of statistical significance? Journal of Insect Science, 3, 1–6.


Tuesday, February 21, 2017

Basic Data Manipulation and Statistics in R and Python

Below are links to a couple of gists with R and Python code for some very basic data manipulation and statistics. I have been using R and SAS for almost a decade, but the R code originates to some very basic scripts that I used when I was a beginning programmer. The python script is just a translation from R to python. This does not represent the best way to solve these problems, but provides enough code for a beginner to get a feel for coding in one of these environments. This is 'starter' code in the crudest sense and intended to allow one to begin learning R or python with as little intimidation with the simplest syntax as possible. However,  once started, one can google other sources  or enroll in courses to expand their programming skillset.

Basic Data Manipulation in R

Basic Data Manipulation in Python

Basic Statistics in R

Basic Statistics in Python 

For more advanced applications in R posted to this blog see all posts with the tag R Code.

Thursday, February 16, 2017

Machine Learning in Finance and Economics with Python

I recently caught a podcast via Chat with Traders that included one among several episodes related to quantitative finance and this one emphasized some basics of machine learning. Very good discussion of some fundamental concepts in machine learning regardless of your interest in finance or algorithmic trading.

You can find this episode via iTunes. But here is a link with some summary information.

Q5: Good (and Not So Good) Uses of Machine Learning in Finance w/ Max Margenot & Delaney Mackenzie

https://chatwithtraders.com/quantopian-podcast-episode-5-max-margenot/

Some of the topics covered include (swiping from the link above):

What is machine learning and how is it used in everyday life?

Supervised vs unsupervised machine learning, and when to use each class.    

Does machine learning offer anything more than traditional statistics methods.

Good (and not so good) uses of machine learning in trading and finance.

The balance between simplicity and complexity.

 I believe the guests on the show were quantopian data scientists, and quantopian is a platform for algorithmic trading and machine learning applied to finance. They do this stuff for real.

There was also some discussion of python. Following up with that there was a tweet from @chatwithtraders  linking to a nice blog,  python for finance that covers some applications using python. Very good stuff all around. I wish I still taught financial data modeling!


See also: Modeling Dependence with Copulas and Quantmod in R

Sunday, February 12, 2017

Molecular Genetics and Economics

A really interesting article in JEP:

A slice:

"In fact, the costs of comprehensively genotyping human subjects have fallen to the point where major funding bodies, even in the social sciences, are beginning to incorporate genetic and biological markers into major social surveys. The National Longitudinal Study of Adolescent Health, the Wisconsin Longitudinal Study, and the Health and Retirement Survey have launched, or are in the process of launching, datasets with comprehensively genotyped subjects…These samples contain, or will soon contain, data on hundreds of thousands of genetic markers for each individual in the sample as well as, in most cases, basic economic variables. How, if at all, should economists use and combine molecular genetic and economic data? What challenges arise when analyzing genetically informative data?"


Link:

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3306008/


Reference:
Beauchamp JP, Cesarini D, Johannesson M, et al. Molecular Genetics and Economics. The journal of economic perspectives : a journal of the American Economic Association. 2011;25(4):57-82.

Saturday, February 11, 2017

Program Evaluation and Causal Inference with High Dimensional Data

Brand new from Econometrica-

Abstract: "In this paper, we provide efficient estimators and honest confidence bands for a variety of treatment effects including local average (LATE) and local quantile treatment effects (LQTE) in data-rich environments.….We provide results on honest inference for (function-valued) parameters within this general framework where any high-quality, machine learning methods (e.g., boosted trees, deep neural networks, random forest, and their aggregated and hybrid versions) can be used to learn the nonparametric/high-dimensional components of the model." Read more...

Tuesday, January 24, 2017