Wednesday, March 22, 2017

Count Models with Offsets: Practical Applications Using R

See also:

Lets consider three count modeling scenarios and determine the appropriate modeling strategy. 

Example 1: Suppose we are observing kids playing basketball during open gym. We have two groups of equal size A and B. Suppose both groups play for 60 minutes and kids in one group, A, score on average about 2.5 goals each while group B averages 5.
In this case both groups of students engage in activity for the same amount of time. There seems to be no need to include time as an offset. And it is clear, for whatever reason students in group B are better at scoring and therefore score more goals.

If we simulate count data to mimic this scenario (see toy data below) we might get descriptive statistics that look like this:

Table 1:

It is clear for the period of observation (60 minutes) group B out scored A. Would we in practice discuss this in terms of rates? Total points per 60 minute session? Or total goals per minute? In this case group A scores .0433 goals per minute vs. .09 for B.  Again, we conclude based on rates that B is better at scoring goals. But most likely, despite the implicit or explicit view of rate, we would discuss these outcomes in a more practical sense, total goals for A vs B. 

We could model this difference with a Poisson regression:

summary(glm(COUNT ~ GROUP,data = counts, family = poisson))

Table 2:
            Estimate Std. Error z value Pr(>|z|)   
(Intercept)   0.9555     0.1601   5.967 2.41e-09 ***
GROUPB        0.7309     0.1949   3.750 0.000177 ***

We can from this that group B completes significantly more goals than A, at a ‘rate’ exp(.7309) = 2.075 times that of A. (roughly twice as many goals). This is basically what we get from a direct comparison of the average counts in the descriptives above.
But what if we wanted to be explicit about the interval of observation and include an offset? The way we incorporate rates into a poisson model for counts is through the offset. 

Log(μ/tx) = xβ  here we are explicitly specifying a rate based on time ‘tx

Re-arranging terms we get:
Log(μ) – Log(tx) = xβ 
Log(μ) = xβ + Log(tx)
The term Log(tx) becomes our ‘offset.’

So we would do this by including log(time) as an offset in our R code: 

summary(glm(COUNT ~ GROUP + offset(log(TIME2)),data = counts, family = poisson))

It turns out the estimate of  .7309 for B vs A is the same. Whether we directly compare the raw counts, or run count models with or without offsets we get the same result. 

Table 3:
            Estimate Std. Error z value Pr(>|z|)   
(Intercept)  -3.1388     0.1601  -19.60  < 2e-16 ***
GROUPB        0.7309     0.1949    3.75 0.000177 ***

Example 2: Suppose again we are observing kids playing basketball during open gym. Let’s still refer to them as groups A and B. Suppose after 30 minutes group A is forced to leave the court (maybe their section of the court is reserved for an art show). Before leaving they score an average of about 2.5 goals. Group B is allowed to play for 60 minutes scoring an average of about 5 goals. This is an example where the two groups had different observation times or exposure times (i.e. playing time). Its plausible that if Group A continued to play longer they would have had more risk or opportunity to score more goals. It seems the only fair way to compare goal scoring for A vs B is to consider court time, or exposure or the rate of goal completion. If we use the same toy data as before (but assuming this different scenario) we would get the following  descriptives:

Table 4

You can see that the difference in the rate of goals scored is very small. Both teams are put on an ‘even’ playing field when we consider rates of goal completion. 

If we fail to consider exposure or the period of observation we run the following model:

summary(glm(COUNT ~ GROUP,data = counts, family = poisson))

The results will appear the same as in table 2 above.  But what if we want to consider the differences in exposure or observation time? In this case we would include an offset in our model specification:

summary(glm(COUNT ~ GROUP + offset(log(TIME3)),data = counts, family = poisson))

Table 5
            Estimate Std. Error z value Pr(>|z|)   
(Intercept) -2.44569    0.16013 -15.273   <2e-16 ***
GROUPB       0.03774    0.19490   0.194    0.846 

We can see from the results that when considering exposure (modeling with an offset) there is no significant difference between groups, although this could be an issue of low power and small sample size. Directionally group B completes about 3.8% more goals (per minute of exposure) than A or alternatively exp(.0377) = 1.038 indicates that B completes 1.038 times as many goals as A  or alternatively (1.038-1)*100 = 3.8% more. We can get all of this from the descriptives by comparing the average ‘rates’ of goal completion for B vs A. But the conclusion is all the same, and if we fail to consider rates or exposure in this case we get the wrong answer!!!

Example 3: Suppose we are again observing kids playing basketball during open gym with groups A and B. Except this time group A tires out after playing about 20 minutes or so and leaves the court after scoring 2.6 goals each on average. Group B perseveres another 30 minutes or so and scores a total of about 5 goals on average per student. In this instance there seem to be important differences in group A and B in terms of drive and ambition that should not be equated by accounting for time played or inclusion of an offset. Event success seems to drive time as much as time drives the event. In this instance if we want to think of a ‘rate’ the rate is total goals scored per open gym session, not per minute of activity.  The relevant interval is a single open gym session.
In this case time actually seems endogenous or confounded with the outcome or confounded with other factors like effort and motivation which drive the outcome.

If we alter our simulated data from before to mimic this scenario we would generate the following descriptive statistics:

Table 6:

As discussed previously, this should be modeled without an offset, implying equal exposure/observation time with regard to the event or exposure being an entire open gym session.  We can think of this as a model of counts, or an implied model of rates in terms of total goals per open gym session.  In that case we get the same results as table 2 indicating that group B scores more goals than A.  It makes no sense in this case to include time as an offset or compare rates of goal completion between groups. But, if we did model this with an offset (making this a model with an explicit specification of exposure being court time) then we would get the following:

Table7 :
            Estimate Std. Error z value Pr(>|z|)   
(Intercept)  -2.1203     0.1601 -13.241   <2e-16 ***
GROUPB       -0.1054     0.1949  -0.541    0.589   

In this case we find that modeling this explicitly using playing time as exposure we get a result indicating that group B completes fewer goals or completes goals at a rate lower than group A. This approach completely ignores the fact that group A had persevered to play longer and ultimately complete more goals. Including an offset in this case most likely leads to the wrong conclusion. 

Summary:  When modeling outcomes that are counts a rate is always implied by the nature of the probability mass function for a Poisson process. However, in practical applications we may not always think of our outcome as an explicit rate based on an explicit interval or exposure time. In some cases this distinction can be critical. When we want to explicitly consider differences in exposure this is done through specification of an offset in our count model. Three examples were given using toy data where (1) modeling rates or including an offset made no difference in outcome (2) including an offset was required to obtain the correct conclusion and (3) including an offset may lead to the wrong conclusion. 

Conclusion: Counts always occur within some interval of time or space and therefore can always have an implicit ‘rate’ interpretation. If counts are observed across different intervals in time or space for different observations then differences in outcomes should be modeled through the specification of an offset. Whether to include an offset really depends on answering the questions:  (1) What is the relevant interval in time or space upon which our counts are based? (2) Is this interval different across our observations of counts?


Essentials of Count Data Regression. A. Colin Cameron and Pravin K. Trivedi. (1999)

Count Data Models for Financial Data. A. Colin Cameron and Pravin K. Trivedi. (1996)

Models for Count Outcomes. Richard Williams, University of Notre Dame, . Last revised February 16, 2016

Econometric Analysis of Count Data. By Rainer Winkelmann. 2nd Edition.

Notes: This ignores any discussion related to overdispersion or inflated zeros which relate to other possible model specifications including negative binomial or zero-inflated poisson (ZIP) or zero-inflated negative binomial (ZINB) models.

Simulated Toy Count Data:

3    A    1    20   60   30
4    A    2    25   60   30
2    A    3    20   60   30
2    A    4    20   60   30
1    A    5    20   60   30
6    A    6    30   60   30
0    A    7    20   60   30
0    A    8    20   60   30
1    A    9    20   60   30
5    A    10   25   60   30
3    A    11   20   60   30
2    A    12   20   60   30
3    A    13   20   60   30
3    A    14   25   60   30
4    A    15   20   60   30
5    B    16   50   60   60
4    B    17   45   60   60
7    B    18   55   60   60
8    B    19   50   60   60
3    B    20   50   60   60
7    B    21   45   60   60
5    B    22   55   60   60
4    B    23   50   60   60
7    B    24   50   60   60
8    B    25   45   60   60
5    B    26   55   60   60
3    B    27   50   60   60
5    B    28   50   60   60
4    B    29   45   60   60
6    B    30   55   60   60

Saturday, March 11, 2017

Basic Econometrics of Counts

As Cameron and Trivedi state, Poisson regression is "the starting point for count data analysis, though it is often inadequate" (Cameron and Trivedi,1999). The "main focus is the effect of covariates on the frequency of an event, measured by non-negative integer values or counts"(Cameron and Trivedi,1996).

Examples of counts they reference are related to medical care utilization such as office visits or days in the hospital.

"In all cases the data are concentrated on a few small discrete values, say 0, 1 and 2; skewed to the left; and intrinsically heteroskedastic with variance increasing with the mean. These features motivate the application of special methods and models for count regression." (Cameron and Trivedi, 1999).

From Cameron and Trivedi (1999) "The Poisson regression model is derived from the Poisson distribution by parameterizing the relation between the mean parameter μ and covariates (regressors) x. The standard assumption is to use the exponential mean parameterization"

μ = exp(xβ)

Slightly abusing partial derivative notation we derive the marginal effect of x on y as follows:

dE[y|x]/dx = β*exp(xβ)

What this implies is that if  β = .10 and  exp(xβ) = 2 then a 1 unit change in x  will change the expectation of y by .20 units.

Another way of interpretation is to get an approximate value for average response by:


(see also Wooldrige, 2010)

Another way this is interpreted is through exponentiation. From Penn State's STATS 504 course:

"with every unit increase in X, the predictor variable has multiplicative effect of exp(β) on the mean (i.e. expected count) of Y."

This implies (as noted in the course notes):

    If β = 0, then exp(β) = 1  and Y and X are not related.
    If β > 0, then exp(β) > 1, and the expected count μ = E(y) is exp(β) times larger than when X = 0
    If β < 0, then exp(β) < 1, and the expected count μ = E(y) is exp(β) times smaller than when X = 0

For example, if comparing a group A  i.e. (X = 1) vs B i.e.  (X = 0)  if exp(β) = .5, group A has an expected count .50 times smaller than B. On the other had if exp(β) = 1.5, group A has an expected count 1.5 times larger than B.

This could also be interpreted in percentage terms (similar to odds ratios in logistic regression)

For example, comparing a group A (X = 1) vs B (X = 0) if exp(B) = .5 that implies that group A has (.5-1)*100%  = -50% lower expected count than group B. On the other hand if exp(B) = 1.5, this implies that group A has a (1.5-1)*100% = 50% larger expected count that B.

A simple rule of thumb  or shortcut is for small values we can interpret β as a percent change in the expected count of y for a given change in x, as in 100*β (Wooldrigde, 2nd ed, 2010)

           β              exp(β)              (exp(B) -1)*100%
0.01 1.0100501671 1.0050167084
0.02 1.02020134 2.0201340027
0.03 1.030454534 3.0454533954
0.04 1.0408107742 4.0810774192
0.05 1.0512710964 5.1271096376
0.06 1.0618365465 6.1836546545
0.07 1.0725081813 7.2508181254
0.08 1.0832870677 8.3287067675
0.09 1.0941742837 9.4174283705
0.1 1.1051709181 10.5170918076
0.11 1.1162780705 11.6278070459

In STATA, the margins command can be used to get predicted (average) counts at each specified level of a covariate. This is similar as I understand,  to getting marginal effects at the mean for logistic regression. See UCLA STATA examples. Similarly this can be done in SAS using the ilink option with lsmeans.

An Applied Example: Suppose we have some count outcome y that we want to model as a function of some treatment 'TRT.' Maybe we are modeling hospital admission rate differences by treated vs control group  for some intervention or maybe this is number of weeds in an acre plot for a treated vs control group in an agricultural experiment. Using python I simulated a toy count data set for two groups treated (TRT = 1) and untreated (TRT = 2). Descriptive statistics indicate a treatment effect.

Mean (treated): 2.6
Mean (untreated): 5.4

However, if I want to look at the significance of this I can model the treatment effect by specifying a Poisson regression model. Results are below:

E[y|TRT]  = exp(xβ) where x = TRT our binary indicator for treatment

Despite the poor quality of this image we can see that our estimate for β = -.7309. This is rather large so the direct percentage approximation above won't likely hold. However we we can interpret the significance and direction of the effect to imply that the treatment significantly reduces the expected count of y, our outcome of interest. The chart presented previously indicates that as β becomes large the direct percentage shortcut interpretation tends to overestimate the true effect. This implies that the treatment is reducing the expected count on some order less than 73%.  If we take the path of exponentiation we get:

exp(-.7309) = .48

This implies the treatment group has an expected count .48 times lower than the control. In percentage terms the treatment group has an expected count (.48-1)*100 = -52% or 52% lower than the control group.

Interestingly, with a single variable poisson regression model we can derive these results from the descriptive data.

If we take the ratio of average counts for treated vs untreated groups we get 2.6/5.4= .48  which is basically the same as our exponentiated result exp(β). And if we calculate a difference in raw means between treated and untreated groups we see that in fact the treatment group has an average count that is about 52% lower than the control group. 

Extensions of the Model 

As stated at the beginning of this post, the poisson model is just the benchmark or starting point for count models. One assumption is that the mean and variance are equal. This is known as 'equidispersion.' If the variance exceeds the mean that is referred to as overdispersion and negative binomial models are often specified (but interpretation of the coefficients is unchanged).  Overdispersion is more often the case (Cameron and Trivedi, 1996). Other special cases consider the proportion of zeros, sometimes accounted for by zero inflated poisson (ZIP) or zero inflated negative binomial models (ZINB).  As noted in Cameron and Trivedi (1996) count models and duration models can be viewed as duals. If observed units have different levels of exposure or duration this is accounted for in count models through inclusion of an offset. More advanced treatment and references should be considered in these cases.

Applied Examples from Literature

Some examples where counts are modeled in the applied economics literature include the following:

The Demand for Varied Diet with Econometric Models for Count Data. Jonq-Ying Lee. American Journal of Agricultural Economics, vol 69 no 3 (Aug,1987)

Standing on the shoulders of giants: Coherence and biotechnology innovation performance. Sanchez and Ng. Selected Poster 2015 Agricultural and Applied Economics Association and Western Agricultural Economics Association Join Annual Meeting. San Francisco CA July 26-28

Adoption of  Best Management Practices to Control Weed Resistance by Corn, Cotton,and Soybean Growers. Frisvold, Hurley, and Mitchell. AgBioForum 12(3&4) 370-381. 2009.

In all cases, as is common in the limited amount of literature I have seen in applied economics, the results of the count regressions are interpreted in terms of direction and significance, but not much consideration is given to an interpretation of results based on exponentiation of coefficients. 


Essentials of Count Data Regression. A. Colin Cameron and Pravin K. Trivedi. (1999)
Count Data Models for Financial Data. A. Colin Cameron and Pravin K. Trivedi. (1996)
Econometric Analysis of Cross Section and Panel Data. Wooldridge. 2nd Ed. 2010.

See also:
Count Models with Offsets
Do we Really Need Zero Inflated Models
Quantile Regression with Count Data

For python code related to the applied example see the following gist.