Monday, April 10, 2017

More on Data Science from Actual Data Scientists

Previously I wrote a post titled: What do you really need to know to be a data scientist. Data science lovers and haters. In this post I made the general argument that this is a broad space and there is a lot of contention about the level of technical skill and tools that one must master to consider themselves a 'real' data scientist vs. getting labeled a 'fake' data scientist or 'poser' or whatever. But, to me its all about leveraging data to solve problems and most of that work is about cleaning and prepping data. It's process.  In an older KDNuggets article, economist/data scientist Scott Nicholson makes a similar point:

GP: What advice you have for aspiring data scientists?

SN: Focus less on algorithms and fancy technology & more on identifying questions, and extracting/cleaning/verifying data. People often ask me how to get started, and I usually recommend that they start with a question and follow through with the end-to-end process before they think about implementing state-of-the-art technology or algorithms. Grab some data, clean it, visualize it, and run a regression or some k-means before you do anything else. That basic set of skills surprisingly is something that a lot of people are just not good at but it is crucial.

GP: Your opinion on the hype around Big Data - how much is real?

SN: Overhyped. Big data is more of a sudden realization of all of the things that we can do with the data than it is about the data themselves. Of course also it is true that there is just more data accessible for analysis and that then starts a powerful and virtuous spiral. For most companies more data is a curse as they can barely figure out what to do with what they had in 2005.

So getting your foot in the door in a data science field doesn't mean mastering Hive or Hadoop apparently. And, this does not sound like PhD level rocket science at this point either. Karolis Urbonas, Head of Business Intelligence at Amazon has recently written a couple of similarly themed pieces also at KDNuggets:

How to think like a data scientist to become one

"I still think there’s too much chaos around the craft and much less clarity, especially for people thinking of switching careers. Don’t get me wrong – there are a lot of very complex branches of data science – like AI, robotics, computer vision, voice recognition etc. – which require very deep technical and mathematical expertise, and potentially a PhD… or two. But if you are interested in getting into a data science role that was called a business / data analyst just a few years ago – here are the four rules that have helped me get into and are still helping me survive in the data science."

He emphasizes basic data analysis, statistics, and coding to get started. The emphasis again is not on specific tools, degrees etc. but more on the process and ability to use data to solve problems. Note in the comments there is some push back on the level of expertise required, but Karolis actually addressed that when he mentioned very narrow and specific roles in AI, robotics, etc. Here he's giving advice for getting started in the broad diversity of roles in data science outside these narrow tracks. The issue is some people in data science want to narrow the scope to the exclusion of much of the work done by business analysts, researchers, engineers and consultants creating much of the value in this space (again see my previous post).

What makes a great data scientist?

"A data scientist is an umbrella term that describes people whose main responsibility is leveraging data to help other people (or machines) making more informed decisions….Over the years that I have worked with data and analytics I have found that this has almost nothing to do with technical skills. Yes, you read it right. Technical knowledge is a must-have if you want to get hired but that’s just the basic absolutely minimal requirement. The features that make one a great data scientist are mostly non-technical."

1. Great data scientist is obsessed with solving problems, not new tools.

"This one is so fundamental, it is hard to believe it’s so simple. Every occupation has this curse – people tend to focus on tools, processes or – more generally – emphasize the form over the content. A very good example is the on-going discussion whether R or Python is better for data science and which one will win the beauty contest. Or another one – frequentist vs. Bayesian statistics and why one will become obsolete. Or my favorite – SQL is dead, all data will be stored on NoSQL databases."




Saturday, April 8, 2017

What do you really need to know to be a data scientist? Data Science Lovers and Haters

Previously I discussed the Super Data Science podcast and credit modeling in terms of the modeling strategy and models used. The discussion also covered data science in general, and one part of the conversation I thought was well worth discussing in more detail. It really gets to the question of what's it take to be a data scientist. There is a ton of energy spent on this in places like LinkedIn and other forums. I think the answer comes in two forms. From the 'lovers' of data science its all about what kind of advice can I give people to help and encourage them to create value in this space. To the 'haters' its more like now that I have established myself in this space what kind of criterion should we have to keep people out and prevent them from creating value.  But before we get to that, here is some great dialogue from Kirill discussing a trap that data scientists or aspiring data scientists fall into:

Kirill: "I think there’s a level of acumen that people should have, especially going into data science role. And then if you’re a manager you might take a step back from that. You might not need that much detail…If you’re doing the algorithms, that acumen might be enough. You don’t need to know the nitty-gritty mathematical academic formulas to everything about support vector machines or Kernels and stuff like that to apply it properly and get results. On the other hand, if you find that you do need that stuff you can go and spend some additional time learning. A lot of people fall into the trap. They try to learn everything in a lot of depth, whereas I think the space of data science is so broad you can’t just learn everything in huge depths. It’s better to learn everything to an acceptable level of acumen and then deepen your knowledge in the spaces that you need."

Greg: "if you don’t want to get into that detail, I totally get it. You can be totally fine without it. I have never once in my career had somebody ask me what are the formulas behind the algorithm….there’s a lot of jobs out there for people that don’t know them."

I admit I used to fall into this trap. In fact this blog is a direct result. Early in my career I had the mindset if you can't prove it you can't use it. I really didn't feel confident about an algorithm or method until I understood it 'on paper' and could at least code my own version in SAS IML or R. A number of posts here were based on this work and mindset. Then, a very well known and accomplished developer/computational scientist that frequently helped me gave the good advice that with this mindset you might never get any work done. Or only a fraction of work.

Given the amount of discussion you might see on LinkedIn or the so called data science community about real or fake data scientists (lots of haters out there) in the Talk Python to Me podcast author Joel Grus (of Data Science from Scratch) provides what I think is the most honest discussion of what data science is and what data scientists do:

"there are just as many jobs called data science as there are data scientists"

That is kind of paraphrasing and kind of out of context and yes very general. Almost defining a word using the word in the definition. But it is very very TRUE.  That is because the field is largely undefined. To attempt to define it is futile and I think would be the antithesis of data science itself. I will warn though that there are plenty of data science haters out there that would quibble with what Greg and Joel have said above.

These are people that want to impose something more strict. Some minimum threshold. Common threads indicate some fear of a poser or fake data scientist fooling some company into hiring them or incompetently pointing and clicking their way through an analysis without knowing what is going on and calling themselves a data scientist. While I understand that concern, its one extreme. It can easily morph into a straw man argument for a more political agenda at the other extreme. That might lead to a listing of minimal requirements to be a real data scientist, some laundry list of requirements (think  big data technologies, degrees and the like). Economists know all about this and we see it in the form of licensing and rent seeking in a number of professions and industries. Broadly speaking its a waste of resources. Absolutely in this broad space economists would also recognize merit in signaling through certification, certain degree programs or course work, or other methods of credentialization. But there is a big difference between competitive signaling and non-competitive rent seeking behaviors.

In its inception, data science was all about disruption. As described in Johns Hopkins applied economics program description:

“Economic analysis is no longer relegated to academicians and a small number of PhD-trained specialists. Instead, economics has become an increasingly ubiquitous as well as rapidly changing line of inquiry that requires people who are skilled in analyzing and interpreting economic data, and then using it to effect decisions ………Advances in computing and the greater availability of timely data through theInternet have created an arena which demands skilled statistical analysis, guided by economic reasoning and modeling.”

This parallels data science. Suddenly you no longer need a PhD in statistics or a software engineering background or an academics' level of acumen to create value added analysis. (although those are all excellent backgrounds for doing some advanced work in data science no doubt).  Its that basic combination of subject matter expertise, some knowledge of statistics and machine learning, and ability to write code or use software to solve problems. That's it. Its disruptive and the haters hate it. They simultaneously embrace the disruption and want to reign it in and fence out the competition. I hate it for the haters but you don't need to be able to code your own estimators or train a neural net from scratch to use it. And there is probably as much or more value creating professional space out there for someone that can clean a data set and provide a set of cross tabs as there is for the know how to set up a Hadoop cluster.

Below are a couple of really great KDNuggets articles in this regard written by Karolis Urbonas, Head of Business Intelligence at Amazon:

How to think like a data scientist to become one

What makes a great data scientist?



Super Data Science Podcast Credit Scoring Models

I recently discovered the Super Data Science podcast hosted by Kirill Eremenko. What I like about this podcast series is that it is applied data science. You can talk all day about theory, theorems, proofs, and mathematical details and assumptions. Even if you could master every technical detail underlying 'data science' you have only scratched the surface. What distinguishes data science from the academic discipline of statistics, computer science, or machine learning is application to solve a problem for business or society. Its not theory for theory's sake. There are huge gaps between theory and application that can easily stump a team of PhD's or experienced practitioners (see also applied econometrics). Podcasts like this can help bridge the gap.

Episode 014 featured Greg Poppe who is Sr Vice President for risk management at an auto lending firm. They discussed how data science is leveraged in loan approvals and rate setting among other things.

The general modeling approach that Greg discussed is very similar to work that I have done before in student risk modeling in higher education (see here and here).

"So think of it like -- you know, I would have a hard time telling you with any high degree of certainty, “This loan will pay. This loan will pay. But this loan won’t.” However, if you give me a portfolio of a hundred loans, I should be able to say “15 aren’t going to pay. I don’t know which 15, but 15 won’t.” And then if you give me another portfolio that’s say riskier, I should be able to measure that risk and say “This is a riskier pool. 25 aren’t going to pay. And again, I don’t know which 25, but I’m estimating 25.” And that’s how we measure our accuracy. So it’s not so much on a loan-by-loan basis. It’s “If we just select a random sample, how many did not pay, and what was our expectation of that?” And if they’re very close, we consider our models to be accurate."

A toy example in R that seems very similar can be found here (Predictive Modeling and Custom Reporting in R).

So at a basic level they are just using predictive models to get a score and using cutoffs to determine different pools of risk and making approvals, declines, and setting interest rates based on this. He doesn't discuss the specifics of the model testing, but to me the key here sounds a lot like calibration (see Is the ROC curve a good metric for model calibration?). In terms of the types of models they use of this it gets very interesting. As Kirill says, the whole podcast is worth listening to for this very point. For their credit scoring models they use regression, even though they could get improved performance from other algorithms like decision trees or ensembles. Why?

"so primarily in the credit decisioning models, we use regression models. And the reason why—well, there’s quite a few. One is it’s very computationally easy. It’s easy to explain, it’s easy for people to understand but it’s also not a black box in the sense that a lot of models can be, and what we need to do is we need to provide a continuity to a dealership because they can adjust the parameters of the application and that will adjust the risk accordingly…..If we were to go with a CART model or any other decision tree model, if the first break point or the first cut point in that model is down payment and they go from one side to the other, it can throw it down a completely separate set of decision logic and they can get very strange approvals. From a data science perspective and from an analytics perspective, that may be more accurate but it’s not sellable, it’s not marketable to the dealership."

Yes huge gap just filled and well worth repeating. Its interesting, in a different scenario you could go the other way around. For instance, in my work in higher education student risk modeling we went with decision trees instead of regression but based on a similar line of reasoning. Our end users however were not going to be tweaking parameters but getting sign off and buy in required that they understand more about what the model was doing. The explicit nature of the splits and decision logic of the trees was easier to explain and understand for untrained statisticians than was regression models or neural networks.

If you have been a practitioner for a while you might think of course every data scientist knows there is a tradeoff between accuracy, complexity, and functional practicality. I agree but it still can't be emphasized enough. And more time should be spent on applied examples like this vs the waste we see in social media discussion who is or isn't a fake data scientist. The real data scientists are too busy working in the gaps between theory and practice to care.  To be continued....







Friday, April 7, 2017

Andrew Gelman on EconTalk

Recently Andrew Gelman was on EconTalk with Russ Roberts. A couple of the most interesting topics covered included discussion of his garden of forking paths as well as what Gelman covered in a fairly recent blog post- The “What does not kill my statistical significance makes it stronger” fallacy.

Here is an excerpt from the transcript:

Russ Roberts: But you have a small sample...--that's very noisy, usually. Very imprecise. And you still found statistical significance. That means, 'Wow, if you'd had a big sample you'd have found even a more reliable effect.'

Andrew Gelman: Um, yes. You're using what we call the 'That which does not kill my statistical significance makes it stronger' fallacy. We can talk about that, too.

From Andrew's Blog Post:

"The idea is that statistical significance is taken as an even stronger signal when it was obtained from a noisy study.

This idea, while attractive, is wrong. Eric Loken and I call it the “What does not kill my statistical significance makes it stronger” fallacy.

"What went wrong? Why it is a fallacy? In short, conditional on statistical significance at some specified level, the noisier the estimate, the higher the Type M and Type S errors. Type M (magnitude) error says that a statistically significant estimate will overestimate the magnitude of the underlying effect, and Type S error says that a statistically significant estimate can have a high probability of getting the sign wrong.

We demonstrated this with an extreme case a couple years ago in a post entitled, “This is what “power = .06” looks like. Get used to it.” We were talking about a really noisy study where, if a statistically significant difference is found, it is guaranteed to be at least 9 times higher than any true effect, with a 24% chance of getting the sign backward."

Noted in both the podcast and the blog post by Gelman, this is not a well known fallacy and as they point out very well known researchers appear to be found committing it one time or another in their writing or dialogue.

See also: Econometrics, Multiple Testing, and Researcher Degrees of Freedom


Wednesday, March 22, 2017

Count Models with Offsets: Practical Applications Using R

See also:

Lets consider three count modeling scenarios and determine the appropriate modeling strategy. 

Example 1: Suppose we are observing kids playing basketball during open gym. We have two groups of equal size A and B. Suppose both groups play for 60 minutes and kids in one group, A, score on average about 2.5 goals each while group B averages 5.
In this case both groups of students engage in activity for the same amount of time. There seems to be no need to include time as an offset. And it is clear, for whatever reason students in group B are better at scoring and therefore score more goals.

If we simulate count data to mimic this scenario (see toy data below) we might get descriptive statistics that look like this:

Table 1:

It is clear for the period of observation (60 minutes) group B out scored A. Would we in practice discuss this in terms of rates? Total points per 60 minute session? Or total goals per minute? In this case group A scores .0433 goals per minute vs. .09 for B.  Again, we conclude based on rates that B is better at scoring goals. But most likely, despite the implicit or explicit view of rate, we would discuss these outcomes in a more practical sense, total goals for A vs B. 

We could model this difference with a Poisson regression:

summary(glm(COUNT ~ GROUP,data = counts, family = poisson))


Table 2:
Coefficients:
            Estimate Std. Error z value Pr(>|z|)   
(Intercept)   0.9555     0.1601   5.967 2.41e-09 ***
GROUPB        0.7309     0.1949   3.750 0.000177 ***

We can from this that group B completes significantly more goals than A, at a ‘rate’ exp(.7309) = 2.075 times that of A. (roughly twice as many goals). This is basically what we get from a direct comparison of the average counts in the descriptives above.
But what if we wanted to be explicit about the interval of observation and include an offset? The way we incorporate rates into a poisson model for counts is through the offset. 

Log(μ/tx) = xβ  here we are explicitly specifying a rate based on time ‘tx

Re-arranging terms we get:
Log(μ) – Log(tx) = xβ 
Log(μ) = xβ + Log(tx)
The term Log(tx) becomes our ‘offset.’

So we would do this by including log(time) as an offset in our R code: 

summary(glm(COUNT ~ GROUP + offset(log(TIME2)),data = counts, family = poisson))

It turns out the estimate of  .7309 for B vs A is the same. Whether we directly compare the raw counts, or run count models with or without offsets we get the same result. 

Table 3:
Coefficients:
            Estimate Std. Error z value Pr(>|z|)   
(Intercept)  -3.1388     0.1601  -19.60  < 2e-16 ***
GROUPB        0.7309     0.1949    3.75 0.000177 ***

Example 2: Suppose again we are observing kids playing basketball during open gym. Let’s still refer to them as groups A and B. Suppose after 30 minutes group A is forced to leave the court (maybe their section of the court is reserved for an art show). Before leaving they score an average of about 2.5 goals. Group B is allowed to play for 60 minutes scoring an average of about 5 goals. This is an example where the two groups had different observation times or exposure times (i.e. playing time). Its plausible that if Group A continued to play longer they would have had more risk or opportunity to score more goals. It seems the only fair way to compare goal scoring for A vs B is to consider court time, or exposure or the rate of goal completion. If we use the same toy data as before (but assuming this different scenario) we would get the following  descriptives:

Table 4

You can see that the difference in the rate of goals scored is very small. Both teams are put on an ‘even’ playing field when we consider rates of goal completion. 

If we fail to consider exposure or the period of observation we run the following model:

summary(glm(COUNT ~ GROUP,data = counts, family = poisson))

The results will appear the same as in table 2 above.  But what if we want to consider the differences in exposure or observation time? In this case we would include an offset in our model specification:

summary(glm(COUNT ~ GROUP + offset(log(TIME3)),data = counts, family = poisson))

Table 5
Coefficients:
            Estimate Std. Error z value Pr(>|z|)   
(Intercept) -2.44569    0.16013 -15.273   <2e-16 ***
GROUPB       0.03774    0.19490   0.194    0.846 

We can see from the results that when considering exposure (modeling with an offset) there is no significant difference between groups, although this could be an issue of low power and small sample size. Directionally group B completes about 3.8% more goals (per minute of exposure) than A or alternatively exp(.0377) = 1.038 indicates that B completes 1.038 times as many goals as A  or alternatively (1.038-1)*100 = 3.8% more. We can get all of this from the descriptives by comparing the average ‘rates’ of goal completion for B vs A. But the conclusion is all the same, and if we fail to consider rates or exposure in this case we get the wrong answer!!!

Example 3: Suppose we are again observing kids playing basketball during open gym with groups A and B. Except this time group A tires out after playing about 20 minutes or so and leaves the court after scoring 2.6 goals each on average. Group B perseveres another 30 minutes or so and scores a total of about 5 goals on average per student. In this instance there seem to be important differences in group A and B in terms of drive and ambition that should not be equated by accounting for time played or inclusion of an offset. Event success seems to drive time as much as time drives the event. In this instance if we want to think of a ‘rate’ the rate is total goals scored per open gym session, not per minute of activity.  The relevant interval is a single open gym session.
In this case time actually seems endogenous or confounded with the outcome or confounded with other factors like effort and motivation which drive the outcome.

If we alter our simulated data from before to mimic this scenario we would generate the following descriptive statistics:

Table 6:

As discussed previously, this should be modeled without an offset, implying equal exposure/observation time with regard to the event or exposure being an entire open gym session.  We can think of this as a model of counts, or an implied model of rates in terms of total goals per open gym session.  In that case we get the same results as table 2 indicating that group B scores more goals than A.  It makes no sense in this case to include time as an offset or compare rates of goal completion between groups. But, if we did model this with an offset (making this a model with an explicit specification of exposure being court time) then we would get the following:

Table7 :
            Estimate Std. Error z value Pr(>|z|)   
(Intercept)  -2.1203     0.1601 -13.241   <2e-16 ***
GROUPB       -0.1054     0.1949  -0.541    0.589   

In this case we find that modeling this explicitly using playing time as exposure we get a result indicating that group B completes fewer goals or completes goals at a rate lower than group A. This approach completely ignores the fact that group A had persevered to play longer and ultimately complete more goals. Including an offset in this case most likely leads to the wrong conclusion. 

Summary:  When modeling outcomes that are counts a rate is always implied by the nature of the probability mass function for a Poisson process. However, in practical applications we may not always think of our outcome as an explicit rate based on an explicit interval or exposure time. In some cases this distinction can be critical. When we want to explicitly consider differences in exposure this is done through specification of an offset in our count model. Three examples were given using toy data where (1) modeling rates or including an offset made no difference in outcome (2) including an offset was required to obtain the correct conclusion and (3) including an offset may lead to the wrong conclusion. 

Conclusion: Counts always occur within some interval of time or space and therefore can always have an implicit ‘rate’ interpretation. If counts are observed across different intervals in time or space for different observations then differences in outcomes should be modeled through the specification of an offset. Whether to include an offset really depends on answering the questions:  (1) What is the relevant interval in time or space upon which our counts are based? (2) Is this interval different across our observations of counts?

References:

Essentials of Count Data Regression. A. Colin Cameron and Pravin K. Trivedi. (1999)

Count Data Models for Financial Data. A. Colin Cameron and Pravin K. Trivedi. (1996)

Models for Count Outcomes. Richard Williams, University of Notre Dame, http://www3.nd.edu/~rwilliam/ . Last revised February 16, 2016

Econometric Analysis of Count Data. By Rainer Winkelmann. 2nd Edition.

Notes: This ignores any discussion related to overdispersion or inflated zeros which relate to other possible model specifications including negative binomial or zero-inflated poisson (ZIP) or zero-inflated negative binomial (ZINB) models.

Simulated Toy Count Data:

COUNT GROUP ID   TIME TIME2 TIME3
3    A    1    20   60   30
4    A    2    25   60   30
2    A    3    20   60   30
2    A    4    20   60   30
1    A    5    20   60   30
6    A    6    30   60   30
0    A    7    20   60   30
0    A    8    20   60   30
1    A    9    20   60   30
5    A    10   25   60   30
3    A    11   20   60   30
2    A    12   20   60   30
3    A    13   20   60   30
3    A    14   25   60   30
4    A    15   20   60   30
5    B    16   50   60   60
4    B    17   45   60   60
7    B    18   55   60   60
8    B    19   50   60   60
3    B    20   50   60   60
7    B    21   45   60   60
5    B    22   55   60   60
4    B    23   50   60   60
7    B    24   50   60   60
8    B    25   45   60   60
5    B    26   55   60   60
3    B    27   50   60   60
5    B    28   50   60   60
4    B    29   45   60   60
6    B    30   55   60   60

Saturday, March 11, 2017

Basic Econometrics of Counts

As Cameron and Trivedi state, Poisson regression is "the starting point for count data analysis, though it is often inadequate" (Cameron and Trivedi,1999). The "main focus is the effect of covariates on the frequency of an event, measured by non-negative integer values or counts"(Cameron and Trivedi,1996).

Examples of counts they reference are related to medical care utilization such as office visits or days in the hospital.

"In all cases the data are concentrated on a few small discrete values, say 0, 1 and 2; skewed to the left; and intrinsically heteroskedastic with variance increasing with the mean. These features motivate the application of special methods and models for count regression." (Cameron and Trivedi, 1999).

From Cameron and Trivedi (1999) "The Poisson regression model is derived from the Poisson distribution by parameterizing the relation between the mean parameter μ and covariates (regressors) x. The standard assumption is to use the exponential mean parameterization"

μ = exp(xβ)

Slightly abusing partial derivative notation we derive the marginal effect of x on y as follows:

dE[y|x]/dx = β*exp(xβ)

What this implies is that if  β = .10 and  exp(xβ) = 2 then a 1 unit change in x  will change the expectation of y by .20 units.

Another way of interpretation is to get an approximate value for average response by:

 mean(y)*β

(see also Wooldrige, 2010)

Another way this is interpreted is through exponentiation. From Penn State's STATS 504 course:

"with every unit increase in X, the predictor variable has multiplicative effect of exp(β) on the mean (i.e. expected count) of Y."

This implies (as noted in the course notes):

    If β = 0, then exp(β) = 1  and Y and X are not related.
    If β > 0, then exp(β) > 1, and the expected count μ = E(y) is exp(β) times larger than when X = 0
    If β < 0, then exp(β) < 1, and the expected count μ = E(y) is exp(β) times smaller than when X = 0

For example, if comparing a group A  i.e. (X = 1) vs B i.e.  (X = 0)  if exp(β) = .5, group A has an expected count .50 times smaller than B. On the other had if exp(β) = 1.5, group A has an expected count 1.5 times larger than B.

This could also be interpreted in percentage terms (similar to odds ratios in logistic regression)

For example, comparing a group A (X = 1) vs B (X = 0) if exp(B) = .5 that implies that group A has (.5-1)*100%  = -50% lower expected count than group B. On the other hand if exp(B) = 1.5, this implies that group A has a (1.5-1)*100% = 50% larger expected count that B.

A simple rule of thumb  or shortcut is for small values we can interpret β as a percent change in the expected count of y for a given change in x, as in 100*β (Wooldrigde, 2nd ed, 2010)

 
           β              exp(β)              (β-1)*100%
0.01 1.0100501671 1.0050167084
0.02 1.02020134 2.0201340027
0.03 1.030454534 3.0454533954
0.04 1.0408107742 4.0810774192
0.05 1.0512710964 5.1271096376
0.06 1.0618365465 6.1836546545
0.07 1.0725081813 7.2508181254
0.08 1.0832870677 8.3287067675
0.09 1.0941742837 9.4174283705
0.1 1.1051709181 10.5170918076
0.11 1.1162780705 11.6278070459

In STATA, the margins command can be used to get predicted (average) counts at each specified level of a covariate. This is similar as I understand,  to getting marginal effects at the mean for logistic regression. See UCLA STATA examples. Similarly this can be done in SAS using the ilink option with lsmeans.

An Applied Example: Suppose we have some count outcome y that we want to model as a function of some treatment 'TRT.' Maybe we are modeling hospital admission rate differences by treated vs control group  for some intervention or maybe this is number of weeds in an acre plot for a treated vs control group in an agricultural experiment. Using python I simulated a toy count data set for two groups treated (TRT = 1) and untreated (TRT = 2). Descriptive statistics indicate a treatment effect.

Mean (treated): 2.6
Mean (untreated): 5.4

However, if I want to look at the significance of this I can model the treatment effect by specifying a Poisson regression model. Results are below:

E[y|TRT]  = exp(xβ) where x = TRT our binary indicator for treatment




Despite the poor quality of this image we can see that our estimate for β = -.7309. This is rather large so the direct percentage approximation above won't likely hold. However we we can interpret the significance and direction of the effect to imply that the treatment significantly reduces the expected count of y, our outcome of interest. The chart presented previously indicates that as β becomes large the direct percentage shortcut interpretation tends to overestimate the true effect. This implies that the treatment is reducing the expected count on some order less than 73%.  If we take the path of exponentiation we get:

exp(-.7309) = .48

This implies the treatment group has an expected count .48 times lower than the control. In percentage terms the treatment group has an expected count (.48-1)*100 = -52% or 52% lower than the control group.

Interestingly, with a single variable poisson regression model we can derive these results from the descriptive data.

If we take the ratio of average counts for treated vs untreated groups we get 2.6/5.4= .48  which is basically the same as our exponentiated result exp(β). And if we calculate a difference in raw means between treated and untreated groups we see that in fact the treatment group has an average count that is about 52% lower than the control group. 

Extensions of the Model 

As stated at the beginning of this post, the poisson model is just the benchmark or starting point for count models. One assumption is that the mean and variance are equal. This is known as 'equidispersion.' If the variance exceeds the mean that is referred to as overdispersion and negative binomial models are often specified (but interpretation of the coefficients is unchanged).  Overdispersion is more often the case (Cameron and Trivedi, 1996). Other special cases consider the proportion of zeros, sometimes accounted for by zero inflated poisson (ZIP) or zero inflated negative binomial models (ZINB).  As noted in Cameron and Trivedi (1996) count models and duration models can be viewed as duals. If observed units have different levels of exposure or duration this is accounted for in count models through inclusion of an offset. More advanced treatment and references should be considered in these cases.

Applied Examples from Literature

Some examples where counts are modeled in the applied economics literature include the following:

The Demand for Varied Diet with Econometric Models for Count Data. Jonq-Ying Lee. American Journal of Agricultural Economics, vol 69 no 3 (Aug,1987)

Standing on the shoulders of giants: Coherence and biotechnology innovation performance. Sanchez and Ng. Selected Poster 2015 Agricultural and Applied Economics Association and Western Agricultural Economics Association Join Annual Meeting. San Francisco CA July 26-28

Adoption of  Best Management Practices to Control Weed Resistance by Corn, Cotton,and Soybean Growers. Frisvold, Hurley, and Mitchell. AgBioForum 12(3&4) 370-381. 2009.

In all cases, as is common in the limited amount of literature I have seen in applied economics, the results of the count regressions are interpreted in terms of direction and significance, but not much consideration is given to an interpretation of results based on exponentiation of coefficients. 

References:

Essentials of Count Data Regression. A. Colin Cameron and Pravin K. Trivedi. (1999)
Count Data Models for Financial Data. A. Colin Cameron and Pravin K. Trivedi. (1996)
Econometric Analysis of Cross Section and Panel Data. Wooldridge. 2nd Ed. 2010.

See also:
Count Models with Offsets
Do we Really Need Zero Inflated Models
Quantile Regression with Count Data

For python code related to the applied example see the following gist.