Tuesday, May 24, 2016

Data Scientists vs Algorithms vs Solutions

A few weeks ago there was a short tweet in my tweetstream that kindled some thoughts.

"All people are biased, that's why we need algorithms!" "All algorithms are biased, that's why we need people!" via @pmarca

And a retweet/reply by Diego Kuonen:

"Algorithms are aids to thinking and NOT replacements for it"

This got me thinking about  a lot of work that I have been doing,  past job interviews and conversations I have had with headhunters and 'data science' recruiters, as well as a number of discussions or sometimes arguments about what defines data science and about 'unicorns' and 'fake' data scientists.

To that end, I have seen numerous variations on Drew Conway's data science Venn Diagram, but I think Drew still nails it down pretty well. If you can write code, understand statistics, and can apply those skills based on your specific subject matter expertise, to me that meets the threshold of the minimum skills most employers might require for you to do value added work. But these tweets beg the question, for many business problems, do we need algorithms at all, or what kind of people do we really need?

 Absolutely there are differences in skillsets required for machine learning vs traditional statistical inference, and I know there are definitely instances where knowing how to set up a Hadoop cluster can be valuable for certain problems. Maybe you do need a complex algorithm to power a killer ap or recommender system.

I think part of the hype and snobery around the terms data science and data scientist might stem from the fact that they are used in so many different contexts and they mean so many things to so many people that there is fear that the true meaning will be lost along with one's relevance in this space as a data scientist. It might be better to forget about semantics and just concentrate on the ends that we are trying to achieve.

I think a vast majority of businesses really need insights driven by people with subject matter expertise and the ability to clean, extract, analyze, visualize, and probably most importantly, communicate. Sometimes the business need requires prediction, other times inference. Many times you may not need a complicated algorithm or experimental design at all, not as much as you need someone to make sense of the nasty transnational data your business is producing and summarize it all with some cross tabs. Sometimes you might need a PhD computer scientist or Engineer that can meet the strictest of data science thresholds, but lots of times what you really may need is a statistician, econometrician, biometrician, or just a good MBA or business analyst that understands predictive modeling, causal inference, and the basics of a left join. 

One thing is for certain, recruiters might have a better shot at placing candidates for their clients if role descriptions would just say what they mean and leave the fighting over who's a real data scientist to the LinkedIn discussion boards and their tweetstream.

Tuesday, March 22, 2016

Identification and Common Trend Assumptions in Difference-in-Differences for Linear vs GLM Models

In a previous post I discussed the conclusion from Lechner’s paper 'The Estimation of Causal Effects by Difference-in-Difference Methods', that difference-in-difference models in a non-linear or GLM context failed to meet the common trend assumptions, and therefore failed to identify treatment effects from a selection on unobservables context.

In that paper I noted that Lechner points out (quite rigorously in the context of the potential outcomes framework):

"We start with a “natural” nonlinear model with a linear index structure which is transformed by a link function, G(·), to yield the conditional expectation of the potential outcome.....The common trend assumption relies on differencing out specific terms of the unobservable potential outcome, which does not happen in this nonlinear specification... Whereas the linear specification requires the group specific differences to be time constant, the nonlinear specification requires them to be absent. Of course, this property of this nonlinear specification removes the attractive feature that DiD allows for some selection on unobservable group and individual specific differences. Thus, we conclude that estimating a DiD model with the standard specification of a nonlinear model would usually lead to an inconsistent estimator if the standard common trend assumption is upheld. In other words, if the standard DiD assumptions hold, this nonlinear model does not exploit them (it will usually violate them). Therefore, estimation based on this model does not identify the causal effect”

I wanted to review at a high level exactly how he gets to this result. But I wanted to simplify this as much as possible and start with some basic concepts. Starting with a basic regression model, the population conditional expectation function, or conditional mean of Y given X can be written as:

Regression and Expected Value Notation:

E[Y|X] = β0 + β1 X (1)

and we estimate this with the regression on observed data:

y = b0 + b1X + e (2)

Where b1 is our estimate of the population parameter of interest β1.

If E[b1] = β1 then we say our estimator is unbiased.

Potential Outcomes Notation:

When it comes to experimental designs, we are interested in knowing counterfactuals, that is what value of an outcome would a treatment or program participant have in absence of treatment (the baseline potential outcome) vs. if they participated or were treated? If we specify these 'potential outcomes' as follows:

Y0= baseline potential outcome
Y1= potential treatment outcome

We can characterize the treatment effect as:

E[Y0-Y1] or the difference in potential treated vs baseline outcomes. This is referred to as the average treatment effect or ATE. Sometimes we are interested in, or some models estimate, the average treatment effect on the treated or ATT :E[Y0-Y1 | d = 1]  

where d is an indicator for treatment (d = 1) vs control or untreated (d =0).

Difference-in-Difference Analysis:

Difference-in-difference (DD) estimators assume that in absence of treatment the difference between control (B) and treatment (A) groups would be constant or ‘fixed’ over time. Treatment effects in DD estimators are derived by subtracting differences between pre and post values within treatment and control groups, and then taking a difference in differences between treatment and control groups. The unobservable effects that are constant or fixed over time 'difference out' allowing us to identify treatment effects controlling for these unobservable characteristics with out explicitly measuring them. This characterizes what is referred to as a 'selection on unobservables' framework.

  This can also be estimated using linear regression with an interaction term:

y = b0 + b1 d + b2 t + b3 d*t+ e (3)

where d indicates treatment (d=1 vs d = 0) and the estimated coefficient (b3 ) on the time by treatment interaction term gives us our estimate of treatment effects. 

Lechner and Potential Outcomes Framework:

In an attempt to present the issues with GLM DD models depicted in Lechner (2010) using the simplest notation possible (abusing notation slightly and perhaps at a cost of precision), we can depict the framework for difference-in-difference analysis using expectations:

DID = [E(Y1|D=1)-E( Y0|D=1)] -[E(Y1|D=0)-E(Y0|D=0)] (4)

DID = [pre/post differences for treatment group] – [pre/post differences for control group]

where Y represents the observed outcome values sub-scripted by pre (0) and post periods(1)

We can represent potential outcomes in the regression framework as follows:

E(Yt1|D) = α + tδ1 + dγ “potential outcome if treated” (5)

E(Yt0|D) = α + tδ0 + dγ “potential baseline outcome” (6)

ATET: E(Yt1- Yt0|D= 1) = θ1 = δ    (7)

difference-in-difference of potential outcomes across time if treated”

We can estimate δ with a regression on observed data of the form:

y = b0 + b1 d + b2 t + b3 d*t+ e (3')

where b3 is our estimator for δ.

Common Trend Assumption:
Difference-in-difference (DD) estimators assume that in absence of treatment the difference between control (B) and treatment (A) groups would be constant or ‘fixed’ over time. This can be represented geometrically in a linear modeling context by 'parallel trends' in outcome levels between treatment and control groups in absence of a treatment:

As depicted above, BB represents the trend in outcome Y for a control group. AA represents the counterfactual trend, or parallel or common trend for the treatment group that would occur in absence of treatment. The distance A'A represents a departure from the parallel trend in response to treatment, and would be our DD treatment effect or the value b3 our estimator for δ.

The common trend assumption following Lechner, can be expressed in terms of potential outcomes:

E(Y10|D=1)-E(Y00|D=1) = α + δ0 + γ - α – γ = δ0 (8)

E(Y10|D=0)-E(Y00|D=0) = α + δ0 - α = δ0 (9)

i.e. the pre and post period differences in baseline outcomes is the same (δ0) regardless if individuals are assigned to the treatment group (D=1) or control group (D=0).

Nonlinear Models:

In a GLM framework, with a specific link function G(.) a DD framework can be expressed in terms of potential outcomes as follows:

E(Yt1|D) = G(α + tδ1 + dγ) “potential outcome if treated” (10)

E(Yt0|D) = G(α + tδ0 + dγ) “potential baseline outcome” (11)

DID can be estimated by regression on observed outcomes:

G(b0 + b1 d + b2 t + b3 d*t) (12)

Common Trend Assumption:

E(Y10|D=1)-E(Y00|D=1) = G(α + δ0 + γ) - G(α + γ) (13)

E(Y10|D=0)-E(Y00|D=0) = G(α + δ0 ) - G(α ) (14)

It turns out in a GLM framework, for the common trend assumption to hold, group specific differences must be zero or γ =0. The common trend assumption relies on differencing out specific terms of the unobservable potential outcome, or the individual specific effects we are trying to control for in the selection on unobservables scenario, but in a GLM scenario we have to assume that these effects are zero or absent. In essence, the attractive feature of DD models to control for unobservable effects is not a feature of DD models in a GLM scenario.  
The Estimation of Causal Effects by Difference-in-Difference Methods
By Michael Lechner Foundations and Trends in Econometrics
Vol. 4, No. 3 (2010) 165–224  

Program Evaluation and the
Difference-in-Difference Estimator
Course Notes
Education Policy and Program Evaluation
Vanderbilt University
October 4, 2008

Difference in Difference Models, Course Notes
ECON 47950: Methods for Inferring Causal Relationships in Economics
William N. Evans
University of Notre Dame
Spring 2008

Friday, March 11, 2016

Marginal Effects vs Odds Ratios

Models of binary dependent variables often are estimated using logistic regression or probit models, but the estimated coefficients (or exponentiated coefficients expressed as odds ratios) are often difficult to interpret from a practical standpoint. Empirical economic research often reports ‘marginal effects’, which are more intuitive but often more difficult to obtain from popular statistical software. The most straightforward way to obtain marginal effects is from estimation of linear probability models. This paper uses a toy data set to demonstrate the calculation of odds ratios and marginal effects from logistic regression using SAS and R, while comparing them to the results from a standard linear probability model.

Suppose we have a data set that looks at program participation (for some program or product or service of interest) by age and we want to know the influence of age on the decision to participate. Our data may look something like the excerpt below:

participation     age
1                       25
1                       26
1                       27
1                       28
1                       29
1                       30
0                       31
1                       32
1                       33
0                       34

Theoretically,  this might call for logistic regression for modeling a dichotomous outcome like participant, so we could use SAS or R to get the following results:

                   Estimate Std. Error z value Pr(>|z|) 
(Intercept)  5.92972    2.34258   2.531   0.0114 *
age            -0.14099    0.05656  -2.493   0.0127 *

                    OR               2.5 %       97.5 %
(Intercept)   376.049897 6.2769262 7.864410e+04
age              0.868502 0.7641126 9.595017e-01

 While the estimated coefficients from logistic regression are not easily interpretable (they represent the change in the log of odds of participation for a given change in age),  odds ratios might provide a better summary of the effects of age on participation (odds ratios are derived from exponentiation of the estimated coefficients from logistic regression -see also: The Calculation and Interpretation of Odds Ratios) and may be somewhat more meaningful. We can see the odds ratio associated with age is .8685 which implies that for every year increase in age the odds of participation are about (1-.865)*100 = 13.15% less.  You tell me what this means if this is the way you think about the likelihood of outcomes in everyday life!

Marginal effects are an alternative metric that can be used to describe the impact of age on participation. Marginal effects can be described as the change in outcome as a function of the change in the treatment (or independent variable of interest) holding all other variables in the model constant. In linear regression, the estimated regression coefficients are marginal effects and are more easily interpreted (more on this later). Marginal effects can be output easily from STATA, however they are not directly available in SAS or R. However there are some adhoc ways of getting them which I will demonstrate here.  (there are some packages in R available to assist with this as well). I am basing most of this directly on two very good blog posts on the topic:


One approach is to use PROC QLIM and request output of marginal effects. This computes a marginal effect for each observation’s value of x in the data set (because marginal effects may not be constant across the range of explanatory variables). Taking the average of this result gives and estimated ‘sample average estimate of marginal effect’:  -.0258

This tells us that for every year increase in age the probability of participation decreases on average by 2.5%.  For most people, for practical purposes, this is probably a more useful interpretation of the relationship between age and participation compared to odds ratios.  We can calculate this more directly (following the code from the blog post by WenSui Liu) using output from logistic regression and the data step in SAS. Basically for each observation in the data set calculate:

MARGIN_AGE = EXP(XB) / ((1 + EXP(XB)) ** 2) * (-0.1410);

Where -.1410 is the estimated coefficient on age from the original logistic regression model. We can run the same analysis in R, either replicating the results from the data step above, or using the mfx function defined by Alan Fernihough referenced in the diffuseprior blog post mentioned above or the paper referenced below.

The paper notes that this function gives similar results to the mfx function in STATA. And we get almost the same results we got from SAS above but additionally provides bootstrapped standard errors :

marginal.effects   standard.error
      -0.0258330        0.6687069

Marginal Effects from Linear Probability Models

Earlier I mentioned that you could estimate marginal effects directly from the estimated coefficients from a linear probability model. While in some circles LPMs are not viewed favorably, they have a strong following among applied econometricians (see references for more on this). As Angrist and Piscke state in their very popular book Mostly Harmless Econometrics:

"While a nonlinear model may fit the CEF (population conditional expectation function) for LDVs (limited dependent variables) more closely than a linear model, when it comes to marginal effects, this probably matters little"

Using SAS or R we can get the following results from estimating a LPM for this data:

                   Estimate    Std. Error  t value    Pr(>|t|)  
(Intercept)  1.700260   0.378572   4.491     0.000111 ***
dat1$age    -0.028699   0.009362  -3.065   0.004775 **

 You can see that the estimate from the linear probability model above gives us a marginal effect  (-.028699) almost identical to the previous estimates derived from logistic regression, as is often the case, and as indicated by Angrist and Pischke.

In the SAS ETS example cited in the references below, a distinction is made between calculating sample average marginal effects (which were discussed above) vs. calculating marginal effects at the mean:

“To evaluate the "average" or "overall" marginal effect, two approaches are frequently used. One approach is to compute the marginal effect at the sample means of the data. The other approach is to compute marginal effect at each observation and then to calculate the sample average of individual marginal effects to obtain the overall marginal effect. For large sample sizes, both the approaches yield similar results. However for smaller samples, averaging the individual marginal effects is preferred (Greene 1997, p. 876)”

For a step by step review of the SAS and R code presented above as well as an additional example with multiple variables see:

Matt Bogard. "Comparing Odds Ratios and Marginal Effects from Logistic Regression and Linear Probability Models" Staff Paper (2016)
Available at: http://works.bepress.com/matt_bogard/30/ 


Simple logit and probit marginal effects in R.  https://ideas.repec.org/p/ucn/wpaper/201122.html

SAS/ETS Web Examples Computing Marginal Effects for Discrete Dependent Variable Models. http://support.sas.com/rnd/app/examples/ets/margeff/ 

Linear Regression and Analysis of Variance with a Binary Dependent Variable (from EconomicSense, by Matt Bogard).

Angrist, Joshua D. & Jörn-Steffen Pischke. Mostly Harmless Econometrics: An Empiricist's Companion. Princeton University Press. NJ. 2008.

Probit better than LPM? http://www.mostlyharmlesseconometrics.com/2012/07/probit-better-than-lpm/ 

Love It or Logit. By Marc Bellemare. marcfbellemare.com/wordpress/9024

R Data Analysis Examples: Logit Regression. From http://www.ats.ucla.edu/stat/r/dae/logit.htm   (accessed March 4,2016).

Greene, W. H. (1997), Econometric Analysis, Third edition, Prentice Hall, 339–350.

Wednesday, March 9, 2016

What's the difference between difference-in-difference models in a linear vs nonlinear context?

A while back I discussed a powerful methodology for identification of causal effects from both a selection on observables and unobservables context, namely combining propensity score matching and difference-in-differences. 

But recently I ran across a tweet from Felix Bethke (https://twitter.com/F_Bethke) sharing a blog post by Tom Pepinsky related to plug and play models. At the risk of oversimplifying, the take away was that we can't just take a methodology like DID used in a standard linear regression context and necessarily 'plug it into'  a non-linear context and get the same results. (often we see arguments going the other way around, we can't use linear models in a non-linear context but that is a different battle for another day).  I highly recommend Tom's post for more details and he links to a number of papers that clarify the issues in a very technical sense.

In a linear difference-in-difference (DID) analysis, identification of causal effects hinge on a common trend assumption and interpretation of the estimated regression coefficient on the time x treatment interaction term.

y = b0 + b1 x + b2 t+b3 x*t + e

In Tom's post, and some of the papers, specific attention is given to how the interpretation of the interaction term (and our estimated treatment effect or b3 in a specification like above) changes in a logit or probit context and its something quite different from the causal effect of interest.

I was specifically interested in knowing, is this an issue just for probit and logit models or other nonlinear models, like GLM models in general. For instance, in the healthcare economics literature, its very common to use probit or logit models in a two part modeling context where the second part of a two part model is a GLM model with a log link and gamma distribution. And I have seen some papers using a difference-in-differences across the board with these models.

I took a look at a couple of papers and it appears that these issues are a concern for any GLM model.

In a Health Services Research paper, Karaca-Mandic et al discuss these issues and in the abstract imply that this would apply to log transformed models often used in healthcare economics:

"We discuss the motivation for including interaction terms in multivariate analyses. We then explain how the straightforward interpretation of interaction terms in linear models changes in nonlinear models, using graphs and equations. We extend the basic results from logit and probit to difference‐in‐differences models, models with higher powers of explanatory variables, other nonlinear models (including log transformation and ordered models), and panel data models."

After pointing out several issues, they state:

"It is important to understand that the issues about interaction terms discussed here apply to all nonlinear models, including log transformation models"

More specifically, what are these issues, at least at a high level? Recall, difference-in-difference models are a special case of fixed effects panel data models, where unobserved differences and individual specific effects essentially cancel out providing clean identification of causal effects.  For this to work in the DID framework, a common trends assumption is required.  In the referenced paper below, Lechner points out (quite rigorously in the context of the potential outcomes framework):

"We start with a “natural” nonlinear model with a linear index structure which is transformed by a link function, G(·), to yield the conditional expectation of the potential outcome.....The common trend assumption relies on differencing out specific terms of the unobservable potential outcome, which does not happen in this nonlinear specification... Whereas the linear specification requires the group specific differences to be time constant, the nonlinear specification requires them to be absent. Of course, this property of this nonlinear specification removes the attractive feature that DiD allows for some selection on unobservable group and individual specific differences. Thus, we conclude that estimating a DiD model with the standard specification of a nonlinear model would usually lead to an inconsistent estimator if the standard common trend assumption is upheld. In other words, if the standard DiD assumptions hold, this nonlinear model does not exploit them (it will usually violate them). Therefore, estimation based on this model does not identify the causal effect "

Because they demonstrate that this applies to any GLM specification/link function, this seems to strike a blow to using DID in the context of a lot of the modeling approaches used in healthcare economics or any other field relying on similar GLM specifications.

So as Angrist and Pischke might ask, what is an applied guy to do? One approach even in the context of skewed distributions with high mass points (as is common in the healthcare econometrics space) is to specify a linear model. For dichotomous outcomes (utilization like ER visits or hospital admissions are often dichotomized and modeled by logit or probit models) you can just use a linear probability model. For skewed distributions with heavy mass points, dichotomization with a LPM may also be an attractive alternative.


Special thanks to tweets and additional input from Tom Pepinsky and Marc Bellemare.

Interaction Terms in Nonlinear Models
Pinar Karaca-Mandic, Edward C. Norton, and Bryan Dowd
HSR: Health Services Research 47:1, Part I (February 2012)

The Estimation of Causal Effects by Difference-in-Difference Methods
By Michael Lechner Foundations and Trends in Econometrics
Vol. 4, No. 3 (2010) 165–224

Tuesday, March 8, 2016

Applied Econometrics in One Lesson

When it comes to the challenging problems of causal inference (all the issues we encounter that create the gaps between textbook and applied econometrics) I think the best advice I have seen as an applied researcher comes from Marc Bellemare:

Do Both!!

Which seems to be a big takeaway from Angrist and Pischke's  Mostly Harmless Econometrics:

"So what's an applied guy to do? One answer, as always, is to check the robustness of your findings using alternative identifying assumptions. That means that you would like to find broadly similar results using plausible alternative models" 

That's applied econometrics in one lesson. That's the credibility revolution in practice.

Saturday, March 5, 2016

Machine Learning and Econometrics

Not long ago Tyler Cowen blogged at Marginal Revolution about a Quora post by Susan Athey discussing the impact of machine learning on econometrics, flavors of machine learning, and differences in the emphasis placed on tools and methodologies traditional in each field. The differences often hinge on whether one's intention is to explain or predict,  or if one is interested in causal inference vs analytics. I really liked the point about instrumental variables made in the snippet below:

"Yet, a cornerstone of introductory econometrics is that prediction is not causal inference, and indeed a classic economic example is that in many economic datasets, price and quantity are positively correlated.  Firms set prices higher in high-income cities where consumers buy more; they raise prices in anticipation of times of peak demand. A large body of econometric research seeks to REDUCE the goodness of fit of a model in order to estimate the causal effect of, say, changing prices. If prices and quantities are positively correlated in the data, any model that estimates the true causal effect (quantity goes down if you change price) will not do as good a job fitting the data….Techniques like instrumental variables seek to use only some of the information that is in the data – the “clean” or “exogenous” or “experiment-like” variation in price—sacrificing predictive accuracy in the current environment to learn about a more fundamental relationship that will help make decisions about changing price. This type of model has not received almost any attention in ML."

Tyler also points to a wealth of resources by Suan Athey here. And check out the mini-course she taught with Guido Imbens via NBER.

The differences and synergies between tools used in both econometrics and machine learning is something I have been interested in for a long time and have blogged about several times in the past. Kenneth Sanford and Hal Varian have also been writing about this as well. See related content below.

Related Content and Further Reading

Economists as Data Scientists http://econometricsense.blogspot.com/2012/10/economists-as-data-scientists.html

Econometrics, Math, and Machine Learning….what? http://econometricsense.blogspot.com/2015/09/econometrics-math-and-machine.html 

"Mathematical Themes in Economics, Machine Learning, and Bioinformatics" (2010)
Available at: http://works.bepress.com/matt_bogard/7/ 

Notes to 'Support' an Understanding of Support Vector Machines  http://econometricsense.blogspot.com/2012/05/notes-to-support-understanding-of.html

Culture War: Classical Statistics vs. Machine Learning http://econometricsense.blogspot.com/2011/01/classical-statistics-vs-machine.html

Analytics vs Causal Inference http://econometricsense.blogspot.com/2014/01/analytics-vs-causal-inference.html

Big Data: Don’t throw the baby out with the bath water http://econometricsense.blogspot.com/2014/05/big-data-dont-throw-baby-out-with.html

To Explain or Predict http://econometricsense.blogspot.com/2015/03/to-explain-or-predict.html 

Big Data: Causality and Local Expertise Are Key in Agronomic Applications http://econometricsense.blogspot.com/2014/05/big-data-think-global-act-local-when-it.html

Big Data:  New Tricks for Econometrics
Hal R. Varian
June 2013
Revised:  April 14, 2014

Is machine learning trending with economists? (Kenneth Sanford)  http://blogs.sas.com/content/subconsciousmusings/2015/06/05/is-machine-learning-trending-with-economists/

Tuesday, December 29, 2015

Nonparametric Approaches to Multiple Comparisons

I have recently started reading "Applied Nonparametric Econometrics", and was thinking, when was the last time I even worked with basic non-parametric statistics?

For instance, in the courses I teach, I don't cover this, but some of the texts I reference cover some basics like the Mann Whitney Wilcoxon  (MWW) test (which can be thought of as a non-parametric equivalent to a two sample independent t-test) or the Kruskall-Wallis test (which is a non-parametric analogue to analysis of variance). These tests are often useful in situations that involve highly skewed, non-normal, or categorical ordered or  ranked data, or data from problematic or unknown distributions.  I kind of briefly reviewed some implementations in SAS, and particularly focused on the Kruskall-Wallis test, which has the following general null hypothesis:

Ho: All Populations Are Equal
Ha: All Populations Are Not Equal

If we reject Ho, we might conclude that there is a difference among populations, with one population or another providing a larger proportion of larger or smaller values for the variable of interest. If we could assume that the populations were of similar shape and symmetry, this *might* be interpreted as a test of differences in medians, but in general this is a test on differences in distributions and specifically ranks, similar to the MWW test. But if we do reject Ho, what next? In an analysis of variance context, if we reject the overall F-test on multiple means we can followup with pairwise comparisons to determine which means differ.  But at least in the older versions of SAS, there are no straightforward ways to do this kind of analysis in the non-parametric context. However, in the SAS Note (22620), one recommendation is to rank-transform the data and use the normal-theory methods in PROC GLM (Iman, 1982). See also Conover, W. J. & Iman, R. L. (1981) referenced below.

A good example of the application of GLM on ranked data can be found here: http://people.stat.sc.edu/Hitchcock/soil_KW_sasexample705.txt 

and a general overview of some non-parametric applications in SAS along these lines here.

You can also find a SAS macro with code and examples for post hoc tests here: http://www.alanelliott.com/kw/

I at first thought this was the macro by Juneau (in the references below and mentioned in the SAS note above) but it is something different, see the Elliot and Hynan reference below. From the abstract:

"The Kruskal-Wallis (KW) nonparametric analysis of variance is often used instead of a standard one-way ANOVA when data are from a suspected non-normal population. The KW omnibus procedure tests for some differences between groups, but provides no specific post hoc pair wise comparisons. This paper provides a SAS(®) macro implementation of a multiple comparison test based on significant Kruskal-Wallis results from the SAS NPAR1WAY procedure. The implementation is designed for up to 20 groups at a user-specified alpha significance level. A Monte-Carlo simulation compared this nonparametric procedure to commonly used parametric multiple comparison tests."

I found an application referencing this implementation here if interested.

According to the SAS note referenced above, SAS/STAT 12.1 will include some versions of some non-parametric post hoc tests. I'm also aware that there are several R packages that can do this as well, such as the dunn.test package.

I compared results from Elliot and Hynan's example code (example 1) and data to those from the adhoc GLM on ranks following Hitchcock and got similar results. I also got similar results using dunn.test in R:

# use same data as in www.alanelliott.com/kw
race <- c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3)
bmi <- c(32,30.1,27.6,26.2,28.2,26.4,23.1,23.5,24.6,24.3,24.9,25.3,23.8,22.1,23.4)
library(dunn.test) #load package
dunn.test(bmi,race, kw = TRUE, method ="bonferroni") # implement test with adjustments for multiple comparisons
Created by Pretty R at inside-R.org

Palomares-Rius JE, Castillo P, Montes-Borrego M, Navas-Cortés JA, Landa BB (2015) Soil Properties and Olive Cultivar Determine the Structure and Diversity of Plant-Parasitic Nematode Communities Infesting Olive Orchards Soils in Southern Spain. PLoS ONE 10(1): e0116890. doi:10.1371/journal.pone.0116890

Dunn, O.J. “Multiple comparisons using rank sums”.
Technometrics 6 (1964) pp. 241-252.

Conover, W. J. & Iman, R. L. (1981). "Rank transformations as a bridge between parametric and
nonparametric statistics". American Statistician 35 (3): 124–129. doi:10.2307/2683975

Elliott AC, Hynan LS. “A SAS Macro implementation of a Multiple Comparison post hoc test for a Kruskal-Wallis analysis,” Comp Meth Prog Bio, 102:75-80, 2011

Iman, R.L. (1982), "Some Aspects of the Rank Transform in Analysis of Variance Problems," Proceedings of the Seventh Annual SAS Users Group International Conference, 7, 676-680.

Juneau, P. (2004), "Simultaneous Nonparametric Inference in a One-Way Layout Using the SAS System," Proceedings of the PharmaSUG 2004 Annual Conference, Paper SP04.