Friday, April 26, 2013

SAS Global Forum 2013 Paper 144-2013: SAS IML Worskshop

I didn't realize until now that the hands on workshops also had accompanying papers! ( I also just noticed this year  that the same was true for the posters as well).

This paper is a great intro to SAS IML. (see my other posts with statistical programming applications in social network analysis, text mining, and maximum likelihood estimation here )

Paper 144-2013
Getting Started with the SAS/IML® Language
Rick Wicklin, SAS Institute Inc.


Do you need a statistic that is not computed by any SAS® procedure? Reach for the SAS/IML® language! Many statistics are naturally expressed in terms of matrices and vectors. For these, you need a matrix language. This paper introduces the SAS/IML language to SAS programmers who are familiar with elementary linear algebra. The focus is on statements that create and manipulate matrices, read and write data sets, and control the program flow. The paper demonstrates how to write user-defined functions, interact with other SAS procedures, and recognize efficient programming techniques.


Thursday, April 18, 2013

Propensity Score Weighting: Logistic vs. CART vs. Boosting vs. Random Forests

 I've yet to do a post on IPTW regressions, although I have been doing some applied work using them. I have found similar results comparing nerual network, decision tree, logistic regression, and gradient boosting propensity score methods in applied examples. This paper provides more robust results using simulation.

Lee BK, Lessler J, Stuart EA (2011) Weight Trimming and Propensity Score Weighting. PLoS ONE 6(3): e18174. doi:10.1371/journal.pone.0018174

“Propensity score weighting is sensitive to model misspecification and outlying weights that can unduly influence results. The authors investigated whether trimming large weights downward can improve the performance of propensity score weighting and whether the benefits of trimming differ by propensity score estimation method. In a simulation study, the authors examined the performance of weight trimming following logistic regression, classification and regression trees (CART), boosted CART, and random forests to estimate propensity score weights. Results indicate that although misspecified logistic regression propensity score models yield increased bias and standard errors, weight trimming following logistic regression can improve the accuracy and precision of final parameter estimates. In contrast, weight trimming did not improve the performance of boosted CART and random forests. The performance of boosted CART and random forests without weight trimming was similar to the best performance obtainable by weight trimmed logistic regression estimated propensity scores. While trimming may be used to optimize propensity score weights estimated using logistic regression, the optimal level of trimming is difficult to determine. These results indicate that although trimming can improve inferences in some settings, in order to consistently improve the performance of propensity score weighting, analysts should focus on the procedures leading to the generation of weights (i.e., proper specification of the propensity score model) rather than relying on ad-hoc methods such as weight trimming.”

Monday, April 15, 2013

SNA & Learning Communities

This week I'm at the CPE Student success summit. Learning communities are an ongoing theme at the conference. This article is one of the few that I've found that uses social network analysis metrics to investigate student learning communities. Are centrality measures good indicators of integration? 

Phys. Rev. ST Physics Ed. Research 8, 010101 (2012) [9 pages]

Investigating student communities with network analysis of interactions in a physics learning center


"Developing a sense of community among students is one of the three pillars of an overall reform effort to increase participation in physics, and the sciences more broadly, at Florida International University. The emergence of a research and learning community, embedded within a course reform effort, has contributed to increased recruitment and retention of physics majors. We utilize social network analysis to quantify interactions in Florida International University's Physics Learning Center (PLC) that support the development of academic and social integration. The tools of social network analysis allow us to visualize and quantify student interactions and characterize the roles of students within a social network. After providing a brief introduction to social network analysis, we use sequential multiple regression modeling to evaluate factors that contribute to participation in the learning community. Results of the sequential multiple regression indicate that the PLC learning community is an equitable environment as we find that gender and ethnicity are not significant predictors of participation in the PLC. We find that providing students space for collaboration provides a vital element in the formation of a supportive learning community." 

Wednesday, April 10, 2013

SAS Global Forum Paper 089-2013 (CART)

This was a nice paper illustrating and explaining CART (classification and regression trees).

089-2013  Using Classification and Regression Trees (CART) in SAS® Enterprise Miner™ for Applications in Public Health

"They (CARTs) are typically model free in their implementation. Howbeit, a model based statistic is sometimes used for a splitting criterion. The main idea of a classification tree is a statistician’s version of the popular twenty questions game. Several questions are asked with the aim of answering a particular research question at hand. However, they are advantageous because of their non -parametric and non- linear nature. They do not make any distribution assumptions and treat the data generation process as unknown and do not require a functional form for the predictors. They also do not assume additivity of the predictors which allows them to identify complex interactions. Tree methods are probably one of the most easily interpreted  statistical techniques. They can be followed with little or no understanding of Statistics and to a certain extent follow the decision process that humans use to make decisions. In this regard, they are conceptually simple yet present a powerful analysis (Hastie et al 2009)."

Interesting SAS Global Forum 2013 Papers

I recently noticed almost all of the papers for SASGF13 are posted. Instead of browsing the conference materials (which are larger than my local telephone directory- such a huge conference) I decided to start by browsing the papers (which can be found in the proceedings). I can then refer back to this post when I start trying to actually map out which sessions I'll go to. (and I can get back to the papers for the sessions I miss)

Most of the sessions and papers I typically like are in the area of Statistics and Data Analysis or Data Mining and Text Analytics. These sessions and papers offer direct applications in SAS that I can immediately take back to my job and implement.  There are often other papers throughout  Pharma, Operations Research, and Financial Services that also can be really helpful.

Below are the titles and links to papers that I've found so far. Yes this seems like a lot, but its only a small portion of the total proceedings. There are tons of other sections related to business intelligence, data management, and programming. I'm not sure I'll fit all of the sessions I've found into a 3.5 day conference schedule.

Business Intelligence Applications
044-2013 A Data-Driven Analytic Strategy for Increasing Yield and Retention at Western Kentucky University Using SAS Enterprise BI and SAS® Enterprise Miner™

Operations Research

Statistics and Data Analysis

Data Mining and Text Analytics

Posters and Videos (papers included)

Monday, April 8, 2013

Using Advanced Analytics to Recruit Students for Improved Retention & Graduation

Forthcoming:SAS Global Forum (April 28-May 1 2013) 

Paper 044-2013
A Data Driven Analytic Strategy for Increasing Yield and Retention at Western Kentucky University Using SAS Enterprise BI and SAS Enterprise Miner
Matt Bogard, Western Kentucky University, Bowling Green, KY


As many universities face the constraints of declining enrollment demographics, pressure from state governments for increased student success, as well as declining revenues, the costs of utilizing anecdotal  evidence and intuition based on ‘gut’ feelings to make time and resource allocation decisions become significant. This paper describes how we are using SAS® Enterprise Miner to develop a model to score university students based on their probability of enrollment and retention early in the enrollment funnel so that staff and administrators can work to recruit students that not only have an average or better chance of enrolling but also succeeding once they enroll. Incorporating these results into SAS® EBI will allow us to deliver easy-to-understand results to university personnel.

PDF TEXT available at Proceedings of the SAS® Global Forum 2013 Conference


The correct bibliographic citation for this publication is as follows:
SAS Institute Inc. 2013.Proceedings of the SAS® Global Forum 2013 Conference. Cary, NC:

Tuesday, April 2, 2013

Is the ROC curve a good metric for model calibration?

I previously discussed the use of the ROC curve as a tool for model assessment, particularly as a metric for discrimination. I stated that this metric (particularly the area under the ROC curve or c-statistic) is used increasingly in the machine learning community and is preferred over other measures of fit like precision or the F1-Score because it evaluates model performance across all considered cutoff values vs. an arbitrarily chosen cutoff (Bradley, 1997). I still prefer this metric over a metric based on an arbitrary cutoff (like percentage of correct predictions, precision, recall or the F-1 score). However, if the goal is to use your predictive model to stratify your score data into groups (like in a market segmentation application or this example) then the ROC curve may not be the absolute best metric. The metric we are actually after is one that assesses model calibration (as discussed here).  In the article 'Use and Misuse of the Receiver Operating Characteristic Curve in Risk Prediction' we get several criticisms of using the ROC curve in this context:

'The c statistic also describes how well models can rank order cases and noncases, but is not a function of the actual predicted probabilities. For example, a model that assigns all cases a value of 0.52 and all noncases a value of 0.51 would have perfect discrimination, although the probabilities it assigns may not be helpful.'

'When the goal of a predictive model is to categorize individuals into risk strata, the assessment of such models should be based on how well they achieve this aim...The use of a single, somewhat insensitive, measure of model fit such as the c statistic can erroneously eliminate important clinical risk predictors for consideration in scoring algorithms'

The paper goes on to demonstrate that there is in fact a tradeoff between model discrimination ( as measured by the ROC curve) and calibration.

In this context, we may prefer a metric that is based on calibration, like the Hosmer-Lemeshaw test, but it is often criticized for sensitivity to group/category compositions, has low power in small sample sizes, and is hypersensitive and misleading in large sample sizes. 

Here is one proposed solution (discussed previously):

From: NATIONAL QUALITY FORUM National Voluntary Consensus Standards for Patient Outcomes Measure Summary. (link)

“Because of the very large sample sizes studied here, a statistically significant
Hosmer-Lemeshow statistic is not considered informative with respect to calibration.”
Although the HL statistic is uninformative, model calibration could still be assessed graphically. This
could be done by comparing observed vs. predicted event rates within deciles of predicted risk."

The Assessment Score Rankings and Assessment Score Distribution tables from SAS Enterprise Miner are helpful in this regard.

Additional References:
Nancy R. Cook, Use and Misuse of the Receiver Operating Characteristic Curve in Risk Prediction. Circulation. 2007; 115: 928-935