Sunday, September 30, 2012

The Robustness of SNA Metrics

Given the required data and processing requirements for sociometric measures like betweeness, k-betweeness, or eigenvector centrality, and the fact that we may not always be able to characterize an entire network (due to missing nodes and edges) we might want to consider, how robust are metrics derived from incomplete network data? It is much easier in terms of both computational and data requirements to obtain measures of degree centrality vs. eigenvector centrality or even betweenness. Is degree similar enough to other measures of centrality to use in our analysis instead? Although not a complete literature review, the articles below investigate these issues.

The stability of centrality measures when networks are sampled
Elizabeth Costenbader a,∗, Thomas W. Valente.
Social Networks 25 (2003) 283–307

"Our results indicate relatively high correlation, albeit in some instances substantial absolute differences, between actual network properties and those calculated on randomly selected sub-samples for some network measures. This indicates that under some circumstances researchers may be still be able to use network data for which some data are missing to study network properties or create network-based interventions. In other words, researchers who do not interview all members of a community or network may still be able to take advantage of some aspects of network theory and techniques."

Their remarks on eigenvector centrality are particularly interesting. 

"As noted previously, the stability of eigenvector centrality when calculated as a simple raw score may indicate that it is the preferred centrality measure when the network data are incomplete. However, the fact that sampling has less effect on this centrality measure may be due to the fact that in comparison to the other centrality measures, which measure the ones (i.e. the actual nominations), this measure is able to effectively capture the similarity of zeros. Since many of the studies restrict nominations to five people, there are a lot of zeros in the original networks. Consequently, eigenvector centrality as a simple raw score is less affected by sampling from the networks as the zeros are preserved."

How Correlated Are Network Centrality Measures? Thomas W. Valente, PhD,
University of Southern California, Department of Prevention Research, Los Angeles

Connect (Tor). 2008 January 1; 28(1): 16–26.

Kathryn Coronges, MPH, University of Southern California, Department of Prevention Research, Los Angeles
Cynthia Lakon, PhD, and University of Southern California, Department of Prevention Research, Los Angeles
Elizabeth Costenbader, PhD
Research Triangle Institute, Raleigh North Carolina

Investigates the  the correlation among four centrality measures: degree, betweenness, closeness, and eigenvector and calculates 9 versions of these measures for 58 existing social networks  previously analyzed  by Costenbader and Valente (2003).
From the article:
"We correlated the 9 measures for each network and then calculated the average correlation, standard deviation, and range across centrality measures. We also calculated the overall correlation and compared it by study to assess the degree of variation in average correlation between studies."

"We find strong but varied correlations among the 9 centrality measures presented here. The average of the average correlations was 0.53 with a standard deviation of 0.14, indicating that most correlations would be considered strong. The level of correlation among measures seems nearly optimal - too high a correlation would indicate redundancy and too low, an indication that the variables measured different things. The amount of correlation between degree, betweenness, closeness, and eigenvector indicates that these measures are distinct, yet conceptually related.

A summary of the correlations for degree, betweenness and eigenvector centrality as reported in the article can be found below:

Borgatti, S.P., Carley, K., and Krackhardt, D. (2006). Robustness of Centrality Measures under Conditions of Imperfect Data. Social Networks 28: 124–136.

Many empirical studies approached the relationship between centrality measures across networks and in the context of missing data by using empirical data from actual networks. Borgatti points out a limitation of this approach:

"A limitation of this approach is that the sampling errors contained in the data are likely to be systematic, but the pattern is unknown. Another limitation is that the sample of networks is necessarily very limited. To overcome these limitations, we take a statistical computational approach and examine robustness in a very large sample of random graphs, into which we introduce a controlled amount and kind of error."

In the article, measures of degree, betweenness, closeness and eigenvector centrality from 'sampled' networks were compared to the 'actual' values from the complete networks. These comparisons were based on 5 measures of robustness (discussed in the article).


"the four centrality measures behave virtually identically in the face of measurement error. This suggests that the distinction between local and global measures of centrality (Scott, 2000) is not as important as previously thought. These results are consistent with those of Everett and Borgatti (2004) who found that betweenness calculated on ego networks (a local measure) was, on average, nearly identical to betweenness calculated on the full network in which ego networks were embedded (a global measure)."

Seeding Strategies for Viral Marketing: An Empirical Comparison
Oliver Hinz, Bernd Skiera, Christian Barrot, & Jan U. Becker
Journal of Marketing, Volume 75, Number 6, November 2011

Consistent with the results from some of the previous research, this article concludes:

"Remarkably, we reveal that to target a particular subnetwork (e.g., students of a particular university, Study 2) with a viral marketing message, the use of the respective subnetwork’s sociometric measures is not absolutely required to implement the desired seeding strategies. Instead, because the sociometric measures of subnetworks and their total network are highly correlated, marketers can use the socio- metric measures of the total network, without undertaking the complex task of determining exact network boundaries. Conversely, this appealing result also allows marketers to feel confident in inferring the connectivity of a person in an overall network from information about his or her connec- tivity in a natural subnetwork."

Friday, September 28, 2012

Social Media Analytics

Here is a summary of some of my posts related to social media analytics, primarily text mining and social network analysis.

Using Twitter to Demonstrate Basic Concepts from Network Analysis

An Intuitive Approach To Text Mining with SAS IML

The Budget Compromise- Mining Tweets

An Introduction to Social Network Analysis using R and Netdraw

Using SNA in Predictive Modeling

Analysis of AgChat Facebook Group Members

Mining Tweets about Factory Farms (from my companion blog Economic Sense)

Analysis of AgChat Facebook Group Members

(this is an older post updated from 'drafts')

Created using R and ‘members to .csv’ Facebook Ap
March 9, 2010

Disclaimer:This is for demonstration purposes only. There were actually 643 members of the #AgChat Facebook group to date, but the ‘members to .csv’ ap limits data retrieval to 499 observations, so this represents only a sampling of actual members. Observations are also omitted for missing values for variables in each respective analysis. For instance, only 24 observations of the available 499 had hometown data listed.

Breakdown by Gender (Click to Enlarge)

Representation by Hometown City and State

Augusta , Illinois
Chicago , Illinois
Indianapolis, Indiana
Hampton , Iowa
Miltonvale, Kansas
Louisville , Kentucky
Caneyville, Kentucky
Frankfort , Kentucky
Winnipeg, Manitoba (Canada)
Saginaw , Michigan
Deckerville, Michigan
Springfield , Missouri
Tecumseh, Oklahoma
Portland, Oregon
Fredrikstad , Ostfold (Norway)
Dallas ,Texas
Selah , Washington
Union West , Virginia

# of Represented Members by Hometown State (Click to Enlarge)

Representation By Country (Click to Enlarge)
(Canada, Norway & the U.S.)

Thursday, September 27, 2012

Data Science and Automation

"New insights often emerge when data scientists and business executives (or anyone else with a strong domain expertise) discuss and brainstorm what questions to ask, what the results of the analysis actually mean, and what the next iteration should be. This is what lies behind the requirement for people skills or business acumen often included in the basket of skills expected of a data scientist"

Wednesday, September 5, 2012

SAS Enterprise Miner Demo

Not a bad demo for SAS EM. Some people would offer up the criticism that you can't just point and click your way through statistics or a model building process without a rigorous understanding of what is going on. I would agree. A good background in statistics, machine learning and research methodology is essential. I personally view statistics and machine learning as a language best communicated via code not pictures and icons. Base SAS, SAS IML, and R will let you get your hands dirty if you really want to figure out what is going on. To me, the point of SAS EM isn't that SAS makes it possible to point and click your way through a problem without really understanding what's going on, but that given the appropriate background knowledge, you can expedite what can be a very tedious process. SAS also offers a very good training and a predictive modeling certification program to go along with the software that I highly recommend.

Generalized Method of Moments (Video)

I actually have not created a post related generalized method of moments. Until then, this video seems to provide a nice introduction if you are already comfortable with he concept of moments and method of moments estimation.

As noted, the full video and slides can also be found here: