After a week as
SAS Gobal Forum, I've been pretty excited about some of the text mining presentations that I got to see. I couldn't wait to get back to work to at least try something. After getting home I found a tweet from
@imusicmash sharing a post from the
Heuristic Andrew blog that shared text mining code from R. I thought I'd use that code to mine tweets related to the budget compromise/government shutdown. I searched two hashtags,
#budget and
#teaparty. I originally wanted to see if I could find out either what teaparty supporters may be saying about the budget, or maybe what others were saying about the teaparty and the potential government shutdown. (since these were interesting topics).
After extracting the text from a sample of 1500 tweets I found the following 'most common' terms. (in the raw from R)
[1] "cnn" "boehner" "gop" "time" "words" "budget" "historic" "cuts" "billion" "cut"
[11] "cspan" "deal" "heh" "movement" "table" "parenthood" "planned" "tcot" "hee" "andersoncooper"
[21] "funded" "neener" "protecting" "painful" "alot" "party" "tea" "shutdown" "government" "beep"
[31] "jobs" "barackobama" "feels" "drinkers" "koolaid" "deficit" "facing" "trillion" "gown" "wedding"
[41] "healthcare" "spending" "2012" "agree" "plan" "compromise" "victory" "term" "tax" "decorating"
[51] "diy" "home" "bank" "dems" "biggest" "history" "hug" "civil" "hoo" "little"
[61] "38b" "tips" "life" "people" "suicide" "doesnt" "wars" "trump" "system" "books"
[71] "teaparty" "ventura" "etatsunis" "fair" "fight" "military" "actually" "win" "compulsive" "liars"
[81] "tbags" "revenue" "rights" "libya" "base" "elites" "house" "crisis" "housing" "hud"
[91] "dem" "nay" "yea"
I then clustered the terms, using hierarchial clustering to get the following groupings: (click to enlarge)
Although it doesn't yield extremely revealing or novel results, the clusters do make sense, putting key politicians in the same group as the terms 'government' and 'shutdown' and putting republicans in the same group as 'teaparty' related terms. This at least validates for me, the power of R's 'tm' textmining package. I'm in the ball park, and a better structured analysis could give better results. But this is just for fun.
I also ran some correlations, which gets me about as close to sentiment as I'm going to get for a first time stab at text mining:
For 'budget' some of the more correlated terms included deal, shutdown, balanced, reached, danger.
For 'teaparty' some of the more correlated terms included boehner, farce, selling, trouble, teapartynation.
For 'barakobama' (as it appeared in the text data) some of the more correlated terms included appreciate, cooperation, rejects, hypocrite, achieved, lunacy.
For 'boehner' some of the more correlated terms included selling, notwinning, teaparty, moderate, retarded.
**UPDATE 1/23/2012 **
Below is some R-code similar to what I used for this project. While it may not make sense to look at tweets related to the budget compromise at this time, it can be modified to investigate whatever current trend of interest.
###
### Read tweets from Twitter using ATOM (XML) format
###
# installation is required only required once and is rememberd across sessions
install.packages('XML')
# loading the package is required once each session
require(XML)
# initialize a storage variable for Twitter tweets
mydata.vectors <- character(0)
# paginate to get more tweets
for (page in c(1:15))
{
# search parameter
twitter_q <- URLencode('#OWS')
# construct a URL
twitter_url = paste('http://search.twitter.com/search.atom?q=',twitter_q,'&rpp=100&page=', page, sep='')
# fetch remote URL and parse
mydata.xml <- xmlParseDoc(twitter_url, asText=F)
# extract the titles
mydata.vector <- xpathSApply(mydata.xml, '//s:entry/s:title', xmlValue, namespaces =c('s'='http://www.w3.org/2005/Atom'))
# aggregate new tweets with previous tweets
mydata.vectors <- c(mydata.vector, mydata.vectors)
}
# how many tweets did we get?
length(mydata.vectors)
# *------------------------------------------------------------------
# |
# | analyze using tm package
# |
# |
# *-----------------------------------------------------------------
###
### Use tm (text mining) package
###
install.packages('tm')
require(tm)
# build a corpus
mydata.corpus <- Corpus(VectorSource(mydata.vectors))
# make each letter lowercase
mydata.corpus <- tm_map(mydata.corpus, tolower)
# remove punctuation
mydata.corpus <- tm_map(mydata.corpus, removePunctuation)
# remove generic and custom stopwords
my_stopwords <- c(stopwords('english'), 'prolife', 'prochoice')
mydata.corpus <- tm_map(mydata.corpus, removeWords, my_stopwords)
# build a term-document matrix
mydata.dtm <- TermDocumentMatrix(mydata.corpus)
# inspect the document-term matrix
mydata.dtm
# inspect most popular words
findFreqTerms(mydata.dtm, lowfreq=30)
# find word correlations
findAssocs(mydata.dtm, 'monsanto', 0.20)
# *------------------------------------------------------------------
# |
# |
# | create data frame for cluster analysis
# |
# *-----------------------------------------------------------------
# remove sparse terms to simplify the cluster plot
# Note: tweak the sparse parameter to determine the number of words.
# About 10-30 words is good.
mydata.dtm2 <- removeSparseTerms(mydata.dtm, sparse=0.95)
# convert the sparse term-document matrix to a standard data frame
mydata.df <- as.data.frame(inspect(mydata.dtm2))
# inspect dimensions of the data frame
nrow(mydata.df)
ncol(mydata.df)
mydata.df.scale <- scale(mydata.df)
d <- dist(mydata.df.scale, method = "euclidean") # distance matrix
fit <- hclust(d, method="ward")
plot(fit) # display dendogram?
groups <- cutree(fit, k=5) # cut tree into 5 clusters
# draw dendogram with red borders around the 5 clusters
rect.hclust(fit, k=5, border="red")
Created by Pretty R at inside-R.org