After extracting the text from a sample of 1500 tweets I found the following 'most common' terms. (in the raw from R)
[1] "cnn" "boehner" "gop" "time" "words" "budget" "historic" "cuts" "billion" "cut"
[11] "cspan" "deal" "heh" "movement" "table" "parenthood" "planned" "tcot" "hee" "andersoncooper"
[21] "funded" "neener" "protecting" "painful" "alot" "party" "tea" "shutdown" "government" "beep"
[31] "jobs" "barackobama" "feels" "drinkers" "koolaid" "deficit" "facing" "trillion" "gown" "wedding"
[41] "healthcare" "spending" "2012" "agree" "plan" "compromise" "victory" "term" "tax" "decorating"
[51] "diy" "home" "bank" "dems" "biggest" "history" "hug" "civil" "hoo" "little"
[61] "38b" "tips" "life" "people" "suicide" "doesnt" "wars" "trump" "system" "books"
[71] "teaparty" "ventura" "etatsunis" "fair" "fight" "military" "actually" "win" "compulsive" "liars"
[81] "tbags" "revenue" "rights" "libya" "base" "elites" "house" "crisis" "housing" "hud"
[91] "dem" "nay" "yea"
I then clustered the terms, using hierarchial clustering to get the following groupings: (click to enlarge)
Although it doesn't yield extremely revealing or novel results, the clusters do make sense, putting key politicians in the same group as the terms 'government' and 'shutdown' and putting republicans in the same group as 'teaparty' related terms. This at least validates for me, the power of R's 'tm' textmining package. I'm in the ball park, and a better structured analysis could give better results. But this is just for fun.
I also ran some correlations, which gets me about as close to sentiment as I'm going to get for a first time stab at text mining:
For 'budget' some of the more correlated terms included deal, shutdown, balanced, reached, danger.
For 'teaparty' some of the more correlated terms included boehner, farce, selling, trouble, teapartynation.
For 'barakobama' (as it appeared in the text data) some of the more correlated terms included appreciate, cooperation, rejects, hypocrite, achieved, lunacy.
For 'boehner' some of the more correlated terms included selling, notwinning, teaparty, moderate, retarded.
**UPDATE 1/23/2012 **
Below is some R-code similar to what I used for this project. While it may not make sense to look at tweets related to the budget compromise at this time, it can be modified to investigate whatever current trend of interest.
### ### Read tweets from Twitter using ATOM (XML) format ### # installation is required only required once and is rememberd across sessions install.packages('XML') # loading the package is required once each session require(XML) # initialize a storage variable for Twitter tweets mydata.vectors <- character(0) # paginate to get more tweets for (page in c(1:15)) { # search parameter twitter_q <- URLencode('#OWS') # construct a URL twitter_url = paste('http://search.twitter.com/search.atom?q=',twitter_q,'&rpp=100&page=', page, sep='') # fetch remote URL and parse mydata.xml <- xmlParseDoc(twitter_url, asText=F) # extract the titles mydata.vector <- xpathSApply(mydata.xml, '//s:entry/s:title', xmlValue, namespaces =c('s'='http://www.w3.org/2005/Atom')) # aggregate new tweets with previous tweets mydata.vectors <- c(mydata.vector, mydata.vectors) } # how many tweets did we get? length(mydata.vectors) # *------------------------------------------------------------------ # | # | analyze using tm package # | # | # *----------------------------------------------------------------- ### ### Use tm (text mining) package ### install.packages('tm') require(tm) # build a corpus mydata.corpus <- Corpus(VectorSource(mydata.vectors)) # make each letter lowercase mydata.corpus <- tm_map(mydata.corpus, tolower) # remove punctuation mydata.corpus <- tm_map(mydata.corpus, removePunctuation) # remove generic and custom stopwords my_stopwords <- c(stopwords('english'), 'prolife', 'prochoice') mydata.corpus <- tm_map(mydata.corpus, removeWords, my_stopwords) # build a term-document matrix mydata.dtm <- TermDocumentMatrix(mydata.corpus) # inspect the document-term matrix mydata.dtm # inspect most popular words findFreqTerms(mydata.dtm, lowfreq=30) # find word correlations findAssocs(mydata.dtm, 'monsanto', 0.20) # *------------------------------------------------------------------ # | # | # | create data frame for cluster analysis # | # *----------------------------------------------------------------- # remove sparse terms to simplify the cluster plot # Note: tweak the sparse parameter to determine the number of words. # About 10-30 words is good. mydata.dtm2 <- removeSparseTerms(mydata.dtm, sparse=0.95) # convert the sparse term-document matrix to a standard data frame mydata.df <- as.data.frame(inspect(mydata.dtm2)) # inspect dimensions of the data frame nrow(mydata.df) ncol(mydata.df) mydata.df.scale <- scale(mydata.df) d <- dist(mydata.df.scale, method = "euclidean") # distance matrix fit <- hclust(d, method="ward") plot(fit) # display dendogram? groups <- cutree(fit, k=5) # cut tree into 5 clusters # draw dendogram with red borders around the 5 clusters rect.hclust(fit, k=5, border="red")
No comments:
Post a Comment