Friday, April 22, 2011

ARIMA Model Forecast of Annual Fertilizer Use in KY

I recently found some data on annual fertilizer use (measured by an index) in KY. This looked like a good series to try out  time series analysis in R (see program documentation for data source)as well as provide a basic demo for my earlier post on time series analysis.

The series is fertilizer use in Kentucky from 1960-2004, and is plotted below:

A look at the autocorrelation plot indicates a sinusoidal decay toward zero, which indicates an autoregressive process and partial autocorrelations show a cut off after the first lag, representing a possible AR(1) process(Newbold & Bos, 1990).

Fitting the ARMA(0,0,1) model in R based on values through 2003, yields a forecast of 2.447325 for 2004 vs. the actual observed value of 2.49. This is a much closer estimate than the naive estimate based soley on the previous value for 2003, which was 2.63. The forecast for 2005 is 2.298228, for which I have no data to verify. It would be interesting to build a more complex model that incorporates additional data, perhaps a vector auto-regression.

R code for for this post:
# *------------------------------------------------------------------
# |  forecast for annual fertilizer use in KY (based on index)           
# *-----------------------------------------------------------------
# data source:
# read in year/time variable
yr <- c(1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004)
# read in fertilizer index
fertilizer <- c(1.4279, 1.4114, 1.5069, 1.6599, 1.8183, 1.8229, 2.1135, 1.9576, 1.3736, 1.5381, 1.5041, 1.5090, 1.4749, 1.6329, 1.9789, 1.8462, 2.0444, 1.9997, 1.9384, 1.8392, 2.1652, 1.9528, 1.4552, 1.3719, 1.3323, 1.3283, 1.5702, 1.5339, 1.4294, 1.4010, 1.4534, 1.4372, 1.4505, 1.5637, 1.4135, 1.6958, 1.7138, 1.7090, 2.3354, 2.4032, 2.7095, 1.9193, 2.3007, 2.6395, 2.4940)
kydat <- data.frame(yr,fertilizer) # create data frame, necessary for subsetting by yr
names(kydat) # list var names in data frame
dim(kydat)  # rows and columns
kydat2 <- kydat[kydat$yr<2004,] # subset to remove last 5 values for forecasting
dim(kydat2) # rows and columns
ky_ftlzr <- ts(kydat2) # convert to a time series object
plot(ky_ftlzr[,2]) # plot the series objects
acf(ky_ftlzr[,2], 40,xlim=c(1,40)) # autocorrelation plot- sinusoidal decay to zero -indicates autoregressive process 
pacf(ky_ftlzr[,2],40, xlim=c(1,40)) # consistent with a AR(1) model
# model 
( = arima(ky_ftlzr[,2], order = c(1,0,0))) # fit a ARIMA(0,0,1)) AIC = 5084.73
tsdiag(, gof.lag = 20) # ARIMA model diagnostics
x.fore = predict(, n.ahead=2) # forecast the next 5 values
print(x.fore) # 2.447325 2.298228
# compare these values to the observed value
print(kydat[kydat$yr>=2004,]) # 2004 = 2.49
Created by Pretty R at

Saturday, April 9, 2011

The Budget Compromise: Mining Tweets

After a week as SAS Gobal Forum, I've been pretty excited about some of the text mining presentations that I got to see. I couldn't wait to get back to work to at least try something.  After getting home I found a tweet from @imusicmash sharing a post from the Heuristic Andrew blog that shared text mining code from R.  I thought I'd use that code to mine tweets related to the budget compromise/government shutdown.  I searched two hashtags, #budget and #teaparty. I originally wanted to see if I could find out either what teaparty supporters may be saying about the budget, or maybe what others were saying about the teaparty and the potential government shutdown. (since these were interesting topics).

After extracting the text from a sample of 1500 tweets I found the following 'most common' terms. (in the raw from R)

[1] "cnn"            "boehner"        "gop"            "time"           "words"          "budget"         "historic"       "cuts"           "billion"        "cut"          
[11] "cspan"          "deal"           "heh"            "movement"       "table"          "parenthood"     "planned"        "tcot"           "hee"            "andersoncooper"
[21] "funded"         "neener"         "protecting"     "painful"        "alot"           "party"          "tea"            "shutdown"       "government"     "beep"         
[31] "jobs"           "barackobama"    "feels"          "drinkers"       "koolaid"        "deficit"        "facing"         "trillion"       "gown"           "wedding"      
[41] "healthcare"     "spending"       "2012"           "agree"          "plan"           "compromise"     "victory"        "term"           "tax"            "decorating"   
[51] "diy"            "home"           "bank"           "dems"           "biggest"        "history"        "hug"            "civil"          "hoo"            "little"       
[61] "38b"            "tips"           "life"           "people"         "suicide"        "doesnt"         "wars"           "trump"          "system"         "books"        
[71] "teaparty"       "ventura"        "etatsunis"      "fair"           "fight"          "military"       "actually"       "win"            "compulsive"     "liars"        
[81] "tbags"          "revenue"        "rights"         "libya"          "base"           "elites"         "house"          "crisis"         "housing"        "hud"          
[91] "dem"            "nay"            "yea"    

I then clustered the terms, using hierarchial clustering to get the following groupings: (click to enlarge)

Although it doesn't yield extremely revealing or novel results, the clusters do make sense, putting key politicians in the same group as the terms 'government' and 'shutdown' and putting republicans in the same group as 'teaparty' related terms. This at least validates for me, the power of R's 'tm' textmining package. I'm in the ball park, and a better structured analysis could give better results. But this is just for fun. 

I also ran some correlations, which gets me about as close to sentiment as I'm going to get for a first time stab at text mining:

For 'budget' some of the more correlated terms included deal, shutdown, balanced, reached, danger.

For 'teaparty' some of the more correlated terms included boehner, farce, selling,  trouble, teapartynation.

For 'barakobama' (as it appeared in the text data) some of the more correlated terms included appreciate, cooperation, rejects, hypocrite, achieved, lunacy.

For 'boehner' some of the more correlated terms included selling, notwinning, teaparty, moderate, retarded.

**UPDATE 1/23/2012 **
Below is some R-code similar to what I used for this project. While it may not make sense to look at tweets related to the budget compromise at this time, it can be modified to investigate whatever current trend of interest.

### Read tweets from Twitter using ATOM (XML) format
# installation is required only required once and is rememberd across sessions
# loading the package is required once each session
# initialize a storage variable for Twitter tweets
mydata.vectors <- character(0)
# paginate to get more tweets
for (page in c(1:15))
    # search parameter
    twitter_q <- URLencode('#OWS')
    # construct a URL
    twitter_url = paste('',twitter_q,'&rpp=100&page=', page, sep='')
    # fetch remote URL and parse
    mydata.xml <- xmlParseDoc(twitter_url, asText=F)
    # extract the titles
    mydata.vector <- xpathSApply(mydata.xml, '//s:entry/s:title', xmlValue, namespaces =c('s'=''))
    # aggregate new tweets with previous tweets
    mydata.vectors <- c(mydata.vector, mydata.vectors)
# how many tweets did we get?
# *------------------------------------------------------------------
# |                
# | analyze using tm package 
# |   
# |  
# *-----------------------------------------------------------------
### Use tm (text mining) package
# build a corpus
mydata.corpus <- Corpus(VectorSource(mydata.vectors))
# make each letter lowercase
mydata.corpus <- tm_map(mydata.corpus, tolower)
# remove punctuation
mydata.corpus <- tm_map(mydata.corpus, removePunctuation)
# remove generic and custom stopwords
my_stopwords <- c(stopwords('english'), 'prolife', 'prochoice')
mydata.corpus <- tm_map(mydata.corpus, removeWords, my_stopwords)
# build a term-document matrix
mydata.dtm <- TermDocumentMatrix(mydata.corpus)
# inspect the document-term matrix
# inspect most popular words
findFreqTerms(mydata.dtm, lowfreq=30)
# find word correlations 
findAssocs(mydata.dtm, 'monsanto', 0.20)
# *------------------------------------------------------------------
# |                
# |
# |  create data frame for cluster analysis 
# |  
# *-----------------------------------------------------------------
# remove sparse terms to simplify the cluster plot
# Note: tweak the sparse parameter to determine the number of words.
# About 10-30 words is good.
mydata.dtm2 <- removeSparseTerms(mydata.dtm, sparse=0.95)
# convert the sparse term-document matrix to a standard data frame
mydata.df <-
# inspect dimensions of the data frame
mydata.df.scale <- scale(mydata.df)
d <- dist(mydata.df.scale, method = "euclidean") # distance matrix
fit <- hclust(d, method="ward")
plot(fit) # display dendogram?
groups <- cutree(fit, k=5) # cut tree into 5 clusters
# draw dendogram with red borders around the 5 clusters
rect.hclust(fit, k=5, border="red")
Created by Pretty R at

Wednesday, April 6, 2011

Topics Related to Linear Models

While attending a session at this year’s SAS Global Forum, I was reminded of several topics that I have not covered in previous posts, but should have.

Fixed and Random Effects Models and Panel Data (see also Mixed, Fixed and Random Effects)

Notes: Panel data, or repeated measures data, is characterized by multiple observations on individuals over time.  A consequence of this is that measurements on the same individual will likely be more correlated than measurements for (or between) different individuals.  Measurements taken closer together over time will also likely be more correlated than measurements taken at greater intervals. As a result, assumptions from OLS regression regarding independence and homogenous variance will likely be violated.  As noted during the session I attended, in SAS, PROC MIXED appropriately handles within subject and time dependent correlations and the covariance structure associated with repeated measures.

Let’s structure the model as follows: Yit = Xit + Ait +Uit

Where Ait is a unobserved individual effect. There are two assumptions that we can work under when estimating this model,

1)Random Effects (RE): Assumes that Ait is independent of X . X may be considered a  predetermined, or other fixed effect. Ait is a random effect.

Estimation: Feasible Generalized Least Squares-  Bfgls = (X’W-1X)-1X’W-1y
i.e V(e) != σ2  I  which would be the case under OLS

2)Fixed Effects (FE): Assumes Ait is not independent of X.

Estimation: differencing or subtracting the respective means from X and Y  and running OLS on the adjusted data produces the estimate of B.

You could also subtract the lagged version of each variable X and Y from itself respectively to move the time-correlated component, or you could remove Ait through dummy variable regression.

Fixed and Random Effects in Analysis of Variance (in general)

In my previous posts related to AOV and Mixed Models,  I did not discuss these in the context of  AOV.

Typically fixed effects are repeatable factors that are set by the experimenter,  often the ‘treatments.’  Random effects are effects that are selected randomly from a population, often ‘blocks.’

Type I and III Sums of Squares

Type I Sums of Squares: often referred to as sequential sums of squares, this is the SS for each effect adjusted for all effects that appear earlier in the model.

Y1 = X1B1                       

Type I SS = SSR(Y1) = SS(x1)

Y2 = X1B1 + X2B2

Type I SS = SSR(Y2) – SSR(Y1) = SS(X2)

Type III sums of squares are adjusted for every X in the full model.


Y1 = B1X1 + B2X2
Y2 = B1X1

Type III SS = SSR(Y1) – SSR(Y2) = SS(X1)

Y3 = B2X2

Type III SS = SSR(Y1) – SSR(Y3) = SS(X2)

Means vs. LS Means

Means = overall mean for a treatment or factor level

LS Means = within group means adjusted for other effects in the model

Type I SS are used to test differences between means while type III SS are used to test differences between LS Means.