Wednesday, October 24, 2012

Nonnegative Matrix Factorization and Recommendor Systems

Albert Au Yeung provides a very nice tutorial on non-negative matrix factorization and an implementation in python. This is based very loosely on his approach. Suppose we have the following matrix of users and ratings on movies:

If we use the information above to form a matrix R it can be decomposed into two matrices  W and H such that R~ WH'

where R is an n x p matrix of users and ratings
            W = n x r user feature matrix
            H = r x p  movie feature matrix

Similar to principle components analysis, the columns in W can be interpreted to represent latent user features while the columns in H’ can be interpreted as latent movie features. This factorization allows us to classify or cluster user types and movie types based on these latent factors.

For example, using the nmf function in R, we can decompose the matrix R above and obtain the following column vectors from H. 

We can see that the first column vector ‘loads’ heavily on ‘military’ movies while the second feature more heavily ‘loads’ onto the ‘western’ themed movies. These vectors form a ‘feature space’ for movie types.  Each movie can be visualized in this space as being a member of a cluster associated with its respective latent feature.

If a new user gives a high recommendation to a movie belonging to one of the clusters created by the matrix factorization, other movies belonging to the same cluster can be recommended.


Yehuda Koren, Yahoo Research Robert Bell and Chris Volinsky, AT&T Labs—Research
IEEE Computer Society 2009

Matrix Factorisation: A Simple Tutorial and Implementation in Python

R Code:

#   ------------------------------------------------------------------
#  | PROGRAM NAME: R nmf example
#  | DATE: 10/20/12
#  | PROJECT FILE: /Users/wkuuser/Desktop/Briefcase/R Programs             
#  |----------------------------------------------------------------
#  | PURPOSE: very basic example of a recommendor system based on               
#  | non-negative matrix factorization
#  |
#  | 
#  |------------------------------------------------------------------
# X ~ WH' 
# X is an n x p matrix
# W = n x r  user feature matrix
# H = r x p  movie feature matrix
# get ratings for 5 users on 4 movies
x1 <- c(5,4,1,1)
x2 <- c(4,5,1,1)
x3 <- c(1,1,5,5)
x4 <- c(1,1,4,5)
x5 <- c(1,1,5,4)
R <- as.matrix(rbind(x1,x2,x3,x4,x5)) # n = 5 rows p = 4 columns 
res <- nmf(R, 4,"lee") # lee & seung method
V.hat <- fitted(res) 
print(V.hat) # estimated target matrix
w <- basis(res) #  W  user feature matrix matrix
dim(w) # n x r (n= 5  r = 4)
h <- coef(res) # H  movie feature matrix
dim(h) #  r x p (r = 4 p = 4)
# recommendor system via clustering based on vectors in H
movies <- data.frame(t(h))
features <- cbind(movies$X1,movies$X2)
title("Movie Feature Plot")
Created by Pretty R at

Thursday, October 18, 2012

Get a Data Science Attitude

Do the following terms mean anything to you?

load balance toggle join index normalize key 

If you are a statistician and aspiring data scientist they should. If not this is one area where you should expand your knowledge base. In her article 'Being a data scientist is as much about IT as it is analysis' Carla Gentry explains why. 

"With knowledge of the client's IT setup from a data management/quality perspective, you'll be equipped to handle most situations you run into when dealing with data, even if the Architect and Programmer are out sick. Your professional knowledge is going to be a big help in getting the assignment or job complete."

If all you want is a cook to order data set from IT so you can run a regression, that's not the attitude or the skillset that employers have in mind when they seek out data scientists. There's plenty of jobs in academia for traditional statisticians, but that's not what Hal Varian was talking about when he said that the sexy job in the next 10 years will be statisticians. 

This reminds me of an article I read not long ago about building data science teams:

"Most of the data was available online, but due to its size, the data was in special formats and spread out over many different systems. To make that data useful for my research, I created a system that took over every computer in the department from 1 AM to 8 AM. During that time, it acquired, cleaned, and processed that data. Once done, my final dataset could easily fit in a single computer's RAM. And that's the whole point. The heavy lifting was required before I could start my research. Good data scientists understand, in a deep way, that the heavy lifting of cleanup and preparation isn't something that gets in the way of solving the problem: it is the problem."

Saturday, October 13, 2012

BMC Proceedings: A comparison of random forests, boosting and support vector machines for genomic selection

A very cool combination of machine learning/quantitative genetics/bioinformatics

"Genomic selection (GS) involves estimating breeding values using molecular markers spanning the entire genome. Accurate prediction of genomic breeding values (GEBVs) presents a central challenge to contemporary plant and animal breeders. The existence of a wide array of marker-based approaches for predicting breeding values makes it essential to evaluate and compare their relative predictive performances to identify approaches able to accurately predict breeding values. We evaluated the predictive accuracy of random forests (RF), stochastic gradient boosting (boosting) and support vector machines (SVMs) for predicting genomic breeding values using dense SNP markers and explored the utility of RF for ranking the predictive importance of markers for pre-screening markers or discovering chromosomal locations of QTLs."

Tuesday, October 9, 2012

Non-Negative Matrix Factorization

From Matrix Factorization Techniques for Recommendor Systems:

"Modern consumers are inundated with choices. Electronic retailers and content providers offer a huge selection of products, with unprecedented opportunities to meet a variety of special needs and tastes. Matching consumers with the most appropriate products is key to enhancing user satisfaction and loyalty. Therefore, more retailers have become interested in recommender systems, which analyze patterns of user interest in products to provide personalized recommendations that suit a user’s taste. Because good personalized recommendations can add another dimension to the user experience, e-commerce leaders like and Netflix have made recommender systems a salient part of their websites."

"matrix factorization models are superior to classic nearest-neighbor techniques for producing product recommendations, allowing the incorporation of additional information such as implicit feedback, temporal effects, and confidence levels"

See also:

Saturday, October 6, 2012

Big Data- A Visualization

 Three features characterize 'big data.' Volume, variety, and velocity (1).

Volume: with automation of so many products, electronic sensors, GPS, mobile aps, networked equipment, log files, and social media, more data is being generated than ever before and is creating challenges to traditional architectures in terms of how do store and access it.

Variety: Not only do we still utilize traditional data sources (like customer data bases) but social media and automation also are creating new forms of data that all must be handled uniquely in order to be analyzed in some way to inform strategy and decision making.

Velocity: Because so much of this data is generated real time our time preference for analysis also becomes more present oriented. We may need to get ahead of trends quickly. This calls for tools and architectures that can give us real time analysis.

References and Further Reading:

Big Ag Meets Big Data (Part 1 & Part 2)

1. IDC. "Big Data Analytics: Future Architectures, Skills and Roadmaps for the CIO," September 2011.

2. With Hadoop, Big Data Analytics Challenges Old-School Business Intelligence. Doug Henschen, Information Week

3. Big Bets On Big Data. Eric Savitz, Forbes.

Creative Commons Image Attributions:

Handheld GPS
By Paul Downey from Berkhamsted, UK (Earthcache De Slufter  Uploaded by Partyzan_XXI) [CC-BY-2.0 (], via Wikimedia Commons

Satellite: NAVSTAR-2 (GPS-2) satellite Source: Status: PD-USGov-Military-Air Force {{PD-USGov-Military-Air Force}} Category:Satellites

Tractor: bdk [CC-BY-SA-3.0 (], via Wikimedia Commons

Economists as Data Scientists


Economists are Like Scientists

In Greg Mankiw’s principles of economics textbooks he proposes that economists are like scientists in that they develop theories and subsequently gather data to test those theories empirically. Econometrics is the empirical aspect of economics.  In general econometrics is focused on hypothesis testing of causes and effects. The goal is typically deriving estimators with desirable properties appropriate for making inferences. As described by Tim Harford:

" econometricians set themselves the task of figuring out past relationships. Have charter schools improved educational standards? Did abortion liberalisation reduce crime? What has been the impact of immigration on wages?"

Some of the tools of econometrics include linear regression, logit/probit models, instrumental variables, and time series.

Data Scientists

As presented by Drew Conway, data science is a combination of hacking skills, math and statistics knowledge, and substantive expertise.

 A recent post via the Harvard Business review blog gives some practical examples of the capabilities of a data scientist:

"They can suck data out of a server log, a telecom billing file, or the alternator on a locomotive, and figure out what the heck is going on with it. They create new products and services for customers. They can also interface with carbon-based lifeforms — senior executives, product managers, CTOs, and CIOs. You need them." - Can You Live Without a Data Scientist? - Harvard Business Review

 Hal Varian, Google's Chief Economist describes the skills of a data scientist as follows:
"Database and data manipulation or how to shuffle data around and move things from place to place; statistics and statistical analysis; machine learning; visualization, or how to present data in a meaningful way; and communication or being able to describe what’s going on."
While data scientists certainly rely on a strong foundation in statistics, and may in fact utilize some of the same tools of inferential statistics used by econometricians, data scientists most often will follow a different path. As described by Leo Brieman:

"There are two cultures in the use of statistical modeling to reach conclusions from data”

The traditional statistical/econometric culture:

"assumes that the data are generated by a given stochastic data model."

vs. the machine learning/data mining culture:

"uses algorithmic models and treats the data mechanism as unknown."

Because of the nature of the data and the problems solved by data scientists, they very often use algorithmic methods to obtain desired solutions. Typically this is not a situation that calls for the types of estimators with desirable properties leading to empirically sound inferences sought by econometricians, but often the concern is simply making accurate predictions or discovering informative patterns in the data.

Economist Scott Nicholson (Chief Data Scientist at Accretive Health and formerly at LinkedIn) comments on the differences between economists and data scientists: 

 "In terms of applied work, economists are primarily concerned with establishing causation. This is key to understanding what influences individual decision-making, how certain economic and public policies impact the world, and tells a much clearer story of the effects of incentives. With this in mind, economists care much less about the accuracy of the predictions from their econometric models than they do about properly estimating the coefficients, which gets them closer to understanding causal effects.At Strata NYC 2011, I summed this up by saying: If you care about prediction, think like a computer scientist, if you care about causality, think like an economist."
The algorithms used by data scientists come from the machine learning and data mining paradigm, and often include neural networks, decision trees, support vector machines, association rules, and others.

These approaches may not be very familiar to economists, but their training in statistics and mathematics make these techniques very accessible.  Take for instance logistic regression. This technique is very familiar to most economists, and is in fact used often times by data scientists to solve classification problems.  However, as Peter Kennedy describes in A Guide to Econometrics, neural networks (with logistic activation functions) can be thought of as a weighted average of logit functions.  And, if the econometrician understands how logistic regression parameters are estimated (based on maximum likelihood with estimation implemented via Newton’s Method) it’s not that difficult to grasp gradient descent or even the backpropogation algorithm used in neural networks.

Similarly, as econometrics is written in the language of calculus and linear algebra, so is machine learning. (for more details see the popular machine learning text Elements of Statistical Learning: Data Mining, Inference, and Prediction).  Some of the mathematical concepts used in advanced microeconomic theory (inner products, separating and supporting hyperplanes,  and quadratic programming for example) are also very useful when it comes to understanding support vector machines.

In conclusion, most economists trained in econometrics have two of the three elements that comprise data science; substantive expertise (economic theory) and knowledge of mathematics and statistics. Supplementing their quantitative skills with hacking skills (data management, manipulation, cleaning, and loop and array processing, etc. via a language like SAS/SQL, MATLAB, or R) and familiarity with machine learning algorithms would open the door for many trained in economics and statistics to employ their skills as data scientists.


 'Statistical Modeling: The Two Cultures' by L. Breiman (Statistical Science
2001, Vol. 16, No. 3, 199–231) in Culture War: Classical Statistics vs. Machine Learning. Matt Bogard. Econometric Sense.

Data Scientist: The Sexiest Job of the 21st Century. HBR. 

Exclusive: Scott Nicholson Interview: Data Science, Economics, Weather, LinkedIn, and Healthcare  

 Google’s chief economist examines the data scientist factor 

Tuesday, October 2, 2012

Modern Ag Meets Big Data

"The major problem we keep on seeing — especially in bigger, modern farms — is that there's a lot of data being created and not being used, on how they're performing, what they're doing."

Amazon, Hadoop, and Farm Management via Big Data


"We took 60 years of crop yield data, and 14 terabytes of information on soil types, every two square miles for the United States, from the Department of Agriculture," says David Friedberg, chief executive of the Climate Corporation, a name WeatherBill started using Tuesday. "We match that with the weather information for one million points the government scans with Doppler radar — this huge national infrastructure for storm warnings — and make predictions for the effect on corn, soybeans and winter wheat."

Big Data Opportunities In Agriculture, Algo Trading, and Futures


"In agriculture the data patterns are changing radically. I bet you never thought about farming as a big data opportunity. Next generation farm equipment like combines and tillers are going to be able to take soil samples as they move along, perform analysis on those samples, and feed the results of the analysis back to the manufacturer for crunching on a macro scale. This will result in a better understanding of what is happening in that entire area and make it possible to adjust things like the amount or types of fertilizer and chemicals that should be applied. If the farm equipment manufacturers figure out how to harness all this information, this kind of big-picture analysis could change the commodity trading markets forever."