Thursday, November 10, 2011

Singular Value Decomposition and Text Mining

Single Value Decomposition (SVD) is a concept from linear algebra based on the following matrix equation:
A = USV which states that a rectangular matrix A can be decomposed into 3 other matrix components:

U  consists of the orthonormal eigenvectors of AA’,  where U’U =  I (recall from linear algebra, orthogonal vectors of unit length are ‘orthonormal’)
V consists of the orthonormal eigenvectors of A’A
S  is a diagonal matrix consisting of the square root of the eigenvalues of U or V (which are equal). The values in S depict the variance of linearly independent components along each dimension similarly to the way eigenvalues depict variance explained by ‘factors’ or components in principle components analysis (PCA).

(below is an entertaining video depicting SVD)

In relation to text mining, SVD provides the mathematical foundation for text mining and classification techniques generally known as latent semantic indexing. In SVD, the matrix A is typically a word x document matrix, it is a way of representing your document and text as a highly dimensional vector space model, also referred to as hyperspace document representation.  Similarly  to PCA , SVD takes high dimensional highly variable data and reduces it to a lower dimensional space that more clearly depicts the underlying structure of the data.  SVD reduces noise and redundancy in the data leaving you with new dimensions that capture the essence of existing relationships.   With regard to text mining, SVD has the following interpretation:

-documents are represented as rows in V
-document similarity can be determined by examining rows in VS
- words are represented as rows in U
- word similarity can be determined by examining the rows in the matrix US


SAS Text Miner: Theory and Practice at UnitedHealthcare
Mark Pitts and Chris Clark, UnitedHealthcare (Presentation at Analytics 2011 Conference)

Singular Value Decomposition Tutorial. Kirk Baker. March 29, 2005.

No comments:

Post a Comment