Saturday, May 5, 2012

An Intuitive Approach to Text Mining with SAS IML

Text Mining- in Plain English: Turning Text into Numbers

With Twitter, Facebook, email, online forums, open response surveys, customer and reader comments on web pages and news articles etc. there is a lot of information available to companies and organizations in the form of text. Without hiring experts to read through all of the thousands of pages worth of text available and making subjective claims about its meaning, text mining allows us to take otherwise unusable 'qualitative' data and convert it into quantitative measures that we can use for various types of reporting and modeling. In the example below I demonstrate how the mathematical technique of singular value decomposition  (SVD) can be used to do this.

Sometimes to assess statistical techniques, or to even understand them at a basic level, you need data with properties you understand. With numerical data, that typically might call for simulation. But since we are dealing with text, I just made some up that will work to clearly demonstrate the effectiveness of SVD.

Below you will see 10 hypothetical comments to the question- 'What is your take on pink slime?' Each person's response is considered a document, and I have classified each document as type 'H' or 'S' as explained below.  The goal will be to see if we can use a basic application of SVD to convert these comments into numbers and use them to predict what type of person  made which type of comments (i.e. does a comment belong in category H or S) based on clustering or some type of predictive model. This gets way beyond simply classifying comments by doing a key word search. SVD allows us to not only classify documents by the specific words they contain, but also by how similar they are.

 The purpose of this exercise is to provide intuition for text mining and the application of singular value decomposition. The text above is made up, and specifically designed to produce the results I'm after. I'm not making any claims one way or the other about how realistic this is. But suppose these are potential customers and we want to be able to distinguish between the hippies, who favor a more local nostalgic food supply from centuries past  (designated as class or type 'H') and the animal scientists, designated as type 'S.' Obviously comments of class H are critical of modern agriculture and technology.  We might want to make these distinctions for PR, marketing, lobbying,or educational outreach purposes.  

After cleaning up the text (which many software programs like SAS Enterprise Miner provide excellent tools for doing so)by eliminating parts of speech, articles, etc. we can form a term-document frequency matrix as follows:




Singular Value Decomposition

Singular Value Decomposition (SVD) is a concept from linear algebra based on the following matrix equation:

A = USV which states that a rectangular matrix A can be decomposed into 3 other matrix components:

U  consists of the orthonormal eigenvectors of AA’,  where U’U =  I (recall from linear algebra, orthogonal vectors of unit length are ‘orthonormal’)
V consists of the orthonormal eigenvectors of A’A
S  is a diagonal matrix consisting of the square root of the eigenvalues of U or V (which are equal). The values in S depict the variance of linearly independent components along each dimension similarly to the way eigenvalues depict variance explained by ‘factors’ or components in principle components analysis (PCA).

SVD provides the mathematical foundation for text mining.

A  term document matrix A can be decomposed as in:
A = USV’

If the term document matrix A were a collection of individuals’ textual responses or comments to a survey question (or Facebook post etc. ) then each individual’s response would be considered a ‘document’. The individual words they used in their response are the ‘terms.’ The term document frequency matrix consists of rows that represent each term and columns that represent each ‘document’ or individual. 

The  vectors in U can be used to score documents in the term-document matrix, as in U’A.  (this may be analogous to the way values of x are scored by eigenvectors in PCA)  As a result, a single numerical value or ‘score’ can be assigned to each person’s textual response (i.e. each document) for each SVD dimension (i.e. each kept independent vector in U). Thus the text is converted into numerical scores (via the transformation U’A) that can then be used in predictive modeling (as predictor variables) or  clustering can be utilized to cluster the individual textual responses (or ‘documents’).  So ultimately SVD converts  otherwise unuseful  text into numeric SVD scores or more interpretable clusters.  Likewise, using  the weights from the eigenvectors that comprise  V to score the terms in the term document matrix A, as in AV’ we can score  cluster like terms as ‘topics’. 

Thus SVD of A gives allows us to derive the following scores:

U’A = SVD document vectors 
AV’ = SVD term vectors

 This can be loosely demonstrated in using PROC IML in SAS. Specifying the term document frequency matrix in PROC IML and implementing SVD produces:









 
The document vectors U’A can be depicted as follows:


 
Here is the original text document with customer types/classes and the appended SVD scores. 

 As depicted above, SVD has allowed us to replace all of the text with the quantitative values for the associated SVD scores (derived from U'A). In this case, I'm representing all of the text with just two SVD dimensions. These values can then be used for clustering or entered into a predictive model.


 
It is easy to see that the different documents types (response types H vs S) cluster very well on the dimensions SVD1 and SVD2. In fact, all responses of type S have a value of SVD2 < .5.
The SVD values can also be entered into a regression or other type of predictive model.

 
Consistent with the observed relationship between SVD2 and class ‘S’ and ‘H’ clusters, we see that SVD2 is significant in the regression. In addition, it shows that higher values of SVD2 decrease the probability of being classified as document or response type ‘S’. Yes, this is OLS on a binary dependent variable, but again the purpose is to provide intuition and motivation for using text analytics for predictive modeling. 

This seemed to work ok on a small collection of documents carefully constructed to illustrate the concepts above, but tools like SAS Text Miner in conjunction with SAS Enterprise Miner are designed specifically to do this type of analysis on a much larger scale. I have used both of these tools on much larger document collections and obtained promising results using text topics from SVD in predictive modeling applications.

SAS CODE


proc iml;
       A = {0 0      0      1      0      0      0      0      0      0,
1      1      0      0      0      1      0      0      0      0,
0      0      0      0      0      1      1      1      0      2,
0      0      0      0      0      1      0      0      0      0,
0      1      1      0      0      0      0      0      0      0,
1      0      0      0      0      0      0      0      0      0,
0      0      1      0      1      0      0      0      0      0,
0      0      0      0      1      0      0      0      0      0,
0      0      0      0      1      0      0      0      0      0,
0      0      1      0      0      0      0      0      0      0,
0      0      0      0      0      1      0      1      0      1,
0      0      1      0      0      0      0      0      0      0,
0      1      1      1      0      0      0      0      0      0,
0      0      0      1      0      0      0      0      0      0,
0      0      0      0      0      0      1      1      0      0,
1      0      1      0      0      0      0      0      0      0,
0      0      0      0      0      0      1      0      0      0,
0      0      0      0      0      0      0      0      0      1,
0      0      0      0      0      1      0      0      0      0,
0      0      1      0      0      0      0      0      0      0,
1      0      0      0      1      0      0      0      0      0,
0      1      0      1      0      0      0      0      0      0,
0      1      0      0      0      1      0      0      0      0,
0      0      0      0      0      0      0      1      0      0,
0      0      0      0      0      0      0      0      0      1,
0      0      0      0      0      0      0      0      1      1,
0      0      0      0      0      0      0      0      0      1,
1      0      0      0      1      0      0      0      0      0,
0      0      0      0      0      0      1      1      1      0,
0      0      1      0      0      0      0      0      0      0,
0      0      0      0      0      0      0      0      1      0,
0      0      0      0      0      0      0      0      0      1,
0      0      0      0      0      1      0      1      0      0,
0      0      0      0      0      0      0      0      1      0
};
       print(A); /* print matrix A*/
       n = nrow(A); /* how many rows */
       p = ncol(A); /* how many columns */
       print(n);   
       print(p);
       call svd(u,s,v,A); /* sigular value decomposition of A = usv'*/
       print(u); /* independent eigenvectors of AA' */
       print(s); /*  independent eigenvectors of A'A */
       print(v); /*singular values (sqrt(eigenvalues)) of AA' or A'A */
       uTa = T(u)*A; /* document vectors*/
       avT = A*T(v); /* term vectors */
       print(uTa);
       print(avT);
    /* scoring a data set */
       ID = {1, 2, 3, 4, 5, 6, 7, 8, 9,10}; /* create a document id matrix */
       print(ID);
       docscores =ID||T(uTa); /* combine with svd results */
       print(docscores);
       /* export as a SAS data set */
       varnames = 'svd1':'svd10';
       create svd_scores from docscores[colname = varnames];
       append from docscores;
       close svd_scores;

quit;
run;

 *-------------------------------------------------------------------------------------*
 |  SCORING, CLUSTERING AND PREDICTIVE MODELING
 *-------------------------------------------------------------------------------------*;

* CREATE CLIENT DATA SET;

DATA CLIENTS;
       INPUT ID CLASS $ ;
       CARDS;
1      H
2      H
3      H
4      H
5      H
6      S
7      S
8      S
9      S
10  S
;
RUN;

* MERGE WITH SVD SCORES;

PROC SQL;
       CREATE TABLE CLIENTS_SCORED AS
       SELECT A.ID, A.CLASS, B.SVD2 AS SVD1, B.SVD3 AS SVD2
       FROM CLIENTS A LEFT JOIN SVD_SCORES B
       ON A.ID = B.SVD1;
QUIT;

PROC PRINT DATA = CLIENTS_SCORED;
RUN;


* CLUSTER/ VISUALIZE  CLIENTS BASED ON SVD SCORES;
PROC GPLOT DATA = CLIENTS_SCORED;
       PLOT SVD1*SVD2 = ID;
RUN;
QUIT;


* RECODE FOR NUMERIC DEPENDENT VAR;

DATA CLIENTS_SCORED2;
       SET CLIENTS_SCORED;
       IF CLASS = 'H' THEN Y = 0;
       ELSE Y = 1;
RUN;

* BASIC REGRESSION MODEL;

PROC REG DATA = CLIENTS_SCORED2;
       MODEL Y = SVD1 SVD2;
RUN;
QUIT;

4 comments:

  1. Question: if we assume we want to reduce dimension to K, and the process of SVD(A) result in dimensions => (WxK)(KxK)(DxK)' => (WxD), then above should SVD term vectors scores instead be A*V = (WxD)(DxK)? AV' only works when D=K. Certainly possible I'm making a matrix dimension error. Very helpful post. Thanks.

    ReplyDelete
    Replies
    1. You may be correct. I need to think about this more when I get a chance. Thanks for bringing it up. Sorry for the delay in response.

      Delete
  2. Is there any meaning to the SVD values containing positive and negative numbers? Thanks.

    ReplyDelete
  3. I'm not sure, from a purely mathematical sense. However, they are at least discussed in the following reference:

    http://alias-i.com/lingpipe/demos/tutorial/svd/read-me.html
    http://www.ling.ohio-state.edu/~kbaker/pubs/Singular_Value_Decomposition_Tutorial.pdf

    " There are two groups visible in the second column vector of U: car and wheel have negative coefficients, while doctor, nurse, and hospital are all positive, indicating a grouping in which wheel only cooccurs with car. "

    I think this is on page 21 of the reference. I think the reference may have a little different take than what I wrote. My post is based on how I think some of the tools in SAS Text Miner work, based on the documentation and discussions with technical support.

    ReplyDelete

Note: Only a member of this blog may post a comment.