Text Mining in Plain English:
Turning Text into Numbers
With Twitter, Facebook, email, online forums, open response surveys, customer and reader comments on web pages and news articles etc. there is a lot of information available to companies and organizations in the form of text. Without hiring experts to read through all of the thousands of pages worth of text available and making subjective claims about its meaning, text mining allows us to take otherwise unusable 'qualitative' data and convert it into quantitative measures that we can use for various types of reporting and modeling. In the example below I demonstrate how the mathematical technique of
singular value decomposition (SVD) can be used to do this.
Sometimes to assess statistical techniques, or to even understand them at a basic level, you need data with properties you understand. With numerical data, that typically might call for simulation. But since we are dealing with text, I just made some up that will work to clearly demonstrate the effectiveness of SVD.
Below you will see 10 hypothetical comments to the question
'What is your take on pink slime?' Each person's response is considered a document, and I have classified each document as type 'H' or 'S' as explained below. The goal will be to see if we can use a basic application of SVD to convert these comments into numbers and use them to predict what type of person made which type of comments (i.e. does a comment belong in category H or S) based on clustering or some type of predictive model. This gets way beyond simply classifying comments by doing a key word search. SVD allows us to not only classify documents by the specific words they contain, but also by how similar they are.
The purpose of this exercise is to provide intuition for text mining and the application of singular value decomposition. The text above is made up, and specifically designed to produce the results I'm after. I'm not making any claims one way or the other about how realistic this is. But suppose these are potential customers and we want to be able to distinguish between the hippies, who favor a more local nostalgic food supply from centuries past (designated as class or type 'H') and the animal scientists, designated as type 'S.' Obviously comments of class H are critical of modern agriculture and technology. We might want to make these distinctions for PR, marketing, lobbying,or educational outreach purposes.
After cleaning up the text (which many software programs like SAS Enterprise Miner provide excellent tools for doing so)by eliminating parts of speech, articles, etc. we can form a termdocument frequency matrix as follows:
Singular Value Decomposition
Singular Value Decomposition (SVD) is a concept from linear algebra
based on the following matrix equation:
A = USV^{’} which states that a rectangular matrix A can
be decomposed into 3 other matrix components:
U
consists of the orthonormal eigenvectors of AA’, where U’U
= I (recall from linear algebra, orthogonal vectors of unit length
are ‘orthonormal’)
V consists
of the orthonormal eigenvectors of A’A
S
is a diagonal matrix consisting of the square root of the eigenvalues of U
or V (which are equal). The values in S depict the variance of
linearly independent components along each dimension similarly to the way
eigenvalues depict variance explained by ‘factors’ or components in principle
components analysis (PCA).
SVD provides the mathematical foundation for text mining.
A term document
matrix A can be decomposed as in:
A = USV’
If the term document matrix A were a collection of individuals’ textual responses or comments
to a survey question (or Facebook post etc. ) then each individual’s response
would be considered a ‘document’. The individual words they used in their
response are the ‘terms.’ The term document frequency matrix consists of rows
that represent each term and columns that represent each ‘document’ or
individual.
The vectors in U can be used to score documents in the
termdocument matrix, as in U’A. (this may be analogous to the way values of x
are scored by eigenvectors in PCA) As a
result, a single numerical value or ‘score’ can be assigned to each person’s
textual response (i.e. each document) for each SVD dimension (i.e. each kept
independent vector in U). Thus the
text is converted into numerical scores (via the transformation U’A) that can then be used in
predictive modeling (as predictor variables) or
clustering can be utilized to cluster the individual textual responses
(or ‘documents’). So ultimately SVD
converts otherwise unuseful text into numeric SVD scores or more
interpretable clusters. Likewise,
using the weights from the eigenvectors
that comprise V to score the terms in the term document matrix A, as in AV’ we can score cluster
like terms as ‘topics’.
Thus SVD of A gives allows us to derive the following
scores:
U’A = SVD
document vectors
AV’ = SVD term
vectors
This can be loosely demonstrated in using PROC IML in SAS. Specifying the term document frequency matrix in PROC IML and implementing SVD
produces:
The document vectors U’A can be depicted as follows:
Here is the original text document with customer
types/classes and the appended SVD scores.
As depicted above, SVD has allowed us to replace all of the text with the quantitative values for the associated SVD scores (derived from U'A). In this case, I'm representing all of the text with just two SVD dimensions. These values can then be used for clustering or entered into
a predictive model.
It is easy to see that the different documents types
(response types H vs S) cluster very well on the dimensions SVD1 and SVD2. In
fact, all responses of type S have a value of SVD2 < .5.
The SVD values can also be entered into a regression or
other type of predictive model.
Consistent with the observed relationship between SVD2
and class ‘S’ and ‘H’ clusters, we see that SVD2 is significant in the regression. In
addition, it shows that higher values of SVD2 decrease the probability of being
classified as document or response type ‘S’.
Yes, this is OLS on a binary dependent variable, but again the purpose is to provide intuition and motivation for using text analytics for predictive modeling.
This seemed to work ok on a small collection of documents carefully constructed to illustrate the concepts above, but tools like
SAS Text Miner in conjunction with
SAS Enterprise Miner are designed specifically to do this type of analysis on a much larger scale. I have used both of these tools on much larger document collections and obtained promising results using text topics from SVD in predictive modeling applications.
SAS CODE
proc
iml;
A = {0 0 0 1 0 0 0 0 0 0,
1 1 0 0 0 1 0 0 0 0,
0 0 0 0 0 1 1 1 0 2,
0 0 0 0 0 1 0 0 0 0,
0 1 1 0 0 0 0 0 0 0,
1 0 0 0 0 0 0 0 0 0,
0 0 1 0 1 0 0 0 0 0,
0 0 0 0 1 0 0 0 0 0,
0 0 0 0 1 0 0 0 0 0,
0 0 1 0 0 0 0 0 0 0,
0 0 0 0 0 1 0 1 0 1,
0 0 1 0 0 0 0 0 0 0,
0 0 0 1 0 0 0 0 0 0,
0 0 0 0 0 0 1 1 0 0,
1 0 1 0 0 0 0 0 0 0,
0 0 0 0 0 0 1 0 0 0,
0 0 0 0 0 0 0 0 0 1,
0 0 0 0 0 1 0 0 0 0,
0 0 1 0 0 0 0 0 0 0,
1 0 0 0 1 0 0 0 0 0,
0 1 0 1 0 0 0 0 0 0,
0 1 0 0 0 1 0 0 0 0,
0 0 0 0 0 0 0 1 0 0,
0 0 0 0 0 0 0 0 0 1,
0 0 0 0 0 0 0 0 1 1,
0 0 0 0 0 0 0 0 0 1,
1 0 0 0 1 0 0 0 0 0,
0 0 0 0 0 0 1 1 1 0,
0 0 1 0 0 0 0 0 0 0,
0 0 0 0 0 0 0 0 1 0,
0 0 0 0 0 0 0 0 0 1,
0 0 0 0 0 1 0 1 0 0,
0 0 0 0 0 0 0 0 1 0
};
print(A);
/* print matrix A*/
n = nrow(A); /*
how many rows */
p = ncol(A); /*
how many columns */
print(n);
print(p);
call
svd(u,s,v,A); /* sigular value decomposition of A = usv'*/
print(u);
/* independent eigenvectors of AA' */
print(s);
/* independent
eigenvectors of A'A */
print(v);
/*singular values (sqrt(eigenvalues)) of AA' or A'A */
uTa = T(u)*A; /*
document vectors*/
avT = A*T(v); /*
term vectors */
print(uTa);
print(avT);
/* scoring a data set
*/
ID = {1,
2, 3,
4, 5,
6, 7,
8, 9,10};
/* create a document id matrix */
print(ID);
docscores =IDT(uTa); /*
combine with svd results */
print(docscores);
/* export as a SAS data
set */
varnames = 'svd1':'svd10';
create
svd_scores from docscores[colname =
varnames];
append
from docscores;
close
svd_scores;
quit;
run;
**

SCORING, CLUSTERING AND PREDICTIVE MODELING
**;
* CREATE CLIENT
DATA SET;
DATA
CLIENTS;
INPUT
ID CLASS $ ;
CARDS;
1 H
2 H
3 H
4 H
5 H
6 S
7 S
8 S
9 S
10 S
;
RUN;
* MERGE WITH SVD
SCORES;
PROC
SQL;
CREATE
TABLE CLIENTS_SCORED AS
SELECT
A.ID, A.CLASS, B.SVD2 AS SVD1, B.SVD3 AS
SVD2
FROM
CLIENTS A LEFT JOIN
SVD_SCORES B
ON
A.ID = B.SVD1;
QUIT;
PROC
PRINT DATA
= CLIENTS_SCORED;
RUN;
* CLUSTER/
VISUALIZE CLIENTS BASED ON SVD SCORES;
PROC
GPLOT DATA
= CLIENTS_SCORED;
PLOT
SVD1*SVD2 = ID;
RUN;
QUIT;
* RECODE FOR
NUMERIC DEPENDENT VAR;
DATA
CLIENTS_SCORED2;
SET
CLIENTS_SCORED;
IF
CLASS = 'H' THEN
Y = 0;
ELSE
Y = 1;
RUN;
* BASIC
REGRESSION MODEL;
PROC
REG DATA =
CLIENTS_SCORED2;
MODEL
Y = SVD1 SVD2;
RUN;
QUIT;