Working with the Bag-of-Words representation
The bow module in tmtoolkit contains several functions for working with Bag-of-Words (BoW) representations of documents. It’s divided into two sub-modules: bow.bow_stats and bow.dtm. The former implements several statistics and transformations for BoW representations, the latter contains functions to create and convert sparse or dense document-term matrices (DTMs).
Most of the functions in both sub-modules accept and/or return sparse DTMs. The previous chapter contained a section about what sparse DTMs are and how they can be generated with tmtoolkit.
An example document-term matrix
Before we start with the bow.dtm module, we will generate a sparse DTM from a small example corpus.
[1]:
import random
random.seed(20191113) # to make the sampling reproducible
import numpy as np
np.set_printoptions(precision=5)
from tmtoolkit.corpus import Corpus
corpus = Corpus.from_builtin_corpus('en-NewsArticles').sample(5)
Let’s have a look at a sample document:
[2]:
print(corpus['NewsArticles-2058'][:227])
Merkel: 'Only if Europe is doing well, will Germany be doing well'
Ahead of meeting her fellow European leaders at a summit in Brussels, German Chancellor Angela Merkel has reiterated her government's call for unity in the EU.
We employ a preprocessing pipeline that removes a lot of information from our original data in order to obtain a very condensed DTM.
[3]:
from tmtoolkit.preprocess import TMPreproc
preproc = TMPreproc(corpus, language='en')
preproc.pos_tag() \
.lemmatize() \
.filter_for_pos('N') \
.tokens_to_lowercase() \
.remove_special_chars_in_tokens() \
.clean_tokens(remove_shorter_than=2) \
.remove_common_tokens(5, absolute=True) # remove tokens that occur in all documents
preproc.tokens_datatable
[3]:
doc | position | token | lemma | pos | whitespace | |
---|---|---|---|---|---|---|
▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪ | ▪▪▪▪ | ▪▪▪▪ | ▪ | |
0 | NewsArticles-119 | 0 | day | day | NOUN | 1 |
1 | NewsArticles-119 | 1 | nhs | NHS | PROPN | 0 |
2 | NewsArticles-119 | 2 | day | day | NOUN | 1 |
3 | NewsArticles-119 | 3 | nhs | NHS | PROPN | 0 |
4 | NewsArticles-119 | 4 | pledge | pledge | NOUN | 1 |
5 | NewsArticles-119 | 5 | prime | Prime | PROPN | 1 |
6 | NewsArticles-119 | 6 | minister | Minister | PROPN | 1 |
7 | NewsArticles-119 | 7 | david | David | PROPN | 1 |
8 | NewsArticles-119 | 8 | cameron | Cameron | PROPN | 0 |
9 | NewsArticles-119 | 9 | theresa | Theresa | PROPN | 1 |
10 | NewsArticles-119 | 10 | may | May | PROPN | 1 |
11 | NewsArticles-119 | 11 | government | government | NOUN | 1 |
12 | NewsArticles-119 | 12 | people | people | NOUN | 1 |
13 | NewsArticles-119 | 13 | access | access | NOUN | 1 |
14 | NewsArticles-119 | 14 | gps | gps | NOUN | 1 |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
925 | NewsArticles-3665 | 351 | article | article | NOUN | 1 |
926 | NewsArticles-3665 | 352 | author | author | NOUN | 0 |
927 | NewsArticles-3665 | 353 | al | Al | PROPN | 1 |
928 | NewsArticles-3665 | 354 | jazeera | Jazeera | PROPN | 0 |
929 | NewsArticles-3665 | 355 | policy | policy.- | NOUN | 0 |
[4]:
preproc.n_docs, preproc.vocabulary_size
[4]:
(5, 514)
We fetch the document labels and vocabulary and convert them to NumPy arrays, because such arrays allow advanced indexing methods such as boolean indexing.
[5]:
doc_labels = np.array(preproc.doc_labels)
doc_labels
[5]:
array(['NewsArticles-119', 'NewsArticles-1206', 'NewsArticles-2058',
'NewsArticles-3016', 'NewsArticles-3665'], dtype='<U17')
[6]:
vocab = np.array(preproc.vocabulary)
vocab[:10] # only showing the first 10 tokens here
[6]:
array(['70', 'abuse', 'access', 'accession', 'accusation', 'act',
'addition', 'address', 'administration', 'affiliation'],
dtype='<U16')
Finally, we fetch the sparse DTM:
[7]:
dtm = preproc.dtm
dtm
[7]:
<5x514 sparse matrix of type '<class 'numpy.int32'>'
with 578 stored elements in Compressed Sparse Row format>
We now have a sparse DTM dtm
, an array of document labels doc_labels
that represent the rows of the DTM and an array of vocabulary tokens vocab
that represent the columns of the DTM. We will use this data for the remainder of the chapter.
The bow.dtm
module
This module is quite small. There are two functions to convert a DTM to a datatable or DataFrame: dtm_to_datatable() and dtm_to_dataframe(). Note that the generated datatable or DataFrame is dense, i.e. it uses up (much) more memory than the input DTM.
Let’s generate a datatable via dtm_to_datatable() from our DTM, the document labels and the vocabulary:
[8]:
from tmtoolkit.bow.dtm import dtm_to_datatable
dtm_to_datatable(dtm, doc_labels, vocab)
[8]:
_doc | 70 | abuse | access | accession | accusation | act | addition | address | administration | … | world | wound | year | york | yucel | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
▪▪▪▪ | ▪▪▪▪ | ▪▪▪▪ | ▪▪▪▪ | ▪▪▪▪ | ▪▪▪▪ | ▪▪▪▪ | ▪▪▪▪ | ▪▪▪▪ | ▪▪▪▪ | ▪▪▪▪ | ▪▪▪▪ | ▪▪▪▪ | ▪▪▪▪ | ▪▪▪▪ | ||
0 | NewsArticles-119 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | … | 0 | 0 | 0 | 0 | 0 |
1 | NewsArticles-1206 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 0 | 0 | 0 | 0 | 0 |
2 | NewsArticles-2058 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | … | 0 | 0 | 2 | 0 | 2 |
3 | NewsArticles-3016 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 3 | 1 | 0 | 1 | 0 |
4 | NewsArticles-3665 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | … | 0 | 0 | 1 | 0 | 0 |
We can see that a row _doc
with the document labels was created and that the vocabulary tokens become the column names. dtm_to_dataframe() works the same way.
You can combine tmtoolkit with Gensim. The bow.dtm
module provides several functions to convert data between both packages:
dtm_and_vocab_to_gensim_corpus_and_dict(): converts a (sparse) DTM and a vocabulary list to a Gensim Corpus and Gensim Dictionary
dtm_to_gensim_corpus(): convert a (sparse) DTM only to a Gensim Corpus
gensim_corpus_to_dtm(): converts a Gensim Corpus object to a sparse DTM in COO format
The bow.bow_stats
module
This module provides several statistics and transformations for sparse or dense DTMs.
Document lengths, document and term frequencies, token co-occurrences
Let’s start with the doc_lengths() function, which simply gives the number of tokens per document (i.e. the row-wise sum of the DTM):
[9]:
from tmtoolkit.bow.bow_stats import doc_lengths
doc_lengths(dtm)
[9]:
array([ 38, 40, 336, 160, 356])
The returned array is aligned to the document labels doc_labels
so we can see that the last document, “NewsArticles-3665”, is the one with the most tokens. Or to do it computationally:
[10]:
doc_labels[doc_lengths(dtm).argmax()]
[10]:
'NewsArticles-3665'
While doc_lengths() gives the row-wise sum across the DTM, term_frequencies() gives the column-wise sum. This means it returns an array of the length of the vocabulary’s size where each entry in that array reflects the number of occurrences of the respective vocabulary token (aka term).
Let’s calculate that measure, get its maximum and the vocabulary token(s) for that maximum value:
[11]:
from tmtoolkit.bow.bow_stats import term_frequencies
term_freq = term_frequencies(dtm)
(term_freq.max(), vocab[term_freq == term_freq.max()])
[11]:
(21, array(['medium'], dtype='<U16'))
It’s also possible to calculate the proportional frequency, i.e. normalize the counts by the overall number of tokens via proportions=True
:
[12]:
term_prop = term_frequencies(dtm, proportions=True)
vocab[term_prop >= 0.01]
[12]:
array(['candidate', 'eu', 'macron', 'medium', 'merkel', 'refugee'],
dtype='<U16')
The function doc_frequencies() returns how often each token in the vocabulary occurs at least n times per document. You can control n per parameter min_val
which is set to 1
by default. The returned array is aligned with the vocabulary. Here, we calculate the document frequency with the default value min_val=1
, extract the maximum document frequency and see which of the tokens in the vocab
array reach the maximum document
frequency:
[13]:
from tmtoolkit.bow.bow_stats import doc_frequencies
df = doc_frequencies(dtm)
max_df = df.max()
max_df, vocab[df == max_df]
[13]:
(4, array(['minister'], dtype='<U16'))
It turns out that the maximum document frequency is 4 and only the token “minister” reaches that document frequency. This means only “minister” is mentioned across 4 documents at least once (because min_val
is 1
). Remember that during preprocessing, we removed all tokens that occur across all five documents, hence there can’t be a vocabulary token with a document frequency of 5.
Let’s see which vocabulary tokens occur within a single document at least 10 times:
[14]:
df = doc_frequencies(dtm, min_val=10)
vocab[df > 0]
[14]:
array(['candidate', 'eu', 'macron', 'medium', 'merkel', 'refugee'],
dtype='<U16')
We can also calculate the co-document frequency or token co-occurrence matrix via codoc_frequencies(). This measures how often each pair of vocabulary tokens occurs at least n times together in the same document. Again, you can control n per parameter min_val
which is set to 1
by default. The result is a sparse matrix of shape vocabulary size by vocabulary size. The columns and rows give the pairs of tokens from the
vocabulary.
Let’s generate a co-document frequency matrix and convert it to a dense representation, because our further operations don’t support sparse matrices.
A co-document frequency matrix is symmetric along the diagonal, because co-occurrence between a pair (token1, token2)
is always the same as between (token2, token1)
. We want to filter out the duplicate pairs and for that use np.triu() to take only the upper triangle of the matrix, i.e. set all values in the lower triangle including the matrix diagonal to zero (k=1
does this):
[15]:
from tmtoolkit.bow.bow_stats import codoc_frequencies
codoc_mat = codoc_frequencies(dtm).todense()
codoc_upper = np.triu(codoc_mat, k=1)
codoc_upper
[15]:
array([[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 1, 0],
[0, 0, 0, ..., 1, 0, 0],
...,
[0, 0, 0, ..., 0, 0, 1],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0]])
Now we create a list that contains the pairs of tokens that occur together in at least two documents (codoc_upper > 1
) together with their co-document frequency:
[16]:
interesting_pairs = [(vocab[t1], vocab[t2], codoc_upper[t1, t2])
for t1, t2 in zip(*np.where(codoc_upper > 1))]
interesting_pairs[:10] # showing only the first ten pairs
[16]:
[('access', 'channel', 2),
('access', 'day', 2),
('access', 'minister', 2),
('access', 'news', 2),
('april', 'author', 2),
('april', 'co', 2),
('april', 'critic', 2),
('april', 'distribution', 2),
('april', 'heart', 2),
('april', 'law', 2)]
Generate sorted lists and datatables according to term frequency
When working with DTMs, it’s often helpful to rank terms per document according to their frequency. This is what sorted_terms() does for you. It further allows to specify the sorting order (the default is descending order via ascending=False
) and several limits:
lo_thresh
for the minimum term frequencyhi_thresh
for the maximum term frequencytop_n
for the maximum number of terms per document
Let’s display the top three tokens per document by frequency:
[17]:
from tmtoolkit.bow.bow_stats import sorted_terms
sorted_terms(dtm, vocab, top_n=3)
[17]:
[[('day', 3), ('nhs', 2), ('bbc', 2)],
[('car', 4), ('garda', 4), ('collision', 3)],
[('merkel', 14), ('eu', 13), ('refugee', 13)],
[('politic', 7), ('party', 6), ('farron', 5)],
[('medium', 21), ('candidate', 19), ('macron', 15)]]
The output is a list for each document (this means the output is aligned with the document labels doc_labels
), with three pairs of (token, frequency)
each. It’s also possible to get this data as datatable via sorted_terms_datatable(), which gives a better overview and also includes labels for the documents. It accepts the same parameters for sorting and limitting the results:
[18]:
from tmtoolkit.bow.bow_stats import sorted_terms_datatable
sorted_terms_datatable(dtm, vocab, doc_labels, top_n=3)
[18]:
doc | token | value | |
---|---|---|---|
▪▪▪▪ | ▪▪▪▪ | ▪▪▪▪ | |
0 | NewsArticles-119 | day | 3 |
1 | NewsArticles-119 | nhs | 2 |
2 | NewsArticles-119 | bbc | 2 |
3 | NewsArticles-1206 | car | 4 |
4 | NewsArticles-1206 | garda | 4 |
5 | NewsArticles-1206 | collision | 3 |
6 | NewsArticles-2058 | merkel | 14 |
7 | NewsArticles-2058 | eu | 13 |
8 | NewsArticles-2058 | refugee | 13 |
9 | NewsArticles-3016 | politic | 7 |
10 | NewsArticles-3016 | party | 6 |
11 | NewsArticles-3016 | farron | 5 |
12 | NewsArticles-3665 | medium | 21 |
13 | NewsArticles-3665 | candidate | 19 |
14 | NewsArticles-3665 | macron | 15 |
[19]:
sorted_terms_datatable(dtm, vocab, doc_labels, lo_thresh=5)
[19]:
doc | token | value | |
---|---|---|---|
▪▪▪▪ | ▪▪▪▪ | ▪▪▪▪ | |
0 | NewsArticles-2058 | merkel | 14 |
1 | NewsArticles-2058 | refugee | 13 |
2 | NewsArticles-2058 | eu | 13 |
3 | NewsArticles-2058 | germany | 8 |
4 | NewsArticles-2058 | country | 8 |
5 | NewsArticles-2058 | turkey | 7 |
6 | NewsArticles-2058 | europe | 6 |
7 | NewsArticles-3016 | politic | 7 |
8 | NewsArticles-3016 | party | 6 |
9 | NewsArticles-3665 | medium | 21 |
10 | NewsArticles-3665 | candidate | 19 |
11 | NewsArticles-3665 | macron | 15 |
12 | NewsArticles-3665 | france | 9 |
13 | NewsArticles-3665 | election | 9 |
14 | NewsArticles-3665 | le | 7 |
15 | NewsArticles-3665 | coverage | 6 |
Term frequency–inverse document frequency transformation (tf-idf)
Term frequency–inverse document frequency transformation (tf-idf) is a matrix transformation that is often applied to DTMs in order to reflect the importance of a token to a document. The bow_stats
module provides the function tfidf() for this. When the input is a sparse matrix, and the calculation supports operating on sparce matrices, the output will also be a sparse matrix, which means that the
tf-idf transformation is implemented in a very memory-efficient way.
Let’s apply tf-idf to our DTM using the default way:
[20]:
from tmtoolkit.bow.bow_stats import tfidf
tfidf_mat = tfidf(dtm)
tfidf_mat
[20]:
<5x514 sparse matrix of type '<class 'numpy.float64'>'
with 578 stored elements in COOrdinate format>
We can see that the output is a sparse matrix. Let’s have a look at its values:
[21]:
tfidf_mat.todense()
[21]:
matrix([[0. , 0. , 0.02581, ..., 0. , 0. , 0. ],
[0.03132, 0. , 0. , ..., 0. , 0. , 0. ],
[0. , 0. , 0. , ..., 0.00584, 0. , 0.00746],
[0. , 0.00783, 0. , ..., 0. , 0.00783, 0. ],
[0. , 0. , 0.00276, ..., 0.00276, 0. , 0. ]])
Of course we can also pass this matrix to sorted_terms_datatable()
and observe that some rankings have changed in comparison to the untransformed DTM:
[22]:
sorted_terms_datatable(tfidf_mat, vocab, doc_labels, top_n=3)
[22]:
doc | token | value | |
---|---|---|---|
▪▪▪▪ | ▪▪▪▪ | ▪▪▪▪▪▪▪▪ | |
0 | NewsArticles-119 | day | 0.0774339 |
1 | NewsArticles-119 | nhs | 0.0659349 |
2 | NewsArticles-119 | victoria | 0.0659349 |
3 | NewsArticles-1206 | car | 0.125276 |
4 | NewsArticles-1206 | garda | 0.125276 |
5 | NewsArticles-1206 | collision | 0.0939572 |
6 | NewsArticles-2058 | merkel | 0.0521985 |
7 | NewsArticles-2058 | refugee | 0.04847 |
8 | NewsArticles-2058 | eu | 0.0379488 |
9 | NewsArticles-3016 | politic | 0.0548084 |
10 | NewsArticles-3016 | farron | 0.0391488 |
11 | NewsArticles-3016 | party | 0.0367811 |
12 | NewsArticles-3665 | medium | 0.0738989 |
13 | NewsArticles-3665 | candidate | 0.0668609 |
14 | NewsArticles-3665 | macron | 0.052785 |
The tf-idf matrix is calculated from a DTM \(D\) as \(\textit{tf}(D) \cdot \textit{idf}(D)\).
There are different variants for how to calculate the term frequency \(\textit{tf}(D)\) and the inverse document frequency \(\textit{idf(D)}\). tmtoolkit contains several functions that implement some of these variants. For \(\text{tf()}\) these are:
tf_binary(): binary term frequency matrix (matrix contains 1 whenever a term occurred in a document, else 0)
tf_proportions(): proportional term frequency matrix (term counts are normalized by document length)
tf_log(): log-normalized term frequency matrix (by default \(\log(1 + D)\))
tf_double_norm(): double-normalized term frequency matrix \(K + (1-K) \cdot \frac{D}{\textit{rowmax}(D)}\), where \(\textit{rowmax}(D)\) is a vector containing the maximum term count per document
As you can see, all the term frequency functions are prefixed with a tf_
. There are also two variants for \(\textit{idf()}\):
idf(): calculates \(\log(\frac{a + N}{b + \textit{df}(D)})\) where \(a\) and \(b\) are smoothing constants, \(N\) is the number of documents and \(\textit{df}(D)\) calculates the document frequency
idf_probabilistic(): calculates \(\log(a + \frac{N - \textit{df}(D)}{\textit{df}(D)})\)
The term frequency functions always return a sparse matrix if possible and if the input is sparse. Let’s try out two term frequency functions:
[23]:
from tmtoolkit.bow.bow_stats import tf_binary, tf_proportions
tf_binary(dtm).todense()
[23]:
matrix([[0, 0, 1, ..., 0, 0, 0],
[1, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 1, 0, 1],
[0, 1, 0, ..., 0, 1, 0],
[0, 0, 1, ..., 1, 0, 0]])
[24]:
tf_proportions(dtm).todense()
[24]:
matrix([[0. , 0. , 0.02632, ..., 0. , 0. , 0. ],
[0.025 , 0. , 0. , ..., 0. , 0. , 0. ],
[0. , 0. , 0. , ..., 0.00595, 0. , 0.00595],
[0. , 0.00625, 0. , ..., 0. , 0.00625, 0. ],
[0. , 0. , 0.00281, ..., 0.00281, 0. , 0. ]])
Just like the document frequency function doc_frequencies(), the inverse document frequency functions also return a vector with the same length as the vocabulary. Let’s use these functions and have a look at the inverse document frequency of certain tokens:
[25]:
from tmtoolkit.bow.bow_stats import idf, idf_probabilistic
idf_vec = idf(dtm)
list(zip(vocab, idf_vec))[:10]
[25]:
[('70', 1.252762968495368),
('abuse', 1.252762968495368),
('access', 0.9808292530117262),
('accession', 1.252762968495368),
('accusation', 1.252762968495368),
('act', 1.252762968495368),
('addition', 1.252762968495368),
('address', 1.252762968495368),
('administration', 1.252762968495368),
('affiliation', 1.252762968495368)]
[26]:
probidf_vec = idf_probabilistic(dtm)
list(zip(vocab, probidf_vec))[:10]
[26]:
[('70', 1.6094379124341003),
('abuse', 1.6094379124341003),
('access', 0.9162907318741551),
('accession', 1.6094379124341003),
('accusation', 1.6094379124341003),
('act', 1.6094379124341003),
('addition', 1.6094379124341003),
('address', 1.6094379124341003),
('administration', 1.6094379124341003),
('affiliation', 1.6094379124341003)]
Note that due to our very small sample, there’s not much variation in the inverse document frequency values.
By default, tfidf() uses tf_proportions()
and idf()
to calculate the tf-idf matrix. You can plug in other functions to get other variants of tf-idf:
[27]:
from tmtoolkit.bow.bow_stats import tf_double_norm
# we also set a "K" parameter for "tf_double_norm"
tfidf_mat2 = tfidf(dtm, tf_func=tf_double_norm,
idf_func=idf_probabilistic, K=0.25)
tfidf_mat2
[27]:
array([[0.40236, 0.40236, 0.45815, ..., 0.22907, 0.40236, 0.40236],
[0.70413, 0.40236, 0.22907, ..., 0.22907, 0.40236, 0.40236],
[0.40236, 0.40236, 0.22907, ..., 0.32725, 0.40236, 0.5748 ],
[0.40236, 0.5748 , 0.22907, ..., 0.22907, 0.5748 , 0.40236],
[0.40236, 0.40236, 0.2618 , ..., 0.2618 , 0.40236, 0.40236]])
[28]:
sorted_terms_datatable(tfidf_mat2, vocab, doc_labels, top_n=3)
[28]:
doc | token | value | |
---|---|---|---|
▪▪▪▪ | ▪▪▪▪ | ▪▪▪▪▪▪▪▪ | |
0 | NewsArticles-119 | bbc | 1.20708 |
1 | NewsArticles-119 | victoria | 1.20708 |
2 | NewsArticles-119 | nhs | 1.20708 |
3 | NewsArticles-1206 | car | 1.60944 |
4 | NewsArticles-1206 | garda | 1.60944 |
5 | NewsArticles-1206 | road | 1.30767 |
6 | NewsArticles-2058 | merkel | 1.60944 |
7 | NewsArticles-2058 | refugee | 1.52322 |
8 | NewsArticles-2058 | germany | 1.09212 |
9 | NewsArticles-3016 | politic | 1.60944 |
10 | NewsArticles-3016 | farron | 1.26456 |
11 | NewsArticles-3016 | putin | 1.09212 |
12 | NewsArticles-3665 | medium | 1.60944 |
13 | NewsArticles-3665 | candidate | 1.49448 |
14 | NewsArticles-3665 | macron | 1.26456 |
Once we have generated a DTM, we can use it for topic modeling. The next chapter will show how tmtoolkit can be used to evaluate the quality of your model, export essential information from it and visualize the results.