Text preprocessing

During text preprocessing, a corpus of documents is tokenized (i.e. the document strings are split into individual words, punctuation, numbers, etc.) and then these tokens can be transformed, filtered or annotated. The goal is to prepare the raw texts in a way that makes it easier to perform eventual analysis methods in a later stage, e.g. by reducing noise in the dataset. tmtoolkit provides a rich set of tools for this purpose in the tmtoolkit.preprocess module.

Two approaches: functional API and TMPreproc class

There are two ways to apply text preprocessing methods to your documents: First, there is the functional API which consists of a set of Python functions that accept a list of (tokenized) documents. An example might be:

corpus = [
    "Hello world!",    # document 1
    "Another example"  # document 2
]

docs = tokenize(corpus)
to_lowercase(docs)
# Out: [['hello', 'world', '!'],
#       ['another', 'example']]

The advantage of this approach is that it’s very straight-forward and flexible. However, you must manage any meta data associated with the documents on your own (e.g. document labels or token metadata). Furthermore, the processing is not done in parallel.

Second, there is the TMPreproc class which addresses these limitations. You can create an instance of this class from your (labelled) documents and then apply preprocessing methods to it. This instance is a “state machine”, i.e. its contents (the documents) an behavior can change when you call a method. An example:

corpus = {
    "doc1": "Hello world!",
    "doc2": "Another example"
}

preproc = TMPreproc(corpus)     # documents are directly tokenized
preproc.tokens_to_lowercase()   # this changes the documents
preproc.tokens                  # one of many ways to access the tokens

# Out:
# {
#   'doc1': ['hello', 'world', '!'],
#   'doc2': ['another', 'example']
# }

The most important advantage is that TMPreproc employs parallel processing, i.e. it uses all available processors on your machine to do the computations necessary during preprocessing. For large text corpora, this can lead to a strong speed up.

Both approaches offer mostly the same features in terms of available preprocessing methods. TMPreproc has some more methods to export the data to pandas DataFrames or datatable Frames. In general, the functional API is mostly used for quick prototyping and when using a small amount of data. For projects with large amounts of data, it’s recommended to use TMPreproc, especially because of the parallel computation support.

A note on the use of datatable Frames

If you have installed the datatable package, many functions and methods in tmtoolkit return or accept datatable Frames instead of (the more commonly known) pandas DataFrames. This is because the former is much faster and memory efficient in most cases. You can always convert between the both like this:

import datatable as dt
import pandas as pd

# a pandas DataFrame:
df = pd.DataFrame({'a': [1, 2, 3], 'b': list('xyz')})

# DataFrame to datatable:
dtable = dt.Frame(df)

# and vice versa datatable to DataFrame:
df == dtable.to_pandas()

# Out:
#       a     b
# 0  True  True
# 1  True  True
# 2  True  True

Even first creating a datatable and then converting to a DataFrame is often faster than directly creating a DataFrame.

This chapter starts with the functional API and then turns to TMPreproc.

Functional API

The functions in the preprocessing module make up the functional API for text preprocessing. We will explore some of the available functions. Most of them require at least passing a list of tokenized documents. In order to tokenize raw text documents (for example from a Corpus object), we can use tokenize().

Loading example data

Let’s load a sample of three documents from the built-in NewsArticles dataset. We’ll save the document labels in doc_labels since the functional API works with lists of documents (not with dicts):

[1]:
import random
random.seed(20191018)   # to make the sampling reproducible

from tmtoolkit.corpus import Corpus
from tmtoolkit.preprocess import tokenize

corpus = Corpus.from_builtin_corpus('english-NewsArticles').sample(3)
doc_labels = list(corpus.keys())
doc_labels
[1]:
['NewsArticles-1880', 'NewsArticles-3350', 'NewsArticles-99']

Tokenization

We can now tokenize these documents. We use corpus.values() to pass a list of documents. We get a list of tokenized documents back (i.e. a list of lists). We peak into the documents by only showing the first 10 words at maximum.

[2]:
docs = tokenize(corpus.values())
[doc[:10] for doc in docs]
[2]:
[['White',
  'House',
  'aides',
  'told',
  'to',
  'keep',
  'Russia-related',
  'materials',
  'Lawyers',
  'for'],
 ['Frustration',
  'as',
  'cabin',
  'electronics',
  'ban',
  'comes',
  'into',
  'force',
  'Passengers',
  'decry'],
 ['Should',
  'you',
  'have',
  'two',
  'bins',
  'in',
  'your',
  'bathroom',
  '?',
  'Our']]

Corpus language

Some preprocessing steps are language-dependent, i.e. they’re trained for different languages and hence you have to tell in which language your documents are written. At the moment, tmtoolkit only supports two languages off the shelf: English and German.

In the functional API, all functions that are language-dependent have a language argument. Examples of such functions are tokenize(), pos_tag(), stem() and lemmatize(). The default language for the language parameter of the preprocessing functions is set in tmtoolkit.defaults.language. If you don’t change it, it’s set to "english". So you have two options when you use the functional API and work with a corpus that is not in English: you either pass the language parameter each time you use a language-dependent function; or you set tmtoolkit.defaults.language right at the beginning which will be used as default for all further language-dependent preprocessing functions. Let’s try both options with a German sample corpus:

[3]:
from tmtoolkit.preprocess import stem

docs_de = [
    'Von der Wiege bis zur Bahre, Formulare, Formulare.',
    'Fischers Fritz fischt frische Fische.',
    'Viel schon ist getan, mehr noch ist zu tun, sagt der Wasserhahn zum Wasserhuhn.'
]

Option 1, passing the language parameter each time:

[4]:
tokens_de = tokenize(docs_de, language='german')
stemmed_de = stem(tokens_de, language='german')
stemmed_de
[4]:
[['von',
  'der',
  'wieg',
  'bis',
  'zur',
  'bahr',
  ',',
  'formular',
  ',',
  'formular',
  '.'],
 ['fisch', 'fritz', 'fischt', 'frisch', 'fisch', '.'],
 ['viel',
  'schon',
  'ist',
  'getan',
  ',',
  'mehr',
  'noch',
  'ist',
  'zu',
  'tun',
  ',',
  'sagt',
  'der',
  'wasserhahn',
  'zum',
  'wasserhuhn',
  '.']]

Option 2, setting tmtoolkit.defaults.language provides the same output:

[5]:
import tmtoolkit.defaults
tmtoolkit.defaults.language = 'german'

tokens_de = tokenize(docs_de)
stemmed_de == stem(tokens_de)
[5]:
True

We will return to the English corpus hence we can reset the default language and clean up:

[6]:
tmtoolkit.defaults.language = 'english'

del docs_de, tokens_de, stemmed_de

A small tour around the functional preprocessing API

We will continue with the most important functions in the preprocessing API and apply them to our English sample corpus.

Document length

The document length is the number of tokens per document and can be obtained with doc_lengths():

[7]:
from tmtoolkit.preprocess import doc_lengths

doc_lengths(docs)
[7]:
[227, 646, 1052]

Vocabulary and document frequencies

The vocabulary is the set of unique tokens in the corpus, i.e. all tokens that occur at least once in at least one of the documents. You can use vocabulary() for that and vocabulary_counts() to additionally get the number of times each token appears in the corpus.

The document frequency of a token is the number of documents in which this token occurs at least once. The function doc_frequencies() returns this measure for all tokens in the vocabulary.

[8]:
from tmtoolkit.preprocess import vocabulary, vocabulary_counts, doc_frequencies

# first 10 entries from the sorted vocab
vocabulary(docs, sort=True)[:10]
[8]:
['%', "'", "''", "'s", '(', ')', ',', '-', '-Al', '.']
[9]:
# get unsorted vocabulary counts as Counter object
vocab_counts = vocabulary_counts(docs)
# get top 10 tokens by occurrence
vocab_counts.most_common(10)
[9]:
[('the', 82),
 (',', 70),
 ('.', 60),
 ('to', 53),
 ('and', 45),
 ('in', 38),
 ('a', 31),
 ('``', 28),
 ('of', 25),
 ("''", 23)]
[10]:
doc_freq = doc_frequencies(docs)

# "the" occurs in all three documents, "Lawyers" only in one
doc_freq['the'], doc_freq['Lawyers']

[10]:
(3, 1)

Part-of-speech (POS) tagging

Part-of-speech (POS) tagging finds the grammatical word-category for each token in a document. The function pos_tag() employs this for the whole corpus. It returns a list of tags for each document. These tags conform to a specific tagset. For English this is the Penn Treebank tagset and for German this is the STTS tagset.

These tags can be used to filter, annotate or lemmatize the documents.

Remember that this is a language-dependent function.

[11]:
from tmtoolkit.preprocess import pos_tag

docs_pos = pos_tag(docs)

# show pairs of tokens and POS tags for the first 10 tokens in the first document
list(zip(docs[0][:10], docs_pos[0][:10]))
[11]:
[('White', 'NNP'),
 ('House', 'NNP'),
 ('aides', 'NNS'),
 ('told', 'VBD'),
 ('to', 'TO'),
 ('keep', 'VB'),
 ('Russia-related', 'JJ'),
 ('materials', 'NNS'),
 ('Lawyers', 'NNS'),
 ('for', 'IN')]

Stemming and lemmatization

Stemming and lemmatization bring a token, if it is a word, to a base form. The former method is rule-based and creates base forms by chopping off common pre- and suffixes. The resulting token may not be a lexicographically correct word any more. We’ve already used stem() in an example above.

Lemmatization is a more sophisticated process that tries to find lexicographically correct base form of a given word by also considering its POS tag and possibly its context (tokens and POS tags nearby). It’s usually not rule-based but a trained model that predicts the base form from the mentioned parameters. Lemmatization can be applied with lemmatize().

Remember that both functions are language-dependent.

[12]:
from tmtoolkit.preprocess import lemmatize

docs_lem = lemmatize(docs, docs_pos)
# show pairs of original tokens and lemmata for the first 10 tokens of first document
list(zip(docs[0][:10], docs_lem[0][:10]))
[12]:
[('White', 'White'),
 ('House', 'House'),
 ('aides', 'aide'),
 ('told', 'tell'),
 ('to', 'to'),
 ('keep', 'keep'),
 ('Russia-related', 'Russia-related'),
 ('materials', 'material'),
 ('Lawyers', 'Lawyers'),
 ('for', 'for')]

Token normalization

Depending on your methodology, it may be necessary to “clean” or “normalize” your tokens in different ways in order to remove noise from the corpus, such as punctuation tokens or numbers, upper/lowercase forms of the same word, etc. Note that this is usually not necessary when you work with more modern approaches such as word embeddings (word vectors).

If you want to remove certain characters in all tokens in your corpus, you can use remove_chars() and pass it a sequence of characters to remove.

Note that for the following examples we continue working with the lemmatized documents docs_lem.

[13]:
from tmtoolkit.preprocess import remove_chars

# remove all vowels from the documents, show first 10 tokens from first document
remove_chars(docs_lem, 'aeiou')[0][:10]
[13]:
['Wht', 'Hs', 'd', 'tll', 't', 'kp', 'Rss-rltd', 'mtrl', 'Lwyrs', 'fr']

You can for example use this to remove all punctuation characters from all tokens:

[14]:
import string

docs_clean = remove_chars(docs_lem, string.punctuation)
# show pairs of original tokens and cleaned tokens for the first 10 tokens of 2nd doc.
list(zip(docs_lem[2][:10], docs_clean[2][:10]))
[14]:
[('Should', 'Should'),
 ('you', 'you'),
 ('have', 'have'),
 ('two', 'two'),
 ('bin', 'bin'),
 ('in', 'in'),
 ('your', 'your'),
 ('bathroom', 'bathroom'),
 ('?', ''),
 ('Our', 'Our')]

Notice how the token '?' was transformed to an empty string '', because “?” is a punctuation character.

A common (but harsh) practice is to transform all tokens to lowercase forms, which can be done with to_lowercase():

[15]:
from tmtoolkit.preprocess import to_lowercase

docs_clean = to_lowercase(docs_clean)
docs_clean[2][:10]
[15]:
['should', 'you', 'have', 'two', 'bin', 'in', 'your', 'bathroom', '', 'our']

The function clean_tokens() finally applies several steps that remove tokens that meet certain criteria. This includes removing:

  • punctuation tokens

  • stopwords (very common words for the given language)

  • empty tokens (i.e. '')

  • tokens that are longer or shorter than a certain number of characters

  • numbers

Note that this is a language-dependent function, because the default stopword list is determined per language. This function has many parameters to tweak, so it’s recommended to check out the documentation.

[16]:
from tmtoolkit.preprocess import clean_tokens

# remove punct., stopwords, empty tokens (this is the default)
# plus tokens shorter than 2 characters and numeric tokens like "2019"
docs_final = clean_tokens(docs_clean, remove_shorter_than=2, remove_numbers=True)

# first 10 tokens of doc. #2
docs_final[2][:10]
[16]:
['two',
 'bin',
 'bathroom',
 'bathroom',
 'fill',
 'shampoo',
 'bottle',
 'toilet',
 'roll',
 'cleaning']

Due to the removal of several tokens in the previous step, the document lengths for the processed corpus are much smaller than for the original corpus:

[17]:
doc_lengths(docs), doc_lengths(docs_final)
[17]:
([227, 646, 1052], [129, 310, 504])

We can also observe that the vocabulary got smaller after the processing steps, which, for large corpora, is also important in terms of computation time and memory consumption for later analyses:

[18]:
len(vocabulary(docs)), len(vocabulary(docs_final))
[18]:
(681, 478)

You can also apply custom token transform functions by using transform() and passing it a function that should be applied to each token in each document (hence it must accept one string argument).

First let’s define such a function. Here we create a simple function that should return a token’s “shape” in terms of the case of its characters:

[19]:
def token_shape(t):
    return ''.join(['X' if str.isupper(c) else 'x' for c in t])

token_shape('USA'), token_shape('CamelCase'), token_shape('lower')
[19]:
('XXX', 'XxxxxXxxx', 'xxxxx')

We can now apply this function to our corpus:

[20]:
from tmtoolkit.preprocess import transform

doc_shapes = transform(docs, token_shape)

# show pairs of tokens and POS tags for the first 10 tokens in the first document
list(zip(docs[0][:10], doc_shapes[0][:10]))
[20]:
[('White', 'Xxxxx'),
 ('House', 'Xxxxx'),
 ('aides', 'xxxxx'),
 ('told', 'xxxx'),
 ('to', 'xx'),
 ('keep', 'xxxx'),
 ('Russia-related', 'Xxxxxxxxxxxxxx'),
 ('materials', 'xxxxxxxxx'),
 ('Lawyers', 'Xxxxxxx'),
 ('for', 'xxx')]

Keywords-in-context (KWIC)

Keywords-in-context (KWIC) allow you to quickly investigate certain keywords and their neighborhood of tokens, i.e. the tokens that appear right before and after this keyword.

tmtoolkit provides three functions for this purpose:

  • kwic() is the base function accepting the input documents, a search pattern and several options that control how the search pattern is matched (more on that below); use this function when you want to further process the output of a KWIC search;

  • kwic_table() is the more “user friendly” version of the above function as it produces a datatable with the highlighted keyword by default

  • filter_tokens_with_kwic() works similar to the above functions but returns the result as list of tokenized documents again; it is explained in the section on filtering

Let’s see both functions in action:

[21]:
from tmtoolkit.preprocess import kwic, kwic_table

kwic(docs, 'news')
[21]:
[[],
 [['told', 'Reuters', 'news', 'agency', '.'],
  ['Jazeera', 'and', 'news', 'agencies']],
 []]

We see that the first and last document do not contain any keyword that matches "news", hence we get empty results for these documents. In the second document, we get two result contexts for the requested keyword. This keyword stands in the middle and is surrounded by its “context tokens”, which by default means two tokens to the left and two tokens to the right. Notice that in the second result context only one token to the right is shown since the document ends after “agencies”.

[22]:
kwic_table(docs, 'news')
[22]:
doccontextkwic
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
010told Reuters *news* agency .
111Jazeera and *news* agencies

With kwic_table(), we get back a datatable which provides a better formatting for quick investigation. See how the matched tokens are highlighted as *news* and empty results are removed (only document “1” contains the keyword which is the second document – remember that Python indexing starts with 0).

We can also pass the document labels via doc_labels to get proper labels in the doc column instead of document indices:

[23]:
kwic_table(docs, 'news', doc_labels=doc_labels)
[23]:
doccontextkwic
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0NewsArticles-33500told Reuters *news* agency .
1NewsArticles-33501Jazeera and *news* agencies

Another important parameter is context_size. It determines the number of tokens to display left and right to the found keyword. You can either pass a single integer for a symmetric context or a tuple with integers (<left>, <right>).

[24]:
# a symmetric context of size (5, 5)
kwic_table(docs, 'news', context_size=5, doc_labels=doc_labels)
[24]:
doccontextkwic
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0NewsArticles-33500a traveler , told Reuters *news* agency . Al Jazee…
1NewsArticles-33501Source : -Al Jazeera and *news* agencies
[25]:
# an asymmetric context of size (5, 1)
kwic_table(docs, 'news', context_size=(5, 1), doc_labels=doc_labels)
[25]:
doccontextkwic
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0NewsArticles-33500a traveler , told Reuters *news* agency
1NewsArticles-33501Source : -Al Jazeera and *news* agencies

The KWIC functions become really powerful when using the pattern matching options. So far, we were looking for exact matches between the corpus tokens and our keyword "news". However, it is also possible to match patterns like "new*" (matches any word starting with “new”) or "agenc(y|ies)" (a regular expression matching “agency” and “agencies”). The next section gives an introduction on the different options for pattern matching.

Common parameters for pattern matching functions

Several functions and methods in tmtoolkit support pattern matching, including the already mentioned KWIC functions but also functions for filtering tokens or documents as you will see later. They all share similar function signatures, i.e. similar parameters:

  • search_token or search_tokens: allows to specify one or more patterns as strings

  • match_type: sets the matching type and can be one of the following options:

  • 'exact' (default): exact string matching (optionally ignoring character case), i.e. no pattern matching

  • 'regex' uses regular expression matching

  • 'glob' uses “glob patterns” like "politic*" which matches for example “politic”, “politics” or “politician” (see globre package)

  • ignore_case: ignore character case (applies to all three match types)

  • glob_method: if match_type is ‘glob’, use this glob method. Must be 'match' or 'search' (similar behavior as Python’s re.match or re.search)

  • inverse: inverse the match results, i.e. if matching for “hello”, return all results that do not match “hello”

Let’s try out some of these options with kwic_table():

[26]:
# using a regular expression, ignoring case
kwic_table(docs, r'agenc(y|ies)', match_type='regex', ignore_case=True,
           doc_labels=doc_labels)
[26]:
doccontextkwic
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0NewsArticles-18800law enforcement *agencies* to keep
1NewsArticles-18801organizations , *agencies* and individuals
2NewsArticles-33500Reuters news *agency* . Al
3NewsArticles-33501and news *agencies*
[27]:
# using a glob, ignoring case
kwic_table(docs, 'pol*', match_type='glob', ignore_case=True,
           doc_labels=doc_labels)
[27]:
doccontextkwic
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0NewsArticles-18800false and *politically* motivated attacks
1NewsArticles-990, senior *policy* adviser for
[28]:
# using a glob, ignoring case
kwic_table(docs, '*sol*', match_type='glob', ignore_case=True,
           doc_labels=doc_labels)
[28]:
doccontextkwic
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0NewsArticles-990potential simple *solution* that could
1NewsArticles-991confused by *aerosols* . ''
2NewsArticles-992bottles , *aerosols* for deodorant
[29]:
# using a regex that matches all tokens with at least one vowel and
# inverting these matches, i.e. all tokens *without* any vowels
kwic_table(docs, r'[AEIOUaeiou]', match_type='regex', inverse=True,
           doc_labels=doc_labels)
[29]:
doccontextkwic
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0NewsArticles-18800in the *2016* presidential election
1NewsArticles-18801related investigations *,* ABC News
2NewsArticles-18802has confirmed *.* `` The
3NewsArticles-18803confirmed . *``* The White
4NewsArticles-18804motivated attacks *,* '' an
5NewsArticles-18805attacks , *''* an administration
6NewsArticles-18806News Wednesday *.* The directive
7NewsArticles-18807last week *by* Senate Democrats
8NewsArticles-18808between Trump *'s* administration ,
9NewsArticles-18809's administration *,* campaign and
10NewsArticles-188010transition teams *``* ? or
11NewsArticles-188011teams `` *?* or anyone
12NewsArticles-188012their behalf *``* ? and
13NewsArticles-188013behalf `` *?* and Russian
14NewsArticles-188014their associates *.* Similarly ,
252NewsArticles-99142you do *n't* have the
253NewsArticles-99143two bins *?* There are
254NewsArticles-99144other options *.* Hang a
255NewsArticles-99145recycling bin *.* Or opt
256NewsArticles-99146non-recyclable items *.*

Filtering tokens and documents

We can use the pattern matching parameters in numerous filtering functions and methods. The heart of many of these functions is token_match(). Given a search pattern, a list of tokens and optionally some pattern matching parameters, it returns a binary NumPy array of the same length as the input tokens. Each occurrence of True in this binary array signals a match.

[30]:
from tmtoolkit.preprocess import token_match

doc0_snippet = docs[0][:10]   # first 10 tokens of first doc.
# get all tokens that match "to*"
matches = token_match('to*', doc0_snippet, match_type='glob')

# iterate through tokens and matches, show pair-wise results
for tok, match in zip(doc0_snippet, matches):
    print(tok, ':', match)
White : False
House : False
aides : False
told : True
to : True
keep : False
Russia-related : False
materials : False
Lawyers : False
for : False

The token_match() function is a rather low-level function that you may use for pattern matching against any list/array of strings, e.g. a list of tokens, file names, etc.

The following functions cover common use-cases for filtering during text preprocessing. Many of these functions start either with filter_...() or remove_...() and these pairs of filter and remove functions are complements. A filter function will always retain the matched elements whereas a remove function will always drop the matched elements. We can observe that with the first pair of functions, filter_tokens() and remove_tokens():

[31]:
from tmtoolkit.preprocess import filter_tokens, remove_tokens

# retain only the tokens that match the pattern in each document
filter_tokens(docs, '*house*', match_type='glob', ignore_case=True)
[31]:
[['House', 'House', 'House', 'House'],
 [],
 ['house', 'greenhouse', 'household']]
[32]:
# retain only the tokens that DON'T match the pattern in each document
# will only show the first 10 tokens from the first document here, b/c
# the resulting documents are too long; you can see that "House" was
# removed from ["White", "House", ...]
remove_tokens(docs, '*house*', match_type='glob', ignore_case=True)[0][:10]
[32]:
['White',
 'aides',
 'told',
 'to',
 'keep',
 'Russia-related',
 'materials',
 'Lawyers',
 'for',
 'the']

The pair filter_documents() and remove_documents() works similarily, but filters or drops whole documents regarding the supplied match criteria. Both accept the standard pattern matching parameters but also a parameter matches_threshold with default value 1. When this number of matching tokens is hit, the document will be part of the result set (filter_documents()) or removed from the result set (remove_documents()). By this, we can for example retain only those documents that contain certain token patterns.

Let’s try these functions out in practice. This time we will also pass the doc_labels so that the filtering also applies to our list of document labels. If doc_labels is also passed, the functions return two results – the filtered list of documents and the filtered list of document labels.

[33]:
from tmtoolkit.preprocess import filter_documents, remove_documents

filtered_docs, filtered_doc_labels = filter_documents(docs, '*house*',
                                                      doc_labels=doc_labels,
                                                      match_type='glob',
                                                      ignore_case=True)
filtered_doc_labels
[33]:
['NewsArticles-1880', 'NewsArticles-99']

We can see that two out of three documents contained the pattern '*house*' and hence were retained. The list filtered_docs represents these two documents (we don’t print them here because they are too long).

We can also adjust matches_threshold to set the minimum number of token matches for filtering:

[34]:
filtered_docs, filtered_doc_labels = filter_documents(docs, '*house*',
                                                      doc_labels=doc_labels,
                                                      match_type='glob',
                                                      ignore_case=True,
                                                      matches_threshold=4)
filtered_doc_labels
[34]:
['NewsArticles-1880']
[35]:
filtered_docs, filtered_doc_labels = remove_documents(docs, '*house*',
                                                      doc_labels=doc_labels,
                                                      match_type='glob',
                                                      ignore_case=True)
filtered_doc_labels
[35]:
['NewsArticles-3350']

When we use remove_documents() we get only the documents that did not contain the specified pattern.

Another useful pair of functions is filter_documents_by_name() and remove_documents_by_name(). Both functions again accept the same pattern matching parameters but they only apply them to the document names, i.e. document labels:

[36]:
from tmtoolkit.preprocess import filter_documents_by_name

filtered_docs, filtered_doc_labels = filter_documents_by_name(docs, doc_labels,
                                                              r'-\d{4}$',
                                                              match_type='regex')
filtered_doc_labels
[36]:
['NewsArticles-1880', 'NewsArticles-3350']

In the above example we wanted to retain only the documents whose document labels ended with exactly 4 digits, like “…-1234”. Hence, we only get “NewsArticles-1880” and “NewsArticles-3350” but not “NewsArticles-99”. Again, remove_documents_by_name() will do the exact opposite.

You may also use Keywords-in-context (KWIC) to filter your tokens in the neighborhood around certain keyword pattern(s). The function for that is called filter_tokens_with_kwic() and works very similar to kwic() but returns the result as a list of tokenized documents (whereas kwic() returns a list of KWIC results per document) with which you can continue working as usual. Here, we filter the tokens in each document to get the tokens directly in front and after the glob pattern '*house*' (context_size=1):

[37]:
from tmtoolkit.preprocess import filter_tokens_with_kwic

filter_tokens_with_kwic(docs, '*house*', context_size=1,
                        match_type='glob', ignore_case=True)
[37]:
[['White',
  'House',
  'aides',
  'White',
  'House',
  'aides',
  'White',
  'House',
  'is',
  'White',
  'House',
  'and'],
 [],
 ['the',
  'house',
  ',',
  'of',
  'greenhouse',
  'gases',
  'UK',
  'household',
  'threw']]

When you annotated your documents’ tokens with Part-of-Speech (POS) tags, you can also filter them using filter_for_pos(). You need to pass the documents, their POS tags and the POS tag(s) to be used for filtering:

[38]:
from tmtoolkit.preprocess import filter_for_pos

filtered_docs, filtered_docs_pos = filter_for_pos(docs, docs_pos, 'N')
# displaying only the first 10 filtered tokens from the first document
filtered_docs[0][:10]
[38]:
['White',
 'House',
 'aides',
 'materials',
 'Lawyers',
 'Trump',
 'administration',
 'White',
 'House',
 'aides']

In this example we filtered for tokens that were identified as nouns by passing the simplified POS tag 'N' (for more on simplified tags, see the function documentation). We can also filter for more than one tag, e.g. nouns or verbs:

[39]:
filtered_docs, filtered_docs_pos = filter_for_pos(docs, docs_pos, ['N', 'V'])
# displaying only the first 10 filtered tokens from the first document
filtered_docs[0][:10]
[39]:
['White',
 'House',
 'aides',
 'told',
 'keep',
 'materials',
 'Lawyers',
 'Trump',
 'administration',
 'have']

filter_for_pos() has no remove_...() counterpart, but you can set the inverse parameter to True to achieve the same effect.

Finally there are two functions for removing tokens based on their document frequency: remove_common_tokens() and remove_uncommon_tokens(). The former removes all tokens that have a document frequency greater or equal a certain threshold defined by parameter df_threshold. The latter does the same for all tokens that have a document frequency lower or equal df_threshold. This parameter can either be a relative frequency (default) or absolute count (by setting parameter absolute=True).

Before applying the function, let’s have a look at the number of tokens per document again, to later see how many we will remove:

[40]:
doc_lengths(docs)
[40]:
[227, 646, 1052]
[41]:
from tmtoolkit.preprocess import remove_common_tokens

doc_lengths(remove_common_tokens(docs, df_threshold=0.9))
[41]:
[143, 413, 699]

By removing all tokens with a document threshold of at least 0.9, we would remove quite a number of tokens in each document. Let’s investigate the vocabulary in order to see which tokens are removed:

[42]:
orig_vocab = vocabulary(docs)  # vocabulary of unfiltered documents

filtered_docs = remove_common_tokens(docs, df_threshold=0.9)
filtered_vocab = vocabulary(filtered_docs)
orig_vocab - filtered_vocab   # set difference gives removed vocabulary tokens
[42]:
{"''",
 "'s",
 ',',
 '.',
 '?',
 'The',
 '``',
 'a',
 'all',
 'also',
 'an',
 'and',
 'be',
 'for',
 'has',
 'have',
 'in',
 'into',
 'is',
 'more',
 'of',
 'on',
 'or',
 'other',
 'such',
 'than',
 'that',
 'the',
 'to',
 'which',
 'with'}

remove_uncommon_tokens works similarily. This time, let’s use an absolute number as threshold:

[43]:
from tmtoolkit.preprocess import remove_uncommon_tokens

filtered_docs = remove_uncommon_tokens(docs, df_threshold=1, absolute=True)
filtered_vocab = vocabulary(filtered_docs)
# set difference gives removed vocabulary tokens
# this time, show only the first 10 tokens that were removed
sorted(orig_vocab - filtered_vocab)[:10]
[43]:
['%', '(', ')', '-Al', '.-', '10', '12', '135,000', '2016', '38']

The above means that we remove all tokens that appear only in exactly one document.

Expanding compound words and joining tokens

Compound words like “US-Student” or “non-recyclable” can be expanded to separate tokens like “US”, “Student” and “non”, “recyclable” using expand_compounds():

[44]:
from tmtoolkit.preprocess import expand_compounds

# trying it out with a single *tokenized* document:
expand_compounds([['US-Student', 'on', 'Berlin-bound', 'train', '.']])
[44]:
[['US', 'Student', 'on', 'Berlin', 'bound', 'train', '.']]
[45]:
# applying this to our documents

docs_expanded = expand_compounds(docs)
orig_vocab - vocabulary(docs_expanded)    # vocabulary tokens that were expanded
[45]:
{'-Al',
 '.-',
 'Britain-bound',
 'Lagoas-and',
 'Russia-related',
 'ban.-',
 'carry-on',
 'editor-in-chief',
 'experts-perplexed',
 'non-recyclable',
 'off-putting',
 're-use'}

It’s also possible to join together certain subsequent occurrences of tokens or token patterns. This means you can for example transform all of the subsequent tokens “White” and “House” to single tokens “White_House”. In case you don’t use n-grams (see next section), this is very helpful when you want to capture a named entity that is made up by several tokens, such as persons, institutions or concepts like “Climate Change”, as a single token. The function to use for this is glue_tokens(). You can pass this function:

  • documents docs to operate on;

  • a patterns sequence of length N that is used to match the subsequent N tokens;

  • a glue string that is used to join the matched subsequent tokens (by default: "_").

Along with that, you can adjust the token matching with the well-known common token matching parameters.

Let’s “glue” all subsequent occurrences of “White” and “House”:

[46]:
from tmtoolkit.preprocess import glue_tokens

# showing only first 20 tokens in document 1
glue_tokens(docs, ['White', 'House'])[0][:20]
[46]:
['White_House',
 'aides',
 'told',
 'to',
 'keep',
 'Russia-related',
 'materials',
 'Lawyers',
 'for',
 'the',
 'Trump',
 'administration',
 'have',
 'instructed',
 'White_House',
 'aides',
 'to',
 'preserve',
 'any',
 'material']

Instead of exact matches, we can also specify a sequence of regular expressions (or “glob” expressions) that must be matched by subsequent tokens. Here, we want to join all token pairs where the first token starts with a captial letter, and the second token is “Trump”. We also set return_glued_tokens to True so that a second return value is created: a list of all matched and “glued” tokens.

[47]:
docs_glued, glued = glue_tokens(docs, [r'^[A-Z]', 'Trump'], match_type='regex',
                                return_glued_tokens=True)
glued
[47]:
{'President_Trump'}

Let’s have a quick view at the context using kwic_table(). We can see that only one such pattern was matched:

[48]:
kwic_table(docs_glued, 'President_Trump')
[48]:
doccontextkwic
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
000contact between *President_Trump* 's advisers

Generating n-grams

So far, we worked with unigrams, i.e. each document consisted of a sequence of discrete tokens. We can also generate n-grams from our corpus where each document consists of a sequence of n subsequent tokens. An example would be:

Document: “This is a simple example.”

n=1 (unigrams):

['This', 'is', 'a', 'simple', 'example', '.']

n=2 (bigrams):

['This is', 'is a', 'a simple', 'simple example', 'example .']

n=3 (trigrams):

['This is a', 'is a simple', 'a simple example', 'simple example .']

The function ngrams() allows us to generate n-grams from tokenized documents.

[49]:
from tmtoolkit.preprocess import ngrams

# showing the first 10 bigrams from the first document:
ngrams(docs, n=2)[0][:10]
[49]:
['White House',
 'House aides',
 'aides told',
 'told to',
 'to keep',
 'keep Russia-related',
 'Russia-related materials',
 'materials Lawyers',
 'Lawyers for',
 'for the']

The string used to join the tokens in each n-gram can be specified via join_str:

[50]:
# showing the first 10 trigrams from the first document:
ngrams(docs, n=3, join_str='_')[0][:10]
[50]:
['White_House_aides',
 'House_aides_told',
 'aides_told_to',
 'told_to_keep',
 'to_keep_Russia-related',
 'keep_Russia-related_materials',
 'Russia-related_materials_Lawyers',
 'materials_Lawyers_for',
 'Lawyers_for_the',
 'for_the_Trump']

The n-grams don’t have to be joined. You can use join=False to generate n-grams as string lists of size n:

[51]:
# showing the first 10 bigrams from the first document:
ngrams(docs, n=2, join=False)[0][:10]
[51]:
[['White', 'House'],
 ['House', 'aides'],
 ['aides', 'told'],
 ['told', 'to'],
 ['to', 'keep'],
 ['keep', 'Russia-related'],
 ['Russia-related', 'materials'],
 ['materials', 'Lawyers'],
 ['Lawyers', 'for'],
 ['for', 'the']]

Generating a sparse document-term matrix (DTM)

If you’re working with a bag-of-words representation of your data, you usually convert the preprocessed documents to a document-term matrix (DTM), which represents of the number of occurrences of each term (i.e. vocabulary token) in each document. This is a N rows by M columns matrix, where N is the number of documents and M is the vocabulary size (i.e. the number of unique tokens in the corpus).

Not all tokens from the vocabulary occur in all documents. In fact, many tokens will occur only in a small subset of the documents if you’re dealing with a “real world” dataset. This means that most entries in such a DTM will be zero. Almost all functions in tmtoolkit therefore generate and work with sparse matrices, where only non-zero values are stored in computer memory.

For this example, we’ll use the preprocessed documents docs_final from above. First, let’s check the vocabulary size:

[52]:
len(vocabulary(docs_final))
[52]:
478

Now we can use sparse_dtm() to generate a sparse DTM. We can either pass an already computed sorted vocabulary or let the function itself generate a vocabulary which is necessary to construct the DTM. In the latter case, the generated vocabulary is also returned:

[53]:
from tmtoolkit.preprocess import sparse_dtm

dtm, vocab_final = sparse_dtm(docs_final)
dtm
[53]:
<3x478 sparse matrix of type '<class 'numpy.int32'>'
        with 529 stored elements in COOrdinate format>

We can see that a sparse matrix with 3 rows (which corresponds with the number of documents) and 478 columns was generated (which corresponds with the vocabulary size). 529 elements in this matrix are non-zero.

We can convert this matrix to a non-sparse, i.e. dense, representation and see parts of its elements:

[54]:
dtm.todense()
[54]:
matrix([[2, 1, 1, ..., 0, 0, 0],
        [0, 0, 0, ..., 1, 0, 0],
        [0, 0, 0, ..., 0, 2, 1]], dtype=int32)

However, note that you should only convert a sparse matrix to a dense representation when you’re either dealing with a small amount of data (which is what we’re doing in this example), or use only a part of the full matrix. Converting a sparse matrix to a dense representation can otherwise easily exceed the available computer memory.

There exist different “formats” for sparse matrices, which have different advantages and disadvantes (see for example the SciPy “sparse” module documentation. Not all formats support all operations that you can usually apply to an ordinary, dense matrix. By default, the generated DTM is in “coo” format, which is a good intermediate format that you can use to convert to a different sparse matrix format quickly, but that doesn’t offer many matrix operations. For example, the “coo” format doesn’t support indexing:

[55]:
# not running the following here:
# dtm[0, 0]

# it creates the following exception:
# TypeError: 'coo_matrix' object is not subscriptable

So you have to convert the sparse DTM to another format first. For example, the CSR format allows indexing and is especially optimized for fast row access:

[56]:
dtm.tocsr()[0, 443]
[56]:
4

This gives us the number of times the token at vocabulary index 443 occurs in the first document. Which token and document does this exactly refer to? We can find out using doc_labels, which corresponds with the rows in dtm and vocab_final that was returned by sparse_dtm() and corresponds with the columns:

[57]:
doc_labels[0], vocab_final[443]
[57]:
('NewsArticles-1880', 'trump')

Where does the index 443 come from? It’s the position of the token “trump” in the vocab_final list. These indices are important when working with DTMs so you should know Python’s methods of the *list* data type:

[58]:
vocab_final.index('trump')
[58]:
443

See also the following example of finding out the index for “administration” and then getting an array that represents the number of occurrences of this token across all three documents:

[59]:
vocab_admin_ix = vocab_final.index('administration')
dtm.tocsc()[:, vocab_admin_ix].toarray()
[59]:
array([[4],
       [1],
       [0]], dtype=int32)

Parallel processing with the TMPreproc class

As mentioned in the beginning of this chapter, the TMPreproc class employs parallel computation for text preprocessing. All functions that are available in the functional API are also available in the TMPreproc class as properties or methods. So you can do exactly the same things, only with a slightly different syntax and with the power of parallel processing in your back.

Optional: enabling logging output

At first let’s have a look on how to display the logging output from tmtoolkit. By default, tmtoolkit does not expose any internal logging messages. Sometimes, for example for diagnostic output during debugging or in order to see progress for long running operations, it’s helpful to enable logging output display which can be done as follows:

import logging

logging.basicConfig(level=logging.INFO)
tmtoolkit_log = logging.getLogger('tmtoolkit')
# set the minimum log level to display, for instance also logging.DEBUG
tmtoolkit_log.setLevel(logging.INFO)
tmtoolkit_log.propagate = True

Creating a TMPreproc object

You can create a TMPreproc object (also known as “instance”) by passing a dict that maps document labels to (untokenized) documents. Since a tmtoolkit Corpus behaves like a dict, we can pass a Corpus object. This time we will not use a sample but the full English news articles corpus:

[60]:
corpus = Corpus.from_builtin_corpus('english-NewsArticles')
corpus
[60]:
<Corpus [3824 documents]>

We can now pass this directly to TMPreproc. Doing so will at first distribute all documents to several sub-processes which will later be used to run the computations in parallel. The number of sub-processes can be controlled via n_max_processes. It defaults to the number of CPU cores in your machine. The distribution of documents to the processes happens according to the document size. E.g. when you have two CPU cores, one very large document and three small documents, CPU 1 will take care about the large document alone and CPU 2 will take the other three small documents. After distribution of the documents, they will directly be tokenized (in parallel). Hence when you have a large corpus, the creation of a TMPreproc object may take some time because of the tokenization process.

Let’s create a TMPreproc object from corpus:

[61]:
from tmtoolkit.preprocess import TMPreproc

preproc = TMPreproc(corpus)
preproc
[61]:
<TMPreproc [3824 documents]>

Another important parameter is language, which defaults to 'english'. So when you’re working with a German corpus, you would create the object as:

preproc = TMPreproc(corpus, language='german')

Our TMPreproc object preproc is now set up to work with the documents passed in corpus and the language 'english'. All further operations with this object will use the specified documents and language.

Accessing tokens, vocabulary and other important properties

TMPreproc provides several properties to access its data and some summary statistics. See for example the number of documents and the sum of the number of tokens in all documents:

[62]:
preproc.n_docs
[62]:
3824
[63]:
preproc.n_tokens
[63]:
2452726

We can also access the document labels and the number of tokens in each document:

[64]:
preproc.doc_labels[:10]  # displaying only the first 10 here
[64]:
['NewsArticles-1',
 'NewsArticles-10',
 'NewsArticles-100',
 'NewsArticles-1000',
 'NewsArticles-1001',
 'NewsArticles-1002',
 'NewsArticles-1003',
 'NewsArticles-1004',
 'NewsArticles-1005',
 'NewsArticles-1006']
[65]:
# displaying only a single document's length here
preproc.doc_lengths['NewsArticles-1880']
[65]:
227

As expected, there are properties for vocabulary and vocabulary counts, too:

[66]:
preproc.vocabulary[:10]  # displaying only the first 10 here
[66]:
['!', '#', '$', '%', '&', "'", "''", "''We", "'-", "'-and"]
[67]:
# how often the word "the" occurs in the whole corpus
preproc.vocabulary_counts['the']
[67]:
115385

We can also get the document frequency for each token in the vocabulary as absolute numbers (.vocabulary_abs_doc_frequency) or proportions (.vocabulary_rel_doc_frequency):

[68]:
(preproc.vocabulary_abs_doc_frequency['Trump'],
 preproc.vocabulary_rel_doc_frequency['Trump'])
[68]:
(1096, 0.28661087866108786)
[69]:
(preproc.vocabulary_abs_doc_frequency['Putin'],
 preproc.vocabulary_rel_doc_frequency['Putin'])
[69]:
(166, 0.043410041841004186)

Accessing document tokens

The most important properties are those that start with .tokens.... They give access to the tokenized documents in the TMPreproc object in different formats.

The .tokens property simply returns a dict mapping document labels to their tokens:

[70]:
# only showing the first ten tokens of a specific doc.
preproc.tokens['NewsArticles-1880'][:10]
[70]:
['White',
 'House',
 'aides',
 'told',
 'to',
 'keep',
 'Russia-related',
 'materials',
 'Lawyers',
 'for']

The .tokens_datatable and .tokens_dataframe properties return a datatable Frame or pandas DataFrame, respectively. The datatable Frame consists of at least three columns: The document label, the position of the token in the document (zero-indexed) and the token itself. Please note that for large amounts of data, .tokens_datatable is usually quicker than using .tokens_dataframe.

[71]:
preproc.tokens_datatable
[71]:
docpositiontoken
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0NewsArticles-10Betsy
1NewsArticles-11DeVos
2NewsArticles-12Confirmed
3NewsArticles-13as
4NewsArticles-14Education
5NewsArticles-15Secretary
6NewsArticles-16,
7NewsArticles-17With
8NewsArticles-18Pence
9NewsArticles-19Casting
10NewsArticles-110Historic
11NewsArticles-111Tie-Breaking
12NewsArticles-112Vote
13NewsArticles-113Michigan
14NewsArticles-114billionaire
2,452,721NewsArticles-999589article
2,452,722NewsArticles-999590was
2,452,723NewsArticles-999591n't
2,452,724NewsArticles-999592funny
2,452,725NewsArticles-999593?

The returned pandas DataFrame from .tokens_dataframe has as similar layout (not shown here).

More columns may be shown when you add token metadata (more on that later).

Understanding TMPreproc as a state machine

Before we proceed with the methods that TMPreproc provides, we should understand how a TMPreproc object represents a state which can be changed by calling its methods. This state also determines the behavior of the object. For example, when you want to lemmatize your documents, you can call the TMPreproc.lemmatize() method (more on that later). However, you can only use this method if you performed POS tagging via TMPreproc.pos_tag() before, i.e. if your TMPreproc object’s state is “ready” for lemmatization.

A TMPreproc object is a complex data structure that encapsulates the data you work with (i.e. your corpus), several “state” variables (e.g. a variable that records whether the tokens have POS tag information), a bunch of methods that transform your data or compute something from it and, as already introduced, some properties that provide access to your data and some summary statistics.

We can see how calling methods may change the data and the state of the object. For example, we can see how transforming all tokens to lowercase changes also the vocabulary and hence the vocabulary size:

[72]:
# original vocabulary size
len(preproc.vocabulary)
[72]:
78290
[73]:
preproc.tokens_to_lowercase()
len(preproc.vocabulary)  # vocabulary size is now smaller
[73]:
69086

Copying TMPreproc objects

It’s important to note that after calling the method tokens_to_lowercase(), the tokens in preproc were transformed and the original tokens from before calling this method are not available anymore. In Python, assigning a mutable object to a variable binds the same object only to a different name, it doesn’t copy it. Since a TMPreproc object is a mutable object (you can change its state by calling its methods), when we simply assign such an object to a different variable (say preproc_upper) we essentially only have two names for the same object and by calling a method on one of these variable names, the values will be changed for both names.

Let’s see this example:

[74]:
preproc_upper = preproc  # simply assignment, no copy!

# we didn't change anything, so this should be true:
preproc.vocabulary == preproc_upper.vocabulary
[74]:
True
[75]:
# let's transform the tokens to uppercase
# we might expect that this only applies to the tokens in "preproc_upper"
preproc_upper.transform_tokens(str.upper)
[75]:
<TMPreproc [3824 documents]>
[76]:
# but the vocabulary is the same for both!
preproc.vocabulary == preproc_upper.vocabulary
[76]:
True
[77]:
preproc.vocabulary[10000:10010]
[77]:
['ARTICHOKES',
 'ARTICLE',
 'ARTICLE-IN',
 'ARTICLE50',
 'ARTICLES',
 'ARTICULATE',
 'ARTICULATED',
 'ARTIFACTS',
 'ARTIFICIAL',
 'ARTIFICIALLY']
[78]:
preproc_upper.vocabulary[10000:10010]
[78]:
['ARTICHOKES',
 'ARTICLE',
 'ARTICLE-IN',
 'ARTICLE50',
 'ARTICLES',
 'ARTICULATE',
 'ARTICULATED',
 'ARTIFACTS',
 'ARTIFICIAL',
 'ARTIFICIALLY']

What happened? As explained, by the assignment preproc_upper = preproc we only assigned a new name to the object behind preproc. Calling methods on either preproc_upper or preproc will essentially modify the same object. We can confirm that both variables point to the same object, by comparing the Python object ID via id():

[79]:
id(preproc), id(preproc_upper)
[79]:
(139932303304072, 139932303304072)

The same is true when you assign the result of a method that returns the TMPreproc “self” object, so you have to watch out here, too:

[80]:
# again, we only create another name for the same object:
preproc_lower = preproc.tokens_to_lowercase()
[81]:
# *all* three names refer to the same object and hence to the same vocabulary
preproc_lower.vocabulary == preproc_upper.vocabulary == preproc.vocabulary
[81]:
True
[82]:
# it's all lowercase now
preproc.vocabulary[10000:10010]
[82]:
['arthanayake',
 'arthaud',
 'arthena',
 'arthenia',
 'arthritic',
 'arthritis',
 'arthur',
 'artichokes',
 'article',
 'article-in']

What can we do about that? We need to copy the object which can be done with the TMPreproc.copy() method. By this, we create another variable that points to a separate TMPreproc object.

[83]:
preproc_upper = preproc.copy()
[84]:
# the IDs confirm that we have two different objects
id(preproc_upper), id(preproc)
[84]:
(139931125296264, 139932303304072)
[85]:
preproc_upper.transform_tokens(str.upper)

# the transformation now only applied to "preproc_upper"
preproc.vocabulary == preproc_upper.vocabulary
[85]:
False
[86]:
preproc_upper.vocabulary[10000:10010]
[86]:
['ARTICHOKES',
 'ARTICLE',
 'ARTICLE-IN',
 'ARTICLE50',
 'ARTICLES',
 'ARTICULATE',
 'ARTICULATED',
 'ARTIFACTS',
 'ARTIFICIAL',
 'ARTIFICIALLY']

Note that this also uses up twice as much computer memory now. So you shouldn’t create copies that often and also release unused memory by using del:

[87]:
# removing the objects again
del preproc_upper, preproc_lower

Serialization: Saving and loading TMPreproc objects

The current state of a TMPreproc object can also be stored to a file on disk so that you (or someone else who has tmtoolkit installed) can later restore it using that file. The methods for that are TMPreproc.save_state() and TMPreproc.load_state() / TMPreproc.from_state().

Let’s store the current state of the preproc, which has all tokens transformed to lowercase:

[88]:
preproc.save_state('data/preproc_lowercase.pickle')
[88]:
<TMPreproc [3824 documents]>

Let’s change the object by retaining only documents that contain the token “trump” (see the reduced number of documents):

[89]:
preproc.filter_documents('trump')
[89]:
<TMPreproc [1097 documents]>

We can restore the saved data using TMPreproc.from_state():

[90]:
preproc_full = TMPreproc.from_state('data/preproc_lowercase.pickle')
preproc_full
[90]:
<TMPreproc [3824 documents]>

This is very useful especially when you have a large amount of data and run time consuming operations, e.g. POS tagging. When you’re finished running these operations, you can easily store the current state to disk and later retrieve it without the need to re-run these operations.

Methods

All functions from the functional API are also available as TMPreproc methods, most carrying the same name. Additional functionality comes in the form of token metadata handling, which will be the first topic in the next section.

Before starting to explore the TMPreproc methods, we’ll re-create a fresh TMPreproc object from the NewsArticles corpus and make a copy of it in order to be able to revert to that state later.

[91]:
preproc = TMPreproc(corpus)
preproc_orig = preproc.copy()
preproc
[91]:
<TMPreproc [3824 documents]>

Working with token metadata / POS tagging

TMPreproc allows to attach arbitrary metadata to each token in each document. This kind of “annotations” for tokens is very useful. For example, you may add metadata that records a token’s length or whether it is all uppercase letters and later use that for filtering or in further analyses. One function to add such metadata is add_metadata_per_doc(). This function requires to pass a dict that maps document labels to the respective token metadata list. The list’s length must match the number of tokens in the respective document. At first we need to create such a metadata dict. Let’s do that for the tokens’ length first:

[92]:
meta_tok_lengths = {doc_label: list(map(len, doc_tokens))
                    for doc_label, doc_tokens in preproc.tokens.items()}

# show first 5 tokens and their string length for a sample document
list(zip(preproc.tokens['NewsArticles-1880'][:10],
         meta_tok_lengths['NewsArticles-1880'][:10]))
[92]:
[('White', 5),
 ('House', 5),
 ('aides', 5),
 ('told', 4),
 ('to', 2),
 ('keep', 4),
 ('Russia-related', 14),
 ('materials', 9),
 ('Lawyers', 7),
 ('for', 3)]

We can now add these metadata via add_metadata_per_doc(). We pass a label, the metadata key, and the previously generated metadata:

[93]:
preproc.add_metadata_per_doc('length', meta_tok_lengths)
del meta_tok_lengths  # we don't need that object anymore

The property .tokens_datatable now shows an additional column with meta_token (the metadata key in always prefixed with meta_):

[94]:
preproc.tokens_datatable
[94]:
docpositiontokenmeta_length
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0NewsArticles-10Betsy5
1NewsArticles-11DeVos5
2NewsArticles-12Confirmed9
3NewsArticles-13as2
4NewsArticles-14Education9
5NewsArticles-15Secretary9
6NewsArticles-16,1
7NewsArticles-17With4
8NewsArticles-18Pence5
9NewsArticles-19Casting7
10NewsArticles-110Historic8
11NewsArticles-111Tie-Breaking12
12NewsArticles-112Vote4
13NewsArticles-113Michigan8
14NewsArticles-114billionaire11
2,452,721NewsArticles-999589article7
2,452,722NewsArticles-999590was3
2,452,723NewsArticles-999591n't3
2,452,724NewsArticles-999592funny5
2,452,725NewsArticles-999593?1

Let’s add a boolean indicator for whether the given token is all uppercase:

[95]:
meta_tok_upper = {doc_label: list(map(str.isupper, doc_tokens))
                  for doc_label, doc_tokens in preproc.tokens.items()}

preproc.add_metadata_per_doc('upper', meta_tok_upper)
del meta_tok_upper

preproc.tokens_datatable
[95]:
docpositiontokenmeta_uppermeta_length
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0NewsArticles-10Betsy05
1NewsArticles-11DeVos05
2NewsArticles-12Confirmed09
3NewsArticles-13as02
4NewsArticles-14Education09
5NewsArticles-15Secretary09
6NewsArticles-16,01
7NewsArticles-17With04
8NewsArticles-18Pence05
9NewsArticles-19Casting07
10NewsArticles-110Historic08
11NewsArticles-111Tie-Breaking012
12NewsArticles-112Vote04
13NewsArticles-113Michigan08
14NewsArticles-114billionaire011
2,452,721NewsArticles-999589article07
2,452,722NewsArticles-999590was03
2,452,723NewsArticles-999591n't03
2,452,724NewsArticles-999592funny05
2,452,725NewsArticles-999593?01

You may use these newly added columns now for example for filtering the datatable:

[96]:
import datatable as dt

preproc.tokens_datatable[dt.f.meta_upper == 1,:]
[96]:
docpositiontokenmeta_uppermeta_length
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0NewsArticles-1466ABC13
1NewsArticles-1010A11
2NewsArticles-10109U.S13
3NewsArticles-10225ABC13
4NewsArticles-10227WEAR14
5NewsArticles-10290AP12
6NewsArticles-103739613BJ16
7NewsArticles-10097UK12
8NewsArticles-100108UK12
9NewsArticles-100326C11
10NewsArticles-100559A11
11NewsArticles-100581UK12
12NewsArticles-100011A11
13NewsArticles-100026A11
14NewsArticles-1000123A11
36,844NewsArticles-999490U.S13
36,845NewsArticles-999495LTE13
36,846NewsArticles-9995154G12
36,847NewsArticles-99956722-28GB17
36,848NewsArticles-999575FCC13

POS tagging is also a way of annotating tokens in TMPreproc. When you run the method pos_tag(), a new metadata column meta_pos is added. We can try that out now:

[97]:
preproc.pos_tag()
preproc.tokens_datatable
[97]:
docpositiontokenmeta_uppermeta_posmeta_length
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0NewsArticles-10Betsy0NNP5
1NewsArticles-11DeVos0NNP5
2NewsArticles-12Confirmed0NNP9
3NewsArticles-13as0IN2
4NewsArticles-14Education0NNP9
5NewsArticles-15Secretary0NNP9
6NewsArticles-16,0,1
7NewsArticles-17With0IN4
8NewsArticles-18Pence0NNP5
9NewsArticles-19Casting0NNP7
10NewsArticles-110Historic0NNP8
11NewsArticles-111Tie-Breaking0NNP12
12NewsArticles-112Vote0NNP4
13NewsArticles-113Michigan0NNP8
14NewsArticles-114billionaire0POS11
2,452,721NewsArticles-999589article0NN7
2,452,722NewsArticles-999590was0VBD3
2,452,723NewsArticles-999591n't0RB3
2,452,724NewsArticles-999592funny0JJ5
2,452,725NewsArticles-999593?0.1

We can see that a new column meta_pos with the POS tags for each token was introduced.

To see which metadata keys are available, you can use get_available_metadata_keys():

[98]:
preproc.get_available_metadata_keys()
[98]:
{'length', 'pos', 'upper'}

Token metadata can be removed with remove_metadata():

[99]:
preproc.remove_metadata('upper')
preproc.get_available_metadata_keys()
[99]:
{'length', 'pos'}

The section on filtering will later show how to use metadata to filter tokens and documents.

Token transformations

As already said, TMPreproc provides the same functionality as the functional API. Token transformations like stemming, lemmatization, lowercase transformation, etc. can be applied step-by-step. We will show a typical pre-processing pipeline consisting of:

  1. lemmatization (which we can apply because we already POS-tagged our tokens)

  2. lowercase transformation

  3. token cleaning

  4. removal of very common and very uncommon tokens

Let’s start with the lemmatize() method:

[100]:
preproc.lemmatize()
preproc.tokens_datatable
[100]:
docpositiontokenmeta_posmeta_length
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0NewsArticles-10BetsyNNP5
1NewsArticles-11DeVosNNP5
2NewsArticles-12ConfirmedNNP9
3NewsArticles-13asIN2
4NewsArticles-14EducationNNP9
5NewsArticles-15SecretaryNNP9
6NewsArticles-16,,1
7NewsArticles-17WithIN4
8NewsArticles-18PenceNNP5
9NewsArticles-19CastingNNP7
10NewsArticles-110HistoricNNP8
11NewsArticles-111Tie-BreakingNNP12
12NewsArticles-112VoteNNP4
13NewsArticles-113MichiganNNP8
14NewsArticles-114billionairePOS11
2,452,721NewsArticles-999589articleNN7
2,452,722NewsArticles-999590beVBD3
2,452,723NewsArticles-999591n'tRB3
2,452,724NewsArticles-999592funnyJJ5
2,452,725NewsArticles-999593?.1

We proceed with the pipeline and employ “method chaining”: You can apply several methods one after another by chaining them with a . as long as this method returns a TMPreproc object:

[101]:
preproc.tokens_to_lowercase().clean_tokens(remove_numbers=True)
preproc.tokens_datatable
[101]:
docpositiontokenmeta_posmeta_length
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0NewsArticles-10betsyNNP5
1NewsArticles-11devosNNP5
2NewsArticles-12confirmedNNP9
3NewsArticles-13educationNNP9
4NewsArticles-14secretaryNNP9
5NewsArticles-15penceNNP5
6NewsArticles-16castingNNP7
7NewsArticles-17historicNNP8
8NewsArticles-18tie-breakingNNP12
9NewsArticles-19voteNNP4
10NewsArticles-110michiganNNP8
11NewsArticles-111billionairePOS11
12NewsArticles-112educationNN9
13NewsArticles-113activistNN8
14NewsArticles-114betsyNNP5
1,313,679NewsArticles-999275awayRB4
1,313,680NewsArticles-999276thinkVBD7
1,313,681NewsArticles-999277articleNN7
1,313,682NewsArticles-999278n'tRB3
1,313,683NewsArticles-999279funnyJJ5
[102]:
preproc.remove_common_tokens(0.9).remove_uncommon_tokens(5, absolute=True)
preproc.tokens_datatable
[102]:
docpositiontokenmeta_posmeta_length
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0NewsArticles-10betsyNNP5
1NewsArticles-11devosNNP5
2NewsArticles-12educationNNP9
3NewsArticles-13secretaryNNP9
4NewsArticles-14penceNNP5
5NewsArticles-15historicNNP8
6NewsArticles-16voteNNP4
7NewsArticles-17michiganNNP8
8NewsArticles-18billionairePOS11
9NewsArticles-19educationNN9
10NewsArticles-110activistNN8
11NewsArticles-111betsyNNP5
12NewsArticles-112devosNNP5
13NewsArticles-113confirmVBN9
14NewsArticles-114todayNN5
1,183,399NewsArticles-999219awayRB4
1,183,400NewsArticles-999220thinkVBD7
1,183,401NewsArticles-999221articleNN7
1,183,402NewsArticles-999222n'tRB3
1,183,403NewsArticles-999223funnyJJ5

When we have a look at the vocabulary size and compare it with the unprocessed data, we can see that we greatly reduced the amount of unique tokens:

[103]:
len(preproc.vocabulary), len(preproc_orig.vocabulary)
[103]:
(11250, 78290)

Filtering

Filtering also works the same as with the functional API, i.e. methods like filter_tokens() or filter_documents() are available. We will now focus on filtering with metadata.

We can tell filter_tokens() and similar methods to use metadata instead of the tokens for matching. For example, we can use the metadata meta_length, which we created in the metadata section to filter for tokens of a certain length:

[104]:
preproc.filter_tokens(3, by_meta='length')
preproc.tokens_datatable
[104]:
docpositiontokenmeta_posmeta_length
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0NewsArticles-10useVB3
1NewsArticles-11tieNN3
2NewsArticles-12dayNN3
3NewsArticles-13oneCD3
4NewsArticles-14senNNP3
5NewsArticles-15lawNN3
6NewsArticles-16vanNNP3
7NewsArticles-17twoCD3
8NewsArticles-18abcNNP3
9NewsArticles-100runNN3
10NewsArticles-101mayMD3
11NewsArticles-102u.sNNP3
12NewsArticles-103abcNNP3
13NewsArticles-104sayVBP3
14NewsArticles-105duoNN3
70,182NewsArticles-99919n'tRB3
70,183NewsArticles-99920newJJ3
70,184NewsArticles-99921n'tRB3
70,185NewsArticles-99922newJJ3
70,186NewsArticles-99923n'tRB3

Note that all matching options then apply to the metadata column, in this case to the meta_length column which contains integers. Since filter_tokens() by default employs exact matching, we get all tokens where meta_length equals the first argument, 3. If we used regular expression or glob matching instead, this method would fail because you can only use that for string data.

If you want to use more complex filter queries, you should create a “filter mask” and pass it to filter_tokens_by_mask(). A filter mask is a dictionary that maps a document label to a sequence of booleans. For all occurrences of True, the respective token in the document will be retained, all others will be removed. Let’s try that out with a small sample:

[105]:
preproc_small = TMPreproc(corpus.sample(5))
meta_tok_lengths = {doc_label: list(map(len, doc_tokens))
                    for doc_label, doc_tokens in preproc_small.tokens.items()}
preproc_small.pos_tag().add_metadata_per_doc('length', meta_tok_lengths)
preproc_small.tokens_datatable
[105]:
docpositiontokenmeta_posmeta_length
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0NewsArticles-17280TrumpNN5
1NewsArticles-17281::1
2NewsArticles-17282AgencyNN6
3NewsArticles-17283toTO2
4NewsArticles-17284supportVB7
5NewsArticles-17285'victimsNNS8
6NewsArticles-17286ofIN2
7NewsArticles-17287immigrantJJ9
8NewsArticles-17288crimes'NN7
9NewsArticles-17289InIN2
10NewsArticles-172810firstJJ5
11NewsArticles-172811speechNN6
12NewsArticles-172812toTO2
13NewsArticles-172813CongressNNP8
14NewsArticles-172814,,1
2570NewsArticles-948332..1
2571NewsArticles-948333SourceNN6
2572NewsArticles-948334::1
2573NewsArticles-948335-NewsNN5
2574NewsArticles-948336agenciesNNS8

We now generate the filter mask, which means for each document we create a boolean list or array that for each token in that document indicates whether that token should be kept or removed.

We will iterate through the .tokens_with_metadata property which is a dict that for each document contains a datatable with its tokens and metadata. Let’s have a look at the first document’s datatable:

[106]:
next(iter(preproc_small.tokens_with_metadata.values()))
[106]:
tokenmeta_posmeta_length
▪▪▪▪▪▪▪▪▪▪▪▪
0Ex-footballerNNP13
1AdamNNP4
2JohnsonNNP7
3losesVBZ5
4appealJJ6
5Ex-EnglandNNP10
6footballerNN10
7AdamNNP4
8JohnsonNNP7
9hasVBZ3
10lostVBN4
11aDT1
12CourtNNP5
13ofIN2
14AppealNNP6
134toTO2
135anotherDT7
136sexualJJ6
137actNN3
138..1

Now we can create the filter mask:

[107]:
import numpy as np

filter_mask = {}
for doc_label, doc_data in preproc_small.tokens_with_metadata.items():
    # extract the columns "meta_length" and "meta_pos"
    # and convert them to NumPy arrays
    doc_data_subset = doc_data[:, [dt.f.meta_length, dt.f.meta_pos]]
    tok_lengths, tok_pos = map(np.array, doc_data_subset.to_list())

    # create a boolean array for nouns with token length less or equal 5
    filter_mask[doc_label] = (tok_lengths <= 5) & np.char.startswith(tok_pos, 'N')

# it's not necessary to add the filter mask as metadata
# but it's a good way to check the mask
preproc_small.add_metadata_per_doc('small_nouns', filter_mask)
preproc_small.tokens_datatable
[107]:
docpositiontokenmeta_small_nounsmeta_posmeta_length
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0NewsArticles-17280Trump1NN5
1NewsArticles-17281:0:1
2NewsArticles-17282Agency0NN6
3NewsArticles-17283to0TO2
4NewsArticles-17284support0VB7
5NewsArticles-17285'victims0NNS8
6NewsArticles-17286of0IN2
7NewsArticles-17287immigrant0JJ9
8NewsArticles-17288crimes'0NN7
9NewsArticles-17289In0IN2
10NewsArticles-172810first0JJ5
11NewsArticles-172811speech0NN6
12NewsArticles-172812to0TO2
13NewsArticles-172813Congress0NNP8
14NewsArticles-172814,0,1
2570NewsArticles-948332.0.1
2571NewsArticles-948333Source0NN6
2572NewsArticles-948334:0:1
2573NewsArticles-948335-News1NN5
2574NewsArticles-948336agencies0NNS8

Finally we can pass the mask dict to filter_tokens_by_mask():

[108]:
preproc_small.filter_tokens_by_mask(filter_mask)
preproc_small.tokens_datatable
[108]:
docpositiontokenmeta_small_nounsmeta_posmeta_length
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0NewsArticles-17280Trump1NN5
1NewsArticles-17281Path1NN4
2NewsArticles-17282wall'1NN5
3NewsArticles-17283Trump1NNP5
4NewsArticles-17284crime1NN5
5NewsArticles-17285VOICE1NN5
6NewsArticles-17286Crime1NNP5
7NewsArticles-17287VOICE1NNP5
8NewsArticles-17288list1NN4
9NewsArticles-17289US1NNP2
10NewsArticles-172810name1NN4
11NewsArticles-172811Trump1NNP5
12NewsArticles-172812name1NN4
13NewsArticles-172813READ1NNP4
14NewsArticles-172814Trump1NNP5
254NewsArticles-94840flow1NN4
255NewsArticles-94841goods1NNS5
256NewsArticles-94842Egypt1NNP5
257NewsArticles-94843Gaza1NNP4
258NewsArticles-94844-News1NN5

Other methods

Again, all the functions that you know from the functional API are also available for TMPreproc and they work exactly the same, so we won’t replicate that here. Make sure to have a look at the API to get an overview about TMPreproc’s methods and properties. For the final section, we only want to focus on generating a sparse document-term matrix (DTM). There is a property .dtm that generates and returns a sparse DTM from the tokens of a TMPreproc object. First, let’s check the number of documents and vocabulary size which should determine the shape of the DTM that we will create afterwards. We will continue working with preproc_small:

[109]:
(preproc_small.n_docs, len(preproc_small.vocabulary))
[109]:
(5, 137)
[110]:
dtm_small = preproc_small.dtm
dtm_small
[110]:
<5x137 sparse matrix of type '<class 'numpy.int32'>'
        with 158 stored elements in Compressed Sparse Row format>

We can see that the DTM has the correct shape. The method get_dtm() also allows to return the result as datatable or pandas DataFrame:

[111]:
preproc_small.get_dtm(as_datatable=True)
[111]:
_doc'We-Al-NewsAA-321AdamAirAlBayswageswallwall'yearszone
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0NewsArticles-172811000001231100
1NewsArticles-216200000000000000
2NewsArticles-261600000300000010
3NewsArticles-290200001000000001
4NewsArticles-94800110010000000

The bow module contains several functions to work with DTMs, e.g. apply transformations such as tf-idf or compute some important summary statistics. The next chapter will introduce some of these functions.