# Text preprocessing¶

During text preprocessing, a corpus of documents is tokenized (i.e. the document strings are split into individual words, punctuation, numbers, etc.) and then these tokens can be transformed, filtered or annotated. The goal is to prepare the raw texts in a way that makes it easier to perform eventual analysis methods in a later stage, e.g. by reducing noise in the dataset. tmtoolkit provides a rich set of tools for this purpose in the tmtoolkit.preprocess module.

## Two approaches: functional API and TMPreproc class¶

There are two ways to apply text preprocessing methods to your documents: First, there is the functional API which consists of a set of Python functions that accept a list of (tokenized) documents. An example might be:

corpus = [
"Hello world!",    # document 1
"Another example"  # document 2
]

docs = tokenize(corpus)
to_lowercase(docs)
# Out: [['hello', 'world', '!'],
#       ['another', 'example']]


The advantage of this approach is that it’s very straight-forward and flexible. However, you must manage any meta data associated with the documents on your own (e.g. document labels or token metadata). Furthermore, the processing is not done in parallel.

Second, there is the TMPreproc class which addresses these limitations. You can create an instance of this class from your (labelled) documents and then apply preprocessing methods to it. This instance is a “state machine”, i.e. its contents (the documents) an behavior can change when you call a method. An example:

corpus = {
"doc1": "Hello world!",
"doc2": "Another example"
}

preproc = TMPreproc(corpus)     # documents are directly tokenized
preproc.tokens_to_lowercase()   # this changes the documents
preproc.tokens                  # one of many ways to access the tokens

# Out:
# {
#   'doc1': ['hello', 'world', '!'],
#   'doc2': ['another', 'example']
# }


The most important advantage is that TMPreproc employs parallel processing, i.e. it uses all available processors on your machine to do the computations necessary during preprocessing. For large text corpora, this can lead to a strong speed up.

Both approaches offer mostly the same features in terms of available preprocessing methods. TMPreproc has some more methods to export the data to pandas DataFrames or datatable Frames. In general, the functional API is mostly used for quick prototyping and when using a small amount of data. For projects with large amounts of data, it’s recommended to use TMPreproc, especially because of the parallel computation support.

A note on the use of datatable Frames

If you have installed the datatable package, many functions and methods in tmtoolkit return or accept datatable Frames instead of (the more commonly known) pandas DataFrames. This is because the former is much faster and memory efficient in most cases. You can always convert between the both like this:

import datatable as dt
import pandas as pd

# a pandas DataFrame:
df = pd.DataFrame({'a': [1, 2, 3], 'b': list('xyz')})

# DataFrame to datatable:
dtable = dt.Frame(df)

# and vice versa datatable to DataFrame:
df == dtable.to_pandas()

# Out:
#       a     b
# 0  True  True
# 1  True  True
# 2  True  True


Even first creating a datatable and then converting to a DataFrame is often faster than directly creating a DataFrame.

This chapter starts with the functional API and then turns to TMPreproc.

## Functional API¶

The functions in the preprocessing module make up the functional API for text preprocessing. We will explore some of the available functions. Most of them require at least passing a list of tokenized documents. In order to tokenize raw text documents (for example from a Corpus object), we can use tokenize().

Let’s load a sample of three documents from the built-in NewsArticles dataset. We’ll save the document labels in doc_labels since the functional API works with lists of documents (not with dicts):

[1]:

import random
random.seed(20191018)   # to make the sampling reproducible

from tmtoolkit.corpus import Corpus
from tmtoolkit.preprocess import tokenize

corpus = Corpus.from_builtin_corpus('english-NewsArticles').sample(3)
doc_labels = list(corpus.keys())
doc_labels

[1]:

['NewsArticles-1880', 'NewsArticles-3350', 'NewsArticles-99']


### Tokenization¶

We can now tokenize these documents. We use corpus.values() to pass a list of documents. We get a list of tokenized documents back (i.e. a list of lists). We peak into the documents by only showing the first 10 words at maximum.

[2]:

docs = tokenize(corpus.values())
[doc[:10] for doc in docs]

[2]:

[['White',
'House',
'aides',
'told',
'to',
'keep',
'Russia-related',
'materials',
'Lawyers',
'for'],
['Frustration',
'as',
'cabin',
'electronics',
'ban',
'comes',
'into',
'force',
'Passengers',
'decry'],
['Should',
'you',
'have',
'two',
'bins',
'in',
'your',
'bathroom',
'?',
'Our']]


### Corpus language¶

Some preprocessing steps are language-dependent, i.e. they’re trained for different languages and hence you have to tell in which language your documents are written. At the moment, tmtoolkit only supports two languages off the shelf: English and German.

In the functional API, all functions that are language-dependent have a language argument. Examples of such functions are tokenize(), pos_tag(), stem() and lemmatize(). The default language for the language parameter of the preprocessing functions is set in tmtoolkit.defaults.language. If you don’t change it, it’s set to "english". So you have two options when you use the functional API and work with a corpus that is not in English: you either pass the language parameter each time you use a language-dependent function; or you set tmtoolkit.defaults.language right at the beginning which will be used as default for all further language-dependent preprocessing functions. Let’s try both options with a German sample corpus:

[3]:

from tmtoolkit.preprocess import stem

docs_de = [
'Von der Wiege bis zur Bahre, Formulare, Formulare.',
'Fischers Fritz fischt frische Fische.',
'Viel schon ist getan, mehr noch ist zu tun, sagt der Wasserhahn zum Wasserhuhn.'
]


Option 1, passing the language parameter each time:

[4]:

tokens_de = tokenize(docs_de, language='german')
stemmed_de = stem(tokens_de, language='german')
stemmed_de

[4]:

[['von',
'der',
'wieg',
'bis',
'zur',
'bahr',
',',
'formular',
',',
'formular',
'.'],
['fisch', 'fritz', 'fischt', 'frisch', 'fisch', '.'],
['viel',
'schon',
'ist',
'getan',
',',
'mehr',
'noch',
'ist',
'zu',
'tun',
',',
'sagt',
'der',
'wasserhahn',
'zum',
'wasserhuhn',
'.']]


Option 2, setting tmtoolkit.defaults.language provides the same output:

[5]:

import tmtoolkit.defaults
tmtoolkit.defaults.language = 'german'

tokens_de = tokenize(docs_de)
stemmed_de == stem(tokens_de)

[5]:

True


We will return to the English corpus hence we can reset the default language and clean up:

[6]:

tmtoolkit.defaults.language = 'english'

del docs_de, tokens_de, stemmed_de


### A small tour around the functional preprocessing API¶

We will continue with the most important functions in the preprocessing API and apply them to our English sample corpus.

#### Document length¶

The document length is the number of tokens per document and can be obtained with doc_lengths():

[7]:

from tmtoolkit.preprocess import doc_lengths

doc_lengths(docs)

[7]:

[227, 646, 1052]


#### Vocabulary and document frequencies¶

The vocabulary is the set of unique tokens in the corpus, i.e. all tokens that occur at least once in at least one of the documents. You can use vocabulary() for that and vocabulary_counts() to additionally get the number of times each token appears in the corpus.

The document frequency of a token is the number of documents in which this token occurs at least once. The function doc_frequencies() returns this measure for all tokens in the vocabulary.

[8]:

from tmtoolkit.preprocess import vocabulary, vocabulary_counts, doc_frequencies

# first 10 entries from the sorted vocab
vocabulary(docs, sort=True)[:10]

[8]:

['%', "'", "''", "'s", '(', ')', ',', '-', '-Al', '.']

[9]:

# get unsorted vocabulary counts as Counter object
vocab_counts = vocabulary_counts(docs)
# get top 10 tokens by occurrence
vocab_counts.most_common(10)

[9]:

[('the', 82),
(',', 70),
('.', 60),
('to', 53),
('and', 45),
('in', 38),
('a', 31),
('', 28),
('of', 25),
("''", 23)]

[10]:

doc_freq = doc_frequencies(docs)

# "the" occurs in all three documents, "Lawyers" only in one
doc_freq['the'], doc_freq['Lawyers']


[10]:

(3, 1)


#### Part-of-speech (POS) tagging¶

Part-of-speech (POS) tagging finds the grammatical word-category for each token in a document. The function pos_tag() employs this for the whole corpus. It returns a list of tags for each document. These tags conform to a specific tagset. For English this is the Penn Treebank tagset and for German this is the STTS tagset.

These tags can be used to filter, annotate or lemmatize the documents.

Remember that this is a language-dependent function.

[11]:

from tmtoolkit.preprocess import pos_tag

docs_pos = pos_tag(docs)

# show pairs of tokens and POS tags for the first 10 tokens in the first document
list(zip(docs[0][:10], docs_pos[0][:10]))

[11]:

[('White', 'NNP'),
('House', 'NNP'),
('aides', 'NNS'),
('told', 'VBD'),
('to', 'TO'),
('keep', 'VB'),
('Russia-related', 'JJ'),
('materials', 'NNS'),
('Lawyers', 'NNS'),
('for', 'IN')]


#### Stemming and lemmatization¶

Stemming and lemmatization bring a token, if it is a word, to a base form. The former method is rule-based and creates base forms by chopping off common pre- and suffixes. The resulting token may not be a lexicographically correct word any more. We’ve already used stem() in an example above.

Lemmatization is a more sophisticated process that tries to find lexicographically correct base form of a given word by also considering its POS tag and possibly its context (tokens and POS tags nearby). It’s usually not rule-based but a trained model that predicts the base form from the mentioned parameters. Lemmatization can be applied with lemmatize().

Remember that both functions are language-dependent.

[12]:

from tmtoolkit.preprocess import lemmatize

docs_lem = lemmatize(docs, docs_pos)
# show pairs of original tokens and lemmata for the first 10 tokens of first document
list(zip(docs[0][:10], docs_lem[0][:10]))

[12]:

[('White', 'White'),
('House', 'House'),
('aides', 'aide'),
('told', 'tell'),
('to', 'to'),
('keep', 'keep'),
('Russia-related', 'Russia-related'),
('materials', 'material'),
('Lawyers', 'Lawyers'),
('for', 'for')]


#### Token normalization¶

Depending on your methodology, it may be necessary to “clean” or “normalize” your tokens in different ways in order to remove noise from the corpus, such as punctuation tokens or numbers, upper/lowercase forms of the same word, etc. Note that this is usually not necessary when you work with more modern approaches such as word embeddings (word vectors).

If you want to remove certain characters in all tokens in your corpus, you can use remove_chars() and pass it a sequence of characters to remove.

Note that for the following examples we continue working with the lemmatized documents docs_lem.

[13]:

from tmtoolkit.preprocess import remove_chars

# remove all vowels from the documents, show first 10 tokens from first document
remove_chars(docs_lem, 'aeiou')[0][:10]

[13]:

['Wht', 'Hs', 'd', 'tll', 't', 'kp', 'Rss-rltd', 'mtrl', 'Lwyrs', 'fr']


You can for example use this to remove all punctuation characters from all tokens:

[14]:

import string

docs_clean = remove_chars(docs_lem, string.punctuation)
# show pairs of original tokens and cleaned tokens for the first 10 tokens of 2nd doc.
list(zip(docs_lem[2][:10], docs_clean[2][:10]))

[14]:

[('Should', 'Should'),
('you', 'you'),
('have', 'have'),
('two', 'two'),
('bin', 'bin'),
('in', 'in'),
('your', 'your'),
('bathroom', 'bathroom'),
('?', ''),
('Our', 'Our')]


Notice how the token '?' was transformed to an empty string '', because “?” is a punctuation character.

A common (but harsh) practice is to transform all tokens to lowercase forms, which can be done with to_lowercase():

[15]:

from tmtoolkit.preprocess import to_lowercase

docs_clean = to_lowercase(docs_clean)
docs_clean[2][:10]

[15]:

['should', 'you', 'have', 'two', 'bin', 'in', 'your', 'bathroom', '', 'our']


The function clean_tokens() finally applies several steps that remove tokens that meet certain criteria. This includes removing:

• punctuation tokens

• stopwords (very common words for the given language)

• empty tokens (i.e. '')

• tokens that are longer or shorter than a certain number of characters

• numbers

Note that this is a language-dependent function, because the default stopword list is determined per language. This function has many parameters to tweak, so it’s recommended to check out the documentation.

[16]:

from tmtoolkit.preprocess import clean_tokens

# remove punct., stopwords, empty tokens (this is the default)
# plus tokens shorter than 2 characters and numeric tokens like "2019"
docs_final = clean_tokens(docs_clean, remove_shorter_than=2, remove_numbers=True)

# first 10 tokens of doc. #2
docs_final[2][:10]

[16]:

['two',
'bin',
'bathroom',
'bathroom',
'fill',
'shampoo',
'bottle',
'toilet',
'roll',
'cleaning']


Due to the removal of several tokens in the previous step, the document lengths for the processed corpus are much smaller than for the original corpus:

[17]:

doc_lengths(docs), doc_lengths(docs_final)

[17]:

([227, 646, 1052], [129, 310, 504])


We can also observe that the vocabulary got smaller after the processing steps, which, for large corpora, is also important in terms of computation time and memory consumption for later analyses:

[18]:

len(vocabulary(docs)), len(vocabulary(docs_final))

[18]:

(681, 478)


You can also apply custom token transform functions by using transform() and passing it a function that should be applied to each token in each document (hence it must accept one string argument).

First let’s define such a function. Here we create a simple function that should return a token’s “shape” in terms of the case of its characters:

[19]:

def token_shape(t):
return ''.join(['X' if str.isupper(c) else 'x' for c in t])

token_shape('USA'), token_shape('CamelCase'), token_shape('lower')

[19]:

('XXX', 'XxxxxXxxx', 'xxxxx')


We can now apply this function to our corpus:

[20]:

from tmtoolkit.preprocess import transform

doc_shapes = transform(docs, token_shape)

# show pairs of tokens and POS tags for the first 10 tokens in the first document
list(zip(docs[0][:10], doc_shapes[0][:10]))

[20]:

[('White', 'Xxxxx'),
('House', 'Xxxxx'),
('aides', 'xxxxx'),
('told', 'xxxx'),
('to', 'xx'),
('keep', 'xxxx'),
('Russia-related', 'Xxxxxxxxxxxxxx'),
('materials', 'xxxxxxxxx'),
('Lawyers', 'Xxxxxxx'),
('for', 'xxx')]


#### Keywords-in-context (KWIC)¶

Keywords-in-context (KWIC) allow you to quickly investigate certain keywords and their neighborhood of tokens, i.e. the tokens that appear right before and after this keyword.

tmtoolkit provides three functions for this purpose:

• kwic() is the base function accepting the input documents, a search pattern and several options that control how the search pattern is matched (more on that below); use this function when you want to further process the output of a KWIC search;

• kwic_table() is the more “user friendly” version of the above function as it produces a datatable with the highlighted keyword by default

• filter_tokens_with_kwic() works similar to the above functions but returns the result as list of tokenized documents again; it is explained in the section on filtering

Let’s see both functions in action:

[21]:

from tmtoolkit.preprocess import kwic, kwic_table

kwic(docs, 'news')

[21]:

[[],
[['told', 'Reuters', 'news', 'agency', '.'],
['Jazeera', 'and', 'news', 'agencies']],
[]]


We see that the first and last document do not contain any keyword that matches "news", hence we get empty results for these documents. In the second document, we get two result contexts for the requested keyword. This keyword stands in the middle and is surrounded by its “context tokens”, which by default means two tokens to the left and two tokens to the right. Notice that in the second result context only one token to the right is shown since the document ends after “agencies”.

[22]:

kwic_table(docs, 'news')

[22]:

 doc context kwic ▪▪▪▪▪▪▪▪ ▪▪▪▪▪▪▪▪ ▪▪▪▪ 0 1 0 told Reuters *news* agency . 1 1 1 Jazeera and *news* agencies

With kwic_table(), we get back a datatable which provides a better formatting for quick investigation. See how the matched tokens are highlighted as *news* and empty results are removed (only document “1” contains the keyword which is the second document – remember that Python indexing starts with 0).

We can also pass the document labels via doc_labels to get proper labels in the doc column instead of document indices:

[23]:

kwic_table(docs, 'news', doc_labels=doc_labels)

[23]:

 doc context kwic ▪▪▪▪ ▪▪▪▪▪▪▪▪ ▪▪▪▪ 0 NewsArticles-3350 0 told Reuters *news* agency . 1 NewsArticles-3350 1 Jazeera and *news* agencies

Another important parameter is context_size. It determines the number of tokens to display left and right to the found keyword. You can either pass a single integer for a symmetric context or a tuple with integers (<left>, <right>).

[24]:

# a symmetric context of size (5, 5)
kwic_table(docs, 'news', context_size=5, doc_labels=doc_labels)

[24]:

 doc context kwic ▪▪▪▪ ▪▪▪▪▪▪▪▪ ▪▪▪▪ 0 NewsArticles-3350 0 a traveler , told Reuters *news* agency . Al Jazee… 1 NewsArticles-3350 1 Source : -Al Jazeera and *news* agencies
[25]:

# an asymmetric context of size (5, 1)
kwic_table(docs, 'news', context_size=(5, 1), doc_labels=doc_labels)

[25]:

 doc context kwic ▪▪▪▪ ▪▪▪▪▪▪▪▪ ▪▪▪▪ 0 NewsArticles-3350 0 a traveler , told Reuters *news* agency 1 NewsArticles-3350 1 Source : -Al Jazeera and *news* agencies

The KWIC functions become really powerful when using the pattern matching options. So far, we were looking for exact matches between the corpus tokens and our keyword "news". However, it is also possible to match patterns like "new*" (matches any word starting with “new”) or "agenc(y|ies)" (a regular expression matching “agency” and “agencies”). The next section gives an introduction on the different options for pattern matching.

#### Common parameters for pattern matching functions¶

Several functions and methods in tmtoolkit support pattern matching, including the already mentioned KWIC functions but also functions for filtering tokens or documents as you will see later. They all share similar function signatures, i.e. similar parameters:

• search_token or search_tokens: allows to specify one or more patterns as strings

• match_type: sets the matching type and can be one of the following options:

• 'exact' (default): exact string matching (optionally ignoring character case), i.e. no pattern matching

• 'regex' uses regular expression matching

• 'glob' uses “glob patterns” like "politic*" which matches for example “politic”, “politics” or “politician” (see globre package)

• ignore_case: ignore character case (applies to all three match types)

• glob_method: if match_type is ‘glob’, use this glob method. Must be 'match' or 'search' (similar behavior as Python’s re.match or re.search)

• inverse: inverse the match results, i.e. if matching for “hello”, return all results that do not match “hello”

Let’s try out some of these options with kwic_table():

[26]:

# using a regular expression, ignoring case
kwic_table(docs, r'agenc(y|ies)', match_type='regex', ignore_case=True,
doc_labels=doc_labels)

[26]:

 doc context kwic ▪▪▪▪ ▪▪▪▪▪▪▪▪ ▪▪▪▪ 0 NewsArticles-1880 0 law enforcement *agencies* to keep 1 NewsArticles-1880 1 organizations , *agencies* and individuals 2 NewsArticles-3350 0 Reuters news *agency* . Al 3 NewsArticles-3350 1 and news *agencies*
[27]:

# using a glob, ignoring case
kwic_table(docs, 'pol*', match_type='glob', ignore_case=True,
doc_labels=doc_labels)

[27]:

 doc context kwic ▪▪▪▪ ▪▪▪▪▪▪▪▪ ▪▪▪▪ 0 NewsArticles-1880 0 false and *politically* motivated attacks 1 NewsArticles-99 0 , senior *policy* adviser for
[28]:

# using a glob, ignoring case
kwic_table(docs, '*sol*', match_type='glob', ignore_case=True,
doc_labels=doc_labels)

[28]:

 doc context kwic ▪▪▪▪ ▪▪▪▪▪▪▪▪ ▪▪▪▪ 0 NewsArticles-99 0 potential simple *solution* that could 1 NewsArticles-99 1 confused by *aerosols* . '' 2 NewsArticles-99 2 bottles , *aerosols* for deodorant
[29]:

# using a regex that matches all tokens with at least one vowel and
# inverting these matches, i.e. all tokens *without* any vowels
kwic_table(docs, r'[AEIOUaeiou]', match_type='regex', inverse=True,
doc_labels=doc_labels)

[29]:

 doc context kwic ▪▪▪▪ ▪▪▪▪▪▪▪▪ ▪▪▪▪ 0 NewsArticles-1880 0 in the *2016* presidential election 1 NewsArticles-1880 1 related investigations *,* ABC News 2 NewsArticles-1880 2 has confirmed *.*  The 3 NewsArticles-1880 3 confirmed . ** The White 4 NewsArticles-1880 4 motivated attacks *,* '' an 5 NewsArticles-1880 5 attacks , *''* an administration 6 NewsArticles-1880 6 News Wednesday *.* The directive 7 NewsArticles-1880 7 last week *by* Senate Democrats 8 NewsArticles-1880 8 between Trump *'s* administration , 9 NewsArticles-1880 9 's administration *,* campaign and 10 NewsArticles-1880 10 transition teams ** ? or 11 NewsArticles-1880 11 teams  *?* or anyone 12 NewsArticles-1880 12 their behalf ** ? and 13 NewsArticles-1880 13 behalf  *?* and Russian 14 NewsArticles-1880 14 their associates *.* Similarly , ⋮ ⋮ ⋮ ⋮ 252 NewsArticles-99 142 you do *n't* have the 253 NewsArticles-99 143 two bins *?* There are 254 NewsArticles-99 144 other options *.* Hang a 255 NewsArticles-99 145 recycling bin *.* Or opt 256 NewsArticles-99 146 non-recyclable items *.*

#### Filtering tokens and documents¶

We can use the pattern matching parameters in numerous filtering functions and methods. The heart of many of these functions is token_match(). Given a search pattern, a list of tokens and optionally some pattern matching parameters, it returns a binary NumPy array of the same length as the input tokens. Each occurrence of True in this binary array signals a match.

[30]:

from tmtoolkit.preprocess import token_match

doc0_snippet = docs[0][:10]   # first 10 tokens of first doc.
# get all tokens that match "to*"
matches = token_match('to*', doc0_snippet, match_type='glob')

# iterate through tokens and matches, show pair-wise results
for tok, match in zip(doc0_snippet, matches):
print(tok, ':', match)

White : False
House : False
aides : False
told : True
to : True
keep : False
Russia-related : False
materials : False
Lawyers : False
for : False


The token_match() function is a rather low-level function that you may use for pattern matching against any list/array of strings, e.g. a list of tokens, file names, etc.

The following functions cover common use-cases for filtering during text preprocessing. Many of these functions start either with filter_...() or remove_...() and these pairs of filter and remove functions are complements. A filter function will always retain the matched elements whereas a remove function will always drop the matched elements. We can observe that with the first pair of functions, filter_tokens() and remove_tokens():

[31]:

from tmtoolkit.preprocess import filter_tokens, remove_tokens

# retain only the tokens that match the pattern in each document
filter_tokens(docs, '*house*', match_type='glob', ignore_case=True)

[31]:

[['House', 'House', 'House', 'House'],
[],
['house', 'greenhouse', 'household']]

[32]:

# retain only the tokens that DON'T match the pattern in each document
# will only show the first 10 tokens from the first document here, b/c
# the resulting documents are too long; you can see that "House" was
# removed from ["White", "House", ...]
remove_tokens(docs, '*house*', match_type='glob', ignore_case=True)[0][:10]

[32]:

['White',
'aides',
'told',
'to',
'keep',
'Russia-related',
'materials',
'Lawyers',
'for',
'the']


The pair filter_documents() and remove_documents() works similarily, but filters or drops whole documents regarding the supplied match criteria. Both accept the standard pattern matching parameters but also a parameter matches_threshold with default value 1. When this number of matching tokens is hit, the document will be part of the result set (filter_documents()) or removed from the result set (remove_documents()). By this, we can for example retain only those documents that contain certain token patterns.

Let’s try these functions out in practice. This time we will also pass the doc_labels so that the filtering also applies to our list of document labels. If doc_labels is also passed, the functions return two results – the filtered list of documents and the filtered list of document labels.

[33]:

from tmtoolkit.preprocess import filter_documents, remove_documents

filtered_docs, filtered_doc_labels = filter_documents(docs, '*house*',
doc_labels=doc_labels,
match_type='glob',
ignore_case=True)
filtered_doc_labels

[33]:

['NewsArticles-1880', 'NewsArticles-99']


We can see that two out of three documents contained the pattern '*house*' and hence were retained. The list filtered_docs represents these two documents (we don’t print them here because they are too long).

We can also adjust matches_threshold to set the minimum number of token matches for filtering:

[34]:

filtered_docs, filtered_doc_labels = filter_documents(docs, '*house*',
doc_labels=doc_labels,
match_type='glob',
ignore_case=True,
matches_threshold=4)
filtered_doc_labels

[34]:

['NewsArticles-1880']

[35]:

filtered_docs, filtered_doc_labels = remove_documents(docs, '*house*',
doc_labels=doc_labels,
match_type='glob',
ignore_case=True)
filtered_doc_labels

[35]:

['NewsArticles-3350']


When we use remove_documents() we get only the documents that did not contain the specified pattern.

Another useful pair of functions is filter_documents_by_name() and remove_documents_by_name(). Both functions again accept the same pattern matching parameters but they only apply them to the document names, i.e. document labels:

[36]:

from tmtoolkit.preprocess import filter_documents_by_name

filtered_docs, filtered_doc_labels = filter_documents_by_name(docs, doc_labels,
r'-\d{4}$', match_type='regex') filtered_doc_labels  [36]:  ['NewsArticles-1880', 'NewsArticles-3350']  In the above example we wanted to retain only the documents whose document labels ended with exactly 4 digits, like “…-1234”. Hence, we only get “NewsArticles-1880” and “NewsArticles-3350” but not “NewsArticles-99”. Again, remove_documents_by_name() will do the exact opposite. You may also use Keywords-in-context (KWIC) to filter your tokens in the neighborhood around certain keyword pattern(s). The function for that is called filter_tokens_with_kwic() and works very similar to kwic() but returns the result as a list of tokenized documents (whereas kwic() returns a list of KWIC results per document) with which you can continue working as usual. Here, we filter the tokens in each document to get the tokens directly in front and after the glob pattern '*house*' (context_size=1): [37]:  from tmtoolkit.preprocess import filter_tokens_with_kwic filter_tokens_with_kwic(docs, '*house*', context_size=1, match_type='glob', ignore_case=True)  [37]:  [['White', 'House', 'aides', 'White', 'House', 'aides', 'White', 'House', 'is', 'White', 'House', 'and'], [], ['the', 'house', ',', 'of', 'greenhouse', 'gases', 'UK', 'household', 'threw']]  When you annotated your documents’ tokens with Part-of-Speech (POS) tags, you can also filter them using filter_for_pos(). You need to pass the documents, their POS tags and the POS tag(s) to be used for filtering: [38]:  from tmtoolkit.preprocess import filter_for_pos filtered_docs, filtered_docs_pos = filter_for_pos(docs, docs_pos, 'N') # displaying only the first 10 filtered tokens from the first document filtered_docs[0][:10]  [38]:  ['White', 'House', 'aides', 'materials', 'Lawyers', 'Trump', 'administration', 'White', 'House', 'aides']  In this example we filtered for tokens that were identified as nouns by passing the simplified POS tag 'N' (for more on simplified tags, see the function documentation). We can also filter for more than one tag, e.g. nouns or verbs: [39]:  filtered_docs, filtered_docs_pos = filter_for_pos(docs, docs_pos, ['N', 'V']) # displaying only the first 10 filtered tokens from the first document filtered_docs[0][:10]  [39]:  ['White', 'House', 'aides', 'told', 'keep', 'materials', 'Lawyers', 'Trump', 'administration', 'have']  filter_for_pos() has no remove_...() counterpart, but you can set the inverse parameter to True to achieve the same effect. Finally there are two functions for removing tokens based on their document frequency: remove_common_tokens() and remove_uncommon_tokens(). The former removes all tokens that have a document frequency greater or equal a certain threshold defined by parameter df_threshold. The latter does the same for all tokens that have a document frequency lower or equal df_threshold. This parameter can either be a relative frequency (default) or absolute count (by setting parameter absolute=True). Before applying the function, let’s have a look at the number of tokens per document again, to later see how many we will remove: [40]:  doc_lengths(docs)  [40]:  [227, 646, 1052]  [41]:  from tmtoolkit.preprocess import remove_common_tokens doc_lengths(remove_common_tokens(docs, df_threshold=0.9))  [41]:  [143, 413, 699]  By removing all tokens with a document threshold of at least 0.9, we would remove quite a number of tokens in each document. Let’s investigate the vocabulary in order to see which tokens are removed: [42]:  orig_vocab = vocabulary(docs) # vocabulary of unfiltered documents filtered_docs = remove_common_tokens(docs, df_threshold=0.9) filtered_vocab = vocabulary(filtered_docs) orig_vocab - filtered_vocab # set difference gives removed vocabulary tokens  [42]:  {"''", "'s", ',', '.', '?', 'The', '', 'a', 'all', 'also', 'an', 'and', 'be', 'for', 'has', 'have', 'in', 'into', 'is', 'more', 'of', 'on', 'or', 'other', 'such', 'than', 'that', 'the', 'to', 'which', 'with'}  remove_uncommon_tokens works similarily. This time, let’s use an absolute number as threshold: [43]:  from tmtoolkit.preprocess import remove_uncommon_tokens filtered_docs = remove_uncommon_tokens(docs, df_threshold=1, absolute=True) filtered_vocab = vocabulary(filtered_docs) # set difference gives removed vocabulary tokens # this time, show only the first 10 tokens that were removed sorted(orig_vocab - filtered_vocab)[:10]  [43]:  ['%', '(', ')', '-Al', '.-', '10', '12', '135,000', '2016', '38']  The above means that we remove all tokens that appear only in exactly one document. #### Expanding compound words and joining tokens¶ Compound words like “US-Student” or “non-recyclable” can be expanded to separate tokens like “US”, “Student” and “non”, “recyclable” using expand_compounds(): [44]:  from tmtoolkit.preprocess import expand_compounds # trying it out with a single *tokenized* document: expand_compounds([['US-Student', 'on', 'Berlin-bound', 'train', '.']])  [44]:  [['US', 'Student', 'on', 'Berlin', 'bound', 'train', '.']]  [45]:  # applying this to our documents docs_expanded = expand_compounds(docs) orig_vocab - vocabulary(docs_expanded) # vocabulary tokens that were expanded  [45]:  {'-Al', '.-', 'Britain-bound', 'Lagoas-and', 'Russia-related', 'ban.-', 'carry-on', 'editor-in-chief', 'experts-perplexed', 'non-recyclable', 'off-putting', 're-use'}  It’s also possible to join together certain subsequent occurrences of tokens or token patterns. This means you can for example transform all of the subsequent tokens “White” and “House” to single tokens “White_House”. In case you don’t use n-grams (see next section), this is very helpful when you want to capture a named entity that is made up by several tokens, such as persons, institutions or concepts like “Climate Change”, as a single token. The function to use for this is glue_tokens(). You can pass this function: • documents docs to operate on; • a patterns sequence of length N that is used to match the subsequent N tokens; • a glue string that is used to join the matched subsequent tokens (by default: "_"). Along with that, you can adjust the token matching with the well-known common token matching parameters. Let’s “glue” all subsequent occurrences of “White” and “House”: [46]:  from tmtoolkit.preprocess import glue_tokens # showing only first 20 tokens in document 1 glue_tokens(docs, ['White', 'House'])[0][:20]  [46]:  ['White_House', 'aides', 'told', 'to', 'keep', 'Russia-related', 'materials', 'Lawyers', 'for', 'the', 'Trump', 'administration', 'have', 'instructed', 'White_House', 'aides', 'to', 'preserve', 'any', 'material']  Instead of exact matches, we can also specify a sequence of regular expressions (or “glob” expressions) that must be matched by subsequent tokens. Here, we want to join all token pairs where the first token starts with a captial letter, and the second token is “Trump”. We also set return_glued_tokens to True so that a second return value is created: a list of all matched and “glued” tokens. [47]:  docs_glued, glued = glue_tokens(docs, [r'^[A-Z]', 'Trump'], match_type='regex', return_glued_tokens=True) glued  [47]:  {'President_Trump'}  Let’s have a quick view at the context using kwic_table(). We can see that only one such pattern was matched: [48]:  kwic_table(docs_glued, 'President_Trump')  [48]:   doc context kwic ▪▪▪▪▪▪▪▪ ▪▪▪▪▪▪▪▪ ▪▪▪▪ 0 0 0 contact between *President_Trump* 's advisers #### Generating n-grams¶ So far, we worked with unigrams, i.e. each document consisted of a sequence of discrete tokens. We can also generate n-grams from our corpus where each document consists of a sequence of n subsequent tokens. An example would be: Document: “This is a simple example.” n=1 (unigrams): ['This', 'is', 'a', 'simple', 'example', '.']  n=2 (bigrams): ['This is', 'is a', 'a simple', 'simple example', 'example .']  n=3 (trigrams): ['This is a', 'is a simple', 'a simple example', 'simple example .']  The function ngrams() allows us to generate n-grams from tokenized documents. [49]:  from tmtoolkit.preprocess import ngrams # showing the first 10 bigrams from the first document: ngrams(docs, n=2)[0][:10]  [49]:  ['White House', 'House aides', 'aides told', 'told to', 'to keep', 'keep Russia-related', 'Russia-related materials', 'materials Lawyers', 'Lawyers for', 'for the']  The string used to join the tokens in each n-gram can be specified via join_str: [50]:  # showing the first 10 trigrams from the first document: ngrams(docs, n=3, join_str='_')[0][:10]  [50]:  ['White_House_aides', 'House_aides_told', 'aides_told_to', 'told_to_keep', 'to_keep_Russia-related', 'keep_Russia-related_materials', 'Russia-related_materials_Lawyers', 'materials_Lawyers_for', 'Lawyers_for_the', 'for_the_Trump']  The n-grams don’t have to be joined. You can use join=False to generate n-grams as string lists of size n: [51]:  # showing the first 10 bigrams from the first document: ngrams(docs, n=2, join=False)[0][:10]  [51]:  [['White', 'House'], ['House', 'aides'], ['aides', 'told'], ['told', 'to'], ['to', 'keep'], ['keep', 'Russia-related'], ['Russia-related', 'materials'], ['materials', 'Lawyers'], ['Lawyers', 'for'], ['for', 'the']]  #### Generating a sparse document-term matrix (DTM)¶ If you’re working with a bag-of-words representation of your data, you usually convert the preprocessed documents to a document-term matrix (DTM), which represents of the number of occurrences of each term (i.e. vocabulary token) in each document. This is a N rows by M columns matrix, where N is the number of documents and M is the vocabulary size (i.e. the number of unique tokens in the corpus). Not all tokens from the vocabulary occur in all documents. In fact, many tokens will occur only in a small subset of the documents if you’re dealing with a “real world” dataset. This means that most entries in such a DTM will be zero. Almost all functions in tmtoolkit therefore generate and work with sparse matrices, where only non-zero values are stored in computer memory. For this example, we’ll use the preprocessed documents docs_final from above. First, let’s check the vocabulary size: [52]:  len(vocabulary(docs_final))  [52]:  478  Now we can use sparse_dtm() to generate a sparse DTM. We can either pass an already computed sorted vocabulary or let the function itself generate a vocabulary which is necessary to construct the DTM. In the latter case, the generated vocabulary is also returned: [53]:  from tmtoolkit.preprocess import sparse_dtm dtm, vocab_final = sparse_dtm(docs_final) dtm  [53]:  <3x478 sparse matrix of type '<class 'numpy.int32'>' with 529 stored elements in COOrdinate format>  We can see that a sparse matrix with 3 rows (which corresponds with the number of documents) and 478 columns was generated (which corresponds with the vocabulary size). 529 elements in this matrix are non-zero. We can convert this matrix to a non-sparse, i.e. dense, representation and see parts of its elements: [54]:  dtm.todense()  [54]:  matrix([[2, 1, 1, ..., 0, 0, 0], [0, 0, 0, ..., 1, 0, 0], [0, 0, 0, ..., 0, 2, 1]], dtype=int32)  However, note that you should only convert a sparse matrix to a dense representation when you’re either dealing with a small amount of data (which is what we’re doing in this example), or use only a part of the full matrix. Converting a sparse matrix to a dense representation can otherwise easily exceed the available computer memory. There exist different “formats” for sparse matrices, which have different advantages and disadvantes (see for example the SciPy “sparse” module documentation. Not all formats support all operations that you can usually apply to an ordinary, dense matrix. By default, the generated DTM is in “coo” format, which is a good intermediate format that you can use to convert to a different sparse matrix format quickly, but that doesn’t offer many matrix operations. For example, the “coo” format doesn’t support indexing: [55]:  # not running the following here: # dtm[0, 0] # it creates the following exception: # TypeError: 'coo_matrix' object is not subscriptable  So you have to convert the sparse DTM to another format first. For example, the CSR format allows indexing and is especially optimized for fast row access: [56]:  dtm.tocsr()[0, 443]  [56]:  4  This gives us the number of times the token at vocabulary index 443 occurs in the first document. Which token and document does this exactly refer to? We can find out using doc_labels, which corresponds with the rows in dtm and vocab_final that was returned by sparse_dtm() and corresponds with the columns: [57]:  doc_labels[0], vocab_final[443]  [57]:  ('NewsArticles-1880', 'trump')  Where does the index 443 come from? It’s the position of the token “trump” in the vocab_final list. These indices are important when working with DTMs so you should know Python’s methods of the *list* data type: [58]:  vocab_final.index('trump')  [58]:  443  See also the following example of finding out the index for “administration” and then getting an array that represents the number of occurrences of this token across all three documents: [59]:  vocab_admin_ix = vocab_final.index('administration') dtm.tocsc()[:, vocab_admin_ix].toarray()  [59]:  array([[4], [1], [0]], dtype=int32)  ## Parallel processing with the TMPreproc class¶ As mentioned in the beginning of this chapter, the TMPreproc class employs parallel computation for text preprocessing. All functions that are available in the functional API are also available in the TMPreproc class as properties or methods. So you can do exactly the same things, only with a slightly different syntax and with the power of parallel processing in your back. ### Optional: enabling logging output¶ At first let’s have a look on how to display the logging output from tmtoolkit. By default, tmtoolkit does not expose any internal logging messages. Sometimes, for example for diagnostic output during debugging or in order to see progress for long running operations, it’s helpful to enable logging output display which can be done as follows: import logging logging.basicConfig(level=logging.INFO) tmtoolkit_log = logging.getLogger('tmtoolkit') # set the minimum log level to display, for instance also logging.DEBUG tmtoolkit_log.setLevel(logging.INFO) tmtoolkit_log.propagate = True  ### Creating a TMPreproc object¶ You can create a TMPreproc object (also known as “instance”) by passing a dict that maps document labels to (untokenized) documents. Since a tmtoolkit Corpus behaves like a dict, we can pass a Corpus object. This time we will not use a sample but the full English news articles corpus: [60]:  corpus = Corpus.from_builtin_corpus('english-NewsArticles') corpus  [60]:  <Corpus [3824 documents]>  We can now pass this directly to TMPreproc. Doing so will at first distribute all documents to several sub-processes which will later be used to run the computations in parallel. The number of sub-processes can be controlled via n_max_processes. It defaults to the number of CPU cores in your machine. The distribution of documents to the processes happens according to the document size. E.g. when you have two CPU cores, one very large document and three small documents, CPU 1 will take care about the large document alone and CPU 2 will take the other three small documents. After distribution of the documents, they will directly be tokenized (in parallel). Hence when you have a large corpus, the creation of a TMPreproc object may take some time because of the tokenization process. Let’s create a TMPreproc object from corpus: [61]:  from tmtoolkit.preprocess import TMPreproc preproc = TMPreproc(corpus) preproc  [61]:  <TMPreproc [3824 documents]>  Another important parameter is language, which defaults to 'english'. So when you’re working with a German corpus, you would create the object as: preproc = TMPreproc(corpus, language='german')  Our TMPreproc object preproc is now set up to work with the documents passed in corpus and the language 'english'. All further operations with this object will use the specified documents and language. ### Accessing tokens, vocabulary and other important properties¶ TMPreproc provides several properties to access its data and some summary statistics. See for example the number of documents and the sum of the number of tokens in all documents: [62]:  preproc.n_docs  [62]:  3824  [63]:  preproc.n_tokens  [63]:  2452726  We can also access the document labels and the number of tokens in each document: [64]:  preproc.doc_labels[:10] # displaying only the first 10 here  [64]:  ['NewsArticles-1', 'NewsArticles-10', 'NewsArticles-100', 'NewsArticles-1000', 'NewsArticles-1001', 'NewsArticles-1002', 'NewsArticles-1003', 'NewsArticles-1004', 'NewsArticles-1005', 'NewsArticles-1006']  [65]:  # displaying only a single document's length here preproc.doc_lengths['NewsArticles-1880']  [65]:  227  As expected, there are properties for vocabulary and vocabulary counts, too: [66]:  preproc.vocabulary[:10] # displaying only the first 10 here  [66]:  ['!', '#', '$', '%', '&', "'", "''", "''We", "'-", "'-and"]

[67]:

# how often the word "the" occurs in the whole corpus
preproc.vocabulary_counts['the']

[67]:

115385


We can also get the document frequency for each token in the vocabulary as absolute numbers (.vocabulary_abs_doc_frequency) or proportions (.vocabulary_rel_doc_frequency):

[68]:

(preproc.vocabulary_abs_doc_frequency['Trump'],
preproc.vocabulary_rel_doc_frequency['Trump'])

[68]:

(1096, 0.28661087866108786)

[69]:

(preproc.vocabulary_abs_doc_frequency['Putin'],
preproc.vocabulary_rel_doc_frequency['Putin'])

[69]:

(166, 0.043410041841004186)


#### Accessing document tokens¶

The most important properties are those that start with .tokens.... They give access to the tokenized documents in the TMPreproc object in different formats.

The .tokens property simply returns a dict mapping document labels to their tokens:

[70]:

# only showing the first ten tokens of a specific doc.
preproc.tokens['NewsArticles-1880'][:10]

[70]:

['White',
'House',
'aides',
'told',
'to',
'keep',
'Russia-related',
'materials',
'Lawyers',
'for']


The .tokens_datatable and .tokens_dataframe properties return a datatable Frame or pandas DataFrame, respectively. The datatable Frame consists of at least three columns: The document label, the position of the token in the document (zero-indexed) and the token itself. Please note that for large amounts of data, .tokens_datatable is usually quicker than using .tokens_dataframe.

[71]:

preproc.tokens_datatable

[71]:

 doc position token ▪▪▪▪ ▪▪▪▪▪▪▪▪ ▪▪▪▪ 0 NewsArticles-1 0 Betsy 1 NewsArticles-1 1 DeVos 2 NewsArticles-1 2 Confirmed 3 NewsArticles-1 3 as 4 NewsArticles-1 4 Education 5 NewsArticles-1 5 Secretary 6 NewsArticles-1 6 , 7 NewsArticles-1 7 With 8 NewsArticles-1 8 Pence 9 NewsArticles-1 9 Casting 10 NewsArticles-1 10 Historic 11 NewsArticles-1 11 Tie-Breaking 12 NewsArticles-1 12 Vote 13 NewsArticles-1 13 Michigan 14 NewsArticles-1 14 billionaire ⋮ ⋮ ⋮ ⋮ 2,452,721 NewsArticles-999 589 article 2,452,722 NewsArticles-999 590 was 2,452,723 NewsArticles-999 591 n't 2,452,724 NewsArticles-999 592 funny 2,452,725 NewsArticles-999 593 ?

The returned pandas DataFrame from .tokens_dataframe has as similar layout (not shown here).

More columns may be shown when you add token metadata (more on that later).

### Understanding TMPreproc as a state machine¶

Before we proceed with the methods that TMPreproc provides, we should understand how a TMPreproc object represents a state which can be changed by calling its methods. This state also determines the behavior of the object. For example, when you want to lemmatize your documents, you can call the TMPreproc.lemmatize() method (more on that later). However, you can only use this method if you performed POS tagging via TMPreproc.pos_tag() before, i.e. if your TMPreproc object’s state is “ready” for lemmatization.

A TMPreproc object is a complex data structure that encapsulates the data you work with (i.e. your corpus), several “state” variables (e.g. a variable that records whether the tokens have POS tag information), a bunch of methods that transform your data or compute something from it and, as already introduced, some properties that provide access to your data and some summary statistics.

We can see how calling methods may change the data and the state of the object. For example, we can see how transforming all tokens to lowercase changes also the vocabulary and hence the vocabulary size:

[72]:

# original vocabulary size
len(preproc.vocabulary)

[72]:

78290

[73]:

preproc.tokens_to_lowercase()
len(preproc.vocabulary)  # vocabulary size is now smaller

[73]:

69086


### Copying TMPreproc objects¶

It’s important to note that after calling the method tokens_to_lowercase(), the tokens in preproc were transformed and the original tokens from before calling this method are not available anymore. In Python, assigning a mutable object to a variable binds the same object only to a different name, it doesn’t copy it. Since a TMPreproc object is a mutable object (you can change its state by calling its methods), when we simply assign such an object to a different variable (say preproc_upper) we essentially only have two names for the same object and by calling a method on one of these variable names, the values will be changed for both names.

Let’s see this example:

[74]:

preproc_upper = preproc  # simply assignment, no copy!

# we didn't change anything, so this should be true:
preproc.vocabulary == preproc_upper.vocabulary

[74]:

True

[75]:

# let's transform the tokens to uppercase
# we might expect that this only applies to the tokens in "preproc_upper"
preproc_upper.transform_tokens(str.upper)

[75]:

<TMPreproc [3824 documents]>

[76]:

# but the vocabulary is the same for both!
preproc.vocabulary == preproc_upper.vocabulary

[76]:

True

[77]:

preproc.vocabulary[10000:10010]

[77]:

['ARTICHOKES',
'ARTICLE',
'ARTICLE-IN',
'ARTICLE50',
'ARTICLES',
'ARTICULATE',
'ARTICULATED',
'ARTIFACTS',
'ARTIFICIAL',
'ARTIFICIALLY']

[78]:

preproc_upper.vocabulary[10000:10010]

[78]:

['ARTICHOKES',
'ARTICLE',
'ARTICLE-IN',
'ARTICLE50',
'ARTICLES',
'ARTICULATE',
'ARTICULATED',
'ARTIFACTS',
'ARTIFICIAL',
'ARTIFICIALLY']


What happened? As explained, by the assignment preproc_upper = preproc we only assigned a new name to the object behind preproc. Calling methods on either preproc_upper or preproc will essentially modify the same object. We can confirm that both variables point to the same object, by comparing the Python object ID via id():

[79]:

id(preproc), id(preproc_upper)

[79]:

(139932303304072, 139932303304072)


The same is true when you assign the result of a method that returns the TMPreproc “self” object, so you have to watch out here, too:

[80]:

# again, we only create another name for the same object:
preproc_lower = preproc.tokens_to_lowercase()

[81]:

# *all* three names refer to the same object and hence to the same vocabulary
preproc_lower.vocabulary == preproc_upper.vocabulary == preproc.vocabulary

[81]:

True

[82]:

# it's all lowercase now
preproc.vocabulary[10000:10010]

[82]:

['arthanayake',
'arthaud',
'arthena',
'arthenia',
'arthritic',
'arthritis',
'arthur',
'artichokes',
'article',
'article-in']


What can we do about that? We need to copy the object which can be done with the TMPreproc.copy() method. By this, we create another variable that points to a separate TMPreproc object.

[83]:

preproc_upper = preproc.copy()

[84]:

# the IDs confirm that we have two different objects
id(preproc_upper), id(preproc)

[84]:

(139931125296264, 139932303304072)

[85]:

preproc_upper.transform_tokens(str.upper)

# the transformation now only applied to "preproc_upper"
preproc.vocabulary == preproc_upper.vocabulary

[85]:

False

[86]:

preproc_upper.vocabulary[10000:10010]

[86]:

['ARTICHOKES',
'ARTICLE',
'ARTICLE-IN',
'ARTICLE50',
'ARTICLES',
'ARTICULATE',
'ARTICULATED',
'ARTIFACTS',
'ARTIFICIAL',
'ARTIFICIALLY']


Note that this also uses up twice as much computer memory now. So you shouldn’t create copies that often and also release unused memory by using del:

[87]:

# removing the objects again
del preproc_upper, preproc_lower


### Serialization: Saving and loading TMPreproc objects¶

The current state of a TMPreproc object can also be stored to a file on disk so that you (or someone else who has tmtoolkit installed) can later restore it using that file. The methods for that are TMPreproc.save_state() and TMPreproc.load_state() / TMPreproc.from_state().

Let’s store the current state of the preproc, which has all tokens transformed to lowercase:

[88]:

preproc.save_state('data/preproc_lowercase.pickle')

[88]:

<TMPreproc [3824 documents]>


Let’s change the object by retaining only documents that contain the token “trump” (see the reduced number of documents):

[89]:

preproc.filter_documents('trump')

[89]:

<TMPreproc [1097 documents]>


We can restore the saved data using TMPreproc.from_state():

[90]:

preproc_full = TMPreproc.from_state('data/preproc_lowercase.pickle')
preproc_full

[90]:

<TMPreproc [3824 documents]>


This is very useful especially when you have a large amount of data and run time consuming operations, e.g. POS tagging. When you’re finished running these operations, you can easily store the current state to disk and later retrieve it without the need to re-run these operations.

### Methods¶

All functions from the functional API are also available as TMPreproc methods, most carrying the same name. Additional functionality comes in the form of token metadata handling, which will be the first topic in the next section.

Before starting to explore the TMPreproc methods, we’ll re-create a fresh TMPreproc object from the NewsArticles corpus and make a copy of it in order to be able to revert to that state later.

[91]:

preproc = TMPreproc(corpus)
preproc_orig = preproc.copy()
preproc

[91]:

<TMPreproc [3824 documents]>


#### Working with token metadata / POS tagging¶

TMPreproc allows to attach arbitrary metadata to each token in each document. This kind of “annotations” for tokens is very useful. For example, you may add metadata that records a token’s length or whether it is all uppercase letters and later use that for filtering or in further analyses. One function to add such metadata is add_metadata_per_doc(). This function requires to pass a dict that maps document labels to the respective token metadata list. The list’s length must match the number of tokens in the respective document. At first we need to create such a metadata dict. Let’s do that for the tokens’ length first:

[92]:

meta_tok_lengths = {doc_label: list(map(len, doc_tokens))
for doc_label, doc_tokens in preproc.tokens.items()}

# show first 5 tokens and their string length for a sample document
list(zip(preproc.tokens['NewsArticles-1880'][:10],
meta_tok_lengths['NewsArticles-1880'][:10]))

[92]:

[('White', 5),
('House', 5),
('aides', 5),
('told', 4),
('to', 2),
('keep', 4),
('Russia-related', 14),
('materials', 9),
('Lawyers', 7),
('for', 3)]


[93]:

preproc.add_metadata_per_doc('length', meta_tok_lengths)
del meta_tok_lengths  # we don't need that object anymore


The property .tokens_datatable now shows an additional column with meta_token (the metadata key in always prefixed with meta_):

[94]:

preproc.tokens_datatable

[94]:

 doc position token meta_length ▪▪▪▪ ▪▪▪▪▪▪▪▪ ▪▪▪▪ ▪▪▪▪ 0 NewsArticles-1 0 Betsy 5 1 NewsArticles-1 1 DeVos 5 2 NewsArticles-1 2 Confirmed 9 3 NewsArticles-1 3 as 2 4 NewsArticles-1 4 Education 9 5 NewsArticles-1 5 Secretary 9 6 NewsArticles-1 6 , 1 7 NewsArticles-1 7 With 4 8 NewsArticles-1 8 Pence 5 9 NewsArticles-1 9 Casting 7 10 NewsArticles-1 10 Historic 8 11 NewsArticles-1 11 Tie-Breaking 12 12 NewsArticles-1 12 Vote 4 13 NewsArticles-1 13 Michigan 8 14 NewsArticles-1 14 billionaire 11 ⋮ ⋮ ⋮ ⋮ ⋮ 2,452,721 NewsArticles-999 589 article 7 2,452,722 NewsArticles-999 590 was 3 2,452,723 NewsArticles-999 591 n't 3 2,452,724 NewsArticles-999 592 funny 5 2,452,725 NewsArticles-999 593 ? 1

Let’s add a boolean indicator for whether the given token is all uppercase:

[95]:

meta_tok_upper = {doc_label: list(map(str.isupper, doc_tokens))
for doc_label, doc_tokens in preproc.tokens.items()}

del meta_tok_upper

preproc.tokens_datatable

[95]:

 doc position token meta_upper meta_length ▪▪▪▪ ▪▪▪▪▪▪▪▪ ▪▪▪▪ ▪ ▪▪▪▪ 0 NewsArticles-1 0 Betsy 0 5 1 NewsArticles-1 1 DeVos 0 5 2 NewsArticles-1 2 Confirmed 0 9 3 NewsArticles-1 3 as 0 2 4 NewsArticles-1 4 Education 0 9 5 NewsArticles-1 5 Secretary 0 9 6 NewsArticles-1 6 , 0 1 7 NewsArticles-1 7 With 0 4 8 NewsArticles-1 8 Pence 0 5 9 NewsArticles-1 9 Casting 0 7 10 NewsArticles-1 10 Historic 0 8 11 NewsArticles-1 11 Tie-Breaking 0 12 12 NewsArticles-1 12 Vote 0 4 13 NewsArticles-1 13 Michigan 0 8 14 NewsArticles-1 14 billionaire 0 11 ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ 2,452,721 NewsArticles-999 589 article 0 7 2,452,722 NewsArticles-999 590 was 0 3 2,452,723 NewsArticles-999 591 n't 0 3 2,452,724 NewsArticles-999 592 funny 0 5 2,452,725 NewsArticles-999 593 ? 0 1

You may use these newly added columns now for example for filtering the datatable:

[96]:

import datatable as dt

preproc.tokens_datatable[dt.f.meta_upper == 1,:]

[96]:

 doc position token meta_upper meta_length ▪▪▪▪ ▪▪▪▪▪▪▪▪ ▪▪▪▪ ▪ ▪▪▪▪ 0 NewsArticles-1 466 ABC 1 3 1 NewsArticles-10 10 A 1 1 2 NewsArticles-10 109 U.S 1 3 3 NewsArticles-10 225 ABC 1 3 4 NewsArticles-10 227 WEAR 1 4 5 NewsArticles-10 290 AP 1 2 6 NewsArticles-10 373 9613BJ 1 6 7 NewsArticles-100 97 UK 1 2 8 NewsArticles-100 108 UK 1 2 9 NewsArticles-100 326 C 1 1 10 NewsArticles-100 559 A 1 1 11 NewsArticles-100 581 UK 1 2 12 NewsArticles-1000 11 A 1 1 13 NewsArticles-1000 26 A 1 1 14 NewsArticles-1000 123 A 1 1 ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ 36,844 NewsArticles-999 490 U.S 1 3 36,845 NewsArticles-999 495 LTE 1 3 36,846 NewsArticles-999 515 4G 1 2 36,847 NewsArticles-999 567 22-28GB 1 7 36,848 NewsArticles-999 575 FCC 1 3

POS tagging is also a way of annotating tokens in TMPreproc. When you run the method pos_tag(), a new metadata column meta_pos is added. We can try that out now:

[97]:

preproc.pos_tag()
preproc.tokens_datatable

[97]:

 doc position token meta_upper meta_pos meta_length ▪▪▪▪ ▪▪▪▪▪▪▪▪ ▪▪▪▪ ▪ ▪▪▪▪ ▪▪▪▪ 0 NewsArticles-1 0 Betsy 0 NNP 5 1 NewsArticles-1 1 DeVos 0 NNP 5 2 NewsArticles-1 2 Confirmed 0 NNP 9 3 NewsArticles-1 3 as 0 IN 2 4 NewsArticles-1 4 Education 0 NNP 9 5 NewsArticles-1 5 Secretary 0 NNP 9 6 NewsArticles-1 6 , 0 , 1 7 NewsArticles-1 7 With 0 IN 4 8 NewsArticles-1 8 Pence 0 NNP 5 9 NewsArticles-1 9 Casting 0 NNP 7 10 NewsArticles-1 10 Historic 0 NNP 8 11 NewsArticles-1 11 Tie-Breaking 0 NNP 12 12 NewsArticles-1 12 Vote 0 NNP 4 13 NewsArticles-1 13 Michigan 0 NNP 8 14 NewsArticles-1 14 billionaire 0 POS 11 ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ 2,452,721 NewsArticles-999 589 article 0 NN 7 2,452,722 NewsArticles-999 590 was 0 VBD 3 2,452,723 NewsArticles-999 591 n't 0 RB 3 2,452,724 NewsArticles-999 592 funny 0 JJ 5 2,452,725 NewsArticles-999 593 ? 0 . 1

We can see that a new column meta_pos with the POS tags for each token was introduced.

[98]:

preproc.get_available_metadata_keys()

[98]:

{'length', 'pos', 'upper'}


[99]:

preproc.remove_metadata('upper')

[99]:

{'length', 'pos'}


The section on filtering will later show how to use metadata to filter tokens and documents.

#### Token transformations¶

As already said, TMPreproc provides the same functionality as the functional API. Token transformations like stemming, lemmatization, lowercase transformation, etc. can be applied step-by-step. We will show a typical pre-processing pipeline consisting of:

1. lemmatization (which we can apply because we already POS-tagged our tokens)

2. lowercase transformation

3. token cleaning

4. removal of very common and very uncommon tokens

[100]:

preproc.lemmatize()
preproc.tokens_datatable

[100]:

 doc position token meta_pos meta_length ▪▪▪▪ ▪▪▪▪▪▪▪▪ ▪▪▪▪ ▪▪▪▪ ▪▪▪▪ 0 NewsArticles-1 0 Betsy NNP 5 1 NewsArticles-1 1 DeVos NNP 5 2 NewsArticles-1 2 Confirmed NNP 9 3 NewsArticles-1 3 as IN 2 4 NewsArticles-1 4 Education NNP 9 5 NewsArticles-1 5 Secretary NNP 9 6 NewsArticles-1 6 , , 1 7 NewsArticles-1 7 With IN 4 8 NewsArticles-1 8 Pence NNP 5 9 NewsArticles-1 9 Casting NNP 7 10 NewsArticles-1 10 Historic NNP 8 11 NewsArticles-1 11 Tie-Breaking NNP 12 12 NewsArticles-1 12 Vote NNP 4 13 NewsArticles-1 13 Michigan NNP 8 14 NewsArticles-1 14 billionaire POS 11 ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ 2,452,721 NewsArticles-999 589 article NN 7 2,452,722 NewsArticles-999 590 be VBD 3 2,452,723 NewsArticles-999 591 n't RB 3 2,452,724 NewsArticles-999 592 funny JJ 5 2,452,725 NewsArticles-999 593 ? . 1

We proceed with the pipeline and employ “method chaining”: You can apply several methods one after another by chaining them with a . as long as this method returns a TMPreproc object:

[101]:

preproc.tokens_to_lowercase().clean_tokens(remove_numbers=True)
preproc.tokens_datatable

[101]:

 doc position token meta_pos meta_length ▪▪▪▪ ▪▪▪▪▪▪▪▪ ▪▪▪▪ ▪▪▪▪ ▪▪▪▪ 0 NewsArticles-1 0 betsy NNP 5 1 NewsArticles-1 1 devos NNP 5 2 NewsArticles-1 2 confirmed NNP 9 3 NewsArticles-1 3 education NNP 9 4 NewsArticles-1 4 secretary NNP 9 5 NewsArticles-1 5 pence NNP 5 6 NewsArticles-1 6 casting NNP 7 7 NewsArticles-1 7 historic NNP 8 8 NewsArticles-1 8 tie-breaking NNP 12 9 NewsArticles-1 9 vote NNP 4 10 NewsArticles-1 10 michigan NNP 8 11 NewsArticles-1 11 billionaire POS 11 12 NewsArticles-1 12 education NN 9 13 NewsArticles-1 13 activist NN 8 14 NewsArticles-1 14 betsy NNP 5 ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ 1,313,679 NewsArticles-999 275 away RB 4 1,313,680 NewsArticles-999 276 think VBD 7 1,313,681 NewsArticles-999 277 article NN 7 1,313,682 NewsArticles-999 278 n't RB 3 1,313,683 NewsArticles-999 279 funny JJ 5
[102]:

preproc.remove_common_tokens(0.9).remove_uncommon_tokens(5, absolute=True)
preproc.tokens_datatable

[102]:

 doc position token meta_pos meta_length ▪▪▪▪ ▪▪▪▪▪▪▪▪ ▪▪▪▪ ▪▪▪▪ ▪▪▪▪ 0 NewsArticles-1 0 betsy NNP 5 1 NewsArticles-1 1 devos NNP 5 2 NewsArticles-1 2 education NNP 9 3 NewsArticles-1 3 secretary NNP 9 4 NewsArticles-1 4 pence NNP 5 5 NewsArticles-1 5 historic NNP 8 6 NewsArticles-1 6 vote NNP 4 7 NewsArticles-1 7 michigan NNP 8 8 NewsArticles-1 8 billionaire POS 11 9 NewsArticles-1 9 education NN 9 10 NewsArticles-1 10 activist NN 8 11 NewsArticles-1 11 betsy NNP 5 12 NewsArticles-1 12 devos NNP 5 13 NewsArticles-1 13 confirm VBN 9 14 NewsArticles-1 14 today NN 5 ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ 1,183,399 NewsArticles-999 219 away RB 4 1,183,400 NewsArticles-999 220 think VBD 7 1,183,401 NewsArticles-999 221 article NN 7 1,183,402 NewsArticles-999 222 n't RB 3 1,183,403 NewsArticles-999 223 funny JJ 5

When we have a look at the vocabulary size and compare it with the unprocessed data, we can see that we greatly reduced the amount of unique tokens:

[103]:

len(preproc.vocabulary), len(preproc_orig.vocabulary)

[103]:

(11250, 78290)


#### Filtering¶

Filtering also works the same as with the functional API, i.e. methods like filter_tokens() or filter_documents() are available. We will now focus on filtering with metadata.

We can tell filter_tokens() and similar methods to use metadata instead of the tokens for matching. For example, we can use the metadata meta_length, which we created in the metadata section to filter for tokens of a certain length:

[104]:

preproc.filter_tokens(3, by_meta='length')
preproc.tokens_datatable

[104]:

 doc position token meta_pos meta_length ▪▪▪▪ ▪▪▪▪▪▪▪▪ ▪▪▪▪ ▪▪▪▪ ▪▪▪▪ 0 NewsArticles-1 0 use VB 3 1 NewsArticles-1 1 tie NN 3 2 NewsArticles-1 2 day NN 3 3 NewsArticles-1 3 one CD 3 4 NewsArticles-1 4 sen NNP 3 5 NewsArticles-1 5 law NN 3 6 NewsArticles-1 6 van NNP 3 7 NewsArticles-1 7 two CD 3 8 NewsArticles-1 8 abc NNP 3 9 NewsArticles-10 0 run NN 3 10 NewsArticles-10 1 may MD 3 11 NewsArticles-10 2 u.s NNP 3 12 NewsArticles-10 3 abc NNP 3 13 NewsArticles-10 4 say VBP 3 14 NewsArticles-10 5 duo NN 3 ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ 70,182 NewsArticles-999 19 n't RB 3 70,183 NewsArticles-999 20 new JJ 3 70,184 NewsArticles-999 21 n't RB 3 70,185 NewsArticles-999 22 new JJ 3 70,186 NewsArticles-999 23 n't RB 3

Note that all matching options then apply to the metadata column, in this case to the meta_length column which contains integers. Since filter_tokens() by default employs exact matching, we get all tokens where meta_length equals the first argument, 3. If we used regular expression or glob matching instead, this method would fail because you can only use that for string data.

If you want to use more complex filter queries, you should create a “filter mask” and pass it to filter_tokens_by_mask(). A filter mask is a dictionary that maps a document label to a sequence of booleans. For all occurrences of True, the respective token in the document will be retained, all others will be removed. Let’s try that out with a small sample:

[105]:

preproc_small = TMPreproc(corpus.sample(5))
meta_tok_lengths = {doc_label: list(map(len, doc_tokens))
for doc_label, doc_tokens in preproc_small.tokens.items()}
preproc_small.tokens_datatable

[105]:

 doc position token meta_pos meta_length ▪▪▪▪ ▪▪▪▪▪▪▪▪ ▪▪▪▪ ▪▪▪▪ ▪▪▪▪ 0 NewsArticles-1728 0 Trump NN 5 1 NewsArticles-1728 1 : : 1 2 NewsArticles-1728 2 Agency NN 6 3 NewsArticles-1728 3 to TO 2 4 NewsArticles-1728 4 support VB 7 5 NewsArticles-1728 5 'victims NNS 8 6 NewsArticles-1728 6 of IN 2 7 NewsArticles-1728 7 immigrant JJ 9 8 NewsArticles-1728 8 crimes' NN 7 9 NewsArticles-1728 9 In IN 2 10 NewsArticles-1728 10 first JJ 5 11 NewsArticles-1728 11 speech NN 6 12 NewsArticles-1728 12 to TO 2 13 NewsArticles-1728 13 Congress NNP 8 14 NewsArticles-1728 14 , , 1 ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ 2570 NewsArticles-948 332 . . 1 2571 NewsArticles-948 333 Source NN 6 2572 NewsArticles-948 334 : : 1 2573 NewsArticles-948 335 -News NN 5 2574 NewsArticles-948 336 agencies NNS 8

We now generate the filter mask, which means for each document we create a boolean list or array that for each token in that document indicates whether that token should be kept or removed.

We will iterate through the .tokens_with_metadata property which is a dict that for each document contains a datatable with its tokens and metadata. Let’s have a look at the first document’s datatable:

[106]:

next(iter(preproc_small.tokens_with_metadata.values()))

[106]:

 token meta_pos meta_length ▪▪▪▪ ▪▪▪▪ ▪▪▪▪ 0 Ex-footballer NNP 13 1 Adam NNP 4 2 Johnson NNP 7 3 loses VBZ 5 4 appeal JJ 6 5 Ex-England NNP 10 6 footballer NN 10 7 Adam NNP 4 8 Johnson NNP 7 9 has VBZ 3 10 lost VBN 4 11 a DT 1 12 Court NNP 5 13 of IN 2 14 Appeal NNP 6 ⋮ ⋮ ⋮ ⋮ 134 to TO 2 135 another DT 7 136 sexual JJ 6 137 act NN 3 138 . . 1

Now we can create the filter mask:

[107]:

import numpy as np

# extract the columns "meta_length" and "meta_pos"
# and convert them to NumPy arrays
doc_data_subset = doc_data[:, [dt.f.meta_length, dt.f.meta_pos]]
tok_lengths, tok_pos = map(np.array, doc_data_subset.to_list())

# create a boolean array for nouns with token length less or equal 5
filter_mask[doc_label] = (tok_lengths <= 5) & np.char.startswith(tok_pos, 'N')

# but it's a good way to check the mask
preproc_small.tokens_datatable

[107]:

 doc position token meta_small_nouns meta_pos meta_length ▪▪▪▪ ▪▪▪▪▪▪▪▪ ▪▪▪▪ ▪ ▪▪▪▪ ▪▪▪▪ 0 NewsArticles-1728 0 Trump 1 NN 5 1 NewsArticles-1728 1 : 0 : 1 2 NewsArticles-1728 2 Agency 0 NN 6 3 NewsArticles-1728 3 to 0 TO 2 4 NewsArticles-1728 4 support 0 VB 7 5 NewsArticles-1728 5 'victims 0 NNS 8 6 NewsArticles-1728 6 of 0 IN 2 7 NewsArticles-1728 7 immigrant 0 JJ 9 8 NewsArticles-1728 8 crimes' 0 NN 7 9 NewsArticles-1728 9 In 0 IN 2 10 NewsArticles-1728 10 first 0 JJ 5 11 NewsArticles-1728 11 speech 0 NN 6 12 NewsArticles-1728 12 to 0 TO 2 13 NewsArticles-1728 13 Congress 0 NNP 8 14 NewsArticles-1728 14 , 0 , 1 ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ 2570 NewsArticles-948 332 . 0 . 1 2571 NewsArticles-948 333 Source 0 NN 6 2572 NewsArticles-948 334 : 0 : 1 2573 NewsArticles-948 335 -News 1 NN 5 2574 NewsArticles-948 336 agencies 0 NNS 8

[108]:

preproc_small.filter_tokens_by_mask(filter_mask)
preproc_small.tokens_datatable

[108]:

 doc position token meta_small_nouns meta_pos meta_length ▪▪▪▪ ▪▪▪▪▪▪▪▪ ▪▪▪▪ ▪ ▪▪▪▪ ▪▪▪▪ 0 NewsArticles-1728 0 Trump 1 NN 5 1 NewsArticles-1728 1 Path 1 NN 4 2 NewsArticles-1728 2 wall' 1 NN 5 3 NewsArticles-1728 3 Trump 1 NNP 5 4 NewsArticles-1728 4 crime 1 NN 5 5 NewsArticles-1728 5 VOICE 1 NN 5 6 NewsArticles-1728 6 Crime 1 NNP 5 7 NewsArticles-1728 7 VOICE 1 NNP 5 8 NewsArticles-1728 8 list 1 NN 4 9 NewsArticles-1728 9 US 1 NNP 2 10 NewsArticles-1728 10 name 1 NN 4 11 NewsArticles-1728 11 Trump 1 NNP 5 12 NewsArticles-1728 12 name 1 NN 4 13 NewsArticles-1728 13 READ 1 NNP 4 14 NewsArticles-1728 14 Trump 1 NNP 5 ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ 254 NewsArticles-948 40 flow 1 NN 4 255 NewsArticles-948 41 goods 1 NNS 5 256 NewsArticles-948 42 Egypt 1 NNP 5 257 NewsArticles-948 43 Gaza 1 NNP 4 258 NewsArticles-948 44 -News 1 NN 5

#### Other methods¶

Again, all the functions that you know from the functional API are also available for TMPreproc and they work exactly the same, so we won’t replicate that here. Make sure to have a look at the API to get an overview about TMPreproc’s methods and properties. For the final section, we only want to focus on generating a sparse document-term matrix (DTM). There is a property .dtm that generates and returns a sparse DTM from the tokens of a TMPreproc object. First, let’s check the number of documents and vocabulary size which should determine the shape of the DTM that we will create afterwards. We will continue working with preproc_small:

[109]:

(preproc_small.n_docs, len(preproc_small.vocabulary))

[109]:

(5, 137)

[110]:

dtm_small = preproc_small.dtm
dtm_small

[110]:

<5x137 sparse matrix of type '<class 'numpy.int32'>'
with 158 stored elements in Compressed Sparse Row format>


We can see that the DTM has the correct shape. The method get_dtm() also allows to return the result as datatable or pandas DataFrame:

[111]:

preproc_small.get_dtm(as_datatable=True)

[111]:

 _doc 'We -Al -News A A-321 Adam Air Al Bays … wages wall wall' years zone ▪▪▪▪ ▪▪▪▪ ▪▪▪▪ ▪▪▪▪ ▪▪▪▪ ▪▪▪▪ ▪▪▪▪ ▪▪▪▪ ▪▪▪▪ ▪▪▪▪ ▪▪▪▪ ▪▪▪▪ ▪▪▪▪ ▪▪▪▪ ▪▪▪▪ 0 NewsArticles-1728 1 1 0 0 0 0 0 1 2 … 3 1 1 0 0 1 NewsArticles-2162 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 2 NewsArticles-2616 0 0 0 0 0 3 0 0 0 … 0 0 0 1 0 3 NewsArticles-2902 0 0 0 0 1 0 0 0 0 … 0 0 0 0 1 4 NewsArticles-948 0 0 1 1 0 0 1 0 0 … 0 0 0 0 0

The bow module contains several functions to work with DTMs, e.g. apply transformations such as tf-idf or compute some important summary statistics. The next chapter will introduce some of these functions.