# Text preprocessing¶

During text preprocessing, a corpus of documents is tokenized (i.e. the document strings are split into individual words, punctuation, numbers, etc.) and then these tokens can be transformed, filtered or annotated. The goal is to prepare the raw texts in a way that makes it easier to perform eventual analysis methods in a later stage, e.g. by reducing noise in the dataset. tmtoolkit provides a rich set of tools for this purpose in the tmtoolkit.preprocess module.

## Parallel processing with the TMPreproc class¶

You can pass a dict-like dataset (i.e. anything that maps document labels to their plain text contents, e.g. a tmtoolkit Corpus object) to the TMPreproc class and can then then apply several text processing methods to it. You can chain these processing steps by applying one method after another and examine the results.

Under the hood, the spaCy package is used to perform most NLP methods. However, TMPreproc offers much more functionality than spaCy, including flexible token and document filtering. The most important advantage of using TMPreproc is that it employs parallel processing, i.e. it uses all available processors on your machine to do the computations necessary during preprocessing. For large text corpora, this can lead to a strong speed up.

Using the functional API

Apart from the TMPreproc class, tmtoolkit also provides several functions in the tmtoolkit.preprocess module. Most of these functions accept a list of spaCy documents along with additional parameters. You may use these functions for quick prototyping, but it is generally much more convenient to use TMPreproc. Note that only the latter provides parallel processing.

Let’s load a sample of three documents from the built-in NewsArticles dataset. We’ll use only a small number of documents here to have a better overview at the beginning. We can later use a larger sample.

[1]:

import random
random.seed(20191018)   # to make the sampling reproducible

from tmtoolkit.corpus import Corpus
from tmtoolkit.preprocess import tokenize

corpus_small = Corpus.from_builtin_corpus('en-NewsArticles').sample(3)


### Optional: enabling logging output¶

By default, tmtoolkit does not expose any internal logging messages. Sometimes, for example for diagnostic output during debugging or in order to see progress for long running operations, it’s helpful to enable logging output display, which can be done as follows:

import logging

logging.basicConfig(level=logging.INFO)
tmtoolkit_log = logging.getLogger('tmtoolkit')
# set the minimum log level to display, for instance also logging.DEBUG
tmtoolkit_log.setLevel(logging.INFO)
tmtoolkit_log.propagate = True


### Creating a TMPreproc object¶

You can create a TMPreproc object (also known as “instance”) by passing a dict that maps document labels to (untokenized) documents. Since a tmtoolkit Corpus behaves like a dict, we can pass our corpus_small object. We also need to specify the corpus language as two-letter ISO 639-1 language code (here "en" for English).

[2]:

from tmtoolkit.preprocess import TMPreproc

preproc = TMPreproc(corpus_small, language='en')
preproc

[2]:

<TMPreproc [3 documents / en]>


The above will at first distribute all documents to several sub-processes which will later be used to run the computations in parallel. The number of sub-processes can be controlled via n_max_processes. It defaults to the number of CPU cores in your machine. The distribution of documents to the processes happens according to the document size. E.g. when you have two CPU cores, one very large document and three small documents, CPU 1 will take care about the large document alone and CPU 2 will take the other three small documents. After distribution of the documents, they will directly be tokenized (in parallel). Hence when you have a large corpus, the creation of a TMPreproc object may take some time because of the tokenization process.

Our TMPreproc object preproc is now set up to work with the documents passed in corpus_small and the language 'en' for English. All further operations with this object will use the specified documents and language. All documents are directly tokenized.

The method print_summary() is very handy and we will use it quite often. It displays a small summary of the documents in the TMPreproc object. N=... denotes the number of tokens in the respective document.

[3]:

preproc.print_summary()

3 documents in language English:
> NewsArticles-1880 (N=230): White House aides told to keep Russia - related ma...
> NewsArticles-3350 (N=657): Frustration as cabin electronics ban comes into fo...
> NewsArticles-99 (N=1060): Should you have two bins in your bathroom ? Our ba...
total number of tokens: 1947 / vocabulary size: 683

[3]:

<TMPreproc [3 documents / en]>


### Accessing tokens, vocabulary and other important properties¶

TMPreproc provides several properties to access its data and some summary statistics. These properties are read-only, i.e. you can only retrieve the results but not assign new values to them.

First, let’s have a look at the labels (names) of the documents:

[4]:

preproc.doc_labels

[4]:

['NewsArticles-1880', 'NewsArticles-3350', 'NewsArticles-99']


We can access the tokens of each document by using the tokens property:

[5]:

# use [:10] slice to show only the first 10 tokens
preproc.tokens['NewsArticles-1880'][:10]

[5]:

['White',
'House',
'aides',
'told',
'to',
'keep',
'Russia',
'-',
'related',
'materials']


If you prefer a tabular output, you can also access the tokens and their metadata as pandas DataFrames or datatable Frames.

A note on the use of datatable Frames

If you have installed the datatable package, many functions and methods in tmtoolkit return or accept datatable Frames instead of (the more commonly known) pandas DataFrames. This is because the former is much faster and memory efficient in most cases. You can always convert between the both like this:

import datatable as dt
import pandas as pd

# a pandas DataFrame:
df = pd.DataFrame({'a': [1, 2, 3], 'b': list('xyz')})

# DataFrame to datatable:
dtable = dt.Frame(df)

# and vice versa datatable to DataFrame:
df == dtable.to_pandas()

# Out:
#       a     b
# 0  True  True
# 1  True  True
# 2  True  True


Even first creating a datatable and then converting to a DataFrame is often faster than directly creating a DataFrame.

You can use the tokens_dataframe or tokens_datatable properties for tabular output. The datatable Frame consists of at least five columns: The document label, the position of the token in the document (zero-indexed) and token itself, lemma and whitespace. The lemma column contains the token’s lemma and whitespace indicates whether there is a whitespace after the token in the text. Please note that for large amounts of data, tokens_datatable is usually quicker than using tokens_dataframe.

[6]:

preproc.tokens_datatable

[6]:

 doc position token lemma whitespace ▪▪▪▪ ▪▪▪▪▪▪▪▪ ▪▪▪▪ ▪▪▪▪ ▪ 0 NewsArticles-1880 0 White White 1 1 NewsArticles-1880 1 House House 1 2 NewsArticles-1880 2 aides aide 1 3 NewsArticles-1880 3 told tell 1 4 NewsArticles-1880 4 to to 1 5 NewsArticles-1880 5 keep keep 1 6 NewsArticles-1880 6 Russia Russia 0 7 NewsArticles-1880 7 - - 0 8 NewsArticles-1880 8 related relate 1 9 NewsArticles-1880 9 materials material 0 10 NewsArticles-1880 10 0 11 NewsArticles-1880 11 Lawyers Lawyers 1 12 NewsArticles-1880 12 for for 1 13 NewsArticles-1880 13 the the 1 14 NewsArticles-1880 14 Trump Trump 1 ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ 1942 NewsArticles-99 1055 non non 0 1943 NewsArticles-99 1056 - - 0 1944 NewsArticles-99 1057 recyclable recyclable 1 1945 NewsArticles-99 1058 items item 0 1946 NewsArticles-99 1059 . . 0

More columns may be shown when you add token metadata (more on that later).

The method get_tokens() gives you more options for accessing the tokens. For example, you can get all tokens with their metadata as nested dictionary in the form document label -> metadata key (e.g. “lemma”) -> metadata.

[7]:

doctokens = preproc.get_tokens(with_metadata=True, as_datatables=False)
doctokens['NewsArticles-1880'].keys()

[7]:

dict_keys(['token', 'lemma', 'whitespace'])

[8]:

# lemmata for the first 10 tokens in this document
doctokens['NewsArticles-1880']['lemma'][:10]

[8]:

['White',
'House',
'aide',
'tell',
'to',
'keep',
'Russia',
'-',
'relate',
'material']


You may also want to access the re-constructed full text of each document via texts property. This returns a dict that maps document labels to their text. Here we only display the first 100 characters from a single document:

[9]:

preproc.texts['NewsArticles-1880'][:100]

[9]:

'White House aides told to keep Russia-related materials\n\nLawyers for the Trump administration have i'


As mentioned in the beginning, tmtoolkit’s preprocessing module uses spaCy internally for most NLP tasks. If you want direct access to the spaCy documents, you can use the spacy_docs property. Here, we access a single spaCy document and check its is_tagged attribute:

[10]:

preproc.spacy_docs['NewsArticles-1880'].is_tagged

[10]:

False


You can also retrieve the document and token vectors from the word embeddings representation of the documents. For this, however, you need to create a TMPreproc instance with the argument enable_vectors=True:

[11]:

preproc_vec = TMPreproc(corpus_small, language='en', enable_vectors=True)
preproc_vec.vectors_enabled

[11]:

True


Now you may access the document vectors via doc_vectors property:

[12]:

# displaying only the first 10 values of a single
# document's document vector
preproc_vec.doc_vectors['NewsArticles-1880'][:10]

[12]:

array([-7.0222005e-02,  8.1240870e-02, -3.9869484e-02,  1.8360456e-02,
1.9232498e-02, -2.5533361e-02, -2.9136341e-02, -1.0187237e-01,
1.6649088e-03,  2.4026785e+00], dtype=float32)


Token vectors are also available via token_vectors property:

[13]:

# displaying only a single document's token matrix
preproc_vec.token_vectors['NewsArticles-1880']

[13]:

array([[-0.39347 , -0.061407,  0.015231, ...,  0.046462,  0.058398,
0.46169 ],
[ 0.19847 ,  0.18087 , -0.089119, ..., -0.24263 , -0.035183,
-0.29661 ],
[ 0.28059 , -0.45684 ,  0.414   , ..., -0.31501 , -0.31649 ,
-0.026392],
...,
[-0.08267 ,  0.092944,  0.028411, ...,  0.49965 , -0.17115 ,
0.27578 ],
[ 0.01327 ,  0.51269 , -0.35735 , ...,  0.19492 ,  0.058496,
0.26636 ],
[ 0.012001,  0.20751 , -0.12578 , ...,  0.13871 , -0.36049 ,
-0.035   ]], dtype=float32)

[14]:

del preproc_vec


The following gives you the number of documents and number of unique tokens respectively:

[15]:

preproc.n_docs

[15]:

3

[16]:

preproc.n_tokens

[16]:

1947


We can also access the number of tokens in each document via doc_lengths property:

[17]:

# displaying only a single document's length here
preproc.doc_lengths['NewsArticles-1880']

[17]:

230


The vocabulary is the set of unique tokens in the corpus, i.e. all tokens that occur at least once in at least one of the documents. You can use the property vocabulary for that and the property vocabulary_counts to additionally get the number of times each token appears in the corpus.

[18]:

preproc.vocabulary[:10]  # displaying only the first 10 here

[18]:

['\n\n', ' ', '"', '%', "'", "'s", '(', ')', ',', '-']

[19]:

# number of unique tokens in all documents
preproc.vocabulary_size

[19]:

683

[20]:

# how often the word "the" occurs in the whole corpus
preproc.vocabulary_counts['the']

[20]:

82


The latter returns a Python Counter object so we can apply its useful functions, e.g. in order to get the most often used tokens:

[21]:

preproc.vocabulary_counts.most_common()[:10]

[21]:

[('the', 82),
(',', 70),
('.', 60),
('to', 53),
('"', 50),
('and', 46),
('in', 39),
('a', 31),
('of', 25),
('that', 22)]


The document frequency of a token is the number of documents in which this token occurs at least once. The properties vocabulary_abs_doc_frequency and vocabulary_rel_doc_frequency return this measure as absolute frequency or proportion respectively:

[22]:

(preproc.vocabulary_abs_doc_frequency['Trump'],
preproc.vocabulary_rel_doc_frequency['Trump'])

[22]:

(2, 0.6666666666666666)

[23]:

(preproc.vocabulary_abs_doc_frequency['Russia'],
preproc.vocabulary_rel_doc_frequency['Russia'])

[23]:

(1, 0.3333333333333333)


### Part-of-Speech (POS) tagging¶

Part-of-speech (POS) tagging finds the grammatical word-category for each token in a document. The method pos_tag() employs this for the whole corpus. The found POS tags are added as metadata to each token. These tags conform to a specific tagset which is explained in the spaCy documentation. The POS tags can be used to annotate and filter the documents. Let’s apply POS tagging:

[24]:

preproc.pos_tag()

[24]:

<TMPreproc [3 documents / en]>


We can now see a new column pos with the found POS tag for each token:

[25]:

preproc.tokens_datatable

[25]:

 doc position token lemma pos whitespace ▪▪▪▪ ▪▪▪▪▪▪▪▪ ▪▪▪▪ ▪▪▪▪ ▪▪▪▪ ▪ 0 NewsArticles-1880 0 White White PROPN 1 1 NewsArticles-1880 1 House House PROPN 1 2 NewsArticles-1880 2 aides aide NOUN 1 3 NewsArticles-1880 3 told tell VERB 1 4 NewsArticles-1880 4 to to PART 1 5 NewsArticles-1880 5 keep keep VERB 1 6 NewsArticles-1880 6 Russia Russia PROPN 0 7 NewsArticles-1880 7 - - PUNCT 0 8 NewsArticles-1880 8 related relate VERB 1 9 NewsArticles-1880 9 materials material NOUN 0 10 NewsArticles-1880 10 SPACE 0 11 NewsArticles-1880 11 Lawyers lawyer NOUN 1 12 NewsArticles-1880 12 for for ADP 1 13 NewsArticles-1880 13 the the DET 1 14 NewsArticles-1880 14 Trump trump ADJ 1 ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ 1942 NewsArticles-99 1055 non non ADJ 0 1943 NewsArticles-99 1056 - - ADJ 0 1944 NewsArticles-99 1057 recyclable recyclable ADJ 1 1945 NewsArticles-99 1058 items item NOUN 0 1946 NewsArticles-99 1059 . . PUNCT 0

### Aside: TMPreproc as “state machine”¶

Before continuing, we should clarify that a TMPreproc instance is a “state machine”, i.e. its contents (the documents) and behavior can change when you call a method. An example:

corpus = {
"doc1": "Hello world!",
"doc2": "Another example"
}

preproc = TMPreproc(corpus)     # documents are directly tokenized
preproc.tokens

# Out:
# {
#   'doc1': ['Hello', 'world', '!'],
#   'doc2': ['Another', 'example']
# }

preproc.tokens_to_lowercase()   # this changes the documents
preproc.tokens

# Out:
# {
#   'doc1': ['hello', 'world', '!'],
#   'doc2': ['another', 'example']
# }


As you can see, the tokens “inside” preproc are changed in place. It’s important to see that after calling the method tokens_to_lowercase(), the tokens in preproc were transformed and the original tokens from before calling this method are not available anymore. In Python, assigning a mutable object to a variable binds the same object only to a different name, it doesn’t copy it. Since a TMPreproc object is a mutable object (you can change its state by calling its methods), when we simply assign such an object to a different variable (say preproc_upper) we essentially only have two names for the same object and by calling a method on one of these variable names, the values will be changed for both names.

#### Copying TMPreproc objects¶

What can we do about that? We need to copy the object which can be done with the TMPreproc.copy() method. By this, we create another variable preproc_upper that points to a separate TMPreproc object.

[26]:

preproc_upper = preproc.copy()

[27]:

# the IDs confirm that we have two different objects
id(preproc_upper), id(preproc)

[27]:

(140426331677504, 140426727032000)

[28]:

preproc_upper.transform_tokens(str.upper)

# the transformation now only applied to "preproc_upper"
preproc.vocabulary == preproc_upper.vocabulary

[28]:

False

[29]:

# show a sample
preproc_upper.tokens['NewsArticles-1880'][:10]

[29]:

['WHITE',
'HOUSE',
'AIDES',
'TOLD',
'TO',
'KEEP',
'RUSSIA',
'-',
'RELATED',
'MATERIALS']

[30]:

# the original "preproc" still holds the same data
preproc.tokens['NewsArticles-1880'][:10]

[30]:

['White',
'House',
'aides',
'told',
'to',
'keep',
'Russia',
'-',
'related',
'materials']


Note that this also uses up twice as much computer memory now. So you shouldn’t create copies that often and also release unused memory by using del:

[31]:

# removing the objects again
del preproc_upper


### Lemmatization and term normalization¶

Before we start with token normalization, we will create a copy of the original TMPreproc object and its data, so that we can later use it for comparison:

[32]:

preproc_orig = preproc.copy()


Lemmatization brings a token, if it is a word, to its base form. The lemma is already found out during the tokenization process and is available in the lemma metadata column. However, when you want to further process the tokens on the base of the lemmata, you should use the lemmatize() method. This method sets the lemmata as tokens and all further processing will happen using the lemmatized tokens:

[33]:

preproc.lemmatize()
preproc.tokens_datatable

[33]:

 doc position token lemma pos whitespace ▪▪▪▪ ▪▪▪▪▪▪▪▪ ▪▪▪▪ ▪▪▪▪ ▪▪▪▪ ▪ 0 NewsArticles-1880 0 White White PROPN 1 1 NewsArticles-1880 1 House House PROPN 1 2 NewsArticles-1880 2 aide aide NOUN 1 3 NewsArticles-1880 3 tell tell VERB 1 4 NewsArticles-1880 4 to to PART 1 5 NewsArticles-1880 5 keep keep VERB 1 6 NewsArticles-1880 6 Russia Russia PROPN 0 7 NewsArticles-1880 7 - - PUNCT 0 8 NewsArticles-1880 8 relate relate VERB 1 9 NewsArticles-1880 9 material material NOUN 0 10 NewsArticles-1880 10 SPACE 0 11 NewsArticles-1880 11 lawyer lawyer NOUN 1 12 NewsArticles-1880 12 for for ADP 1 13 NewsArticles-1880 13 the the DET 1 14 NewsArticles-1880 14 trump trump ADJ 1 ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ 1942 NewsArticles-99 1055 non non ADJ 0 1943 NewsArticles-99 1056 - - ADJ 0 1944 NewsArticles-99 1057 recyclable recyclable ADJ 1 1945 NewsArticles-99 1058 item item NOUN 0 1946 NewsArticles-99 1059 . . PUNCT 0

As we can see, the lemma column was copied over to the token column.

Stemming

tmtoolkit doesn’t support stemming directly, since lemmatization is generally accepted as a better approach to bring different word forms of one word to a common base form. However, you may install NLTK and apply stemming by using the transform_tokens() method together with the stem() function.

Depending on how you further want to analyze the data, it may be necessary to “clean” or “normalize” your tokens in different ways in order to remove noise from the corpus, such as punctuation tokens or numbers, upper/lowercase forms of the same word, etc. Note that this is usually not necessary when you work with more modern approaches such as word embeddings (word vectors).

If you want to remove certain characters in all tokens in your documents, you can use remove_chars_in_tokens() and pass it a sequence of characters to remove. There is also a shortcut remove_special_chars_in_tokens() which will remove all “special characters” (all characters in string.punction by default).

[34]:

preproc.remove_chars_in_tokens(['-'])  # remove only "-"
preproc.print_summary()

3 documents in language English:
> NewsArticles-1880 (N=230): White House aide tell to keep Russia relate materi...
> NewsArticles-3350 (N=657): frustration as cabin electronic ban come into forc...
> NewsArticles-99 (N=1060): Should you have two bin in your bathroom ? Our bat...
total number of tokens: 1947 / vocabulary size: 596

[34]:

<TMPreproc [3 documents / en]>

[35]:

# remove all punctuation
preproc.remove_special_chars_in_tokens()
preproc.print_summary()   # the "?" also vanishes

3 documents in language English:
> NewsArticles-1880 (N=230): White House aide tell to keep Russia relate materi...
> NewsArticles-3350 (N=657): frustration as cabin electronic ban come into forc...
> NewsArticles-99 (N=1060): Should you have two bin in your bathroom Our bathr...
total number of tokens: 1947 / vocabulary size: 580

[35]:

<TMPreproc [3 documents / en]>


A common (but harsh) practice is to transform all tokens to lowercase forms, which can be done with tokens_to_lowercase():

[36]:

preproc.tokens_to_lowercase()
preproc.print_summary()

3 documents in language English:
> NewsArticles-1880 (N=230): white house aide tell to keep russia relate materi...
> NewsArticles-3350 (N=657): frustration as cabin electronic ban come into forc...
> NewsArticles-99 (N=1060): should you have two bin in your bathroom our bathr...
total number of tokens: 1947 / vocabulary size: 562

[36]:

<TMPreproc [3 documents / en]>


The method clean_tokens() finally applies several steps that remove tokens that meet certain criteria. This includes removing:

• punctuation tokens

• stopwords (very common words for the given language)

• empty tokens (i.e. '')

• tokens that are longer or shorter than a certain number of characters

• numbers

Note that this is a language-dependent method, because the default stopword list is determined per language. This method has many parameters to tweak, so it’s recommended to check out the documentation.

[37]:

# remove punct., stopwords, empty tokens (this is the default)
# plus tokens shorter than 2 characters and numeric tokens like "2019"
preproc.clean_tokens(remove_numbers=True, remove_shorter_than=2)
preproc.print_summary()

3 documents in language English:
> NewsArticles-1880 (N=130): white house aide tell keep russia relate material ...
> NewsArticles-3350 (N=309): frustration cabin electronic ban come force passen...
> NewsArticles-99 (N=486): bin bathroom bathroom fill shampoo bottle toilet r...
total number of tokens: 925 / vocabulary size: 469

[37]:

<TMPreproc [3 documents / en]>


Due to the removal of several tokens in the previous step, the document lengths for the processed corpus are much smaller than for the original corpus:

[38]:

preproc.doc_lengths, preproc_orig.doc_lengths

[38]:

({'NewsArticles-1880': 130, 'NewsArticles-3350': 309, 'NewsArticles-99': 486},
{'NewsArticles-1880': 230, 'NewsArticles-3350': 657, 'NewsArticles-99': 1060})


We can also observe that the vocabulary got smaller after the processing steps, which, for large corpora, is also important in terms of computation time and memory consumption for later analyses:

[39]:

len(preproc.vocabulary), len(preproc_orig.vocabulary)

[39]:

(469, 683)


You can also apply custom token transform functions by using transform_tokens() and passing it a function that should be applied to each token in each document (hence it must accept one string argument).

First let’s define such a function. Here we create a simple function that should return a token’s “shape” in terms of the case of its characters:

[40]:

def token_shape(t):
return ''.join(['X' if str.isupper(c) else 'x' for c in t])

token_shape('EU'), token_shape('CamelCase'), token_shape('lower')

[40]:

('XX', 'XxxxxXxxx', 'xxxxx')


We can now apply this function to our documents (we will use the original documents here, because they were not transformed to lower case):

[41]:

preproc = preproc_orig.copy() # swap instances for later

preproc_orig.transform_tokens(token_shape)   # apply function
preproc_orig.print_summary()

# remove instance
del preproc_orig

3 documents in language English:
> NewsArticles-1880 (N=230): Xxxxx Xxxxx xxxxx xxxx xx xxxx Xxxxxx x xxxxxxx xx...
> NewsArticles-3350 (N=657): Xxxxxxxxxxx xx xxxxx xxxxxxxxxxx xxx xxxxx xxxx xx...
> NewsArticles-99 (N=1060): Xxxxxx xxx xxxx xxx xxxx xx xxxx xxxxxxxx x xx Xxx...
total number of tokens: 1947 / vocabulary size: 32


#### Expanding compound words and joining tokens¶

Compound words like “US-Student” or “non-recyclable” can be expanded to separate tokens like “US”, “Student” and “non”, “recyclable” using expand_compound_tokens(). However, depending on the language model, most of these compounds will already be separated on initial tokenization.

[42]:

orig_vocab = preproc.vocabulary
preproc.expand_compound_tokens()

# create set difference to show vocabulary tokens
# that were expanded
set(orig_vocab) - set(preproc.vocabulary)

[42]:

{'Source:-Al'}


It’s also possible to join together certain subsequent occurrences of tokens or token patterns. This means you can for example transform all of the subsequent tokens “White” and “House” to single tokens “White_House”. In case you don’t use n-grams (described in a separate section), this is very helpful when you want to capture a named entity that is made up by several tokens, such as persons, institutions or concepts like “Climate Change”, as a single token. The method to use for this is glue_tokens(). It accepts the following parameters:

• a patterns sequence of length N that is used to match the subsequent N tokens;

• a glue string that is used to join the matched subsequent tokens (by default: "_").

Along with that, you can adjust the token matching with the common token matching parameters described below.

Let’s “glue” all subsequent occurrences of “White” and “House”. The glue_tokens() method will return a set of glued tokens that matched the provided pattern:

[43]:

preproc_orig = preproc.copy()  # make a copy of full orig. data for later use
preproc.glue_tokens(['White', 'House'])

[43]:

{'White_House'}

[44]:

preproc.tokens['NewsArticles-1880'][:20]

[44]:

['White_House',
'aides',
'told',
'to',
'keep',
'Russia',
'-',
'related',
'materials',
'\n\n',
'Lawyers',
'for',
'the',
'Trump',
'have',
'instructed',
'White_House',
'aides',
'to']

[45]:

del preproc


### Keywords-in-context (KWIC) and general filtering methods¶

Keywords-in-context (KWIC) allow you to quickly investigate certain keywords and their neighborhood of tokens, i.e. the tokens that appear right before and after this keyword.

TMPreproc provides three methods for this purpose:

• get_kwic() is the base method accepting a search pattern and several options that control how the search pattern is matched (more on that below); use this function when you want to further process the output of a KWIC search;

• get_kwic_table() is the more “user friendly” version of the above method as it produces a datatable with the highlighted keyword by default

• filter_tokens_with_kwic() works similar to the above functions but applies the result by filtering the documents again; it is explained in the section on filtering

Let’s see the KWIC methods in action:

[46]:

preproc = preproc_orig.copy()  # use orig. full data
preproc.get_kwic('house', ignore_case=True)

[46]:

{'NewsArticles-1880': [['White', 'House', 'aides', 'told'],
['instructed', 'White', 'House', 'aides', 'to'],
['The', 'White', 'House', 'is', 'simply'],
['the', 'White', 'House', 'and', 'law']],
'NewsArticles-3350': [],
'NewsArticles-99': [['of', 'the', 'house', ',', '"']]}


The method returns a dictionary that maps document labels to the KWIC results. Each document contains a list of “contexts”, i.e. a list of tokens that surround a keyword, here "house". This keyword stands in the middle and is surrounded by its “context tokens”, which by default means two tokens to the left and two tokens to the right (which may be less when the keyword is near the start or the end of a document).

We can see that NewsArticles-1880 contains four contexts, NewsArticles-99 one context and NewsArticles-3350 none.

With get_kwic_table(), we get back a datatable which provides a better formatting for quick investigation. See how the matched tokens are highlighted as *house* and empty results are removed:

[47]:

preproc.get_kwic_table('house', ignore_case=True)

[47]:

 doc context kwic ▪▪▪▪ ▪▪▪▪▪▪▪▪ ▪▪▪▪ 0 NewsArticles-1880 0 White *House* aides told 1 NewsArticles-1880 1 instructed White *House* aides to 2 NewsArticles-1880 2 The White *House* is simply 3 NewsArticles-1880 3 the White *House* and law 4 NewsArticles-99 0 of the *house* , "

An important parameter is context_size. It determines the number of tokens to display left and right to the found keyword. You can either pass a single integer for a symmetric context or a tuple with integers (<left>, <right>):

[48]:

preproc.get_kwic_table('house', ignore_case=True, context_size=4)

[48]:

 doc context kwic ▪▪▪▪ ▪▪▪▪▪▪▪▪ ▪▪▪▪ 0 NewsArticles-1880 0 White *House* aides told to keep 1 NewsArticles-1880 1 administration have instructed White *House* aides… 2 NewsArticles-1880 2 . " The White *House* is simply taking proactive 3 NewsArticles-1880 3 Democrats to the White *House* and law enforcement… 4 NewsArticles-99 0 other rooms of the *house* , " says Jonny
[49]:

preproc.get_kwic_table('house', ignore_case=True, context_size=(1, 4))

[49]:

 doc context kwic ▪▪▪▪ ▪▪▪▪▪▪▪▪ ▪▪▪▪ 0 NewsArticles-1880 0 White *House* aides told to keep 1 NewsArticles-1880 1 White *House* aides to preserve any 2 NewsArticles-1880 2 White *House* is simply taking proactive 3 NewsArticles-1880 3 White *House* and law enforcement agencies 4 NewsArticles-99 0 the *house* , " says Jonny

The KWIC functions become really powerful when using the pattern matching options. So far, we were looking for exact (but case insensitive) matches between the corpus tokens and our keyword "house". However, it is also possible to match patterns like "new*" (matches any word starting with “new”) or "agenc(y|ies)" (a regular expression matching “agency” and “agencies”). The next section gives an introduction on the different options for pattern matching.

#### Common parameters for pattern matching functions¶

Several functions and methods in tmtoolkit support pattern matching, including the already mentioned KWIC functions but also functions for filtering tokens or documents as you will see later. They all share similar function signatures, i.e. similar parameters:

• search_token or search_tokens: allows to specify one or more patterns as strings

• match_type: sets the matching type and can be one of the following options:

• 'exact' (default): exact string matching (optionally ignoring character case), i.e. no pattern matching

• 'regex' uses regular expression matching

• 'glob' uses “glob patterns” like "politic*" which matches for example “politic”, “politics” or “politician” (see globre package)

• ignore_case: ignore character case (applies to all three match types)

• glob_method: if match_type is ‘glob’, use this glob method. Must be 'match' or 'search' (similar behavior as Python’s re.match or re.search)

• inverse: inverse the match results, i.e. if matching for “hello”, return all results that do not match “hello”

Let’s try out some of these options with get_kwic_table():

[50]:

# using a regular expression, ignoring case
preproc.get_kwic_table(r'agenc(y|ies)', match_type='regex', ignore_case=True)

[50]:

 doc context kwic ▪▪▪▪ ▪▪▪▪▪▪▪▪ ▪▪▪▪ 0 NewsArticles-1880 0 law enforcement *agencies* to keep 1 NewsArticles-1880 1 organizations , *agencies* and individuals 2 NewsArticles-3350 0 Reuters news *agency* . Al 3 NewsArticles-3350 1 and news *agencies*
[51]:

# using a glob, ignoring case
preproc.get_kwic_table('pol*', match_type='glob', ignore_case=True)

[51]:

 doc context kwic ▪▪▪▪ ▪▪▪▪▪▪▪▪ ▪▪▪▪ 0 NewsArticles-1880 0 false and *politically* motivated attacks 1 NewsArticles-99 0 , senior *policy* adviser for
[52]:

# using a glob, ignoring case
preproc.get_kwic_table('*sol*', match_type='glob', ignore_case=True)

[52]:

 doc context kwic ▪▪▪▪ ▪▪▪▪▪▪▪▪ ▪▪▪▪ 0 NewsArticles-99 0 potential simple *solution* that could 1 NewsArticles-99 1 confused by *aerosols* . " 2 NewsArticles-99 2 bottles , *aerosols* for deodorant
[53]:

# using a regex that matches all tokens with at least one vowel and
# inverting these matches, i.e. all tokens *without* any vowels
preproc.get_kwic_table(r'[AEIOUaeiou]', match_type='regex', inverse=True)

[53]:

 doc context kwic ▪▪▪▪ ▪▪▪▪▪▪▪▪ ▪▪▪▪ 0 NewsArticles-1880 0 keep Russia *-* related materials 1 NewsArticles-1880 1 related materials * * Lawyers for 2 NewsArticles-1880 2 in the *2016* presidential election 3 NewsArticles-1880 3 related investigations *,* ABC News 4 NewsArticles-1880 4 has confirmed *.* " The 5 NewsArticles-1880 5 confirmed . *"* The White 6 NewsArticles-1880 6 motivated attacks *,* " an 7 NewsArticles-1880 7 attacks , *"* an administration 8 NewsArticles-1880 8 News Wednesday *.* The directive 9 NewsArticles-1880 9 last week *by* Senate Democrats 10 NewsArticles-1880 10 between Trump *'s* administration , 11 NewsArticles-1880 11 's administration *,* campaign and 12 NewsArticles-1880 12 transition teams *"* ? or 13 NewsArticles-1880 13 teams " *?* or anyone 14 NewsArticles-1880 14 their behalf *"* ? and ⋮ ⋮ ⋮ ⋮ 265 NewsArticles-99 147 two bins *?* There are 266 NewsArticles-99 148 other options *.* Hang a 267 NewsArticles-99 149 recycling bin *.* Or opt 268 NewsArticles-99 150 and non *-* recyclable items 269 NewsArticles-99 151 recyclable items *.*

#### Filtering tokens and documents¶

We can use the pattern matching parameters in numerous filtering methods. The heart of many of these methods is token_match(). Given a search pattern, a list of tokens and optionally some pattern matching parameters, it returns a binary NumPy array of the same length as the input tokens. Each occurrence of True in this binary array signals a match.

[54]:

from tmtoolkit.preprocess import token_match

# first 10 tokens of document "NewsArticles-1880"
doc_snippet = preproc.tokens['NewsArticles-1880'][:10]
# get all tokens that match "to*"
matches = token_match('to*', doc_snippet, match_type='glob')

# iterate through tokens and matches, show pair-wise results
for tok, match in zip(doc_snippet, matches):
print(tok, ':', match)

White : False
House : False
aides : False
told : True
to : True
keep : False
Russia : False
- : False
related : False
materials : False


The token_match() function is a rather low-level function that you may use for pattern matching against any list/array of strings, e.g. a list of tokens, file names, etc.

The following methods cover common use-cases for filtering during text preprocessing. Many of these methods start either with filter_...() or remove_...() and these pairs of filter and remove functions are complements. A filter method will always retain the matched elements whereas a remove method will always drop the matched elements. We can observe that with the first pair of method, filter_tokens() and remove_tokens():

So much .copy()

Note that the following code snippets make lot of use of the copy() methods. This is because we want to show how the different methods work with the same original data (remember that a TMPreproc instance behaves like a state machine) and also want to “clean up” the temporary instances. Under normal circumstances, you wouldn’t use copy() so excessively.

[55]:

# retain only the tokens that match the pattern in each document
preproc.filter_tokens('*house*', match_type='glob', ignore_case=True)
preproc.print_summary()

del preproc

3 documents in language English:
> NewsArticles-1880 (N=4): House House House House
> NewsArticles-3350 (N=0):
> NewsArticles-99 (N=3): house greenhouse household
total number of tokens: 7 / vocabulary size: 4

[56]:

preproc = preproc_orig.copy()  # make a copy from full data

preproc.remove_tokens('*house*', match_type='glob', ignore_case=True)
preproc.print_summary()

del preproc

3 documents in language English:
> NewsArticles-1880 (N=226): White aides told to keep Russia - related material...
> NewsArticles-3350 (N=658): Frustration as cabin electronics ban comes into fo...
> NewsArticles-99 (N=1057): Should you have two bins in your bathroom ? Our ba...
total number of tokens: 1941 / vocabulary size: 679


The pair filter_documents() and remove_documents() works similarily, but filters or drops whole documents regarding the supplied match criteria. Both accept the standard pattern matching parameters but also a parameter matches_threshold with default value 1. When this number of matching tokens is hit, the document will be part of the result set (filter_documents()) or removed from the result set (remove_documents()). By this, we can for example retain only those documents that contain certain token patterns.

Let’s try these methods out in practice:

[57]:

preproc = preproc_orig.copy()  # make a copy from full data

preproc.filter_documents('*house*', match_type='glob', ignore_case=True)
preproc.print_summary()

del preproc

2 documents in language English:
> NewsArticles-1880 (N=230): White House aides told to keep Russia - related ma...
> NewsArticles-99 (N=1060): Should you have two bins in your bathroom ? Our ba...
total number of tokens: 1290 / vocabulary size: 485


We can see that two out of three documents contained the pattern '*house*' and hence were retained.

We can also adjust matches_threshold to set the minimum number of token matches for filtering:

[58]:

preproc = preproc_orig.copy()  # make a copy from full data

preproc.filter_documents('*house*', match_type='glob', ignore_case=True,
matches_threshold=4)
preproc.print_summary()

del preproc

1 documents in language English:
> NewsArticles-1880 (N=230): White House aides told to keep Russia - related ma...
total number of tokens: 230 / vocabulary size: 140

[59]:

preproc = preproc_orig.copy()  # make a copy from full data

preproc.remove_documents('*house*', match_type='glob', ignore_case=True)
preproc.print_summary()

del preproc

1 documents in language English:
> NewsArticles-3350 (N=658): Frustration as cabin electronics ban comes into fo...
total number of tokens: 658 / vocabulary size: 288


When we use remove_documents() we get only the documents that did not contain the specified pattern.

Another useful pair of methods is filter_documents_by_name() and remove_documents_by_name(). Both methods again accept the same pattern matching parameters but they only apply them to the document names, i.e. document labels:

[60]:

preproc = preproc_orig.copy()  # make a copy from full data

preproc.filter_documents_by_name(r'-\d{4}\$', match_type='regex')
preproc.print_summary()

del preproc

2 documents in language English:
> NewsArticles-1880 (N=230): White House aides told to keep Russia - related ma...
> NewsArticles-3350 (N=658): Frustration as cabin electronics ban comes into fo...
total number of tokens: 888 / vocabulary size: 385


In the above example we wanted to retain only the documents whose document labels ended with exactly 4 digits, like “…-1234”. Hence, we only get “NewsArticles-1880” and “NewsArticles-3350” but not “NewsArticles-99”. Again, remove_documents_by_name() will do the exact opposite.

You may also use Keywords-in-context (KWIC) to filter your tokens in the neighborhood around certain keyword pattern(s). The method for that is called filter_tokens_with_kwic() and works very similar to get_kwic() but filters the documents in the TMPreproc instance with which you can continue working as usual. Here, we filter the tokens in each document to get the tokens directly in front and after the glob pattern '*house*' (context_size=1):

[61]:

preproc = preproc_orig.copy()  # make a copy from full data

preproc.filter_tokens_with_kwic('*house*', context_size=1,
match_type='glob', ignore_case=True)
preproc.tokens_datatable

[61]:

 doc position token lemma whitespace ▪▪▪▪ ▪▪▪▪▪▪▪▪ ▪▪▪▪ ▪▪▪▪ ▪ 0 NewsArticles-1880 0 White White 1 1 NewsArticles-1880 1 House House 1 2 NewsArticles-1880 2 aides aide 1 3 NewsArticles-1880 3 White White 1 4 NewsArticles-1880 4 House House 1 5 NewsArticles-1880 5 aides aide 1 6 NewsArticles-1880 6 White White 1 7 NewsArticles-1880 7 House House 1 8 NewsArticles-1880 8 is be 1 9 NewsArticles-1880 9 White White 1 10 NewsArticles-1880 10 House House 1 11 NewsArticles-1880 11 and and 1 12 NewsArticles-99 0 the the 1 13 NewsArticles-99 1 house house 0 14 NewsArticles-99 2 , , 0 15 NewsArticles-99 3 of of 1 16 NewsArticles-99 4 greenhouse greenhouse 1 17 NewsArticles-99 5 gases gas 1 18 NewsArticles-99 6 UK UK 1 19 NewsArticles-99 7 household household 1 20 NewsArticles-99 8 threw throw 1

When you annotated your documents’ tokens with Part-of-Speech (POS) tags, you can also filter them using filter_for_pos():

[62]:

del preproc

preproc = preproc_orig.copy()  # make a copy from full data

# apply POS tagging and retain only nouns
preproc.pos_tag().filter_for_pos('N').tokens_datatable

[62]:

doc position token ▪ ▪ ▪
[63]:

del preproc


In this example we filtered for tokens that were identified as nouns by passing the simplified POS tag 'N' (for more on simplified tags, see the method documentation). We can also filter for more than one tag, e.g. nouns or verbs by passing a list of required POS tags.

filter_for_pos() has no remove_...() counterpart, but you can set the inverse parameter to True to achieve the same effect.

Finally there are two methods for removing tokens based on their document frequency: remove_common_tokens() and remove_uncommon_tokens(). The former removes all tokens that have a document frequency greater or equal a certain threshold defined by parameter df_threshold. The latter does the same for all tokens that have a document frequency lower or equal df_threshold. This parameter can either be a relative frequency (default) or absolute count (by setting parameter absolute=True).

Before applying the method, let’s have a look at the number of tokens per document again, to later see how many we will remove. We will also store the vocabulary in orig_vocab for later comparison:

[64]:

preproc = preproc_orig.copy()  # make a copy from full data
orig_vocab = preproc.vocabulary
preproc.doc_lengths

[64]:

{'NewsArticles-1880': 230, 'NewsArticles-3350': 658, 'NewsArticles-99': 1060}

[65]:

preproc.remove_common_tokens(df_threshold=0.9).doc_lengths

[65]:

{'NewsArticles-1880': 144, 'NewsArticles-3350': 413, 'NewsArticles-99': 700}


By removing all tokens with a document frequency threshold of 0.9, we removed quite a number of tokens in each document. Let’s investigate the vocabulary in order to see which tokens were removed:

[66]:

# set difference gives removed vocabulary tokens
set(orig_vocab) - set(preproc.vocabulary)

[66]:

{'\n\n',
'"',
"'s",
',',
'-',
'.',
'?',
'The',
'a',
'all',
'also',
'an',
'and',
'be',
'for',
'has',
'have',
'in',
'into',
'is',
'more',
'of',
'on',
'or',
'other',
'such',
'than',
'that',
'the',
'to',
'which',
'with'}

[67]:

del preproc


remove_uncommon_tokens() works similarily. This time, let’s use an absolute number as threshold:

[68]:

preproc = preproc_orig.copy()  # make a copy from full data

preproc.remove_uncommon_tokens(df_threshold=1, absolute=True)

# set difference gives removed vocabulary tokens
# this time, show only the first 10 tokens that were removed
sorted(set(orig_vocab) - set(preproc.vocabulary))[:10]

[68]:

[' ', '%', '(', ')', '10', '12', '135,000', '2016', '38', '45']


The above means that we remove all tokens that appear only in exactly one document.

[69]:

del preproc


TMPreproc allows to attach arbitrary metadata to each token in each document. This kind of “annotations” for tokens is very useful. For example, you may add metadata that records a token’s length or whether it is all uppercase letters and later use that for filtering or in further analyses. One method to add such metadata is add_metadata_per_doc(). This method requires to pass a dict that maps document labels to the respective token metadata list. The list’s length must match the number of tokens in the respective document. At first we need to create such a metadata dict. Let’s do that for the tokens’ length first:

[70]:

preproc = preproc_orig.copy()  # make a copy from full data

meta_tok_lengths = {doc_label: list(map(len, doc_tokens))
for doc_label, doc_tokens in preproc.tokens.items()}

# show first 5 tokens and their string length for a sample document
list(zip(preproc.tokens['NewsArticles-1880'][:10],
meta_tok_lengths['NewsArticles-1880'][:10]))

[70]:

[('White', 5),
('House', 5),
('aides', 5),
('told', 4),
('to', 2),
('keep', 4),
('Russia', 6),
('-', 1),
('related', 7),
('materials', 9)]


[71]:

preproc.add_metadata_per_doc('length', meta_tok_lengths)
del meta_tok_lengths  # we don't need that object anymore


The property .tokens_datatable now shows an additional column with meta_length (the metadata key in always prefixed with meta_):

[72]:

preproc.tokens_datatable

[72]:

 doc position token lemma whitespace meta_length ▪▪▪▪ ▪▪▪▪▪▪▪▪ ▪▪▪▪ ▪▪▪▪ ▪ ▪▪▪▪▪▪▪▪ 0 NewsArticles-1880 0 White White 1 5 1 NewsArticles-1880 1 House House 1 5 2 NewsArticles-1880 2 aides aide 1 5 3 NewsArticles-1880 3 told tell 1 4 4 NewsArticles-1880 4 to to 1 2 5 NewsArticles-1880 5 keep keep 1 4 6 NewsArticles-1880 6 Russia Russia 0 6 7 NewsArticles-1880 7 - - 0 1 8 NewsArticles-1880 8 related relate 1 7 9 NewsArticles-1880 9 materials material 0 9 10 NewsArticles-1880 10 0 2 11 NewsArticles-1880 11 Lawyers lawyer 1 7 12 NewsArticles-1880 12 for for 1 3 13 NewsArticles-1880 13 the the 1 3 14 NewsArticles-1880 14 Trump trump 1 5 ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ 1943 NewsArticles-99 1055 non non 0 3 1944 NewsArticles-99 1056 - - 0 1 1945 NewsArticles-99 1057 recyclable recyclable 1 10 1946 NewsArticles-99 1058 items item 0 5 1947 NewsArticles-99 1059 . . 0 1

Let’s add a boolean indicator for whether the given token is all uppercase:

[73]:

meta_tok_upper = {doc_label: list(map(str.isupper, doc_tokens))
for doc_label, doc_tokens in preproc.tokens.items()}

del meta_tok_upper

preproc.tokens_datatable

[73]:

 doc position token lemma whitespace meta_length meta_upper ▪▪▪▪ ▪▪▪▪▪▪▪▪ ▪▪▪▪ ▪▪▪▪ ▪ ▪▪▪▪▪▪▪▪ ▪ 0 NewsArticles-1880 0 White White 1 5 0 1 NewsArticles-1880 1 House House 1 5 0 2 NewsArticles-1880 2 aides aide 1 5 0 3 NewsArticles-1880 3 told tell 1 4 0 4 NewsArticles-1880 4 to to 1 2 0 5 NewsArticles-1880 5 keep keep 1 4 0 6 NewsArticles-1880 6 Russia Russia 0 6 0 7 NewsArticles-1880 7 - - 0 1 0 8 NewsArticles-1880 8 related relate 1 7 0 9 NewsArticles-1880 9 materials material 0 9 0 10 NewsArticles-1880 10 0 2 0 11 NewsArticles-1880 11 Lawyers lawyer 1 7 0 12 NewsArticles-1880 12 for for 1 3 0 13 NewsArticles-1880 13 the the 1 3 0 14 NewsArticles-1880 14 Trump trump 1 5 0 ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ 1943 NewsArticles-99 1055 non non 0 3 0 1944 NewsArticles-99 1056 - - 0 1 0 1945 NewsArticles-99 1057 recyclable recyclable 1 10 0 1946 NewsArticles-99 1058 items item 0 5 0 1947 NewsArticles-99 1059 . . 0 1 0

You may use these newly added columns now for example for filtering the datatable:

[74]:

import datatable as dt

preproc.tokens_datatable[dt.f.meta_upper == 1,:]

[74]:

 doc position token lemma whitespace meta_length meta_upper ▪▪▪▪ ▪▪▪▪▪▪▪▪ ▪▪▪▪ ▪▪▪▪ ▪ ▪▪▪▪▪▪▪▪ ▪ 0 NewsArticles-1880 43 ABC ABC 1 3 1 1 NewsArticles-1880 73 ABC ABC 1 3 1 2 NewsArticles-1880 213 U.S. U.S. 1 4 1 3 NewsArticles-3350 11 US US 0 2 1 4 NewsArticles-3350 13 UK UK 1 2 1 5 NewsArticles-3350 34 US US 1 2 1 6 NewsArticles-3350 98 US US 1 2 1 7 NewsArticles-3350 106 US US 1 2 1 8 NewsArticles-3350 134 UAE UAE 1 3 1 9 NewsArticles-3350 153 READ READ 1 4 1 10 NewsArticles-3350 154 MORE MORE 0 4 1 11 NewsArticles-3350 273 US US 1 2 1 12 NewsArticles-3350 346 READ READ 1 4 1 13 NewsArticles-3350 347 MORE MORE 0 4 1 14 NewsArticles-3350 349 US US 1 2 1 15 NewsArticles-3350 358 US US 1 2 1 16 NewsArticles-3350 454 I -PRON- 1 1 1 17 NewsArticles-3350 480 UK UK 1 2 1 18 NewsArticles-3350 502 UK UK 1 2 1 19 NewsArticles-3350 506 UAE UAE 1 3 1 20 NewsArticles-3350 529 UAE UAE 1 3 1 21 NewsArticles-3350 570 US US 1 2 1 22 NewsArticles-3350 637 US US 1 2 1 23 NewsArticles-99 376 UK UK 1 2 1 24 NewsArticles-99 711 A a 1 1 1 25 NewsArticles-99 955 UK UK 1 2 1 26 NewsArticles-99 995 M25 M25 1 3 1

[75]:

preproc.get_available_metadata_keys()

[75]:

{'lemma', 'length', 'upper', 'whitespace'}


[76]:

preproc.remove_metadata('upper')

[76]:

{'lemma', 'length', 'whitespace'}

[77]:

preproc.tokens_datatable

[77]:

 doc position token lemma whitespace meta_length ▪▪▪▪ ▪▪▪▪▪▪▪▪ ▪▪▪▪ ▪▪▪▪ ▪ ▪▪▪▪▪▪▪▪ 0 NewsArticles-1880 0 White White 1 5 1 NewsArticles-1880 1 House House 1 5 2 NewsArticles-1880 2 aides aide 1 5 3 NewsArticles-1880 3 told tell 1 4 4 NewsArticles-1880 4 to to 1 2 5 NewsArticles-1880 5 keep keep 1 4 6 NewsArticles-1880 6 Russia Russia 0 6 7 NewsArticles-1880 7 - - 0 1 8 NewsArticles-1880 8 related relate 1 7 9 NewsArticles-1880 9 materials material 0 9 10 NewsArticles-1880 10 0 2 11 NewsArticles-1880 11 Lawyers lawyer 1 7 12 NewsArticles-1880 12 for for 1 3 13 NewsArticles-1880 13 the the 1 3 14 NewsArticles-1880 14 Trump trump 1 5 ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ 1943 NewsArticles-99 1055 non non 0 3 1944 NewsArticles-99 1056 - - 0 1 1945 NewsArticles-99 1057 recyclable recyclable 1 10 1946 NewsArticles-99 1058 items item 0 5 1947 NewsArticles-99 1059 . . 0 1

We can tell filter_tokens() and similar methods to use metadata instead of the tokens for matching. For example, we can use the metadata meta_length, which we created before, to filter for tokens of a certain length:

[78]:

preproc_meta_example = preproc.copy()
preproc_meta_example.filter_tokens(3, by_meta='length')
preproc_meta_example.tokens_datatable

[78]:

 doc position token lemma whitespace meta_length ▪▪▪▪ ▪▪▪▪▪▪▪▪ ▪▪▪▪ ▪▪▪▪ ▪ ▪▪▪▪▪▪▪▪ 0 NewsArticles-1880 0 for for 1 3 1 NewsArticles-1880 1 the the 1 3 2 NewsArticles-1880 2 any any 1 3 3 NewsArticles-1880 3 the the 1 3 4 NewsArticles-1880 4 and and 1 3 5 NewsArticles-1880 5 ABC ABC 1 3 6 NewsArticles-1880 6 has have 1 3 7 NewsArticles-1880 7 The the 1 3 8 NewsArticles-1880 8 and and 1 3 9 NewsArticles-1880 9 ABC ABC 1 3 10 NewsArticles-1880 10 The the 1 3 11 NewsArticles-1880 11 the the 1 3 12 NewsArticles-1880 12 and and 1 3 13 NewsArticles-1880 13 law law 1 3 14 NewsArticles-1880 14 all all 1 3 ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ 335 NewsArticles-99 186 for for 1 3 336 NewsArticles-99 187 bin bin 1 3 337 NewsArticles-99 188 can can 1 3 338 NewsArticles-99 189 and and 1 3 339 NewsArticles-99 190 non non 0 3
[79]:

del preproc_meta_example


Note that all matching options then apply to the metadata column, in this case to the meta_length column which contains integers. Since filter_tokens() by default employs exact matching, we get all tokens where meta_length equals the first argument, 3. If we used regular expression or glob matching instead, this method would fail because you can only use that for string data.

If you want to use more complex filter queries, you should create a “filter mask” and pass it to filter_tokens_by_mask(). A filter mask is a dictionary that maps a document label to a sequence of booleans. For all occurrences of True, the respective token in the document will be retained, all others will be removed. Let’s try that out with a small sample:

[80]:

preproc.pos_tag().tokens_datatable

[80]:

 doc position token lemma pos whitespace meta_length ▪▪▪▪ ▪▪▪▪▪▪▪▪ ▪▪▪▪ ▪▪▪▪ ▪▪▪▪ ▪ ▪▪▪▪▪▪▪▪ 0 NewsArticles-1880 0 White White PUNCT 1 5 1 NewsArticles-1880 1 House House PUNCT 1 5 2 NewsArticles-1880 2 aides aide PUNCT 1 5 3 NewsArticles-1880 3 told tell PUNCT 1 4 4 NewsArticles-1880 4 to to PUNCT 1 2 5 NewsArticles-1880 5 keep keep PUNCT 1 4 6 NewsArticles-1880 6 Russia Russia PUNCT 0 6 7 NewsArticles-1880 7 - - PUNCT 0 1 8 NewsArticles-1880 8 related relate PUNCT 1 7 9 NewsArticles-1880 9 materials material PUNCT 0 9 10 NewsArticles-1880 10 PUNCT 0 2 11 NewsArticles-1880 11 Lawyers lawyer PUNCT 1 7 12 NewsArticles-1880 12 for for PUNCT 1 3 13 NewsArticles-1880 13 the the PUNCT 1 3 14 NewsArticles-1880 14 Trump trump PUNCT 1 5 ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ 1943 NewsArticles-99 1055 non non PUNCT 0 3 1944 NewsArticles-99 1056 - - PUNCT 0 1 1945 NewsArticles-99 1057 recyclable recyclable PUNCT 1 10 1946 NewsArticles-99 1058 items item PUNCT 0 5 1947 NewsArticles-99 1059 . . PUNCT 0 1

We now generate the filter mask, which means for each document we create a boolean list or array that for each token in that document indicates whether that token should be kept or removed.

We will iterate through the tokens_with_metadata property, which is a dict that for each document contains a datatable with its tokens and metadata. Let’s have a look at the first document’s datatable:

[81]:

next(iter(preproc.tokens_with_metadata.values()))

[81]:

 token lemma pos whitespace meta_length ▪▪▪▪ ▪▪▪▪ ▪▪▪▪ ▪ ▪▪▪▪▪▪▪▪ 0 White White PUNCT 1 5 1 House House PUNCT 1 5 2 aides aide PUNCT 1 5 3 told tell PUNCT 1 4 4 to to PUNCT 1 2 5 keep keep PUNCT 1 4 6 Russia Russia PUNCT 0 6 7 - - PUNCT 0 1 8 related relate PUNCT 1 7 9 materials material PUNCT 0 9 10 PUNCT 0 2 11 Lawyers lawyer PUNCT 1 7 12 for for PUNCT 1 3 13 the the PUNCT 1 3 14 Trump trump PUNCT 1 5 ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ 225 during during PUNCT 1 6 226 his -PRON- X 1 3 227 confirmation confirmation PUNCT 1 12 228 hearing hearing PUNCT 0 7 229 . . PUNCT 0 1

Now we can create the filter mask:

[82]:

import numpy as np

# extract the columns "meta_length" and "pos"
# and convert them to NumPy arrays
doc_data_subset = doc_data[:, [dt.f.meta_length, dt.f.pos]]
tok_lengths, tok_pos = map(np.array, doc_data_subset.to_list())

# create a boolean array for nouns with token length less or equal 5
filter_mask[doc_label] = (tok_lengths <= 5) & np.isin(tok_pos, ['NOUN', 'PROPN'])

# but it's a good way to check the mask
preproc.tokens_datatable

[82]:

 doc position token lemma pos whitespace meta_length meta_small_nouns ▪▪▪▪ ▪▪▪▪▪▪▪▪ ▪▪▪▪ ▪▪▪▪ ▪▪▪▪ ▪ ▪▪▪▪▪▪▪▪ ▪ 0 NewsArticles-1880 0 White White PUNCT 1 5 0 1 NewsArticles-1880 1 House House PUNCT 1 5 0 2 NewsArticles-1880 2 aides aide PUNCT 1 5 0 3 NewsArticles-1880 3 told tell PUNCT 1 4 0 4 NewsArticles-1880 4 to to PUNCT 1 2 0 5 NewsArticles-1880 5 keep keep PUNCT 1 4 0 6 NewsArticles-1880 6 Russia Russia PUNCT 0 6 0 7 NewsArticles-1880 7 - - PUNCT 0 1 0 8 NewsArticles-1880 8 related relate PUNCT 1 7 0 9 NewsArticles-1880 9 materials material PUNCT 0 9 0 10 NewsArticles-1880 10 PUNCT 0 2 0 11 NewsArticles-1880 11 Lawyers lawyer PUNCT 1 7 0 12 NewsArticles-1880 12 for for PUNCT 1 3 0 13 NewsArticles-1880 13 the the PUNCT 1 3 0 14 NewsArticles-1880 14 Trump trump PUNCT 1 5 0 ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ 1943 NewsArticles-99 1055 non non PUNCT 0 3 0 1944 NewsArticles-99 1056 - - PUNCT 0 1 0 1945 NewsArticles-99 1057 recyclable recyclable PUNCT 1 10 0 1946 NewsArticles-99 1058 items item PUNCT 0 5 0 1947 NewsArticles-99 1059 . . PUNCT 0 1 0

[83]:

preproc.filter_tokens_by_mask(filter_mask)
preproc.tokens_datatable

[83]:

doc position token ▪ ▪ ▪

### Generating n-grams¶

So far, we worked with unigrams, i.e. each document consisted of a sequence of discrete tokens. We can also generate n-grams from our corpus where each document consists of a sequence of n subsequent tokens. An example would be:

Document: “This is a simple example.”

n=1 (unigrams):

['This', 'is', 'a', 'simple', 'example', '.']


n=2 (bigrams):

['This is', 'is a', 'a simple', 'simple example', 'example .']


n=3 (trigrams):

['This is a', 'is a simple', 'a simple example', 'simple example .']


The method generate_ngrams() allows us to generate n-grams from tokenized documents. We can then get the results with the ngrams property:

[84]:

del preproc

preproc = preproc_orig.copy()  # make a copy from full data

preproc.generate_ngrams(2)  # generate bigrams
preproc.ngrams['NewsArticles-1880'][:10]  # show first 10 bigrams of this document

[84]:

[['White', 'House'],
['House', 'aides'],
['aides', 'told'],
['told', 'to'],
['to', 'keep'],
['keep', 'Russia'],
['Russia', '-'],
['-', 'related'],
['related', 'materials'],
['materials', 'Lawyers']]


You may afterwards use join_ngrams() to merge the generated n-grams to joint tokens and use these as new tokens in this TMPreproc instance:

[85]:

preproc.join_ngrams()
preproc.tokens_datatable

[85]:

 doc position token lemma whitespace ▪▪▪▪ ▪▪▪▪▪▪▪▪ ▪▪▪▪ ▪▪▪▪ ▪ 0 NewsArticles-1880 0 White House White House 1 1 NewsArticles-1880 1 House aides House aides 1 2 NewsArticles-1880 2 aides told aides told 1 3 NewsArticles-1880 3 told to told to 1 4 NewsArticles-1880 4 to keep to keep 1 5 NewsArticles-1880 5 keep Russia keep Russia 1 6 NewsArticles-1880 6 Russia - Russia - 1 7 NewsArticles-1880 7 - related - related 1 8 NewsArticles-1880 8 related materials related materials 1 9 NewsArticles-1880 9 materials Lawyers materials Lawyers 1 10 NewsArticles-1880 10 Lawyers for Lawyers for 1 11 NewsArticles-1880 11 for the for the 1 12 NewsArticles-1880 12 the Trump the Trump 1 13 NewsArticles-1880 13 Trump administration Trump administration 1 14 NewsArticles-1880 14 administration have administration have 1 ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ 1934 NewsArticles-99 1052 and non and non 1 1935 NewsArticles-99 1053 non - non - 1 1936 NewsArticles-99 1054 - recyclable - recyclable 1 1937 NewsArticles-99 1055 recyclable items recyclable items 1 1938 NewsArticles-99 1056 items . items . 1
[86]:

del preproc


### Generating a sparse document-term matrix (DTM)¶

If you’re working with a bag-of-words representation of your data, you usually convert the preprocessed documents to a document-term matrix (DTM), which represents of the number of occurrences of each term (i.e. vocabulary token) in each document. This is a N rows by M columns matrix, where N is the number of documents and M is the vocabulary size (i.e. the number of unique tokens in the corpus).

Not all tokens from the vocabulary occur in all documents. In fact, many tokens will occur only in a small subset of the documents if you’re dealing with a “real world” dataset. This means that most entries in such a DTM will be zero. Almost all functions in tmtoolkit therefore generate and work with sparse matrices, where only non-zero values are stored in computer memory.

For this example, we’ll generate a DTM from the preproc_orig instance. First, let’s check the number of documents and the vocabulary size:

[87]:

preproc_orig.n_docs, preproc_orig.vocabulary_size

[87]:

(3, 683)


We can use the dtm property to generate a sparse DTM from the current instance:

[88]:

preproc_orig.dtm

[88]:

<3x683 sparse matrix of type '<class 'numpy.int32'>'
with 816 stored elements in Compressed Sparse Row format>


We can see that a sparse matrix with 3 rows (which corresponds with the number of documents) and 683 columns was generated (which corresponds to the vocabulary size). 816 elements in this matrix are non-zero.

We can convert this matrix to a non-sparse, i.e. dense, representation and see parts of its elements:

[89]:

preproc_orig.dtm.todense()

[89]:

matrix([[ 1,  0,  4, ...,  0,  0,  0],
[ 2,  1, 14, ...,  0,  3,  0],
[ 2,  0, 32, ...,  2,  5,  5]], dtype=int32)


However, note that you should only convert a sparse matrix to a dense representation when you’re either dealing with a small amount of data (which is what we’re doing in this example), or use only a part of the full matrix. Converting a sparse matrix to a dense representation can otherwise easily exceed the available computer memory.

There exist different “formats” for sparse matrices, which have different advantages and disadvantages (see for example the SciPy “sparse” module documentation). Not all formats support all operations that you can usually apply to an ordinary, dense matrix. By default, the generated DTM is in Compressed Sparse Row (CSR) format. This format allows indexing and is especially optimized for fast row access. You may convert it to any other sparse matrix format; see the mentioned SciPy documentation for this.

The rows of the DTM are aligned to the sequence of the document labels and its columns are aligned to the vocabulary. For example, let’s find the frequency of the term “House” in the document “NewsArticles-1880”. To do this, we find out the indices into the matrix:

[90]:

preproc_orig.doc_labels.index('NewsArticles-1880')

[90]:

0

[91]:

preproc_orig.vocabulary.index('House')

[91]:

67


This means the frequency of the term “House” in the document “NewsArticles-1880” is located in row 0 and column 4 of the DTM:

[92]:

preproc_orig.dtm[0, 67]

[92]:

4


See also the following example of finding out the index for “administration” and then getting an array that represents the number of occurrences of this token across all three documents:

[93]:

vocab_admin_ix = preproc_orig.vocabulary.index('administration')

[93]:

matrix([[4],
[1],
[0]], dtype=int32)


Apart from the dtm property, there’s also the get_dtm() method which allows to also return the result as datatable or pandas DataFrame. Note that these representations are not sparse and hence can consume much memory.

[94]:

preproc_orig.get_dtm(as_datatable=True)

DatatableWarning: Duplicate column name found, and was assigned a unique name: '.' -> '.0'

[94]:

 _doc . " % ' 's ( ) , … work world would you your ▪▪▪▪ ▪▪▪▪ ▪▪▪▪ ▪▪▪▪ ▪▪▪▪ ▪▪▪▪ ▪▪▪▪ ▪▪▪▪ ▪▪▪▪ ▪▪▪▪ ▪▪▪▪ ▪▪▪▪ ▪▪▪▪ ▪▪▪▪ ▪▪▪▪ 0 NewsArticles-1880 1 0 4 0 1 3 0 0 9 … 0 0 0 0 0 1 NewsArticles-3350 2 1 14 0 1 6 0 0 28 … 0 1 0 3 0 2 NewsArticles-99 2 0 32 5 0 3 2 2 33 … 1 0 2 5 5

### Serialization: Saving and loading TMPreproc objects¶

The current state of a TMPreproc object can also be stored to a file on disk so that you (or someone else who has tmtoolkit installed) can later restore it using that file. The methods for that are save_state() and load_state() / from_state().

Let’s store the current state of the preproc_orig instance:

[95]:

preproc_orig.print_summary()
preproc_orig.save_state('data/preproc_state.pickle')

3 documents in language English:
> NewsArticles-1880 (N=230): White House aides told to keep Russia - related ma...
> NewsArticles-3350 (N=658): Frustration as cabin electronics ban comes into fo...
> NewsArticles-99 (N=1060): Should you have two bins in your bathroom ? Our ba...
total number of tokens: 1948 / vocabulary size: 683

[95]:

<TMPreproc [3 documents / en]>


Let’s change the object by retaining only documents that contain the token “house” (see the reduced number of documents):

[96]:

preproc_orig.filter_documents('*house*', match_type='glob', ignore_case=True)
preproc_orig.print_summary()

2 documents in language English:
> NewsArticles-1880 (N=230): White House aides told to keep Russia - related ma...
> NewsArticles-99 (N=1060): Should you have two bins in your bathroom ? Our ba...
total number of tokens: 1290 / vocabulary size: 485

[96]:

<TMPreproc [2 documents / en]>


We can restore the saved data using from_state():

[97]:

preproc_restored = TMPreproc.from_state('data/preproc_state.pickle')
preproc_restored.print_summary()

3 documents in language English:
> NewsArticles-1880 (N=230): White House aides told to keep Russia - related ma...
> NewsArticles-3350 (N=658): Frustration as cabin electronics ban comes into fo...
> NewsArticles-99 (N=1060): Should you have two bins in your bathroom ? Our ba...
total number of tokens: 1948 / vocabulary size: 683

[97]:

<TMPreproc [3 documents / en]>


You can see that the full dataset with three documents was restored.

This is very useful especially when you have a large amount of data and run time consuming operations, e.g. POS tagging. When you’re finished running these operations, you can easily store the current state to disk and later retrieve it without the need to re-run these operations.

## Functional API¶

The TMPreproc class provides a convenient object-oriented interface for parallel text processing and analysis. There is also a functional API provided in the tmtoolkit.preprocess module. Most of these functions accept a list of spaCy documents along with additional parameters. You may use these functions for quick prototyping, but it is generally much more convenient to use TMPreproc. The functional API does not provide parallel processing.

To initialize the functional API for a certain language, you need to start with init_for_language() and may then tokenize your raw text documents via tokenize(), which will generate a list of spaCy documents. Most other functions in this API accept such a list of list of spaCy documents as input.

init_for_language('en')
docs = tokenize(['Hello this is a test.', 'And here comes another one.'])


The final result after applying preprocessing steps and hence transforming the text data is often a document-term matrix (DTM). The bow module contains several functions to work with DTMs, e.g. apply transformations such as tf-idf or compute some important summary statistics. The next chapter will introduce some of these functions.