Text preprocessing

During text preprocessing, a corpus of documents is tokenized (i.e. the document strings are split into individual words, punctuation, numbers, etc.) and then these tokens can be transformed, filtered or annotated. The goal is to prepare the raw texts in a way that makes it easier to perform eventual analysis methods in a later stage, e.g. by reducing noise in the dataset. tmtoolkit provides a rich set of tools for this purpose in the tmtoolkit.preprocess module.

Parallel processing with the TMPreproc class

You can pass a dict-like dataset (i.e. anything that maps document labels to their plain text contents, e.g. a tmtoolkit Corpus object) to the TMPreproc class and can then then apply several text processing methods to it. You can chain these processing steps by applying one method after another and examine the results.

Under the hood, the spaCy package is used to perform most NLP methods. However, TMPreproc offers much more functionality than spaCy, including flexible token and document filtering. The most important advantage of using TMPreproc is that it employs parallel processing, i.e. it uses all available processors on your machine to do the computations necessary during preprocessing. For large text corpora, this can lead to a strong speed up.

Using the functional API

Apart from the TMPreproc class, tmtoolkit also provides several functions in the tmtoolkit.preprocess module. Most of these functions accept a list of spaCy documents along with additional parameters. You may use these functions for quick prototyping, but it is generally much more convenient to use TMPreproc. Note that only the latter provides parallel processing.

Loading example data

Let’s load a sample of three documents from the built-in NewsArticles dataset. We’ll use only a small number of documents here to have a better overview at the beginning. We can later use a larger sample.

[1]:

import random
random.seed(20191018)   # to make the sampling reproducible

from tmtoolkit.corpus import Corpus
from tmtoolkit.preprocess import tokenize

corpus_small = Corpus.from_builtin_corpus('en-NewsArticles').sample(3)

Optional: enabling logging output

By default, tmtoolkit does not expose any internal logging messages. Sometimes, for example for diagnostic output during debugging or in order to see progress for long running operations, it’s helpful to enable logging output display, which can be done as follows:

import logging

logging.basicConfig(level=logging.INFO)
tmtoolkit_log = logging.getLogger('tmtoolkit')
# set the minimum log level to display, for instance also logging.DEBUG
tmtoolkit_log.setLevel(logging.INFO)
tmtoolkit_log.propagate = True

Creating a `TMPreproc` object

You can create a TMPreproc object (also known as “instance”) by passing a dict that maps document labels to (untokenized) documents. Since a tmtoolkit Corpus behaves like a dict, we can pass our corpus_small object. We also need to specify the corpus language as two-letter ISO 639-1 language code (here "en" for English).

[2]:

from tmtoolkit.preprocess import TMPreproc

preproc = TMPreproc(corpus_small, language='en')
preproc

[2]:

<TMPreproc [3 documents / en]>

The above will at first distribute all documents to several sub-processes which will later be used to run the computations in parallel. The number of sub-processes can be controlled via n_max_processes. It defaults to the number of CPU cores in your machine. The distribution of documents to the processes happens according to the document size. E.g. when you have two CPU cores, one very large document and three small documents, CPU 1 will take care about the large document alone and CPU 2 will take the other three small documents. After distribution of the documents, they will directly be tokenized (in parallel). Hence when you have a large corpus, the creation of a TMPreproc object may take some time because of the tokenization process.

Our TMPreproc object preproc is now set up to work with the documents passed in corpus_small and the language 'en' for English. All further operations with this object will use the specified documents and language. All documents are directly tokenized.

The method print_summary() is very handy and we will use it quite often. It displays a small summary of the documents in the TMPreproc object. N=... denotes the number of tokens in the respective document.

[3]:

preproc.print_summary()

3 documents in language English:
> NewsArticles-1880 (N=230): White House aides told to keep Russia - related ma...
> NewsArticles-3350 (N=657): Frustration as cabin electronics ban comes into fo...
> NewsArticles-99 (N=1060): Should you have two bins in your bathroom ? Our ba...
total number of tokens: 1947 / vocabulary size: 683

[3]:

<TMPreproc [3 documents / en]>

Accessing tokens, vocabulary and other important properties

TMPreproc provides several properties to access its data and some summary statistics. These properties are read-only, i.e. you can only retrieve the results but not assign new values to them.

First, let’s have a look at the labels (names) of the documents:

[4]:

preproc.doc_labels

[4]:

['NewsArticles-1880', 'NewsArticles-3350', 'NewsArticles-99']

We can access the tokens of each document by using the tokens property:

[5]:

# use [:10] slice to show only the first 10 tokens
preproc.tokens['NewsArticles-1880'][:10]

[5]:

['White',
 'House',
 'aides',
 'told',
 'to',
 'keep',
 'Russia',
 '-',
 'related',
 'materials']

If you prefer a tabular output, you can also access the tokens and their metadata as pandas DataFrames or datatable Frames.

A note on the use of datatable Frames

If you have installed the datatable package, many functions and methods in tmtoolkit return or accept datatable Frames instead of (the more commonly known) pandas DataFrames. This is because the former is much faster and memory efficient in most cases. You can always convert between the both like this:

import datatable as dt
import pandas as pd

# a pandas DataFrame:
df = pd.DataFrame({'a': [1, 2, 3], 'b': list('xyz')})

# DataFrame to datatable:
dtable = dt.Frame(df)

# and vice versa datatable to DataFrame:
df == dtable.to_pandas()

# Out:
#       a     b
# 0  True  True
# 1  True  True
# 2  True  True

Even first creating a datatable and then converting to a DataFrame is often faster than directly creating a DataFrame.

You can use the tokens_dataframe or tokens_datatable properties for tabular output. The datatable Frame consists of at least five columns: The document label, the position of the token in the document (zero-indexed) and token itself, lemma and whitespace. The lemma column contains the token’s lemma and whitespace indicates whether there is a whitespace after the token in the text. Please note that for large amounts of data, tokens_datatable is usually quicker than using tokens_dataframe.

[6]:

preproc.tokens_datatable

[6]:

	doc	position	token	lemma	whitespace
	▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪	▪▪▪▪	▪
0	NewsArticles-1880	0	White	White	1
1	NewsArticles-1880	1	House	House	1
2	NewsArticles-1880	2	aides	aide	1
3	NewsArticles-1880	3	told	tell	1
4	NewsArticles-1880	4	to	to	1
5	NewsArticles-1880	5	keep	keep	1
6	NewsArticles-1880	6	Russia	Russia	0
7	NewsArticles-1880	7	-	-	0
8	NewsArticles-1880	8	related	relate	1
9	NewsArticles-1880	9	materials	material	0
10	NewsArticles-1880	10			0
11	NewsArticles-1880	11	Lawyers	Lawyers	1
12	NewsArticles-1880	12	for	for	1
13	NewsArticles-1880	13	the	the	1
14	NewsArticles-1880	14	Trump	Trump	1
⋮	⋮	⋮	⋮	⋮	⋮
1942	NewsArticles-99	1055	non	non	0
1943	NewsArticles-99	1056	-	-	0
1944	NewsArticles-99	1057	recyclable	recyclable	1
1945	NewsArticles-99	1058	items	item	0
1946	NewsArticles-99	1059	.	.	0

More columns may be shown when you add token metadata (more on that later).

The method get_tokens() gives you more options for accessing the tokens. For example, you can get all tokens with their metadata as nested dictionary in the form document label -> metadata key (e.g. “lemma”) -> metadata.

[7]:

doctokens = preproc.get_tokens(with_metadata=True, as_datatables=False)
doctokens['NewsArticles-1880'].keys()

[7]:

dict_keys(['token', 'lemma', 'whitespace'])

[8]:

# lemmata for the first 10 tokens in this document
doctokens['NewsArticles-1880']['lemma'][:10]

[8]:

['White',
 'House',
 'aide',
 'tell',
 'to',
 'keep',
 'Russia',
 '-',
 'relate',
 'material']

You may also want to access the re-constructed full text of each document via texts property. This returns a dict that maps document labels to their text. Here we only display the first 100 characters from a single document:

[9]:

preproc.texts['NewsArticles-1880'][:100]

[9]:

'White House aides told to keep Russia-related materials\n\nLawyers for the Trump administration have i'

As mentioned in the beginning, tmtoolkit’s preprocessing module uses spaCy internally for most NLP tasks. If you want direct access to the spaCy documents, you can use the spacy_docs property. Here, we access a single spaCy document and check its is_tagged attribute:

[10]:

preproc.spacy_docs['NewsArticles-1880'].is_tagged

[10]:

False

You can also retrieve the document and token vectors from the word embeddings representation of the documents. For this, however, you need to create a TMPreproc instance with the argument enable_vectors=True:

[11]:

preproc_vec = TMPreproc(corpus_small, language='en', enable_vectors=True)
preproc_vec.vectors_enabled

[11]:

True

Now you may access the document vectors via doc_vectors property:

[12]:

# displaying only the first 10 values of a single
# document's document vector
preproc_vec.doc_vectors['NewsArticles-1880'][:10]

[12]:

array([-7.0222005e-02,  8.1240870e-02, -3.9869484e-02,  1.8360456e-02,
        1.9232498e-02, -2.5533361e-02, -2.9136341e-02, -1.0187237e-01,
        1.6649088e-03,  2.4026785e+00], dtype=float32)

Token vectors are also available via token_vectors property:

[13]:

# displaying only a single document's token matrix
preproc_vec.token_vectors['NewsArticles-1880']

[13]:

array([[-0.39347 , -0.061407,  0.015231, ...,  0.046462,  0.058398,
         0.46169 ],
       [ 0.19847 ,  0.18087 , -0.089119, ..., -0.24263 , -0.035183,
        -0.29661 ],
       [ 0.28059 , -0.45684 ,  0.414   , ..., -0.31501 , -0.31649 ,
        -0.026392],
       ...,
       [-0.08267 ,  0.092944,  0.028411, ...,  0.49965 , -0.17115 ,
         0.27578 ],
       [ 0.01327 ,  0.51269 , -0.35735 , ...,  0.19492 ,  0.058496,
         0.26636 ],
       [ 0.012001,  0.20751 , -0.12578 , ...,  0.13871 , -0.36049 ,
        -0.035   ]], dtype=float32)

[14]:

del preproc_vec

The following gives you the number of documents and number of unique tokens respectively:

[15]:

preproc.n_docs

[15]:

[16]:

preproc.n_tokens

[16]:

We can also access the number of tokens in each document via doc_lengths property:

[17]:

# displaying only a single document's length here
preproc.doc_lengths['NewsArticles-1880']

[17]:

The vocabulary is the set of unique tokens in the corpus, i.e. all tokens that occur at least once in at least one of the documents. You can use the property vocabulary for that and the property vocabulary_counts to additionally get the number of times each token appears in the corpus.

[18]:

preproc.vocabulary[:10]  # displaying only the first 10 here

[18]:

['\n\n', ' ', '"', '%', "'", "'s", '(', ')', ',', '-']

[19]:

# number of unique tokens in all documents
preproc.vocabulary_size

[19]:

[20]:

# how often the word "the" occurs in the whole corpus
preproc.vocabulary_counts['the']

[20]:

The latter returns a Python Counter object so we can apply its useful functions, e.g. in order to get the most often used tokens:

[21]:

preproc.vocabulary_counts.most_common()[:10]

[21]:

[('the', 82),
 (',', 70),
 ('.', 60),
 ('to', 53),
 ('"', 50),
 ('and', 46),
 ('in', 39),
 ('a', 31),
 ('of', 25),
 ('that', 22)]

The document frequency of a token is the number of documents in which this token occurs at least once. The properties vocabulary_abs_doc_frequency and vocabulary_rel_doc_frequency return this measure as absolute frequency or proportion respectively:

[22]:

(preproc.vocabulary_abs_doc_frequency['Trump'],
 preproc.vocabulary_rel_doc_frequency['Trump'])

[22]:

(2, 0.6666666666666666)

[23]:

(preproc.vocabulary_abs_doc_frequency['Russia'],
 preproc.vocabulary_rel_doc_frequency['Russia'])

[23]:

(1, 0.3333333333333333)

Part-of-Speech (POS) tagging

Part-of-speech (POS) tagging finds the grammatical word-category for each token in a document. The method pos_tag() employs this for the whole corpus. The found POS tags are added as metadata to each token. These tags conform to a specific tagset which is explained in the spaCy documentation. The POS tags can be used to annotate and filter the documents. Let’s apply POS tagging:

[24]:

preproc.pos_tag()

[24]:

<TMPreproc [3 documents / en]>

We can now see a new column pos with the found POS tag for each token:

[25]:

preproc.tokens_datatable

[25]:

	doc	position	token	lemma	pos	whitespace
	▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪	▪▪▪▪	▪▪▪▪	▪
0	NewsArticles-1880	0	White	White	PROPN	1
1	NewsArticles-1880	1	House	House	PROPN	1
2	NewsArticles-1880	2	aides	aide	NOUN	1
3	NewsArticles-1880	3	told	tell	VERB	1
4	NewsArticles-1880	4	to	to	PART	1
5	NewsArticles-1880	5	keep	keep	VERB	1
6	NewsArticles-1880	6	Russia	Russia	PROPN	0
7	NewsArticles-1880	7	-	-	PUNCT	0
8	NewsArticles-1880	8	related	relate	VERB	1
9	NewsArticles-1880	9	materials	material	NOUN	0
10	NewsArticles-1880	10			SPACE	0
11	NewsArticles-1880	11	Lawyers	lawyer	NOUN	1
12	NewsArticles-1880	12	for	for	ADP	1
13	NewsArticles-1880	13	the	the	DET	1
14	NewsArticles-1880	14	Trump	trump	ADJ	1
⋮	⋮	⋮	⋮	⋮	⋮	⋮
1942	NewsArticles-99	1055	non	non	ADJ	0
1943	NewsArticles-99	1056	-	-	ADJ	0
1944	NewsArticles-99	1057	recyclable	recyclable	ADJ	1
1945	NewsArticles-99	1058	items	item	NOUN	0
1946	NewsArticles-99	1059	.	.	PUNCT	0

Aside: TMPreproc as “state machine”

Before continuing, we should clarify that a TMPreproc instance is a “state machine”, i.e. its contents (the documents) and behavior can change when you call a method. An example:

corpus = {
    "doc1": "Hello world!",
    "doc2": "Another example"
}

preproc = TMPreproc(corpus)     # documents are directly tokenized
preproc.tokens

# Out:
# {
#   'doc1': ['Hello', 'world', '!'],
#   'doc2': ['Another', 'example']
# }

preproc.tokens_to_lowercase()   # this changes the documents
preproc.tokens

# Out:
# {
#   'doc1': ['hello', 'world', '!'],
#   'doc2': ['another', 'example']
# }

As you can see, the tokens “inside” preproc are changed in place. It’s important to see that after calling the method tokens_to_lowercase(), the tokens in preproc were transformed and the original tokens from before calling this method are not available anymore. In Python, assigning a mutable object to a variable binds the same object only to a different name, it doesn’t copy it. Since a TMPreproc object is a mutable object (you can change its state by calling its methods), when we simply assign such an object to a different variable (say preproc_upper) we essentially only have two names for the same object and by calling a method on one of these variable names, the values will be changed for both names.

Copying `TMPreproc` objects

What can we do about that? We need to copy the object which can be done with the TMPreproc.copy() method. By this, we create another variable preproc_upper that points to a separate TMPreproc object.

[26]:

preproc_upper = preproc.copy()

[27]:

# the IDs confirm that we have two different objects
id(preproc_upper), id(preproc)

[27]:

(140426331677504, 140426727032000)

[28]:

preproc_upper.transform_tokens(str.upper)

# the transformation now only applied to "preproc_upper"
preproc.vocabulary == preproc_upper.vocabulary

[28]:

False

[29]:

# show a sample
preproc_upper.tokens['NewsArticles-1880'][:10]

[29]:

['WHITE',
 'HOUSE',
 'AIDES',
 'TOLD',
 'TO',
 'KEEP',
 'RUSSIA',
 '-',
 'RELATED',
 'MATERIALS']

[30]:

# the original "preproc" still holds the same data
preproc.tokens['NewsArticles-1880'][:10]

[30]:

['White',
 'House',
 'aides',
 'told',
 'to',
 'keep',
 'Russia',
 '-',
 'related',
 'materials']

Note that this also uses up twice as much computer memory now. So you shouldn’t create copies that often and also release unused memory by using del:

[31]:

# removing the objects again
del preproc_upper

Lemmatization and term normalization

Before we start with token normalization, we will create a copy of the original TMPreproc object and its data, so that we can later use it for comparison:

[32]:

preproc_orig = preproc.copy()

Lemmatization brings a token, if it is a word, to its base form. The lemma is already found out during the tokenization process and is available in the lemma metadata column. However, when you want to further process the tokens on the base of the lemmata, you should use the lemmatize() method. This method sets the lemmata as tokens and all further processing will happen using the lemmatized tokens:

[33]:

preproc.lemmatize()
preproc.tokens_datatable

[33]:

	doc	position	token	lemma	pos	whitespace
	▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪	▪▪▪▪	▪▪▪▪	▪
0	NewsArticles-1880	0	White	White	PROPN	1
1	NewsArticles-1880	1	House	House	PROPN	1
2	NewsArticles-1880	2	aide	aide	NOUN	1
3	NewsArticles-1880	3	tell	tell	VERB	1
4	NewsArticles-1880	4	to	to	PART	1
5	NewsArticles-1880	5	keep	keep	VERB	1
6	NewsArticles-1880	6	Russia	Russia	PROPN	0
7	NewsArticles-1880	7	-	-	PUNCT	0
8	NewsArticles-1880	8	relate	relate	VERB	1
9	NewsArticles-1880	9	material	material	NOUN	0
10	NewsArticles-1880	10			SPACE	0
11	NewsArticles-1880	11	lawyer	lawyer	NOUN	1
12	NewsArticles-1880	12	for	for	ADP	1
13	NewsArticles-1880	13	the	the	DET	1
14	NewsArticles-1880	14	trump	trump	ADJ	1
⋮	⋮	⋮	⋮	⋮	⋮	⋮
1942	NewsArticles-99	1055	non	non	ADJ	0
1943	NewsArticles-99	1056	-	-	ADJ	0
1944	NewsArticles-99	1057	recyclable	recyclable	ADJ	1
1945	NewsArticles-99	1058	item	item	NOUN	0
1946	NewsArticles-99	1059	.	.	PUNCT	0

As we can see, the lemma column was copied over to the token column.

Stemming

tmtoolkit doesn’t support stemming directly, since lemmatization is generally accepted as a better approach to bring different word forms of one word to a common base form. However, you may install NLTK and apply stemming by using the transform_tokens() method together with the stem() function.

Depending on how you further want to analyze the data, it may be necessary to “clean” or “normalize” your tokens in different ways in order to remove noise from the corpus, such as punctuation tokens or numbers, upper/lowercase forms of the same word, etc. Note that this is usually not necessary when you work with more modern approaches such as word embeddings (word vectors).

If you want to remove certain characters in all tokens in your documents, you can use remove_chars_in_tokens() and pass it a sequence of characters to remove. There is also a shortcut remove_special_chars_in_tokens() which will remove all “special characters” (all characters in string.punction by default).

[34]:

preproc.remove_chars_in_tokens(['-'])  # remove only "-"
preproc.print_summary()

3 documents in language English:
> NewsArticles-1880 (N=230): White House aide tell to keep Russia relate materi...
> NewsArticles-3350 (N=657): frustration as cabin electronic ban come into forc...
> NewsArticles-99 (N=1060): Should you have two bin in your bathroom ? Our bat...
total number of tokens: 1947 / vocabulary size: 596

[34]:

<TMPreproc [3 documents / en]>

[35]:

# remove all punctuation
preproc.remove_special_chars_in_tokens()
preproc.print_summary()   # the "?" also vanishes

3 documents in language English:
> NewsArticles-1880 (N=230): White House aide tell to keep Russia relate materi...
> NewsArticles-3350 (N=657): frustration as cabin electronic ban come into forc...
> NewsArticles-99 (N=1060): Should you have two bin in your bathroom Our bathr...
total number of tokens: 1947 / vocabulary size: 580

[35]:

<TMPreproc [3 documents / en]>

A common (but harsh) practice is to transform all tokens to lowercase forms, which can be done with tokens_to_lowercase():

[36]:

preproc.tokens_to_lowercase()
preproc.print_summary()

3 documents in language English:
> NewsArticles-1880 (N=230): white house aide tell to keep russia relate materi...
> NewsArticles-3350 (N=657): frustration as cabin electronic ban come into forc...
> NewsArticles-99 (N=1060): should you have two bin in your bathroom our bathr...
total number of tokens: 1947 / vocabulary size: 562

[36]:

<TMPreproc [3 documents / en]>

The method clean_tokens() finally applies several steps that remove tokens that meet certain criteria. This includes removing:

punctuation tokens
stopwords (very common words for the given language)
empty tokens (i.e. '')
tokens that are longer or shorter than a certain number of characters
numbers

Note that this is a language-dependent method, because the default stopword list is determined per language. This method has many parameters to tweak, so it’s recommended to check out the documentation.

[37]:

# remove punct., stopwords, empty tokens (this is the default)
# plus tokens shorter than 2 characters and numeric tokens like "2019"
preproc.clean_tokens(remove_numbers=True, remove_shorter_than=2)
preproc.print_summary()

3 documents in language English:
> NewsArticles-1880 (N=130): white house aide tell keep russia relate material ...
> NewsArticles-3350 (N=309): frustration cabin electronic ban come force passen...
> NewsArticles-99 (N=486): bin bathroom bathroom fill shampoo bottle toilet r...
total number of tokens: 925 / vocabulary size: 469

[37]:

<TMPreproc [3 documents / en]>

Due to the removal of several tokens in the previous step, the document lengths for the processed corpus are much smaller than for the original corpus:

[38]:

preproc.doc_lengths, preproc_orig.doc_lengths

[38]:

({'NewsArticles-1880': 130, 'NewsArticles-3350': 309, 'NewsArticles-99': 486},
 {'NewsArticles-1880': 230, 'NewsArticles-3350': 657, 'NewsArticles-99': 1060})

We can also observe that the vocabulary got smaller after the processing steps, which, for large corpora, is also important in terms of computation time and memory consumption for later analyses:

[39]:

len(preproc.vocabulary), len(preproc_orig.vocabulary)

[39]:

(469, 683)

You can also apply custom token transform functions by using transform_tokens() and passing it a function that should be applied to each token in each document (hence it must accept one string argument).

First let’s define such a function. Here we create a simple function that should return a token’s “shape” in terms of the case of its characters:

[40]:

def token_shape(t):
    return ''.join(['X' if str.isupper(c) else 'x' for c in t])

token_shape('EU'), token_shape('CamelCase'), token_shape('lower')

[40]:

('XX', 'XxxxxXxxx', 'xxxxx')

We can now apply this function to our documents (we will use the original documents here, because they were not transformed to lower case):

[41]:

preproc = preproc_orig.copy() # swap instances for later

preproc_orig.transform_tokens(token_shape)   # apply function
preproc_orig.print_summary()

# remove instance
del preproc_orig

3 documents in language English:
> NewsArticles-1880 (N=230): Xxxxx Xxxxx xxxxx xxxx xx xxxx Xxxxxx x xxxxxxx xx...
> NewsArticles-3350 (N=657): Xxxxxxxxxxx xx xxxxx xxxxxxxxxxx xxx xxxxx xxxx xx...
> NewsArticles-99 (N=1060): Xxxxxx xxx xxxx xxx xxxx xx xxxx xxxxxxxx x xx Xxx...
total number of tokens: 1947 / vocabulary size: 32

Expanding compound words and joining tokens

Compound words like “US-Student” or “non-recyclable” can be expanded to separate tokens like “US”, “Student” and “non”, “recyclable” using expand_compound_tokens(). However, depending on the language model, most of these compounds will already be separated on initial tokenization.

[42]:

orig_vocab = preproc.vocabulary
preproc.expand_compound_tokens()

# create set difference to show vocabulary tokens
# that were expanded
set(orig_vocab) - set(preproc.vocabulary)

[42]:

{'Source:-Al'}

It’s also possible to join together certain subsequent occurrences of tokens or token patterns. This means you can for example transform all of the subsequent tokens “White” and “House” to single tokens “White_House”. In case you don’t use n-grams (described in a separate section), this is very helpful when you want to capture a named entity that is made up by several tokens, such as persons, institutions or concepts like “Climate Change”, as a single token. The method to use for this is glue_tokens(). It accepts the following parameters:

a patterns sequence of length N that is used to match the subsequent N tokens;
a glue string that is used to join the matched subsequent tokens (by default: "_").

Along with that, you can adjust the token matching with the common token matching parameters described below.

Let’s “glue” all subsequent occurrences of “White” and “House”. The glue_tokens() method will return a set of glued tokens that matched the provided pattern:

[43]:

preproc_orig = preproc.copy()  # make a copy of full orig. data for later use
preproc.glue_tokens(['White', 'House'])

[43]:

{'White_House'}

[44]:

preproc.tokens['NewsArticles-1880'][:20]

[44]:

['White_House',
 'aides',
 'told',
 'to',
 'keep',
 'Russia',
 '-',
 'related',
 'materials',
 '\n\n',
 'Lawyers',
 'for',
 'the',
 'Trump',
 'administration',
 'have',
 'instructed',
 'White_House',
 'aides',
 'to']

[45]:

del preproc

Keywords-in-context (KWIC) and general filtering methods

Keywords-in-context (KWIC) allow you to quickly investigate certain keywords and their neighborhood of tokens, i.e. the tokens that appear right before and after this keyword.

TMPreproc provides three methods for this purpose:

get_kwic() is the base method accepting a search pattern and several options that control how the search pattern is matched (more on that below); use this function when you want to further process the output of a KWIC search;
get_kwic_table() is the more “user friendly” version of the above method as it produces a datatable with the highlighted keyword by default
filter_tokens_with_kwic() works similar to the above functions but applies the result by filtering the documents again; it is explained in the section on filtering

Let’s see the KWIC methods in action:

[46]:

preproc = preproc_orig.copy()  # use orig. full data
preproc.get_kwic('house', ignore_case=True)

[46]:

{'NewsArticles-1880': [['White', 'House', 'aides', 'told'],
  ['instructed', 'White', 'House', 'aides', 'to'],
  ['The', 'White', 'House', 'is', 'simply'],
  ['the', 'White', 'House', 'and', 'law']],
 'NewsArticles-3350': [],
 'NewsArticles-99': [['of', 'the', 'house', ',', '"']]}

The method returns a dictionary that maps document labels to the KWIC results. Each document contains a list of “contexts”, i.e. a list of tokens that surround a keyword, here "house". This keyword stands in the middle and is surrounded by its “context tokens”, which by default means two tokens to the left and two tokens to the right (which may be less when the keyword is near the start or the end of a document).

We can see that NewsArticles-1880 contains four contexts, NewsArticles-99 one context and NewsArticles-3350 none.

With get_kwic_table(), we get back a datatable which provides a better formatting for quick investigation. See how the matched tokens are highlighted as *house* and empty results are removed:

[47]:

preproc.get_kwic_table('house', ignore_case=True)

[47]:

	doc	context	kwic
	▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪
0	NewsArticles-1880	0	White House aides told
1	NewsArticles-1880	1	instructed White House aides to
2	NewsArticles-1880	2	The White House is simply
3	NewsArticles-1880	3	the White House and law
4	NewsArticles-99	0	of the house , "

An important parameter is context_size. It determines the number of tokens to display left and right to the found keyword. You can either pass a single integer for a symmetric context or a tuple with integers (<left>, <right>):

[48]:

preproc.get_kwic_table('house', ignore_case=True, context_size=4)

[48]:

	doc	context	kwic
	▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪
0	NewsArticles-1880	0	White House aides told to keep
1	NewsArticles-1880	1	administration have instructed White House aides…
2	NewsArticles-1880	2	. " The White House is simply taking proactive
3	NewsArticles-1880	3	Democrats to the White House and law enforcement…
4	NewsArticles-99	0	other rooms of the house , " says Jonny

[49]:

preproc.get_kwic_table('house', ignore_case=True, context_size=(1, 4))

[49]:

	doc	context	kwic
	▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪
0	NewsArticles-1880	0	White House aides told to keep
1	NewsArticles-1880	1	White House aides to preserve any
2	NewsArticles-1880	2	White House is simply taking proactive
3	NewsArticles-1880	3	White House and law enforcement agencies
4	NewsArticles-99	0	the house , " says Jonny

The KWIC functions become really powerful when using the pattern matching options. So far, we were looking for exact (but case insensitive) matches between the corpus tokens and our keyword "house". However, it is also possible to match patterns like "new*" (matches any word starting with “new”) or "agenc(y|ies)" (a regular expression matching “agency” and “agencies”). The next section gives an introduction on the different options for pattern matching.

Common parameters for pattern matching functions

Several functions and methods in tmtoolkit support pattern matching, including the already mentioned KWIC functions but also functions for filtering tokens or documents as you will see later. They all share similar function signatures, i.e. similar parameters:

search_token or search_tokens: allows to specify one or more patterns as strings
match_type: sets the matching type and can be one of the following options:
'exact' (default): exact string matching (optionally ignoring character case), i.e. no pattern matching
'regex' uses regular expression matching
'glob' uses “glob patterns” like "politic*" which matches for example “politic”, “politics” or “politician” (see globre package)
ignore_case: ignore character case (applies to all three match types)
glob_method: if match_type is ‘glob’, use this glob method. Must be 'match' or 'search' (similar behavior as Python’s re.match or re.search)
inverse: inverse the match results, i.e. if matching for “hello”, return all results that do not match “hello”

Let’s try out some of these options with get_kwic_table():

[50]:

# using a regular expression, ignoring case
preproc.get_kwic_table(r'agenc(y|ies)', match_type='regex', ignore_case=True)

[50]:

	doc	context	kwic
	▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪
0	NewsArticles-1880	0	law enforcement agencies to keep
1	NewsArticles-1880	1	organizations , agencies and individuals
2	NewsArticles-3350	0	Reuters news agency . Al
3	NewsArticles-3350	1	and news agencies

[51]:

# using a glob, ignoring case
preproc.get_kwic_table('pol*', match_type='glob', ignore_case=True)

[51]:

	doc	context	kwic
	▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪
0	NewsArticles-1880	0	false and politically motivated attacks
1	NewsArticles-99	0	, senior policy adviser for

[52]:

# using a glob, ignoring case
preproc.get_kwic_table('*sol*', match_type='glob', ignore_case=True)

[52]:

	doc	context	kwic
	▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪
0	NewsArticles-99	0	potential simple solution that could
1	NewsArticles-99	1	confused by aerosols . "
2	NewsArticles-99	2	bottles , aerosols for deodorant

[53]:

# using a regex that matches all tokens with at least one vowel and
# inverting these matches, i.e. all tokens *without* any vowels
preproc.get_kwic_table(r'[AEIOUaeiou]', match_type='regex', inverse=True)

[53]:

	doc	context	kwic
	▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪
0	NewsArticles-1880	0	keep Russia - related materials
1	NewsArticles-1880	1	related materials * * Lawyers for
2	NewsArticles-1880	2	in the 2016 presidential election
3	NewsArticles-1880	3	related investigations , ABC News
4	NewsArticles-1880	4	has confirmed . " The
5	NewsArticles-1880	5	confirmed . " The White
6	NewsArticles-1880	6	motivated attacks , " an
7	NewsArticles-1880	7	attacks , " an administration
8	NewsArticles-1880	8	News Wednesday . The directive
9	NewsArticles-1880	9	last week by Senate Democrats
10	NewsArticles-1880	10	between Trump 's administration ,
11	NewsArticles-1880	11	's administration , campaign and
12	NewsArticles-1880	12	transition teams " ? or
13	NewsArticles-1880	13	teams " ? or anyone
14	NewsArticles-1880	14	their behalf " ? and
⋮	⋮	⋮	⋮
265	NewsArticles-99	147	two bins ? There are
266	NewsArticles-99	148	other options . Hang a
267	NewsArticles-99	149	recycling bin . Or opt
268	NewsArticles-99	150	and non - recyclable items
269	NewsArticles-99	151	recyclable items .

Filtering tokens and documents

We can use the pattern matching parameters in numerous filtering methods. The heart of many of these methods is token_match(). Given a search pattern, a list of tokens and optionally some pattern matching parameters, it returns a binary NumPy array of the same length as the input tokens. Each occurrence of True in this binary array signals a match.

[54]:

from tmtoolkit.preprocess import token_match

# first 10 tokens of document "NewsArticles-1880"
doc_snippet = preproc.tokens['NewsArticles-1880'][:10]
# get all tokens that match "to*"
matches = token_match('to*', doc_snippet, match_type='glob')

# iterate through tokens and matches, show pair-wise results
for tok, match in zip(doc_snippet, matches):
    print(tok, ':', match)

White : False
House : False
aides : False
told : True
to : True
keep : False
Russia : False
- : False
related : False
materials : False

The token_match() function is a rather low-level function that you may use for pattern matching against any list/array of strings, e.g. a list of tokens, file names, etc.

The following methods cover common use-cases for filtering during text preprocessing. Many of these methods start either with filter_...() or remove_...() and these pairs of filter and remove functions are complements. A filter method will always retain the matched elements whereas a remove method will always drop the matched elements. We can observe that with the first pair of method, filter_tokens() and remove_tokens():

So much .copy()

Note that the following code snippets make lot of use of the copy() methods. This is because we want to show how the different methods work with the same original data (remember that a TMPreproc instance behaves like a state machine) and also want to “clean up” the temporary instances. Under normal circumstances, you wouldn’t use copy() so excessively.

[55]:

# retain only the tokens that match the pattern in each document
preproc.filter_tokens('*house*', match_type='glob', ignore_case=True)
preproc.print_summary()

del preproc

3 documents in language English:
> NewsArticles-1880 (N=4): House House House House
> NewsArticles-3350 (N=0):
> NewsArticles-99 (N=3): house greenhouse household
total number of tokens: 7 / vocabulary size: 4

[56]:

preproc = preproc_orig.copy()  # make a copy from full data

preproc.remove_tokens('*house*', match_type='glob', ignore_case=True)
preproc.print_summary()

del preproc

3 documents in language English:
> NewsArticles-1880 (N=226): White aides told to keep Russia - related material...
> NewsArticles-3350 (N=658): Frustration as cabin electronics ban comes into fo...
> NewsArticles-99 (N=1057): Should you have two bins in your bathroom ? Our ba...
total number of tokens: 1941 / vocabulary size: 679

The pair filter_documents() and remove_documents() works similarily, but filters or drops whole documents regarding the supplied match criteria. Both accept the standard pattern matching parameters but also a parameter matches_threshold with default value 1. When this number of matching tokens is hit, the document will be part of the result set (filter_documents()) or removed from the result set (remove_documents()). By this, we can for example retain only those documents that contain certain token patterns.

Let’s try these methods out in practice:

[57]:

preproc = preproc_orig.copy()  # make a copy from full data

preproc.filter_documents('*house*', match_type='glob', ignore_case=True)
preproc.print_summary()

del preproc

2 documents in language English:
> NewsArticles-1880 (N=230): White House aides told to keep Russia - related ma...
> NewsArticles-99 (N=1060): Should you have two bins in your bathroom ? Our ba...
total number of tokens: 1290 / vocabulary size: 485

We can see that two out of three documents contained the pattern '*house*' and hence were retained.

We can also adjust matches_threshold to set the minimum number of token matches for filtering:

[58]:

preproc = preproc_orig.copy()  # make a copy from full data

preproc.filter_documents('*house*', match_type='glob', ignore_case=True,
                         matches_threshold=4)
preproc.print_summary()

del preproc

1 documents in language English:
> NewsArticles-1880 (N=230): White House aides told to keep Russia - related ma...
total number of tokens: 230 / vocabulary size: 140

[59]:

preproc = preproc_orig.copy()  # make a copy from full data

preproc.remove_documents('*house*', match_type='glob', ignore_case=True)
preproc.print_summary()

del preproc

1 documents in language English:
> NewsArticles-3350 (N=658): Frustration as cabin electronics ban comes into fo...
total number of tokens: 658 / vocabulary size: 288

When we use remove_documents() we get only the documents that did not contain the specified pattern.

Another useful pair of methods is filter_documents_by_name() and remove_documents_by_name(). Both methods again accept the same pattern matching parameters but they only apply them to the document names, i.e. document labels:

[60]:

preproc = preproc_orig.copy()  # make a copy from full data

preproc.filter_documents_by_name(r'-\d{4}$', match_type='regex')
preproc.print_summary()

del preproc

2 documents in language English:
> NewsArticles-1880 (N=230): White House aides told to keep Russia - related ma...
> NewsArticles-3350 (N=658): Frustration as cabin electronics ban comes into fo...
total number of tokens: 888 / vocabulary size: 385

In the above example we wanted to retain only the documents whose document labels ended with exactly 4 digits, like “…-1234”. Hence, we only get “NewsArticles-1880” and “NewsArticles-3350” but not “NewsArticles-99”. Again, remove_documents_by_name() will do the exact opposite.

You may also use Keywords-in-context (KWIC) to filter your tokens in the neighborhood around certain keyword pattern(s). The method for that is called filter_tokens_with_kwic() and works very similar to get_kwic() but filters the documents in the TMPreproc instance with which you can continue working as usual. Here, we filter the tokens in each document to get the tokens directly in front and after the glob pattern '*house*' (context_size=1):

[61]:

preproc = preproc_orig.copy()  # make a copy from full data

preproc.filter_tokens_with_kwic('*house*', context_size=1,
                                match_type='glob', ignore_case=True)
preproc.tokens_datatable

[61]:

	doc	position	token	lemma	whitespace
	▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪	▪▪▪▪	▪
0	NewsArticles-1880	0	White	White	1
1	NewsArticles-1880	1	House	House	1
2	NewsArticles-1880	2	aides	aide	1
3	NewsArticles-1880	3	White	White	1
4	NewsArticles-1880	4	House	House	1
5	NewsArticles-1880	5	aides	aide	1
6	NewsArticles-1880	6	White	White	1
7	NewsArticles-1880	7	House	House	1
8	NewsArticles-1880	8	is	be	1
9	NewsArticles-1880	9	White	White	1
10	NewsArticles-1880	10	House	House	1
11	NewsArticles-1880	11	and	and	1
12	NewsArticles-99	0	the	the	1
13	NewsArticles-99	1	house	house	0
14	NewsArticles-99	2	,	,	0
15	NewsArticles-99	3	of	of	1
16	NewsArticles-99	4	greenhouse	greenhouse	1
17	NewsArticles-99	5	gases	gas	1
18	NewsArticles-99	6	UK	UK	1
19	NewsArticles-99	7	household	household	1
20	NewsArticles-99	8	threw	throw	1

When you annotated your documents’ tokens with Part-of-Speech (POS) tags, you can also filter them using filter_for_pos():

[62]:

del preproc

preproc = preproc_orig.copy()  # make a copy from full data

# apply POS tagging and retain only nouns
preproc.pos_tag().filter_for_pos('N').tokens_datatable

[62]:

	doc	position	token
	▪	▪	▪

[63]:

del preproc

In this example we filtered for tokens that were identified as nouns by passing the simplified POS tag 'N' (for more on simplified tags, see the method documentation). We can also filter for more than one tag, e.g. nouns or verbs by passing a list of required POS tags.

filter_for_pos() has no remove_...() counterpart, but you can set the inverse parameter to True to achieve the same effect.

Finally there are two methods for removing tokens based on their document frequency: remove_common_tokens() and remove_uncommon_tokens(). The former removes all tokens that have a document frequency greater or equal a certain threshold defined by parameter df_threshold. The latter does the same for all tokens that have a document frequency lower or equal df_threshold. This parameter can either be a relative frequency (default) or absolute count (by setting parameter absolute=True).

Before applying the method, let’s have a look at the number of tokens per document again, to later see how many we will remove. We will also store the vocabulary in orig_vocab for later comparison:

[64]:

preproc = preproc_orig.copy()  # make a copy from full data
orig_vocab = preproc.vocabulary
preproc.doc_lengths

[64]:

{'NewsArticles-1880': 230, 'NewsArticles-3350': 658, 'NewsArticles-99': 1060}

[65]:

preproc.remove_common_tokens(df_threshold=0.9).doc_lengths

[65]:

{'NewsArticles-1880': 144, 'NewsArticles-3350': 413, 'NewsArticles-99': 700}

By removing all tokens with a document frequency threshold of 0.9, we removed quite a number of tokens in each document. Let’s investigate the vocabulary in order to see which tokens were removed:

[66]:

# set difference gives removed vocabulary tokens
set(orig_vocab) - set(preproc.vocabulary)

[66]:

{'\n\n',
 '"',
 "'s",
 ',',
 '-',
 '.',
 '?',
 'The',
 'a',
 'all',
 'also',
 'an',
 'and',
 'be',
 'for',
 'has',
 'have',
 'in',
 'into',
 'is',
 'more',
 'of',
 'on',
 'or',
 'other',
 'such',
 'than',
 'that',
 'the',
 'to',
 'which',
 'with'}

[67]:

del preproc

remove_uncommon_tokens() works similarily. This time, let’s use an absolute number as threshold:

[68]:

preproc = preproc_orig.copy()  # make a copy from full data

preproc.remove_uncommon_tokens(df_threshold=1, absolute=True)

# set difference gives removed vocabulary tokens
# this time, show only the first 10 tokens that were removed
sorted(set(orig_vocab) - set(preproc.vocabulary))[:10]

[68]:

[' ', '%', '(', ')', '10', '12', '135,000', '2016', '38', '45']

The above means that we remove all tokens that appear only in exactly one document.

[69]:

del preproc

Working with token metadata

TMPreproc allows to attach arbitrary metadata to each token in each document. This kind of “annotations” for tokens is very useful. For example, you may add metadata that records a token’s length or whether it is all uppercase letters and later use that for filtering or in further analyses. One method to add such metadata is add_metadata_per_doc(). This method requires to pass a dict that maps document labels to the respective token metadata list. The list’s length must match the number of tokens in the respective document. At first we need to create such a metadata dict. Let’s do that for the tokens’ length first:

[70]:

preproc = preproc_orig.copy()  # make a copy from full data

meta_tok_lengths = {doc_label: list(map(len, doc_tokens))
                    for doc_label, doc_tokens in preproc.tokens.items()}

# show first 5 tokens and their string length for a sample document
list(zip(preproc.tokens['NewsArticles-1880'][:10],
         meta_tok_lengths['NewsArticles-1880'][:10]))

[70]:

[('White', 5),
 ('House', 5),
 ('aides', 5),
 ('told', 4),
 ('to', 2),
 ('keep', 4),
 ('Russia', 6),
 ('-', 1),
 ('related', 7),
 ('materials', 9)]

We can now add these metadata via add_metadata_per_doc(). We pass a label, the metadata key, and the previously generated metadata:

[71]:

preproc.add_metadata_per_doc('length', meta_tok_lengths)
del meta_tok_lengths  # we don't need that object anymore

The property .tokens_datatable now shows an additional column with meta_length (the metadata key in always prefixed with meta_):

[72]:

preproc.tokens_datatable

[72]:

	doc	position	token	lemma	whitespace	meta_length
	▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪	▪▪▪▪	▪	▪▪▪▪▪▪▪▪
0	NewsArticles-1880	0	White	White	1	5
1	NewsArticles-1880	1	House	House	1	5
2	NewsArticles-1880	2	aides	aide	1	5
3	NewsArticles-1880	3	told	tell	1	4
4	NewsArticles-1880	4	to	to	1	2
5	NewsArticles-1880	5	keep	keep	1	4
6	NewsArticles-1880	6	Russia	Russia	0	6
7	NewsArticles-1880	7	-	-	0	1
8	NewsArticles-1880	8	related	relate	1	7
9	NewsArticles-1880	9	materials	material	0	9
10	NewsArticles-1880	10			0	2
11	NewsArticles-1880	11	Lawyers	lawyer	1	7
12	NewsArticles-1880	12	for	for	1	3
13	NewsArticles-1880	13	the	the	1	3
14	NewsArticles-1880	14	Trump	trump	1	5
⋮	⋮	⋮	⋮	⋮	⋮	⋮
1943	NewsArticles-99	1055	non	non	0	3
1944	NewsArticles-99	1056	-	-	0	1
1945	NewsArticles-99	1057	recyclable	recyclable	1	10
1946	NewsArticles-99	1058	items	item	0	5
1947	NewsArticles-99	1059	.	.	0	1

Let’s add a boolean indicator for whether the given token is all uppercase:

[73]:

meta_tok_upper = {doc_label: list(map(str.isupper, doc_tokens))
                  for doc_label, doc_tokens in preproc.tokens.items()}

preproc.add_metadata_per_doc('upper', meta_tok_upper)
del meta_tok_upper

preproc.tokens_datatable

[73]:

	doc	position	token	lemma	whitespace	meta_length	meta_upper
	▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪	▪▪▪▪	▪	▪▪▪▪▪▪▪▪	▪
0	NewsArticles-1880	0	White	White	1	5	0
1	NewsArticles-1880	1	House	House	1	5	0
2	NewsArticles-1880	2	aides	aide	1	5	0
3	NewsArticles-1880	3	told	tell	1	4	0
4	NewsArticles-1880	4	to	to	1	2	0
5	NewsArticles-1880	5	keep	keep	1	4	0
6	NewsArticles-1880	6	Russia	Russia	0	6	0
7	NewsArticles-1880	7	-	-	0	1	0
8	NewsArticles-1880	8	related	relate	1	7	0
9	NewsArticles-1880	9	materials	material	0	9	0
10	NewsArticles-1880	10			0	2	0
11	NewsArticles-1880	11	Lawyers	lawyer	1	7	0
12	NewsArticles-1880	12	for	for	1	3	0
13	NewsArticles-1880	13	the	the	1	3	0
14	NewsArticles-1880	14	Trump	trump	1	5	0
⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮
1943	NewsArticles-99	1055	non	non	0	3	0
1944	NewsArticles-99	1056	-	-	0	1	0
1945	NewsArticles-99	1057	recyclable	recyclable	1	10	0
1946	NewsArticles-99	1058	items	item	0	5	0
1947	NewsArticles-99	1059	.	.	0	1	0

You may use these newly added columns now for example for filtering the datatable:

[74]:

import datatable as dt

preproc.tokens_datatable[dt.f.meta_upper == 1,:]

[74]:

	doc	position	token	lemma	whitespace	meta_length	meta_upper
	▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪	▪▪▪▪	▪	▪▪▪▪▪▪▪▪	▪
0	NewsArticles-1880	43	ABC	ABC	1	3	1
1	NewsArticles-1880	73	ABC	ABC	1	3	1
2	NewsArticles-1880	213	U.S.	U.S.	1	4	1
3	NewsArticles-3350	11	US	US	0	2	1
4	NewsArticles-3350	13	UK	UK	1	2	1
5	NewsArticles-3350	34	US	US	1	2	1
6	NewsArticles-3350	98	US	US	1	2	1
7	NewsArticles-3350	106	US	US	1	2	1
8	NewsArticles-3350	134	UAE	UAE	1	3	1
9	NewsArticles-3350	153	READ	READ	1	4	1
10	NewsArticles-3350	154	MORE	MORE	0	4	1
11	NewsArticles-3350	273	US	US	1	2	1
12	NewsArticles-3350	346	READ	READ	1	4	1
13	NewsArticles-3350	347	MORE	MORE	0	4	1
14	NewsArticles-3350	349	US	US	1	2	1
15	NewsArticles-3350	358	US	US	1	2	1
16	NewsArticles-3350	454	I	-PRON-	1	1	1
17	NewsArticles-3350	480	UK	UK	1	2	1
18	NewsArticles-3350	502	UK	UK	1	2	1
19	NewsArticles-3350	506	UAE	UAE	1	3	1
20	NewsArticles-3350	529	UAE	UAE	1	3	1
21	NewsArticles-3350	570	US	US	1	2	1
22	NewsArticles-3350	637	US	US	1	2	1
23	NewsArticles-99	376	UK	UK	1	2	1
24	NewsArticles-99	711	A	a	1	1	1
25	NewsArticles-99	955	UK	UK	1	2	1
26	NewsArticles-99	995	M25	M25	1	3	1

To see which metadata keys were already created, you can use get_available_metadata_keys():

[75]:

preproc.get_available_metadata_keys()

[75]:

{'lemma', 'length', 'upper', 'whitespace'}

Token metadata can be removed with remove_metadata():

[76]:

preproc.remove_metadata('upper')
preproc.get_available_metadata_keys()

[76]:

{'lemma', 'length', 'whitespace'}

[77]:

preproc.tokens_datatable

[77]:

	doc	position	token	lemma	whitespace	meta_length
	▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪	▪▪▪▪	▪	▪▪▪▪▪▪▪▪
0	NewsArticles-1880	0	White	White	1	5
1	NewsArticles-1880	1	House	House	1	5
2	NewsArticles-1880	2	aides	aide	1	5
3	NewsArticles-1880	3	told	tell	1	4
4	NewsArticles-1880	4	to	to	1	2
5	NewsArticles-1880	5	keep	keep	1	4
6	NewsArticles-1880	6	Russia	Russia	0	6
7	NewsArticles-1880	7	-	-	0	1
8	NewsArticles-1880	8	related	relate	1	7
9	NewsArticles-1880	9	materials	material	0	9
10	NewsArticles-1880	10			0	2
11	NewsArticles-1880	11	Lawyers	lawyer	1	7
12	NewsArticles-1880	12	for	for	1	3
13	NewsArticles-1880	13	the	the	1	3
14	NewsArticles-1880	14	Trump	trump	1	5
⋮	⋮	⋮	⋮	⋮	⋮	⋮
1943	NewsArticles-99	1055	non	non	0	3
1944	NewsArticles-99	1056	-	-	0	1
1945	NewsArticles-99	1057	recyclable	recyclable	1	10
1946	NewsArticles-99	1058	items	item	0	5
1947	NewsArticles-99	1059	.	.	0	1

We can tell filter_tokens() and similar methods to use metadata instead of the tokens for matching. For example, we can use the metadata meta_length, which we created before, to filter for tokens of a certain length:

[78]:

preproc_meta_example = preproc.copy()
preproc_meta_example.filter_tokens(3, by_meta='length')
preproc_meta_example.tokens_datatable

[78]:

	doc	position	token	lemma	whitespace	meta_length
	▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪	▪▪▪▪	▪	▪▪▪▪▪▪▪▪
0	NewsArticles-1880	0	for	for	1	3
1	NewsArticles-1880	1	the	the	1	3
2	NewsArticles-1880	2	any	any	1	3
3	NewsArticles-1880	3	the	the	1	3
4	NewsArticles-1880	4	and	and	1	3
5	NewsArticles-1880	5	ABC	ABC	1	3
6	NewsArticles-1880	6	has	have	1	3
7	NewsArticles-1880	7	The	the	1	3
8	NewsArticles-1880	8	and	and	1	3
9	NewsArticles-1880	9	ABC	ABC	1	3
10	NewsArticles-1880	10	The	the	1	3
11	NewsArticles-1880	11	the	the	1	3
12	NewsArticles-1880	12	and	and	1	3
13	NewsArticles-1880	13	law	law	1	3
14	NewsArticles-1880	14	all	all	1	3
⋮	⋮	⋮	⋮	⋮	⋮	⋮
335	NewsArticles-99	186	for	for	1	3
336	NewsArticles-99	187	bin	bin	1	3
337	NewsArticles-99	188	can	can	1	3
338	NewsArticles-99	189	and	and	1	3
339	NewsArticles-99	190	non	non	0	3

[79]:

del preproc_meta_example

Note that all matching options then apply to the metadata column, in this case to the meta_length column which contains integers. Since filter_tokens() by default employs exact matching, we get all tokens where meta_length equals the first argument, 3. If we used regular expression or glob matching instead, this method would fail because you can only use that for string data.

If you want to use more complex filter queries, you should create a “filter mask” and pass it to filter_tokens_by_mask(). A filter mask is a dictionary that maps a document label to a sequence of booleans. For all occurrences of True, the respective token in the document will be retained, all others will be removed. Let’s try that out with a small sample:

[80]:

preproc.pos_tag().tokens_datatable

[80]:

	doc	position	token	lemma	pos	whitespace	meta_length
	▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪	▪▪▪▪	▪▪▪▪	▪	▪▪▪▪▪▪▪▪
0	NewsArticles-1880	0	White	White	PUNCT	1	5
1	NewsArticles-1880	1	House	House	PUNCT	1	5
2	NewsArticles-1880	2	aides	aide	PUNCT	1	5
3	NewsArticles-1880	3	told	tell	PUNCT	1	4
4	NewsArticles-1880	4	to	to	PUNCT	1	2
5	NewsArticles-1880	5	keep	keep	PUNCT	1	4
6	NewsArticles-1880	6	Russia	Russia	PUNCT	0	6
7	NewsArticles-1880	7	-	-	PUNCT	0	1
8	NewsArticles-1880	8	related	relate	PUNCT	1	7
9	NewsArticles-1880	9	materials	material	PUNCT	0	9
10	NewsArticles-1880	10			PUNCT	0	2
11	NewsArticles-1880	11	Lawyers	lawyer	PUNCT	1	7
12	NewsArticles-1880	12	for	for	PUNCT	1	3
13	NewsArticles-1880	13	the	the	PUNCT	1	3
14	NewsArticles-1880	14	Trump	trump	PUNCT	1	5
⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮
1943	NewsArticles-99	1055	non	non	PUNCT	0	3
1944	NewsArticles-99	1056	-	-	PUNCT	0	1
1945	NewsArticles-99	1057	recyclable	recyclable	PUNCT	1	10
1946	NewsArticles-99	1058	items	item	PUNCT	0	5
1947	NewsArticles-99	1059	.	.	PUNCT	0	1

We now generate the filter mask, which means for each document we create a boolean list or array that for each token in that document indicates whether that token should be kept or removed.

We will iterate through the tokens_with_metadata property, which is a dict that for each document contains a datatable with its tokens and metadata. Let’s have a look at the first document’s datatable:

[81]:

next(iter(preproc.tokens_with_metadata.values()))

[81]:

	token	lemma	pos	whitespace	meta_length
	▪▪▪▪	▪▪▪▪	▪▪▪▪	▪	▪▪▪▪▪▪▪▪
0	White	White	PUNCT	1	5
1	House	House	PUNCT	1	5
2	aides	aide	PUNCT	1	5
3	told	tell	PUNCT	1	4
4	to	to	PUNCT	1	2
5	keep	keep	PUNCT	1	4
6	Russia	Russia	PUNCT	0	6
7	-	-	PUNCT	0	1
8	related	relate	PUNCT	1	7
9	materials	material	PUNCT	0	9
10			PUNCT	0	2
11	Lawyers	lawyer	PUNCT	1	7
12	for	for	PUNCT	1	3
13	the	the	PUNCT	1	3
14	Trump	trump	PUNCT	1	5
⋮	⋮	⋮	⋮	⋮	⋮
225	during	during	PUNCT	1	6
226	his	-PRON-	X	1	3
227	confirmation	confirmation	PUNCT	1	12
228	hearing	hearing	PUNCT	0	7
229	.	.	PUNCT	0	1

Now we can create the filter mask:

[82]:

import numpy as np

filter_mask = {}
for doc_label, doc_data in preproc.tokens_with_metadata.items():
    # extract the columns "meta_length" and "pos"
    # and convert them to NumPy arrays
    doc_data_subset = doc_data[:, [dt.f.meta_length, dt.f.pos]]
    tok_lengths, tok_pos = map(np.array, doc_data_subset.to_list())

    # create a boolean array for nouns with token length less or equal 5
    filter_mask[doc_label] = (tok_lengths <= 5) & np.isin(tok_pos, ['NOUN', 'PROPN'])

# it's not necessary to add the filter mask as metadata
# but it's a good way to check the mask
preproc.add_metadata_per_doc('small_nouns', filter_mask)
preproc.tokens_datatable

[82]:

	doc	position	token	lemma	pos	whitespace	meta_length	meta_small_nouns
	▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪	▪▪▪▪	▪▪▪▪	▪	▪▪▪▪▪▪▪▪	▪
0	NewsArticles-1880	0	White	White	PUNCT	1	5	0
1	NewsArticles-1880	1	House	House	PUNCT	1	5	0
2	NewsArticles-1880	2	aides	aide	PUNCT	1	5	0
3	NewsArticles-1880	3	told	tell	PUNCT	1	4	0
4	NewsArticles-1880	4	to	to	PUNCT	1	2	0
5	NewsArticles-1880	5	keep	keep	PUNCT	1	4	0
6	NewsArticles-1880	6	Russia	Russia	PUNCT	0	6	0
7	NewsArticles-1880	7	-	-	PUNCT	0	1	0
8	NewsArticles-1880	8	related	relate	PUNCT	1	7	0
9	NewsArticles-1880	9	materials	material	PUNCT	0	9	0
10	NewsArticles-1880	10			PUNCT	0	2	0
11	NewsArticles-1880	11	Lawyers	lawyer	PUNCT	1	7	0
12	NewsArticles-1880	12	for	for	PUNCT	1	3	0
13	NewsArticles-1880	13	the	the	PUNCT	1	3	0
14	NewsArticles-1880	14	Trump	trump	PUNCT	1	5	0
⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮
1943	NewsArticles-99	1055	non	non	PUNCT	0	3	0
1944	NewsArticles-99	1056	-	-	PUNCT	0	1	0
1945	NewsArticles-99	1057	recyclable	recyclable	PUNCT	1	10	0
1946	NewsArticles-99	1058	items	item	PUNCT	0	5	0
1947	NewsArticles-99	1059	.	.	PUNCT	0	1	0

Finally, we can pass the mask dict to filter_tokens_by_mask():

[83]:

preproc.filter_tokens_by_mask(filter_mask)
preproc.tokens_datatable

[83]:

	doc	position	token
	▪	▪	▪

Generating n-grams

So far, we worked with unigrams, i.e. each document consisted of a sequence of discrete tokens. We can also generate n-grams from our corpus where each document consists of a sequence of n subsequent tokens. An example would be:

Document: “This is a simple example.”

n=1 (unigrams):

['This', 'is', 'a', 'simple', 'example', '.']

n=2 (bigrams):

['This is', 'is a', 'a simple', 'simple example', 'example .']

n=3 (trigrams):

['This is a', 'is a simple', 'a simple example', 'simple example .']

The method generate_ngrams() allows us to generate n-grams from tokenized documents. We can then get the results with the ngrams property:

[84]:

del preproc

preproc = preproc_orig.copy()  # make a copy from full data

preproc.generate_ngrams(2)  # generate bigrams
preproc.ngrams['NewsArticles-1880'][:10]  # show first 10 bigrams of this document

[84]:

[['White', 'House'],
 ['House', 'aides'],
 ['aides', 'told'],
 ['told', 'to'],
 ['to', 'keep'],
 ['keep', 'Russia'],
 ['Russia', '-'],
 ['-', 'related'],
 ['related', 'materials'],
 ['materials', 'Lawyers']]

You may afterwards use join_ngrams() to merge the generated n-grams to joint tokens and use these as new tokens in this TMPreproc instance:

[85]:

preproc.join_ngrams()
preproc.tokens_datatable

[85]:

	doc	position	token	lemma	whitespace
	▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪	▪▪▪▪	▪
0	NewsArticles-1880	0	White House	White House	1
1	NewsArticles-1880	1	House aides	House aides	1
2	NewsArticles-1880	2	aides told	aides told	1
3	NewsArticles-1880	3	told to	told to	1
4	NewsArticles-1880	4	to keep	to keep	1
5	NewsArticles-1880	5	keep Russia	keep Russia	1
6	NewsArticles-1880	6	Russia -	Russia -	1
7	NewsArticles-1880	7	- related	- related	1
8	NewsArticles-1880	8	related materials	related materials	1
9	NewsArticles-1880	9	materials Lawyers	materials Lawyers	1
10	NewsArticles-1880	10	Lawyers for	Lawyers for	1
11	NewsArticles-1880	11	for the	for the	1
12	NewsArticles-1880	12	the Trump	the Trump	1
13	NewsArticles-1880	13	Trump administration	Trump administration	1
14	NewsArticles-1880	14	administration have	administration have	1
⋮	⋮	⋮	⋮	⋮	⋮
1934	NewsArticles-99	1052	and non	and non	1
1935	NewsArticles-99	1053	non -	non -	1
1936	NewsArticles-99	1054	- recyclable	- recyclable	1
1937	NewsArticles-99	1055	recyclable items	recyclable items	1
1938	NewsArticles-99	1056	items .	items .	1

[86]:

del preproc

Generating a sparse document-term matrix (DTM)

If you’re working with a bag-of-words representation of your data, you usually convert the preprocessed documents to a document-term matrix (DTM), which represents of the number of occurrences of each term (i.e. vocabulary token) in each document. This is a N rows by M columns matrix, where N is the number of documents and M is the vocabulary size (i.e. the number of unique tokens in the corpus).

Not all tokens from the vocabulary occur in all documents. In fact, many tokens will occur only in a small subset of the documents if you’re dealing with a “real world” dataset. This means that most entries in such a DTM will be zero. Almost all functions in tmtoolkit therefore generate and work with sparse matrices, where only non-zero values are stored in computer memory.

For this example, we’ll generate a DTM from the preproc_orig instance. First, let’s check the number of documents and the vocabulary size:

[87]:

preproc_orig.n_docs, preproc_orig.vocabulary_size

[87]:

(3, 683)

We can use the dtm property to generate a sparse DTM from the current instance:

[88]:

preproc_orig.dtm

[88]:

<3x683 sparse matrix of type '<class 'numpy.int32'>'
        with 816 stored elements in Compressed Sparse Row format>

We can see that a sparse matrix with 3 rows (which corresponds with the number of documents) and 683 columns was generated (which corresponds to the vocabulary size). 816 elements in this matrix are non-zero.

We can convert this matrix to a non-sparse, i.e. dense, representation and see parts of its elements:

[89]:

preproc_orig.dtm.todense()

[89]:

matrix([[ 1,  0,  4, ...,  0,  0,  0],
        [ 2,  1, 14, ...,  0,  3,  0],
        [ 2,  0, 32, ...,  2,  5,  5]], dtype=int32)

However, note that you should only convert a sparse matrix to a dense representation when you’re either dealing with a small amount of data (which is what we’re doing in this example), or use only a part of the full matrix. Converting a sparse matrix to a dense representation can otherwise easily exceed the available computer memory.

There exist different “formats” for sparse matrices, which have different advantages and disadvantages (see for example the SciPy “sparse” module documentation). Not all formats support all operations that you can usually apply to an ordinary, dense matrix. By default, the generated DTM is in Compressed Sparse Row (CSR) format. This format allows indexing and is especially optimized for fast row access. You may convert it to any other sparse matrix format; see the mentioned SciPy documentation for this.

The rows of the DTM are aligned to the sequence of the document labels and its columns are aligned to the vocabulary. For example, let’s find the frequency of the term “House” in the document “NewsArticles-1880”. To do this, we find out the indices into the matrix:

[90]:

preproc_orig.doc_labels.index('NewsArticles-1880')

[90]:

[91]:

preproc_orig.vocabulary.index('House')

[91]:

This means the frequency of the term “House” in the document “NewsArticles-1880” is located in row 0 and column 4 of the DTM:

[92]:

preproc_orig.dtm[0, 67]

[92]:

See also the following example of finding out the index for “administration” and then getting an array that represents the number of occurrences of this token across all three documents:

[93]:

vocab_admin_ix = preproc_orig.vocabulary.index('administration')
preproc_orig.dtm[:, vocab_admin_ix].todense()

[93]:

matrix([[4],
        [1],
        [0]], dtype=int32)

Apart from the dtm property, there’s also the get_dtm() method which allows to also return the result as datatable or pandas DataFrame. Note that these representations are not sparse and hence can consume much memory.

[94]:

preproc_orig.get_dtm(as_datatable=True)

DatatableWarning: Duplicate column name found, and was assigned a unique name: '.' -> '.0'

[94]:

	_doc	.		"	%	'	's	(	)	,	…	work	world	would	you	your
	▪▪▪▪	▪▪▪▪	▪▪▪▪	▪▪▪▪	▪▪▪▪	▪▪▪▪	▪▪▪▪	▪▪▪▪	▪▪▪▪	▪▪▪▪		▪▪▪▪	▪▪▪▪	▪▪▪▪	▪▪▪▪	▪▪▪▪
0	NewsArticles-1880	1	0	4	0	1	3	0	0	9	…	0	0	0	0	0
1	NewsArticles-3350	2	1	14	0	1	6	0	0	28	…	0	1	0	3	0
2	NewsArticles-99	2	0	32	5	0	3	2	2	33	…	1	0	2	5	5

Serialization: Saving and loading `TMPreproc` objects

The current state of a TMPreproc object can also be stored to a file on disk so that you (or someone else who has tmtoolkit installed) can later restore it using that file. The methods for that are save_state() and load_state() / from_state().

Let’s store the current state of the preproc_orig instance:

[95]:

preproc_orig.print_summary()
preproc_orig.save_state('data/preproc_state.pickle')

3 documents in language English:
> NewsArticles-1880 (N=230): White House aides told to keep Russia - related ma...
> NewsArticles-3350 (N=658): Frustration as cabin electronics ban comes into fo...
> NewsArticles-99 (N=1060): Should you have two bins in your bathroom ? Our ba...
total number of tokens: 1948 / vocabulary size: 683

[95]:

<TMPreproc [3 documents / en]>

Let’s change the object by retaining only documents that contain the token “house” (see the reduced number of documents):

[96]:

preproc_orig.filter_documents('*house*', match_type='glob', ignore_case=True)
preproc_orig.print_summary()

2 documents in language English:
> NewsArticles-1880 (N=230): White House aides told to keep Russia - related ma...
> NewsArticles-99 (N=1060): Should you have two bins in your bathroom ? Our ba...
total number of tokens: 1290 / vocabulary size: 485

[96]:

<TMPreproc [2 documents / en]>

We can restore the saved data using from_state():

[97]:

preproc_restored = TMPreproc.from_state('data/preproc_state.pickle')
preproc_restored.print_summary()

3 documents in language English:
> NewsArticles-1880 (N=230): White House aides told to keep Russia - related ma...
> NewsArticles-3350 (N=658): Frustration as cabin electronics ban comes into fo...
> NewsArticles-99 (N=1060): Should you have two bins in your bathroom ? Our ba...
total number of tokens: 1948 / vocabulary size: 683

[97]:

<TMPreproc [3 documents / en]>

You can see that the full dataset with three documents was restored.

This is very useful especially when you have a large amount of data and run time consuming operations, e.g. POS tagging. When you’re finished running these operations, you can easily store the current state to disk and later retrieve it without the need to re-run these operations.

Functional API

The TMPreproc class provides a convenient object-oriented interface for parallel text processing and analysis. There is also a functional API provided in the tmtoolkit.preprocess module. Most of these functions accept a list of spaCy documents along with additional parameters. You may use these functions for quick prototyping, but it is generally much more convenient to use TMPreproc. The functional API does not provide parallel processing.

To initialize the functional API for a certain language, you need to start with init_for_language() and may then tokenize your raw text documents via tokenize(), which will generate a list of spaCy documents. Most other functions in this API accept such a list of list of spaCy documents as input.

init_for_language('en')
docs = tokenize(['Hello this is a test.', 'And here comes another one.'])

The final result after applying preprocessing steps and hence transforming the text data is often a document-term matrix (DTM). The bow module contains several functions to work with DTMs, e.g. apply transformations such as tf-idf or compute some important summary statistics. The next chapter will introduce some of these functions.