Text preprocessing

During text preprocessing, a corpus of documents is tokenized (i.e. the document strings are split into individual words, punctuation, numbers, etc.) and then these tokens can be transformed, filtered or annotated. The goal is to prepare the raw texts in a way that makes it easier to perform eventual analysis methods in a later stage, e.g. by reducing noise in the dataset. tmtoolkit provides a rich set of tools for this purpose in the tmtoolkit.preprocess module.

Parallel processing with the TMPreproc class

You can pass a dict-like dataset (i.e. anything that maps document labels to their plain text contents, e.g. a tmtoolkit Corpus object) to the TMPreproc class and can then then apply several text processing methods to it. You can chain these processing steps by applying one method after another and examine the results.

Under the hood, the spaCy package is used to perform most NLP methods. However, TMPreproc offers much more functionality than spaCy, including flexible token and document filtering. The most important advantage of using TMPreproc is that it employs parallel processing, i.e. it uses all available processors on your machine to do the computations necessary during preprocessing. For large text corpora, this can lead to a strong speed up.

Using the functional API

Apart from the TMPreproc class, tmtoolkit also provides several functions in the tmtoolkit.preprocess module. Most of these functions accept a list of spaCy documents along with additional parameters. You may use these functions for quick prototyping, but it is generally much more convenient to use TMPreproc. Note that only the latter provides parallel processing.

Loading example data

Let’s load a sample of three documents from the built-in NewsArticles dataset. We’ll use only a small number of documents here to have a better overview at the beginning. We can later use a larger sample.

[1]:
import random
random.seed(20191018)   # to make the sampling reproducible

from tmtoolkit.corpus import Corpus
from tmtoolkit.preprocess import tokenize

corpus_small = Corpus.from_builtin_corpus('en-NewsArticles').sample(3)

Optional: enabling logging output

By default, tmtoolkit does not expose any internal logging messages. Sometimes, for example for diagnostic output during debugging or in order to see progress for long running operations, it’s helpful to enable logging output display, which can be done as follows:

import logging

logging.basicConfig(level=logging.INFO)
tmtoolkit_log = logging.getLogger('tmtoolkit')
# set the minimum log level to display, for instance also logging.DEBUG
tmtoolkit_log.setLevel(logging.INFO)
tmtoolkit_log.propagate = True

Creating a TMPreproc object

You can create a TMPreproc object (also known as “instance”) by passing a dict that maps document labels to (untokenized) documents. Since a tmtoolkit Corpus behaves like a dict, we can pass our corpus_small object. We also need to specify the corpus language as two-letter ISO 639-1 language code (here "en" for English).

[2]:
from tmtoolkit.preprocess import TMPreproc

preproc = TMPreproc(corpus_small, language='en')
preproc
[2]:
<TMPreproc [3 documents / en]>

The above will at first distribute all documents to several sub-processes which will later be used to run the computations in parallel. The number of sub-processes can be controlled via n_max_processes. It defaults to the number of CPU cores in your machine. The distribution of documents to the processes happens according to the document size. E.g. when you have two CPU cores, one very large document and three small documents, CPU 1 will take care about the large document alone and CPU 2 will take the other three small documents. After distribution of the documents, they will directly be tokenized (in parallel). Hence when you have a large corpus, the creation of a TMPreproc object may take some time because of the tokenization process.

Our TMPreproc object preproc is now set up to work with the documents passed in corpus_small and the language 'en' for English. All further operations with this object will use the specified documents and language. All documents are directly tokenized.

The method print_summary() is very handy and we will use it quite often. It displays a small summary of the documents in the TMPreproc object. N=... denotes the number of tokens in the respective document.

[3]:
preproc.print_summary()
3 documents in language English:
> NewsArticles-1880 (N=230): White House aides told to keep Russia - related ma...
> NewsArticles-3350 (N=657): Frustration as cabin electronics ban comes into fo...
> NewsArticles-99 (N=1060): Should you have two bins in your bathroom ? Our ba...
total number of tokens: 1947 / vocabulary size: 683
[3]:
<TMPreproc [3 documents / en]>

Accessing tokens, vocabulary and other important properties

TMPreproc provides several properties to access its data and some summary statistics. These properties are read-only, i.e. you can only retrieve the results but not assign new values to them.

First, let’s have a look at the labels (names) of the documents:

[4]:
preproc.doc_labels
[4]:
['NewsArticles-1880', 'NewsArticles-3350', 'NewsArticles-99']

We can access the tokens of each document by using the tokens property:

[5]:
# use [:10] slice to show only the first 10 tokens
preproc.tokens['NewsArticles-1880'][:10]
[5]:
['White',
 'House',
 'aides',
 'told',
 'to',
 'keep',
 'Russia',
 '-',
 'related',
 'materials']

If you prefer a tabular output, you can also access the tokens and their metadata as pandas DataFrames or datatable Frames.

A note on the use of datatable Frames

If you have installed the datatable package, many functions and methods in tmtoolkit return or accept datatable Frames instead of (the more commonly known) pandas DataFrames. This is because the former is much faster and memory efficient in most cases. You can always convert between the both like this:

import datatable as dt
import pandas as pd

# a pandas DataFrame:
df = pd.DataFrame({'a': [1, 2, 3], 'b': list('xyz')})

# DataFrame to datatable:
dtable = dt.Frame(df)

# and vice versa datatable to DataFrame:
df == dtable.to_pandas()

# Out:
#       a     b
# 0  True  True
# 1  True  True
# 2  True  True

Even first creating a datatable and then converting to a DataFrame is often faster than directly creating a DataFrame.

You can use the tokens_dataframe or tokens_datatable properties for tabular output. The datatable Frame consists of at least five columns: The document label, the position of the token in the document (zero-indexed) and token itself, lemma and whitespace. The lemma column contains the token’s lemma and whitespace indicates whether there is a whitespace after the token in the text. Please note that for large amounts of data, tokens_datatable is usually quicker than using tokens_dataframe.

[6]:
preproc.tokens_datatable
[6]:
docpositiontokenlemmawhitespace
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0NewsArticles-18800WhiteWhite1
1NewsArticles-18801HouseHouse1
2NewsArticles-18802aidesaide1
3NewsArticles-18803toldtell1
4NewsArticles-18804toto1
5NewsArticles-18805keepkeep1
6NewsArticles-18806RussiaRussia0
7NewsArticles-18807--0
8NewsArticles-18808relatedrelate1
9NewsArticles-18809materialsmaterial0
10NewsArticles-188010 0
11NewsArticles-188011LawyersLawyers1
12NewsArticles-188012forfor1
13NewsArticles-188013thethe1
14NewsArticles-188014TrumpTrump1
1942NewsArticles-991055nonnon0
1943NewsArticles-991056--0
1944NewsArticles-991057recyclablerecyclable1
1945NewsArticles-991058itemsitem0
1946NewsArticles-991059..0

More columns may be shown when you add token metadata (more on that later).

The method get_tokens() gives you more options for accessing the tokens. For example, you can get all tokens with their metadata as nested dictionary in the form document label -> metadata key (e.g. “lemma”) -> metadata.

[7]:
doctokens = preproc.get_tokens(with_metadata=True, as_datatables=False)
doctokens['NewsArticles-1880'].keys()
[7]:
dict_keys(['token', 'lemma', 'whitespace'])
[8]:
# lemmata for the first 10 tokens in this document
doctokens['NewsArticles-1880']['lemma'][:10]
[8]:
['White',
 'House',
 'aide',
 'tell',
 'to',
 'keep',
 'Russia',
 '-',
 'relate',
 'material']

You may also want to access the re-constructed full text of each document via texts property. This returns a dict that maps document labels to their text. Here we only display the first 100 characters from a single document:

[9]:
preproc.texts['NewsArticles-1880'][:100]
[9]:
'White House aides told to keep Russia-related materials\n\nLawyers for the Trump administration have i'

As mentioned in the beginning, tmtoolkit’s preprocessing module uses spaCy internally for most NLP tasks. If you want direct access to the spaCy documents, you can use the spacy_docs property. Here, we access a single spaCy document and check its is_tagged attribute:

[10]:
preproc.spacy_docs['NewsArticles-1880'].is_tagged
[10]:
False

You can also retrieve the document and token vectors from the word embeddings representation of the documents. For this, however, you need to create a TMPreproc instance with the argument enable_vectors=True:

[11]:
preproc_vec = TMPreproc(corpus_small, language='en', enable_vectors=True)
preproc_vec.vectors_enabled
[11]:
True

Now you may access the document vectors via doc_vectors property:

[12]:
# displaying only the first 10 values of a single
# document's document vector
preproc_vec.doc_vectors['NewsArticles-1880'][:10]
[12]:
array([-7.0222005e-02,  8.1240870e-02, -3.9869484e-02,  1.8360456e-02,
        1.9232498e-02, -2.5533361e-02, -2.9136341e-02, -1.0187237e-01,
        1.6649088e-03,  2.4026785e+00], dtype=float32)

Token vectors are also available via token_vectors property:

[13]:
# displaying only a single document's token matrix
preproc_vec.token_vectors['NewsArticles-1880']
[13]:
array([[-0.39347 , -0.061407,  0.015231, ...,  0.046462,  0.058398,
         0.46169 ],
       [ 0.19847 ,  0.18087 , -0.089119, ..., -0.24263 , -0.035183,
        -0.29661 ],
       [ 0.28059 , -0.45684 ,  0.414   , ..., -0.31501 , -0.31649 ,
        -0.026392],
       ...,
       [-0.08267 ,  0.092944,  0.028411, ...,  0.49965 , -0.17115 ,
         0.27578 ],
       [ 0.01327 ,  0.51269 , -0.35735 , ...,  0.19492 ,  0.058496,
         0.26636 ],
       [ 0.012001,  0.20751 , -0.12578 , ...,  0.13871 , -0.36049 ,
        -0.035   ]], dtype=float32)
[14]:
del preproc_vec

The following gives you the number of documents and number of unique tokens respectively:

[15]:
preproc.n_docs
[15]:
3
[16]:
preproc.n_tokens
[16]:
1947

We can also access the number of tokens in each document via doc_lengths property:

[17]:
# displaying only a single document's length here
preproc.doc_lengths['NewsArticles-1880']
[17]:
230

The vocabulary is the set of unique tokens in the corpus, i.e. all tokens that occur at least once in at least one of the documents. You can use the property vocabulary for that and the property vocabulary_counts to additionally get the number of times each token appears in the corpus.

[18]:
preproc.vocabulary[:10]  # displaying only the first 10 here
[18]:
['\n\n', ' ', '"', '%', "'", "'s", '(', ')', ',', '-']
[19]:
# number of unique tokens in all documents
preproc.vocabulary_size
[19]:
683
[20]:
# how often the word "the" occurs in the whole corpus
preproc.vocabulary_counts['the']
[20]:
82

The latter returns a Python Counter object so we can apply its useful functions, e.g. in order to get the most often used tokens:

[21]:
preproc.vocabulary_counts.most_common()[:10]
[21]:
[('the', 82),
 (',', 70),
 ('.', 60),
 ('to', 53),
 ('"', 50),
 ('and', 46),
 ('in', 39),
 ('a', 31),
 ('of', 25),
 ('that', 22)]

The document frequency of a token is the number of documents in which this token occurs at least once. The properties vocabulary_abs_doc_frequency and vocabulary_rel_doc_frequency return this measure as absolute frequency or proportion respectively:

[22]:
(preproc.vocabulary_abs_doc_frequency['Trump'],
 preproc.vocabulary_rel_doc_frequency['Trump'])
[22]:
(2, 0.6666666666666666)
[23]:
(preproc.vocabulary_abs_doc_frequency['Russia'],
 preproc.vocabulary_rel_doc_frequency['Russia'])
[23]:
(1, 0.3333333333333333)

Part-of-Speech (POS) tagging

Part-of-speech (POS) tagging finds the grammatical word-category for each token in a document. The method pos_tag() employs this for the whole corpus. The found POS tags are added as metadata to each token. These tags conform to a specific tagset which is explained in the spaCy documentation. The POS tags can be used to annotate and filter the documents. Let’s apply POS tagging:

[24]:
preproc.pos_tag()
[24]:
<TMPreproc [3 documents / en]>

We can now see a new column pos with the found POS tag for each token:

[25]:
preproc.tokens_datatable
[25]:
docpositiontokenlemmaposwhitespace
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0NewsArticles-18800WhiteWhitePROPN1
1NewsArticles-18801HouseHousePROPN1
2NewsArticles-18802aidesaideNOUN1
3NewsArticles-18803toldtellVERB1
4NewsArticles-18804totoPART1
5NewsArticles-18805keepkeepVERB1
6NewsArticles-18806RussiaRussiaPROPN0
7NewsArticles-18807--PUNCT0
8NewsArticles-18808relatedrelateVERB1
9NewsArticles-18809materialsmaterialNOUN0
10NewsArticles-188010 SPACE0
11NewsArticles-188011LawyerslawyerNOUN1
12NewsArticles-188012forforADP1
13NewsArticles-188013thetheDET1
14NewsArticles-188014TrumptrumpADJ1
1942NewsArticles-991055nonnonADJ0
1943NewsArticles-991056--ADJ0
1944NewsArticles-991057recyclablerecyclableADJ1
1945NewsArticles-991058itemsitemNOUN0
1946NewsArticles-991059..PUNCT0

Aside: TMPreproc as “state machine”

Before continuing, we should clarify that a TMPreproc instance is a “state machine”, i.e. its contents (the documents) and behavior can change when you call a method. An example:

corpus = {
    "doc1": "Hello world!",
    "doc2": "Another example"
}

preproc = TMPreproc(corpus)     # documents are directly tokenized
preproc.tokens

# Out:
# {
#   'doc1': ['Hello', 'world', '!'],
#   'doc2': ['Another', 'example']
# }

preproc.tokens_to_lowercase()   # this changes the documents
preproc.tokens

# Out:
# {
#   'doc1': ['hello', 'world', '!'],
#   'doc2': ['another', 'example']
# }

As you can see, the tokens “inside” preproc are changed in place. It’s important to see that after calling the method tokens_to_lowercase(), the tokens in preproc were transformed and the original tokens from before calling this method are not available anymore. In Python, assigning a mutable object to a variable binds the same object only to a different name, it doesn’t copy it. Since a TMPreproc object is a mutable object (you can change its state by calling its methods), when we simply assign such an object to a different variable (say preproc_upper) we essentially only have two names for the same object and by calling a method on one of these variable names, the values will be changed for both names.

Copying TMPreproc objects

What can we do about that? We need to copy the object which can be done with the TMPreproc.copy() method. By this, we create another variable preproc_upper that points to a separate TMPreproc object.

[26]:
preproc_upper = preproc.copy()
[27]:
# the IDs confirm that we have two different objects
id(preproc_upper), id(preproc)
[27]:
(140426331677504, 140426727032000)
[28]:
preproc_upper.transform_tokens(str.upper)

# the transformation now only applied to "preproc_upper"
preproc.vocabulary == preproc_upper.vocabulary
[28]:
False
[29]:
# show a sample
preproc_upper.tokens['NewsArticles-1880'][:10]
[29]:
['WHITE',
 'HOUSE',
 'AIDES',
 'TOLD',
 'TO',
 'KEEP',
 'RUSSIA',
 '-',
 'RELATED',
 'MATERIALS']
[30]:
# the original "preproc" still holds the same data
preproc.tokens['NewsArticles-1880'][:10]
[30]:
['White',
 'House',
 'aides',
 'told',
 'to',
 'keep',
 'Russia',
 '-',
 'related',
 'materials']

Note that this also uses up twice as much computer memory now. So you shouldn’t create copies that often and also release unused memory by using del:

[31]:
# removing the objects again
del preproc_upper

Lemmatization and term normalization

Before we start with token normalization, we will create a copy of the original TMPreproc object and its data, so that we can later use it for comparison:

[32]:
preproc_orig = preproc.copy()

Lemmatization brings a token, if it is a word, to its base form. The lemma is already found out during the tokenization process and is available in the lemma metadata column. However, when you want to further process the tokens on the base of the lemmata, you should use the lemmatize() method. This method sets the lemmata as tokens and all further processing will happen using the lemmatized tokens:

[33]:
preproc.lemmatize()
preproc.tokens_datatable
[33]:
docpositiontokenlemmaposwhitespace
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0NewsArticles-18800WhiteWhitePROPN1
1NewsArticles-18801HouseHousePROPN1
2NewsArticles-18802aideaideNOUN1
3NewsArticles-18803telltellVERB1
4NewsArticles-18804totoPART1
5NewsArticles-18805keepkeepVERB1
6NewsArticles-18806RussiaRussiaPROPN0
7NewsArticles-18807--PUNCT0
8NewsArticles-18808relaterelateVERB1
9NewsArticles-18809materialmaterialNOUN0
10NewsArticles-188010 SPACE0
11NewsArticles-188011lawyerlawyerNOUN1
12NewsArticles-188012forforADP1
13NewsArticles-188013thetheDET1
14NewsArticles-188014trumptrumpADJ1
1942NewsArticles-991055nonnonADJ0
1943NewsArticles-991056--ADJ0
1944NewsArticles-991057recyclablerecyclableADJ1
1945NewsArticles-991058itemitemNOUN0
1946NewsArticles-991059..PUNCT0

As we can see, the lemma column was copied over to the token column.

Stemming

tmtoolkit doesn’t support stemming directly, since lemmatization is generally accepted as a better approach to bring different word forms of one word to a common base form. However, you may install NLTK and apply stemming by using the transform_tokens() method together with the stem() function.

Depending on how you further want to analyze the data, it may be necessary to “clean” or “normalize” your tokens in different ways in order to remove noise from the corpus, such as punctuation tokens or numbers, upper/lowercase forms of the same word, etc. Note that this is usually not necessary when you work with more modern approaches such as word embeddings (word vectors).

If you want to remove certain characters in all tokens in your documents, you can use remove_chars_in_tokens() and pass it a sequence of characters to remove. There is also a shortcut remove_special_chars_in_tokens() which will remove all “special characters” (all characters in string.punction by default).

[34]:
preproc.remove_chars_in_tokens(['-'])  # remove only "-"
preproc.print_summary()
3 documents in language English:
> NewsArticles-1880 (N=230): White House aide tell to keep Russia relate materi...
> NewsArticles-3350 (N=657): frustration as cabin electronic ban come into forc...
> NewsArticles-99 (N=1060): Should you have two bin in your bathroom ? Our bat...
total number of tokens: 1947 / vocabulary size: 596
[34]:
<TMPreproc [3 documents / en]>
[35]:
# remove all punctuation
preproc.remove_special_chars_in_tokens()
preproc.print_summary()   # the "?" also vanishes
3 documents in language English:
> NewsArticles-1880 (N=230): White House aide tell to keep Russia relate materi...
> NewsArticles-3350 (N=657): frustration as cabin electronic ban come into forc...
> NewsArticles-99 (N=1060): Should you have two bin in your bathroom Our bathr...
total number of tokens: 1947 / vocabulary size: 580
[35]:
<TMPreproc [3 documents / en]>

A common (but harsh) practice is to transform all tokens to lowercase forms, which can be done with tokens_to_lowercase():

[36]:
preproc.tokens_to_lowercase()
preproc.print_summary()
3 documents in language English:
> NewsArticles-1880 (N=230): white house aide tell to keep russia relate materi...
> NewsArticles-3350 (N=657): frustration as cabin electronic ban come into forc...
> NewsArticles-99 (N=1060): should you have two bin in your bathroom our bathr...
total number of tokens: 1947 / vocabulary size: 562
[36]:
<TMPreproc [3 documents / en]>

The method clean_tokens() finally applies several steps that remove tokens that meet certain criteria. This includes removing:

  • punctuation tokens

  • stopwords (very common words for the given language)

  • empty tokens (i.e. '')

  • tokens that are longer or shorter than a certain number of characters

  • numbers

Note that this is a language-dependent method, because the default stopword list is determined per language. This method has many parameters to tweak, so it’s recommended to check out the documentation.

[37]:
# remove punct., stopwords, empty tokens (this is the default)
# plus tokens shorter than 2 characters and numeric tokens like "2019"
preproc.clean_tokens(remove_numbers=True, remove_shorter_than=2)
preproc.print_summary()
3 documents in language English:
> NewsArticles-1880 (N=130): white house aide tell keep russia relate material ...
> NewsArticles-3350 (N=309): frustration cabin electronic ban come force passen...
> NewsArticles-99 (N=486): bin bathroom bathroom fill shampoo bottle toilet r...
total number of tokens: 925 / vocabulary size: 469
[37]:
<TMPreproc [3 documents / en]>

Due to the removal of several tokens in the previous step, the document lengths for the processed corpus are much smaller than for the original corpus:

[38]:
preproc.doc_lengths, preproc_orig.doc_lengths
[38]:
({'NewsArticles-1880': 130, 'NewsArticles-3350': 309, 'NewsArticles-99': 486},
 {'NewsArticles-1880': 230, 'NewsArticles-3350': 657, 'NewsArticles-99': 1060})

We can also observe that the vocabulary got smaller after the processing steps, which, for large corpora, is also important in terms of computation time and memory consumption for later analyses:

[39]:
len(preproc.vocabulary), len(preproc_orig.vocabulary)
[39]:
(469, 683)

You can also apply custom token transform functions by using transform_tokens() and passing it a function that should be applied to each token in each document (hence it must accept one string argument).

First let’s define such a function. Here we create a simple function that should return a token’s “shape” in terms of the case of its characters:

[40]:
def token_shape(t):
    return ''.join(['X' if str.isupper(c) else 'x' for c in t])

token_shape('EU'), token_shape('CamelCase'), token_shape('lower')
[40]:
('XX', 'XxxxxXxxx', 'xxxxx')

We can now apply this function to our documents (we will use the original documents here, because they were not transformed to lower case):

[41]:
preproc = preproc_orig.copy() # swap instances for later

preproc_orig.transform_tokens(token_shape)   # apply function
preproc_orig.print_summary()

# remove instance
del preproc_orig
3 documents in language English:
> NewsArticles-1880 (N=230): Xxxxx Xxxxx xxxxx xxxx xx xxxx Xxxxxx x xxxxxxx xx...
> NewsArticles-3350 (N=657): Xxxxxxxxxxx xx xxxxx xxxxxxxxxxx xxx xxxxx xxxx xx...
> NewsArticles-99 (N=1060): Xxxxxx xxx xxxx xxx xxxx xx xxxx xxxxxxxx x xx Xxx...
total number of tokens: 1947 / vocabulary size: 32

Expanding compound words and joining tokens

Compound words like “US-Student” or “non-recyclable” can be expanded to separate tokens like “US”, “Student” and “non”, “recyclable” using expand_compound_tokens(). However, depending on the language model, most of these compounds will already be separated on initial tokenization.

[42]:
orig_vocab = preproc.vocabulary
preproc.expand_compound_tokens()

# create set difference to show vocabulary tokens
# that were expanded
set(orig_vocab) - set(preproc.vocabulary)
[42]:
{'Source:-Al'}

It’s also possible to join together certain subsequent occurrences of tokens or token patterns. This means you can for example transform all of the subsequent tokens “White” and “House” to single tokens “White_House”. In case you don’t use n-grams (described in a separate section), this is very helpful when you want to capture a named entity that is made up by several tokens, such as persons, institutions or concepts like “Climate Change”, as a single token. The method to use for this is glue_tokens(). It accepts the following parameters:

  • a patterns sequence of length N that is used to match the subsequent N tokens;

  • a glue string that is used to join the matched subsequent tokens (by default: "_").

Along with that, you can adjust the token matching with the common token matching parameters described below.

Let’s “glue” all subsequent occurrences of “White” and “House”. The glue_tokens() method will return a set of glued tokens that matched the provided pattern:

[43]:
preproc_orig = preproc.copy()  # make a copy of full orig. data for later use
preproc.glue_tokens(['White', 'House'])
[43]:
{'White_House'}
[44]:
preproc.tokens['NewsArticles-1880'][:20]
[44]:
['White_House',
 'aides',
 'told',
 'to',
 'keep',
 'Russia',
 '-',
 'related',
 'materials',
 '\n\n',
 'Lawyers',
 'for',
 'the',
 'Trump',
 'administration',
 'have',
 'instructed',
 'White_House',
 'aides',
 'to']
[45]:
del preproc

Keywords-in-context (KWIC) and general filtering methods

Keywords-in-context (KWIC) allow you to quickly investigate certain keywords and their neighborhood of tokens, i.e. the tokens that appear right before and after this keyword.

TMPreproc provides three methods for this purpose:

  • get_kwic() is the base method accepting a search pattern and several options that control how the search pattern is matched (more on that below); use this function when you want to further process the output of a KWIC search;

  • get_kwic_table() is the more “user friendly” version of the above method as it produces a datatable with the highlighted keyword by default

  • filter_tokens_with_kwic() works similar to the above functions but applies the result by filtering the documents again; it is explained in the section on filtering

Let’s see the KWIC methods in action:

[46]:
preproc = preproc_orig.copy()  # use orig. full data
preproc.get_kwic('house', ignore_case=True)
[46]:
{'NewsArticles-1880': [['White', 'House', 'aides', 'told'],
  ['instructed', 'White', 'House', 'aides', 'to'],
  ['The', 'White', 'House', 'is', 'simply'],
  ['the', 'White', 'House', 'and', 'law']],
 'NewsArticles-3350': [],
 'NewsArticles-99': [['of', 'the', 'house', ',', '"']]}

The method returns a dictionary that maps document labels to the KWIC results. Each document contains a list of “contexts”, i.e. a list of tokens that surround a keyword, here "house". This keyword stands in the middle and is surrounded by its “context tokens”, which by default means two tokens to the left and two tokens to the right (which may be less when the keyword is near the start or the end of a document).

We can see that NewsArticles-1880 contains four contexts, NewsArticles-99 one context and NewsArticles-3350 none.

With get_kwic_table(), we get back a datatable which provides a better formatting for quick investigation. See how the matched tokens are highlighted as *house* and empty results are removed:

[47]:
preproc.get_kwic_table('house', ignore_case=True)
[47]:
doccontextkwic
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0NewsArticles-18800White *House* aides told
1NewsArticles-18801instructed White *House* aides to
2NewsArticles-18802The White *House* is simply
3NewsArticles-18803the White *House* and law
4NewsArticles-990of the *house* , "

An important parameter is context_size. It determines the number of tokens to display left and right to the found keyword. You can either pass a single integer for a symmetric context or a tuple with integers (<left>, <right>):

[48]:
preproc.get_kwic_table('house', ignore_case=True, context_size=4)
[48]:
doccontextkwic
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0NewsArticles-18800White *House* aides told to keep
1NewsArticles-18801administration have instructed White *House* aides…
2NewsArticles-18802. " The White *House* is simply taking proactive
3NewsArticles-18803Democrats to the White *House* and law enforcement…
4NewsArticles-990other rooms of the *house* , " says Jonny
[49]:
preproc.get_kwic_table('house', ignore_case=True, context_size=(1, 4))
[49]:
doccontextkwic
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0NewsArticles-18800White *House* aides told to keep
1NewsArticles-18801White *House* aides to preserve any
2NewsArticles-18802White *House* is simply taking proactive
3NewsArticles-18803White *House* and law enforcement agencies
4NewsArticles-990the *house* , " says Jonny

The KWIC functions become really powerful when using the pattern matching options. So far, we were looking for exact (but case insensitive) matches between the corpus tokens and our keyword "house". However, it is also possible to match patterns like "new*" (matches any word starting with “new”) or "agenc(y|ies)" (a regular expression matching “agency” and “agencies”). The next section gives an introduction on the different options for pattern matching.

Common parameters for pattern matching functions

Several functions and methods in tmtoolkit support pattern matching, including the already mentioned KWIC functions but also functions for filtering tokens or documents as you will see later. They all share similar function signatures, i.e. similar parameters:

  • search_token or search_tokens: allows to specify one or more patterns as strings

  • match_type: sets the matching type and can be one of the following options:

  • 'exact' (default): exact string matching (optionally ignoring character case), i.e. no pattern matching

  • 'regex' uses regular expression matching

  • 'glob' uses “glob patterns” like "politic*" which matches for example “politic”, “politics” or “politician” (see globre package)

  • ignore_case: ignore character case (applies to all three match types)

  • glob_method: if match_type is ‘glob’, use this glob method. Must be 'match' or 'search' (similar behavior as Python’s re.match or re.search)

  • inverse: inverse the match results, i.e. if matching for “hello”, return all results that do not match “hello”

Let’s try out some of these options with get_kwic_table():

[50]:
# using a regular expression, ignoring case
preproc.get_kwic_table(r'agenc(y|ies)', match_type='regex', ignore_case=True)
[50]:
doccontextkwic
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0NewsArticles-18800law enforcement *agencies* to keep
1NewsArticles-18801organizations , *agencies* and individuals
2NewsArticles-33500Reuters news *agency* . Al
3NewsArticles-33501and news *agencies*
[51]:
# using a glob, ignoring case
preproc.get_kwic_table('pol*', match_type='glob', ignore_case=True)
[51]:
doccontextkwic
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0NewsArticles-18800false and *politically* motivated attacks
1NewsArticles-990, senior *policy* adviser for
[52]:
# using a glob, ignoring case
preproc.get_kwic_table('*sol*', match_type='glob', ignore_case=True)
[52]:
doccontextkwic
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0NewsArticles-990potential simple *solution* that could
1NewsArticles-991confused by *aerosols* . "
2NewsArticles-992bottles , *aerosols* for deodorant
[53]:
# using a regex that matches all tokens with at least one vowel and
# inverting these matches, i.e. all tokens *without* any vowels
preproc.get_kwic_table(r'[AEIOUaeiou]', match_type='regex', inverse=True)
[53]:
doccontextkwic
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0NewsArticles-18800keep Russia *-* related materials
1NewsArticles-18801related materials * * Lawyers for
2NewsArticles-18802in the *2016* presidential election
3NewsArticles-18803related investigations *,* ABC News
4NewsArticles-18804has confirmed *.* " The
5NewsArticles-18805confirmed . *"* The White
6NewsArticles-18806motivated attacks *,* " an
7NewsArticles-18807attacks , *"* an administration
8NewsArticles-18808News Wednesday *.* The directive
9NewsArticles-18809last week *by* Senate Democrats
10NewsArticles-188010between Trump *'s* administration ,
11NewsArticles-188011's administration *,* campaign and
12NewsArticles-188012transition teams *"* ? or
13NewsArticles-188013teams " *?* or anyone
14NewsArticles-188014their behalf *"* ? and
265NewsArticles-99147two bins *?* There are
266NewsArticles-99148other options *.* Hang a
267NewsArticles-99149recycling bin *.* Or opt
268NewsArticles-99150and non *-* recyclable items
269NewsArticles-99151recyclable items *.*

Filtering tokens and documents

We can use the pattern matching parameters in numerous filtering methods. The heart of many of these methods is token_match(). Given a search pattern, a list of tokens and optionally some pattern matching parameters, it returns a binary NumPy array of the same length as the input tokens. Each occurrence of True in this binary array signals a match.

[54]:
from tmtoolkit.preprocess import token_match

# first 10 tokens of document "NewsArticles-1880"
doc_snippet = preproc.tokens['NewsArticles-1880'][:10]
# get all tokens that match "to*"
matches = token_match('to*', doc_snippet, match_type='glob')

# iterate through tokens and matches, show pair-wise results
for tok, match in zip(doc_snippet, matches):
    print(tok, ':', match)
White : False
House : False
aides : False
told : True
to : True
keep : False
Russia : False
- : False
related : False
materials : False

The token_match() function is a rather low-level function that you may use for pattern matching against any list/array of strings, e.g. a list of tokens, file names, etc.

The following methods cover common use-cases for filtering during text preprocessing. Many of these methods start either with filter_...() or remove_...() and these pairs of filter and remove functions are complements. A filter method will always retain the matched elements whereas a remove method will always drop the matched elements. We can observe that with the first pair of method, filter_tokens() and remove_tokens():

So much .copy()

Note that the following code snippets make lot of use of the copy() methods. This is because we want to show how the different methods work with the same original data (remember that a TMPreproc instance behaves like a state machine) and also want to “clean up” the temporary instances. Under normal circumstances, you wouldn’t use copy() so excessively.

[55]:
# retain only the tokens that match the pattern in each document
preproc.filter_tokens('*house*', match_type='glob', ignore_case=True)
preproc.print_summary()

del preproc
3 documents in language English:
> NewsArticles-1880 (N=4): House House House House
> NewsArticles-3350 (N=0):
> NewsArticles-99 (N=3): house greenhouse household
total number of tokens: 7 / vocabulary size: 4
[56]:
preproc = preproc_orig.copy()  # make a copy from full data

preproc.remove_tokens('*house*', match_type='glob', ignore_case=True)
preproc.print_summary()

del preproc
3 documents in language English:
> NewsArticles-1880 (N=226): White aides told to keep Russia - related material...
> NewsArticles-3350 (N=658): Frustration as cabin electronics ban comes into fo...
> NewsArticles-99 (N=1057): Should you have two bins in your bathroom ? Our ba...
total number of tokens: 1941 / vocabulary size: 679

The pair filter_documents() and remove_documents() works similarily, but filters or drops whole documents regarding the supplied match criteria. Both accept the standard pattern matching parameters but also a parameter matches_threshold with default value 1. When this number of matching tokens is hit, the document will be part of the result set (filter_documents()) or removed from the result set (remove_documents()). By this, we can for example retain only those documents that contain certain token patterns.

Let’s try these methods out in practice:

[57]:
preproc = preproc_orig.copy()  # make a copy from full data

preproc.filter_documents('*house*', match_type='glob', ignore_case=True)
preproc.print_summary()

del preproc
2 documents in language English:
> NewsArticles-1880 (N=230): White House aides told to keep Russia - related ma...
> NewsArticles-99 (N=1060): Should you have two bins in your bathroom ? Our ba...
total number of tokens: 1290 / vocabulary size: 485

We can see that two out of three documents contained the pattern '*house*' and hence were retained.

We can also adjust matches_threshold to set the minimum number of token matches for filtering:

[58]:
preproc = preproc_orig.copy()  # make a copy from full data

preproc.filter_documents('*house*', match_type='glob', ignore_case=True,
                         matches_threshold=4)
preproc.print_summary()

del preproc
1 documents in language English:
> NewsArticles-1880 (N=230): White House aides told to keep Russia - related ma...
total number of tokens: 230 / vocabulary size: 140
[59]:
preproc = preproc_orig.copy()  # make a copy from full data

preproc.remove_documents('*house*', match_type='glob', ignore_case=True)
preproc.print_summary()

del preproc
1 documents in language English:
> NewsArticles-3350 (N=658): Frustration as cabin electronics ban comes into fo...
total number of tokens: 658 / vocabulary size: 288

When we use remove_documents() we get only the documents that did not contain the specified pattern.

Another useful pair of methods is filter_documents_by_name() and remove_documents_by_name(). Both methods again accept the same pattern matching parameters but they only apply them to the document names, i.e. document labels:

[60]:
preproc = preproc_orig.copy()  # make a copy from full data

preproc.filter_documents_by_name(r'-\d{4}$', match_type='regex')
preproc.print_summary()

del preproc
2 documents in language English:
> NewsArticles-1880 (N=230): White House aides told to keep Russia - related ma...
> NewsArticles-3350 (N=658): Frustration as cabin electronics ban comes into fo...
total number of tokens: 888 / vocabulary size: 385

In the above example we wanted to retain only the documents whose document labels ended with exactly 4 digits, like “…-1234”. Hence, we only get “NewsArticles-1880” and “NewsArticles-3350” but not “NewsArticles-99”. Again, remove_documents_by_name() will do the exact opposite.

You may also use Keywords-in-context (KWIC) to filter your tokens in the neighborhood around certain keyword pattern(s). The method for that is called filter_tokens_with_kwic() and works very similar to get_kwic() but filters the documents in the TMPreproc instance with which you can continue working as usual. Here, we filter the tokens in each document to get the tokens directly in front and after the glob pattern '*house*' (context_size=1):

[61]:
preproc = preproc_orig.copy()  # make a copy from full data

preproc.filter_tokens_with_kwic('*house*', context_size=1,
                                match_type='glob', ignore_case=True)
preproc.tokens_datatable
[61]:
docpositiontokenlemmawhitespace
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0NewsArticles-18800WhiteWhite1
1NewsArticles-18801HouseHouse1
2NewsArticles-18802aidesaide1
3NewsArticles-18803WhiteWhite1
4NewsArticles-18804HouseHouse1
5NewsArticles-18805aidesaide1
6NewsArticles-18806WhiteWhite1
7NewsArticles-18807HouseHouse1
8NewsArticles-18808isbe1
9NewsArticles-18809WhiteWhite1
10NewsArticles-188010HouseHouse1
11NewsArticles-188011andand1
12NewsArticles-990thethe1
13NewsArticles-991househouse0
14NewsArticles-992,,0
15NewsArticles-993ofof1
16NewsArticles-994greenhousegreenhouse1
17NewsArticles-995gasesgas1
18NewsArticles-996UKUK1
19NewsArticles-997householdhousehold1
20NewsArticles-998threwthrow1

When you annotated your documents’ tokens with Part-of-Speech (POS) tags, you can also filter them using filter_for_pos():

[62]:
del preproc

preproc = preproc_orig.copy()  # make a copy from full data

# apply POS tagging and retain only nouns
preproc.pos_tag().filter_for_pos('N').tokens_datatable
[62]:
docpositiontoken
[63]:
del preproc

In this example we filtered for tokens that were identified as nouns by passing the simplified POS tag 'N' (for more on simplified tags, see the method documentation). We can also filter for more than one tag, e.g. nouns or verbs by passing a list of required POS tags.

filter_for_pos() has no remove_...() counterpart, but you can set the inverse parameter to True to achieve the same effect.

Finally there are two methods for removing tokens based on their document frequency: remove_common_tokens() and remove_uncommon_tokens(). The former removes all tokens that have a document frequency greater or equal a certain threshold defined by parameter df_threshold. The latter does the same for all tokens that have a document frequency lower or equal df_threshold. This parameter can either be a relative frequency (default) or absolute count (by setting parameter absolute=True).

Before applying the method, let’s have a look at the number of tokens per document again, to later see how many we will remove. We will also store the vocabulary in orig_vocab for later comparison:

[64]:
preproc = preproc_orig.copy()  # make a copy from full data
orig_vocab = preproc.vocabulary
preproc.doc_lengths
[64]:
{'NewsArticles-1880': 230, 'NewsArticles-3350': 658, 'NewsArticles-99': 1060}
[65]:
preproc.remove_common_tokens(df_threshold=0.9).doc_lengths
[65]:
{'NewsArticles-1880': 144, 'NewsArticles-3350': 413, 'NewsArticles-99': 700}

By removing all tokens with a document frequency threshold of 0.9, we removed quite a number of tokens in each document. Let’s investigate the vocabulary in order to see which tokens were removed:

[66]:
# set difference gives removed vocabulary tokens
set(orig_vocab) - set(preproc.vocabulary)
[66]:
{'\n\n',
 '"',
 "'s",
 ',',
 '-',
 '.',
 '?',
 'The',
 'a',
 'all',
 'also',
 'an',
 'and',
 'be',
 'for',
 'has',
 'have',
 'in',
 'into',
 'is',
 'more',
 'of',
 'on',
 'or',
 'other',
 'such',
 'than',
 'that',
 'the',
 'to',
 'which',
 'with'}
[67]:
del preproc

remove_uncommon_tokens() works similarily. This time, let’s use an absolute number as threshold:

[68]:
preproc = preproc_orig.copy()  # make a copy from full data

preproc.remove_uncommon_tokens(df_threshold=1, absolute=True)

# set difference gives removed vocabulary tokens
# this time, show only the first 10 tokens that were removed
sorted(set(orig_vocab) - set(preproc.vocabulary))[:10]
[68]:
[' ', '%', '(', ')', '10', '12', '135,000', '2016', '38', '45']

The above means that we remove all tokens that appear only in exactly one document.

[69]:
del preproc

Working with token metadata

TMPreproc allows to attach arbitrary metadata to each token in each document. This kind of “annotations” for tokens is very useful. For example, you may add metadata that records a token’s length or whether it is all uppercase letters and later use that for filtering or in further analyses. One method to add such metadata is add_metadata_per_doc(). This method requires to pass a dict that maps document labels to the respective token metadata list. The list’s length must match the number of tokens in the respective document. At first we need to create such a metadata dict. Let’s do that for the tokens’ length first:

[70]:
preproc = preproc_orig.copy()  # make a copy from full data

meta_tok_lengths = {doc_label: list(map(len, doc_tokens))
                    for doc_label, doc_tokens in preproc.tokens.items()}

# show first 5 tokens and their string length for a sample document
list(zip(preproc.tokens['NewsArticles-1880'][:10],
         meta_tok_lengths['NewsArticles-1880'][:10]))
[70]:
[('White', 5),
 ('House', 5),
 ('aides', 5),
 ('told', 4),
 ('to', 2),
 ('keep', 4),
 ('Russia', 6),
 ('-', 1),
 ('related', 7),
 ('materials', 9)]

We can now add these metadata via add_metadata_per_doc(). We pass a label, the metadata key, and the previously generated metadata:

[71]:
preproc.add_metadata_per_doc('length', meta_tok_lengths)
del meta_tok_lengths  # we don't need that object anymore

The property .tokens_datatable now shows an additional column with meta_length (the metadata key in always prefixed with meta_):

[72]:
preproc.tokens_datatable
[72]:
docpositiontokenlemmawhitespacemeta_length
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0NewsArticles-18800WhiteWhite15
1NewsArticles-18801HouseHouse15
2NewsArticles-18802aidesaide15
3NewsArticles-18803toldtell14
4NewsArticles-18804toto12
5NewsArticles-18805keepkeep14
6NewsArticles-18806RussiaRussia06
7NewsArticles-18807--01
8NewsArticles-18808relatedrelate17
9NewsArticles-18809materialsmaterial09
10NewsArticles-188010 02
11NewsArticles-188011Lawyerslawyer17
12NewsArticles-188012forfor13
13NewsArticles-188013thethe13
14NewsArticles-188014Trumptrump15
1943NewsArticles-991055nonnon03
1944NewsArticles-991056--01
1945NewsArticles-991057recyclablerecyclable110
1946NewsArticles-991058itemsitem05
1947NewsArticles-991059..01

Let’s add a boolean indicator for whether the given token is all uppercase:

[73]:
meta_tok_upper = {doc_label: list(map(str.isupper, doc_tokens))
                  for doc_label, doc_tokens in preproc.tokens.items()}

preproc.add_metadata_per_doc('upper', meta_tok_upper)
del meta_tok_upper

preproc.tokens_datatable
[73]:
docpositiontokenlemmawhitespacemeta_lengthmeta_upper
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0NewsArticles-18800WhiteWhite150
1NewsArticles-18801HouseHouse150
2NewsArticles-18802aidesaide150
3NewsArticles-18803toldtell140
4NewsArticles-18804toto120
5NewsArticles-18805keepkeep140
6NewsArticles-18806RussiaRussia060
7NewsArticles-18807--010
8NewsArticles-18808relatedrelate170
9NewsArticles-18809materialsmaterial090
10NewsArticles-188010 020
11NewsArticles-188011Lawyerslawyer170
12NewsArticles-188012forfor130
13NewsArticles-188013thethe130
14NewsArticles-188014Trumptrump150
1943NewsArticles-991055nonnon030
1944NewsArticles-991056--010
1945NewsArticles-991057recyclablerecyclable1100
1946NewsArticles-991058itemsitem050
1947NewsArticles-991059..010

You may use these newly added columns now for example for filtering the datatable:

[74]:
import datatable as dt

preproc.tokens_datatable[dt.f.meta_upper == 1,:]
[74]:
docpositiontokenlemmawhitespacemeta_lengthmeta_upper
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0NewsArticles-188043ABCABC131
1NewsArticles-188073ABCABC131
2NewsArticles-1880213U.S.U.S.141
3NewsArticles-335011USUS021
4NewsArticles-335013UKUK121
5NewsArticles-335034USUS121
6NewsArticles-335098USUS121
7NewsArticles-3350106USUS121
8NewsArticles-3350134UAEUAE131
9NewsArticles-3350153READREAD141
10NewsArticles-3350154MOREMORE041
11NewsArticles-3350273USUS121
12NewsArticles-3350346READREAD141
13NewsArticles-3350347MOREMORE041
14NewsArticles-3350349USUS121
15NewsArticles-3350358USUS121
16NewsArticles-3350454I-PRON-111
17NewsArticles-3350480UKUK121
18NewsArticles-3350502UKUK121
19NewsArticles-3350506UAEUAE131
20NewsArticles-3350529UAEUAE131
21NewsArticles-3350570USUS121
22NewsArticles-3350637USUS121
23NewsArticles-99376UKUK121
24NewsArticles-99711Aa111
25NewsArticles-99955UKUK121
26NewsArticles-99995M25M25131

To see which metadata keys were already created, you can use get_available_metadata_keys():

[75]:
preproc.get_available_metadata_keys()
[75]:
{'lemma', 'length', 'upper', 'whitespace'}

Token metadata can be removed with remove_metadata():

[76]:
preproc.remove_metadata('upper')
preproc.get_available_metadata_keys()
[76]:
{'lemma', 'length', 'whitespace'}
[77]:
preproc.tokens_datatable
[77]:
docpositiontokenlemmawhitespacemeta_length
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0NewsArticles-18800WhiteWhite15
1NewsArticles-18801HouseHouse15
2NewsArticles-18802aidesaide15
3NewsArticles-18803toldtell14
4NewsArticles-18804toto12
5NewsArticles-18805keepkeep14
6NewsArticles-18806RussiaRussia06
7NewsArticles-18807--01
8NewsArticles-18808relatedrelate17
9NewsArticles-18809materialsmaterial09
10NewsArticles-188010 02
11NewsArticles-188011Lawyerslawyer17
12NewsArticles-188012forfor13
13NewsArticles-188013thethe13
14NewsArticles-188014Trumptrump15
1943NewsArticles-991055nonnon03
1944NewsArticles-991056--01
1945NewsArticles-991057recyclablerecyclable110
1946NewsArticles-991058itemsitem05
1947NewsArticles-991059..01

We can tell filter_tokens() and similar methods to use metadata instead of the tokens for matching. For example, we can use the metadata meta_length, which we created before, to filter for tokens of a certain length:

[78]:
preproc_meta_example = preproc.copy()
preproc_meta_example.filter_tokens(3, by_meta='length')
preproc_meta_example.tokens_datatable
[78]:
docpositiontokenlemmawhitespacemeta_length
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0NewsArticles-18800forfor13
1NewsArticles-18801thethe13
2NewsArticles-18802anyany13
3NewsArticles-18803thethe13
4NewsArticles-18804andand13
5NewsArticles-18805ABCABC13
6NewsArticles-18806hashave13
7NewsArticles-18807Thethe13
8NewsArticles-18808andand13
9NewsArticles-18809ABCABC13
10NewsArticles-188010Thethe13
11NewsArticles-188011thethe13
12NewsArticles-188012andand13
13NewsArticles-188013lawlaw13
14NewsArticles-188014allall13
335NewsArticles-99186forfor13
336NewsArticles-99187binbin13
337NewsArticles-99188cancan13
338NewsArticles-99189andand13
339NewsArticles-99190nonnon03
[79]:
del preproc_meta_example

Note that all matching options then apply to the metadata column, in this case to the meta_length column which contains integers. Since filter_tokens() by default employs exact matching, we get all tokens where meta_length equals the first argument, 3. If we used regular expression or glob matching instead, this method would fail because you can only use that for string data.

If you want to use more complex filter queries, you should create a “filter mask” and pass it to filter_tokens_by_mask(). A filter mask is a dictionary that maps a document label to a sequence of booleans. For all occurrences of True, the respective token in the document will be retained, all others will be removed. Let’s try that out with a small sample:

[80]:
preproc.pos_tag().tokens_datatable
[80]:
docpositiontokenlemmaposwhitespacemeta_length
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0NewsArticles-18800WhiteWhitePUNCT15
1NewsArticles-18801HouseHousePUNCT15
2NewsArticles-18802aidesaidePUNCT15
3NewsArticles-18803toldtellPUNCT14
4NewsArticles-18804totoPUNCT12
5NewsArticles-18805keepkeepPUNCT14
6NewsArticles-18806RussiaRussiaPUNCT06
7NewsArticles-18807--PUNCT01
8NewsArticles-18808relatedrelatePUNCT17
9NewsArticles-18809materialsmaterialPUNCT09
10NewsArticles-188010 PUNCT02
11NewsArticles-188011LawyerslawyerPUNCT17
12NewsArticles-188012forforPUNCT13
13NewsArticles-188013thethePUNCT13
14NewsArticles-188014TrumptrumpPUNCT15
1943NewsArticles-991055nonnonPUNCT03
1944NewsArticles-991056--PUNCT01
1945NewsArticles-991057recyclablerecyclablePUNCT110
1946NewsArticles-991058itemsitemPUNCT05
1947NewsArticles-991059..PUNCT01

We now generate the filter mask, which means for each document we create a boolean list or array that for each token in that document indicates whether that token should be kept or removed.

We will iterate through the tokens_with_metadata property, which is a dict that for each document contains a datatable with its tokens and metadata. Let’s have a look at the first document’s datatable:

[81]:
next(iter(preproc.tokens_with_metadata.values()))
[81]:
tokenlemmaposwhitespacemeta_length
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0WhiteWhitePUNCT15
1HouseHousePUNCT15
2aidesaidePUNCT15
3toldtellPUNCT14
4totoPUNCT12
5keepkeepPUNCT14
6RussiaRussiaPUNCT06
7--PUNCT01
8relatedrelatePUNCT17
9materialsmaterialPUNCT09
10 PUNCT02
11LawyerslawyerPUNCT17
12forforPUNCT13
13thethePUNCT13
14TrumptrumpPUNCT15
225duringduringPUNCT16
226his-PRON-X13
227confirmationconfirmationPUNCT112
228hearinghearingPUNCT07
229..PUNCT01

Now we can create the filter mask:

[82]:
import numpy as np

filter_mask = {}
for doc_label, doc_data in preproc.tokens_with_metadata.items():
    # extract the columns "meta_length" and "pos"
    # and convert them to NumPy arrays
    doc_data_subset = doc_data[:, [dt.f.meta_length, dt.f.pos]]
    tok_lengths, tok_pos = map(np.array, doc_data_subset.to_list())

    # create a boolean array for nouns with token length less or equal 5
    filter_mask[doc_label] = (tok_lengths <= 5) & np.isin(tok_pos, ['NOUN', 'PROPN'])

# it's not necessary to add the filter mask as metadata
# but it's a good way to check the mask
preproc.add_metadata_per_doc('small_nouns', filter_mask)
preproc.tokens_datatable
[82]:
docpositiontokenlemmaposwhitespacemeta_lengthmeta_small_nouns
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0NewsArticles-18800WhiteWhitePUNCT150
1NewsArticles-18801HouseHousePUNCT150
2NewsArticles-18802aidesaidePUNCT150
3NewsArticles-18803toldtellPUNCT140
4NewsArticles-18804totoPUNCT120
5NewsArticles-18805keepkeepPUNCT140
6NewsArticles-18806RussiaRussiaPUNCT060
7NewsArticles-18807--PUNCT010
8NewsArticles-18808relatedrelatePUNCT170
9NewsArticles-18809materialsmaterialPUNCT090
10NewsArticles-188010 PUNCT020
11NewsArticles-188011LawyerslawyerPUNCT170
12NewsArticles-188012forforPUNCT130
13NewsArticles-188013thethePUNCT130
14NewsArticles-188014TrumptrumpPUNCT150
1943NewsArticles-991055nonnonPUNCT030
1944NewsArticles-991056--PUNCT010
1945NewsArticles-991057recyclablerecyclablePUNCT1100
1946NewsArticles-991058itemsitemPUNCT050
1947NewsArticles-991059..PUNCT010

Finally, we can pass the mask dict to filter_tokens_by_mask():

[83]:
preproc.filter_tokens_by_mask(filter_mask)
preproc.tokens_datatable
[83]:
docpositiontoken

Generating n-grams

So far, we worked with unigrams, i.e. each document consisted of a sequence of discrete tokens. We can also generate n-grams from our corpus where each document consists of a sequence of n subsequent tokens. An example would be:

Document: “This is a simple example.”

n=1 (unigrams):

['This', 'is', 'a', 'simple', 'example', '.']

n=2 (bigrams):

['This is', 'is a', 'a simple', 'simple example', 'example .']

n=3 (trigrams):

['This is a', 'is a simple', 'a simple example', 'simple example .']

The method generate_ngrams() allows us to generate n-grams from tokenized documents. We can then get the results with the ngrams property:

[84]:
del preproc

preproc = preproc_orig.copy()  # make a copy from full data

preproc.generate_ngrams(2)  # generate bigrams
preproc.ngrams['NewsArticles-1880'][:10]  # show first 10 bigrams of this document
[84]:
[['White', 'House'],
 ['House', 'aides'],
 ['aides', 'told'],
 ['told', 'to'],
 ['to', 'keep'],
 ['keep', 'Russia'],
 ['Russia', '-'],
 ['-', 'related'],
 ['related', 'materials'],
 ['materials', 'Lawyers']]

You may afterwards use join_ngrams() to merge the generated n-grams to joint tokens and use these as new tokens in this TMPreproc instance:

[85]:
preproc.join_ngrams()
preproc.tokens_datatable
[85]:
docpositiontokenlemmawhitespace
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0NewsArticles-18800White HouseWhite House1
1NewsArticles-18801House aidesHouse aides1
2NewsArticles-18802aides toldaides told1
3NewsArticles-18803told totold to1
4NewsArticles-18804to keepto keep1
5NewsArticles-18805keep Russiakeep Russia1
6NewsArticles-18806Russia -Russia -1
7NewsArticles-18807- related- related1
8NewsArticles-18808related materialsrelated materials1
9NewsArticles-18809materials Lawyersmaterials Lawyers1
10NewsArticles-188010Lawyers forLawyers for1
11NewsArticles-188011for thefor the1
12NewsArticles-188012the Trumpthe Trump1
13NewsArticles-188013Trump administrationTrump administration1
14NewsArticles-188014administration haveadministration have1
1934NewsArticles-991052and nonand non1
1935NewsArticles-991053non -non -1
1936NewsArticles-991054- recyclable- recyclable1
1937NewsArticles-991055recyclable itemsrecyclable items1
1938NewsArticles-991056items .items .1
[86]:
del preproc

Generating a sparse document-term matrix (DTM)

If you’re working with a bag-of-words representation of your data, you usually convert the preprocessed documents to a document-term matrix (DTM), which represents of the number of occurrences of each term (i.e. vocabulary token) in each document. This is a N rows by M columns matrix, where N is the number of documents and M is the vocabulary size (i.e. the number of unique tokens in the corpus).

Not all tokens from the vocabulary occur in all documents. In fact, many tokens will occur only in a small subset of the documents if you’re dealing with a “real world” dataset. This means that most entries in such a DTM will be zero. Almost all functions in tmtoolkit therefore generate and work with sparse matrices, where only non-zero values are stored in computer memory.

For this example, we’ll generate a DTM from the preproc_orig instance. First, let’s check the number of documents and the vocabulary size:

[87]:
preproc_orig.n_docs, preproc_orig.vocabulary_size
[87]:
(3, 683)

We can use the dtm property to generate a sparse DTM from the current instance:

[88]:
preproc_orig.dtm
[88]:
<3x683 sparse matrix of type '<class 'numpy.int32'>'
        with 816 stored elements in Compressed Sparse Row format>

We can see that a sparse matrix with 3 rows (which corresponds with the number of documents) and 683 columns was generated (which corresponds to the vocabulary size). 816 elements in this matrix are non-zero.

We can convert this matrix to a non-sparse, i.e. dense, representation and see parts of its elements:

[89]:
preproc_orig.dtm.todense()
[89]:
matrix([[ 1,  0,  4, ...,  0,  0,  0],
        [ 2,  1, 14, ...,  0,  3,  0],
        [ 2,  0, 32, ...,  2,  5,  5]], dtype=int32)

However, note that you should only convert a sparse matrix to a dense representation when you’re either dealing with a small amount of data (which is what we’re doing in this example), or use only a part of the full matrix. Converting a sparse matrix to a dense representation can otherwise easily exceed the available computer memory.

There exist different “formats” for sparse matrices, which have different advantages and disadvantages (see for example the SciPy “sparse” module documentation). Not all formats support all operations that you can usually apply to an ordinary, dense matrix. By default, the generated DTM is in Compressed Sparse Row (CSR) format. This format allows indexing and is especially optimized for fast row access. You may convert it to any other sparse matrix format; see the mentioned SciPy documentation for this.

The rows of the DTM are aligned to the sequence of the document labels and its columns are aligned to the vocabulary. For example, let’s find the frequency of the term “House” in the document “NewsArticles-1880”. To do this, we find out the indices into the matrix:

[90]:
preproc_orig.doc_labels.index('NewsArticles-1880')
[90]:
0
[91]:
preproc_orig.vocabulary.index('House')
[91]:
67

This means the frequency of the term “House” in the document “NewsArticles-1880” is located in row 0 and column 4 of the DTM:

[92]:
preproc_orig.dtm[0, 67]
[92]:
4

See also the following example of finding out the index for “administration” and then getting an array that represents the number of occurrences of this token across all three documents:

[93]:
vocab_admin_ix = preproc_orig.vocabulary.index('administration')
preproc_orig.dtm[:, vocab_admin_ix].todense()
[93]:
matrix([[4],
        [1],
        [0]], dtype=int32)

Apart from the dtm property, there’s also the get_dtm() method which allows to also return the result as datatable or pandas DataFrame. Note that these representations are not sparse and hence can consume much memory.

[94]:
preproc_orig.get_dtm(as_datatable=True)
DatatableWarning: Duplicate column name found, and was assigned a unique name: '.' -> '.0'
[94]:
_doc. "%''s(),workworldwouldyouyour
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0NewsArticles-188010401300900000
1NewsArticles-33502114016002801030
2NewsArticles-992032503223310255

Serialization: Saving and loading TMPreproc objects

The current state of a TMPreproc object can also be stored to a file on disk so that you (or someone else who has tmtoolkit installed) can later restore it using that file. The methods for that are save_state() and load_state() / from_state().

Let’s store the current state of the preproc_orig instance:

[95]:
preproc_orig.print_summary()
preproc_orig.save_state('data/preproc_state.pickle')
3 documents in language English:
> NewsArticles-1880 (N=230): White House aides told to keep Russia - related ma...
> NewsArticles-3350 (N=658): Frustration as cabin electronics ban comes into fo...
> NewsArticles-99 (N=1060): Should you have two bins in your bathroom ? Our ba...
total number of tokens: 1948 / vocabulary size: 683
[95]:
<TMPreproc [3 documents / en]>

Let’s change the object by retaining only documents that contain the token “house” (see the reduced number of documents):

[96]:
preproc_orig.filter_documents('*house*', match_type='glob', ignore_case=True)
preproc_orig.print_summary()
2 documents in language English:
> NewsArticles-1880 (N=230): White House aides told to keep Russia - related ma...
> NewsArticles-99 (N=1060): Should you have two bins in your bathroom ? Our ba...
total number of tokens: 1290 / vocabulary size: 485
[96]:
<TMPreproc [2 documents / en]>

We can restore the saved data using from_state():

[97]:
preproc_restored = TMPreproc.from_state('data/preproc_state.pickle')
preproc_restored.print_summary()
3 documents in language English:
> NewsArticles-1880 (N=230): White House aides told to keep Russia - related ma...
> NewsArticles-3350 (N=658): Frustration as cabin electronics ban comes into fo...
> NewsArticles-99 (N=1060): Should you have two bins in your bathroom ? Our ba...
total number of tokens: 1948 / vocabulary size: 683
[97]:
<TMPreproc [3 documents / en]>

You can see that the full dataset with three documents was restored.

This is very useful especially when you have a large amount of data and run time consuming operations, e.g. POS tagging. When you’re finished running these operations, you can easily store the current state to disk and later retrieve it without the need to re-run these operations.

Functional API

The TMPreproc class provides a convenient object-oriented interface for parallel text processing and analysis. There is also a functional API provided in the tmtoolkit.preprocess module. Most of these functions accept a list of spaCy documents along with additional parameters. You may use these functions for quick prototyping, but it is generally much more convenient to use TMPreproc. The functional API does not provide parallel processing.

To initialize the functional API for a certain language, you need to start with init_for_language() and may then tokenize your raw text documents via tokenize(), which will generate a list of spaCy documents. Most other functions in this API accept such a list of list of spaCy documents as input.

init_for_language('en')
docs = tokenize(['Hello this is a test.', 'And here comes another one.'])

The final result after applying preprocessing steps and hence transforming the text data is often a document-term matrix (DTM). The bow module contains several functions to work with DTMs, e.g. apply transformations such as tf-idf or compute some important summary statistics. The next chapter will introduce some of these functions.