Text preprocessing
During text preprocessing, a corpus of documents is tokenized (i.e. the document strings are split into individual words, punctuation, numbers, etc.) and then these tokens can be transformed, filtered or annotated. The goal is to prepare the raw texts in a way that makes it easier to perform eventual analysis methods in a later stage, e.g. by reducing noise in the dataset. tmtoolkit provides a rich set of tools for this purpose in the tmtoolkit.preprocess module.
Parallel processing with the TMPreproc class
You can pass a dict-like dataset (i.e. anything that maps document labels to their plain text contents, e.g. a tmtoolkit Corpus object) to the TMPreproc class and can then then apply several text processing methods to it. You can chain these processing steps by applying one method after another and examine the results.
Under the hood, the spaCy package is used to perform most NLP methods. However, TMPreproc
offers much more functionality than spaCy, including flexible token and document filtering. The most important advantage of using TMPreproc
is that it employs parallel processing, i.e. it uses all available processors on your machine to do the computations necessary during preprocessing. For large text corpora, this can lead to a strong speed up.
Using the functional API
Apart from the TMPreproc
class, tmtoolkit also provides several functions in the tmtoolkit.preprocess module. Most of these functions accept a list of spaCy documents along with additional parameters. You may use these functions for quick prototyping, but it is generally much more convenient to use TMPreproc
. Note that only the latter provides parallel processing.
Loading example data
Let’s load a sample of three documents from the built-in NewsArticles dataset. We’ll use only a small number of documents here to have a better overview at the beginning. We can later use a larger sample.
[1]:
import random
random.seed(20191018) # to make the sampling reproducible
from tmtoolkit.corpus import Corpus
from tmtoolkit.preprocess import tokenize
corpus_small = Corpus.from_builtin_corpus('en-NewsArticles').sample(3)
Optional: enabling logging output
By default, tmtoolkit does not expose any internal logging messages. Sometimes, for example for diagnostic output during debugging or in order to see progress for long running operations, it’s helpful to enable logging output display, which can be done as follows:
import logging
logging.basicConfig(level=logging.INFO)
tmtoolkit_log = logging.getLogger('tmtoolkit')
# set the minimum log level to display, for instance also logging.DEBUG
tmtoolkit_log.setLevel(logging.INFO)
tmtoolkit_log.propagate = True
Creating a TMPreproc
object
You can create a TMPreproc
object (also known as “instance”) by passing a dict that maps document labels to (untokenized) documents. Since a tmtoolkit Corpus behaves like a dict, we can pass our corpus_small
object. We also need to specify the corpus language as two-letter ISO 639-1 language code (here "en"
for English).
[2]:
from tmtoolkit.preprocess import TMPreproc
preproc = TMPreproc(corpus_small, language='en')
preproc
[2]:
<TMPreproc [3 documents / en]>
The above will at first distribute all documents to several sub-processes which will later be used to run the computations in parallel. The number of sub-processes can be controlled via n_max_processes
. It defaults to the number of CPU cores in your machine. The distribution of documents to the processes happens according to the document size. E.g. when you have two CPU cores, one very large document and three small documents, CPU 1 will take care about the large document alone and CPU 2
will take the other three small documents. After distribution of the documents, they will directly be tokenized (in parallel). Hence when you have a large corpus, the creation of a TMPreproc
object may take some time because of the tokenization process.
Our TMPreproc
object preproc
is now set up to work with the documents passed in corpus_small
and the language 'en'
for English. All further operations with this object will use the specified documents and language. All documents are directly tokenized.
The method print_summary() is very handy and we will use it quite often. It displays a small summary of the documents in the TMPreproc
object. N=...
denotes the number of tokens in the respective document.
[3]:
preproc.print_summary()
3 documents in language English:
> NewsArticles-1880 (N=230): White House aides told to keep Russia - related ma...
> NewsArticles-3350 (N=657): Frustration as cabin electronics ban comes into fo...
> NewsArticles-99 (N=1060): Should you have two bins in your bathroom ? Our ba...
total number of tokens: 1947 / vocabulary size: 683
[3]:
<TMPreproc [3 documents / en]>
Accessing tokens, vocabulary and other important properties
TMPreproc
provides several properties to access its data and some summary statistics. These properties are read-only, i.e. you can only retrieve the results but not assign new values to them.
First, let’s have a look at the labels (names) of the documents:
[4]:
preproc.doc_labels
[4]:
['NewsArticles-1880', 'NewsArticles-3350', 'NewsArticles-99']
We can access the tokens of each document by using the tokens property:
[5]:
# use [:10] slice to show only the first 10 tokens
preproc.tokens['NewsArticles-1880'][:10]
[5]:
['White',
'House',
'aides',
'told',
'to',
'keep',
'Russia',
'-',
'related',
'materials']
If you prefer a tabular output, you can also access the tokens and their metadata as pandas DataFrames or datatable Frames.
A note on the use of datatable Frames
If you have installed the datatable package, many functions and methods in tmtoolkit return or accept datatable Frames instead of (the more commonly known) pandas DataFrames. This is because the former is much faster and memory efficient in most cases. You can always convert between the both like this:
import datatable as dt
import pandas as pd
# a pandas DataFrame:
df = pd.DataFrame({'a': [1, 2, 3], 'b': list('xyz')})
# DataFrame to datatable:
dtable = dt.Frame(df)
# and vice versa datatable to DataFrame:
df == dtable.to_pandas()
# Out:
# a b
# 0 True True
# 1 True True
# 2 True True
Even first creating a datatable and then converting to a DataFrame is often faster than directly creating a DataFrame.
You can use the tokens_dataframe or tokens_datatable properties for tabular output. The datatable Frame consists of at least five columns: The document label, the position of the token in the document (zero-indexed) and token itself, lemma
and whitespace
. The lemma
column contains the token’s lemma and whitespace
indicates whether there is a whitespace
after the token in the text. Please note that for large amounts of data, tokens_datatable
is usually quicker than using tokens_dataframe
.
[6]:
preproc.tokens_datatable
[6]:
doc | position | token | lemma | whitespace | |
---|---|---|---|---|---|
▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪ | ▪▪▪▪ | ▪ | |
0 | NewsArticles-1880 | 0 | White | White | 1 |
1 | NewsArticles-1880 | 1 | House | House | 1 |
2 | NewsArticles-1880 | 2 | aides | aide | 1 |
3 | NewsArticles-1880 | 3 | told | tell | 1 |
4 | NewsArticles-1880 | 4 | to | to | 1 |
5 | NewsArticles-1880 | 5 | keep | keep | 1 |
6 | NewsArticles-1880 | 6 | Russia | Russia | 0 |
7 | NewsArticles-1880 | 7 | - | - | 0 |
8 | NewsArticles-1880 | 8 | related | relate | 1 |
9 | NewsArticles-1880 | 9 | materials | material | 0 |
10 | NewsArticles-1880 | 10 | 0 | ||
11 | NewsArticles-1880 | 11 | Lawyers | Lawyers | 1 |
12 | NewsArticles-1880 | 12 | for | for | 1 |
13 | NewsArticles-1880 | 13 | the | the | 1 |
14 | NewsArticles-1880 | 14 | Trump | Trump | 1 |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
1942 | NewsArticles-99 | 1055 | non | non | 0 |
1943 | NewsArticles-99 | 1056 | - | - | 0 |
1944 | NewsArticles-99 | 1057 | recyclable | recyclable | 1 |
1945 | NewsArticles-99 | 1058 | items | item | 0 |
1946 | NewsArticles-99 | 1059 | . | . | 0 |
More columns may be shown when you add token metadata (more on that later).
The method get_tokens() gives you more options for accessing the tokens. For example, you can get all tokens with their metadata as nested dictionary in the form document label -> metadata key (e.g. “lemma”) -> metadata.
[7]:
doctokens = preproc.get_tokens(with_metadata=True, as_datatables=False)
doctokens['NewsArticles-1880'].keys()
[7]:
dict_keys(['token', 'lemma', 'whitespace'])
[8]:
# lemmata for the first 10 tokens in this document
doctokens['NewsArticles-1880']['lemma'][:10]
[8]:
['White',
'House',
'aide',
'tell',
'to',
'keep',
'Russia',
'-',
'relate',
'material']
You may also want to access the re-constructed full text of each document via texts property. This returns a dict that maps document labels to their text. Here we only display the first 100 characters from a single document:
[9]:
preproc.texts['NewsArticles-1880'][:100]
[9]:
'White House aides told to keep Russia-related materials\n\nLawyers for the Trump administration have i'
As mentioned in the beginning, tmtoolkit’s preprocessing module uses spaCy internally for most NLP tasks. If you want direct access to the spaCy documents, you can use the spacy_docs property. Here, we access a single spaCy document and check its is_tagged attribute:
[10]:
preproc.spacy_docs['NewsArticles-1880'].is_tagged
[10]:
False
You can also retrieve the document and token vectors from the word embeddings representation of the documents. For this, however, you need to create a TMPreproc
instance with the argument enable_vectors=True
:
[11]:
preproc_vec = TMPreproc(corpus_small, language='en', enable_vectors=True)
preproc_vec.vectors_enabled
[11]:
True
Now you may access the document vectors via doc_vectors property:
[12]:
# displaying only the first 10 values of a single
# document's document vector
preproc_vec.doc_vectors['NewsArticles-1880'][:10]
[12]:
array([-7.0222005e-02, 8.1240870e-02, -3.9869484e-02, 1.8360456e-02,
1.9232498e-02, -2.5533361e-02, -2.9136341e-02, -1.0187237e-01,
1.6649088e-03, 2.4026785e+00], dtype=float32)
Token vectors are also available via token_vectors property:
[13]:
# displaying only a single document's token matrix
preproc_vec.token_vectors['NewsArticles-1880']
[13]:
array([[-0.39347 , -0.061407, 0.015231, ..., 0.046462, 0.058398,
0.46169 ],
[ 0.19847 , 0.18087 , -0.089119, ..., -0.24263 , -0.035183,
-0.29661 ],
[ 0.28059 , -0.45684 , 0.414 , ..., -0.31501 , -0.31649 ,
-0.026392],
...,
[-0.08267 , 0.092944, 0.028411, ..., 0.49965 , -0.17115 ,
0.27578 ],
[ 0.01327 , 0.51269 , -0.35735 , ..., 0.19492 , 0.058496,
0.26636 ],
[ 0.012001, 0.20751 , -0.12578 , ..., 0.13871 , -0.36049 ,
-0.035 ]], dtype=float32)
[14]:
del preproc_vec
The following gives you the number of documents and number of unique tokens respectively:
[15]:
preproc.n_docs
[15]:
3
[16]:
preproc.n_tokens
[16]:
1947
We can also access the number of tokens in each document via doc_lengths property:
[17]:
# displaying only a single document's length here
preproc.doc_lengths['NewsArticles-1880']
[17]:
230
The vocabulary is the set of unique tokens in the corpus, i.e. all tokens that occur at least once in at least one of the documents. You can use the property vocabulary for that and the property vocabulary_counts to additionally get the number of times each token appears in the corpus.
[18]:
preproc.vocabulary[:10] # displaying only the first 10 here
[18]:
['\n\n', ' ', '"', '%', "'", "'s", '(', ')', ',', '-']
[19]:
# number of unique tokens in all documents
preproc.vocabulary_size
[19]:
683
[20]:
# how often the word "the" occurs in the whole corpus
preproc.vocabulary_counts['the']
[20]:
82
The latter returns a Python Counter object so we can apply its useful functions, e.g. in order to get the most often used tokens:
[21]:
preproc.vocabulary_counts.most_common()[:10]
[21]:
[('the', 82),
(',', 70),
('.', 60),
('to', 53),
('"', 50),
('and', 46),
('in', 39),
('a', 31),
('of', 25),
('that', 22)]
The document frequency of a token is the number of documents in which this token occurs at least once. The properties vocabulary_abs_doc_frequency and vocabulary_rel_doc_frequency return this measure as absolute frequency or proportion respectively:
[22]:
(preproc.vocabulary_abs_doc_frequency['Trump'],
preproc.vocabulary_rel_doc_frequency['Trump'])
[22]:
(2, 0.6666666666666666)
[23]:
(preproc.vocabulary_abs_doc_frequency['Russia'],
preproc.vocabulary_rel_doc_frequency['Russia'])
[23]:
(1, 0.3333333333333333)
Part-of-Speech (POS) tagging
Part-of-speech (POS) tagging finds the grammatical word-category for each token in a document. The method pos_tag() employs this for the whole corpus. The found POS tags are added as metadata to each token. These tags conform to a specific tagset which is explained in the spaCy documentation. The POS tags can be used to annotate and filter the documents. Let’s apply POS tagging:
[24]:
preproc.pos_tag()
[24]:
<TMPreproc [3 documents / en]>
We can now see a new column pos
with the found POS tag for each token:
[25]:
preproc.tokens_datatable
[25]:
doc | position | token | lemma | pos | whitespace | |
---|---|---|---|---|---|---|
▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪ | ▪▪▪▪ | ▪▪▪▪ | ▪ | |
0 | NewsArticles-1880 | 0 | White | White | PROPN | 1 |
1 | NewsArticles-1880 | 1 | House | House | PROPN | 1 |
2 | NewsArticles-1880 | 2 | aides | aide | NOUN | 1 |
3 | NewsArticles-1880 | 3 | told | tell | VERB | 1 |
4 | NewsArticles-1880 | 4 | to | to | PART | 1 |
5 | NewsArticles-1880 | 5 | keep | keep | VERB | 1 |
6 | NewsArticles-1880 | 6 | Russia | Russia | PROPN | 0 |
7 | NewsArticles-1880 | 7 | - | - | PUNCT | 0 |
8 | NewsArticles-1880 | 8 | related | relate | VERB | 1 |
9 | NewsArticles-1880 | 9 | materials | material | NOUN | 0 |
10 | NewsArticles-1880 | 10 | SPACE | 0 | ||
11 | NewsArticles-1880 | 11 | Lawyers | lawyer | NOUN | 1 |
12 | NewsArticles-1880 | 12 | for | for | ADP | 1 |
13 | NewsArticles-1880 | 13 | the | the | DET | 1 |
14 | NewsArticles-1880 | 14 | Trump | trump | ADJ | 1 |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
1942 | NewsArticles-99 | 1055 | non | non | ADJ | 0 |
1943 | NewsArticles-99 | 1056 | - | - | ADJ | 0 |
1944 | NewsArticles-99 | 1057 | recyclable | recyclable | ADJ | 1 |
1945 | NewsArticles-99 | 1058 | items | item | NOUN | 0 |
1946 | NewsArticles-99 | 1059 | . | . | PUNCT | 0 |
Aside: TMPreproc as “state machine”
Before continuing, we should clarify that a TMPreproc instance is a “state machine”, i.e. its contents (the documents) and behavior can change when you call a method. An example:
corpus = {
"doc1": "Hello world!",
"doc2": "Another example"
}
preproc = TMPreproc(corpus) # documents are directly tokenized
preproc.tokens
# Out:
# {
# 'doc1': ['Hello', 'world', '!'],
# 'doc2': ['Another', 'example']
# }
preproc.tokens_to_lowercase() # this changes the documents
preproc.tokens
# Out:
# {
# 'doc1': ['hello', 'world', '!'],
# 'doc2': ['another', 'example']
# }
As you can see, the tokens “inside” preproc
are changed in place. It’s important to see that after calling the method tokens_to_lowercase()
, the tokens in preproc
were transformed and the original tokens from before calling this method are not available anymore. In Python, assigning a mutable object to a variable binds the same object only to a different name, it doesn’t copy it. Since a TMPreproc
object is a mutable object (you can change its state by calling its methods),
when we simply assign such an object to a different variable (say preproc_upper
) we essentially only have two names for the same object and by calling a method on one of these variable names, the values will be changed for both names.
Copying TMPreproc
objects
What can we do about that? We need to copy the object which can be done with the TMPreproc.copy() method. By this, we create another variable preproc_upper
that points to a separate TMPreproc
object.
[26]:
preproc_upper = preproc.copy()
[27]:
# the IDs confirm that we have two different objects
id(preproc_upper), id(preproc)
[27]:
(140426331677504, 140426727032000)
[28]:
preproc_upper.transform_tokens(str.upper)
# the transformation now only applied to "preproc_upper"
preproc.vocabulary == preproc_upper.vocabulary
[28]:
False
[29]:
# show a sample
preproc_upper.tokens['NewsArticles-1880'][:10]
[29]:
['WHITE',
'HOUSE',
'AIDES',
'TOLD',
'TO',
'KEEP',
'RUSSIA',
'-',
'RELATED',
'MATERIALS']
[30]:
# the original "preproc" still holds the same data
preproc.tokens['NewsArticles-1880'][:10]
[30]:
['White',
'House',
'aides',
'told',
'to',
'keep',
'Russia',
'-',
'related',
'materials']
Note that this also uses up twice as much computer memory now. So you shouldn’t create copies that often and also release unused memory by using del
:
[31]:
# removing the objects again
del preproc_upper
Lemmatization and term normalization
Before we start with token normalization, we will create a copy of the original TMPreproc
object and its data, so that we can later use it for comparison:
[32]:
preproc_orig = preproc.copy()
Lemmatization brings a token, if it is a word, to its base form. The lemma is already found out during the tokenization process and is available in the lemma
metadata column. However, when you want to further process the tokens on the base of the lemmata, you should use the lemmatize() method. This method sets the lemmata as tokens and all further processing will happen using the lemmatized tokens:
[33]:
preproc.lemmatize()
preproc.tokens_datatable
[33]:
doc | position | token | lemma | pos | whitespace | |
---|---|---|---|---|---|---|
▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪ | ▪▪▪▪ | ▪▪▪▪ | ▪ | |
0 | NewsArticles-1880 | 0 | White | White | PROPN | 1 |
1 | NewsArticles-1880 | 1 | House | House | PROPN | 1 |
2 | NewsArticles-1880 | 2 | aide | aide | NOUN | 1 |
3 | NewsArticles-1880 | 3 | tell | tell | VERB | 1 |
4 | NewsArticles-1880 | 4 | to | to | PART | 1 |
5 | NewsArticles-1880 | 5 | keep | keep | VERB | 1 |
6 | NewsArticles-1880 | 6 | Russia | Russia | PROPN | 0 |
7 | NewsArticles-1880 | 7 | - | - | PUNCT | 0 |
8 | NewsArticles-1880 | 8 | relate | relate | VERB | 1 |
9 | NewsArticles-1880 | 9 | material | material | NOUN | 0 |
10 | NewsArticles-1880 | 10 | SPACE | 0 | ||
11 | NewsArticles-1880 | 11 | lawyer | lawyer | NOUN | 1 |
12 | NewsArticles-1880 | 12 | for | for | ADP | 1 |
13 | NewsArticles-1880 | 13 | the | the | DET | 1 |
14 | NewsArticles-1880 | 14 | trump | trump | ADJ | 1 |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
1942 | NewsArticles-99 | 1055 | non | non | ADJ | 0 |
1943 | NewsArticles-99 | 1056 | - | - | ADJ | 0 |
1944 | NewsArticles-99 | 1057 | recyclable | recyclable | ADJ | 1 |
1945 | NewsArticles-99 | 1058 | item | item | NOUN | 0 |
1946 | NewsArticles-99 | 1059 | . | . | PUNCT | 0 |
As we can see, the lemma
column was copied over to the token
column.
Stemming
tmtoolkit doesn’t support stemming directly, since lemmatization is generally accepted as a better approach to bring different word forms of one word to a common base form. However, you may install NLTK and apply stemming by using the transform_tokens() method together with the stem() function.
Depending on how you further want to analyze the data, it may be necessary to “clean” or “normalize” your tokens in different ways in order to remove noise from the corpus, such as punctuation tokens or numbers, upper/lowercase forms of the same word, etc. Note that this is usually not necessary when you work with more modern approaches such as word embeddings (word vectors).
If you want to remove certain characters in all tokens in your documents, you can use remove_chars_in_tokens() and pass it a sequence of characters to remove. There is also a shortcut remove_special_chars_in_tokens() which will remove all “special characters” (all characters in string.punction by default).
[34]:
preproc.remove_chars_in_tokens(['-']) # remove only "-"
preproc.print_summary()
3 documents in language English:
> NewsArticles-1880 (N=230): White House aide tell to keep Russia relate materi...
> NewsArticles-3350 (N=657): frustration as cabin electronic ban come into forc...
> NewsArticles-99 (N=1060): Should you have two bin in your bathroom ? Our bat...
total number of tokens: 1947 / vocabulary size: 596
[34]:
<TMPreproc [3 documents / en]>
[35]:
# remove all punctuation
preproc.remove_special_chars_in_tokens()
preproc.print_summary() # the "?" also vanishes
3 documents in language English:
> NewsArticles-1880 (N=230): White House aide tell to keep Russia relate materi...
> NewsArticles-3350 (N=657): frustration as cabin electronic ban come into forc...
> NewsArticles-99 (N=1060): Should you have two bin in your bathroom Our bathr...
total number of tokens: 1947 / vocabulary size: 580
[35]:
<TMPreproc [3 documents / en]>
A common (but harsh) practice is to transform all tokens to lowercase forms, which can be done with tokens_to_lowercase():
[36]:
preproc.tokens_to_lowercase()
preproc.print_summary()
3 documents in language English:
> NewsArticles-1880 (N=230): white house aide tell to keep russia relate materi...
> NewsArticles-3350 (N=657): frustration as cabin electronic ban come into forc...
> NewsArticles-99 (N=1060): should you have two bin in your bathroom our bathr...
total number of tokens: 1947 / vocabulary size: 562
[36]:
<TMPreproc [3 documents / en]>
The method clean_tokens() finally applies several steps that remove tokens that meet certain criteria. This includes removing:
punctuation tokens
stopwords (very common words for the given language)
empty tokens (i.e.
''
)tokens that are longer or shorter than a certain number of characters
numbers
Note that this is a language-dependent method, because the default stopword list is determined per language. This method has many parameters to tweak, so it’s recommended to check out the documentation.
[37]:
# remove punct., stopwords, empty tokens (this is the default)
# plus tokens shorter than 2 characters and numeric tokens like "2019"
preproc.clean_tokens(remove_numbers=True, remove_shorter_than=2)
preproc.print_summary()
3 documents in language English:
> NewsArticles-1880 (N=130): white house aide tell keep russia relate material ...
> NewsArticles-3350 (N=309): frustration cabin electronic ban come force passen...
> NewsArticles-99 (N=486): bin bathroom bathroom fill shampoo bottle toilet r...
total number of tokens: 925 / vocabulary size: 469
[37]:
<TMPreproc [3 documents / en]>
Due to the removal of several tokens in the previous step, the document lengths for the processed corpus are much smaller than for the original corpus:
[38]:
preproc.doc_lengths, preproc_orig.doc_lengths
[38]:
({'NewsArticles-1880': 130, 'NewsArticles-3350': 309, 'NewsArticles-99': 486},
{'NewsArticles-1880': 230, 'NewsArticles-3350': 657, 'NewsArticles-99': 1060})
We can also observe that the vocabulary got smaller after the processing steps, which, for large corpora, is also important in terms of computation time and memory consumption for later analyses:
[39]:
len(preproc.vocabulary), len(preproc_orig.vocabulary)
[39]:
(469, 683)
You can also apply custom token transform functions by using transform_tokens() and passing it a function that should be applied to each token in each document (hence it must accept one string argument).
First let’s define such a function. Here we create a simple function that should return a token’s “shape” in terms of the case of its characters:
[40]:
def token_shape(t):
return ''.join(['X' if str.isupper(c) else 'x' for c in t])
token_shape('EU'), token_shape('CamelCase'), token_shape('lower')
[40]:
('XX', 'XxxxxXxxx', 'xxxxx')
We can now apply this function to our documents (we will use the original documents here, because they were not transformed to lower case):
[41]:
preproc = preproc_orig.copy() # swap instances for later
preproc_orig.transform_tokens(token_shape) # apply function
preproc_orig.print_summary()
# remove instance
del preproc_orig
3 documents in language English:
> NewsArticles-1880 (N=230): Xxxxx Xxxxx xxxxx xxxx xx xxxx Xxxxxx x xxxxxxx xx...
> NewsArticles-3350 (N=657): Xxxxxxxxxxx xx xxxxx xxxxxxxxxxx xxx xxxxx xxxx xx...
> NewsArticles-99 (N=1060): Xxxxxx xxx xxxx xxx xxxx xx xxxx xxxxxxxx x xx Xxx...
total number of tokens: 1947 / vocabulary size: 32
Expanding compound words and joining tokens
Compound words like “US-Student” or “non-recyclable” can be expanded to separate tokens like “US”, “Student” and “non”, “recyclable” using expand_compound_tokens(). However, depending on the language model, most of these compounds will already be separated on initial tokenization.
[42]:
orig_vocab = preproc.vocabulary
preproc.expand_compound_tokens()
# create set difference to show vocabulary tokens
# that were expanded
set(orig_vocab) - set(preproc.vocabulary)
[42]:
{'Source:-Al'}
It’s also possible to join together certain subsequent occurrences of tokens or token patterns. This means you can for example transform all of the subsequent tokens “White” and “House” to single tokens “White_House”. In case you don’t use n-grams (described in a separate section), this is very helpful when you want to capture a named entity that is made up by several tokens, such as persons, institutions or concepts like “Climate Change”, as a single token. The method to use for this is glue_tokens(). It accepts the following parameters:
a
patterns
sequence of length N that is used to match the subsequent N tokens;a
glue
string that is used to join the matched subsequent tokens (by default:"_"
).
Along with that, you can adjust the token matching with the common token matching parameters described below.
Let’s “glue” all subsequent occurrences of “White” and “House”. The glue_tokens()
method will return a set of glued tokens that matched the provided pattern:
[43]:
preproc_orig = preproc.copy() # make a copy of full orig. data for later use
preproc.glue_tokens(['White', 'House'])
[43]:
{'White_House'}
[44]:
preproc.tokens['NewsArticles-1880'][:20]
[44]:
['White_House',
'aides',
'told',
'to',
'keep',
'Russia',
'-',
'related',
'materials',
'\n\n',
'Lawyers',
'for',
'the',
'Trump',
'administration',
'have',
'instructed',
'White_House',
'aides',
'to']
[45]:
del preproc
Keywords-in-context (KWIC) and general filtering methods
Keywords-in-context (KWIC) allow you to quickly investigate certain keywords and their neighborhood of tokens, i.e. the tokens that appear right before and after this keyword.
TMPreproc
provides three methods for this purpose:
get_kwic() is the base method accepting a search pattern and several options that control how the search pattern is matched (more on that below); use this function when you want to further process the output of a KWIC search;
get_kwic_table() is the more “user friendly” version of the above method as it produces a datatable with the highlighted keyword by default
filter_tokens_with_kwic() works similar to the above functions but applies the result by filtering the documents again; it is explained in the section on filtering
Let’s see the KWIC methods in action:
[46]:
preproc = preproc_orig.copy() # use orig. full data
preproc.get_kwic('house', ignore_case=True)
[46]:
{'NewsArticles-1880': [['White', 'House', 'aides', 'told'],
['instructed', 'White', 'House', 'aides', 'to'],
['The', 'White', 'House', 'is', 'simply'],
['the', 'White', 'House', 'and', 'law']],
'NewsArticles-3350': [],
'NewsArticles-99': [['of', 'the', 'house', ',', '"']]}
The method returns a dictionary that maps document labels to the KWIC results. Each document contains a list of “contexts”, i.e. a list of tokens that surround a keyword, here "house"
. This keyword stands in the middle and is surrounded by its “context tokens”, which by default means two tokens to the left and two tokens to the right (which may be less when the keyword is near the start or the end of a document).
We can see that NewsArticles-1880
contains four contexts, NewsArticles-99
one context and NewsArticles-3350
none.
With get_kwic_table()
, we get back a datatable which provides a better formatting for quick investigation. See how the matched tokens are highlighted as *house*
and empty results are removed:
[47]:
preproc.get_kwic_table('house', ignore_case=True)
[47]:
doc | context | kwic | |
---|---|---|---|
▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪ | |
0 | NewsArticles-1880 | 0 | White *House* aides told |
1 | NewsArticles-1880 | 1 | instructed White *House* aides to |
2 | NewsArticles-1880 | 2 | The White *House* is simply |
3 | NewsArticles-1880 | 3 | the White *House* and law |
4 | NewsArticles-99 | 0 | of the *house* , " |
An important parameter is context_size
. It determines the number of tokens to display left and right to the found keyword. You can either pass a single integer for a symmetric context or a tuple with integers (<left>, <right>)
:
[48]:
preproc.get_kwic_table('house', ignore_case=True, context_size=4)
[48]:
doc | context | kwic | |
---|---|---|---|
▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪ | |
0 | NewsArticles-1880 | 0 | White *House* aides told to keep |
1 | NewsArticles-1880 | 1 | administration have instructed White *House* aides |
2 | NewsArticles-1880 | 2 | . " The White *House* is simply taking proactive |
3 | NewsArticles-1880 | 3 | Democrats to the White *House* and law enforcement |
4 | NewsArticles-99 | 0 | other rooms of the *house* , " says Jonny |
[49]:
preproc.get_kwic_table('house', ignore_case=True, context_size=(1, 4))
[49]:
doc | context | kwic | |
---|---|---|---|
▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪ | |
0 | NewsArticles-1880 | 0 | White *House* aides told to keep |
1 | NewsArticles-1880 | 1 | White *House* aides to preserve any |
2 | NewsArticles-1880 | 2 | White *House* is simply taking proactive |
3 | NewsArticles-1880 | 3 | White *House* and law enforcement agencies |
4 | NewsArticles-99 | 0 | the *house* , " says Jonny |
The KWIC functions become really powerful when using the pattern matching options. So far, we were looking for exact (but case insensitive) matches between the corpus tokens and our keyword "house"
. However, it is also possible to match patterns like "new*"
(matches any word starting with “new”) or "agenc(y|ies)"
(a regular expression matching “agency” and “agencies”). The next section gives an introduction on the different options for pattern matching.
Common parameters for pattern matching functions
Several functions and methods in tmtoolkit support pattern matching, including the already mentioned KWIC functions but also functions for filtering tokens or documents as you will see later. They all share similar function signatures, i.e. similar parameters:
search_token
orsearch_tokens
: allows to specify one or more patterns as stringsmatch_type
: sets the matching type and can be one of the following options:'exact'
(default): exact string matching (optionally ignoring character case), i.e. no pattern matching'regex'
uses regular expression matching'glob'
uses “glob patterns” like"politic*"
which matches for example “politic”, “politics” or “politician” (see globre package)ignore_case
: ignore character case (applies to all three match types)glob_method
: ifmatch_type
is ‘glob’, use this glob method. Must be'match'
or'search'
(similar behavior as Python’s re.match or re.search)inverse
: inverse the match results, i.e. if matching for “hello”, return all results that do not match “hello”
Let’s try out some of these options with get_kwic_table()
:
[50]:
# using a regular expression, ignoring case
preproc.get_kwic_table(r'agenc(y|ies)', match_type='regex', ignore_case=True)
[50]:
doc | context | kwic | |
---|---|---|---|
▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪ | |
0 | NewsArticles-1880 | 0 | law enforcement *agencies* to keep |
1 | NewsArticles-1880 | 1 | organizations , *agencies* and individuals |
2 | NewsArticles-3350 | 0 | Reuters news *agency* . Al |
3 | NewsArticles-3350 | 1 | and news *agencies* |
[51]:
# using a glob, ignoring case
preproc.get_kwic_table('pol*', match_type='glob', ignore_case=True)
[51]:
doc | context | kwic | |
---|---|---|---|
▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪ | |
0 | NewsArticles-1880 | 0 | false and *politically* motivated attacks |
1 | NewsArticles-99 | 0 | , senior *policy* adviser for |
[52]:
# using a glob, ignoring case
preproc.get_kwic_table('*sol*', match_type='glob', ignore_case=True)
[52]:
doc | context | kwic | |
---|---|---|---|
▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪ | |
0 | NewsArticles-99 | 0 | potential simple *solution* that could |
1 | NewsArticles-99 | 1 | confused by *aerosols* . " |
2 | NewsArticles-99 | 2 | bottles , *aerosols* for deodorant |
[53]:
# using a regex that matches all tokens with at least one vowel and
# inverting these matches, i.e. all tokens *without* any vowels
preproc.get_kwic_table(r'[AEIOUaeiou]', match_type='regex', inverse=True)
[53]:
doc | context | kwic | |
---|---|---|---|
▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪ | |
0 | NewsArticles-1880 | 0 | keep Russia *-* related materials |
1 | NewsArticles-1880 | 1 | related materials * * Lawyers for |
2 | NewsArticles-1880 | 2 | in the *2016* presidential election |
3 | NewsArticles-1880 | 3 | related investigations *,* ABC News |
4 | NewsArticles-1880 | 4 | has confirmed *.* " The |
5 | NewsArticles-1880 | 5 | confirmed . *"* The White |
6 | NewsArticles-1880 | 6 | motivated attacks *,* " an |
7 | NewsArticles-1880 | 7 | attacks , *"* an administration |
8 | NewsArticles-1880 | 8 | News Wednesday *.* The directive |
9 | NewsArticles-1880 | 9 | last week *by* Senate Democrats |
10 | NewsArticles-1880 | 10 | between Trump *'s* administration , |
11 | NewsArticles-1880 | 11 | 's administration *,* campaign and |
12 | NewsArticles-1880 | 12 | transition teams *"* ? or |
13 | NewsArticles-1880 | 13 | teams " *?* or anyone |
14 | NewsArticles-1880 | 14 | their behalf *"* ? and |
⋮ | ⋮ | ⋮ | ⋮ |
265 | NewsArticles-99 | 147 | two bins *?* There are |
266 | NewsArticles-99 | 148 | other options *.* Hang a |
267 | NewsArticles-99 | 149 | recycling bin *.* Or opt |
268 | NewsArticles-99 | 150 | and non *-* recyclable items |
269 | NewsArticles-99 | 151 | recyclable items *.* |
Filtering tokens and documents
We can use the pattern matching parameters in numerous filtering methods. The heart of many of these methods is token_match(). Given a search pattern, a list of tokens and optionally some pattern matching parameters, it returns a binary NumPy array of the same length as the input tokens. Each occurrence of True
in this binary array signals a match.
[54]:
from tmtoolkit.preprocess import token_match
# first 10 tokens of document "NewsArticles-1880"
doc_snippet = preproc.tokens['NewsArticles-1880'][:10]
# get all tokens that match "to*"
matches = token_match('to*', doc_snippet, match_type='glob')
# iterate through tokens and matches, show pair-wise results
for tok, match in zip(doc_snippet, matches):
print(tok, ':', match)
White : False
House : False
aides : False
told : True
to : True
keep : False
Russia : False
- : False
related : False
materials : False
The token_match()
function is a rather low-level function that you may use for pattern matching against any list/array of strings, e.g. a list of tokens, file names, etc.
The following methods cover common use-cases for filtering during text preprocessing. Many of these methods start either with filter_...()
or remove_...()
and these pairs of filter and remove functions are complements. A filter method will always retain the matched elements whereas a remove method will always drop the matched elements. We can observe that with the first pair of method, filter_tokens() and
remove_tokens():
So much .copy()
Note that the following code snippets make lot of use of the copy()
methods. This is because we want to show how the different methods work with the same original data (remember that a TMPreproc
instance behaves like a state machine) and also want to “clean up” the temporary instances. Under normal circumstances, you wouldn’t use copy()
so excessively.
[55]:
# retain only the tokens that match the pattern in each document
preproc.filter_tokens('*house*', match_type='glob', ignore_case=True)
preproc.print_summary()
del preproc
3 documents in language English:
> NewsArticles-1880 (N=4): House House House House
> NewsArticles-3350 (N=0):
> NewsArticles-99 (N=3): house greenhouse household
total number of tokens: 7 / vocabulary size: 4
[56]:
preproc = preproc_orig.copy() # make a copy from full data
preproc.remove_tokens('*house*', match_type='glob', ignore_case=True)
preproc.print_summary()
del preproc
3 documents in language English:
> NewsArticles-1880 (N=226): White aides told to keep Russia - related material...
> NewsArticles-3350 (N=658): Frustration as cabin electronics ban comes into fo...
> NewsArticles-99 (N=1057): Should you have two bins in your bathroom ? Our ba...
total number of tokens: 1941 / vocabulary size: 679
The pair filter_documents() and remove_documents() works similarily, but filters or drops whole documents regarding the supplied match criteria. Both accept the standard pattern matching parameters but also a parameter matches_threshold
with default value 1
. When this number of matching tokens is hit, the document will be part of the result set
(filter_documents()
) or removed from the result set (remove_documents()
). By this, we can for example retain only those documents that contain certain token patterns.
Let’s try these methods out in practice:
[57]:
preproc = preproc_orig.copy() # make a copy from full data
preproc.filter_documents('*house*', match_type='glob', ignore_case=True)
preproc.print_summary()
del preproc
2 documents in language English:
> NewsArticles-1880 (N=230): White House aides told to keep Russia - related ma...
> NewsArticles-99 (N=1060): Should you have two bins in your bathroom ? Our ba...
total number of tokens: 1290 / vocabulary size: 485
We can see that two out of three documents contained the pattern '*house*'
and hence were retained.
We can also adjust matches_threshold
to set the minimum number of token matches for filtering:
[58]:
preproc = preproc_orig.copy() # make a copy from full data
preproc.filter_documents('*house*', match_type='glob', ignore_case=True,
matches_threshold=4)
preproc.print_summary()
del preproc
1 documents in language English:
> NewsArticles-1880 (N=230): White House aides told to keep Russia - related ma...
total number of tokens: 230 / vocabulary size: 140
[59]:
preproc = preproc_orig.copy() # make a copy from full data
preproc.remove_documents('*house*', match_type='glob', ignore_case=True)
preproc.print_summary()
del preproc
1 documents in language English:
> NewsArticles-3350 (N=658): Frustration as cabin electronics ban comes into fo...
total number of tokens: 658 / vocabulary size: 288
When we use remove_documents()
we get only the documents that did not contain the specified pattern.
Another useful pair of methods is filter_documents_by_name() and remove_documents_by_name(). Both methods again accept the same pattern matching parameters but they only apply them to the document names, i.e. document labels:
[60]:
preproc = preproc_orig.copy() # make a copy from full data
preproc.filter_documents_by_name(r'-\d{4}$', match_type='regex')
preproc.print_summary()
del preproc
2 documents in language English:
> NewsArticles-1880 (N=230): White House aides told to keep Russia - related ma...
> NewsArticles-3350 (N=658): Frustration as cabin electronics ban comes into fo...
total number of tokens: 888 / vocabulary size: 385
In the above example we wanted to retain only the documents whose document labels ended with exactly 4 digits, like “…-1234”. Hence, we only get “NewsArticles-1880” and “NewsArticles-3350” but not “NewsArticles-99”. Again, remove_documents_by_name()
will do the exact opposite.
You may also use Keywords-in-context (KWIC) to filter your tokens in the neighborhood around certain keyword pattern(s). The method for that is called filter_tokens_with_kwic() and works very similar to get_kwic() but filters the documents in the TMPreproc
instance with which you can continue working as
usual. Here, we filter the tokens in each document to get the tokens directly in front and after the glob pattern '*house*'
(context_size=1
):
[61]:
preproc = preproc_orig.copy() # make a copy from full data
preproc.filter_tokens_with_kwic('*house*', context_size=1,
match_type='glob', ignore_case=True)
preproc.tokens_datatable
[61]:
doc | position | token | lemma | whitespace | |
---|---|---|---|---|---|
▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪ | ▪▪▪▪ | ▪ | |
0 | NewsArticles-1880 | 0 | White | White | 1 |
1 | NewsArticles-1880 | 1 | House | House | 1 |
2 | NewsArticles-1880 | 2 | aides | aide | 1 |
3 | NewsArticles-1880 | 3 | White | White | 1 |
4 | NewsArticles-1880 | 4 | House | House | 1 |
5 | NewsArticles-1880 | 5 | aides | aide | 1 |
6 | NewsArticles-1880 | 6 | White | White | 1 |
7 | NewsArticles-1880 | 7 | House | House | 1 |
8 | NewsArticles-1880 | 8 | is | be | 1 |
9 | NewsArticles-1880 | 9 | White | White | 1 |
10 | NewsArticles-1880 | 10 | House | House | 1 |
11 | NewsArticles-1880 | 11 | and | and | 1 |
12 | NewsArticles-99 | 0 | the | the | 1 |
13 | NewsArticles-99 | 1 | house | house | 0 |
14 | NewsArticles-99 | 2 | , | , | 0 |
15 | NewsArticles-99 | 3 | of | of | 1 |
16 | NewsArticles-99 | 4 | greenhouse | greenhouse | 1 |
17 | NewsArticles-99 | 5 | gases | gas | 1 |
18 | NewsArticles-99 | 6 | UK | UK | 1 |
19 | NewsArticles-99 | 7 | household | household | 1 |
20 | NewsArticles-99 | 8 | threw | throw | 1 |
When you annotated your documents’ tokens with Part-of-Speech (POS) tags, you can also filter them using filter_for_pos():
[62]:
del preproc
preproc = preproc_orig.copy() # make a copy from full data
# apply POS tagging and retain only nouns
preproc.pos_tag().filter_for_pos('N').tokens_datatable
[62]:
doc | position | token | |
---|---|---|---|
▪ | ▪ | ▪ |
[63]:
del preproc
In this example we filtered for tokens that were identified as nouns by passing the simplified POS tag 'N'
(for more on simplified tags, see the method documentation). We can also filter for more than one tag, e.g. nouns or verbs by passing a list of required POS tags.
filter_for_pos()
has no remove_...()
counterpart, but you can set the inverse
parameter to True
to achieve the same effect.
Finally there are two methods for removing tokens based on their document frequency: remove_common_tokens() and remove_uncommon_tokens(). The former removes all tokens that have a document frequency greater or equal a certain threshold defined by parameter df_threshold
. The latter does the
same for all tokens that have a document frequency lower or equal df_threshold
. This parameter can either be a relative frequency (default) or absolute count (by setting parameter absolute=True
).
Before applying the method, let’s have a look at the number of tokens per document again, to later see how many we will remove. We will also store the vocabulary in orig_vocab
for later comparison:
[64]:
preproc = preproc_orig.copy() # make a copy from full data
orig_vocab = preproc.vocabulary
preproc.doc_lengths
[64]:
{'NewsArticles-1880': 230, 'NewsArticles-3350': 658, 'NewsArticles-99': 1060}
[65]:
preproc.remove_common_tokens(df_threshold=0.9).doc_lengths
[65]:
{'NewsArticles-1880': 144, 'NewsArticles-3350': 413, 'NewsArticles-99': 700}
By removing all tokens with a document frequency threshold of 0.9, we removed quite a number of tokens in each document. Let’s investigate the vocabulary in order to see which tokens were removed:
[66]:
# set difference gives removed vocabulary tokens
set(orig_vocab) - set(preproc.vocabulary)
[66]:
{'\n\n',
'"',
"'s",
',',
'-',
'.',
'?',
'The',
'a',
'all',
'also',
'an',
'and',
'be',
'for',
'has',
'have',
'in',
'into',
'is',
'more',
'of',
'on',
'or',
'other',
'such',
'than',
'that',
'the',
'to',
'which',
'with'}
[67]:
del preproc
remove_uncommon_tokens()
works similarily. This time, let’s use an absolute number as threshold:
[68]:
preproc = preproc_orig.copy() # make a copy from full data
preproc.remove_uncommon_tokens(df_threshold=1, absolute=True)
# set difference gives removed vocabulary tokens
# this time, show only the first 10 tokens that were removed
sorted(set(orig_vocab) - set(preproc.vocabulary))[:10]
[68]:
[' ', '%', '(', ')', '10', '12', '135,000', '2016', '38', '45']
The above means that we remove all tokens that appear only in exactly one document.
[69]:
del preproc
Working with token metadata
TMPreproc
allows to attach arbitrary metadata to each token in each document. This kind of “annotations” for tokens is very useful. For example, you may add metadata that records a token’s length or whether it is all uppercase letters and later use that for filtering or in further analyses. One method to add such metadata is add_metadata_per_doc(). This method requires to pass a dict that maps document labels to the
respective token metadata list. The list’s length must match the number of tokens in the respective document. At first we need to create such a metadata dict. Let’s do that for the tokens’ length first:
[70]:
preproc = preproc_orig.copy() # make a copy from full data
meta_tok_lengths = {doc_label: list(map(len, doc_tokens))
for doc_label, doc_tokens in preproc.tokens.items()}
# show first 5 tokens and their string length for a sample document
list(zip(preproc.tokens['NewsArticles-1880'][:10],
meta_tok_lengths['NewsArticles-1880'][:10]))
[70]:
[('White', 5),
('House', 5),
('aides', 5),
('told', 4),
('to', 2),
('keep', 4),
('Russia', 6),
('-', 1),
('related', 7),
('materials', 9)]
We can now add these metadata via add_metadata_per_doc(). We pass a label, the metadata key, and the previously generated metadata:
[71]:
preproc.add_metadata_per_doc('length', meta_tok_lengths)
del meta_tok_lengths # we don't need that object anymore
The property .tokens_datatable
now shows an additional column with meta_length
(the metadata key in always prefixed with meta_
):
[72]:
preproc.tokens_datatable
[72]:
doc | position | token | lemma | whitespace | meta_length | |
---|---|---|---|---|---|---|
▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪ | ▪▪▪▪ | ▪ | ▪▪▪▪▪▪▪▪ | |
0 | NewsArticles-1880 | 0 | White | White | 1 | 5 |
1 | NewsArticles-1880 | 1 | House | House | 1 | 5 |
2 | NewsArticles-1880 | 2 | aides | aide | 1 | 5 |
3 | NewsArticles-1880 | 3 | told | tell | 1 | 4 |
4 | NewsArticles-1880 | 4 | to | to | 1 | 2 |
5 | NewsArticles-1880 | 5 | keep | keep | 1 | 4 |
6 | NewsArticles-1880 | 6 | Russia | Russia | 0 | 6 |
7 | NewsArticles-1880 | 7 | - | - | 0 | 1 |
8 | NewsArticles-1880 | 8 | related | relate | 1 | 7 |
9 | NewsArticles-1880 | 9 | materials | material | 0 | 9 |
10 | NewsArticles-1880 | 10 | 0 | 2 | ||
11 | NewsArticles-1880 | 11 | Lawyers | lawyer | 1 | 7 |
12 | NewsArticles-1880 | 12 | for | for | 1 | 3 |
13 | NewsArticles-1880 | 13 | the | the | 1 | 3 |
14 | NewsArticles-1880 | 14 | Trump | trump | 1 | 5 |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
1943 | NewsArticles-99 | 1055 | non | non | 0 | 3 |
1944 | NewsArticles-99 | 1056 | - | - | 0 | 1 |
1945 | NewsArticles-99 | 1057 | recyclable | recyclable | 1 | 10 |
1946 | NewsArticles-99 | 1058 | items | item | 0 | 5 |
1947 | NewsArticles-99 | 1059 | . | . | 0 | 1 |
Let’s add a boolean indicator for whether the given token is all uppercase:
[73]:
meta_tok_upper = {doc_label: list(map(str.isupper, doc_tokens))
for doc_label, doc_tokens in preproc.tokens.items()}
preproc.add_metadata_per_doc('upper', meta_tok_upper)
del meta_tok_upper
preproc.tokens_datatable
[73]:
doc | position | token | lemma | whitespace | meta_length | meta_upper | |
---|---|---|---|---|---|---|---|
▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪ | ▪▪▪▪ | ▪ | ▪▪▪▪▪▪▪▪ | ▪ | |
0 | NewsArticles-1880 | 0 | White | White | 1 | 5 | 0 |
1 | NewsArticles-1880 | 1 | House | House | 1 | 5 | 0 |
2 | NewsArticles-1880 | 2 | aides | aide | 1 | 5 | 0 |
3 | NewsArticles-1880 | 3 | told | tell | 1 | 4 | 0 |
4 | NewsArticles-1880 | 4 | to | to | 1 | 2 | 0 |
5 | NewsArticles-1880 | 5 | keep | keep | 1 | 4 | 0 |
6 | NewsArticles-1880 | 6 | Russia | Russia | 0 | 6 | 0 |
7 | NewsArticles-1880 | 7 | - | - | 0 | 1 | 0 |
8 | NewsArticles-1880 | 8 | related | relate | 1 | 7 | 0 |
9 | NewsArticles-1880 | 9 | materials | material | 0 | 9 | 0 |
10 | NewsArticles-1880 | 10 | 0 | 2 | 0 | ||
11 | NewsArticles-1880 | 11 | Lawyers | lawyer | 1 | 7 | 0 |
12 | NewsArticles-1880 | 12 | for | for | 1 | 3 | 0 |
13 | NewsArticles-1880 | 13 | the | the | 1 | 3 | 0 |
14 | NewsArticles-1880 | 14 | Trump | trump | 1 | 5 | 0 |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
1943 | NewsArticles-99 | 1055 | non | non | 0 | 3 | 0 |
1944 | NewsArticles-99 | 1056 | - | - | 0 | 1 | 0 |
1945 | NewsArticles-99 | 1057 | recyclable | recyclable | 1 | 10 | 0 |
1946 | NewsArticles-99 | 1058 | items | item | 0 | 5 | 0 |
1947 | NewsArticles-99 | 1059 | . | . | 0 | 1 | 0 |
You may use these newly added columns now for example for filtering the datatable:
[74]:
import datatable as dt
preproc.tokens_datatable[dt.f.meta_upper == 1,:]
[74]:
doc | position | token | lemma | whitespace | meta_length | meta_upper | |
---|---|---|---|---|---|---|---|
▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪ | ▪▪▪▪ | ▪ | ▪▪▪▪▪▪▪▪ | ▪ | |
0 | NewsArticles-1880 | 43 | ABC | ABC | 1 | 3 | 1 |
1 | NewsArticles-1880 | 73 | ABC | ABC | 1 | 3 | 1 |
2 | NewsArticles-1880 | 213 | U.S. | U.S. | 1 | 4 | 1 |
3 | NewsArticles-3350 | 11 | US | US | 0 | 2 | 1 |
4 | NewsArticles-3350 | 13 | UK | UK | 1 | 2 | 1 |
5 | NewsArticles-3350 | 34 | US | US | 1 | 2 | 1 |
6 | NewsArticles-3350 | 98 | US | US | 1 | 2 | 1 |
7 | NewsArticles-3350 | 106 | US | US | 1 | 2 | 1 |
8 | NewsArticles-3350 | 134 | UAE | UAE | 1 | 3 | 1 |
9 | NewsArticles-3350 | 153 | READ | READ | 1 | 4 | 1 |
10 | NewsArticles-3350 | 154 | MORE | MORE | 0 | 4 | 1 |
11 | NewsArticles-3350 | 273 | US | US | 1 | 2 | 1 |
12 | NewsArticles-3350 | 346 | READ | READ | 1 | 4 | 1 |
13 | NewsArticles-3350 | 347 | MORE | MORE | 0 | 4 | 1 |
14 | NewsArticles-3350 | 349 | US | US | 1 | 2 | 1 |
15 | NewsArticles-3350 | 358 | US | US | 1 | 2 | 1 |
16 | NewsArticles-3350 | 454 | I | -PRON- | 1 | 1 | 1 |
17 | NewsArticles-3350 | 480 | UK | UK | 1 | 2 | 1 |
18 | NewsArticles-3350 | 502 | UK | UK | 1 | 2 | 1 |
19 | NewsArticles-3350 | 506 | UAE | UAE | 1 | 3 | 1 |
20 | NewsArticles-3350 | 529 | UAE | UAE | 1 | 3 | 1 |
21 | NewsArticles-3350 | 570 | US | US | 1 | 2 | 1 |
22 | NewsArticles-3350 | 637 | US | US | 1 | 2 | 1 |
23 | NewsArticles-99 | 376 | UK | UK | 1 | 2 | 1 |
24 | NewsArticles-99 | 711 | A | a | 1 | 1 | 1 |
25 | NewsArticles-99 | 955 | UK | UK | 1 | 2 | 1 |
26 | NewsArticles-99 | 995 | M25 | M25 | 1 | 3 | 1 |
To see which metadata keys were already created, you can use get_available_metadata_keys():
[75]:
preproc.get_available_metadata_keys()
[75]:
{'lemma', 'length', 'upper', 'whitespace'}
Token metadata can be removed with remove_metadata():
[76]:
preproc.remove_metadata('upper')
preproc.get_available_metadata_keys()
[76]:
{'lemma', 'length', 'whitespace'}
[77]:
preproc.tokens_datatable
[77]:
doc | position | token | lemma | whitespace | meta_length | |
---|---|---|---|---|---|---|
▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪ | ▪▪▪▪ | ▪ | ▪▪▪▪▪▪▪▪ | |
0 | NewsArticles-1880 | 0 | White | White | 1 | 5 |
1 | NewsArticles-1880 | 1 | House | House | 1 | 5 |
2 | NewsArticles-1880 | 2 | aides | aide | 1 | 5 |
3 | NewsArticles-1880 | 3 | told | tell | 1 | 4 |
4 | NewsArticles-1880 | 4 | to | to | 1 | 2 |
5 | NewsArticles-1880 | 5 | keep | keep | 1 | 4 |
6 | NewsArticles-1880 | 6 | Russia | Russia | 0 | 6 |
7 | NewsArticles-1880 | 7 | - | - | 0 | 1 |
8 | NewsArticles-1880 | 8 | related | relate | 1 | 7 |
9 | NewsArticles-1880 | 9 | materials | material | 0 | 9 |
10 | NewsArticles-1880 | 10 | 0 | 2 | ||
11 | NewsArticles-1880 | 11 | Lawyers | lawyer | 1 | 7 |
12 | NewsArticles-1880 | 12 | for | for | 1 | 3 |
13 | NewsArticles-1880 | 13 | the | the | 1 | 3 |
14 | NewsArticles-1880 | 14 | Trump | trump | 1 | 5 |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
1943 | NewsArticles-99 | 1055 | non | non | 0 | 3 |
1944 | NewsArticles-99 | 1056 | - | - | 0 | 1 |
1945 | NewsArticles-99 | 1057 | recyclable | recyclable | 1 | 10 |
1946 | NewsArticles-99 | 1058 | items | item | 0 | 5 |
1947 | NewsArticles-99 | 1059 | . | . | 0 | 1 |
We can tell filter_tokens() and similar methods to use metadata instead of the tokens for matching. For example, we can use the metadata meta_length
, which we created before, to filter for tokens of a certain length:
[78]:
preproc_meta_example = preproc.copy()
preproc_meta_example.filter_tokens(3, by_meta='length')
preproc_meta_example.tokens_datatable
[78]:
doc | position | token | lemma | whitespace | meta_length | |
---|---|---|---|---|---|---|
▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪ | ▪▪▪▪ | ▪ | ▪▪▪▪▪▪▪▪ | |
0 | NewsArticles-1880 | 0 | for | for | 1 | 3 |
1 | NewsArticles-1880 | 1 | the | the | 1 | 3 |
2 | NewsArticles-1880 | 2 | any | any | 1 | 3 |
3 | NewsArticles-1880 | 3 | the | the | 1 | 3 |
4 | NewsArticles-1880 | 4 | and | and | 1 | 3 |
5 | NewsArticles-1880 | 5 | ABC | ABC | 1 | 3 |
6 | NewsArticles-1880 | 6 | has | have | 1 | 3 |
7 | NewsArticles-1880 | 7 | The | the | 1 | 3 |
8 | NewsArticles-1880 | 8 | and | and | 1 | 3 |
9 | NewsArticles-1880 | 9 | ABC | ABC | 1 | 3 |
10 | NewsArticles-1880 | 10 | The | the | 1 | 3 |
11 | NewsArticles-1880 | 11 | the | the | 1 | 3 |
12 | NewsArticles-1880 | 12 | and | and | 1 | 3 |
13 | NewsArticles-1880 | 13 | law | law | 1 | 3 |
14 | NewsArticles-1880 | 14 | all | all | 1 | 3 |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
335 | NewsArticles-99 | 186 | for | for | 1 | 3 |
336 | NewsArticles-99 | 187 | bin | bin | 1 | 3 |
337 | NewsArticles-99 | 188 | can | can | 1 | 3 |
338 | NewsArticles-99 | 189 | and | and | 1 | 3 |
339 | NewsArticles-99 | 190 | non | non | 0 | 3 |
[79]:
del preproc_meta_example
Note that all matching options then apply to the metadata column, in this case to the meta_length
column which contains integers. Since filter_tokens()
by default employs exact matching, we get all tokens where meta_length
equals the first argument, 3
. If we used regular expression or glob matching instead, this method would fail because you can only use that for string data.
If you want to use more complex filter queries, you should create a “filter mask” and pass it to filter_tokens_by_mask(). A filter mask is a dictionary that maps a document label to a sequence of booleans. For all occurrences of True
, the respective token in the document will be retained, all others will be removed. Let’s try that out with a small sample:
[80]:
preproc.pos_tag().tokens_datatable
[80]:
doc | position | token | lemma | pos | whitespace | meta_length | |
---|---|---|---|---|---|---|---|
▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪ | ▪▪▪▪ | ▪▪▪▪ | ▪ | ▪▪▪▪▪▪▪▪ | |
0 | NewsArticles-1880 | 0 | White | White | PUNCT | 1 | 5 |
1 | NewsArticles-1880 | 1 | House | House | PUNCT | 1 | 5 |
2 | NewsArticles-1880 | 2 | aides | aide | PUNCT | 1 | 5 |
3 | NewsArticles-1880 | 3 | told | tell | PUNCT | 1 | 4 |
4 | NewsArticles-1880 | 4 | to | to | PUNCT | 1 | 2 |
5 | NewsArticles-1880 | 5 | keep | keep | PUNCT | 1 | 4 |
6 | NewsArticles-1880 | 6 | Russia | Russia | PUNCT | 0 | 6 |
7 | NewsArticles-1880 | 7 | - | - | PUNCT | 0 | 1 |
8 | NewsArticles-1880 | 8 | related | relate | PUNCT | 1 | 7 |
9 | NewsArticles-1880 | 9 | materials | material | PUNCT | 0 | 9 |
10 | NewsArticles-1880 | 10 | PUNCT | 0 | 2 | ||
11 | NewsArticles-1880 | 11 | Lawyers | lawyer | PUNCT | 1 | 7 |
12 | NewsArticles-1880 | 12 | for | for | PUNCT | 1 | 3 |
13 | NewsArticles-1880 | 13 | the | the | PUNCT | 1 | 3 |
14 | NewsArticles-1880 | 14 | Trump | trump | PUNCT | 1 | 5 |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
1943 | NewsArticles-99 | 1055 | non | non | PUNCT | 0 | 3 |
1944 | NewsArticles-99 | 1056 | - | - | PUNCT | 0 | 1 |
1945 | NewsArticles-99 | 1057 | recyclable | recyclable | PUNCT | 1 | 10 |
1946 | NewsArticles-99 | 1058 | items | item | PUNCT | 0 | 5 |
1947 | NewsArticles-99 | 1059 | . | . | PUNCT | 0 | 1 |
We now generate the filter mask, which means for each document we create a boolean list or array that for each token in that document indicates whether that token should be kept or removed.
We will iterate through the tokens_with_metadata property, which is a dict that for each document contains a datatable with its tokens and metadata. Let’s have a look at the first document’s datatable:
[81]:
next(iter(preproc.tokens_with_metadata.values()))
[81]:
token | lemma | pos | whitespace | meta_length | |
---|---|---|---|---|---|
▪▪▪▪ | ▪▪▪▪ | ▪▪▪▪ | ▪ | ▪▪▪▪▪▪▪▪ | |
0 | White | White | PUNCT | 1 | 5 |
1 | House | House | PUNCT | 1 | 5 |
2 | aides | aide | PUNCT | 1 | 5 |
3 | told | tell | PUNCT | 1 | 4 |
4 | to | to | PUNCT | 1 | 2 |
5 | keep | keep | PUNCT | 1 | 4 |
6 | Russia | Russia | PUNCT | 0 | 6 |
7 | - | - | PUNCT | 0 | 1 |
8 | related | relate | PUNCT | 1 | 7 |
9 | materials | material | PUNCT | 0 | 9 |
10 | PUNCT | 0 | 2 | ||
11 | Lawyers | lawyer | PUNCT | 1 | 7 |
12 | for | for | PUNCT | 1 | 3 |
13 | the | the | PUNCT | 1 | 3 |
14 | Trump | trump | PUNCT | 1 | 5 |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
225 | during | during | PUNCT | 1 | 6 |
226 | his | -PRON- | X | 1 | 3 |
227 | confirmation | confirmation | PUNCT | 1 | 12 |
228 | hearing | hearing | PUNCT | 0 | 7 |
229 | . | . | PUNCT | 0 | 1 |
Now we can create the filter mask:
[82]:
import numpy as np
filter_mask = {}
for doc_label, doc_data in preproc.tokens_with_metadata.items():
# extract the columns "meta_length" and "pos"
# and convert them to NumPy arrays
doc_data_subset = doc_data[:, [dt.f.meta_length, dt.f.pos]]
tok_lengths, tok_pos = map(np.array, doc_data_subset.to_list())
# create a boolean array for nouns with token length less or equal 5
filter_mask[doc_label] = (tok_lengths <= 5) & np.isin(tok_pos, ['NOUN', 'PROPN'])
# it's not necessary to add the filter mask as metadata
# but it's a good way to check the mask
preproc.add_metadata_per_doc('small_nouns', filter_mask)
preproc.tokens_datatable
[82]:
doc | position | token | lemma | pos | whitespace | meta_length | meta_small_nouns | |
---|---|---|---|---|---|---|---|---|
▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪ | ▪▪▪▪ | ▪▪▪▪ | ▪ | ▪▪▪▪▪▪▪▪ | ▪ | |
0 | NewsArticles-1880 | 0 | White | White | PUNCT | 1 | 5 | 0 |
1 | NewsArticles-1880 | 1 | House | House | PUNCT | 1 | 5 | 0 |
2 | NewsArticles-1880 | 2 | aides | aide | PUNCT | 1 | 5 | 0 |
3 | NewsArticles-1880 | 3 | told | tell | PUNCT | 1 | 4 | 0 |
4 | NewsArticles-1880 | 4 | to | to | PUNCT | 1 | 2 | 0 |
5 | NewsArticles-1880 | 5 | keep | keep | PUNCT | 1 | 4 | 0 |
6 | NewsArticles-1880 | 6 | Russia | Russia | PUNCT | 0 | 6 | 0 |
7 | NewsArticles-1880 | 7 | - | - | PUNCT | 0 | 1 | 0 |
8 | NewsArticles-1880 | 8 | related | relate | PUNCT | 1 | 7 | 0 |
9 | NewsArticles-1880 | 9 | materials | material | PUNCT | 0 | 9 | 0 |
10 | NewsArticles-1880 | 10 | PUNCT | 0 | 2 | 0 | ||
11 | NewsArticles-1880 | 11 | Lawyers | lawyer | PUNCT | 1 | 7 | 0 |
12 | NewsArticles-1880 | 12 | for | for | PUNCT | 1 | 3 | 0 |
13 | NewsArticles-1880 | 13 | the | the | PUNCT | 1 | 3 | 0 |
14 | NewsArticles-1880 | 14 | Trump | trump | PUNCT | 1 | 5 | 0 |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
1943 | NewsArticles-99 | 1055 | non | non | PUNCT | 0 | 3 | 0 |
1944 | NewsArticles-99 | 1056 | - | - | PUNCT | 0 | 1 | 0 |
1945 | NewsArticles-99 | 1057 | recyclable | recyclable | PUNCT | 1 | 10 | 0 |
1946 | NewsArticles-99 | 1058 | items | item | PUNCT | 0 | 5 | 0 |
1947 | NewsArticles-99 | 1059 | . | . | PUNCT | 0 | 1 | 0 |
Finally, we can pass the mask dict to filter_tokens_by_mask():
[83]:
preproc.filter_tokens_by_mask(filter_mask)
preproc.tokens_datatable
[83]:
doc | position | token | |
---|---|---|---|
▪ | ▪ | ▪ |
Generating n-grams
So far, we worked with unigrams, i.e. each document consisted of a sequence of discrete tokens. We can also generate n-grams from our corpus where each document consists of a sequence of n subsequent tokens. An example would be:
Document: “This is a simple example.”
n=1 (unigrams):
['This', 'is', 'a', 'simple', 'example', '.']
n=2 (bigrams):
['This is', 'is a', 'a simple', 'simple example', 'example .']
n=3 (trigrams):
['This is a', 'is a simple', 'a simple example', 'simple example .']
The method generate_ngrams() allows us to generate n-grams from tokenized documents. We can then get the results with the ngrams
property:
[84]:
del preproc
preproc = preproc_orig.copy() # make a copy from full data
preproc.generate_ngrams(2) # generate bigrams
preproc.ngrams['NewsArticles-1880'][:10] # show first 10 bigrams of this document
[84]:
[['White', 'House'],
['House', 'aides'],
['aides', 'told'],
['told', 'to'],
['to', 'keep'],
['keep', 'Russia'],
['Russia', '-'],
['-', 'related'],
['related', 'materials'],
['materials', 'Lawyers']]
You may afterwards use join_ngrams() to merge the generated n-grams to joint tokens and use these as new tokens in this TMPreproc instance:
[85]:
preproc.join_ngrams()
preproc.tokens_datatable
[85]:
doc | position | token | lemma | whitespace | |
---|---|---|---|---|---|
▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪ | ▪▪▪▪ | ▪ | |
0 | NewsArticles-1880 | 0 | White House | White House | 1 |
1 | NewsArticles-1880 | 1 | House aides | House aides | 1 |
2 | NewsArticles-1880 | 2 | aides told | aides told | 1 |
3 | NewsArticles-1880 | 3 | told to | told to | 1 |
4 | NewsArticles-1880 | 4 | to keep | to keep | 1 |
5 | NewsArticles-1880 | 5 | keep Russia | keep Russia | 1 |
6 | NewsArticles-1880 | 6 | Russia - | Russia - | 1 |
7 | NewsArticles-1880 | 7 | - related | - related | 1 |
8 | NewsArticles-1880 | 8 | related materials | related materials | 1 |
9 | NewsArticles-1880 | 9 | materials Lawyers | materials Lawyers | 1 |
10 | NewsArticles-1880 | 10 | Lawyers for | Lawyers for | 1 |
11 | NewsArticles-1880 | 11 | for the | for the | 1 |
12 | NewsArticles-1880 | 12 | the Trump | the Trump | 1 |
13 | NewsArticles-1880 | 13 | Trump administration | Trump administration | 1 |
14 | NewsArticles-1880 | 14 | administration have | administration have | 1 |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
1934 | NewsArticles-99 | 1052 | and non | and non | 1 |
1935 | NewsArticles-99 | 1053 | non - | non - | 1 |
1936 | NewsArticles-99 | 1054 | - recyclable | - recyclable | 1 |
1937 | NewsArticles-99 | 1055 | recyclable items | recyclable items | 1 |
1938 | NewsArticles-99 | 1056 | items . | items . | 1 |
[86]:
del preproc
Generating a sparse document-term matrix (DTM)
If you’re working with a bag-of-words representation of your data, you usually convert the preprocessed documents to a document-term matrix (DTM), which represents of the number of occurrences of each term (i.e. vocabulary token) in each document. This is a N rows by M columns matrix, where N is the number of documents and M is the vocabulary size (i.e. the number of unique tokens in the corpus).
Not all tokens from the vocabulary occur in all documents. In fact, many tokens will occur only in a small subset of the documents if you’re dealing with a “real world” dataset. This means that most entries in such a DTM will be zero. Almost all functions in tmtoolkit therefore generate and work with sparse matrices, where only non-zero values are stored in computer memory.
For this example, we’ll generate a DTM from the preproc_orig
instance. First, let’s check the number of documents and the vocabulary size:
[87]:
preproc_orig.n_docs, preproc_orig.vocabulary_size
[87]:
(3, 683)
We can use the dtm property to generate a sparse DTM from the current instance:
[88]:
preproc_orig.dtm
[88]:
<3x683 sparse matrix of type '<class 'numpy.int32'>'
with 816 stored elements in Compressed Sparse Row format>
We can see that a sparse matrix with 3 rows (which corresponds with the number of documents) and 683 columns was generated (which corresponds to the vocabulary size). 816 elements in this matrix are non-zero.
We can convert this matrix to a non-sparse, i.e. dense, representation and see parts of its elements:
[89]:
preproc_orig.dtm.todense()
[89]:
matrix([[ 1, 0, 4, ..., 0, 0, 0],
[ 2, 1, 14, ..., 0, 3, 0],
[ 2, 0, 32, ..., 2, 5, 5]], dtype=int32)
However, note that you should only convert a sparse matrix to a dense representation when you’re either dealing with a small amount of data (which is what we’re doing in this example), or use only a part of the full matrix. Converting a sparse matrix to a dense representation can otherwise easily exceed the available computer memory.
There exist different “formats” for sparse matrices, which have different advantages and disadvantages (see for example the SciPy “sparse” module documentation). Not all formats support all operations that you can usually apply to an ordinary, dense matrix. By default, the generated DTM is in Compressed Sparse Row (CSR) format. This format allows indexing and is especially optimized for fast row access. You may convert it to any other sparse matrix format; see the mentioned SciPy documentation for this.
The rows of the DTM are aligned to the sequence of the document labels and its columns are aligned to the vocabulary. For example, let’s find the frequency of the term “House” in the document “NewsArticles-1880”. To do this, we find out the indices into the matrix:
[90]:
preproc_orig.doc_labels.index('NewsArticles-1880')
[90]:
0
[91]:
preproc_orig.vocabulary.index('House')
[91]:
67
This means the frequency of the term “House” in the document “NewsArticles-1880” is located in row 0 and column 4 of the DTM:
[92]:
preproc_orig.dtm[0, 67]
[92]:
4
See also the following example of finding out the index for “administration” and then getting an array that represents the number of occurrences of this token across all three documents:
[93]:
vocab_admin_ix = preproc_orig.vocabulary.index('administration')
preproc_orig.dtm[:, vocab_admin_ix].todense()
[93]:
matrix([[4],
[1],
[0]], dtype=int32)
Apart from the dtm property, there’s also the get_dtm() method which allows to also return the result as datatable or pandas DataFrame. Note that these representations are not sparse and hence can consume much memory.
[94]:
preproc_orig.get_dtm(as_datatable=True)
DatatableWarning: Duplicate column name found, and was assigned a unique name: '.' -> '.0'
[94]:
_doc | . | " | % | ' | 's | ( | ) | , | … | work | world | would | you | your | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
▪▪▪▪ | ▪▪▪▪ | ▪▪▪▪ | ▪▪▪▪ | ▪▪▪▪ | ▪▪▪▪ | ▪▪▪▪ | ▪▪▪▪ | ▪▪▪▪ | ▪▪▪▪ | ▪▪▪▪ | ▪▪▪▪ | ▪▪▪▪ | ▪▪▪▪ | ▪▪▪▪ | ||
0 | NewsArticles-1880 | 1 | 0 | 4 | 0 | 1 | 3 | 0 | 0 | 9 | … | 0 | 0 | 0 | 0 | 0 |
1 | NewsArticles-3350 | 2 | 1 | 14 | 0 | 1 | 6 | 0 | 0 | 28 | … | 0 | 1 | 0 | 3 | 0 |
2 | NewsArticles-99 | 2 | 0 | 32 | 5 | 0 | 3 | 2 | 2 | 33 | … | 1 | 0 | 2 | 5 | 5 |
Serialization: Saving and loading TMPreproc
objects
The current state of a TMPreproc
object can also be stored to a file on disk so that you (or someone else who has tmtoolkit installed) can later restore it using that file. The methods for that are save_state() and load_state() / from_state().
Let’s store the current state of the preproc_orig
instance:
[95]:
preproc_orig.print_summary()
preproc_orig.save_state('data/preproc_state.pickle')
3 documents in language English:
> NewsArticles-1880 (N=230): White House aides told to keep Russia - related ma...
> NewsArticles-3350 (N=658): Frustration as cabin electronics ban comes into fo...
> NewsArticles-99 (N=1060): Should you have two bins in your bathroom ? Our ba...
total number of tokens: 1948 / vocabulary size: 683
[95]:
<TMPreproc [3 documents / en]>
Let’s change the object by retaining only documents that contain the token “house” (see the reduced number of documents):
[96]:
preproc_orig.filter_documents('*house*', match_type='glob', ignore_case=True)
preproc_orig.print_summary()
2 documents in language English:
> NewsArticles-1880 (N=230): White House aides told to keep Russia - related ma...
> NewsArticles-99 (N=1060): Should you have two bins in your bathroom ? Our ba...
total number of tokens: 1290 / vocabulary size: 485
[96]:
<TMPreproc [2 documents / en]>
We can restore the saved data using from_state():
[97]:
preproc_restored = TMPreproc.from_state('data/preproc_state.pickle')
preproc_restored.print_summary()
3 documents in language English:
> NewsArticles-1880 (N=230): White House aides told to keep Russia - related ma...
> NewsArticles-3350 (N=658): Frustration as cabin electronics ban comes into fo...
> NewsArticles-99 (N=1060): Should you have two bins in your bathroom ? Our ba...
total number of tokens: 1948 / vocabulary size: 683
[97]:
<TMPreproc [3 documents / en]>
You can see that the full dataset with three documents was restored.
This is very useful especially when you have a large amount of data and run time consuming operations, e.g. POS tagging. When you’re finished running these operations, you can easily store the current state to disk and later retrieve it without the need to re-run these operations.
Functional API
The TMPreproc
class provides a convenient object-oriented interface for parallel text processing and analysis. There is also a functional API provided in the tmtoolkit.preprocess module. Most of these functions accept a list of spaCy documents along with additional parameters. You may use these functions for quick prototyping, but it is generally much more convenient to use TMPreproc
. The functional API does not provide parallel processing.
To initialize the functional API for a certain language, you need to start with init_for_language() and may then tokenize your raw text documents via tokenize(), which will generate a list of spaCy documents. Most other functions in this API accept such a list of list of spaCy documents as input.
init_for_language('en')
docs = tokenize(['Hello this is a test.', 'And here comes another one.'])
The final result after applying preprocessing steps and hence transforming the text data is often a document-term matrix (DTM). The bow module contains several functions to work with DTMs, e.g. apply transformations such as tf-idf or compute some important summary statistics. The next chapter will introduce some of these functions.