# Working with text corpora

Your text data usually comes in the form of (long) plain text strings that are stored in one or several files on disk. We can load and transform this data into a Corpus object so that we can perform all kinds of operations that are implemented as corpus functions in tmtoolkit. The Corpus class itself resembles a Python dictionary with some additional functionality.

Let’s import the Corpus class first:

[1]:
from tmtoolkit.corpus import Corpus

Several methods are implemented to load text data from different sources:

• load plain text files (“.txt files”)

• load folder(s) with plain text files

• load a tabular (i.e. CSV or Excel) file containing document IDs and texts

• load a ZIP file containing plain text or tabular files

We can create a Corpus object directly by immediately loading a dataset using one of the Corpus.from_... methods. This is what we’ve done when we used corp = Corpus.from_builtin_corpus('en-News100') in the previous chapter.

Let’s load a folder with example documents. Make sure that the path is relative to the current working directory. The data for these examples can be downloaded from GitHub.

Note: Rich text documents

If you want to work with “rich text documents”, i.e. formatted, non-plain text sources such as PDFs, Word documents, HTML files, etc. you must convert them to one of the supported formats first. For example you can use the pdftotext command from the Linux package poppler-utils to convert from PDF to plain text files or pandoc to convert from Word or HTML to plain text.

[2]:
corp = Corpus.from_folder('data/corpus_example', language='en')
corp
[2]:
<Corpus [3 documents  / language "en"]>

Again, we can have a look which document labels were created and print one sample document:

[3]:
corp.doc_labels
[3]:
['sample1', 'sample2', 'sample3']
[4]:
corp['sample3']['token']
[4]:
['And',
'here',
'we',
'go',
'with',
'the',
'third',
'and',
'final',
'example',
'file',
'.',
'\n',
'Another',
'line',
'of',
'text',
'.',
'\n\n',
'§',
...]

The corpus_summary and print_summary functions are very helpful to get a first overview of a loaded corpus:

[5]:
from tmtoolkit.corpus import print_summary

print_summary(corp)
Corpus with 3 documents in English
> sample2 (33 tokens): Here comes the second example ( with HTML < i > ta...
> sample3 (36 tokens): And here we go with the third and final example fi...
> sample1 (25 tokens): This is the first example file . ☺ We showcase NER...
total number of tokens: 94 / vocabulary size: 64

Side note: Corpus functions

The corpus_summary and print_summary functions are examples of corpus functions. All corpus functions accept a Corpus object as first argument and operate on it. A corpus function may retrieve information from a corpus and/or modify it. Most functions in the tmtoolkit.corpus module are corpus functions.

Another option is to create a Corpus object and adding further documents using the corpus_add_... functions. Here we create an empty Corpus and then add documents via corpus_add_files which is another example of a corpus function (one that modifies a Corpus object). It takes a Corpus object and one or more paths to raw text files.

[6]:
corp = Corpus(language='en')
print_summary(corp)
Corpus with 0 document in English
total number of tokens: 0 / vocabulary size: 0
[7]:

print_summary(corp)
Corpus with 1 document in English
> data_corpus_example-sample1 (25 tokens): This is the first example file . ☺ We showcase NER...
total number of tokens: 25 / vocabulary size: 24

Note that this time the document label is different. Its prefixed by a normalized version of the path to the document. We can alter the doc_label_fmt argument of corpus_add_files in order to control how document labels are generated. But at first, let’s remove the previously loaded document from the corpus. Since a Corpus instance behaves like a Python dict, we can use del:

[8]:
del corp['data_corpus_example-sample1']
print_summary(corp)
Corpus with 0 document in English
total number of tokens: 0 / vocabulary size: 0

Now we use a modified doc_label_fmt paramater value to generate document labels only from the file name and not from the full path to the document. We also load three files now:

[9]:
'data/corpus_example/sample2.txt',
'data/corpus_example/sample3.txt'],
doc_label_fmt='{basename}')
print_summary(corp)

Corpus with 3 documents in English
> sample2 (33 tokens): Here comes the second example ( with HTML < i > ta...
> sample3 (36 tokens): And here we go with the third and final example fi...
> sample1 (25 tokens): This is the first example file . ☺ We showcase NER...
total number of tokens: 94 / vocabulary size: 64

As noted in the beginning, there are more corpus_add_... and Corpus.from_... functions/methods to load text data from different sources. See the corpus module API for details.

Note

Please be aware of the difference of the corpus_add_... and Corpus.from_... functions/methods: The former modifies a given Corpus object, whereas the latter creates a new Corpus object.

## Configuring the NLP pipeline, parallel processing and more via Corpus parameters

When initializing a Corpus, you can pass several arguments. You must at least provide either the language, language_model or spacy_instance argument. The simplest is to just pass the language as two-letter ISO 639-1 language code (“en”, “de”, “es”, etc.). A respective SpaCy language model will then be automatically selected depending on that language code and the NLP pipeline features that you require. Alternatively, you set the SpaCy language model to be loaded via language_model. Finally, if you already loaded a SpaCy pipeline instance yourself, you can also pass it via the spacy_instance parameter.

You can specify the features, i.e. the components, of the NLP pipeline using the load_features parameter. This only applies if you don’t provide your own pipeline instance via spacy_instance. It determines the pipeline’s capabilities, i.e. if POS-tags, named entities, sentences, etc. are recognized. By default, tmtoolkit will load and enable all components except for the named-entity recognition (NER) component. The more components are enabled, the slower the text processing works and the more memory it uses. So it makes sense to only enable pipelines that you actually use. See the SpaCy documentation for which pipelines are implemented (note that this also depends on the language and language model).

Let’s create a corpus with only a minimal pipeline. If we set load_features to an empty sequence, only a basic tokenizing pipeline is run.

[10]:

We can compare the token table output of the minimal pipeline with the default pipeline used in the previously loaded corp object:

[11]:
from tmtoolkit.corpus import tokens_table

tokens_table(corp_minimal)
[11]:
doc position token is_punct is_stop like_num
0 sample1 0 This False True False
1 sample1 1 is False True False
2 sample1 2 the False True False
3 sample1 3 first False True True
4 sample1 4 example False False False
... ... ... ... ... ... ...
89 sample3 31 third False True True
90 sample3 32 and False True False
91 sample3 33 final False False False
92 sample3 34 paragraph False False False
93 sample3 35 . True False False

94 rows × 6 columns

[12]:
tokens_table(corp)
[12]:
doc position token is_punct is_stop lemma like_num pos tag
0 sample1 0 This False True this False PRON DT
1 sample1 1 is False True be False AUX VBZ
2 sample1 2 the False True the False DET DT
3 sample1 3 first False True first True ADJ JJ
4 sample1 4 example False False example False NOUN NN
... ... ... ... ... ... ... ... ... ...
89 sample3 31 third False True third True ADJ JJ
90 sample3 32 and False True and False CCONJ CC
91 sample3 33 final False False final False ADJ JJ
92 sample3 34 paragraph False False paragraph False NOUN NN
93 sample3 35 . True False . False PUNCT .

94 rows × 9 columns

As we can see, the minimal pipeline produces less information: The lemmata and POS tags are missing from the table for the corpus using the minimal pipeline. Also, if we’d wanted sentences, we’d get an error since sentences recognition is not enabled in the minimal pipeline:

tokens_table(corp_minimal, sentences=True)
# results in ValueError: sentence numbers requested, but sentence borders not set; Corpus documents probably not parsed with sentence recognition

You can also use the default pipeline as basis and add new components via the add_features parameter. Here, we add the NER component for finding out named entities like persons. The table then shows a new column ent_type with "PERSON" entries next to names:

[13]:

tokens_table(corp_ner, select='sample1').tail(10)
[13]:
doc position token ent_type is_punct is_stop lemma like_num pos tag
15 sample1 15 famous False False famous False ADJ JJ
16 sample1 16 people False False people False NOUN NNS
17 sample1 17 like False False like False ADP IN
18 sample1 18 Missy PERSON False False Missy False PROPN NNP
19 sample1 19 Elliott PERSON False False Elliott False PROPN NNP
20 sample1 20 or False True or False CCONJ CC
21 sample1 21 George PERSON False False George False PROPN NNP
22 sample1 22 Harrison PERSON False False Harrison False PROPN NNP
23 sample1 23 . True False . False PUNCT .
24 sample1 24 \n False False \n False SPACE _SP

### Parallel processing

A final important parameter is the max_workers setting. With it, you can enable parallel processing and specify the number of worker processes that can run in parallel. You should never set this value higher than the number of CPU cores in your machine (it will still work, but it will actually slow down the processing). Parallel processing makes most sense for large corpora with thousands of documents. With the small toy examples in this tutorial, you won’t see performance gains (you will actually see performance loss due to increased overhead for setting up the parallel processing).

You can provide an integer value to max_workers, which sets the number of worker processes. 0 or 1 disable parallel processing (the default behavior).

Here, we use two workers, i.e. two CPU cores of our machine:

[14]:
corp_2w = Corpus.from_folder('data/corpus_example', language='en', max_workers=2)
corp_2w
[14]:
<Corpus [3 documents  / 2 worker processes / language "en"]>

If you use a negative number x, this means “use all available CPU cores, but leave x spare cores. For example, when your machine has four CPU cores, three of them will be used by tmtoolkit when setting max_workers=-1.

[15]:
corp_allbutone = Corpus.from_folder('data/corpus_example', language='en', max_workers=-1)
corp_allbutone
[15]:
<Corpus [3 documents  / 3 worker processes / language "en"]>

Finally, you can pass a float value to the max_workers parameter which specifies the fraction of available CPU cores to use (rounded). Here, we use 50% of the available CPU cores:

[16]:
corp_halfcpus = Corpus.from_folder('data/corpus_example', language='en', max_workers=0.5)
corp_halfcpus
[16]:
<Corpus [3 documents  / 2 worker processes / language "en"]>
[17]:
del corp_2w, corp_allbutone, corp_halfcpus

### Text preprocessing before tokenization

Finally, there’s the option to provide a text preprocessing pipeline that is applied on the raw text documents before the text is tokenized. This is helpful in cases where the tokenizer might be confused by some elements in the text. An example might be HTML tags like in <i>italic text</i>. If your text contains these, you should remove them, since the tokenizer can’t handle them as seen for example in document “sample2”:

[18]:
corp['sample2']['token']
[18]:
['Here',
'comes',
'the',
'second',
'example',
'(',
'with',
'HTML',
'<',
'i',
'>',
'tags</i',
'>',
'&',
'amp',
';',
'entities',
')',
'.',
'\n\n',
...]

There are two options for how to handle this: One option is to write your own code for loading the text and cleansing it and then pass the preprocessed text as dictionary to the Corpus constructor. The other option is to use the raw_preproc parameter for the Corpus class. This parameter allows you to pass one or more functions that will be applied one after another to the raw input text. The only requirement for these functions is that they accept a string and return a processed string.

There’s already a function for stripping HTML tags in tmtoolkit, strip_tags, and we can set it as sole input preprocessing function. We can see in the tokens of “sample2”, that there are no more HTML tags and the HTML entity &amp; was converted to &:

[19]:
from tmtoolkit.corpus import strip_tags

corp = Corpus.from_folder('data/corpus_example', language='en', raw_preproc=strip_tags)
corp['sample2']['token']
[19]:
['Here',
'comes',
'the',
'second',
'example',
'(',
'with',
'HTML',
'tags',
'&',
'entities',
')',
'.',
'\n\n',
'This',
'one',
'contains',
'three',
'lines',
'of',
...]

You may specify your own preprocessing function and pass all functions that will be applied one after another as a list:

[20]:
def remove_line_breaks(txt):
return txt.replace('\n', ' ')

corp_no_lb = Corpus.from_folder('data/corpus_example', language='en',
raw_preproc=[strip_tags, remove_line_breaks])
corp_no_lb['sample2']['token']
[20]:
['Here',
'comes',
'the',
'second',
'example',
'(',
'with',
'HTML',
'tags',
'&',
'entities',
')',
'.',
' ',
'This',
'one',
'contains',
'three',
'lines',
'of',
...]

Text (pre-)processing

The tmtoolkit package contains a lot of text processing and text “normalization” functions, but they can all be applied after tokenization, i.e. after creating the Corpus object. The next chapter will present many of these functions. You should only use the raw_preproc parameter in cases where downstream tasks like tokenization fail due to malformed input (e.g. text with HTML tags). In all other cases use the functions explained in the next chapter.

## Corpus properties, iteration and corpus functions for document management

A Corpus object provides several helpful properties that summarize the text data and the tmtoolkit.corpus module features several corpus functions to manage the documents.

### Properties

Let’s start with the number of documents in the corpus. There are two ways to obtain this value:

[21]:
len(corp)
[21]:
3
[22]:
corp.n_docs
[22]:
3

Other important properties are the language of the documents in the corpus and the SpaCy language model used in the NLP pipeline (note that so far tmtoolkit requires that all documents within one corpus use the same language):

[23]:
corp.language
[23]:
'en'
[24]:
corp.language_model
[24]:
'en_core_web_sm'

The corpus’ SpaCy NLP pipeline is given with the nlp property:

[25]:
corp.nlp
[25]:
<spacy.lang.en.English at 0x7fee11675220>

As you can configure the NLP pipeline (as shown in the previous section), sentence recognition is optional. You can check if sentences were recognized using the following property:

[26]:
corp.has_sents
[26]:
True

The max_workers property shows how many worker processes are used during computations.

[27]:
corp.max_workers
[27]:
1

You can set new values for this property in order to temporarily enable parallel processing:

[28]:
corp.max_workers = 2
corp
[28]:
<Corpus [3 documents  / 2 worker processes / language "en"]>
[29]:
corp.max_workers = 1   # reset

Another important property is the list of document labels:

[30]:
corp.doc_labels
[30]:
['sample1', 'sample2', 'sample3']

### Working with a Corpus object like with a dictionary

A Corpus object resembles a dictionary, i.e. it provides the same methods. The keys of this corpus-dictionary are the unique document labels and the values or elements are the respective Document objects.

You can access a Corpus object’s elements via square brackets (corp[<doc. label>]) or corp.get(<doc. label>):

[31]:
corp['sample1']
[31]:
Document "sample1" (25 tokens, 9 token attributes, 2 document attributes)
[32]:
corp.get('sample2')
[32]:
Document "sample2" (27 tokens, 9 token attributes, 2 document attributes)

The square brackets also accept an integer i, which will give you the document with the ith document label (corresponding to the order in corp.doc_labels):

[33]:
corp[2]
[33]:
Document "sample3" (36 tokens, 9 token attributes, 2 document attributes)

Slicing is also supported and returns a list of Document objects:

[34]:
corp[0:2]
[34]:
[Document "sample1" (25 tokens, 9 token attributes, 2 document attributes),
Document "sample2" (27 tokens, 9 token attributes, 2 document attributes)]

You can also update existing documents or add new documents using the bracket syntax. For this, you can either pass a string, a Document object or a SpaCy Doc object.

[35]:
print_summary(corp)
Corpus with 3 documents in English
> sample2 (27 tokens): Here comes the second example ( with HTML tags & e...
> sample3 (36 tokens): And here we go with the third and final example fi...
> sample1 (25 tokens): This is the first example file . ☺ We showcase NER...
total number of tokens: 88 / vocabulary size: 59
[36]:
corp['sample2'] = 'This is the updated version of the second example.'
corp['sample4'] = 'This document was added as fourth example.'
print_summary(corp)
Corpus with 4 documents in English
> sample2 (10 tokens): This is the updated version of the second example ...
> sample3 (36 tokens): And here we go with the third and final example fi...
> sample1 (25 tokens): This is the first example file . ☺ We showcase NER...
> sample4 (8 tokens): This document was added as fourth example .
total number of tokens: 79 / vocabulary size: 66

The dictionary-interface of the Corpus class also allows to iterate through its contents via corp.items(), corp.keys() or corp.values().

[37]:
# iterate through pairs of document labels lbl and
# document objects d and report each document's length
for lbl, d in corp.items():
print(f'Document length for {lbl}: {len(d)}')
Document length for sample1: 25
Document length for sample2: 10
Document length for sample3: 36
Document length for sample4: 8
[38]:
# make a list of each document's document attributes
# (more on document attributes in the next chapter)
[d.doc_attrs for d in corp.values()]
[38]:
[{'label': 'sample1', 'has_sents': True},
{'label': 'sample2', 'has_sents': True},
{'label': 'sample3', 'has_sents': True},
{'label': 'sample4', 'has_sents': True}]

### Corpus functions for document management

#### Splitting documents

Sometimes you may want to split larger documents. In our example corpus, document “sample3” consists of three paragraphs:

[39]:
from tmtoolkit.corpus import doc_texts

print(doc_texts(corp)['sample3'])
And here we go with the third and final example file.
Another line of text.

§2.
This is the second paragraph.

The third and final paragraph.

We can use the corpus_split_by_paragraph function for splitting documents by paragraphs. There’s also corpus_split_by_token for splitting by an arbitrary token.

[40]:
from tmtoolkit.corpus import corpus_split_by_paragraph

corpus_split_by_paragraph(corp)
print_summary(corp)
Corpus with 6 documents in English
> sample3-3 (6 tokens): The third and final paragraph .
> sample2 (10 tokens): This is the updated version of the second example ...
> sample1 (25 tokens): This is the first example file . ☺ We showcase NER...
> sample4 (8 tokens): This document was added as fourth example .
> sample3-1 (19 tokens): And here we go with the third and final example fi...
> sample3-2 (11 tokens): § 2 .   This is the second paragraph .
total number of tokens: 79 / vocabulary size: 49

As we can see above, the “sample3” document was split into three individual documents, one per paragraph. You can further customize the splitting process by tweaking the parameters, e.g. the minimum number of line breaks used to detect paragraphs (default is two line breaks).

#### Joining documents

There’s of course also a function for doing the inverse operation, i.e. joining several documents. This function is called corpus_join_documents and it accepts a dictionary that maps a name for the newly joint document to a string pattern or a list of string patterns of documents to be joint. This function is especially helpful when you want to bundle lots of smaller documents (e.g. tweets) into a bigger document (e.g. all tweets of one account).

[41]:
from tmtoolkit.corpus import corpus_join_documents

# alternative: no matching, specify each document name in a list
# corpus_join_documents(corp, {'sample3-rejoined': ['sample3-1', 'sample3-2', 'sample3-3']})

# join all documents matching "sample3*" to form a new document "sample3-rejoined"
# "sample3*" is a glob pattern, so we set this as match_type (more on that in the next chapter)
# we also set glue to an empty string, because we don't need to have additional line breaks
# between the documents
corpus_join_documents(corp, {'sample3-rejoined': 'sample3*'}, glue='', match_type='glob')
print_summary(corp)
Corpus with 4 documents in English
> sample2 (10 tokens): This is the updated version of the second example ...
> sample3-rejoined (36 tokens): And here we go with the third and final example fi...
> sample1 (25 tokens): This is the first example file . ☺ We showcase NER...
> sample4 (8 tokens): This document was added as fourth example .
total number of tokens: 79 / vocabulary size: 49
[42]:
print(doc_texts(corp)['sample3-rejoined'])
And here we go with the third and final example file.
Another line of text.

§2.
This is the second paragraph.

The third and final paragraph.

#### Sampling a corpus

Finally you can sample the documents in a corpus using corpus_sample. To get a random sample of two documents from our corpus:

[43]:
from tmtoolkit.corpus import corpus_sample

# sample two out of four documents in the corpus
corpus_sample(corp, 2)

print_summary(corp)
Corpus with 2 documents in English
> sample3-rejoined (36 tokens): And here we go with the third and final example fi...
> sample4 (8 tokens): This document was added as fourth example .
total number of tokens: 44 / vocabulary size: 30

There are more corpus functions for document management, namely the functions for filtering documents. These will be explained in the next chapter, which will also show how to apply several text processing and mining methods to a corpus.