Working with text corpora

Your text data usually comes in the form of (long) plain text strings that are stored in one or several files on disk. We can load and transform this data into a Corpus object so that we can perform all kinds of operations that are implemented as corpus functions in tmtoolkit. The Corpus class itself resembles a Python dictionary with some additional functionality.

Let’s import the Corpus class first:

[1]:

from tmtoolkit.corpus import Corpus

Loading text data

Several methods are implemented to load text data from different sources:

load built-in datasets
load plain text files (“.txt files”)
load folder(s) with plain text files
load a tabular (i.e. CSV or Excel) file containing document IDs and texts
load a ZIP file containing plain text or tabular files

We can create a Corpus object directly by immediately loading a dataset using one of the Corpus.from_... methods. This is what we’ve done when we used corp = Corpus.from_builtin_corpus('en-News100') in the previous chapter.

Let’s load a folder with example documents. Make sure that the path is relative to the current working directory. The data for these examples can be downloaded from GitHub.

Note: Rich text documents

If you want to work with “rich text documents”, i.e. formatted, non-plain text sources such as PDFs, Word documents, HTML files, etc. you must convert them to one of the supported formats first. For example you can use the pdftotext command from the Linux package poppler-utils to convert from PDF to plain text files or pandoc to convert from Word or HTML to plain text.

[2]:

corp = Corpus.from_folder('data/corpus_example', language='en')
corp

[2]:

<Corpus [3 documents  / language "en"]>

Again, we can have a look which document labels were created and print one sample document:

[3]:

corp.doc_labels

[3]:

['sample1', 'sample2', 'sample3']

[4]:

corp['sample3']['token']

[4]:

['And',
 'here',
 'we',
 'go',
 'with',
 'the',
 'third',
 'and',
 'final',
 'example',
 'file',
 '.',
 '\n',
 'Another',
 'line',
 'of',
 'text',
 '.',
 '\n\n',
 '§',
 ...]

The corpus_summary and print_summary functions are very helpful to get a first overview of a loaded corpus:

[5]:

from tmtoolkit.corpus import print_summary

print_summary(corp)

Corpus with 3 documents in English
> sample3 (36 tokens): And here we go with the third and final example fi...
> sample1 (25 tokens): This is the first example file . ☺ We showcase NER...
> sample2 (33 tokens): Here comes the second example ( with HTML < i > ta...
total number of tokens: 94 / vocabulary size: 64

Side note: Corpus functions

The corpus_summary and print_summary functions are examples of corpus functions. All corpus functions accept a Corpus object as first argument and operate on it. A corpus function may retrieve information from a corpus and/or modify it. Most functions in the tmtoolkit.corpus module are corpus functions.

Another option is to create a Corpus object and adding further documents using the corpus_add_... functions. Here we create an empty Corpus and then add documents via corpus_add_files which is another example of a corpus function (one that modifies a Corpus object). It takes a Corpus object and one or more paths to raw text files.

[6]:

corp = Corpus(language='en')
print_summary(corp)

Corpus with 0 document in English
total number of tokens: 0 / vocabulary size: 0

[7]:

from tmtoolkit.corpus import corpus_add_files

corpus_add_files(corp, 'data/corpus_example/sample1.txt')
print_summary(corp)

Corpus with 1 document in English
> data_corpus_example-sample1 (25 tokens): This is the first example file . ☺ We showcase NER...
total number of tokens: 25 / vocabulary size: 24

Note that this time the document label is different. Its prefixed by a normalized version of the path to the document. We can alter the doc_label_fmt argument of corpus_add_files in order to control how document labels are generated. But at first, let’s remove the previously loaded document from the corpus. Since a Corpus instance behaves like a Python dict, we can use del:

[8]:

del corp['data_corpus_example-sample1']
print_summary(corp)

Corpus with 0 document in English
total number of tokens: 0 / vocabulary size: 0

Now we use a modified doc_label_fmt paramater value to generate document labels only from the file name and not from the full path to the document. We also load three files now:

[9]:

corpus_add_files(corp, ['data/corpus_example/sample1.txt',
                        'data/corpus_example/sample2.txt',
                        'data/corpus_example/sample3.txt'],
                 doc_label_fmt='{basename}')
print_summary(corp)

Corpus with 3 documents in English
> sample3 (36 tokens): And here we go with the third and final example fi...
> sample1 (25 tokens): This is the first example file . ☺ We showcase NER...
> sample2 (33 tokens): Here comes the second example ( with HTML < i > ta...
total number of tokens: 94 / vocabulary size: 64

As noted in the beginning, there are more corpus_add_... and Corpus.from_... functions/methods to load text data from different sources. See the corpus module API for details.

Note

Please be aware of the difference of the corpus_add_... and Corpus.from_... functions/methods: The former modifies a given Corpus object, whereas the latter creates a new Corpus object.

Configuring the NLP pipeline, parallel processing and more via Corpus parameters

When initializing a Corpus, you can pass several arguments. You must at least provide either the language, language_model or spacy_instance argument. The simplest is to just pass the language as two-letter ISO 639-1 language code (“en”, “de”, “es”, etc.). A respective SpaCy language model will then be automatically selected depending on that language code and the NLP pipeline features that you require. Alternatively, you set the SpaCy language model to be loaded via language_model. Finally, if you already loaded a SpaCy pipeline instance yourself, you can also pass it via the spacy_instance parameter.

You can specify the features, i.e. the components, of the NLP pipeline using the load_features parameter. This only applies if you don’t provide your own pipeline instance via spacy_instance. It determines the pipeline’s capabilities, i.e. if POS-tags, named entities, sentences, etc. are recognized. By default, tmtoolkit will load and enable all components except for the named-entity recognition (NER) component. The more components are enabled, the slower the text processing works and the more memory it uses. So it makes sense to only enable pipelines that you actually use. See the SpaCy documentation for which pipelines are implemented (note that this also depends on the language and language model).

Let’s create a corpus with only a minimal pipeline. If we set load_features to an empty sequence, only a basic tokenizing pipeline is run.

[10]:

corp_minimal = Corpus.from_folder('data/corpus_example', language='en', load_features=[])

We can compare the token table output of the minimal pipeline with the default pipeline used in the previously loaded corp object:

[11]:

from tmtoolkit.corpus import tokens_table

tokens_table(corp_minimal)

[11]:

	doc	position	token	is_punct	is_stop	like_num
0	sample1	0	This	False	True	False
1	sample1	1	is	False	True	False
2	sample1	2	the	False	True	False
3	sample1	3	first	False	True	True
4	sample1	4	example	False	False	False
...	...	...	...	...	...	...
89	sample3	31	third	False	True	True
90	sample3	32	and	False	True	False
91	sample3	33	final	False	False	False
92	sample3	34	paragraph	False	False	False
93	sample3	35	.	True	False	False

94 rows × 6 columns

[12]:

tokens_table(corp)

[12]:

	doc	position	token	is_punct	is_stop	lemma	like_num	pos	tag
0	sample1	0	This	False	True	this	False	PRON	DT
1	sample1	1	is	False	True	be	False	AUX	VBZ
2	sample1	2	the	False	True	the	False	DET	DT
3	sample1	3	first	False	True	first	True	ADJ	JJ
4	sample1	4	example	False	False	example	False	NOUN	NN
...	...	...	...	...	...	...	...	...	...
89	sample3	31	third	False	True	third	True	ADJ	JJ
90	sample3	32	and	False	True	and	False	CCONJ	CC
91	sample3	33	final	False	False	final	False	ADJ	JJ
92	sample3	34	paragraph	False	False	paragraph	False	NOUN	NN
93	sample3	35	.	True	False	.	False	PUNCT	.

94 rows × 9 columns

As we can see, the minimal pipeline produces less information: The lemmata and POS tags are missing from the table for the corpus using the minimal pipeline. Also, if we’d wanted sentences, we’d get an error since sentences recognition is not enabled in the minimal pipeline:

tokens_table(corp_minimal, sentences=True)
# results in ValueError: sentence numbers requested, but sentence borders not set; Corpus documents probably not parsed with sentence recognition

You can also use the default pipeline as basis and add new components via the add_features parameter. Here, we add the NER component for finding out named entities like persons. The table then shows a new column ent_type with "PERSON" entries next to names:

[13]:

corp_ner = Corpus.from_folder('data/corpus_example', language='en', add_features=['ner'])

tokens_table(corp_ner, select='sample1').tail(10)

[13]:

	doc	position	token	ent_type	is_punct	is_stop	lemma	like_num	pos	tag
15	sample1	15	famous		False	False	famous	False	ADJ	JJ
16	sample1	16	people		False	False	people	False	NOUN	NNS
17	sample1	17	like		False	False	like	False	ADP	IN
18	sample1	18	Missy	PERSON	False	False	Missy	False	PROPN	NNP
19	sample1	19	Elliott	PERSON	False	False	Elliott	False	PROPN	NNP
20	sample1	20	or		False	True	or	False	CCONJ	CC
21	sample1	21	George	PERSON	False	False	George	False	PROPN	NNP
22	sample1	22	Harrison	PERSON	False	False	Harrison	False	PROPN	NNP
23	sample1	23	.		True	False	.	False	PUNCT	.
24	sample1	24	\n		False	False	\n	False	SPACE	_SP

Parallel processing

A final important parameter is the max_workers setting. With it, you can enable parallel processing and specify the number of worker processes that can run in parallel. You should never set this value higher than the number of CPU cores in your machine (it will still work, but it will actually slow down the processing). Parallel processing makes most sense for large corpora with thousands of documents. With the small toy examples in this tutorial, you won’t see performance gains (you will actually see performance loss due to increased overhead for setting up the parallel processing).

You can provide an integer value to max_workers, which sets the number of worker processes. 0 or 1 disable parallel processing (the default behavior).

Here, we use two workers, i.e. two CPU cores of our machine:

[14]:

corp_2w = Corpus.from_folder('data/corpus_example', language='en', max_workers=2)
corp_2w

[14]:

<Corpus [3 documents  / 2 worker processes / language "en"]>

If you use a negative number x, this means “use all available CPU cores, but leave x spare cores. For example, when your machine has four CPU cores, three of them will be used by tmtoolkit when setting max_workers=-1.

[15]:

corp_allbutone = Corpus.from_folder('data/corpus_example', language='en', max_workers=-1)
corp_allbutone

[15]:

<Corpus [3 documents  / 11 worker processes / language "en"]>

Finally, you can pass a float value to the max_workers parameter which specifies the fraction of available CPU cores to use (rounded). Here, we use 50% of the available CPU cores:

[16]:

corp_halfcpus = Corpus.from_folder('data/corpus_example', language='en', max_workers=0.5)
corp_halfcpus

[16]:

<Corpus [3 documents  / 6 worker processes / language "en"]>

[17]:

del corp_2w, corp_allbutone, corp_halfcpus

Text preprocessing before tokenization

Finally, there’s the option to provide a text preprocessing pipeline that is applied on the raw text documents before the text is tokenized. This is helpful in cases where the tokenizer might be confused by some elements in the text. An example might be HTML tags like in <i>italic text</i>. If your text contains these, you should remove them, since the tokenizer can’t handle them as seen for example in document “sample2”:

[18]:

corp['sample2']['token']

[18]:

['Here',
 'comes',
 'the',
 'second',
 'example',
 '(',
 'with',
 'HTML',
 '<',
 'i',
 '>',
 'tags</i',
 '>',
 '&',
 'amp',
 ';',
 'entities',
 ')',
 '.',
 '\n\n',
 ...]

There are two options for how to handle this: One option is to write your own code for loading the text and cleansing it and then pass the preprocessed text as dictionary to the Corpus constructor. The other option is to use the raw_preproc parameter for the Corpus class. This parameter allows you to pass one or more functions that will be applied one after another to the raw input text. The only requirement for these functions is that they accept a string and return a processed string.

There’s already a function for stripping HTML tags in tmtoolkit, strip_tags, and we can set it as sole input preprocessing function. We can see in the tokens of “sample2”, that there are no more HTML tags and the HTML entity & was converted to &:

[19]:

from tmtoolkit.corpus import strip_tags

corp = Corpus.from_folder('data/corpus_example', language='en', raw_preproc=strip_tags)
corp['sample2']['token']

[19]:

['Here',
 'comes',
 'the',
 'second',
 'example',
 '(',
 'with',
 'HTML',
 'tags',
 '&',
 'entities',
 ')',
 '.',
 '\n\n',
 'This',
 'one',
 'contains',
 'three',
 'lines',
 'of',
 ...]

You may specify your own preprocessing function and pass all functions that will be applied one after another as a list:

[20]:

def remove_line_breaks(txt):
    return txt.replace('\n', ' ')

corp_no_lb = Corpus.from_folder('data/corpus_example', language='en',
                                raw_preproc=[strip_tags, remove_line_breaks])
corp_no_lb['sample2']['token']

[20]:

['Here',
 'comes',
 'the',
 'second',
 'example',
 '(',
 'with',
 'HTML',
 'tags',
 '&',
 'entities',
 ')',
 '.',
 ' ',
 'This',
 'one',
 'contains',
 'three',
 'lines',
 'of',
 ...]

Text (pre-)processing

The tmtoolkit package contains a lot of text processing and text “normalization” functions, but they can all be applied after tokenization, i.e. after creating the Corpus object. The next chapter will present many of these functions. You should only use the raw_preproc parameter in cases where downstream tasks like tokenization fail due to malformed input (e.g. text with HTML tags). In all other cases use the functions explained in the next chapter.

Corpus properties, iteration and corpus functions for document management

A Corpus object provides several helpful properties that summarize the text data and the tmtoolkit.corpus module features several corpus functions to manage the documents.

Properties

Let’s start with the number of documents in the corpus. There are two ways to obtain this value:

[21]:

len(corp)

[21]:

[22]:

corp.n_docs

[22]:

Other important properties are the language of the documents in the corpus and the SpaCy language model used in the NLP pipeline (note that so far tmtoolkit requires that all documents within one corpus use the same language):

[23]:

corp.language

[23]:

'en'

[24]:

corp.language_model

[24]:

'en_core_web_sm'

The corpus’ SpaCy NLP pipeline is given with the nlp property:

[25]:

corp.nlp

[25]:

<spacy.lang.en.English at 0x7fe4a199dd80>

As you can configure the NLP pipeline (as shown in the previous section), sentence recognition is optional. You can check if sentences were recognized using the following property:

[26]:

corp.has_sents

[26]:

True

The max_workers property shows how many worker processes are used during computations.

[27]:

corp.max_workers

[27]:

You can set new values for this property in order to temporarily enable parallel processing:

[28]:

corp.max_workers = 2
corp

[28]:

<Corpus [3 documents  / 2 worker processes / language "en"]>

[29]:

corp.max_workers = 1   # reset

Another important property is the list of document labels:

[30]:

corp.doc_labels

[30]:

['sample1', 'sample2', 'sample3']

Working with a `Corpus` object like with a dictionary

A Corpus object resembles a dictionary, i.e. it provides the same methods. The keys of this corpus-dictionary are the unique document labels and the values or elements are the respective Document objects.

You can access a Corpus object’s elements via square brackets (corp[<doc. label>]) or corp.get(<doc. label>):

[31]:

corp['sample1']

[31]:

Document "sample1" (25 tokens, 9 token attributes, 2 document attributes)

[32]:

corp.get('sample2')

[32]:

Document "sample2" (27 tokens, 9 token attributes, 2 document attributes)

The square brackets also accept an integer i, which will give you the document with the ith document label (corresponding to the order in corp.doc_labels):

[33]:

corp[2]

[33]:

Document "sample3" (36 tokens, 9 token attributes, 2 document attributes)

Slicing is also supported and returns a list of Document objects:

[34]:

corp[0:2]

[34]:

[Document "sample1" (25 tokens, 9 token attributes, 2 document attributes),
 Document "sample2" (27 tokens, 9 token attributes, 2 document attributes)]

You can also update existing documents or add new documents using the bracket syntax. For this, you can either pass a string, a Document object or a SpaCy Doc object.

[35]:

print_summary(corp)

Corpus with 3 documents in English
> sample3 (36 tokens): And here we go with the third and final example fi...
> sample1 (25 tokens): This is the first example file . ☺ We showcase NER...
> sample2 (27 tokens): Here comes the second example ( with HTML tags & e...
total number of tokens: 88 / vocabulary size: 59

[36]:

corp['sample2'] = 'This is the updated version of the second example.'
corp['sample4'] = 'This document was added as fourth example.'
print_summary(corp)

Corpus with 4 documents in English
> sample3 (36 tokens): And here we go with the third and final example fi...
> sample4 (8 tokens): This document was added as fourth example .
> sample1 (25 tokens): This is the first example file . ☺ We showcase NER...
> sample2 (10 tokens): This is the updated version of the second example ...
total number of tokens: 79 / vocabulary size: 66

The dictionary-interface of the Corpus class also allows to iterate through its contents via corp.items(), corp.keys() or corp.values().

[37]:

# iterate through pairs of document labels `lbl` and
# document objects `d` and report each document's length
for lbl, d in corp.items():
    print(f'Document length for {lbl}: {len(d)}')

Document length for sample1: 25
Document length for sample2: 10
Document length for sample3: 36
Document length for sample4: 8

[38]:

# make a list of each document's document attributes
# (more on document attributes in the next chapter)
[d.doc_attrs for d in corp.values()]

[38]:

[{'label': 'sample1', 'has_sents': True},
 {'label': 'sample2', 'has_sents': True},
 {'label': 'sample3', 'has_sents': True},
 {'label': 'sample4', 'has_sents': True}]

Corpus functions for document management

Splitting documents

Sometimes you may want to split larger documents. In our example corpus, document “sample3” consists of three paragraphs:

[39]:

from tmtoolkit.corpus import doc_texts

print(doc_texts(corp)['sample3'])

And here we go with the third and final example file.
Another line of text.

§2.
This is the second paragraph.

The third and final paragraph.

We can use the corpus_split_by_paragraph function for splitting documents by paragraphs. There’s also corpus_split_by_token for splitting by an arbitrary token.

[40]:

from tmtoolkit.corpus import corpus_split_by_paragraph

corpus_split_by_paragraph(corp)
print_summary(corp)

Corpus with 6 documents in English
> sample3-3 (6 tokens): The third and final paragraph .
> sample3-2 (11 tokens): § 2 .   This is the second paragraph .
> sample3-1 (19 tokens): And here we go with the third and final example fi...
> sample2 (10 tokens): This is the updated version of the second example ...
> sample4 (8 tokens): This document was added as fourth example .
> sample1 (25 tokens): This is the first example file . ☺ We showcase NER...
total number of tokens: 79 / vocabulary size: 49

As we can see above, the “sample3” document was split into three individual documents, one per paragraph. You can further customize the splitting process by tweaking the parameters, e.g. the minimum number of line breaks used to detect paragraphs (default is two line breaks).

Joining documents

There’s of course also a function for doing the inverse operation, i.e. joining several documents. This function is called corpus_join_documents and it accepts a dictionary that maps a name for the newly joint document to a string pattern or a list of string patterns of documents to be joint. This function is especially helpful when you want to bundle lots of smaller documents (e.g. tweets) into a bigger document (e.g. all tweets of one account).

[41]:

from tmtoolkit.corpus import corpus_join_documents

# alternative: no matching, specify each document name in a list
# corpus_join_documents(corp, {'sample3-rejoined': ['sample3-1', 'sample3-2', 'sample3-3']})

# join all documents matching "sample3*" to form a new document "sample3-rejoined"
# "sample3*" is a glob pattern, so we set this as `match_type` (more on that in the next chapter)
# we also set glue to an empty string, because we don't need to have additional line breaks
# between the documents
corpus_join_documents(corp, {'sample3-rejoined': 'sample3*'}, glue='', match_type='glob')
print_summary(corp)

Corpus with 4 documents in English
> sample4 (8 tokens): This document was added as fourth example .
> sample1 (25 tokens): This is the first example file . ☺ We showcase NER...
> sample2 (10 tokens): This is the updated version of the second example ...
> sample3-rejoined (36 tokens): And here we go with the third and final example fi...
total number of tokens: 79 / vocabulary size: 49

[42]:

print(doc_texts(corp)['sample3-rejoined'])

And here we go with the third and final example file.
Another line of text.

§2.
This is the second paragraph.

The third and final paragraph.

Sampling a corpus

Finally you can sample the documents in a corpus using corpus_sample. To get a random sample of two documents from our corpus:

[43]:

from tmtoolkit.corpus import corpus_sample

# sample two out of four documents in the corpus
corpus_sample(corp, 2)

print_summary(corp)

Corpus with 2 documents in English
> sample1 (25 tokens): This is the first example file . ☺ We showcase NER...
> sample2 (10 tokens): This is the updated version of the second example ...
total number of tokens: 35 / vocabulary size: 28

There are more corpus functions for document management, namely the functions for filtering documents. These will be explained in the next chapter, which will also show how to apply several text processing and mining methods to a corpus.