Working with text corpora

Your text data usually comes in the form of (long) plain text strings that are stored in one or several files on disk. The Corpus class is for loading and managing plain text corpora, i.e. a set of documents with a label and their content as text strings. It resembles a Python dictionary with additional functionality.

Let’s import the Corpus class first:

[1]:

from tmtoolkit.corpus import Corpus

Loading text data

Several methods are implemented to load text data from different sources:

load built-in datasets
load plain text files (“.txt files”)
load folder(s) with plain text files
load a tabular (i.e. CSV or Excel) file containing document IDs and texts
load a ZIP file containing plain text or tabular files

We can create a Corpus object directly by immediately loading a dataset using one of the Corpus.from_... methods. This is what we’ve done when we used corpus = Corpus.from_builtin_corpus('en-NewsArticles') in the previous chapter. Let’s load a folder with example documents. Make sure that the path is relative to the current working directory. The data for these examples can be downloaded from GitHub.

Note

If you want to work with “rich text documents”, i.e. formatted, non-plain text sources such as PDFs, Word documents, HTML files, etc. you must convert them to one of the supported formats first. For example you can use the pdftotext command from the Linux package poppler-utils to convert from PDF to plain text files or pandoc to convert from Word or HTML to plain text.

[2]:

corpus = Corpus.from_folder('data/corpus_example')
corpus

[2]:

<Corpus [3 documents]>

Again, we can have a look which document labels were created and print one sample document:

[3]:

corpus.doc_labels

[3]:

['sample1', 'sample2', 'sample3']

[4]:

corpus['sample1']

[4]:

'This is the first example file. ☺'

Now let’s look at all documents’ text. Since we have a very small corpus, printing all text out shouldn’t be a problem. We can iterate through all documents by using the items() method because a Corpus object behaves like a dict. We will write a small function for this because we’ll reuse this later and one of the most important principles when writing code is DRY – don’t repeat yourself.

[5]:

def print_corpus(c):
    """Print all documents and their text in corpus `c`"""

    for doc_label, doc_text in c.items():
        print(doc_label, ':')
        print(doc_text)
        print('---\n')

print_corpus(corpus)

sample1 :
This is the first example file. ☺
---

sample2 :
Here comes the second example.

This one contains three lines of plain text which means two paragraphs.
---

sample3 :
And here we go with the third and final example file.
Another line of text.

§2.
This is the second paragraph.

The third and final paragraph.
---

Another option is to create a Corpus object by passing a dictionary of already obtained data and optionally add further documents using the Corpus.add_... methods. We can also create an empty Corpus and then add documents:

[6]:

corpus = Corpus()
corpus

[6]:

<Corpus [0 documents]>

[7]:

corpus.add_files('data/corpus_example/sample1.txt')
corpus.doc_labels

[7]:

['data_corpus_example-sample1']

See how we created an empty corpus first and then added a single document. Also note that this time the document label is different. Its prefixed by a normalized version of the path to the document. We can alter the doc_label_fmt argument of Corpus.add_files() in order to control how document labels are generated. But at first, let’s remove the previously loaded document from the corpus. Since a Corpus instance behaves like a Python dict, we can use del:

[8]:

del corpus['data_corpus_example-sample1']
corpus

[8]:

<Corpus [0 documents]>

Now we use a modified doc_label_fmt paramater value to generate document labels only from the file name and not from the full path to the document. We also load three files now:

[9]:

corpus.add_files(['data/corpus_example/sample1.txt',
                  'data/corpus_example/sample2.txt',
                  'data/corpus_example/sample3.txt'],
                 doc_label_fmt='{basename}')
corpus.doc_labels

[9]:

['sample1', 'sample2', 'sample3']

As noted in the beginning, there are more add_... and from_... methods to load text data from different sources. See the Corpus API for details.

Note

Please be aware of the difference of the add_... and from_... methods: The former modifies a given Corpus instance, whereas the latter creates a new Corpus instance.

Corpus properties and methods

A Corpus object provides several helpful properties that summarize the plain text data and several methods to manage the documents.

Number of documents and characters

Let’s start with the number of documents in the corpus. There are two ways to obtain this value:

[10]:

len(corpus)

[10]:

[11]:

corpus.n_docs

[11]:

Another important property is the number of characters per document:

[12]:

corpus.doc_lengths

[12]:

{'sample1': 33, 'sample2': 103, 'sample3': 142}

Characters used in the corpus

The unique_characters property returns the set of characters that occur at least once in the document.

[13]:

corpus.unique_characters

[13]:

{'\n',
 ' ',
 '.',
 '2',
 'A',
 'H',
 'T',
 'a',
 'c',
 'd',
 'e',
 'f',
 'g',
 'h',
 'i',
 'l',
 'm',
 'n',
 'o',
 'p',
 'r',
 's',
 't',
 'w',
 'x',
 '§',
 '☺'}

This is helpful if you want to check if there are strange characters in your documents that you may want to replace or remove. For example, I included a Unicode smiley ☺ in the first document (which may not be rendered correctly in your browser) that we can remove using Corpus.remove_characters().

[14]:

corpus['sample1']

[14]:

'This is the first example file. ☺'

[15]:

corpus.remove_characters('☺')
corpus['sample1']

[15]:

'This is the first example file. '

Corpus.filter_characters() behaves similar to the above used method but by default removes all characters that are not in a whitelist of allowed characters.

Corpus.replace_characters() also allows to replace certain characters with others. With Corpus.apply() you can perform any custom text transformation on each document.

There are more filtering methods: Corpus.filter_by_min_length() / Corpus.filter_by_max_length() allow to remove documents that are too short or too long.

Note

These methods already go in the direction of “text preprocessing”, which is the topic of the next chapter and is implemented in the tmtoolkit.preprocess module. However, the methods in Corpus differ substantially from the preprocess module, as the Corpus methods work on untokenized plain text strings whereas the preprocess functions and methods work on document tokens (e.g. individual words) and therefore provide a much richer set of tools. However, sometimes it is necessary to do things like removing certain characters before tokenization, e.g. when such characters confuse the tokenizer.

Splitting by paragraphs

Another helpful method is Corpus.split_by_paragraphs(). This allows splitting each document of the corpus by paragraph.

Again, let’s have a look at our current corpus’ documents:

[16]:

print_corpus(corpus)

sample1 :
This is the first example file.
---

sample2 :
Here comes the second example.

This one contains three lines of plain text which means two paragraphs.
---

sample3 :
And here we go with the third and final example file.
Another line of text.

§2.
This is the second paragraph.

The third and final paragraph.
---

As we can see, sample1 contains one paragraph, sample2 two and sample3 three paragraphs. Now we can split those and get the expected number of documents (each paragraph is then an individual document):

[17]:

corpus.split_by_paragraphs()
corpus

[17]:

<Corpus [6 documents]>

Our newly created six documents:

[18]:

print_corpus(corpus)

sample1-1 :
This is the first example file.
---

sample2-1 :
Here comes the second example.
---

sample2-2 :
This one contains three lines of plain text which means two paragraphs.
---

sample3-1 :
And here we go with the third and final example file. Another line of text.
---

sample3-2 :
§2. This is the second paragraph.
---

sample3-3 :
The third and final paragraph.
---

You can further customize the splitting process by tweaking the parameters, e.g. the minimum number of line breaks used to detect paragraphs (default is two line breaks).

Sampling a corpus

Finally you can sample the documents in a corpus using Corpus.sample(). To get a random sample of three documents from our corpus:

[19]:

corpus.sample(3)

[19]:

<Corpus [3 documents]>

[20]:

corpus.doc_labels

[20]:

['sample1-1', 'sample2-1', 'sample2-2', 'sample3-1', 'sample3-2', 'sample3-3']

Note that this returns a new Corpus instance by default. You can pass as_corpus=False if you only need a Python dict.

The next chapter will show how to apply several text preprocessing functions to a corpus.