Getting started

This is only quick overview for getting started. Corpus loading, text preprocessing, etc. are explained in depth in the respective chapters.

Loading a built-in text corpus

Once you have installed tmtoolkit, you can start by loading a built-in dataset. Let’s import the Corpus class first and have a look which datasets are available:

[1]:

from tmtoolkit.corpus import Corpus

Corpus.builtin_corpora()

[1]:

['de-parlspeech-v2-sample-bundestag',
 'en-NewsArticles',
 'en-parlspeech-v2-sample-houseofcommons',
 'es-parlspeech-v2-sample-congreso',
 'nl-parlspeech-v2-sample-tweedekamer']

Let’s load one of these corpora, the News Articles dataset from Harvard Dataverse:

[2]:

corpus = Corpus.from_builtin_corpus('en-NewsArticles')
corpus

[2]:

<Corpus [3824 documents]>

We can have a look which documents were loaded (showing only the first ten document labels):

[3]:

corpus.doc_labels[:10]

[3]:

['NewsArticles-1',
 'NewsArticles-10',
 'NewsArticles-100',
 'NewsArticles-1000',
 'NewsArticles-1001',
 'NewsArticles-1002',
 'NewsArticles-1003',
 'NewsArticles-1004',
 'NewsArticles-1005',
 'NewsArticles-1006']

The first 100 characters from the the document NewsArticles-1:

[4]:

corpus['NewsArticles-1'][:100]

[4]:

'Betsy DeVos Confirmed as Education Secretary, With Pence Casting Historic Tie-Breaking Vote\n\nMichiga'

The Corpus class is for loading and managing plain text corpora, i.e. a set of documents with a label and their content as text strings. It resembles a Python dictionary. See working with text corpora for more information.

Tokenizing a corpus

For quantitative text analysis, you usually work with words in documents as units of interest. This means the plain text strings in the corpus’ documents need to be split up into individual tokens (words, punctuation, etc.). For a quick starter, we can do so by using tokenize after we have specified the language that is used via init_for_language.

[5]:

from tmtoolkit.preprocess import init_for_language, tokenize

doc_labels = corpus.doc_labels   # save the document labels as list for later use

init_for_language('en')   # we use an English corpus
docs = tokenize(list(corpus.values()))

The function tokenize() takes a sequence of text strings, tokenizes them and returns a list of tokenized spaCy documents:

[6]:

type(docs)

[6]:

list

[7]:

type(docs[0])

[7]:

spacy.tokens.doc.Doc

Each document in docs in turn is a list of token strings (words, punctuation). Let’s peek into the first document (index 0) and return the first ten tokens from it:

[8]:

docs[0][:10]

[8]:

Betsy DeVos Confirmed as Education Secretary, With Pence Casting

docs and doc_labels are aligned, i.e. the first element in doc_labels is the label of the first tokenized document in docs:

[9]:

doc_labels[0]

[9]:

'NewsArticles-1'

Tokenization is part of text preprocessing, which also includes several transformations that you can apply to the tokens (e.g. transform all to lower case). The chapter on text preprocessing explains this in much more detail. Next, we proceed with working with text corpora.