Getting started

This is only quick overview for getting started. Corpus loading, text preprocessing, etc. are explained in depth in the respective chapters.

Loading a built-in text corpus

Once you have installed tmtoolkit, you can start by loading a built-in dataset. Note that you must have installed tmtoolkit with the [recommended] or [textproc] option for this to work. See the installation instructions for details.

Let’s import the builtin_corpora_info function first and have a look which datasets are available:

[1]:
from tmtoolkit.corpus import builtin_corpora_info

builtin_corpora_info()
[1]:
['de-parlspeech-v2-sample-bundestag',
 'en-News100',
 'en-NewsArticles',
 'en-healthtweets',
 'en-parlspeech-v2-sample-houseofcommons',
 'es-parlspeech-v2-sample-congreso',
 'nl-parlspeech-v2-sample-tweedekamer']

Let’s load one of these corpora, a sample of 100 articles from the News Articles dataset from Harvard Dataverse. For this, we import the Corpus class and use Corpus.from_builtin_corpus. The raw text data will then be processed by an NLP pipeline with SpaCy. That is, it will be tokenized and analyzed for the grammatical structure of each sentence and the linguistic attributes of each token, among other things. Since this step is computationally intense, it takes quite some time for large text corpora (it can be sped up by enabling parallel processing as explained later).

[2]:
from tmtoolkit.corpus import Corpus

corp = Corpus.from_builtin_corpus('en-News100')
corp
[2]:
<Corpus [100 documents  / language "en"]>

We can have a look which documents were loaded (showing only the first ten document labels):

[3]:
corp.doc_labels[:10]
[3]:
['News100-2338',
 'News100-3228',
 'News100-1253',
 'News100-1615',
 'News100-3334',
 'News100-92',
 'News100-869',
 'News100-3092',
 'News100-3088',
 'News100-1173']

Accessing documents and document tokens

We can now access each document in this corpus via its document label:

[4]:
corp['News100-2338']
[4]:
Document "News100-2338" (680 tokens, 9 token attributes, 2 document attributes)

By accessing the corpus in this way, we get a Document object. We can query a document for its contents again using the square brackets syntax. Here, we access its tokens and show only the first ten:

[5]:
corp['News100-2338']['token'][:10]
[5]:
["'",
 'This',
 'Is',
 'Us',
 "'",
 'Makes',
 'Surprising',
 'Reveal',
 'About',
 'Jack']

Most of the time, you won’t need to access the Document objects of a corpus directly. You can rather use functions that provide a convenient interface to a corpus’ contents, e.g. the doc_tokens function which allows to retrieve all documents’ tokens along with additional token attributes like Part-of-Speech (POS) tags, token lemma, etc.

Let’s first import doc_tokens and then list the first ten tokens of the documents “News100-2338” and “News100-3228”:

[6]:
from tmtoolkit.corpus import doc_tokens

tokens = doc_tokens(corp)
[7]:
tokens['News100-2338'][:10]
[7]:
["'",
 'This',
 'Is',
 'Us',
 "'",
 'Makes',
 'Surprising',
 'Reveal',
 'About',
 'Jack']
[8]:
tokens['News100-3228'][:10]
[8]:
['Neil',
 'Gorsuch',
 'facing',
 "'",
 'rigorous',
 "'",
 'confirmation',
 'hearing',
 'this',
 'week']

We can retrieve more information than just the tokens. Let’s also get the POS tags via with_attr='pos' and enable structuring the results according to the sentences in the document via sentences=True:

[9]:
tokens = doc_tokens(corp, sentences=True, with_attr='pos')

For each document, we now have a dictionary with two entries, “token” and “pos”:

[10]:
tokens['News100-2338'].keys()
[10]:
dict_keys(['token', 'pos'])

Within these dictionary entries, the tokens and the POS tags are contained inside a list of sentences. So for example to get the POS tags for each token in the fourth sentence (i.e. index 3), we can write:

[11]:
# index 3 is the fourth sentence, since indices start with 0
tokens['News100-2338']['pos'][3]
[11]:
['DET',
 'NOUN',
 'VERB',
 'ADP',
 'ADP',
 'DET',
 'ADJ',
 'PROPN',
 'VERB',
 'ADP',
 'PROPN',
 'PART',
 'PUNCT',
 'PROPN',
 'PROPN',
 'PUNCT',
 'VERB',
 'ADP',
 'VERB',
 'ADP',
 ...]

We could for example combine the tokens and their POS tags by using zip. Here we do that for the first five tokens in the fourth sentence:

[12]:
list(zip(tokens['News100-2338']['token'][3][:5],
         tokens['News100-2338']['pos'][3][:5]))
[12]:
[('The', 'DET'),
 ('episode', 'NOUN'),
 ('started', 'VERB'),
 ('off', 'ADP'),
 ('with', 'ADP')]

To get an overview about the contents of a corpus, it’s often more useful to get it in a tabular format. The tmtoolkit package provides a function to generate a pandas DataFrame from a corpus, tokens_table.

We’ll use that now and instruct it to also return the sentence index of each token via sentences=True:

[13]:
from tmtoolkit.corpus import tokens_table

toktbl = tokens_table(corp, sentences=True)
toktbl.head()
[13]:
doc sent position token is_punct is_stop lemma like_num pos tag
0 News100-1026 0 0 Kremlin False False Kremlin False PROPN NNP
1 News100-1026 0 1 gives False False give False VERB VBZ
2 News100-1026 0 2 no False True no False DET DT
3 News100-1026 0 3 comment False False comment False NOUN NN
4 News100-1026 0 4 on False True on False ADP IN

Using subsetting, we can for example select the fourth sentence in the “News100-2338” document:

[14]:
toktbl[(toktbl.doc == 'News100-2338') & (toktbl.sent == 3)].head()
[14]:
doc sent position token is_punct is_stop lemma like_num pos tag
28191 News100-2338 3 101 The False True the False DET DT
28192 News100-2338 3 102 episode False False episode False NOUN NN
28193 News100-2338 3 103 started False False start False VERB VBD
28194 News100-2338 3 104 off False True off False ADP RP
28195 News100-2338 3 105 with False True with False ADP IN

We can do much more with text corpora in terms of accessing and transforming their contents. This is shown in great detail in the chapter on text preprocessing.

Next, we proceed with working with text corpora.