Getting started

This is only quick overview for getting started. Corpus loading, text preprocessing, etc. are explained in depth in the respective chapters.

Loading a built-in text corpus

Once you have installed tmtoolkit, you can start by loading a built-in dataset. Note that you must have installed tmtoolkit with the [recommended] or [textproc] option for this to work. See the installation instructions for details.

Let’s import the builtin_corpora_info function first and have a look which datasets are available:

[1]:

from tmtoolkit.corpus import builtin_corpora_info

builtin_corpora_info()

[1]:

['de-parlspeech-v2-sample-bundestag',
 'en-News100',
 'en-NewsArticles',
 'en-parlspeech-v2-sample-houseofcommons',
 'es-parlspeech-v2-sample-congreso',
 'nl-parlspeech-v2-sample-tweedekamer']

Let’s load one of these corpora, a sample of 100 articles from the News Articles dataset from Harvard Dataverse. For this, we import the Corpus class and use Corpus.from_builtin_corpus. The raw text data will then be processed by an NLP pipeline with SpaCy. That is, it will be tokenized and analyzed for the grammatical structure of each sentence and the linguistic attributes of each token, among other things. Since this step is computationally intense, it takes quite some time for large text corpora (it can be sped up by enabling parallel processing as explained later).

[2]:

from tmtoolkit.corpus import Corpus

corp = Corpus.from_builtin_corpus('en-News100')
corp

[2]:

<Corpus [100 documents  / language "en"]>

We can have a look which documents were loaded (showing only the first ten document labels):

[3]:

corp.doc_labels[:10]

[3]:

['News100-2338',
 'News100-3228',
 'News100-1253',
 'News100-1615',
 'News100-3334',
 'News100-92',
 'News100-869',
 'News100-3092',
 'News100-3088',
 'News100-1173']

Accessing documents and document tokens

We can now access each document in this corpus via its document label:

[4]:

corp['News100-2338']

[4]:

Document "News100-2338" (680 tokens, 9 token attributes, 2 document attributes)

By accessing the corpus in this way, we get a Document object. We can query a document for its contents again using the square brackets syntax. Here, we access its tokens and show only the first ten:

[5]:

corp['News100-2338']['token'][:10]

[5]:

["'",
 'This',
 'Is',
 'Us',
 "'",
 'Makes',
 'Surprising',
 'Reveal',
 'About',
 'Jack']

Most of the time, you won’t need to access the Document objects of a corpus directly. You can rather use functions that provide a convenient interface to a corpus’ contents, e.g. the doc_tokens function which allows to retrieve all documents’ tokens along with additional token attributes like Part-of-Speech (POS) tags, token lemma, etc.

Let’s first import doc_tokens and then list the first ten tokens of the documents “News100-2338” and “News100-3228”:

[6]:

from tmtoolkit.corpus import doc_tokens

tokens = doc_tokens(corp)

[7]:

tokens['News100-2338'][:10]

[7]:

["'",
 'This',
 'Is',
 'Us',
 "'",
 'Makes',
 'Surprising',
 'Reveal',
 'About',
 'Jack']

[8]:

tokens['News100-3228'][:10]

[8]:

['Neil',
 'Gorsuch',
 'facing',
 "'",
 'rigorous',
 "'",
 'confirmation',
 'hearing',
 'this',
 'week']

We can retrieve more information than just the tokens. Let’s also get the POS tags via with_attr='pos' and enable structuring the results according to the sentences in the document via sentences=True:

[9]:

tokens = doc_tokens(corp, sentences=True, with_attr='pos')

For each document, we now have a dictionary with two entries, “token” and “pos”:

[10]:

tokens['News100-2338'].keys()

[10]:

dict_keys(['token', 'pos'])

Within these dictionary entries, the tokens and the POS tags are contained inside a list of sentences. So for example to get the POS tags for each token in the fourth sentence (i.e. index 3), we can write:

[11]:

# index 3 is the fourth sentence, since indices start with 0
tokens['News100-2338']['pos'][3]

[11]:

['DET',
 'NOUN',
 'VERB',
 'ADP',
 'ADP',
 'DET',
 'ADJ',
 'PROPN',
 'VERB',
 'ADP',
 'PROPN',
 'PART',
 'PUNCT',
 'PROPN',
 'PROPN',
 'PUNCT',
 'VERB',
 'ADP',
 'VERB',
 'ADP',
 ...]

We could for example combine the tokens and their POS tags by using zip. Here we do that for the first five tokens in the fourth sentence:

[12]:

list(zip(tokens['News100-2338']['token'][3][:5],
         tokens['News100-2338']['pos'][3][:5]))

[12]:

[('The', 'DET'),
 ('episode', 'NOUN'),
 ('started', 'VERB'),
 ('off', 'ADP'),
 ('with', 'ADP')]

To get an overview about the contents of a corpus, it’s often more useful to get it in a tabular format. The tmtoolkit package provides a function to generate a pandas DataFrame from a corpus, tokens_table.

We’ll use that now and instruct it to also return the sentence index of each token via sentences=True:

[13]:

from tmtoolkit.corpus import tokens_table

toktbl = tokens_table(corp, sentences=True)
toktbl.head()

[13]:

	doc	position	token	is_punct	is_stop	lemma	like_num	pos	tag
0	News100-1026	0	Kremlin	False	False	Kremlin	False	PROPN	NNP
1	News100-1026	1	gives	False	False	give	False	VERB	VBZ
2	News100-1026	2	no	False	True	no	False	DET	DT
3	News100-1026	3	comment	False	False	comment	False	NOUN	NN
4	News100-1026	4	on	False	True	on	False	ADP	IN

Using subsetting, we can for example select the fourth sentence in the “News100-2338” document:

[14]:

toktbl[(toktbl.doc == 'News100-2338') & (toktbl.sent == 3)].head()

[14]:

	doc	sent	position	token	is_punct	is_stop	lemma	like_num	pos	tag
28191	News100-2338	3	101	The	False	True	the	False	DET	DT
28192	News100-2338	3	102	episode	False	False	episode	False	NOUN	NN
28193	News100-2338	3	103	started	False	False	start	False	VERB	VBD
28194	News100-2338	3	104	off	False	True	off	False	ADP	RP
28195	News100-2338	3	105	with	False	True	with	False	ADP	IN

We can do much more with text corpora in terms of accessing and transforming their contents. This is shown in great detail in the chapter on text preprocessing.

Next, we proceed with working with text corpora.