Getting started
This is only quick overview for getting started. Corpus loading, text preprocessing, etc. are explained in depth in the respective chapters.
Loading a built-in text corpus
Once you have installed tmtoolkit, you can start by loading a built-in dataset. Note that you must have installed tmtoolkit with the [recommended]
or [textproc]
option for this to work. See the installation instructions for details.
Let’s import the builtin_corpora_info function first and have a look which datasets are available:
[1]:
from tmtoolkit.corpus import builtin_corpora_info
builtin_corpora_info()
[1]:
['de-parlspeech-v2-sample-bundestag',
'en-News100',
'en-NewsArticles',
'en-parlspeech-v2-sample-houseofcommons',
'es-parlspeech-v2-sample-congreso',
'nl-parlspeech-v2-sample-tweedekamer']
Let’s load one of these corpora, a sample of 100 articles from the News Articles dataset from Harvard Dataverse. For this, we import the Corpus class and use Corpus.from_builtin_corpus. The raw text data will then be processed by an NLP pipeline with SpaCy. That is, it will be tokenized and analyzed for the grammatical structure of each sentence and the linguistic attributes of each token, among other things. Since this step is computationally intense, it takes quite some time for large text corpora (it can be sped up by enabling parallel processing as explained later).
[2]:
from tmtoolkit.corpus import Corpus
corp = Corpus.from_builtin_corpus('en-News100')
corp
[2]:
<Corpus [100 documents / language "en"]>
We can have a look which documents were loaded (showing only the first ten document labels):
[3]:
corp.doc_labels[:10]
[3]:
['News100-2338',
'News100-3228',
'News100-1253',
'News100-1615',
'News100-3334',
'News100-92',
'News100-869',
'News100-3092',
'News100-3088',
'News100-1173']
Accessing documents and document tokens
We can now access each document in this corpus via its document label:
[4]:
corp['News100-2338']
[4]:
Document "News100-2338" (680 tokens, 9 token attributes, 2 document attributes)
By accessing the corpus in this way, we get a Document object. We can query a document for its contents again using the square brackets syntax. Here, we access its tokens and show only the first ten:
[5]:
corp['News100-2338']['token'][:10]
[5]:
["'",
'This',
'Is',
'Us',
"'",
'Makes',
'Surprising',
'Reveal',
'About',
'Jack']
Most of the time, you won’t need to access the Document
objects of a corpus directly. You can rather use functions that provide a convenient interface to a corpus’ contents, e.g. the doc_tokens function which allows to retrieve all documents’ tokens along with additional token attributes like Part-of-Speech (POS) tags, token lemma, etc.
Let’s first import doc_tokens
and then list the first ten tokens of the documents “News100-2338” and “News100-3228”:
[6]:
from tmtoolkit.corpus import doc_tokens
tokens = doc_tokens(corp)
[7]:
tokens['News100-2338'][:10]
[7]:
["'",
'This',
'Is',
'Us',
"'",
'Makes',
'Surprising',
'Reveal',
'About',
'Jack']
[8]:
tokens['News100-3228'][:10]
[8]:
['Neil',
'Gorsuch',
'facing',
"'",
'rigorous',
"'",
'confirmation',
'hearing',
'this',
'week']
We can retrieve more information than just the tokens. Let’s also get the POS tags via with_attr='pos'
and enable structuring the results according to the sentences in the document via sentences=True
:
[9]:
tokens = doc_tokens(corp, sentences=True, with_attr='pos')
For each document, we now have a dictionary with two entries, “token” and “pos”:
[10]:
tokens['News100-2338'].keys()
[10]:
dict_keys(['token', 'pos'])
Within these dictionary entries, the tokens and the POS tags are contained inside a list of sentences. So for example to get the POS tags for each token in the fourth sentence (i.e. index 3), we can write:
[11]:
# index 3 is the fourth sentence, since indices start with 0
tokens['News100-2338']['pos'][3]
[11]:
['DET',
'NOUN',
'VERB',
'ADP',
'ADP',
'DET',
'ADJ',
'PROPN',
'VERB',
'ADP',
'PROPN',
'PART',
'PUNCT',
'PROPN',
'PROPN',
'PUNCT',
'VERB',
'ADP',
'VERB',
'ADP',
...]
We could for example combine the tokens and their POS tags by using zip
. Here we do that for the first five tokens in the fourth sentence:
[12]:
list(zip(tokens['News100-2338']['token'][3][:5],
tokens['News100-2338']['pos'][3][:5]))
[12]:
[('The', 'DET'),
('episode', 'NOUN'),
('started', 'VERB'),
('off', 'ADP'),
('with', 'ADP')]
To get an overview about the contents of a corpus, it’s often more useful to get it in a tabular format. The tmtoolkit package provides a function to generate a pandas DataFrame from a corpus, tokens_table.
We’ll use that now and instruct it to also return the sentence index of each token via sentences=True
:
[13]:
from tmtoolkit.corpus import tokens_table
toktbl = tokens_table(corp, sentences=True)
toktbl.head()
[13]:
doc | sent | position | token | is_punct | is_stop | lemma | like_num | pos | tag | |
---|---|---|---|---|---|---|---|---|---|---|
0 | News100-1026 | 0 | 0 | Kremlin | False | False | Kremlin | False | PROPN | NNP |
1 | News100-1026 | 0 | 1 | gives | False | False | give | False | VERB | VBZ |
2 | News100-1026 | 0 | 2 | no | False | True | no | False | DET | DT |
3 | News100-1026 | 0 | 3 | comment | False | False | comment | False | NOUN | NN |
4 | News100-1026 | 0 | 4 | on | False | True | on | False | ADP | IN |
Using subsetting, we can for example select the fourth sentence in the “News100-2338” document:
[14]:
toktbl[(toktbl.doc == 'News100-2338') & (toktbl.sent == 3)].head()
[14]:
doc | sent | position | token | is_punct | is_stop | lemma | like_num | pos | tag | |
---|---|---|---|---|---|---|---|---|---|---|
28191 | News100-2338 | 3 | 101 | The | False | True | the | False | DET | DT |
28192 | News100-2338 | 3 | 102 | episode | False | False | episode | False | NOUN | NN |
28193 | News100-2338 | 3 | 103 | started | False | False | start | False | VERB | VBD |
28194 | News100-2338 | 3 | 104 | off | False | True | off | False | ADP | RP |
28195 | News100-2338 | 3 | 105 | with | False | True | with | False | ADP | IN |
We can do much more with text corpora in terms of accessing and transforming their contents. This is shown in great detail in the chapter on text preprocessing.
Next, we proceed with working with text corpora.