Development
This part of the documentation serves as developer documentation, i.e. a help for those who want to contribute to the development of the package.
Project overview
This project aims to provide a Python package that allows text processing, text mining and topic modeling with
easy installation,
extensive documentation,
clear functional programming interface,
good performance on large datasets.
All computations need to be performed in memory. Streaming data from disk is not supported so far.
The package is written in Python and uses other packages for key tasks:
SpaCy is used for the text processing and text mining tasks
lda, gensim or scikit-learn are used for computing topic models
The project’s packages are published to the Python Package Index PyPI.
The package’s dependencies are only installed on demand. There’s a setup routine that provides an interface for easy installation of SpaCy’s language models.
Text processing and normalization is often used to construct a Bag-of-Words (BoW) model which in turn is the input for topic models.
Contributing to tmtoolkit
If you want to contribute to tmtoolkit, you can create code or documentation patches (updates) and submit them as pull requests on GitHub. The first thing to do for this is to fork the GitHub repository and to clone it on your local machine. It’s best to create a separate branch for your updates next. You should then set up your local machine for development as follows:
create a Python virtual environment – make sure that the Python version you’re using for this is supported by tmtoolkit
update pip via
pip install -U pip
if you’re planning to contribute to the code or to the tutorials in the documentation:
make sure your current working directory is the tmtoolkit repository root folder
install all dependencies via
pip install -r requirements.txt
run the tmtoolkit setup routine via
python -m tmtoolkit setup all
to install the required language modelscheck that everything works by running all tests via
pytest tests/
if you’re only planning to contribute to the documentation (without the tutorials which are Jupyter Notebooks):
install dependencies for documentation via
pip install -r requirements_doc.txt
You can then start working on the code or documentation. Make sure to run the tests and/or create new tests when you provide code updates in your pull request. You should also read this developer documentation completely before diving into the code.
Folder structure
The project’s root folder contains files for documentation generation (.readthedocs.yaml
), testing (conftest.py
, coverage.svg
, tox.ini
) as well as project management and package building (Makefile
, MANIFEST.in
, setup.py
). The subfolders include:
.github/worflows
: provides Continuous Integration (CI) configuration for GitHub Actions,doc
: documentation source and built documentation files,examples
: example scripts and data to show some of the features (most features are better explained in the tutorial which is part of the documentation),scripts
: scripts used for preparing datasets that come along with the package,tests
: test suite,tmtoolkit
: package source code.
Packaging and dependency management
This package uses setuptools for packaging. All package metadata and dependencies are defined in setup.py
. Since tmtoolkit allows installing dependencies on demand, there are several installation options defined in setup.py
. For development, the most important are:
[dev]
: installs packages for development and packaging[test]
: installs packages for testing tmtoolkit[doc]
: installs packages for generating the documentation[all]
: installs all required and optional packages – recommended for development
The requirements.txt
and requirements_doc.txt
files simply point to the [all]
and [doc]
installation options.
The Makefile
in the root folder contains targets for generating a Python Wheel package (make wheel
) and a Python source distribution package (make sdist
).
Built-in datasets
All built-in datasets reside in tmtoolkit/data/<LANGUAGE_CODE>
, where LANGUAGE_CODE
is an ISO language code. For the ParlSpeech V2 datasets, the samples are generated via the R script scripts/prepare_corpora.R
. For the Health News in Twitter Data Set, the data conversion is done in scripts/health_tweets_data.py
. The News Articles dataset is used without further processing.
Automated testing
The tmtoolkit package relies on the following packages for testing:
pytest as testing framework,
hypothesis for property-based testing,
coverage for measuring test coverage of the code,
tox for checking packaging and running tests in different virtual environments.
All tests are implemented in the tests
directory and prefixed by test_
. The conftest.py
file contains project-wide test configuration. The tox.ini
file contains configuration for setting up the virtual environments for tox. For each release, tmtoolkit aims to support the last three major Python release versions, e.g. 3.8, 3.9 and 3.10, and all of these are tested with tox along with different dependency configurations from minimal to full. To use different versions of Python on the same system, it’s recommended to use the deadsnakes repository on Ubuntu or Debian Linux.
The Makefile
in the root folder contains a target for generating coverage reports and the coverage badge (make cov_tests
).
Documentation
The Sphinx package is used for documentation. All objects exposed by the API are
documented in the Sphinx format. All other parts of the documentation reside in doc/source
. The configuration for
Sphinx lies in doc/source/conf.py
. The nbsphinx package is used for generating
the tutorial from Jupyter Notebooks which are also located in doc/source
.
Some cells in the Jupyter Notebooks generate long sequences of data. In order to limit the number of items that are
displayed in the cell output of these sequences, it is advisable to first generate an IPython profile via
ipython profile create tmtoolkitdoc
and then to adapt the generated profile configuration in
~/.ipython/profile_tmtoolkitdoc/ipython_config.py
to set the maximum number of displayed sequence items with
c.PlainTextFormatter.max_seq_length
. Finally, the IPython kernel for the Jupyter Notebooks needs to be adapted to
load the profile by editing <VENV_PATH>/share/jupyter/kernels/python3/kernel.json
(note the new line
"--profile=tmtoolkitdoc"
):
{
"argv": [
"python",
"-m",
"ipykernel_launcher",
"--profile=tmtoolkitdoc",
"-f",
"{connection_file}"
],
"display_name": "Python 3 (ipykernel)",
"language": "python",
"metadata": {
"debugger": true
}
}
The Makefile
in the doc
folder has several targets for generating the documentation. These are:
make notebooks
– run all notebooks to generate their outputs; these are stored in-placemake clean
– remove everything underdoc/build
make html
– generate the HTML documentation from the documentation source
The generated documentation then resides under doc/build
.
The documentation is published at tmtoolkit.readthedocs.io. For this,
new commits to the master branch of the GitHub project or new tags are automatically built by
readthedocs.org. The .readthedocs.yaml
file in the root folder sets up the build
process for readthedocs.org.
Continuous integration
Continuous integration routines are defined via GitHub Actions (GA). For tmtoolkit, this so far only means automatic testing for new commits and releases on different machine configurations.
The GA set up for the tests is done in .github/worflows/runtests.yml
. There are “minimal” and “full” test suites for Ubuntu, MacOS and Windows with Python versions 3.8, 3.9 and 3.10 each, which means 18 jobs are spawned. Again, tox is used for running the tests on the machines.
Release management
Publishing a new release for tmtoolkit involves several steps, listed below. You may consider creating a pre-release for PyPI first before publishing a final release.
Preparation:
create a new branch for the release version X.Y.Z as
releaseX.Y.Z
check if there are new minimum version requirements for dependencies or generally new dependencies to be added in
setup.py
check if the compatible Python versions should be updated in
setup.py
set the new version in
setup.py
andtmtoolkit/__init__.py
(consider first using a pre-release version denoted byrcN
version suffix)
Documentation updates:
check and possibly update the tutorials – do all code examples still work and are all important features covered?
update documentation
update README
update changelog (
doc/source/version_history.rst
)
Testing:
run examples and check if they work
run tests locally via tox
push to GitHub repository
develop
orrelease*
branch to run tests via GitHub Actionswhen all tests pass locally and via GitHub Actions, update the test coverage report by running
make cov_tests
locally
Publish package to PyPI and GitHub:
make a new tag for the new version via
git tag -a vX.Y.Z -m "version X.Y.Z"
push the new tag to the GitHub repository – this will automatically trigger the release workflow and publish the source and built distributions to PyPI
build source distribution via
make sdist
build wheel via
make wheel
create a new release from the tag in the GitHub repository and upload the source and wheel distributions
Finalization:
merge the development or release branch with the master branch and push the master branch to the GitHub repository
log in to readthedocs.org, go to the project page, activate the current version, let it build the documentation
verify documentation on tmtoolkit.readthedocs.io
If you notice a (major) mistake in a release after publication, you have several options like yanking the release on PyPI, publishing a post-release or updating the build number of the wheel. See this blog post for more information about these options.
API style
The tmtoolkit package provides a functional API. This is quite different from object-oriented APIs that are found in many other Python packages, where a programmer mainly uses classes and their methods that are exposed by an API. The tmtoolkit API on the other hand mainly exposes data structures and functions that operate on these data structures. In tmtoolkit, Python classes are usually used to implement more complex data structures such as documents or document corpora, but these classes don’t provide (public) methods. Rather, they are used as function arguments, for example as in the large set of corpus functions that operate on text corpora as explained below.
Implementation details
Top-level module and setup routine
The __main__.py
file provides a command-line interface for the package. It’s only purpose is to allow easy
installation of SpaCy language models via the setup routine. The tokenseq
module provides functions
that operate on single (string) tokens or sequences of tokens. These functions are used mainly internally in the
corpus
module, but are also exposed by the API to be used from a package user. The utils.py
module provides
helper functions used internally throughout the package, but also to be possibly used from a package user.
bow
module
This module provides functions for generating document-term-matrices (DTMs), which are central to the BoW concept, and some common statistics used for these matrices.
corpus
module
This is the central module for text processing and text mining.
At the core of this module, there is the Corpus
class implemented in corpus/_corpus.py
.
It takes documents with raw text as input (i.e. a dict mapping document labels to text strings) and applies a SpaCy
NLP pipeline to it. After that, the corpus consists of Document
(implemented in
corpus/_document.py
) objects which contain the textual data in tokenized form, i.e. as a sequence of tokens
(roughly translated as “words” but other text contents such as numbers and punctuation also form separate tokens).
Each token comes along with several token attributes which were estimated using the NLP pipeline. Examples for token
attributes include the Part-of-Speech tag or the lemma.
The Document
class stores the tokens and their “standard” attributes in a token matrix.
This matrix is of shape (N, M) for N tokens and with M attributes. There are at least 2 or 3 attributes:
whitespace
(boolean – is there a whitespace after the token?), token
(the actual token, i.e. “word” type) and
optionally sent_start
(only given when sentence information is parsed in the NLP pipeline).
The token matrix is a uint64 matrix as it stores all information as 64 bit hash values. Compared to sequences of
strings, this reduces memory usage and allows faster computations and data modifications. E.g., when you transform a
token (lets say “Hello” to “hello”), you only do one transformation, calculate one new hash value and replace each
occurrence of the old hash with the new hash. The hashes are calculated with SpaCy’s
hash_string function. For fast conversion between token/attribute
hashes and strings, the mappings are stored in a bidirectional dictionary using the
bidict package. Each column, i.e. each attribute, in the token matrix has a
separate bidict in the bimaps
dictionary that is shared between a corpus and each Document object. Using bidict
proved to be much faster than using SpaCy’s built in Vocab / StringStore.
Besides “standard” token attributes that come from the SpaCy NLP pipeline, a user may also add custom token attributes.
These are stored in each document’s custom_token_attrs
dictionary that map a
attribute name to a NumPy array. These arrays are of arbitrary type and don’t use the hashing approach. Besides token
attributes, there are also document attributes. These are attributes attached to each document, for example the
document label (unique document identifier). Custom document attributes can be added, e.g. to record the publication
year of a document. Document attributes can also be of any type and are not hashed.
The Corpus
class implements a data structure for text corpora with named documents. All these
documents are stored in the corpus as Document
objects. Corpus functions allow to operate
on Corpus objects. They are implemented in corpus/_corpusfuncs.py
. All corpus functions that transform/modify a
corpus, have an inplace
argument, by default set to True
. If inplace
is set to True
, the corpus will
be directly modified in-place, i.e. modifying the input corpus. If inplace
is set to False
, a copy of the input
corpus is created and all modifications are applied to this copy. The original input corpus is not altered in that case.
The corpus_func_inplace_opt
decorator is used to mark corpus functions with the in-place option.
The Corpus
class provides parallel processing capabilities for processing large data amounts.
This can be controlled with the max_workers
argument. Parallel processing is then enabled at two stages: First, it
is simply enabled for the SpaCy NLP pipeline by setting up the pipeline accordingly. Second, a
reusable process pool executor is created by the means of loky. This process
pool is then used in corpus functions whenever parallel execution is beneficial over serial execution. The
parallelexec
decorator is used to mark (inner) functions for parallel execution.
topicmod
module
This is the central module for computing, evaluating and analyzing topic models.
In topicmod/evaluate.py
there are mainly several evaluation metrics for topic models implemented. Topic models can
be computed and evaluated in parallel, the base code for that is in topicmod/parallel.py
. Three modules use the
base classes from topicmod/parallel.py
to implement interfaces to popular topic modeling packages:
topicmod/tm_gensim.py
for gensimtopicmod/tm_lda.py
for ldatopicmod/tm_sklearn.py
for scikit-learn