Development

This part of the documentation serves as developer documentation, i.e. a help for those who want to contribute to the development of the package.

Project overview

This project aims to provide a Python package that allows text processing, text mining and topic modeling with

easy installation,
extensive documentation,
clear functional programming interface,
good performance on large datasets.

All computations need to be performed in memory. Streaming data from disk is not supported so far.

The package is written in Python and uses other packages for key tasks:

SpaCy is used for the text processing and text mining tasks
lda, gensim or scikit-learn are used for computing topic models

The project’s packages are published to the Python Package Index PyPI.

The package’s dependencies are only installed on demand. There’s a setup routine that provides an interface for easy installation of SpaCy’s language models.

Text processing and normalization is often used to construct a Bag-of-Words (BoW) model which in turn is the input for topic models.

Contributing to tmtoolkit

If you want to contribute to tmtoolkit, you can create code or documentation patches (updates) and submit them as pull requests on GitHub. The first thing to do for this is to fork the GitHub repository and to clone it on your local machine. It’s best to create a separate branch for your updates next. You should then set up your local machine for development as follows:

create a Python virtual environment – make sure that the Python version you’re using for this is supported by tmtoolkit
update pip via pip install -U pip
if you’re planning to contribute to the code or to the tutorials in the documentation:
- make sure your current working directory is the tmtoolkit repository root folder
- install all dependencies via pip install -r requirements.txt
- run the tmtoolkit setup routine via python -m tmtoolkit setup all to install the required language models
- check that everything works by running all tests via pytest tests/
if you’re only planning to contribute to the documentation (without the tutorials which are Jupyter Notebooks):
- install dependencies for documentation via pip install -r requirements_doc.txt

You can then start working on the code or documentation. Make sure to run the tests and/or create new tests when you provide code updates in your pull request. You should also read this developer documentation completely before diving into the code.

Folder structure

The project’s root folder contains files for documentation generation (.readthedocs.yaml), testing (conftest.py, coverage.svg, tox.ini) as well as project management and package building (Makefile, MANIFEST.in, setup.py). The subfolders include:

.github/worflows: provides Continuous Integration (CI) configuration for GitHub Actions,
doc: documentation source and built documentation files,
examples: example scripts and data to show some of the features (most features are better explained in the tutorial which is part of the documentation),
scripts: scripts used for preparing datasets that come along with the package,
tests: test suite,
tmtoolkit: package source code.

Packaging and dependency management

This package uses setuptools for packaging. All package metadata and dependencies are defined in setup.py. Since tmtoolkit allows installing dependencies on demand, there are several installation options defined in setup.py. For development, the most important are:

[dev]: installs packages for development and packaging
[test]: installs packages for testing tmtoolkit
[doc]: installs packages for generating the documentation
[all]: installs all required and optional packages – recommended for development

The requirements.txt and requirements_doc.txt files simply point to the [all] and [doc] installation options.

The Makefile in the root folder contains targets for generating a Python Wheel package (make wheel) and a Python source distribution package (make sdist).

Built-in datasets

All built-in datasets reside in tmtoolkit/data/<LANGUAGE_CODE>, where LANGUAGE_CODE is an ISO language code. For the ParlSpeech V2 datasets, the samples are generated via the R script scripts/prepare_corpora.R. For the Health News in Twitter Data Set, the data conversion is done in scripts/health_tweets_data.py. The News Articles dataset is used without further processing.

Automated testing

The tmtoolkit package relies on the following packages for testing:

pytest as testing framework,
hypothesis for property-based testing,
coverage for measuring test coverage of the code,
tox for checking packaging and running tests in different virtual environments.

All tests are implemented in the tests directory and prefixed by test_. The conftest.py file contains project-wide test configuration. The tox.ini file contains configuration for setting up the virtual environments for tox. For each release, tmtoolkit aims to support the last three major Python release versions, e.g. 3.8, 3.9 and 3.10, and all of these are tested with tox along with different dependency configurations from minimal to full. To use different versions of Python on the same system, it’s recommended to use the deadsnakes repository on Ubuntu or Debian Linux.

The Makefile in the root folder contains a target for generating coverage reports and the coverage badge (make cov_tests).

Documentation

The Sphinx package is used for documentation. All objects exposed by the API are documented in the Sphinx format. All other parts of the documentation reside in doc/source. The configuration for Sphinx lies in doc/source/conf.py. The nbsphinx package is used for generating the tutorial from Jupyter Notebooks which are also located in doc/source.

Some cells in the Jupyter Notebooks generate long sequences of data. In order to limit the number of items that are displayed in the cell output of these sequences, it is advisable to first generate an IPython profile via ipython profile create tmtoolkitdoc and then to adapt the generated profile configuration in ~/.ipython/profile_tmtoolkitdoc/ipython_config.py to set the maximum number of displayed sequence items with c.PlainTextFormatter.max_seq_length. Finally, the IPython kernel for the Jupyter Notebooks needs to be adapted to load the profile by editing <VENV_PATH>/share/jupyter/kernels/python3/kernel.json (note the new line "--profile=tmtoolkitdoc"):

{
 "argv": [
  "python",
  "-m",
  "ipykernel_launcher",
  "--profile=tmtoolkitdoc",
  "-f",
  "{connection_file}"
 ],
 "display_name": "Python 3 (ipykernel)",
 "language": "python",
 "metadata": {
  "debugger": true
 }
}

The Makefile in the doc folder has several targets for generating the documentation. These are:

make notebooks – run all notebooks to generate their outputs; these are stored in-place
make clean – remove everything under doc/build
make html – generate the HTML documentation from the documentation source

The generated documentation then resides under doc/build.

The documentation is published at tmtoolkit.readthedocs.io. For this, new commits to the master branch of the GitHub project or new tags are automatically built by readthedocs.org. The .readthedocs.yaml file in the root folder sets up the build process for readthedocs.org.

Continuous integration

Continuous integration routines are defined via GitHub Actions (GA). For tmtoolkit, this so far only means automatic testing for new commits and releases on different machine configurations.

The GA set up for the tests is done in .github/worflows/runtests.yml. There are “minimal” and “full” test suites for Ubuntu, MacOS and Windows with Python versions 3.8, 3.9 and 3.10 each, which means 18 jobs are spawned. Again, tox is used for running the tests on the machines.

Release management

Publishing a new release for tmtoolkit involves several steps, listed below. You may consider creating a pre-release for PyPI first before publishing a final release.

Preparation:

create a new branch for the release version X.Y.Z as releaseX.Y.Z
check if there are new minimum version requirements for dependencies or generally new dependencies to be added in setup.py
check if the compatible Python versions should be updated in setup.py
set the new version in setup.py and tmtoolkit/__init__.py (consider first using a pre-release version denoted by rcN version suffix)

Documentation updates:

check and possibly update the tutorials – do all code examples still work and are all important features covered?
update documentation
update README
update changelog (doc/source/version_history.rst)

Testing:

run examples and check if they work
run tests locally via tox
push to GitHub repository develop or release* branch to run tests via GitHub Actions
when all tests pass locally and via GitHub Actions, update the test coverage report by running make cov_tests locally

Publish package to PyPI and GitHub:

make a new tag for the new version via git tag -a vX.Y.Z -m "version X.Y.Z"
push the new tag to the GitHub repository – this will automatically trigger the release workflow and publish the source and built distributions to PyPI
build source distribution via make sdist
build wheel via make wheel
create a new release from the tag in the GitHub repository and upload the source and wheel distributions

Finalization:

merge the development or release branch with the master branch and push the master branch to the GitHub repository
log in to readthedocs.org, go to the project page, activate the current version, let it build the documentation
verify documentation on tmtoolkit.readthedocs.io

If you notice a (major) mistake in a release after publication, you have several options like yanking the release on PyPI, publishing a post-release or updating the build number of the wheel. See this blog post for more information about these options.

API style

The tmtoolkit package provides a functional API. This is quite different from object-oriented APIs that are found in many other Python packages, where a programmer mainly uses classes and their methods that are exposed by an API. The tmtoolkit API on the other hand mainly exposes data structures and functions that operate on these data structures. In tmtoolkit, Python classes are usually used to implement more complex data structures such as documents or document corpora, but these classes don’t provide (public) methods. Rather, they are used as function arguments, for example as in the large set of corpus functions that operate on text corpora as explained below.

Implementation details

Top-level module and setup routine

The __main__.py file provides a command-line interface for the package. It’s only purpose is to allow easy installation of SpaCy language models via the setup routine. The tokenseq module provides functions that operate on single (string) tokens or sequences of tokens. These functions are used mainly internally in the corpus module, but are also exposed by the API to be used from a package user. The utils.py module provides helper functions used internally throughout the package, but also to be possibly used from a package user.

`bow` module

This module provides functions for generating document-term-matrices (DTMs), which are central to the BoW concept, and some common statistics used for these matrices.

`corpus` module

This is the central module for text processing and text mining.

At the core of this module, there is the Corpus class implemented in corpus/_corpus.py. It takes documents with raw text as input (i.e. a dict mapping document labels to text strings) and applies a SpaCy NLP pipeline to it. After that, the corpus consists of Document (implemented in corpus/_document.py) objects which contain the textual data in tokenized form, i.e. as a sequence of tokens (roughly translated as “words” but other text contents such as numbers and punctuation also form separate tokens). Each token comes along with several token attributes which were estimated using the NLP pipeline. Examples for token attributes include the Part-of-Speech tag or the lemma.

The Document class stores the tokens and their “standard” attributes in a token matrix. This matrix is of shape (N, M) for N tokens and with M attributes. There are at least 2 or 3 attributes: whitespace (boolean – is there a whitespace after the token?), token (the actual token, i.e. “word” type) and optionally sent_start (only given when sentence information is parsed in the NLP pipeline).

The token matrix is a uint64 matrix as it stores all information as 64 bit hash values. Compared to sequences of strings, this reduces memory usage and allows faster computations and data modifications. E.g., when you transform a token (lets say “Hello” to “hello”), you only do one transformation, calculate one new hash value and replace each occurrence of the old hash with the new hash. The hashes are calculated with SpaCy’s hash_string function. For fast conversion between token/attribute hashes and strings, the mappings are stored in a bidirectional dictionary using the bidict package. Each column, i.e. each attribute, in the token matrix has a separate bidict in the bimaps dictionary that is shared between a corpus and each Document object. Using bidict proved to be much faster than using SpaCy’s built in Vocab / StringStore.

Besides “standard” token attributes that come from the SpaCy NLP pipeline, a user may also add custom token attributes. These are stored in each document’s custom_token_attrs dictionary that map a attribute name to a NumPy array. These arrays are of arbitrary type and don’t use the hashing approach. Besides token attributes, there are also document attributes. These are attributes attached to each document, for example the document label (unique document identifier). Custom document attributes can be added, e.g. to record the publication year of a document. Document attributes can also be of any type and are not hashed.

The Corpus class implements a data structure for text corpora with named documents. All these documents are stored in the corpus as Document objects. Corpus functions allow to operate on Corpus objects. They are implemented in corpus/_corpusfuncs.py. All corpus functions that transform/modify a corpus, have an inplace argument, by default set to True. If inplace is set to True, the corpus will be directly modified in-place, i.e. modifying the input corpus. If inplace is set to False, a copy of the input corpus is created and all modifications are applied to this copy. The original input corpus is not altered in that case. The corpus_func_inplace_opt decorator is used to mark corpus functions with the in-place option.

The Corpus class provides parallel processing capabilities for processing large data amounts. This can be controlled with the max_workers argument. Parallel processing is then enabled at two stages: First, it is simply enabled for the SpaCy NLP pipeline by setting up the pipeline accordingly. Second, a reusable process pool executor is created by the means of loky. This process pool is then used in corpus functions whenever parallel execution is beneficial over serial execution. The parallelexec decorator is used to mark (inner) functions for parallel execution.

`topicmod` module

This is the central module for computing, evaluating and analyzing topic models.

In topicmod/evaluate.py there are mainly several evaluation metrics for topic models implemented. Topic models can be computed and evaluated in parallel, the base code for that is in topicmod/parallel.py. Three modules use the base classes from topicmod/parallel.py to implement interfaces to popular topic modeling packages:

topicmod/tm_gensim.py for gensim
topicmod/tm_lda.py for lda
topicmod/tm_sklearn.py for scikit-learn