Installation

Requirements

tmtoolkit works with Python 3.8 or newer (tested up to Python 3.11).

Note

There are two dependencies, that don’t work with Python 3.11 so far: lda and wordcloud. If you want to do topic modeling via LDA and/or want to use word cloud visualizations, you must use Python 3.8 to 3.10 or wait until lda and wordcloud receive updates that make them work under Python 3.11.

Requirements are automatically installed via pip as described below. Additional packages can also be installed via pip for certain use cases (see Optional packages).

Installation instructions

The package tmtoolkit is available on PyPI and can be installed via Python package manager pip. It is highly recommended to install tmtoolkit and its dependencies in a separate Python Virtual Environment (“venv”) and upgrade to the latest pip version (you may also choose to install virtualenvwrapper, which makes managing venvs a lot easier).

Creating and activating a venv without virtualenvwrapper:

python3 -m venv myenv

# activating the environment (on Windows type "myenv\Scripts\activate.bat")
source myenv/bin/activate

Alternatively, creating and activating a venv with virtualenvwrapper:

mkvirtualenv myenv

# activating the environment
workon myenv

Upgrading pip (only do this when you’ve activated your venv):

pip install -U pip

The tmtoolkit package is highly modular and tries to install as few software dependencies as possible. So in order to install tmtoolkit, you can first choose if you want a minimal installation or install a recommended set of packages that enable most features. For the recommended installation, you can type one of the following, depending on the preferred package for topic modeling:

# recommended installation without topic modeling
pip install -U "tmtoolkit[recommended]"

# recommended installation with "lda" for topic modeling
pip install -U "tmtoolkit[recommended,lda]"

# recommended installation with "scikit-learn" for topic modeling
pip install -U "tmtoolkit[recommended,sklearn]"

# recommended installation with "gensim" for topic modeling
pip install -U "tmtoolkit[recommended,gensim]"

# you may also select several topic modeling packages
pip install -U "tmtoolkit[recommended,lda,sklearn,gensim]"

The minimal installation will only install a base set of dependencies and will only enable the modules for BoW statistics, token sequence operations, topic modeling and utility functions. You can install it as follows:

# alternative installation if you only want to install a minimum set of dependencies
pip install -U tmtoolkit

Note

The tmtoolkit package is about 10MB big, because it contains some example corpora.

After that, you should initially run tmtoolkit’s setup routine. This makes sure that all required data files are present and downloads them if necessary. You should specify a list of languages for which language models should be downloaded and installed. The list of available language models corresponds with the models provided by SpaCy (except for “multi-language”). You need to specify the two-letter ISO language code for the language models that you want to install. Don’t use spaces in the list of languages. E.g. in order to install models for English and German:

python -m tmtoolkit setup en,de

To install all available language models, you can run:

python -m tmtoolkit setup all

Optional packages

For additional features, you can install further packages using the following installation options:

pip install -U tmtoolkit[textproc_extra] for Unicode normalization and simplification and for stemming with nltk
pip install -U tmtoolkit[wordclouds] for generating word clouds
pip install -U tmtoolkit[lda] for topic modeling with LDA
pip install -U tmtoolkit[sklearn] for topic modeling with scikit-learn
pip install -U tmtoolkit[gensim] for topic modeling and additional evaluation metrics with Gensim
pip install -U tmtoolkit[topic_modeling_eval_extra] for topic modeling evaluation metrics griffiths_2004 and held_out_documents_wallach09 (see further information below)
pip install -U tmtoolkit[rinterop] for interoperability with R (see R interop. chapter)

For LDA evaluation metrics griffiths_2004 and held_out_documents_wallach09 it is necessary to install gmpy2 for multiple-precision arithmetic. This in turn requires installing some C header libraries for GMP, MPFR and MPC. On Debian/Ubuntu systems this is done with:

sudo apt install libgmp-dev libmpfr-dev libmpc-dev