Version history

0.12.0 - 2023-05-03

  • added optional interoperability functions for data exchange with R

  • added token_cooccurrence function for calculating a token cooccurrence matrix for a corpus

  • added common by_attr argument for many text processing/mining functions to operate only on a certain token attribute

  • added new function token_collocation_matrix for calculating a token collocation matrix based on bigrams

  • added PPMI measure (ppmi function)

  • added NGramModel class for N-gram models

  • added NaiveBayesClassifier class for Naive Bayes classification models

  • added Health News in Twitter Data Set

  • added 5 new languages now supported by SpaCy (Kroatian, Finnish, Korean, Swedish, Ukrainian)

  • fix: don’t store parallelization worker related attributes on pickling

  • updated dependencies (only SpaCy 3.3 or higher is now supported)

  • compat. with Python 3.11

0.11.2 - 2022-03-11

  • updated Arun et al. 2010 topic model evaluation metric to better follow the paper’s instructions instead of the implementation adapted from the ldatuning package (see this discussion – many thanks to @hkimber)

  • updated Mimno et al. 2011 topic model evaluation metric’s default argument values to be the same as used in the original paper; added an optional argument to include word probabilities into the calculations

  • added an example with topic model evaluation for the AP and NIPS datasets

  • added a developer documentation

0.11.1 - 2022-02-10

  • show better error messages when dependencies for optional module corpus are not met

  • fix a SciPy deprecation warning

0.11.0 - 2022-02-08

This release brings several major API changes to the text loading, text preprocessing and text mining parts of tmtoolkit. All these features are now in a single sub-module, corpus. This module contains a Corpus class which holds Document objects. All text processing and text mining operations can be performed on Corpus objects. These operations are implemented as a functional API in the corpus sub-module.

It is advisable to re-install tmtoolkit in a new virtual environment following the installation instructions. Make sure to run python -m tmtoolkit setup <LANGUAGES>, where <LANGUAGES> is a list of language codes like en,fr.

Further changes include:

  • added new functions for identifying and joining token collocations

  • added new functions for visualizing corpus summary statistics

  • added new function find_documents

  • added new text normalization functions normalize_unicode, simplify_unicode, numbers_to_magnitudes

  • added support for sentences

  • added support for using all SpaCy token attributes

  • added common select argument for many text processing/mining functions to operate only on a subset of documents

  • added common as_table argument for many text processing/mining functions to operate to convert the result to a (sorted) dataframe

  • added common proportions argument for many text processing/mining functions to convert resulting frequencies to proportions or log proportions

  • added common inplace argument for many text processing/mining functions to either transform a corpus in-place or return a transformed copy

  • added 6 new languages now supported by SpaCy (Catalan, Danish, Macedonian, Polish, Romanian, Russian)

  • added new function corpus_join_documents for joining documents

  • added option for calculating log probabilities or proportions

  • fixed log probability calculations for higher precision in BoW statistics and topic model evaluation functions

  • dependencies for text processing and text mining are now optional

  • added function for easier logging: enable_logging

  • moved all functions that operate on string or numeric sequences to tokenseq sub-module

  • all glob patterns now use EXACT flag

  • added type annotations for corpus, tokenseq and utils sub-modules

  • updated dependencies (only SpaCy 3.2 or higher is now supported)

  • updated minimum Python requirements (Python 3.8 or higher)

  • removed datatable support

0.10.0 - 2020-08-03

This release marks a switch from NLTK to SpaCy for text preprocessing tasks. With this change, much more languages are supported (see this list). It is advisable to re-install tmtoolkit in a new virtual environment following the installation instructions. Make sure to run python -m tmtoolkit setup <LANGUAGES>, where <LANGUAGES> is a list of language codes like en,fr.

Further changes:

  • added support for word and document vectors via SpaCy

  • added built-in datasets available via Corpus class

  • added ldamodel_top_word_topics and ldamodel_top_topic_docs functions

  • added new filter functions and options for TMPreproc

  • made stemming function optional (only available when NLTK is installed)

  • run DTM generation in parallel

  • updated dependencies

  • restructured tests

0.9.0 - 2019-12-20

  • added usage and API documentation

  • added support for Arun 2010 metric in tm_gensim (thx to @mcooper)

  • added support for datatable package

  • added functional API for text preprocessing

  • added KWIC in text preprocessing

  • added post-installation setup routine to download necessary data files

  • added built-in corpora

  • added sorted_terms and sorted_terms_data_table to bow_stats

  • added glue_tokens function

  • retain sparse matrices in several bow_stats functions such as tfidf

  • corpus module: loading of CSV and ZIP files, added several other new methods

  • faster get_dtm (now works in parallel)

  • filter_tokens / filter_documents accept multiple patterns at once

  • lots of (partly breaking) changes and speed improvements in TMPreproc

  • fixed error with ignore_case being ignored in token_match for regex and glob

  • integrate tox

  • use Numpy extras for hypothesis tests

  • compatibility with Python 3.6, 3.7 and 3.8

0.8.0 - 2019-02-05

  • faster package and sub-module import

  • remove support for Python 2.7 (now only Python 3.5 and higher is supported)

  • use germalemma package

  • use importlib instead of deprecated imp

  • fix problem with not installing all required packages

0.7.3 - 2018-09-17 (last release to support Python 2.7)

  • new options in corpus module for converting Windows linebreaks to Unix linebreaks

0.7.2 - 2018-07-23

  • new option for exclude_topics: return_new_topic_mapping

  • fixed issue #7 (results entry about model gets overwritten)

0.7.1 - 2018-06-18

  • fix stupid missing import

0.7.0 - 2018-06-18

  • added sub-package bow with functions for DTM creation and statistics

  • fixed problems with evaluation and parallel calculation of gensim models (#5)

  • added Gensim evaluation example

0.6.3 - 2018-06-01

  • made get_vocab_and_terms more memory-efficient

  • updated requirements (fixes #6)

0.6.2 - 2018-04-27

  • added new function exclude_topics to model_stats

0.6.1 - 2018-04-27

  • better figure title placement, grouped subplots and other improvements in plot_eval_results

  • bugfix in model_stats due to missing unicode literals

0.6.0 - 2018-04-25

  • API restructured: (uninstall package first when upgrading!) * sub-package lda_utils is now called topicmod * no more common module in topicmod -> divided into evaluate (including evaluation metrics from former eval_metrics), model_io, model_stats, and parallel

  • added coherence metrics PR #2 * implemented modified coherence metric according to Mimno et al. 2011 as metric_coherence_mimno_2011 * added wrapper function for coherence model provided by Gensim as metric_coherence_gensim

  • added evaluation metric with probability of held-out documents in cross-validation (see metric_held_out_documents_wallach09)

  • added new example for topic model coherence

  • updated examples

0.5.0 - 2018-02-13

  • add doc_paths field to Corpus

  • change plot_eval_results to show individual metrics’ results as subplots – function signature changed!

0.4.2 - 2018-02-06

  • made greedy partitioning much more efficient (i.e. faster work distribution)

  • added package information variables

  • added this CHANGES document :)

0.4.1 - 2018-01-24

  • fixed bug in lda_utils.common.ldamodel_full_doc_topics

  • added topic_labels for doc-topic heatmap

  • minor documentation fixes

0.4.0 - 2018-01-18

  • improved parameter checks for TMPreproc.filter_for_pos

  • improved tests for TMPreproc.filter_for_pos

  • fixed broken test in Python 2.x

  • added generate_topic_labels_from_top_words

  • speed up in top_n_from_distribution

  • added relevance score calculation (Sievert et al 2014)

  • added functions to get most/least distinctive words

  • added saliency calculation

  • allow to define axis labels and plot title in plot_eval_results