Version history
0.11.0 - 2022-02-08
This release brings several major API changes to the text loading, text preprocessing and text mining parts of
tmtoolkit. All these features are now in a single sub-module, corpus
. This module contains a Corpus
class which
holds Document
objects. All text processing and text mining operations can be performed on Corpus
objects. These
operations are implemented as a functional API in the corpus
sub-module.
It is advisable to re-install tmtoolkit in a new virtual environment following the
installation instructions. Make sure to run python -m tmtoolkit setup <LANGUAGES>
, where
<LANGUAGES>
is a list of language codes like en,fr
.
Further changes include:
added new functions for identifying and joining token collocations
added new functions for visualizing corpus summary statistics
added new function
find_documents
added new text normalization functions
normalize_unicode
,simplify_unicode
,numbers_to_magnitudes
added support for sentences
added support for using all SpaCy token attributes
added common
select
argument for many text processing/mining functions to operate only on a subset of documentsadded common
as_table
argument for many text processing/mining functions to operate to convert the result to a (sorted) dataframeadded common
proportions
argument for many text processing/mining functions to convert resulting frequencies to proportions or log proportionsadded common
inplace
argument for many text processing/mining functions to either transform a corpus in-place or return a transformed copyadded 6 new languages now supported by SpaCy (Catalan, Danish, Macedonian, Polish, Romanian, Russian)
added new function
corpus_join_documents
for joining documentsadded option for calculating log probabilities or proportions
fixed log probability calculations for higher precision in BoW statistics and topic model evaluation functions
dependencies for text processing and text mining are now optional
added function for easier logging:
enable_logging
moved all functions that operate on string or numeric sequences to
tokenseq
sub-moduleall glob patterns now use
EXACT
flagadded type annotations for
corpus
,tokenseq
andutils
sub-modulesupdated dependencies (only SpaCy 3.2 or higher is now supported)
updated minimum Python requirements (Python 3.8 or higher)
removed datatable support
0.10.0 - 2020-08-03
This release marks a switch from NLTK to SpaCy for text preprocessing tasks. With this change,
much more languages are supported (see this list). It is advisable to re-install tmtoolkit
in a new virtual environment following the installation instructions. Make sure to run
python -m tmtoolkit setup <LANGUAGES>
, where <LANGUAGES>
is a list of language codes like en,fr
.
Further changes:
added support for word and document vectors via SpaCy
added built-in datasets available via
Corpus
classadded
ldamodel_top_word_topics
andldamodel_top_topic_docs
functionsadded new filter functions and options for
TMPreproc
made stemming function optional (only available when NLTK is installed)
run DTM generation in parallel
updated dependencies
restructured tests
0.9.0 - 2019-12-20
added usage and API documentation
added support for Arun 2010 metric in tm_gensim (thx to @mcooper)
added support for datatable package
added functional API for text preprocessing
added KWIC in text preprocessing
added post-installation setup routine to download necessary data files
added built-in corpora
added sorted_terms and sorted_terms_data_table to bow_stats
added glue_tokens function
retain sparse matrices in several bow_stats functions such as tfidf
corpus module: loading of CSV and ZIP files, added several other new methods
faster get_dtm (now works in parallel)
filter_tokens / filter_documents accept multiple patterns at once
lots of (partly breaking) changes and speed improvements in TMPreproc
fixed error with ignore_case being ignored in token_match for regex and glob
integrate tox
use Numpy extras for hypothesis tests
compatibility with Python 3.6, 3.7 and 3.8
0.8.0 - 2019-02-05
faster package and sub-module import
remove support for Python 2.7 (now only Python 3.5 and higher is supported)
use importlib instead of deprecated imp
fix problem with not installing all required packages
0.7.3 - 2018-09-17 (last release to support Python 2.7)
new options in corpus module for converting Windows linebreaks to Unix linebreaks
0.7.2 - 2018-07-23
new option for exclude_topics: return_new_topic_mapping
fixed issue #7 (results entry about model gets overwritten)
0.7.1 - 2018-06-18
fix stupid missing import
0.7.0 - 2018-06-18
added sub-package bow with functions for DTM creation and statistics
fixed problems with evaluation and parallel calculation of gensim models (#5)
added Gensim evaluation example
0.6.3 - 2018-06-01
made get_vocab_and_terms more memory-efficient
updated requirements (fixes #6)
0.6.2 - 2018-04-27
added new function exclude_topics to model_stats
0.6.1 - 2018-04-27
better figure title placement, grouped subplots and other improvements in plot_eval_results
bugfix in model_stats due to missing unicode literals
0.6.0 - 2018-04-25
API restructured: (uninstall package first when upgrading!) * sub-package lda_utils is now called topicmod * no more common module in topicmod -> divided into evaluate (including evaluation metrics from former eval_metrics), model_io, model_stats, and parallel
added coherence metrics PR #2 * implemented modified coherence metric according to Mimno et al. 2011 as metric_coherence_mimno_2011 * added wrapper function for coherence model provided by Gensim as metric_coherence_gensim
added evaluation metric with probability of held-out documents in cross-validation (see metric_held_out_documents_wallach09)
added new example for topic model coherence
updated examples
0.5.0 - 2018-02-13
add doc_paths field to Corpus
change plot_eval_results to show individual metrics’ results as subplots – function signature changed!
0.4.2 - 2018-02-06
made greedy partitioning much more efficient (i.e. faster work distribution)
added package information variables
added this CHANGES document :)
0.4.1 - 2018-01-24
fixed bug in lda_utils.common.ldamodel_full_doc_topics
added topic_labels for doc-topic heatmap
minor documentation fixes
0.4.0 - 2018-01-18
improved parameter checks for TMPreproc.filter_for_pos
improved tests for TMPreproc.filter_for_pos
fixed broken test in Python 2.x
added generate_topic_labels_from_top_words
speed up in top_n_from_distribution
added relevance score calculation (Sievert et al 2014)
added functions to get most/least distinctive words
added saliency calculation
allow to define axis labels and plot title in plot_eval_results