API

tmtoolkit.bow

tmtoolkit.bow.bow_stats

Common statistics from bag-of-words (BoW) matrices.

tmtoolkit.bow.bow_stats.codoc_frequencies(dtm, min_val=1, proportions=False)

Calculate the co-document frequency (aka word co-occurrence) matrix for a document-term matrix dtm, i.e. how often each pair of tokens occurs together at least min_val times in the same document. If proportions is True, return proportions scaled to the number of documents instead of absolute numbers.

Parameters
  • dtm – (sparse) document-term-matrix of size NxM (N docs, M is vocab size) with raw term counts.

  • min_val – threshold for counting occurrences

  • proportions – If proportions is True, return proportions scaled to the number of documents instead of absolute numbers.

Returns

co-document frequency (aka word co-occurrence) matrix with shape (vocab size, vocab size)

tmtoolkit.bow.bow_stats.doc_frequencies(dtm, min_val=1, proportions=False)

For each term in the vocab of dtm (i.e. its columns), return how often it occurs at least min_val times per document.

Parameters
  • dtm – (sparse) document-term-matrix of size NxM (N docs, M is vocab size) with raw term counts.

  • min_val – threshold for counting occurrences

  • proportions – If proportions is True, return proportions scaled to the number of documents instead of absolute numbers.

Returns

NumPy array of size M (vocab size) indicating how often each term occurs at least min_val times.

tmtoolkit.bow.bow_stats.doc_lengths(dtm)

Return the length, i.e. number of terms for each document in document-term-matrix dtm. This corresponds to the row-wise sums in dtm.

Parameters

dtm – (sparse) document-term-matrix of size NxM (N docs, M is vocab size) with raw terms counts

Returns

NumPy array of size N (number of docs) with integers indicating the number of terms per document

tmtoolkit.bow.bow_stats.idf(dtm, smooth_log=1, smooth_df=1)

Calculate inverse document frequency (idf) vector from raw count document-term-matrix dtm with formula log(smooth_log + N / (smooth_df + df)), where N is the number of documents, df is the document frequency (see function doc_frequencies()), smooth_log and smooth_df are smoothing constants. With default arguments, the formula is thus log(1 + N/(1+df)).

Note that this may introduce NaN values due to division by zero when a document is of length 0.

Parameters
  • dtm – (sparse) document-term-matrix of size NxM (N docs, M is vocab size) with raw term counts.

  • smooth_log – smoothing constant inside log()

  • smooth_df – smoothing constant to add to document frequency

Returns

NumPy array of size M (vocab size) with inverse document frequency for each term in the vocab

tmtoolkit.bow.bow_stats.idf_probabilistic(dtm, smooth=1)

Calculate probabilistic inverse document frequency (idf) vector from raw count document-term-matrix dtm with formula log(smooth + (N - df) / df), where N is the number of documents and df is the document frequency (see function doc_frequencies()).

Parameters
  • dtm – (sparse) document-term-matrix of size NxM (N docs, M is vocab size) with raw term counts.

  • smooth – smoothing constant (setting this to 0 can lead to -inf results)

Returns

NumPy array of size M (vocab size) with probabilistic inverse document frequency for each term in the vocab

tmtoolkit.bow.bow_stats.sorted_terms(mat, vocab, lo_thresh=0, hi_tresh=None, top_n=None, ascending=False, datatable_doc_labels=None)

For each row (i.e. document) in a (sparse) document-term-matrix mat, do the following:

  1. filter all values according to lo_thresh and hi_thresh

  2. sort values and the corresponding terms from vocab according to ascending

  3. optionally select the top top_n terms

  4. generate a list with pairs of terms and values

Return the collected lists for each row or convert the result to a data frame if document labels are passed via data_frame_doc_labels (see shortcut function sorted_terms_datatable()).

Parameters
  • mat – (sparse) document-term-matrix mat (may be tf-idf transformed or any other transformation)

  • vocab – list or array of vocabulary corresponding to columns in mat

  • lo_thresh – if not None, filter for values greater than lo_thresh

  • hi_tresh – if not None, filter for values lesser than or equal hi_thresh

  • top_n – if not None, select only the top top_n terms

  • ascending – sorting direction

  • datatable_doc_labels – optional list/array of document labels corresponding to mat rows

Returns

list of list with tuples (term, value) or data table with columns “doc”, “term”, “value” if data_frame_doc_labels is given

tmtoolkit.bow.bow_stats.sorted_terms_datatable(mat, vocab, doc_labels, lo_thresh=0, hi_tresh=None, top_n=None, ascending=False)

Shortcut function for sorted_terms() which generates a data table with doc_labels.

Parameters
  • mat – (sparse) document-term-matrix mat (may be tf-idf transformed or any other transformation)

  • vocab – list or array of vocabulary corresponding to columns in mat

  • doc_labels – list/array of document labels corresponding to mat rows

  • lo_thresh – if not None, filter for values greater than lo_thresh

  • hi_tresh – if not None, filter for values lesser than or equal hi_thresh

  • top_n – if not None, select only the top top_n terms

  • ascending – sorting direction

Returns

data table with columns “doc”, “term”, “value”

tmtoolkit.bow.bow_stats.term_frequencies(dtm, proportions=False)

Return the number of occurrences of each term in the vocab across all documents in document-term-matrix dtm. This corresponds to the column-wise sums in dtm.

Parameters
  • dtm – (sparse) document-term-matrix of size NxM (N docs, M is vocab size) with raw term counts.

  • proportions – If proportions is True, return proportions scaled to the number of terms in the whole dtm.

Returns

NumPy array of size M (vocab size) with integers indicating the number of occurrences of each term in the vocab across all documents.

tmtoolkit.bow.bow_stats.tf_binary(dtm)

Transform raw count document-term-matrix dtm to binary term frequency matrix. This matrix contains 1 whenever a term occurred in a document, else 0.

Parameters

dtm – (sparse) document-term-matrix of size NxM (N docs, M is vocab size) with raw term counts.

Returns

(sparse) binary term frequency matrix of type integer of size NxM

tmtoolkit.bow.bow_stats.tf_double_norm(dtm, K=0.5)

Transform raw count document-term-matrix dtm to double-normalized term frequency matrix K + (1-K) * dtm / max{t in doc}, where max{t in doc} is vector of size N containing the maximum term count per document.

Note that this may introduce NaN values due to division by zero when a document is of length 0.

Parameters
  • dtm – (sparse) document-term-matrix of size NxM (N docs, M is vocab size) with raw term counts

  • K – normalization factor

Returns

double-normalized term frequency matrix of size NxM

tmtoolkit.bow.bow_stats.tf_log(dtm, log_fn=<ufunc 'log1p'>)

Transform raw count document-term-matrix dtm to log-normalized term frequency matrix log_fn(dtm).

Parameters
  • dtm – (sparse) document-term-matrix of size NxM (N docs, M is vocab size) with raw term counts.

  • log_fn – log function to use; default is NumPy’s numpy.log1p(), which calculates log(1 + x)

Returns

(sparse) log-normalized term frequency matrix of size NxM

tmtoolkit.bow.bow_stats.tf_proportions(dtm)

Transform raw count document-term-matrix dtm to term frequency matrix with proportions, i.e. term counts normalized by document length.

Note that this may introduce NaN values due to division by zero when a document is of length 0.

Parameters

dtm – (sparse) document-term-matrix of size NxM (N docs, M is vocab size) with raw term counts

Returns

(sparse) term frequency matrix of size NxM with proportions, i.e. term counts normalized by document length

tmtoolkit.bow.bow_stats.tfidf(dtm, tf_func=<function tf_proportions>, idf_func=<function idf>, **kwargs)

Calculate tfidf (term frequency inverse document frequency) matrix from raw count document-term-matrix dtm with matrix multiplication tf * diag(idf), where tf is the term frequency matrix tf_func(dtm) and idf is the document frequency vector idf_func(dtm).

Parameters
  • dtm – (sparse) document-term-matrix of size NxM (N docs, M is vocab size) with raw term counts

  • tf_func – function to calculate term-frequency matrix; see tf_* functions in this module

  • idf_func – function to calculate inverse document frequency vector; see tf_* functions in this module

  • kwargs – additional parameters passed to tf_func or idf_func like K or smooth (depending on which parameters these functions except)

Returns

(sparse) tfidf matrix of size NxM

tmtoolkit.bow.bow_stats.word_cooccurrence(dtm, min_val=1, proportions=False)

Calculate the co-document frequency (aka word co-occurrence) matrix. Alias for codoc_frequencies().

tmtoolkit.bow.dtm

Functions for creating a document-term matrix (DTM) and some compatibility functions for Gensim.

tmtoolkit.bow.dtm.create_sparse_dtm(vocab, docs, n_unique_tokens, vocab_is_sorted=False, dtype=<class 'numpy.int32'>)

Create a sparse document-term-matrix (DTM) as matrix in COO sparse format from vocabulary array vocab, a list of tokenized documents docs and the number of unique tokens across all documents n_unique_tokens.

The DTM’s rows are document names, its columns are indices in vocab, hence a value DTM[j, k] is the term frequency of term vocab[k] in document j.

A note on performance: Creating the three arrays for a COO matrix seems to be the fastest way to generate a DTM. An alternative implementation using LIL format was ~2x slower.

Memory requirement: about 3 * <n_unique_tokens> * 4 bytes with default dtype (32-bit integer).

See also

This is the “low level” function. For the straight-forward to use function see tmtoolkit.preprocess.sparse_dtm(), which also calculates n_unique_tokens.

Parameters
  • vocab – list or array of vocabulary used as column names; size must equal number of columns in dtm

  • docs – a list of tokenized documents

  • n_unique_tokens – number of unique tokens across all documents

  • vocab_is_sorted – if True, assume that vocab is sorted when creating the token IDs

  • dtype – data type of the resulting matrix

Returns

a sparse document-term-matrix in COO sparse format

tmtoolkit.bow.dtm.dtm_and_vocab_to_gensim_corpus_and_dict(dtm, vocab, as_gensim_dictionary=True)

Convert a (sparse) DTM and a vocabulary list to a Gensim Corpus object and Gensim Dictionary object or a Python dict().

Parameters
  • dtm – (sparse) document-term-matrix of size NxM (N docs, M is vocab size) with raw terms counts

  • vocab – list or array of vocabulary

  • as_gensim_dictionary – if True create Gensim Dictionary from vocab, else create Python dict()

Returns

a 2-tuple with (Corpus object, Gensim Dictionary or Python dict())

tmtoolkit.bow.dtm.dtm_to_dataframe(dtm, doc_labels, vocab)

Convert a (sparse) DTM to a pandas DataFrame using document labels doc_labels as row index and vocab as column names.

See also

dtm_to_datatable() for generating a datatable Frame.

Parameters
  • dtm – (sparse) document-term-matrix of size NxM (N docs, M is vocab size) with raw terms counts

  • doc_labels – document labels used as row index (row names); size must equal number of rows in dtm

  • vocab – list or array of vocabulary used as column names; size must equal number of columns in dtm

Returns

pandas DataFrame

tmtoolkit.bow.dtm.dtm_to_datatable(dtm, doc_labels, vocab, colname_rowindex='_doc')

Convert a (sparse) DTM to a datatable Frame using document labels doc_labels as row idenitifier (with column name colname_rowindex) and vocab as column names.

See also

dtm_to_dataframe() for generating a pandas DataFrame.

Parameters
  • dtm – (sparse) document-term-matrix of size NxM (N docs, M is vocab size) with raw terms counts

  • doc_labels – document labels used as row index (row names); size must equal number of rows in dtm

  • vocab – list or array of vocabulary used as column names; size must equal number of columns in dtm

  • colname_rowindex – column name for row identifier (i.e. column where the document labels are put)

Returns

datatable Frame

tmtoolkit.bow.dtm.dtm_to_gensim_corpus(dtm)

Convert a (sparse) DTM to a Gensim Corpus object.

See also

gensim_corpus_to_dtm() for the reverse function or dtm_and_vocab_to_gensim_corpus_and_dict() which additionally creates a Gensim Dictionary.

Parameters

dtm – (sparse) document-term-matrix of size NxM (N docs, M is vocab size) with raw terms counts

Returns

a Gensim gensim.matutils.Sparse2Corpus object

tmtoolkit.bow.dtm.gensim_corpus_to_dtm(corpus)

Convert a Gensim corpus object to a sparse DTM in COO format.

See also

dtm_to_gensim_corpus() for the reverse function.

Parameters

corpus – Gensim corpus object

Returns

sparse DTM in COO format

tmtoolkit.corpus

Corpus class for handling raw text corpora

class tmtoolkit.corpus.Corpus(docs=None)

The Corpus class facilitates the handling of raw text corpora. By “raw text” we mean that the documents in the corpus are represented as plain text strings, i.e. they are not tokenized and hence not ready for token-based quantitative analysis. In order to tokenize and further process the raw text documents, you can pass the Corpus object to tmtoolkit.preprocess.TMPreproc or use the functional preprocessing API from tmtoolkit.preprocess.

This class implements dict() methods, i.e. it behaves like a Python dict() where the keys are document labels and values are the corresponding document texts as strings.

__init__(docs=None)

Construct a new Corpus object by passing a dictionary of documents with document label -> document text mapping. You can create an empty corpus by not passing any documents and later at them, e.g. with add_doc(), add_files() or add_folder().

A Corpus object can also be created by loading data from files or folders. See the class methods from_files(), from_folders() and from_pickle().

Parameters

docs – dictionary of documents with document label -> document text mapping

__deepcopy__(memodict=None)

Copy a Corpus object including all of its its present state. Performs a deep copy.

__getitem__(doc_label)

dict method for retrieving document with label doc_label via corpus[<doc_label>].

__setitem__(doc_label, doc_text)

dict method for setting a document with label doc_label via corpus[<doc_label>] = <doc_text>.

__delitem__(doc_label)

dict method for removing a document with label doc_label via del corpus[<doc_label>].

__contains__(doc_label)

dict method for checking whether doc_label exists in this corpus.

__init__(docs=None)

Construct a new Corpus object by passing a dictionary of documents with document label -> document text mapping. You can create an empty corpus by not passing any documents and later at them, e.g. with add_doc(), add_files() or add_folder().

A Corpus object can also be created by loading data from files or folders. See the class methods from_files(), from_folders() and from_pickle().

Parameters

docs – dictionary of documents with document label -> document text mapping

add_doc(doc_label, doc_text, force_unix_linebreaks=True)

Add a document with document label doc_label and text doc_text to the corpus.

Parameters
  • doc_label – document label string

  • doc_text – document text string

  • force_unix_linebreaks – if True, convert Windows linebreaks to Unix linebreaks

Returns

this corpus instance

add_files(files, encoding='utf8', doc_label_fmt='{path}-{basename}', doc_label_path_join='_', doc_labels=None, read_size=- 1, force_unix_linebreaks=True)

Read text documents from files passed in files and add them to the corpus. The document label for each new document is determined via format string doc_label_fmt.

Parameters
  • files – single file string or sequence of files to read

  • encoding – character encoding of the files

  • doc_label_fmt – document label format string with placeholders “path”, “basename”, “ext”

  • doc_label_path_join – string with which to join the components of the file paths

  • doc_labels – instead generating document labels from doc_label_fmt, pass a list of document labels to be used directly

  • read_size – max. number of characters to read. -1 means read full file.

  • force_unix_linebreaks – if True, convert Windows linebreaks to Unix linebreaks

Returns

this instance

add_folder(folder, valid_extensions=('txt',), encoding='utf8', strip_folderpath_from_doc_label=True, doc_label_fmt='{path}-{basename}', doc_label_path_join='_', read_size=- 1, force_unix_linebreaks=True)

Read documents residing in folder folder and ending on file extensions specified via valid_extensions. Note that only raw text files can be read, not PDFs, Word documents, etc. These must be converted to raw text files beforehand, for example with pdttotext (poppler-utils package) or pandoc.

Parameters
  • folder – Folder from where the files are read.

  • valid_extensions – Sequence of valid file extensions like .txt, .md, etc.

  • encoding – character encoding of the files

  • strip_folderpath_from_doc_label – if True, do not include the folder path in the document label

  • doc_label_fmt – document label format string with placeholders “path”, “basename”, “ext”

  • doc_label_path_join – string with which to join the components of the file paths

  • read_size – max. number of characters to read. -1 means read full file.

  • force_unix_linebreaks – if True, convert Windows linebreaks to Unix linebreaks

Returns

this instance

add_tabular(files, id_column, text_column, prepend_columns=None, encoding='utf8', doc_label_fmt='{basename}-{id}', force_unix_linebreaks=True, **kwargs)

Add documents from tabular (CSV or Excel) file(s).

Parameters
  • files – single string or list of strings with path to file(s) to load

  • id_column – column name of document identifiers

  • text_column – column name of document texts

  • prepend_columns – if not None, pass a list of columns whose contents should be added before the document text, e.g. ['title', 'subtitle']

  • encoding – character encoding of the files

  • doc_label_fmt – document label format string with placeholders "basename", "id" (document ID), and "row_index" (dataset row index)

  • force_unix_linebreaks – if True, convert Windows linebreaks to Unix linebreaks in texts

  • kwargs – additional arguments passed to pandas.read_csv() or pandas.read_excel()

Returns

this instance

add_zip(zipfile, valid_extensions=('txt', 'csv', 'xls', 'xlsx'), encoding='utf8', doc_label_fmt_txt='{path}-{basename}', doc_label_path_join='_', doc_label_fmt_tabular='{basename}-{id}', force_unix_linebreaks=True, **kwargs)

Add documents from a ZIP file. The ZIP file may include documents with extensions listed in valid_extensions.

For file extensions ‘csv’, ‘xls’ or ‘xlsx’ add_tabular() will be called. Make sure to pass at least the parameters id_column and text_column as additional kwargs if your ZIP contains such files.

For all other file extensions add_files() will be called.

Parameters
  • zipfile – path to ZIP file to be loaded; string

  • valid_extensions – list of valid file extensions of ZIP file members; all other members will be ignored

  • encoding – character encoding of the files

  • doc_label_fmt_txt – document label format for non-tabular files; string with placeholders "path", "basename", "ext"

  • doc_label_path_join – string with which to join the components of the file paths

  • doc_label_fmt_tabular – document label format string for tabular files; placeholders "basename", "id" (document ID), and "row_index" (dataset row index)

  • force_unix_linebreaks – if True, convert Windows linebreaks to Unix linebreaks in texts

  • kwargs – additional arguments passed to add_tabular() or add_files()

Returns

this instance

apply(func)

Apply function func to each document in the corpus.

Parameters

func – function accepting a document text string as only argument

Returns

this instance

static builtin_corpora(with_paths=False)

Return list of available built-in corpora.

Parameters

with_paths – if True, return dict mapping corpus label to absolute path to dataset, else return only a list of corpus labels

Returns

dict or list, depending on with_paths

copy()

Copy a Corpus object including all of its its present state. Performs a deep copy.

Returns

copy of this Corpus object

property doc_labels

Sorted document labels.

property doc_lengths

Return dict with number of characters per document.

Returns

dict mapping document labels to document text length in number of characters

filter_by_max_length(nchars)

Filter corpus by retaining only documents with at most nchars characters.

Parameters

nchars – maximum number of characters

Returns

this instance

filter_by_min_length(nchars)

Filter corpus by retaining only documents with at least nchars characters.

Parameters

nchars – minimum number of characters

Returns

this instance

filter_characters(allow_chars='0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c', drop_chars=None)

Filter the document strings by removing all characters but those in allow_chars or, if allow_chars evaluates to False, remove those in drop_chars.

Parameters
  • allow_chars – set (like {'a', 'b', 'c'} or string sequence (like 'abc')

  • drop_chars – set or string sequence of characters to remove (if allow_chars evaluates to False)

Returns

this instance

classmethod from_builtin_corpus(corpus_label)

Construct Corpus object by loading one of the built-in datasets specified by corpus_label. To get a list of available built-in datasets, use builtin_corpora().

Parameters

corpus_label – the corpus to load (one of the labels listed in builtin_corpora()

Returns

Corpus instance

classmethod from_files(*args, **kwargs)

Construct Corpus object by loading files. See method add_files() for available arguments.

Returns

Corpus instance

classmethod from_folder(*args, **kwargs)

Construct Corpus object by loading files from a folder. See method add_folder() for available arguments.

Returns

Corpus instance

classmethod from_pickle(picklefile)

Construct Corpus object by loading picklefile.

Returns

Corpus instance

classmethod from_tabular(*args, **kwargs)

Construct Corpus object by loading documents from a tabular file, i.e. CSV or Excel file. See method add_tabular() for available arguments.

Returns

Corpus instance

classmethod from_zip(*args, **kwargs)

Construct Corpus object by loading files from a ZIP file. See method add_zip() for available arguments.

Returns

Corpus instance

get(*args)

dict method to retrieve a specific document like corpus.get(<doc_label>, <default>).

get_doc_labels(sort=False)

Return the document labels, optionally sorted.

Parameters

sort – sort the document labels if True

Returns

list of document labels

items()

dict method to retrieve pairs of document labels and texts.

keys()

dict method to retrieve document labels.

property n_docs

Number of documents.

remove_characters(drop_chars)

Shortcut for filter_characters() for removing characters in drop_chars.

Parameters

drop_chars – set or string sequence of characters to remove

Returns

this instance

replace_characters(translation_table)

Replace all characters in all document strings by applying the translation table translation_table, which in effect converts or removes characters.

Parameters

translation_table – a dict with character -> replacement mapping; if “replacement” None, remove that character; both “character” and “replacement” can be either single characters or ordinals; can be constructed with str.maketrans(); Examples: {'a': 'X', 'b': None} (turns all a’s to X’s and removes all b’s), which is equivalent to {97: 88, 98: None}

Returns

this instance

sample(n, inplace=False, as_corpus=True)

Return a sample of n documents` of this corpus. Sampling occurs without replacement.

Parameters
  • n – sample size

  • inplace – replace this corpus’ documents with the sampled documents if this argument is True

  • as_corpus – if True, return result as new Corpus object, else as dict. Only applies when inplace is False

Returns

a sample of n documents` as dict if inplace is False, else this instance with resampled documents

split_by_paragraphs(break_on_num_newlines=2, splitchar='\n', join_paragraphs=1, force_unix_linebreaks=True, new_doc_label_fmt='{doc}-{parnum}')

Split documents in corpus by paragraphs and set the resulting documents as new corpus.

Parameters
  • break_on_num_newlines – Threshold of minimum number of linebreaks that denote a new paragraph.

  • splitchar – Linebreak character(s)

  • join_paragraphs – Number of subsequent paragraphs to join and form a document

  • force_unix_linebreaks – if True, convert Windows linebreaks to Unix linebreaks

  • new_doc_label_fmt – document label format string with placeholders “doc” and “parnum” (paragraph number)

Returns

this corpus instance

to_pickle(picklefile)

Save corpus to pickle file picklefile.

Parameters

picklefile – path to file to store corpus

Returns

this instance

property unique_characters

Return a the set of unique characters that exist in this corpus.

Returns

set of unique characters that exist in this corpus

values()

dict method to retrieve document texts.

Utility functions in corpus module

Module that facilitates handling of raw text corpora.

tmtoolkit.corpus.linebreaks_win2unix(text)

Convert Windows line breaks '\r\n' to Unix line breaks '\n'.

Parameters

text – text string

Returns

text string with Unix line breaks

tmtoolkit.corpus.paragraphs_from_lines(lines, splitchar='\n', break_on_num_newlines=2, force_unix_linebreaks=True)

Take string of lines, split into list of lines using splitchar (or don’t if splitchar evaluates to False) and then split them into individual paragraphs. A paragraph must be divided by at least break_on_num_newlines line breaks (empty lines) from another paragraph. Return a list of paragraphs, each paragraph containing a string of sentences.

Parameters
  • lines – either a string which will be split into lines by splitchar or a list of strings representing lines; in this case, set splitchar to None

  • splitchar – character used to split string lines into separate lines

  • break_on_num_newlines – threshold of consecutive line breaks for creating a new paragraph

  • force_unix_linebreaks – if True, convert Windows linebreaks to Unix linebreaks

Returns

list of paragraphs, each paragraph containing a string of sentences

tmtoolkit.corpus.path_recursive_split(path, base=None)

Split path path into its components:

path_recursive_split('a/simple/test.txt')
# ['a', 'simple', 'test.txt']
Parameters
  • path – a file path

  • base – path remainder (used for recursion)

Returns

components of the path as list

tmtoolkit.corpus.read_text_file(fpath, encoding, read_size=- 1, force_unix_linebreaks=True)

Read the text file at path fpath with character encoding encoding and return it as string.

Parameters
  • fpath – path to file to read

  • encoding – character encoding

  • read_size – max. number of characters to read. -1 means read full file.

  • force_unix_linebreaks – if True, convert Windows linebreaks to Unix linebreaks

Returns

file content as string

tmtoolkit.preprocess

TMPreproc class for parallel text preprocessing

class tmtoolkit.preprocess.TMPreproc(docs, language=None, language_model=None, n_max_processes=None, stopwords=None, special_chars=None, enable_vectors=False, spacy_opts=None, loading_from_state=False)

TMPreproc implements a class for parallel text processing. The API implements a state machine, i.e. you create a TMPreproc instance with text documents and modify them by calling methods like “tokens_to_lowercase”, etc.

__init__(docs, language=None, language_model=None, n_max_processes=None, stopwords=None, special_chars=None, enable_vectors=False, spacy_opts=None, loading_from_state=False)

Create a parallel text processing instance by passing a dictionary of raw texts docs with document label to document text mapping. You can pass a Corpus instance because it implements the dictionary methods.

TMPreproc will start n_max_processes sub-processes and distribute the documents on them for parallel processing.

Parameters
  • docs – documents dictionary (“corpus”) with document label to document text mapping

  • language – documents language used for language-dependent methods such as POS tagging or lemmatization

  • n_max_processes – max. number of sub-processes for parallel processing; uses the number of CPUs on the current machine if None is passed

  • stopwords – provide manual stopword list or use default stopword list for given language

  • special_chars – provide manual special characters list or use default list from string.punctuation()

  • enable_vectors – if True, enable word vectors (aka word embeddings) by loading the appropriate models; this will be more computationally expensive; note that you will have to install the respective medium or large spaCy language models beforehand

  • spacy_opts – keyword arguments passed to spaCy’s spacy.load() function

__del__()

destructor. shutdown all workers

__copy__()

Copy a TMPreproc object including all its present state (tokens, meta data, etc.). Performs a deep copy.

Returns

deep copy of the current TMPreproc instance

__deepcopy__(memodict=None)

Copy a TMPreproc object including all its present state (tokens, meta data, etc.). Performs a deep copy.

Returns

deep copy of the current TMPreproc instance

__init__(docs, language=None, language_model=None, n_max_processes=None, stopwords=None, special_chars=None, enable_vectors=False, spacy_opts=None, loading_from_state=False)

Create a parallel text processing instance by passing a dictionary of raw texts docs with document label to document text mapping. You can pass a Corpus instance because it implements the dictionary methods.

TMPreproc will start n_max_processes sub-processes and distribute the documents on them for parallel processing.

Parameters
  • docs – documents dictionary (“corpus”) with document label to document text mapping

  • language – documents language used for language-dependent methods such as POS tagging or lemmatization

  • n_max_processes – max. number of sub-processes for parallel processing; uses the number of CPUs on the current machine if None is passed

  • stopwords – provide manual stopword list or use default stopword list for given language

  • special_chars – provide manual special characters list or use default list from string.punctuation()

  • enable_vectors – if True, enable word vectors (aka word embeddings) by loading the appropriate models; this will be more computationally expensive; note that you will have to install the respective medium or large spaCy language models beforehand

  • spacy_opts – keyword arguments passed to spaCy’s spacy.load() function

add_metadata_per_doc(key, data, default=None)

Add a list of meta data values per document, where key is the meta data label and data is a dict that maps document labels to meta data values. The length of the values of each document must match the number of tokens for the respective document. If a document that exists in this instance is not part of data, the value default will be repeated len(document) times.

Suppose you three documents named a, b, c with respective document lengths (i.e. number of tokens) 5, 3, 6. You want to add meta data labelled as token_category. You can do so by passing a dict with lists of values for each document:

preproc.add_metadata_per_doc('token_category', {
    'a': ['x', 'y', 'z', 'y', 'z'],
    'b': ['z', 'y', 'z'],
    'c': ['x', 'x', 'x', 'y', 'x', 'x'],
})
Parameters
  • key – meta data key, i.e. label as string

  • data – dict that maps document labels to meta data values

  • default – default value for documents not listed in data

Returns

this instance

add_metadata_per_token(key, data, default=None)

Add a meta data value per token match, where key is the meta data label and data is a dict that maps tokens to the respective meta data values. If a token existing in this instance is not listed in data, the value default is taken instead. Example:

preproc = TMPreproc(docs={'a': 'This is a test document.',
                          'b': 'This is another test, test, test.'})
preproc.tokens_datatable

Output (note that there’s no meta data column for the tokens):

    doc  position  token
--  ---  --------  --------
 0  a           0  This
 1  a           1  is
 2  a           2  a
 3  a           3  test
[...]

Now we add a meta data with the key interesting, e.g. indicating which tokens we deem interesting. For every occurrence of the token “test”, this should be set to True (1). All other tokens by default get False (0):

preproc.add_metadata_per_token('interesting', {'test': True}, default=False)
preproc.tokens_datatable

New output with additional column meta_interesting:

    doc  position  token     meta_interesting
--  ---  --------  --------  ----------------
 0  a           0  This                     0
 1  a           1  is                       0
 2  a           2  a                        0
 3  a           3  test                     1
Parameters
  • key – meta data key, i.e. label as string

  • data – dict that maps tokens to the respective meta data values

  • default – default meta data value for tokens that do not appear in data

Returns

this instance

add_special_chars(special_chars)

Add more characters to the set of “special characters” used in remove_special_chars_in_tokens().

Parameters

special_chars – list, tuple or set of special characters

Returns

this instance

add_stopwords(stopwords)

Add more stop words to the set of stop words used in clean_tokens().

Parameters

stopwords – list, tuple or set or stop words

Returns

this instance

apply_custom_filter(filter_func, to_tokens_datatable=False)

Apply a custom filter function filter_func to all tokens or tokens dataframe. filter_func must accept a single parameter: a dictionary of structure {<doc_label>: <tokens list>} as from tokens if to_tokens_dataframe is False or a datatable Frame as from tokens_datatable. It must return a result with the same structure.

Parameters
  • filter_func – filter function to apply to all tokens or tokens dataframe

  • to_tokens_datatable – if True, pass datatable as from tokens_datatable to filter_func, otherwise pass dict as from tokens to filter_func

Warning

This function can only be run on a single process, hence it could be slow for large corpora.

clean_tokens(remove_punct=True, remove_stopwords=True, remove_empty=True, remove_shorter_than=None, remove_longer_than=None, remove_numbers=False)

Clean tokens by removing a certain, configurable subset of them.

Parameters
  • remove_punct – remove all tokens that intersect with punctuation tokens from punctuation

  • remove_stopwords – remove all tokens that intersect with stopword tokens from stopwords

  • remove_empty – remove all empty string "" tokens

  • remove_shorter_than – remove all tokens shorter than this length

  • remove_longer_than – remove all tokens longer than this length

  • remove_numbers – remove all tokens that are “numeric” according to the NumPy function numpy.char.isnumeric()

Returns

this instance

copy()

Copy a TMPreproc instance including all its present state (tokens, meta data, etc.). Performs a deep copy.

Returns

deep copy of the current TMPreproc instance

property doc_labels

Document labels as sorted list.

property doc_lengths

Document lengths as dict with mapping document label to document length (number of tokens in doc.).

property doc_vectors

A dict mapping document labels to document vectors.

property dtm

Generate and return a sparse document-term matrix of shape (n_docs, n_vocab) where n_docs is the number of documents and n_vocab is the vocabulary size.

expand_compound_tokens(split_chars=('-',), split_on_len=2, split_on_casechange=False)

Expand compound tokens like “US-Student” to “US” and “Student”. Use split_chars to determine possible split points and/or case changes (e.g. “USstudent”) when setting split_on_casechange to True. The minimum length of the split sub-strings must be split_on_len.

Warning

This will remove all information about POS tags, i.e. POS tagging has to be applied (again) after using this method.

Parameters
  • split_chars – possibly split on these characters

  • split_on_len – ensure that split sub-strings have at least this length

  • split_on_casechange – also split on case changes

Returns

this instance

filter_documents(search_tokens, by_meta=None, matches_threshold=1, match_type='exact', ignore_case=False, glob_method='match', inverse_result=False, inverse_matches=False)

This method is similar to filter_tokens but applies at document level. For each document, the number of matches is counted. If it is at least matches_threshold the document is retained, otherwise removed. If inverse_result is True, then documents that meet the threshold are removed.

Parameters
  • search_tokens – single string or list of strings that specify the search pattern(s)

  • by_meta – if not None, this should be a string of a meta data key; this meta data will then be used for matching instead of the tokens in docs

  • matches_threshold – the minimum number of matches required per document

  • match_type – the type of matching that is performed: 'exact' does exact string matching (optionally ignoring character case if ignore_case=True is set); 'regex' treats search_tokens as regular expressions to match the tokens against; 'glob' uses “glob patterns” like "politic*" which matches for example “politic”, “politics” or “”politician” (see globre package)

  • ignore_case – ignore character case (applies to all three match types)

  • glob_method – if match_type is 'glob', use either 'search' or 'match' as glob method (has similar implications as Python’s re.search vs. re.match)

  • inverse_result – inverse the threshold comparison result

  • inverse_matches – inverse the match results for filtering

Returns

this instance

filter_documents_by_name(name_patterns, match_type='exact', ignore_case=False, glob_method='match', inverse=False)

Filter documents by their name (i.e. document label). Keep all documents whose name matches name_pattern according to additional matching options. If inverse is True, drop all those documents whose name matches, which is the same as calling remove_documents_by_name().

Parameters
  • name_patterns – either single search string or sequence of search strings

  • match_type

    the type of matching that is performed: 'exact' does exact string matching (optionally ignoring character case if ignore_case=True is set); 'regex' treats search_tokens as regular expressions to match the tokens against; 'glob' uses “glob patterns” like "politic*" which matches for example “politic”, “politics” or “”politician” (see globre package)

  • ignore_case – ignore character case (applies to all three match types)

  • glob_method – if match_type is 'glob', use either 'search' or 'match' as glob method (has similar implications as Python’s re.search vs. re.match)

  • inverse – invert the matching results

Returns

this instance

filter_for_pos(required_pos, simplify_pos=True, inverse=False)

Filter tokens for a specific POS tag (if required_pos is a string) or several POS tags (if required_pos is a list/tuple/set of strings). The POS tag depends on the tagset used during tagging. See https://spacy.io/api/annotation#pos-tagging for a general overview on POS tags in SpaCy and refer to the documentation of your language model for specific tags.

If simplify_pos is True, then the tags are matched to the following simplified forms:

  • 'N' for nouns

  • 'V' for verbs

  • 'ADJ' for adjectives

  • 'ADV' for adverbs

  • None for all other

Parameters
  • required_pos – single string or list of strings with POS tag(s) used for filtering

  • simplify_pos – before matching simplify POS tags in documents to forms shown above

  • inverse – inverse the matching results, i.e. remove tokens that match the POS tag

Returns

this instance

filter_tokens(search_tokens, by_meta=None, match_type='exact', ignore_case=False, glob_method='match', inverse=False)

Filter tokens according to search pattern(s) search_tokens and several matching options. Only those tokens are retained that match the search criteria unless you set inverse=True, which will remove all tokens that match the search criteria (which is the same as calling remove_tokens()).

Parameters
  • search_tokens – single string or list of strings that specify the search pattern(s)

  • by_meta – if not None, this should be a string of a meta data key; this meta data will then be used for matching instead of the tokens in docs

  • match_type

    the type of matching that is performed: 'exact' does exact string matching (optionally ignoring character case if ignore_case=True is set); 'regex' treats search_tokens as regular expressions to match the tokens against; 'glob' uses “glob patterns” like "politic*" which matches for example “politic”, “politics” or “”politician” (see globre package)

  • ignore_case – ignore character case (applies to all three match types)

  • glob_method – if match_type is 'glob', use either 'search' or 'match' as glob method (has similar implications as Python’s re.search vs. re.match)

  • inverse – inverse the match results for filtering (i.e. remove all tokens that match the search criteria)

Returns

this instance

filter_tokens_by_mask(mask, inverse=False)

Filter tokens according to a binary mask specified by mask.

Parameters
  • mask – a dict containing a mask list for each document; each mask list contains boolean values for each token in that document, where True means keeping that token and False means removing it

  • inverse – inverse the mask for filtering, i.e. keep all tokens with a mask set to False and remove all those with True

Returns

this instance

filter_tokens_with_kwic(search_tokens, context_size=2, match_type='exact', ignore_case=False, glob_method='match', inverse=False)

Filter tokens in docs according to Keywords-in-Context (KWIC) context window of size context_size around search_tokens.

Parameters
  • search_tokens – single string or list of strings that specify the search pattern(s)

  • context_size – either scalar int or tuple (left, right) – number of surrounding words in keyword context. if scalar, then it is a symmetric surrounding, otherwise can be asymmetric.

  • match_type

    the type of matching that is performed: 'exact' does exact string matching (optionally ignoring character case if ignore_case=True is set); 'regex' treats search_tokens as regular expressions to match the tokens against; 'glob' uses “glob patterns” like "politic*" which matches for example “politic”, “politics” or “”politician” (see globre package)

  • ignore_case – ignore character case (applies to all three match types)

  • glob_method – if match_type is ‘glob’, use this glob method. Must be ‘match’ or ‘search’ (similar behavior as Python’s re.match() or re.search())

  • inverse – inverse the match results for filtering (i.e. remove all tokens that match the search criteria)

Returns

this instance

classmethod from_state(file_or_stateobj, **init_kwargs)

Create a new TMPreproc instance by loading a state from by loading either a pickled state from disk as saved with save_state() or by loading a state object directly.

Parameters
  • file_or_stateobj – either path to a pickled file as saved with save_state() or a state object

  • init_kwargs – arguments passed to __init__()

Returns

new instance as restored from the passed file / object

classmethod from_tokens(tokens, **init_kwargs)

Create a new TMPreproc instance by loading tokens in the same format as they are returned by tokens or tokens_with_metadata, i.e. as dict with mapping: document label -> document tokens array or document data frame.

Note

You must specify either language or language_model as additional arguments in init_kwargs.

Parameters
Returns

new instance with passed tokens

classmethod from_tokens_datatable(tokensdf, **init_kwargs)

Create a new TMPreproc instance by loading tokens dataframe tokendf in the same format as it is returned by tokens_datatable(), i.e. as data frame with hierarchical indices “doc” and “position” and at least a column “token” plus optional columns like “pos”, “lemma”, “meta_…”, etc.

Note

You must specify either language or language_model as additional arguments in init_kwargs.

Parameters
Returns

new instance with passed tokens

generate_ngrams(n)

Generate n-grams of length n. They are then available in the ngrams property.

You may afterwards use join_ngrams() to join the generated n-grams to a single token and use these as new tokens in this TMPreproc instance.

Parameters

n – length of n-grams, must be >= 2

Returns

this instance

get_available_metadata_keys()

Return set of available meta data keys, e.g. “pos” for POS tags if pos_tag() was called before.

Returns

set of available meta data keys

get_dtm(as_datatable=False, as_dataframe=False, dtype=None)

Generate a sparse document-term-matrix (DTM) for the current tokens with rows representing documents according to doc_labels and columns representing tokens according to vocabulary.

Parameters
  • as_datatable – Return result as datatable with document labels in ‘_doc’ column and vocabulary as column names

  • as_dataframe – Return result as pandas dataframe with document labels in index and vocabulary as column names

  • dtype – optionally specify a DTM data type; by default it is 32bit integer

Returns

either a sparse document-term-matrix (in CSR format) or a datatable or a pandas DataFrame

get_kwic(search_tokens, context_size=2, match_type='exact', ignore_case=False, glob_method='match', inverse=False, with_metadata=False, as_datatable=False, non_empty=False, glue=None, highlight_keyword=None)

Perform keyword-in-context (kwic) search for search_tokens. Uses similar search parameters as filter_tokens().

Parameters
  • search_tokens – single string or list of strings that specify the search pattern(s)

  • context_size – either scalar int or tuple (left, right) – number of surrounding words in keyword context. if scalar, then it is a symmetric surrounding, otherwise can be asymmetric.

  • match_type

    the type of matching that is performed: 'exact' does exact string matching (optionally ignoring character case if ignore_case=True is set); 'regex' treats search_tokens as regular expressions to match the tokens against; 'glob' uses “glob patterns” like "politic*" which matches for example “politic”, “politics” or “”politician” (see globre package)

  • ignore_case – ignore character case (applies to all three match types)

  • glob_method – if match_type is ‘glob’, use this glob method. Must be ‘match’ or ‘search’ (similar behavior as Python’s re.match() or re.search())

  • inverse – inverse the match results for filtering (i.e. remove all tokens that match the search criteria)

  • with_metadata – Also return metadata (like POS) along with each token.

  • as_datatable – Return result as data frame with indices “doc” (document label) and “context” (context ID per document) and optionally “position” (original token position in the document) if tokens are not glued via glue parameter.

  • non_empty – If True, only return non-empty result documents.

  • glue – If not None, this must be a string which is used to combine all tokens per match to a single string

  • highlight_keyword – If not None, this must be a string which is used to indicate the start and end of the matched keyword.

Returns

Return dict with document label -> kwic for document mapping or a data frame, depending on as_datatable.

get_kwic_table(search_tokens, context_size=2, match_type='exact', ignore_case=False, glob_method='match', inverse=False, glue=' ', highlight_keyword='*')

Shortcut for get_kwic() to directly return a data frame table with highlighted keywords in context.

Parameters
  • search_tokens – single string or list of strings that specify the search pattern(s)

  • context_size – either scalar int or tuple (left, right) – number of surrounding words in keyword context. if scalar, then it is a symmetric surrounding, otherwise can be asymmetric.

  • match_type

    the type of matching that is performed: 'exact' does exact string matching (optionally ignoring character case if ignore_case=True is set); 'regex' treats search_tokens as regular expressions to match the tokens against; 'glob' uses “glob patterns” like "politic*" which matches for example “politic”, “politics” or “”politician” (see globre package)

  • ignore_case – ignore character case (applies to all three match types)

  • glob_method – if match_type is ‘glob’, use this glob method. Must be ‘match’ or ‘search’ (similar behavior as Python’s re.match() or re.search())

  • inverse – inverse the match results for filtering (i.e. remove all tokens that match the search criteria)

  • glue – If not None, this must be a string which is used to combine all tokens per match to a single string

  • highlight_keyword – If not None, this must be a string which is used to indicate the start and end of the matched keyword.

Returns

Data frame with indices “doc” (document label) and “context” (context ID per document) and column “kwic” containing strings with highlighted keywords in context.

get_ngrams(non_empty=False)

Return generated n-grams as dict with mapping document labels to document n-grams list. Each list of n-grams (i.e. each document) in turn contains lists of size n (i.e. two if you generated bigrams).

Requires that n-grams have been generated with generate_ngrams() before.

Parameters

non_empty – remove empty documents from the result set

Returns

dict mapping document labels to document n-grams list

get_tokens(non_empty=False, with_metadata=True, as_datatables=False, arrays_to_lists=True)

Return document tokens as dict with mapping document labels to document tokens. The format of the tokens depends on the passed arguments: If as_datatables is True, each document is a datatable with at least the column "token" and optional "lemma", "pos" and "meta_..." columns if with_metadata is True.

If as_datatables is False, the result documents are either plain lists of tokens if with_metadata is False, or they’re dicts of lists with keys "token" and optional "lemma", "pos" and "meta_..." keys.

Parameters
  • non_empty – remove empty documents from the result set

  • with_metadata – add meta data to results (e.g. POS tags)

  • as_datatables – return results as dict of datatables (if package datatable is installed) or pandas DataFrames

  • arrays_to_lists – if True, convert NumPy character arrays to plain Python lists (only applies when as_datatables is False)

Returns

dict mapping document labels to document tokens

get_vocabulary(sort=True)

Return the vocabulary, i.e. the list of unique words across all documents, as (sorted) list.

Parameters

sort – the vocabulary alphabetically

Returns

list of tokens in the vocabulary

glue_tokens(patterns, glue='_', match_type='exact', ignore_case=False, glob_method='match', inverse=False)

Match N subsequent tokens to the N patterns in patterns using match options like in filter_tokens(). Join the matched tokens by glue string glue. Replace these tokens in the documents. Returns a set of all joint tokens.

Warning

This will remove all information about POS tags and other token metadata.

Parameters
  • patterns – a sequence of search patterns as excepted by filter_tokens

  • glue – string for joining the subsequent matches

  • match_type – one of: 'exact', 'regex', 'glob'. If 'regex', patterns must be list of RE patterns. If 'glob', patterns must be list of a “glob” patterns like ["hello", "w*"] (see https://github.com/metagriffin/globre).

  • ignore_case – if True, ignore case for matching

  • glob_method – if match_type is ‘glob’, use this glob method; must be ‘match’ or ‘search’ (similar behavior as Python’s re.match or re.search)

  • inverse – invert the matching results

Returns

set of all joint tokens

join_ngrams(join_str=' ')

Use the generated n-grams as tokens by joining them via join_str. After this operation, the joined n-grams are available as tokens but the original n-grams will be removed and ngrams_generated is reset to False.

Requires that n-grams have been generated with generate_ngrams() before.

Parameters

join_str – string use to “glue” the grams

Returns

this instance

lemmatize()

Lemmatize tokens, i.e. set the lemmata as tokens so that all further processing will happen using the lemmatized tokens.

Returns

this instance

load_state(file_or_stateobj)

Restore a state by loading either a pickled state from disk as saved with save_state() or by loading a state object directly.

Parameters

file_or_stateobj – either path to a pickled file as saved with save_state() or a state object

Returns

this instance as restored from the passed file / object

load_tokens(tokens)

Load tokens tokens into TMPreproc in the same format as they are returned by tokens or tokens_with_metadata, i.e. as dict with mapping: document label -> document tokens array or document data frame.

Parameters

tokens – dict of tokens as returned by tokens or tokens_with_metadata

Returns

this instance

load_tokens_datatable(tokendf)

Load tokens dataframe tokendf into TMPreproc in the same format as they are returned by tokens_datatable, i.e. as data frame with indices “doc” and “position” and at least a column “token” plus optional columns like “pos”, “lemma”, “meta_…”, etc.

Parameters

tokendf – tokens datatable Frame object as returned by tokens_datatable

Returns

this instance

property n_docs

Number of documents.

property n_tokens

Number of tokens in all documents (sum of document lengths).

property ngrams

Generated n-grams as dict with mapping document label to list of n-grams. Each list of n-grams (i.e. each document) in turn contains lists of size n (i.e. two if you generated bigrams).

property ngrams_generated

Indicates if n-grams were generated before (True if yes, else False).

pos_tag()

Apply Part-of-Speech (POS) tagging to all documents. POS tags can then be retrieved via tokens_with_metadata, tokens_datatable / tokens_datatable or tokens_with_pos_tags properties, or the get_tokens() method.

The meanings of the POS tags are described in the spaCy documentation.

Returns

this instance

property pos_tagged

Indicates if documents were POS tagged. True if yes, otherwise False.

print_summary(max_documents=None, max_tokens_string_length=None)

Print a summary of this object, i.e. the first tokens of each document and some summary statistics.

Parameters
  • max_documents – maximum number of documents to print; None uses default value 10; set to -1 to print all documents

  • max_tokens_string_length – maximum string length of concatenated tokens for each document; None uses default value 50; set to -1 to print complete documents

Returns

this instance

remove_chars_in_tokens(chars)

Remove all characters listed in chars from all tokens.

Parameters

chars – list of characters to remove

Returns

this instance

remove_common_tokens(df_threshold, absolute=False)

Remove tokens with document frequency greater than or equal to df_threshold.

Parameters
  • df_threshold – document frequency threshold value

  • absolute – if True, use absolute document frequency (i.e. number of times token X occurs at least once in a document), otherwise use relative document frequency (normalized by number of documents)

Returns

this instance

remove_documents(search_tokens, by_meta=None, matches_threshold=1, match_type='exact', ignore_case=False, glob_method='match', inverse_matches=False)

This is a shortcut for the filter_documents method with inverse_result=True, i.e. remove all documents that meet the token matching threshold.

Parameters
  • search_tokens – single string or list of strings that specify the search pattern(s)

  • by_meta – if not None, this should be a string of a meta data key; this meta data will then be used for matching instead of the tokens in docs

  • matches_threshold – the minimum number of matches required per document

  • match_type

    the type of matching that is performed: 'exact' does exact string matching (optionally ignoring character case if ignore_case=True is set); 'regex' treats search_tokens as regular expressions to match the tokens against; 'glob' uses “glob patterns” like "politic*" which matches for example “politic”, “politics” or “”politician” (see globre package)

  • ignore_case – ignore character case (applies to all three match types)

  • glob_method – if match_type is 'glob', use either 'search' or 'match' as glob method (has similar implications as Python’s re.search vs. re.match)

  • inverse_matches – inverse the match results for filtering

Returns

this instance

remove_documents_by_name(name_patterns, match_type='exact', ignore_case=False, glob_method='match')

Same as filter_documents_by_name() with inverse=True: drop all those documents whose name match.

Parameters
  • name_patterns – either single search string or sequence of search strings

  • match_type

    the type of matching that is performed: 'exact' does exact string matching (optionally ignoring character case if ignore_case=True is set); 'regex' treats search_tokens as regular expressions to match the tokens against; 'glob' uses “glob patterns” like "politic*" which matches for example “politic”, “politics” or “”politician” (see globre package)

  • ignore_case – ignore character case (applies to all three match types)

  • glob_method – if match_type is 'glob', use either 'search' or 'match' as glob method (has similar implications as Python’s re.search vs. re.match)

Returns

this instance

remove_metadata(key)

Remove meta data information previously added by pos_tag() or add_metadata_per_token()/add_metadata_per_doc() and identified by meta data key key.

Parameters

key – meta data key, i.e. label as string

Returns

this instance

remove_special_chars_in_tokens()

Remove everything that is deemed a “special character”, i.e. everything in special_chars from all tokens. Be default, this will remove all characters listed in strings.punctuation().

Returns

this instance

remove_tokens(search_tokens, by_meta=None, match_type='exact', ignore_case=False, glob_method='match')

This is a shortcut for the filter_tokens() method with inverse=True, i.e. remove all tokens that match the search criteria).

Parameters
  • search_tokens – single string or list of strings that specify the search pattern(s)

  • by_meta – if not None, this should be a string of a meta data key; this meta data will then be used for matching instead of the tokens in docs

  • match_type

    the type of matching that is performed: 'exact' does exact string matching (optionally ignoring character case if ignore_case=True is set); 'regex' treats search_tokens as regular expressions to match the tokens against; 'glob' uses “glob patterns” like "politic*" which matches for example “politic”, “politics” or “”politician” (see globre package)

  • ignore_case – ignore character case (applies to all three match types)

  • glob_method – if match_type is 'glob', use either 'search' or 'match' as glob method (has similar implications as Python’s re.search vs. re.match)

Returns

this instance

remove_tokens_by_doc_frequency(which, df_threshold, absolute=False)

Remove tokens according to their document frequency.

Parameters
  • which – which threshold comparison to use: either 'common', '>', '>=' which means that tokens with higher document freq. than (or equal to) df_threshold will be removed; or 'uncommon', '<', '<=' which means that tokens with lower document freq. than (or equal to) df_threshold will be removed

  • df_threshold – document frequency threshold value

  • absolute – if True, use absolute document frequency (i.e. number of times token X occurs at least once in a document), otherwise use relative document frequency (normalized by number of documents)

Returns

this instance

remove_tokens_by_mask(mask)

Remove tokens according to a binary mask specified by mask.

Parameters

mask – a dict containing a mask list for each document; each mask list contains boolean values for each token in that document, where False means keeping that token and True means removing it

Returns

this instance

remove_uncommon_tokens(df_threshold, absolute=False)

Remove tokens with document frequency lesser than or equal to df_threshold.

Parameters
  • df_threshold – document frequency threshold value

  • absolute – if True, use absolute document frequency (i.e. number of times token X occurs at least once in a document), otherwise use relative document frequency (normalized by number of documents)

Returns

this instance

save_state(picklefile)

Save the current state of this TMPreproc instance to disk, i.e. to the pickle file picklefile. The state can be restored from this file using load_state() or class method from_state().

Parameters

picklefile – disk file to store the state to

Returns

this instance

shutdown_workers(force=False)

Manually send the shutdown signal to all worker processes.

Normally you don’t need to call this manually as the worker processes are killed automatically when the TMPreproc instance is removed. However, if you need to free resources immediately, you can use this method as it is also used in the tests.

property spacy_docs

Documents as dict mapping document labels to spaCy document objects.

property texts

Document texts as dict mapping document labels to document content strings.

property token_vectors

A dict mapping document labels to document’s token vector matrix. Each row in this matrix represents a token vector (word embeddings).

property tokens

Document tokens as dict with mapping document label to list of tokens.

property tokens_dataframe

Tokens and metadata as pandas DataFrame with indices “doc” (document label) and “position” (token position in the document) and columns “token” plus optional meta data columns.

property tokens_datatable

Tokens and metadata as datatable (if datatable package is installed) or pandas DataFrame. Result has columns “doc” (document label), “position” (token position in the document), “token” and optional meta data columns.

tokens_to_lowercase()

Convert all tokens to lower-case form.

Returns

this instance

property tokens_with_metadata

Document tokens with metadata (e.g. POS tag) as dict with mapping document label to datatable.

property tokens_with_pos_tags

Document tokens with POS tag as dict with mapping document label to datatable. The datatables have two columns, token and pos.

transform_tokens(transform_fn, process_on_workers=False)

Transform tokens in all documents by applying transform_fn to each document’s tokens individually.

If transform_fn is “pickable” (e.g. pickle.dumps(transform_fn) doesn’t raise an exception), you may try to set process_on_workers to True which will apply transform_fn in parallel on the worker processes. However, there’s no guarantee that this will work and you may get an AttributeError exception in “_ForkingPickler”.

By default the function is applied to the documents sequentially, which may be very slow.

Parameters
  • transform_fn – a function to apply to all documents’ tokens; it must accept a single token string and vice-versa return single token string

  • process_on_workers – if True, apply transform_fn in parallel on the worker processes

Returns

this instance

property vocabulary

Corpus vocabulary, i.e. sorted list of all tokens that occur across all documents.

property vocabulary_abs_doc_frequency

Absolute document frequency per vocabulary token as dict with token to document frequency mapping. Document frequency is the measure of how often a token occurs at least once in a document. Example:

doc tokens
--- ------
A   z, z, w, x
B   y, z, y
C   z, z, y, z

document frequency df(z) = 3  (occurs in all 3 documents)
df(x) = df(w) = 1 (occur only in A)
df(y) = 2 (occurs in B and C)
...
property vocabulary_counts

collections.Counter() instance of vocabulary containing counts of occurrences of tokens across all documents.

property vocabulary_rel_doc_frequency

Same as vocabulary_abs_doc_frequency but normalized by number of documents, i.e. relative document frequency.

property vocabulary_size

Number of unique tokens across all documents.

Functional Preprocessing API

tmtoolkit.preprocess.clean_tokens(docs, remove_punct=True, remove_stopwords=True, remove_empty=True, remove_shorter_than=None, remove_longer_than=None, remove_numbers=False, nlp_instance=None, language=None)

Apply several token cleaning steps to documents docs and optionally documents metadata docs_meta, depending on the given parameters.

Parameters
  • docs – list of string tokens or spaCy documents

  • remove_punct – if True, remove all tokens marked as is_punct by spaCy if docs are spaCy documents, otherwise remove tokens that match the characters listed in string.punctuation; if arg is a list, tuple or set, remove all tokens listed in this arg from the documents; if False do not apply punctuation token removal

  • remove_stopwords – if True, remove stop words for the given language as loaded via ~tmtoolkit.preprocess.load_stopwords ; if arg is a list, tuple or set, remove all tokens listed in this arg from the documents; if False do not apply stop word token removal

  • remove_empty – if True, remove empty strings "" from documents

  • remove_shorter_than – if given a positive number, remove tokens that are shorter than this number

  • remove_longer_than – if given a positive number, remove tokens that are longer than this number

  • remove_numbers – if True, remove all tokens that are deemed numeric by np.char.isnumeric()

  • nlp_instance – spaCy nlp instance

  • language – language for stop word removal

Returns

list of string tokens or spaCy documents, depending on docs

tmtoolkit.preprocess.compact_documents(docs)

Compact documents docs by recreating new documents using the previously applied filters.

Parameters

docs – list of spaCy documents

Returns

list with compact spaCy documents

tmtoolkit.preprocess.doc_frequencies(docs, proportions=False)

Document frequency per vocabulary token as dict with token to document frequency mapping. Document frequency is the measure of how often a token occurs at least once in a document. Example with absolute document frequencies:

doc tokens
--- ------
A   z, z, w, x
B   y, z, y
C   z, z, y, z

document frequency df(z) = 3  (occurs in all 3 documents)
df(x) = df(w) = 1 (occur only in A)
df(y) = 2 (occurs in B and C)
...
Parameters
  • docs – list of string tokens or spaCy documents

  • proportions – if True, normalize by number of documents to obtain proportions

Returns

dict mapping token to document frequency

tmtoolkit.preprocess.doc_labels(docs)

Return list of document labels that are assigned to spaCy documents docs.

Parameters

docs – list of spaCy documents

Returns

list of document labels

tmtoolkit.preprocess.doc_lengths(docs)

Return document length (number of tokens in doc.) for each document.

Parameters

docs – list of string tokens or spaCy documents

Returns

list of document lengths

tmtoolkit.preprocess.doc_tokens(docs, to_lists=False)

If docs is a list of spaCy documents, return the (potentially filtered) tokens from these documents as list of string tokens, otherwise return the input list as-is.

Parameters
  • docs – list of string tokens or spaCy documents

  • to_lists – if docs is list of spaCy documents or list of NumPy arrays, convert result to lists

Returns

list of string tokens as NumPy arrays (default) or lists (if to_lists is True)

tmtoolkit.preprocess.expand_compound_token(t, split_chars=('-',), split_on_len=2, split_on_casechange=False)

Expand a token t if it is a compound word, e.g. splitting token “US-Student” into two tokens “US” and “Student”.

See also

expand_compounds() which operates on token documents

Parameters
  • t – string token

  • split_chars – characters to split on

  • split_on_len – minimum length of a result token when considering splitting (e.g. when split_on_len=2 “e-mail” would not be split into “e” and “mail”)

  • split_on_casechange – use case change to split tokens, e.g. “CamelCase” would become “Camel”, “Case”

Returns

list with split sub-tokens or single original token, i.e. [t]

tmtoolkit.preprocess.expand_compounds(docs, split_chars=('-',), split_on_len=2, split_on_casechange=False)

Expand all compound tokens in documents docs, e.g. splitting token “US-Student” into two tokens “US” and “Student”.

Parameters
  • docs – list of string tokens or spaCy documents

  • split_chars – characters to split on

  • split_on_len – minimum length of a result token when considering splitting (e.g. when split_on_len=2 “e-mail” would not be split into “e” and “mail”)

  • split_on_casechange – use case change to split tokens, e.g. “CamelCase” would become “Camel”, “Case”

Returns

list of string tokens or spaCy documents, depending on docs

tmtoolkit.preprocess.filter_documents(docs, search_tokens, by_meta=None, matches_threshold=1, match_type='exact', ignore_case=False, glob_method='match', inverse_result=False, inverse_matches=False)

This function is similar to filter_tokens() but applies at document level. For each document, the number of matches is counted. If it is at least matches_threshold the document is retained, otherwise removed. If inverse_result is True, then documents that meet the threshold are removed.

Parameters
  • docs – list of string tokens or spaCy documents

  • search_tokens – typically a single string or non-empty list of strings that specify the search pattern(s); when matching against meta data via by_meta, may also be of any other type

  • by_meta – if not None, this should be a string of a token meta data attribute; this meta data will then be used for matching instead of the tokens in docs

  • matches_threshold – the minimum number of matches required per document

  • match_type

    the type of matching that is performed: 'exact' does exact string matching (optionally ignoring character case if ignore_case=True is set); 'regex' treats search_tokens as regular expressions to match the tokens against; 'glob' uses “glob patterns” like "politic*" which matches for example “politic”, “politics” or “”politician” (see globre package)

  • ignore_case – ignore character case (applies to all three match types)

  • glob_method – if match_type is ‘glob’, use this glob method. Must be ‘match’ or ‘search’ (similar behavior as Python’s re.match() or re.search())

  • inverse_result – inverse the threshold comparison result

  • inverse_matches – inverse the match results for filtering

Returns

list of string tokens or spaCy documents, depending on docs

tmtoolkit.preprocess.filter_documents_by_name(docs, name_patterns, labels=None, match_type='exact', ignore_case=False, glob_method='match', inverse=False)

Filter documents by their name (i.e. document label). Keep all documents whose name matches name_pattern according to additional matching options. If inverse is True, drop all those documents whose name matches, which is the same as calling remove_documents_by_name().

Parameters
  • docs – list of string tokens or spaCy documents

  • search_tokens – typically a single string or non-empty list of strings that specify the search pattern(s); when matching against meta data via by_meta, may also be of any other type

  • labels – if docs is not a list of spaCy documents, you must pass the document labels as list of strings

  • match_type

    the type of matching that is performed: 'exact' does exact string matching (optionally ignoring character case if ignore_case=True is set); 'regex' treats search_tokens as regular expressions to match the tokens against; 'glob' uses “glob patterns” like "politic*" which matches for example “politic”, “politics” or “”politician” (see globre package)

  • ignore_case – ignore character case (applies to all three match types)

  • glob_method – if match_type is ‘glob’, use this glob method. Must be ‘match’ or ‘search’ (similar behavior as Python’s re.match() or re.search())

  • inverse – invert the matching results

Returns

list of string tokens or spaCy documents, depending on docs

tmtoolkit.preprocess.filter_for_pos(docs, required_pos, simplify_pos=True, pos_attrib='pos_', tagset='ud', inverse=False)

Filter tokens for a specific POS tag (if required_pos is a string) or several POS tags (if required_pos is a list/tuple/set of strings). The POS tag depends on the tagset used during tagging. See https://spacy.io/api/annotation#pos-tagging for a general overview on POS tags in SpaCy and refer to the documentation of your language model for specific tags.

If simplify_pos is True, then the tags are matched to the following simplified forms:

  • 'N' for nouns

  • 'V' for verbs

  • 'ADJ' for adjectives

  • 'ADV' for adverbs

  • None for all other

Parameters
  • docs – list of spaCy documents

  • required_pos – single string or list of strings with POS tag(s) used for filtering

  • simplify_pos – before matching simplify POS tags in documents to forms shown above

  • pos_attrib – token attribute name for POS tags

  • tagset – POS tagset used while tagging; necessary for simplifying POS tags when simplify_pos is True

  • inverse – inverse the matching results, i.e. remove tokens that match the POS tag

Returns

filtered list of spaCy documents

tmtoolkit.preprocess.filter_tokens(docs, search_tokens, by_meta=None, match_type='exact', ignore_case=False, glob_method='match', inverse=False)

Filter tokens in docs according to search pattern(s) search_tokens and several matching options. Only those tokens are retained that match the search criteria unless you set inverse=True, which will remove all tokens that match the search criteria (which is the same as calling remove_tokens()).

Parameters
  • docs – list of string tokens or spaCy documents

  • search_tokens – typically a single string or non-empty list of strings that specify the search pattern(s); when matching against meta data via by_meta, may also be of any other type

  • by_meta – if not None, this should be a string of a token meta data attribute; this meta data will then be used for matching instead of the tokens in docs

  • match_type

    the type of matching that is performed: 'exact' does exact string matching (optionally ignoring character case if ignore_case=True is set); 'regex' treats search_tokens as regular expressions to match the tokens against; 'glob' uses “glob patterns” like "politic*" which matches for example “politic”, “politics” or “”politician” (see globre package)

  • ignore_case – ignore character case (applies to all three match types)

  • glob_method – if match_type is ‘glob’, use this glob method. Must be ‘match’ or ‘search’ (similar behavior as Python’s re.match() or re.search())

  • inverse – inverse the match results for filtering (i.e. remove all tokens that match the search criteria)

Returns

list of string tokens or spaCy documents, depending on docs

tmtoolkit.preprocess.filter_tokens_by_mask(docs, mask, inverse=False)

Filter tokens in docs according to a binary mask specified by mask.

Parameters
  • docs – list of string tokens or spaCy documents

  • mask – a list containing a mask list for each document in docs; each mask list contains boolean values for each token in that document, where True means keeping that token and False means removing it;

  • inverse – inverse the mask for filtering, i.e. keep all tokens with a mask set to False and remove all those with True

Returns

list of string tokens or spaCy documents, depending on docs

tmtoolkit.preprocess.filter_tokens_with_kwic(docs, search_tokens, context_size=2, match_type='exact', ignore_case=False, glob_method='match', inverse=False)

Filter tokens in docs according to Keywords-in-Context (KWIC) context window of size context_size around search_tokens. Works similar to kwic(), but returns result as list of tokenized documents, i.e. in the same structure as docs whereas kwic() returns result as list of KWIC windows into docs.

See also

kwic()

Parameters
  • docs – list of string tokens or spaCy documents

  • search_tokens – typically a single string or non-empty list of strings that specify the search pattern(s); when matching against meta data via by_meta, may also be of any other type

  • context_size – either scalar int or tuple (left, right) – number of surrounding words in keyword context. if scalar, then it is a symmetric surrounding, otherwise can be asymmetric.

  • match_type

    the type of matching that is performed: 'exact' does exact string matching (optionally ignoring character case if ignore_case=True is set); 'regex' treats search_tokens as regular expressions to match the tokens against; 'glob' uses “glob patterns” like "politic*" which matches for example “politic”, “politics” or “”politician” (see globre package)

  • ignore_case – ignore character case (applies to all three match types)

  • glob_method – if match_type is ‘glob’, use this glob method. Must be ‘match’ or ‘search’ (similar behavior as Python’s re.match() or re.search())

  • inverse – inverse the match results for filtering (i.e. remove all tokens that match the search criteria)

Returns

list of string tokens or spaCy documents, depending on docs

tmtoolkit.preprocess.glue_tokens(docs, patterns, glue='_', match_type='exact', ignore_case=False, glob_method='match', inverse=False, return_glued_tokens=False)

Match N subsequent tokens to the N patterns in patterns using match options like in filter_tokens(). Join the matched tokens by glue string glue. Replace these tokens in the documents.

If there is metadata, the respective entries for the joint tokens are set to None.

Note

If docs is a list of spaCy documents, this modifies the documents in docs in place.

Parameters
  • docs – list of string tokens or spaCy documents

  • patterns – a sequence of search patterns as excepted by filter_tokens()

  • glue – string for joining the subsequent matches

  • match_type – one of: ‘exact’, ‘regex’, ‘glob’; if ‘regex’, search_token must be RE pattern; if glob, search_token must be a “glob” pattern like “hello w*” (see https://github.com/metagriffin/globre)

  • ignore_case – if True, ignore case for matching

  • glob_method – if match_type is ‘glob’, use this glob method; must be ‘match’ or ‘search’ (similar behavior as Python’s re.match() or re.search())

  • inverse – invert the matching results

  • return_glued_tokens – if True, additionally return a set of tokens that were glued

Returns

updated documents docs if docs is a list of spaCy documents or otherwise a list of string token documents; if return_glued_tokens is True, return 2-tuple with additional set of tokens that were glued

tmtoolkit.preprocess.ids2tokens(vocab, tokids)

Convert list of numeric token ID arrays tokids to a character token array with the help of the spaCy vocabulary vocab. Returns result as list of spaCy documents.

See also

tokens2ids() which reverses this operation.

Parameters
  • vocab – spaCy vocabulary

  • tokids – list of numeric token ID arrays as from tokens2ids()

Returns

list of spaCy documents

tmtoolkit.preprocess.init_for_language(language=None, language_model=None, **spacy_opts)

Initialize the functional API for a given language code language or a spaCy language model language_model. The spaCy nlp instance will be returned and will also be used by default in all subsequent preprocess API calls.

Parameters
  • language – two-letter ISO 639-1 language code (lowercase)

  • language_model – spaCy language model language_model

  • spacy_opts – additional keyword arguments passed to spacy.load()

Returns

spaCy nlp instance

tmtoolkit.preprocess.kwic(docs, search_tokens, context_size=2, match_type='exact', ignore_case=False, glob_method='match', inverse=False, with_metadata=False, as_dict=False, as_datatable=False, non_empty=False, glue=None, highlight_keyword=None)

Perform keyword-in-context (kwic) search for search pattern(s) search_tokens. Returns result as list of KWIC windows or datatable / dataframe. If you want to filter with KWIC, use filter_tokens_with_kwic(), which returns results as list of tokenized documents (same structure as docs).

Uses similar search parameters as filter_tokens().

Parameters
  • docs – list of string tokens or spaCy documents

  • search_tokens – single string or list of strings that specify the search pattern(s)

  • context_size – either scalar int or tuple (left, right) – number of surrounding words in keyword context. if scalar, then it is a symmetric surrounding, otherwise can be asymmetric.

  • match_type

    the type of matching that is performed: 'exact' does exact string matching (optionally ignoring character case if ignore_case=True is set); 'regex' treats search_tokens as regular expressions to match the tokens against; 'glob' uses “glob patterns” like "politic*" which matches for example “politic”, “politics” or “”politician” (see globre package)

  • ignore_case – ignore character case (applies to all three match types)

  • glob_method – if match_type is ‘glob’, use this glob method. Must be ‘match’ or ‘search’ (similar behavior as Python’s re.match() or re.search())

  • inverse – inverse the match results for filtering (i.e. remove all tokens that match the search criteria)

  • with_metadata – also return metadata (like POS) along with each token

  • as_dict – if True, return result as dict with document labels mapping to KWIC results

  • as_datatable – return result as data frame with indices “doc” (document label) and “context” (context ID per document) and optionally “position” (original token position in the document) if tokens are not glued via glue parameter

  • non_empty – if True, only return non-empty result documents

  • glue – if not None, this must be a string which is used to combine all tokens per match to a single string

  • highlight_keyword – if not None, this must be a string which is used to indicate the start and end of the matched keyword

Returns

return either as: (1) list with KWIC results per document, (2) as dict with document labels mapping to KWIC results when as_dict is True or (3) dataframe / datatable when as_datatable is True

tmtoolkit.preprocess.kwic_table(docs, search_tokens, context_size=2, match_type='exact', ignore_case=False, glob_method='match', inverse=False, glue=' ', highlight_keyword='*')

Shortcut for kwic() to directly return a data frame table with highlighted keywords in context.

Parameters
  • docs – list of string tokens or spaCy documents

  • search_tokens – single string or list of strings that specify the search pattern(s)

  • context_size – either scalar int or tuple (left, right) – number of surrounding words in keyword context. if scalar, then it is a symmetric surrounding, otherwise can be asymmetric.

  • match_type – One of: ‘exact’, ‘regex’, ‘glob’. If ‘regex’, search_token must be RE pattern. If glob, search_token must be a “glob” pattern like “hello w*” (see https://github.com/metagriffin/globre).

  • ignore_case – If True, ignore case for matching.

  • glob_method – If match_type is ‘glob’, use this glob method. Must be ‘match’ or ‘search’ (similar behavior as Python’s re.match() or re.search()).

  • inverse – Invert the matching results.

  • glue – If not None, this must be a string which is used to combine all tokens per match to a single string

  • highlight_keyword – If not None, this must be a string which is used to indicate the start and end of the matched keyword.

Returns

datatable or pandas DataFrame with columns “doc” (document label), “context” (context ID per document) and “kwic” containing strings with highlighted keywords in context.

tmtoolkit.preprocess.lemmatize(docs, lemma_attrib='lemma_')

Lemmatize documents according to language or use a custom lemmatizer function lemmatizer_fn.

Parameters
  • docs – list of spaCy documents

  • tag_attrib – spaCy document tag attribute to fetch the lemmata; "lemma_" gives lemmata as strings, "lemma" gives lemmata as integer token IDs

Returns

list of string lists with lemmata for each document

tmtoolkit.preprocess.load_stopwords(language)

Load stopwords for language code language.

Parameters

language – two-letter ISO 639-1 language code

Returns

list of stopword strings or None if loading failed

tmtoolkit.preprocess.make_index_window_around_matches(matches, left, right, flatten=False, remove_overlaps=True)

Take a boolean 1D vector matches of length N and generate an array of indices, where each occurrence of a True value in the boolean vector at index i generates a sequence of the form:

[i-left, i-left+1, ..., i, ..., i+right-1, i+right, i+right+1]

If flatten is True, then a flattened NumPy 1D array is returned. Otherwise, a list of NumPy arrays is returned, where each array contains the window indices.

remove_overlaps is only applied when flatten is True.

Example with left=1 and right=1, flatten=False:

input:
#   0      1      2      3     4      5      6      7     8
[True, True, False, False, True, False, False, False, True]
output (matches *highlighted*):
[[0, *1*], [0, *1*, 2], [3, *4*, 5], [7, *8*]]

Example with left=1 and right=1, flatten=True, remove_overlaps=True:

input:
#   0      1      2      3     4      5      6      7     8
[True, True, False, False, True, False, False, False, True]
output (matches *highlighted*, other values belong to the respective "windows"):
[*0*, *1*, 2, 3, *4*, 5, 7, *8*]
tmtoolkit.preprocess.ngrams(docs, n, join=True, join_str=' ')

Generate and return n-grams of length n.

Parameters
  • docs – list of string tokens or spaCy documents

  • n – length of n-grams, must be >= 2

  • join – if True, join generated n-grams by string join_str

  • join_str – string used for joining

Returns

list of n-grams; if join is True, the list contains strings of joined n-grams, otherwise the list contains lists of size n in turn containing the strings that make up the n-gram

tmtoolkit.preprocess.pos_tag(docs, tagger=None, nlp_instance=None)

Apply Part-of-Speech (POS) tagging to all documents.

The meanings of the POS tags are described in the spaCy documentation.

Note

This function only applies POS tagging to the documents but doesn’t retrieve the tags. If you want to retrieve the tags, you may use pos_tags().

Note

This function modifies the documents in docs in place and adds/modifies a pos_ attribute in each token.

Parameters
  • docs – list of spaCy documents

  • tagger – POS tagger instance to use; by default, use the tagger for the currently loaded spaCy nlp instance

  • nlp_instance – spaCy nlp instance

Returns

input spaCy documents docs with in-place modified documents

tmtoolkit.preprocess.pos_tags(docs, tag_attrib='pos_', tagger=None, nlp_instance=None)

Return Part-of-Speech (POS) tags of docs. If POS tagging was not applied to docs yet, this function runs pos_tag() first.

Parameters
  • docs – list of spaCy documents

  • tag_attrib – spaCy document tag attribute to fetch the POS tag; "pos_" and "pos" give coarse POS tags as string or integer tags respectively, "tag_" and "tag" give fine grained POS tags as string or integer tags

  • tagger – POS tagger instance to use; by default, use the tagger for the currently loaded spaCy nlp instance

  • nlp_instance – spaCy nlp instance

Returns

POS tags of docs as list of strings or integers depending on tag_attrib

tmtoolkit.preprocess.remove_chars(docs, chars)

Remove all characters listed in chars from all tokens.

Parameters
  • docs – list of string tokens or spaCy documents

  • chars – list of characters to remove

Returns

list of string tokens or spaCy documents, depending on docs

tmtoolkit.preprocess.remove_common_tokens(docs, df_threshold=0.95, absolute=False)

Shortcut for remove_tokens_by_doc_frequency() for removing tokens above a certain document frequency.

Parameters
  • docs – list of string tokens or spaCy documents

  • df_threshold – document frequency threshold value

  • absolute – if True, use absolute document frequency (i.e. number of times token X occurs at least once in a document), otherwise use relative document frequency (normalized by number of documents)

Returns

list of string tokens or spaCy documents, depending on docs

tmtoolkit.preprocess.remove_documents(docs, search_tokens, by_meta=None, matches_threshold=1, match_type='exact', ignore_case=False, glob_method='match', inverse_matches=False)

Same as filter_documents() but with inverse=True.

tmtoolkit.preprocess.remove_documents_by_name(docs, name_patterns, labels=None, match_type='exact', ignore_case=False, glob_method='match')

Same as filter_documents_by_name() but with inverse=True.

tmtoolkit.preprocess.remove_tokens(docs, search_tokens, by_meta=None, match_type='exact', ignore_case=False, glob_method='match')

Same as filter_tokens() but with inverse=True.

tmtoolkit.preprocess.remove_tokens_by_doc_frequency(docs, which, df_threshold, absolute=False, return_blacklist=False, return_mask=False)

Remove tokens according to their document frequency.

Parameters
  • docs – list of string tokens or spaCy documents

  • which – which threshold comparison to use: either 'common', '>', '>=' which means that tokens with higher document freq. than (or equal to) df_threshold will be removed; or 'uncommon', '<', '<=' which means that tokens with lower document freq. than (or equal to) df_threshold will be removed

  • df_threshold – document frequency threshold value

  • docs_meta – list of meta data for each document in docs; each element at index i is a dict containing the meta data for document i; POS tags must exist for all documents in docs_meta ("meta_pos" key)

  • return_blacklist – if True return a list of tokens that should be removed instead of the filtered tokens

  • return_mask – if True return a list of token masks where each occurrence of True signals a token to be removed

Returns

list of string tokens or spaCy documents, depending on docs

tmtoolkit.preprocess.remove_tokens_by_mask(docs, mask)

Same as filter_tokens_by_mask() but with inverse=True.

tmtoolkit.preprocess.remove_uncommon_tokens(docs, df_threshold=0.05, absolute=False)

Shortcut for remove_tokens_by_doc_frequency() for removing tokens below a certain document frequency.

Parameters
  • docs – list of string tokens or spaCy documents

  • df_threshold – document frequency threshold value

  • absolute – if True, use absolute document frequency (i.e. number of times token X occurs at least once in a document), otherwise use relative document frequency (normalized by number of documents)

Returns

list of string tokens or spaCy documents, depending on docs

tmtoolkit.preprocess.simplified_pos(pos, tagset='ud', default='')

Return a simplified POS tag for a full POS tag pos belonging to a tagset tagset.

Does the following conversion by default:

  • all N… (noun) tags to ‘N’

  • all V… (verb) tags to ‘V’

  • all ADJ… (adjective) tags to ‘ADJ’

  • all ADV… (adverb) tags to ‘ADV’

  • all other to default

Does the following conversion by with tagset=='penn':

  • all N… (noun) tags to ‘N’

  • all V… (verb) tags to ‘V’

  • all JJ… (adjective) tags to ‘ADJ’

  • all RB… (adverb) tags to ‘ADV’

  • all other to default

Does the following conversion by with tagset=='ud':

  • all N… (noun) tags to ‘N’

  • all V… (verb) tags to ‘V’

  • all JJ… (adjective) tags to ‘ADJ’

  • all RB… (adverb) tags to ‘ADV’

  • all other to default

Parameters
  • pos – a POS tag

  • tagset – tagset used for pos; can be 'wn' (WordNet), 'penn' (Penn tagset) or 'ud' (universal dependencies – default)

  • default – default return value when tag could not be simplified

Returns

simplified tag

tmtoolkit.preprocess.spacydoc_from_tokens(tokens, vocab=None, spaces=None, lemmata=None, label=None)

Create a new spaCy Doc document with tokens tokens.

Parameters
  • tokens – list, tuple or NumPy array of string tokens

  • vocab – list, tuple, set, NumPy array or spaCy Vocab object with vocabulary; if None, vocabulary will be generated from tokens

  • spaces – list, tuple or NumPy array of whitespace for each token

  • lemmata – list, tuple or NumPy array of string lemmata for each token

  • label – document label

Returns

spaCy Doc document

tmtoolkit.preprocess.sparse_dtm(docs, vocab=None)

Create a sparse document-term-matrix (DTM) from a list of tokenized documents docs. If vocab is None, determine the vocabulary (unique terms) from docs, otherwise take vocab which must be a sorted list or NumPy array. If vocab is None, the generated sorted vocabulary list is returned as second value, else only a single value is returned – the DTM.

Parameters
  • docs – list of string tokens or spaCy documents

  • vocab – optional sorted list / NumPy array of vocabulary (unique terms) in docs

Returns

either a single value (sparse document-term-matrix) or a tuple with sparse DTM and sorted vocabulary if none was passed

tmtoolkit.preprocess.str_multisplit(s, split_chars)

Split string s by all characters in split_chars.

Parameters
  • s – a string to split

  • split_chars – sequence or set of characters to use for splitting

Returns

list of split string parts

tmtoolkit.preprocess.str_shape(s, lower=0, upper=1, as_str=False)

Generate a sequence that reflects the “shape” of string s.

Parameters
  • s – input string

  • lower – shape element marking a lower case letter

  • upper – shape element marking an upper case letter

  • as_str – join the sequence to a string

Returns

shape list or string if as_str is True

tmtoolkit.preprocess.str_shapesplit(s, shape=None, min_part_length=2)

Split string s according to its “shape” which is either given by shape (see str_shape()).

Parameters
  • s – string to split

  • shape – list where 0 denotes a lower case character and 1 an upper case character; if shape is None, it is computed via str_shape()

  • min_part_length – minimum length of a chunk (as long as len(s) >= min_part_length)

Returns

list of substrings of s; returns [''] if s is empty string

tmtoolkit.preprocess.to_lowercase(docs)

Apply lowercase transformation to each document.

Parameters

docs – list of string tokens or spaCy documents

Returns

list of string tokens or spaCy documents, depending on docs

tmtoolkit.preprocess.token_glue_subsequent(tokens, matches, glue='_', return_glued=False)

Select subsequent tokens as defined by list of indices matches (e.g. output of token_match_subsequent()) and join those by string glue. Return a list of tokens where the subsequent matches are replaced by the joint tokens.

Warning

Only works correctly when matches contains indices of subsequent tokens.

Example:

token_glue_subsequent(['a', 'b', 'c', 'd', 'd', 'a', 'b', 'c'], [np.array([1, 2]), np.array([6, 7])])
# ['a', 'b_c', 'd', 'd', 'a', 'b_c']
Parameters
  • tokens – a sequence of tokens

  • matches – list of NumPy arrays with subsequent indices into tokens (e.g. output of token_match_subsequent())

  • glue – string for joining the subsequent matches or None if no joint tokens but a None object should be placed in the result list

  • return_glued – if yes, return also a list of joint tokens

Returns

either two-tuple or list; if return_glued is True, return a two-tuple with 1) list of tokens where the subsequent matches are replaced by the joint tokens and 2) a list of joint tokens; if return_glued is True only return 1)

tmtoolkit.preprocess.token_match(pattern, tokens, match_type='exact', ignore_case=False, glob_method='match')

Return a boolean NumPy array signaling matches between pattern and tokens. pattern is a string that will be compared with each element in sequence tokens either as exact string equality (match_type is 'exact') or regular expression (match_type is 'regex') or glob pattern (match_type is 'glob').

Parameters
  • pattern – either a string or a compiled RE pattern used for matching against tokens

  • tokens – list or NumPy array of string tokens

  • match_type – one of: ‘exact’, ‘regex’, ‘glob’; if ‘regex’, search_token must be RE pattern; if glob, search_token must be a “glob” pattern like “hello w*” (see https://github.com/metagriffin/globre)

  • ignore_case – if True, ignore case for matching

  • glob_method – if match_type is ‘glob’, use this glob method. Must be ‘match’ or ‘search’ (similar behavior as Python’s re.match or re.search)

Returns

1D boolean NumPy array of length len(tokens) where elements signal matches between pattern and the respective token from tokens

tmtoolkit.preprocess.token_match_subsequent(patterns, tokens, **kwargs)

Using N patterns in patterns, return each tuple of N matching subsequent tokens from tokens. Excepts the same token matching options via kwargs as token_match(). The results are returned as list of NumPy arrays with indices into tokens.

Example:

# indices:   0        1        2         3        4       5       6
tokens = ['hello', 'world', 'means', 'saying', 'hello', 'world', '.']

token_match_subsequent(['hello', 'world'], tokens)
# [array([0, 1]), array([4, 5])]

token_match_subsequent(['world', 'hello'], tokens)
# []

token_match_subsequent(['world', '*'], tokens, match_type='glob')
# [array([1, 2]), array([5, 6])]

See also

token_match()

Parameters
  • patterns – a sequence of search patterns as excepted by token_match()

  • tokens – a sequence of tokens to be used for matching

  • kwargs – token matching options as passed to token_match()

Returns

list of NumPy arrays with subsequent indices into tokens

tmtoolkit.preprocess.tokendocs2spacydocs(docs, vocab=None, doc_labels=None, return_vocab=False)

Create new spaCy documents from token lists in docs.

Note

spaCy doesn’t handle empty tokens (“”), hence these tokens will not appear in the resulting spaCy documents if they exist in the input documents.

Parameters
  • docs – list of document tokens

  • vocab – provide vocabulary to be used when generating spaCy documents; if no vocabulary is given, it will be generated from docs

  • doc_labels – optional list of document labels; if given, must be of same length as docs

  • return_vocab – if True, additionally return generated vocabulary as spaCy Vocab object

Returns

list of spaCy documents or tuple with additional generated vocabulary if return_vocab is True

tmtoolkit.preprocess.tokenize(docs, as_spacy_docs=True, doc_labels=None, doc_labels_fmt='doc-{i1}', enable_vectors=False, nlp_instance=None)

Tokenize a list or dict of documents docs, where each element contains the raw text of the document as string.

Requires that init_for_language() is called before or nlp_instance is passed.

Parameters
  • docs – list or dict of documents with raw text strings; if dict, use dict keys as document labels

  • as_spacy_docs – if True, return list of spaCy Doc objects, otherwise return list of string tokens

  • doc_labels – if not None and docs is a list, use strings in this list as document labels

  • doc_labels_fmt – if docs is a list and doc_labels is None, generate document labels according to this format, where {i0} or {i1} are replaced by the respective zero- or one-indexed document numbers

  • enable_vectors – if True, generate word vectors (aka word embeddings) during tokenization; this will be more computationally expensive

  • nlp_instance – spaCy nlp instance

Returns

list of spaCy Doc documents if as_spacy_docs is True (default) or list of string token documents

tmtoolkit.preprocess.tokens2ids(docs)

Convert a list of spaCy documents docs to a list of numeric token ID arrays. The IDs correspond to the current spaCy vocabulary.

See also

ids2tokens() which reverses this operation.

Parameters

docs – list of spaCy documents

Returns

list of token ID arrays

tmtoolkit.preprocess.transform(docs, func, **kwargs)

Apply func to each token in each document of docs and return the result.

Parameters
  • docs – list of string tokens or spaCy documents

  • func – function to apply to each token; should accept a string as first arg. and optional kwargs

  • kwargs – keyword arguments passed to func

Returns

list of string tokens or spaCy documents, depending on docs

tmtoolkit.preprocess.vocabulary(docs, sort=False)

Return vocabulary, i.e. set of all tokens that occur at least once in at least one of the documents in docs.

Parameters
  • docs – list of string tokens or spaCy documents

  • sort – return as sorted list

Returns

either set of token strings or sorted list if sort is True

tmtoolkit.preprocess.vocabulary_counts(docs)

Return collections.Counter() instance of vocabulary containing counts of occurrences of tokens across all documents.

Parameters

docs – list of string tokens or spaCy documents

Returns

collections.Counter() instance of vocabulary containing counts of occurrences of tokens across all documents

tmtoolkit.topicmod

Topic modeling sub-package with modules for model evaluation, model I/O, model statistics, parallel computation and visualization.

Functions and classes in tm_gensim, tm_lda and tm_sklearn implement parallel model computation and evaluation using popular topic modeling packages. You need to install the respective packages (lda, scikit-learn or gensim) in order to use them.

Evaluation metrics for Topic Modeling

Metrics for topic model evaluation.

In order to run model evaluations in parallel use one of the modules tm_gensim, tm_lda or tm_sklearn.

tmtoolkit.topicmod.evaluate.metric_arun_2010(topic_word_distrib, doc_topic_distrib, doc_lengths)

Calculate metric as in [Arun2010] using topic-word distribution topic_word_distrib, document-topic distribution doc_topic_distrib and document lengths doc_lengths.

Note

It will fail when num. of words in the vocabulary is less then the num. of topics (which is very unusual).

Arun2010

Rajkumar Arun, V. Suresh, C. E. Veni Madhavan, and M. N. Narasimha Murthy. 2010. On finding the natural number of topics with latent dirichlet allocation: Some observations. In Advances in knowledge discovery and data mining, Mohammed J. Zaki, Jeffrey Xu Yu, Balaraman Ravindran and Vikram Pudi (eds.). Springer Berlin Heidelberg, 391–402. http://doi.org/10.1007/978-3-642-13657-3_43.

Parameters
  • topic_word_distrib – topic-word distribution; shape KxM, where K is number of topics, M is vocabulary size

  • doc_topic_distrib – document-topic distribution; shape NxK, where N is the number of documents

  • doc_lengths – array of length N with number of tokens per document

Returns

calculated metric

tmtoolkit.topicmod.evaluate.metric_cao_juan_2009(topic_word_distrib)

Calculate metric as in [Cao2008] using topic-word distribution topic_word_distrib.

Cao2008

Cao Juan, Xia Tian, Li Jintao, Zhang Yongdong, and Tang Sheng. 2009. A density-based method for adaptive LDA model selection. Neurocomputing — 16th European Symposium on Artificial Neural Networks 2008 72, 7–9: 1775–1781. <http://doi.org/10.1016/j.neucom.2008.06.011>.

Parameters

topic_word_distrib – topic-word distribution; shape KxM, where K is number of topics, M is vocabulary size

Returns

calculated metric

tmtoolkit.topicmod.evaluate.metric_coherence_gensim(measure, topic_word_distrib=None, gensim_model=None, vocab=None, dtm=None, gensim_corpus=None, texts=None, top_n=20, return_coh_model=False, return_mean=False, **kwargs)

Calculate model coherence using Gensim’s CoherenceModel. See also this tutorial.

Define which measure to use with parameter measure:

  • 'u_mass'

  • 'c_v'

  • 'c_uci'

  • 'c_npmi'

Provide a topic word distribution topic_word_distrib OR a Gensim model gensim_model and the corpus’ vocabulary as vocab OR pass a gensim corpus as gensim_corpus. top_n controls how many most probable words per topic are selected.

If measure is 'u_mass', a document-term-matrix dtm or gensim_corpus must be provided and texts can be None. If any other measure than 'u_mass' is used, tokenized input as texts must be provided as 2D list:

[['some', 'text', ...],          # doc. 1
 ['some', 'more', ...],          # doc. 2
 ['another', 'document', ...]]   # doc. 3

If return_coh_model is True, the whole gensim.models.CoherenceModel instance will be returned, otherwise:

  • if return_mean is True, the mean coherence value will be returned

  • if return_mean is False, a list of coherence values (for each topic) will be returned

Provided kwargs will be passed to gensim.models.CoherenceModel or gensim.models.CoherenceModel.get_coherence_per_topic().

Note

This function also supports models from lda and sklearn (by passing topic_word_distrib, dtm and vocab)!

Parameters
  • measure – the coherence calculation type; one of the values listed above

  • topic_word_distrib – topic-word distribution; shape KxM, where K is number of topics, M is vocabulary size if gensim_model is not given

  • gensim_model – a topic model from Gensim if topic_word_distrib is not given

  • vocab – vocabulary list/array if gensim_corpus is not given

  • dtm – document-term matrix of shape NxM with N documents and vocabulary size M if gensim_corpus is not given

  • gensim_corpus – a Gensim corpus if vocab is not given

  • texts – list of tokenized documents; necessary if using a measure other than 'u_mass'

  • top_n – number of most probable words selected per topic

  • return_coh_model – if True, return gensim.models.CoherenceModel as result

  • return_mean – if return_coh_model is False and return_mean is True, return mean coherence

  • kwargs – parameters passed to gensim.models.CoherenceModel or gensim.models.CoherenceModel.get_coherence_per_topic()

Returns

if return_coh_model is True, return gensim.models.CoherenceModel as result; otherwise if return_mean is True, mean of all coherence values, otherwise array of length K with coherence per topic

tmtoolkit.topicmod.evaluate.metric_coherence_mimno_2011(topic_word_distrib, dtm, top_n=20, eps=1e-12, normalize=True, return_mean=False)

Calculate coherence metric according to [Mimno2011] (a.k.a. “U_Mass” coherence metric). There are two modifications to the originally suggested measure:

  • uses a different epsilon by default (set eps=1 for original)

  • uses a normalizing constant by default (set normalize=False for original)

Provide a topic word distribution as topic_word_distrib and a document-term-matrix dtm (can be sparse). top_n controls how many most probable words per topic are selected.

By default, it will return a NumPy array of coherence values per topic (same ordering as in topic_word_distrib). Set return_mean to True to return the mean of all topics instead.

Mimno2011

D. Mimno, H. Wallach, E. Talley, M. Leenders, A. McCullum 2011: Optimizing semantic coherence in topic models

Parameters
  • topic_word_distrib – topic-word distribution; shape KxM, where K is number of topics, M is vocabulary size

  • dtm – document-term matrix of shape NxM with N documents and vocabulary size M

  • top_n – number of most probable words selected per topic

  • eps – smoothing constant epsilon

  • normalize – if True, normalize coherence values

  • return_mean – if True, return mean of all coherence values, otherwise array of coherence per topic

Returns

if return_mean is True, mean of all coherence values, otherwise array of length K with coherence per topic

tmtoolkit.topicmod.evaluate.metric_griffiths_2004(logliks)

Calculate metric as in [GriffithsSteyvers2004].

Calculates the harmonic mean of the log-likelihood values logliks. Burn-in values should already be removed from logliks.

GriffithsSteyvers2004

Thomas L. Griffiths and Mark Steyvers. 2004. Finding scientific topics. Proceedings of the National Academy of Sciences 101, suppl 1: 5228–5235. http://doi.org/10.1073/pnas.0307752101

Note

Requires gmpy2 package for multiple-precision arithmetic to avoid numerical underflow.

Parameters

logliks – array with log-likelihood values

Returns

calculated metric

tmtoolkit.topicmod.evaluate.metric_held_out_documents_wallach09(dtm_test, theta_test, phi_train, alpha, n_samples=10000)

Estimation of the probability of held-out documents according to [Wallach2009] using a document-topic estimation theta_test that was estimated via held-out documents dtm_test on a trained model with a topic-word distribution phi_train and a document-topic prior alpha. Draw n_samples according to theta_test for each document in dtm_test (memory consumption and run time can be very high for larger n_samples and a large amount of big documents in dtm_test).

A document-topic estimation theta_test can be obtained from a trained model from the “lda” package or scikit-learn package with the transform() method.

Adopted MATLAB code originally from Ian Murray, 2009 and downloaded from umass.edu.

Note

Requires gmpy2 package for multiple-precision arithmetic to avoid numerical underflow.

Wallach2009

Wallach, H.M., Murray, I., Salakhutdinov, R. and Mimno, D., 2009. Evaluation methods for topic models.

Parameters
  • dtm_test – held-out documents of shape NxM with N documents and vocabulary size M

  • theta_test – document-topic estimation of dtm_test; shape NxK with K topics

  • phi_train – topic-word distribution of a trained topic model that should be evaluated; shape KxM

  • alpha – document-topic prior of the trained topic model that should be evaluated; either a scalar or an array of length K

Returns

estimated probability of held-out documents

tmtoolkit.topicmod.evaluate.results_by_parameter(res, param, sort_by=None, sort_desc=False)

Takes a list of evaluation results res returned by a topic model evaluation function – a list in the form:

[(parameter_set_1, {'<metric_name>': result_1, ...}),
 ...,
 (parameter_set_n, {'<metric_name>': result_n, ...})])

Then returns a list with tuple pairs using only the parameter param from the parameter sets in the evaluation results such that the returned list is:

[(param_1, {'<metric_name>': result_1, ...}),
 ...,
 (param_n, {'<metric_name>': result_n, ...})]

Optionally order either by parameter value (sort_by is None - the default) or by result metric (sort_by='<metric name>').

Parameters
  • res – list of evaluation results

  • param – string of parameter name

  • sort_by – order by parameter value if this is None, or by a certain result metric given as string

  • sort_desc – sort in descending order

Returns

list with tuple pairs using only the parameter param from the parameter sets

Printing, importing and exporting topic model results

Functions for printing/exporting topic model results.

tmtoolkit.topicmod.model_io.ldamodel_full_doc_topics(doc_topic_distrib, doc_labels, colname_rowindex='_doc', topic_labels='topic_{i1}')

Generate a datatable Frame (if datatable is installed) or pandas DataFrame for the full doc-topic distribution doc_topic_distrib.

See also

ldamodel_top_doc_topics() to retrieve only the most probable topics in the distribution as formatted pandas DataFrame; ldamodel_full_topic_words() to retrieve the full topic-word distribution as datatable Frame

Parameters
  • doc_topic_distrib – document-topic distribution; shape NxK, where N is the number of documents, K is the number of topics

  • doc_labels – list/array of length N with a string label for each document

  • colname_rowindex – column name for the “row index”, i.e. the column that identifies each row

  • topic_labels – format string for each row index where {i0} or {i1} are replaced by the respective zero- or one-indexed topic numbers or an array with individual topic labels

Returns

datatable Frame or pandas DataFrame

tmtoolkit.topicmod.model_io.ldamodel_full_topic_words(topic_word_distrib, vocab, colname_rowindex='_topic', row_labels='topic_{i1}')

Generate a datatable Frame (if datatable is installed) or pandas DataFrame for the full topic-word distribution topic_word_distrib.

See also

ldamodel_top_topic_words() to retrieve only the most probable words in the distribution as formatted pandas DataFrame; ldamodel_full_doc_topics() to retrieve the full document-topic distribution as datatable Frame

Parameters
  • topic_word_distrib – topic-word distribution; shape KxM, where K is number of topics, M is vocabulary size

  • vocab – vocabulary list/array of length K

  • colname_rowindex – column name for the “row index”, i.e. the column that identifies each row

  • row_labels – format string for each row index where {i0} or {i1} are replaced by the respective zero- or one-indexed topic numbers or an array with individual row labels

Returns

datatable Frame or pandas DataFrame

tmtoolkit.topicmod.model_io.ldamodel_top_doc_topics(doc_topic_distrib, doc_labels, top_n=3, val_fmt=None, topic_labels='topic_{i1}', col_labels=None, index_name='document')

Retrieve the top (i.e. most probable) top_n topics for each document in the document-topic distribution doc_topic_distrib as pandas DataFrame.

See also

ldamodel_full_doc_topics() to retrieve the full distribution as formatted pandas DataFrame; ldamodel_top_topic_docs() to retrieve the top documents per topic; ldamodel_top_topic_words() to retrieve the top words per topic from a topic-word distribution; ldamodel_top_word_topics() to retrieve the top topics per word from a topic-word distribution

Parameters
  • doc_topic_distrib – document-topic distribution; shape NxK, where N is the number of documents, K is the number of topics

  • doc_labels – list/array of length N with a string label for each document

  • top_n – number of most probable topics per document to select

  • val_fmt – format string for table cells where {lbl} is replaced by the respective topic name and {val} is replaced by the topic’s probability given the document

  • topic_labels – format string for each row index where {i0} or {i1} are replaced by the respective zero- or one-indexed topic numbers or an array with individual topic labels

  • col_labels – format string for the columns where {i0} or {i1} are replaced by the respective zero- or one-indexed rank

  • index_name – name of the table index

Returns

pandas DataFrame

tmtoolkit.topicmod.model_io.ldamodel_top_topic_docs(doc_topic_distrib, doc_labels, top_n=3, val_fmt=None, topic_labels='topic_{i1}', col_labels=None, index_name='topic')

Retrieve the top (i.e. most probable) top_n documents for each topic in the document-topic distribution doc_topic_distrib as pandas DataFrame.

See also

ldamodel_full_doc_topics() to retrieve the full distribution as formatted pandas DataFrame; ldamodel_top_doc_topics() to retrieve the top topics per document; ldamodel_top_topic_words() to retrieve the top words per topic from a topic-word distribution; ldamodel_top_word_topics() to retrieve the top topics per word from a topic-word distribution

Parameters
  • doc_topic_distrib – document-topic distribution; shape NxK, where N is the number of documents, K is the number of topics

  • doc_labels – list/array of length N with a string label for each document

  • top_n – number of most probable documents per topic to select

  • val_fmt – format string for table cells where {lbl} is replaced by the respective document label and {val} is replaced by the topic’s probability given the document

  • topic_labels – format string for each row index where {i0} or {i1} are replaced by the respective zero- or one-indexed topic numbers or an array with individual topic labels

  • col_labels – format string for the columns where {i0} or {i1} are replaced by the respective zero- or one-indexed rank

  • index_name – name of the table index

Returns

pandas DataFrame

tmtoolkit.topicmod.model_io.ldamodel_top_topic_words(topic_word_distrib, vocab, top_n=10, val_fmt=None, row_labels='topic_{i1}', col_labels=None, index_name='topic')

Retrieve the top (i.e. most probable) top_n words for each topic in the topic-word distribution topic_word_distrib as pandas DataFrame.

See also

ldamodel_full_topic_words() to retrieve the full distribution as formatted pandas DataFrame; ldamodel_top_word_topics() to retrieve the top topics per word from a topic-word distribution; ldamodel_top_doc_topics() to retrieve the top topics per document from a document-topic distribution; ldamodel_top_topic_docs() to retrieve the top documents per topic;

Parameters
  • topic_word_distrib – topic-word distribution; shape KxM, where K is number of topics, M is vocabulary size

  • vocab – vocabulary list/array of length K

  • top_n – number of most probable words per topic to select

  • val_fmt – format string for table cells where {lbl} is replaced by the respective word from vocab and {val} is replaced by the word’s probability given the topic

  • row_labels – format string for each row index where {i0} or {i1} are replaced by the respective zero- or one-indexed topic numbers or an array with individual row labels

  • col_labels – format string for the columns where {i0} or {i1} are replaced by the respective zero- or one-indexed rank

  • index_name – name of the table index

Returns

pandas DataFrame

tmtoolkit.topicmod.model_io.ldamodel_top_word_topics(topic_word_distrib, vocab, top_n=10, val_fmt=None, topic_labels='topic_{i1}', col_labels=None, index_name='token')

Retrieve the top (i.e. most probable) top_n topics for each word in the topic-word distribution topic_word_distrib as pandas DataFrame.

See also

ldamodel_full_topic_words() to retrieve the full distribution as formatted pandas DataFrame; ldamodel_top_topic_words() to retrieve the top words per topic from a topic-word distribution; ldamodel_top_doc_topics() to retrieve the top topics per document from a document-topic distribution; ldamodel_top_topic_docs() to retrieve the top documents per topic;

Parameters
  • topic_word_distrib – topic-word distribution; shape KxM, where K is number of topics, M is vocabulary size

  • vocab – vocabulary list/array of length K

  • top_n – number of most probable words per topic to select

  • val_fmt – format string for table cells where {lbl} is replaced by the respective topic label from topic_labels and {val} is replaced by the word’s probability given the topic

  • topic_labels – format string for each row index where {i0} or {i1} are replaced by the respective zero- or one-indexed topic numbers or an array with individual topic labels

  • col_labels – format string for the columns where {i0} or {i1} are replaced by the respective zero- or one-indexed rank

  • index_name – name of the table index

Returns

pandas DataFrame

tmtoolkit.topicmod.model_io.load_ldamodel_from_pickle(picklefile, **kwargs)

Load an LDA model object from a pickle file picklefile.

See also

save_ldamodel_to_pickle() to save a model.

Parameters
Returns

dict with keys: 'model' – model instance; 'vocab' – vocabulary; 'doc_labels' – document labels; 'dtm' – optional document-term matrix;

tmtoolkit.topicmod.model_io.print_ldamodel_distribution(distrib, row_labels, val_labels, top_n=10)

Print top_n top values from a LDA model’s distribution distrib. This is a general function to print top values of any multivariate distribution given as matrix distrib with H rows and I columns, each identified by H row_labels and I val_labels.

See also

print_ldamodel_topic_words() to print the top values of a topic-word distribution or print_ldamodel_doc_topics() to print the top values of a document-topic distribution.

Parameters
  • distrib – either a topic-word or a document-topic distribution of shape HxI

  • row_labels – list/array of length H with label string for each row of distrib or format string

  • val_labels – list/array of length I with label string for each column of distrib or format string

  • top_n – number of top values to print

tmtoolkit.topicmod.model_io.print_ldamodel_doc_topics(doc_topic_distrib, doc_labels, top_n=3, val_labels='topic_{i1}')

Print top_n values from an LDA model’s document-topic distribution doc_topic_distrib.

See also

print_ldamodel_topic_words() to print the top values of a topic-word distribution.

Parameters
  • doc_topic_distrib – document-topic distribution; shape NxK, where N is the number of documents, K is the number of topics

  • doc_labels – list/array of length N with a string label for each document

  • top_n – number of top values to print

  • val_labels – format string for each value where {i0} or {i1} are replaced by the respective zero- or one-indexed topic numbers or an array with individual value labels

tmtoolkit.topicmod.model_io.print_ldamodel_topic_words(topic_word_distrib, vocab, top_n=10, row_labels='topic_{i1}')

Print top_n values from an LDA model’s topic-word distribution topic_word_distrib.

See also

print_ldamodel_doc_topics() to print the top values of a document-topic distribution.

Parameters
  • topic_word_distrib – topic-word distribution; shape KxM, where K is number of topics, M is vocabulary size

  • vocab – vocabulary list/array of length K

  • top_n – number of top values to print

  • row_labels – format string for each row index where {i0} or {i1} are replaced by the respective zero- or one-indexed topic numbers or an array with individual row labels

tmtoolkit.topicmod.model_io.save_ldamodel_summary_to_excel(excel_file, topic_word_distrib, doc_topic_distrib, doc_labels, vocab, top_n_topics=10, top_n_words=10, dtm=None, rank_label_fmt=None, topic_labels=None)

Save a summary derived from an LDA model’s topic-word and document-topic distributions (topic_word_distrib and doc_topic_distrib to an Excel file excel_file. Return the generated Excel sheets as dict of pandas DataFrames.

The resulting Excel file will consist of 6 or optional 7 sheets:

  • top_doc_topics_vals: document-topic distribution with probabilities of top topics per document

  • top_doc_topics_labels: document-topic distribution with labels (e.g. "topic_12") of top topics per document

  • top_doc_topics_labelled_vals: document-topic distribution combining probabilities and labels of top topics per document (e.g. "topic_12 (0.21)")

  • top_topic_word_vals: topic-word distribution with probabilities of top words per topic

  • top_topic_word_labels: topic-word distribution with top words per (e.g. "politics") topic

  • top_topic_words_labelled_vals: topic-word distribution combining probabilities and top words per topic (e.g. "politics (0.08)")

  • optional if dtm is given – marginal_topic_distrib: marginal topic distribution

Parameters
  • excel_file – target Excel file

  • topic_word_distrib – topic-word distribution; shape KxM, where K is number of topics, M is vocabulary size

  • doc_topic_distrib – document-topic distribution; shape NxK, where N is the number of documents, K is the number of topics

  • doc_labels – list/array of length N with a string label for each document

  • vocab – vocabulary list/array of length K

  • top_n_topics – number of most probable topics per document to include in the summary

  • top_n_words – number of most probable words per topic to include in the summary

  • dtm – document-term matrix; shape NxM; if this is given, a sheet for the marginal topic distribution will be included

  • rank_label_fmt – format string for the rank labels where {i0} or {i1} are replaced by the respective zero- or one-indexed rank numbers (leave to None for default)

  • topic_labels – format string for each row index where {i0} or {i1} are replaced by the respective zero- or one-indexed topic numbers or an array with individual topic labels

Returns

dict mapping sheet name to pandas DataFrame

tmtoolkit.topicmod.model_io.save_ldamodel_to_pickle(picklefile, model, vocab, doc_labels, dtm=None, **kwargs)

Save an LDA model object model as pickle file to picklefile.

See also

load_ldamodel_from_pickle() to load the saved model.

Parameters
  • picklefile – target file

  • model – LDA model instance

  • vocab – vocabulary list/array of length M

  • doc_labels – document labels list/array of length N

  • dtm – optional document-term matrix of shape NxM

  • kwargs – additional options for tmtoolkit.utils.pickle_data()

Statistics for topic models and BoW matrices

Common statistics and tools for topic models.

tmtoolkit.topicmod.model_stats.exclude_topics(excl_topic_indices, doc_topic_distrib, topic_word_distrib=None, renormalize=True, return_new_topic_mapping=False)

Exclude topics with the indices excl_topic_indices from the document-topic distribution doc_topic_distrib (i.e. delete the respective columns in this matrix) and optionally re-normalize the distribution so that the rows sum up to 1 if renormalize is set to True.

Optionally also strip the topics from the topic-word distribution topic_word_distrib (i.e. remove the respective rows).

If topic_word_distrib is given, return a tuple with the updated doc.-topic and topic-word distributions, else return only the updated doc.-topic distribution.

Warning

The topics to be excluded are specified by zero-based indices.

Parameters
  • excl_topic_indices – list/array with zero-based indices of topics to exclude

  • doc_topic_distrib – document-topic distribution; shape NxK, where N is the number of documents, K is the number of topics

  • topic_word_distrib – optional topic-word distribution; shape KxM, where K is number of topics, M is vocabulary size

  • renormalize – if True, re-normalize the document-topic distribution so that the rows sum up to 1

  • return_new_topic_mapping – if True, additional return a dict that maps old topic indices to new topic indices

Returns

new document-topic distribution where topics from excl_topic_indices are removed and optionally re-normalized; optional new topic-word distribution with same topics removed; optional dict that maps old topic indices to new topic indices

tmtoolkit.topicmod.model_stats.filter_topics(search_pattern, vocab, topic_word_distrib, top_n=None, thresh=None, match_type='exact', cond='any', glob_method='match', return_words_and_matches=False)

Filter topics defined as topic-word distribution topic_word_distrib across vocabulary vocab for a word (pass a string) or multiple words/patterns w (pass a list of strings). Either run pattern(s) w against the list of top words per topic (use top_n for number of words in top words list) or specify a minimum topic-word probability thresh, resulting in a list of words above this threshold for each topic, which will be used for pattern matching. You can also specify top_n and thresh.

Set the match parameter according to the options provided by ~tmtoolkit.preprocess.filter_tokens.token_match (exact matching, RE or glob matching). Use cond to specify whether at only one match suffices per topic when a list of patterns w is passed (cond='any') or all patterns must match (cond='all').

By default, this function returns a NumPy array containing the indices of topics that passed the filter criteria. If return_words_and_matches is True, this function additionally returns a NumPy array with the top words for each topic and a NumPy array with the pattern matches for each topic.

See also

See tmtoolkit.preprocess.token_match() for filtering options.

Parameters
  • search_pattern – single match pattern string or list of match pattern strings

  • vocab – vocabulary array of length M

  • topic_word_distrib – topic-word distribution; shape KxM, where K is number of topics, M is vocabulary size

  • top_n – if given, consider only the top top_n words per topic

  • thresh – if given, consider only the words with a probability above thresh

  • match_type – one of: ‘exact’, ‘regex’, ‘glob’; if ‘regex’, search_token must be RE pattern; if glob, search_token must be a “glob” pattern like “hello w*” (see https://github.com/metagriffin/globre)

  • cond – either "any" or "all"; controls whether only one or all patterns must match if multiple match patterns are given

  • glob_method – if match_type is ‘glob’, use this glob method. Must be ‘match’ or ‘search’ (similar behavior as Python’s re.match or re.search)

  • return_words_and_matches – if True, additionally return list of arrays of words per topic and list of binary arrays indicating matches per topic

Returns

array of topic indices with matches; if return_words_and_matches is True, return two more lists as described above

tmtoolkit.topicmod.model_stats.generate_topic_labels_from_top_words(topic_word_distrib, doc_topic_distrib, doc_lengths, vocab, n_words=None, lambda_=1, labels_glue='_', labels_format='{i1}_{topwords}')

Generate unique topic labels derived from the top words of each topic. The top words are determined from the relevance score [SievertShirley2014] depending on lambda_. Specify the number of top words in the label with n_words. If n_words is None, a minimum number of words will be used to create unique labels for each topic. Topic labels are formed by joining the top words with labels_glue and formatting them with labels_format. Placeholders in labels_format are "{i0}" (zero-based topic index), "{i1}" (one-based topic index) and "{topwords}" (top words glued with labels_glue).

Parameters
  • topic_word_distrib – topic-word distribution; shape KxM, where K is number of topics, M is vocabulary size

  • doc_topic_distrib – document-topic distribution; shape NxK, where N is the number of documents, K is the number of topics

  • doc_lengths – array of size N (number of docs) with integers indicating the number of terms per document

  • vocab – vocabulary array of length M

  • n_words – minimum number of words to be used to create unique labels

  • lambda – lambda parameter (influences weight of “log lift”)

  • labels_glue – string to join the top words

  • labels_format – final topic labels format string

Returns

NumPy array of topic labels; length is K

tmtoolkit.topicmod.model_stats.least_distinct_words(vocab, topic_word_distrib, doc_topic_distrib, doc_lengths, n=None)

Order the words from vocab by distinctiveness score from least to most distinctive. Optionally only return the n least distinctive words.

Parameters
  • vocab – vocabulary array of length M

  • topic_word_distrib – topic-word distribution; shape KxM, where K is number of topics, M is vocabulary size

  • doc_topic_distrib – document-topic distribution; shape NxK, where N is the number of documents, K is the number of topics

  • doc_lengths – array of size N (number of docs) with integers indicating the number of terms per document

  • n – if not None, return only the n least distinctive words

Returns

array of length M or n (if n is given) with least distinctive words

tmtoolkit.topicmod.model_stats.least_probable_words(vocab, topic_word_distrib, doc_topic_distrib, doc_lengths, n=None)

Order the words from vocab by marginal word probability from least to most probable. Optionally only return the n least probable words.

Parameters
  • vocab – vocabulary array of length M

  • topic_word_distrib – topic-word distribution; shape KxM, where K is number of topics, M is vocabulary size

  • doc_topic_distrib – document-topic distribution; shape NxK, where N is the number of documents, K is the number of topics

  • doc_lengths – array of size N (number of docs) with integers indicating the number of terms per document

  • n – if not None, return only the n most salient words

Returns

array of length M or n (if n is given) with least probable words

tmtoolkit.topicmod.model_stats.least_relevant_words_for_topic(vocab, rel_mat, topic, n=None)

Get words from vocab for topic ordered by least to most relevance according to [SievertShirley2014]. Use the relevance matrix rel_mat obtained from topic_word_relevance(). Optionally only return the n least relevant words.

Parameters
  • vocab – vocabulary array of length M

  • rel_mat – relevance matrix; shape KxM, where K is number of topics, M is vocabulary size

  • topic – topic number (zero-indexed)

Returns

array of length M or n (if n is given) with least relevant words for topic topic

tmtoolkit.topicmod.model_stats.least_salient_words(vocab, topic_word_distrib, doc_topic_distrib, doc_lengths, n=None)

Order the words from vocab by saliency score from least to most salient. Optionally only return the n least salient words.

See also

word_saliency()

Parameters
  • vocab – vocabulary array of length M

  • topic_word_distrib – topic-word distribution; shape KxM, where K is number of topics, M is vocabulary size

  • doc_topic_distrib – document-topic distribution; shape NxK, where N is the number of documents, K is the number of topics

  • doc_lengths – array of size N (number of docs) with integers indicating the number of terms per document

  • n – if not None, return only the n least salient words

Returns

array of length M or n (if n is given) with least salient words

tmtoolkit.topicmod.model_stats.marginal_topic_distrib(doc_topic_distrib, doc_lengths)

Return marginal topic distribution p(T) (topic proportions) given the document-topic distribution (theta) doc_topic_distrib and the document lengths doc_lengths. The latter can be calculated with doc_lengths().

Parameters
  • doc_topic_distrib – document-topic distribution; shape NxK, where N is the number of documents, K is the number of topics

  • doc_lengths – array of size N (number of docs) with integers indicating the number of terms per document

Returns

array of size K (number of topics) with marginal topic distribution

tmtoolkit.topicmod.model_stats.marginal_word_distrib(topic_word_distrib, p_t)

Return the marginal word distribution p(w) (term proportions derived from topic model) given the topic-word distribution (phi) topic_word_distrib and the marginal topic distribution p(T) p_t. The latter can be calculated with marginal_topic_distrib().

Parameters
  • topic_word_distrib – topic-word distribution; shape KxM, where K is number of topics, M is vocabulary size

  • p_t – marginal topic distribution; array of size K

Returns

array of size M (vocabulary size) with marginal word distribution

tmtoolkit.topicmod.model_stats.most_distinct_words(vocab, topic_word_distrib, doc_topic_distrib, doc_lengths, n=None)

Order the words from vocab by distinctiveness score from most to least distinctive. Optionally only return the n most distinctive words.

Parameters
  • vocab – vocabulary array of length M

  • topic_word_distrib – topic-word distribution; shape KxM, where K is number of topics, M is vocabulary size

  • doc_topic_distrib – document-topic distribution; shape NxK, where N is the number of documents, K is the number of topics

  • doc_lengths – array of size N (number of docs) with integers indicating the number of terms per document

  • n – if not None, return only the n most distinctive words

Returns

array of length M or n (if n is given) with most distinctive words

tmtoolkit.topicmod.model_stats.most_probable_words(vocab, topic_word_distrib, doc_topic_distrib, doc_lengths, n=None)

Order the words from vocab by marginal word probability from most to least probable. Optionally only return the n most probable words.

Parameters
  • vocab – vocabulary array of length M

  • topic_word_distrib – topic-word distribution; shape KxM, where K is number of topics, M is vocabulary size

  • doc_topic_distrib – document-topic distribution; shape NxK, where N is the number of documents, K is the number of topics

  • doc_lengths – array of size N (number of docs) with integers indicating the number of terms per document

  • n – if not None, return only the n most salient words

Returns

array of length M or n (if n is given) with most probable words

tmtoolkit.topicmod.model_stats.most_relevant_words_for_topic(vocab, rel_mat, topic, n=None)

Get words from vocab for topic ordered by most to least relevance according to [SievertShirley2014]. Use the relevance matrix rel_mat obtained from topic_word_relevance(). Optionally only return the n most relevant words.

Parameters
  • vocab – vocabulary array of length M

  • rel_mat – relevance matrix; shape KxM, where K is number of topics, M is vocabulary size

  • topic – topic number (zero-indexed)

Returns

array of length M or n (if n is given) with most relevant words for topic topic

tmtoolkit.topicmod.model_stats.most_salient_words(vocab, topic_word_distrib, doc_topic_distrib, doc_lengths, n=None)

Order the words from vocab by saliency score from most to least salient. Optionally only return the n most salient words.

See also

word_saliency()

Parameters
  • vocab – vocabulary array of length M

  • topic_word_distrib – topic-word distribution; shape KxM, where K is number of topics, M is vocabulary size

  • doc_topic_distrib – document-topic distribution; shape NxK, where N is the number of documents, K is the number of topics

  • doc_lengths – array of size N (number of docs) with integers indicating the number of terms per document

  • n – if not None, return only the n most salient words

Returns

array of length M or n (if n is given) with most salient words

tmtoolkit.topicmod.model_stats.top_n_from_distribution(distrib, top_n=10, row_labels=None, col_labels=None, val_labels=None)

Get top_n values from LDA model’s distribution distrib as DataFrame. Can be used for topic-word distributions and document-topic distributions. Set row_labels to a format string or a list. Set col_labels to a format string for the column names. Set val_labels to return value labels instead of pure values (probabilities).

Parameters
  • distrib – a 2D probability distribution of shape NxM from an LDA model

  • top_n – number of top values to take from each row of distrib

  • row_labels – either list of row label strings of length N or a single row format string

  • col_labels – column format string or None for default numbered columns

  • val_labels – value labels format string or None to return only the probabilities

Returns

pandas DataFrame with N rows and top_n columns

tmtoolkit.topicmod.model_stats.top_words_for_topics(topic_word_distrib, top_n=None, vocab=None, return_prob=False)

Generate sorted list of top_n words (or word indices) per topic in topic-word distribution topic_word_distrib.

Parameters
  • topic_word_distrib – topic-word distribution; shape KxM, where K is number of topics, M is vocabulary size

  • top_n – number of top words (according to probability given topic) to select per topic; if None return full sorted lists of words

  • vocab – vocabulary array of length M; if None, return word indices instead of word strings

  • return_prob – if True, also return sorted arrays of word probabilities given topic for each topic

Returns

list of length K consisting of sorted arrays of most probable words; arrays have length top_n or M (if top_n is None); if return_prob is True another list of sorted arrays of word probabilities given topic for each topic is returned

tmtoolkit.topicmod.model_stats.topic_word_relevance(topic_word_distrib, doc_topic_distrib, doc_lengths, lambda_)

Calculate the topic-word relevance score with a lambda parameter lambda_ according to [SievertShirley2014]:

relevance(w,t|lambda) = lambda * log phi_{t,w} + (1-lambda) * log (phi_{t,w} / p(w)), where

  • phi is the topic-word distribution,

  • p(w) is the marginal word probability.

SievertShirley2014(1,2,3,4)

Sievert, C., & Shirley, K. (2014, June). LDAvis: A method for visualizing and interpreting topics. In Proceedings of the workshop on interactive language learning, visualization, and interfaces (pp. 63-70).

Parameters
  • topic_word_distrib – topic-word distribution; shape KxM, where K is number of topics, M is vocabulary size

  • doc_topic_distrib – document-topic distribution; shape NxK, where N is the number of documents, K is the number of topics

  • doc_lengths – array of size N (number of docs) with integers indicating the number of terms per document

  • lambda – lambda parameter (influences weight of “log lift”)

Returns

matrix with topic-word relevance scores; shape KxM

tmtoolkit.topicmod.model_stats.word_distinctiveness(topic_word_distrib, p_t)

Calculate word distinctiveness according to [Chuang2012]:

distinctiveness(w) = KL(P(T|w), P(T)) = sum_T(P(T|w) log(P(T|w)/P(T))), where

  • KL is Kullback-Leibler divergence,

  • P(T) is marginal topic distribution,

  • P(T|w) is prob. of a topic given a word.

Parameters
  • topic_word_distrib – topic-word distribution; shape KxM, where K is number of topics, M is vocabulary size

  • p_t – marginal topic distribution; array of size K

Returns

array of size M (vocabulary size) with word distinctiveness

tmtoolkit.topicmod.model_stats.word_saliency(topic_word_distrib, doc_topic_distrib, doc_lengths)

Calculate word saliency according to [Chuang2012] as saliency(w) = p(w) * distinctiveness(w) for a word w.

Chuang2012(1,2)

J. Chuang, C. Manning, J. Heer. 2012. Termite: Visualization Techniques for Assessing Textual Topic Models

Parameters
  • topic_word_distrib – topic-word distribution; shape KxM, where K is number of topics, M is vocabulary size

  • doc_topic_distrib – document-topic distribution; shape NxK, where N is the number of documents, K is the number of topics

  • doc_lengths – array of size N (number of docs) with integers indicating the number of terms per document

Returns

array of size M (vocabulary size) with word saliency

Parallel model fitting and evaluation with lda

Parallel model computation and evaluation using the lda package.

Available evaluation metrics for this module are listed in AVAILABLE_METRICS. See tmtoolkit.topicmod.evaluate for references and implementations of those evaluation metrics.

tmtoolkit.topicmod.tm_lda.AVAILABLE_METRICS = ('loglikelihood', 'cao_juan_2009', 'arun_2010', 'coherence_mimno_2011')

Available metrics for lda ("griffiths_2004", "held_out_documents_wallach09" are added when package gmpy2 is installed, several "coherence_gensim_" metrics are added when package gensim is installed).

tmtoolkit.topicmod.tm_lda.DEFAULT_METRICS = ('cao_juan_2009', 'arun_2010', 'coherence_mimno_2011')

Metrics used by default.

tmtoolkit.topicmod.tm_lda.compute_models_parallel(data, varying_parameters=None, constant_parameters=None, n_max_processes=None)

Compute several topic models in parallel using the “lda” package. Use a single or multiple document term matrices data and optionally a list of varying parameters varying_parameters. Pass parameters in constant_parameters dict to each model calculation. Use at maximum n_max_processes processors or use all available processors if None is passed.

data can be either a Document-Term-Matrix (NumPy array/matrix, SciPy sparse matrix) or a dict with corpus ID -> Document-Term-Matrix mapping when calculating models for multiple corpora.

If data is a dict of named matrices, this function will return a dict with document ID -> result list. Otherwise it will only return a result list. A result list always is a list containing tuples (parameter_set, model) where parameter_set is a dict of the used parameters.

Parameters
  • data – either a (sparse) 2D array/matrix or a dict mapping dataset labels to such matrices

  • varying_parameters – list of dicts with parameters; each parameter set will be used in a separate computation

  • constant_parameters – dict with parameters that are the same for all parallel computations

  • n_max_processes – maximum number of worker processes to spawn

Returns

if passed data is 2D array, returns a list with tuples (parameter set, results); if passed data is a dict of 2D arrays, returns dict with same keys as data and the respective results for each dataset

tmtoolkit.topicmod.tm_lda.evaluate_topic_models(data, varying_parameters, constant_parameters=None, n_max_processes=None, return_models=False, metric=None, **metric_kwargs)

Compute several Topic Models in parallel using the “lda” package. Calculate the models using a list of varying parameters varying_parameters on a single Document-Term-Matrix data. Pass parameters in constant_parameters dict to each model calculation. Use at maximum n_max_processes processors or use all available processors if None is passed.

data must be a Document-Term-Matrix (NumPy array/matrix, SciPy sparse matrix).

Will return a list of size len(varying_parameters) containing tuples (parameter_set, eval_results) where parameter_set is a dict of the used parameters and eval_results is a dict of metric names -> metric results:

[(parameter_set_1, {'<metric_name>': result_1, ...}),
 ...,
 (parameter_set_n, {'<metric_name>': result_n, ...})])

See also

Results can be simplified using tmtoolkit.topicmod.evaluate.results_by_parameter().

Parameters
  • data – a (sparse) 2D array/matrix

  • varying_parameters – list of dicts with parameters; each parameter set will be used in a separate evaluation

  • constant_parameters – dict with parameters that are the same for all parallel computations

  • n_max_processes – maximum number of worker processes to spawn

  • return_models – if True, also return the computed models in the evaluation results

  • metric – string or list of strings; if given, use only this metric(s) for evaluation; must be subset of available_metrics

  • metric_kwargs – dict of options for metric used metric(s)

Returns

list of evaluation results for each varying parameter set as described above

Parallel model fitting and evaluation with scikit-learn

Parallel model computation and evaluation using the scikit-learn package.

Available evaluation metrics for this module are listed in AVAILABLE_METRICS. See tmtoolkit.topicmod.evaluate for references and implementations of those evaluation metrics.

tmtoolkit.topicmod.tm_sklearn.AVAILABLE_METRICS = ('perplexity', 'cao_juan_2009', 'arun_2010', 'coherence_mimno_2011', 'coherence_gensim_u_mass', 'coherence_gensim_c_v', 'coherence_gensim_c_uci', 'coherence_gensim_c_npmi')

Available metrics for sklearn ("held_out_documents_wallach09" is added when package gmpy2 is installed, several "coherence_gensim_" metrics are added when package gensim is installed).

tmtoolkit.topicmod.tm_sklearn.DEFAULT_METRICS = ('perplexity', 'cao_juan_2009', 'arun_2010', 'coherence_mimno_2011')

Metrics used by default.

tmtoolkit.topicmod.tm_sklearn.compute_models_parallel(data, varying_parameters=None, constant_parameters=None, n_max_processes=None)

Compute several topic models in parallel using the “sklearn” package. Use a single or multiple document term matrices data and optionally a list of varying parameters varying_parameters. Pass parameters in constant_parameters dict to each model calculation. Use at maximum n_max_processes processors or use all available processors if None is passed.

data can be either a Document-Term-Matrix (NumPy array/matrix, SciPy sparse matrix) or a dict with corpus ID -> Document-Term-Matrix mapping when calculating models for multiple corpora.

If data is a dict of named matrices, this function will return a dict with document ID -> result list. Otherwise it will only return a result list. A result list always is a list containing tuples (parameter_set, model) where parameter_set is a dict of the used parameters.

Parameters
  • data – either a (sparse) 2D array/matrix or a dict mapping dataset labels to such matrices

  • varying_parameters – list of dicts with parameters; each parameter set will be used in a separate computation

  • constant_parameters – dict with parameters that are the same for all parallel computations

  • n_max_processes – maximum number of worker processes to spawn

Returns

if passed data is 2D array, returns a list with tuples (parameter set, results); if passed data is a dict of 2D arrays, returns dict with same keys as data and the respective results for each dataset

tmtoolkit.topicmod.tm_sklearn.evaluate_topic_models(data, varying_parameters, constant_parameters=None, n_max_processes=None, return_models=False, metric=None, **metric_kwargs)

Compute several Topic Models in parallel using the “sklearn” package. Calculate the models using a list of varying parameters varying_parameters on a single Document-Term-Matrix data. Pass parameters in constant_parameters dict to each model calculation. Use at maximum n_max_processes processors or use all available processors if None is passed.

data must be a Document-Term-Matrix (NumPy array/matrix, SciPy sparse matrix).

Will return a list of size len(varying_parameters) containing tuples (parameter_set, eval_results) where parameter_set is a dict of the used parameters and eval_results is a dict of metric names -> metric results:

[(parameter_set_1, {'<metric_name>': result_1, ...}),
 ...,
 (parameter_set_n, {'<metric_name>': result_n, ...})])

See also

Results can be simplified using tmtoolkit.topicmod.evaluate.results_by_parameter().

Parameters
  • data – a (sparse) 2D array/matrix

  • varying_parameters – list of dicts with parameters; each parameter set will be used in a separate evaluation

  • constant_parameters – dict with parameters that are the same for all parallel computations

  • n_max_processes – maximum number of worker processes to spawn

  • return_models – if True, also return the computed models in the evaluation results

  • metric – string or list of strings; if given, use only this metric(s) for evaluation; must be subset of available_metrics

  • metric_kwargs – dict of options for metric used metric(s)

Returns

list of evaluation results for each varying parameter set as described above

Parallel model fitting and evaluation with Gensim

Parallel model computation and evaluation using the Gensim package.

Available evaluation metrics for this module are listed in AVAILABLE_METRICS. See tmtoolkit.topicmod.evaluate for references and implementations of those evaluation metrics.

tmtoolkit.topicmod.tm_gensim.AVAILABLE_METRICS = ('perplexity', 'cao_juan_2009', 'arun_2010', 'coherence_mimno_2011', 'coherence_gensim_u_mass', 'coherence_gensim_c_v', 'coherence_gensim_c_uci', 'coherence_gensim_c_npmi')

Available metrics for Gensim.

tmtoolkit.topicmod.tm_gensim.DEFAULT_METRICS = ('perplexity', 'cao_juan_2009', 'arun_2010', 'coherence_mimno_2011', 'coherence_gensim_c_v')

Metrics used by default.

tmtoolkit.topicmod.tm_gensim.compute_models_parallel(data, varying_parameters=None, constant_parameters=None, n_max_processes=None)

Compute several topic models in parallel using the “gensim” package. Use a single or multiple document term matrices data and optionally a list of varying parameters varying_parameters. Pass parameters in constant_parameters dict to each model calculation. Use at maximum n_max_processes processors or use all available processors if None is passed.

data can be either a Document-Term-Matrix (NumPy array/matrix, SciPy sparse matrix) or a dict with corpus ID -> Document-Term-Matrix mapping when calculating models for multiple corpora.

If data is a dict of named matrices, this function will return a dict with document ID -> result list. Otherwise it will only return a result list. A result list always is a list containing tuples (parameter_set, model) where parameter_set is a dict of the used parameters.

Parameters
  • data – either a (sparse) 2D array/matrix or a dict mapping dataset labels to such matrices

  • varying_parameters – list of dicts with parameters; each parameter set will be used in a separate computation

  • constant_parameters – dict with parameters that are the same for all parallel computations

  • n_max_processes – maximum number of worker processes to spawn

Returns

if passed data is 2D array, returns a list with tuples (parameter set, results); if passed data is a dict of 2D arrays, returns dict with same keys as data and the respective results for each dataset

tmtoolkit.topicmod.tm_gensim.evaluate_topic_models(data, varying_parameters, constant_parameters=None, n_max_processes=None, return_models=False, metric=None, **metric_kwargs)

Compute several Topic Models in parallel using the “gensim” package. Calculate the models using a list of varying parameters varying_parameters on a single Document-Term-Matrix data. Pass parameters in constant_parameters dict to each model calculation. Use at maximum n_max_processes processors or use all available processors if None is passed.

data must be a Document-Term-Matrix (NumPy array/matrix, SciPy sparse matrix).

Will return a list of size len(varying_parameters) containing tuples (parameter_set, eval_results) where parameter_set is a dict of the used parameters and eval_results is a dict of metric names -> metric results:

[(parameter_set_1, {'<metric_name>': result_1, ...}),
 ...,
 (parameter_set_n, {'<metric_name>': result_n, ...})])

See also

Results can be simplified using tmtoolkit.topicmod.evaluate.results_by_parameter().

Parameters
  • data – a (sparse) 2D array/matrix

  • varying_parameters – list of dicts with parameters; each parameter set will be used in a separate evaluation

  • constant_parameters – dict with parameters that are the same for all parallel computations

  • n_max_processes – maximum number of worker processes to spawn

  • return_models – if True, also return the computed models in the evaluation results

  • metric – string or list of strings; if given, use only this metric(s) for evaluation; must be subset of available_metrics

  • metric_kwargs – dict of options for metric used metric(s)

Returns

list of evaluation results for each varying parameter set as described above

Visualize topic models and topic model evaluation results

Wordclouds from topic models

tmtoolkit.topicmod.visualize.DEFAULT_WORDCLOUD_KWARGS = {'background_color': None, 'color_func': <function _wordcloud_color_func_black>, 'height': 600, 'mode': 'RGBA', 'width': 800}

Default wordcloud settings for transparent background and black font; will be passed to wordcloud.WordCloud

tmtoolkit.topicmod.visualize.generate_wordclouds_for_topic_words(topic_word_distrib, vocab, top_n, topic_labels='topic_{i1}', which_topics=None, return_images=True, **wordcloud_kwargs)

Generate wordclouds for the top top_n words of each topic in topic_word_distrib.

Parameters
  • topic_word_distrib – topic-word distribution; shape KxM, where K is number of topics, M is vocabulary size

  • vocab – vocabulary array of length M

  • top_n – number of top values to take from each row of distrib

  • topic_labels – labels used for each row; determine keys in in result dict; either single format string with placeholders "{i0}" (zero-based topic index) or "{i1}" (one-based topic index), or list of topic label strings

  • which_topics – if not None, a sequence of indices into rows of topic_word_distrib to select only these topics to generate wordclouds from

  • return_images – if True, store image objects instead of wordcloud.WordCloud objects in the result dict

  • wordcloud_kwargs – pass additional options to wordcloud.WordCloud; updates options in DEFAULT_WORDCLOUD_KWARGS

Returns

dict mapping row labels to wordcloud images or instances generated from each topic

tmtoolkit.topicmod.visualize.generate_wordclouds_for_document_topics(doc_topic_distrib, doc_labels, top_n, topic_labels='topic_{i1}', which_documents=None, return_images=True, **wordcloud_kwargs)

Generate wordclouds for the top top_n topics of each document in doc_topic_distrib.

Parameters
  • doc_topic_distrib – document-topic distribution; shape NxK, where N is the number of documents, K is the number of topics

  • doc_labels – list/array of length N with a string label for each document

  • top_n – number of top values to take from each row of distrib

  • topic_labels – labels used for each row; determine keys in in result dict; either single format string with placeholders "{i0}" (zero-based topic index) or "{i1}" (one-based topic index), or list of topic label strings

  • which_documents – if not None, a sequence of indices into rows of doc_topic_distrib to select only these topics to generate wordclouds from

  • return_images – if True, store image objects instead of wordcloud.WordCloud objects in the result dict

  • wordcloud_kwargs – pass additional options to wordcloud.WordCloud; updates options in DEFAULT_WORDCLOUD_KWARGS

Returns

dict mapping row labels to wordcloud images or instances generated from each document

tmtoolkit.topicmod.visualize.generate_wordcloud_from_probabilities_and_words(prob, words, return_image=True, wordcloud_instance=None, **wordcloud_kwargs)

Generate a single wordcloud for given probabilities (weights) prob of the respective words.

Parameters
  • prob – 1D array or sequence of probabilities for words

  • words – 1D array or sequence of word strings

  • return_images – if True, store image objects instead of wordcloud.WordCloud objects in the result dict

  • wordcloud_instance – optionally pass an already initialized wordcloud.WordCloud instance

  • wordcloud_kwargs – pass additional options to wordcloud.WordCloud; updates options in DEFAULT_WORDCLOUD_KWARGS

Returns

either a wordcloud image if return_images is True, otherwise a wordcloud.WordCloud instance

tmtoolkit.topicmod.visualize.generate_wordcloud_from_weights(weights, return_image=True, wordcloud_instance=None, **wordcloud_kwargs)

Generate a single wordcloud for a weights dict that maps words to “weights” (e.g. probabilities) which determine their size in the wordcloud.

Parameters
  • weights – dict that maps words to weights

  • return_images – if True, store image objects instead of wordcloud.WordCloud objects in the result dict

  • wordcloud_instance – optionally pass an already initialized wordcloud.WordCloud instance

  • wordcloud_kwargs – pass additional options to wordcloud.WordCloud; updates options in DEFAULT_WORDCLOUD_KWARGS

Returns

either a wordcloud image if return_images is True, otherwise a wordcloud.WordCloud instance

tmtoolkit.topicmod.visualize.write_wordclouds_to_folder(wordclouds, folder, file_name_fmt='{label}.png', **save_kwargs)

Save all wordcloud image objects in wordclouds to folder.

Parameters
  • wordclouds – dict mapping wordcloud label to wordcloud object

  • folder – target path

  • file_name_fmt – file name string format with placeholder "{label}"

  • save_kwargs – additional options passed to save method of each wordcloud image object

tmtoolkit.topicmod.visualize.generate_wordclouds_from_distribution(distrib, row_labels, val_labels, top_n, which_rows=None, return_images=True, **wordcloud_kwargs)

Generate wordclouds for each row in a given probability distribution distrib.

Note

Use generate_wordclouds_for_topic_words() or generate_wordclouds_for_document_topics() as shortcuts for creating wordclouds for a topic-word or document-topic distribution.

Parameters
  • distrib – 2D (sparse) array/matrix probability distribution

  • row_labels – labels for rows in probability distribution; these are used as keys in the return dict

  • val_labels – labels for values in probability distribution (e.g. vocabulary)

  • top_n – number of top values to take from each row of distrib

  • which_rows – if not None, select only the rows from this sequence of indices from distrib

  • return_images – if True, store image objects instead of wordcloud.WordCloud objects in the result dict

  • wordcloud_kwargs – pass additional options to wordcloud.WordCloud; updates options in DEFAULT_WORDCLOUD_KWARGS

Returns

dict mapping row labels to wordcloud images or instances generated from each distribution row

Plot heatmaps for topic models

tmtoolkit.topicmod.visualize.plot_doc_topic_heatmap(fig, ax, doc_topic_distrib, doc_labels, topic_labels=None, which_documents=None, which_document_indices=None, which_topics=None, which_topic_indices=None, xaxislabel=None, yaxislabel=None, **kwargs)

Plot a heatmap for a document-topic distribution doc_topic_distrib to a matplotlib Figure fig and Axes ax using doc_labels as document labels on the y-axis and topics from 1 to K (number of topics) on the x-axis.

Note

It is almost always necessary to select a subset of your document-topic distribution with the which_documents or which_topics parameters, as otherwise the amount of data to be plotted will be too high to give a reasonable picture.

Parameters
  • fig – matplotlib Figure object

  • ax – matplotlib Axes object

  • doc_topic_distrib – document-topic distribution; shape NxK, where N is the number of documents, K is the number of topics

  • doc_labels – list/array of length N with a string label for each document

  • topic_labels – labels used for each row; either single format string with placeholders "{i0}" (zero-based topic index) or "{i1}" (one-based topic index), or list of topic label strings

  • which_documents – select documents via document label strings

  • which_document_indices – alternatively, select documents with zero-based document index in [0, N-1]

  • which_topics – select topics via topic label strings (when string array or list) or with one-based topic index in [1, K] (when integer array or list)

  • which_topic_indices – alternatively, select topics with zero-based topic index in [0, K-1]

  • xaxislabel – x axis label string

  • yaxislabel – y axis label string

  • kwargs – additional arguments passed to plot_heatmap()

Returns

tuple of generated (matplotlib Figure object, matplotlib Axes object)

tmtoolkit.topicmod.visualize.plot_topic_word_heatmap(fig, ax, topic_word_distrib, vocab, topic_labels=None, which_topics=None, which_topic_indices=None, which_words=None, which_word_indices=None, xaxislabel=None, yaxislabel=None, **kwargs)

Plot a heatmap for a topic-word distribution topic_word_distrib to a matplotlib Figure fig and Axes ax using vocab as vocabulary on the x-axis and topics from 1 to n_topics=doc_topic_distrib.shape[1] on the y-axis.

Note

It is almost always necessary to select a subset of your topic-word distribution with the which_words or which_topics parameters, as otherwise the amount of data to be plotted will be too high to give a reasonable picture.

Parameters
  • fig – matplotlib Figure object

  • ax – matplotlib Axes object

  • topic_word_distrib – topic-word distribution; shape KxM, where K is number of topics, M is vocabulary size

  • vocab – vocabulary array of length M

  • topic_labels – labels used for each row; either single format string with placeholders "{i0}" (zero-based topic index) or "{i1}" (one-based topic index), or list of topic label strings

  • which_topics – select topics via topic label strings (when string array or list and topic_labels is given) or with one-based topic index in [1, K] (when integer array or list)

  • which_topic_indices – alternatively, select topics with zero-based topic index in [0, K-1]

  • which_words – select words with one-based word index in [1, M]

  • which_word_indices – alternatively, select words with zero-based word index in [0, K-1]

  • xaxislabel – x axis label string

  • yaxislabel – y axis label string

  • kwargs – additional arguments passed to plot_heatmap()

Returns

tuple of generated (matplotlib Figure object, matplotlib Axes object)

tmtoolkit.topicmod.visualize.plot_heatmap(fig, ax, data, xaxislabel=None, yaxislabel=None, xticklabels=None, yticklabels=None, title=None, grid=True, values_in_cells=True, round_values_in_cells=2, legend=False, fontsize_axislabel=None, fontsize_axisticks=None, fontsize_cell_values=None)

Generic heatmap plotting function for 2D matrix data.

Parameters
  • fig – matplotlib Figure object

  • ax – matplotlib Axes object

  • data – 2D array/matrix to be plotted as heatmap

  • xaxislabel – x axis label string

  • yaxislabel – y axis label string

  • xticklabels – list of x axis tick labels

  • yticklabels – list of y axis tick labels

  • title – plot title

  • grid – draw grid if True

  • values_in_cells – draw values of data in heatmap cells

  • round_values_in_cells – round these values to the given number of digits

  • legend – if True, draw a legend

  • fontsize_axislabel – font size for axis label

  • fontsize_axisticks – font size for axis ticks

  • fontsize_cell_values – font size for values in cells

Returns

tuple of generated (matplotlib Figure object, matplotlib Axes object)

Plot topic model evaluation results

tmtoolkit.topicmod.visualize.plot_eval_results(eval_results, metric=None, xaxislabel=None, yaxislabel=None, title=None, title_fontsize='x-large', axes_title_fontsize='large', show_metric_direction=True, metric_direction_font_size='large', subplots_opts=None, subplots_adjust_opts=None, figsize='auto', **fig_kwargs)

Plot the evaluation results from eval_results, which must be a sequence containing (param, values) tuples, where param is the parameter value to appear on the x axis and values can be a dict structure containing the metric values. eval_results can be created using tmtoolkit.topicmod.evaluate.results_by_parameter().

Parameters
  • eval_results – topic evaluation results as sequence containing (param, metric results)

  • metric – either single string or list of strings; plot only this/these specific metric/s

  • xaxislabel – x axis label string

  • yaxislabel – y axis label string

  • title – plot title

  • title_fontsize – font size for the figure title

  • axes_title_fontsize – font size for the plot titles

  • show_metric_direction – if True, show whether the shown metric should be minimized or maximized for optimization

  • metric_direction_font_size – font size for the metric optimization direction indicator

  • subplots_opts – options passed to Matplotlib’s plt.subplots()

  • subplots_adjust_opts – options passed to Matplotlib’s fig.subplots_adjust()

  • figsize – tuple (width, height) or "auto" (default) which will set the size to (8, 2 * <num. of metrics>)

  • fig_kwargs – additional parameters passed to Matplotlib’s plt.subplots()

Returns

tuple of generated (matplotlib Figure object, matplotlib Axes object)

Other functions

tmtoolkit.topicmod.visualize.parameters_for_ldavis(topic_word_distrib, doc_topic_distrib, dtm, vocab, sort_topics=False)

Create a parameters dict that can be used with the pyLDAVis package by passing the dict params like pyLDAVis.prepare(**params).

Parameters
  • topic_word_distrib – topic-word distribution; shape KxM, where K is number of topics, M is vocabulary size

  • doc_topic_distrib – document-topic distribution; shape NxK, where N is the number of documents, K is the number of topics

  • dtm – document-term-matrix; shape NxM

  • vocab – vocabulary array/list of length M

  • sort_topics – if True, sort the topics

Returns

dict with parameters ready to use with pyLDAVis

Base classes for parallel model fitting and evaluation

Base classes for parallel model fitting and evaluation. See the specific functions and classes in tm_gensim, tm_lda and tm_sklearn for parallel processing with popular topic modeling packages.

Note

The classes and functions in this module are only important if you want to implement your own parallel model computation and evaluation.

class tmtoolkit.topicmod.parallel.MultiprocEvaluationRunner(worker_class, available_metrics, data, varying_parameters, constant_parameters=None, metric=None, metric_options=None, n_max_processes=None, return_models=False)

Specialization of MultiprocModelsRunner for parallel model evaluations.

__init__(worker_class, available_metrics, data, varying_parameters, constant_parameters=None, metric=None, metric_options=None, n_max_processes=None, return_models=False)

Initialize evaluation runner.

Parameters
  • worker_class – model computation worker class derived from MultiprocModelsWorkerABC

  • available_metrics – list/tuple with available metrics as strings

  • data – the data that the workers use for computations; 2D (sparse) array/matrix

  • varying_parameters – list of dicts with parameters; each parameter set will be used in a separate computation

  • constant_parameters – dict with parameters that are the same for all parallel computations

  • metric – string or list of strings; if given, use only this metric(s) for evaluation; must be subset of available_metrics

  • metric_options – dict of options for metric used metric(s)

  • n_max_processes – maximum number of worker processes to spawn

  • return_models – if True, also return the computed models in the evaluation results

class tmtoolkit.topicmod.parallel.MultiprocEvaluationWorkerABC(worker_id, eval_metric, eval_metric_options, return_models, tasks_queue, results_queue, data, group=None, target=None, name=None, args=(), kwargs=None)

Specialization of MultiprocModelsWorkerABC for parallel model evaluations.

__init__(worker_id, eval_metric, eval_metric_options, return_models, tasks_queue, results_queue, data, group=None, target=None, name=None, args=(), kwargs=None)

Initialize parallel model evaluations worker class with an ID worker_id, a queue to receive tasks from tasks_queue, a queue to send results to results_queue and the data to operate on. Use evaluation metrics eval_metric.

Parameters
  • worker_id – process ID

  • eval_metric – list/tuple of strings of evaluation metrics to use

  • eval_metric_options – dict of options for metric used metric(s)

  • tasks_queue – queue to receive tasks from

  • results_queue – queue to send results to

  • data – data to operate on; a dict mapping dataset label to a dataset; can be anything but is usually a tuple of shared data pointers for sparse matrix in COO format (see tmtoolkit.topicmod.parallel.MultiprocModelsRunner._prepare_data())

  • group – see Python’s multiprocessing.Process class

  • target – see Python’s multiprocessing.Process class

  • name – see Python’s multiprocessing.Process class

  • args – see Python’s multiprocessing.Process class

  • kwargs – see Python’s multiprocessing.Process class

class tmtoolkit.topicmod.parallel.MultiprocModelsRunner(worker_class, data, varying_parameters=None, constant_parameters=None, n_max_processes=None)

Runner class for distributing and managing worker processes for parallel model computation.

__init__(worker_class, data, varying_parameters=None, constant_parameters=None, n_max_processes=None)

Initiate runner class with a model computation worker class worker_class (which should be derived from MultiprocModelsWorkerABC). This class represents the worker processes and each will be instantiated with data and work on it with a different parameter set that can be passed via varying_parameters.

Parameters
  • worker_class – model computation worker class derived from MultiprocModelsWorkerABC

  • data – the data that the workers use for computations; 2D (sparse) array/matrix or a dict with such matrices; the latter allows to run all computations on different datasets at once

  • varying_parameters – list of dicts with parameters; each parameter set will be used in a separate computation

  • constant_parameters – dict with parameters that are the same for all parallel computations

  • n_max_processes – maximum number of worker processes to spawn

run()

Set up worker processes and run parallel computations. Blocks until all processes are done, then stops all workers and returns the results.

Returns

if passed data is 2D array, returns a list with tuples (parameter set, results); if passed data is a dict of 2D arrays, returns dict with same keys as data and the respective results for each dataset

shutdown_workers()

Send shutdown signal to all worker processes to stop them.

class tmtoolkit.topicmod.parallel.MultiprocModelsWorkerABC(worker_id, tasks_queue, results_queue, data, group=None, target=None, name=None, args=(), kwargs=None)

Abstract base class for parallel model computations worker class.

__init__(worker_id, tasks_queue, results_queue, data, group=None, target=None, name=None, args=(), kwargs=None)

Initialize parallel model computations worker class with an ID worker_id, a queue to receive tasks from tasks_queue, a queue to send results to results_queue and the data to operate on.

Parameters
  • worker_id – process ID

  • tasks_queue – queue to receive tasks from

  • results_queue – queue to send results to

  • data – data to operate on; a dict mapping dataset label to a dataset; can be anything but is usually a tuple of shared data pointers for sparse matrix in COO format (see tmtoolkit.topicmod.parallel.MultiprocModelsRunner._prepare_data())

  • group – see Python’s multiprocessing.Process class

  • target – see Python’s multiprocessing.Process class

  • name – see Python’s multiprocessing.Process class

  • args – see Python’s multiprocessing.Process class

  • kwargs – see Python’s multiprocessing.Process class

fit_model(data, params)

Method stub to implement actually model fitting for data with parameter set params.

Parameters
  • data – data passed to the model fitting algorithm

  • params – parameter set dict

Returns

model fitting / evaluation results

run()

Run the process worker: Calls fit_model() on each dataset and parameter set coming from the tasks queue.

send_results(doc, params, results)

Put the results into the results queue.

Parameters
  • doc – “document” / dataset label

  • params – used parameter set

  • results – generated results, e.g. fit model and/or evaluation results

tmtoolkit.utils

Misc. utility functions.

tmtoolkit.utils.argsort(seq)

Same as NumPy’s numpy.argsort() but for Python sequences.

Parameters

seq – a sequence

Returns

indices into seq that sort seq

tmtoolkit.utils.combine_sparse_matrices_columnwise(matrices, col_labels, row_labels=None, dtype=None)

Given a sequence of sparse matrices in matrices and their corresponding column labels in col_labels, stack these matrices in rowwise fashion by retaining the column affiliation and filling in zeros, e.g.:

m1:
   C A D
   -----
   1 0 3
   0 2 0

m2:
   D B C A
   -------
   0 0 1 2
   3 4 5 6
   2 1 0 0

will result in:

A B C D
-------
0 0 1 3
2 0 0 0
2 0 1 0
6 4 5 3
0 1 0 2

(where the first two rows come from m1 and the other three rows from m2).

The resulting columns will always be sorted in ascending order.

Additionally you can pass a sequence of row labels for each matrix via row_labels. This will also sort the rows in ascending order according to the row labels.

Parameters
  • matrices – sequence of sparse matrices

  • col_labels – solumn labels for each matrix in matrices; may be sequence of strings or integers

  • row_labels – optional sequence of row labels for each matrix in matrices

  • dtype – optionally specify the dtype of the resulting sparse matrix

Returns

a tuple with (1) combined sparse matrix in CSR format; (2) column labels of the matrix; (3) optionally row labels of the matrix if row_labels is not None.

tmtoolkit.utils.empty_chararray()

Create empty NumPy character array.

Returns

empty NumPy character array

tmtoolkit.utils.flatten_list(l)

Flatten a 2D sequence l to a 1D list and return it.

Although return sum(l, []) looks like a very nice one-liner, it turns out to be much much slower than what is implemented below.

Parameters

l – 2D sequence, e.g. list of lists

Returns

flattened list, i.e. a 1D list that concatenates all elements from each list inside l

tmtoolkit.utils.greedy_partitioning(elems_dict, k, return_only_labels=False)

Implementation of greed partitioning algorithm as explained here for a dict elems_dict containing elements with label -> weight mapping. A weight can be a number in an arbitrary range. Since this is used for task scheduling, you can think if it as the larger the weight, the bigger the task is.

The elements are placed in k bins such that the difference of sums of weights in each bin is minimized. The algorithm does not always find the optimal solution.

If return_only_labels is False, returns a list of k dicts with label -> weight mapping, else returns a list of k lists containing only the labels for the respective partitions.

Parameters
  • elems_dict – dictionary containing elements with label -> weight mapping

  • k – number of bins

  • return_only_labels – if True, only return the labels in each bin

Returns

list with k bins, where each each bin is either a dict with label -> weight mapping if return_only_labels is False or a list of labels

tmtoolkit.utils.mat2d_window_from_indices(mat, row_indices=None, col_indices=None, copy=False)

Select an area/”window” inside of a 2D array/matrix mat specified by either a sequence of row indices row_indices and/or a sequence of column indices col_indices. Returns the specified area as a view of the data if copy is False, else it will return a copy.

Parameters
  • mat – a 2D NumPy array

  • row_indices – list or array of row indices to select or None to select all rows

  • col_indices – list or array of column indices to select or None to select all columns

  • copy – if True, return result as copy, else as view into mat

Returns

window into mat as specified by the passed indices

tmtoolkit.utils.merge_dict_sequences_inplace(a, b)

Given two sequences of equal length a and b, where each sequence contains only dicts, update the dicts in a with the corresponding dict from b.

a is updated in place, hence no value is returned from this function.

Parameters
  • a – a sequence of dicts where each dict will be updated

  • b – a sequence of dicts used for updating

tmtoolkit.utils.normalize_to_unit_range(values)

Bring a 1D NumPy array with at least two values in values to a linearly normalized range of [0, 1].

Result is (x - min(x)) / (max(x) - min(x)) where x is values. Note that an ValueError is raised when max(x) - min(x) equals 0.

Parameters

values – 1D NumPy array with at least two values

Returns

values linearly normalized to range [0, 1]

tmtoolkit.utils.pickle_data(data, picklefile, **kwargs)

Save data in picklefile with Python’s pickle module.

Parameters
  • data – data to store in picklefile

  • picklefile – either target file path as string or file handle

  • kwargs – further parameters passed to pickle.dump()

tmtoolkit.utils.require_attrs(x, req_attrs, error_msg=None)

Check if x has all attributes listed in req_attrs. Raise an ValueError if x check fails.

Parameters
  • x – variable to check

  • req_attrs – required attributes as sequence of strings

  • error_msg – optional error message to use instead of default exception message

tmtoolkit.utils.require_dictlike(x)

Check if x has all attributes implemented that make it a dict-like data structure.

Parameters

x – variable to check

tmtoolkit.utils.require_listlike(x)

Check if x is a list, tuple or dict values sequence.

Parameters

x – variable to check

tmtoolkit.utils.require_listlike_or_set(x)

Check if x is a list, tuple, dict values sequence or set.

Parameters

x – variable to check

tmtoolkit.utils.require_types(x, valid_types, valid_types_str=(), error_msg=None)

Check if x is an instance of the types in valid_types or its type string representation is listed in valid_types_str. Raise an ValueError if x is not of the required type(s).

Parameters
  • x – variable to check

  • valid_types – types to check against

  • valid_types_str – optional string representations of types to check against

  • error_msg – optional error message to use instead of default exception message

tmtoolkit.utils.unpickle_file(picklefile, **kwargs)

Load data from picklefile with Python’s pickle module.

Parameters
  • picklefile – either target file path as string or file handle

  • kwargs – further parameters passed to pickle.load()

Returns

data stored in picklefile

tmtoolkit.utils.widen_chararray(arr, size)

Widen the maximum character length of a NumPy unicode character array to size characters and return a copy of arr with the adapted maximum char. length. If the maximum length is already greater or equal size, return input arr without any changes (arr won’t be copied).

Parameters
  • arr – NumPy unicode character array

  • size – new maximum character length

Returns

NumPy unicode character array with adapted maximum character length if necessary