API

tmtoolkit.bow

tmtoolkit.bow.bow_stats

Common statistics for bag-of-words (BoW) or sparse word representation models.

tmtoolkit.bow.bow_stats.codoc_frequencies(dtm, min_val=1, proportions=0)

Calculate the co-document frequency (aka word co-occurrence) matrix for a document-term matrix dtm, i.e. how often each pair of tokens occurs together at least min_val times in the same document. If proportions is True, return proportions scaled to the number of documents instead of absolute numbers.

See also

See pairwise_max_table for a convenient way to get the maximum token cooccurrences in tabular form.

Parameters:
  • dtm – (sparse) document-term-matrix of size NxM (N docs, M is vocab size) with raw term counts.

  • min_val – threshold for counting occurrences

  • proportions – one of Proportion: NO (0) – return counts; YES (1) – return proportions; LOG (2) – convert input to dense matrix if necessary and return log(proportions + 1)

Returns:

co-document frequency (aka word co-occurrence) matrix with shape (vocab size, vocab size)

tmtoolkit.bow.bow_stats.doc_frequencies(dtm, min_val=1, proportions=0)

For each term in the vocab of dtm (i.e. its columns), return how often it occurs at least min_val times per document.

Parameters:
  • dtm – (sparse) document-term-matrix of size NxM (N docs, M is vocab size) with raw term counts.

  • min_val – threshold for counting occurrences

  • proportions – one of Proportion: NO (0) – return counts; YES (1) – return proportions; LOG (2) – return log of proportions

Returns:

NumPy array of size M (vocab size) indicating how often each term occurs at least min_val times.

tmtoolkit.bow.bow_stats.doc_lengths(dtm)

Return the length, i.e. number of terms for each document in document-term-matrix dtm. This corresponds to the row-wise sums in dtm.

Parameters:

dtm – (sparse) document-term-matrix of size NxM (N docs, M is vocab size) with raw terms counts

Returns:

NumPy array of size N (number of docs) with integers indicating the number of terms per document

tmtoolkit.bow.bow_stats.idf(dtm, smooth_log=1, smooth_df=1)

Calculate inverse document frequency (idf) vector from raw count document-term-matrix dtm with formula log(smooth_log + N / (smooth_df + df)), where N is the number of documents, df is the document frequency (see function doc_frequencies), smooth_log and smooth_df are smoothing constants. With default arguments, the formula is thus log(1 + N/(1+df)).

Note that this may introduce NaN values due to division by zero when a document is of length 0.

Parameters:
  • dtm – (sparse) document-term-matrix of size NxM (N docs, M is vocab size) with raw term counts.

  • smooth_log – smoothing constant inside log()

  • smooth_df – smoothing constant to add to document frequency

Returns:

NumPy array of size M (vocab size) with inverse document frequency for each term in the vocab

tmtoolkit.bow.bow_stats.idf_probabilistic(dtm, smooth=1)

Calculate probabilistic inverse document frequency (idf) vector from raw count document-term-matrix dtm with formula log(smooth + (N - df) / df), where N is the number of documents and df is the document frequency (see function doc_frequencies).

Parameters:
  • dtm – (sparse) document-term-matrix of size NxM (N docs, M is vocab size) with raw term counts.

  • smooth – smoothing constant (setting this to 0 can lead to -inf results)

Returns:

NumPy array of size M (vocab size) with probabilistic inverse document frequency for each term in the vocab

tmtoolkit.bow.bow_stats.sorted_terms(mat, vocab, lo_thresh=0, hi_tresh=None, top_n=None, ascending=False, table_doc_labels=None)

For each row (i.e. document) in a (sparse) document-term-matrix mat, do the following:

  1. filter all values according to lo_thresh and hi_thresh

  2. sort values and the corresponding terms from vocab according to ascending

  3. optionally select the top top_n terms

  4. generate a list with pairs of terms and values

Return the collected lists for each row or convert the result to a data frame if document labels are passed via data_frame_doc_labels (see shortcut function sorted_terms_table).

Parameters:
  • mat – (sparse) document-term-matrix mat (may be tf-idf transformed or any other transformation)

  • vocab – list or array of vocabulary corresponding to columns in mat

  • lo_thresh – if not None, filter for values greater than lo_thresh

  • hi_tresh – if not None, filter for values lesser than or equal hi_thresh

  • top_n – if not None, select only the top top_n terms

  • ascending – sorting direction

  • table_doc_labels – optional list/array of document labels corresponding to mat rows

Returns:

list of list with tuples (term, value) or data table with columns “doc”, “term”, “value” if data_frame_doc_labels is given

tmtoolkit.bow.bow_stats.sorted_terms_table(mat, vocab, doc_labels, lo_thresh=0, hi_tresh=None, top_n=None, ascending=False)

Shortcut function for sorted_terms which generates a data table with doc_labels.

Parameters:
  • mat – (sparse) document-term-matrix mat (may be tf-idf transformed or any other transformation)

  • vocab – list or array of vocabulary corresponding to columns in mat

  • doc_labels – list/array of document labels corresponding to mat rows

  • lo_thresh – if not None, filter for values greater than lo_thresh

  • hi_tresh – if not None, filter for values lesser than or equal hi_thresh

  • top_n – if not None, select only the top top_n terms

  • ascending – sorting direction

Returns:

data table with columns “doc”, “term”, “value”

tmtoolkit.bow.bow_stats.term_frequencies(dtm, proportions=0)

Return the number of occurrences of each term in the vocab across all documents in document-term-matrix dtm. This corresponds to the column-wise sums in dtm.

Parameters:
  • dtm – (sparse) document-term-matrix of size NxM (N docs, M is vocab size) with raw term counts.

  • proportions – one of Proportion: NO (0) – return counts; YES (1) – return proportions; LOG (2) – return log of proportions

Returns:

NumPy array of size M (vocab size) with integers indicating the number of occurrences of each term in the vocab across all documents.

tmtoolkit.bow.bow_stats.tf_binary(dtm)

Transform raw count document-term-matrix dtm to binary term frequency matrix. This matrix contains 1 whenever a term occurred in a document, else 0.

Parameters:

dtm – (sparse) document-term-matrix of size NxM (N docs, M is vocab size) with raw term counts.

Returns:

(sparse) binary term frequency matrix of type integer of size NxM

tmtoolkit.bow.bow_stats.tf_double_norm(dtm, K=0.5)

Transform raw count document-term-matrix dtm to double-normalized term frequency matrix K + (1-K) * dtm / max{t in doc}, where max{t in doc} is vector of size N containing the maximum term count per document.

Note that this may introduce NaN values due to division by zero when a document is of length 0.

Parameters:
  • dtm – (sparse) document-term-matrix of size NxM (N docs, M is vocab size) with raw term counts

  • K – normalization factor

Returns:

double-normalized term frequency matrix of size NxM

tmtoolkit.bow.bow_stats.tf_log(dtm, log_fn=<ufunc 'log1p'>)

Transform raw count document-term-matrix dtm to log-normalized term frequency matrix log_fn(dtm).

Parameters:
  • dtm – (sparse) document-term-matrix of size NxM (N docs, M is vocab size) with raw term counts.

  • log_fn – log function to use; default is NumPy’s numpy.log1p, which calculates log(1 + x)

Returns:

(sparse) log-normalized term frequency matrix of size NxM

tmtoolkit.bow.bow_stats.tf_proportions(dtm)

Transform raw count document-term-matrix dtm to term frequency matrix with proportions, i.e. term counts normalized by document length.

Note that this may introduce NaN values due to division by zero when a document is of length 0.

Parameters:

dtm – (sparse) document-term-matrix of size NxM (N docs, M is vocab size) with raw term counts

Returns:

(sparse) term frequency matrix of size NxM with proportions, i.e. term counts normalized by document length

tmtoolkit.bow.bow_stats.tfidf(dtm, tf_func=<function tf_proportions>, idf_func=<function idf>, **kwargs)

Calculate tfidf (term frequency inverse document frequency) matrix from raw count document-term-matrix dtm with matrix multiplication tf * diag(idf), where tf is the term frequency matrix tf_func(dtm) and idf is the document frequency vector idf_func(dtm).

Parameters:
  • dtm – (sparse) document-term-matrix of size NxM (N docs, M is vocab size) with raw term counts

  • tf_func – function to calculate term-frequency matrix; see tf_* functions in this module

  • idf_func – function to calculate inverse document frequency vector; see tf_* functions in this module

  • kwargs – additional parameters passed to tf_func or idf_func like K or smooth (depending on which parameters these functions except)

Returns:

(sparse) tfidf matrix of size NxM

tmtoolkit.bow.bow_stats.word_cooccurrence(dtm, min_val=1, proportions=0)

Calculate the co-document frequency (aka word co-occurrence) matrix. Alias for codoc_frequencies.

tmtoolkit.bow.dtm

Functions for creating a document-term matrix (DTM) and some compatibility functions for Gensim.

tmtoolkit.bow.dtm.create_sparse_dtm(vocab, docs, n_unique_tokens, vocab_is_sorted=False, dtype=None)

Create a sparse document-term-matrix (DTM) as matrix in COO sparse format from vocabulary array vocab, a list of tokenized documents docs and the number of unique tokens across all documents n_unique_tokens.

The DTM’s rows are document names, its columns are indices in vocab, hence a value DTM[j, k] is the term frequency of term vocab[k] in document j.

A note on performance: Creating the three arrays for a COO matrix seems to be the fastest way to generate a DTM. An alternative implementation using LIL format was ~2x slower.

Memory requirement: about 3 * <n_unique_tokens> * 4 bytes with default dtype (32-bit integer).

See also

This is the “low level” function. For the straight-forward to use function see tmtoolkit.corpus.dtm, which also calculates n_unique_tokens.

Parameters:
  • vocab – list or array of vocabulary used as column names; size must equal number of columns in dtm

  • docs – a list of tokenized documents

  • n_unique_tokens – number of unique tokens across all documents

  • vocab_is_sorted – if True, assume that vocab is sorted when creating the token IDs

  • dtype – data type of the resulting matrix

Returns:

a sparse document-term-matrix in COO sparse format

tmtoolkit.bow.dtm.dtm_and_vocab_to_gensim_corpus_and_dict(dtm, vocab, as_gensim_dictionary=True)

Convert a (sparse) DTM and a vocabulary list to a Gensim Corpus object and Gensim Dictionary object or a Python dict.

Parameters:
  • dtm – (sparse) document-term-matrix of size NxM (N docs, M is vocab size) with raw terms counts

  • vocab – list or array of vocabulary

  • as_gensim_dictionary – if True create Gensim Dictionary from vocab, else create Python dict

Returns:

a 2-tuple with (Corpus object, Gensim Dictionary or Python dict)

tmtoolkit.bow.dtm.dtm_to_dataframe(dtm, doc_labels, vocab)

Convert a (sparse) DTM to a pandas DataFrame using document labels doc_labels as row index and vocab as column names.

Parameters:
  • dtm – (sparse) document-term-matrix of size NxM (N docs, M is vocab size) with raw terms counts

  • doc_labels – document labels used as row index (row names); size must equal number of rows in dtm

  • vocab – list or array of vocabulary used as column names; size must equal number of columns in dtm

Returns:

pandas DataFrame

tmtoolkit.bow.dtm.dtm_to_gensim_corpus(dtm)

Convert a (sparse) DTM to a Gensim Corpus object.

See also

gensim_corpus_to_dtm for the inverse function or dtm_and_vocab_to_gensim_corpus_and_dict which additionally creates a Gensim Dictionary.

Parameters:

dtm – (sparse) document-term-matrix of size NxM (N docs, M is vocab size) with raw terms counts

Returns:

a Gensim gensim.matutils.Sparse2Corpus object

tmtoolkit.bow.dtm.gensim_corpus_to_dtm(corpus)

Convert a Gensim corpus object to a sparse DTM in COO format.

See also

dtm_to_gensim_corpus for the inverse function.

Parameters:

corpus – Gensim corpus object

Returns:

sparse DTM in COO format

tmtoolkit.bow.dtm.read_dtm_from_rds(path)

Load a document-term matrix with optional document labels and/or vocabulary from an RDS file. The RDS file must contain a R sparseMatrix or matrix.

Note

It’s highly recommended to store a sparse matrix when dealing with a large text corpus.

See also

Use save_dtm_to_rds to save a document-term matrix to an RDS file.

Parameters:

path (str | Path) – path to RDS file

Returns:

triplet with sparse or dense document-term matrix, optional document labels list, optional vocabulary list

Return type:

Tuple[csc_matrix | ndarray, List[str] | None, List[str] | None]

tmtoolkit.bow.dtm.save_dtm_to_rds(path, dtmat, doc_labels=None, vocab=None)

Save a document-term matrix along with document labels and/or vocabulary to an RDS file that can be imported with R. The RDS file will contain an R sparseMatrix or matrix with optional row and column names according to doc_labels and vocab.

Note

It’s highly recommended to store a sparse matrix when dealing with a large text corpus. Note that the sparse matrix is always represented with float values, so you may need to convert it to integer values in R.

See also

Use read_dtm_from_rds to load a document-term matrix from an RDS file.

Parameters:
  • path (str | Path) – path to RDS file

  • dtmat (ndarray | spmatrix) – sparse or dense document-termin matrix

  • doc_labels (List[str] | None) – optional document labels; length must match number of rows in dtmat

  • vocab (List[str] | None) – optional vocabulary; length must match number of columns in dtmat

Return type:

None

tmtoolkit.corpus

Corpus class and corpus functions

Module for processing text as token sequences in labelled documents. A set of documents is represented as corpus using the Corpus class. This sub-package also provides functions that work with a Corpus object.

Text parsing and processing relies on the SpaCy library which must be installed when using this sub-package.

class tmtoolkit.corpus.Corpus(docs=None, language=None, language_model=None, load_features=None, add_features=(), raw_preproc=None, spacy_token_attrs=None, spacy_instance=None, spacy_opts=None, punctuation=None, max_workers=None, workers_timeout=10)

The Corpus class represents text as string token sequences in labelled documents. It behaves like a Python dict, i.e. you can access document tokens via square brackets (corp['my_doc']).

SpaCy is used for text parsing and all documents are SpaCy Doc objects with special user data. The SpaCy documents can be accessed by using the spacydocs function. The SpaCy instance can be accessed via the nlp property. Many more properties are defined in the Corpus class.

The Corpus class allows to attach attributes (or “meta data”) to documents and individual tokens inside documents. This can be done using the set_document_attr and set_token_attr functions.

Because of the functional programming approach used in tmtoolkit, this class doesn’t implement any methods besides special Python “dunder” methods to provide dict-like behaviour and (deep)-copy functionality. Functions that operate on Corpus objects are defined in the corpus module.

Parallel processing is implemented for many tasks in order to improve processing speed with large text corpora when multiple processors are available. Parallel processing can be enabled setting the max_workers argument or max_workers property to the respective number or proportion of CPUs to be used. A Reusable Process Pool Executor from the loky package is used for job scheduling. It can be accessed via the procexec property.

Parameters:
  • docs (Optional[Union[Dict[str, str], Sequence[Document]]]) –

  • language (Optional[str]) –

  • language_model (Optional[str]) –

  • load_features (Optional[Collection[str]]) –

  • add_features (Collection[str]) –

  • raw_preproc (Optional[Union[Callable, Sequence[Callable]]]) –

  • spacy_token_attrs (Optional[Collection[str]]) –

  • spacy_instance (Optional[spacy.Language]) –

  • spacy_opts (Optional[dict]) –

  • punctuation (Optional[Sequence[str]]) –

  • max_workers (Optional[Union[int, float]]) –

  • workers_timeout (int) –

__init__(docs=None, language=None, language_model=None, load_features=None, add_features=(), raw_preproc=None, spacy_token_attrs=None, spacy_instance=None, spacy_opts=None, punctuation=None, max_workers=None, workers_timeout=10)

Create a new Corpus class using raw text data (i.e. the document text as string) from the dict docs that maps document labels to document text.

The documents will be parsed right away using a newly generated SpaCy instance or one that is provided via spacy_instance. If no spacy_instance is given, either language or language_model must be given.

Parameters:
  • docs (Dict[str, str] | Sequence[Document] | None) – either dict mapping document labels to document text strings or a sequence of Document objects

  • language (str | None) – documents language as two-letter ISO 639-1 language code; will be used to load the appropriate SpaCy language model if language_model is not set

  • language_model (str | None) –

    SpaCy language model to be loaded if neither language nor spacy_instance is given

  • spacy_instance (Language | None) – a SpaCy Language text-processing pipeline; set this if you want to use your already loaded pipeline, otherwise specify either language or language_model

  • load_features (Collection[str] | None) – SpaCy pipeline components to load; see spacy.load; only in effective if not providing your own spacy_instance; has special feature vectors that determines the default language model to load, if no language_model is given; by default will use the set provided by “pipeline” model meta information except for NER

  • add_features (Collection[str]) – shortcut for providing pipeline components additional to the default list in load_features

  • spacy_token_attrs (Collection[str] | None) – SpaCy token attributes to be loaded from each parsed document; see attributes list for spacy.Token

  • spacy_opts (dict | None) –

    other SpaCy pipeline parameters passed to spacy.load; only in effective if not providing your own spacy_instance

  • punctuation (Sequence[str] | None) – provide custom punctuation characters list or use default list from string.punctuation and common whitespace characters

  • max_workers (int | float | None) – number of worker processes used for parallel processing; set to None, 0 or 1 to disable parallel processing; set to positive integer to use up to this amount of worker processes; set to negative integer to use all available CPUs except for this amount; set to float in interval [0, 1] to use this proportion of available CPUs

  • workers_timeout (int) – timeout in seconds until worker processes are stopped

  • raw_preproc (Callable | Sequence[Callable] | None) –

Return type:

None

bimaps: Dict[str, bidict]

bijective maps (bidirectional dictionaries) for each token attribute that is represented with hashes

property custom_token_attrs_defaults: Dict[str, Any]

Return dict of available custom token attributes along with their default values.

property doc_attrs: Tuple[str, ...]

Return list of available document attributes.

property doc_attrs_defaults: Dict[str, Any]

Return list of available document attributes along with their default values.

property doc_labels: List[str]

Return document label names.

classmethod from_builtin_corpus(corpus_label, **kwargs)

Construct Corpus object by loading one of the built-in datasets specified by corpus_label. To get a list of available built-in datasets, use builtin_corpora_info.

Parameters:
  • corpus_label – the corpus to load (one of the labels listed in builtin_corpora_info)

  • kwargs – override arguments of Corpus constructor

Returns:

Corpus instance

Return type:

Corpus

classmethod from_files(files, **kwargs)

Construct Corpus object by loading files. Pass arguments for Corpus initialization and file loading as keyword arguments via kwargs. See __init__ for Corpus constructor arguments and corpus_add_files for file loading arguments.

Parameters:

files (str | Collection[str] | Dict[str, str]) – single file path string or sequence of file paths or dict mapping document label to file path

Returns:

Corpus instance

Return type:

Corpus

classmethod from_folder(folder, **kwargs)

Construct Corpus object by loading files from a folder folder. Pass arguments for Corpus initialization and file loading as keyword arguments via kwargs. See __init__ for Corpus constructor arguments and corpus_add_folder for file loading arguments.

Parameters:

folder (str) – folder from where the files are read

Returns:

Corpus instance

Return type:

Corpus

classmethod from_tabular(files, **kwargs)

Construct Corpus object by loading documents from a tabular file, i.e. CSV or Excel file. Pass arguments for Corpus initialization and file loading as keyword arguments via kwargs. See __init__ for Corpus constructor arguments and corpus_add_tabular for file loading arguments.

Parameters:

files (str | Collection[str]) – single string or list of strings with path to file(s) to load

Returns:

Corpus instance

Return type:

Corpus

classmethod from_zip(zipfile, **kwargs)

Construct Corpus object by loading files from a ZIP file. Pass arguments for Corpus initialization and file loading as keyword arguments via kwargs. See __init__ for Corpus constructor arguments and corpus_add_zip for file loading arguments.

Parameters:

zipfile (str) – path to ZIP file to be loaded

Returns:

Corpus instance

Return type:

Corpus

get(*args)

Dict method to retrieve a specific document like corpus.get(<doc_label>, <default>).

Returns:

token sequence

Return type:

Document

property has_sents: bool

Return True if information sentence borders were parsed for documents in this corpus, else return False.

items()

Dict method to retrieve pairs of document labels and their Document objects.

Returns:

pairs of document labels and their Document objects

Return type:

ItemsView[str, Document]

keys()

Dict method to retrieve document labels.

Returns:

document labels

Return type:

KeysView[str]

property language: str

Return Corpus language as two-letter ISO 639-1 language code.

property language_model: str

Return name of the language model that was loaded.

property max_workers

Return the number of worker processes for parallel processing.

property n_docs: int

Same as __len__.

property ngrams: int

Return n-gram setting, e.g. 1 if Corpus is set up for unigrams, 2 if set up for bigrams, etc.

property ngrams_join_str: str

Return string that is used for joining n-grams.

nlp: Language | None

SpaCy Language instance

print_summary_default_max_documents: int

max. number of documents to display in corpus_summary

print_summary_default_max_tokens_string_length: int

max. number of characters to display in corpus_summary for document tokens

procexec: ProcessPoolExecutor | None

Reusable Process Pool Executor from the loky package used for job scheduling

punctuation: Sequence[str]

sequence of punctuation characters

property spacy_token_attrs: Tuple[str, ...]

Return tuple of available SpaCy token attributes.

property token_attrs: Tuple[str, ...]

Return tuple of available token attributes (SpaCy attributes like “pos” or “lemma” and custom attributes).

update(new_docs)

Dict method for inserting new documents or updating existing documents as either:

Parameters:

new_docs (Dict[str, str | Doc | Document] | Sequence[Document]) –

dict mapping document labels to text, SpaCy Doc objects or Document objects; or sequence of Document objects

property uses_unigrams: bool

Returns True when this Corpus is set up for unigram tokens, i.e. tokens_processed is 1.

values()

Dict method to retrieve Document objects.

Returns:

Document objects

Return type:

ValuesView[Document]

property workers_docs: List[List[str]]

When N is the number of worker processes for parallel processing, return list of size N with each item being a list of document labels for the respective worker process. Returns an empty list when parallel processing is disabled.

workers_timeout: int

timeout in seconds until worker processes are stopped (used for parallel processing)

class tmtoolkit.corpus.Document(bimaps, label, has_sents, tokenmat, tokenmat_attrs, custom_token_attrs=None, doc_attrs=None)

A class that represents text as sequence of tokens. Attributes are also implemented at two levels:

  1. Document attributes like the document label (document name);

  2. Token attributes (e.g. POS, lemma, etc.)

Token attributes are further divided into “standard” or “SpaCy” token attributes and custom attributes. The former are represented as 64 bit unsigned integer hash value and are stored in a “token matrix” in which rows represent tokens and columns represent token attributes. The token hash itself is also stored in this matrix as “token” attribute. The custom token attributes are stored as NumPy arrays of any dtype.

Parameters:
  • bimaps (Optional[Dict[str, bidict]]) –

  • label (str) –

  • has_sents (bool) –

  • tokenmat (np.ndarray) –

  • tokenmat_attrs (Sequence[str]) –

  • custom_token_attrs (Optional[Dict[str, Union[Sequence, np.ndarray]]]) –

  • doc_attrs (Optional[Dict[str, Any]]) –

__init__(bimaps, label, has_sents, tokenmat, tokenmat_attrs, custom_token_attrs=None, doc_attrs=None)

Create a new Document object that uses the bidirectional dictionaries in bimaps for hash <-> text conversion, has a document label label, has sentences recognized (has_sents) and has a token matrix tokenmat.

Parameters:
  • bimaps (Dict[str, bidict] | None) – bidirectional dictionaries for hash <-> text conversion of data in tokenmat

  • label (str) – document label (document name)

  • has_sents (bool) – if True, this document supports sentences

  • tokenmat (ndarray) – token matrix as uint64 matrix of shape (N, M) for N tokens and with M attributes; the data is not copied

  • tokenmat_attrs (Sequence[str]) – names of token attributes in tokenmat with respect to column order

  • custom_token_attrs (Dict[str, Sequence | ndarray] | None) – additional custom token attributes

  • doc_attrs (Dict[str, Any] | None) – document attributes

property has_sents: bool

Status on information on borders of sentences.

Returns:

True if information on borders of sentences is contained in this document, else False

property label: str

Document label (document name).

property token_attrs: List[str]

Retrieve list of token attribute names (standard and custom attributes).

Returns:

list of token attribute names

tmtoolkit.corpus.builtin_corpora_info(with_paths=False)

Return list/dict of available built-in corpora.

Parameters:

with_paths (bool) – if True, return dict mapping corpus label to absolute path to dataset, else return only a list of corpus labels

Returns:

dict or list, depending on with_paths

Return type:

List[str] | Dict[str, str]

tmtoolkit.corpus.corpus_add_files(docs, files, encoding='utf8', doc_label_fmt='{path}-{basename}', doc_label_path_join='_', read_size=-1, sample=None, force_unix_linebreaks=True, inplace=True)

Read text documents from files passed in files and add them to the corpus. If files is a dict, the dict keys represent the document labels. Otherwise, the document label for each new document is determined via format string doc_label_fmt.

Parameters:
  • docs (Corpus) – a Corpus object

  • files (str | Collection[str] | Dict[str, str]) – single file path string or sequence of file paths or dict mapping document label to file path

  • encoding (str) – character encoding of the files

  • doc_label_fmt (str) – document label format string with placeholders “path”, “basename”, “ext”

  • doc_label_path_join (str) – string with which to join the components of the file paths

  • custom_doc_labels – instead generating document labels from doc_label_fmt, pass a list of document labels to be used directly

  • read_size (int) – max. number of characters to read. -1 means read full file.

  • sample (int | None) – if given, draw random sample of size sample from files (without replacement)

  • force_unix_linebreaks (bool) – if True, convert Windows linebreaks to Unix linebreaks

  • inplace (bool) – if True, modify Corpus object in place, otherwise return a modified copy

Returns:

either None (if inplace is True) or a modified copy of the original docs object

Return type:

Corpus | None

tmtoolkit.corpus.corpus_add_folder(docs, folder, valid_extensions=('txt',), encoding='utf8', strip_folderpath_from_doc_label=True, doc_label_fmt='{path}-{basename}', doc_label_path_join='_', read_size=-1, sample=None, force_unix_linebreaks=True, inplace=True)

Read documents residing in folder folder and ending on file extensions specified via valid_extensions and add these to the corpus. This is done recursively, i.e. documents are also loaded from sub-folders inside folder.

Note that only raw text files can be read, not PDFs, Word documents, etc. These must be converted to raw text files beforehand, for example with pdttotext (poppler-utils package) or pandoc.

Parameters:
  • docs (Corpus) – a Corpus object

  • folder (str) – folder from where the files are read

  • valid_extensions (Collection[str]) – collection of valid file extensions like .txt, .md, etc.

  • encoding (str) – character encoding of the files

  • strip_folderpath_from_doc_label (bool) – if True, do not include the folder path in the document label

  • doc_label_fmt (str) – document label format string with placeholders “path”, “basename”, “ext”

  • doc_label_path_join (str) – string with which to join the components of the file paths

  • read_size (int) – max. number of characters to read. -1 means read full file.

  • sample (int | None) – if given, draw random sample of size sample from all loaded files

  • force_unix_linebreaks (bool) – if True, convert Windows linebreaks to Unix linebreaks

  • inplace (bool) – if True, modify Corpus object in place, otherwise return a modified copy

Returns:

either None (if inplace is True) or a modified copy of the original docs object

Return type:

Corpus | None

tmtoolkit.corpus.corpus_add_tabular(docs, files, id_column, text_column, prepend_columns=None, encoding='utf8', doc_label_fmt='{basename}-{id}', sample=None, force_unix_linebreaks=True, pandas_read_opts=None, inplace=True)

Add documents from tabular (CSV or Excel) file(s) to the corpus.

Parameters:
  • docs (Corpus) – a Corpus object

  • files (str | Collection[str]) – single string or list of strings with path to file(s) to load

  • id_column (str | int) – column name or column index of document identifiers

  • text_column (str | int) – column name or column index of document texts

  • prepend_columns (Sequence[str] | None) – if not None, pass a list of columns whose contents should be added before the document text, e.g. ['title', 'subtitle']

  • encoding (str) – character encoding of the files

  • doc_label_fmt (str) – document label format string with placeholders "basename", "id" (document ID), and "row_index" (dataset row index)

  • sample (int | None) – if given, draw random sample of size sample from all text data

  • force_unix_linebreaks (bool) – if True, convert Windows linebreaks to Unix linebreaks in texts

  • pandas_read_opts (Dict[str, Any] | None) – additional arguments passed to pandas.read_csv or pandas.read_excel

  • inplace (bool) – if True, modify Corpus object in place, otherwise return a modified copy

Returns:

either None (if inplace is True) or a modified copy of the original docs object

Return type:

Corpus | None

tmtoolkit.corpus.corpus_add_zip(docs, zipfile, valid_extensions=('txt', 'csv', 'xls', 'xlsx'), encoding='utf8', doc_label_fmt_txt='{path}-{basename}', doc_label_path_join='_', doc_label_fmt_tabular='{basename}-{id}', sample=None, force_unix_linebreaks=True, add_files_opts=None, add_tabular_opts=None, inplace=True)

Add documents from a ZIP file. The ZIP file may include documents with extensions listed in valid_extensions.

For file extensions ‘csv’, ‘xls’ or ‘xlsx’ corpus_add_tabular will be called. Make sure to pass at least the parameters id_column and text_column via add_tabular_opts if your ZIP contains such files.

For all other file extensions corpus_add_files will be called.

Parameters:
  • docs (Corpus) – a Corpus object

  • zipfile (str) – path to ZIP file to be loaded

  • valid_extensions (Collection[str]) – list of valid file extensions of ZIP file members; all other members will be ignored

  • encoding (str) – character encoding of the files

  • doc_label_fmt_txt (str) – document label format for non-tabular files; string with placeholders "path", "basename", "ext"

  • doc_label_path_join (str) – string with which to join the components of the file paths

  • doc_label_fmt_tabular (str) – document label format string for tabular files; placeholders "basename", "id" (document ID), and "row_index" (dataset row index)

  • sample (int | None) – if given, draw random sample of size sample from all text data

  • force_unix_linebreaks (bool) – if True, convert Windows linebreaks to Unix linebreaks in texts

  • add_files_opts (Dict[str, Any] | None) – additional arguments passed to corpus_add_files

  • add_tabular_opts (Dict[str, Any] | None) – additional arguments passed to corpus_add_tabular

  • inplace (bool) – if True, modify Corpus object in place, otherwise return a modified copy

Returns:

either None (if inplace is True) or a modified copy of the original docs object

Return type:

Corpus | None

tmtoolkit.corpus.corpus_collocations(docs, select=None, threshold=None, min_count=1, embed_tokens_min_docfreq=None, embed_tokens_set=None, statistic=<function ppmi>, return_statistic=True, rank='desc', as_table=True, glue=' ', **statistic_kwargs)

Identify token collocations in the corpus docs. Collocations are tokens that occur together in a series frequently (i.e. more than would be expected by chance).

Parameters:
  • docs (Corpus) – a Corpus object

  • select (str | Collection[str] | None) – if not None, this can be a single string or a sequence of strings specifying a subset of docs

  • threshold (float | None) – minimum statistic value for a collocation to enter the results; if None, results are not filtered

  • min_count (int) – ignore collocations with number of occurrences below this threshold

  • embed_tokens_min_docfreq (int | float | None) – dynamically generate the set of embed_tokens used when calling token_collocations by using a minimum document frequency (see doc_frequencies); if this is an integer, it is used as absolute count, if it is a float, it is used as proportion

  • embed_tokens_set (Set | None) – tokens that, if occurring inside an n-gram, are not counted; see ngrams

  • statistic (Callable) – functicorpus_join_documentson to calculate the statistic measure from the token counts; use one of the [n]pmi[2,3] functions provided in the tokenseq module or provide your own function which must accept parameters n_x, n_y, n_xy, n_total; see pmi for more information

  • return_statistic (bool) – also return computed statistic

  • rank (str | None) – if not None, rank the results according to the computed statistic in ascending (rank='asc') or descending (rank='desc') order

  • as_table (bool) – return result as dataframe with columns “collocation” and optionally “statistic”

  • glue (str) – if not None, provide a string that is used to join the collocation tokens; must be set if as_table is True

  • statistic_kwargs – additional arguments passed to statistic function

Returns:

if as_table is True, a dataframe with columns “collocation” and optionally “statistic”; else same output as token_collocations, i.e. list of tuples (collocation tokens, score) if return_statistic is True, otherwise only a list of collocations

Return type:

DataFrame | List[tuple | str]

tmtoolkit.corpus.corpus_join_documents(docs, /, join, glue='\n\n', sort_document_labels=True, match_type='exact', ignore_case=False, glob_method='match', doc_opts=None, inplace=True)

Join documents using the document labels or patterns for document labels in join. For each entry in join, the document labels in docs are matched against a provided pattern. This may be a string or a list of strings either for exact matching (default) or pattern matching (controlled via match_type). If no match is found for an entry in join, no joint document is generated.

# example: generate joint document named "joined-tweets-foo" with all documents
# whose labels start with "tweets-foo"
corpus_join_documents(corp, {'joined-tweets-foo': 'tweets-foo*'}, match_type='glob')

# alternatively specify a list of documents to match, this time using exact matching
corpus_join_documents(corp, {'joined-tweets-foo': ['tweets-foo-1',
                                                   'tweets-foo-2',
                                                   'tweets-foo-3']})
Parameters:
  • docs (Corpus) – a Corpus object

  • join (Dict[str, List[str] | str]) – dictionary that maps a name for the newly joint document to a string pattern or a list of string patterns of documents to be joint

  • glue (str) – string used for concatenating the documents

  • sort_document_labels (bool) – if True, sort the matched document labels before joining the documents

  • match_type (str) – the type of matching that is performed: 'exact' does exact string matching (optionally ignoring character case if ignore_case=True is set); 'regex' treats search_tokens as regular expressions to match the tokens against; 'glob' uses “glob patterns” like "politic*" which matches for example “politic”, “politics” or “”politician” (see globre package)

  • ignore_case (bool) – ignore character case (applies to all three match types)

  • glob_method (str) – if match_type is ‘glob’, use this glob method. Must be ‘match’ or ‘search’ (similar behavior as Python’s re.match or re.search)

  • doc_opts (Dict[str, Any] | None) – keyword arguments passed to Document constructor when creating a joint document

  • inplace (bool) – if True, modify Corpus object in place, otherwise return a modified copy

Returns:

either None (if inplace is True) or a modified copy of the original docs object

Return type:

Corpus | None

tmtoolkit.corpus.corpus_ngramify(docs, /, n, join_str=' ', inplace=True)

Set the Corpus docs to handle tokens as n-grams.

Parameters:
  • docs (Corpus) – a Corpus object

  • n (int) – size of the n-grams to generate

  • join_str (str) – string to join n-grams

  • inplace (bool) – if True, modify Corpus object in place, otherwise return a modified copy

Returns:

either None (if inplace is True) or a modified copy of the original docs object

Return type:

Corpus | None

tmtoolkit.corpus.corpus_num_chars(docs, select=None)

Return the number of characters (excluding whitespace) in a Corpus docs.

Parameters:
  • docs (Corpus) – a Corpus object

  • select (str | Collection[str] | None) – if not None, this can be a single string or a sequence of strings specifying a subset of docs

Returns:

number of characters

Return type:

int

tmtoolkit.corpus.corpus_num_tokens(docs, select=None)

Return the number of tokens in a Corpus docs.

Parameters:
  • docs (Corpus) – a Corpus object

  • select (str | Collection[str] | None) – if not None, this can be a single string or a sequence of strings specifying a subset of docs

Returns:

number of tokens

Return type:

int

tmtoolkit.corpus.corpus_retokenize(docs, collapse=' ', inplace=True)

Parse the corpus again using the current – possibly modified – tokens, but the same NLP pipeline as before.

Note

This function is useful when you modified the corpus’ tokens, e.g. by removing punctuation characters or transforming to lower-case characters, which has influence on token attributes like POS tags when parsing the corpus again. Already specified custom document and token attributes will be removed when applying this function.

Parameters:
  • docs (Corpus) – a Corpus object

  • collapse (str | None) – if None, use whitespace token attribute for collapsing tokens, otherwise use custom string

  • inplace (bool) – if True, modify Corpus object in place, otherwise return a modified copy

Returns:

either None (if inplace is True) or a modified copy of the original docs object

Return type:

Corpus | None

tmtoolkit.corpus.corpus_sample(docs, /, n, inplace=True)

Generate a sample of n documents of corpus docs. Sampling occurs without replacement, hence n must be smaller or equal len(docs).

Parameters:
  • docs (Corpus) – a Corpus object

  • n (int) – sample size; must be in range [1, len(docs)]

  • inplace (bool) – if True, modify Corpus object in place, otherwise return a modified copy

Returns:

either None (if inplace is True) or a modified copy of the original docs object

Return type:

Corpus | None

tmtoolkit.corpus.corpus_split_by_paragraph(docs, /, paragraph_linebreaks=2, linebreak_str='\n', new_doc_label_fmt='{doc}-{num}', force_unix_linebreaks=True, inplace=True)

Split documents in corpus by paragraphs and set the resulting documents as new corpus. Paragraphs are divided by a number paragraph_linebreaks of line breaks (given as linebreak_str).

See also

See corpus_split_by_token which allows to split documents by any token.

Parameters:
  • docs (Corpus) – a Corpus object

  • paragraph_linebreaks (int) – number of subsequent line breaks to start a new paragraph

  • linebreak_str (str) – string used for line breaks

  • new_doc_label_fmt (str) – document label format string with placeholders “doc” and “num” (split number)

  • force_unix_linebreaks (bool) – if True, convert Windows linebreaks to Unix linebreaks

  • inplace (bool) – if True, modify Corpus object in place, otherwise return a modified copy

Returns:

either None (if inplace is True) or a modified copy of the original docs object

Return type:

Corpus | None

tmtoolkit.corpus.corpus_split_by_token(docs, /, split, new_doc_label_fmt='{doc}-{num}', force_unix_linebreaks=True, inplace=True)

Split documents in corpus by token split and set the resulting documents as new corpus.

See also

See corpus_split_by_paragraph for a shortcut for splitting by paragraph, which is a common use case.

Parameters:
  • docs (Corpus) – a Corpus object

  • split (str) – string used for splitting documents

  • new_doc_label_fmt (str) – document label format string with placeholders “doc” and “num” (split number)

  • force_unix_linebreaks (bool) – if True, convert Windows linebreaks to Unix linebreaks

  • inplace (bool) – if True, modify Corpus object in place, otherwise return a modified copy

Returns:

either None (if inplace is True) or a modified copy of the original docs object

Return type:

Corpus | None

tmtoolkit.corpus.corpus_summary(docs, select=None, max_documents=None, max_tokens_string_length=None)

Generate a summary of this object, i.e. the first tokens of each document and some summary statistics.

Parameters:
  • docs (Corpus) – a Corpus object

  • select (str | Collection[str] | None) – if not None, this can be a single string or a sequence of strings specifying a subset of docs

  • max_documents (int | None) – maximum number of documents to print; None uses default value 10; set to -1 to print all documents; this setting is disabled if select is not None

  • max_tokens_string_length (int | None) – maximum string length of concatenated tokens for each document; None uses default value 50; set to -1 to print complete documents

Returns:

summary as string

Return type:

str

tmtoolkit.corpus.corpus_tokens_flattened(docs, select=None, sentences=False, by_attr=None, tokens_as_hashes=False, as_array=False, force_unigrams=False)

Return tokens (or token hashes) from docs as flattened list, simply concatenating all documents.

Parameters:
  • docs (Corpus) – a Corpus object

  • select (str | Collection[str] | None) – if not None, this can be a single string or a sequence of strings specifying a subset of docs

  • sentences (bool) – divide results into sentences; if True, the result will consist of a list of sentences

  • by_attr (str | None) – if not None, this should be an attribute name; this attribute data will then be used instead of the tokens in docs

  • tokens_as_hashes (bool) – passed to doc_tokens; if True, return token hashes instead of string tokens

  • as_array (bool) – if True, return NumPy array instead of list

  • force_unigrams (bool) – ignore n-grams setting if docs is a Corpus with ngrams and always return unigrams

Returns:

list or NumPy array (depending on as_array) of token strings or hashes (depending on tokens_as_hashes); if sentences is True, the result is a list of sentences that in turn are token lists/arrays

Return type:

list | ndarray

tmtoolkit.corpus.corpus_unique_chars(docs, select=None)

Return the set of characters used in a Corpus docs.

Parameters:
  • docs (Corpus) – a Corpus object

  • select (str | Collection[str] | None) – if not None, this can be a single string or a sequence of strings specifying a subset of docs

Returns:

set of characters

Return type:

Set[str]

tmtoolkit.corpus.deserialize_corpus(serialized_corpus_data)

Deserialize a Corpus object from a dict. The inverse operation is implemented in serialize_corpus.

Parameters:

serialized_corpus_data (dict) – Corpus data serialized as dict

Returns:

a Corpus object

Return type:

Corpus

tmtoolkit.corpus.doc_frequencies(docs, select=None, by_attr=None, tokens_as_hashes=False, force_unigrams=False, proportions=Proportion.NO, as_table=False)

Document frequency per vocabulary token as dict with token to document frequency mapping. Document frequency is the measure of how often a token occurs at least once in a document. Example with absolute document frequencies:

doc tokens
--- ------
A   z, z, w, x
B   y, z, y
C   z, z, y, z

document frequency df(z) = 3  (occurs in all 3 documents)
df(x) = df(w) = 1 (occurs only in A)
df(y) = 2 (occurs in B and C)
...
Parameters:
  • docs (Corpus) – a Corpus object

  • select (str | Collection[str] | None) – if not None, this can be a single string or a sequence of strings specifying a subset of docs

  • by_attr (str | None) – if not None, this should be an attribute name; this attribute data will then be used instead of the tokens in docs

  • tokens_as_hashes (bool) – if True, return token type hashes (integers) instead of textual representations (strings)

  • force_unigrams (bool) – ignore n-grams setting if docs is a Corpus with ngrams and always return unigrams

  • proportions (Proportion) – one of Proportion: NO (0) – return counts; YES (1) – return proportions; LOG (2) – return log10 of proportions

  • as_table (bool | str) – if True, return result as dataframe; if a string, sort dataframe by this column; if string prefixed with “-”, sort by this column in descending order

Returns:

dict mapping token to document frequency or dataframe if as_table is active

Return type:

Dict[str | int, int | float] | DataFrame

tmtoolkit.corpus.doc_labels(docs, sort=True)

Return list of the documents’ labels.

Parameters:
  • docs (Corpus) – a Corpus object

  • sort (bool) – if True, return as sorted list

Returns:

list of the documents’ labels

Return type:

List[str]

tmtoolkit.corpus.doc_labels_sample(docs, n)

Generate random sample of document labels from docs with sample size n.

Parameters:
  • docs (Corpus) – a Corpus object

  • n (int) – sample size; must be in interval [0, len(docs)]

Returns:

set of sampled document labels

Return type:

Set[str]

tmtoolkit.corpus.doc_lengths(docs, select=None, as_table=False)

Return document length (number of tokens in doc.) for each document.

Parameters:
  • docs (Corpus) – a Corpus object

  • select (str | Collection[str] | None) – if not None, this can be a single string or a sequence of strings specifying a subset of docs

  • as_table (bool | str) – if True, return result as dataframe; if a string, sort dataframe by this column; if string prefixed with “-”, sort by this column in descending order

Returns:

dict of document lengths per document label or dataframe if as_table is active

Return type:

Dict[str, int] | DataFrame

tmtoolkit.corpus.doc_num_sents(docs, select=None, as_table=False)

Return number of sentences for each document.

Note

This number may be unreliable after filtering tokens in the corpus, since a filter may remove the starting tokens of sentences.

Parameters:
  • docs (Corpus) – a Corpus object

  • select (str | Collection[str] | None) – if not None, this can be a single string or a sequence of strings specifying a subset of docs

  • as_table (bool | str) – if True, return result as dataframe; if a string, sort dataframe by this column; if string prefixed with “-”, sort by this column in descending order

Returns:

dict with number of sentences per document label or dataframe if as_table is active

Return type:

Dict[str, int] | DataFrame

tmtoolkit.corpus.doc_sent_lengths(docs, select=None)

Return sentence lengths (number of tokens of each sentence) for each document.

Parameters:
  • docs (Corpus) – a Corpus object

  • select (str | Collection[str] | None) – if not None, this can be a single string or a sequence of strings specifying a subset of docs

Returns:

dict with list of sentence lengths per document label

Return type:

Dict[str, List[int]]

tmtoolkit.corpus.doc_texts(docs, select=None, by_attr=None, collapse=None, n_tokens=None, as_table=False)

Return reconstructed document text from documents in docs. By default, uses whitespace token attribute to collapse tokens to document text, otherwise custom collapse string.

Parameters:
  • docs (Corpus) – a Corpus object

  • select (str | Collection[str] | None) – if not None, this can be a single string or a sequence of strings specifying the documents to fetch

  • by_attr (str | None) – if not None, this should be an attribute name; this attribute data will then be used instead of the tokens in docs

  • collapse (str | None) – if None, use whitespace token attribute for collapsing tokens, otherwise use custom string

  • n_tokens (int | None) – max. number of tokens to retrieve from each document; if None (default), retrieve all tokens

  • as_table (bool | str) – if True, return result as dataframe; if a string, sort dataframe by this column; if string prefixed with “-”, sort by this column in descending order

Returns:

dict with reconstructed document text per document label or dataframe if as_table is active

Return type:

Dict[str, str] | DataFrame

tmtoolkit.corpus.doc_token_lengths(docs, select=None)

Return token lengths (number of characters of each token) for each document.

Parameters:
  • docs (Corpus) – a Corpus object

  • select (str | Collection[str] | None) – if not None, this can be a single string or a sequence of strings specifying a subset of docs

Returns:

dict with list of token lengths per document label

Return type:

Dict[str, List[int]]

tmtoolkit.corpus.doc_tokens(docs, select=None, sentences=False, only_non_empty=False, by_attr=None, with_attr=False, tokens_as_hashes=False, n_tokens=None, as_tables=False, as_arrays=False, force_unigrams=False)

Retrieve documents’ tokens from a Corpus or dict of SpaCy documents. Optionally also retrieve document and token attributes.

Parameters:
  • docs (Corpus) – a Corpus object

  • select (str | Collection[str] | None) – if not None, this can be a single string or a sequence of strings specifying the documents to fetch

  • sentences (bool) – divide results into sentences; if True, each document will consist of a list of sentences which in turn contain a list or array of tokens

  • only_non_empty (bool) – if True, only return non-empty result documents

  • by_attr (str | None) – if not None, this should be an attribute name; this attribute data will then be used instead of the tokens in docs

  • with_attr (bool | str | Sequence[str]) – also return document and token attributes along with each token; if True, returns all default attributes and custom defined attributes; if string, return this specific attribute; if sequence, returns attributes specified in this sequence

  • tokens_as_hashes (bool) – if True, return token type hashes (integers) instead of textual representations (strings)

  • n_tokens (int | None) – max. number of tokens to retrieve from each document; if None (default), retrieve all tokens

  • as_tables (bool) – return result as dataframe with tokens and document and token attributes in columns

  • as_arrays (bool) – return result as NumPy arrays instead of lists

  • force_unigrams (bool) – ignore n-grams setting if docs is a Corpus with ngrams and always return unigrams

Returns:

by default, a dict mapping document labels to document tokens data, which can be of different form, depending on the arguments passed to this function: (1) list of token strings or hash integers; (2) NumPy array of token strings or hash integers; (3) dict containing "token" key with values from (1) or (2) and document and token attributes with their values as list or NumPy array; (4) dataframe with tokens and document and token attributes in columns; if select is a string not a dict of documents is returned, but a single document with one of the 4 forms described before; if sentences is True, another list level representing sentences is added

Return type:

Dict[str, List[int | str] | List[List[int | str]] | ndarray | List[ndarray] | Dict[str, list | ndarray] | List[Dict[str, list | ndarray]] | DataFrame] | List[int | str] | List[List[int | str]] | ndarray | List[ndarray] | Dict[str, list | ndarray] | List[Dict[str, list | ndarray]] | DataFrame

tmtoolkit.corpus.doc_vectors(docs, select=None, collapse=None, omit_empty=False)

Return a vector representation for each document in docs. The vector representation’s size corresponds to the vector width of the language model that is used (usually 300).

Note

docs can be either a Corpus object or dict of SpaCy Doc objects. If it is a Corpus object, it must use a SpaCy language model with word vectors (i.e. an _md or _lg model). If the corpus was transformed, especially if tokens were removed, then you should set collapse to ” “. Otherwise tokens may be joint because of missing whitespace between them.

Parameters:
  • docs (Corpus | Dict[str, Doc]) – a Corpus object or dict mapping document labels to SpaCy Doc objects

  • select (str | Collection[str] | None) – if not None, this can be a single string or a sequence of strings specifying a subset of docs

  • collapse (str | None) – if None, use whitespace token attribute for collapsing tokens, otherwise use custom string

  • omit_empty (bool) – omit empty documents

Returns:

dict mapping document label to vector representation of the document

Return type:

Dict[str, ndarray]

tmtoolkit.corpus.document_from_attrs(bimaps, vocab, label, tokens_w_attr, sentences, doc_attr_names=None, token_attr_names=None)

Create a new Document object from tokens with attributes in tokens_w_attr.

Parameters:
  • bimaps (Dict[str, bidict]) – bidirectional dictionaries for hash <-> text conversion

  • label (str) – document label

  • tokens_w_attr (Dict[str, list | ndarray]) – dictionary mapping attribute names to attribute values

  • sentences (bool) – if True, tokens_w_attr contains data split by sentences, else sentences are not split

  • doc_attr_names (Collection[str] | None) – names of keys in tokens_w_attr that are assumed to be document attributes

  • token_attr_names (Collection[str] | None) – names of keys in tokens_w_attr that are assumed to be token attributes

  • vocab (Vocab) –

Returns:

Document object with data from tokens_w_attr

Return type:

Document

tmtoolkit.corpus.document_token_attr(d, attr='token', default=None, sentences=False, n=None, ngrams=1, ngrams_join=' ', as_hashes=False, as_array=False)

Retrieve one or more token attributes given as attr from a Document object d.

Parameters:
  • d (Document) – Document object

  • attr (Union[str, Sequence[str]]) – either single token attribute name or a sequence of token attribute names

  • default (Optional[Any, Dict[str, Any]]) – default value if a token attribute doesn’t exist

  • sentences (bool) – divide result into sentences

  • n (Optional[int]) – max. number of tokens to retrieve from each document; if None (default), retrieve all tokens

  • ngrams (int) – form n-grams if ngrams > 1

  • ngrams_join (str) – use this string to join the n-grams if ngrams > 1

  • as_hashes (bool) – return hashes instead of textual representations

  • as_array (bool) – return NumPy arrays instead of lists

Returns:

if a single token attribute is given as attr, return a list, a NumPy array or a list of lists or NumPy arrays depending on as_array and sentences; if multiple token attributes are given, return a dictionary mapping the token attribute name to the respective result

Return type:

Union[list, List[list], np.ndarray, List[np.ndarray], Dict[str, list], Dict[str, List[list]], Dict[str, np.ndarray], Dict[str, List[np.ndarray]]]

tmtoolkit.corpus.dtm(docs, select=None, by_attr=None, as_table=False, tokens_as_hashes=False, dtype=None, return_doc_labels=False, return_vocab=False)

Generate and return a sparse document-term matrix (or alternatively a dataframe) of shape (n_docs, n_vocab) where n_docs is the number of documents and n_vocab is the vocabulary size.

The rows of the matrix correspond to the sorted document labels, the columns of the matrix correspond to the sorted vocabulary of docs. Using return_doc_labels and/or return_vocab, you can additionally return these two lists.

Warning

Setting as_table to True will return dense data, which means that it may require a lot of memory.

Parameters:
  • docs (Corpus) – a Corpus object

  • select (str | Collection[str] | None) – if not None, this can be a single string or a sequence of strings specifying a subset of docs

  • by_attr (str | None) – if not None, this should be an attribute name; this attribute data will then be used instead of the tokens in docs

  • as_table (bool) – return result as dense pandas DataFrame

  • tokens_as_hashes (bool) – if True, return token type hashes (integers) instead of textual representations (strings) in the vocabulary

  • dtype (str | dtype | None) – use a specific matrix dtype; otherwise dtype will be int32

  • return_doc_labels (bool) – if True, additionally return sorted document labels that correspond to the rows of the document-term matrix

  • return_vocab (bool) – if True, additionally return the sorted vocabulary that corresponds to the columns of the document-term matrix

Returns:

document-term matrix as sparse matrix or dense dataframe; additionally sorted document labels and/or sorted vocabulary if return_doc_labels and/or return_vocab is True

Return type:

csr_matrix | DataFrame | Tuple[csr_matrix | DataFrame, List[int | str]] | Tuple[csr_matrix | DataFrame, List[int | str], List[int | str]]

tmtoolkit.corpus.filter_clean_tokens(docs, /, remove_punct=True, remove_stopwords=True, remove_empty=True, remove_shorter_than=None, remove_longer_than=None, remove_numbers=False, inplace=True)

Filter tokens in docs to retain only a certain, configurable subset of tokens.

Parameters:
  • docs (Corpus) – a Corpus object

  • remove_punct (bool) – remove all tokens that are considered to be punctuation (".", ",", ";" etc.) according to the is_punct attribute of the SpaCy Token

  • remove_stopwords (bool | Iterable[str]) –

    remove all tokens that are considered to be stopwords; if True, remove tokens according to the is_stop attribute of the SpaCy Token; if remove_stopwords is a set/tuple/list it defines the stopword list

  • remove_empty (bool) – remove all empty ("") and whitespace-only string tokens

  • remove_shorter_than (int | None) – remove all tokens shorter than this length

  • remove_longer_than (int | None) – remove all tokens longer than this length

  • remove_numbers (bool) –

    remove all tokens that are “numeric” according to the like_num attribute of the SpaCy Token

  • inplace (bool) – if True, modify Corpus object in place, otherwise return a modified copy

Returns:

either None (if inplace is True) or a modified copy of the original docs object

Return type:

Corpus | None

tmtoolkit.corpus.filter_documents(docs, /, search_tokens, by_attr=None, matches_threshold=1, match_type='exact', ignore_case=False, glob_method='match', inverse_result=False, inverse_matches=False, inplace=True)

This function is similar to filter_tokens but applies at document level. For each document, the number of matches is counted. If it is at least matches_threshold the document is retained, otherwise it is removed. If inverse_result is True, then documents that meet the threshold are removed.

See also

find_documents which does that same but only reports the found documents; remove_documents which is the same as this function but with inversed result

Parameters:
  • docs (Corpus) – a Corpus object

  • search_tokens (Any) – single string or list of strings that specify the search pattern(s); when match_type is 'exact', pattern may be of any type that allows equality checking

  • by_attr (str | None) – if not None, this should be an attribute name; this attribute data will then be used for matching instead of the tokens in docs

  • matches_threshold (int) – number of matches required for filtering a document

  • match_type (str) –

    the type of matching that is performed: 'exact' does exact string matching (optionally ignoring character case if ignore_case=True is set); 'regex' treats search_tokens as regular expressions to match the tokens against; 'glob' uses “glob patterns” like "politic*" which matches for example “politic”, “politics” or “”politician” (see globre package)

  • ignore_case (bool) – ignore character case (applies to all three match types)

  • glob_method (str) – if match_type is 'glob', use either 'search' or 'match' as glob method (has similar implications as Python’s re.search vs. re.match)

  • inverse_result (bool) – inverse the threshold comparison result

  • inverse_matches (bool) – inverse the match results for filtering

  • inplace (bool) – if True, modify Corpus object in place, otherwise return a modified copy

Returns:

either None (if inplace is True) or a modified copy of the original docs object

Return type:

Corpus | None

tmtoolkit.corpus.filter_documents_by_docattr(docs, /, search_tokens, by_attr, match_type='exact', ignore_case=False, glob_method='match', inverse=False, inplace=True)

Filter documents by a document attribute by_attr.

Parameters:
  • docs (Corpus) – a Corpus object

  • search_tokens (Any) – single string or list of strings that specify the search pattern(s); when match_type is 'exact', pattern may be of any type that allows equality checking

  • by_attr (str) – document attribute name used for filtering

  • match_type (str) –

    the type of matching that is performed: 'exact' does exact string matching (optionally ignoring character case if ignore_case=True is set); 'regex' treats search_tokens as regular expressions to match the tokens against; 'glob' uses “glob patterns” like "politic*" which matches for example “politic”, “politics” or “”politician” (see globre package)

  • ignore_case (bool) – ignore character case (applies to all three match types)

  • glob_method (str) – if match_type is 'glob', use either 'search' or 'match' as glob method (has similar implications as Python’s re.search vs. re.match)

  • inverse (bool) – inverse the match results for filtering (i.e. remove all tokens that match the search criteria)

  • inplace (bool) – if True, modify Corpus object in place, otherwise return a modified copy

Returns:

either None (if inplace is True) or a modified copy of the original docs object

Return type:

Corpus | None

tmtoolkit.corpus.filter_documents_by_label(docs, /, search_tokens, match_type='exact', ignore_case=False, glob_method='match', inverse=False, inplace=True)

Filter documents by document label.

Parameters:
  • docs (Corpus) – a Corpus object

  • search_tokens (Any) – single string or list of strings that specify the search pattern(s); when match_type is 'exact', pattern may be of any type that allows equality checking

  • match_type (str) –

    the type of matching that is performed: 'exact' does exact string matching (optionally ignoring character case if ignore_case=True is set); 'regex' treats search_tokens as regular expressions to match the tokens against; 'glob' uses “glob patterns” like "politic*" which matches for example “politic”, “politics” or “”politician” (see globre package)

  • ignore_case (bool) – ignore character case (applies to all three match types)

  • glob_method (str) – if match_type is 'glob', use either 'search' or 'match' as glob method (has similar implications as Python’s re.search vs. re.match)

  • inverse (bool) – inverse the match results for filtering (i.e. remove all tokens that match the search criteria)

  • inplace (bool) – if True, modify Corpus object in place, otherwise return a modified copy

Returns:

either None (if inplace is True) or a modified copy of the original docs object

Return type:

Corpus | None

tmtoolkit.corpus.filter_documents_by_length(docs, /, relation, threshold, inverse=False, inplace=True)

Filter documents in docs by length, i.e. number of tokens.

Parameters:
  • docs (Corpus) – a Corpus object

  • relation (str) – comparison operator as string; must be one of '<', '<=', '==', '>=', '>'

  • threshold (int) – document length threshold in number of documents

  • inverse (bool) – inverse the mask

  • inplace (bool) – if True, modify Corpus object in place, otherwise return a modified copy

Returns:

either None (if inplace is True) or a modified copy of the original docs object

Return type:

Corpus | None

tmtoolkit.corpus.filter_documents_by_mask(docs, /, mask, inverse=False)

Filter documents by setting a mask.

Parameters:
  • docs (Corpus) – a Corpus object

  • mask (Dict[str, bool]) – dict that maps document labels to document attribute value

  • inverse (bool) – inverse the mask

  • inplace – if True, modify Corpus object in place, otherwise return a modified copy

Returns:

either None (if inplace is True) or a modified copy of the original docs object

Return type:

Corpus | None

tmtoolkit.corpus.filter_for_pos(docs, /, search_pos, simplify_pos=True, tagset='ud', inverse=False, inplace=True)

Filter tokens for a specific POS tag (if required_pos is a string) or several POS tags (if required_pos is a list/tuple/set of strings). The POS tag depends on the tagset used during tagging. See https://spacy.io/api/annotation#pos-tagging for a general overview on POS tags in SpaCy and refer to the documentation of your language model for specific tags.

If simplify_pos is True, then the tags are matched to the following simplified forms:

  • 'N' for nouns

  • 'V' for verbs

  • 'ADJ' for adjectives

  • 'ADV' for adverbs

  • None for all other

Parameters:
  • docs (Corpus) – a Corpus object

  • search_pos (str | Collection[str]) – single string or list of strings with POS tag(s) used for filtering

  • simplify_pos (bool) – if True, simplify POS tags in documents to forms shown above before matching

  • tagset (str) – tagset used for pos; can be 'wn' (WordNet), 'penn' (Penn tagset) or 'ud' (universal dependencies – default)

  • inverse (bool) – inverse the matching results, i.e. remove tokens that match the POS tag

  • inplace (bool) – if True, modify Corpus object in place, otherwise return a modified copy

Returns:

either None (if inplace is True) or a modified copy of the original docs object

Return type:

Corpus | None

tmtoolkit.corpus.filter_tokens(docs, /, search_tokens, by_attr=None, match_type='exact', ignore_case=False, glob_method='match', inverse=False, inplace=True)

Filter tokens according to search pattern(s) search_tokens and several matching options. Only those tokens are retained that match the search criteria unless you set inverse=True, which will remove all tokens that match the search criteria (which is the same as calling remove_tokens).

See also

remove_tokens and token_match

Parameters:
  • docs (Corpus) – a Corpus object

  • search_tokens (Any) – single string or list of strings that specify the search pattern(s); when match_type is 'exact', pattern may be of any type that allows equality checking

  • by_attr (str | None) – if not None, this should be an attribute name; this attribute data will then be used for matching instead of the tokens in docs

  • match_type (str) –

    the type of matching that is performed: 'exact' does exact string matching (optionally ignoring character case if ignore_case=True is set); 'regex' treats search_tokens as regular expressions to match the tokens against; 'glob' uses “glob patterns” like "politic*" which matches for example “politic”, “politics” or “”politician” (see globre package)

  • ignore_case (bool) – ignore character case (applies to all three match types)

  • glob_method (str) – if match_type is 'glob', use either 'search' or 'match' as glob method (has similar implications as Python’s re.search vs. re.match)

  • inverse (bool) – inverse the match results for filtering (i.e. remove all tokens that match the search criteria)

  • inplace (bool) – if True, modify Corpus object in place, otherwise return a modified copy

Returns:

either None (if inplace is True) or a modified copy of the original docs object

Return type:

Corpus | None

tmtoolkit.corpus.filter_tokens_by_doc_frequency(docs, /, which, df_threshold, proportions=Proportion.NO, return_filtered_tokens=False, inverse=False, inplace=True)

Filter tokens according to their document frequency.

Parameters:
  • docs (Corpus) – a Corpus object

  • which (str) – which threshold comparison to use: either 'common', '>', '>=' which means that tokens with higher document freq. than (or equal to) df_threshold will be kept; or 'uncommon', '<', '<=' which means that tokens with lower document freq. than (or equal to) df_threshold will be kept

  • df_threshold (int | float) – document frequency threshold value

  • proportions (Proportion) – controls whether document frequency threshold is given in (log) proportions rather than absolute counts

  • return_filtered_tokens (bool) – if True, additionally return set of filtered token types

  • inverse (bool) – inverse the match results for filtering (i.e. remove all tokens that match the search criteria)

  • inplace (bool) – if True, modify Corpus object in place, otherwise return a modified copy

Returns:

depending on return_filtered_tokens and inplace: if both are True, returns only filtered token types; if return_filtered_tokens is True and inplace is False, returns tuple with modified copy of docs and filtered token types; if return_filtered_tokens is False returns either original Corpus object docs or a modified copy of it

Return type:

None | Corpus | Set[str] | Tuple[Corpus, Set[str]]

tmtoolkit.corpus.filter_tokens_by_mask(docs, /, mask, inverse=False, inplace=True)

Filter (i.e. remove) tokens according to a boolean mask specified by mask.

Parameters:
  • docs (Corpus) – a Corpus object

  • mask (Dict[str, List[bool] | ndarray]) – dict mapping document label to boolean list or NumPy array where False means “remove” and True means “keep” for the respective token; the length of the mask must equal the number of tokens in the document

  • inverse (bool) – inverse the truth values in the mask arrays

  • inplace (bool) – if True, modify Corpus object in place, otherwise return a modified copy

Returns:

either None (if inplace is True) or a modified copy of the original docs object

Return type:

Corpus | None

tmtoolkit.corpus.filter_tokens_with_kwic(docs, /, search_tokens, context_size=2, by_attr=None, match_type='exact', ignore_case=False, glob_method='match', inverse=False, inplace=True)

Filter tokens in docs according to Keywords-in-Context (KWIC) context window of size context_size around search_tokens. Uses similar search parameters as filter_tokens. Use kwic or kwic_table if you want to retrieve KWIC results without filtering the corpus.

Parameters:
  • docs (Corpus) – a Corpus object

  • search_tokens (Any) – single string or list of strings that specify the search pattern(s)

  • context_size (int | Tuple[int, int] | List[int]) – either scalar int or tuple/list (left, right) – number of surrounding words in keyword context; if scalar, then it is a symmetric surrounding, otherwise can be asymmetric

  • by_attr (str | None) – if not None, this should be an attribute name; this attribute data will then be used for matching instead of the tokens in docs

  • match_type (str) –

    the type of matching that is performed: 'exact' does exact string matching (optionally ignoring character case if ignore_case=True is set); 'regex' treats search_tokens as regular expressions to match the tokens against; 'glob' uses “glob patterns” like "politic*" which matches for example “politic”, “politics” or “”politician” (see globre package)

  • ignore_case (bool) – ignore character case (applies to all three match types)

  • glob_method (str) – if match_type is ‘glob’, use this glob method. Must be ‘match’ or ‘search’ (similar behavior as Python’s re.match or re.search)

  • inverse (bool) – inverse the match results for filtering (i.e. remove all tokens that match the search criteria)

  • inplace (bool) – if True, modify Corpus object in place, otherwise return a modified copy

Returns:

either None (if inplace is True) or a modified copy of the original docs object

Return type:

Corpus | None

tmtoolkit.corpus.find_documents(docs, /, search_tokens, by_attr=None, matches_threshold=1, match_type='exact', ignore_case=False, glob_method='match', inverse_result=False, inverse_matches=False, as_table=False)

For each document, the number of token matches is counted and a dict or dataframe (if as_table is True) is returned with entries of document labels when the number of matches is at least matches_threshold.

See also

filter_documents which does that same but applies the matches to the corpus, creating a subset of documents instead of only reporting the matches

Parameters:
  • docs (Corpus) – a Corpus object

  • search_tokens (Any) – single string or list of strings that specify the search pattern(s); when match_type is 'exact', pattern may be of any type that allows equality checking

  • by_attr (str | None) – if not None, this should be an attribute name; this attribute data will then be used for matching instead of the tokens in docs

  • matches_threshold (int) – number of matches required for filtering a document

  • match_type (str) –

    the type of matching that is performed: 'exact' does exact string matching (optionally ignoring character case if ignore_case=True is set); 'regex' treats search_tokens as regular expressions to match the tokens against; 'glob' uses “glob patterns” like "politic*" which matches for example “politic”, “politics” or “”politician” (see globre package)

  • ignore_case (bool) – ignore character case (applies to all three match types)

  • glob_method (str) – if match_type is 'glob', use either 'search' or 'match' as glob method (has similar implications as Python’s re.search vs. re.match)

  • inverse_result (bool) – inverse the threshold comparison result

  • inverse_matches (bool) – inverse the match results for filtering

  • as_table (bool | str) – if True, return result as dataframe; if a string, sort dataframe by this column; if string prefixed with “-”, sort by this column in descending order

Returns:

dict of number of matches per document label or dataframe if as_table is active

Return type:

Dict[str, int] | DataFrame

tmtoolkit.corpus.join_collocations_by_patterns(docs, /, patterns, select=None, glue='_', match_type='exact', ignore_case=False, glob_method='match', return_joint_tokens=False, inplace=True)

Match N subsequent tokens to the N patterns in patterns using match options like in filter_tokens. Join the matched tokens by glue string glue and mask the original tokens that this new joint token was generated from.

Warning

For each of the joint subsequent tokens, only the token attributes of the first token in the sequence will be retained. All further tokens will be removed. For example: In a document with tokens ["a", "hello", "world", "example"] where we join "hello", "world", the resulting document will be ["a", "hello_world", "example"] and only the token attributes (lemma, POS tag, etc. and custom attributes) for "hello" will be retained and assigned to “hello_world”.

Parameters:
  • docs (Corpus) – a Corpus object

  • select (str | Collection[str] | None) – if not None, this can be a single string or a sequence of strings specifying a subset of docs

  • patterns (Sequence[str]) – a sequence of search patterns as excepted by filter_tokens

  • glue (str) – string used for joining the matched subsequent tokens

  • match_type (str) –

    the type of matching that is performed: 'exact' does exact string matching (optionally ignoring character case if ignore_case=True is set); 'regex' treats search_tokens as regular expressions to match the tokens against; 'glob' uses “glob patterns” like "politic*" which matches for example “politic”, “politics” or “”politician” (see globre package)

  • ignore_case – ignore character case (applies to all three match types)

  • glob_method (str) – if match_type is 'glob', use either 'search' or 'match' as glob method (has similar implications as Python’s re.search vs. re.match)

  • return_joint_tokens (bool) – also return set of joint collocations

  • inplace (bool) – if True, modify Corpus object in place, otherwise return a modified copy

Returns:

either None (if inplace is True) or a modified copy of the original docs object; if return_joint_tokens is True, return set of joint collocations instead (if inplace is True) or additionally in tuple (modified Corpus copy, set of joint collocations) (if inplace is False)

Return type:

Corpus | Tuple[Corpus, Set[str]] | None

tmtoolkit.corpus.join_collocations_by_statistic(docs, /, threshold, select=None, glue='_', min_count=1, embed_tokens_min_docfreq=None, embed_tokens_set=None, statistic=<function ppmi>, return_joint_tokens=False, inplace=True, **statistic_kwargs)

Join subsequent tokens by token collocation statistic as can be computed by corpus_collocations.

Parameters:
  • docs (Corpus) – a Corpus object

  • threshold (float) – minimum statistic value for a collocation to enter the results

  • select (str | Collection[str] | None) – if not None, this can be a single string or a sequence of strings specifying a subset of docs

  • glue (str) – string used for joining the subsequent tokens

  • min_count (int) – ignore collocations with number of occurrences below this threshold

  • embed_tokens_min_docfreq (int | float | None) – dynamically generate the set of embed_tokens used when calling token_collocations by using a minimum document frequency (see doc_frequencies); if this is an integer, it is used as absolute count, if it is a float, it is used as proportion

  • embed_tokens_set (Set | None) – tokens that, if occurring inside an n-gram, are not counted; see token_ngrams

  • statistic (Callable) – function to calculate the statistic measure from the token counts; use one of the [n]pmi[2,3]_from_counts functions provided in the tokenseq module or provide your own function which must accept parameters n_x, n_y, n_xy, n_total; see pmi_from_counts and pmi for more information

  • return_joint_tokens (bool) – also return set of joint collocations

  • inplace (bool) – if True, modify Corpus object in place, otherwise return a modified copy

  • statistic_kwargs – additional arguments passed to statistic function

Returns:

either None (if inplace is True) or a modified copy of the original docs object; if return_joint_tokens is True, return set of joint collocations instead (if inplace is True) or additionally in tuple (modified Corpus copy, set of joint collocations) (if inplace is False)

Return type:

Corpus | Tuple[Corpus, Set[str]] | None

tmtoolkit.corpus.kwic(docs, search_tokens, context_size=2, select=None, by_attr=None, match_type='exact', ignore_case=False, glob_method='match', inverse=False, with_attr=False, as_tables=False, only_non_empty=False, glue=None, highlight_keyword=None)

Perform keyword-in-context (KWIC) search for search_tokens. Uses similar search parameters as filter_tokens. Returns results as dict with document label to KWIC results mapping. For tabular output, use kwic_table. You may also use as_tables which gives dataframes per document with columns doc (document label), context (document-specific context number), position (token position in document), token and further token attributes if specified via with_attr.

Parameters:
  • docs (Corpus) – a Corpus object

  • search_tokens (Any) – single string or list of strings that specify the search pattern(s)

  • context_size (int | Tuple[int, int] | List[int]) – either scalar int or tuple/list (left, right) – number of surrounding words in keyword context; if scalar, then it is a symmetric surrounding, otherwise can be asymmetric

  • select (str | Collection[str] | None) – if not None, this can be a single string or a sequence of strings specifying a subset of docs

  • by_attr (str | None) – if not None, this should be an attribute name; this attribute data will then be used for matching instead of the tokens in docs

  • match_type (str) –

    the type of matching that is performed: 'exact' does exact string matching (optionally ignoring character case if ignore_case=True is set); 'regex' treats search_tokens as regular expressions to match the tokens against; 'glob' uses “glob patterns” like "politic*" which matches for example “politic”, “politics” or “”politician” (see globre package)

  • ignore_case (bool) – ignore character case (applies to all three match types)

  • glob_method (str) – if match_type is ‘glob’, use this glob method. Must be ‘match’ or ‘search’ (similar behavior as Python’s re.match or re.search)

  • inverse (bool) – inverse the match results for filtering (i.e. remove all tokens that match the search criteria)

  • with_attr (bool | str | Sequence[str]) – also return document and token attributes along with each token; if True, returns all default attributes and custom defined attributes; if sequence, returns attributes specified in this sequence

  • as_tables (bool) – return result as dataframe with “doc” (document label) and “context” (context ID per document) and optionally “position” (original token position in the document) if tokens are not glued via glue parameter

  • only_non_empty (bool) – if True, only return non-empty result documents

  • glue (str | None) – if not None, this must be a string which is used to combine all tokens per match to a single string

  • highlight_keyword (str | None) – if not None, this must be a string which is used to indicate the start and end of the matched keyword

Returns:

dict with document label -> kwic for document mapping or a dataframe, depending on as_tables

Return type:

Dict[str, list | DataFrame]

tmtoolkit.corpus.kwic_table(docs, search_tokens, context_size=2, select=None, by_attr=None, match_type='exact', ignore_case=False, glob_method='match', inverse=False, with_attr=False, glue=' ', highlight_keyword='*')

Perform keyword-in-context (KWIC) search for search_tokens and return result as dataframe.

If a glue string is given, a “short” dataframe will be generated with columns doc (document label), context (document-specific context number) and token (KWIC result) or, if by_attr is set, the specified token attribute as last column name.

If a glue is None, a “long” dataframe will be generated with columns doc (document label), context (document-specific context number), position (token position in document), token and further token attributes if specified via with_attr.

Uses similar search parameters as filter_tokens.

Parameters:
  • docs (Corpus) – a Corpus object

  • search_tokens (Any) – single string or list of strings that specify the search pattern(s)

  • context_size (int | Tuple[int, int] | List[int]) – either scalar int or tuple/list (left, right) – number of surrounding words in keyword context; if scalar, then it is a symmetric surrounding, otherwise can be asymmetric

  • select (str | Collection[str] | None) – if not None, this can be a single string or a sequence of strings specifying a subset of docs

  • by_attr (str | None) – if not None, this should be an attribute name; this attribute data will then be used for matching instead of the tokens in docs

  • match_type (str) –

    the type of matching that is performed: 'exact' does exact string matching (optionally ignoring character case if ignore_case=True is set); 'regex' treats search_tokens as regular expressions to match the tokens against; 'glob' uses “glob patterns” like "politic*" which matches for example “politic”, “politics” or “”politician” (see globre package)

  • ignore_case (bool) – ignore character case (applies to all three match types)

  • glob_method (str) – if match_type is ‘glob’, use this glob method. Must be ‘match’ or ‘search’ (similar behavior as Python’s re.match or re.search)

  • inverse (bool) – inverse the match results for filtering (i.e. remove all tokens that match the search criteria)

  • with_attr (bool | str | Sequence[str]) – also return document and token attributes along with each token; if True, returns all default attributes and custom defined attributes; if sequence, returns attributes specified in this sequence

  • glue (str) – if not None, this must be a string which is used to combine all tokens per match to a single string

  • highlight_keyword (str | None) – if not None, this must be a string which is used to indicate the start and end of the matched keyword

Returns:

dataframe with columns doc (document label), context (document-specific context number) and kwic (KWIC result)

Return type:

DataFrame

tmtoolkit.corpus.lemmatize(docs, /, select=None, inplace=True)

Lemmatize tokens, i.e. set the lemmata as tokens so that all further processing will happen using the lemmatized tokens.

Parameters:
  • docs (Corpus) – a Corpus object

  • select (str | Collection[str] | None) – if not None, this can be a single string or a sequence of strings specifying a subset of docs

  • inplace (bool) – if True, modify Corpus object in place, otherwise return a modified copy

Returns:

either None (if inplace is True) or a modified copy of the original docs object

Return type:

Corpus | None

tmtoolkit.corpus.load_corpus_from_picklefile(picklefile)

Load and deserialize a stored Corpus object from the Python pickle file picklefile.

See also

Use save_corpus_to_picklefile to save a Corpus object to a pickle file.

Warning

Python pickle files may contain malicious code. You should only load pickle files from trusted sources.

Parameters:

picklefile (str) – path to pickle file

Returns:

a Corpus object

Return type:

Corpus

tmtoolkit.corpus.load_corpus_from_tokens(tokens, sentences=False, doc_attr=None, token_attr=None, **corpus_opt)

Create a Corpus object from a dict of tokens (optionally along with document/token attributes) as may be returned from doc_tokens.

Parameters:
  • tokens (Dict[str, Any]) – dict mapping document labels to tokens (optionally along with document/token attributes)

  • sentences (bool) – if True, tokens are assumed to contain another level that indicates the sentences (as from doc_tokens with sentences=True)

  • doc_attr (Dict[str, Any] | None) – document attributes with their respective default values

  • token_attr (Dict[str, Any] | None) – token attributes with their respective default values

  • corpus_opt – arguments passed to __init__; shall not contain docs argument; at least language, language_model or spacy_instance should be given

Returns:

a Corpus object

Return type:

Corpus

tmtoolkit.corpus.load_corpus_from_tokens_table(tokens, doc_attr=None, token_attr=None, **corpus_kwargs)

Create a Corpus object from a dataframe as may be returned from tokens_table.

Parameters:
  • tokens (DataFrame) – a dataframe with tokens, optionally along with document/token attributes

  • doc_attr (Dict[str, Any] | None) – optional dict mapping document attribute names to default values

  • token_attr (Dict[str, Any] | None) – optional dict mapping token attribute names to default values

  • corpus_kwargs – arguments passed to __init__; shall not contain docs argument

Returns:

a Corpus object

Return type:

Corpus

tmtoolkit.corpus.ngrams(docs, n, select=None, sentences=False, by_attr=None, tokens_as_hashes=False, join=True, join_str=' ')

Generate and return n-grams of length n.

Parameters:
  • docs (Corpus) – a Corpus object

  • n (int) – length of n-grams, must be >= 2

  • select (str | Collection[str] | None) – if not None, this can be a single string or a sequence of strings specifying a subset of docs

  • sentences (bool) – divide results into sentences; if True, each document will consist of a list of sentences which in turn contain a list or array of tokens

  • by_attr (str | None) – if not None, this should be an attribute name; this attribute data will then be used instead of the tokens in docs

  • tokens_as_hashes (bool) – if True, return token type hashes (integers) instead of textual representations (strings)

  • join (bool) – if True, join generated n-grams by string join_str

  • join_str (str) – string used for joining

Returns:

dict mapping document label to document n-grams; if join is True, the list contains strings of joined n-grams, otherwise the list contains lists of size n in turn containing the strings that make up the n-gram; if sentences is True, each result document consists of a list of sentences with n-grams

Return type:

Dict[str, List[str] | str]

tmtoolkit.corpus.normalize_unicode(docs, /, select=None, form='NFC', inplace=True)

Normalize unicode characters according to form.

This function only normalizes unicode characters in the tokens of docs to the form specified by form. If you want to simplify the characters, i.e. remove diacritics, underlines and other marks, use simplify_unicode instead.

Parameters:
Returns:

either None (if inplace is True) or a modified copy of the original docs object

Return type:

Corpus | None

tmtoolkit.corpus.numbers_to_magnitudes(docs, /, select=None, char='0', firstchar='1', below_one='0', zero='0', drop_sign=False, decimal_sep='.', thousands_sep=',', value_on_conversion_error=None, inplace=True)

Convert each string token in docs that represents a number (e.g. “13”, “1.3” or “-1313”) to a string token that represents the magnitude of that number by repeating char (“00”, “0”, “0000” for the mentioned examples). A different first character can be set via firstchar.

See also

numbertoken_to_magnitude

Parameters:
  • docs (Corpus) – a Corpus object

  • select (str | Collection[str] | None) – if not None, this can be a single string or a sequence of strings specifying a subset of docs

  • char (str) – character string used to represent single orders of magnitude

  • firstchar (str) – special character used for first character in the output

  • below_one (str) – special character used for numbers with absolute value below 1 (would otherwise return ‘’)

  • zero (str) – if numbertoken evaluates to zero, return this string

  • drop_sign (bool) – if True, drop the sign in number numbertoken, i.e. use absolute value

  • decimal_sep (str) – decimal separator used in numbertoken; this is language-specific

  • thousands_sep (str) – thousands separator used in numbertoken; this is language-specific

  • value_on_conversion_error (str | None) – determines placeholder when the input token cannot be converted to a number; if value_on_conversion_error is None, use the input token unchanged, otherwise use value_on_conversion_error

  • inplace (bool) – if True, modify Corpus object in place, otherwise return a modified copy

Returns:

either None (if inplace is True) or a modified copy of the original docs object

Return type:

Corpus | None

tmtoolkit.corpus.print_summary(docs, select=None, max_documents=None, max_tokens_string_length=None)

Print a summary of this object, i.e. the first tokens of each document and some summary statistics.

Parameters:
  • docs (Corpus) – a Corpus object

  • select (str | Collection[str] | None) – if not None, this can be a single string or a sequence of strings specifying a subset of docs

  • max_documents (int | None) – maximum number of documents to print; None uses default value 10; set to -1 to print all documents; this setting is disabled in select is not None

  • max_tokens_string_length (int | None) – maximum string length of concatenated tokens for each document; None uses default value 50; set to -1 to print complete documents

Return type:

None

tmtoolkit.corpus.remove_chars(docs, /, chars, select=None, inplace=True)

Remove all characters listed in chars from all tokens.

Parameters:
  • docs (Corpus) – a Corpus object

  • chars (Iterable[str]) – list of characters to remove; each element in the list should be a single character

  • select (str | Collection[str] | None) – if not None, this can be a single string or a sequence of strings specifying a subset of docs

  • inplace (bool) – if True, modify Corpus object in place, otherwise return a modified copy

Returns:

either None (if inplace is True) or a modified copy of the original docs object

Return type:

Corpus | None

tmtoolkit.corpus.remove_common_tokens(docs, /, df_threshold=0.95, proportions=Proportion.YES, inplace=True)

Shortcut for filter_tokens_by_doc_frequency for removing tokens above a certain document frequency.

Parameters:
  • docs (Corpus) – a Corpus object

  • df_threshold (int | float) – document frequency threshold value

  • proportions (Proportion) – controls whether document frequency threshold is given in (log) proportions rather than absolute counts

  • inplace (bool) – if True, modify Corpus object in place, otherwise return a modified copy

Returns:

either None (if inplace is True) or a modified copy of the original docs object

Return type:

Corpus | None

tmtoolkit.corpus.remove_document_attr(docs, /, attrname, inplace=True)

Remove a document attribute with name attrname from the Corpus object docs.

See also

See set_document_attr to set a document attribute.

Parameters:
  • docs (Corpus) – a Corpus object

  • attrname (str) – name of the document attribute

  • inplace (bool) – if True, modify Corpus object in place, otherwise return a modified copy

Returns:

either None (if inplace is True) or a modified copy of the original docs object

Return type:

Corpus | None

tmtoolkit.corpus.remove_documents(docs, /, search_tokens, by_attr=None, matches_threshold=1, match_type='exact', ignore_case=False, glob_method='match', inverse_matches=False, inplace=True)

This is a shortcut for the filter_documents function with inverse_result=True, i.e. remove all documents that meet the token matching threshold.

See also

filter_documents

Parameters:
  • docs (Corpus) – a Corpus object

  • search_tokens (Any) – single string or list of strings that specify the search pattern(s); when match_type is 'exact', pattern may be of any type that allows equality checking

  • by_attr (str | None) – if not None, this should be an attribute name; this attribute data will then be used for matching instead of the tokens in docs

  • match_type (str) –

    the type of matching that is performed: 'exact' does exact string matching (optionally ignoring character case if ignore_case=True is set); 'regex' treats search_tokens as regular expressions to match the tokens against; 'glob' uses “glob patterns” like "politic*" which matches for example “politic”, “politics” or “”politician” (see globre package)

  • ignore_case (bool) – ignore character case (applies to all three match types)

  • glob_method (str) – if match_type is 'glob', use either 'search' or 'match' as glob method (has similar implications as Python’s re.search vs. re.match)

  • inverse_matches (bool) – inverse the match results for filtering

  • inplace (bool) – if True, modify Corpus object in place, otherwise return a modified copy

  • matches_threshold (int) –

Returns:

either None (if inplace is True) or a modified copy of the original docs object

Return type:

Corpus | None

tmtoolkit.corpus.remove_documents_by_docattr(docs, /, search_tokens, by_attr, match_type='exact', ignore_case=False, glob_method='match', inplace=True)

This is a shortcut for the filter_documents_by_docattr function with inverse=True, i.e. remove all documents that meet the document attribute matching criteria.

Parameters:
  • docs (Corpus) – a Corpus object

  • search_tokens (Any) – single string or list of strings that specify the search pattern(s); when match_type is 'exact', pattern may be of any type that allows equality checking

  • by_attr (str) – document attribute name used for filtering

  • match_type (str) –

    the type of matching that is performed: 'exact' does exact string matching (optionally ignoring character case if ignore_case=True is set); 'regex' treats search_tokens as regular expressions to match the tokens against; 'glob' uses “glob patterns” like "politic*" which matches for example “politic”, “politics” or “”politician” (see globre package)

  • ignore_case (bool) – ignore character case (applies to all three match types)

  • glob_method (str) – if match_type is 'glob', use either 'search' or 'match' as glob method (has similar implications as Python’s re.search vs. re.match)

  • inplace (bool) – if True, modify Corpus object in place, otherwise return a modified copy

Returns:

either None (if inplace is True) or a modified copy of the original docs object

Return type:

Corpus | None

tmtoolkit.corpus.remove_documents_by_label(docs, /, search_tokens, match_type='exact', ignore_case=False, glob_method='match', inplace=True)

Shortcut for filter_documents_by_label with inverse=True, i.e. remove all documents that meet the document label matching criteria.

Parameters:
  • docs (Corpus) – a Corpus object

  • search_tokens (Any) – single string or list of strings that specify the search pattern(s); when match_type is 'exact', pattern may be of any type that allows equality checking

  • match_type (str) –

    the type of matching that is performed: 'exact' does exact string matching (optionally ignoring character case if ignore_case=True is set); 'regex' treats search_tokens as regular expressions to match the tokens against; 'glob' uses “glob patterns” like "politic*" which matches for example “politic”, “politics” or “”politician” (see globre package)

  • ignore_case (bool) – ignore character case (applies to all three match types)

  • glob_method (str) – if match_type is 'glob', use either 'search' or 'match' as glob method (has similar implications as Python’s re.search vs. re.match)

  • inplace (bool) – if True, modify Corpus object in place, otherwise return a modified copy

Returns:

either None (if inplace is True) or a modified copy of the original docs object

Return type:

Corpus | None

tmtoolkit.corpus.remove_documents_by_length(docs, /, relation, threshold, inplace=True)

Shortcut for filter_documents_by_length with inverse=True, i.e. remove all documents that meet the length criterion.

Parameters:
  • docs (Corpus) – a Corpus object

  • relation (str) – comparison operator as string; must be one of '<', '<=', '==', '>=', '>'

  • threshold (int) – document length threshold in number of documents

  • inplace (bool) – if True, modify Corpus object in place, otherwise return a modified copy

Returns:

either None (if inplace is True) or a modified copy of the original docs object

Return type:

Corpus | None

tmtoolkit.corpus.remove_documents_by_mask(docs, /, mask, inplace=True)

This is a shortcut for the filter_documents_by_mask function with inverse_result=True, i.e. remove all documents where the mask is set to True.

Parameters:
  • docs (Corpus) – a Corpus object

  • mask (Dict[str, bool]) – dict that maps document labels to document attribute value

  • inplace (bool) – if True, modify Corpus object in place, otherwise return a modified copy

Returns:

either None (if inplace is True) or a modified copy of the original docs object

Return type:

Corpus | None

tmtoolkit.corpus.remove_punctuation(docs, /, select=None, inplace=True)

Removes punctuation characters in tokens, i.e. ['a', '.', 'f;o;o'] becomes ['a', '', 'foo'].

If you want to remove punctuation tokens, use filter_clean_tokens.

Parameters:
  • docs (Corpus) – a Corpus object

  • select (str | Collection[str] | None) – if not None, this can be a single string or a sequence of strings specifying a subset of docs

  • inplace (bool) – if True, modify Corpus object in place, otherwise return a modified copy

Returns:

either None (if inplace is True) or a modified copy of the original docs object

Return type:

Corpus | None

tmtoolkit.corpus.remove_token_attr(docs, /, attrname, inplace=True)

Remove a token attribute with name attrname from the Corpus object docs.

See also

See set_token_attr to set a token attribute.

Parameters:
  • docs (Corpus) – a Corpus object

  • attrname (str) – name of the token attribute

  • inplace (bool) – if True, modify Corpus object in place, otherwise return a modified copy

Returns:

either None (if inplace is True) or a modified copy of the original docs object

Return type:

Corpus | None

tmtoolkit.corpus.remove_tokens(docs, /, search_tokens, by_attr=None, match_type='exact', ignore_case=False, glob_method='match', inplace=True)

This is a shortcut for the filter_tokens method with inverse=True, i.e. remove all tokens that match the search criteria).

See also

filter_tokens and token_match

Parameters:
  • docs (Corpus) – a Corpus object

  • search_tokens (Any) – single string or list of strings that specify the search pattern(s); when match_type is 'exact', pattern may be of any type that allows equality checking

  • by_attr (str | None) – if not None, this should be an attribute name; this attribute data will then be used for matching instead of the tokens in docs

  • match_type (str) –

    the type of matching that is performed: 'exact' does exact string matching (optionally ignoring character case if ignore_case=True is set); 'regex' treats search_tokens as regular expressions to match the tokens against; 'glob' uses “glob patterns” like "politic*" which matches for example “politic”, “politics” or “”politician” (see globre package)

  • ignore_case (bool) – ignore character case (applies to all three match types)

  • glob_method (str) – if match_type is 'glob', use either 'search' or 'match' as glob method (has similar implications as Python’s re.search vs. re.match)

  • inplace (bool) – if True, modify Corpus object in place, otherwise return a modified copy

Returns:

either None (if inplace is True) or a modified copy of the original docs object

Return type:

Corpus | None

tmtoolkit.corpus.remove_tokens_by_mask(docs, /, mask, inplace=True)

Remove tokens according to a boolean mask specified by mask.

Parameters:
  • docs (Corpus) – a Corpus object

  • mask (Dict[str, List[bool] | ndarray]) – dict mapping document label to boolean list or NumPy array where False means “keep” and True means “remove” for the respective token; the length of the mask must equal the number of tokens in the document

  • inplace (bool) – if True, modify Corpus object in place, otherwise return a modified copy

Returns:

either None (if inplace is True) or a modified copy of the original docs object

Return type:

Corpus | None

tmtoolkit.corpus.remove_uncommon_tokens(docs, /, df_threshold=0.05, proportions=Proportion.YES, inplace=True)

Shortcut for filter_tokens_by_doc_frequency for removing tokens below a certain document frequency.

Parameters:
  • docs (Corpus) – a Corpus object

  • df_threshold (int | float) – document frequency threshold value

  • proportions (Proportion) – controls whether document frequency threshold is given in (log) proportions rather than absolute counts

  • inplace (bool) – if True, modify Corpus object in place, otherwise return a modified copy

Returns:

either None (if inplace is True) or a modified copy of the original docs object

Return type:

Corpus | None

tmtoolkit.corpus.save_corpus_to_picklefile(docs, picklefile)

Serialize Corpus docs and save to Python pickle file picklefile.

See also

Use load_corpus_from_picklefile to load the Corpus object from a pickle file.

Parameters:
  • docs (Corpus) – a Corpus object

  • picklefile (str) – path to pickle file

Return type:

None

tmtoolkit.corpus.serialize_corpus(docs, deepcopy_attrs=True, store_workers_attrs=False)

Serialize a Corpus object to a dict. The inverse operation is implemented in deserialize_corpus.

Parameters:
  • docs (Corpus) – a Corpus object

  • deepcopy_attrs (bool) – apply deep copy to all attributes

  • store_workers_attrs (bool) – if True, store the number of maximum parallel worker processes and worker timeout

Returns:

Corpus data serialized as dict

Return type:

Dict[str, Any]

tmtoolkit.corpus.set_document_attr(docs, /, attrname, data, default=None, inplace=True)

Set a document attribute named attrname for documents in Corpus object docs. If the attribute already exists, it will be overwritten.

See also

See remove_document_attr to remove a document attribute.

Parameters:
  • docs (Corpus) – a Corpus object

  • attrname (str) – name of the document attribute

  • data (Dict[str, Any]) – dict that maps document labels to document attribute value

  • default (Any | None) – default document attribute value

  • inplace (bool) – if True, modify Corpus object in place, otherwise return a modified copy

Returns:

either None (if inplace is True) or a modified copy of the original docs object

Return type:

Corpus | None

tmtoolkit.corpus.set_token_attr(docs, /, attrname, data, default=None, per_token_occurrence=True, inplace=True)

Set a token attribute named attrname for all tokens in all documents in Corpus object docs. If the attribute already exists, it will be overwritten.

There are two ways of assigning token attributes which are determined by the argument per_token_occurrence. If per_token_occurrence is True, then data is a dict that maps token occurrences (or “word types”) to attribute values, i.e. {'foo': True} will assign the attribute value True to every occurrence of the token "foo". If per_token_occurrence is False, then data is a dict that maps document labels to token attributes. In this case the token attributes must be a list, tuple or NumPy array with a length according to the number of tokens.

See also

See remove_token_attr to remove a token attribute.

Parameters:
  • docs (Corpus) – a Corpus object

  • attrname (str) – name of the token attribute

  • data (Dict[str, Any]) – depends on per_token_occurrence; if per_token_occurrence is True, then data is a dict that maps token occurrences (or “token types”) to attribute values; if per_token_occurrence is False, then data is a dict that maps document labels to token attributes; in this case token attributes must be a list, tuple or NumPy array with a length according to the number of tokens values

  • per_token_occurrence (bool) – determines how data is interpreted when assigning token attributes

  • default (Any | None) – default token attribute value

  • inplace (bool) – if True, modify Corpus object in place, otherwise return a modified copy

Returns:

either None (if inplace is True) or a modified copy of the original docs object

Return type:

Corpus | None

tmtoolkit.corpus.simplified_pos(pos, tagset='ud', default='')

Return a simplified POS tag for a full POS tag pos belonging to a tagset tagset.

Does the following conversion by default:

  • all N… (noun) tags to ‘N’

  • all V… (verb) tags to ‘V’

  • all ADJ… (adjective) tags to ‘ADJ’

  • all ADV… (adverb) tags to ‘ADV’

  • all other to default

Does the following conversion by with tagset=='penn':

  • all N… (noun) tags to ‘N’

  • all V… (verb) tags to ‘V’

  • all JJ… (adjective) tags to ‘ADJ’

  • all RB… (adverb) tags to ‘ADV’

  • all other to default

Does the following conversion by with tagset=='ud':

  • all N… (noun) tags to ‘N’

  • all V… (verb) tags to ‘V’

  • all JJ… (adjective) tags to ‘ADJ’

  • all RB… (adverb) tags to ‘ADV’

  • all other to default

Parameters:
  • pos (str) – a POS tag as string

  • tagset (str) – tagset used for pos; can be 'wn' (WordNet), 'penn' (Penn tagset) or 'ud' (universal dependencies – default)

  • default (str) – default return value when tag could not be simplified

Returns:

simplified tag string

Return type:

str

tmtoolkit.corpus.simplify_unicode(docs, /, select=None, method='icu', ascii_encoding_errors='ignore', inplace=True)

Simplify unicode characters in the tokens of docs, i.e. remove diacritics, underlines and other marks. Requires PyICU to be installed when using method="icu".

Parameters:
  • docs (Corpus) – a Corpus object

  • select (str | Collection[str] | None) – if not None, this can be a single string or a sequence of strings specifying a subset of docs

  • method (str) –

    either "icu" which uses PyICU for “proper” simplification or "ascii" which tries to encode the characters as ASCII; the latter is not recommended and will simply dismiss any characters that cannot be converted to ASCII after decomposition

  • ascii_encoding_errors (str) – only used if method is "ascii"; what to do when a character cannot be encoded as ASCII character; can be either "ignore" (default – replace by empty character), "replace" (replace by "???") or "strict" (raise a UnicodeEncodeError)

  • inplace (bool) – if True, modify Corpus object in place, otherwise return a modified copy

Returns:

either None (if inplace is True) or a modified copy of the original docs object

Return type:

Corpus | None

tmtoolkit.corpus.spacydocs(docs, select=None, collapse=None)

Generate SpaCy Doc objects from current corpus.

Note

If the corpus was transformed, especially if tokens were removed, then you should set collapse to " ". Otherwise tokens may be joint because of missing whitespace between them.

Parameters:
  • docs (Corpus) – a Corpus object or a dict of token strings

  • select (str | Collection[str] | None) – if not None, this can be a single string or a sequence of strings specifying a subset of docs

  • collapse (str | None) – if None, use whitespace token attribute for collapsing tokens, otherwise use custom string

Returns:

dict mapping document labels to SpaCy Doc objects

Return type:

Dict[str, Doc]

tmtoolkit.corpus.to_lowercase(docs, /, select=None, inplace=True)

Convert all tokens to lower-case form.

Parameters:
  • docs (Corpus) – a Corpus object

  • select (str | Collection[str] | None) – if not None, this can be a single string or a sequence of strings specifying a subset of docs

  • inplace (bool) – if True, modify Corpus object in place, otherwise return a modified copy

Returns:

either None (if inplace is True) or a modified copy of the original docs object

Return type:

Corpus | None

tmtoolkit.corpus.to_uppercase(docs, /, select=None, inplace=True)

Convert all tokens to upper-case form.

Parameters:
  • docs (Corpus) – a Corpus object

  • select (str | Collection[str] | None) – if not None, this can be a single string or a sequence of strings specifying a subset of docs

  • inplace (bool) – if True, modify Corpus object in place, otherwise return a modified copy

Returns:

either None (if inplace is True) or a modified copy of the original docs object

Return type:

Corpus | None

tmtoolkit.corpus.token_cooccurrence(docs, context_size, tokens=None, select=None, by_attr=None, per_document=False, tokens_as_hashes=False, sparse_mat=True, triu=False, as_table=False, dtype='int32', return_tokens=False)

Calculate a token cooccurrence matrix either for all unique tokens in a corpus or for a given set of tokens via tokens argument. The cooccurrences are counted within a context window specified via context_size. See [JurafskyMartin2023], p. 111, section 6.3.3 (“Words as vectors: word dimensions”) for details.

If the context window is symmetric, the output will be a symmetric matrix. In such a case, you can set triu to True in order to obtain only the upper triangular of that matrix. Together with using a sparse matrix (sparse_mat argument), this can effectively reduce the memory usage.

See also

See pairwise_max_table for a convenient way to get the maximum token cooccurrences in tabular form. See codoc_frequencies to calculate the token cooccurrence on document-level, i.e. without context window, based on a document-term matrix.

Parameters:
  • docs (Corpus) – a Corpus object

  • context_size (int | Tuple[int, int] | List[int]) – either scalar int or tuple/list (left, right) – number of surrounding words in keyword context; if scalar, then it is a symmetric surrounding, otherwise can be asymmetric

  • tokens (List[int | str] | ndarray | None) – if None, this function will obtain the vocabulary of the passed corpus and compute the cooccurrence matrix for all tokens in the vocabulary; otherwise specify a sequence of tokens for which to calculate the cooccurrence matrix

  • select (str | Collection[str] | None) – if not None, this can be a single string or a sequence of strings specifying a subset of docs

  • by_attr (str | None) – if not None, this should be an attribute name; this attribute data will then be used for matching instead of the tokens in docs

  • per_document (bool) – if True, then compute a cooccurrence matrix for each document in docs separately and return the result as dictionary mapping the document label to the document’s cooccurrence matrix

  • tokens_as_hashes (bool) – if True assume that tokens are passed as token hashes integers

  • sparse_mat (bool) – generate a sparse matrix in CSR format

  • triu (bool) – return only the upper triangular of the cooccurrence matrix; only possible if context_size is symmetric

  • as_table (bool) – convert output to pandas dataframe (not sparse)

  • dtype (str | dtype) – matrix data type

  • return_tokens (bool) – if True, return the tokens for which the cooccurrence matrix was computed as list; the order of the items in that list corresponds to the rows and columns of the returned cooc. matrix

Returns:

token cooc. matrix as sparse matrix, dense array, or pandas dataframe; optional list of tokens that correspond to the rows and columns of that matrix; if per_document is True, return dict instead that maps each document label its cooccurrence matrix

Return type:

csr_matrix | ndarray | DataFrame | Tuple[csr_matrix | ndarray | DataFrame, List[int | str]] | Dict[str, csr_matrix | ndarray | DataFrame] | Tuple[Dict[str, csr_matrix | ndarray | DataFrame], List[int | str]]

tmtoolkit.corpus.token_vectors(docs, select=None, collapse=None, omit_oov=True)

Return a token vectors matrix for each document in docs. This matrix is of size n by m where n is the number of tokens in the document and m is the vector width of the language model that is used (usually 300). If omit_oov is True, n will be number of tokens in the document for which there is a word vector in used the language model.

Note

docs can be either a Corpus object or dict of SpaCy Doc objects. If it is a Corpus object, it must use a SpaCy language model with word vectors (i.e. an _md or _lg model). If the corpus was transformed, especially if tokens were removed, then you should set collapse to ” “. Otherwise tokens may be joint because of missing whitespace between them.

Parameters:
  • docs (Corpus | Dict[str, Doc]) – a Corpus object or dict mapping document labels to SpaCy Doc objects

  • select (str | Collection[str] | None) – if not None, this can be a single string or a sequence of strings specifying a subset of docs

  • collapse (str | None) – if None, use whitespace token attribute for collapsing tokens, otherwise use custom string

  • omit_oov (bool) – omit “out of vocabulary” tokens, i.e. tokens without a vector

Returns:

dict mapping document label to token vectors matrix

Return type:

Dict[str, ndarray]

tmtoolkit.corpus.tokens_table(docs, select=None, sentences=False, tokens_as_hashes=False, with_attr=True, force_unigrams=False)

Generate a dataframe with tokens and document/token attributes. Result has columns “doc” (document label), “position” (token position in the document, starting at zero), “token” and optional columns for document/token attributes.

Parameters:
  • docs (Corpus) – a Corpus object

  • select (str | Collection[str] | None) – if not None, this can be a single string or a sequence of strings specifying the documents to fetch

  • sentences (bool) – if True, list sentence index (starting at zero) per token in sent column

  • tokens_as_hashes (bool) – if True, return token type hashes (integers) instead of textual representations (strings)

  • with_attr (bool | str | Sequence[str]) – also return document and token attributes along with each token; if True, returns all default attributes and custom defined attributes; if sequence, returns attributes specified in this sequence

  • force_unigrams (bool) – ignore n-grams setting if docs is a Corpus with ngrams and always return unigrams

Returns:

dataframe with tokens and document/token attributes

Return type:

DataFrame

tmtoolkit.corpus.transform_tokens(docs, /, func, select=None, vocab=None, inplace=True, **kwargs)

Transform tokens in all documents by applying function func to each document’s tokens individually.

Parameters:
  • docs (Corpus) – a Corpus object

  • func (Callable) – a function to apply to all documents’ tokens; it must accept a single token string and vice-versa return single token string

  • select (str | Collection[str] | None) – if not None, this can be a single string or a sequence of strings specifying a subset of docs

  • vocab (Set[int] | None) – optional vocabulary of token hashes (set of integers), which should be considered for transformation; if this is not given, the full vocabulary of docs will be generated

  • inplace (bool) – if True, modify Corpus object in place, otherwise return a modified copy

  • kwargs – additional arguments passed to func

Returns:

either None (if inplace is True) or a modified copy of the original docs object

Return type:

Corpus | None

tmtoolkit.corpus.vocabulary(docs, select=None, by_attr=None, tokens_as_hashes=False, force_unigrams=False, sort=True, convert_uint64hashes=True)

Return the vocabulary, i.e. the set or sorted list of unique token types, of a Corpus or a dict of token strings.

Parameters:
  • docs (Corpus) – a Corpus object or a dict of token strings

  • select (str | Collection[str] | None) – if not None, this can be a single string or a sequence of strings specifying a subset of docs

  • by_attr (str | None) – if not None, this should be an attribute name; this attribute data will then be used instead of the tokens in docs

  • tokens_as_hashes (bool) – use token hashes instead of token strings

  • force_unigrams (bool) – ignore n-grams setting if docs is a Corpus with ngrams and always return unigrams

  • sort (bool) – if True, sort the vocabulary

  • convert_uint64hashes (bool) – if True, convert NumPy uint64 hashes to Python int types (only is effective if tokens_as_hashes is True)

Returns:

set or, if sort is True, a sorted list of unique token types

Return type:

Set[str | int] | List[int | str]

tmtoolkit.corpus.vocabulary_counts(docs, select=None, by_attr=None, proportions=Proportion.NO, tokens_as_hashes=False, force_unigrams=False, convert_uint64hashes=True, as_table=False)

Return a dict mapping the tokens in the vocabulary to their respective number of occurrences across all or selected documents.

Parameters:
  • docs (Corpus) – a Corpus object

  • select (str | Collection[str] | None) – if not None, this can be a single string or a sequence of strings specifying a subset of docs

  • by_attr (str | None) – if not None, this should be an attribute name; this attribute data will then be used instead of the tokens in docs

  • proportions (Proportion) – one of Proportion: NO (0) – return counts; YES (1) – return proportions; LOG (2) – return log10 of proportions

  • tokens_as_hashes (bool) – if True, return token type hashes (integers) instead of textual representations (strings)

  • force_unigrams (bool) – ignore n-grams setting if docs is a Corpus with ngrams and always return unigrams

  • convert_uint64hashes (bool) – if True, convert NumPy uint64 hashes to Python int types (only is effective if tokens_as_hashes is True)

  • as_table (bool | str) – if True, return result as dataframe; if a string, sort dataframe by this column; if string prefixed with “-”, sort by this column in descending order

Returns:

dict mapping the tokens in the vocabulary to their respective counts or dataframe if as_table is active

Return type:

Dict[str | int, int | float] | DataFrame

tmtoolkit.corpus.vocabulary_size(docs, select=None, force_unigrams=False)

Return size of the vocabulary, i.e. number of unique token types in docs (or a subset via select).

Parameters:
  • docs (Corpus | Dict[str, List[str]]) – a Corpus object or a dict of token strings

  • select (str | Collection[str] | None) – if not None, this can be a single string or a sequence of strings specifying a subset of docs

  • force_unigrams (bool) – ignore n-grams setting if docs is a Corpus with ngrams and always return unigrams

Returns:

size of the vocabulary

Return type:

int

Functions to visualize corpus summary statistics

Functions to visualize corpus summary statistics.

tmtoolkit.corpus.visualize.plot_doc_frequencies_hist(fig, ax, docs, select=None, proportions=Proportion.NO, y_log=True, title='Histogram of document frequencies', xaxislabel='document frequency', yaxislabel='count', **kwargs)

Plot histogram of document frequencies for corpus docs.

Parameters:
  • fig (Figure) – matplotlib Figure object

  • ax (Axes) – matplotlib Axes object

  • docs (Corpus) – a Corpus object

  • select (str | Collection[str] | None) – if not None, this can be a single string or a sequence of strings specifying a subset of docs

  • proportions (Proportion) – one of Proportion: NO (0) – return counts; YES (1) – return proportions; LOG (2) – return log10 of proportions

  • y_log (bool) – if True, scale y-axis via log10 transformation

  • title (str | None) – plot title

  • xaxislabel (str | None) – x-axis label

  • yaxislabel (str | None) – y-axis label

  • kwargs – additional keyword arguments passed on to matplotlib histogram plotting function ax.hist

Returns:

tuple of generated (matplotlib Figure object, matplotlib Axes object)

Return type:

Tuple[Figure, Axes]

tmtoolkit.corpus.visualize.plot_doc_lengths_hist(fig, ax, docs, select=None, y_log=True, title='Histogram of document lengths', xaxislabel='document lengths', yaxislabel='count', **kwargs)

Plot histogram of document lengths for corpus docs.

Parameters:
  • fig (Figure) – matplotlib Figure object

  • ax (Axes) – matplotlib Axes object

  • docs (Corpus) – a Corpus object

  • select (str | Collection[str] | None) – if not None, this can be a single string or a sequence of strings specifying a subset of docs

  • y_log (bool) – if True, scale y-axis via log10 transformation

  • title (str | None) – plot title

  • xaxislabel (str | None) – x-axis label

  • yaxislabel (str | None) – y-axis label

  • kwargs – additional keyword arguments passed on to matplotlib histogram plotting function ax.hist

Returns:

tuple of generated (matplotlib Figure object, matplotlib Axes object)

Return type:

Tuple[Figure, Axes]

tmtoolkit.corpus.visualize.plot_num_sents_hist(fig, ax, docs, select=None, y_log=True, title='Histogram of number of sentences per document', xaxislabel='number of sentences', yaxislabel='count', **kwargs)

Plot histogram of number of sentences per document of corpus docs.

Parameters:
  • fig (Figure) – matplotlib Figure object

  • ax (Axes) – matplotlib Axes object

  • docs (Corpus) – a Corpus object

  • select (str | Collection[str] | None) – if not None, this can be a single string or a sequence of strings specifying a subset of docs

  • y_log (bool) – if True, scale y-axis via log10 transformation

  • title (str | None) – plot title

  • xaxislabel (str | None) – x-axis label

  • yaxislabel (str | None) – y-axis label

  • kwargs – additional keyword arguments passed on to matplotlib histogram plotting function ax.hist

Returns:

tuple of generated (matplotlib Figure object, matplotlib Axes object)

Return type:

Tuple[Figure, Axes]

tmtoolkit.corpus.visualize.plot_num_sents_vs_sent_length(fig, ax, docs, select=None, min_n_sents=0, x_log=False, y_log=False, title='Number of sentences vs. mean sentence length', xaxislabel='number of documents', yaxislabel='sentence length', **kwargs)

Make scatter plot of number of sentences vs. mean sentence length in corpus docs.

Parameters:
  • fig (Figure) – matplotlib Figure object

  • ax (Axes) – matplotlib Axes object

  • docs (Corpus) – a Corpus object

  • select (str | Collection[str] | None) – if not None, this can be a single string or a sequence of strings specifying a subset of docs

  • min_n_sents (int) – plot only mean sentence lengths for documents with at least min_n_sents sentences

  • x_log (bool) – if True, scale x-axis via log10 transformation

  • y_log (bool) – if True, scale y-axis via log10 transformation

  • title (str | None) – plot title

  • xaxislabel (str | None) – x-axis label

  • yaxislabel (str | None) – y-axis label

  • kwargs – additional keyword arguments passed on to matplotlib histogram plotting function ax.hist

Returns:

tuple of generated (matplotlib Figure object, matplotlib Axes object)

Return type:

Tuple[Figure, Axes]

tmtoolkit.corpus.visualize.plot_ranked_vocab_counts(fig, ax, docs, select=None, x_log=True, y_log=True, zipf=False, title='Scatter plot for vocabulary term count vs. rank', xaxislabel='rank', yaxislabel='count', hist_opts=None, plot_opts=None)

Make scatter plot for vocabulary term count vs. rank and optionally overlay with theoretical distribution from Zipf’s law via zipf=True.

Parameters:
  • fig (Figure) – matplotlib Figure object

  • ax (Axes) – matplotlib Axes object

  • docs (Corpus) – a Corpus object

  • select (str | Collection[str] | None) – if not None, this can be a single string or a sequence of strings specifying a subset of docs

  • x_log (bool) – if True, scale x-axis via log10 transformation

  • y_log (bool) – if True, scale y-axis via log10 transformation

  • zipf (bool) – if True, add red dashed line indicating theoretical frequencies according to Zipf’s law

  • title (str | None) – plot title

  • xaxislabel (str | None) – x-axis label

  • yaxislabel (str | None) – y-axis label

  • hist_opts (Dict[str, Any] | None) – additional keyword arguments passed on to histogram binning function np.histogram

  • plot_opts (Dict[str, Any] | None) – additional keyword arguments passed on to respective matplotlib plotting function

Returns:

tuple of generated (matplotlib Figure object, matplotlib Axes object)

Return type:

Tuple[Figure, Axes]

tmtoolkit.corpus.visualize.plot_sent_lengths_hist(fig, ax, docs, select=None, y_log=True, title='Histogram of sentence lengths', xaxislabel='sentence length', yaxislabel='count', **kwargs)

Plot histogram of sentence lengths in corpus docs.

Parameters:
  • fig (Figure) – matplotlib Figure object

  • ax (Axes) – matplotlib Axes object

  • docs (Corpus) – a Corpus object

  • select (str | Collection[str] | None) – if not None, this can be a single string or a sequence of strings specifying a subset of docs

  • y_log (bool) – if True, scale y-axis via log10 transformation

  • title (str | None) – plot title

  • xaxislabel (str | None) – x-axis label

  • yaxislabel (str | None) – y-axis label

  • kwargs – additional keyword arguments passed on to matplotlib histogram plotting function ax.hist

Returns:

tuple of generated (matplotlib Figure object, matplotlib Axes object)

Return type:

Tuple[Figure, Axes]

tmtoolkit.corpus.visualize.plot_token_lengths_hist(fig, ax, docs, select=None, y_log=True, title='Histogram of token lengths', xaxislabel='token length', yaxislabel='count', **kwargs)

Plot histogram of sentence lengths in corpus docs.

Parameters:
  • fig (Figure) – matplotlib Figure object

  • ax (Axes) – matplotlib Axes object

  • docs (Corpus) – a Corpus object

  • select (str | Collection[str] | None) – if not None, this can be a single string or a sequence of strings specifying a subset of docs

  • y_log (bool) – if True, scale y-axis via log10 transformation

  • title (str | None) – plot title

  • xaxislabel (str | None) – x-axis label

  • yaxislabel (str | None) – y-axis label

  • kwargs – additional keyword arguments passed on to matplotlib histogram plotting function ax.hist

Returns:

tuple of generated (matplotlib Figure object, matplotlib Axes object)

Return type:

Tuple[Figure, Axes]

tmtoolkit.corpus.visualize.plot_vocab_counts_hist(fig, ax, docs, select=None, y_log=True, title='Histogram for number of occurrences per token type', xaxislabel='number of occurrences per token type', yaxislabel='count', **kwargs)

Plot histogram of vocabulary counts (i.e. number of occurrences per token type) for corpus docs.

Parameters:
  • fig (Figure) – matplotlib Figure object

  • ax (Axes) – matplotlib Axes object

  • docs (Corpus) – a Corpus object

  • select (str | Collection[str] | None) – if not None, this can be a single string or a sequence of strings specifying a subset of docs

  • y_log (bool) – if True, scale y-axis via log10 transformation

  • title (str | None) – plot title

  • xaxislabel (str | None) – x-axis label

  • yaxislabel (str | None) – y-axis label

  • kwargs – additional keyword arguments passed on to matplotlib histogram plotting function ax.hist

Returns:

tuple of generated (matplotlib Figure object, matplotlib Axes object)

Return type:

Tuple[Figure, Axes]

tmtoolkit.ngrammodels

N-gram models as in [JurafskyMartin2023]. Mainly provides the NGramModel class.

class tmtoolkit.ngrammodels.NGramModel(n, add_k_smoothing=1.0, keep_vocab=None, tokens_as_hashes=True)

An N-gram model.

Parameters:
  • n (int) –

  • add_k_smoothing (float) –

  • keep_vocab (Optional[Union[int, float]]) –

  • tokens_as_hashes (bool) –

__init__(n, add_k_smoothing=1.0, keep_vocab=None, tokens_as_hashes=True)

Initialize an n-gram model with gram size n.

Parameters:
  • n (int) – strictly positive integer for the gram size

  • add_k_smoothing (float) – smoothing constant added to each count; must be positive

  • keep_vocab (int | float | None) – optional; specifies the maximum vocabulary size, i.e. keep only the most frequent tokens; either int or float; if float, the number implicates the proportion of most frequent tokens to keep

  • tokens_as_hashes (bool) – if True, use token type hashes (integers) instead of textual representations (strings)

convert_token_sequence(tok, collapse=' ')

Convert a sequence of tokens tok to a sequence of token strings if tokens in this model are given as hashes (self.tokens_as_hashes is True) or to a sequence of token hashes if tokens in this model are given as strings (self.tokens_as_hashes is False).

Parameters:
  • tok (Iterable[str | int]) – sequence of tokens to convert

  • collapse (str | None) – collapse the resulting sequence to a string joined by this character (if output is a sequence of strings)

Returns:

sequence of converted tokens or string if collapse is True

Return type:

str | Tuple[str | int, …] | List[int | str]

fit(corp)

Fit this n-gram model using a Corpus object or a list of token sequences.

Parameters:

corp (Corpus | List[List[int | str]]) – a Corpus object or a list of token sequences used as training data

Returns:

this instance

Return type:

NGramModel

generate_sequence(given=None, backoff=True, until_n=None, until_token=11)

Generate a random sequence of tokens given an optional “seed” sequence as given. If given is None, assume a sentence start. Generate the sequence until either a number of tokens until_n is reached or a certain token until_token was generated.

The random sequence is generated by sampling from a probability distribution that is conditional on the previous n-1 token(s) in the given n-gram model.

This method is a generator that yields one token at a time.

Parameters:
  • given (Optional[Union[StrOrInt, Tuple[StrOrInt, ...], List[StrOrInt]]]) – optional given single token or given sequence of tokens; if None, assume a sentence start

  • backoff (bool) – if True, then if no continuation candidates for the given sequence can be found, iteratively back off to a smaller given sequence (eventually up until no given sequence) until continuation candidates are found

  • until_n (Optional[int]) – if given, sample until at maximum this number of tokens is generated

  • until_token (Optional[StrOrInt]) – if given, sample until at maximum this certain token is generated

Returns:

yields one token at a time

Return type:

Generator[StrOrInt]

pad_sequence(s, sides='both')

Prepend start sentence token(s) and/or append end sentence token(s) of length n-1 to a sequence of tokens s. If s is an empty sequence don’t apply padding.

Parameters:
  • s (Tuple[str | int, ...] | List[int | str]) – sequence of tokens

  • sides (str) – either ‘left’, ‘right’ or ‘both’

Returns:

padded sequence

Return type:

Tuple[str | int, …] | List[int | str]

perplexity(x, pad_input=False)

Calculate the perplexity for a given single token or sequence of tokens x. The perplexity is defined as perplexity(x) = p(x)^-1/N, where p(x) is the prob. of the sequence x as calc. by prob and N is the vocabulary size.

Parameters:
  • x (str | int | Tuple[str | int, ...] | List[int | str]) – single token or sequence of tokens

  • pad_input (bool) – if True, pad x with sentence start token(s)

Returns:

perplexity

Return type:

float

predict(given=None, backoff=False, return_prob=0)

Predict the most likely continuation candidate (i.e. next token) given a sequence of tokens given. If given is None, assume a sentence start.

Parameters:
  • given (str | int | Tuple[str | int, ...] | List[int | str] | None) – optional given single token or given sequence of tokens; if None, assume a sentence start

  • backoff (bool) – if True, then if no continuation candidates for the given sequence can be found, iteratively back off to a smaller given sequence (eventually up until no given sequence) until continuation candidates are found

  • return_prob (int) – 0 - don’t return prob., 1 – return prob., 2 – return log prob.

Returns:

if return_prob is 0, return the most likely next token; if return_prob is not zero, return a 2-tuple with (must likely token, prediction probability); if backoff is False and no continuation candidates are found, return None or (None, 1.0)

Return type:

str | int | None | Tuple[str | int | None, float]

prob(x, given=None, log=True, pad_input=False)

Return the probability of token or token sequence x optionally given a sequence given. If given is not None, it is simply prepended to x.

For each token t_i in the concatenated sequence S of given and x, calculate the overall prob. of the sequence S in the n-gram model as prod(`P(t_i|t_{i-n+1}, ..., t_{i-1})`) (or the sum of the respective log. prob. if log is True).

Parameters:
  • x (str | int | Tuple[str | int, ...] | List[int | str]) – single token or sequence of tokens

  • given (str | int | Tuple[str | int, ...] | List[int | str] | None) – optional given single token or given sequence of tokens

  • log (bool) – if True, return log prob.

  • pad_input (bool) – if True, pad x with sentence start token(s)

Returns:

(log) probability

Return type:

float

tmtoolkit.strings

Module for functions that work strings, i.e. single tokens.

tmtoolkit.strings.numbertoken_to_magnitude(numbertoken, char='0', firstchar='1', below_one='0', zero='0', decimal_sep='.', thousands_sep=',', drop_sign=False, value_on_conversion_error='')

Convert a string token numbertoken that represents a number (e.g. “13”, “1.3” or “-1313”) to a string token that represents the magnitude of that number by repeating char (“10”, “1”, “-1000” for the mentioned examples). A different first character can be set via firstchar. The sign can be dropped via drop_sign.

If numbertoken cannot be converted to a float, either the value value_on_conversion_error is returned or numbertoken is returned unchanged if value_on_conversion_error is None.

Parameters:
  • numbertoken (str) – token that represents a number

  • char (str) – character string used to represent single orders of magnitude

  • firstchar (str) – special character used for first character in the output

  • below_one (str) – special character used for numbers with absolute value below 1 (would otherwise return ‘’)

  • zero (str) – if numbertoken evaluates to zero, return this string

  • decimal_sep (str) – decimal separator used in numbertoken; this is language-specific

  • thousands_sep (str) – thousands separator used in numbertoken; this is language-specific

  • drop_sign (bool) – if True, drop the sign in number numbertoken, i.e. use absolute value

  • value_on_conversion_error (str | None) – determines return value when numbertoken cannot be converted to a number; if None, return input numbertoken unchanged, otherwise return value_on_conversion_error

Returns:

string that represents the magnitude of the input or an empty string

Return type:

str

tmtoolkit.strings.simplify_unicode_chars(token, method='icu', ascii_encoding_errors='ignore')

Simplify unicode characters in string token, i.e. remove diacritics, underlines and other marks. Requires PyICU to be installed when using method="icu".

Parameters:
  • docs – a Corpus object

  • token (str) – string to simplify

  • method (str) –

    either "icu" which uses PyICU for “proper” simplification or "ascii" which tries to encode the characters as ASCII; the latter is not recommended and will simply dismiss any characters that cannot be converted to ASCII after decomposition

  • ascii_encoding_errors (str) – only used if method is "ascii"; what to do when a character cannot be encoded as ASCII character; can be either "ignore" (default – replace by empty character), "replace" (replace by "???") or "strict" (raise a UnicodeEncodeError)

Returns:

simplified string

Return type:

str

tmtoolkit.strings.strip_tags(value)

Return the given HTML with all tags stripped and HTML entities and character references converted to Unicode characters.

Code taken and adapted from https://github.com/django/django/blob/main/django/utils/html.py.

Parameters:

value (str) – input string

Returns:

string without HTML tags

Return type:

str

tmtoolkit.tokenseq

Module for functions that work with text represented as token sequences, e.g. ["A", "test", "document", "."].

Tokens don’t have to be represented as strings – for many functions, they may also be token hashes (as integers). Most functions also accept NumPy arrays instead of lists / tuples.

class tmtoolkit.tokenseq.Counter(iterable=None, /, **kwds)

Dict subclass for counting hashable items. Sometimes called a bag or multiset. Elements are stored as dictionary keys and their counts are stored as dictionary values.

>>> c = Counter('abcdeabcdabcaba')  # count elements from a string
>>> c.most_common(3)                # three most common elements
[('a', 5), ('b', 4), ('c', 3)]
>>> sorted(c)                       # list all unique elements
['a', 'b', 'c', 'd', 'e']
>>> ''.join(sorted(c.elements()))   # list elements with repetitions
'aaaaabbbbcccdde'
>>> sum(c.values())                 # total of all counts
15
>>> c['a']                          # count of letter 'a'
5
>>> for elem in 'shazam':           # update counts from an iterable
...     c[elem] += 1                # by adding 1 to each element's count
>>> c['a']                          # now there are seven 'a'
7
>>> del c['b']                      # remove all 'b'
>>> c['b']                          # now there are zero 'b'
0
>>> d = Counter('simsalabim')       # make another counter
>>> c.update(d)                     # add in the second counter
>>> c['a']                          # now there are nine 'a'
9
>>> c.clear()                       # empty the counter
>>> c
Counter()

Note: If a count is set to zero or reduced to zero, it will remain in the counter until the entry is deleted or the counter is cleared:

>>> c = Counter('aaabbc')
>>> c['b'] -= 2                     # reduce the count of 'b' by two
>>> c.most_common()                 # 'b' is still in, but its count is zero
[('a', 3), ('c', 1), ('b', 0)]
__init__(iterable=None, /, **kwds)

Create a new, empty Counter object. And if given, count elements from an input iterable. Or, initialize the count from another mapping of elements to their counts.

>>> c = Counter()                           # a new, empty counter
>>> c = Counter('gallahad')                 # a new counter from an iterable
>>> c = Counter({'a': 4, 'b': 2})           # a new counter from a mapping
>>> c = Counter(a=4, b=2)                   # a new counter from keyword args
copy()

Return a shallow copy.

elements()

Iterator over elements repeating each as many times as its count.

>>> c = Counter('ABCABC')
>>> sorted(c.elements())
['A', 'A', 'B', 'B', 'C', 'C']

# Knuth’s example for prime factors of 1836: 2**2 * 3**3 * 17**1 >>> prime_factors = Counter({2: 2, 3: 3, 17: 1}) >>> product = 1 >>> for factor in prime_factors.elements(): # loop over factors … product *= factor # and multiply them >>> product 1836

Note, if an element’s count has been set to zero or is a negative number, elements() will ignore it.

classmethod fromkeys(iterable, v=None)

Create a new dictionary with keys from iterable and values set to value.

most_common(n=None)

List the n most common elements and their counts from the most common to the least. If n is None, then list all element counts.

>>> Counter('abracadabra').most_common(3)
[('a', 5), ('b', 2), ('r', 2)]
subtract(iterable=None, /, **kwds)

Like dict.update() but subtracts counts instead of replacing them. Counts can be reduced below zero. Both the inputs and outputs are allowed to contain zero and negative counts.

Source can be an iterable, a dictionary, or another Counter instance.

>>> c = Counter('which')
>>> c.subtract('witch')             # subtract elements from another iterable
>>> c.subtract(Counter('watch'))    # subtract elements from another counter
>>> c['h']                          # 2 in which, minus 1 in witch, minus 1 in watch
0
>>> c['w']                          # 1 in which, minus 1 in witch, minus 1 in watch
-1
update(iterable=None, /, **kwds)

Like dict.update() but add counts instead of replacing them.

Source can be an iterable, a dictionary, or another Counter instance.

>>> c = Counter('which')
>>> c.update('witch')           # add elements from another iterable
>>> d = Counter('watch')
>>> c.update(d)                 # add elements from another counter
>>> c['h']                      # four 'h' in which, witch, and watch
4
tmtoolkit.tokenseq.collapse_tokens(tokens, collapse=' ')

Take a sequence of tokens tokens and turn it into a string by joining the tokens using either a single “glue” string or a sequence of “glue” strings in collapse.

Parameters:
  • tokens (Iterable[str] | ndarray) – list or NumPy array of string tokens

  • collapse (str | Iterable[str] | ndarray) – either single string or list / NumPy array of “glue” strings where collapse[i] is the string to appear after tokens[i]

Returns:

collapsed tokens as string

Return type:

str

tmtoolkit.tokenseq.copy(x)

Shallow copy operation on arbitrary Python objects.

See the module’s __doc__ string for more info.

tmtoolkit.tokenseq.empty_chararray()

Create empty NumPy character array.

Returns:

empty NumPy character array

Return type:

ndarray

tmtoolkit.tokenseq.index_windows_around_matches(matches, left, right, flatten=False, remove_overlaps=True)

Take a boolean 1D array matches of length N and generate an array of indices, where each occurrence of a True value in the boolean vector at index i generates a sequence of the form:

[i-left, i-left+1, ..., i, ..., i+right-1, i+right, i+right+1]

If flatten is True, then a flattened NumPy 1D array is returned. Otherwise, a list of NumPy arrays is returned, where each array contains the window indices.

remove_overlaps is only applied when flatten is True.

Example with left=1 and right=1, flatten=False:

input:
#   0     1      2      3     4      5      6      7     8
[True, True, False, False, True, False, False, False, True]
output (matches *highlighted*):
[[*0*, 1], [0, *1*, 2], [3, *4*, 5], [7, *8*]]

Example with left=1 and right=1, flatten=True, remove_overlaps=True:

input:
#   0     1      2      3     4      5      6      7     8
[True, True, False, False, True, False, False, False, True]
output (matches *highlighted*, other values belong to the respective "windows"):
[*0*, *1*, 2, 3, *4*, 5, 7, *8*]
Parameters:
  • matches (ndarray) – 1D boolean input array

  • left (int) – index window left side size

  • right (int) – index window right side size

  • flatten (bool) – if True return flattened NumPy 1D array, otherwise return list of NumPy arrays with one array per window

  • remove_overlaps (bool) – if True, remove overlaps in match windows (only applies if flatten is set to True)

Returns:

if flatten is False, return a list of arrays where each array is an index window into matches; if flatten is True, return a concatenated NumPy array with the index windows

Return type:

List[List[int]] | List[ndarray] | ndarray

tmtoolkit.tokenseq.indices_of_matches(a, b, b_is_sorted=False, check_a_in_b=False)

Return the indices into 1D array b where elements in 1D array a equal an element in b. E.g.: Suppose b is a vocabulary like [13, 10, 12, 8] and a is a sequence of tokens [12, 13]. Then indices_of_matches(a, b) will return [2, 0] since first element in a equals b[2] and the second element in a equals b[0].

Parameters:
  • a (ndarray) – 1D array which will be searched in b

  • b (ndarray) – 1D array of elements to match against; result will produce indices into this array; should have same dtype as a

  • b_is_sorted (bool) – set this to True if you’re sure that b is sorted; then a shortcut will be used

  • check_a_in_b (bool) – if True then check if all elements in a exist in b; if this is not the case, raise an exception

Returns:

1D array of indices; length equals the length of a

Return type:

ndarray

tmtoolkit.tokenseq.npmi(x, y=None, xy=None, n_total=None, logfn=<ufunc 'log'>, *, k=1, alpha=1.0, normalize=True)

Calculate pointwise mutual information measure (PMI). You can either pass a matrix x which represents counts (if the matrix is of dtype (u)int) N_{x,y} or probabilities p(x, y) or you pass probabilities p(x), p(y), p(x, y) given as x, y, xy, or total counts x, y, xy and additionally n_total. Setting k > 1 gives PMI^k variants. Setting normalized to True gives normalized PMI (NPMI) as in [Bouma2009]. See [RoleNadif2011] for a comparison of PMI variants.

Probabilities should be such that p(x, y) <= min(p(x), p(y)).

Parameters:
  • x (Union[np.ndarray, sparse.spmatrix]) – either a matrix with probabilities p(x, y) or counts (if matrix is of dtype (u)int); for the alternative calling signature that requires arguments x, y and xy, you can pass x a vector of probabilities p(x) or vector of number of occurrences of x (interpreted as count if n_total is given)

  • y (Optional[np.ndarray]) – probabilities p(y) or count of occurrences of y (interpreted as count if n_total is given) when using alternative calling signature

  • xy (Optional[np.ndarray]) – probabilities p(x, y) or count of occurrences of x and y (interpreted as count if n_total is given) when using alternative calling signature

  • n_total (Optional[int]) – if given, x, y and xy are interpreted as counts with n_total as size of the sample space; if x is given as matrix you can set n_total to 1 indicate that x is a matrix of counts even if it is of dtype (u)int

  • logfn (Callable) – logarithm function to use (default: np.log – natural logarithm)

  • k (int) – if k > 1, calculate PMI^k variant

  • alpha (float) – calculate p_{alpha}(y) as y^alpha/sum(y^alpha) (only if given as counts)

  • normalize (bool) – if True, normalize to range [-1, 1]; gives NPMI measure

Returns:

array with same shape as inputs containing (N)PMI measures for each input probability

Return type:

Union[np.ndarray, sparse.spmatrix]

tmtoolkit.tokenseq.pad_sequence(s, left, right, left_symbol, right_symbol, skip_empty=True)

Prepend and/or append symbols to token sequence s.

Parameters:
  • s (Tuple[str | int, ...] | List[int | str] | ndarray) – sequence of tokens

  • left (int) – number of symbols to add to the start

  • right (int) – number of symbols to add to the end

  • left_symbol (str | int) – symbol to add to the start

  • right_symbol (str | int) – symbol to add to the end

  • skip_empty (bool) – if set to True and s is an empty sequence, don’t apply padding

Returns:

padded sequence of same type as input sequence

Return type:

Tuple[str | int, …] | List[int | str] | ndarray

tmtoolkit.tokenseq.pmi(x, y=None, xy=None, n_total=None, logfn=<ufunc 'log'>, k=1, alpha=1.0, normalize=False)

Calculate pointwise mutual information measure (PMI). You can either pass a matrix x which represents counts (if the matrix is of dtype (u)int) N_{x,y} or probabilities p(x, y) or you pass probabilities p(x), p(y), p(x, y) given as x, y, xy, or total counts x, y, xy and additionally n_total. Setting k > 1 gives PMI^k variants. Setting normalized to True gives normalized PMI (NPMI) as in [Bouma2009]. See [RoleNadif2011] for a comparison of PMI variants.

Probabilities should be such that p(x, y) <= min(p(x), p(y)).

Parameters:
  • x (ndarray | spmatrix) – either a matrix with probabilities p(x, y) or counts (if matrix is of dtype (u)int); for the alternative calling signature that requires arguments x, y and xy, you can pass x a vector of probabilities p(x) or vector of number of occurrences of x (interpreted as count if n_total is given)

  • y (ndarray | None) – probabilities p(y) or count of occurrences of y (interpreted as count if n_total is given) when using alternative calling signature

  • xy (ndarray | None) – probabilities p(x, y) or count of occurrences of x and y (interpreted as count if n_total is given) when using alternative calling signature

  • n_total (int | None) – if given, x, y and xy are interpreted as counts with n_total as size of the sample space; if x is given as matrix you can set n_total to 1 indicate that x is a matrix of counts even if it is of dtype (u)int

  • logfn (Callable) – logarithm function to use (default: np.log – natural logarithm)

  • k (int) – if k > 1, calculate PMI^k variant

  • alpha (float) – calculate p_{alpha}(y) as y^alpha/sum(y^alpha) (only if given as counts)

  • normalize (bool) – if True, normalize to range [-1, 1]; gives NPMI measure

Returns:

array with same shape as inputs containing (N)PMI measures for each input probability

Return type:

ndarray | spmatrix

tmtoolkit.tokenseq.pmi2(x, y=None, xy=None, n_total=None, logfn=<ufunc 'log'>, *, k=2, alpha=1.0, normalize=False)

Calculate pointwise mutual information measure (PMI). You can either pass a matrix x which represents counts (if the matrix is of dtype (u)int) N_{x,y} or probabilities p(x, y) or you pass probabilities p(x), p(y), p(x, y) given as x, y, xy, or total counts x, y, xy and additionally n_total. Setting k > 1 gives PMI^k variants. Setting normalized to True gives normalized PMI (NPMI) as in [Bouma2009]. See [RoleNadif2011] for a comparison of PMI variants.

Probabilities should be such that p(x, y) <= min(p(x), p(y)).

Parameters:
  • x (Union[np.ndarray, sparse.spmatrix]) – either a matrix with probabilities p(x, y) or counts (if matrix is of dtype (u)int); for the alternative calling signature that requires arguments x, y and xy, you can pass x a vector of probabilities p(x) or vector of number of occurrences of x (interpreted as count if n_total is given)

  • y (Optional[np.ndarray]) – probabilities p(y) or count of occurrences of y (interpreted as count if n_total is given) when using alternative calling signature

  • xy (Optional[np.ndarray]) – probabilities p(x, y) or count of occurrences of x and y (interpreted as count if n_total is given) when using alternative calling signature

  • n_total (Optional[int]) – if given, x, y and xy are interpreted as counts with n_total as size of the sample space; if x is given as matrix you can set n_total to 1 indicate that x is a matrix of counts even if it is of dtype (u)int

  • logfn (Callable) – logarithm function to use (default: np.log – natural logarithm)

  • k (int) – if k > 1, calculate PMI^k variant

  • alpha (float) – calculate p_{alpha}(y) as y^alpha/sum(y^alpha) (only if given as counts)

  • normalize (bool) – if True, normalize to range [-1, 1]; gives NPMI measure

Returns:

array with same shape as inputs containing (N)PMI measures for each input probability

Return type:

Union[np.ndarray, sparse.spmatrix]

tmtoolkit.tokenseq.pmi3(x, y=None, xy=None, n_total=None, logfn=<ufunc 'log'>, *, k=3, alpha=1.0, normalize=False)

Calculate pointwise mutual information measure (PMI). You can either pass a matrix x which represents counts (if the matrix is of dtype (u)int) N_{x,y} or probabilities p(x, y) or you pass probabilities p(x), p(y), p(x, y) given as x, y, xy, or total counts x, y, xy and additionally n_total. Setting k > 1 gives PMI^k variants. Setting normalized to True gives normalized PMI (NPMI) as in [Bouma2009]. See [RoleNadif2011] for a comparison of PMI variants.

Probabilities should be such that p(x, y) <= min(p(x), p(y)).

Parameters:
  • x (Union[np.ndarray, sparse.spmatrix]) – either a matrix with probabilities p(x, y) or counts (if matrix is of dtype (u)int); for the alternative calling signature that requires arguments x, y and xy, you can pass x a vector of probabilities p(x) or vector of number of occurrences of x (interpreted as count if n_total is given)

  • y (Optional[np.ndarray]) – probabilities p(y) or count of occurrences of y (interpreted as count if n_total is given) when using alternative calling signature

  • xy (Optional[np.ndarray]) – probabilities p(x, y) or count of occurrences of x and y (interpreted as count if n_total is given) when using alternative calling signature

  • n_total (Optional[int]) – if given, x, y and xy are interpreted as counts with n_total as size of the sample space; if x is given as matrix you can set n_total to 1 indicate that x is a matrix of counts even if it is of dtype (u)int

  • logfn (Callable) – logarithm function to use (default: np.log – natural logarithm)

  • k (int) – if k > 1, calculate PMI^k variant

  • alpha (float) – calculate p_{alpha}(y) as y^alpha/sum(y^alpha) (only if given as counts)

  • normalize (bool) – if True, normalize to range [-1, 1]; gives NPMI measure

Returns:

array with same shape as inputs containing (N)PMI measures for each input probability

Return type:

Union[np.ndarray, sparse.spmatrix]

tmtoolkit.tokenseq.ppmi(x, y=None, xy=None, n_total=None, logfn=<ufunc 'log'>, add_k_smoothing=0.0, alpha=1.0)

Calculate positive pointwise mutual information measure (PPMI) as max(pmi(...), 0). This results in a measure that is in range [0, +Inf]. See pmi for further information. See [JurafskyMartin2023], p. 117 for more on (positive) PMI.

Note

If you pass x as sparse matrix, the calculations are applied only to non-zero elements and will return another sparse matrix.

Parameters:
  • x (ndarray | spmatrix) – either a matrix with probabilities p(x, y) or counts (if matrix is of dtype (u)int); for the alternative calling signature that requires arguments x, y and xy, you can pass x a vector of probabilities p(x) or vector of number of occurrences of x (interpreted as count if n_total is given)

  • y (ndarray | None) – probabilities p(y) or count of occurrences of y (interpreted as count if n_total is given) when using alternative calling signature

  • xy (ndarray | None) – probabilities p(x, y) or count of occurrences of x and y (interpreted as count if n_total is given) when using alternative calling signature

  • n_total (int | None) – if given, x, y and xy are interpreted as counts with n_total as size of the sample space; if x is given as matrix you can set n_total to 1 indicate that x is a matrix of counts even if it is of dtype (u)int

  • logfn (Callable) – logarithm function to use (default: np.log – natural logarithm)

  • add_k_smoothing (float) – can only be used when x is a matrix of counts; in this case add_k_smoothing is added to x

  • alpha (float) – calculate p_{alpha}(y) as y^alpha/sum(y^alpha) (only if given as counts)

Returns:

array with same shape as inputs containing PPMI measures for each input probability

Return type:

ndarray | spmatrix

tmtoolkit.tokenseq.token_collocation_matrix(sentences, min_count=1, embed_tokens=None, tokens_as_hashes=False, return_vocab=False, return_bigrams_with_indices=False)

Generate a sparse token collocation matrix from bigrams in sentences.

See also

See token_collocations for a similar function that returns a list of collocations sorted by a statistic score such as PPMI.

Parameters:
  • sentences (List[List[StrOrInt]]) – list of sentences containing lists of tokens

  • min_count (int) – ignore collocations with number of occurrences below this threshold

  • embed_tokens (Optional[Iterable]) – tokens that, if occurring inside an n-gram, are not counted; see token_ngrams

  • tokens_as_hashes (bool) – if True, assume that tokens in sentences are hashes (integers) instead of strings

  • return_vocab (bool) – additionally return the vocabulary as numpy array for each axis of the matrix

  • return_bigrams_with_indices (bool) – additionally return a list of bigrams together with a pair of indices of the respective bigram into the result matrix

Returns:

a sparse collocation count matrix where the rows and columns represent bigram token pairs and the elements represent their collocation count; if return_vocab is True, also return the vocabulary for each matrix axis; if return_bigrams_with_indices is True, additionally return a list of bigrams together a pair of indices of the respective bigram into the result matrix

Return type:

Union[sparse.csr_matrix, Tuple[sparse.csr_matrix, np.ndarray, np.ndarray], Tuple[sparse.csr_matrix, List[Tuple, Tuple[int, int]]], Tuple[sparse.csr_matrix, np.ndarray, np.ndarray, List[Tuple, Tuple[int, int]]]]

tmtoolkit.tokenseq.token_collocations(sentences, threshold=None, min_count=1, embed_tokens=None, statistic=<function ppmi>, glue=None, return_statistic=True, rank='desc', tokens_as_hashes=False, hashes2tokens=None, **statistic_kwargs)

Identify token collocations (frequently co-occurring token series) in a list of sentences of tokens given by sentences. Currently only supports bigram collocations.

Parameters:
  • sentences (List[List[int | str]]) – list of sentences containing lists of tokens; tokens can be items of any type if glue is None

  • threshold (float | None) – minimum statistic value for a collocation to enter the results; if None, results are not filtered

  • min_count (int) – ignore collocations with number of occurrences below this threshold

  • embed_tokens (Iterable | None) – tokens that, if occurring inside an n-gram, are not counted; see token_ngrams

  • statistic (Callable[[spmatrix, ...], spmatrix | ndarray]) – function to calculate the statistic measure from the token counts; use one of the [n|p]pmi functions provided in this module or provide your own function which must accept a sparse matrix x and return a matrix of the same shape; see pmi for more information

  • glue (str | None) – if not None, provide a string that is used to join the collocation tokens

  • return_statistic (bool) – also return computed statistic

  • rank (str | None) – if not None, rank the results according to the computed statistic in ascending (rank='asc') or descending (rank='desc') order

  • tokens_as_hashes (bool) – if True, return token type hashes (integers) instead of textual representations (strings)

  • hashes2tokens (Dict[int, str] | dict | None) – if tokens are given as integer hashes, this table is used to generate textual representations for the results

  • statistic_kwargs – additional arguments passed to statistic function

Returns:

list of tuples (collocation tokens, score) if return_statistic is True, otherwise only a list of collocations; collocations are either a string (if glue is given) or a tuple of strings

Return type:

List[tuple | str]

tmtoolkit.tokenseq.token_hash_convert(tokens, stringstore, special_tokens=None, collapse=None, arr_dtype_for_hashes=None)

Perform token <-> hash conversion on a sequence of tokens tokens using the bijection stringstore. If tokens contains token hashes, the output is a sequence of token strings and if tokens contains token strings, the output is a sequence of token hashes. In case the output is a sequence of token strings, these can be collapsed using the collapse parameter.

Parameters:
  • tokens (Iterable[str | int] | ndarray) – a sequence of tokens either as token strings or token hashes

  • stringstore (dict) – a bijection mapping strings to hashes and vice versa as implemented in SpaCy’s StringStore

  • special_tokens (dict | None) – optional bijection for tokens not present or of higher importance than those in stringstore

  • collapse (str | Iterable[str] | ndarray | None) – either single string or list / NumPy array of “glue” strings where collapse[i] is the string to appear after tokens[i]; if this is None, no collapsing is applied, i.e. this function returns a sequence of converted tokens instead of a string; collapsing can only be applied if tokens is converted to strings and not hashes

  • arr_dtype_for_hashes (str | None) – if tokens is an array, assume this dtype for hashes (e.g. ‘uint64’ if using SpaCy’s token hashes)

Returns:

converted sequence of tokens or collapsed token string if collapse is given

Return type:

str | Iterable[str | int] | ndarray

tmtoolkit.tokenseq.token_join_subsequent(tokens, matches, glue='_', tokens_dtype=None, return_glued=False, return_mask=False)

Select subsequent tokens as defined by list of indices matches (e.g. output of token_match_subsequent) and join those by string glue. Return a list of tokens where the subsequent matches are replaced by the joint tokens.

Warning

Only works correctly when matches contains indices of subsequent tokens.

Example:

token_glue_subsequent(['a', 'b', 'c', 'd', 'd', 'a', 'b', 'c'],
                      [np.array([1, 2]), np.array([6, 7])])
# ['a', 'b_c', 'd', 'd', 'a', 'b_c']
Parameters:
  • tokens (List[str] | ndarray) – a sequence of tokens

  • matches (List[ndarray]) – list of NumPy arrays with subsequent indices into tokens (e.g. output of token_match_subsequent)

  • glue (str | None) – string for joining the subsequent matches or None to keep them as separate items in a list

  • tokens_dtype (str | dtype | None) – if tokens is not a NumPy array, it will be converted as such; use this dtype for the array

  • return_glued (bool) – if True, return also a list of joint tokens

  • return_mask (bool) – if True, return also a NumPy integer array with the length of the input tokens list that marks the original input tokens in three ways: 0 means mask that original token, 1 means retain that original token, 2 means replace original token by newly generated joint token; if True, also only return newly generated joint subsequent tokens and not also the original tokens

Returns:

either two-tuple, three-tuple or list depending on return_glued and return_mask

Return type:

list | tuple

tmtoolkit.tokenseq.token_lengths(tokens)

Token lengths (number of characters of each token) in tokens.

Parameters:

tokens (Iterable[str] | ndarray) – list or NumPy array of string tokens

Returns:

list of token lengths

Return type:

List[int]

tmtoolkit.tokenseq.token_match(pattern, tokens, match_type='exact', ignore_case=False, glob_method='match', inverse=False)

Return a boolean NumPy array signaling matches between pattern and tokens. pattern will be compared with each element in sequence tokens either as exact equality (match_type is 'exact') or regular expression (match_type is 'regex') or glob pattern (match_type is 'glob'). For the last two options, pattern must be a string or compiled RE pattern, otherwise it can be of any type that allows equality checking.

See token_match_multi_pattern for a version of this function that accepts multiple search patterns.

Parameters:
  • pattern (Any) – string or compiled RE pattern used for matching against tokens; when match_type is 'exact', pattern may be of any type that allows equality checking

  • tokens (List[int | str] | ndarray) – list or NumPy array of string tokens

  • match_type (str) – one of: ‘exact’, ‘regex’, ‘glob’; if ‘regex’, search_token must be RE pattern; if glob, search_token must be a “glob” pattern like “hello w*” (see https://github.com/metagriffin/globre)

  • ignore_case (bool) – if True, ignore case for matching

  • glob_method (str) – if match_type is ‘glob’, use this glob method. Must be ‘match’ or ‘search’ (similar behavior as Python’s re.match or re.search)

  • inverse (bool) – invert the matching results

Returns:

1D boolean NumPy array of length len(tokens) where elements signal matches between pattern and the respective token from tokens

Return type:

ndarray

tmtoolkit.tokenseq.token_match_multi_pattern(search_tokens, tokens, match_type='exact', ignore_case=False, glob_method='match')

Return a boolean NumPy array signaling matches between any pattern in search_tokens and tokens. Works the same as token_match, but accepts multiple patterns as search_tokens argument.

Parameters:
  • search_tokens (Any) – single string or list of strings that specify the search pattern(s); when match_type is 'exact', pattern may be of any type that allows equality checking

  • tokens (List[str] | ndarray) – list or NumPy array of string tokens

  • match_type (str) – one of: ‘exact’, ‘regex’, ‘glob’; if ‘regex’, search_token must be RE pattern; if glob, search_token must be a “glob” pattern like “hello w*” (see https://github.com/metagriffin/globre)

  • ignore_case (bool) – if True, ignore case for matching

  • glob_method (str) – if match_type is ‘glob’, use this glob method. Must be ‘match’ or ‘search’ (similar behavior as Python’s re.match or re.search)

Returns:

1D boolean NumPy array of length len(tokens) where elements signal matches

Return type:

ndarray

tmtoolkit.tokenseq.token_match_subsequent(patterns, tokens, **match_opts)

Using N patterns in patterns, return each tuple of N matching subsequent tokens from tokens. Excepts the same token matching options via match_opts as token_match. The results are returned as list of NumPy arrays with indices into tokens.

Example:

# indices:   0        1        2         3        4       5       6
tokens = ['hello', 'world', 'means', 'saying', 'hello', 'world', '.']

token_match_subsequent(['hello', 'world'], tokens)
# [array([0, 1]), array([4, 5])]

token_match_subsequent(['world', 'hello'], tokens)
# []

token_match_subsequent(['world', '*'], tokens, match_type='glob')
# [array([1, 2]), array([5, 6])]

See also

token_match

Parameters:
  • patterns (Sequence) – a sequence of search patterns as excepted by token_match

  • tokens (list | ndarray) – a sequence of string tokens to be used for matching

  • match_opts – token matching options as passed to token_match

Returns:

list of NumPy arrays with subsequent indices into tokens

Return type:

List[ndarray]

tmtoolkit.tokenseq.token_ngrams(tokens, n, join=True, join_str=' ', ngram_container=<class 'list'>, embed_tokens=None, keep_embed_tokens=True)

Generate n-grams of length n from list of tokens tokens. Either join the n-grams when join is True using join_str so that a list of joined n-gram strings is returned or, if join is False, return a list of n-gram lists (or other sequences depending on ngram_container). For the latter option, the tokens in tokens don’t have to be strings but can by of any type.

Optionally pass a set/list/tuple embed_tokens which contains tokens that, if occurring inside an n-gram, are not counted. See for example how a trigram 'bank of america' is generated when the token 'of' is set as embed_tokens, although we ask to generate bigrams:

> token_ngrams("I visited the bank of america".split(), n=2)
['I visited', 'visited the', 'the bank', 'bank of', 'of america']
> token_ngrams("I visited the bank of america".split(), n=2, embed_tokens={'of'})
['I visited', 'visited the', 'the bank', 'bank of america', 'of america']
Parameters:
  • tokens (Sequence) – sequence of tokens; if join is True, this must be a list of strings

  • n (int) – size of the n-grams to generate

  • join (bool) – if True, join n-grams by join_str

  • join_str (str) – string to join n-grams if join is True

  • ngram_container (Callable) – if join is False, use this function to create the n-gram sequences

  • embed_tokens (Iterable | None) – tokens that, if occurring inside an n-gram, are not counted

  • keep_embed_tokens (bool) – if True, keep embedded tokens in the result

Returns:

list of joined n-gram strings or list of n-grams that are n-sized sequences

Return type:

list

tmtoolkit.tokenseq.unique_chars(tokens)

Return a set of all characters used in tokens.

Parameters:

tokens (Iterable[str]) – iterable of string tokens

Returns:

set of all characters used in tokens

Return type:

Set[str]

tmtoolkit.topicmod

Topic modeling sub-package with modules for model evaluation, model I/O, model statistics, parallel computation and visualization.

Functions and classes in tm_gensim, tm_lda and tm_sklearn implement parallel model computation and evaluation using popular topic modeling packages. You need to install the respective packages (lda, scikit-learn or gensim) in order to use them.

Evaluation metrics for Topic Modeling

Metrics for topic model evaluation.

In order to run model evaluations in parallel use one of the modules tm_gensim, tm_lda or tm_sklearn.

tmtoolkit.topicmod.evaluate.metric_arun_2010(topic_word_distrib, doc_topic_distrib, doc_lengths)

Calculate metric as in [Arun2010] using topic-word distribution topic_word_distrib, document-topic distribution doc_topic_distrib and document lengths doc_lengths.

Note

It will fail when num. of words in the vocabulary is less then the num. of topics (which is very unusual).

Warning

There’s no code available for the [Arun2010] paper. The code follows the procedures outlined in the paper so that its results could be reproduced for the NIPS dataset. See the discussion at https://github.com/nikita-moor/ldatuning/issues/7.

Parameters:
  • topic_word_distrib – topic-word distribution; shape KxM, where K is number of topics, M is vocabulary size

  • doc_topic_distrib – document-topic distribution; shape NxK, where N is the number of documents

  • doc_lengths – array of length N with number of tokens per document

Returns:

calculated metric

tmtoolkit.topicmod.evaluate.metric_cao_juan_2009(topic_word_distrib)

Calculate metric as in [Cao2009] using topic-word distribution topic_word_distrib.

Parameters:

topic_word_distrib – topic-word distribution; shape KxM, where K is number of topics, M is vocabulary size

Returns:

calculated metric

tmtoolkit.topicmod.evaluate.metric_coherence_gensim(measure, topic_word_distrib=None, gensim_model=None, vocab=None, dtm=None, gensim_corpus=None, texts=None, top_n=20, return_coh_model=False, return_mean=False, **kwargs)

Calculate model coherence using Gensim’s CoherenceModel. See also this tutorial.

Define which measure to use with parameter measure:

  • 'u_mass'

  • 'c_v'

  • 'c_uci'

  • 'c_npmi'

Provide a topic word distribution topic_word_distrib OR a Gensim model gensim_model and the corpus’ vocabulary as vocab OR pass a gensim corpus as gensim_corpus. top_n controls how many most probable words per topic are selected.

If measure is 'u_mass', a document-term-matrix dtm or gensim_corpus must be provided and texts can be None. If any other measure than 'u_mass' is used, tokenized input as texts must be provided as 2D list:

[['some', 'text', ...],          # doc. 1
 ['some', 'more', ...],          # doc. 2
 ['another', 'document', ...]]   # doc. 3

If return_coh_model is True, the whole gensim.models.CoherenceModel instance will be returned, otherwise:

  • if return_mean is True, the mean coherence value will be returned

  • if return_mean is False, a list of coherence values (for each topic) will be returned

Provided kwargs will be passed to gensim.models.CoherenceModel or gensim.models.CoherenceModel.get_coherence_per_topic.

Note

This function also supports models from lda and sklearn (by passing topic_word_distrib, dtm and vocab)!

Parameters:
  • measure – the coherence calculation type; one of the values listed above

  • topic_word_distrib – topic-word distribution; shape KxM, where K is number of topics, M is vocabulary size if gensim_model is not given

  • gensim_model – a topic model from Gensim if topic_word_distrib is not given

  • vocab – vocabulary list/array if gensim_corpus is not given

  • dtm – document-term matrix of shape NxM with N documents and vocabulary size M if gensim_corpus is not given

  • gensim_corpus – a Gensim corpus if vocab is not given

  • texts – list of tokenized documents; necessary if using a measure other than 'u_mass'

  • top_n – number of most probable words selected per topic

  • return_coh_model – if True, return gensim.models.CoherenceModel as result

  • return_mean – if return_coh_model is False and return_mean is True, return mean coherence

  • kwargs – parameters passed to gensim.models.CoherenceModel or gensim.models.CoherenceModel.get_coherence_per_topic

Returns:

if return_coh_model is True, return gensim.models.CoherenceModel as result; otherwise if return_mean is True, mean of all coherence values, otherwise array of length K with coherence per topic

tmtoolkit.topicmod.evaluate.metric_coherence_mimno_2011(topic_word_distrib, dtm, top_n=20, eps=1, include_prob=False, normalize=False, return_mean=False)

Calculate coherence metric according to [Mimno2011]. You need to provide a topic word distribution as topic_word_distrib and a document-term-matrix dtm (can be sparse). top_n controls how many most probable words per topic are selected.

If you set eps=1e-12 and normalize=True, this is equivalent to the “U_Mass” coherence metric as provided in the Gensim package and as wrapper function in metric_coherence_gensim with measure='u_mass'.

By default, it will return a NumPy array of coherence values per topic (same ordering as in topic_word_distrib). Set return_mean to True to return the mean of all topics instead.

Parameters:
  • topic_word_distrib – topic-word distribution; shape KxM, where K is number of topics, M is vocabulary size

  • dtm – document-term matrix of shape NxM with N documents and vocabulary size M

  • top_n – number of most probable words selected per topic

  • eps – smoothing constant epsilon

  • include_prob – if True, include probabilities of top words per topic in the calculations

  • normalize – if True, normalize coherence values

  • return_mean – if True, return mean of all coherence values, otherwise array of coherence per topic

Returns:

if return_mean is True, mean of all coherence values, otherwise array of length K with coherence per topic

tmtoolkit.topicmod.evaluate.metric_griffiths_2004(logliks)

Calculate metric as in [GriffithsSteyvers2004].

Calculates the harmonic mean of the log-likelihood values logliks. Burn-in values should already be removed from logliks.

Note

Requires gmpy2 package for multiple-precision arithmetic to avoid numerical underflow.

Parameters:

logliks – array with log-likelihood values

Returns:

calculated metric

tmtoolkit.topicmod.evaluate.metric_held_out_documents_wallach09(dtm_test, theta_test, phi_train, alpha, n_samples=10000)

Estimation of the probability of held-out documents according to [Wallach2009] using a document-topic estimation theta_test that was estimated via held-out documents dtm_test on a trained model with a topic-word distribution phi_train and a document-topic prior alpha. Draw n_samples according to theta_test for each document in dtm_test (memory consumption and run time can be very high for larger n_samples and a large amount of big documents in dtm_test).

A document-topic estimation theta_test can be obtained from a trained model from the “lda” package or scikit-learn package with the transform() method.

Adopted MATLAB code originally from Ian Murray, 2009 and downloaded from umass.edu.

Note

Requires gmpy2 package for multiple-precision arithmetic to avoid numerical underflow.

Parameters:
  • dtm_test – held-out documents of shape NxM with N documents and vocabulary size M

  • theta_test – document-topic estimation of dtm_test; shape NxK with K topics

  • phi_train – topic-word distribution of a trained topic model that should be evaluated; shape KxM

  • alpha – document-topic prior of the trained topic model that should be evaluated; either a scalar or an array of length K

Returns:

estimated probability of held-out documents

tmtoolkit.topicmod.evaluate.results_by_parameter(res, param, sort_by=None, sort_desc=False)

Takes a list of evaluation results res returned by a topic model evaluation function – a list in the form:

[(parameter_set_1, {'<metric_name>': result_1, ...}),
 ...,
 (parameter_set_n, {'<metric_name>': result_n, ...})])

Then returns a list with tuple pairs using only the m parameter(s) listed in param from the parameter sets in the evaluation results such that the returned list is:

[(param_1_0, ..., param_1_m, {'<metric_name>': result_1, ...}),
 ...,
 (param_n_0, ..., param_n_m, {'<metric_name>': result_n, ...})]

Optionally order either by parameter value (sort_by is None - the default) or by result metric (sort_by='<metric name>').

Parameters:
  • res – list of evaluation results

  • param – string of parameter name

  • sort_by – order by parameter value if this is None, or by a certain result metric given as string

  • sort_desc – sort in descending order

Returns:

list with tuple pairs using only the parameter param from the parameter sets

Printing, importing and exporting topic model results

Functions for printing/exporting topic model results.

tmtoolkit.topicmod.model_io.ldamodel_full_doc_topics(doc_topic_distrib, doc_labels, colname_rowindex='_doc', topic_labels='topic_{i1}')

Generate a pandas DataFrame for the full doc-topic distribution doc_topic_distrib.

See also

ldamodel_top_doc_topics to retrieve only the most probable topics in the distribution as formatted pandas DataFrame; ldamodel_full_topic_words to retrieve the full topic-word distribution as dataframe

Parameters:
  • doc_topic_distrib – document-topic distribution; shape NxK, where N is the number of documents, K is the number of topics

  • doc_labels – list/array of length N with a string label for each document

  • colname_rowindex – column name for the “row index”, i.e. the column that identifies each row

  • topic_labels – format string for each row index where {i0} or {i1} are replaced by the respective zero- or one-indexed topic numbers or an array with individual topic labels

Returns:

pandas DataFrame

tmtoolkit.topicmod.model_io.ldamodel_full_topic_words(topic_word_distrib, vocab, colname_rowindex='_topic', row_labels='topic_{i1}')

Generate a pandas DataFrame for the full topic-word distribution topic_word_distrib.

See also

ldamodel_top_topic_words to retrieve only the most probable words in the distribution as formatted pandas DataFrame; ldamodel_full_doc_topics to retrieve the full document-topic distribution as dataframe

Parameters:
  • topic_word_distrib – topic-word distribution; shape KxM, where K is number of topics, M is vocabulary size

  • vocab – vocabulary list/array of length K

  • colname_rowindex – column name for the “row index”, i.e. the column that identifies each row

  • row_labels – format string for each row index where {i0} or {i1} are replaced by the respective zero- or one-indexed topic numbers or an array with individual row labels

Returns:

pandas DataFrame

tmtoolkit.topicmod.model_io.ldamodel_top_doc_topics(doc_topic_distrib, doc_labels, top_n=3, val_fmt=None, topic_labels='topic_{i1}', col_labels=None, index_name='document')

Retrieve the top (i.e. most probable) top_n topics for each document in the document-topic distribution doc_topic_distrib as pandas DataFrame.

See also

ldamodel_full_doc_topics to retrieve the full distribution as formatted pandas DataFrame; ldamodel_top_topic_docs to retrieve the top documents per topic; ldamodel_top_topic_words to retrieve the top words per topic from a topic-word distribution; ldamodel_top_word_topics to retrieve the top topics per word from a topic-word distribution

Parameters:
  • doc_topic_distrib – document-topic distribution; shape NxK, where N is the number of documents, K is the number of topics

  • doc_labels – list/array of length N with a string label for each document

  • top_n – number of most probable topics per document to select

  • val_fmt – format string for table cells where {lbl} is replaced by the respective topic name and {val} is replaced by the topic’s probability given the document

  • topic_labels – format string for each row index where {i0} or {i1} are replaced by the respective zero- or one-indexed topic numbers or an array with individual topic labels

  • col_labels – format string for the columns where {i0} or {i1} are replaced by the respective zero- or one-indexed rank

  • index_name – name of the table index

Returns:

pandas DataFrame

tmtoolkit.topicmod.model_io.ldamodel_top_topic_docs(doc_topic_distrib, doc_labels, top_n=3, val_fmt=None, topic_labels='topic_{i1}', col_labels=None, index_name='topic')

Retrieve the top (i.e. most probable) top_n documents for each topic in the document-topic distribution doc_topic_distrib as pandas DataFrame.

See also

ldamodel_full_doc_topics to retrieve the full distribution as formatted pandas DataFrame; ldamodel_top_doc_topics to retrieve the top topics per document; ldamodel_top_topic_words to retrieve the top words per topic from a topic-word distribution; ldamodel_top_word_topics to retrieve the top topics per word from a topic-word distribution

Parameters:
  • doc_topic_distrib – document-topic distribution; shape NxK, where N is the number of documents, K is the number of topics

  • doc_labels – list/array of length N with a string label for each document

  • top_n – number of most probable documents per topic to select

  • val_fmt – format string for table cells where {lbl} is replaced by the respective document label and {val} is replaced by the topic’s probability given the document

  • topic_labels – format string for each row index where {i0} or {i1} are replaced by the respective zero- or one-indexed topic numbers or an array with individual topic labels

  • col_labels – format string for the columns where {i0} or {i1} are replaced by the respective zero- or one-indexed rank

  • index_name – name of the table index

Returns:

pandas DataFrame

tmtoolkit.topicmod.model_io.ldamodel_top_topic_words(topic_word_distrib, vocab, top_n=10, val_fmt=None, row_labels='topic_{i1}', col_labels=None, index_name='topic')

Retrieve the top (i.e. most probable) top_n words for each topic in the topic-word distribution topic_word_distrib as pandas DataFrame.

See also

ldamodel_full_topic_words to retrieve the full distribution as formatted pandas DataFrame; ldamodel_top_word_topics to retrieve the top topics per word from a topic-word distribution; ldamodel_top_doc_topics to retrieve the top topics per document from a document-topic distribution; ldamodel_top_topic_docs to retrieve the top documents per topic;

Parameters:
  • topic_word_distrib – topic-word distribution; shape KxM, where K is number of topics, M is vocabulary size

  • vocab – vocabulary list/array of length K

  • top_n – number of most probable words per topic to select

  • val_fmt – format string for table cells where {lbl} is replaced by the respective word from vocab and {val} is replaced by the word’s probability given the topic

  • row_labels – format string for each row index where {i0} or {i1} are replaced by the respective zero- or one-indexed topic numbers or an array with individual row labels

  • col_labels – format string for the columns where {i0} or {i1} are replaced by the respective zero- or one-indexed rank

  • index_name – name of the table index

Returns:

pandas DataFrame

tmtoolkit.topicmod.model_io.ldamodel_top_word_topics(topic_word_distrib, vocab, top_n=10, val_fmt=None, topic_labels='topic_{i1}', col_labels=None, index_name='token')

Retrieve the top (i.e. most probable) top_n topics for each word in the topic-word distribution topic_word_distrib as pandas DataFrame.

See also

ldamodel_full_topic_words to retrieve the full distribution as formatted pandas DataFrame; ldamodel_top_topic_words to retrieve the top words per topic from a topic-word distribution; ldamodel_top_doc_topics to retrieve the top topics per document from a document-topic distribution; ldamodel_top_topic_docs to retrieve the top documents per topic;

Parameters:
  • topic_word_distrib – topic-word distribution; shape KxM, where K is number of topics, M is vocabulary size

  • vocab – vocabulary list/array of length K

  • top_n – number of most probable words per topic to select

  • val_fmt – format string for table cells where {lbl} is replaced by the respective topic label from topic_labels and {val} is replaced by the word’s probability given the topic

  • topic_labels – format string for each row index where {i0} or {i1} are replaced by the respective zero- or one-indexed topic numbers or an array with individual topic labels

  • col_labels – format string for the columns where {i0} or {i1} are replaced by the respective zero- or one-indexed rank

  • index_name – name of the table index

Returns:

pandas DataFrame

tmtoolkit.topicmod.model_io.load_ldamodel_from_pickle(picklefile, **kwargs)

Load an LDA model object from a pickle file picklefile.

See also

save_ldamodel_to_pickle to save a model.

Warning

Python pickle files may contain malicious code. You should only load pickle files from trusted sources.

Parameters:
Returns:

dict with keys: 'model' – model instance; 'vocab' – vocabulary; 'doc_labels' – document labels; 'dtm' – optional document-term matrix;

tmtoolkit.topicmod.model_io.print_ldamodel_distribution(distrib, row_labels, val_labels, top_n=10)

Print top_n top values from a LDA model’s distribution distrib. This is a general function to print top values of any multivariate distribution given as matrix distrib with H rows and I columns, each identified by H row_labels and I val_labels.

See also

print_ldamodel_topic_words to print the top values of a topic-word distribution or print_ldamodel_doc_topics to print the top values of a document-topic distribution.

Parameters:
  • distrib – either a topic-word or a document-topic distribution of shape HxI

  • row_labels – list/array of length H with label string for each row of distrib or format string

  • val_labels – list/array of length I with label string for each column of distrib or format string

  • top_n – number of top values to print

tmtoolkit.topicmod.model_io.print_ldamodel_doc_topics(doc_topic_distrib, doc_labels, top_n=3, val_labels='topic_{i1}')

Print top_n values from an LDA model’s document-topic distribution doc_topic_distrib.

See also

print_ldamodel_topic_words to print the top values of a topic-word distribution.

Parameters:
  • doc_topic_distrib – document-topic distribution; shape NxK, where N is the number of documents, K is the number of topics

  • doc_labels – list/array of length N with a string label for each document

  • top_n – number of top values to print

  • val_labels – format string for each value where {i0} or {i1} are replaced by the respective zero- or one-indexed topic numbers or an array with individual value labels

tmtoolkit.topicmod.model_io.print_ldamodel_topic_words(topic_word_distrib, vocab, top_n=10, row_labels='topic_{i1}')

Print top_n values from an LDA model’s topic-word distribution topic_word_distrib.

See also

print_ldamodel_doc_topics to print the top values of a document-topic distribution.

Parameters:
  • topic_word_distrib – topic-word distribution; shape KxM, where K is number of topics, M is vocabulary size

  • vocab – vocabulary list/array of length K

  • top_n – number of top values to print

  • row_labels – format string for each row index where {i0} or {i1} are replaced by the respective zero- or one-indexed topic numbers or an array with individual row labels

tmtoolkit.topicmod.model_io.save_ldamodel_summary_to_excel(excel_file, topic_word_distrib, doc_topic_distrib, doc_labels, vocab, top_n_topics=10, top_n_words=10, dtm=None, rank_label_fmt=None, topic_labels=None)

Save a summary derived from an LDA model’s topic-word and document-topic distributions (topic_word_distrib and doc_topic_distrib to an Excel file excel_file. Return the generated Excel sheets as dict of pandas DataFrames.

The resulting Excel file will consist of 6 or optional 7 sheets:

  • top_doc_topics_vals: document-topic distribution with probabilities of top topics per document

  • top_doc_topics_labels: document-topic distribution with labels (e.g. "topic_12") of top topics per document

  • top_doc_topics_labelled_vals: document-topic distribution combining probabilities and labels of top topics per document (e.g. "topic_12 (0.21)")

  • top_topic_word_vals: topic-word distribution with probabilities of top words per topic

  • top_topic_word_labels: topic-word distribution with top words per (e.g. "politics") topic

  • top_topic_words_labelled_vals: topic-word distribution combining probabilities and top words per topic (e.g. "politics (0.08)")

  • optional if dtm is given – marginal_topic_distrib: marginal topic distribution

Parameters:
  • excel_file – target Excel file

  • topic_word_distrib – topic-word distribution; shape KxM, where K is number of topics, M is vocabulary size

  • doc_topic_distrib – document-topic distribution; shape NxK, where N is the number of documents, K is the number of topics

  • doc_labels – list/array of length N with a string label for each document

  • vocab – vocabulary list/array of length K

  • top_n_topics – number of most probable topics per document to include in the summary

  • top_n_words – number of most probable words per topic to include in the summary

  • dtm – document-term matrix; shape NxM; if this is given, a sheet for the marginal topic distribution will be included

  • rank_label_fmt – format string for the rank labels where {i0} or {i1} are replaced by the respective zero- or one-indexed rank numbers (leave to None for default)

  • topic_labels – format string for each row index where {i0} or {i1} are replaced by the respective zero- or one-indexed topic numbers or an array with individual topic labels

Returns:

dict mapping sheet name to pandas DataFrame

tmtoolkit.topicmod.model_io.save_ldamodel_to_pickle(picklefile, model, vocab, doc_labels, dtm=None, **kwargs)

Save an LDA model object model as pickle file to picklefile.

See also

load_ldamodel_from_pickle to load the saved model.

Parameters:
  • picklefile – target file

  • model – LDA model instance

  • vocab – vocabulary list/array of length M

  • doc_labels – document labels list/array of length N

  • dtm – optional document-term matrix of shape NxM

  • kwargs – additional options for tmtoolkit.utils.pickle_data

Statistics for topic models and BoW matrices

Common statistics and tools for topic models.

tmtoolkit.topicmod.model_stats.exclude_topics(excl_topic_indices, doc_topic_distrib, topic_word_distrib=None, renormalize=True, return_new_topic_mapping=False)

Exclude topics with the indices excl_topic_indices from the document-topic distribution doc_topic_distrib (i.e. delete the respective columns in this matrix) and optionally re-normalize the distribution so that the rows sum up to 1 if renormalize is set to True.

Optionally also strip the topics from the topic-word distribution topic_word_distrib (i.e. remove the respective rows).

If topic_word_distrib is given, return a tuple with the updated doc.-topic and topic-word distributions, else return only the updated doc.-topic distribution.

Warning

The topics to be excluded are specified by zero-based indices.

Parameters:
  • excl_topic_indices – list/array with zero-based indices of topics to exclude

  • doc_topic_distrib – document-topic distribution; shape NxK, where N is the number of documents, K is the number of topics

  • topic_word_distrib – optional topic-word distribution; shape KxM, where K is number of topics, M is vocabulary size

  • renormalize – if True, re-normalize the document-topic distribution so that the rows sum up to 1

  • return_new_topic_mapping – if True, additional return a dict that maps old topic indices to new topic indices

Returns:

new document-topic distribution where topics from excl_topic_indices are removed and optionally re-normalized; optional new topic-word distribution with same topics removed; optional dict that maps old topic indices to new topic indices

tmtoolkit.topicmod.model_stats.filter_topics(search_pattern, vocab, topic_word_distrib, top_n=None, thresh=None, match_type='exact', cond='any', glob_method='match', return_words_and_matches=False)

Filter topics defined as topic-word distribution topic_word_distrib across vocabulary vocab for a word (pass a string) or multiple words/patterns w (pass a list of strings). Either run pattern(s) w against the list of top words per topic (use top_n for number of words in top words list) or specify a minimum topic-word probability thresh, resulting in a list of words above this threshold for each topic, which will be used for pattern matching. You can also specify top_n and thresh.

Set the match parameter according to the options provided by token_match (exact matching, RE or glob matching). Use cond to specify whether at only one match suffices per topic when a list of patterns w is passed (cond='any') or all patterns must match (cond='all').

By default, this function returns a NumPy array containing the indices of topics that passed the filter criteria. If return_words_and_matches is True, this function additionally returns a NumPy array with the top words for each topic and a NumPy array with the pattern matches for each topic.

See also

See tmtoolkit.tokenseq.token_match for filtering options.

Parameters:
  • search_pattern – single match pattern string or list of match pattern strings

  • vocab – vocabulary array of length M

  • topic_word_distrib – topic-word distribution; shape KxM, where K is number of topics, M is vocabulary size

  • top_n – if given, consider only the top top_n words per topic

  • thresh – if given, consider only the words with a probability above thresh

  • match_type – one of: ‘exact’, ‘regex’, ‘glob’; if ‘regex’, search_token must be RE pattern; if glob, search_token must be a “glob” pattern like “hello w*” (see https://github.com/metagriffin/globre)

  • cond – either "any" or "all"; controls whether only one or all patterns must match if multiple match patterns are given

  • glob_method – if match_type is ‘glob’, use this glob method. Must be ‘match’ or ‘search’ (similar behavior as Python’s re.match or re.search)

  • return_words_and_matches – if True, additionally return list of arrays of words per topic and list of binary arrays indicating matches per topic

Returns:

array of topic indices with matches; if return_words_and_matches is True, return two more lists as described above

tmtoolkit.topicmod.model_stats.generate_topic_labels_from_top_words(topic_word_distrib, doc_topic_distrib, doc_lengths, vocab, n_words=None, lambda_=1, labels_glue='_', labels_format='{i1}_{topwords}')

Generate unique topic labels derived from the top words of each topic. The top words are determined from the relevance score [SievertShirley2014] depending on lambda_. Specify the number of top words in the label with n_words. If n_words is None, a minimum number of words will be used to create unique labels for each topic. Topic labels are formed by joining the top words with labels_glue and formatting them with labels_format. Placeholders in labels_format are "{i0}" (zero-based topic index), "{i1}" (one-based topic index) and "{topwords}" (top words glued with labels_glue).

Parameters:
  • topic_word_distrib – topic-word distribution; shape KxM, where K is number of topics, M is vocabulary size

  • doc_topic_distrib – document-topic distribution; shape NxK, where N is the number of documents, K is the number of topics

  • doc_lengths – array of size N (number of docs) with integers indicating the number of terms per document

  • vocab – vocabulary array of length M

  • n_words – minimum number of words to be used to create unique labels

  • lambda – lambda parameter (influences weight of “log lift”)

  • labels_glue – string to join the top words

  • labels_format – final topic labels format string

Returns:

NumPy array of topic labels; length is K

tmtoolkit.topicmod.model_stats.least_distinct_words(vocab, topic_word_distrib, doc_topic_distrib, doc_lengths, n=None)

Order the words from vocab by distinctiveness score from least to most distinctive. Optionally only return the n least distinctive words.

Parameters:
  • vocab – vocabulary array of length M

  • topic_word_distrib – topic-word distribution; shape KxM, where K is number of topics, M is vocabulary size

  • doc_topic_distrib – document-topic distribution; shape NxK, where N is the number of documents, K is the number of topics

  • doc_lengths – array of size N (number of docs) with integers indicating the number of terms per document

  • n – if not None, return only the n least distinctive words

Returns:

array of length M or n (if n is given) with least distinctive words

tmtoolkit.topicmod.model_stats.least_probable_words(vocab, topic_word_distrib, doc_topic_distrib, doc_lengths, n=None)

Order the words from vocab by marginal word probability from least to most probable. Optionally only return the n least probable words.

Parameters:
  • vocab – vocabulary array of length M

  • topic_word_distrib – topic-word distribution; shape KxM, where K is number of topics, M is vocabulary size

  • doc_topic_distrib – document-topic distribution; shape NxK, where N is the number of documents, K is the number of topics

  • doc_lengths – array of size N (number of docs) with integers indicating the number of terms per document

  • n – if not None, return only the n most salient words

Returns:

array of length M or n (if n is given) with least probable words

tmtoolkit.topicmod.model_stats.least_relevant_words_for_topic(vocab, rel_mat, topic, n=None)

Get words from vocab for topic ordered by least to most relevance according to [SievertShirley2014]. Use the relevance matrix rel_mat obtained from topic_word_relevance. Optionally only return the n least relevant words.

Parameters:
  • vocab – vocabulary array of length M

  • rel_mat – relevance matrix; shape KxM, where K is number of topics, M is vocabulary size

  • topic – topic number (zero-indexed)

Returns:

array of length M or n (if n is given) with least relevant words for topic topic

tmtoolkit.topicmod.model_stats.least_salient_words(vocab, topic_word_distrib, doc_topic_distrib, doc_lengths, n=None)

Order the words from vocab by saliency score from least to most salient. Optionally only return the n least salient words.

See also

word_saliency

Parameters:
  • vocab – vocabulary array of length M

  • topic_word_distrib – topic-word distribution; shape KxM, where K is number of topics, M is vocabulary size

  • doc_topic_distrib – document-topic distribution; shape NxK, where N is the number of documents, K is the number of topics

  • doc_lengths – array of size N (number of docs) with integers indicating the number of terms per document

  • n – if not None, return only the n least salient words

Returns:

array of length M or n (if n is given) with least salient words

tmtoolkit.topicmod.model_stats.marginal_topic_distrib(doc_topic_distrib, doc_lengths)

Return marginal topic distribution p(T) (topic proportions) given the document-topic distribution (theta) doc_topic_distrib and the document lengths doc_lengths. The latter can be calculated with doc_lengths.

Parameters:
  • doc_topic_distrib – document-topic distribution; shape NxK, where N is the number of documents, K is the number of topics

  • doc_lengths – array of size N (number of docs) with integers indicating the number of terms per document

Returns:

array of size K (number of topics) with marginal topic distribution

tmtoolkit.topicmod.model_stats.marginal_word_distrib(topic_word_distrib, p_t)

Return the marginal word distribution p(w) (term proportions derived from topic model) given the topic-word distribution (phi) topic_word_distrib and the marginal topic distribution p(T) p_t. The latter can be calculated with marginal_topic_distrib.

Parameters:
  • topic_word_distrib – topic-word distribution; shape KxM, where K is number of topics, M is vocabulary size

  • p_t – marginal topic distribution; array of size K

Returns:

array of size M (vocabulary size) with marginal word distribution

tmtoolkit.topicmod.model_stats.most_distinct_words(vocab, topic_word_distrib, doc_topic_distrib, doc_lengths, n=None)

Order the words from vocab by distinctiveness score from most to least distinctive. Optionally only return the n most distinctive words.

Parameters:
  • vocab – vocabulary array of length M

  • topic_word_distrib – topic-word distribution; shape KxM, where K is number of topics, M is vocabulary size

  • doc_topic_distrib – document-topic distribution; shape NxK, where N is the number of documents, K is the number of topics

  • doc_lengths – array of size N (number of docs) with integers indicating the number of terms per document

  • n – if not None, return only the n most distinctive words

Returns:

array of length M or n (if n is given) with most distinctive words

tmtoolkit.topicmod.model_stats.most_probable_words(vocab, topic_word_distrib, doc_topic_distrib, doc_lengths, n=None)

Order the words from vocab by marginal word probability from most to least probable. Optionally only return the n most probable words.

Parameters:
  • vocab – vocabulary array of length M

  • topic_word_distrib – topic-word distribution; shape KxM, where K is number of topics, M is vocabulary size

  • doc_topic_distrib – document-topic distribution; shape NxK, where N is the number of documents, K is the number of topics

  • doc_lengths – array of size N (number of docs) with integers indicating the number of terms per document

  • n – if not None, return only the n most salient words

Returns:

array of length M or n (if n is given) with most probable words

tmtoolkit.topicmod.model_stats.most_relevant_words_for_topic(vocab, rel_mat, topic, n=None)

Get words from vocab for topic ordered by most to least relevance according to [SievertShirley2014]. Use the relevance matrix rel_mat obtained from topic_word_relevance. Optionally only return the n most relevant words.

Parameters:
  • vocab – vocabulary array of length M

  • rel_mat – relevance matrix; shape KxM, where K is number of topics, M is vocabulary size

  • topic – topic number (zero-indexed)

Returns:

array of length M or n (if n is given) with most relevant words for topic topic

tmtoolkit.topicmod.model_stats.most_salient_words(vocab, topic_word_distrib, doc_topic_distrib, doc_lengths, n=None)

Order the words from vocab by saliency score from most to least salient. Optionally only return the n most salient words.

See also

word_saliency

Parameters:
  • vocab – vocabulary array of length M

  • topic_word_distrib – topic-word distribution; shape KxM, where K is number of topics, M is vocabulary size

  • doc_topic_distrib – document-topic distribution; shape NxK, where N is the number of documents, K is the number of topics

  • doc_lengths – array of size N (number of docs) with integers indicating the number of terms per document

  • n – if not None, return only the n most salient words

Returns:

array of length M or n (if n is given) with most salient words

tmtoolkit.topicmod.model_stats.top_n_from_distribution(distrib, top_n=10, row_labels=None, col_labels=None, val_labels=None)

Get top_n values from LDA model’s distribution distrib as DataFrame. Can be used for topic-word distributions and document-topic distributions. Set row_labels to a format string or a list. Set col_labels to a format string for the column names. Set val_labels to return value labels instead of pure values (probabilities).

Parameters:
  • distrib – a 2D probability distribution of shape NxM from an LDA model

  • top_n – number of top values to take from each row of distrib

  • row_labels – either list of row label strings of length N or a single row format string

  • col_labels – column format string or None for default numbered columns

  • val_labels – value labels format string or None to return only the probabilities

Returns:

pandas DataFrame with N rows and top_n columns

tmtoolkit.topicmod.model_stats.top_words_for_topics(topic_word_distrib, top_n=None, vocab=None, return_prob=False)

Generate sorted list of top_n words (or word indices) per topic in topic-word distribution topic_word_distrib.

Parameters:
  • topic_word_distrib – topic-word distribution; shape KxM, where K is number of topics, M is vocabulary size

  • top_n – number of top words (according to probability given topic) to select per topic; if None return full sorted lists of words

  • vocab – vocabulary array of length M; if None, return word indices instead of word strings

  • return_prob – if True, also return sorted arrays of word probabilities given topic for each topic

Returns:

list of length K consisting of sorted arrays of most probable words; arrays have length top_n or M (if top_n is None); if return_prob is True, another list of sorted arrays of word probabilities for each topic is returned

tmtoolkit.topicmod.model_stats.topic_word_relevance(topic_word_distrib, doc_topic_distrib, doc_lengths, lambda_)

Calculate the topic-word relevance score with a lambda parameter lambda_ according to [SievertShirley2014]:

relevance(w,t|lambda) = lambda * log phi_{t,w} + (1-lambda) * log (phi_{t,w} / p(w)), where

  • phi is the topic-word distribution,

  • p(w) is the marginal word probability.

Parameters:
  • topic_word_distrib – topic-word distribution; shape KxM, where K is number of topics, M is vocabulary size

  • doc_topic_distrib – document-topic distribution; shape NxK, where N is the number of documents, K is the number of topics

  • doc_lengths – array of size N (number of docs) with integers indicating the number of terms per document

  • lambda – lambda parameter (influences weight of “log lift”)

Returns:

matrix with topic-word relevance scores; shape KxM

tmtoolkit.topicmod.model_stats.word_distinctiveness(topic_word_distrib, p_t)

Calculate word distinctiveness according to [Chuang2012]:

distinctiveness(w) = KL(P(T|w), P(T)) = sum_T(P(T|w) log(P(T|w)/P(T))), where

  • KL is Kullback-Leibler divergence,

  • P(T) is marginal topic distribution,

  • P(T|w) is prob. of a topic given a word.

Parameters:
  • topic_word_distrib – topic-word distribution; shape KxM, where K is number of topics, M is vocabulary size

  • p_t – marginal topic distribution; array of size K

Returns:

array of size M (vocabulary size) with word distinctiveness

tmtoolkit.topicmod.model_stats.word_saliency(topic_word_distrib, doc_topic_distrib, doc_lengths)

Calculate word saliency according to [Chuang2012] as saliency(w) = p(w) * distinctiveness(w) for a word w.

Parameters:
  • topic_word_distrib – topic-word distribution; shape KxM, where K is number of topics, M is vocabulary size

  • doc_topic_distrib – document-topic distribution; shape NxK, where N is the number of documents, K is the number of topics

  • doc_lengths – array of size N (number of docs) with integers indicating the number of terms per document

Returns:

array of size M (vocabulary size) with word saliency

Parallel model fitting and evaluation with lda

Parallel model computation and evaluation using the lda package.

Available evaluation metrics for this module are listed in AVAILABLE_METRICS. See tmtoolkit.topicmod.evaluate for references and implementations of those evaluation metrics.

tmtoolkit.topicmod.tm_lda.AVAILABLE_METRICS = ('loglikelihood', 'cao_juan_2009', 'arun_2010', 'coherence_mimno_2011')

Available metrics for lda ("griffiths_2004", "held_out_documents_wallach09" are added when package gmpy2 is installed, several "coherence_gensim_" metrics are added when package gensim is installed).

tmtoolkit.topicmod.tm_lda.DEFAULT_METRICS = ('cao_juan_2009', 'coherence_mimno_2011')

Metrics used by default.

tmtoolkit.topicmod.tm_lda.compute_models_parallel(data, varying_parameters=None, constant_parameters=None, n_max_processes=None)

Compute several topic models in parallel using the “lda” package. Use a single or multiple document term matrices data and optionally a list of varying parameters varying_parameters. Pass parameters in constant_parameters dict to each model calculation. Use at maximum n_max_processes processors or use all available processors if None is passed.

data can be either a Document-Term-Matrix (NumPy array/matrix, SciPy sparse matrix) or a dict with corpus ID -> Document-Term-Matrix mapping when calculating models for multiple corpora.

If data is a dict of named matrices, this function will return a dict with document ID -> result list. Otherwise it will only return a result list. A result list always is a list containing tuples (parameter_set, model) where parameter_set is a dict of the used parameters.

Parameters:
  • data – either a (sparse) 2D array/matrix or a dict mapping dataset labels to such matrices

  • varying_parameters – list of dicts with parameters; each parameter set will be used in a separate computation

  • constant_parameters – dict with parameters that are the same for all parallel computations

  • n_max_processes – maximum number of worker processes to spawn

Returns:

if passed data is 2D array, returns a list with tuples (parameter set, results); if passed data is a dict of 2D arrays, returns dict with same keys as data and the respective results for each dataset

tmtoolkit.topicmod.tm_lda.evaluate_topic_models(data, varying_parameters, constant_parameters=None, n_max_processes=None, return_models=False, metric=None, **metric_kwargs)

Compute several Topic Models in parallel using the “lda” package. Calculate the models using a list of varying parameters varying_parameters on a single Document-Term-Matrix data. Pass parameters in constant_parameters dict to each model calculation. Use at maximum n_max_processes processors or use all available processors if None is passed.

data must be a Document-Term-Matrix (NumPy array/matrix, SciPy sparse matrix).

Will return a list of size len(varying_parameters) containing tuples (parameter_set, eval_results) where parameter_set is a dict of the used parameters and eval_results is a dict of metric names -> metric results:

[(parameter_set_1, {'<metric_name>': result_1, ...}),
 ...,
 (parameter_set_n, {'<metric_name>': result_n, ...})])

See also

Results can be simplified using tmtoolkit.topicmod.evaluate.results_by_parameter.

Parameters:
  • data – a (sparse) 2D array/matrix

  • varying_parameters – list of dicts with parameters; each parameter set will be used in a separate evaluation

  • constant_parameters – dict with parameters that are the same for all parallel computations

  • n_max_processes – maximum number of worker processes to spawn

  • return_models – if True, also return the computed models in the evaluation results

  • metric – string or list of strings; if given, use only this metric(s) for evaluation; must be subset of available_metrics

  • metric_kwargs – dict of options for metric used metric(s)

Returns:

list of evaluation results for each varying parameter set as described above

Parallel model fitting and evaluation with scikit-learn

Parallel model computation and evaluation using the scikit-learn package.

Available evaluation metrics for this module are listed in AVAILABLE_METRICS. See tmtoolkit.topicmod.evaluate for references and implementations of those evaluation metrics.

tmtoolkit.topicmod.tm_sklearn.AVAILABLE_METRICS = ('perplexity', 'cao_juan_2009', 'arun_2010', 'coherence_mimno_2011', 'coherence_gensim_u_mass', 'coherence_gensim_c_v', 'coherence_gensim_c_uci', 'coherence_gensim_c_npmi')

Available metrics for sklearn ("held_out_documents_wallach09" is added when package gmpy2 is installed, several "coherence_gensim_" metrics are added when package gensim is installed).

tmtoolkit.topicmod.tm_sklearn.DEFAULT_METRICS = ('perplexity', 'cao_juan_2009', 'coherence_mimno_2011')

Metrics used by default.

tmtoolkit.topicmod.tm_sklearn.compute_models_parallel(data, varying_parameters=None, constant_parameters=None, n_max_processes=None)

Compute several topic models in parallel using the “sklearn” package. Use a single or multiple document term matrices data and optionally a list of varying parameters varying_parameters. Pass parameters in constant_parameters dict to each model calculation. Use at maximum n_max_processes processors or use all available processors if None is passed.

data can be either a Document-Term-Matrix (NumPy array/matrix, SciPy sparse matrix) or a dict with corpus ID -> Document-Term-Matrix mapping when calculating models for multiple corpora.

If data is a dict of named matrices, this function will return a dict with document ID -> result list. Otherwise it will only return a result list. A result list always is a list containing tuples (parameter_set, model) where parameter_set is a dict of the used parameters.

Parameters:
  • data – either a (sparse) 2D array/matrix or a dict mapping dataset labels to such matrices

  • varying_parameters – list of dicts with parameters; each parameter set will be used in a separate computation

  • constant_parameters – dict with parameters that are the same for all parallel computations

  • n_max_processes – maximum number of worker processes to spawn

Returns:

if passed data is 2D array, returns a list with tuples (parameter set, results); if passed data is a dict of 2D arrays, returns dict with same keys as data and the respective results for each dataset

tmtoolkit.topicmod.tm_sklearn.evaluate_topic_models(data, varying_parameters, constant_parameters=None, n_max_processes=None, return_models=False, metric=None, **metric_kwargs)

Compute several Topic Models in parallel using the “sklearn” package. Calculate the models using a list of varying parameters varying_parameters on a single Document-Term-Matrix data. Pass parameters in constant_parameters dict to each model calculation. Use at maximum n_max_processes processors or use all available processors if None is passed.

data must be a Document-Term-Matrix (NumPy array/matrix, SciPy sparse matrix).

Will return a list of size len(varying_parameters) containing tuples (parameter_set, eval_results) where parameter_set is a dict of the used parameters and eval_results is a dict of metric names -> metric results:

[(parameter_set_1, {'<metric_name>': result_1, ...}),
 ...,
 (parameter_set_n, {'<metric_name>': result_n, ...})])

See also

Results can be simplified using tmtoolkit.topicmod.evaluate.results_by_parameter.

Parameters:
  • data – a (sparse) 2D array/matrix

  • varying_parameters – list of dicts with parameters; each parameter set will be used in a separate evaluation

  • constant_parameters – dict with parameters that are the same for all parallel computations

  • n_max_processes – maximum number of worker processes to spawn

  • return_models – if True, also return the computed models in the evaluation results

  • metric – string or list of strings; if given, use only this metric(s) for evaluation; must be subset of available_metrics

  • metric_kwargs – dict of options for metric used metric(s)

Returns:

list of evaluation results for each varying parameter set as described above

Parallel model fitting and evaluation with Gensim

Parallel model computation and evaluation using the Gensim package.

Available evaluation metrics for this module are listed in AVAILABLE_METRICS. See tmtoolkit.topicmod.evaluate for references and implementations of those evaluation metrics.

tmtoolkit.topicmod.tm_gensim.AVAILABLE_METRICS = ('perplexity', 'cao_juan_2009', 'arun_2010', 'coherence_mimno_2011', 'coherence_gensim_u_mass', 'coherence_gensim_c_v', 'coherence_gensim_c_uci', 'coherence_gensim_c_npmi')

Available metrics for Gensim.

tmtoolkit.topicmod.tm_gensim.DEFAULT_METRICS = ('perplexity', 'cao_juan_2009', 'coherence_mimno_2011', 'coherence_gensim_c_v')

Metrics used by default.

tmtoolkit.topicmod.tm_gensim.compute_models_parallel(data, varying_parameters=None, constant_parameters=None, n_max_processes=None)

Compute several topic models in parallel using the “gensim” package. Use a single or multiple document term matrices data and optionally a list of varying parameters varying_parameters. Pass parameters in constant_parameters dict to each model calculation. Use at maximum n_max_processes processors or use all available processors if None is passed.

data can be either a Document-Term-Matrix (NumPy array/matrix, SciPy sparse matrix) or a dict with corpus ID -> Document-Term-Matrix mapping when calculating models for multiple corpora.

If data is a dict of named matrices, this function will return a dict with document ID -> result list. Otherwise it will only return a result list. A result list always is a list containing tuples (parameter_set, model) where parameter_set is a dict of the used parameters.

Parameters:
  • data – either a (sparse) 2D array/matrix or a dict mapping dataset labels to such matrices

  • varying_parameters – list of dicts with parameters; each parameter set will be used in a separate computation

  • constant_parameters – dict with parameters that are the same for all parallel computations

  • n_max_processes – maximum number of worker processes to spawn

Returns:

if passed data is 2D array, returns a list with tuples (parameter set, results); if passed data is a dict of 2D arrays, returns dict with same keys as data and the respective results for each dataset

tmtoolkit.topicmod.tm_gensim.evaluate_topic_models(data, varying_parameters, constant_parameters=None, n_max_processes=None, return_models=False, metric=None, **metric_kwargs)

Compute several Topic Models in parallel using the “gensim” package. Calculate the models using a list of varying parameters varying_parameters on a single Document-Term-Matrix data. Pass parameters in constant_parameters dict to each model calculation. Use at maximum n_max_processes processors or use all available processors if None is passed.

data must be a Document-Term-Matrix (NumPy array/matrix, SciPy sparse matrix).

Will return a list of size len(varying_parameters) containing tuples (parameter_set, eval_results) where parameter_set is a dict of the used parameters and eval_results is a dict of metric names -> metric results:

[(parameter_set_1, {'<metric_name>': result_1, ...}),
 ...,
 (parameter_set_n, {'<metric_name>': result_n, ...})])

See also

Results can be simplified using tmtoolkit.topicmod.evaluate.results_by_parameter.

Parameters:
  • data – a (sparse) 2D array/matrix

  • varying_parameters – list of dicts with parameters; each parameter set will be used in a separate evaluation

  • constant_parameters – dict with parameters that are the same for all parallel computations

  • n_max_processes – maximum number of worker processes to spawn

  • return_models – if True, also return the computed models in the evaluation results

  • metric – string or list of strings; if given, use only this metric(s) for evaluation; must be subset of available_metrics

  • metric_kwargs – dict of options for metric used metric(s)

Returns:

list of evaluation results for each varying parameter set as described above

Visualize topic models and topic model evaluation results

Wordclouds from topic models

tmtoolkit.topicmod.visualize.DEFAULT_WORDCLOUD_KWARGS = {'background_color': None, 'color_func': <function _wordcloud_color_func_black>, 'height': 600, 'mode': 'RGBA', 'width': 800}

Default wordcloud settings for transparent background and black font; will be passed to wordcloud.WordCloud

tmtoolkit.topicmod.visualize.generate_wordclouds_for_topic_words(topic_word_distrib, vocab, top_n, topic_labels='topic_{i1}', which_topics=None, return_images=True, **wordcloud_kwargs)

Generate wordclouds for the top top_n words of each topic in topic_word_distrib.

Parameters:
  • topic_word_distrib – topic-word distribution; shape KxM, where K is number of topics, M is vocabulary size

  • vocab – vocabulary array of length M

  • top_n – number of top values to take from each row of distrib

  • topic_labels – labels used for each row; determine keys in in result dict; either single format string with placeholders "{i0}" (zero-based topic index) or "{i1}" (one-based topic index), or list of topic label strings

  • which_topics – if not None, a sequence of indices into rows of topic_word_distrib to select only these topics to generate wordclouds from

  • return_images – if True, store image objects instead of wordcloud.WordCloud objects in the result dict

  • wordcloud_kwargs – pass additional options to wordcloud.WordCloud; updates options in DEFAULT_WORDCLOUD_KWARGS

Returns:

dict mapping row labels to wordcloud images or instances generated from each topic

tmtoolkit.topicmod.visualize.generate_wordclouds_for_document_topics(doc_topic_distrib, doc_labels, top_n, topic_labels='topic_{i1}', which_documents=None, return_images=True, **wordcloud_kwargs)

Generate wordclouds for the top top_n topics of each document in doc_topic_distrib.

Parameters:
  • doc_topic_distrib – document-topic distribution; shape NxK, where N is the number of documents, K is the number of topics

  • doc_labels – list/array of length N with a string label for each document

  • top_n – number of top values to take from each row of distrib

  • topic_labels – labels used for each row; determine keys in in result dict; either single format string with placeholders "{i0}" (zero-based topic index) or "{i1}" (one-based topic index), or list of topic label strings

  • which_documents – if not None, a sequence of indices into rows of doc_topic_distrib to select only these topics to generate wordclouds from

  • return_images – if True, store image objects instead of wordcloud.WordCloud objects in the result dict

  • wordcloud_kwargs – pass additional options to wordcloud.WordCloud; updates options in DEFAULT_WORDCLOUD_KWARGS

Returns:

dict mapping row labels to wordcloud images or instances generated from each document

tmtoolkit.topicmod.visualize.generate_wordcloud_from_probabilities_and_words(prob, words, return_image=True, wordcloud_instance=None, **wordcloud_kwargs)

Generate a single wordcloud for given probabilities (weights) prob of the respective words.

Parameters:
  • prob – 1D array or sequence of probabilities for words

  • words – 1D array or sequence of word strings

  • return_images – if True, store image objects instead of wordcloud.WordCloud objects in the result dict

  • wordcloud_instance – optionally pass an already initialized wordcloud.WordCloud instance

  • wordcloud_kwargs – pass additional options to wordcloud.WordCloud; updates options in DEFAULT_WORDCLOUD_KWARGS

Returns:

either a wordcloud image if return_images is True, otherwise a wordcloud.WordCloud instance

tmtoolkit.topicmod.visualize.generate_wordcloud_from_weights(weights, return_image=True, wordcloud_instance=None, **wordcloud_kwargs)

Generate a single wordcloud for a weights dict that maps words to “weights” (e.g. probabilities) which determine their size in the wordcloud.

Parameters:
  • weights – dict that maps words to weights

  • return_images – if True, store image objects instead of wordcloud.WordCloud objects in the result dict

  • wordcloud_instance – optionally pass an already initialized wordcloud.WordCloud instance

  • wordcloud_kwargs – pass additional options to wordcloud.WordCloud; updates options in DEFAULT_WORDCLOUD_KWARGS

Returns:

either a wordcloud image if return_images is True, otherwise a wordcloud.WordCloud instance

tmtoolkit.topicmod.visualize.write_wordclouds_to_folder(wordclouds, folder, file_name_fmt='{label}.png', **save_kwargs)

Save all wordcloud image objects in wordclouds to folder.

Parameters:
  • wordclouds – dict mapping wordcloud label to wordcloud object

  • folder – target path

  • file_name_fmt – file name string format with placeholder "{label}"

  • save_kwargs – additional options passed to save method of each wordcloud image object

tmtoolkit.topicmod.visualize.generate_wordclouds_from_distribution(distrib, row_labels, val_labels, top_n, which_rows=None, return_images=True, **wordcloud_kwargs)

Generate wordclouds for each row in a given probability distribution distrib.

Note

Use generate_wordclouds_for_topic_words or generate_wordclouds_for_document_topics as shortcuts for creating wordclouds for a topic-word or document-topic distribution.

Parameters:
  • distrib – 2D (sparse) array/matrix probability distribution

  • row_labels – labels for rows in probability distribution; these are used as keys in the return dict

  • val_labels – labels for values in probability distribution (e.g. vocabulary)

  • top_n – number of top values to take from each row of distrib

  • which_rows – if not None, select only the rows from this sequence of indices from distrib

  • return_images – if True, store image objects instead of wordcloud.WordCloud objects in the result dict

  • wordcloud_kwargs – pass additional options to wordcloud.WordCloud; updates options in DEFAULT_WORDCLOUD_KWARGS

Returns:

dict mapping row labels to wordcloud images or instances generated from each distribution row

Plot heatmaps for topic models

tmtoolkit.topicmod.visualize.plot_doc_topic_heatmap(fig, ax, doc_topic_distrib, doc_labels, topic_labels=None, which_documents=None, which_document_indices=None, which_topics=None, which_topic_indices=None, xaxislabel=None, yaxislabel=None, **kwargs)

Plot a heatmap for a document-topic distribution doc_topic_distrib to a matplotlib Figure fig and Axes ax using doc_labels as document labels on the y-axis and topics from 1 to K (number of topics) on the x-axis.

Note

It is almost always necessary to select a subset of your document-topic distribution with the which_documents or which_topics parameters, as otherwise the amount of data to be plotted will be too high to give a reasonable picture.

Parameters:
  • fig – matplotlib Figure object

  • ax – matplotlib Axes object

  • doc_topic_distrib – document-topic distribution; shape NxK, where N is the number of documents, K is the number of topics

  • doc_labels – list/array of length N with a string label for each document

  • topic_labels – labels used for each row; either single format string with placeholders "{i0}" (zero-based topic index) or "{i1}" (one-based topic index), or list of topic label strings

  • which_documents – select documents via document label strings

  • which_document_indices – alternatively, select documents with zero-based document index in [0, N-1]

  • which_topics – select topics via topic label strings (when string array or list) or with one-based topic index in [1, K] (when integer array or list)

  • which_topic_indices – alternatively, select topics with zero-based topic index in [0, K-1]

  • xaxislabel – x axis label string

  • yaxislabel – y axis label string

  • kwargs – additional arguments passed to plot_heatmap

Returns:

tuple of generated (matplotlib Figure object, matplotlib Axes object)

tmtoolkit.topicmod.visualize.plot_topic_word_heatmap(fig, ax, topic_word_distrib, vocab, topic_labels=None, which_topics=None, which_topic_indices=None, which_words=None, which_word_indices=None, xaxislabel=None, yaxislabel=None, **kwargs)

Plot a heatmap for a topic-word distribution topic_word_distrib to a matplotlib Figure fig and Axes ax using vocab as vocabulary on the x-axis and topics from 1 to n_topics=doc_topic_distrib.shape[1] on the y-axis.

Note

It is almost always necessary to select a subset of your topic-word distribution with the which_words or which_topics parameters, as otherwise the amount of data to be plotted will be too high to give a reasonable picture.

Parameters:
  • fig – matplotlib Figure object

  • ax – matplotlib Axes object

  • topic_word_distrib – topic-word distribution; shape KxM, where K is number of topics, M is vocabulary size

  • vocab – vocabulary array of length M

  • topic_labels – labels used for each row; either single format string with placeholders "{i0}" (zero-based topic index) or "{i1}" (one-based topic index), or list of topic label strings

  • which_topics – select topics via topic label strings (when string array or list and topic_labels is given) or with one-based topic index in [1, K] (when integer array or list)

  • which_topic_indices – alternatively, select topics with zero-based topic index in [0, K-1]

  • which_words – select words with one-based word index in [1, M]

  • which_word_indices – alternatively, select words with zero-based word index in [0, K-1]

  • xaxislabel – x axis label string

  • yaxislabel – y axis label string

  • kwargs – additional arguments passed to plot_heatmap

Returns:

tuple of generated (matplotlib Figure object, matplotlib Axes object)

tmtoolkit.topicmod.visualize.plot_heatmap(fig, ax, data, xaxislabel=None, yaxislabel=None, xticklabels=None, yticklabels=None, title=None, grid=True, values_in_cells=True, round_values_in_cells=2, legend=False, fontsize_axislabel=None, fontsize_axisticks=None, fontsize_cell_values=None)

Generic heatmap plotting function for 2D matrix data.

Parameters:
  • fig – matplotlib Figure object

  • ax – matplotlib Axes object

  • data – 2D array/matrix to be plotted as heatmap

  • xaxislabel – x axis label string

  • yaxislabel – y axis label string

  • xticklabels – list of x axis tick labels

  • yticklabels – list of y axis tick labels

  • title – plot title

  • grid – draw grid if True

  • values_in_cells – draw values of data in heatmap cells

  • round_values_in_cells – round these values to the given number of digits

  • legend – if True, draw a legend

  • fontsize_axislabel – font size for axis label

  • fontsize_axisticks – font size for axis ticks

  • fontsize_cell_values – font size for values in cells

Returns:

tuple of generated (matplotlib Figure object, matplotlib Axes object)

Plot probability distribution rankings for topic models

tmtoolkit.topicmod.visualize.plot_topic_word_ranked_prob(fig, ax, topic_word_distrib, n, highlight_label_fmt='topic {i0}', highlight_label_other='other topics', title='Ranked word probability per topic', xaxislabel='word rank', yaxislabel='word probability', **kwargs)

Plot a topic-word probability distribution by ranking the probabilities in each row. This is for example useful in order to examine how many top words usually describe most of a topic.

Parameters:
  • fig – matplotlib Figure object

  • ax – matplotlib Axes object

  • topic_word_distrib – topic-word probability distribution

  • n – limit max. shown word rank on x-axis

  • highlight_label_fmt – if highlight is given, use this format for labeling the highlighted rows

  • highlight_label_other – if highlight is given, use this as label for non-highlighted rows

  • title – plot title

  • xaxislabel – x-axis label

  • yaxislabel – y-axis label

  • kwargs – further arguments passed to plot_prob_distrib_ranked_prob

Returns:

tuple of generated (matplotlib Figure object, matplotlib Axes object)

tmtoolkit.topicmod.visualize.plot_doc_topic_ranked_prob(fig, ax, doc_topic_distrib, n, highlight_label_fmt='document {i0}', highlight_label_other='other documents', title='Ranked topic probability per document', xaxislabel='topic rank', yaxislabel='topic probability', **kwargs)

Plot a document-topic probability distribution by ranking the probabilities in each row. This is for example useful in order to examine how many top topics usually describe most of a document.

Parameters:
  • fig – matplotlib Figure object

  • ax – matplotlib Axes object

  • doc_topic_distrib – document-topic probability distribution

  • n – limit max. shown topic rank on x-axis

  • highlight_label_fmt – if highlight is given, use this format for labeling the highlighted rows

  • highlight_label_other – if highlight is given, use this as label for non-highlighted rows

  • title – plot title

  • xaxislabel – x-axis label

  • yaxislabel – y-axis label

  • kwargs – further arguments passed to plot_prob_distrib_ranked_prob

Returns:

tuple of generated (matplotlib Figure object, matplotlib Axes object)

tmtoolkit.topicmod.visualize.plot_prob_distrib_ranked_prob(fig, ax, data, x_limit, log_scale=True, lw=1, alpha=0.1, highlight=None, highlight_label_fmt='{i0}', highlight_label_other='other', highlight_lw=3, highlight_alpha=0.3, title=None, xaxislabel='rank', yaxislabel='probability')

Plot a 2D probability distribution (one distribution for each row which should add up to 1) by ranking the probabilities in each row.

Parameters:
  • fig – matplotlib Figure object

  • ax – matplotlib Axes object

  • data – a 2D probability distribution (one distribution for each row which should add up to 1)

  • x_limit – limit max. shown rank on x-axis

  • log_scale – if True, apply log scale on y-axis

  • lw – line width

  • alpha – line transparency

  • highlight – if given, pass a sequence or NumPy array with indices of rows in data, which should be highlighted

  • highlight_label_fmt – if highlight is given, use this format for labeling the highlighted rows

  • highlight_label_other – if highlight is given, use this as label for non-highlighted rows

  • highlight_lw – line width for highlighted distributions

  • highlight_alpha – line transparency for highlighted distributions

  • title – plot title

  • xaxislabel – x-axis label

  • yaxislabel – y-axis label

Returns:

tuple of generated (matplotlib Figure object, matplotlib Axes object)

Plot topic model evaluation results

tmtoolkit.topicmod.visualize.plot_eval_results(eval_results, metric=None, param=None, xaxislabel=None, yaxislabel=None, title=None, title_fontsize='xx-large', subfig_fontsize='large', axes_title_fontsize='medium', show_metric_direction=True, metric_direction_font_size='medium', subplots_adjust_opts=None, figsize='auto', fig_opts=None, subfig_opts=None, subplots_opts=None)

Plot the evaluation results from eval_results, which must be a sequence containing (param_0, …, param_N, metric results) tuples, where param_N is the parameter value to appear on the x axis and all parameter combinations before are used to create a small multiples plot (if there are more than one param.). The metric results can be a dict structure containing the evaluation results for each metric. eval_results can be created using tmtoolkit.topicmod.evaluate.results_by_parameter.

Note

Due to a bug in matplotlib, it seems that it’s not possible to display a plot title when plotting small multiples and adjusting the positioning of the subplots. Hence you must set show_metric_direction to False when you’re displaying small multiples and need want to display a plot title.

Parameters:
  • eval_results – topic evaluation results as sequence containing (param_0, …, param_N, metric results)

  • metric – either single string or list of strings; plot only this/these specific metric/s

  • param – names of the parameters used in eval_results

  • xaxislabel – x axis label string

  • yaxislabel – y axis label string

  • title – plot title

  • title_fontsize – font size for the figure title

  • axes_title_fontsize – font size for the plot titles

  • show_metric_direction – if True, show whether the shown metric should be minimized or maximized for optimization

  • metric_direction_font_size – font size for the metric optimization direction indicator

  • subplots_opts – options passed to Matplotlib’s plt.subplots()

  • subplots_adjust_opts – options passed to Matplotlib’s fig.subplots_adjust()

  • figsize – tuple (width, height) or "auto" (default)

  • fig_opts – additional parameters passed to Matplotlib’s plt.figure()

  • subfig_opts – additional parameters passed to Matplotlib’s fig.subfigures()

  • subplots_opts – additional parameters passed to Matplotlib’s subfig.subplots()

Returns:

tuple of generated (matplotlib Figure object, matplotlib Subfigures, matplotlib Axes)

Other functions

tmtoolkit.topicmod.visualize.parameters_for_ldavis(topic_word_distrib, doc_topic_distrib, dtm, vocab, sort_topics=False)

Create a parameters dict that can be used with the pyLDAVis package by passing the dict params like pyLDAVis.prepare(**params).

Parameters:
  • topic_word_distrib – topic-word distribution; shape KxM, where K is number of topics, M is vocabulary size

  • doc_topic_distrib – document-topic distribution; shape NxK, where N is the number of documents, K is the number of topics

  • dtm – document-term-matrix; shape NxM

  • vocab – vocabulary array/list of length M

  • sort_topics – if True, sort the topics

Returns:

dict with parameters ready to use with pyLDAVis

Base classes for parallel model fitting and evaluation

Base classes for parallel model fitting and evaluation. See the specific functions and classes in tm_gensim, tm_lda and tm_sklearn for parallel processing with popular topic modeling packages.

Note

The classes and functions in this module are only important if you want to implement your own parallel model computation and evaluation.

class tmtoolkit.topicmod.parallel.MultiprocEvaluationRunner(worker_class, available_metrics, data, varying_parameters, constant_parameters=None, metric=None, metric_options=None, n_max_processes=None, return_models=False)

Specialization of MultiprocModelsRunner for parallel model evaluations.

__init__(worker_class, available_metrics, data, varying_parameters, constant_parameters=None, metric=None, metric_options=None, n_max_processes=None, return_models=False)

Initialize evaluation runner.

Parameters:
  • worker_class – model computation worker class derived from MultiprocModelsWorkerABC

  • available_metrics – list/tuple with available metrics as strings

  • data – the data that the workers use for computations; 2D (sparse) array/matrix

  • varying_parameters – list of dicts with parameters; each parameter set will be used in a separate computation

  • constant_parameters – dict with parameters that are the same for all parallel computations

  • metric – string or list of strings; if given, use only this metric(s) for evaluation; must be subset of available_metrics

  • metric_options – dict of options for metric used metric(s)

  • n_max_processes – maximum number of worker processes to spawn

  • return_models – if True, also return the computed models in the evaluation results

class tmtoolkit.topicmod.parallel.MultiprocEvaluationWorkerABC(worker_id, eval_metric, eval_metric_options, return_models, tasks_queue, results_queue, data, group=None, target=None, name=None, args=(), kwargs=None)

Specialization of MultiprocModelsWorkerABC for parallel model evaluations.

__init__(worker_id, eval_metric, eval_metric_options, return_models, tasks_queue, results_queue, data, group=None, target=None, name=None, args=(), kwargs=None)

Initialize parallel model evaluations worker class with an ID worker_id, a queue to receive tasks from tasks_queue, a queue to send results to results_queue and the data to operate on. Use evaluation metrics eval_metric.

Parameters:
  • worker_id – process ID

  • eval_metric – list/tuple of strings of evaluation metrics to use

  • eval_metric_options – dict of options for metric used metric(s)

  • tasks_queue – queue to receive tasks from

  • results_queue – queue to send results to

  • data – data to operate on; a dict mapping dataset label to a dataset; can be anything but is usually a tuple of shared data pointers for sparse matrix in COO format

  • group – see Python’s multiprocessing.Process class

  • target – see Python’s multiprocessing.Process class

  • name – see Python’s multiprocessing.Process class

  • args – see Python’s multiprocessing.Process class

  • kwargs – see Python’s multiprocessing.Process class

class tmtoolkit.topicmod.parallel.MultiprocModelsRunner(worker_class, data, varying_parameters=None, constant_parameters=None, n_max_processes=None)

Runner class for distributing and managing worker processes for parallel model computation.

__init__(worker_class, data, varying_parameters=None, constant_parameters=None, n_max_processes=None)

Initiate runner class with a model computation worker class worker_class (which should be derived from MultiprocModelsWorkerABC). This class represents the worker processes and each will be instantiated with data and work on it with a different parameter set that can be passed via varying_parameters.

Parameters:
  • worker_class – model computation worker class derived from MultiprocModelsWorkerABC

  • data – the data that the workers use for computations; 2D (sparse) array/matrix or a dict with such matrices; the latter allows to run all computations on different datasets at once

  • varying_parameters – list of dicts with parameters; each parameter set will be used in a separate computation

  • constant_parameters – dict with parameters that are the same for all parallel computations

  • n_max_processes – maximum number of worker processes to spawn

run()

Set up worker processes and run parallel computations. Blocks until all processes are done, then stops all workers and returns the results.

Returns:

if passed data is 2D array, returns a list with tuples (parameter set, results); if passed data is a dict of 2D arrays, returns dict with same keys as data and the respective results for each dataset

shutdown_workers()

Send shutdown signal to all worker processes to stop them.

class tmtoolkit.topicmod.parallel.MultiprocModelsWorkerABC(worker_id, tasks_queue, results_queue, data, group=None, target=None, name=None, args=(), kwargs=None)

Abstract base class for parallel model computations worker class.

__init__(worker_id, tasks_queue, results_queue, data, group=None, target=None, name=None, args=(), kwargs=None)

Initialize parallel model computations worker class with an ID worker_id, a queue to receive tasks from tasks_queue, a queue to send results to results_queue and the data to operate on.

Parameters:
  • worker_id – process ID

  • tasks_queue – queue to receive tasks from

  • results_queue – queue to send results to

  • data – data to operate on; a dict mapping dataset label to a dataset; can be anything but is usually a tuple of shared data pointers for sparse matrix in COO format

  • group – see Python’s multiprocessing.Process class

  • target – see Python’s multiprocessing.Process class

  • name – see Python’s multiprocessing.Process class

  • args – see Python’s multiprocessing.Process class

  • kwargs – see Python’s multiprocessing.Process class

fit_model(data, params)

Method stub to implement actually model fitting for data with parameter set params.

Parameters:
  • data – data passed to the model fitting algorithm

  • params – parameter set dict

Returns:

model fitting / evaluation results

run()

Run the process worker: Calls fit_model on each dataset and parameter set coming from the tasks queue.

send_results(doc, params, results)

Put the results into the results queue.

Parameters:
  • doc – “document” / dataset label

  • params – used parameter set

  • results – generated results, e.g. fit model and/or evaluation results

tmtoolkit.utils

Misc. utility functions.

tmtoolkit.utils.applychain(funcs, initial_arg)

For n functions f in funcs apply f_0(initial) f_1() ... f_n().

Parameters:
  • funcs (Iterable[Callable]) – functions to apply; must not be empty

  • initial_arg (Any) – initial function argument

Returns:

result after applying all functions in funcs

Return type:

Any

tmtoolkit.utils.argsort(seq)

Same as NumPy’s numpy.argsort but for Python sequences.

Parameters:

seq (Sequence) – a sequence

Returns:

indices into seq that sort seq

Return type:

List[int]

tmtoolkit.utils.as_chararray(x)

Convert a NumPy array or sequence x to a NumPy character array. If x is already a NumPy character array, return a copy of it.

Parameters:

x (ndarray | Sequence) – NumPy array or sequence

Returns:

NumPy character array

Return type:

ndarray

tmtoolkit.utils.chararray_elem_size(x)

Return the reserved size of each element in a NumPy unicode character array x, which is the maximum character length of all elements in x, but at least 1. E.g. if x.dtype is '<U5', this function will return 5.

Parameters:

x (ndarray) – NumPy unicode character array

Returns:

reserved size of each element

Return type:

int

tmtoolkit.utils.check_context_size(context_size)

Check a context size for validity. The context size must be given as integer for a symmetric context size or as tuple (left, right) and must contain at least one strictly positive value.

Parameters:

context_size (int | Tuple[int, int] | List[int]) – either scalar int or tuple/list (left, right) – number of surrounding tokens; if scalar, then it is a symmetric surrounding, otherwise can be asymmetric

Returns:

tuple of (left, right) context size

Return type:

Tuple[int, int]

tmtoolkit.utils.combine_sparse_matrices_columnwise(matrices, col_labels, row_labels=None, dtype=None, dtype_cols=None)

Given a sequence of sparse matrices in matrices and their corresponding column labels in col_labels, stack these matrices in rowwise fashion by retaining the column affiliation and filling in zeros, e.g.:

m1:
   C A D
   -----
   1 0 3
   0 2 0

m2:
   D B C A
   -------
   0 0 1 2
   3 4 5 6
   2 1 0 0

will result in:

A B C D
-------
0 0 1 3
2 0 0 0
2 0 1 0
6 4 5 3
0 1 0 2

(where the first two rows come from m1 and the other three rows from m2).

The resulting columns will always be sorted in ascending order.

Additionally, you can pass a sequence of row labels for each matrix via row_labels. This will also sort the rows in ascending order according to the row labels.

Parameters:
  • matrices (Sequence) – sequence of sparse matrices

  • col_labels (Sequence[Sequence[str | int]]) – column labels for each matrix in matrices; may be sequence of strings or integers

  • row_labels (Sequence[Sequence[str]] | None) – optional sequence of row labels for each matrix in matrices

  • dtype (str | dtype | None) – optionally specify the dtype of the resulting sparse matrix

  • dtype_cols (str | dtype | None) – optionally specify the dtype for the column labels

Returns:

a tuple with (1) combined sparse matrix in CSR format; (2) column labels of the matrix; (3) optionally row labels of the matrix if row_labels is not None.

Return type:

Tuple[csr_matrix, ndarray] | Tuple[csr_matrix, ndarray, ndarray]

tmtoolkit.utils.dict2df(data, key_name='key', value_name='value', sort=None)

Take a simple dictionary that maps any key to any scalar value and convert it to a dataframe that contains two columns: one for the keys and one for the respective values. Optionally sort by column sort.

Parameters:
  • data (dict) – dictionary that maps keys to scalar values

  • key_name (str) – column name for the keys

  • value_name (str) – column name for the values

  • sort (str | None) – optionally sort by this column; prepend by “-” to indicate descending sorting order, e.g. “-value”

Returns:

a dataframe with two columns: one for the keys named key_name and one for the respective values named value_name

Return type:

DataFrame

tmtoolkit.utils.disable_logging()

Disable logging for tmtoolkit package.

Return type:

None

tmtoolkit.utils.empty_chararray()

Create empty NumPy character array.

Returns:

empty NumPy character array

Return type:

ndarray

tmtoolkit.utils.enable_logging(level=20, fmt='%(asctime)s:%(levelname)s:%(name)s:%(message)s', logging_handler=None, add_logging_handler=True, **stream_hndlr_opts)

Enable logging for tmtoolkit package with minimum log level level and log message format fmt. By default, logs to stderr via logging.StreamHandler. You may also pass your own log handler.

See also

Currently, only the logging levels INFO and DEBUG are used in tmtoolkit. See the Python Logging HOWTO guide for more information on log levels and formats.

Parameters:
  • level (int) – minimum log level; default is INFO level

  • fmt (str) – log message format

  • logging_handler (Handler | None) – pass custom logging handler to be used instead of

  • add_logging_handler (bool) – if True, add the logging handler to the logger

  • stream_hndlr_opts – optional additional parameters passed to logging.StreamHandler

Return type:

None

tmtoolkit.utils.flatten_list(l)

Flatten a 2D sequence l to a 1D list and return it.

Although return sum(l, []) looks like a very nice one-liner, it turns out to be much much slower than what is implemented below.

Parameters:

l (Iterable[Iterable]) – 2D sequence, e.g. list of lists

Returns:

flattened list, i.e. a 1D list that concatenates all elements from each list inside l

Return type:

list

tmtoolkit.utils.greedy_partitioning(elems_dict, k, return_only_labels=False)

Implementation of greed partitioning algorithm as explained here for a dict elems_dict containing elements with label -> weight mapping. A weight can be a number in an arbitrary range. Since this is used for task scheduling, you can think if it as the larger the weight, the bigger the task is.

The elements are placed in k bins such that the difference of sums of weights in each bin is minimized. The algorithm does not always find the optimal solution.

If return_only_labels is False, returns a list of k dicts with label -> weight mapping, else returns a list of k lists containing only the labels for the respective partitions.

Parameters:
  • elems_dict (Dict[str, int | float]) – dictionary containing elements with label -> weight mapping

  • k (int) – number of bins

  • return_only_labels – if True, only return the labels in each bin

Returns:

list with k bins, where each each bin is either a dict with label -> weight mapping if return_only_labels is False or a list of labels

Return type:

List[Dict[str, int | float]] | List[List[str]]

tmtoolkit.utils.indices_of_matches(a, b, b_is_sorted=False, check_a_in_b=False)

Return the indices into 1D array b where elements in 1D array a equal an element in b. E.g.: Suppose b is a vocabulary like [13, 10, 12, 8] and a is a sequence of tokens [12, 13]. Then indices_of_matches(a, b) will return [2, 0] since first element in a equals b[2] and the second element in a equals b[0].

Parameters:
  • a (ndarray) – 1D array which will be searched in b

  • b (ndarray) – 1D array of elements to match against; result will produce indices into this array; should have same dtype as a

  • b_is_sorted (bool) – set this to True if you’re sure that b is sorted; then a shortcut will be used

  • check_a_in_b (bool) – if True then check if all elements in a exist in b; if this is not the case, raise an exception

Returns:

1D array of indices; length equals the length of a

Return type:

ndarray

tmtoolkit.utils.linebreaks_win2unix(text)

Convert Windows line breaks \r\n to Unix line breaks \n.

Parameters:

text (str) – text string

Returns:

text string with Unix line breaks

Return type:

str

tmtoolkit.utils.mat2d_window_from_indices(mat, row_indices=None, col_indices=None, copy=False)

Select an area/”window” inside of a 2D array/matrix mat specified by either a sequence of row indices row_indices and/or a sequence of column indices col_indices. Returns the specified area as a view of the data if copy is False, else it will return a copy.

Parameters:
  • mat (ndarray) – a 2D NumPy array

  • row_indices (List[int] | ndarray | None) – list or array of row indices to select or None to select all rows

  • col_indices (List[int] | ndarray | None) – list or array of column indices to select or None to select all columns

  • copy – if True, return result as copy, else as view into mat

Returns:

window into mat as specified by the passed indices

Return type:

ndarray

tmtoolkit.utils.merge_dicts(dicts, sort_keys=False, safe=False)

Merge all dictionaries in dicts to form a single dict.

Parameters:
  • dicts (Sequence[dict]) – sequence of dictionaries to merge

  • sort_keys (bool) – sort the keys in the resulting dictionary

  • safe (bool) – if True, raise a ValueError if sets of keys in dicts are not disjoint, else later dicts in the sequence will silently update already existing data with the same key

Returns:

merged dictionary

Return type:

dict

tmtoolkit.utils.merge_sets(sets, safe=False)

Merge all sets in sets to form a single set.

Parameters:
  • sets (Sequence[set]) – sequence of sets to merge

  • safe (bool) – if True, raise a ValueError if sets are not disjoint

Returns:

merged set

Return type:

set

tmtoolkit.utils.pairwise_max_table(m, labels=None, output_columns=('x', 'y', 'value'), sort=True, skip_zeros=False)

Given a symmetric or triangular matrix or dataframe m in which each entry m[i,j] denotes some metric between a pair (i, j), this function takes the maximum entry for each row and outputs the result as dataframe. This will result in a table containing the maximum of each pair i and j.

Parameters:
  • m (ndarray | spmatrix | DataFrame) – symmetric or triangular matrix or dataframe; can be a sparse matrix

  • labels (Sequence | None) – sequence of pair labels; if m is a dataframe, the labels will be taken from its column names

  • output_columns (Sequence[str]) – names of columns in output dataframe

  • sort (str | bool | None) – optionally sort by this column; by default will sort by last column in output_columns in descending order; pass a string to specify the column and prepend by “-” to indicate descending sorting order, e.g. “-value”

  • skip_zeros (bool) – don’t store pair entries with value zero in the result

Returns:

dataframe with pair maxima

Return type:

DataFrame

tmtoolkit.utils.partial_sparse_log(x, logfn=<ufunc 'log'>)

Apply logarithm function logfn to all non-zero elements in sparse matrix x.

Note

Applying \(\log(x)\) only to non-zero elements in \(x\) does not produce mathematically correct results, since \(\log(0)\) is not defined (but \(\log(x)\) approaches minus infinity if \(x\) goes toward 0). However, if you further process a matrix x, e.g. by replacing negative values with 0 as for example in the PPMI calculation, this function is still useful.

Parameters:
  • x (spmatrix) – a sparse matrix

  • logfn (Callable[[ndarray], ndarray]) – a logarithm function that accepts a numpy array and returns a numpy array

Returns:

a sparse matrix with logfn applied to all non-zero elements

Return type:

spmatrix

tmtoolkit.utils.path_split(path, base=None)

Split path path into its components:

path_split('a/simple/test.txt')
# ['a', 'simple', 'test.txt']
Parameters:
  • path (str) – a file path

  • base (List[str] | None) – path remainder (used for recursion)

Returns:

components of the path as list

Return type:

List[str]

tmtoolkit.utils.pickle_data(data, picklefile, **kwargs)

Save data in picklefile with Python’s pickle module.

Parameters:
  • data (Any) – data to store in picklefile

  • picklefile (str) – either target file path as string or file handle

  • kwargs – further parameters passed to pickle.dump

Return type:

None

tmtoolkit.utils.read_text_file(fpath, encoding, read_size=-1, force_unix_linebreaks=True)

Read the text file at path fpath with character encoding encoding and return it as string.

Parameters:
  • fpath (str) – path to file to read

  • encoding (str) – character encoding

  • read_size (int) – max. number of characters to read. -1 means read full file.

  • force_unix_linebreaks (bool) – if True, convert Windows linebreaks to Unix linebreaks

Returns:

file content as string

Return type:

str

tmtoolkit.utils.sample_dict(d, n)

Return a subset of the dictionary d as random sample of size n.

Parameters:
  • d (dict) – dictionary to sample

  • n (int) – sample size; must be positive and smaller than or equal to len(d)

Returns:

subset of the input dictionary

Return type:

dict

tmtoolkit.utils.set_logging_level(level)

Set logging level for tmtoolkit package default logging handler.

Parameters:

level (int) – minimum log level

Return type:

None

tmtoolkit.utils.sorted_df(df, sort=None, **kwargs)

Sort a dataframe df by column sort if sort is not None. Otherwise, keep df unchanged.

Parameters:
  • df (DataFrame) – input dataframe

  • sort (str | None) – optionally sort by this column; prepend by “-” to indicate descending sorting order, e.g. “-value”

  • kwargs – optional arguments passed to pandas.DataFrame.sort_values

Returns:

optionally sorted dataframe

Return type:

DataFrame

tmtoolkit.utils.split_func_args(fn, args)

Split keyword arguments args so that all function arguments for fn are the first element of the returned tuple and the rest of the arguments are the second element of the returned tuple.

Parameters:
  • fn (Callable) – a function

  • args (Dict[str, Any]) – keyword arguments dict

Returns:

tuple with two dict elements: all arguments for fn are the first element, the rest of the arguments are the second element

Return type:

Tuple[Dict[str, Any], Dict[str, Any]]

tmtoolkit.utils.unpickle_file(picklefile, **kwargs)

Load data from picklefile with Python’s pickle module.

Warning

Python pickle files may contain malicious code. You should only load pickle files from trusted sources.

Parameters:
  • picklefile (str) – either target file path as string or file handle

  • kwargs – further parameters passed to pickle.load

Returns:

data stored in picklefile

Return type:

Any