API
tmtoolkit.bow
tmtoolkit.bow.bow_stats
Common statistics from bag-of-words (BoW) matrices.
- tmtoolkit.bow.bow_stats.codoc_frequencies(dtm, min_val=1, proportions=0)
Calculate the co-document frequency (aka word co-occurrence) matrix for a document-term matrix dtm, i.e. how often each pair of tokens occurs together at least min_val times in the same document. If proportions is True, return proportions scaled to the number of documents instead of absolute numbers.
- Parameters
dtm – (sparse) document-term-matrix of size NxM (N docs, M is vocab size) with raw term counts.
min_val – threshold for counting occurrences
proportions – one of
Proportion
:NO (0)
– return counts;YES (1)
– return proportions;LOG (2)
– convert input to dense matrix if necessary and return log(proportions + 1)
- Returns
co-document frequency (aka word co-occurrence) matrix with shape (vocab size, vocab size)
- tmtoolkit.bow.bow_stats.doc_frequencies(dtm, min_val=1, proportions=0)
For each term in the vocab of dtm (i.e. its columns), return how often it occurs at least min_val times per document.
- Parameters
dtm – (sparse) document-term-matrix of size NxM (N docs, M is vocab size) with raw term counts.
min_val – threshold for counting occurrences
proportions – one of
Proportion
:NO (0)
– return counts;YES (1)
– return proportions;LOG (2)
– return log of proportions
- Returns
NumPy array of size M (vocab size) indicating how often each term occurs at least min_val times.
- tmtoolkit.bow.bow_stats.doc_lengths(dtm)
Return the length, i.e. number of terms for each document in document-term-matrix dtm. This corresponds to the row-wise sums in dtm.
- Parameters
dtm – (sparse) document-term-matrix of size NxM (N docs, M is vocab size) with raw terms counts
- Returns
NumPy array of size N (number of docs) with integers indicating the number of terms per document
- tmtoolkit.bow.bow_stats.idf(dtm, smooth_log=1, smooth_df=1)
Calculate inverse document frequency (idf) vector from raw count document-term-matrix dtm with formula
log(smooth_log + N / (smooth_df + df))
, whereN
is the number of documents,df
is the document frequency (see functiondoc_frequencies
), smooth_log and smooth_df are smoothing constants. With default arguments, the formula is thuslog(1 + N/(1+df))
.Note that this may introduce NaN values due to division by zero when a document is of length 0.
- Parameters
dtm – (sparse) document-term-matrix of size NxM (N docs, M is vocab size) with raw term counts.
smooth_log – smoothing constant inside log()
smooth_df – smoothing constant to add to document frequency
- Returns
NumPy array of size M (vocab size) with inverse document frequency for each term in the vocab
- tmtoolkit.bow.bow_stats.idf_probabilistic(dtm, smooth=1)
Calculate probabilistic inverse document frequency (idf) vector from raw count document-term-matrix dtm with formula
log(smooth + (N - df) / df)
, whereN
is the number of documents anddf
is the document frequency (see functiondoc_frequencies
).- Parameters
dtm – (sparse) document-term-matrix of size NxM (N docs, M is vocab size) with raw term counts.
smooth – smoothing constant (setting this to 0 can lead to -inf results)
- Returns
NumPy array of size M (vocab size) with probabilistic inverse document frequency for each term in the vocab
- tmtoolkit.bow.bow_stats.sorted_terms(mat, vocab, lo_thresh=0, hi_tresh=None, top_n=None, ascending=False, table_doc_labels=None)
For each row (i.e. document) in a (sparse) document-term-matrix mat, do the following:
filter all values according to lo_thresh and hi_thresh
sort values and the corresponding terms from vocab according to ascending
optionally select the top top_n terms
generate a list with pairs of terms and values
Return the collected lists for each row or convert the result to a data frame if document labels are passed via data_frame_doc_labels (see shortcut function
sorted_terms_table
).- Parameters
mat – (sparse) document-term-matrix mat (may be tf-idf transformed or any other transformation)
vocab – list or array of vocabulary corresponding to columns in mat
lo_thresh – if not None, filter for values greater than lo_thresh
hi_tresh – if not None, filter for values lesser than or equal hi_thresh
top_n – if not None, select only the top top_n terms
ascending – sorting direction
table_doc_labels – optional list/array of document labels corresponding to mat rows
- Returns
list of list with tuples (term, value) or data table with columns “doc”, “term”, “value” if data_frame_doc_labels is given
- tmtoolkit.bow.bow_stats.sorted_terms_table(mat, vocab, doc_labels, lo_thresh=0, hi_tresh=None, top_n=None, ascending=False)
Shortcut function for
sorted_terms
which generates a data table with doc_labels.- Parameters
mat – (sparse) document-term-matrix mat (may be tf-idf transformed or any other transformation)
vocab – list or array of vocabulary corresponding to columns in mat
doc_labels – list/array of document labels corresponding to mat rows
lo_thresh – if not None, filter for values greater than lo_thresh
hi_tresh – if not None, filter for values lesser than or equal hi_thresh
top_n – if not None, select only the top top_n terms
ascending – sorting direction
- Returns
data table with columns “doc”, “term”, “value”
- tmtoolkit.bow.bow_stats.term_frequencies(dtm, proportions=0)
Return the number of occurrences of each term in the vocab across all documents in document-term-matrix dtm. This corresponds to the column-wise sums in dtm.
- Parameters
dtm – (sparse) document-term-matrix of size NxM (N docs, M is vocab size) with raw term counts.
proportions – one of
Proportion
:NO (0)
– return counts;YES (1)
– return proportions;LOG (2)
– return log of proportions
- Returns
NumPy array of size M (vocab size) with integers indicating the number of occurrences of each term in the vocab across all documents.
- tmtoolkit.bow.bow_stats.tf_binary(dtm)
Transform raw count document-term-matrix dtm to binary term frequency matrix. This matrix contains 1 whenever a term occurred in a document, else 0.
- Parameters
dtm – (sparse) document-term-matrix of size NxM (N docs, M is vocab size) with raw term counts.
- Returns
(sparse) binary term frequency matrix of type integer of size NxM
- tmtoolkit.bow.bow_stats.tf_double_norm(dtm, K=0.5)
Transform raw count document-term-matrix dtm to double-normalized term frequency matrix
K + (1-K) * dtm / max{t in doc}
, wheremax{t in doc}
is vector of size N containing the maximum term count per document.Note that this may introduce NaN values due to division by zero when a document is of length 0.
- Parameters
dtm – (sparse) document-term-matrix of size NxM (N docs, M is vocab size) with raw term counts
K – normalization factor
- Returns
double-normalized term frequency matrix of size NxM
- tmtoolkit.bow.bow_stats.tf_log(dtm, log_fn=<ufunc 'log1p'>)
Transform raw count document-term-matrix dtm to log-normalized term frequency matrix
log_fn(dtm)
.- Parameters
dtm – (sparse) document-term-matrix of size NxM (N docs, M is vocab size) with raw term counts.
log_fn – log function to use; default is NumPy’s
numpy.log1p
, which calculateslog(1 + x)
- Returns
(sparse) log-normalized term frequency matrix of size NxM
- tmtoolkit.bow.bow_stats.tf_proportions(dtm)
Transform raw count document-term-matrix dtm to term frequency matrix with proportions, i.e. term counts normalized by document length.
Note that this may introduce NaN values due to division by zero when a document is of length 0.
- Parameters
dtm – (sparse) document-term-matrix of size NxM (N docs, M is vocab size) with raw term counts
- Returns
(sparse) term frequency matrix of size NxM with proportions, i.e. term counts normalized by document length
- tmtoolkit.bow.bow_stats.tfidf(dtm, tf_func=<function tf_proportions>, idf_func=<function idf>, **kwargs)
Calculate tfidf (term frequency inverse document frequency) matrix from raw count document-term-matrix dtm with matrix multiplication
tf * diag(idf)
, where tf is the term frequency matrixtf_func(dtm)
andidf
is the document frequency vectoridf_func(dtm)
.- Parameters
dtm – (sparse) document-term-matrix of size NxM (N docs, M is vocab size) with raw term counts
tf_func – function to calculate term-frequency matrix; see
tf_*
functions in this moduleidf_func – function to calculate inverse document frequency vector; see
tf_*
functions in this modulekwargs – additional parameters passed to tf_func or idf_func like K or smooth (depending on which parameters these functions except)
- Returns
(sparse) tfidf matrix of size NxM
- tmtoolkit.bow.bow_stats.word_cooccurrence(dtm, min_val=1, proportions=0)
Calculate the co-document frequency (aka word co-occurrence) matrix. Alias for
codoc_frequencies
.
tmtoolkit.bow.dtm
Functions for creating a document-term matrix (DTM) and some compatibility functions for Gensim.
- tmtoolkit.bow.dtm.create_sparse_dtm(vocab, docs, n_unique_tokens, vocab_is_sorted=False, dtype=None)
Create a sparse document-term-matrix (DTM) as matrix in COO sparse format from vocabulary array vocab, a list of tokenized documents docs and the number of unique tokens across all documents n_unique_tokens.
The DTM’s rows are document names, its columns are indices in vocab, hence a value
DTM[j, k]
is the term frequency of termvocab[k]
in documentj
.A note on performance: Creating the three arrays for a COO matrix seems to be the fastest way to generate a DTM. An alternative implementation using LIL format was ~2x slower.
Memory requirement: about
3 * <n_unique_tokens> * 4
bytes with default dtype (32-bit integer).See also
This is the “low level” function. For the straight-forward to use function see
tmtoolkit.corpus.dtm
, which also calculates n_unique_tokens.- Parameters
vocab – list or array of vocabulary used as column names; size must equal number of columns in dtm
docs – a list of tokenized documents
n_unique_tokens – number of unique tokens across all documents
vocab_is_sorted – if True, assume that vocab is sorted when creating the token IDs
dtype – data type of the resulting matrix
- Returns
a sparse document-term-matrix in COO sparse format
- tmtoolkit.bow.dtm.dtm_and_vocab_to_gensim_corpus_and_dict(dtm, vocab, as_gensim_dictionary=True)
Convert a (sparse) DTM and a vocabulary list to a Gensim Corpus object and Gensim
Dictionary
object or a Pythondict
.- Parameters
dtm – (sparse) document-term-matrix of size NxM (N docs, M is vocab size) with raw terms counts
vocab – list or array of vocabulary
as_gensim_dictionary – if True create Gensim
Dictionary
from vocab, else create Pythondict
- Returns
a 2-tuple with (Corpus object, Gensim
Dictionary
or Pythondict
)
- tmtoolkit.bow.dtm.dtm_to_dataframe(dtm, doc_labels, vocab)
Convert a (sparse) DTM to a pandas DataFrame using document labels doc_labels as row index and vocab as column names.
- Parameters
dtm – (sparse) document-term-matrix of size NxM (N docs, M is vocab size) with raw terms counts
doc_labels – document labels used as row index (row names); size must equal number of rows in dtm
vocab – list or array of vocabulary used as column names; size must equal number of columns in dtm
- Returns
pandas DataFrame
- tmtoolkit.bow.dtm.dtm_to_gensim_corpus(dtm)
Convert a (sparse) DTM to a Gensim Corpus object.
See also
gensim_corpus_to_dtm
for the inverse function ordtm_and_vocab_to_gensim_corpus_and_dict
which additionally creates a GensimDictionary
.- Parameters
dtm – (sparse) document-term-matrix of size NxM (N docs, M is vocab size) with raw terms counts
- Returns
a Gensim
gensim.matutils.Sparse2Corpus
object
- tmtoolkit.bow.dtm.gensim_corpus_to_dtm(corpus)
Convert a Gensim corpus object to a sparse DTM in COO format.
See also
dtm_to_gensim_corpus
for the inverse function.- Parameters
corpus – Gensim corpus object
- Returns
sparse DTM in COO format
tmtoolkit.corpus
Corpus class and corpus functions
Functions to visualize corpus summary statistics
tmtoolkit.tokenseq
Module for functions that work with text represented as token sequences, e.g. ["A", "test", "document", "."]
and single tokens (i.e. strings).
Tokens don’t have to be represented as strings – for many functions, they may also be token hashes (as integers). Most functions also accept NumPy arrays instead of lists / tuples.
- RoleNadif2011(1,2,3,4)
Role, François & Nadif, Mohamed. (2011). Handling the Impact of Low Frequency Events on Co-occurrence based Measures of Word Similarity - A Case Study of Pointwise Mutual Information.
- Bouma2009(1,2,3,4)
Bouma, G. (2009). Normalized (pointwise) mutual information in collocation extraction. Proceedings of GSCL, 30, 31-40.
- tmtoolkit.tokenseq.index_windows_around_matches(matches, left, right, flatten=False, remove_overlaps=True)
Take a boolean 1D array matches of length N and generate an array of indices, where each occurrence of a True value in the boolean vector at index i generates a sequence of the form:
[i-left, i-left+1, ..., i, ..., i+right-1, i+right, i+right+1]
If flatten is True, then a flattened NumPy 1D array is returned. Otherwise, a list of NumPy arrays is returned, where each array contains the window indices.
remove_overlaps is only applied when flatten is True.
Example with
left=1 and right=1, flatten=False
:input: # 0 1 2 3 4 5 6 7 8 [True, True, False, False, True, False, False, False, True] output (matches *highlighted*): [[0, *1*], [0, *1*, 2], [3, *4*, 5], [7, *8*]]
Example with
left=1 and right=1, flatten=True, remove_overlaps=True
:input: # 0 1 2 3 4 5 6 7 8 [True, True, False, False, True, False, False, False, True] output (matches *highlighted*, other values belong to the respective "windows"): [*0*, *1*, 2, 3, *4*, 5, 7, *8*]
- Parameters
matches (ndarray) –
left (int) –
right (int) –
flatten (bool) –
remove_overlaps (bool) –
- Return type
Union[List[List[int]], ndarray]
- tmtoolkit.tokenseq.npmi(x, y, xy, n_total=None, logfn=<ufunc 'log'>, *, k=1, normalize=True)
Calculate pointwise mutual information measure (PMI) either from probabilities p(x), p(y), p(x, y) given as x, y, xy, or from total counts x, y, xy and additionally n_total. Setting k > 1 gives PMI^k variants. Setting normalized to True gives normalized PMI (NPMI) as in [Bouma2009]. See [RoleNadif2011] for a comparison of PMI variants.
Probabilities should be such that
p(x, y) <= min(p(x), p(y))
.- Parameters
x (ndarray) – probabilities p(x) or count of occurrence of x (interpreted as count if n_total is given)
y (ndarray) – probabilities p(y) or count of occurrence of y (interpreted as count if n_total is given)
xy (ndarray) – probabilities p(x, y) or count of occurrence of x and y (interpreted as count if n_total is given)
n_total (Optional[int]) – if given, x, y and xy are interpreted as counts with n_total as size of the sample space
logfn (Callable) – logarithm function to use (default:
np.log
– natural logarithm)k (int) – if k > 1, calculate PMI^k variant
normalize (bool) – if True, normalize to range [-1, 1]; gives NPMI measure
- Returns
array with same length as inputs containing (N)PMI measures for each input probability
- Return type
ndarray
- tmtoolkit.tokenseq.numbertoken_to_magnitude(numbertoken, char='0', firstchar='1', below_one='0', zero='0', decimal_sep='.', thousands_sep=',', drop_sign=False, value_on_conversion_error='')
Convert a string token numbertoken that represents a number (e.g. “13”, “1.3” or “-1313”) to a string token that represents the magnitude of that number by repeating char (“10”, “1”, “-1000” for the mentioned examples). A different first character can be set via firstchar. The sign can be dropped via drop_sign.
If numbertoken cannot be converted to a float, either the value value_on_conversion_error is returned or numbertoken is returned unchanged if value_on_conversion_error is None.
- Parameters
numbertoken (str) – token that represents a number
char (str) – character string used to represent single orders of magnitude
firstchar (str) – special character used for first character in the output
below_one (str) – special character used for numbers with absolute value below 1 (would otherwise return ‘’)
zero (str) – if numbertoken evaluates to zero, return this string
decimal_sep (str) – decimal separator used in numbertoken; this is language-specific
thousands_sep (str) – thousands separator used in numbertoken; this is language-specific
drop_sign (bool) – if True, drop the sign in number numbertoken, i.e. use absolute value
value_on_conversion_error (Optional[str]) – determines return value when numbertoken cannot be converted to a number; if None, return input numbertoken unchanged, otherwise return value_on_conversion_error
- Returns
string that represents the magnitude of the input or an empty string
- Return type
str
- tmtoolkit.tokenseq.pmi(x, y, xy, n_total=None, logfn=<ufunc 'log'>, k=1, normalize=False)
Calculate pointwise mutual information measure (PMI) either from probabilities p(x), p(y), p(x, y) given as x, y, xy, or from total counts x, y, xy and additionally n_total. Setting k > 1 gives PMI^k variants. Setting normalized to True gives normalized PMI (NPMI) as in [Bouma2009]. See [RoleNadif2011] for a comparison of PMI variants.
Probabilities should be such that
p(x, y) <= min(p(x), p(y))
.- Parameters
x (ndarray) – probabilities p(x) or count of occurrence of x (interpreted as count if n_total is given)
y (ndarray) – probabilities p(y) or count of occurrence of y (interpreted as count if n_total is given)
xy (ndarray) – probabilities p(x, y) or count of occurrence of x and y (interpreted as count if n_total is given)
n_total (Optional[int]) – if given, x, y and xy are interpreted as counts with n_total as size of the sample space
logfn (Callable) – logarithm function to use (default:
np.log
– natural logarithm)k (int) – if k > 1, calculate PMI^k variant
normalize (bool) – if True, normalize to range [-1, 1]; gives NPMI measure
- Returns
array with same length as inputs containing (N)PMI measures for each input probability
- Return type
ndarray
- tmtoolkit.tokenseq.pmi2(x, y, xy, n_total=None, logfn=<ufunc 'log'>, *, k=2, normalize=False)
Calculate pointwise mutual information measure (PMI) either from probabilities p(x), p(y), p(x, y) given as x, y, xy, or from total counts x, y, xy and additionally n_total. Setting k > 1 gives PMI^k variants. Setting normalized to True gives normalized PMI (NPMI) as in [Bouma2009]. See [RoleNadif2011] for a comparison of PMI variants.
Probabilities should be such that
p(x, y) <= min(p(x), p(y))
.- Parameters
x (ndarray) – probabilities p(x) or count of occurrence of x (interpreted as count if n_total is given)
y (ndarray) – probabilities p(y) or count of occurrence of y (interpreted as count if n_total is given)
xy (ndarray) – probabilities p(x, y) or count of occurrence of x and y (interpreted as count if n_total is given)
n_total (Optional[int]) – if given, x, y and xy are interpreted as counts with n_total as size of the sample space
logfn (Callable) – logarithm function to use (default:
np.log
– natural logarithm)k (int) – if k > 1, calculate PMI^k variant
normalize (bool) – if True, normalize to range [-1, 1]; gives NPMI measure
- Returns
array with same length as inputs containing (N)PMI measures for each input probability
- Return type
ndarray
- tmtoolkit.tokenseq.pmi3(x, y, xy, n_total=None, logfn=<ufunc 'log'>, *, k=3, normalize=False)
Calculate pointwise mutual information measure (PMI) either from probabilities p(x), p(y), p(x, y) given as x, y, xy, or from total counts x, y, xy and additionally n_total. Setting k > 1 gives PMI^k variants. Setting normalized to True gives normalized PMI (NPMI) as in [Bouma2009]. See [RoleNadif2011] for a comparison of PMI variants.
Probabilities should be such that
p(x, y) <= min(p(x), p(y))
.- Parameters
x (ndarray) – probabilities p(x) or count of occurrence of x (interpreted as count if n_total is given)
y (ndarray) – probabilities p(y) or count of occurrence of y (interpreted as count if n_total is given)
xy (ndarray) – probabilities p(x, y) or count of occurrence of x and y (interpreted as count if n_total is given)
n_total (Optional[int]) – if given, x, y and xy are interpreted as counts with n_total as size of the sample space
logfn (Callable) – logarithm function to use (default:
np.log
– natural logarithm)k (int) – if k > 1, calculate PMI^k variant
normalize (bool) – if True, normalize to range [-1, 1]; gives NPMI measure
- Returns
array with same length as inputs containing (N)PMI measures for each input probability
- Return type
ndarray
- tmtoolkit.tokenseq.simple_collocation_counts(x, y, xy, n_total)
“Statistic” function that can be used in
token_collocations
and will simply return the number of collocations between tokens x and y passed as xy. Mainly useful for debugging purposes.- Parameters
x (Optional[ndarray]) – unused
y (Optional[ndarray]) – unused
xy (ndarray) – counts for collocations of x and y
n_total (Optional[int]) – total number of tokens (strictly positive)
- Returns
simply returns xy
- tmtoolkit.tokenseq.simplify_unicode_chars(token, method='icu', ascii_encoding_errors='ignore')
Simplify unicode characters in string token, i.e. remove diacritics, underlines and other marks. Requires PyICU to be installed when using
method="icu"
.- Parameters
docs – a Corpus object
token (str) – string to simplify
method (str) –
either
"icu"
which uses PyICU for “proper” simplification or"ascii"
which tries to encode the characters as ASCII; the latter is not recommended and will simply dismiss any characters that cannot be converted to ASCII after decompositionascii_encoding_errors (str) – only used if method is
"ascii"
; what to do when a character cannot be encoded as ASCII character; can be either"ignore"
(default – replace by empty character),"replace"
(replace by"???"
) or"strict"
(raise aUnicodeEncodeError
)
- Returns
simplified string
- Return type
str
- tmtoolkit.tokenseq.strip_tags(value)
Return the given HTML with all tags stripped and HTML entities and character references converted to Unicode characters.
Code taken and adapted from https://github.com/django/django/blob/main/django/utils/html.py.
- Parameters
value (str) – input string
- Returns
string without HTML tags
- Return type
str
- tmtoolkit.tokenseq.token_collocations(sentences, threshold=None, min_count=1, embed_tokens=None, statistic=functools.partial(<function pmi>, k=1, normalize=True), vocab_counts=None, glue=None, return_statistic=True, rank='desc', tokens_as_hashes=False, hashes2tokens=None, **statistic_kwargs)
Identify token collocations (frequently co-occurring token series) in a list of sentences of tokens given by sentences. Currently only supports bigram collocations.
- Parameters
sentences (List[List[Union[str, int]]]) – list of sentences containing lists of tokens; tokens can be items of any type if glue is None
threshold (Optional[float]) – minimum statistic value for a collocation to enter the results; if None, results are not filtered
min_count (int) – ignore collocations with number of occurrences below this threshold
embed_tokens (Optional[Iterable]) – tokens that, if occurring inside an n-gram, are not counted; see
token_ngrams
statistic (Callable) – function to calculate the statistic measure from the token counts; use one of the
[n]pmi
functions provided in this module or provide your own function which must accept parametersx, y, xy, n_total
; seepmi
for more informationvocab_counts (Optional[Mapping]) – pass already computed token type counts to prevent computing these again in this function
glue (Optional[str]) – if not None, provide a string that is used to join the collocation tokens
return_statistic (bool) – also return computed statistic
rank (Optional[str]) – if not None, rank the results according to the computed statistic in ascending (
rank='asc'
) or descending (rank='desc'
) ordertokens_as_hashes (bool) – if True, return token type hashes (integers) instead of textual representations (strings)
hashes2tokens (Optional[Union[Dict[int, str], dict]]) – if tokens are given as integer hashes, this table is used to generate textual representations for the results
statistic_kwargs – additional arguments passed to statistic function
- Returns
list of tuples
(collocation tokens, score)
if return_statistic is True, otherwise only a list of collocations; collocations are either a string (if glue is given) or a tuple of strings- Return type
List[Union[tuple, str]]
- tmtoolkit.tokenseq.token_join_subsequent(tokens, matches, glue='_', tokens_dtype=None, return_glued=False, return_mask=False)
Select subsequent tokens as defined by list of indices matches (e.g. output of
token_match_subsequent
) and join those by string glue. Return a list of tokens where the subsequent matches are replaced by the joint tokens.Warning
Only works correctly when matches contains indices of subsequent tokens.
See also
Example:
token_glue_subsequent(['a', 'b', 'c', 'd', 'd', 'a', 'b', 'c'], [np.array([1, 2]), np.array([6, 7])]) # ['a', 'b_c', 'd', 'd', 'a', 'b_c']
- Parameters
tokens (Union[List[str], ndarray]) – a sequence of tokens
matches (List[ndarray]) – list of NumPy arrays with subsequent indices into tokens (e.g. output of
token_match_subsequent
)glue (Optional[str]) – string for joining the subsequent matches or None to keep them as separate items in a list
tokens_dtype (Optional[Union[str, dtype]]) – if tokens is not a NumPy array, it will be converted as such; use this dtype for the array
return_glued (bool) – if True, return also a list of joint tokens
return_mask (bool) – if True, return also a NumPy integer array with the length of the input tokens list that marks the original input tokens in three ways: 0 means mask that original token, 1 means retain that original token, 2 means replace original token by newly generated joint token; if True, also only return newly generated joint subsequent tokens and not also the original tokens
- Returns
either two-tuple, three-tuple or list depending on return_glued and return_mask
- Return type
Union[list, tuple]
- tmtoolkit.tokenseq.token_lengths(tokens)
Token lengths (number of characters of each token) in tokens.
- Parameters
tokens (Union[Iterable[str], ndarray]) – list or NumPy array of string tokens
- Returns
list of token lengths
- Return type
List[int]
- tmtoolkit.tokenseq.token_match(pattern, tokens, match_type='exact', ignore_case=False, glob_method='match', inverse=False)
Return a boolean NumPy array signaling matches between pattern and tokens. pattern will be compared with each element in sequence tokens either as exact equality (match_type is
'exact'
) or regular expression (match_type is'regex'
) or glob pattern (match_type is'glob'
). For the last two options, pattern must be a string or compiled RE pattern, otherwise it can be of any type that allows equality checking.See
token_match_multi_pattern
for a version of this function that accepts multiple search patterns.- Parameters
pattern (Any) – string or compiled RE pattern used for matching against tokens; when match_type is
'exact'
, pattern may be of any type that allows equality checkingtokens (Union[List[str], ndarray]) – list or NumPy array of string tokens
match_type (str) – one of: ‘exact’, ‘regex’, ‘glob’; if ‘regex’, search_token must be RE pattern; if glob, search_token must be a “glob” pattern like “hello w*” (see https://github.com/metagriffin/globre)
ignore_case (bool) – if True, ignore case for matching
glob_method (str) – if match_type is ‘glob’, use this glob method. Must be ‘match’ or ‘search’ (similar behavior as Python’s re.match or re.search)
inverse (bool) – invert the matching results
- Returns
1D boolean NumPy array of length
len(tokens)
where elements signal matches between pattern and the respective token from tokens- Return type
ndarray
- tmtoolkit.tokenseq.token_match_multi_pattern(search_tokens, tokens, match_type='exact', ignore_case=False, glob_method='match')
Return a boolean NumPy array signaling matches between any pattern in search_tokens and tokens. Works the same as
token_match
, but accepts multiple patterns as search_tokens argument.- Parameters
search_tokens (Any) – single string or list of strings that specify the search pattern(s); when match_type is
'exact'
, pattern may be of any type that allows equality checkingtokens (Union[List[str], ndarray]) – list or NumPy array of string tokens
match_type (str) – one of: ‘exact’, ‘regex’, ‘glob’; if ‘regex’, search_token must be RE pattern; if glob, search_token must be a “glob” pattern like “hello w*” (see https://github.com/metagriffin/globre)
ignore_case (bool) – if True, ignore case for matching
glob_method (str) – if match_type is ‘glob’, use this glob method. Must be ‘match’ or ‘search’ (similar behavior as Python’s re.match or re.search)
- Returns
1D boolean NumPy array of length
len(tokens)
where elements signal matches- Return type
ndarray
- tmtoolkit.tokenseq.token_match_subsequent(patterns, tokens, **match_opts)
Using N patterns in patterns, return each tuple of N matching subsequent tokens from tokens. Excepts the same token matching options via match_opts as
token_match
. The results are returned as list of NumPy arrays with indices into tokens.Example:
# indices: 0 1 2 3 4 5 6 tokens = ['hello', 'world', 'means', 'saying', 'hello', 'world', '.'] token_match_subsequent(['hello', 'world'], tokens) # [array([0, 1]), array([4, 5])] token_match_subsequent(['world', 'hello'], tokens) # [] token_match_subsequent(['world', '*'], tokens, match_type='glob') # [array([1, 2]), array([5, 6])]
See also
- Parameters
patterns (Sequence) – a sequence of search patterns as excepted by
token_match
tokens (Union[list, ndarray]) – a sequence of string tokens to be used for matching
match_opts – token matching options as passed to
token_match
- Returns
list of NumPy arrays with subsequent indices into tokens
- Return type
List[ndarray]
- tmtoolkit.tokenseq.token_ngrams(tokens, n, join=True, join_str=' ', ngram_container=<class 'list'>, embed_tokens=None, keep_embed_tokens=True)
Generate n-grams of length n from list of tokens tokens. Either join the n-grams when join is True using join_str so that a list of joined n-gram strings is returned or, if join is False, return a list of n-gram lists (or other sequences depending on ngram_container). For the latter option, the tokens in tokens don’t have to be strings but can by of any type.
Optionally pass a set/list/tuple embed_tokens which contains tokens that, if occurring inside an n-gram, are not counted. See for example how a trigram
'bank of america'
is generated when the token'of'
is set as embed_tokens, although we ask to generate bigrams:> token_ngrams("I visited the bank of america".split(), n=2) ['I visited', 'visited the', 'the bank', 'bank of', 'of america'] > token_ngrams("I visited the bank of america".split(), n=2, embed_tokens={'of'}) ['I visited', 'visited the', 'the bank', 'bank of america', 'of america']
- Parameters
tokens (Sequence) – sequence of tokens; if join is True, this must be a list of strings
n (int) – size of the n-grams to generate
join (bool) – if True, join n-grams by join_str
join_str (str) – string to join n-grams if join is True
ngram_container (Callable) – if join is False, use this function to create the n-gram sequences
embed_tokens (Optional[Iterable]) – tokens that, if occurring inside an n-gram, are not counted
keep_embed_tokens (bool) – if True, keep embedded tokens in the result
- Returns
list of joined n-gram strings or list of n-grams that are n-sized sequences
- Return type
list
- tmtoolkit.tokenseq.unique_chars(tokens)
Return a set of all characters used in tokens.
- Parameters
tokens (Iterable[str]) – iterable of string tokens
- Returns
set of all characters used in tokens
- Return type
Set[str]
tmtoolkit.topicmod
Topic modeling sub-package with modules for model evaluation, model I/O, model statistics, parallel computation and visualization.
Functions and classes in tm_gensim
, tm_lda
and
tm_sklearn
implement parallel model computation and evaluation using popular topic modeling
packages. You need to install the respective packages (lda, scikit-learn or gensim) in order to use them.
Evaluation metrics for Topic Modeling
Metrics for topic model evaluation.
In order to run model evaluations in parallel use one of the modules tm_gensim
,
tm_lda
or tm_sklearn
.
- tmtoolkit.topicmod.evaluate.metric_arun_2010(topic_word_distrib, doc_topic_distrib, doc_lengths)
Calculate metric as in [Arun2010] using topic-word distribution topic_word_distrib, document-topic distribution doc_topic_distrib and document lengths doc_lengths.
Note
It will fail when num. of words in the vocabulary is less then the num. of topics (which is very unusual).
- Arun2010
Rajkumar Arun, V. Suresh, C. E. Veni Madhavan, and M. N. Narasimha Murthy. 2010. On finding the natural number of topics with latent dirichlet allocation: Some observations. In Advances in knowledge discovery and data mining, Mohammed J. Zaki, Jeffrey Xu Yu, Balaraman Ravindran and Vikram Pudi (eds.). Springer Berlin Heidelberg, 391–402. http://doi.org/10.1007/978-3-642-13657-3_43.
- Parameters
topic_word_distrib – topic-word distribution; shape KxM, where K is number of topics, M is vocabulary size
doc_topic_distrib – document-topic distribution; shape NxK, where N is the number of documents
doc_lengths – array of length N with number of tokens per document
- Returns
calculated metric
- tmtoolkit.topicmod.evaluate.metric_cao_juan_2009(topic_word_distrib)
Calculate metric as in [Cao2009] using topic-word distribution topic_word_distrib.
- Cao2009
Cao Juan, Xia Tian, Li Jintao, Zhang Yongdong, and Tang Sheng. 2009. A density-based method for adaptive LDA model selection. Neurocomputing — 16th European Symposium on Artificial Neural Networks 2008 72, 7–9: 1775–1781. <http://doi.org/10.1016/j.neucom.2008.06.011>.
- Parameters
topic_word_distrib – topic-word distribution; shape KxM, where K is number of topics, M is vocabulary size
- Returns
calculated metric
- tmtoolkit.topicmod.evaluate.metric_coherence_gensim(measure, topic_word_distrib=None, gensim_model=None, vocab=None, dtm=None, gensim_corpus=None, texts=None, top_n=20, return_coh_model=False, return_mean=False, **kwargs)
Calculate model coherence using Gensim’s CoherenceModel. See also this tutorial.
Define which measure to use with parameter measure:
'u_mass'
'c_v'
'c_uci'
'c_npmi'
Provide a topic word distribution topic_word_distrib OR a Gensim model gensim_model and the corpus’ vocabulary as vocab OR pass a gensim corpus as gensim_corpus. top_n controls how many most probable words per topic are selected.
If measure is
'u_mass'
, a document-term-matrix dtm or gensim_corpus must be provided and texts can be None. If any other measure than'u_mass'
is used, tokenized input as texts must be provided as 2D list:[['some', 'text', ...], # doc. 1 ['some', 'more', ...], # doc. 2 ['another', 'document', ...]] # doc. 3
If return_coh_model is True, the whole
gensim.models.CoherenceModel
instance will be returned, otherwise:if return_mean is True, the mean coherence value will be returned
if return_mean is False, a list of coherence values (for each topic) will be returned
Provided kwargs will be passed to
gensim.models.CoherenceModel
orgensim.models.CoherenceModel.get_coherence_per_topic
.Note
This function also supports models from lda and sklearn (by passing topic_word_distrib, dtm and vocab)!
- Parameters
measure – the coherence calculation type; one of the values listed above
topic_word_distrib – topic-word distribution; shape KxM, where K is number of topics, M is vocabulary size if gensim_model is not given
gensim_model – a topic model from Gensim if topic_word_distrib is not given
vocab – vocabulary list/array if gensim_corpus is not given
dtm – document-term matrix of shape NxM with N documents and vocabulary size M if gensim_corpus is not given
gensim_corpus – a Gensim corpus if vocab is not given
texts – list of tokenized documents; necessary if using a measure other than
'u_mass'
top_n – number of most probable words selected per topic
return_coh_model – if True, return
gensim.models.CoherenceModel
as resultreturn_mean – if return_coh_model is False and return_mean is True, return mean coherence
kwargs – parameters passed to
gensim.models.CoherenceModel
orgensim.models.CoherenceModel.get_coherence_per_topic
- Returns
if return_coh_model is True, return
gensim.models.CoherenceModel
as result; otherwise if return_mean is True, mean of all coherence values, otherwise array of length K with coherence per topic
- tmtoolkit.topicmod.evaluate.metric_coherence_mimno_2011(topic_word_distrib, dtm, top_n=20, eps=1e-12, normalize=True, return_mean=False)
Calculate coherence metric according to [Mimno2011] (a.k.a. “U_Mass” coherence metric). There are two modifications to the originally suggested measure:
uses a different epsilon by default (set eps=1 for original)
uses a normalizing constant by default (set normalize=False for original)
Provide a topic word distribution as topic_word_distrib and a document-term-matrix dtm (can be sparse). top_n controls how many most probable words per topic are selected.
By default, it will return a NumPy array of coherence values per topic (same ordering as in topic_word_distrib). Set return_mean to True to return the mean of all topics instead.
- Mimno2011
D. Mimno, H. Wallach, E. Talley, M. Leenders, A. McCullum 2011: Optimizing semantic coherence in topic models
- Parameters
topic_word_distrib – topic-word distribution; shape KxM, where K is number of topics, M is vocabulary size
dtm – document-term matrix of shape NxM with N documents and vocabulary size M
top_n – number of most probable words selected per topic
eps – smoothing constant epsilon
normalize – if True, normalize coherence values
return_mean – if True, return mean of all coherence values, otherwise array of coherence per topic
- Returns
if return_mean is True, mean of all coherence values, otherwise array of length K with coherence per topic
- tmtoolkit.topicmod.evaluate.metric_griffiths_2004(logliks)
Calculate metric as in [GriffithsSteyvers2004].
Calculates the harmonic mean of the log-likelihood values logliks. Burn-in values should already be removed from logliks.
- GriffithsSteyvers2004
Thomas L. Griffiths and Mark Steyvers. 2004. Finding scientific topics. Proceedings of the National Academy of Sciences 101, suppl 1: 5228–5235. http://doi.org/10.1073/pnas.0307752101
Note
Requires gmpy2 package for multiple-precision arithmetic to avoid numerical underflow.
- Parameters
logliks – array with log-likelihood values
- Returns
calculated metric
- tmtoolkit.topicmod.evaluate.metric_held_out_documents_wallach09(dtm_test, theta_test, phi_train, alpha, n_samples=10000)
Estimation of the probability of held-out documents according to [Wallach2009] using a document-topic estimation theta_test that was estimated via held-out documents dtm_test on a trained model with a topic-word distribution phi_train and a document-topic prior alpha. Draw n_samples according to theta_test for each document in dtm_test (memory consumption and run time can be very high for larger n_samples and a large amount of big documents in dtm_test).
A document-topic estimation theta_test can be obtained from a trained model from the “lda” package or scikit-learn package with the transform() method.
Adopted MATLAB code originally from Ian Murray, 2009 and downloaded from umass.edu.
Note
Requires gmpy2 package for multiple-precision arithmetic to avoid numerical underflow.
- Wallach2009
Wallach, H.M., Murray, I., Salakhutdinov, R. and Mimno, D., 2009. Evaluation methods for topic models.
- Parameters
dtm_test – held-out documents of shape NxM with N documents and vocabulary size M
theta_test – document-topic estimation of dtm_test; shape NxK with K topics
phi_train – topic-word distribution of a trained topic model that should be evaluated; shape KxM
alpha – document-topic prior of the trained topic model that should be evaluated; either a scalar or an array of length K
- Returns
estimated probability of held-out documents
- tmtoolkit.topicmod.evaluate.results_by_parameter(res, param, sort_by=None, sort_desc=False)
Takes a list of evaluation results res returned by a topic model evaluation function – a list in the form:
[(parameter_set_1, {'<metric_name>': result_1, ...}), ..., (parameter_set_n, {'<metric_name>': result_n, ...})])
Then returns a list with tuple pairs using only the m parameter(s) listed in param from the parameter sets in the evaluation results such that the returned list is:
[(param_1_0, ..., param_1_m, {'<metric_name>': result_1, ...}), ..., (param_n_0, ..., param_n_m, {'<metric_name>': result_n, ...})]
Optionally order either by parameter value (sort_by is None - the default) or by result metric (
sort_by='<metric name>'
).- Parameters
res – list of evaluation results
param – string of parameter name
sort_by – order by parameter value if this is None, or by a certain result metric given as string
sort_desc – sort in descending order
- Returns
list with tuple pairs using only the parameter param from the parameter sets
Printing, importing and exporting topic model results
Functions for printing/exporting topic model results.
- tmtoolkit.topicmod.model_io.ldamodel_full_doc_topics(doc_topic_distrib, doc_labels, colname_rowindex='_doc', topic_labels='topic_{i1}')
Generate a pandas DataFrame for the full doc-topic distribution doc_topic_distrib.
See also
ldamodel_top_doc_topics
to retrieve only the most probable topics in the distribution as formatted pandas DataFrame;ldamodel_full_topic_words
to retrieve the full topic-word distribution as dataframe- Parameters
doc_topic_distrib – document-topic distribution; shape NxK, where N is the number of documents, K is the number of topics
doc_labels – list/array of length N with a string label for each document
colname_rowindex – column name for the “row index”, i.e. the column that identifies each row
topic_labels – format string for each row index where
{i0}
or{i1}
are replaced by the respective zero- or one-indexed topic numbers or an array with individual topic labels
- Returns
pandas DataFrame
- tmtoolkit.topicmod.model_io.ldamodel_full_topic_words(topic_word_distrib, vocab, colname_rowindex='_topic', row_labels='topic_{i1}')
Generate a pandas DataFrame for the full topic-word distribution topic_word_distrib.
See also
ldamodel_top_topic_words
to retrieve only the most probable words in the distribution as formatted pandas DataFrame;ldamodel_full_doc_topics
to retrieve the full document-topic distribution as dataframe- Parameters
topic_word_distrib – topic-word distribution; shape KxM, where K is number of topics, M is vocabulary size
vocab – vocabulary list/array of length K
colname_rowindex – column name for the “row index”, i.e. the column that identifies each row
row_labels – format string for each row index where
{i0}
or{i1}
are replaced by the respective zero- or one-indexed topic numbers or an array with individual row labels
- Returns
pandas DataFrame
- tmtoolkit.topicmod.model_io.ldamodel_top_doc_topics(doc_topic_distrib, doc_labels, top_n=3, val_fmt=None, topic_labels='topic_{i1}', col_labels=None, index_name='document')
Retrieve the top (i.e. most probable) top_n topics for each document in the document-topic distribution doc_topic_distrib as pandas DataFrame.
See also
ldamodel_full_doc_topics
to retrieve the full distribution as formatted pandas DataFrame;ldamodel_top_topic_docs
to retrieve the top documents per topic;ldamodel_top_topic_words
to retrieve the top words per topic from a topic-word distribution;ldamodel_top_word_topics
to retrieve the top topics per word from a topic-word distribution- Parameters
doc_topic_distrib – document-topic distribution; shape NxK, where N is the number of documents, K is the number of topics
doc_labels – list/array of length N with a string label for each document
top_n – number of most probable topics per document to select
val_fmt – format string for table cells where
{lbl}
is replaced by the respective topic name and{val}
is replaced by the topic’s probability given the documenttopic_labels – format string for each row index where
{i0}
or{i1}
are replaced by the respective zero- or one-indexed topic numbers or an array with individual topic labelscol_labels – format string for the columns where
{i0}
or{i1}
are replaced by the respective zero- or one-indexed rankindex_name – name of the table index
- Returns
pandas DataFrame
- tmtoolkit.topicmod.model_io.ldamodel_top_topic_docs(doc_topic_distrib, doc_labels, top_n=3, val_fmt=None, topic_labels='topic_{i1}', col_labels=None, index_name='topic')
Retrieve the top (i.e. most probable) top_n documents for each topic in the document-topic distribution doc_topic_distrib as pandas DataFrame.
See also
ldamodel_full_doc_topics
to retrieve the full distribution as formatted pandas DataFrame;ldamodel_top_doc_topics
to retrieve the top topics per document;ldamodel_top_topic_words
to retrieve the top words per topic from a topic-word distribution;ldamodel_top_word_topics
to retrieve the top topics per word from a topic-word distribution- Parameters
doc_topic_distrib – document-topic distribution; shape NxK, where N is the number of documents, K is the number of topics
doc_labels – list/array of length N with a string label for each document
top_n – number of most probable documents per topic to select
val_fmt – format string for table cells where
{lbl}
is replaced by the respective document label and{val}
is replaced by the topic’s probability given the documenttopic_labels – format string for each row index where
{i0}
or{i1}
are replaced by the respective zero- or one-indexed topic numbers or an array with individual topic labelscol_labels – format string for the columns where
{i0}
or{i1}
are replaced by the respective zero- or one-indexed rankindex_name – name of the table index
- Returns
pandas DataFrame
- tmtoolkit.topicmod.model_io.ldamodel_top_topic_words(topic_word_distrib, vocab, top_n=10, val_fmt=None, row_labels='topic_{i1}', col_labels=None, index_name='topic')
Retrieve the top (i.e. most probable) top_n words for each topic in the topic-word distribution topic_word_distrib as pandas DataFrame.
See also
ldamodel_full_topic_words
to retrieve the full distribution as formatted pandas DataFrame;ldamodel_top_word_topics
to retrieve the top topics per word from a topic-word distribution;ldamodel_top_doc_topics
to retrieve the top topics per document from a document-topic distribution;ldamodel_top_topic_docs
to retrieve the top documents per topic;- Parameters
topic_word_distrib – topic-word distribution; shape KxM, where K is number of topics, M is vocabulary size
vocab – vocabulary list/array of length K
top_n – number of most probable words per topic to select
val_fmt – format string for table cells where
{lbl}
is replaced by the respective word from vocab and{val}
is replaced by the word’s probability given the topicrow_labels – format string for each row index where
{i0}
or{i1}
are replaced by the respective zero- or one-indexed topic numbers or an array with individual row labelscol_labels – format string for the columns where
{i0}
or{i1}
are replaced by the respective zero- or one-indexed rankindex_name – name of the table index
- Returns
pandas DataFrame
- tmtoolkit.topicmod.model_io.ldamodel_top_word_topics(topic_word_distrib, vocab, top_n=10, val_fmt=None, topic_labels='topic_{i1}', col_labels=None, index_name='token')
Retrieve the top (i.e. most probable) top_n topics for each word in the topic-word distribution topic_word_distrib as pandas DataFrame.
See also
ldamodel_full_topic_words
to retrieve the full distribution as formatted pandas DataFrame;ldamodel_top_topic_words
to retrieve the top words per topic from a topic-word distribution;ldamodel_top_doc_topics
to retrieve the top topics per document from a document-topic distribution;ldamodel_top_topic_docs
to retrieve the top documents per topic;- Parameters
topic_word_distrib – topic-word distribution; shape KxM, where K is number of topics, M is vocabulary size
vocab – vocabulary list/array of length K
top_n – number of most probable words per topic to select
val_fmt – format string for table cells where
{lbl}
is replaced by the respective topic label from topic_labels and{val}
is replaced by the word’s probability given the topictopic_labels – format string for each row index where
{i0}
or{i1}
are replaced by the respective zero- or one-indexed topic numbers or an array with individual topic labelscol_labels – format string for the columns where
{i0}
or{i1}
are replaced by the respective zero- or one-indexed rankindex_name – name of the table index
- Returns
pandas DataFrame
- tmtoolkit.topicmod.model_io.load_ldamodel_from_pickle(picklefile, **kwargs)
Load an LDA model object from a pickle file picklefile.
See also
save_ldamodel_to_pickle
to save a model.Warning
Python pickle files may contain malicious code. You should only load pickle files from trusted sources.
- Parameters
picklefile – target file
kwargs – additional options for
tmtoolkit.utils.unpickle_file
- Returns
dict with keys:
'model'
– model instance;'vocab'
– vocabulary;'doc_labels'
– document labels;'dtm'
– optional document-term matrix;
- tmtoolkit.topicmod.model_io.print_ldamodel_distribution(distrib, row_labels, val_labels, top_n=10)
Print top_n top values from a LDA model’s distribution distrib. This is a general function to print top values of any multivariate distribution given as matrix distrib with H rows and I columns, each identified by H row_labels and I val_labels.
See also
print_ldamodel_topic_words
to print the top values of a topic-word distribution orprint_ldamodel_doc_topics
to print the top values of a document-topic distribution.- Parameters
distrib – either a topic-word or a document-topic distribution of shape HxI
row_labels – list/array of length H with label string for each row of distrib or format string
val_labels – list/array of length I with label string for each column of distrib or format string
top_n – number of top values to print
- tmtoolkit.topicmod.model_io.print_ldamodel_doc_topics(doc_topic_distrib, doc_labels, top_n=3, val_labels='topic_{i1}')
Print top_n values from an LDA model’s document-topic distribution doc_topic_distrib.
See also
print_ldamodel_topic_words
to print the top values of a topic-word distribution.- Parameters
doc_topic_distrib – document-topic distribution; shape NxK, where N is the number of documents, K is the number of topics
doc_labels – list/array of length N with a string label for each document
top_n – number of top values to print
val_labels – format string for each value where
{i0}
or{i1}
are replaced by the respective zero- or one-indexed topic numbers or an array with individual value labels
- tmtoolkit.topicmod.model_io.print_ldamodel_topic_words(topic_word_distrib, vocab, top_n=10, row_labels='topic_{i1}')
Print top_n values from an LDA model’s topic-word distribution topic_word_distrib.
See also
print_ldamodel_doc_topics
to print the top values of a document-topic distribution.- Parameters
topic_word_distrib – topic-word distribution; shape KxM, where K is number of topics, M is vocabulary size
vocab – vocabulary list/array of length K
top_n – number of top values to print
row_labels – format string for each row index where
{i0}
or{i1}
are replaced by the respective zero- or one-indexed topic numbers or an array with individual row labels
- tmtoolkit.topicmod.model_io.save_ldamodel_summary_to_excel(excel_file, topic_word_distrib, doc_topic_distrib, doc_labels, vocab, top_n_topics=10, top_n_words=10, dtm=None, rank_label_fmt=None, topic_labels=None)
Save a summary derived from an LDA model’s topic-word and document-topic distributions (topic_word_distrib and doc_topic_distrib to an Excel file excel_file. Return the generated Excel sheets as dict of pandas DataFrames.
The resulting Excel file will consist of 6 or optional 7 sheets:
top_doc_topics_vals
: document-topic distribution with probabilities of top topics per documenttop_doc_topics_labels
: document-topic distribution with labels (e.g."topic_12"
) of top topics per documenttop_doc_topics_labelled_vals
: document-topic distribution combining probabilities and labels of top topics per document (e.g."topic_12 (0.21)"
)top_topic_word_vals
: topic-word distribution with probabilities of top words per topictop_topic_word_labels
: topic-word distribution with top words per (e.g."politics"
) topictop_topic_words_labelled_vals
: topic-word distribution combining probabilities and top words per topic (e.g."politics (0.08)"
)optional if dtm is given –
marginal_topic_distrib
: marginal topic distribution
- Parameters
excel_file – target Excel file
topic_word_distrib – topic-word distribution; shape KxM, where K is number of topics, M is vocabulary size
doc_topic_distrib – document-topic distribution; shape NxK, where N is the number of documents, K is the number of topics
doc_labels – list/array of length N with a string label for each document
vocab – vocabulary list/array of length K
top_n_topics – number of most probable topics per document to include in the summary
top_n_words – number of most probable words per topic to include in the summary
dtm – document-term matrix; shape NxM; if this is given, a sheet for the marginal topic distribution will be included
rank_label_fmt – format string for the rank labels where
{i0}
or{i1}
are replaced by the respective zero- or one-indexed rank numbers (leave to None for default)topic_labels – format string for each row index where
{i0}
or{i1}
are replaced by the respective zero- or one-indexed topic numbers or an array with individual topic labels
- Returns
dict mapping sheet name to pandas DataFrame
- tmtoolkit.topicmod.model_io.save_ldamodel_to_pickle(picklefile, model, vocab, doc_labels, dtm=None, **kwargs)
Save an LDA model object model as pickle file to picklefile.
See also
load_ldamodel_from_pickle
to load the saved model.- Parameters
picklefile – target file
model – LDA model instance
vocab – vocabulary list/array of length M
doc_labels – document labels list/array of length N
dtm – optional document-term matrix of shape NxM
kwargs – additional options for
tmtoolkit.utils.pickle_data
Statistics for topic models and BoW matrices
Common statistics and tools for topic models.
- SievertShirley2014(1,2,3,4)
Sievert, C., & Shirley, K. (2014, June). LDAvis: A method for visualizing and interpreting topics. In Proceedings of the workshop on interactive language learning, visualization, and interfaces (pp. 63-70).
- Chuang2012(1,2)
J. Chuang, C. Manning, J. Heer. 2012. Termite: Visualization Techniques for Assessing Textual Topic Models
- tmtoolkit.topicmod.model_stats.exclude_topics(excl_topic_indices, doc_topic_distrib, topic_word_distrib=None, renormalize=True, return_new_topic_mapping=False)
Exclude topics with the indices excl_topic_indices from the document-topic distribution doc_topic_distrib (i.e. delete the respective columns in this matrix) and optionally re-normalize the distribution so that the rows sum up to 1 if renormalize is set to True.
Optionally also strip the topics from the topic-word distribution topic_word_distrib (i.e. remove the respective rows).
If topic_word_distrib is given, return a tuple with the updated doc.-topic and topic-word distributions, else return only the updated doc.-topic distribution.
Warning
The topics to be excluded are specified by zero-based indices.
- Parameters
excl_topic_indices – list/array with zero-based indices of topics to exclude
doc_topic_distrib – document-topic distribution; shape NxK, where N is the number of documents, K is the number of topics
topic_word_distrib – optional topic-word distribution; shape KxM, where K is number of topics, M is vocabulary size
renormalize – if True, re-normalize the document-topic distribution so that the rows sum up to 1
return_new_topic_mapping – if True, additional return a dict that maps old topic indices to new topic indices
- Returns
new document-topic distribution where topics from excl_topic_indices are removed and optionally re-normalized; optional new topic-word distribution with same topics removed; optional dict that maps old topic indices to new topic indices
- tmtoolkit.topicmod.model_stats.filter_topics(search_pattern, vocab, topic_word_distrib, top_n=None, thresh=None, match_type='exact', cond='any', glob_method='match', return_words_and_matches=False)
Filter topics defined as topic-word distribution topic_word_distrib across vocabulary vocab for a word (pass a string) or multiple words/patterns w (pass a list of strings). Either run pattern(s) w against the list of top words per topic (use top_n for number of words in top words list) or specify a minimum topic-word probability thresh, resulting in a list of words above this threshold for each topic, which will be used for pattern matching. You can also specify top_n and thresh.
Set the match parameter according to the options provided by
token_match
(exact matching, RE or glob matching). Use cond to specify whether at only one match suffices per topic when a list of patterns w is passed (cond='any'
) or all patterns must match (cond='all'
).By default, this function returns a NumPy array containing the indices of topics that passed the filter criteria. If return_words_and_matches is True, this function additionally returns a NumPy array with the top words for each topic and a NumPy array with the pattern matches for each topic.
Note
Using this function requires that you’ve installed tmtoolkit with the [textproc] option.
See also
See
tmtoolkit.tokenseq.token_match
for filtering options.- Parameters
search_pattern – single match pattern string or list of match pattern strings
vocab – vocabulary array of length M
topic_word_distrib – topic-word distribution; shape KxM, where K is number of topics, M is vocabulary size
top_n – if given, consider only the top top_n words per topic
thresh – if given, consider only the words with a probability above thresh
match_type – one of: ‘exact’, ‘regex’, ‘glob’; if ‘regex’, search_token must be RE pattern; if glob, search_token must be a “glob” pattern like “hello w*” (see https://github.com/metagriffin/globre)
cond – either
"any"
or"all"
; controls whether only one or all patterns must match if multiple match patterns are givenglob_method – if match_type is ‘glob’, use this glob method. Must be ‘match’ or ‘search’ (similar behavior as Python’s re.match or re.search)
return_words_and_matches – if True, additionally return list of arrays of words per topic and list of binary arrays indicating matches per topic
- Returns
array of topic indices with matches; if return_words_and_matches is True, return two more lists as described above
- tmtoolkit.topicmod.model_stats.generate_topic_labels_from_top_words(topic_word_distrib, doc_topic_distrib, doc_lengths, vocab, n_words=None, lambda_=1, labels_glue='_', labels_format='{i1}_{topwords}')
Generate unique topic labels derived from the top words of each topic. The top words are determined from the relevance score [SievertShirley2014] depending on lambda_. Specify the number of top words in the label with n_words. If n_words is None, a minimum number of words will be used to create unique labels for each topic. Topic labels are formed by joining the top words with labels_glue and formatting them with labels_format. Placeholders in labels_format are
"{i0}"
(zero-based topic index),"{i1}"
(one-based topic index) and"{topwords}"
(top words glued with labels_glue).See also
- Parameters
topic_word_distrib – topic-word distribution; shape KxM, where K is number of topics, M is vocabulary size
doc_topic_distrib – document-topic distribution; shape NxK, where N is the number of documents, K is the number of topics
doc_lengths – array of size N (number of docs) with integers indicating the number of terms per document
vocab – vocabulary array of length M
n_words – minimum number of words to be used to create unique labels
lambda – lambda parameter (influences weight of “log lift”)
labels_glue – string to join the top words
labels_format – final topic labels format string
- Returns
NumPy array of topic labels; length is K
- tmtoolkit.topicmod.model_stats.least_distinct_words(vocab, topic_word_distrib, doc_topic_distrib, doc_lengths, n=None)
Order the words from vocab by distinctiveness score from least to most distinctive. Optionally only return the n least distinctive words.
See also
- Parameters
vocab – vocabulary array of length M
topic_word_distrib – topic-word distribution; shape KxM, where K is number of topics, M is vocabulary size
doc_topic_distrib – document-topic distribution; shape NxK, where N is the number of documents, K is the number of topics
doc_lengths – array of size N (number of docs) with integers indicating the number of terms per document
n – if not None, return only the n least distinctive words
- Returns
array of length M or n (if n is given) with least distinctive words
- tmtoolkit.topicmod.model_stats.least_probable_words(vocab, topic_word_distrib, doc_topic_distrib, doc_lengths, n=None)
Order the words from vocab by marginal word probability from least to most probable. Optionally only return the n least probable words.
See also
- Parameters
vocab – vocabulary array of length M
topic_word_distrib – topic-word distribution; shape KxM, where K is number of topics, M is vocabulary size
doc_topic_distrib – document-topic distribution; shape NxK, where N is the number of documents, K is the number of topics
doc_lengths – array of size N (number of docs) with integers indicating the number of terms per document
n – if not None, return only the n most salient words
- Returns
array of length M or n (if n is given) with least probable words
- tmtoolkit.topicmod.model_stats.least_relevant_words_for_topic(vocab, rel_mat, topic, n=None)
Get words from vocab for topic ordered by least to most relevance according to [SievertShirley2014]. Use the relevance matrix rel_mat obtained from
topic_word_relevance
. Optionally only return the n least relevant words.See also
- Parameters
vocab – vocabulary array of length M
rel_mat – relevance matrix; shape KxM, where K is number of topics, M is vocabulary size
topic – topic number (zero-indexed)
- Returns
array of length M or n (if n is given) with least relevant words for topic topic
- tmtoolkit.topicmod.model_stats.least_salient_words(vocab, topic_word_distrib, doc_topic_distrib, doc_lengths, n=None)
Order the words from vocab by saliency score from least to most salient. Optionally only return the n least salient words.
See also
- Parameters
vocab – vocabulary array of length M
topic_word_distrib – topic-word distribution; shape KxM, where K is number of topics, M is vocabulary size
doc_topic_distrib – document-topic distribution; shape NxK, where N is the number of documents, K is the number of topics
doc_lengths – array of size N (number of docs) with integers indicating the number of terms per document
n – if not None, return only the n least salient words
- Returns
array of length M or n (if n is given) with least salient words
- tmtoolkit.topicmod.model_stats.marginal_topic_distrib(doc_topic_distrib, doc_lengths)
Return marginal topic distribution
p(T)
(topic proportions) given the document-topic distribution (theta) doc_topic_distrib and the document lengths doc_lengths. The latter can be calculated withdoc_lengths
.- Parameters
doc_topic_distrib – document-topic distribution; shape NxK, where N is the number of documents, K is the number of topics
doc_lengths – array of size N (number of docs) with integers indicating the number of terms per document
- Returns
array of size K (number of topics) with marginal topic distribution
- tmtoolkit.topicmod.model_stats.marginal_word_distrib(topic_word_distrib, p_t)
Return the marginal word distribution
p(w)
(term proportions derived from topic model) given the topic-word distribution (phi) topic_word_distrib and the marginal topic distributionp(T)
p_t. The latter can be calculated withmarginal_topic_distrib
.- Parameters
topic_word_distrib – topic-word distribution; shape KxM, where K is number of topics, M is vocabulary size
p_t – marginal topic distribution; array of size K
- Returns
array of size M (vocabulary size) with marginal word distribution
- tmtoolkit.topicmod.model_stats.most_distinct_words(vocab, topic_word_distrib, doc_topic_distrib, doc_lengths, n=None)
Order the words from vocab by distinctiveness score from most to least distinctive. Optionally only return the n most distinctive words.
See also
- Parameters
vocab – vocabulary array of length M
topic_word_distrib – topic-word distribution; shape KxM, where K is number of topics, M is vocabulary size
doc_topic_distrib – document-topic distribution; shape NxK, where N is the number of documents, K is the number of topics
doc_lengths – array of size N (number of docs) with integers indicating the number of terms per document
n – if not None, return only the n most distinctive words
- Returns
array of length M or n (if n is given) with most distinctive words
- tmtoolkit.topicmod.model_stats.most_probable_words(vocab, topic_word_distrib, doc_topic_distrib, doc_lengths, n=None)
Order the words from vocab by marginal word probability from most to least probable. Optionally only return the n most probable words.
See also
- Parameters
vocab – vocabulary array of length M
topic_word_distrib – topic-word distribution; shape KxM, where K is number of topics, M is vocabulary size
doc_topic_distrib – document-topic distribution; shape NxK, where N is the number of documents, K is the number of topics
doc_lengths – array of size N (number of docs) with integers indicating the number of terms per document
n – if not None, return only the n most salient words
- Returns
array of length M or n (if n is given) with most probable words
- tmtoolkit.topicmod.model_stats.most_relevant_words_for_topic(vocab, rel_mat, topic, n=None)
Get words from vocab for topic ordered by most to least relevance according to [SievertShirley2014]. Use the relevance matrix rel_mat obtained from
topic_word_relevance
. Optionally only return the n most relevant words.See also
- Parameters
vocab – vocabulary array of length M
rel_mat – relevance matrix; shape KxM, where K is number of topics, M is vocabulary size
topic – topic number (zero-indexed)
- Returns
array of length M or n (if n is given) with most relevant words for topic topic
- tmtoolkit.topicmod.model_stats.most_salient_words(vocab, topic_word_distrib, doc_topic_distrib, doc_lengths, n=None)
Order the words from vocab by saliency score from most to least salient. Optionally only return the n most salient words.
See also
- Parameters
vocab – vocabulary array of length M
topic_word_distrib – topic-word distribution; shape KxM, where K is number of topics, M is vocabulary size
doc_topic_distrib – document-topic distribution; shape NxK, where N is the number of documents, K is the number of topics
doc_lengths – array of size N (number of docs) with integers indicating the number of terms per document
n – if not None, return only the n most salient words
- Returns
array of length M or n (if n is given) with most salient words
- tmtoolkit.topicmod.model_stats.top_n_from_distribution(distrib, top_n=10, row_labels=None, col_labels=None, val_labels=None)
Get top_n values from LDA model’s distribution distrib as DataFrame. Can be used for topic-word distributions and document-topic distributions. Set row_labels to a format string or a list. Set col_labels to a format string for the column names. Set val_labels to return value labels instead of pure values (probabilities).
- Parameters
distrib – a 2D probability distribution of shape NxM from an LDA model
top_n – number of top values to take from each row of distrib
row_labels – either list of row label strings of length N or a single row format string
col_labels – column format string or None for default numbered columns
val_labels – value labels format string or None to return only the probabilities
- Returns
pandas DataFrame with N rows and top_n columns
- tmtoolkit.topicmod.model_stats.top_words_for_topics(topic_word_distrib, top_n=None, vocab=None, return_prob=False)
Generate sorted list of top_n words (or word indices) per topic in topic-word distribution topic_word_distrib.
- Parameters
topic_word_distrib – topic-word distribution; shape KxM, where K is number of topics, M is vocabulary size
top_n – number of top words (according to probability given topic) to select per topic; if None return full sorted lists of words
vocab – vocabulary array of length M; if None, return word indices instead of word strings
return_prob – if True, also return sorted arrays of word probabilities given topic for each topic
- Returns
list of length K consisting of sorted arrays of most probable words; arrays have length top_n or M (if top_n is None); if return_prob is True another list of sorted arrays of word probabilities given topic for each topic is returned
- tmtoolkit.topicmod.model_stats.topic_word_relevance(topic_word_distrib, doc_topic_distrib, doc_lengths, lambda_)
Calculate the topic-word relevance score with a lambda parameter lambda_ according to [SievertShirley2014]:
relevance(w,t|lambda) = lambda * log phi_{t,w} + (1-lambda) * log (phi_{t,w} / p(w))
, wherephi
is the topic-word distribution,p(w)
is the marginal word probability.
- Parameters
topic_word_distrib – topic-word distribution; shape KxM, where K is number of topics, M is vocabulary size
doc_topic_distrib – document-topic distribution; shape NxK, where N is the number of documents, K is the number of topics
doc_lengths – array of size N (number of docs) with integers indicating the number of terms per document
lambda – lambda parameter (influences weight of “log lift”)
- Returns
matrix with topic-word relevance scores; shape KxM
- tmtoolkit.topicmod.model_stats.word_distinctiveness(topic_word_distrib, p_t)
Calculate word distinctiveness according to [Chuang2012]:
distinctiveness(w) = KL(P(T|w), P(T)) = sum_T(P(T|w) log(P(T|w)/P(T)))
, whereKL
is Kullback-Leibler divergence,P(T)
is marginal topic distribution,P(T|w)
is prob. of a topic given a word.
- Parameters
topic_word_distrib – topic-word distribution; shape KxM, where K is number of topics, M is vocabulary size
p_t – marginal topic distribution; array of size K
- Returns
array of size M (vocabulary size) with word distinctiveness
- tmtoolkit.topicmod.model_stats.word_saliency(topic_word_distrib, doc_topic_distrib, doc_lengths)
Calculate word saliency according to [Chuang2012] as
saliency(w) = p(w) * distinctiveness(w)
for a wordw
.- Parameters
topic_word_distrib – topic-word distribution; shape KxM, where K is number of topics, M is vocabulary size
doc_topic_distrib – document-topic distribution; shape NxK, where N is the number of documents, K is the number of topics
doc_lengths – array of size N (number of docs) with integers indicating the number of terms per document
- Returns
array of size M (vocabulary size) with word saliency
Parallel model fitting and evaluation with lda
Parallel model computation and evaluation using the lda package.
Available evaluation metrics for this module are listed in AVAILABLE_METRICS
.
See tmtoolkit.topicmod.evaluate
for references and implementations of those evaluation metrics.
- tmtoolkit.topicmod.tm_lda.AVAILABLE_METRICS = ('loglikelihood', 'cao_juan_2009', 'arun_2010', 'coherence_mimno_2011')
Available metrics for lda (
"griffiths_2004"
,"held_out_documents_wallach09"
are added when package gmpy2 is installed, several"coherence_gensim_"
metrics are added when package gensim is installed).
- tmtoolkit.topicmod.tm_lda.DEFAULT_METRICS = ('cao_juan_2009', 'arun_2010', 'coherence_mimno_2011')
Metrics used by default.
- tmtoolkit.topicmod.tm_lda.compute_models_parallel(data, varying_parameters=None, constant_parameters=None, n_max_processes=None)
Compute several topic models in parallel using the “lda” package. Use a single or multiple document term matrices data and optionally a list of varying parameters varying_parameters. Pass parameters in constant_parameters dict to each model calculation. Use at maximum n_max_processes processors or use all available processors if None is passed.
data can be either a Document-Term-Matrix (NumPy array/matrix, SciPy sparse matrix) or a dict with corpus ID -> Document-Term-Matrix mapping when calculating models for multiple corpora.
If data is a dict of named matrices, this function will return a dict with document ID -> result list. Otherwise it will only return a result list. A result list always is a list containing tuples (parameter_set, model) where parameter_set is a dict of the used parameters.
- Parameters
data – either a (sparse) 2D array/matrix or a dict mapping dataset labels to such matrices
varying_parameters – list of dicts with parameters; each parameter set will be used in a separate computation
constant_parameters – dict with parameters that are the same for all parallel computations
n_max_processes – maximum number of worker processes to spawn
- Returns
if passed data is 2D array, returns a list with tuples (parameter set, results); if passed data is a dict of 2D arrays, returns dict with same keys as data and the respective results for each dataset
- tmtoolkit.topicmod.tm_lda.evaluate_topic_models(data, varying_parameters, constant_parameters=None, n_max_processes=None, return_models=False, metric=None, **metric_kwargs)
Compute several Topic Models in parallel using the “lda” package. Calculate the models using a list of varying parameters varying_parameters on a single Document-Term-Matrix data. Pass parameters in constant_parameters dict to each model calculation. Use at maximum n_max_processes processors or use all available processors if None is passed.
data must be a Document-Term-Matrix (NumPy array/matrix, SciPy sparse matrix).
Will return a list of size len(varying_parameters) containing tuples (parameter_set, eval_results) where parameter_set is a dict of the used parameters and eval_results is a dict of metric names -> metric results:
[(parameter_set_1, {'<metric_name>': result_1, ...}), ..., (parameter_set_n, {'<metric_name>': result_n, ...})])
See also
Results can be simplified using
tmtoolkit.topicmod.evaluate.results_by_parameter
.- Parameters
data – a (sparse) 2D array/matrix
varying_parameters – list of dicts with parameters; each parameter set will be used in a separate evaluation
constant_parameters – dict with parameters that are the same for all parallel computations
n_max_processes – maximum number of worker processes to spawn
return_models – if True, also return the computed models in the evaluation results
metric – string or list of strings; if given, use only this metric(s) for evaluation; must be subset of available_metrics
metric_kwargs – dict of options for metric used metric(s)
- Returns
list of evaluation results for each varying parameter set as described above
Parallel model fitting and evaluation with scikit-learn
Parallel model computation and evaluation using the scikit-learn package.
Available evaluation metrics for this module are listed in AVAILABLE_METRICS
.
See tmtoolkit.topicmod.evaluate
for references and implementations of those evaluation metrics.
- tmtoolkit.topicmod.tm_sklearn.AVAILABLE_METRICS = ('perplexity', 'cao_juan_2009', 'arun_2010', 'coherence_mimno_2011', 'coherence_gensim_u_mass', 'coherence_gensim_c_v', 'coherence_gensim_c_uci', 'coherence_gensim_c_npmi')
Available metrics for sklearn (
"held_out_documents_wallach09"
is added when package gmpy2 is installed, several"coherence_gensim_"
metrics are added when package gensim is installed).
- tmtoolkit.topicmod.tm_sklearn.DEFAULT_METRICS = ('perplexity', 'cao_juan_2009', 'arun_2010', 'coherence_mimno_2011')
Metrics used by default.
- tmtoolkit.topicmod.tm_sklearn.compute_models_parallel(data, varying_parameters=None, constant_parameters=None, n_max_processes=None)
Compute several topic models in parallel using the “sklearn” package. Use a single or multiple document term matrices data and optionally a list of varying parameters varying_parameters. Pass parameters in constant_parameters dict to each model calculation. Use at maximum n_max_processes processors or use all available processors if None is passed.
data can be either a Document-Term-Matrix (NumPy array/matrix, SciPy sparse matrix) or a dict with corpus ID -> Document-Term-Matrix mapping when calculating models for multiple corpora.
If data is a dict of named matrices, this function will return a dict with document ID -> result list. Otherwise it will only return a result list. A result list always is a list containing tuples (parameter_set, model) where parameter_set is a dict of the used parameters.
- Parameters
data – either a (sparse) 2D array/matrix or a dict mapping dataset labels to such matrices
varying_parameters – list of dicts with parameters; each parameter set will be used in a separate computation
constant_parameters – dict with parameters that are the same for all parallel computations
n_max_processes – maximum number of worker processes to spawn
- Returns
if passed data is 2D array, returns a list with tuples (parameter set, results); if passed data is a dict of 2D arrays, returns dict with same keys as data and the respective results for each dataset
- tmtoolkit.topicmod.tm_sklearn.evaluate_topic_models(data, varying_parameters, constant_parameters=None, n_max_processes=None, return_models=False, metric=None, **metric_kwargs)
Compute several Topic Models in parallel using the “sklearn” package. Calculate the models using a list of varying parameters varying_parameters on a single Document-Term-Matrix data. Pass parameters in constant_parameters dict to each model calculation. Use at maximum n_max_processes processors or use all available processors if None is passed.
data must be a Document-Term-Matrix (NumPy array/matrix, SciPy sparse matrix).
Will return a list of size len(varying_parameters) containing tuples (parameter_set, eval_results) where parameter_set is a dict of the used parameters and eval_results is a dict of metric names -> metric results:
[(parameter_set_1, {'<metric_name>': result_1, ...}), ..., (parameter_set_n, {'<metric_name>': result_n, ...})])
See also
Results can be simplified using
tmtoolkit.topicmod.evaluate.results_by_parameter
.- Parameters
data – a (sparse) 2D array/matrix
varying_parameters – list of dicts with parameters; each parameter set will be used in a separate evaluation
constant_parameters – dict with parameters that are the same for all parallel computations
n_max_processes – maximum number of worker processes to spawn
return_models – if True, also return the computed models in the evaluation results
metric – string or list of strings; if given, use only this metric(s) for evaluation; must be subset of available_metrics
metric_kwargs – dict of options for metric used metric(s)
- Returns
list of evaluation results for each varying parameter set as described above
Parallel model fitting and evaluation with Gensim
Parallel model computation and evaluation using the Gensim package.
Available evaluation metrics for this module are listed in AVAILABLE_METRICS
.
See tmtoolkit.topicmod.evaluate
for references and implementations of those evaluation metrics.
- tmtoolkit.topicmod.tm_gensim.AVAILABLE_METRICS = ('perplexity', 'cao_juan_2009', 'arun_2010', 'coherence_mimno_2011', 'coherence_gensim_u_mass', 'coherence_gensim_c_v', 'coherence_gensim_c_uci', 'coherence_gensim_c_npmi')
Available metrics for Gensim.
- tmtoolkit.topicmod.tm_gensim.DEFAULT_METRICS = ('perplexity', 'cao_juan_2009', 'arun_2010', 'coherence_mimno_2011', 'coherence_gensim_c_v')
Metrics used by default.
- tmtoolkit.topicmod.tm_gensim.compute_models_parallel(data, varying_parameters=None, constant_parameters=None, n_max_processes=None)
Compute several topic models in parallel using the “gensim” package. Use a single or multiple document term matrices data and optionally a list of varying parameters varying_parameters. Pass parameters in constant_parameters dict to each model calculation. Use at maximum n_max_processes processors or use all available processors if None is passed.
data can be either a Document-Term-Matrix (NumPy array/matrix, SciPy sparse matrix) or a dict with corpus ID -> Document-Term-Matrix mapping when calculating models for multiple corpora.
If data is a dict of named matrices, this function will return a dict with document ID -> result list. Otherwise it will only return a result list. A result list always is a list containing tuples (parameter_set, model) where parameter_set is a dict of the used parameters.
- Parameters
data – either a (sparse) 2D array/matrix or a dict mapping dataset labels to such matrices
varying_parameters – list of dicts with parameters; each parameter set will be used in a separate computation
constant_parameters – dict with parameters that are the same for all parallel computations
n_max_processes – maximum number of worker processes to spawn
- Returns
if passed data is 2D array, returns a list with tuples (parameter set, results); if passed data is a dict of 2D arrays, returns dict with same keys as data and the respective results for each dataset
- tmtoolkit.topicmod.tm_gensim.evaluate_topic_models(data, varying_parameters, constant_parameters=None, n_max_processes=None, return_models=False, metric=None, **metric_kwargs)
Compute several Topic Models in parallel using the “gensim” package. Calculate the models using a list of varying parameters varying_parameters on a single Document-Term-Matrix data. Pass parameters in constant_parameters dict to each model calculation. Use at maximum n_max_processes processors or use all available processors if None is passed.
data must be a Document-Term-Matrix (NumPy array/matrix, SciPy sparse matrix).
Will return a list of size len(varying_parameters) containing tuples (parameter_set, eval_results) where parameter_set is a dict of the used parameters and eval_results is a dict of metric names -> metric results:
[(parameter_set_1, {'<metric_name>': result_1, ...}), ..., (parameter_set_n, {'<metric_name>': result_n, ...})])
See also
Results can be simplified using
tmtoolkit.topicmod.evaluate.results_by_parameter
.- Parameters
data – a (sparse) 2D array/matrix
varying_parameters – list of dicts with parameters; each parameter set will be used in a separate evaluation
constant_parameters – dict with parameters that are the same for all parallel computations
n_max_processes – maximum number of worker processes to spawn
return_models – if True, also return the computed models in the evaluation results
metric – string or list of strings; if given, use only this metric(s) for evaluation; must be subset of available_metrics
metric_kwargs – dict of options for metric used metric(s)
- Returns
list of evaluation results for each varying parameter set as described above
Visualize topic models and topic model evaluation results
Wordclouds from topic models
- tmtoolkit.topicmod.visualize.DEFAULT_WORDCLOUD_KWARGS = {'background_color': None, 'color_func': <function _wordcloud_color_func_black>, 'height': 600, 'mode': 'RGBA', 'width': 800}
Default wordcloud settings for transparent background and black font; will be passed to
wordcloud.WordCloud
- tmtoolkit.topicmod.visualize.generate_wordclouds_for_topic_words(topic_word_distrib, vocab, top_n, topic_labels='topic_{i1}', which_topics=None, return_images=True, **wordcloud_kwargs)
Generate wordclouds for the top top_n words of each topic in topic_word_distrib.
- Parameters
topic_word_distrib – topic-word distribution; shape KxM, where K is number of topics, M is vocabulary size
vocab – vocabulary array of length M
top_n – number of top values to take from each row of distrib
topic_labels – labels used for each row; determine keys in in result dict; either single format string with placeholders
"{i0}"
(zero-based topic index) or"{i1}"
(one-based topic index), or list of topic label stringswhich_topics – if not None, a sequence of indices into rows of topic_word_distrib to select only these topics to generate wordclouds from
return_images – if True, store image objects instead of
wordcloud.WordCloud
objects in the result dictwordcloud_kwargs – pass additional options to
wordcloud.WordCloud
; updates options inDEFAULT_WORDCLOUD_KWARGS
- Returns
dict mapping row labels to wordcloud images or instances generated from each topic
- tmtoolkit.topicmod.visualize.generate_wordclouds_for_document_topics(doc_topic_distrib, doc_labels, top_n, topic_labels='topic_{i1}', which_documents=None, return_images=True, **wordcloud_kwargs)
Generate wordclouds for the top top_n topics of each document in doc_topic_distrib.
- Parameters
doc_topic_distrib – document-topic distribution; shape NxK, where N is the number of documents, K is the number of topics
doc_labels – list/array of length N with a string label for each document
top_n – number of top values to take from each row of distrib
topic_labels – labels used for each row; determine keys in in result dict; either single format string with placeholders
"{i0}"
(zero-based topic index) or"{i1}"
(one-based topic index), or list of topic label stringswhich_documents – if not None, a sequence of indices into rows of doc_topic_distrib to select only these topics to generate wordclouds from
return_images – if True, store image objects instead of
wordcloud.WordCloud
objects in the result dictwordcloud_kwargs – pass additional options to
wordcloud.WordCloud
; updates options inDEFAULT_WORDCLOUD_KWARGS
- Returns
dict mapping row labels to wordcloud images or instances generated from each document
- tmtoolkit.topicmod.visualize.generate_wordcloud_from_probabilities_and_words(prob, words, return_image=True, wordcloud_instance=None, **wordcloud_kwargs)
Generate a single wordcloud for given probabilities (weights) prob of the respective words.
- Parameters
prob – 1D array or sequence of probabilities for words
words – 1D array or sequence of word strings
return_images – if True, store image objects instead of
wordcloud.WordCloud
objects in the result dictwordcloud_instance – optionally pass an already initialized
wordcloud.WordCloud
instancewordcloud_kwargs – pass additional options to
wordcloud.WordCloud
; updates options inDEFAULT_WORDCLOUD_KWARGS
- Returns
either a wordcloud image if return_images is True, otherwise a
wordcloud.WordCloud
instance
- tmtoolkit.topicmod.visualize.generate_wordcloud_from_weights(weights, return_image=True, wordcloud_instance=None, **wordcloud_kwargs)
Generate a single wordcloud for a weights dict that maps words to “weights” (e.g. probabilities) which determine their size in the wordcloud.
- Parameters
weights – dict that maps words to weights
return_images – if True, store image objects instead of
wordcloud.WordCloud
objects in the result dictwordcloud_instance – optionally pass an already initialized
wordcloud.WordCloud
instancewordcloud_kwargs – pass additional options to
wordcloud.WordCloud
; updates options inDEFAULT_WORDCLOUD_KWARGS
- Returns
either a wordcloud image if return_images is True, otherwise a
wordcloud.WordCloud
instance
- tmtoolkit.topicmod.visualize.write_wordclouds_to_folder(wordclouds, folder, file_name_fmt='{label}.png', **save_kwargs)
Save all wordcloud image objects in wordclouds to folder.
- Parameters
wordclouds – dict mapping wordcloud label to wordcloud object
folder – target path
file_name_fmt – file name string format with placeholder
"{label}"
save_kwargs – additional options passed to save method of each wordcloud image object
- tmtoolkit.topicmod.visualize.generate_wordclouds_from_distribution(distrib, row_labels, val_labels, top_n, which_rows=None, return_images=True, **wordcloud_kwargs)
Generate wordclouds for each row in a given probability distribution distrib.
Note
Use
generate_wordclouds_for_topic_words
orgenerate_wordclouds_for_document_topics
as shortcuts for creating wordclouds for a topic-word or document-topic distribution.- Parameters
distrib – 2D (sparse) array/matrix probability distribution
row_labels – labels for rows in probability distribution; these are used as keys in the return dict
val_labels – labels for values in probability distribution (e.g. vocabulary)
top_n – number of top values to take from each row of distrib
which_rows – if not None, select only the rows from this sequence of indices from distrib
return_images – if True, store image objects instead of
wordcloud.WordCloud
objects in the result dictwordcloud_kwargs – pass additional options to
wordcloud.WordCloud
; updates options inDEFAULT_WORDCLOUD_KWARGS
- Returns
dict mapping row labels to wordcloud images or instances generated from each distribution row
Plot heatmaps for topic models
- tmtoolkit.topicmod.visualize.plot_doc_topic_heatmap(fig, ax, doc_topic_distrib, doc_labels, topic_labels=None, which_documents=None, which_document_indices=None, which_topics=None, which_topic_indices=None, xaxislabel=None, yaxislabel=None, **kwargs)
Plot a heatmap for a document-topic distribution doc_topic_distrib to a matplotlib Figure fig and Axes ax using doc_labels as document labels on the y-axis and topics from 1 to K (number of topics) on the x-axis.
Note
It is almost always necessary to select a subset of your document-topic distribution with the which_documents or which_topics parameters, as otherwise the amount of data to be plotted will be too high to give a reasonable picture.
- Parameters
fig – matplotlib Figure object
ax – matplotlib Axes object
doc_topic_distrib – document-topic distribution; shape NxK, where N is the number of documents, K is the number of topics
doc_labels – list/array of length N with a string label for each document
topic_labels – labels used for each row; either single format string with placeholders
"{i0}"
(zero-based topic index) or"{i1}"
(one-based topic index), or list of topic label stringswhich_documents – select documents via document label strings
which_document_indices – alternatively, select documents with zero-based document index in [0, N-1]
which_topics – select topics via topic label strings (when string array or list) or with one-based topic index in [1, K] (when integer array or list)
which_topic_indices – alternatively, select topics with zero-based topic index in [0, K-1]
xaxislabel – x axis label string
yaxislabel – y axis label string
kwargs – additional arguments passed to
plot_heatmap
- Returns
tuple of generated (matplotlib Figure object, matplotlib Axes object)
- tmtoolkit.topicmod.visualize.plot_topic_word_heatmap(fig, ax, topic_word_distrib, vocab, topic_labels=None, which_topics=None, which_topic_indices=None, which_words=None, which_word_indices=None, xaxislabel=None, yaxislabel=None, **kwargs)
Plot a heatmap for a topic-word distribution topic_word_distrib to a matplotlib Figure fig and Axes ax using vocab as vocabulary on the x-axis and topics from 1 to n_topics=doc_topic_distrib.shape[1] on the y-axis.
Note
It is almost always necessary to select a subset of your topic-word distribution with the which_words or which_topics parameters, as otherwise the amount of data to be plotted will be too high to give a reasonable picture.
- Parameters
fig – matplotlib Figure object
ax – matplotlib Axes object
topic_word_distrib – topic-word distribution; shape KxM, where K is number of topics, M is vocabulary size
vocab – vocabulary array of length M
topic_labels – labels used for each row; either single format string with placeholders
"{i0}"
(zero-based topic index) or"{i1}"
(one-based topic index), or list of topic label stringswhich_topics – select topics via topic label strings (when string array or list and topic_labels is given) or with one-based topic index in [1, K] (when integer array or list)
which_topic_indices – alternatively, select topics with zero-based topic index in [0, K-1]
which_words – select words with one-based word index in [1, M]
which_word_indices – alternatively, select words with zero-based word index in [0, K-1]
xaxislabel – x axis label string
yaxislabel – y axis label string
kwargs – additional arguments passed to
plot_heatmap
- Returns
tuple of generated (matplotlib Figure object, matplotlib Axes object)
- tmtoolkit.topicmod.visualize.plot_heatmap(fig, ax, data, xaxislabel=None, yaxislabel=None, xticklabels=None, yticklabels=None, title=None, grid=True, values_in_cells=True, round_values_in_cells=2, legend=False, fontsize_axislabel=None, fontsize_axisticks=None, fontsize_cell_values=None)
Generic heatmap plotting function for 2D matrix data.
- Parameters
fig – matplotlib Figure object
ax – matplotlib Axes object
data – 2D array/matrix to be plotted as heatmap
xaxislabel – x axis label string
yaxislabel – y axis label string
xticklabels – list of x axis tick labels
yticklabels – list of y axis tick labels
title – plot title
grid – draw grid if True
values_in_cells – draw values of data in heatmap cells
round_values_in_cells – round these values to the given number of digits
legend – if True, draw a legend
fontsize_axislabel – font size for axis label
fontsize_axisticks – font size for axis ticks
fontsize_cell_values – font size for values in cells
- Returns
tuple of generated (matplotlib Figure object, matplotlib Axes object)
Plot probability distribution rankings for topic models
- tmtoolkit.topicmod.visualize.plot_topic_word_ranked_prob(fig, ax, topic_word_distrib, n, highlight_label_fmt='topic {i0}', highlight_label_other='other topics', title='Ranked word probability per topic', xaxislabel='word rank', yaxislabel='word probability', **kwargs)
Plot a topic-word probability distribution by ranking the probabilities in each row. This is for example useful in order to examine how many top words usually describe most of a topic.
- Parameters
fig – matplotlib Figure object
ax – matplotlib Axes object
topic_word_distrib – topic-word probability distribution
n – limit max. shown word rank on x-axis
highlight_label_fmt – if highlight is given, use this format for labeling the highlighted rows
highlight_label_other – if highlight is given, use this as label for non-highlighted rows
title – plot title
xaxislabel – x-axis label
yaxislabel – y-axis label
kwargs – further arguments passed to
plot_prob_distrib_ranked_prob
- Returns
tuple of generated (matplotlib Figure object, matplotlib Axes object)
- tmtoolkit.topicmod.visualize.plot_doc_topic_ranked_prob(fig, ax, doc_topic_distrib, n, highlight_label_fmt='document {i0}', highlight_label_other='other documents', title='Ranked topic probability per document', xaxislabel='topic rank', yaxislabel='topic probability', **kwargs)
Plot a document-topic probability distribution by ranking the probabilities in each row. This is for example useful in order to examine how many top topics usually describe most of a document.
- Parameters
fig – matplotlib Figure object
ax – matplotlib Axes object
doc_topic_distrib – document-topic probability distribution
n – limit max. shown topic rank on x-axis
highlight_label_fmt – if highlight is given, use this format for labeling the highlighted rows
highlight_label_other – if highlight is given, use this as label for non-highlighted rows
title – plot title
xaxislabel – x-axis label
yaxislabel – y-axis label
kwargs – further arguments passed to
plot_prob_distrib_ranked_prob
- Returns
tuple of generated (matplotlib Figure object, matplotlib Axes object)
- tmtoolkit.topicmod.visualize.plot_prob_distrib_ranked_prob(fig, ax, data, x_limit, log_scale=True, lw=1, alpha=0.1, highlight=None, highlight_label_fmt='{i0}', highlight_label_other='other', highlight_lw=3, highlight_alpha=0.3, title=None, xaxislabel='rank', yaxislabel='probability')
Plot a 2D probability distribution (one distribution for each row which should add up to 1) by ranking the probabilities in each row.
- Parameters
fig – matplotlib Figure object
ax – matplotlib Axes object
data – a 2D probability distribution (one distribution for each row which should add up to 1)
x_limit – limit max. shown rank on x-axis
log_scale – if True, apply log scale on y-axis
lw – line width
alpha – line transparency
highlight – if given, pass a sequence or NumPy array with indices of rows in data, which should be highlighted
highlight_label_fmt – if highlight is given, use this format for labeling the highlighted rows
highlight_label_other – if highlight is given, use this as label for non-highlighted rows
highlight_lw – line width for highlighted distributions
highlight_alpha – line transparency for highlighted distributions
title – plot title
xaxislabel – x-axis label
yaxislabel – y-axis label
- Returns
tuple of generated (matplotlib Figure object, matplotlib Axes object)
Plot topic model evaluation results
- tmtoolkit.topicmod.visualize.plot_eval_results(eval_results, metric=None, param=None, xaxislabel=None, yaxislabel=None, title=None, title_fontsize='xx-large', subfig_fontsize='large', axes_title_fontsize='medium', show_metric_direction=True, metric_direction_font_size='medium', subplots_adjust_opts=None, figsize='auto', fig_opts=None, subfig_opts=None, subplots_opts=None)
Plot the evaluation results from eval_results, which must be a sequence containing (param_0, …, param_N, metric results) tuples, where param_N is the parameter value to appear on the x axis and all parameter combinations before are used to create a small multiples plot (if there are more than one param.). The metric results can be a dict structure containing the evaluation results for each metric. eval_results can be created using
tmtoolkit.topicmod.evaluate.results_by_parameter
.Note
Due to a bug in matplotlib, it seems that it’s not possible to display a plot title when plotting small multiples and adjusting the positioning of the subplots. Hence you must set show_metric_direction to False when you’re displaying small multiples and need want to display a plot title.
- Parameters
eval_results – topic evaluation results as sequence containing (param_0, …, param_N, metric results)
metric – either single string or list of strings; plot only this/these specific metric/s
param – names of the parameters used in eval_results
xaxislabel – x axis label string
yaxislabel – y axis label string
title – plot title
title_fontsize – font size for the figure title
axes_title_fontsize – font size for the plot titles
show_metric_direction – if True, show whether the shown metric should be minimized or maximized for optimization
metric_direction_font_size – font size for the metric optimization direction indicator
subplots_opts – options passed to Matplotlib’s
plt.subplots()
subplots_adjust_opts – options passed to Matplotlib’s
fig.subplots_adjust()
figsize – tuple
(width, height)
or"auto"
(default)fig_opts – additional parameters passed to Matplotlib’s
plt.figure()
subfig_opts – additional parameters passed to Matplotlib’s
fig.subfigures()
subplots_opts – additional parameters passed to Matplotlib’s
subfig.subplots()
- Returns
tuple of generated (matplotlib Figure object, matplotlib Subfigures, matplotlib Axes)
Other functions
- tmtoolkit.topicmod.visualize.parameters_for_ldavis(topic_word_distrib, doc_topic_distrib, dtm, vocab, sort_topics=False)
Create a parameters dict that can be used with the pyLDAVis package by passing the dict
params
likepyLDAVis.prepare(**params)
.- Parameters
topic_word_distrib – topic-word distribution; shape KxM, where K is number of topics, M is vocabulary size
doc_topic_distrib – document-topic distribution; shape NxK, where N is the number of documents, K is the number of topics
dtm – document-term-matrix; shape NxM
vocab – vocabulary array/list of length M
sort_topics – if True, sort the topics
- Returns
dict with parameters ready to use with pyLDAVis
Base classes for parallel model fitting and evaluation
Base classes for parallel model fitting and evaluation. See the specific functions and classes in
tm_gensim
, tm_lda
and tm_sklearn
for
parallel processing with popular topic modeling packages.
Note
The classes and functions in this module are only important if you want to implement your own parallel model computation and evaluation.
- class tmtoolkit.topicmod.parallel.MultiprocEvaluationRunner(worker_class, available_metrics, data, varying_parameters, constant_parameters=None, metric=None, metric_options=None, n_max_processes=None, return_models=False)
Specialization of
MultiprocModelsRunner
for parallel model evaluations.- __init__(worker_class, available_metrics, data, varying_parameters, constant_parameters=None, metric=None, metric_options=None, n_max_processes=None, return_models=False)
Initialize evaluation runner.
- Parameters
worker_class – model computation worker class derived from
MultiprocModelsWorkerABC
available_metrics – list/tuple with available metrics as strings
data – the data that the workers use for computations; 2D (sparse) array/matrix
varying_parameters – list of dicts with parameters; each parameter set will be used in a separate computation
constant_parameters – dict with parameters that are the same for all parallel computations
metric – string or list of strings; if given, use only this metric(s) for evaluation; must be subset of available_metrics
metric_options – dict of options for metric used metric(s)
n_max_processes – maximum number of worker processes to spawn
return_models – if True, also return the computed models in the evaluation results
- class tmtoolkit.topicmod.parallel.MultiprocEvaluationWorkerABC(worker_id, eval_metric, eval_metric_options, return_models, tasks_queue, results_queue, data, group=None, target=None, name=None, args=(), kwargs=None)
Specialization of
MultiprocModelsWorkerABC
for parallel model evaluations.- __init__(worker_id, eval_metric, eval_metric_options, return_models, tasks_queue, results_queue, data, group=None, target=None, name=None, args=(), kwargs=None)
Initialize parallel model evaluations worker class with an ID worker_id, a queue to receive tasks from tasks_queue, a queue to send results to results_queue and the data to operate on. Use evaluation metrics eval_metric.
- Parameters
worker_id – process ID
eval_metric – list/tuple of strings of evaluation metrics to use
eval_metric_options – dict of options for metric used metric(s)
tasks_queue – queue to receive tasks from
results_queue – queue to send results to
data – data to operate on; a dict mapping dataset label to a dataset; can be anything but is usually a tuple of shared data pointers for sparse matrix in COO format
group – see Python’s
multiprocessing.Process
classtarget – see Python’s
multiprocessing.Process
classname – see Python’s
multiprocessing.Process
classargs – see Python’s
multiprocessing.Process
classkwargs – see Python’s
multiprocessing.Process
class
- class tmtoolkit.topicmod.parallel.MultiprocModelsRunner(worker_class, data, varying_parameters=None, constant_parameters=None, n_max_processes=None)
Runner class for distributing and managing worker processes for parallel model computation.
- __init__(worker_class, data, varying_parameters=None, constant_parameters=None, n_max_processes=None)
Initiate runner class with a model computation worker class worker_class (which should be derived from
MultiprocModelsWorkerABC
). This class represents the worker processes and each will be instantiated with data and work on it with a different parameter set that can be passed via varying_parameters.- Parameters
worker_class – model computation worker class derived from
MultiprocModelsWorkerABC
data – the data that the workers use for computations; 2D (sparse) array/matrix or a dict with such matrices; the latter allows to run all computations on different datasets at once
varying_parameters – list of dicts with parameters; each parameter set will be used in a separate computation
constant_parameters – dict with parameters that are the same for all parallel computations
n_max_processes – maximum number of worker processes to spawn
- run()
Set up worker processes and run parallel computations. Blocks until all processes are done, then stops all workers and returns the results.
- Returns
if passed data is 2D array, returns a list with tuples (parameter set, results); if passed data is a dict of 2D arrays, returns dict with same keys as data and the respective results for each dataset
- shutdown_workers()
Send shutdown signal to all worker processes to stop them.
- class tmtoolkit.topicmod.parallel.MultiprocModelsWorkerABC(worker_id, tasks_queue, results_queue, data, group=None, target=None, name=None, args=(), kwargs=None)
Abstract base class for parallel model computations worker class.
- __init__(worker_id, tasks_queue, results_queue, data, group=None, target=None, name=None, args=(), kwargs=None)
Initialize parallel model computations worker class with an ID worker_id, a queue to receive tasks from tasks_queue, a queue to send results to results_queue and the data to operate on.
- Parameters
worker_id – process ID
tasks_queue – queue to receive tasks from
results_queue – queue to send results to
data – data to operate on; a dict mapping dataset label to a dataset; can be anything but is usually a tuple of shared data pointers for sparse matrix in COO format
group – see Python’s
multiprocessing.Process
classtarget – see Python’s
multiprocessing.Process
classname – see Python’s
multiprocessing.Process
classargs – see Python’s
multiprocessing.Process
classkwargs – see Python’s
multiprocessing.Process
class
- fit_model(data, params)
Method stub to implement actually model fitting for data with parameter set params.
- Parameters
data – data passed to the model fitting algorithm
params – parameter set dict
- Returns
model fitting / evaluation results
- run()
Run the process worker: Calls
fit_model
on each dataset and parameter set coming from the tasks queue.
- send_results(doc, params, results)
Put the results into the results queue.
- Parameters
doc – “document” / dataset label
params – used parameter set
results – generated results, e.g. fit model and/or evaluation results
tmtoolkit.utils
Misc. utility functions.
- tmtoolkit.utils.applychain(funcs, initial_arg)
For n functions
f
in funcs applyf_0(initial) ∘ f_1() ∘ ... ∘ f_n()
.- Parameters
funcs (Iterable[Callable]) – functions to apply; must not be empty
initial_arg (Any) – initial function argument
- Returns
result after applying all functions in funcs
- Return type
Any
- tmtoolkit.utils.argsort(seq)
Same as NumPy’s
numpy.argsort
but for Python sequences.- Parameters
seq (Sequence) – a sequence
- Returns
indices into seq that sort seq
- Return type
List[int]
- tmtoolkit.utils.as_chararray(x)
Convert a NumPy array or sequence x to a NumPy character array. If x is already a NumPy character array, return a copy of it.
- Parameters
x (Union[ndarray, Sequence]) – NumPy array or sequence
- Returns
NumPy character array
- Return type
ndarray
- tmtoolkit.utils.combine_sparse_matrices_columnwise(matrices, col_labels, row_labels=None, dtype=None)
Given a sequence of sparse matrices in matrices and their corresponding column labels in col_labels, stack these matrices in rowwise fashion by retaining the column affiliation and filling in zeros, e.g.:
m1: C A D ----- 1 0 3 0 2 0 m2: D B C A ------- 0 0 1 2 3 4 5 6 2 1 0 0
will result in:
A B C D ------- 0 0 1 3 2 0 0 0 2 0 1 0 6 4 5 3 0 1 0 2
(where the first two rows come from
m1
and the other three rows fromm2
).The resulting columns will always be sorted in ascending order.
Additionally you can pass a sequence of row labels for each matrix via row_labels. This will also sort the rows in ascending order according to the row labels.
- Parameters
matrices (Sequence) – sequence of sparse matrices
col_labels (Sequence[Union[str, int]]) – column labels for each matrix in matrices; may be sequence of strings or integers
row_labels (Optional[Sequence[str]]) – optional sequence of row labels for each matrix in matrices
dtype (Optional[Union[str, dtype]]) – optionally specify the dtype of the resulting sparse matrix
- Returns
a tuple with (1) combined sparse matrix in CSR format; (2) column labels of the matrix; (3) optionally row labels of the matrix if row_labels is not None.
- Return type
Union[Tuple[csr_matrix, ndarray], Tuple[csr_matrix, ndarray, ndarray]]
- tmtoolkit.utils.dict2df(data, key_name='key', value_name='value', sort=None)
Take a simple dictionary that maps any key to any scalar value and convert it to a dataframe that contains two columns: one for the keys and one for the respective values. Optionally sort by column sort.
- Parameters
data (dict) – dictionary that maps keys to scalar values
key_name (str) – column name for the keys
value_name (str) – column name for the values
sort (Optional[str]) – optionally sort by this column; prepend by “-” to indicate descending sorting order, e.g. “-value”
- Returns
a dataframe with two columns: one for the keys named key_name and one for the respective values named value_name
- Return type
DataFrame
- tmtoolkit.utils.disable_logging()
Disable logging for tmtoolkit package.
- Return type
None
- tmtoolkit.utils.empty_chararray()
Create empty NumPy character array.
- Returns
empty NumPy character array
- Return type
ndarray
- tmtoolkit.utils.enable_logging(level=20, fmt='%(asctime)s:%(levelname)s:%(name)s:%(message)s', logging_handler=None, add_logging_handler=True, **stream_hndlr_opts)
Enable logging for tmtoolkit package with minimum log level level and log message format fmt. By default, logs to stderr via
logging.StreamHandler
. You may also pass your own log handler.See also
Currently, only the logging levels INFO and DEBUG are used in tmtoolkit. See the Python Logging HOWTO guide for more information on log levels and formats.
- Parameters
level (int) – minimum log level; default is INFO level
fmt (str) – log message format
logging_handler (Optional[Handler]) – pass custom logging handler to be used instead of
add_logging_handler (bool) – if True, add the logging handler to the logger
stream_hndlr_opts – optional additional parameters passed to
logging.StreamHandler
- Return type
None
- tmtoolkit.utils.flatten_list(l)
Flatten a 2D sequence l to a 1D list and return it.
Although
return sum(l, [])
looks like a very nice one-liner, it turns out to be much much slower than what is implemented below.- Parameters
l (Iterable[Iterable]) – 2D sequence, e.g. list of lists
- Returns
flattened list, i.e. a 1D list that concatenates all elements from each list inside l
- Return type
list
- tmtoolkit.utils.greedy_partitioning(elems_dict, k, return_only_labels=False)
Implementation of greed partitioning algorithm as explained here for a dict elems_dict containing elements with label -> weight mapping. A weight can be a number in an arbitrary range. Since this is used for task scheduling, you can think if it as the larger the weight, the bigger the task is.
The elements are placed in k bins such that the difference of sums of weights in each bin is minimized. The algorithm does not always find the optimal solution.
If return_only_labels is False, returns a list of k dicts with label -> weight mapping, else returns a list of k lists containing only the labels for the respective partitions.
- Parameters
elems_dict (Dict[str, Union[int, float]]) – dictionary containing elements with label -> weight mapping
k (int) – number of bins
return_only_labels – if True, only return the labels in each bin
- Returns
list with k bins, where each each bin is either a dict with label -> weight mapping if return_only_labels is False or a list of labels
- Return type
Union[List[Dict[str, Union[int, float]]], List[List[str]]]
- tmtoolkit.utils.linebreaks_win2unix(text)
Convert Windows line breaks
\r\n
to Unix line breaks\n
.- Parameters
text (str) – text string
- Returns
text string with Unix line breaks
- Return type
str
- tmtoolkit.utils.mat2d_window_from_indices(mat, row_indices=None, col_indices=None, copy=False)
Select an area/”window” inside of a 2D array/matrix mat specified by either a sequence of row indices row_indices and/or a sequence of column indices col_indices. Returns the specified area as a view of the data if copy is False, else it will return a copy.
- Parameters
mat (ndarray) – a 2D NumPy array
row_indices (Optional[Union[List[int], ndarray]]) – list or array of row indices to select or
None
to select all rowscol_indices (Optional[Union[List[int], ndarray]]) – list or array of column indices to select or
None
to select all columnscopy – if True, return result as copy, else as view into mat
- Returns
window into mat as specified by the passed indices
- Return type
ndarray
- tmtoolkit.utils.merge_dicts(dicts, sort_keys=False, safe=False)
Merge all dictionaries in dicts to form a single dict.
- Parameters
dicts (Sequence[dict]) – sequence of dictionaries to merge
sort_keys (bool) – sort the keys in the resulting dictionary
safe (bool) – if True, raise a
ValueError
if sets of keys in dicts are not disjoint, else later dicts in the sequence will silently update already existing data with the same key
- Returns
merged dictionary
- Return type
dict
- tmtoolkit.utils.merge_sets(sets, safe=False)
Merge all sets in sets to form a single set.
- Parameters
sets (Sequence[set]) – sequence of sets to merge
safe (bool) – if True, raise a
ValueError
if sets are not disjoint
- Returns
merged set
- Return type
set
- tmtoolkit.utils.path_split(path, base=None)
Split path path into its components:
path_split('a/simple/test.txt') # ['a', 'simple', 'test.txt']
- Parameters
path (str) – a file path
base (Optional[List[str]]) – path remainder (used for recursion)
- Returns
components of the path as list
- Return type
List[str]
- tmtoolkit.utils.pickle_data(data, picklefile, **kwargs)
Save data in picklefile with Python’s
pickle
module.- Parameters
data (Any) – data to store in picklefile
picklefile (str) – either target file path as string or file handle
kwargs – further parameters passed to
pickle.dump
- Return type
None
- tmtoolkit.utils.read_text_file(fpath, encoding, read_size=- 1, force_unix_linebreaks=True)
Read the text file at path fpath with character encoding encoding and return it as string.
- Parameters
fpath (str) – path to file to read
encoding (str) – character encoding
read_size (int) – max. number of characters to read. -1 means read full file.
force_unix_linebreaks (bool) – if True, convert Windows linebreaks to Unix linebreaks
- Returns
file content as string
- Return type
str
- tmtoolkit.utils.sample_dict(d, n)
Return a subset of the dictionary d as random sample of size n.
- Parameters
d (dict) – dictionary to sample
n (int) – sample size; must be positive and smaller than or equal to
len(d)
- Returns
subset of the input dictionary
- Return type
dict
- tmtoolkit.utils.set_logging_level(level)
Set logging level for tmtoolkit package default logging handler.
- Parameters
level (int) – minimum log level
- Return type
None
- tmtoolkit.utils.split_func_args(fn, args)
Split keyword arguments args so that all function arguments for fn are the first element of the returned tuple and the rest of the arguments are the second element of the returned tuple.
- Parameters
fn (Callable) – a function
args (Dict[str, Any]) – keyword arguments dict
- Returns
tuple with two dict elements: all arguments for fn are the first element, the rest of the arguments are the second element
- Return type
Tuple[Dict[str, Any], Dict[str, Any]]
- tmtoolkit.utils.unpickle_file(picklefile, **kwargs)
Load data from picklefile with Python’s
pickle
module.Warning
Python pickle files may contain malicious code. You should only load pickle files from trusted sources.
- Parameters
picklefile (str) – either target file path as string or file handle
kwargs – further parameters passed to
pickle.load
- Returns
data stored in picklefile
- Return type
Any