Topic modeling

The topicmod module offers a wide range of tools to facilitate topic modeling with Python. This chapter will introduce the following techniques:

A quick note on terminology: So far, we spoke about tokens or sometimes terms when we meant the individual elements that our documents consist of after we applied text preprocessing such as tokenization to the raw input text strings. These tokens can be lexicographically correct words, but they don’t have to, e.g. when you applied stemming you might have tokens like “argu” in your vocabulary. There may also be numbers or punctuation symbols in your vocabulary. For those topic modeling techniques that tmtoolkit supports, the results are always two probability distributions: a document-topic distribution and a topic-word distribution. Since the latter is called topic-word and not topic-token or -term distribution, we will also use the term word when we mean any token from the corpus’ vocabulary.

An example document-term matrix

tmtoolkit supports topic models that are computed from document-term matrices (DTMs). Just as in the previous chapter, we will at first generate a DTM. However, this time the sample will be bigger:

[1]:
import random
random.seed(20191120)   # to make the sampling reproducible

import numpy as np
np.set_printoptions(precision=5)

from tmtoolkit.corpus import Corpus

# for topic modeling, the document sizes shouldn't be
# too different, hence we set a filter
corpus = Corpus.from_builtin_corpus('english-NewsArticles') \
    .filter_by_min_length(1000) \
    .filter_by_max_length(10000) \
    .sample(100)
corpus
[1]:
<Corpus [100 documents]>

We will also now generate two DTMs, because we later want to show how you can compute topic models for two different DTMs in parallel. At first, we to some general preprocessing:

[2]:
from tmtoolkit.preprocess import TMPreproc

preproc = TMPreproc(corpus)
preproc.pos_tag() \
    .lemmatize() \
    .tokens_to_lowercase() \
    .remove_special_chars_in_tokens()
[2]:
<TMPreproc [100 documents]>

Now we at first apply more “relaxed” cleaning:

[3]:
preproc_bigger = preproc.copy() \
    .add_stopwords(['would', 'could', 'nt', 'mr', 'mrs', 'also']) \
    .clean_tokens(remove_shorter_than=2) \
    .remove_common_tokens(df_threshold=0.85) \
    .remove_uncommon_tokens(df_threshold=0.05)

preproc_bigger.n_docs, preproc_bigger.vocabulary_size
[3]:
(100, 866)

Another copy of preproc will apply more aggressive cleaning and hence will result in a smaller vocabulary size:

[4]:
preproc_smaller = preproc.copy() \
    .filter_for_pos('N') \
    .clean_tokens(remove_numbers=True, remove_shorter_than=2) \
    .remove_common_tokens(df_threshold=0.9) \
    .remove_uncommon_tokens(df_threshold=0.1)

del preproc

preproc_smaller.n_docs, preproc_smaller.vocabulary_size
[4]:
(100, 156)

We will create the document labels, vocabulary arrays and DTMs for both versions now:

[5]:
# doc_labels are the same for both

doc_labels = np.array(preproc_bigger.doc_labels)
doc_labels[:10]
[5]:
array(['NewsArticles-1041', 'NewsArticles-1065', 'NewsArticles-1099',
       'NewsArticles-1169', 'NewsArticles-1174', 'NewsArticles-1184',
       'NewsArticles-1189', 'NewsArticles-120', 'NewsArticles-1237',
       'NewsArticles-1282'], dtype='<U17')
[6]:
vocab_bg = np.array(preproc_bigger.vocabulary)
vocab_sm = np.array(preproc_smaller.vocabulary)
[7]:
dtm_bg = preproc_bigger.dtm
dtm_sm = preproc_smaller.dtm

del preproc_bigger, preproc_smaller  # don't need these any more

dtm_bg, dtm_sm
[7]:
(<100x866 sparse matrix of type '<class 'numpy.int32'>'
        with 10860 stored elements in Compressed Sparse Row format>,
 <100x156 sparse matrix of type '<class 'numpy.int32'>'
        with 2785 stored elements in Compressed Sparse Row format>)

We now have two sparse DTMs dtm_bg (from the bigger preprocessed data) and dtm_sm (from the smaller preprocessed data), a list of document labels doc_labels that represent the rows of both DTMs and vocabulary arrays vocab_bg and vocab_sm that represent the columns of the respective DTMs. We will use this data for the remainder of the chapter.

Computing topic models in parallel

tmtoolkit allows to compute topic models in parallel, making use of all processor cores in your machine. Parallelization can be done per input DTM, per hyperparameter set and as combination of both. Hyperparameters control the number of topics and their “granularity”. We will later have a look at the role of hyperparameters and how to find an optimal combination for a given dataset with the means of topic model evaluation.

For now, we will concentrate on computing the topic models for both of our two DTMs in parallel. tmtoolkit supports three very popular packages for topic modeling, which provide the work of actually computing the model from the input matrix. They can all be accessed in separate sub-modules of the topicmod module:

Each of these sub-modules offer at least two functions that work with the respective package: compute_models_parallel() for general parallel model computation and evaluate_topic_models() for parallel model computation and evaluation (discussed later). For now, we want to compute two models in parallel with the lda package and hence use compute_models_parallel() from topicmod.tm_lda.

We need to provide two things for this function: First, the input matrices as a dict that maps labels to the respective DTMs. Second, hyperparameters to use for the model computations. Note that each topic modeling package has different hyperparameters and you should refer to their documentation in order to find out which hyperparameters you need to provide. For lda, we set the number of topics n_topics to 10 and the number of iterations for the Gibbs sampling process n_iter to 1000. We always want to use the same hyperparameters, so we pass these as constant_parameters. If we wanted to create models for a whole range of parameters, e.g. for different numbers of topics, we could provide varying_parameters. We will check this out later when we evaluate topic models.

Note

For proper topic modeling, we shouldn’t just set the number of topics, but try to find it out via evaluation methods. We should also check if the algorithm converged using the provided likelihood estimations. We will do both later on, but now focus on compute_models_parallel().

[8]:
import logging
import warnings
from tmtoolkit.topicmod.tm_lda import compute_models_parallel

# suppress the "INFO" messages and warnings from lda
logger = logging.getLogger('lda')
logger.addHandler(logging.NullHandler())
logger.propagate = False

warnings.filterwarnings('ignore')

# set data to use
dtms = {
    'bigger': dtm_bg,
    'smaller': dtm_sm
}

# and fixed hyperparameters
lda_params = {
    'n_topics': 10,
    'n_iter': 1000,
    'random_state': 20191122  # to make results reproducible
}

models = compute_models_parallel(dtms, constant_parameters=lda_params)
models
[8]:
defaultdict(list,
            {'smaller': [({'n_topics': 10,
                'n_iter': 1000,
                'random_state': 20191122},
               <lda.lda.LDA at 0x7f4251b0aef0>)],
             'bigger': [({'n_topics': 10,
                'n_iter': 1000,
                'random_state': 20191122},
               <lda.lda.LDA at 0x7f4251b0a828>)]})

As expected, two models were created. These can be accessed via the labels that we used in the dtms dict:

[9]:
models['smaller']
[9]:
[({'n_topics': 10, 'n_iter': 1000, 'random_state': 20191122},
  <lda.lda.LDA at 0x7f4251b0aef0>)]

We can see that for each input DTM, we get a list of 2-tuples. The first element in each tuple is a dict that represents the hyperparameters that were used to compute the model, the second element is the actual topic model (the <lda.lda.LDA ...> object). This structure looks a bit complex, but this is because it also supports varying parameters. Since we only have one fixed set of hyperparameters per DTM, we only have a list of length 1 for each DTM.

We will now access the models and print the top words per topic by using print_ldamodel_topic_words():

[10]:
from tmtoolkit.topicmod.model_io import print_ldamodel_topic_words

model_sm = models['smaller'][0][1]
print_ldamodel_topic_words(model_sm.topic_word_, vocab_sm, top_n=3)
topic_1
> #1. child (0.100100)
> #2. state (0.087127)
> #3. police (0.076006)
topic_2
> #1. minister (0.076240)
> #2. deal (0.066211)
> #3. party (0.062199)
topic_3
> #1. russia (0.165184)
> #2. threat (0.073712)
> #3. february (0.066089)
topic_4
> #1. group (0.097418)
> #2. attack (0.064261)
> #3. force (0.058045)
topic_5
> #1. house (0.170026)
> #2. white (0.099919)
> #3. president (0.089403)
topic_6
> #1. trump (0.224533)
> #2. president (0.121897)
> #3. administration (0.075390)
topic_7
> #1. year (0.100786)
> #2. court (0.078077)
> #3. day (0.049691)
topic_8
> #1. us (0.358221)
> #2. united (0.093928)
> #3. states (0.076541)
topic_9
> #1. china (0.181912)
> #2. company (0.110824)
> #3. year (0.093052)
topic_10
> #1. people (0.143312)
> #2. government (0.080095)
> #3. health (0.063238)
[11]:
model_bg = models['bigger'][0][1]
print_ldamodel_topic_words(model_bg.topic_word_, vocab_bg, top_n=3)
topic_1
> #1. russia (0.057998)
> #2. vote (0.033534)
> #3. russian (0.032628)
topic_2
> #1. us (0.042852)
> #2. people (0.030392)
> #3. take (0.027657)
topic_3
> #1. year (0.057346)
> #2. first (0.023742)
> #3. last (0.022817)
topic_4
> #1. one (0.035330)
> #2. get (0.030228)
> #3. go (0.024732)
topic_5
> #1. death (0.049719)
> #2. court (0.041434)
> #3. police (0.031642)
topic_6
> #1. trump (0.089177)
> #2. president (0.074211)
> #3. house (0.059869)
topic_7
> #1. china (0.117126)
> #2. chinese (0.038374)
> #3. million (0.034335)
topic_8
> #1. north (0.092136)
> #2. south (0.063956)
> #3. two (0.031442)
topic_9
> #1. company (0.044429)
> #2. market (0.043636)
> #3. european (0.040463)
topic_10
> #1. child (0.036569)
> #2. state (0.029752)
> #3. tell (0.029133)

We could also generate models from different parameters in parallel, either for a single DTM or several. In the following example we generate models for a series of four different values for the alpha parameter. The parameters n_iter and n_topics are held constant across all models.

[12]:
var_params = [{'alpha': 1/(10**x)} for x in range(1, 5)]

const_params = {
    'n_iter': 500,
    'n_topics': 10,
    'random_state': 20191122  # to make results reproducible
}

models = compute_models_parallel(dtm_sm,  # smaller DTM
                                 varying_parameters=var_params,
                                 constant_parameters=const_params)
models
[12]:
[({'alpha': 0.001, 'n_iter': 500, 'n_topics': 10, 'random_state': 20191122},
  <lda.lda.LDA at 0x7f423661ffd0>),
 ({'alpha': 0.01, 'n_iter': 500, 'n_topics': 10, 'random_state': 20191122},
  <lda.lda.LDA at 0x7f4251af5eb8>),
 ({'alpha': 0.1, 'n_iter': 500, 'n_topics': 10, 'random_state': 20191122},
  <lda.lda.LDA at 0x7f4251af5e10>),
 ({'alpha': 0.0001, 'n_iter': 500, 'n_topics': 10, 'random_state': 20191122},
  <lda.lda.LDA at 0x7f4251af5d30>)]

We could compare these models now, e.g. by investigating their topics.

A more systematic approach on comparing and evaluating topic models, also in order to find a good set of hyperparameters for a given dataset, will be presented in the next section.

Evaluation of topic models

tmtoolkit provides several metrics for comparing and evaluating topic models. This can be used for finding a good hyperparameter set for a given dataset, e.g. a good combination of the number of topics and concentration paramaters (often called alpha and beta in literature). For some background on hyperparameters in topic modeling, see this blog post.

For each candidate hyperparameter set, a model can be generated and evaluated in parallel. We will do this now for the “big” DTM dtm_bg. Our candidate values for the number of topics k range between 20 and 120, with steps of 10. We make the concentration parameter for a prior over the document-specific topic distributions, alpha, depending on k as 1/k:

[13]:
var_params = [{'n_topics': k, 'alpha': 1/k} for k in range(20, 121, 10)]
var_params
[13]:
[{'n_topics': 20, 'alpha': 0.05},
 {'n_topics': 30, 'alpha': 0.03333333333333333},
 {'n_topics': 40, 'alpha': 0.025},
 {'n_topics': 50, 'alpha': 0.02},
 {'n_topics': 60, 'alpha': 0.016666666666666666},
 {'n_topics': 70, 'alpha': 0.014285714285714285},
 {'n_topics': 80, 'alpha': 0.0125},
 {'n_topics': 90, 'alpha': 0.011111111111111112},
 {'n_topics': 100, 'alpha': 0.01},
 {'n_topics': 110, 'alpha': 0.00909090909090909},
 {'n_topics': 120, 'alpha': 0.008333333333333333}]

The heart of the model evaluation process is the function evaluate_topic_models(), which is available for all three topic modeling packages. We stick with lda and import that function from topicmod.tm_lda. It is similar to compute_models_parallel() as it accepts varying and constant hyperparameters. However, it doesn’t only compute the models in parallel, but also applies several metrics to these models in order to evaluate them. This can be controlled with the metric parameter that accepts a string or a list of strings that specify the used metric(s). These metrics refer to functions that are implemented in topicmod.evaluate.

Each topic modeling sub-module defines two important sequences: AVAILABLE_METRICS and DEFAULT_METRICS. The former lists all available metrics for that sub-module, the latter lists the default metrics that are used when you don’t specify anything with the metric parameter. Let’s have a look at both sequences in topicmod.tm_lda:

[14]:
from tmtoolkit.topicmod import tm_lda

tm_lda.AVAILABLE_METRICS
[14]:
('loglikelihood',
 'cao_juan_2009',
 'arun_2010',
 'coherence_mimno_2011',
 'griffiths_2004',
 'held_out_documents_wallach09',
 'coherence_gensim_u_mass',
 'coherence_gensim_c_v',
 'coherence_gensim_c_uci',
 'coherence_gensim_c_npmi')
[15]:
tm_lda.DEFAULT_METRICS
[15]:
('cao_juan_2009', 'arun_2010', 'coherence_mimno_2011')

For details about the metrics and the academic references, see the respective implementations in the topicmod.evaluate module.

We will now run the model evaluations with evaluate_topic_models() using our previously generated list of varying hyperparameters var_params, some constant hyperparameters and the default set of metrics. We also set return_models=True which means to retain the generated models in the evaluation results:

[16]:
from tmtoolkit.topicmod.tm_lda import evaluate_topic_models
from tmtoolkit.topicmod.evaluate import results_by_parameter

const_params = {
    'n_iter': 1000,
    'eta': 0.1,       # "eta" aka "beta"
    'random_state': 20191122  # to make results reproducible
}

eval_results = evaluate_topic_models(dtm_bg,
                                     varying_parameters=var_params,
                                     constant_parameters=const_params,
                                     return_models=True)
eval_results[:3]  # only show first three models
[16]:
[({'n_topics': 20,
   'alpha': 0.05,
   'n_iter': 1000,
   'eta': 0.1,
   'random_state': 20191122},
  {'model': <lda.lda.LDA at 0x7f4251af5cf8>,
   'cao_juan_2009': 0.11481331353994903,
   'arun_2010': 10.814407714872376,
   'coherence_mimno_2011': -1.5705955029901921}),
 ({'n_topics': 30,
   'alpha': 0.03333333333333333,
   'n_iter': 1000,
   'eta': 0.1,
   'random_state': 20191122},
  {'model': <lda.lda.LDA at 0x7f422b9d24a8>,
   'cao_juan_2009': 0.11299796966131251,
   'arun_2010': 6.3501538041085475,
   'coherence_mimno_2011': -1.5797425879990783}),
 ({'n_topics': 40,
   'alpha': 0.025,
   'n_iter': 1000,
   'eta': 0.1,
   'random_state': 20191122},
  {'model': <lda.lda.LDA at 0x7f422b9bf128>,
   'cao_juan_2009': 0.11013624472342246,
   'arun_2010': 4.853368419225177,
   'coherence_mimno_2011': -1.6514230985163838})]

The evaluation results are a list with pairs of hyperparameters and their evaluation results for each metric. Additionally, there is the generated model for each hyperparameter set.

We now use results_by_parameter(), which takes the “raw” evaluation results and sorts them by a specific hyperparameter, in this case n_topics. This is important because this is the way that the function for visualizing evaluation results, plot_eval_results(), expects the input.

[17]:
eval_results_by_topics = results_by_parameter(eval_results, 'n_topics')
eval_results_by_topics[:3]  # again only the first three models
[17]:
[(20,
  {'model': <lda.lda.LDA at 0x7f4251af5cf8>,
   'cao_juan_2009': 0.11481331353994903,
   'arun_2010': 10.814407714872376,
   'coherence_mimno_2011': -1.5705955029901921}),
 (30,
  {'model': <lda.lda.LDA at 0x7f422b9d24a8>,
   'cao_juan_2009': 0.11299796966131251,
   'arun_2010': 6.3501538041085475,
   'coherence_mimno_2011': -1.5797425879990783}),
 (40,
  {'model': <lda.lda.LDA at 0x7f422b9bf128>,
   'cao_juan_2009': 0.11013624472342246,
   'arun_2010': 4.853368419225177,
   'coherence_mimno_2011': -1.6514230985163838})]

We can now see the results for each metric across the specified range of number of topics using plot_eval_results():

[18]:
from tmtoolkit.topicmod.visualize import plot_eval_results

plot_eval_results(eval_results_by_topics);
_images/topic_modeling_32_0.png

These results suggest to set the number of topics, n_topics, to 50 and alpha to 0.02. We don’t have to generate a model with these hyperparameters again, because it’s already in the evaluation results (thanks to return_models=True). We extract the model from there in order to use it in the rest of the chapter.

[19]:
best_tm = [m for k, m in eval_results_by_topics if k == 50][0]['model']
best_tm.n_topics, best_tm.alpha, best_tm.eta  # just to make sure
[19]:
(50, 0.02, 0.1)

Common statistics and tools for topic models

The topicmod.model_stats module mostly contains functions that compute statistics from the document-topic and topic-word distribution of a topic model and also some helper functions for working with such distributions. We’ll start with an important helper function, generate_topic_labels_from_top_words().

Generating labels for topics

In topic modeling, topics are numbered because they’re abstract – they’re simply a probability distribution across all words in the vocabulary. Still, it’s useful to give them labels for better identification. The function generate_topic_labels_from_top_words() is very useful for that, as it finds labels according to the most “relevant” words in each topic. We’ll later see how we can identify the most relevant words per topic using a special relevance statistic. Note that you can adjust the weight of the relevance measure for the ranking by using the parameter lambda_ which is in range \([0, 1]\).

The function requires at least the topic-word and document-topic distributions from the model, the document lengths and the vocabulary. It then finds the minimum number of relevant words that uniquely label each topic. You can also use a fixed number for that minimum number with the parameter n_words.

[20]:
from tmtoolkit.bow.bow_stats import doc_lengths
from tmtoolkit.topicmod.model_stats import generate_topic_labels_from_top_words

doc_lengths_bg = doc_lengths(dtm_bg)
topic_labels = generate_topic_labels_from_top_words(
    best_tm.topic_word_,
    best_tm.doc_topic_,
    doc_lengths_bg,
    vocab_bg,
    lambda_=0.6
)

topic_labels[:10]   # showing only the first 5 topics here
[20]:
array(['1_record_rock', '2_referendum_vote', '3_car_reportedly',
       '4_help_ability', '5_air_force', '6_find_drug', '7_food_safety',
       '8_order_enter', '9_capacity_million', '10_recall_vehicle'],
      dtype='<U24')

As we can see, two words are necessary to label each topic uniquely. By default, each label is prefixed with a number. You can change that with the parameter labels_format.

Let’s have a look at the top words for a specific topic. We can use ldamodel_top_topic_words() for that from the module topicmod.model_io, which we will have a closer look at later:

[21]:
from tmtoolkit.topicmod.model_io import ldamodel_top_topic_words

top_topic_word = ldamodel_top_topic_words(best_tm.topic_word_,
                                          vocab_bg,
                                          row_labels=topic_labels)
top_topic_word[top_topic_word.index == '10_recall_vehicle']
[21]:
rank_1 rank_2 rank_3 rank_4 rank_5 rank_6 rank_7 rank_8 rank_9 rank_10
topic
10_recall_vehicle recall (0.07373) vehicle (0.05172) car (0.03338) 2015 (0.02605) cause (0.02605) company (0.02605) include (0.02605) 2014 (0.02605) fire (0.01871) level (0.01504)

Marginal topic and word distributions

We’ll now focus on the marginal topic and word distributions. Let’s get the marginal topic distribution first by using marginal_topic_distrib():

[22]:
from tmtoolkit.topicmod.model_stats import marginal_topic_distrib

marg_topic = marginal_topic_distrib(best_tm.doc_topic_, doc_lengths_bg)
marg_topic
[22]:
array([0.00566, 0.02144, 0.01269, 0.00993, 0.00775, 0.01187, 0.01217,
       0.01248, 0.02731, 0.01021, 0.01079, 0.01756, 0.01002, 0.00871,
       0.01896, 0.02267, 0.02212, 0.01184, 0.01375, 0.0095 , 0.02533,
       0.0233 , 0.02307, 0.00638, 0.04622, 0.02352, 0.05468, 0.01701,
       0.03019, 0.01418, 0.02923, 0.02074, 0.03795, 0.01413, 0.02126,
       0.02138, 0.00654, 0.00981, 0.01801, 0.01419, 0.02214, 0.02156,
       0.03424, 0.02314, 0.06307, 0.01903, 0.01363, 0.01762, 0.01606,
       0.03497])

The marginal topic distribution can be interpreted as the “importance” of each topic for the whole corpus. Let’s get the sorted indices into topic_labels with np.argsort() and get the top five topics:

[23]:
# np.argsort() gives ascending order, hence reverse via [::-1]
topic_labels[np.argsort(marg_topic)[::-1][:5]]
[23]:
array(['45_get_go', '27_china_chinese', '25_trump_president',
       '33_kill_group', '50_year_first'], dtype='<U24')

Likewise, we can get the marginal word distribution with marginal_word_distrib() from the model’s topic-word distribution and the marginal topic distribution. We’ll use this to list the most probable words for the corpus. As expected, these are mostly quite common words:

[24]:
from tmtoolkit.topicmod.model_stats import marginal_word_distrib

marg_word = marginal_word_distrib(best_tm.topic_word_, marg_topic)
vocab_bg[np.argsort(marg_word)[::-1][:10]]
[24]:
array(['year', 'china', 'us', 'trump', 'people', 'president', 'one',
       'country', 'company', 'new'], dtype='<U15')

Two helper functions exist for this purpose: most_probable_words() and least_probable_words() sort the vocabulary according to the marginal probability:

[25]:
from tmtoolkit.topicmod.model_stats import most_probable_words, least_probable_words

most_probable_words(vocab_bg, best_tm.topic_word_,
                    best_tm.doc_topic_, doc_lengths_bg,
                    n=10)
[25]:
array(['year', 'china', 'us', 'trump', 'people', 'president', 'one',
       'country', 'company', 'new'], dtype='<U15')
[26]:
least_probable_words(vocab_bg, best_tm.topic_word_,
                     best_tm.doc_topic_, doc_lengths_bg,
                     n=10)
[26]:
array(['17', 'implement', 'reject', 'immediately', 'representative',
       'attention', 'highly', 'responsibility', 'quarter', 'ongoing'],
      dtype='<U15')

Word distinctiveness and saliency

Word distinctiveness and saliency (see below) help to identify the most “informative” words in a corpus given its topic model. Both measures are introduced in Chuang et al. 2012.

Word distinctiveness is calculated for each word \(w\) as

\(\text{distinctiveness}(w) = \sum_T(P(T|w) \log \frac{P(T|w)}{P(T)})\).

where \(P(T)\) is the marginal topic distribution and \(P(T|w)\) is the probability of a topic given a word \(w\).

We can calculate this measure using word_distinctiveness(). To use this measure directly to rank words, we can use most_distinct_words() and least_distinct_words():

[27]:
from tmtoolkit.topicmod.model_stats import word_distinctiveness, \
    most_distinct_words, least_distinct_words

word_distinct = word_distinctiveness(best_tm.topic_word_, marg_topic)
word_distinct[:10]   # first 10 words in vocab
[27]:
array([0.919  , 0.83647, 1.42262, 0.91743, 1.30967, 0.83061, 1.04771,
       1.626  , 1.31854, 1.14434])
[28]:
most_distinct_words(vocab_bg, best_tm.topic_word_,
                    best_tm.doc_topic_, doc_lengths_bg,
                    n=10)
[28]:
array(['food', 'criminal', 'recall', 'son', 'safety', 'facebook',
       'protest', 'vehicle', 'record', 'north'], dtype='<U15')
[29]:
least_distinct_words(vocab_bg, best_tm.topic_word_,
                     best_tm.doc_topic_, doc_lengths_bg,
                     n=10)
[29]:
array(['participate', '24', 'fun', 'room', 'central', 'chinadailycomcn',
       'single', '40', 'chairman', 'transfer'], dtype='<U15')

Word saliency weights each words’ distinctiveness by it’s marginal probability \(P(w)\):

\(\text{saliency}(w) = P(w) \cdot \text{distinctiveness}(w)\).

The respective functions in tmtoolkit are word_saliency(), most_salient_words() and least_salient_words():

[30]:
from tmtoolkit.topicmod.model_stats import word_saliency, \
    most_salient_words, least_salient_words

word_sal = word_saliency(best_tm.topic_word_, best_tm.doc_topic_, doc_lengths_bg)
word_sal[:10]   # first 10 words in vocab
[30]:
array([0.00079, 0.00084, 0.00078, 0.00052, 0.00083, 0.00048, 0.00081,
       0.0008 , 0.00059, 0.00142])
[31]:
most_salient_words(vocab_bg, best_tm.topic_word_,
                   best_tm.doc_topic_, doc_lengths_bg,
                   n=10)
[31]:
array(['china', 'trump', 'us', 'north', 'president', 'year', 'company',
       'death', 'people', 'house'], dtype='<U15')
[32]:
least_salient_words(vocab_bg, best_tm.topic_word_,
                    best_tm.doc_topic_, doc_lengths_bg,
                    n=10)
[32]:
array(['participate', 'fun', 'chinadailycomcn', 'central', 'piece', '24',
       'route', 'section', 'mission', 'education'], dtype='<U15')

Topic-word relevance

The topic-word relevance measure as introduced by Sievert and Shirley 2014 helps to identify the most relevant words within a topic by also accounting for the marginal probability of each word across the corpus. This is done by integrating a lift value, which is the “ratio of a term’s probability within a topic to its marginal probability across the corpus.” (ibid.)

Thus for each word \(w\), given a topic-word distribution \(\phi\), a topic \(t\) and a weight parameter \(\lambda\), it is calculated as:

\(\text{relevance}_{\phi, \lambda}(w, t) = \lambda \log \phi_{t,w} + (1-\lambda) \log \frac{\phi_{t,w}}{p(w)}\).

The first term \(\log \phi_{t,w}\) is the log of the topic-word distribution, the second term \(\log \frac{\phi_{t,w}}{p(w)}\) is the log lift and \(\lambda\) can be used to control the weight between both terms. The lower \(\lambda\), the more weight is put on the lift term, i.e. the more different are the results from the original topic-word distribution.

This measure is implemented in topic_word_relevance(). It returns a matrix of the same shape as the topic-word distribution, i.e. each row represents a topic with a (log-transformed) distribution across all words in the vocabulary. Please note that the lambda parameter ends with an underscore: lambda_.

[33]:
from tmtoolkit.topicmod.model_stats import topic_word_relevance

topic_word_rel = topic_word_relevance(best_tm.topic_word_, best_tm.doc_topic_,
                                      doc_lengths_bg, lambda_=0.6)
topic_word_rel
[33]:
array([[-4.72165, -4.78791, -4.54399, ..., -4.75864, -4.83761, -4.55602],
       [-5.64969, -5.71594, -5.47202, ..., -5.68667, -5.76565, -5.48406],
       [-5.24067, -5.30693, -5.06301, ..., -5.27766, -5.35663, -5.07504],
       ...,
       [-3.09158, -5.55573, -5.31181, ..., -5.52646, -5.60543, -5.32384],
       [-5.41848, -5.48474, -5.24082, ..., -5.45547, -5.53444, -5.25285],
       [-6.06786, -2.70013, -5.8902 , ..., -6.10484, -6.18382, -2.18866]])

To confirm that it’s 50 topics across all words in the vocabulary:

[34]:
topic_word_rel.shape
[34]:
(50, 866)

Two functions can be used to get the most or least relevant words for a topic: most_relevant_words_for_topic() and least_relevant_words_for_topic(). You can select the topic with the topic parameter which is a zero-based topic index.

We’ll do it for topic with index 9, which is:

[35]:
topic_labels[9]
[35]:
'10_recall_vehicle'
[36]:
from tmtoolkit.topicmod.model_stats import most_relevant_words_for_topic, \
    least_relevant_words_for_topic

most_relevant_words_for_topic(vocab_bg, topic_word_rel, topic=9, n=10)
[36]:
array(['recall', 'vehicle', 'car', '2014', 'cause', 'fire', '2015', '29',
       'include', 'hit'], dtype='<U15')
[37]:
least_relevant_words_for_topic(vocab_bg, topic_word_rel, topic=9, n=10)
[37]:
array(['year', 'china', 'trump', 'people', 'president', 'one', 'country',
       'new', 'house', 'two'], dtype='<U15')

Topic coherence

We already used the coherence metric (Mimno et al. 2011) for topic model evaluation. However, this metric cannot only be used to assess the overall quality of a topic model, but also to evaluate the individual topics’ coherence.

[38]:
from tmtoolkit.topicmod.evaluate import metric_coherence_mimno_2011

# use top 20 words per topic for metric
coh = metric_coherence_mimno_2011(best_tm.topic_word_, dtm_bg, top_n=20)
coh
[38]:
array([-2.56315, -1.28152, -2.14267, -2.31874, -1.87702, -1.30562,
       -1.26685, -2.01535, -1.41837, -1.50476, -2.04062, -1.48196,
       -2.71536, -1.58692, -1.87578, -1.27845, -1.34391, -1.43587,
       -1.49466, -1.38354, -1.8825 , -1.21781, -1.5731 , -2.29883,
       -1.05245, -1.28351, -0.84887, -1.40302, -1.55985, -1.52426,
       -1.16612, -1.11992, -1.04331, -2.00359, -1.13631, -1.21529,
       -1.67817, -2.38309, -1.70165, -2.83715, -2.03771, -1.45466,
       -0.93653, -1.2439 , -0.88786, -1.52184, -2.96194, -1.66745,
       -1.40139, -1.65171])

This generates a coherence value for each topic. Let’s show the distribution of these values:

[39]:
import matplotlib.pyplot as plt

plt.hist(coh, bins=20)
plt.xlabel('coherence')
plt.ylabel('n')
plt.show();
_images/topic_modeling_68_0.png

And print the best and worst topics according to this metric:

[40]:
import numpy as np

top10_t_indices = np.argsort(coh)[::-1][:5]
bottom10_t_indices = np.argsort(coh)[:5]

topic_labels[top10_t_indices]
[40]:
array(['27_china_chinese', '45_get_go', '43_company_manufacturing',
       '33_kill_group', '25_trump_president'], dtype='<U24')
[41]:
topic_labels[bottom10_t_indices]
[41]:
array(['47_russia_border', '40_father_new', '13_air_commission',
       '1_record_rock', '38_rule_concern'], dtype='<U24')

Note that this metric also doesn’t spare oneself careful manual evaluation, because it can also be off for some topics. For example, topic 45_get_go is certainly not a coherent topic as it mostly ranks common words high.

More coherence metrics can be used with the function metric_coherence_gensim(). This requires that gensim is installed. Furthemore, most metrics require that a parameter texts is passed which is the tokenized text that was used to create the document-term matrix.

Filtering topics

With the function filter_topics(), you can filter the topics according to their topic-word distribution and the following search criteria:

  • search_pattern: one or more search patterns according to the common parameters for pattern matching

  • top_n: pattern match(es) must occur in the first top_n most probable words in the topic

  • thresh: matched words’ probability must be above this threshold

You must specify at least one of top_n and thresh, but you can also specify both. The function returns an array of topic indices (which start with zero!).

Let’s find all topics that have the word “trump” in the top 5 most probable words:

[42]:
from tmtoolkit.topicmod.model_stats import filter_topics

found_topics = filter_topics('trump', vocab_bg,
                             best_tm.topic_word_, top_n=5)
found_topics
[42]:
array([11, 20, 24])

We can use these indices with our topic_labels:

[43]:
topic_labels[found_topics]
[43]:
array(['12_ryan_democrats', '21_day_share', '25_trump_president'],
      dtype='<U24')

Next, we want to select all topics where any of the words matched by the glob patterns (match_type='glob') "trump" or "russia*" achieve at least a probability of 0.01 (thresh=0.01):

[44]:
found_topics = filter_topics(['trump', 'russia*'], vocab_bg,
                             best_tm.topic_word_, thresh=0.01, match_type='glob')
topic_labels[found_topics]
[44]:
array(['12_ryan_democrats', '14_protest_young', '21_day_share',
       '24_criminal_domestic', '25_trump_president', '39_russia_moscow',
       '47_russia_border'], dtype='<U24')

When we specify cond='all', all patterns must have at least one match (here in the top 50 list of words per topic):

[45]:
found_topics = filter_topics(['trump', 'russia*'], vocab_bg,
                             best_tm.topic_word_, top_n=50, match_type='glob',
                             cond='all')
topic_labels[found_topics]
[45]:
array(['12_ryan_democrats'], dtype='<U24')

You could also pass a topic-word relevance matrix instead of a topic-word probability distribution.

[46]:
found_topics = filter_topics('trump', vocab_bg,
                             topic_word_rel, top_n=5)
topic_labels[found_topics]
[46]:
array(['12_ryan_democrats', '25_trump_president'], dtype='<U24')

Excluding topics

It is often the case that some topics of a topic model rank a lot of uninformative (e.g. very common) words the highest. This results in some uninformative topics, which you may want to exclude from further analysis. Note that if a large fraction of topics seems uninformative, it points to a problem with your topic model and/or your preprocessing steps. You should evaluate your candidate models carefully with the mentioned metrics and/or adjust your text preprocessing pipeline.

The function exclude_topics() allows to remove a specified set of topics from the document-topic and topic-word distributions. You need to pass the zero-based indices of the topics that you want to remove, and both distributions.

In this example, I identified the following topics as uninformative (by looking at the top ranked words either by topic-word distribution or topic-word relevance):

[47]:
uninform_topics = [19, 27, 30, 44, 49]
topic_labels[uninform_topics]
[47]:
array(['20_son_site', '28_office_man', '31_support_time', '45_get_go',
       '50_year_first'], dtype='<U24')

We can now pass these indices to exclude_topics() along with the topic model distributions. We’ll get back new, filtered, distributions.

[48]:
from tmtoolkit.topicmod.model_stats import exclude_topics

new_doc_topic, new_topic_word, new_topic_mapping = \
    exclude_topics(uninform_topics, best_tm.doc_topic_,
                best_tm.topic_word_, return_new_topic_mapping=True)
new_doc_topic.shape, new_topic_word.shape
[48]:
((100, 45), (45, 866))

We can see in the new distributions’ shapes that we now have 45 instead of 50 topics, because we removed five of them. We shouldn’t forget to also update the topic labels and remove the unwanted topics:

[49]:
new_topic_labels = np.delete(topic_labels, uninform_topics)
new_topic_labels
[49]:
array(['1_record_rock', '2_referendum_vote', '3_car_reportedly',
       '4_help_ability', '5_air_force', '6_find_drug', '7_food_safety',
       '8_order_enter', '9_capacity_million', '10_recall_vehicle',
       '11_season_third', '12_ryan_democrats', '13_air_commission',
       '14_protest_young', '15_campaign_news', '16_death_murder',
       '17_north_action', '18_facebook_review', '19_north_woman',
       '21_day_share', '22_south_visit', '23_year_energy',
       '24_criminal_domestic', '25_trump_president', '26_child_home',
       '27_china_chinese', '29_house_white', '30_opposition_bank',
       '32_police_arrest', '33_kill_group', '34_president_security',
       '35_election_party', '36_eu_uk', '37_board_solution',
       '38_rule_concern', '39_russia_moscow', '40_father_new',
       '41_water_per', '42_water_people', '43_company_manufacturing',
       '44_product_market', '46_court_case', '47_russia_border',
       '48_growth_tax', '49_hospital_care'], dtype='<U24')

Displaying and exporting topic modeling results

The topicmod.model_io module provides several functions for displaying and exporting topic modeling results, i.e. results derived from the document-topic and topic-word distribution of a given topic model.

We already used ldamodel_top_topic_words() briefly, which generates a dataframe with the top words from a topic-word distribution. You can also use the topic-word relevance matrix instead. With top_n we can control the number of top words:

[50]:
# using relevance matrix here and showing only the first 3 topics
ldamodel_top_topic_words(topic_word_rel, vocab_bg, top_n=5)[:3]
[50]:
rank_1 rank_2 rank_3 rank_4 rank_5
topic
topic_1 record (-0.2566) rock (-0.6573) best (-0.7433) list (-1.149) mike (-1.162)
topic_2 referendum (-0.4418) vote (-0.4827) government (-1.178) street (-1.2) next (-1.286)
topic_3 car (-0.8822) reportedly (-0.9954) white (-1.008) vehicle (-1.08) individual (-1.198)

Note that the values in parantheses are the corresponding values from the matrix for that word in that topic. They’re negative because of the log transformation that is applied in the topic-word relevance measure.

A similar function can be used for the document-topic distribution: ldamodel_top_doc_topics(). Here, top_n controls the number of top-ranked topics to export. This time, we use the filtered document-topic distribution new_doc_topics:

[51]:
from tmtoolkit.topicmod.model_io import ldamodel_top_doc_topics

ldamodel_top_doc_topics(new_doc_topic, doc_labels, top_n=3,
                        topic_labels=new_topic_labels)[:5]
[51]:
rank_1 rank_2 rank_3
document
NewsArticles-1041 22_south_visit (0.8199) 15_campaign_news (0.06708) 34_president_security (0.06192)
NewsArticles-1065 27_china_chinese (0.4759) 35_election_party (0.3074) 40_father_new (0.1389)
NewsArticles-1099 25_trump_president (0.3958) 8_order_enter (0.3464) 12_ryan_democrats (0.2104)
NewsArticles-1169 33_kill_group (0.4507) 47_russia_border (0.1984) 37_board_solution (0.1398)
NewsArticles-1174 33_kill_group (0.6924) 3_car_reportedly (0.2984) 49_hospital_care (0.000213)

Let’s have a look at one of these documents:

[52]:
print(corpus['NewsArticles-1065'][:500], '...')
The leader China never forgot

In new biography, author recalls former prime minister Edward Heath's meetings with nation's legendary figures Michael McManus believes Edward Heath was a pivotal figure in China's opening up to the West. The former British prime minister - who is the subject of the author's new biography - famously first met with Chairman Mao in 1974 and was a regular visitor thereafter to the country that has since become the world's second-largest economy. "They (the Chinese) re ...

There are also two functions that generate datatables for the full topic-word and document-topic distributions: ldamodel_full_topic_words() and ldamodel_full_doc_topics(). The output of both functions is naturally quite big, as long as you’re not working with a “toy dataset”.

[53]:
from tmtoolkit.topicmod.model_io import ldamodel_full_topic_words

datatable_topic_word = ldamodel_full_topic_words(new_topic_word,
                                                 vocab_bg,
                                                 row_labels=new_topic_labels)
# displaying only the first 5 topics with the first
# 10 words from the vocabulary (which are all numbers)
datatable_topic_word[:5, :10]
[53]:
_topic1010011121314151617
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
01_record_rock0.0005274260.0005274260.0005274260.0005274260.0005274260.0005274260.0005274260.01107590.000527426
12_referendum_vote0.0002085070.0002085070.0002085070.002293580.0002085070.002293580.0002085070.0002085070.000208507
23_car_reportedly0.0003138730.0003138730.0003138730.0003138730.0003138730.0003138730.006591340.0003138730.000313873
34_help_ability0.0003736920.0003736920.0003736920.0003736920.0003736920.0003736920.0003736920.0003736920.000373692
45_air_force0.0004393670.0004393670.0004393670.0004393670.0004393670.0004393670.0004393670.0004393670.000439367
[54]:
from tmtoolkit.topicmod.model_io import ldamodel_full_doc_topics

datatable_doc_topic = ldamodel_full_doc_topics(new_doc_topic,
                                               doc_labels,
                                               topic_labels=new_topic_labels)
# displaying only the first 3 documents with the first
# 5 topics
datatable_doc_topic[:3, :5]
[54]:
_doc1_record_rock2_referendum_vote3_car_reportedly4_help_ability
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0NewsArticles-10415.15597e-055.15597e-055.15597e-055.15597e-05
1NewsArticles-10650.0001982160.0001982160.0001982160.000198216
2NewsArticles-10990.0002472190.0002472190.0002472190.000247219

For quick inspection of topics there’s also a pair of print functions. We already used print_ldamodel_topic_words(), but we haven’t tried print_ldamodel_doc_topics() yet. This prints the top_n most probable topics for each document:

[55]:
from tmtoolkit.topicmod.model_io import print_ldamodel_doc_topics

# subsetting new_doc_topic and doc_labels to get only the first
# five documents
print_ldamodel_doc_topics(new_doc_topic[:5, :], doc_labels[:5],
                          val_labels=new_topic_labels)
NewsArticles-1041
> #1. 22_south_visit (0.819850)
> #2. 15_campaign_news (0.067079)
> #3. 34_president_security (0.061923)
NewsArticles-1065
> #1. 27_china_chinese (0.475917)
> #2. 35_election_party (0.307433)
> #3. 40_father_new (0.138949)
NewsArticles-1099
> #1. 25_trump_president (0.395797)
> #2. 8_order_enter (0.346354)
> #3. 12_ryan_democrats (0.210383)
NewsArticles-1169
> #1. 33_kill_group (0.450744)
> #2. 47_russia_border (0.198378)
> #3. 37_board_solution (0.139793)
NewsArticles-1174
> #1. 33_kill_group (0.692439)
> #2. 3_car_reportedly (0.298403)
> #3. 49_hospital_care (0.000213)

You can also export the results of a topic model to an Excel file using save_ldamodel_summary_to_excel(). The resulting Excel file will contain the following sheets:

  • top_doc_topics_vals: document-topic distribution with probabilities of top topics per document

  • top_doc_topics_labels: document-topic distribution with labels of top topics per document

  • top_doc_topics_labelled_vals: document-topic distribution combining probabilities and labels of top topics per document (e.g. "topic_12 (0.21)")

  • top_topic_word_vals: topic-word distribution with probabilities of top words per topic

  • top_topic_word_labels: topic-word distribution with top words per (e.g. "politics") topic

  • top_topic_words_labelled_vals: topic-word distribution combining probabilities and top words per topic (e.g. "politics (0.08)")

  • optional if dtm is given – marginal_topic_distrib: marginal topic distribution

Additionally to saving the output to the specified Excel file, the function will also return a dict with the sheets and their data.

[56]:
from tmtoolkit.topicmod.model_io import save_ldamodel_summary_to_excel

sheets = save_ldamodel_summary_to_excel('data/news_articles_100.xlsx',
                                        new_topic_word, new_doc_topic,
                                        doc_labels, vocab_bg,
                                        dtm = dtm_bg,
                                        topic_labels = new_topic_labels)

To quickly store a topic model to disk for sharing or loading again at a later point in time, there are save_ldamodel_to_pickle() and load_ldamodel_from_pickle(). The function for saving takes a path to a pickle file to create (or update), a topic model object (such as an LDA instance as best_tm, but you could also pass a tuple like (new_doc_topic, new_topic_word)), the corresponding vocabulary and document labels, and optionally the DTM that was used to create the topic model. The function for loading the data will return the saved data as a dict. We will only show the dict’s keys here, as the data itself is too large to be printed:

[57]:
from tmtoolkit.topicmod.model_io import save_ldamodel_to_pickle, \
    load_ldamodel_from_pickle

save_ldamodel_to_pickle('data/news_articles_100.pickle',
                        best_tm, vocab_bg, doc_labels,
                        dtm = dtm_bg)

loaded = load_ldamodel_from_pickle('data/news_articles_100.pickle')
loaded.keys()
[57]:
dict_keys(['model', 'vocab', 'doc_labels', 'dtm'])

Visualizing topic models

The topicmod.visualize module contains several functions to visualize topic models and evaluation results. We’ve already used plot_eval_results() during topic model evaluation so we’ll now focus on visualizing topic models.

Heatmaps

Let’s start with heatmap visualizations of document-topic or topic-word distributions from our topic model. This can be done with plot_doc_topic_heatmap() and plot_topic_word_heatmap() respectively. Both functions draw on a matplotlib figure and Axes object, which you must create before using these functions.

Heatmap visualizations essentially shade cells in a 2D matrix (like the document-topic or topic-word distributions) according to their value, i.e. the respective probability for a topic in a given document or a word in a given topic. Since these matrices are usually quite large, i.e. with hundreds of rows and/or columns, it doesn’t make sense to plot a heatmap of the whole matrix, but rather a certain subset of interest. When we want to visualize a document-topic distribution, we can optionally select a subset of the documents with the which_documents parameter and a subset of the topics with the which_topics parameter. Let’s draw a heatmap of a subset of documents across all topics at first:

[58]:
import matplotlib.pyplot as plt
from tmtoolkit.topicmod.visualize import plot_doc_topic_heatmap

# create a figure of certain size and
# Axes object to draw on
fig, ax = plt.subplots(figsize=(32, 8))

which_docs = [
    'NewsArticles-1473',
    'NewsArticles-1646',
    'NewsArticles-2252',
    'NewsArticles-2473',
    'NewsArticles-2583',
    'NewsArticles-2765',
    'NewsArticles-2922',
    'NewsArticles-3396',
    'NewsArticles-3601',
    'NewsArticles-3753'
]

plot_doc_topic_heatmap(fig, ax, new_doc_topic, doc_labels,
                       topic_labels=new_topic_labels,
                       which_documents=which_docs);
_images/topic_modeling_105_0.png
[59]:
fig, ax = plt.subplots(figsize=(6, 8))

which_topics = [
    '2_referendum_vote',
    '35_election_party',
    '36_eu_uk',
    '6_find_drug',
    '13_air_commission'
]

plot_doc_topic_heatmap(fig, ax, new_doc_topic, doc_labels,
                       topic_labels=new_topic_labels,
                       which_documents=which_docs,
                       which_topics=which_topics);
_images/topic_modeling_106_0.png

Similarily, we can work with plot_topic_word_heatmap() to visualize a topic-word distribution. We can also select a subset of topics and words from the vocabulary:

[60]:
from tmtoolkit.topicmod.visualize import plot_topic_word_heatmap

fig, ax = plt.subplots(figsize=(12, 8))

which_words = ['may', 'trump', 'referendum', 'brexit',
               'eu', 'uk', 'britain', 'economy', 'trade', 'law']

plot_topic_word_heatmap(fig, ax, new_topic_word, vocab_bg,
                        topic_labels=new_topic_labels,
                        which_topics=which_topics,
                        which_words=which_words);
_images/topic_modeling_108_0.png

Note that there’s also a generic heatmap plotting function plot_heatmap() for any kind of 2D matrices.

Word clouds

Thanks to the wordlcloud package, topic-word and document-topic distributions can also be visualized as “word clouds” with tmtoolkit. The function generate_wordclouds_for_topic_words() generates a word cloud for each topic by scaling a topic’s word by its probability (weight). You can choose to display only the top top_n words per topic. The result of this function will be a dictionary mapping topic labels to the respective word cloud image.

[61]:
from tmtoolkit.topicmod.visualize import generate_wordclouds_for_topic_words

# some options for wordcloud output
img_w = 400   # image width
img_h = 300   # image height

topic_clouds = generate_wordclouds_for_topic_words(
    new_topic_word, vocab_bg,
    top_n=20, topic_labels=new_topic_labels,
    width=img_w, height=img_h
)

# show all generated word clouds
topic_clouds.keys()
[61]:
dict_keys(['1_record_rock', '2_referendum_vote', '3_car_reportedly', '4_help_ability', '5_air_force', '6_find_drug', '7_food_safety', '8_order_enter', '9_capacity_million', '10_recall_vehicle', '11_season_third', '12_ryan_democrats', '13_air_commission', '14_protest_young', '15_campaign_news', '16_death_murder', '17_north_action', '18_facebook_review', '19_north_woman', '21_day_share', '22_south_visit', '23_year_energy', '24_criminal_domestic', '25_trump_president', '26_child_home', '27_china_chinese', '29_house_white', '30_opposition_bank', '32_police_arrest', '33_kill_group', '34_president_security', '35_election_party', '36_eu_uk', '37_board_solution', '38_rule_concern', '39_russia_moscow', '40_father_new', '41_water_per', '42_water_people', '43_company_manufacturing', '44_product_market', '46_court_case', '47_russia_border', '48_growth_tax', '49_hospital_care'])

Let’s select specific topics and display their word cloud:

[62]:
topic_clouds['36_eu_uk']
[62]:
_images/topic_modeling_113_0.png
[63]:
topic_clouds['2_referendum_vote']
[63]:
_images/topic_modeling_114_0.png

The same can be done for the document-topic distribution using generate_wordclouds_for_document_topics(). Here, a word cloud for each document will be generated that contains the top_n most probable topics for this document:

[64]:
from tmtoolkit.topicmod.visualize import generate_wordclouds_for_document_topics

doc_clouds = generate_wordclouds_for_document_topics(
    new_doc_topic, doc_labels, topic_labels=new_topic_labels,
    top_n=5, width=img_w, height=img_h)

# show only the first 5 documents for
# which word clouds were generated
list(doc_clouds.keys())[:5]
[64]:
['NewsArticles-1041',
 'NewsArticles-1065',
 'NewsArticles-1099',
 'NewsArticles-1169',
 'NewsArticles-1174']

To display a specific document’s topic word cloud:

[65]:
doc_clouds['NewsArticles-1473']
[65]:
_images/topic_modeling_118_0.png

We can write the generated images as PNG files to a folder on disk. Here, we store all word clouds in topic_clouds to 'data/tm_wordclouds/':

[66]:
from tmtoolkit.topicmod.visualize import write_wordclouds_to_folder

write_wordclouds_to_folder(topic_clouds, 'data/tm_wordclouds/')

Interactive visualization with pyLDAVis

The pyLDAVis package offers a great interactive tool to explore a topic model. The tmtoolkit function parameters_for_ldavis() allows to prepare your topic model data for this package so that you can easily pass it on to pyLDAVis.

[67]:
from tmtoolkit.topicmod.visualize import parameters_for_ldavis

ldavis_params = parameters_for_ldavis(new_topic_word,
                                      new_doc_topic,
                                      dtm_bg,
                                      vocab_bg)

If you have installed the package, you can now start the LDAVis explorer with the following lines of code in a Jupyter notebook:

import pyLDAvis
pyLDAVis.prepare(**ldavis_params)