Topic modeling

The topicmod module offers a wide range of tools to facilitate topic modeling with Python. This chapter will introduce the following techniques:

A quick note on terminology: So far, we spoke about tokens or sometimes terms when we meant the individual elements that our documents consist of after we applied text preprocessing such as tokenization to the raw input text strings. These tokens can be lexicographically correct words, but they don’t have to, e.g. when you applied stemming you might have tokens like “argu” in your vocabulary. There may also be numbers or punctuation symbols in your vocabulary. For those topic modeling techniques that tmtoolkit supports, the results are always two probability distributions: a document-topic distribution and a topic-word distribution. Since the latter is called topic-word and not topic-token or -term distribution, we will also use the term word when we mean any token from the corpus’ vocabulary.

An example document-term matrix

tmtoolkit supports topic models that are computed from document-term matrices (DTMs). Just as in the previous chapter, we will at first generate a DTM. However, this time the sample will be bigger:

[1]:
import random
random.seed(20191120)   # to make the sampling reproducible

import numpy as np
np.set_printoptions(precision=5)

from tmtoolkit.utils import enable_logging
enable_logging()

from tmtoolkit.corpus import Corpus, print_summary


corp = Corpus.from_builtin_corpus('en-NewsArticles', sample=100)
print_summary(corp)
2022-02-08 07:53:29,472:INFO:tmtoolkit:creating Corpus instance with no documents
2022-02-08 07:53:29,473:INFO:tmtoolkit:using serial processing
2022-02-08 07:53:29,957:INFO:tmtoolkit:sampling 100 documents(s) out of 3824
2022-02-08 07:53:29,958:INFO:tmtoolkit:adding text from 100 documents(s)
2022-02-08 07:53:29,959:INFO:tmtoolkit:running NLP pipeline on 100 documents
2022-02-08 07:53:37,126:INFO:tmtoolkit:generating document texts
Corpus with 100 documents in English
> NewsArticles-113 (1071 tokens): Use talk not tech to tame your children 's online ...
> NewsArticles-1043 (270 tokens): Burhan Ozbilici wins 2017 World Press Photo compet...
> NewsArticles-1032 (653 tokens): Germany 's right - wing AfD seeks to expel state l...
> NewsArticles-104 (31 tokens): Your pictures : Broken resolutions    Each week , ...
> NewsArticles-1137 (226 tokens): These Cool New ' Vertical Forest ' Skyscrapers Are...
> NewsArticles-1036 (835 tokens): Amnesty accuses Tunisian authorities of torture ah...
> NewsArticles-1048 (476 tokens): Espirito Santo police return to work after murder ...
> NewsArticles-1126 (163 tokens): This Makeup Palette Has A Game - Changing Little S...
> NewsArticles-1090 (1291 tokens): Martin challenges Fitzgerald over Tusla informatio...
> NewsArticles-1141 (914 tokens): Aslef members reject Southern rail deal    Aslef m...
(and 90 more documents)
total number of tokens: 66637 / vocabulary size: 9469

We will also now generate two DTMs, because we later want to show how you can compute topic models for two different DTMs in parallel. At first, we to some general preprocessing.

[2]:
from tmtoolkit.corpus import lemmatize, to_lowercase, remove_punctuation

lemmatize(corp)
to_lowercase(corp)
remove_punctuation(corp)

print_summary(corp)
2022-02-08 07:53:37,165:INFO:tmtoolkit:replacing 2186 token hashes
2022-02-08 07:53:37,202:INFO:tmtoolkit:replacing 502 token hashes
2022-02-08 07:53:37,217:INFO:tmtoolkit:generating document texts
Corpus with 100 documents in English
> NewsArticles-113 (1071 tokens): use talk not tech to tame your child s online habi...
> NewsArticles-1043 (270 tokens): burhan ozbilici win 2017 world press photo competi...
> NewsArticles-1032 (653 tokens): germany s right  wing afd seek to expel state lead...
> NewsArticles-104 (31 tokens): your picture  break resolution  each week  we publ...
> NewsArticles-1137 (226 tokens): these cool new  vertical forest  skyscraper be des...
> NewsArticles-1036 (835 tokens): amnesty accuse tunisian authority of torture ahead...
> NewsArticles-1048 (476 tokens): espirito santo police return to work after murder ...
> NewsArticles-1126 (163 tokens): this makeup palette have a game  change little sec...
> NewsArticles-1090 (1291 tokens): martin challenge fitzgerald over tusla information...
> NewsArticles-1141 (914 tokens): aslef member reject southern rail deal  aslef memb...
(and 90 more documents)
total number of tokens: 66637 / vocabulary size: 6758

Check if there are a few odd, unprintable characters in any tokens:

[3]:
import string
from tmtoolkit.corpus import corpus_unique_chars

{(c, c.encode('utf-8')) for c in corpus_unique_chars(corp) if c not in string.printable}
[3]:
{('\xa0', b'\xc2\xa0'),
 ('à', b'\xc3\xa0'),
 ('ó', b'\xc3\xb3'),
 ('™', b'\xe2\x84\xa2'),
 ('�', b'\xef\xbf\xbd')}

Remove all of them but “à” and “ó”

[4]:
from tmtoolkit.corpus import remove_chars

unprintable_bytes = {b'\xc2\xa0', b'\xe2\x84\xa2', b'\xef\xbf\xbd'}
unprintable_chars = set(map(lambda b: b.decode('utf-8'), unprintable_bytes))
remove_chars(corp, unprintable_chars)

# check again
{(c, c.encode('utf-8')) for c in corpus_unique_chars(corp) if c not in string.printable}
2022-02-08 07:53:37,262:INFO:tmtoolkit:replacing 3 token hashes
[4]:
{('à', b'\xc3\xa0'), ('ó', b'\xc3\xb3')}

Now we at first apply more “relaxed” cleaning:

[5]:
from copy import copy
from tmtoolkit.corpus import filter_clean_tokens, remove_common_tokens, remove_uncommon_tokens

corp_bigger = copy(corp)

filter_clean_tokens(corp_bigger, remove_shorter_than=2)
remove_common_tokens(corp_bigger, df_threshold=0.85)
remove_uncommon_tokens(corp_bigger, df_threshold=0.05)

print_summary(corp_bigger)
2022-02-08 07:53:37,327:INFO:tmtoolkit:creating Corpus instance with no documents
2022-02-08 07:53:37,327:INFO:tmtoolkit:using serial processing
2022-02-08 07:53:37,447:INFO:tmtoolkit:filtered tokens by mask: num. tokens was 66637 and is now 30407
2022-02-08 07:53:37,534:INFO:tmtoolkit:filtered tokens by mask: num. tokens was 30407 and is now 30407
2022-02-08 07:53:37,626:INFO:tmtoolkit:filtered tokens by mask: num. tokens was 30407 and is now 16443
2022-02-08 07:53:37,662:INFO:tmtoolkit:generating document texts
Corpus with 100 documents in English
> NewsArticles-113 (256 tokens): use talk child online like parent house like happy...
> NewsArticles-1043 (64 tokens): win 2017 world press win 2017 world press image ru...
> NewsArticles-1032 (158 tokens): germany right seek state leader leader germany ask...
> NewsArticles-104 (7 tokens): break week publish set theme week break
> NewsArticles-1137 (43 tokens): new design help fight world need dont china kind d...
> NewsArticles-1036 (241 tokens): accuse authority ahead key talk germany right grou...
> NewsArticles-1048 (141 tokens): police return work murder officer return work stat...
> NewsArticles-1126 (19 tokens): game change little product reveal feature game new...
> NewsArticles-1090 (266 tokens): challenge information leader claim minister justic...
> NewsArticles-1141 (248 tokens): member reject southern deal member reject deal sou...
(and 90 more documents)
total number of tokens: 16443 / vocabulary size: 791

Another copy of corp will apply more aggressive cleaning and hence will result in a smaller vocabulary size:

[6]:
from tmtoolkit.corpus import filter_for_pos
corp_smaller = copy(corp)

filter_for_pos(corp_smaller, 'N')
filter_clean_tokens(corp_smaller, remove_shorter_than=2)
remove_common_tokens(corp_smaller, df_threshold=0.8)
remove_uncommon_tokens(corp_smaller, df_threshold=0.1)

del corp   # remove original corpus

print_summary(corp_smaller)
2022-02-08 07:53:37,706:INFO:tmtoolkit:creating Corpus instance with no documents
2022-02-08 07:53:37,706:INFO:tmtoolkit:using serial processing
2022-02-08 07:53:37,814:INFO:tmtoolkit:filtered tokens by mask: num. tokens was 66637 and is now 19002
2022-02-08 07:53:37,868:INFO:tmtoolkit:filtered tokens by mask: num. tokens was 19002 and is now 18551
2022-02-08 07:53:37,939:INFO:tmtoolkit:filtered tokens by mask: num. tokens was 18551 and is now 18551
2022-02-08 07:53:38,000:INFO:tmtoolkit:filtered tokens by mask: num. tokens was 18551 and is now 5065
2022-02-08 07:53:38,014:INFO:tmtoolkit:generating document texts
Corpus with 100 documents in English
> NewsArticles-113 (69 tokens): child house service house service month time child...
> NewsArticles-1043 (18 tokens): world press world press world press russia year ne...
> NewsArticles-1032 (58 tokens): germany state leader leader germany state party le...
> NewsArticles-104 (2 tokens): week week
> NewsArticles-1137 (12 tokens): new world china press day area china house office ...
> NewsArticles-1036 (91 tokens): authority germany right group security official mi...
> NewsArticles-1048 (59 tokens): police work officer work state people day police o...
> NewsArticles-1126 (3 tokens): product way twitter
> NewsArticles-1090 (103 tokens): information leader minister child prime time week ...
> NewsArticles-1141 (52 tokens): member member member member secretary decision mem...
(and 90 more documents)
total number of tokens: 5065 / vocabulary size: 141

We will create the document labels, vocabulary arrays and DTMs for both versions now:

[7]:
from tmtoolkit.corpus import dtm

dtm_bg, doc_labels_bg, vocab_bg = dtm(corp_bigger, return_doc_labels=True, return_vocab=True)
dtm_sm, doc_labels_sm, vocab_sm = dtm(corp_smaller, return_doc_labels=True, return_vocab=True)

del corp_bigger, corp_smaller  # don't need these any more

dtm_bg, dtm_sm
2022-02-08 07:53:38,029:INFO:tmtoolkit:generating sparse DTM with 100 documents and vocab size 791
2022-02-08 07:53:38,063:INFO:tmtoolkit:generating sparse DTM with 100 documents and vocab size 141
[7]:
(<100x791 sparse matrix of type '<class 'numpy.int32'>'
        with 9247 stored elements in Compressed Sparse Row format>,
 <100x141 sparse matrix of type '<class 'numpy.int32'>'
        with 2378 stored elements in Compressed Sparse Row format>)

We now have two sparse DTMs dtm_bg (from the bigger preprocessed data) and dtm_sm (from the smaller preprocessed data), a list of document labels doc_labels that represent the rows of both DTMs and vocabulary arrays vocab_bg and vocab_sm that represent the columns of the respective DTMs. We will use this data for the remainder of the chapter.

Computing topic models in parallel

tmtoolkit allows to compute topic models in parallel, making use of all processor cores in your machine. Parallelization can be done per input DTM, per hyperparameter set and as combination of both. Hyperparameters control the number of topics and their “granularity”. We will later have a look at the role of hyperparameters and how to find an optimal combination for a given dataset with the means of topic model evaluation.

For now, we will concentrate on computing the topic models for both of our two DTMs in parallel. tmtoolkit supports three very popular packages for topic modeling, which provide the work of actually computing the model from the input matrix. They can all be accessed in separate sub-modules of the topicmod module:

Each of these sub-modules offer at least two functions that work with the respective package: compute_models_parallel for general parallel model computation and evaluate_topic_models for parallel model computation and evaluation (discussed later). For now, we want to compute two models in parallel with the lda package and hence use compute_models_parallel from topicmod.tm_lda.

We need to provide two things for this function: First, the input matrices as a dict that maps labels to the respective DTMs. Second, hyperparameters to use for the model computations. Note that each topic modeling package has different hyperparameters and you should refer to their documentation in order to find out which hyperparameters you need to provide. For lda, we set the number of topics n_topics to 10 and the number of iterations for the Gibbs sampling process n_iter to 1000. We always want to use the same hyperparameters, so we pass these as constant_parameters. If we wanted to create models for a whole range of parameters, e.g. for different numbers of topics, we could provide varying_parameters. We will check this out later when we evaluate topic models.

Note

For proper topic modeling, we shouldn’t just set the number of topics, but try to find it out via evaluation methods. We should also check if the algorithm converged using the provided likelihood estimations. We will do both later on, but now focus on compute_models_parallel.

[8]:
import logging
import warnings
from tmtoolkit.utils import disable_logging
from tmtoolkit.topicmod.tm_lda import compute_models_parallel

# disable tmtoolkit logging for now (too much output)
disable_logging()

# suppress the "INFO" messages and warnings from lda
logger = logging.getLogger('lda')
logger.addHandler(logging.NullHandler())
logger.propagate = False

warnings.filterwarnings('ignore')

# set data to use
dtms = {
    'bigger': dtm_bg,
    'smaller': dtm_sm
}

# and fixed hyperparameters
lda_params = {
    'n_topics': 10,
    'n_iter': 1000,
    'random_state': 20191122  # to make results reproducible
}

models = compute_models_parallel(dtms, constant_parameters=lda_params)
models
[8]:
defaultdict(list,
            {'smaller': [({'n_topics': 10,
                'n_iter': 1000,
                'random_state': 20191122},
               <lda.lda.LDA at 0x7fb69dd0d580>)],
             'bigger': [({'n_topics': 10,
                'n_iter': 1000,
                'random_state': 20191122},
               <lda.lda.LDA at 0x7fb69dd0d460>)]})

As expected, two models were created. These can be accessed via the labels that we used in the dtms dict:

[9]:
models['smaller']
[9]:
[({'n_topics': 10, 'n_iter': 1000, 'random_state': 20191122},
  <lda.lda.LDA at 0x7fb69dd0d580>)]

We can see that for each input DTM, we get a list of 2-tuples. The first element in each tuple is a dict that represents the hyperparameters that were used to compute the model, the second element is the actual topic model (the <lda.lda.LDA ...> object). This structure looks a bit complex, but this is because it also supports varying parameters. Since we only have one fixed set of hyperparameters per DTM, we only have a list of length 1 for each DTM.

We will now access the models and print the top words per topic by using print_ldamodel_topic_words:

[10]:
from tmtoolkit.topicmod.model_io import print_ldamodel_topic_words

model_sm = models['smaller'][0][1]
print_ldamodel_topic_words(model_sm.topic_word_, vocab_sm, top_n=3)
topic_1
> #1. mr (0.159627)
> #2. germany (0.134861)
> #3. member (0.096337)
topic_2
> #1. police (0.229620)
> #2. man (0.185890)
> #3. officer (0.145804)
topic_3
> #1. party (0.132964)
> #2. election (0.132964)
> #3. leader (0.072724)
topic_4
> #1. people (0.149059)
> #2. country (0.117747)
> #3. attack (0.056375)
topic_5
> #1. al (0.118424)
> #2. syria (0.109317)
> #3. force (0.097174)
topic_6
> #1. company (0.173780)
> #2. percent (0.086901)
> #3. business (0.069525)
topic_7
> #1. trump (0.196396)
> #2. house (0.109759)
> #3. president (0.088581)
topic_8
> #1. year (0.217796)
> #2. time (0.102423)
> #3. day (0.073904)
topic_9
> #1. china (0.299307)
> #2. development (0.074851)
> #3. european (0.071598)
topic_10
> #1. report (0.087935)
> #2. official (0.058197)
> #3. president (0.058197)
[11]:
model_bg = models['bigger'][0][1]
print_ldamodel_topic_words(model_bg.topic_word_, vocab_bg, top_n=3)
topic_1
> #1. american (0.041231)
> #2. new (0.036285)
> #3. america (0.029689)
topic_2
> #1. year (0.032174)
> #2. day (0.026613)
> #3. work (0.025025)
topic_3
> #1. party (0.051625)
> #2. election (0.051625)
> #3. vote (0.040334)
topic_4
> #1. people (0.090103)
> #2. country (0.067925)
> #3. million (0.032580)
topic_5
> #1. say (0.109116)
> #2. report (0.040041)
> #3. mr (0.025522)
topic_6
> #1. trump (0.085939)
> #2. president (0.063024)
> #3. russian (0.045838)
topic_7
> #1. police (0.048848)
> #2. man (0.039545)
> #3. officer (0.034119)
topic_8
> #1. china (0.066056)
> #2. company (0.057441)
> #3. market (0.030878)
topic_9
> #1. help (0.029090)
> #2. child (0.028429)
> #3. good (0.024463)
topic_10
> #1. say (0.091405)
> #2. year (0.040949)
> #3. come (0.021510)

We could also generate models from different parameters in parallel, either for a single DTM or several. In the following example we generate models for a series of four different values for the alpha parameter. The parameters n_iter and n_topics are held constant across all models.

[12]:
var_params = [{'alpha': 1/(10**x)} for x in range(1, 5)]

const_params = {
    'n_iter': 500,
    'n_topics': 10,
    'random_state': 20191122  # to make results reproducible
}

models = compute_models_parallel(dtm_sm,  # smaller DTM
                                 varying_parameters=var_params,
                                 constant_parameters=const_params)
models
[12]:
[({'alpha': 0.0001, 'n_iter': 500, 'n_topics': 10, 'random_state': 20191122},
  <lda.lda.LDA at 0x7fb69dcec430>),
 ({'alpha': 0.001, 'n_iter': 500, 'n_topics': 10, 'random_state': 20191122},
  <lda.lda.LDA at 0x7fb69dd01ee0>),
 ({'alpha': 0.01, 'n_iter': 500, 'n_topics': 10, 'random_state': 20191122},
  <lda.lda.LDA at 0x7fb69dd015e0>),
 ({'alpha': 0.1, 'n_iter': 500, 'n_topics': 10, 'random_state': 20191122},
  <lda.lda.LDA at 0x7fb69dd017c0>)]

We could compare these models now, e.g. by investigating their topics.

A more systematic approach on comparing and evaluating topic models, also in order to find a good set of hyperparameters for a given dataset, will be presented in the next section.

Evaluation of topic models

The package tmtoolkit provides several metrics for comparing and evaluating topic models. This can be used for finding a good hyperparameter set for a given dataset, e.g. a good combination of the number of topics and concentration paramaters (often called alpha and beta in literature). For some background on hyperparameters in topic modeling, see this blog post.

For each candidate hyperparameter set, a model can be generated and evaluated in parallel. We will do this now for the “big” DTM dtm_bg. Our candidate values for the number of topics k range between 20 and 120, with steps of 10. We make the concentration parameter for a prior over the document-specific topic distributions, alpha, depending on k as 1/k:

[13]:
var_params = [{'n_topics': k, 'alpha': 1/k}
               for k in range(20, 121, 10)]
var_params
[13]:
[{'n_topics': 20, 'alpha': 0.05},
 {'n_topics': 30, 'alpha': 0.03333333333333333},
 {'n_topics': 40, 'alpha': 0.025},
 {'n_topics': 50, 'alpha': 0.02},
 {'n_topics': 60, 'alpha': 0.016666666666666666},
 {'n_topics': 70, 'alpha': 0.014285714285714285},
 {'n_topics': 80, 'alpha': 0.0125},
 {'n_topics': 90, 'alpha': 0.011111111111111112},
 {'n_topics': 100, 'alpha': 0.01},
 {'n_topics': 110, 'alpha': 0.00909090909090909},
 {'n_topics': 120, 'alpha': 0.008333333333333333}]

The heart of the model evaluation process is the function evaluate_topic_models, which is available for all three topic modeling packages. We stick with lda and import that function from topicmod.tm_lda. It is similar to compute_models_parallel as it accepts varying and constant hyperparameters. However, it doesn’t only compute the models in parallel, but also applies several metrics to these models in order to evaluate them. This can be controlled with the metric parameter that accepts a string or a list of strings that specify the used metric(s). These metrics refer to functions that are implemented in topicmod.evaluate.

Each topic modeling sub-module defines two important sequences: AVAILABLE_METRICS and DEFAULT_METRICS. The former lists all available metrics for that sub-module, the latter lists the default metrics that are used when you don’t specify anything with the metric parameter. Let’s have a look at both sequences in topicmod.tm_lda:

[14]:
from tmtoolkit.topicmod import tm_lda

tm_lda.AVAILABLE_METRICS
[14]:
('loglikelihood',
 'cao_juan_2009',
 'arun_2010',
 'coherence_mimno_2011',
 'griffiths_2004',
 'held_out_documents_wallach09',
 'coherence_gensim_u_mass',
 'coherence_gensim_c_v',
 'coherence_gensim_c_uci',
 'coherence_gensim_c_npmi')
[15]:
tm_lda.DEFAULT_METRICS
[15]:
('cao_juan_2009', 'arun_2010', 'coherence_mimno_2011')

For details about the metrics and the academic references, see the respective implementations in the topicmod.evaluate module.

We will now run the model evaluations with evaluate_topic_models using our previously generated list of varying hyperparameters var_params, some constant hyperparameters and the default set of metrics. We also set return_models=True which means to retain the generated models in the evaluation results:

[16]:
from tmtoolkit.topicmod.tm_lda import evaluate_topic_models
from tmtoolkit.topicmod.evaluate import results_by_parameter

const_params = {
    'n_iter': 1000,
    'random_state': 20191122,  # to make results reproducible
    'eta': 0.1,                # sometimes also called "beta"
}

eval_results = evaluate_topic_models(dtm_bg,
                                     varying_parameters=var_params,
                                     constant_parameters=const_params,
                                     return_models=True)
eval_results[:3]  # only show first three models
[16]:
[({'n_topics': 20,
   'alpha': 0.05,
   'n_iter': 1000,
   'random_state': 20191122,
   'eta': 0.1},
  {'model': <lda.lda.LDA at 0x7fb69df052e0>,
   'cao_juan_2009': 0.1583051825037188,
   'arun_2010': 6.813827298707812,
   'coherence_mimno_2011': -1.696975792080187}),
 ({'n_topics': 30,
   'alpha': 0.03333333333333333,
   'n_iter': 1000,
   'random_state': 20191122,
   'eta': 0.1},
  {'model': <lda.lda.LDA at 0x7fb69dd01610>,
   'cao_juan_2009': 0.11654514557276367,
   'arun_2010': 4.260416549019389,
   'coherence_mimno_2011': -1.563392437504735}),
 ({'n_topics': 40,
   'alpha': 0.025,
   'n_iter': 1000,
   'random_state': 20191122,
   'eta': 0.1},
  {'model': <lda.lda.LDA at 0x7fb69dd01730>,
   'cao_juan_2009': 0.11915454452526894,
   'arun_2010': 3.0999249948241188,
   'coherence_mimno_2011': -1.688220159791409})]

The evaluation results are a list with pairs of hyperparameters and their evaluation results for each metric. Additionally, there is the generated model for each hyperparameter set.

We now use results_by_parameter, which takes the “raw” evaluation results and sorts them by a specific hyperparameter, in this case n_topics. This is important because this is the way that the function for visualizing evaluation results, plot_eval_results, expects the input.

[17]:
eval_results_by_topics = results_by_parameter(eval_results, 'n_topics')
eval_results_by_topics[:3]  # again only the first three models
[17]:
[(20,
  {'model': <lda.lda.LDA at 0x7fb69df052e0>,
   'cao_juan_2009': 0.1583051825037188,
   'arun_2010': 6.813827298707812,
   'coherence_mimno_2011': -1.696975792080187}),
 (30,
  {'model': <lda.lda.LDA at 0x7fb69dd01610>,
   'cao_juan_2009': 0.11654514557276367,
   'arun_2010': 4.260416549019389,
   'coherence_mimno_2011': -1.563392437504735}),
 (40,
  {'model': <lda.lda.LDA at 0x7fb69dd01730>,
   'cao_juan_2009': 0.11915454452526894,
   'arun_2010': 3.0999249948241188,
   'coherence_mimno_2011': -1.688220159791409})]

We can now see the results for each metric across the specified range of number of topics using plot_eval_results:

[18]:
from tmtoolkit.topicmod.visualize import plot_eval_results

plot_eval_results(eval_results_by_topics);
_images/topic_modeling_34_0.png

These results suggest to set the number of topics, n_topics, to 50 and alpha to 0.02. We don’t have to generate a model with these hyperparameters again, because it’s already in the evaluation results (thanks to return_models=True). We extract the model from there in order to use it in the rest of the chapter.

[19]:
best_tm = [m for k, m in eval_results_by_topics if k == 50][0]['model']
best_tm.n_topics, best_tm.alpha, best_tm.eta  # just to make sure
[19]:
(50, 0.02, 0.1)

Common statistics and tools for topic models

The topicmod.model_stats module mostly contains functions that compute statistics from the document-topic and topic-word distribution of a topic model and also some helper functions for working with such distributions. We’ll start with an important helper function, generate_topic_labels_from_top_words.

Generating labels for topics

In topic modeling, topics are numbered because they’re abstract – they’re simply a probability distribution across all words in the vocabulary. Still, it’s useful to give them labels for better identification. The function generate_topic_labels_from_top_words is very useful for that, as it finds labels according to the most “relevant” words in each topic. We’ll later see how we can identify the most relevant words per topic using a special relevance statistic. Note that you can adjust the weight of the relevance measure for the ranking by using the parameter lambda_ which is in range \([0, 1]\).

The function requires at least the topic-word and document-topic distributions from the model, the document lengths and the vocabulary. It then finds the minimum number of relevant words that uniquely label each topic. You can also use a fixed number for that minimum number with the parameter n_words.

[20]:
from tmtoolkit.bow.bow_stats import doc_lengths
from tmtoolkit.topicmod.model_stats import generate_topic_labels_from_top_words

vocab_bg = np.array(vocab_bg)   # we need this to be an array

doc_lengths_bg = doc_lengths(dtm_bg)
topic_labels = generate_topic_labels_from_top_words(
    best_tm.topic_word_,
    best_tm.doc_topic_,
    doc_lengths_bg,
    np.array(vocab_bg),
    lambda_=0.6
)

topic_labels[:10]   # showing only the first 5 topics here
[20]:
array(['1_error_case', '2_germany_german', '3_trump_president',
       '4_circumstance_describe', '5_flight_air', '6_report_agency',
       '7_company_reform', '8_intelligence_russia', '9_party_vote',
       '10_al_syria'], dtype='<U23')

As we can see, two words are necessary to label each topic uniquely. By default, each label is prefixed with a number. You can change that with the parameter labels_format.

Let’s have a look at the top words for a specific topic. We can use ldamodel_top_topic_words for that from the module topicmod.model_io, which we will have a closer look at later:

[21]:
from tmtoolkit.topicmod.model_io import ldamodel_top_topic_words

top_topic_word = ldamodel_top_topic_words(best_tm.topic_word_,
                                          vocab_bg,
                                          row_labels=topic_labels)
top_topic_word[top_topic_word.index == '9_party_vote']
[21]:
rank_1 rank_2 rank_3 rank_4 rank_5 rank_6 rank_7 rank_8 rank_9 rank_10
topic
9_party_vote party (0.09073) vote (0.04245) leader (0.02635) right (0.02434) minister (0.02233) percent (0.02233) voter (0.02032) political (0.01629) call (0.01629) come (0.01629)

Marginal topic and word distributions

We’ll now focus on the marginal topic and word distributions. Let’s get the marginal topic distribution first by using marginal_topic_distrib:

[22]:
from tmtoolkit.topicmod.model_stats import marginal_topic_distrib

marg_topic = marginal_topic_distrib(best_tm.doc_topic_, doc_lengths_bg)
marg_topic
[22]:
array([0.01576, 0.01888, 0.01706, 0.00615, 0.00833, 0.03309, 0.03259,
       0.01403, 0.02541, 0.03231, 0.02303, 0.01714, 0.02143, 0.03394,
       0.01467, 0.01534, 0.0367 , 0.02148, 0.01134, 0.01369, 0.02523,
       0.01143, 0.0359 , 0.01353, 0.01892, 0.01301, 0.01765, 0.02398,
       0.01649, 0.02521, 0.02488, 0.01322, 0.00826, 0.03753, 0.00964,
       0.08694, 0.01579, 0.0151 , 0.01658, 0.02005, 0.01848, 0.00534,
       0.01172, 0.00715, 0.01603, 0.02789, 0.01261, 0.00995, 0.01288,
       0.01625])

The marginal topic distribution can be interpreted as the “importance” of each topic for the whole corpus. Let’s get the sorted indices into topic_labels with np.argsort and get the top five topics:

[23]:
# np.argsort gives ascending order, hence reverse via [::-1]
topic_labels[np.argsort(marg_topic)[::-1][:5]]
[23]:
array(['36_say_year', '34_america_nation', '17_white_trump',
       '23_country_love', '14_say_committee'], dtype='<U23')

Likewise, we can get the marginal word distribution with marginal_word_distrib from the model’s topic-word distribution and the marginal topic distribution. We’ll use this to list the most probable words for the corpus. As expected, these are mostly quite common words:

[24]:
from tmtoolkit.topicmod.model_stats import marginal_word_distrib

marg_word = marginal_word_distrib(best_tm.topic_word_, marg_topic)
vocab_bg[np.argsort(marg_word)[::-1][:10]]
[24]:
array(['say', 'year', 'people', 'country', 'new', 'time', 'trump',
       'report', 'china', 'president'], dtype='<U14')

Two helper functions exist for this purpose: most_probable_words and least_probable_words sort the vocabulary according to the marginal probability:

[25]:
from tmtoolkit.topicmod.model_stats import most_probable_words, least_probable_words

most_probable_words(vocab_bg, best_tm.topic_word_,
                    best_tm.doc_topic_, doc_lengths_bg,
                    n=10)
[25]:
array(['say', 'year', 'people', 'country', 'new', 'time', 'trump',
       'report', 'china', 'president'], dtype='<U14')
[26]:
least_probable_words(vocab_bg, best_tm.topic_word_,
                     best_tm.doc_topic_, doc_lengths_bg,
                     n=10)
[26]:
array(['urge', 'series', 'reveal', 'protection', 'associate', 'argue',
       'elect', 'analysis', 'seven', 'guarantee'], dtype='<U14')

Word distinctiveness and saliency

Word distinctiveness and saliency (see below) help to identify the most “informative” words in a corpus given its topic model. Both measures are introduced in Chuang et al. 2012.

Word distinctiveness is calculated for each word \(w\) as

\(\text{distinctiveness}(w) = \sum_T(P(T|w) \log \frac{P(T|w)}{P(T)})\).

where \(P(T)\) is the marginal topic distribution and \(P(T|w)\) is the probability of a topic given a word \(w\).

We can calculate this measure using word_distinctiveness. To use this measure directly to rank words, we can use most_distinct_words and least_distinct_words:

[27]:
from tmtoolkit.topicmod.model_stats import word_distinctiveness, \
    most_distinct_words, least_distinct_words

word_distinct = word_distinctiveness(best_tm.topic_word_, marg_topic)
word_distinct[:10]   # first 10 words in vocab
[27]:
array([0.78042, 1.13865, 1.21893, 1.01726, 1.20055, 1.5611 , 1.18047,
       1.58108, 0.74311, 0.97515])
[28]:
most_distinct_words(vocab_bg, best_tm.topic_word_,
                    best_tm.doc_topic_, doc_lengths_bg,
                    n=10)
[28]:
array(['note', 'space', 'china', '7', 'bank', 'judge', 'north', 'flight',
       'police', 'mr'], dtype='<U14')
[29]:
least_distinct_words(vocab_bg, best_tm.topic_word_,
                     best_tm.doc_topic_, doc_lengths_bg,
                     n=10)
[29]:
array(['away', 'adviser', 'agree', 'adopt', 'place', 'effect',
       'conference', 'currently', 'mind', 'explain'], dtype='<U14')

Word saliency weights each words’ distinctiveness by it’s marginal probability \(P(w)\):

\(\text{saliency}(w) = P(w) \cdot \text{distinctiveness}(w)\).

The respective functions in tmtoolkit are word_saliency, most_salient_words and least_salient_words:

[30]:
from tmtoolkit.topicmod.model_stats import word_saliency, \
    most_salient_words, least_salient_words

word_sal = word_saliency(best_tm.topic_word_, best_tm.doc_topic_, doc_lengths_bg)
word_sal[:10]   # first 10 words in vocab
[30]:
array([0.00054, 0.00106, 0.00136, 0.00081, 0.00093, 0.00084, 0.00091,
       0.00105, 0.00044, 0.00072])
[31]:
most_salient_words(vocab_bg, best_tm.topic_word_,
                   best_tm.doc_topic_, doc_lengths_bg,
                   n=10)
[31]:
array(['say', 'china', 'trump', 'people', 'report', 'new', 'year',
       'country', 'police', 'mr'], dtype='<U14')
[32]:
least_salient_words(vocab_bg, best_tm.topic_word_,
                    best_tm.doc_topic_, doc_lengths_bg,
                    n=10)
[32]:
array(['adopt', 'effect', 'personal', 'adviser', 'positive', 'shortly',
       'strengthen', 'conduct', 'arab', 'conference'], dtype='<U14')

Topic-word relevance

The topic-word relevance measure as introduced by Sievert and Shirley 2014 helps to identify the most relevant words within a topic by also accounting for the marginal probability of each word across the corpus. This is done by integrating a lift value, which is the “ratio of a term’s probability within a topic to its marginal probability across the corpus.” (ibid.)

Thus for each word \(w\), given a topic-word distribution \(\phi\), a topic \(t\) and a weight parameter \(\lambda\), it is calculated as:

\(\text{relevance}_{\phi, \lambda}(w, t) = \lambda \log \phi_{t,w} + (1-\lambda) \log \frac{\phi_{t,w}}{p(w)}\).

The first term \(\log \phi_{t,w}\) is the log of the topic-word distribution, the second term \(\log \frac{\phi_{t,w}}{p(w)}\) is the log lift and \(\lambda\) can be used to control the weight between both terms. The lower \(\lambda\), the more weight is put on the lift term, i.e. the more different are the results from the original topic-word distribution.

This measure is implemented in topic_word_relevance. It returns a matrix of the same shape as the topic-word distribution, i.e. each row represents a topic with a (log-transformed) distribution across all words in the vocabulary. Please note that the lambda parameter ends with an underscore: lambda_.

[33]:
from tmtoolkit.topicmod.model_stats import topic_word_relevance

topic_word_rel = topic_word_relevance(best_tm.topic_word_, best_tm.doc_topic_,
                                      doc_lengths_bg, lambda_=0.6)
topic_word_rel
[33]:
array([[-5.21276, -5.33602, -5.40768, ..., -6.27353, -5.35915, -5.51232],
       [-5.35582, -5.47908, -5.55074, ..., -2.70302, -1.78864, -1.54451],
       [-5.27302, -2.35175, -5.46794, ..., -6.33379, -5.41941, -5.57258],
       ...,
       [-4.87877, -5.00203, -5.07369, ..., -5.93954, -5.02516, -5.17833],
       [-5.05964, -2.78501, -5.25456, ..., -6.12041, -5.20604, -5.3592 ],
       [-5.23614, -5.3594 , -5.43107, ..., -2.18604, -5.38254, -5.5357 ]])

To confirm that it’s 50 topics across all words in the vocabulary:

[34]:
topic_word_rel.shape
[34]:
(50, 791)

Two functions can be used to get the most or least relevant words for a topic: most_relevant_words_for_topic and least_relevant_words_for_topic. You can select the topic with the topic parameter which is a zero-based topic index.

We’ll do it for topic with index 9, which is:

[35]:
topic_labels[9]
[35]:
'10_al_syria'
[36]:
from tmtoolkit.topicmod.model_stats import most_relevant_words_for_topic, \
    least_relevant_words_for_topic

most_relevant_words_for_topic(vocab_bg, topic_word_rel, topic=9, n=10)
[36]:
array(['al', 'syria', 'syrian', 'opposition', 'city', 'area', 'military',
       'kill', 'war', 'strike'], dtype='<U14')
[37]:
least_relevant_words_for_topic(vocab_bg, topic_word_rel, topic=9, n=10)
[37]:
array(['year', 'people', 'time', 'trump', 'report', 'china', 'president',
       'come', 'state', 'company'], dtype='<U14')

Topic coherence

We already used the coherence metric (Mimno et al. 2011) for topic model evaluation. However, this metric cannot only be used to assess the overall quality of a topic model, but also to evaluate the individual topics’ coherence.

[38]:
from tmtoolkit.topicmod.evaluate import metric_coherence_mimno_2011

# use top 20 words per topic for metric
coh = metric_coherence_mimno_2011(best_tm.topic_word_, dtm_bg, top_n=20)
coh
[38]:
array([-2.06169, -0.69593, -1.13214, -1.48083, -2.53062, -1.14   ,
       -1.30559, -1.66265, -1.18398, -1.24934, -1.50454, -1.24552,
       -1.27484, -1.26755, -1.45137, -2.55208, -1.20722, -1.6191 ,
       -2.06777, -1.2751 , -2.14256, -1.50499, -1.10072, -1.98659,
       -2.34686, -1.56978, -1.97106, -1.54649, -1.58539, -1.20991,
       -1.50709, -2.20169, -1.56954, -1.4693 , -3.6065 , -0.93041,
       -1.18229, -1.30928, -3.1338 , -1.35259, -1.34244, -2.3644 ,
       -2.73258, -3.99957, -1.45934, -1.34668, -2.20253, -2.63822,
       -1.52807, -3.48115])

This generates a coherence value for each topic. Let’s show the distribution of these values:

[39]:
import matplotlib.pyplot as plt

plt.hist(coh, bins=20)
plt.xlabel('coherence')
plt.ylabel('n')
plt.show();
_images/topic_modeling_70_0.png

And print the best and worst topics according to this metric:

[40]:
import numpy as np

top10_t_indices = np.argsort(coh)[::-1][:5]
bottom10_t_indices = np.argsort(coh)[:5]

topic_labels[top10_t_indices]
[40]:
array(['2_germany_german', '36_say_year', '23_country_love',
       '3_trump_president', '6_report_agency'], dtype='<U23')
[41]:
topic_labels[bottom10_t_indices]
[41]:
array(['44_note_7', '35_force_group', '50_officer_police',
       '39_russian_diplomat', '43_bank_investor'], dtype='<U23')

Note that this metric also doesn’t spare oneself careful manual evaluation, because it can also be off for some topics. For example, topic 36_say_year is certainly not a coherent topic as it mostly ranks very common but but less meaningful words high:

[42]:
top_topic_word[top_topic_word.index == '36_say_year']
[42]:
rank_1 rank_2 rank_3 rank_4 rank_5 rank_6 rank_7 rank_8 rank_9 rank_10
topic
36_say_year say (0.185) year (0.04498) time (0.02648) tell (0.02582) think (0.01922) lot (0.01922) take (0.01922) people (0.01856) know (0.01658) want (0.01592)

More coherence metrics can be used with the function metric_coherence_gensim. This requires that gensim is installed. Furthemore, most metrics require that a parameter texts is passed which is the tokenized text that was used to create the document-term matrix.

Filtering topics

With the function filter_topics, you can filter the topics according to their topic-word distribution and the following search criteria:

  • search_pattern: one or more search patterns according to the common parameters for pattern matching

  • top_n: pattern match(es) must occur in the first top_n most probable words in the topic

  • thresh: matched words’ probability must be above this threshold

You must specify at least one of top_n and thresh, but you can also specify both. The function returns an array of topic indices (which start with zero!).

Let’s find all topics that have the glob pattern (match_type='glob') “russ*” (to match both “russia” and “russian”) in the top 5 most probable words:

[43]:
from tmtoolkit.topicmod.model_stats import filter_topics

found_topics = filter_topics('russ*', vocab_bg,
                             best_tm.topic_word_, match_type='glob',
                             top_n=5)
found_topics
[43]:
array([ 7, 24, 38, 41])

We can use these indices with our topic_labels:

[44]:
topic_labels[found_topics]
[44]:
array(['8_intelligence_russia', '25_sanction_visit',
       '39_russian_diplomat', '42_win_anti'], dtype='<U23')

Next, we want to select all topics where any of the words matched by the glob patterns "chin*" or "business" achieve at least a probability of 0.01 (thresh=0.01):

[45]:
found_topics = filter_topics(['chin*', 'business'], vocab_bg,
                             best_tm.topic_word_, thresh=0.01, match_type='glob')
topic_labels[found_topics]
[45]:
array(['7_company_reform', '22_space_european', '30_china_chinese',
       '47_family_refugee'], dtype='<U23')

When we specify cond='all', all patterns must have at least one match (here in the top 10 list of words per topic):

[46]:
found_topics = filter_topics(['chin*', 'business'], vocab_bg,
                             best_tm.topic_word_, top_n=10, match_type='glob',
                             cond='all')
topic_labels[found_topics]   # no result
[46]:
array([], dtype='<U23')

You could also pass a topic-word relevance matrix instead of a topic-word probability distribution.

[47]:
found_topics = filter_topics('russ*', vocab_bg,
                             topic_word_rel, match_type='glob',
                             top_n=5)
topic_labels[found_topics]
[47]:
array(['8_intelligence_russia', '25_sanction_visit',
       '39_russian_diplomat', '42_win_anti'], dtype='<U23')

Excluding topics

It is often the case that some topics of a topic model rank a lot of uninformative (e.g. very common) words the highest. This results in some uninformative topics, which you may want to exclude from further analysis. Note that if a large fraction of topics seems uninformative, it points to a problem with your topic model and/or your preprocessing steps. You should evaluate your candidate models carefully with the mentioned metrics and/or adjust your text preprocessing pipeline.

The function exclude_topics allows to remove a specified set of topics from the document-topic and topic-word distributions. You need to pass the zero-based indices of the topics that you want to remove, and both distributions.

For example, suppose the following topics were identified as uninformative:

[48]:
uninform_topics = [0, 3, 35]
topic_labels[uninform_topics]
[48]:
array(['1_error_case', '4_circumstance_describe', '36_say_year'],
      dtype='<U23')

We can now pass these indices to exclude_topics along with the topic model distributions. We’ll get back new, filtered, distributions.

[49]:
from tmtoolkit.topicmod.model_stats import exclude_topics

new_doc_topic, new_topic_word, new_topic_mapping = \
    exclude_topics(uninform_topics, best_tm.doc_topic_,
                best_tm.topic_word_, return_new_topic_mapping=True)
new_doc_topic.shape, new_topic_word.shape
[49]:
((100, 47), (47, 791))

We can see in the new distributions’ shapes that we now have 47 instead of 50 topics, because we removed three of them. We shouldn’t forget to also update the topic labels and remove the unwanted topics:

[50]:
new_topic_labels = np.delete(topic_labels, uninform_topics)
new_topic_labels
[50]:
array(['2_germany_german', '3_trump_president', '5_flight_air',
       '6_report_agency', '7_company_reform', '8_intelligence_russia',
       '9_party_vote', '10_al_syria', '11_play_game', '12_parent_child',
       '13_economic_meeting', '14_say_committee', '15_northern_election',
       '16_mr_damage', '17_white_trump', '18_minister_election',
       '19_police_shoot', '20_percent_limit', '21_attack_group',
       '22_space_european', '23_country_love', '24_industry_service',
       '25_sanction_visit', '26_judge_victim', '27_man_new', '28_feel_mp',
       '29_club_sell', '30_china_chinese', '31_company_investment',
       '32_britain_democratic', '33_general_authority',
       '34_america_nation', '35_force_group', '37_north_discuss',
       '38_southern_deal', '39_russian_diplomat', '40_turkish_country',
       '41_bill_house', '42_win_anti', '43_bank_investor', '44_note_7',
       '45_day_morning', '46_support_provide', '47_family_refugee',
       '48_body_north', '49_fire_death', '50_officer_police'],
      dtype='<U23')

Displaying and exporting topic modeling results

The topicmod.model_io module provides several functions for displaying and exporting topic modeling results, i.e. results derived from the document-topic and topic-word distribution of a given topic model.

We already used ldamodel_top_topic_words briefly, which generates a dataframe with the top words from a topic-word distribution. You can also use the topic-word relevance matrix instead. With top_n we can control the number of top words:

[51]:
# using relevance matrix here and showing only the first 3 topics
ldamodel_top_topic_words(topic_word_rel, vocab_bg, top_n=5)[:3]
[51]:
rank_1 rank_2 rank_3 rank_4 rank_5
topic
topic_1 error (-0.8448) case (-0.8618) stage (-0.897) statement (-0.9278) detail (-0.9412)
topic_2 germany (-0.1098) german (-0.9475) linkedin (-1.021) tumblr (-1.021) stumble (-1.021)
topic_3 trump (0.02926) president (-0.4054) course (-0.9024) play (-1.059) fact (-1.187)

If you’re interested in the top topics for each word/token, you can use ldamodel_top_word_topics. Here, we generate the top five topics for each token in the vocabulary, but only display the output for four specific words. Instead of the generic "topic_..." topic names, we additionally pass our previously generated topic labels topic_labels:

[52]:
from tmtoolkit.topicmod.model_io import ldamodel_top_word_topics

top_word_topics = ldamodel_top_word_topics(topic_word_rel, vocab_bg,
                                           top_n=5, topic_labels=topic_labels)
top_word_topics[top_word_topics.index.isin(['china', 'chinese', 'russia', 'russian'])]
[52]:
rank_1 rank_2 rank_3 rank_4 rank_5
token
china 30_china_chinese (0.4125) 37_north_discuss (-2.959) 34_america_nation (-4.323) 42_win_anti (-5.29) 4_circumstance_describe (-5.365)
chinese 30_china_chinese (-0.3812) 22_space_european (-1.433) 42_win_anti (-4.893) 4_circumstance_describe (-4.969) 44_note_7 (-5.059)
russia 8_intelligence_russia (-0.539) 25_sanction_visit (-0.907) 39_russian_diplomat (-0.9549) 42_win_anti (-1.286) 31_company_investment (-3.681)
russian 39_russian_diplomat (-0.1902) 8_intelligence_russia (-1.083) 25_sanction_visit (-1.224) 42_win_anti (-2.033) 24_industry_service (-2.628)

Note that the values in parantheses are the corresponding values from the matrix for that word in that topic. They’re negative because of the log transformation that is applied in the topic-word relevance measure.

Similar functions can be used for the document-topic distribution: ldamodel_top_doc_topics gives the top topics per document and ldamodel_top_topic_docs gives the top documents per topic. Here, top_n controls the number of top-ranked topics or documents to return, respectively. This time, we use the filtered document-topic distribution new_doc_topics:

[53]:
from tmtoolkit.topicmod.model_io import ldamodel_top_doc_topics

ldamodel_top_doc_topics(new_doc_topic, doc_labels_bg, top_n=3,
                        topic_labels=new_topic_labels)[:5]
[53]:
rank_1 rank_2 rank_3
document
NewsArticles-1032 9_party_vote (0.6419) 2_germany_german (0.1385) 18_minister_election (0.1008)
NewsArticles-1036 21_attack_group (0.3321) 6_report_agency (0.2325) 2_germany_german (0.1495)
NewsArticles-104 38_southern_deal (0.6322) 32_britain_democratic (0.2544) 24_industry_service (0.002519)
NewsArticles-1043 42_win_anti (0.6779) 10_al_syria (0.1851) 6_report_agency (0.0619)
NewsArticles-1048 50_officer_police (0.5356) 45_day_morning (0.2608) 6_report_agency (0.1974)

And now for the top documents per topic:

[54]:
from tmtoolkit.topicmod.model_io import ldamodel_top_topic_docs

ldamodel_top_topic_docs(new_doc_topic, doc_labels_bg, top_n=3,
                        topic_labels=new_topic_labels)[:5]
[54]:
rank_1 rank_2 rank_3
topic
2_germany_german NewsArticles-293 (0.8348) NewsArticles-879 (0.4695) NewsArticles-3119 (0.2692)
3_trump_president NewsArticles-3140 (0.8479) NewsArticles-19 (0.2168) NewsArticles-3730 (0.172)
5_flight_air NewsArticles-1539 (0.6423) NewsArticles-1982 (0.6268) NewsArticles-526 (0.4505)
6_report_agency NewsArticles-1545 (0.8991) NewsArticles-3125 (0.6466) NewsArticles-2269 (0.4598)
7_company_reform NewsArticles-2799 (0.8006) NewsArticles-461 (0.5785) NewsArticles-2613 (0.4432)

There are also two functions that generate datatables for the full topic-word and document-topic distributions: ldamodel_full_topic_words and ldamodel_full_doc_topics. The output of both functions is naturally quite big, as long as you’re not working with a “toy dataset”.

[55]:
from tmtoolkit.topicmod.model_io import ldamodel_full_topic_words

df_topic_word = ldamodel_full_topic_words(new_topic_word,
                                          vocab_bg,
                                          row_labels=new_topic_labels)
# displaying only the first 5 topics with 10 tokens
# from sorted vocabulary list (tokens 120 to 129)
df_topic_word.iloc[:5, 120:130]
[55]:
care carry case cause center central centre century certain chairman
0 0.000256 0.000256 0.002820 0.005383 0.000256 0.000256 0.000256 0.000256 0.000256 0.000256
1 0.000278 0.000278 0.000278 0.000278 0.000278 0.000278 0.000278 0.000278 0.000278 0.003063
2 0.000465 0.000465 0.000465 0.000465 0.000465 0.000465 0.000465 0.000465 0.000465 0.000465
3 0.000160 0.000160 0.000160 0.000160 0.000160 0.000160 0.000160 0.000160 0.000160 0.000160
4 0.000162 0.000162 0.003409 0.000162 0.000162 0.000162 0.000162 0.001785 0.000162 0.001785
[56]:
from tmtoolkit.topicmod.model_io import ldamodel_full_doc_topics

df_doc_topic = ldamodel_full_doc_topics(new_doc_topic,
                                        doc_labels_bg,
                                        topic_labels=new_topic_labels)
# displaying only the first 3 documents with the first
# 5 topics
df_doc_topic.iloc[:3, :5]
[56]:
_doc 2_germany_german 3_trump_president 5_flight_air 6_report_agency
0 NewsArticles-1032 0.138543 0.000126 0.000126 0.000126
1 NewsArticles-1036 0.149498 0.000083 0.000083 0.232506
2 NewsArticles-104 0.002519 0.002519 0.002519 0.002519

For quick inspection of topics there’s also a pair of print functions. We already used print_ldamodel_topic_words, but we haven’t tried print_ldamodel_doc_topics yet. This prints the top_n most probable topics for each document:

[57]:
from tmtoolkit.topicmod.model_io import print_ldamodel_doc_topics

# subsetting new_doc_topic and doc_labels to get only the first
# five documents
print_ldamodel_doc_topics(new_doc_topic[:5, :], doc_labels_bg[:5],
                          val_labels=new_topic_labels)
NewsArticles-1032
> #1. 9_party_vote (0.641877)
> #2. 2_germany_german (0.138543)
> #3. 18_minister_election (0.100793)
NewsArticles-1036
> #1. 21_attack_group (0.332116)
> #2. 6_report_agency (0.232506)
> #3. 2_germany_german (0.149498)
NewsArticles-104
> #1. 38_southern_deal (0.632242)
> #2. 32_britain_democratic (0.254408)
> #3. 24_industry_service (0.002519)
NewsArticles-1043
> #1. 42_win_anti (0.677856)
> #2. 10_al_syria (0.185094)
> #3. 6_report_agency (0.061903)
NewsArticles-1048
> #1. 50_officer_police (0.535578)
> #2. 45_day_morning (0.260814)
> #3. 6_report_agency (0.197407)

You can also export the results of a topic model to an Excel file using save_ldamodel_summary_to_excel. The resulting Excel file will contain the following sheets:

  • top_doc_topics_vals: document-topic distribution with probabilities of top topics per document

  • top_doc_topics_labels: document-topic distribution with labels of top topics per document

  • top_doc_topics_labelled_vals: document-topic distribution combining probabilities and labels of top topics per document (e.g. "topic_12 (0.21)")

  • top_topic_word_vals: topic-word distribution with probabilities of top words per topic

  • top_topic_word_labels: topic-word distribution with top words per (e.g. "politics") topic

  • top_topic_words_labelled_vals: topic-word distribution combining probabilities and top words per topic (e.g. "politics (0.08)")

  • optional if dtm is given – marginal_topic_distrib: marginal topic distribution

Additionally to saving the output to the specified Excel file, the function will also return a dict with the sheets and their data.

[58]:
from tmtoolkit.topicmod.model_io import save_ldamodel_summary_to_excel

sheets = save_ldamodel_summary_to_excel('data/news_articles_100.xlsx',
                                        new_topic_word, new_doc_topic,
                                        doc_labels_bg, vocab_bg,
                                        dtm = dtm_bg,
                                        topic_labels = new_topic_labels)

To quickly store a topic model to disk for sharing or loading again at a later point in time, there are save_ldamodel_to_pickle and load_ldamodel_from_pickle. The function for saving takes a path to a pickle file to create (or update), a topic model object (such as an LDA instance as best_tm, but you could also pass a tuple like (new_doc_topic, new_topic_word)), the corresponding vocabulary and document labels, and optionally the DTM that was used to create the topic model. The function for loading the data will return the saved data as a dict. We will only show the dict’s keys here, as the data itself is too large to be printed:

[59]:
from tmtoolkit.topicmod.model_io import save_ldamodel_to_pickle, \
    load_ldamodel_from_pickle

save_ldamodel_to_pickle('data/news_articles_100.pickle',
                        best_tm, vocab_bg, doc_labels_bg,
                        dtm = dtm_bg)

loaded = load_ldamodel_from_pickle('data/news_articles_100.pickle')
loaded.keys()
[59]:
dict_keys(['model', 'vocab', 'doc_labels', 'dtm'])

Visualizing topic models

The topicmod.visualize module contains several functions to visualize topic models and evaluation results. We’ve already used plot_eval_results during topic model evaluation so we’ll now focus on visualizing topic models.

Heatmaps

Let’s start with heatmap visualizations of document-topic or topic-word distributions from our topic model. This can be done with plot_doc_topic_heatmap and plot_topic_word_heatmap respectively. Both functions draw on a matplotlib figure and Axes object, which you must create before using these functions.

Heatmap visualizations essentially shade cells in a 2D matrix (like the document-topic or topic-word distributions) according to their value, i.e. the respective probability for a topic in a given document or a word in a given topic. Since these matrices are usually quite large, i.e. with hundreds of rows and/or columns, it doesn’t make sense to plot a heatmap of the whole matrix, but rather a certain subset of interest. When we want to visualize a document-topic distribution, we can optionally select a subset of the documents with the which_documents parameter and a subset of the topics with the which_topics parameter. Let’s draw a heatmap of a subset of documents across all topics at first:

[60]:
import matplotlib.pyplot as plt
from tmtoolkit.topicmod.visualize import plot_doc_topic_heatmap

# create a figure of certain size and
# Axes object to draw on
fig, ax = plt.subplots(figsize=(32, 8))

# randomly selecting a subset of documents
which_docs = random.sample(doc_labels_bg, 5)

plot_doc_topic_heatmap(fig, ax, new_doc_topic, doc_labels_bg,
                       topic_labels=new_topic_labels,
                       which_documents=which_docs);
_images/topic_modeling_111_0.png
[61]:
fig, ax = plt.subplots(figsize=(6, 8))

# randomly selecting a subset of topics
which_topics = random.sample(list(new_topic_labels), 10)

plot_doc_topic_heatmap(fig, ax, new_doc_topic, doc_labels_bg,
                       topic_labels=new_topic_labels,
                       which_documents=which_docs,
                       which_topics=which_topics);
_images/topic_modeling_112_0.png

Similarily, we can work with plot_topic_word_heatmap to visualize a topic-word distribution. We can also select a subset of topics and words from the vocabulary:

[62]:
from tmtoolkit.topicmod.visualize import plot_topic_word_heatmap

fig, ax = plt.subplots(figsize=(12, 8))

which_words = ['may', 'european', 'referendum', 'brexit',
               'eu', 'uk', 'britain', 'company', 'trade', 'growth']

plot_topic_word_heatmap(fig, ax, new_topic_word, vocab_bg,
                        topic_labels=new_topic_labels,
                        which_topics=which_topics,
                        which_words=which_words);
_images/topic_modeling_114_0.png

Note that there’s also a generic heatmap plotting function plot_heatmap for any kind of 2D matrices.

Word clouds

Thanks to the wordlcloud package, topic-word and document-topic distributions can also be visualized as “word clouds” with tmtoolkit. The function generate_wordclouds_for_topic_words generates a word cloud for each topic by scaling a topic’s word by its probability (weight). You can choose to display only the top top_n words per topic. The result of this function will be a dictionary mapping topic labels to the respective word cloud image.

[63]:
from tmtoolkit.topicmod.visualize import generate_wordclouds_for_topic_words

# some options for wordcloud output
img_w = 400   # image width
img_h = 300   # image height

topic_clouds = generate_wordclouds_for_topic_words(
    new_topic_word, vocab_bg,
    top_n=20, topic_labels=new_topic_labels,
    width=img_w, height=img_h
)

# show all generated word clouds
topic_clouds.keys()
[63]:
dict_keys(['2_germany_german', '3_trump_president', '5_flight_air', '6_report_agency', '7_company_reform', '8_intelligence_russia', '9_party_vote', '10_al_syria', '11_play_game', '12_parent_child', '13_economic_meeting', '14_say_committee', '15_northern_election', '16_mr_damage', '17_white_trump', '18_minister_election', '19_police_shoot', '20_percent_limit', '21_attack_group', '22_space_european', '23_country_love', '24_industry_service', '25_sanction_visit', '26_judge_victim', '27_man_new', '28_feel_mp', '29_club_sell', '30_china_chinese', '31_company_investment', '32_britain_democratic', '33_general_authority', '34_america_nation', '35_force_group', '37_north_discuss', '38_southern_deal', '39_russian_diplomat', '40_turkish_country', '41_bill_house', '42_win_anti', '43_bank_investor', '44_note_7', '45_day_morning', '46_support_provide', '47_family_refugee', '48_body_north', '49_fire_death', '50_officer_police'])

Let’s select specific topics and display their word cloud:

[64]:
topic_clouds['50_officer_police']
[64]:
_images/topic_modeling_119_0.png

The same can be done for the document-topic distribution using generate_wordclouds_for_document_topics. Here, a word cloud for each document will be generated that contains the top_n most probable topics for this document:

[65]:
from tmtoolkit.topicmod.visualize import generate_wordclouds_for_document_topics

doc_clouds = generate_wordclouds_for_document_topics(
    new_doc_topic, doc_labels_bg, topic_labels=new_topic_labels,
    top_n=5, width=img_w, height=img_h)

# show only the first 5 documents for
# which word clouds were generated
list(doc_clouds.keys())[:5]
[65]:
['NewsArticles-1032',
 'NewsArticles-1036',
 'NewsArticles-104',
 'NewsArticles-1043',
 'NewsArticles-1048']

To display a specific document’s topic word cloud:

[66]:
doc_clouds['NewsArticles-1032']
[66]:
_images/topic_modeling_123_0.png

We can write the generated images as PNG files to a folder on disk. Here, we store all word clouds in topic_clouds to 'data/tm_wordclouds/':

[67]:
from tmtoolkit.topicmod.visualize import write_wordclouds_to_folder

write_wordclouds_to_folder(topic_clouds, 'data/tm_wordclouds/')

Interactive visualization with pyLDAVis

The pyLDAVis package offers a great interactive tool to explore a topic model. The tmtoolkit function parameters_for_ldavis allows to prepare your topic model data for this package so that you can easily pass it on to pyLDAVis.

[68]:
from tmtoolkit.topicmod.visualize import parameters_for_ldavis

ldavis_params = parameters_for_ldavis(new_topic_word,
                                      new_doc_topic,
                                      dtm_bg,
                                      vocab_bg)

If you have installed the package, you can now start the LDAVis explorer with the following lines of code in a Jupyter notebook:

import pyLDAvis
pyLDAVis.prepare(**ldavis_params)