Interoperability with R

There are many popular R packages for text mining, topic modeling and NLP like tm or topicmodels. If for some reason you need to implement parts of your work in Python with tmtoolkit and other parts in R, you can do that quite easily.

First of all, you can import and export all tabular data to and from Python using tabular data formats like CSV or Excel. See for example the sections on tabular tokens output or exporting topic modeling results and check out the load_corpus_from_tokens_table function.

However, if you only want to load a document-term matrix (DTM) that you generated with tmtoolkit into R or vice versa, the most efficient way is to store this matrix along with all necessary metadata to an RDS file as explained in the following section.

Note

You will need to install tmtoolkit with the “rinterop” option in order to use the functions explained in this chapter: pip install tmtoolkit[rinterop]. This is only available since version 0.12.0.

Saving a (sparse) document-term matrix to an RDS file

A common scenario is that you used tmtoolkit for preprocessing your text corpus and generated a DTM along with document labels and the corpus vocabulary. For further processing you want to use R, e.g. for topic modeling with the topicmodels package. You can do so by using the save_dtm_to_rds function.

First, we generate a DTM from some sample data:

[1]:
import tmtoolkit.corpus as c

corp = c.Corpus.from_builtin_corpus('en-News100', sample=10)
c.print_summary(corp)
Corpus with 10 documents in English
> News100-3088 (306 tokens): Rose McGowan Seeking Help From Department Of Justi...
> News100-3160 (160 tokens): SpaceX capsule returns space station science to Ea...
> News100-3232 (159 tokens): Assad ally Russia summons Israeli diplomat over Sy...
> News100-3687 (240 tokens): FitzPatrick trial now expected to conclude in May ...
> News100-2510 (366 tokens): Murder trial told victim suffered nine blows to he...
> News100-2462 (124 tokens): US Federal Reserve System raises base interest rat...
> News100-3575 (328 tokens): Cyclone Debbie makes landfall with destructive win...
> News100-755 (768 tokens): World Cup 2026 : Uefa will ask for 16 places for E...
> News100-161 (165 tokens): Syrian army gaining ground in effort to re - take ...
> News100-2338 (680 tokens): ' This Is Us ' Makes Surprising Reveal About Jack ...
total number of tokens: 3296 / vocabulary size: 1244
[2]:
c.lemmatize(corp)
c.to_lowercase(corp)
c.filter_clean_tokens(corp, remove_numbers=True)
c.remove_common_tokens(corp, df_threshold=0.9)
c.remove_uncommon_tokens(corp, df_threshold=0.1)

c.print_summary(corp)
Corpus with 10 documents in English
> News100-3088 (38 tokens): send group online take situation report tape onlin...
> News100-3160 (19 tokens): return space space sunday coast set international ...
> News100-3232 (21 tokens): strike say moscow week syrian force syrian preside...
> News100-3687 (41 tokens): trial expect conclude trial bank return court expe...
> News100-2510 (50 tokens): trial tell victim blow die force central criminal ...
> News100-2462 (13 tokens): expect /tass/. central bank point open committee s...
> News100-3575 (47 tokens): make great report area go begin cross state coast ...
> News100-755 (80 tokens): ask time share ask give expand new look begin grou...
> News100-161 (25 tokens): syrian army effort moscow /tass/. syrian army way ...
> News100-2338 (82 tokens): make reveal tuesday night fan wait new set learn d...
total number of tokens: 416 / vocabulary size: 124
[3]:
dtm, doc_labels, vocab = c.dtm(corp, return_doc_labels=True, return_vocab=True)
[4]:
print('first 10 document labels:')
print(doc_labels[:10])

print('first 10 vocabulary tokens:')
print(vocab[:10])

print('DTM shape:')
print(dtm.shape)
first 10 document labels:
['News100-161', 'News100-2338', 'News100-2462', 'News100-2510', 'News100-3088', 'News100-3160', 'News100-3232', 'News100-3575', 'News100-3687', 'News100-755']
first 10 vocabulary tokens:
['/tass/.', 'agency', 'ago', 'allow', 'area', 'army', 'ask', 'authority', 'away', 'bank']
DTM shape:
(10, 124)

The DTM is stored a sparse matrix. It’s highly recommended to use a sparse matrix representation, especially when you’re working with large text corpora.

[5]:
dtm
[5]:
<10x124 sparse matrix of type '<class 'numpy.int32'>'
        with 291 stored elements in Compressed Sparse Row format>

Now, we save the DTM along with the document labels and the vocabulary as sparse matrix to an RDS file, that we can load into R:

[6]:
import os
from tmtoolkit.bow.dtm import save_dtm_to_rds

rds_file = os.path.join('data', 'dtm.RDS')
print(f'saving DTM, document labels and vocabulary to file "{rds_file}"')
save_dtm_to_rds(rds_file, dtm, doc_labels, vocab)

saving DTM, document labels and vocabulary to file "data/dtm.RDS"

The following R code would load this DTM from the RDS file and fit a topic model via LDA with 20 topics:

library(Matrix)       # for sparseMatrix in RDS file
library(topicmodels)  # for LDA()
library(slam)         # for as.simple_triplet_matrix()

# load data
dtm <- readRDS('data/dtm.RDS')
class(dtm)
dtm  # sparse matrix with document labels as row names, vocabulary as column names

# convert sparse matrix to triplet format required for LDA
dtm <- as.simple_triplet_matrix(dtm)

# fit a topic model
topicmodel <- LDA(dtm, k = 20, method = 'Gibbs')

# investigate the topics
terms(topicmodel, 5)

Load a (sparse) document-term matrix from an RDS file

The opposite direction is also possible. For example, you may have preprocessed a text corpus in R and generated a (sparse) DTM along with its document labels and vocabulary. You can write this data to an RDS file and load it into Python/tmtoolkit. The following R code shows an example to generate a sparse DTM and store it to data/dtm2.RDS:

library(Matrix)       # for sparseMatrix
library(tm)           # for DocumentTermMatrix

data("crude")

dtm <- DocumentTermMatrix(crude, control = list(removePunctuation = TRUE, stopwords = TRUE))

dtm_out <- sparseMatrix(i = dtm$i, j = dtm$j, x = dtm$v, dims = dim(dtm),
                        dimnames = dimnames(dtm))

saveRDS(dtm_out, 'data/dtm2.RDS')

We can now load the DTM along with its document labels and vocabulary from this RDS file:

[7]:
import os.path
from tmtoolkit.bow.dtm import read_dtm_from_rds


rds_file = os.path.join('data', 'dtm2.RDS')
print(f'loading DTM, document labels and vocabulary from file "{rds_file}"')
dtm, doc_labels, vocab = read_dtm_from_rds(rds_file)

print('first 10 document labels:')
print(doc_labels[:10])

print('first 10 vocabulary tokens:')
print(vocab[:10])

print('DTM shape:')
print(dtm.shape)
loading DTM, document labels and vocabulary from file "data/dtm2.RDS"
first 10 document labels:
['127', '144', '191', '194', '211', '236', '237', '242', '246', '248']
first 10 vocabulary tokens:
['100000', '108', '111', '115', '12217', '1232', '1381', '13member', '13nation', '150']
DTM shape:
(20, 1000)
[8]:
dtm
[8]:
<20x1000 sparse matrix of type '<class 'numpy.float64'>'
        with 1738 stored elements in Compressed Sparse Column format>

Note that the DTM was loaded as floating point matrix, but it makes more sense to represent the term frequencies as integers, since they are essentially counts:

[9]:
dtm = dtm.astype('int')
dtm
[9]:
<20x1000 sparse matrix of type '<class 'numpy.int64'>'
        with 1738 stored elements in Compressed Sparse Column format>

We could now further process and analyze this DTM with tmtoolkit. For example, we can display to three most frequent tokens per document:

[10]:
from tmtoolkit.bow.bow_stats import sorted_terms_table

# selecting only the first 5 documents
sorted_terms_table(dtm[:5, :], vocab=vocab, doc_labels=doc_labels[:5], top_n=3)
[10]:
token value
doc rank
127 1 oil 5
2 prices 3
3 said 3
144 1 opec 13
2 oil 12
3 said 11
191 1 canadian 2
2 texaco 2
3 crude 2
194 1 crude 3
2 price 2
3 west 2
211 1 said 3
2 estimates 2
3 trust 2