Interoperability with R

There are many popular R packages for text mining, topic modeling and NLP like tm or topicmodels. If for some reason you need to implement parts of your work in Python with tmtoolkit and other parts in R, you can do that quite easily.

First of all, you can import and export all tabular data to and from Python using tabular data formats like CSV or Excel. See for example the sections on tabular tokens output or exporting topic modeling results and check out the load_corpus_from_tokens_table function.

However, if you only want to load a document-term matrix (DTM) that you generated with tmtoolkit into R or vice versa, the most efficient way is to store this matrix along with all necessary metadata to an RDS file as explained in the following section.

Note

You will need to install tmtoolkit with the “rinterop” option in order to use the functions explained in this chapter: pip install tmtoolkit[rinterop]. This is only available since version 0.12.0.

Saving a (sparse) document-term matrix to an RDS file

A common scenario is that you used tmtoolkit for preprocessing your text corpus and generated a DTM along with document labels and the corpus vocabulary. For further processing you want to use R, e.g. for topic modeling with the topicmodels package. You can do so by using the save_dtm_to_rds function.

First, we generate a DTM from some sample data:

[1]:
import tmtoolkit.corpus as c

corp = c.Corpus.from_builtin_corpus('en-News100', sample=10)
c.print_summary(corp)
Corpus with 10 documents in English
> News100-2128 (67 tokens): Brexit : Scottish leader seeks UK split as EU divo...
> News100-3575 (328 tokens): Cyclone Debbie makes landfall with destructive win...
> News100-1943 (1090 tokens): Donald Trump : ' Total witch hunt ' over attorney ...
> News100-2515 (167 tokens): Four killed in Austrian avalanche    Four Swiss me...
> News100-1807 (217 tokens): Russia to repair thirty Indian Mi-17 - 1B helicopt...
> News100-2462 (124 tokens): US Federal Reserve System raises base interest rat...
> News100-879 (400 tokens): German gold repatriation ahead of schedule    Germ...
> News100-1026 (358 tokens): Kremlin gives no comment on row involving Russian ...
> News100-877 (1061 tokens): Hunting for fake news    Fearing Russian influence...
> News100-2671 (387 tokens): GENIVI Alliance Chosen by Google Summer of Code Pr...
total number of tokens: 4199 / vocabulary size: 1400
[2]:
c.lemmatize(corp)
c.to_lowercase(corp)
c.filter_clean_tokens(corp, remove_numbers=True)
c.remove_common_tokens(corp, df_threshold=0.9)
c.remove_uncommon_tokens(corp, df_threshold=0.1)

c.print_summary(corp)
Corpus with 10 documents in English
> News100-2128 (13 tokens): seek eu london news member reject government eu ci...
> News100-3575 (52 tokens): make high great report lash area go begin state ne...
> News100-1943 (247 tokens): donald trump total hunt president lash russia meet...
> News100-2515 (28 tokens): man sweep police region man area near high find bu...
> News100-1807 (42 tokens): russia lot russian moscow march /tass/. russian mo...
> News100-2462 (33 tokens): federal reserve rate rate increase expect expert w...
> News100-879 (74 tokens): german schedule germany central bank bring home fo...
> News100-1026 (114 tokens): give involve russian ambassador trump advisor flyn...
> News100-877 (180 tokens): hunt news fear russian authority plan election han...
> News100-2671 (60 tokens): program march open community open source vehicle a...
total number of tokens: 843 / vocabulary size: 205
[3]:
dtm, doc_labels, vocab = c.dtm(corp, return_doc_labels=True, return_vocab=True)
[4]:
print('first 10 document labels:')
print(doc_labels[:10])

print('first 10 vocabulary tokens:')
print(vocab[:10])

print('DTM shape:')
print(dtm.shape)
first 10 document labels:
['News100-1026', 'News100-1807', 'News100-1943', 'News100-2128', 'News100-2462', 'News100-2515', 'News100-2671', 'News100-3575', 'News100-877', 'News100-879']
first 10 vocabulary tokens:
['/tass/.', 'able', 'accord', 'add', 'advisor', 'aftermath', 'agency', 'allegation', 'allow', 'ambassador']
DTM shape:
(10, 205)

The DTM is stored a sparse matrix. It’s highly recommended to use a sparse matrix representation, especially when you’re working with large text corpora.

[5]:
dtm
[5]:
<10x205 sparse matrix of type '<class 'numpy.int32'>'
        with 478 stored elements in Compressed Sparse Row format>

Now, we save the DTM along with the document labels and the vocabulary as sparse matrix to an RDS file, that we can load into R:

[6]:
import os
from tmtoolkit.bow.dtm import save_dtm_to_rds

rds_file = os.path.join('data', 'dtm.RDS')
print(f'saving DTM, document labels and vocabulary to file "{rds_file}"')
save_dtm_to_rds(rds_file, dtm, doc_labels, vocab)

saving DTM, document labels and vocabulary to file "data/dtm.RDS"

The following R code would load this DTM from the RDS file and fit a topic model via LDA with 20 topics:

library(Matrix)       # for sparseMatrix in RDS file
library(topicmodels)  # for LDA()
library(slam)         # for as.simple_triplet_matrix()

# load data
dtm <- readRDS('data/dtm.RDS')
class(dtm)
dtm  # sparse matrix with document labels as row names, vocabulary as column names

# convert sparse matrix to triplet format required for LDA
dtm <- as.simple_triplet_matrix(dtm)

# fit a topic model
topicmodel <- LDA(dtm, k = 20, method = 'Gibbs')

# investigate the topics
terms(topicmodel, 5)

Load a (sparse) document-term matrix from an RDS file

The opposite direction is also possible. For example, you may have preprocessed a text corpus in R and generated a (sparse) DTM along with its document labels and vocabulary. You can write this data to an RDS file and load it into Python/tmtoolkit. The following R code shows an example to generate a sparse DTM and store it to data/dtm2.RDS:

library(Matrix)       # for sparseMatrix
library(tm)           # for DocumentTermMatrix

data("crude")

dtm <- DocumentTermMatrix(crude, control = list(removePunctuation = TRUE, stopwords = TRUE))

dtm_out <- sparseMatrix(i = dtm$i, j = dtm$j, x = dtm$v, dims = dim(dtm),
                        dimnames = dimnames(dtm))

saveRDS(dtm_out, 'data/dtm2.RDS')

We can now load the DTM along with its document labels and vocabulary from this RDS file:

[7]:
import os.path
from tmtoolkit.bow.dtm import read_dtm_from_rds


rds_file = os.path.join('data', 'dtm2.RDS')
print(f'loading DTM, document labels and vocabulary from file "{rds_file}"')
dtm, doc_labels, vocab = read_dtm_from_rds(rds_file)

print('first 10 document labels:')
print(doc_labels[:10])

print('first 10 vocabulary tokens:')
print(vocab[:10])

print('DTM shape:')
print(dtm.shape)
loading DTM, document labels and vocabulary from file "data/dtm2.RDS"
first 10 document labels:
['127', '144', '191', '194', '211', '236', '237', '242', '246', '248']
first 10 vocabulary tokens:
['100000', '108', '111', '115', '12217', '1232', '1381', '13member', '13nation', '150']
DTM shape:
(20, 1000)
[8]:
dtm
[8]:
<20x1000 sparse matrix of type '<class 'numpy.float64'>'
        with 1738 stored elements in Compressed Sparse Column format>

Note that the DTM was loaded as floating point matrix, but it makes more sense to represent the term frequencies as integers, since they are essentially counts:

[9]:
dtm = dtm.astype('int')
dtm
[9]:
<20x1000 sparse matrix of type '<class 'numpy.int64'>'
        with 1738 stored elements in Compressed Sparse Column format>

We could now further process and analyze this DTM with tmtoolkit. For example, we can display to three most frequent tokens per document:

[10]:
from tmtoolkit.bow.bow_stats import sorted_terms_table

# selecting only the first 5 documents
sorted_terms_table(dtm[:5, :], vocab=vocab, doc_labels=doc_labels[:5], top_n=3)
[10]:
token value
doc rank
127 1 oil 5
2 prices 3
3 said 3
144 1 opec 13
2 oil 12
3 said 11
191 1 canadian 2
2 texaco 2
3 crude 2
194 1 crude 3
2 price 2
3 west 2
211 1 said 3
2 estimates 2
3 trust 2