84
Masterarbeit zur Erlangung des akademischen Grades Master of Arts der Philosophischen Fakult¨ at der Universit¨ at Z ¨ urich Topic Modeling and Visualisation of Diachronic Trends in Biomedical Academic Articles Verfasserin: Parijat Ghoshal Matrikel-Nr: 09-716-010 Referent: Prof. Dr. Martin Volk Betreuer: Dr. Fabio Rinaldi Institut f ¨ ur Computerlinguistik Abgabedatum: 24.06.2017

Topic Modeling and Visualisation of Diachronic Trends in

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Topic Modeling and Visualisation of Diachronic Trends in

Masterarbeit

zur Erlangung des akademischen Grades

Master of Arts

der Philosophischen Fakultat der Universitat Zurich

Topic Modeling and Visualisation ofDiachronic Trends in Biomedical

Academic Articles

Verfasserin: Parijat Ghoshal

Matrikel-Nr: 09-716-010

Referent: Prof. Dr. Martin Volk

Betreuer: Dr. Fabio Rinaldi

Institut fur Computerlinguistik

Abgabedatum: 24.06.2017

Page 2: Topic Modeling and Visualisation of Diachronic Trends in

Abstract

In the biomedical domain, there is an abundance of texts making the task of having

a thematic overview about them a challenging endeavour. This is also due to the

fact that many of these texts are unlabelled and one simply cannot always assign

them to a certain thematic domain. Some texts remain thematically ambiguous and

sorting them neatly into thematic domains is impossible. Thus, it could be helpful

to implement an unsupervised algorithm to sort into topics a corpus of unlabelled

data. In this Master’s thesis, latent Dirichlet allocation will be used on the corpus

to automatically generate topics. Throughout the course of this work, I will create

topic models based on articles from PubMed Central’s Open Access Subset. Then I

will observe diachronic trends in them on three different levels with the help of the

topic model. On the first level, I will observe diachronic changes in the popularity

of the topics themselves. Then I will check how the popularity of the topic words

within a topic evolve throughout the corpus. On the third level, I will observe

the popularity of common words that belong to documents about a certain topic.

Moreover, a companion website and a topic modeling pipeline is also created as an

output of this project.

Page 3: Topic Modeling and Visualisation of Diachronic Trends in

Acknowledgement

I would like to thank Dr. Fabio Rinaldi for his guidance, patience and understanding,

and my parents for their unrelenting support. Finally, I would also like to thank Jo

who put up with everything else.

ii

Page 4: Topic Modeling and Visualisation of Diachronic Trends in

Contents

Abstract i

Acknowledgement ii

Contents iii

List of Figures vii

List of Tables ix

List of Acronyms x

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Theoretical background 3

2.1 Topic Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1.1 Precursors of Latent Dirichlet Allocation . . . . . . . . . . . . . 3

2.1.2 Latent Dirichlet Allocation . . . . . . . . . . . . . . . . . . . . . 4

2.2 Machine Learning Problem . . . . . . . . . . . . . . . . . . . . . . . . 5

2.3 Issues with Topic Models . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.3.1 Categories of Bad Topics . . . . . . . . . . . . . . . . . . . . . . 6

2.3.1.1 General and Specific Words . . . . . . . . . . . . . . . . . 6

2.3.1.2 Mixed and Chained Topics . . . . . . . . . . . . . . . . . . 6

2.3.1.3 Identical Topics . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3.1.4 Incomplete Stopword List . . . . . . . . . . . . . . . . . . 7

2.3.1.5 Nonsensical Topics . . . . . . . . . . . . . . . . . . . . . . 7

2.3.2 Topic Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3.3 Topic Quality Evaluation . . . . . . . . . . . . . . . . . . . . . 8

2.3.3.1 Human Evaluation . . . . . . . . . . . . . . . . . . . . . . 8

2.3.3.2 Topic Size . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3.3.3 Topic Word Length . . . . . . . . . . . . . . . . . . . . . . 9

iii

Page 5: Topic Modeling and Visualisation of Diachronic Trends in

Contents

2.4 Improving Topic Models . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.4.1 Automatic Topic Model Labelling . . . . . . . . . . . . . . . . . 10

2.4.1.1 Information Retrieval . . . . . . . . . . . . . . . . . . . . . 10

2.4.1.2 Neural Embeddings . . . . . . . . . . . . . . . . . . . . . . 10

2.4.2 Text Preprocessing to Acquire Meaningful Topics . . . . . . . . 11

3 Previous work 12

3.1 Biomedical Topic Modeling . . . . . . . . . . . . . . . . . . . . . . . . 12

3.1.1 Ontology Term Mapping . . . . . . . . . . . . . . . . . . . . . . 12

3.1.2 Enriching LDA Output with External Data . . . . . . . . . . . 12

3.1.3 Discover Relationships between Diseases and Genes with Topic

Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.1.4 Comprehensive Biomedical LDA Topics Example Source . . . . 13

3.2 Diachronic Topic Modeling . . . . . . . . . . . . . . . . . . . . . . . . 14

3.2.1 Early Work in Modern Non-biomedical Domain . . . . . . . . . 14

3.2.2 Observe Diachronic Changes and Author Influence in Biomed-

ical Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.3 Brief Overview of Available Tools . . . . . . . . . . . . . . . . . . . . 15

3.3.1 MAchine Learning for LanguagE Toolkit (MALLET) . . . . . . 15

3.3.2 Stanford Topic Modeling Toolbox . . . . . . . . . . . . . . . . . 15

3.3.3 Spark Machine Learning Library (MLlib) . . . . . . . . . . . . . 16

3.3.4 R Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.3.5 Gensim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4 Methodology 17

4.1 Source of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4.2 Extracting Topic Models . . . . . . . . . . . . . . . . . . . . . . . . . 17

4.2.1 Experiment 1 : Exploring the Topics in the Corpus . . . . . . . 18

4.2.1.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.2.2 LDA Model Creation Parameters . . . . . . . . . . . . . . . . . 19

4.2.2.1 Evaluation: 10 Topics Model . . . . . . . . . . . . . . . . . 19

4.2.2.2 Evaluation: 20 Topics Model . . . . . . . . . . . . . . . . . 20

4.2.2.3 Model Evaluation . . . . . . . . . . . . . . . . . . . . . . . 20

4.2.2.4 Evaluation: 50 Topics Model . . . . . . . . . . . . . . . . . 21

4.2.2.5 Evaluation: 100 Topics Model . . . . . . . . . . . . . . . . 21

4.2.2.6 Evaluation Experiment 1: All Models . . . . . . . . . . . . 22

4.2.3 Experiment 2 : Edited Corpus and Modified Model Update

Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.2.3.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.2.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

iv

Page 6: Topic Modeling and Visualisation of Diachronic Trends in

Contents

4.2.4 Experiment 3: Online Learning with Different Batch Sizes . . . 24

4.2.4.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.2.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.2.5 Experiment 4: Reduced Vocabulary . . . . . . . . . . . . . . . . 26

4.2.5.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.2.5.2 10 Topics Model . . . . . . . . . . . . . . . . . . . . . . . . 27

4.2.5.3 20 Topics Model . . . . . . . . . . . . . . . . . . . . . . . . 27

4.2.5.4 50 Topics Model . . . . . . . . . . . . . . . . . . . . . . . . 28

4.2.5.5 100 Topics Model . . . . . . . . . . . . . . . . . . . . . . . 29

4.2.6 Experiment 5: Influence of POS Tags . . . . . . . . . . . . . . . 29

4.2.6.1 Noun-Verb Corpus . . . . . . . . . . . . . . . . . . . . . . 30

4.2.6.2 Noun-Adjective Corpus . . . . . . . . . . . . . . . . . . . . 31

4.2.6.3 Noun-Verb-Adjective Corpus . . . . . . . . . . . . . . . . . 32

4.2.7 Experiment 6: Extracting Models with Distinct Topics Using

Topic Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.2.8 Experiment 7: Extracting Stable Models . . . . . . . . . . . . . 35

4.2.9 Topic Labelling . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5 Data and Topic Exploration 39

5.1 Document Topics Distribution . . . . . . . . . . . . . . . . . . . . . . 39

5.2 Data Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

5.2.1 Topic Distribution . . . . . . . . . . . . . . . . . . . . . . . . . 39

5.2.2 Average Topic Probability . . . . . . . . . . . . . . . . . . . . . 41

5.3 Observing Diachronic Trends Using Topic Models . . . . . . . . . . . 45

5.4 Topic Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.4.1 Frequency of Topic Words in the Corpus . . . . . . . . . . . . . 46

5.4.2 Diachronic Shifts within a Topic . . . . . . . . . . . . . . . . . . 48

5.5 Frequency of Popular Words within a Topic . . . . . . . . . . . . . . 49

5.5.1 Diachronic Popularity of Non-topic Word Related Terms . . . . 51

6 Results and Discussion 52

6.1 Research question Nr. 1 . . . . . . . . . . . . . . . . . . . . . . . . . 52

6.2 Research question Nr. 2 . . . . . . . . . . . . . . . . . . . . . . . . . 53

6.3 Research question Nr. 3 . . . . . . . . . . . . . . . . . . . . . . . . . 53

6.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

7 Website 55

7.1 Generating the charts . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

7.2 Website sections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

7.2.1 Observing diachronic trends in topics . . . . . . . . . . . . . . . 56

v

Page 7: Topic Modeling and Visualisation of Diachronic Trends in

Contents

7.2.2 Generate frequency of topic words in the corpus . . . . . . . . . 57

7.2.3 Frequency of popular words within a topic . . . . . . . . . . . . 57

8 Diachronic topic modeling pipeline 59

8.1 Data extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

8.1.1 Extract metadata . . . . . . . . . . . . . . . . . . . . . . . . . . 59

8.1.2 Extract text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

8.1.2.1 Text preprocessing . . . . . . . . . . . . . . . . . . . . . . 59

8.1.2.1.1 POS tagging of the corpus . . . . . . . . . . 60

8.1.2.2 Token filtering . . . . . . . . . . . . . . . . . . . . . . . . . 60

8.1.3 Corpus creation . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

8.2 LDA topic modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

8.2.1 Dictionary creation . . . . . . . . . . . . . . . . . . . . . . . . . 61

8.2.2 Editing the original dictionary . . . . . . . . . . . . . . . . . . . 61

8.2.3 LDA corpus creation . . . . . . . . . . . . . . . . . . . . . . . . 62

8.2.4 LDA model creation . . . . . . . . . . . . . . . . . . . . . . . . 62

8.3 Data mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

8.3.1 Mapping, creating yearly average topic probability . . . . . . . 62

8.3.2 Mapping: generating reality frequencies for topic words . . . . . 63

8.3.3 Mapping: generating relative frequencies for popular words in

topic subcorpora . . . . . . . . . . . . . . . . . . . . . . . . . . 63

8.4 Other functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

8.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

9 Conclusion 65

9.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

References 67

A Tables 71

Curriculum Vitae 72

vi

Page 8: Topic Modeling and Visualisation of Diachronic Trends in

List of Figures

1.1 Plate notation representing the LDA model (from Blei et al. [2003]) . 4

4.1 Number of articles published per year from 1950-2016 in the corpus

of 150 thousand articles . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.2 Percentage of articles from the original corpus (1.5 million articles)

per year from 1950-2016 that are in the corpus of 150 thousand articles 26

4.3 Average inter-topic similarity . . . . . . . . . . . . . . . . . . . . . . 35

5.1 Topic words in article (green : topic 34) . . . . . . . . . . . . . . . . 40

5.2 Topic words in article (green: topic 19, yellow: topic 28) . . . . . . . 40

5.3 Topic probability distribu- tion of documents of topic 19 . . . . . . . 41

5.4 Topic probability distribution of documents of topic 19, where topic

probability >0.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.5 Topic probability distribution of documents of topics 6,10, 39, and

41, where topic probability >0.1 . . . . . . . . . . . . . . . . . . . . 42

5.6 Topic probability distribution of documents of topic 25, where topic

probability >0.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5.7 Average topic probability of documents from 2000 to 2015 of topics

10, 23, and 33 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5.8 Average topic probability of documents from 2000 to 2015 of topics

11,12,17,21,28, and 43 . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5.9 Average topic probability of documents from 2000 to 2015 of topics

2,5, and 25 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5.10 Average topic probability of documents from 1980 to 2005 of topic 50 45

5.11 Average topic probability of documents from 2000 to 2015 of topics

13,22,34, and 38 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

5.12 Relative frequency of topic words for topic 13-woman-heart-pregnancy

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

5.13 Relative frequency of pregnancy related words from topic 13 . . . . . 48

5.14 Relative frequency of heart disease related words from topic 13 . . . . 48

5.15 Relative frequency of topic words for topic 22-infection-virus-vaccine 49

5.16 Relative frequency of immunology related words from topic 22 (group

2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

vii

Page 9: Topic Modeling and Visualisation of Diachronic Trends in

List of Figures

5.17 Relative frequency of immunology related words from topic 22 (group

3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5.18 Relative frequency of words from topic 13 (top 1-5 words) . . . . . . 50

5.19 Relative frequency of words from topic 13 (top 6-10 words) . . . . . . 50

5.20 Relative frequency of words from topic 22 (top 1-5 words) . . . . . . 50

5.21 Relative frequency of words from topic 22 (top 6-9 words) . . . . . . 50

7.1 Website: Part 1 User options . . . . . . . . . . . . . . . . . . . . . . . 56

7.2 Website: Part 1 Example output for topics 2,3,4,5 . . . . . . . . . . . 56

7.3 Website: Part 2 Example output for topic 13 (topics shown partially) 57

7.4 Website: Part 3 Example output for topic 13 (top 2-5 words shown) . 58

8.1 Diachronic topic modeling pipeline . . . . . . . . . . . . . . . . . . . 60

viii

Page 10: Topic Modeling and Visualisation of Diachronic Trends in

List of Tables

4.1 10 topics generated from a corpus of 1.5 million articles . . . . . . . . 19

4.2 A selection from the 20 topics generated from a corpus of 1.5 million

articles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.3 A selection from the 50 topics generated from a corpus of 1.5 million

articles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.4 A selection from the 100 topics generated from a corpus of 1.5 million

articles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.5 Identical topics generated from multiple topic models with different

topic sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.6 Topics from 10 topic model from noun corpus . . . . . . . . . . . . . 27

4.7 Topics from 20 topic model from noun corpus . . . . . . . . . . . . . 28

4.8 A selection from the 50 topics generated from noun corpus . . . . . . 28

4.9 Focus on breast-cancer related topics from 100 topics models from

noun corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.10 15 topics from 50 topic model from Noun-Corpus . . . . . . . . . . . 30

4.11 15 topics from 50 topic model from Noun-Verb corpus . . . . . . . . . 31

4.12 15 topics from 50 topic model from Noun-Adjective corpus . . . . . . 32

4.13 15 topics from 50 topic model from Noun-Verb-Adjective corpus . . . 32

4.14 Percentage of identical terms in between the models . . . . . . . . . . 33

4.15 Number of unique words in found in all the topics . . . . . . . . . . 33

4.16 Topic similarity based on number similar words over multiple passes 37

5.1 Topics 19, 28, 34 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

5.2 Fictitious topic probability distribution over multiple topics and doc-

uments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5.3 Four topics selected for data exploration . . . . . . . . . . . . . . . . 46

A.1 All 50 topics from final model . . . . . . . . . . . . . . . . . . . . . . 71

ix

Page 11: Topic Modeling and Visualisation of Diachronic Trends in

List of Acronyms

HTML HyperText Markup Language

LDA latent Dirichlet allocation

NLP Natural Language Processing

NLTK Natural Language Toolkit

OCR Optical Character Recognition

POS Part-Of-Speech

XML eXtensible Markup Language

x

Page 12: Topic Modeling and Visualisation of Diachronic Trends in

1 Introduction

In this section, I will mention the motivation for writing this Master’s thesis. I will

also mention my research questions that will be tackled in this work. Finally, I will

give an overview of this work stating the themes of the upcoming chapters.

1.1 Motivation

In the field of biomedical literature thousands of articles are published every day.

This is by no means a mere exaggeration; between 2012 and 2015, approximately

800’000 articles were annually published1. Research and discoveries in the biomed-

ical field are primarily found in scholarly publications; however, due to the afore-

mentioned amount of literature being published, the task of finding trends in the

biomedical domain can be a challenge. Natural language processing (NLP), as a

consequence, can be of great use because these academic publications are often pub-

lished in machine-readable text formats. Huang and Lu [2016] mention in their

article about community challenges in the biomedical field that collaborations be-

tween the biomedical and NLP researchers have become commonplace forming the

field of research known as biomedical natural language processing (BioNLP). They

also mention that NLP methods and text mining can be used for a multitude of

tasks, such as constructing ontologies and curating databases etc.

For those interested in machine-learning approaches, the field of biomedical research

publication is well suited for finding patterns in large amounts of data, due to the

abundance of publications available.

PubMed offers full articles, as well as abstracts of scientific articles written in the

biomedical domain. The data provided on PubMed is quite useful for machine-

learning approaches. Firstly, the data is machine-readable, and secondly, there are

metadata including authors and date of publication in addition to the scientific

texts.

1Based on MEDLINE citation counts by year of publication. https://www.nlm.nih.gov/bsd/medline_cit_counts_yr_pub.html

1

Page 13: Topic Modeling and Visualisation of Diachronic Trends in

Chapter 1. Introduction

Exploring diachronic trends in large corpora has been a particular interest of mine

for quite a while and I have worked on related projects within the constraints of

academia as well as for professional projects. Furthermore, working with biomedical

data is undeniably fascinating in its own regard. For these reasons, I decided to

embark on this project with the aim of discovering diachronic trends in a large

corpus of biomedical publications.

1.2 Research Questions

The research questions that shall be answered in this Master’s thesis are:

1. Is it possible to detect temporal trends in a corpus using topics generated from

a topic modeling algorithm?

2. Using topic modeling, can one detect diachronic changes within the words of

a given topic throughout the entire corpus?

3. Can one use topic modeling to detect diachronic changes in term/word usage

within documents that fall into a specific topic?

1.3 Thesis Structure

This Master’s thesis is structured as follows: at first I will provide a brief overview

of the theoretical background for topic modeling in Chapter 2 and explain the topic

modeling algorithm that I will use to create the models. Furthermore, I will men-

tion the issues of this algorithm and probable strategies of circumventing them. In

Chapter 3, I will mention the previous work that has been done in the domain of

diachronic and biomedical topic modeling. At the same time, I will give a brief

overview of the off-the-shelf available tools for topic modeling. In Chapter 4, the

entire methodology to create the topic model from the initial corpus is provided.

Then in Chapter 5, which is about the data and topic exploration, I look into the

topic model and answer the research questions. In Chapter 6, I discuss the results of

the research questions. In Chapter 7, I introduce the companion website and explain

its functionalities. In Chapter 8, I introduce the diachronic topic modeling pipeline

that has been used to create the topic model. Finally, in Chapter 9, I conclude the

findings of my Master’s thesis and mention possibilities of future work to be done

in this field.

2

Page 14: Topic Modeling and Visualisation of Diachronic Trends in

2 Theoretical background

In this section, we look at the theoretical framework behind topic modeling. I give a

brief overview of the precursors of latent Dirichlet allocation (LDA). Then I provide

a brief theoretical overview of LDA, and the issues that one could face when using

topic models. The problems of topic models are explained in detail, as I will be

referring to them to justify my methodology.

2.1 Topic Models

One can define topic models as statistical models that are used to learn about the

latent structures that exist within a corpus of documents. These models can have

many uses; however, discovering patterns is one of the key reasons for building topic

models (Boyd-Graber et al. [2014]).

2.1.1 Precursors of Latent Dirichlet Allocation

There are many different kinds of statistical models that are currently in use to

discover topics within documents. In this section, I will quickly list a few of these

methods and then explain in detail the one which is used by me. Latent Semantic

Analysis (LSA) uses vector-based models for finding coherence between texts. The

main methods used in this model are term frequency – inverse document frequency

(tf-idf) and singular value decomposition (SVD)1 (Deerwester et al. [1990]). There

are certain advantages to using LSA, namely one can find the latent topics that

exist within the corpus. However, due to SVD, the mathematical complexity of this

model is extremely high.

Another type of topic modeling method is Probabilistic Latent Semantic Analysis

(PLSA). It uses a simple two level generative model, where it calculates a probability

model for the documents, the topics and the words ( Hofmann [1999]). It has certain

advantages, such as the topics can be easily interpreted and the model is based on a

1https://en.wikipedia.org/wiki/Singular_value_decomposition

3

Page 15: Topic Modeling and Visualisation of Diachronic Trends in

Chapter 2. Theoretical background

solid statistical foundation(Holzinger et al. [2014]). However, this model has also a

disadvantage because the expectation–maximization algorithm1 used by PLSA has a

tendency to find a solution which is not always the global optimum (Leopold [2007]).

2.1.2 Latent Dirichlet Allocation

One of the key concepts for the models used for this Master’s thesis is latent Dirichlet

allocation (LDA) by Blei et al. [2003]. They explain LDA as follows:

“Latent Dirichlet allocation (LDA) is a generative probabilistic model of a corpus.

The basic idea is that documents are represented as random mixtures over latent

topics, where each topic is characterized by a distribution over words.”

The underlying logic behind LDA is that similar groups of words will occur in doc-

uments with similar topics in them. Thus, latent topics are groups of words that

frequently occur in a document. Documents, in this case, are simply probability

distribution of latent topics, and the topics can be defined as the probability dis-

tributions over words. A key point here is that in this model one is working with

probability distributions and not word frequencies. Hence, the syntax of the text

within the document does not matter; only the distribution of the words is of im-

portance.

Figure 1.1: Plate notation representing the LDA model (from Blei et al. [2003])

The plate notation (see Figure 1.1 ) represents the overall architecture of the LDA

model from Blei et al. [2003].

M: total number of documents in the corpus (1...m)

N: number of words in a document (1...n)

α: the Dirichlet prior parameter for the per-document topic distributions

β: the Dirichlet prior parameter for the per-topic word distribution

θm: distribution of topics in a document

1https://en.wikipedia.org/wiki/Expectation-maximization_algorithm

4

Page 16: Topic Modeling and Visualisation of Diachronic Trends in

Chapter 2. Theoretical background

zmn: denotes the topic for the n-th word in document m

wmn: denotes a given word, m: specific document, n: specific word

The generative processes give an insight into how the LDA model assumes the

document is created. As the first step, the model calculates the number of words

for a given document. Then it determines the document as a mixture of a given set

of topics. For example, if the number of topic is set to four, then it would decide

that the document m consists of 10% topic 1, 20% topic 2, 40% topic 3, and 30%

topic 4. The model tries to generate the words in the document. This is done

by choosing a topic for the document, which is the multinomial representation of

the topics assigned to the document (40% topic 3, 30% topic 3 etc.). In the next

step, it chooses the topic word, which is calculated using the aforementioned topic’s

multinomial distribution.

2.2 Machine Learning Problem

The techniques of machine learning can be divided into approximately three cate-

gories, namely: supervised, semi-supervised and unsupervised learning. A machine

learning problem falls into one of these categories based on the full, partial, or

absence of ground truth that could be applied to the model during the training pro-

cedure. One uses unsupervised machine learning when there is a complete absence

of ground truth. The aim of unsupervised learning methods is to find structures and

patterns from the input data, based on the type of machine learning algorithm that

is being implemented (Bonissone [2015]). LDA falls under the category of unsuper-

vised machine learning algorithm, as the input data is not labelled and the model

tries to infer structures within the data, based on predefined parameters. This could

be problematic at a later stage, as it could be challenging to judge the quality of the

generated topic model as there does not exist any reference with which one could

compare it.

2.3 Issues with Topic Models

Boyd-Graber et al. [2014] give a comprehensive overview of the issues that could

occur with topics generated by an LDA model. They mention five categories that

could be used to judge if a topic is of good quality. As I will be using these metrics to

judge the quality of the topics generated by the model (see Chapter 4), I will mention

here the aspects of good and bad topics, as proposed by them in the following

5

Page 17: Topic Modeling and Visualisation of Diachronic Trends in

Chapter 2. Theoretical background

sections.

2.3.1 Categories of Bad Topics

2.3.1.1 General and Specific Words

As most words in natural language convey some sort of meaning, it can occur that

the model generates topics made of words that are not useful. These are topics

that contain words which are often frequent in the corpus, but they are also not

specific. Thus, these topics can be perceived as being general and not belonging

to a specific subdivision of the corpus. These topics may include stop words that

were not removed during the preprocessing step. However, it can also be the case

that these are high frequency words, specific to the corpus and should be removed

in order to yield meaningful topics.

Boyd-Graber et al. [2014] also state that low-frequency words can cause problems.

According to them, topics containing a multitude of specific words can also be

bad, because there is a chance that these topics are not representative of a specific

subdivision within the corpus, but were generated due to mere chance, as the model

generates topics based on word frequencies. The authors do not mention how to

avoid the creation of such topics.

2.3.1.2 Mixed and Chained Topics

Boyd-Graber et al. [2014] define ‘mixed topics’ as topics that are made of a set of

words that do not make any sense in combination. However, these topics contain

subset of words that make sense. Example 2.1 is a case of ‘mixed’ topic, as this

topic consists of two subsets, namely names of flowers (in emphasis) and tools.

(2.1) rose, daffodil, daisy, hammer, screwdriver, pliers ...

‘Chained’ topics are related to mixed topics, as here too one has different subset

of words within the topic, but the issue here is that at least one word from one of

the subsets could belong in the other subset. As shown in Example 2.2, there are

two subsets within the topic, but one word, namely ‘apple’, from the first subset,

which is about names of fruits could belong in the other other subset, which is about

products made the company called Apple.

(2.2) apple, banana, grape, iphone, smartphone, ipad...

6

Page 18: Topic Modeling and Visualisation of Diachronic Trends in

Chapter 2. Theoretical background

2.3.1.3 Identical Topics

Another issue that can happen while generating topics is that the topics are mostly

or completely identical, and possible exhibit a different word order (see Examples

2.3 and 2.4).

(2.3) apple, banana, grape, orange, pineapple

(2.4) grape, apple, pear, pineapple, banana

Boyd-Graber et al. [2014] mention some solutions to avoid such topics. They suggest

that one should check if there are empty documents in the dataset, and if the number

of topics is excessive for the given dataset.

2.3.1.4 Incomplete Stopword List

These are topics generated as a result of having a stop words list that is incomplete.

This issue is somewhat similar to the one mentioned in topics containing general

words. However, the difference here is that the topics are not vague, but make

sense. For example, they could be a list of first names, or Roman numerals. Here

the authors suggest that this problem can be circumvented by updating the list of

stop words and running the model again Boyd-Graber et al. [2014].

2.3.1.5 Nonsensical Topics

These are topics that do not make any sense. Boyd-Graber et al. [2014] mention

that providing the model with an excessive number of topics to generate may cause

nonsensical topics. This is due to the fact that the model tries to generate a given

number of topics even if these do not exist in the corpus, the model tries to infer

topics based on some pattern it found. The authors mention that, for example,

OCR errors can be an artificially generated topic that could probably be detected

by a topic modeling algorithm.

2.3.2 Topic Alignment

Another issue with LDA models is that, even if one uses the same corpus and

parameters, the model generates the topics in different order for each run. Yang

et al. [2016] mention that when using an LDA model, the words that are generated

in the model are based on the fixed conditional distributions. This has a side-effect,

as it leads to the topics in the model as being exchangeable. Due to this reason,

7

Page 19: Topic Modeling and Visualisation of Diachronic Trends in

Chapter 2. Theoretical background

in statistical topic modeling algorithms such as LDA, even with identical model

generation parameters, the topic indices between two topic models may not match.

Hence, the authors suggest some kind of alignment measure in order to calculate the

similarities between multiple models. Yang et al. [2016] implement the Hungarian

algorithm 1 in their work. However, for my experiments I use a different measure to

calculate the similarities between multiple models (see Chapter 4.2.7).

2.3.3 Topic Quality Evaluation

Boyd-Graber et al. [2014] mention that after retrieving the topics from the model,

there are different ways of judging the quality of them. Furthermore, they indicate

that the weakness of most topic modeling papers is that the researchers do only a

qualitative assessment after generated topics. They also state that in many cases

the quality of the topics are judged based on some NLP related task that is not

directly related to the topics themselves. For example, using the inferred topics

for an information retrieval task (e.g. Wei and Croft [2006], or using them for a

sentiment detection task (e.g. Titov and McDonald [2008]). The second approach

is to have a set of held-out articles (i.e. a test set) and using the probability of the

observations based on the articles in the test set and those used to train the model.

The paper by Wallach et al. [2009] provides a comprehensive overview of evaluating

topics based on the probability of the observations on the test set.

At this point I would like to mention that I will not be using a test set to evaluate

my results, as I do not have a reference for comparison. Hence, I will be using a

different metric to judge the quality of my model (see 4.2.7).

2.3.3.1 Human Evaluation

Chang et al. [2009] propose a method for human evaluation of topics, where the

participants are given a set of words from a topic that also include an intruder topic

word (see Example 2.52). They mention that the participants are able to identify

the intruder word, if the other words in the set belong to the same semantic group.

However, if the set contains words that do not belong together (see Example 2.6),

then the task is much more difficult for the participants, who then seem to start

choosing the intruder word at random. They mention that the quality of the topic

can be evaluated based on the consistency of the human judgement.

1https://en.wikipedia.org/wiki/Hungarian_algorithm2The intruder word is in emphasis.

8

Page 20: Topic Modeling and Visualisation of Diachronic Trends in

Chapter 2. Theoretical background

(2.5) apple, orange, banana, pineapple, tiger

(2.6) book, computer, teacher, student, weekend

2.3.3.2 Topic Size

The topic size plays a key role in the quality of the model. As Mimno et al. [2011]

state, there is a relationship between the number of topics and quality of the topics

themselves. They point out that on the one hand, models with a large number of

topics provide the user with a more detailed view of the themes that are present in

the corpus. On the other hand, having a multitude of topics comes with disadvan-

tages because certain topic modeling algorithms tend to create topics even if there

are none to be found (see 2.3.1.5).

2.3.3.3 Topic Word Length

Boyd-Graber et al. [2014] refer to the length of the words in the topic as an indicator

of topic quality. Their intuition is as follows: if a word has a specific meaning, then

it is quite likely to be longer in length and vice versa. Thus, according to Boyd-

Graber et al. [2014] topics with a short average word length could be an indication

of anomalous topic clustering (e.g. acronyms). Moreover, they mention that the

length of the topic word is not an indication for a nonsensical topic that cannot be

interpreted by the user. Topics with short words in them probably indicate words

that have a tendency to co-occur. The authors allude to a topic that contains the

word ‘legislator’ and acronyms for the names of states in the US (e.g. ‘ca’,‘pa’, ‘nc’,

‘fl’, etc.). In this case, the topic shows that names of states tend to co-occur with

tokens such as ’legislator‘. The authors do not provide a solution for avoiding such

topics. Nonetheless, this criterion of topic length can be applied to my future topic

models to evaluate if the output is of adequate quality.

2.4 Improving Topic Models

Boyd-Graber et al. [2014] also mention multiple ways of improving topic models.

Their suggestions include merging topics that are similar(see 2.3.1.3) or separating

topics that conflate multiple concepts (see 2.3.1.2). In most of their suggestions,

they recommend measures that calculate the co-occurrence of words (e.g. point-wise

mutual information (PMI)) and expert knowledge amongst few of the approaches

that can be implemented. They also state automatic topic labelling as way to

9

Page 21: Topic Modeling and Visualisation of Diachronic Trends in

Chapter 2. Theoretical background

interpret topics without the help of domain experts. This method also provides a

summary of what is being presented by the topics.

2.4.1 Automatic Topic Model Labelling

2.4.1.1 Information Retrieval

Lau et al.[2011] attempt at resolving the issue of topic labelling by generating new

labels for the topics. The methods implemented by them include using terms that

are found in the topics. They search the English Wikipedia for the words which are

the top-ranking topic terms and use the article titles that have been returned by

their query to generate more topic labels. Then they rank and process the Wikipedia

article titles and extract label keywords from them. These keywords are further pro-

cessed using a combination of association measures (PMI), and lexical features. Out

of all the data sets evaluated in their work, the topic labels of the PubMed abstracts

perform less well than labels generated on the other datasets. This approach is cer-

tainly interesting; however, in order to implement their methodology, it is necessary

to use an API to get the relevant information from Wikipedia, which goes beyond

the scope of the focus of this Master’s thesis.

2.4.1.2 Neural Embeddings

An improvement of the model proposed by Lau et al. [2011] is the one by Bhatia

et al. [2016]. Even though their methodology has some similarities to the former

approach, Bhatia et al. [2016] forego the information retrieval aspect of Lau et al.

[2011] and replace it with word2vec and doc2vec. The word2vec model generates

more abstract labels, whereas doc2vec returns fine-grained labels for a given topic.

The strength of the system lies in combining the outputs of both systems. They also

make use of different learn-to-rank approaches to improve the quality after the top

ranking topic labels. Unlike the approach by Lau et al. [2011], the researchers claim

that this method is much easier to implement as one is not required to use search

APIs. Moreover, Bhatia et al. [2016] yield better results than Lau et al. [2011] in

the previously mentioned article.

As their method mentions doc2vec, to generate multi-word topic labelling, I chose

not to implement this for my work. Doc2vec requires a significant amount of compu-

tational resources (notably RAM) to run, and it is not recommended for the corpus

that I intend to use.

10

Page 22: Topic Modeling and Visualisation of Diachronic Trends in

Chapter 2. Theoretical background

2.4.2 Text Preprocessing to Acquire Meaningful Topics

A prevalent issue with LDA topic modeling is finding meaningful topics based on the

given corpus. Zhu et al. [2014] suggest that when using LDA, such a problem can

be significantly reduced by using multiple levels of text pre-processing with methods

such as POS tagging, base noun phrase chunking, and K means clustering . Part

of their preprocessing approach includes the reduction of tokens in plural to their

lemma form. They mention that their method outperforms a baseline with the pre-

processing steps and the output which ranks better among human annotators. This

paper is of particular interest as many of the approaches mentioned by the authors

can be implemented using off-the-shelf NLP tools such as the Natural Language

Toolkit (NLTK) for Python (Bird et al. [2009]).

11

Page 23: Topic Modeling and Visualisation of Diachronic Trends in

3 Previous work

In this section I give a brief overview of the work that has been done in the field of

biomedical topic modeling and diachronic topic modeling. I will focus only on the

articles which are of interest to my work. Finally, I give a brief overview of the tools

that can be used for topic modeling purposes.

3.1 Biomedical Topic Modeling

3.1.1 Ontology Term Mapping

Zheng et al. [2006] used topic modeling on titles and abstracts of protein-related

MEDLINE articles. They used LDA and extracted 300 topics from their corpus.

They found out that the majority of the extracted topics were not only semantically

coherent, but they also featured biological terms. As an added feature, they mapped

the topics to the Gene Ontology (GO) controlled vocabulary. They did this by

associating the common terms between the topic words the GO term. Thus, this

paper exhibits a practical usage of topic modeling in the domain of biomedical

publication. Furthermore, this paper also contains multiple examples of biomedical

topics which were created using LDA.

3.1.2 Enriching LDA Output with External Data

Other ways of improving the output of the model can be done by enhancing the

output with information from an external knowledge base. Wang et al. [2011] do

this by applying multiple levels of complex pre-processing steps, which include NER,

getting information about the tokens from an external database, and recategorising

the extracted information into a relational database for further usage. Moreover,

they apply other semantic association features to improve the topics. The LDA

method has been enhanced by researchers to suit their need for biomedical topic

modeling. The researchers create Bio-LDA, which is an algorithm that performs

LDA on a given corpus; moreover, it enriches the results using datasets from the

12

Page 24: Topic Modeling and Visualisation of Diachronic Trends in

Chapter 3. Previous work

life science domain. Then it identifies relationships between the topics using the

aforementioned methods.

This article is of interest to me, as they incorporate external information to enhance

their method. It has also served as an inspiration for the pipeline that I created to

easily process PubMed articles (see Chapter 8).

3.1.3 Discover Relationships between Diseases and Genes with

Topic Modeling

ElShal et al. [2016] aim to find relationships between diseases and genes.

They used LDA to extract the topics from the abstracts of biomedical research

articles, which is their corpus. They converted the documents in their corpus into

vectors using the bag-of-words model approach. From the LDA model, they also

used the topics, the topic word distribution, and from the corpus they used the

mentions of genes and diseases. They combined these information to find links

between genes and diseases. In most cases, they calculated similarities between the

documents, topics, genes, and diseases using cosine similarity.

Using this approach, they found many correct links between genes and diseases;

however, they also realised that the vocabulary plays an important role and the

topics extracted rely heavily on it. This article is interesting as it focusses on the

role of vocabulary when creating LDA models.

3.1.4 Comprehensive Biomedical LDA Topics Example Source

Examples of topics that are created from models with biomedical texts as input

are an useful resource, because they serve as a reference for what such topics could

look like. This is one of the reasons why the article by van Altena et al. [2016] is

helpful. The article focuses on the usage of big data themes in scientific texts, with

an emphasis on biomedical literature. As their corpus, they take 1308 documents

from PubMed and PMC. These abstracts were selected based on the occurrence of

big data related keywords in them. As for the topic modeling methodology, they

implemented LDA using R. The preprocessing methods included stopword filtering

and stemming tokens. An interesting feature implemented by them in the corpus

creation was to transform common bigrams into a single token using an underscore

(e.g. heath care). To extract the best parameters for their model, they calculated

13

Page 25: Topic Modeling and Visualisation of Diachronic Trends in

Chapter 3. Previous work

the Akaike information criterion (AIC)1 using the following variables: the number

of topics, the likelihood of the model, and the number of unique words in the corpus.

Based on their corpus, the best number of topics is 25. Finally, they used manual

annotators with domain knowledge to ascribe labels to the topics. The article further

describes the usage of big data terminology biomedical texts.

Despite the fact that their corpus is about big data, this article provides a highly

valuable resource, as it gives an insight into what such topics can resemble when

using a biomedical corpus. Moreover, their method for finding the best parameters

for the LDA model can potentially be applied for finding the parametres for the

models during the experiments.

3.2 Diachronic Topic Modeling

3.2.1 Early Work in Modern Non-biomedical Domain

The paper by Wang and McCallum [2006] does diachronic topic modeling using

LDA. The authors developed a tool called Topics over Time (TOT), which uses

topic modeling and combines it with co-occurrence patterns of words. However,

they do not use biomedical data in their training corpus. Nonetheless, they were

able to show trends in their data. This article is being mentioned here, as it is an

interesting early example of diachronic topic modeling and trend analysis.

3.2.2 Observe Diachronic Changes and Author Influence in

Biomedical Domain

The issue of diachronic changes in topics and the influence of authors is tackled by

Song et al. [2014]. They analyse 20,869 articles from PubMed from 2000 to 2011.

They extract bibliographical and other relevant information (e.g. abstract, full-text,

etc.) information from the XML files. Using the extracted bibliographical informa-

tion, they create a relational citation database. Furthermore, the texts are divided

into three temporal categories, namely 2000-2003, 2004-2007, and 2008-2011. They

ran separate LDA models for each of the aforementioned time periods. The authors

state that these temporal categories are based on in-domain trends, the number

of publications per year, and having enough data for diachronic topic modeling.

They use topic modeling to find out the changes in topics that have occurred in the

aforementioned time periods. Moreover, they implement a mixture of information

1https://en.wikipedia.org/wiki/Akaike_information_criterion

14

Page 26: Topic Modeling and Visualisation of Diachronic Trends in

Chapter 3. Previous work

extraction, to find the most influential authors and topic modeling to find the topics

associated with them. The researchers use Dirichlet multinomial regression (DMR)

as their topic modeling algorithm and the tool MALLET. Using these methods, they

not only find the most productive authors, countries, institutions, but also patterns

of larger collaboration networks based on topics. This article provides an insight into

diachronic changes in topics in the domain of biomedical literature, and what such

topics could look like. It also provides a method for finding influential authors and

the topics used by them. Moreover, it provides information about finding collabo-

ration networks and interrelated fields. Finally, similar to the article by van Altena

et al. [2016], this paper is very useful to me as it also provides a comprehensive list

of topics that the researchers found during the topic modeling process.

3.3 Brief Overview of Available Tools

There are numerous of topic modeling tools that are available for free, and they can

be implemented based on ones knowledge of programming language, the amount of

data being processed, and the level of customizability required for the task.

3.3.1 MAchine Learning for LanguagE Toolkit (MALLET)

It is a cross-platform NLP package, which is Java-based (McCallum [2002]). It has

tools for document classification, sequence tagging, and topic modeling. The topic

modeling toolkit contains some models such as LDA, hierarchical LDA etc. It is

easy to use and does not require any prior knowledge of Java, and contains tutorials

for users who do not have any prior programming knowledge.

3.3.2 Stanford Topic Modeling Toolbox

It is part of the Stanford NLP suite of software (Ramage and Rosen [2009]). Written

in Scala, it takes as an input data in spreadsheet (Excel, CSV) format. The topic

models available for this toolkit are LDA, labelled LDA, and PLDA. The advantage

of using this tool is that it generates the output in Excel format, which eases the data

analysis process; moreover, there’s a Java based user interface. The disadvantages

include that knowledge of Scala is required to fine-tune the model parameters.

15

Page 27: Topic Modeling and Visualisation of Diachronic Trends in

Chapter 3. Previous work

3.3.3 Spark Machine Learning Library (MLlib)

The machine learning libraries in Spark offer big data solution for LDA models1.

Despite the fact that most of the core code and documentation for Spark MLlib

is written in Java (and Scala), there is it a Python wrapper (PySpark API) which

enables the user to write the required code in Python. The advantage of this library

is that it offers a big data solution that could be useful for handling large amounts

of data, which is usually the case for working with the biomedical literature. A

disadvantage is that this requires an architecture for big data analysis, and the user

documentation is somewhat complicated and requires time to get accustomed to.

3.3.4 R Libraries

There are also topic modeling libraries for R and notable one is called ‘topicmodels’

(Hornik and Grun [2011]). It requires data in document matrix format, so that it can

be processed by the aforementioned R package. Moreover, it requires programming

knowledge of R and one should be skilled in the methods of data manipulation using

other R libraries, so that the input as well as the output can be processed by R.

3.3.5 Gensim

Gensim is a Python library (compatible with Python 2 and 3) that contains a

multitude of tools for extracting semantic relations from documents (Rehurek and

Sojka [2010]). As for the topic modeling algorithms, it has LSI and LDA. It takes

as input either raw text or text in Python readable list formats. Gensim has its

own text processing tools such as simple tokenizers. A key feature is that Gensim

also allows parallel processing which can reduce the time it takes to run the models.

Moreover, this library has a very detailed documentation, which is easy to follow,

and allows higher level of customizability for the user.

In addition, as it works on Python, NLP tools such as NLTK can be implemented for

preprocessing purposes before feeding the data into the topic modeling algorithm. I

will be using Gensim for this project because of the level of customizability of this

library and my current programming knowledge of Python.

1https://spark.apache.org/mllib/

16

Page 28: Topic Modeling and Visualisation of Diachronic Trends in

4 Methodology

As mentioned before, LDA is an unsupervised machine learning algorithm. Thus,

there is the issue of ground truth, as there is no pre-existing data against which I can

compare my results. Therefore, my aim in the following section will be to be aware

of the issues of topic modeling and circumvent them, based on what is possible and

what is recommended in the literature about building topic models.

4.1 Source of Data

In this section, I document how the topics were extracted from a corpus of around

1.5 million Open access Pubmed articles. I downloaded the Open access bulk article

packages from the Pubmed central. The corpus contains articles from the PMC Open

Access Subset1. I used the articles from this section because the data which is made

available here falls under the Creative Commons or similar licensing regulations,

which enables me to avoid any issues regarding copyright laws. The corpus in this

section was downloaded in the beginning of March 2017.

4.2 Extracting Topic Models

In the following sections I will implement the LDA algorithm on the corpus to create

topic models. I tried different parameters, such as chunk size, dictionary trimming,

number of extracted topics, and corpus filtering by POS tags, to get topics that

appear as coherent and meaningful to a human reader.

1https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/

17

Page 29: Topic Modeling and Visualisation of Diachronic Trends in

Chapter 4. Methodology

4.2.1 Experiment 1 : Exploring the Topics in the Corpus

4.2.1.1 Preprocessing

The articles downloaded from the PMC Open Access Subset are in XML format.

Hence, with the help of Python libraries I extracted the text from each article to

create my corpus. As a next step, I tokenized the text and then I reduced the corpus

by removing English stopwords1 and punctuation marks that are not part of a token.

By this I mean punctuation marks that do not occur within a token. In order to

further reduce the vocabulary, I lowercased the entire corpus. This approach has

its advantages and disadvantages, as by lowercasing one can consolidate multiple

tokens with different casing into a single lowercased token (e.g. ‘CELL’, ‘Cell’ to

‘cell’), which in turn reduces the number of vocabulary items. However, disadvan-

tages of this method are that by lowercasing a token one can generate homonyms

with different meanings (e.g. ‘AIDS’ vs. ‘aids’) which should not be consolidated

into a single vocabulary item. Despite the aforementioned issues, I decided to low-

ercase the corpus, as in this context the positive aspects of lowercasing outweigh

the disadvantages. Then I lemmatized the text to further reduce the number of

vocabulary items (e.g. ‘cell’ and ‘cells’ to ‘cell’).

I used tokenizing and lemmatizing functions provided by NLTK in the text prepro-

cessing stage (Bird et al. [2009]). The decision to use NLTK was mainly based on

the fact that it is freely available and can be easily combined with using Python.

Secondly, the dataset I am using consists of text written in English, and the tools

provided by NLTK are trained on English datasets. At this moment I would like to

state that biomedical texts pose a difficulty for most text processing tools. This is

due to the specific biomedical terminology used by the authors of these texts. Con-

sidering the scope of this Master’s thesis, I decided against using tools that have

been trained on biomedical data. This decision was based on the examples of topic

models that I saw in the work done by Song et al. [2014]2 and van Altena et al.

[2016]3, where the topic models were generated from biomedical texts.

In both papers, examples of topic words tend to be common nouns such as ‘cell’,

‘disease’, ‘cancer’, ‘virus’ etc. (from Song et al. [2014]), or ‘brain’, ‘disorder’, ‘bi-

ology’ etc (from van Altena et al. [2016]). In both works, the researchers did not

implement any tools for recognising biomedical entities. Nonetheless, terms such as

1These were taken from the set of English stop words provided in the NLTK package (Birdet al. [2009])

2Song et al. [2014] have a table of topics generated by their model in their paper on pages357-359

3van Altena et al. [2016] also show the topics generated by their model, in tables that can befound in pages 357-359 of their paper.

18

Page 30: Topic Modeling and Visualisation of Diachronic Trends in

Chapter 4. Methodology

‘dna’ were recognised as topic words by the models in both papers. Therefore, based

on the previous work done in this domain, I decided against the implementation of

a tool that was trained on biomedical data.

4.2.2 LDA Model Creation Parameters

Model Parameters

After the pre-processing step, I created an LDA model using Gensim. In my first

tests, I ran the LDA model with different configurations for extracting the topics

within the corpus. I set the parameter number of topics to 10, 20, 50, and 100 for

each iteration. Moreover, I set the chunk size to 5000 and set the number of passes

to 1.

Dictionary Trimming

I also edited the dictionary used by the model, in order to make the model run

faster as well as discard redundant vocabulary items that could potentially behave

like spam in the output. Hence, I not only removed words that occurred less than

10 times in the entire corpus, but also 100 of the most common words were removed.

Other omissions from the dictionary included removing numbers, and tokens that

are less than or equal to three characters long.

Topic number Topic words

1 response participant task trial stimulus subject experiment fig activity day2 fig structure solution protein patient surface size compound acid reaction3 patient health age risk participant care score woman clinical outcome4 snp gene patient population dna expression allele fig association protein5 protein fig gene activity dna mutant binding expression strain acid6 expression mouse fig protein gene tissue antibody day tumor activity7 fig specie water plant concentration area temperature surface site population8 patient disease blood clinical day infection response tumor serum month9 gene sequence document expression protein genome fig set region minimal10 patient fig image parameter region activity network area structure response

Table 4.1: 10 topics generated from a corpus of 1.5 million articles

4.2.2.1 Evaluation: 10 Topics Model

The output of the model returned 10 topics which are based on the entire corpus.

The result of this topic model can be found in Table 4.1, where the topic words

and the topics numbers are displayed. As it can be seen from the results, the topic

can be guessed by using the top five most relevant topic words from the list. For

example, other topics 4, 5, and 9 are about ‘gene related’ themes, whereas topic 3

is about patient information

19

Page 31: Topic Modeling and Visualisation of Diachronic Trends in

Chapter 4. Methodology

Issues

Within the topics, there are certain terms that can be classified as noise namely,

“fig” (in emphasis) which probably refers to the shortened form of figure. These

terms should be removed in the future experiments. Moreover, as seen in Table 4.1

at least three of the topics can be labelled as gene related. It can be concluded

that there is a lack of thematic variance in this model. As previously discussed

we have issues that are common with bad topic models. Three issues that can be

mentioned here are: mixed and chained topics (2.3.1.2), identical topics (2.3.1.3),

and incomplete stopword list (2.3.1.4).

Model Evaluation

It can be seen that the 10 topics that were given as an output are vague, somewhat

generic. Moreover, they tend to have common themes. Nonetheless, by looking

at them, it is possible to gain a general idea what the corpus is about. However,

the topics themselves are far too vague. Ultimately, it is necessary to have more

fine-grained topics. Finally, it can be said that this is a poor quality model and 10

topics are not enough for a corpus of this size.

4.2.2.2 Evaluation: 20 Topics Model

In the Table 4.2, some of the results of the 20 topic model are shown. This model

was also run with the same set of parameters with the exception of number of topics

as the output. There are still some issues of spam words in the output such as

‘fig’. A topic was found that contained LaTeX related markup terms (topic 7). This

could be due to the fact that in the corpus the markup terms had not been filtered

out during the preprocessing steps. Unlike in the previous iteration, there are more

distinguished topics, as new themes emerge in the output. There are topics about

insulin (topic 15) and mice, also about mice and brain experiments (topic 20), and

cancer (topic 10) which were not in the previous model.

4.2.2.3 Model Evaluation

We can see that more topics are extracted, as new themes emerged from 20 topics

model. Those themes are much more detailed than those in the 10 topics model.

These topics contain more words which are frequent in the corpus, but they are

still not specific. The topics can be perceived as being general and not belonging

to a specific subdivision of the corpus (see 2.3.1.1). Moreover, the issues of mixed

20

Page 32: Topic Modeling and Visualisation of Diachronic Trends in

Chapter 4. Methodology

and chained topics (2.3.1.2), identical topics (2.3.1.3), and incomplete stopword list

(2.3.1.4) are still prevalent in this model. Finally, it can be said that 20 topics are

not adequate to generate such a model for this corpus.

Topic number Topic Words

7 document amsbsy minimal amsfonts mathrsfs wasysym upgreek amssymb amsmath 12pt10 cancer tumor patient expression breast gene protein tissue mutation line15 concentration glucose insulin mouse acid fig weight diet rat protein20 rat neuron mouse animal brain response day patient experiment trial

Table 4.2: A selection from the 20 topics generated from a corpus of 1.5 millionarticles

4.2.2.4 Evaluation: 50 Topics Model

In this iteration, the same issues with the 10 and 20 topics model persist. This is

due to the fact that this model was created also with the same parameters, corpus,

and dictionary. However, compared to the previous ones this model resulted in

even more detailed topics. It should be said that some of the topics in this model

were also found in the previous models. Hence, I will only mention the new topics

that emerged. These new topics are about mental health/depression (topic 44),

antibiotics resistance (topic 6), vaccines (topic 12), diabetes (topic 17), eyes/surgery

(topic 32), bone/tissue (topic 18), neurons (topic 48) and plants (topic 8).

It appears that in this model of 50 topics, the generated topics are less general and

have a tendency to be more specific.

Model Evaluation

Although the 50 topics model yielded much more detailed results than the previous

iterations, this model contains some of the topics that were found before. Further-

more, it should be mentioned that some of the criteria that make a model bad, were

found in this iteration as well. Nonetheless, with 50 topics some nuanced themes

that are mentioned in the corpus become prevalent as the topics are less vague than

before. If I can achieve to solve some of the issues that put the topics generated by

this model under the category of bad topics, then for future iterations, 50 topics are

an adequate, yet manageable amount that can be generated from this corpus.

4.2.2.5 Evaluation: 100 Topics Model

Unlike the 50 topic model, this iteration contained very few new topics, namely

about heart disease (topic 22), aneurysms (topic 63), malaria (topic91) pregnancy

21

Page 33: Topic Modeling and Visualisation of Diachronic Trends in

Chapter 4. Methodology

Topic number Topic words

6 antibiotic resistance patient isolates strain antimicrobial infection resistant culture day8 plant leaf soil specie fig water seed site day root12 virus vaccine influenza antibody vaccination response patient infection protein day17 risk patient age diabetes blood subject bmi disease association weight18 bone fracture patient fig tissue implant cartilage week day protein32 patient eye nerve left surgery right clinical disease month image44 patient pain score symptom depression sleep protein disorder fig anxiety48 neuron fig response channel current receptor expression activity synaptic protein

Table 4.3: A selection from the 50 topics generated from a corpus of 1.5 millionarticles

(topic 74) and reproduction (topic 95). The lack of topic diversity is due to the

fact that this model contained similar topics as found in the previous models with

slight variations in the topic words and their order. Thus, it can be said that this

model has a high number of identical topics. Perhaps, this is due to the fact that

100 topics are too much for the given corpus (2.3.1.3). As this model was made with

the same parameters as the aforementioned ones, the weaknesses of those models

are also prevalent in this one.

Topic number Topic words

22 patient heart cardiac pressure vitamin ventricular left blood artery volume63 aneurysm patient fig artery blood protein activity min concentration rbc74 birth pregnancy infant maternal patient woman asthma age child risk91 malaria parasite infection hpv falciparum patient mosquito blood day expression95 oocyte sperm expression embryo patient human chromosome mtdna ovarian stage

Table 4.4: A selection from the 100 topics generated from a corpus of 1.5 millionarticles

Model Evaluation

The 100 topics model exhibit a rather comprehensive array of topics. Unfortunately,

several of them are similar. It is possible to go into further detail with a 200 topics

model. As there are very few new topics generated by the current model, branching

into a 200 topics model at this point is not advisable. As discussed before, LDA has

a tendency of creating topics even though there are none thematically in the corpus

(see 2.3.1.5). Hence, at this point it is no longer required to branch further into a

more detailed topic model, but to fix the issues that are currently present.

4.2.2.6 Evaluation Experiment 1: All Models

The results of the different iterations show that extracting only a small number of

topics, namely 10 to 20 topics, from a corpus result in vague topics. This is due to

the fact that the LDA model tries to cluster together high frequency words in the

22

Page 34: Topic Modeling and Visualisation of Diachronic Trends in

Chapter 4. Methodology

corpus resulting in generic topics. With increasing number of topics extracted from

the corpus the topics become more specific. It should be noted that with increasing

number of topics, there may not be different topics extracted by the model, but

different versions of topic belonging to a same theme. As discussed before, the

generation of identical topics can sometimes be due to the number of topics which

is excessive for the data set.

At the end of this experiment, the results give an overview of the types of topics that

can be found in this corpus. As mentioned before, there are many topics and topic

words in the output that should be removed which will be done by updating the

stopword list. Hence, the corpus needs to be cleaned again and further experiments

with different parameters should be run on the new corpus.

4.2.3 Experiment 2 : Edited Corpus and Modified Model Update

Parameters

In the previous experiment, I ran the model with the chunk size parameter set to

5000. This means that the model performs online training, meaning that with the

influx of new information the model is being updated continuously. The update

parameter of the model is dependent on the number of workers (parallel processes)

and the chunk size (see 4.1).

update = number of workers × chunk size (4.1)

In this batch of experiments, I decided against online training, as I do not have any

control over how the initial chunk is chosen and how representative it is of the entire

corpus. Thus, I set the batch parameter to False, which results in the LDA model

calculating the topics from the entire corpus at once. The goal of this approach is

to find the topics that are representative of the whole corpus and not those chosen

by the chunks.

4.2.3.1 Preprocessing

The corpus from the previous model needed to be modified as it contained many

spam elements such as LaTeX markup terms. Hence, a new corpus was created with

the spam elements removed from them. The other preprocessing elements remain

the same as before (see 4.2.1.1).

23

Page 35: Topic Modeling and Visualisation of Diachronic Trends in

Chapter 4. Methodology

Topic words

cell gene patient model mouse cancer activity rate expression population

patient protein cell model gene treatment rate activity expression site

patient cell treatment gene protein model expression risk activity rate

cell protein patient treatment expression activity mouse concentration line model

gene cell patient expression treatment model network activity disease function

cell protein gene mouse activity patient antibody sequence expression treatment

Table 4.5: Identical topics generated from multiple topic models with different topicsizes

4.2.3.2 Results

The results from this iteration are disappointing because for all the different types

of models that I ran, all the topics were variations of the ones shown in Table 4.5.

Hence, I can only assume that the model looked at the most common elements in

the corpus, which are about running experiments on mice, analysing proteins, and

cells. Unlike in the previous experiment, the topics have little variance. The number

of topics generated from the model is irrelevant, as even in the 100 topics model all

the topics have the word ‘cell’ in them. I can assume that doing online training is

important, and in the following experiments, I will experiment with different batch

sizes to see how the topics differ. Nonetheless, the results from this experiment seem

illogical, as I do not see any correlation between online learning and types of topics

generated. Perhaps, the problem could be in the dictionary because a new corpus

was generated (as mentioned in the preprocessing step 4.2.3.1) which has different

word frequencies.

4.2.4 Experiment 3: Online Learning with Different Batch Sizes

4.2.4.1 Preprocessing

Working with a corpus of 1.5 million documents is not recommended when it comes

to using Gensim. This is due to the reason that Gensim during the training process

loads a significant amount of the data onto the RAM. Therefore, one cannot run

multiple batches with different parameters simultaneously, as doing so almost con-

sumed 250 GB of RAM and putting the server to a standstill. For this reason, in

order to reduce the training time and run multiple models at the same time, I chose

a smaller subset of the corpus. This subset contains 150 thousand randomly selected

articles from the entire corpus regardless of year of publication. In Figure 4.1 one

24

Page 36: Topic Modeling and Visualisation of Diachronic Trends in

Chapter 4. Methodology

Figure 4.1: Number of articles published per year from 1950-2016 in the corpus of150 thousand articles

can see the number of articles published per year in the smaller corpus. The graph

shows the number of publications from the year 1950 to 2016. As it is visible in the

graph of Figure (4.1) the number of publications increases exponentially. In order

to see if the sample set that was randomly chosen is representative of the corpus,

I calculated for each year the percentage of articles from the original corpus that

are in the new corpus of 150 thousand articles. As one can see in Figure 4.2 the

distribution of the articles for every year for the entire corpus oscillates somewhere

between 8 to 11 percent of the articles that were published on a given year and are

available in the open access subset. Reducing the number of articles drastically,

reduced the training time as well as the strain on the RAM created by running

the topic models. The other preprocessing elements remain the same as before (see

4.2.1.1).

4.2.4.2 Results

The results of this batch were nearly identical to the output of the previous exper-

iment, thus confirming my suspicion that the online training did not influence the

topics generated by the model. Hence, I looked into the differences between the first

experiment and the latter two and realised that the key issue is the dictionary. In

experiments 2 and 3, after removing the LaTeX markup words from the corpus, the

25

Page 37: Topic Modeling and Visualisation of Diachronic Trends in

Chapter 4. Methodology

Figure 4.2: Percentage of articles from the original corpus (1.5 million articles) peryear from 1950-2016 that are in the corpus of 150 thousand articles

token ‘cell’ was not removed when I deleted 100 of the most common tokens from

the corpus. It can be seen that certain vocabulary items have a tremendous influ-

ence on the output of the results. In later experiments, the influence of these words

should be noted and treated as some form of stopword, as they are high frequency

vocabulary items. This raises the issue of which new tokens should be considered

as being disruptive when it comes to finding the underlying topics in the corpus.

Should other tokens such as ‘protein’ or ‘mice’ be removed as well? Despite the fact

that the results from this experiment can be discarded, the underlying cause for not

finding topics that vary has possibly been found.

4.2.5 Experiment 4: Reduced Vocabulary

4.2.5.1 Preprocessing

As seen before, the dictionary plays a critical role regarding the type of tokens

chosen by the model (see 4.2.4.2). In the preprocessing step I decided to reduce the

vocabulary drastically.

I created a new reduced corpus which consists only of lemmatized common and

proper nouns1. This was done by running NLTK’s English POS Tagger on the

corpus and selecting only the noun related tags for further processing. The other

1The POS tagger in NLTK uses the Penn TreeBank POS tags, which in this case are NN, NNS,NNP and NNPS

26

Page 38: Topic Modeling and Visualisation of Diachronic Trends in

Chapter 4. Methodology

preprocessing methods remain the same as before (see 4.2.1.1).

4.2.5.2 10 Topics Model

As seen in Table 4.6, the topics generated by this model are quite generic and only

partially nonsensical. However, none of them are identical. The token ‘mouse’ ap-

pears in several topics, namely topics 3, 7, 8 and 9. Perhaps ‘mouse’ is an important

token in this subcorpus that I am currently using. Other common topic words are

‘blood’, ‘virus’, and ‘tumour’. The words in these topics are very generic, and the

topics themselves also fall into the category of mixed topics.

Topic number Topic name

1 month intervention therapy blood trial hospital outcome pressure infection event2 sequence mutation specie mouse domain receptor antibody virus genome family3 infection mouse strain blood antibody sequence virus culture serum isolates4 child score trial month mortality death outcome infection antibody sequence5 medium growth solution strain membrane culture surface antibody temperature reaction6 participant woman intervention score child community service family practice problem7 mouse plant growth infection antibody macrophage production tumor activation cancer8 tumor water lesion stage image diagnosis mouse therapy brain field9 cancer mouse tumor antibody blood brain muscle animal breast receptor10 sequence parameter image network length target frequency position performance error

Table 4.6: Topics from 10 topic model from noun corpus

Model Evaluation

The topics generated on this iteration do fall under some of the criteria of being

poor quality topics (see 2.3.1). Despite being somewhat vague, the topics show the

overarching themes in the subcorpus. Reducing the vocabulary items does not have

an adverse influence on the generated topics; nonetheless, more fine-grained topics

are required. Overall one can say that 10 topics are not enough to demonstrate the

thematic diversity within this corpus.

4.2.5.3 20 Topics Model

Unlike in the previous iteration, the topics are less nonsensical for this model. How-

ever, they are still very vague and exhibit signs of mixed topics. Here we also observe

that the tokens ‘cancer’ as well as ‘mouse’ occur in many of the topics. Perhaps these

tokens are important in the subcorpus I am using. Expanding to 20 topics bring

forth more detailed topics that were not present in the smaller model. However, one

can also observe that some of the topics are partially identical.

27

Page 39: Topic Modeling and Visualisation of Diachronic Trends in

Chapter 4. Methodology

Topic number Topic name

1 infection virus blood mouse woman mutation antibody pregnancy growth phase2 sequence specie genome length variation selection distance frequency receptor family3 cluster sequence strain plant family specie isolates genome annotation transcription4 network image parameter channel exercise frequency performance solution measurement surface5 brain image neuron surgery month lesion diagnosis muscle tumor nerve6 muscle cancer mouse trial blood month specie medium score event7 participant woman child service intervention question people community practice score8 temperature mouse hospital medium culture frequency production image lesion cancer9 cancer blood antibody association exposure tumor bladder smoker brain status10 trial intervention score outcome month child therapy participant review criterion11 feature brain frequency image sequence association event error performance position12 water strain growth infection energy surface temperature culture density production13 sequence mouse strain domain culture mutation mutant primer antibody promoter14 mouse antibody infection brain sequence image marker mirnas cancer culture15 vaccine sequence virus mouse specie infection trial death cancer mortality16 cancer tumor mutation breast stage survival metastasis therapy methylation sequence17 plant stress mouse sequence growth water specie reaction temperature promoter18 compound solution reaction product medium membrane water blood mouse plant19 blood woman association therapy month parameter pressure serum fracture criterion20 antibody mouse tumor cancer activation receptor medium growth culture inhibitor

Table 4.7: Topics from 20 topic model from noun corpus

Model Evaluation

In this iteration, we also have the same issues as shown in the 10 topics model.

However, the topics are more specific than before, as new themes emerge within

them. Nonetheless, they are still too vague and this shows that 20 topics are not

adequate for this smaller subcorpus.

4.2.5.4 50 Topics Model

There are some overlaps with the previous model; however, in this case, we can see

that ‘cancer’ is again a common topic word. Thus, I would assume perhaps there are

texts about cancer in this subcorpus. Some of the topics are about chromosomes

(topic 32), surgery complication (topic 7), patient mental health (topic 23), viral

and bacterial infection (topics 30 and 35), pregnancy (topic 40), microRNA (topics

26), and some social aspects of hospital/clinical procedures (topic 12) (see Table

4.8).

Topic number Topic words

7 surgery injury complication image technique lesion month blood strain brain12 network hospital family staff mouse program score participant member intervention23 depression anxiety movement participant score disorder scale symptom stress month26 mirnas sequence cancer growth target fraction mirna culture infection medium30 virus insulin mouse blood trial vitamin score child infection cancer32 chromosome embryo stage clone exposure receptor mutation locus phenotype marker35 strain bacteria production culture medium activation growth infection inhibition product40 pregnancy birth woman mother antibody child outcome parent muscle infection

Table 4.8: A selection from the 50 topics generated from noun corpus

28

Page 40: Topic Modeling and Visualisation of Diachronic Trends in

Chapter 4. Methodology

Model Evaluation

This model has more detailed topics, and cancer appears to be a significant topic

word. There are some identical topics based on the themes represented by them.

Overall the model shows varieties based on the types of topics in it. There are some

identical topics and instances of mixed topics are present as well.

4.2.5.5 100 Topics Model

The topics found by this model are variations of the topics found in the 10, 20, 50

topics model. However, the advantage of this model is that one can see many topic

words that are associated with the same topic. A good example for this are topics

about breast-cancer (see Table 4.9). The three topics that fall under the theme of

breast-cancer are depicted in Table 4.9. However, we see that the topic words in

them have a different thematic focus. Some focus on the growth of the carcinoma

(topic 1), others on diagnosis and mortality (topic 23), whereas the last one focuses

on tumours and mutations (topics 80).

Topic words Labels

cancer tumor breast proliferation antibody growth carcinoma woman medium invasion breast-cancer (1)

cancer association death mortality cohort breast diagnosis exposure survival incidence breast-cancer (23)

cancer tumor breast sequence therapy mutation plant stage mirnas association breast-cancer (80)

Table 4.9: Focus on breast-cancer related topics from 100 topics models from nouncorpus

Model Evaluation

Overall the 100 topic model has the issue of identical topics. As shown in the

previous example (see 4.9) it generates fairly good topics. The model lacks thematic

diversity, Perhaps 100 topics are too much to be extracted from this subcorpus.

Hence, if one were to work with this corpus for the scope of this Master’s thesis, one

would have to rely on a better version of the previously generated 50 topics model.

4.2.6 Experiment 5: Influence of POS Tags

In order to gauge the influence of different POS tags, I ran multiple models with

different corpora. As the base corpus, I am using the results from 4.2.5.4, where I

have a 50 topics model with noun related words. Let’s denote it as Noun-corpus.

I created three new corpora, each with a different group of tokens in them. I used

29

Page 41: Topic Modeling and Visualisation of Diachronic Trends in

Chapter 4. Methodology

Topic number Topic words

1 plant strain sequence medium growth culture primer cancer water mutation2 stimulus stimulation vaccine mouse neuron target layer latency trial sequence3 growth insulin image neuron domain field score strain parameter reaction4 participant trial event image performance block stage activation stimulus frequency5 measurement temperature blood image water sensor phase device field surface6 mouse antibody infection blood animal cytokine culture tumor macrophage brain7 surgery injury complication image technique lesion month blood strain brain8 domain sequence mutation amino molecule peptide position virus compound family9 intervention trial score child outcome participant month disorder symptom criterion10 infant child score infection trial month season image correlation participant11 brain seizure mouse frequency sequence woman blood trial error stimulation12 network hospital family staff mouse program score participant member intervention13 woman participant infection blood child prevalence brain status adult volume14 antibody culture mouse membrane neuron medium image growth mutation fluorescence15 student service country community people practice program question school survey

Table 4.10: 15 topics from 50 topic model from Noun-Corpus

the following criteria to create the corpora, namely a corpus with noun and verb

type tokens (Noun-Verb-corpus)1, a corpus with noun and adjective type tokens

(Noun-Adjective-corpus)2, and a corpus with noun, verb, and adjective type tokens

(Noun-Verb-Adjective-corpus)3. Also, as a vocabulary reduction measure that can

be compared for all three corpora, I removed the top 100 most common words from

each newly created corpus. This was done by removing the top 100 most frequent

words from the dictionary of vocabulary frequencies created by Gensim. I then

analysed the topics returned by the models.

4.2.6.1 Noun-Verb Corpus

For the Noun-Verb corpus I will only look at the first 15 topics returned by the model

(see Table 4.11). The reason I chose 15 topics is due to the fact that topic diversity

in this model is extremely low and it is possible to make a judgement on the quality

based on the first 15 topics returned by it. I could also have randomly chosen e.g.

20 topics that are part of this model and would not have made a difference for my

evaluation of this topic model. I will compare it with the top 15 topics from the

Noun-corpus (see Table 4.10).

Whilst observing the topics returned by both corpora, one can observe that the

topics from the Noun-corpus demonstrate great diversity. In contrast, the Noun-

Verb-corpus exhibits an issue with its dictionary. Most of the observed topics contain

the word ‘cell’, ‘gene’, ‘mouse’, ‘data’. These topics are also very bad as most of

them are identical, they are also very generic and nonsensical. Furthermore, most

of the 15 topics show to a certain extent cases of mixed topics.

1Token with POS tags: NN, NNS, NNP, NNPS, VB, VBD, VBG, VBN, VBP, and VBZ.2Token with POS tags: NN, NNS, NNP, NNPS, JJ, JJR, and JJS.3Token with POS tags: NN, NNS, NNP, NNPS, VB, VBD, VBG, VBN, VBP, VBZ, JJ, JJR,

and JJS.

30

Page 42: Topic Modeling and Visualisation of Diachronic Trends in

Chapter 4. Methodology

Model Evaluation

This model has severe issues with its vocabulary and cannot be used for any practical

purposes. I also observed that despite the model containing words that fall under the

category of verbs, none of the topic words appear explicitly to be verbs themselves. It

could be the case that some of the tokens in the topics are homonyms of lemmatized

verbs. For example, the tokens ‘effect’, ‘test’, ‘study’ and ‘result’ could possibly

be verbs. Unfortunately, in the topic model it is not possible to say if the tokens

are indeed nouns or verbs. This is due to the reason how LDA works, namely it

ignores the syntax of the tokens and focusses on word frequencies in documents.

Moreover, during the text preprocessing stage I may have conflated tokens which

are homonyms. It is highly likely that the homonyms that appear in the topics refer

both to their verb as well noun forms.

Topic number Topic words

1 cell protein figure antibody activity expression analysis study level membrane2 cell study group data patient analysis difference result time level3 cell protein data treatment expression study analysis level gene receptor4 cell study patient data analysis effect group gene level disease5 cell population region site bone analysis gene study number data6 sequence analysis study gene data population time group result rate7 patient study risk data group time year case result rate8 cell figure analysis condition activity time result data study effect9 study patient analysis data group result model treatment effect test10 gene study patient expression cell group sequence data protein disease11 cell expression mouse control level protein group analysis figure gene12 specie population model effect time study data number group size13 data cell study trial analysis time gene result effect treatment14 study student patient group time level effect rate treatment analysis15 cell patient study infection control mouse gene result expression level

Table 4.11: 15 topics from 50 topic model from Noun-Verb corpus

4.2.6.2 Noun-Adjective Corpus

For this corpus, I will compare the topics returned by the model using the Noun-

Adjective-corpus with the results from our base Noun-corpus. I observed here as well

that the model has issues with the dictionary, as the token ‘cell’, ‘study’, ‘group’.

‘data’ are highly prevalent in all of the 15 topics of the Noun-Adjective-model. In

this case too the topics are very generic and, in some cases, nonsensical.

Model Evaluation

Despite the fact that the model has issues with its dictionary, which should be

reduced in order to yield sensible results, another aspect that I noticed is that the

words in the topics do not contain any adjective-related tokens. I assume that the

adjective tokens are low frequency, compared to their noun counterparts and because

31

Page 43: Topic Modeling and Visualisation of Diachronic Trends in

Chapter 4. Methodology

of this reason they do not appear in the topic model. In summary, this model cannot

be used because of the aforementioned reasons.

Topic number Topic words

1 cell study analysis patient expression level gene data group figure2 plant cell study control analysis group leaf sample level gene3 cell response expression study data effect macrophage level infection group4 cell expression tumor control mouse figure protein antibody level treatment5 gene data study analysis figure level effect group control genotype6 study risk population year group prevalence case cancer analysis woman7 study level cell data group analysis concentration year child time8 group study bone case result time difference patient data analysis9 strain sample sequence group number vaccine study data analysis isolates10 study group analysis protein data region activity patient gene level11 treatment data study group analysis response sample time cell model12 cell study figure analysis data expression group gene number control13 compound study activity concentration reaction effect result acid group data14 mutation study analysis cell nerve data sample group patient case15 group system process study data change time research community level

Table 4.12: 15 topics from 50 topic model from Noun-Adjective corpus

4.2.6.3 Noun-Verb-Adjective Corpus

In this case too, the model generated form this corpus will be compared to the base

Noun-corpus. It is also observed here that the tokens ‘cell’, ‘study’, ‘group’, ‘data’

are quite prevalent items in the generated topics. For this iteration, the topics are

nonsensical and partially mixed in nature.

Model Evaluation

This model should theoretically exhibit topic words that are verbs, and adjectives.

Unfortunately, what is observed here is that the tokens are mostly nouns. There

could be the case that some of these nouns are homonyms of verbs as discussed

before (see 4.2.6.1). As for the adjectives, they are lacking in this topic model as

well. The dictionary of the Noun-Verb-Adjective-corpus is also the cause of the lack

of topic diversity for this model.

Topic number Topic words

1 patient study cell treatment group level trial result data therapy2 model data number value effect study result time parameter analysis3 cell patient data analysis study effect mouse model figure result4 cell data group study patient time figure protein analysis result5 cell response study result gene expression effect analysis number receptor6 sequence gene specie population data number analysis region genome site7 patient study disease year data risk rate treatment analysis factor8 data time sample analysis temperature solution particle figure result surface9 study treatment result gene time patient group cell analysis effect10 cell protein figure antibody control result data expression membrane time11 group study effect analysis time data response control difference patient12 health study care woman research data intervention group time service13 cell concentration study activity effect data control figure analysis treatment14 gene study cell level analysis data activity expression sample treatment15 cell protein study level activity gene expression effect control result

Table 4.13: 15 topics from 50 topic model from Noun-Verb-Adjective corpus

32

Page 44: Topic Modeling and Visualisation of Diachronic Trends in

Chapter 4. Methodology

Term Similarity between the Models

Noun-Verb Noun-Adjective Noun-Verb-Adjective

Noun 45.77% 44.09% 37.93%Noun-Verb - 61.8% 54.59%Noun-Adjective - - 60.4%

Table 4.14: Percentage of identical terms in between the models

As a next step, I calculated the percentage of identical words between the topic

words of the corpora. As seen in Table 4.14, the Noun-corpus has the least amount

of common terms with the other three corpora. As a further measure, I also looked

at the number of unique topic words for each case (see Table 4.15).

Noun Noun-Verb Noun-Adjective Noun-Verb-Adjective

Unique terms 201 178 165 159

Table 4.15: Number of unique words in found in all the topics

It can be seen that the Noun-corpus has the highest number of unique words in the

topics. Even though the other models hopefully have tokens from different groups

of POS tags, the number of unique tokens decrease with as the number of token

types increases.

The explanation for this can be found by looking at the topic models and the

dictionary for each corpus. Even though with each new group of tokens included

in the corpus the number of tokens in the corpus increases, when it comes to the

topic models, the topic words are different types of nouns. This is true for all

the corpora created thus far. Hence, the issue of dictionary trimming becomes

important because with increasing amounts of tokens, removing only the top 100

tokens for the three corpora is not the best method for getting good topic models.

For the verb and the adjective based corpora it has been mentioned some of the

vocabulary items may have been conflated during the lemmatization processes. The

word frequencies within the dictionaries are no longer identical due to this conflation.

Therefore, for the other larger corpora based on the results of Noun-Verb-, Noun-

Adjective-, Noun-Verb-Adjective-corpora, it would be advisable to remove a larger

amount of most frequent terms than in those corpora that only have fewer types

of lemmas. Unfortunately, there are no guidelines that state the amount of high-

frequency dictionary items that should be removed. This can only be determined

via trial and error with multiple experiments.

Hence, I decided to discard the verbs and the adjectives from my future experiments.

This decision is easily comprehensible if one considers the fact, that with the current

33

Page 45: Topic Modeling and Visualisation of Diachronic Trends in

Chapter 4. Methodology

subcorpus, a somewhat ideal cut off point for reducing the dictionary has been

found. In the following experiments, I will only work with the corpus of noun

related lemmas.

4.2.7 Experiment 6: Extracting Models with Distinct Topics Using

Topic Similarity

Similar topics are an issue for topic models. In this section, I look at a measure for

detecting similar topics and check if with the current model parameters, my models

have them.

In order to find the common topics, I ran the model with the Noun-corpus and

previously mentioned parameters 25 times. I calculated if the topics within a model

are similar. Let us consider that we have a topic model A, which consists of topics

At1 to At50, and each topic has 10 topic words. If I want to check if At1 is similar

to At2, At3...At50, I can check if the topic words in At1 match the topic words in

the other topics.

Here, I can use a similarity measure, inter-topic-similarity, which is the amount of

similarity required between the two topics. If the inter-topic-similarity measure is

100% then all the topic words in At1 are identical to all the topic words in At2, if

the similarity measure is 60% then only 6 out of 10 words in any order between the

two topics need to be identical.

To check the inter-topic similarity between all the topics of a given model, one

should first calculate the the number of combination between two topics (k) in a

model of 50 topics (n). This can be calculating number the pairwise combination of

50 objects as follows:

n!

k!(n− k)!=

50!

2!(50 − 2)!= 1225 (4.2)

Then the inter-topic similarity can be calculated by using the number of cases where

two topics are the same (for a given similarity measure) divided by the number

of possible combinations of topics between them. For example, if the model A

has 1000 cases where the combination of 2 topics are the same (for a inter-topic-

similarity measure of 20%), then the inter-topic similarity for all the topics in model

A is (1000/1225)× 100 = 81.63%. If one has multiple topic models, then one can

calculate the average of all the inter-topic similarity scores for the models.

34

Page 46: Topic Modeling and Visualisation of Diachronic Trends in

Chapter 4. Methodology

Figure 4.3: Average inter-topic similarity

Figure 4.3 shows the average inter-topic similarity for the models. It shows that

with an inter-topic-similarity measure of 40% the average inter-topic similarity of

the model drops to 2.65%. This means that in average only in 32.56 out of 1225

possible topic combinations, one could expect topics with 4 out of 10 words in

common. For 50% the average inter-topic similarity of the model drops to 0.43%.

Given a high inter-topic similarity between the topics, a low average inter-topic

similarity score for the model indicates that the model does not have many identical

topics. This also shows that with the current set of parameters the models do not

have many identical topics (see 4.3).

4.2.8 Experiment 7: Extracting Stable Models

As discussed before, LDA models tend to be unstable as each run of the model

returns a different set of topics in a different order (see 2.3.2). Hence, the task is

to find a model that is stable enough so that it can be used for further testing and

experiments. In the following section I will try to find a way to judge the stability

of my topic models.

As mentioned before, topic model A consists of topics At1, At2... At50 and similarly

35

Page 47: Topic Modeling and Visualisation of Diachronic Trends in

Chapter 4. Methodology

another model B consists of topics Bt1 . . . Bt50. In order to check if there are any

similarities between the models, one could try and match the topics from model A,

with model B. This could be done by checking if the topic words At1 match those

from any topic in model B. Here it is highly likely that the match is not perfect.

Hence, we can have results such as, At1 matches Bt3 (6 out 10 words), and Bt6

(7 out of 10 words). Then I can say that At1 matches Bt6. Then I remove At1

and Bt6 from my comparison setup, and continue to do the same with At2 until

At50. I go through all the topics and choose the one with the highest score. If

the scores are identical, I choose the first one. For example, if At7 matches with

Bt27 and Bt35 with 5 common words, here I would match At7 with Bt27. In order

to calculate the similarity between two models, inter-model-similarity, I can count

the similarity between the topics, e.g. At1 and Bt6 has the similarity of 7. Two

models can have the maximum inter-model-similarity score of the number of topic

words per topic multiplied by the total number of topics. In our case this is, 10 ×50 = 500. Thus, if two models (modular permutations) are identical, their inter-

model-similarity score would be 500. I calculated the average similarity score for

the aforementioned models. I chose one model as a constant and compared it with

the other models (model A with model B, then model A with model C etc.). This

resulted in an average inter-model similarity score of 171.25 or in other words the

models are 34.25% similar to each other.

However, I realized that this measure is not adequate to check the similarity between

the models. This is due to the fact that by trying to match every topic in one model

with every topic in another model, the matches are sometimes very much forced.

I tried a similar approach as explained above; however, this time I only calculated if

the topic was found or not. This means if At1 can be matched with any of the topics

in model B. Moreover, I also accounted for the number of common words between

the topics, as explained in section 4.2.7.

This can be explained in the following manner. I take model A as my base model,

and try to match At1 with any of the topics in model B, with a specific inter-topic

similarity measure. For example, with a inter-topic similarity measure of 0.7 (7 out

of 10 words should be common), At1 matched with at least one topic in model B,

and At2 matched with none and so on. At the end of this example, the result is

that 32 topics from model A found a counterpart with a given inter-topic similarity

measure in model B. Then the inter-model similarity is (32/50) × 100 = 64%.

The logic behind this measure is that if the inter-model similarity score is high for

a high inter-topic similarity score, then the models are similar to each other, and

have a variety of topics that are not identical.

36

Page 48: Topic Modeling and Visualisation of Diachronic Trends in

Chapter 4. Methodology

number of passes 10 20 30 40 50 60 70 80 90 100 200 300 400 500

similarity(%)10 93.2 91.6 91.6 90.4 88.8 92.8 90.4 93.6 88.4 90.4 89 91 91 9420 79.6 80.4 82 82 80.4 80.4 85.2 80 82 85.2 83 84 85 8430 68.4 74 78.4 75.2 76 79.2 81.2 79.6 78.8 80.4 84 82 84 7940 61.6 62 70.8 66.8 69.2 72 76.4 73.2 75.6 78.4 76 78 76 7950 51.2 54 56 56 60.8 61.6 66.8 66 66.4 70.4 67 74 69 7060 37.2 40.8 43.2 41.6 48.4 49.6 54 52.4 48.8 54 58 61 51 4970 23.6 28 28.4 26.8 34.4 34.4 39.6 36.8 32 42.8 41 49 41 3480 11.6 18.4 16 17.2 20 24.4 26.8 18.8 16 28.4 20 32 22 2090 5.2 5.2 8.8 4.8 7.2 11.6 12 8 10.8 16 9 17 13 10100 0.4 0.4 2 0.4 0.4 3.2 2 1.6 2.4 4.4 2 2 3 4

Table 4.16: Topic similarity based on number similar words over multiple passes

I calculated the inter-model similarity for different inter-topic similarity measures.

As my preliminary results were rather low, I changed another parameter of my LDA

model, the number of passes, which is the number of times the model goes through

the entire corpus. Then I calculated the scores for those models. For every change

in the number of passes, I ran the model 5 times. The scores on Table 4.16 show

the average inter-model similarity scores.

One can see in Table 4.16, that the similarity between the models increase with

the number of passes. Moreover, there is a greater similarity between the models

if one takes the number of common words between the topics into consideration.

Intuitively, the lower the minimum number of common words is, the more similar

are the models with each other. Moreover, as one can see with the similarity of 50 to

60% for 100 passes, a model similarity of 70.4% and 54% respectively are achieved.

It can be assumed that the models appear to be stable. With 100 passes one achieves

a fairly adequate level of similarity. After 100 passes the level of similarity for the

models do not increase significantly any more.

At this point, the hyper-parameters for the topic models have been set, and a fairly

stable set of topic models have been found. I will choose one of the models with 100

passes, namely the one used as the base model as my topic model for the upcoming

sections.

4.2.9 Topic Labelling

There are many ways of automatically labelling topics within a topic model (see

2.4.1). However, despite their usability, they are either dependant on external APIs

(Lau et al. [2011]) or require further models to be trained (Bhatia et al. [2016])

However, instead of referring to a topic by merely a number, I propose a rudimentary

37

Page 49: Topic Modeling and Visualisation of Diachronic Trends in

Chapter 4. Methodology

approach of referring to them by their topic number and the top three topic words in

a hyphenated construct. For example, topic 19 (see Table 5.1) can also be referred

as topic 19-surgery-injury-complication. This will convey more information about

the content of the topic. A comprehensive list of all 50 topics can be found in the

appendix (see Table A.1).

38

Page 50: Topic Modeling and Visualisation of Diachronic Trends in

5 Data and Topic Exploration

5.1 Document Topics Distribution

Checking the topic distribution per model can be a challenging task, because LDA

represents a document as a probability distribution of multiple topics.

For example, the document in Figure 5.1 is represented by topic 34-cancer-tumor-

breast (see Table 5.1) with a topic probability of 0.9034. Whereas the document in

figure 5.2 is represented by topic 19-surgery-injury-complication with a probability

of 0.4172 and topic 28-lung-liver-platelet with a probability of 0.4142. The topic

words from both topics 19 and 28 are present in the article in Figure 5.2.

In some cases it is easy to guess which topic best represents the document (e.g. topic

19-surgery-injury-complication for Figure 5.1). However, for other documents this

distinction is not easy, as shown for Figure 5.2. The probability difference between

the top 2 topics is 0.4172 -0.4142 = 0.003. There are other cases in the corpus where

the probability score between the top topics are identical.

Topic number Topic words

19 surgery injury complication pain technique nerve catheter pressure operation vein28 lung liver platelet respiratory mortality fibrosis admission sepsis count fluid34 cancer tumor breast survival metastasis growth carcinoma lung tumour chemotherapy

Table 5.1: Topics 19, 28, 34

5.2 Data Exploration

5.2.1 Topic Distribution

Trends in the data can be discovered by looking at the topic probability distribu-

tion within the corpus. The logic for this approach is as follows: For the 150,000

documents in the corpus, using Gensim, one can get the document topic probability

for each document that was used to create the model. For example, for a fictitious

document D4, the topic probabilities are 0.85 for topic 1, and 0.1 for topic 4 (see

39

Page 51: Topic Modeling and Visualisation of Diachronic Trends in

Chapter 5. Data and Topic Exploration

Table 5.2). After gathering the topic probability scores for all the documents, one

can calculate the topic probability distribution for a given topic. For topic 3 in

the example shown in Table 5.2, the probability distribution can be visualised by

creating a histogram for all the 150,000 data points (documents in the corpus) for

which the document probability is greater than 0 (e.g. 0.45,..., 0.1).

Figure 5.1: Topic words in article(green : topic 34)

Figure 5.2: Topic words in article(green: topic 19, yellow:topic 28)

Based on the corpus of 150,000 PubMed documents, Figure 5.3 shows the topic

probability distribution for the corpus. The x-axis represents the topic probability,

and the y-axis shows the number of articles in each category. As it can be seen, this

graph does not show much information. This could be due to the reason that in

many cases the articles exhibit a low topic probability towards this topic. In order

to visualize the data better, I only selected the cases where the topic probabilities

are greater than 0.1 (Figure 5.4).

Document ID Year Topic 1 Topic 2 Topic 3 Topic 4

D1 2000 0.25 0 0 0D2 2000 0 0 0.45 0D4 2001 0 0.85 0.1 0D5 2001 0 0 0 0.3

Table 5.2: Fictitious topic probability distribution over multiple topics and docu-ments

For most of the 50 topics, the probability distribution looks similar to that of topic 19

(see Figure 5.2). As for topic 30-adult-host-male and 41-image-volume-measurement,

one can observe sudden spikes in the topic distribution. For topic 41, there is a

sudden increase in the number of articles at the probability of approximately 0.5. For

40

Page 52: Topic Modeling and Visualisation of Diachronic Trends in

Chapter 5. Data and Topic Exploration

Figure 5.3: Topic probability distribu-tion of documents of topic 19

Figure 5.4: Topic probability distributionof documents of topic 19,where topic probability >0.1

topic 39, the spike in increase of articles with a topic probability of approximately 0.7

and 0.8. For topic 6-specie-specimen-margin, the probability distribution does not

fall exponentially, but stagnates a little. For topic 25-domain-molecule-chain, the

drop in number of articles with increasing probability is observed until 0.6, then the

number of articles with higher probability increases before falling again (see Figure

5.5). In general, it is observed that the higher the topic probability, the fewer the

number of articles that belong to that category. Otherwise, this information is not

useful for detecting trends in the data.

5.2.2 Average Topic Probability

To observe the temporal changes in the data, it is required to transform the infor-

mation shown before (see 5.2.1) to show the diachronic aspects. For a given topic,

an article has the topic probability k. During a given year y, there are ny articles

published. Thus, one can calculate the yearly average topic probability Ay (see

Equation 5.1). For example, in the fictitious example in Table 5.2, the average

yearly topic distribution for the year 2000 and topic 3 is (0+0.45)/2 = 0.225.

Ay =

ny∑i=1

ki

ny

(5.1)

This process results in showing temporal trends in the data based on the topics. I

looked at a time frame of 15 years, namely from 2000 to 2015. The following graphs

below show the results (Figures 5.7 -5.10). The x-axis represents the year and the

41

Page 53: Topic Modeling and Visualisation of Diachronic Trends in

Chapter 5. Data and Topic Exploration

Figure 5.5: Topic probability distribution of documents of topics 6,10, 39, and 41,where topic probability >0.1

y-axis represents the average topic probability. It should be mentioned that the data

points are not continuous, the line drawn between them are for visual aid.

In most cases the average topic probability fluctuates in a given time range or re-

mains more or less constant. However, there are cases where one can observe an

increase in the topic probability. As one can see in Figure 5.7, the topic probabil-

ities for the topics 23-trial-month-therapy and 33-lesion-diagnosis-biopsy fluctuate,

with the gradual tendency to increase over time. For topic 10-surface-temperature-

particle, after a dip in the early 2000s, the probability for this topic also sharply

rises.

In other cases, the opposite is true. The average topic probability gradually drops

over time. This is illustrated in Figure 5.8 where the topic probability has ei-

ther gradually dropped, as is the case for topic 11-antibody-vector-construct, or the

topic probability has dropped and remained stable over the years, e.g. topics 12-

activation-inhibitor-phosphorylation and 28-care-intervention-service.

There are also instances where a topic suddenly peaks and then its topic probability

42

Page 54: Topic Modeling and Visualisation of Diachronic Trends in

Chapter 5. Data and Topic Exploration

Figure 5.6: Topic probability distribution of documents of topic 25, where topicprobability >0.1

Figure 5.7: Average topic probability of documents from 2000 to 2015 of topics 10,23, and 33

wanes gradually over time. This is shown in Figure 5.9, where the topic 25-domain-

molecule-chain suddenly exhibits a surge in topic probability in the late 2000s and

then sharply declines over the upcoming years. Peaks are also observed for the topics

2-sequence-genome-specie and 5-parameter-correlation-probability.

Finally, Figure 5.10 shows that a topic can vanish over time. This is the case for

topic 50-exposure-skin-smoking, which suffers a sharp decline in 1995 and does not

reappear in the corpus after 1999.

43

Page 55: Topic Modeling and Visualisation of Diachronic Trends in

Chapter 5. Data and Topic Exploration

Figure 5.8: Average topic probability of documents from 2000 to 2015 of topics11,12,17,21,28, and 43

Figure 5.9: Average topic probability of documents from 2000 to 2015 of topics 2,5,and 25

Thus, using the average yearly topic probability, one can observe temporal trends

in the data, namely changes in diachronic topic probability, where it increases (see

Figure 5.7), decreases (see Figure 5.8), exhibits peaks (see Figure 5.9), and stops

being popular (see Figure 5.10).

44

Page 56: Topic Modeling and Visualisation of Diachronic Trends in

Chapter 5. Data and Topic Exploration

Figure 5.10: Average topic probability of documents from 1980 to 2005 of topic 50

5.3 Observing Diachronic Trends Using Topic Models

With the help of average topic probability as well as a diachronic corpus, it has been

shown that diachronic trends within a corpus can be detected using a topic model

(see 5.2.2). Moreover, trends in the data vary based on the type of topic. Thus,

using topic models one can detect surges as well declines of certain topics within

these corpora. These observations assist my claim, that it is indeed possible to

detect temporal trends in a corpus by using topics generated from a topic modeling

algorithm (see 1.2).

5.4 Topic Exploration

For the upcoming sections, I will be analysing the following topics, namely 13-

woman-heart-pregnancy, 22-infection-virus-vaccine, 34-cancer-tumor-breast and 38-

infection-resistance-bacteria (see Table 5.3). I chose these topics because I not only

know the meaning of topic words in them, but also because the topics themselves

have a cohesive theme. As for their average topic probability, these topics exhibit,

a certain variability (see Figure 5.11). Thus, it is visible that these topics have

gained as well lost popularity during 2000 to 2015. However, simply looking at the

average topic probability does not convey much information about the content of

these topics. Hence, in the following sections, I will took into the variability that

exists within the topics themselves.

45

Page 57: Topic Modeling and Visualisation of Diachronic Trends in

Chapter 5. Data and Topic Exploration

Topic number Topic words

13 woman heart pregnancy pressure birth hypertension infant delivery mother week22 infection virus vaccine antibody vaccination antigen replication titer influenza transmission34 cancer tumor breast survival metastasis growth carcinoma lung tumour chemotherapy38 infection resistance bacteria isolates strain culture pathogen phage tuberculosis coli

Table 5.3: Four topics selected for data exploration

Figure 5.11: Average topic probability of documents from 2000 to 2015 of topics13,22,34, and 38

5.4.1 Frequency of Topic Words in the Corpus

To analyse the contents of the individual topics, I looked at the relative frequency

of the topic words as they occur in the corpus. The relative frequency for each year

was calculated as follows:

yearly relative frequency =

number of documents in which the topic word occurred in a year

number of documents published that year

The aim of this approach is to demonstrate the usage of the topics, and the di-

achronic changes that have occurred to them. As seen in Figure 5.12, it is difficult

to see any trends as the graph if highly cluttered by the 10 topic words. Hence, I de-

cided to divide the topic words into groups, for better visibility of the trends. These

were created by dividing the topic words into 2 groups, namely pregnancy related

words (e.g. ‘birth’, ‘infant’, ‘mother’, and ‘pregnancy’) and words related to heart

disease (e.g. ‘heart’, ‘hypertension’, ‘pressure’). I discarded the words ‘week’ and

‘woman’ from the analysis as they are too general and can occur in many articles

regardless of the topic. As for the pregnancy related words, it can be seen that there

46

Page 58: Topic Modeling and Visualisation of Diachronic Trends in

Chapter 5. Data and Topic Exploration

Figure 5.12: Relative frequency of topic words for topic 13-woman-heart-pregnancy

has been a steep increase in the usage of these words since 2001, but the usage has

somewhat stagnated since 2005 (see Figure 5.13). As for the words related to heart

disease, the usage of these words have steadily increased over time (see Figure 5.14).

Another observation made during this process was that by analysing the trends that

are present between the topic words, one can observe groups within a given topic.

Unlike in the example mentioned above one can clearly detect groups within the

topic words as demonstrated in Figure 5.15. It is clearly visible that from 2005

onwards the topic words form three distinct groups. Based on the graph in Figure

5.15 one can divide topic topic words into three groups as shown in Examples 5.2 -

5.4.

(5.2) ‘antibody’, ‘infection’

(5.3) ‘replication’, ‘transmission’, ‘virus’, ‘antigen’

(5.4) ‘influenza’, ‘vaccination’, ‘vaccine’, ‘titer’

Out of these immunology related terms from topic 22-infection-virus-vaccine one

is able to observe logical grouping between them, which is the case for Example

5.4 where the topic words ‘vaccination’ and ‘vaccine’ fall into the same semantic

category. Nonetheless, as these are low frequency terms within the specific grouping

47

Page 59: Topic Modeling and Visualisation of Diachronic Trends in

Chapter 5. Data and Topic Exploration

of topic words, it is somewhat difficult to see the changes in their usage within the

topic as a whole. Hence, I plotted the topic words from 5.4 on a separate graph

for better visibility of the temporal trends (see Figure 5.16). Unlike in Figure 5.15,

where due to the scaling of the image it appears that the relative frequency of the

terms from Example 5.4 had remained somewhat stable since 2007, in Figure 5.16 it

can be observed that the frequency of these topic words have fluctuated over time. I

applied a similar approach to visualize the topics in Example 5.3 (see Figure 5.17).

However, this visualization was not helpful to ascertain temporal trends amongst

the topic words within this group. Thus, with the means of visualisation it is

also possible to detect groups within a topic and observe how these groups evolve

temporally.

Figure 5.13: Relative frequency ofpregnancy relatedwords from topic 13

Figure 5.14: Relative frequency of heartdisease related words fromtopic 13

It can be seen here that the internal trends within a topic can be examined by

looking at the diachronic relative frequency of the topic words. This task can be

made easier with the help of a dynamic interface (see Chapter 7).

5.4.2 Diachronic Shifts within a Topic

In the previous section it has been demonstrated that one can use the relative

frequency of the words that exist within a model to show diachronic shifts and trends

that exist within a topic. Thus, it has been shown that in using topic modeling one

can detect diachronic changes within the words of a given topic. Consequently, this

provides an answer to my second research question.

48

Page 60: Topic Modeling and Visualisation of Diachronic Trends in

Chapter 5. Data and Topic Exploration

Figure 5.15: Relative frequency of topic words for topic 22-infection-virus-vaccine

Figure 5.16: Relative frequency ofimmunology relatedwords from topic 22(group 2)

Figure 5.17: Relative frequency ofimmunology relatedwords from topic 22(group 3)

5.5 Frequency of Popular Words within a Topic

A topic does not only consist of the topic words in it, but one also has to consider

other underlying trends within the articles of a given topic. These trends can be

analysed by visualizing the relative frequency of the most popular words in them. For

49

Page 61: Topic Modeling and Visualisation of Diachronic Trends in

Chapter 5. Data and Topic Exploration

my analysis, I selected articles that belong to a certain topic. I selected these articles

based on their document topic probability. If a document has a topic probability

that is greater than zero for a specific topic, then it belongs to that topic. As a

consequence, a document can belong to multiple topics. Then from this subset of

articles I calculated the most frequent words. These are words that have the highest

absolute frequency in this subcorpus of articles. Then I used the relative frequency

of these popular words within the subcorpus and visualized them in a diachronic

manner.

Figure 5.18: Relative frequency ofwords from topic 13(top 1-5 words)

Figure 5.19: Relative frequency ofwords from topic 13(top 6-10 words)

In Figures 5.18 and 5.19 one can see the diachronic relative frequency of the 10

most popular words in topic 13-woman-heart-pregnancy. Similarly, the diachronic

frequencies of the popular words in topic 22-infection-virus-vaccine, are displayed

in the Figures 5.20 and 5.21.

Figure 5.20: Relative frequency ofwords from topic 22(top 1-5 words)

Figure 5.21: Relative frequency ofwords from topic 22(top 6-9 words)

Specifically for the word ‘brain’, the diachronic relative frequencies differs signifi-

cantly for topic 13 and 22. Thus, one can assume, that the focus of the articles

50

Page 62: Topic Modeling and Visualisation of Diachronic Trends in

Chapter 5. Data and Topic Exploration

that occur with these topics exhibit a different thematic focus, despite the usage of

common words (see Figure 5.181 and 5.212).

5.5.1 Diachronic Popularity of Non-topic Word Related Terms

In the previous section, I was able to demonstrate that by using topic modeling one

can group a corpus into different topic groups using their document topic probability

(see 5.5). In these partitions, one can show diachronic trends in the most popular

words. I was able to detect diachronic changes in in the usage of words within

documents that fall under a specific topic, consequently answering my third and

final research question.

1the word ‘brain’ is displayed by the turquoise line2the word ‘brain’ is displayed by the blue line

51

Page 63: Topic Modeling and Visualisation of Diachronic Trends in

6 Results and Discussion

6.1 Research Question Nr. 1

I could show with the help of the average topic probability that topics exhibit

diachronic trends within a corpus. I also visualized them over a period of 15 years

for certain topics. As mentioned before (see 2.3), LDA is an unsupervised machine

learning technique and there does not exist any ground truth with which I can

compare my results. Furthermore, each iteration of the model generates a different

output, which indeed makes LDA a weak candidate, when it comes to replication of

the methods mentioned in this Master’s thesis. This means that even with the same

data and model parameters one could get different results based on the probabilities

calculated by the model. Thus, the results here are for demonstrating my hypothesis

as mentioned in the research questions. Therefore, the only way to demonstrate my

claim was with the help of visualizations, as the goal was to show diachronic trends

in the data.

LDA aids in automatic generation of subdivisions within the corpus. However, LDA

does not create the diachronic trends, as they already exist in the data set. However,

the topics made by LDA do exhibit thematic entities that are present in the corpus.

Hence, the topics exhibit semantic coherence between the topic words. This is the

case for topic 34-cancer-tumor-breast, where the topic is made up of mostly words

that are about cancer or somehow belong into cancer-related discourse. A similar

argument can also be made for topic 22-infection-virus-vaccine, where the topic

words belong to the theme of immunology. In other cases, the topics are mixed (see

2.3.1.2), but the subtopics in them demonstrate semantic coherence; for example,

topic 13-woman-heart-pregnancy is about pregnancy and heart disease. Therefore,

following the trend of such topics might be to some extent helpful; however, one

should take the trends in the subtopics also into consideration (see 6.2).

It should be emphasised that these topics are specific to the corpus and are not

about the general historical trends in the field of biomedical publication. Hence, the

results may vary based on a different subset of documents. As mentioned before

(see 4.2.4.1), the documents used for model creation only consists of 150 thousand

52

Page 64: Topic Modeling and Visualisation of Diachronic Trends in

Chapter 6. Results and Discussion

articles, which are about 8-11% of randomly selected articles that were published

each year. These topics could be of interest to anyone wishing to explore a data

set and check why there is a certain trend occurring in the corpus. It should also

be mentioned that in order to acknowledge the validity of a given topic one should

consult a domain expert.

6.2 Research Question Nr. 2

I was able to demonstrate diachronic shifts within a topic. These temporal changes

in the usage of topic words throughout the entire corpus, show how certain terms

within a given topic can exhibit different levels of usage over time. The results also

showed that words belonging to a specific subtopic have relative frequencies that are

closer to each other.

A side effect of this approach is that one can use it to check the quality of a given

topic. As demonstrated by topic 13-woman-heart-pregnancy, vague and generic

terms that occur in a topic tend to exhibit a high relative frequencies within the

corpus. The results are logical, as generic terms are expected to occur in multiple

documents.

In some cases, the topic words with low relative frequencies show an interesting

property. If these topic words are similar to each other, then they tend to be

grouped together, and can be detected in the visualization. This was the case for

topic 22-infection-virus-vaccine, where similar words had similar diachronic rela-

tive frequencies. Hence, one an use the diachronic visualisation to detect thematic

subdivisions within a topic.

6.3 Research Qsuestion Nr. 3

Using LDA created corpus subdivisions, one can also check trends of words within

a topic. This approach as well cannot be quantified with the help of some pre-

existing ground truth as the results are entirely corpus dependent. However, if one

looks at the topics and the trends of the popular words within the topic related

subcorpus, it can be said that this approach is useful for analysing trends within

the data. For topic 13-woman-heart-pregnancy one observes that usage of the word

‘brain’ is fairly high. Whereas, if one compares this topic with topic 22-infection-

virus-vaccine, which is about pregnancy and heart disease, one would observe a

different relative frequency of the word ‘brain’ and a very different diachronic trend

53

Page 65: Topic Modeling and Visualisation of Diachronic Trends in

Chapter 6. Results and Discussion

as well. This shows the different thematic focus within the topics, and that the most

popular words in each topic are different. Furthermore, the difference in diachronic

trends of the relative frequencies of the words that are common between certain

topics emphasize the temporal thematic focus of these topics. Hence, it is possible

to detect diachronic changes in the usage of words in a specific topic.

6.4 Summary

In summary, it can be said that even though the three aforementioned approaches

cannot be quantified, the results exhibit clear trends, which appear to be logical,

based on the topics that I analysed in detail. Visualisations are a key to judging

the viability of these diachronic trends. As an exploratory approach into analysing

the underlying trends in unlabelled data, visualisations are adequate to judge the

validity of the claims made in the research questions. As the aim of all three research

questions were to demonstrate visible diachronic trends in the data, a quantifiable

measure is not yet required. However, for future work in this domain, it would be

advisable to find some metric which quantifies the trends shown in the visualisations.

54

Page 66: Topic Modeling and Visualisation of Diachronic Trends in

7 Website

As it is quite challenging to view the results of this Master’s thesis in a non interac-

tive interface, I have built a companion website, for diachronic topic visualization

(DiaTopVis), where one can access and view the data interactively. This section

introduces the companion website designed to visualize the results of my Master’s

thesis. Also in this section, the features and functionalities of this website are ex-

plained. The aim of this website is to provide the user with an easy to use interface

for exploring the topics and the underlying themes that exist within them. The in-

spiration for this project is based on the work done by Wang and McCallum [2006]

and Song et al. [2014]. The interface was likewise inspired by a previous project

done by me (Ghoshal et al. [2017]). The website consists of three primary sections

each providing an interactive interface to explore the answers to the three research

questions tackled in this paper. There are also sections where one can view the topic

models and the topics in them. Finally, there is a help section which is there for

the user to assist with website navigation and orientation. The help section also

provides explanations for each section.

7.1 Generating Charts

The graphs on the website is created by using the C3 library1. C3 is a JavaScript

library (based on D32) that can be used to generate multiple formats of charts. An

advantage of C3 is that the graphs are interactive. The user can select or remove a

line within the chart by clicking on the legend that is associated with the line. In

addition, the y-axis is dynamic and immediately responds to the addition or removal

of a given line. Furthermore, by placing the cursor on any given point in the chart

the user is provided with all the values for a chosen cross section on the y-axis.

1http://c3js.org/examples.html2https://d3js.org/

55

Page 67: Topic Modeling and Visualisation of Diachronic Trends in

Chapter 7. Website

7.2 Website Sections

7.2.1 Observing Diachronic Trends in Topics

This section of the website provides the user with a tool to interactively observe the

average topic probabilities of a chosen topic. On the website this section, labelled as

‘Part 1: Generate diachronic topic distribution’, generates diachronic topic models.

Also in this section of the website the user is guided through a series of steps. If

the steps are followed correctly, this would result in the generation of a graph which

would show the average topic probability of documents for a chosen number of topics

and a specified time range. In ‘Step 1’ the user can choose between multiple topics,

the choice being shown in check box format. In ‘Step 2’ the user specifies what time

range should be looked at. Finally, ‘Step 3’ generates the topic distribution (see

Figure 7.1 and 7.2).

Figure 7.1: Website: Part 1 User options

Figure 7.2: Website: Part 1 Example output for topics 2,3,4,5

56

Page 68: Topic Modeling and Visualisation of Diachronic Trends in

Chapter 7. Website

7.2.2 Generate Frequency of Topic Words in the Corpus

In this section, the website provides a tool to look at the distribution of the topic

words in the corpus. On the website, this section is called ‘Part 2 : Generate topic

words distribution’. In a first step the user can choose a topic from a drop-down

menu. In a second step the user specifies what time range should be looked at. The

final step, again, generates the topic distribution. The graphs here are similar to the

ones generated in 5.4.1. The x-axis shows the year and the y-axis shows the relative

frequency of the topic words in the entire corpus that was used to create this model.

After the graph is generated it shows the relative frequencies for all ten topic words.

As this view can be a little cluttered, the user can remove some of the topic words

by clicking on the name in the legend below. The advantage of this section is that

one has the opportunity to set a time scale and toggle the number of topic words

that one wishes to visualize only with a few clicks of a button (see Figure 7.3).

Figure 7.3: Website: Part 2 Example output for topic 13 (topics shown partially)

7.2.3 Frequency of Popular Words within a Topic

In this part of the website can be used to visualize the popular words that occur in

documents that belong to a specific topic. This section is called ‘Part 3: Generate

distribution of top word(s) in a topic’ on the website. The graphs generated in this

section of the website are the ones that were used to demonstrate the popularity

of non-topic related terms (see 5.5). In this third part of the website, the user can

choose a specific topic from the drop-down menu. Then they can select the years for

which they would like to visualize the topics. As a next step, the user can chose the

range of popular words they wish to view. The range of popular words that shall

57

Page 69: Topic Modeling and Visualisation of Diachronic Trends in

Chapter 7. Website

be depicted can be set using two HTML input buttons with a step attribute1. That

means, for example, if the user sets the range of words to be shown between 2 and

4, then the generated graph will show the results for the second, third and fourth

most popular words. Finally, in a last step, the user can simply press a button to

generate the relative frequency of the words from a topic subcorpus. These are the

three key-sections of the website where the user has the opportunity to explore the

topic models with an interactive interface (see Figure 7.4).

Figure 7.4: Website: Part 3 Example output for topic 13 (top 2-5 words shown)

1https://www.w3schools.com/tags/att_input_step.asp

58

Page 70: Topic Modeling and Visualisation of Diachronic Trends in

8 Diachronic Topic Modeling Pipeline

In the methodology section, I mentioned the steps taken in order to arrive at the

final topic model which was used to answer the research questions. This section has

a particular focus on the pipeline that takes as an input PubMed XML documents

and returns topic models, as well as the CSV files that can used to create the

visualizations that I used for answering my research questions. Figure 8.1 shows the

structure of the pipeline, which is explained in the following sections.

8.1 Data Extraction

8.1.1 Extract Metadata

The first part of the pipeline consists of data extraction. In the data extraction part

a script reads the PubMed XML file and extracts any required metadata from the

file. For the purpose of this Master’s thesis only the information about the year of

publication was extracted from the file. It is however possible to extend the script

in order to allow the extraction of other metadata. The information about the year

of publication is saved to a separate file. The metadata information will be used in

the data mapping section (see 8.3).

8.1.2 Extract Text

In the text extraction part of the pipeline, the article text from the corpus undergoes

different levels of text preprocessing and filtering. These processes are explained in

detail in the following sections (see 8.1.2.1 and 8.1.2.2).

8.1.2.1 Text Pre-processing

Here the article text is extracted from the XML file. Then the article text is divided

into sentences using a sentence splitter. As a next step these sentences are tokenized

59

Page 71: Topic Modeling and Visualisation of Diachronic Trends in

Chapter 8. Diachronic Topic Modeling Pipeline

Extract Text

Token Pre-processing

Extract Metadata

Sentence Segmentation

Token Filtering

Remove short tokens (e.g. tokens with less than 3 characters)

Remove stopwords, punctuation and numbers

Words with specific POS tags (optional)

Corpus Creation

LDA Topic Modeling

Create LDA Dictionary

Create LDA Corpus

Create LDA Model

Edit LDA Dictionary (optional)

Data Mapping

User defined criteria (optional)

Token lemmatization

POS Tagging (optional)

Word Tokenization

Data Extraction

Output: Tabular Data

Figure 8.1: Diachronic topic modeling pipeline

into words. There is an optional step which one can take to filter the results based

on POS tagging. This optional step will be discussed in section. Then the tokens

in the sentences are lemmatized (see 8.1.2.1.1).

8.1.2.1.1 POS Tagging of the Corpus

This is an optional part of the text extraction section where one can do a POS

tagging of the corpus and chose to keep only the tokens that fall under a chosen set

of POS tags which is defined by the user.

8.1.2.2 Token Filtering

In this section, the corpus is reduced and the following elements are removed from it.

These elements include the stopwords, punctuation and numbers. It is also possible

to remove tokens that are below a certain length. For example, for my corpus I only

kept tokens that had at least three characters in them. The output of this section

60

Page 72: Topic Modeling and Visualisation of Diachronic Trends in

Chapter 8. Diachronic Topic Modeling Pipeline

is sent to the token filtering part of the pipeline (see 8.1.2.2).

8.1.3 Corpus Creation

In this section of the pipeline, the data from the token filtering process is converted

into a Python list. The list is then saved as a JSON file. The purpose of creating a

corpus at this stage is twofold. Firstly, the corpus will be used to create the LDA

model (see 8.2.4), and secondly, at a later state this corpus will be also used to

create word frequencies.

8.2 LDA Topic Modeling

In this section, certain aspects such as corpus creation, dictionary creation and LDA

model creation are explained. An advantage of this part of the pipeline is that the

user can always reuse the data from the previous section to generate different models.

This data recycling aspect can be useful as it saves time.

8.2.1 Dictionary Creation

The files from the previously created corpus are used to create a dictionary that

will be used by the LDA model. This dictionary is created by Gensim and contains

information about the word frequencies in the corpus. In the later stage, this dictio-

nary serves as one of the input parameters for the corpus creation that is required

by Gensim to make an LDA model.

8.2.2 Editing the Original Dictionary

As it was mentioned multiple times in the methodology section (see Chapter 4), the

dictionary plays a key role in the creation of the quality of the topic model. High

frequency words in the dictionary may lead to the creation of identical topics and

subsequently to a poor quality topic model. However, creating a new dictionary

form a corpus can be a time consuming endeavour. For this reason this pipeline

contains a dictionary reduction section where the user can specify which parts of

the vocabulary should be reduced (e.g. high frequency words, low frequency words

or words with certain features that can be user-defined). The output is a reduced

dictionary which will be used to create the corpus.

61

Page 73: Topic Modeling and Visualisation of Diachronic Trends in

Chapter 8. Diachronic Topic Modeling Pipeline

8.2.3 LDA Corpus Creation

In this section Gensim creates a corpus based on the frequencies calculated by the

aforementioned dictionary (original or reduced). The corpus itself is not human

readable but contains information about word location and frequencies of the doc-

uments.

8.2.4 LDA Model Creation

This part of the pipeline creates the LDA model. Gensim takes as an input the

dictionary and the LDA corpus and generates the model based on them. Here the

user has the option to set certain parameters of the model such as the number of

topics, the chunk size and or number of passes. Based on these information, Gensim

creates an LDA model. From the output of the LDA model one can extract the

topics. Moreover, the model has additional information such as document topic

probability for all the documents that were used to create the corpus. The topics

as well as the document topic probabilities will be used to generate the diachronic

topic information for some parts of the desired output.

8.3 Data Mapping

In order to generate output that would answer my research questions, it is necessary

to map the information that was generated in the data extraction section (see 8.1)

and the LDA topic modeling section (see 8.2). The key issue here is mapping the

documents in the corpus that was created during the data extraction process (see

8.1.3) and the document topic probabilities that were calculated during the LDA

model creation process (see 8.2.4).

8.3.1 Mapping: Creating Yearly Average Topic Probability

For calculating the average topic probability for every year, I first map the docu-

ments IDs used by Gensim to the metadata information that was extracted before

(see 8.1.1). Then I calculate the topic probabilities for all the documents in this cor-

pus. This data is saved as a CSV file for later use. The average topic probabilities

for every year is calculated using the data from this file and is saved for future use

such as visualization.

62

Page 74: Topic Modeling and Visualisation of Diachronic Trends in

Chapter 8. Diachronic Topic Modeling Pipeline

8.3.2 Mapping: Generating Relative Frequencies for Topic Words

For calculating the diachronic trends of the topic words I also map the document

IDs with the metadata. Then for each of the topic words I extract their occurrence

from the corpus files (see 8.1.3). Then I calculate the yearly relative frequency for

those topic words using the method mentioned in section 5.4.1. The output was

saved as a CSV file for visualization purposes.

8.3.3 Mapping: Generating Relative Frequencies for Popular

Words in Topic Subcorpora

Finally, for calculating the relative frequencies of words I divide my corpus into

multiple subcorpora where each subcorpus represents a topic. For the documents

that belong to a specific subcorpus I calculate all the words and their absolute

frequencies. Then I selected the top 250 words with the highest absolute frequency.

For these words, I calculate their relative frequencies using the same method as in the

previous section 8.3.2. Again, the output was saved as a CSV file for visualization

purposes.

8.4 Other Functions

This pipeline also contains other functions that can be useful to judge the quality

of a topic model. There is a function that calculates topic similarity of multiple

models based on number of similar words between them. There is also another

function which calculates the inter-topic similarity of a topic model which could be

used as a measure to judge the quality of a topic model.

8.5 Summary

The pipeline described in this section helps the user create their own topic models

and CSV-tables for data visualisation. Unlike off-the-shelf topic modeling tools,

here the user is provided with some predefined pre-processing functions. Due to

the modular structure of this pipeline, it is possible to skip certain steps. This is

true for the data extraction part where one can skip the part of speech tagging.

In the topic modeling section, the modular structure supports the user by saving

time, as the output generated in the corpus (see 8.2.3) and the dictionary creation

(see 8.2.1) sections can always be used by the LDA model (given these were created

63

Page 75: Topic Modeling and Visualisation of Diachronic Trends in

Chapter 8. Diachronic Topic Modeling Pipeline

from the same original corpus). Furthermore, the data mapping structure helps the

user to identify the original files from which the LDA model was created. This is

a clear advantage over the current version of Gensim, which does not keep track of

the original input files used to create the corpus and the dictionary.

64

Page 76: Topic Modeling and Visualisation of Diachronic Trends in

9 Conclusion

In this Master’s thesis, I implemented topic modeling on biomedical texts to detect

temporal trends within the data. I take a systematic approach towards topic mod-

eling of biomedical texts. By conducting a series of experiments, I finally arrive at

the desired topic model, which will be used to answer my research questions.

For my first research question, I was able to demonstrate temporal trends within

the topics that were generated using an LDA model. With the means of the average

topic probability, temporal fluctuations in the popularity of the topics were observed.

For my second research question, I then delved deeper into the topics themselves

in order to investigate the impact of the topic words throughout the corpus. For

this I looked at the relative frequency of the topic words and observed temporal

trends within them. Moreover, I was able to observe diachronic groupings of topic

words. These groupings showed tendencies about the quality of the topic words.

Semantically similar topic words tend to be grouped together, whereas generic words

tend to exhibit a high frequency and are grouped further away for the semantically

coherent topic words.

For the third research question, I observed temporal trends using topic modeling

within documents that belong to a specific topic. I was able to demonstrate that

the popular words within a subcorpus of documents belonging to a specific topic

tend to have different diachronic relative frequencies. Moreover, the top 250 most

popular words within the corpus tend to be different as well, based on the topic they

are representing.

As a further feature, I created a website to demonstrate the results that have been

found. The website has a general three part structure, where each part concentrates

on the results of one of the research questions.

Finally, I was also able to create a pipeline that would enable the user to create

similar results as shown in this paper. The pipeline contains a text extraction

component with the focus on text preprocessing and a topic modeling component.

65

Page 77: Topic Modeling and Visualisation of Diachronic Trends in

Chapter 9. Conclusion

The aim of the pipeline is to facilitate the topic modeling process of PubMed XML

files and map elements form the corpus document metadata and information from

the topic model together to create data tables that can be used by researchers.

9.1 Future Work

In the scope of this Master’s thesis I was only able to scratch the surface of diachronic

topic modeling using a large corpus of biomedical texts. I would like to improve

my current work especially the created website to show also the articles with the

highest topic probabilities for a specific time frame. I aim to focus on finding the

most relevant document for a specific topic, which I was not able to do for this

project, as I did not have any access to domain experts who could have evaluated

the output of my system.

In future, I plan on gathering articles from only few sources on a specific topic

and try to match the temporal trends with actual historical developments in the

field. For example, for articles belonging to the general theme of immunology, one

could try to match with historical events such as major outbreaks of diseases and

discoveries of cures. I was not able to implement such an aspect for this project, as

I first needed to test the viability of the framework .

66

Page 78: Topic Modeling and Visualisation of Diachronic Trends in

References

S. Bhatia, J. H. Lau, and T. Baldwin. Automatic Labelling of Topics with Neural

Embeddings. ArXiv e-prints, Dec. 2016.

S. Bird, E. Klein, and E. Loper. Natural Language Processing with Python.

O’Reilly Media, Inc., 1st edition, 2009. ISBN 0596516495, 9780596516499.

D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. Journal of

machine Learning research, 3(Jan):993–1022, 2003.

P. P. Bonissone. Machine Learning Applications, pages 783–821. Springer Berlin

Heidelberg, Berlin, Heidelberg, 2015. ISBN 978-3-662-43505-2. doi:

10.1007/978-3-662-43505-2 41. URL

http://dx.doi.org/10.1007/978-3-662-43505-2_41.

J. Boyd-Graber, D. Mimno, and D. Newman. Care and Feeding of Topic Models:

Problems, Diagnostics, and Improvements. CRC Handbooks of Modern

Statistical Methods. CRC Press, Boca Raton, Florida, 2014. URL

docs/2014_book_chapter_care_and_feeding.pdf.

J. Chang, S. Gerrish, C. Wang, J. L. Boyd-Graber, and D. M. Blei. Reading tea

leaves: How humans interpret topic models. In Advances in neural information

processing systems, pages 288–296, 2009.

S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman.

Indexing by latent semantic analysis. Journal of the American society for

information science, 41(6):391, 1990.

S. ElShal, M. Mathad, J. Simm, J. Davis, and Y. Moreau. Topic modeling of

biomedical text. In 2016 IEEE International Conference on Bioinformatics and

Biomedicine (BIBM), pages 712–716, Dec 2016. doi:

10.1109/BIBM.2016.7822606.

P. Ghoshal, J. Goldzycher, and S. Clematide. Bbdia: Diachronic visualisation of

semantically related n-grams using word embeddings. Conference Poster, June

2017. Poster presentation at SwissText 2017: 2nd Swiss Text Analytics

Conference.

67

Page 79: Topic Modeling and Visualisation of Diachronic Trends in

References

T. Hofmann. Probabilistic latent semantic indexing. In Proceedings of the 22nd

annual international ACM SIGIR conference on Research and development in

information retrieval, pages 50–57. ACM, 1999.

A. Holzinger, J. Schantl, M. Schroettner, C. Seifert, and K. Verspoor. Biomedical

Text Mining: State-of-the-Art, Open Problems and Future Challenges, pages

271–300. Springer Berlin Heidelberg, Berlin, Heidelberg, 2014. ISBN

978-3-662-43968-5. doi: 10.1007/978-3-662-43968-5 16. URL

http://dx.doi.org/10.1007/978-3-662-43968-5_16.

K. Hornik and B. Grun. topicmodels: An r package for fitting topic models.

Journal of Statistical Software, 40(13):1–30, 2011.

C.-C. Huang and Z. Lu. Community challenges in biomedical text mining over 10

years: success, failure and the future. Briefings in Bioinformatics, 17(1):

132–144, 2016. doi: 10.1093/bib/bbv024. URL

http://bib.oxfordjournals.org/content/17/1/132.abstract.

J. H. Lau, K. Grieser, D. Newman, and T. Baldwin. Automatic labelling of topic

models. In Proceedings of the 49th Annual Meeting of the Association for

Computational Linguistics: Human Language Technologies - Volume 1, HLT ’11,

pages 1536–1545, Stroudsburg, PA, USA, 2011. Association for Computational

Linguistics. ISBN 978-1-932432-87-9. URL

http://dl.acm.org/citation.cfm?id=2002472.2002658.

E. Leopold. Models of Semantic Spaces, pages 117–137. Springer Berlin

Heidelberg, Berlin, Heidelberg, 2007. ISBN 978-3-540-37522-7. doi:

10.1007/978-3-540-37522-7 6. URL

http://dx.doi.org/10.1007/978-3-540-37522-7_6.

A. K. McCallum. MALLET: A Machine Learning for Language Toolkit.

http://mallet.cs.umass.edu, 2002.

D. Mimno, H. M. Wallach, E. Talley, M. Leenders, and A. McCallum. Optimizing

semantic coherence in topic models. In Proceedings of the Conference on

Empirical Methods in Natural Language Processing, EMNLP ’11, pages 262–272,

Stroudsburg, PA, USA, 2011. Association for Computational Linguistics. ISBN

978-1-937284-11-4. URL

http://dl.acm.org/citation.cfm?id=2145432.2145462.

D. Ramage and E. Rosen. Stanford TMT, 2009. URL

http://nlp.stanford.edu/software/tmt/tmt-0.4/.

68

Page 80: Topic Modeling and Visualisation of Diachronic Trends in

References

R. Rehurek and P. Sojka. Software Framework for Topic Modelling with Large

Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for

NLP Frameworks, pages 45–50, Valletta, Malta, May 2010. ELRA.

http://is.muni.cz/publication/884893/en.

M. Song, S. Kim, G. Zhang, Y. Ding, and T. Chambers. Productivity and

influence in bioinformatics: A bibliometric analysis using PubMed Central.

Journal of the Association for Information Science and Technology, 65(2):

352–371, 2014. ISSN 2330-1643. doi: 10.1002/asi.22970. URL

http://dx.doi.org/10.1002/asi.22970.

I. Titov and R. T. McDonald. A Joint Model of Text and Aspect Ratings for

Sentiment Summarization. In ACL, volume 8, pages 308–316. Citeseer, 2008.

A. J. van Altena, P. D. Moerland, A. H. Zwinderman, and S. D. Olabarriaga.

Understanding big data themes from scientific biomedical literature through

topic modeling. Journal of Big Data, 3(1):23, 2016. ISSN 2196-1115. doi:

10.1186/s40537-016-0057-0. URL

http://dx.doi.org/10.1186/s40537-016-0057-0.

H. M. Wallach, I. Murray, R. Salakhutdinov, and D. Mimno. Evaluation methods

for topic models. In Proceedings of the 26th annual international conference on

machine learning, pages 1105–1112. ACM, 2009.

H. Wang, Y. Ding, J. Tang, X. Dong, B. He, J. Qiu, and D. J. Wild. Finding

complex biological relationships in recent PubMed articles using Bio-LDA.

PLOS ONE, 6(3):1–14, 03 2011. doi: 10.1371/journal.pone.0017243. URL

https://doi.org/10.1371/journal.pone.0017243.

X. Wang and A. McCallum. Topics over Time: A non-Markov Continuous-time

Model of Topical Trends. In Proceedings of the 12th ACM SIGKDD

International Conference on Knowledge Discovery and Data Mining, KDD ’06,

pages 424–433, New York, NY, USA, 2006. ACM. ISBN 1-59593-339-5. doi:

10.1145/1150402.1150450. URL

http://doi.acm.org/10.1145/1150402.1150450.

X. Wei and W. B. Croft. LDA-based document models for ad-hoc retrieval. In

Proceedings of the 29th annual international ACM SIGIR conference on

Research and development in information retrieval, pages 178–185. ACM, 2006.

Y. Yang, S. Pan, J. Lu, M. Topkara, and Y. Song. The Stability and Usability of

Statistical Topic Models. ACM Trans. Interact. Intell. Syst., 6(2):14:1–14:23,

July 2016. ISSN 2160-6455. doi: 10.1145/2954002. URL

http://doi.acm.org/10.1145/2954002.

69

Page 81: Topic Modeling and Visualisation of Diachronic Trends in

References

B. Zheng, D. C. McLean, and X. Lu. Identifying biological concepts from a

protein-related corpus with a probabilistic topic model. BMC Bioinformatics, 7

(1):58, 2006. ISSN 1471-2105. doi: 10.1186/1471-2105-7-58. URL

http://dx.doi.org/10.1186/1471-2105-7-58.

P. Zhu, J. Shen, D. Sun, and K. Xu. Mining meaningful topics from massive

biomedical literature. In 2014 IEEE International Conference on Bioinformatics

and Biomedicine (BIBM), pages 438–443, Nov 2014. doi:

10.1109/BIBM.2014.6999197.

70

Page 82: Topic Modeling and Visualisation of Diachronic Trends in

A Tables

Topic number Topic words

1 strain growth medium culture production bacteria coli yeast mutant plate2 sequence genome specie family position domain alignment amino tree database3 genotype marker locus frequency polymorphism variant allele trait variation haplotype4 mutation mutant embryo phenotype deletion stage loss mouse defect domain5 parameter correlation probability error prediction estimate feature performance equation simulation6 specie specimen margin length view dorsal head genus surface seta7 specie water soil temperature community season abundance habitat climate diversity8 frequency phase input stimulation stimulus neuron amplitude noise power electrode9 transcript file probe mirnas microarray array mrna transcription pathway mirna10 surface temperature particle energy solution layer water film property nanoparticles11 antibody vector construct sequence plasmid transfection domain buffer blot lane12 activation inhibitor phosphorylation pathway kinase apoptosis inhibition receptor death growth13 woman heart pregnancy pressure birth hypertension infant delivery mother week14 food insulin cholesterol obesity intake consumption diabetes weight alcohol mass15 bone fracture cartilage spine knee teeth joint surface root pain16 mouse antibody medium hour culture section animal week serum plate17 migration adhesion microtubule actin formation image localization junction motility filament18 brain mouse neuron cortex animal cord motor hippocampus nerve matter19 surgery injury complication pain technique nerve catheter pressure operation vein20 promoter methylation transcription histone chromatin chip element regulation modification motif21 membrane fluorescence vesicle fusion microscopy mitochondrion transport image localization surface22 infection virus vaccine antibody vaccination antigen replication titer influenza transmission23 trial month therapy week baseline event outcome intervention medication cohort24 network cluster module node database edge user tool feature file25 domain molecule chain bond ligand conformation energy crystal position atom26 child participant student school family parent people question experience behavior27 care intervention service practice program management community provider staff hospital28 lung liver platelet respiratory mortality fibrosis admission sepsis count fluid29 score disorder item scale symptom depression pain questionnaire anxiety correlation30 adult host male female larva stage mosquito fitness generation selection31 chromosome cycle replication damage repair phase focus testis irradiation follicle32 drug administration agent injection mg/kg combination toxicity dose efficacy therapy33 lesion diagnosis biopsy examination tumor resection recurrence month mass feature34 cancer tumor breast survival metastasis growth carcinoma lung tumour chemotherapy35 vessel segment artery wall plaque toxin aggregation formation flow lesion36 serum kidney plasma vitamin donor biomarkers correlation creatinine marker laboratory37 plant leaf seed root rice arabidopsis growth fruit wheat stage38 infection resistance bacteria isolates strain culture pathogen phage tuberculosis coli39 review search quality paper publication criterion strategy literature science heterogeneity40 task trial participant memory stimulus performance word session face block41 image volume measurement sensor intensity position field detection device resolution42 primer reaction sequence product amplification cycle clone polymerase extraction detection43 mouse macrophage cytokine production antibody inflammation activation receptor immune antigen44 differentiation stem culture collagen proliferation marker growth progenitor medium fibroblast45 stress metabolism acid enzyme iron production synthesis pathway oxygen reaction46 receptor channel release calcium activation neuron stimulation solution action voltage47 muscle exercise movement force motor hand training strength limb fiber48 compound solution acid reaction water mass mixture extract spectrum fraction49 country prevalence cost survey mortality status death hospital proportion household50 exposure skin smoking smoker tobacco cigarette worker hair product meta

Table A.1: All 50 topics from final model

71

Page 83: Topic Modeling and Visualisation of Diachronic Trends in

Curriculum Vitae

Personal Details

Parijat Ghoshal

Lerchenberg 31

8046 Zurich

[email protected]

Education

2015 Bachelor of Arts in English Languages and Literatures

at University of Bern

since 2015 M.A. studies in Multilingual Text Analysis at the

University of Zurich

Professional Activities

July 2016 – September 2016 Machine Learning Intern at Neue Zurcher Zeitung

since October 2016 Data Scientist at Neue Zurcher Zeitung

72

Page 84: Topic Modeling and Visualisation of Diachronic Trends in

Selbststandigkeitserklarung

73