Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
DISTANTLY SUPERVISED INFORMATION EXTRACTION USING
BOOTSTRAPPED PATTERNS
A DISSERTATION
SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE
AND THE COMMITTEE ON GRADUATE STUDIES
OF STANFORD UNIVERSITY
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY
Sonal Gupta
June 2015
http://creativecommons.org/licenses/by-nc/3.0/us/
This dissertation is online at: http://purl.stanford.edu/nt508qx3506
© 2015 by Sonal Gupta. All Rights Reserved.
Re-distributed by Stanford University under license with the author.
This work is licensed under a Creative Commons Attribution-Noncommercial 3.0 United States License.
ii
I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.
Christopher Manning, Primary Adviser
I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.
Jeffrey Heer
I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.
Percy Liang
Approved for the Stanford University Committee on Graduate Studies.
Patricia J. Gumport, Vice Provost for Graduate Education
This signature page was generated electronically upon submission of this dissertation in electronic format. An original signed hard copy of the signature page is on file inUniversity Archives.
iii
Abstract
Information extraction (IE) involves extracting information such as entities, relations, and
events from unstructured text. Although most work in IE focuses on tasks that have abun-
dant training data by exploiting supervised machine learning techniques, in practice, most
IE problems do not have any supervised training data available. Learning conditional ran-
dom fields (CRFs), a state-of-the-art supervised approach, is impractical for such real world
applications because: (1) they require large and expensive labeled corpora, and (2) it is dif-
ficult to interpret them and analyze errors, an often-ignored but important feature.
This dissertation focuses on information extraction for tasks that have no labeled data
available, apart from some seed examples. Supervision using seed examples is usually eas-
ier to obtain than fully labeled sentences. In addition, for many tasks, the seed examples can
be acquired using existing resources like Wikipedia and other human curated knowledge
bases.
I present Bootstrapped Pattern Learning (BPL), an iterative pattern and entity learning
approach, as an effective and interpretable approach to entity extraction tasks with only
seed examples as supervision. I propose two new tasks: (1) extracting key aspects from
scientific articles to study the influence of sub-communities of a research community, and
(2) extracting medical entities from online web forums. For the first task, I propose three
new categories of key aspects and a new definition of influence based on the key aspects.
This dissertation is the first work to address the second task of extracting drugs & treatments
and symptoms & conditions entities from patient-authored text. Extracting these entities
can aid in studying the efficacy and side effects of drugs and home remedies at a large
scale. I show that BPL, using either dependency patterns or lexico-syntactic surface-word
patterns, is an effective approach to solve both problems. It outperforms existing tools and
iv
CRFs.
Similar to most bootstrapped or semi-supervised systems, BPL systems developed ear-
lier either ignore the unlabeled data or make closed world assumptions about it, resulting in
less accurate classifiers. To address this problem, I propose improvements to BPL’s pattern
and entity scoring functions by evaluating the unlabeled entities using unsupervised sim-
ilarity measures, such as word embeddings and contrasting domain-specific and general
text. I improve the entity classifier of BPL by expanding the training sets using similar-
ity computed by distributed representations of entities. My systems successfully leverage
unlabeled data and significantly outperform the baselines by not making closed world as-
sumptions.
Developing any learning system usually requires a developer-in-the-loop to tune the
parameters. I utilize the interpretability of patterns to humans, a highly desirable attribute
for industrial applications, to develop a new diagnostic tool for visualization of the output
of multiple pattern-based entity learning systems. Such comparisons can help in diagnosing
errors faster, resulting in a shorter and easier development cycle. I make source code of all
tools developed in this dissertation publicly available.
v
To my wonderful parents, Arvind and Sudha Gupta, and my partner-in-crime, Apurva.
vi
Acknowledgements
I consider myself lucky to have had great advisors during my graduate life. Thank you
Chris for your insightful short and long answers, for showing me the right path whenever I
was in doubt, and for encouraging me to pursue my research whenever I felt disheartened.
You gave me all the freedom to work on the research projects I was excited about. Your
honest and constructive feedback helped me learn how to critically assess ideas and their
implementations. Very often graduate students worry about not publishing enough – I did
too. A lot. Thanks so much for the persistent advice that the number of papers do not
matter; what matters is the quality of the research I do and whether I relish the process. I
think it is one of the best advices I have ever gotten.
I am also thankful to my committee members, Jeff and Percy. Jeff, I really enjoyed our
conversations and the brainstorming sessions. I admire your clear thinking and great ideas.
The research project with you and Diana opened the door for many successful projects
subsequently. I have always appreciated your encouragement and high-spiritedness. Percy,
you are so smart and yet so grounded!! Thanks so much for all the great, thoughtful feed-
back on my research and this dissertation.
I am indebted to Ray Mooney for becoming my mentor during my masters at UT
Austin. Ray, it is fair to say that this PhD would not have been possible without you.
Even though I spent only two years with you, I learned the skills for a lifetime. Some of
my best memories are from the time we spent together sightseeing after ECML in Belgium
and NAACL in Los Angeles.
The research in this dissertation has been possible because of the incredible people
around me. First and foremost: Diana, the collaboration and friendship with you has been
one of the highlights of my time at Stanford. Jason and Sanjay, it was a lot of fun to
vii
work with both of you. Thank you Val and Angel for labeling the inter-annotator data for
studying the key aspects of scientific articles. Thank you DanJ and Chris for mapping the
ACL Anthology topics to communities. Whenever someone asked me about the validity of
the mapping, it felt great to point them to two universally-acknowledged experts in NLP!
Thanks to Eric Xing, Kriti, and Jacob for the fun and exciting collaboration during my
quarter at CMU.
The best thing about Stanford is its grad students – brilliant and yet so approachable.
It has been amazing to be a part of the NLP group at Stanford. I have thoroughly enjoyed
hanging out with the group during the almost-daily afternoon tea time (that is, procrastina-
tion time). I also have fond memories of the various group hikes and the NLP group retreat.
People in the 2A wing – you know who you are – thanks so much for being there whenever
I needed you.
Finally, I am grateful to my family and friends. Diana, Isa, Nick, Nisha, Reyes, Suyash,
Tejo: thanks for being the stress busters I sorely needed over the years. My parents, Arvind
and Sudha Gupta, are the reason I am here. Their dedication and love has always been
selfless and I do not think I can ever repay for their sacrifices. My sister Anshu and brother
Ankur have always been there for me. Thanks so much! I am very fortunate to have the
best parents-in-law – Sandhya and Prakash Samudra. Their love and encouragement has
been unconditional. Words are not enough to express my gratitude and love towards my
husband, Apurva. We lived apart for many years so that we can pursue our own dreams,
however, that distance never came between us. Apurva, you have always been my best
friend and counselor, and always will be.
viii
Contents
Abstract iv
Acknowledgements vii
1 Introduction 11.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Distant Supervision . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.2 Pattern-based learning . . . . . . . . . . . . . . . . . . . . . . . . 6
1.1.3 Bootstrapped Pattern Learning . . . . . . . . . . . . . . . . . . . . 8
1.1.4 Challenges with unlabeled data . . . . . . . . . . . . . . . . . . . 9
1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3 Dissertation Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2 Background 132.1 Entity Extraction Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1.1 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.1 Lexico-syntactic Surface word Patterns . . . . . . . . . . . . . . . 18
2.2.2 Dependency Patterns . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3 Bootstrapped Pattern Learning . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4 Classifiers and Entity Features . . . . . . . . . . . . . . . . . . . . . . . . 26
2.5 Dataset: MedHelp Patient Authored Text . . . . . . . . . . . . . . . . . . . 27
ix
3 Related Work 293.1 Pattern-based systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.1.1 Fully supervised . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.1.2 Distantly supervised . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2 Distantly supervised Non-pattern-based systems . . . . . . . . . . . . . . . 34
3.3 Distantly supervised Hybrid systems . . . . . . . . . . . . . . . . . . . . . 34
3.3.1 Open IE systems . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4 Studying Scientific Articles and Communities 374.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2 Related Work: Scientific Study . . . . . . . . . . . . . . . . . . . . . . . . 40
4.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.3.1 Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.3.2 Communities and their Influence . . . . . . . . . . . . . . . . . . . 43
4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.6 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5 Information Extraction on Medical Forums 585.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.2 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.3 Related Work: Medical IE . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.4 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.4.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.5 Inducing Lexico-Syntactic Patterns . . . . . . . . . . . . . . . . . . . . . . 64
5.5.1 Creating Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.5.2 Learning Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.5.3 Learning Phrases . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.6 Evaluation Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.6.1 Test Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.6.2 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
x
5.6.3 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.7 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.7.1 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.7.2 Case study: Anecdotal Efficacy . . . . . . . . . . . . . . . . . . . 83
5.8 Additional Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.9 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.10 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6 Leveraging Unlabeled Data 936.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.3.1 Creating Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.3.2 Scoring Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.3.3 Learning Entities . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.4.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.4.2 Labeling Guidelines . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.4.3 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
6.4.4 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 104
6.4.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
6.5 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 110
7 Word Embeddings Improve Entity Classifiers 1127.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
7.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
7.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
7.4 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
7.5 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
7.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
xi
8 Visualizing and Diagnosing BPL 1228.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
8.2 Learning Patterns and Entities . . . . . . . . . . . . . . . . . . . . . . . . 124
8.3 Design Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
8.4 Visualizing Diagnostic Information . . . . . . . . . . . . . . . . . . . . . . 125
8.5 System Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
8.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
8.7 Future Work and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . 131
9 Conclusions 132
A Stop Words List 136
xii
List of Tables
2.1 Examples of patterns and how they match to sentences. . . . . . . . . . . . 19
2.2 A few examples of sentences from the MedHelp forum. . . . . . . . . . . . 28
3.1 A few examples to give an idea about the types of IE systems developed
based on the amount of supervision and the type of models. . . . . . . . . . 29
4.1 Some examples of dependency patterns that extract information from de-
pendency trees of sentences. . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2 Extracted phrases for some papers. . . . . . . . . . . . . . . . . . . . . . . 45
4.3 Examples of patterns learned using the iterative extraction algorithm. . . . . 47
4.4 The precision, recall and F1 scores of each category for the different ap-
proaches. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.5 The top 5 influential communities with their most influential phrases. . . . . 50
4.6 The next 5 influential communities with their most influential phrases. . . . 51
4.7 The community in the first column has been influenced the most by the
communities in the second column. . . . . . . . . . . . . . . . . . . . . . . 52
4.8 Comparison of our BPL-based approach and supervised CRF for the task. . 56
5.1 F1 scores for labeling with Dictionaries using different types of labeling
schemes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.2 Token-level Precision, Recall, and F1 scores of our system and the baselines
on the Asthma forum for the label DT. . . . . . . . . . . . . . . . . . . . . 72
5.3 Token-level Precision, Recall, and F1 scores of our system and the baselines
on the Asthma forum for the label SC. . . . . . . . . . . . . . . . . . . . . 72
xiii
5.4 Token-level Precision, Recall, and F1 scores of our system and the baselines
on the ENT forum for the label DT. . . . . . . . . . . . . . . . . . . . . . 73
5.5 Token-level Precision, Recall, and F1 scores of our system and the baselines
on the ENT forum for the label SC. . . . . . . . . . . . . . . . . . . . . . 73
5.6 Entity-level Precision, Recall, and F1 scores of our system and the baselines
on the Asthma forum for the label DT. . . . . . . . . . . . . . . . . . . . . 74
5.7 Entity-level Precision, Recall, and F1 scores of our system and the baselines
on the Asthma forum for the label SC. . . . . . . . . . . . . . . . . . . . . 74
5.8 Entity-level Precision, Recall, and F1 scores of our system and the baselines
on the ENT forum for the label DT. . . . . . . . . . . . . . . . . . . . . . 75
5.9 Entity-level Precision, Recall, and F1 scores of our system and the baselines
on the ENT forum for the label SC. . . . . . . . . . . . . . . . . . . . . . 75
5.10 Top 10 patterns learned for the label DT on the Asthma forum. . . . . . . . 76
5.11 Top 10 patterns learned for the label SC on the Asthma forum. . . . . . . . 76
5.12 Top 10 patterns learned for the label DT on the ENT forum. . . . . . . . . 77
5.13 Top 10 patterns learned for the label SC on the ENT forum. . . . . . . . . . 77
5.14 Precision, Recall, and F1 scores of systems that use pattern matching when
labeling data and our system on the Asthma forum. . . . . . . . . . . . . . 86
5.15 Precision, Recall, and F1 scores of systems that use pattern matching when
labeling data and our system on the ENT forum. . . . . . . . . . . . . . . . 87
5.16 Effects of use of GoogleCommonList in OBA on the Asthma forum. . . . . 87
5.17 Effects of use of GoogleCommonList in OBA on the ENT forum. . . . . . 88
5.18 Effects of use of GoogleCommonList in MetaMap on the Asthma forum. . 88
5.19 Effects of use of GoogleCommonList in MetaMap on the ENT forum. . . . 88
5.20 Scores when our system is run with different phrase threshold values. . . . 89
5.21 Scores when our system is run with different pattern threshold values. . . . 89
5.22 Scores when our system is run with different values of N and T . . . . . . . 90
5.23 Scores when our system is run with different values of K. . . . . . . . . . . 90
6.1 Area under Precision-Recall curves of the systems. . . . . . . . . . . . . . 105
xiv
6.2 Individual feature effectiveness: Area under Precision-Recall curves when
our system uses individual features during pattern scoring. . . . . . . . . . 107
6.3 Feature ablation study: Area under Precision-Recall curves when individ-
ual features are removed from our system during pattern scoring. . . . . . . 108
6.4 Example patterns and the entities extracted by them, along with the rank at
which the pattern was added to the list of learned patterns. . . . . . . . . . 109
6.5 Top 10 (simplified) patterns learned by our system and RlogF-PUN from
the ENT forum. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
7.1 Area under Precision-Recall curve for all the systems. . . . . . . . . . . . . 117
7.2 Examples of unlabeled entities that were expanded into the training sets. . . 121
xv
List of Figures
1.1 Seed examples can often be automatically curated using existing resources. 5
1.2 Percentage of commercial entity extraction systems that use rule-based,
machine learning-based, or hybrid systems. . . . . . . . . . . . . . . . . . 7
2.1 The dependency tree for ‘We work on extracting information using depen-
dency graphs’. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2 A flowchart of various steps in a bootstrapped pattern-based entity learning
system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3 An example pattern learning system for the class ‘animals’ from the text. . . 24
4.1 The F1 scores for TECHNIQUE and DOMAIN categories after every five it-
erations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.2 The influence scores of communities in each year. . . . . . . . . . . . . . . 52
4.3 The popularity of communitites in each year. . . . . . . . . . . . . . . . . 53
4.4 The influence scores of machine translation related communities. . . . . . . 54
4.5 Popularity of machine translation communities in each year. . . . . . . . . 55
5.1 Top 15 phrases extracted for the Asthma and the ENT forums. . . . . . . . 78
5.2 Top 50 DT phrases extracted by our system for three different forums. . . . 80
5.3 Top 50 SC phrases extracted by our system for three different forums. . . . 81
5.4 Top DT and SC phrases extracted by our system, MetaMap, and MetaMap-
C for the Diabetes forum. . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.5 Study of efficacy of ‘Cinnamon’ and ‘Vinegar’, two DTs extracted by our
system, for treating Type II Diabetes. . . . . . . . . . . . . . . . . . . . . . 84
xvi
6.1 An example pattern learning system for the class ‘animals’ from the text
starting with the seed entity ‘dog’. . . . . . . . . . . . . . . . . . . . . . . 95
6.2 Precision vs. Recall curves of our system and the baselines for the Asthma
forum. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6.3 Precision vs. Recall curves of our system and the baselines for the ENT
forum. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
6.4 Precision vs. Recall curves of our system and the baselines for the Acne
forum. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
6.5 Precision vs. Recall curves of our system and the baselines for the Diabetes
forum. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
7.1 An example of expanding a bootstrapped entity classifier’s training set us-
ing word vector similarity. . . . . . . . . . . . . . . . . . . . . . . . . . . 113
7.2 Precision vs. Recall curves of our system and the baselines for the Asthma
forum. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
7.3 Precision vs. Recall curves of our system and the baselines for the Acne
forum. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
7.4 Precision vs. Recall curves of our system and the baselines for the Diabetes
forum. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
7.5 Precision vs. Recall curves of our system and the baselines for the ENT
forum. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
8.1 Entity centric view of SPIED-Viz. . . . . . . . . . . . . . . . . . . . . . . 128
8.2 Pattern centric view of SPIED-Viz. . . . . . . . . . . . . . . . . . . . . . . 129
8.3 When the user clicks on the compare icon for an entity, the explanations of
the entity extraction for both systems (if available) are displayed. . . . . . . 130
xvii
Chapter 1
Introduction
1.1 Overview
Information extraction involves extracting information, such as entities, relations between
entities, and events from unstructured text. The most common entity types are names of
people, places, organizations, and locations. Relation extraction systems predict relations
between entities in text, for example, city of birth, spouse of, and employee of. Earlier
systems built extraction modules using hand-written regular expressions and rules. They
have been largely replaced by machine learning-based models, especially in the research
community. The last decade of the machine learning-based information extraction research
has focused on models that require large amounts of labeled data. A state-of-the-art sys-
tem based on conditional random field (Lafferty et al., 2001) has been very successful at
entity extraction tasks (Ratinov and Roth, 2009), but they only work well when given large
amount of labeled data. However, most real-world information extraction tasks do not have
any fully labeled data. Labeling new data to train a reasonably accurate sequence model is
not only expensive, it also requires labeling data for each new domain.
Information extraction tasks that have no readily available labeled data is the norm in
practical uses, not the exception. Consider a substance-abuse specialist who wants to study
substances-of-choice (SoC) of people from online health discussion forums. A post from
one such forum on MedHelp.org1
1MedHelp is an online health forum. Example altered to preserve privacy.
1
CHAPTER 1. INTRODUCTION 2
Brother was on huge amounts of opiates...80 vics a day plus oxys and mor-phine...then valium at night to sleep. Then he was on an alcohol bender for 3
weeks... Doc has put him on suboxone.
Two common ways of approaching the above problem are: machine learning-based
sequence models like CRFs and hand-written lexicons.
CRFs and other supervised sequence models tend to work well when trained on reason-
ably sized fully labeled data. However, there is currently no publicly available data labeled
with SoCs. Labeling such data would be time-consuming and difficult for someone who is
not a trained data annotator with access to good annotation tools.
The second way of manually constructing a lexicon is expensive, time consuming, and
can lead to poor recall. Most commonly used medical entity extractors, e.g. MetaMap
(Aronson, 2001) and Open Biomedical Annotator or OBA (Jonquet et al., 2009), are based
on keyword matching with manually written ontologies. They have two problems: 1. they
have poor recall on patient-authored text (Smith and Wicks, 2008) since most ontologies
do not contain colloquial or slang phrases, and 2. dictionary-lookup based annotators do
not model context.
OBA references various manually curated ontologies and currently has no SoC cate-
gory. When applied using the generic pharmaceutical drugs categories to the above ex-
ample, it extracts ‘opiates’, ‘morphine’, ‘suboxone’, and ‘valium’. There are two kinds of
extraction errors: 1. not extracting ‘vics’, ‘oxys’, and ‘alcohol bender’, and 2. extracting
‘suboxone’ as a SoC. Suboxone can be a SOC but is also prescribed as a treatment for ad-
diction of other stronger opiates, such as in the above example. The example underscores
the need for domain-specific context modeling.
Consider another task, which I focus on in this thesis, of extracting drugs & treatments
from patient-authored text (PAT). Treatments include anything consumed or applied to im-
prove a symptom or a condition. Following is an example from Medhelp.org
I plan to start cinnamon and holy basil known to help diabetes in many peo-
ple.
PAT, such as discussion posts on online forums, is rife with many home and alternate
CHAPTER 1. INTRODUCTION 3
remedies, and morphological variations, spelling mistakes, and abbreviations of pharma-
ceutical drugs. It is nearly impossible to manually list all treatments mentioned in such
text. No public dataset exists in this domain for these types of entities to train a machine
learning classifier.
Such information extraction needs are more widespread. Consider another task of ex-
tracting dish names from Yelp.com reviews. Following is an example:
We ordered the Empanadas for an appetizer, it comes with three different
sauces, ... My wife ordered the arroz con pollo and I had the lomo saltadoand we split an order of mac and cheese. The presentation was awesome ..
One is likely to find ‘mac and cheese’ in existing dish lexicons but less likely to find
‘lomo saltado’. It is not practically viable to list all dish names of all the cuisines by hand.
Additionally, restaurants add new dishes to their menus frequently, making an existing
lexicon outdated very soon.
These real world information extraction scenarios are common: people need very spe-
cific information extracted from text and they do not have any fully labeled data. Manually
listing a few examples or using existing knowledge bases as examples for given entity
types, however, is very easy. These examples can be used as seed sets, also known as dic-
tionaries or gazettes, for learning more examples for the entity types. To extract food items
from text, for example, it is much easier to find a list of dish names on the Internet than to
manually label sentences. The list of dish names can be used as a seed set to train a dish
name extractor from unlabeled reviews. Similarly, for drugs & treatments, it is easier to
use the existing ontologies as a seed set.
In this dissertation, I show that these seed sets can be effectively used to extract entities
from unlabeled text, even though, in contrast to fully labeled data, the supervision provided
by them is weak. I explore machine learned patterns to extract information from text.
Patterns can be thought of as instantiations of feature templates, or in their simplest form,
regular expressions. I use bootstrapped pattern learning to learn patterns and entities from
text automatically, starting with only a seed set of a few examples of each entity type. I
propose two new tasks and show that BPL is effective for both of them. I compare our
pattern-based approach with both lexicon matching and CRFs and show that our system
CHAPTER 1. INTRODUCTION 4
performs significantly better. I also propose improvements to BPL systems.
Earlier pattern or rule-based systems were built with hand-written rules (Hobbs et al.,
1993; Riloff, 1993). An important distinction between these systems and the systems I
work on is that the patterns in my systems are learned. I use various statistical approaches
to automatically score computer generated patterns. Many other pattern-based systems
also learn patterns, often using BPL. BPL was a popular research topic in the mid-1990s
till early-2000s (Hearst, 1992; Riloff, 1996; Collins and Singer, 1999), however, academic
research on entity extraction has recently focused more on feature-based sequence clas-
sifiers. In my opinion, the shift happened mostly due to the trends in the wide world;
feature-based classifiers (e.g., naive Bayes and logistic regression) and their structured ex-
tensions (e.g., hidden Markov model and conditional random field model) became popular
beginning late-1990s. The popularity of these classifiers is justified – they tend to work
very well when trained on fully labeled data. However, it is not clear if they work better
than pattern-based learning approaches, both for fully supervised and distantly supervised
settings.
Below I discuss various aspects of my systems.
1.1.1 Distant Supervision
When supervision is in the form of examples, the usual first step is to label the data using
the examples. The simplest way is to label all occurrences of the examples in text with the
corresponding label. This simple matching of seed sets to text does not take word ambiguity
into account. There exists more sophisticated use of distant supervision (Surdeanu et al.,
2012) – in many scenarios, not all token instances in text that match the seed lexicon need
be an instance of that particular class. In my systems, I assume that the seed set entries are
unambiguous in text, which is a reasonable assumption for specialized domains. Note that
only a few tokens, those corresponding to the seed examples, in sentences get labeled; other
tokens are unlabeled. In contrast, a more common way of acquiring supervision in semi-
supervised settings is to get a few sentences fully labeled. One obvious question is, why use
supervision in the form of examples instead of fully labeled sentences? There are two main
reasons: First, it is easier for someone to give examples of a particular information need.
CHAPTER 1. INTRODUCTION 5
(a) Wikipedia Infobox for Barack Obama (b) List of Schedule I drugs in the US fromWikipedia.
Figure 1.1: Seed examples can often be automatically curated using existing resources.
Labeling full sentences is a more cumbersome task. Druck et al. (2008) and Mann and
McCallum (2008) showed that the more effective use of a human annotator with limited
time is to label features rather than labeling instances. They define labeled features as
words that are more likely to occur in one label as compared to others (e.g. puck is a
stronger indicator of label hockey as compared to the label baseball). Acquiring seed sets
is similar to labeling features – it is easier and more efficient for human annotators to
list a few good candidates for each information need. Second, existing lexicons and web
resources can be frequently used as seed sets. For example, the TAC-KBP slot filling task
uses Wikipedia infoboxes as seed examples. Figure 1.1 show examples of already curated
information about people and entities. The figure on the left is the Wikipedia Infobox
for Barack Obama, which can be used for learning relation extractors like ‘born in’ and
‘spouse’. The figure on the right lists Schedule I drugs2 in the US, which can be used to
bootstrap learning SoCs.
1http://en.wikipedia.org/wiki/Barack_Obama. Accessed April 2015.2http://en.wikipedia.org/wiki/List_of_Schedule_I_drugs_(US). Accessed April
2015.
CHAPTER 1. INTRODUCTION 6
1.1.2 Pattern-based learning
Patterns are typically created using contexts around the known entities in a text corpus. The
two types of patterns I explore are lexico-syntactic surface word patterns (Hearst, 1992)
and dependency tree patterns (Yangarber et al., 2000). They have been shown to perform
better than state-of-the-art feature-based machine learning methods on some specialized
domains, such as in Chapters 4 and 5, and by Nallapati and Manning (2008). Addition-
ally, pattern-based3 systems dominate in commercial use (Chiticariu et al., 2013), mainly
because patterns are effective, interpretable, and are easy to customize by non-experts to
cope with errors. Figure 1.2 shows the distribution of pattern or rule-based vs. machine
learning-based entity extraction systems in commercial use, in a study by Chiticariu et al.
(2013).4
Comparison with sequence classifiers
One main difference between pattern-based learning systems and sequence classifiers like
CRFs is the representation – whether a system is represented using patterns or features.
Sequence classifiers learn weights on a large number of features. The features commonly
include token-level properties, neighboring tokens and their tags, and distributional simi-
larity word classes. Note that there is a continuum between patterns and features. That is,
one can think of a pattern as a big conjunction of features and use it in a sequence classi-
fier like a CRF. Conversely, many feature-based systems use feature conjunctions. Some
feature-based systems use some quite specific hand-engineered conjunctions. For example,
the Stanford part-of-speech tagger (Toutanova and Manning, 2003) models the unknown
words with a conjunction feature of words that are capitalized and have a digit and a dash
in them.
However, in practice, the distinction between patterns and features is clearer: The
feature-based systems typically have a very large number of features, starting with single
element features (such as, word to left is ‘foo’) and then considering simple conjunctions.
Generally, all instantiations of the features and the conjunctions are generated, resulting in
3I use the terms patterns and rules interchangeably in this dissertation.4It is not clear from their paper in which category BPL, and thus my systems, fall under. My systems are
pattern-based and are machine learned.
CHAPTER 1. INTRODUCTION 7
Figure 1.2: Percentage of commercial entity extraction systems that use rule-based, ma-chine learning-based, or hybrid systems. The study was conducted by Chiticariu et al.(2013). Pattern or rule-based systems dominate the commercial market, especially amonglarge vendors.
a large numbers of features, most of which are not individually very useful. The emphasis
is on coverage and recall of features. The advantage is that the features can share statis-
tics better. There has been some work on learning useful feature templates (Martins et al.,
2011). However, the focus is on whether to include entire feature templates, which are
usually much more general than patterns. The pattern-based systems are normally built on
orders of magnitude smaller numbers of patterns. Each pattern is normally a quite specific
conjunction of several things (such as, tokens and their generalized versions, dependency
paths, and wildcards) careful targeted to extract an information need. The emphasis is more
on the precision of the patterns.
The difference is not only in representation; the systems also differ in the typical learn-
ing methods used. Feature-based system normally optimize weights on every feature
in a classifier using optimization methods like stochastic gradient descent and Newton’s
method. Pattern-based learning also has a loss function, but the optimization methods are
rarely used. The focus is on choosing whether to include or exclude patterns (or to weight
them more or less) based on the change in the loss function value.
CHAPTER 1. INTRODUCTION 8
Interpretability
Even though feature-based machine learning is very popular in the academic world, indus-
try is slow and reluctant to adopt it, as seen in the earlier figure. One of the reasons is that
developers, who are often not machine learning experts, do not trust black boxes. Patterns
solve this problem because: 1. patterns are understandable to humans, 2. it is easy to find
errors and fix them in a pattern-based system, and 3. patterns are generally high precision.
However, most industrial pattern-based systems are developed using manually defined pat-
terns, which requires significant human effort and expertise in the pattern language. I work
on making the task automated using machine learning to learn good patterns; the system
preserves the interpretability of patterns but does not require much manual effort. Note that
some other forms of machine learning also have much better interpretability than methods
such as feature-based classifiers or neural networks; traditionally, decision tree and decision
list classifiers have been the prototypical examples of more interpretable machine learning
classifiers (Letham et al., 2013). In a decision tree, the conjunction of features from its root
down a path is often not so different in nature from a decision list or a pattern.
1.1.3 Bootstrapped Pattern Learning
In a bootstrapped pattern-based entity learning system, seed dictionaries and/or patterns
provide distant supervision to label data. The BPL system iteratively learns new patterns
and entities belonging to a specific class from unlabeled text (Riloff, 1996; Collins and
Singer, 1999). I discuss individual components of BPL in detail in Chapter 2. A high level
overview is: BPL is an iterative algorithm, which learns a few good patterns and entities for
each entity type in each iteration. First, patterns are created around known entities. They are
scored and ranked by their ability to extract more positive entities and less negative entities.
Top ranked patterns are then used to extract candidate entities from text. An entity scorer
is trained to score candidate entities based on the entity features and the scores of patterns
that extracted them. High scoring candidate entities are added to the dictionaries and are
used to generate more candidate patterns around them. The power of BPL comes from two
properties. First, it identifies only a few good patterns and entities in each iteration; it takes
a cautious approach to learning new information. Cautious approaches have been shown to
CHAPTER 1. INTRODUCTION 9
be more accurate in semi-supervised settings (Abney, 2004). Second, the patterns get used
in two ways: 1. they act as a filter to suggest good candidate entities to be scored by an
entity scorer, and 2. they act as good features in the entity scorer; a highly scored pattern is
more likely to extract good entities.
1.1.4 Challenges with unlabeled data
The attractive part about distantly-supervised IE – the limited supervision – is also its main
challenge. It is very hard to learn effective classifiers with little labeled data. Starting
with unlabeled data and a small seed set, only a few tokens or examples get labeled using
the seed and learned sets of entities. Existing systems either assume unlabeled data to be
negative or just ignore them. Very often unlabeled data is subsampled to generate negative
training set for an entity or relation classifier (Angeli et al., 2014). Assuming unlabeled data
to be negative can be counterproductive – since many examples subsampled as negative can
actually be positive. On the other hand, by ignoring the unlabeled data, a system does not
use the data to its full extent possible. In this thesis, I propose two improvements to BPL
that exploits the unlabeled data to make the pattern and entity scoring more accurate.
1.2 Contributions
I make the following contributions in this dissertation:
• I focus on low-resource information extraction problems and show that patterns, both
lexico-syntactic surface word patterns and dependency patterns, learned using BPL
are effective for distantly supervised IE. I bring the academic research in IE closer to
the problems in industry.
• I propose two new tasks and show that BPL is an effective approach for both of them.
1. Studying influence of sub-communities of a scientific community: I propose
new types of key aspects of research papers – focus or main contribution, tech-
niques used, and domain or problem. I also propose a new way of quantifying
influence of one research article on another. There has since been a surge in
CHAPTER 1. INTRODUCTION 10
interest in the study of academic dynamics, with IARPA funding a research
program called FUSE.
2. Extracting medical entities from patient-authored text: This dissertation is the
first work to extract drugs & treatments and symptoms & conditions from patient-
authored text. Such extractions can be used to study side-effects and the efficacy
of treatments and home remedies at a large scale. My systems outperformed
commonly used medical entity extractors and other machine learning-based
baselines.
• I leverage the unlabeled data to improve bootstrapped pattern learning in two ways.
My systems significantly outperform the existing pattern and entity scoring mea-
sures.
1. Improved pattern scoring: I propose predicting labels of unlabeled entities us-
ing unsupervised measures to improve pattern scoring in BPL. I present a new
pattern scoring method that uses the predicted labels of unlabeled entities. I
predict the labels using five unsupervised measures, such as distributional sim-
ilarity between labeled and unlabeled entities, and edit distances of unlabeled
entities from the labeled entities.
2. Improved entity scoring: I present an improved entity classifier by creating its
training set in a better way. I expand the positive and negative training ex-
amples by adding most similar unlabeled entities, computed using distributed
representations of words, to the training sets.
• I make the source code and some datasets publicly available. In addition, I also re-
lease a visualization and diagnostics tool to compare pattern-based learning systems
to make developing pattern-based systems more effective and efficient.
.
CHAPTER 1. INTRODUCTION 11
1.3 Dissertation Structure
Chapter 2 This chapter has details about the entity extraction tasks, and necessary
background information about patterns and bootstrapped pattern learning. I give a detailed
overview of individual components of BPL in this chapter. I discuss contributions made to
the system and its components by other researchers in the next chapter.
Chapter 3 I discuss work related to semi-supervised and distantly supervised IE, and
pattern learning in this chapter. I discuss related work specific to the different tasks in each
corresponding chapter.
Chapter 4 I present a new way of studying influence between sub-communities of
a research community in this chapter. I define three new key aspects to extract from a
research article: focus or main contribution, techniques used, and domains applied to. I
then describe how to use topic models to define sub-communities. I combine article-to-
community scores and key aspects of each article to compute influence of sub-communities
on each other. I present a case-study of influence of sub-communities in the computational
linguistics community, such as Speech Recognition and Machine Learning, on each other.
The content of this chapter is drawn from Gupta and Manning (2011).
Chapter 5 This chapter describes the work published in Gupta et al. (2014b). I
describe a new task of extracting drugs & treatments, and symptoms & conditions from
patient-authored text. I show that BPL using lexico-syntactic surface-word patterns is an
effective technique for extracting the information. It performs significantly better than
other approaches, including existing medical entity extraction tools like MetaMap and
Open Biomedical Annotator, on extracting the entities from posts on four forums on Med-
Help.org.
Chapter 6 I present an improved measure to compute pattern scores in BPL by lever-
age unlabeled data. I propose predicting labels of unlabeled entities extracted by patterns
using unsupervised measures and using the predicted labels in the pattern scoring function.
I describe the five unsupervised measures I use to predict the labels of unlabeled entities
and present experimental results on four forums from MedHelp.org. This work has been
published in Gupta and Manning (2014a).
CHAPTER 1. INTRODUCTION 12
Chapter 7 I present an improved entity classifier for BPL by using distributed repre-
sentation of words. I propose expanding training sets for BPL’s entity classifier, modeled
by a logistic regression, using similarity of entities computed by cosine distance between
word vectors. I present experimental results and show that expanded training sets improve
the performance significantly. This work has been published in Gupta and Manning (2015).
Chapter 8 I present and publicly release a visualization and diagnostics tools to com-
pare pattern-based learning systems. This work has been published in Gupta and Manning
(2014b).
Chapter 9 I conclude this dissertation and discuss avenues for future work.
I release the code for the systems described in this dissertation at http://nlp.
stanford.edu/software/patternslearning.shtml. I also release a visual-
ization tool, described in Chapter 8, that can be downloaded at http://nlp.stanford.
edu/software/patternviz.shtml.
Chapter 2
Background
This chapter contains the background information necessary for understanding the rest of
the dissertation. First, I describe the entity extraction task and the evaluation measures.
I then explain the patterns I use in this dissertation – dependency patterns and lexico-
syntactic surface-word patterns. I also describe components of a Bootstrapped Pattern
Learning (BPL) system and a few aspects of the classifier I use.
Information extraction encompasses many subtasks, such as entity extraction, relation
and event extraction, and entity linking and canonicalization. Entity extraction is the first
step for other extraction systems; for example, predicting relations between entities is per-
formed jointly with or after entity extraction. In this dissertation, I focus on the entity
extraction task and discuss it below.
2.1 Entity Extraction Task
Entity extraction involves labeling contiguous tokens that form an entity of the desired
type in a sentence. The most common task has been to extract named entities, that is, to
identify the sequence of words that are the names of things, such as PERSON NAME, PLACE,
LOCATION, and ORGANIZATION from text. Below is an example of a sequence of tokens
labeled with named entity tags.
PresidentTITLE BarackNAME ObamaNAME lives in D.PLACE C.PLACE
13
CHAPTER 2. BACKGROUND 14
The labeled datasets for the common named entity recognition tasks are publicly avail-
able and widely used. The three common corpora that are used to train such extractors
are CoNLL-03, which is from the shared task of the Conference on Computational Natural
Language Learning in 2003, the MUC-7 dataset (Chinchor, 1998), and the OntoNotes cor-
pus (Hovy et al., 2006), which contains around 10 named entity types and 7 miscellaneous
entity types like TIME and DATE. Generally, the tasks are modeled using BIO enhanced
labels – separate labels for Beginning of an entity, Inside an entity, and Outside token of an
entity. For example, the phrase ‘the capital Washington D. C.’ would be labeled as ‘the/O
capital/O Washington/B-LOCATION D./I-LOCATION C./I-LOCATION’. Any contiguous se-
quence of the same entity class is labeled as an entity for that class. Learning using BIO
enhanced labels increases the number of classes, since each entity type has 2 classes asso-
ciated with it. Because the tasks I work on have limited amount of supervision provided, I
do not use the BIO or other more expressive notations.
Most work in entity extraction has focused on supervised training of the classifiers.
Accuracy for the common named entity tasks using fully supervised data has reached in
the 90s – Ratinov and Roth (2009) reported 90.8 F1 score on the CoNLL-03 dataset and
86.15 F1 score on the MUC-7 dataset. The most informative features when classifying
these entities come from the word itself and other word-level features like capitalization,
prefix characters, suffix characters, and part-of-speech tags. There is ambiguity, such as
whether ‘Washington’ is a NAME or a PLACE, and thus sequence models like CRFs are
very effective for such tasks. However, word-level features are still very predictive of the
labels for most entities. I do not work on the CoNLL-03 or MUC-07-like entity extraction
tasks because they have been well-studied and have large fully labeled corpora available to
train supervised classifiers. I, however, use many of the features used in these systems for
other entity extraction tasks.
Other entity extraction tasks include Biomedical named entity recognition (BioNER),
such as recognizing protein, DNA, drugs and genes in text. Using statistical classifiers for
BioNER has been moderately successful – Finkel (2010) reported around 70 F1 score on
the GENIA corpus (Kim et al., 2003) that has Medline abstracts labeled with entities of
types such as PROTEIN and DNA. Generally word features of these entity types are not very
informative. There is more ambiguity in word level features, and thus context features are
CHAPTER 2. BACKGROUND 15
more important. See Gu (2002) for a longer discussion.
In this dissertation, I focus on training entity extractors from unlabeled text and seed
sets of entities. In the majority of the work, I focus on a task that is similar to the BioNER
tasks. The task is to extract symptoms & diseases and drugs & treatments from patient-
authored text (PAT). PAT is more challenging than well written abstracts and news articles
because of the variation in entity naming and the long descriptions of entities. Similar
to the BioNER tasks, word features are not very powerful for this dataset because of the
morphological variations (like ‘vics’ for ‘Vicodin’) and spelling mistakes. I discuss the
dataset and its challenges in Section 2.5.
2.1.1 Evaluation
Entity extraction is usually evaluated by precision, recall, and F1 scores. For each entity
type l, the following values are computed: true positivel (number of entities correctly
extracted as l by the model), false positivel (number of entities extracted for label l by
the model but they belong to other labels), true negativel (number of entities that are not
of type l and are not extracted by the model), and false negativel (number of entities of
type l but are not extracted by the model).
Precision
Also known as ‘positive predictive value’, precision is the percentage of correct entities
extracted among all the extracted entities.
Precisionl =true positivel
true positivel + false positivel(2.1)
Recall
Recall, also known as sensitivity, is the percentage of correct entities extracted among all
correct entities in the text.
Recalll =true positivel
true positivel + false negativel(2.2)
CHAPTER 2. BACKGROUND 16
F1 Score
Usually F1 score is used as an evaluation measure to compare various systems since it
provides a single score by taking the harmonic mean of precision and recall.
F l1 =
2× Precisionl ×Recalll
(Precisionl +Recalll)(2.3)
There are two ways of combining scores of different entity types – macro-averaged
and micro-averaged. To obtain the macro-averaged scores, the precision, recall, and F1
scores are calculated for each entity type (except the background label) and are averaged to
compute the final score of a system. This averaging is commonly reported for text catego-
rization in the information retrieval papers. Micro-averaging, which is used for CoNLL-03
shared task evaluation, computes the precision, recall, and F1 scores for all the entities
together.
Entity-level vs. Token-level Evaluation
The correctness of an extracted entity can be judged at the entity-level or the token-level.
In an entity-level evaluation, a multi-word entity is considered as a single unit. That is,
an entity is considered correctly extracted for l if and only if the all the contiguous tokens
labeled as l form an entity. I refer to the entity-level evaluation measures that do not give
partial credits as hard entity-level evaluation. Although the hard entity-level evaluation is
common for NER tasks, when data has inconsistencies or when extracting partial entities
is also important, token level scores are used, such as in Ratinov and Roth (2009). In
a token-level evaluation, each token’s correctness is judged independently of the label of
other tokens. For example, if a system extracts a five words long entity correctly, the hard
entity-level evaluation gives it a score of 1 and the token-level evaluation gives it a score of
5.
The hard entity-level evaluation penalizes the system twice for extracting a partial en-
tity. For example, if an entity is “salbutamol inhaler” and a system labels only “inhaler”
as a DRUG, then for the label DRUG, the entity-level number of true positives is 0, false
negatives is 1, and false positives is 1. On the other hand, the token-level number of true
positives is 1, false negative is 1, and false positive is 0. Entity-level evaluation is preferred
CHAPTER 2. BACKGROUND 17
over token-level evaluation when extracting all words of an entity is more important than
extracting parts of entity phrases. Entity-level evaluation is commonly used for recogniz-
ing named entities, where, for example, it is not clear if partially extracting ‘Mining Corp.’,
instead of fully extracting ‘Westport Mining Corp.’, should be considered correct.
Some evaluation measures consider a multi-word entity as a single unit, however, give
partial credit for partially extracting an entity. For example, the MUC-7 evaluation adds
scores for partially matching an entity, downweighting the partial match scores by 50% to
prefer systems that extract more fully matched entities.
True Recall vs. Pooled Recall
True Recall is calculated when there is fully labeled test data available, that is every token
of test sentences is labeled. However, in many situations, hand labeling all tokens in the
test data is not practical because of the size of test data, such as in information retrieval
(Buckley et al., 2007) and the TAC-KBP task settings. In such cases, recall is measured
by pooling (i.e., taking a union) of all correct entities extracted by all the systems. Pooled
Recall for label l is defined in the same way as Equation 2.2, when the denominator is the
size of the pooled set for the label l. See Chapter 8 of Manning et al. (2008) for more
information on evaluation measures and pooled recall. I use true recall in Chapters 4 and 5
and pooled recall in Chapters 6 and 7, when the size of the test sets was prohibitively large
to hand label all tokens.
2.2 Patterns
There are many ways of defining patterns – lexico-syntactic surface word patterns (Hearst,
1992; Riloff, 1996), dependency patterns (Yangarber et al., 2000), a combination of both
(Illig et al., 2014), or cascaded rules (Hobbs et al., 1997). In this dissertation, I focus on
the first two and describe them below.
CHAPTER 2. BACKGROUND 18
2.2.1 Lexico-syntactic Surface word Patterns
Lexico-syntactic surface word patterns1 considers context around entity tokens in a sen-
tence. The patterns are formed using a window of words before and after the labeled
tokens. There are several different ways of constructing these patterns. In this section, I
give an overview of how these patterns can be formed; more details on the restrictions and
the parameter values I use are in the individual chapters.
A few options a developer can consider when developing a surface word pattern-based
system are below. Table 2.1 shows an example of two patterns and how they match to two
sentences.
• Target Entity Restrictions: Patterns can be developed to extract any sequence of
words that match a pattern’s context. However, generally, manually providing or
automatically learning restrictions on the target entity, such as part-of-speech and
common named entity tags, can improve precision of a system. Patterns can also
specify minimum and maximum lengths of entities to be learned.
• Context Length: Surface word patterns are formed by considering the maximum and
minimum window size on either side of the entities. Contexts that consist of only
a few stop words can be discarded because they are too general; sometimes longer
contexts consisting of all stop words can still be useful (for example, ‘I am on X’ is
a good pattern for extracting a DRUG.)
• Context Generalizations: Generalizing context of a pattern helps to reduce sparsity
and thus improve learning, since the pattern matches more entities. It also improves
performance on unseen data. Lemmatization of words is a classic form of generaliza-
tion. Other ways of generalizing tokens include semantic word classes, such as from
Wordnet (Fellbaum, 1998) or Yago (Suchanek et al., 2007) hierarchy, stop words, and
part-of-speech tags. Such generalizations have been used widely in previous work.
For example, Califf and Mooney (1999) used part-of-speech tags and semantic word
class constraints on pattern elements. In addition, words that are labeled with one
of the label dictionaries (seed as well as learned) can be generalized with the label.
1I also refer to them as surface word patterns in the dissertation.
CHAPTER 2. BACKGROUND 19
Pattern Sentencelemma:put FW* lemma:I FW* lemma:on FW* SW* {X |tag:NN.*}{1,2}
dr. put me on some albuterol inhaler
{X}{1,2} SW* FW* lemma:in FW*lemma:throat
I have this itchiness in the throat.
Table 2.1: Examples of patterns and how they match to sentences. X means one tokenthat will be matched, tag means the part of speech tag restriction of the target entity, FW*means up 2 or less words from {a, an, the}, SW* means 2 or less stop words, .* meanszero or more characters can match, and lemma means the lemma of the token. Lemmas,FW*, and SW* are generalizations of the patterns’ context. Colors show correspondingmatches between pattern elements and words in sample sentences.
As an example on how to vary the context window size and generalize, consider the
labeled sentence, ‘I take Advair::DT and Albuterol::DT for asthma::SC’, where ‘::’
indicates the label of the word. The patterns created around the DT word ‘Albuterol’
will be ‘DT and X’, ‘DT and X for SC’, ‘X for SC’, and so on, where X is the target
entity.
• Flexible Matching: One can create flexible patterns by ignoring certain types of
words, such as determiners and function words, while matching the pattern. One
can also allow stop words between the context and the term to be extracted. Some
systems (Agichtein and Gravano, 2000; Brin, 1999) have used vectors of context
words instead of contiguous tokens as patterns to increase flexibility in matching. I
use context as contiguous tokens or their generalized forms.
2.2.2 Dependency Patterns
A dependency tree of a sentence is a parse tree that gives dependencies (such as direct-
object, subject) between words in the sentence. It is, in my opinion, the best way to trade-
off semantic meaning representation and ‘learnability’ using the current resources we have.
Semantic representations can be more expressive but it is hard to learn how to generate the
representation for a new sentence. The expressive representations require more manually
labeled data, which is very hard to acquire. All work in this dissertation has used the
CHAPTER 2. BACKGROUND 20
Stanford English Dependencies (De Marneffe et al., 2006). Many researchers are currently
working on developing the Universal Dependencies2 to provide a universal collection of
categories with consistent annotations across different languages. My systems can be easily
customized to work with the Universal Dependencies.
Figure 2.1 shows the dependency tree for the sentence ‘We work on extracting informa-
tion using dependency graphs.’. Dependency patterns match dependency trees of sentences
to extract phrase sub-trees. The figure shows matching of two patterns: [using→ (direct-
object)] and [work→ (preposition on)]. The two patterns are part of seed patterns to extract
FOCUS and TECHNIQUE entities from scientific articles in Chapter 4.
A dependency tree matches a pattern [T → (d)], with a trigger word T and a de-
pendency d, if (1) it contains T , and (2) the trigger word’s node has a successor whose
dependency with its parent is d. In the rest of the dissertation, I call the subtree headed by
the successor as the matched phrase-tree. The notion of a phrase in a dependency grammar
is the subtree below the head node selected by the pattern. I extract the phrase correspond-
ing to the matched phrase-tree and label it with the pattern’s category. For example, the
dependency tree in Figure 2.1 matches the FOCUS pattern [work→ (preposition on)] and
the TECHNIQUE pattern [using→ (direct-object)]. Thus, the system labels the phrase corre-
sponding to the phrase-tree headed by ‘extracting’, which is ‘extracting information using
dependency graphs’, with the category FOCUS, and similarly labels the phrase ‘dependency
graphs’ as a TECHNIQUE.
I use Stanford CoreNLP (Manning et al., 2014) to get dependency trees of sentences
and use its Semgrex tool3 to match dependency patterns to the dependency trees.4
The options to consider when creating dependency trees are similar to the surface word
patterns. A few other parameters to consider are: 1. the allowed and disallowed dependen-
cies, both when generating the dependency patterns and when extracting a phrase from a
matched phrase sub-tree, 2. flexible matching of dependency patterns by allowing a cer-
tain number or type of nodes to be skipped between the trigger node and the node that is
connected by the required dependency edge.
2http://universaldependencies.github.io/3http://nlp.stanford.edu/software/tregex.shtml4More details about the dependencies are in http://nlp.stanford.edu/software/
dependencies_manual.pdf.
CHAPTER 2. BACKGROUND 21
Figure 2.1: The dependency tree for ‘We work on extracting information using dependencygraphs’. The tree is generated using the collapsed dependencies defined in the StanfordCoreNLP toolkit (the word ‘on’ is collapsed with the edge ‘preposition’). The dependency‘nn’ means ‘noun compound modifier’. The generated dependencies are not always correct,for example, the correct dependency between ‘extracting’ and ‘using’ should have been‘advcl’. Also shown are matching of two patterns. More details are in Chapter 4.
2.3 Bootstrapped Pattern Learning
Bootstrapped pattern-based entity learning (BPL) generally begins with seed sets of pat-
terns and/or example dictionaries for given labels and iteratively learns new entities from
unlabeled text (Riloff, 1996; Collins and Singer, 1999). I earlier discussed the two types
of patterns our systems learned using this approach – lexico-syntactic surface word pat-
terns (Hearst, 1992) and dependency tree patterns (Yangarber et al., 2000). In each itera-
tion, BPL learns a few patterns and a few entities of each given label. Figure 2.2 shows
the flow of the system when the supervision is provided as seed entities. For ease of ex-
position, I present the approach below for learning entities for one label l. It can easily
be generalized to multiple labels. I refer to entities belonging to l as positive and entities
belonging to all other labels as negative. Patterns are scored by their ability to extract more
positive entities and less negative entities. Top ranked patterns are used to extract candidate
entities from text. High scoring candidate entities are added to the dictionaries and are used
to generate more candidate patterns around them.
CHAPTER 2. BACKGROUND 22
Figure 2.2: A flowchart of various steps in a bootstrapped pattern-based entity learningsystem.
CHAPTER 2. BACKGROUND 23
The bootstrapping process involves the following steps, iteratively performed until no
more patterns or entities can be learned. For the ease of understanding, I use a running
example of learning ‘animal’ entities from the following unlabeled text, starting with the
seed set of entities as {dog}.
Step 1: Data labeling
The unlabeled text is partially labeled using the label dictionaries, starting with the seed
dictionaries in the first iteration. In later iterations, the text is labeled using both the seed
and the learned dictionaries. A phrase matching a dictionary phrase is labeled with the
dictionary’s label. Often, phrases are soft matched, by using lemmas of words and/or by
matching phrases within a small edit distance. In the example below, both instances of
‘dog’ are labeled as an ‘animal’.
Step 2: Pattern generation
Patterns are generated using the context around the labeled entities to create candidate
patterns. I discussed various parameters to consider when generating the patterns in Section
2.2. I generate all possible patterns and learn the good ones in the next step. Figure 2.3
shows two of the many possible candidate patterns and their extractions.
Step 3: Pattern learning
This is one of the two crucial steps in a BPL system. Candidate patterns generated in
the previous step are scored using a pattern scoring measure. Top ones are added to the
list of learned patterns for l. The maximum number of patterns to be learned and the
CHAPTER 2. BACKGROUND 24
threshold to choose a pattern are given as an input to the system by the developer. In a
supervised setting, the efficacy of patterns can be judged by their performance on a fully
labeled dataset (Califf and Mooney, 1999; Ciravegna, 2001). In a bootstrapped system,
where the data is not fully labeled, a pattern is usually judged by the number of positive,
negative, and unlabeled entities it extracts. Note that a true recall cannot be used because of
the lack of a fully labeled dataset. One of the most commonly used measures is RlogF by
Riloff (1996). It is a combination of reliability of a pattern and the frequency with which
it extracts positive entities. Let pos(p), neg(p), and unlab(p) be the number of positive,
negative, and unlabeled entities extracted by the pattern p, respectively. The RlogF score is
RlogF (p) =pos(p)
pos(p) + neg(p) + unlab(p)log pos(p) (2.4)
The first term is a very rough estimate of the precision of a pattern – it assumes unla-
beled entities to be negative. The log pos(p) term gives higher scores to patterns that extract
more positive entities. In Figure 2.3, the pattern scorer gives scores to the two candidate
patterns. The pos and unlab values for both patterns are 1 and neg is 0. Assuming that the
pattern scorer is good, that is s2 > s1, the second pattern is selected and added to the list of
learned patterns.
Figure 2.3: An example pattern learning system for the class ‘animals’ from the text. Twoof the many possible candidate patterns are shown, along with the extracted entities. Textmatched with the patterns is shown in italics and the extracted entities are shown in bold.
CHAPTER 2. BACKGROUND 25
Step 4: Entity learning
Patterns that are learned for the label in the previous step are applied to the text to extract
candidate entities. An entity scorer ranks the candidate entities and adds the top entities to
l’s dictionary. The maximum number of entities to be learned and the threshold to choose
an entity is given as an input to the system by the developer. Some systems learn every
entity extracted by the learned patterns, however, that can lead to many noisy entities. In
Chapter 7, I discuss various entity evaluation measures. In our systems, I represent an
entity as a vector of feature values. The features are used to score the entities, either by
taking an average of their values or by training a machine learning-based classifier.
Iterations
Steps 1-4 are repeated for a given number of iterations. Generally, precision drops and
recall increases with every iteration. The number of iterations can also be determined by a
threshold on precision; in Chapters 6 and 7, I do not consider output of the learning systems
when their precision drops below 75% during the post-hoc analysis.
Parameters
Similar to any learning system, there are many parameters one can tweak to improve a
system’s performance. One set of parameters are BPL related parameters – the thresholds
for learning a pattern (entity), number of patterns (entities) to learn in each iteration, and
the total number of iterations. The second set of parameters, as discussed in the previous
section, are related to the construction of patterns: minimum and maximum window of
context, annotations (such as, part-of-speech tags and word class tags) to consider for the
context tokens and the target entity, and, in the case of dependency patterns, the depth
of ancestors/dependents to consider when matching or constructing a pattern. Some of
these parameters, such as thresholds, can be hand tuned on a development dataset. One
trick to not tune the thresholds is to start with high thresholds and reduce them when no
more patterns or entities are learned by the system. I follow this approach in Chapters 5–
7. Another advantage is that initially the systems learn only highly confident patterns and
CHAPTER 2. BACKGROUND 26
entities, reducing the chances of semantic drift. Semantic drift is a phenomenon when the
system learn a few false positive entities leading to learning of more incorrect and pattern
entities over the iterations. Other parameters, such as restrictions to consider for the target
entity, can be learned by the system; patterns with and without the restrictions are generated
and are scored by the pattern scoring function.
2.4 Classifiers and Entity Features
Brown clusters
I use Brown clustering (Brown et al., 1992) to cluster words in the MedHelp dataset in
an unsupervised way. It is a greedy bottom-up hierarchical clustering approach based on
n-gram class language models. They have been used widely for other tasks, such as for
NER (Ratinov and Roth, 2009), parsing (Koo et al., 2008), and part-of-speech tagging
(Li et al., 2012a). I used its implementation by Liang (2005). I do not use the publicly
available generic word clusters because they are not from the same domain as the datasets.
I mainly used Brown clustering instead of other clustering methods such as distributional
clustering (Clark, 2001) or word embeddings (Collobert and Weston, 2008; Mikolov et al.,
2013a) because it was fast, easy to use, and produced good clusters. Additionally, Turian
et al. (2010) reported that Brown clustering induces better representations for rare words
than the embeddings from Collobert and Weston (2008), when the latter does not receive
sufficient training updates. I use the word embeddings from a neural network model to
enhance the entity classifier in Chapter 7.
Google Ngrams
Google Ngrams5 is a resource provided by Google consisting of 1–5 words long English
phrases and their observed frequency counts on the web (considering around 1 trillion word
tokens). Only phrases with frequency greater than or equal to 40 are included. It is a great
resource for building language models or for estimating usage of a phrase on the Internet. I
5https://catalog.ldc.upenn.edu/LDC2006T13, accessed in January 2008.
CHAPTER 2. BACKGROUND 27
use this for calculating feature values of entities, on the assumption that an entity common
on the Internet (such as, ‘youtube’) is not a useful entity to extract for a specialized domain.
Logistic Regression
I use logistic regression (LR) for entity classifiers since it is one of the most commonly used
classifiers and it worked better than SVMs and Random Forests in the pilot experiments. I
used the implementation of LR in Stanford CoreNLP (Manning et al., 2014) and used the
default settings (with L2 regularization). Note that our training datasets are noisy – auto-
matically constructed seeds sets often have some noise; learned patterns and entities can be
incorrect; and the sampling to create a training set can lead to a wrongly labeled dataset.
There has been some work in modeling annotation noise to learn more robust classifiers,
such as Shift LR (Tibshirani and Manning, 2014) and Natarajan et al. (2013), that model
random labeling noise. The noise in bootstrapped systems is, however, more systematic.
The wrong labels come from the noisy dictionaries, instead of wrong annotations by human
annotators, which is presumably more random. I tried using Shift LR in our systems but it
led to poor results.
2.5 Dataset: MedHelp Patient Authored Text
In Chapters 5, 6, and 7 for experimental evaluation, I use MedHelp.org forum data. Med-
Help is one of the largest online health discussion forums. Similar to other discussion
forums, there are forums under topics like ‘Asthma’, ‘ENT’, and ‘Pregnancy: Sept 2015
babies’. A MedHelp forum consists of thousands of threads; each thread is a sequence of
posts by users. The dataset includes some medical research material posted by users but has
no clinical text. In each thread, the initiator of the thread posts a paragraph or more about
a health concern or comment. The conversations are usually about health topics, but are
also sometimes about emotional support and advice (MacLean et al., 2015). We acquired
the dataset through a research agreement with MedHelp, who anonymized the data prior
to sharing. The data spans from 2007 to May 2011. Other work on this dataset includes
MacLean and Heer (2013), MacLean et al. (2015), and MacLean (2015).
CHAPTER 2. BACKGROUND 28
Cold dry air is a common trigger, I’m also haven’t a lot of trouble keeping theasthma under control now that is it winter (only diganosed last spring).I had actually been feeling spasms in my throat that I thought were palpitationsbut it ended up not being my heart.Now I have developed a low grade fever and blisters in my throat.Would love some feedback as I’m anxious.No stuffed nose, no discharge.yes i realize that i should have used ear plugs and yes i’ve learned my lesson thati will use plugs from now on.I have chronic sinusitis, scars on both ears from past infections , and “fairly severedeviated septum, ”.I went to the doctor and he gave me augmittin it cleared the white patches rightup.I went to the health food store and found Wally’s Ear Oil about 2 weeks ago afterreading some of the posts here.I am interested in Xanax side affect of loosing taste and smell.It sounds like chronic non-infectious bronchitis.I’ve had chest x-ray-normal.Once I had my sinus surgeries my asthma improved dramatically.
Table 2.2: A few examples of sentences from the MedHelp forum. The sentences arelabeled with symptoms & conditions (in italics) and drugs & treatments (in bold) labels.
There are several challenges with extracting information from the dataset. Patients use
various slang, colloquial forms of entities and home remedies that are not found in seed
sets. They are very descriptive about their symptoms and conditions. Some examples
of sentences from the Asthma and ENT forums labeled with symptoms & conditions (in
italics) and drugs & treatments (in bold) labels are shown in Table 2.2. More information
about these two labels are in Chapter 5.
In the next chapter, I discuss previous work related to bootstrapped and pattern-based
learning.
Chapter 3
Related Work
Fully Supervised Distantly SupervisedPattern-based SRV, SLIPPER, WHISK,
RAPIERSnowball, Basilisk,Ravichandran and Hovy(2002)
Non-pattern-based Sequence models like CRFs,HMMs, CMMs
Semi-supervised classifiers,TAC-KBP systems likeMIML-RE
Hybrid Boella et al. (2013), Freitagand Kushmerick (2000)
KnowItAll, NELL, Putthivid-hya and Hu (2011), Surdeanuet al. (2006)
Table 3.1: A few examples to give an idea about the types of IE systems developed basedon the amount of supervision and the type of models.
The existing IE systems can be roughly categorized along two dimensions – the super-
vision required and the models used. A system can be fully supervised, semi-supervised,
or distantly supervised. There are several names for distantly supervised learning in the
existing literature: bootstrapped, lightly supervised, weakly supervised, minimally super-
vised, or semi-supervised learning. The second dimension is that the model used can be
pattern-based, not pattern-based (most of them are feature-based sequence classifiers), or a
hybrid of pattern and feature-based. Table 3.1 gives a few examples for each category.
29
CHAPTER 3. RELATED WORK 30
In this chapter, I discuss distantly supervised systems and pattern-based learning ap-
proaches to information extraction (IE). I do not review hand-written systems, fully su-
pervised or semi-supervised machine learning-based systems in this chapter. Conditional
random fields (CRFs), a fully supervised approach, have been very successful at entity ex-
traction tasks. For more information on CRFs and other advances in IE, see Hobbs and
Riloff (2010) and Sarawagi (2008).
3.1 Pattern-based systems
Pattern-based learning, both distantly and fully supervised, has been a topic of interest
for many years. Pattern-based approaches have been widely used for IE (Chiticariu et al.,
2013; Fader et al., 2011; Etzioni et al., 2005). Systems differ in how they create patterns,
learn patterns, and learn the entities they extract. Patterns are useful in two ways: they
are good features in an entity classifier, and they identify promising candidate entities.
Pattern learning can also be thought of as a feature selection approach, in which patterns
are instantiated feature templates. Patwardhan (2010) gives a good overview of the research
in the field.
3.1.1 Fully supervised
The pioneering work by Hearst (1992) used hand written rules to automatically generate
more rules that were manually evaluated to extract hypernym-hyponym pairs from text.
Other supervised systems like SRV (Freitag, 1998), SLIPPER (Cohen and Singer, 1999),
WHISK (Soderland, 1999), (LP )2 (Ciravegna, 2001), and RAPIER (Califf and Mooney,
1999) used a fully labeled corpus to either create or score rules. WHISK learned sur-
face word patterns with wildcards and semantic classes like digit and numbers. SLIP-
PER (Cohen and Singer, 1999) used boosting to create an ensemble of rules. Here, I de-
scribe RAPIER and (LP )2 in detail as examples of the supervised pattern learning systems.
RAPIER, a bottom-up learning system, used a relational learning algorithm that preferred
overly specific to overly general rules. A rule or pattern was defined as a combination of
the filler (target entity), pre-filler (left context), and post-filler (right context) slots. Initially,
CHAPTER 3. RELATED WORK 31
the patterns were created as maximally specific by considering all the left and right context
tokens around the known entities in the labeled documents. RAPIER generalized pairs of
patterns and then specialized pre- and post-fillers. A pattern was evaluated by considering
the positive and negative examples it extracted and the pattern’s complexity, giving prefer-
ence to less complex patterns. Note that since the system was fully supervised, the number
of positive and negative examples were known for each pattern. (LP )2, another bottom-
up pattern learning system, had similar steps as RAPIER. It had two stages: induction of
tagging patterns and induction of correction patterns. Similar to other systems, the training
corpus was manually marked with positive examples and the rest of the corpus was con-
sidered as negative examples. Generalization was performed by relaxing constraints in the
initial patterns, which were created by considering a window of words on the left and the
right of marked tokens. The correction rules, which learned correct boundaries of entities,
were induced from the mistakes made in applying the earlier learned tagging rules on the
training corpus. All entities that matched the patterns were extracted.
IBM Research’s SystemT (Liu et al., 2010) used supervision of correctly and incor-
rectly extracted data to suggest rule refinements to developers. Freitag and Kushmerick
(2000) used extraction patterns as weak learners in a boosting framework to learn patterns
with better recall. Some systems used patterns or patterns-like features in a supervised
machine learning-based classifier. In particular, dependency paths between entities, which
can be thought of as simpler versions of dependency patterns, have been used as features in
various relation extraction systems. Bunescu and Mooney (2005) used dependency paths
similarity to compute kernel scores in a SVM for relation extraction.
3.1.2 Distantly supervised
There has been a lot of recent work on using distant supervision for entity and relation ex-
traction, both using classifiers and patterns. Bootstrapping or distantly supervised learning
has many variants, such as pattern- or rule-based learning, self-training, co-training, and la-
bel propagation. Yarowsky’s style of self-training algorithms (Yarowsky, 1995) have been
shown to be successful at bootstrapping (Collins and Singer, 1999). Co-training (Blum and
Mitchell, 1998) and its bootstrapped adaptation (Collins and Singer, 1999) require disjoint
CHAPTER 3. RELATED WORK 32
views of the features of the data. In an entity learning task, the two views are from the
context and the content of the entities. Bellare et al. (2007) learned attributes of entities
in underspecified queries using the DL-CoTrain algorithm proposed by Collins and Singer
(1999). Whitney and Sarkar (2012) proposed a modified Yarowsky algorithm that used la-
bel propagation on graphs, inspired by an algorithm proposed in Subramanya et al. (2010)
that used a large labeled data for domain adaptation.
My dissertation is inspired by the system proposed by Riloff (1996). Riloff used a set
of seed entities to bootstrap learning of patterns for entity extraction from unlabeled text. I
describe the iterative algorithm in Chapter 2. She scored a pattern by a weighted conditional
probability measure estimated by counting the number of positive entities among all the
entities extracted by the rule. Riloff and Jones (1999) added another level of bootstrapping
by retaining only the learned entities and restarting the process after each iteration. Thelen
and Riloff (2002) proposed a system called Basilisk that extended the above bootstrapping
algorithm for multi-class learning. Their systems can be viewed as a form of the Yarowsky
algorithm, with pattern learning as an additional step.
Yangarber et al. (2002) learned surface-word patterns to extract diseases and viruses
from medical text using seed sets. Lin et al. (2003) learned names, such as diseases and
locations, using an approach similar to Yangarber et al. (2002), and tested the system on
multiple languages. Stevenson and Greenwood (2005) used Wordnet to assess semantic
similarity between patterns. Talukdar et al. (2006) used seed sets to learn trigger words
for entities and a pattern automata. Using the learned dictionaries as gazette features, they
improved supervised CRFs performance on the CoNLL NER task.
Snowball (Agichtein and Gravano, 2000) and DIPRE (Brin, 1999) are two classic
pattern-learning systems. Snowball learned patterns to extract (LOCATION, ORGANIZA-
TION) tuples from text using seed sets of examples. It was inspired by the DIPRE system.
Unlike most other pattern-based systems, Snowball represented a pattern by the left, mid-
dle, and right vectors of terms around the entities. StatSnowball (Zhu et al., 2009) is an
enhanced Snowball system that used Markov logic networks to learn scores of patterns and
selected the patterns using L1 regularization.
There are many different ways of creating and representing patterns. Sudo et al. (2003),
CHAPTER 3. RELATED WORK 33
for example, used a subtree model to represent patterns, which is based on arbitrary sub-
trees of dependency trees. Sub-trees can potentially capture more varied context than just
dependency paths or surface context.
The distant supervision can also be provided by using seed patterns instead of seed
examples. Yangarber et al. (2000) used seed dependency patterns to divide a text corpus
into relevant and irrelevant documents, and ranked the candidate patterns according to their
frequency of match in relevant vs. irrelevant documents. Pasca (2004) used seed patterns to
learn generic non-domain specific patterns like ‘X [such as | including] N [and |,|.]’ from
web pages to learn named entities (represented by ‘N’) and their categories (represented by
‘X’). In Chapter 4, I use seed patterns to extract key aspects from scientific articles. In the
rest of the chapters, I use seed entities to extract more entities.
Some papers built IE systems in the setting of traditional semi-supervised learning.
McLernon and Kushmerick (2006) acquired patterns using a small amount of labeled data,
in addition to the unlabeled data. In contrast, Hassan et al. (2006) did not use any seed
examples or patterns; they used human annotation for identifying ‘interesting entities’.
They used a HITS-like algorithm (Kleinberg, 1999) on patterns (authorities) and instances
(hubs) to learn generic relation extractors.
Patterns have also been used to extract attributes of entities. Yahya et al. (2014) used
seed patterns to extract attributes of nouns from queries. Gupta et al. (2014a) used text
patterns to learn attributes of entities and described Biperpedia, an ontology with 1.6M
(class, attribute) pairs.
In this dissertation, I do not discuss canonicalization of entities, which is, in some ways,
a harder task because it involves disambiguation along with extraction. Suchanek et al.
(2009) used pattern-based IE combined with logical constraint checking using a Max-SAT
model to extend existing ontologies. They worked on canonicalization of entities, which
is useful for extension of ontologies and other downstream tasks. Buitelaar and Magnini
(2005) gave an overview of the methods to learn ontologies from text.
Distant supervision can also come from existing human-curated resources, such as web
pages and ontologies. I used automatically generated seed sets from medical ontologies and
webpages in my experiments. Other resources include Freebase (Bollacker et al., 2008),
Wikipedia Infoboxes, and Yago (Suchanek et al., 2007). Mintz et al. (2009) used Freebase
CHAPTER 3. RELATED WORK 34
as supervision to learn relation extractors. Xu et al. (2007) learned pattern rules for n-ary
relation extraction, starting with seed examples. Most other systems use hybrid approaches,
which are discussed in the next section. Systems developed for the TAC-KBP slot filling
task,1 a shared task for relation extraction, use Wikipedia Infoboxes as distant supervision
(Surdeanu et al., 2012). Jean-Louis et al. (2011) learned patterns for the TAC-KBP slot
filling task.
3.2 Distantly supervised Non-pattern-based systems
I work on IE systems that use unlabeled data and seed sets of entities. Existing non-pattern-
based IE systems can be compared with our systems along the two aspects. First is the use
of unlabeled data. Many IE systems use unlabeled data to learn word embeddings or clus-
ters (such as, Brown clusters) to use them as features in a fully supervised feature-based
classifier (Ratinov and Roth, 2009; Turian et al., 2010). These systems are not distantly
supervised because, even though they make use of unlabeled data, they still need a fully
labeled dataset to learn a robust classifier. I also use word embeddings computed using
unlabeled data to improve entity and pattern scoring function, however, in a bootstrapped
setting. Second, many of them, such as CRFs-based systems, use sets of entities, also
called gazettes or dictionaries, as features. However, they do not expand the sets of enti-
ties. Most systems perform direct look-up against the dictionaries. Cohen and Sarawagi
(2004) worked on improving the matching of named entity segments to dictionaries to use
as a feature for a sequence model. They used segment-based conditional markov models
(CMM, also called MEMM) to incorporate similarity of named entity segments (instead of
words) with a dictionary’s entries, and model lengths of entities. I compare pattern-based
systems with sequence classifiers in Section 1.1.2.
3.3 Distantly supervised Hybrid systems
There are several ways of combining patterns-based and feature-based learning systems.
First, individual components of a pattern-based learning system can use feature-based1http://www.nist.gov/tac/2014/KBP/
CHAPTER 3. RELATED WORK 35
learning methods to learn good pattern and entity ranking functions. For example, I use
logistic regression to learn an entity scorer in some of my systems. Second, dependency
paths can be used as features in a classifier, a common practice for building classifier-based
entity and relation extraction systems. Boella et al. (2013) used patterns or syntactic de-
pendencies as features in a SVM for extracting semantic knowledge from legislative text.
Patterns can also be thought of as ‘feature templates’ used in classifiers. In my opinion,
pattern-based learning approaches learn good instantiations of the feature templates. Sur-
deanu et al. (2006) proposed a co-training-based algorithm that used text categorization
along with pattern extraction, starting with seed sets.
Roth and Klakow (2013) used patterns in their system on combining generative and
discriminative relation extraction approaches. Angeli et al. (2014) used learned dependency
patterns, along with a machine learning-based MIML-RE approach (Surdeanu et al., 2012),
to predict relations between two entities.
More recently, DeepDive (Niu et al., 2012) has shown promising results on distantly
supervised relation extraction (Angeli et al., 2014) by using fast inference in Markov logic
networks. Govindaraju et al. (2013) and Zhang et al. (2013) used DeepDive on the task of
extracting structured information like tables from text.
3.3.1 Open IE systems
Open IE, a popular task in recent years, is geared towards learning generic, domain-independent
extractors. KnowItAll’s entity extraction from the web (Downey et al., 2004; Etzioni et
al., 2005) used components such as list extractors, generic and domain specific pattern
learning, and subclass learning. They learned domain-specific patterns using a seed set.
Never-Ending Language Learning (NELL) system (Carlson et al., 2010a) learned multiple
semantic types using coupled semi-supervised training from web-scale data, which is not
feasible for all datasets and entity learning tasks.
Open-IE relation extraction systems like ReVerb (Fader et al., 2011) and OLLIE (Mausam
et al., 2012) learn domain-independent generic relation extractors for web data. However,
using them for a specific domain with a moderately sized corpus leads to poor results. I
tested learning an entity extractor for a given class using ReVerb. I labeled the binary and
CHAPTER 3. RELATED WORK 36
unary ReVerb extractions using the class seed entities and retrained its confidence function,
with poor results. Poon and Domingos (2010) found a similar result for inducing a proba-
bilistic ontology: an open information extraction system extracted low accuracy relational
triples on a small corpus.
There has been some work to map generic Open IE extractions to learn extractors for
specific relations. Soderland et al. (2013) manually wrote rules to map Open IE extrac-
tions to TAC-KBP Slot Filling relations in under 3 hours and achieved reasonable perfor-
mance. Improving pattern-based learning systems would also improve the hybrid systems
described above.
Overall, even though pattern-based approaches have been less popular in the recent
years as compared to feature-based sequence models because of the trends in the wide
world, they have been shown to be successful at both supervised and bootstrapped entity
learning. The hybrid systems, which have become popular in the last few years, usually
have a pattern learning component. Improving pattern learning would presumably also im-
prove the performance of hybrid systems. In Chapters 4 and 5, I apply the bootstrapped
pattern-based learning approach to two new problems and domains. The results show that
they are very effective at expanding seed sets of entities in these domains. One key com-
ponent missing from the previous systems is the utilization of unlabeled data beyond the
matching of patterns to text. For example, when scoring patterns, unlabeled entities ex-
tracted by patterns are either considered negative or are ignored. In Chapters 6 and 7, I
propose improvements to bootstrapped pattern-based learning systems that leverage unla-
beled data in a better way.
Chapter 4
Studying Scientific Articles andCommunities
In this chapter, I present how to study influence of sub-communities of a research com-
munity by extracting key aspects of the articles published. I examine the computational
linguistics community as a case-study. I use bootstrapped pattern learning to extract the
key aspects, starting with only a few seed dependency patterns as supervision. The content
of this chapter is drawn from Gupta and Manning (2011).
4.1 Introduction
The evolution of ideas and the dynamics of a research community can be studied using
the scientific articles published by the community. For instance, we may be interested in
how methods spread from one community to another, or the evolution of a topic from a
focus of research to a problem-solving tool. We might want to find the balance between
technique-driven and domain-driven research within a field. Such a rich insight of the
development and progress of scientific research requires an understanding of more than
just “topics” of discussion or citation links between articles. As an example, to determine
whether technique-driven researchers have greater or lesser impact, we need to be able to
identify styles of work. To achieve this level of detail and to be able to connect together
how methods and ideas are being pursued, it is essential to move beyond bag-of-words
37
CHAPTER 4. STUDYING SCIENTIFIC ARTICLES AND COMMUNITIES 38
topical models. This requires an understanding of sentence and argument structure, and is
therefore a form of information extraction, if of a looser form than the relation extraction
methods that have typically been studied.
To study the application domains, the techniques used to approach the domain prob-
lems, and the focus of scientific articles in a community, I propose to extract the following
concepts from the articles
FOCUS: an article’s main contribution
TECHNIQUE: a method or a tool used in an article, for example, expectation maxi-
mization and conditional random fields
DOMAIN: an article’s application domain, such as speech recognition and classifica-
tion of documents.
For example, if an article concentrates on regularization in support vector machines and
shows improvement in parsing accuracy, then its FOCUS and TECHNIQUE are regularization
and support vector machines and its DOMAIN is parsing. In contrast, an article that focuses
on lexical features to improve parsing accuracy, and uses support vector machines to train
the model has FOCUS as lexical features and parsing, the TECHNIQUE being lexical fea-
tures and support vector machines, and DOMAIN still is parsing.1 In this case, even though
TECHNIQUEs and DOMAIN of both papers are very similar, the FOCUS phrases distinguish
them from each other. Note that a DOMAIN of one article can be a TECHNIQUE of another,
and vice-versa. For example, an article that shows improvements in named entity recogni-
tion (NER) has DOMAIN as NER, and an article that uses named entities as an intermediary
tool to extract relations has NER as one of its TECHNIQUEs.
I use dependency patterns to extract the above three categories of phrases from arti-
cles, which can then be used to study the influence of communities on each other. The
phrases are extracted by matching semantic (dependency) patterns in dependency trees of
sentences. The input to the extraction system are some seed patterns (see Table 4.1 for
examples) and it learns more patterns using a bootstrapping approach, similar to one de-
scribed in Chapter 2.
1A community vs. a DOMAIN: a community can be as broad as computer science or statistics whereas aDOMAIN is a specific application such as Chinese word segmentation.
CHAPTER 4. STUDYING SCIENTIFIC ARTICLES AND COMMUNITIES 39
As a case study, I examine the computational linguistics community and consider the
influence of its sub-fields such as parsing and machine translation. For the study, I use the
document collection from the ACL Anthology Network and the ACL Anthology Reference
corpus (Bird et al., 2008; Radev et al., 2009). To get the sub-fields of the community, I use
latent Dirichlet allocation (LDA) (Blei et al., 2003) to find topics and label them by hand.2
However, our general approach can be used to study any case of the influence of academic
communities, including looking more broadly at the influence of statistics or economics
across the social sciences.
Using the approach, I study how communities influence each other in terms of tech-
niques that are reused, and show how some communities ‘mature’ so that the results they
produce get adopted as tools for solving other problems. For example, the products of the
part-of-speech tagging community have been adopted by many other communities. This is
evidenced by many papers that use part-of-speech tagging as an intermediary step to solve
other problems. Overall, our results show that speech recognition and probability theory
have been the most influential fields in the last two decades, since many communities now
use the techniques introduced by papers in those communities. Probability theory, unlike
speech recognition, is not a sub-field of computational linguistics, but it is an important
topic since many papers use and work on probabilistic approaches.
I also show the timeline of influence of communities. For example, the results show
that formal computational semantics and unification-based grammars had a lot of influence
in the late 1980s. The speech recognition and probability theory fields showed an upward
trend of influence in the mid-1990s, and even though it has decreased in recent years,3
they still have a lot of influence on recent papers mainly due to techniques like expectation
maximization and hidden Markov models.
Contributions I introduce a new categorization of key aspects of scientific articles,
which is (1) FOCUS: main contribution, (2) TECHNIQUE: method or tool used, and (3)2In this chapter, I use the terms communities, sub-communities and sub-fields interchangeably.3Speech Recognition has recently made a come-back with the advances using the deep learning approach.
CHAPTER 4. STUDYING SCIENTIFIC ARTICLES AND COMMUNITIES 40
DOMAIN: application domain. I extract them by matching dependency patterns to de-
pendency trees, and learn patterns using bootstrapping. I present a new definition of in-
fluence of a research community on another, and present a case study on the computa-
tional linguistics community, both for verifying the results of our system and showing
novel results for the dynamics and the overall influence of computational linguistics sub-
fields. I introduce a dataset of abstracts labeled with the novel categories available at
http://nlp.stanford.edu/pubs/FTDDataset_v1.txt for the research com-
munity.
4.2 Related Work: Scientific Study
While there is some connection to keyphrase selection in text summarization (Radev et al.,
2002), extracting FOCUS, TECHNIQUE and DOMAIN phrases is fundamentally a form of
information extraction, and there has been a wide variety of prior work in this area. Some
work, including the seminal (Hearst, 1992) identified patterns (IS-A relations) using hand-
written patterns, while other work has learned patterns over dependency graphs (Bunescu
and Mooney, 2005). For more related work on pattern-based systems, see Chapter 3.
Topic models have been used to study the history of ideas (Hall et al., 2008) and schol-
arly impact of papers (Gerrish and Blei, 2010). However, topic models do not extract
detailed information from text as we do. Still, I use topic-to-word distributions from topic
models as a way of describing sub-fields.
Demner-Fushman and Lin (2007) used hand written knowledge extractors to extract in-
formation, such as population and intervention, in their clinical question-answering system
to improve ranking of relevant abstracts. Our categorization of key aspects is applicable
to a broader range of communities, and we learn the patterns by bootstrapping. Li et al.
(2010) used semantic metadata to create a semantic digital library for Chemistry. They
applied machine learning techniques to identify experimental paragraphs using keywords
features. Xu et al. (2006) and Ruch et al. (2007) proposed systems, in the clinical-trials
and biomedical domain, respectively, to classify sentences of abstracts corresponding to
categories such as introduction, purpose, method, results and conclusion to improve article
CHAPTER 4. STUDYING SCIENTIFIC ARTICLES AND COMMUNITIES 41
FOCUS
present→ (direct object)work→ (preposition on)propose→ (direct object)
TECHNIQUE
using→ (direct object)apply→ (direct object)extend→ (direct object)
DOMAIN
system→ (preposition for)task→ (preposition of)
framework→ (preposition for)
Table 4.1: Some examples of dependency patterns that extract information from depen-dency trees of sentences. A pattern is of the form T → (d), where T is the trigger wordand d is the dependency that the trigger word’s node has with its successor.
retrieval by using either structured abstracts,4 or hand-labeled sentences. Some summariza-
tion systems also use machine learning approaches to find ‘key sentences’. The systems
built in these papers are complimentary to ours since one can find relevant paragraphs or
sentences and then extract the key aspects from them. Note that a sentence can have multi-
ple phrases corresponding to our three categories, and thus classification of sentences will
not give similar results.
4.3 Approach
In this section, I explain how to extract phrases for each of the three categories (FOCUS,
TECHNIQUE and DOMAIN) and how to compute the influence of communities.
4.3.1 Extraction
From an article’s abstract and title, I use the dependency trees of sentences and a set of
semantic dependency extraction patterns to extract phrases in each of the three categories.
More details on the dependency patterns and trees are in Chapter 2. Figure 2.1 shows an
example of matching two dependency patterns to a dependency tree. I start with a few
4Structured abstracts, which are used by some journals, have multiple sections such as PURPOSE andMETHOD.
CHAPTER 4. STUDYING SCIENTIFIC ARTICLES AND COMMUNITIES 42
handwritten patterns and learn more patterns using a bootstrapping approach. Table 4.1
shows some seed patterns.
To learn more patterns automatically, I run an iterative algorithm that extracts phrases
using semantic patterns, and then learns new patterns from the extracted phrases. Section
2.3 gives a detailed overview of a bootstrapped pattern-based learning approach. Here,
the seed supervision is provided in terms of seed patterns, instead of seed entities. More
specific details of each step are described below.
Extracting Phrases from Patterns
The pattern matching is the same as described in Section 2.2.2. To increase flexibility of
matching the patterns, when matching the dependency edge, I consider dependents and
granddependents upto 4 levels. I have special rules for paper titles. I label the whole title as
FOCUS if we are not able to extract a FOCUS phrase using the patterns, as authors usually
include the main contribution of the paper in the title. For titles from which we can extract a
TECHNIQUE phrase, I label the rest of the words (except for trigger words) with DOMAIN.
For example, for the title ‘Studying the history of ideas using topic models’, our system
extracts ‘topic models’ as TECHNIQUE, and then labels ‘Studying the history of ideas’ as
DOMAIN.
Learning Patterns from Phrases
After extracting phrases with patterns, we want to be able to construct and learn new pat-
terns. For each sentence whose dependency tree has a subtree corresponding to one of the
extracted phrases, I construct a pattern T → (d) by considering the ancestor (parent or
grandparent) of the subtree as the trigger word T , and the dependency between the head
of the subtree and its parent as the dependency d. For each category, I weight the patterns
depending on the categories of phrases from which they are derived. The weighting method
is as follows. For a set of phrases (P ) that extract a pattern (q), the weight of the pattern
q for the category FOCUS is∑
p∈P1zpcount(p ∈ FOCUS), where zp is the total frequency
of the phrase p. Similarly, I get weights of the pattern for the other two categories. Note
that we do not need smoothing since the phrase-category ratios are aggregated over all the
CHAPTER 4. STUDYING SCIENTIFIC ARTICLES AND COMMUNITIES 43
phrases from which the pattern is constructed. After weighting all the patterns that have
not been selected in the previous iterations, I select the top k patterns in each category (k=2
in our experiments). Table 4.3 shows some patterns learned through the iterative method.
4.3.2 Communities and their Influence
I define communities as fields or sub-fields that one wishes to study. To study communities
using the articles published, one needs to know which communities each article belongs to.
The article-to-community assignment can be computed in several ways, such as by manual
assignment, using metadata, or by text categorization of papers. In our case study, I use the
topics formed by applying latent Dirichlet allocation (Blei et al., 2003) to the text of the
papers by considering each topic as one community. In recent years, topic modeling has
been widely used to get ‘concepts’ from text; it has the advantage of defining communities
and soft, probabilistic article-to-community assignment scores in an unsupervised manner.
I combine these soft assignment scores with the phrases extracted in the previous section
to score a phrase for each community and category as follows. The score of a phrase p,
which is extracted from an article a, for a community c and the category TECHNIQUE is
calculated as
techScore (c, p, a) =1
zpcount (p ∈ TECHNIQUE | a)P (c | a; θ) (4.1)
where the function P (c | a, θ) gives the probability of a community (i.e., a topic) for an
article a given the topic modeling parameters θ. The normalization constant for the phrase,
zp, is the frequency of the phrase in all the abstracts. In the rest of the section, I use ai’s for
articles, ci’s for communities and y’s for years.
I define influence such that communities receive higher scores if they use techniques
earlier than other communities do or produce tools to solve other problems. For example,
since hidden Markov model introduced by the speech recognition community and part-
of-speech tagging tools built by the part-of-speech community have been widely used as
techniques in other communities, these communities should receive higher scores as com-
pared to some nascent or not-so-widely-used ones. Thus, I define influence of a community
CHAPTER 4. STUDYING SCIENTIFIC ARTICLES AND COMMUNITIES 44
based on the number of times its FOCUS, TECHNIQUE or DOMAIN phrases have been used
as a TECHNIQUE in other communities. To calculate the overall influence of one commu-
nity on another, we first need to calculate influence because of individual articles in the
community, which is calculated as follows. The influence of community c1 on another
community c2 because of a phrase p extracted from an article a1 is
techInfl (c1, c2, p, a1) = allScore (c1, p, a1)∑a2∈D
ya2>ya1
techScore (c2, p, a2)C(a2, a1)
(4.2)
where the function allScore(c, p, a) is computed the same way as in Equation 4.1 but by
using count(p ∈ ALL | a), where ALL means the union of phrases extracted in all three
categories. The variable D is the set of all articles, and ya2 means year of publication of the
article a2. The function C(a2, a1) is a weighting function based on citations, whose value
is 1 if a2 cites a1, and λ otherwise. If λ is 0 then the system calculates influence based on
just citations, which can be noisy and incomplete. In the experiments, I used λ as 0.5, since
we want to study the influence even when an article does not explicitly cite another article.
The function allScore measures how often a phrase is used by a community.
Thus, the technique-influence score of community c1 on community c2 in a particular
year y is computed by summing up the previous equation for all phrases in P and for all
articles in D. It is computed as
techInfl (c1, c2, y) =∑p∈P
∑a1∈Dya1=y
techInfl (c1, c2, p, a1) (4.3)
where P is the set of all phrases.
Straightforwardly, the overall influence of community c1 on the community c2 is calcu-
lated as
CHAPTER 4. STUDYING SCIENTIFIC ARTICLES AND COMMUNITIES 45
Paper Title FOCUS TECHNIQUE DOMAIN
Studying thehistory of ideasusing topicmodels
studying the historyof ideas using topic
latent dirichlet alloca-tion; topic; topic; un-supervised topic; his-torical trends; that allthree conferences areconverging in the topics
studying the historyof ideas; topic;model of the di-versity of ideas, topic entropy;probabilistic
A BayesianHybrid MethodFor Context-SensitiveSpelling Correc-tion.
new hybrid method, based on bayesianclassifiers; bayesianhybrid method forcontext-sensitivespelling correction
decision lists; bayesian;bayesian classifiers;ambiguous; part-of-speech tags; methodsusing decision lists;single strongest pieceof evidence; spelling
context-sensitivespelling correction;for context-sensitivespelling correction;spelling
Table 4.2: Extracted phrases for some papers. The word ‘model’ is missing from the endof some phrases as it was removed during post-processing.
techInfl (c1, c2) =∑y
techInfl (c1, c2, y) (4.4)
And, the overall influence of a community c1 is calculated as
techInfl (c1) =∑c2 6=c1
techInfl (c1, c2) (4.5)
Next, I present a case study over the sub-fields of computational linguistics using the
influence scores described above.
4.4 Experiments
I studied the computational linguistics community from 1965 to 2009 using titles and ab-
stracts of 15,016 articles in the ACL Anthology5 dataset (Bird et al., 2008; Radev et al.,
2009), since it has full text of papers available. I use the full text of papers to build a topic
model. I found 52 pairs of abstracts that had more than 80% of words in common with
5http://www.aclweb.org/anthology
CHAPTER 4. STUDYING SCIENTIFIC ARTICLES AND COMMUNITIES 46
each other; I ignored the influence of the earlier-published paper on the later-published pa-
per in the pairs while calculating the influence scores, because double publishing the same
research presumably does not indicate influence.
When extracting phrases from the matched phrase trees, I ignored tokens with part-
of-speech tags as pronoun, number, determiner, punctuation or symbol, and removed all
subtrees in the matched phrase trees that had either relative-clause-modifier or clausal-
complement dependency with their parents since, even though we want full phrases, in-
cluding these sub-trees introduce extraneous phrases and clauses. I also added phrases
from the subtrees of the matched phrase trees to the set of extracted phrases.
I used 13 seed hand-written patterns for FOCUS, 7 for TECHNIQUE and 15 for DO-
MAIN. When constructing a new pattern for learning, I ignored the ancestors that were not
a noun or a verb since most trigger words are a noun or a verb (such as use, constraints). I
also ignored conjunction, relative-clause-modifier, dependent (most generic dependency),
quantifier-modifier and abbreviation dependencies6 since they either are too generic or in-
troduce extraneous phrases and clauses.
Learning new patterns did not help in improving the FOCUS category phrases when
tested over a hand labeled test set. It got relatively high scores when using just the seed
patterns and the titles, and hence learning new patterns reduced the precision without any
significant improvement in recall. Thus, I learned new patterns only for the TECHNIQUE
and DOMAIN categories. I ran 50 iterations for both categories. After extracting all the
phrases, I removed common phrases that are frequently used in scientific articles, such as
‘this technique’ and ‘the presence of’, using a stop words list, a set of 3,000 phrases created
by taking the top most occurring 1 to 3 grams from 100,000 random articles that have an
abstract in the ISI web of knowledge database7. I ignored phrases that were either one
character or more than 15 tokens long.
In a step towards finding canonical names, I automatically detected abbreviations and
their expanded forms from the full text of papers by searching for text between two paren-
theses, and considered the phrase before the parentheses as the expanded form (similar
6See De Marneffe et al. (2006) for details of these dependencies.7www.isiknowledge.com
CHAPTER 4. STUDYING SCIENTIFIC ARTICLES AND COMMUNITIES 47
TECHNIQUE DOMAIN
model→ (nn) improve→ (direct-object)rules→ (nn) used→ (preposition for)extracting→ (direct-object) evaluation→ (nn)identify→ (direct-object) parsing→ (nn)constraints→ (amod) domain→ (nn)based→ (preposition on) ) applied→ (preposition to)
Table 4.3: Examples of patterns learned using the iterative extraction algorithm. The de-pendency ‘nn’ is the noun compound modifier dependency.
to Schwartz and Hearst (2003)). I got a high precision list by picking the top most oc-
curring pairs of abbreviations and their expanded forms, and created groups of phrases by
merging all the phrases that use same abbreviation. I then changed all the phrases in the
extracted phrases dataset to their canonical names. I also removed ‘model’, ‘approach’,
‘method’, ‘algorithm’, ‘based’, ‘style’ words and their variants when they occurred at the
end of a phrase.
To get communities in the computational linguistics literature, I considered the topics
generated using the same ACL Anthology dataset by Bethard and Jurafsky (2010) as com-
munities. They ran latent Dirichlet allocation on the full text of the papers to get 100 topics.
I, with help from two computational linguistics experts, hand labeled the topics and used
72 of them in my study; the rest of them were about common words. When calculating the
scores in Equation 4.1, I considered the value of P (c | a; θ) to be zero if it was less than
0.1.
4.5 Results
The total numbers of phrases extracted were 25525 for FOCUS, 24430 for TECHNIQUE,
and 33203 for DOMAIN. The total numbers of phrases after including the phrases extracted
from subtrees of the matched phrase trees were 64041, 38220 and 46771, respectively.
Examples of phrases extracted from some papers are shown in Table 4.2.
CHAPTER 4. STUDYING SCIENTIFIC ARTICLES AND COMMUNITIES 48
Approach F1 Precision RecallFOCUS
Baseline tf-idf NPs 35.60 24.36 66.07Seed Patterns 55.29 44.67 72.54Inter-Annotator Agreement 53.33 50.80 56.14
TECHNIQUE
Baseline tf-idf NPs 26.65 17.87 52.41Seed Patterns 20.09 23.46 21.72Iteration 50 36.86 30.46 46.68Inter-Annotator Agreement 72.02 66.81 78.11
DOMAIN
Baseline tf-idf NPs 30.13 19.90 62.03Seed Patterns 25.27 30.55 26.29Iteration 50 37.29 27.60 57.50Inter-Annotator Agreement 72.31 75.58 69.32
Table 4.4: The precision, recall and F1 scores of each category for the different approaches.Note that the inter-annotator agreement is calculated on a smaller set.
For testing, I hand labeled 474 abstracts with the three categories to measure the pre-
cision and recall scores. For each abstract and each category, I compared the unique non-
stop-words extracted from my algorithm to the hand labeled dataset. I calculated preci-
sion, recall measures for each abstract and then averaged them to get the results for the
dataset. To compare against a non-information-extraction based baseline, I extracted all
noun phrases (and sub-trees of the noun phrase trees) from the abstracts and labeled them
with all the three categories. In addition, I labeled the titles (and their sub-trees) with the
category FOCUS. I then scored the phrases with a tf-idf inspired measure, which was the
ratio of the frequency of the phrase in the abstract and the sum of the total frequency of the
individual words, and removed phrases that had the tf-idf measure less than 0.001 (best out
of many experiments). I call this approach as ‘Baseline tf-idf NPs’.
Table 4.4 compares precision, recall and micro-averaged F1 scores for the three cat-
egories when we use: (1) only the seed patterns, (2) the combined set of learned and
seed patterns, and (3) the baseline. I also calculated inter-annotator agreement for 30 ab-
stracts, where each abstract was labeled by 2 annotators,8 and the precision-recall scores
8I annotated all 30 abstracts and two other doctoral candidates in computational linguistics annotated 15
CHAPTER 4. STUDYING SCIENTIFIC ARTICLES AND COMMUNITIES 49
Figure 4.1: The F1 scores for TECHNIQUE and DOMAIN categories after every five itera-tions. For reasons explained in the text, I do not learn new patterns for FOCUS.
were calculated by randomly choosing one annotation as gold and another as predicted for
each article. We can see that both precision and recall scores increase for TECHNIQUE
because of the learned patterns, though for DOMAIN, precision decreases but recall in-
creases. The recall scores for the baseline are higher as expected but the precision is very
low. Three possible reasons explain the mistakes made by our system: (1) authors some-
times use generic phrases to describe their system, which are not annotated with any of the
three categories in the test set but are extracted by the system (such as ‘We use a simplemethod . . . ’, ‘We propose a faster model . . . ’, ‘This paper presents a new approach to
. . . ’); (2) the dependency trees of some sentences are wrong; and (3) some of the patterns
learned for TECHNIQUE and DOMAIN were low-precision but high-recall, for example,
[based → (preposition on)] was learned a TECHNIQUE pattern. The first problem of er-
roneous extraction of generic phrases could perhaps be decreased by allowing restrictions
on the target of the dependency or by disallowing certain kinds of generic positive terms
like ‘simple’, ‘new’, ‘faster’. Figure 4.1 shows the F1 scores for TECHNIQUE and DOMAIN
after every 5 iterations.
each
CHAPTER 4. STUDYING SCIENTIFIC ARTICLES AND COMMUNITIES 50
Community Representative words Most Influential Phrases ScoreSpeech Recogni-tion
(recognition, acoustic, er-ror, speaker, rate, adapta-tion, recognizer, vocabulary,phone)
expectation maximization; hidden markov;language; contextually; segment; context in-dependent phone; snn hidden markov; n gramback off language; multiple reference speak-ers; cepstral; phoneme; least squares; speechrecognition; intra; hi gram; bu; word depen-dent; tree structured; statistical decision trees
1.35
Probability The-ory
(probability, probabilities,distribution, probabilistic,estimation, estimate, en-tropy, statistical, likelihood,parameters)
hidden markov; maximum entropy; language;expectation maximization; merging; expec-tation maximization hidden markov; naturallanguage; variable memory markov; standardhidden markov; part of speech; inside out-side; segmentation only; minimum descrip-tion length principle; continuous density hid-den markov; part of speech information; for-ward backward
1.31
Bilingual WordAlignment
(alignment, alignments,aligned, pairs, align, pair,statistical, parallel, source,target, links, brown, ibm,null)
hidden markov; expectation maximization;maximum entropy; spectral clustering; statis-tical alignment; conditional random fields ,a discriminative; statistical word alignment;string to tree; state of the art statistical machinetranslation system; single word; synchronouscontext free grammar; inversion transductiongrammar; ensemble; novel reordering
1.2
POS Tagging (tag, tagging, pos, tags,tagger, part-of-speech,tagged, unknown, accuracy,part, taggers, brill, corpora,tagset)
maximum entropy; machine learning; ex-pectation maximization hidden markov; partof speech information; decision tree; hid-den markov; transformation based error drivenlearning; entropy; part of speech tagging; partof speech; variable memory markov; viterbi;second stage classifiers; document; wide cov-erage lexicon; using inductive logic program-ming
1.13
Machine Learn-ing Classification
(classification, classifier, ex-amples, classifiers, kernel,class, svm, accuracy, deci-sion, methods, labeled, vec-tor, instances)
support vector machines; ensemble; machinelearning; gaussian mixture; expectation max-imization; flat; weak classifiers; statisticalmachine learning; lexicalized tree adjoininggrammar based features; natural language pro-cessing; standard text categorization collec-tion; pca; semisupervised learning; standardhidden markov; supervised learning
1.12
Table 4.5: The top 5 influential communities with their most influential phrases. Thesecond column lists the top words that describe the communities obtained by the topicmodel, and the third column shows most influential phrases that have been widely used astechniques. The last column is the score of the community computed by Equation 4.5.
CHAPTER 4. STUDYING SCIENTIFIC ARTICLES AND COMMUNITIES 51
Community Representative words Most Influential Phrases ScoreStatistical Pars-ing
(parse, treebank, trees,parses, penn, collins,parsers, charniak, accu-racy, wsj, head, statistical,constituent, constituents)
propbank; expectation maximization; super-vised machine learning; maximumentropyclassifier; ensemble; lexicalized tree adjoininggrammar based features; neural network; gen-erative probability; incomplete constituents;part of speech tagging; treebank; penn; 50 bestparses; lexical functional grammar; maximumentropy; full comlex resource
0.92
Statistical Ma-chine Translation(More-Phrase-Based)
(bleu, statistical, source, tar-get, phrases, smt, reorder-ing, translations, phrase-based)
maximum entropy; hidden markov; expec-tation maximization; language; linguisticallystructured; ihmm; cross language informationretrieval; ter; factored language; billion word;hierarchical phrases; string to tree; state ofthe art statistical machine translation system;statistical alignment; ist inversion transductiongrammar; bleu as a metric; statistical machinetranslation
0.82
Parsing (grammars, parse, chart,context-free, edge, edges,production, symbols, sym-bol)
natural language processing; expectation max-imization; natural language; inside outside;rule; macro; various filtering strategies; tomita’s parser; forward backward; phrase structure;synchronous context free grammars; cky; ter-mination properties; extraposition grammars
0.81
Chunking/MemoryBased Models
(chunk, chunking, chunks,pos, accuracy, best,memory-based, daelemans,van, base)
state of the art machine learning; conditionalrandomf ields; support vector machines; ma-chine learning; using hidden markov; maxi-mum entropy; memory based learning; hid-den markov; standard hidden markov; secondstage classifiers; weak classifiers; flat; conll2004; iob; probabilities output; high recall
0.8
DiscriminativeSequence Models
(label, conditional, se-quence, random, labels,discriminative, inference,crf, fields, labeling)
conditional random fields; ensemble; maxi-mum entropy; maximum entropy; conditionalrandom fields , a discriminative; large mar-gin; perceptron; hidden markov; generalizedperceptron; pseudo negative examples; natu-ral language processing; entropy; singer; la-tent variable; character level; named entity
0.72
Table 4.6: The next 5 influential communities with their most influential phrases. Thesecond column lists the top words that describe the communities obtained by the topicmodel, and the third column shows most influential phrases that have been widely used astechniques. The last column is the score of the community computed by Equation 4.5.
CHAPTER 4. STUDYING SCIENTIFIC ARTICLES AND COMMUNITIES 52
Figure 4.2: The influence scores of communities in each year.
Community Communities that have influenced most (descending order)Named Entity Recognition Chunking/Memory Based Models; Discriminative Sequence Models;
POS Tagging; Machine Learning Classification; Coherence Relations;Biomedical NER; Bilingual Word Alignment
Statistical Parsing Probability Theory; POS Tagging; Discriminative Sequence Mod-els; Speech Recognition; Parsing; Syntactic Theory; Cluster-ing+DistributionalSimilarity; Chunking/Memory Based Models
Word Sense Disambiguation Clustering + DistributionalSimilarity; Machine Learning Classifica-tion; Dictionary Lexicons; Collocations/Compounds; Syntax; SpeechRecognition; Probability Theory
Table 4.7: The community in the first column has been influenced the most by the commu-nities in the second column. The scores are calculated using Equation 4.4
Influence
Tables 4.5 and 4.6 show the most influential communities overall and their respective in-
fluential phrases that have been widely adopted as techniques by other communities. The
third column is the score of the community calculated using Equation 4.5. We can see that
CHAPTER 4. STUDYING SCIENTIFIC ARTICLES AND COMMUNITIES 53
Figure 4.3: The popularity of communitites in each year. It is measured by summing upthe article-to-topic scores for the articles published in that year (see Hall et al. (2008)).The scores are smoothed with weighted scores of 2 previous and 2 next years, and L1-normalized for each year. The scores are lower for all communities in late 2000s sincethe probability mass is more evenly distributed among many communities. Contrast therelative popularity of the communities with their relative influence shown in Figure 4.2.
speech recognition is the most influential community because of the techniques like hidden
Markov models and other stochastic methods it introduced in the computational linguistics
literature. This shows that its long-term seeding influence is still present despite the lim-
ited popularity around 2000s. Probability theory also gets a high score since many papers
in the last decade have used stochastic methods. The communities part-of-speech tagging
and parsing get high scores because they adopted some techniques that are used in other
communities, and because other communities use part-of-speech tagging and parsing in the
intermediary steps for solving other problems.
Figure 4.2 shows the change in a community’s influence over time. The scores are nor-
malized such that the total score for all communities in a year sum to one. Compare the
relative scores of communities in the figure with the relative scores in Figure 4.3, which
CHAPTER 4. STUDYING SCIENTIFIC ARTICLES AND COMMUNITIES 54
Figure 4.4: The influence scores of machine translation related communities. The statisticalmachine translation community, which is a topic from the topic model, is more phrase-based.
shows sum of all article-to-topics scores for each community for articles published in a
given year, and is normalized the same way as before. There is a huge spike for the Speech
Recognition community for the years 1989–1994. Hall et al. (2008) note, “These years
correspond exactly to the DARPA Speech and Natural Language Workshop, held at differ-
ent locations from 1989–1994. That workshop contained a significant amount of speech
until its last year (1994), and then it was revived in 2001 as the Human Language Technol-
ogy workshop with a much smaller emphasis on speech processing.” See their paper Hall
et al. (2008) for more analysis. Note that this analysis uses just bag-of-words-based topic
models.
Comparing Figures 4.2 and 4.3, we can see influence of a community is different from
the popularity of a community in a given year. As mentioned before, we observe that al-
though the influence score for speech recognition has declined during 1997-2009, it still has
a lot of influence, though the popularity of the community in recent years is very low. Ma-
chine learning classification has been both popular and influential in recent years. Figures
4.4 and 4.5 compare the machine translation communities in the same way as we compare
other communities in Figures 4.2 and 4.3. We can see that statistical machine translation
(more phrase-based) community’s popularity has steeply increased during late 2002-2009,
CHAPTER 4. STUDYING SCIENTIFIC ARTICLES AND COMMUNITIES 55
Figure 4.5: Popularity of machine translation communities in each year. The statisticalmachine translation community, which is a topic from the topic model, is more phrase-based. Contrast the relative popularity scores with the relative influence scores shown inFigure 4.4.
however, its influence has increased at a slower rate. On the other hand, the influence of
bilingual word alignment (the most influential community in 2009) has increased during
the same period, mainly because of its influence on statistical machine translation. The in-
fluence of non-statistical machine translation has been decreasing recently, though slower
than its popularity. Table 4.7 shows the communities that have the most influence on a
given community (the list is in descending order of scores by Equation 4.4).
Comparison with Supervised CRF
In this section, I present an experiment performed after Gupta and Manning (2011) was
published. To compare how the BPL approach with dependency patterns compared against
a supervised CRF, I divided the labeled examples used as the test set into two. One half was
reserved for training a CRF model and the rest was used to test both BPL and CRF. Note
that the supervision provided to BPL and CRF are very different. BPL did not have access
to the fully labeled abstracts, which the CRF used. Instead, it used the same seed patterns
as before. Since fully labeled abstracts have each token labeled, they are of much higher
quality than seed patterns. Table 4.8 shows the scores of the two systems for the labels
CHAPTER 4. STUDYING SCIENTIFIC ARTICLES AND COMMUNITIES 56
TECHNIQUE DOMAIN
System Precision Recall F1 Precision Recall F1
Supervised CRF 41.55 31.51 35.38 53.90 52.80 55.05Bootstrapped Patterns 29.37 56.10 38.56 37.56 30.66 48.45
Table 4.8: Comparison of our BPL-based approach and supervised CRF for the task.
TECHNIQUE and DOMAIN. Note that the scores are not directly comparable to the previous
results in the chapter since the test set is now half of the previous test set. Supervised CRF
performed better than BPL for the label DOMAIN as expected. Surprisingly, BPL performed
better than Supervised CRF for TECHNIQUE, even with lower supervision.
4.6 Further Reading
There has been a surge of interest in studying academic communities and the technical ad-
vancements made in published literature in the last few years. IARPA funded a program
called Foresight and Understanding from Scientific Exposition (FUSE) to develop auto-
mated methods to study technical contributions made by scientific, technical, and patent
literature.9 Tsai et al. (2013) used a bootstrapping approach to identify and categorize
scientific concepts in research literature. They used the context of citations to cluster the
extracted mentions into concepts. Tateisi et al. (2014) released an annotation framework
for relation extraction from research literature in computer science. The Semantic Scholar
project from Allen Institute for Artificial Intelligence 10 is focused towards understand-
ing scientific literature semantically. One of their papers (Valenzuela et al., 2015) studied
identifying meaningful citations using a supervised method.
9http://www.iarpa.gov/index.php/research-programs/fuse10http://allenai.org/semantic-scholar.html
CHAPTER 4. STUDYING SCIENTIFIC ARTICLES AND COMMUNITIES 57
4.7 Conclusion
This chapter presented a framework for extracting detailed information from scientific ar-
ticles, such as main contributions, tools and techniques used, and domain problems ad-
dressed, by matching semantic extraction patterns in dependency trees. I start with a few
hand written seed patterns and learn new patterns using a bootstrapping approach. I use
this rich information extracted from the articles to study the dynamics of research commu-
nities, and define a new way of measuring influence of one research community on another.
I present a case study on the computational linguistics community, where I examine the in-
fluence of its sub-fields, and observed that speech recognition and probability theory have
had the most seminal influence.
The results show that bootstrapped pattern-based learning is an effective approach for
this task. Since the task is new, there exists no fully labeled dataset of scientific articles
labeled with the three categories. Bootstrapping with a few hand written patterns provides
enough supervision to learn more patterns and entities. Henceforth, I will apply a simi-
lar approach, bootstrapped lexico-syntactic surface word pattern-based learning, to extract
entities from a very different domain – patient authored text.
Chapter 5
Inducing Lexico-Syntactic Patterns forInformation Extraction on MedicalForums
In the last chapter, I presented bootstrapped pattern-based learning as an effective approach
for extracting key aspects from scientific papers. In this chapter, I describe using the ap-
proach for extracting entities from another domain – patient authored text. Patient authored
text is usually very different from the content in scientific papers; many sentences are un-
grammatical, and the sentences have various spelling mistakes, variations in naming enti-
ties, and extensive use of slang words. I develop a system to extract drugs & treatments and
symptoms & diseases from users’ posts on online medical forums. The system outperforms
existing medical text annotators and state-of-the-art classifier-based systems. I also discuss
how the extractor can be used to study the efficacy of drugs & treatments on a large scale.
This work was published in Gupta et al. (2014b).
5.1 Introduction
In 2013, 59% of adults in the United States sought health information on the Internet (Fox
and Duggan, 2013). While these users typically have no formal medical education, they
generate large volumes of patient-authored text (PAT) in the form of medical blogs and
58
CHAPTER 5. INFORMATION EXTRACTION ON MEDICAL FORUMS 59
discussions on online health forums. Their contributions range from rare disease diagnosis
to drug and treatment efficacy.
My eventual goal is to enable open-ended mining and analysis of PAT to improve health
outcomes. In particular, PAT can be a great resource for extracting the efficacy and side-
effects of both pharmaceutical and alternative treatments. Prior work demonstrates the
knowledge value of PAT in mining adverse drug events (Leaman et al., 2010), predicting flu
trends (Carneiro and Mylonakis, 2009) (although caution is needed (Butler, 2013)), explor-
ing drug interactions (White et al., 2013), and replicating results of a double-blind medical
trial (Wicks et al., 2011). Already websites such as http://www.medify.com and
http://www.treato.com aggregate information on the efficacy and side effects of
drugs from PAT. Extraction of sentiment and side effects for drugs and treatments in PAT is
only possible on a large scale when we have tools to discover and robustly identify entities
such as symptoms, conditions, drugs, and treatments in the text. Most of the research in
extracting such information has focused on clinicians’ notes, and thus most annotation sys-
tems are tailored towards them. Unlike expert-authored text, which is composed of terms
routinely used by the medical community, PAT contains a great deal of slang and verbose,
informal descriptions of symptoms and treatments (for example, ‘feels like a brick on my
heart’ or ‘Watson 357’ for Vicodin). Previous research has shown that most terms used by
consumers are not in ontologies (Smith and Wicks, 2008).
I propose inducing lexico-syntactic patterns using seed dictionaries to identify specific
medical entity types in PAT. The patterns generalize terms from seed dictionaries to learn
new entities. I test our method over two entity types: symptoms & conditions (SC), and
drugs & treatments (DT) on two of MedHelp’s forums: Asthma and Ear, Nose & Throat
(ENT). I also report the results of applying the system to three other forums on MedHelp:
Adult Type II Diabetes, Acne, and Breast Cancer. Our system is able to extract SC and DT
phrases that are not in the seed dictionaries, such as ‘cinnamon pills’ and ‘Opuntia’ as DT
from the Diabetes forum, and ‘achiness’ and ‘lumpy’ as SC from the Breast Cancer forum.
CHAPTER 5. INFORMATION EXTRACTION ON MEDICAL FORUMS 60
5.2 Objective
The objective is to learn new SC and DT phrases from PAT without using hand written
rules or any hand-labeled sentences. I define SC as any symptom or condition mentioned
in text. The DT label refers to any treatment taken or intervention performed in order to
improve a symptom or condition. It includes pharmaceutical treatments and drugs, surg-
eries, interventions (like ‘getting rid of cat and carpet’ for Asthma patients), and alternative
treatments (like ‘acupuncture’ or ‘garlic’). Note that our system ignores negations (for ex-
ample, in the sentence ‘I don’t have Asthma’, ‘Asthma’ is labeled SC) since it is preferable
to extract all SC and DT mentions and handle the negations separately, if required. The
labels include all relevant generic terms (for example, ‘meds’, ‘disease’). Devices used to
improve a symptom or condition (like inhalers) are included in DT, but devices that are
used for monitoring or diagnosis are not. Some examples of sentences from the Asthma
and ENT forums labeled with SC (in italics) and DT (in bold) labels are shown below:
I don’t agree with my doctor’s diagnostic after research and I think I may have
a case of Sinus Mycetoma
I started using an herbal mixture especially meant for Candida with limited
success.
however , with the consistent green and occasional blood in nasal discharge
(but with minimal “stuffy” feeling), I wonder if perhaps a problem with chronic
sinusitis and or eustachian tubes
She gave me albuteral and symbicort (plus some hayfever meds and asked
me to use the peak flow meter.
My sinus infections were treated electrically, with high voltage million voltelectricity, which solved the problem, but the treatment is not FDA approved
and generally unavailable, except under experimental treatment protocols.
CHAPTER 5. INFORMATION EXTRACTION ON MEDICAL FORUMS 61
5.3 Related Work: Medical IE
Medical term annotation is a long-standing research challenge. However, almost no prior
work focuses on automatically annotating PAT. Tools like TerMINE (Frantzi et al., 2000)
and ADEPT (MacLean and Heer, 2013) do not identify specific entity types. Other ex-
isting tools like MetaMap (Aronson, 2001), the OBA (Jonquet et al., 2009), and Apache
cTakes1 perform poorly mainly because they are designed for fine-grained entity extraction
on expert-authored text. They essentially perform dictionary matching on text based on
source ontologies (Aronson, 2001; Jonquet et al., 2009; Aronson and Lang, 2010). Despite
being the go-to tools for medical text annotation, previous studies (Pratt and Yetisgen-
Yildiz, 2003) comparing OBA and MetaMap to human annotator performance underscore
two sources of performance error, which we also notice in our results. The first is ontology
incompleteness, which results in low recall, and second is inclusion of contextually irrel-
evant terms (MacLean and Heer, 2013). For example, when restricted to the RxNORM
ontology and semantic type Antibiotic (T195), OBA will extract both Today and Penicillin
from the sentence “Today I filled my Penicillin rx”. Other approaches focusing on expert-
authored text show improvement in identifying food and drug allergies (Epstein et al., 2013)
and disease normalization (Kang et al., 2012) with the use of statistical methods. While
these statistically-based approaches tend to perform well, they require hand labeled data,
which is both manually intensive to collect and does not generalize across PAT sources.
The most relevant work to ours is in building the Consumer Health Vocabularies (CHVs).
CHVs are ontologies designed to bridge the gap between patient language and the UMLS
Metathesaurus. We are aware of two CHVs: the (OAC) CHV (Zeng and Tse, 2006)2 and
the MedlinePlus CHV3. To date, most work in this area focuses on identifying candidate
terms of general medical relevance, and not specific entity types. We use the OAC CHV to
construct our seed dictionaries.
There has been some work that extracts information from PAT. In a study investigating
the feasibility of mining adverse drug events from user comments on DailyStrength (www.
1http://ctakes.apache.org2http://www.consumerhealthvocab.org3http://www.nlm.nih.gov/medlineplus/xml.html
CHAPTER 5. INFORMATION EXTRACTION ON MEDICAL FORUMS 62
dailystrength.org), Leaman et al. (2010) achieve an F1-score of 73.9% against hu-
man annotators utilizing a lexicon-based approach. This approach differs from ours in that
they do not learn new lexicon terms from their data. Other approaches focusing on expert-
authored text show improvement with the use of statistical methods. For example, Epstein
et al. (2013) utilize RxNorm and several NLP techniques to achieve F1, Precision and Re-
call scores in the 90’s for identifying food and drug allergies entered using non-standard
terminology in allergy and sensitivity entries in the Vanderbilt perioperative information
management system. Kang et al. (2012) were able to improve both MetaMap and Peregrine
disease normalization F1-scores significantly (by about 15%) by post-processing annotator
output using NLP rules for entity resolution.
In this chapter, I extract SC and DT terms by inducing lexico-syntactic surface-word
patterns. The general approach has been shown to be useful in learning different semantic
lexicons, as discussed in Chapter 2.
5.4 Materials and Methods
5.4.1 Dataset
I used discussion forum text from MedHelp4, one of the biggest online health community
websites. See Section 2.5 for more details of the dataset. I excluded from our dataset
sentences from one user who had posted very similar posts several thousand times. I test the
performance of our system in extracting DT and SC phrases on sentences from two forums:
the Asthma forum and the Ear, Nose and Throat (ENT) forum. The Asthma and ENT
forums consist of 39,137 and 215,123 sentences, respectively, in our dataset. In addition,
I present qualitative results of our system run on three other forums: the Adult Type II
Diabetes forum (63,355 sentences), the Acne forum (65,595 sentences), and the Breast
Cancer forum (296,861 sentences). I used the Stanford CoreNLP toolkit (Manning et al.,
2014) to tokenize text, split it into sentences, and to label the tokens with their part-of-
speech tags and lemma (that is, canonical form). I converted all text to lowercase because
PAT usually contains inconsistent capitalization.
4Data spans from 2007 to May 2011. Available from: http://www.medhelp.org.
CHAPTER 5. INFORMATION EXTRACTION ON MEDICAL FORUMS 63
Initial Labeling Using Dictionaries
As the first step, I ‘partially’ label data using matching phrases from our DT and SC dic-
tionaries. Our DT dictionary, comprising 38,684 phrases, was sourced from Wikipedia’s
list of drugs, surgeries and delivery devices; RxList5; MedlinePlus6; Medicinenet7 phrases
with semantic type ‘procedures’ from MedDRA8; and phrases with relevant semantic types
(Antibiotic, Clinical Drug, Laboratory Procedure, Medical Device, Steroid, and Therapeu-
tic or Preventive Procedure) from the NCI thesaurus.9
Our SC dictionary comprises 100,879 phrases, and was constructed using phrases from
MedlinePlus, Medicinenet, and from MedDRA (with semantic type ‘disorders’). We ex-
panded both dictionaries using the OAC Consumer Health Vocabulary10 by adding all syn-
onyms of the phrases previously added. Because the dictionaries are automatically con-
structed with no manual editing, they might have some incorrect phrases. However, the
results show that they perform effectively.
I label a phrase with the dictionary label when the sequence of non-stop-words (or their
lemmas) matches an entry in the dictionary. To match spelling mistakes and morpholog-
ical variations (like ‘tickly’), which are common in PAT, I do a fuzzy matching. A token
matches a word in the dictionary if the token is longer than 6 characters and the token and
the word are edit distance one away. I ignore words ‘disease’, ‘disorder’, ‘chronic’, and
‘pre-existing’ in the dictionaries when matching phrases. I remove phrases that are very
common on the Internet by compiling a list of the 2000 most common words from Google
Ngrams, called GoogleCommonList henceforth. See Section 2.4 for more information on
Google N-grams. This helps exclude words like ‘Today’ and ‘AS’, which are also names
of medicines. Tokens that are labeled as SC by the SC dictionary are not labeled DT, to
avoid labeling ‘asthma’ as DT in the phrase ‘asthma meds’, in case ‘asthma meds’ is in the
DT dictionary.
5www.rxlist.com, Accessed January 2013.6http://www.nlm.nih.gov/medlineplus, Accessed January 2013.7http://www.medicinenet.com, accessed January 20138MedDRA stands for Medical Dictionary for Regulatory Activities. http://www.meddra.org, Ac-
cessed February 20139http://ncit.nci.nih.gov, Accessed March 2013.
10Open Access, Collaborative Consumer Health Vocabulary Initiative. http://www.consumerhealthvocab.org, accessed February 2013.
CHAPTER 5. INFORMATION EXTRACTION ON MEDICAL FORUMS 64
5.5 Inducing Lexico-Syntactic Patterns
In Chapter 2, I gave a high level overview of the steps of a bootstrapped pattern learning
system. Below is a summary of the steps.
1. Label data using dictionaries
2. Create patterns using the labeled data and choose top K patterns
3. Extract phrases using the learned patterns and choose top N words
4. Add new phrases to the dictionaries
5. Repeat 1-4 T times or until converged
I experimented with different phrase and pattern weighting schemes (for example, ap-
plying log sublinear scaling in the weighting formulations below) and parameters for our
system. I selected the ones that performed best on the Asthma forum test sentences. Below,
I explain the algorithm using DT as an example label for ease of explanation.
5.5.1 Creating Patterns
I create potential patterns by looking at two to four words before and after the labeled
tokens. I discard contexts that consist of 2 or fewer stop words because they are too general
and extract many noisy entities. Contexts with 3 or more stop words are included because
the long context makes them less general, for example, ‘I am on X’ is a good pattern to
extract DTs. Words that are labeled with one of the dictionaries are generalized with the
class of the dictionary. I create flexible patterns by ignoring the words {‘a’, ‘an’, ‘the’,
‘,’, ‘.’} while matching the patterns and by allowing at most two stop words between the
context and the term to be extracted. I create two sets of the above patterns – with and
without the part-of-speech (POS) restriction of the target phrase (for example, that it only
contains nouns). Since many symptoms and drugs tend to be more than just one word, I
allow matching 1 to 2 tokens. In our experiments, matching 3 or more consecutive terms
extracted noisy phrases, mostly by patterns without the POS restriction. Table 2.1 shows
an example of two patterns and how they match to two sentences.
CHAPTER 5. INFORMATION EXTRACTION ON MEDICAL FORUMS 65
5.5.2 Learning Patterns
I learn new patterns by weighting them using normalization measures and selecting the top
patterns. In essence, we want to trade off precision and recall of the patterns to extract the
correct phrases. The weighting scheme for a pattern i is
pti =
∑mk=1
√freq(i, wk)∑n
j=1
√freq(i, wj)
(5.1)
wherem is the number of words with the label DT that match the pattern, n is the number of
all words that match the pattern, and freq(i, wk) is the number of times pattern i matched
the phrase wk. Sublinear scaling of the frequency prevents high frequency words from
overshadowing the contribution of low frequency words. Using the RlogF pattern scoring
function (Riloff, 1996) led to lower scores in the pilot experiments. I discard patterns that
have weight less than a threshold (=0.5 in our experiments). I also discard patterns when m
is equal to n since adding them would be of no benefit for learning new phrases. I remove
patterns that occur in the top 500 patterns for the other label. After calculating weights for
all the remaining patterns, I choose the top K (=50 in our experiments) patterns.
5.5.3 Learning Phrases
I apply the patterns selected by the above process to all the sentences and extract the
matched phrases. The phrase weighting scheme is a combination of TF-IDF scoring,
weight of the patterns, and relative frequency of the phrases in different dictionaries. The
latter weighting term assigns higher weight to words that are sub-phrases of phrases in the
entity’s dictionary. The weighting function for a phrase p for the label DT is
weight(p,DT) =(∑t
i=1 num(p, i)× ptilog(freqp)
)× 1 + dictDTFreqp
1 + dictSCFreqp(5.2)
where t is the number of patterns that extract the phrase p, num(p, i) is the number of times
phrase p is extracted using pattern i, pti is the weight of the pattern i from the previous
equation, freqp is frequency of phrase p in the corpus, dictDTFreqp and dictSCFreqpare the frequency of phrase p in the n-grams of the phrases from the DT dictionary and the
CHAPTER 5. INFORMATION EXTRACTION ON MEDICAL FORUMS 66
SC dictionary, respectively. I discard phrases with weight less than a threshold (=0.2 in our
experiments). I also discard phrases that are matched by less than 2 patterns to improve
precision of the system – phrases extracted by multiple patterns tend to be more accurate.
I remove the following kinds of phrases from the set of potential phrases: (1) list of
specialists and physicians downloaded from WebMD11, (2) words in the GoogleCommon-
List, and (3) 5000 most frequent tokens from around 1 million tweets from Twitter to avoid
learning slang words like ‘asap’, (4) phrases that are already in any of the dictionaries. I
then extract up to the top N (=10 in our experiments) words and label those phrases in the
sentences. I also remove body parts phrases (198 phrases that were curated from Wikipedia
and manually expanded by us) from the set of potential DT phrases.
I repeat the cycle of learning patterns and learning phrases T times (=20 in our experi-
ments) or until no more patterns and words can be extracted.
5.6 Evaluation Setup
5.6.1 Test Data
I tested our system and the baselines on two forums – Asthma and ENT. For each forum, I
randomly sampled 500 sentences, and my collaborator and I annotated 250 sentences each.
The test sentences were removed from the data used in the learning system. The labeling
guidelines for the annotators for the test sentences were to include the minimum number of
words to convey the medical information. To calculate the inter-annotator agreement, the
annotators labeled 50 sentences from the 250 sentences assigned to the other annotator; the
agreement is thus calculated on 100 sentences out of the 500 sentences. The token-level
agreement for the Asthma test sentences was 96% with Cohen’s kappa κ=0.781, and for
the ENT test sentences was 96.2% with Cohen’s kappa κ=0.801. I used the Asthma forum
as a development forum to select parameters, such as the maximum number of patterns and
phrases added in an iteration, total number of iterations, and the thresholds for learning
patterns and phrases. I discuss effect of varying these parameters in the additional exper-
iments section below. I used ENT as a test forum; no parameters were tuned on the ENT
11http://www.webmd.com, accessed October 2013
CHAPTER 5. INFORMATION EXTRACTION ON MEDICAL FORUMS 67
forum test set.
Failed Crowdsourcing Effort
I tried using Amazon mechanical turk to acquire labeled test data. However, the annota-
tions were of poor quality, thus, we did not use them. Annotators frequently labeled any
medically relevant term as DT or SC, such as ‘blood’ and ‘doctor’. I tried a secondary
verification step – turkers were asked to verify the annotations from the first step, but the
results were not satisfactory. I believe that either they did not read the instructions properly,
did not pay attention when labeling the data, or did not fully understand the task. For ex-
ample, one annotator labeled ‘doc’ and ‘WAIT’ as DTs in the sentence ‘Take care of your
son, take him to the doc regualrly , do as they say and WAIT ... until he grows out of it.’ In
retrospect, labeling the test data by ourselves was faster and, maybe, cheaper.
5.6.2 Metrics
I present both token-level and entity-level Precision, Recall, and F1 metrics to evaluate our
system and the baselines. I discuss the metrics and the difference between token-level and
entity-level metrics in Chapter 2. All the results in this chapter, unless otherwise noted,
are token-level measures because identifying partial tokens in an entity (that is, ‘inhaler’
in ‘salbutamol inhaler’) is still useful in this domain. Entity-level evaluation is commonly
used for recognizing named entities, where, for example, the distinction between ‘Wash-
ington’ and ‘Washington D. C.’ is more prominent. Note that sometimes extracting partial
phrases in our task will also lead to a wrong number of token-level true positives (for ex-
ample, extracting just ‘looking’ in ‘trouble looking straight ahead’), but I did not observe
it often in our experiments.
Note that accuracy is not a good measure for the task because most of the tokens are
labeled ‘none’, and thus labeling everything as ‘none’ achieves very high accuracy and zero
recall. I ignore about 200 very common words (like ‘i’, ‘am’), 26 very common medical
terms and their derivatives (like ‘disease’, ‘doctor’), and words that do not start with a letter
(see the Appendix for the full list) when evaluating the systems.12
12This is same as considering them as stop words and fixing their label to ‘none’. Since the F1 scores are
CHAPTER 5. INFORMATION EXTRACTION ON MEDICAL FORUMS 68
Statistical Significance Testing
I tested the statistical significance of the improvement of our system over the baselines us-
ing approximate randomization (Noreen, 1989; Yeh, 2000) implemented by SIGFv2 (Pado,
2006), commonly used for statistical significance testing for named entity recognition sys-
tems. It does not assume that the model is representative, that is it does not perform sam-
pling with replacement. Instead, it is based on random shuffling of the predictions. I
assumed each token to be an observation and randomized the observations 10,000 times.
5.6.3 Baselines
I compare our system to the OBA annotator (Jonquet et al., 2009) and the MetaMap an-
notator (Aronson, 2001). I evaluated both the baselines with the default settings. I also
compare our algorithm with the pattern learning system proposed by Xu et al. (2008). I
describe the details of these systems below.
MetaMap: I used the Java API of MetaMap 2013v2. I used the semantic types An-
tibiotic, Clinical Drug, Drug Delivery Device, Steroid, Therapeutic or Preventive Proce-
dure, Vitamin, Pharmacologic Substance for DT; and semantic types Disease or Syndrome,
Sign or Symptom, Congenital Abnormality, Experimental Model of Disease, Injury or
Poisoning, Mental or Behavioral Dysfunction, Finding for SC. MetaMap-C refers to the
MetaMap system when its output is post-processed by removing common words.
OBA: I used the web service provided by OBA to label the sentences. I used the seman-
tic types Pharmacologic Substance, Steroid, Vitamin, Antibiotic, Therapeutic or Preventive
Procedure, Medical Device, Substance, Clinical Drug, Drug Delivery Device, Biomedical
or Dental Material for DT; and the semantic types Sign or Symptom, Injury or Poisoning,
Disease or Syndrome, Mental or Behavioral Dysfunction, Rickettsia or Chlamydia for SC.
OBA-C refers to the OBA system when its output is post-processed by removing common
words.
Xu et al.: Xu et al. (2008) learned surface patterns for extracting diseases from Medline
paper abstracts. They ranked patterns based on overlap of words extracted by potential
patterns with a seed pattern. Potential words were ranked by the scores of the patterns
not calculated for the label ‘none’, both approaches have the same effect.
CHAPTER 5. INFORMATION EXTRACTION ON MEDICAL FORUMS 69
that extracted them. I compared our system with their best performing ranking measures:
BalancedRank for patterns and Best-pattern-based rank for words. Since they focus only
on extracting diseases from research paper abstracts, their seed pattern ‘patients with X’
will not perform well on our dataset. Thus, for each label, we create patterns according to
their algorithm and choose the pattern weighted highest by our system as their seed pattern.
The seed patterns were the same as the top patterns shown in Tables 5.10-5.13.
CRF: A conditional random field (CRF) is a Markov random field based classifier that
uses word features and context features, such as the words and labels of nearby words.
Even though the data is only partially labeled using dictionaries, CRFs can learn correct
labels using the context features. I experimented with many different features and settings
and report the best results. I removed sentences in which none of the words were labeled
and fixed the label of words that are labeled by dictionaries. I used distributional similarity
features, which were computed using the Brown clustering method on all sentences of the
MedHelp forums (see Section 2.4 for more details). I built the classifier using the Stanford
NER toolkit (Finkel et al., 2005). I also present results of CRFs with self-training (‘CRF-
2’ and ‘CRF-20’ for 2 and 20 iterations, respectively), in which a CRF is trained on the
sentences labeled by dictionaries and predictions using the trained CRF from the previous
iteration.
5.7 Results
Fuzzy matching
Table 5.1 shows F1 scores for our system across different dictionary labeling schemes.
‘Dictionary’ refers to the seed dictionary without fuzzy matching or removing common
words. Fuzzy matching (indicated by ‘-F’) and removing common words (indicated by
‘-C’) increase the F1 scores by 3-5%. The performance of the system increases by removing
common words from dictionaries and matching words fuzzily.
CHAPTER 5. INFORMATION EXTRACTION ON MEDICAL FORUMS 70
System Asthma–DT Asthma–SC ENT–DT ENT–SCDictionary 58.21 71.39 49.66 60.32
Dictionary-C 60.29 73.32 53.93 61.88Dictionary-F-C 62.50 74.59 54.74 63.13
Table 5.1: F1 scores for labeling with Dictionaries using different types of labelingschemes. ‘-F’ means using fuzzy matching and ‘-C’ means pruning words that are inGoogleCommonList.
Our system vs. Other systems
Tables 5.2–5.5 show the scores for DT and SC labels on the Asthma and ENT forums.
The horizontal line separates systems that do not learn new phrases from the systems that
do. An asterisk denotes that our system is statistically significantly better (for two-tailed
p-value <0.05) than the system using approximated randomization.
In most cases, our system significantly outperforms current standard tools in medical
informatics. MetaMap and OBA have lower computational time since they do not match
words fuzzily or learn new dictionary phrases, but have lower performance. All systems
extract SC terms with higher recall than DT terms because many simple SC terms (such
as ‘asthma’) occurred frequently and were present in the dictionary. The improvement in
performance of our system over the baselines is higher for DT as compared to SC, mainly
because SC terms are usually verbose and descriptive, and hence are harder to extract using
patterns. In addition, the performance is higher on Asthma than on ENT for two reasons.
First, the system was tuned on the Asthma forum. Second, the Asthma test set had many
easy to label DT and SC phrases, such as ‘asthma’ and ‘inhaler’. On the other hand, many
ENT phrases were longer and not present in seed dictionaries, such as ‘milk free diets’ and
‘smelly nasal discharge’.
One of the reasons that CRF does not perform so well, despite being very popular for ex-
tracting entities from human-labeled text data, is that the data is partially labeled using the
dictionaries. Thus, the data is noisy and lacks full supervision provided in human-labeled
data, making the word-level features not very predictive. CRF missed extracting some
common terms like ‘inhaler’ and ‘inhalers’ as DT (‘inhaler’ occurred only as a sub-phrase
in the seed dictionary), and extracted some noisy terms, such as ‘afraid’ and ‘icecream’. In
CHAPTER 5. INFORMATION EXTRACTION ON MEDICAL FORUMS 71
addition, CRF uses context for labeling data – we show in the Additional Experiments sec-
tion that using context in the form of patterns performs worse than dictionary matching for
labeling data. Our system, on the other hand, learned new dictionary phrases by exploiting
context but labeled data by dictionary matching. Self-training the CRF initially increased
the F1 score for DT but performed worse in the subsequent iterations. Xu et al.’s system
performed worse because of its overdependence on the seed patterns: it gave low scores to
patterns that extracted phrases that had low overlap with the phrases extracted by the seed
patterns, which resulted in lower recall.
I believe token-level evaluation is better suited than entity-level evalution for the task.
However, for completeness, I have included entity-level evaluation results in Tables 5.6-5.9.
Scores of all systems are better when measured at the token level than at the entity level
because they get credit for extracting partial entities. The entity-level evaluation results
show a similar trend as the token-level evaluation: our system performs better than other
systems, albeit the difference is smaller for the DT label on the ENT forum.
5.7.1 Analysis
Tables 5.10–5.13 show the top 10 patterns extracted from the Asthma and the ENT forums
for the two labels. To improve readability, I have shown only the sequences of lemmas
from the patterns. X indicates the target phrase and ‘pos:’ indicates the part-of-speech
restriction. As we can see, several context tokens are generalized to their labels. Some
target entities do not have part-of-speech restriction, especially when the context is very
predictive of the label, such as ‘and be diagnose with’ for label SC.
Table 5.1 shows the top 15 phrases extracted from the Asthma and ENT forums by our
system. Figures 5.2 and 5.3 show the phrases extracted by our system from the following
three forums: Acne, Breast Cancer, and Adult Type II Diabetes. We can broadly group the
extracted phrases into 4 categories, which are described below.
New Terms
One goal of extracting medical information from PAT is to learn new treatments patients
are using or symptoms they are experiencing. Our system extracted phrases like ‘stabbing
CHAPTER 5. INFORMATION EXTRACTION ON MEDICAL FORUMS 72
System Precision Recall F1
OBA 52.25 56.50 54.25*OBA-C 62.06 53.15 57.25*
MetaMap 68.42 57.56 62.52*MetaMap-C 77.60 54.98 64.36*
Dictionary-F-C 89.65 47.97 62.50*Xu et al.-25 89.57 53.87 67.28*Xu et al.-50 85.96 54.24 66.51*
CRF 87.09 49.81 63.38*CRF-2 87.74 50.18 63.84*
CRF-20 86.53 49.81 63.23*Our system 86.88 58.67 70.04
Table 5.2: Token-level Precision, Recall, and F1 scores of our system and the baselines onthe Asthma forum for the label DT.
System Precision Recall F1
OBA 78.87 60.08 68.20*OBA-C 83.62 58.24 68.66*
MetaMap 58.63 80.24 67.75*MetaMap-C 70.28 75.15 72.63*
Dictionary-F-C 78.73 70.87 74.59*Xu et al.-25 77.29 72.09 74.60*Xu et al.-50 76.28 72.70 74.45*
CRF 77.68 73.72 75.65*CRF-2 77.63 73.52 75.52*
CRF-20 76.64 73.52 75.05*Our system 78.10 75.56 76.81
Table 5.3: Token-level Precision, Recall, and F1 scores of our system and the baselines onthe Asthma forum for the label SC.
pain’, ‘flakiness’, ‘plaque buildup’, which are not in the seed dictionaries. It also extracted
alternative and preventative treatments like ‘HEPA’ for high-efficiency particulate absorp-
tion air filter, ‘cinnamon pills’, ‘vinegar pills’, ‘basil’, and ‘opuntia’. Effects of alternative
and new treatments are usually studied in small-scale clinical trials (e.g. the effects of
Opuntia plant and Cinnamon on Diabetes patients in clinical trials have been studied in
CHAPTER 5. INFORMATION EXTRACTION ON MEDICAL FORUMS 73
System Precision Recall F1
OBA 43.22 55.73 48.68*OBA-C 49.73 51.36 50.53*
MetaMap 56.39 53.00 54.64MetaMap-C 64.08 49.72 55.99
Dictionary-F-C 82.41 40.98 54.74*Xu et al.-25 76.50 40.98 53.37*Xu et al.-50 62.80 41.53 49.99*
CRF 79.38 42.07 55.71CRF-2 79.20 43.71 56.33
CRF-20 67.79 43.71 53.15*Our system 82.82 44.80 58.15
Table 5.4: Token-level Precision, Recall, and F1 scores of our system and the baselines onthe ENT forum for the label DT.
System Precision Recall F1
OBA 67.51 50.52 57.79*OBA-C 70.55 46.18 55.82*
MetaMap 57.01 64.23 60.40*MetaMap-C 67.40 58.50 62.63*
Dictionary-F-C 74.35 54.86 63.13*Xu et al.-25 73.48 54.86 62.82*Xu et al.-50 73.88 57.46 64.64*
CRF 72.06 56.42 63.29*CRF-2 71.39 55.90 62.70*
CRF-20 70.61 55.90 62.40*Our system 71.65 61.45 66.16
Table 5.5: Token-level Precision, Recall, and F1 scores of our system and the baselines onthe ENT forum for the label SC.
CHAPTER 5. INFORMATION EXTRACTION ON MEDICAL FORUMS 74
System Precision Recall F1
OBA 46.64 58.12 51.75OBA-C 54.80 56.15 55.47
MetaMap 54.50 56.65 55.55MetaMap-C 61.87 55.17 58.33
Dictionary-F-C 70.28 47.48 56.89Xu et al.-25 73.33 54.18 62.32Xu et al.-50 70.44 55.17 61.87
CRF 68.49 49.26 57.30CRF-2 69.65 49.75 58.04
CRF-20 69.17 49.75 57.87Our system 73.00 58.62 65.02
Table 5.6: Entity-level Precision, Recall, and F1 scores of our system and the baselines onthe Asthma forum for the label DT.
System Precision Recall F1
OBA 70.60 57.25 63.23OBA-C 73.12 55.69 63.23
MetaMap 52.90 75.38 62.17MetaMap-C 61.71 70.98 66.02
Dictionary-F-C 71.54 68.39 69.93Xu et al.-25 70.15 69.43 69.79Xu et al.-50 69.21 70.46 69.83
CRF 71.05 71.24 71.15CRF-2 70.43 70.98 70.70
CRF-20 69.36 70.98 70.16Our System 71.28 73.31 72.28
Table 5.7: Entity-level Precision, Recall, and F1 scores of our system and the baselines onthe Asthma forum for the label SC.
CHAPTER 5. INFORMATION EXTRACTION ON MEDICAL FORUMS 75
System Precision Recall F1
OBA 29.79 49.57 37.22OBA-C 34.16 46.21 39.28
MetaMap 40.97 49.57 44.86MetaMap-C 46.28 47.05 46.66
Dictionary-F-C 66.23 42.85 52.04Xu et al.-25 60.71 42.85 50.23Xu et al.-50 47.66 42.85 45.12
CRF 62.65 43.69 51.48CRF-2 60.91 44.53 51.45
CRF-20 51.45 44.53 47.74Our system 63.52 45.37 52.94
Table 5.8: Entity-level Precision, Recall, and F1 scores of our system and the baselines onthe ENT forum for the label DT.
System Precision Recall F1
OBA 56.57 44.79 50OBA-C 56.78 40.72 47.43
MetaMap 48.51 59.04 53.26MetaMap-C 56.10 54.07 55.06
Dictionary-F-C 65.57 50 56.73Xu et al.-25 64.63 50.45 56.67Xu et al.-50 64.78 52.03 57.71
CRF 63.27 50.67 56.28CRF-2 62.01 50.22 55.49
CRF-20 61.38 50 55.11Our System 62.53 55.88 59.02
Table 5.9: Entity-level Precision, Recall, and F1 scores of our system and the baselines onthe ENT forum for the label SC.
CHAPTER 5. INFORMATION EXTRACTION ON MEDICAL FORUMS 76
i be put on (X | pos: noun)i have be on (X | pos:noun)use DT and (X | pos:noun)put he on (X | pos:noun)prescribe DT and (X | pos:noun)mg of Xhe put I on Xto give he (X | pos:noun)i have be use Xand put I on X
Table 5.10: Top 10 patterns learned for the label DT on the Asthma forum.
(X | pos: noun) SC etc.reduce SC (X | pos:noun)first SC (X | pos:noun)have history of (X | pos:noun)develop SC (X | pos:noun)really bad SC (X | pos:noun)not cause SC (X | pos:noun)symptom be (X | pos:noun)(X | pos:noun) SC feeland be diagnose with X
Table 5.11: Top 10 patterns learned for the label SC on the Asthma forum.
CHAPTER 5. INFORMATION EXTRACTION ON MEDICAL FORUMS 77
have endoscopic (X | pos: noun)include DT (X | pos: noun)and put I on (X | pos: noun)(X | pos: noun) 500 mg2 round of (X | pos: noun)and be put on (X | pos: noun)have put I on (X | pos: noun)(X | pos: adj) DT and useent put I on X(X | pos: noun) and nasal rinse
Table 5.12: Top 10 patterns learned for the label DT on the ENT forum.
persistent SC (X | pos: noun)have have problem with (X | pos: noun)diagnose I with SC (X | pos: noun)morning with SC (X | pos: noun)(X | pos: noun) SC cause SChave be treat for (X | pos: noun)year SC (X | pos: noun)(X | pos: noun) SC even though(X | pos: noun) SC like SCdaughter have SC X
Table 5.13: Top 10 patterns learned for the label SC on the ENT forum.
CHAPTER 5. INFORMATION EXTRACTION ON MEDICAL FORUMS 78
inhaler, inhalers, steroid inhaler,albuterol inhaler, b5, preventive in-haler, ventolin inhaler, advar, ser-itide, steroid inhalers, symbicorttrubohaler, agumentin, pantoloc,inahler, puffs
Asthma - DT
flare, flare-up, rad, congestion,mucus, tightness, sinuses, excesmucus, cataracts-along, athsma,vcd, sensation, mites, nasal,ashtma
(a) Asthma - SCotic, z-pack, z-pac, predneson,tylenol sinus, amoxillin, salinenasal, eardrops, regimen, inhaler,peroxide, rinse, amoxcilyn, rinses,anti-nausea, saline, mucodyn,flixonase, vertin, amocicillan
(b) ENT - DT
dysfunction, sinus, sinuses, lymph,gland, tonsilitus, sinues, sensa-tion, congestion, pharynx, tight-ness, mucus, tonsil, onset, ethmoidsinus
(c) ENT - SC
Figure 5.1: Top 15 phrases extracted for the Asthma and the ENT forums.
Frati-Munari et al. (1998) and Khan et al. (2003)). In contrast, our system enables discov-
ering and extracting new DT and SC phrases in PAT and studying their effects reported by
patients at a larger scale in online forums.
Abbreviations
Patterns leverage context to extract abbreviations from PAT, despite the fact that unlike in
well-formed text, abbreviations in PAT tend to lack identifying structure like capitalization
and periods. Some examples of abbreviations our system extracted are: ‘neb’ for nebulizer,
‘labas’ for long-acting beta agonists, and ‘lada’ for latent autoimmune diabetes of adults.
Sub-phrases
Patients frequently do not use full names of diseases and drugs in PAT. For example, it
would not be unusual for patients to refer to ‘vitamin b12’ simply as ‘b12’. These partial
phrases did not get labeled by dictionaries because dictionaries contain long precise phrases
and we label a phrase only when it fully matches a dictionary phrase. When we ran trial
experiments that labeled phrases even when they partially matched a dictionary phrase, it
resulted in low precision. Our pattern learning system learns relevant sub-phrases of the
CHAPTER 5. INFORMATION EXTRACTION ON MEDICAL FORUMS 79
dictionary phrases without sacrificing much precision. For example, the system is able to
learn that ‘large’ is not a relevant word by itself even though it occurs frequently in the SC
dictionary, but ‘deficiency’ is. More examples include ‘inhaler’, ‘b5’, ‘puffer’ as DT.
Spelling Mistakes
Spelling mistakes are very common in PAT, especially for DT mentions. Context sensitive
patterns allow us to extract a wider range of spelling mistakes than would be possible with
typical edit distance metrics (e.g., Dictionary-F-C matches phrases fuzzily). For example,
the system extracts ‘neurothapy’ for neuropathy, ‘ibubofrin’ for Ibuprofen, and ‘metforim’
for Metformin.
The results show that bootstrapping using patterns gives effective in-domain dictionary
expansion for SC and DT phrases. As we can see from the top extracted phrases for the
three MedHelp forums, our system uncovers novel terms for SCs and DTs, some of which
refer to lesser-known home remedies (such as ‘basil’, ‘cinnamon’ for Diabetes) and com-
ponents of daily care and management. The system extracts some incorrect phrases, which
can be discarded by manual supervision. Such discoveries are valuable on two fronts:
firstly, they may comprise a useful candidate set for future research into alternative treat-
ments; second they can be used to suggest candidate terms for various dictionaries and
ontologies. There are two reasons for the overall lower recall and precision on this dataset
than for extracting some other types of medical entities on clinical text. First, DT and SC
definitions are broad, encompassing any symptom, condition, treatment, or intervention.
Second, PAT contains slang and verbose descriptions that are usually not present in dictio-
naries. One limitation of our system is that it does not identify long descriptive phrases,
such as ‘olive leaf nasal extract nasal spray’ and ‘trouble looking straight ahead’. More
research in needed to robustly identify those to increase recall of the system. In addition,
incorrect phrases in the dictionaries, which were curated automatically, reduced the pre-
cision of our system. Further research in automatically removing incorrect entries in the
dictionaries will help to improve the precision.
To compare the efficacy of our system for extracting relevant phrases apart from spelling
mistakes with MetaMap, I cluster all the strings from both the systems that are 1 edit
distance away and normalize them to their most frequent spelling variation. I compare the
CHAPTER 5. INFORMATION EXTRACTION ON MEDICAL FORUMS 80
ACNE DIABETES BREAST CANCER DT
NEW TERMS
diane35, retinoid, dianette, retinoids, topical retinoids, femodene, ginette, cilest, dalacin-t, dalacin, piriton, freederm, byebyeblemish, non-hormonal anti-androgen, sudocrem, byebye blemish, dermatologists, dian-35, canneston, microdermabrasions, isotrexin, noxema, proactiv, derm, cleansers, concealer, proactive, creme, microdermabrasion, moisturizer, minocylin
ambulance, basil, bedtime, c-peptide, cinnamon, diaformin, glycomet, glycomet-gp-1, hydrochlorothiazide-quinapril, hydrochlorothiazide-reserpine, lipoic, minidiab, neurotonin, opuntia, rebuilder, sometime, tritace
hormonal fluctuations, rads, ayurveda, ameridex, tram flap, bilateral mastectomy, flaps, incision, thinly, taxanes, bisphosphonates, bisphosphonate, mammosite, rad, imagery, stimulation, relicore, bezielle, wle (wide local excision), lymph spread-wle, moisturising, lymphnode, lympe, her2 neu, hormone-suppressing
SUB-PHRASES
topical, depo, contraceptives, contraceptive, aloe vera, topicals, salicylic, d3, peroxide, androgens-male, cleanser
asprin, bolus, carb, carbohydrate, carbohydrates, ovulation, regimen
hormonal, topical, antagonists, excision, vit, sentinel, cmf, primrose, augmentation, depo, flap
ABBREV. a1c, a1c cutoff, a1cs, endo (endocrinology), ob gyn, ogtt (oral glucose tolerance test), xr (extended release)
recon (reconstruction), neu
SPELLING MISTAKES
oxytetrcaycline, contracpetive, anitbiotics, oxytetracylcine, oxytracycline, lymecylcine, sprionolactone, benzol peroxide, depot-shot, tetracylcines, shampo, dorxy, steriod, moisturising, perscription
actoplusmet, awhile, basil-known, birth-control, blood-cholesterol, condrotin, darvetcet, diabix, excercise, fairley, htis, inslin, klonopin-i, metforim, metform, metformun100mg, metmorfin, omigut40, pils, sutant
homonal, steriod, horonal, releif, ibubofrin, tamoxofin, tomoxphen, reloxifen, tamoxafin, tomoxifin, steriods, tamixofin
Figure 5.2: Top 50 DT phrases extracted by our system for three different forums. Erro-neous phrases (as determined by us) are shown in gray. Full forms of some abbreviationsare in italics. Note that Abbreviations are also New Terms but are categorized separatelybecause of their frequency in PAT.
CHAPTER 5. INFORMATION EXTRACTION ON MEDICAL FORUMS 81
ACNE DIABETES BREAST CANCER
SC
NEW TERMS
squeeze blackheads, squeeze, breakouts, teenager, itchiness, coldsores, blemishes, blemish, breakout, chin, break-outs, re-appearing, outbreaks, poke, puss, flares, bum, outbreak, coldsore, acneic, armpit, teenagers
borderline diabetic, c-peptide, calories, checkup, educator, harden, rarer, sugary, thorough, type2
armpit, grandmother, aunt, cancer-grandmother, cancer-having, survivor, aunts, morphologies, diagnosing
SUB-PHRASES
lesions, bumps, irritation, glands, bump, forehead, lumps, scalp, cheeks, follicles, dryness, gland, flare-up, pilaris rubra, puberty, cystic, follicular, inflamed, follicle, pcos, soreness, groin, occurrence, discoloration, relapse, oily
abdomen, blockage, bowel, calfs, circulatory, cirrhosis, disruptions, dryness, fibro, flour, fluctuations, foggy, lesion, lumps, masturbation, menopause, onset, pcos, precursor, sensations, spike, spikes, thighs, urine
lesions, lump, soreness, lumps, phyllodes, situ, ducts, lesion, sensations, needle, menopause, manifestations, variant, mutation, manifestation, onset, duct, lymph, gland, benign, irritation, abnormality, glands, mutations, asymmetry, occurrence, leaking, parenchymal, bump, unilateral, thighs, menstrual, subtypes, ductal, colon, bumps
ABBREV. a-fib (atrial fibrillation), carbs (carbohydrates), cardio, ha1c, hep (hepatitis), hgba1c, hypo, oj (orange juice), t2 (type 2)
hx (history)*, ibc (inflammatory breast cancer)
SPELLING MISTAKES
becuase, forhead
allegeries, energyless, jsut, neurothapy, tyoe, vomiting-more, weezyness
caner, posibility, tratment
Figure 5.3: Top 50 SC phrases extracted by our system for three different forums. Erro-neous phrases (as determined by us) are shown in gray. Full forms of some abbreviationsare in italics. Note that Abbreviations are also New Terms but are categorized separatelybecause of their frequency in PAT.
CHAPTER 5. INFORMATION EXTRACTION ON MEDICAL FORUMS 82
(a) Top DT phrases.
(b) Top SC phrases.
Figure 5.4: Top DT and SC phrases extracted by our system, MetaMap, and MetaMap-Cfor the Diabetes forum. Numbers in parentheses indicate the number of times the phrasewas extracted by the system. Erroneous phrases (as determined by us) are shown in gray.
CHAPTER 5. INFORMATION EXTRACTION ON MEDICAL FORUMS 83
top most frequent phrases extracted from the Diabetes forum in Figure 5.4. We can see that
our system extracts more relevant phrases. The reason we do not extract insulin is because
it exists (incorrectly) in the automatically curated SC dictionary and we do not label DT
phrases that are in the SC dictionary. For our system, I concatenated all consecutive words
with the same label as one phrase, in contrast with MetaMap, which many times extracted
consecutive words as different phrases (leading to the difference in the frequency of some
phrases). For example, our system extracted ‘diabetes drug dependency’, but MetaMap
extracted it as ‘diabetes’ and ‘drug dependency’. Similarly, our system extracted ‘latent
autoimmune diabetes in adults’, whereas MetaMap extracted ‘latent’ and ‘autoimmune
diabetes’.
Below, I demonstrate a use case of the system to explore alternative treatments people
use for a symptom or condition. I manually labeled posts that mentioned new treatments
identified by our system as DTs and explore their efficacy by mining sentiment towards
them in the forum.
5.7.2 Case study: Anecdotal Efficacy
Our system can be used to explore different (possibly previously unknown) treatments peo-
ple are using for a condition. In turn, this can lead to novel insights, which can be further
explored by the medical community. For example, for Diabetes, our system extracted ‘Cin-
namon’ and ‘Vinegar’ as DTs. To study the anecdotal efficacy of ‘Cinnamon’ and ‘Vinegar’
for managing Diabetes, we manually labeled the posts that mentioned the terms as treat-
ment for Diabetes (47 out of 49 posts for ‘Cinnamon’ and 26 out of 30 posts for ‘Vinegar’)
with the sentiment towards that treatment. Both terms were extracted as DT by our system
for the Diabetes forum. ‘Strongly positive’ means the treatment helped the person. ‘Weakly
positive’ means the person is using the treatment or has heard positive effects of it. ‘Neu-
tral’ means the user is not using the treatment and did not express an opinion in the post.
‘Weakly negative’ means the person has heard that the treatment does not work. ‘Strongly
negative’ means the treatment did not work for the person. An informal analysis of the
posts reveals that the ‘Cinnamon’ was generally considered helpful by the community and
‘Vinegar’ had mixed reviews (Figure 5.5). Below are more details about each label.
CHAPTER 5. INFORMATION EXTRACTION ON MEDICAL FORUMS 84
Figure 5.5: Study of efficacy of ‘Cinnamon’ and ‘Vinegar’, two DTs extracted by oursystem, for treating Type II Diabetes.
CHAPTER 5. INFORMATION EXTRACTION ON MEDICAL FORUMS 85
• Strongly positive: The person has explicitly mentioned that the treatment is helping
the subject of the post (many times the posts discuss health of a family member)
for Diabetes. Example: “. . . A relative with the same problem told her about taking
cinnamon gel tabs which had greatly helped her. She found a brand at the local health
store by the name of NewChapter titled Cinnamon Force. She was afraid to take it
with so many other medications and it sat in the cabinet about five months. Last week,
she got brave and took two tabs behind the two largest meals of the day.Wow! the
level dropped down into the safe range and has remained there for several days.All
that I can tell you about the product, is that it contains 140mg of cinnamon per gel
tab.We are so thrilled that after so many years of frustration, that we see a great
change in blood sugar levels . . . ”
• Weakly positive: The subject of the post is either using the treatment or heard/read
positive effects of the treatment for Diabetes. Example: “... Some people do think
things such as vinegar help. My belief is those things are worth trying but they are
secondary to tried and true things such as weight loss, exercise and lowering carb
intake.”
• Neutral: The subject of the post is neither using the treatment nor expressed any
sentiment about it in the post. Example: “. . . I may be wrong, but I haven’t heard of
cinnamon lowering glucose levels. Please take your mother to a doctor for a checkup
asap . . . ” Posts that asked a question about using the about treatment were also
labeled neutral. For example, “Does vinegar help diabetes” is labeled Neutral.
• Weakly negative: The post mentioned that the user has heard that the treatment does
not work. For example, people citing studies that showed inconclusive evidence of
the efficacy of the treatment. Example: “. . . Studies now show that cinnamon doesn’t
lower glucose levels, but has been known to regulate blood pressure. I can vouch for
the latter . . . ”
• Strongly negative: The post mentioned that the treatment is not working from per-
sonal experience of the subject of the post (for example, a family member). Example:
“I have tried the Apple Cider Vinegar and it didn’t work for me . . . ”
CHAPTER 5. INFORMATION EXTRACTION ON MEDICAL FORUMS 86
DT SCSystem Precision Recall F1 Precision Recall F1
Our system 86.88 58.67 70.04 78.10 75.56 76.81Pattern Matches (No Gen.) 45.26 13.73 21.07 50.75 12.33 19.85
Pattern Matches 36.58 16.60 22.84 43.07 17.10 24.48Patterns Matches in Dictionary 80.00 13.28 22.78 83.51 15.47 26.11
Table 5.14: Precision, Recall, and F1 scores of systems that use pattern matching whenlabeling data and our system on the Asthma forum.
5.8 Additional Experiments
Labeling Data by Matching Patterns
I learn dictionaries for SC and DT phrases and use the dictionaries to label data. The label-
ing is done by dictionary look-up and does not consider context. Context is only considered
to learn patterns that extract new dictionary phrases. Another approach is to label data us-
ing the learned patterns, which uses only context. I compare both the approaches in Tables
5.14 and 5.15.
The system ‘Pattern Matches (No Gen.)’ applied all the patterns leaned by our system
for a given label and labeled every extraction as positive for the label. ‘Pattern Matches’
is similar to ‘Pattern Matches (No Gen.)’ except it used the dictionaries for generalizing
the context, which increased the recall. ‘Pattern Matches in Dictionary’ is the most con-
servative approach in which a token was labeled as positive only if it matched both by the
dictionary and the learned patterns. That is, it filtered the output of ‘Pattern Matching’
to all the phrases that were also labeled by the dictionaries. All the pattern matching ap-
proaches have very low recall because many correct tokens did not occur in the patterns’
context. ‘Pattern Matches in Dictionary’ has high precision because it is the most restricted
approach of all, but suffers from low recall.
CHAPTER 5. INFORMATION EXTRACTION ON MEDICAL FORUMS 87
DT SCSystem Precision Recall F1 Precision Recall F1
Our system 82.35 45.90 58.94 71.65 61.45 66.16Pattern Matches (No Gen.) 47.36 4.71 8.57 46.15 0.98 1.92
Pattern Matches 40 6.55 11.26 53.12 8.85 15.17Pattern Matches in Dictionary 90.90 5.46 10.30 94.11 8.33 15.31
Table 5.15: Precision, Recall, and F1 scores of systems that use pattern matching whenlabeling data and our system on the ENT forum.
DT SCSystem Precision Recall F1 Precision Recall F1
OBA 52.25 56.50 54.25 78.87 60.08 68.20OBA-C 62.06 53.15 57.25 83.62 58.24 68.66
OBA-C-T5 64.67 52.02 57.66 85.01 56.61 67.97
Table 5.16: Effects of use of GoogleCommonList in OBA on the Asthma forum. Precision,Recall, and F1 scores of OBA when words in GoogleCommonList are not labeled (‘-C’suffix), and when words in GoogleCommonList and in manually identified negative phrasesare not labeled (‘-C-T5’ suffix).
Manually removing top negative words from MetaMap and OBA
I sorted all words extracted by MetaMap and OBA by their frequency and manually iden-
tified top 5 words that I judged as incorrect (without considering the context). I ran experi-
ments in which those words were not labeled by OBA-C and MetaMap-C, that is, I added
them to the stop words list. The systems are marked as ‘OBA-C-T5’ and ‘MetaMap-C-
T5’, respectively, in Tables 5.16-5.17 and 5.18-5.19. The motivation to compare the per-
formance of these systems is when a user might be interested in manually identifying the
top negative words and adding them to the stop words list. Removing the manually iden-
tified words generally increased precision, but reduced recall. I suspect the recall dropped
because the words might be correct when they appeared in some contexts. The reason for
the same scores for MetaMap-C and MetaMap-C-T5 for the SC label on the Asthma forum
is that the negative words were already in the GoogleCommonList.
CHAPTER 5. INFORMATION EXTRACTION ON MEDICAL FORUMS 88
DT SCSystem Precision Recall F1 Precision Recall F1
OBA 43.22 55.73 48.68 67.51 50.52 57.59OBA-C 49.73 51.36 50.53 70.55 46.18 55.82
OBA-C-T5 53.71 51.36 52.51 82.31 42.01 55.63
Table 5.17: Effects of use of GoogleCommonList in OBA on the ENT forum. Precision,Recall, and F1 scores of OBA when words in GoogleCommonList are not labeled (‘-C’suffix), and when words in GoogleCommonList and in manually identified negative phrasesare not labeled (‘-C-T5’ suffix).
DT SCSystem Precision Recall F1 Precision Recall F1
MetaMap 68.42 57.56 62.52 58.63 80.24 67.75MetaMap-C 77.60 54.98 64.36 70.28 75.15 72.63
MetaMap-C-T5 78.68 53.13 63.43 70.28 75.15 72.63
Table 5.18: Effects of use of GoogleCommonList in MetaMap on the Asthma forum. Preci-sion, Recall, and F1 scores of MetaMap when words in GoogleCommonList are not labeled(‘-C’ suffix), and when words in GoogleCommonList and in manually identified negativephrases are not labeled (‘-C-T5’ suffix).
DT SCSystem Precision Recall F1 Precision Recall F1
MetaMap 56.39 53.00 54.64 57.01 64.23 60.40MetaMap-C 64.08 49.72 55.99 67.40 58.50 62.63
MetaMap-C-T5 64.17 46.99 54.25 70.44 57.11 63.08
Table 5.19: Effects of use of GoogleCommonList in MetaMap on the ENT forum. Preci-sion, Recall, and F1 scores of MetaMap when words in GoogleCommonList are not labeled(‘-C’ suffix), and when words in GoogleCommonList and in manually identified negativephrases are not labeled (‘-C-T5’ suffix).
CHAPTER 5. INFORMATION EXTRACTION ON MEDICAL FORUMS 89
Phrase threshold Precision Recall F1
0.01 86.78 55.71 67.860.1 87.71 55.35 67.870.2 87.77 58.30 70.060.8 90.53 56.45 69.541.0 90.53 56.45 69.54
Table 5.20: Scores when our system is run with different phrase threshold values. Increas-ing the threshold increases the precision but reduces recall. The value in bold was used inour final system.
Pattern threshold Precision Recall F1
0.2 87.77 58.30 70.060.5 87.77 58.30 70.060.8 90.68 53.87 67.591.0 89.65 47.97 62.50
Table 5.21: Scores when our system is run with different pattern threshold values. Allother parameters remain unchanged. The threshold of 0.2 and 0.5 did not make a differencebecause all patterns extracted had score of more than 0.5. The threshold of 0.8 led to higherprecision but lower recall. Threshold of 1.0 did not extract any patterns. The value in boldwas used in our final system.
Parameter Tuning
In our experiments, I tuned the parameters, such as N , K, and T , on the Asthma forum.
In this section, we discuss the effect of varying some of the parameters (keeping others the
same as the final system) on extracting DT phrases from the Asthma forum. We experi-
enced a similar effect of varying the parameters for extracting SC phrases from the Asthma
forum.
Phrase and pattern thresholds
Tables 5.20 and 5.21 show scores of our system when different phrase and pattern thresh-
olds are used. In both cases, generally increasing the threshold resulted in higher precision
but lower recall.
CHAPTER 5. INFORMATION EXTRACTION ON MEDICAL FORUMS 90
N T Precision Recall F1
5 40 86.41 58.67 69.8910 20 87.77 58.30 70.0640 5 89.22 54.98 68.03
Table 5.22: Scores when our system is run with different values of N and T . The values inbold were used in our final system.
K T Precision Recall F1
20 20 88.75 55.35 68.1850 20 87.77 58.30 70.06100 20 86.41 58.67 69.89
Table 5.23: Scores when our system is run with different values of K. Increasing K de-creases precision but improves recall. The values shown in bold were used in our finalsystem.
Number of phrases in each iteration (N )
Our system learned a maximum of 200 phrases (with maximum number of phrases in each
iteration N=10 and maximum number of iterations T=20). Table 5.22 shows scores for
different combinations of values of N and T, keeping the total number of phrases learned
constant.
Number of patterns in each iteration (K)
Table 5.23 shows results for different values of K, that is, the maximum number of pattern
learned in each iteration.
5.9 Future Work
Future improvements to performance would allow us to reap enhanced benefits from au-
tomatic medical term extraction. Improving precision, for example, would reduce manual
effort required for verifying extracted terms to do an analysis similar to one shown in Fig-
ure 5.5. Improving recall would increase the range of terms that we extract. For example,
at present, our system still misses relevant terms, such as ‘oatmeal’ as a DT for Diabetes.
CHAPTER 5. INFORMATION EXTRACTION ON MEDICAL FORUMS 91
Our results open several avenues for future work on mining and analyzing PAT. Extrac-
tion of DT and SC entities allows us to investigate connections and relationships between
drug pairs, and drugs and symptoms. Prior work has successfully identified adverse drug
events in electronic medical records (Tatonetti et al., 2012); using self-report patient data
(such as that found on MedHelp), we might uncover novel information on how particular
drug combinations affect users. One such case study to identify side effects of drugs was
presented in Leaman et al. (2010). Our system can also help to analyze sentiment towards
various treatments, including home remedies and alternative treatments, for a particular
disease – manually enumerating all treatments, along with their morphological variations,
is difficult. Finally, I note that our system does not require any labeled data of sentences
and thus can be applied to many different types of PAT (like patient emails) and entity types
(like diagnostic tests).
5.10 Conclusion
I demonstrate a method for identifying medical entity types in patient-authored text. I in-
duce lexico-syntactic patterns using a seed dictionary of desirable terms. Annotating spe-
cific types of medical terms in PAT is difficult because of lexical and semantic mismatches
between experts’ and consumers’ description of medical terms. Previous ontology-based
tools like OBA and MetaMap are good at fine-grained concept mapping on expert-authored
text, but they have low accuracy on PAT.
I demonstrate that our method improves performance for the task of extracting two
entity types: drugs & treatments (DT) and symptoms & conditions (SC), from MedHelp’s
Asthma and ENT forums by effectively expanding dictionaries in context. Our system
extracts new entities missing from the seed dictionaries; abbreviations, relevant sub-phrases
of seed dictionary phrases, and spelling mistakes. In evaluation, in most cases, our system
significantly outperformed MetaMap, OBA, an existing system that uses word patterns for
extracting diseases, and a conditional random field classifier. I believe that the ability to
effectively extract specific entities is the key first step towards deriving novel findings from
PAT.
Pattern and entity scoring are the critical components of a bootstrapped pattern-based
CHAPTER 5. INFORMATION EXTRACTION ON MEDICAL FORUMS 92
learning system. The system developed in this chapter utilizes only the supervision pro-
vided by seed sets to score patterns and entities. Thus, many entities extracted by patterns
are unlabeled. During the pattern scoring phase, the unlabeled entities extracted by patterns
are considered negative. However, many of these unlabeled entities are actually positive,
resulting in lower scores for good patterns that extract many good (that is, positive) unla-
beled entities. In the next chapter, I propose improvements to the pattern scoring phase by
evaluating unlabeled entities using unsupervised measures. It leads to improved precision
and recall.
Chapter 6
Leveraging Unlabeled Data ImprovesPattern Learning
In the previous chapters, I discussed bootstrapped pattern-based learning (BPL) as an effec-
tive approaches for entity extraction with minimal distantly supervised data. In this chapter,
I propose improvements to BPL by leveraging unlabeled data to enhance the pattern scoring
function. The work has been published in Gupta and Manning (2014a).
6.1 Introduction
In a pattern-based entity learning system, scoring patterns and scoring entities are the most
important steps. In the pattern-scoring phase, patterns are scored by their ability to extract
more positive entities and less negative entities. In a supervised setting, the efficacy of
patterns can be judged by their performance on a fully labeled dataset (Califf and Mooney,
1999; Ciravegna, 2001). In contrast, in a BPL system, seed dictionaries and/or patterns pro-
vide weak supervision. Thus, most entities extracted by candidate patterns are unlabeled,
making it harder for the system to learn good patterns.
Existing systems score patterns by making closed world assumptions about the unla-
beled entities. The problem is similar to the closed world assumption in distantly super-
vised relation extraction systems, when all propositions missing from a knowledge base
are considered false (Ritter et al., 2013; Xu et al., 2013). Consider the example discussed
93
CHAPTER 6. LEVERAGING UNLABELED DATA 94
in Chapter 2, also shown in Figure 6.1. Current pattern learning systems would score both
patterns, ‘own a X’ and ‘my pet X’, equally by either ignoring the unlabeled entities or
assuming them as negative. However, these scoring schemes cannot differentiate between
patterns that extract good versus bad unlabeled entities. Systems that ignore the unlabeled
entities do not leverage the unlabeled data in scoring patterns. Frequently, these systems
learn patterns that extract some positive entities but many bad unlabeled entities. Systems
that assume unlabeled entities to be negative are very conservative; in the example, they
wrongly penalize ‘Pattern 1’ that extracted good unlabeled entity, ‘cat’.
Predicting labels of unlabeled entities can improve scoring patterns. Features like dis-
tributional similarity can predict that ‘cat’ is closer to the seed set {dog} than ‘house’, and
a pattern learning system can use that information to rank ‘Pattern 1’ higher than ‘Pattern
2’. In this chapter, I improve the scoring of patterns for an entity class by defining a pat-
tern’s score by the number of positive entities it extracts and the ratio of number of positive
entities to expected number of negative entities it extracts. I propose five features to predict
the scores of unlabeled entities. One feature is based on Google Ngrams that exploits the
specialized nature of our dataset; entities that are frequent on the web are less likely to be a
drug-and-treatment entity. The other four features can be used to learn entities for generic
domains as well.
My main contribution is introducing the expected number of negative entities in pat-
tern scoring – I predict probabilities of unlabeled entities belonging to the negative class.
I estimate an unlabeled entity’s negative class probability by averaging probabilities from
various unsupervised class predictors, such as distributional similarity, string edit distances
from learned entities, and TF-IDF scores. Our system performs significantly better than ex-
isting pattern scoring measures for extracting drug-and-treatment entities from four medical
forums on MedHelp.
6.2 Related Work
I discuss pattern-based systems in Chapter 3. Here, I review the pattern-scoring aspects
of previous pattern-based systems. The pioneering work by Hearst (1992) used hand writ-
ten patterns to automatically generate more rules that were manually evaluated to extract
CHAPTER 6. LEVERAGING UNLABELED DATA 95
Figure 6.1: An example pattern learning system for the class ‘animals’ from the text start-ing with the seed entity ‘dog’. The figure shows two candidate patterns, along with theirextracted entities, in the first iteration. Text matched with the patterns is shown in italicsand the extracted entities are shown in bold.
hypernym-hyponym pairs from text. Other supervised systems like SRV (Freitag, 1998),
SLIPPER (Cohen and Singer, 1999), (LP )2 (Ciravegna, 2001), and RAPIER (Califf and
Mooney, 1999) used a fully labeled corpus to either create or score patterns.
Riloff (1996) used a set of seed entities to bootstrap learning of rules for entity extrac-
tion from unlabeled text. She scored a rule by a weighted conditional probability measure,
called RlogF, estimated by counting the number of positive entities among all the entities
extracted by the pattern. Thelen and Riloff (2002) extended the above bootstrapping al-
gorithm for multi-class learning. Riloff and Jones (1999) used the similar pattern scoring
measure as Riloff (1996) for their multi-level bootstrapping approach. Snowball (Agichtein
and Gravano, 2000) used the same scoring function for patterns as Riloff (1996). Yangar-
ber et al. (2002) and Lin et al. (2003) used a combination of accuracy and confidence of
a pattern for multiclass entity learning, where the accuracy measure ignored the unlabeled
entities and the confidence measure treated them as negative. Talukdar et al. (2006) used
seed sets to learn trigger words for entities and a pattern automata. Their pattern scoring
measure is same as (Lin et al., 2003). In Chapter 5, I use the ratio of scaled frequencies
of positive entities among all extracted entities. None of the above measures predict labels
of unlabeled entities to score patterns. Our system outperforms them in our experiments.
CHAPTER 6. LEVERAGING UNLABELED DATA 96
Stevenson and Greenwood (2005) used Wordnet to assess patterns, which is not feasible
for domains that have low coverage in Wordnet, such as medical data. Zhang et al. (2008)
used HITS algorithm (Kleinberg, 1999) over patterns (authorities) and instances (hubs)
to overcome some of the problems with the above systems – unlabeled entities extracted
by patterns are either considered negative or are ignored when computing pattern scores.
However, they do not use any external unsupervised knowledge for evaluating the unlabeled
entities.
Current open entity extraction systems either ignore the unlabeled entities or consider
them as negative. KnowItAll’s entity extraction from the web (Downey et al., 2004; Etzioni
et al., 2005) used components such as list extractors, generic and domain specific pattern
learning, and subclass learning. They learned domain-specific patterns using a seed set and
scored them by ignoring unlabeled entities. One of our baselines is similar to their domain-
specific pattern learning component. Carlson et al. (2010a) learned multiple semantic types
using coupled semi-supervised training from web-scale data, which is not feasible for all
datasets and entity learning tasks. They assessed patterns by their precision, assuming un-
labeled entities to be negative; one of our baselines is similar to their pattern assessment
method. Other open information extraction systems like ReVerb (Fader et al., 2011) and
OLLIE (Mausam et al., 2012) are mainly geared towards generic, domain-independent rela-
tion extractors for web data. ReVerb used manually written patterns (called as constraints)
to extract potential tuples, which were scored using a logistic regression classifier trained
on around 1000 manually labeled sentences. OLLIE ranked patterns by their frequency of
occurrence in the dataset. For more discussion on these systems, see Chapter 2.
6.3 Approach
In Chapter 2, I discussed the skeleton of a bootstrapped pattern-based learning system. In
this chapter, I use the same framework with lexico-syntactic surface word patterns. I extract
entities from unlabeled text starting with seed dictionaries of entities for multiple classes.
The success of bootstrapped pattern learning methods crucially depends on the effec-
tiveness of the pattern scorer and the entity scorer. Here I focus on improving the pattern
scoring measure.
CHAPTER 6. LEVERAGING UNLABELED DATA 97
6.3.1 Creating Patterns
Candidate patterns are created using contexts of words or their lemmas in a window of two
to four words before and after a positively labeled token. Context words that are labeled
with one of the classes are generalized with that class. The target term has a part-of-speech
(POS) restriction, which is the POS tag of the labeled token. I create flexible patterns by
ignoring the words {‘a’, ‘an’, ‘the’} and quotation marks when matching patterns to the
text. Some examples of the patterns are shown in Table 6.4.
6.3.2 Scoring Patterns
Judging the efficacy of patterns without using a fully labeled dataset can be challenging
because of two types of failures: 1. penalizing good patterns that extract good (that is,
positive) unlabeled entities, and 2. giving high scores to bad patterns that extract bad (that
is, negative) unlabeled entities. Existing systems that assume unlabeled entities as negative
are too conservative in scoring patterns and suffer from the first problem. Systems that
ignore unlabeled entities can suffer from both the problems. For a pattern r, let sets Pr, Nr,
and Ur denote the positive, negative, and unlabeled entities extracted by r, respectively.
One commonly used pattern scoring measure RlogF (Riloff, 1996) calculates a pattern’s
score by the function |Pr||Pr|+|Nr|+|Ur| log(|Pr|). The first term is a rough measure of precision,
which assumes unlabeled entities as negative. The second term gives higher weights to
patterns that extract more positive entities. The function has been shown to be effective for
learning patterns in many systems. However, it gives lower scores to patterns that extract
many unlabeled entities – regardless of whether those entities are good or bad.
I propose to estimate the labels of unlabeled entities to more accurately score the pat-
terns. The pattern score, ps(r) is calculated as
ps(r) =|Pr|
|Nr|+∑
e∈Ur(1− score(e))
log(|Pr|) (6.1)
where |.| denotes the size of a set. The function score(e) gives the probability of an entity
e belonging to C. If e is a common word, score(e) is 0. Otherwise, score(e) is calculated
as the average of five feature scores (explained below), each of which give a score between
CHAPTER 6. LEVERAGING UNLABELED DATA 98
0 and 1. The feature scores are calculated using the seed dictionaries, learned entities for
all labels, Google Ngrams, and clustering of domain words using distributional similarity.
The log |Pr| term, inspired from RlogF, gives higher scores to patterns that extract more
positive entities. Candidate patterns are ranked by ps(r) and the top patterns are added to
the list of learned patterns.1
To calculate score(e), I use features that assess unlabeled entities to be either closer
to positive or negative entities in an unsupervised way. I motivate my choice of the five
features below with the following insights. If the dataset consists of informally written
text, many unlabeled entities are spelling mistakes and morphological variations of labeled
entities. I use two edit distance based features to predict labels for these unlabeled entities.
Second, some unlabeled entities are substrings of multi-word dictionary phrases but do
not necessarily belong to the dictionary’s class. For example, for learning drug names,
the positive dictionary might contain ‘asthma meds’, but ‘asthma’ is negative and might
occur in a negative dictionary as ‘asthma disease’. To predict the labels of entities that
are a substring of dictionary phrases, I use SemOdd, which I also used in Chapter 5 to
learn entities. Third, for a specialized domain, unlabeled entities that commonly occur
in generic text are more likely to be negative. I use Google Ngrams (called GN) to get
a fast, non-sparse estimate of the frequency of entities over a broad range of domains.
The above features do not consider the context in which the entities occur in text. I use
the fifth feature, DistSim, to exploit contextual information of the labeled entities using
distributional similarity. The features are defined as:
Edit distance from positive entities (EDP): This feature gives a score of 1 if e has low edit
distance to the positive entities. It is computed as
maxp∈Pr
1
(editDist(p, e)
|p|< 0.2
)where 1(c) returns 1 if the condition c is true and 0 otherwise, |p| is the length of p,
and editDist(p, e) is the Damerau-Levenshtein string edit distance between p and e.
1Including the |Pr| term in the denominator of Equation 6.1 resulted in comparable but a bit lower per-formance in some experiments.
CHAPTER 6. LEVERAGING UNLABELED DATA 99
The hard cut-off for the edit distance function resulted in better results in the pilot
experiments as compared to a soft scoring function.
Edit distance from negative entities (EDN): It is similar to EDP and gives a score of 1 if e
has high edit distance to the negative entities. It is computed as
1−maxn∈Nr
1
(editDist(n, e)
|n|< 0.2
)Semantic odds ratio (SemOdd): First, I calculate the ratio of frequency of the entity term
in the positive entities to its frequency in the negative entities with Laplace smooth-
ing. The ratio is then normalized using a softmax function. The feature values for the
unlabeled entities extracted by all the candidate patterns are then normalized using
the min-max function to scale the values between 0 and 1. I do min-max normaliza-
tion on top of the softmax normalization because the maximum and minimum value
by softmax might not be close to 1 and 0, respectively. And, treating the out-of-
feature-vocabulary entities the same as the worst scored entities by the feature, that
is giving them a score of 0, performed best on the development dataset.
Google Ngrams score (GN): I calculate the ratio of scaled frequency of e in the dataset to
the frequency in Google Ngrams. The scaling factor is to balance the two frequencies
and is computed as the ratio of total number of phrases in the dataset to the total of
phrases in Google Ngrams. The feature values are normalized in the same way as
SemOdd.
Distributional similarity score (DistSim): Words that occur in similar contexts, such as
‘asthma’ and ‘depression’, are clustered using distributional similarity. Unlabeled
entities that get clustered with positive entities are given higher score than the ones
clustered with negative entities. To score the clusters, I learn a logistic regression
classifier using cluster ID as features, and use their weights as scores for all the
entities in those clusters. The dataset for logistic regression is created by considering
all positively labeled words as positive and sampling negative and unlabeled words
as negative. The scores for entities are normalized in the same way as SemOdd and
GN.
CHAPTER 6. LEVERAGING UNLABELED DATA 100
Entities outside the feature vocabulary are given a score of 0 for the features SemOdd,
GN, and DistSim. I use a simple way of combining the feature values: I give equal weights
to all features and average their scores. Features can be combined using a weighted average
by manually tuning the weights on a development set; I leave it to the future work. Another
way of weighting the features is to learn the weights using machine learning. I discuss this
approach in the last section of the chapter.
6.3.3 Learning Entities
I apply the learned patterns to the text and extract candidate entities. I discard common
words, negative entities, and those containing non-alphanumeric characters from the set.
The rest are scored by averaging the scores of DistSim, SemOdd, EDO, and EDN features
from Section 6.3.2 and the following features.
Pattern TF-IDF scoring (PTF): For an entity e, it is calculated as
1
log freqe
∑r∈R
ps(r)
where R is the set of learned patterns that extract e, freqe is the frequency of e in
the corpus, and ps(r) is the pattern score calculated in Equation 6.1. Entities that are
extracted by many high-weighted patterns get higher weight. To mitigate the effect
of many commonly occurring entities also getting extracted by several patterns, I
normalize the feature value with the log of the entity’s frequency. The values are
normalized in the same way as DistSim and SemOdd.
Domain N-grams TF-IDF (DN): This feature gives higher scores to entities that are more
prevalent in the corpus compared to the general domain. For example, to learn enti-
ties about a specific disease from a disease-related corpus, the feature favors entities
related to the disease over generic medical entities. It is calculated in the same way
as GN except the frequency is computed in the n-grams of the generic domain text.
Including GN in the phrase scoring features or including DN in the pattern scoring
features did not perform well on the development set in our pilot experiments.
CHAPTER 6. LEVERAGING UNLABELED DATA 101
6.4 Experiments
6.4.1 Dataset
I evaluate our system on extracting drug-and-treatment (DT) entities in sentences from four
forums on the MedHelp user health discussion website: 1. Acne, 2. Adult Type II Diabetes
(called Diabetes), 3. Ear Nose & Throat (called ENT), and 4. Asthma.
I used Asthma as the development forum for feature engineering and parameter tuning.
Similar to Chapter 5, a DT entity is defined as a pharmaceutical drug, or any treatment
or intervention mentioned that may help a symptom or a condition. It includes surgeries,
lifestyle changes, alternative treatments, home remedies, and components of daily care and
management of a disease, but does not include diagnostic tests and devices. Refer to Chap-
ter 2 for examples of sentences from these forums and the labeled entities. I used entities
from the following classes as negative: symptoms and conditions (SC), medical specialists,
body parts, and common temporal nouns to remove dates and dosage information.
Seed dictionaries
I used the DT and SC seed dictionaries from Chapter 5. The DT seed dictionary (36,091
phrases) and SC seed dictionary (97,211 phrases) were automatically constructed from
various sources on the Internet and expanded using the OAC Consumer Health Vocabulary,
which maps medical jargon to everyday phrases and their variants. Both dictionaries are
large because they contain many variants of entities. The dictionaries matched with 1065
phrases on the Acne forum, 1232 phrases on the Diabetes forum, 2271 phrases on the ENT
forum, and 1007 phrases on the Asthma forum. For each system, the SC dictionary was
further expanded by running the system with the SC class as positive (considering DT and
other classes as negative) and adding the top 50 words extracted by the top 300 patterns to
the SC class dictionary. This helps in adding corpus-specific SC words to the dictionary.
The lists of body parts and temporal nouns were obtained from Wordnet (Fellbaum, 1998).
The common words list was created using most common words on the web and Twitter. I
used the top 10,000 words from Google Ngrams and the most frequent 5,000 words from
CHAPTER 6. LEVERAGING UNLABELED DATA 102
Twitter.2
6.4.2 Labeling Guidelines
For evaluation, I hand labeled the learned entities pooled from all systems, to be used only
as a test set. For class DT, I labeled entities belonging to DT as positive and all others as
negative. I queried ‘word + forum name’ on Google and manually inspected the results.
Apart from the definition of the DT class above, the following instructions were followed
for each class.
Positive
The following types of variations of DT entities were allowed: spelling mistakes, abbre-
viations, and phonetically similar variations (for example, ‘brufen’ for ‘ibuprofen’). If a
word or phrase was a part of a DT entity, then it was labeled positive. For example, ‘nux’ is
considered positive because ‘nux vomica’ is sometimes used medicinally. Generic entities
that can be used as a treatment for the medical condition were included (like ‘moisturizer’
for Acne). Brand names of DT entities, like ‘Amway’, were labeled positive. Ways to
administer a medicine were included, such as ‘syrup’, ‘tabs’, and ‘inhalation’. Phrases like
‘anti-bacterial’ or ‘asthma meds’ were also considered positive.
Negative
Entities that were not labeled as positive were considered negative. If a phrase had any
non-DT word then it was considered negative, except when phrases had the name of the
disease or symptom for which the treatment is mentioned. For example, ‘sinus meds’ was
considered positive. Websites, dosages, diagnosis tests or devices, doctors, specialists were
labeled as negative.
Inter-annotator agreement between the annotator and another researcher was computed
on 200 randomly sampled learned entities from each of the Asthma and ENT forum. The
agreement for the entities from the Asthma forum was 96% and from the ENT forum was
92.46%. The Cohen’s kappa scores were 0.91 and 0.83, respectively. Most disagreements2www.twitter.com, accessed from May 19 to 25, 2012.
CHAPTER 6. LEVERAGING UNLABELED DATA 103
were on food items like ‘yogurt’, which are hard to label. Note that I do not use the hand
labeled entities for training.
6.4.3 Baselines
As in Section 6.3, the sets Pr,Nr, andUr are defined as the positive, negative, and unlabeled
entities extracted by a pattern r, respectively. The set Ar is defined as union of all the
three sets. I compare our system with the following pattern scoring algorithms. Candidate
entities are scored in the same way as described in Section 6.3.3. It is important to note that
previous works also differ in how they create patterns, apply patterns, and score entities.
Since I focus on only the pattern scoring aspect, I run experiments that differ in only that
component.
PNOdd: Defined as |Pr|/|Nr|, this measure ignores unlabeled entities and is similar to the
domain specific pattern learning component of Etzioni et al. (2005) since all patterns
with |Pr| < 2 were discarded (more details in the next section).
PUNOdd: Defined as |Pr|/(|Ur|+ |Nr|), this measure treats unlabeled entities as negative
entities.
RlogF: Measure used by Riloff (1996) and Thelen and Riloff (2002), and calculated as
Rr log |Pr|, where Rr was defined as |Pr|/|Ar| (labeled RlogF-PUN). It assumed
unlabeled entities as negative entities. I also compare with a variant that ignores the
unlabeled entities, that is by defining Rr as |Pr|/(|Pr + |Nr|) (labeled RlogF-PN).
Yangarber02: This measure from Yangarber et al. (2002) calculated two scores, accr =
|Pr|/|Nr| and confr = (|Pr|/|Ar|) log |Pr|. Patterns with accr less than a threshold
were discarded and the rest were ranked using confr. I empirically determined that
a threshold of 0.8 performed best on the development forum.
Lin03: A measure proposed in Lin et al. (2003), it was similar to Yangarber02, except
confr was defined as log |Pr|(|Pr| − |Nr|)/|Ar|. In essence, it discards a pattern if it
extracts more negative entities than positive entities.
CHAPTER 6. LEVERAGING UNLABELED DATA 104
SqrtRatioAll: This pattern scoring method I used in Chapter 5 from Gupta et al. (2014b)
and defined as∑
k∈Pr
√freqk/
∑j∈Ar
√freqj , where freqi is the number of times
entity i is extracted by r. Sublinear scaling of the term-frequency prevents high
frequency words from overshadowing the contribution of low frequency words.
6.4.4 Experimental Setup
I used the same experimental setup for our system and the baselines. When matching
phrases from a seed dictionary to text, a phrase is labeled with the dictionary’s class if
the sequence of phrase words or their lemmas match with the sequence of words of a
dictionary phrase. Since our corpora are from online discussion forums, they have many
spelling mistakes and morphological variations of entities. To deal with the variations, I do
fuzzy matching of words – if two words are one edit distance away and are more than 6
characters long, then they are considered a match.
I used Stanford TokensRegex (Chang and Manning, 2014) to create and apply surface
word patterns to text, and used the Stanford part-of-speech (POS) tagger (Toutanova and
Manning, 2003) to find POS tags of tokens and lemmatize them. I created patterns in a
similar way as described in Chapter 5, I discarded patterns whose left or right context was
1 or 2 stop words to avoid generating low precision patterns. In each iteration, I learned
a maximum 20 patterns with ps(r) ≥ θr and maximum 10 words with score ≥ 0.2. The
initial value of θr was 1.0, which was reduced to 0.8 × θr whenever the system did not
extract any more patterns and words. I discarded patterns that extracted less than 2 positive
entities. I selected these parameters by their performance on the development forum.
For calculating the DistSim feature used for scoring patterns and entities, I clustered
all of MedHelp’s forum data into 1000 clusters using the Brown clustering algorithm. The
data consisted of around 4 million tokens. Words that occurred less than 50 times were dis-
carded, which resulted in 50353 unique words. For calculating the DN feature for scoring
entities, I used n-grams from all user forums in MedHelp as the domain n-grams.
I evaluate systems by their precision and recall in each iteration. I stopped learning
entities for a system if the precision dropped below 75% to extract entities with reasonably
high precision. Recall is defined as the fraction of correct entities among the total unique
CHAPTER 6. LEVERAGING UNLABELED DATA 105
0.76
0.78
0.8
0.82
0.84
0.86
0.88
0.9
0.92
0.94
0.96
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Pre
cisi
on
Recall (out of 221 correct entities)
ASTHMA
OurSystemRlogF-PUN
Yangarber02SqrtAllRatio
Lin03PUNOdd
Figure 6.2: Precision vs. Recall curves of our system and the baselines for the Asthmaforum.
correct entities pooled from all systems while maintaining the precision ≥ 75%. Note that
true recall is very hard to compute since our dataset is unlabeled. To compare the systems
overall, I calculate the area under the precision-recall curves (AUC-PR).
System Asthma ENT Diabetes AcneOurSystem 68.36 60.71 67.62 68.01PNOdd 51.62 50.31 05.91 58.45PUNOdd 42.42 30.44 36.11 58.38RlogF-PUN 56.13 54.11 48.70 57.04RlogF-PN 53.46 52.84 16.59 62.35SqrtRatioAll 41.49 40.44 35.47 46.46Yangarber02 53.76 48.46 41.45 59.85Lin03 54.58 47.98 56.15 60.79
Table 6.1: Area under Precision-Recall curves of the systems.
CHAPTER 6. LEVERAGING UNLABELED DATA 106
0.75
0.8
0.85
0.9
0.95
1
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Pre
cisi
on
Recall (out of 645 correct entities)
ENT
OurSystemRlogF-PUN
Yangarber02SqrtAllRatio
Lin03PUNOdd
Figure 6.3: Precision vs. Recall curves of our system and the baselines for the ENT forum.
0.75
0.8
0.85
0.9
0.95
1
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Pre
cisi
on
Recall (out of 624 correct entities)
ACNE
OurSystemRlogF-PUN
Yangarber02SqrtAllRatio
Lin03PUNOdd
Figure 6.4: Precision vs. Recall curves of our system and the baselines for the Acne forum.
CHAPTER 6. LEVERAGING UNLABELED DATA 107
0.76
0.78
0.8
0.82
0.84
0.86
0.88
0.9
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Pre
cisi
on
Recall (out of 118 correct entities)
DIABETES
OurSystemRlogF-PUN
Yangarber02SqrtAllRatio
Lin03PUNOdd
Figure 6.5: Precision vs. Recall curves of our system and the baselines for the Diabetesforum.
Feature Asthma ENT Diabetes AcneAll Features 68.36 60.71 67.62 68.01EDP 68.66 59.07 60.03 65.15EDN 59.39 59.21 16.75 65.96SemOdd 67.07 58.41 60.51 65.04GN 57.52 59.53 48.76 68.61DistSim 64.87 59.05 71.11 69.48
Table 6.2: Individual feature effectiveness: Area under Precision-Recall curves when oursystem uses individual features during pattern scoring. Other features are still used forentity scoring.
CHAPTER 6. LEVERAGING UNLABELED DATA 108
Feature Asthma ENT Diabetes AcneAll Features 68.36 60.71 67.62 68.01minusEDP 66.29 60.45 69.84 69.46minusEDN 67.19 60.39 69.89 67.57minusGN 65.53 60.33 66.07 67.28minusSemOdd 66.66 60.76 70.79 68.25minusDistSim 66.10 60.58 66.59 67.85
Table 6.3: Feature ablation study: Area under Precision-Recall curves when individualfeatures are removed from our system during pattern scoring. The feature is still used forentity scoring.
6.4.5 Results
Figure 6.2–6.5 plot the precision and recall of systems. I do not show plots of PNOdd
and RlogF-PN to improve clarity; they performed similarly to other baselines. All systems
extract more entities for Acne and ENT because different drugs and treatments are more
prevalent in these forums. Diabetes and Asthma have more interventions and lifestyle
changes that are harder to extract. Table 6.1 shows AUC-PR scores for all systems. RlogF-
PN and PNOdd have low value for Diabetes because they learned generic patterns in initial
iteration, which led them to learn incorrect entities. Overall our system performed sig-
nificantly better than existing systems. This is because the system is able to exploit the
unlabeled data in better scoring the patterns – patterns that extract good unlabeled entities
get ranked higher than the patterns that extract bad unlabeled entities.
To compare the effectiveness of each feature in our system, Table 6.2 shows the AUC-
PR values when each feature was individually used for pattern scoring (other features were
still used to learn entities). EDP and DistSim were strong predictors of labels of unlabeled
entities because many good unlabeled entities were spelling mistakes of DT entities and
occurred in similar context as them. Table 6.3 shows the AUC-PR values when each feature
was removed from the set of features used to score patterns (the feature was still used for
learning entities). Removing GN and DistSim reduced the AUC-PR scores for all forums.
Table 6.4 shows some examples of patterns and the entities they extracted along with
their labels when the pattern was learned. Our system learned the first pattern because
‘pinacillin’ has low edit distance from the positive entity ‘penicillin’. Similarly, it scored
CHAPTER 6. LEVERAGING UNLABELED DATA 109
the second pattern higher than the baseline because ‘desoidne’ is a typo of the positive
entity ‘desonide’. Note that the seed dictionaries are noisy – the entity ‘metro’, part of the
positive entity ‘metrogel’, was falsely considered a negative entity because it was in the
common web words list. Our system learned the third pattern for two reasons: ‘inhaler’,
‘inhalers’, and ‘hfa’ occurred frequently as sub-phrases in the DT dictionary, and they
were clustered with positive entities by distributional similarity. Since RlogF-PUN does
not distinguish between unlabeled and negative entities, it is does not learn the pattern.
Table 6.5 shows top 10 patterns learned for the ENT forum by our system and RlogF-PUN,
the best performing baseline for the forum. Our system preferred to learn patterns with
longer contexts, which are usually higher precision, first.
Forum Pattern Positive entities Negative Unlabeled OurSys-tem
Baseline
ENT he give I more X antibiotics, steroid, an-tibiotic
pinacillin 68NA(RlogF-PUN)
Acne topical DT ( X prednisone, clin-damycin, differin,benzoyl peroxide,tretinoin, metrogel
metro desoidne 149231(RlogF-PN)
Asthma i be put on X cortisone, prednisone,asmanex, advair, aug-mentin, bypass, nebu-lizer, xolair, steroids,prilosec
inhaler,inhalers,hfa
8NA(RlogF-PUN)
Table 6.4: Example patterns and the entities extracted by them, along with the rank atwhich the pattern was added to the list of learned patterns. NA means that the system neverlearned the pattern. Baseline refers to the best performing baseline system on the forum.The patterns have been simplified to show just the sequence of lemmas. X refers to thetarget entity; all of them in these examples had noun POS restriction. Terms that havealready been identified as the positive class were generalized to their class DT.
CHAPTER 6. LEVERAGING UNLABELED DATA 110
Our System RlogF-PUNlow dose of X* mg of Xmg of X treat with XX 10 mg take DT and Xshe prescribe X be take XX 500 mg she prescribe Xbe take DT and X* put on Xent put I on X* stop take XDT ( like X:NN i be prescribe X
like DT and X have be take Xthen prescribe X* tell I to take X
Table 6.5: Top 10 (simplified) patterns learned by our system and RlogF-PUN from theENT forum. An asterisk denotes that the pattern was never learned by the other system. Xis the target entity slot with noun POS restriction.
6.5 Discussion and Conclusion
Our system extracted entities with higher precision and recall than other existing systems.
Since most entities extracted by patterns, especially in the crucial initial iterations, are un-
labeled, existing pattern scoring functions either unfairly penalize good patterns and/or do
not penalize bad patterns enough. Our system successfully leveraged the unlabeled data to
score patterns better – it evaluated unlabeled entities extracted by patterns in an unsuper-
vised way. However, learning entities from an informal text corpus that is partially labeled
from seed entities presents some challenges. Our system made mistakes primarily due to
three reasons. One, it sometimes extracted typos of negative entities that were not easily
predictable by the edit distance measures, such as ‘knowwhere’. Second, patterns that ex-
tracted many good but some bad unlabeled entities got high scores because of the good
unlabeled entities. However, the bad unlabeled entities extracted by the highly weighted
patterns were scored high by the PTF feature during the entity scoring phase, leading to
extraction of the bad entities. Better features to predict negative entities and robust text
normalization would help mitigate both the problems. Third, we used automatically con-
structed seed dictionaries that were not dataset specific, which led to incorrectly labeling
of some entities (for example, ‘metro’ as negative in Table 6.4). Reducing noise in the
dictionaries would increase precision and recall.
CHAPTER 6. LEVERAGING UNLABELED DATA 111
In our proposed system, the features are weighted equally by taking the average of the
feature scores. In pilot experiments, learning a logistic regression classifier on heuristically
labeled data did not work well for either pattern scoring or entity scoring. In the next
chapter, I use a logistic regression to learn an entity classifier; I improved sampling of
examples to create a training set resulting in better results using a classifier. In retrospect,
this approach could also be successfully applied to the system in this chapter.
One limitation of our system and evaluation is that I learned single word entities, since
calculating some features for multi-word phrases is not straightforward. For example, word
clusters using distributional similarity were constructed for single words. Our future work
includes expanding the features to evaluate multi-word phrases. Another avenue for fu-
ture work is to use our pattern scoring method for learning other kinds of patterns, such
as dependency patterns, and in different kinds of systems, such as hybrid entity learning
systems (Etzioni et al., 2005; Carlson et al., 2010a).
In conclusion, I show that predicting the labels of unlabeled entities in the pattern scorer
of a bootstrapped entity extraction system significantly improves precision and recall of
learned entities. Our experiments demonstrate the importance of having models that con-
trast domain-specific and general domain text, and the usefulness of features that allow
spelling variations when dealing with informal texts. Our pattern scorer outperforms ex-
isting pattern scoring methods for learning drug-and-treatment entities from four medical
web forums.
Chapter 7
Distributed Word Representations toGuide Entity Classifiers
In the last chapter, I improve the pattern scoring function of a bootstrapped pattern-based
learning system using unlabeled data. In this chapter, I leverage the unlabeled data to
improve the entity scoring function. I model it by training a logistic regression and use
the unlabeled data to enhance its training set. The work has been published in Gupta and
Manning (2015).
7.1 Introduction
The limited supervision provided in bootstrapped systems, though an attractive quality, is
also one of its main challenges. When seed sets are small, noisy, or do not cover the label
space, the bootstrapped classifiers do not generalize well. I use a major guiding inspiration
of deep learning and earlier approaches such as LSA (Landauer et al., 1998): we can learn
a lot about syntactic and semantic similarities between words in an unsupervised fashion
and capture this information in word vectors. This distributed representation can inform an
inductive bias to generalize in a bootstrapping system.
In the previous chapter, I used averaging of feature values to predict an entity’s class in
a bootstrapped system. In this chapter, I use a logistic regression classifier to predict scores
for candidate entities. My main contribution is a simple approach of using the distributed
112
CHAPTER 7. WORD EMBEDDINGS IMPROVE ENTITY CLASSIFIERS 113
Figure 7.1: An example of expanding a bootstrapped entity classifier’s training set usingword vector similarity. The entities in blue represent known positive entities and the entitiesin red represent known negative entities. The entities in black are unlabeled but can beincorporated in the corresponding positive and negative sets because of their proximity tothe known entities in the word vector space.
vector representations of words to expand training data for entity classifiers. To improve
the step of learning an entity classifier, I first learn a vector representation of entities using
the continuous bag of words model (Mikolov et al., 2013b). I then use kNN to expand the
training set of the classifier by adding unlabeled entities close to seed entities in the training
set. Figure 7.1 shows an example of expansion of a training set for a drugs-and-treatment
entity classifier tailored for online health forums. The unlabeled entities shown in the
figure are usually not found in seed sets that are automatically constructed using medical
ontologies. However, these entities can be incorporated into the training set because they
occur in similar contexts in the dataset. Expanding a training set not only makes it larger
but also less susceptible to false negatives, since the process of sampling the unlabeled
entities as negative is guided by the frequency and context of entities.
The key insight is to use the word vector similarity indirectly by enhancing training
data for the entity classifier. I do not directly label the unlabeled entities using the similar-
ity between word vectors, which I show extracts many noisy entities. I show that classifiers
trained with expanded sets of entities perform better on extracting drug-and-treatment en-
tities from four online health forums from MedHelp.
CHAPTER 7. WORD EMBEDDINGS IMPROVE ENTITY CLASSIFIERS 114
7.2 Related Work
In a pattern-based system, if the patterns are not very specific, they can extract noisy terms.
On the other hand, too specific patterns can result in low recall. Many systems, such
as RAPIER (Califf and Mooney, 1999) and Kozareva and Hovy (2013), learn patterns
and extract all fillers that match the patterns. In supervised systems (e.g. RAPIER), the
patterns are scored using fully supervised data, and hence the patterns are presumably more
accurate. Learning all matched entities is a bigger problem in bootstrapped systems since
there is little labeled data to judge patterns. Kozareva and Hovy (2013) extended ontologies
using bootstrapping; they learned very specific ‘doubly-anchored’ patterns.
To mitigate the problem of extracting noisy entities, some BPL systems have an en-
tity evaluation step and they learn only the top ranked entities. There are several ways to
rank the candidate entities. Systems, such as Thelen and Riloff (2002), Lin et al. (2003),
and Agichtein and Gravano (2000), score entities using the number and scores of patterns
that extracted them. In Chapter 5, I used a similar function to rank the entities. Snow-
ball (Agichtein and Gravano, 2000) and DIPRE (Brin, 1999) also took into account how
well a pattern matched a sentence to extract an entity. All of the above systems use only
the patterns to score entities extracted by them. Surprisingly, only a few systems also use
entity-based features to score the entities. StatSnowball proposed MLNs to extract enti-
ties and used token-level features and joint entity-level features. In Chapter 6, I used five
features to evaluate an entity, four of them were entity-based features.
Some open IE systems like KnowItAll use the web to assess the quality of extractions.
KnowItAll’s assessor used querying search engines to get a PMI score of occurrence of the
entity by itself vs. as a slot of the extractors. The PMI scores are used as features in a naive
Bayes classifier. Downey et al. (2010) proposed a probabilistic urn model and compared
against noisy-or and PMI scoring models.
Most of the BPL systems do not use a machine learning-based classifier for the entity
scoring step. In this chapter, I model the entity scoring function using a logistic regression
classifier. To the best of my knowledge, this work is the first to improve a bootstrapped
CHAPTER 7. WORD EMBEDDINGS IMPROVE ENTITY CLASSIFIERS 115
system’s entity evaluation by expanding the classifier’s training set. I use distributed rep-
resentations of words to compute unlabeled entities that are similar to known entities. Dis-
tributed representations of words have been shown to be successful at improving general-
ization. Passos et al. (2014) proposed word embeddings that leverage lexicons and used the
embeddings to improve a CRF-based named entity recognition system.
7.3 Approach
In this section, I propose an entity classifier and its enhancement by expanding its training
set using an unsupervised word similarity measure.
I build a one-vs-all entity classifier using logistic regression. In each iteration, for
label l, the entity classifier is trained by treating l’s dictionary entities (seed and learned
in previous iterations) as positive and entities belonging to all other labels as negative. To
improve generalization, I also sample the unlabeled entities that are not function words as
negative. To train with a balanced dataset, I randomly sub-sample the negatives such that
the number of negative instances is equal to the number of positive instances.
The features for the entities are similar to the ones described in Chapter 6 from Gupta
and Manning (2014a): edit distances from positive and negative entities, relative frequency
of the entity words in the seed dictionaries, word classes computed using the Brown clus-
tering algorithm, and pattern TF-IDF score. Note that in Chapter 6, I averaged the feature
values to predict an entity’s score; one of the features was the score of the word class clus-
ter belonging to a label. First, the words were clustered using the Brown clustering method
(Brown et al., 1992). Then, each cluster was considered as an instance in a logisitic regres-
sion classifier, which was trained to give a probability of whether a cluster belongs to the
given label. I then used this cluster score as a feature in the average function. Here, I simply
include the word cluster id directly as a feature in the logistic regression classifier, which
is trained to give a score of whether an entity belongs to the given label. The last feature,
pattern TF-IDF score, gives higher scores to entities that are extracted by many learned
patterns and have low frequency in the dataset. In the experiments, I call this classifier
NotExpanded.
The lack of labeled data to train a good entity classifier is one of the challenges in
CHAPTER 7. WORD EMBEDDINGS IMPROVE ENTITY CLASSIFIERS 116
bootstrapped learning. I use distributed representations of words, in the form of word
vectors, to guide the entity classifier by expanding its training set. I expand the positive
training set by labeling the unlabeled entities that are similar to the seed entities of the label
as positive examples, and labeling the unlabeled entities that are similar to seed entities of
other labels as negative examples. I take the cautious approach of finding similar entities
only to the seed entities and not the learned entities. The algorithm can be modified to
find similar entities to learned entities as well. Cautious approaches have been shown to be
better for bootstrapped learning (Abney, 2004; Surdeanu et al., 2006).
To compute similarity of an unlabeled entity to the positive entities, I find k most similar
positive entities, measured by cosine similarity between the word vectors, and average the
scores. Similarly, I compute similarity of the unlabeled entity to the negative entities. If the
entity’s positive similarity score is above a given threshold θ and is higher than its negative
similarity score, it is added to the training set with positive label. I expand the negative
entities similarly. I tried expanding just the positive entities and just the negative entities.
Their relative performance, though higher than the baselines, varied between the datasets.
Expanding both positives and negatives gave more stable results across the datasets. Thus,
I present results only for expanding both positives and negatives.
An alternative to our approach is to directly label the entities using the vector simi-
larities. Our experimental results suggest that even though exploiting similarities between
word vectors is useful for guiding the classifier by expanding the training set, it is not ro-
bust enough to use for labeling entities directly. For example, for our development dataset,
when the similarity threshold θ was set as 0.4, 16 out of 41 unlabeled entities that were
expanded into the training set as positive entities were false positives. Increasing θ ex-
tracted far fewer entities. Setting θ to 0.5 extracted only 5 entities, all true positives, and
to 0.6 extracted none. Thus, labeling entities solely based on similarity scores resulted in
lower performance. A classifier, on the other hand, can use other sources of information as
features to predict an entity’s label.
I compute the distributed vector representations using the continuous bag-of-words
model (Mikolov et al., 2013b; Mikolov et al., 2013a) implemented in the word2vec toolkit.1
1http://code.google.com/p/word2vec/
CHAPTER 7. WORD EMBEDDINGS IMPROVE ENTITY CLASSIFIERS 117
Forum Expanded Expanded-M NotExpanded AverageAsthma 77.01 75.68 74.48 65.42
Acne 73.84 75.41 71.65 65.05Diabetes 82.37 44.25 48.75 21.82
ENT 80.66 80.04 77.02 59.50
Table 7.1: Area under Precision-Recall curve for all the systems. Expanded is our systemwhen word vectors are learned using the Wiki+Twit+MedHelp data and Expanded-M iswhen word vectors are learning using the MedHelp data. Average is the average of featurevalues, similar to Gupta and Manning (2014a).
The publicly available word vectors are not tailored towards the online health forums do-
main and thus I train new vector representations. I train 200-dimensional vector representa-
tions on a combined dataset of a 2014 Wikipedia dump (1.6 billion tokens), a sample of 50
million tweets from Twitter (200 million tokens), and an in-domain dataset of all MedHelp
forums (400 million tokens). The three types of datasets have words and context of differ-
ent kinds: the Wikipedia data mainly consists of domain-independent words; the Twitter
data has many slang and colloquial words, also common on online forums; and the Med-
Help data has the in-domain content. I tried learning 500-dimensional and 50-dimensional
vectors; the 200-dimensional vectors worked best on the developmental data. I removed
words that occurred less than 20 times, resulting in a vocabulary of 89k words. I call this
dataset Wiki+Twit+MedHelp. I used the parameters suggested in Pennington et al. (2014):
negative sampling with 10 samples and a window size of 10. I ran the model for 3 itera-
tions, which were enough to get good results; more iterations would presumably result in
better vectors.
7.4 Experimental Setup
I present results on the same experimental setup, dataset, and seed lists as discussed in
Chapter 6 from Gupta and Manning (2014a). The task is to extract drug-and-treatment
(DT) entities in sentences from four forums on the MedHelp user health discussion website:
1. Asthma, 2. Acne, 3. Adult Type II Diabetes (called Diabetes), and 4. Ear Nose &
Throat (called ENT). A DT entity is defined as a pharmaceutical drug, or any treatment
CHAPTER 7. WORD EMBEDDINGS IMPROVE ENTITY CLASSIFIERS 118
0.75
0.8
0.85
0.9
0.95
1
0 0.2 0.4 0.6 0.8 1
Precis
ion
Recall (268.0 correct entities)
ASTHMA
ExpandedNotExpanded
Average
Figure 7.2: Precision vs. Recall curves of our system and the baselines for the Asthmaforum.
0.75
0.8
0.85
0.9
0.95
1
0 0.2 0.4 0.6 0.8 1
Precis
ion
Recall (268.0 correct entities)
ACNE
ExpandedNotExpanded
Average
Figure 7.3: Precision vs. Recall curves of our system and the baselines for the Acne forum.
CHAPTER 7. WORD EMBEDDINGS IMPROVE ENTITY CLASSIFIERS 119
0.75
0.8
0.85
0.9
0.95
1
0 0.2 0.4 0.6 0.8 1
Precis
ion
Recall (268.0 correct entities)
DIABETES
ExpandedNotExpanded
Average
Figure 7.4: Precision vs. Recall curves of our system and the baselines for the Diabetesforum.
0.75
0.8
0.85
0.9
0.95
1
0 0.2 0.4 0.6 0.8 1
Precis
ion
Recall (268.0 correct entities)
ENT
ExpandedNotExpanded
Average
Figure 7.5: Precision vs. Recall curves of our system and the baselines for the ENT forum.
CHAPTER 7. WORD EMBEDDINGS IMPROVE ENTITY CLASSIFIERS 120
or intervention mentioned that may help a symptom or a condition. I judged the output
of all systems, following the guidelines in the previous chapter. I used Asthma as the
development forum for parameter and threshold tuning. I used threshold θ as 0.4 and use k
(number of nearest neighbors) as 2 when expanding the seed sets.
I evaluate systems by their precision and recall (see Chapter 2 for details). Similar to
the previous chapter. I present the precision and recall curves for precision above 75% to
compare systems when they extract entities with reasonably high precision. Recall is de-
fined as the fraction of correct entities among the total unique correct entities pooled from
all systems. Note that calculating lower precisions or true recall is very hard to compute.
Our dataset is unlabeled and manually labeling all entities is expensive. Pooling is a com-
mon evaluation strategy in such situations (such as, in information retrieval (Buckley et al.,
2007) and the TAC-KBP shared task). I calculate the area under the precision-recall curves
(AUC-PR) to compare the systems.
I call our system Expanded in the experiments. To compare the effects of word vectors
learned using different types of datasets, I also study our system when the word vectors are
learned using just the in-domain MedHelp data, called Expanded-M. I compare against two
baselines: NotExpanded as explained in previous section, and Average, in which I average
the feature values, similar to Gupta and Manning (2014a).
7.5 Results and Discussion
Table 7.1 shows AUC-PR of various systems and Figures 7.2–7.5 show the precision-recall
curves. Our systems Expanded and Expanded-M, which used similar entities for training,
improved the scores for all four forums. I believe the improvement for the Diabetes fo-
rum was much higher than other forums because the baseline’s performance on the forum
degraded quickly in later iterations (see the figure), and improving the classifier helped
in adding more correct entities. Additionally, Diabetes DT entities are more lifestyle-
based and hence occur frequently in web text, making the word vectors trained using the
Wiki+Twit+MedHelp dataset better suited.
In three out of four forums, word vectors trained using a large corpus perform better
than those trained using the smaller in-domain corpus. For the Acne forum, where brand
CHAPTER 7. WORD EMBEDDINGS IMPROVE ENTITY CLASSIFIERS 121
Positives NegativesAsthma
pranayama, steril-izing, expectorants,inhalable, sanitizers,ayurvedic
block, yougurt, med-cine, exertion, hate, vi-rally
Diabetesquinoa, vinegars,vegatables, thread-mill, possilbe, asanas,omegas
nicely, chiropracter,exhales, paralytic,metabolize, fluffy
Table 7.2: Examples of unlabeled entities that were expanded into the training sets. Graycolored entities were judged by the authors as falsely labeled.
name DT entities are more frequent, the entities expanded by MedHelp vectors had fewer
false positives than those expanded by Wiki+Twit+MedHelp.
Table 7.2 shows some examples of unlabeled entities that were included as positive/neg-
ative entities in the entity classifiers. Even though some entities were included in the train-
ing data with wrong labels, overall the classifiers benefited from the expansion.
7.6 Conclusion
I improve entity classifiers in bootstrapped entity extraction systems by enhancing the train-
ing set using unsupervised distributed representations of words. The classifiers learned us-
ing the expanded seed sets extract entities with better F1 score. This supports our hypoth-
esis that generalizing labels to entities that are similar according to unsupervised methods
of word vector learning is effective in improving entity classifiers, notwithstanding that the
label generalization is quite noisy. Using the word embedding based similarity measure to
directly label the data resulted in low scores. However, training a classifier with expanded
training sets improved the scores, underscoring its robustness to noise.
In the last three chapters, I worked on applying bootstrapped pattern-based learning
to extract entities from PAT, improving scoring of both patterns and entities by exploiting
unlabeled data. In the next chapter, I turn briefly to another aspect important to real life use
of pattern-based systems – their interpretability and explainability.
Chapter 8
Visualizing and Diagnosing BPL
In the previous chapters, I discussed bootstrapped pattern-based learning, along with its
improvements, as an effective practical tool for entity extraction. In this chapter, I dis-
cuss why patterns are popular in industry and present a visualization tool for developing
a pattern-based system more effectively and efficiently. The work has been published in
Gupta and Manning (2014b).
8.1 Introduction
Entity extraction using patterns dominates commercial industry, mainly because patterns
are effective, interpretable by humans, and easy to customize to cope with errors (Chiti-
cariu et al., 2013). Patterns or rules, which can be hand crafted or learned by a system,
are commonly created by looking at the context around already known entities, such as
lexico-syntactic surface word patterns and dependency patterns. Building a pattern-based
learning system is usually a repetitive process, usually performed by the system developer,
of manually examining a system’s output to identify improvements or errors introduced by
changing the entity or pattern extractor. Interpretability of patterns makes it easier for hu-
mans to identify sources of errors by inspecting patterns that extracted incorrect instances
or instances that resulted in learning of bad patterns. Parameters range from window size
of the context in surface word patterns to thresholds for learning a candidate entity. At
present, there is a lack of tools helping a system developer to understand results and to
122
CHAPTER 8. VISUALIZING AND DIAGNOSING BPL 123
improve results iteratively.
Visualizing diagnostic information of a system and contrasting it with another system
can make the iterative process easier and more efficient. For example, consider a user trying
to decide on the context’s window size in surface words patterns. And the user deliberates
that part-of-speech (POS) restriction of context words might be required for a reduced
window size to avoid extracting erroneous mentions. A shorter context size usually extracts
entities with higher recall but lower precision. By comparing and contrasting extractions
of two systems with different parameters, the user can investigate the cases in which the
POS restriction is required with smaller window size, and whether the restriction causes the
system to miss some correct entities. In contrast, comparing just accuracy of two systems
does not allow inspecting finer details of extractions that increase or decrease accuracy and
to make changes accordingly.
In this chapter, I present a pattern-based entity learning and diagnostics tool, SPIED. It
consists of two components: 1. pattern-based entity learning using bootstrapping (SPIED-
Learn), and 2. visualizing the output of one or two entity learning systems (SPIED-Viz).
SPIED-Viz is independent of SPIED-Learn and can be used with any pattern-based entity
learner. For demonstration, I use the output of SPIED-Learn as an input to SPIED-Viz.
SPIED-Viz has pattern-centric and entity-centric views, which visualize learned patterns
and entities, respectively, and the explanations for learning them. SPIED-Viz can also con-
trast two systems by comparing the ranks of learned entities and patterns. As a concrete ex-
ample, I learn and visualize drug-treatment (DT) entities from unlabeled patient-generated
medical text, starting with seed dictionaries of entities for multiple classes. This is the same
task proposed and developed in Chapter 5 and 6 from Gupta et al. (2014b) and Gupta and
Manning (2014a).
My contributions are: 1. I present a novel diagnostic tool for visualization of output of
multiple pattern-based entity learning systems, and 2. I release the code of an end-to-end
pattern learning system, which learns entities using patterns in a bootstrapped system and
visualizes its diagnostic output. The pattern learning and the visualization code are avail-
able at http://nlp.stanford.edu/software/patternslearning.shtml.
CHAPTER 8. VISUALIZING AND DIAGNOSING BPL 124
8.2 Learning Patterns and Entities
SPIED-Learn is based on the system described in Chapter 6 published in Gupta and Man-
ning (2014a). The system builds upon the previous bootstrapped pattern-learning work and
proposes an improved measure to score patterns. It learns entities for given classes from
unlabeled text by bootstrapping from seed dictionaries. Patterns are learned using labeled
entities, and entities are learned based on the extractions of learned patterns. The process
is iteratively performed until no more patterns or entities can be learned.
SPIED-Learn provides an option to use any of the pattern scoring measures described
in (Riloff, 1996; Thelen and Riloff, 2002; Yangarber et al., 2002; Lin et al., 2003; Gupta
et al., 2014b). A pattern is scored based on the positive, negative, and unlabeled entities
it extracts. The positive and negative labels of entities are heuristically determined by the
system using the dictionaries and the iterative entity learning process. The oracle labels
of learned entities are not available to the learning system. Note that an entity that the
system considered positive might actually be incorrect, since the seed dictionaries can be
noisy and the system can learn incorrect entities in the previous iterations, and vice-versa.
SPIED-Learn’s entity scorer can be chosen between the systems described in Chapter 6 or
7.
Each candidate entity is scored using weights of the patterns that extract it and other
entity scoring measures, such as TF-IDF. Thus, learning of each entity can be explained by
the learned patterns that extract it, and learning of each pattern can be explained by all the
entities it extracts.
8.3 Design Criteria
The following design criteria are considered when designing the interface.
• Quick summary: The interface should provide a quick summary of the learned en-
tities and patterns, including the percentage of correct and incorrect entities, if gold
labels are provided.
• Provenance: In a pattern-based system, provenance of an extracted entity or pattern
CHAPTER 8. VISUALIZING AND DIAGNOSING BPL 125
is much easier than in a feature-based system. The visualization needs to be able to
drill down a dictionary entry (usually a learned entity) to the learned patterns that
extracted it. Similarly, it should have the ability to go from a pattern to the lists of
entities (divided by good labels, if provided) it extracted and its perceived goodness.
• Individual goodness and its quick identification: By using a heuristic criteria, the
system should identify good and bad learned patterns and entities. It would help in
quick identification and diagnosis of errors. In SPIED-viz, an exclamation mark is
shown for a pattern if more than half of the entities extracted by it are incorrect. In
addition, various signs, such as trophy (correct entity extracted by only one system)
and star (unlabeled entity extracted by only one system), are used to identify various
types of entities.
• Comparison: The interface should be able to compare multiple systems, both at a
higher level and a fine-grained entity/pattern level.
• Pattern-centric and entity-centric views: These views can provide detailed informa-
tion, either from a pattern point of view or from an entity point of view.
• Easy and fast: The tool should not require any cumbersome installation and should
be fast to use. Web browser-based tools are easy to use since they do not require
installation of a new software.
8.4 Visualizing Diagnostic Information
SPIED-Viz visualizes learned entities and patterns from one or two entity learning systems,
and the diagnostic information associated with them. It optionally uses the oracle labels
of learned entities to color code them, and contrast their ranks of correct/incorrect enti-
ties when comparing two systems. The oracle labels are usually determined by manually
judging each learned entity as correct or incorrect. SPIED-Viz has two views: 1. a pattern-
centric view that visualizes patterns of one to two systems, and 2. an entity centric view that
mainly focuses on the entities learned. Figure 8.1 shows a screenshot of the entity-centric
view of SPIED-Viz. It displays following information:
CHAPTER 8. VISUALIZING AND DIAGNOSING BPL 126
Summary: Summary information of each system at each iteration and overall. It shows
for each system the number of iterations, the number of patterns learned, and the
number of correct and incorrect entities learned.
Learned Entities with provenance: It shows ranked list of entities learned by each system,
along with an explanation of why the entity was learned. The details shown include
the entity’s oracle label, its rank in the other system, and the learned patterns that
extracted the entity. Such information can help the user to identify and inspect the
patterns responsible for learning an incorrect entity. The interface also provides a
link to search the entity along with any user provided keywords (such as domain of
the problem) on Google.
System Comparison: SPIED-Viz can be used to compare entities learned by two systems.
It marks entities that are learned by one system but not by the other system, by either
displaying a trophy sign (if the entity is correct), a thumbs down sign (if the entity is
incorrect), or a star sign (if the oracle label is not provided).
The second view of SPIED-Viz is pattern-centric. Figure 8.2 shows a screenshot of the
pattern-centric view. It displays the following information.
Summary: Summary information of each system including the number of iterations and
number of patterns learned at each iteration and overall.
Learned Patterns with provenance: It shows a ranked list of patterns along with the en-
tities it extracts and their labels. Note that each pattern is associated with a set of
positive, negative and unlabeled entities, which were used to determine its score.1 It
also shows the percentage of unlabeled entities extracted by a pattern that were even-
tually learned by the system and assessed as correct by the oracle. A smaller percent-
age means that the pattern extracted many entities that were either never learned or
learned but were labeled as incorrect by the oracle.1Note that positive, negative, and unlabeled labels are different from the oracle labels, correct and incor-
rect, for the learned entities. The former refer to the entity labels considered by the system when learning thepattern, and they come from the seed dictionaries and the learned entities. A positive entity considered by thesystem can be labeled as incorrect by the human assessor, in case the system made a mistake in labeling data,and vice-versa.
CHAPTER 8. VISUALIZING AND DIAGNOSING BPL 127
Figure 8.3 shows an option in the entity-centric view when hovering over an entity
opens a window on the side that shows the diagnostic information of the entity learned by
the other system. This direct comparison is to directly contrast learning of an entity by both
systems. For example, it can help the user to inspect why an entity was learned at an earlier
rank than the other system.
An advantage of making the learning entities component and the visualization compo-
nent independent is that a developer can use any pattern scorer or entity scorer in the system
without depending on the visualization component to provide that functionality.
I develop a list-based visualization since it is easy to navigate and it can compare learn-
ing of individual entities/patterns. Additionally, since most pattern-based systems are it-
erative, ranking the entities/patterns in the visualization by the number of iterations helps
in diagnosing errors better. Other variations, such as clustering of entities based on pat-
terns, can give higher-level insights into the learning process, however, it is more difficult
to diagnose sources of errors.
8.5 System Details
SPIED-Learn uses TokensRegex (Chang and Manning, 2014) to create and apply surface
word patterns to text. SPIED-Viz takes details of learned entities and patterns as input in a
JSON format. It uses Javascript, angular, and jquery to visualize the information in a web
browser.
8.6 Related Work
Most interactive IE systems focus on annotation of text, labeling of entities, and manual
writing of rules. Some annotation and labeling tools are: MITRE’s Callisto2, Knowta-
tor3, SAPIENT (Liakata et al., 2009), brat4, Melita (Ciravegna et al., 2002), and XConc
2http://callisto.mitre.org3http://knowtator.sourceforge.net4http://brat.nlplab.org
CHAPTER 8. VISUALIZING AND DIAGNOSING BPL 128
Sco
re o
f th
e en
tity
in
th
is s
yst
em a
nd
th
e o
ther
sy
stem
, a
lon
g
wit
h a
lin
k t
o s
earc
h
it o
n G
oo
gle
.
An
sta
r si
gn
fo
r a
n
enti
ty i
nd
ica
tes
the
enti
ty l
ab
el i
s n
ot
pro
vid
ed a
nd
it
wa
s n
ot
extr
act
ed b
y t
he
oth
er s
yst
em.
A t
rop
hy
sig
n
ind
ica
tes
tha
t th
e en
tity
is
corr
ect
an
d w
as
no
t ex
tra
cted
by
th
e o
ther
sy
stem
.
Lis
t o
f en
titi
es
lea
rned
at
each
it
era
tio
n.
Gre
en
colo
r in
dic
ate
s th
at
the
enti
ty i
s co
rrec
t a
nd
red
co
lor
ind
ica
tes
tha
t th
e en
tity
is
inco
rrec
t.
Lis
t o
f p
att
ern
s th
at
extr
act
ed t
he
enti
ty.
Th
eir
det
ail
s a
re s
imil
ar
to t
he
det
ail
s sh
ow
n i
n t
he
pa
tter
n-c
entr
ic
vie
w.
Figu
re8.
1:E
ntity
cent
ric
view
ofSP
IED
-Viz
.The
inte
rfac
eal
low
sth
eus
erto
drill
dow
nth
ere
sults
todi
agno
seex
trac
tion
ofco
rrec
tand
inco
rrec
tent
ities
,and
cont
rast
the
deta
ilsof
the
two
syst
ems.
The
entit
ies
that
are
notl
earn
edby
the
othe
rsy
stem
are
mar
ked
with
eith
era
trop
hy(c
orre
cten
tity)
,a
thum
bsdo
wn
(inc
orre
cten
tity)
,or
ast
aric
on(o
racl
ela
bel
mis
sing
),fo
reas
yid
entifi
catio
n.
CHAPTER 8. VISUALIZING AND DIAGNOSING BPL 129
Lis
t o
f en
titi
es
con
sid
ered
as
po
siti
ve,
neg
ati
ve,
a
nd
un
lab
eled
by
th
e sy
stem
wh
en i
t le
arn
ed t
his
p
att
ern
.
An
ex
cla
ma
tio
n
sig
n i
nd
ica
tes
tha
t le
ss t
ha
n h
alf
of
the
un
lab
eled
en
titi
es w
ere
even
tua
lly
lea
rned
w
ith
co
rrec
t la
bel
.
Det
ail
s o
f th
e p
att
ern
.
Gre
en c
olo
r o
f en
tity
in
dic
ate
s th
at
the
enti
ty w
as
lea
rned
by
th
e sy
stem
an
d t
he
ora
cle
ass
ign
ed i
t th
e ‘c
orr
ect’
la
bel
.
Lis
t o
f p
att
ern
s le
arn
ed a
t ea
ch
iter
ati
on
. B
lue
pa
tter
n i
nd
ica
tes
tha
t th
e p
att
ern
w
as
no
t le
arn
ed b
y
the
oth
er s
yst
em.
Figu
re8.
2:Pa
ttern
cent
ric
view
ofSP
IED
-Viz
.
CHAPTER 8. VISUALIZING AND DIAGNOSING BPL 130
Figure 8.3: When the user clicks on the compare icon for an entity, the explanations of theentity extraction for both systems (if available) are displayed. This allows direct compari-son of why the two systems learned the entity.
Suite (Kim et al., 2008). Akbik et al. (2013) interactively helps non-expert users to manu-
ally write patterns over dependency trees. GATE5 provides the JAPE language that recog-
nizes regular expressions over annotations. Other systems focus on reducing manual effort
for developing extractors (Brauer et al., 2011; Li et al., 2011). ICE (He and Grishman,
2015) is an interface for building entity, relation, and event extractors using dependency
patterns. Valenzuela-Escarcega et al. (2015) built an interactive web-based event extraction
tool for event grammar development via rules. In contrast, our tool focuses on visualizing
and comparing diagnostic information associated with pattern learning systems.
WizIE (Li et al., 2012b) is an integrated environment for annotating text and writing
pattern extractors for information extraction. It also generates regular expressions around
labeled mentions and suggests patterns to users. It is most similar to our tool as it displays
an explanation of the results extracted by a pattern. However, it is focused towards hand
writing and selection of rules. In addition, it cannot be used to directly compare two pattern
learning systems.
What’s Wrong With My NLP?6 is a tool for jointly visualizing various natural language
processing formats such as trees, graphs, and entities. It is the same as our system in the
focus on diagnosing errors so they can be fixed, but it is different in providing no tools to
drill down and find the source of errors. Since I focus on a particular task and a learning
mechanism, I am able to develop a specialized tool that can provide more functionality,5http://gate.ac.uk6https://code.google.com/p/whatswrong
CHAPTER 8. VISUALIZING AND DIAGNOSING BPL 131
which is presumably harder for a generic visualization tool.
8.7 Future Work and Conclusion
A limitation of the tool is that there is no way for a user to give feedback, such as to provide
the oracle label of a learned entity. Currently, the oracle labels are assigned offline. It would
be useful to extend the interface to visualize diagnostic information of learned relations, in
addition to entities, from a pattern-based relation learning system. Another avenue of future
work is to evaluate SPIED-Viz by studying its users and their interactions with the system.
In addition, the visualization can be improved by summarizing the diagnostic information,
such as which parameters led to what mistakes, to make it easier to understand for systems
that extract large number of patterns and entities.
In this chapter, I present a novel diagnostic tool for pattern-based entity learning that
visualizes and compares output of one to two systems. It is a light-weight web browser
based visualization. The visualization can be used with any pattern-based entity learner.
I make the code of an end-to-end system freely available. The system learns entities and
patterns using bootstrapping starting with seed dictionaries, and visualizes the diagnostic
output. The tool was crucial in diagnosing the systems I built in Chapter 6 and 7. The
problem with unlabeled entities in a pattern learning system, as described in Chapter 6,
became apparent when I was identifying the sources of errors in earlier systems using the
interface. I hope SPIED will help other researchers and users as well to diagnose errors and
tune parameters in their pattern-based entity learning system in an easy and efficient way.
Chapter 9
Conclusions
In this dissertation, I presented bootstrapped pattern-based learning (BPL) as an effective
approach for entity extraction tasks that have no fully labeled data. Even though many real-
world information extraction tasks have no fully labeled data, frequently, developers can
gather a few examples, either manually or by using existing knowledge bases. These ex-
amples can be used as seed sets, along with an unlabeled corpus, for bootstrapped learning.
I proposed two new tasks and showed that BPL extracts information effectively, starting
out with only a few handwritten patterns or automatically constructed dictionaries. Ex-
isting BPL systems underutilize the unlabeled data. I proposed improvements to BPL by
leveraging unlabeled data in its pattern and entity scoring functions.
The two tasks I proposed had not been studied before: 1. studying influence of aca-
demic papers and communities by extracting techniques, domain, and focus entities, and 2.
extracting medical entities of types symptom-and-condition and drug-and-treatment from
challenging and loosely-structured patient-authored text. A bootstrapped system is suitable
for both problems since the tasks are new and there does not exist any fully labeled dataset
to train supervised classifiers.
In Chapter 4, I described the first task, our approach, and a case study of the computa-
tional linguistics academic community. I bootstrapped with a few hand-written patterns for
the three entity types. Since the sentences in scientific articles are well-formed and gram-
matical, I learned dependency patterns using the dependency parses of the sentences. I also
showed that for one entity type our system outperformed a fully supervised conditional
132
CHAPTER 9. CONCLUSIONS 133
random field (CRF) model. In a case study, I discuss how the Speech Recognition sub-
community has been very influential in the computational linguistics community, mainly
because of its introductory use of now-standard techniques, such as hidden Markov models,
expectation maximization, and language modeling.
I described the second task of extracting medical entities from patient-authored text on
online health forums in Chapter 5. The dataset was challenging because of the mismatch in
the content of the existing bio-informatics resources and the online health forums. Patients
use slang, colloquial, and descriptive phrases, usually not found in existing medical on-
tologies. Thus, our system started the learning process with dictionaries consisting mostly
of well-formed official names of medical entities. Over the iterations, it learned various
informal entities, including spelling mistakes, abbreviations, sub-phrases, and new terms.
Our system outperformed commonly used medical annotators (MetaMap and OBA), sta-
tistical approaches (self-trained CRFs), existing pattern-based learning systems (Xu et al.,
2008), and other dictionary-based approaches. I presented a case study of comparing the
anecdotal efficacy of two new alternate treatments extracted by our system. The analysis
shows that our system can potentially be used to study the efficacy and side effects of drugs
and treatments at a large scale.
In Chapters 6 and 7, I proposed improvements to BPL systems by exploiting the unla-
beled data. Similar to many distantly supervised learning systems, existing bootstrapped
pattern-based learning systems either ignore the unlabeled data or make closed world as-
sumptions. In Chapter 6, I discussed how to improve BPL’s pattern scoring by evaluating
the unlabeled entities extracted by patterns. I proposed five unsupervised features, such
as distributional similarity and contrasting domain vs. generic text, to predict labels of
unlabeled entities, and use the predictions to rank patterns better. My system performed
significantly better than the existing pattern ranking approaches.
In Chapter 7, I proposed a method to improve BPL’s entity scoring using distributed
representation of words obtained by recent unsupervised neural network approaches. I
modeled the entity scoring function using a logistic regression classifier. Existing distantly
supervised approaches create the training set for the classifier using the known entities as
labeled examples and sampling the unlabeled entities as negative examples. This results
in the training set being both limited and noisy. The training set is limited in size because
CHAPTER 9. CONCLUSIONS 134
the number of known entities is small. The dataset is noisy because many of the unlabeled
examples sampled as negative can actually be positive. I proposed a better way to create
the training set by exploiting similarity of known entities to the unlabeled entities. I used
distributed representations of words to find unlabeled entities that are similar to seed en-
tities and incorporated the k nearest neighbors in the training set. The proposed system
outperformed the baseline systems.
The improvements I made to BPL’s pattern and entity scoring functions underscore the
potential of unlabeled data. Computing similarity between words in an unsupervised way
has always been a topic of interest. However, the recent progress using the deep learning
approaches has improved the accuracy. Our systems illustrate that in a distantly supervised
or semi-supervised system, performance can be improved by leveraging the unlabeled data
using unsupervised measures. This insight can be applied to existing relation learning
systems like NELL (Carlson et al., 2010b) and OLLIE (Mausam et al., 2012).
One of the main benefits of using patterns, apart from being effective and fast, is that
they are interpretable. Non-machine-learning experts can understand a pattern and its ex-
tractions to identify the source of errors and possible improvements. I presented a diagnos-
tic and visualization system for pattern-based learning systems in Chapter 8. The problem
with unlabeled entities extracted by patterns, described in Chapter 6, was diagnosed using
the visualization system. It can compare multiple systems by displaying their output and
the provenance of learned entities and patterns, helping developers to tune the parameters
of a system, which is usually not possible in a classifier-based system.
There are several interesting avenues to explore in future. I discuss them below.
• Semantic drift: One main challenge of a bootstrapped system is avoiding semantic
drift, the phenomenon in which learning of a few false examples leads to a snowball
effect of learning more false patterns and examples. There has been some work
on detecting semantic drift (McIntosh and Curran, 2009). One solution is to have
a human give feedback occasionally to steer the system in the correct direction, as
implemented by NELL (Carlson et al., 2010b). More work is needed in automatically
detecting semantic drift using machine learning-based approaches and figuring out
the optimal opportunities for human input.
CHAPTER 9. CONCLUSIONS 135
• Seed sets quality and quantity: Modeling the amount and quality of supervision
needed to build an effective bootstrapped system is hard but an important aspect
from a practical viewpoint. The question is not trivial since the definition of a class
is conveyed via the seed sets. When the seed sets do not cover the label set, the scope
of the task is not clear. For example, if the seed set consists of two football athlete
names, it is unclear whether the task is to learn all athlete names or just names of
footballers. Moreover, Pantel et al. (2009) showed that the seed set composition con-
siderably affects performance. In their experiments, the difference between the best
performing seed set and the worst performing seed set was 42% precision points and
39% recall points. Thus, using human experts to list high quality seed sets might lead
to a big performance boost. However, acquiring manually labeled seed sets would be
unrealistic for a large number of classes, as is the case with several open IE systems.
• Feature-based sequence models: Comprehensive comparison of pattern-based mod-
els with sequence models (such as, HMMs and CRFs) is an important work miss-
ing in the existing information extraction literature. In this dissertation, I show
that BPL approaches outperformed unsupervised and self-trained CRFs. Supervised
CRFs, on the other hand, are the go-to tools for building entity extraction systems
in academia. More experiments comparing BPL with supervised CRFs and hand-
written rule-based systems can give insights into strengths and weaknesses of each
type of system. Moreover, these insights can inform better hybrid systems that com-
bine pattern-based approaches and feature-based sequence models.
Pattern-based IE systems are very popular in industry, however, most of them are hand-
written. I hope my work on bootstrapped pattern-based learning will help in automating
these systems in an effective and efficient way. My research on exploiting the unlabeled
data in a bootstrapped system can improve accuracy of not only pattern-based machine
learning systems, but also presumably of feature-based machine learning systems. Fur-
thermore, I hope my research will motivate more researchers to work on the practical and
interpretable IE problems, bringing the academic world closer to the world of industry.
Appendix A
Stop Words List
Medical stop words
The medical stop words list is to identify words that are common in medical text but are
not relevant to the entity types of the systems I worked on. Some of the words are not
dictionary words because the dataset is from online health forums, where users frequently
use abbreviations and slang words. Following is the list of words.
disease, diseases, disorder, symptom, symptoms, drug, drugs, problems, problem, prob,
probs, med, meds, pill, pills, medicine, medicines, medication, medications, treatment,
treatments, caps, capsules, capsule, tablet, tablets, tabs, doctor, dr, dr., doc, physician,
physicians, test, tests, testing, specialist, specialists, side-effect, side-effects, pharmaceu-
tical, pharmaceuticals, pharma, diagnosis, diagnose, diagnosed, exam, challenge, device,
condition, conditions, suffer, suffering, suffered, feel, feeling, prescription, prescribe, pre-
scribed, over-the-counter, otc
General stop words
The general stop words list is to identify words that are commonly used in the English
language but do not belong to the entity types of the systems I worked on. It is similar to
136
APPENDIX A. STOP WORDS LIST 137
many other stop words list available on the web. It contains stop words for both whitespace-
tokenized text and PTB-tokenized1 text. Below is the list of words.
a, about, above, after, again, against, all, am, an, and, any, are, aren’t, as, at, be, because,
been, before, being, below, between, both, but, by, can, can’t, cannot, could, couldn’t,
did, didn’t, do, does, doesn’t, doing, don’t, down, during, each, few, for, from, further,
had, hadn’t, has, hasn’t, have, haven’t, having, he, he’d, he’ll, he’s, her, here, here’s, hers,
herself, him, himself, his, how, how’s, i, i’d, i’ll, i’m, i’ve, if, in, into, is, isn’t, it, it’s, its,
itself, let’s, me, more, most, mustn’t, my, myself, no, nor, not, of, off, on, once, only, or,
other, ought, our, ours, ourselves, out, over, own, same, shan’t, she, she’d, she’ll, she’s,
should, shouldn’t, so, some, such, than, that, that’s, the, their, theirs, them, themselves,
then, there, there’s, these, they, they’d, they’ll, they’re, they’ve, this, those, through, to,
too, under, until, up, very, was, wasn’t, we, we’d, we’ll, we’re, we’ve, were, weren’t, what,
what’s, when, when’s, where, where’s, which, while, who, who’s, whom, why, why’s,
with, won’t, would, wouldn’t, you, you’d, you’ll, you’re, you’ve, your, yours, yourself,
yourselves, n’t, ’re, ’ve, ’d, ’s, ’ll, ’m.
1See http://nlp.stanford.edu/software/tokenizer.shtml and https://catalog.ldc.upenn.edu/LDC99T42 for the tokenization details.
Bibliography
Abney, Steven (2004). “Understanding the Yarowsky Algorithm”. In: Computational Lin-
guistics 30, pp. 365–395.
Agichtein, Eugene and Luis Gravano (2000). “Snowball: Extracting Relations from Large
Plain-text Collections”. In: Proceedings of the Fifth ACM Conference on Digital Li-
braries. DL’00.
Akbik, Alan, Oresti Konomi, and Michail Melnikov (2013). “Propminer: A Workflow for
Interactive Information Extraction and Exploration using Dependency Trees”. In: As-
sociation for Computer Linguistics System Demonstrations.
Angeli, Gabor, Sonal Gupta, Melvin Johnson Premkumar, Christopher D. Manning, Christo-
pher Re, Julie Tibshirani, Jean Y. Wu, Sen Wu, and Ce Zhang (2014). “Stanford’s Dis-
tantly Supervised Slot Filling Systems for KBP 2014”. In: Proceedings of the Text An-
alytics Conference.
Aronson, Alan R (2001). “Effective mapping of biomedical text to the UMLS Metathe-
saurus: the MetaMap program.” In: Proceedings of the AMIA Symposium.
Aronson, Alan R and Francois-Michel Lang (2010). “An overview of MetaMap: historical
perspective and recent advances”. In: Journal of the American Medical Informatics
Association 17, pp. 229–236.
Bellare, Kedar, Partha Talukdar, Giridhar Kumaran, Fernando Pereira, Mark Liberman,
Andrew McCallum, and Mark Dredze (2007). “Lightly-Supervised Attribute Extraction
for Web Search”. In: NIPS 2007 Workshop on Machine Learning for Web Search.
Bethard, Steven and Dan Jurafsky (2010). “Who should I cite: learning literature search
models from citation behavior”. In: Proceedings of the Conference on Information and
Knowledge Management.
138
BIBLIOGRAPHY 139
Bird, Steven, Robert Dale, Bonnie J. Dorr, Bryan Gibson, Mark T. Joseph, Min yen Kan,
Dongwon Lee, Brett Powley, Dragomir R. Radev, and Yee Fan Tan (2008). “The ACL
Anthology Reference Corpus: A Reference Dataset for Bibliographic Research in Com-
putational Linguistics”. In: Proceedings of the Conference on Language Resources and
Evaluation (LREC).
Blei, David, Andrew Ng, and Michael I. Jordan (2003). “Latent Dirichlet Allocation”. In:
Journal of Machine Learning Research (JMLR) 3, pp. 993–1022.
Blum, Avrim and Tom Mitchell (1998). “Combining Labeled and Unlabeled Data with
Co-training”. In: Conference on Learning Theory (COLT).
Boella, Guido, Luigi Di Caro, and Livio Robaldo (2013). “Semantic Relation Extraction
from Legislative Text Using Generalized Syntactic Dependencies and Support Vector
Machines”. In: Proceedings of the 7th International Conference on Theory, Practice,
and Applications of Rules on the Web. RuleML’13.
Bollacker, Kurt, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor (2008).
“Freebase: A collaboratively created graph database for structuring human knowledge”.
In: International Conference on Management of Data (SIGMOD), pp. 1247–1250.
Brauer, Falk, Robert Rieger, Adrian Mocan, and Wojciech M. Barczynski (2011). “En-
abling information extraction by inference of regular expressions from sample enti-
ties”. In: Proceedings of the International Conference on Information and Knowledge
Management.
Brin, Sergey (1999). Extracting Patterns and Relations from the World Wide Web. Technical
Report. Previous number = SIDL-WP-1999-0119. Stanford InfoLab.
Brown, Peter F., Peter V. deSouza, Robert L. Mercer, Vincent J. Della Pietra, Jennifer C.
Lai, and Robert L. Mercer (1992). “Class-Based n-gram Models of Natural Language”.
In: Computational Linguistics 18, pp. 467–479.
Buckley, Chris, Darrin Dimmick, Ian Soboroff, and Ellen Voorhees (2007). “Bias and the
limits of pooling for large collections”. In: Information Retrieval 10, pp. 491–508.
Buitelaar, Paul and Bernardo Magnini (2005). “Ontology Learning from Text: An Overview”.
In: In Ontology Learning from Text: Methods, Applications and Evaluation. IOS Press,
pp. 3–12.
BIBLIOGRAPHY 140
Bunescu, Razvan C. and Raymond J. Mooney (2005). “A Shortest Path Dependency Ker-
nel for Relation Extraction”. In: Empirical Methods in Natural Language Processing
(EMNLP).
Butler, Declan (2013). “When Google got flu wrong”. In: Nature 494, pp. 155–156.
Califf, Mary Elaine and Raymond J. Mooney (1999). “Relational Learning of Pattern-
match Rules for Information Extraction”. In: Association for the Advancement of Arti-
ficial Intelligence (AAAI), pp. 328–334.
Carlson, Andrew, Justin Betteridge, Richard C. Wang, Estevam R. Hruschka Jr., and Tom
M. Mitchell (2010a). “Coupled Semi-supervised Learning for Information Extraction”.
In: Web Search and Data Mining (WSDM), pp. 101–110.
Carlson, Andrew, Justin Betteridge, Bryan Kisiel, Burr Settles, Estevam R. Hruschka Jr.,
and Tom M. Mitchell (2010b). “Toward an architecture for never-ending language
learning”. In: Association for the Advancement of Artificial Intelligence (AAAI).
Carneiro, Herman A. and Eleftherios Mylonakis (2009). “Google trends: a web-based tool
for real-time surveillance of disease outbreaks”. In: Clinical Infectious Diseases 10,
pp. 1557–1564.
Chang, Angel X. and Christopher D. Manning (2014). TokensRegex: Defining cascaded
regular expressions over tokens. Tech. rep. Department of Computer Science, Stanford
University (CSTR 2014-02).
Chinchor, Nancy A. (1998). “Proceedings of the Seventh Message Understanding Confer-
ence (MUC-7) Named Entity Task Definition”. In: Proceedings of the Seventh Message
Understanding Conference (MUC-7).
Chiticariu, Laura, Yunyao Li, and Frederick R. Reiss (2013). “Rule-Based Information
Extraction is Dead! Long Live Rule-Based Information Extraction Systems!” In: Em-
pirical Methods in Natural Language Processing (EMNLP), pp. 827–832.
Ciravegna, Fabio (2001). “Adaptive information extraction from text by rule induction and
generalisation”. In: International Joint Conference on Artificial Intelligence (IJCAI),
pp. 1251–1256.
Ciravegna, Fabio, Alexiei Dingli, Daniela Petrelli, and Yorick Wilks (2002). “User-system
cooperation in document annotation based on information extraction”. In: Proceedings
BIBLIOGRAPHY 141
of the 13th International Conference on Knowledge Engineering and Knowledge Man-
agement.
Clark, Alexander (2001). “Unsupervised induction of stochastic context free grammars
with distributional clustering”. In: Computational Natural Language Learning (CoNLL).
Cohen, William W. and Sunita Sarawagi (2004). “Exploiting Dictionaries in Named Entity
Extraction: Combining semi-Markov Extraction Processes and Data Integration Meth-
ods”. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowl-
edge Discovery and Data Mining.
Cohen, William W. and Yoram Singer (1999). “A simple, fast, and effective rule learner”.
In: Association for the Advancement of Artificial Intelligence (AAAI), pp. 335–342.
Collins, Michael and Yoram Singer (1999). “Unsupervised Models for Named Entity Clas-
sification”. In: Empirical Methods in Natural Language Processing (EMNLP).
Collobert, Ronan and Jason Weston (2008). “A Unified Architecture for Natural Language
Processing: Deep Neural Networks with Multitask Learning”. In: Proceedings of the
25th International Conference on Machine Learning. ICML ’08.
De Marneffe, Marie Catherine, Bill MacCartney, and Christopher D. Manning (2006).
“Generating typed dependency parses from phrase structure parses”. In: Proceedings
of the Conference on Language Resources and Evaluation (LREC).
Demner-Fushman, Dina and Jimmy Lin (2007). “Answering clinical questions with knowledge-
based and statistical techniques”. In: Computational Linguistics 33, pp. 63–103.
Downey, Doug, Oren Etzioni, Stephen Soderland, and Daniel S. Weld (2004). “Learning
Text Patterns for Web Information Extraction and Assessment”. In: Proceedings of the
2004 AAAI Workshop on Adaptive Text Extraction and Mining (ATEM).
Downey, Doug, Oren Etzioni, and Stephen Soderland (2010). “Analysis of a Probabilistic
Model of Redundancy in Unsupervised Information Extraction”. In: Artificial Intelli-
gence 174.11, pp. 726–748. ISSN: 0004-3702.
Druck, Gregory, Gideon Mann, and Andrew McCallum (2008). “Learning from Labeled
Features using Generalized Expectation Criteria”. In: ACM Special Interest Group on
Information Retreival (SIGIR), pp. 595–602.
Epstein, Richard H., Paul St. Jacques, Michael Stockin, Brian Rothman, Jesse M. Ehren-
feld, and Joshua C. Denny (2013). “Automated identification of drug and food allergies
BIBLIOGRAPHY 142
entered using non-standard terminology”. In: Journal of the American Medical Infor-
matics Association 20, pp. 962–968.
Etzioni, Oren, Michael Cafarella, Doug Downey, Ana-Maria Popescu, Tal Shaked, Stephen
Soderland, Daniel S. Weld, and Alexander Yates (2005). “Unsupervised named-entity
extraction from the web: An experimental study”. In: Artificial Intelligence 165.1, pp. 91–
134.
Fader, Anthony, Stephen Soderland, and Oren Etzioni (2011). “Identifying Relations for
Open Information Extraction”. In: Empirical Methods in Natural Language Processing
(EMNLP).
Fellbaum, Christiane (1998). WordNet: An Electronic Lexical Database. MIT Press.
Finkel, Jenny R. (2010). “Holistic Language Processing: Joint Models of Linguistic Struc-
ture”. PhD thesis. Stanford University.
Finkel, Jenny R., Trond Grenager, and Christopher Manning (2005). “Incorporating non-
local information into information extraction systems by Gibbs sampling”. In: Associ-
ation for Computational Linguistics (ACL), pp. 363–370.
Fox, Susannah and Maeve Duggan (2013). Health Online. http://www.pewinternet.
org/Reports/2013/Health-online.aspx.
Frantzi, Katerina, Sophia Ananiadou, and Hideki Mima (2000). “Automatic recognition of
multi-word terms: the C-value/NC-value method”. In: International Journal on Digital
Libraries 3.2, pp. 115–130.
Frati-Munari, Alberto C., Blanca E. Gordillo, Perla Altamirano, and C. Raul Ariza (1998).
“Hypoglycemic effect of Opuntia streptacantha Lemaire in NIDDM”. In: Diabetes Care
11, pp. 63–66.
Freitag, Dayne (1998). “Toward General-Purpose Learning for Information Extraction”. In:
International Conference on Computational Linguistics (COLING), pp. 404–408.
Freitag, Dayne and Nicholas Kushmerick (2000). “Boosted Wrapper Induction”. In: Asso-
ciation for the Advancement of Artificial Intelligence (AAAI), pp. 577–583.
Gerrish, Sean M. and David M. Blei (2010). “A language-based approach to measuring
scholarly impact”. In: International Conference on Machine Learning (ICML).
BIBLIOGRAPHY 143
Govindaraju, Vidhya, Ce Zhang, and Christopher Re (2013). “Understanding Tables in
Context Using Standard NLP Toolkits”. In: Proceedings of the 51st Annual Meeting of
the Association for Computational Linguistics (Volume 2: Short Papers).
Gu, Baohua (2002). “Recognizing named entities in biomedical texts”. MA thesis. National
University of Singapore.
Gupta, Rahul, Alon Halevy, Xuezhi Wang, Steven Euijong Whang, and Fei Wu (2014a).
“Biperpedia: An Ontology for Search Applications”. In: Proceedings of the VLDB En-
dowment 7.7, pp. 505–516.
Gupta, Sonal and Christopher D. Manning (2011). “Analyzing the Dynamics of Research
by Extracting Key Aspects of Scientific Papers”. In: Proceedings of the International
Joint Conference on Natural Language Processing.
— (2014a). “Improved Pattern Learning for Bootstrapped Entity Extraction”. In: Compu-
tational Natural Language Learning (CoNLL).
— (2014b). “SPIED: Stanford Pattern-based Information Extraction and Diagnostics”. In:
Proceedings of the ACL 2014 Workshop on Interactive Language Learning, Visualiza-
tion, and Interfaces (ACL-ILLVI).
— (2015). “Distributed Representations of Words to Guide Bootstrapped Entity Classi-
fiers”. In: North American Association for Computational Linguistics (NAACL).
Gupta, Sonal, Diana MacLean, Jeffrey Heer, and Christopher D. Manning (2014b). “In-
duced Lexico-Syntactic Patterns Improve Information Extraction from Online Medical
Forums”. In: Journal of the American Medical Informatics Association 21, pp. 902–
909.
Hall, David, Daniel Jurafsky, and Christopher D Manning (2008). “Studying the history
of ideas using topic models”. In: Empirical Methods in Natural Language Processing
(EMNLP).
Hassan, Hany, Ahmed Hassan, and Ossama Emam (2006). “Unsupervised Information Ex-
traction Approach Using Graph Mutual Reinforcement”. In: Proceedings of the 2006
Conference on Empirical Methods in Natural Language Processing. EMNLP ’06.
He, Yifan and Ralph Grishman (2015). “ICE: Rapid Information Extraction Customiza-
tion for NLP Novices”. In: Proceedings of the 2015 Conference of the North American
BIBLIOGRAPHY 144
Chapter of the Association for Computational Linguistics – Human Language Tech-
nologies (System Demostrations).
Hearst, Marti A (1992). “Automatic acquisition of hyponyms from large text corpora”. In:
Interational Conference on Computational linguistics, pp. 539–545.
Hobbs, Jerry R. and Ellen Riloff (2010). “Information Extraction”. In: Handbook of Natural
Language Processing, Second Edition. ISBN 978-1420085921.
Hobbs, Jerry R., John Bear, David Israel, and Mabry Tyson (1993). “FASTUS: A finite-
state processor for information extraction from real-world text”. In: International Joint
Conference on Artificial Intelligence. IJCAI’93, pp. 1172–1178.
Hobbs, Jerry R., Douglas E. Appelt, John Bear, David J. Israel, Megumi Kameyama, Mark
E. Stickel, and Mabry Tyson (1997). “FASTUS: A Cascaded Finite-State Transducer for
Extracting Information from Natural-Language Text”. In: Computing Research Repos-
itory (CoRR) cmp-lg/9705013.
Hovy, Eduard, Mitchell Marcus, Martha Palmer, Lance Ramshaw, and Ralph Weischedel
(2006). “OntoNotes: The 90% Solution”. In: Proceedings of the Human Language
Technology Conference of the NAACL. NAACL-Short ’06.
Illig, Jens, Benjamin Roth, and Dietrich Klakow (2014). “Unsupervised Parsing for Gener-
ating Surface-Based Relation Extraction Patterns”. In: European Association for Com-
putational Linguistics (EACL).
Jean-Louis, Ludovic, Romaric Besancon, Olivier Ferret, and Wei Wang (2011). “Using
a weakly supervised approach and lexical patterns for the KBP slot filling task”. In:
Proceedings of the Text Analysis Conference - Knowledge Base Propagation (KBP).
Jonquet, Clement, Nigam H. Shah, and Mark A. Musen (2009). “The Open Biomedical
Annotator”. In: Summit on translational bioinformatics 2009, pp. 56–60.
Kang, Ning, Bharat Singh, Zubair Afzal, Erik M. van Mulligen, and Jan A. Kors (2012).
“Using rule-based natural language processing to improve disease normalization in
biomedical text”. In: Journal of the American Medical Informatics Association 20,
pp. 876–881.
Khan, Alam, Mahpara Safdar, Mohammad Muzaffar Ali Khan, Khan Nawaz Khattak, and
Richard A. Anderson (2003). “Cinnamon improves glucose and lipids of people with
type 2 diabetes”. In: Diabetes Care 26.12, pp. 3215–3218.
BIBLIOGRAPHY 145
Kim, Jin-Dong, Tomoko Ohta, Yuka Tateisi, and Jun ichi Tsujii (2003). “GENIA corpus -
A semantically annotated corpus for bio-textmining”. In: ISMB (Supplement of Bioin-
formatics), pp. 180–182.
Kim, Jin-Dong, Tomoko Ohta, and Jun ichi Tsujii (2008). “Corpus annotation for mining
biomedical events from literature.” In: BMC Bioinformatics 9.
Kleinberg, Jon M. (1999). “Authoritative sources in a hyperlinked environment”. In: Jour-
nal of the ACM 46, pp. 604–632.
Koo, Terry, Xavier Carreras, and Michael Collins (2008). “Simple Semi-Supervised De-
pendency Parsing”. In: Human Language Technology and Association for Computa-
tional Linguistics (HLT/ACL).
Kozareva, Zornitsa and Eduard H. Hovy (2013). “Tailoring the automated construction of
large-scale taxonomies using the web.” In: Language Resources and Evaluation 47.3,
pp. 859–890.
Lafferty, John, Andrew McCallum, and Fernando Pereira (2001). “Conditional Random
Fields: Probabilistic Models for Segmenting and Labeling Data”. In: International Con-
ference on Machine Learning (ICML), pp. 282–289.
Landauer, Thomas K., Peter W. Foltz, and Darrell Laham (1998). “An introduction to latent
semantic analysis”. In: Discourse processes 25, pp. 259–284.
Leaman, Robert, Laura Wojtulewicz, Ryan Sullivan, Annie Skariah, Jian Yang, and Gra-
ciela Gonzalez (2010). “Towards internet-age pharmacovigilance: Extracting adverse
drug reactions from user posts to health-related social networks”. In: Proceedings of
the 2010 workshop on biomedical natural language processing, pp. 117–125.
Letham, Benjamin, Cynthia Rudin, Tyler H. Mccormick, and David Madigan (2013). In-
terpretable classifiers using rules and Bayesian analysis: Building a better stroke pre-
diction model. Tech. rep. Department of Statistics, University of Washington (Report
No. 609).
Li, Na, Leilei Zhu, Prasenjit Mitra, Karl Mueller, Eric Poweleit, and C. Lee Giles (2010).
“OreChem ChemXSeer: A semantic digital library for chemistry”. In: Proceedings of
the Joint Conference on Digital libraries.
BIBLIOGRAPHY 146
Li, Shen, Joao V. Graca, and Ben Taskar (2012a). “Wiki-ly Supervised Part-of-speech Tag-
ging”. In: Proceedings of the 2012 Joint Conference on Empirical Methods in Natu-
ral Language Processing and Computational Natural Language Learning. EMNLP-
CoNLL ’12.
Li, Yunyao, Vivian Chu, Sebastian Blohm, Huaiyu Zhu, and Howard Ho (2011). “Facili-
tating Pattern Discovery for Relation Extraction with Semantic-signature-based Clus-
tering”. In: Proceedings of the 20th ACM International Conference on Information and
Knowledge Management.
Li, Yunyao, Laura Chiticariu, Huahai Yang, Frederick R. Reiss, and Arnaldo Carreno-
fuentes (2012b). “WizIE: A Best Practices Guided Development Environment for In-
formation Extraction”. In: Proceedings of the ACL 2012 System Demonstrations.
Liakata, Maria, Claire Q, and Larisa N. Soldatova (2009). “Semantic Annotation of Pa-
pers: Interface & Enrichment Tool (SAPIENT)”. In: Proceedings of the BioNLP 2009
Workshop.
Liang, Percy (2005). “Semi-Supervised Learning for Natural Language”. MA thesis. Mas-
sachusetts Institute of Technology.
Lin, Winston, Roman Yangarber, and Ralph Grishman (2003). “Bootstrapped Learning of
Semantic Classes from Positive and Negative Examples”. In: International Conference
on Machine Learning (ICML).
Liu, Bin, Laura Chiticariu, Vivian Chu, H. V. Jagadish, and Frederick R. Reiss (2010).
“Automatic Rule Refinement for Information Extraction”. In: Proceedings of the VLDB
Endowment 3.1-2, pp. 588–597.
MacLean, Diana (2015). “Insights from Patient Authored Text : From Close Reading to
Automated Extraction”. PhD thesis. Stanford University.
MacLean, Diana and Jeffrey Heer (2013). “Identifying Medical Terms in Patient-Authored
Text: A Crowdsourcing-based Approach”. In: Journal of the American Medical Infor-
matics Association 20, pp. 1120–1127.
MacLean, Diana, Sonal Gupta, Anna Lembke, Christopher D. Manning, and Jeffrey Heer
(2015). “Forum77: An Analysis of an Online Health Forum Dedicated to Addiction
Recovery”. In: Computer Supported Cooperative Work and Social Computing (CSCW).
BIBLIOGRAPHY 147
Mann, Gideon and Andrew McCallum (2008). “Generalized Expectation Criteria for Semi-
Supervised Learning of Conditional Random Fields”. In: Human Language Technology
and Association for Computational Linguistics (HLT/ACL), pp. 870–878.
Manning, Christopher, Prabhakar Raghavan, and Hinrich Schutze (2008). Introduction to
Information Retrieval. Vol. 1. Cambridge University Press.
Manning, Christopher D., Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard,
and Davic McClosky (2014). “The Stanford CoreNLP natural language processing
toolkit”. In: ACL system demonstrations.
Martins, Andre F. T., Noah A. Smith, Pedro M. Q. Aguiar, and Mario A. T. Figueiredo
(2011). “Structured Sparsity in Structured Prediction”. In: Proceedings of the Confer-
ence on Empirical Methods in Natural Language Processing. EMNLP’11.
Mausam, Michael Schmitz, Robert Bart, Stephen Soderland, and Oren Etzioni (2012).
“Open language learning for information extraction”. In: Empirical Methods in Nat-
ural Language Processing and Computational Natural Language Learning (EMNLP/-
CoNLL), pp. 523–534.
McIntosh, Tara and James R. Curran (2009). “Reducing Semantic Drift with Bagging
and Distributional Similarity”. In: Association for Computational Linguistics (ACL),
pp. 396–404.
McLernon, Brian and Nicholas Kushmerick (2006). “Transductive Pattern Learning for
Information Extraction”. In: Proceedings of the Workshop on Adaptive Text Extraction
and Mining (ATEM 2006).
Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean (2013a). “Dis-
tributed Representations of Words and Phrases and their Compositionality”. In: Ad-
vances in Neural Information Processing Systems (NIPS).
Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean (2013b). Efficient Estimation
of Word Representations in Vector Space. Tech. rep. 1301.3781. arXiv.
Mintz, Mike, Steven Bills, Rion Snow, and Dan Jurafsky (2009). “Distant supervision for
relation extraction without labeled data”. In: Association for Computational Linguistics
(ACL), pp. 1003–1011.
BIBLIOGRAPHY 148
Nallapati, Ramesh and Christopher D. Manning (2008). “Legal Docket-entry Classifica-
tion: Where Machine Learning Stumbles”. In: Empirical Methods in Natural Language
Processing (EMNLP), pp. 438–446.
Natarajan, Nagarajan, Inderjit S. Dhillon, Pradeep K. Ravikumar, and Ambuj Tewari (2013).
“Learning with Noisy Labels”. In: Advances in Neural Information Processing Systems.
Niu, Feng, Ce Zhang, Christopher Re, and Jude Shavlik (2012). “Elementary: Large-Scale
Knowledge-Base Construction via Machine Learning and Statistical Inference”. In: In-
ternational Journal on Semantic Web and Information Systems 8.3, pp. 42–73.
Noreen, E. (1989). Computer-intensive Methods for Testing Hypotheses: An Introduction.
John Wiley and Sons Inc.
Pado, Sebastian (2006). User’s guide to sigf: Significance testing by approximate ran-
domisation. http://www.nlpado.de/ sebastian/software/sigf.shtml.
Pantel, Patrick, Eric Crestan, Arkady Borkovsky, Ana-Maria Popescu, and Vishnu Vyas
(2009). “Web-scale Distributional Similarity and Entity Set Expansion”. In: Proceed-
ings of the 2009 Conference on Empirical Methods in Natural Language Processing:
Volume 2 - Volume 2. EMNLP ’09, pp. 938–947.
Pasca, Marius (2004). “Acquisition of Categorized Named Entities for Web Search”. In:
Proceedings of the Thirteenth ACM International Conference on Information and Knowl-
edge Management. CIKM ’04.
Passos, Alexandre, Vineet Kumar, and Andrew McCallum (2014). “Lexicon Infused Phrase
Embeddings for Named Entity Resolution”. In: Proceedings of the Eighteenth Confer-
ence on Computational Natural Language Learning. Association for Computational
Linguistics, pp. 78–86.
Patwardhan, S. (2010). “Widening the Field of View of Information Extraction through
Sentential Event Recognition”. PhD thesis. University of Utah.
Pennington, Jeffrey, Richard Socher, and Christopher D. Manning (2014). “GloVe: Global
Vectors for Word Representation”. In: Empirical Methods in Natural Language Pro-
cessing (EMNLP).
Poon, Hoifung and Pedro Domingos (2010). “Unsupervised Ontology Induction from Text”.
In: Association for Computational Linguistics (ACL).
BIBLIOGRAPHY 149
Pratt, Wanda and Meliha Yetisgen-Yildiz (2003). “A study of biomedical concept identi-
fication: MetaMap vs. people”. In: AMIA Annual Symposium Proceedings. Vol. 2003,
pp. 529–533.
Putthividhya, Duangmanee (Pew) and Junling Hu (2011). “Bootstrapped Named Entity
Recognition for Product Attribute Extraction”. In: Empirical Methods in Natural Lan-
guage Processing (EMNLP), pp. 1557–1567.
Radev, Dragomir R., Eduard Hovy, and Kathleen McKeown (2002). “Introduction to the
special issue on summarization”. In: Computational Linguistics 28, pp. 399–408.
Radev, Dragomir R., Pradeep Muthukrishnan, and Vahed Qazvinian (2009). “The ACL An-
thology Network corpus”. In: Proceedings of the 2009 Workshop on Text and Citation
Analysis for Scholarly Digital Libraries.
Ratinov, Lev and Dan Roth (2009). “Design Challenges and Misconceptions in Named
Entity Recognition”. In: Computational Natural Language Learning (CoNLL).
Ravichandran, Deepak and Eduard Hovy (2002). “Learning Surface Text Patterns for a
Question Answering System”. In: Association for Computational Linguistics (ACL),
pp. 41–47.
Riloff, Ellen (1993). “Automatically Constructing a Dictionary for Information Extraction
Tasks”. In: Proceedings of the Eleventh National Conference on Artificial Intelligence.
AAAI’93, pp. 811–816.
— (1996). “Automatically Generating Extraction Patterns from Untagged Text”. In: Asso-
ciation for the Advancement of Artificial Intelligence (AAAI), pp. 1044–1049.
Riloff, Ellen and Rosie Jones (1999). “Learning Dictionaries for Information Extraction
by Multi-level Bootstrapping”. In: Association for the Advancement of Artificial Intel-
ligence (AAAI).
Ritter, Alan, Luke Zettlemoyer, Mausam, and Oren Etzioni (2013). “Modeling Missing
Data in Distant Supervision for Information Extraction.” In: Transactions of the Asso-
ciation for Computational Linguistics (TACL) 1, pp. 367–378.
Roth, Benjamin and Dietrich Klakow (2013). “Combining Generative and Discriminative
Model Scores for Distant Supervision”. In: Proceedings of the 2013 Conference on
Empirical Methods in Natural Language Processing.
BIBLIOGRAPHY 150
Ruch, Patrick, Celia Boyer, Christine Chichester, Imad Tbahriti, Antoine Geissbuhler, Paul
Fabry, Julien Gobeill, Violaine Pillet, Dietrich Rebholz-Schuhmann, Christian Lovis,
and Anne-Lise Veuthey (2007). “Using argumentation to extract key sentences from
biomedical abstracts”. In: International Journal of Medical Informatics 76, pp. 195–
200.
Sarawagi, Sunita (2008). “Information Extraction”. In: Foundations and Trends in Databases
1.3, pp. 261–377.
Schwartz, Ariel S. and Marti A. Hearst (2003). “A Simple Algorithm For Identifying Ab-
breviation Definitions in Biomedical Text”. In: Proceedings of the Pacific Symposium
on Biocomputing.
Smith, Catherine Arnott and Paul J. Wicks (2008). “PatientsLikeMe: Consumer health vo-
cabulary as a folksonomy”. In: AMIA annual symposium proceedings. Vol. 2008.
Soderland, Stephen (1999). “Learning Information Extraction Rules for Semi-Structured
and Free Text”. In: Machine Learning 34, pp. 233–272.
Soderland, Stephen, John Gilmer, Robert Bart, Oren Etzioni, and Daniel S. Weld (2013).
“Open Information Extraction to KBP Relation in 3 Hours”. In: Proceedings of the Text
Analysis Conference on Knowledge Base Propagation.
Stevenson, Mark and Mark A. Greenwood (2005). “A Semantic Approach to IE Pattern
Induction”. In: Association for Computational Linguistics (ACL), pp. 379–386.
Subramanya, Amarnag, Slav Petrov, and Fernando Pereira (2010). “Efficient Graph-Based
Semi-Supervised Learning of Structured Tagging Models”. In: Empirical Methods in
Natural Language Processing (EMNLP).
Suchanek, Fabian M., Gjergji Kasneci, and Gerhard Weikum (2007). “YAGO: A core of
semantic knowledge”. In: World Wide Web (WWW), pp. 697–706.
Suchanek, Fabian M., Mauro Sozio, and Gerhard Weikum (2009). “SOFIE: A Self-organizing
Framework for Information Extraction”. In: Proceedings of the 18th International Con-
ference on World Wide Web. WWW ’09.
Sudo, Kiyoshi, Satoshi Sekine, and Ralph Grishman (2003). “An Improved Extraction Pat-
tern Representation Model for Automatic IE Pattern Acquisition”. In: Proceedings of
the 41st Annual Meeting on Association for Computational Linguistics. ACL ’03.
BIBLIOGRAPHY 151
Surdeanu, Mihai, Jordi Turmo, and Alicia Ageno (2006). “A Hybrid Approach for the
Acquisition of Information Extraction Patterns”. In: Proceedings of the EACL 2006
Workshop on Adaptive Text Extraction and Mining. ATEM 2006.
Surdeanu, Mihai, Julie Tibshirani, Ramesh Nallapati, and Christopher D. Manning (2012).
“Multi-instance Multi-label Learning for Relation Extraction”. In: Proceedings of the
2012 Joint Conference on Empirical Methods in Natural Language Processing and
Computational Natural Language Learning. EMNLP-CoNLL ’12.
Talukdar, Partha Pratim, Thorsten Brants, Mark Liberman, and Fernando Pereira (2006).
“A Context Pattern Induction Method for Named Entity Extraction”. In: Proceedings
of the Tenth Conference on Computational Natural Language Learning. CoNLL’06.
Tateisi, Yuka, Yo Shidahara, Yusuke Miyao, and Akiko Aizawa (2014). “Annotation of
Computer Science Papers for Semantic Relation Extraction”. In: Proceedings of the
Ninth International Conference on Language Resources and Evaluation (LREC’14).
Tatonetti, Nicholas P., Guy Haskin Fernald, and Russ B. Altman (2012). “A novel sig-
nal detection algorithm for identifying hidden drug-drug interactions in adverse event
reports”. In: Journal of the American Medical Informatics Association 19, pp. 79–85.
Thelen, Michael and Ellen Riloff (2002). “A Bootstrapping Method for Learning Semantic
Lexicons using Extraction Pattern Contexts”. In: Empirical Methods in Natural Lan-
guage Processing (EMNLP), pp. 214–221.
Tibshirani, Julie and Christopher D. Manning (2014). “Robust Logistic Regression using
Shift Parameters”. In: Proceedings of the Association for Computational Linguistics.
Toutanova, Kristina and Christopher D. Manning (2003). “Feature-Rich Part-of-Speech
Tagging with a Cyclic Dependency Network”. In: Human Language Technology and
North American Association for Computational Linguistics (HLT/NAACL).
Tsai, Chen-Tse, Gourab Kundu, and Dan Roth (2013). “Concept-based analysis of scien-
tific literature”. In: Proceedings of the 22nd ACM international conference on Confer-
ence on information and knowledge management. CIKM ’13.
Turian, Joseph, Lev Ratinov, and Yoshua Bengio (2010). “Word Representations: A Sim-
ple and General Method for Semi-supervised Learning”. In: Proceedings of the 48th
Annual Meeting of the Association for Computational Linguistics. ACL ’10.
BIBLIOGRAPHY 152
Valenzuela, Marco, Vu Ha, and Oren Etzioni (2015). “Identifying Meaningful Citations”.
In: AAAI Workshop on Scholarly Big Data.
Valenzuela-Escarcega, Marco A., Gustave Hahn-Powell, Thomas Hicks, and Mihai Sur-
deanu (2015). “A Domain-independent Rule-based Framework for Event Extraction”.
In: Proceedings of the 53rd Annual Meeting of the Association for Computational Lin-
guistics and the 7th International Joint Conference on Natural Language Processing
of the Assian Federation of Natural Language Processing: Software Demonstrations
(ACL-IJCNLP).
White, Ryen W., Nicholas P. Tatonetti, Nigam H. Shah, Russ B. Altman, and Eric Horvitz
(2013). “Web-scale pharmacovigilance: listening to signals from the crowd”. In: Jour-
nal of the American Medical Informatics Association 20, pp. 404–408.
Whitney, Max and Anoop Sarkar (2012). “Bootstrapping via Graph Propagation”. In: As-
sociation for Computational Linguistics (ACL).
Wicks, Paul, Timothy E. Vaughan, Michael P. Massagli, and James Heywood (2011). “Ac-
celerated clinical discovery using self-reported patient data collected online and a patient-
matching algorithm”. In: Nature Biotechnology 29, pp. 411–414.
Xu, Feiyu, Hans Uszkoreit, and Hong Li (2007). “A seed-driven bottom-up machine learn-
ing framework for extracting relations of various complexity”. In: Proceedings of the
45th Annual Meeting of the Association for Computational Linguistics.
Xu, Rong, Kaustubh Supekar, Yang Huang, Amar Das, and Alan Garber (2006). “Com-
bining text classification and Hidden Markov Modeling techniques for categorizing
sentences in randomized clinical trial abstracts.” In: AMIA Annual Symposium Pro-
ceedings, pp. 824–828.
Xu, Rong, Kaustubh Supekar, Alex Morgan, Amar Das, and Alan Garber (2008). “Unsu-
pervised method for automatic construction of a disease dictionary from a large free
text collection”. In: AMIA Annual Symposium Proceedings. Vol. 2008, pp. 820–824.
Xu, Wei, Raphael Hoffmann, Le Zhao, and Ralph Grishman (2013). “Filling Knowledge
Base Gaps for Distant Supervision of Relation Extraction.” In: Association for Compu-
tational Linguistics (ACL), pp. 665–670.
BIBLIOGRAPHY 153
Yahya, Mohamed, Steven Euijong Whang, Rahul Gupta, and Alon Halevy (2014). “Re-
Noun: Fact Extraction for Nominal Attributes”. In: Empirical Methods in Natural Lan-
guage Processing (EMNLP).
Yangarber, Roman, Ralph Grishman, and Pasi Tapanainen (2000). “Automatic Acquisition
of Domain Knowledge for Information Extraction”. In: International Conference on
Computational Linguistics (COLING), pp. 940–946.
Yangarber, Roman, Winston Lin, and Ralph Grishman (2002). “Unsupervised Learning
of Generalized Names”. In: International Conference on Computational Linguistics
(COLING).
Yarowsky, David (1995). “Unsupervised word sense disambiguation rivaling supervised
methods”. In: Association for Computational Linguistics (ACL).
Yeh, A. (2000). “More accurate tests for the statistical significance of result differences.”
In: The International Conference on Computational Linguistics.
Zeng, Qing T and Tony Tse (2006). “Exploring and developing consumer health vocabu-
laries”. In: Journal of the American Medical Informatics Association 13.
Zhang, Ce, Vidhya Govindaraju, Jackson Borchardt, Tim Foltz, Christopher Re, and Shanan
Peters (2013). “GeoDeepDive: Statistical Inference Using Familiar Data-processing
Languages”. In: Proceedings of the 2013 ACM SIGMOD International Conference on
Management of Data. SIGMOD ’13.
Zhang, Qi, Yaqian Zhou, Xuanjing Huang, and Lide Wu (2008). “Graph Mutual Rein-
forcement Based Bootstrapping”. In: Proceedings of the 4th Asia Information Retrieval
Conference on Information Retrieval Technology. AIRS’08.
Zhu, Jun, Zaiqing Nie, Xiaojiang Liu, Bo Zhang, and Ji-Rong Wen (2009). “StatSnowball:
A Statistical Approach to Extracting Entity Relationships”. In: Proceedings of the 18th
International Conference on World Wide Web. WWW ’09.