MULTILINGUAL ACQUISITION OF STRUCTURED INFORMATION …ngarera/publications/thesis_ngarera_2009.pdf · 2010-01-30 · Furthermore, the traditional seed-based learning framework fails

MULTILINGUAL ACQUISITION OF

STRUCTURED INFORMATION VIA NOVEL

RELATIONSHIP EXTRACTION MODELS

OVER DIVERSE KNOWLEDGE SOURCES

by

Nikesh Lucky Garera

A dissertation submitted to The Johns Hopkins University in conformity with the

requirements for the degree of Doctor of Philosophy.

Baltimore, Maryland

September, 2009

© Nikesh Lucky Garera 2009

All rights reserved

Abstract

This dissertation presents original techniques for a class of problems that can be

collectively referred to as relationship extraction. This machine learning task involves

extracting tuples from free text, the exemplar instantiations of which help model the

target relationship. A wide range of relationships are explored, including semantic

relationships between words, their translation equivalents in different languages and

encyclopedic facts about named entities.

This dissertation explores new relationship extraction models which exploit novel

knowledge sources across a diverse set of relationship types in multiple languages. It

ties together extraction of diverse relationships in the classic seed-based minimally

supervised framework. However, this framework has previously failed to capture in-

formation beyond local context such as transitively-derived information, domain con-

straints and knowledge, correlations among relationships and additional novel knowl-

edge sources. Furthermore, the traditional seed-based learning framework fails to

extract non-overt relationships such as an author’s gender or age when they are not

explicitly stated.In contrast, some of these non-overt relationships can be inferred

ii

with an accuracy exceeding 95% via novel document-wide, discourse-feature-based

and interlocutar-sensitive models. This disseration presents new relationship extrac-

tion methods embedding a wide range of such knowledge sources in the minimally

supervised learning framework.

Collectively, these methods outperform previously published algorithms on a diverse

set of natural language data sources and genres including newswire text, biographical

articles, raw webpages, conversational speech transcripts and email, and on a large

set of languages including Albanian, Arabic, Bulgarian, Czech, Farsi, German, Hindi,

Hungarian, Russian, Slovak, Spanish and Swedish.

Thesis Committee:

David Yarowsky Chris Callison-BurchProfessor Assistant Research ProfessorDepartment of Computer Science Department of Computer ScienceJohns Hopkins University Johns Hopkins University

James MayfieldPrincipal Research ScientistApplied Physics LaboratoryJohns Hopkins University

iii

Acknowledgements

I feel truly fortunate to have pursued my Ph.D. studies to completion in an aca-

demic institution and a center that has some of the best faculty and students in

natural language processing. The faculty, students, staff and the entire research envi-

ronment here has helped me in numerous ways during the entire course of my doctoral

studies at Johns Hopkins University.

Before starting my at program JHU, I had read and heard about how a students

advisor can be a critical factor during Ph.D. studies. I feel extremely lucky to have

been advised by David Yarowsky during my time at JHU. David has been a strong

pillar of support and a great mentor. He has an incredible faith in his students

and I feel proud to be counted amongst one of them. He has always allowed me

to work independently on various research projects of my interest, and at the same

time, he has been extremely accessible to provide close guidance whenever I needed

it. I have learned a lot from observing his practical and systematic approach towards

doing research and contributing towards the scientific community. His guidance and

insightful vision have helped me immensely throughout my time here at JHU. Thank

iv

you David for being such an amazing advisor.

I would like to thank my thesis committee members Chris Callison-Burch and

James Mayfield for providing prompt and valuable feedback on several drafts on the

thesis. They were very supportive throughout my dissertation writing process and

their insightful questions and comments helped improve the shape and content of this

thesis considerably.

I would also like to thank Jason Eisner for the influence he has had on me during

the PhD program, although he was not my direct advisor he was very approachable

and open to any kind of discussion. I especially admire the enthusiasm and energy

he brings when discussing and solving research problems.

I would like to thank the National Science Foundation and Johns Hopkins Uni-

versity’s new Center of Excellence in Human Language Technology (COE) for their

financial support of my graduate work. During the last couple of years of my PhD,

I was involved with COE on various research projects. I would like to convey my

thanks to Mark Dredze, Paul McNamee, Christine Piatko, James Mayfield and other

COE members for providing a supportive research environment.

One of the best things that happened to me after coming to Hopkins was to find

the camaraderie of the students here, who are not only among the most brilliant minds

that I know of but are so much fun to hang out with even outside of work! Thanks

to V Balakrishnan, John Blatz, Anoop Deoras, Markus Dreyer, Eliott Drabek, Erin

Fitzgerald, Arnab Ghoshal, Ann Irvine, Sridhar Krishna, Zhifei Li, Gideon Mann,

v

Carolina Prada, Brock Pytlik, Delip Rao, Ariya Rastrow, David Smith, Noah Smith,

Jason Smith, Charles Schafer, Roy Tromble, Chris White and Omar Zaidan.

I would also like to thank the CS and CLSP staff Desiree Cleves, Debbie DeFord,

Monique Folk, Laura Graham and Cathy Thornton for helping me navigate through

all the administrative procedures.

My family has been a great source of strength and support during my graduate

studies. I want to thank my grandparents Parumal Garera and Jaidevi Garera for

their love and affection. Though my grandfather is not here in body, he was and

still is a constant source of inspiration. My parents Lucky Garera and Madhu Garera

have always valued education as a top priority for their children and their upbringing

and values are the strongest reason why I have been able to reach such heights in

my education. My in-laws Bharat Doshi and Vidya Doshi have provided their much

needed love and support along the way. My brother Deepak Garera has always

supported me in many milestones along the way and is always available to talk to

whenever I feel the need for a fun conversation!

Last and in not any way the least, I would like to thank my wife Sujata Garera

for supporting me and encouraging me in just about any goal that I plan to pursue,

not matter how hard the challenge. She has been there with me through all the ups

and downs and no words are sufficient to describe how much her presence has helped

me in my life.

vi

Contents

Abstract ii

Acknowledgements iv

List of Tables xix

List of Figures xxiv

1 Introduction 1

1.1 Types of relationships explored . . . . . . . . . . . . . . . . . . . . . 3

1.2 Basic approach: Relationship

extraction using seed exemplars . . . . . . . . . . . . . . . . . . . . . 6

1.2.1 Algorithm (Rapp 1999; Ravichandran and Hovy, 2002) . . . . 7

1.2.2 Context Representations . . . . . . . . . . . . . . . . . . . . . 9

1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1.3.1 Internal knowledge sources . . . . . . . . . . . . . . . . . . . . 12

1.3.2 External knowledge sources . . . . . . . . . . . . . . . . . . . 14

vii

1.4 Outline of this Dissertation . . . . . . . . . . . . . . . . . . . . . . . 16

1.4.1 Part I: Cross-language relationships . . . . . . . . . . . . . . . 16

1.4.2 Part II: Semantic relationships . . . . . . . . . . . . . . . . . . 16

1.4.3 Part III: Factual relationships . . . . . . . . . . . . . . . . . . 17

I Extracting Cross-language/Translation Relationships 18

2 Part I Literature Review 19

2.1 Using Parallel Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.2 Using Monolingual Corpora and Seed Lexicons . . . . . . . . . . . . . 21

2.3 Using Bridge Languages . . . . . . . . . . . . . . . . . . . . . . . . . 23

3 Translating Compounds by Learning Component Gloss Translation

Models via Multiple Languages 25

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.2 Resources Utilized . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.4 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.4.1 Splitting compound words and gloss

generation with translation lexicon

lookup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.4.2 Using cross-language evidence from different bilingual dictionaries 32

viii

3.4.3 Ranking translation candidates . . . . . . . . . . . . . . . . . 32

3.5 Evaluation using Exact-match

Translation Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.6 Comparison of different compound translation models . . . . . . . . . 34

3.6.1 A simple model using literal English gloss concatenation as the

translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.6.2 Using bilingual dictionaries . . . . . . . . . . . . . . . . . . . 36

3.6.3 Using forward and backward ordering for English gloss search 38

3.6.4 Increasing coverage by automatically discovering compound

morphology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.6.5 Re-ranking using context vector projection . . . . . . . . . . . 42

3.6.6 Using phrase-tables if a parallel corpus is available . . . . . . . 43

3.7 Statistical Significance of Results . . . . . . . . . . . . . . . . . . . . 45

3.8 Quantifying the Role of

Cross-language Selection

and Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.8.1 Coverage/Accuracy Trade-off . . . . . . . . . . . . . . . . . . 46

3.8.2 Varying the size of bilingual dictionaries . . . . . . . . . . . . 46

3.8.3 Greedy vs Random Selection of Utilized Languages . . . . . . 49

3.8.4 Languages found using Greedy selection . . . . . . . . . . . . 52

3.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

ix

4 Improving Translation Lexicon Induction from Monolingual Corpora

via Dependency Contexts and Part-of-Speech Equivalences 55

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.3 Translation by Context Vector

Projection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.3.1 Models of Context . . . . . . . . . . . . . . . . . . . . . . . . 63

4.3.1.1 Baseline model . . . . . . . . . . . . . . . . . . . . . 63

4.3.1.2 Modeling context using dependency trees . . . . . . . 65

4.4 Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.4.1 Training Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

4.4.2 Evaluation Criterion . . . . . . . . . . . . . . . . . . . . . . . 68

4.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4.6 Further Extensions: Generalizing to other word types via tagset mapping 74

4.6.1 Mapping Part-of-Speech tagsets in different languages . . . . . 76

4.7 Application to Unrelated Corpora . . . . . . . . . . . . . . . . . . . . 81


4.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

II Extracting Semantic Relationships 84

5 Part II Literature Review 85

x

5.1 Extracting relationships in a semantic taxonomy . . . . . . . . . . . . 85

5.1.1 Manually created databases . . . . . . . . . . . . . . . . . . . 86

5.1.2 Hand-crafted Patterns for “is-a” and “part-whole” relationships 87

5.1.3 Weakly supervised approaches . . . . . . . . . . . . . . . . . . 90

5.1.4 Training Supervised Classifiers . . . . . . . . . . . . . . . . . . 91

5.1.5 Clustering Approaches . . . . . . . . . . . . . . . . . . . . . . 91

5.2 Extracting complex semantic

relationships . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

6 Minimally Supervised Multilingual Taxonomy and Translation

Lexicon Induction 95

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

6.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

6.3.1 Independently Bootstrapping Lexical

Relationship Models . . . . . . . . . . . . . . . . . . . . . . . 101

6.3.2 A minimally supervised multi-class classifier for identifying dif-

ferent semantic relations . . . . . . . . . . . . . . . . . . . . . 103

6.3.3 Evaluation of the Classification Task . . . . . . . . . . . . . . 104


6.5 Improving a partial translation

dictionary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

xi

6.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

7 Extraction of Semantic Facts from Unlabeled Corpora targeting

Resolution and Generation of Definite Anaphora 114

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

7.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

7.3 Models for Lexical Acquisition . . . . . . . . . . . . . . . . . . . . . . 120

7.3.1 TheY-Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

7.3.2 WordNet-Model (WN) . . . . . . . . . . . . . . . . . . . . . . 123

7.3.3 Combination: TheY+WordNet Model . . . . . . . . . . . . . . 124

7.3.4 OtherY-Modelfreq . . . . . . . . . . . . . . . . . . . . . . . . . 126

7.3.5 OtherY-ModelMI(normalized) . . . . . . . . . . . . . . . . . . 126

7.3.6 Combination: TheY+OtherYMI Model . . . . . . . . . . . . . 127

7.4 Further Anaphora Resolution Results . . . . . . . . . . . . . . . . . . 127

7.5 Generation Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

7.5.1 Human experiment . . . . . . . . . . . . . . . . . . . . . . . . 131

7.5.2 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

7.5.2.1 Individual Models . . . . . . . . . . . . . . . . . . . 133

7.5.2.2 Combining corpus-based approaches and WordNet . 133

7.5.3 Evaluation of Anaphor Generation . . . . . . . . . . . . . . . 135


7.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

xii

III Extracting Factual Relationships 138

8 Part III Literature Review 139

8.1 Literature for Modeling Explicit

Relationships . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

8.1.1 Early MUC approaches: Handcrafted

Lexico-syntactic Patterns . . . . . . . . . . . . . . . . . . . . . 140

8.1.2 Machine Learning Approaches . . . . . . . . . . . . . . . . . . 141

8.1.3 Weakly Supervised Approaches using

Seed-exemplars . . . . . . . . . . . . . . . . . . . . . . . . . . 141

8.2 Literature for Modeling Latent

Relationships . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

8.2.1 Sociolinguistic Studies . . . . . . . . . . . . . . . . . . . . . . 145

8.2.2 Computational Approaches . . . . . . . . . . . . . . . . . . . 145

9 Structural, Transitive and Correlational Models for Biographic Fact

Extraction 148

9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

9.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

9.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153

9.4 Contextual Pattern-Based Model . . . . . . . . . . . . . . . . . . . . 154

xiii

9.5 Partially Untethered Templatic

Contextual Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . 158

9.6 Document-Position-Based Model . . . . . . . . . . . . . . . . . . . . 159

9.6.1 Learning Relative Ordering in the

Position-Based Model . . . . . . . . . . . . . . . . . . . . . . . 161

9.7 Implicit Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164

9.7.1 Extracting Attributes Transitively using Neighboring Person-

Names . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164

9.7.2 Latent-Attribute Models based on Document-Wide Context

Profiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165

9.8 Model Combination . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168

9.9 Further Extensions: Reducing False Positives . . . . . . . . . . . . . . 168

9.9.1 Using Inter-Attribute Correlations . . . . . . . . . . . . . . . . 170

9.9.2 Using Age Distribution . . . . . . . . . . . . . . . . . . . . . . 171


9.11 Extracting factual relationships from noisy sources for a wider range

of attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172

9.11.1 Analysis of pattern learning component for fact extraction . . 174

9.11.2 Manually Filtering Patterns . . . . . . . . . . . . . . . . . . . 174

9.11.3 Filtering Noisy Patterns Automatically . . . . . . . . . . . . . 179

xiv

9.11.4 Evaluating automatic pattern filtering

measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180

9.11.5 Error analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 181

9.12 Application of Position-based Model to News Data . . . . . . . . . . 182

9.12.1 Corpora Details . . . . . . . . . . . . . . . . . . . . . . . . . . 183

9.12.2 Global Position Model of “Occupation”

Attribute . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183

9.12.3 Modeling Position with respect to the First Name Mention . . 185

9.12.4 Modeling Position with respect to the

Closest Name Mention . . . . . . . . . . . . . . . . . . . . . . 185


Closest Full or Partial Name Mention . . . . . . . . . . . . . . 188

9.12.6 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188

9.13 Using Biographical Facts for Name Disambiguation . . . . . . . . . . 190

9.14 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193

10 Modeling Latent Biographical Attributes in Conversational Genres195

10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196

10.1.1 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199

10.1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 200

10.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201

10.3 Corpus Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203

xv

10.4 Modeling Gender via Ngram

features (Boulis and Ostendorf, 2005) . . . . . . . . . . . . . . . . . . 204

10.4.1 Training Vectors . . . . . . . . . . . . . . . . . . . . . . . . . 204

10.4.2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206

10.5 Modeling Based on the Partner’s Gender . . . . . . . . . . . . . . . . 208

10.5.1 Oracle Experiment . . . . . . . . . . . . . . . . . . . . . . . . 209

10.5.2 Replacing Oracle by a Homogeneous vs Heterogeneous Classifier 210

10.5.3 Modeling partner via conditional model and whole-conversation

model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211

10.6 Sociolinguistic Features . . . . . . . . . . . . . . . . . . . . . . . . . . 213

10.7 Gender Classification Results . . . . . . . . . . . . . . . . . . . . . . 215

10.7.1 Aggregating results over per-speaker via consensus voting . . . 217

10.8 Effect of Self-Reporting Features on Gender Classification . . . . . . . 219

10.9 Application to Arabic Language . . . . . . . . . . . . . . . . . . . . . 221

10.9.1 Corpus Details . . . . . . . . . . . . . . . . . . . . . . . . . . 221

10.9.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222

10.9.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223

10.10Application to Email Genre . . . . . . . . . . . . . . . . . . . . . . . 225

10.10.1 Corpus Details . . . . . . . . . . . . . . . . . . . . . . . . . . 225

10.10.2 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227

10.10.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227

xvi

10.10.4 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228

10.11Modeling Other Attributes . . . . . . . . . . . . . . . . . . . . . . . . 229

10.11.1 Corpus details for Age and

Native Language . . . . . . . . . . . . . . . . . . . . . . . . . 232

10.11.2 Results for Age and Native/Non-Native . . . . . . . . . . . . . 232

10.11.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233

10.12Regression Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235

10.12.1 Evaluation Measures . . . . . . . . . . . . . . . . . . . . . . . 237

10.12.2 Baseline Approaches . . . . . . . . . . . . . . . . . . . . . . . 237

10.12.3 Ngram-based regression model . . . . . . . . . . . . . . . . . . 237

10.12.4 Sociolinguistic features . . . . . . . . . . . . . . . . . . . . . . 238

10.12.5 Top Ngram features . . . . . . . . . . . . . . . . . . . . . . . . 238

10.12.6 Multiple Binary Classifiers Across

Different Age Boundaries . . . . . . . . . . . . . . . . . . . . . 240

10.12.7 Stacked Models . . . . . . . . . . . . . . . . . . . . . . . . . . 242

10.12.7.1 Linear Combination . . . . . . . . . . . . . . . . . . 242

10.12.7.2 Regression Trees . . . . . . . . . . . . . . . . . . . . 243

10.12.8 Balancing Size of Different Age Groups in Test Set . . . . . . 245

10.13Effect of Self-Reporting Features on Age Prediction . . . . . . . . . . 246

10.14Statistical Significance of Results . . . . . . . . . . . . . . . . . . . . 247

10.15Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248

xvii

11 Contributions and Conclusion 250

11.1 Applications and Future Work . . . . . . . . . . . . . . . . . . . . . . 255

Bibliography 258

Vita 284

xviii

List of Tables

1.1 A sample of seed pairs for the three major categories of relationshipsextracted. A representation of context is learned starting with suchseed pairs for extracting new pairs describing the relationship as de-scribed in Section 1.2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3.1 Example lexical resources used in this task and their application totranslating compound words in new languages. . . . . . . . . . . . . . 27

3.2 Baseline performance using unreordered literal English glosses as trans-lations. The percentages in parentheses indicate what fraction of allthe words in the test (entire) vocabulary were detected and translatedas compounds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.3 Coverage and accuracy for the standard model using gloss-to-fluenttranslation mappings learned from bilingual dictionaries in other lan-guages (in forward order only). . . . . . . . . . . . . . . . . . . . . . 38

3.4 Size of various bilingual dictionaries (with other language as English) 393.5 Performance for looking up English gloss via both orderings. The

percentages in parentheses are relative improvements from the per-formance in Table 3.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.6 Top 15 middle glues (fillers) and end glues discovered for each lan-guage along with their probability values. Glue characters allow forappropriately splitting the compound words into the root forms of theindividual components for lookup in a lexicon. . . . . . . . . . . . . . 41

3.7 Performance for increasing coverage by including compounding mor-phology. The percentages in parentheses are relative improvementsfrom the performance in Table 3.5 . . . . . . . . . . . . . . . . . . . . 42

3.8 Average performance on German and Swedish with and without usingcontext vector similarity from monolingual corpora. . . . . . . . . . . 43

3.9 Performance of this work’s BiDict approach compared with and aug-mented with traditional statistical MT learning from bitext. . . . . . 44

xix

3.10 Illustrating 3-best cross-languages obtained for each test language(shown in bold). Each row shows the effect of adding the respectivecross-language to the set of languages in the rows above it and thecorresponding F-scores (Top 1 and Top 10) achieved. . . . . . . . . . 53

4.1 Contrasting context words derived from the adjacent vs dependencymodels for the above example . . . . . . . . . . . . . . . . . . . . . . 65

4.2 Top 10 translation candidates for the Spanish word “camino (way)”and “crecimiento (growth)” for the best adjacent context model(Adjbow) and best dependency context model (Depposn). The bold En-glish terms show the acceptable translations. . . . . . . . . . . . . . . 70

4.3 Performance of various context-based models learned from monolin-gual corpora and phrase-table learned from parallel corpora on Nountranslation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

4.4 List of 20 most confident mappings using the dependency context basedmodel for noun translation along with exact match evaluation outputbased on whether the mapping is present as a lexicon entry. Note thatalthough the first mapping (senores, gentlemen) is the correct one, itwas not present in the lexicon used for evaluation and hence is markedas incorrect. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

4.5 Performance of dependency context-based model along with additionof part-of-speech mapping model on translating all word-types. . . . . 79

4.6 List of 25 most confident mappings using the dependency context withthe part-of-speech mapping model translating all word-types alongwith exact match evaluation output based on whether the mappingis present as a lexicon entry. Note that although the second best map-ping in Table 4.4 for noun-translation is for xenophobia with score0.87, xenophobia is not among the 1000 most frequent words (of allword-types) and thus is not in this test set. . . . . . . . . . . . . . . . 80

6.1 Naive pattern scoring: Hyponymy patterns ranked by their raw corpusfrequency scores. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

6.2 Patterns for hypernymy class re-ranked using evidence from otherclasses. Patterns distributed fairly evenly across multiple relationshiptypes (e.g. “X and Y”) are deprecated more than patterns focusedpredominantly on a single relationship type (e.g. “Y such as X”). . . 102

6.3 A sample of patterns and their relationship type probabilitiesP (class|pattern) extracted at the end of training phase for English. . 105

6.4 A sample of patterns and their class probabilities P (class|pattern)extracted at the end of training phase for Hindi. . . . . . . . . . . . . 105

xx

6.5 A sample of seeds used and model predictions for each class for the tax-onomy induction task. For each of the model predictions shown above,its Hyponym/Meronym/Cousin classification was correctly assigned bythe model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

6.6 Overall accuracy for4-way classification {hypernym,meronym,cousin,other} using differentpattern scoring methods. . . . . . . . . . . . . . . . . . . . . . . . . . 107

6.7 Test set coverage and accuracy results for inducing different semanticrelationship types. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

6.8 Confusion matrix for English (left) Hindi (right) for the four-way clas-sification task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

6.9 Accuracy on Hindi to English word translation using different transitivehypernym algorithms. The additional model components in the bi-d(bi-directional) plus Other model are only used to rerank the top20 candidates of the bidirectional model, and are hence limited to itstop-20 performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

6.10 A sample of correct and incorrect translations using transitive hyper-nymy/hyponym word translation induction . . . . . . . . . . . . . . . 112

7.1 A sample of ranked hyponyms proposed for the definite NP The drugby TheY-Model illustrating the differences in weighting methods. . . 122

7.2 Results using different normalization techniques for the TheY-Modelin isolation. (60 million word corpus) . . . . . . . . . . . . . . . . . . 122

7.3 Accuracy and Average Rank showing combined model performance onthe antecedent selection task. Corpus Size: 60 million words. . . . . . 124

7.4 A sample of output from different models on antecedent selection (60million word corpus). . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

7.5 Accuracy and Average Rank of Models defined in Section 7.3 on theantecedent selection task. . . . . . . . . . . . . . . . . . . . . . . . . . 128

7.6 Agreement of different generation models with human judge and withdefinite NP used in the corpus. . . . . . . . . . . . . . . . . . . . . . 134

7.7 Sample of decisions made by human judge and the best performingmodel (TheY+OtherY+WN) on the generation task. . . . . . . . . . 135

9.1 A sample of partially untethered and fully tethered patterns alongwith their precision. For some of the attributes, only 4-5 fully tetheredpatterns but relaxing the constraint on the <hook> allows extractionof many partially tethered patterns providing improved performanceas shown in Tables 9.5 9.6. . . . . . . . . . . . . . . . . . . . . . . . . 155

xxi

9.2 A sample of partially untethered and fully tethered patterns alongwith their precision. For some of the attributes, only 4-5 fully tetheredpatterns but relaxing the constraint on the <hook> allows extractionof many partially tethered patterns providing improved performanceas shown in Tables 9.5 and 9.6. . . . . . . . . . . . . . . . . . . . . . 156

9.3 Majority rank of the correct attribute value in the Wikipedia pagesof the seed names used for learning relative ordering among attributessatisfying the domain model . . . . . . . . . . . . . . . . . . . . . . . 159

9.4 Sample of occupation weight vectors in English and German learnedusing the latent-attribute-based model. . . . . . . . . . . . . . . . . . 165

9.5 Average Performance of different models across all biographic attributes1689.6 Performance comparison of all the models across several biographic

attributes. Bolded accuracies indicate the top-performing model. . . . 1699.7 Sample of untethered patterns that were annotated as high quality by

human annotators. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1759.8 Sample of untethered patterns that were annotated as high quality by



human annotators. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1789.11 Pattern relevance based on presence in high quality pattern list gen-

erated by human annotators. Top 5 indicates the fraction of top 5patterns generated by the algorithm that were marked by annotatorsas high quality patterns. The results are averaged over all attributes. 180

9.12 Name disambiguation performance for matching first or last name men-tions to a Wikipedia person page . . . . . . . . . . . . . . . . . . . . 192

9.13 Correlation between occupations based on number of people sharingthe same occupation . . . . . . . . . . . . . . . . . . . . . . . . . . . 193

10.1 Top 20 ngram features for Gender, ranked by the weights assigned bythe linear SVM model . . . . . . . . . . . . . . . . . . . . . . . . . . 205

10.2 Difference in Gender classification accuracy between mixed gender andsame gender conversations using the reference algorithm . . . . . . . 209

10.3 Performance for 4-way classification of the entire conversation into(mm, ff, mf, fm) classes using the reference algorithm on Switchboardcorpus. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209

10.4 Results showing improvement in accuracy of gender classifier usingpartner-sensitive model and sociolinguistic features . . . . . . . . . . 216

xxii

10.5 Aggregate results on a “per-speaker” basis via majority consensuson different conversations for the respective speaker. The results onSwitchboard are significantly higher due to more conversations perspeaker as compared to the Fisher corpus . . . . . . . . . . . . . . . . 219

10.6 Fraction of conversations containing self-reporting features such as “mywife”, “my boyfriend”, on different corpora. Although Fisher has sig-nificant conversations with such features, they have little impact onthe overall performance as shown in Table 10.7 . . . . . . . . . . . . . 220

10.7 Self reporting features for gender such as “my wife”, “my boyfriend”,etc. have negligible impact on performance of gender classification. . 221

10.8 Gender classification results for a new language (Gulf Arabic) showingconsistent improvement gains via partner-sensitive model and sociolin-guistic features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223

10.9 Application of Ngram model and sociolinguistic features for genderclassification in a new genre (Email) . . . . . . . . . . . . . . . . . . 228

10.10Top 20 ngram features for gender classification in email, ranked by theweights assigned by the linear SVM model. See Section 10.10.4 formore details. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230

10.11Results showing improvement in the accuracy of age and native lan-guage classification using partner-sensitive model and sociolinguisticfeatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233

10.12Top 25 ngram features for Age ranked by weights assigned by the linearSVM model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234

10.13Results for age regression using different feature and model combina-tions. Substantial performance gains were obtained by utilizing binaryclassifiers across different age boundaries as features in a stacked SVMmodel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239

10.14Top 20 ngram features for Age ranked by weights assigned by thengram-based SVM regression model . . . . . . . . . . . . . . . . . . . 240

10.15Results for age regression using different feature and model combi-nations for age-wise balanced test set. While the performance of thebaseline models degrade due to higher variance, regression models showconsistent performance improvements as in Table 10.13 . . . . . . . . 245

10.16Self-reporting features such as “in thirties, i’m fourty five, etc.”. havelittle impact. The performance after deleting such features is similarto the original model containing all ngrams as features. . . . . . . . . 247

xxiii

List of Figures

1.1 Thesis overview: Extracting different types of relationships from un-structured text in multiple languages into a structured multilingualknowledge base. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3.1 Illustration of using cross-language evidence using bilingual dictionar-ies of different languages for compound translation. The basic ap-proach is to translate compound words by modeling the mapping ofliteral component-word glosses (e.g. “iron-path”) into fluent English(e.g. “railway”) across multiple languages. . . . . . . . . . . . . . . . 30

3.2 Illustration of the problem with generating fluent translation candi-dates via compositional methods (Grefenstette, 1999; Cao and Li, 2002;Baldwin and Tanaka, 2004) . . . . . . . . . . . . . . . . . . . . . . . 37

3.3 Illustration of compounding morphology using middle and end gluecharacters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.4 Coverage/Accuracy trade-off curve by incrementing the minimumnumber of languages exhibiting a candidate translation for the source-word’s literal English gloss. Accuracy here is the Top1 accuracy aver-aged over all 10 test languages. . . . . . . . . . . . . . . . . . . . . . 47

3.5 F-measure performance given varying sizes of the bilingual dictionariesused for cross-language evidence (as a percentage of words randomlyutilized from each dictionary). . . . . . . . . . . . . . . . . . . . . . . 48

3.6 Top-1 match F-score performance utilizing K languages for cross-language evidence, for both a random K languages and greedy selec-tion of the most effective K languages (typically the closest or largestdictionaries) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3.7 Top-10 match F-score performance utilizing K languages for cross-language evidence, for both a random K languages and greedy selec-tion of the most effective K languages (typically the closest or largestdictionaries) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

xxiv

4.1 Illustration of (Rapp, 1999) model for translating Spanish word “crec-imiento (growth)” via dependency context vectors extracted from re-spective monolingual corpora as explained in Section 4.3.1.2 . . . . . 62

4.2 An illustration of dependency tree showing clearly the parent and childnodes. The word marked in bold (“crecimiento”) is used as an examplesource word in the chapter for illustrative purposes, and its adjacentand dependency contexts are shown in Table 4.1. . . . . . . . . . . . 64

4.3 Precision/Recall curve showing superior performance of dependencycontext model as compared to adjacent context at different recallpoints. Precision is the fraction of tested Spanish words with Top1 translation correct and Recall is fraction of the 1000 Spanish wordstested upon. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

4.4 Illustration of using part-of-speech tag mapping to restrict candidatespace of translations. . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

4.5 Illustration of mapping Spanish part-of-speech tagset to English tagset.The tagsets vary greatly in notation and the morphological/syntacticconstituents represented and need to be mapped first, using the algo-rithm described in Section 4.6.1. . . . . . . . . . . . . . . . . . . . . . 77

4.6 Precision/Recall curve showing superior performance of using part-of-speech equivalences for translating all word-types. Precision is thefraction of tested Spanish words with Top 1 translation correct andRecall is fraction of the 1000 Spanish words tested upon. . . . . . . . 78

5.1 Example of definite anaphora resolution and generation. Both thetasks require the knowledge of a derived semantic relationship that“pseudoephedrine is-a drug”. . . . . . . . . . . . . . . . . . . . . . . . 93

6.1 Goal: To induce multilingual taxonomy relationships in parallel in mul-tiple languages (such as Hindi and English) for information extractionand machine translation purposes. . . . . . . . . . . . . . . . . . . . . 96

6.2 Illustration of the models of using induced hyponymy and hypernymyfor translation lexicon induction. . . . . . . . . . . . . . . . . . . . . 109

6.3 Reducing the space of likely translation candidates of the wordraaiphala by inducing its hypernym, using a partial dictionary to lookup the translation of hypernym and generating the candidate transla-tions as induced hyponyms in English space. . . . . . . . . . . . . . . 110

7.1 Example of definite anaphora resolution and generation. Both thetasks require the knowledge of semantic relationship that “pseu-doephedrine is-a drug”, however the resolution task is easier becausethere are only a limited set of candidates to choose from (shown bycircled nouns). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

xxv

7.2 Illustrating the problem with WordNet for definite anaphora gener-ation. The immediate parent and grandparent of “pseudophedrine”,“alkaloid” and “organic compound” do not serve as natural definiteanaphoras as compared to the “drug” that is often observed in corpora. 132

8.1 Illustration of basic weakly supervised approach by Ravichandran andHovy (2002) for fact extraction. Using a few seeds of the fact in ques-tion, contextual patterns occurring with the seeds are extracted andranked based on their distribution in the monolingual corpora. Newpairs observing the given fact (for example, occupation) can then beextracted using co-occurrence with these patterns. . . . . . . . . . . . 142

9.1 Goal: extracting attribute-value biographic fact pairs from biographicfree-text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

9.2 Distribution of the observed document mentions of Deathdate, Nation-ality and Religion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160

9.4 Illustration of modeling “occupation” and “nationality” transitivelyvia consensus from attributes of neighboring names . . . . . . . . . . 162

9.3 Empirical distribution of the relative position of the correct (seed) an-swers among all text phrases satisfying the domain model for “birth-place” and “death date”. . . . . . . . . . . . . . . . . . . . . . . . . . 163

9.5 Age distribution of famous people on the web (from www.spock.com) 1719.6 Global position “occupation” attribute the New York Times articles.

The position is given as the fraction of the article length on the X-axis,and Y-axis describes the number of times an “occupation” attributewas found in that fraction. . . . . . . . . . . . . . . . . . . . . . . . . 184

9.7 Distribution of “occupation” attribute from first full mention of thename in the New York Times articles. . . . . . . . . . . . . . . . . . . 186

9.8 Distribution of “occupation” attribute from the closest full mention ofthe name in the New York Times articles. . . . . . . . . . . . . . . . 187

9.9 Distribution of “occupation” attribute from the closest full or partial(first name or last name) mention of the name in the New York Timesarticles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189

9.10 Application of biographical attributes for name disambiguation: Dis-ambiguating mention of “Phil Collins” to the correct Wikipedia entryusing the premodifying occupation “rider”. Similarly other biographi-cal attributes such as nationality premodifier “British” can also be usedfor disambiguation. This can be further improved by using compatibleoccupations as shown in Table 9.13. . . . . . . . . . . . . . . . . . . . 191

10.1 A snippet of Fisher telephone transcript between a female (A) andmale (B) speaker. The first two fields indicate the start time and stoptime and the third field contains the utterance. . . . . . . . . . . . . 198

xxvi

10.2 The effect of varying the amount of each conversation side utilized fortraining, based on the utilized % of each conversation, starting from thebeginning of the conversation. While one would expect the accuracy toimprove linearly with increased training data, the anomaly inolving flatportion in the middle could be due to the fact that Fisher and Switch-board participants were complete strangers. The intial ramp up in thecurve is probably due to the addition of speaker data starting from nodata at all and the flat portion is probably due to the time taken forthe speakers to get familiar and speak comfortably with each other,after which, the discourse features for speaker attributes become moreprominent. Another reason could be due to the fact that the middleportion indicates discussion on a specific topic given to the speakersand after they have spoken enough about the topic, the speakers maymove on to more gender biased topics of their choice. . . . . . . . . . 207

10.3 People use stronger gender-specific discourse properties when speak-ing to someone of a similar gender. Stacking whole conversation andpartner-conditioned models as shown above allows to model such be-havior. The common graphic utilized for individual SVM classifiersfirst appeared in (Ustun, 2003). . . . . . . . . . . . . . . . . . . . . . 211

10.4 Empirical differences in sociolinguistic features for Gender on theSwitchboard corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213

10.5 Aggregating results over all the conversations of a given speaker viaconsensus voting as explained in Section 10.7.1. One can also uti-lize other ways of combining evidence such as length-weighted voting,confidence-weighted voting, stacking, combining all conversations intoone single conversation, etc. However, since the speakers were sup-posed to speak for a fixed time while collecting the data for Fisher andSwitchboard corpus, the conversations in these corpora are of similarlength. Thus the above simple combination technique is also appropri-ate due to the nature of approximately equal length conversations. . . 218

10.6 Top 20 Arabic ngram features (along with their Roman translitera-tions) for Gender, ranked by the weights assigned by the linear SVMmodel. Section 10.9.3 provides translation and insight into why theseare appropriate gender indicators. . . . . . . . . . . . . . . . . . . . . 224

10.7 Example of an email sent by a male sender in Enron corpus. The headerand signature information containing the sender’s name are removedand only the body of the email is used for gender classification. . . . 226

10.8 Empirical differences in sociolinguistic features for Age. Youngerspeakers tend to use short utterances, pronouns and auxiliaries moreoften than older speakers. . . . . . . . . . . . . . . . . . . . . . . . . 231

10.9 Age histograms for training and test speakers of Switchboard corpusindicating unbalanced age groups of the participating speakers. . . . . 236

xxvii

10.10Stacking approach for age regression utilizing binary classifiers acrossdifferent age boundaries and sociolinguistic features as individual com-ponents. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244

10.11Histograms for different age groups in the test set. The horizontal lineshows the threshold for balancing the size of test set across differentage groups, retaining a total of 600 examples. . . . . . . . . . . . . . 246

xxviii

Chapter 1

Introduction

The amount of available unstructured data is growing at a rapid rate. The In-

ternational Data Corporation (IDC) predicts that in 2011, the amount of digital

information produced in the year will equal nearly 1,800 exabytes, or 10 times that

was reported in a measurement study in 20061. To utilize such vast information in a

meaningful manner, it is essential to organize it into a structured form. However, con-

verting this information into structured repositories usually requires a slow, manual

annotation process with most of the information remaining unexamined. Further-

more, some of such manually created resources including WordNet, DBPedia, etc.,

have limited coverage and are available only for a few of the world’s languages.

A large amount of information which is still untapped and unorganized exists in a

wide range of genres. Some of such genres include multilingual news articles, blogs,

1http://www.emc.com/collateral/analyst-reports/diverse-exploding-digital-universe.pdf

1

emails, conversation transcripts, discussion forums, etc.

A primary step in organizing unstructured text is to identify the relationships between

those general concepts denoted by words or phrases, both within and across different

languages. Techniques for identifying and extracting such relationships provide a ba-

sic relational structure that can then be refined and generalized into a fully-fledged

crosslingual knowledge base.

This dissertation provides new relationship extraction models that exploit novel

knowledge sources, across a diverse set of relationship types in multiple languages. A

wide range of relationships are explored, including semantic relationships between

words, their translation equivalents in different languages and encyclopedic facts

about named entities.

The goal of this dissertation is multi-faceted:

• to tie together extraction of diverse relationships in a common minimally super-

vised framework using seed exemplars for learning typical contexts. Using the

same starting approach, different representations of context can be leveraged

for extracting different relationships.

• to explore novel knowledge sources including social context, correlations among

relationships, cross-language evidence using bilingual dictionaries, and others.

Such novel and multilingual knowledge sources not only help in improving the

performance of relationship extraction but also allow for extracting new rela-

tionship types, such as latent or implicit relationships.

2

• to develop and evaluate relationship extraction models for various domains such

as conversational speech transcripts, email data, web pages, formal genres, etc.

and for diverse languages including Arabic, Bulgarian, German, Hindi, Spanish

and Hungarian.

Having described the motivation and goals of this dissertation, the rest of this chapter

is organized as follows: Section 1.1 categorizes the type of relationships explored in

this dissertation into three broad categories. Section 1.2 explains how extraction of

such relationships can be tied under a basic seed-exemplar based framework along

with a discussion on variants of context representations explored. Section 1.3 sum-

marizes the contributions according to the novel knowledge sources that were used to

build new relationship extraction models. Section 1.4 provides a chapter wise outline

of the dissertation.

1.1 Types of relationships explored

The type of relationships explored in this thesis fall into the following three broad

categories, as illustrated in Figure 1.1.

• Cross-language/Translation relationships: To leverage a vast amount of

multilingual information, it is necessary to identify how similar concepts are

denoted in different languages. This can often be easily identified with manually

created bilingual dictionaries such as Spanish-English dictionary or Chinese-

3

English dictionary. However, the lack of such translation lexicons is a major

bottleneck for low-resource languages. Chapters 3 and 4 of this dissertation

provide several novel methods for inducing such lexicons automatically and

with low annotation effort.

• Semantic relationships: General relationships found in semantic knowledge

bases such as WordNet (hypernymy, synonymy and meronymy) are critical to

many applications that aim at restoring parts of the meaning in the sentence,

such as sentiment analysis tools and semantic search engines. Chapters 6 and

7 show how such semantic relationships can be extracted in multiple languages

and also quantify how such information can be applied to downstream tasks.

• Factual relationships: A large number of relationships are domain-specific

and express facts about the world. For example, DBPedia is a knowledge base

that contains factual relationships about people such as “birthplace”, which is

a relationship between phrases denoting person names and locations, or about

relationships in organization such as “founder”, which is a relationship between

phrases that denote person name and company name. Such factual relationships

can be explicitly stated in the text and can also often be implicitly derived

when not directly stated. Chapters 9 and 10 provide techniques to extract such

relationships in both an explicit and implicit manner, and in different languages.

4

Categorization, Classification, Extraction and Projection Techniques

Dean Golf

English German Prob

Sports

Occupation

Education

Lecturer

Value Prob Value Prob Value ProbChesley B. SullenbergerPilot 0.93 1951 0.98 American 1.0

Captain 0.87 1954 0.65…

A. R. Rehman Musician 0.95 1967 0.95 Indian 1.0Singer 0.93 1966 0.76…

Occupation Birthyear Nationality

English Spanish Probpilot piloto 0.94pilot guía 0.85… … ..land tierra 0.91land desembarcar 0.89

.. La noche de ayer es de las que cambian el curso de la política, y quizás de la historia, en España. Mientras

रहमान को (लमडॉग

मि लि य/यर 0

स2गीत 0 लि ए और

उन0 गीत ...

يتأيو .اهفرص طسو رمتؤملا ةطخ نع ريراقت

نم لصنتت وهاينتنل نيتلودلا لح رايخ نود نايكب دعتو

ةدايس

.... Capt. Chesley B. Sullenberger III is the US Airways pilot who landed an Airbus A320 ...

Unstructured Text in Multiple Languages

Structured Multilingual Knowledgebase

Applications- Fine-grained Information Retrieval- Coreference Resolution- Personalized User Assistance........

Chapters 3,4 (Extraction of Cross-language Relations)- Novel models of Extracting Compound Translations- Using Syntactic Structure and Part-of-Speech Equivalences

Chapters 6,7(Extraction of Semantic Relations)- Extracting semantic relations via evidence from multiple relationship types- Downstream application of semantic relationships to NLP tasks

Chapters 9,10(Extraction of Factual Relations)- Structural, Transitive and Latent models for extracting factual relationships in biography domain - Study of latent discourse models for extractionof implicit biographic facts in informal genres

Figure 1.1: Thesis overview: Extracting different types of relationships from unstruc-

tured text in multiple languages into a structured multilingual knowledge base.

5

1.2 Basic approach: Relationship

extraction using seed exemplars

A common starting approach in extracting the different kinds of relationships in

this thesis is that of bootstrapping from a small set of seed pairs. This approach is

inspired by the success of application of self-learning approaches in machine learn-

ing (Baum, 1972; Dempster et al., 1977) to natural language processing tasks such

as seed-based approaches for word sense disambiguation (Yarowsky, 1995). Simi-

lar approaches have been used for extracting cross-language relationships such as

translation equivalence (Rapp, 1999), and semantic and factual relationships (Thelen

and Riloff, 2002; Ravichandran and Hovy, 2002). Many variants of this basic seed

exemplar-based approach have been further developed in the literature for specific

relationships.While such seed-based techniques have been independently and specifi-

cally developed for different relationship types, this section shows that the underlying

algorithm remains the same and the context representation varies. Most of the pre-

vious approaches use only local context representations. Sections 1.2.2 and 1.3 show

other context representations and a wide array of novel knowledge sources that can

be embedded under the same framework.

The basic algorithm is outlined below followed by description of different context

representations and additional original knowledge sources explored in this thesis.

6

Cross-language Semantic Factual(Spanish-English) (Hypernymy) (Occupation)

(diversidad, diversity) (car, vehicle) (Seamus Heaney, Poet)(chipre, cyprus) (copper, metal) (Amitabh Bachchan, Actor)

(gobierno, government) (gun, weapon) (Desmond Dekker, Singer)(fundamento, certainty) (yet, currency) (Elfriede Jelinek, Novelist)

(ruego, thank) (dog, animal) (John Hume, Politician)(papel, role) (hammer, tool) (Ludwig Wittgenstein,

Philosopher)(fundamento, basis) (tennis, sport) (Monty Hall, Game show host)

(de, of) (cancer, disease) (Robert Boyle, Chemist)(clave, key) (English, language) (Xavier Cugat, Musician)

(entre, between) (passport, legal document) (Rupert Sheldrake, Biologist)

Table 1.1: A sample of seed pairs for the three major categories of relationshipsextracted. A representation of context is learned starting with such seed pairs forextracting new pairs describing the relationship as described in Section 1.2.1

1.2.1 Algorithm (Rapp 1999; Ravichandran and

Hovy, 2002)

The traditional seed-based learning framework is as follows:

Input: A set of seed pairs exhibiting the relationship of interest; and unlabeled mono-

lingual corpora. Table 1.1 shows examples of seed pairs for the different relationships

explored in this thesis.

Output: New word/phrase pairs exhibiting the relationship of interest.

Method:

1. Extract individual contexts for each seed pair occurrence in monolingual cor-

pora. A context can be as simple as the set of adjacent words surrounding

the seed pairs. More complex versions such as pattern templates, dependency

7

contexts, document-wide contexts are explained in Section 1.2.2.

For example, in (Ravichandran and Hovy, 2002), the context is simply the se-

quence of words (also called a pattern template) surrounding the appearance of

a seed pair in a sentence.

2. Aggregate individual contexts into a general context representation. This con-

text representation can be thought of as abstractly representing the relationship

of interest. Thus, there is a single aggregate context representation per rela-

tionship. Some examples of this aggregate representation include a TF.IDF

weighted bag-of-words context vector, list of pattern templates ranked accord-

ing to some pattern reliability score, and position specific context vectors.

For example, in (Ravichandran and Hovy, 2002), this is a list of pattern tem-

plates ranked according to their precision score, which is the fraction of times

the pattern appears with the seed pair in an unlabeled corpus.

3. Extract new pairs that occur with the aggregated context representation in

monolingual corpora. This extraction process depends on the type of context

representation. For example, for a bag-of-words context vector representation,

this process involves creating bag-of-words vectors using the adjacent words that

occur along side the candidate pairs. These pairs are treated as candidates for

the relationship of interest. The candidates are then ranked by different mea-

sures depending on the context representation. For example, in (Ravichandran

8

and Hovy, 2002), the extracted candidate pairs are ranked according to their

frequency in an unlabeled corpus.

This extraction process and the ranking measures and are explained in more

detail in their respective thesis chapters.

Given an initial set of candidate pairs extracted using the above seed-exemplar based

approach, novel relationship extraction models using diverse knowledge sources can

then be exploited to substantially improve the extraction. These knowledge sources

and their application are explained in brief in Section 1.3 of this chapter.

1.2.2 Context Representations

Context representation is a key component of the algorithm outlined in Section

1.2.1. The following section illustrates different major variants of context representa-

tion used in the thesis.

1. Narrow bag-of-words context vectors (Rapp, 1999; Schafer and

Yarowsky, 2002; Koehn and Knight, 2002; Haghghi et al., 2008; Gar-

era et al., 2009) : This representation uses words surrounding the seeds over

a fixed window size and forms a vector with the weights computed from the

co-occurrence frequency. It is also possible to store at what position the con-

textual words occur as compared to using a bag-of-words context.

9

Application: Utilized for cross-language relationships such as translation equiv-

alence and complex semantic relationships such as definite anaphora.

Example: Figure 4.1 in Chapter 4 provides an example of narrow bag-of-words

context vector.

2. Prefix, infix and suffix pattern templates (Ravichandran and Hovy,

2002; Thelen and Riloff, 2002; Pasca et al., 2006; Garera and

Yarowsky, 2009) : When the relationship is often expressed by words in close

proximity to the words of the seed pair, then a very useful context representa-

tion is of pattern templates. For example, “X was born in Y” is an infix pattern

template representing one of the “birthplace” relationship contexts. Each con-

text template is assigned a weight computed based on the seed pairs.

Application: Utilized for generic semantic relationships such as “Is-a”, “Part-

of”, etc., explicit factual relationships such as “birthplace” and cross-language

relationships among compound words.

Example: Figure 8.1 in Chapter 8 provides an example of pattern-based context

representation.

3. Document-wide contexts2 (Garera and Yarowsky, 2009) : This repre-

2Although these were not utilized in the previous seed-based approaches described at the begin-

10

sentation is similar to that of fixed window bag-of-words context vector but

the context is expanded to the entire document, used for topically modeling a

factual relationship, as described in Chapter 9.

Application: Utilized for extracting latent factual relationships that are not

explicitly stated in the text but can be inferred indirectly from the topical na-

ture of the document.

Example: Table 9.4 in Chapter 9 provides an example of utilizing document-

wide contexts for extracting “occupation” relationship.

4. Multilingual Dependency contexts (Garera et al., 2009) : This represen-

tation involves obtaining dependency parses in multiple languages for modeling

long range dependencies and word ordering as part of contextual clues.

Application: Utilized for extracting translation relationships via context pro-

jection across languages, as described in Chapter 4.

Example: Figure 4.2 and Table 4.1 in Chapter 4 provide an example of uti-

lizing dependency context for extracting translation relationship.

ning of Section 1.2, such contexts can be naturally embedded under the seed-based approach andare listed here for completeness.

11

1.3 Contributions

This dissertation makes several novel contributions to the general natural language

learning framework described in Section 1.2. It explores an array of new relationship

extraction techniques, exploiting novel internal and external knowledge sources. In

what follows, I outline the specific contributions of this dissertation. These are orga-

nized according to the knowledge sources that have been used in the development of

new relationship extraction models.

1.3.1 Internal knowledge sources

As defined in this dissertation for the purposes of distinguishing two major classes

of knowledge sources, Internal knowledge sources are directly derived from the corpora

from which relationships are extracted. These include new context representations,

transitive knowledge and many more as described below:

1. Evidence across different relationship types: Several relationships, es-

pecially within the same category can benefit from joint modeling. Chapter

6 provides a novel minimal-resource model for the acquisition of multilingual

lexical taxonomies (including hyponymy/hypernymy and meronymy) using ev-

idence from multiple relationship-types.

2. Global document-wide contexts: Latent relationships that are not explic-

itly stated in text are difficult to extract using the basic seed-based approach.

12

Chapter 9 shows how latent-attribute models of wide-document context, both

monolingually and translingually, can capture facts that are not stated directly

in a text.

3. Transitivity information via co-occurrence statistics: Factual relation-

ships often have similar values for related entities, and co-occurrence statistics

can be used to find entities that are related. Chapter 9 provides a transitive

model that predicts values of factual attributes based on consensus voting via

the extracted attributes of neighboring entities.

4. Structural information: Some of the relationships such as “birthdate” tend

to occur in characteristic positions, such positional information can be very

useful when a good context model is not available. Chapter 9 provides the first

known work on illustrating a global structural model for factual relationship

extraction, utilizing absolute and relative document-wide positions as opposed

to only modeling local contextual patterns.

5. Social context: For extracting facts about people, social context can play an

important role. This involves modeling of speaker attributes sensitive to partner

speaker attributes, given the differences in lexical usage and discourse style such

as observed between same-gender and mixed-gender conversations. Chapter 10

makes use of such social contexts for improving extraction of implicit factual

relationships from conversation genres.

13

6. Sociolinguistic information: Chapter 10 also explores a rich variety of novel

sociolinguistic and discourse-based features, including mean utterance length,

passive/active usage, percentage domination of the conversation, speaking rate

and filler word usage.

7. Derived relationships: Successful extraction of simple relationships can be

used as an input for extracting more complicated relationship types. Chap-

ters 6 and 7 show how automatically extracted “is-a” semantic relationships

can be used as a knowledge source for extracting cross-language and anaphoric

relationships.

8. Morphology and sequence information: Chapter 3 illustrates the use of

component-sequence and compound morphology for extracting cross-language

relationships among compound words.

1.3.2 External knowledge sources

These knowledge sources consist of external tools and data such as dependency

parsers, bilingual dictionaries, etc., used alongside the corpora from which relation-

ships are extracted.

1. Bilingual dictionaries for leveraging cross-language evidence: Chapter

3 presents the first known work on non-compositional extraction of compound

word equivalences across different languages, leading to fluent translations. The

14

key knowledge source that makes this possible is a set of bilingual dictionaries,

which is used for learning multilingual similarities between compound words.

2. Richer contexts via dependency parses: Dependency trees not only help

in modeling richer contexts but are also helpful with respect to modeling long-

distance relationships and word-reordering. Chapter 4 shows novel use of de-

pendency parsers for extracting cross-language relationships.

3. Part-of-speech equivalences: Part-of-speech tags are usually preserved in

cross-language relationships. Chapter 4 shows hows how the entropy of can-

didate translations can be reduced by mapping part-of-speech tagsets. The

chapter also provides a mechanism for learning such a mapping automatically.

4. Domain models: The values of attribute-value pairs of a factual relationship

can often be modeled based on its domain. Chapter 9 shows how using external

gazetteers and syntax constraints can be used as domain models for extracting

factual relationships.

5. Correlation statistics among instances of different relationship types:

Different attributes describing facts about the same entity are often correlated

and can be used to reduce the entropy of candidate extraction space. Chapter

9 shows how such correlations can be learned and applied using an external

database of factual relationships3.

3 Note that correlations among different relationships can also be derived internally as describedin the first point of Section 1.3.1, and in more detail in Chapter 6.

15

1.4 Outline of this Dissertation

This dissertation is organized into the following chapters:

1.4.1 Part I: Cross-language relationships

• Chapter 2 presents a literature review for extracting translation equivalents

across different languages, with a focus on minimally supervised methods.

• Chapter 3 presents an approach for fluent, non-compositional translation of

compound words by learning component gloss translation models across multi-

ple languages.

• Chapter 4 presents novel improvements to the induction of translation lexicons

from monolingual corpora by incorporating multilingual dependency parses and

part-of-speech equivalences.

1.4.2 Part II: Semantic relationships

• Chapter 5 presents a literature review for extracting semantic relationships

and their downstream application to anaphora resolution.

• Chapter 6 presents a novel algorithm for the acquisition of multilingual se-

mantic taxonomies, using evidence from different semantic relationship types.

16

• Chapter 7 shows how corpus-based approaches for extracting semantic rela-

tionships can be utilized for resolving and generating definite anaphora.

1.4.3 Part III: Factual relationships

• Chapter 8 present a literature review for extracting a third category of lexical

relationships, namely domain-specific factual relationships.

• Chapters 9 and 10 present structural, transitive and latent approaches for

fact extraction in the biographic domain, across different genres.

• Chapter 11 concludes this dissertation.

17

Part I

Extracting

Cross-language/Translation

Relationships

18

Chapter 2

Part I Literature Review

This chapter covers the literature review for extracting cross-language relation-

ships in the form of translation equivalences. The different lines of previous work can

be classified according to the nature of the resources utilized, namely, parallel cor-

pora (Section 2.1), monolingual corpora with seed lexicons (Section 2.2) and bridge

languages (Section 2.3).

2.1 Using Parallel Corpora

Learning translation equivalence relationships across words in different languages

can be traced back to the first statistical approach to machine translation by Brown

et al. (1990) from parallel text. This was formally defined as a translation model

in Brown et al. (1993) using the word alignments learned via the expectation max-

19

imization algorithm (Dempster et al., 1977). After obtaining word alignments, the

translation lexicon can be induced using the IBM Model 1 (Brown et al., 1993) as

follows:

t(e|f ; e,f) =

∑(e,f) c(e|f ; e,f)∑

e

∑(e,f) c(e|f ; e,f)

(2.1)

where,

t(e|f ; e,f) is the translation probability estimated from a sentence aligned corpus (say,

an English-French corpus denoted by e,f) and c(e|f ; e,f) denotes the number of times

English word e aligns with the French word f in the aligned corpus.

IBM models 2-5 take into account reordering, fertility and deficiency issues. Since

then there have been several improvements to IBM models including the introduction

of HMM models (Vogel et al., 1996; Toutanova et al., 2002), use of posterior methods

(Kumar and Byrne, 2002) and discriminative training methods (Och and Ney, 2002),

incorporating manual word alignments (Callison-Burch et al., 2004) and using log-

linear model combination of simpler models (Liu et al., 2005). Moving beyond words,

there has also been extensive work in training phrase-based models from parallel text.

These models are rooted in the work by Och and Weber (1998) and Och et al. (1999)

on alignment template models, with many variations in the recent literature that are

beyond the scope of this dissertation. There has been also work in combining par-

allel corpus with additional extractable noisy parallel text from monolingual corpora

20

(Munteanu et al., 2004; Fung and Cheung 2004) and then applying the word-based or

phrased-based statistical models treating the noisy text as a part of sentence-aligned

corpora.

The focus of the translation lexicon induction methods discussed in this dissertation

is on methods not requiring a parallel corpus, in order to alleviate the bottleneck of

manual annotation efforts. Towards this goal, several weakly supervised approaches

have been proposed that are detailed in the sections below.

2.2 Using Monolingual Corpora and Seed

Lexicons

The primary idea behind weakly supervised methods is to exploit noisy clues ex-

tracted from the monolingual corpora of source and target languages. The noisy

clues include diverse similarity measures such as contextual, orthographic, frequency

distribution, etc. A highly effective source of similarity often used in this literature

is that of similar contexts (Schafer and Yarowsky, 2002; Koehn and Knight, 2002;

Haghighi et al., 2008). The idea of words with similar meaning having similar con-

texts in the same language comes from the Distributional Hypothesis (Harris, 1954),

and Rapp (1999) was the first to propose using context of a given word as a clue to

its translation.

The algorithm presented by Rapp (1999) shows how an English translation for a Ger-

21

man word can be obtained by first constructing a German context vector by counting

its surrounding words in a monolingual German corpus. Then, using an incomplete

bilingual dictionary, the counts of the German context words with known translations

are projected onto an English vector. The projected vector for the German word is

compared to the vectors constructed for all English words using a monolingual English

corpus. The English words with the highest vector similarity are treated as transla-

tion candidates. Rapp (1999) makes use of the city-block similarity metric, however

other researchers have found cosine similarity to perform better (Koehn and Knight,

2002; Schafer and Yarowsky, 2002). The original Rapp (1999) work employed a rel-

atively large bilingual dictionary containing approximately 16,000 words and tested

only on a small collection of 100 manually selected nouns. A detailed illustration

of this approach is described in the context of the improvements presented in this

dissertation in Section 4.3 of Chapter 4.

Fung (1998) also used a similar approach to Rapp (1999) for Chinese-English, also

using a large seed dictionary (20,000 words) for projection used for comparing Chi-

nese and English context vectors after projection into a common vector space.

Koehn and Knight (2002) tested this idea on a larger test set consisting of the 1000

most frequent words from a German-English lexicon. They also incorporated clues

such as frequency and orthographic similarity in addition to context. Schafer and

Yarowsky, (2002) independently proposed using frequency, orthographic similarity

and also showed improvements using temporal and word-burstiness similarity mea-

22

sures, in addition to context. Haghighi et al., (2008) made use of contextual and

orthographic clues for learning a generative model from monolingual corpora and a

seed lexicon.

A key notion in the above class of work is that of the similarity of projected context

vector to candidate translation context vector and all of the previous literature (Fung,

1998; Rapp, 1999; Koehn and Knight, 2002; Schafer and Yarowsky, 2002; Haghighi

et al., 2008) has made use of naive fixed-window adjacent contexts for the construc-

tion of context vectors. Chapter 4 shows how richer contexts exploiting dependency

information can allow for dynamic context size, and account for word reordering in

the source and target language.

2.3 Using Bridge Languages

Another highly useful weakly supervised method is that of using “bridge lan-

guages”, often also referred to as “pivot languages” (Hajic et al., 2000; Gollins and

Sanderson, 2001) . Often, a low-resource source language (say, Romanian) has a

language closely related within its family that has a large translation lexicon for the

target language, for example, Spanish is closely related to Romanian and a large

Spanish-English dictionary is easily available. The mapping from the low-resource

language to the close language in the family is established by learning statistical

models of cognate surface similarity.

23

Mann and Yarowsky (2001) introduced the idea of using bridge languages for tran-

sitive lexical translation induction using cognates. A cognate model was developed

consisting of several string distance measures such as raw Levenshtein distance and

trained single-state probabilistic transducers presented in Ristad and Yianilos (1997).

Schafer and Yarowsky (2002) extended this work by investigating a range of trans-

ducer structures for modeling cognates and further by using diverse similarity mea-

sures such as context similarity, temporal and word-burstiness similarity for ranking

the translation candidates when a cognate is found.

Chapter 3 shows another novel use of bridge languages, for compound translation

induction by leveraging cross-language similarity of compound components as a tran-

sitivity bridge. Translating compound words such as “lighthouse”, “fireplace”, etc.,

are often a challenge for corpus-based methods due to their often low frequency and

potentially complex compounding behavior, thus needing special treatment. The rel-

evant literature for compound word translation is provided in Section 3.3 of Chapter

3, all of which are based on surface-based compositional methods. Chapter 3 shows

how the bridge language paradigm can be utilized for generating fluent translation

candidates for compound translation, as opposed to providing glossy translations via

fixed surface-based pattern templates.

24

Chapter 3

Translating Compounds by

Learning Component Gloss

Translation Models via Multiple

Languages

Summary

This chapter describes an approach to the translation of compound words and

phrases without the need for bilingual training text, by modeling the mapping of

literal component word glosses (e.g. “iron-path”) into fluent English (e.g. “railway”)

across multiple languages. Performance is improved by adding component-sequence

25

and learned-morphology models along with context similarity from monolingual text

and optional combination with traditional bilingual-text-based translation discovery.

Components of this chapter were originally published by the author of this dissertation

in the forum referenced below1.

3.1 Introduction

Compound words such as lighthouse and fireplace, that are composed of two or

more component words, are often a challenge for machine translation due to their

potentially complex compounding behavior and ambiguous interpretations. Further-

more, compound words/phrases are often poorly covered in bilingual dictionaries.

Compounds also exist in many other languages and some of the examples are shown

below:

• German: “Krankenhaus” (hospital) is compound word with it’s individual com-

ponents as “Kranken” (sick) and “Haus” (house). Another example is “Re-

genschirm” (umbrella) that has individual components as “Regen” (rain) and

“Schirm” (guard).

• Farsi: “mehmankhane2” (hotel) is a compound word with it’s individual com-

ponents as “mehman” (guest) and “khane” (house). Another example is

1Reference: N. Garera and D. Yarowsky. Translating Compounds by Learning Component GlossTranslation Models via Multiple Languages. Proceedings of International Joint Conference on Nat-ural Language Processing (IJCNLP), 2008.

2All the non-Latin-1 languages were represented using unicode format while performing the ex-periments reported in this chapter.

26

Compound Splitting English Gloss TranslationInput: Distilled glosses from German-English dictionaryKrankenhaus Kranken-Haus sick-house hospitalRegenschirm Regen-Schirm rain-guard umbrellaWorterBuch Worter-Buch words-book dictionaryEisenbahn Eisen-Bahn iron-path railroadInput: Distilled glosses from Swedish-English dictionarySjukhus Sjhu-Khus sick-house hospitalJarnvag Jarn-vag iron-path railwayOrdbok Ord-Bok words-book dictionary

Goal: To translate new Albanian compoundsHekurudhe Hekur-Udhe iron-path ???

Table 3.1: Example lexical resources used in this task and their application to trans-lating compound words in new languages.

“mizetahrir” (desk) that has individual components “miz”(table), “e” (filler),

and “tahrir” (writing).

For many languages, such words form a significant portion of the lexicon and the

compounding process is further complicated by diverse morphological processes and

the properties of different compound sequences such as Noun-Noun, Adj-Adj, Adj-

Noun, Verb-Verb, etc. Compounds also tend to have a high type frequency but a low

token frequency which makes their translation difficult to learn using corpus-based

algorithms (Tanaka and Baldwin, 2003). Furthermore, most of the literature on com-

pound translation has been restricted to a few languages dealing with compounding

phenomena specific to the language in question.

With these challenges in mind, the primary goal of this work is to improve the cover-

age of translation lexicons for compounds, as illustrated in Table 3.1 and Figure 3.1,

in multiple new languages. The algorithms presented show how using cross-language

27

compound evidence obtained from bilingual dictionaries can aid in compound trans-

lation. A primary motivating idea for this work is that the literal component glosses

for compound words (such as “iron path” for railway) is often replicated in multiple

languages, providing insight into the fluent translation of a similar literal gloss in a

new (often resource-poor) language.

3.2 Resources Utilized

The only resource utilized for the compound translation lexicon algorithm is a

collection of bilingual dictionaries and a small lexicon of the source-target language

pair for translating the individual components3). Bilingual dictionary collections for

50 languages were acquired in electronic form over the Internet or via optical character

recognition (OCR) on paper dictionaries. Note that no parallel or even monolingual

corpora are required ; their use described later in the chapter is optional.

3.3 Related Work

The compound-translation literature typically deals with these steps: 1) Com-

pound splitting, 2) translation candidate generation and 3) translation candidate

scoring. Compound splitting is generally done using translation lexicon lookup and

3The individual component words are usually common frequent words and hence are easier obtaintranslations either via a native speaker or corpus-based methods.

28

allowing for different splitting options based on corpus frequency (Zhang et al., 2000;

Koehn and Knight, 2003).

Translation candidate generation is an important phase and this is where this work

differs significantly from the previous literature. Most of the previous work has been

focused on generating compositional translation candidates, that is, the translation

candidates of the compound words are lexically composed of the component word

translations. This has been done by either just concatenating the translations of

component words to form a candidate (Grefenstette, 1999; Cao and Li, 2002), or

using syntactic templates such as “E2 in E1”, “E1 of E2” to form translation candi-

dates from the translation of the component words E2 and E1 (Baldwin and Tanaka,

2004), or using synsets of the component word translations to include synonyms in

the compositional candidates (Navigli et al., 2003).

The above class of work in compositional-candidate generation fails to translate

compounds such as Krankenhaus (hospital) whose component word translations are

Kranken (sick) and Haus (hospital), and composing sick and house in any order will

not result in the correct translation (hospital). Another problem with using fixed

syntactic templates is that they are restricted to the specific patterns occurring in

the target language. This chapter describes how one can use the gloss patterns of

compounds in multiple other languages to hypothesize translation candidates that

are not lexically compositional.

29

Lookup words in other languages that result in

"iron path" after splitting

Goal: To translate this Albanian compound word:

udhë

(English gloss)

Compound splittingusing lexicon lookup

Using small Albanian English dictionary

iron path

hekur

hekurudhë zog birdudhë pathhekur ironvadis water

Italian-English dictionary ferrovia ---> ferro via (railroad) <--- (iron) (path)

German-English dictionary eisenbahn ---> eisen bahn (railroad) <--- (iron) (path)

Swedish-English dictionary järnväg ---> järn väg (railway) <--- (iron) (path)

Uighur-English dictionary tömüryol ---> tömür yol (railroad) <--- (iron) (path)

Candidate translations

of hekurudhë

Other dictionaries

0.190.140.05

railroadrailway

rail

Algorithm outputfor hekurudhë

(iron) (path)

Figure 3.1: Illustration of using cross-language evidence using bilingual dictionaries

of different languages for compound translation. The basic approach is to translate

compound words by modeling the mapping of literal component-word glosses (e.g.

“iron-path”) into fluent English (e.g. “railway”) across multiple languages.

30

3.4 Approach

The approach to compound word translation employed here is illustrated in Figure

3.1.

3.4.1 Splitting compound words and gloss

generation with translation lexicon

lookup

First a given source word is split, such as the Albanian compound hekurudhe,

into a set of component word partitions, such as hekur (iron) and udhe (path). The

initial approach is to consider all possible partitions based on contiguous component

words found in a small dictionary for the language, as in Brown (2002) and Koehn

and Knight (2003)4. For a given split, its English glosses are generated by using all

possible English translations of the component words given in the dictionary of that

language. The algorithm is allowed to generate multiple glosses “iron way,” “iron

road,” etc. based on multiple translations of the component words. Multiple glosses

only add to the number of translation candidates generated.

4 In order to avoid inflections as component-words, the component-word length is limited to atleast three characters.

31

3.4.2 Using cross-language evidence from different

bilingual dictionaries

For many compound words (especially for borrowings), the compounding process

is identical across several languages and the literal English gloss remains the same

across these languages. For example, the English word railway is translated as a

compound word in many languages, and the English gloss of those compounds is

often “iron path” or a similar literal meaning5. Thus knowing the fluent English

translation of the literal gloss “iron path” in some relatively resource-rich language

provides a vehicle for the translation from all other languages sharing that literal

gloss6

3.4.3 Ranking translation candidates

The confidence in the correctness of a mapping between a literal gloss (e.g. “iron

path”) and fluent translation (e.g. “railroad”) can be based on the number of distinct

languages exhibiting this association. Thus the candidate translations generated via

different languages are ranked as in Figure 3.1 as follows: For a given target com-

pound word, say fc with a set of English glosses G obtained via multiple splitting

options or multiple component word translations, the translation probability for a

5For the gloss, “iron path”, 10 out of the 49 other languages were found in which some compoundword has the English gloss after splitting and component-word translation.

6A small translation lexicon in the target language is used for translating the individualcomponent-words, but these are often higher frequency words and present either in a basic dic-tionary or discoverable through corpus-based techniques.

32

candidate translation can be computed as:

p(ec|fc) =∑g∈G

p(ec, g|fc) (3.1)

=∑g∈G

p(g|fc) · p(ec|g, fc) (3.2)

≈∑g∈G

p(g|fc) · p(ec|g) (3.3)

where, p(g|fc) = p(g1|f1) · p(g2|f2). f1, f2 are the individual component-words of

compound and g1, g2 are their translations from the existing dictionary. For human

dictionaries, p(g|fc) is uniform for all g ∈ G, while variable probabilities can also be

acquired from bitext or other translation discovery approaches. Also, p(ec|g) can be

estimated using the relative frequency, freq(g,ec)freq(g)

, where freq(g, ec) is the number of

times the compound word with English gloss g is translated as ec in the bilingual

dictionaries of other languages and freq(g) is the total number of times the English

gloss appears in these dictionaries.

3.5 Evaluation using Exact-match

Translation Accuracy

For evaluation, the performance of the algorithm is tested on the following 10

languages: Albanian, Arabic, Bulgarian, Czech, Farsi, German, Hungarian, Russian,

Slovak and Swedish. The evaluation results show both the average performance for

33

these 10 languages (Avg10), as well as provide individual performance details on

Albanian, Bulgarian, German and Swedish. For each of the compound translation

models, the evaluation results report coverage (the # of compound words for which

a hypothesis was generated by the algorithm) and Top1/Top10 accuracy. Top1 and

Top 10 accuracy are the fraction of words for which a correct translation (listed in

the evaluation dictionary) appears in the Top 1 and Top 10 translation candidates

respectively, as ranked by the algorithm. Because evaluation dictionaries are often

missing acceptable translations (e.g. railroad rather than railway), and any devia-

tion from exact-match is scored as incorrect, these measures will be a lower bound

on acceptable translation accuracy. Also, target language models can often select

effectively among such hypothesis lists in context.

3.6 Comparison of different compound

translation models

This section compares the results of various models for compound translation,

starting from the prior work on using compositional methods (Grefenstette, 1999;

Cao and Li, 2002) and then describing the new non-compositional and cross-language

evidence based methods introduced in this chapter.

34

Language Compound words Top1 Top10 Foundtranslated Acc. Acc. Acc.

Albanian 4472 (10.11%) 0.001 0.010 0.020Bulgarian 9093 (12.50%) 0.001 0.015 0.031German 15731 (29.11%) 0.004 0.079 0.134Swedish 18316 (31.57%) 0.005 0.068 0.111Avg10 14228 (17.84%) 0.002 0.030 0.055

Table 3.2: Baseline performance using unreordered literal English glosses as transla-tions. The percentages in parentheses indicate what fraction of all the words in thetest (entire) vocabulary were detected and translated as compounds.

3.6.1 A simple model using literal English gloss

concatenation as the translation

The baseline model is a simple gloss concatenation model for generating compo-

sitional translation candidates on the lines of Grefenstette (1999) and Cao and Li

(2002). The translations of the individual component-words (e.g. for the compound

word hekurudhe, they would be hekur (iron) and udhe (path)) are used for hypothe-

sizing three translation candidate variants: “ironpath”, “iron path” and “iron-path”.

A test instance is scored as correct if any of these translation candidates occur in

the translations of hekurudhe in the bilingual dictionary. This baseline performance

measures how well simple literal glosses serve as translation candidates. In cases such

as the German compound Nußschale (nutshell), which is a simple concatenation of

the individual components Nuß(nut) and Schale (shell), the literal gloss is correct.

For this baseline, if the component-words have multiple translations, then each of the

possible English gloss is ranked randomly. While Grefenstette (1999) and Cao and

35

Li (2002) proposed re-ranking these candidates using web-data, the potential gains

of this ranking are limited, as it can be seen in Table 3.2 that even the Found Acc.

is very low7, that is for most of the cases the correct translation does not appear

anywhere in the set of English glosses. One explanation for this could be that for

only a small percentage of compound words, their dictionary translations are formed

by concatenating their English glosses. Also, Grefenstette (1999) reports much higher

accuracies for German on this model because his 724 German test compounds were

chosen in such a way that their correct translation is a concatenation of the possible

component word translations.

3.6.2 Using bilingual dictionaries

This section describes the results from the model explained in Section 3.4. To

recap, this model attempts to translate every test word such that there is at least one

additional language whose bilingual dictionary supports an equivalent split and literal

English gloss, and bases its translation hypotheses on the consensus fluent transla-

tion(s) corresponding to the literal glosses in these other languages. The performance

is shown in Table 3.3. The substantial increase in accuracy over the baseline indicates

the usefulness of such gloss-to-translation guidance from other languages. The rest

of the sections detail the investigation of improvements to this model.

7Found Acc. is the fraction of examples for which the correct translation appears anywhere inthe n-best list.

36

Nußschale

Nuß Schale

Nut Shell

Nutshell

Shellnut

Nut in shell

Nut of shell

(English Gloss)

Krankenhaus

Kranken Haus

Sick House

Sickhouse

Housesick

Sick of house

House of sick

(English Gloss)

Figure 3.2: Illustration of the problem with generating fluent translation candi-

dates via compositional methods (Grefenstette, 1999; Cao and Li, 2002; Baldwin

and Tanaka, 2004)

37

Language Compound words Top1 Top10translated Acc. Acc.

Albanian 3085 (6.97%) 0.185 0.332Bulgarian 6719 (9.24%) 0.247 0.416German 11103 (20.55%) 0.195 0.362Swedish 12681 (21.86%) 0.188 0.346Avg10 9320.9 (11.98%) 0.184 0.326

Table 3.3: Coverage and accuracy for the standard model using gloss-to-fluent trans-lation mappings learned from bilingual dictionaries in other languages (in forwardorder only).

3.6.3 Using forward and backward ordering for

English gloss search

In the standard model, the literal English gloss for a source compound word (for

example, iron path) matches glosses in other language dictionaries only in the identical

order. But given that modifier/head word order often differs between languages,

one can test how searching for both orderings (e.g. “iron path” and “path iron”)

can improve performance, as shown in Table 3.5. The percentages in parentheses

show relative increase from the performance of the standard model in Section 3.4. A

substantial improvement is seen in both coverage and accuracy.

3.6.4 Increasing coverage by automatically discov-

ering compound morphology

For many languages, the compounding process introduces its own morphology

(Figure 3.3). For example, in German, the word Geschaftsfuhrer (manager) consists

38

Language Dictionary SizeAfrikaans 11,389Albanian 188,563Arabic 167,189Azeri 231,891Bangla 1,606Basque 880Bosnian 18,283Bulgarian 316,631Chinese 82,080Czech 262,690Dutch 233,805Esperanto 3,001Farsi 198,605French 195,627German 272,230Greek 160,126Hindi 58,179Hungarian 289,225Indonesian 67,633Irish 887Italian 166,966Kapampangan 1,000Kazakh 145,750Korean 229,742Kurdish 9,870Kyrgyz 74,890Latin 18,884Latvian 148,363

Language Dictionary SizeMalay 9,438Maltese 7,574Maori 27,967Mongolian 948Nepali 6,812Polish 261,463Portuguese 840Punjabi 76,311Romanian 249,479Russian 423,009Serbian 168,140Slovak 233,093Somali 230Spanish 347,441Swedish 227,849Tagalog 247,662Tamil 165,004Tatar 8,557Thai 14,925Tibetan 59,083Tigrinya 56Turkish 1,272,881Turkmen 91,928Uighur 16,285Ukrainian 14,056Urdu 36,428Uzbek 190,688Welsh 25,832

Table 3.4: Size of various bilingual dictionaries (with other language as English)


Albanian 3229(+4.67%) .217(+17.30%) .409(+23.19%)Bulgarian 6806(+1.29%) .255(+3.24%) .442(+6.25%)German 11346(+2.19%) .199(+2.05%) .388(+7.18%)Swedish 12970(+2.28%) .189(+0.53%) .361(+4.34%)Avg10 9603(+3.03%) .193(+4.89%) .362(+11.04%)

Table 3.5: Performance for looking up English gloss via both orderings. The percent-ages in parentheses are relative improvements from the performance in Table 3.3.

39

Geschäft s Führer

Paterfamilias

Pater Familia s+ + + +

s as Middle Glue in German

s as End Glue in Latin

Geschäftsführer

(Business) (Guide) (Father) (Family)

(Manager) (Household head)

Figure 3.3: Illustration of compounding morphology using middle and end glue char-

acters.

of the lexemes Geschaft (business) and Fuhrer (guide) joined by the lexeme -s. For

the purposes of these experiments, such lexemes are called fillers or middle glue

characters. Koehn and Knight (2003) used a fixed set of two known fillers s and

es for handling German compounds. To broaden the applicability of this work to

new languages without linguistic guidance, such fillers are estimated directly from

corpora in different languages. In additional to fillers, compound can also introduce

morphology at the suffix or prefix of compounds, for example, in the Latin language,

the lexeme paterfamilias contains the genitive form familias of the lexeme familia

(family), thus s in this case is referred to as the “end glue” character. To augment

the splitting step outlined in Section 3.4.1, deletion of up to two middle characters

and two end characters is allowed. Then, for each glue candidate (for example es), its

probability is estimated as the relative frequency of unique hypothesized compound

words successfully using that particular glue.

The set of glues is ranked by their probability and take the top 10 middle and end

40

Albanian Bulgarian German SwedishTop 15 Middle Glue Character(s)j 0.059 O 0.129 s 0.133 s 0.132s 0.048 N 0.046 n 0.090 l 0.051t 0.042 H 0.036 k 0.066 n 0.049r 0.042 E 0.025 h 0.042 t 0.045i 0.038 A 0.025 f 0.037 r 0.035l 0.031 d 0.024 l 0.036 k 0.030n 0.030 C 0.025 r 0.032 g 0.026e 0.022 3 0.023 t 0.031 v 0.023m 0.022 y 0.021 er 0.027 d 0.023a 0.021 T 0.021 st 0.024 b 0.023a 0.021 CT 0.020 en 0.022 f 0.020k 0.020 l 0.020 b 0.022 e 0.020sh 0.019 CK 0.019 ge 0.019 m 0.019h 0.016 P 0.019 e 0.016 st 0.017u 0.015 K 0.018 ch 0.016 p 0.016Top 15 End Glue Character(s)m 0.146 T 0.124 n 0.188 a 0.074t 0.079 EH 0.092 t 0.167 g 0.073s 0.059 H 0.063 en 0.130 t 0.059k 0.048 M 0.049 e 0.069 e 0.057r 0.037 AM 0.047 d 0.043 d 0.057es 0.037 E 0.046 r 0.041 re 0.046e 0.034 � 0.037 er 0.040 k 0.041je 0.027 K N 0.033 g 0.040 n 0.039i 0.023 A 0.032 ig 0.024 ng 0.037e 0.023 O 0.030 nd 0.018 l 0.034es 0.022 CT 0.027 l 0.015 s 0.031l 0.021 HE 0.025 s 0.014 r 0.029n 0.020 K 0.025 ch 0.012 sk 0.025st 0.019 KA 0.025 i 0.011 ra 0.019te 0.017 NE 0.018 m 0.010 ad 0.018

Table 3.6: Top 15 middle glues (fillers) and end glues discovered for each languagealong with their probability values. Glue characters allow for appropriately splittingthe compound words into the root forms of the individual components for lookup ina lexicon.

41


Albanian 3272(+1.33%) .214(-1.38%) .407(-0.49%)Bulgarian 7211(+5.95%) .258(+1.18%) .443(+0.23%)German 13372(+17.86%) .200(+0.50%) .391(+0.77%)Swedish 15094(+16.38%) .190(+0.53%) .363(+0.55%)Avg10 10273(+6.98%) .194(+0.52%) .363(+0.28%)

Table 3.7: Performance for increasing coverage by including compounding morphol-ogy. The percentages in parentheses are relative improvements from the performancein Table 3.5 .

glues for each language. A sample of glues discovered for some of the languages are

shown in Table 3.6. The performance for the morphology step is shown in Table 3.7.

The relative percentage improvements are with respect to the previous Section 3.6.3.

A statistically significant gain in coverage is observed as the flexibility of glue process

allows discovery of more compounds.

3.6.5 Re-ranking using context vector projection

Performance can be further improved by re-ranking candidate translations based

on the goodness of semantic “fit” between two words, as measured by their context

similarity. This can be accomplished as in Rapp (1999) and Schafer and Yarowsky

(2002) by creating bag-of-words context vectors around both the source and tar-

get language words and then projecting the source vectors into the (English) target

space via the current small translation dictionary. Once in the same language space,

source words and their translation hypotheses are compared via cosine similarity us-

ing their surrounding context vectors. This experiment was performed for German

42

Method Top1avg Top10avgOriginal ranking 0.196 0.388Comb. with Context Sim 0.201 0.391

Table 3.8: Average performance on German and Swedish with and without usingcontext vector similarity from monolingual corpora.

and Swedish and report average accuracies with and without this addition in Table

3.8. For monolingual corpora, the German and Swedish side of the Europarl corpus

(Koehn, 2005) was used, consisting of approximately 15 million and 21 million words

respectively. The context vectors could be projected for an average of 4224.5 words in

the two languages among all the possible compound words detected in Section 3.6.4.

The poor Eurpoarl coverage could be due to the fact that compound words are gener-

ally technical words with low Europarl corpus frequency, especially in parliamentary

proceedings. Thus, the small performance gains here could be attributed to these

limitations of the monolingual corpora.

3.6.6 Using phrase-tables if a parallel corpus is

available

All previous results presented in this chapter have been for translation lexicon dis-

covery without the need for parallel bilingual text (bitext), which is often in limited

supply for lower-resource languages. However, it is useful to assess how this transla-

tion lexicon discovery work compares with traditional bitext-based lexicon induction

(and how well the approaches can be combined). For this purpose, phrase tables

43

Method # of words Top1 Top10translated Acc. Acc.

GermanBiDict 13372 0.200 0.391Parallel Corpus SMT 3281 0.423 0.576Parallel + BiDict 3281 0.452 0.579CzechBiDictthresh=1 3455 0.276 0.514Parallel Corpus SMT 309 0.285 0.404Parallel + BiDict 309 0.359 0.599

Table 3.9: Performance of this work’s BiDict approach compared with and augmentedwith traditional statistical MT learning from bitext.

learned by the standard statistical MT Toolkit Moses (Koehn et al., 2007) were used.

The phrase-table accuracy was tested on two languages, one for which a lot of par-

allel data available (German-English Europarl corpus with approximately 15 million

words) was available and one for which relatively little parallel data (Czech-English

news-commentary corpus with approximately 1 million words) was available. This

was done to see how the amount of parallel data available affects the accuracy and

coverage of compound translation. Table 3.9 shows the performance for this exper-

iment. For German, a significant improvement in accuracy and for Czech a small

improvement in Top1 but a decline in Top10 accuracy is seen. Note that these accu-

racies are still quite low as compared to general performance of phrase tables in an

end-to-end MT system because the evaluation measures exact-match accuracy on a

generally more challenging and often-lower-frequency lexicon subset. The third row

in Table 3.9 for each of the languages shows that if one had a parallel corpus available,

its n-best list can be combined with the n-best list of Bilingual Dictionaries algorithm

44

to provide much higher consensus accuracy gains using weighted voting.

3.7 Statistical Significance of Results

All results reported in this chapter are based on very large test sets, greater than

9,000 examples on average and the algorithms presented results in large gains with

respect to the baseline. Using a binomial test of various sample sizes and baseline

accuracies in different languages, all improvements with respect to the baseline results

(Table 3.2) are statistically significant with p-value less than 0.05. Furthermore, the

accuracy improvements with respect to MOSES system (Koehn et al., 2007) using

parallel corpora in Table 3.9 is also statistically significant for Czech and for Top 1

accuracy on German. The Top 10 accuracy on German in this table is not statistically

significant with respect to phrase-table based accuracy. Nevertheless, this does not

contradict the claim of the methods presented in this chapter being more useful for

languages where large amounts of parallel corpora are not available such as Czech.

45

3.8 Quantifying the Role of

Cross-language Selection

and Usage

3.8.1 Coverage/Accuracy Trade-off

The number of languages offering a translation hypothesis for a given literal En-

glish gloss is a useful parameter for measuring confidence in the algorithm’s selection.

The more distinct languages exhibiting a translation for the gloss, the higher likeli-

hood that the majority translation will be correct rather than noise. Varying this

parameter yields the coverage/accuracy trade-off as shown in Figure 3.4.

3.8.2 Varying the size of bilingual dictionaries

Figure 3.5 illustrates how the size of the bilingual dictionaries used for providing

cross-language evidence affects translation performance. In order to take both cov-

erage and accuracy into account, performance measure used was the F-score which

is a harmonic average of Precision (the accuracy on the subset of words that could

be translated) and Pseudo-recall (which is the correctly translated fraction out of

total words that could be translated using 100% of the dictionary size). Given the

Precision P (same as accuracy) and Pseudo-Recall (R) as defined above, the F-score

46

Coverage/Accuracy Tradeoff

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 200 400 600 800 1000 1200 1400 1600

# of words translated as compounds

Exact

matc

h a

ccu

racy

Avg Top 1 Acc.

Avg Top 10 Acc.

>= 8

>= 5>= 4

>= 3

>= 6

>= 14

>= x: threshold for # of languages

Figure 3.4: Coverage/Accuracy trade-off curve by incrementing the minimum number

of languages exhibiting a candidate translation for the source-word’s literal English

gloss. Accuracy here is the Top1 accuracy averaged over all 10 test languages.

47

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0 10 20 30 40 50 60 70 80 90 100

% of dictionary used

F-

sco

re

Top 1Top 10

Figure 3.5: F-measure performance given varying sizes of the bilingual dictionaries

used for cross-language evidence (as a percentage of words randomly utilized from

each dictionary).

48

was computed in the standard manner as:

F =2 · P ·RP +R

(3.4)

Figure 3.5 shows that increasing the percentage of dictionary size8 always helps with-

out plateauing, suggesting substantial extrapolation potential from large dictionaries.

3.8.3 Greedy vs Random Selection of Utilized

Languages

A natural question for the compound translation algorithm is how does the choice

of additional languages affect performance. Results of two experiments are reported

on this question. A simple experiment is to use bilingual dictionaries of randomly

selected languages and test the performance of K randomly selected languages9, in-

crementing K until it is the full set of 50 languages. The dashed lines in Figures

3.6 and 3.7 show this trend. The performance is measured by F-score as in Section

3.8.2, where Pseudo-Recall here is the fraction of correct candidates out of the total

candidates that could be translated, had bilingual dictionaries of all the languages

been used. It can be seen that adding random bilingual dictionaries helps improve

the performance in a close to linear fashion.

Furthermore, it can be observed that certain contributing languages are much more

effective than others (e.g. Arabic/Farsi vs. Arabic/Czech). A greedy heuristic is

8Each run of choosing a percentage of dictionary size was averaged over 10 runs.9Each run of randomly selecting K languages was averaged over 10 runs.

49

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

0 10 20 30 40 50

# of languages utilized (K)

F-s

core

(To

p 1

)

K-Random

K-Greedy

Figure 3.6: Top-1 match F-score performance utilizing K languages for cross-language

evidence, for both a random K languages and greedy selection of the most effective

K languages (typically the closest or largest dictionaries)

50

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0 10 20 30 40 50

# of languages utilized (K)

F-s

core

(To

p 1

0)

K-Random

K-Greedy

Figure 3.7: Top-10 match F-score performance utilizing K languages for cross-

language evidence, for both a random K languages and greedy selection of the most

effective K languages (typically the closest or largest dictionaries)

51

used for ranking an additional cross-language, that is the number of test words for

which the correct English translation can be provided by the bilingual dictionary of

the respective cross-language. Figures 3.6 and 3.7 show that greedy selection of the

most effective K utilized languages using this heuristic substantially accelerates per-

formance. In fact, beyond the best 10 languages, performance plateaus and actually

decreases slightly, indicating that increased noise is outweighing increased coverage.

3.8.4 Languages found using Greedy selection

Table 3.10 shows the sets of the most effective three cross-languages per test

language selected using the greedy heuristic explained in previous section. Unsur-

prisingly, related languages tend to help more than distant languages. For example,

Dutch is most effective for the test language German, and Slovak is most effective

for Czech. Interesting symmetries can also be seen between related languages, for

example: Farsi is the top language used for test language Arabic and vice-versa.

Such symmetries can also be seen for other pairs of related languages such as (Czech,

Slovak) and (Russian, Bulgarian). Thus, related languages are most helpful and they

can be related in several ways such as etymologically, culturally and physically (such

as Hungarian contact with the Germanic languages). The second point to note is that

languages having large dictionaries also tend to be especially helpful, even when un-

related. This can be seen by the presence of Hungarian in top three cross-languages

for most of the test languages. This is likely because Hungarian was one of the

52

Albanian ArabicRussian 0.067 0.116 Farsi 0.051 0.090+Spanish 0.100 0.169 +Spanish 0.059 0.111+Bulgarian 0.119 0.201 +French 0.077 0.138

Bulgarian CzechRussian 0.186 0.294 Slovak 0.177 0.289+Hungarian 0.190 0.319 +Russian 0.222 0.368+Swedish 0.203 0.339 +Hungarian 0.235 0.407

Farsi GermanArabic 0.031 0.047 Dutch 0.130 0.228+Dutch 0.038 0.070 +Swedish 0.191 0.316+Spanish 0.044 0.079 +Hungarian 0.204 0.355

Hungarian RussianSwedish 0.073 0.108 Bulgarian 0.185 0.250+Dutch 0.103 0.158 +Hungarian 0.199 0.292+German 0.117 0.182 +Swedish 0.216 0.319

Slovak SwedishCzech 0.145 0.218 German 0.120 0.188+Russian 0.168 0.280 +Hungarian 0.152 0.264+Hungarian 0.176 0.300 +Dutch 0.182 0.309

Table 3.10: Illustrating 3-best cross-languages obtained for each test language (shownin bold). Each row shows the effect of adding the respective cross-language to the setof languages in the rows above it and the corresponding F-scores (Top 1 and Top 10)achieved.

largest dictionaries and hence can provide good coverage for obtaining translation

candidates of rarer or technical compounds, which may have more language universal

literal glosses. For reference, the sizes of various bilingual dictionaries is provided in

Table 3.4.

3.9 Conclusion

This chapter presents a successful approach to extracting compound translation

relationships without the need for bilingual training text, by modeling the mapping of

53

literal component-word glosses (e.g. “iron-path”) into fluent English (e.g. “railway”)

across multiple languages. An interesting property of using such cross-language ev-

idence is that one does need to restrict the candidate translations to compositional

(or “glossy”) translations, as this model allows the successful generation of more

fluent non-compositional translations. Performance is further improved by adding

component-sequence and learned-morphology models along with context similarity

from monolingual text and optional combination with traditional bilingual-text-based

translation discovery. These models show consistent performance gains across 10 di-

verse test languages.

54

Chapter 4

Improving Translation Lexicon

Induction from Monolingual

Corpora via Dependency Contexts

and Part-of-Speech Equivalences

Summary

This chapter presents novel improvements to the induction of translation lexi-

cons from monolingual corpora by incorporating multilingual dependency parses. A

dependency-based context model was introduced that incorporates long-range depen-

dencies, variable context sizes, and reordering. It provides a 16% relative improvement

55

over the baseline approach that uses a fixed context window of adjacent words. Its

Top 10 accuracy for noun translation is higher than that of a statistical translation

model trained on a Spanish-English parallel corpus containing 100,000 sentence pairs.

The evaluation was generalized to other word-types, and it was shown that the rela-

tive gain can be increased to 18% relative by preserving part-of-speech equivalencies

during translation.

Components of this chapter were originally published by the author of this disserta-

tion in the forum referenced below1.

4.1 Introduction

Recent trends in machine translation illustrate that highly accurate word and

phrase translations can be learned automatically given enough parallel training data.

However, large parallel corpora exist for only a small fraction of the world’s languages,

leading to a bottleneck for building translation systems in low resource languages

such as Swahili, Uzbek or Punjabi. While parallel training data is uncommon for

such languages, more readily available resources include small translation dictionaries,

comparable corpora, and large amounts of monolingual data.

The marked difference in the availability of monolingual vs parallel corpora has

led several researchers to develop methods for automatically learning bilingual lex-

1Reference: N. Garera, C. Callison-Burch, D. Yarowsky. Improving Translation Lexicon Induc-tion from Monolingual Corpora via Dependency Contexts and Part-of-Speech Equivalences. Toappear in Proceedings of the Conference on Natural Language Learning (CoNLL), 2009.

56

icons, either by using monolingual corpora (Rapp, 1999; Koehn and Knight, 2002;

Schafer and Yarowsky, 2002; Haghighi et al., 2008) or by exploiting the cross-language

evidence of closely related “bridge” languages that have more resources (Mann and

Yarowsky, 2001).

This chapter investigates new ways of learning translations from monolingual cor-

pora. The Rapp (1999) model of context vector projection using a seed lexicon is

extended. It is based on the intuition that translation equivalents will have similar

lexical context, even in unrelated corpora. For example, in order to translate the word

“airplane”, the algorithm builds a context vector which might contain terms such as

“passengers”, “runway”, “airport”, etc. and words in target language that have their

translations (obtained via seed lexicon) in surrounding context can be considered as

likely translations.

The basic approach is extended by formulating a context model that uses depen-

dency parses as an external knowledge source. The use of dependency parses has the

following advantages:

• Long distance dependencies allow associated words to be included in the context

vector even if they fall outside of the fixed-window used in the baseline model.

• Using relationships like parent and child instead of absolute positions like pre-

ceding and following word alleviates problems when projecting vectors between

languages with different word orders.

57

• It achieves better performance than baseline context models across the board,

and better performance than statistical translation models on Top-10 accuracy

for noun translation when trained on identical data.

It is shown that an extension based on part-of-speech clustering can give similar

accuracy gains for learning translations of all word types, thus deepening the findings

of previous literature which mainly focused on translating nouns (Rapp, 1999; Koehn

and Knight, 2002; Haghighi et al., 2008).

4.2 Related Work

The literature on translation lexicon induction for low resource languages falls in to

two broad categories: 1) Effectively utilizing similarity between languages by choosing

a high-resource “bridge” language for translation (Mann and Yarowsky, 2001; Garera

and Yarowsky, 2008) and 2) Extracting noisy clues (such as similar context) from

monolingual corpora with help of a seed lexicon (Rapp, 1999; Koehn and Knight,

2002; Schafer and Yarowsky, 2002, Haghighi et al., 2008). The latter category is

more relevant to this work and is explained in detail below.

The idea of words with similar meaning having similar contexts in the same lan-

guage comes from the Distributional Hypothesis (Harris, 1985) and Rapp (1999)

was the first to propose using context of a given word as a clue to its translation.

Given a German word with an unknown translation, a German context vector is con-

58

structed by counting its surrounding words in a monolingual German corpus. Using

an incomplete bilingual dictionary, the counts of the German context words with

known translations are projected onto an English vector. The projected vector for

the German word is compared to the vectors constructed for all English words using

a monolingual English corpus. The English words with the highest vector similarity

are treated as translation candidates. The original work employed a relatively large

bilingual dictionary containing approximately 16,000 words and tested only on a small

collection of 100 manually selected nouns.

Koehn and Knight (2002) tested this idea on a larger test set consisting of the

1000 most frequent words from a German-English lexicon. They also incorporated

clues such as frequency and orthographic similarity in addition to context. Schafer

and Yarowsky, (2002) independently proposed using frequency, orthographic simi-

larity and also showed improvements using temporal and word-burstiness similarity

measures, in addition to context. Haghighi et al., (2008) made use of contextual and

orthographic clues for learning a generative model from monolingual corpora and a

seed lexicon.

All of the aforementioned work defines context similarity in terms of the adjacent

words over a window of some arbitrary size (usually 2 to 4 words), as initially proposed

by Rapp (1999). This work shows that the model for surrounding context can be

improved by using dependency information rather than strictly relying on adjacent

words, based on the success of dependency trees for monolingual clustering tasks (Lin

59

and Pantel, 2002) and the recent developments in multilingual dependency parsing

literature (Buchholz and Marsi, 2006; Nivre et al., 2007).

This work further includes a second evaluation that examines the accuracy of

translating all word types, rather than just nouns, thus differentiating it from previous

work. While the straightforward application of context-based model gives a lower

overall accuracy than nouns alone, this work shows how learning a mapping of part-

of-speech tagsets between the source and target language can result in comparable

performance to that of noun translation.

4.3 Translation by Context Vector

Projection

This section details how translations are discovered from monolingual corpora

through context vector projection. Section 4.3.1 defines alternative ways of modeling

context vectors, and including baseline models and our dependency-based model.

The central idea of Rapp’s method for learning translations from unrelated mono-

lingual corpora is based on context vector projection and vector similarity. The good-

ness of semantic “fit” of candidate translations is measured as the vector similarity

between two words. Those vectors are drawn from two different languages, so the

vector for one word must first be projected onto the language space of the other. The

Rapp’s algorithm for creating, projecting and comparing vectors is described below,

60

and illustrated in Figure 4.1.

Algorithm:

1. Extract context vectors:

Given a word in source language, say sw, create a vector using the surrounding

context words and call this reference source vector rssw for source word sw.

The actual composition of this vector varies depending on how the surrounding

context is modeled. The context model is independent of the algorithm, and

various models are explained in later sections.

2. Project reference source vector:

Project all the source vector words contained in the projection dictionary onto

the vector space for the target language, retaining the counts from source corpus.

This vector now exists in the target language space and is called the reference

target vector rtsw . This vector may be sparse, depending on how complete the

bilingual dictionary is, because words without dictionary entries will receive

zero counts in the reference target vector.

3. Rank candidates by vector similarity:

For each word twiin the target language a context vector is created using the

target language monolingual corpora as in Step 1. Compute a similarity score

between the context vector of twi= 〈ci1, ci2, ...., cin〉 and reference target vector

61

Figure 4.1: Illustration of (Rapp, 1999) model for translating Spanish word “crec-

imiento (growth)” via dependency context vectors extracted from respective mono-

lingual corpora as explained in Section 4.3.1.2

62

rtsw = 〈r1, r2, ...., rn〉. The word with the maximum similarity score t∗wiis chosen

as the candidate translation of sw.

The vector similarity can be computed in a number of ways. Cosine similar-

ity was used in our implementation setup, and the formula for this similarity

measure is given below:

t∗wi= argmaxtwi

ci1·r1+ci2·r2+....+cin·rn√c2i1+c2i2+...+c2in

√r21+r2

2+...+r2n

Rapp (1999) used l1-norm metric after normalizing the vectors to unit length,

Koehn and Knight (2002) used Spearman rank order correlation, and Schafer

and Yarowsky (2002) use cosine similarity. Cosine similarity was found to give

the best results in the experimental conditions evaluated. Other similarity mea-

sures may be used equally well.

4.3.1 Models of Context

Several context models are compared in this section. Empirical results for their

ability to find accurate translations are given in Section 4.5.

4.3.1.1 Baseline model

In the baseline model, the context is computed using adjacent words as in

(Rapp,1999; Koehn and Knight, 2002; Schafer and Yarowsky, 2002; Haghighi et al.,

2008). Given a word in source language, say sw, count all its immediate context

63

Figure 4.2: An illustration of dependency tree showing clearly the parent and child

nodes. The word marked in bold (“crecimiento”) is used as an example source word

in the chapter for illustrative purposes, and its adjacent and dependency contexts are

shown in Table 4.1.

words appearing in a window of four words. The counts are collected separately for

each position by keeping track of four separate vectors for positions -2, -1, +1 and

+2. Thus each vector is a sparse vector, having the number of dimensions as the size

of source language vocabulary. In addition to the term frequency, each dimension is

reweighted by multiplying the inverse document frequency (IDF) as in the standard

TF.IDF weighting scheme. In order to compute the IDF, while there were no clear

document boundaries in our corpus, a virtual document boundary was created by

binning after every 1000 words. These vectors are then concatenated into a single

vector, having dimension four times the size of the vocabulary. This vector is called

the reference source vector rssw for source word sw.

64

Position Adjacent DependencyContext Context

-2 para camino-1 el para+1 y prosperidad, y, el+2 la economica

Table 4.1: Contrasting context words derived from the adjacent vs dependency modelsfor the above example

4.3.1.2 Modeling context using dependency trees

Dependency parsing is used to extend the context model. Our context vectors

use contexts derived from head-words linked by dependency trees instead of using the

immediate adjacent lexical words. The use of dependency trees for modeling contexts

has been shown to help in monolingual clustering tasks of finding words with similar

meaning (Lin and Pantel, 2002) and this work shows how they can be effectively used

for translation lexicon induction.

The four vectors for positions -1, +1, -2 and +2 in the baseline model get mapped

to immediate parent (-1), immediate child (+1), grandparent (-2) and grandchild

(+2). An example of using the dependency tree context is shown in Figure 4.2, and

the dependency context is shown in contrast with the adjacent context in Table 4.1,

showing the selection of more salient words by using the dependency tree representa-

tion.

Note that while this approach limits to using four positions in the tree, it does not

imply that only a maximum of four context words are selected for a given sentence

since the word can have multiple immediate children depending upon the dependency

65

parse of the sentence. Hence, this approach allows for a dynamic context size, with

the number of context words varying with the number of children and parents at the

two levels.

Another advantage of this method is that it alleviates the reordering problem

as tree positions (consisting of head-words) are used as compared to usage of the

adjacent position in the baseline context model. For example, if the source Spanish

word to be translated was “prosperidad”, then in the example shown in Figure 4.2, in

case of adjacent context, the context word “economica” will show up in +1 position

in Spanish and -1 position in English (as adjectives come before nouns in English) but

in case of dependency context, the adjective will be the child of noun and hence will

show up in +1 position in both languages. Thus, a bag of word model as in Section

4.3 need not be used in order to avoid learning the explicit mapping that adjectives

and nouns in Spanish and English are reversed.

4.4 Experimental Design

For our initial set of experiments several different vector-based context models are

compared:

• Adjbow – A baseline model which used bag of words model with a fixed window

of 4 words, two on either side of the word to be translated.

• Adjposn – A second baseline that used a fixed window of 4 words but which took

66

positional into account

• Depbow – A dependency model which did not distinguish between grandparent,

parent, child and grandparent relations, analogous to the bag of words model.

• Depposn – A dependency model which did include such relationships, and was

analogous to the position-based baseline.

• Depposn + rev – The above Depposn model applied in both directions (Spanish-to-

English and English-to-Spanish) using their sum as the final translation score.

Translation accuracy of the above methods, which use monolingual corpora, is

contrasted with a statistical model trained on bilingual parallel corpora. That model

is referred to as Mosesen-es-100k, because it was trained using the Moses toolkit (Koehn

et al., 2007).

4.4.1 Training Data

All context models were trained on a Spanish corpus containing 100,000 sentences

with 2.13 million words and an English corpus containing 100,000 sentences with 2.07

million words. The Spanish corpus was parsed using the MST dependency parser

(McDonald et al., 2005) trained using dependency trees generated from the the En-

glish Penn Treebank (Marcus et al., 1993) and Spanish CoNLL-X data (Buchholz and

Marsi, 2006).

67

In order to directly compare against statistical translation models, Spanish and

English monolingual corpora were drawn from the Europarl parallel corpus (Koehn,

2005). The fact that the two monolingual corpora are taken from a parallel corpus

ensures that the assumption that similar contexts are a good indicator of translation

holds. This assumption underlies in all work of translation lexicon induction from

comparable monolingual corpora, and this work maintains a strong bias toward that

assumption. Despite the bias, the comparison of different context models holds, since

all models are trained on the same data.

4.4.2 Evaluation Criterion

The models were evaluated in terms of exact-match translation accuracy of the

1000 most frequent nouns in a English-Spanish dictionary. The accuracy was calcu-

lated by counting how many mappings exactly match one of the entries in the dic-

tionary. This evaluation criterion is similar to the setup used by Koehn and Knight

(2002). The Top N accuracy was computed in the standard way as the number of

Spanish words whose Top N English translation candidates contain a lexicon trans-

lation entry out of the total number of Spanish words that can be mapped correctly

using the lexicon entries. Thus if “crecimiento, growth” is the correct mapping based

on the lexicon entries, the translation for “crecimiento” will be counted as correct if

“growth” occurs in the Top N English translation candidates for “crecimiento”.

Note that the exact-match accuracy is a conservative estimate as it is possible

68

that the algorithm may propose a reasonable translation for the given Spanish word

but is marked incorrect if it does not exist in the lexicon.

Because it would be intractable to compare each projected vector against the

vectors for all possible English words, only the projected vector from each Spanish

word are compared against the vectors for the 1000 most frequent English nouns,

following along the lines of previous work (Koehn and Knight, 2002; Haghighi et al.,

2008).

4.5 Results

Table 4.3 gives the Top 1 and Top 10 accuracy for each of the models on their

ability to translate Spanish nouns into English. Examples of the top 10 translations

using the best performing of the baseline and dependency-based models are shown in

Table 4.2. The baseline models Adjposn and Adjbow differ in that the latter disregards

the position information in the context vector and simply uses a bag of words instead.

Table 4.3 shows that Adjbow gains using this simplification. A bag of words vector

approach pools counts together, which helps to reduce data sparsity. In the position

based model the vector is four times as long. Additionally, the bag of words model can

help when there is local re-ordering between the two languages. For instance, Spanish

adjectives often follow nouns whereas in English the the ordering is reversed. Thus,

one can either learn position mappings, that is, position +1 for adjectives in Spanish

69

caminoDepposn Context Model Adjbow Context Modelway 0.124 intentions 0.22solution 0.097 way 0.21steps 0.094 idea 0.20path 0.093 thing 0.20debate 0.085 faith 0.18account 0.082 steps 0.17means 0.080 example 0.17work 0.079 news 0.16approach 0.074 work 0.16issue 0.073 attitude 0.15

crecimientogrowth 0.27 growth 0.27activity 0.14 loss 0.22development 0.12 source 0.20recovery 0.12 activity 0.15integration 0.12 integration 0.15prosperity 0.11 savings 0.15creation 0.10 coordination 0.13cohesion 0.09 taxation 0.13order 0.08 prosperity 0.13package 0.08 expense 0.12

Table 4.2: Top 10 translation candidates for the Spanish word “camino (way)” and“crecimiento (growth)” for the best adjacent context model (Adjbow) and best de-pendency context model (Depposn). The bold English terms show the acceptabletranslations.

Model AccTop 1 AccTop 10

Adjbow 35.3% 59.8%Adjposn 20.9% 46.9%Depbow 41.0% 62.0%Depposn 41.0% 64.1%Depposn + rev 42.9% 65.5%Mosesen-es-100k 56.4% 62.7%

Table 4.3: Performance of various context-based models learned from monolingualcorpora and phrase-table learned from parallel corpora on Noun translation.

70

Figure 4.3: Precision/Recall curve showing superior performance of dependency

context model as compared to adjacent context at different recall points. Precision

is the fraction of tested Spanish words with Top 1 translation correct and Recall is

fraction of the 1000 Spanish words tested upon.

71

is the same as position -1 in English or just add the the word counts from different

positions into one common vector as considered in the bag of words approach.

Using dependency trees also alleviates the problem of position mapping between

source and target language. Table 4.3 shows the performance using the dependency

tree based models outperforms the baseline models substantially. Comparing Depbow

to Depposn shows that ignoring the tree depth and treating it as a bag of words does not

increase the performance. This contrasts with the baseline models. The dependency

positions account for re-ordering automatically. The precision-recall curve in Figure

4.3 shows that the dependency-based context performs better than adjacent context

at almost all recall levels.

The Mosesen-es-100k model shows the performance of the statistical translation

model trained on a bilingual parallel corpus. While the system performs best in

Top 1 accuracy, the dependency context-based model that ignores the sentence align-

ments surprisingly performs better in case of Top 10 accuracy, showing substantial

promise.

While computing the accuracy using the phrase-table learned from parallel cor-

pora (Mosesen-es-100k), the translation probabilities from both directions (p(es|en) and

p(en|es)) were used to rank the candidates. The monolingual context-based model is

also applied in the reverse direction (from English to Spanish) and the row with label

Depposn + rev in Table 4.3 shows further performance gains using both directions.

72

Spanish English Sim Is presentScore in lexicon

senores gentlemen 0.99 NOxenofobia xenophobia 0.87 YESdiversidad diversity 0.73 YESchipre cyprus 0.66 YESmujeres women 0.65 YESalemania germany 0.65 YESexplotacion exploitation 0.63 YEShombres men 0.62 YESrepublica republic 0.60 YESracismo racism 0.59 YEScomercio commerce 0.58 YEScontinente continent 0.53 YESgobierno government 0.52 YESisrael israel 0.52 YESfrancia france 0.52 YESfundamento certainty 0.51 NOsuecia sweden 0.50 YEStrafico space 0.49 NOtelevision tv 0.48 YESfrancesa portuguese 0.48 NO

Table 4.4: List of 20 most confident mappings using the dependency context basedmodel for noun translation along with exact match evaluation output based onwhether the mapping is present as a lexicon entry. Note that although the firstmapping (senores, gentlemen) is the correct one, it was not present in the lexiconused for evaluation and hence is marked as incorrect.

73

4.6 Further Extensions: Generalizing to

other word types via tagset mapping

Most of the previous literature on this problem focuses on evaluating on nouns

(Rapp, 1999; Koehn and Knight 2002; Haghighi et al., 2008). However the vector

projection approach is general, and should be applicable to other word-types as well.

The models with new test set containing 1000 most frequent words (not just nouns)

in the English-Spanish lexicon are evaluated.

The dependency-based context model is used to create translations for this new

set. The row labeled Depposn in Table 4.5 shows that the accuracy on this set is lower

when compared to evaluating only on nouns. The main reason for lower accuracy is

that closed class words are often the most frequent and tend to have a wide range of

contexts resulting in reasonable translation for most words include open class words

via the context model. For instance, the English preposition “to” appears as the most

confident translation for 147 out of the 1000 Spanish test words and in none (rightly

so) after restricting the translations by part-of-speech categories.

This problem can be greatly reduced by making use of the intuition that part-of-

speech is often preserved in translation, thus the space of possible candidate transla-

tion can be largely reduced based on the part-of-speech restrictions. For example, a

noun in source language will usually be translated as noun in target language, deter-

miner will be translated as determiner and so on. This idea is more clearly illustrated

74

Figure 4.4: Illustration of using part-of-speech tag mapping to restrict candidate

space of translations.

in in Figure 4.4. Rather than imposing a hard restriction, this work computes a rank-

ing based on the conditional probability of candidate translation’s part-of-speech tag

given source word’s tag.

An interesting problem in using part-of-speech restrictions is that corpora in dif-

ferent languages have been tagged using widely different tagsets and the following

subsection explains this problem in detail:

75

4.6.1 Mapping Part-of-Speech tagsets in different

languages

The English tagset was derived from the Penn treebank consisting of 53 tags (in-

cluding punctuation markers) and the Spanish tagset was derived from the Cast3LB

dataset consisting of 57 tags but there is a large difference in the morphological and

syntactic features marked by the tagset. For example, the Spanish tagset as different

tags for masculine and feminine nouns and also has a different tag for coordinated

nouns, all of which need to be mapped to the singular or plural noun category available

in English tagset. Figure 4.5 shows an illustration of the mapping problem between

the Spanish and English POS tags.

An empirical approach for learning the mapping between tagsets using the English-

Spanish projection dictionary used in the monolingual context-based models for trans-

lation is now described. Given a small English-Spanish bilingual dictionary and n-best

list of part-of-speech tags for each word in the dictionary2, the conditional probabil-

ity of translating a source word with pos tag sposito a target with pos tag tposj

is

computed as follows:

2The n-best part-of-speech tag list for any word in the dictionary was derived using the relativefrequencies in a part-of-speech annotated corpora in the respective languages.

76

Figure 4.5: Illustration of mapping Spanish part-of-speech tagset to English tagset.

The tagsets vary greatly in notation and the morphological/syntactic constituents

represented and need to be mapped first, using the algorithm described in Section

4.6.1.

p(tposj|sposi

) =c(sposi

, tposj)

c(sposi)

(4.1)

=

∑sw∈S, tw∈T p(sposi

|sw) · p(tposj|tw) · Idict(sw, tw)∑

sw∈S p(sposi|sw) ·

∑tw∈T Idict(sw, tw)

(4.2)

where

• S and T are the source and target vocabulary in the seed dictionary, with sw

and tw being any of the words in the respective sets.

• p(sposi|sw), p(tposj

|tw) are obtained using relative frequencies in a part-of-speech

tagged corpus in the source/target languages respectively, and are used as soft

counts.

• Idict(sw, tw) is the indicator function with value 1 if the pair (sw, tw) occurs in

the seed dictionary and 0 otherwise.

77

Figure 4.6: Precision/Recall curve showing superior performance of using part-of-

speech equivalences for translating all word-types. Precision is the fraction of tested

Spanish words with Top 1 translation correct and Recall is fraction of the 1000 Spanish

words tested upon.

78

Model AccTop 1 AccTop 10

Depposn 35.1% 62.9%+ POS 41.3% 66.4%

Table 4.5: Performance of dependency context-based model along with addition ofpart-of-speech mapping model on translating all word-types.

In essence, the mapping between tagsets is learned using the known translations

from a small dictionary.

Given a source word sw to translate, its most likely tag s∗pos, and the most likely

mapping of this tag into English t∗pos computed as above, the translation candidates

with part-of-speech tag t∗pos are considered for comparison with vector similarity and

the other candidates with tposj6= t∗pos are discarded from the candidate space. Figure

4.4 shows an example of restricting the candidate space using part-of-speech tags.

The row labeled +POS in Table 4.5 shows the part-of-speech tags provides sub-

stantial gain in performance as compared to direct application of dependency context-

based model and is also comparable to the accuracy obtained evaluating just on nouns

in Table 4.3.

79

Spanish English Sim Is presentScore in lexicon

senores gentlemen 0.99 NOchipre cyprus 0.66 YESmujeres women 0.65 YESalemania germany 0.65 YEShombres men 0.62 YESexpresar express 0.60 YESracismo racism 0.59 YESinterior internal 0.55 YESgobierno government 0.52 YESfrancia france 0.52 YEScultural cultural 0.51 YESsuecia sweden 0.50 YESfundamento basis 0.48 YESfrancesa french 0.48 YESentre between 0.47 YESorigen origin 0.46 YEStrafico traffic 0.45 YESde of 0.44 YESsocial social 0.43 YESruego thank 0.43 NOenergıa energy 0.42 YESclave key 0.42 YESpapel role 0.42 YESinstitucional institutional 0.42 YEStransporte transport 0.41 YES

Table 4.6: List of 25 most confident mappings using the dependency context withthe part-of-speech mapping model translating all word-types along with exact matchevaluation output based on whether the mapping is present as a lexicon entry. Notethat although the second best mapping in Table 4.4 for noun-translation is for xeno-phobia with score 0.87, xenophobia is not among the 1000 most frequent words (ofall word-types) and thus is not in this test set.

80

4.7 Application to Unrelated Corpora

One of the challenges in translation lexicon literature is to apply the monolingual

corpus-based methods utilizing completely unrelated corpora as opposed to un-aligned

bitext or comparable corpora, that have been commonly utilized. This section pro-

vides an initial foray into studying the effect in performance of utilizing unrelated

monolingual corpora for seed-based translation lexicon induction.

As a sample of unrelated corpora, the English Gigaword Corpus (News) and Span-

ish Europarl Corpus (Parliamentary Proceedings) were utilized as target-side and

source-side monolingual corpora respectively. Since the bag-of-words of word adja-

cency context-based model can be applied in a straightforward manner to the new

corpora, the difference in performance for this model was calculated when utilizing

the unrelated corpora. It was found that the top 10 accuracy drops from 59.8% to

48.8% when using unrelated corpora. The drop in magnitude is expected as unrelated

corpora will have less salient context vectors that get utilized for projection, never-

theless the performance is still reasonable as compared to other accuracies obtained

using parallel corpora, showing substantial promise for application to diverse corpora.


Using a binomial test of sample size 1000 and the best baseline accuracies of 35.3%

(Top 1) and 59.8% (Top 10) for noun-translations (Table 4.3), any improvements in

81

accuracy over 37.8% (Top 1) and 62.3% (Top 10) are statistically significant with a

p-value less than 0.05. Furthermore, even the improvement in Top-10 accuracy (row

Depposn+rev) in table 4.3 with respect to the MOSES system that utilizes parallel

corpora (Koehn et al., 2007) is statistically significant with a p-value less than 0.05.

In Table 4.5, showing results for improvements obtained via part-of-speech mapping,

any accuracy improvement over 37.6% (Top 1) and 65.4% (Top 10) are statistically

significant with a p-value less than 0.05. Thus, the results in table 4.5 are also

statistically significant.

4.9 Conclusion

This chapter presents a novel contribution to the standard context models used

when learning translation lexicons from monolingual corpora by vector projection.It

is shown that using contexts based on dependency parses can provide more salient

contexts, allow for dynamic context size, and account for word reordering in the source

and target language. An exact-match evaluation shows 16% relative improvement by

using a dependency-based context model over the standard approach. Furthermore,

it is shown that the introduced model, which is trained only on monolingual corpora,

outperforms the standard statistical MT approach to learning phrase tables when

trained on the same amount of sentence-aligned parallel corpora, when evaluated on

Top 10 accuracy.

82

As a second contribution, this work goes beyond previous literature which evalu-

ated only on nouns. It is showed how preserving a word’s part-of-speech in translation

can improve performance. Furthermore, a solution to an interesting sub-problem en-

countered on the way is proposed. Since part-of-speech tagsets are not identical across

two languages, this work proposes a way of learning their mapping automatically.

Restricting candidate space based on this learned tagset mapping resulted in 18%

improvement over the direct application of context-based model to all word-types.

Dependency trees help improve the context for translation substantially and their

use opens up the question of how the context can be enriched further making use of the

hidden structure that may provide clues for a word’s translation. Through this work

the belief that the problem of learning the mapping between tagsets in two different

languages can be used in general for other NLP tasks making use of projection of

words and morphological/syntactic properties between languages is strengthened.

83

Part II

Extracting Semantic Relationships

84

Chapter 5

Part II Literature Review

This chapter covers the literature review for extracting semantic relations. Section

5.1 describes previous work for extracting relationships such as “is-a” and “part-of”

present in a semantic taxonomy. Section 5.2 describes previous work for extracting

more complex semantic relationships such as definite anaphora. Both these types

are also inter-related and Chapter 7 shows how improved modeling of taxonomic

relationships can aid in extraction of more complex semantic relationships.

5.1 Extracting relationships in a semantic

taxonomy

There has been a plethora of work on extracting generic semantic relationships

such as “is-a”, “part-of”, etc. Some examples of such generic semantic relations in-

85

clude:

• Hypernyms/Hyponyms1 These constitute “is-a” relationships such as “(banana,

fruit)”, “(car, vehicle)”, etc.

• Meronyms: These constitute part-of/member-of relationships such as “(wheel,

car)”, “(floor, house)”, etc.

• Synonyms: These constitute pairs that describe the same concept or have similar

meaning within the language “(path, way)”, “(instruct, teach)”, etc.

• Cousins/Siblings: These constitute pairs that share a close common hypernym

such as “(bus, truck)”, “(diabetes, arthritis)”, etc.

The main approaches in the literature for learning such relationships are described

below:

5.1.1 Manually created databases

A major line of research has been on manually creating a semantic taxonomy from

scratch. A popular example of such a database is WordNet (Miller, 1995; Fellbaum

1998) that has been used widely in natural language processing problems. It contains

over 150,000 unique strings laid out in a taxonomy that identifies the hypernymy,

1Given a “is-a” relationship, hypernym is the parent and hyponym is the child. Thus for “(banana,fruit)”, “fruit” is the hypernym and “banana” is the hyponym.

86

meronymy, synonymy and cousin/sibling relationships. Another popular semantic

resource is CYC (Lenat, 1995). Such a vast manual effort has also been replicated for

other languages leading to creation of Eurowordnet (Vossen, 1998), Hindi WordNet

(Narayan, 2002), Japanese WordNet (Isahara et al., 2008) etc.

However, taxonomy resources such as WordNet are limited or non-existent for most

of the world’s languages. Building a WordNet manually from scratch requires a huge

amount of human effort and for rare languages the required human and linguistic

resources may simply not be available. Hence a major line of research in the com-

putational linguistics literature has been focused on automatically extracting such

semantic relations. These approaches are explained in the following sections.

5.1.2 Hand-crafted Patterns for “is-a” and “part-

whole” relationships

Some of the semantic relationships found in WordNet tend to occur using a few

evocative fixed patterns. Thus given a corpus of the language and a list of hand-

crafted patterns, a large amount of semantic knowledge can be extracted from cor-

pora. This observation was first explored in detail by Hearst (1992) for extracting

hypernymy or “is-a” relationships from unstructured corpora. She observed that the

hypernyms usually co-occur with the following patterns/regular expressions:

1. NP such as NP,* ( or | and ) NP

87

2. such NP as NP ,* ( or | and ) NP

3. NP, NP* , or other NP

4. NP, NP* , and other NP

5. NP , including NP,* or | and NP

6. NP , especially NP,* or | and NP

A total of approximately 400 word pairs were extracted using the above patterns

using a corpus of about 20 million words. While the accuracy was not reported, the

extracted pairs were found to exhibit the hypernymy relationship. She also proposed

how such patterns may be learned from a set of seeds but did not perform an empir-

ical evaluation of such an approach. The seed-based pattern induction literature has

become on of the mainstream approaches for information extraction and is explained

in Section 5.1.3.

While the above patterns have been widely used for extracting hypernym relationship

between common nouns, Mann (2002) used the manually crafted pattern “<Common

Noun> <Proper Noun>” for extracting “is-a” relationship for proper nouns. This

pattern exploits the premodifier position for providing a description of the proper

noun. For example, this pattern will match phrases such as “[the] automaker Mer-

cedes Benz”.

Berland and Charniak (1999) used hand-crafted patterns for extracting meronyms

(“part-of”) relationships. The following two patterns were used to extract word pairs

88

exhibiting meronym relationship:

1. whole-NN ’s part-NN

2. part-NN of the | a whole-NN

Using a corpus of 100,000 words, they were able to extract meronym relations with

55% accuracy. Girju et al. (2003) improved upon their work using a combination

of hand-crafted pattern and supervised learning. They suggested using the following

patterns for extracting meronym pairs:

1. whole-NP ’s part-NP

2. part-NP of whole-NP

3. part-NP VERB while-NP

After extracting sentences that match the above patterns, they filtered out the bad

examples using decision trees. The decision tree model was trained using features

from the WordNet.

The problems with using a few fixed patterns is the often low coverage of such pat-

terns; thus there is a need for discovering additional informative patterns automati-

cally. Approaches for automatically learning such patterns is described in the follow-

ing section.

89

5.1.3 Weakly supervised approaches

Weakly supervised approaches using seed exemplars have been widely used in the

literature for extraction semantic relations. The basic paradigm is similar to the seed-

based approach for crosslingual relationship extraction explained in earlier chapters.

Given a set of seeds of the relationship of interest, patterns such as “X and other

Ys” (for hypernymy) are learned automatically (Ravichandran and Hovy, 2002). The

pattern-learning approaches for learning semantic relationships have been also ap-

plied successfully for learning factual relationships. A more detailed description of

such approaches is provided in the literature review for factual relationships in Chap-

ter 8.

A major problem often noticed in pattern-learning approaches approaches with re-

spect to learning semantic relationships is the high recall but low precision of pat-

terns2. Furthermore, much of the semantic relation extraction work has focused on

extracting a particular relation independently of other relations. Chapter 6 describes

how this problem can be solved by combining evidence from multiple relations in Sec-

tion 6.3.2. Furthermore, Chapter 6 also describes how derived semantic relationships

can be used for extracting cross-lingual relationships.

2also noted by Pantel and Pennacchiotti (2006)

90

5.1.4 Training Supervised Classifiers

Going beyond the popular pattern-based approaches, several researchers have also

tried training a fully supervised classifier such as logistic regression, using features

derived from the parse tree of the sentence (Snow et al., 2006). However detailed

level annotations and other resources such as parse trees may be difficult to obtain

for other languages. Furthermore, requirements of such resources can also become a

bottleneck when scaling up to large corpora.

5.1.5 Clustering Approaches

The other end of the spectrum involves fully unsupervised clustering-based ap-

proaches. A distinct line of work on inducing taxonomies was based on agglomerative

clustering of words using a notion of word similarity (Caraballo, 1999). Cederberg

and Widdows (2003) use latent semantic analysis and noun co-ordination patterns to

improve the precision and recall of hyponymy extraction. The clustering by commit-

tee (CBC) algorithm (Pantel and Lin, 2002) has also been used in extracting noun

clusters that belong to the same class (Pantel and Ravichandran, 2004).

91

5.2 Extracting complex semantic

relationships

This section covers literature on extracting more complex semantic relationships

that are difficult to extract via contextual pattern templates. Definite anaphora (see

Figure 5.1) is a typical example of such a relationship where surface local context is

not sufficient. The standard approaches for coreference resolution that are evaluated

on MUC-style (Hirschman and Chinchor, 1997) corpora have been reported to per-

form poorly on resolution of definite anaphors (Connolly et al., 1997; Strube et al.,

2002; Ng and Cardie, 2002; Yang et al., 2003). For instance, the coreference system

for German texts by Strube et al. (2002) report an F-measure of 33.9% for definite

NPs as compared to 82.8% for personal pronouns.

Definite anaphors are also interesting as a case study because it requires deriving

simple relationships such as “is-a” for successful anaphora resolution/generation. For

example, in Figure 5.1, determining the antecedent to the definite anaphor “the drug”

in text requires knowledge of what previous noun-phrase candidates could be drugs.

Likewise, generating a definite anaphor for the antecedent “Morphine” in text requires

both knowledge of potential hypernyms (e.g. “the opiate”, “the narcotic”, “the drug”,

and “the substance”), as well as selection of the most appropriate level of generality

along the hypernym tree in context (i.e. the “natural” hypernym anaphor). In or-

der to obtain such “is-a” relationship knowledge for dealing with definite anaphors,

92

...pseudoephedrine is found in an allergy treatment, which was given to Wilson by a doctor when he attended Blinn junior college in Houston. In a unanimous vote, the Norwegian sports confederation ruled that Wilson had not taken the drug to enhance his performance...

...pseudoephedrine is found in an allergy treatment, which was given to Wilson by a doctor when he attended Blinn junior college in Houston. In a unanimous vote, the Norwegian sports confederation ruled that Wilson had not taken the __?__ to enhance his performance...

Resolution Task

Generation Task

Figure 5.1: Example of definite anaphora resolution and generation. Both the tasks

require the knowledge of a derived semantic relationship that “pseudoephedrine is-a

drug”.

93

many resolution systems rely on manually built WordNet database (Poesio et al.,

1997; Meyer and Dale, 2002). WordNet has also been used as an important feature

in machine learning of coreference resolution using supervised training data (Soon et

al., 2001; Ng and Cardie, 2002).

However, there are several disadvantages to using handcrafted ontologies. First is

that building, extending and maintaining ontologies by hand is expensive. Second,

some of the anaphoric relationships are context dependent. Hearst (1992) raises the

issue of whether underspecified, context or point-of-view dependent hyponymy rela-

tions should be included in a fixed ontology. For example, “corruption” is referred to

as “the tool” in the corpora utilized for this study. This is a metaphoric usage that

would be difficult to predict unless given the usage sentence and its context. Third,

using all senses of anaphor and potential antecedents in the ontology can result in an

incorrect link due to wrong antecedent selection. Finally the most significant disad-

vantage is that WordNet has a rigid and complicated hierarchy levels. Thus there is

no notion of a “natural” parent that is essential for definite anaphora generation.

In order to alleviate these problems, corpus-based approaches for automatically deriv-

ing “is-a” relationships for definite anaphora have been used in the literature (Poesio

et al., 2002; Markert and Nissim, 2005). The relevant literature for such corpus-

based approaches and contributions to it are described in more detail in Section 7.2

of Chapter 7.

94

Chapter 6

Minimally Supervised Multilingual

Taxonomy and Translation

Lexicon Induction

Summary

This chapter presents a novel algorithm for the acquisition of multilingual lex-

ical taxonomies (including hyponymy/hypernymy, meronymy and taxonomic cous-

inhood), from monolingual corpora with minimal supervision in the form of seed

exemplars using discriminative learning across the major WordNet semantic relation-

ships. This capability is also extended robustly and effectively to a second language

(Hindi) via cross-language projection of the various seed exemplars. This chapter also

95

(grenade)haathagolaa

(explosive)baaruuda

(bomb)bama

(gun)banduuka

explosivegrenade bomb gun

weapon

Induced Hindi Hypernymy (with glosses)

Induced English Hypernymy

hathiyaara(weapon)

Figure 6.1: Goal: To induce multilingual taxonomy relationships in parallel in mul-

tiple languages (such as Hindi and English) for information extraction and machine

translation purposes.

presents a novel model of translation dictionary induction via multilingual transitive

models of hypernymy and hyponymy, using these induced taxonomies. Candidate

lexical translation probabilities are based on the probability that their induced hy-

ponyms and/or hypernyms are translations of one another. All of the above models

are evaluated on English and Hindi.



1Reference: N. Garera and D. Yarowsky. Minimally Supervised Multilingual Taxonomy andTranslation Lexicon Induction. Proceedings of International Joint Conference on Natural LanguageProcessing (IJCNLP), 2008.

96

6.1 Introduction

Taxonomy resources such as WordNet (Miller, 1995; Fellbaum 1998) are limited

or non-existent for most of the world’s languages. Building a WordNet manually

from scratch requires a huge amount of human effort and for rare languages the

required human and linguistic resources may simply not be available. Most of the

automatic approaches for extracting semantic relations (such as hyponyms) have been

demonstrated for English and some of them rely on various language-specific resources

(such as supervised training data, language-specific lexicosyntactic patterns, shallow

parsers, etc.). This chapter presents a language independent approach for induc-

ing taxonomies such as shown in Figure 6.1 using limited supervision and linguistic

resources. A seed learning based approach for extracting semantic relations (hy-

ponyms, meronyms and cousins) is presented that improves upon existing induction

frameworks by combining evidence from multiple semantic relation types. Using a

joint model for extracting different semantic relations helps to induce more relation-

specific patterns and filter out the generic patterns2. The patterns can then be used

for extracting new word-pairs expressing the relation. Note that the only training

data used in the algorithm are the few seed pairs required to start the bootstrapping

process, which are relatively easy to obtain. The algorithm is evaluated on English

and a second language (Hindi), showing reliable and accurate induction of taxonomies

2The phrase “generic patterns” means patterns that cannot distinguish between different semanticrelations. For example, the pattern “X and Y” is a generic pattern whereas the pattern “Y such asX” is a hyponym-specific pattern.

97

in two diverse languages.

This chapter further describes how having induced parallel taxonomies in two lan-

guages can be used for augmenting a translation dictionary between those two

languages. The translation algorithm make use of the automatically induced hy-

ponym/hypernym relations in each language to create a transitive “bridge” for dic-

tionary induction. Specifically, it relies on the key observation that words in two

languages (e.g. English and Hindi) have increased probabilities of being translations

of each other if their hypernyms or hyponyms are translations of one another.

6.2 Related Work

While manually created WordNets for English (Miller, 1995; Fellbaum, 1998) and

Hindi (Narayan, 2002) have been made available, a lot of time and effort is required

in building such semantic taxonomies from scratch. Hence several automatic corpus

based approaches for acquiring lexical knowledge have been proposed in the litera-

ture. Much of this work has been done for English based on using a few evocative

fixed patterns including “X and other Ys”, “Y such as X”, as in the classic work

by Hearst (1992). The problems with using a few fixed patterns is the often low

coverage of such patterns; thus there is a need for discovering additional informative

patterns automatically. There has been a plethora of work in the area of informa-

tion extraction using automatically derived patterns contextual patterns for semantic

98

categories (e.g. companies, locations, time, person-names, etc.) based on bootstrap-

ping from a small set of seed words (Riloff and Jones, 1999; Agichtein and Gravano,

2000; Ravichandran and Hovy, 2002; Pasca et al. 2006). This framework has been

also shown to work for extracting semantic relations between entities: Girju et al.

(2003) used 100 seed words from WordNet to extract patterns for part-of relations.

Pantel and Pennacchiotti (2006) use pattern-based approaches to extract is-a, part-of

and other semantic relations. While most of the above pattern induction work has

been shown to work well for specific relations (such as “birthdates, companies, etc.”),

Section 6.3.1 explains why directly applying seed learning for semantic relations can

result in high recall but low precision patterns, a problem also noted by Pantel and

Pennacchiotti (2006). Furthermore, much of the semantic relation extraction work

has focused on extracting a particular relation independently of other relations. This

chapter describes how this problem can be solved by combining evidence from multiple

relations in Section 6.3.2. Snow et al.(2006) also describe a probabilistic framework

for combining evidence using constraints from hyponymy and cousin relations. How-

ever, they use a supervised logistic regression model. Moreover, their features rely on

parsing dependency trees which may not be available for most languages.

The key contribution of this work is using evidence from multiple relationship types

in the seed learning framework for inducing these relationships and conducting a mul-

tilingual evaluation for the same. Furthermore, the extraction of semantic relations

in multiple languages can serve as a useful tool for improving a dictionary between

99

Rank English Hindi1 Y, the X Y aura X (Gloss: Y and X)2 Y and X Y va X (Gloss: Y in addition to X)3 X and other Y Y ne X (Gloss: Y (case marker) X)4 X and Y X ke Y (Gloss: X’s Y)5 Y, X Y me.n X (Gloss: Y in X)

Table 6.1: Naive pattern scoring: Hyponymy patterns ranked by their raw corpusfrequency scores.

those languages.

6.3 Approach

To be able to automatically create taxonomies such as WordNet, it is useful to be

able to learn not only hyponymy/hyponymy directly, but also the additional semantic

relationships of meronymy and taxonomic cousinhood. Specifically, given a pair of

words (X, Y), the task is to answer the following questions: 1. Is X a hyponym of

Y (e.g. weapon, gun)? 2. Is X a part/member of Y (e.g. trigger, gun)? 3. Is X a

cousin/sibling3 of Y (e.g. gun, missile)? 4. Do none of the above 3 relations apply

but X is observed in the context of Y (e.g. airplane,accident)?4 Class 4 is referred as

“other” in the rest of the chapter.

3Cousins/siblings are words that share a close common hypernym.4Note that this does not imply X is unrelated or independent of Y. On the contrary, the required

sentential co-occurrence implies a topic similarity. Thus, this is a much harder class to distinguishfrom classes 1-3 than non co-occurring unrelatedness (such as gun, protazoa) and hence was includedin the evaluation.

100

6.3.1 Independently Bootstrapping Lexical

Relationship Models

Following the pattern induction framework of Ravichandran and Hovy (2002),

one of the ways of extracting different semantic relations is to learn patterns for each

relation independently using seeds of that relation and extract new pairs using the

learned patterns. For example, to build an independent model of hyponymy using

this framework, approximately 50 seed exemplars of hyponym pairs were used for

extracting all the patterns that match with the seed pairs5. As in Ravichandran

and Hovy (2002), the patterns were ranked by corpus frequency and a frequency

threshold was set to select the final patterns. These patterns were then used to

extract new word pairs expressing the hyponymy relation by finding word pairs that

occur with these patterns in an unlabeled corpus. However, the problem with this

approach is that generic patterns (like “X and Y”) occur many times in a corpus and

thus low-precision patterns may end up with high cumulative scores. This problem

is illustrated more clearly in Table 6.1, which shows a list of top five hyponymy

patterns (ranked by their corpus frequency) using this approach. This problem can

be overcome by exploiting the multi-class nature of this task and combine evidence

from multiple relations in order to learn high precision patterns (with high conditional

probabilities) for each relation. The key idea is to weed out the patterns that occur

5A pattern is the ngrams occurring between the seedpair (also called gluetext). The length ofthe pattern was thresholded to 15 words.

101

Rank English Hindi1 Y like X X aura anya Y (Gloss: X and other Y)2 Y such as X Y, X (Gloss: Y, X)3 X and other Y X jaise Y (Gloss: X like Y)4 Y and X Y tathaa X (Gloss: Y or X)5 Y, including X X va anya Y (Gloss: X and other Y)

Table 6.2: Patterns for hypernymy class re-ranked using evidence from other classes.Patterns distributed fairly evenly across multiple relationship types (e.g. “X and Y”)are deprecated more than patterns focused predominantly on a single relationshiptype (e.g. “Y such as X”).

in more than one semantic relation and keep the ones that are relation-specific6, thus

using the relations meronymy, cousins and other as negative evidence for hyponymy

and vice versa. Table 6.2 shows the pattern ranking by using the model developed

in Section 6.3.2 that makes use of evidence from different classes. More hyponymy

specific patterns are ranked at the top7 suggesting the usefulness of this method in

finding class-specific patterns.

6In the actual algorithm, the common patterns are not entirely removed but an estimate of theconditional class probabilities for each pattern: p(class|pattern) is computed.

7It is interesting to see in Table 6.2 that the top learned Hindi hyponymy patterns seem to betranslations of the English patterns suggested by Hearst (1992). This leads to an interesting futurework question: Are the most effective hyponym patterns in other languages usually translations ofthe English hyponym patterns proposed by Hearst (1992) and what are frequent exceptions?

102

6.3.2 A minimally supervised multi-class classifier

for identifying different semantic relations

First, a list of patterns is extracted from an unlabeled corpus8 independently for

each relationship type (class) using the seeds9 for the respective class as in Section

6.3.1.10 In order to develop a multi-class probabilistic model, the probability of each

class c given the pattern p is obtained as follows:

P (c|p) =seedfreq(p, c)∑c′ seedfreq(p, c

′)(6.1)

where seedfreq(p, c) is the number of seeds of class c that were found with the pattern p

in an unlabeled corpus. A sample of the P (class|pattern) tables for English and Hindi

are shown in the Tables 6.3 and 6.4 respectively. It is clear how occurrence of a pat-

tern in multiple classes can be used for finding reliable patterns for a particular class.

For example, in Table 6.3: although the pattern “X and Y” will get a higher seed fre-

quency than the pattern “Y, especially X”, the probability P (“X and Y ”|hyponymy)

is much lower than P (“Y, especially X”|hyponymy), since the pattern “Y, especially

X” is unlikely to occur with seeds of other relations.

Now, instead of using the seedfreq(p, c) as the score for a particular pattern with re-

8Unlabeled monolingual corpora were used for this task, the English corpus was the LDC Giga-word corpus and the Hindi corpus was news-wire text extracted from the web containing a total of64 million words.

9The number of seeds used for classes {hyponym, meronym, cousin, other} were {48,40,49,50}for English and were {32,58,31,35} for Hindi respectively. A sample of seeds used is shown in Table6.5.

10Only the patterns that had seed frequency greater than one were retained for extracting new wordpairs. The total number of retained patterns across all classes for {English,Hindi} were {455,117}respectively.

103

spect to a class, the patterns can be rescored using the probabilities P (class|pattern).

Thus the final score for a pattern p with respect to class c is obtained as:

score(p, c) = seedfreq(p, c) · P (c|p) (6.2)

This equation can be viewed as balancing recall and precision, where the first term

is the frequency of the pattern with respect to seeds of class c (representing recall),

and the second term represents the relation-specificness of the pattern with respect to

class c (representing precision). The score for each pattern is recomputed in the above

manner and obtain a ranked list of patterns for each of the classes for English and

Hindi. Now, to extract new pairs for each class, all the patterns with a seed frequency

greater than 2 are used to extract word pairs from an unlabeled corpus. The semantic

class for each extracted pair is then predicted using the multi-class classifier as follows:

Given a pair of words (X1, X2), note all the patterns that matched with this pair in

the unlabeled corpus, denote this set as P . Choose the predicted class c∗ for this pair

as:

c∗ = argmaxc∑p∈P

score(p, c) (6.3)

6.3.3 Evaluation of the Classification Task

Over 10,000 new word relationship pairs were extracted based on the above algo-

rithm. While it is hard to evaluate all the extracted pairs manually, one can certainly

104

Hyponym Meronym Cousin/Sibling OtherX of the Y 0 0.66 0.04 0.3

Y, especially X 1 0 0 0Y, whose X 0 1 0 0

X and other Y 0.63 0.08 0.18 0.11X and Y 0.23 0.3 0.33 0.14

Table 6.3: A sample of patterns and their relationship type probabilitiesP (class|pattern) extracted at the end of training phase for English.

Hyponym Meronym Cousin/Sibling OtherX aura anya Y (X and other Y) 1 0 0 0

X aura Y (X and Y) 0.09 0.09 0.71 0.11X jaise Y (X like Y) 1 0 0 0X va Y (X and Y) 0.11 0 0.89 0

Y kii X (Y’s X) 0.33 0.67 0 0

Table 6.4: A sample of patterns and their class probabilities P (class|pattern) ex-tracted at the end of training phase for Hindi.

English HindiSeed Pairs Model Predictions Seed Pairs Model Predictions

tool,hammer weapon,gun khela,Tenisa paarTii,kaa.ngresa,(game,tennis) (party,congress)

Hypernym currency,yen sport,hockey appraadha,hatyaa kaagajaata,passporTa(crime,murder) (document,passport)

metal,copper disease,cancer jaanvara,bhaaga bhaashhaa,a.ngrejii(animal,tiger) (language,English)

wheel,truck room,hotel u.ngalii,haatha jeba,sharTa(finger,hand) (pocket,shirt)

Meronym headline,newspaper bark,tree kamaraa,aspataala kaptaana,Tiima(room,hospital) (captain,team)

wing,bird lens,camera ma.njila,imaarata darvaaja,makaana(floor,building) (door,house)

dollar,euro guitar,drum bhaajapa,kaa.ngresa peTrola,Diijala(bjp,congress) (petrol,diesel)

Cousin heroin,cocaine history, geography Hindii,a.ngrejii Daalara,rupayaa(Hindi,English) (dollar,rupee)

helicopter,submarine diabetes,arthritis basa,Traka talaaba,nadii(bus,truck) (pond,river)

Table 6.5: A sample of seeds used and model predictions for each class for thetaxonomy induction task. For each of the model predictions shown above, its Hy-ponym/Meronym/Cousin classification was correctly assigned by the model.

105

create a representative smaller test set and evaluate performance on that set. The

test set was created by randomly identifying word pairs in WordNet and newswire

corpora and annotating their correct semantic class relationships. Test set construc-

tion was done entirely independently from the algorithm application, and hence some

of the test pairs were missed entirely by the learning algorithm, yielding only partial

coverage.

The total number of test examples including all classes were 200 and 140 for English

and Hindi test-sets respectively. The overall coverage11 on these test-sets was 81% and

79% for English and Hindi respectively. Table 6.6 reports the overall accuracy12 for

the 4-way classification using different patterns scoring methods. Baseline 1 is scoring

patterns by their corpus frequency as in Ravichandran and Hovy (2002), Baseline 2

is another intutive method of scoring patterns by the number of seeds they extract.

The third row in Table 6.6 indicates the result of rescoring patterns by their class

conditional probabilties, giving the best accuracy.

While this method yields some improvement over other baselines, the main point to

note here is that the pattern-based methods which have been shown to work well for

English also perform reasonably well on Hindi, inspite of the fact that the size of the

unlabeled corpus available for Hindi was 15 times smaller than for English.

Table 6.7 shows detailed accuracy results for each relationship type using the model

11Coverage is defined as the percentage of the test cases that were present in the unlabeled corpus,that is, cases for which an answer was given.

12Accuracy on a particular set of pairs is defined as the percentage of pairs in that set whose classwas correctly predicted.

106

Model English HindiAccuracy Accuracy

Baseline 1 [RH02] 65% 63%Baseline 2 seedfreq

70% 65%seedfreq · P (c|p) 73% 66%

Table 6.6: Overall accuracy for 4-way classification{hypernym,meronym,cousin,other} using different pattern scoring methods.

English HindiTotal Coverage Accuracy Total Coverage Accuracy

Hyponym 83 74% 97% 59 82% 75%Meronym 41 81% 88% 33 63% 81%

Cousin/Sibling 42 91% 55% 23 91% 71%Other 34 85% 31% 25 80% 20%

Overall 200 81% 73% 140 79% 66%

Table 6.7: Test set coverage and accuracy results for inducing different semanticrelationship types.

English HindiHypo. Mero. Cous. Other Hypo. Mero. Cous. Other

Hypo. 59 1 1 0 36 1 10 1Mero. 1 28 1 3 0 17 4 0Cous. 14 3 21 0 6 0 15 0Other 7 3 10 9 1 4 11 4

Table 6.8: Confusion matrix for English (left) Hindi (right) for the four-way classifi-cation task

107

developed in Section 6.3.2. It is also interesting to see in Table 6.8 that most of the

confusion is due to “other” class being classified as “cousin” which is expected as

cousin words are only weakly semantically related and uses more generic patterns

such as “X and Y” which can often be associated with the “other” class as well.

Strongly semantically clear classes like Hypernymy and Meronymy seem to be well

discriminated as their induced patterns are less likely to occur in other relationship

types.


Using a binomial test of sample sizes 200 (English) and 140 (Hindi), and the

baseline algorithm performance of 65% (English) and 63% (Hindi), any improvement

in accuracy over 70.3% for English and over 70% for Hindi are statistically significant

with a p-value of less than 0.05. Thus the final overall accuracy obtained for English

(73%) is statistically significant and for Hindi is not statistically significant.

108

baaruuda

hathiyaara

bama

[via inducedhypernymy]

bomb explosive grenadegun

weapon

banduuka

hyponymy][via induced

Goal: To learn this translation

haathagolaa

[via existing dictionary entries or previous induced translations]

EnglishHindi

Figure 6.2: Illustration of the models of using induced hyponymy and hypernymy for

translation lexicon induction.

6.5 Improving a partial translation

dictionary

In this section, I describe the application of automatically generated multilingual

taxonomies to the task of translation dictionary induction. The hypothesis is that a

pair of words in two languages would have increased probability of being translations

of each other if their hypernyms or hyponyms are translations of one another.

As illustrated in Figure 6.2, the probability that weapon is a translation of the Hindi

word hathiyaara can be decomposed into the sum of the probabilities that their hy-

ponyms in both languages (as induced in Section 6.3.2) are translations of each other.

Thus:

PH−>E (WE|WH) =∑i

Phyper (WE|Eng(Hi)) Phypo(Hi|WH) (6.4)

109

raaiphala missile grenade bomb rifle

weapon

(hypothesis space)

[via inducedhathiyaara

or previous induced translations][via existing dictionary entries

hypernymy][via induced

hyponymy]

Hindi English

Goal: To learn this translation

Figure 6.3: Reducing the space of likely translation candidates of the word raaiphala

by inducing its hypernym, using a partial dictionary to look up the translation of

hypernym and generating the candidate translations as induced hyponyms in English

space.

for induced hyponyms Hi of the source word WH , and using an existing (and likely

very incomplete) Hindi-English dictionary to generate Eng(Hi) for these hyponyms,

and the corresponding induced hypernyms of these translations in English.13. A pre-

liminary evaluation of this idea was conducted for obtaining English translations of

a set of 25 Hindi words. The Hindi candidate hyponym space had been pruned of

function words and non-noun words. The likely English translation candidates for

each Hindi word were ranked according to the probability PH−>E(WE|WH).

13One of the challenges of inducing a dictionary via using a corpus based taxonomy is sensedisambiguation of the words to be translated. In the current model, the more dominant sense (interms of corpus frequency of its hyponyms) is likely to get selected by this approach. While thecurrent model can still help in getting translations of the dominant sense, possible future work wouldbe to cluster all the hyponyms according to contextual features such that each cluster can representthe hyponyms for a particular sense. The current dictionary induction model can then be appliedagain using the hyponym clusters to distinguish different senses for translation.

110

Accuracy Accuracy Accuracy(uni-d) (bi-d) bi-d + Other

Top 1 20% 36% 36%Top 5 56% 64% 72%Top 10 72% 72% 80%Top 20 84% 84% 84%

Table 6.9: Accuracy on Hindi to English word translation using different transitivehypernym algorithms. The additional model components in the bi-d(bi-directional)plus Other model are only used to rerank the top 20 candidates of the bidirectionalmodel, and are hence limited to its top-20 performance.

The first column of Table 6.9 shows the stand-alone performance for this model on

the dictionary induction task. This standalone model has a reasonably good accuracy

for finding the correct translation in the Top 10 and Top 20 English candidates.

This approach can be further improved by also implementing the above model in the

reverse direction and computing the P (WH |WEi) for each of the English candidates

Ei. P (WH |WEi) was computed for top 20 English candidate translations. The final

score for an English candidate translation given a Hindi word was combined by a sim-

ple average of the two directions, that is, by summing P (WEi|WH) + P (WH |WEi

).

The second column of Table 6.9 shows how this bidirectional approach helps in get-

ting the right translations in Top 1 and Top 5 as compared to the unidirectional

approach. Table 6.10 shows a sample of correct and incorrect translations generated

by the above model. It is interesting to see that the incorrect translations seem to be

the words that are very general (like “topic”, “stuff”, etc.) and hence their hyponym

space is very large and diffuse, resulting in incorrect translations. While the columns

111

Correctly translated Incorrectly translatedaujaara (tool) vishaya (topic)

biimaarii (disease) saamana (stuff)hathiyaara (weapon) dala (group,union)

dastaaveja (documents) tyohaara (festival)aparaadha (crime) jagaha (position,location)

Table 6.10: A sample of correct and incorrect translations using transitive hyper-nymy/hyponym word translation induction

1 and 2 of Table 6.9 show the standalone application of translation dictionary in-

duction method, it can also be combined with existing work on dictionary induction

using other translation induction measures such as using relative frequency similar-

ity in multilingual corpora and using cross-language context similarity between word

co-occurrence vectors (Schafer and Yarowsky, 2002). The above dictionary induction

measures were implemented and combined with the taxonomy based dictionary induc-

tion model by just summing the two scores14. The preliminary results for bidirectional

hypernym/hyponym + other features are shown in column 3 of Table 6.9.

6.6 Conclusion

This chapter presents a novel minimal-resource algorithm for the acquisition of

multilingual lexical taxonomies (including hyponymy/hypernymy and meronymy).

The algorithm is based on cross language projection of various monolingual indica-

tors of these taxonomic relationships in free text and via bootstrapping thereof. Using

only 31-58 seed examples, the algorithm achieves accuracies of 73% and 66% for En-

14after renormalizing each of the individual score to be in the range 0 to 1.

112

glish and Hindi respectively on the tasks of hyponymy/meronomy/cousinhood/other

model induction. The robustness of this approach is shown by the fact that the unan-

notated Hindi development corpus was only 1/15th the size of the utilized English

corpus. A novel model of unsupervised translation dictionary induction is also pre-

sented via multilingual transitive models of hypernymy and hyponymy, using these

induced taxonomies and evaluated on Hindi-English. Performance starting from no

multilingual dictionary supervision is quite promising.

113

Chapter 7

Extraction of Semantic Facts from

Unlabeled Corpora targeting

Resolution and Generation of

Definite Anaphora

Summary

This chapter outlines an original and successful approach for both resolving and

generating definite anaphora. Models for extracting hypernym relations are learned

by mining co-occurrence data of definite NPs and potential antecedents in an un-

labeled corpus. The algorithm outperforms a standard WordNet-based approach to

114

resolving and generating definite anaphora. It also substantially outperforms recent

related work using pattern-based extraction of such hypernym relations for corefer-

ence resolution.



7.1 Introduction

Successful resolution and generation of definite anaphora requires knowledge of

hypernym and hyponym relationships. For example, determining the antecedent to

the definite anaphor “the drug” in text requires knowledge of what previous noun-

phrase candidates could be drugs. Likewise, generating a definite anaphor for the

antecedent “Morphine” in text requires both knowledge of potential hypernyms (e.g.

“the opiate”, “the narcotic”, “the drug”, and “the substance”), as well as selection of

the most appropriate level of generality along the hypernym tree in context (i.e. the

“natural” hypernym anaphor). Unfortunately existing manual hypernym databases

such as WordNet are very incomplete, especially for technical vocabulary and proper

names. WordNets are also limited or non-existent for most of the world’s languages.

Finally, WordNets also do not include notation of the “natural” hypernym level for

anaphora generation, and using the immediate parent performs quite poorly, as quan-

1Reference: N. Garera and D. Yarowsky. Resolving and Generating Definite Anaphora by Mod-eling Hypernymy using Unlabeled Corpora. Proceedings of the Conference on Natural LanguageLearning (CoNLL), 2006.

115

...pseudoephedrine is found in an allergy treatment, which was given to Wilson by a doctor when he attended Blinn junior college in Houston. In a unanimous vote, the Norwegian sports confederation ruled that Wilson had not taken the drug to enhance his performance...

...pseudoephedrine is found in an allergy treatment, which was given to Wilson by a doctor when he attended Blinn junior college in Houston. In a unanimous vote, the Norwegian sports confederation ruled that Wilson had not taken the __?__ to enhance his performance...

Resolution Task

Generation Task

Figure 7.1: Example of definite anaphora resolution and generation. Both the tasks

require the knowledge of semantic relationship that “pseudoephedrine is-a drug”,

however the resolution task is easier because there are only a limited set of candidates

to choose from (shown by circled nouns).

116

tified in Section 7.5.

The first part of this chapter describes a novel approach for resolving definite anaphora

involving hyponymy relations, which performs substantially better than previous ap-

proaches on the task of antecedent selection. In the second part, the same approach

is successfully extended to the problem of generating a natural definite NP given a

specific antecedent.

The following example taken from the LDC Gigaword corpus (Graff et al., 2005) ex-

plains the antecedent selection task for definite anaphora more clearly (see also Figure

7.1):

(1)...pseudoephedrine is found in an allergy treatment, which was given to Wilson

by a doctor when he attended Blinn junior college in Houston. In a unanimous vote,

the Norwegian sports confederation ruled that Wilson had not taken the drug to

enhance his performance...

In the above example, the task is to resolve the definite NP the drug to its correct

antecedent pseudoephedrine, among the potential antecedents <pseudoephedrine, al-

lergy, blinn, college, houston, vote, confederation, wilson>. Only Wilson can be ruled

out on syntactic grounds (Hobbs, 1978). To be able to resolve the correct antecedent

from the remaining potential antecedents, the system requires the knowledge that

pseudoephedrine is a drug. Thus, the problem is to create such a knowledge source

and apply it to this task of antecedent selection. A total of 177 such anaphoric ex-

amples were extracted randomly from the LDC Gigaword corpus and a human judge

117

identified the correct antecedent for the definite NP in each example (given a context

of previous sentences).2 Two human judges were asked to perform the same task over

the same examples. The agreement between the judges was 92% (of all 177 exam-

ples), indicating a clearly defined task for evaluation purposes.

This chapter describes an unsupervised approach to this task that extracts examples

containing definite NPs from a large corpus, considers all head words appearing be-

fore the definite NP as potential antecedents and then filters the noisy <antecedent,

definite-NP> pair using Mutual Information space. The co-occurrence statistics of

such pairs can then be used as a mechanism for detecting a hypernym relation be-

tween the definite NP and its potential antecedents. This approach is compared with

a WordNet-based algorithm and with an approach presented by Markert and Nissim

(2005) on resolving definite NP coreference that makes use of lexico-syntactic patterns

such as ’X and Other Ys’ as utilized by Hearst (1992).

7.2 Related work

There is a rich tradition of work using lexical and semantic resources for anaphora

and coreference resolution. Several researchers have used WordNet as a lexical and

semantic resource for certain types of bridging anaphora (Poesio et al., 1997; Meyer

2The test examples were selected as follows: First, all the sentences containing definite NP “TheY ” were extracted from the corpus. Then, the sentences containing instances of anaphoric definiteNPs were kept and other cases of definite expressions (like existential NPs “The White House”,“Theweather”) were discarded. From this anaphoric set of sentences, 177 sentence instances covering 13distinct hypernyms were randomly selected as the test set and annotated for the correct antecedentby human judges.

118

and Dale, 2002). WordNet has also been used as an important feature in machine

learning of coreference resolution using supervised training data (Soon et al., 2001;

Ng and Cardie, 2002). However, several researchers have reported that knowledge

incorporated via WordNet is still insufficient for definite anaphora resolution. And

of course, WordNet is not available for all languages and is missing inclusion of large

segments of the vocabulary even for covered languages. Hence researchers have inves-

tigated use of corpus-based approaches to build a WordNet like resource automatically

(Hearst, 1992; Caraballo, 1999; Berland and Charniak, 1999). Poesio et al. (2002)

have proposed extracting lexical knowledge about part-of relations using Hearst-style

patterns and applied it to the task of resolving bridging references. Markert et al.

(2003) have applied relations extracted from lexico-syntactic patterns such as ’X and

other Ys’ for Other-Anaphora (referential NPs with modifiers other or another) and

for bridging involving meronymy.

There has generally been a lack of work in the existing literature for automatically

building lexical resources for definite anaphora resolution involving hyponyms rela-

tions such as presented in Example (1). However, this issue was recently addressed by

Markert and Nissim (2005) by extending their work on Other-Anaphora using lexico

syntactic pattern ’X and other Y’s to antecedent selection for definite NP coreference.

However, the task here is more challenging since the anaphoric definite NPs in the

test set include only hypernym anaphors without including the much simpler cases

of headword repetition and other instances of string matching. For direct evaluation,

119

their corpus-based approach was also implemented and compared with the models

presented in this chapter on identical test data.

Later in the chapter, a mechanism for combining the knowledge obtained from Word-

Net and the six corpus-based approaches is also presented. The resulting models are

able to overcome the weaknesses of a WordNet-only model and substantially outper-

forms any of the individual models.

7.3 Models for Lexical Acquisition

7.3.1 TheY-Model

The algorithm developed in this section is one of the core contributions of this

chapter. This algorithm is motivated by the observation that in a discourse, the

use of the definite article (“the”) in a non-deictic context is primarily licensed if the

concept has already been mentioned in the text. Hence a sentence such as “The drug

is very expensive” generally implies that either the word drug itself was previously

mentioned (e.g. “He is taking a new drug for his high cholesterol.”) or a hyponym of

drug was previously mentioned (e.g. “He is taking Lipitor for his high cholesterol.”).

Because it is straightforward to filter out the former case by string matching, the

residual instances of the phrase “the drug” (without previous mentions of the word

“drug” in the discourse) are likely to be instances of hypernymic definite anaphora.

One can then determine which nouns earlier in the discourse (e.g. Lipitor) are likely

120

antecedents by unsupervised statistical co-occurrence modeling aggregated over the

entire corpus. All that is needed is a large corpus without any anaphora annotation

and a basic tool for noun tagging and NP head annotation. The detailed algorithm

is as follows:

1. Find each sentence in the training corpus that contains a definite NP (’the

Y’ ) and does not contain ’a Y’, ’an Y’ or other instantiations of Y appearing

before the definite NP within a fixed window. The window size was set to two

sentences, a larger window size of five sentences was also experimented with and

the results obtained were similar. While matching for both ’the Y’ and ’a/an

Y’, the algorithm also accounts for Nouns getting modified by other words such

as adjectives. Thus ’the Y’ will still match to ’the green and big Y’ 3.

2. In the sentences that pass the above definite NP and a/an test, regard all the

head words (X) occurring in the current sentence before the definite NP and

the ones occurring in previous two sentences as potential antecedents.

3. Count the frequency c(X,Y) for each pair obtained in the above two steps and

pre-store it in a table.4 The frequency table can be modified to give other scores

for pair(X,Y) such as standard TF.IDF and Mutual Information scores.

4. Given a test sentence having an anaphoric definite NP Y, consider the nouns

appearing before Y within a fixed window as potential antecedents. Rank the

3The noun phrase and its head were identified using a simple and noisy heuristic, eliminating theneed for parsing the sentences.

4Note that the count c(X,Y) is asymmetric.

121

Rank Raw freq TF.IDF MI1 today kilogram amphetamine2 police heroin cannabis3 kilogram police cocaine4 year cocaine heroin5 heroin today marijuana6 dollar trafficker pill7 country officer hashish8 official amphetamine tablet

Table 7.1: A sample of ranked hyponyms proposed for the definite NP The drugby TheY-Model illustrating the differences in weighting methods.

Acc Acctag Av RankMI 0.531 0.577 4.82

TF.IDF 0.175 0.190 6.63Raw Freq 0.113 0.123 7.61

Table 7.2: Results using different normalization techniques for the TheY-Model inisolation. (60 million word corpus)

candidates by their pre-computed co-occurrence measures as computed in Step

3.

Since all head words preceding the definite NP as potential correct antecedents are

considered, the raw frequency of the pair (X,Y ) can be very noisy. This can be seen

clearly in Table 7.1, where the first column shows the top potential antecedents of

definite NP the drug as given by raw frequency. The raw frequency is normalized using

standard TF.IDF and Pointwise Mutual Information scores to filter the noisy pairs.

Note that MI(X, Y ) = log P (X,Y )P (X)P (Y )

and this is directly proportional to P (Y |X) =

p(X,Y )p(X)

for a fixed Y . Thus, one can simply use this conditional probability during

implementation since the definite NP Y is fixed for the task of antecedent selection.

Table 7.2 reports results for antecedent selection using Raw frequency c(X,Y), TF.IDF

122

5 and MI in isolation. Accuracy is the fraction of total examples that were assigned the

correct antecedent and Accuracytag is the same excluding the examples that had POS

tagging errors for the correct antecedent.6 Av Rank is the rank of the true antecedent

averaged over the number of test examples.7 Based on the above experiment, the rest

of this chapter makes use of Mutual Information scoring technique for TheY-Model.

7.3.2 WordNet-Model (WN)

Because WordNet is considered as a standard resource of lexical knowledge and is

often used in coreference tasks, it is useful to know how well corpus-based approaches

perform as compared to a standard model based on the WordNet (version 2.0). A

simple baseline was also investigated, namely, selecting the closest previous headword

as the correct antecedent. This recency based baseline obtained a low accuracy of

15% and hence a stronger WordNet based model was used for comparison purposes.

The algorithm for the WordNet-Model is as follows:

Given a definite NP Y and its potential antecedent X, choose X if it occurs as a

hyponym (either direct or indirect inheritance) of Y. If multiple potential antecedents

occur in the hierarchy of Y, choose the one that is closest in the hierarchy.

5For the purposes of TF.IDF computation, document frequency df(X) is defined as the numberof unique definite NPs for which X appears as an antecedent.

6Since the POS tagging was done automatically, it is possible for any model to miss the correctantecedent because it was not tagged correctly as a noun in the first place. There were 14 suchexamples in the test set and none of the model variants can find the correct antecedent in theseinstances.

7Knowing average rank can be useful when a n-best ranked list from coreference task is used asan input to other downstream tasks such as information extraction.

123

Acc Acctag Av RankTheY+WN 0.695 0.755 3.37WordNet 0.593 0.644 3.29

TheY 0.531 0.577 4.82

Table 7.3: Accuracy and Average Rank showing combined model performance on theantecedent selection task. Corpus Size: 60 million words.

7.3.3 Combination: TheY+WordNet Model

Most of the literature on using lexical resources for definite anaphora has focused

on using individual models (either corpus-based or manually build resources such as

WordNet) for antecedent selection. Some of the difficulties with using WordNet is

its limited coverage and its lack of empirical ranking model. Thus, a combination

of TheY-Model and WordNet-Model is used in order to overcome these problems.

Essentially, the hypotheses found in WordNet-Model are reranked based on ranks of

TheY-model or using a backoff scheme if WordNet-Model does not return an answer

due to its limited coverage. Given a definite NP Y and a set of potential antecedents

Xs the detailed algorithm is specified as follows:

1. Rerank with TheY-Model: Rerank the potential antecedents found in the

WordNet-Model table by assigning them the ranks given by TheY-Model. If

TheY-Model does not return a rank for a potential antecedent, use the rank

given by the WordNet-Model. Now pick the top ranked antecedent after rerank-

ing.

2. Backoff: If none of the potential antecedents were found in the WordNet-Model

124

Summary Keyword True TheY Truth WordNet Truth TheY+WN Truth(Def. Ana) Antecedent Choice Rank Choice Rank Choice Rank

Both metal gold gold 1 gold 1 gold 1correct sport soccer soccer 1 soccer 1 soccer 1

TheY-Model drug steroid steroid 1 NA NA steroid 1helps drug azt azt 1 medication 2 azt 1

WN-Model instrument trumpet king 10 trumpet 1 trumpet 1helps drug naltrexone alcohol 14 naltrexone 1 naltrexone 1

Both weapon bomb artillery 3 NA NA artillery 3incorrect instrument voice music 9 NA NA music 9

Table 7.4: A sample of output from different models on antecedent selection (60million word corpus).

then pick the correct antecedent from the ranked list of The-Y model. If none

of the models return an answer then assign ranks uniformly at random.

The above algorithm harnesses the strength of WordNet-Model to identify good hy-

ponyms and the strength of TheY-model to identify which are more likely to be used

as an antecedent. Note that this combination algorithm can be applied using any

corpus-based technique to account for poor-ranking and low-coverage problems of

WordNet and the Sections 7.3.4, 7.3.5 and 7.3.6 will show the results for backing off

to a Hearst-style hypernym model. Table 7.4 shows the decisions made by TheY-

model, WordNet-Model and the combined model for a sample of test examples. It

is interesting to see how both the models mutually complement each other in these

decisions. Table 7.3 shows the results for the models presented so far using a 60 mil-

lion word training text from the Gigaword corpus. The combined model results in a

substantially better accuracy than the individual WordNet-Model and TheY-Model,

indicating its strong merit for the antecedent selection task.

125

7.3.4 OtherY-Modelfreq

This model is a reimplementation of the corpus-based algorithm proposed by

Markert and Nissim (2005) for the equivalent task of antecedent selection for definite

NP coreference. Their approach of using the lexico-syntactic pattern X and A* other

B* Y{pl} for extracting (X,Y) pairs was replicated. Markert and Nissim (2005) also

report a Web algorithm that makes use of hits from Google for instantiations of

X and other Ys. Also, they used ’X{sl} OR X{pl} in their patterns to take both

singular and plurals into account. The lemmatized form of X was used during test

and unsupervised training.

The A* and B* allow for adjectives or other modifiers to be placed in between the

pattern. The model presented in their article uses the raw frequency as the criteria

for selecting the antecedent.

7.3.5 OtherY-ModelMI(normalized)

Normalization of the OtherY-Model is done using Mutual Information scoring

method. Although Markert and Nissim (2005) report that using Mutual Information

performs similar to using raw frequency, Table 7.5 shows that using Mutual Infor-

mation makes a substantial impact on results using large training corpora relative to

using raw frequency.

126

7.3.6 Combination: TheY+OtherYMI Model

The two corpus-based approaches (TheY and OtherY) make use of different linguis-

tic phenomena and it would be interesting to see whether they are complementary

in nature. A similar combination algorithm was used as in Section 7.3.3 with the

WordNet-Model replaced with the OtherY-Model for hypernym filtering, and use

the noisy TheY-Model for reranking and backoff. The results for this approach are

showed as the entry TheY+OtherYMI in Table 7.5. A combination (OtherY+WN)

of Other-Y model and WordNet-Model was also computed by replacing TheY-Model

with OtherY-Model in the algorithm described in Section 7.3.3. The respective results

are indicated as OtherY+WN entry in Table 7.5.

7.4 Further Anaphora Resolution Results

Table 7.5 summarizes results obtained from all the models defined in Section 7.3

on three different sizes of training unlabeled corpora (from Gigaword corpus). The

models are listed from high accuracy to low accuracy order. The OtherY-Model

performs particularly poorly on smaller data sizes, where coverage of the Hearst-style

patterns maybe limited, as also observed by Berland and Charniak (1999). With

increased corpus sizes, the Markert and Nissim (2005) OtherY-Model and MI-based

improvement do show substantial relative performance growth, although they still

under perform the basic TheY-Model at all tested corpus sizes. Also, the combination

127

Acc Acctag Av Rank60 million words

TheY+WN 0.695 0.755 3.37OtherYMI+WN 0.633 0.687 3.04

WordNet 0.593 0.644 3.29TheY 0.531 0.577 4.82

TheY+OtherYMI 0.497 0.540 4.96OtherYMI 0.356 0.387 5.38OtherYfreq 0.350 0.380 5.39

230 million wordsTheY+WN 0.678 0.736 3.61

OtherYMI+WN 0.650 0.705 2.99WordNet 0.593 0.644 3.29

TheY+OtherYMI 0.559 0.607 4.50TheY 0.519 0.564 4.64

OtherYMI 0.503 0.546 4.37OtherYfreq 0.418 0.454 4.52

380 million wordsTheY+WN 0.695 0.755 3.47

OtherYMI+WN 0.644 0.699 3.03WordNet 0.593 0.644 3.29

TheY+OtherYMI 0.554 0.601 4.20TheY 0.537 0.583 4.26

OtherYMI 0.525 0.571 4.20OtherYfreq 0.446 0.485 4.36

Table 7.5: Accuracy and Average Rank of Models defined in Section 7.3 on theantecedent selection task.

128

of corpus-based models (TheY-Model+OtherY-model) does indeed performs better

than either of them in isolation. Finally, note that the basic TheY-algorithm still

does relatively well by itself on smaller corpus sizes, suggesting its merit on resource-

limited languages with smaller available online text collections and the unavailability

of WordNet. The combined models of WordNet-Model with the two corpus-based

approaches still substantially outperform any of the other individual models.

Also, the syntactic co-reference candidate filters such as the Hobbs algorithm were not

utilized in this study. To assess the performance implications, the Hobbs algorithm

was applied to a randomly selected 100-instance subset of the test data. Although the

Hobbs algorithm frequently pruned at least one of the coreference candidates, in only

2% of the data did such candidate filtering change system output. However, since

both of these changes were improvements, it could be worthwhile to utilize Hobbs

filtering in future work, although the gains would likely be modest.

129

7.5 Generation Task

Having shown positive results for the task of antecedent selection in the first

part, the second part of the chapter presents a more difficult task, namely generating

an anaphoric definite NP given a nominal antecedent. In Example (1), this would

correspond to generating “the drug” as an anaphor knowing that the antecedent is

pseudoephedrine. This task clearly has many applications: current generation systems

often limit their anaphoric usage to pronouns and thus an automatic system that does

well on hypernymic definite NP generation can directly be helpful. It also has strong

potential application in abstractive summarization where rewriting a fluent passage

requires a good model of anaphoric usage.

There are many interesting challenges in this problem: first of all, there maybe be

multiple acceptable choices for definite anaphor given a particular antecedent, com-

plicating automatic evaluation. Second, when a system generates a definite anaphora,

the space of potential candidates is essentially unbounded, unlike in antecedent selec-

tion, where it is limited only to the number of potential antecedents in prior context.

In spite of the complex nature of this problem, the experiments with the human judg-

ments, WordNet and corpus-based approaches show a simple feasible solution. All

the approaches are evaluated based on exact-match agreement with definite anaphora

actually used in the corpus (accuracy) and also by agreement with definite anaphora

predicted independently by a human judge in an absence of context.

130

7.5.1 Human experiment

A total of 103 <true antecedent, definite NP> pairs were extracted from the set of

test instances used in the resolution task. Then a human judge was asked (a native

speaker of English) to predict a parent class of the antecedent that could act as a

good definite anaphora choice in general, independent of a particular context. Thus,

the actual corpus sentence containing the antecedent and definite NP and its context

was not provided to the judge. The predictions provided by the judge were matched

with the actual definite NPs used in the corpus and the agreement between corpus

and the human judge was 79% which can thus be considered as an upper bound of

algorithm performance. Table 7.7 shows a sample of decisions made by the human

and how they agree with the definite NPs observed in the corpus. It is interesting

to note the challenge of the sense variation and figurative usage. For example, “cor-

ruption” is refered to as a “tool” in the actual corpus anaphora, a metaphoric usage

that would be difficult to predict unless given the usage sentence and its context.

However, a human agreement of 79% indicate that such instances are relatively rare

and the task of predicting a definite anaphor without its context is viable. In gen-

eral, it appears from the experiements that humans tend to select from a relatively

small set of parent classes when generating hypernymic definite anaphora. Further-

more, there appears to be a relatively context-independent concept of the “natural”

level in the hypernym hierarchy for generating anaphors8. For example, although

8 This is somewhat similar to the notion of “natural kind” in philosophy that describes the notionof a “natural” grouping as opposed to artificial grouping of things (Quine, 1969).

131

05/12/2006 01:13 AMWordNet Search - 2.1

Page 1 of 1http://wordnet.princeton.edu/perl/webwn?o2=&o0=1&o7=&o5=&o1=1&o6=&o4=&o3=&r=1&s=pseudoephedrine&i=1&h=110#c

WordNet Search - 2.1

Return to WordNet Home

Glossary - Help

SEARCH DISPLAY OPTIONS: (Select option to change) Change

Enter a word to search for: pseudoephedrine Search WordNet

KEY: "S:" = Show Synset (semantic) relations, "W:" = Show Word (lexical) relations

Noun

S: (n) pseudoephedrine (poisonous crystalline alkaloid occurring with ephedrine and isomorphic withit)

direct hypernym / inherited hypernym / sister term

S: (n) alkaloid (natural bases containing nitrogen found in plants)S: (n) organic compound (any compound of carbon and another element or aradical)

S: (n) compound, chemical compound ((chemistry) a substance formed bychemical union of two or more elements or ingredients in definite proportionby weight)

S: (n) substance, matter (that which has mass and occupies space) "an

atom is the smallest indivisible unit of matter"

S: (n) physical entity (an entity that has physical existence)S: (n) entity (that which is perceived or known or inferredto have its own distinct existence (living or nonliving))

Return to WordNet Home

Figure 7.2: Illustrating the problem with WordNet for definite anaphora generation.

The immediate parent and grandparent of “pseudophedrine”, “alkaloid” and “organic

compound” do not serve as natural definite anaphoras as compared to the “drug” that

is often observed in corpora.

<“alkaloid”, “organic compound”, “compound”, “substance”, “entity”> are all hy-

pernyms of “Pseudoephederine” in WordNet (see Figure 7.2), “the drug” appears to

be the preferred hypernym for definite anaphora in the data, with the other alterna-

tives being either too specific or too general to be natural. This natural level appears

to be difficult to define by rule. For example, using just the immediate parent hy-

pernym in the WordNet hierarchy only achieves 4% match with the corpus data for

definite anaphor generation.

132

7.5.2 Algorithms

The following sections presents the corpus-based algorithms as more effective al-

ternatives.

7.5.2.1 Individual Models

For the corpus-based approaches, the TheY-Model and OtherY-Model were

trained in the same manner as for the antecedent selection task. The only differ-

ence was that in the generation case, the frequency statistics were reversed to provide

a hypernym given a hyponym. 9 Additionally, it can be seen that the raw frequency

outperformed either TF.IDF or Mutual Information and was used for all results in

Table 7.6.

The stand-alone WordNet model is also very simple: Given an antecedent, its direct

hypernym (using first sense) is looked up in the WordNet and use it as the definite

NP, for lack of a better rule for preferred hypernym location10.

7.5.2.2 Combining corpus-based approaches and WordNet

Each of the corpus-based approaches was combined with WordNet resulting in

two different models as follows: Given an antecedent X, the corpus-based approach

9On this task, it was found that using raw frequency worked better than other scoring techniques.10The first sense of the antecedent was used to find its location in the WordNet. However, using

appropriate Word Sense Disambiguation techniques can be very helpful for this task.

133

Agreement Agreementw/ human w/ corpus

judgeTheY+OtherY+WN 47% 46%

OtherY +WN 43% 43%TheY+WN 42% 37%

TheY +OtherY 39% 36%OtherY 39% 36%

WordNet 4% 4%Human judge 100% 79%

Corpus 79% 100%

Table 7.6: Agreement of different generation models with human judge and withdefinite NP used in the corpus.

looks up in its table the hypernym of X, for example Y, and only produces Y as

the output if Y also occurs in the WordNet as hypernym. Thus WordNet is used

as a filtering tool for detecting viable hypernyms. This combination resulted in two

models: ’TheY+WN’ and ’OtherY+WN’.

The combination of all the three approaches, ’TheY’, ’OtherY’ and WordNet is

represented as ’TheY+OtherY+WN’. The combination was done as follows: First

the models ’TheY’ and ’OtherY’ were combined using a backoff model. The first

priority is to use the hypernym from the model ’OtherY’, if not found then use the

hypernym from the model ’TheY’. Given a definite NP from the backoff model, apply

the WordNet filtering technique, specifically, choose it as the correct definite NP if it

also occurs as a hypernym in the WordNet hierarchy of the antecedent.

134

Antecedent Corpus Human TheY+OtherYDef Ana Choice +WN

racing sport sport sportazt drug drug drug

missile weapon weapon weaponalligator animal animal animal

steel metal metal metalosteporosis disease disease condition

grenade device weapon devicebaikonur site city station

corruption tool crime activity

Table 7.7: Sample of decisions made by human judge and the best performing model(TheY+OtherY+WN) on the generation task.

7.5.3 Evaluation of Anaphor Generation

The resulting algorithms from Section 7.5.2 were evaluated on the definite NP

prediction task as described earlier. Table 7.6 shows the agreement of the algorithm

predictions with the human judge as well as with the definite NP actually observed in

the corpus. It is interesting to see that WordNet by itself performs very poorly on this

task since it does not have any word-specific mechanism to choose the correct level

in the hierarchy and the correct word sense for selecting the hypernym. However,

when combined with the corpus-based approaches, the agreement increases substan-

tially indicating that the corpus-based approaches are effectively filtering the space

of hypernyms that can be used as natural classes. Likewise, WordNet helps to filter

the noisy hypernyms from the corpus predictions. Thus, this interplay between the

corpus-based and WordNet algorithm works out nicely, resulting in the best model

being a combination of all three individual models and achieving a substantially bet-

135

ter agreement with both the corpus and human judge than any of the individual

models. Table 7.7 shows decisions made by this algorithm on a sample test data.


This section analyzes statistical significance of results reported in Tables 7.5 and

7.6. Using a binomial test and the best baseline accuracies (obtained via WordNet)

of 59.3% (Acc) and 64.4% (Acctag) for all corpora sizes in Table 7.5, any resulting

accuracy over 65.5% (Acc) and over 70.1% (Acctag) is statistically significant with a

p-value less than 0.05. Thus, the results obtained with the best model for all corpora

in Table 7.5 are statistically significant.

For the results on generation task in Table 7.6, WordNet performs poorly resulting in

only a 4% accuracy, and any resulting accuracy over 7.77% is statistically significant

with a p-value less than 0.05. While Markert and Nissim (2005) did not apply their

approach for generation task, applying it to the generation task and using the accu-

racies of “OtherY” row as the baseline, any resulting accuracy over 46.6% (agreement

w/human judge) and over 43.7% (agreement w/corpus) are statistically significant.

Thus, the accuracies of the best model in Table 7.6 are also statistically significant

with respect to the Markert and Nissim (2006) model as applied to the generation

task.

136

7.7 Conclusion

This chapter presents a successful solution to the problem of incomplete lexical

resources for definite anaphora resolution and further demonstrates how the resources

built for resolution can be naturally extended for the less studied task of anaphora

generation. First, a simple and noisy corpus-based approach is presented based on

globally modeling headword co-occurrence around likely anaphoric definite NPs. This

was shown to outperform a recent approach by Markert and Nissim (2005) that makes

use of standard Hearst-style patterns extracting hypernyms for the same task. Even

with a relatively small training corpora, the simple TheY-model was able to achieve

relatively high accuracy, making it suitable for resource-limited languages where an-

notated training corpora and full WordNets are likely not available. Then, several

variants of this algorithm were evaluated based on model combination techniques.

The best combined model was shown to exceed 75% accuracy on the resolution task,

beating any of the individual models. On the much harder anaphora generation task,

where the stand-alone WordNet-based model only achieved an accuracy of 4%, the

corpus-based models can achieve 35%-47% accuracy on blind exact-match evaluation,

thus motivating the use of such corpus-based learning approaches on the generation

task as well.

137

Part III

Extracting Factual Relationships

138

Chapter 8

Part III Literature Review

This section covers the literature review for extracting factual relations. Most of

the literature in computational linguistics has focused on extracting “explicit” factual

relationships (described in Section 8.1). Often, factual properties can also be latently

expressed and there has been a plethora of work in the sociolinguistics literature

for extracting such “implicit” or non-overt relationships. Section 8.2 describes the

literature for latent fact extraction.

8.1 Literature for Modeling Explicit

Relationships

The literature for extracting explicitly stated facts can be broadly classified into

hand crafted rules (Section 8.1.1), supervised machine learning approaches (Section

139

8.1.2) and seed-based approaches (Section 8.1.3).

8.1.1 Early MUC approaches: Handcrafted

Lexico-syntactic Patterns

Factual relationships give some domain specific information or a knowledge about

properties of a concept or an entity and how it is related to other concepts/entities

such as “Mozart-birthplace-Salzburg”. The main bottleneck of obtaining such knowl-

edge is the huge amount of manual annotation required to build such structured

database. Furthermore, such manually built databases are limited to only few of the

world’s languages.

One key property of identifying such relationships automatically is that regardless of

the type of fact (“birthplace”, “occupation”, etc.), it is common to observe textual

patterns that tie the concepts along with the relationship-type. Fixed lexico-syntactic

patterns have been used in the early Message Understanding Conference (MUC) eval-

uations where the goal was to extract segments containing the relevant fact. UMass

CIRCUS (Lehnert et al., 1991) was one of the most successful system in the MUC-3

evaluation and was based on handcrafted patterns, with SRI’s FASTUS (Appelt et

al., 1993) in MUC-4 setting the trend towards pattern-based approaches by show-

ing use of robust pattern inventory performing better than even a full parsing-based

TACITUS (Hobbs, 1986) system using abductive inference rules.

140

8.1.2 Machine Learning Approaches

The next direction in the field was towards building supervised models for learn-

ing extraction rules from annotated data, with specialized models such as WHISK

(Soderland, 1999) and then moving towards more general statistical models such as

Hidden Markov Models (Leek, 1997), Conditional Random Fields (Lafferty et al.,

2001; Culotta et al., 2006), Support Vector Machines (Culotta and Sorensen, 2004)

and a Logistic Regression Model (Snow et al., 2006).

8.1.3 Weakly Supervised Approaches using

Seed-exemplars

The problem with handcrafted extraction rules or using annotated data and train-

ing supervised models to learn such rules was that lot of manual effort was needed

to annotate the data. In order to overcome this problem, a new direction towards

a bootstrapping framework (Yarowsky, 1995) was investigated, leading to a plethora

of work in the area of extracting new relationships starting from a small set of seed

pairs. The basic seed-based pattern induction work (Brin, 1998; Agichtein and Gra-

vano, 2000; Ravichandran and Hovy (2002)) consisted of two stages, pattern-learning

and extraction of new-pairs. For example, to build a model of extracting “occupa-

tion” using this framework, few seed examples of <Person name, occupation> are

used for extracting all the patterns that match with the seed pairs. The patterns

141

SEED PAIRSPeter Rasmussen, PhysicistAlison Wolfe, SingerRichard Smith, ProfessorAlex Wilson, Tennis playerSteve Pierce, Actor............

NEW PAIRSMike Beres, EconomistJames Young, Social workerKenneth Bohr, SwimmerDean Schoppe, NarratorRyan Cooper, Actor............

Iterate

…Rasmussen worked as a physicist at…

…Wolfe, a well-known singer performed in…

…Shingles served as a professor at the University…

…Wilson trained as a tennis player when he…

Patterns Extractedworked as a 0.91 a well-known 0.87served as a 0.83trained as a 0.78a full-time 0.72....is a 0.21....

…including Mike Beres , a well-known economist who

…Young worked as a social worker at the …

…where Bohr trained as a swimmer with the help …

…Schoppe served as a narrator in the famous…

Monolingual corpora Monolingual corpora

Figure 8.1: Illustration of basic weakly supervised approach by Ravichandran and

Hovy (2002) for fact extraction. Using a few seeds of the fact in question, contextual

patterns occurring with the seeds are extracted and ranked based on their distribu-

tion in the monolingual corpora. New pairs observing the given fact (for example,

occupation) can then be extracted using co-occurrence with these patterns.

142

are then ranked by corpus frequency and a frequency threshold is set to select the

final patterns. In the extraction stage, these patterns are used to extract occupation

for a new name by finding words or phrases that occur in the occupation slot of the

extracted patterns in an unlabeled corpus.

More formally, the probability of a relationship r(“occupation”), given the surround-

ing context “A1 p A2 q A3”, where p and q are <NAME> and <Occupation Value>

respectively, is given using the rote extractor model probability as in (Ravichandran

and Hovy, 2002; Mann and Yarowsky 2005):

P (r(p, q)|A1pA2qA3) =

∑x,y∈r c(A1xA2yA3)∑x,z c(A1xA2zA3)

(8.1)

The variations on the above pattern-based learning approach have been due to pattern

representation and measures used for ranking the patterns. Thelen and Riloff (2002)

also proposed learning semantic lexicons using collective evidence from large number

of contextual patterns, their system was called Basilisk and used the RlogF metric for

ranking patterns (Riloff, 1996) shown below. This metric is similar to Ravichandran

and Hovy’s (2002) precision metric except that the precision is multiple with a log of

the seed frequency of the pattern:

RlogF (A1pA2qA3) =


· log2(∑x,y∈r

c(A1xA2yA3)) (8.2)

The metric for scoring extracted candidate words was performed using the AvgLog

function that uses a log sum of the number of patterns that extracted the candidate

word whereas Ravichandran and Hovy (2002) selected a weighted sum, weighted by

143

pattern precision. An illustration of the basic pattern-based approach is shown in

Figure 8.1.

Pantel et al. (2004) proposed an approach based on edit-distance to learn lexico-

POS patterns for is-a and part-of relations. Pantel and Pennacchiotti (2006) use

pattern-based approaches to extract is-a, part-of and other semantic relations using

a mutual-information based reliability measure to rank the patterns. Mann and

Yarowsky (2005), formalized the definition of the basic pattern-based approach calling

it a rote classifier and showed how bag of words context using a phrase conditional,

naive bayes and conditional random field model along with automatically annotated

negative examples using spurious targets can further aid in extraction performance.

8.2 Literature for Modeling Latent

Relationships

While latent relationships are often not dealt in the standard information extrac-

tion literature, there have been a lot of sociolinguistic studies on identifying properties

of a population sample based on their discourse usage. In particular, biographic facts

such as age, gender, education level, etc., have received a lot of attention in this liter-

ature. Section 8.2.1 describes the salient sociolinguistic approaches and Section 8.2.2

describes some of the computational approaches based on the features identified in

the sociolinguistics literature.

144

8.2.1 Sociolinguistic Studies

Much attention has been devoted in the sociolinguistics literature to detection of

age, gender, social class, religion, education, etc. from conversational discourse and

monologues starting as early as the 1950s, making use of morphological features such

as the choice between the -ing and the -in variants of the present participle ending

of the verb (Fischer, 1958), and phonological features such as the pronounciation of

the “r” sound in words such as far, four, cards, etc. (Labov, 1966).

Gender differences has been one of the primary areas of sociolinguistic research.

Coates (1998) and Eckert and McConnell-Ginet (2003) provide a detailed overview of

the sociolinguistic approaches for studying gender differences. Some of the important

features used in these studies such as use of pronouns, passive constructions, use of

specific ngrams like “well”, “yeah”, “I mean”, etc. are outlined in Section 10.6.

8.2.2 Computational Approaches

There has also been some work in developing computational models based on lin-

guistically interesting clues suggested by the sociolinguistic literature for detecting

Gender on formal written texts (Koppel et al., 2002; Herring and Paolillo, 2006) but

it has been primarily focused on using a small number of manually selected features,

and on a small number of formal written texts. Another computational model was

proposed by Koppel et al. (2002), where a manually selected set of 1081 features were

145

used consisting of 405 function words, 76 part-of-speech tags, 100 most frequent part-

of-speech bigrams and 500 most frequent part-of-speech trigrams. A variant of the

exponential gradient algorithm (Kivinen and Warmuth, 1997) was used for training.

Their paper reports an accuracy of approximately 80% on a set of 566 gender labeled

documents from British National Corpus. However, later in this thesis it is shown

that an online system (Gender Genie) based on the algorithm described in this paper

performs poorly on conversation speech transcripts.

Another relevant line of work has been on the blog domain, using a bag of word fea-

ture set to discriminate age and gender (Schler et al., 2006; Burger and Henderson,

2006; Nowson and Oberlander, 2006). Schler et al., (2006) used both style based and

content based features for gender classification of a blog entry. Style-based features

consisted of selected parts-of-speech, function words and blog specific features such as

hyperlinks. Content based features are the ngrams used in the body of the blog entry

and they show that words such as “linux, gaming, google, economic” are correlated

with “male” gender and words such as “shopping, cute, mom, boyfriend,” are cor-

related with “female” gender. They also report that the top content based features

suggest a pattern of more “personal” writing by female bloggers than male bloggers.

They train a multi-class real winnow model (MCRW) and report an accuracy of 80.1%

using all the features. Another study on blog data by Nowson and Oberlander (2006)

also shows that learning n-gram contexts perform well in predicting gender of the

blogger. In addition to word-based features, Burger and Henderson (2006) explore

146

a wide range of non-lexical features for blogger age prediction such as mean post

length, mean number of non-image links per post, language/script in which the blog

is written, location and time of the blog entry, number of blogger friends, etc. On the

applications side, Liu and Mihalcea (2007) show study gender preferences for weblogs

based on color, size, time, socialness, affect and cravings. They show how learning

such preferences can be used for improving user interfaces for weblogs and for filtering

gender-specific news data.

While the approaches described above have shed some light on the important features

for extracting latent biographic relationships, most of them have been small-scale

studies. Boulis and Ostendorf (2005) presented the first large-scale study of gen-

der modeling in conversational speech transcripts using Fisher corpus (Cieri et al.,

2004). Chapter 10 describes several novel contributions to this state-of-the-art ap-

proach on gender classification in the previous literature. Section 10.2 of Chapter 10

describes the Boulis and Ostendorf (2005) model and other relevant gender modeling

approaches in more detail.

147

Chapter 9

Structural, Transitive and

Correlational Models for

Biographic Fact Extraction

Summary

This chapter presents novel approaches to biographic fact extraction that model

structural, transitive and latent properties of biographical data. The ensemble of these

proposed models substantially outperforms standard pattern-based biographic fact

extraction methods and performance is further improved by modeling inter-attribute

correlations and distributions over functions of attributes, achieving an average ex-

traction accuracy of 80% over seven types of biographic attributes.

148



9.1 Introduction

Extracting biographic facts such as “Birthdate”, “Occupation”, “Nationality”,

etc. is a critical step for advancing the state of the art in information processing

and retrieval. An important aspect of web search is to be able to narrow down

search results by distinguishing among people with the same name leading to mul-

tiple efforts focusing on web person name disambiguation in the literature (Mann

and Yarowsky, 2003; Artiles et al., 2007, Cucerzan, 2007). While biographic facts

are certainly useful for disambiguating person names, they also allow for automatic

extraction of encyclopedic knowledge that has been limited to manual efforts such as

Britannica, Wikipedia, etc. Such encyclopedic knowledge can advance vertical search

engines such as http://www.spock.com that are focused on people searches where one

can get an enhanced search interface for searching by various biographic attributes.

Biographic facts are also useful for powerful query mechanisms such as finding what

attributes are common between two people (Auer and Lehmann, 2007).

While there are a large quantity of biographic texts available online, there are only a

1Reference: N. Garera and D. Yarowsky. Structural, Transitive and Latent Models for BiographicFact Extraction. Proceedings of European Chapter of the Association for Computational Linguistics(EACL), 2009.

149

Figure 9.1: Goal: extracting attribute-value biographic fact pairs from biographic

free-text

150

few biographic fact databases available2, and most of them have been created manu-

ally, are incomplete and are available primarily in English.

This chapter presents multiple novel approaches for automatically extracting bio-

graphic facts such as “Birthdate”, “Occupation”, “Nationality”, and “Religion”, mak-

ing use of diverse sources of information present in biographies. In particular, the

following 6 distinct original approaches to this task are evaluated with large collective

empirical gains:

1. An improvement to the Ravichandran and Hovy (2002) algorithm based on

Partially Untethered Contextual Pattern Models

2. Learning a position-based model using absolute and relative positions and se-

quential order of hypotheses that satisfy the domain model. For example,

“Deathdate” very often appears after “Birthdate” in a biography.

3. Using transitive models over attributes via co-occurring entities. For example,

other people mentioned person’s biography page tend to have similar attributes

such as occupation (See Figure 9.4).

4. Using latent wide-document-context models to detect attributes that may not be

mentioned directly in the article (e.g. the words “song, hits, album, recorded,..”

all collectively indicate the occupation of singer or musician in the article.

5. Using inter-attribute correlations, for filtering unlikely biographic attribute com-

2E.g.: http://www.nndb.com, http://www.biography.com, Infoboxes in Wikipedia.

151

binations. For example, a tuple consisting of < “Nationality” = India, “Reli-

gion” = Hindu > has a higher probability than a tuple consisting of < “Na-

tionality” = France, “Religion” = Hindu >.

6. Learning distributions over functions of attributes, for example, using an age

distribution to filter tuples containing improbable <deathyear>-<birthyear>

lifespan values.

The rest of the chapter describes and evaluates techniques for exploiting all of the

above classes of information in the next sections.

9.2 Related Work

The literature for biography extraction falls into two major classes. The first

one deals with identifying and extracting biographical sentences and treats the prob-

lem as a summarization task (Cowie et al., 2000, Schiffman et al., 2001, Zhou et

al., 2004). The second and more closely related class deals with extracting specific

facts such as “birthplace”, “occupation”, etc. For this task, the primary theme of

work in the literature has been to treat the task as a general semantic-class learning

problem where one starts with a few seeds of the semantic relationship of interest

and learns contextual patterns such as “<NAME> was born in <Birthplace>” or

“<NAME> (born <Birthdate>)” (Hearst, 1992; Riloff, 1996; Agichtein and Gra-

vano, 2000; Ravichandran and Hovy, 2002; Mann and Yarowsky, 2003; Mann and

152

Yarowsky, 2005; Alfonseca et al., 2006; Pasca et al., 2006). There has also been some

work on extracting biographic facts directly from Wikipedia pages. Culotta et al.

(2006) deal with learning contextual patterns for extracting family relationships from

Wikipedia. Ruiz-Casado et al. (2006) learn contextual patterns for biographic facts

and apply them to Wikipedia pages.

While the pattern-learning approach extends well for a few biography classes, some of

the biographic facts like “Gender” and “Religion” do not have consistent contextual

patterns, and only a few of the explicit biographic attributes such as “Birthdate”,

“Birthplace” and “Occupation” have been shown to work well in the pattern-learning

framework.

Secondly, there is a general lack of work that attempts to utilize the typical informa-

tion sequencing within biographic texts for fact extraction, and this chapter illustrates

how the information structure of biographies can be used to improve upon pattern

based models. Furthermore, additional novel models of attribute correlation and age

distribution that aid the extraction process are also presented.

9.3 Approach

First, the standard pattern-based approach is implemented for extracting bio-

graphic facts from the raw prose in Wikipedia people pages. Then, an array of novel

techniques is presented exploiting different classes of information including partially-

153

tethered contextual patterns, relative attribute position and sequence, transitive at-

tributes of co-occurring entities, broad-context topical profiles, inter-attribute corre-

lations and likely human age distributions.

9.4 Contextual Pattern-Based Model

A standard model for extracting biographic facts is to learn templatic contextual

patterns such as <NAME> “was born in” <Birthplace>). Such templatic patterns

can be learned using seed examples of the attribute in question and, there has been

a plethora of work in the seed-based bootstrapping literature which addresses this

problem (Ravichandran and Hovy, 2002; Mann and Yarowsky, 2005; Alfonseca et al.,

2006; Pasca et al., 2006)

Thus, as a baseline, the standard Ravichandran and Hovy (2002) pattern learning

model was implemented using 100 seed3 examples from an online biographic database

called NNDB (http://www.nndb.com) for each of the biographic attributes: “Birth-

date”, “Birthplace”, “Deathdate”, “Gender”, “Nationality”, “Occupation” and “Re-

ligion”. Given the seed pairs, patterns for each attribute were learned by searching

for seed <Name,Attribute Value> pairs in the Wikipedia page and extracting the left,

middle and right contexts as various contextual patterns. A noisy model of corefer-

ence resolution was implemented by resolving any gender-correct pronoun used in the

3The seed examples were chosen randomly, with a bias against duplicate attribute values toincrease training diversity.

154

Partially Untethered Patterns PrecisionBirthplace born in <birthplace> 1.0living in <birthplace> 1.0grew up in <birthplace> 1.0family in <birthplace> 1.0( born in <birthplace> 1.0....to return to <birthplace> 0.80returned to <birthplace> 0.79....Birthdatewas born on <birthdate> 1.0 born on <birthdate> 1.0<birthdate> - 0.94) ( <birthdate> 0.83....<birthdate> , is 0.56born <birthdate> 0.41....Deathdate- <deathdate> 1.0<deathdate> ) was an 0.91<deathdate> ) was a 0.89<deathdate> ) was 0.62<deathdate> ) , 0.11; <deathdate> 0.056on <deathdate> 0.05<deathdate> ) 0.04

Fully Tethered Patterns PrecisionBirthplace <name> was bornin <birthplace>

1.0

 “ <name> ” ( born<DATE>

1.0

in <birthplace> “<name>” ( born in<birthplace>

1.0

<name> was born on<DATE> in <birthplace>

1.0

<name> was born and raisedin <birthplace>

1.0

<name> returned to<birthplace>

1.0

<birthplace> where <name>was

0.75

<name> left <birthplace> 0.67....Birthdate <name> was bornon <born>

1.0

 “ <name> ” ( born<born>

1.0

<name> “ ( <born> - 1.0“ <name> ” ( born <DATE>in <birthplace>

1.0

Deathdate“ <name> ” ( <DATE> -<died> )

1.0

Table 9.1: A sample of partially untethered and fully tethered patterns along withtheir precision. For some of the attributes, only 4-5 fully tethered patterns butrelaxing the constraint on the <hook> allows extraction of many partially tetheredpatterns providing improved performance as shown in Tables 9.5 9.6.

155

Partially Untethered Patterns PrecisionOccupationreputation as a <occupation> 1.0<occupation> oscar 1.0<occupation> in a musical 1.0<occupation> and author 1.0<occupation> and author . 1.0for best <occupation> 1.0best supporting <occupation> 1.0....a French <occupation> 0.71singer and <occupation> 0.70young <occupation> 0.53as an <occupation> 0.47....<occupation> , and 0.06Nationality) was an <nationality> 1.0) was a <nationality> 1.0native <nationality> 1.0<nationality> , where he 1.0....to become <nationality> 0.8to return to <nationality> 0.78founder of <nationality> 0.78one of <nationality> 0.75....Religion<religion> family in 1.0<religion> faith . 1.0raised as a <religion> 1.0<religion> faith 0.86as an <religion> 0.58<religion> family 0.48<religion> , but 0.26as a <religion> 0.23, a <religion> 0.11...

Fully Tethered Patterns PrecisionOccupationan <occupation> , <name> 1.0“ <name> : <occupation> 0.67<occupation> . <name> was 0.4<name> : <occupation> 0.4and <occupation> . <name> 0.1<occupation> , <name> 0.10<occupation> . <name> 0.06Nationality <name> was the 1.0only <nationality> “ <name> ” (<nationality>

1.0

, <name> returned to<nationality>

1.0

term <name> dominated 1.0<nationality> politicsterm of any <nationality>prime minister ,

1.0

and during his second term<name>..... <name> left <nationality> 0.50of <nationality> . <name> 0.43<nationality> . <name> was 0.33...Religion <name> was born to a 1.0<religion> family<name> was raised as a<religion> , but

1.0

Table 9.2: A sample of partially untethered and fully tethered patterns along withtheir precision. For some of the attributes, only 4-5 fully tethered patterns butrelaxing the constraint on the <hook> allows extraction of many partially tetheredpatterns providing improved performance as shown in Tables 9.5 and 9.6.

156

Wikipedia page to the title person name of the article4. The probability of a rela-

tionship r(Attribute Name), given the surrounding context “A1 p A2 q A3”, where p

and q are <NAME> and <Attrib Val> respectively, is given using the rote extractor

model probability as in (Ravichandran and Hovy, 2002; Mann and Yarowsky 2005):

P (r(p, q)|A1pA2qA3) =


(9.1)

Each extracted attribute value q using the given context can thus be ranked according

to the above probability. This approach for extracting values was tested for each of

the above attributes on a test set of 100 held-out names from NNDB and report

Precision, Pseudo-recall and F-score for each attribute which are computed in the

standard way as follows, for say Attribute “Birthplace (bplace)”:

Precisionbplace =# people with bplace correctly extracted

# of people with bplace extracted(9.2)

Pseudo-recbplace =# people with bplace correctly extracted

# of people with bplace in test set(9.3)

F-scorebplace =2 · Precisionbplace · Pseudo-recbplace

Precisionbplace + Pseudo-recbplace

(9.4)

Since the true values of each attribute are obtained from a cleaner and normalized

person-database (NNDB), not all the attribute values maybe present in the Wikipedia

article for a given name. Thus, the evaluation results also report accuracy on the

subset of names for which the value of a given attribute is also explicitly stated in

the article. This is denoted as:

4Gender is also extracted automatically as a biographic attribute.

157

Acctruth pres =# people with bplace correctly extracted

# of people with true bplace stated in article(9.5)

A domain model was further applied for each attribute to filter noisy targets extracted

from lexical patterns. The domain models of attributes include lists of acceptable val-

ues (such as lists of places, occupations and religions) and structural constraints such

as possible date formats for “Birthdate” and “Deathdate”. The row with subscript

“RH02” in Table 9.6 shows the performance of this Ravichandran and Hovy (2002)

model with additional attribute domain modeling for each attribute, and Table 9.5

shows the average performance across all attributes.

9.5 Partially Untethered Templatic

Contextual Patterns

The pattern-learning literature for fact extraction often consists of patterns with

a “hook” and “target” (Mann and Yarowsky, 2005). For example, in the pattern

“<Name> was born in <Birthplace>”, “<NAME>” is the hook and “<Birthplace>”

is the target. The disadvantage of this approach is that the intervening dually-

tethered patterns can be quite long and highly variable, such as “<NAME> was

highly influential in his role as <Occupation>”. This problem was overcome by

modeling partially untethered variable-length ngram patterns adjacent to only the

158

target, with the only constraint being that the hook entity appear somewhere in the

sentence. This constraint is particularly viable in biographic text, which tends to fo-

cus on the properties of a single individual. Examples of these new contextual ngram

features include “his role as <Occupation>” and ‘role as <Occupation>”. The pat-

tern probability model here is essentially the same as in Ravichandran and Hovy, 2002

and just the pattern representation is changed. The rows with subscript “RH02imp”

in Tables 9.6 and 9.5 show performance gains using this improved templatic-pattern-

based model, yielding an absolute 21% gain in accuracy.

Attribute Best rank P(Rank)in seed set

Birthplace 1 0.61Birthdate 1 0.98Deathdate 2 0.58

Gender 1 1.0Occupation 1 0.70Nationality 1 0.83

Religion 1 0.80

Table 9.3: Majority rank of the correct attribute value in the Wikipedia pages of theseed names used for learning relative ordering among attributes satisfying the domainmodel

9.6 Document-Position-Based Model

One of the properties of biographic genres is that primary biographic attributes 5

tend to appear in characteristic positions, often toward the beginning of the article.

5Hyperlinked phrases were used as potential values for all attributes except “Gender”. For“Gender” pronouns were used as potential values ranked according to the their distance from thebeginning of the page.

159

Figure 9.2: Distribution of the observed document mentions of Deathdate, Nationality

and Religion.

Thus, the absolute position (in percentage) can be modeled explicitly using a Gaussian

parametric model as follows for choosing the best candidate value v∗ for a given

attribute A:

v∗ = argmaxv∈domain(A)f(posnv|A) (9.6)

where the density f(posnv|A) is given as,

f(posnv|A) = N (posnv; µA, σ2A) =

1

σA√

2πe−(posnv−µA)2/2σA

2

(9.7)

In the above equation, posnv is the absolute position ratio (position/length) and

µA, σA2 are the sample mean and variance based on the sample of correct position

ratios of attribute values in biographies with attribute A. Figure 9.2, for example,

shows the positional distribution of the seed attribute values for deathdate, nation-

ality and religion in Wikipedia articles, fit to a Gaussian distribution. Combining

160

this empirically derived position model with a domain model6 of acceptable attribute

values is effective enough to serve as a stand-alone model.

9.6.1 Learning Relative Ordering in the

Position-Based Model

In practice, for attributes such as birthdate, the first text pattern satisfying the

domain model is often the correct answer for biographical articles. Deathdate also

tends to occur near the beginning of the article, but almost always some point after

the birthdate. This motivates a second, sequence-based position model based on the

rank of the attribute values among other values in the domain of the attribute, as

follows:

v∗ = argmaxv∈domain(A)P (rankv|A) (9.8)

where P (rankv|A) is the fraction of biographies having attribute a with the correct

value occurring at rank rankv, where rank is measured according to the relative order

in which the values belonging to the attribute domain occur from the beginning of the

article. The seed set was used to learn the relative positions between attributes, that

is, in the Wikipedia pages of seed names what is the rank of the correct attribute.

Table 9.3 shows the most frequent rank of the correct attribute value and Figure 9.3

6The domain model is the same as used in Section 9.4 and remains constant across all the modelsdeveloped in this chapter.

161

Figure 9.4: Illustration of modeling “occupation” and “nationality” transitively via

consensus from attributes of neighboring names

shows the distribution of the correct ranks for a sample of attributes. It can be seen

that 61% of the time the first location mentioned in a biography is the individuals’s

birthplace, while 58% of the time the 2nd date in the article is the deathdate. Thus,

“Deathdate” often appears as the second date in a Wikipedia page as expected.

These empirical distributions for the correct rank provide a direct vehicle for scoring

hypotheses, and the rows with “rel. posn” as the subscript in Table 9.6 shows the

improvement in performance using the learned relative ordering. Averaging across

different attributes, Table 9.5 shows an absolute 11% average gain in accuracy of the

position-sequence-based models relative to the improved Ravichandran and Hovy-

based results achieved here.

162

Figure 9.3: Empirical distribution of the relative position of the correct (seed) answers

among all text phrases satisfying the domain model for “birthplace” and “death date”.

163

9.7 Implicit Models

Some of the biographic attributes such as “Nationality”, “Occupation” and “Re-

ligion” can be extracted successfully even when the answer is not directly mentioned

in the biographic article. Two such models are also presented for doing so in the

following sections:

9.7.1 Extracting Attributes Transitively using

Neighboring Person-Names

Attributes such as “Occupation” are transitive in nature, that is, the people names

appearing close to the target name will tend to have the same occupation as the target

name. Based on this intuition, a transitive model was implemented that predicts

occupation based on consensus voting via the extracted occupations of neighboring

names7 as follows:

v∗ = argmaxv∈domain(A)P (v|A,Sneighbors) (9.9)

where,

P (v|A,Sneighbors) = # neighboring names with attrib value v# of neighboring names in the article

7Only the neighboring names were used whose attribute value can be obtained from an ency-clopedic database. Furthermore, since this work is dealing with biographic pages that talk abouta single person, all other person-names mentioned in the article whose attributes are present in anencyclopedia were considered for consensus voting.

164

Occupation Weight VectorEnglish

Physicist <magnetic:32.7, electromagnetic:18.2, wire: 18.2, electricity: 17.7, optical:14.5, discovered:11.2>Singer <song:40, hits:30.5, hit:29.6, reggae:23.6, album:17.1, francis:15.2, music:13.8, recorded:13.6, ...>Politician <humphrey:367.4, soviet: 97.4, votes: 70.6, senate: 64.7, democratic: 57.2, kennedy: 55.9, ...>Painter <mural:40.0, diego:14.7, paint:14.5, fresco:10.9. paintings:10.9, museum of modern art:8.83, ...>Auto racing <renault:76.3, championship:32.7. schumacher:32.7, race:30.4, pole:29.1, driver:28.1 >

GermanPhysicist <faraday:25.4, chemie:7.3, vorlesungsserie:7.2, 1846:5.8, entdeckt:4.5, rotation:3.6 ...>Singer <song:16.22, jamaikanischen:11.77, platz:7.3, hit: 6.7, solounstler:4.5, album:4.1, widmet:4.0, ...>Politician <konservativen:26.5, wahlkreis:26.5, romano:21.8, stimmen:18.6, gewahlt:18.4, ...>Painter <rivera:32.7, malerin:7.6, wandgemalde:7.3, kunst:6.75, 1940:5.8, maler:5.1, auftrag:4.5, ...>Auto racing <team:29.4,mclaren:18.1,teamkollegen:18.1,sieg:11.7, meisterschaft:10.9, gegner:10.9, ...>

Table 9.4: Sample of occupation weight vectors in English and German learned usingthe latent-attribute-based model.

The set of neighboring names is represented as Sneighbors and the best candidate value

for an attribute A is chosen based on the the fraction of neighboring names having

the same value for the respective attribute. Candidates are ranked according to this

probability and the row labeled “trans” in Table 9.6 shows that this model helps

in substantially improving the recall of “Occupation” and “Religion”, yielding a 7%

and 3% average improvement in F-measure respectively, on top of the position model

described in Section 9.6.

9.7.2 Latent-Attribute

Models based on Document-Wide Context

Profiles

In addition to modeling cross-entity attribute transitively, attributes such as “Oc-

cupation” can also be modeled successfully using a document-wide context or topic

165

model. For example, the distribution of words occurring in a biography of a politi-

cian would be different from that of a scientist. Thus, even if the occupation is not

explicitly mentioned in the article, one can infer it using a bag-of-words topic profile

learned from the seed examples.

Given a value v, for an attribute A, (for example v = “Politician” and A = “Occu-

pation”), a centroid weight vector is learned:

Cv = [w1,v, w2,v, ..., wn,v] (9.10)

where,

wt,v =1

Ntft,v · log

|A||t ∈ A|

(9.11)

tft,v is the frequency of word t in the articles of People having attribute A = v

|A| is the total number of values of attribute A

|t ∈ A| is the total number of values of attribute A, such that the articles of people

having one of those values contain the term t

N is the total number of People in the seed set

Given a biography article of a test name and an attribute in question, a similar

word weight vector C ′ = [w′1, w′2, ..., w

′n] is computed for the test name and its cosine

similarity to the centroid vector of each value of the given attribute is measured.

Thus, the best value a∗ is chosen as:

166

v∗ = argmaxvw′1 · w1,v + w′2 · w2,v + ....+ w′n · wn,v√

w′21 + w′22 + ...+ w′2n

√w2

1,v + w22,v + ...+ w2

n,v

(9.12)

Tables 9.5 and 9.6 show performance using the latent document-wide-context model.

It can be seen that this model by itself gives the top performance on “Occupation”,

outperforming the best alternative model by 9% absolute accuracy, indicating the

usefulness of implicit attribute modeling via broad-context word frequencies.

This latent-attribute-based model can be further extended using the multilingual

nature of Wikipedia. The corresponding German pages of the training names were

used to model the German word distributions characterizing each seed occupation.

Table 9.6 shows that English attribute classification can be successful using only the

words in a parallel German article, and while under performing the direct English

models stand-alone, this additional information gives up to a 1% additional gain in

combination. For some attributes, the performance of latent-attribute-based model

modeled via cross-language (noted as latentCL) is close to that of English suggesting

potential future work by exploiting this multilingual dimension.

It is interesting to note that both the transitive model and the latent wide-context

model do not rely on the actual “Occupation” being explicitly mentioned in the

article, they still outperform explicit pattern-based and position-based models. This

implicit modeling also helps in improving the recall of less-often directly mentioned

attributes such as a person’s “Religion”.

167

Model Fscore Acctruth present

Ravichandran and Hovy, 2002 0.37 0.43Improved RH02 Model 0.54 0.64Position-Based Model 0.53 0.75Combinedabove 3+trans+latent+cl 0.59 0.78Combined + Age Dist + Corr 0.62 0.80

(+24%) (+37%)

Table 9.5: Average Performance of different models across all biographic attributes

9.8 Model Combination

While the pattern-based, position-based, transitive and latent-attribute-based

models are all stand-alone models, they can complement each other in combination as

they provide relatively orthogonal sources of information. To combine these models,

a simple backoff-based combination is used for each attribute based on stand-alone

model performance, and the row with subscript “combined” in Tables 9.5 and 9.6

shows an average 14% absolute performance gain of the combined model relative to

the improved Ravichandran and Hovy (2002) model.

9.9 Further Extensions: Reducing False

Positives

Since the position-and-domain-based models will almost always posit an answer,

one of the problems is the high number of false positives yielded by these algorithms.

The following sections introduce further extensions using interesting properties of

168

Attribute Precision Pseudo-recall Fscore Accuracytruth present

BirthdateRH02 0.86 0.38 0.53 0.88BirthdateRH02imp 0.52 0.52 0.52 0.67

Birthdaterel. posn 0.42 0.40 0.41 0.93Birthdatecombined 0.58 0.58 0.58 0.95Birthdatecomb+age dist 0.63 0.60 0.61 1.00

DeathdateRH02 0.80 0.19 0.30 0.36DeathdateRH02imp 0.50 0.49 0.49 0.59

Deathdaterel. posn 0.46 0.44 0.45 0.86Deathdatecombined 0.49 0.49 0.49 0.86Deathdatecomb+age dist 0.51 0.49 0.50 0.86

BirthplaceRH02 0.42 0.38 0.40 0.42BirthplaceRH02imp 0.41 0.41 0.41 0.45

Birthplacerel. posn 0.47 0.41 0.44 0.48Birthplacecombined 0.44 0.44 0.44 0.48Birthplacecombined+corr 0.53 0.50 0.51 0.55

OccupationRH02 0.54 0.18 0.27 0.26OccupationRH02imp 0.38 0.34 0.36 0.48

Occupationrel. posn 0.48 0.35 0.40 0.50Occupationtrans 0.49 0.46 0.47 0.50Occupationlatent 0.48 0.48 0.48 0.59OccupationlatentCL 0.48 0.48 0.48 0.54Occupationcombined 0.48 0.48 0.48 0.59

NationalityRH02 0.40 0.25 0.31 0.27NationalityRH02imp 0.75 0.75 0.75 0.81

Nationalityrel. posn 0.73 0.72 0.71 0.78Nationalitytrans 0.51 0.48 0.49 0.49Nationalitylatent 0.56 0.56 0.56 0.56NationalitylatentCL 0.55 0.48 0.51 0.48Nationalitycombined 0.75 0.75 0.75 0.81Nationalitycomb+corr 0.77 0.77 0.77 0.84

GenderRH02 0.76 0.76 0.76 0.76GenderRH02imp 0.99 0.99 0.99 0.99

Genderrel. posn 1.00 1.00 1.00 1.00Gendertrans 0.79 0.75 0.77 0.75Genderlatent 0.82 0.82 0.82 0.82GenderlatentCL 0.83 0.72 0.77 0.72Gendercombined 1.00 1.00 1.00 1.00

ReligionRH02 0.02 0.02 0.04 0.06ReligionRH02imp 0.55 0.18 0.27 0.45

Religionrel. posn 0.49 0.24 0.32 0.73Religiontrans 0.38 0.33 0.35 0.48Religionlatent 0.36 0.36 0.36 0.45ReligionlatentCL 0.30 0.26 0.28 0.22Religioncombined 0.41 0.41 0.41 0.76Religioncombined+corr 0.44 0.44 0.44 0.79

Table 9.6: Performance comparison of all the models across several biographic at-tributes. Bolded accuracies indicate the top-performing model.

169

biographic attributes to reduce the effect of false positives.

9.9.1 Using Inter-Attribute Correlations

One of the ways to filter false positives is by filtering empirically incompat-

ible inter-attribute pairings. The motivation here is that the attributes are not

independent of each other when modeled for the same individual. For example,

P(Religion=Hindu | Nationality=India) is higher than P(Religion=Hindu | Nation-

ality=France) and similarly one can find positive and negative correlations among

other attribute pairings. For implementation, all possible 3-tuples of (“Nationality”,

“Birthplace”, “Religion”)8 were considered and searched on NNDB for the presence

of the tuple for any individual in the database (excluding the test data). As an ag-

gressive but effective filter, this model filters the tuples for which no name in NNDB

was found containing the candidate 3-tuples. The rows with label “combined+corr”

in Table 9.6 and Table 9.5 shows substantial performance gains using inter-attribute

correlations, such as the 7% absolute average gain for Birthplace over the Section 9.8

combined models, and a 3% absolute gain for Nationality and Religion.

8The test of joint-presence between these three attributes were used since they are stronglycorrelated.

170

Figure 9.5: Age distribution of famous people on the web (from www.spock.com)

9.9.2 Using Age Distribution

Another way to filter out false positives is to consider distributions on meta-

attributes, for example: while age is not explicitly extracted, one can use the fact that

age is a function of two extracted attributes (<Deathyear>-<Birthyear>) and use

the age distribution to filter out false positives for <Birthdate> and <Deathdate>.

Based on the age distribution for famous people9 on the web shown in Figure 9.5, one

can bias against unusual candidate lifespans and filter out completely those outside

the range of 25-100, as most of the probability mass is concentrated in this range.

Rows with subscript “comb+age dist” in Table 9.6 shows the performance gains using

this feature, yielding an average 5% absolute accuracy gain for Birthdate.

9Since all the seed and test examples were used from nndb.com, the age distribution of famouspeople on the web was used: http://blog.spock.com/2008/02/08/age-distribution-of-people-on-the-web/

171


Using a binomial test of sample size 100 and the baseline accuracy of 43%

(Ravichandran and Hovy, 2002) in Table 9.5, any improvement in accuracy over 51%

is statistically significant with a p-value less than 0.05. With respect to the improved

baseline of 64% accuracy (Improved RH02 Model), any improvement in accuracy over

72% is statistically significant. For results at a per-attribute level presented in Table

9.6, all the accuracies obtained using the best model (reported in bold) developed in

this chapter are statistically significant with respect to the baseline model (RH02),

with a p-value less than 0.05 using the binomial test.

9.11 Extracting factual relationships

from noisy sources for a wider range

of attributes

This section provides a foray into extracting factual relationships from non-

biographic genres where the target entity for extraction is not clear. More specifically,

the facts presented in the given document can belong to multiple entities, resulting

in more noisy extractions. As a case study, the attribute specifications from the Text

Analysis Conference (TAC) 2009 Knowledge Base Population (KBP)10 task were uti-

10http://apl.jhu.edu/˜paulmac/kbp.html

172

lized for extraction11.

The idea behind the TAC KBP task is to explore information extraction of entities

with reference to an external knowledge source. The main challenge here is to popu-

late and expand an existing ontology of entities and their attributes with knowledge

extracted from free text. The slot filling system is given an ontology of entities where

each node contains different attribute-value pairs of the respective entity and a doc-

ument collection that may contain mention of some of the entities in the ontology.

Using the facts present in the ontology as training data, the system should identify

correct mentions of the query entity in a large document collection and augment the

ontology with the facts extracted from these mentions.

Another challenge of this exercise was to test the robustness of current slot filling

approaches on 42 diverse attributes such as “causes of death”, “alternate names”,

etc.

Such a slot filling system has many different components such as document/sentence

selection based on the query entity, pattern-based and domain models for fact ex-

traction, answer ranking, redundancy detection, linking nodes in the ontology, etc.

This section will focus on pattern-based component for fact extraction given a query

relevant sentence. The pattern-based approach used in the system is similar to that

used for biographic genres with the difference that the target entity is ambiguous and

hence partially tethered patterns are used for modeling the target value as explained

11Thanks to Mark Dredze, Tim Finin, Adam Gerber, James Mayfield, Paul McNamee, ChristinePiatko and David Yarowsky for helping in developing different components of the Johns HopkinsUniversity TAC KBP system.

173

in Section 9.5. Some examples of such partially untethered patterns on this data are

also shown in Tables 9.7, 9.8, 9.9 and 9.10.

9.11.1 Analysis of pattern learning component for

fact extraction

The diverse and ambiguous nature of the document collection lead to generation of

large number of noisy patterns. While the noisy nature of patterns hurts the precision

of candidate facts extracted, the domain model component of the slot filling system is

used for filtering noisy candidates. However, some of the attributes such as “parents”

do not have a strong domain model component and it is essential to filter out noisy

or excessively broad patterns such as “of < A >”12.

9.11.2 Manually Filtering Patterns

One of the lessons learned during the TAC exercise was that patterns lists for

such diverse attribute sets are noisy and a manual filtering step can significantly

aid in identifying clean attribute-specific patterns. A sample of manually filtered

untethered patterns for few attributes is shown in Tables 9.7, 9.8, 9.9 and 9.10.

Manually filtering patterns, however, was noted to be a time consuming process. In

12< A > denotes the attribute value, for example name of the query entity’s “parent”.

174

Title Spouseby the british < A >american landscape < A >was a guest < A >late british < A >american screenwriter and < A >modification ; japanese < A >) , Chinese < A >indian model and < A >, a scottish < A >French director and < A >is american < A >any Russian < A >peters , american < A >was american < A >is a leading < A >is an awesome < A >was a pakistani < A >influential Russian < A >young mexican < A >84 , american < A >

former husband , < A >wife of tsar < A >married actress < A >widow , < A >wife of < A >he married < A >) ; married < A >< A > spouse ofmarried to actress < A >husband of < A >hubby < A >< A > says his wifes wife , < A >she and husband < A >< A > marriedhe and wife < A >< A > , widow ofher marriage to < A >was married to < A >before marrying < A >

Age Alternate Name< A > -year-old son .< A > -year-old man< A > -year-old girl< A > years old .< A > years old )< A > -year-old son ,< A > , died< A > -year-old woman< A > -year-old son< A > -year-old< A > ) died at< A > years old when< A > , was born< A > , was arrested< A > , was marriedhe was < A >< A > , was named< A > , was appointed< A > -year-old , whodied at < A >

maiden name , < A >stage name ” < A >been known as < A >better known as < A >formerly known as < A >known as < A >her stage name < A >popularly known as < A >born as < A >well-known as < A >professionally known as < A >known as rapper < A >is best-known as < A >otherwise known as < A >reborn as < A >known as mrs. < A >stage name < A >forever known as < A >universally known as < A >once known as < A >

Table 9.7: Sample of untethered patterns that were annotated as high quality byhuman annotators.

175

Children Other family< A > , son of, whose son < A >son , king < A >a daughter of < A >own son , < A >’s son , < A >daughters : < A >and her son < A >infant son , < A >daughter : < A >father of singer < A >his daughter < A >his oldest son < A >marrying his daughter < A >one daughter ; < A >for her son < A >a daughter , < A >< A > ’s father ,of the son < A >and successor , < A >

< A > grandfatherher cousin , < A >a grandson of < A >< A > , grandson ofnephew of < A >grandchildren , < A >niece < A >cousin of < A >his grandson < A >, nephew of < A >< A > great grandchildrencousins < A >< A > cousin< A > grandson of< A > a niecehis uncle , < A >aunt < A >his uncle < A >grandparents < A >grandmother of < A >


176

Cause of Death Chargesfrom complications of < A >< A > victim ’ sdies of < A >in 1995 of < A >, dies of < A >, died of < A >her death from < A >died friday of < A >< A > by hanging .he died of < A >he suffered a < A >< A > -related complicationsattack suffered < A >was diagnosed with < A >she died of < A >, died from < A >complications from < A >commit mass < A >after suffering a < A >of death was < A >

< A > and sentenced to< A > trial ,< A > conviction< A > charges .< A > and sentencedconvicted of < A >< A > conviction ,< A > conviction ,< A > case< A > trial< A > investigation< A > case ,< A > chargestried for < A >

Date of Birth Date of Death) ( b. < A >: born < A >b. < A >, born in < A >was born < A >< A > - d.( b. < A >was born on < A >d. < A >< A > and died( born < A >born on < A >born in < A >< A > ; diedwas born in < A >< A > ) is anb : < A >born : < A >, b. < A >data : born < A >

died < A >death date = < A >( d. < A >death in < A >died in < A >< A > deathsad : likewas assassinated on < A >killed in < A >having died in < A >< A > - death of, and died < A >died on < A >assassinated on < A >died c. < A >( died < A >, died < A >who died < A >he died < A >passed away in < A >death date < A >


177

Place of Birth Place of Deathhis birthplace , < A >’s birthplace in < A >man born in < A >: born < A >birthplace = < A >’s born < A >producer. born in < A >< A > -born ”born < A >is born at < A >birth country : < A >< A > , born< A > born striker1981 birth place : < A >< A > -born formeralthough born in < A >birth : < A >composer born in < A >< A > .bornhis birthday < A >

passed away in < A >being killed in < A >just died in < A >died in a < A >and killed in < A >died in his < A >deathplace = < A >his death in < A >< A > till his deathdeath at < A >death at her < A >and died at < A >died : in < A >and murdered in < A >< A > death campand death at < A >killed in < A >died in < A >soldiers to < A >’ dies in < A >

Schools attended Religionmater , the < A >< A > , qbeconomics at < A >doctoral degree from < A >a doctorate from < A >he attended < A >economics at the < A >physics from < A >qb , < A >a former < A >played at < A >graduate of the < A >< A > university years< A > footballyear at < A >< A > quarterbackcampus of the < A >attending < A >< A > , hbcollege career at < A >

< A > martyr in scotlandhutchison of the < A >member of the < A >< A > and muslim communities< A > church whena fundamentalist < A >< A > church ’s top< A > vs. sunnithe walnut hill < A >politicization of < A >< A > shrines in< A > church south of< A > church in memphis< A > faithof the methodist < A >< A > churchrev .< A > denominations and< A > church today< A > fellowship , which< A > and muslim leaders


178

the next sections, automated approaches for filtering noisy patterns are discussed.

9.11.3 Filtering Noisy Patterns Automatically

Several corpus statistics can be used for automatically filtering the patterns. For

each of the patterns the following measures were recorded during the pattern learning

step:

1. Token count: This is the total number of correct values extracted for a given

attribute.

2. Type count: This is the unique number of correct values extracted for a given

attribute. It was used to prevent one overwhelming mention and its attribute

value to artificially inflate the pattern score.

3. Slot count: This is the number of different attributes for which the pattern ap-

plies. For example patterns such as “of < A >” will extract different attributes

and patterns such as “son of < A >” will apply for mostly the “parent” at-

tribute. The idea here is similar to document frequency, and the lower its value

the better.

4. TF.IDF1 = Token count * log( NSlotcount

), where N is the number of attributes

(or slot types).

5. TF.IDF2 = Type count * log( NSlotcount

), where N is the number of attributes (or

179

slot types).

9.11.4 Evaluating automatic pattern filtering

measures

In order to evaluate the automatic pattern filtering measures, the manually se-

lected patterns were used as gold truth. Each of the pattern scores provides an n-best

ranking of the patterns, and the accuracy for each of the pattern lists using this rank-

ing is computed. For a given attribute and top n patterns for that attribute, the

accuracy is computed as the fraction of k patterns that are present in the manually

selected list. Table 9.11 report the average accuracy over all the attributes.

Attribute Token count Type count Slot count TFIDF1 TFIDF2Top 5 0.385 0.149 0.077 0.133 0.467Top 10 0.344 0.174 0.079 3:0.141 0.421Top 20 0.277 0.141 0.071 3:0.140 0.383Top 50 0.208 0.103 0.061 3:0.124 0.308

Table 9.11: Pattern relevance based on presence in high quality pattern list generatedby human annotators. Top 5 indicates the fraction of top 5 patterns generated bythe algorithm that were marked by annotators as high quality patterns. The resultsare averaged over all attributes.

Since TFIDF2 measure performs the best, utilizing the unique number of correct

values extracted by a given pattern (type count) is a useful measure in indicating

pattern relevance. Another component of this measure is the number of different

slots in which a pattern occurs, indicating that this slot-specificity of the pattern is

also a useful component of determining pattern relevance. Additional analysis on

180

what other factors can be useful for improving the pattern relevance are discussed in

the following section.

9.11.5 Error analysis

While Table 9.11 provides some insight into what features may be useful in auto-

matically filtering the patterns, there is still a lot of room for improvement. Following

are some of the lessons learned from error analysis of the generated patterns:

• Numeric attributes: Numeric attributes such as “number of employees” resulted

in a poor pattern accuracy due to a wide range of values for the attribute.

However, a syntactic domain model that can identify the range of potential

values was also utilized as one of the components in the pipeline to account for

such attributes.

• All the attributes dealing with aliases (“alternate names”) also resulted in poor

pattern accuracy since most of the values for alternate names are mentioned

in parentheses, resulting in noisy generic patterns that can be applied in many

contexts.

• Attributes such as “political and religious affiliations”, “origin” do not naturally

occur in typical contexts and hence are difficult to model using pattern-based

approach.

• Annotator errors: Given the time constraints, the annotators were able to only

181

sift through a subset of the patterns in order to select good ones. In some cases,

the annotators completely missed selecting good patterns in the initial screening

and in some cases noisy patterns were also incorporated in the final set.

• Related patterns: Since the evaluation reports an “exact-match” accuracy, some

of the patterns that are similar to the ones selected by the annotators were also

be marked incorrect. For example, even though the pattern “better known as

< A >” is similar to “also known as < A >” that is on the list of selected

patterns, the former will be marked as incorrect. A possible direction for eval-

uation of patterns can be to perform fuzzy matching of patterns using content

words.

9.12 Application of Position-based Model

to News Data

While formal biographies such as Wikipedia articles have a well defined structure

that can be easily modeled for salient positions of the attributes, such characteristic

positions are not very clear in non-biographic articles such as news articles. This is

due to the ambiguous nature of news article as opposed to the monosemous nature

of Wikipedia article, with respect to the target named entity. A news article may

contain many named entities and biographic attributes that appear in the article may

182

belong to any of these entities. This section presents an empirical study on finding

such biographic position indicators in news data using a sample of New York Times

articles.

9.12.1 Corpora Details

The initial set of people names consisted of 58 names along with their occupations.

These names were then queried to the New York Times website for finding recent news

about the people in this set. A total of 134 articles were found for a subset of the 27

people names from the initial set. These set of articles were used for modeling the

position of “occupation” attribute in the article, both globally and with respect to

the name mention.

9.12.2 Global Position Model of “Occupation”

Attribute

Figure 9.6 shows the histogram of the position of correct “occupation” in the

overall article, without any relation to where the respective name was mentioned in

the article. While such a global position model is useful for formal biographies as

explained in Section 9.6, Figure 9.6 shows that the distribution of positions has a

large variance, motivating more localized models as explained in the next sections.

183

Figure 9.6: Global position “occupation” attribute the New York Times articles.

The position is given as the fraction of the article length on the X-axis, and Y-axis

describes the number of times an “occupation” attribute was found in that fraction.

184

9.12.3 Modeling Position with respect to the First

Name Mention

Figure 9.7 shows the histogram of the relative distance of correct “occupation”

with respect to the first full mention of the target name. The first mention of the

name usually licenses the author to provide additional biographic information either

just before (as a premodifer) or near any of the following coreferent mentions. Figure

9.7 shows that the overwhelming indicator of the correct “occupation” of the name is

in the premodifier (-1) position. Furthermore, majority of the remaining probability

mass is also focused around a small window near the name mention, within a distance

of 10 words.


Closest Name Mention

Often there are multiple mentions of the target name in the article and it is useful

to model the position of the correct occupation nearest to the closest target mention,

in order to obtain a better localized model. Figure 9.8 shows the histogram of the

distance of correct “occupation” from the closest full name mention. One can see

that it looks very similar to that of the position from first mention (Figure 9.7), and

that may be due to the fact that the full names are usually mentioned only once in

185

Figure 9.7: Distribution of “occupation” attribute from first full mention of the name

in the New York Times articles.

186

Figure 9.8: Distribution of “occupation” attribute from the closest full mention of

the name in the New York Times articles.

187

the article and tend to the be the first mention. The later mentions of the name often

use only part of the name such as the first name or the last name and the next section

describes taking partial name matches into account for position modeling.


Closest Full or Partial Name Mention

Figure 9.9 shows the histogram of the correct “occupation” attributes with re-

spect to the closest full or partial name mentions. The partial name mentions were

approximated via the usage of first name or last name of the target name and a more

exact model would require a full coreference chain analysis. One can see in Figure 9.9

that the premodifier position still remains the most salient indicator of “occupation”

attribute, however the frequency at -1 position is higher due to increased coverage

via partial matches. Furthermore, this increased coverage also results in additional

“occupation” matches around a small window of size five to six words.

9.12.6 Analysis

The various histograms in Figures 9.6, 9.7, 9.8 and 9.9 motivate a “occupation”

extraction model of news that places a very high prior on the premodifier position of

the target name for extracting correct occupation values. Furthermore, the majority

of the remaining probability mass occurs in a small window centered around any

188

Figure 9.9: Distribution of “occupation” attribute from the closest full or partial

(first name or last name) mention of the name in the New York Times articles.

189

mention of the target name. This motivates modeling “occupation” attribute using a

narrow bag-of-words as candidates in conjunction with an appropriate domain model

for the “occupation” attribute as utilized in Section 9.6 for the biographic genre.

9.13 Using Biographical Facts for Name

Disambiguation

This work considers the task of disambiguating a first name or last name mention

in unstructured text to a disambiguated Wikipedia page. This task is along the lines

of Bunescu and Pasca (2006) and Cucerzan (2007), who make use of the entire text

on the Wikipedia page and the mention page to perform disambiguation. However,

the goal here is to test the effectiveness of biographical attributes in disambiguation

and this work reports some preliminary results on how often a name can be disam-

biguated by just using an “occupation” match13.

A preliminary name disambiguation experiment was performed on a set of 100 first

name or last name mentions. In order to test the “occupation” match, the men-

tions were chosen such that an occupation string occurs in the 5-word premodifying

or appositive context14 of the mention and were chosen randomly from the English

13Biographical features have also been used for cross-document coreference by Mann and Yarowsky(2003) in combination with the full bag-of-words model.

14Nenkova and McKeown (2003) showed in their corpus study that name-external evidence suchas “occupation” and “nationality” often occur in the premodifying or appositive context of themention.

190

..... British rider Phil Collins will be among the favorites tonight in the 19th U.S. championship speedway motorcycle races at the Orange County Fairgrounds in ........

Phil Collins (1)(Speedway rider)

Phil Collins (2)(Musician)

Phil Collins (3)(Baseball player)

Phil Collins (4)(Artist, Photographer)

Figure 9.10: Application of biographical attributes for name disambiguation: Disam-

biguating mention of “Phil Collins” to the correct Wikipedia entry using the premod-

ifying occupation “rider”. Similarly other biographical attributes such as nationality

premodifier “British” can also be used for disambiguation. This can be further im-

proved by using compatible occupations as shown in Table 9.13.

191

ACE-2005 (Walker et al., 2006) training set. An example of using occupation for

disambiguating the mention of name “Phil Collins” is shown in Figure 9.10.

The baseline model is to just do a string match of the mention to the names in

Wikipedia and choose randomly if more than one name is present. The “occupation”

model extracts the occupation present in the premodifying or appositive context of

the mention and and selects the candidate Wikipedia page with the matching “oc-

cupation”. The first two rows of Table 9.12 show the performance gain using exact

“occupation” match.

However, it is found that using exact “occupation” match was too conservative as it

would count the mention “musician Yanni” as mismatch to “Composer” or “Pianist”

which is the occupation mentioned on the Wikipedia page. In order to solve this

problem, the correlation among values of “occupation” using the # of names in a bi-

ographical database15 that share those occupations was measured. Table 9.13 shows

a sample of the “occupation” correlations.

Results for Name DisambiguationModel AccuracyName string match 0.15+Exact occupation match 0.46+Occupation match with correlation 0.56

Table 9.12: Name disambiguation performance for matching first or last name men-tions to a Wikipedia person page

15Biographical database used was Freebase (www.freebase.com) as it is a Wikipedia-centricdatabase.

192

The “occupation” correlations were used for fuzzy match in name disambiguation

and the candidate whose “occupation” had the highest correlation with the “occupa-

tion” of the mention was chosen. The third row in Table 9.12 shows the performance

gain using this feature. The preliminary results show promise for using additional

biographical features and a likely fruitful line of future work is to integrate all the

automatically extracted biographical features presented in this work with a full name

disambiguation/cross-document coreference system.

Correlations for “Occupation” attributeOccupation Pair # of PeopleNovelist, Writer 7100

Lawyer, Politician 3366Singer, Songwriter 1552

. .

. .Baseball player, Football player 588

Mathematician, Physicist 366Film Director, Screenwriter 248

. .

. .Accountant, Baseball player 2

Actor, Store manager 2

Table 9.13: Correlation between occupations based on number of people sharing thesame occupation

9.14 Conclusion

This chapter describes novel approaches to biographic fact extraction using struc-

tural, transitive and latent properties of biographic data. First, an improvement to

the standard Ravichandran and Hovy (2002) model is shown utilizing untethered

193

contextual pattern models, followed by a document position and sequence-based ap-

proach to attribute modeling. Next transitive models were presented exploiting the

tendency for individuals occurring together in an article to have related attribute val-

ues. This chapter also describes how latent-attribute-based models of wide document

context, both monolingually and translingually, can capture facts that are not stated

directly in a text.

Each of these models provide substantial performance gain, and further performance

gain is achieved via classifier combination. As additional source of information, inter-

attribution correlations are modeled to filter unlikely attribute combinations, and

models of functions over attributes, such as deathdate-birthdate distributions, further

constrain the candidate space. These approaches collectively achieve 80% average ac-

curacy on a test set of 7 biographic attribute types, yielding a 36% absolute accuracy

gain relative to a standard algorithm on the same data.

194

Chapter 10

Modeling Latent Biographical

Attributes in Conversational

Genres

Summary

This chapter presents and evaluates several original techniques for the latent clas-

sification of biographic attributes such as gender, age and native language, in diverse

genres (conversation transcripts, email) and languages (Arabic, English). First, a

novel partner-sensitive model for extracting biographic attributes in conversations,

given the differences in lexical usage and discourse style such as observed between

same-gender and mixed-gender conversations is presented. Then, a rich variety of

195

novel sociolinguistic and discourse-based features, including mean utterance length,

passive/active usage, percentage domination of the conversation, speaking rate and

filler word usage is explored. Cumulatively up to 20% error reduction is achieved rel-

ative to the standard Boulis and Ostendorf (2005) algorithm for classifying individual

conversations on Switchboard, and accuracy for gender detection on the Switchboard

corpus (aggregate) and Gulf Arabic corpus exceeds 95%.



10.1 Introduction

Speaker attributes such as gender, age, dialect, native language and educational

level may be (a) stated overtly in metadata, (b) derivable indirectly from metadata

such as a speaker’s phone number or userid, or (c) derivable from acoustic properties

of the speaker, including pitch and f0 contours (Bocklet et al., 2008).

In contrast, the goal of this work is to model and classify such speaker attributes

from only the latent information found in textual transcripts. In particular, this

work is focused on modeling and classifying speaker attributes such as gender and

age based on lexical and discourse factors including lexical choice, mean utterance

length, patterns of participation in the conversation and filler word usage. Further-

1Reference: N. Garera, D.Yarowsky. Modeling Latent Biographic Attributes in ConversationalGenres. Proceedings of Association for Computational Linguistics (ACL), 2009.

196

more, a speaker’s lexical choice and discourse style may differ substantially depending

on the gender/age/etc. of the speaker’s interlocutor, and hence improvements may

be achieved via joint conversational dyad modeling or stacked classifiers.

There has been substantial work in the sociolinguistics literature investigating dis-

course style differences due to speaker properties such as gender (Coates, 1997; Eckert,

McConnell-Ginet, 2003). While most of the prior work in sociolinguistics has been

approached from a non-computational perspective, Singh (2001) and Koppel et al.

(2002) employed the use of a linear model for gender with manually selected and

linguistically interesting words and part-of-speech as features, focused on a small de-

velopment corpus. Another computational study for gender using approximately 30

weblog entries was done by Herring and Paolillo (2006), making use of a logistic regres-

sion model to study the effect of different features. While small-scale sociolinguistic

studies on monologues have shed some light on important features, this work focuses

on modeling attributes from transcripts of spoken conversations as shown in Figure

10.1, building upon the work of Boulis and Ostendorf (2005) and show how gender

and other attributes can be accurately predicted in conversations. In addition to spo-

ken conversations, this work also explores another genre for informal conversations,

namely email2.

2An example email snippet is shown in Figure 10.7.

197

75.06 75.53 A: no

77.52 78.23 B: actually

78.36 79.00 B: um

79.16 79.54 B: [cough]

79.82 80.65 B: my ah

81.79 82.52 B: my wife

83.02 84.46 B: ah and i

84.63 88.48 B: enjoy having dinner together and i have to tell you i really enjoy eating.

89.82 93.14 A: i definitely agree with you on that one [laugh]

94.44 94.99 A: um

96.59 97.83 B: so what's you're favorite?

98.86 99.80 A: um

100.48 103.05 A: that's a hard one, but i would have to go

103.40 104.55 A: i like to make

104.81 106.68 A: like taco salad a lot

106.68 106.97 B: hm

107.31 109.66 A: stuff like that simple stuff but

111.04 113.02 B: [lipsmack] we had taco

Figure 10.1: A snippet of Fisher telephone transcript between a female (A) and male

(B) speaker. The first two fields indicate the start time and stop time and the third

field contains the utterance.

198

10.1.1 Applications

Analyzing such differences in latent author/speaker attributes is interesting from

the sociolinguistic and psycholinguistic point of view of language understanding, but

also from an engineering perspective for a wide range of applications described below:

• Call routing: A straightforward application of detecting gender, age, native

language of the speaker is to re-route the call for personalized assistance in

various phone-based services.

• User authentication and security: An important problem on major blogging or

social networking websites is that of detecting whether a user account has been

compromised. The posts or comments written by the user can be analyzed to

see if the author attributes match to that of the original profile in determining

the identity/authenticity of the user.

• Filling user profiles: Many websites require users to fill in their biographical

data. Such information could be automatically extracted using the content

posted by the users.

• Gender/age conditioned models: Extracting latent properties enables re-

searchers to build gender/age conditional models for building attribute-specific

models for various tasks such as language modeling, machine translation and

speech recognition.

199

10.1.2 Contributions

Having motivated the goal of predicting latent biographic attributes, the following

points briefly outline the original contributions of the work described in this chapter:

1. Modeling Partner Effect: A speaker may adapt his or her conversation style

depending on the partner and it is shown how conditioning on the predicted

partner class using a stacked model can provide further performance gains in

gender classification.

2. Sociolinguistic features: The chapter explores a rich set of lexical and non-lexical

features motivated by the sociolinguistic literature for gender classification, and

shows how they can effectively augment the standard ngram-based model of

Boulis and Ostendorf (2005).

3. Application to Arabic Language: This work also reports results for application

to the Arabic language, in addition to the English Fisher transcripts used by

Boulis and Ostendorf (2005). It is shown that the ngram model gives reasonably

high accuracy for Arabic as well. Furthermore, consistent performance gains due

to partner-sensitive models and sociolinguistic features, as observed in English

are also obtained.

4. Application to Email Genre: It is shown how the models explored in this chapter

developed for conversational transcript genre extend to email, showing the wide

applicability of the models due to the use of general text-based features.

200

5. Application to new attributes: This work shows how the lexical model of Boulis

and Ostendorf (2005) can be extended to Age and Native vs. Non-native pre-

diction, with further improvements gained from using the introduced partner-

sensitive models and novel sociolinguistic features.

10.2 Related Work

Conversational speech presents a challenging domain due to the interaction of

genders, recognition errors and sudden topic shifts.Text-based information extraction

approaches for speech genre have also been investigated, (Jing et al., 2007) presents a

supervised framework for extracting biographical facts from a transcribed conversa-

tional speech collection (MALACH) consisting of interviews of Holocaust survivors.

They present new features for co-reference resolution in conversational speech such as

speaker role identification, speaker turns, name patterns, etc. and use a combination

of lexical, contextual and syntactic features for attribute labeling. However, only the

explicitly mentioned attributes were extracted as they treat the attribute extraction

as a labeling problem and train a maximum-entropy classifier for the same.

In contrast, the goal of the work presented in this chapter is to predict such at-

tributes when they are not necessarily explicitly stated in the utterance, along the

lines of Singh (2001) and Boulis and Ostendorf (2005), which use lexical differences

in conversational speech for gender classification. Singh (2001) performed a pilot

201

study using conversational speech for identifying gender differences based on lexical

richness measures. A total of thirty subjects were recorded and transcribed in a con-

versational setting and the lexical richness measures were based on word frequencies

of word classes such as noun, pronoun, adjective and verb rate per 100 words, type-

token ratio, etc, achieving a 90% classification accuracy using discriminant analysis

on this small dataset.

Boulis and Ostendorf (2005) presented the first large-scale study of gender modeling

in conversational speech transcripts using Fisher corpus (Cieri et al., 2004). Their

model utilized a simple bag of ngrams feature vector in a SVM framework for gender

classification, showing how state-of-the art machine learning approaches utilizing very

high dimensional feature vectors can classify gender with more than 90% accuracy.

While Boulis and Ostendorf (2005) observe that the gender of the partner can have

a substantial effect on their classifier accuracy, given that same-gender conversations

are easier to classify than mixed-gender classifications, they don’t utilize this obser-

vation in their work. Section 10.5.3, shows how the predicted gender/age etc. of the

partner/interlocutor can be used to improve overall performance via both joint dyad

modeling and classifier stacking. Boulis and Ostendorf (2005) have also constrained

themselves to lexical n-gram features, while this work shows improvements via the

incorporation of non-lexical features such as the percentage domination of the con-

versation, degree of passive usage, usage of subordinate clauses, speaker rate, usage

profiles for filler words (e.g. “umm”), mean-utterance length, and other such proper-

202

ties. Finally, this work explores and empirically evaluates original model performance

on additional latent speaker attributes including age and native vs. non-native En-

glish speaking status. The remaining sections describe the approach in detail.

10.3 Corpus Details

Consistent with Boulis and Ostendorf (2005), this work utilized the Fisher tele-

phone conversation corpus (Cieri et al., 2004) and also evaluated performance on the

standard Switchboard conversational corpus (Godfrey et al., 1992), both collected and

annotated by the Linguistic Data Consortium. In both cases, the provided metadata

(including true speaker gender, age, native language, etc.) was utilized as only class

labels for both training and evaluation, but never as features in the classification.

The primary task employed was identical to Boulis and Ostendorf (2005), namely

the classification of gender, etc. of each speaker in an isolated conversation, and also

to evaluate performance when classifying speaker attributes given the combination

of multiple conversations in which the speaker has participated. The Fisher corpus

contains a total of 11971 speakers and each speaker participated in 1-3 conversations,

resulting in a total of 23398 conversation sides (i.e. the transcript of a single speaker

in a single conversation). The preprocessing steps and experimental setup of Boulis

and Ostendorf (2005) was followed as closely as possible given the details presented in

their paper, although the some details such as the exact training/test partition were

203

not currently obtainable from either the paper or personal communication. This re-

sulted in a training set of 9000 speakers with 17587 conversation sides and a test set

of 1000 speakers with 2008 conversation sides.

The Switchboard corpus was much smaller and consisted of 543 speakers, with 443

speakers used for training and 100 speakers used for testing, resulting in a total of

4062 conversation sides for training and 808 conversation sides for testing.

10.4 Modeling Gender via Ngram

features (Boulis and Ostendorf,

2005)

As the reference algorithm, the currently state-of-the-art system developed by

Boulis and Ostendorf (2005) was used using unigram and bigram features in a SVM

framework. This model was reimplemented as the reference standard for gender

classification, further details of which are given below:

10.4.1 Training Vectors

For each conversation side, a training example was created using unigram and

bigram features with TF.IDF weighting, as done in standard text classification ap-

proaches. However, stopwords were retained in the feature set as various sociolin-

204

Female MaleFisher Corpus

husband -0.0291 my wife3 0.0366my husband -0.0281 wife 0.0328oh -0.0210 uh 0.0284laughter -0.0186 ah 0.0248have -0.0169 er 0.0222mhm -0.0169 i i 0.0201so -0.0163 hey 0.0199because -0.0160 you doing 0.0169and -0.0155 all right 0.0169i know -0.0152 man 0.0160hi -0.0147 pretty 0.0156um -0.0141 i see 0.0141boyfriend -0.0134 yeah i 0.0125oh my -0.0124 my girlfriend 0.0114i have -0.0119 thats thats 0.0109but -0.0118 mike 0.0109children -0.0115 guy 0.0109goodness -0.0114 is that 0.0108yes -0.0106 basically 0.0106uh huh -0.0105 shit 0.0102

Switchboard Corpusoh -0.0122 wife 0.0078laughter -0.0088 my wife 0.0077my husband -0.0077 uh 0.0072husband -0.0072 i i 0.0053have -0.0069 actually 0.0051uhhuh -0.0068 sort of 0.0041and i -0.0050 yeah i 0.0041feel -0.0048 got 0.0039umhum -0.0048 a 0.0038i know -0.0047 sort 0.0037really -0.0046 yep 0.0036women -0.0043 the 0.0036um -0.0042 stuff 0.0035would -0.0039 yeah 0.0034children -0.0038 pretty 0.0033too -0.0036 that that 0.0032but -0.0035 guess 0.0031and -0.0034 as 0.0029wonderful -0.0032 is 0.0028yeah yeah -0.0031 i guess 0.0028

Table 10.1: Top 20 ngram features for Gender, ranked by the weights assigned by thelinear SVM model

205

guistic studies have shown that use of some of the stopwords, for instance, pronouns

and determiners, are correlated with age and gender. Also, only the ngrams with fre-

quency greater than 5 were retained in the feature set, resulting in a total of 227,450

features for the Fisher corpus and 57,914 features for the Switchboard corpus.

10.4.2 Model

After extracting the ngram features, an SVM model was trained via the SVMlight

toolkit (Joachims, 1999) using the linear kernel with the default toolkit settings.

Table 10.1 shows the most discriminative ngrams for gender based on the weights

assigned by the linear SVM model. The negative and positive sign on the weights of

ngrams for the two gender are due to the selection of “-1” as female class and “+1”

as male class in the SVM model. It is interesting that some of the gender-correlated

words proposed by sociolinguistic literature are also found by this empirical approach,

including the frequent use of “oh” by females and also obvious indicators of gender

such as “my wife” or “my husband”, etc. Also, named entity “Mike” shows up as a

discriminative unigram, this maybe due to the self-introduction at the beginning of

the conversations and “Mike” being a common male name. For compatibility with

Boulis and Ostendorf (2005), no special preprocessing for names is performed, and

they are treated as just any other unigrams or bigrams in this particular direct com-

parison4.

4A natural extension of this work, however, would be to do explicit extraction of self introductionsand then do table-lookup-based gender classification, although this was not implemented for consis-

206

Figure 10.2: The effect of varying the amount of each conversation side utilized for

training, based on the utilized % of each conversation, starting from the beginning

of the conversation. While one would expect the accuracy to improve linearly with

increased training data, the anomaly inolving flat portion in the middle could be due

to the fact that Fisher and Switchboard participants were complete strangers. The

intial ramp up in the curve is probably due to the addition of speaker data starting

from no data at all and the flat portion is probably due to the time taken for the

speakers to get familiar and speak comfortably with each other, after which, the

discourse features for speaker attributes become more prominent. Another reason

could be due to the fact that the middle portion indicates discussion on a specific

topic given to the speakers and after they have spoken enough about the topic, the

speakers may move on to more gender biased topics of their choice.

207

Furthermore, the ngram-based approach scales well with varying the amount of con-

versation utilized in training the model as shown in Figure 10.2.

The “Boulis and Ostendorf, 05” rows in Table 10.4 show the performance of this

reimplemented algorithm on both the Fisher Corpus (90.84%) and Switchboard Cor-

pus (90.22%), under the identical training and test conditions used elsewhere in the

chapter for direct comparison with subsequent results5.

10.5 Modeling Based on the Partner’s

Gender

The original contribution in this section is the successful modeling of speaker prop-

erties (e.g. gender/age) based on the prior and joint modeling of the partner speaker’s

gender/age in the same discourse. The motivation for this work is that people tend

to use stronger gender-specific, age-specific or dialect-specific word/phrase usage and

discourse properties when speaking with someone of a similar gender/age/dialect than

when speaking with someone of a different gender/age/dialect, when they may adapt

a more neutral speaking style. Also, discourse properties such as relative use of the

passive and percentage of the conversation dominated may vary depending on the

tency with the reference algorithm. The handling of names and potentially self-reporting features isstudied and handled specifically elsewhere in this chapter, however.

5The modest differences with their reported results may be due to unreported details such asthe exact training/test splits or SVM parameterizations, so for the purposes of assessing the relativegain of the subsequent enhancements we base all reported experiments on the internally-consistentconfigurations as (re-)implemented here.

208

Fisher CorpusSame gender conversations 94.01Mixed gender conversations 84.06Switchboard CorpusSame gender conversations 93.22Mixed gender conversations 86.84

Table 10.2: Difference in Gender classification accuracy between mixed gender andsame gender conversations using the reference algorithm

Classifying speaker’s and partner’sgender simultaneously

Male-Male 84.80Female-Female 81.96Male-Female 15.58Female-Male 27.46

Table 10.3: Performance for 4-way classification of the entire conversation into (mm,ff, mf, fm) classes using the reference algorithm on Switchboard corpus.

gender or age relationship with the speaking partner. Several varieties of classifier

stacking and joint modeling were employed to be effectively sensitive to these dif-

ferences. To illustrate the significance of the “partner effect”, Table 10.2 shows the

difference in the standard algorithm performance between same-gender conversations

(when gender-specific style flourishes) and mixed-gender conversations (where the

more neutral styles are harder to classify):

10.5.1 Oracle Experiment

To assess the potential gains from full exploitation of partner-sensitive modeling,

first the result are reported from an oracle experiment, where it is assumed that

the algorithm knows whether the conversation is homogeneous (same gender) or het-

erogeneous (different gender). In order to effectively utilize this information, both

209

the test conversation side and the partner side are classified , and if the classifier is

more confident about the partner side then the gender of the test conversation side is

chosen based on the heterogeneous/homogeneous information. The overall accuracy

improves to 96.46% on the Fisher corpus using this oracle (from 90.84%), leading to

the following experiment where the oracle is replaced with a non-oracle SVM model

trained on a subset of training data such that all test conversation sides (of the speaker

and the partner) are excluded from the training set.

10.5.2 Replacing Oracle by a Homogeneous vs

Heterogeneous Classifier

Given the substantial improvement using the Oracle information, initially another

binary classifier was trained for classifying the conversation as mixed or single-gender.

It turns out that this task is much harder than the single-side gender classification,

task and achieved only a low accuracy value of 68.35% on the Fisher corpus. In-

tuitively, the homogeneous vs. hetereogeneous partition results in a much harder

classification task because the two diverse classes of male-male and female-female

conversations are grouped into one class (“homogeneous”) resulting in linearly in-

separable classes6. This lead us (subsequently) to create two different classifiers for

conversations, namely, male-male vs rest and female-female vs rest used in a classifier

6Even non-linear kernels were not able to find a good classification boundary.

210

Figure 10.3: People use stronger gender-specific discourse properties when speaking

to someone of a similar gender. Stacking whole conversation and partner-conditioned

models as shown above allows to model such behavior. The common graphic utilized

for individual SVM classifiers first appeared in (Ustun, 2003).

combination framework as follows:

10.5.3 Modeling partner via conditional model

and whole-conversation model

The following classifiers were trained and each of their scores was used as a feature

in a meta SVM classifier:

211

1. Male-Male vs Rest: Classifying the entire conversation (using test speaker and

partner’s sides) as male-male or other7.

2. Female-Female vs Rest: Classifying the entire conversation (using test speaker

and partner’s sides) as female-female or other.

3. Conditional model of gender given most likely partner’s gender: Two separate

classifiers were trained for classifying the gender of a

given conversation side, one where the partner is male and other where the

partner is female. Given a test conversation side, first the most likely gender of

the partner’s conversation side is chosen using the ngram-based model8 and then

choose the gender of the test conversation side using the appropriate conditional

model.

4. Ngram-based model as explained in Section 10.4

The stacking approach described above is illustrated in Figure 10.3. The row labeled

“+ Partner Model” in Table 10.4 shows the performance gain obtained via this meta-

classifier incorporating conversation type and partner-conditioned models.

7For classifying the conversations as male-male vs rest or female-female vs rest, all the conversa-tions with either the speaker or the partner present in any of the test conversations were eliminatedfrom the training set, thus creating a disjoint training and test conversation partitions.

8All the partner conversation sides of test speakers were removed from the training data and thengram-based model was retrained on the remaining subset.

212

Figure 10.4: Empirical differences in sociolinguistic features for Gender on the Switch-

board corpus

10.6 Sociolinguistic Features

The sociolinguistic literature has shown gender differences for speakers due to

features such as speaking rate, pronoun usage and filler word usage. While ngram

features are able to reasonably predict speaker gender due to their high detail and cov-

erage and the overall importance of lexical choice in gender differences while speaking,

the sociolinguistics literature suggests that other non-lexical features can further help

improve performance, and more importantly, advance our understanding of gender

differences in discourse. Thus, on top of the standard Boulis and Ostendorf (2005)

model, the following features were also investigated, motivated by the sociolinguistic

literature on gender differences in discourse (Macaulay, 2005):

213

1. % of conversation spoken: The speaker’s fraction of conversation spoken was

measured via three features extracted from the transcripts: % of words, utter-

ances and time.

2. Speaker rate: Some studies have shown that males speak faster than females

(Yuan et al., 2006) as can also be observed in Figure 10.4 showing empirical

data obtained from Switchboard corpus. The speaker rate was measured in

words/sec., using starting and ending time-stamps for the discourse.

3. % of pronoun usage: Macaulay (2005) argues that females tend to use more

third-person male/female pronouns (he, she, him, her and his) as compared to

males.

4. % of back-channel responses such as “(laughter)”, “(lipsmacks)” and “(sighs)”.

5. % of passive usage: Passive usage was detected by extracting a list of past-

participle verbs from Penn Treebank and counting any occurrence of “form of

”to be” + past participle”.

6. % of short utterances (<= 3 words).

7. % of modal auxiliaries and subordinate clauses.

8. % of “mm” tokens such as “mhm”, “um”, “uh-huh”, “uh”, “hm”, “hmm”,etc.

9. Type-token ratio: Ratio of number of types divided by the number of tokens in

the conversation side, in order to measure vocabulary richness of the speaker.

214

10. Mean inter-utterance time: The average time taken between utterances of the

same speaker.

11. % of “yeah” occurences.

12. % of WH-question words such as “What”, “When”, “Where”, etc.

13. % Mean word and utterance length of the speaker.

The above classes resulted in a total of 16 sociolinguistic features which were added

based on feature-ablation studies as features in the meta SVM classifier along with

the 4 features as explained previously in Section 10.5.3.

The rows in Table 10.4 labeled “+ (any sociolinguistic feature)” show the perfor-

mance gain using the respective features described in this section. Each row indicates

an additive effect in the feature ablation, showing the result of adding the current

sociolinguistic feature with the set of features mentioned in the rows above.

10.7 Gender Classification Results

Table 10.4 combines the results of the experiments reported in the previous sec-

tions, assessed on both the Fisher and Switchboard corpora for gender classification.

The evaluation measure was the standard classifier accuracy, that is, the fraction of

test conversation sides whose gender was correctly predicted. Baseline performance

(always guessing the most frequent gender female) yields 57.47% and 51.6% on Fisher

215

Model Acc. ErrorReduction

Fisher Corpus (57.5% of sides are female)Gender Genie 55.63 -384%Ngram 90.84 Ref.(Boulis & Ostendorf, 05)+ Partner Model 91.28 4.80%+ % of “yeah” 91.33+ % of (laughter) 91.38+ % of short utterance 91.43+ % of auxiliaries 91.48+ % of subord-clauses, “mm” 91.58+ % of Participation (in utterance) 91.63+ % of Passive usage 91.68 9.17%

Switchboard Corpus (51.6% of sides are female)Gender Genie 55.94 -350%Ngram 90.22 Ref.(Boulis & Ostendorf, 05)+ Partner Model 91.58 13.91%+ Speaker rate, % of fillers 91.71+ Mean utterance length, % of Ques. 91.96+ % of Passive usage 92.08+ % of (laughter) 92.20 20.25%

Table 10.4: Results showing improvement in accuracy of gender classifier usingpartner-sensitive model and sociolinguistic features

216

and Switchboard respectively. As noted before, the standard reference algorithm is

Boulis and Ostendorf (2005), and all cited relative error reductions are based on this

established standard. Also, as a second reference, performance is also cited for the

popular “Gender Genie”, an online gender-detector9, based on the manually weighted

word-level sociolinguistic features discussed in Argamon et al. (2003).

The additional rows in the table are described in Sections 10.4-10.6, and cumulatively

yield substantial improvements over the Boulis and Ostendorf (2005) standard.

10.7.1 Aggregating results over per-speaker via

consensus voting

While Table 10.4 shows results for classifying the gender of the speaker on a per

conversation basis (to be consistent and enable fair comparison with the work reported

by Boulis and Ostendorf (2005)), all of the above models can be easily extended to

per-speaker evaluation by pooling in the predictions from multiple conversations of

the same speaker. Table 10.5 shows the result of each model on a per-speaker basis

using a majority vote of the predictions made on the individual conversations of

the respective speaker. The consensus model when applied to Switchboard corpus

show larger gains as it has 9.38 conversations per speaker on average as compared

to 1.95 conversations per speaker on average in Fisher. The results on Switchboard

9http://bookblog.net/gender/genie.php

217

Conversation 1

Conversation 2

Conversation 3

Voting

Probability

Female Male

Probability

Female Male

Probability

Female Male

Probability

Female Male

Figure 10.5: Aggregating results over all the conversations of a given speaker via

consensus voting as explained in Section 10.7.1. One can also utilize other ways

of combining evidence such as length-weighted voting, confidence-weighted voting,

stacking, combining all conversations into one single conversation, etc. However,

since the speakers were supposed to speak for a fixed time while collecting the data

for Fisher and Switchboard corpus, the conversations in these corpora are of similar

length. Thus the above simple combination technique is also appropriate due to the

nature of approximately equal length conversations.

218


Fisher CorpusNgram 90.50 Ref.(Boulis & Ostendorf, 05)+ Partner Model 91.60 11.58%+ Socioling. Features 91.70 12.63%

Switchboard CorpusNgram 92.78 Ref.(Boulis & Ostendorf, 05)+ Partner Model 93.81 14.27%+ Socioling. Features 96.91 57.20%

Table 10.5: Aggregate results on a “per-speaker” basis via majority consensus ondifferent conversations for the respective speaker. The results on Switchboard aresignificantly higher due to more conversations per speaker as compared to the Fishercorpus

corpus show a very large reduction in error rate of more than 57% with respect to the

standard algorithm, further indicating the usefulness of the partner-sensitive model

and richer sociolinguistic features when more conversational evidence is available.

10.8 Effect of Self-Reporting Features on

Gender Classification

Some ngrams are very strong indicators of gender in a conversation, to the extent

that the single occurrence of those ngrams can determine the gender of the speaker.

Some examples of such self-reporting features are “my wife”, “my boyfriend”, etc. It

is worthwhile to study the impact of such features in artificially inflating the perfor-

mance, from the scientific perspective of studying whether general discourse features

219

% of conversation sideswith self-reporting features

Fisher CorpusTraining data 25.94%Test data 26.69%Switchboard CorpusTraining data 3.2%Test data 3.09%

Table 10.6: Fraction of conversations containing self-reporting features such as “mywife”, “my boyfriend”, on different corpora. Although Fisher has significant con-versations with such features, they have little impact on the overall performance asshown in Table 10.7

are helpful in determining the gender.

Table 10.6 shows the statistics for presence of such features in the Fisher and Switch-

board conversation corpora. While only a small fraction of the conversations (approx-

imately 3%) in the Switchboard corpus contain such features, a significant fraction

(approximately 25%) of the conversations in Fisher corpus have such features. Nev-

ertheless, a second experiment by retraining the classifier after removing all such

features shows only a negligible effect in performance. Table 10.7 shows that there is

no change in the accuracy on Switchboard corpora as expected, and on Fisher corpora

the accuracy drops only by 0.25%, thus indicating the robust and sicessful utilization

of general discourse based features for gender classification.

220

Model AccuracyFisher CorpusNgram-based model 90.84Removing self reporting features 90.59Switchboard CorpusNgram-based model 90.22Removing self reporting features 90.22

Table 10.7: Self reporting features for gender such as “my wife”, “my boyfriend”, etc.have negligible impact on performance of gender classification.

10.9 Application to Arabic Language

While differences in discourse based on author attributes have been heavily studied

for English, it would be interesting to see how the ngram-based model along with the

partner-based model and sociolinguistic features would extend to a new language.

Arabic differs from English along many dimensions such as orthography, word order,

capitalization, sound system, etc., making it a good language to test the robustness

of these models. It also allows to assess the contribution of non-lexical sociolinguistic

features such as mean word length, when language specific sociolinguistic features

(e.g. % of passives) are not available. The following subsections explain the corpus

details and results obtained for Arabic.

10.9.1 Corpus Details

LDC Gulf Arabic telephone conversation corpus (Linguistic Data Consortium,

2006). The training set consisted of 499 conversations, and the test set consisted

of 200 conversations. Each speaker participated in only one conversation, resulting

221

in the same number of training/test speakers as conversations. Thus there was no

overlap in speakers/partners between training and test sets, which is also appropriate

for training and evaluation of the partner-based models.

10.9.2 Results

The ngram feature vectors and the partner-sensitive models were trained similar

way to the models trained for English. Among the sociolinguistic features, only

the non-lexical features, namely, % of conversation spoken, speaker rate, % of short

utterances, type-token ratio, mean inter-utterance time and mean word length and

utterance length were utilized for feature-ablation study.

The results for Arabic are shown in Table 10.8. Based on the prior distribution,

always guessing the most likely class for gender (“male”) yielded 52.5% accuracy. It

can be seen that the ngram-based model gives a reasonably high accuracy in Arabic

as well. More importantly, consistent performance gains due to partner modeling

are observed, achieving an accuracy of 95%, indicating robustness of partner-senstive

model. Furthermore, using the sociolinguistic features, it is seen that mean word

length and mean utterance length can provide additional performance gains.

222


Gulf Arabic (52.5% sides are male)Ngram (Boulis & Ostendorf, 05) 92.00 Ref.+ Partner Model 95.00+ Mean word length 95.50+ Mean utterance length 96.00 50.00%

Table 10.8: Gender classification results for a new language (Gulf Arabic) showingconsistent improvement gains via partner-sensitive model and sociolinguistic features.

10.9.3 Analysis

In order to understand which of the ngram features are more discriminative for

gender, the ngrams were ranked according to their weight in the SVM model as shown

in Table 10.6, that shows the Arabic ngram in unicode, roman transliteration and the

weight. Some examples of the top male ngram features are “Aaxiy (my brother)”,

“yaA (addressing or calling upon, vocative particle)”, “waAll ah and waAll ahi (swear

to god often used by males)”, “Al$ abaAb (the guys)”. It is also interesting to see the

use of questions “kam (how much, how many)”, “kayf (how)” among male speakers.

Some examples of top female ngram features are “<intiy (pronoun “you” when refer-

ring to a female)”, “Hiluw and Hilwap (sweet or nice)”, “Guwliy (say, when talking

to a female)”, “AlHamdi lil ah (Thank God)”. As noted in the sociolinguistic studies

for English, one can also see the use of gender-specific pronouns “GaAlat (he said)”

for female speakers.

223

Male

يتنإ <inta 4.54

كل lak 4.37

اي yaA 2.62

تنا Ainta 2.53

يخا Aaxiy 2.40

يخا اي yaA_Aaxiy 2.38

كيلع Ealayk 2.26

Al- 1.86

مك kam 1.75

تنإ

۔لا

<int 1.63

هللاو waAll~ah 1.63

كدنع Eindak 1.63

هللاو waAll~ahi 1.43

نيو wayn 1.22

بابشلا Al$~abaAb 1.18

لاعت taEaAl 1.17

كرابخاش

يلوق

ةولح

يتنإ

دمحلا

ولح

كل

اويأ

ةرم

هلل دمحلا

<intiy -4.59

AlHamdi -2.13

Hiluw -2.07

liJ -1.91

<iywaA -1.90

mar~ap -1.80

AlHamdi_lil~ah -1.46

هلل

يتنا

تلاق

هللا

كل

ام

lil~ah -1.44

Aintiy -1.41

GaAlat -1.41

All~ah -1.39

lik -1.32

maA -1.27

$aAaxbaAriJ -1.24

Guwliy -1.14

Hilwap -1.13

Female

Figure 10.6: Top 20 Arabic ngram features (along with their Roman transliterations)

for Gender, ranked by the weights assigned by the linear SVM model. Section 10.9.3

provides translation and insight into why these are appropriate gender indicators.

224

10.10 Application to Email Genre

A primary motivation for using only the speaker transcripts as compared to also

using acoustic properties of the speaker (Bocklet et al., 2008) was to enable the appli-

cation of the models to other new genres. In order to empirically support this moti-

vation, the performance of the models explored in this chapter was also tested on the

Enron email corpus (Klimt and Yang, 2004). The email genre has also been studied

before for gender classification, primarily from the point of view computer forensics

for securing user identity. Corney et al. (2002) describe an approach for email gender

classification using structural features such as style markers, email domain features

such as reply-status, number of attachments, use of HTML tags, greeting and farewell

acknowledgments and language features such as number of words ending with “able”,

“ive”, etc. While a full scale email modeling of the email domain for gender is defi-

nitely possible, in this section it is described how the simple ngram-features and the

sociolinguistic features used for conversational speech genre extend to email.

10.10.1 Corpus Details

The Enron corpus consists of email data of mostly senior management of Enron.

The corpus is organized into different folders and all the emails from “sent” folder

were used for gender classification. While the dataset does not contain explicit gender

markings, a manual examination and annotation for a subset of users resulted in

225

John,

Regarding the employment agreement, Mike declined without a counter. Keith said he would sign for $75K cash/$250 equity. I still believe Frank should receive the same signing incentives as Keith.

Figure 10.7: Example of an email sent by a male sender in Enron corpus. The

header and signature information containing the sender’s name are removed and only

the body of the email is used for gender classification.

unambiguous gender labels for 90 users, out of which 54 were male and 30 were female.

Only the emails with more than 20 words were utilized as part of training/test emails.

The emails were further cleaned by removing the header information containing the

sender’s name and other details from the corpus and only the email text was utilized.

Any content that was not written by the sender as a part of a reply-to or forward

message was also removed to a large extent using simple heuristics. Also, for fairness,

the name of the sender at the end of the email was also removed. The resulting

training and test sets after preprocessing consisted of 1579 and 204 emails respectively.

226

10.10.2 Features

The ngram feature vectors were computed as in Section 10.4 with TF.IDF weight-

ing10. In addition to ngram features, a subset of sociolinguistic features that could

be extracted for email, namely, % of pronoun usage, % of passive usage, % of modal

auxiliaries, % of subordinate clauses, type-token ratio, % of “yeah” occurences, % of

WH-question words and mean word length were also utilized11.

10.10.3 Results

Table 10.9 shows the performance for gender classification on email data. Based

on the prior distribution, always guessing the most likely class (“male”) results in

63.2% accuracy. It can be seen that the Boulis and Ostendorf (2005) model based on

all ngram features yields a reasonable performance of 76.78%. It is also interesting to

see that the accuracy drops to 74.61% when the names of other persons mentioned in

the email are removed12. Also, it can be seen that % of sub-ordinate clauses, mean

word length, type-token ratio and % of pronouns help in improving the performance

gain further, yielding 80.5% accuracy.

10The inverse document frequency was computed with the notion of email as a document.11The partner-based features were not utilized due to the problems with data sparsity, receiver’s

message deleted in the reply, reply to a group, unindentified receiver and the high noise in extractingthe receiver’s reply even when available. Furthermore, the collection of cleaned and annotatedemails consisted of only 90 authors resulting in a significant overlap between the receivers of thetest messages and senders of training messages. Nevertheless, even the gains from some of thesociolinguistic features as seen from Table 10.9 are quite promising.

12The name of the sender is always removed from the email and is a consistent preprocessing stepin all the experiments on email data.

227


Enron Email Corpus (63.2% sides are male)Ngrams with person names removed 74.61 Ref.All Ngrams 76.78 Ref.+ % of subor-claus., Mean 80.19word length, Type-token ratio+ % of pronouns. 80.50 16.02%

Table 10.9: Application of Ngram model and sociolinguistic features for gender clas-sification in a new genre (Email)

10.10.4 Analysis

The top part of Table 10.10 with heading “using all ngrams as features” shows

the most discriminative ngram features. Even though the name of the sender was

removed from both the email header and body13, the top ngram features show that

names of other people mentioned in email are quite discriminative of the gender of the

sender. For instance, the common male names such as “jeff, john, jim” occur as top

ngram features for male sender and the common female names such as “kim, susan,

sara” occur as top ngram features for female sender. It is also interesting to see that

“m” (and also “j” in the bottom part of the table), indicating that male senders tend

to use first-name initials as a common signature at the end of the message.

In order to gain further insight into the differences due to lexical usage, all the names

present in the US Census database were removed from the email and a second ngram-

based SVM model was trained using the remaining ngram features. The bottom part

of Table 10.10 shows the resulting top ngrams after training the SVM model using

13usually part of the signature at the end of the body.

228

this reduced feature set. It can be seen that that the gender neutral pronoun “it” is

common among males and gender specific pronoun “she” is more common in females,

as also observed in sociolinguistic studies of discourse. It is also interesting to see

the unigram “hi” as a strong feature for female senders and the short “bt (best)”

signature for male senders.

Note that the removal of person names was performed in an overly aggressive manner

by filtering out all names that occur in the US Census database, thus resulting in a

more conservative approach. Thus some words such as “Best” that are both person

names and common words will get removed from discourse even when not used as a

named entity for referring to a person. Nevertheless, the results obtained using this

approach are still reasonable (Table 10.9) and a more appropriate model of named

entity detection can further improve performance.

10.11 Modeling Other Attributes

While gender has been studied heavily in the literature, other speaker attributes

such as age and native/non-native status also correlate highly with lexical choice and

other non-lexical features. The ngram-based model of Boulis and Ostendorf (2005)

was applied and the improvements using the partner-sensitive model and richer soci-

olinguistic features for a binary classification of the age of the speaker, and classifying

into native speaker of English vs non-native.

229

Male FemaleUsing all ngrams as features

best 1.2735 kim -1.8831bt 1.2650 susan -1.6136it 1.1549 sara -1.4759jeff 1.0164 tana -1.4612john 0.9985 guaranty -1.2856jim 0.9624 master -1.2821m 0.9354 fax -1.2737to mark 0.8516 credit -1.2211book 0.8472 thanks kim -1.1441debra 0.8087 shelley -1.0857mark attached 0.7366 lindy -1.0199kate 0.7254 mark -1.0049market 0.7037 meeting -0.9949talked 0.7008 stephanie -0.9810andy 0.6885 to tana -0.9798barry 0.6840 form -0.9697lavo 0.6780 she -0.9351is a 0.6726 who -0.9186regards 0.6697 carol -0.8964set 0.6650 ss -0.8888

Retraining after removing person namesbt 1.8437 fax -2.0861it 1.8297 hi -1.7447m 1.4442 guaranty -1.6986talked 1.3735 thanks -1.6003regards 1.2503 ss -1.5832think 1.0448 she -1.4513positions 1.0018 tw -1.3503i think 1.0008 request -1.3384set 0.9963 who -1.3080lavo 0.9885 counterparty -1.2976this we 0.9693 agreements -1.2810j 0.9332 copies -1.2674gas 0.9324 fyi -1.2642make 0.9224 agreement -1.2627that it 0.9111 send -1.1894very 0.9029 one -1.1848i need 0.8982 since -1.1829send a 0.8932 handle -1.1246is a 0.8890 with -1.1073month 0.8838 meeting -1.1021

Table 10.10: Top 20 ngram features for gender classification in email, ranked by theweights assigned by the linear SVM model. See Section 10.10.4 for more details.

230

Figure 10.8: Empirical differences in sociolinguistic features for Age. Younger speak-

ers tend to use short utterances, pronouns and auxiliaries more often than older

speakers.

231

10.11.1 Corpus details for Age and

Native Language

For age, the same training and test speakers were used from Fisher corpus as

explained for gender in Section 10.3 and binarized into greater-than or less-than-

or-equal-to 40 for more parallel binary evaluation. For predicting native/non-native

status, the 1156 non-native speakers in the Fisher corpus was used and pooled them

with a randomly selected equal number of native speakers. The training and test par-

titions consisted of 2000 and 312 speakers respectively, resulting in 3267 conversation

sides for training and 508 conversation sides for testing.

10.11.2 Results for Age and Native/Non-Native

Based on the prior distribution, always guessing the most likely class for age ( age

less-than-or-equal-to 40) results in 62.59% accuracy and always guessing the most

likely class for native language (non-native) yields 50.59% accuracy.

Table 10.11 shows the performance of the models discussed in this chapter for age

and native/non-native speaker status. It can be seen that the ngram-based approach

of Boulis and Ostendorf (2005) for gender also gives reasonable performance on other

speaker attributes, and more importantly, both the partner-sensitive model and so-

ciolinguistic features help in reducing the error rate on age and native language sub-

stantially, indicating their usefulness not just on gender but also on other diverse

232

Model AccuracyAge (62.6% of sides have age <= 40)Ngram Model 82.27+ Partner Model 82.77+ % of passive, mean inter-utterance time 83.02, % of pronouns+ % of “yeah” 83.43+ type/token ratio, + % of lipsmacks 83.83+ % of auxiliaries, + % of short utterances 83.98+ % of “mm” 84.03(Reduction in Error) (9.93%)Native vs Non-native (50.6% of sides are non-native)Ngram 76.97+ Partner 80.31+ Mean word length 80.51(Reduction in Error) (15.37%)

Table 10.11: Results showing improvement in the accuracy of age and native languageclassification using partner-sensitive model and sociolinguistic features

latent attribute classifications.

10.11.3 Analysis

Table 10.12 shows the most discriminative ngram features for binary classification

of age, it is interesting to see the use of “well” right on top of the list for older speakers,

also found in the sociolinguistic studies for age (Macaulay, 2005). One can also see

that older speakers talk about their children (“my daughter”) and younger speakers

talk about their parents (“my mom”), the use of words such as “wow”, “kinda” and

“cool” is also common in younger speakers. To give maximal consistency/benefit

to the Boulis and Ostendorf (2005) n-gram-based model, the self-reporting n-grams

such as “im forty”, “im thirty” were not filtered, putting the sociolinguistic-literature-

233

Age >= 40 Age < 40well 0.0330 im thirty14 -0.0266im forty 0.0189 actually -0.0262thats right 0.0160 definitely -0.0226forty 0.0158 like -0.0223yeah well 0.0153 wow -0.0189uhhuh 0.0148 as well -0.0183yeah right 0.0144 exactly -0.0170and um 0.0130 oh wow -0.0143im fifty 0.0126 everyone -0.0137years 0.0126 i mean -0.0132anyway 0.0123 oh really -0.0128isnt 0.0118 mom -0.0112daughter 0.0117 im twenty -0.0110well i 0.0116 cool -0.0108in fact 0.0116 think that -0.0107whether 0.0111 so -0.0107my daughter 0.0111 mean -0.0106pardon 0.0110 pretty -0.0106gee 0.0109 thirty -0.0105know laughter 0.0105 hey -0.0103this 0.0102 right now -0.0100oh 0.0102 cause -0.0096young 0.0100 im actually -0.0096in 0.0100 my mom -0.0096when they 0.0100 kinda -0.0095

Table 10.12: Top 25 ngram features for Age ranked by weights assigned by the linearSVM model

234

based and discourse-style-based features at a relative disadvantage.

Among the sociolinguistic features, adding % of passive usage, mean inter-utterance

time, % of pronouns, % of yeah, % of type-token ratio, % of lipsmacks, % of auxiliaries,

% of short utterances and % of “mm” show improvement in performance. Figure

10.8 shows the empirical distributions for some of the sociolinguistic features for

age. For native vs non-native language, mean word length is a strong feature and

it was observed that native speakers tend to have larger mean word length (2.34) as

compared to non-native speakers (1.61).

10.12 Regression Models

While the binary model for age described in the previous section gives an insight

into what features are indicative of age, it is only a step towards predicting the real age

of the speaker. A better approach would is to use a regression framework, that allows

for greater reduction in entropy for predicting age. The following sub-sections explain

the different regression models and their performance on the. The same training and

test speakers were used from Switchboard corpus as explained for gender in Section

10.3.

235

Figure 10.9: Age histograms for training and test speakers of Switchboard corpus

indicating unbalanced age groups of the participating speakers.

236

10.12.1 Evaluation Measures

Table 10.13 reports both the mean absolute error and mean squared error defined

as follows:

Mean absolute error (MAE) =1

n

n∑i=1

|yi − yi| (10.1)

Mean squared error (MSE) =1

n

n∑i=1

(yi − yi)2 (10.2)

where n is the number of instances, yi and yi are the true age and predicted age of

instance i respectively.

10.12.2 Baseline Approaches

Simple predictors for age based on prior distribution are median and average

statistics. As shown in Table 10.13, predicting the correct age as median age (38)

yields a mean absolute error of 8.41 and a mean squared error of 111.98. Furthermore,

using the average age instead of median age, results in a mean absolute error of 8.61

and mean squared error of 108.01.

10.12.3 Ngram-based regression model

Based on the promising results using ngrams for binary classification of age, results

for a SVM-based regression model is also reported utilizing all the ngrams, resulting in

57,914 features. The regression model utilized was based on support vector machines

237

(Vapnik, 1995) along with the optimizations described in (Joachims, 1999; Joachims,

2002).

The row labeled “Ngram-based model” in Table 10.13 reports the performance of

this model, resulting in mean absolute error of 7.15 and mean squared error of 79.80

respectively.

10.12.4 Sociolinguistic features

The rows labeled “+ Socioling.” in Table 10.13 show the results for using soci-

olinguistic features along with the score from other models, combined using a meta-

classifier. One can see that using regression trees as a meta-classifier results in an

improved performance (MAE: 79.06) as compared to SVMs. This may be due to

increased robustness of regression trees in dealing with widely varying feature types

for hypothesizing decision boundaries.

10.12.5 Top Ngram features

Since the number of ngram features results in a high dimensional feature space,

a simple method of feature selection is to use the ngrams with high positive and

negative weights assigned by the ngram-based SVM model. The row labeled “Top

40 ngrams” show the regression performance just based on these 40 lexical features.

These 40 ngram features are also showed in Table 10.14 and based on how the SVM

238

Model MAE MSEMedian age (38 years) 8.41 111.98Average age (39.86 years) 8.61 108.01Ngram-based model 7.15 79.80Top 40 N-grams Only 7.81 91.96SVM StackingNgram + Socioling. 7.15 79.74Binary Splits 6.45 67.32Binary Splits + Socioling. 6.25 63.46Ngrams + Socioling. + Binary Splits 7.15 79.74Ngrams + Socioling. + Binary Splits + Top 40 Ngrams 7.14 79.64Regression Tree StackingNgram + Socioling. 7.06 78.08Binary Splits 7.33 104.24Binary Splits + Socioling. 7.38 104.04Ngrams + Socioling. + Binary Splits 7.06 78.08Ngrams + Socioling. + Binary Splits + Top 40 Ngrams 7.06 78.08

Table 10.13: Results for age regression using different feature and model combina-tions. Substantial performance gains were obtained by utilizing binary classifiersacross different age boundaries as features in a stacked SVM model.

models are trained, the features with higher weight indicate predictiveness of elder

speakers and vice versa. This can also be seen from the presence of ngrams such as

“grandchildren”, “son”, “i see” as features with high positive weights and ngrams

such as “oh really”, “my parents, “pretty much”, “yeah i”, etc. as features with high

negative weights.

While the performance reduces as compared to using all the 57,914 ngram features,

giving a mean absolute error of 7.81, it is still a reasonable performance comparatively.

The primary motivation behind reducing the number of ngram features was to include

them directly in the stacked model and allow for prominent ngrams to influence the

stacked model. The rows labeled “+ Top 40 Ngrams” reports performance with

239

Top +ve features Top -ve featuresumhum umhum 0.1454 oh really -0.1342isnt 0.1252 oh no -0.1070yeah right 0.1189 definitely -0.1053anyhow 0.1103 laughteryeah -0.0982and um 0.1091 umhum yeah -0.0973you mean 0.0998 agree -0.0917dallas 0.0957 also -0.0902thats right 0.0953 exactly -0.0883uhhuh uhhuh 0.0937 because1 -0.0852son 0.0897 yeah i -0.0850right uhhuh 0.0864 um -0.0810course 0.0853 pretty much -0.0802i think 0.0848 do do -0.0795yes i 0.0829 well um -0.0781isnt it 0.0814 pretty -0.0780grandchildren 0.0800 my parents -0.0776i see 0.0800 right -0.0773i have 0.0796 research -0.0767i say 0.0795 only -0.0751i just 0.0786 the school -0.0748

Table 10.14: Top 20 ngram features for Age ranked by weights assigned by the ngram-based SVM regression model

adding these ngrams as additional 40 features in the stacked model. However, no

performance gains were obtained via this combination.

10.12.6 Multiple Binary Classifiers Across

Different Age Boundaries

Training multiple binary SVM models as compared to full regression models can

be helpful due to the much reduced hypothesis space and improved accuracy of the

240

individual classifiers15. This model explores the use of multiple ngram-based binary

classifiers by windowing different age groups. The output of these classifiers is then

used as a feature in the regression model, resulting in a much lower dimension as com-

pared to the ngram-based regression model. Based on the assumption that the points

closer to the decision boundary are harder to classify, a windowing-based approach

was used resulting in the following three binary classifiers:

1. age < 30 vs age > 40, with standalone performance of 72.09%.



Each of these binary classifiers were trained using the ngram features. The rows

labeled “Binary splits” shows the performance of using only these four features in

the stacking models. Using SVM as the stacking model for these four classifiers as

features results in a substantially better performance than the previous models with

a mean absolute error of 6.45 and mean squared error of 67.32. Further adding

sociolinguistic features results in the best performance shown in rows labeled “Binary

splits + Socioling.” in the SVM stacking model. In particular, with these added

features the mean absolute error is reduced to 6.25 and mean squared error is reduced

to 63.46.

15The original characterization of SVMs was also with respect to binary classification, makingthem more suitable for making binary predictions.

16The next windowing model of age < 60 vs age > 70 was not utilized since there were no instanceswith age > 70.

241

10.12.7 Stacked Models

Stacking or Stacked generalization (Wolpert, 1992) is an approach for constructing

ensemble or committee of classifiers. A classifier ensemble or committee is a set of clas-

sifiers whose individual decisions are combined to classify new instances (Dietterich,

1997). Stacking is a specific instance of ensemble classification where a higher-level or

a meta classifier is utilized for combining the output of multiple classifiers. Using the

analogy of a committee, the meta classifier is the president of the committee and the

individual classifiers as the ground committee members. The motivation for stacking

is that different committee members make different classification errors and hence a

president can learn when to trust each of the members depending upon the nature of

instance to be classified. The following sections describe the use of stacking for age

regression using a linear (SVM with linear kernel) and regression tree classifier.

10.12.7.1 Linear Combination

This approach uses a linear kernel SVM as a meta-regression for combining the

score from ngram-based lexical model, sociolinguistic features, top ngram features and

scores from multiple binary classifiers across different age boundaries. The results for

using various combinations of these features and classifier scores are shown under the

“SVM Stacking” category in Table 10.13. The row labeled “Ngram + Socioling.”

describes the result for using the output of the SVM Ngram based regression model

and sociolinguistic features in a meta SVM regression model. The models utilizing “+

242

Top 40 Ngrams” describe the performance of using top 40 ngram features as described

in Section 10.12.5 along with other features/scores in the meta SVM regression model.

Finally, the rows describing “+ Binary Splits” describe the performance of using

output from multiple binary classifiers across different age boundaries along with

other features/scores.

10.12.7.2 Regression Trees

This approach uses regression trees as a meta-regression for combining the score

from ngram-based lexical model, sociolinguistic features, top ngram features and

scores from multiple binary classifiers across different age boundaries. A regression

tree has traditionally been used for combining a small number of signals and hence

can be easily utilized as a meta classifier for various combinations of stacked models.

The REPTree package from Weka17 (Witten and Frank, 2005) was utilized as the

model for regression tree. REPTree builds a regression tree using the information

gain critieria and uses reduced error pruning for generalizing the tree. The various

combinations described under “Regression Tree Stacking” category in Table 10.13 are

stacked in a similar fashion as explained in SVM stacking section above.

17Weka (Witten and Frank, 2005) is a machine learning toolkit written in Java. It contains toolsfor data pre-processing, classification, regression, clustering, association rules and visualization. Itcan be downloaded from http://www.cs.waikato.ac.nz/ml/weka/

243

Features

Mean utterancelength

Active/Passive ratio usage

% of modal auxiliaries

Type-token ratio

....

Age < 30 vs Age > 40



Ngram-based binary classifiers for different age splits

SociolinguisticFeatures

SVM Regression Model

Regression Tree

OR

Figure 10.10: Stacking approach for age regression utilizing binary classifiers across

different age boundaries and sociolinguistic features as individual components.

244

Model MAE MSEBaseline1 (Median: 40.5) 9.74 130.29Baseline2 (Average: 41.19) 9.75 129.81Ngram-based model 7.63 88.37SVM StackingBinary Splits 7.17 80.27Binary Splits + Socioling. 6.94 75.69Ngrams + Socioling. + Binary Splits + Top 40 Ngrams 7.63 88.37Regression Tree StackingBinary Splits 7.87 116.42Binary Splits + Socioling. 7.95 116.21Ngrams + Socioling. + Binary Splits + Top 40 Ngrams 7.53 86.49

Table 10.15: Results for age regression using different feature and model combinationsfor age-wise balanced test set. While the performance of the baseline models degradedue to higher variance, regression models show consistent performance improvementsas in Table 10.13

10.12.8 Balancing Size of Different Age Groups in

Test Set

The age sample of speakers that participated in the telephone conversation exper-

iment could be biased due the age limit restrictions, location of sampling, etc. The

age histograms of training and test speakers are shown in Figure 10.9 and one can

see the characteristic age clusters leading to an unbalanced age sample. In order to

obtain a fair performance estimate on a more balanced age distribution set, the test

set was filtered to create a more balanced distribution across different age groups

using a threshold as shown in Figure 10.11. The results of various regression models

on this balanced test set are shown in Table 10.15. While the baselines result in a

lower performance as the age distribution is more uniform, the row labeled “Binary

245

Figure 10.11: Histograms for different age groups in the test set. The horizontal

line shows the threshold for balancing the size of test set across different age groups,

retaining a total of 600 examples.

Splits + Socioling.” still gives the best and a similar performance as compared to the

original biased sample.

10.13 Effect of Self-Reporting Features

on Age Prediction

A speaker may sometimes say his or her age during the conversation such as “I’m

thirty two”, etc. This may lead to artificial inflation of the results and may cause the

246

Model Mean Absolute Mean SquaredError Error

Ngram-based model 7.15 79.80Deleting self-reporting features 7.13 79.90

Table 10.16: Self-reporting features such as “in thirties, i’m fourty five, etc.”. havelittle impact. The performance after deleting such features is similar to the originalmodel containing all ngrams as features.

the regression model to over depend on such self-reporting features. In order to study

the impact of this effect on performance, all the self reporting ngrams were removed

from the feature set and the regression model was retrained using the remaining

features. The results showed in Table 10.16 show that such features have negligible

impact on performance, indicating the robustness of general discourse features in

predicting age.


This section analyzes the statistical significance of results reported in Tables 10.4,

10.5, 10.8, 10.9, 10.11 and 10.13. For per conversation results reported in Table 10.4,

using a binomial test of sample sizes 2008 (Fisher) and 808 (Switchboard), and base-

line accuracies of 90.84% (Fisher) and 90.22 (Switchboard), any resulting accuracy

over 91.88% (Fisher) and 91.96% (Switchboard) corpus is statistically significant with

a p-value less than 0.05.

For speaker-wise aggregate results reported in Table 10.5, using a binomial test with

sample sizes of 1000 (Fisher) and 100 (Switchboard) speakers, any resulting accuracy

247

over 92% (Fisher) and over 97% (Switchboard) are statistically significant with a p-

value less than 0.05.

For Arabic gender classification results in Table 10.8, using a binomial test with sam-

ple size 200 and baseline accuracy 92%, any resulting accuracy over 95% is statistically

significant with p-value less than 0.05.

For gender classification results on Email in Table 10.9, using a binomial test with

sample size 204 and baseline accuracy 76.78%, any resulting accuracy over 81.37% is

statistically significant with a p-value less than 0.05.

For results on binary classification of age reported in Table 10.11, using a binomial

test with sample size 2008 and baseline accuracy of 82.27%, any resulting accuracy

over 83.66% is statistically significant with a p-value less than 0.05.

For results on native vs non-native speaker reported in Table 10.11, using a binomial

test with sample size 508 and baseline accuracy of 76.97%, any resulting accuracy

over 79.92% is statistically significant with a p-value less than 0.05.

10.15 Conclusion

This chapter has presented and evaluated several original techniques for the la-

tent classification of speaker gender, age and native language in diverse genres and

languages. A novel partner-sensitive model shows performance gains from the joint

248

modeling of speaker attributes along with partner speaker attributes, given the dif-

ferences in lexical usage and discourse style such as observed between same-gender

and mixed-gender conversations. The robustness of the partner-sensitive model is

substantially supported based on the consistent performance gains achieved in di-

verse languages and attributes. This chapter has also explored a rich variety of novel

sociolinguistic and discourse-based features, including mean utterance length, pas-

sive/active usage, percentage domination of the conversation, speaking rate and filler

word usage. In addition to these novel models, this work also shows how these mod-

els and the previous work extend to new languages and genres. Cumulatively up to

20% error reduction is achieved relative to the standard Boulis and Ostendorf (2005)

algorithm for classifying individual conversations on Switchboard, and accuracy for

gender detection on the Switchboard corpus (aggregate) and Gulf Arabic exceeds

95%.

249

Chapter 11

Contributions and Conclusion

This dissertation has presented several scientific contributions to the areas of trans-

lation lexicon induction, semantic knowledge extraction and textual fact extraction.

In particular, the following is a brief summary of some of the distinct scientific con-

tributions contained herein.

Chapter 3

1. Fluent, Non Compositional Translation of Compound Words: This is

the first work on fluent, non-compositional translation of compound words via

cross-language transitivity as opposed to compositional (or “glossy”) translation

methods used in the previous literature. Successful translation of compounds

can be achieved without the need for bilingual training text, by modeling the

mapping of literal component-word glosses (e.g. “iron-path”) into fluent English

(e.g. “railway”) across multiple languages.

250

2. Modeling Sequence of Compound Components and

Compound Morphology: Performance of compound translation is further

improved by adding component-sequence and learned-morphology models along

with context similarity from monolingual text and optional combination with

traditional bilingual-text-based translation discovery.

3. Application to Diverse World Languages: This is the first known work on

compound translation induction to be evaluated broadly on 10 diverse languages

(Albanian, Arabic, Bulgarian, Czech, Farsi, German, Hungarian, Russian, Slo-

vak and Swedish), showing robust language-independence as compared to pre-

vious literature that has focused on using fixed syntactic patterns modeling the

compounding phenomena of one or two languages in question. The models de-

veloped in this dissertation show consistent performance gains in translation

accuracy across all these languages.

Chapter 4

4. Dependency Contexts for Translation Lexicon Induction: While depen-

dency contexts have been successfully used for monolingual natural language

processing tasks, this is the first work to report their contribution to trans-

lation lexicon induction. In addition to providing empirical gains, this work

clearly shows why such richer contexts are helpful with respect to modeling

long-distance relationships and word-reordering.

251

5. Reducing Entropy of Candidate Space via Mapping Part-of-speech

Tagsets: This is the first work in the minimally supervised translation lexi-

con literature to show how preserving a word’s part of speech in translation can

improve performance and provided a mechanism for mapping part-of-speech

tagsets automatically. Such mapping was used to restrict the candidate space

(which can be large depending on the size of the vocabulary), allowing to im-

prove monolingual corpus-based methods for translating words for all part-of-

speech categories.

Chapter 6

6. Learning Semantic Taxonomy in Multiple Languages using Information

from Different Relationship-types: This work provides a novel minimal-

resource algorithm for the acquisition of multilingual lexical taxonomies (in-

cluding hyponymy/hypernymy and meronymy) using evidence from multiple

relationship-types.

This is also the first work to show successful application of corpus-based meth-

ods for fact extraction to Hindi and the robustness of this approach is shown by

the fact that the unannotated Hindi development corpus was only 1/15th the

size of the utilized English corpus.

7. Semantic Taxonomy as Transitive Bridge for

Translation Lexicon Induction: This is the first work to present a novel

252

model of unsupervised translation lexicon induction via multilingual transitive

models of hypernymy and hyponymy, using corpus-based induced taxonomies.

Chapter 7

8. Extracting Natural Parents in the Hypernymy Chain for Definite

Anaphora Resolution: This chapter presents a successful solution to the

problem of identifying natural hypernyms for definite anaphora resolution, by

illustrating a simple and noisy corpus-based approach globally modeling head-

word co-occurrence around likely anaphoric definite NPs as compared to the

approaches in previous literature utilizing standard standard Hearst-style pat-

terns for extracting hypernyms to identify likely antecedents.

9. Generation of Definite Anaphors using Natural Parents: This is the first

work in the coreference modeling literature to present a perspective on gen-

erating definite anaphors using natural parents extracted from corpus-based

methods. On this much harder anaphora generation task, where the stand-

alone WordNet-based model only achieved an accuracy of 4%, the corpus-based

models can achieve 35%-47% accuracy on blind exact-match evaluation.

Chapter 9

10. Global Document-level Structural Model: This is the first work illustrat-

ing a global structural model for biographic fact extraction utilizing absolute

253

and relative document-wide positions as opposed to modeling local contextual

patterns.

11. Transitive Model: Another property exploited in a novel manner in this work

is the tendency for individuals occurring together in an article to have related

attribute values. Based on this intuition, a transitive model was implemented

that predicts attributes based on consensus voting via the extracted attributes

of neighboring names

12. Correlation-based Model: This work also presents the novel use of correla-

tions between attributes for biographic fact extraction, learning compatible and

incompatible inter-attribute pairings. The motivation here is that the attributes

(such as nationality and religion) are not independent of each other when mod-

eled for the same individual, leading to performance gains via exploiting this

correlation.

Chapter 10

13. Modeling Partner-Effect: This work is the first to show performance gains

from the novel modeling of speaker attributes sensitive to partner speaker at-

tributes, given the differences in lexical usage and discourse style such as ob-

served between same-gender and mixed-gender conversations.

14. Use of Sociolinguistic Features: In contrast to lexical n-gram focused models

developed for gender prediction in the computational linguistics literature, this

254

work explores a rich variety of novel sociolinguistic and discourse-based features,

including features such as mean utterance length, passive/active usage ratio,

percentage domination of the conversation, speaking rate and filler word usage.

15. Application to Variety of Attributes: This work shows how the lexical

models of gender classification in the previous literature can be extended to

Age and Native vs. Non-native prediction, with further improvements gained

from partner-sensitive models and novel sociolinguistic features.

11.1 Applications and Future Work

The approaches, models and algorithms presented in this dissertation for cross-

language, semantic and factual relationship extraction have broad potential utilization

in a number of major applications:

• Fine-grained information retrieval: Building structured knowledge bases

containing a wide range of relationships allows for powerful query mechanisms

for search. For example, a relational database containing biographic facts

can return results for finding what attributes are common between two peo-

ple. General semantic relationships such as hypernymy can also be helpful for

query expansion. Also, learning translation lexicons is important for improving

query/document term translation in cross language information retrieval.

• Machine translation: Given enough parallel bilingual text, current statis-

255

tical machine translation systems can learn highly accurate word and phrase

translation lexicons. However, large parallel corpora exist for only few of the

world’s languages and the methods described in this thesis show several novel

approaches to building translation lexicons without the need for parallel cor-

pora.

• Disambiguation of concepts and named entities: A major problem in

natural language processing and search systems is that a word can refer to

multiple concepts in different languages, and similarly a named entity can have

multiple referent persons, organizations, etc. The extraction of entity attributes

and relations can provide powerful features for ambiguous entity disambiguation

and linking.

• Personalized services/recommendations: Approaches that can extract bi-

ographical attributes from unstructured user content can provide additional

meta information about the user. Meta information such as “gender”, “age”,

“education level”, etc., can enable more personalized user assistance such as

custom news feeds, call routing, book recommendations, etc. based on the

extracted attributes.

• Education: Generating structured repositories of the relationships between

words, their translations in different languages, and distilled facts about enti-

ties can provide a better way for students to learn about domains of current

256

interest. In addition, a system could detect that the student is a non-native

speaker and use the translingual and synonymy relationships to improve stu-

dent’s vocabulary.

257

Bibliography

[1] E. Agichtein and L. Gravano. Snowball: extracting relations from large plain-

text collections. In Proceedings of the 5th ACM International Conference on

Digital Libraries, pages 85–94, 2000.

[2] E. Alfonseca, P. Castells, M. Okumura, and M. Ruiz-Casado. A rote extractor

with edit distance-based generalisation and multi-corpora precision calculation.

Proceedings of International Conference on Computational Linguistics and As-

sociation for Computational Linguistics, pages 9–16, 2006.

[3] D.E. Appelt, J.R. Hobbs, J. Bear, D. Israel, and M. Tyson. FASTUS: A finite-

state processor for information extraction from real-world text. In International

Joint Conference on Artificial Intelligence, volume 13, pages 1172–1172, 1993.

[4] S. Argamon, M. Koppel, J. Fine, and A.R. Shimoni. Gender, genre, and writing

style in formal written texts. Text-Interdisciplinary Journal for the Study of

Discourse, 23(3):321–346, 2003.

[5] J. Artiles, J. Gonzalo, and S. Sekine. The semeval-2007 weps evaluation: Estab-

258

lishing a benchmark for the web people search task. In Proceedings of SemEval,

pages 64–69, 2007.

[6] S. Auer and J. Lehmann. What have Innsbruck and Leipzig in common? Ex-

tracting Semantics from Wiki Content. Proceedings of ESWC, pages 503–517,

2007.

[7] A. Bagga and B. Baldwin. Entity-Based Cross-Document Coreferencing Using

the Vector Space Model. In Proceedings of International Conference on Compu-

tational Linguistics and Association for Computational Linguistics, pages 79–

85, 1998.

[8] T. Baldwin and T. Tanaka. Translation by Machine of Complex Nominals: Get-

ting it Right. In Proceedings of the Association for Computational Linguistics

Workshop on Multiword Expressions, pages 24–31, 2004.

[9] M. Berland and E. Charniak. Finding parts in very large corpora. In Proceedings

of the 37th Annual Meeting of the Association for Computational Linguistics,

pages 57–64, 1999.

[10] T. Bocklet, A. Maier, and E. Noth. Age Determination of Children in Preschool

and Primary School Age with GMM and Based Supervectors and Support Vec-

tor Machines/Regression. In Proceedings of Text, Speech and Dialogue; 11th

International Conference, volume 1, pages 253–260, 2008.

259

[11] C. Boulis and M. Ostendorf. A quantitative analysis of lexical differences be-

tween genders in telephone conversations. In Proceedings of Association for

Computational Linguistics, pages 435–442, 2005.

[12] S. Brin. Extracting patterns and relations from the world wide web. In

In WebDB Workshop at 6th International Conference on Extending Database

Technology, EDBT98, 1998.

[13] P.F. Brown, S.A. Della Pietra, V.J. Della Pietra, F. Jelinek, J.D. Lafferty,

R.L. Mercer, and P.S. Roossin. A statistical approach to machine translation.

Computational Linguistics, 16(2):79–85, 1990.

[14] P.F. Brown, V.J. Della Pietra, S.A. Della Pietra, and R.L. Mercer. The mathe-

matics of statistical machine translation: Parameter estimation. Computational

linguistics, 19(2):263–311, 1993.

[15] R.D. Brown. Corpus-driven splitting of compound words. In Proceedings of

TMI, 2002.

[16] S. Buchholz and E. Marsi. Conference on natural language learning-X shared

task on multilingual dependency parsing. In Proceedings of Conference on Nat-

ural Language Learning, pages 189–210, 2006.

[17] R. Bunescu and M. Pasca. Using encyclopedic knowledge for named entity

260

disambiguation. In Proceedings of European Chapter of the Association for


[18] J.D. Burger and J.C. Henderson. An exploration of observable features related

to blogger age. In Computational Approaches to Analyzing Weblogs: Papers

from the 2006 American Association for Artificial Intelligence Spring Sympo-

sium, pages 15–20, 2006.

[19] M. J. Cafarella, D. Downey, S. Soderland, and O. Etzioni. Knowitnow: Fast,

scalable information extraction from the web. In Proceedings of Empirical Meth-

ods in Natural Language Processing and Human Language Technologies, pages

563–570, 2005.

[20] C. Callison-Burch, D. Talbot, and M. Osborne. Statistical Machine Transla-

tion with Word-and Sentence-Aligned Parallel Corpora. In Proceedings of the

Association for Computational Linguistics, pages 175–182.

[21] Y. Cao and H. Li. Base Noun Phrase translation using web data and the EM

algorithm. In Proceedings of the International Conference on Computational

Linguistics and Volume 1, pages 1–7, 2002.

[22] S. Caraballo. Automatic construction of a hypernym-labeled noun hierarchy

from text. In Proceedings of the 37th Annual Meeting of the Association for


261

[23] B. Carterette, R. Jones, W. Greiner, and C. Barr. N semantic classes are

harder than two. In Proceedings of Association for Computational Linguistics

and International Conference on Computational Linguistics, pages 49–56, 2006.

[24] S. Cederberg and D. Widdows. Using LSA and noun coordination informa-

tion to improve the precision and recall of automatic hyponymy extraction. In

Proceedings of Conference on Natural Language Learning, pages 111–118, 2003.

[25] C. Cieri, D. Miller, and K. Walker. The Fisher Corpus: a resource for the next

generations of speech-to-text. In Proceedings of LREC, 2004.

[26] H. H. Clark. Bridging. In Proceedings of the Conference on Theoretical Issues

in Natural Language Processing, pages 169–174, 1975.

[27] J. Coates. Language and Gender: A Reader. Blackwell Publishers, 1998.

[28] D. Connolly, J. D. Burger, and D. S. Day. A machine learning approach to

anaphoric reference. In Proceedings of the International Conference on New

Methods in Language Processing, pages 133–144, 1997.

[29] M. Corney, O. de Vel, A. Anderson, and G. Mohay. Gender-preferential text

mining of e-mail discourse. In Proceedings of Annual Computer Security Appli-

cations Conference, pages 21–27, 2002.

[30] J. Cowie, S. Nirenburg, and H. Molina-Salgado. Generating personal profiles.

In The International Conference On MT And Multilingual NLP, 2000.

262

[31] S. Cucerzan. Large-scale named entity disambiguation based on wikipedia data.

In Proceedings of Empirical Methods in Natural Language Processing and Con-

ference on Natural Language Learning, pages 708–716, 2007.

[32] A. Culotta, A. McCallum, and J. Betz. Integrating probabilistic extraction

models and data mining to discover relations and patterns in text. In Pro-

ceedings of Human Language Technologies and North American Chapter of the

Association for Computational Linguistics, pages 296–303, 2006.

[33] A. Culotta and J. Sorensen. Dependency tree kernels for relation extraction. In

Proceedings of Association for Computational Linguistics, 2004.

[34] A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum likelihood from in-

complete data via the EM algorithm. Journal of the Royal Statistical Society.

Series B (Methodological), pages 1–38, 1977.

[35] T.G. Dietterich. An experimental comparison of three methods for constructing

ensembles of decision trees: Bagging, boosting, and randomization. Machine

learning, 40(2):139–157, 2000.

[36] P. Eckert and S. McConnell-Ginet. Language and Gender. Cambridge Univer-

sity Press, 2003.

[37] O. Etzioni, M. Cafarella, D. Downey, A. Popescu, T. Shaked, S. Soderland,

263

D. S. Weld, and A. Yates. Unsupervised named-entity extraction from the web:

an experimental study. Artif. Intell., 165(1):91–134, 2005.

[38] C. Fellbaum. Wordnet: An electronic lexical database. 1998.

[39] E. Filatova and J. Prager. Tell me what you do and Ill tell you what you are:

Learning occupation-related activities for biographies. Proceedings of Human

Language Technologies and Empirical Methods in Natural Language Processing,

pages 113–120, 2005.

[40] J.L. Fischer. Social influences on the choice of a linguistic variant. Word,

14:47–56, 1958.

[41] R. Florian and D. Yarowsky. Modeling consensus: Classifier combination for

word sense disambiguation. In Proceedings of the conference on Empirical meth-

ods in natural language processing, pages 25–32, 2002.

[42] P. Fung. A statistical view on bilingual lexicon extraction: from parallel corpora

to non-parallel corpora. Lecture Notes in Computer Science, 1529:1–17, 1998.

[43] P. Fung and P. Cheung. Multi-level bootstrapping for extracting parallel sen-

tences from a quasi-comparable corpus. In Proceedings of International Con-

ference on Computational Linguistics, pages 1051–1057, 2004.

[44] P. Fung and L.Y. Yee. An IR Approach for Translating New Words from Non-

264

parallel, Comparable Texts. In Proceedings of Association for Computational

Linguistics, volume 36, pages 414–420, 1998.

[45] N. Garera, C. Callison-Burch, and D. Yarowsky. Improving translation lexicon

induction from monolingual corpora via dependency contexts and part-of-speech

equivalences. In Proceedings of the Conference on Computational Natural Lan-

guage Learning, pages 129–137, 2009.

[46] N. Garera and A. I. Rudnicky. Briefing assistant: Learning human summariza-

tion behavior over time. In AAAI Spring Symposium on Persistent Assistants,

2005.

[47] N. Garera and D. Yarowsky. Resolving and generating definite anaphora by

modeling hypernymy using unlabeled corpora. In Proceedings of the Conference

on Natural Language Learning, pages 37–44, 2006.

[48] N. Garera and D. Yarowsky. Minimally supervised multilingual taxonomy and

translation lexicon induction. In Proceedings of the International Joint Confer-

ence on Natural Language Processing, pages 465–472, 2008.

[49] N. Garera and D. Yarowsky. Translating compounds by learning component

gloss translation models via multiple languages. In Proceedings of the Interna-

tional Joint Conference on Natural Language Processing, pages 403–410, 2008.

[50] N. Garera and D. Yarowsky. Modeling latent biographic attributes in conver-

265

sational genres. In Proceedings of the Joint Conference of Association of Com-

putational Linguistics and International Joint Conference on Natural Language

Processing (ACL-IJCNLP), pages 710–718, 2009.

[51] N. Garera and D. Yarowsky. Structural, transitive and latent models for bi-

ographic fact extraction. In Proceedings of the Conference of the European

Chapter of the Association of Computational Linguistics, pages 300–308, 2009.

[52] R. Girju, A. Badulescu, and D. Moldovan. Learning semantic constraints for the

automatic discovery of part-whole relations. In Proceedings of Human Language

Technologies and North American Chapter of the Association for Computational

Linguistics, pages 1–8, 2003.

[53] R. Girju, A. Badulescu, and D. Moldovan. Automatic discovery of part-whole

relations. Computational Linguistics, 21(1):83–135, 2006.

[54] J.J. Godfrey, E.C. Holliman, and J. McDaniel. Switchboard: Telephone speech

corpus for research and development. In Proceedings of ICASSP, volume 1,

1992.

[55] T. Gollins and M. Sanderson. Improving cross language retrieval with triangu-

lated translation. In Proceedings of the 24th annual international ACM SIGIR

conference on Research and development in information retrieval, pages 90–95,

2001.

266

[56] D. Graff, J. Kong, K. Chen, and K. Maeda. English Gigaword Second Edition.

Linguistic Data Consortium, catalog number LDC2005T12, 2005.

[57] G. Grefenstette. The World Wide Web as a Resource for Example-Based Ma-

chine Translation Tasks. In ASLIB’99 Translating and the Computer 21., 1999.

[58] A. Haghighi, P. Liang, T. Berg-Kirkpatrick, and D. Klein. Learning bilingual

lexicons from monolingual corpora. In Proceedings of Association for Compu-

tational Linguistics and Human Language Technologies, pages 771–779, 2008.

[59] J. Hajic, J. Hric, and V. Kubon. Machine translation of very close languages.

In Proceedings of the sixth conference on Applied natural language processing,

pages 7–12, 2000.

[60] S. Harabagiu, R. Bunescu, and S. J. Maiorano. Text and knowledge mining

for coreference resolution. In Proceedings of the Second Meeting of the North

American Chapter of the Association for Computational Linguistics, pages 55–

62, 2001.

[61] Z. Harris. Distributional structure. Word, 10(23):146–162, 1954.

[62] T. Hasegawa, S. Sekine, and R. Grishman. Discovering relations among named

entities from large corpora. In Proceedings of Association for Computational


[63] M. Hearst. Automatic acquisition of hyponyms from large text corpora. In

267

Proceedings of International Conference on Computational Linguistics, pages

539–545, 1992.

[64] S.C. Herring and J.C. Paolillo. Gender and genre variation in weblogs. Journal

of Sociolinguistics, 10(4):439–459, 2006.

[65] L. Hirschman and N. Chinchor. MUC-7 coreference task definition. In MUC-7

proceedings, 1997.

[66] J Hobbs. Resolving pronoun references. pages 339–352, 1986.

[67] J.R. Hobbs. Overview of the TACITUS Project. Computational Linguistics,

12(3):220–222, 1986.

[68] H. Isahara, F. Bond, K. Uchimoto, M. Utiyama, and K. Kanzaki. Development

of the Japanese WordNet. In Proceedings of the 6th International Conference

on Language Resources and Evaluation (LREC 2008), 2008.

[69] V. Jijkoun, M. de Rijke, and J. Mur. Information extraction for question answer-

ing: improving recall through syntactic patterns. In Proceedings of International

Conference on Computational Linguistics, page 1284, 2004.

[70] H. Jing, N. Kambhatla, and S. Roukos. Extracting social networks and bio-

graphical facts from conversational speech transcripts. In Proceedings of Asso-

ciation for Computational Linguistics, pages 1040–1047, 2007.

268

[71] J. Kivinen and M.K. Warmuth. Exponentiated Gradient versus Gradient De-

scent for Linear Predictors. Information and Computation, 132(1):1–63, 1997.

[72] P. Koehn. In Europarl: A parallel corpus for statistical machine translation,

2005.

[73] P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi,

B. Cowan, W. Shen, C. Moran, R. Zens, et al. Moses: Open source toolkit for

statistical machine translation. In Proceedings of Association for Computational

Linguistics, companian volume, pages 177–180, 2007.

[74] P. Koehn and K. Knight. Learning a translation lexicon from monolingual

corpora. In Proceedings of Association for Computational Linguistics Workshop

on Unsupervised Lexical Acquisition, pages 9–16, 2002.

[75] P. Koehn and K. Knight. Empirical methods for compound splitting. In Proceed-

ings of the European Chapter of the Association for Computational Linguistics,

Volume 1, pages 187–193, 2003.

[76] M. Koppel, S. Argamon, and A.R. Shimoni. Automatically Categorizing Writ-

ten Texts by Author Gender. Literary and Linguistic Computing, 17(4):401–412,

2002.

[77] M. Kumar, N. Garera, and A. I. Rudnicky. Learning from the report-writing

269

behavior of individuals. In Proceedings of Internation Joint Conference on

Artificial Intelligence, pages 1641–1646, 2007.

[78] S. Kumar and W. Byrne. Minimum Bayes-Risk word alignments of bilingual

texts. In Proceedings of the conference on Empirical methods in natural language

processing, pages 140–147, 2002.

[79] W. Labov. The Social Stratification of English in New York City. Center for

Applied Linguistics, Washington DC, 1966.

[80] J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Prob-

abilistic models for segmenting and labeling sequence data. In Proceedings of

International Conference on Machine Learning, pages 282–289, 2001.

[81] T.R. Leek. Information Extraction Using Hidden Markov Models. PhD thesis,

University of California, San Diego, 1997.

[82] W. Lehnert, C. Cardie, D. Fisher, E. Riloff, and R. Williams. University of

Massachusetts: Description of the CIRCUS System as Used for MUC-3. In

Proceedings of the 3rd conference on Message understanding, pages 223–233,

1991.

[83] D. B. Lenat. Cyc: a large-scale investment in knowledge infrastructure. Com-

mun. ACM, 38(11):33–38, 1995.

[84] J.N. Levi. The Syntax and Semantics of Complex Nominals. 1978.

270

[85] D. Lin and P. Pantel. Discovery of inference rules for question-answering. Nat-

ural Language Engineering, 7(04):343–360, 2002.

[86] H. Liu and R. Mihalcea. Of Men, Women, and Computers: Data-Driven Gender

Modeling for Improved User Interfaces. In International Conference on Weblogs

and Social Media, 2007.

[87] Y. Liu, Q. Liu, and S. Lin. Log-linear models for word alignment. In Proceedings

of the 43rd Annual Meeting on Association for Computational Linguistics, pages

459–466, 2005.

[88] R.K.S. Macaulay. Talk that Counts: Age, Gender, and Social Class Differences

in Discourse. Oxford University Press, USA, 2005.

[89] G.S. Mann and D. Yarowsky. Multipath translation lexicon induction via bridge

languages. In Proceedings of North American Chapter of the Association for


[90] G.S. Mann and D. Yarowsky. Unsupervised personal name disambiguation. In

Proceedings of Conference on Natural Language Learning, pages 33–40, 2003.

[91] G.S. Mann and D. Yarowsky. Multi-field information extraction and cross-

document fusion. In Proceedings of Association for Computational Linguistics,

pages 483–490, 2005.

[92] M.P. Marcus, M.A. Marcinkiewicz, and B. Santorini. Building a large annotated

271

corpus of English: the Penn treebank. Computational Linguistics, 19(2):313–

330, 1993.

[93] K. Markert and M. Nissim. Comparing knowledge sources for nominal anaphora

resolution. Computational Linguistics, 31(3):367–402, 2005.

[94] K. Markert, M. Nissim, and N. N. Modjeska. Using the web for nominal

anaphora resolution. In Proceedings of the European Chapter of the Association

for Computational Linguistics Workshop on the Computational Treatment of

Anaphora, pages 39–46, 2003.

[95] R. McDonald, F. Pereira, K. Ribarov, and J. Hajic. Non-projective dependency

parsing using spanning tree algorithms. In Proceedings of Empirical Methods

in Natural Language Processing and Human Language Technologies, pages 523–

530, 2005.

[96] J. Meyer and R. Dale. Mining a corpus to support associative anaphora res-

olution. In Proceedings of the Fourth International Conference on Discourse

Anaphora and Anaphor Resolution, 2002.

[97] G.A. Miller. WordNet: a lexical database for English. 1995.

[98] D.S. Munteanu, A. Fraser, D. Marcu, S. Dumais, D. Marcu, and S. Roukos.

Improved Machine Translation Performance via Parallel Sentence Extraction

from Comparable Corpora. In Proceedings of Human Language Technologies

272

and North American Chapter of the Association for Computational Linguistics,

pages 265–272, 2004.

[99] D.S. Munteanu and D. Marcu. Improving machine translation performance

by exploiting non-parallel corpora. Computational Linguistics, 31(4):477–504,

2005.

[100] D. Narayan, D. Chakrabarty, P. Pande, and P. Bhattacharyya. An experience in

building the Indo WordNet-a WordNet for Hindi. In International Conference

on Global WordNet, 2002.

[101] R. Navigli, P. Velardi, and A. Gangemi. Ontology learning and its application

to automated terminology translation. Intelligent Systems, IEEE, 18(1):22–31,

2003.

[102] A. Nenkova and K. McKeown. References to named entities: a corpus study.

Proceedings of Human Language Technologies and North American Chapter of

the Association for Computational Linguistics companion volume, pages 70–72,

2003.

[103] V. Ng and C. Cardie. Improving machine learning approaches to coreference

resolution. In Proceedings of the 40th Annual Meeting of the Association for


[104] J. Nivre, J. Hall, S. Kubler, R. McDonald, J. Nilsson, S. Riedel, and D. Yuret.

273

The conference on natural language learning 2007 shared task on dependency

parsing. In Proceedings of the Conference on Natural Language Learning Shared

Task Session of Empirical Methods in Natural Language Processing and Con-

ference on Natural Language Learning, pages 915–932, 2007.

[105] S. Nowson and J. Oberlander. The identity of bloggers: Openness and gender

in personal weblogs. In Proceedings of the American Association for Artifi-

cial Intelligence Spring Symposia on Computational Approaches to Analyzing

Weblogs, 2006.

[106] F.J. Och and H. Ney. Discriminative training and maximum entropy models

for statistical machine translation. In Proc. of the 40th Annual Meeting of the

Association for Computational Linguistics (ACL), volume 8, 2002.

[107] F.J. Och, C. Tillmann, and H. Ney. Improved alignment models for statistical

machine translation. In Proceedings of the conference on on Empirical Methods

in Natural Language Processing and Very Large Corpora, pages 20–28, 1999.

[108] F.J. Och and H. Weber. Improving statistical natural language translation

with categories and rules. In Proceedings of the 17th international conference

on Computational linguistics-Volume 2, pages 985–989, 1998.

[109] M. Pasca, L. Dekang, J. Bigham, A. Lifchits, and A. Jain. Names and similari-

ties on the web: Fact extraction in the fast lane. In Proceedings of Association

274

for Computational Linguistics and International Conference on Computational


[110] M. Pasca, D. Lin, J. Bigham, A. Lifchits, and A. Jain. Organizing and searching

the World Wide Web of factsstep one: the one-million fact extraction challenge.

In Proceedings of American Association for Artificial Intelligence, pages 1400–

1405, 2006.

[111] P. Pantel and M. Pennacchiotti. Espresso: Leveraging generic patterns for

automatically harvesting semantic relations. In Proceedings of Association for

Computational Linguistics and International Conference on Computational Lin-

guistics, pages 113–120, 2006.

[112] P. Pantel and D. Ravichandran. Automatically labeling semantic classes. In

Proceedings of Human Language Technologies and North American Chapter of

the Association for Computational Linguistics, pages 321–328, 2004.

[113] P. Pantel, D. Ravichandran, and E. Hovy. Towards terascale knowledge acquisi-

tion. In Proceedings of International Conference on Computational Linguistics,

2004.

[114] M. Pasca, B. V. Durme, and N. Garera. The role of documents vs. queries

in extracting class attributes from text. In Proceedings of the Conference on

Information and Knowledge Management, pages 485–494, 2007.

275

[115] M. Poesio, T. Ishikawa, S. Schulte im Walde, and R. Viera. Acquiring lexical

knowledge for anaphora resolution. In Proccedings of the Third Conference on

Language Resources and Evaluation, pages 1220–1224, 2002.

[116] M. Poesio, R. Mehta, A. Maroudas, and J. Hitzeman. Learning to resolve bridg-

ing references. In Proceedings of the 42nd Annual Meeting of the Association

for Computational Linguistics, pages 143–150, 2004.

[117] M. Poesio, R. Vieira, and S. Teufel. Resolving bridging references in unrestricted

text. In Proceedings of the Association for Computational Linguistics Workshop

on Operational Factors in Robust Anaphora, pages 1–6, 1997.

[118] WV Quine. Natural kinds. Essays in honor of Carl G. Hempel, pages 5–23,

1969.

[119] U. Rackow, I. Dagan, and U. Schwall. Automatic translation of noun com-

pounds. In Proceedings of the International Conference on Computational Lin-

guistics and Volume 4, pages 1249–1253, 1992.

[120] D. Rao, N. Garera, and D. Yarowsky. Jhu1 : An unsupervised approach to

person name disambiguation using web snippets. In Proceedings of the Fourh

International Workshop on Semantic Evaluations (SemEval), pages 199–202,

2007.

[121] R. Rapp. Automatic identification of word translations from unrelated En-

276

glish and German corpora. In Proceedings of Association for Computational


[122] D. Ravichandran and E. Hovy. Learning surface text patterns for a question

answering system. In Proceedings of Association for Computational Linguistics,

pages 41–47, 2002.

[123] Y. Ravin and Z. Kazi. Is Hillary Rodham Clinton the President? Disambiguat-

ing Names across Documents. In Proceedings of Association for Computational

Linguistics, 1999.

[124] M. Remy. Wikipedia: The Free Encyclopedia. Online Information Review Year,

26(6), 2002.

[125] E. Riloff. Automatically Generating Extraction Patterns from Untagged Text.

In Proceedings of American Association for Artificial Intelligence, pages 1044–

1049, 1996.

[126] E. Riloff and R. Jones. Learning dictionaries for information extraction by

multi-level bootstrapping. In Proceedings of American Association for Artificial

Intelligence and Innovative Applications of Artificial Intelligence, pages 474–

479, 1999.

[127] E. Riloff and J. Shepherd. A corpus-based approach for building semantic

lexicons. CoRR, cmp-lg/9706013, 1997.

277

[128] E. S. Ristad and P. N. Yianilos. Learning string edit distance. In Machine

Learning: Proceedings of the Fourteenth International Conference, pages 287–

295, 1997.

[129] M. Ruiz-Casado, E. Alfonseca, and P. Castells. Automatic extraction of seman-

tic relationships for wordnet by means of pattern learning from wikipedia. In

Proceedings of NLDB 2005, 2005.

[130] M. Ruiz-Casado, E. Alfonseca, and P. Castells. From Wikipedia to semantic

relationships: a semiautomated annotation approach. In Proceedings of ESWC,

2006.

[131] A. Sayeed, T. Elsayed, N. Garera, D/ Alexander, T. Xu, D. Oard, D. Yarowsky,

and C. Piatko. Arabic cross-document coreference resolution. In Proceedings

of the Joint Conference of Association of Computational Linguistics and In-

ternational Joint Conference on Natural Language Processing (ACL-IJCNLP),

Conference Short Papers, pages 357–360, 2009.

[132] C. Schafer and D. Yarowsky. Inducing translation lexicons via diverse similarity

measures and bridge languages. In Proceedings of CONLL, pages 146–152, 2002.

[133] C. Schafer and D. Yarowsky. Inducing translation lexicons via diverse similarity

measures and bridge languages. In Proceedings of International Conference on


278

[134] C. Schafer and D. Yarowsky. Exploiting aggregate properties of bilingual dic-

tionaries for distinguishing senses of English words and inducing English sense

clusters. In Proceedings of Association for Computational Linguistics, pages

118–121, 2004.

[135] B. Schiffman, I. Mani, and K.J. Concepcion. Producing biographical sum-

maries: combining linguistic knowledge with corpus statistics. In Proceedings

of Association for Computational Linguistics, pages 458–465, 2001.

[136] J. Schler, M. Koppel, S. Argamon, and J. Pennebaker. Effects of age and gender

on blogging. In Proceedings of the American Association for Artificial Intelli-

gence Spring Symposia on Computational Approaches to Analyzing Weblogs,

2006.

[137] J. Schler, M. Koppel, S. Argamon, and J. Pennebaker. Effects of age and

gender on blogging. In AAAI Spring Symposium on Computational Approaches

to Analyzing Weblogs, 2006.

[138] I. Shafran, M. Riley, and M. Mohri. Voice signatures. In Proceedings of ASRU,

pages 31–36, 2003.

[139] S. Singh. A pilot study on gender differences in conversational speech on lexical

richness measures. Literary and Linguistic Computing, 16(3):251–264, 2001.

[140] R. Snow, D. Jurafsky, and A. Y. Ng. Semantic taxonomy induction from het-

279

erogenous evidence. In Proceedings of Association for Computational Linguis-

tics and International Conference on Computational Linguistics, pages 801–808,

2006.

[141] S. Soderland. Learning information extraction rules for semi-structured and

free text. Machine learning, 34(1):233–272, 1999.

[142] W. M. Soon, H. T. Ng, and D. C. Y. Lim. A machine learning approach to

coreference resolution of noun phrases. Computational Linguistics, 27(4):521–

544, 2001.

[143] M. Strube, S. Rapp, and C. Muller. The influence of minimum edit distance

on reference resolution. In Proceedings of the 2002 Conference on Empirical

Methods in Natural Language Processing, pages 312–319, 2002.

[144] I. Szpektor, H. Tanev, I. Dagan, and B. Coppola. Scaling web-based acquisition

of entailment relations, 2004.

[145] T. Tanaka and T. Baldwin. Noun-Noun Compound Machine Translation: A

Feasibility Study on Shallow Processing. In Proceedings of the Association for

Computational Linguistics Workshop on Multiword Expressions, pages 17–24,

2003.

[146] M. Thelen and E. Riloff. A bootstrapping method for learning semantic lexi-

280

cons using extraction pattern contexts. In Proceedings of Empirical Methods in

Natural Language Processing, pages 214–221, 2002.

[147] K. Toutanova, H.T. Ilhan, and C.D. Manning. Extensions to HMM-based sta-

tistical word alignment models. In Proceedings of the conference on Empirical

methods in natural language processing-Volume 10, pages 87–94, 2002.

[148] B. Ustun. A Comparison of Support Vector Machines and Partial Least Squares

Regression on Spectral Data, 2003.

[149] R. Vieira and M. Poesio. An empirically-based system for processing definite

descriptions. Computational Linguistics, 26(4):539–593, 2000.

[150] S. Vogel, H. Ney, and C. Tillmann. HMM-based word alignment in statistical

translation. In Proceedings of the 16th conference on Computational linguistics-

Volume 2, pages 836–841, 1996.

[151] P. Vossen. Eurowordnet a multilingual database with lexical semantic networks.

In Computational Linguistics, volume 25, 1998.

[152] N. Wacholder, Y. Ravin, and M. Choi. Disambiguation of proper names in text.

In Proceedings of ANLP, pages 202–208, 1931.

[153] C. Walker, S. Strassel, J. Medero, and K. Maeda. Ace 2005 multilingual training

corpus. Linguistic Data Consortium, 2006.

281

[154] R. Weischedel, J. Xu, and A. Licuanan. A Hybrid Approach to Answering

Biographical Questions. New Directions In Question Answering, pages 59–70,

2004.

[155] M. Wick, A. Culotta, and A. McCallum. Learning field compatibilities to ex-

tract database records from unstructured text. In Proceedings of Empirical

Methods in Natural Language Processing, pages 603–611, 2006.

[156] D. Widdows. Unsupervised methods for developing taxonomies by combining

syntactic and statistical information. In Proceedings of Human Language Tech-

nologies and North American Chapter of the Association for Computational


[157] I.H. Witten and E. Frank. Data mining: practical machine learning tools and

techniques with Java implementations. ACM SIGMOD Record, 31(1):76–77,

2002.

[158] D.H. Wolpert. Stacked generalization. Neural networks, 5(2):241–259, 1992.

[159] X. Yang, G. Zhou, J. Su, and C. L. Tan. Coreference resolution using com-

petition learning approach. In Proceedings of the 41st Annual Meeting of the

Association for Computational Linguistics, pages 176–183, 2003.

[160] D. Yarowsky. Unsupervised word sense disambiguation rivaling supervised

282

methods. In Proceedings of Association for Computational Linguistics, pages

189–196, 1995.

[161] J. Zhang, J. Gao, and M. Zhou. Extraction of Chinese compound words: an ex-

perimental study on a very large corpus. In Proceedings of the second workshop

on Chinese language processing, pages 132–139, 2000.

[162] L. Zhou, M. Ticrea, and E. Hovy. Multidocument biography summarization.

Proceedings of Empirical Methods in Natural Language Processing, pages 434–

441, 2004.

283

Vita

Nikesh Garera grew up in Mumbai, India, where he also went to college, obtaining

his bachelor degree in Computer Engineering from University of Mumbai in May 2002.

He came to USA for pursuing his graduate studies and earned his Master of Science

in Language Technologies from the School of Computer Science at Carnegie Mellon

University in May 2005. He moved on to pursue his doctoral studies at the Computer

Science Department at Johns Hopkins University. At Johns Hopkins, he earned his

Master of Science in Computer Science in May 2007 and his Doctor of Philosophy in

Computer Science in September 2009.

284