46
A Metric-based Framework for Automatic Taxonomy Induction Hui Yang and Jamie Callan Language Technologies Institute Carnegie Mellon University ACL2009, Singapore

A Metric-based Framework for Automatic Taxonomy Induction Hui Yang and Jamie Callan Language Technologies Institute Carnegie Mellon University ACL2009,

  • View
    212

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A Metric-based Framework for Automatic Taxonomy Induction Hui Yang and Jamie Callan Language Technologies Institute Carnegie Mellon University ACL2009,

A Metric-based Framework for Automatic Taxonomy InductionHui Yang and Jamie Callan

Language Technologies InstituteCarnegie Mellon University

ACL2009, Singapore

Page 2: A Metric-based Framework for Automatic Taxonomy Induction Hui Yang and Jamie Callan Language Technologies Institute Carnegie Mellon University ACL2009,

ROADMAP

Introduction

Related Work

Metric-Based Taxonomy Induction Framework

The Features

Experimental Results

Conclusions

Page 3: A Metric-based Framework for Automatic Taxonomy Induction Hui Yang and Jamie Callan Language Technologies Institute Carnegie Mellon University ACL2009,

INTRODUCTION

Semantic taxonomies, such as WordNet, play an important role in solving knowledge-rich problems

Limitations of Manually-created Taxonomies Rarely complete Difficult to include new terms from emerging/changing domains Time-consuming to create; May make it unfeasible for specialized

domains and personalized tasks

Page 4: A Metric-based Framework for Automatic Taxonomy Induction Hui Yang and Jamie Callan Language Technologies Institute Carnegie Mellon University ACL2009,

INTRODUCTION

Automatic Taxonomy Induction is a solution to Augment existing resources Quickly produce new taxonomies for specialized domains and

personalized tasks Subtasks in Automatic Taxonomy Induction

Term extraction Relation formation

This paper focuses on Relation Formation

Page 5: A Metric-based Framework for Automatic Taxonomy Induction Hui Yang and Jamie Callan Language Technologies Institute Carnegie Mellon University ACL2009,

Related Work Pattern-based Approaches Define lexical-syntactic patterns

for relations, and use these patterns to discover instances

Have been applied to extract Is-a, part-of, sibling, synonym, causal, etc, relations

Strength: Highly accurate Weakness: Sparse coverage of

patterns

Clustering-based Approaches Hierarchically cluster terms based

on similarities of their meanings usually represented by a feature vector

Have only been applied to extract is-a and sibling relations

Strength: Allowing discovery of relations which do not explicitly appear in text; higher recall

Weaknesses: Generally fail to produce coherent cluster for small corpora [Pantel and Pennacchiotti 2006]; Hard to label non-leaf nodes

Page 6: A Metric-based Framework for Automatic Taxonomy Induction Hui Yang and Jamie Callan Language Technologies Institute Carnegie Mellon University ACL2009,

A UNIFIED SOLUTION

Combine strengths of both approaches in a unified framework Flexibly incorporate heterogeneous features Use lexical-syntactic patterns as one types of features in a

clustering framework

Metric-based Taxonomy Induction

Page 7: A Metric-based Framework for Automatic Taxonomy Induction Hui Yang and Jamie Callan Language Technologies Institute Carnegie Mellon University ACL2009,

THE FRAMEWORK

A novel framework, which Incrementally clusters terms Transforms taxonomy induction into a multi-criteria optimization Using heterogeneous features

Optimization based on two criteria Minimization of taxonomy structures

Minimum Evolution Assumption Modeling of term abstractness

Abstractness Assumption

Page 8: A Metric-based Framework for Automatic Taxonomy Induction Hui Yang and Jamie Callan Language Technologies Institute Carnegie Mellon University ACL2009,

LET’S BEGIN WITH SOME IMPORTANT DEFINITIONS

A Taxonomy is a data model

Concept Set Relationship Set Domain

Page 9: A Metric-based Framework for Automatic Taxonomy Induction Hui Yang and Jamie Callan Language Technologies Institute Carnegie Mellon University ACL2009,

MORE DEFINITIONS

ball table

Game Equipment

A Full Taxonomy:

AssignedTermSet={game equipment, ball, table, basketball, volleyball, soccer, table-tennis table, snooker table}UnassignedTermSet={}

Page 10: A Metric-based Framework for Automatic Taxonomy Induction Hui Yang and Jamie Callan Language Technologies Institute Carnegie Mellon University ACL2009,

MORE DEFINITIONS

ball

Game Equipment

A Partial Taxonomy

table

AssignedTermSet={game equipment, ball, table, basketball, volleyball}UnassignedTermSet={soccer, table-tennis table, snooker table}

Page 11: A Metric-based Framework for Automatic Taxonomy Induction Hui Yang and Jamie Callan Language Technologies Institute Carnegie Mellon University ACL2009,

MORE DEFINITIONSOntology

Metric

distance = 1.5 distance = 2

distance =1

distance =1

d( , ) = 2

d( , ) = 1 ball

d( , ) = 4.5 table

Page 12: A Metric-based Framework for Automatic Taxonomy Induction Hui Yang and Jamie Callan Language Technologies Institute Carnegie Mellon University ACL2009,

ASSUMPTIONSMinimum Evolution

Assumption: The Optimal Ontology is One that Introduces

Least Information Changes!

Page 13: A Metric-based Framework for Automatic Taxonomy Induction Hui Yang and Jamie Callan Language Technologies Institute Carnegie Mellon University ACL2009,

ILLUSTRATIONMinimum Evolution

Assumption

Page 14: A Metric-based Framework for Automatic Taxonomy Induction Hui Yang and Jamie Callan Language Technologies Institute Carnegie Mellon University ACL2009,

ILLUSTRATIONMinimum Evolution

Assumption

Page 15: A Metric-based Framework for Automatic Taxonomy Induction Hui Yang and Jamie Callan Language Technologies Institute Carnegie Mellon University ACL2009,

ILLUSTRATIONMinimum Evolution

Assumptionball

Page 16: A Metric-based Framework for Automatic Taxonomy Induction Hui Yang and Jamie Callan Language Technologies Institute Carnegie Mellon University ACL2009,

ILLUSTRATIONMinimum Evolution

Assumption ball

table

Page 17: A Metric-based Framework for Automatic Taxonomy Induction Hui Yang and Jamie Callan Language Technologies Institute Carnegie Mellon University ACL2009,

ILLUSTRATIONMinimum Evolution

Assumption

ball table

Game Equipment

Page 18: A Metric-based Framework for Automatic Taxonomy Induction Hui Yang and Jamie Callan Language Technologies Institute Carnegie Mellon University ACL2009,

ILLUSTRATIONMinimum Evolution

Assumption

ball table

Game Equipment

Page 19: A Metric-based Framework for Automatic Taxonomy Induction Hui Yang and Jamie Callan Language Technologies Institute Carnegie Mellon University ACL2009,

ILLUSTRATIONMinimum Evolution

Assumption

ball table

Game Equipment

Page 20: A Metric-based Framework for Automatic Taxonomy Induction Hui Yang and Jamie Callan Language Technologies Institute Carnegie Mellon University ACL2009,

ASSUMPTIONSAbstractness

Assumption: Each abstraction level

has its own Information

function

Page 21: A Metric-based Framework for Automatic Taxonomy Induction Hui Yang and Jamie Callan Language Technologies Institute Carnegie Mellon University ACL2009,

ASSUMPTIONSAbstractness Assumption

ball table

Game Equipment

Page 22: A Metric-based Framework for Automatic Taxonomy Induction Hui Yang and Jamie Callan Language Technologies Institute Carnegie Mellon University ACL2009,

MULTIPLE CRITERION OPTIMIZATION

Minimum Evolution

objective function

Abstractnessobjective function

Scalarization variable

Page 23: A Metric-based Framework for Automatic Taxonomy Induction Hui Yang and Jamie Callan Language Technologies Institute Carnegie Mellon University ACL2009,

ESTIMATING ONTOLOGY METRIC

Assume ontology metric is a linear interpolation of some underlying feature functions

Ridge Regression to estimate and predict the ontology metric

Page 24: A Metric-based Framework for Automatic Taxonomy Induction Hui Yang and Jamie Callan Language Technologies Institute Carnegie Mellon University ACL2009,

THE FEATURES

Our framework allows a wide range of features to be used Input for the Feature Functions: Two terms Output: A numeric score to measure semantic

distance between these two terms We can use the following types of feature functions, but not

restricted to only these: Contextual Features Term Co-occurrence Lexical-Syntactic Patterns Syntactic Dependency Features Word Length Difference Definition Overlap, etc

Page 25: A Metric-based Framework for Automatic Taxonomy Induction Hui Yang and Jamie Callan Language Technologies Institute Carnegie Mellon University ACL2009,

EXPERIMENTAL RESULTS

Task: Reconstruct taxonomies from WordNet and ODP Not the entire WordNet or ODP, but fragments of WordNet or ODP

Ground Truth: 50 hypernym taxonomies from WordNet; 50 hypernym taxonomies from ODP; 50 meronym taxonomies from WordNet.

Auxiliary Datasets: 1000 Google documents per term or per term pair; 100 Wikipedia documents per term.

Evaluation Metrics: F1-measure (averaged by Leave-One-Out Cross Validation).

Page 26: A Metric-based Framework for Automatic Taxonomy Induction Hui Yang and Jamie Callan Language Technologies Institute Carnegie Mellon University ACL2009,

DATASETS

Page 27: A Metric-based Framework for Automatic Taxonomy Induction Hui Yang and Jamie Callan Language Technologies Institute Carnegie Mellon University ACL2009,

PERFORMANCE OF TAXONOMY INDUCTION

Compare our system (ME) with other state-of-the-art systems HE: 6 is-a patterns [Hearst 1992] GI: 3 part-of patterns [Girju et al. 2003] PR: a probabilistic framework [Snow et al. 2006] ME: our metric-based framework

Page 28: A Metric-based Framework for Automatic Taxonomy Induction Hui Yang and Jamie Callan Language Technologies Institute Carnegie Mellon University ACL2009,

PERFORMANCE OF TAXONOMY INDUCTION

Our system (ME) consistently gives the best F1 for all three tasks.

Systems using heterogeneous features (ME and PR) achieve a significant absolute F1 gain (>30%)

Page 29: A Metric-based Framework for Automatic Taxonomy Induction Hui Yang and Jamie Callan Language Technologies Institute Carnegie Mellon University ACL2009,

FEATURES VS. RELATIONS

This is the first study of the impact of using different features on taxonomy induction for different relations

Co-occurrence and lexico-syntactic patterns are good for is-a, part-of, and sibling relations

Contextual and syntactic dependency features are only good for sibling relation

Page 30: A Metric-based Framework for Automatic Taxonomy Induction Hui Yang and Jamie Callan Language Technologies Institute Carnegie Mellon University ACL2009,

FEATURES VS. ABSTRACTNESS

This is the first study of the impact of using different features on taxonomy induction for terms at different abstraction levels

Contextual, co-occurrence, lexical-syntactic patterns, and syntactic dependency features work well for concrete terms;

Only co-occurrence works well for abstract terms

Page 31: A Metric-based Framework for Automatic Taxonomy Induction Hui Yang and Jamie Callan Language Technologies Institute Carnegie Mellon University ACL2009,

CONCLUSIONS

This paper presents a novel metric-based taxonomy induction framework, which Combines strengths of pattern-based and clustering-based

approaches

Achieves better F1 than 3 state-of-the-art systems

The first study on the impact of using different features on taxonomy induction for different types of relations and for terms at different abstraction levels

Page 32: A Metric-based Framework for Automatic Taxonomy Induction Hui Yang and Jamie Callan Language Technologies Institute Carnegie Mellon University ACL2009,

CONCLUSIONS

This work is a general framework, which Allows a wider range of features

Allows different metric functions at different abstraction levels

This work has a potential to learn more complex taxonomies than previous approaches

Page 33: A Metric-based Framework for Automatic Taxonomy Induction Hui Yang and Jamie Callan Language Technologies Institute Carnegie Mellon University ACL2009,

THANK YOU AND QUESTIONS

[email protected]@cs.cmu.edu

Page 34: A Metric-based Framework for Automatic Taxonomy Induction Hui Yang and Jamie Callan Language Technologies Institute Carnegie Mellon University ACL2009,

EXTRA SLIDES

Page 35: A Metric-based Framework for Automatic Taxonomy Induction Hui Yang and Jamie Callan Language Technologies Institute Carnegie Mellon University ACL2009,

FORMAL FORMULATION OF TAXONOMY INDUCTION

The Task of Taxonomy Induction:

The construction of a full ontology T given a set of concepts C and an initial partial ontology T0

Keeping adding concepts in C into T0

Note T0 could be empty

Until a full ontology is formed

Page 36: A Metric-based Framework for Automatic Taxonomy Induction Hui Yang and Jamie Callan Language Technologies Institute Carnegie Mellon University ACL2009,

GOAL OF TAXONOMY INDUCTION

Find the optimal full ontology s.t. the information changes since T0 are least , i.e.,

Note that this is by the Minimum Evolution Assumption

Page 37: A Metric-based Framework for Automatic Taxonomy Induction Hui Yang and Jamie Callan Language Technologies Institute Carnegie Mellon University ACL2009,

GET TO THE GOAL

Goal:

Since the optimal set of concepts is always C

Concepts are added incrementally

Page 38: A Metric-based Framework for Automatic Taxonomy Induction Hui Yang and Jamie Callan Language Technologies Institute Carnegie Mellon University ACL2009,

GET TO THE GOAL

Plug in definition of information change

Transform into a minimization problemMinimum

Evolution objective function

Page 39: A Metric-based Framework for Automatic Taxonomy Induction Hui Yang and Jamie Callan Language Technologies Institute Carnegie Mellon University ACL2009,

EXPLICITLY MODEL ABSTRACTNESS

Model Abstractness for each Level by Least Square Fit

Plug in definition of amount of information for an abstraction level

Abstractnessobjective function

Page 40: A Metric-based Framework for Automatic Taxonomy Induction Hui Yang and Jamie Callan Language Technologies Institute Carnegie Mellon University ACL2009,

THE OPTIMIZATION ALGORITHM

Page 41: A Metric-based Framework for Automatic Taxonomy Induction Hui Yang and Jamie Callan Language Technologies Institute Carnegie Mellon University ACL2009,

MORE DEFINITIONS

distance = 1.5 distance = 2

distance =1

distance =1

d( , ) = 2

d( , ) = 1 ball

d( , ) = 4.5 table

Information in an

Taxonomy T

Page 42: A Metric-based Framework for Automatic Taxonomy Induction Hui Yang and Jamie Callan Language Technologies Institute Carnegie Mellon University ACL2009,

MORE DEFINITIONS

d( , ) = 2

d( , ) = 1 ball

d( , ) = 1

Information in a Level L

ball

Page 43: A Metric-based Framework for Automatic Taxonomy Induction Hui Yang and Jamie Callan Language Technologies Institute Carnegie Mellon University ACL2009,

Contextual Features Global Context KL-Divergence = KL-Divergence(1000 Google Documents for

Cx , 1000 Google Documents for Cy);

Local Context KL-Divergence = KL-Divergence(Left two and Right two words for Cx , Left two and Right two words for Cy).

Term Co-occurrence Point-wise Mutual Information (PMI)

= # of sentences containing the term(s);

or # of documents containing the term(s);

or n as in “Results 1-10 of about n for …” in Google.

EXAMPLES OF FEATURES

Page 44: A Metric-based Framework for Automatic Taxonomy Induction Hui Yang and Jamie Callan Language Technologies Institute Carnegie Mellon University ACL2009,

EXAMPLES OF FEATURES

Syntactic Dependency Features Minipar Syntactic Distance = Average length of syntactic paths in

syntactic parse trees for sentences containing the terms; Modifier Overlap = # of overlaps between modifiers of the terms; e.g.,

red apple, red pear; Object Overlap = # of overlaps between objects of the terms when the

terms are subjects; e.g., A dog eats apple; A cat eats apple; Subject Overlap = # of overlaps between subjects of the terms when

the terms are objects; e.g., A dog eats apple; A dog eats pear; Verb Overlap = # of overlaps between verbs of the terms when the

terms are subjects/objects; e.g., A dog eats apple; A cat eats pear.

Page 45: A Metric-based Framework for Automatic Taxonomy Induction Hui Yang and Jamie Callan Language Technologies Institute Carnegie Mellon University ACL2009,

EXAMPLES OF FEATURES

Lexical-Syntactic Patterns

Page 46: A Metric-based Framework for Automatic Taxonomy Induction Hui Yang and Jamie Callan Language Technologies Institute Carnegie Mellon University ACL2009,

EXAMPLES OF FEATURES

Miscellaneous Features Definition Overlap = # of non-stopword overlaps between

definitions of two terms. Word Length Difference