24
1 Nicholas Monath, Ari Kobren, Andrew McCallum Entity Resolution and Clustering: Inventor Disambiguation and Related Tasks

Entity Resolution and Clustering · J. Smith Metal Alloy Composition J. Smith LCD display wiring J. Smith Display wiring methods J. Smith LCD Displays J. Smith LCD Displays Jane Smith

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Entity Resolution and Clustering · J. Smith Metal Alloy Composition J. Smith LCD display wiring J. Smith Display wiring methods J. Smith LCD Displays J. Smith LCD Displays Jane Smith

1

Nicholas Monath, Ari Kobren, Andrew McCallum

Entity Resolution and Clustering:

Inventor Disambiguation and Related Tasks

Page 2: Entity Resolution and Clustering · J. Smith Metal Alloy Composition J. Smith LCD display wiring J. Smith Display wiring methods J. Smith LCD Displays J. Smith LCD Displays Jane Smith

Entity Resolution

2

A. Banerjee, S. Chassang, E. Snowberg. Decision Theoretic Approaches to Experiment Design and External Validity. Handbook of Field Experiments. 2016.

Arindam Banerjee, S. Merugu, I. S. Dhillon, J. Ghosh. Clustering with Bregman Divergences. JMLR. 2006.

A. Banerjee, I. S. Dhillon, J. Ghosh, S. Sra. Clustering on the Unit Hypersphere using von Mises-Fisher Distributions. Journal of Machine Learning Research. 2005

Author/Inventor Coreference

Page 3: Entity Resolution and Clustering · J. Smith Metal Alloy Composition J. Smith LCD display wiring J. Smith Display wiring methods J. Smith LCD Displays J. Smith LCD Displays Jane Smith

Entity Resolution

3

A. Banerjee, S. Chassang, E. Snowberg. Decision Theoretic Approaches to Experiment Design and External Validity. Handbook of Field Experiments. 2016.

Arindam Banerjee, S. Merugu, I. S. Dhillon, J. Ghosh. Clustering with Bregman Divergences. JMLR. 2006.

A. Banerjee, I. S. Dhillon, J. Ghosh, S. Sra. Clustering on the Unit Hypersphere using von Mises-Fisher Distributions. Journal of Machine Learning Research. 2005

Author/Inventor Coreference

Page 4: Entity Resolution and Clustering · J. Smith Metal Alloy Composition J. Smith LCD display wiring J. Smith Display wiring methods J. Smith LCD Displays J. Smith LCD Displays Jane Smith

Entity Resolution

4

Dengue viruses are members of the Flaviviridae, transmitted principally in a cycle involving humans and mosquito vectors.

The virus-encoded RNA-dependent RNA polymerase (RdRp), which is required for replication of the positive-strand RNA genome, is a key enzyme

of members of the virus family Flaviviridae

We present several modifications of the original recurrent neural network language model

Unlike Toutanova et al. (2015), we also consider RNNs, specifically Long-Short Term Memory Networks

(LSTMs) (Hochreiter and Schmidhuber, 1997)

Cross Document Coreference

Page 5: Entity Resolution and Clustering · J. Smith Metal Alloy Composition J. Smith LCD display wiring J. Smith Display wiring methods J. Smith LCD Displays J. Smith LCD Displays Jane Smith

Entity Resolution as Clustering

m4 …m3m2m1

Given mentions M = {m1,m2,…,mN}

m5 mNm6

Page 6: Entity Resolution and Clustering · J. Smith Metal Alloy Composition J. Smith LCD display wiring J. Smith Display wiring methods J. Smith LCD Displays J. Smith LCD Displays Jane Smith

Entity Resolution as Clustering

m4 …m3m2m1

Given mentions M = {m1,m2,…,mN}

m5 mNm6

A. Banerjee, S. Chassang, E. Snowberg. Decision Theoretic Approaches to Experiment Design and External Validity. Handbook of Field Experiments. 2016.

Page 7: Entity Resolution and Clustering · J. Smith Metal Alloy Composition J. Smith LCD display wiring J. Smith Display wiring methods J. Smith LCD Displays J. Smith LCD Displays Jane Smith

Entity Resolution as Clustering

m4 …m3m2m1

Given mentions M = {m1,m2,…,mN}

m5 mNm6

Page 8: Entity Resolution and Clustering · J. Smith Metal Alloy Composition J. Smith LCD display wiring J. Smith Display wiring methods J. Smith LCD Displays J. Smith LCD Displays Jane Smith

Entity Resolution as Clustering

m4 …m3m2m1

Given mentions M = {m1,m2,…,mN}

m5 mNm6

Partition M into entities E = {e1,e2,…,ek} where k unknown in advance

Page 9: Entity Resolution and Clustering · J. Smith Metal Alloy Composition J. Smith LCD display wiring J. Smith Display wiring methods J. Smith LCD Displays J. Smith LCD Displays Jane Smith

Entity Resolution as Clustering

m4 …m3m2m1

Given mentions M = {m1,m2,…,mN}

m5 mNm6

Partition M into entities E = {e1,e2,…,ek} where k unknown in advance

m4 …m3m2m1 m5 mNm6

Page 10: Entity Resolution and Clustering · J. Smith Metal Alloy Composition J. Smith LCD display wiring J. Smith Display wiring methods J. Smith LCD Displays J. Smith LCD Displays Jane Smith

Entity Resolution as Clustering

m4 …m3m2m1

Given mentions M = {m1,m2,…,mN}

m5 mNm6

Partition M into entities E = {e1,e2,…,ek} where k unknown in advance

m4 …m3m2m1 m5 mNm6

e1 e2 e3 ek

Page 11: Entity Resolution and Clustering · J. Smith Metal Alloy Composition J. Smith LCD display wiring J. Smith Display wiring methods J. Smith LCD Displays J. Smith LCD Displays Jane Smith

Entity Resolution as Clustering

m4 …m3m2m1

Given mentions M = {m1,m2,…,mN}

m5 mNm6

Partition M into entities E = {e1,e2,…,ek} where k unknown in advance

m4 …m3m2m1 m5 mNm6

e1 e2 e3 ekA. Banerjee

D.O.B. 12/4/87

Page 12: Entity Resolution and Clustering · J. Smith Metal Alloy Composition J. Smith LCD display wiring J. Smith Display wiring methods J. Smith LCD Displays J. Smith LCD Displays Jane Smith

Entity Resolution Challenges

Clustering hundreds of millions of entity mentions

Many Singleton Clusters — Large number of Entities

Page 13: Entity Resolution and Clustering · J. Smith Metal Alloy Composition J. Smith LCD display wiring J. Smith Display wiring methods J. Smith LCD Displays J. Smith LCD Displays Jane Smith

Entity Resolution in a Growing Knowledge Base

KB

Jane SmithPhysics

J. SmithMusic Theory

Build KB of inventors / researchers / entities

Receive new data constantlyUpdate KB with new data

immediatelyJack SmithEcology

John SmithControl Systems

J. SmithOrganometallic Compounds

J. SmithMetal Alloy Composition

J. SmithLCD display wiringJ. SmithDisplay wiring methods

J. SmithLCD DisplaysJ. Smith

LCD Displays

Page 14: Entity Resolution and Clustering · J. Smith Metal Alloy Composition J. Smith LCD display wiring J. Smith Display wiring methods J. Smith LCD Displays J. Smith LCD Displays Jane Smith

Entity Resolution in a Growing Knowledge Base

KB

Build KB of inventors / researchers / entities

Receive new data constantlyUpdate KB with new data

immediately

J. SmithOrganometallic Compounds

J. SmithMetal Alloy Composition

J. SmithLCD display wiringJ. SmithDisplay wiring methods

J. SmithLCD DisplaysJ. Smith

LCD Displays

Jane SmithPhysics

J. SmithMusic Theory

Jack SmithEcology

John SmithControl Systems

Page 15: Entity Resolution and Clustering · J. Smith Metal Alloy Composition J. Smith LCD display wiring J. Smith Display wiring methods J. Smith LCD Displays J. Smith LCD Displays Jane Smith

Entity Resolution in a Growing Knowledge Base

KB

Build KB of inventors / researchers / entities

Receive new data constantly Update KB with new data

immediately

J. SmithOrganometallic Compounds

J. SmithMetal Alloy Composition

J. SmithLCD display wiringJ. SmithDisplay wiring methods

J. SmithLCD DisplaysJ. Smith

LCD Displays

Jane SmithPhysics

J. SmithMusic Theory

Jack SmithEcology

John SmithControl Systems

Page 16: Entity Resolution and Clustering · J. Smith Metal Alloy Composition J. Smith LCD display wiring J. Smith Display wiring methods J. Smith LCD Displays J. Smith LCD Displays Jane Smith

Entity Resolution in a Growing Knowledge Base

KBJ. SmithOrganometallic Compounds

J. SmithMetal Alloy Composition

J. SmithLCD display wiringJ. SmithDisplay wiring methods

J. SmithLCD DisplaysJ. Smith

LCD Displays

Jane SmithPhysics

J. SmithMusic Theory

Jack SmithEcology

John SmithControl Systems

Page 17: Entity Resolution and Clustering · J. Smith Metal Alloy Composition J. Smith LCD display wiring J. Smith Display wiring methods J. Smith LCD Displays J. Smith LCD Displays Jane Smith

Entity Resolution in a Growing Knowledge Base

KB

J. SmithOrganometallic Compounds for LCD Displays

mi

J. SmithOrganometallic Compounds

J. SmithMetal Alloy Composition

J. SmithLCD display wiringJ. SmithDisplay wiring methods

J. SmithLCD DisplaysJ. Smith

LCD Displays

Jane SmithPhysics

J. SmithMusic Theory

Jack SmithEcology

John SmithControl Systems

Page 18: Entity Resolution and Clustering · J. Smith Metal Alloy Composition J. Smith LCD display wiring J. Smith Display wiring methods J. Smith LCD Displays J. Smith LCD Displays Jane Smith

Entity Resolution in a Growing Knowledge Base

KB

J. SmithOrganometallic Compounds for LCD Displays

mi

Determine entity for new data &

update existing entities

J. SmithOrganometallic Compounds

J. SmithMetal Alloy Composition

J. SmithLCD display wiringJ. SmithDisplay wiring methods

J. SmithLCD DisplaysJ. Smith

LCD Displays

Jane SmithPhysics

J. SmithMusic Theory

Jack SmithEcology

John SmithControl Systems

Page 19: Entity Resolution and Clustering · J. Smith Metal Alloy Composition J. Smith LCD display wiring J. Smith Display wiring methods J. Smith LCD Displays J. Smith LCD Displays Jane Smith

Entity Resolution in a Growing Knowledge Base

KB

J. SmithOrganometallic Compounds for LCD Displays

mi

Determine entity for new data &

update existing entities

J. SmithOrganometallic Compounds

J. SmithMetal Alloy Composition

J. SmithLCD display wiringJ. SmithDisplay wiring methods

J. SmithLCD DisplaysJ. Smith

LCD Displays

Jane SmithPhysics

J. SmithMusic Theory

Jack SmithEcology

John SmithControl Systems

Page 20: Entity Resolution and Clustering · J. Smith Metal Alloy Composition J. Smith LCD display wiring J. Smith Display wiring methods J. Smith LCD Displays J. Smith LCD Displays Jane Smith

Entity Resolution in a Growing Knowledge Base

KB

J. SmithOrganometallic Compounds for LCD Displays

mi

Determine entity for new data &

update existing entities

Merge entities!

J. SmithOrganometallic Compounds

J. SmithMetal Alloy Composition

J. SmithLCD display wiringJ. SmithDisplay wiring methods

J. SmithLCD DisplaysJ. Smith

LCD Displays

Jane SmithPhysics

J. SmithMusic Theory

Jack SmithEcology

John SmithControl Systems

Page 21: Entity Resolution and Clustering · J. Smith Metal Alloy Composition J. Smith LCD display wiring J. Smith Display wiring methods J. Smith LCD Displays J. Smith LCD Displays Jane Smith

Related Work• “Author-ity” System for Disambiguation in MEDLINE & U.S. Patents

[Lai, D’ Amour, Doolin, Li, Sun, Torvik,Yu, Fleming; 2014], [Torvik and Smalheiser; 2009], [Torvik, Weeber, Swanson and Smalheiser; 2005]

• Disambiguation using DBSCAN Clustering algorithm & learned distance function [Khabsa, Treeratpituk, Giles; 2014]

• Using Random Forest Classifiers: [Ventura, Nugen, Fuchs; 2015], [Treeratpituk, Giles, 2009]

• Numerous other approaches: [Ge, Huang, Png; 2014], [Liu et al; 2014],[Levin, Krawczyk, Bethard, Jurafsky; 2012], [Lissoni, Maurino, Pezzoni, Tarasconi; 2010], [Raffo, Lhuillery; 2009], [Carayol, Cassi; 2009], [Yang, Peng, Jiang, Lee, Ho; 2008], [Fleming, King, Juda; 2007], [Lissoni, Sanditov, Sanditov,; 2006], [Han, Zha, Giles; 2005] & many others!

Author & Inventor Coreference

21

Page 22: Entity Resolution and Clustering · J. Smith Metal Alloy Composition J. Smith LCD display wiring J. Smith Display wiring methods J. Smith LCD Displays J. Smith LCD Displays Jane Smith

Related WorkCross Document Coreference

Entity Linking

• Hierarchical Models: • [Wick, Singh, McCallum, 2013], [Singh et al, 2011]

• Sampling based Spectral Clustering: • [Dutta and Weikum, 2015]

• Streaming Models: • [Rao, McNamee, Dredze, 2010]

• Local & Global Contexts: • [Chang and Roth, 2013]

• Selective Context Selection: • [Lazic, Subramanya, Ringgaard,

Pereira, 2015] • Neural Networks:

• [Huang et al, 2015] • [Sun et al, 2015] • [Yamada et al, 2016]

• Graph/Coherence Based: • [Hoffart et al, 2011] • [Han and Zhao, 2009]

• Wikification • [Milne and Witten, 2008]

• Many others!

• [Gooi and Allan, 2004], [Nicolae and Nicolae, 2006], and many others!

22

Page 23: Entity Resolution and Clustering · J. Smith Metal Alloy Composition J. Smith LCD display wiring J. Smith Display wiring methods J. Smith LCD Displays J. Smith LCD Displays Jane Smith

Large Scale ClusteringEfficiently find candidate clusters for new data

e3e2e1 e6 e7 e8 e9e4 e5

Tree structures for efficient search

Inspired by work in extreme mutliclass

classification: [Daume, et al 2016],

[Choromanska & Langford, 2015], [Beygelzimer, et al

2010] and others23

Page 24: Entity Resolution and Clustering · J. Smith Metal Alloy Composition J. Smith LCD display wiring J. Smith Display wiring methods J. Smith LCD Displays J. Smith LCD Displays Jane Smith

Thank you for listening.

Questions?

24