Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
1
Nicholas Monath, Ari Kobren, Andrew McCallum
Entity Resolution and Clustering:
Inventor Disambiguation and Related Tasks
Entity Resolution
2
A. Banerjee, S. Chassang, E. Snowberg. Decision Theoretic Approaches to Experiment Design and External Validity. Handbook of Field Experiments. 2016.
Arindam Banerjee, S. Merugu, I. S. Dhillon, J. Ghosh. Clustering with Bregman Divergences. JMLR. 2006.
A. Banerjee, I. S. Dhillon, J. Ghosh, S. Sra. Clustering on the Unit Hypersphere using von Mises-Fisher Distributions. Journal of Machine Learning Research. 2005
Author/Inventor Coreference
Entity Resolution
3
A. Banerjee, S. Chassang, E. Snowberg. Decision Theoretic Approaches to Experiment Design and External Validity. Handbook of Field Experiments. 2016.
Arindam Banerjee, S. Merugu, I. S. Dhillon, J. Ghosh. Clustering with Bregman Divergences. JMLR. 2006.
A. Banerjee, I. S. Dhillon, J. Ghosh, S. Sra. Clustering on the Unit Hypersphere using von Mises-Fisher Distributions. Journal of Machine Learning Research. 2005
Author/Inventor Coreference
Entity Resolution
4
Dengue viruses are members of the Flaviviridae, transmitted principally in a cycle involving humans and mosquito vectors.
The virus-encoded RNA-dependent RNA polymerase (RdRp), which is required for replication of the positive-strand RNA genome, is a key enzyme
of members of the virus family Flaviviridae
We present several modifications of the original recurrent neural network language model
Unlike Toutanova et al. (2015), we also consider RNNs, specifically Long-Short Term Memory Networks
(LSTMs) (Hochreiter and Schmidhuber, 1997)
Cross Document Coreference
Entity Resolution as Clustering
m4 …m3m2m1
Given mentions M = {m1,m2,…,mN}
m5 mNm6
Entity Resolution as Clustering
m4 …m3m2m1
Given mentions M = {m1,m2,…,mN}
m5 mNm6
A. Banerjee, S. Chassang, E. Snowberg. Decision Theoretic Approaches to Experiment Design and External Validity. Handbook of Field Experiments. 2016.
Entity Resolution as Clustering
m4 …m3m2m1
Given mentions M = {m1,m2,…,mN}
m5 mNm6
Entity Resolution as Clustering
m4 …m3m2m1
Given mentions M = {m1,m2,…,mN}
m5 mNm6
Partition M into entities E = {e1,e2,…,ek} where k unknown in advance
Entity Resolution as Clustering
m4 …m3m2m1
Given mentions M = {m1,m2,…,mN}
m5 mNm6
Partition M into entities E = {e1,e2,…,ek} where k unknown in advance
m4 …m3m2m1 m5 mNm6
Entity Resolution as Clustering
m4 …m3m2m1
Given mentions M = {m1,m2,…,mN}
m5 mNm6
Partition M into entities E = {e1,e2,…,ek} where k unknown in advance
m4 …m3m2m1 m5 mNm6
e1 e2 e3 ek
Entity Resolution as Clustering
m4 …m3m2m1
Given mentions M = {m1,m2,…,mN}
m5 mNm6
Partition M into entities E = {e1,e2,…,ek} where k unknown in advance
m4 …m3m2m1 m5 mNm6
e1 e2 e3 ekA. Banerjee
D.O.B. 12/4/87
Entity Resolution Challenges
Clustering hundreds of millions of entity mentions
Many Singleton Clusters — Large number of Entities
Entity Resolution in a Growing Knowledge Base
KB
Jane SmithPhysics
J. SmithMusic Theory
Build KB of inventors / researchers / entities
Receive new data constantlyUpdate KB with new data
immediatelyJack SmithEcology
John SmithControl Systems
J. SmithOrganometallic Compounds
J. SmithMetal Alloy Composition
J. SmithLCD display wiringJ. SmithDisplay wiring methods
J. SmithLCD DisplaysJ. Smith
LCD Displays
Entity Resolution in a Growing Knowledge Base
KB
Build KB of inventors / researchers / entities
Receive new data constantlyUpdate KB with new data
immediately
J. SmithOrganometallic Compounds
J. SmithMetal Alloy Composition
J. SmithLCD display wiringJ. SmithDisplay wiring methods
J. SmithLCD DisplaysJ. Smith
LCD Displays
Jane SmithPhysics
J. SmithMusic Theory
Jack SmithEcology
John SmithControl Systems
Entity Resolution in a Growing Knowledge Base
KB
Build KB of inventors / researchers / entities
Receive new data constantly Update KB with new data
immediately
J. SmithOrganometallic Compounds
J. SmithMetal Alloy Composition
J. SmithLCD display wiringJ. SmithDisplay wiring methods
J. SmithLCD DisplaysJ. Smith
LCD Displays
Jane SmithPhysics
J. SmithMusic Theory
Jack SmithEcology
John SmithControl Systems
Entity Resolution in a Growing Knowledge Base
KBJ. SmithOrganometallic Compounds
J. SmithMetal Alloy Composition
J. SmithLCD display wiringJ. SmithDisplay wiring methods
J. SmithLCD DisplaysJ. Smith
LCD Displays
Jane SmithPhysics
J. SmithMusic Theory
Jack SmithEcology
John SmithControl Systems
Entity Resolution in a Growing Knowledge Base
KB
J. SmithOrganometallic Compounds for LCD Displays
mi
J. SmithOrganometallic Compounds
J. SmithMetal Alloy Composition
J. SmithLCD display wiringJ. SmithDisplay wiring methods
J. SmithLCD DisplaysJ. Smith
LCD Displays
Jane SmithPhysics
J. SmithMusic Theory
Jack SmithEcology
John SmithControl Systems
Entity Resolution in a Growing Knowledge Base
KB
J. SmithOrganometallic Compounds for LCD Displays
mi
Determine entity for new data &
update existing entities
J. SmithOrganometallic Compounds
J. SmithMetal Alloy Composition
J. SmithLCD display wiringJ. SmithDisplay wiring methods
J. SmithLCD DisplaysJ. Smith
LCD Displays
Jane SmithPhysics
J. SmithMusic Theory
Jack SmithEcology
John SmithControl Systems
Entity Resolution in a Growing Knowledge Base
KB
J. SmithOrganometallic Compounds for LCD Displays
mi
Determine entity for new data &
update existing entities
J. SmithOrganometallic Compounds
J. SmithMetal Alloy Composition
J. SmithLCD display wiringJ. SmithDisplay wiring methods
J. SmithLCD DisplaysJ. Smith
LCD Displays
Jane SmithPhysics
J. SmithMusic Theory
Jack SmithEcology
John SmithControl Systems
Entity Resolution in a Growing Knowledge Base
KB
J. SmithOrganometallic Compounds for LCD Displays
mi
Determine entity for new data &
update existing entities
Merge entities!
J. SmithOrganometallic Compounds
J. SmithMetal Alloy Composition
J. SmithLCD display wiringJ. SmithDisplay wiring methods
J. SmithLCD DisplaysJ. Smith
LCD Displays
Jane SmithPhysics
J. SmithMusic Theory
Jack SmithEcology
John SmithControl Systems
Related Work• “Author-ity” System for Disambiguation in MEDLINE & U.S. Patents
[Lai, D’ Amour, Doolin, Li, Sun, Torvik,Yu, Fleming; 2014], [Torvik and Smalheiser; 2009], [Torvik, Weeber, Swanson and Smalheiser; 2005]
• Disambiguation using DBSCAN Clustering algorithm & learned distance function [Khabsa, Treeratpituk, Giles; 2014]
• Using Random Forest Classifiers: [Ventura, Nugen, Fuchs; 2015], [Treeratpituk, Giles, 2009]
• Numerous other approaches: [Ge, Huang, Png; 2014], [Liu et al; 2014],[Levin, Krawczyk, Bethard, Jurafsky; 2012], [Lissoni, Maurino, Pezzoni, Tarasconi; 2010], [Raffo, Lhuillery; 2009], [Carayol, Cassi; 2009], [Yang, Peng, Jiang, Lee, Ho; 2008], [Fleming, King, Juda; 2007], [Lissoni, Sanditov, Sanditov,; 2006], [Han, Zha, Giles; 2005] & many others!
Author & Inventor Coreference
21
Related WorkCross Document Coreference
Entity Linking
• Hierarchical Models: • [Wick, Singh, McCallum, 2013], [Singh et al, 2011]
• Sampling based Spectral Clustering: • [Dutta and Weikum, 2015]
• Streaming Models: • [Rao, McNamee, Dredze, 2010]
• Local & Global Contexts: • [Chang and Roth, 2013]
• Selective Context Selection: • [Lazic, Subramanya, Ringgaard,
Pereira, 2015] • Neural Networks:
• [Huang et al, 2015] • [Sun et al, 2015] • [Yamada et al, 2016]
• Graph/Coherence Based: • [Hoffart et al, 2011] • [Han and Zhao, 2009]
• Wikification • [Milne and Witten, 2008]
• Many others!
• [Gooi and Allan, 2004], [Nicolae and Nicolae, 2006], and many others!
22
Large Scale ClusteringEfficiently find candidate clusters for new data
e3e2e1 e6 e7 e8 e9e4 e5
Tree structures for efficient search
Inspired by work in extreme mutliclass
classification: [Daume, et al 2016],
[Choromanska & Langford, 2015], [Beygelzimer, et al
2010] and others23
Thank you for listening.
Questions?
24