55
Link Mining Li Gt Lise Getoor University of Maryland, College Park August 22, 2012

Lise Getoor, "

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Lise Getoor, "

Link Mining

Li G tLise GetoorUniversity of Maryland, College Park

August 22, 2012

Page 2: Lise Getoor, "

Alternate Title…..

What Machine Learning/Statistics/Data MiningMachine Learning/Statistics/Data Mining

can do for YOU!

1.Predict future values

2.Fill-in missing values

3 Identify anomalies

Supervised Learning

What are some common 3.Identify anomalies

4.Find patterns

What are some common machine learning algorithms?

5.Identify ClustersUnsupervised Learning

Page 3: Lise Getoor, "

So, what’s Link Mining??? Machine learning when you have graphs (or networks)

Nodes are entities• People• Places

Organizations• Organizations• Text

Links are relationshipsp• Friends• MemberOf• LivesIn• LivesIn• Tweeted• Posted

e.g., heterogeneous multi-relational data, multimodal data …..

Page 4: Lise Getoor, "

Ex: Social Media RelationshipsFriendsCollaborators

User-User

UbUa

FamilyFan/FollowerRepliesCo-EditsCo-Mentions, etc.

User-Doc

CommentsEdits, etc.

User Doc

Doc1U

User-Query-ClickQU URL

User-Tag-DocTag DocU

Page 5: Lise Getoor, "

Link Mining Tasks Node Labeling Link Prediction Link Prediction Entity Resolution Group DetectionG oup etect o

Page 6: Lise Getoor, "

Node Labeling

h What is Harry’spolitical persuasion?

Harry

Natasha

Page 7: Lise Getoor, "

Link Prediction

Friends?

Page 8: Lise Getoor, "

Entity Resolution Aka: deduplication, co-reference resolution, record

linkage, reference consolidation, etc.g

Page 9: Lise Getoor, "

Abstract Problem StatementReal

WorldDigital World

Records / Mentions

Page 10: Lise Getoor, "

Deduplication Problem Statement Cluster the records/mentions that correspond to

same entity y

Page 11: Lise Getoor, "

Deduplication Problem Statement Cluster the records/mentions that correspond to

same entity y Intensional Variant: Compute cluster representative

Page 12: Lise Getoor, "

Record Linkage Problem Statement Link records that match across databases

AB

Page 13: Lise Getoor, "

Reference Matching Problem Match noisy records to clean records in a reference

table

Reference T blTable

Page 14: Lise Getoor, "

InfoVis Co-Author Network Fragment

before after

Page 15: Lise Getoor, "

Group Detection

Page 16: Lise Getoor, "

Link Mining Algorithms Node Labeling Link Prediction Link Prediction Entity Resolution Group DetectionG oup etect o

Page 17: Lise Getoor, "

Link Mining Algorithms Node Labeling Link Prediction

1. Relational Classifiers2. Collective Classifiers

Link Prediction Entity Resolution Group DetectionG oup etect o

Page 18: Lise Getoor, "

Relational ClassifiersGiven:

125

ba

c

w

x

34e

d

z

y

Task: Predict attributeof some of the entities

1 ?

Alternate task: Predict existenceof relationship between entities

1 2?

?1

2

...

?

?

relational features

1 2

1 3?

...?

?

5

.

?

local featuresl f hb

4 5?

?

same-attribute-value

number of neighbors

avg value of neighborsnumber of shared neighbors

participate in relation

Page 19: Lise Getoor, "

Relational Classifiers Values are represented as a fixed-length feature

vector

Instances are treated independently of each other

Relational features are computed by aggregating over related entities

Any classification or regression model can be used for learning and prediction

Page 20: Lise Getoor, "

Application Case Studies Two example applications that use relational

classifiers Focus is on types of relational features used

Case Study 1: Predicting click-through rate of search result adsC St d 2 P di ti f i d hi i i l Case Study 2: Predicting friendships in a social network

Page 21: Lise Getoor, "

Case Study 1: Predicting Ad Click-Through RatePredicting Ad Click Through Rate

Task: Predict the click-through rate (CTR) of an Task: Predict the click through rate (CTR) of an online ad, given that it is seen by the user, where the ad is described by: URL to which user is sent when clicking on ad Bid terms used to determine when to display ad

Titl d t t f d Title and text of ad

Our description is based on approach by Our description is based on approach by [Richardson et al., WWW07]

Page 22: Lise Getoor, "

Relational Features Used

Ad Ad Ad Ad Ad Ad Ad

Average CTR Average CTRCTR?

Ad

contains-bid-term

Ad1 Ad2 Ad3 Ad4 Ad5 Ad6

BT1 BT3BT2BT4 BT5 BT6

t i bid trelated-bid-term(containing subsets or supersets of the term)

contains-bid-term(according to search engine)

… … ……

queried-bid-term

Count Count

Page 23: Lise Getoor, "

Case Study 2: Predicting FriendshipsPredicting Friendships

Task: Predict new friendships among users based Task: Predict new friendships among users, based on their descriptive attributes, their existing friendships, and their family ties.p , y

Our description is based on approach byp pp y [Zheleva et al., SNAKDD08]

Page 24: Lise Getoor, "

Relational Features Used “Petworks” - social networks of pets

P3P8

count, density

P4

P6

P9countt

count, proportion

P1 P2

4

P5P10P7

countcount

1

Friends?

same breed

P11

F2F1

in-familyJaccard coeff

same-breed

Page 25: Lise Getoor, "

Key Idea: Feature Construction Feature informativeness is key to the success of a

relational classifier

Features can be: Attributes of entity/entities Match predicate on attributes of entities Attributes of related entities Encode structural features

Based on o erlap in sets Based on overlap in sets

Page 26: Lise Getoor, "

Link Mining Algorithms Node Labeling Link Prediction

1. Relational Classifiers2. Collective Classifiers

Link Prediction Entity Resolution Group DetectionG oup etect o

Page 27: Lise Getoor, "

Collective Classification

Extends relational classifiers by allowing relational

[Neville & Jensen, SRL00; Lu & Getoor, ICML03, Sen et al. AI Mag08]

Extends relational classifiers by allowing relational features to be functions of predicted attributes/relations of neighbors

At training time, these features are computed based on observed values in the training setAt i f ti th l ith it t ti At inference time, the algorithm iterates, computing relational features based on the current prediction for any unobserved attributesany unobserved attributes In the first, bootstrap, iteration, only local features are

used

Page 28: Lise Getoor, "

CC: Learning label set:

P2 P4

P

P5P8

P3P1

P10

PP9

P6

P7

L d l (l l d l ti l) f Learn models (local and relational) from fully labeled training set

Page 29: Lise Getoor, "

CC: Inference (1)

P

P1

P

P1

P5

P2P5

P2

P4P3 P4P3

St 1 B t t i tit tt ib t lStep 1: Bootstrap using entity attributes only

Page 30: Lise Getoor, "

CC: Inference (2)

P

P1

P

P1

P5

P2P5

P2

P3 P4P3 P4P4

St 2 It ti l d t th t f h titStep 2: Iteratively update the category of each entity, based on related entities’ categories

Page 31: Lise Getoor, "

CC Key Idea Rather than make predictions independently, begin

with relational classifier, and then ‘propagate’ p p gclassification

Variations: Propagate probabilities, rather than mode (related to

Gibbs Sampling)Gibbs Sampling) Batch vs. Incremental updates Ordering strategies Ordering strategies

Active area of research: active learning, semi- Active area of research: active learning, semisupervised learning, more principled joint probabilistic models, etc.

Page 32: Lise Getoor, "

Link Mining Algorithms Node Labeling Link Prediction Link Prediction Entity Resolution Group DetectionG oup etect o

Page 33: Lise Getoor, "

The Entity Resolution Problem

John Smith

James SmithSmith

“John Smith”

“Jim Smith”

“James Smith”

“J Smith”

Jonathan Smith “Jon Smith”

James Smith

“J Smith”

“Jonthan Smith”

Issues:1. Identification2. Disambiguation

Page 34: Lise Getoor, "

Relational Identification

Very similar names.Added evidence from shared co-authors

Page 35: Lise Getoor, "

Relational Disambiguation

Very similar names but no shared collaboratorscollaborators

Page 36: Lise Getoor, "

Collective Entity Resolution

One resolution provides evidence for another => joint jresolution

Page 37: Lise Getoor, "

P1: “JOSTLE: Partitioning of Unstructured Meshes for Massively Parallel Machines”, C. Walshaw, M. Cross, M. G. Everett, S. Johnson J

P2: “Partitioning Mapping of Unstructured Meshes toP2: Partitioning Mapping of Unstructured Meshes to Parallel Machine Topologies”, C. Walshaw, M. Cross, M. G. Everett, S. Johnson, K. McManus J

P3: “Dynamic Mesh Partitioning: A Unied Optimisation andP3: Dynamic Mesh Partitioning: A Unied Optimisation and Load-Balancing Algorithm”, C. Walshaw, M. Cross, M. G. Everett

P4: “Code Generation for Machines with Multiregister Operations”, Alfred V. Aho, Stephen C. Johnson, Jefferey D. Ullman J

P5: “Deterministic Parsing of Ambiguous Grammars”, A. g gAho, S. Johnson, J. Ullman J

P6: “Compilers: Principles, Techniques, and Tools”, A. Aho, R. Sethi, J. Ullman

Page 38: Lise Getoor, "

P1: “JOSTLE: Partitioning of Unstructured Meshes for Massively Parallel Machines”, C. Walshaw, M. Cross, M. G. Everett, S. Johnson

P2: “Partitioning Mapping of Unstructured Meshes toP2: Partitioning Mapping of Unstructured Meshes to Parallel Machine Topologies”, C. Walshaw, M. Cross, M. G. Everett, S. Johnson, K. McManus

P3: “Dynamic Mesh Partitioning: A Unied Optimisation andP3: Dynamic Mesh Partitioning: A Unied Optimisation and Load-Balancing Algorithm”, C. Walshaw, M. Cross, M. G. Everett

P4: “Code Generation for Machines with MultiregisterOperations”, Alfred V. Aho, Stephen C. Johnson, Jefferey D. Ullman

P5: “Deterministic Parsing of Ambiguous Grammars”, A. g gAho, S. Johnson, J. Ullman

P6: “Compilers: Principles, Techniques, and Tools”, A. Aho, R. Sethi, J. Ullman

Page 39: Lise Getoor, "

Relational Clustering (RC-ER)

C. Walshaw M. G. Everett S. JohnsonM. CrossP1

K McManusC Walshaw M. Everett S. JohnsonM CrossP2 K. McManusC. Walshaw M. Everett S. JohnsonM. CrossP2

Alfred V. Aho Stephen C. JohnsonJefferey D. UllmanP4

P5 A. Aho S. JohnsonJ. Ullman

Page 40: Lise Getoor, "

Relational Clustering (RC-ER)

C. Walshaw M. G. Everett S. JohnsonM. CrossP1

K McManusC Walshaw M. Everett S. JohnsonM CrossP2 K. McManusC. Walshaw M. Everett S. JohnsonM. CrossP2

Alfred V. Aho Stephen C. JohnsonJefferey D. UllmanP4

P5 A. Aho S. JohnsonJ. Ullman

Page 41: Lise Getoor, "

Relational Clustering (RC-ER)

C. Walshaw M. G. Everett S. JohnsonM. CrossP1

K McManusC Walshaw M. Everett S. JohnsonM CrossP2 K. McManusC. Walshaw M. Everett S. JohnsonM. CrossP2

Alfred V. Aho Stephen C. JohnsonJefferey D. UllmanP4

P5 A. Aho S. JohnsonJ. Ullman

Page 42: Lise Getoor, "

Relational Clustering (RC-ER)

C. Walshaw M. G. Everett S. JohnsonM. CrossP1

K McManusC Walshaw M. Everett S. JohnsonM CrossP2 K. McManusC. Walshaw M. Everett S. JohnsonM. CrossP2

Alfred V. Aho Stephen C. JohnsonJefferey D. UllmanP4

P5 A. Aho S. JohnsonJ. Ullman

Page 43: Lise Getoor, "

Cut-based Formulation of RC-ER

S. JohnsonM. G. Everett S. JohnsonM. G. Everett

S. Johnson

S. Johnson

M. Everett S. Johnson

S. Johnson

M. Everett

Stephen C. A. Aho

Stephen C. A. Aho

Stephen C. JohnsonAlfred V. Aho

Stephen C. JohnsonAlfred V. Aho

Good separation of attributesMany cluster-cluster relationships Aho-Johnson1 Aho-Johnson2

Worse in terms of attributesFewer cluster-cluster relationships Aho-Johnson1 Everett-Johnson2 Aho Johnson1, Aho Johnson2,

Everett-Johnson1 Aho Johnson1, Everett Johnson2

Page 44: Lise Getoor, "

Objective Function Minimize:

)()( ii ),(),( jiRRji j

iAA ccsimwccsimw

weight for attributes

weight for relations

similarity ofattributes

Similarity based on relational edges between ci and cj

Greedy clustering algorithm: merge cluster pair with max reduction in objective function

( , ) ( , ) (| ( )| | ( )|)c c w sim c c w N c N ci j A A i j R i j

Common cluster neighborhood Similarity of attributes

Page 45: Lise Getoor, "

Relational Clustering Algorithm1. Find similar references using ‘blocking’2. Bootstrap clusters using attributes and relations3. Compute similarities for cluster pairs and insert into

priority queue

4. Repeat until priority queue is empty5. Find ‘closest’ cluster pair6. Stop if similarity below threshold7. Merge to create new cluster8 Update similarity for ‘related’ clusters8. Update similarity for related clusters

O( k l ) l ith / ffi i t i l t ti O(n k log n) algorithm w/ efficient implementation

Page 46: Lise Getoor, "

Evaluation Datasets CiteSeer

1,504 citations to machine learning papers (Lawrence et al.) 2,892 references to 1,165 author entities

arXiv arXiv 29,555 publications from High Energy Physics (KDD Cup’03) 58,515 refs to 9,200 authors

Elsevier BioBase 156,156 Biology papers (IBM KDD Challenge ’05) 831,991 author refs Keywords, topic classifications, language, country and affiliation

of corresponding author, etcp g ,

Page 47: Lise Getoor, "

Baselines A: Pair-wise duplicate decisions w/ attributes only

Names: Soft-TFIDF with Levenstein, Jaro, Jaro-Winkler Other textual attributes: TF-IDF

A*: Transitive closure over A

A+N: Add attribute similarity of co-occurring refs A+N*: Transitive closure over A+N

Evaluate pair-wise decisions over references F1 measure (harmonic mean of precision and recall) F1-measure (harmonic mean of precision and recall)

Page 48: Lise Getoor, "

ER over Entire DatasetAlgorithm CiteSeer arXiv BioBase

A 0.980 0.976 0.568A* 0.990 0.971 0.559

A+N 0.973 0.938 0.710A+N* 0 984 0 934 0 753A+N 0.984 0.934 0.753

RC-ER 0.995 0.985 0.818

RC-ER outperforms baselines in all datasets Collective resolution better than naïve relational resolution

Page 49: Lise Getoor, "

ER over Entire DatasetAlgorithm CiteSeer arXiv BioBase

A 0.980 0.976 0.568A* 0.990 0.971 0.559

A+N 0.973 0.938 0.710A+N* 0 984 0 934 0 753A+N 0.984 0.934 0.753

RC-ER 0.995 0.985 0.818

CiteSeer: Near perfect resolution; 22% error reduction arXiv: 6 500 additional correct resolutions; 20% error reduction arXiv: 6,500 additional correct resolutions; 20% error reduction BioBase: Biggest improvement over baselines

Page 50: Lise Getoor, "

Flipside….

Page 51: Lise Getoor, "

Privacy breaches in OSNs Identity disclosure A mapping from a record Who is ?

to a specific individual

Attribute disclosure

?

Find attribute value that the user intended to stay private

Is liberal?

Social link disclosure Participation in a sensitive

relationship or communication

Friends?p

Affiliation link disclosure Participation in a group revealing

Support gay Participation in a group revealing

a sensitive attribute value marriage

Page 52: Lise Getoor, "

Other Linqs Projects Key Opinion Leader Identification Active Surveying in Social Networks Ontology Alignment and Folksonomy construction Ontology Alignment and Folksonomy construction Label Acquisition & Active Learning in Network Data Inference & Search in Camera Networks

Id tif i R l i S i l N t k Identifying Roles in Social Networks Group Recommendation in Social Networks Social Search Analysis of Dynamic Networks: loyalty, stability, diversity Ranking and Retrieval in Biological Networks Discourse-level sentiment analysis Discourse level sentiment analysis Bilingual Word Sense Disambiguation Visual Analytics:

D D C G G P D-Dupe, C-Group, G-Pare Others …

http://www.cs.umd.edu/linqs

Page 53: Lise Getoor, "

Other Linqs Projects Key Opinion Leader Identification Active Surveying in Social Networks Ontology Alignment and Folksonomy construction Ontology Alignment and Folksonomy construction Label Acquisition & Active Learning in Network Data Inference & Search in Camera Networks

Id tif i R l i S i l N t k Identifying Roles in Social Networks Group Recommendation in Social Networks Social Search Analysis of Dynamic Networks: loyalty, stability, diversity Ranking and Retrieval in Biological Networks Discourse-level sentiment analysis Discourse level sentiment analysis Bilingual Word Sense Disambiguation Visual Analytics:

D D C G G P D-Dupe, C-Group, G-Pare Others …

http://www.cs.umd.edu/linqs

Page 54: Lise Getoor, "

Conclusion Link mining algorithms can be useful tools for social

media Need algorithms that can handle the multi-modal,

multi-relational, temporal nature of social media Collective algorithms make use of

Structure to define features and propagate i f ti ll t i th llinformation, allows us to improve the overall accuracy

While there are important pitfalls to take into account (confidence and privacy) there areaccount (confidence and privacy), there are many potential benefits and payoffs (improved personalization and context-aware predictions!)personalization and context aware predictions!)

Page 55: Lise Getoor, "

http://www.cs.umd.edu/linqs

Work sponsored by the National Science Foundation, Maryland Industrial Partners (MIPS) National Geospatial AgencyMaryland Industrial Partners (MIPS), National Geospatial Agency,

Airforce Research Laboratory, DARPA, Google, Microsoft, and Yahoo!