41
Declarative Analysis of Noisy Information Networks Walaa Eldin Moustafa Galileo Namata Amol Deshpande Lise Getoor University of Maryland

Declarative analysis of noisy information networks

Embed Size (px)

Citation preview

Page 1: Declarative analysis of noisy information networks

Declarative Analysis of NoisyInformation Networks

Walaa Eldin MoustafaGalileo Namata

Amol DeshpandeLise Getoor

University of Maryland

Page 2: Declarative analysis of noisy information networks

Outline

Motivations/Contributions

Framework

Declarative Language

Implementation

Results

Related and Future Work

Page 3: Declarative analysis of noisy information networks

Motivation

Page 4: Declarative analysis of noisy information networks

Motivation

• Users/objects are modeled as nodes, relationships as edges

• The observed networks are noisy and incomplete.– Some users may have more than one account– Communication may contain a lot of spam

• Missing attributes, links, having multiple references to the same entity

• Need to extract underlying information network.

Page 5: Declarative analysis of noisy information networks

Inference Operations

• Attribute Prediction– To predict values of missing attributes

• Link Prediction– To predict missing links

• Entity Resolution– To predict if two references refer to the same entity

• These prediction tasks can use:– Local node information– Relational information surrounding the node

Page 6: Declarative analysis of noisy information networks

Attribute Prediction

Automatic RuleRefinement for

Information Extraction

Join Optimization of Information Extraction

Output: Quality Matters!

A Statistical Model for Multilingual Entity

Detection and Tracking

Why Not?

Tracing Lineage Beyond Relational Operators

An Annotation Management System for

Relational Databases

Language Model Based Arabic Word

Segmentation.

DB NL ?

Legend

Use links between nodes (collective attribute prediction) [Sen et al., AI Magazine 2008]

Task: Predict topic of the paper

Page 7: Declarative analysis of noisy information networks

Attribute Prediction

Automatic RuleRefinement for

Information Extraction

Join Optimization of Information Extraction

Output: Quality Matters!

A Statistical Model for Multilingual Entity

Detection and Tracking

Why Not?

Tracing Lineage Beyond Relational Operators

An Annotation Management System for

Relational Databases

Language Model Based Arabic Word

Segmentation.

DB NL ?

Legend

Task: Predict topic of the paper

P1P2

Page 8: Declarative analysis of noisy information networks

Attribute Prediction

Automatic RuleRefinement for

Information Extraction

Join Optimization of Information Extraction

Output: Quality Matters!

A Statistical Model for Multilingual Entity

Detection and Tracking

Why Not?

Tracing Lineage Beyond Relational Operators

An Annotation Management System for

Relational Databases

Language Model Based Arabic Word

Segmentation.

DB NL ?

Legend

Task: Predict topic of the paper

P2 P1

Page 9: Declarative analysis of noisy information networks

Link Prediction

• Goal: Predict new links• Using local similarity• Using relational similarity [Liben-Nowell et al.,

CIKM 2003]

Divesh Srivastava

Divesh Srivastava

Vladislav Shkapenyuk

Vladislav Shkapenyuk

Nick Koudas

Nick Koudas

Avishek Saha

Avishek Saha

Graham CormodeGraham

CormodeFlip KornFlip Korn

Lukasz GolabLukasz Golab

Theodore Johnson

Theodore Johnson

Page 10: Declarative analysis of noisy information networks

Entity Resolution

• Goal: to deduce that two references refer to the same entity

• Can be based on node attributes (local)– e.g. string similarity between titles or author

names

• Local information only may not be enough

Jian LiJian Li Jian LiJian Li

Page 11: Declarative analysis of noisy information networks

Entity Resolution

William RobertsWilliam Roberts

Petre StoicaPetre Stoica

Jian LiJian Li

Prabhu Babu

Prabhu Babu

Amol Deshpande

Amol Deshpande

Samir KhullerSamir

Khuller

Barna SahaBarna Saha

Jian LiJian Li

Use links between the nodes (collective entity resolution) [Bhattacharya et al., TKDD 2007]

Page 12: Declarative analysis of noisy information networks

Joint Inference

• Each task helps others get better predictions.• How to combine the tasks?

– One after other (pipelined), or interleaved?

• GAIA:– A Java library for applying multiple joint AP, LP, ER

learning and inference tasks: [Namata et al., MLG 2009, Namata et al., KDUD 2009]

– Inference can be pipelined or interleaved.

Page 13: Declarative analysis of noisy information networks

Our Goal and Contributions

• Motivation: To support declarative network inference

• Desiderata:– User declaratively specifies the prediction features

• Local features• Relational features

– Declaratively specify tasks• Attribute prediction, Link prediction, Entity resolution

– Specify arbitrary interleaving or pipelining– Support for complex prediction functions

Handle all that efficiently

Page 14: Declarative analysis of noisy information networks

Outline

Motivations/Contributions

Framework

Declarative Language

Implementation

Results

Related and Future Work

Page 15: Declarative analysis of noisy information networks

Unifying Framework

Specify the domainSpecify the domain

Compute featuresCompute features

Make Predictions, and Compute Confidence in the Predictions

Make Predictions, and Compute Confidence in the Predictions

Choose Which Predictions to Apply

Choose Which Predictions to Apply

For attribute prediction, the domain is a subset of the graph nodes.

For link prediction and entity resolution, the domain is a subset of pairs of nodes.

Page 16: Declarative analysis of noisy information networks

Unifying Framework

Specify the domainSpecify the domain

Compute featuresCompute features

Make Predictions, and Compute Confidence in the Predictions

Make Predictions, and Compute Confidence in the Predictions

Choose Which Predictions to Apply

Choose Which Predictions to Apply

Local: word frequency, income, etc.Relational: degree, clustering coeff., no. of neighbors with each attribute value, common neighbors between pairs of nodes, etc.

Page 17: Declarative analysis of noisy information networks

Unifying Framework

Specify the domainSpecify the domain

Compute featuresCompute features

Make Predictions, and Compute Confidence in the Predictions

Make Predictions, and Compute Confidence in the Predictions

Choose Which Predictions to Apply

Choose Which Predictions to Apply

Attribute prediction: the missing attribute

Link prediction: add link or not?

Entity resolution: merge two nodes or not?

Page 18: Declarative analysis of noisy information networks

Unifying Framework

Specify the DomainSpecify the Domain

Compute FeaturesCompute Features

Make Predictions, and Compute Confidence in the Predictions

Make Predictions, and Compute Confidence in the Predictions

Choose Which Predictions to Apply

Choose Which Predictions to Apply

After predictions are made, the graph changes:Attribute prediction changes local attributes.Link prediction changes the graph links.Entity resolution changes both local attributes and graph links.

Page 19: Declarative analysis of noisy information networks

Outline

Motivations/Contributions

Framework

Declarative Language

Implementation

Results

Related and Future Work

Page 20: Declarative analysis of noisy information networks

Datalog

• Use Datalog to express:– Domains– Local and relational features

• Extend Datalog with operational semantics (vs. fix-point semantics) to express:– Predictions (in the form of updates)– Iteration

Page 21: Declarative analysis of noisy information networks

Specifying Features

Degree:Degree(X, COUNT<Y>) :-Edge(X, Y)

Number of Neighbors with attribute ‘A’NumNeighbors(X, COUNT<Y>) :− Edge(X, Y), Node(Y, Att=’A’)

Clustering CoefficientNeighborCluster(X, COUNT<Y,Z>) :−Edge(X,Y), Edge(X,Z), Edge(Y,Z)ClusteringCoeff(X, C) :−NeighborCluster(X,N), Degree(X,D), C=2*N/(D*(D-1))

Jaccard CoefficientIntersectionCount(X, Y, COUNT<Z>) :−Edge(X, Z), Edge(Y, Z)UnionCount(X, Y, D) :−Degree(X,D1), Degree(Y,D2), D=D1+D2-D3, IntersectionCount(X, Y, D3) Jaccard(X, Y, J) :−IntersectionCount(X, Y, N), UnionCount(X, Y, D), J=N/D

Page 22: Declarative analysis of noisy information networks

Specifying Domains

• Domains are used to restrict the space of computation for the prediction elements.

• Space for this feature is |V|2

Similarity(X, Y, S) :−Node(X, Att=V1), Node(Y, Att=V1), S=f(V1, V2)

• Using this domain the space becomes |E|:DOMAIN D(X,Y) :- Edge(X, Y)

• Other DOMAIN predicates:– Equality– Locality sensitive hashing– String similarity joins– Traverse edges

Page 23: Declarative analysis of noisy information networks

Feature Vector

• Features of prediction elements are combined in a single predicate to create the feature vector:DOMAIN D(X, Y) :- …{

P1(X, Y, F1) :- ……Pn(X, Y, Fn) :- …Features(X, Y, F1, …, Fn) :- P1(X, Y, F1) , …, Pn(X, Y, Fn)

}

Page 24: Declarative analysis of noisy information networks

Update Operation

DEFINE Merge(X, Y){

INSERT Edge(X, Z) :- Edge(Y, Z)DELETE Edge(Y, Z)UPDATE Node(X, A=ANew) :- Node(X,A=AX), Node(Y,A=AY), ANew=(AX+AY)/2UPDATE Node(X, B=BNew) :- Node(X,B=BX), Node(X,B=BX), BNew=max(BX,BY)DELETE Node(Y)

}Merge(X, Y) :- Features (X, Y, F1,…,Fn), predict-

ER(F1,…,Fn) = true, confidence-ER(F1,…,Fn) > 0.95

Page 25: Declarative analysis of noisy information networks

Prediction and Confidence Functions

• The prediction and confidence functions are user defined functions

• Can be based on logistic regression, Bayes classifier, or any other classification algorithm

• The confidence is the class membership value – In logistic regression, the confidence can be the

value of the logistic function– In Bayes classifier, the confidence can be the

posterior probability value

Page 26: Declarative analysis of noisy information networks

Iteration

• Iteration is supported by ITERATE construct.• Takes the number of iterations as a

parameter, or * to iterate until no more predictions.

• ITERATE (*){

MERGE(X,Y) :-Features (X, Y, F1,…,Fn),predict-ER(F1,…,Fn) = true,confidence-ER(F1,…,Fn) IN TOP

10%

Page 27: Declarative analysis of noisy information networks

Pipelining

DOMAIN ER(X,Y) :- ….{

ER1(X,Y,F1) :- …ER2(X,Y,F1) :- …Features-ER(X,Y,F1,F2) :- …

}

DOMAIN LP(X,Y) :- ….{

LP1(X,Y,F1) :- …LP2(X,Y,F1) :- …Features-LP(X,Y,F1,F2) :- …

}

ITERATE(*){

INSERT EDGE(X,Y) :- FT-LP(X,Y,F1,F2), predict-LP(X,Y,F1,F2), confidence-LP(X,Y,F1,F2IN TOP 10%

}ITERATE(*){

MERGE(X,Y) :- FT-ER(X,Y,F1,F2), predict-ER(X,Y,F1,F2), confidence-ER(X,Y,F1,F2)IN TOP 10%

}

Page 28: Declarative analysis of noisy information networks

Interleaving

DOMAIN ER(X,Y) :- ….{

ER1(X,Y,F1) :- …ER2(X,Y,F1) :- …Features-ER(X,Y,F1,F2) :- …

}

DOMAIN LP(X,Y) :- ….{

LP1(X,Y,F1) :- …LP2(X,Y,F1) :- …Features-LP(X,Y,F1,F2) :- …

}

ITERATE(*){

INSERT EDGE(X,Y) :- FT-LP(X,Y,F1,F2), predict-LP(X,Y,F1,F2), confidence-LP(X,Y,F1,F2IN TOP 10%

MERGE(X,Y) :- FT-ER(X,Y,F1,F2), predict-ER(X,Y,F1,F2), confidence-ER(X,Y,F1,F2)IN TOP 10%

}

Page 29: Declarative analysis of noisy information networks

Outline

Motivations/Contributions

Framework

Declarative Language

Implementation

Results

Related and Future Work

Page 30: Declarative analysis of noisy information networks

Implementation

• Prototype based on Java Berkeley DB• Implemented a query parser, plan generator,

query evaluation engine• Incremental maintenance:

– Aggregate/non-aggregate incremental maintenance

– DOMAIN maintenance

Page 31: Declarative analysis of noisy information networks

Incremental Maintenance

• Predicates in the program correspond to materialized tables (key/value maps).

• Every set of changes done by AP, LP, or ER are logged into two change tables ΔNodes and ΔEdges.– Insertions: |Record | +1 |– Deletions: |Record | -1 |– Updates: deletion followed by an insertion

• Aggregate maintenance is performed by aggregating the change table then refreshing the old table.

• DOMAIN:

DOMAIN L(X):- Subgoals of L{

P1(X,Y) :- Subgoals of P1}

L(X) :- Subgoals of LP1’(X) :- L(X), Subgoals of P1P1(X) :- L(X) >> Subgoals of P1

Page 32: Declarative analysis of noisy information networks

Outline

Motivations/Contributions

Framework

Declarative Language

Implementation

Results

Related and Future Work

Page 33: Declarative analysis of noisy information networks

Synthetic Experiements

• Synthetic graphs. Generated using forest fire, and preferential attachment generation models.

• Three tasks:– Attribute Prediction, Link Prediction and Entity Resolution

• Two approaches:– Recomputing features after every iteration– Incremental maintenance

• Varied parameters:– Graph size– Graph density– Confidence threshold (update size)

Page 34: Declarative analysis of noisy information networks

Changing Graph Size

• Varied the graph size from 20K nodes and 200K edges to 100K nodes and 1M edges

Page 35: Declarative analysis of noisy information networks

Comparison with Derby

• Compared the evaluation of 4 features: degree, clustering coefficient, common neighbors and Jaccard.

Page 36: Declarative analysis of noisy information networks

Real-world Experiment

• Real-world PubMed graph– Set of publications from the medical domain, their

abstracts, and citations

• 50,634 publications, 115,323 citation edges• Task: Attribute prediction

– Predict if the paper is categorized as Cognition, Learning, Perception or Thinking

• Choose top 10% predictions after each iteration, for 10 iterations

• Incremental: 28 minutes. Recompute: 42 minutes

Page 37: Declarative analysis of noisy information networks

Program

DOMAIN Uncommitted(X):-Node(X,Committed=‘no’){

ThinkingNeighbors(X,Count<Y>):- Edge(X,Y), Node(Y,Label=‘Thinking’)PerceptionNeighbors(X,Count<Y>):- Edge(X,Y), Node(Y,Label=‘Perception’)CognitionNeighbors(X,Count<Y>):- Edge(X,Y), Node(Y,Label=‘Cognition’)LearningNeighbors(X,Count<Y>):- Edge(X,Y), Node(Y,Label=‘Learning’)Features-AP(X,A,B,C,D,Abstract):- ThinkingNeighbors(X,A),

PerceptionNeighbors(X,B), CognitionNeighbors(X,C), LearningNeighbors(X,D),Node(X,Abstract, _,_)

}ITERATE(10) {

UPDATE Node(X,_,P,‘yes’):- Features-AP(X,A,B,C,D,Text),P = predict-AP(X,A,B,C,D,Text),confidence-AP(X,A,B,C,D,Text) IN TOP 10%

}

Page 38: Declarative analysis of noisy information networks

Outline

Motivations/Contributions

Framework

Declarative Language

Implementation

Results

Related and Future Work

Page 39: Declarative analysis of noisy information networks

Related Work

• Dedupalog [Arasu et al., ICDE 2009]:– Datalog-based entity resolution

• User defines hard and soft rules for deduplication• System satisfies hard rules and minimizes violations to

soft rules when deduplicating references

• Swoosh [Benjelloun et al., VLDBJ 2008]:– Generic Entity resolution

• Match function for pairs of nodes (based on a set of features)

• Merge function determines which pairs should be merged

Page 40: Declarative analysis of noisy information networks

Conclusions and Ongoing Work

• Conclusions:– We built a declarative system to specify graph

inference operations– We implemented the system on top of Berkeley DB

and implemented incremental maintenance techniques

• Future work:– Direct computation of top-k predictions– Multi-query evaluation (especially on graphs)– Employing a graph DB engine (e.g. Neo4j)– Support recursive queries and recursive view

maintenance

Page 41: Declarative analysis of noisy information networks

References• [Sen et al., AI Magazine 2008]

– Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Gallagher, Tina Eliassi-Rad: Collective Classification in Network Data. AI Magazine 29(3): 93-106 (2008)

• [Liben-Nowell et al., CIKM 2003]– David Liben-Nowell, Jon M. Kleinberg: The link prediction problem for social networks. CIKM

2003.• [Bhattacharya et al., TKDD 2007]

– I. Bhattacharya and L. Getoor. Collective entity resolution in relational data. ACM TKDD, 1:1–36, 2007.

• [Namata et al., MLG 2009]– G. Namata and L. Getoor: A Pipeline Approach to Graph Identification. MLG Workshop, 2009.

• [Namata et al., KDUD 2009]– G. Namata and L. Getoor: Identifying Graphs From Noisy and Incomplete Data. SIGKDD

Workshop on Knowledge Discovery from Uncertain Data, 2009.• [Arasu et al., ICDE 2009]

– A. Arasu, C. Re, and D. Suciu. Large-scale deduplication with constraints using dedupalog. In ICDE, 2009

• [Benjelloun et al., VLDBJ 2008]– O. Benjelloun, H. Garcia-Molina, D. Menestrina, Q. Su, S. E. Whang,and J. Widom. Swoosh: a

generic approach to entity resolution. The VLDB Journal, 2008.