Introduction to text mining and insights on bridging structured and unstructured data

Introduction to text miningWith a dive into structured and unstructured data

Sayali Kulkarni

October 23, 2010

Outline for today

I Quick refresher on data mining

I What is so special about text?

I Introduction to CSAW

I Annotation system

I Distributed indexing and retrieval system

I Future work

Data Mining I

I Data is useless if it does not make sense!

I Analyzing the data from different angles

I Important to know:I Data: What we getI Information: What we can useI Knowledge: How we use it

I Classes, clusters, association rules, patterns, sequences ...

Data Mining II

I Different kinds of dataI Protein sequencesI Genetic dataI Network monitoringI Text dataI Images/sound – multimedia data

I Different challenges in each case

I Scaling, noise, generalization, overfitting, incorporatingdomain knowledge

Text Mining I

I SourcesI Textual data from the webI Data collected within the organizationsI Survey data and feedback

I RepresentationI Using words as featuresI Data cleaning is a big task: spelling corrections, stop word

handling, stemmingI Weight of words depends on : importance of words in the

document and overall uniqueness of the word

I Mining TasksI SummarizationI Document clusteringI Document labellingI SearchI ...

Text Mining II

I Structure of the data is important

I Web data is diverse in natureI Completely unstructured data like news, blogs, mails forumsI Parly structured data like Wikipedia, PubMed, other domain

specific enclyclopedias and dictionariesI Data contained in the text in form of lists and tables is much

more structured

I Adding semantics to such data

I Linking the structured and unstructured data

I One of the major applications of this is sematic search

Search today: Impedance mismatch

Search Engine

��

Our vision of next-gen search

Search Engine

��

Curating and Searching the Annotated Web

Our vision of next-gen search

Search Engine

��

Curating and Searching the Annotated Web

CSAW search paradigm IData Model

I IR indexes - limited expressiveness

I Relational databases - intricate schema knowledge

I CSAW : IR index (unstructured) + annotation and catalogindex(structured)

CSAW search paradigm II

Query CapabilitiesQuerying text with type annotations

��

��

��

��

��

ResponseTables of entities, quantities (special type of entities) and textfields

High level block diagram

Figure: CSAW - high level block diagram

Annotation System

Figure: Annotation Engine in CSAW

Terminologies I

Figure: A plain page from unstructured data source

Terminologies II

Spots

Figure: A spot on a page

Spot is an occurrence of text on a page that can be possibly linkedto a Wikipedia articleRelated notations:S0 All candidate spots in a Web pageS ⊆ S0 Arbitrary set of spotss ∈ S One spot, including surrounding context

Terminologies III

Possible attachments

Figure: Possible attachments for a spot

Attachments are Wikipedia entities that can be possibly linked to aspotRelated notations:Γs Candidate entity labels for spot sΓ0

Ss∈S0

Γs , all candidate labels for page

Γ ⊆ Γ0 An arbitrary set of entity labelsγ ∈ Γ An entity label value, here, a Wikipedia entity

Entity Disambiguation

Document

Spots

s

s’

Spo

t-to

-labe

l co

mpa

tibili

ty

Candidate labels

γγγγ

ΓΓΓΓs

γγγγ’

ΓΓΓΓs’

Figure: Disambiguation based on compatibility between spot and label

SemTag and Seeker[D+03] exploited this for entity disambiguation.It is the first Web-scale entity disambiguation system

Entity Disambiguation

Document

Spots

s

s’

Spo

t-to

-labe

l co

mpa

tibili

ty

Candidate labels

γγγγ

ΓΓΓΓs

γγγγ’

ΓΓΓΓs’

Figure: Disambiguation based on compatibility between spot and label

SemTag and Seeker[D+03] exploited this for entity disambiguation.It is the first Web-scale entity disambiguation system

Collective Entity Disambiguation

Document

Spots

s

s’

Spo

t-to

-labe

l co

mpa

tibili

ty

Inter-label topical coherence

Candidate labels

γγγγ

ΓΓΓΓs

γγγγ’

ΓΓΓΓs’

g(γ(γ(γ(γ))))

g(γ(γ(γ(γ’ ))))

Figure: Disambiguation based on local compatibility and topicalcoherence of spots

Example: Page with spots for Air Jordan, Michael Jordan, ChicagoBulls

I Cucerzan[Cuc07] was the first to recognize generalinterdependence between entity labels

I Work by Milne et al.[MW08] includes limited form ofcollective disambiguation

Collective Entity Disambiguation

Document

Spots

s

s’

Spo

t-to

-labe

l co

mpa

tibili

ty

Inter-label topical coherence

Candidate labels

γγγγ

ΓΓΓΓs

γγγγ’

ΓΓΓΓs’

g(γ(γ(γ(γ))))

g(γ(γ(γ(γ’ ))))

Figure: Disambiguation based on local compatibility and topicalcoherence of spots

Example: Page with spots for Air Jordan, Michael Jordan, ChicagoBulls

I Cucerzan[Cuc07] was the first to recognize generalinterdependence between entity labels

I Work by Milne et al.[MW08] includes limited form ofcollective disambiguation

Topical coherence based on entity catalog

Relatedness information from entity catalog

I How related are two entities γ, γ′ in Wikipedia?

I Embed γ in some space using g : Γ → Rc

I Define relatedness r(γ, γ′) = g(γ) · g(γ′) or related

I Cucerzan’s proposal: c = number of categories; g(γ)[τ ] = 1 ifγ belongs to category τ , 0 otherwise, length of g(γ) is c .

r(γ, γ′) =g(γ)>g(γ′)√

g(γ)>g(γ)√

g(γ′)>g(γ′)

I Milne and Witten’s proposal: c = number of Wikipediapages; g(γ)[p] = 1 if page p links to page γ, 0 otherwise

r(γ, γ′) =log |g(γ) ∩ g(γ′)| − log max{|g(γ)|, |g(γ′)|}

log c − log min{|g(γ)|, |g(γ′)|}






r(γ, γ′) =g(γ)>g(γ′)√

g(γ)>g(γ)√

g(γ′)>g(γ′)









r(γ, γ′) =g(γ)>g(γ′)√

g(γ)>g(γ)√

g(γ′)>g(γ′)




Dataset for evaluation I

I Documents(IITB) crawled from popular sites

I Publicly available data from Cucerzan’s experiments (CZ)

IITB CZ

Number of documents 107 19

Total number of spots 17,200 288

Spot per 100 tokens 30 4.48

Average ambiguity per Spot 5.3 18

Figure: Corpus statistics.

Dataset for evaluation II

More on IITB dataset

I Collected a total of about 19,000 annotations

I Done by by 6 volunteers

I About 50 man-hours spent in collecting the annotations

I Exhaustive tagging by volunteers

I Spots labeled as NAwas about 40%

#Spots tagged by more than one person 1390

#NAamong these spots 524

#Spots with disagreement 278

#Spots with disagreement involving NA 218

Figure: Inter-annotator agreement.

Human Supervision

I System identifies spots and mentions

I Shows pull-down list of (subset of) Γs for each s

I User selects γ∗ ∈ Γs ∪ NA

Our Approach

I Main contributions:I Refined node features (feature design)I Using inlink based features for defining coherence score

(feature design)I Modified approach for collective inference (algorithm design)

Modeling local compatibility

I Feature vector fs(γ) ∈ Rd expresses local textual compatibilitybetween (context of) spot s and candidate label γ

I Components of fs(γ) based on Wikipedia TFIDF vectors of:

1. Snippet2. Full text3. Anchor text4. Anchor text with some tokens around it

and using similarity measures:

1. Dot-product2. Cosine similarity3. Jaccard similarity

Sense probability prior

I What entity does “Intel” refer to?I Chip design and manufacturing companyI Fictional cartel in a 1961 BBC TV serial

I Pr0(γ|s) is very high for chip maker, low for cartel

I Append element log Pr0(γ|s) to fs(γ)

Components of the objective

Node score

I Node scoring model w ∈ Rd

I Node score defined as w>fs(γ)

I w is trained to give suitable weights to different compatibilitymeasures

I During test time, greedy choice local to s would bearg maxγ∈Γs w>fs(γ)

Clique Score

I Use Milne’s relatedness formulation

Two-part objective to maximize

Node potential:

NP(y) =∏s

NPs(ys) =∏s

exp(w>fs(ys)

)

Clique potential:

CP(y) =∏

s 6=s′exp (r(ys , ys′)) = exp

∑

s 6=s′r(ys , ys′)

After taking logs and rescaling terms

1

|S0|∑

s

w>fs(ys) +1(|S0|2

)∑


Two-part objective to maximize

Node potential:

NP(y) =∏s

NPs(ys) =∏s

exp(w>fs(ys)

)

Clique potential:

CP(y) =∏

s 6=s′exp (r(ys , ys′)) = exp

∑


After taking logs and rescaling terms

1

|S0|∑

s

w>fs(ys) +1(|S0|2

)∑


ILP formulation

I Casting as 0/1 integer linear program

I Relaxing it to an LP

I Using up to |Γ0|+ |Γ0|2 variables

Variables:

zsγ = spotsisassignedlabelγ ∈ Γs ]

uγγ′ = [both γ, γ′ assigned to spots]{{

ILP formulationObjective:

max{zsγ ,uγγ′} (NP′) + (CP1′)

Node potential:

1

|S0|∑

s∈S0

∑

γ∈Γs

zsγw>fs(γ) (NP′)

Clique potential:

1(|S0|2

)∑

s 6=s′∈S0

∑

γ∈Γs ,γ′∈Γs′

uγγ′r(γ, γ′) (CP1′)

Subject to constraints:

∀s, γ : zsγ ∈ {0, 1}, ∀γ, γ′ : uγγ′ ∈ {0, 1} (1)

∀s, γ, γ′ : uγγ′ ≤ zsγ and uγγ′ ≤ zsγ′ (2)

∀s :∑

γ zsγ = 1. (3)



Node potential:

1

|S0|∑

s∈S0

∑

γ∈Γs


Clique potential:

1(|S0|2

)∑

s 6=s′∈S0

∑




∀s, γ : zsγ ∈ {0, 1}, ∀γ, γ′ : uγγ′ ∈ {0, 1} (1)


∀s :∑

γ zsγ = 1. (3)



Node potential:

1

|S0|∑

s∈S0

∑

γ∈Γs


Clique potential:

1(|S0|2

)∑

s 6=s′∈S0

∑




∀s, γ : zsγ ∈ {0, 1}, ∀γ, γ′ : uγγ′ ∈ {0, 1} (1)


∀s :∑

γ zsγ = 1. (3)



Node potential:

1

|S0|∑

s∈S0

∑

γ∈Γs


Clique potential:

1(|S0|2

)∑

s 6=s′∈S0

∑




∀s, γ : zsγ ∈ {0, 1}, ∀γ, γ′ : uγγ′ ∈ {0, 1} (1)


∀s :∑

γ zsγ = 1. (3)

LP relaxation for the ILP formulation

I Relax the constraints in the formulation as :

∀s, γ : 0 ≤ zsγ ≤ 1, ∀γ, γ′ : 0 ≤ uγγ′ ≤ 1

∀s, γ, γ′ : uγγ′ ≤ zsγ and uγγ′ ≤ zsγ′

∀s :∑

γ zsγ = 1.

I Margin between objective of relaxed LP and the rounded LP isquite thin

700

800

900

1000

1 2 3 4 5 6 7 8Tuning parameter

Tot

al O

bjec

tive

LP1-rounded

LP1-relaxed

Hill climbing algorithm

I Initialization mechanismsI Label updates

Backoff strategy I

I Allow backoff from tagging some spots

I Assign a special label “NA” to mark a “no attachment”

I Reward a spot for attaching to NA– RNA

I Spots marked NAdo not contribute to clique potential

I Smaller the value of RNA, more aggresive is the tagging

How this affects our objectiveN0 ⊆ S0 : spots assigned NAA0 = S0 \ N0 : remaining spotsFinal objective:

maxy

1

|S0|

∑

s∈N0

RNA +∑

s∈A0

w>fs(ys)

(NP)

+1(|A0|2

)∑

s 6=s′∈A0

r(ys , ys′) (CP1)

Backoff strategy I

I Allow backoff from tagging some spots

I Assign a special label “NA” to mark a “no attachment”

I Reward a spot for attaching to NA– RNA

I Spots marked NAdo not contribute to clique potential

I Smaller the value of RNA, more aggresive is the tagging

How this affects our objectiveN0 ⊆ S0 : spots assigned NAA0 = S0 \ N0 : remaining spotsFinal objective:

maxy

1

|S0|

∑

s∈N0

RNA +∑

s∈A0

w>fs(ys)

(NP)

+1(|A0|2

)∑

s 6=s′∈A0

r(ys , ys′) (CP1)

Backoff strategy II

IssuesA0 depends on y and hence the resulting optimization can nolonger be written as an ILPWay around

I Treat NAas a zero topical coherence label

r(NA, ·) = r(·,NA) = r(NA,NA) = 0;

I Contribution to NPis still equal to RNA

Modified Objective

maxy

1

|S0|(∑

s∈N0

RNA +∑

s∈A0

w>fs(ys)) (NP)

+1(|S0|2

)∑

s 6=s′∈A0

r(ys , ys′) (CP1)

Multi-topic model

I Current clique potentials encourages a single cluster model

I The single cluster hypothesis is not always true

I Partition the set of possible attachments as C = Γ1, . . . , ΓK

I Refined clique potential for supporting multitopic model

1

|C |∑

Γk∈C

1(Γk

2

)∑

s,s′: ys ,ys′∈Γk

r(ys , ys′). (CPK)

I Using(Γk

2

)instead of

(S02

)to reward smaller coherent clusters

I Node score is not disturbed

System design of the annotation system

Evaluation of the annotation system

Evaluation measures:

PrecisionNumber of spots tagged correctly out oftotal number of spots tagged

RecallNumber of spots tagged correctly out oftotal number of spots in ground truth

F12×Recall×Precision(Recall+Precision)

Results summary

I Selection of NPfeatures is important

I Collective inference adds value

Evaluation:

Our system CZ Milne

Recall 70.7% 31.43% 66.1%

Precision 68.7% 53.41% 19.35%

F1 69.69% 39.57% 29.94%

Results summary I

Figure: Annotated page related to cricket

Results summary II

Figure: Annotated page related to finance

Query building blocks

Matcher: Word or phrase or mention of specific entity incatalog or quantity

Target: Placeholder that the engine must instantiate, e.g.,entity of given type, quantity with given unit (butpossibly just an uninterpreted token sequence)

Context: Any token segment that contains specified matchersand instantiations of targets

Predicates: Constraints over targets and context, e.g., textproximity, membership of entity in category,containment of quantity in range, . . .

Aggregators: Collects evidence in favor of candidate targetinstantiations from multiple contexts

Query example: Category targets

Tabulate French films with the number of Academy Awards thateach won

I ?f ∈+ Category:French Films,

I ?a ∈ Qtype:Number,

I InContext(?c; ?f, ?a, "academy awards", won)

I Evidence aggregator Consensus(?c)

I Resulting in an output table with two columns 〈?f, ?a〉Tabulate physicists and musical instruments they played

I ?p ∈+ Category:Physicist

I ?m ∈+ Category:Musical Instrument

I InContext(?c; ?p, ?m, played)

I Evidence aggregator Consensus(?c)

Subqueries and joins

I ?f ∈+ Category:French Film,

I ?a ∈ Qtype:Number, ?p ∈ Qtype:MoneyAmount,

I InContext(?c1; ?f, ?a, "academy awards", won),

I InContext(?c2; ?f, ?p, production cost, budget),

I Consensus(?c1, ?c2)

I Output 〈?f, ?a, ?p〉I Note that the number of academy awards and the production

cost may come from different Web pages

Distributed indexing and storage

��

�� ! ��

I Hadoop used for distributed storage and processing

I Distributed Index stored as Lucene posting lists

I Lucene payload carries additional data like annotationconfidence, quantities

I Adapted Katta for distributed index retrieval

Distributed search

��

� �� ! �� "��

I Local Ranking Engine(LRE) scores and ranks a documentwith respect to a user query

I Query Consensus Engine(QCE) aggregates evidences fromdifferent pages

Distributed query processing

��

�� !��"�

#��$� �� %�� % �� #��

&�� '�� $� �� % %�� (��(��

... hence completing the big picture of CSAW

��

�� ! �� "��#��$�� %&�' %�� &��(��

Figure: Detailed CSAW system

Road ahead

Annotation system

I Extending collective inferencing beyond page-level boundaries

I Extending inference algorithms to multitopic models

I Associating confidence with annotations

Query system

I Enhancing the data model and query language

I Entity concensus algorithms

Others

I Ranking of entities in dropdown on the Annotation UI

I Alternative methods for storing annotations suitable forperforming interesting mining tasks

Thank you all for your interest in this topic

References I

Somnath Banerjee, Soumen Chakrabarti, and Ganesh Ramakrishnan,Learning to rank for quantity consensus queries, SIGIR Conference, 2009.

S. Cucerzan, Large-scale named entity disambiguation based on Wikipediadata, EMNLP Conference, 2007, pp. 708–716.

S Dill et al., SemTag and Seeker: Bootstrapping the semantic Web viaautomated semantic annotation, WWW Conference, 2003.

Michael I. Jordan (ed.), Learning in graphical models, MIT Press, 1999.

Sayali Kulkarni, Amit Singh, Ganesh Ramakrishnan, and SoumenChakrabarti, Collective annotation of Wikipedia entities in Web text,SIGKDD Conference, 2009.

R Mihalcea and A Csomai, Wikify!: linking documents to encyclopedicknowledge, CIKM, 2007, pp. 233–242.

References II

Rada Mihalcea, Paul Tarau, and Elizabeth Figa, Pagerank on semanticnetworks, with application to word sense disambiguation, COLING ’04:Proceedings of the 20th international conference on ComputationalLinguistics (Morristown, NJ, USA), Association for ComputationalLinguistics, 2004, p. 1126.

David Milne and Ian H Witten, Learning to link with Wikipedia, CIKM,2008.

Support slides

I Evaluation of our system in more details

I CSAW search paradigm description

I Objective value comparison

I Scaling and performance measurement

I About Katta

I Comparison of Local, hill climbing, LP - training RNA

I Sample malformed dendrograms in category space

I Multi-topical model and dendrogram for the same

I Dendrorgam based algorithm

Effect of NP learning

50

55

60

65

70

75

Wiki fullpage,cosine

Anchortext,

cosine

Anchortext

context,cosine

Wiki fullpage,

Jaccard

Allfeatures(learn w)

F1,

%

20

40

60

80

100

0 20 40 60 80Recall, %

Pre

cisi

on, %

LocalLocal+PriorM&WCucerzan

I Learning w isbetter thancommonly-used singlefeatures

I Enough tobeatleave-one-outandanchor-basedapproaches

Effect of NP learning

50

55

60

65

70

75

Wiki fullpage,cosine

Anchortext,

cosine

Anchortext

context,cosine

Wiki fullpage,

Jaccard

Allfeatures(learn w)

F1,

%

20

40

60

80

100

0 20 40 60 80Recall, %

Pre

cisi

on, %

LocalLocal+PriorM&WCucerzan

I Learning w isbetter thancommonly-used singlefeatures

I Enough tobeatleave-one-outandanchor-basedapproaches

Benefits of collective annotation

40

50

60

70

80

90

40 50 60 70 80Recall, %

Pre

cisi

on, %

Local Local+priorHill1 Hill1+priorLP1 LP1+prior

Recall/Precision on IITB dataset

0

20

40

60

80

100

20 30 40 50 60 70Recall, %

Pre

cisi

on, %

Local

Hill1

LP1

Milne

Cucerzan

F1=63%

F1=69%

Recall/Precision on CZ dataset

I Evaluated on twodifferent data sets

I Can significantly pushrecall while preservingprecision

Is our belief about the objective correct?

20

30

40

50

0.75 0.8 0.85 0.9 0.95 1Objective (normalized)

F1,

%

doc1 doc2doc3 doc4doc5 doc6

Figure: F1 versus Objective

I As theobjective valueincreases, theF1 increases

I Validates ourbelief aboutthe objective

Effect of tuning RNA I

15

25

35

45

55

65

1 2 3 4 5 6 7 8RNA

F1,

%

Local Hill1LP1

Figure: F1 for Local, Hill and LP for differentRNAvalues

I Best RNAforLocal islesser than thebest RNAforHill1 andLP1

Effect of tuning RNA II

40

50

60

70

80

90

1 2 3 4 5 6 7 8RNA

Pre

cisi

on, %

Local Hill1LP1

Figure: Precision for different RNAvalues

102030405060708090

1 2 3 4 5 6 7 8RNA

Rec

all,

%

Local Hill1LP1

Figure: Recall for different RNAvalues

I Smaller the value ofRNA, more aggresiveis the tagging

I Precision increaseswith increase inRNAvalue

I Recall decreases withincrease in RNAvalue

CSAW search paradigm description

I Data modelI Two extremes in current available systems - IR systems and

relationalI Our goal: bridging the gap between the two

I Query capabilitiesI Two extremes in current systems - keyword queries and structured

SQL-like queriesI Our goal: allow composite representation and combine textual

proximity with structured data (from some catalog)

I ResponseI Current search systems returns URLs or highly structured data (as

in SQL)I Our goal: return list of entities, quantities (special type of entities),

or tables of entities, quantities



















Objective value comparison for Local, hill climbing, LP

700

800

900

1000

1 2 3 4 5 6 7 8rhoNA

Tot

al O

bjec

tive

Hill1LP1-roundedLP1-relaxed

Scaling and performance measurement

0

2

4

6

8

10

0 50 100 150 200 250# of spots

Tim

e, s

Hill1

LP1

Figure: Scaling the annotation process withnumber of spots being annotated

I Scaling ismildlyquadraticallywrt |S0|

I Hill climbingtakes about2–3 seconds

I LP takesaround 4–6seconds

About Katta

I Salient Features:I ScalableI Failure tolerantI DistributedI IndexedI Data storage

I Serves very large Lucene indexes as index shards on manyservers

I Replicates shards on different servers for performance andfault-tolerance

I Supports pluggable network topologies

I Master fail-over

I Plays well with Hadoop clusters

Comparison of Local, hill climbing, LP - training RNA

Local Hill1 LP1

no Prior 63.45% 64.87% 67.02%

+Prior 68.75% 67.46% 69.69%

Sample dendrograms I

Sample dendrograms II

Dendrogram with multitopic model

Multi-topical model

I Current clique potentials encourages a single cluster model

I The single cluster hypothesis is not always true

I Refined clique potential for supporting multitopic model

1

|C |∑

Γk∈C

1(Γk

2

)∑

s,s′: ys ,ys′∈Γk

r(ys , ys′). (CPK)

I Using(Γk

2

)instead of

(S02

), to reward smaller coherent clusters

as desired

Technology

Introduction to text mining and insights on bridging structured and unstructured data