78
Introduction to text mining With a dive into structured and unstructured data Sayali Kulkarni October 23, 2010

Introduction to text mining and insights on bridging structured and unstructured data

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Introduction to text mining and insights on bridging structured and unstructured data

Introduction to text miningWith a dive into structured and unstructured data

Sayali Kulkarni

October 23, 2010

Page 2: Introduction to text mining and insights on bridging structured and unstructured data

Outline for today

I Quick refresher on data mining

I What is so special about text?

I Introduction to CSAW

I Annotation system

I Distributed indexing and retrieval system

I Future work

Page 3: Introduction to text mining and insights on bridging structured and unstructured data

Data Mining I

I Data is useless if it does not make sense!

I Analyzing the data from different angles

I Important to know:I Data: What we getI Information: What we can useI Knowledge: How we use it

I Classes, clusters, association rules, patterns, sequences ...

Page 4: Introduction to text mining and insights on bridging structured and unstructured data

Data Mining II

I Different kinds of dataI Protein sequencesI Genetic dataI Network monitoringI Text dataI Images/sound – multimedia data

I Different challenges in each case

I Scaling, noise, generalization, overfitting, incorporatingdomain knowledge

Page 5: Introduction to text mining and insights on bridging structured and unstructured data

Text Mining I

I SourcesI Textual data from the webI Data collected within the organizationsI Survey data and feedback

I RepresentationI Using words as featuresI Data cleaning is a big task: spelling corrections, stop word

handling, stemmingI Weight of words depends on : importance of words in the

document and overall uniqueness of the word

I Mining TasksI SummarizationI Document clusteringI Document labellingI SearchI ...

Page 6: Introduction to text mining and insights on bridging structured and unstructured data

Text Mining II

I Structure of the data is important

I Web data is diverse in natureI Completely unstructured data like news, blogs, mails forumsI Parly structured data like Wikipedia, PubMed, other domain

specific enclyclopedias and dictionariesI Data contained in the text in form of lists and tables is much

more structured

I Adding semantics to such data

I Linking the structured and unstructured data

I One of the major applications of this is sematic search

Page 7: Introduction to text mining and insights on bridging structured and unstructured data

Search today: Impedance mismatch

Search Engine

�� ��������� ��� ���������� ��������� ��� �����

Page 8: Introduction to text mining and insights on bridging structured and unstructured data

Our vision of next-gen search

Search Engine

������ ������ �������������� ������� ��������� ������ ����� ����������� ����� ��������� ������ ��

Curating and Searching the Annotated Web

Page 9: Introduction to text mining and insights on bridging structured and unstructured data

Our vision of next-gen search

Search Engine

������ ������ �������������� ������� ��������� ������ ����� ����������� ����� ��������� ������ ��

Curating and Searching the Annotated Web

Page 10: Introduction to text mining and insights on bridging structured and unstructured data

CSAW search paradigm IData Model

I IR indexes - limited expressiveness

I Relational databases - intricate schema knowledge

I CSAW : IR index (unstructured) + annotation and catalogindex(structured)

Page 11: Introduction to text mining and insights on bridging structured and unstructured data

CSAW search paradigm II

Query CapabilitiesQuerying text with type annotations

�������������������

���� ������������ ������������

������������ ����������

���������������������� ������

����

ResponseTables of entities, quantities (special type of entities) and textfields

Page 12: Introduction to text mining and insights on bridging structured and unstructured data

High level block diagram

Figure: CSAW - high level block diagram

Page 13: Introduction to text mining and insights on bridging structured and unstructured data

Annotation System

Figure: Annotation Engine in CSAW

Page 14: Introduction to text mining and insights on bridging structured and unstructured data

Terminologies I

Figure: A plain page from unstructured data source

Page 15: Introduction to text mining and insights on bridging structured and unstructured data

Terminologies II

Spots

Figure: A spot on a page

Spot is an occurrence of text on a page that can be possibly linkedto a Wikipedia articleRelated notations:S0 All candidate spots in a Web pageS ⊆ S0 Arbitrary set of spotss ∈ S One spot, including surrounding context

Page 16: Introduction to text mining and insights on bridging structured and unstructured data

Terminologies III

Possible attachments

Figure: Possible attachments for a spot

Attachments are Wikipedia entities that can be possibly linked to aspotRelated notations:Γs Candidate entity labels for spot sΓ0

Ss∈S0

Γs , all candidate labels for page

Γ ⊆ Γ0 An arbitrary set of entity labelsγ ∈ Γ An entity label value, here, a Wikipedia entity

Page 17: Introduction to text mining and insights on bridging structured and unstructured data

Entity Disambiguation

Document

Spots

s

s’

Spo

t-to

-labe

l co

mpa

tibili

ty

Candidate labels

γγγγ

ΓΓΓΓs

γγγγ’

ΓΓΓΓs’

Figure: Disambiguation based on compatibility between spot and label

SemTag and Seeker[D+03] exploited this for entity disambiguation.It is the first Web-scale entity disambiguation system

Page 18: Introduction to text mining and insights on bridging structured and unstructured data

Entity Disambiguation

Document

Spots

s

s’

Spo

t-to

-labe

l co

mpa

tibili

ty

Candidate labels

γγγγ

ΓΓΓΓs

γγγγ’

ΓΓΓΓs’

Figure: Disambiguation based on compatibility between spot and label

SemTag and Seeker[D+03] exploited this for entity disambiguation.It is the first Web-scale entity disambiguation system

Page 19: Introduction to text mining and insights on bridging structured and unstructured data

Collective Entity Disambiguation

Document

Spots

s

s’

Spo

t-to

-labe

l co

mpa

tibili

ty

Inter-label topical coherence

Candidate labels

γγγγ

ΓΓΓΓs

γγγγ’

ΓΓΓΓs’

g(γ(γ(γ(γ))))

g(γ(γ(γ(γ’ ))))

Figure: Disambiguation based on local compatibility and topicalcoherence of spots

Example: Page with spots for Air Jordan, Michael Jordan, ChicagoBulls

I Cucerzan[Cuc07] was the first to recognize generalinterdependence between entity labels

I Work by Milne et al.[MW08] includes limited form ofcollective disambiguation

Page 20: Introduction to text mining and insights on bridging structured and unstructured data

Collective Entity Disambiguation

Document

Spots

s

s’

Spo

t-to

-labe

l co

mpa

tibili

ty

Inter-label topical coherence

Candidate labels

γγγγ

ΓΓΓΓs

γγγγ’

ΓΓΓΓs’

g(γ(γ(γ(γ))))

g(γ(γ(γ(γ’ ))))

Figure: Disambiguation based on local compatibility and topicalcoherence of spots

Example: Page with spots for Air Jordan, Michael Jordan, ChicagoBulls

I Cucerzan[Cuc07] was the first to recognize generalinterdependence between entity labels

I Work by Milne et al.[MW08] includes limited form ofcollective disambiguation

Page 21: Introduction to text mining and insights on bridging structured and unstructured data

Topical coherence based on entity catalog

Page 22: Introduction to text mining and insights on bridging structured and unstructured data

Relatedness information from entity catalog

I How related are two entities γ, γ′ in Wikipedia?

I Embed γ in some space using g : Γ → Rc

I Define relatedness r(γ, γ′) = g(γ) · g(γ′) or related

I Cucerzan’s proposal: c = number of categories; g(γ)[τ ] = 1 ifγ belongs to category τ , 0 otherwise, length of g(γ) is c .

r(γ, γ′) =g(γ)>g(γ′)√

g(γ)>g(γ)√

g(γ′)>g(γ′)

I Milne and Witten’s proposal: c = number of Wikipediapages; g(γ)[p] = 1 if page p links to page γ, 0 otherwise

r(γ, γ′) =log |g(γ) ∩ g(γ′)| − log max{|g(γ)|, |g(γ′)|}

log c − log min{|g(γ)|, |g(γ′)|}

Page 23: Introduction to text mining and insights on bridging structured and unstructured data

Relatedness information from entity catalog

I How related are two entities γ, γ′ in Wikipedia?

I Embed γ in some space using g : Γ → Rc

I Define relatedness r(γ, γ′) = g(γ) · g(γ′) or related

I Cucerzan’s proposal: c = number of categories; g(γ)[τ ] = 1 ifγ belongs to category τ , 0 otherwise, length of g(γ) is c .

r(γ, γ′) =g(γ)>g(γ′)√

g(γ)>g(γ)√

g(γ′)>g(γ′)

I Milne and Witten’s proposal: c = number of Wikipediapages; g(γ)[p] = 1 if page p links to page γ, 0 otherwise

r(γ, γ′) =log |g(γ) ∩ g(γ′)| − log max{|g(γ)|, |g(γ′)|}

log c − log min{|g(γ)|, |g(γ′)|}

Page 24: Introduction to text mining and insights on bridging structured and unstructured data

Relatedness information from entity catalog

I How related are two entities γ, γ′ in Wikipedia?

I Embed γ in some space using g : Γ → Rc

I Define relatedness r(γ, γ′) = g(γ) · g(γ′) or related

I Cucerzan’s proposal: c = number of categories; g(γ)[τ ] = 1 ifγ belongs to category τ , 0 otherwise, length of g(γ) is c .

r(γ, γ′) =g(γ)>g(γ′)√

g(γ)>g(γ)√

g(γ′)>g(γ′)

I Milne and Witten’s proposal: c = number of Wikipediapages; g(γ)[p] = 1 if page p links to page γ, 0 otherwise

r(γ, γ′) =log |g(γ) ∩ g(γ′)| − log max{|g(γ)|, |g(γ′)|}

log c − log min{|g(γ)|, |g(γ′)|}

Page 25: Introduction to text mining and insights on bridging structured and unstructured data

Dataset for evaluation I

I Documents(IITB) crawled from popular sites

I Publicly available data from Cucerzan’s experiments (CZ)

IITB CZ

Number of documents 107 19

Total number of spots 17,200 288

Spot per 100 tokens 30 4.48

Average ambiguity per Spot 5.3 18

Figure: Corpus statistics.

Page 26: Introduction to text mining and insights on bridging structured and unstructured data

Dataset for evaluation II

More on IITB dataset

I Collected a total of about 19,000 annotations

I Done by by 6 volunteers

I About 50 man-hours spent in collecting the annotations

I Exhaustive tagging by volunteers

I Spots labeled as NAwas about 40%

#Spots tagged by more than one person 1390

#NAamong these spots 524

#Spots with disagreement 278

#Spots with disagreement involving NA 218

Figure: Inter-annotator agreement.

Page 27: Introduction to text mining and insights on bridging structured and unstructured data

Human Supervision

I System identifies spots and mentions

I Shows pull-down list of (subset of) Γs for each s

I User selects γ∗ ∈ Γs ∪ NA

Page 28: Introduction to text mining and insights on bridging structured and unstructured data

Our Approach

I Main contributions:I Refined node features (feature design)I Using inlink based features for defining coherence score

(feature design)I Modified approach for collective inference (algorithm design)

Page 29: Introduction to text mining and insights on bridging structured and unstructured data

Modeling local compatibility

I Feature vector fs(γ) ∈ Rd expresses local textual compatibilitybetween (context of) spot s and candidate label γ

I Components of fs(γ) based on Wikipedia TFIDF vectors of:

1. Snippet2. Full text3. Anchor text4. Anchor text with some tokens around it

and using similarity measures:

1. Dot-product2. Cosine similarity3. Jaccard similarity

Page 30: Introduction to text mining and insights on bridging structured and unstructured data

Sense probability prior

I What entity does “Intel” refer to?I Chip design and manufacturing companyI Fictional cartel in a 1961 BBC TV serial

I Pr0(γ|s) is very high for chip maker, low for cartel

I Append element log Pr0(γ|s) to fs(γ)

Page 31: Introduction to text mining and insights on bridging structured and unstructured data

Components of the objective

Node score

I Node scoring model w ∈ Rd

I Node score defined as w>fs(γ)

I w is trained to give suitable weights to different compatibilitymeasures

I During test time, greedy choice local to s would bearg maxγ∈Γs w>fs(γ)

Clique Score

I Use Milne’s relatedness formulation

Page 32: Introduction to text mining and insights on bridging structured and unstructured data

Two-part objective to maximize

Node potential:

NP(y) =∏s

NPs(ys) =∏s

exp(w>fs(ys)

)

Clique potential:

CP(y) =∏

s 6=s′exp (r(ys , ys′)) = exp

s 6=s′r(ys , ys′)

After taking logs and rescaling terms

1

|S0|∑

s

w>fs(ys) +1(|S0|2

)∑

s 6=s′r(ys , ys′)

Page 33: Introduction to text mining and insights on bridging structured and unstructured data

Two-part objective to maximize

Node potential:

NP(y) =∏s

NPs(ys) =∏s

exp(w>fs(ys)

)

Clique potential:

CP(y) =∏

s 6=s′exp (r(ys , ys′)) = exp

s 6=s′r(ys , ys′)

After taking logs and rescaling terms

1

|S0|∑

s

w>fs(ys) +1(|S0|2

)∑

s 6=s′r(ys , ys′)

Page 34: Introduction to text mining and insights on bridging structured and unstructured data

ILP formulation

I Casting as 0/1 integer linear program

I Relaxing it to an LP

I Using up to |Γ0|+ |Γ0|2 variables

Variables:

zsγ = spotsisassignedlabelγ ∈ Γs ]

uγγ′ = [both γ, γ′ assigned to spots]{{

Page 35: Introduction to text mining and insights on bridging structured and unstructured data

ILP formulationObjective:

max{zsγ ,uγγ′} (NP′) + (CP1′)

Node potential:

1

|S0|∑

s∈S0

γ∈Γs

zsγw>fs(γ) (NP′)

Clique potential:

1(|S0|2

)∑

s 6=s′∈S0

γ∈Γs ,γ′∈Γs′

uγγ′r(γ, γ′) (CP1′)

Subject to constraints:

∀s, γ : zsγ ∈ {0, 1}, ∀γ, γ′ : uγγ′ ∈ {0, 1} (1)

∀s, γ, γ′ : uγγ′ ≤ zsγ and uγγ′ ≤ zsγ′ (2)

∀s :∑

γ zsγ = 1. (3)

Page 36: Introduction to text mining and insights on bridging structured and unstructured data

ILP formulationObjective:

max{zsγ ,uγγ′} (NP′) + (CP1′)

Node potential:

1

|S0|∑

s∈S0

γ∈Γs

zsγw>fs(γ) (NP′)

Clique potential:

1(|S0|2

)∑

s 6=s′∈S0

γ∈Γs ,γ′∈Γs′

uγγ′r(γ, γ′) (CP1′)

Subject to constraints:

∀s, γ : zsγ ∈ {0, 1}, ∀γ, γ′ : uγγ′ ∈ {0, 1} (1)

∀s, γ, γ′ : uγγ′ ≤ zsγ and uγγ′ ≤ zsγ′ (2)

∀s :∑

γ zsγ = 1. (3)

Page 37: Introduction to text mining and insights on bridging structured and unstructured data

ILP formulationObjective:

max{zsγ ,uγγ′} (NP′) + (CP1′)

Node potential:

1

|S0|∑

s∈S0

γ∈Γs

zsγw>fs(γ) (NP′)

Clique potential:

1(|S0|2

)∑

s 6=s′∈S0

γ∈Γs ,γ′∈Γs′

uγγ′r(γ, γ′) (CP1′)

Subject to constraints:

∀s, γ : zsγ ∈ {0, 1}, ∀γ, γ′ : uγγ′ ∈ {0, 1} (1)

∀s, γ, γ′ : uγγ′ ≤ zsγ and uγγ′ ≤ zsγ′ (2)

∀s :∑

γ zsγ = 1. (3)

Page 38: Introduction to text mining and insights on bridging structured and unstructured data

ILP formulationObjective:

max{zsγ ,uγγ′} (NP′) + (CP1′)

Node potential:

1

|S0|∑

s∈S0

γ∈Γs

zsγw>fs(γ) (NP′)

Clique potential:

1(|S0|2

)∑

s 6=s′∈S0

γ∈Γs ,γ′∈Γs′

uγγ′r(γ, γ′) (CP1′)

Subject to constraints:

∀s, γ : zsγ ∈ {0, 1}, ∀γ, γ′ : uγγ′ ∈ {0, 1} (1)

∀s, γ, γ′ : uγγ′ ≤ zsγ and uγγ′ ≤ zsγ′ (2)

∀s :∑

γ zsγ = 1. (3)

Page 39: Introduction to text mining and insights on bridging structured and unstructured data

LP relaxation for the ILP formulation

I Relax the constraints in the formulation as :

∀s, γ : 0 ≤ zsγ ≤ 1, ∀γ, γ′ : 0 ≤ uγγ′ ≤ 1

∀s, γ, γ′ : uγγ′ ≤ zsγ and uγγ′ ≤ zsγ′

∀s :∑

γ zsγ = 1.

I Margin between objective of relaxed LP and the rounded LP isquite thin

700

800

900

1000

1 2 3 4 5 6 7 8Tuning parameter

Tot

al O

bjec

tive

LP1-rounded

LP1-relaxed

Page 40: Introduction to text mining and insights on bridging structured and unstructured data

Hill climbing algorithm

I Initialization mechanismsI Label updates

Page 41: Introduction to text mining and insights on bridging structured and unstructured data

Backoff strategy I

I Allow backoff from tagging some spots

I Assign a special label “NA” to mark a “no attachment”

I Reward a spot for attaching to NA– RNA

I Spots marked NAdo not contribute to clique potential

I Smaller the value of RNA, more aggresive is the tagging

How this affects our objectiveN0 ⊆ S0 : spots assigned NAA0 = S0 \ N0 : remaining spotsFinal objective:

maxy

1

|S0|

s∈N0

RNA +∑

s∈A0

w>fs(ys)

(NP)

+1(|A0|2

)∑

s 6=s′∈A0

r(ys , ys′) (CP1)

Page 42: Introduction to text mining and insights on bridging structured and unstructured data

Backoff strategy I

I Allow backoff from tagging some spots

I Assign a special label “NA” to mark a “no attachment”

I Reward a spot for attaching to NA– RNA

I Spots marked NAdo not contribute to clique potential

I Smaller the value of RNA, more aggresive is the tagging

How this affects our objectiveN0 ⊆ S0 : spots assigned NAA0 = S0 \ N0 : remaining spotsFinal objective:

maxy

1

|S0|

s∈N0

RNA +∑

s∈A0

w>fs(ys)

(NP)

+1(|A0|2

)∑

s 6=s′∈A0

r(ys , ys′) (CP1)

Page 43: Introduction to text mining and insights on bridging structured and unstructured data

Backoff strategy II

IssuesA0 depends on y and hence the resulting optimization can nolonger be written as an ILPWay around

I Treat NAas a zero topical coherence label

r(NA, ·) = r(·,NA) = r(NA,NA) = 0;

I Contribution to NPis still equal to RNA

Modified Objective

maxy

1

|S0|(∑

s∈N0

RNA +∑

s∈A0

w>fs(ys)) (NP)

+1(|S0|2

)∑

s 6=s′∈A0

r(ys , ys′) (CP1)

Page 44: Introduction to text mining and insights on bridging structured and unstructured data

Multi-topic model

I Current clique potentials encourages a single cluster model

I The single cluster hypothesis is not always true

I Partition the set of possible attachments as C = Γ1, . . . , ΓK

I Refined clique potential for supporting multitopic model

1

|C |∑

Γk∈C

1(Γk

2

)∑

s,s′: ys ,ys′∈Γk

r(ys , ys′). (CPK)

I Using(Γk

2

)instead of

(S02

)to reward smaller coherent clusters

I Node score is not disturbed

Page 45: Introduction to text mining and insights on bridging structured and unstructured data

System design of the annotation system

Page 46: Introduction to text mining and insights on bridging structured and unstructured data

Evaluation of the annotation system

Evaluation measures:

PrecisionNumber of spots tagged correctly out oftotal number of spots tagged

RecallNumber of spots tagged correctly out oftotal number of spots in ground truth

F12×Recall×Precision(Recall+Precision)

Page 47: Introduction to text mining and insights on bridging structured and unstructured data

Results summary

I Selection of NPfeatures is important

I Collective inference adds value

Evaluation:

Our system CZ Milne

Recall 70.7% 31.43% 66.1%

Precision 68.7% 53.41% 19.35%

F1 69.69% 39.57% 29.94%

Page 48: Introduction to text mining and insights on bridging structured and unstructured data

Results summary I

Figure: Annotated page related to cricket

Page 49: Introduction to text mining and insights on bridging structured and unstructured data

Results summary II

Figure: Annotated page related to finance

Page 50: Introduction to text mining and insights on bridging structured and unstructured data

Query building blocks

Matcher: Word or phrase or mention of specific entity incatalog or quantity

Target: Placeholder that the engine must instantiate, e.g.,entity of given type, quantity with given unit (butpossibly just an uninterpreted token sequence)

Context: Any token segment that contains specified matchersand instantiations of targets

Predicates: Constraints over targets and context, e.g., textproximity, membership of entity in category,containment of quantity in range, . . .

Aggregators: Collects evidence in favor of candidate targetinstantiations from multiple contexts

Page 51: Introduction to text mining and insights on bridging structured and unstructured data

Query example: Category targets

Tabulate French films with the number of Academy Awards thateach won

I ?f ∈+ Category:French Films,

I ?a ∈ Qtype:Number,

I InContext(?c; ?f, ?a, "academy awards", won)

I Evidence aggregator Consensus(?c)

I Resulting in an output table with two columns 〈?f, ?a〉Tabulate physicists and musical instruments they played

I ?p ∈+ Category:Physicist

I ?m ∈+ Category:Musical Instrument

I InContext(?c; ?p, ?m, played)

I Evidence aggregator Consensus(?c)

Page 52: Introduction to text mining and insights on bridging structured and unstructured data

Subqueries and joins

I ?f ∈+ Category:French Film,

I ?a ∈ Qtype:Number, ?p ∈ Qtype:MoneyAmount,

I InContext(?c1; ?f, ?a, "academy awards", won),

I InContext(?c2; ?f, ?p, production cost, budget),

I Consensus(?c1, ?c2)

I Output 〈?f, ?a, ?p〉I Note that the number of academy awards and the production

cost may come from different Web pages

Page 53: Introduction to text mining and insights on bridging structured and unstructured data

Distributed indexing and storage

�������������� �� ���� ���� ��

��� ����������������������������� �������������� ���������� ���� ������� � ��� ! �������

I Hadoop used for distributed storage and processing

I Distributed Index stored as Lucene posting lists

I Lucene payload carries additional data like annotationconfidence, quantities

I Adapted Katta for distributed index retrieval

Page 54: Introduction to text mining and insights on bridging structured and unstructured data

Distributed search

������������������

� �� ����� ���� ��������� ����� ������ �� ! ����� ��"�� ��

I Local Ranking Engine(LRE) scores and ranks a documentwith respect to a user query

I Query Consensus Engine(QCE) aggregates evidences fromdifferent pages

Page 55: Introduction to text mining and insights on bridging structured and unstructured data

Distributed query processing

���� �������� ����� ����������� ���� ��� �������� ��������� ������������������ ��������������� �����

�� �� ����������������������� !�������"�

#�����$� �� ���%���� ���% ������ � ����� ����� #����� ������

&�� �� ���� �������'�������� �$� ���� ���% %���� ���� �� ����� ����� (���������������(��

Page 56: Introduction to text mining and insights on bridging structured and unstructured data

... hence completing the big picture of CSAW

�������������� �� ���� ���� ��

��� ����������������������������� �������������� ���������� ���� ������� � ��� ! ������� "��#���$������ � %&�' %���� &��(��� ������

Figure: Detailed CSAW system

Page 57: Introduction to text mining and insights on bridging structured and unstructured data

Road ahead

Annotation system

I Extending collective inferencing beyond page-level boundaries

I Extending inference algorithms to multitopic models

I Associating confidence with annotations

Query system

I Enhancing the data model and query language

I Entity concensus algorithms

Others

I Ranking of entities in dropdown on the Annotation UI

I Alternative methods for storing annotations suitable forperforming interesting mining tasks

Page 58: Introduction to text mining and insights on bridging structured and unstructured data

Thank you all for your interest in this topic

Page 59: Introduction to text mining and insights on bridging structured and unstructured data

References I

Somnath Banerjee, Soumen Chakrabarti, and Ganesh Ramakrishnan,Learning to rank for quantity consensus queries, SIGIR Conference, 2009.

S. Cucerzan, Large-scale named entity disambiguation based on Wikipediadata, EMNLP Conference, 2007, pp. 708–716.

S Dill et al., SemTag and Seeker: Bootstrapping the semantic Web viaautomated semantic annotation, WWW Conference, 2003.

Michael I. Jordan (ed.), Learning in graphical models, MIT Press, 1999.

Sayali Kulkarni, Amit Singh, Ganesh Ramakrishnan, and SoumenChakrabarti, Collective annotation of Wikipedia entities in Web text,SIGKDD Conference, 2009.

R Mihalcea and A Csomai, Wikify!: linking documents to encyclopedicknowledge, CIKM, 2007, pp. 233–242.

Page 60: Introduction to text mining and insights on bridging structured and unstructured data

References II

Rada Mihalcea, Paul Tarau, and Elizabeth Figa, Pagerank on semanticnetworks, with application to word sense disambiguation, COLING ’04:Proceedings of the 20th international conference on ComputationalLinguistics (Morristown, NJ, USA), Association for ComputationalLinguistics, 2004, p. 1126.

David Milne and Ian H Witten, Learning to link with Wikipedia, CIKM,2008.

Page 61: Introduction to text mining and insights on bridging structured and unstructured data

Support slides

I Evaluation of our system in more details

I CSAW search paradigm description

I Objective value comparison

I Scaling and performance measurement

I About Katta

I Comparison of Local, hill climbing, LP - training RNA

I Sample malformed dendrograms in category space

I Multi-topical model and dendrogram for the same

I Dendrorgam based algorithm

Page 62: Introduction to text mining and insights on bridging structured and unstructured data

Effect of NP learning

50

55

60

65

70

75

Wiki fullpage,cosine

Anchortext,

cosine

Anchortext

context,cosine

Wiki fullpage,

Jaccard

Allfeatures(learn w)

F1,

%

20

40

60

80

100

0 20 40 60 80Recall, %

Pre

cisi

on, %

LocalLocal+PriorM&WCucerzan

I Learning w isbetter thancommonly-used singlefeatures

I Enough tobeatleave-one-outandanchor-basedapproaches

Page 63: Introduction to text mining and insights on bridging structured and unstructured data

Effect of NP learning

50

55

60

65

70

75

Wiki fullpage,cosine

Anchortext,

cosine

Anchortext

context,cosine

Wiki fullpage,

Jaccard

Allfeatures(learn w)

F1,

%

20

40

60

80

100

0 20 40 60 80Recall, %

Pre

cisi

on, %

LocalLocal+PriorM&WCucerzan

I Learning w isbetter thancommonly-used singlefeatures

I Enough tobeatleave-one-outandanchor-basedapproaches

Page 64: Introduction to text mining and insights on bridging structured and unstructured data

Benefits of collective annotation

40

50

60

70

80

90

40 50 60 70 80Recall, %

Pre

cisi

on, %

Local Local+priorHill1 Hill1+priorLP1 LP1+prior

Recall/Precision on IITB dataset

0

20

40

60

80

100

20 30 40 50 60 70Recall, %

Pre

cisi

on, %

Local

Hill1

LP1

Milne

Cucerzan

F1=63%

F1=69%

Recall/Precision on CZ dataset

I Evaluated on twodifferent data sets

I Can significantly pushrecall while preservingprecision

Page 65: Introduction to text mining and insights on bridging structured and unstructured data

Is our belief about the objective correct?

20

30

40

50

0.75 0.8 0.85 0.9 0.95 1Objective (normalized)

F1,

%

doc1 doc2doc3 doc4doc5 doc6

Figure: F1 versus Objective

I As theobjective valueincreases, theF1 increases

I Validates ourbelief aboutthe objective

Page 66: Introduction to text mining and insights on bridging structured and unstructured data

Effect of tuning RNA I

15

25

35

45

55

65

1 2 3 4 5 6 7 8RNA

F1,

%

Local Hill1LP1

Figure: F1 for Local, Hill and LP for differentRNAvalues

I Best RNAforLocal islesser than thebest RNAforHill1 andLP1

Page 67: Introduction to text mining and insights on bridging structured and unstructured data

Effect of tuning RNA II

40

50

60

70

80

90

1 2 3 4 5 6 7 8RNA

Pre

cisi

on, %

Local Hill1LP1

Figure: Precision for different RNAvalues

102030405060708090

1 2 3 4 5 6 7 8RNA

Rec

all,

%

Local Hill1LP1

Figure: Recall for different RNAvalues

I Smaller the value ofRNA, more aggresiveis the tagging

I Precision increaseswith increase inRNAvalue

I Recall decreases withincrease in RNAvalue

Page 68: Introduction to text mining and insights on bridging structured and unstructured data

CSAW search paradigm description

I Data modelI Two extremes in current available systems - IR systems and

relationalI Our goal: bridging the gap between the two

I Query capabilitiesI Two extremes in current systems - keyword queries and structured

SQL-like queriesI Our goal: allow composite representation and combine textual

proximity with structured data (from some catalog)

I ResponseI Current search systems returns URLs or highly structured data (as

in SQL)I Our goal: return list of entities, quantities (special type of entities),

or tables of entities, quantities

Page 69: Introduction to text mining and insights on bridging structured and unstructured data

CSAW search paradigm description

I Data modelI Two extremes in current available systems - IR systems and

relationalI Our goal: bridging the gap between the two

I Query capabilitiesI Two extremes in current systems - keyword queries and structured

SQL-like queriesI Our goal: allow composite representation and combine textual

proximity with structured data (from some catalog)

I ResponseI Current search systems returns URLs or highly structured data (as

in SQL)I Our goal: return list of entities, quantities (special type of entities),

or tables of entities, quantities

Page 70: Introduction to text mining and insights on bridging structured and unstructured data

CSAW search paradigm description

I Data modelI Two extremes in current available systems - IR systems and

relationalI Our goal: bridging the gap between the two

I Query capabilitiesI Two extremes in current systems - keyword queries and structured

SQL-like queriesI Our goal: allow composite representation and combine textual

proximity with structured data (from some catalog)

I ResponseI Current search systems returns URLs or highly structured data (as

in SQL)I Our goal: return list of entities, quantities (special type of entities),

or tables of entities, quantities

Page 71: Introduction to text mining and insights on bridging structured and unstructured data

Objective value comparison for Local, hill climbing, LP

700

800

900

1000

1 2 3 4 5 6 7 8rhoNA

Tot

al O

bjec

tive

Hill1LP1-roundedLP1-relaxed

Page 72: Introduction to text mining and insights on bridging structured and unstructured data

Scaling and performance measurement

0

2

4

6

8

10

0 50 100 150 200 250# of spots

Tim

e, s

Hill1

LP1

Figure: Scaling the annotation process withnumber of spots being annotated

I Scaling ismildlyquadraticallywrt |S0|

I Hill climbingtakes about2–3 seconds

I LP takesaround 4–6seconds

Page 73: Introduction to text mining and insights on bridging structured and unstructured data

About Katta

I Salient Features:I ScalableI Failure tolerantI DistributedI IndexedI Data storage

I Serves very large Lucene indexes as index shards on manyservers

I Replicates shards on different servers for performance andfault-tolerance

I Supports pluggable network topologies

I Master fail-over

I Plays well with Hadoop clusters

Page 74: Introduction to text mining and insights on bridging structured and unstructured data

Comparison of Local, hill climbing, LP - training RNA

Local Hill1 LP1

no Prior 63.45% 64.87% 67.02%

+Prior 68.75% 67.46% 69.69%

Page 75: Introduction to text mining and insights on bridging structured and unstructured data

Sample dendrograms I

Page 76: Introduction to text mining and insights on bridging structured and unstructured data

Sample dendrograms II

Page 77: Introduction to text mining and insights on bridging structured and unstructured data

Dendrogram with multitopic model

Page 78: Introduction to text mining and insights on bridging structured and unstructured data

Multi-topical model

I Current clique potentials encourages a single cluster model

I The single cluster hypothesis is not always true

I Refined clique potential for supporting multitopic model

1

|C |∑

Γk∈C

1(Γk

2

)∑

s,s′: ys ,ys′∈Γk

r(ys , ys′). (CPK)

I Using(Γk

2

)instead of

(S02

), to reward smaller coherent clusters

as desired