102
1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

Embed Size (px)

Citation preview

Page 1: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

1

Text Retrieval and Applications – More Advanced Topics

J. H. WangMay 20, 2008

Page 2: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

2

Outline

• Text Mining and Information Extraction– Introduction to Text Mining– Methods of Information Extraction– Applications to Information Extraction

• Applications in Digital Libraries– OAI– Unencoded character problem

• Advanced Topics– Mobile Search– LiveClassifier

Page 3: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

3

Text Mining and Information Extraction

• Introduction to Text Mining• Methods of Information Extraction• Applications to Information Extractio

n

Page 4: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

4

References

• Marti Hearst, What is Text Mining, http://www.sims.berkeley.edu/~hearst/text-mining.html

• Marti Hearst, Untangling Text Data Mining, ACL 1999.• Douglas E. Appelt and David J. Israel, Introduction to

Information Extraction Technology, IJCAI 1999 Tutorial.• Ion Muslea, Extraction Patterns for Information

Extraction Tasks: A Survey, AAAI 1999 Workshop on Machine Learning for Information Extraction.

• Andrew McCallum and William Cohen, Information Extraction from the World Wide Web, KDD 2003 tutorial. (Also earlier version in NIPS 2002 tutorial)

• Hamish Cunningham, Named Entity Recognition, RANLP 2003 tutorial.

Page 5: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

5

Text Mining

• Text Mining is the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources [Marti Hearst]

Page 6: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

6

Text Mining vs. Web Search

• In search, the user is typically looking for something that is already known and has been written by someone else

• In text mining, the goal is to discover heretofore unknown information, something that no one yet knows and so could not have yet written down

Page 7: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

7

Text Mining vs. Data Mining

• Data mining tries to find interesting (non-trivial, implicit, previously unknown, potentially useful) patterns from large databases

• In text mining, the patterns are extracted from natural language text rather than from structured databases of facts

Page 8: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

8

Text Mining vs. Computational Linguistics (or NLP)

• NLP is making a lot of progress in doing small subtasks in text analysis– Word segmentation, part-of-speech

tagging, word sense disambiguation, …

• Text understanding vs. mining

Page 9: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

9

Text Mining vs. Information Extraction

• There are programs that can, with reasonable accuracy, extract information from text with somewhat regularized structure

• Discovering new knowledge vs. showing trends

Page 10: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

10

What is “Information Extraction”Filling slots in a database from sub-segments of text.As a task:

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

NAME TITLE ORGANIZATION

Page 11: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

11

What is “Information Extraction”Filling slots in a database from sub-segments of text.As a task:

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

NAME TITLE ORGANIZATIONBill Gates CEO MicrosoftBill Veghte VP MicrosoftRichard Stallman founder Free Soft..

IE

Page 12: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

12

What is “Information Extraction”Information Extraction = segmentation + classification + clustering + association

As a familyof techniques:October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation

aka “named entity extraction”

Page 13: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

13

What is “Information Extraction”Information Extraction = segmentation + classification + association + clustering

As a familyof techniques:October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation

Page 14: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

14

What is “Information Extraction”Information Extraction = segmentation + classification + association + clustering

As a familyof techniques:October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation

Page 15: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

15

What is “Information Extraction”Information Extraction = segmentation + classification + association + clustering

As a familyof techniques:October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation N

AME

TITLE ORGANIZATION

Bill Gates

CEO

Microsoft

Bill Veghte

VP

Microsoft

Richard Stallman

founder

Free Soft..

*

*

*

*

Page 16: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

16

IE in ContextCreate ontology

SegmentClassifyAssociateCluster

Load DB

Spider

Query,Search

Data mine

IE

Documentcollection

Database

Filter by relevance

Label training data

Train extraction models

Page 17: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

17

Information Extraction Tasks

• Information Extraction (IE) pulls facts and structured information from the content of large text collections – Unstructured or semi-structured structured

• Federal government funded research– MUC: Message Understanding Conferences (1987 –

1998) by DARPA– TIPSTER (1991-1998) by DARPA– ACE: Automatic Content Extraction (1999-) by NIST

• http://www.nist.gov/speech/tests/ace/• http://www.ldc.upenn.edu/Projects/ACE/

Page 18: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

18

Landscape of IE Techniques: Models

Any of these models can be used to capture words, formatting or both.

Lexicons

AlabamaAlaska…WisconsinWyoming

Abraham Lincoln was born in Kentucky.

member?

Classify Pre-segmentedCandidates

Abraham Lincoln was born in Kentucky.

Classifier

which class?

Sliding Window

Abraham Lincoln was born in Kentucky.

Classifier

which class?

Try alternatewindow sizes:

Boundary Models

Abraham Lincoln was born in Kentucky.

Classifier

which class?

BEGIN END BEGIN END

BEGIN

Context Free Grammars

Abraham Lincoln was born in Kentucky.

NNP V P NPVNNP

NP

PP

VP

VP

S

Mos

t lik

ely

pars

e?

Finite State Machines

Abraham Lincoln was born in Kentucky.

Most likely state sequence?

Page 19: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

19

Methods of Information Extractions –

Three Approaches• NLP (Linguistic) approach

– Named entity recognition

• Extraction pattern/template approach– Wrapper generation/induction

• Statistical approach– Class-based language model– PAT-tree-based

Page 20: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

20

NLP (Linguistic) Approach to IE

• MUC-7 tasks• Named entity recognition

– Preprocessing– Two kinds of approaches– Baseline– Rule-based approach– Learning-based approach

Page 21: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

21

MUC-7 Tasks

• NE: Named Entity recognition and typing

• CO: co-reference resolution • TE: Template Elements (attributes)• TR: Template Relations • ST: Scenario Templates

Page 22: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

22

An Example

The shiny red rocket was fired on Tuesday. It is the brainchild of Dr. Big Head. Dr. Head is a staff scientist at We Build Rockets Inc.

• NE: entities are "rocket", "Tuesday", "Dr. Head" and "We Build Rockets"

• CO: "it" refers to the rocket; "Dr. Head" and "Dr. Big Head" are the same

• TE: the rocket is "shiny red" and Head's "brainchild".

• TR: Dr. Head works for We Build Rockets Inc.

• ST: a rocket launching event occurred with the various participants.

Page 23: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

23

Performance Levels

• Vary according to text type, domain, scenario, language

• NE: up to 97% (tested in English, Spanish, Japanese, Chinese)

• CO: 60-70% resolution • TE: 80% • TR: 75-80% • ST: 60% (but human level may be only

80%)

Page 24: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

24

What are Named Entities?

• NER involves identification of proper names in texts, and classification into a set of predefined categories of interest– Person names– Organizations (companies, government

organizations, committees, etc)– Locations (cities, countries, rivers, etc)– Date and time expressions

Page 25: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

25

What are Named Entities (2)

• Other common types: measures (percent, money, weight etc), email addresses, Web addresses, street addresses, etc.

• Some domain-specific entities: names of drugs, medical conditions, names of ships, bibliographic references etc.

• MUC-7 entity definition guidelines [Chinchor’97]

http://www.itl.nist.gov/iaui/894.02/related_projects/muc/proceedings/ne_task.html

Page 26: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

26

Problems in NE

• Variation of NEs – e.g. John Smith, Mr Smith, John.

• Ambiguity of NE types: John Smith (company vs. person) – May (person vs. month) – Washington (person vs. location) – 1945 (date vs. time)

• Ambiguity with common words, e.g. "may“• More complex problems in NE

– Issues of style, structure, domain, genre etc. – Punctuation, spelling, spacing, formatting, ...

Page 27: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

27

The Evaluation Metric

• Precision vs. recall• F-Measure = (β2 + 1)PR / β2R + P

[van Rijsbergen 75]– β reflects the weighting between precision and recall,

typically β=1• We may also want to take account of partially

correct answers:– Precision =

(Correct + ½ Partially correct) / (Correct + Incorrect + Partial)

– Recall = (Correct + ½ Partially correct) / (Correct + Missing + Partial)

– Why: NE boundaries are often misplaced, sosome partially correct results

Page 28: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

28

Pre-processing for NE Recognition

• Tokenization– Word segmentation

• Lexical or morphological processing– Part of speech tagging– Word sense tagging

• Syntactic analysis– Parsing

• Domain-specific module– Coreference– Merging partial results

Page 29: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

29

Two kinds of NE approaches

Knowledge Engineering

• Rule based • Developed by experienced

language engineers, linguistic resources required

• Make use of human intuition • Requires only small amount

of training data• Development could be very

time consuming • Good performance• Some changes may be hard

to accommodate

Learning Systems

• Use statistics or other machine learning

• Developers do not need NE expertise

• Domain independent• Requires large amounts of

annotated training data, which may be difficult to obtain

• Some changes may require re-annotation of the entire training corpus

• Annotators are cheap (but you get what you pay for!)

Page 30: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

30

Baseline: List Lookup Approach

• System that recognizes only entities stored in its lists (gazetteers)– Online phone directories and yellow pages for person and

organisation names (e.g. [Paskaleva02])– Locations lists: US GEOnet Names Server (GNS) data –

3.9 million locations with 5.37 million names (e.g., [Manov03])

– Automatic collection from annotated training data

• Advantages - Simple, fast, language independent, easy to retarget (just create lists)

• Disadvantages – impossible to enumerate all names, collection and maintenance of lists, cannot deal with name variants, cannot resolve ambiguity

Page 31: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

31

Rule-based: Shallow Parsing Approach (Internal Structure)

• Internal evidence – names often have internal structure. These components can be either stored or guessed, e.g. location:

• Cap. Word + {City, Forest, Center, River}– e.g. Sherwood Forest

• Cap. Word + {Street, Boulevard, Avenue, Crescent, Road}– e.g. Portobello Street

Page 32: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

32

Problems with the Shallow Parsing Approach

• Ambiguously capitalized words (first word in sentence)[All American Bank] vs. All [State Police]

• Semantic ambiguity"John F. Kennedy" = airport (location) "Philip Morris" = organization

• Structural ambiguity [Cable and Wireless] vs. [Microsoft] and [Dell];[Center for Computational Linguistics] vs. message from [City Hospital] for [John Smith]

Page 33: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

33

Shallow Parsing Approach with Context

• Use of context-based patterns is helpful in ambiguous cases – "David Walton" and "Goldman Sachs" are

indistinguishable – But with the phrase "David Walton of

Goldman Sachs" and the Person entity "David Walton" recognized, we can use the pattern "[Person] of [Organization]" to identify "Goldman Sachs“ correctly

Page 34: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

34

Examples of Context Patterns

• [PERSON] earns [MONEY]• [PERSON] joined [ORGANIZATION]• [PERSON] left [ORGANIZATION]• [PERSON] joined [ORGANIZATION] as [JOBTITLE]• [ORGANIZATION]'s [JOBTITLE] [PERSON]• [ORGANIZATION] [JOBTITLE] [PERSON]• the [ORGANIZATION] [JOBTITLE]• part of the [ORGANIZATION]• [ORGANIZATION] headquarters in [LOCATION]• price of [ORGANIZATION]• sale of [ORGANIZATION]• investors in [ORGANIZATION]• [ORGANIZATION] is worth [MONEY]• [JOBTITLE] [PERSON]• [PERSON], [JOBTITLE]

Page 35: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

35

Rule-based Examples

• FACILE - used in MUC-7 [Black et al 98]• ANNIE - part of GATE, Sheffield’s open-source

infrastructure for language processing• Gazetteer Lists for Rule-based NE

– Internal location indicators – e.g., {river, mountain, forest} for natural locations; {street, road, crescent, place, square, …} for address locations

– Internal organization indicators – e.g., company designators {GmbH, Ltd, Inc, …}

Page 36: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

36

Using Co-reference to Classify Ambiguous NEs

• Improves NE results by assigning entity type to previously unclassified names, based on relations with classified NEs

• Classification of unknown entities very useful for surnames which match a full name, or abbreviations, e.g. [Bonfield] will match [Sir Peter Bonfield]; [International Business Machines Ltd.] will match [IBM]

Page 37: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

37

Machine Learning Approaches• ML approaches frequently break down the NE task in

two parts:– Recognizing the entity boundaries– Classifying the entities in the NE categories

• Example approaches– IdentiFinder [Bikel et al 99] (Hidden Markov Models)– MENE [Borthwick et al 98], combining rule-based and ML NE

(Maximum Entropy)– NE Recognition without Gazetteers [Mikheev et al 99],

combining rule-based grammars and statistical (MaxEnt) models

– Fine-grained Classification of NEs [Fleischman 02] • Ex: Person classification into 8 sub-categories – athlete,

politician/government, clergy, businessperson, entertainer/artist, lawyer, doctor/scientist, police

Page 38: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

38

ML Approaches to Named Entity Identification

• Boosting • Bootstrapping • Class-based Language Model • Conditional Markov Model • Decision Tree • Hidden Markov Model • Maximum Entropy Model • Memory-based Learning • Stacking • Support Vector Machine • Transform-based Learning • Voted Perceptron • Others

Page 39: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

39

Extraction Pattern/Template Approach

• Types of extraction patterns (rules)– Syntactic/semantic constraints– Delimiter-based– Combination of both

• Types of documents to be extracted– Semi-structured documents: Web

pages, …– Unstructured documents: free text

Page 40: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

40

IE from Free Text

• “The parliament was bombed by the guerrillas.”

• AutoSlog [Riloff 1993]• LIEP [Huffman 1996]• PALKA [Kim & Moldovan 1995]• CRYSTAL [Soderland et al. 1995]• CRYSTAL+Webfoot [Soderland 1997]• HASTEN [Krupka 1995]

Page 41: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

41

IE from Online Documents

• WHISK [Soderland 1999]– A special type of regular expression

• RAPIER [Califf & Mooney 1997]– Robust Automated Production of

Information Extraction Rules

• SRV [Freitag 1998]– First-order logic

Page 42: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

42

Wrapper Induction Systems

• Wrapper: a procedure for extracting a particular resource’s content

• WIEN [Kushmerick, Weld & Doorenbos, 1997]– First wrapper induction system

• SoftMealy [Hsu & Dung, 1998]– Finite State Transducer (FST)

• STALKER [Muslea, Minton & Knoblock, 1999]– Hierarchical information extraction

• BWI [Freitag & Kushmerick 2000]– Boosting

Page 43: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

43

Statistical Approach to IE

• Class-based language model [Brown et al. 1992] for Chinese NE – [COLING 2002]

• PAT-tree-based [SIGIR 1997]– Long, repeated patterns without length

limitation– Space and time efficient– Incremental – http://pattree.openfoundry.org/

Page 44: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

44

Class-based Language Model

• Types of classes– Person, location, organization, terms in dictionary

• Number of classes = |V|+3 if the size of vocabulary is |V|

* *

,

,

, arg max ( , )

arg max ( | ) ( )C W

C W

C W P C W

P W C P C

1 1 2 1 2 1 13

( ) ( ... ) ( | ) ( | , ){ ( | , )} ( / | , )m

m i i i m mi

P C P c c P c s P c c s P c c c P s c c

1 11

( | ) ( ... | ... ) ( | )m

m m i ii

P W C P w w c c P w c

Page 45: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

45

PAT-tree-based Approach

• SCP (Symmetric Conditional Probability)– Cohesion holding the words together– Low frequency n-grams tend to be

discarded

• CD (Context Dependency)– Dependence on the left- or right- adjacent

word/character– Low frequency n-grams can be extracted

• SCPCD: a combination of the two

Page 46: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

46

Association Measure

1

1 11

21

1

1 11

21

1

)...()...(1

1)...(

)()(1

1)(

)(

n

i nii

n

n

i nii

nn

wwfreqwwfreqn

wwfreq

wwpwwpn

wwpwwSCP

21

111

)(

)()()(

n

nnn

wwfreq

wwRCwwLCwwCD

1

1 11

11

111

)()(1

1)()(

)()()(

n

i nii

nn

nnn

wwfreqwwfreqn

wwRCwwLCwwCDwwSCPwwSCPCD

Page 47: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

47

Term Extraction Performance

Association Measure

Precision Recall Avg. R-P

CD 68.1 % 5.9 % 37.0 %

SCP 62.6 % 63.3 % 63.0 %

SCPCD 79.3 % 78.2 % 78.7 %

•Table 1. The obtained extraction accuracy including precision, recall, and average recall-precision of auto-extracted translation candidates using different methods.

Page 48: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

48

Speed Performance

Table 2. The obtained average speed performance of different term extraction methods.

Term Extraction MethodTime for

PreprocessingTime for Extraction

LocalMaxs (Web Queries) 0.87 s 0.99 s

PATtree+LocalMaxs (Web Queries)

2.30 s 0.61 s

LocalMaxs (1,367 docs) 63.47 s 4,851.67 s

PATtree+LocalMaxs (1,367 docs)

840.90 s 71.24 s

LocalMaxs (5,357 docs) 47,247.55 s 350,495.65 s

PATtree+LocalMaxs (5,357 docs)

11,086.67 s 759.32 s

Page 49: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

49

State of the Art Performance

• Named entity recognition– Person, Location, Organization, …– F1 in high 80’s or low- to mid-90’s

• Binary relation extraction– Contained-in (Location1, Location2)

Member-of (Person1, Organization1)– F1 in 60’s (events) or 70’s (facts) or 80’s

(attributes)• Wrapper induction

– Extremely accurate performance obtainable– Human effort (~30min) required on each site

Page 50: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

50

Broader ViewCreate ontology

SegmentClassifyAssociateCluster

Load DB

Spider

Query,Search

Data mine

IETokenize

Documentcollection

Database

Filter by relevance

Label training data

Train extraction models

some other issues

12

3

4

5

Page 51: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

51

Applications to Information Extraction

• LiveTrans• LiveClassifier

Page 52: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

52

LiveTrans: Cross-language Web Search

Page 53: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

53

Web Mining Approach to Term Translation Extraction

• LiveTrans: http://livetrans.iis.sinica.edu.tw/lt.html

LiveTrans Engine

LiveTrans Engine

Academia SinicaAnchor textsAnchor texts

Search resultsSearch results

The Web

中央研究院 / 中研院

Source query

Target translations

Page 54: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

54

National Palace Museum vs. 故宮博物院Search-Result Page

• Mixed-language characteristic in Chinese pages• How to extract translation candidates?• Which candidates to choose?

Noises

Page 55: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

55

Yahoo vs. 雅虎 -- Anchor-Text Set

• Anchor text (link text)– The descriptive text of a

link on a Web page

• Anchor-text set– A set of anchor texts

pointing to the same page (URL)

– Multilingual translations− Yahoo/雅虎 /야후− America/美国 /アメリカ

• Anchor-text-set corpus– A collection of anchor-

text sets

Yahoo Search Engine

美国雅虎 雅虎搜尋引擎

Yahoo! America

Taiwan

China

Japan

Korea

야후 -USA

アメリカの Yahoo! http://www.yahoo.com

Page 56: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

56

Term Translation Extraction from Different Resources

Term

Extraction

Term

Extraction

Source Query

TargetTranslation

Search-ResultPages

SearchEngineSearchEngine

SimilarityEstimationSimilarityEstimation

National Palace Museum

國立故宮博物院 , 故宮 , 故宮博物院

Anchor-Text

Corpus

WebSpiderWeb

Spider

Page 57: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

57

IE Resources• Data

– RISE, http://www.isi.edu/~muslea/RISE/index.html– Linguistic Data Consortium (LDC)

• Penn Treebank, Named Entities, Relations, etc.– http://www.biostat.wisc.edu/~craven/ie– http://www.cs.umass.edu/~mccallum/data

• Code– TextPro, http://www.ai.sri.com/~appelt/TextPro– MALLET, http://www.cs.umass.edu/~mccallum/mallet

• Both– http://www.cis.upenn.edu/~adwait/penntools.html

– http://www.cs.umass.edu/~mccallum/ie

Page 58: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

58

Text Mining Related Workshops

• Workshops in Data Mining conferences– KDD 2000, Workshop on Text Mining (TextKDD 2000)– ICDM 2001, Workshop on Text Mining (TextDM 2001)– PAKDD 2002, Workshop on Text Mining – SDM 2001-2003, 2006-2008, Workshop on Text Mining

• Workshop in Machine Learning conferences– ECML 1998, Workshop on Text Mining– ICML 1999, Workshop on Machine Learning in Text Data

Analysis– ICML 2002, Workshop on Text Learning (TextML 2002)

• Others– RANLP 2005 Text Mining Workshop

Page 59: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

59

Workshops on Machine Learning and IE

• AAAI 1999, Workshop on Machine Learning for Information Extraction

• ECAI 2000, Workshop on Machine Learning for Information Extraction

• IJCAI 2001, Workshop on Adaptive Text Extraction and Mining (ATEM 2001)

• ECML 2003, Workshop on Adaptive Text Extraction and Mining (ATEM 2003)

• AAAI 2004, Workshop on Adaptive Text Extraction and Mining (ATEM 2004)

• EACL 2006, Workshop on Adaptive Text Extraction and Mining (ATEM 2006)

Page 60: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

60

Other Workshops

• Workshop on Text Mining and Link Analysis– TextLink 2007, in IJCAI 2007– TextLink 2003, in IJCAI 2003

• Workshop on Link Analysis– LinkKDD 2003-2006, in KDD 2003-

2006– AAAI 2005 workshop on Link Analysis

Page 61: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

61

Applications in Digital Libraries

• OAI• Unencoded Character Problem

Page 62: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

62

Some Problems in Digital Libraries

• Variety in objects – Museums, libraries, …– Difficult to integrate metadata

• Ancient characters in archives– Difficult to input, display, distribute,

Page 63: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

63

OAI (Open Archives Initiatives)

• Starting from Oct. 1999• OAI-PMH (Open Archives Initiative

Protocol for Metadata Harvesting) version 2.0 of 2002– Service Provider

• Harvesters

– Data Provider• Repositories

– Metadata

Page 64: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

64

Service Provider vs. Data Provider

ServiceProvider

DataProvider

repository

repositoryUser

Request

Response

Page 65: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

65

OAI-PMH vs. Z39.50

• What is the relationship between the OAI-PMH and other protocols such as Z39.50?

• The OAI technical framework is intentionally simple– Providing a low barrier for participants  – Easy-to-implement and easy-to-deploy alternative,

not intended to replace other approaches

• Protocols such as Z39.50 have more complete functionality– Session management, results sets, specification of

predicates that filter the records returned – An increase in difficulty of implementation and cost

Page 66: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

66

Why Dublin Core?

• Why does the protocol mandate a common metadata format (and why is that common format Dublin Core)?

• Mapping among multiple metadata formats – Creating services such as common search interfaces across

heterogeneous metadata formats• A less burdensome and ultimately more deployable

solution is to require repositories to map to a simple and common metadata format 

• The fifteen elements in Dublin Core as a de facto standard for simple cross-discipline metadata – Dublin Core Metadata Element Set (DCMES)

• Cooperation between the OAI and the Dublin Core Metadata Initiative (DCMI) has led to a common XML schema for unqualified dublin core that is available at http://dublincore.org/schemas/xmls/simpledc20020312.xsd

Page 67: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

67

Unencoded Character Problem

• Large amounts of Chinese characters in different forms such as Bronze Script ( 金文 ) and Seal Script ( 小篆 )

• People can better appreciate the long history of the Chinese character evolution process and the Chinese culture in general

• However, digitization of the heritage materials brings a big problem: these characters are not included in common character encodings in computers (the missing characters)

Page 68: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

68

Example Unencoded Characters

Bronze Script (金文 ) Seal Script ( 小篆 )

Page 69: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

69

Goal

• We intend to develop an integrated technology to facilitate easier processing of large amounts of missing characters

• This includes the input, representation, font generation, display, distribution, and search for all the missing characters

• We propose an effective composite approach to handle the formation and basic components of Chinese characters

Page 70: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

70

Composite Approach to Unencoded Chinese Characters [JCDL 2005]

Page 71: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

71

Advanced Topics

• Mobile Search• LiveClassifier• Concept Search

Page 72: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

72

Mobile Search

• Introduction to Mobile Search• Existing Services• Google Mobile/SMS• Issues

Page 73: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

73

References

• R. Schusteritsch, S. Rao, and K. Rodden, Mobile Search with Text Messages: Designing the User Experience for Google SMS, Proceedings of CHI 2005, pp. 1777-1780 (poster).

• B. Miller, China’s Internet Portals and Content Providers Look to a Future Beyond Mobile Text Messaging, The Yankee Group Report, Feb. 2004.

• Communications of the ACM, Vol.48, No.7, Designing for the Mobile Devices, Jul. 2005.

Page 74: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

74

Introduction to Mobile Search

• Internet users (as of 2003)– US: 170 million (population: 292 million)– China: 80 million (population: 1.3 billion)– Japan: 70 million (population: 126 million)

• Currently, there are more than two billion mobile phone users worldwide, which is more than three times the number of PC users

• More than a half-billion mobile phones sold each year (in 2004)

Page 75: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

75

Internet and Mobile Users in China

Page 76: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

76

SMS-based Mobile Internet

• SMS (Short Message Service)– 1 billion messages worldwide everyday– Nearly 200 billion messages in China in

2003 • Content Explodes

– News alerts– Sports news– Weather– Special interest information

• Ex.: Yao Ming

• From SMS to search

Page 77: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

77

Two Common Modes of Mobile Search

• Mobile Web browsing• Text Messaging (SMS)

Page 78: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

78

Google Mobile (1/2)

http://mobile.google.com/

(XHTML)

(WML)

Page 79: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

79

Google Mobile (2/2)

(Images) (Mobile Web)

Page 80: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

80

Google SMS

http://www.google.com/sms/

Page 81: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

81

Google Maps for Mobile

http://www.google.com/gmm/

Page 82: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

82

Existing Mobile Search Services

• Google (46645)– Google Mobile, http://mobile.google.com/– Google SMS, http://www.google.com/sms/– Google Maps for Mobile, http://www.google.com/gmm/

• Yahoo (92466)– Yahoo! Mobile, http://mobile.yahoo.com/– Yahoo! Go, http://go.yahoo.com

• 4INFO (44636)• AOL

– AOL Mobile Search, http://mobile.aolsearch.com/• MSN

– MSN Mobile, http://mobile.msn.com/• Synfonic, UpSNAP, …

SMS short code

Page 83: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

83

Google SMS

• Specialized information– Business listings– Residential listings– Product prices– Dictionary definitions– Area codes– Zip codes– …

• Google SMS attempts to return the desired information directly, rather than returning hyperlinks

Page 84: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

84

Design Constraints

• Conceptual Model– 1-to-1 communication vs. mobile search– Abbreviation interpretation

• Inherent Limitations of Mobile Devices and SMS– Text entry is slow– Charge per message sent– Small, low-resolution screens– SMS message size limitation: 160 characters– No guarantee for the receiving order– SMS interface: text-only, one-dimensional with no menus,

forms, or buttons to help users understand its affordances– Not possible for the system to offer instructions or a

prompt– To get any feedback, the user must wait a new message

Page 85: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

85

Addressing Users’ Existing Conceptual Models

• Most users had some initial problems understanding how SMS could be used for search

• Users’ existing conceptual model of Google searching also cause some initial problems

• Changes to message interpretation– “froogle”, “shopping”, “yellow pages”, “white

pages”, “dictionary” combined with “help”, “tips”, “instructions”

– “price” or “prices” vs. product search

Page 86: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

86

• Communicating affordances– Minimal and prominent sequence of instructions

and the features that are considered most useful are shown on the Google SMS home page

– Sending the message “help” should return a concise set of instructions on how to use it

– Collaborate with PR team to mention the “help” command and highlight the most important features

– Work with Marketing team to develop a wallet-size instruction card

Page 87: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

87

Addressing the Limitations of Mobile Devices and SMS

• Order of messages– “1of2”

• Limited input technology– Query refinement is more difficult (“cofee”)– Immediately returning search results for the

closest match

• Limited output technology– No results, “help”– No more than 3 messages in response to

each help request

Page 88: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

88

Issues in Mobile Search

• User interface for input• Mobile search result output• Context-aware, location-based

service• User preference

Page 89: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

89

• To allow users to get updated or additional information with fewer keystrokes

• To transform the search experience found on a PC to a mobile device – Tiny screens, network bandwidth, …– Wireless Application Protocol (WAP) to shrink Web pages

down to more manageable sizes • Google WML

– Alternatively, SMS• To provide users who are not working with a mouse or a

keyboard with a simple way to enter queries – Shortcuts (for example, “w” for “weather”…)

• To integrate the search vendors’ local and mobile search functions

• To integrate commerce and mobile search applications

Page 90: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

90

Possible Related Directions

• Context-aware Retrieval• Ubiquitous Computing

– Communications of the ACM, Vol.48, No.3, The Disappearing Computer, Mar. 2005.

– Communications of the ACM, Vol.45, No.12, Issues and Challenges in Ubiquitous Computing, Dec. 2002.

• …

Page 91: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

91

LiveClassifier

A system that creates classifiers through Web mining

Page 92: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

92

LiveClassifier

Users create topic hierarchies and define classes/keywords

Page 93: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

93

LiveClassifier

Web

Auto-extracted training data; No manually-labeled data provided

Exploiting the structure information inherent for training

Page 94: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

94

LiveClassifier

People

Place

Subjects

Sub-subjects

Page 95: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

95

LiveClassifier

Classifying documents

Into classes

Page 96: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

96

LiveClassifier

Classifying short texts

Into classes

Page 97: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

97

LiveClassifier

Page 98: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

98

LiveClassifier

Page 99: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

99

Page 100: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

100

Concept Search

• Conventional search

• Concept-level search

doc Keyword search for “researcher” and “AI” and “Taiwan”

docresearcher AI

“professor”

“NTU”

“neuralnetwork”

researcherAI

Interesting document

Taiwan

Page 101: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

101

LiveTrans + LiveClassifier

Page 102: 1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

102

Thanks for Your Attention!