1 Text Retrieval and Applications – More Advanced Topics J. H. Wang May 20, 2008

Preview:

Citation preview

1

Text Retrieval and Applications – More Advanced Topics

J. H. WangMay 20, 2008

2

Outline

• Text Mining and Information Extraction– Introduction to Text Mining– Methods of Information Extraction– Applications to Information Extraction

• Applications in Digital Libraries– OAI– Unencoded character problem

• Advanced Topics– Mobile Search– LiveClassifier

3

Text Mining and Information Extraction

• Introduction to Text Mining• Methods of Information Extraction• Applications to Information Extractio

n

4

References

• Marti Hearst, What is Text Mining, http://www.sims.berkeley.edu/~hearst/text-mining.html

• Marti Hearst, Untangling Text Data Mining, ACL 1999.• Douglas E. Appelt and David J. Israel, Introduction to

Information Extraction Technology, IJCAI 1999 Tutorial.• Ion Muslea, Extraction Patterns for Information

Extraction Tasks: A Survey, AAAI 1999 Workshop on Machine Learning for Information Extraction.

• Andrew McCallum and William Cohen, Information Extraction from the World Wide Web, KDD 2003 tutorial. (Also earlier version in NIPS 2002 tutorial)

• Hamish Cunningham, Named Entity Recognition, RANLP 2003 tutorial.

5

Text Mining

• Text Mining is the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources [Marti Hearst]

6

Text Mining vs. Web Search

• In search, the user is typically looking for something that is already known and has been written by someone else

• In text mining, the goal is to discover heretofore unknown information, something that no one yet knows and so could not have yet written down

7

Text Mining vs. Data Mining

• Data mining tries to find interesting (non-trivial, implicit, previously unknown, potentially useful) patterns from large databases

• In text mining, the patterns are extracted from natural language text rather than from structured databases of facts

8

Text Mining vs. Computational Linguistics (or NLP)

• NLP is making a lot of progress in doing small subtasks in text analysis– Word segmentation, part-of-speech

tagging, word sense disambiguation, …

• Text understanding vs. mining

9

Text Mining vs. Information Extraction

• There are programs that can, with reasonable accuracy, extract information from text with somewhat regularized structure

• Discovering new knowledge vs. showing trends

10

What is “Information Extraction”Filling slots in a database from sub-segments of text.As a task:

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

NAME TITLE ORGANIZATION

11

What is “Information Extraction”Filling slots in a database from sub-segments of text.As a task:

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

NAME TITLE ORGANIZATIONBill Gates CEO MicrosoftBill Veghte VP MicrosoftRichard Stallman founder Free Soft..

IE

12

What is “Information Extraction”Information Extraction = segmentation + classification + clustering + association

As a familyof techniques:October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation

aka “named entity extraction”

13

What is “Information Extraction”Information Extraction = segmentation + classification + association + clustering

As a familyof techniques:October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation

14

What is “Information Extraction”Information Extraction = segmentation + classification + association + clustering

As a familyof techniques:October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation

15

What is “Information Extraction”Information Extraction = segmentation + classification + association + clustering

As a familyof techniques:October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation N

AME

TITLE ORGANIZATION

Bill Gates

CEO

Microsoft

Bill Veghte

VP

Microsoft

Richard Stallman

founder

Free Soft..

*

*

*

*

16

IE in ContextCreate ontology

SegmentClassifyAssociateCluster

Load DB

Spider

Query,Search

Data mine

IE

Documentcollection

Database

Filter by relevance

Label training data

Train extraction models

17

Information Extraction Tasks

• Information Extraction (IE) pulls facts and structured information from the content of large text collections – Unstructured or semi-structured structured

• Federal government funded research– MUC: Message Understanding Conferences (1987 –

1998) by DARPA– TIPSTER (1991-1998) by DARPA– ACE: Automatic Content Extraction (1999-) by NIST

• http://www.nist.gov/speech/tests/ace/• http://www.ldc.upenn.edu/Projects/ACE/

18

Landscape of IE Techniques: Models

Any of these models can be used to capture words, formatting or both.

Lexicons

AlabamaAlaska…WisconsinWyoming

Abraham Lincoln was born in Kentucky.

member?

Classify Pre-segmentedCandidates

Abraham Lincoln was born in Kentucky.

Classifier

which class?

Sliding Window

Abraham Lincoln was born in Kentucky.

Classifier

which class?

Try alternatewindow sizes:

Boundary Models

Abraham Lincoln was born in Kentucky.

Classifier

which class?

BEGIN END BEGIN END

BEGIN

Context Free Grammars

Abraham Lincoln was born in Kentucky.

NNP V P NPVNNP

NP

PP

VP

VP

S

Mos

t lik

ely

pars

e?

Finite State Machines

Abraham Lincoln was born in Kentucky.

Most likely state sequence?

19

Methods of Information Extractions –

Three Approaches• NLP (Linguistic) approach

– Named entity recognition

• Extraction pattern/template approach– Wrapper generation/induction

• Statistical approach– Class-based language model– PAT-tree-based

20

NLP (Linguistic) Approach to IE

• MUC-7 tasks• Named entity recognition

– Preprocessing– Two kinds of approaches– Baseline– Rule-based approach– Learning-based approach

21

MUC-7 Tasks

• NE: Named Entity recognition and typing

• CO: co-reference resolution • TE: Template Elements (attributes)• TR: Template Relations • ST: Scenario Templates

22

An Example

The shiny red rocket was fired on Tuesday. It is the brainchild of Dr. Big Head. Dr. Head is a staff scientist at We Build Rockets Inc.

• NE: entities are "rocket", "Tuesday", "Dr. Head" and "We Build Rockets"

• CO: "it" refers to the rocket; "Dr. Head" and "Dr. Big Head" are the same

• TE: the rocket is "shiny red" and Head's "brainchild".

• TR: Dr. Head works for We Build Rockets Inc.

• ST: a rocket launching event occurred with the various participants.

23

Performance Levels

• Vary according to text type, domain, scenario, language

• NE: up to 97% (tested in English, Spanish, Japanese, Chinese)

• CO: 60-70% resolution • TE: 80% • TR: 75-80% • ST: 60% (but human level may be only

80%)

24

What are Named Entities?

• NER involves identification of proper names in texts, and classification into a set of predefined categories of interest– Person names– Organizations (companies, government

organizations, committees, etc)– Locations (cities, countries, rivers, etc)– Date and time expressions

25

What are Named Entities (2)

• Other common types: measures (percent, money, weight etc), email addresses, Web addresses, street addresses, etc.

• Some domain-specific entities: names of drugs, medical conditions, names of ships, bibliographic references etc.

• MUC-7 entity definition guidelines [Chinchor’97]

http://www.itl.nist.gov/iaui/894.02/related_projects/muc/proceedings/ne_task.html

26

Problems in NE

• Variation of NEs – e.g. John Smith, Mr Smith, John.

• Ambiguity of NE types: John Smith (company vs. person) – May (person vs. month) – Washington (person vs. location) – 1945 (date vs. time)

• Ambiguity with common words, e.g. "may“• More complex problems in NE

– Issues of style, structure, domain, genre etc. – Punctuation, spelling, spacing, formatting, ...

27

The Evaluation Metric

• Precision vs. recall• F-Measure = (β2 + 1)PR / β2R + P

[van Rijsbergen 75]– β reflects the weighting between precision and recall,

typically β=1• We may also want to take account of partially

correct answers:– Precision =

(Correct + ½ Partially correct) / (Correct + Incorrect + Partial)

– Recall = (Correct + ½ Partially correct) / (Correct + Missing + Partial)

– Why: NE boundaries are often misplaced, sosome partially correct results

28

Pre-processing for NE Recognition

• Tokenization– Word segmentation

• Lexical or morphological processing– Part of speech tagging– Word sense tagging

• Syntactic analysis– Parsing

• Domain-specific module– Coreference– Merging partial results

29

Two kinds of NE approaches

Knowledge Engineering

• Rule based • Developed by experienced

language engineers, linguistic resources required

• Make use of human intuition • Requires only small amount

of training data• Development could be very

time consuming • Good performance• Some changes may be hard

to accommodate

Learning Systems

• Use statistics or other machine learning

• Developers do not need NE expertise

• Domain independent• Requires large amounts of

annotated training data, which may be difficult to obtain

• Some changes may require re-annotation of the entire training corpus

• Annotators are cheap (but you get what you pay for!)

30

Baseline: List Lookup Approach

• System that recognizes only entities stored in its lists (gazetteers)– Online phone directories and yellow pages for person and

organisation names (e.g. [Paskaleva02])– Locations lists: US GEOnet Names Server (GNS) data –

3.9 million locations with 5.37 million names (e.g., [Manov03])

– Automatic collection from annotated training data

• Advantages - Simple, fast, language independent, easy to retarget (just create lists)

• Disadvantages – impossible to enumerate all names, collection and maintenance of lists, cannot deal with name variants, cannot resolve ambiguity

31

Rule-based: Shallow Parsing Approach (Internal Structure)

• Internal evidence – names often have internal structure. These components can be either stored or guessed, e.g. location:

• Cap. Word + {City, Forest, Center, River}– e.g. Sherwood Forest

• Cap. Word + {Street, Boulevard, Avenue, Crescent, Road}– e.g. Portobello Street

32

Problems with the Shallow Parsing Approach

• Ambiguously capitalized words (first word in sentence)[All American Bank] vs. All [State Police]

• Semantic ambiguity"John F. Kennedy" = airport (location) "Philip Morris" = organization

• Structural ambiguity [Cable and Wireless] vs. [Microsoft] and [Dell];[Center for Computational Linguistics] vs. message from [City Hospital] for [John Smith]

33

Shallow Parsing Approach with Context

• Use of context-based patterns is helpful in ambiguous cases – "David Walton" and "Goldman Sachs" are

indistinguishable – But with the phrase "David Walton of

Goldman Sachs" and the Person entity "David Walton" recognized, we can use the pattern "[Person] of [Organization]" to identify "Goldman Sachs“ correctly

34

Examples of Context Patterns

• [PERSON] earns [MONEY]• [PERSON] joined [ORGANIZATION]• [PERSON] left [ORGANIZATION]• [PERSON] joined [ORGANIZATION] as [JOBTITLE]• [ORGANIZATION]'s [JOBTITLE] [PERSON]• [ORGANIZATION] [JOBTITLE] [PERSON]• the [ORGANIZATION] [JOBTITLE]• part of the [ORGANIZATION]• [ORGANIZATION] headquarters in [LOCATION]• price of [ORGANIZATION]• sale of [ORGANIZATION]• investors in [ORGANIZATION]• [ORGANIZATION] is worth [MONEY]• [JOBTITLE] [PERSON]• [PERSON], [JOBTITLE]

35

Rule-based Examples

• FACILE - used in MUC-7 [Black et al 98]• ANNIE - part of GATE, Sheffield’s open-source

infrastructure for language processing• Gazetteer Lists for Rule-based NE

– Internal location indicators – e.g., {river, mountain, forest} for natural locations; {street, road, crescent, place, square, …} for address locations

– Internal organization indicators – e.g., company designators {GmbH, Ltd, Inc, …}

36

Using Co-reference to Classify Ambiguous NEs

• Improves NE results by assigning entity type to previously unclassified names, based on relations with classified NEs

• Classification of unknown entities very useful for surnames which match a full name, or abbreviations, e.g. [Bonfield] will match [Sir Peter Bonfield]; [International Business Machines Ltd.] will match [IBM]

37

Machine Learning Approaches• ML approaches frequently break down the NE task in

two parts:– Recognizing the entity boundaries– Classifying the entities in the NE categories

• Example approaches– IdentiFinder [Bikel et al 99] (Hidden Markov Models)– MENE [Borthwick et al 98], combining rule-based and ML NE

(Maximum Entropy)– NE Recognition without Gazetteers [Mikheev et al 99],

combining rule-based grammars and statistical (MaxEnt) models

– Fine-grained Classification of NEs [Fleischman 02] • Ex: Person classification into 8 sub-categories – athlete,

politician/government, clergy, businessperson, entertainer/artist, lawyer, doctor/scientist, police

38

ML Approaches to Named Entity Identification

• Boosting • Bootstrapping • Class-based Language Model • Conditional Markov Model • Decision Tree • Hidden Markov Model • Maximum Entropy Model • Memory-based Learning • Stacking • Support Vector Machine • Transform-based Learning • Voted Perceptron • Others

39

Extraction Pattern/Template Approach

• Types of extraction patterns (rules)– Syntactic/semantic constraints– Delimiter-based– Combination of both

• Types of documents to be extracted– Semi-structured documents: Web

pages, …– Unstructured documents: free text

40

IE from Free Text

• “The parliament was bombed by the guerrillas.”

• AutoSlog [Riloff 1993]• LIEP [Huffman 1996]• PALKA [Kim & Moldovan 1995]• CRYSTAL [Soderland et al. 1995]• CRYSTAL+Webfoot [Soderland 1997]• HASTEN [Krupka 1995]

41

IE from Online Documents

• WHISK [Soderland 1999]– A special type of regular expression

• RAPIER [Califf & Mooney 1997]– Robust Automated Production of

Information Extraction Rules

• SRV [Freitag 1998]– First-order logic

42

Wrapper Induction Systems

• Wrapper: a procedure for extracting a particular resource’s content

• WIEN [Kushmerick, Weld & Doorenbos, 1997]– First wrapper induction system

• SoftMealy [Hsu & Dung, 1998]– Finite State Transducer (FST)

• STALKER [Muslea, Minton & Knoblock, 1999]– Hierarchical information extraction

• BWI [Freitag & Kushmerick 2000]– Boosting

43

Statistical Approach to IE

• Class-based language model [Brown et al. 1992] for Chinese NE – [COLING 2002]

• PAT-tree-based [SIGIR 1997]– Long, repeated patterns without length

limitation– Space and time efficient– Incremental – http://pattree.openfoundry.org/

44

Class-based Language Model

• Types of classes– Person, location, organization, terms in dictionary

• Number of classes = |V|+3 if the size of vocabulary is |V|

* *

,

,

, arg max ( , )

arg max ( | ) ( )C W

C W

C W P C W

P W C P C

1 1 2 1 2 1 13

( ) ( ... ) ( | ) ( | , ){ ( | , )} ( / | , )m

m i i i m mi

P C P c c P c s P c c s P c c c P s c c

1 11

( | ) ( ... | ... ) ( | )m

m m i ii

P W C P w w c c P w c

45

PAT-tree-based Approach

• SCP (Symmetric Conditional Probability)– Cohesion holding the words together– Low frequency n-grams tend to be

discarded

• CD (Context Dependency)– Dependence on the left- or right- adjacent

word/character– Low frequency n-grams can be extracted

• SCPCD: a combination of the two

46

Association Measure

1

1 11

21

1

1 11

21

1

)...()...(1

1)...(

)()(1

1)(

)(

n

i nii

n

n

i nii

nn

wwfreqwwfreqn

wwfreq

wwpwwpn

wwpwwSCP

21

111

)(

)()()(

n

nnn

wwfreq

wwRCwwLCwwCD

1

1 11

11

111

)()(1

1)()(

)()()(

n

i nii

nn

nnn

wwfreqwwfreqn

wwRCwwLCwwCDwwSCPwwSCPCD

47

Term Extraction Performance

Association Measure

Precision Recall Avg. R-P

CD 68.1 % 5.9 % 37.0 %

SCP 62.6 % 63.3 % 63.0 %

SCPCD 79.3 % 78.2 % 78.7 %

•Table 1. The obtained extraction accuracy including precision, recall, and average recall-precision of auto-extracted translation candidates using different methods.

48

Speed Performance

Table 2. The obtained average speed performance of different term extraction methods.

Term Extraction MethodTime for

PreprocessingTime for Extraction

LocalMaxs (Web Queries) 0.87 s 0.99 s

PATtree+LocalMaxs (Web Queries)

2.30 s 0.61 s

LocalMaxs (1,367 docs) 63.47 s 4,851.67 s

PATtree+LocalMaxs (1,367 docs)

840.90 s 71.24 s

LocalMaxs (5,357 docs) 47,247.55 s 350,495.65 s

PATtree+LocalMaxs (5,357 docs)

11,086.67 s 759.32 s

49

State of the Art Performance

• Named entity recognition– Person, Location, Organization, …– F1 in high 80’s or low- to mid-90’s

• Binary relation extraction– Contained-in (Location1, Location2)

Member-of (Person1, Organization1)– F1 in 60’s (events) or 70’s (facts) or 80’s

(attributes)• Wrapper induction

– Extremely accurate performance obtainable– Human effort (~30min) required on each site

50

Broader ViewCreate ontology

SegmentClassifyAssociateCluster

Load DB

Spider

Query,Search

Data mine

IETokenize

Documentcollection

Database

Filter by relevance

Label training data

Train extraction models

some other issues

12

3

4

5

51

Applications to Information Extraction

• LiveTrans• LiveClassifier

52

LiveTrans: Cross-language Web Search

53

Web Mining Approach to Term Translation Extraction

• LiveTrans: http://livetrans.iis.sinica.edu.tw/lt.html

LiveTrans Engine

LiveTrans Engine

Academia SinicaAnchor textsAnchor texts

Search resultsSearch results

The Web

中央研究院 / 中研院

Source query

Target translations

54

National Palace Museum vs. 故宮博物院Search-Result Page

• Mixed-language characteristic in Chinese pages• How to extract translation candidates?• Which candidates to choose?

Noises

55

Yahoo vs. 雅虎 -- Anchor-Text Set

• Anchor text (link text)– The descriptive text of a

link on a Web page

• Anchor-text set– A set of anchor texts

pointing to the same page (URL)

– Multilingual translations− Yahoo/雅虎 /야후− America/美国 /アメリカ

• Anchor-text-set corpus– A collection of anchor-

text sets

Yahoo Search Engine

美国雅虎 雅虎搜尋引擎

Yahoo! America

Taiwan

China

Japan

Korea

야후 -USA

アメリカの Yahoo! http://www.yahoo.com

56

Term Translation Extraction from Different Resources

Term

Extraction

Term

Extraction

Source Query

TargetTranslation

Search-ResultPages

SearchEngineSearchEngine

SimilarityEstimationSimilarityEstimation

National Palace Museum

國立故宮博物院 , 故宮 , 故宮博物院

Anchor-Text

Corpus

WebSpiderWeb

Spider

57

IE Resources• Data

– RISE, http://www.isi.edu/~muslea/RISE/index.html– Linguistic Data Consortium (LDC)

• Penn Treebank, Named Entities, Relations, etc.– http://www.biostat.wisc.edu/~craven/ie– http://www.cs.umass.edu/~mccallum/data

• Code– TextPro, http://www.ai.sri.com/~appelt/TextPro– MALLET, http://www.cs.umass.edu/~mccallum/mallet

• Both– http://www.cis.upenn.edu/~adwait/penntools.html

– http://www.cs.umass.edu/~mccallum/ie

58

Text Mining Related Workshops

• Workshops in Data Mining conferences– KDD 2000, Workshop on Text Mining (TextKDD 2000)– ICDM 2001, Workshop on Text Mining (TextDM 2001)– PAKDD 2002, Workshop on Text Mining – SDM 2001-2003, 2006-2008, Workshop on Text Mining

• Workshop in Machine Learning conferences– ECML 1998, Workshop on Text Mining– ICML 1999, Workshop on Machine Learning in Text Data

Analysis– ICML 2002, Workshop on Text Learning (TextML 2002)

• Others– RANLP 2005 Text Mining Workshop

59

Workshops on Machine Learning and IE

• AAAI 1999, Workshop on Machine Learning for Information Extraction

• ECAI 2000, Workshop on Machine Learning for Information Extraction

• IJCAI 2001, Workshop on Adaptive Text Extraction and Mining (ATEM 2001)

• ECML 2003, Workshop on Adaptive Text Extraction and Mining (ATEM 2003)

• AAAI 2004, Workshop on Adaptive Text Extraction and Mining (ATEM 2004)

• EACL 2006, Workshop on Adaptive Text Extraction and Mining (ATEM 2006)

60

Other Workshops

• Workshop on Text Mining and Link Analysis– TextLink 2007, in IJCAI 2007– TextLink 2003, in IJCAI 2003

• Workshop on Link Analysis– LinkKDD 2003-2006, in KDD 2003-

2006– AAAI 2005 workshop on Link Analysis

61

Applications in Digital Libraries

• OAI• Unencoded Character Problem

62

Some Problems in Digital Libraries

• Variety in objects – Museums, libraries, …– Difficult to integrate metadata

• Ancient characters in archives– Difficult to input, display, distribute,

63

OAI (Open Archives Initiatives)

• Starting from Oct. 1999• OAI-PMH (Open Archives Initiative

Protocol for Metadata Harvesting) version 2.0 of 2002– Service Provider

• Harvesters

– Data Provider• Repositories

– Metadata

64

Service Provider vs. Data Provider

ServiceProvider

DataProvider

repository

repositoryUser

Request

Response

65

OAI-PMH vs. Z39.50

• What is the relationship between the OAI-PMH and other protocols such as Z39.50?

• The OAI technical framework is intentionally simple– Providing a low barrier for participants  – Easy-to-implement and easy-to-deploy alternative,

not intended to replace other approaches

• Protocols such as Z39.50 have more complete functionality– Session management, results sets, specification of

predicates that filter the records returned – An increase in difficulty of implementation and cost

66

Why Dublin Core?

• Why does the protocol mandate a common metadata format (and why is that common format Dublin Core)?

• Mapping among multiple metadata formats – Creating services such as common search interfaces across

heterogeneous metadata formats• A less burdensome and ultimately more deployable

solution is to require repositories to map to a simple and common metadata format 

• The fifteen elements in Dublin Core as a de facto standard for simple cross-discipline metadata – Dublin Core Metadata Element Set (DCMES)

• Cooperation between the OAI and the Dublin Core Metadata Initiative (DCMI) has led to a common XML schema for unqualified dublin core that is available at http://dublincore.org/schemas/xmls/simpledc20020312.xsd

67

Unencoded Character Problem

• Large amounts of Chinese characters in different forms such as Bronze Script ( 金文 ) and Seal Script ( 小篆 )

• People can better appreciate the long history of the Chinese character evolution process and the Chinese culture in general

• However, digitization of the heritage materials brings a big problem: these characters are not included in common character encodings in computers (the missing characters)

68

Example Unencoded Characters

Bronze Script (金文 ) Seal Script ( 小篆 )

69

Goal

• We intend to develop an integrated technology to facilitate easier processing of large amounts of missing characters

• This includes the input, representation, font generation, display, distribution, and search for all the missing characters

• We propose an effective composite approach to handle the formation and basic components of Chinese characters

70

Composite Approach to Unencoded Chinese Characters [JCDL 2005]

71

Advanced Topics

• Mobile Search• LiveClassifier• Concept Search

72

Mobile Search

• Introduction to Mobile Search• Existing Services• Google Mobile/SMS• Issues

73

References

• R. Schusteritsch, S. Rao, and K. Rodden, Mobile Search with Text Messages: Designing the User Experience for Google SMS, Proceedings of CHI 2005, pp. 1777-1780 (poster).

• B. Miller, China’s Internet Portals and Content Providers Look to a Future Beyond Mobile Text Messaging, The Yankee Group Report, Feb. 2004.

• Communications of the ACM, Vol.48, No.7, Designing for the Mobile Devices, Jul. 2005.

74

Introduction to Mobile Search

• Internet users (as of 2003)– US: 170 million (population: 292 million)– China: 80 million (population: 1.3 billion)– Japan: 70 million (population: 126 million)

• Currently, there are more than two billion mobile phone users worldwide, which is more than three times the number of PC users

• More than a half-billion mobile phones sold each year (in 2004)

75

Internet and Mobile Users in China

76

SMS-based Mobile Internet

• SMS (Short Message Service)– 1 billion messages worldwide everyday– Nearly 200 billion messages in China in

2003 • Content Explodes

– News alerts– Sports news– Weather– Special interest information

• Ex.: Yao Ming

• From SMS to search

77

Two Common Modes of Mobile Search

• Mobile Web browsing• Text Messaging (SMS)

78

Google Mobile (1/2)

http://mobile.google.com/

(XHTML)

(WML)

79

Google Mobile (2/2)

(Images) (Mobile Web)

80

Google SMS

http://www.google.com/sms/

81

Google Maps for Mobile

http://www.google.com/gmm/

82

Existing Mobile Search Services

• Google (46645)– Google Mobile, http://mobile.google.com/– Google SMS, http://www.google.com/sms/– Google Maps for Mobile, http://www.google.com/gmm/

• Yahoo (92466)– Yahoo! Mobile, http://mobile.yahoo.com/– Yahoo! Go, http://go.yahoo.com

• 4INFO (44636)• AOL

– AOL Mobile Search, http://mobile.aolsearch.com/• MSN

– MSN Mobile, http://mobile.msn.com/• Synfonic, UpSNAP, …

SMS short code

83

Google SMS

• Specialized information– Business listings– Residential listings– Product prices– Dictionary definitions– Area codes– Zip codes– …

• Google SMS attempts to return the desired information directly, rather than returning hyperlinks

84

Design Constraints

• Conceptual Model– 1-to-1 communication vs. mobile search– Abbreviation interpretation

• Inherent Limitations of Mobile Devices and SMS– Text entry is slow– Charge per message sent– Small, low-resolution screens– SMS message size limitation: 160 characters– No guarantee for the receiving order– SMS interface: text-only, one-dimensional with no menus,

forms, or buttons to help users understand its affordances– Not possible for the system to offer instructions or a

prompt– To get any feedback, the user must wait a new message

85

Addressing Users’ Existing Conceptual Models

• Most users had some initial problems understanding how SMS could be used for search

• Users’ existing conceptual model of Google searching also cause some initial problems

• Changes to message interpretation– “froogle”, “shopping”, “yellow pages”, “white

pages”, “dictionary” combined with “help”, “tips”, “instructions”

– “price” or “prices” vs. product search

86

• Communicating affordances– Minimal and prominent sequence of instructions

and the features that are considered most useful are shown on the Google SMS home page

– Sending the message “help” should return a concise set of instructions on how to use it

– Collaborate with PR team to mention the “help” command and highlight the most important features

– Work with Marketing team to develop a wallet-size instruction card

87

Addressing the Limitations of Mobile Devices and SMS

• Order of messages– “1of2”

• Limited input technology– Query refinement is more difficult (“cofee”)– Immediately returning search results for the

closest match

• Limited output technology– No results, “help”– No more than 3 messages in response to

each help request

88

Issues in Mobile Search

• User interface for input• Mobile search result output• Context-aware, location-based

service• User preference

89

• To allow users to get updated or additional information with fewer keystrokes

• To transform the search experience found on a PC to a mobile device – Tiny screens, network bandwidth, …– Wireless Application Protocol (WAP) to shrink Web pages

down to more manageable sizes • Google WML

– Alternatively, SMS• To provide users who are not working with a mouse or a

keyboard with a simple way to enter queries – Shortcuts (for example, “w” for “weather”…)

• To integrate the search vendors’ local and mobile search functions

• To integrate commerce and mobile search applications

90

Possible Related Directions

• Context-aware Retrieval• Ubiquitous Computing

– Communications of the ACM, Vol.48, No.3, The Disappearing Computer, Mar. 2005.

– Communications of the ACM, Vol.45, No.12, Issues and Challenges in Ubiquitous Computing, Dec. 2002.

• …

91

LiveClassifier

A system that creates classifiers through Web mining

92

LiveClassifier

Users create topic hierarchies and define classes/keywords

93

LiveClassifier

Web

Auto-extracted training data; No manually-labeled data provided

Exploiting the structure information inherent for training

94

LiveClassifier

People

Place

Subjects

Sub-subjects

95

LiveClassifier

Classifying documents

Into classes

96

LiveClassifier

Classifying short texts

Into classes

97

LiveClassifier

98

LiveClassifier

99

100

Concept Search

• Conventional search

• Concept-level search

doc Keyword search for “researcher” and “AI” and “Taiwan”

docresearcher AI

“professor”

“NTU”

“neuralnetwork”

researcherAI

Interesting document

Taiwan

101

LiveTrans + LiveClassifier

102

Thanks for Your Attention!

Recommended