View
215
Download
2
Category
Tags:
Preview:
Citation preview
1
Text Retrieval and Applications – More Advanced Topics
J. H. WangMay 20, 2008
2
Outline
• Text Mining and Information Extraction– Introduction to Text Mining– Methods of Information Extraction– Applications to Information Extraction
• Applications in Digital Libraries– OAI– Unencoded character problem
• Advanced Topics– Mobile Search– LiveClassifier
3
Text Mining and Information Extraction
• Introduction to Text Mining• Methods of Information Extraction• Applications to Information Extractio
n
4
References
• Marti Hearst, What is Text Mining, http://www.sims.berkeley.edu/~hearst/text-mining.html
• Marti Hearst, Untangling Text Data Mining, ACL 1999.• Douglas E. Appelt and David J. Israel, Introduction to
Information Extraction Technology, IJCAI 1999 Tutorial.• Ion Muslea, Extraction Patterns for Information
Extraction Tasks: A Survey, AAAI 1999 Workshop on Machine Learning for Information Extraction.
• Andrew McCallum and William Cohen, Information Extraction from the World Wide Web, KDD 2003 tutorial. (Also earlier version in NIPS 2002 tutorial)
• Hamish Cunningham, Named Entity Recognition, RANLP 2003 tutorial.
5
Text Mining
• Text Mining is the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources [Marti Hearst]
6
Text Mining vs. Web Search
• In search, the user is typically looking for something that is already known and has been written by someone else
• In text mining, the goal is to discover heretofore unknown information, something that no one yet knows and so could not have yet written down
7
Text Mining vs. Data Mining
• Data mining tries to find interesting (non-trivial, implicit, previously unknown, potentially useful) patterns from large databases
• In text mining, the patterns are extracted from natural language text rather than from structured databases of facts
8
Text Mining vs. Computational Linguistics (or NLP)
• NLP is making a lot of progress in doing small subtasks in text analysis– Word segmentation, part-of-speech
tagging, word sense disambiguation, …
• Text understanding vs. mining
9
Text Mining vs. Information Extraction
• There are programs that can, with reasonable accuracy, extract information from text with somewhat regularized structure
• Discovering new knowledge vs. showing trends
10
What is “Information Extraction”Filling slots in a database from sub-segments of text.As a task:
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“
Richard Stallman, founder of the Free Software Foundation, countered saying…
NAME TITLE ORGANIZATION
11
What is “Information Extraction”Filling slots in a database from sub-segments of text.As a task:
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“
Richard Stallman, founder of the Free Software Foundation, countered saying…
NAME TITLE ORGANIZATIONBill Gates CEO MicrosoftBill Veghte VP MicrosoftRichard Stallman founder Free Soft..
IE
12
What is “Information Extraction”Information Extraction = segmentation + classification + clustering + association
As a familyof techniques:October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“
Richard Stallman, founder of the Free Software Foundation, countered saying…
Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation
aka “named entity extraction”
13
What is “Information Extraction”Information Extraction = segmentation + classification + association + clustering
As a familyof techniques:October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“
Richard Stallman, founder of the Free Software Foundation, countered saying…
Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation
14
What is “Information Extraction”Information Extraction = segmentation + classification + association + clustering
As a familyof techniques:October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“
Richard Stallman, founder of the Free Software Foundation, countered saying…
Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation
15
What is “Information Extraction”Information Extraction = segmentation + classification + association + clustering
As a familyof techniques:October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“
Richard Stallman, founder of the Free Software Foundation, countered saying…
Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation N
AME
TITLE ORGANIZATION
Bill Gates
CEO
Microsoft
Bill Veghte
VP
Microsoft
Richard Stallman
founder
Free Soft..
*
*
*
*
16
IE in ContextCreate ontology
SegmentClassifyAssociateCluster
Load DB
Spider
Query,Search
Data mine
IE
Documentcollection
Database
Filter by relevance
Label training data
Train extraction models
17
Information Extraction Tasks
• Information Extraction (IE) pulls facts and structured information from the content of large text collections – Unstructured or semi-structured structured
• Federal government funded research– MUC: Message Understanding Conferences (1987 –
1998) by DARPA– TIPSTER (1991-1998) by DARPA– ACE: Automatic Content Extraction (1999-) by NIST
• http://www.nist.gov/speech/tests/ace/• http://www.ldc.upenn.edu/Projects/ACE/
18
Landscape of IE Techniques: Models
Any of these models can be used to capture words, formatting or both.
Lexicons
AlabamaAlaska…WisconsinWyoming
Abraham Lincoln was born in Kentucky.
member?
Classify Pre-segmentedCandidates
Abraham Lincoln was born in Kentucky.
Classifier
which class?
Sliding Window
Abraham Lincoln was born in Kentucky.
Classifier
which class?
Try alternatewindow sizes:
Boundary Models
Abraham Lincoln was born in Kentucky.
Classifier
which class?
BEGIN END BEGIN END
BEGIN
Context Free Grammars
Abraham Lincoln was born in Kentucky.
NNP V P NPVNNP
NP
PP
VP
VP
S
Mos
t lik
ely
pars
e?
Finite State Machines
Abraham Lincoln was born in Kentucky.
Most likely state sequence?
19
Methods of Information Extractions –
Three Approaches• NLP (Linguistic) approach
– Named entity recognition
• Extraction pattern/template approach– Wrapper generation/induction
• Statistical approach– Class-based language model– PAT-tree-based
20
NLP (Linguistic) Approach to IE
• MUC-7 tasks• Named entity recognition
– Preprocessing– Two kinds of approaches– Baseline– Rule-based approach– Learning-based approach
21
MUC-7 Tasks
• NE: Named Entity recognition and typing
• CO: co-reference resolution • TE: Template Elements (attributes)• TR: Template Relations • ST: Scenario Templates
22
An Example
The shiny red rocket was fired on Tuesday. It is the brainchild of Dr. Big Head. Dr. Head is a staff scientist at We Build Rockets Inc.
• NE: entities are "rocket", "Tuesday", "Dr. Head" and "We Build Rockets"
• CO: "it" refers to the rocket; "Dr. Head" and "Dr. Big Head" are the same
• TE: the rocket is "shiny red" and Head's "brainchild".
• TR: Dr. Head works for We Build Rockets Inc.
• ST: a rocket launching event occurred with the various participants.
23
Performance Levels
• Vary according to text type, domain, scenario, language
• NE: up to 97% (tested in English, Spanish, Japanese, Chinese)
• CO: 60-70% resolution • TE: 80% • TR: 75-80% • ST: 60% (but human level may be only
80%)
24
What are Named Entities?
• NER involves identification of proper names in texts, and classification into a set of predefined categories of interest– Person names– Organizations (companies, government
organizations, committees, etc)– Locations (cities, countries, rivers, etc)– Date and time expressions
25
What are Named Entities (2)
• Other common types: measures (percent, money, weight etc), email addresses, Web addresses, street addresses, etc.
• Some domain-specific entities: names of drugs, medical conditions, names of ships, bibliographic references etc.
• MUC-7 entity definition guidelines [Chinchor’97]
http://www.itl.nist.gov/iaui/894.02/related_projects/muc/proceedings/ne_task.html
26
Problems in NE
• Variation of NEs – e.g. John Smith, Mr Smith, John.
• Ambiguity of NE types: John Smith (company vs. person) – May (person vs. month) – Washington (person vs. location) – 1945 (date vs. time)
• Ambiguity with common words, e.g. "may“• More complex problems in NE
– Issues of style, structure, domain, genre etc. – Punctuation, spelling, spacing, formatting, ...
27
The Evaluation Metric
• Precision vs. recall• F-Measure = (β2 + 1)PR / β2R + P
[van Rijsbergen 75]– β reflects the weighting between precision and recall,
typically β=1• We may also want to take account of partially
correct answers:– Precision =
(Correct + ½ Partially correct) / (Correct + Incorrect + Partial)
– Recall = (Correct + ½ Partially correct) / (Correct + Missing + Partial)
– Why: NE boundaries are often misplaced, sosome partially correct results
28
Pre-processing for NE Recognition
• Tokenization– Word segmentation
• Lexical or morphological processing– Part of speech tagging– Word sense tagging
• Syntactic analysis– Parsing
• Domain-specific module– Coreference– Merging partial results
29
Two kinds of NE approaches
Knowledge Engineering
• Rule based • Developed by experienced
language engineers, linguistic resources required
• Make use of human intuition • Requires only small amount
of training data• Development could be very
time consuming • Good performance• Some changes may be hard
to accommodate
Learning Systems
• Use statistics or other machine learning
• Developers do not need NE expertise
• Domain independent• Requires large amounts of
annotated training data, which may be difficult to obtain
• Some changes may require re-annotation of the entire training corpus
• Annotators are cheap (but you get what you pay for!)
30
Baseline: List Lookup Approach
• System that recognizes only entities stored in its lists (gazetteers)– Online phone directories and yellow pages for person and
organisation names (e.g. [Paskaleva02])– Locations lists: US GEOnet Names Server (GNS) data –
3.9 million locations with 5.37 million names (e.g., [Manov03])
– Automatic collection from annotated training data
• Advantages - Simple, fast, language independent, easy to retarget (just create lists)
• Disadvantages – impossible to enumerate all names, collection and maintenance of lists, cannot deal with name variants, cannot resolve ambiguity
31
Rule-based: Shallow Parsing Approach (Internal Structure)
• Internal evidence – names often have internal structure. These components can be either stored or guessed, e.g. location:
• Cap. Word + {City, Forest, Center, River}– e.g. Sherwood Forest
• Cap. Word + {Street, Boulevard, Avenue, Crescent, Road}– e.g. Portobello Street
32
Problems with the Shallow Parsing Approach
• Ambiguously capitalized words (first word in sentence)[All American Bank] vs. All [State Police]
• Semantic ambiguity"John F. Kennedy" = airport (location) "Philip Morris" = organization
• Structural ambiguity [Cable and Wireless] vs. [Microsoft] and [Dell];[Center for Computational Linguistics] vs. message from [City Hospital] for [John Smith]
33
Shallow Parsing Approach with Context
• Use of context-based patterns is helpful in ambiguous cases – "David Walton" and "Goldman Sachs" are
indistinguishable – But with the phrase "David Walton of
Goldman Sachs" and the Person entity "David Walton" recognized, we can use the pattern "[Person] of [Organization]" to identify "Goldman Sachs“ correctly
34
Examples of Context Patterns
• [PERSON] earns [MONEY]• [PERSON] joined [ORGANIZATION]• [PERSON] left [ORGANIZATION]• [PERSON] joined [ORGANIZATION] as [JOBTITLE]• [ORGANIZATION]'s [JOBTITLE] [PERSON]• [ORGANIZATION] [JOBTITLE] [PERSON]• the [ORGANIZATION] [JOBTITLE]• part of the [ORGANIZATION]• [ORGANIZATION] headquarters in [LOCATION]• price of [ORGANIZATION]• sale of [ORGANIZATION]• investors in [ORGANIZATION]• [ORGANIZATION] is worth [MONEY]• [JOBTITLE] [PERSON]• [PERSON], [JOBTITLE]
35
Rule-based Examples
• FACILE - used in MUC-7 [Black et al 98]• ANNIE - part of GATE, Sheffield’s open-source
infrastructure for language processing• Gazetteer Lists for Rule-based NE
– Internal location indicators – e.g., {river, mountain, forest} for natural locations; {street, road, crescent, place, square, …} for address locations
– Internal organization indicators – e.g., company designators {GmbH, Ltd, Inc, …}
36
Using Co-reference to Classify Ambiguous NEs
• Improves NE results by assigning entity type to previously unclassified names, based on relations with classified NEs
• Classification of unknown entities very useful for surnames which match a full name, or abbreviations, e.g. [Bonfield] will match [Sir Peter Bonfield]; [International Business Machines Ltd.] will match [IBM]
37
Machine Learning Approaches• ML approaches frequently break down the NE task in
two parts:– Recognizing the entity boundaries– Classifying the entities in the NE categories
• Example approaches– IdentiFinder [Bikel et al 99] (Hidden Markov Models)– MENE [Borthwick et al 98], combining rule-based and ML NE
(Maximum Entropy)– NE Recognition without Gazetteers [Mikheev et al 99],
combining rule-based grammars and statistical (MaxEnt) models
– Fine-grained Classification of NEs [Fleischman 02] • Ex: Person classification into 8 sub-categories – athlete,
politician/government, clergy, businessperson, entertainer/artist, lawyer, doctor/scientist, police
38
ML Approaches to Named Entity Identification
• Boosting • Bootstrapping • Class-based Language Model • Conditional Markov Model • Decision Tree • Hidden Markov Model • Maximum Entropy Model • Memory-based Learning • Stacking • Support Vector Machine • Transform-based Learning • Voted Perceptron • Others
39
Extraction Pattern/Template Approach
• Types of extraction patterns (rules)– Syntactic/semantic constraints– Delimiter-based– Combination of both
• Types of documents to be extracted– Semi-structured documents: Web
pages, …– Unstructured documents: free text
40
IE from Free Text
• “The parliament was bombed by the guerrillas.”
• AutoSlog [Riloff 1993]• LIEP [Huffman 1996]• PALKA [Kim & Moldovan 1995]• CRYSTAL [Soderland et al. 1995]• CRYSTAL+Webfoot [Soderland 1997]• HASTEN [Krupka 1995]
41
IE from Online Documents
• WHISK [Soderland 1999]– A special type of regular expression
• RAPIER [Califf & Mooney 1997]– Robust Automated Production of
Information Extraction Rules
• SRV [Freitag 1998]– First-order logic
42
Wrapper Induction Systems
• Wrapper: a procedure for extracting a particular resource’s content
• WIEN [Kushmerick, Weld & Doorenbos, 1997]– First wrapper induction system
• SoftMealy [Hsu & Dung, 1998]– Finite State Transducer (FST)
• STALKER [Muslea, Minton & Knoblock, 1999]– Hierarchical information extraction
• BWI [Freitag & Kushmerick 2000]– Boosting
43
Statistical Approach to IE
• Class-based language model [Brown et al. 1992] for Chinese NE – [COLING 2002]
• PAT-tree-based [SIGIR 1997]– Long, repeated patterns without length
limitation– Space and time efficient– Incremental – http://pattree.openfoundry.org/
44
Class-based Language Model
• Types of classes– Person, location, organization, terms in dictionary
• Number of classes = |V|+3 if the size of vocabulary is |V|
* *
,
,
, arg max ( , )
arg max ( | ) ( )C W
C W
C W P C W
P W C P C
1 1 2 1 2 1 13
( ) ( ... ) ( | ) ( | , ){ ( | , )} ( / | , )m
m i i i m mi
P C P c c P c s P c c s P c c c P s c c
1 11
( | ) ( ... | ... ) ( | )m
m m i ii
P W C P w w c c P w c
45
PAT-tree-based Approach
• SCP (Symmetric Conditional Probability)– Cohesion holding the words together– Low frequency n-grams tend to be
discarded
• CD (Context Dependency)– Dependence on the left- or right- adjacent
word/character– Low frequency n-grams can be extracted
• SCPCD: a combination of the two
46
Association Measure
1
1 11
21
1
1 11
21
1
)...()...(1
1)...(
)()(1
1)(
)(
n
i nii
n
n
i nii
nn
wwfreqwwfreqn
wwfreq
wwpwwpn
wwpwwSCP
21
111
)(
)()()(
n
nnn
wwfreq
wwRCwwLCwwCD
1
1 11
11
111
)()(1
1)()(
)()()(
n
i nii
nn
nnn
wwfreqwwfreqn
wwRCwwLCwwCDwwSCPwwSCPCD
47
Term Extraction Performance
Association Measure
Precision Recall Avg. R-P
CD 68.1 % 5.9 % 37.0 %
SCP 62.6 % 63.3 % 63.0 %
SCPCD 79.3 % 78.2 % 78.7 %
•Table 1. The obtained extraction accuracy including precision, recall, and average recall-precision of auto-extracted translation candidates using different methods.
48
Speed Performance
Table 2. The obtained average speed performance of different term extraction methods.
Term Extraction MethodTime for
PreprocessingTime for Extraction
LocalMaxs (Web Queries) 0.87 s 0.99 s
PATtree+LocalMaxs (Web Queries)
2.30 s 0.61 s
LocalMaxs (1,367 docs) 63.47 s 4,851.67 s
PATtree+LocalMaxs (1,367 docs)
840.90 s 71.24 s
LocalMaxs (5,357 docs) 47,247.55 s 350,495.65 s
PATtree+LocalMaxs (5,357 docs)
11,086.67 s 759.32 s
49
State of the Art Performance
• Named entity recognition– Person, Location, Organization, …– F1 in high 80’s or low- to mid-90’s
• Binary relation extraction– Contained-in (Location1, Location2)
Member-of (Person1, Organization1)– F1 in 60’s (events) or 70’s (facts) or 80’s
(attributes)• Wrapper induction
– Extremely accurate performance obtainable– Human effort (~30min) required on each site
50
Broader ViewCreate ontology
SegmentClassifyAssociateCluster
Load DB
Spider
Query,Search
Data mine
IETokenize
Documentcollection
Database
Filter by relevance
Label training data
Train extraction models
some other issues
12
3
4
5
51
Applications to Information Extraction
• LiveTrans• LiveClassifier
52
LiveTrans: Cross-language Web Search
53
Web Mining Approach to Term Translation Extraction
• LiveTrans: http://livetrans.iis.sinica.edu.tw/lt.html
LiveTrans Engine
LiveTrans Engine
Academia SinicaAnchor textsAnchor texts
Search resultsSearch results
The Web
中央研究院 / 中研院
Source query
Target translations
54
National Palace Museum vs. 故宮博物院Search-Result Page
• Mixed-language characteristic in Chinese pages• How to extract translation candidates?• Which candidates to choose?
Noises
55
Yahoo vs. 雅虎 -- Anchor-Text Set
• Anchor text (link text)– The descriptive text of a
link on a Web page
• Anchor-text set– A set of anchor texts
pointing to the same page (URL)
– Multilingual translations− Yahoo/雅虎 /야후− America/美国 /アメリカ
• Anchor-text-set corpus– A collection of anchor-
text sets
Yahoo Search Engine
美国雅虎 雅虎搜尋引擎
Yahoo! America
Taiwan
China
Japan
Korea
야후 -USA
アメリカの Yahoo! http://www.yahoo.com
56
Term Translation Extraction from Different Resources
Term
Extraction
Term
Extraction
Source Query
TargetTranslation
Search-ResultPages
SearchEngineSearchEngine
SimilarityEstimationSimilarityEstimation
National Palace Museum
國立故宮博物院 , 故宮 , 故宮博物院
Anchor-Text
Corpus
WebSpiderWeb
Spider
57
IE Resources• Data
– RISE, http://www.isi.edu/~muslea/RISE/index.html– Linguistic Data Consortium (LDC)
• Penn Treebank, Named Entities, Relations, etc.– http://www.biostat.wisc.edu/~craven/ie– http://www.cs.umass.edu/~mccallum/data
• Code– TextPro, http://www.ai.sri.com/~appelt/TextPro– MALLET, http://www.cs.umass.edu/~mccallum/mallet
• Both– http://www.cis.upenn.edu/~adwait/penntools.html
– http://www.cs.umass.edu/~mccallum/ie
58
Text Mining Related Workshops
• Workshops in Data Mining conferences– KDD 2000, Workshop on Text Mining (TextKDD 2000)– ICDM 2001, Workshop on Text Mining (TextDM 2001)– PAKDD 2002, Workshop on Text Mining – SDM 2001-2003, 2006-2008, Workshop on Text Mining
• Workshop in Machine Learning conferences– ECML 1998, Workshop on Text Mining– ICML 1999, Workshop on Machine Learning in Text Data
Analysis– ICML 2002, Workshop on Text Learning (TextML 2002)
• Others– RANLP 2005 Text Mining Workshop
59
Workshops on Machine Learning and IE
• AAAI 1999, Workshop on Machine Learning for Information Extraction
• ECAI 2000, Workshop on Machine Learning for Information Extraction
• IJCAI 2001, Workshop on Adaptive Text Extraction and Mining (ATEM 2001)
• ECML 2003, Workshop on Adaptive Text Extraction and Mining (ATEM 2003)
• AAAI 2004, Workshop on Adaptive Text Extraction and Mining (ATEM 2004)
• EACL 2006, Workshop on Adaptive Text Extraction and Mining (ATEM 2006)
60
Other Workshops
• Workshop on Text Mining and Link Analysis– TextLink 2007, in IJCAI 2007– TextLink 2003, in IJCAI 2003
• Workshop on Link Analysis– LinkKDD 2003-2006, in KDD 2003-
2006– AAAI 2005 workshop on Link Analysis
61
Applications in Digital Libraries
• OAI• Unencoded Character Problem
62
Some Problems in Digital Libraries
• Variety in objects – Museums, libraries, …– Difficult to integrate metadata
• Ancient characters in archives– Difficult to input, display, distribute,
…
63
OAI (Open Archives Initiatives)
• Starting from Oct. 1999• OAI-PMH (Open Archives Initiative
Protocol for Metadata Harvesting) version 2.0 of 2002– Service Provider
• Harvesters
– Data Provider• Repositories
– Metadata
64
Service Provider vs. Data Provider
ServiceProvider
DataProvider
repository
repositoryUser
Request
Response
65
OAI-PMH vs. Z39.50
• What is the relationship between the OAI-PMH and other protocols such as Z39.50?
• The OAI technical framework is intentionally simple– Providing a low barrier for participants – Easy-to-implement and easy-to-deploy alternative,
not intended to replace other approaches
• Protocols such as Z39.50 have more complete functionality– Session management, results sets, specification of
predicates that filter the records returned – An increase in difficulty of implementation and cost
66
Why Dublin Core?
• Why does the protocol mandate a common metadata format (and why is that common format Dublin Core)?
• Mapping among multiple metadata formats – Creating services such as common search interfaces across
heterogeneous metadata formats• A less burdensome and ultimately more deployable
solution is to require repositories to map to a simple and common metadata format
• The fifteen elements in Dublin Core as a de facto standard for simple cross-discipline metadata – Dublin Core Metadata Element Set (DCMES)
• Cooperation between the OAI and the Dublin Core Metadata Initiative (DCMI) has led to a common XML schema for unqualified dublin core that is available at http://dublincore.org/schemas/xmls/simpledc20020312.xsd
67
Unencoded Character Problem
• Large amounts of Chinese characters in different forms such as Bronze Script ( 金文 ) and Seal Script ( 小篆 )
• People can better appreciate the long history of the Chinese character evolution process and the Chinese culture in general
• However, digitization of the heritage materials brings a big problem: these characters are not included in common character encodings in computers (the missing characters)
68
Example Unencoded Characters
Bronze Script (金文 ) Seal Script ( 小篆 )
69
Goal
• We intend to develop an integrated technology to facilitate easier processing of large amounts of missing characters
• This includes the input, representation, font generation, display, distribution, and search for all the missing characters
• We propose an effective composite approach to handle the formation and basic components of Chinese characters
70
Composite Approach to Unencoded Chinese Characters [JCDL 2005]
71
Advanced Topics
• Mobile Search• LiveClassifier• Concept Search
72
Mobile Search
• Introduction to Mobile Search• Existing Services• Google Mobile/SMS• Issues
73
References
• R. Schusteritsch, S. Rao, and K. Rodden, Mobile Search with Text Messages: Designing the User Experience for Google SMS, Proceedings of CHI 2005, pp. 1777-1780 (poster).
• B. Miller, China’s Internet Portals and Content Providers Look to a Future Beyond Mobile Text Messaging, The Yankee Group Report, Feb. 2004.
• Communications of the ACM, Vol.48, No.7, Designing for the Mobile Devices, Jul. 2005.
74
Introduction to Mobile Search
• Internet users (as of 2003)– US: 170 million (population: 292 million)– China: 80 million (population: 1.3 billion)– Japan: 70 million (population: 126 million)
• Currently, there are more than two billion mobile phone users worldwide, which is more than three times the number of PC users
• More than a half-billion mobile phones sold each year (in 2004)
75
Internet and Mobile Users in China
76
SMS-based Mobile Internet
• SMS (Short Message Service)– 1 billion messages worldwide everyday– Nearly 200 billion messages in China in
2003 • Content Explodes
– News alerts– Sports news– Weather– Special interest information
• Ex.: Yao Ming
• From SMS to search
77
Two Common Modes of Mobile Search
• Mobile Web browsing• Text Messaging (SMS)
78
Google Mobile (1/2)
http://mobile.google.com/
(XHTML)
(WML)
79
Google Mobile (2/2)
(Images) (Mobile Web)
80
Google SMS
http://www.google.com/sms/
81
Google Maps for Mobile
http://www.google.com/gmm/
82
Existing Mobile Search Services
• Google (46645)– Google Mobile, http://mobile.google.com/– Google SMS, http://www.google.com/sms/– Google Maps for Mobile, http://www.google.com/gmm/
• Yahoo (92466)– Yahoo! Mobile, http://mobile.yahoo.com/– Yahoo! Go, http://go.yahoo.com
• 4INFO (44636)• AOL
– AOL Mobile Search, http://mobile.aolsearch.com/• MSN
– MSN Mobile, http://mobile.msn.com/• Synfonic, UpSNAP, …
SMS short code
83
Google SMS
• Specialized information– Business listings– Residential listings– Product prices– Dictionary definitions– Area codes– Zip codes– …
• Google SMS attempts to return the desired information directly, rather than returning hyperlinks
84
Design Constraints
• Conceptual Model– 1-to-1 communication vs. mobile search– Abbreviation interpretation
• Inherent Limitations of Mobile Devices and SMS– Text entry is slow– Charge per message sent– Small, low-resolution screens– SMS message size limitation: 160 characters– No guarantee for the receiving order– SMS interface: text-only, one-dimensional with no menus,
forms, or buttons to help users understand its affordances– Not possible for the system to offer instructions or a
prompt– To get any feedback, the user must wait a new message
85
Addressing Users’ Existing Conceptual Models
• Most users had some initial problems understanding how SMS could be used for search
• Users’ existing conceptual model of Google searching also cause some initial problems
• Changes to message interpretation– “froogle”, “shopping”, “yellow pages”, “white
pages”, “dictionary” combined with “help”, “tips”, “instructions”
– “price” or “prices” vs. product search
86
• Communicating affordances– Minimal and prominent sequence of instructions
and the features that are considered most useful are shown on the Google SMS home page
– Sending the message “help” should return a concise set of instructions on how to use it
– Collaborate with PR team to mention the “help” command and highlight the most important features
– Work with Marketing team to develop a wallet-size instruction card
87
Addressing the Limitations of Mobile Devices and SMS
• Order of messages– “1of2”
• Limited input technology– Query refinement is more difficult (“cofee”)– Immediately returning search results for the
closest match
• Limited output technology– No results, “help”– No more than 3 messages in response to
each help request
88
Issues in Mobile Search
• User interface for input• Mobile search result output• Context-aware, location-based
service• User preference
89
• To allow users to get updated or additional information with fewer keystrokes
• To transform the search experience found on a PC to a mobile device – Tiny screens, network bandwidth, …– Wireless Application Protocol (WAP) to shrink Web pages
down to more manageable sizes • Google WML
– Alternatively, SMS• To provide users who are not working with a mouse or a
keyboard with a simple way to enter queries – Shortcuts (for example, “w” for “weather”…)
• To integrate the search vendors’ local and mobile search functions
• To integrate commerce and mobile search applications
90
Possible Related Directions
• Context-aware Retrieval• Ubiquitous Computing
– Communications of the ACM, Vol.48, No.3, The Disappearing Computer, Mar. 2005.
– Communications of the ACM, Vol.45, No.12, Issues and Challenges in Ubiquitous Computing, Dec. 2002.
• …
91
LiveClassifier
A system that creates classifiers through Web mining
92
LiveClassifier
Users create topic hierarchies and define classes/keywords
93
LiveClassifier
Web
Auto-extracted training data; No manually-labeled data provided
Exploiting the structure information inherent for training
94
LiveClassifier
People
Place
Subjects
Sub-subjects
95
LiveClassifier
Classifying documents
Into classes
96
LiveClassifier
Classifying short texts
Into classes
97
LiveClassifier
98
LiveClassifier
99
100
Concept Search
• Conventional search
• Concept-level search
doc Keyword search for “researcher” and “AI” and “Taiwan”
docresearcher AI
“professor”
“NTU”
“neuralnetwork”
researcherAI
Interesting document
Taiwan
101
LiveTrans + LiveClassifier
102
Thanks for Your Attention!
Recommended