Upload
aron-hart
View
218
Download
0
Embed Size (px)
Citation preview
1999. Yu.Demchenko. TERENA
Multilinguality in Indexing, Searching and Metadata
Slide 2_1
Multilinguality and
cross-language searching
Multilingual aspects
in Indexing, Searching and Metadata (Resource Description)
©1999. Yu.Demchenko. TERENA Multilinguality in Indexing, Searching and Metadata
Slide2_2
Multilingual aspects in Indexing, Searching and Metadata
IETF Model of Multilingual support in Internet Applications Electronic Mail Interactive applications
Charset and Language tagging MIME types XML Language and Charset tagging DC language definition
Metadata and RDF DC.Language
Existing solutions TUSTEP Search Engines and Subject Gateways
Multilingual framework for the REIS Project
©1999. Yu.Demchenko. TERENA Multilinguality in Indexing, Searching and Metadata
Slide2_3
IETF Model of Multilingual support in Internet Applications
Electronic Mail Language Character Encoding Scheme Transfer Encoding Scheme
Interactive applications WWW: HTTP/HTML
http-equiv="Content-Type" Content="text/html; charset=euc-jp" <META http-equiv="Content-Type" Content="text/html; charset=euc-jp">
XML/DOM LDAP and X.500 (?)
©1999. Yu.Demchenko. TERENA Multilinguality in Indexing, Searching and Metadata
Slide2_4
XML:Language and Charset tagging
Character is atomic unit of text All ISO 10646 characters + TAB, CR, LF
The mechanism for Encoding can vary for different characters All XML processors must accept UTF-8 and UTF-16
Character Encoding in Entities (XML 4.3.3) EncodingDecl : : = S ‘encoding’ Eq ‘ ” ’ EncName
‘ “ ‘ | “ ‘ “ EncName “ ‘ “ )<?xml encoding+’UTF-8’?> <?xml encoding+’EUC-JP’?>
Autodetection of Character Encoding
Language identification (XML 2.12) Tag for identification of languages
LanguageID : : = Langcode (‘-’ Subcode) Langcode : : = ISO639Code | IanaCode | UserCode
©1999. Yu.Demchenko. TERENA Multilinguality in Indexing, Searching and Metadata
Slide2_5
Charset and Language tagging
MIME types text, img, audio, video Charset = Character Set + Character Encoding Scheme Transfer Encoding Scheme
base64 quoted-printable
Language RFC 1766 ISO639-2
©1999. Yu.Demchenko. TERENA Multilinguality in Indexing, Searching and Metadata
Slide2_6
Language Definition in DC Metadata set
<meta name = “DC.language”
scheme= “rfc1766” “ISO639-2”
content= “es”>
<meta name = “DC.title”
lang = “es”
content= “La Mesa y Silla Roja”>
©1999. Yu.Demchenko. TERENA Multilinguality in Indexing, Searching and Metadata
Slide2_7
Multilingual Subject Gateway
Developing multilingual subject gateways (SOSIG as example) SOSIG accept any languages evaluated for quality Translation should be coherent and checked Different language version should be equally well maintained SOSIG Cataloguing rules
TITLE will be displayed in the first language ALTERNATIVE TITLE in other languages DESCRIPTION will mention different languages in which resource is available URI of all language versions Labeling URI language
Library standards for multilingual provision NISO Z39.53 Language codes USMARC Language codes
©1999. Yu.Demchenko. TERENA Multilinguality in Indexing, Searching and Metadata
Slide2_8
Multilingual provision in popular Internet Search Engines
AltaVista Search in 25 languages
Documents indexed as is
Automatic translation - very simple and naive
Other sites that have dedicated national sites interface language language resoures no special language policy
Euroseek Excite Lycos Infoseek
©1999. Yu.Demchenko. TERENA Multilinguality in Indexing, Searching and Metadata
Slide2_9
New Developments in Subject Gateways, Indexing, Searching
NRENs projects
Subject gateways
Commercial Search Engines
Multilingual Text Retrieval and Processing TUSTEP system
©1999. Yu.Demchenko. TERENA Multilinguality in Indexing, Searching and Metadata
Slide2_10
NREN projects
Social Science Information Gateway http://sosig.esrc.bris.ac.uk/
ROADS Project Software/Documentation Server - http://www.roads.lut.ac.uk/
CHIP-Pilot (Clearing House for Internet Projects) - http://www.terena.nl/chip/
IMesh - International Collaboration on Internet Subject Gateways - http://www.desire.org/html/subjectgateways/community/imesh/
DFN Indexing and Searching projects - http://www.dfn.de/links/suchen.html
X.500 Directory E-mail Addresses Search (AMBIX-D) - http://ambix.uni-tuebingen.de:8889
TUSTEP Munltilingual Textdata Processing and Fuzzy Searching - http://www.uni-tuebingen.de/zdv/tustep/tdv_eng.html
IKEM Toolkit - http://bikit.rug.ac.be:80/ikem/
DRUID Classification Tools, University of Twente - http://twentyone.tpd.tno.nl/druid/
©1999. Yu.Demchenko. TERENA Multilinguality in Indexing, Searching and Metadata
Slide2_11
Search Engines news
CLEVER project at IBM Almaden Research Center - http://www.almaden.ibm.com/cs/k53/clever.html
Cora Search Engine - http://www.cora.justresearch.com/about.html
Google Search Engine - http://www.google.com/why_use.html
Free AltaVista Search Intranet v2.3A Entry Level Software http://www.altavista.software.digital.com/search/intranet/free_3k/index.asp
Ultraseek Server for Linux Platformshttp://software.infoseek.com/products/ultraseek/linux/ultrareq.htm
©1999. Yu.Demchenko. TERENA Multilinguality in Indexing, Searching and Metadata
Slide2_12
TUSTEP TUebingen System of Text Processing Programs
1. File structure
2. Multilingual capabilities
3. Internal data presentation
4. Database publishing/output data presentation
5. CGI
6. Sample implementation http://lddv.zdv.uni-tuebingen.de/cgi-bin/opac/zdvlit
Try entries like Smith or Meier or...
http://lddv.zdv.uni-tuebingen.de/cgi-bin/km/npquery
©1999. Yu.Demchenko. TERENA Multilinguality in Indexing, Searching and Metadata
Slide2_13
TUSTEP: File structure
TUSTEP can handle basically all kinds of (explicitely or implicitely) structured text files)
Special support for XML "Databases" (i. e. files with a repeated and regular structure) are only a special case
of this.
Fuzzy search and other retrieval actions can then be used to access the data
©1999. Yu.Demchenko. TERENA Multilinguality in Indexing, Searching and Metadata
Slide2_14
TUSTEP: Multilingual capabilities
TUSTEP supports the following scripts: - Latin - Cyrillic - Greek (classical and modern) - Hebrew (with support for Yiddish) - Arabic - Estrangelo - Coptic - Old Church Slavonic
More: Phonetics, Egyptian hieroglyphs allows use of combining diacritics
Experimental: Indic scripts and Armenian
©1999. Yu.Demchenko. TERENA Multilinguality in Indexing, Searching and Metadata
Slide2_15
TUSTEP: Internal data presentation and transformation
TUSTEP uses internally a script tagging system with transliteration into ASCII which allows all data to be encoded in a human-readable and easily transmittable form
TUSTEP has a module for importing from and exporting into the UCS (UTF8 and UTF16)
Example: #r+Novij rafiqnij clovnik ykra^ins^bko%:^i movi#r-
Transformation module allows use of other tagging systems and other transliteration schemes
©1999. Yu.Demchenko. TERENA Multilinguality in Indexing, Searching and Metadata
Slide2_16
TUSTEP: Database publishing
TUSTEP's typesetting module offers a high-quality, fast and easy way of publishing all or part of the database
in paper (or pdf) form
©1999. Yu.Demchenko. TERENA Multilinguality in Indexing, Searching and Metadata
Slide2_17
TUSTEP: CGI
Complete control over input and output forms Possibility to configure exactly the kind of search(es), e.g.
exact matches only SoundEX "intelligent" fuzzy search "brute" fuzzy search that allows a number of different letters.
©1999. Yu.Demchenko. TERENA Multilinguality in Indexing, Searching and Metadata
Slide2_18
Multilinguality framework of the project
Multiple language indexing multiple language documents/indexes
Cross-language Searching Multiple language indexes/documents Automatic Query forwarding based on thesauri
Automatic translation Multilingual information retrieval Translation Request Protocol
Language and Character Encoding tagging XML as internal presentation of data
Using XML language and charset tagging
Metadata DC.Language definition