42
Multilingual support to a proposed Semantic Web architecture Andrea Ferrato TOP-UIC MS Thesis, 2003/’04 Advisor: Laura Farinetti

Multilingual support to a proposed Semantic Web architecture

  • Upload
    karif

  • View
    23

  • Download
    0

Embed Size (px)

DESCRIPTION

Multilingual support to a proposed Semantic Web architecture. Andrea Ferrato TOP-UIC MS Thesis, 2003/’04 Advisor : Laura Farinetti. Purpose of this work. Design and (partially) implement multilingual support on a pre-existing Semantic Web platform Provide an approach as generical as possible - PowerPoint PPT Presentation

Citation preview

Page 1: Multilingual support to a proposed Semantic Web architecture

Multilingual supportto a proposed Semantic Web

architecture

Andrea FerratoTOP-UIC MS Thesis, 2003/’04

Advisor: Laura Farinetti

Page 2: Multilingual support to a proposed Semantic Web architecture

A. Ferrato, TOP-UIC 2003-'04 2

Purpose of this work

Design and (partially) implement multilingual support on a pre-existing Semantic Web platform Provide an approach as generical as

possible Exploit features of the pre-existing

architecture Cope with the average chaotic

structure of resources currently available

Page 3: Multilingual support to a proposed Semantic Web architecture

A. Ferrato, TOP-UIC 2003-'04 3

Outline

Semantic WebMultilingualityThe DOSE platformProposed solutionGiven implementationExperimental resultsConclusions

Page 4: Multilingual support to a proposed Semantic Web architecture

A. Ferrato, TOP-UIC 2003-'04 4

Semantic Web

The next evolutionary stage for WWWGoal: make network data usable by

intelligent agentsDeployable only on top of existing

infrastructureTwo pressing tasks

Transform existing contents to include semantics

Setup ad hoc user agents to work on them

Page 5: Multilingual support to a proposed Semantic Web architecture

A. Ferrato, TOP-UIC 2003-'04 5

Transform existing contents

Basic data units: resources Every single information entity that

can be semantically isolatedFeatures to be given

Identification: URI Structure: XML Meaning: RDF Knowledge: ontologies

Page 6: Multilingual support to a proposed Semantic Web architecture

A. Ferrato, TOP-UIC 2003-'04 6

Set up ad hoc user agents

Major players in Semantic Web deployment

Invoked by users, can proceed autonomously

Key facilities to be supported Logic Proof Trust

Page 7: Multilingual support to a proposed Semantic Web architecture

A. Ferrato, TOP-UIC 2003-'04 7

Dig

ital

sig

natu

res

Semantic Web: layer cake view(Berners-Lee)

Unicode URI

XML + NS + XMLschema

RDF + RDFschema

Ontology vocabulary

Logic

Proof

Trust

Self desc

. doc.

Data D

ata R

ule s

Page 8: Multilingual support to a proposed Semantic Web architecture

A. Ferrato, TOP-UIC 2003-'04 8

Multilinguality

The extension to multiple languages of tasks already performed in a monolingual context

Typical issues from cross-language mapping Lexical gaps Role of the context Lack of pre-acquired knowledge

Page 9: Multilingual support to a proposed Semantic Web architecture

A. Ferrato, TOP-UIC 2003-'04 9

Multilinguality and Semantic Web

A problem of Text Retrieval in multiple languages (NLP) Start from popular approaches

(Controlled Vocabulary, Free text, etc.)

Two main requirements Recognize language ID of resources Map contents independently from

language

Page 10: Multilingual support to a proposed Semantic Web architecture

A. Ferrato, TOP-UIC 2003-'04 10

Language ID retrieval

Two possible scenarios Retrieve a given ID via resource

parsing Recreate the ID via resource analysis

When recollecting a given language attribute, conform to existing language specification standards

Page 11: Multilingual support to a proposed Semantic Web architecture

A. Ferrato, TOP-UIC 2003-'04 11

Language ID specification

Content-language

CSS-leveldeclarations

“lang”attribute

Languageinheritance

+

Page 12: Multilingual support to a proposed Semantic Web architecture

A. Ferrato, TOP-UIC 2003-'04 12

Language-independent contents mapping

Investigate the form/meaning relationship Ontology design is crucial Three main requirements

1. Consistency (based on linguistic evidence)

2. Flexibility (meaningful for all languages)

3. Extendibility (easy addition of new languages)

Page 13: Multilingual support to a proposed Semantic Web architecture

A. Ferrato, TOP-UIC 2003-'04 13

Ontology models

Conceptual founded upon general knowledge

Language-based Built on a particular language

Interlingua A combination of the above two

None is definitely superior for multilinguality

Page 14: Multilingual support to a proposed Semantic Web architecture

A. Ferrato, TOP-UIC 2003-'04 14

The DOSE platform

Distributed Open Semantic Elaboration platform

Key features Modularity Scalability Semantic integration

Main functionalities offered Annotation Search

Page 15: Multilingual support to a proposed Semantic Web architecture

A. Ferrato, TOP-UIC 2003-'04 15

DOSE: layered view

Indexer SearchEngine

SemanticMapper

FragmentRetriever

Substructure

Extractor AnnotationRepository

Onto-logy Syn-

set

Servicelayer

Back-endlayer

Front-endlayer

Page 16: Multilingual support to a proposed Semantic Web architecture

A. Ferrato, TOP-UIC 2003-'04 16

DOSE: distributed view

Onto-logy

Syn-set

Fragment Retriever

Substructure Extractor

SemanticMapper

AnnotationRepository

Indexer SearchEngine

XML-RPC infrastructure

Page 17: Multilingual support to a proposed Semantic Web architecture

A. Ferrato, TOP-UIC 2003-'04 17

13

4

5

8

7

6

9 1011

DOSE: annotation

SemanticMapper

Substructure

Extractor

AnnotationRepository

The Web

2

Indexer FragmentRetriever

Page 18: Multilingual support to a proposed Semantic Web architecture

A. Ferrato, TOP-UIC 2003-'04 18

1

23

4

5

6

7

8

DOSE: search

SearchEngine

AnnotationRepository

The WebFragmentRetriever

SemanticMapper

Page 19: Multilingual support to a proposed Semantic Web architecture

A. Ferrato, TOP-UIC 2003-'04 19

DOSE and multilinguality

Traditionally: a new ontology for each different language

DOSE: the ontology language is totally independent of the synset language Use synsets to store lexical

representations only Let the ontology focus on knowledge

modelization

Page 20: Multilingual support to a proposed Semantic Web architecture

A. Ferrato, TOP-UIC 2003-'04 20

Practical requirements for multilinguality

Indexing Recognize language of resources to

consequently setup the system Store language IDs with annotations

Search Interpret user queries coming in

natural languages Allow for cross-language search tasks

Page 21: Multilingual support to a proposed Semantic Web architecture

A. Ferrato, TOP-UIC 2003-'04 21

Extension to language

Proposed approach: one ontology, many synsets A concept is expressed by a different

synset for each supported language Each synset contains multiple lexical

representations of a related concept in a single language

Separate semantic and textual layers

Page 22: Multilingual support to a proposed Semantic Web architecture

A. Ferrato, TOP-UIC 2003-'04 22

lavorostipendiodatore di lavoro…

salaryjobemployment…

travailchomeur…

Extension to language (cont’d)

job

(one concept,three synsets)

Page 23: Multilingual support to a proposed Semantic Web architecture

A. Ferrato, TOP-UIC 2003-'04 23

Advantages

Reduced implementation requirements Ontology design Resource occupation

Simplicity (in ontology management)Flexibility

A new language just brings a new bag of synsets

Expansion of indexing word set

Page 24: Multilingual support to a proposed Semantic Web architecture

A. Ferrato, TOP-UIC 2003-'04 24

Language recognition

Proposed approach Retrieve language IDs whenever present Otherwise, recognize language(s)

Design constraints To be activated in the annotation phase Refined at the document substructure level Has to deal with the average low authoring

quality of Web documents

Page 25: Multilingual support to a proposed Semantic Web architecture

A. Ferrato, TOP-UIC 2003-'04 25

Language recognition (cont’d)

1. Validate explicit request

2. Retrieve “lang” value

3. Guess via heuristics

4. Retrieve from ancestor

5. Accept default

<P lang=“ru”>

Russian

There was an Old Man of Coblenz,The length of whose legs was immense…

English

default = “it”

Italian

<H1 lang=“fr”>Le Bilboquet</H1><P>C’était un vieux passe-temps…

<P> is French

Hindi

Hindisynset

?

Page 26: Multilingual support to a proposed Semantic Web architecture

A. Ferrato, TOP-UIC 2003-'04 26

Current implementation

A new English synset to couple with a disability ontology (~500 concepts)

A set of 20 bilingual documents (Italian, English) on disability

A basic Language Detector XML-RPC module implemented in Java

Testing scenarios Parallel annotation Language recognition

Page 27: Multilingual support to a proposed Semantic Web architecture

A. Ferrato, TOP-UIC 2003-'04 27

Implementation work

Language Detector module (Java, ~1000 lines of code)

Additions to pre-existing modules (Java, ~1000 lines of code)

English synset (RDF, ~3500 lines of code)~ 24 Mb of annotations producedSimulation results analysis (A 600x40 .XLS

for <BODY>, a 925x250 .XLS for <Hx>)

Page 28: Multilingual support to a proposed Semantic Web architecture

A. Ferrato, TOP-UIC 2003-'04 28

Multilingual DOSE in action

Page 29: Multilingual support to a proposed Semantic Web architecture

A. Ferrato, TOP-UIC 2003-'04 29

Parallel annotation

Two parallel documents have The same structure elements with the

same contents Two different languages of expression

Goal: demonstrate that two sets of parallel documents are (almost) simmetrically mapped to the same concepts (“parallel annotation”)

Both sets indexed separately, with language explicitly specified

Page 30: Multilingual support to a proposed Semantic Web architecture

A. Ferrato, TOP-UIC 2003-'04 30

Parallel annotation (cont’d)

Test methodology: “Vector Space Model”Document fragments described as vectors

Dimensions are ontology concepts Components are weighted (tf/idf)

occurrencies of such conceptsThe correlation between two fragments is

quantified as the cosine of the angle between their vectors

Page 31: Multilingual support to a proposed Semantic Web architecture

A. Ferrato, TOP-UIC 2003-'04 31

Parallel annotation (cont’d)

IT/html/body/p[3]X:Part-time job (2.5)Y:Retirement (0)

EN/html/body/p[3]X:Part-time job (1.5)Y:Retirement (1.5)

Y

XX

Y

X

Y

CorrelationItalian

English

Page 32: Multilingual support to a proposed Semantic Web architecture

A. Ferrato, TOP-UIC 2003-'04 32

Parallel annotation results at <BODY> level

0

0,05

0,1

0,15

0,2

0,25

0,3

0,35

0,4

0,45

0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1

Correlation factor

Norm

aliz

ed fre

que

ncy

Parallel fragments Others

Page 33: Multilingual support to a proposed Semantic Web architecture

A. Ferrato, TOP-UIC 2003-'04 33

Correlation results at <BODY> level

1 4 7 10 13 16 19

S1

S3

S5

S7

S9

S11

S13

S15

S17

S19

Correlation factor

Italian pages

English pages

0-0,2 0,2-0,4 0,4-0,6 0,6-0,8

Page 34: Multilingual support to a proposed Semantic Web architecture

A. Ferrato, TOP-UIC 2003-'04 34

Correlation results at <BODY> level (alt)

1

7

13

19 S1

S6

S11

S160

0,2

0,4

0,6

0,8

Correlation factor

Italianpages

Englishpages

0,6-0,8

0,4-0,6

0,2-0,4

0-0,2

Page 35: Multilingual support to a proposed Semantic Web architecture

A. Ferrato, TOP-UIC 2003-'04 35

Parallel annotation results at <Hx> level

0

0,1

0,2

0,3

0,4

0,5

0,6

0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1

Correlation factor

Norm

aliz

ed fre

quen

cy

Parallel fragments Others

Page 36: Multilingual support to a proposed Semantic Web architecture

A. Ferrato, TOP-UIC 2003-'04 36

Parallel annotation: notes

Parallel and nonparallel pairs can be grouped as two different distributions i.e. Gaussian distributions

Average values of the two distributions are clearly separated, both for <BODY> and <Hx> levels This proves that the indexing system is

able to annotate relevant document fragments independently from language

Page 37: Multilingual support to a proposed Semantic Web architecture

A. Ferrato, TOP-UIC 2003-'04 37

Language recognition

Separate testing on the same document setItalian and English documents are

alternated in batch processing Avoid reuse of default settings for

contiguous documents of the same language

Two ways to retrieve ancestor language Via Annotation Repository (acceptable) Via a “Language Stack” (still inefficient)

Page 38: Multilingual support to a proposed Semantic Web architecture

A. Ferrato, TOP-UIC 2003-'04 38

Annotation Repository vs. Language Stack

<BODY lang="en">

<H1 lang="it">Passatempi</H1>

<H2 lang="en">Board Games</H2>

<P>Gomuku</P><P>Dama</P>…

All cyan, underlined words are to annotate (included in the synsets)Language Stack: Dama is ignored (language “en” inherited by <H2>)Annotation Repository: Dama is annotated (language “it” inherited by <H1>, annotated)

Page 39: Multilingual support to a proposed Semantic Web architecture

A. Ferrato, TOP-UIC 2003-'04 39

Language recognition results(via Annotation Repository)

0

20

40

60

80

100

120

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39

Analyzed pages

Rec

ogn

itio

n pe

rcen

tage

Hit percentage (%) Hit average (%)

Page 40: Multilingual support to a proposed Semantic Web architecture

A. Ferrato, TOP-UIC 2003-'04 40

Conclusions

Typical issues discussedOverall validity of the approach shownFurther work and improvements

Synset composition Annotation testing with more

languages Optimize proposed language

recognition techniques, add new ones

Page 41: Multilingual support to a proposed Semantic Web architecture

A. Ferrato, TOP-UIC 2003-'04 41

Thank you…

Questions?

Page 42: Multilingual support to a proposed Semantic Web architecture

A. Ferrato, TOP-UIC 2003-'04 42

Language recognition (2)

0

20

40

60

80

100

120

1 4 7 10 13 16 19 22 25 28 31 34 37 40

Analyzed pages

Rec

ogn

itio

n pe

rcen

tage

Percentage Anno

Percentage Stack

Average Anno

Average Stack