Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
A Semantic Model for End-to-End Multilingual Web Content Processing
David Lewis, Alex O’Connor, Dominic JonesCentre for Next Generation Localisation,
Trinity College London
2
© 2010 TCD
Challenges for the Localisation Industry
Localisation Industry
Web & mobile, Apps & SaaS, perpetual beta,
long tail
Demand reflects shifts in global trade
User Generated
Content
3
© 2010 TCD
Centre for Next Generation Localisation
DC
U • Machine Translation
• Translation Technology
• Multilingual Information Retrieval
• Semantic Model Evolution
UC
D • Speech Synthesis
• Speech Recognition
• Semantic Model Generation
UL • Localisation
Research Centre
• Localisation Standards
• Localisation Workflow
TCD • Personalisation
• Adaptive Hypermedia
• Text Analytics• Interaction
Design• Service
Integration & Management
• Semantic Interoperability
4
© 2010 TCD
Supporting the Global Customer
WWW
User Forum
Knowledge ArticleLanguage Service Provider
Customer
Enterprise
Multi-lingual IRReal time social media translationEnterprise MT for user forumAssisted spoken interaction with usersCommunity rating/post-editingLSP offer quality on-demand
5
© 2010 TCD
Conventional Localisation Interoperability
Translation Management
System
Content Developer LSP Translator
XLIFF XLIFF
TMX
TBX
TMX
TBX
Content Management
System
Computer Assisted
Translation
Hand-off standards in XML
Import & Export functions in monolithic tools
6
© 2010 TCD
Web Services – Potential Benefits for Localisation
Views sought from developers, LSP and tool vendors Scalability through automated interfaces
Coping with unpredictable workloads in an agile worldBuilding blocks for automated workflows
24x7 availability for a geographically distributed workforce'Pay as you use' modelsMore granular cost managementEasy to “rip and replace” componentsEasy deployment and version consistencyGuarantees from automated workflowsSupport for specialist companies delivering 'niche' services
Language Technology and Language Resource Curation
7
© 2010 TCD
Web Services: Needed Now - Need to Innovate
1. Machine Translating2. QA checking3. Reporting4. File Parsing5. Leveraging6. Segmentation7. Pseudo translating8. Updating9. Generate localised file10. Archiving11. Artwork/Multimedia
processing12. Translating13. Language reviewing14. Testing
Language Technology
services
Translator
SMT Domain tuned
RBMT
Hybrid
Text Analyser
Text classifier
Topic identifier
POS tagger
Sentiment analyser
CLIR
Speech processor
8
© 2010 TCD
Industry Survey – Barriers to Adoption
Need for open standardsReuse tools solutionsBuilding industry consensus
PerformanceReliability and robustnessSecurity: need buy-in from IT departments at content developers and LSPsMaturity of workflow engines
Choice Interoperability
Flexibility Automation
SOA
9
© 2010 TCD
Web Service Interoperability
Translation Management
System
Content Developer LSP Translator
“XLIFF”Content
Management System
Computer Assisted
Translation
MT TMTA MT TA MT TA TBLanguage Web Services
“XLIFF”
Tools vendors and service providers already offering web servicesRepurposing handoff standard requires careful profilingand definition of processing expectations
10
© 2010 TCD1
Semantic Web to the Rescue?Scalable in extension, good for incremental adoptionLink XML content and RDF meta-data (RDFa)Leverage existing vocabularies, e.g. Dublin Core, Open Provenance Model, Creative CommonsMaturing Tools & StoresRDF(S)
Subject-Property-ObjectName Spaces URI Classes and Subclasses Range and Domain
11
© 2010 TCD1
Relational Data
This example from http://research.talis.com/2005/rdf-intro/
ISBN Title Author Publisher ID Pages
596002637 PracticalRDF S. Powers 7642 350
596000480 Javascript D. Flanagan 3556 936
... ... ... ... ...
12
© 2010 TCD1
Row = Subject
This example from http://research.talis.com/2005/rdf-intro/
ISBN Title Author Publisher ID Pages
596002637 PracticalRDF S. Powers 7642 350
596000480 Javascript D. Flanagan 3556 936
... ... ... ... ...
13
© 2010 TCD1
Column = Property
This example from http://research.talis.com/2005/rdf-intro/
ISBN Title Author Publisher ID Pages
596002637 PracticalRDF S. Powers 7642 350
596000480 Javascript D. Flanagan 3556 936
... ... ... ... ...
14
© 2010 TCD1
Cell = Value
This example from http://research.talis.com/2005/rdf-intro/
ISBN Title Author Publisher ID Pages
596002637 PracticalRDF S. Powers 7642 350
596000480 Javascript D. Flanagan 3556 936
... ... ... ... ...
15
© 2010 TCD1
“Javascript”
Graph View
aBooktitle
This example from http://research.talis.com/2005/rdf-intro/
objectsubjectproperty
16
© 2010 TCD1
“Javascript”
Graph View
aBooktitle
This example from http://research.talis.com/2005/rdf-intro/
0596000480isbn
3556publisher
“O’Reilly”name
17
© 2010 TCD
Classes, Properties, Specialisation and Name Spaces yield Scalable Extensibility
ulo:rootClass
ulo:concept4ulo:concept2 ulo:concept3ulo:concept1
Upper Layer Ontology:Reusable
concepts and associations
ulo:assocxsd:Type1 xsd:Type2
@prefix ulo: <http://www.eg.org/ontologies/upperLayer#>
dsm:domanSpecific1
dsm:domainSpecific2dsm:assoc
Domain Specific Model:
Concepts multiply inherit from upper layer and extend
@prefix dsm: <http://www.domain.org/ontologies/domainSpecifc#>
sdsm:domainSpecific3 sdsm:assoc
Subsequent Domain Specific Model:Refer to upper layer for consistency, easing association with other domain specific models
@prefix sdsm: <http://www.metoo.org/ontologies/yaDomainSpecifc#>
18
© 2010 TCD
Semantics for Web Service InteroperabilityWeb service input and output passed as structured data: XML, JSONSemantic Annotations
Operation/interface to service categoryInput/Output to semantic structures
W3C Semantic Annotation for WSDL (SAWSDL)WSMOLite for RESTful
MT
Client
input
output
SyntacticSemantic
Service category
http
lifting
lowering
19
© 2010 TCD
Propose RDFS Semantic ModelsContent
The core processed content taxonomy that we use to record the content transformations of various types (including training) delivered by services
Service ModelTaxonomy service category classifications of different services , dovetails with the content model
20
© 2010 TCD
:Managed Content
:Analysed Content
:Prepared Content
:Localised Content
:Published Content
:Generated Content
NGL Content: Seed Taxonomy
:Asset Content
:Manually Processed Content
:Automatically Processed Content
:Personalised Content
:Presented Content
:Retrieved Content
:Prepared For Localsiation
:Prepared For Adaptation
:Serialised Content
:Rights Asserted Content
:Grouped Content
:Counted Content
21
© 2010 TCD
:Localised Content
:Translated Content
:Reviewed Content
:Reengineered Content
:Reassembled Content
:Segmented Content
NGL Content: Seed taxonomy for localised content
:Machine Translated
:Human Translated
:Postedited
:Leveraged from TM
22
© 2010 TCD
:NglService
:Transform Content
:Annotate Content
:Process Group Content
:Create Service
:Generate Content
NGL Service Model: Seed Taxonomy
:Create MTSvc
:Create TM LeverageSvc
:Create Composite Svc
:Create Personalised Svc
:Order Group
:Filter Group
:Select From Group
:Classify Text
:Rate Content
:Tag Content
:Machine Translate
:LeverageTM
:Synthesise Speech
:Author Text
:Recognise Speech
…
… …
…
…
23
© 2010 TCD
Model Refinement
Seed Semantic Model
Apply to Data Flow of specific process models
Capture semantic annotation for
service contractsExtend semantic
model
Revise core model
24
© 2010 TCD
Fine-grained Roundtrips
LSP Translator
“XLIFF”
TA TA MT TBLanguage Technology Web Services
“XLIFF”Content Management
System
Computer Assisted
Translation
MT
Translation Management
System
TATM MT
Customer
Content Developer
25
© 2010 TCD
>2B RDF triples
Around 3M RDF links http://en.wikipedia.org/wiki/Linked_D
ataLinking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/
Linked Open Data
26
© 2010 TCD
Interoperability through Linked DataLSP Translator
TA TA MT TBLanguage Technology Web Services
CMSxml:tm,
termlink, RDFa
Computer Assisted
Translation
MT
Translation Management
System
TATM MT
Content Developer
27
© 2010 TCD
Content State Transformation Content semantics aim to express the state transformation that operate on content and its meta-data as the result of content processing by different servicesA provenance-oriented model based on the Open Provenance Vocabulary is used to capture both process transformation and details of the agents and content of those operationsThis allows processes to be defined in terms specific transforms (additions, changes) on content
28
© 2010 TCD
Linked Localisation DataOpen Provenance Vocabulary
Lightweight version of Open Provenance Modelhttp://openprovenance.org/
29
© 2010 TCD
Author, segment and source QA
12401
isPartOf
Lkr-l10n
wasGeneratedBy
rschalerwasControlledBy
wasRevisedFrom
2010-02-07T09:00:00
dlewis wasGeneratedBy
wasGeneratedAt
en-IE
xml:langCmsWrite
wasControlledBy
example1
1001312345isPartOf
“I’m a string”value
2010-02-07T09:00:00
rschaler wasGeneratedBy
wasGeneratedAt
Loc Prep
wasControlledBywasDerivedFrom
12396
wasannotatedWith
2010-02-07T09:30:00
wasGeneratedAt 2010-02-07T09:30:00
wasGeneratedAt
Lkr-l10n wasGeneratedByrschaler
wasControlledBy “I am a string”value
2: Register doc for localisation and segment
1: Register source Document (typically from CMS)
3: annotate each segment with source QA and then revise specific source segments as appropriate
:Generated Content
isa
:Segmented Content
isa
isa
:Author Text
30
© 2010 TCD
Next StepsApplying and Revising semantic model to different use cases
CNGL Web+LT+Loc demonstratorsAnnotating available APIs
Semantic sandpitContent mark-up for links - RDFaAvoid standardising semantics
Linked Data vocabularies established through adoption
Stress testing semantic technologyAccess control to federated triple storesService pre-conditions and effects with rules
31
© 2010 TCD
Conclusions
Extensible semantic models offers interoperability without stifling innovation
Semantic Annotation offers improved service-oriented interoperabilty
Provenance-based linked data for fine-grained process roundtrips
Quality-driven language resource management
32
© 2010 TCD
Visit: www.cngl.ie
Watch: http://www.youtube.com/user/nextgenlocalisation
Contact: [email protected]