24
After OWL: defacto standards for semantic technologies (or: what do you get for €40m EU research money?) http://gate.ac.uk/ http://nlp.shef.ac.uk/ Hamish Cunningham, Kalina Bontcheva, Valentin Tablan, Diana Maynard, Wim Peters, Niraj Aswani, Milena Yankova, Yaoyong Li, Akshay Java, Michael Dowman ILASH workshop, March 2004

After OWL: defacto standards for semantic technologies (or: what do you get for €40m

  • Upload
    shalom

  • View
    25

  • Download
    0

Embed Size (px)

DESCRIPTION

After OWL: defacto standards for semantic technologies (or: what do you get for €40m EU research money?) http://gate.ac.uk/ http://nlp.shef.ac.uk/ - PowerPoint PPT Presentation

Citation preview

Page 1: After OWL: defacto standards for semantic technologies (or: what do you get for €40m

After OWL: defacto standardsfor semantic technologies

(or: what do you get for €40m EU research money?)

http://gate.ac.uk/ http://nlp.shef.ac.uk/

Hamish Cunningham,Kalina Bontcheva, Valentin Tablan, Diana Maynard,

Wim Peters, Niraj Aswani, Milena Yankova, Yaoyong Li, Akshay Java, Michael Dowman

ILASH workshop, March 2004

Page 2: After OWL: defacto standards for semantic technologies (or: what do you get for €40m

2(24)

Structure of the talk

• Context:• increasing use of “semantic” technology in IT• the role(s) of human language technology• substantial investment in the next phase of semantic web

research

• Semantic Web: moving on from formal standards• Acronym soup:

• GATE: HLT API 4 SDK SW & KT

• An application: Ontology-Based IE in KIM• Issues in API design, next steps

Page 3: After OWL: defacto standards for semantic technologies (or: what do you get for €40m

3(24)

The Knowledge Economy and Human Language

Gartner, December 2002: • taxonomic and hierachical knowledge mapping and indexing

will be prevalent in almost all information-rich applications • through 2012 more than 95% of human-to-computer

information input will involve textual language A contradiction: • to deal with the information deluge we need formal knowledge

in semantics-based systems • our information spaces are in informal and ambiguous natural

language The challenge: to reconcile these two phenomena

Page 4: After OWL: defacto standards for semantic technologies (or: what do you get for €40m

4(24)

HumanLanguage

Formal Knowledge(ontologies andinstance bases)

(A)IE

CLIE

(M)NLG

ControlledLanguage

OIE

SemanticWeb; Semantic Grid;Semantic Web Services

KEYMNLG: Multilingual Natural Language GenerationOIE: Ontology-aware Information ExtractionAIE: Adaptive IECLIE: Controlled Language IE

HLT: Closing the Loop

Page 5: After OWL: defacto standards for semantic technologies (or: what do you get for €40m

5(24)

SEKT: Semantic Knowledge Technology• 6th framework IP project

• Duration: 36 months from 1/1/4, €12.5m

• http://sekt.semanticweb.org/

• Improve automation of ontology and metadata generation

• Develop highly-scalable solutions

• Research sound inferencing despite inconsistent models

• Develop semantic knowledge access tools

• Develop methodology for deployment

Page 6: After OWL: defacto standards for semantic technologies (or: what do you get for €40m

6(24)

PrestoSpace (20th Century Rot)• 20th Century audio-visual media is rapidly

disappearing• Preservation and restoration are high cost• The costs must be justified by increased access• “Metadata”: descriptive information about

content• PrestoSpace (€9m IP, 40 months from 02/04):

– rich metadata and semantic access– cross-lingual access– syndicated delivery– repurposeable content

Page 7: After OWL: defacto standards for semantic technologies (or: what do you get for €40m

7(24)

The “SDK” research cluster

• “Building the European Research Area” in KM through collaboration with related IP and NoE projects in this area for a coordinated impact strategy

• SEKT, DIP, KnowledgeWeb – SDK cluster:http://sdk.semanticweb.org/

• Other related projects:• AceMedia IP (semantic knowledge systems)• PrestoSpace IP (cultural heritage / digital libraries)• BRICKS IP (cultural heritage / digital libraries)

• Total EU/6FP investment in semantic tech. research €40m: potential to influence the emergence of defacto standards

Page 8: After OWL: defacto standards for semantic technologies (or: what do you get for €40m

8(24)

Next step for Semantics tech: from formal to defacto standards?

• Computer scientists love standards, so we have many• For any given problem there are usually 3 “standards”• OWL is no exception: Lite, DL, Full• There are good reasons, but cf. RDF(S)

implementation history: applications will of necessity mix and match

• If we can achieve standard practice and libraries in applications we will have made a next step and will promote takeup

• (Pathological) example: TCP/IP vs. OSI

Page 9: After OWL: defacto standards for semantic technologies (or: what do you get for €40m

9(24)

HLT API 4 SDK SW & KT• What sorts of software do we need?• Ontology and metadata management: storage;

versionning; caching, inferencing; etc. (below)• Human language technology components and

services (not monolithic systems, not unproven research prototypes)

• The role of measurement in scaling and robustness: in HLT this means MUC, TREC, ACE, TIDES, ...

• Here’s one we baked earlier....

Page 10: After OWL: defacto standards for semantic technologies (or: what do you get for €40m

10(24)

GATE (the Volkswagen Beetle of Language Processing) is:

• Eight years old, with the largest user constituency of its type• An architecture A macro-level organisational picture for LE

software systems. • A framework For programmers, GATE is an object-oriented

class library that implements the architecture. • A development environment For language engineers,

computational linguists et al, a graphical development environment.

• Some free components... ...and wrappers for other people's components

• Tools for: evaluation; visualise/edit; persistence; IR; IE; dialogue; ontologies; etc.

• Free software (LGPL). Download at http://gate.ac.uk/download/

Page 11: After OWL: defacto standards for semantic technologies (or: what do you get for €40m

11(24)

Critical mass: 000s people 00s sitesGATE team projects. Past:• Conceptual indexing: MUMIS:

automatic semantic indices for sports video

• MUSE, cross-genre entitiy finder• HSL, Health-and-safety IE• Old Bailey: collaboration with HRI

on 17th century court reports• Multiflora: plant taxonomy text

analysis for biodiversity research e-science

• EMILLE: S. Asian language corpus• ACE / TIDES: Arabic, Chinese NE• JHU summer w/s on semtaggingPresent:• Advanced Knowledge

Technologies: €12m UK five site collaborative project

• ETCSL: Sumerian digital library• MiAKT: medical informatics / AKT• SEKT: Semantic Knowledge Tech• PrestoSpace: AV Preservation• KnowledgeWeb; h-TechSight

GATE users = significant proportion of community. A small sample:

• the American National Corpus project • the Perseus Digital Library project,

Tufts University, US• Longman Pearson publishing, UK• Merck KgAa, Germany• Canon Europe, UK• Knight Ridder, US• BBN (leading HLT research lab), US• SMEs: Melandra, SG-MediaStyle, ...• Imperial College, London, the University

of Manchester, UMIST, the University of Karlsruhe, Vassar College, the University of Southern California and a large number of other UK, US and EU Universities

• UK and EU projects inc. MyGrid, CLEF, dotkom, AMITIES, CubReporter, Poesia...

Page 12: After OWL: defacto standards for semantic technologies (or: what do you get for €40m

12(24)

                                                                                                                           

Architectural principles• Non-prescriptive, theory neutral

(strength and weakness) • Re-use, interoperation, not reimplementation

(e.g. diverse XML support, integration of Protégé, Jena, Weka, interoperation with SCHUG in MUMIS)

• (Almost) everything is a component, and component sets are user-extendable

• (Almost) all operations are available both from API and GUI

• Why does this matter? It means that GATE works well with other tools, embeds easily, and achieves robustness through focus (API requirements)

Page 13: After OWL: defacto standards for semantic technologies (or: what do you get for €40m

13(24)

All the world’s a Java Bean....

CREOLE: a Collection of REusable Objects for Language Engineering:

• GATE components: modified Java Beans with XML configuration

• The minimal component = 10 lines of Java, 10 lines of XML, 1 URL

Why bother? • Allows the system to load arbitrary language

processing components

Page 14: After OWL: defacto standards for semantic technologies (or: what do you get for €40m

14(24)

NOTES•everything is a replaceable bean•all communication via fixed APIs •low coupling, high modularity, high extensibility

HTMLdocs

RTFdocs

XMLdocs

PDFdocs

email

XMLDocument

Format

HTMLDocument

Format

PDFDocument

Format

…Document

FormatLayer (LRs)

XML OraclePostgreSql .ser

DataStore Layer

Corpus Document

DocumentContent

AnnotationSet

Annotation FeatureMap

Corpus Layer (LRs)

NOTES (2)•eg: Protégé LR & VR both wrapped in Res. (bean) API

•ontology repositories and inference are the same: KAON + Sesame + Orenge + ?

GATE APIs

Processing Layer (PRs)

NE Co-ref TEs TRs POS …

Onto-logy

ProtégéOnto-logy

Word-net

Gaz-etteers

Language Resource Layer (LRs)

...

Application Layer

ANNIE OBIE …IDE GUI Layer (VRs)

ADiff OntolVR DocVR ... WebServices

Page 15: After OWL: defacto standards for semantic technologies (or: what do you get for €40m

15(24)

Issues (1): a common HLT API

• OGSA, WMSO in the web services layer?

• Eclipse: less code for us, more services for users? (A free OWL/UML drawing tool, for example)

• ISO TC37/SC4: JNLE special; LIRICS consortium

Page 16: After OWL: defacto standards for semantic technologies (or: what do you get for €40m

16(24)

API Application: Ontology-based IEXYZ was established on 03 November 1978 in London. It opened a plant in Bulgaria in …

Ontology & KB

Company

type

HQ

establOn

City Country

Location

partOf

type

type type

“03/11/1978”

XYZ

London

UK Bulgaria

HQpartOf

Page 17: After OWL: defacto standards for semantic technologies (or: what do you get for €40m

17(24)

EntityPerson

Job-title

president

chancellorminister

G.Brown

“Gordon Brown met George Bush during his two day visit.

Classes, instances & metadata

Classes+instances before

Bush

<metadata> <DOC-ID>http://… 1.html</DOC-ID> <Annotation> <s_offset> 0 </s_offset> <e_offset> 12 </e_offset> <string>Gordon Brown</string>

<class>…#Person</class> <inst>…#Person12345</inst>

</Annotation> <Annotation> <s_offset> 18 </s_offset> <e_offset> 32 </e_offset> <string>George Bush</string>

<class>…#Person</class> <inst>…#Person67890</inst>

</Annotation></metadata>

Classes+instances

after

Page 18: After OWL: defacto standards for semantic technologies (or: what do you get for €40m

18(24)

OBIE in KIM

Popov et al. KIM. ISWC’03

• An ontology (KIMO) and 200K instances KB• High ambiguity of instances with the same label –

uses disambiguation step• Lookup phase marks mentions from the ontology• Combined with GATE-based IE system to

recognise new instances of concepts and relations• KB enrichment stage where some of these new

instances are added to the KB• Disambiguation uses an Entity Ranking algorithm,

i.e., priority ordering of entities with the same label based on corpus statistics (e.g., Paris)

Page 19: After OWL: defacto standards for semantic technologies (or: what do you get for €40m

19(24)

OBIE in KIM (2)

Popov et al. KIM. ISWC’03

Page 20: After OWL: defacto standards for semantic technologies (or: what do you get for €40m

20(24)

KIM demo...

• Continue to exploit the pluggability and community effects of GATE (and Sesame, Lucene, ...)

• SWAN: Semantic Web Annotator at DERI/Galway• Syndication• Social networking• Evaluation (below)

Next steps in OBIE

Page 21: After OWL: defacto standards for semantic technologies (or: what do you get for €40m

21(24)

(The “P” in OLP) Challenge:Evaluating Richer NE Tagging

• Need for new metrics when evaluating hierarchy/ontology-based NE tagging

• Need to take into account distance in the hierarchy

• Tagging a company as a charity is less wrong than tagging it as a person

Page 22: After OWL: defacto standards for semantic technologies (or: what do you get for €40m

22(24)

SW IE Evaluation tasks• Detection of entities and events, given a target

ontology of the domain.• Disambiguation of the entities and events from the

documents with respect to instances in the given ontology. For example, measuring whether the IE correctly disambiguated “Cambridge” in the text to the correct instance: Cambridge, UK vs Cambridge, MA.

• Decision when a new instance needs to be added to the ontology, because the text contains a new instance, that does not already exist in the ontology.

Page 23: After OWL: defacto standards for semantic technologies (or: what do you get for €40m

23(24)

Issues (2): a common OMM API• Two design approaches:

A. the “richest set of features” approachpool experience, cover all the bases, be relevant to very many users (“top-down”)

B. the “highest common factors” approachanalyse software, pick common features, create plugability layer (“bottom-up”)

• Both useful; can be combined

• Approach B. has some key advantages:– leads to quicker version 1.0

– minimises arguments (criteria: feature exists in several sys, not is “good”)

• Problems:– features present several places but not all – “operation not supported”?

– new work not prefigured in version 1.0 – roadmaps, placeholders

Page 24: After OWL: defacto standards for semantic technologies (or: what do you get for €40m

24(24)

The end

• Tutorial on HLT for the Semantic Web at European Semantic Web Symposium:http://www.esws2004.org/

• These slides: http://gate.ac.uk/sale/talks/ilash-semweb-mar2004.ppt

• More information: http://gate.ac.uk/ http://nlp.shef.ac.uk/