11
PANACEA - Y2 After the 2 nd Annual Review, 28 th February 2012, Barcelona 1

PANACEA - Y2 After the 2 nd Annual Review, 28 th February 2012, Barcelona 1

Embed Size (px)

Citation preview

PANACEA - Y2

After the 2nd Annual Review,28th February 2012, Barcelona

1

• Join together a number of advanced interoperable tools to build a platform/factory/production line that automates the stages involved in the– acquiring, processing and producing Language

Resources required by MT and other Language Technologies

Objectives

Partners

• WP1 – Management (UPF)• WP3 – The Platform (UPF)

• WP4 – Corpus Acquisition & Annotation (ILSP)• WP5 – Parallel corpus & derivatives (DCU)• WP6 – Lexical Acquisition (UCAM)

• WP7 – Integration & resource evaluation (ILC)• WP8 – Evaluation in industrial environment (LT)• WP2 – Dissemination and Exploitation (ELDA)

Platform

• The PANACEA platform is an interoperability space based on tools, guidelines, a Common Interface definition, and a “Travelling Object” specification

• Tools: Taverna, BioCatalogue, myExperiment, Soaplab

• Common Interface: WS interoperability• Travelling Object: XCES and GrAF • Documentation (video tutorials, how-tos, deliverables, etc.

at http://www.panacea-lr.eu)

4

Tools

SOAPLAB 2 (SOAP)

SOAPLAB 2 (SOAP)

- Web application for deploying command line tools as WS- No coding needed! Metadata only- Services deployed by ILSP at http://nlp.ilsp.gr/ws/

- Web application for deploying command line tools as WS- No coding needed! Metadata only- Services deployed by ILSP at http://nlp.ilsp.gr/ws/

TAVERNATAVERNA- Open source desktop application - Imports Soaplab and other types of WS- Allows for combination of WS in workflows (http://www.taverna.org.uk/)

- Open source desktop application - Imports Soaplab and other types of WS- Allows for combination of WS in workflows (http://www.taverna.org.uk/)

BioCatalogue

BioCatalogue

-Web application for registering and documenting WSs http://registry.elda.org -Search function- Auto-checks web services status- Annotations: tags, categories, etc.

-Web application for registering and documenting WSs http://registry.elda.org -Search function- Auto-checks web services status- Annotations: tags, categories, etc.

Web Services

Web Services

Workflow editor

Workflow editor

RegistryRegistry

Social networkSocial

networkmyExperimen

tmyExperimen

t

- Share workflows, files, data, etc.- Share opinions and comments, create work groups, etc. http://myexperiment.elda.org

- Share workflows, files, data, etc.- Share opinions and comments, create work groups, etc. http://myexperiment.elda.org

5

• Three levels of interoperability:– COMMUNICATION PROTOCOLS: Soap, Rest– DATA

– PARAMETERS

Interoperability

Tool B does not “understand” format N!All tools understand the previous format

Tool A

Tool B

ABCD

ABCD

Tool A

Tool B

YTQZ

ABCD

6

Travelling Object • The Travelling Object (TO) is the common data and metadata

format used in PANACEA to make components understand each other (syntactic interoperability)

• First TO for annotations up to tagging and lemmatization– Based on XCES (XML files with p, s, and t elements)– Tools: formatConverters and stylesheets

• Second TO for everything else (NER, DepParsing, etc.)– Based on GrAF (standoff annotation)– One file for primary data– One file for each annotation layer

7

Common Interface• A Common Interface (CI) defines the mandatory

parameters for every type of WS:

http://panacea-lr.eu/en/info-for-professionals/documents/http://registry.elda.org 8

Soaplab Web Services

• 28 Corpus Acquisition and Annotation Web Services• NLP WS’s focusing on sentence splitting,

tokenization, tagging, lemmatization and parsing, e.g:– EN, FR: Berkeley tagger and parser (DCU)– ES: UPF tools, Freeling; IT: ILC’s DESR, Freeling – DE and EL: LT’s and ILSP’s in-house tools

• WS’s for conversion from and to PANACEA’s Travelling Object (@UPF and ILC)

• WS’s for alignment of parallel data (@DCU)

10

Corpus Acquisition WS• Focused Bilingual Crawler (FBC)

– Documentation: http://registry.elda.org/services/127 – Test at http://nlp.ilsp.gr/soaplab2-axis/#ilsp.ilsp_bilingual_crawl_row – Sample topic definition for crawling EN-FR pages in the Environment domain

http://nlp.ilsp.gr/panacea/testinput/bilingual/ENV_topics/ENV_EN_FR_topic.txt

– Seed URL for crawling EN-FR ENV data http://nlp.ilsp.gr/panacea/testinput/bilingual/ENV_EN_FR_greenfacts.txt

• Focused Monolingual Crawler (FMC)– Documentation: http://registry.elda.org/services/160 – Test at http://nlp.ilsp.gr/soaplab2-axis/#ilsp.ilsp_fmc_row – Topic definition for crawling EN ENV data

http://nlp.ilsp.gr/panacea/testinput/monolingual/ENV_topics/ENV_EN_topic.txt

– List of seed URLs for crawling EN ENV http://nlp.ilsp.gr/panacea/testinput/monolingual/ENV_seeds/ENV_EN_seeds.txt

11

Taverna Workflow Demo• How can I align crawled data?• Search for a DCU hosted alignment service at• http://myexperiment.elda.org/workflows?

query=align