rd - FLaReNet · Closing Session From recommendations to actions 16:30 16:50 S3 Go green: reuse, repurpose and recycle resources End 2nd Day 17:30 18:30 End 1st Day 20:00 Social Dinner

The 3rd European Language Resources and Technologies Forum

Language Resources in the Sharing Age ‐ the Strategic Agenda

Venezia, 26‐27 May 2011

Proceedings

Edited by: Calzolari N., Baroni P., Soria C., Goggi S., Monachini M., Quochi V.

Istituto di Linguistica Computazionale del CNR ‐ Pisa, ITALY

Table of Contents

Program ............................................................................................................................ 7

Opening Session ............................................................................................................... 9

Session 1 – Identification and tracking of Language Resources ..................................... 13

Session 2 – Open Data .................................................................................................... 27

Session 3 – Go green: reuse, repurpose and recycle resources ..................................... 43

Session 4 – Innovation needs data ................................................................................. 53

Session 5 – Data for all languages: think big .................................................................. 63

Session 6 – Long life to our resources ............................................................................ 75

Closing Session – From recommendations to actions .................................................... 83

Organisation ................................................................................................................... 85

Program

Wednesday 25th May 2011 – 20:30

Welcome Concert

Thursday 26th May 2011

@@@ Friday 27th May 2011

9:00 Registration

S4 Innovation needs data

9:15

10:00 Opening Session

Coffee Break

11:00

11:00 Coffee Break

S5 Data for all languages: think big

11:30

11:30 S1 Identification and tracking of

Language Resources

Lunch

13:15

13:15 Lunch

S6 Long life to our resources

14:45

14:30 S2 Open Data

‐‐‐

16:20 Coffee Break

Closing Session From recommendations to actions

16:30

16:50 S3 Go green: reuse, repurpose and

recycle resources

End 2nd Day

17:30

18:30 End 1st Day

20:00 Social Dinner

9:00‐18:30

Posters 9:15‐17:30

7

Thursday 26th May 2011

Opening Session – 10:00‐11:00

Chair: Nicoletta Calzolari

Nicoletta Calzolari (ILC‐CNR, IT / FLaReNet Coordinator)

Aleksandra Wesolowska (EC ‐ DG Information Society & Media ‐ Unit INFSO.E1 ‐ LTs & MT, LUX)

Flavio Gregori (Ca’ Foscari University ‐ Department of Compared Linguistic and Cultural Studies, IT / Director)

Rodolfo Delmonte (Ca’ Foscari University, IT / Local Host)

The FLaReNet Recommendations Nicoletta Calzolari (CNR‐ILC, IT / FLaReNet)

9

The third FLaReNet Forum

Nicoletta Calzolari, Claudia Soria

Istituto di Linguistica Computazionale, CNR, Pisa, Italy

FLaReNet – Fostering Language Resources Network – is an international Forum,

composed by a steadily enlarging community, whose goals are:

o to coordinate a community-wide effort to analyse the sector of language resources and technologies along all the relevant dimensions: technical and scientific, but also organisational, economic, political and legal

o to promote and sustain international cooperation o to identify short, medium, and long-term strategic objectives and provide consensual

recommendations in the form of a plan of action targeted to a broad range of stakeholders, from the industrial and scientific community to funding agencies and policy makers.

The FLaReNet Forum is the venue where leading experts in the field of Language Resources and

Technologies (LRT) are gathered to present their visions, to discuss together some of the hot topics

identified by FLaReNet and consensually identify a set of priorities and strategic objectives. Many

messages recurred repeatedly across the various meetings organised by FLaReNet, as a sign both of

a great convergence around these ideas and also of their relevance and importance to the field. The

Forum is meant also to validate ideas that have been “in the air” for several years and, in some

cases, fostered and/or developed by specific groups, as having entered the main stream of thought

and practice within the LRT community.

To date, the major high-level FLaReNet recommendations are presented in the FLaReNet

Blueprint of Actions and Infrastructures, gathering the recommendations collected around

the many meetings, panels and consultations of the community, as well as the results of the

surveying and analysis activities carried out under FLaReNet work packages. The Blueprint is the

result of a permanent and cyclical consultation that FLaReNet has conducted inside the community

it represents – with more than 300 members – and outside it, through connections with

neighbouring projects, associations, initiatives, funding agencies and government institutions.

The Blueprint is organised along three main “directions”: Infrastructural Aspects, Research and

Development, Political and Strategic Issues. They reflect three major development directions that

can boost or hinder the growth of the field of Language Resources and Technologies. Altogether

these directions are intended to contribute to the creation of a sustainable LRT ecosystem.

We present in a leaflet the three tables that synthesise – for the three directions – the main

challenges and the corresponding recommendations.

In this third Forum, the Blueprint – together with the recommendations in the major areas and

along the various critical dimensions around Language Resources – is opened again for

improvement and validation to the community which is called to form a consensus on the top

priorities. As a result of this community consultation a document will be produced that will be then

sent to all FLaReNet Institutional Members and National Contact points for adoption and

endorsement. This is an important step to show a critical mass around our recommendations, in

order to sensitise those responsible for implementing the financial and political frameworks that

are necessary to sustain the actions to be implemented on a long term.

11

S1. Identification and tracking of Language Resources – 11:30‐13:15

Chair: Joseph Mariani

Introduction

Joseph Mariani (LIMSI/IMMI‐CNRS, FR / Chair)

Introductory Talks

Towards a Comprehensive Model for Language Resource Catalogs (with emphasis on non‐traditional resources) Chris Cieri (University of Pennsylvania ‐ Linguistic Data Consortium, USA)

The concept of BRIF (Bioresource research impact factor) as a tool to foster bioresource sharing in research: is it applicable to other domains? Anne Cambon‐Thomsen (CNRS, FR)

Contributions

Capturing Community Knowledge of LRs: the LRE Map Claudia Soria (CNR‐ILC, IT / FLaReNet)

A journey from LRE Map to Language Matrixes Joseph Mariani (LIMSI/IMMI‐CNRS, FR / FLaReNet)

Proposal for the International Language Resource Number Khalid Choukri (ELDA, FR / FLaReNet)

Discussants

Antonio Pareja‐Lora (Universidad Complutense de Madrid, SP)

Gil Francopoulo (Tagmatica & IMMI‐CNRS, FR)

13

Towards a Comprehensive Model for Language Resource Catalogs,

with emphasis on non-traditional resources

Christopher Cieri

Linguistic Data Consortium, University of Pennsylvania

Rapid growth in the inventory of language resources (LRs) has lead also to a proliferation of the

sites one must search in order to identify, evaluate and acquire those resources. Some intrepid

researchers have undertaken to ameliorate this problem by developing union catalogs and

technical infrastructure to support such catalogs. Those efforts have typically focused on what we

argue is a subset of all LRs. We argue here that the material cataloged must be expanded to include

– in addition to data sets – tools, specifications and, particularly, published papers that describe,

critique or build upon LRs.

In 1992, when the Linguistic Data Consortium (LDC)1 was founded its goal was to serve as central

location where language resources, principally data sets at that time, could be hosted, archived and

distributed under consistent terms in order to lower barriers to research. LDC principals

recognized that that the organization’s origins in the American experience would make it best

suited to serve that market despite early and ongoing efforts to adapt to international needs. Not

surprisingly, a number of data centers have opened around the world including some others in the

U.S. The European Language Resource Association (ELRA2) was created in France in 1995 followed

by Gengo-Shigen-Kyokai3 (GSK) in Japan in 2003, the Chinese LDC4, and the LDC for Indian

Languages5 in 2007 to name a few. In addition there have been a number of national corpus

initiatives including the British6, American7, Dutch and Danish among others. Finally, the number

of individual laboratories and projects that create and distribute LRs directly has only grown over

time.

Since, as is clear, some LR creators will continue to distribute independently and multiple data

centers will continue to operate, the communities that rely upon LRs also need union catalogs or

other portals that harvest metadata from multiple providers and present them in a unified

interface. Indeed, the Open Language Archives Community (OLAC) has developed specification for

such a Catalog and multiple instantiations. According to their site: “OLAC Archives contain over

100,000 records, covering resources in half of the world's living languages”8. In addition, OLAC

gathers metadata records from 45 different providers.

LR catalogs differ in their treatment of LR types. The LDC9 and ELRA10 Catalogs focus on data sets.

The ELRA Universal Catalog11 contains corpora, lexicons and tools but apparently not academic

1 www.ldc.upenn.edu

2 www.elra.info

3 www.gsk.or.jp

4 www.chineseldc.org

5 www.ldcil.org

6 www.natcorp.ox.ac.uk

7 www.americannationalcorpus.org

8 www.language-archives.org

9 www.ldc.upenn.edu/Catalog

10 catalog.elra.info

11 universal.elra.info

15

papers or specifications except when they are included within data sets. Among the resource types

listed, there is no entry for either and searches for appropriate terms return no papers. In contrast,

OLAC archives do contain many records describing academic papers; however, they typically do

not distinguish such papers from other textual resource such as corpora and specifications. The

LDCIL provides separate but consistent treatment of data, tools and specifications but apparently

not academic papers.

Notwithstanding the current practice among LR providers, LR users rely upon academic papers

and specifications as they work to exploit data sets and tools. Specifications describe the goals of

the LR as well as the assumptions of its developers, the methods used and the formats included.

Academic papers may further describe the LR, provide criticism, and explain prior attempts to

exploit the LR, their success and any lessons learned. In short, data sets, tools, specification and

academic papers form a network of LRs that should be considered as a whole when any component

is exploited. Unfortunately, the LR user communities currently lack any easy way to find such

networks of resources. To resolve this problem we need a concerted effort to update our catalogs

with metadata entries for missing components and to develop links among the entries to express

the relations that connect the individual LRs. Ongoing maintenance must then follow the one-time

effort.

LDC has recently begun to catalog all academic papers that mention LDC corpora. To date we have

identified approximately 5000 papers mentioning about half of all our corpora. Our goal is

complete one pass over all LDC corpora to find related papers making the results available and

allowing paper authors to update. Naturally we encounter several challenges in this effort. There

are many corpora in the LDC Catalog and multiple papers mentioning each. The papers span

multiple research publications covering diverse communities. Paper authors cite LRs variably

sometimes providing the full name or catalog number, something abbreviating in different ways.

Therefore, determining that a paper mentions an LR requires reading at least part of the paper but

no single researcher commands the skills required to adequately understand whether and how the

LR and paper are related. Still we believe this effort is worthy. The results will prove beneficial to:

1) LR users who will gain a better understanding of prior uses and related LRs developed; 2) LR

creators who will better monitor critiques of their work and also understand its impact; 3) new

development projects that benefit from lesson learned in prior efforts.

16

The concept of BRIF (Bioresource research impact factor) as a tool

to foster bioresource sharing in research: is it applicable to other

domains?

Anne Cambon-Thomsen

CNRS

For the working group BRIF

Inserm and University of Toulouse, UMR 1027, 31000 Toulouse, France

Introduction

Concept. Numerous health research funding institutions have recently expressed their strong will

to promote data sharing. As underlined in a recent editorial in Nature Medicine, an operational

approach is needed to achieve this goal1. Bioresources such as biobanks, databases and

bioinformatics tools are important elements in this landscape. Bioresources need to be easily

accessible to facilitate advancement of research. Besides technical and ethical aspects, a major

obstacle for sharing them is the absence of recognition of the effort behind establishing and

maintaining such resources. The main objective of a Bioresource Research Impact Factor (BRIF) is

to promote the sharing of bioresources by creating a link between their initiators/implementers

and the impact of scientific research using them.2 A BRIF would make it possible to trace: 1. the

quantitative use of a bioresource, 2. the kind of research utilising it, and 3. the efforts of people and

institutions that construct and make it available.

An international working group (http://www.gen2phen.org/groups/brif-bio-resource-impact-

factor). In the context of EU projects, a BRIF working group has been set up, including so far 105

participants. The work involves several steps: 1. creating a unique identifier; 2. standardising

bioresource acknowledgment in papers; 3. cataloguing bioresource data access and sharing

policies; 4. identifying other parameters to take into account; and 5. prototype testing, involving

volunteer bioresources and the help of journal editors.

A workshop. The first BRIF workshop was held in Toulouse, France (17-18 January 2011),

gathering 34 people from 10 countries, representing various domains: biobanks, genome

databases, epidemiological longitudinal cohorts, bioinformatics, scientific publishing, bibliometry,

health law and bioethics3. The lack of objective measures of use of bioresources was recognised by

all; we focused on shared aims, but underlined that each community had specific aspects to

consider and resolve http://precedings.nature.com/collections/brif-workshop-january-2011.

Main avenues explored and further steps

Identification. Bioresources need to be identified by a unique digital identifier (ID), ideally via

existing mechanisms4. Digital Object Identifiers (DOIs) may be interesting5. Several issues must be

1 Cambon-Thomsen A. Assessing the impact of biobanks. Nat Genet 2003; 34: 25–26.

2 Kauffmann F, and Cambon-Thomsen A. Tracing Biological Collections Between Books and Clinical Trials. JAMA 2008; 299: 2316–18

3 Cambon-Thomsen et al. The role of a bioresources impact factor as an incentive to share human bioresources. Nat genet 2011, in press

4 Peterson J, and Campbell J. Marker papers and data citation. Nat Genet 2010; 42: 919

17

http://www.gen2phen.org/groups/brif-bio-resource-impact-factor

http://www.gen2phen.org/groups/brif-bio-resource-impact-factor

considered, including: what to identify (biobank, collection, database, dataset, subset); identifier

requirements (persistent over time, globally-unique, citable); and which international and

independent body should be responsible for assigning bioresource IDs. Working subgroups were

created to address those questions. Attribution of credit to scientists for different kinds of work (in

addition to publications) using researcher IDs was also discussed. The ORCID initiative

(http://www.orcid.org ) is building a new contributor ID framework which should in principle

enable credit to be given to both bioresources and individuals involved in their creation and

maintenance.

Citation. Standardisation is necessary, but could be combined with existing referencing standards

and conventions6, such as: citing marker papers, standardised sentences in Materials & Methods or

acknowledgements section of papers, co-authorship when justified, and including resource name in

paper title. Specific requirements for citing bioresources are lacking in the Uniform Requirements

for Manuscripts submitted to Biomedical Journals and should be added. In order to enable

automated tracking of bioresource use, the bioresource ID should ideally appear in or under the

abstract section in order to be visible even without access to the full text of articles.

Factors to take into account in impact factor calculation. BRIF should not be a citation index only.

Factors such as time and domain of bioresources need to be considered in the calculation process

and its weighting. Although the BRIF scope could be extended to measure many different aspects

of bioresource utilisation, including economic implications, it was decided to concentrate first on

use and impact in research settings.

Access and sharing policies. They have been developed over years. However, the incentivisation of

bioresources to promote access needs to be balanced with appropriate provisions compatible with

their interests proper recognition of scientific contribution and sustainability supported by the

capacity of measuring their own impact. There are actually no mechanisms in place to measure this

impact. Empowering bioresources with tools such as BRIF is, therefore, urgent.

Perspectives. The full impact of bioresources is wider than BRIF, but BRIF is an essential

operational step. The present proliferation of ideas, statements and proposals around data sharing

from different perspectives and stakeholders favours the emergence of tools such as BRIF in order

to make data sharing principles operational. In the same spirit it could be applied to the domain of

language studies.

5 International Committee of Medical Journal Editors. Uniform Requirements for Manuscripts Submitted to Biomedical Journals:

Writing and Editing for Biomedical Publication. Version April 2010: http://www.icmje.org/urm_main.html (accessed Feb 11, 2011). 6 Toronto International Data Release Workshop Authors. Prepublication data sharing. Nature 2009; 461:168-70.

18

http://www.icmje.org/urm_main.html

Capturing Community Knowledge of LRs: the LRE Map

Claudia Soria

CNR-ILC, Italy

The LRE Map of Language Resources and Tools is an initiative jointly launched by FLaReNet and

ELRA in May 2010 with the purpose to develop an entirely new instrument for discovering,

searching and documenting language resources, here intended in a broad sense, as both data and

tools. The LRE Map was initially in conjunction with the LREC 2010 Conference, and was

conceived as a campaign for collecting information about the language resources and technologies

underlying the scientific work presented at that conference. To collect this information, authors

who submitted a paper were requested to provide information about the language resources either

developed or used. The required information was pretty simple and related to basic information

about the type of the resource, the language and modality represented, the intended or real

application purposes, the degree of availability for further use, the maturity status, the size, type of

license and availability of documentation.

In a rather short time, the LRE Map contained more than 2000 description of resources, and soon

became a very popular initiative joined by the COLING and EMNLP conferences as well. Other

conferences, such as Interspeech, ACL-HLT, and IJCNLP have already agreed to become part of

the game and will use the LRE Map for their next year conferences. This shows that the idea has a

great potential, and there is reasonable confidence that it can become a “standard and regular”

instrument in LRT conferences. The growth in the number and quantity of information about LRT

can easily be foreseen.

So far the LRE Map is both a set of metadata about LRT collected in three major conferences

(during 2010) and a web interface designed to search and browse this data. The web interface

currently provided to the community is a very simple interface based on not-normalised data, while

a new release (currently as an alpha version) offers a better visualization of data and a login system

to manage simple access management to resources. Moreover, it is based on (partially) normalised

data.

The LRE Map web interface provides the possibility of searching according to a fixed set of

metadata (designed for conference submission) and to view/edit the details of extracted resources

(edit if the user is directly related to the resource: i.e. when the user is an author of the paper that

cites the resource(s). In addition the database contains a lot of implicit relations, for example

relations among authors (because in some way related to same resources) and resources (because

cited by same authors in different papers).

The potential of the LRE Map for becoming a powerful aggregator of information related to

language resources was immediately clear, as was the possibility of deriving and discovering new

combinations of information in entirely new ways. The database underlying the LRE Map can yield

interesting matrices of the language resources available for the various languages, modalities, or

applications. Such matrices have been already used, for example, in META-NET to provide a

picture of the situation of resources for the various European languages.

Although the LRE Map was not realized as a social platform, it was conceived as a community-

based initiative. Differently from other catalogues maintained by institutions worldwide (ELRA,

LDC, National Institute of Information and Communications Technology (NICT), ACL Data and

19

Code Repository, OLAC, LT World, etc.), the LRE Map presents a set of innovative features since it

is built by the community according to a bottom-up strategy and without conforming to a strict

template for possible metadata and values. The Map really captures the knowledge of the LRT

community about language resources.

The immediate success and impact of the LRE Map, together with the spontaneous agreement of

major conferences/associations of the LRT field to adopt it, requires now to move from a prototype

to a stable and solid service, that is both a repository of information about language resources and,

at the same time, a community for users of resources, a place to share and discover resources,

discuss opinions, provide feedback, and discover new trends.

20

A Journey from the LRE Map to the Language Matrixes

Joseph Mariani

LIMSI-CNRS & IMMI, France

The objective of the Language Matrixes developed within META-NET is to provide a clear picture

on what exists in terms of Language Resources (LR), in the broad sense (Data, Tools, Evaluation

and Meta-resources) for the various languages, and to stress the languages that are missing such

Language Resources. The goal would then be to ensure the production of the corresponding

resources to fill the gaps for those languages.

The Languages Matrixes provide an easy way to get that picture and to have access to the details of

the corresponding resources.

We built those matrixes from the LREC Map that has been produced from the information

provided by the authors of the papers submitted at the LREC’2010 conference, which gathers the

international community working in the field of Language Resources and Evaluation. Each author

was requested to provide a set of information on the Language Resource(s) which was/were

mentioned in their paper, through an online questionnaire, which includes suggestions as an aid to

the author. It resulted in a table of close to 2,000 entries. This information was then made

available to the scientific community, through the same interface than the one that was used for the

information acquisition.

The Language Matrixes were automatically built from that table. In this first analysis, we

considered the 23 official languages of the European Union, together with a category on “Regional

European languages” and one on “Non-EU European languages”, as well as “Multilingual”,

“Language Independent” and “Not Applicable” categories. We produced 8 Language Matrixes on :

Multimodal/Multimedia Data and Tools, Written Language Data and Tools, Spoken Language Data

and Tools, Evaluation and Meta-resources (standards, metadata, guidelines). Several Types of

resources are listed for each matrix, either corresponding to the 24 Types that were suggested in

the questionnaire or to the author’s own entry, when no suggested Type was found appropriate.

This results in a total of about 150 Language Resource Types, with a variable number for each

matrix (from 5 Types for Evaluation to 78 Types for Written Language Tools).

Those matrixes show that the English language is by far the most resourced language, followed by

French and German, Spanish, Italian and Dutch. Some languages are clearly under-resourced, such

as Irish Gaelic, Slovak or Maltese. Given the large number of Types expressed by the authors, some

may exist only for one language, and the matrixes therefore show a large number of zeroes for all

the other languages. We however preferred to keep that information as such rather than merging

them into an “Other Type” category, as those singletons may be weak signals announcing a new

research trend. Another option is to merge those singletons into a single “Other” category to

facilitate the browsing of the Language Matrixes.

In order to produce the Language Matrixes, we had to conduct a tedious process of cleaning up this

data, as the information was not always provided in the proper format, despite the suggested

terms, and as the declared LR should only be counted once. This process reduced the number of

entries from 2,000 to about 1,500.

21

We first cleaned up the names of the LR, as different wordings may be used by different authors for

the same LR. We found the problem that those different wordings may not follow in the

alphabetical order (e.g. acronyms), and we faced the issue of the way to consider the different

versions of the same LR over time and the subparts of a LR. This has to be decided by hand on a

case-by-case basis. For this purpose, we found interesting to gather some LR into LR families. This

cleaning process clearly demonstrated the need to label in a persistent and unique way (PUiD:

Persistent and Unique Identifier) the LR in order to better identify and track them.

We then cleaned up the LR modality (Multimodal/Multimedia, Written Language, Spoken

Language, Evaluation, Meta-resources and Not Applicable). The main problem here is the

possibility for a LR to address several modalities (written and spoken language for example). In

this case, we considered the LR in both modalities. We also had to harmonize possible differences

among authors.

The next cleaning addressed the LR Types. In addition to the 24 suggested Types organized in 4

categories (Data, Tools, Meta-resources and Evaluation), the authors proposed 127 new Types.

Some of them correspond to mistakes committed by the authors (i.e. the Type existed, or existed

with a different wording). Others (23) correspond to Types that were missed when producing the

list of suggested Types (in this case, this new Type is often mentioned by several authors,

sometimes with different wordings). Others correspond to LR belonging to several Types. There is

finally a long tail of other Types, most of them only mentioned once by a single author, which only

represent 5% of the LR.

The last cleaning process addressed the Languages, as authors may have used different spellings or

codes (such as the ISO ones). This could be facilitated in the future by providing the list of existing

languages. According to our objective, only the 23 official EU languages where considered, while

the European non-EU languages were grouped in a single category and the same for the EU

regional languages. A “multilingual” category was added, when the number of the languages

mentioned was large.

Despite this cleaning process and although they may have contained obvious mistakes, all initial

information provided by the authors have been retained as the reference information. The cleaned

information is used for conducting data analysis and for searching information. The general need

for conducting data cleaning is nowadays clearly identified, and Google proposes a Google Refine1

tool to facilitate this task.

Since the LREC Map produced at the LREC’2010 conference, more data have been harvested at

other conferences such as EMNLP’10 and Coling’2010, which will be included in the next versions

of the Language Matrixes, as well as the LR appearing in Journals, such as the Language Resources

and Evaluation journal, and Language Resources catalogues, such as the LDC or ELRA ones. More

conferences agreed to participate in the building of the LRE Map (Interspeech, ACL-HLT, Oriental

Cocosda, etc.). Based on our findings when conducting the data cleaning, we will complement and

improve the suggested information provided to the authors in order to facilitate their task, which

will also serve lightening the cleaning process.

Building the Language Matrixes from actual data provided by authors allows reflecting over time a

landscape that is continuously evolving with more and more LR. Within META-NET, the Language

Matrixes have already started being used for identifying the Language Gaps and for writing the

Language Tables in the Language Reports.

1 http://code.google.com/p/google-refine/

22

The next steps will be to:

increase the number of the identified LR trough more conferences, journals and catalogues,

extend the range of languages being considered,

include the analysis of Sign Languages,

improve the coherence of the inputs in a loop where the taxonomy and the metadata are refined through the analysis of the matrixes and thanks to the work on metadata within META-NET, and reflected in the terms suggested in the questionnaire,

track the use and quality of the identified LR as they appear in conference and journal papers.

In order to improve the analysis of LR, a major issue would be to attach a PUiD to each LR, taking

into account the families of LR, their various parts and their evolution over time, including all

contributors.

23

Proposal for the International Language Resource Number

Khalid Choukri, Jungyeul Park, Victoria Arranz, Olivier Hamon

ELRA/ELDA, France

Every object in the world requires a kind of identification to be correctly recognized. In this paper,

we propose a new principle for assigning a PID for LRs within the human language technology

area.

Every object in the world requires a kind of identification to be correctly recognized and easily

"discoverable". Traditional printed materials like a book, for example, generally have used the

International Standard Book Number (ISBN), the Library of Congress Control Number (LCCN),

the Digital Object Identifier (DOI) and several other numeric identifiers as a unique identification

scheme. Book identifiers allow us to easily "identify" books in a unique way. There are also already

several identifier schemes in other domains. In computer programming languages, Identifiers

(IDs) are usually lexical tokens that name language entities. The Electronic Product Code (EPC) is

also designed as a Universal Identifier that provides a unique identity for every physical object. The

canonical representation of an EPC is also a Uniform Resource Identifier (URI). A Part Number

(PN) is a unique identifier of a part used in a particular industry and unambiguously defines a part

within a single manufacturer. A Universally Unique IDentifier (UUID) is an identifier standard

used in software construction, standardized by the Open Software Foundation. An Accession

Number in bioinformatics is a unique identifier given to a Deoxyribonucleic acid (DNA) or protein

sequence record to allow for tracking of different versions of that sequence record and the

associated sequence over time in a single data repository. Biomedical research articles already have

a PubMed Identifier (PMID).

In this presentation, we propose the use of new identifier schemes for Language Resources (LRs) to

be identified, and consequently to be recognized as proper references. It is also a major step in the

emerging NLP networked and shared world. Unique resources must be identified as what they are

and meta-catalogues need a common identification format to manage such data correctly.

The ELRA Catalogue offers a repository of LRs made available through ELRA. The catalogue

contains over 1000 LRs in more than 25 languages. Other LRs identified all over the world, but not

available through ELRA, can be also viewed in the Universal Catalogue. The actual LR identifiers in

the ELRA Catalogue contain 4 digits followed by a systematic pattern (B|S|E|W|M|T|L) where B

signifies a bundle which can contain several LRs under and S|E|W|M|T|L are Speech, Evaluation,

Written, Multilingual corpora, and Terminology and lexicon, respectively. LDC uses LDC publisher

code with a year number followed by (S|T|V|L) and 2 digits. S|T|V|L are speech, text, voice,

lexical(-related) corpora, respectively. ISO is also working towards a PID framework and the

practice for referencing and citing LRs in documents as well as in LRs themselves. The DOI System

also provides a good framework for PID which can be used for any form of management of any

data.

We propose a practical implementation of the international Language Resource Number and

elaborate on an international framework with requirements and expectations to take into account

for such set up.

24

S2. Open Data – 14:30‐16:20

Chair: Nancy Ide

Introduction

Nancy Ide (Vassar College, USA / Chair)

Introductory Talks

Open Data and Language Resources Timos Sellis (IMIS ‐ "Athena" R.C., GR)

Multilingual Linking Open Data Key‐Sun Choi (KAIST, KR)

Beyond Open Data to Open Advancement Eric Nyberg (Carnegie Mellon University, USA)

Contributions

To Create Commons in order to Open Data Danièle Bourcier (CNRS & Creative Commons France, FR)

Is Our Relationship With Open Data Sustainable? Denise DiPersio (LDC, USA)

Opening the Language Library: let's build it together! Nicoletta Calzolari (CNR‐ILC, Italy / FLaReNet)

Discussants

Thierry Declerck (DFKI, DE)

António Branco (University of Lisbon, PT)

Thibault Grouas (Ministry of Culture of France ‐ Office for the French Language and Languages of France, FR)

27

Open Data and Language Resources

Timos Sellis, Spiros Athanasiou

Institute for the Management of Information Systems, “Athena” Research Center

A brief introduction on Open Data

Open Data, i.e. data that can be freely used, reused and redistributed by anyone [1], is not a

current trend, but rather a historically established scientific practice. The goal for providing open

scientific data is noble and a necessity for scientific advancement; share knowledge, evaluate

research, educate, promote intradiscliplinary activities. Recently, open data practices have spread

beyond the S&T realm and into the mainstream political and technological agenda. A new

movement has been formed, the open-data movement.

The reasons for this development are stemming from needs established in two separate but

interlinking domains: politics and S&T. On a political level, open data is perceived as a means to

enforce transparency and accountability, i.e. core democratic values. Further, PSI reuse (Public

Sector Information) can boost the economy, enabling the private sector to develop and monetize

value added services. On an S&T level, the need to share data to promote scientific advancement is

ever so increasing. Further, the World Wide Web in its current and upcoming iteration (Web

2.0/3.0), offers significant technical opportunities for collaboration and knowledge management,

yet to be harnessed.

Consequently, this mix of technological, political, and economical support towards open data is

unique, since it provides excellent opportunities for sustainable growth and business development,

advances in research, and cross disciplinary research agendas.

Learning from geospatial data

Geospatial data offers a great source of examples, applied technical solutions, and policy models

that attest the tangible benefits of open data, data reuse and common sharing. Geospatial data is

important in this respect for three reasons. First, geospatial data account for roughly 80% of public

data. Second, several mature international, EU, and national efforts exist to promote reuse and

openness (e.g. INSPIRE Directive). Further, we can learn from well documented use cases

concerning real world return-on-investment (ROI) and cost-benefit analysis (CBA).

In a nutshell, the status quo in the ecosystem of geospatial data is based on four complementary

driving forces: standardization, maturity, vendor support and an active FLOSS community.

Standardization activates for metadata, data, and services are handled by international

organizations with mutual and deep understanding for the benefits of interoperability. The

ISO, Open Geospatial Consortium, and INSPIRE Directive, have established a common

ground for all stakeholders, actively evolve standards, and promote uptake on all fields

relating to geospatial data.

Geospatial data sharing among public bodies has long been established due to practical

reasons; geospatial data is expensive to produce, must be compatible, and are needed for

policy making at all levels of the administration and across various domains. The concept of

Spatial Data Infrastructures, i.e. formal technical, administrative, and legislative

29

frameworks to support geospatial data sharing, was coined in the early 80s, even on an

analogue form.

Vendor support for standards-based access and use of geospatial data, as well as SDIs, is

practically unanimous. Almost all commercial products (GIS, geospatial databases) are

compliant with ISO/OGC standards. Therefore, data reuse and common sharing is offered

almost out of the box.

Finally, the FLOSS community has a pivotal role, offering open and ready to deploy

software and services for geospatial data.

Out of several examples in the literature, the case of Denmark [2] is remarkable and clearly

establishes the economic gains of open data. In 2002, the Danish government decided to provide

the address dataset free of cost, since addresses are an integral part for the majority of IT systems

and services, both in the public and private sector. The costs of curating and publishing the

addresses were around 2M Euros (2005-2009). However, the direct financial benefits for the same

period were around 62M Euros, allocated 30%/70% between the public and the private sector

respectively. This is calculated as a ROI of 31. Moreover, for 2010, the costs for the public sector

will be 0,2M and the benefits 14M, i.e. a ROI of 70.

Opening up data

Opening up data is actually the last step one should tackle, and certainly not a technical challenge;

it is just a matter of choosing an appropriate license. Instead, the focus should be on how to make

our resources (data and services) discoverable and reusable. Without delving into technical details,

practice dictates that opening up data is a three step process:

Step 1. Setup a catalogue. Provide standard metadata, assign unique identifiers, and build a

catalogue of your resources. In this manner a user can discover your resources and use a

static identifier to reference them. Note that the data themselves, do not need to be

available at this point.

Step 2. Harmonization. Provide standard APIs (de facto or de jure) to query the catalogue.

In this manner your catalogue can be used from third systems and catalogues can be

federated. Again, no actual data need to be published at this point.

Step 3. Licensing and business models. This is where you have to decide on whether to

publish the data or not, and if so, under which license. Decisions must be based on a

business model (e.g. sell the data, share data amongst a closed group, provide the data for

free) which then shapes the catalogue and its services.

It is important to consider that choosing an “open” license does not equate to complete loss of

rights. Open vs. closed is not equivalent to black vs. white. There is a spectrum of open licenses

with all shades of grey in between. For example, the Creative Commons licenses provide several

restrictions when licensing open data to accommodate various needs. These restrictions can be

arbitrarily combined to create a specific CC-compatible license. Specifically, CC provides clauses

that (a) require attribution (indirect gains for the publisher), (b) forbid commercial work (direct

gains for the publisher in commercial applications), (c) forbid derivative work (direct gains for the

publisher), and (d) demand share-alike (indirect gains for the publisher).

Where do Language Resources stand?

The question of whether Language Resources should follow the open data movement is misleading.

Providing open data should not be the primary objective, but rather one of the instruments to

promote growth in the LR community on an EU level. All stakeholders, the S&T community, SMEs,

30

EC, should study the current landscape, Europe’s competitive advantage and unique attributes, in

order to agree on a common agenda for growth and cooperation in the LR field.

There are many questions that need to be meticulously answered.

What is the market value of LR in the EU? Establishing a data economy on LR should first

focus on identifying exactly where we stand, on an international level.

What is the potential of the market value of LR data? Given EU’s needs, priorities, and

goals, is there room for growth? If so, how can growth be achieved? Does the EC need to

intervene, or the market forces can pursue this goal alone?

Are LR data discoverable? To promote the data economy we should first ensure that LR

resources can be found, even on the lowest technical level.

What do we gain by closing data? Are business models based on closed and heavily guarded

LR actually successful? Are we losing opportunities for growth by not systematically

exploiting common sharing and synergies?

What do we lose by opening up data? Can the direct income lost compensated by direct or

indirect financial gains?

Experience in similar data-driven ICT markets has demonstrated the steps the LR community

should follow. Standardization, harmonized licensing, LR marketplaces, and common sharing

agreements, are a necessity to bootstrap the LR economy in Europe.

[1] www.opendefinition.org

[2] http://www.adresse-

info.dk/Portals/2/Benefit/Value_Assessment_Danish_Address_Data_UK_2010-07-07b.pdf

31

http://www.opendefinition.org/

Multilingual Linking Open Data

Key-Sun Choi, Eun-Kyung Kim, Dong-Hyun Choi

Web Science and Technology Division, Computer Science Department, KAIST, Korea

Ontology Population and Enrichment for Semantic Evolution of LOD

Linking Open data is a syntactic linking of datasets, but it cannot acknowledge whether linked

objects are really the same, or the other equivalent objects cannot be linked simply because they

have different strings for labeling the objects.

Ontology population is to link the instances in the real world to the already-made ontology, and

enrichment means to mine the relations between those instances, which is defined in the ontology

schema.

It is not easy to view the whole LOD space to be one set of ontology and even will make controversy

about why it is necessary for their use in real application. But almost all of LOD is DBpedia-centric

linked structure through their exact match of naming for URI, and always there are problems such

that the redundancy of presence of objects cause the missing to aggregate all of relevant data, the

non-provability of semantic interoperability/equivalency for linking and even among non-linked

objects will cause the inconsistency of the whole space of linked open data. This would like to

pursuit the elegant ecosystem to maintain the LOD semantics by incorporating the ontology

population and enrichment mechanism to the evolving LOD and even incrementally converging

status of LOD to LOD2, so-called semantically stable LOD.

Basic strategy is as follows: it should be recognized that DBpedia and its mother Wikipedia should

be a backbone of LOD and its following evolving semantic version and finally their ontology

structure. The first step is to make a taxonomy backbone based on Wikipedia structure. Then it will

be changed to ontology structured with their bootstrapping of semantic annotation and infobox-

based template structure. The second step is an ontology enrichment using the Wikipedia text in

order to strengthen its ontology structure. The third ontology population is to accept each dataset

as one user ontology of local aggregate of LOD. After transforming each local aggregate to a

possible ontology structure with taxonomy, it will be populated from Wikipedia-based ontology.

The fourth step is to enrich the local LOD-ontology as well as feedback to the Wikipedia-based

ontology for the overall stability of semantic LOD.

Multi-lingual Synchronization

It is necessary to explore an automated approach to synchronize multilingual-linked articles in

Wikipedia. Two linked articles in two different languages have different amount of information in

Wikipedia. Even some entries in one language have no entry in the other language.

Synchronization means translation between two languages plus information balancing for their

equivalent degree of information. Even if the full translations between two languages’ linked

articles are successfully accomplished, the resultant translated content to the other language

should be readjusted with the already existing article in that language. Even they have obsolete

information in one language. So the next stage is to find the temporal ordering of facts and events

that are represented in sentences of article, and to generate the up-to-date information-

synchronized articles in two languages.

32

There are many challenges for massive translation among multilingual-linked articles with

synchronizing the information content. This approach will envision the role of infobox in each

article of Wikipedia in order to improve the quality of infobox keeping their data categories in a

semantically interoperable way. Although the current shape of infoboxes are governed by

“templates”, their templates are not enough organized to give a standardized concise guide to put

the article content to its infobox. One solution is to give an ontology structure for templates of

infoboxes and provide a multilingual standard for their data categories. We have a proposal called

OntoCloud. If it is set up, the next steps are to find the indirect translation through the infobox

pivot among different languages. Here we need machine learning between Wikipedia article’s

sentences and their infobox. For their alignment between each infobox row and sentences, we will

try to find the relevant information extraction from Wikipedia article to its infobox as well as the

methods to generate relevant sentences from infobox table. Here is a common minimal exchange of

information through infobox based on already-made templates. Of course, one Wikipedia articles

can require several templates of infoboxes. Multilingual word nets are also crucial resources to link

them.

Ontology structure based on templates of infobox has relationship with category structure in

Wikipedia. But because the categories in Wikipedia is user-given tags, their category structure are

not the one of ontology and even more their taxonomy is far beyond what we expect in WordNet.

We will investigate the inter-feedback between the template ontology and the category structure for

their persistent eco-system that can be evolved with keeping their collective intelligence advantage.

Section 3. Issues

Issues raised are summarized as follows:

1. Is it possible and necessary to construct a backbone ontology for LOD?

2. Is it also good to align each dataset of LOD to the backbone ontology?

3. Infobox Schema Management:

a. ontology structure for templates of infoboxes

b. Multilingual standard for data categories in templates

4. Multilingual Infobox alignment:

a. Fining semantic correspondences between multilingual contents

b. Multilingual thesauri are used to link between different lingual sources

5. Infobox Population: filling the missing and unknown contents

a. Extracting relevant information from other sources such as Wikipedia articles or

external resources

b. Generating up-to-date information-synchronization in multilingual environment

33

Beyond Open Data to Open Advancement

Eric Nyberg

Carnegie Mellon University

Open Data Isn’t Enough

In the spirit of open data, several initiatives have developed shared language resources, annotation

schemas, and datasets for research and associated programmatic evaluations. One branch of

Language Technology, Question Answering, has benefited greatly from the availability of open data

resources, especially the research datasets created for the yearly TREC question answering track.

Nevertheless, recent experience with the Jeopardy! Challenge shows that open data isn’t enough.

Sustained interoperability and steady improvement in task performance requires additional work

to formalize, standardize and share related language processing resources, such as:

• Shared component APIs

• Shared software (open source)

• Shared configurations / data flows for specific language tasks

• Shared metrics, measurements and evaluation frameworks for specific language tasks

• A collaborative development process (driven by measured improvement)

The Open Advancement of Question Answering (OAQA)

The Watson QA system developed by IBM Research is based on a foundational architecture and

methodology for accelerating collaborative research in automatic question answering (OAQA). By

making long-term commitments to component APIs, shared software, configurations, metrics and

process, OAQA led to rapid acceleration in the state of the art, and a Jeopardy! victory for Watson.

OAQA combines object-oriented software architecture with comprehensive metrics, measurement

and error analysis at the system and module levels, so that the sources of overall task error can be

traced to component-level errors to be debugged. Detailed error analysis allows the team to focus

on the most important sources of error during each development iteration, resulting in steady

progress in task performance.

Beyond Question Answering to other Language Tasks

I believe that the same OA approach can be applied to other human language processing tasks, and

I strongly advocate the position that new language resources (i.e., corpora, datasets, etc.) should

not be developed without a specific set of tasks and task metrics in mind, along with a formal

software design to support automatic configuration management, automatic comparative /

incremental evaluation, and detailed error analysis. These techniques were used to great effect in

the development of Watson; they can have a similar positive effect on other areas of language

research, if we are willing to invest the effort required to specify the task and component APIs and

metrics along with the shared language resources for the task.

34

To Create Commons in order to Open Data

Danièle Bourcier

CNRS and Creative Commons France, France

In this talk, I will propose to link the movement of Commons and that of open data.

Since 2004, Science commons has been focusing its efforts to expand the use of Creative

Commons licenses to scientific and technical research. Creative Commons plaid an instrumental

role in the Open Access movement, which is making scholarly research and journals more widely

available on the Web. The world’s largest Open Access publishers in the world all use CC licenses to

publish their content online: 10% of scholarly journals is CC licensed. The movement of Science

Commons also expands Open Access to research institutions. Creative Commons licenses are

directly integrated into institutional repository software platforms. University libraries at MIT offer

access to the Scholar’s Copyright Addendum Engine, which helps faculty members upload their

research for public use while retaining the rights they want to keep. Finally Science Commons have

created policy briefings and guidelines to help institutions implement Open Access into their

frameworks.

35

1

Is Our Relationship With Open Data Sustainable?

Denise DiPersio

Linguistic Data Consortium, Philadelphia, PA 19104 USA

[email protected]

Introduction

The language resource development and user communities have publicly declared their affection

for open data. But what has the community embraced? “Open data” is a term susceptible to

multiple and often inconsistent interpretations. In any relationship, long-term success depends on

how well we know our partner. This paper will examine some characteristics that affect data’s

“openness”, review emerging trends and conclude with some ideas for considering the notion of

openness at it applies to language resources.

Open Data is ....

The concept of open data has its roots in discussions among the scientific community in the 1950s

that resulted in various efforts to promote the sharing of scientific data in order to minimize loss

and to maximize accessibility. The emergence of the web as a data sharing mechanism gave new life

to this idea premised on the assumption that data could be made available online and downloaded

at little or no cost. But there remains some confusion about the meaning of open data.

Some view open data as describing a method of distribution. In that context, open data is “data

made openly available without permission or payment barriers” (MIT Libraries), or “a piece of

content or data is open if you are free to use, reuse, and distribute it – subject only, at most, to the

requirement to attribute and share-alike” (Open Knowledge Foundation).

Others use open data to describe content. Examples here include information collected by some

public body or whose collection was publicly supported. The default assumption of this approach is

that all such information – except perhaps material subject to privacy or sensitivity concerns –

should be publicly available at no cost. Indeed a large number of current open data “initiatives” are

efforts by national and local government bodies to release their data collections online.

The language resource community has tended to think of open data in the former sense: digital

resources distributed under open source-type licenses that can in turn be used, modified and

redistributed. However, openness can depend on a number of factors including:

LR type/design: this includes formatting and compatibility with existing related data sets as well

as metadata, all of which affect usability. An “accessible” corpus that is not usable is not open.

Source data: this is where most legal restrictions apply, mostly copyright, but this can also

include data that contains private or sensitive information. Lack of appropriate permissions could

impair openness.

Legal issues: there are implications in copyright law and in many jurisdictions, laws governing

rights in databases. Moreover, not all open source licenes are consistent. The goal is to have a

license that promotes openness to the greatest extent possible but this can be difficult to achieve

uniformly.

36

2

Access: includes user access and downloadability. Restrictions on user groups and alternate

distribution methods can be perceived to mitigate openness.

Sustainability: the data should be available for the long-term. This may require additonal

infrastructure which in return requires maintainance.

Cost: open data may be distributed at no cost, but the data is not necessarily “free”. A free price

usually masks cost-shifitng that has to be paid by someone.

The State of our Relationship with Open Data

There is a theory of relationships called the “uncertainty reduction theory” (URT). URT assumes

that personal relationships are filled with uncertainty at the beginning and that people try to

reduce that uncertainty through knowledge and understanding. One way to view the language

resource community’s interaction with open data is to say that it is on a quest for knowledge and

understanding. Here are some examples:

CKAN – the Data Hub – Comprehensive Knowledge Archive Newtwork (Open Knowledge

Foundation), “the easy way to get, use and share data”’; includes an Open Linguistics Resources

group that contains 18 LR “packages” available at no cost under open source-type license via web

download; corpora and tools; includes a wiki function edited by the community. Openness

assessment – some dependence on individual contributions; inconsistent usability information;

main site contains at least one closed dataset.

Language Commons – “Open, Online Encyclopedia and Data Repository for all 7,000 Human

Languages” hosted by archive.org under no-cost Creative Commons license; less than 10 datasets;

Brown Corpus most downloaded (300+). Openness assessment – some licenses have use

restrictions; inconsistent usability information.

Language Grid – “an online multilingual service platform which enables easy registration and

sharing of language services such as online dictionaries, bilingual corpora, and machine

translations”; users must apply to join the grid and agree to use the services for non-profit,

research purposes; provides 100+ language services. Openness assessment – limited user

accessiblity; inconsistent usabiity information.

Toward a Viable Relationship

The worthy initiatives described above highlight the imbalances in our current relationship with

open data. These include the need for data variety, a way to address legal and related issues to

achieve variety and reduce license restrictions, better data description and community-wide access.

Since a hallmark of open data initiatives is community input, it may be useful to survey the

language resource community about open data – what is it, what data should be open, what are the

needs, how has wok been hampered. Is cost the main barrier to accessibility? How problematic are

license restricitons? Responses to such questions could be helpful as we move ahead to a viable

relationship with open data.

References

Anderson, Chris. Free The Future of a Radical Price. New York: Hyperion (2009).

CKAN - theData Hub. <http://ckan.net/group/linguistics> (14 April 2011).

37

3

Interpersonal Communication Theories and Concepts: Social Penetration Theory, Self-Disclosure,

Uncertainty Reduction Theory, and Relational Dialectics Theory.

<http://oregonstate.edu/instruct/comm321/gwalker/relationships.htm > (29 April 2011).

Language Commons. <http://www.archive.org/details/LanguageCommons> (15 April 2011).

Language Grid.<http://langrid.nict.go.jp/en/index.html> (30 April 2011).

Open science data. <http://en.wikipedia.org/wiki/Open_science_data> (26 April 2011).

Scholarly Publishing – MIT Libraries | Open Data. <http://info-libraries.mit.edu/scholarly/mit-

open-access/general-information> (14 April 2011).

What is Open? <http://opendatamanual.org/what-is-open-data/what-is-open> (14 April 2011).

What “open data” means – and what it doesn’t | opensource.com.

<http://opensource.com/government/10/12/what-“open-data”-means> (14 Aprl 2011).

38

http://oregonstate.edu/instruct/comm321/gwalker/relationships.htm

http://www.archive.org/details/LanguageCommons

http://langrid.nict.go.jp/en/index.html

http://en.wikipedia.org/wiki/Open_science_data

http://info-libraries.mit.edu/scholarly/mit-open-access/general-information

http://info-libraries.mit.edu/scholarly/mit-open-access/general-information

http://opendatamanual.org/what-is-open-data/what-is-open

http://opensource.com/government/10/12/what-“open-data”-means

Opening the Language Library: Let’s Build it Together!

Nicoletta Calzolari

Istituto di Linguistica Computazionale, CNR, Italy

I present here a vision briefly introduced last year at a COLING panel on “crazy ideas”. We must

now convert the crazy into real!

The rationale and the vision

We have recognised that Computational Linguistics is a data-intensive discipline, but we must be

coherent and take concrete actions leading to the coordinated gathering – in a shared effort – of as

many (annotated-encoded) language data as we are able to produce.

Time is ripe for “opening” a big “Language Library”, collaboratively built by the LRT (Language

Resources and Technologies) community. The Language Library is conceived as a facility for

assembling and making available through services “all the linguistic knowledge the field is able to

produce”, putting in place social networking ways of collaboration within the LRT community.

We can compare the proposed initiative to the astronomers/astrophysics’ accumulation of huge

amounts of observation data for a better understanding of the universe.

The rationale behind the Language Library initiative is that accumulation of massive amounts of

(high-quality) multi-dimensional data about language is the key to foster advancement in our

knowledge about language and its mechanisms, in particular for finding new facts about language.

The Language Library must be community built, with the entire LRT community providing data

about language resources and annotated/encoded language data and freely using them.

As ongoing initiatives (FLaReNet, META-SHARE and CLARIN among others) have shown, the

LRT field is mature enough to require consolidation of its foundations and steady increase of its

major assets, minimising dispersion of efforts and enabling synergies based on common knowledge

and collaborative initiatives.

To better achieve these goals we need:

1. Information on the LRT that constitute the real infrastructure of the field 2. Knowledge of best practices for the major LRT and all the languages 3. Facilities for gathering in a unique virtual repository all the linguistic knowledge the field is

able to produce.

These requirements can be satisfied by three – strongly interrelated – collaborative initiatives:

1. The LRE Map, started last year, collecting metadata about LRT 2. The Repository of Standards, Best Practices and Documentation 3. The Language Library, collecting language data to explore the language universe

FLaReNet has acted as an incubator for the proposed initiatives, and they will constitute a great

contribution to META-SHARE, both in terms of services around the resources and in terms of new

and innovative strategies for the creation of new resources.

39

The Language Library

The Language Library can be seen as a sort of big Genome project for languages, where the

community will collectively deposit/create all its “knowledge” about language.

Our understanding of language phenomena is inherent in annotations, in lexicon encoding, in the

capability of annotating/extracting linguistic information. Annotation is at the core of training and

testing systems, i.e. at the core of our technologies. But we are currently over-simplifying the task,

focusing mostly on one specific linguistic layer at a time, without (having the possibility of) paying

attention to the relations among the different layers. However, relations among phenomena at

different linguistic levels are at the essence of language properties. At the same time our efforts are

too much scattered and dispersed, without much possibility of exploitation of others’

achievements.

Today we have enough capability and resources for addressing the complexities hidden in

multilayer interrelations. Moreover, we must exploit the current trend towards sharing for

initiating a collective movement that works towards creating synergies and harmonisation among

all the annotation/encoding work that is now dispersed and fragmented.

With the Language Library we thus want to enable a more global approach to language, starting a

movement aimed at – and providing the facilities to – collecting all possible annotations/encodings

at all possible levels (and we certainly can do tens of different annotation layers and types), to

enable a better analysis and exploitation of language phenomena that we tend today to disregard

(in particular the interrelations among levels that are so important and are not taken care of). E.g. a

coreference annotation on top of simpler annotation layers would improve machine translation

performance. Part of this multi-layer and multi-language annotation could be performed on

parallel texts, so as to foster comparability of new achievements and equality among languages.

Moreover, the Language Library will be both a repository of annotation layers and a set of

tools/services that allow easy browsing, annotation, conversion, porting through languages, and

exploitation of the existing layered data ...

This collaborative approach to creation of massive amounts of annotated data has a clear relation

to interoperability: it will contribute to push towards a convergence on best practices, also making

available tools that simplify the adoption of most used annotation guidelines, while at the same

time encouraging exploratory diversity. This could create a fruitful (community driven) loop

between most used annotation schemes and establishment of best practices.

A first experiment at LREC 2012!

At LREC 2012 we will launch a new initiative aimed at laying the foundation for such a community

built Language Library.

Without going into details now, we’ll try to distribute texts in all the LREC languages, to have them

back processed/annotated by the LREC authors in all the possible manners that the LREC

submissions will be able to manage.

All the collected language data will be made available to the LREC community (and beyond)

possibly before the Conference, to try to organise something around/on top of them.

We’ll announce/describe this initiative in the LREC Call for Papers next month, and we call for

suggestions now from the FLaReNet community!

40

S3. Go green: reuse, repurpose and recycle resources – 16:50‐18:30

Chair: Stelios Piperidis

Introduction

Stelios Piperidis (ILSP ‐ A.C. "Athena", GR / Chair)

Introductory Talks

An Interoperability Challenge for the NLP Community James Pustejovsky (Brandeis University, USA)

Community Co‐Creation in Cultural Domain Virach Sornlertlamvanich (NECTEC, TH)

Imagine we have 100 Billion Translated Words at our Disposal Jaap van der Meer (TAUS Data Association, NL)

Contributions

U‐Compare: interoperability of text mining tools with UIMA Sophia Ananiadou (University of Manchester, UK)

Opening, Sharing, Re‐using Language Resources: Who, What, When, Where and How Stelios Piperidis (ILSP ‐ A.C. "Athena", GR)

Discussants

Petya Osenova (BAS, BG)

Anna Braasch (University of Copenhagen ‐ Centre for Language Technology, DK)

43

An Interoperability Challenge for the NLP Community

Nancy Ide1 and James Pustejovsky2

1Vassar College and 2Brandeis University, USA

Web services are becoming increasingly more sophisticated and responsive to user needs over a

range of applications and areas. However, at this time a robust, interoperable software

infrastructure to support natural language processing (NLP) research and development does not

exist. The need for robust language processing capabilities across academic disciplines, education,

and industry is without question of vital importance. The goals of our NSF-funded SILT project,

which has worked together with European and Asian collaborators, has been to advance

interoperability among NLP tools and resources. The ultimate goal of this research is to build on

the foundation laid in SILT and other projects, to create the momentum toward establishing a

comprehensive network of Language Apps (LAPPs) web services and resources within the NLP

community. To this end, in this talk, we define a shared task for the NLP community we call "The

SILT Interoperability Challenge", which will call for development of interoperable web services. We

anticipate that the challenge will be carried out in two or three iterations over the next two years.

45

Community Co-Creation in Cultural Domain

Virach Sornlertlamvanich

National Electronics and Computer Technology Center (NECTEC), Thailand

Introduction

Under the collaboration between Ministry of Culture and Ministry of Science and Technology by

National Electronics and Computer Technology Center (NECTEC), a collection of cultural

knowledge has been embodied since 2010. We did not start the work from scratch. Actually the

work has been done some years ago by forming a set of servers individually operated by each

province. Each province has to take care of their own contents about their responsible area. The

initiative has been carried out for the purpose of creating a reference site of the local cultural

knowledge. The distributed system is an aim to decentralize the management and to maintain the

uniqueness of each specific area. However, there is a trade-off between the independent design and

cost of maintenance that covers the service operation, interoperability and integrity. There are

currently 77 provinces in Thailand, and each province is allocated an office for provincial cultural

center. With the approach of the above mentioned distributed system, it is too costly to maintain

the service and the standard for data interchange.

The newly designed platform-based approach for digital cultural communication has been

introduced. It is to build a co-creative relationship between the cultural institution and the

community by using new media to produce audience-focused cultural interactive experience

(Russo and Watkins 2005). First, we collected the existing provincial cultural knowledge and

convert them to conform to a standardized set of metadata. This is to prepare the cultural

knowledge for an open data schema and interoperability. The metadata is defined to follow the

Dublin Core Metadata Element Set with some additional element to fulfill the requirement

information during the recording process. Second, we assign representatives from each province

and train them to be a core cultural content development team for community co-creation. The

contributed content needs to be approved by the core team before public visibility. Third, the

cultural knowledge will be put on service to the audience of such scholars who may be interested in

the cultural practice, or business developers who may benefit from attaching the cultural

knowledge to their products, or tourists who may seek for cultural tourism.

This cultural media assets will be linked and annotated by a governed conceptual scheme such as

Asian WordNet (Sornlertlamvanich and et al. 2009). The semantic annotated and linked data will

be serviced as a fine-grained cultural knowledge for higher level applications. The new media for

recording the cultural knowledge is in the form of narrative, photo, video, animation, image

incorporated with GPS data for visualization on the map.

Cultural Knowledge Co-Creation

The existing cultural data has been collected and cleaned up to conform to the designated standard

metadata. The absent data are supposed to be revised and augmented by the experts from the

Ministry of Culture. A few tens of thousands of records have been collected but most of them are

captured in a coarse-grained image. Narratives and images are revised by a group of trained expert

to create a seed of standardized annotated cultural knowledge base. Some new records have been

added together with animation, video, panoramic photograph, etc. New technique in capturing the

46

cultural image is aggressively introduced to create value added and gain more interest from the

audience.

The standardized annotated cultural knowledge base is presented through a set of viewing utilities

to the audience. Filter according to the location and province is prepared for customizing to page

for each province. This is to allow the unique presentation of each province. The administration of

each province will be responsible to its content correctness and coverage. Actually, the attractive

presentation and narrative are required to attract the audience.

Figure 1 Community Co-Creation Cultural Knowledge Base

Social networking system is introduced to invite the participation from the communities. The

institution representatives are actively encouraged to create their own community. The results

from the community co-creation will keep the content maintained and clean up to compete with

each other.

It is significantly that the provided framework can encourage the data accumulation and fulfill the

needs from the audience. Community co-creation will feedback the actual requirement that can

improve the quality of the content. Institution plays an important role in mediating between

community and the audience. As a result, the multiple types of content are generated on a

designated standard. The annotated metadata can be used as a guidance for higher level of data

manipulation such as semantic annotation, cross language and link analysis.

Reference:

Russo, A. and Watkins, J. 2005. ‘Digital Cultural Communication: Audience and Remediation’ in

Theorizing Digital Cultural Heritage eds. F. Cameron and S. Kenderdine, Cambridge, Mass., MIT

Press.

Sornlertlamvanich, V., Charoenporn, T., Robkop, K., Mokarat, C. and Isahara, H. 2009. ‘Review on

Development of Asian WordNet’ in JAPIO 2009 Year Book, Japan Patent Information

Organization, Tokyo, Japan.

47

Imagine we have 100 Billion Translated Words at our Disposal

Jaap van der Meer

TAUS Data Association

Not just anonymous words scraped from numerous web sites, but good quality translations from

trusted sources, from government bodies and institutions, from companies large and small and

from professional translators.

What could we do with it?

We could transform the translation industry!

Here is how we do it:

1. Terminology mining and dictionary building

Today glossaries are built by terminologists: the best in class language specialists. It is laborious

work and frustrating. Because language keeps changing, the terminologist is always behind and the

glossary is often ignored.

Imagine we have 100 billion translated words at our disposal. Terminology is harvested real-time.

Synonyms and related terms are identified automatically. Part-of-speech is tagged, context is listed,

sources are quoted, meanings are described. It is not rocket science. In fact all the tools exist to do

this and do this well.

2. Customize automated translation

Today we use MT on the internet and accept the stupid failures due to lack of domain knowledge of

the engines. Some of us go through the lengthy and costly process of customizing an engine for our

company’s use.

Imagine we have 100 billion translated words at our disposal. We will do fully automatic semantic

clustering to find the translations that match our own domain. We will do automatic genre

identification to make sure that we use the right style. We will go deeper in advancing the MT

technology with syntax and concept descriptions.

3. Global market and customer analytics

Today translation is an isolated function and cost center in most companies and organizations. We

push translations out but we have no means to listen, learn and connect with our customers

worldwide.

Imagine we have 100 billion translated words at our disposal. We will integrate our translation

process and skills with text analytics and social media management. We will do multilingual

sentiment analysis, search engine optimization, opinion mining, customer engagement,

competitive analytics, etcetera. From a cost center the translation function is now becoming a very

valuable strategic ally in every global organization.

4. Quality management

48

Today we struggle to deliver adequate quality in translations. We miss the local flavor, the right

term or the subject knowledge. The source texts may be in bad shape causing all kinds of trouble

for the translator or the MT engine. The craftsmanship typical of the translation industry stops all

innovation.

Imagine we have 100 billion translated words at our disposal. We will automatically clean and

improve source texts for translation. We will run automatic scoring and benchmarks on quality. We

will improve consistency and comprehensibility.

5. Interoperability

Today the lack of interoperability and compliance with standards costs a fortune. Buyers and

providers of translation lose 10% to 40% of their budgets or revenues because language resources

are not stored in compatible standard formats.

Imagine we have 100 billion translated words at our disposal. Imagine that it’s common practice in

the global translation industry to share most of your public translation data in a common industry

repository. Very naturally then all vendors and translation tools are driven towards hundred

percent compatibility. Jobs and resources will travel without any loss of value. Benefits on an

industry scale add up to billions of dollars or euros.

Stakes are high. Risks are low. Only fear can stop us.

A quarter of a million professional translators produce 625 million good quality translations every

day, some 150 billion a year, of which an estimated 70% is published on the internet. We can collect

and share 100 billion translated words every year. We should give unlimited access to this gigantic

supercloud of translations. The translation supercloud must be not-for-profit and directed by the

data contributors.

The stakes are high. The translation industry will flourish. The world will communicate better

across all language barriers.

The risks are low. It is just a choice to participate proactively, or leave it to the ‘pirates’ to change

the industry.

Only fear can stop us. Fear to change and fear to lose control will be replaced by fear to be left

behind. Because many of the leaders have already started sharing their translations since July

2008 in the TDA repository.

49

U-Compare: Interoperability of Text Mining Tools with UIMA

Sophia Ananiadou1, Yoshinobu Kano2

1University of Manchester, UK and 2Database Center for Life Science (DBCLS), Japan

Due to the increasing number NLP resources (software tools and corpora), interoperability issues

are becoming significant obstacles to effective use. UIMA, the Unstructured Information

Management Architecture, is an open framework designed to aid in the construction of more

interoperable tools. U-Compare is built on top of the UIMA framework, and provides both a

concrete framework for out-of-the-box text mining and a sophisticated evaluation platform

allowing users to run specific tools on any target text, generating both detailed statistics and

instance-based visualizations of outputs.

U-Compare is a joint project, providing the world's largest collection of UIMA resources that are

compatible with a single, shared type system. The collection includes resources developed by

different groups for a variety of domains. Whilst the current emphasis is on English biomedical text

processing components, planned work within META-NET will significantly increase the inventory

of components available for several European languages, including multi-lingual components.

Japanese components are also in development.

U-Compare can be launched straight from the web, without needing to be manually installed. All

U-Compare components are provided ready-to-use and can be combined into workflows easily via

a drag-and-drop interface without any programming. External UIMA components can also simply

be mixed with U-Compare components, without distinguishing between locally and remotely

deployed resources.

50

Opening, sharing, re-using language resources : who, what, when,

where and how

Stelios Piperidis

Institute for Language and Speech Processing, “Athena” Research Center, Greece

In its report titled “Riding the Wave: how Europe can gain from the raising tide of

scientific data”, October 2010, the High-Level Group on Scientific Data states : “A

fundamental characteristic of our age is the raising tide of data – global, diverse,

valuable and complex. In the realm of science, this is both an opportunity and a

challenge.” In the context of language research and technology, the essence of this

statement has been in focus for the last two decades. The current prevailing

methodologies in language technology development, the sheer number of languages and

the vast volumes of digital content together with a wide palette of useful content

processing applications, render new models for managing the underlying language

resources indispensable.

META-SHARE tries to respond to this need by building a network of distributed

repositories of language resources, including language data and basic language

processing tools (e.g. morphological analysers, pos taggers, etc). Repositories bear a local

or non-local (central) role. Local repositories are set up and maintained by organisations

participating in the META-SHARE network, storing their own resources. Non-local

(central) repositories are also set up and maintained by organisations participating in the

META-SHARE network acting as storage and documentation facilities for resources

either developed in organisations not wishing to set up their own repository, or they are

donated or they are considered orphan resources, etc. Language resources are described

according to a metadata schema (the META-SHARE metadata schema). Actual resources

and their metadata reside in the local repositories, which export metadata records and

allow their harvesting. Central network servers harvest, host and mirror META-SHARE’s

metadata and point to local repositories for browsing, downloading, etc actual resources.

Users (language resources seekers/consumers) will be able to log-in once, search the

central inventory using multifaceted search facilities and access the actual resources by

visiting the resources repositories for browsing and downloading them, as well as getting

additional information about the usage of specific resources, their relation (e.g.

compatibility, suitability) to other resources, recommendations, etc.

In this brief presentation, I will discuss the background of the emerging infrastructure,

e.g. community needs this infrastructure should cater for, form and modes of operation,

as well as instruments foreseen to help achieve maximum usability and sustainability.

We will discuss the principles that META-SHARE uses regarding language resource

sharing and the instruments that support them. We will conclude by elaborating on

potential synergies with neighbouring initiatives and future plans at large.

51

Friday 27th May 2011

S4. Innovation needs data – 9:15‐11:00

Chair: Jan Odijk

Introduction

Jan Odijk (University of Utrecht, NL / Chair)

Introductory Talks

Such Stuff as Dreams are Made on... the Consequences of Grand Visions on Linguistic Resources Hans Uszkoreit (DFKI, DE)

Parallel Multilingual Data from Monolingual Speakers Bill Dolan (Microsoft Research, USA)

Turning water into wine: transforming data sources to satisfy the thirst of the knowledge era Frederique Segond (Xerox, FR)

Contributions

How to get more data for under‐resourced languages and domains? Andrejs Vasiljevs (TILDE, LV)

User feedback collection for MT and dictionaries: current status and strategies Théo Hoffenberg (Reverso ‐ Softissimo, FR)

Discussants

Maria Teresa Pazienza (University of Rome ‘Tor Vergata’, IT)

Guido Vetere (IBM ‐ Senso Comune, IT)

Martine Garnier‐Rizet (Vecsys & IMMI‐CNRS, FR)

53

Such Stuff as Dreams are Made on ...

the Consequences of Grand Visions on Linguistic Resources

Hans Uszkoreit

DFKI, Germany

The European Network of Excellence META-NET has conducted a complex brainstorming and

collective deliberation process with the aim to arrive at a shared technology vision for the European

language technology community. Such a shared vision is a corner stone of a planned strategic

research agenda for European LT research and innovation in the next ten years. More than one

hundred commercial companies and other organizations have actively participated in the process

by recognized experts in technology and commercialization. A central part of the resulting vision

paper, "The Future European Multilingual Information Society. Towards a Strategic Research

Agenda for Multilingual Europe", is a selection of commercially attractive and socially relevant

application scenarios.

In this talk I will summarize the consequences of these application visions and their enabling

technologies on the availability of resources. Although I will concentrate on monolingual and

multilingual data, I will also comment on linguistic descriptions and tools such as lexical resources

and basic processing components. In order to provide all these prerequisites for our dream

technologies when they are needed, novel ways of collecting, producing, maintaining and sharing

data will have to be pursued.

55

Parallel Multilingual Data from Monolingual Speakers

Bill Dolan

Microsoft Research, USA

This talk will describe joint work with David Chen (University of Texas Austin) aimed at eliciting

large volumes of fluent, native multilingual data from a surprising source: monolingual speakers.

Our approach relies on large numbers of Mechanical Turk contributors, each of whom was asked to

watch a short video snippet depicting a simple action and then write a brief description of the

action. Contributors were encouraged to write the description using their native language; given

the international composition of contributors on Mechanical Turk, this task produced clusters of

multilingual sentences all describing the same basic semantic content.

Contributors found the task to be engaging and easy to do, and because it does not require deep

bilingual skills, it is open to virtually any crowdsourcing contributor. As a result, our method offers

a simple and very inexpensive way to gather large volumes of “almost parallel” text, potentially

focused on specific domains like sports, cooking, or automobiles. We will describe the data

collection technique and results, provide a pointer to the dataset (which is being released to the

research community) and finally discuss some of the interesting qualitative differences between

this unique “semantically parallel” data versus “linguistically parallel” data.

56

Turning water into wine: transforming data sources to satisfy the

thirst of the knowledge era

Frédérique Segond

Xerox Research Centre Europe, France

Since the beginning of humanity, data, in its different forms, has been recognized as essential to

knowledge and the principal ingredient of innovation. In this short positioning paper which follows

the "Data Information Knowledge and Wisdom (DIKW)" paradigm, we present what is specific to

the era of Information Technology. Using the example of rare diseases we conclude that not only

the amount of data but the capacity to make sense out of it, learn from it, and turn it into

knowledge will speed up the innovation process.

History is full of examples that show how collecting data and making sense of it has been central to

radical changes in culture and science. Greek philosophers such as Aristotle were able build a

scientific theory with little data, but little by little, the qualitative approach has been complemented

with the quantitative as large amounts of data are required to sustain scientific results and theories.

The Ancient Library of Alexandria is one example of data collection in Antiquity that aimed at

capturing knowledge from the world for scholars to study and hopefully to innovate.

Monks and later on, copyists were part of the tradition of collecting data and knowledge of the

world to learn from them and to then educate others.

At the beginning of the 17th century Galileo collected observations with his telescope and the

theory that he developed based on these observations has served as the basis of modern astronomy

and which, today, continued to interpret large amounts of data to obtain scientific results.

In the 18th century more and more scientists and philosophers supported observation and

experience rather than purely intellectually based theories.

The French naturalist Comte de Buffon influenced peers like Lamarck and Cuvier with the

publication of his thirty six volumes of "Histoire naturelle, générale et particulière" and is

considered by Darwin as the first author who treated evolution in a scientific manner.

At the same period, led by Diderot, the «Encyclopédie, ou dictionnaire raisonné des sciences, des

arts et des métiers" collected data on sciences and mechanical arts with the goal of «changing the

way people think". It is recognized as an important intellectual vector of the French revolution that

eventually led to new political models.

In the 19th century Durkheim proposed a scientific approach to society using quantitative methods

and gave birth to modern sociology.

In the same century and closer to the domain of FLaReNet, linguists and ethnographers such as

Sapir and Lévi-Strauss spent their life collecting data on different languages and cultures and

influencing the work of several generations of linguists, anthropologists and ethnographers.

What has dramatically changed with the advent of the Internet and Information technologies is

that this data which was previously so difficult to collect became, in the course only a few years,

extremely easy to access and in much greater quantity. All of a sudden we went from the dream of

57

having more data to the nightmare of data overload or data obesity. Nowadays data are not only of

the type of encyclopedic as before but they can be emails, Facebook walls, and exchanges on

Twitter. Today, data is gathered not only from the Internet but also from supermarket receipts,

mobile phones, cars, planes and soon even refrigerators, ovens and any type of electronic device we

use will provide data. Much of the data that previously simply disappeared after having been used

for a specific purpose, is now stored, distributed and even resold for analysis, interpretation or

other purposes of which the best if not most frequent case is innovation.

The definition of what data is has evolved over the course of history. We adopt the general

definition of data as symbols such as words, numbers, codes or tables. These symbols (data) can

then be linked into sentences, paragraphs, equation concepts and ideas to give birth to

information. Information can then further be structured and interpreted to become knowledge.

With recent advances in the semantic web, natural language processing and knowledge

management to cite only the most relevant fields for our purpose, the analysis of data has made

huge progress. So what’s the link to innovation?

When looking at multiple existing definitions of innovation a difference is often made between

invention and innovation. Today Innovation is generally associated with two ingredients:

technology and people willing to use or buy this technology, while invention may have no

commercial value. Innovation is usually associated with the idea of benefit. Almost any company

dealing with data which claims to be innovative communicates on its capacity to turn data into

wine to give you a competitive advantage because it performs semantic analysis, knowledge

discovery, business intelligence or analytics in general.

What these companies offer their customers is support in understanding their data to make better

use of it in marketing, technical development or strategic decisions. There are many examples :

One can quote opinion mining for companies selling products of any type including politicians

selling a political discourse; being able to make sense out of huge amounts of data is important for

the societies of risk that we now live in, be it for homeland security, environmental risk, risk

associated with drugs to name but a few. The opportunity of making sense out of data, of linking

information generated from different sources and of reasoning based on them has completely

changed the way investigations are pursued in law, crime and... medicine.

Medicine has always been a big consumer of data for innovative purposes. The more data a medical

domain has the more medical progress is made. National health institutions invest large amounts

of time and money to get real user data. For instance blood tests for pregnant women for the early

detection of down syndrome or the collection of data on the human genome to enable great

progress in treating and curing genetic diseases. To better understand diseases and how to properly

prevent and cure them medical doctors need to relate many types of knowledge such as symptoms,

treatment, genes, and phenotypes. To do so they use data from collections, communications,

publications, patient records and medical archives. In many hospitals there are archives of

numerous and very precious data that could be used for epidemiological studies. However data

access and links within and across this data is as important as the actual quantity. In the same

medical domain the study of rare diseases is, by definition, characterized by the fact that very little

data exists. But it is precisely because such data is rare that it is important to capture and link it

with other data such as, in the case of rare diseases, data on genes.

We have given examples of how data is the basic block of innovation prior to becoming information

and knowledge. We conclude with the fact that the quantity of data alone is not sufficient for

innovation. What is equal importance is the ability to link the information carried by this data to

discover and develop new paradigms.

58

How to get more data for under-resourced languages and

domains?

Andrejs Vasiļjevs

Tilde

Abstract

The explosive growth of digital information on the web enables rapid development of data driven

techniques. Significant breakthrough in many areas of language technologies has been achieved.

Statistical methods based on huge volume of data have replaced the laborious human work that

was required to encode linguistic knowledge. In the new paradigm, the more data you have the

better results you get.

However, dependence on data creates new disparities for under-resourced languages and domains.

Naturally, smaller language communities produce much less data than speakers of the languages

dominating the globe. The same problems occur for language data in narrow domains with their

own specific terminological and stylistic requirements.

Essential question is how to get more data for under-resourced languages and domains. Not only

innovation needs data but also the collection of data needs innovation. This presentation will

briefly discuss some inventive strategies.

When the Web is not enough

Google estimates that 95% of Web pages are in the TOP 20 languages [1]. Although the language

identification method used in this estimation is not very reliable, it clearly demonstrates the huge

disparity in languages representation on the Web.

Smaller languages often have a complex morphological structure and free word order. To learn this

complexity from corpus data by statistical methods, much larger volumes of training data are

needed than for languages with simpler linguistic structure.

Shared services from non-sharable data

Motivating users to share their data is a powerful strategy to boost public language resources. In

data-driven machine translation most of the public MT systems are built on the parallel texts

collected from the web. But a lot of translated data still reside on the local systems of translation

agencies, multinational corporations, public and private institutions, and desktops of individual

users. TAUS Data Association [2] is an example of successful involvement of the key players in

localization industry in sharing their translation memories.

Still in many cases data holders are not able or willing to share their data for competitiveness or

confidentiality reasons. New cloud-based services can provide a solution how the community can

benefit from restricted data. An example is a machine translation platform being developed

through the ICT-PSP project LetsMT! [3]. It provides fully automated self-service for MT

generation from user submitted data. As opposed to the traditional sharing platforms users of the

LetsMT! system may only upload their data to the online repository. This data is not downloadable

and can be used only for the generation and running of statistical models for machine translation.

The uploaded proprietary data is not directly exposed or shared. However, the community does

59

benefit from being able to use this data for training and running MT systems. In such a way even

small companies and institutions can create their user-tailored MT solutions while contributing to

the expansion of online MT training data and a variety of custom MT systems.

Motivating the crowd

The crowdsourcing approach is boosting the acquisition and expansion of language resources.

Obviously crowdsourcing needs a crowd. If the total population is in tens or hundreds of millions, a

small percentage of active people willing to become involved in crowdsourcing activities make quite

a big group. But how about smaller language communities? Boosting enthusiasm or providing

monetary based incentives (e.g., Amazon Mechanical Turk) are only temporary solutions to raise

the number of participants.

There is a room for innovation to find new motivation schemas. One such example is a successful

trial of collaborative translation in Latvia using CTF tool by Microsoft Research [4], and organizing

translation competitions in social network.

Use data wisely

Scarcity of data for under-resourced languages and domains is a strong motivation to look for ways

to use it more efficiently. For example, a potentially very useful resource could be multilingual

comparable corpora – collections of texts about the same or similar topics that are not direct

translations. There are several research activities to find efficient methods for collecting and

analyzing comparable corpora. FP7 project ACCURAT [5] is researching how to use comparable

corpora for statistical MT, but FP7 project TTC - for extraction of multilingual terminology [6].

Profound research is needed to find new ways how to teach computers language tasks. It should be

possible to get much better results from much less data than the current data-driven methods. A

proof for that is a human child which has fantastic ability to generalize quite limited language

information received from the outside world into complete fluency. If we could mimic this ability in

online “agents” learning language using web data and interacting with participants from the crowd,

this could be a principal solution in closing the technology gap between larger and smaller

languages.

References

[1] Daniel Pimienta, Daniel Prado and Álvaro Blanco. 2009. Twelve years of measuring linguistic

diversity in the Internet: balance and perspectives. UNESCO, Paris.

[2] http://www.tausdata.org

[3] Vasiljevs, Andrejs, Tatiana Gornostay and Raivis Skadins. 2010. LetsMT! – Online Platform for

Sharing Training Data and Building User Tailored Machine Translation. Proceedings of the

Fourth International Conference Baltic HLT 2010, Riga.

[4] http://blogs.msdn.com/b/translation/archive/2010/03/15/collaborative-translations-

announcing-the-next-version-of-microsoft-translator-technology-v2-apis-and-widget.aspx

[5] http://www.accurat-project.eu

[6] http://www.ttc-project.eu

60

http://www.tausdata.org/

http://blogs.msdn.com/b/translation/archive/2010/03/15/collaborative-translations-announcing-the-next-version-of-microsoft-translator-technology-v2-apis-and-widget.aspx

http://blogs.msdn.com/b/translation/archive/2010/03/15/collaborative-translations-announcing-the-next-version-of-microsoft-translator-technology-v2-apis-and-widget.aspx

http://www.accurat-project.eu/

http://www.ttc-project.eu/

User feedback collection for MT and dictionaries : current status

and strategies

Théo Hoffenberg

Reverso – Softissimo (France)

MT systems and online dictionaries are becoming more mature, and gaining the next 1% in

accuracy or coverage requires a lot of effort. Users of online systems such as Reverso, which drives

over a billion pages viewed each year can be of great help to identify issues, add phrases in

dictionaries, however the feedback that people provide is not always usable, and not in large

enough quantities.

Through the Faust project, we started to address this issue, allowing users to rate the translations,

suggest better ones, comment them and in a second stage to suggest directly phrases to enter in

user or general dictionaries.

In next steps, we will allow users to give their opinion on the self-assessment of MT tools

(confidence score) and chose best translation when several engines are connected.

In parallel, online dictionaries like Reverso’s are now open to user contributions, votes and

comments.

Several thousands of such contributions are collected each month and this information is still not

enough used.

We will describe the methods currently used to collect user data and feedback and to filter and

analyze it. We’ll also comment the results obtained and the plans to improve quality and quantity

of this data, and if possible the way to make it directly usable for MT system or online dictionaries.

61

S5. Data for all languages: think big – 11:30‐13:30

Chair: Núria Bel

Introduction

Núria Bel (UPF, Spain / Chair)

Introductory Talks

Plan B Kenneth Church (Johns Hopkins University ‐ HLTCOE, USA)

Europeana and multi‐lingual access: Can we do it without Google or Microsoft? Can we do it without an open community? David Haskiya (Europeana Foundation)

Contributions

Cross‐lingual knowledge extraction Dunja Mladenic (JSI, SL) and Marko Grobelnik (JSI, SL)

Developing Language Resources in an African Context: the South African Case Justus C. Roux (CTexT, North‐West University, South Africa)

Identifying and networking forces: an international panorama Joseph Mariani (LIMSI/IMMI‐CNRS, FR / FLaReNet) and Claudia Soria (CNR‐ILC, IT / FLaReNet)

Discussants

Christopher Cieri (University of Pennsylvania ‐ Linguistic Data Consortium, USA)

Aurélien Max (LIMSI‐CNRS & Université Paris‐Sud, FR)

63

Plan B

Kenneth Church

Johns Hopkins University, HLTCOE

Abstract

Ideally a corpus should be large, balanced and annotated. But what should we do when we can‟t

have it all. This paper will discuss various fall back positions which we will refer to as Plan B:

1. Large unbalanced archives of newswire and web crawls

2. Google-ology: The Web is exciting

3. Catch-as-Catch-Can: Hybrid Combinations of a trillion of this and a billion of that and a

million of the other thing.

4. Just-in-Time Data Collection (with Amazon‟s Mechanical Turk)

5. Zero Resources (Do Without): It is sometimes possible to get something from nothing

Large Unbalanced Archives of Newswire and Web Crawls

Should a corpus be balanced? This is an old question that keeps coming up again and again. There

was an Oxford Debate on this question where the house was predisposed to vote for balance, but

ended up voting for quantity (with some encouragement from a few passionate American engineers

from IBM and AT&T). Patrick Hanks was really excited in the 1980s with what you could do with

tens of millions of words, far more than the Brown Corpus (one million). Much more could be

done with hundreds of millions of words such as the British National Corpus (BNC). These days,

web counts are 1000 times larger than BNC counts. More is more (Church, 2010), despite

criticisms of "Google-ology" (Kilgarriff, 2007).

Google-ology

The Web is exciting. There is hope that everyone will have access to vast quantities of data, well

beyond the means of consortia such as the British National Corpus (BNC). Government funding

agencies should focus their limited resources on targets of opportunities where they have an unfair

advantage, and they should avoid areas where commercial entities such as search companies have

an unfair advantage.

Web companies are doing what they can to work with researchers (Huang et al, 2010) and (Wang

et al, 2010).1 In addition, a number of researchers are discovering clever ways to use search

engines to do computational lexicography. For example, Turney (2001, 2002) has shown how to

use web queries to estimate pointwise mutual information (PMI), an improvement over Church

and Hanks (1990) since the web is so much larger than research corpora such as the BNC. The web

opens up all sorts of opportunities to use search engines for applications in computational

1 See http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html, http://research.microsoft.com/en-us/collaboration/focus/cs/web-ngram.aspx, http://research.microsoft.com/en-us/events/webngram/sigir2010web_ngram_workshop_proceedings.pdf and http://ngrams.googlelabs.com/.

65

http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html

http://research.microsoft.com/en-us/collaboration/focus/cs/web-ngram.aspx

http://research.microsoft.com/en-us/events/webngram/sigir2010web_ngram_workshop_proceedings.pdf

http://research.microsoft.com/en-us/events/webngram/sigir2010web_ngram_workshop_proceedings.pdf

http://ngrams.googlelabs.com/

lexicography: Hearst (1992), KnowItAll (Lin et al, 2009 & Etzioni et al, 2005), Sekine (2006),

Fang and Chang (2011) and much more.

Catch-as-Catch-Can

When we can‟t have what we want (large, balanced and annotated), can we come up with a hybrid

combination of what we have (some collections are large, and some others are balanced and a few

small ones are annotated). In Bergsma et al (2011), for a particular application (disambiguating a

special case of conjunction), we propose a combination of a trillion words of this (Google Ngrams),

a billion words of that (parallel corpora) and a million words of annotated text (Penn Treebank).

Just-in-Time Data Collection with Amazon’s Mechanical Turk

Mechanical Turk (Artificial Artificial Intelligence) is a hot topic. See Callison-Burch and Dredze

(2010) for a recent workshop on this subject.

While there is plenty of evidence that “there is no data like more data,” you never know where the

next opportunity/crisis will be. Crowd-sourcing (Amazon‟s Mechnical Turk) makes it possible to

mobilize large armies of human talent around the world with just the right language skills so that it

is feasible to collect what we need when we need it, even during a crisis such as the recent

earthquake in Haiti or the flood in Pakistan. According to Callison-Burch‟s home page

(http://www.cs.jhu.edu/~ccb/), he is #6 on the leaderboard for MTurk Requesters.

Zero Resources (Do Without)

Much of the work in Information Retrieval on web pages is based on simple counts (bags of words),

with few dependencies on language-specific resources. In Dredze et al (2010) and Jansen et al

(2010), we propose bags of pseudo-terms for spoken documents, which are similar to bags of words

for web search. Pseudo-terms are computed by linking long repetitions in the audio. We are

finding that bags of pseudo-terms are almost as good as bags of words, at least for some

applications. When we have resources, we ought to use them. But when we don‟t, it may be

possible to find something from nothing.

References

Shane Bergsma, David Yarowsky, Kenneth Church (2011), Using Large Monolingual and Bilingual Corpora to Improve Coordination Disambiguation, ACL. Chris Callison-Burch and Mark Dredze Workshop on Creating Speech and Language Data With Mechanical Turk at NAACL-HLT, 2010. Kenneth Church and Patrick Hanks, Word association norms, mutual information, and lexicography. Comput. Linguist. 16, 1 (March 1990), 22-29. Kenneth Church. More is more. In G.-M. de Schryver, editor, A Way with Words: Recent Advances in Lexical Theory and Analysis: A Festschrift for Patrick Hanks, chapter 7. Menha Publishers, Kampala, Uganda, 2010. Kenneth Church, Approximate Lexicography and Web Search International Journal of Lexicography, 21.3: 325–336, 2008 Mark Dredze, Aren Jansen, Glen Coppersmith, and Kenneth Church. NLP on spoken documents without ASR. In EMNLP, 2010. Oren Etzioni, Michael Cafarella, Doug Downey, Ana-Maria Popescu, Tal Shaked, Stephen

66

http://www.cs.jhu.edu/~ccb/

http://mturk-tracker.com/top_requesters/

Soderland, Daniel S. Weld, and Alexander Yates. 2005. Unsupervised named-entity extraction from the web: an experimental study. Artif. Intell. 165, 1 (June 2005), 91-134. DOI=10.1016/j.artint.2005.03.001 http://dx.doi.org/10.1016/j.artint.2005.03.001 Yuan Fang and Kevin Chen-Chuan Chang, Searching patterns for relation extraction over the web: rediscovering the pattern-relation duality. In Proceedings of the fourth ACM international conference on Web search and data mining (WSDM '11). ACM, New York, NY, USA, 825-834. DOI=10.1145/1935826.1935933 http://doi.acm.org/10.1145/1935826.1935933 Marti Hearst, Automatic acquisition of hyponyms from large text corpora. In Proceedings of the 14th conference on Computational linguistics - Volume 2 (COLING '92), Vol. 2. Association for Computational Linguistics, Stroudsburg, PA, USA, 539-545. DOI=10.3115/992133.992154 http://dx.doi.org/10.3115/992133.992154 Jian Huang, Jianfeng Gao, Jiangbo Miao, Xiaolong Li, Kuansan Wang and Fritz Behr. 2010 Exploring web scale language models for search query processing. In WWW 2010 Aren Jansen, Kenneth Church, and Hynek Hermansky. Towards spoken term discovery at scale with zero resources. In INTERSPEECH, 2010. Frank Keller and Mirella Lapata, Using the web to obtain frequencies for unseen bigrams. Comput. Linguist. 29, 3 (September 2003), 459-484. DOI=10.1162/089120103322711604 http://dx.doi.org/10.1162/089120103322711604 Adam Kilgarriff, Googleology is Bad Science. Comput. Linguist. 33, 1 (March 2007), 147-151.

DOI=10.1162/coli.2007.33.1.147 http://dx.doi.org/10.1162/coli.2007.33.1.147

Thomas Lin, Oren Etzioni, and James Fogarty, Identifying interesting assertions from the web. In Proceeding of the 18th ACM conference on Information and knowledge management (CIKM '09). ACM, New York, NY, USA, 1787-1790. DOI=10.1145/1645953.1646230 http://doi.acm.org/10.1145/1645953.1646230 Peter Turney, „Mining the Web for synonyms: PMI-IR versus LSA on TOEFL.‟ Proceedings of the Twelfth European Conference on Machine Learning (ECML2001), Freiburg, Germany: 491–502. Peter Turney, „Coherent keyphrase extraction via web mining.‟ Proceedings of IJCAI, 434–439, 2002. Satoshi Sekine. On-demand information extraction. In ACL, 2006. Kuansan Wang, Christopher Thrasher, Evelyne Viegas, Xiaolong Li, and Bo-june (Paul) Hsu. 2010. An overview of Microsoft web N-gram corpus and applications. In Proceedings of the NAACL HLT 2010 Demonstration Session (HLT-DEMO '10). Association for Computational Linguistics, Stroudsburg, PA, USA, 45-48.

67

http://dx.doi.org/10.1016/j.artint.2005.03.001

http://doi.acm.org/10.1145/1935826.1935933

http://dx.doi.org/10.3115/992133.992154

http://dx.doi.org/10.1162/089120103322711604

http://doi.acm.org/10.1145/1645953.1646230

Europeana and multi-lingual access:

Can we do it without Google or Microsoft?

Can we do it without an open community?

David Haskiya

The Europeana Foundation

“Europeana enables people to explore the digital resources of Europe's museums, libraries,

archives and audio-visual collections. It promotes discovery and networking opportunities in a

multilingual space where users can engage, share in and be inspired by the rich diversity of

Europe's cultural and scientific heritage.”

The key phrase here in relation to FLaReNet Forum is “multi-lingual” space. How can Europeana

make digitised heritage accessible across language boundaries? The Europeana metadata

repository holds descriptive metadata in at least 27 languages and those metadata records describe

original objects that can contain information in many other languages e.g. ancient texts written in

Latin, Old Norse, or even Sumerian! Also, apart from users from all European countries we also

know that we have users from all across the world in e.g. India, South Korea, and Japan.

This poses massive challenges in localising user interfaces, in translating editorial texts, in

language identification of metadata records, in query translation, in multi-lingual semantic

enrichment, and in translation of the metadata records.

In this presentation I’ll touch on some of those challenges and on whether they can be solved

without big companies like Google or Microsoft and without open communities like Wikipedia?

68

Cross-lingual Knowledge Extraction

Dunja Mladenic and Marko Grobelnik

J. Stefan Institute, Ljubljana, Slovenia

Knowledge extraction has been a challenge for many years and it remains a challenge. On this long

path of exploring language and simultaneously modeling the world, we have transitioned through

many stages. From the early years, when linguists were exploring the structure of words and

sentences, and philosophers were discussing ontology, up till today, when a large part of written

information is available to almost everybody and modern communication allows researchers to join

their efforts and exploit opportunities offered by technology. Still, the apparent opportunities seem

to be much bigger than the actual depth of the solutions developed in recent years in the field of

text understanding.

One of the key properties of natural languages is redundancy in the encoded information and the

structure used. As a consequence, different techniques can extract different aspects of information

from a text. They range from simple techniques such as character counting, to more sophisticated

ones such as linear algebra, to the advanced techniques which exploit the structural aspects of text.

Many of these techniques deliver something useful and solve somebody’s problem. Some examples

of such problems are: language identification (solved with character counting), document

categorization (solved with linear algebra methods), question-answering (solved typically with

shallow linguistic methods), and reasoning (solved typically using logic).

Having many techniques for dealing with text, has the unfortunate consequence that those

techniques fall into different research areas, which often do not communicate effectively. Text

understanding, as a long-term research goal, can significantly benefit from a diversity of insights

coming from related research areas.

This talk will address the problem of knowledge extraction from social language in cross-lingual

setting and, a need for semantic interlingua based on logic to support knowledge extraction.

69

Developing language resources in an African context:

The South African case

Justus C Roux

Centre for Text Technology (CTexT)

North-West University, Potchefstroom, South Africa

Background

Developing appropriately balanced and annotated text and speech corpora for resource scarce

languages is indeed a challenge, particularly in an African context where many indigenous

languages have a relatively short written tradition. This presentation will focus on the South

African situation, with eleven proclaimed official languages and an official language policy that

implicitly views the development of human language technologies as a means to convey

information to citizens in a language of choice. Apart from English and Afrikaans (a Germanic

language with close relationship to Flemish and Dutch), the other nine official languages can be

classified in four language families: Nguni (isiZulu, isiXhosa, Siswati, isiNdebele), Sotho (Sesotho

sa Leboa, Setswana and Sesotho), Tshivenda and Xitsonga. The ideal to develop appropriate text

and speech applications in a variety of languages obviously relies on the availability of resources.

This however is a painstaking and costly exercise and it has become necessary to adopt a variety of

data gathering approaches to speed up the process of gathering and processing text and speech

resources.

Factors influencing data acquisition and processing

Limited printed materials in the African languages, e.g. newpapers, periodicals etc.

Limited presence on the Internet to involve web crawling to gather text data etc.

Different writing systems (conjunctive and disjunctive) requiring considerable

preprocessing and linguistic knowledge in the processing phase.

Non standardized orthographies and spelling conventions influencing the development of

spelling checkers.

Unsophisticated users of technology.

Relatively small HLT community – lack of capacity.

Potential data sources

Multilingual Hansard data from local and national parliaments (text and speech).

Multilingual documents generated by government departments on provincial and national

levels.

Access to speech archives of the South African Broadcasting Company (SABC) that

transmits programs in all languages on a continuous basis.

Private media houses.

70

Methodological approaches

Include inter alia,

Telephone based speech acquisition (especially mobile): different projects since 2000.

Use of automated speech segmentation and annotation techniques.

Creation of software (language specific morphological analysers) to account for the

agglutinative languages.

Creation of pronunciation dictionaries: bootstrapping techniques to speed up the process.

Software development for verifying pronunciation dictionaries.

Resource Centre for reusable resources

In December 2008 the South African Government approved the establishment of a National Centre

for Human Language Technologies (NCHLT) as part of the implementation of a National Language

Policy Framework approved by Cabinet in 2003. This policy explicitly refers to the development of

Human Language Technologies (HLT) in South Africa.

The NCHLT currently functions as a virtual entity under the auspices of the national Department of

Arts and Culture (DAC) with the support of an Expert Panel for HLT (HLTEP), mainly comprising

academics actively working in the field. Over the last number of years this department has been

sponsoring a number of text and speech related HLT projects mainly conducted by academic and

research institutions functioning as “agencies”. It was initially foreseen that resources developed

through these projects would need to be managed in a professional way and thus a Resource

Management Agency (RMA) needed to be established. A blueprint for the functioning of this RMA

was developed in 2010 and a call for interest for running this agency is to be issued in June 2011. It

is foreseen that this RMA will be operative before the end of 2011.

Aims of the RMA

To function as a single depository point for various types of electronic data of the official

languages of South Africa for research and development purposes in the field of Human

Language Technologies. The data types may include broad categories such as text, speech,

language related video, multimodal resources including sign language as well as

pathological language and forensic language data;

To function as the official language resource management and distribution point of

data related to Human Language Technologies for all of the South African official

languages. Ongoing projects of DAC/NLS (National Language Service) will require that all

project data be deposited and managed by the RMA in order to prevent data loss and to

promote reusability of the data;

To position South Africa strategically with respect to collaboration with other similar

agencies worldwide, with a long term vision of becoming the hub for language

resource management in Africa.

71

Identifying and networking forces: an international panorama

Joseph Mariani1 and Claudia Soria2

1LIMSI-CNRS and IMMI, France and 2CNR-ILC, Italy

In order to conduct a survey of the National and Transnational initiatives in the area of Language

Resources (LR), a network of international FLaReNet Contact Points was created in August 2010,

comprising 102 people from 76 countries or regions in the European Union (26 Member States and

6 regions), non-EU European countries (9 countries) and non-European countries (35 countries).

The survey shows that almost all European countries now take care of gathering Language

Resources for their languages in order to conduct research investigations and develop and test

systems for those languages. The languages which were considered as “low-resourced” are

promptly recovering, even if they still need many more LR, given that no language have enough LR

available for the needs of the research and industrial communities. Surprisingly, UK has no

National program for (British) English. The reason may be the importance of US activities

regarding the processing of the (American) English language. Baltic and Nordic countries are

conscious of the importance of LR for the promotion and survival of their languages, and they

accordingly have a policy to support that area, including for minority languages. There is also an

important activity in some EU regions, either for specific languages (Basque, Catalan), or in general

(the Trento region in Italy).

Activities in the other parts of the world are also impressive. Governmental initiatives in India or

South Africa to cover the development of Language Technologies for all the official languages of

those multilingual countries in order to meet the needs of all citizens is exemplary. The creation of

Associations for the specific development of Language Technologies for the Arabic language or for

African Languages is also an interesting trend, while Asia keeps on organizing the activities in the

various countries, with individual initiatives dealing with the cultural heritage, the preservation of

the languages spoken in the country being part of it.

This work represented the basis for conducting a survey in the META-NET project of ongoing and

recent projects and initiatives at the national, EU and transnational level. The main purpose was to

identify projects addressing Machine Translation, multilingual issues, language resources and

technologies, or infrastructural issues at large. The focus of the survey is on Europe but relevant

initiatives outside Europe have been reviewed as well. This work resulted in the collection of up-to-

date and quality-checked information about 66 EU, 28 transnational and 183 national projects,

thus contributing to a comprehensive and reliable overview of recent and current activities.

Now that the importance of Language Resources and Language Technologies Evaluation is granted

as essential, the coordination of activities should be aimed at, in order to avoid the multiplication

of independent initiatives, which may result in an inextricable landscape, while building on Best

Practices. The needs should be carefully identified and properly addressed, both in the framework

of the Information and Communication Technologies and of the Human and Social Sciences, while

devoting an investment in agreement with the size of the corresponding challenge.

It is our intention to build upon this permanent network of International Contact Points, and

maintain a wiki allowing for the regular update of the harvested information. Internet brings the

availability of a worldwide Web, which provides a boarder-less infrastructure allowing for a

72

network approach to LR production, distribution, validation, evaluation and sharing. This may be

the chance to allow for full multilingualism.

73

S6. Long life to our resources – 14:45‐16:30

Chair: Khalid Choukri

Introduction

Khalid Choukri (ELDA, FR / Chair)

Talks

Preparing to share the effort of preservation using a new EU preservation e‐Infrastructure David Giaretta (STFC, UK)

Digital Humanities – its challenges and opportunities Daan Broeder (MPI, NL) and Peter Wittenburg (MPI, NL)

Contributions

Durable digital data at DANS Peter Doorn (DANS, NL)

Life is Longer and Better in OpenAIRE Yannis Ioannidis (University of Athens & ILSP ‐ A.C. "Athena", GR)

Discussants

Bob Boelhouver (Instituur voor Nederlandse Lexicologie, NL)

Edouard Geoffrois (DGA, FR)

75

Preparing to share the effort of preservation using a new EU

preservation e-Infrastructure

David Giaretta

STFC, UK

This talk will describe the background and the capabilities of the preservation e-Infrastructure

components which are to be produced by the SCIDIP-ES EU project, and how these can benefit this

domain. These components are based on the techniques proven through the CASPAR project

(www.casparpreserves.eu) and justified by the surveys and case studies of PARSE.Insight

(www.parse-insight.eu).

Closely linked to these is the APARSEN network of excellence (www.aparsen.eu) which is bringing

together researchers from libraries, academia, industry, vendors and science laboratories to create

a common vision for digital preservation research. By this we mean not a straight-jacket but rather

a direction of travel and a common understanding which should ensure that the individual pieces

of research can be seen as parts of a greater whole and can thereby work together rather than as

disjoint universes of discourse. The individual pieces of research will, for the most part, create

metadata which are essential for the working of the domain neutral e-Infrastructure components.

An additional benefit which will be described is that which arises from viewing preservation as

enabling those unfamiliar which the data to use it - for preservation this is unfamiliarity through

distance in time. However the same infrastructure, tools and techniques also allow those who, right

now, are unfamiliar with the data through distance in expertise, to use the data.

77

Digital Humanities – its challenges and opportunities

Peter Wittenburg

MPI, The Netherlands

Topics that dominate some of the current discussions in the humanities are a) the paradigm change

in humanities research towards digital humanities, b) the need to focus on data curation and

integration, c) ways to foster computational humanities and d) how to achieve that not only a few

specialists can profit from an infrastructure offering many opportunities. These topics are fairly

much related since for example computational humanities can only be done efficiently and at a

reasonable level when the data curation problem is being tackled, digital humanities can only be

realized when humanities researchers have access to fairly much integrated data resources and

tools/web services. Finally we need to develop a strategy how we want to educate future

generations of humanities researchers: do we expect all to be able to do smart scripting in

technologically increasingly complex domain or do we expect to offer smart services and easy-to-

use virtual research environments supporting the mass of researchers. Infrastructure initiatives

need to find answers to these questions and tackle them in a convincing way and this in a time

where collaboration towards an eco-system of infrastructure is highly required.

The talk will touch the different aspects and give directions that are currently debated.

78

Durable digital data at DANS

Peter Doorn

Director, Data Archiving and Networked Services

Summary

Data Archiving and Networked Services (DANS) promotes durable access to digital research data.

For this purpose, DANS encourages scientific researchers to archive and reuse data in a sustainable

manner, e.g. by means of our online archiving system EASY. DANS also provides access, through

Narcis.nl, to thousands of scientific datasets, e-publications and other research information in the

Netherlands. In addition, the institute provides training and advice and performs research into

sustained access to digital information.

The data collections in our archives focus on the humanities and social sciences. DANS will also act

as a CLARIN data center, concentrating on the role of long-term preservation of language data and

text corpora. There are still important challenges to tackle; preserving services (or software) is for

instance much more complicated than preserving raw data.

We developed the Data Seal Of Approval (www.datasealofapproval.org), which specifies criteria for

trusted digital deposit, preservation and use of data.

As coordinator of the preparation phase, DANS is also involved in DARIAH, the European-wide

Digital Research Infrastructure for the Arts and Humanities. DARIAH’s mission is to enhance and

support digitally enabled research across the humanities and arts. DARIAH aims to develop and

maintain an infrastructure in support of ICT-based research practices across the arts and

humanities, and works with working with communities of practice to:

Explore and apply ICT-based methods and tools to enable new research questions to be

asked and old questions to be posed in new ways

Link distributed digital source materials of many kinds

Exchange knowledge, expertise, methodologies and practices across domains and

disciplines

DARIAH aims to create a single, unified European data area in which scholars and students can

easily survey the available information in their field – data which is dependable in terms of both

quality and durability.

Driven by data, DANS ensures that access to digital research data keeps improving, by its services

and by taking part in (international) projects and networks. Go to www.dans.knaw.nl for more

information and contact details.

79

http://www.dans.knaw.nl/

Life is Longer and Better in OpenAIRE

Yannis Ioannidis

University of Athens & ATHENA Research Center, Greece

OpenAIRE is a European project that delivers an electronic infrastructure and supporting

mechanisms for the identification, deposition, open access, and monitoring of FP7 and ERC funded

articles, through the establishment and operation of the European Helpdesk. All deposited articles

to the products of EU-funded research are freely accessible through the www.openaire.eu portal,

which also supports a special repository for articles that can be stored neither in institutional nor in

subject-based/thematic repositories. The OpenAIRE electronic infrastructure is based on state-of-

the-art software services of the D-NET package developed within the DRIVER and DRIVER-II

predecessor projects and the Invenio digital repository software developed at CERN. Thematically,

the project focuses on peer-reviewed publications in at least seven disciplines: energy,

environment, health, cognitive systems-interaction-robotics, electronic infrastructures, science in

society, and socioeconomic sciences-humanities. Geographically, the OpenAIRE project has a

definitive European footprint by covering the European Union in its entirety, engaging people and

scientific repositories in almost all 27 member states and beyond. In this presentation, I will

discuss the architecture and overall philosophy between the infrastructure software built as part of

OpenAIRE and will highlight the quality characteristics it offers: flexibility, availability,

sustainability, adaptability, and reusability.

80

Closing Session – 16:30‐17:30

Chair: Nicoletta Calzolari

Nicoletta Calzolari (ILC‐CNR, IT / FLaReNet Coordinator)

"Highlights from the Sessions" Sessions' Chairs

Community Endorsement of the FLaReNet Recommendations Nicoletta Calzolari (CNR‐ILC, IT / FLaReNet)

The Language Resources Sharing Charter Stelios Piperidis (ILSP ‐ A.C. "Athena", GR)

Language Resources in the LT Strategic Research Agenda Hans Uszkoreit (DFKI, DE / META‐NET)

The FLaReNet Community: the Way Forward Nicoletta Calzolari (CNR‐ILC, IT / FLaReNet) and Joseph Mariani (LIMSI/IMMI‐CNRS, FR / FLaReNet)

83

Organisation

Scientific Committee

Nicoletta Calzolari (ILC‐CNR, Pisa, ITALY)

Khalid Choukri (ELDA, Paris, FRANCE)

Stelios Piperidis (ILSP / “Athena” R. C., Athens, GREECE)

Jan Odijk (Universiteit Utrecht, Utrecht, THE NETHERLANDS)

Núria Bel (Universitat Pompeu Fabra, Barcelona, SPAIN)

Joseph Mariani (LIMSI/IMMI‐CNRS, Paris, FRANCE)

Claudia Soria (ILC‐CNR, Pisa, ITALY)

Organising Committee

Nicoletta Calzolari (ILC‐CNR, Pisa, ITALY)

Paola Baroni (ILC‐CNR, Pisa, ITALY)

Riccardo Del Gratta (ILC‐CNR, Pisa, ITALY)

Francesca Frontini (ILC‐CNR, Pisa, ITALY)

Sara Goggi (ILC‐CNR, Pisa, ITALY)

Monica Monachini (ILC‐CNR, Pisa, ITALY)

Valeria Quochi (ILC‐CNR, Pisa, ITALY)

Irene Russo (ILC‐CNR, Pisa, ITALY)

Claudia Soria (ILC‐CNR, Pisa, ITALY)

Local Committee

Rodolfo Delmonte (Università Ca' Foscari, Venezia, ITALY)

Rocco Tripodi (Università Ca' Foscari, Venezia, ITALY)

85

Documents

rd - FLaReNet · Closing Session From recommendations to actions 16:30 16:50 S3 Go green: reuse, repurpose and recycle resources End 2nd Day 17:30 18:30 End 1st Day 20:00 Social Dinner