Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
The 3rd European Language Resources and Technologies Forum
Language Resources in the Sharing Age ‐ the Strategic Agenda
Venezia, 26‐27 May 2011
Proceedings
Edited by: Calzolari N., Baroni P., Soria C., Goggi S., Monachini M., Quochi V.
Istituto di Linguistica Computazionale del CNR ‐ Pisa, ITALY
Table of Contents
Program ............................................................................................................................ 7
Opening Session ............................................................................................................... 9
Session 1 – Identification and tracking of Language Resources ..................................... 13
Session 2 – Open Data .................................................................................................... 27
Session 3 – Go green: reuse, repurpose and recycle resources ..................................... 43
Session 4 – Innovation needs data ................................................................................. 53
Session 5 – Data for all languages: think big .................................................................. 63
Session 6 – Long life to our resources ............................................................................ 75
Closing Session – From recommendations to actions .................................................... 83
Organisation ................................................................................................................... 85
Program
Wednesday 25th May 2011 – 20:30
Welcome Concert
Thursday 26th May 2011
@@@ Friday 27th May 2011
9:00 Registration
S4 Innovation needs data
9:15
10:00 Opening Session
Coffee Break
11:00
11:00 Coffee Break
S5 Data for all languages: think big
11:30
11:30 S1 Identification and tracking of
Language Resources
Lunch
13:15
13:15 Lunch
S6 Long life to our resources
14:45
14:30 S2 Open Data
‐‐‐
16:20 Coffee Break
Closing Session From recommendations to actions
16:30
16:50 S3 Go green: reuse, repurpose and
recycle resources
End 2nd Day
17:30
18:30 End 1st Day
20:00 Social Dinner
9:00‐18:30
Posters 9:15‐17:30
7
Thursday 26th May 2011
Opening Session – 10:00‐11:00
Chair: Nicoletta Calzolari
Nicoletta Calzolari (ILC‐CNR, IT / FLaReNet Coordinator)
Aleksandra Wesolowska (EC ‐ DG Information Society & Media ‐ Unit INFSO.E1 ‐ LTs & MT, LUX)
Flavio Gregori (Ca’ Foscari University ‐ Department of Compared Linguistic and Cultural Studies, IT / Director)
Rodolfo Delmonte (Ca’ Foscari University, IT / Local Host)
The FLaReNet Recommendations Nicoletta Calzolari (CNR‐ILC, IT / FLaReNet)
9
The third FLaReNet Forum
Nicoletta Calzolari, Claudia Soria
Istituto di Linguistica Computazionale, CNR, Pisa, Italy
FLaReNet – Fostering Language Resources Network – is an international Forum,
composed by a steadily enlarging community, whose goals are:
o to coordinate a community-wide effort to analyse the sector of language resources and technologies along all the relevant dimensions: technical and scientific, but also organisational, economic, political and legal
o to promote and sustain international cooperation o to identify short, medium, and long-term strategic objectives and provide consensual
recommendations in the form of a plan of action targeted to a broad range of stakeholders, from the industrial and scientific community to funding agencies and policy makers.
The FLaReNet Forum is the venue where leading experts in the field of Language Resources and
Technologies (LRT) are gathered to present their visions, to discuss together some of the hot topics
identified by FLaReNet and consensually identify a set of priorities and strategic objectives. Many
messages recurred repeatedly across the various meetings organised by FLaReNet, as a sign both of
a great convergence around these ideas and also of their relevance and importance to the field. The
Forum is meant also to validate ideas that have been “in the air” for several years and, in some
cases, fostered and/or developed by specific groups, as having entered the main stream of thought
and practice within the LRT community.
To date, the major high-level FLaReNet recommendations are presented in the FLaReNet
Blueprint of Actions and Infrastructures, gathering the recommendations collected around
the many meetings, panels and consultations of the community, as well as the results of the
surveying and analysis activities carried out under FLaReNet work packages. The Blueprint is the
result of a permanent and cyclical consultation that FLaReNet has conducted inside the community
it represents – with more than 300 members – and outside it, through connections with
neighbouring projects, associations, initiatives, funding agencies and government institutions.
The Blueprint is organised along three main “directions”: Infrastructural Aspects, Research and
Development, Political and Strategic Issues. They reflect three major development directions that
can boost or hinder the growth of the field of Language Resources and Technologies. Altogether
these directions are intended to contribute to the creation of a sustainable LRT ecosystem.
We present in a leaflet the three tables that synthesise – for the three directions – the main
challenges and the corresponding recommendations.
In this third Forum, the Blueprint – together with the recommendations in the major areas and
along the various critical dimensions around Language Resources – is opened again for
improvement and validation to the community which is called to form a consensus on the top
priorities. As a result of this community consultation a document will be produced that will be then
sent to all FLaReNet Institutional Members and National Contact points for adoption and
endorsement. This is an important step to show a critical mass around our recommendations, in
order to sensitise those responsible for implementing the financial and political frameworks that
are necessary to sustain the actions to be implemented on a long term.
11
S1. Identification and tracking of Language Resources – 11:30‐13:15
Chair: Joseph Mariani
Introduction
Joseph Mariani (LIMSI/IMMI‐CNRS, FR / Chair)
Introductory Talks
Towards a Comprehensive Model for Language Resource Catalogs (with emphasis on non‐traditional resources) Chris Cieri (University of Pennsylvania ‐ Linguistic Data Consortium, USA)
The concept of BRIF (Bioresource research impact factor) as a tool to foster bioresource sharing in research: is it applicable to other domains? Anne Cambon‐Thomsen (CNRS, FR)
Contributions
Capturing Community Knowledge of LRs: the LRE Map Claudia Soria (CNR‐ILC, IT / FLaReNet)
A journey from LRE Map to Language Matrixes Joseph Mariani (LIMSI/IMMI‐CNRS, FR / FLaReNet)
Proposal for the International Language Resource Number Khalid Choukri (ELDA, FR / FLaReNet)
Discussants
Antonio Pareja‐Lora (Universidad Complutense de Madrid, SP)
Gil Francopoulo (Tagmatica & IMMI‐CNRS, FR)
13
Towards a Comprehensive Model for Language Resource Catalogs,
with emphasis on non-traditional resources
Christopher Cieri
Linguistic Data Consortium, University of Pennsylvania
Rapid growth in the inventory of language resources (LRs) has lead also to a proliferation of the
sites one must search in order to identify, evaluate and acquire those resources. Some intrepid
researchers have undertaken to ameliorate this problem by developing union catalogs and
technical infrastructure to support such catalogs. Those efforts have typically focused on what we
argue is a subset of all LRs. We argue here that the material cataloged must be expanded to include
– in addition to data sets – tools, specifications and, particularly, published papers that describe,
critique or build upon LRs.
In 1992, when the Linguistic Data Consortium (LDC)1 was founded its goal was to serve as central
location where language resources, principally data sets at that time, could be hosted, archived and
distributed under consistent terms in order to lower barriers to research. LDC principals
recognized that that the organization’s origins in the American experience would make it best
suited to serve that market despite early and ongoing efforts to adapt to international needs. Not
surprisingly, a number of data centers have opened around the world including some others in the
U.S. The European Language Resource Association (ELRA2) was created in France in 1995 followed
by Gengo-Shigen-Kyokai3 (GSK) in Japan in 2003, the Chinese LDC4, and the LDC for Indian
Languages5 in 2007 to name a few. In addition there have been a number of national corpus
initiatives including the British6, American7, Dutch and Danish among others. Finally, the number
of individual laboratories and projects that create and distribute LRs directly has only grown over
time.
Since, as is clear, some LR creators will continue to distribute independently and multiple data
centers will continue to operate, the communities that rely upon LRs also need union catalogs or
other portals that harvest metadata from multiple providers and present them in a unified
interface. Indeed, the Open Language Archives Community (OLAC) has developed specification for
such a Catalog and multiple instantiations. According to their site: “OLAC Archives contain over
100,000 records, covering resources in half of the world's living languages”8. In addition, OLAC
gathers metadata records from 45 different providers.
LR catalogs differ in their treatment of LR types. The LDC9 and ELRA10 Catalogs focus on data sets.
The ELRA Universal Catalog11 contains corpora, lexicons and tools but apparently not academic
1 www.ldc.upenn.edu
2 www.elra.info
3 www.gsk.or.jp
4 www.chineseldc.org
5 www.ldcil.org
6 www.natcorp.ox.ac.uk
7 www.americannationalcorpus.org
8 www.language-archives.org
9 www.ldc.upenn.edu/Catalog
10 catalog.elra.info
11 universal.elra.info
15
papers or specifications except when they are included within data sets. Among the resource types
listed, there is no entry for either and searches for appropriate terms return no papers. In contrast,
OLAC archives do contain many records describing academic papers; however, they typically do
not distinguish such papers from other textual resource such as corpora and specifications. The
LDCIL provides separate but consistent treatment of data, tools and specifications but apparently
not academic papers.
Notwithstanding the current practice among LR providers, LR users rely upon academic papers
and specifications as they work to exploit data sets and tools. Specifications describe the goals of
the LR as well as the assumptions of its developers, the methods used and the formats included.
Academic papers may further describe the LR, provide criticism, and explain prior attempts to
exploit the LR, their success and any lessons learned. In short, data sets, tools, specification and
academic papers form a network of LRs that should be considered as a whole when any component
is exploited. Unfortunately, the LR user communities currently lack any easy way to find such
networks of resources. To resolve this problem we need a concerted effort to update our catalogs
with metadata entries for missing components and to develop links among the entries to express
the relations that connect the individual LRs. Ongoing maintenance must then follow the one-time
effort.
LDC has recently begun to catalog all academic papers that mention LDC corpora. To date we have
identified approximately 5000 papers mentioning about half of all our corpora. Our goal is
complete one pass over all LDC corpora to find related papers making the results available and
allowing paper authors to update. Naturally we encounter several challenges in this effort. There
are many corpora in the LDC Catalog and multiple papers mentioning each. The papers span
multiple research publications covering diverse communities. Paper authors cite LRs variably
sometimes providing the full name or catalog number, something abbreviating in different ways.
Therefore, determining that a paper mentions an LR requires reading at least part of the paper but
no single researcher commands the skills required to adequately understand whether and how the
LR and paper are related. Still we believe this effort is worthy. The results will prove beneficial to:
1) LR users who will gain a better understanding of prior uses and related LRs developed; 2) LR
creators who will better monitor critiques of their work and also understand its impact; 3) new
development projects that benefit from lesson learned in prior efforts.
16
The concept of BRIF (Bioresource research impact factor) as a tool
to foster bioresource sharing in research: is it applicable to other
domains?
Anne Cambon-Thomsen
CNRS
For the working group BRIF
Inserm and University of Toulouse, UMR 1027, 31000 Toulouse, France
Introduction
Concept. Numerous health research funding institutions have recently expressed their strong will
to promote data sharing. As underlined in a recent editorial in Nature Medicine, an operational
approach is needed to achieve this goal1. Bioresources such as biobanks, databases and
bioinformatics tools are important elements in this landscape. Bioresources need to be easily
accessible to facilitate advancement of research. Besides technical and ethical aspects, a major
obstacle for sharing them is the absence of recognition of the effort behind establishing and
maintaining such resources. The main objective of a Bioresource Research Impact Factor (BRIF) is
to promote the sharing of bioresources by creating a link between their initiators/implementers
and the impact of scientific research using them.2 A BRIF would make it possible to trace: 1. the
quantitative use of a bioresource, 2. the kind of research utilising it, and 3. the efforts of people and
institutions that construct and make it available.
An international working group (http://www.gen2phen.org/groups/brif-bio-resource-impact-
factor). In the context of EU projects, a BRIF working group has been set up, including so far 105
participants. The work involves several steps: 1. creating a unique identifier; 2. standardising
bioresource acknowledgment in papers; 3. cataloguing bioresource data access and sharing
policies; 4. identifying other parameters to take into account; and 5. prototype testing, involving
volunteer bioresources and the help of journal editors.
A workshop. The first BRIF workshop was held in Toulouse, France (17-18 January 2011),
gathering 34 people from 10 countries, representing various domains: biobanks, genome
databases, epidemiological longitudinal cohorts, bioinformatics, scientific publishing, bibliometry,
health law and bioethics3. The lack of objective measures of use of bioresources was recognised by
all; we focused on shared aims, but underlined that each community had specific aspects to
consider and resolve http://precedings.nature.com/collections/brif-workshop-january-2011.
Main avenues explored and further steps
Identification. Bioresources need to be identified by a unique digital identifier (ID), ideally via
existing mechanisms4. Digital Object Identifiers (DOIs) may be interesting5. Several issues must be
1 Cambon-Thomsen A. Assessing the impact of biobanks. Nat Genet 2003; 34: 25–26.
2 Kauffmann F, and Cambon-Thomsen A. Tracing Biological Collections Between Books and Clinical Trials. JAMA 2008; 299: 2316–18
3 Cambon-Thomsen et al. The role of a bioresources impact factor as an incentive to share human bioresources. Nat genet 2011, in press
4 Peterson J, and Campbell J. Marker papers and data citation. Nat Genet 2010; 42: 919
17
considered, including: what to identify (biobank, collection, database, dataset, subset); identifier
requirements (persistent over time, globally-unique, citable); and which international and
independent body should be responsible for assigning bioresource IDs. Working subgroups were
created to address those questions. Attribution of credit to scientists for different kinds of work (in
addition to publications) using researcher IDs was also discussed. The ORCID initiative
(http://www.orcid.org ) is building a new contributor ID framework which should in principle
enable credit to be given to both bioresources and individuals involved in their creation and
maintenance.
Citation. Standardisation is necessary, but could be combined with existing referencing standards
and conventions6, such as: citing marker papers, standardised sentences in Materials & Methods or
acknowledgements section of papers, co-authorship when justified, and including resource name in
paper title. Specific requirements for citing bioresources are lacking in the Uniform Requirements
for Manuscripts submitted to Biomedical Journals and should be added. In order to enable
automated tracking of bioresource use, the bioresource ID should ideally appear in or under the
abstract section in order to be visible even without access to the full text of articles.
Factors to take into account in impact factor calculation. BRIF should not be a citation index only.
Factors such as time and domain of bioresources need to be considered in the calculation process
and its weighting. Although the BRIF scope could be extended to measure many different aspects
of bioresource utilisation, including economic implications, it was decided to concentrate first on
use and impact in research settings.
Access and sharing policies. They have been developed over years. However, the incentivisation of
bioresources to promote access needs to be balanced with appropriate provisions compatible with
their interests proper recognition of scientific contribution and sustainability supported by the
capacity of measuring their own impact. There are actually no mechanisms in place to measure this
impact. Empowering bioresources with tools such as BRIF is, therefore, urgent.
Perspectives. The full impact of bioresources is wider than BRIF, but BRIF is an essential
operational step. The present proliferation of ideas, statements and proposals around data sharing
from different perspectives and stakeholders favours the emergence of tools such as BRIF in order
to make data sharing principles operational. In the same spirit it could be applied to the domain of
language studies.
5 International Committee of Medical Journal Editors. Uniform Requirements for Manuscripts Submitted to Biomedical Journals:
Writing and Editing for Biomedical Publication. Version April 2010: http://www.icmje.org/urm_main.html (accessed Feb 11, 2011). 6 Toronto International Data Release Workshop Authors. Prepublication data sharing. Nature 2009; 461:168-70.
18
Capturing Community Knowledge of LRs: the LRE Map
Claudia Soria
CNR-ILC, Italy
The LRE Map of Language Resources and Tools is an initiative jointly launched by FLaReNet and
ELRA in May 2010 with the purpose to develop an entirely new instrument for discovering,
searching and documenting language resources, here intended in a broad sense, as both data and
tools. The LRE Map was initially in conjunction with the LREC 2010 Conference, and was
conceived as a campaign for collecting information about the language resources and technologies
underlying the scientific work presented at that conference. To collect this information, authors
who submitted a paper were requested to provide information about the language resources either
developed or used. The required information was pretty simple and related to basic information
about the type of the resource, the language and modality represented, the intended or real
application purposes, the degree of availability for further use, the maturity status, the size, type of
license and availability of documentation.
In a rather short time, the LRE Map contained more than 2000 description of resources, and soon
became a very popular initiative joined by the COLING and EMNLP conferences as well. Other
conferences, such as Interspeech, ACL-HLT, and IJCNLP have already agreed to become part of
the game and will use the LRE Map for their next year conferences. This shows that the idea has a
great potential, and there is reasonable confidence that it can become a “standard and regular”
instrument in LRT conferences. The growth in the number and quantity of information about LRT
can easily be foreseen.
So far the LRE Map is both a set of metadata about LRT collected in three major conferences
(during 2010) and a web interface designed to search and browse this data. The web interface
currently provided to the community is a very simple interface based on not-normalised data, while
a new release (currently as an alpha version) offers a better visualization of data and a login system
to manage simple access management to resources. Moreover, it is based on (partially) normalised
data.
The LRE Map web interface provides the possibility of searching according to a fixed set of
metadata (designed for conference submission) and to view/edit the details of extracted resources
(edit if the user is directly related to the resource: i.e. when the user is an author of the paper that
cites the resource(s). In addition the database contains a lot of implicit relations, for example
relations among authors (because in some way related to same resources) and resources (because
cited by same authors in different papers).
The potential of the LRE Map for becoming a powerful aggregator of information related to
language resources was immediately clear, as was the possibility of deriving and discovering new
combinations of information in entirely new ways. The database underlying the LRE Map can yield
interesting matrices of the language resources available for the various languages, modalities, or
applications. Such matrices have been already used, for example, in META-NET to provide a
picture of the situation of resources for the various European languages.
Although the LRE Map was not realized as a social platform, it was conceived as a community-
based initiative. Differently from other catalogues maintained by institutions worldwide (ELRA,
LDC, National Institute of Information and Communications Technology (NICT), ACL Data and
19
Code Repository, OLAC, LT World, etc.), the LRE Map presents a set of innovative features since it
is built by the community according to a bottom-up strategy and without conforming to a strict
template for possible metadata and values. The Map really captures the knowledge of the LRT
community about language resources.
The immediate success and impact of the LRE Map, together with the spontaneous agreement of
major conferences/associations of the LRT field to adopt it, requires now to move from a prototype
to a stable and solid service, that is both a repository of information about language resources and,
at the same time, a community for users of resources, a place to share and discover resources,
discuss opinions, provide feedback, and discover new trends.
20
A Journey from the LRE Map to the Language Matrixes
Joseph Mariani
LIMSI-CNRS & IMMI, France
The objective of the Language Matrixes developed within META-NET is to provide a clear picture
on what exists in terms of Language Resources (LR), in the broad sense (Data, Tools, Evaluation
and Meta-resources) for the various languages, and to stress the languages that are missing such
Language Resources. The goal would then be to ensure the production of the corresponding
resources to fill the gaps for those languages.
The Languages Matrixes provide an easy way to get that picture and to have access to the details of
the corresponding resources.
We built those matrixes from the LREC Map that has been produced from the information
provided by the authors of the papers submitted at the LREC’2010 conference, which gathers the
international community working in the field of Language Resources and Evaluation. Each author
was requested to provide a set of information on the Language Resource(s) which was/were
mentioned in their paper, through an online questionnaire, which includes suggestions as an aid to
the author. It resulted in a table of close to 2,000 entries. This information was then made
available to the scientific community, through the same interface than the one that was used for the
information acquisition.
The Language Matrixes were automatically built from that table. In this first analysis, we
considered the 23 official languages of the European Union, together with a category on “Regional
European languages” and one on “Non-EU European languages”, as well as “Multilingual”,
“Language Independent” and “Not Applicable” categories. We produced 8 Language Matrixes on :
Multimodal/Multimedia Data and Tools, Written Language Data and Tools, Spoken Language Data
and Tools, Evaluation and Meta-resources (standards, metadata, guidelines). Several Types of
resources are listed for each matrix, either corresponding to the 24 Types that were suggested in
the questionnaire or to the author’s own entry, when no suggested Type was found appropriate.
This results in a total of about 150 Language Resource Types, with a variable number for each
matrix (from 5 Types for Evaluation to 78 Types for Written Language Tools).
Those matrixes show that the English language is by far the most resourced language, followed by
French and German, Spanish, Italian and Dutch. Some languages are clearly under-resourced, such
as Irish Gaelic, Slovak or Maltese. Given the large number of Types expressed by the authors, some
may exist only for one language, and the matrixes therefore show a large number of zeroes for all
the other languages. We however preferred to keep that information as such rather than merging
them into an “Other Type” category, as those singletons may be weak signals announcing a new
research trend. Another option is to merge those singletons into a single “Other” category to
facilitate the browsing of the Language Matrixes.
In order to produce the Language Matrixes, we had to conduct a tedious process of cleaning up this
data, as the information was not always provided in the proper format, despite the suggested
terms, and as the declared LR should only be counted once. This process reduced the number of
entries from 2,000 to about 1,500.
21
We first cleaned up the names of the LR, as different wordings may be used by different authors for
the same LR. We found the problem that those different wordings may not follow in the
alphabetical order (e.g. acronyms), and we faced the issue of the way to consider the different
versions of the same LR over time and the subparts of a LR. This has to be decided by hand on a
case-by-case basis. For this purpose, we found interesting to gather some LR into LR families. This
cleaning process clearly demonstrated the need to label in a persistent and unique way (PUiD:
Persistent and Unique Identifier) the LR in order to better identify and track them.
We then cleaned up the LR modality (Multimodal/Multimedia, Written Language, Spoken
Language, Evaluation, Meta-resources and Not Applicable). The main problem here is the
possibility for a LR to address several modalities (written and spoken language for example). In
this case, we considered the LR in both modalities. We also had to harmonize possible differences
among authors.
The next cleaning addressed the LR Types. In addition to the 24 suggested Types organized in 4
categories (Data, Tools, Meta-resources and Evaluation), the authors proposed 127 new Types.
Some of them correspond to mistakes committed by the authors (i.e. the Type existed, or existed
with a different wording). Others (23) correspond to Types that were missed when producing the
list of suggested Types (in this case, this new Type is often mentioned by several authors,
sometimes with different wordings). Others correspond to LR belonging to several Types. There is
finally a long tail of other Types, most of them only mentioned once by a single author, which only
represent 5% of the LR.
The last cleaning process addressed the Languages, as authors may have used different spellings or
codes (such as the ISO ones). This could be facilitated in the future by providing the list of existing
languages. According to our objective, only the 23 official EU languages where considered, while
the European non-EU languages were grouped in a single category and the same for the EU
regional languages. A “multilingual” category was added, when the number of the languages
mentioned was large.
Despite this cleaning process and although they may have contained obvious mistakes, all initial
information provided by the authors have been retained as the reference information. The cleaned
information is used for conducting data analysis and for searching information. The general need
for conducting data cleaning is nowadays clearly identified, and Google proposes a Google Refine1
tool to facilitate this task.
Since the LREC Map produced at the LREC’2010 conference, more data have been harvested at
other conferences such as EMNLP’10 and Coling’2010, which will be included in the next versions
of the Language Matrixes, as well as the LR appearing in Journals, such as the Language Resources
and Evaluation journal, and Language Resources catalogues, such as the LDC or ELRA ones. More
conferences agreed to participate in the building of the LRE Map (Interspeech, ACL-HLT, Oriental
Cocosda, etc.). Based on our findings when conducting the data cleaning, we will complement and
improve the suggested information provided to the authors in order to facilitate their task, which
will also serve lightening the cleaning process.
Building the Language Matrixes from actual data provided by authors allows reflecting over time a
landscape that is continuously evolving with more and more LR. Within META-NET, the Language
Matrixes have already started being used for identifying the Language Gaps and for writing the
Language Tables in the Language Reports.
1 http://code.google.com/p/google-refine/
22
The next steps will be to:
increase the number of the identified LR trough more conferences, journals and catalogues,
extend the range of languages being considered,
include the analysis of Sign Languages,
improve the coherence of the inputs in a loop where the taxonomy and the metadata are refined through the analysis of the matrixes and thanks to the work on metadata within META-NET, and reflected in the terms suggested in the questionnaire,
track the use and quality of the identified LR as they appear in conference and journal papers.
In order to improve the analysis of LR, a major issue would be to attach a PUiD to each LR, taking
into account the families of LR, their various parts and their evolution over time, including all
contributors.
23
Proposal for the International Language Resource Number
Khalid Choukri, Jungyeul Park, Victoria Arranz, Olivier Hamon
ELRA/ELDA, France
Every object in the world requires a kind of identification to be correctly recognized. In this paper,
we propose a new principle for assigning a PID for LRs within the human language technology
area.
Every object in the world requires a kind of identification to be correctly recognized and easily
"discoverable". Traditional printed materials like a book, for example, generally have used the
International Standard Book Number (ISBN), the Library of Congress Control Number (LCCN),
the Digital Object Identifier (DOI) and several other numeric identifiers as a unique identification
scheme. Book identifiers allow us to easily "identify" books in a unique way. There are also already
several identifier schemes in other domains. In computer programming languages, Identifiers
(IDs) are usually lexical tokens that name language entities. The Electronic Product Code (EPC) is
also designed as a Universal Identifier that provides a unique identity for every physical object. The
canonical representation of an EPC is also a Uniform Resource Identifier (URI). A Part Number
(PN) is a unique identifier of a part used in a particular industry and unambiguously defines a part
within a single manufacturer. A Universally Unique IDentifier (UUID) is an identifier standard
used in software construction, standardized by the Open Software Foundation. An Accession
Number in bioinformatics is a unique identifier given to a Deoxyribonucleic acid (DNA) or protein
sequence record to allow for tracking of different versions of that sequence record and the
associated sequence over time in a single data repository. Biomedical research articles already have
a PubMed Identifier (PMID).
In this presentation, we propose the use of new identifier schemes for Language Resources (LRs) to
be identified, and consequently to be recognized as proper references. It is also a major step in the
emerging NLP networked and shared world. Unique resources must be identified as what they are
and meta-catalogues need a common identification format to manage such data correctly.
The ELRA Catalogue offers a repository of LRs made available through ELRA. The catalogue
contains over 1000 LRs in more than 25 languages. Other LRs identified all over the world, but not
available through ELRA, can be also viewed in the Universal Catalogue. The actual LR identifiers in
the ELRA Catalogue contain 4 digits followed by a systematic pattern (B|S|E|W|M|T|L) where B
signifies a bundle which can contain several LRs under and S|E|W|M|T|L are Speech, Evaluation,
Written, Multilingual corpora, and Terminology and lexicon, respectively. LDC uses LDC publisher
code with a year number followed by (S|T|V|L) and 2 digits. S|T|V|L are speech, text, voice,
lexical(-related) corpora, respectively. ISO is also working towards a PID framework and the
practice for referencing and citing LRs in documents as well as in LRs themselves. The DOI System
also provides a good framework for PID which can be used for any form of management of any
data.
We propose a practical implementation of the international Language Resource Number and
elaborate on an international framework with requirements and expectations to take into account
for such set up.
24
S2. Open Data – 14:30‐16:20
Chair: Nancy Ide
Introduction
Nancy Ide (Vassar College, USA / Chair)
Introductory Talks
Open Data and Language Resources Timos Sellis (IMIS ‐ "Athena" R.C., GR)
Multilingual Linking Open Data Key‐Sun Choi (KAIST, KR)
Beyond Open Data to Open Advancement Eric Nyberg (Carnegie Mellon University, USA)
Contributions
To Create Commons in order to Open Data Danièle Bourcier (CNRS & Creative Commons France, FR)
Is Our Relationship With Open Data Sustainable? Denise DiPersio (LDC, USA)
Opening the Language Library: let's build it together! Nicoletta Calzolari (CNR‐ILC, Italy / FLaReNet)
Discussants
Thierry Declerck (DFKI, DE)
António Branco (University of Lisbon, PT)
Thibault Grouas (Ministry of Culture of France ‐ Office for the French Language and Languages of France, FR)
27
Open Data and Language Resources
Timos Sellis, Spiros Athanasiou
Institute for the Management of Information Systems, “Athena” Research Center
A brief introduction on Open Data
Open Data, i.e. data that can be freely used, reused and redistributed by anyone [1], is not a
current trend, but rather a historically established scientific practice. The goal for providing open
scientific data is noble and a necessity for scientific advancement; share knowledge, evaluate
research, educate, promote intradiscliplinary activities. Recently, open data practices have spread
beyond the S&T realm and into the mainstream political and technological agenda. A new
movement has been formed, the open-data movement.
The reasons for this development are stemming from needs established in two separate but
interlinking domains: politics and S&T. On a political level, open data is perceived as a means to
enforce transparency and accountability, i.e. core democratic values. Further, PSI reuse (Public
Sector Information) can boost the economy, enabling the private sector to develop and monetize
value added services. On an S&T level, the need to share data to promote scientific advancement is
ever so increasing. Further, the World Wide Web in its current and upcoming iteration (Web
2.0/3.0), offers significant technical opportunities for collaboration and knowledge management,
yet to be harnessed.
Consequently, this mix of technological, political, and economical support towards open data is
unique, since it provides excellent opportunities for sustainable growth and business development,
advances in research, and cross disciplinary research agendas.
Learning from geospatial data
Geospatial data offers a great source of examples, applied technical solutions, and policy models
that attest the tangible benefits of open data, data reuse and common sharing. Geospatial data is
important in this respect for three reasons. First, geospatial data account for roughly 80% of public
data. Second, several mature international, EU, and national efforts exist to promote reuse and
openness (e.g. INSPIRE Directive). Further, we can learn from well documented use cases
concerning real world return-on-investment (ROI) and cost-benefit analysis (CBA).
In a nutshell, the status quo in the ecosystem of geospatial data is based on four complementary
driving forces: standardization, maturity, vendor support and an active FLOSS community.
Standardization activates for metadata, data, and services are handled by international
organizations with mutual and deep understanding for the benefits of interoperability. The
ISO, Open Geospatial Consortium, and INSPIRE Directive, have established a common
ground for all stakeholders, actively evolve standards, and promote uptake on all fields
relating to geospatial data.
Geospatial data sharing among public bodies has long been established due to practical
reasons; geospatial data is expensive to produce, must be compatible, and are needed for
policy making at all levels of the administration and across various domains. The concept of
Spatial Data Infrastructures, i.e. formal technical, administrative, and legislative
29
frameworks to support geospatial data sharing, was coined in the early 80s, even on an
analogue form.
Vendor support for standards-based access and use of geospatial data, as well as SDIs, is
practically unanimous. Almost all commercial products (GIS, geospatial databases) are
compliant with ISO/OGC standards. Therefore, data reuse and common sharing is offered
almost out of the box.
Finally, the FLOSS community has a pivotal role, offering open and ready to deploy
software and services for geospatial data.
Out of several examples in the literature, the case of Denmark [2] is remarkable and clearly
establishes the economic gains of open data. In 2002, the Danish government decided to provide
the address dataset free of cost, since addresses are an integral part for the majority of IT systems
and services, both in the public and private sector. The costs of curating and publishing the
addresses were around 2M Euros (2005-2009). However, the direct financial benefits for the same
period were around 62M Euros, allocated 30%/70% between the public and the private sector
respectively. This is calculated as a ROI of 31. Moreover, for 2010, the costs for the public sector
will be 0,2M and the benefits 14M, i.e. a ROI of 70.
Opening up data
Opening up data is actually the last step one should tackle, and certainly not a technical challenge;
it is just a matter of choosing an appropriate license. Instead, the focus should be on how to make
our resources (data and services) discoverable and reusable. Without delving into technical details,
practice dictates that opening up data is a three step process:
Step 1. Setup a catalogue. Provide standard metadata, assign unique identifiers, and build a
catalogue of your resources. In this manner a user can discover your resources and use a
static identifier to reference them. Note that the data themselves, do not need to be
available at this point.
Step 2. Harmonization. Provide standard APIs (de facto or de jure) to query the catalogue.
In this manner your catalogue can be used from third systems and catalogues can be
federated. Again, no actual data need to be published at this point.
Step 3. Licensing and business models. This is where you have to decide on whether to
publish the data or not, and if so, under which license. Decisions must be based on a
business model (e.g. sell the data, share data amongst a closed group, provide the data for
free) which then shapes the catalogue and its services.
It is important to consider that choosing an “open” license does not equate to complete loss of
rights. Open vs. closed is not equivalent to black vs. white. There is a spectrum of open licenses
with all shades of grey in between. For example, the Creative Commons licenses provide several
restrictions when licensing open data to accommodate various needs. These restrictions can be
arbitrarily combined to create a specific CC-compatible license. Specifically, CC provides clauses
that (a) require attribution (indirect gains for the publisher), (b) forbid commercial work (direct
gains for the publisher in commercial applications), (c) forbid derivative work (direct gains for the
publisher), and (d) demand share-alike (indirect gains for the publisher).
Where do Language Resources stand?
The question of whether Language Resources should follow the open data movement is misleading.
Providing open data should not be the primary objective, but rather one of the instruments to
promote growth in the LR community on an EU level. All stakeholders, the S&T community, SMEs,
30
EC, should study the current landscape, Europe’s competitive advantage and unique attributes, in
order to agree on a common agenda for growth and cooperation in the LR field.
There are many questions that need to be meticulously answered.
What is the market value of LR in the EU? Establishing a data economy on LR should first
focus on identifying exactly where we stand, on an international level.
What is the potential of the market value of LR data? Given EU’s needs, priorities, and
goals, is there room for growth? If so, how can growth be achieved? Does the EC need to
intervene, or the market forces can pursue this goal alone?
Are LR data discoverable? To promote the data economy we should first ensure that LR
resources can be found, even on the lowest technical level.
What do we gain by closing data? Are business models based on closed and heavily guarded
LR actually successful? Are we losing opportunities for growth by not systematically
exploiting common sharing and synergies?
What do we lose by opening up data? Can the direct income lost compensated by direct or
indirect financial gains?
Experience in similar data-driven ICT markets has demonstrated the steps the LR community
should follow. Standardization, harmonized licensing, LR marketplaces, and common sharing
agreements, are a necessity to bootstrap the LR economy in Europe.
[1] www.opendefinition.org
[2] http://www.adresse-
info.dk/Portals/2/Benefit/Value_Assessment_Danish_Address_Data_UK_2010-07-07b.pdf
31
Multilingual Linking Open Data
Key-Sun Choi, Eun-Kyung Kim, Dong-Hyun Choi
Web Science and Technology Division, Computer Science Department, KAIST, Korea
Ontology Population and Enrichment for Semantic Evolution of LOD
Linking Open data is a syntactic linking of datasets, but it cannot acknowledge whether linked
objects are really the same, or the other equivalent objects cannot be linked simply because they
have different strings for labeling the objects.
Ontology population is to link the instances in the real world to the already-made ontology, and
enrichment means to mine the relations between those instances, which is defined in the ontology
schema.
It is not easy to view the whole LOD space to be one set of ontology and even will make controversy
about why it is necessary for their use in real application. But almost all of LOD is DBpedia-centric
linked structure through their exact match of naming for URI, and always there are problems such
that the redundancy of presence of objects cause the missing to aggregate all of relevant data, the
non-provability of semantic interoperability/equivalency for linking and even among non-linked
objects will cause the inconsistency of the whole space of linked open data. This would like to
pursuit the elegant ecosystem to maintain the LOD semantics by incorporating the ontology
population and enrichment mechanism to the evolving LOD and even incrementally converging
status of LOD to LOD2, so-called semantically stable LOD.
Basic strategy is as follows: it should be recognized that DBpedia and its mother Wikipedia should
be a backbone of LOD and its following evolving semantic version and finally their ontology
structure. The first step is to make a taxonomy backbone based on Wikipedia structure. Then it will
be changed to ontology structured with their bootstrapping of semantic annotation and infobox-
based template structure. The second step is an ontology enrichment using the Wikipedia text in
order to strengthen its ontology structure. The third ontology population is to accept each dataset
as one user ontology of local aggregate of LOD. After transforming each local aggregate to a
possible ontology structure with taxonomy, it will be populated from Wikipedia-based ontology.
The fourth step is to enrich the local LOD-ontology as well as feedback to the Wikipedia-based
ontology for the overall stability of semantic LOD.
Multi-lingual Synchronization
It is necessary to explore an automated approach to synchronize multilingual-linked articles in
Wikipedia. Two linked articles in two different languages have different amount of information in
Wikipedia. Even some entries in one language have no entry in the other language.
Synchronization means translation between two languages plus information balancing for their
equivalent degree of information. Even if the full translations between two languages’ linked
articles are successfully accomplished, the resultant translated content to the other language
should be readjusted with the already existing article in that language. Even they have obsolete
information in one language. So the next stage is to find the temporal ordering of facts and events
that are represented in sentences of article, and to generate the up-to-date information-
synchronized articles in two languages.
32
There are many challenges for massive translation among multilingual-linked articles with
synchronizing the information content. This approach will envision the role of infobox in each
article of Wikipedia in order to improve the quality of infobox keeping their data categories in a
semantically interoperable way. Although the current shape of infoboxes are governed by
“templates”, their templates are not enough organized to give a standardized concise guide to put
the article content to its infobox. One solution is to give an ontology structure for templates of
infoboxes and provide a multilingual standard for their data categories. We have a proposal called
OntoCloud. If it is set up, the next steps are to find the indirect translation through the infobox
pivot among different languages. Here we need machine learning between Wikipedia article’s
sentences and their infobox. For their alignment between each infobox row and sentences, we will
try to find the relevant information extraction from Wikipedia article to its infobox as well as the
methods to generate relevant sentences from infobox table. Here is a common minimal exchange of
information through infobox based on already-made templates. Of course, one Wikipedia articles
can require several templates of infoboxes. Multilingual word nets are also crucial resources to link
them.
Ontology structure based on templates of infobox has relationship with category structure in
Wikipedia. But because the categories in Wikipedia is user-given tags, their category structure are
not the one of ontology and even more their taxonomy is far beyond what we expect in WordNet.
We will investigate the inter-feedback between the template ontology and the category structure for
their persistent eco-system that can be evolved with keeping their collective intelligence advantage.
Section 3. Issues
Issues raised are summarized as follows:
1. Is it possible and necessary to construct a backbone ontology for LOD?
2. Is it also good to align each dataset of LOD to the backbone ontology?
3. Infobox Schema Management:
a. ontology structure for templates of infoboxes
b. Multilingual standard for data categories in templates
4. Multilingual Infobox alignment:
a. Fining semantic correspondences between multilingual contents
b. Multilingual thesauri are used to link between different lingual sources
5. Infobox Population: filling the missing and unknown contents
a. Extracting relevant information from other sources such as Wikipedia articles or
external resources
b. Generating up-to-date information-synchronization in multilingual environment
33
Beyond Open Data to Open Advancement
Eric Nyberg
Carnegie Mellon University
Open Data Isn’t Enough
In the spirit of open data, several initiatives have developed shared language resources, annotation
schemas, and datasets for research and associated programmatic evaluations. One branch of
Language Technology, Question Answering, has benefited greatly from the availability of open data
resources, especially the research datasets created for the yearly TREC question answering track.
Nevertheless, recent experience with the Jeopardy! Challenge shows that open data isn’t enough.
Sustained interoperability and steady improvement in task performance requires additional work
to formalize, standardize and share related language processing resources, such as:
• Shared component APIs
• Shared software (open source)
• Shared configurations / data flows for specific language tasks
• Shared metrics, measurements and evaluation frameworks for specific language tasks
• A collaborative development process (driven by measured improvement)
The Open Advancement of Question Answering (OAQA)
The Watson QA system developed by IBM Research is based on a foundational architecture and
methodology for accelerating collaborative research in automatic question answering (OAQA). By
making long-term commitments to component APIs, shared software, configurations, metrics and
process, OAQA led to rapid acceleration in the state of the art, and a Jeopardy! victory for Watson.
OAQA combines object-oriented software architecture with comprehensive metrics, measurement
and error analysis at the system and module levels, so that the sources of overall task error can be
traced to component-level errors to be debugged. Detailed error analysis allows the team to focus
on the most important sources of error during each development iteration, resulting in steady
progress in task performance.
Beyond Question Answering to other Language Tasks
I believe that the same OA approach can be applied to other human language processing tasks, and
I strongly advocate the position that new language resources (i.e., corpora, datasets, etc.) should
not be developed without a specific set of tasks and task metrics in mind, along with a formal
software design to support automatic configuration management, automatic comparative /
incremental evaluation, and detailed error analysis. These techniques were used to great effect in
the development of Watson; they can have a similar positive effect on other areas of language
research, if we are willing to invest the effort required to specify the task and component APIs and
metrics along with the shared language resources for the task.
34
To Create Commons in order to Open Data
Danièle Bourcier
CNRS and Creative Commons France, France
In this talk, I will propose to link the movement of Commons and that of open data.
Since 2004, Science commons has been focusing its efforts to expand the use of Creative
Commons licenses to scientific and technical research. Creative Commons plaid an instrumental
role in the Open Access movement, which is making scholarly research and journals more widely
available on the Web. The world’s largest Open Access publishers in the world all use CC licenses to
publish their content online: 10% of scholarly journals is CC licensed. The movement of Science
Commons also expands Open Access to research institutions. Creative Commons licenses are
directly integrated into institutional repository software platforms. University libraries at MIT offer
access to the Scholar’s Copyright Addendum Engine, which helps faculty members upload their
research for public use while retaining the rights they want to keep. Finally Science Commons have
created policy briefings and guidelines to help institutions implement Open Access into their
frameworks.
35
1
Is Our Relationship With Open Data Sustainable?
Denise DiPersio
Linguistic Data Consortium, Philadelphia, PA 19104 USA
Introduction
The language resource development and user communities have publicly declared their affection
for open data. But what has the community embraced? “Open data” is a term susceptible to
multiple and often inconsistent interpretations. In any relationship, long-term success depends on
how well we know our partner. This paper will examine some characteristics that affect data’s
“openness”, review emerging trends and conclude with some ideas for considering the notion of
openness at it applies to language resources.
Open Data is ....
The concept of open data has its roots in discussions among the scientific community in the 1950s
that resulted in various efforts to promote the sharing of scientific data in order to minimize loss
and to maximize accessibility. The emergence of the web as a data sharing mechanism gave new life
to this idea premised on the assumption that data could be made available online and downloaded
at little or no cost. But there remains some confusion about the meaning of open data.
Some view open data as describing a method of distribution. In that context, open data is “data
made openly available without permission or payment barriers” (MIT Libraries), or “a piece of
content or data is open if you are free to use, reuse, and distribute it – subject only, at most, to the
requirement to attribute and share-alike” (Open Knowledge Foundation).
Others use open data to describe content. Examples here include information collected by some
public body or whose collection was publicly supported. The default assumption of this approach is
that all such information – except perhaps material subject to privacy or sensitivity concerns –
should be publicly available at no cost. Indeed a large number of current open data “initiatives” are
efforts by national and local government bodies to release their data collections online.
The language resource community has tended to think of open data in the former sense: digital
resources distributed under open source-type licenses that can in turn be used, modified and
redistributed. However, openness can depend on a number of factors including:
LR type/design: this includes formatting and compatibility with existing related data sets as well
as metadata, all of which affect usability. An “accessible” corpus that is not usable is not open.
Source data: this is where most legal restrictions apply, mostly copyright, but this can also
include data that contains private or sensitive information. Lack of appropriate permissions could
impair openness.
Legal issues: there are implications in copyright law and in many jurisdictions, laws governing
rights in databases. Moreover, not all open source licenes are consistent. The goal is to have a
license that promotes openness to the greatest extent possible but this can be difficult to achieve
uniformly.
36
2
Access: includes user access and downloadability. Restrictions on user groups and alternate
distribution methods can be perceived to mitigate openness.
Sustainability: the data should be available for the long-term. This may require additonal
infrastructure which in return requires maintainance.
Cost: open data may be distributed at no cost, but the data is not necessarily “free”. A free price
usually masks cost-shifitng that has to be paid by someone.
The State of our Relationship with Open Data
There is a theory of relationships called the “uncertainty reduction theory” (URT). URT assumes
that personal relationships are filled with uncertainty at the beginning and that people try to
reduce that uncertainty through knowledge and understanding. One way to view the language
resource community’s interaction with open data is to say that it is on a quest for knowledge and
understanding. Here are some examples:
CKAN – the Data Hub – Comprehensive Knowledge Archive Newtwork (Open Knowledge
Foundation), “the easy way to get, use and share data”’; includes an Open Linguistics Resources
group that contains 18 LR “packages” available at no cost under open source-type license via web
download; corpora and tools; includes a wiki function edited by the community. Openness
assessment – some dependence on individual contributions; inconsistent usability information;
main site contains at least one closed dataset.
Language Commons – “Open, Online Encyclopedia and Data Repository for all 7,000 Human
Languages” hosted by archive.org under no-cost Creative Commons license; less than 10 datasets;
Brown Corpus most downloaded (300+). Openness assessment – some licenses have use
restrictions; inconsistent usability information.
Language Grid – “an online multilingual service platform which enables easy registration and
sharing of language services such as online dictionaries, bilingual corpora, and machine
translations”; users must apply to join the grid and agree to use the services for non-profit,
research purposes; provides 100+ language services. Openness assessment – limited user
accessiblity; inconsistent usabiity information.
Toward a Viable Relationship
The worthy initiatives described above highlight the imbalances in our current relationship with
open data. These include the need for data variety, a way to address legal and related issues to
achieve variety and reduce license restrictions, better data description and community-wide access.
Since a hallmark of open data initiatives is community input, it may be useful to survey the
language resource community about open data – what is it, what data should be open, what are the
needs, how has wok been hampered. Is cost the main barrier to accessibility? How problematic are
license restricitons? Responses to such questions could be helpful as we move ahead to a viable
relationship with open data.
References
Anderson, Chris. Free The Future of a Radical Price. New York: Hyperion (2009).
CKAN - theData Hub. <http://ckan.net/group/linguistics> (14 April 2011).
37
3
Interpersonal Communication Theories and Concepts: Social Penetration Theory, Self-Disclosure,
Uncertainty Reduction Theory, and Relational Dialectics Theory.
<http://oregonstate.edu/instruct/comm321/gwalker/relationships.htm > (29 April 2011).
Language Commons. <http://www.archive.org/details/LanguageCommons> (15 April 2011).
Language Grid.<http://langrid.nict.go.jp/en/index.html> (30 April 2011).
Open science data. <http://en.wikipedia.org/wiki/Open_science_data> (26 April 2011).
Scholarly Publishing – MIT Libraries | Open Data. <http://info-libraries.mit.edu/scholarly/mit-
open-access/general-information> (14 April 2011).
What is Open? <http://opendatamanual.org/what-is-open-data/what-is-open> (14 April 2011).
What “open data” means – and what it doesn’t | opensource.com.
<http://opensource.com/government/10/12/what-“open-data”-means> (14 Aprl 2011).
38
Opening the Language Library: Let’s Build it Together!
Nicoletta Calzolari
Istituto di Linguistica Computazionale, CNR, Italy
I present here a vision briefly introduced last year at a COLING panel on “crazy ideas”. We must
now convert the crazy into real!
The rationale and the vision
We have recognised that Computational Linguistics is a data-intensive discipline, but we must be
coherent and take concrete actions leading to the coordinated gathering – in a shared effort – of as
many (annotated-encoded) language data as we are able to produce.
Time is ripe for “opening” a big “Language Library”, collaboratively built by the LRT (Language
Resources and Technologies) community. The Language Library is conceived as a facility for
assembling and making available through services “all the linguistic knowledge the field is able to
produce”, putting in place social networking ways of collaboration within the LRT community.
We can compare the proposed initiative to the astronomers/astrophysics’ accumulation of huge
amounts of observation data for a better understanding of the universe.
The rationale behind the Language Library initiative is that accumulation of massive amounts of
(high-quality) multi-dimensional data about language is the key to foster advancement in our
knowledge about language and its mechanisms, in particular for finding new facts about language.
The Language Library must be community built, with the entire LRT community providing data
about language resources and annotated/encoded language data and freely using them.
As ongoing initiatives (FLaReNet, META-SHARE and CLARIN among others) have shown, the
LRT field is mature enough to require consolidation of its foundations and steady increase of its
major assets, minimising dispersion of efforts and enabling synergies based on common knowledge
and collaborative initiatives.
To better achieve these goals we need:
1. Information on the LRT that constitute the real infrastructure of the field 2. Knowledge of best practices for the major LRT and all the languages 3. Facilities for gathering in a unique virtual repository all the linguistic knowledge the field is
able to produce.
These requirements can be satisfied by three – strongly interrelated – collaborative initiatives:
1. The LRE Map, started last year, collecting metadata about LRT 2. The Repository of Standards, Best Practices and Documentation 3. The Language Library, collecting language data to explore the language universe
FLaReNet has acted as an incubator for the proposed initiatives, and they will constitute a great
contribution to META-SHARE, both in terms of services around the resources and in terms of new
and innovative strategies for the creation of new resources.
39
The Language Library
The Language Library can be seen as a sort of big Genome project for languages, where the
community will collectively deposit/create all its “knowledge” about language.
Our understanding of language phenomena is inherent in annotations, in lexicon encoding, in the
capability of annotating/extracting linguistic information. Annotation is at the core of training and
testing systems, i.e. at the core of our technologies. But we are currently over-simplifying the task,
focusing mostly on one specific linguistic layer at a time, without (having the possibility of) paying
attention to the relations among the different layers. However, relations among phenomena at
different linguistic levels are at the essence of language properties. At the same time our efforts are
too much scattered and dispersed, without much possibility of exploitation of others’
achievements.
Today we have enough capability and resources for addressing the complexities hidden in
multilayer interrelations. Moreover, we must exploit the current trend towards sharing for
initiating a collective movement that works towards creating synergies and harmonisation among
all the annotation/encoding work that is now dispersed and fragmented.
With the Language Library we thus want to enable a more global approach to language, starting a
movement aimed at – and providing the facilities to – collecting all possible annotations/encodings
at all possible levels (and we certainly can do tens of different annotation layers and types), to
enable a better analysis and exploitation of language phenomena that we tend today to disregard
(in particular the interrelations among levels that are so important and are not taken care of). E.g. a
coreference annotation on top of simpler annotation layers would improve machine translation
performance. Part of this multi-layer and multi-language annotation could be performed on
parallel texts, so as to foster comparability of new achievements and equality among languages.
Moreover, the Language Library will be both a repository of annotation layers and a set of
tools/services that allow easy browsing, annotation, conversion, porting through languages, and
exploitation of the existing layered data ...
This collaborative approach to creation of massive amounts of annotated data has a clear relation
to interoperability: it will contribute to push towards a convergence on best practices, also making
available tools that simplify the adoption of most used annotation guidelines, while at the same
time encouraging exploratory diversity. This could create a fruitful (community driven) loop
between most used annotation schemes and establishment of best practices.
A first experiment at LREC 2012!
At LREC 2012 we will launch a new initiative aimed at laying the foundation for such a community
built Language Library.
Without going into details now, we’ll try to distribute texts in all the LREC languages, to have them
back processed/annotated by the LREC authors in all the possible manners that the LREC
submissions will be able to manage.
All the collected language data will be made available to the LREC community (and beyond)
possibly before the Conference, to try to organise something around/on top of them.
We’ll announce/describe this initiative in the LREC Call for Papers next month, and we call for
suggestions now from the FLaReNet community!
40
S3. Go green: reuse, repurpose and recycle resources – 16:50‐18:30
Chair: Stelios Piperidis
Introduction
Stelios Piperidis (ILSP ‐ A.C. "Athena", GR / Chair)
Introductory Talks
An Interoperability Challenge for the NLP Community James Pustejovsky (Brandeis University, USA)
Community Co‐Creation in Cultural Domain Virach Sornlertlamvanich (NECTEC, TH)
Imagine we have 100 Billion Translated Words at our Disposal Jaap van der Meer (TAUS Data Association, NL)
Contributions
U‐Compare: interoperability of text mining tools with UIMA Sophia Ananiadou (University of Manchester, UK)
Opening, Sharing, Re‐using Language Resources: Who, What, When, Where and How Stelios Piperidis (ILSP ‐ A.C. "Athena", GR)
Discussants
Petya Osenova (BAS, BG)
Anna Braasch (University of Copenhagen ‐ Centre for Language Technology, DK)
43
An Interoperability Challenge for the NLP Community
Nancy Ide1 and James Pustejovsky2
1Vassar College and 2Brandeis University, USA
Web services are becoming increasingly more sophisticated and responsive to user needs over a
range of applications and areas. However, at this time a robust, interoperable software
infrastructure to support natural language processing (NLP) research and development does not
exist. The need for robust language processing capabilities across academic disciplines, education,
and industry is without question of vital importance. The goals of our NSF-funded SILT project,
which has worked together with European and Asian collaborators, has been to advance
interoperability among NLP tools and resources. The ultimate goal of this research is to build on
the foundation laid in SILT and other projects, to create the momentum toward establishing a
comprehensive network of Language Apps (LAPPs) web services and resources within the NLP
community. To this end, in this talk, we define a shared task for the NLP community we call "The
SILT Interoperability Challenge", which will call for development of interoperable web services. We
anticipate that the challenge will be carried out in two or three iterations over the next two years.
45
Community Co-Creation in Cultural Domain
Virach Sornlertlamvanich
National Electronics and Computer Technology Center (NECTEC), Thailand
Introduction
Under the collaboration between Ministry of Culture and Ministry of Science and Technology by
National Electronics and Computer Technology Center (NECTEC), a collection of cultural
knowledge has been embodied since 2010. We did not start the work from scratch. Actually the
work has been done some years ago by forming a set of servers individually operated by each
province. Each province has to take care of their own contents about their responsible area. The
initiative has been carried out for the purpose of creating a reference site of the local cultural
knowledge. The distributed system is an aim to decentralize the management and to maintain the
uniqueness of each specific area. However, there is a trade-off between the independent design and
cost of maintenance that covers the service operation, interoperability and integrity. There are
currently 77 provinces in Thailand, and each province is allocated an office for provincial cultural
center. With the approach of the above mentioned distributed system, it is too costly to maintain
the service and the standard for data interchange.
The newly designed platform-based approach for digital cultural communication has been
introduced. It is to build a co-creative relationship between the cultural institution and the
community by using new media to produce audience-focused cultural interactive experience
(Russo and Watkins 2005). First, we collected the existing provincial cultural knowledge and
convert them to conform to a standardized set of metadata. This is to prepare the cultural
knowledge for an open data schema and interoperability. The metadata is defined to follow the
Dublin Core Metadata Element Set with some additional element to fulfill the requirement
information during the recording process. Second, we assign representatives from each province
and train them to be a core cultural content development team for community co-creation. The
contributed content needs to be approved by the core team before public visibility. Third, the
cultural knowledge will be put on service to the audience of such scholars who may be interested in
the cultural practice, or business developers who may benefit from attaching the cultural
knowledge to their products, or tourists who may seek for cultural tourism.
This cultural media assets will be linked and annotated by a governed conceptual scheme such as
Asian WordNet (Sornlertlamvanich and et al. 2009). The semantic annotated and linked data will
be serviced as a fine-grained cultural knowledge for higher level applications. The new media for
recording the cultural knowledge is in the form of narrative, photo, video, animation, image
incorporated with GPS data for visualization on the map.
Cultural Knowledge Co-Creation
The existing cultural data has been collected and cleaned up to conform to the designated standard
metadata. The absent data are supposed to be revised and augmented by the experts from the
Ministry of Culture. A few tens of thousands of records have been collected but most of them are
captured in a coarse-grained image. Narratives and images are revised by a group of trained expert
to create a seed of standardized annotated cultural knowledge base. Some new records have been
added together with animation, video, panoramic photograph, etc. New technique in capturing the
46
cultural image is aggressively introduced to create value added and gain more interest from the
audience.
The standardized annotated cultural knowledge base is presented through a set of viewing utilities
to the audience. Filter according to the location and province is prepared for customizing to page
for each province. This is to allow the unique presentation of each province. The administration of
each province will be responsible to its content correctness and coverage. Actually, the attractive
presentation and narrative are required to attract the audience.
Figure 1 Community Co-Creation Cultural Knowledge Base
Social networking system is introduced to invite the participation from the communities. The
institution representatives are actively encouraged to create their own community. The results
from the community co-creation will keep the content maintained and clean up to compete with
each other.
It is significantly that the provided framework can encourage the data accumulation and fulfill the
needs from the audience. Community co-creation will feedback the actual requirement that can
improve the quality of the content. Institution plays an important role in mediating between
community and the audience. As a result, the multiple types of content are generated on a
designated standard. The annotated metadata can be used as a guidance for higher level of data
manipulation such as semantic annotation, cross language and link analysis.
Reference:
Russo, A. and Watkins, J. 2005. ‘Digital Cultural Communication: Audience and Remediation’ in
Theorizing Digital Cultural Heritage eds. F. Cameron and S. Kenderdine, Cambridge, Mass., MIT
Press.
Sornlertlamvanich, V., Charoenporn, T., Robkop, K., Mokarat, C. and Isahara, H. 2009. ‘Review on
Development of Asian WordNet’ in JAPIO 2009 Year Book, Japan Patent Information
Organization, Tokyo, Japan.
47
Imagine we have 100 Billion Translated Words at our Disposal
Jaap van der Meer
TAUS Data Association
Not just anonymous words scraped from numerous web sites, but good quality translations from
trusted sources, from government bodies and institutions, from companies large and small and
from professional translators.
What could we do with it?
We could transform the translation industry!
Here is how we do it:
1. Terminology mining and dictionary building
Today glossaries are built by terminologists: the best in class language specialists. It is laborious
work and frustrating. Because language keeps changing, the terminologist is always behind and the
glossary is often ignored.
Imagine we have 100 billion translated words at our disposal. Terminology is harvested real-time.
Synonyms and related terms are identified automatically. Part-of-speech is tagged, context is listed,
sources are quoted, meanings are described. It is not rocket science. In fact all the tools exist to do
this and do this well.
2. Customize automated translation
Today we use MT on the internet and accept the stupid failures due to lack of domain knowledge of
the engines. Some of us go through the lengthy and costly process of customizing an engine for our
company’s use.
Imagine we have 100 billion translated words at our disposal. We will do fully automatic semantic
clustering to find the translations that match our own domain. We will do automatic genre
identification to make sure that we use the right style. We will go deeper in advancing the MT
technology with syntax and concept descriptions.
3. Global market and customer analytics
Today translation is an isolated function and cost center in most companies and organizations. We
push translations out but we have no means to listen, learn and connect with our customers
worldwide.
Imagine we have 100 billion translated words at our disposal. We will integrate our translation
process and skills with text analytics and social media management. We will do multilingual
sentiment analysis, search engine optimization, opinion mining, customer engagement,
competitive analytics, etcetera. From a cost center the translation function is now becoming a very
valuable strategic ally in every global organization.
4. Quality management
48
Today we struggle to deliver adequate quality in translations. We miss the local flavor, the right
term or the subject knowledge. The source texts may be in bad shape causing all kinds of trouble
for the translator or the MT engine. The craftsmanship typical of the translation industry stops all
innovation.
Imagine we have 100 billion translated words at our disposal. We will automatically clean and
improve source texts for translation. We will run automatic scoring and benchmarks on quality. We
will improve consistency and comprehensibility.
5. Interoperability
Today the lack of interoperability and compliance with standards costs a fortune. Buyers and
providers of translation lose 10% to 40% of their budgets or revenues because language resources
are not stored in compatible standard formats.
Imagine we have 100 billion translated words at our disposal. Imagine that it’s common practice in
the global translation industry to share most of your public translation data in a common industry
repository. Very naturally then all vendors and translation tools are driven towards hundred
percent compatibility. Jobs and resources will travel without any loss of value. Benefits on an
industry scale add up to billions of dollars or euros.
Stakes are high. Risks are low. Only fear can stop us.
A quarter of a million professional translators produce 625 million good quality translations every
day, some 150 billion a year, of which an estimated 70% is published on the internet. We can collect
and share 100 billion translated words every year. We should give unlimited access to this gigantic
supercloud of translations. The translation supercloud must be not-for-profit and directed by the
data contributors.
The stakes are high. The translation industry will flourish. The world will communicate better
across all language barriers.
The risks are low. It is just a choice to participate proactively, or leave it to the ‘pirates’ to change
the industry.
Only fear can stop us. Fear to change and fear to lose control will be replaced by fear to be left
behind. Because many of the leaders have already started sharing their translations since July
2008 in the TDA repository.
49
U-Compare: Interoperability of Text Mining Tools with UIMA
Sophia Ananiadou1, Yoshinobu Kano2
1University of Manchester, UK and 2Database Center for Life Science (DBCLS), Japan
Due to the increasing number NLP resources (software tools and corpora), interoperability issues
are becoming significant obstacles to effective use. UIMA, the Unstructured Information
Management Architecture, is an open framework designed to aid in the construction of more
interoperable tools. U-Compare is built on top of the UIMA framework, and provides both a
concrete framework for out-of-the-box text mining and a sophisticated evaluation platform
allowing users to run specific tools on any target text, generating both detailed statistics and
instance-based visualizations of outputs.
U-Compare is a joint project, providing the world's largest collection of UIMA resources that are
compatible with a single, shared type system. The collection includes resources developed by
different groups for a variety of domains. Whilst the current emphasis is on English biomedical text
processing components, planned work within META-NET will significantly increase the inventory
of components available for several European languages, including multi-lingual components.
Japanese components are also in development.
U-Compare can be launched straight from the web, without needing to be manually installed. All
U-Compare components are provided ready-to-use and can be combined into workflows easily via
a drag-and-drop interface without any programming. External UIMA components can also simply
be mixed with U-Compare components, without distinguishing between locally and remotely
deployed resources.
50
Opening, sharing, re-using language resources : who, what, when,
where and how
Stelios Piperidis
Institute for Language and Speech Processing, “Athena” Research Center, Greece
In its report titled “Riding the Wave: how Europe can gain from the raising tide of
scientific data”, October 2010, the High-Level Group on Scientific Data states : “A
fundamental characteristic of our age is the raising tide of data – global, diverse,
valuable and complex. In the realm of science, this is both an opportunity and a
challenge.” In the context of language research and technology, the essence of this
statement has been in focus for the last two decades. The current prevailing
methodologies in language technology development, the sheer number of languages and
the vast volumes of digital content together with a wide palette of useful content
processing applications, render new models for managing the underlying language
resources indispensable.
META-SHARE tries to respond to this need by building a network of distributed
repositories of language resources, including language data and basic language
processing tools (e.g. morphological analysers, pos taggers, etc). Repositories bear a local
or non-local (central) role. Local repositories are set up and maintained by organisations
participating in the META-SHARE network, storing their own resources. Non-local
(central) repositories are also set up and maintained by organisations participating in the
META-SHARE network acting as storage and documentation facilities for resources
either developed in organisations not wishing to set up their own repository, or they are
donated or they are considered orphan resources, etc. Language resources are described
according to a metadata schema (the META-SHARE metadata schema). Actual resources
and their metadata reside in the local repositories, which export metadata records and
allow their harvesting. Central network servers harvest, host and mirror META-SHARE’s
metadata and point to local repositories for browsing, downloading, etc actual resources.
Users (language resources seekers/consumers) will be able to log-in once, search the
central inventory using multifaceted search facilities and access the actual resources by
visiting the resources repositories for browsing and downloading them, as well as getting
additional information about the usage of specific resources, their relation (e.g.
compatibility, suitability) to other resources, recommendations, etc.
In this brief presentation, I will discuss the background of the emerging infrastructure,
e.g. community needs this infrastructure should cater for, form and modes of operation,
as well as instruments foreseen to help achieve maximum usability and sustainability.
We will discuss the principles that META-SHARE uses regarding language resource
sharing and the instruments that support them. We will conclude by elaborating on
potential synergies with neighbouring initiatives and future plans at large.
51
Friday 27th May 2011
S4. Innovation needs data – 9:15‐11:00
Chair: Jan Odijk
Introduction
Jan Odijk (University of Utrecht, NL / Chair)
Introductory Talks
Such Stuff as Dreams are Made on... the Consequences of Grand Visions on Linguistic Resources Hans Uszkoreit (DFKI, DE)
Parallel Multilingual Data from Monolingual Speakers Bill Dolan (Microsoft Research, USA)
Turning water into wine: transforming data sources to satisfy the thirst of the knowledge era Frederique Segond (Xerox, FR)
Contributions
How to get more data for under‐resourced languages and domains? Andrejs Vasiljevs (TILDE, LV)
User feedback collection for MT and dictionaries: current status and strategies Théo Hoffenberg (Reverso ‐ Softissimo, FR)
Discussants
Maria Teresa Pazienza (University of Rome ‘Tor Vergata’, IT)
Guido Vetere (IBM ‐ Senso Comune, IT)
Martine Garnier‐Rizet (Vecsys & IMMI‐CNRS, FR)
53
Such Stuff as Dreams are Made on ...
the Consequences of Grand Visions on Linguistic Resources
Hans Uszkoreit
DFKI, Germany
The European Network of Excellence META-NET has conducted a complex brainstorming and
collective deliberation process with the aim to arrive at a shared technology vision for the European
language technology community. Such a shared vision is a corner stone of a planned strategic
research agenda for European LT research and innovation in the next ten years. More than one
hundred commercial companies and other organizations have actively participated in the process
by recognized experts in technology and commercialization. A central part of the resulting vision
paper, "The Future European Multilingual Information Society. Towards a Strategic Research
Agenda for Multilingual Europe", is a selection of commercially attractive and socially relevant
application scenarios.
In this talk I will summarize the consequences of these application visions and their enabling
technologies on the availability of resources. Although I will concentrate on monolingual and
multilingual data, I will also comment on linguistic descriptions and tools such as lexical resources
and basic processing components. In order to provide all these prerequisites for our dream
technologies when they are needed, novel ways of collecting, producing, maintaining and sharing
data will have to be pursued.
55
Parallel Multilingual Data from Monolingual Speakers
Bill Dolan
Microsoft Research, USA
This talk will describe joint work with David Chen (University of Texas Austin) aimed at eliciting
large volumes of fluent, native multilingual data from a surprising source: monolingual speakers.
Our approach relies on large numbers of Mechanical Turk contributors, each of whom was asked to
watch a short video snippet depicting a simple action and then write a brief description of the
action. Contributors were encouraged to write the description using their native language; given
the international composition of contributors on Mechanical Turk, this task produced clusters of
multilingual sentences all describing the same basic semantic content.
Contributors found the task to be engaging and easy to do, and because it does not require deep
bilingual skills, it is open to virtually any crowdsourcing contributor. As a result, our method offers
a simple and very inexpensive way to gather large volumes of “almost parallel” text, potentially
focused on specific domains like sports, cooking, or automobiles. We will describe the data
collection technique and results, provide a pointer to the dataset (which is being released to the
research community) and finally discuss some of the interesting qualitative differences between
this unique “semantically parallel” data versus “linguistically parallel” data.
56
Turning water into wine: transforming data sources to satisfy the
thirst of the knowledge era
Frédérique Segond
Xerox Research Centre Europe, France
Since the beginning of humanity, data, in its different forms, has been recognized as essential to
knowledge and the principal ingredient of innovation. In this short positioning paper which follows
the "Data Information Knowledge and Wisdom (DIKW)" paradigm, we present what is specific to
the era of Information Technology. Using the example of rare diseases we conclude that not only
the amount of data but the capacity to make sense out of it, learn from it, and turn it into
knowledge will speed up the innovation process.
History is full of examples that show how collecting data and making sense of it has been central to
radical changes in culture and science. Greek philosophers such as Aristotle were able build a
scientific theory with little data, but little by little, the qualitative approach has been complemented
with the quantitative as large amounts of data are required to sustain scientific results and theories.
The Ancient Library of Alexandria is one example of data collection in Antiquity that aimed at
capturing knowledge from the world for scholars to study and hopefully to innovate.
Monks and later on, copyists were part of the tradition of collecting data and knowledge of the
world to learn from them and to then educate others.
At the beginning of the 17th century Galileo collected observations with his telescope and the
theory that he developed based on these observations has served as the basis of modern astronomy
and which, today, continued to interpret large amounts of data to obtain scientific results.
In the 18th century more and more scientists and philosophers supported observation and
experience rather than purely intellectually based theories.
The French naturalist Comte de Buffon influenced peers like Lamarck and Cuvier with the
publication of his thirty six volumes of "Histoire naturelle, générale et particulière" and is
considered by Darwin as the first author who treated evolution in a scientific manner.
At the same period, led by Diderot, the «Encyclopédie, ou dictionnaire raisonné des sciences, des
arts et des métiers" collected data on sciences and mechanical arts with the goal of «changing the
way people think". It is recognized as an important intellectual vector of the French revolution that
eventually led to new political models.
In the 19th century Durkheim proposed a scientific approach to society using quantitative methods
and gave birth to modern sociology.
In the same century and closer to the domain of FLaReNet, linguists and ethnographers such as
Sapir and Lévi-Strauss spent their life collecting data on different languages and cultures and
influencing the work of several generations of linguists, anthropologists and ethnographers.
What has dramatically changed with the advent of the Internet and Information technologies is
that this data which was previously so difficult to collect became, in the course only a few years,
extremely easy to access and in much greater quantity. All of a sudden we went from the dream of
57
having more data to the nightmare of data overload or data obesity. Nowadays data are not only of
the type of encyclopedic as before but they can be emails, Facebook walls, and exchanges on
Twitter. Today, data is gathered not only from the Internet but also from supermarket receipts,
mobile phones, cars, planes and soon even refrigerators, ovens and any type of electronic device we
use will provide data. Much of the data that previously simply disappeared after having been used
for a specific purpose, is now stored, distributed and even resold for analysis, interpretation or
other purposes of which the best if not most frequent case is innovation.
The definition of what data is has evolved over the course of history. We adopt the general
definition of data as symbols such as words, numbers, codes or tables. These symbols (data) can
then be linked into sentences, paragraphs, equation concepts and ideas to give birth to
information. Information can then further be structured and interpreted to become knowledge.
With recent advances in the semantic web, natural language processing and knowledge
management to cite only the most relevant fields for our purpose, the analysis of data has made
huge progress. So what’s the link to innovation?
When looking at multiple existing definitions of innovation a difference is often made between
invention and innovation. Today Innovation is generally associated with two ingredients:
technology and people willing to use or buy this technology, while invention may have no
commercial value. Innovation is usually associated with the idea of benefit. Almost any company
dealing with data which claims to be innovative communicates on its capacity to turn data into
wine to give you a competitive advantage because it performs semantic analysis, knowledge
discovery, business intelligence or analytics in general.
What these companies offer their customers is support in understanding their data to make better
use of it in marketing, technical development or strategic decisions. There are many examples :
One can quote opinion mining for companies selling products of any type including politicians
selling a political discourse; being able to make sense out of huge amounts of data is important for
the societies of risk that we now live in, be it for homeland security, environmental risk, risk
associated with drugs to name but a few. The opportunity of making sense out of data, of linking
information generated from different sources and of reasoning based on them has completely
changed the way investigations are pursued in law, crime and... medicine.
Medicine has always been a big consumer of data for innovative purposes. The more data a medical
domain has the more medical progress is made. National health institutions invest large amounts
of time and money to get real user data. For instance blood tests for pregnant women for the early
detection of down syndrome or the collection of data on the human genome to enable great
progress in treating and curing genetic diseases. To better understand diseases and how to properly
prevent and cure them medical doctors need to relate many types of knowledge such as symptoms,
treatment, genes, and phenotypes. To do so they use data from collections, communications,
publications, patient records and medical archives. In many hospitals there are archives of
numerous and very precious data that could be used for epidemiological studies. However data
access and links within and across this data is as important as the actual quantity. In the same
medical domain the study of rare diseases is, by definition, characterized by the fact that very little
data exists. But it is precisely because such data is rare that it is important to capture and link it
with other data such as, in the case of rare diseases, data on genes.
We have given examples of how data is the basic block of innovation prior to becoming information
and knowledge. We conclude with the fact that the quantity of data alone is not sufficient for
innovation. What is equal importance is the ability to link the information carried by this data to
discover and develop new paradigms.
58
How to get more data for under-resourced languages and
domains?
Andrejs Vasiļjevs
Tilde
Abstract
The explosive growth of digital information on the web enables rapid development of data driven
techniques. Significant breakthrough in many areas of language technologies has been achieved.
Statistical methods based on huge volume of data have replaced the laborious human work that
was required to encode linguistic knowledge. In the new paradigm, the more data you have the
better results you get.
However, dependence on data creates new disparities for under-resourced languages and domains.
Naturally, smaller language communities produce much less data than speakers of the languages
dominating the globe. The same problems occur for language data in narrow domains with their
own specific terminological and stylistic requirements.
Essential question is how to get more data for under-resourced languages and domains. Not only
innovation needs data but also the collection of data needs innovation. This presentation will
briefly discuss some inventive strategies.
When the Web is not enough
Google estimates that 95% of Web pages are in the TOP 20 languages [1]. Although the language
identification method used in this estimation is not very reliable, it clearly demonstrates the huge
disparity in languages representation on the Web.
Smaller languages often have a complex morphological structure and free word order. To learn this
complexity from corpus data by statistical methods, much larger volumes of training data are
needed than for languages with simpler linguistic structure.
Shared services from non-sharable data
Motivating users to share their data is a powerful strategy to boost public language resources. In
data-driven machine translation most of the public MT systems are built on the parallel texts
collected from the web. But a lot of translated data still reside on the local systems of translation
agencies, multinational corporations, public and private institutions, and desktops of individual
users. TAUS Data Association [2] is an example of successful involvement of the key players in
localization industry in sharing their translation memories.
Still in many cases data holders are not able or willing to share their data for competitiveness or
confidentiality reasons. New cloud-based services can provide a solution how the community can
benefit from restricted data. An example is a machine translation platform being developed
through the ICT-PSP project LetsMT! [3]. It provides fully automated self-service for MT
generation from user submitted data. As opposed to the traditional sharing platforms users of the
LetsMT! system may only upload their data to the online repository. This data is not downloadable
and can be used only for the generation and running of statistical models for machine translation.
The uploaded proprietary data is not directly exposed or shared. However, the community does
59
benefit from being able to use this data for training and running MT systems. In such a way even
small companies and institutions can create their user-tailored MT solutions while contributing to
the expansion of online MT training data and a variety of custom MT systems.
Motivating the crowd
The crowdsourcing approach is boosting the acquisition and expansion of language resources.
Obviously crowdsourcing needs a crowd. If the total population is in tens or hundreds of millions, a
small percentage of active people willing to become involved in crowdsourcing activities make quite
a big group. But how about smaller language communities? Boosting enthusiasm or providing
monetary based incentives (e.g., Amazon Mechanical Turk) are only temporary solutions to raise
the number of participants.
There is a room for innovation to find new motivation schemas. One such example is a successful
trial of collaborative translation in Latvia using CTF tool by Microsoft Research [4], and organizing
translation competitions in social network.
Use data wisely
Scarcity of data for under-resourced languages and domains is a strong motivation to look for ways
to use it more efficiently. For example, a potentially very useful resource could be multilingual
comparable corpora – collections of texts about the same or similar topics that are not direct
translations. There are several research activities to find efficient methods for collecting and
analyzing comparable corpora. FP7 project ACCURAT [5] is researching how to use comparable
corpora for statistical MT, but FP7 project TTC - for extraction of multilingual terminology [6].
Profound research is needed to find new ways how to teach computers language tasks. It should be
possible to get much better results from much less data than the current data-driven methods. A
proof for that is a human child which has fantastic ability to generalize quite limited language
information received from the outside world into complete fluency. If we could mimic this ability in
online “agents” learning language using web data and interacting with participants from the crowd,
this could be a principal solution in closing the technology gap between larger and smaller
languages.
References
[1] Daniel Pimienta, Daniel Prado and Álvaro Blanco. 2009. Twelve years of measuring linguistic
diversity in the Internet: balance and perspectives. UNESCO, Paris.
[2] http://www.tausdata.org
[3] Vasiljevs, Andrejs, Tatiana Gornostay and Raivis Skadins. 2010. LetsMT! – Online Platform for
Sharing Training Data and Building User Tailored Machine Translation. Proceedings of the
Fourth International Conference Baltic HLT 2010, Riga.
[4] http://blogs.msdn.com/b/translation/archive/2010/03/15/collaborative-translations-
announcing-the-next-version-of-microsoft-translator-technology-v2-apis-and-widget.aspx
[5] http://www.accurat-project.eu
[6] http://www.ttc-project.eu
60
User feedback collection for MT and dictionaries : current status
and strategies
Théo Hoffenberg
Reverso – Softissimo (France)
MT systems and online dictionaries are becoming more mature, and gaining the next 1% in
accuracy or coverage requires a lot of effort. Users of online systems such as Reverso, which drives
over a billion pages viewed each year can be of great help to identify issues, add phrases in
dictionaries, however the feedback that people provide is not always usable, and not in large
enough quantities.
Through the Faust project, we started to address this issue, allowing users to rate the translations,
suggest better ones, comment them and in a second stage to suggest directly phrases to enter in
user or general dictionaries.
In next steps, we will allow users to give their opinion on the self-assessment of MT tools
(confidence score) and chose best translation when several engines are connected.
In parallel, online dictionaries like Reverso’s are now open to user contributions, votes and
comments.
Several thousands of such contributions are collected each month and this information is still not
enough used.
We will describe the methods currently used to collect user data and feedback and to filter and
analyze it. We’ll also comment the results obtained and the plans to improve quality and quantity
of this data, and if possible the way to make it directly usable for MT system or online dictionaries.
61
S5. Data for all languages: think big – 11:30‐13:30
Chair: Núria Bel
Introduction
Núria Bel (UPF, Spain / Chair)
Introductory Talks
Plan B Kenneth Church (Johns Hopkins University ‐ HLTCOE, USA)
Europeana and multi‐lingual access: Can we do it without Google or Microsoft? Can we do it without an open community? David Haskiya (Europeana Foundation)
Contributions
Cross‐lingual knowledge extraction Dunja Mladenic (JSI, SL) and Marko Grobelnik (JSI, SL)
Developing Language Resources in an African Context: the South African Case Justus C. Roux (CTexT, North‐West University, South Africa)
Identifying and networking forces: an international panorama Joseph Mariani (LIMSI/IMMI‐CNRS, FR / FLaReNet) and Claudia Soria (CNR‐ILC, IT / FLaReNet)
Discussants
Christopher Cieri (University of Pennsylvania ‐ Linguistic Data Consortium, USA)
Aurélien Max (LIMSI‐CNRS & Université Paris‐Sud, FR)
63
Plan B
Kenneth Church
Johns Hopkins University, HLTCOE
Abstract
Ideally a corpus should be large, balanced and annotated. But what should we do when we can‟t
have it all. This paper will discuss various fall back positions which we will refer to as Plan B:
1. Large unbalanced archives of newswire and web crawls
2. Google-ology: The Web is exciting
3. Catch-as-Catch-Can: Hybrid Combinations of a trillion of this and a billion of that and a
million of the other thing.
4. Just-in-Time Data Collection (with Amazon‟s Mechanical Turk)
5. Zero Resources (Do Without): It is sometimes possible to get something from nothing
Large Unbalanced Archives of Newswire and Web Crawls
Should a corpus be balanced? This is an old question that keeps coming up again and again. There
was an Oxford Debate on this question where the house was predisposed to vote for balance, but
ended up voting for quantity (with some encouragement from a few passionate American engineers
from IBM and AT&T). Patrick Hanks was really excited in the 1980s with what you could do with
tens of millions of words, far more than the Brown Corpus (one million). Much more could be
done with hundreds of millions of words such as the British National Corpus (BNC). These days,
web counts are 1000 times larger than BNC counts. More is more (Church, 2010), despite
criticisms of "Google-ology" (Kilgarriff, 2007).
Google-ology
The Web is exciting. There is hope that everyone will have access to vast quantities of data, well
beyond the means of consortia such as the British National Corpus (BNC). Government funding
agencies should focus their limited resources on targets of opportunities where they have an unfair
advantage, and they should avoid areas where commercial entities such as search companies have
an unfair advantage.
Web companies are doing what they can to work with researchers (Huang et al, 2010) and (Wang
et al, 2010).1 In addition, a number of researchers are discovering clever ways to use search
engines to do computational lexicography. For example, Turney (2001, 2002) has shown how to
use web queries to estimate pointwise mutual information (PMI), an improvement over Church
and Hanks (1990) since the web is so much larger than research corpora such as the BNC. The web
opens up all sorts of opportunities to use search engines for applications in computational
1 See http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html, http://research.microsoft.com/en-us/collaboration/focus/cs/web-ngram.aspx, http://research.microsoft.com/en-us/events/webngram/sigir2010web_ngram_workshop_proceedings.pdf and http://ngrams.googlelabs.com/.
65
lexicography: Hearst (1992), KnowItAll (Lin et al, 2009 & Etzioni et al, 2005), Sekine (2006),
Fang and Chang (2011) and much more.
Catch-as-Catch-Can
When we can‟t have what we want (large, balanced and annotated), can we come up with a hybrid
combination of what we have (some collections are large, and some others are balanced and a few
small ones are annotated). In Bergsma et al (2011), for a particular application (disambiguating a
special case of conjunction), we propose a combination of a trillion words of this (Google Ngrams),
a billion words of that (parallel corpora) and a million words of annotated text (Penn Treebank).
Just-in-Time Data Collection with Amazon’s Mechanical Turk
Mechanical Turk (Artificial Artificial Intelligence) is a hot topic. See Callison-Burch and Dredze
(2010) for a recent workshop on this subject.
While there is plenty of evidence that “there is no data like more data,” you never know where the
next opportunity/crisis will be. Crowd-sourcing (Amazon‟s Mechnical Turk) makes it possible to
mobilize large armies of human talent around the world with just the right language skills so that it
is feasible to collect what we need when we need it, even during a crisis such as the recent
earthquake in Haiti or the flood in Pakistan. According to Callison-Burch‟s home page
(http://www.cs.jhu.edu/~ccb/), he is #6 on the leaderboard for MTurk Requesters.
Zero Resources (Do Without)
Much of the work in Information Retrieval on web pages is based on simple counts (bags of words),
with few dependencies on language-specific resources. In Dredze et al (2010) and Jansen et al
(2010), we propose bags of pseudo-terms for spoken documents, which are similar to bags of words
for web search. Pseudo-terms are computed by linking long repetitions in the audio. We are
finding that bags of pseudo-terms are almost as good as bags of words, at least for some
applications. When we have resources, we ought to use them. But when we don‟t, it may be
possible to find something from nothing.
References
Shane Bergsma, David Yarowsky, Kenneth Church (2011), Using Large Monolingual and Bilingual Corpora to Improve Coordination Disambiguation, ACL. Chris Callison-Burch and Mark Dredze Workshop on Creating Speech and Language Data With Mechanical Turk at NAACL-HLT, 2010. Kenneth Church and Patrick Hanks, Word association norms, mutual information, and lexicography. Comput. Linguist. 16, 1 (March 1990), 22-29. Kenneth Church. More is more. In G.-M. de Schryver, editor, A Way with Words: Recent Advances in Lexical Theory and Analysis: A Festschrift for Patrick Hanks, chapter 7. Menha Publishers, Kampala, Uganda, 2010. Kenneth Church, Approximate Lexicography and Web Search International Journal of Lexicography, 21.3: 325–336, 2008 Mark Dredze, Aren Jansen, Glen Coppersmith, and Kenneth Church. NLP on spoken documents without ASR. In EMNLP, 2010. Oren Etzioni, Michael Cafarella, Doug Downey, Ana-Maria Popescu, Tal Shaked, Stephen
66
Soderland, Daniel S. Weld, and Alexander Yates. 2005. Unsupervised named-entity extraction from the web: an experimental study. Artif. Intell. 165, 1 (June 2005), 91-134. DOI=10.1016/j.artint.2005.03.001 http://dx.doi.org/10.1016/j.artint.2005.03.001 Yuan Fang and Kevin Chen-Chuan Chang, Searching patterns for relation extraction over the web: rediscovering the pattern-relation duality. In Proceedings of the fourth ACM international conference on Web search and data mining (WSDM '11). ACM, New York, NY, USA, 825-834. DOI=10.1145/1935826.1935933 http://doi.acm.org/10.1145/1935826.1935933 Marti Hearst, Automatic acquisition of hyponyms from large text corpora. In Proceedings of the 14th conference on Computational linguistics - Volume 2 (COLING '92), Vol. 2. Association for Computational Linguistics, Stroudsburg, PA, USA, 539-545. DOI=10.3115/992133.992154 http://dx.doi.org/10.3115/992133.992154 Jian Huang, Jianfeng Gao, Jiangbo Miao, Xiaolong Li, Kuansan Wang and Fritz Behr. 2010 Exploring web scale language models for search query processing. In WWW 2010 Aren Jansen, Kenneth Church, and Hynek Hermansky. Towards spoken term discovery at scale with zero resources. In INTERSPEECH, 2010. Frank Keller and Mirella Lapata, Using the web to obtain frequencies for unseen bigrams. Comput. Linguist. 29, 3 (September 2003), 459-484. DOI=10.1162/089120103322711604 http://dx.doi.org/10.1162/089120103322711604 Adam Kilgarriff, Googleology is Bad Science. Comput. Linguist. 33, 1 (March 2007), 147-151.
DOI=10.1162/coli.2007.33.1.147 http://dx.doi.org/10.1162/coli.2007.33.1.147
Thomas Lin, Oren Etzioni, and James Fogarty, Identifying interesting assertions from the web. In Proceeding of the 18th ACM conference on Information and knowledge management (CIKM '09). ACM, New York, NY, USA, 1787-1790. DOI=10.1145/1645953.1646230 http://doi.acm.org/10.1145/1645953.1646230 Peter Turney, „Mining the Web for synonyms: PMI-IR versus LSA on TOEFL.‟ Proceedings of the Twelfth European Conference on Machine Learning (ECML2001), Freiburg, Germany: 491–502. Peter Turney, „Coherent keyphrase extraction via web mining.‟ Proceedings of IJCAI, 434–439, 2002. Satoshi Sekine. On-demand information extraction. In ACL, 2006. Kuansan Wang, Christopher Thrasher, Evelyne Viegas, Xiaolong Li, and Bo-june (Paul) Hsu. 2010. An overview of Microsoft web N-gram corpus and applications. In Proceedings of the NAACL HLT 2010 Demonstration Session (HLT-DEMO '10). Association for Computational Linguistics, Stroudsburg, PA, USA, 45-48.
67
Europeana and multi-lingual access:
Can we do it without Google or Microsoft?
Can we do it without an open community?
David Haskiya
The Europeana Foundation
“Europeana enables people to explore the digital resources of Europe's museums, libraries,
archives and audio-visual collections. It promotes discovery and networking opportunities in a
multilingual space where users can engage, share in and be inspired by the rich diversity of
Europe's cultural and scientific heritage.”
The key phrase here in relation to FLaReNet Forum is “multi-lingual” space. How can Europeana
make digitised heritage accessible across language boundaries? The Europeana metadata
repository holds descriptive metadata in at least 27 languages and those metadata records describe
original objects that can contain information in many other languages e.g. ancient texts written in
Latin, Old Norse, or even Sumerian! Also, apart from users from all European countries we also
know that we have users from all across the world in e.g. India, South Korea, and Japan.
This poses massive challenges in localising user interfaces, in translating editorial texts, in
language identification of metadata records, in query translation, in multi-lingual semantic
enrichment, and in translation of the metadata records.
In this presentation I’ll touch on some of those challenges and on whether they can be solved
without big companies like Google or Microsoft and without open communities like Wikipedia?
68
Cross-lingual Knowledge Extraction
Dunja Mladenic and Marko Grobelnik
J. Stefan Institute, Ljubljana, Slovenia
Knowledge extraction has been a challenge for many years and it remains a challenge. On this long
path of exploring language and simultaneously modeling the world, we have transitioned through
many stages. From the early years, when linguists were exploring the structure of words and
sentences, and philosophers were discussing ontology, up till today, when a large part of written
information is available to almost everybody and modern communication allows researchers to join
their efforts and exploit opportunities offered by technology. Still, the apparent opportunities seem
to be much bigger than the actual depth of the solutions developed in recent years in the field of
text understanding.
One of the key properties of natural languages is redundancy in the encoded information and the
structure used. As a consequence, different techniques can extract different aspects of information
from a text. They range from simple techniques such as character counting, to more sophisticated
ones such as linear algebra, to the advanced techniques which exploit the structural aspects of text.
Many of these techniques deliver something useful and solve somebody’s problem. Some examples
of such problems are: language identification (solved with character counting), document
categorization (solved with linear algebra methods), question-answering (solved typically with
shallow linguistic methods), and reasoning (solved typically using logic).
Having many techniques for dealing with text, has the unfortunate consequence that those
techniques fall into different research areas, which often do not communicate effectively. Text
understanding, as a long-term research goal, can significantly benefit from a diversity of insights
coming from related research areas.
This talk will address the problem of knowledge extraction from social language in cross-lingual
setting and, a need for semantic interlingua based on logic to support knowledge extraction.
69
Developing language resources in an African context:
The South African case
Justus C Roux
Centre for Text Technology (CTexT)
North-West University, Potchefstroom, South Africa
Background
Developing appropriately balanced and annotated text and speech corpora for resource scarce
languages is indeed a challenge, particularly in an African context where many indigenous
languages have a relatively short written tradition. This presentation will focus on the South
African situation, with eleven proclaimed official languages and an official language policy that
implicitly views the development of human language technologies as a means to convey
information to citizens in a language of choice. Apart from English and Afrikaans (a Germanic
language with close relationship to Flemish and Dutch), the other nine official languages can be
classified in four language families: Nguni (isiZulu, isiXhosa, Siswati, isiNdebele), Sotho (Sesotho
sa Leboa, Setswana and Sesotho), Tshivenda and Xitsonga. The ideal to develop appropriate text
and speech applications in a variety of languages obviously relies on the availability of resources.
This however is a painstaking and costly exercise and it has become necessary to adopt a variety of
data gathering approaches to speed up the process of gathering and processing text and speech
resources.
Factors influencing data acquisition and processing
Limited printed materials in the African languages, e.g. newpapers, periodicals etc.
Limited presence on the Internet to involve web crawling to gather text data etc.
Different writing systems (conjunctive and disjunctive) requiring considerable
preprocessing and linguistic knowledge in the processing phase.
Non standardized orthographies and spelling conventions influencing the development of
spelling checkers.
Unsophisticated users of technology.
Relatively small HLT community – lack of capacity.
Potential data sources
Multilingual Hansard data from local and national parliaments (text and speech).
Multilingual documents generated by government departments on provincial and national
levels.
Access to speech archives of the South African Broadcasting Company (SABC) that
transmits programs in all languages on a continuous basis.
Private media houses.
70
Methodological approaches
Include inter alia,
Telephone based speech acquisition (especially mobile): different projects since 2000.
Use of automated speech segmentation and annotation techniques.
Creation of software (language specific morphological analysers) to account for the
agglutinative languages.
Creation of pronunciation dictionaries: bootstrapping techniques to speed up the process.
Software development for verifying pronunciation dictionaries.
Resource Centre for reusable resources
In December 2008 the South African Government approved the establishment of a National Centre
for Human Language Technologies (NCHLT) as part of the implementation of a National Language
Policy Framework approved by Cabinet in 2003. This policy explicitly refers to the development of
Human Language Technologies (HLT) in South Africa.
The NCHLT currently functions as a virtual entity under the auspices of the national Department of
Arts and Culture (DAC) with the support of an Expert Panel for HLT (HLTEP), mainly comprising
academics actively working in the field. Over the last number of years this department has been
sponsoring a number of text and speech related HLT projects mainly conducted by academic and
research institutions functioning as “agencies”. It was initially foreseen that resources developed
through these projects would need to be managed in a professional way and thus a Resource
Management Agency (RMA) needed to be established. A blueprint for the functioning of this RMA
was developed in 2010 and a call for interest for running this agency is to be issued in June 2011. It
is foreseen that this RMA will be operative before the end of 2011.
Aims of the RMA
To function as a single depository point for various types of electronic data of the official
languages of South Africa for research and development purposes in the field of Human
Language Technologies. The data types may include broad categories such as text, speech,
language related video, multimodal resources including sign language as well as
pathological language and forensic language data;
To function as the official language resource management and distribution point of
data related to Human Language Technologies for all of the South African official
languages. Ongoing projects of DAC/NLS (National Language Service) will require that all
project data be deposited and managed by the RMA in order to prevent data loss and to
promote reusability of the data;
To position South Africa strategically with respect to collaboration with other similar
agencies worldwide, with a long term vision of becoming the hub for language
resource management in Africa.
71
Identifying and networking forces: an international panorama
Joseph Mariani1 and Claudia Soria2
1LIMSI-CNRS and IMMI, France and 2CNR-ILC, Italy
In order to conduct a survey of the National and Transnational initiatives in the area of Language
Resources (LR), a network of international FLaReNet Contact Points was created in August 2010,
comprising 102 people from 76 countries or regions in the European Union (26 Member States and
6 regions), non-EU European countries (9 countries) and non-European countries (35 countries).
The survey shows that almost all European countries now take care of gathering Language
Resources for their languages in order to conduct research investigations and develop and test
systems for those languages. The languages which were considered as “low-resourced” are
promptly recovering, even if they still need many more LR, given that no language have enough LR
available for the needs of the research and industrial communities. Surprisingly, UK has no
National program for (British) English. The reason may be the importance of US activities
regarding the processing of the (American) English language. Baltic and Nordic countries are
conscious of the importance of LR for the promotion and survival of their languages, and they
accordingly have a policy to support that area, including for minority languages. There is also an
important activity in some EU regions, either for specific languages (Basque, Catalan), or in general
(the Trento region in Italy).
Activities in the other parts of the world are also impressive. Governmental initiatives in India or
South Africa to cover the development of Language Technologies for all the official languages of
those multilingual countries in order to meet the needs of all citizens is exemplary. The creation of
Associations for the specific development of Language Technologies for the Arabic language or for
African Languages is also an interesting trend, while Asia keeps on organizing the activities in the
various countries, with individual initiatives dealing with the cultural heritage, the preservation of
the languages spoken in the country being part of it.
This work represented the basis for conducting a survey in the META-NET project of ongoing and
recent projects and initiatives at the national, EU and transnational level. The main purpose was to
identify projects addressing Machine Translation, multilingual issues, language resources and
technologies, or infrastructural issues at large. The focus of the survey is on Europe but relevant
initiatives outside Europe have been reviewed as well. This work resulted in the collection of up-to-
date and quality-checked information about 66 EU, 28 transnational and 183 national projects,
thus contributing to a comprehensive and reliable overview of recent and current activities.
Now that the importance of Language Resources and Language Technologies Evaluation is granted
as essential, the coordination of activities should be aimed at, in order to avoid the multiplication
of independent initiatives, which may result in an inextricable landscape, while building on Best
Practices. The needs should be carefully identified and properly addressed, both in the framework
of the Information and Communication Technologies and of the Human and Social Sciences, while
devoting an investment in agreement with the size of the corresponding challenge.
It is our intention to build upon this permanent network of International Contact Points, and
maintain a wiki allowing for the regular update of the harvested information. Internet brings the
availability of a worldwide Web, which provides a boarder-less infrastructure allowing for a
72
network approach to LR production, distribution, validation, evaluation and sharing. This may be
the chance to allow for full multilingualism.
73
S6. Long life to our resources – 14:45‐16:30
Chair: Khalid Choukri
Introduction
Khalid Choukri (ELDA, FR / Chair)
Talks
Preparing to share the effort of preservation using a new EU preservation e‐Infrastructure David Giaretta (STFC, UK)
Digital Humanities – its challenges and opportunities Daan Broeder (MPI, NL) and Peter Wittenburg (MPI, NL)
Contributions
Durable digital data at DANS Peter Doorn (DANS, NL)
Life is Longer and Better in OpenAIRE Yannis Ioannidis (University of Athens & ILSP ‐ A.C. "Athena", GR)
Discussants
Bob Boelhouver (Instituur voor Nederlandse Lexicologie, NL)
Edouard Geoffrois (DGA, FR)
75
Preparing to share the effort of preservation using a new EU
preservation e-Infrastructure
David Giaretta
STFC, UK
This talk will describe the background and the capabilities of the preservation e-Infrastructure
components which are to be produced by the SCIDIP-ES EU project, and how these can benefit this
domain. These components are based on the techniques proven through the CASPAR project
(www.casparpreserves.eu) and justified by the surveys and case studies of PARSE.Insight
(www.parse-insight.eu).
Closely linked to these is the APARSEN network of excellence (www.aparsen.eu) which is bringing
together researchers from libraries, academia, industry, vendors and science laboratories to create
a common vision for digital preservation research. By this we mean not a straight-jacket but rather
a direction of travel and a common understanding which should ensure that the individual pieces
of research can be seen as parts of a greater whole and can thereby work together rather than as
disjoint universes of discourse. The individual pieces of research will, for the most part, create
metadata which are essential for the working of the domain neutral e-Infrastructure components.
An additional benefit which will be described is that which arises from viewing preservation as
enabling those unfamiliar which the data to use it - for preservation this is unfamiliarity through
distance in time. However the same infrastructure, tools and techniques also allow those who, right
now, are unfamiliar with the data through distance in expertise, to use the data.
77
Digital Humanities – its challenges and opportunities
Peter Wittenburg
MPI, The Netherlands
Topics that dominate some of the current discussions in the humanities are a) the paradigm change
in humanities research towards digital humanities, b) the need to focus on data curation and
integration, c) ways to foster computational humanities and d) how to achieve that not only a few
specialists can profit from an infrastructure offering many opportunities. These topics are fairly
much related since for example computational humanities can only be done efficiently and at a
reasonable level when the data curation problem is being tackled, digital humanities can only be
realized when humanities researchers have access to fairly much integrated data resources and
tools/web services. Finally we need to develop a strategy how we want to educate future
generations of humanities researchers: do we expect all to be able to do smart scripting in
technologically increasingly complex domain or do we expect to offer smart services and easy-to-
use virtual research environments supporting the mass of researchers. Infrastructure initiatives
need to find answers to these questions and tackle them in a convincing way and this in a time
where collaboration towards an eco-system of infrastructure is highly required.
The talk will touch the different aspects and give directions that are currently debated.
78
Durable digital data at DANS
Peter Doorn
Director, Data Archiving and Networked Services
Summary
Data Archiving and Networked Services (DANS) promotes durable access to digital research data.
For this purpose, DANS encourages scientific researchers to archive and reuse data in a sustainable
manner, e.g. by means of our online archiving system EASY. DANS also provides access, through
Narcis.nl, to thousands of scientific datasets, e-publications and other research information in the
Netherlands. In addition, the institute provides training and advice and performs research into
sustained access to digital information.
The data collections in our archives focus on the humanities and social sciences. DANS will also act
as a CLARIN data center, concentrating on the role of long-term preservation of language data and
text corpora. There are still important challenges to tackle; preserving services (or software) is for
instance much more complicated than preserving raw data.
We developed the Data Seal Of Approval (www.datasealofapproval.org), which specifies criteria for
trusted digital deposit, preservation and use of data.
As coordinator of the preparation phase, DANS is also involved in DARIAH, the European-wide
Digital Research Infrastructure for the Arts and Humanities. DARIAH’s mission is to enhance and
support digitally enabled research across the humanities and arts. DARIAH aims to develop and
maintain an infrastructure in support of ICT-based research practices across the arts and
humanities, and works with working with communities of practice to:
Explore and apply ICT-based methods and tools to enable new research questions to be
asked and old questions to be posed in new ways
Link distributed digital source materials of many kinds
Exchange knowledge, expertise, methodologies and practices across domains and
disciplines
DARIAH aims to create a single, unified European data area in which scholars and students can
easily survey the available information in their field – data which is dependable in terms of both
quality and durability.
Driven by data, DANS ensures that access to digital research data keeps improving, by its services
and by taking part in (international) projects and networks. Go to www.dans.knaw.nl for more
information and contact details.
79
Life is Longer and Better in OpenAIRE
Yannis Ioannidis
University of Athens & ATHENA Research Center, Greece
OpenAIRE is a European project that delivers an electronic infrastructure and supporting
mechanisms for the identification, deposition, open access, and monitoring of FP7 and ERC funded
articles, through the establishment and operation of the European Helpdesk. All deposited articles
to the products of EU-funded research are freely accessible through the www.openaire.eu portal,
which also supports a special repository for articles that can be stored neither in institutional nor in
subject-based/thematic repositories. The OpenAIRE electronic infrastructure is based on state-of-
the-art software services of the D-NET package developed within the DRIVER and DRIVER-II
predecessor projects and the Invenio digital repository software developed at CERN. Thematically,
the project focuses on peer-reviewed publications in at least seven disciplines: energy,
environment, health, cognitive systems-interaction-robotics, electronic infrastructures, science in
society, and socioeconomic sciences-humanities. Geographically, the OpenAIRE project has a
definitive European footprint by covering the European Union in its entirety, engaging people and
scientific repositories in almost all 27 member states and beyond. In this presentation, I will
discuss the architecture and overall philosophy between the infrastructure software built as part of
OpenAIRE and will highlight the quality characteristics it offers: flexibility, availability,
sustainability, adaptability, and reusability.
80
Closing Session – 16:30‐17:30
Chair: Nicoletta Calzolari
Nicoletta Calzolari (ILC‐CNR, IT / FLaReNet Coordinator)
"Highlights from the Sessions" Sessions' Chairs
Community Endorsement of the FLaReNet Recommendations Nicoletta Calzolari (CNR‐ILC, IT / FLaReNet)
The Language Resources Sharing Charter Stelios Piperidis (ILSP ‐ A.C. "Athena", GR)
Language Resources in the LT Strategic Research Agenda Hans Uszkoreit (DFKI, DE / META‐NET)
The FLaReNet Community: the Way Forward Nicoletta Calzolari (CNR‐ILC, IT / FLaReNet) and Joseph Mariani (LIMSI/IMMI‐CNRS, FR / FLaReNet)
83
Organisation
Scientific Committee
Nicoletta Calzolari (ILC‐CNR, Pisa, ITALY)
Khalid Choukri (ELDA, Paris, FRANCE)
Stelios Piperidis (ILSP / “Athena” R. C., Athens, GREECE)
Jan Odijk (Universiteit Utrecht, Utrecht, THE NETHERLANDS)
Núria Bel (Universitat Pompeu Fabra, Barcelona, SPAIN)
Joseph Mariani (LIMSI/IMMI‐CNRS, Paris, FRANCE)
Claudia Soria (ILC‐CNR, Pisa, ITALY)
Organising Committee
Nicoletta Calzolari (ILC‐CNR, Pisa, ITALY)
Paola Baroni (ILC‐CNR, Pisa, ITALY)
Riccardo Del Gratta (ILC‐CNR, Pisa, ITALY)
Francesca Frontini (ILC‐CNR, Pisa, ITALY)
Sara Goggi (ILC‐CNR, Pisa, ITALY)
Monica Monachini (ILC‐CNR, Pisa, ITALY)
Valeria Quochi (ILC‐CNR, Pisa, ITALY)
Irene Russo (ILC‐CNR, Pisa, ITALY)
Claudia Soria (ILC‐CNR, Pisa, ITALY)
Local Committee
Rodolfo Delmonte (Università Ca' Foscari, Venezia, ITALY)
Rocco Tripodi (Università Ca' Foscari, Venezia, ITALY)
85