26
META-NET has received funding from the EU’s Horizon 2020 research and innovation programme through the contract CRACKER (grant agreement no.: 645357). Formerly co-funded by FP7 and ICT PSP through the contracts T4ME (grant agreement no.: 249119), CESAR (grant agreement no.: 271022), METANET4U (grant agreement no.: 270893) and META-NORD (grant agreement no.: 270899). Language Technologies for Big Data: A Strategic Agenda for the Multilingual Digital Single Market Georg Rehm Coordinator CRACKER, General Secretary META-NET [email protected] BDVA Summit – Valencia, Spain, 1 st December 2016

Language Technologies for Big Data – A Strategic Agenda for the Multilingual Digital Single Market

Embed Size (px)

Citation preview

Page 1: Language Technologies for Big Data – A Strategic Agenda for the Multilingual Digital Single Market

META-NET has received funding from the EU’s Horizon 2020 research and innovation programme through the contract CRACKER(grant agreement no.: 645357). Formerly co-funded by FP7 and ICT PSP through the contracts T4ME (grant agreement no.: 249119), CESAR (grant agreement no.: 271022), METANET4U (grant agreement no.: 270893) and META-NORD (grant agreement no.: 270899).

Language Technologies for Big Data:A Strategic Agenda for the

Multilingual Digital Single Market

Georg RehmCoordinator CRACKER, General Secretary META-NET

[email protected]

BDVA Summit – Valencia, Spain, 1st December 2016

Page 2: Language Technologies for Big Data – A Strategic Agenda for the Multilingual Digital Single Market

Language Technologyq Text and Content Analytics (NLP, Multimedia Processing)

§ Information Extraction, Relation Extraction, Knowledge Extraction§ Mining of: Topics, Sentiments, Opinions, Arguments, Rumours etc.§ Goal: From Data to Knowledge (Smart Data) – no deep NLU yet

q Semantic Technologies – Knowledge Technologies§ Semantic Web, semantic descriptions, ontologies, reasoning § RDF, Linked Data, Knowledge Graphs, grounding

q Multilingual and Crosslingual Technologies (Machine Translation)§ Technologies to bridge language barriers

q Text and Report Generation – From data to textq Curation Technologies – mix of tech to create and curate content q Conversational Interaction Technologies (ASR, DS, TTS etc.)q Multilingual Europe – Technologies for all European languages

http://www.meta-net.eu 2

Page 3: Language Technologies for Big Data – A Strategic Agenda for the Multilingual Digital Single Market

Outlineq Initiatives for Multilingual Europeq Towards the Multilingual Digital Single Marketq MDSM SRIA V0.9q Next Steps

http://www.meta-net.eu – http://www.cracker-project.eu 3

Page 4: Language Technologies for Big Data – A Strategic Agenda for the Multilingual Digital Single Market

q

60 research centres in 34 countries (founded in 2010)Chair of Executive Board: Jan Hajic (CUNI)Dep.: J. van Genabith (DFKI), A. Vasiljevs (Tilde) General Secretary: Georg Rehm (DFKI)

q

Multilingual Europe Technology Alliance.826 members in 67 countries

(published in 2013) (31 volumes; published in 2012)

T4ME (META-NET) CESAR METANET4UMETA-NORDMultilingual Europe Technology AllianceNET 4

Page 5: Language Technologies for Big Data – A Strategic Agenda for the Multilingual Digital Single Market

1 DFKI Germany Georg Rehm2 CUNI Czech Republic Jan Hajic3 ELDA France Khalid Choukri4 FBK Italy Marcello Federico5 ATHENA RC Greece Stelios Piperidis6 UEDIN UK Philipp Koehn7 USFD UK Lucia Specia

Coordination and Support Action, H2020-ICT17, 2015–2017, 36 months – http://www.cracker-project.eu

Cracking the Language BarrierCoordination, Evaluation and Resources for European MT Research

THREE PRIORITY AREAS FOR ACHIEVING THE MULTILINGUAL DIGITAL SINGLE MARKET

Multilingual access to all digital goods and services across Europe1

Geo-blocking:

due to nationality, location, or residence

customers

Language-blocking:

languages they do not speak

however, current online translation is insufficienttrying to conduct

common languages

Geo-blocking and language-blocking are barriers to access

Both geo-blocking and language-blocking aredaily problems for tens of millions of EU citizens.

Customers are six times more likely to buy from sites in their native language.

Most EU languages address less than 3% of the market, fundamentally limiting SMEs operating in countries where thoselanguages are spoken.

Lack of language technology support (automatic translation, tools to assist human translators, and multilingual support in

European businesses.

Language can be expensive for SMEs

Online businesses face around €5,000 in up-front costs for each new language they translate their websites into, plus similar

and marketing costs.

Even when sites are translated, the vast majority of SMEs cannot respond to support requests or customer feedback in other languages. Such responsiveness is needed to achieve customer satisfaction and build brand loyalty.

English is not the answer52% of EU customers do not purchase

Adding even a few languages to an SME’s website beyond Englishcan have a major impact on revenue. Large organizations today

to increase market share.

6x morelikely to

purchase

Site in buyer’snative language

Site in foreignlanguage

Likel

ihoo

d of p

urch

asin

g

THREE PRIORITY AREAS FOR ACHIEVING THE MULTILINGUAL DIGITAL SINGLE MARKET

Multilingual access to all digital goods and services across Europe1

Geo-blocking:

due to nationality, location, or residence

customers

Language-blocking:

languages they do not speak

however, current online translation is insufficienttrying to conduct

common languages

Geo-blocking and language-blocking are barriers to access

Both geo-blocking and language-blocking aredaily problems for tens of millions of EU citizens.

Customers are six times more likely to buy from sites in their native language.

Most EU languages address less than 3% of the market, fundamentally limiting SMEs operating in countries where thoselanguages are spoken.

Lack of language technology support (automatic translation, tools to assist human translators, and multilingual support in

European businesses.

Language can be expensive for SMEs

Online businesses face around €5,000 in up-front costs for each new language they translate their websites into, plus similar

and marketing costs.

Even when sites are translated, the vast majority of SMEs cannot respond to support requests or customer feedback in other languages. Such responsiveness is needed to achieve customer satisfaction and build brand loyalty.

English is not the answer52% of EU customers do not purchase

Adding even a few languages to an SME’s website beyond Englishcan have a major impact on revenue. Large organizations today

to increase market share.

6x morelikely to

purchase

Site in buyer’snative language

Site in foreignlanguage

Likel

ihoo

d of p

urch

asin

g

Communities• META-NET incl. META-SHARE and META• MT evaluation initiatives – WMT, IWSLT, MT Marathons• MT and other LT industry• Language resources – META-SHARE, ELRA• HT/MT evaluation tools – translate5 • Translation industry, translation profession• MT user communities

Strategic Agenda for the Multilingual Digital Single Market• Version 0.5 presented at META-FORUM 2015 (Riga)• Version 0.9 presented at META-FORUM 2016 (Lisbon)

Strategic Research and Innovation Agenda

Language as a Data Type and Key Challenge for Big Data

Enabling the Multilingual Digital Single Market through technologies for translating, analysing, processing

and curating natural language content

SRIA Editorial Team

Version 0.9 – July 2016

5

Page 6: Language Technologies for Big Data – A Strategic Agenda for the Multilingual Digital Single Market

http://www.cracker-project.eu • http://www.meta-net.eu

• Riga Summit 2015 and Riga Declaration.• Federation of European projects and

organisations working on technologies for a multilingual Europe.

• Multi-lateral Memorandum of Understanding; 10 organisations and 24 projects on board.

• Getting new members on a regular basis.• Selected areas of collaboration: data

management and repositories, tools, shared tasks, evaluations, events.

• Goal: provide one umbrella organisation for the whole community.

Page 7: Language Technologies for Big Data – A Strategic Agenda for the Multilingual Digital Single Market

q Top priority in the European Union.

q Expected to add 400b€ to European GDP and hundreds of thousands of new jobs.

q Unfortunately, the language topic is not included in the EC’s Digital Single Market strategy (published in May 2015).

7

Page 8: Language Technologies for Big Data – A Strategic Agenda for the Multilingual Digital Single Market
Page 9: Language Technologies for Big Data – A Strategic Agenda for the Multilingual Digital Single Market
Page 10: Language Technologies for Big Data – A Strategic Agenda for the Multilingual Digital Single Market

http://www.meta-net.eu 10

Page 11: Language Technologies for Big Data – A Strategic Agenda for the Multilingual Digital Single Market

Facts and Figures

http://www.meta-net.eu 11

THREE PRIORITY AREAS FOR ACHIEVING THE MULTILINGUAL DIGITAL SINGLE MARKET

Multilingual access to all digital goods and services across Europe1

Geo-blocking:

due to nationality, location, or residence

customers

Language-blocking:

languages they do not speak

however, current online translation is insufficienttrying to conduct

common languages

Geo-blocking and language-blocking are barriers to access

Both geo-blocking and language-blocking aredaily problems for tens of millions of EU citizens.

Customers are six times more likely to buy from sites in their native language.

Most EU languages address less than 3% of the market, fundamentally limiting SMEs operating in countries where thoselanguages are spoken.

Lack of language technology support (automatic translation, tools to assist human translators, and multilingual support in

European businesses.

Language can be expensive for SMEs

Online businesses face around €5,000 in up-front costs for each new language they translate their websites into, plus similar

and marketing costs.

Even when sites are translated, the vast majority of SMEs cannot respond to support requests or customer feedback in other languages. Such responsiveness is needed to achieve customer satisfaction and build brand loyalty.

English is not the answer52% of EU customers do not purchase

Adding even a few languages to an SME’s website beyond Englishcan have a major impact on revenue. Large organizations today

to increase market share.

6x morelikely to

purchase

Site in buyer’snative language

Site in foreignlanguage

Likel

ihoo

d of p

urch

asin

g

Page 12: Language Technologies for Big Data – A Strategic Agenda for the Multilingual Digital Single Market

A. Ansip’s May 2016 Blog Post

q Posted on 27 May 2016. q First public acknowledgment

of the EC that the language topic is of very high relevance for the Digital Single Market.

q “Overcoming language barriers is vital for building the DSM, which is by definition multilingual. It is now time to reduce and remove the language barriers that are holding back its advance, and turn them into competitive advantages.”

http://www.meta-net.eu – http://www.cracker-project.eu 12

Page 13: Language Technologies for Big Data – A Strategic Agenda for the Multilingual Digital Single Market

13

Page 14: Language Technologies for Big Data – A Strategic Agenda for the Multilingual Digital Single Market

Language as a Data Type

q Language technology is a necessary ingredient of the multilingual DSM and mandatory enabler for the European data economy.

q Big Data is never only numerical – there’s always a language component: unstructured text content, column heads, metadata etc.

q Without language technology, Big Data analytics won’t happen.

q The EU Data Economy needs Multilingual Big Data Content Analytics and Multilingual Big Data Content Generation.

http://www.meta-net.eu 14

Unstructured data

Language Technology

Structured dataHeterogeneous data Homogeneous data

Big data KnowledgeUnorganised data Organised data

Multilingual big data Crosslingual analytics

Page 15: Language Technologies for Big Data – A Strategic Agenda for the Multilingual Digital Single Market

q Overall goal: “deliver new Big Data technology allowing for deep analytics capacities on data-at-rest and data-in-motion while providing sufficient privacy guarantees, optimized user experience support and a sound data engineering framework.”

q “In Europe, text-based data resources occur in many different languages […].”

q “This multilingualism of data sources makes it often impossible to use existing tools and to align available resources, because they are generally provided only in the English language.”

q “Thus, the seamless aligning of data sources for data analysis or business intelligence applications is hindered by the lack of language support and availability of appropriate resources.” (p. 23)http://www.meta-net.eu 15

Page 16: Language Technologies for Big Data – A Strategic Agenda for the Multilingual Digital Single Market

BDVA SRIA V2.0: Challenges and Needsq BDVA SRIA Technical Priority “Data Management”:

§ Tools for handling unstructured and semi-structured data for different languages.§ Annotation frameworks for integration of annotation technologies and data formats. § Techniques for semantic interoperability such as standardised data models and

interoperable architectures for different sectors.§ Standards and multilingual knowledge repositories that allow the seamless linking of data.

q BDVA SRIA Technical Priority “Data Analytics”: § Improved, more accurate statistical models, especially with regard to semantic analysis. § Deep learning, contextualisation, machine learning, NLP, smart data analytics and real-

time semantic analysis, including event and pattern discovery. § Methods for unstructured multimedia analytics and data mining, linking

algorithms to deliver cross-domain and cross-sector intelligence.

q BDVA SRIA Technical Priority “Data Processing”:§ Real-time analytics and event processing of highly heterogeneous

data sources and formats§ Processing, linking, aligning data sets with one another, including

semantic representations, unstructured, semi-structured and structured data, and multimedia data etc.

§ Knowledge extraction out of heterogeneous data sets.§ Special emphasis on quality, precision, robustness

http://www.meta-net.eu 16

Page 17: Language Technologies for Big Data – A Strategic Agenda for the Multilingual Digital Single Market

MDSM SRIA

q Version 0.5 unveiled at META-FORUM 2015q Version 0.9 unveiled at META-FORUM 2016q Version 1.0 foreseen for early 2017q Prepared and presented by Cracking the Language

Barrier federation (editorial team: 13 colleagues)q SRIA addresses how the LT community is going

to act united in order to make the DSM multilingualq Aligned to three of the BDVA SRIA V2.0’s technical priorities:

Data Management, Data Analysis, Data Processing.

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFTStrategic Agenda for the

Multilingual Digital Single Market

Technologies for Overcoming Language Barriers towardsa truly integrated European Online Market

DRAFT

Version 0.5 – April 22, 2015

http://www.meta-net.eu – http://www.cracker-project.eu 17

Page 18: Language Technologies for Big Data – A Strategic Agenda for the Multilingual Digital Single Market

Strategic Research and Innovation Agenda

Language as a Data Type and Key Challenge for Big Data

Enabling the Multilingual Digital Single Market through technologies for translating, analysing, processing

and curating natural language content

SRIA Editorial Team

Version 0.9 – July 2016

http

://w

ww

.cra

cker

-pro

ject

.eu

http://ww

w.cracking-the-language-barrier.eu

Page 19: Language Technologies for Big Data – A Strategic Agenda for the Multilingual Digital Single Market

MLV Programme

q Multilingual Value Programe*§ Three-year programme§ Requires modest investment

q “Enabling the Multilingual Digital SingleMarket through technologies fortranslating, analysing, processing andcurating natural language content”

q Three components address the main needs of the Multilingual DSM (MDSM)and how to put them into practice:1. Multilingual Application Areas2. Multilingual Services3. Research

http://www.meta-net.eu – http://www.cracker-project.eu 19

Strategic Research and Innovation Agenda

Language as a Data Type and Key Challenge for Big Data

Enabling the Multilingual Digital Single Market through technologies for translating, analysing, processing

and curating natural language content

SRIA Editorial Team

Version 0.9 – July 2016

* SRIA V0.9 and MLV Programme devisedbefore re-organisation of DG CONNECT.

Page 20: Language Technologies for Big Data – A Strategic Agenda for the Multilingual Digital Single Market

Multilingual Digital Single Market

Automated Translation

E-Commerce Content, Media, Verticals

Translation, Language, Knowledge, Data

Knowledge andData Repositories

Multilingual Applications

Multilingual Services

ResearchCrosslingual Big Data Language

Analytics

Meaning, Semantics, Knowledge

High-Quality Machine

Translation

SMEs CEF DSIs IT Integrators Researchprovide innovative

applications

fills gaps

H2020 RIAs

H2020 CSAs, IAs, RIAs

H2020 CSAs, RAs, national funding

Multimodal Interaction

Language Processing, Analysis and Production – Language Resources

Citizens Public Business

interoperable and standardised

collaboration with member states

Conversational Technologies

Strategic Research and Innovation Agenda

Language as a Data Type and Key Challenge for Big Data

Enabling the Multilingual Digital Single Market through technologies for translating, analysing, processing

and curating natural language content

SRIA Editorial Team

Version 0.9 – July 2016

MLV Programme

Page 21: Language Technologies for Big Data – A Strategic Agenda for the Multilingual Digital Single Market

MDSM: Goals and Needs

q Crosslingual communication for SMEs, public institutions, citizensq Crosslingual SME presales communication and aftersales servicesq Multilingual (big) data, language and knowledge value chainsq Multilingual websites, product catalogues, product descriptionsq Multilingual knowledge bases and knowledge graphs (and services)q Multilingual conversational interfaces for connected devices (IoT)q Crosslingual business intelligence (e.g., based on UGC)q Crosslingual social media analytics for EU-wide societal issuesq Multilingual text and report generation (knowledge/data to text)q All services must be domain-adaptable (no one size fits all)q Translation Centre (Cloud) – HQ automated translation for all

http://www.meta-net.eu – http://www.cracker-project.eu 21

Page 22: Language Technologies for Big Data – A Strategic Agenda for the Multilingual Digital Single Market

Application Areas (Selection)

q Multilingual E-commerce§ Customer-facing vs. back-office facing (after-market, after-sales)§ Crosslingual search, CRM, helpdesks, processes, workflows§ Semantic, crosslingual product descriptions and catalogues§ Online dispute resolution

q Multilingual Content, Media, Verticals§ Content analytics, curation, generation (incl. authoring support)§ Multimodal communication (conversational, written, IoT)§ Vertical domains: health, government, mobility, energy, legal.

q Translation, Language, Knowledge, Data§ Translation Cloud – written/spoken, automatic/human§ Crosslingual public and social intelligence, business intelligence§ HQ resources, under-resourced languages, domain-specific LRs

22

Page 23: Language Technologies for Big Data – A Strategic Agenda for the Multilingual Digital Single Market

Three Phases: 2018–2020q Pre-MLV (2016/2017): stakeholder discussions and consensus building; finalisation of

strategy and roadmap; selection and prioritisation of topics; small prototype projects for proofs of concept; CSAs for planning, coordination, support, community building.

q MLV Phase 1 (2018): first Multilingual Services; conceptualiseMultilingual Applications; integration of services into applications; start and continue activities in the priority research themes

q MLV Phase 2 (2019): extension of Multilingual Services (coverage, quality, precision); standardisation activities; deployment of first applications; business models; continue research activities

q MLV Phase 3 (2020): extension of Multilingual Services (incl. standardisation); deployment of Multilingual Applications; transformation of projects into sustainable entities; continue research

q Post-MLV (2021+): Scaling up and extending Multilingual Applications and Multilingual Services; expanding language and domain coverage; going beyond Europe, penetrating other markets; exploration of novel research strands etc.

http://www.meta-net.eu 23

Page 24: Language Technologies for Big Data – A Strategic Agenda for the Multilingual Digital Single Market

Setup – Timeframe

q Close collaboration with EC, EP and all other stakeholders (including SMEs, research centres, universities, NGOs etc.).

q Mix of funding sources: § Horizon 2020 (WP 2018-2020) for EU projects (RA, RIA, CSA)§ National/regional funding sources:

- Production of monolingual technologies and data sets- Support and grow SMEs in this area

q Timeframe 2018, 2019, 2020:§ Includes set of mission-critical services and applications

http://www.meta-net.eu – http://www.cracker-project.eu 24

Page 25: Language Technologies for Big Data – A Strategic Agenda for the Multilingual Digital Single Market

Next Steps

a) extend Cracking the Language Barrier federation (BDVA?);b) discuss MDSM SRIA V0.9 with BDVA: what are BDVA’s

concrete requirements and priorities with regard to LT?c) specify concrete set of research results and services;

prioritise needed applications, services, research areas;d) MDSM SRIA V1.0 to be finalised in Q1 2017;

http://www.meta-net.eu 25

Language Technology for Big Data Applications

Multilingual Digital Single Market

Page 26: Language Technologies for Big Data – A Strategic Agenda for the Multilingual Digital Single Market

Thank you for your attention.

[email protected]

http://www.meta-net.euhttp://www.facebook.com/META.Alliance

http://www.cracker-project.euhttp://www.cracking-the-language-barrier.eu

26