Infrastructure crossroads... and the way we walked them in DKPro

Text and Data Mining in Europe: Defining the Challenges and Actions @ The Hague

Infrastructure crossroads

Richard Eckart de Castilho UKP LAB

Technische Universität Darmstadt

...and the way we walked them in dkpro


PRESENTER

Dr. Richard Eckart de Castilho

•  Interoperability WP lead @ OpenMinTeD •  Technical Lead @ UKP •  Java developer •  Open source guy •  NLP software infrastructure researcher •  Apache UIMA developer •  DKPro person

@i_am_rec https://github.com/reckart

Ubiquitous Knowledge Processing Lab Technische Universität Darmstadt


Ubiquitous knowledge

Processing LAB

•  Argumentation Mining •  Language Technology for Digital Humanities •  Lexical-Semantic Resources &Algorithms •  Text Mining & Analytics •  Writing Assistance and Language Learning

@UKPLab http://www.ukp.tu-darmstadt.de

Prof. Dr. Iryna Gurevych Technische Universität Darmstadt


DKPro – reuse not reinvent •  What?

•  Collection of open-source projects related to NLP •  Community of communities •  Interoperability between projects •  Target group: programmers, researchers, application developers

•  Why? •  Flexibility and control – liberal licensing and redistributable software •  Sustainability – open community not bound to specific grants •  Replicability – portable software distributed through repositories •  Usability – the the edge out of installation

•  Projects •  DKPro Core – linguistic preprocessing, interoperable third-party tools •  DKPro TC – text classification experimentation suite •  UBY – unified semantic resource •  CSniper – integrated search and annotaton •  … https://github.io/dkpro


… but why like this? … how else could it be done?



Analytics •  Analytics layer

•  Analytics tools (tagger, parser, etc.)

•  Interoperability layer •  Input/output conversion •  Tool wrappers •  Pivot data model

•  Workflow layer •  Workflow descriptions •  Workflow engines

•  UI layer •  Workflow editors •  Annotation editors •  Exploration / visualization

Complete!Solution!

Analytics stack


Automatic text analysis •  pragmatic

•  Gain insight about a particular field of interest •  Investigate data •  Use latest data available •  Results relevant for the moment •  No need for reproducibility

•  principled •  Interest in reproducibility •  Investigate methods •  Use a fixed data set •  Results should be reproducible


Manual text analysis •  pragmatic

•  Collaborative analysis •  Get as much done as quickly as possible •  All see/edit the same data / annotations •  No means of measuring quality / single truth

•  Principled •  Training data for supervised machine learning •  Evaluation of automatic methods •  Distributed analysis •  Guideline-driven process •  Multiple independent analyses/annotations •  Inter-annotator agreement as quality indicator

•  Human in the loop •  Analytics make suggestions / guide human •  Human input guides analystics

Human!Machine!


deployment •  Distributed / static

•  Service oriented •  High network traffic •  Running cost •  Risk of decay / limited availability of older versions •  More control to providers

•  Localized / dynamic •  Cloud computing •  Reduced cost •  Data locality •  Scalability •  Large freedom choosing a version •  More control to users

•  Gateways •  Make dynamic setup appear static •  Handle input/output and workflow management •  Walled garden vs. convenience

Software!Repository!

Gateway!

Gateway!


“openness” •  Open

•  Liberal licensing •  Freedom to choose deployment •  Integrate custom resources/analytics •  Control to the user

•  Not open/closed •  Copyleft/proprietary licensing •  Prescribed deployment •  Difficult to customize for the user •  Control to the provider


A peek at the landscape Service-based

•  ARGO* •  Pipeline builder, annotation editor •  Online platform accessible through

gateway •  Internally dynamic deployment (afaik) •  Closed source

•  Weblicht / Alveo / LAPPS •  Pipeline builder •  Online platform accessible through

gateway •  Many services distributed over multiple

locations/stakeholders •  Some offer access to non-public

content/analytics •  Some are partially open source

Software-based •  DKPro Core* / ClearTK

•  Component collection •  Pipeline scripting / programming •  Repository-based •  Easy to deploy/embed anywhere •  Open source

•  GATE workbench* •  Pipeline builder, annotation editor,

+++ •  Desktop application •  GATE Cloud •  Open source •  …


DKPro Core – Runnable example #!/usr/bin/env groovy @Grab(group='de.tudarmstadt.ukp.dkpro.core', module='de.tudarmstadt.ukp.dkpro.core.opennlp-asl', version='1.5.0')

import de.tudarmstadt.ukp.dkpro.core.opennlp.*; import org.apache.uima.fit.factory.JCasFactory; import org.apache.uima.fit.pipeline.SimplePipeline; import de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.*; import de.tudarmstadt.ukp.dkpro.core.api.syntax.type.*; import static org.apache.uima.fit.util.JCasUtil.*; import static org.apache.uima.fit.factory.AnalysisEngineFactory.*;

def jcas = JCasFactory.createJCas(); jcas.documentText = "This is a test"; jcas.documentLanguage = "en";

SimplePipeline.runPipeline(jcas, createEngineDescription(OpenNlpSegmenter), createEngineDescription(OpenNlpPosTagger), createEngineDescription(OpenNlpParser, OpenNlpParser.PARAM_WRITE_PENN_TREE, true));

select(jcas, Token).each { println "${it.coveredText} ${it.pos.posValue}" } select(jcas, PennTree).each { println it.pennTree }

Fetches all required!dependencies!

No manual installation!!

Input!

Analytics pipeline.!Language-specific!resources fetched !

automatically!

Output!


DKPro Core – Runnable example #!/usr/bin/env groovy @Grab(group='de.tudarmstadt.ukp.dkpro.core', module='de.tudarmstadt.ukp.dkpro.core.opennlp-asl', version='1.5.0')

import de.tudarmstadt.ukp.dkpro.core.opennlp.*; import org.apache.uima.fit.factory.JCasFactory; import org.apache.uima.fit.pipeline.SimplePipeline; import de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.*; import de.tudarmstadt.ukp.dkpro.core.api.syntax.type.*; import static org.apache.uima.fit.util.JCasUtil.*; import static org.apache.uima.fit.factory.AnalysisEngineFactory.*;

def jcas = JCasFactory.createJCas(); jcas.documentText = "This is a test"; jcas.documentLanguage = "en";

SimplePipeline.runPipeline(jcas, createEngineDescription(OpenNlpSegmenter), createEngineDescription(OpenNlpPosTagger), createEngineDescription(OpenNlpParser, OpenNlpParser.PARAM_WRITE_PENN_TREE, true));

select(jcas, Token).each { println "${it.coveredText} ${it.pos.posValue}" } select(jcas, PennTree).each { println it.pennTree }

Fetches all required!dependencies!

No manual installation!!

Input!

Analytics pipeline.!Language-specific!resources fetched !

automatically!

Output!

Why is this cool?!

This is an actual running example!!

Requires only !JVM + Groovy (+ Internet connection)!

Easy to parallelize / scale!

Trivial to embed in applications!Trivial to wrap as a service!


Conclusion / Challenges •  Data is growing / analytics get more complex

•  Need more powerful systems to process it

•  Human in the loop •  Human interaction influences analytics and vice versa

•  Need to move data and analytics around •  Often conflicts with interest in protection of investment

•  Need interoperability •  To discover data, resources, and analytics •  To access data and resources •  To deploy analytics •  To retrieve and further use results


What comes next?



tomorrow @ the hague: interoperability

Data Conversion! Analysis!

Automatic Step /!Analysis

Component /!Nested workflow!

Human Annotation/Human Correction!

Resource repository!

(Auxiliary Data)!

Data!Source!

Data!Sink!

Provenance!

WG1!

WG2!

WG3!

WG4!

Data Conversion!

API!

API!

API!

Software repository!

API!

ID / Version!

ID / Version!

New ID / Version!

Desktop / Server!Cloud !

resource!| | | | | | | | | | | | | | | | !

Cluster!

Portability / Scalability / Sustainability!

Analysis!Service!

API!

Rights and restrictions aggregation!


Thanks



References •  Alveo http://alveo.edu.au/

•  Argo http://argo.nactem.ac.uk

•  CLEARTK http://cleartk.github.io/cleartk/

•  DKPro https://dkpro.github.io

•  Gate https://gate.ac.uk

•  Lapps http://www.lappsgrid.org

•  UIMA http://uima.apache.org

•  Weblicht https://weblicht.sfs.uni-tuebingen.de/

Data & Analytics

Infrastructure crossroads... and the way we walked them in DKPro