Infrastructure crossroads... and the way we walked them in DKPro

Text and Data Mining in Europe: Defining the Challenges and Actions @ The Hague

Infrastructure crossroads

Richard Eckart de Castilho UKP LAB

Technische Universität Darmstadt

...and the way we walked them in dkpro

PRESENTER

Dr. Richard Eckart de Castilho

•  Interoperability WP lead @ OpenMinTeD •  Technical Lead @ UKP •  Java developer •  Open source guy •  NLP software infrastructure researcher •  Apache UIMA developer •  DKPro person

@i_am_rec https://github.com/reckart

Ubiquitous Knowledge Processing Lab Technische Universität Darmstadt

Ubiquitous knowledge

Processing LAB

•  Argumentation Mining •  Language Technology for Digital Humanities •  Lexical-Semantic Resources &Algorithms •  Text Mining & Analytics •  Writing Assistance and Language Learning

@UKPLab http://www.ukp.tu-darmstadt.de

Prof. Dr. Iryna Gurevych Technische Universität Darmstadt

DKPro – reuse not reinvent •  What?

•  Collection of open-source projects related to NLP •  Community of communities •  Interoperability between projects •  Target group: programmers, researchers, application developers

•  Why? •  Flexibility and control – liberal licensing and redistributable software •  Sustainability – open community not bound to specific grants •  Replicability – portable software distributed through repositories •  Usability – the the edge out of installation

•  Projects •  DKPro Core – linguistic preprocessing, interoperable third-party tools •  DKPro TC – text classification experimentation suite •  UBY – unified semantic resource •  CSniper – integrated search and annotaton •  … https://github.io/dkpro

… but why like this? … how else could it be done?

Analytics •  Analytics layer

•  Analytics tools (tagger, parser, etc.)

•  Interoperability layer •  Input/output conversion •  Tool wrappers •  Pivot data model

•  Workflow layer •  Workflow descriptions •  Workflow engines

•  UI layer •  Workflow editors •  Annotation editors •  Exploration / visualization

Complete!Solution!

Analytics stack

Automatic text analysis •  pragmatic

•  Gain insight about a particular field of interest •  Investigate data •  Use latest data available •  Results relevant for the moment •  No need for reproducibility

•  principled •  Interest in reproducibility •  Investigate methods •  Use a fixed data set •  Results should be reproducible

Manual text analysis •  pragmatic

•  Collaborative analysis •  Get as much done as quickly as possible •  All see/edit the same data / annotations •  No means of measuring quality / single truth

•  Principled •  Training data for supervised machine learning •  Evaluation of automatic methods •  Distributed analysis •  Guideline-driven process •  Multiple independent analyses/annotations •  Inter-annotator agreement as quality indicator

•  Human in the loop •  Analytics make suggestions / guide human •  Human input guides analystics

Human!Machine!

deployment •  Distributed / static

•  Service oriented •  High network traffic •  Running cost •  Risk of decay / limited availability of older versions •  More control to providers

•  Localized / dynamic •  Cloud computing •  Reduced cost •  Data locality •  Scalability •  Large freedom choosing a version •  More control to users

•  Gateways •  Make dynamic setup appear static •  Handle input/output and workflow management •  Walled garden vs. convenience

Software!Repository!

Gateway!

“openness” •  Open

•  Liberal licensing •  Freedom to choose deployment •  Integrate custom resources/analytics •  Control to the user

•  Not open/closed •  Copyleft/proprietary licensing •  Prescribed deployment •  Difficult to customize for the user •  Control to the provider

A peek at the landscape Service-based

•  ARGO* •  Pipeline builder, annotation editor •  Online platform accessible through

gateway •  Internally dynamic deployment (afaik) •  Closed source

•  Weblicht / Alveo / LAPPS •  Pipeline builder •  Online platform accessible through

gateway •  Many services distributed over multiple

locations/stakeholders •  Some offer access to non-public

content/analytics •  Some are partially open source

Software-based •  DKPro Core* / ClearTK

•  Component collection •  Pipeline scripting / programming •  Repository-based •  Easy to deploy/embed anywhere •  Open source

•  GATE workbench* •  Pipeline builder, annotation editor,

+++ •  Desktop application •  GATE Cloud •  Open source •  …

DKPro Core – Runnable example #!/usr/bin/env groovy @Grab(group='de.tudarmstadt.ukp.dkpro.core', module='de.tudarmstadt.ukp.dkpro.core.opennlp-asl', version='1.5.0')

import de.tudarmstadt.ukp.dkpro.core.opennlp.*; import org.apache.uima.fit.factory.JCasFactory; import org.apache.uima.fit.pipeline.SimplePipeline; import de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.*; import de.tudarmstadt.ukp.dkpro.core.api.syntax.type.*; import static org.apache.uima.fit.util.JCasUtil.*; import static org.apache.uima.fit.factory.AnalysisEngineFactory.*;

def jcas = JCasFactory.createJCas(); jcas.documentText = "This is a test"; jcas.documentLanguage = "en";

SimplePipeline.runPipeline(jcas, createEngineDescription(OpenNlpSegmenter), createEngineDescription(OpenNlpPosTagger), createEngineDescription(OpenNlpParser, OpenNlpParser.PARAM_WRITE_PENN_TREE, true));

select(jcas, Token).each { println "${it.coveredText} ${it.pos.posValue}" } select(jcas, PennTree).each { println it.pennTree }

Fetches all required!dependencies!

No manual installation!!

Input!

Analytics pipeline.!Language-specific!resources fetched !

automatically!

Output!

DKPro Core – Runnable example #!/usr/bin/env groovy @Grab(group='de.tudarmstadt.ukp.dkpro.core', module='de.tudarmstadt.ukp.dkpro.core.opennlp-asl', version='1.5.0')

import de.tudarmstadt.ukp.dkpro.core.opennlp.*; import org.apache.uima.fit.factory.JCasFactory; import org.apache.uima.fit.pipeline.SimplePipeline; import de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.*; import de.tudarmstadt.ukp.dkpro.core.api.syntax.type.*; import static org.apache.uima.fit.util.JCasUtil.*; import static org.apache.uima.fit.factory.AnalysisEngineFactory.*;

def jcas = JCasFactory.createJCas(); jcas.documentText = "This is a test"; jcas.documentLanguage = "en";

SimplePipeline.runPipeline(jcas, createEngineDescription(OpenNlpSegmenter), createEngineDescription(OpenNlpPosTagger), createEngineDescription(OpenNlpParser, OpenNlpParser.PARAM_WRITE_PENN_TREE, true));

select(jcas, Token).each { println "${it.coveredText} ${it.pos.posValue}" } select(jcas, PennTree).each { println it.pennTree }

Fetches all required!dependencies!

No manual installation!!

Input!

Analytics pipeline.!Language-specific!resources fetched !

automatically!

Output!

Why is this cool?!

This is an actual running example!!

Requires only !JVM + Groovy (+ Internet connection)!

Easy to parallelize / scale!

Trivial to embed in applications!Trivial to wrap as a service!

Conclusion / Challenges •  Data is growing / analytics get more complex

•  Need more powerful systems to process it

•  Human in the loop •  Human interaction influences analytics and vice versa

•  Need to move data and analytics around •  Often conflicts with interest in protection of investment

•  Need interoperability •  To discover data, resources, and analytics •  To access data and resources •  To deploy analytics •  To retrieve and further use results

What comes next?

tomorrow @ the hague: interoperability

Data Conversion! Analysis!

Automatic Step /!Analysis

Component /!Nested workflow!

Human Annotation/Human Correction!

Resource repository!

(Auxiliary Data)!

Data!Source!

Data!Sink!

Provenance!

Data Conversion!

Software repository!

ID / Version!

New ID / Version!

Desktop / Server!Cloud !

resource!| | | | | | | | | | | | | | | | !

Cluster!

Portability / Scalability / Sustainability!

Analysis!Service!

Rights and restrictions aggregation!

Thanks

References •  Alveo http://alveo.edu.au/

•  Argo http://argo.nactem.ac.uk

•  CLEARTK http://cleartk.github.io/cleartk/

•  DKPro https://dkpro.github.io

•  Gate https://gate.ac.uk

•  Lapps http://www.lappsgrid.org

•  UIMA http://uima.apache.org

•  Weblicht https://weblicht.sfs.uni-tuebingen.de/

Infrastructure crossroads... and the way we walked them in DKPro

Data & Analytics

© 2010 Crossroads Systems, Inc. Crossroads, RVA, SPHiNX, TapeSentry, and DB Protector are registered trademarks of Crossroads Systems, Inc. Crossroads

Namcheongyo Crossroads NIHC Stop Ssajeondari Crossroads

Crossroads Dream

Crossroads uk20102011

Victory crossroads

The Crossroads Connectioncrossroads.dsbn.org/documents/December2013newsletter.pdfThe Crossroads Connection School Newsletter December 2013 Crossroads Public School Troy Wallace, 1350

When Elephants Walked Backwards

Crossroads 2015

Sarah Walked and Walked - churchofjesuschrist.org · 32 Friend Sarah Walked and Walked ILLUSTRATIONS BY JULIE F. YOUNG By Heidi Poelman Based on a true story Sarah hopped and skipped

Crossroads Winter Trail Map - Crossroads at Big Creek

Crossroads Transition

Woman Who Walked

They Walked Before Us... ....They Walked With Us

And Sammy Walked In - Piano Play - And Sammy Walked … · TTrTrraaannnssscccrrriiibbbeededd aatatt PPPiiaiaannnooo---PPPllalaayyy...cccooommm And Sammy Walked In from album: On fire

Career Crossroads

Crossroads ASP

Crossroads 2010

Crossroads Global

Membership Crossroads

crossroads DENIM