19
D2. 8 EUDAT service compatibility Document information Title EUDAT service compatibility ID CLARINPLUS-D2.8 (CE-2017-1034) Author(s) Claus Zinn Responsible WP leader Erhard Hinrichs Contractual Delivery Date 2017-06-01 Actual Delivery Date Distribution Public Document status in workplan Deliverable Project information Project name CLARIN-PLUS Project number 676529 Call H2020-INFRADEV-1-2015-1 Duration 2015-09-01 – 2017-08-31 Website www.clarin.eu Contact address [email protected]

D28 EUDAT service compatibility - CLARIND2.8 EUDAT service compatibility Document information Title EUDAT service compatibility ID CLARINPLUS-D2.8 (CE-2017-1034) Author(s) Claus Zinn

  • Upload
    others

  • View
    13

  • Download
    0

Embed Size (px)

Citation preview

Page 1: D28 EUDAT service compatibility - CLARIND2.8 EUDAT service compatibility Document information Title EUDAT service compatibility ID CLARINPLUS-D2.8 (CE-2017-1034) Author(s) Claus Zinn

D2.8

EUDATservicecompatibilityDocumentinformationTitle EUDATservicecompatibilityID CLARINPLUS-D2.8(CE-2017-1034)Author(s) ClausZinnResponsibleWPleader ErhardHinrichsContractualDeliveryDate 2017-06-01ActualDeliveryDate Distribution PublicDocumentstatusinworkplan DeliverableProjectinformationProjectname CLARIN-PLUSProjectnumber 676529Call H2020-INFRADEV-1-2015-1Duration 2015-09-01–2017-08-31Website www.clarin.euContactaddress [email protected]

Page 2: D28 EUDAT service compatibility - CLARIND2.8 EUDAT service compatibility Document information Title EUDAT service compatibility ID CLARINPLUS-D2.8 (CE-2017-1034) Author(s) Claus Zinn

CLARIN-PLUSD2.8EUDATservicecompatibility

1

Tableofcontents1 ExecutiveSummary.................................................................................................................22 Introduction..............................................................................................................................33 Background................................................................................................................................43.1 GenericExecutionFramework(GEF).......................................................................................43.2 WebLicht...............................................................................................................................................6

4 WebLicht–GEFIntegrationScenarios.............................................................................94.1 GEFIntegration–Front-endAdaptation.................................................................................94.2 GEFIntegration:Back-endImplications...............................................................................104.2.1 MakingtheCutthroughtheToolSpace.......................................................................104.2.2 Transportingdatathroughtheworkflow...................................................................11

4.3 LiaisingbetweenWebLichtandtheGEF..............................................................................125 GEFPerspective.....................................................................................................................166 Conclusion...............................................................................................................................17References......................................................................................................................................18

Page 3: D28 EUDAT service compatibility - CLARIND2.8 EUDAT service compatibility Document information Title EUDAT service compatibility ID CLARINPLUS-D2.8 (CE-2017-1034) Author(s) Claus Zinn

CLARIN-PLUSD2.8EUDATservicecompatibility

2

1 ExecutiveSummaryThe WebLicht software suite is one of the most robust and well-known workflowengines in theCLARIN infrastructure. Itallowsusers toexecutepredefinedworkflowsfor common processing tasks and to define new workflows with the help of theWebLicht orchestrator. The services of a given workflow, however, are distributedgeographically, andhence, there is a significant amountofdata (inputdataand resultdata originating from intermediate processing stages) that is transferred betweenWebLicht and the individual services. The distributed nature of the WebLichtinfrastructuremakes ithard ifnot impossible touseWebLichtonbigdata sets, orondata sets with restrictive property rights attached to them. Here, data transfer timebecomesverycostly,orevenprohibitiveassensitivedatamaynotbeallowedtoleavethedata’shomeinstitution.The EUDAT project is currently developing the Generic Execution Framework (GEF).The GEF aims at providing an infrastructure that permits the execution of genericworkflows in a computing environment located close to the data (at the data’s hostinstitution). Such an execution environment would tackle the aspects of transferringlargeamountsofdata,ordatawithrestrictivepropertyandprivacyrights.In this deliverable, we report on the progress to bring together the CLARIN-basedWebLichtworkflowenginewiththeEUDAT-basedGenericExecutionFramework.Giventhecurrentandevolvingstateof theGEF,ourworkhas tobeseenasastartingpoint.Whileweattempttogivevariouspossibleintegrationscenariostoagoodlevelofdetail,weneedtoemphasizethatsubsequentimplementationsmayneedtodeviatefromourtechnicalspecification.ThepotentialusageofWebLichtwithinaGEFenvironmenthasalreadyhadapositiveimpact on WebLicht’s design rationale, in particular, with respect to organising thetransfer of data between the services of any given workflow. Our input has alsobenefited the GEF development team, yielding concrete GEF usage scenarios anddefiningspecificuserrequirements.

Page 4: D28 EUDAT service compatibility - CLARIND2.8 EUDAT service compatibility Document information Title EUDAT service compatibility ID CLARINPLUS-D2.8 (CE-2017-1034) Author(s) Claus Zinn

CLARIN-PLUSD2.8EUDATservicecompatibility

3

2 IntroductionThis deliverable reports on progress on task 2.3.3 ofwork package 2 of the CLARIN-PLUSproject.Inbrief,theobjectivesforthissubtaskaretoaddresstwochallengesthataffecttheapplicabilityofworkflowsforsomedatasets.First,restrictivepropertyrightsmayforbidresearchdatatoleavetheirhomeinstitution,andtherefore,datatransferaltootherinstitutionforprocessingisnotallowed.Thesecondissueconcernsthesizeofthedata,withbigdataoftencausingprohibitiveoverheadonce it isnecessarytosendsuchdatabackandforthtothevarioustoolsofascientifictoolpipeline.Inbothcases,itisdesirabletobringtheworkflowenginetothedata,ratherthanhavingthedatatraveltothetools.The EUDAT project is currently developing the Generic Execution Framework (GEF).The GEF aims at providing a framework that allows the execution of scientificworkflowsinacomputingenvironmentclosetothedata.Inthisdeliverable,weexplorea number of scenarios and technical adaptations to WebLicht to render its servicescompatiblewiththeGEF.Theremainderofthedeliverableisstructuredasfollows.InSection3,wegivetechnicalbackgroundontheGenericExecutionFrameworkandWebLicht.InSection4,wespecifya number of scenarios that describe the adaptation of WebLicht towards a GEFintegration. This includes user-centric front-end considerations as well as necessaryadaptations on the WebLicht back-end and the processing services connected toWebLicht,whichneedtobeconvertedtoGEF-compatiblewebservices.Anintegralpartis a translation mechanism that liaises the WebLicht workflow engine with the GEFenvironment, and which ensures that all data transfers occur at the GEF’s hostinstitution,whichalsohostsalldata.InSection5,wereviewtheservicecompatibilityofWebLichtfromthepointofviewofaGEFadministrator.InSection6,weconclude.

Page 5: D28 EUDAT service compatibility - CLARIND2.8 EUDAT service compatibility Document information Title EUDAT service compatibility ID CLARINPLUS-D2.8 (CE-2017-1034) Author(s) Claus Zinn

CLARIN-PLUSD2.8EUDATservicecompatibility

4

3 Background

3.1 GenericExecutionFramework(GEF)Oneunderlyingmotivation for theGEF is thatdatasetshavebecomemuch larger thanthetoolsthatprocessthem.Itisthusmoreefficienttomovethetoolstothedataratherthan the data to the tools, given that the tools do not require a large amount ofprocessingpower.Thisargument isoftensupportedbyrestrictivepropertyorprivacyrightsattachedtothedata;heredatatransfertotoolsnotundercontrolbythepropertyholderisoftennotpossible.Also,thetransferofbigdatasetsacrossthetoolsofatoolchainoftenleadstobottlenecks,andtransfertimeissometimeshigherthantheactualprocessingtime.The GEF aims at a framework that allows developers to pack tools (and theircomputation)intomovablecontainersthatcanbemovedtowardsthedata.GEFmakesuse of Docker, a virtualizationmechanism thatwraps an application into a containerthat capsules and sandboxes all computation. Following the explanations athttps://www.docker.com/what-docker,acontainerimageisa“lightweight,stand-alone,executable package of a piece of software that includes everything needed to run it:code,runtime,systemtools,systemlibraries,settings.”Docker-basedtechnologyshowsaverygoodperformance,withcontainer imagesthatcanhaveasmall imagesize,andarefasttostart.Containerimagesareimmutablebydesignandrelativelyeasytodesign,usingDockerscriptsandpre-builtcontainerimagestobuildupon.Figure 1 shows the design rationale of GEF. Docker software runs on the operatingsystem of the host, on the host’s hardware, and is able to execute Docker containers(holdingthetools’virtualization).

Figure1.TheGEFarchitecture.

GEF is built-upon Docker allowing community admin users to upload their tools asDocker files (fromwhichDockercontainer imagesarebeinggenerated),ordirectlyasDocker container images. Once uploaded, theDocker container image becomes a GEFserviceandcanbeinvokedonanyEUDATdataset.The GEF source code is available at GitHub, see https://github.com/EUDAT-GEF/GEF.Its back-end is built upon the Go programming language (https://golang.org), and itsfront-end is built upon the React Javascript library for building user interfaces(https://facebook.github.io/react/).There is no official software release of GEF available yet. Also, at the time ofwriting,documentationisrathersparse.TheEUDATdeliverablefrom2015givesawaysomeofGEF’sdesign rationale (Dima,Pagé, andBudich, 2015). For general information aboutEUDAT’s Collaborative Data Infrastructure (CDI), see https://www.eudat.eu/eudat-cdi/about.

Page 6: D28 EUDAT service compatibility - CLARIND2.8 EUDAT service compatibility Document information Title EUDAT service compatibility ID CLARINPLUS-D2.8 (CE-2017-1034) Author(s) Claus Zinn

CLARIN-PLUSD2.8EUDATservicecompatibility

5

It iseasyto installaGEFenvironmentfromtheGitHubrepository.Figure2showsthefront-endoftheGEF. Inthe leftpane,therearethreeitems:“Buildaservice”,“BrowseServices” and “Browse Jobs”. When choosing the first item, the GEF admin (“poweruser”) can build a new GEF service, see (1). For this, the user needs to upload aDockerfile,togetherwithotherfileswhichshouldbepartofthecontainer.Withthefileuploadcomplete,GEF’sunderlyingDockersoftwarewillbuildaDocker-basedcontainerimage, which the GEF will use to provide the corresponding GEF service of theapplication.

Figure2:MainpageoftheGEFfront-end.

The standard distribution of the GEF comeswith a number of predefinedDockerfilesthat canbeused to test andplaywith theGEF. Figure3 shows theDockerfile for theStanfordParserforEnglish.ItbuildsaDockercontainerbasedontheoperationsystemUbuntu,version16.04.Itupdatesthesystem’slistofpackagesusingUbuntu’s“apt-get”,packagemanager,installsthesoftwarepackagesfor‘curl’and‘unzip’andthenusesthesoftware to fetch theStanfordparser fromagiven location(theparser isprecompiledandcanbestartedwith ‘lex-parser.sh’).Note thenumberofLABEL instructions in theDockerfile that fill in some of the metadata for the GEF-enabled parser service. Inparticular, it specifies the input (/root/input) and output (/root/output) servicelocationsonthefilesystem.

Page 7: D28 EUDAT service compatibility - CLARIND2.8 EUDAT service compatibility Document information Title EUDAT service compatibility ID CLARINPLUS-D2.8 (CE-2017-1034) Author(s) Claus Zinn

CLARIN-PLUSD2.8EUDATservicecompatibility

6

Figure3:ADockerfilefortheStanfordParserforEnglish

Figure2showsthatwehavebuilttwoservicesfromtheGEFdistribution;clickingon(2)shows all available services: the “Stanford Parser for English” and the “NLTK POS-tagging”software.AGEF-enabledservicecanberunbysupplyingaPersistentIdentifier(PID)orURLthatpointstotheinputtobeprocessed.Hittingthe“Submit”button–see(4) – will start the service asynchronously. All running jobs can be inspected by (3).Oncethejobterminates,itsoutputvolumeholdstheresultofthecomputation.At the time ofwriting, GEF services can only dealwith a single input parameter, theinput volume, and a single output parameter, the output volume. Also, note that thecurrent version of the GEF is also not capable of executing workflows. To execute aworkflow, itmust be orchestrated by an external engine,whichmakes reference to aGEF-based repository of (immutable) application containers, each of which can bereferenced by a PID. In the rest of this section, we will describe how the WebLichtorchestrator can be used to execute a GEF-based WebLicht workflow; here eachprocessingtoolofthechainishostedonaGEFenvironment.Clearly,wewillalsoneedtofleshouthowanydatatransferbetweentheGEF-externalWebLichtorchestratorandtheGEF-internalservicesishandled.

3.2 WebLichtWebLicht is aworkflowengine givingusers aweb-based access toover fifty tools foranalysing digital texts (Hinrichs, Hinrichs, and Zastrow 2010). Its pipelining engineoffers predefined workflows and supports users in configuring their own. WithWebLicht, users can analyse texts at different levels such as morphology analysis,tokenization, part-of-speech tagging, lemmatization, dependency parsing, constituentparsing, co-occurrence analysis and word frequency analysis, supporting mainlyGerman, English, andDutch.Note thatWebLicht does not implement any of the toolsitself but mediates their use via pre-defined as well as user-configurable processpipelines. Theseworkflows schedule the succession of tools so that one tool is calledafteranothertoachieveagiventask,sayforinstance,namedentityrecognition.WebLichtisagoodstepforwardinincreasing(web-based)toolaccessandusabilityasitsTCFformatmediatesbetweenthevariousinputandoutputformatsthetoolsrequire,and calls the tools (hosted onmany different servers located nation andworld-wide)withoutmuchneededuserengagement.WebLichthasnowbeenusedformanyyearsinthe linguistics community; WebLicht is actively maintained, profits from regular toolupdates and new tool integrations, and has recently been integratedwith TüNDRA, atreebank search and visualization tool that allows WebLicht users to inspectlinguistically annotated data (Chernov, Hinrichs, and Hinrichs 2017). With TüNDRA,

Page 8: D28 EUDAT service compatibility - CLARIND2.8 EUDAT service compatibility Document information Title EUDAT service compatibility ID CLARINPLUS-D2.8 (CE-2017-1034) Author(s) Claus Zinn

CLARIN-PLUSD2.8EUDATservicecompatibility

7

userscansearchforspecific linguisticphenomenaatthewordandsentencelevel,andvisualizesuchphenomena.Development of WebLicht started in October 2008 as part of the D-SPIN project(https://weblicht.sfs.uni-tuebingen.de/englisch/index.shtml); in the last eight years, ithas accumulated a solid user base and is the work-flow engine of choice for manynationalandEuropeanresearchersintheCLARINcontext.WebLicht as a Service.WebLichtasaService(WaaS) isaRESTservicethatexecutesWebLicht chains.1Unlike the WebLicht web application, WaaS does not require abrowser,andhencepreventsbrowser-specific issues toarisesuchassessiontimeoutsand file size limits.WithWaaS, users can run chains from theirUNIX shell, scripts, orprograms.OnceusershavedefinedachainintheWebLichtbrowserinterface,theycandownload the chain, and then they execute a HTTP POST request with themultipart/form-dataencodingtoinvokeWaaSwiththechaininquestionandtheinputdata,e.g.,byexecutingthefollowingcurlcommand:2

curl -X POST -F [email protected] -F content=@inputFile -F apikey=apiKey URL > result

WebLichtHarvesting of ToolMetadata.Atregular intervals,WebLicht isharvestingtool metadata from tool providers (as ingested to the metadata repositories of thecentre registries). The description of a WebLicht tool is based on the CMDI ProfileWebLichtServiceProfile.3 Among other information, it provides information about thetool’s URL and its input and output parameters. EachWebLicht tool has a persistentidentifierthatreferstoitsCMDIdescription.TheWebLichtorchestratorusesthetools’metadatatohelpusersdefineworkflows.Workflow definition and execution. WebLicht offers users two modes: an “EasyMode”andan“AdvancedMode”.Thefirstmodeoffersadozenofpredefinedeasy-chainsthat each define a tool workflow to perform a complex textual analysis. The secondmode supports users in defining their own pipelines. Each tool in the workflow isidentifiedbyapersistentidentifierthatWebLicht’sexecutionsystemusestoinvokethetool. With each tool invocation, WebLicht passes on a TCF-compliant data file thatcontainsthetool’sinput.Thetoolprocessesselectedpartsoftheinput,andenrichestheTCF filewith the output of the computation,which is then sent back to theWebLichtorchestrator. WebLicht then passes the enriched TCF file to the next tool of theworkflowuntilalltoolsintheworkflowhasbeenexecutedinthegivenorder.Figure 4 shows the data flow between theWebLicht orchestrator and the tools. Notethat each tool is invoked via the Hypertext Transfer Protocol (HTTP) using POSTmethods,withthetools’outputdatabeingcapturedintheresponsebodyoftherequest.Thisdesigndecisionmakes ithard touseWebLichtonhugedatasets,orondata thatcannotleavethehostinstitutionforpropertyrightsorprivacyconcerns.

1https://weblicht.sfs.uni-tuebingen.de/WaaS/2Note that URL points to https://weblicht.sfs.uni-tuebingen.de/WaaS/api/1.0/chain/processandthatanAPIkeytoaccesstheserviceneedstobeobtainedaswell.3 The CMDI XSD definition is available in the CLARIN component registry, available athttps://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/1.x/profiles/clarin.eu:cr1:p_1320657629644/xsd.

Page 9: D28 EUDAT service compatibility - CLARIND2.8 EUDAT service compatibility Document information Title EUDAT service compatibility ID CLARINPLUS-D2.8 (CE-2017-1034) Author(s) Claus Zinn

CLARIN-PLUSD2.8EUDATservicecompatibility

8

Figure4.DataFlowinWebLicht.

IntheGEFmindset,WebLicht’stoolsneedtobe“migrated”toaGEFenvironmentthatencapsulatesallcomputation.TheGEFenvironmentislocatedclosetothedata,thatis,placedonthesamecomputernetworkthatalsohoststhedata.WebLichtthencallsGEF-ified services rather than the original tools. To keep all data transfer local to theGEFenvironment that also hosts the data, theway theWebLicht orchestrator invokes theservicesneedstobechangedaswell.In the next section, we outline a number of possible scenarios to make WebLichtservice-compatiblewiththeGEFframework.4

4WebLichtisnotawebservicebutaweb-based/browser-basedtoolthatprovidesuserswithauser interface to specify the input data, select a predefined easy-chain, or in advancedmode,defineanewworkflow.Italsovisualisestheresultsofthepipeline,includinginformationabouttheindividualstages.AtthecurrentdevelopmentstageofGEF,WebLichtcannotbe“GEF-ified”,but“WebLichtAsaService”couldbeGEF-ified.

Page 10: D28 EUDAT service compatibility - CLARIND2.8 EUDAT service compatibility Document information Title EUDAT service compatibility ID CLARINPLUS-D2.8 (CE-2017-1034) Author(s) Claus Zinn

CLARIN-PLUSD2.8EUDATservicecompatibility

9

4 WebLicht–GEFIntegrationScenariosAuserwithbigdataorwithdatahavingrestrictivepropertyrightswouldliketoinformWebLichtthatitshouldonlyconsidertoolsaspartofaprocessingworkflowthatrunathisorherhomeinstitution.Thisassumes,however,thatthegivenhomeinstitutionhasset-upaGEFenvironmentthathostsall the“relevant”webservices that theWebLichtuserswanttoemploy.

4.1 GEFIntegration–Front-endAdaptationTherearetwopossibleapproacheswhereuserscouldspecifyGEF-relatedinformation,onlythefirstrequiringachangetoWebLicht’suserinterface.

Figure5.SketchofWebLichtGEFextension(proposal)

Figure 5 depicts a proposed UI change to WebLicht’s entry page. In addition to the“InputSelection”pane,thereisa“ToolLocation”pane,whereuserscanselectthetoolenvironment that theWebLicht orchestrator shoulduse.The environment “GLOBAL–alltools”isthedefaulttoolenvironmentwhereWebLichtassumestraditionaloperation(all tools are eligible for inclusion in the workflow). When the user selects theenvironment “GEF@sfs_tuebingen”, the WebLicht orchestrator will only make use ofworkflows whose tools are running in the GEF environment at the Seminar fürSprachwissenschaft(SfS)inTübingen.5Note,however,thataWebLicht–GEFintegrationmustnotmaterialiseonthefront-end.In “Easy-Mode”, WebLicht might offer predefined easy-chains that run entirely in agivenGEFenvironment;thoseeasy-chainsaremerelynamedappropriately,say“Named

5Sometimes, a usermight be happy to have partial GEF support, that is, a userpreference fortools in a given GEF environment. If WebLicht can identify a service available in the user’spreferred GEF environment, then WebLicht would use it, otherwise WebLicht would use aserviceatadifferentlocation.

Page 11: D28 EUDAT service compatibility - CLARIND2.8 EUDAT service compatibility Document information Title EUDAT service compatibility ID CLARINPLUS-D2.8 (CE-2017-1034) Author(s) Claus Zinn

CLARIN-PLUSD2.8EUDATservicecompatibility

10

Entities(GEF@sfs_tuebingen)”.UserswillthenselecttheGEFenvironmentbyselectingtheappropriatelynamedeasy-chain.In WebLicht, it is already the case that applicable tools are often prefixed by theirlocation,e.g.,“SfS:ToTCFConverter”,“SfS:StanfordTokenizer”,or“SfS:IllinoisNamedEntity”. Here, GEF-ified variants of the tools will need to be named accordingly, say,“GEF_sfs_tuebingen: To TCF Converter”. In “Advanced Mode”, users can construct aworkflowthatsuitstheircomputingrequirements,say,byselectingonlytoolsthatarepartof a givenGEFenvironmentas indicatedby the tool’sname.Theusabilityof thisapproach,however,decreaseswiththenumberoftoolsrunninginmanydifferentGEFenvironments, for instance, when a user has to choose among a dozen differentinstances of “To TCF converter”, all running in different GEF environments. Here, theaforementionedfront-endadaptationforWebLichtseemsmoresuitable.6

4.2 GEFIntegration:Back-endImplicationsThebackendadaptationstoWebLichtaremoresubstantial.TwochangestoWebLicht’sback-endarediscussed:howshould theorchestratororganise its tool space;andhowshouldtheorchestratorcalltheindividualtoolstominimizedatatransfers?

4.2.1 MakingtheCutthroughtheToolSpace.TheWebLichtOrchestratorworkson the entire tool space that results from regularlyharvestingtools’metadatafromtoolproviders.Here,weassumethatprovidersofGEF-ified toolsadvertise themin thesamewaythan theirnon-GEF-ifiedequivalents:Theirtools’metadata isentered in themetadata repositoryof the respectiveCLARINcentrerepositorysothatWebLichtcanharvestthesetoolsaswell.Forthefollowing,weassumethattheWebLichtuserhasspecifiedaGEFenvironmentofherchoice(Figure5).Now,theusercanchoosebetweentwomodes,seeFigure6.

Figure6.WebLicht:ModeChoice.

InEasyMode,theWebLichtorchestratorneedstoidentifywhichpredefinedworkflowsareavailableintheGEForganisationspecifiedbytheuser.Here,aworkflowisavailableinaGEFenvironmentifeachtooloftheworkflowisavailableintheGEFenvironment.In Advanced Mode, the Weblicht orchestrator makes a cut through the tool spacegatheredfromtheharvester.AlltoolsnotoriginatinginthegivenGEFenvironmentcan

6The WebLicht orchestrator could, however, take the partially constructed workflow intoaccount;itcanidentifyandpresentservicesfirstthatareinthesameGEFenvironmentasthoseservicesalreadypartoftheworkflow.

Page 12: D28 EUDAT service compatibility - CLARIND2.8 EUDAT service compatibility Document information Title EUDAT service compatibility ID CLARINPLUS-D2.8 (CE-2017-1034) Author(s) Claus Zinn

CLARIN-PLUSD2.8EUDATservicecompatibility

11

be ignored; only tools that are part of a given GEF environment are considered forworkflowconstruction.

4.2.2 TransportingdatathroughtheworkflowFigure4displays thedata flow in thecurrentversionofWebLicht. It shows theback-and-forth movement of data between the WebLicht orchestrator and the tools it isexecuting.Ifagivenworkflowconsistsofntools,thenthedataistransferred2ntimesbetween the orchestrator and the tools, potentially across networks of varyingbandwidthandlatency.WithdatarequiredtostayintheGEFenvironment,therearetwooptions,giventhatalltoolsofaworkflowliveinthesameGEFenvironment:

1. The WebLicht orchestrator becomes a GEF-ified variant of WaaS, that is,WebLichtasaServicebecomesalsoaservicerunningintheGEFenvironment.

2. The interplay between the orchestrator and the tools change; rather thanpassing along actual data, now references to data is being passed, with allreferencesoriginatingintheGEFenvironment.

Ad1.The firstoptiondoesnotaffect theamountofdata transferbetween theplayers,but makes the data transfer less of a bottleneck given that all tools of a given GEFenvironment have access to the same high bandwidth, low latency network. On theotherhand,thefirstoptionislessflexiblethanthesecondasWebLichtchainsmustbeknown and defined in advance (so that the WaaS can be called with the chain inquestion),butthismightbeacceptableformostGEFusecases.Ad2. The second option requires either changing the tools connected toWebLicht sothat they now accept references to data rather than the data, or building wrappersaround the tool that show this behaviour. Here, a wrapper approach seems to be abetter choice and it does not require interaction with and contribution from tooldevelopers.Step1.Figure7showsthewrappingofatool(left)resultinginawrappedtool(right).Thewrapped tool: (i) retrieves the input data froma givenURL, (ii) starts the tool itwrapsbypostingthetooltheinputdataviaHTTP,and(iii)acceptstheresponseoftheHTTP request, (iv) stores the output data locally (no transfer), and (v) returns thislocationwithacorrespondingURL.WiththeworkflowconsistingonlyoftoolswithsuchwrappersnodataissentbackandforthtoWebLicht,butratherURL-basedpointerstothedata.

Figure7:WrappingaTool.

Notethatthewrapperswillberequiredtomonitortheexecutionofthetoolstheywrap.Thatis,whenanHTTPPOSTrequestyieldsanerrormessage,thenthismessageissent

Page 13: D28 EUDAT service compatibility - CLARIND2.8 EUDAT service compatibility Document information Title EUDAT service compatibility ID CLARINPLUS-D2.8 (CE-2017-1034) Author(s) Claus Zinn

CLARIN-PLUSD2.8EUDATservicecompatibility

12

backtotheWebLichtorchestrator,whichthenprematurelyterminatestheexecutionoftheworkflowwithanappropriateerrormessage.Step2.Thewrappedtoolsaredockerized.Wehavedockerizedagoodnumberof toolsconnected toWebLicht.EachwebserviceshowninFigure8,forinstance,hasalreadyaDockervariant.ThismeansthatthereisaDockerfilethatcanbeusedtogeneratetheDockercontainerimage,andhence,thewebservice can already be incorporated in a GEF environment, taking the approachillustratedwiththehelpofFigure2andFigure3.GiventhatwehaveDockerfilesforagoodnumberofWebLichtwebservices, itshouldbeinvestigatedwhetheraDockerfileforawebservicecanbeautomaticallyextendedtoaDockerfileforawrappedwebservice,seeFigure7.Note:whenatoolisintegratedintoWebLicht,itreceivesa“webservicewrapping”.FortheGEFusecaseitmightbeprofitabletounwrapitsWebLichtwrapping,thatis,totaketheoriginal toolasa startingplace for theGEF-basedwrapping.With toolsbeingno longerwebservices,nohttprequeststakeplace,justplaintoolinvocationsthatneedtogettheirdatasomehow(dependingonthetool).Ontheotherhand,startingfromthewebserviceswrapping iseasierand it servesasanabstractionsharedbyall tools.Wrappingthe toolratherthanthewebservicerequireslookingatthespecificsofeachtoolindividually.

4.3 LiaisingbetweenWebLichtandtheGEFAtranslationmechanism(workingtitle“BridgIt”)isneededtoliaisebetweenWebLichtandtheGEF-enabledservicesinaGEFenvironment.Wediscusstheliaisonbyexample.Consider a user wanting to annotate sensitive data, say an English text with namedentities;forthis,theuserselectsaGEF-basedWebLichtworkflow,cf.Figure8.

Figure8.AWebLichtWorkflowforNamedEntityRecognition(Englishtexts)

This workflow has a corresponding XML representation, see Figure 9. Here, we haveusedhypotheticalhandles.

Page 14: D28 EUDAT service compatibility - CLARIND2.8 EUDAT service compatibility Document information Title EUDAT service compatibility ID CLARINPLUS-D2.8 (CE-2017-1034) Author(s) Claus Zinn

CLARIN-PLUSD2.8EUDATservicecompatibility

13

Figure9.TheNamedEntityRecognitionEasyChaininXML.

Each of the three tools in the tool chain is referenced with a persistent URL. ThepersistenttoolURLpointstometadatathatdescribesthetool.Considerthesecondtoolof the chain referenced with http://hdl.handle.net/11022/0000-GEF-2518-C . This(hypothetical) handle refers to a CMDI-based 7 instance of the CMDI profileWebLichtWebService.TheCMDIinstanceis locatedinaCLARINcentrerepository.8TheinformationintheCMDIinstanceholdstheinformationgiveninFigure10.

Figure10.MetadatafortheStanfordTokenizer.

7CMDIstandsforComponentMetaDataInfrastructure,seewww.clarin.eu/cmdifordetails.8Seehttp://weblicht.sfs.uni-tuebingen.de/apps/harvester-beta/resources/services.

Page 15: D28 EUDAT service compatibility - CLARIND2.8 EUDAT service compatibility Document information Title EUDAT service compatibility ID CLARINPLUS-D2.8 (CE-2017-1034) Author(s) Claus Zinn

CLARIN-PLUSD2.8EUDATservicecompatibility

14

ThemetadatahasaURLslotthatisusedtoinvokethetoolfromWebLicht.NotethattheURLdoesnotpoint to theGEFenvironment,but rather to the translationmechanism,seebelow.Alsonote that the typeof the input isnot “text/tcf+xml”but “text/tcf+url”,indicatingthatWebLichtwillneedtoinvokethetoolbypassingareferencetothedatatothetool,ratherthantheactualdata.For the following discussion, please consult Figure 11, which summarises theinteractionbetweentheWebLichtorchestratorandtheGEFenvironment.

Figure11:ArchitectureofWebLicht-GEFinteraction.

As with all other web services connected to WebLicht, the orchestrator invokes thegivenURLwithasynchronousPOSTrequest.Giventhenewtype“text/tcf+url”,noinputdataisbeingpassed,onlyareferencetothedata(aURL).ThesynchronousrequestinvokesBridgIt,thetranslationmechanism,whichliaiseswiththeGEFenvironment.BridgItmaintainsthefollowinginformation:

• ApointertotheGEFenvironment,say,http://gef.sfs.uni-tuebingen.de:4041.• A JSON-based table that maps URL-encoded information such as

“/StanfordTokenizer”toaservicecodethatidentifiesthecorrespondingserviceintheGEFenvironment.

WiththisinformationaURLofthefollowingformisconstructed:

http://gef.sfs.uni-tuebingen.de:4041/jobs?serviceId=<id>&pid=<pointerToData>

BridgIt sends this URL as POST request to start the GEF service The given GEFenvironment answers with a JSON structure (as body of the response to the POSTrequest)thatgivesthejobidentifierofthecorrespondingGEFservice,e.g.,

Page 16: D28 EUDAT service compatibility - CLARIND2.8 EUDAT service compatibility Document information Title EUDAT service compatibility ID CLARINPLUS-D2.8 (CE-2017-1034) Author(s) Claus Zinn

CLARIN-PLUSD2.8EUDATservicecompatibility

15

{ jobId : 42 }. BridgitthenpollstheGEFenvironmentatregularintervalstocheckwhetherthejobhasterminated.Technically,thisisaGETrequestoftheform:

http://gef.sfs.uni-tuebingen.de:4041/jobs/<jobId>

ThisGETrequestreturnsallinformationabouttheGEFjobwiththegivenjobID: { job: { ID: 42, State: { Status: “….” Error: … Code: -1 } OutputVolume: “…”

} The information includes the job’s id, state, and output volume (for the result of theprocessing). If the job’s state signals that the job is still running, BridgIt sleeps for asecond,beforesendinganotherGETrequest,andthiscontinuesuntilthejobterminated.When the job terminated successfully, then BridgIt returns a reference to the outputvolumeassociatedwiththejobid;otherwiseBridgItpassesontheerrorcodeencodedin theresponseof theGETrequest.Anyresult is returned toWebLichtasresponse tothesynchronousPOSTrequesttoBridgIt.Notes:

• EachGEFenvironmenthasitsownassociatedBridgItprocess.• GEF needs to support multiple parameters (other than a pointer to the input

data).Forinstance,fortheinvocationofBridgItvia

http://bridgit.sfs.uni-tuebingen.de/StanfordTokenizer?model=<someModel>BridgIt will need to map the parameter “model” and its value to thecorresponding parameters of the GEF-enabled service. Here, BridgIt must beawareofthemetadataoftheGEF-enabledservicetoperformthemapping.

• Forvariousreasons,BridgItanditsassociatedGEFenvironmentmightbeoutofsync.WrongGEFinvocations(e.g.,erroneousparameterassociations)willyielda servererror (code500);here thePOSTrequest to theGEFenvironmentwillalsoreturnabodywiththeerrorspecification,e.g.,“argumentmismatch”.

Page 17: D28 EUDAT service compatibility - CLARIND2.8 EUDAT service compatibility Document information Title EUDAT service compatibility ID CLARINPLUS-D2.8 (CE-2017-1034) Author(s) Claus Zinn

CLARIN-PLUSD2.8EUDATservicecompatibility

16

5 GEFPerspectiveIt is quite likely that any GEF environment will only include a small set ofWebLichtservices.However,setting-upaGEFenvironmentcomesatasubstantialcost.Technicaland logistical support is needed to keep costsmanageable. From the perspective of aGEFcommunitymanager,itishighlydesirablethatalltoolsheorshewantstointegrateintoaGEFenvironmentarepre-packaged.Thedockerizationofatoolrequiresintricateknowledge about the tool and its best runtime environment. In general, suchinformationisrarelyavailable.As a consequence, we assume a GEF community manager to contact the WebLichtdevelopers and to ask them for support. There is a good chance that aWebLichtwebservice of interest has already been dockerized by the WebLicht development team.Ideally,thewrappingofwebservicestoGEF-ifiabletools,asdiscussedinSection4,canbe automated: a script extends the existing Dockerfile for a given web service withinstructionsthatperformthewrapping.TheextendedDockerfilecanthenbeuploadedtotheGEFenvironmentasdiscussedinSection3.1.Setting-upaGEFenvironmentmustalsoincludetheprovisionofmetadatarecordsthatdescribe all GEF-ified tools using theCMDIprofileWebLichtWebService.Themetadatamust be included in the GEF host’s CLARIN centre repository so that WebLicht canharvestthisdataandhenceknowsabouttheexistingoftheGEFenvironment.Forthis,the existing metadata of the original web service should be copied and adaptedaccordingly.The GEF community manager must also complement the GEF environment with thetranslationmechanismwediscussed in Section4.3.Here,we assume that a referenceimplementationofBridgItwillbeprovidedbytheWebLichtteamsothataGEFadmincanuseandconfigurethetranslationmechanismwithlittlecost.Given the required expertise and high cost of setting up a GEF environment forWebLicht services, the WebLicht team may provide GEF environments for popularnaturallanguageprocessingchains.OnceaGEFenvironmenthasbeensetup,thecostofmovingittoanewlocation(closetothedata)shouldberelativelylow.

Page 18: D28 EUDAT service compatibility - CLARIND2.8 EUDAT service compatibility Document information Title EUDAT service compatibility ID CLARINPLUS-D2.8 (CE-2017-1034) Author(s) Claus Zinn

CLARIN-PLUSD2.8EUDATservicecompatibility

17

6 ConclusionIn this deliverable,we have sketched a potential integration of the CLARINWebLichtworkflow engine and its serviceswith a EUDAT’s Generic Execution Framework. Ourintegration would allow users to bring the language processing tools integrated intoWebLichttoanexecutionenvironmentthatalsohoststhedata,henceallowinglanguageprocessingclosetothedata.Fromtheimplementationperspective,initialstepstowardssuchintegrationhavebeendone. A good number of WebLicht web services have been dockerized. Thecontainerisationof services is the first step towards “movable computation”.All otherstepsstillawaitimplementation.Thelocationoftheapplicationcontainer,forinstance,needstobeadvertised,ofcourse,sothatworkflowenginesrelyingonthecomputationcan address the container accordingly. For the “advertisement”, the CMDI-basedWebLichtWebService profile must be used. The metadata must be added to CLARINcentrerepositoriesfortoolmetadatasothatWebLichtcanharvestitfromthere.At the time of writing, there is a significant amount of data transfer between theWebLichtorchestrator,andtheindividualservicesthatdefineaworkflow.Suchtransferbecomes prohibitively expensive for big data, and non-permissive for sensitive data.IndependentfromGEF’sexistence,WebLichtmaydecidetoinvoketheservicesinfuturebygivingthemreferencestothedataratherthanbypostingthedatatothem.At the time of writing, GEF offers no support for the execution of workflows. TheWebLichtorchestratoristhusneededtoperformthistask.Aswehavedemonstrated,atranslation mechanism (“BridgIt”) will need to be implemented to liaise betweenWebLichtandtheGEF-enabledservices.Therearea fewsteps tobe taken to integrateWebLichtwithGEF.Aproofof conceptGEF environment for a popular language processing workflow (e.g., Named EntityRecognitionforEnglish)shouldbeset-up.Sofar,wehavedockerizedall threetoolsofthisworkflow.We need to extend the respectiveDockerfiles so that the tools receivetheirGEFpackagingwithregardtodatatransfer.ItneedstobeinvestigatedwhethertheDockerfile extension from WebLicht web service to GEF-enabled service can bemechanized.Fortestingpurposes,wewillcreateaWebLichtbranchthatoffersGEF-enabledservicestousers.Atthesametime,wewillneedtopatchtheproductionserverofWebLichttonot show those services to the users. Also, we need to build a demonstrator for thetranslationmechanismBridgIt.With this work, we have demonstrated, in theory, that CLARIN web services can beexposed asGEF services, and that thoseGEF-enabled services can be usedwithin theCLARINWebLichtworkflowengine.Moreimplementationworkisrequiredtoshowthefeasibilityofourapproach.WewillcontinuetocloselycooperatewiththeEUDAT-GEFdevelopmentteamtoensurethat user requirements stemming from the WebLicht use case lead to GEF featurerequeststhatwillfindtheirwayintotheofficialGEFspecificationandimplementation.

Page 19: D28 EUDAT service compatibility - CLARIND2.8 EUDAT service compatibility Document information Title EUDAT service compatibility ID CLARINPLUS-D2.8 (CE-2017-1034) Author(s) Claus Zinn

CLARIN-PLUSD2.8EUDATservicecompatibility

18

References[1] E. Hinrichs, M. Hinrichs, and T. Zastrow: WebLicht: Web-Based LRT Services forGerman,Proceedingsof the48thAnnualMeetingof theAssociation forComputationalLinguistics(SystemDemonstrations),2010.[2] E. Dima, C. Pagé, and R. Budich: D7.5.2: Technology adaptation and developmentframework(final).EUDATdeliverable,retrievedfromhttps://b2share.eudat.eu/api/files/4cc8cf0e-99a2-4b6b-981a-0ffcd870af19/EUDAT-DEL-WP7-D7%205%202-Technology_adaptation_and_development_framework-2.pdf[3] A. Chernov, E.Hinrichs, andM.Hinrichs (2017). “Search YourOwnTreebank.” In:Proceedings of the 15th International Workshop on Treebanks and Linguistic Theories(TLT15),Bloomington,IN,USA,January20-21,2017.Pp.25–34.