FOSDEM20 Interactive applications on HPC systems · 2020-02-02 · data science workﬂows. STUDY...

Interactive applications on HPC systems

Erich Birngruber (erich.birngruber@gmi.oeaw.ac.at, @ebirn)

Vienna BioCenter

FOSDEM20

Interactive applications on HPC systems

FOSDEM20

Erich Birngruber (erich.birngruber@gmi.oeaw.ac.at, @ebirn)

Vienna BioCenter

sh$ not good enough?

XPRA• https://xpra.org/

• “screen for X11”

• Allows disconnect / re-connect to existing X sessions

• Web interface for X11 rendering (HTML5 canvas)

• For arbitrary GUI applications

• Containerized in SLURM

• Custom middleware for job management

Launch XPRA job

XPRA job submitted

XPRA session

XPRA setup

launch jobsubmit request

connect to xpra client

IT services batch scheduler

middleware

RStudio

• https://rstudio.com/

• IDE for R language

• Desktop and Web version (RStudio server)

• Commercial version for advanced features

• RStudio company has become a public benefit companyhttps://blog.rstudio.com/2020/01/29/rstudio-pbc

RStudio setup

batch schedulerRStudio server

job launcher

session connect session

Galaxy

• https://galaxyproject.org/

• Web based workflow tool

• Tools as building blocks (parameters, input, output)

• Tool definitions in XML

• Multiple instances: dev - testing - production

Galaxy setup

develop testing production

batch scheduler

Git repo branches

test job

session

JupyterHub

• https://jupyter.org/

• Web-Based IDE (standalone vs. hub)

• Notebooks = Code + Outputs

• Interpreters as “Kernels”

JupyterHub setupbatch schedulerJupyterHub

connects

session

Summary• XPRA

Special use cases: X11 applications (Fiji) in Containers

• RStudioR (from env modules), web-based IDE

• Galaxypre-configured workflows

• JupyterHubPython (per-user kernels), plugins

Others

• OpenOnDemand: interactive/remote desktop portalhttps://openondemand.org/

• Apache Zeppelin: data exploration “notebooks”https://zeppelin.apache.org/

• Eclipse Che: cloud-based editorhttps://www.eclipse.org/che/

Then this happened

What’s Wrong with Computational Notebooks?Pain Points, Needs, and Design Opportunities

Souti Chattopadhyay1, Ishita Prasad2, Austin Z. Henley3, Anita Sarma1, Titus Barik2

Oregon State University1, Microsoft2, University of Tennessee-Knoxville3

{chattops, anita.sarma}@oregonstate.edu, {ishita.prasad, titus.barik}@microsoft.com, azh@utk.edu

ABSTRACTComputational notebooks—such as Azure, Databricks, andJupyter—are a popular, interactive paradigm for data scien-tists to author code, analyze data, and interleave visualiza-tions, all within a single document. Nevertheless, as datascientists incorporate more of their activities into notebooks,they encounter unexpected difficulties, or pain points, thatimpact their productivity and disrupt their workflow. Througha systematic, mixed-methods study using semi-structured in-terviews (n = 20) and survey (n = 156) with data scientists,we catalog nine pain points when working with notebooks.Our findings suggest that data scientists face numerous painpoints throughout the entire workflow—from setting up note-books to deploying to production—across many notebookenvironments. Our data scientists report essential notebookrequirements, such as supporting data exploration and visual-ization. The results of our study inform and inspire the designof computational notebooks.

Author KeywordsComputational notebooks; challenges; data science;interviews; pain points; survey

CCS Concepts•Human-centered computing ! Interactive systems and

tools; Empirical studies in HCI; •Software and its engi-

neering ! Development frameworks and environments;

INTRODUCTIONComputational notebooks are an interactive paradigm for com-bining code, data, visualizations, and other artifacts, all withina single document [21, 36, 32, 30]. This interface, essentially,is organized as a collection of input and output cells. For exam-ple, a data scientist might write Python code in an input codecell, whose result renders a plot in an output cell. Althoughthese cells are linearly arranged, they can be reorganized orexecuted in any order. The code executes in a kernel—thecomputational engine behind the notebook.

This interactive paradigm has made notebooks an appealingchoice for data scientists, and this demand has sparked multi-ple open source and commercial implementations, including

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from permissions@acm.org.CHI ’20, April 25–30, 2020, Honolulu, HI, USA.

Copyright is held by the owner/author(s). Publication rights licensed to ACM.ACM ISBN 978-1-4503-6708-0/20/04 ...$15.00.http://dx.doi.org/10.1145/3313831.3376729

Azure,1 Databricks,2 Colab,3 Jupyter,4 and nteract.5 Whileoriginally intended for exploring and constructing computa-tional narratives [29, 31], data scientists are now increasinglyorchestrating more of their activities within this paradigm [33]:through long-running statistical models, transforming data atscale, collaborating with others, and executing notebooks di-rectly in production pipelines. But as data scientists try to doso, they encounter unexpected difficulties—pain points—fromlimitations in affordances and features in the notebooks, whichimpact their productivity and disrupt their workflow.

To investigate the pain points and needs of data scientistswho work in computational notebooks, across multiple note-book environments, we conducted a systematic mixed-methodstudy using field observations, semi-structured interviews, anda confirmation survey with data science practitioners. Whileprior work has studied specific facets of difficulties in note-books [24, 17], such as versioning [18, 19] or cleaning unusedcode [13, 34], the central contribution of this paper is a taxon-omy of validated pain points across data scientists’ notebookactivities.

Our findings identify that data scientists face considerablepain points through the entire analytics workflow—from set-ting up the notebook to deploying to production—acrossmany notebook environments. While our participants reportedworkarounds, these were ad hoc, required manual interven-tions, and were prone to errors. Our data scientist report theirkey needs are support for deploying notebooks to productionand scheduling time-consuming batch executions as well asunder-the-hood software engineering support for managingcode and history. Our findings further our understanding ofrequirements for supporting data scientists’ day-to-day ac-tivities, and suggest design opportunities for researchers andtoolsmiths to improve computational notebooks and streamlinedata science workflows.

STUDY DESIGNOur investigation consisted of two studies. Study 1, a mix ofcomplementary field observations and interviews, investigatesthe difficulties that data scientists face in their day-to-dayactivities. Study 2 confirms our findings from Study 1 througha survey of 156 data scientists.

1https://notebooks.azure.com/

2https://databricks.com/product/

3https://colab.research.google.com/

4https://jupyter.org/

5https://nteract.io/

What is wrong?

References• XPRA https://xpra.org/

• RStudio https://rstudio.com/

• Jupyterhub https://jupyter.org/hub

• Galaxy https://galaxyproject.org/

• What is wrong with computational notebooks?http://web.eecs.utk.edu/~azh/blog/notebookpainpoints.html

FOSDEM20 Interactive applications on HPC systems · 2020-02-02 · data science workﬂows. STUDY...

Documents

MediaPower: a File-Based Workﬂows Enabler · MediaPower: a File-Based Workﬂows Enabler ... Louth (Harris) VDCP used to run video servers and the Sony VTR control interface. Encoding

Constructing workﬂows from script applications

Airflow Documentation - Read the Docs · Airflow is a platform to programmatically author, schedule and monitor workflows. Use airflow to author workflows as directed acyclic

Dynamic service selection in workﬂows using performance data

Scientific Software as Workflows: From Discovery …Scientific Software as Workflows: From Discovery to Distribution David Woollard 1,2Nenad Medvidovic Yolanda Gil 3 Chris A. Mattmann

Optimizing User Views for Workﬂows

Crossroads Workﬂows...Feb 04, 2019 · Crossroads Workﬂows LANL, NERSC, SNL LA-UR-19-20837 February 4, 2019 1 Introduction The Department of Energy (DOE) compute facilities operate

iRODS& Workﬂows&&&Beyond&

Convergence of computation and data workﬂows · PDF fileConvergence of computation and data workﬂows ... V. Balaji NOAA/GFDL and Princeton University 28 September 2016 ... Figure

Interactive Previews in Document Workﬂows · Interactive Previews in Document Workﬂows Ausgefuhr¨ t am Institut fur¨ Softwaretechnik und Interaktive Systeme der Technischen

Taming asynchronous workﬂows with Functional Reactive ...€¦ · Taming asynchronous workﬂows with Functional Reactive Programming LambdaJam - Brisbane, 2013 Leonardo Borges

Approaches to Distributed Execution of Scientific Workflows ...jianwu/JianwuWang... · M. Płociennik et al./Approaches to Distributed Execution of Scientific Workflows in Kepler´

Real world Git workﬂows · Real world Git workﬂows. TODO Show of hands ! Picture with hands/lighter/concert. Stefan Saasen Atlassian Stash Development Lead. Stefan Saasen Atlassian

Supporting the Design of Machine Learning Workﬂows with a ... · Additional Key Words and Phrases: Data analysis workﬂows, RapidMiner, visual process modeling ACM Reference Format:

AAAI-08 Tutorial on Computational Workﬂows for Large-Scale

New RNA-seq workﬂows - Bioconductor · New RNA-seq workﬂows Charlotte Soneson University of Zurich Brixen 2016. Wikipedia. 7 . . . ... Step 1: build transcriptome index kallisto

Enabling Scientific Workflows in Virtual Realitygraphics.idav.ucdavis.edu/~hamann/KreylosBernardinBillen...Enabling Scientific Workflows in Virtual Reality Oliver Kreylos∗ Tony

Managing complex workﬂows in neural simulation and data ...Managing complex workﬂows in neural simulation and data analysis, Release 0.1 In this section we will focus on the third

ClowdFlows: Online Workﬂows for Distributed Big Data …lkm.fri.uni-lj.si/rmarko/papers/Kranjc17-FGCS-preprint.pdf · ClowdFlows: Online Workﬂows for Distributed Big Data Mining

TECHNICAL REPORT Generalization of Workﬂows in Process ... · TECHNICAL REPORT Generalization of Workﬂows in Process-Oriented Case-Based Reasoning ... for generalizing and specialization