29
Interactive applications on HPC systems Erich Birngruber ([email protected], @ebirn) Vienna BioCenter FOSDEM20

FOSDEM20 Interactive applications on HPC systems · 2020-02-02 · data science workflows. STUDY DESIGN Our investigation consisted of two studies. Study 1, a mix of complementary

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: FOSDEM20 Interactive applications on HPC systems · 2020-02-02 · data science workflows. STUDY DESIGN Our investigation consisted of two studies. Study 1, a mix of complementary

Interactive applications on HPC systems

Erich Birngruber ([email protected], @ebirn)

Vienna BioCenter

FOSDEM20

Page 2: FOSDEM20 Interactive applications on HPC systems · 2020-02-02 · data science workflows. STUDY DESIGN Our investigation consisted of two studies. Study 1, a mix of complementary

Interactive applications on HPC systems

FOSDEM20

.

Erich Birngruber ([email protected], @ebirn)

Vienna BioCenter

Page 3: FOSDEM20 Interactive applications on HPC systems · 2020-02-02 · data science workflows. STUDY DESIGN Our investigation consisted of two studies. Study 1, a mix of complementary

sh$ not good enough?

Page 4: FOSDEM20 Interactive applications on HPC systems · 2020-02-02 · data science workflows. STUDY DESIGN Our investigation consisted of two studies. Study 1, a mix of complementary

XPRA

Page 5: FOSDEM20 Interactive applications on HPC systems · 2020-02-02 · data science workflows. STUDY DESIGN Our investigation consisted of two studies. Study 1, a mix of complementary

XPRA• https://xpra.org/

• “screen for X11”

• Allows disconnect / re-connect to existing X sessions

• Web interface for X11 rendering (HTML5 canvas)

• For arbitrary GUI applications

• Containerized in SLURM

• Custom middleware for job management

Page 6: FOSDEM20 Interactive applications on HPC systems · 2020-02-02 · data science workflows. STUDY DESIGN Our investigation consisted of two studies. Study 1, a mix of complementary

Launch XPRA job

Page 7: FOSDEM20 Interactive applications on HPC systems · 2020-02-02 · data science workflows. STUDY DESIGN Our investigation consisted of two studies. Study 1, a mix of complementary

XPRA job submitted

Page 8: FOSDEM20 Interactive applications on HPC systems · 2020-02-02 · data science workflows. STUDY DESIGN Our investigation consisted of two studies. Study 1, a mix of complementary

XPRA session

Page 9: FOSDEM20 Interactive applications on HPC systems · 2020-02-02 · data science workflows. STUDY DESIGN Our investigation consisted of two studies. Study 1, a mix of complementary

XPRA setup

launch jobsubmit request

connect to xpra client

IT services batch scheduler

middleware

Page 10: FOSDEM20 Interactive applications on HPC systems · 2020-02-02 · data science workflows. STUDY DESIGN Our investigation consisted of two studies. Study 1, a mix of complementary

RStudio

Page 11: FOSDEM20 Interactive applications on HPC systems · 2020-02-02 · data science workflows. STUDY DESIGN Our investigation consisted of two studies. Study 1, a mix of complementary

RStudio

• https://rstudio.com/

• IDE for R language

• Desktop and Web version (RStudio server)

• Commercial version for advanced features

• RStudio company has become a public benefit companyhttps://blog.rstudio.com/2020/01/29/rstudio-pbc

Page 12: FOSDEM20 Interactive applications on HPC systems · 2020-02-02 · data science workflows. STUDY DESIGN Our investigation consisted of two studies. Study 1, a mix of complementary
Page 13: FOSDEM20 Interactive applications on HPC systems · 2020-02-02 · data science workflows. STUDY DESIGN Our investigation consisted of two studies. Study 1, a mix of complementary
Page 14: FOSDEM20 Interactive applications on HPC systems · 2020-02-02 · data science workflows. STUDY DESIGN Our investigation consisted of two studies. Study 1, a mix of complementary

RStudio setup

batch schedulerRStudio server

job launcher

session connect session

Page 15: FOSDEM20 Interactive applications on HPC systems · 2020-02-02 · data science workflows. STUDY DESIGN Our investigation consisted of two studies. Study 1, a mix of complementary
Page 16: FOSDEM20 Interactive applications on HPC systems · 2020-02-02 · data science workflows. STUDY DESIGN Our investigation consisted of two studies. Study 1, a mix of complementary

Galaxy

• https://galaxyproject.org/

• Web based workflow tool

• Tools as building blocks (parameters, input, output)

• Tool definitions in XML

• Multiple instances: dev - testing - production

Page 17: FOSDEM20 Interactive applications on HPC systems · 2020-02-02 · data science workflows. STUDY DESIGN Our investigation consisted of two studies. Study 1, a mix of complementary
Page 18: FOSDEM20 Interactive applications on HPC systems · 2020-02-02 · data science workflows. STUDY DESIGN Our investigation consisted of two studies. Study 1, a mix of complementary
Page 19: FOSDEM20 Interactive applications on HPC systems · 2020-02-02 · data science workflows. STUDY DESIGN Our investigation consisted of two studies. Study 1, a mix of complementary

Galaxy setup

develop testing production

batch scheduler

job

Git repo branches

test job

session

Page 20: FOSDEM20 Interactive applications on HPC systems · 2020-02-02 · data science workflows. STUDY DESIGN Our investigation consisted of two studies. Study 1, a mix of complementary
Page 21: FOSDEM20 Interactive applications on HPC systems · 2020-02-02 · data science workflows. STUDY DESIGN Our investigation consisted of two studies. Study 1, a mix of complementary

JupyterHub

• https://jupyter.org/

• Web-Based IDE (standalone vs. hub)

• Notebooks = Code + Outputs

• Interpreters as “Kernels”

Page 22: FOSDEM20 Interactive applications on HPC systems · 2020-02-02 · data science workflows. STUDY DESIGN Our investigation consisted of two studies. Study 1, a mix of complementary
Page 23: FOSDEM20 Interactive applications on HPC systems · 2020-02-02 · data science workflows. STUDY DESIGN Our investigation consisted of two studies. Study 1, a mix of complementary
Page 24: FOSDEM20 Interactive applications on HPC systems · 2020-02-02 · data science workflows. STUDY DESIGN Our investigation consisted of two studies. Study 1, a mix of complementary

JupyterHub setupbatch schedulerJupyterHub

job

connects

proxy

hub

session

api

Page 25: FOSDEM20 Interactive applications on HPC systems · 2020-02-02 · data science workflows. STUDY DESIGN Our investigation consisted of two studies. Study 1, a mix of complementary

Summary• XPRA

Special use cases: X11 applications (Fiji) in Containers

• RStudioR (from env modules), web-based IDE

• Galaxypre-configured workflows

• JupyterHubPython (per-user kernels), plugins

Page 26: FOSDEM20 Interactive applications on HPC systems · 2020-02-02 · data science workflows. STUDY DESIGN Our investigation consisted of two studies. Study 1, a mix of complementary

Others

• OpenOnDemand: interactive/remote desktop portalhttps://openondemand.org/

• Apache Zeppelin: data exploration “notebooks”https://zeppelin.apache.org/

• Eclipse Che: cloud-based editorhttps://www.eclipse.org/che/

Page 27: FOSDEM20 Interactive applications on HPC systems · 2020-02-02 · data science workflows. STUDY DESIGN Our investigation consisted of two studies. Study 1, a mix of complementary

Then this happened

😳

Page 28: FOSDEM20 Interactive applications on HPC systems · 2020-02-02 · data science workflows. STUDY DESIGN Our investigation consisted of two studies. Study 1, a mix of complementary

What’s Wrong with Computational Notebooks?Pain Points, Needs, and Design Opportunities

Souti Chattopadhyay1, Ishita Prasad2, Austin Z. Henley3, Anita Sarma1, Titus Barik2

Oregon State University1, Microsoft2, University of Tennessee-Knoxville3

{chattops, anita.sarma}@oregonstate.edu, {ishita.prasad, titus.barik}@microsoft.com, [email protected]

ABSTRACTComputational notebooks—such as Azure, Databricks, andJupyter—are a popular, interactive paradigm for data scien-tists to author code, analyze data, and interleave visualiza-tions, all within a single document. Nevertheless, as datascientists incorporate more of their activities into notebooks,they encounter unexpected difficulties, or pain points, thatimpact their productivity and disrupt their workflow. Througha systematic, mixed-methods study using semi-structured in-terviews (n = 20) and survey (n = 156) with data scientists,we catalog nine pain points when working with notebooks.Our findings suggest that data scientists face numerous painpoints throughout the entire workflow—from setting up note-books to deploying to production—across many notebookenvironments. Our data scientists report essential notebookrequirements, such as supporting data exploration and visual-ization. The results of our study inform and inspire the designof computational notebooks.

Author KeywordsComputational notebooks; challenges; data science;interviews; pain points; survey

CCS Concepts•Human-centered computing ! Interactive systems and

tools; Empirical studies in HCI; •Software and its engi-

neering ! Development frameworks and environments;

INTRODUCTIONComputational notebooks are an interactive paradigm for com-bining code, data, visualizations, and other artifacts, all withina single document [21, 36, 32, 30]. This interface, essentially,is organized as a collection of input and output cells. For exam-ple, a data scientist might write Python code in an input codecell, whose result renders a plot in an output cell. Althoughthese cells are linearly arranged, they can be reorganized orexecuted in any order. The code executes in a kernel—thecomputational engine behind the notebook.

This interactive paradigm has made notebooks an appealingchoice for data scientists, and this demand has sparked multi-ple open source and commercial implementations, including

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected] ’20, April 25–30, 2020, Honolulu, HI, USA.

Copyright is held by the owner/author(s). Publication rights licensed to ACM.ACM ISBN 978-1-4503-6708-0/20/04 ...$15.00.http://dx.doi.org/10.1145/3313831.3376729

Azure,1 Databricks,2 Colab,3 Jupyter,4 and nteract.5 Whileoriginally intended for exploring and constructing computa-tional narratives [29, 31], data scientists are now increasinglyorchestrating more of their activities within this paradigm [33]:through long-running statistical models, transforming data atscale, collaborating with others, and executing notebooks di-rectly in production pipelines. But as data scientists try to doso, they encounter unexpected difficulties—pain points—fromlimitations in affordances and features in the notebooks, whichimpact their productivity and disrupt their workflow.

To investigate the pain points and needs of data scientistswho work in computational notebooks, across multiple note-book environments, we conducted a systematic mixed-methodstudy using field observations, semi-structured interviews, anda confirmation survey with data science practitioners. Whileprior work has studied specific facets of difficulties in note-books [24, 17], such as versioning [18, 19] or cleaning unusedcode [13, 34], the central contribution of this paper is a taxon-omy of validated pain points across data scientists’ notebookactivities.

Our findings identify that data scientists face considerablepain points through the entire analytics workflow—from set-ting up the notebook to deploying to production—acrossmany notebook environments. While our participants reportedworkarounds, these were ad hoc, required manual interven-tions, and were prone to errors. Our data scientist report theirkey needs are support for deploying notebooks to productionand scheduling time-consuming batch executions as well asunder-the-hood software engineering support for managingcode and history. Our findings further our understanding ofrequirements for supporting data scientists’ day-to-day ac-tivities, and suggest design opportunities for researchers andtoolsmiths to improve computational notebooks and streamlinedata science workflows.

STUDY DESIGNOur investigation consisted of two studies. Study 1, a mix ofcomplementary field observations and interviews, investigatesthe difficulties that data scientists face in their day-to-dayactivities. Study 2 confirms our findings from Study 1 througha survey of 156 data scientists.

1https://notebooks.azure.com/

2https://databricks.com/product/

3https://colab.research.google.com/

4https://jupyter.org/

5https://nteract.io/

What is wrong?

Page 29: FOSDEM20 Interactive applications on HPC systems · 2020-02-02 · data science workflows. STUDY DESIGN Our investigation consisted of two studies. Study 1, a mix of complementary

References• XPRA https://xpra.org/

• RStudio https://rstudio.com/

• Jupyterhub https://jupyter.org/hub

• Galaxy https://galaxyproject.org/

• What is wrong with computational notebooks?http://web.eecs.utk.edu/~azh/blog/notebookpainpoints.html