12
Piotr Lusakowski Cooperative Data Exploration with IPython Notebook

Cooperative Data Exploration with iPython Notebook

Embed Size (px)

Citation preview

Page 1: Cooperative Data Exploration with iPython Notebook

Piotr LusakowskiCooperative Data Exploration

with IPython Notebook

Page 2: Cooperative Data Exploration with iPython Notebook

Motivation

1

● Big Data computations require lots of resources

○ CPU○ RAM

● Sharing the results is difficult in most current setups

○ Precomputed datasets○ Trained models○ Insights

Page 3: Cooperative Data Exploration with iPython Notebook

Solution Created for the Seahorse 1.0 release

● Single Spark application as the backend○ Results of other team members easily accessible in-memory○ No unnecessary duplication of data

● Multiple IPython Notebooks as clients

2

Page 4: Cooperative Data Exploration with iPython Notebook

● How to use the SparkContext and SqlContext of an application running on a cluster?

● How to execute Python code on cluster?

Challenges

3

Page 5: Cooperative Data Exploration with iPython Notebook

A library for Python - Java communication

● “Wraps” JVM-based objects

● Exposes their API in Python

● Internally, uses a custom TCP client/server communication

● In JVM: a Gateway Server

● On the Python side: a client called Java Gateway

Py4J

4

Page 6: Cooperative Data Exploration with iPython Notebook

● Spark application exposes its SparkContext and SqlContext

○ It’s actually quite easy, once you know what you’re doing

● Notebook connects to the Spark application via Py4J on startup

○ sc and sqlContext variables are added to user’s environment○ This setup is completely transparent to the user

Using an Existing SparkContext

5

Page 7: Cooperative Data Exploration with iPython Notebook

Notebook Architecture Overview

6

● User’s code is executed by kernels - processes spawned by the Notebook Server

● Kernels execute user’s code on Notebook Server host

Page 8: Cooperative Data Exploration with iPython Notebook

Requirements

7

● User’s code is executed on the Spark driver

● No assumptions about the driver being visible from the Notebook Server

Page 9: Cooperative Data Exploration with iPython Notebook

● Forwarding Kernel

● Executing Kernel

● Message Queue

Custom Kernel

8

Page 10: Cooperative Data Exploration with iPython Notebook

● Storage object accessible via Py4J

○ Each client connected to the Spark application can reuse any entity from the storage

■ DataFrames■ Models■ Even code snippets

○ Access control■ Sharing with only selected colleagues■ Private storage

○ Notifications: “Hey, look, Susan published a new result!”

The Interaction Between Users

9

Page 11: Cooperative Data Exploration with iPython Notebook

● John defines a DataFrame: “Something Interesting”

● Alex explores it

● Susan bases her models on it

● John uses a model shared by Susan

Cooperative Data Exploration

10

Page 12: Cooperative Data Exploration with iPython Notebook

Thank you!

Piotr LusakowskiSenior Software Engineer

[email protected]