View
250
Download
2
Embed Size (px)
Citation preview
Piotr LusakowskiCooperative Data Exploration
with IPython Notebook
Motivation
1
● Big Data computations require lots of resources
○ CPU○ RAM
● Sharing the results is difficult in most current setups
○ Precomputed datasets○ Trained models○ Insights
Solution Created for the Seahorse 1.0 release
● Single Spark application as the backend○ Results of other team members easily accessible in-memory○ No unnecessary duplication of data
● Multiple IPython Notebooks as clients
2
● How to use the SparkContext and SqlContext of an application running on a cluster?
● How to execute Python code on cluster?
Challenges
3
A library for Python - Java communication
● “Wraps” JVM-based objects
● Exposes their API in Python
● Internally, uses a custom TCP client/server communication
● In JVM: a Gateway Server
● On the Python side: a client called Java Gateway
Py4J
4
● Spark application exposes its SparkContext and SqlContext
○ It’s actually quite easy, once you know what you’re doing
● Notebook connects to the Spark application via Py4J on startup
○ sc and sqlContext variables are added to user’s environment○ This setup is completely transparent to the user
Using an Existing SparkContext
5
Notebook Architecture Overview
6
● User’s code is executed by kernels - processes spawned by the Notebook Server
● Kernels execute user’s code on Notebook Server host
Requirements
7
● User’s code is executed on the Spark driver
● No assumptions about the driver being visible from the Notebook Server
● Forwarding Kernel
● Executing Kernel
● Message Queue
Custom Kernel
8
● Storage object accessible via Py4J
○ Each client connected to the Spark application can reuse any entity from the storage
■ DataFrames■ Models■ Even code snippets
○ Access control■ Sharing with only selected colleagues■ Private storage
○ Notifications: “Hey, look, Susan published a new result!”
The Interaction Between Users
9
● John defines a DataFrame: “Something Interesting”
● Alex explores it
● Susan bases her models on it
● John uses a model shared by Susan
Cooperative Data Exploration
10