Spark + i python

Spark + IPythonThe remix

theblackbox.io@theblackboxio

shakespearecode.io@shksprcodeio

27 March 2015 at @Itnig with @pybcn

Index

● Motivation● Walkthrough● Demo



A little about me

● Guillermo Blasco● Graduate in Mathematics and Software

Engineering● Developing theblackbox.io● Working as Data Scientist



Spark, What?

● Distributed computation engine● Based on Resilient Distributed Dataset● Runs on JVM, but available from Java, Scala

and Python● Open Source



Spark, Why?

● Mainly, scalability in terms of○ Commodity costs○ Computation time○ Dataset size

● Hadoop was hard to maintain● MapReduce is a computational pattern● RDD is a distributed data model theblackbox.io

@theblackboxioshakespearecode.io

@shksprcodeio

IPython, What?

● Interactive computing framework● “python with batteries”● Open Source● Expanding to other languages (Jupyter)



IPython, Why?

● Powerful interactive remote shells○ Terminal○ Qt○ Notebook

● Easy data visualization● Configurable in cluster and in parallel ● Embeddable, flexible, extensible theblackbox.io


@shksprcodeio

Wait...

● IPython is cluster configurable● Spark has an interactive Scala and Python

shell

¿Are they not pretty much the same?theblackbox.io


@shksprcodeio

Well, not at allGin and Vodka

are not the sametheblackbox.io


@shksprcodeio

Spark + IPython, Why?

● Spark is the leading general purpose distributed computational system today, in terms of productive performance.

● IPython is great to experiment and develop scientific applications.

Mix them together to get the best of both.



So, what is the goal?

● Connecting your IPython environment to a Spark cluster powers your development to process even larger data



And an extra benefit...

Since Spark is production ready, you just have to export* your IPython project to a python script. Meaning:● No code translations to production

environmenttheblackbox.io


@shksprcodeio

Before mixing it up, understand

Spark● Slave-Master-ClientIPython● (Cluster-)Master-Client



Spark architecture

● Master node coordinates distribution and resilience of RDD.

● Slave nodes compute the operations over RDDs.

● Client nodes connect to master to request computations. theblackbox.io


@shksprcodeio

IPython architecture

● Master node with a kernel (computation unit)● Slave nodes handle computations tagged as

distributed (%px)● Client nodes connect to master to request

computations.theblackbox.io


@shksprcodeio

● Configure Spark cluster● Link IPython kernel to one Spark context● Use IPython clients to develop scripts with

Spark

The plan



Hands On!Let’s drink

Gin with Vodka



https://github.com/theblackboxio/spark-ipython



Conclusions

● Computational power of Spark● Interactiveness of IPython● Viable, not that hard to configure● Also fun



Complexities

● Sysadmin work● Python dependencies

Thanks!Questions?



Thanks to:Python BCN (@pybcn)Itnig (@Itnig)