20

Click here to load reader

Spark + i python

Embed Size (px)

Citation preview

Page 1: Spark + i python

Spark + IPythonThe remix

theblackbox.io@theblackboxio

shakespearecode.io@shksprcodeio

27 March 2015 at @Itnig with @pybcn

Page 2: Spark + i python

Index

● Motivation● Walkthrough● Demo

theblackbox.io@theblackboxio

shakespearecode.io@shksprcodeio

Page 3: Spark + i python

A little about me

● Guillermo Blasco● Graduate in Mathematics and Software

Engineering● Developing theblackbox.io● Working as Data Scientist

theblackbox.io@theblackboxio

shakespearecode.io@shksprcodeio

Page 4: Spark + i python

Spark, What?

● Distributed computation engine● Based on Resilient Distributed Dataset● Runs on JVM, but available from Java, Scala

and Python● Open Source

theblackbox.io@theblackboxio

shakespearecode.io@shksprcodeio

Page 5: Spark + i python

Spark, Why?

● Mainly, scalability in terms of○ Commodity costs○ Computation time○ Dataset size

● Hadoop was hard to maintain● MapReduce is a computational pattern● RDD is a distributed data model theblackbox.io

@theblackboxioshakespearecode.io

@shksprcodeio

Page 6: Spark + i python

IPython, What?

● Interactive computing framework● “python with batteries”● Open Source● Expanding to other languages (Jupyter)

theblackbox.io@theblackboxio

shakespearecode.io@shksprcodeio

Page 7: Spark + i python

IPython, Why?

● Powerful interactive remote shells○ Terminal○ Qt○ Notebook

● Easy data visualization● Configurable in cluster and in parallel ● Embeddable, flexible, extensible theblackbox.io

@theblackboxioshakespearecode.io

@shksprcodeio

Page 8: Spark + i python

Wait...

● IPython is cluster configurable● Spark has an interactive Scala and Python

shell

¿Are they not pretty much the same?theblackbox.io

@theblackboxioshakespearecode.io

@shksprcodeio

Page 9: Spark + i python

Well, not at allGin and Vodka

are not the sametheblackbox.io

@theblackboxioshakespearecode.io

@shksprcodeio

Page 10: Spark + i python

Spark + IPython, Why?

● Spark is the leading general purpose distributed computational system today, in terms of productive performance.

● IPython is great to experiment and develop scientific applications.

Mix them together to get the best of both.

theblackbox.io@theblackboxio

shakespearecode.io@shksprcodeio

Page 11: Spark + i python

So, what is the goal?

● Connecting your IPython environment to a Spark cluster powers your development to process even larger data

theblackbox.io@theblackboxio

shakespearecode.io@shksprcodeio

Page 12: Spark + i python

And an extra benefit...

Since Spark is production ready, you just have to export* your IPython project to a python script. Meaning:● No code translations to production

environmenttheblackbox.io

@theblackboxioshakespearecode.io

@shksprcodeio

Page 13: Spark + i python

Before mixing it up, understand

Spark● Slave-Master-ClientIPython● (Cluster-)Master-Client

theblackbox.io@theblackboxio

shakespearecode.io@shksprcodeio

Page 14: Spark + i python

Spark architecture

● Master node coordinates distribution and resilience of RDD.

● Slave nodes compute the operations over RDDs.

● Client nodes connect to master to request computations. theblackbox.io

@theblackboxioshakespearecode.io

@shksprcodeio

Page 15: Spark + i python

IPython architecture

● Master node with a kernel (computation unit)● Slave nodes handle computations tagged as

distributed (%px)● Client nodes connect to master to request

computations.theblackbox.io

@theblackboxioshakespearecode.io

@shksprcodeio

Page 16: Spark + i python

● Configure Spark cluster● Link IPython kernel to one Spark context● Use IPython clients to develop scripts with

Spark

The plan

theblackbox.io@theblackboxio

shakespearecode.io@shksprcodeio

Page 17: Spark + i python

Hands On!Let’s drink

Gin with Vodka

theblackbox.io@theblackboxio

shakespearecode.io@shksprcodeio

https://github.com/theblackboxio/spark-ipython

Page 18: Spark + i python

Conclusions

● Computational power of Spark● Interactiveness of IPython● Viable, not that hard to configure● Also fun

theblackbox.io@theblackboxio

shakespearecode.io@shksprcodeio

Page 19: Spark + i python

Complexities

● Sysadmin work● Python dependencies

Page 20: Spark + i python

Thanks!Questions?

theblackbox.io@theblackboxio

shakespearecode.io@shksprcodeio

Thanks to:Python BCN (@pybcn)Itnig (@Itnig)