Click here to load reader
Upload
guillermo-blasco-jimenez
View
234
Download
0
Embed Size (px)
Citation preview
Spark + IPythonThe remix
theblackbox.io@theblackboxio
shakespearecode.io@shksprcodeio
27 March 2015 at @Itnig with @pybcn
Index
● Motivation● Walkthrough● Demo
theblackbox.io@theblackboxio
shakespearecode.io@shksprcodeio
A little about me
● Guillermo Blasco● Graduate in Mathematics and Software
Engineering● Developing theblackbox.io● Working as Data Scientist
theblackbox.io@theblackboxio
shakespearecode.io@shksprcodeio
Spark, What?
● Distributed computation engine● Based on Resilient Distributed Dataset● Runs on JVM, but available from Java, Scala
and Python● Open Source
theblackbox.io@theblackboxio
shakespearecode.io@shksprcodeio
Spark, Why?
● Mainly, scalability in terms of○ Commodity costs○ Computation time○ Dataset size
● Hadoop was hard to maintain● MapReduce is a computational pattern● RDD is a distributed data model theblackbox.io
@theblackboxioshakespearecode.io
@shksprcodeio
IPython, What?
● Interactive computing framework● “python with batteries”● Open Source● Expanding to other languages (Jupyter)
theblackbox.io@theblackboxio
shakespearecode.io@shksprcodeio
IPython, Why?
● Powerful interactive remote shells○ Terminal○ Qt○ Notebook
● Easy data visualization● Configurable in cluster and in parallel ● Embeddable, flexible, extensible theblackbox.io
@theblackboxioshakespearecode.io
@shksprcodeio
Wait...
● IPython is cluster configurable● Spark has an interactive Scala and Python
shell
¿Are they not pretty much the same?theblackbox.io
@theblackboxioshakespearecode.io
@shksprcodeio
Well, not at allGin and Vodka
are not the sametheblackbox.io
@theblackboxioshakespearecode.io
@shksprcodeio
Spark + IPython, Why?
● Spark is the leading general purpose distributed computational system today, in terms of productive performance.
● IPython is great to experiment and develop scientific applications.
Mix them together to get the best of both.
theblackbox.io@theblackboxio
shakespearecode.io@shksprcodeio
So, what is the goal?
● Connecting your IPython environment to a Spark cluster powers your development to process even larger data
theblackbox.io@theblackboxio
shakespearecode.io@shksprcodeio
And an extra benefit...
Since Spark is production ready, you just have to export* your IPython project to a python script. Meaning:● No code translations to production
environmenttheblackbox.io
@theblackboxioshakespearecode.io
@shksprcodeio
Before mixing it up, understand
Spark● Slave-Master-ClientIPython● (Cluster-)Master-Client
theblackbox.io@theblackboxio
shakespearecode.io@shksprcodeio
Spark architecture
● Master node coordinates distribution and resilience of RDD.
● Slave nodes compute the operations over RDDs.
● Client nodes connect to master to request computations. theblackbox.io
@theblackboxioshakespearecode.io
@shksprcodeio
IPython architecture
● Master node with a kernel (computation unit)● Slave nodes handle computations tagged as
distributed (%px)● Client nodes connect to master to request
computations.theblackbox.io
@theblackboxioshakespearecode.io
@shksprcodeio
● Configure Spark cluster● Link IPython kernel to one Spark context● Use IPython clients to develop scripts with
Spark
The plan
theblackbox.io@theblackboxio
shakespearecode.io@shksprcodeio
Hands On!Let’s drink
Gin with Vodka
theblackbox.io@theblackboxio
shakespearecode.io@shksprcodeio
https://github.com/theblackboxio/spark-ipython
Conclusions
● Computational power of Spark● Interactiveness of IPython● Viable, not that hard to configure● Also fun
theblackbox.io@theblackboxio
shakespearecode.io@shksprcodeio
Complexities
● Sysadmin work● Python dependencies
Thanks!Questions?
theblackbox.io@theblackboxio
shakespearecode.io@shksprcodeio
Thanks to:Python BCN (@pybcn)Itnig (@Itnig)