31
Analyzing Data With Docker Andreas Dewes (@japh44) EuroPython 2016 - Bilbao

Analyzing data with docker v4

Embed Size (px)

Citation preview

Page 1: Analyzing data with docker   v4

Analyzing Data With DockerAndreas Dewes (@japh44)

EuroPython 2016 - Bilbao

Page 2: Analyzing data with docker   v4

Outline

Data Analysis: Small & Large-Scale, Easy & Difficult

Introduction To Docker

Containerizing our Data Analysis

Possible Approaches

Relevant Technologies & Outlook

Page 3: Analyzing data with docker   v4

Data Analysis: Use Cases

small-scale large-scale

automated

interactive

Interactive, UI-based analysis(e.g. iPython notebook)

analysis scripts usingLocal data sources(e.g. databases)

non-interactive analysis pipelines(e.g. Apache Hadoop)

Interactive “Big Data” tools, e.gApache Spark or Google BigQuery

Page 4: Analyzing data with docker   v4

So what's so difficult about data analysis?

Page 5: Analyzing data with docker   v4

Sharing Data & Tools

Page 6: Analyzing data with docker   v4

Reproducibility

Page 7: Analyzing data with docker   v4

Scaling

Page 8: Analyzing data with docker   v4

Enter Docker....

Page 9: Analyzing data with docker   v4

What is Docker?

A tool that allows us to deploy applications inside "software containers".

Containers work at the process level and isolate the view of the operating system (i.e. the processes, resources and files an application sees)

Provides a high-level API to manage, version-control, deploy and network containers.

Page 10: Analyzing data with docker   v4

Docker Swarm

Docker Core-Concepts

Docker EngineDocker Engine

Docker API

Registry

CLI

Image

Image

ImageContainer

Container

Container

Container

Container

Page 11: Analyzing data with docker   v4

Images Are Space-Efficient(or at least more efficient than VMs)

Page 12: Analyzing data with docker   v4

Containers Have Little Overhead

https://domino.research.ibm.com/library/cyberdig.nsf/papers/0929052195DD819C85257D2300681E7B/$File/rc25482.pdf

Page 13: Analyzing data with docker   v4

Containers Are Self-Sufficient

Page 14: Analyzing data with docker   v4

Containers Are "Lego" For Data Analytics!

Container

output

inputsconfiguration

datanetworked containers

Page 15: Analyzing data with docker   v4

We Can Build Reproducible Data-Analysis Workflows With Them

Map Apache

logs

Map Nginx logs

BI

Aggregate results Filtering Monitoring

Archiving

Page 16: Analyzing data with docker   v4

Example: Analyzing Github Data

analysis script

log filesfrom Github

output

analysis process(es)

Repository with code: https://github.com/adewes/docker-map-reduce-example

Page 17: Analyzing data with docker   v4

Live Demo (fingers crossed)

Page 18: Analyzing data with docker   v4

Containerizing Our Analysis

analysis script

log filesfrom Github

output

analysis container

image

analysis container

analysis container

supervisor

Page 19: Analyzing data with docker   v4

Live demo (what could go wrong?)

Page 20: Analyzing data with docker   v4

Advantages DisadvantagesEasy to share

Each analysis step is self-sufficient

Analysis components are "plug & play"

Easy to parallelize (for the right problems)

Versioning included

Requires to prepare containers

Requires Docker on each machine

Slightly decreases interactivity & flexibility

Page 21: Analyzing data with docker   v4

Which Parts Are Missing?

Page 22: Analyzing data with docker   v4

Orchestration

Page 23: Analyzing data with docker   v4

Dependency Management

Page 24: Analyzing data with docker   v4

Resource ManagementResource Management

Page 25: Analyzing data with docker   v4

Rouster:A Python Tool for Containerized Data Analysis

Built on top of the Docker API"Make for Docker"

Resource ManagementContainer OrchestrationDependency Management

Page 26: Analyzing data with docker   v4

Rouster Uses Recipes to Describe Data Analysis Workflows

Resources(including dependencies)

Services

Actions

versioning, dependency calculation,backup / copying, distribution, ...

startup (including dependencies),resource provisioning, networking, ...

scheduling, monitoring, exceptionhandling, logging, ...

Page 27: Analyzing data with docker   v4

Live Demo: CSV -> Postgres

Page 28: Analyzing data with docker   v4

Open Questions

How to handle communication between containers(through files, network, ...)?

How to provide resources/data to containers in adistributed environment?

Page 29: Analyzing data with docker   v4

Pachyderm is a data lake that offers complete version control for data and leverages the container ecosystem to provide reproducible data processing. Built on top of Kubernetes.

http://www.pachyderm.io

Pachyderm

LuigiLuigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in.

https://github.com/spotify/luigi

Other relevant technologies

Page 30: Analyzing data with docker   v4

Summary & Outlook

Containers are here to stay!

They are useful in various data analysis contexts.

They don't solve all our problems though.

We need additional tools to use them effectively.

Page 31: Analyzing data with docker   v4

Thanks!Want to contribute?

https://github.com/7scientists/rouster

Andreas Dewes (@japh44)

Image Licenses:

https://commons.wikimedia.org/wiki/File:Matryoshka_dolls_(3671820040)_(2).jpghttps://pixabay.com/de/nordlichter-lager-zelt-abenteuer-1203289/https://en.wikipedia.org/wiki/Orchestrahttps://de.wikipedia.org/wiki/Graph_(Graphentheorie)http://www.library.illinois.edu/prescons/disaster_response/high_density_storage_disaster_plan/https://brookeborel.com/2011/06/02/363/https://en.wikipedia.org/wiki/Data_sharing