Analyzing data with docker v4

Analyzing Data With DockerAndreas Dewes (@japh44)

EuroPython 2016 - Bilbao

Outline

Data Analysis: Small & Large-Scale, Easy & Difficult

Introduction To Docker

Containerizing our Data Analysis

Possible Approaches

Relevant Technologies & Outlook

Data Analysis: Use Cases

small-scale large-scale

automated

interactive

Interactive, UI-based analysis(e.g. iPython notebook)

analysis scripts usingLocal data sources(e.g. databases)

non-interactive analysis pipelines(e.g. Apache Hadoop)

Interactive “Big Data” tools, e.gApache Spark or Google BigQuery

So what's so difficult about data analysis?

Sharing Data & Tools

Reproducibility

Scaling

Enter Docker....

What is Docker?

A tool that allows us to deploy applications inside "software containers".

Containers work at the process level and isolate the view of the operating system (i.e. the processes, resources and files an application sees)

Provides a high-level API to manage, version-control, deploy and network containers.

Docker Swarm

Docker Core-Concepts

Docker EngineDocker Engine

Docker API

Registry

CLI

Image

Image

ImageContainer

Container

Container

Container

Container

Images Are Space-Efficient(or at least more efficient than VMs)

Containers Have Little Overhead

https://domino.research.ibm.com/library/cyberdig.nsf/papers/0929052195DD819C85257D2300681E7B/$File/rc25482.pdf

Containers Are Self-Sufficient

Containers Are "Lego" For Data Analytics!

Container

output

inputsconfiguration

datanetworked containers

We Can Build Reproducible Data-Analysis Workflows With Them

Map Apache

logs

Map Nginx logs

BI

Aggregate results Filtering Monitoring

Archiving

Example: Analyzing Github Data

analysis script

log filesfrom Github

output

analysis process(es)

Repository with code: https://github.com/adewes/docker-map-reduce-example

Live Demo (fingers crossed)

Containerizing Our Analysis

analysis script

log filesfrom Github

output

analysis container

image

analysis container

analysis container

supervisor

Live demo (what could go wrong?)

Advantages DisadvantagesEasy to share

Each analysis step is self-sufficient

Analysis components are "plug & play"

Easy to parallelize (for the right problems)

Versioning included

Requires to prepare containers

Requires Docker on each machine

Slightly decreases interactivity & flexibility

Which Parts Are Missing?

Orchestration

Dependency Management

Resource ManagementResource Management

Rouster:A Python Tool for Containerized Data Analysis

Built on top of the Docker API"Make for Docker"

Resource ManagementContainer OrchestrationDependency Management

Rouster Uses Recipes to Describe Data Analysis Workflows

Resources(including dependencies)

Services

Actions

versioning, dependency calculation,backup / copying, distribution, ...

startup (including dependencies),resource provisioning, networking, ...

scheduling, monitoring, exceptionhandling, logging, ...

Live Demo: CSV -> Postgres

Open Questions

How to handle communication between containers(through files, network, ...)?

How to provide resources/data to containers in adistributed environment?

Pachyderm is a data lake that offers complete version control for data and leverages the container ecosystem to provide reproducible data processing. Built on top of Kubernetes.

http://www.pachyderm.io

Pachyderm

LuigiLuigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in.

https://github.com/spotify/luigi

Other relevant technologies

http://www.pachyderm.io/

https://github.com/spotify/luigi

Summary & Outlook

Containers are here to stay!

They are useful in various data analysis contexts.

They don't solve all our problems though.

We need additional tools to use them effectively.

Thanks!Want to contribute?

https://github.com/7scientists/rouster

Andreas Dewes (@japh44)

Image Licenses:

https://commons.wikimedia.org/wiki/File:Matryoshka_dolls_(3671820040)_(2).jpghttps://pixabay.com/de/nordlichter-lager-zelt-abenteuer-1203289/https://en.wikipedia.org/wiki/Orchestrahttps://de.wikipedia.org/wiki/Graph_(Graphentheorie)http://www.library.illinois.edu/prescons/disaster_response/high_density_storage_disaster_plan/https://brookeborel.com/2011/06/02/363/https://en.wikipedia.org/wiki/Data_sharing

Data & Analytics

Analyzing data with docker v4