31
www.mimeria.com DataOps in practice 2019-03-14 Lars Albertsson Mimeria 1

Mimeria Lars Albertsson DataOps in practice 2019-03-14 · Complex business logic - all Hadoop @ Spotify 2K unique jobs, 20K daily ~100 teams Almost all features involve lake Multiple

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Mimeria Lars Albertsson DataOps in practice 2019-03-14 · Complex business logic - all Hadoop @ Spotify 2K unique jobs, 20K daily ~100 teams Almost all features involve lake Multiple

www.mimeria.com

DataOps in practice

2019-03-14Lars Albertsson

Mimeria1

Page 2: Mimeria Lars Albertsson DataOps in practice 2019-03-14 · Complex business logic - all Hadoop @ Spotify 2K unique jobs, 20K daily ~100 teams Almost all features involve lake Multiple

www.mimeria.com

Friction to production

2

Which NoSQL database would you recommend for this use case? It won’t matter. Use something

that you are used to. MySQL, Oracle?

Well, I’d prefer something else.

?

If we use an RDBMS, Ops have rules and opinions that slow us down.

Page 3: Mimeria Lars Albertsson DataOps in practice 2019-03-14 · Complex business logic - all Hadoop @ Spotify 2K unique jobs, 20K daily ~100 teams Almost all features involve lake Multiple

www.mimeria.com

Distance to production

3

What are your data scientists up to?

They have received a data dump and built a model.

We will hand it off to a developer team, who hands it off to operations when the model is translated to Java.

Great, that is the first 1%. What’s next?

Page 4: Mimeria Lars Albertsson DataOps in practice 2019-03-14 · Complex business logic - all Hadoop @ Spotify 2K unique jobs, 20K daily ~100 teams Almost all features involve lake Multiple

www.mimeria.com

Risky operations

4

How to I test the pipeline?

You temporarily change the output path and run manually.

Don’t do that.

What if I forget to change path?

Page 7: Mimeria Lars Albertsson DataOps in practice 2019-03-14 · Complex business logic - all Hadoop @ Spotify 2K unique jobs, 20K daily ~100 teams Almost all features involve lake Multiple

www.mimeria.com

Properties of disruptors

7

Sustained ROI from machine

learning

Short time from idea to production

Homogeneous data platform

Data internally democratised

Major cloud + open source

Organisation aligned by use

case

Data processing pipelines

Diverse teams: data science,

dev, ops

Page 8: Mimeria Lars Albertsson DataOps in practice 2019-03-14 · Complex business logic - all Hadoop @ Spotify 2K unique jobs, 20K daily ~100 teams Almost all features involve lake Multiple

www.mimeria.com

DataOps

8

Sustained ROI from machine

learning

Short time from idea to production

Homogeneous data platform

Data internally democratised

Major cloud + open source

Organisation aligned by use

case

Data processing pipelines

Diverse teams: data science,

dev, ops

Purpose

Context

Means

Building data processing software

Running data processing software

Page 9: Mimeria Lars Albertsson DataOps in practice 2019-03-14 · Complex business logic - all Hadoop @ Spotify 2K unique jobs, 20K daily ~100 teams Almost all features involve lake Multiple

www.mimeria.com

Big data - a collaboration paradigm

9

Stream storage

Data lake

Data democratised

Page 10: Mimeria Lars Albertsson DataOps in practice 2019-03-14 · Complex business logic - all Hadoop @ Spotify 2K unique jobs, 20K daily ~100 teams Almost all features involve lake Multiple

www.mimeria.com

Data pipelines

10

Data lake

Page 11: Mimeria Lars Albertsson DataOps in practice 2019-03-14 · Complex business logic - all Hadoop @ Spotify 2K unique jobs, 20K daily ~100 teams Almost all features involve lake Multiple

www.mimeria.com

Nearline

● Stream storage (Kafka)● Asynchronous event

processing● 10 ms - 1 hour

Data integration timescales

11

Job

Stream

Offline

● File storage (Hadoop)● Asynchronous batch

processing● 10 minutes -

Online

● SOA / microservices● Synchronous RPC● 1-100 ms

Stream

Job

Stream

Page 12: Mimeria Lars Albertsson DataOps in practice 2019-03-14 · Complex business logic - all Hadoop @ Spotify 2K unique jobs, 20K daily ~100 teams Almost all features involve lake Multiple

www.mimeria.com

Upgrade

● Careful rollout● Risk of user impact● Proactive QA

Operational manoeuvres - online

12

Service failure

● User impact● Data loss● Cascading outage

Bug

● User impact● Data corruption● Cascading corruption

Page 13: Mimeria Lars Albertsson DataOps in practice 2019-03-14 · Complex business logic - all Hadoop @ Spotify 2K unique jobs, 20K daily ~100 teams Almost all features involve lake Multiple

www.mimeria.com

Data platform overview

13

Data lake

Cold store

Service

Service

Onlineservices

Offlinedata platform

Batchprocessing

Page 14: Mimeria Lars Albertsson DataOps in practice 2019-03-14 · Complex business logic - all Hadoop @ Spotify 2K unique jobs, 20K daily ~100 teams Almost all features involve lake Multiple

www.mimeria.com

Data platform overview

14

Data lake

Cold store

DatasetJob

Service

Service

Onlineservices

Offlinedata platform

Batchprocessing

Page 15: Mimeria Lars Albertsson DataOps in practice 2019-03-14 · Complex business logic - all Hadoop @ Spotify 2K unique jobs, 20K daily ~100 teams Almost all features involve lake Multiple

www.mimeria.com

Data platform overview

15

Data lake

Cold store

DatasetPipeline

Service

Service

Onlineservices

Offlinedata platform

Job

Workfloworchestration

(Luigi, Airflow)

Onlineservices

Service

Data feature

Page 16: Mimeria Lars Albertsson DataOps in practice 2019-03-14 · Complex business logic - all Hadoop @ Spotify 2K unique jobs, 20K daily ~100 teams Almost all features involve lake Multiple

www.mimeria.com

Operational manoeuvres - offline

16

Upgrade

● Instant rollout● No user impact● Reactive QA

Service failure

● Pipeline delay● No data loss● No downstream impact

Bug

● Temporary data corruption

● Downstream impact

Page 17: Mimeria Lars Albertsson DataOps in practice 2019-03-14 · Complex business logic - all Hadoop @ Spotify 2K unique jobs, 20K daily ~100 teams Almost all features involve lake Multiple

www.mimeria.com

Production critical upgrade

17

● Dual datasets during transition

● Run downstream parallel pipelines

○ Cheap

○ Low risk

○ Easy rollback

● Easy to test end-to-end

○ Upstream team can do the change No dev & staging environment needed!

∆?

Page 18: Mimeria Lars Albertsson DataOps in practice 2019-03-14 · Complex business logic - all Hadoop @ Spotify 2K unique jobs, 20K daily ~100 teams Almost all features involve lake Multiple

www.mimeria.com

Life of an error, batch pipelines

18

● Faulty job, emits bad data

1. Revert serving datasets to old2. Fix bug3. Remove faulty datasets4. Backfill is automatic (Luigi)Done!

● Low cost of error

○ Reactive QA

○ Production environment sufficient

Page 19: Mimeria Lars Albertsson DataOps in practice 2019-03-14 · Complex business logic - all Hadoop @ Spotify 2K unique jobs, 20K daily ~100 teams Almost all features involve lake Multiple

www.mimeria.com

Operational manoeuvres - nearline

19

Upgrade

● Swift rollout● Parallel pipelines● User impact, QA?

Service failure

● Pipeline delay● No data loss● Downstream impact?

Bug

● Data corruption● Downstream impact

Job

Stream

Stream

Job

Stream

Job

Stream

Stream

Job

Stream

Job

Stream

Stream

Job

Stream

Page 20: Mimeria Lars Albertsson DataOps in practice 2019-03-14 · Complex business logic - all Hadoop @ Spotify 2K unique jobs, 20K daily ~100 teams Almost all features involve lake Multiple

www.mimeria.com

Life of an error, streaming

20● Works for a single job, not pipeline. :-(

Job

StreamStream Stream

Stream Stream Stream

Job

Job

Stream Stream Stream

Job

Job Job

Reprocessing in Kafka Streams

Page 21: Mimeria Lars Albertsson DataOps in practice 2019-03-14 · Complex business logic - all Hadoop @ Spotify 2K unique jobs, 20K daily ~100 teams Almost all features involve lake Multiple

www.mimeria.com

Deployment example, cloud native

21

source repo Luigi DSL, jars, config

my-pipe:7

Luigidaemon

WorkerWorker

WorkerWorker

WorkerWorker

WorkerWorker

Redundant cron schedule, higher frequency

kind: CronJobspec: schedule: "10 * * * *" command: "luigi --module mymodule MyDaily"

Docker image Docker registry

S3 / GCS

Dataproc / EMR

Page 22: Mimeria Lars Albertsson DataOps in practice 2019-03-14 · Complex business logic - all Hadoop @ Spotify 2K unique jobs, 20K daily ~100 teams Almost all features involve lake Multiple

www.mimeria.com

Monitoring timeliness, examples● Datamon - Spotify internal● Twitter Ambrose (dead?)● Airflow

22

Page 23: Mimeria Lars Albertsson DataOps in practice 2019-03-14 · Complex business logic - all Hadoop @ Spotify 2K unique jobs, 20K daily ~100 teams Almost all features involve lake Multiple

www.mimeria.com23

Measuring correctness: counters● Processing tool (Spark/Hadoop) counters

○ Odd code path => bump counter○ System metrics

Hadoop / Spark counters DB

Standard graphing tools

Standard alerting service

Page 24: Mimeria Lars Albertsson DataOps in practice 2019-03-14 · Complex business logic - all Hadoop @ Spotify 2K unique jobs, 20K daily ~100 teams Almost all features involve lake Multiple

www.mimeria.com24

Measuring correctness: pipelines● Processing tool (Spark/Hadoop) counters

○ Odd code path => bump counter○ System metrics

● Dedicated quality assessment pipelines

DB

Quality assessment job

Quality metadataset (tiny)

Standard graphing tools

Standard alerting service

Page 25: Mimeria Lars Albertsson DataOps in practice 2019-03-14 · Complex business logic - all Hadoop @ Spotify 2K unique jobs, 20K daily ~100 teams Almost all features involve lake Multiple

www.mimeria.com25

Machine learning operations● Multiple trained models

○ Select at run time● Measure user behaviour

○ E.g. session length, engagement, funnel● Ready to revert to

○ old models○ simpler models

Measure interactionsRendez-vous

DB

Standard alerting service

Stream Job

Page 26: Mimeria Lars Albertsson DataOps in practice 2019-03-14 · Complex business logic - all Hadoop @ Spotify 2K unique jobs, 20K daily ~100 teams Almost all features involve lake Multiple

www.mimeria.com

Nearline

Data processing tradeoff

26

Job

Stream

OfflineOnline

Stream

Job

Stream

Data speed Innovation speed

Page 27: Mimeria Lars Albertsson DataOps in practice 2019-03-14 · Complex business logic - all Hadoop @ Spotify 2K unique jobs, 20K daily ~100 teams Almost all features involve lake Multiple

www.mimeria.com

Bonus slides

27

Page 28: Mimeria Lars Albertsson DataOps in practice 2019-03-14 · Complex business logic - all Hadoop @ Spotify 2K unique jobs, 20K daily ~100 teams Almost all features involve lake Multiple

www.mimeria.com

Machine learning products

28

Configuration Data collection Monitoring

Size = effort Credits: “Hidden Technical Debt inColour = code complexity Machine Learning Systems”,

Google, NIPS 2015

Serving infrastructure

Feature extraction Process management tools

Analysis tools

Machine resource

management

Data verification

ML

Page 29: Mimeria Lars Albertsson DataOps in practice 2019-03-14 · Complex business logic - all Hadoop @ Spotify 2K unique jobs, 20K daily ~100 teams Almost all features involve lake Multiple

www.mimeria.com

Data science

Mature machine learning

29

Configuration Data collection Monitoring

Serving infrastructure

Feature extraction Process management tools

Analysis tools

Machine resource

management

Data verification

ML

DataOps

Page 30: Mimeria Lars Albertsson DataOps in practice 2019-03-14 · Complex business logic - all Hadoop @ Spotify 2K unique jobs, 20K daily ~100 teams Almost all features involve lake Multiple

www.mimeria.com

Complex business logic - MDM @ Spotify● 10 pipelines like this● Pipeline dev environment● Pipeline continuous deployment

infrastructure

One team of five engineers

30

Page 31: Mimeria Lars Albertsson DataOps in practice 2019-03-14 · Complex business logic - all Hadoop @ Spotify 2K unique jobs, 20K daily ~100 teams Almost all features involve lake Multiple

www.mimeria.com

Complex business logic - all Hadoop @ Spotify● 2K unique jobs, 20K daily● ~100 teams● Almost all features involve lake● Multiple processing tools● Homogeneous infrastructure

○ Storage○ Workflow management

● 2500 nodes, 50K cores, 100+TB mem, 100+PB store

31

● Migrating to Google cloud● 750 BigQuery users● 3M queries = 500 PB / month