11
Migration from Redshift to Spark/ Presto Sky Yin Data scientist @ Stitch Fix

Open Air 2016 Mini Talk

  • Upload
    sky-yin

  • View
    146

  • Download
    0

Embed Size (px)

Citation preview

Migration from Redshift to Spark/PrestoSky YinData scientist @ Stitch Fix

// Stitch Fix is an online personal clothes shopping service

// We recommend clothes through a combination of human stylists and algorithms

Company

// Data scientist in the merchandising algorithm team

// Former data engineer at ShazamMe

Redshift Spark Presto

Managed data warehouse from AWS

Distributed general purpose computing engine

Distributed SQL query engine

// Fast// Familiar to use: SQL// Managed by AWS// Expensive// Can scale up and down on demand*

Redshift

// Too many people querying in the morning// Production pipelines and ad-hoc queries

together on the same clusterTrouble

But Redshift is scalable, why not just scale it up in the morning and back after?

Scalable ≠ Easy to scale

// One for production, one for ad-hoc// Sync among two clusters// Too expensive// Will hit the scalability problem again

Solution: yet another Redshift cluster?

// Presto for light ad-hoc queries// Spark for heavy jobs// Store data in S3 as the single source of truth// Running on EMR: scale up and down quickly

Solution