37
Best Practices for Using Alluxio with Spark Gene Pang, Alluxio, Inc. Spark Summit EU - October 2017

Best Practices for Using Alluxio with Apache Spark with Gene Pang

Embed Size (px)

Citation preview

Page 1: Best Practices for Using Alluxio with Apache Spark with Gene Pang

Best Practices for Using Alluxio with Spark

Gene Pang, Alluxio, Inc. Spark Summit EU - October 2017

Page 2: Best Practices for Using Alluxio with Apache Spark with Gene Pang

About Me

•  Gene Pang

•  Software engineer @ Alluxio, Inc.

•  Alluxio open source PMC member

•  Ph.D. from AMPLab @ UC Berkeley

•  Worked at Google before UC Berkeley

•  Twitter: @unityxx

•  Github: @gpang

©2017 Alluxio, Inc. All Rights Reserved 2

Page 3: Best Practices for Using Alluxio with Apache Spark with Gene Pang

Outline Alluxio Overview

Alluxio + Spark Use Cases

Alluxio Architecture

Using Spark with Alluxio

Experiments

1

2

3

4

5

©2017 Alluxio, Inc. All Rights Reserved 3

Page 4: Best Practices for Using Alluxio with Apache Spark with Gene Pang

Data Ecosystem Yesterday

•  One Compute Framework

•  Single Storage System •  Co-located

©2017 Alluxio, Inc. All Rights Reserved 4

Page 5: Best Practices for Using Alluxio with Apache Spark with Gene Pang

Data Ecosystem Today

•  Many Compute Frameworks

•  Multiple Storage Systems •  Most not co-located

©2017 Alluxio, Inc. All Rights Reserved 5

Page 6: Best Practices for Using Alluxio with Apache Spark with Gene Pang

Data Ecosystem Issues

•  Each application manage multiple data sources

•  Add/Removing data sources require application changes

•  Storage optimizations requires application change

•  Lower performance due to lack of locality

©2017 Alluxio, Inc. All Rights Reserved 6

Page 7: Best Practices for Using Alluxio with Apache Spark with Gene Pang

Data Ecosystem with Alluxio

•  Apps only talk to Alluxio

•  Simple Add/Remove

•  No App Changes

•  Memory Performance

Native File System Hadoop Compatible File System

Native Key-Value Interface

Fuse Compatible File System

HDFS Interface Amazon S3 Interface Swift Interface GlusterFS Interface

©2017 Alluxio, Inc. All Rights Reserved 7

Page 8: Best Practices for Using Alluxio with Apache Spark with Gene Pang

Next Gen Analytics with Alluxio

Native File System Hadoop Compatible File System

Native Key-Value Interface

Fuse Compatible File System

HDFS Interface Amazon S3 Interface Swift Interface GlusterFS Interface

Apps, Data & Storage at Memory Speed

ü  Big Data/IoT ü  AI/ML ü  Deep Learning ü  Cloud Migration ü  Multi Platform ü  Autonomous

©2017 Alluxio, Inc. All Rights Reserved 8

Page 9: Best Practices for Using Alluxio with Apache Spark with Gene Pang

Fastest Growing Big Data Open Source Projects

Fastest Growing open-source project in the big data ecosystem

Running in large production clusters

600+ Contributors from 100+ organizations

0

100

200

300

400

500 0 10 20 30 40 45

Num

ber

of C

ontr

ibut

ors

Github Open Source Contributors by Month

Alluxio

Spark

Kafka

Redis

HDFS

Cassandra

Hive

©2017 Alluxio, Inc. All Rights Reserved 9

Page 10: Best Practices for Using Alluxio with Apache Spark with Gene Pang

Outline Alluxio Overview

Alluxio + Spark Use Cases

Alluxio Architecture

Using Spark with Alluxio

Experiments

1

2

3

4

5

©2017 Alluxio, Inc. All Rights Reserved 1 0

Page 11: Best Practices for Using Alluxio with Apache Spark with Gene Pang

Big Data Case Study –

1 110/30/17 ©2017 Alluxio, Inc. All Rights Reserved

Challenge – Gain end to end view of business with large volume of data Queries were slow / not interactive, resulting in operational inefficiency

SPARK

TERADATA

SPARK

TERADATA

Solution – ETL Data from Teradata to Alluxio Impact – Faster Time to Market – “Now we don’t have to work Sundays” http://bit.ly/2oMx95W

Page 12: Best Practices for Using Alluxio with Apache Spark with Gene Pang

Big Data Case Study –

1 210/30/17 ©2017 Alluxio, Inc. All Rights Reserved

Challenge – Gain end to end view of business with large volume of data Queries were slow / not interactive, resulting in operational inefficiency

SPARK

Baidu File System

SPARK

Baidu File System

Solution – With Alluxio, data queries are 30X faster Impact – Higher operational efficiency http://bit.ly/2pDHS3O

Page 13: Best Practices for Using Alluxio with Apache Spark with Gene Pang

Big Data Case Study –

Challenge – Gain end to end view of business with large volume of data for $5B Travel Site Queries were slow / not interactive, resulting in operational inefficiency

SPARK

HDFS

Solution – With Alluxio, 300x improvement in performance Impact – Increased revenue from immediate response to user behavior Use case: http://bit.ly/2pDJdrq

CEPH

HDFS CEPH

FLINK SPARK FLINK

©2017 Alluxio, Inc. All Rights Reserved 1 3

Page 14: Best Practices for Using Alluxio with Apache Spark with Gene Pang

Machine Learning Case Study –

1 410/30/17 ©2017 Alluxio, Inc. All Rights Reserved

Challenge – Disparate Data both on-prem and Cloud. Heterogeneous types of data. Scaling of Exabyte size data. Slow due to disk based approach.

SPARK

HDFS

SPARK

MINIO

Solution – Using Alluxio to prevent I/O bottlenecks Impact – Orders of magnitude higher performance than before. http://bit.ly/2p18ds3

MES

OS

Page 15: Best Practices for Using Alluxio with Apache Spark with Gene Pang

Outline Alluxio Overview

Alluxio + Spark Use Cases

Alluxio Architecture

Using Spark with Alluxio

Experiments

1

2

3

4

5

©2017 Alluxio, Inc. All Rights Reserved 1 5

Page 16: Best Practices for Using Alluxio with Apache Spark with Gene Pang

1 6©2017 Alluxio, Inc. All Rights Reserved

App

licat

ion

Allu

xio

Clie

nt

Alluxio Master

Alluxio Worker

Alluxio Worker

Storage

Storage

Alluxio Architecture

Page 17: Best Practices for Using Alluxio with Apache Spark with Gene Pang

Alluxio Client

Applications interact with Alluxio via the Alluxio client

•  Java Native Alluxio Filesystem Client •  Alluxio specific operations like [un]pin, [un]mount, [un]set TTL

•  HDFS-Compatible Filesystem Client •  No code change necessary

•  S3 API

©2017 Alluxio, Inc. All Rights Reserved 1 7

Page 18: Best Practices for Using Alluxio with Apache Spark with Gene Pang

Alluxio Master

Master is responsible for managing metadata

•  Filesystem namespace metadata

•  Blocks / workers metadata

Primary master writes journal for durable operations

•  Secondary masters replay journal entries

©2017 Alluxio, Inc. All Rights Reserved 1 8

Page 19: Best Practices for Using Alluxio with Apache Spark with Gene Pang

Alluxio Worker

Worker is responsible for managing block data

Worker stores block data on various storage media

•  HDD, SSD, Memory

Reads and writes data to underlying storage systems

©2017 Alluxio, Inc. All Rights Reserved 1 9

Page 20: Best Practices for Using Alluxio with Apache Spark with Gene Pang

Outline Alluxio Overview

Alluxio + Spark Use Cases

Alluxio Architecture

Using Spark with Alluxio

Experiments

1

2

3

4

5

©2017 Alluxio, Inc. All Rights Reserved 2 0

Page 21: Best Practices for Using Alluxio with Apache Spark with Gene Pang

Sharing Data via Memory

Storage Engine & Execution Engine Same Process

•  Two copies of data in memory – double the memory used •  Sharing Slowed Down by Network / Disk I/O

Spark Compute

Spark Storage

block 1

block 3

HDFS / Amazon S3 block 1

block 3

block 2

block 4

Spark Compute

Spark Storage

block 1

block 3

©2017 Alluxio, Inc. All Rights Reserved 2 1

Page 22: Best Practices for Using Alluxio with Apache Spark with Gene Pang

Sharing Data via Memory

Storage Engine & Execution Engine Different process

•  Half the memory used •  Sharing Data at Memory Speed

Spark Compute

Spark Storage

HDFS / Amazon S3 block 1

block 3

block 2

block 4

HDFS disk

block 1

block 3

block 2

block 4 Alluxio

block 1

block 3 block 4

Spark Compute

Spark Storage

©2017 Alluxio, Inc. All Rights Reserved 2 2

Page 23: Best Practices for Using Alluxio with Apache Spark with Gene Pang

Data Resilience During Crash

Spark Compute

Spark Storage block 1

block 3

HDFS / Amazon S3 block 1

block 3

block 2

block 4

Storage Engine & Execution Engine Same Process

©2017 Alluxio, Inc. All Rights Reserved 2 3

Page 24: Best Practices for Using Alluxio with Apache Spark with Gene Pang

Data Resilience During Crash

CRASH

Spark Storage block 1

block 3

HDFS / Amazon S3 block 1

block 3

block 2

block 4

•  Process Crash Requires Network and/or Disk I/O to Re-read Data

Storage Engine & Execution Engine Same Process

©2017 Alluxio, Inc. All Rights Reserved 2 4

Page 25: Best Practices for Using Alluxio with Apache Spark with Gene Pang

Data Resilience During Crash

CRASH

HDFS / Amazon S3 block 1

block 3

block 2

block 4

Storage Engine & Execution Engine Same Process

•  Process Crash Requires Network and/or Disk I/O to Re-read Data

©2017 Alluxio, Inc. All Rights Reserved 2 5

Page 26: Best Practices for Using Alluxio with Apache Spark with Gene Pang

Data Resilience During Crash

Spark Compute

Spark Storage

HDFS / Amazon S3 block 1

block 3

block 2

block 4

HDFS disk

block 1

block 3

block 2

block 4 Alluxio

block 1

block 3 block 4

Storage Engine & Execution Engine Different process

©2017 Alluxio, Inc. All Rights Reserved 2 6

Page 27: Best Practices for Using Alluxio with Apache Spark with Gene Pang

Data Resilience During Crash

•  Process Crash – Data is Re-read at Memory Speed

HDFS / Amazon S3 block 1

block 3

block 2

block 4

HDFS disk

block 1

block 3

block 2

block 4 Alluxio

block 1

block 3 block 4

CRASH Storage Engine & Execution Engine Different process

©2017 Alluxio, Inc. All Rights Reserved 2 7

Page 28: Best Practices for Using Alluxio with Apache Spark with Gene Pang

Accessing Alluxio Data From Spark

Writing Data Write to an Alluxio file

Reading Data Read from an Alluxio file

©2017 Alluxio, Inc. All Rights Reserved 2 8

Page 29: Best Practices for Using Alluxio with Apache Spark with Gene Pang

Code Example for Spark RDDs

Writing RDD to Alluxio rdd.saveAsTextFile(alluxioPath) !rdd.saveAsObjectFile(alluxioPath) !

Reading RDD from Alluxio rdd = sc.textFile(alluxioPath) !rdd = sc.objectFile(alluxioPath) !

©2017 Alluxio, Inc. All Rights Reserved 2 9

Page 30: Best Practices for Using Alluxio with Apache Spark with Gene Pang

Code Example for Spark DataFrames

Writing to Alluxio df.write.parquet(alluxioPath) !

Reading from Alluxio df = sc.read.parquet(alluxioPath) !

©2017 Alluxio, Inc. All Rights Reserved 3 0

Page 31: Best Practices for Using Alluxio with Apache Spark with Gene Pang

Deploying Alluxio with Spark

©2017 Alluxio, Inc. All Rights Reserved 3 1

Spark

Alluxio

Storage

Spark

Alluxio

Storage

Colocate Alluxio Workers with Spark for optimal I/O performance

Deploy Alluxio between Spark and Storage

Page 32: Best Practices for Using Alluxio with Apache Spark with Gene Pang

Outline Alluxio Overview

Alluxio + Spark Use Cases

Alluxio Architecture

Using Spark with Alluxio

Experiments

1

2

3

4

5

©2017 Alluxio, Inc. All Rights Reserved 3 2

Page 33: Best Practices for Using Alluxio with Apache Spark with Gene Pang

Experiments

Spark 2.2.0 + Alluxio 1.6.0

Single worker: Amazon r3.2xlarge

Compare reading cached parquet files

©2017 Alluxio, Inc. All Rights Reserved 3 3

Page 34: Best Practices for Using Alluxio with Apache Spark with Gene Pang

Reading Cached DataFrame (parquet)

©2017 Alluxio, Inc. All Rights Reserved 3 4

Page 35: Best Practices for Using Alluxio with Apache Spark with Gene Pang

New Context: 50 GB DataFrame (S3)

6x – 8x speedup

©2017 Alluxio, Inc. All Rights Reserved 3 5

Page 36: Best Practices for Using Alluxio with Apache Spark with Gene Pang

Conclusion

Easy to use Alluxio with Spark

Alluxio enables improved I/O performance

Easily interact with various storage systems with Alluxio

©2017 Alluxio, Inc. All Rights Reserved 3 6

Page 37: Best Practices for Using Alluxio with Apache Spark with Gene Pang

Thank you!

Gene Pang [email protected] Twitter: @unityxx

Twi$er.com/alluxio  

Linkedin.com/alluxio    

Website www.alluxio.com

E-mail [email protected]

@

Social Media

á

©2017 Alluxio, Inc. All Rights Reserved 3 7