Yahoo Audience Expansion: Migration from Hadoop Streaming to Spark

Yahoo Audience Expansion: Migration from Hadoop Streaming to Spark

Gavin Li, Jaebong Kim, Andy FengYahoo

Agenda

• Audience Expansion Spark Application• Spark scalability: problems and our solutions• Performance tuning

AUDIENCE EXPANSIONHow we built audience expansion on Spark

Audience Expansion

• Train a model to find users perform similar as sample users

• Find more potential “converters”

System

• Large scale machine learning system• Logistic Regression• TBs input data, up to TBs intermediate data• Hadoop pipeline is using 30000+ mappers,

2000 reducers, 16 hrs run time• All hadoop streaming, ~20 jobs

• Use Spark to reduce latency and cost

Pipeline

Labeling

• Label positive/negative samples• 6-7 hrs, IO intensive, 17 TB intermediate IO in hadoop

Feature Extraction

• Extract Features from raw events

Model Training

• Logistic regression phase, CPU bound

Score/Analyze models

• Validate trained models, parameters combinations, select new model

Validation/Metrics

• Validate and publish new model

How to adopt to Spark efficiently?

• Very complicated system• 20+ hadoop streaming map reduce jobs• 20k+ lines of code• Tbs data, person.months to do data validation• 6+ person, 3 quarters to rewrite the system

based on Scala from scratch

Our migrate solution

• Build transition layer automatically convert hadoop streaming jobs to Spark job

• Don’t need to change any Hadoop streaming code

• 2 person*quarter

• Private Spark

Spark

ZIPPO:Hadoop Streaming Over Spark

Hadoop Streaming

ZIPPO

HDFS

Audience Expansion Pipeline20+ Hadoop Streaming jobs

ZIPPO

• A layer (zippo) between Spark and application

• Implemented all Hadoop Streaming interfaces

• Migrate pipeline without code rewriting

• Can focus on rewriting perf bottleneck

• Plan to open source HDFS

Audience Expansion Pipeline

Hadoop Streaming

Spark

ZIPPO:Hadoop

Streaming Over Spark

ZIPPO - Supported Features

• Partition related– Hadoop Partitioner class (-partitioner)– Num.map.key.fields, num.map.parition.fields

• Distributed cache– -cacheArchive, -file, -cacheFile

• Independent working directory for each task instead of each executor

• Hadoop Streaming Aggregation• Input Data Combination (to mitigate many small files)• Customized OutputFormat, InputFormat

Performance Comparison 1Tb data

• Zippo Hadoop streaming

• Spark cluster– 1 hard drive– 40 hosts

• Perf data:– 1hr 25 min

• Original Hadoop streaming

• Hadoop cluster– 1 hard drives– 40 Hosts

• Perf data– 3hrs 5 min

SPARK SCALABILITY

Spark Shuffle

• Mapper side of shuffle write all the output to disk(shuffle files)

• Data can be large scale, so not able to all hold in memory

• Reducers transfer all the shuffle files for each partition, then process

Spark Shuttle

Mapper 1

Mapper m-2

Reducer Partition 1

Reducer Partition 2

Reducer Partition n

Reducer Partition 3

Shuffle 1

Shuffle 2

Shuffle 3

Shuffle n

Shuffle 1

Shuffle 2

Shuffle 3

Shuffle n

On each Reducer

• Every partition needs to hold all the data from all the mappers

• In hash map• In memory

• Uncompressed

Reducer i of 4 cores

Partition 1

Shuffle

mapper 1

Shuffle

mapper 2

Shuffle

mapper 3

Shuffle

mapper n

Partition 2

Shuffle

mapper 1

Shuffle

mapper 2

Shuffle

mapper 3

Shuffle

mapper n

Partition 4

Shuffle

mapper 1

Shuffle

mapper 2

Shuffle

mapper 3

Shuffle

mapper n

Partition 3Shuffl

e m

apper 1

Shuffle

mapper 2

Shuffle

mapper 3

Shuffle

mapper n

Host 2 (4 cores)

How many partitions?

• Need to have small enough partitions to put all in memory

Host 1 (4 cores)

Partition 1

Partition 2

Partition 3

Partition 4

Partition 5

Partition 6

Partition 7

Partition n

Partition 8

Partition 9……

Partition 10

Partition 11

Partition 12

Partition 13

Partition 14

……

Spark needs many Partitions

• So a common pattern of using Spark is to have big number of partitions

On each Reducer

• For 64 Gb memory host• 16 cores CPU• For compression ratio 30:1, 2 times overhead• To process 3Tb data, Needs 46080 partitions• To process 3Pb data, Need 46 million

partitions

Non Scalable

• Not linear scalable.• No matter how many hosts in total do we

have, we always need 46k partitions

Issues of huge number of partitions

• Issue 1: OOM in mapper side– Each Mapper core needs to write to 46k shuffle files

simultaneously– 1 shuffle file = OutputStream + FastBufferStream +

CompressionStream – Memory overhead:

• FD and related kernel overhead• FastBufferStream (for making ramdom IO to sequential IO), default

100k buffer each stream• CompressionStream, default 64k buffer each stream

– So by default total buffer size:• 164k * 46k * 16 = 100+ Gb

Issues of huge number of paritions

• Our solution to Mapper OOM– Set spark.shuffle.file.buffer.kb to 4k for FastBufferStream

(kernel block size) – Based on our Contributed patch

https://github.com/mesos/spark/pull/685• Set spark.storage.compression.codec to

spark.storage.SnappyCompressionCodec to enable snappy to reduce footprint

• Set spark.snappy.block.size to 8192 to reduce buffer size (while snappy can still have good compression ratio)

– Total buffer size after this:• 12k * 46k * 16 = 10Gb

https://github.com/mesos/spark/pull/685


Issues of huge number of partitions

• Issue 2: large number of small files– Each Input split in Mapper is broken down into at least 46K partitions– Large number of small files makes lots of random R/W IO– When each shuffle file is less then 4k (kernel block size), overhead becomes

significant– Significant meta data overhead in FS layer– Example: only manually deleting the whole tmp directory can take 2 hour as we

have too many small files– Especially bad when splits are not balanced.– 5x slower than Hadoop

Input Split 1

Shuffle 1

Shuffle 2

Shuffle 3

Shuffle

46080

…

Input Split 2

Shuffle 1

Shuffle 2

Shuffle 3

Shuffle

46080

…

Input Split n

Shuffle 1

Shuffle 2

Shuffle 3

Shuffle

46080

…

Reduce side compression

• Current shuffle in reducer side data in memory is not compressed

• Can take 10-100 times more memory• With our patch

https://github.com/mesos/spark/pull/686, we reduced memory consumption by 30x, while compression overhead is only less than 3%

• Without this patch it doesn’t work for our case • 5x-10x performance improvement



Reduce side compression

• Reducer side– compression – 1.6k files– Noncompression – 46k shuffle files

Reducer Side Spilling

Reduce

Compressio

n Bucket 1

Compressio

n Bucket 2

Compressio

n Bucket 3

Compressio

n Bucket n

…

Spill 1

Spill 2 Spill n

Reducer Side Spilling

• Spills the over-size data to Disk in the aggregation hash table

• Spilling - More IO, more sequential IO, less seeks• All in mem – less IO, more random IO, more

seeks

• Fundamentally resolved Spark’s scalability issue

Align with previous Partition function

• Our input data are from another map reduce job

• We use exactly the same hash function to reduce number of shuffle files

Previous Job GeneratingInput data

Spark Job


• New hash function, More even distribution

InputData

0Mod 4

Key 0, 4, 8…

Key 2, 6, 10…

Key 1,5,9…

Key 3, 7, 11…

shuffule file 0shuffule file 1shuffule file 2shuffule file 3shuffule file 4

Mod 5




Spark JobPrevious Job GeneratingInput data


• Use the same hash function

InputData

0Mod 4

Key 0, 4, 8…

Key 2, 6, 10…

Key 1,5,9…

Key 3, 7, 11…

Mod 4

1 shuffle file

1 shuffle file

1 shuffle file

1 shuffle file

Align with previous Hash function

• Our Case:– 16m shuffle files, 62kb on average (5-10x slower)– 8k shuffle files, 125mb on average

• Several different input data sources• Partition function from the major one

PERFORMANCE TUNNING

All About Resource Utilization

• Maximize the resource utilization• Use as much CPU,Mem,Disk,Net as possbile• Monitor vmstat, iostat, sar

Resource Utilization

• (This is old diagram, to update)

Resource Utilization

• Ideally CPU/IO should be fully utilized

• Mapper phase – IO bound• Final reducer phase – CPU bound

Shuffle file transfer

• Spark transfers all shuffle files to reducer memory before start processing.

• Non-streaming(very hard to change to streaming).

• For poor resource utilization– So need to make sure maxBytesInFlight is set big

enough– Consider allocating 2x more threads than physical

core number

Thanks.

Gavin Li [email protected] Kim [email protected] Feng [email protected]

Documents

Yahoo Audience Expansion: Migration from Hadoop Streaming to Spark