Upload
bendek
View
52
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Yahoo Audience Expansion: Migration from Hadoop Streaming to Spark . Gavin Li, Jaebong Kim, Andy Feng Yahoo. Agenda. Audience Expansion Spark Application Spark scalability: problems and our solutions Performance tuning. How we built audience expansion on Spark . audience expansion. - PowerPoint PPT Presentation
Citation preview
Yahoo Audience Expansion: Migration from Hadoop Streaming to Spark
Gavin Li, Jaebong Kim, Andy FengYahoo
Agenda
• Audience Expansion Spark Application• Spark scalability: problems and our solutions• Performance tuning
AUDIENCE EXPANSIONHow we built audience expansion on Spark
Audience Expansion
• Train a model to find users perform similar as sample users
• Find more potential “converters”
System
• Large scale machine learning system• Logistic Regression• TBs input data, up to TBs intermediate data• Hadoop pipeline is using 30000+ mappers,
2000 reducers, 16 hrs run time• All hadoop streaming, ~20 jobs
• Use Spark to reduce latency and cost
Pipeline
Labeling
• Label positive/negative samples• 6-7 hrs, IO intensive, 17 TB intermediate IO in hadoop
Feature Extraction
• Extract Features from raw events
Model Training
• Logistic regression phase, CPU bound
Score/Analyze models
• Validate trained models, parameters combinations, select new model
Validation/Metrics
• Validate and publish new model
How to adopt to Spark efficiently?
• Very complicated system• 20+ hadoop streaming map reduce jobs• 20k+ lines of code• Tbs data, person.months to do data validation• 6+ person, 3 quarters to rewrite the system
based on Scala from scratch
Our migrate solution
• Build transition layer automatically convert hadoop streaming jobs to Spark job
• Don’t need to change any Hadoop streaming code
• 2 person*quarter
• Private Spark
Spark
ZIPPO:Hadoop Streaming Over Spark
Hadoop Streaming
ZIPPO
HDFS
Audience Expansion Pipeline20+ Hadoop Streaming jobs
ZIPPO
• A layer (zippo) between Spark and application
• Implemented all Hadoop Streaming interfaces
• Migrate pipeline without code rewriting
• Can focus on rewriting perf bottleneck
• Plan to open source HDFS
Audience Expansion Pipeline
Hadoop Streaming
Spark
ZIPPO:Hadoop
Streaming Over Spark
ZIPPO - Supported Features
• Partition related– Hadoop Partitioner class (-partitioner)– Num.map.key.fields, num.map.parition.fields
• Distributed cache– -cacheArchive, -file, -cacheFile
• Independent working directory for each task instead of each executor
• Hadoop Streaming Aggregation• Input Data Combination (to mitigate many small files)• Customized OutputFormat, InputFormat
Performance Comparison 1Tb data
• Zippo Hadoop streaming
• Spark cluster– 1 hard drive– 40 hosts
• Perf data:– 1hr 25 min
• Original Hadoop streaming
• Hadoop cluster– 1 hard drives– 40 Hosts
• Perf data– 3hrs 5 min
SPARK SCALABILITY
Spark Shuffle
• Mapper side of shuffle write all the output to disk(shuffle files)
• Data can be large scale, so not able to all hold in memory
• Reducers transfer all the shuffle files for each partition, then process
Spark Shuttle
Mapper 1
Mapper m-2
Reducer Partition 1
Reducer Partition 2
Reducer Partition n
Reducer Partition 3
Shuffle 1
Shuffle 2
Shuffle 3
Shuffle n
Shuffle 1
Shuffle 2
Shuffle 3
Shuffle n
On each Reducer
• Every partition needs to hold all the data from all the mappers
• In hash map• In memory
• Uncompressed
Reducer i of 4 cores
Partition 1
Shuffle
mapper 1
Shuffle
mapper 2
Shuffle
mapper 3
Shuffle
mapper n
Partition 2
Shuffle
mapper 1
Shuffle
mapper 2
Shuffle
mapper 3
Shuffle
mapper n
Partition 4
Shuffle
mapper 1
Shuffle
mapper 2
Shuffle
mapper 3
Shuffle
mapper n
Partition 3Shuffl
e m
apper 1
Shuffle
mapper 2
Shuffle
mapper 3
Shuffle
mapper n
Host 2 (4 cores)
How many partitions?
• Need to have small enough partitions to put all in memory
Host 1 (4 cores)
Partition 1
Partition 2
Partition 3
Partition 4
Partition 5
Partition 6
Partition 7
Partition n
Partition 8
Partition 9……
Partition 10
Partition 11
Partition 12
Partition 13
Partition 14
……
Spark needs many Partitions
• So a common pattern of using Spark is to have big number of partitions
On each Reducer
• For 64 Gb memory host• 16 cores CPU• For compression ratio 30:1, 2 times overhead• To process 3Tb data, Needs 46080 partitions• To process 3Pb data, Need 46 million
partitions
Non Scalable
• Not linear scalable.• No matter how many hosts in total do we
have, we always need 46k partitions
Issues of huge number of partitions
• Issue 1: OOM in mapper side– Each Mapper core needs to write to 46k shuffle files
simultaneously– 1 shuffle file = OutputStream + FastBufferStream +
CompressionStream – Memory overhead:
• FD and related kernel overhead• FastBufferStream (for making ramdom IO to sequential IO), default
100k buffer each stream• CompressionStream, default 64k buffer each stream
– So by default total buffer size:• 164k * 46k * 16 = 100+ Gb
Issues of huge number of paritions
• Our solution to Mapper OOM– Set spark.shuffle.file.buffer.kb to 4k for FastBufferStream
(kernel block size) – Based on our Contributed patch
https://github.com/mesos/spark/pull/685• Set spark.storage.compression.codec to
spark.storage.SnappyCompressionCodec to enable snappy to reduce footprint
• Set spark.snappy.block.size to 8192 to reduce buffer size (while snappy can still have good compression ratio)
– Total buffer size after this:• 12k * 46k * 16 = 10Gb
Issues of huge number of partitions
• Issue 2: large number of small files– Each Input split in Mapper is broken down into at least 46K partitions– Large number of small files makes lots of random R/W IO– When each shuffle file is less then 4k (kernel block size), overhead becomes
significant– Significant meta data overhead in FS layer– Example: only manually deleting the whole tmp directory can take 2 hour as we
have too many small files– Especially bad when splits are not balanced.– 5x slower than Hadoop
Input Split 1
Shuffle 1
Shuffle 2
Shuffle 3
Shuffle
46080
…
Input Split 2
Shuffle 1
Shuffle 2
Shuffle 3
Shuffle
46080
…
Input Split n
Shuffle 1
Shuffle 2
Shuffle 3
Shuffle
46080
…
Reduce side compression
• Current shuffle in reducer side data in memory is not compressed
• Can take 10-100 times more memory• With our patch
https://github.com/mesos/spark/pull/686, we reduced memory consumption by 30x, while compression overhead is only less than 3%
• Without this patch it doesn’t work for our case • 5x-10x performance improvement
Reduce side compression
• Reducer side– compression – 1.6k files– Noncompression – 46k shuffle files
Reducer Side Spilling
Reduce
Compressio
n Bucket 1
Compressio
n Bucket 2
Compressio
n Bucket 3
Compressio
n Bucket n
…
Spill 1
Spill 2 Spill n
Reducer Side Spilling
• Spills the over-size data to Disk in the aggregation hash table
• Spilling - More IO, more sequential IO, less seeks• All in mem – less IO, more random IO, more
seeks
• Fundamentally resolved Spark’s scalability issue
Align with previous Partition function
• Our input data are from another map reduce job
• We use exactly the same hash function to reduce number of shuffle files
Previous Job GeneratingInput data
Spark Job
Align with previous Partition function
• New hash function, More even distribution
InputData
0Mod 4
Key 0, 4, 8…
Key 2, 6, 10…
Key 1,5,9…
Key 3, 7, 11…
shuffule file 0shuffule file 1shuffule file 2shuffule file 3shuffule file 4
Mod 5
shuffule file 0shuffule file 1shuffule file 2shuffule file 3shuffule file 4
shuffule file 0shuffule file 1shuffule file 2shuffule file 3shuffule file 4
shuffule file 0shuffule file 1shuffule file 2shuffule file 3shuffule file 4
Spark JobPrevious Job GeneratingInput data
Align with previous Partition function
• Use the same hash function
InputData
0Mod 4
Key 0, 4, 8…
Key 2, 6, 10…
Key 1,5,9…
Key 3, 7, 11…
Mod 4
1 shuffle file
1 shuffle file
1 shuffle file
1 shuffle file
Align with previous Hash function
• Our Case:– 16m shuffle files, 62kb on average (5-10x slower)– 8k shuffle files, 125mb on average
• Several different input data sources• Partition function from the major one
PERFORMANCE TUNNING
All About Resource Utilization
• Maximize the resource utilization• Use as much CPU,Mem,Disk,Net as possbile• Monitor vmstat, iostat, sar
Resource Utilization
• (This is old diagram, to update)
Resource Utilization
• Ideally CPU/IO should be fully utilized
• Mapper phase – IO bound• Final reducer phase – CPU bound
Shuffle file transfer
• Spark transfers all shuffle files to reducer memory before start processing.
• Non-streaming(very hard to change to streaming).
• For poor resource utilization– So need to make sure maxBytesInFlight is set big
enough– Consider allocating 2x more threads than physical
core number
Thanks.
Gavin Li [email protected] Kim [email protected] Feng [email protected]