Petabyte-Scale Text Processing with Spark
Oleksii Sliusarenko, Grammarly Inc.E-mail: aliaxey90 (at) gmail (dot) com
Read the full article in Grammarly tech blog
Modern error correcting
depending from the weatherdepending on the weather
Size: 3 Petabytes
Format: WARC - Raw HTTP protocol dump
We need: 1 PB or 2000 x 480GB SSD disks
Common Crawl = internet dump
High-level pipeline view
Extract texts English Filter Deduplicate
Break into words
Count frequencies
Typical processing step example
Processing example:
count each n-gram frequency
Input data example:
<sentence> <tab> <frequency>
Output data example:
<n-gram> <tab> <frequency>
My name is Bob. 12
Kiev is a capital. 25
name is 12
is 37
Classic and modern approaches
Our alternatives
$12000
$3000$1000
Default choice: Amazon EMR
$12000
$24000OOM
segfault
Our MapReduce
12x faster than Hadoop
Easy to learn Full support
2x2=4
Our MapReduce
Hardware failures Network failures
Distributed failsafe difficulties:
Fixing Spark
3 months!
First of all
Latest stable
Latest stable
◈ Build Spark with patch◈ Don’t forget Hadoop native libraries
The hardest button
S3 HEAD request failed for "file path" -
ResponseCode=403, ResponseMessage=Forbidden.
Why???
HTTP Head Request
HTTP body contains the
error description, but it’
s not fetched!
No body!
Possible reasons
Possible reasons:
◈ AccessDenied◈ AccountProblem◈ CrossLocationLoggingProhibited◈ InvalidAccessKeyId◈ InvalidObjectState◈ InvalidPayer◈ InvalidSecurity◈ NotSignedUp◈ RequestTimeTooSkewed◈ SignatureDoesNotMatch
We need to go deeper!
Spark Hadoop JetS3t HttpClient
Fix here
Fixing Spark
◈ Choose latest filesystem: S3A, not S3 or S3N
◈ conf.setInt("fs.s3a.connection.maximum", 100)
◈ Use DirectOutputCommitter
◈ --conf spark.hadoop.fs.s3a.access.key=…
Fixing S3
Fixing Spark
◈ Spark.default.parallelism = cores * 3
◈ spark_mb = system_ram_mb * 4 // 5
◈ set("spark.akka.frameSize", "2047") Fixing OOM
Fixing Spark
◈ Don’t force Kryo class registration
◈ Use bzip2 compression for input filesFixing miscellaneous
Our Ultimate Spark Recipe
See Grammarly tech blog for more info
Use spot instances
Spot instance
80% cheaper!
Safe Transient
Regular instance
Cheap
Expensive
◈ We spent the same amount of money
◈ Further experiments will be cheaper
◈ You can save three months!
Was It All Worth It?
◈ Don’t reinvent the wheel
◈ New technology will eat a lot of time
◈ Don’t be afraid to dive into code
◈ Look at problems from various angles
◈ Use spot instances
Take-aways
Thanks!Any questions?You can find me at aliaxey90 (at) gmail (dot) com
Read the full article in Grammarly tech blog