Upload
vasil-remeniuk
View
92
Download
3
Tags:
Embed Size (px)
Citation preview
Quick Guide
What is Scalding ?
• Scala wrapper for Cascading
What is Cascading ?
Tap / Pipe / Sink abstraction over Map / Reduce in Java
What is Scalding ?
• Scala wrapper for Cascading
• Just like working with in-memory collections !
TextLine( args("input") )
.flatMap('line -> 'word) { line : String => tokenize(line) }
.groupBy('word) { _.size }
.write( Tsv( args("output") ) )
• No more scripting and UDFs!
Hands on
• Clone the skeleton repository
• Get IntelliJ Idea and the scala plugin
• Open the project
• Compile, wait for dependencies to download
• Create a run configuration …
• Create a specs2 configuration for tests
run the WordCountJob in local mode with given input and output
Building and Deploying
• Get sbt
• sbt assembly produces jar file in target/scala_2.10
• sbt s3-upload produces jar and uploads to s3
• Configure teamcity
Running on EMR
• hadoop fs -get s3://dev-adform-temp-results/wordcount-job.jar job.jar
• hadoop jar job.jar \
com.twitter.scalding.Tool \ Entry class
com.adform.dspr.WordCountJob \ Scalding job class
--hdfs \ Run in HDFS mode
--input s3://adform-dsp-metadata/countries/countries.txt \ Parameter
--output s3://dev-adform-temp-results/wordcount Parameter
Under the covers
• sbt run-main \
com.twitter.scalding.Tool \
com.adform.dspr.WordCountJob \
--hdfs \
--tool.graph \
--input dummy --output dummy
• dot -Tpng com.adform.dspr.WordCountJob0.dot -o logical_plan.png
• dot -Tpng com.adform.dspr.WordCountJob0_steps.dot -o mr_plan.png
Development
• Different APIs:• Fields – everything is a string
• Typed – working with classes, e.g. Request/Transaction
Development
• Fields:• No need to parse columns
• Redundant
• No IDE support like auto-completion
• Typed:• All benefits of types
• More manual work with parsing
Resources
• https://github.com/twitter/scalding
• https://github.com/twitter/scalding/tree/develop/tutorial
• https://github.com/twitter/scalding/wiki
• http://www.slideshare.net/AntwnisChalkiopoulos/scalding-presentation
• http://www.slideshare.net/ktoso/scalding-the-notsobasics-scaladays-2014
• https://gitz.adform.com/dspr/data-processing/tree/develop/jobs/process-logs-rtb
My Experience
• Running the job locally is a HUGE time saver
• Programming scala is amazing (no more UDFs)
• Type safety, IDE support!
• Debugging !!!!111
• More optimal job plans
My Experience
• A lot of configuring and googling random issues
• Scarce documentation, had to read source code
• IntelliJ is slow
• Boilerplate code for parsing data
Use cases
• Easy jobs hive
• Non-trivial jobs scalding
• Optional: scalding is nice for doing matrix calculations, twitter also provides a lot of monoids (algorithms) for nice approximations, e.g. HyperLogLog, CountMinSketch, etc. (see algebird).
process-logs-rtb
• Had to hack scalding: • WritableMultiSinkTap
• Records
• CompressedTsv
• ModelKryoInstantiator
• Uses typed API
• Helpers like FluentJob