View
7.169
Download
2
Category
Tags:
Preview:
DESCRIPTION
Slides from Adam Ilardi's presentation at the 5/21 NY Scala Meetup held at eBay NYC.
Citation preview
Scala and Hadoop @ eBay
What we will cover
• Polymorphic Function Values• Higher Kinded/Recursive Types• Cokleislis Star Operators• Scala Macros
I have no clue what those things are
What we will ACTUALLY cover
• Why Scala• Why Hadoop• How we use Scala with Hadoop• Lots of CODE!
Why Scala?
• JVM• **Functional**• Expressive• How to convince your boss?
Someone on Hacker News said Scala sucks
• Compile Times• You changed List again?• Complicated• Leads to Madness
Madness?trait Lazy[+T, P] { var creationParameters: P = None.asInstanceOf[P]; lazy val lazyThing: Either[Throwable, T] = try { Right(create(creationParameters)) }
catch { case e => Left(e) } def get(createParams: P): Either[Throwable, T] = { creationParameters = createParams lazyThing } def create(params: P): T}
Madness?
def getSingleInstance[T, P](params: P)(implicit lazyCreator: Lazy[T, P]): T = { lazyCreator.get(params) match {
case Right(successValue) => successValue case Left(exception) => throw new
StackException(exception) }
}
This is used by ONE client class
• Show some self-restraint
Hadoop
• void map(K1 key, V1 value, OutputCollector<K2, V2> output, Reporter reporter)
• void reduce(K2 key, Iterator<V2> values, OutputCollector<K3, V3> output, Reporter reporter)
BIG NUMBERS
• Petabytes of data• 1k+ node Hadoop cluster• Multi-billion dollar merchandising business• Lots of users and items
How should I use Map Reduce?
• Raw map reduce • Pig • Hive• Cascading• Scoobi• Scalding
Decision Time
• “And every one that heareth these sayings of mine (great software engineers of the past), and doeth them not, shall be likened unto a foolish man, which built his house upon the sand.”
• “And the rain descended, and the floods came, and the winds blew, and beat upon that house; and it fell: and great was the fall of it.”
I believe!
• Scalding combines the best of PIG and Cascading
Good PigA = LOAD 'input' AS (x, y, z);B = FILTER A BY x > 5;DUMP B;C = FOREACH B GENERATE y, z;STORE C INTO 'output';
// do joins and group by also
Bad Pig
DEFINE NV_terms `perl nv_terms2.pl` ship('$scripts/nv_terms2.pl');
i5 = stream i4 through NV_terms as (leafcat:chararray, name:chararray, name1:chararray);
i7 = foreach i5 generate leafcat, com.ebay.pigudf.sic.RtlUDF(0,0,0,'$site_id',name) as name, com.ebay.pigudf.sic.RtlUDF(0,0,0,'$site_id',name1) as name1;
Other Pig Issues
• Scheduling and DAG creation
Cascading Rocks!
• What is it?• Supports large workflows and reusable
components– DAG generation– Parallel Executions
Cascading code in Scala
val masterPipe = new FilterURLEncodedStrings(masterPipe, "sqr")
masterPipe = new FilterInappropriateQueries(masterPipe, "sqr”)
masterPipe = new GroupBy(masterPipe, CFields("user_id", "epoch_ts", "sqr"), sortFields)
Someone should really code review this
Cascading Issues
This page intentionally left blank
Scalding Time
class WordCountJob(args : Args) extends Job(args) { TextLine( args("input") ) .flatMap('line -> 'word) { line : String => tokenize(line) } .groupBy('word) { _.size } .write( Tsv( args("output") ) )
// Split a piece of text into individual words. def tokenize(text : String) : Array[String] = { // Lowercase each word and remove punctuation. text.toLowerCase.replaceAll("[^a-zA-Z0-9\\s]", "").split("\\s+") }}
Scalding @ eBay
• Boilerplate reduction• Extensibility• New hires
Practical Scalding Use • Pimp my pimp• Code generated boilerplate• Cascades• Traps• Testing!
class eBayJob(args: Args) extends Job(args) with PipeBoilerPlate {
implicit def pipe2eBayRichPipe(pipe: Pipe) = new eBayRichPipe(pipe)
class eBayRichPipe(pipe: Pipe) extends RichPipe(pipe) with CommonFunctions
trait CommonFunctions { import Dsl._ import RichPipe.assignName def pipe: Pipe def reallyComplexFunction(field: Fields, param: Long) = {
//mind blowing code here }}}
CheckoutTransactionsPipe(//default path logic) .project(//fields I need).countUserInteractions(//params).doScoreCalculation(//params).doConfidenceCalculation(//params)
Seems a bit too readable for Scala
Collaborative Filtering
• Typically hard to run on large datasets
Structured Data Importance
• Do people shop by brand?
Bag Dep
th
Bag Heig
ht
Bag Le
ngthBran
dColor
Country of M
anufac
ture
Materia
l
Shad
eSiz
e
Strap
Drop
Style
0
0.2
0.4
0.6
0.8
1
1.2
Handbags and Purses
Supp
ly
Markov Chains
• Investigation of buying patterns in ~50 lines of code
val purchases = "firsttime" :: x.take(500).toListval pairs = purchases zip purchases.tailval grouped = pairs.groupBy(x =>
x._1.toString+"-"+x._2.toString) val sizes = grouped map { x => { x._1 -> x._2.size }} toList
Mining Search Queries
• 20+ billion user queries - give me the top ones per user
De-Dupe Rank ValidateSample Data
Automation
Hadoop Proxy Batch Database Load Machines
Cassandra
Jenkins
MySql
Mongo
Questions?
www.ebaynyc.com
Recommended