BDX 2016 - Tzach zohar @ kenshoo

Preview:

Citation preview

Scala The language for Big Data

Tzach Zohar @ Kenshoo, March/2016

Who am I

System Architect @ Kenshoo

Java backend for 10 years

Working with Scala + Spark for 2 years

https://www.linkedin.com/in/tzachzohar

Who’s Kenshoo

10-year Tel-Aviv based startup

500+ employees

Industry Leader in Digital Marketing

Heavy data shop

http://kenshoo.com/

And who are you?

Agenda

NOT the usual Scala pitch

Scala - Short Intro

Scala

Created by Martin Odersky and his research group in EPFL, 2003

Open Source

Runs on the JVM, Seamless Java Interoperability

Strongly Typed

Object Oriented

Functional

Functional Programming

From Wikipedia:

“[…] functional programming […] treats computation as the evaluation of mathematical functions and avoids changing-state and mutable data”

Functional Languages

Functions are “first-class citizens”, i.e. values

Higher-Order Functions

Minimize side effects

Minimize mutability

Example: Imperative to Functional

Java / Imperative:

private static class Person { String firstName; String lastName; } private List<Person> firstNFamilies(int n, List<Person> persons) { final List<String> familiesSoFar = new LinkedList<>(); final List<Person> result = new LinkedList<>(); for (Person p : persons) { if (familiesSoFar.contains(p.lastName)) { result.add(p); } else if (familiesSoFar.size() < n) { familiesSoFar.add(p.lastName); result.add(p); } } return result; }

Scala / Functional:

case class Person(firstName: String, lastName: String) def firstNFamilies(n: Int, persons: List[Person]): List[Person] = { val firstFamilies = persons.map(p => p.lastName).distinct.take(n) persons.filter(p => firstFamilies.contains(p.lastName)) }

Hey, won’t this look rather similar with Java8’s Streams + Lambdas?

Scala + Big Data

What can a language do for Big Data?

Language “requirements”

Open Source

Strongly Typed

Java/JVM Friendly

Interactive

Performant

Abstracts “machinery” away

Open Source

Strongly Typed

Java/JVM Friendly

Interactive

Performant

Abstracts “machinery”

Java Interoperability

class DirectParquetOutputCommitter(outputPath: Path, context: TaskAttemptContext) extends ParquetOutputCommitter(outputPath, context) { … }

Java class from org.apache.parquet:parquet-hadoop

Scala class from org.apache.spark:spark-core_2.10

Open Source

Strongly Typed

Java/JVM Friendly

Interactive

Performant

Abstracts “machinery”

Scala is Interactive

Scala has a built-in REPL, extensible by Scala-based tools

Open Source

Strongly Typed

Java/JVM Friendly

Interactive

Performant

Abstracts “machinery”

Performant

Benchmarking languages is hard and suffers from bias

Most benchmarks show Scala is at least on-par with Java, e.g. Google’s benchmark:

Nonsense!

No Way!

RAGE!!11

Does it matter?

Open Source

Strongly Typed

Java/JVM Friendly

Interactive

Performant

Abstracts “machinery”

Performant

Ability to scale out is more significant than per-CPU performance

http://vmturbo.com/wp-content/uploads/2015/05/ScaleUpScaleOut_sm-min.jpg

Abstracts “machinery” away - What?

Open Source

Strongly Typed

Java/JVM Friendly

Interactive

Performant

Abstracts “machinery”

Think about MapReduce

Hadoop’s Mapper and Reducer - code what to do, not:

How

Where

In what order

How to handle failures

Leaves these concerns for the framework to figure out

Open Source

Strongly Typed

Java/JVM Friendly

Interactive

Performant

Abstracts “machinery”

Think about MapReduce

Hadoop’s Java API imitates Functional Programming:

Mapper and Reducer are Functions

Executed by “Higher Order Functions”

No Side Effects / Mutability

Abstracts “machinery” away - Why?

Open Source

Strongly Typed

Java/JVM Friendly

Interactive

Performant

Abstracts “machinery”

Functional makes concurrency easy

val numbers = 1 to 100000 val result = numbers.map(slowF)

Open Source

Strongly Typed

Java/JVM Friendly

Interactive

Performant

Abstracts “machinery”

Functional makes concurrency easy

val numbers = 1 to 100000 val result = numbers.par.map(slowF)

Parallelizes next manipulations over available CPUs

Open Source

Strongly Typed

Java/JVM Friendly

Interactive

Performant

Abstracts “machinery”

Functional makes distribution easy

val numbers = 1 to 100000 val result = sparkContext.parallelize(numbers).map(slowF)

Parallelizes next manipulations over scalable cluster, by creating a Spark RDD - a Resilient Distributed Dataset

“Spark RDDs are the ultimate Scala collections"

-  Martin Odersky

photo: http://www.swissict-award.ch/fileadmin/award/Pressebilder/Martin_Odersky_Scala.jpg

Open Source

Strongly Typed

Java/JVM Friendly

Interactive

Performant

Abstracts “machinery”

Functional makes Resiliency easy

Pure functions are idempotent, which allows retriability

Map

Map

Map Map Map (retry)

What if we always coded this way?

A functional language means just that

Language “requirements”

Open Source

Strongly Typed

Java/JVM Friendly

Interactive

Performant

Abstracts “machinery” away

But “Scala is hard!”

It’s really not that scary...

From Manuel Bernhardt's "Debunking Some Myths About Scala And Its Environment":

“I need to become a mathematician and know all about Monads before I can get started”

“I can throw all of my object-orientation knowledge out of the window”

“There is no good IDE support for Scala”

Thank You!

Recommended