13
The 10 Apache Spark Features You (Unlikely) Didn't Hear About Roger Brinkley Technical Evangelist

10 Things About Spark

Embed Size (px)

DESCRIPTION

A presentation prepared for Data Stack as a part of their Interview process on July 20. This 15 presentation in ignite format features 10 items that you might not know about the V1.0 Spark release

Citation preview

Page 1: 10 Things About Spark

The 10 Apache Spark Features You (Unlikely) Didn't Hear About

Roger BrinkleyTechnical Evangelist

Page 2: 10 Things About Spark

The 10 Apache Stack Features You (Unlikely) Didn't Hear About

• 10 minutes – 10 slides• Ignite Format

• No stopping!• No going back!• Questions? Sure, but only if and until time

remains on slide (otherwise, save for later)

• Hire me, I’ll find 45 more

Page 3: 10 Things About Spark

It’s Fast Really Fast

• 10 - 100x faster than MapReduce• 10 – 100x faster than Hive• Historical perspective

– JRuby 2-3x Faster with InvokedDynamic JVM– Hardware rarely gets greater than 10x/year

MapReduce is Listed as the Last Most Important Software Innovation

And Spark Blew the Lid Off of MapReduce

Page 4: 10 Things About Spark

• Commons-based Peer Production– Apache Software Foundation Top Level Project – 200 people from 50 OrganizationsContributing– 12 Organizations Committing– Peer Governance– Participative Decision Making

It’s Pure Open Source

The very essence of a free government consists in considering offices as public trusts,

bestowed for the good of the country, and not for the benefit of an individual or a party

John C. Calhoun 2/13/1835

The very essence of a free software consists in considering contributing roles as public trusts, bestowed for the good of the community, and not for the benefit of an individual or a party

Modern FOSS John C. Calhoun

Page 5: 10 Things About Spark

Strong Enterprise Relationships

• Spark is in every major Hadoop distributor• Vertical enterprise use

– Internet companies, government, financials– Churn analysis, fraud detection, risk analytics

• Used in other data stores – Datastax (Cassandra)– MongoDB

• Databricks has a cloud based implementation

Page 6: 10 Things About Spark

Enhances Other Big Data Implementations

• Hadoop – Replacement of Map Reduce• Cassandara – Analytics• Hive – Faster SQL processing• SAP Hana – Faster interactive analysis

Page 7: 10 Things About Spark

API Stability

• Guaranteed stability of its core API for 1.X • Spark has always been conservative with API

changes• Clearly defined annotations for future APIs

– Experimental– Alpha– Developer

Page 8: 10 Things About Spark

Don’t Need to Learn a New Language

• Scala• Java – 25%• Python – 30% • And soon R

Page 9: 10 Things About Spark

Java 8 Lambda SupportJavaRDD<String> lines = sc.textFile("hdfs://log.txt");

// Map each line to multiple wordsJavaRDD<String> words = lines.flatMap( new FlatMapFunction<String, String>() { public Iterable<String> call(String line) { return Arrays.asList(line.split(" ")); }});

// Turn the words into (word, 1) pairsJavaPairRDD<String, Integer> ones = words.mapToPair( new PairFunction<String, String, Integer>() { public Tuple2<String, Integer> call(String w) { return new Tuple2<String, Integer>(w, 1); }});

// Group up and add the pairs by key to produce countsJavaPairRDD<String, Integer> counts = ones.reduceByKey( new Function2<Integer, Integer, Integer>() { public Integer call(Integer i1, Integer i2) { return i1 + i2; }});

counts.saveAsTextFile("hdfs://counts.txt");

JavaRDD<String> lines = sc.textFile("hdfs://log.txt");JavaRDD<String> words = lines.flatMap(line -> Arrays.asList(line.split(" ")));JavaPairRDD<String, Integer> counts = words.mapToPair(w -> new Tuple2<String, Integer>(w, 1)) .reduceByKey((x, y) -> x + y);counts.saveAsTextFile("hdfs://counts.txt");

Page 10: 10 Things About Spark

val ssc = new StreamingContext(args(0), "NetworkHashCount", Seconds(10), System.getenv("SPARK_HOME"), Seq(System.getenv("SPARK_EXAMPLES_JAR")))

val lines = ssc.socketTextStream("localhost", 9999)val words = lines.flatMap(_.split(" ")).filter(_.startsWith("#"))val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)wordCounts.print()ssc.start()

val file = sc.textFile("hdfs://.../pagecounts-*.gz")val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _)counts.saveAsTextFile("hdfs://.../word-count")

Real Time Stream Process

Page 11: 10 Things About Spark

Caching Interactive Algorithms

val points = sc.textFile("...").map(parsePoint).cache()var w = Vector.random(D) //current separating planefor (i <- 1 to ITERATIONS) { val gradient = points.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient } println("Final separating plane: " + w)

Page 12: 10 Things About Spark

New Security Integration

• Complete Integration with Haddop/YARN Security Model– Authenticate Job Submissions– Securely transfer HDFS credentials– Authenticate communication between component

• Other deployments supported val conf = new SparkConfconf.set("spark.authenticate", "true")conf.set("spark.authenticate.secret", "good")

Page 13: 10 Things About Spark

And Lots More

• Apache Spark Website• Databricks – making big data easy

– Introduction to Apache Spark• Jul 28 – Austin, TX - More Info & Registration• Aug 25 – Chicago, IL - More Info & Registration