Upload
roger-brinkley
View
244
Download
0
Tags:
Embed Size (px)
DESCRIPTION
A presentation prepared for Data Stack as a part of their Interview process on July 20. This 15 presentation in ignite format features 10 items that you might not know about the V1.0 Spark release
Citation preview
The 10 Apache Spark Features You (Unlikely) Didn't Hear About
Roger BrinkleyTechnical Evangelist
The 10 Apache Stack Features You (Unlikely) Didn't Hear About
• 10 minutes – 10 slides• Ignite Format
• No stopping!• No going back!• Questions? Sure, but only if and until time
remains on slide (otherwise, save for later)
• Hire me, I’ll find 45 more
It’s Fast Really Fast
• 10 - 100x faster than MapReduce• 10 – 100x faster than Hive• Historical perspective
– JRuby 2-3x Faster with InvokedDynamic JVM– Hardware rarely gets greater than 10x/year
MapReduce is Listed as the Last Most Important Software Innovation
And Spark Blew the Lid Off of MapReduce
• Commons-based Peer Production– Apache Software Foundation Top Level Project – 200 people from 50 OrganizationsContributing– 12 Organizations Committing– Peer Governance– Participative Decision Making
It’s Pure Open Source
The very essence of a free government consists in considering offices as public trusts,
bestowed for the good of the country, and not for the benefit of an individual or a party
John C. Calhoun 2/13/1835
The very essence of a free software consists in considering contributing roles as public trusts, bestowed for the good of the community, and not for the benefit of an individual or a party
Modern FOSS John C. Calhoun
Strong Enterprise Relationships
• Spark is in every major Hadoop distributor• Vertical enterprise use
– Internet companies, government, financials– Churn analysis, fraud detection, risk analytics
• Used in other data stores – Datastax (Cassandra)– MongoDB
• Databricks has a cloud based implementation
Enhances Other Big Data Implementations
• Hadoop – Replacement of Map Reduce• Cassandara – Analytics• Hive – Faster SQL processing• SAP Hana – Faster interactive analysis
API Stability
• Guaranteed stability of its core API for 1.X • Spark has always been conservative with API
changes• Clearly defined annotations for future APIs
– Experimental– Alpha– Developer
Don’t Need to Learn a New Language
• Scala• Java – 25%• Python – 30% • And soon R
Java 8 Lambda SupportJavaRDD<String> lines = sc.textFile("hdfs://log.txt");
// Map each line to multiple wordsJavaRDD<String> words = lines.flatMap( new FlatMapFunction<String, String>() { public Iterable<String> call(String line) { return Arrays.asList(line.split(" ")); }});
// Turn the words into (word, 1) pairsJavaPairRDD<String, Integer> ones = words.mapToPair( new PairFunction<String, String, Integer>() { public Tuple2<String, Integer> call(String w) { return new Tuple2<String, Integer>(w, 1); }});
// Group up and add the pairs by key to produce countsJavaPairRDD<String, Integer> counts = ones.reduceByKey( new Function2<Integer, Integer, Integer>() { public Integer call(Integer i1, Integer i2) { return i1 + i2; }});
counts.saveAsTextFile("hdfs://counts.txt");
JavaRDD<String> lines = sc.textFile("hdfs://log.txt");JavaRDD<String> words = lines.flatMap(line -> Arrays.asList(line.split(" ")));JavaPairRDD<String, Integer> counts = words.mapToPair(w -> new Tuple2<String, Integer>(w, 1)) .reduceByKey((x, y) -> x + y);counts.saveAsTextFile("hdfs://counts.txt");
val ssc = new StreamingContext(args(0), "NetworkHashCount", Seconds(10), System.getenv("SPARK_HOME"), Seq(System.getenv("SPARK_EXAMPLES_JAR")))
val lines = ssc.socketTextStream("localhost", 9999)val words = lines.flatMap(_.split(" ")).filter(_.startsWith("#"))val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)wordCounts.print()ssc.start()
val file = sc.textFile("hdfs://.../pagecounts-*.gz")val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _)counts.saveAsTextFile("hdfs://.../word-count")
Real Time Stream Process
Caching Interactive Algorithms
val points = sc.textFile("...").map(parsePoint).cache()var w = Vector.random(D) //current separating planefor (i <- 1 to ITERATIONS) { val gradient = points.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient } println("Final separating plane: " + w)
New Security Integration
• Complete Integration with Haddop/YARN Security Model– Authenticate Job Submissions– Securely transfer HDFS credentials– Authenticate communication between component
• Other deployments supported val conf = new SparkConfconf.set("spark.authenticate", "true")conf.set("spark.authenticate.secret", "good")
And Lots More
• Apache Spark Website• Databricks – making big data easy
– Introduction to Apache Spark• Jul 28 – Austin, TX - More Info & Registration• Aug 25 – Chicago, IL - More Info & Registration