Upload
chetan-khatri
View
21
Download
0
Embed Size (px)
Citation preview
Data Science Bootcamp Day-3Presented by: Chetan Khatri, Volunteer Teaching Assistant, Data Science lab, University of Kachchh
Guidance by: Prof. Devji D. Chhanga, University of Kachchh.
AgendaAn Introduction to Apache Spark
Apache Spark single node configuration
MapReduce Program on Spark Cluster
An Introduction to Apache Kafka
Apache Kafka single on Configuration.
Create Topic, Push Messages to Topic
Spark Terminology» Spark and SQL Contexts : A Spark program first creates a SparkContext object
» SparkContext tells Spark how and where to access a cluster
» The program next creates a sqlContext object
» Use sqlContext to create DataFrames
Review : DataFramesThe primary abstraction in Spark
» Immutable once constructed.» Track lineage information to efficiently recompute lost data.» Enable operations on collection of elements in parallel.
You construct DataFrames
» by parallelizing existing Scala collections (lists)» by transforming an existing Spark DFs» from files in HDFS or any other storage system
Review: DataFramesTwo types of operations: transformations and actions.
Transformations are lazy (not computed immediately).
Transformed DF is executed when action runs on it.
Persist (cache) DFs in memory or disk.
Resilient Distributed DatasetsUntyped Spark abstraction underneath DataFrames:» Immutable once constructed» Track lineage information to efficiently recompute lost data» Enable operations on collection of elements in parallel
You construct RDDs
» by parallelizing existing Scala collections (lists)» by transforming an existing RDDs or DataFrame» from files in HDFS or any other storage system
When to use DataFrames ?Need high-level transformations and actions, and want high-level control over your dataset.
Have typed (structured or semi-structured) data.
You want DataFrame optimization and performance benefits» Catalyst Optimization Engine• 75% reduction in execution time» Project Tungsten off-heap memory management• 75+% reduction in memory usage (less GC)
Apache Spark MapReduce1) Start Apache Spark Shell./bin/spark-shell2) Let's Read the text filescala> val textFile = sc.textFile("file:///home/chetan306/inputfile.txt")3) RDDs have actions, which return values, and transformations, which return pointers to new RDDs. Let’s start with a few actions:scala> textFile.count()scala> textFile.first()4) Now let’s use a transformation. We will use the filter transformation to return a new RDD with a subset of the items in the file.val linesWithSpark = textFile.filter(line => line.contains("Spark"))// Get transformation output.linesWithSpark.collect()
Apache Spark MapReduce5) We can chain together transformations and actions:textFile.filter(line => line.contains("Spark")).count()
6) One common data flow pattern is MapReduce, as popularized by Hadoop. Spark can implement MapReduce flows easily:
val wordCounts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b)
wordCounts.collect()