9
Data Science Bootcamp Day-3 Presented by: Chetan Khatri, Volunteer Teaching Assistant, Data Science lab, University of Kachchh Guidance by: Prof. Devji D. Chhanga, University of Kachchh.

Data science bootcamp day 3

Embed Size (px)

Citation preview

Page 1: Data science bootcamp day 3

Data Science Bootcamp Day-3Presented by: Chetan Khatri, Volunteer Teaching Assistant, Data Science lab, University of Kachchh

Guidance by: Prof. Devji D. Chhanga, University of Kachchh.

Page 2: Data science bootcamp day 3

AgendaAn Introduction to Apache Spark

Apache Spark single node configuration

MapReduce Program on Spark Cluster

An Introduction to Apache Kafka

Apache Kafka single on Configuration.

Create Topic, Push Messages to Topic

Page 3: Data science bootcamp day 3

Spark Terminology» Spark and SQL Contexts : A Spark program first creates a SparkContext object

» SparkContext tells Spark how and where to access a cluster

» The program next creates a sqlContext object

» Use sqlContext to create DataFrames

Page 4: Data science bootcamp day 3

Review : DataFramesThe primary abstraction in Spark

» Immutable once constructed.» Track lineage information to efficiently recompute lost data.» Enable operations on collection of elements in parallel.

You construct DataFrames

» by parallelizing existing Scala collections (lists)» by transforming an existing Spark DFs» from files in HDFS or any other storage system

Page 5: Data science bootcamp day 3

Review: DataFramesTwo types of operations: transformations and actions.

Transformations are lazy (not computed immediately).

Transformed DF is executed when action runs on it.

Persist (cache) DFs in memory or disk.

Page 6: Data science bootcamp day 3

Resilient Distributed DatasetsUntyped Spark abstraction underneath DataFrames:» Immutable once constructed» Track lineage information to efficiently recompute lost data» Enable operations on collection of elements in parallel

You construct RDDs

» by parallelizing existing Scala collections (lists)» by transforming an existing RDDs or DataFrame» from files in HDFS or any other storage system

Page 7: Data science bootcamp day 3

When to use DataFrames ?Need high-level transformations and actions, and want high-level control over your dataset.

Have typed (structured or semi-structured) data.

You want DataFrame optimization and performance benefits» Catalyst Optimization Engine• 75% reduction in execution time» Project Tungsten off-heap memory management• 75+% reduction in memory usage (less GC)

Page 8: Data science bootcamp day 3

Apache Spark MapReduce1) Start Apache Spark Shell./bin/spark-shell2) Let's Read the text filescala> val textFile = sc.textFile("file:///home/chetan306/inputfile.txt")3) RDDs have actions, which return values, and transformations, which return pointers to new RDDs. Let’s start with a few actions:scala> textFile.count()scala> textFile.first()4) Now let’s use a transformation. We will use the filter transformation to return a new RDD with a subset of the items in the file.val linesWithSpark = textFile.filter(line => line.contains("Spark"))// Get transformation output.linesWithSpark.collect()

Page 9: Data science bootcamp day 3

Apache Spark MapReduce5) We can chain together transformations and actions:textFile.filter(line => line.contains("Spark")).count()

6) One common data flow pattern is MapReduce, as popularized by Hadoop. Spark can implement MapReduce flows easily:

val wordCounts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b)

wordCounts.collect()