Tulsa techfest Spark Core Aug 5th 2016

Preview:

Citation preview

Remove DuplicatesBasic Spark Functionality

Spark

Spark Core

• Spark Core is the base engine for large-scale parallel and distributed data processing. It is responsible for:• memory management and fault recovery • scheduling, distributing and monitoring jobs

on a cluster • interacting with storage systems

Spark Core• Spark introduces the concept of an RDD (Resilient

Distributed Dataset)• an immutable fault-tolerant, distributed collection of objects

that can be operated on in parallel. • contains any type of object and is created by loading an

external dataset or distributing a collection from the driver program.

• RDDs support two types of operations:• Transformations are operations (such as map, filter, join, union,

and so on) that are performed on an RDD and which yield a new RDD containing the result.

• Actions are operations (such as reduce, count, first, and so on) that return a value after running a computation on an RDD.

Spark DataFrames• DataFrames API is inspired by data frames in R and Python

(Pandas), but designed from the ground-up to support modern big data and data science applications:• Ability to scale from kilobytes of data on a single laptop to

petabytes on a large cluster • Support for a wide array of data formats and storage systems • State-of-the-art optimization and code generation through the

Spark SQL Catalyst optimizer • Seamless integration with all big data tooling and

infrastructure via Spark • APIs for Python, Java, Scala, and R (in development via

SparkR)

Remove Duplicates

val college = sc.textFile("/Users/marksmith/TulsaTechFest2016/colleges.csv")college: org.apache.spark.rdd.RDD[String] val cNoDups = college.distinctcNoDups: org.apache.spark.rdd.RDD[String]

college: RDD“as,df,asf” “qw,e,qw” “mb,kg,o”

“as,df,asf” “q3,e,qw” “mb,kg,o”

“as,df,asf” “qw,e,qw” “mb,k2,o”

cNoDups: RDD

“as,df,asf” “qw,e,qw” “mb,kg,o”

“q3,e,qw “mb,k2,o”

val cRows = college.map(x => x.split(",",-1)) cRows: org.apache.spark.rdd.RDD[Array[String]] val cKeyRows = cRows.map(x => "%s_%s_%s_%s".format(x(0),x(1),x(2),x(3)) -> x ) cKeyRows: org.apache.spark.rdd.RDD[(String, Array[String])]

college: RDD“as,df,asf” “qw,e,qw” “mb,kg,o”

“as,df,asf” “q3,e,qw” “mb,kg,o”

“as,df,asf” “qw,e,qw” “mb,k2,o”

cRows: RDDArray(as,df,asf) Array(qw,e,qw) Array(mb,kg,o)

Array(as,df,asf) Array(q3,e,qw) Array(mb,kg,o)

Array(as,df,asf) Array(qw,e,qw) Array(mb,k2,o)

cKeyRows: RDDkey->Array(as,df,asf) key->Array(qw,e,qw) key->Array(mb,kg,o)

key->Array(as,df,asf) key->Array(q3,e,qw) key->Array(mb,kg,o)

key->Array(as,df,asf) key->Array(qw,e,qw) key->Array(mb,k2,o)

val cGrouped = cKeyRows .groupBy(x => x._1) .map(x => (x._1,x._2.to[scala.collection.mutable.ArrayBuffer])) cGrouped: org.apache.spark.rdd.RDD[(String,Array[(String, Array[String])])]

val cDups = cGrouped.filter(x => x._2.length > 1)

cKeyRows: RDDkey->Array(as,df,asf) key->Array(qw,e,qw) key->Array(mb,kg,o)

key->Array(as,df,asf) key->Array(q3,e,qw) key->Array(mb,kg,o)

key->Array(as,df,asf) key->Array(qw,e,qw) key->Array(mb,k2,o)

cGrouped: RDDkey->Array(as,df,asf) Array(as,df,asf) Array(as,df,asf)

key->Array(mb,kg,o) Array(mb,kg,o) key->Array(mb,k2,o)

key->Array(qw,e,qw) Array(qw,e,qw) key->Array(q3,e,qw)

val cDups = cGrouped.filter(x => x._2.length > 1) cDups: org.apache.spark.rdd.RDD[(String, Array[(String, Array[String])])] val cNoDups = cGrouped.map(x => x._2(0)._2) cNoDups: org.apache.spark.rdd.RDD[Array[String]]

cGrouped: RDDkey->Array(as,df,asf) Array(as,df,asf) Array(as,df,asf)

key->Array(mb,kg,o) Array(mb,kg,o) key->Array(mb,k2,o)

key->Array(qw,e,qw) Array(qw,e,qw) key->Array(q3,e,qw)

“as,df,asf” “qw,e,qw” “mb,kg,o”

“q3,e,qw “mb,k2,o”

cNoDups: RDD cDups: RDDkey->Array(as,df,asf) Array(as,df,asf) Array(as,df,asf)

key->Array(mb,kg,o) Array(mb,kg,o)

key->Array(qw,e,qw) Array(qw,e,qw)

Previously it was RDD but currently the Spark DataFrames API is considered to be the primary interaction point of Spark. but RDD is available if needed

What is partitioning in Apache Spark? Partitioning is actually the main concept of access your entire Hardware resources while executing any Job.More Partition = More ParallelismSo conceptually you must check the number of slots in your hardware, how many tasks can each of executors can handle.Each partition will leave in different Executor.

DataFrames• So Dataframe is more like column structure and each

record is actually a line. • Can Run statistics naturally as its somewhat works like

SQL or Python/R Dataframe. • In RDD, to process any data for last 7 days, spark

needed to go through entire dataset to get the details, but in Dataframe you already get Time column to handle the situation, so Spark won’t even see the data which is greater than 7 days.

• Easier to program. • Better performance and storage in the heap of executor.

How Dataframe ensures to read less data?

• You can skip partition while reading the data using Dataframe.

• Using Parquet

• Skipping data using statistucs (ie min, max)

• Using partitioning (ie year = 2015/month = 06…)

• Pushing predicates into storage systems.

What is Parquet• You can skip partition while reading the data using

Dataframe.

• Using Parquet

• Skipping data using statistucs (ie min, max)

• Using partitioning (ie year = 2015/month = 06…)

• Pushing predicates into storage systems.

• Parquet should be the source for any operation or ETL. So if the data is different format, preferred approach is to convert the source to Parquet and then process.

• If any dataset in JSON or comma separated file, first ETL it to convert it to Parquet.

• It limits I/O , so scans/reads only the columns that are needed.

• Parquet is columnar layout based, so it compresses better, so save spaces.

• So parquet takes first column and store that as a file, and so on. So if we have 3 different files and sql query is on 2 files, then parquet won’t even consider to read the 3rd file.

val college = sc.textFile("/Users/marksmith/TulsaTechFest2016/CollegeScoreCard.csv") college.count res2: Long = 7805 val collegeNoDups = college.distinct collegeNoDups.count res3: Long = 7805

val college = sc.textFile("/Users/marksmith/TulsaTechFest2016/colleges.csv") college: org.apache.spark.rdd.RDD[String] = /Users/marksmith/TulsaTechFest2016/colleges.csv MapPartitionsRDD[17] at textFile at <console>:27

val cNoDups = college.distinct cNoDups: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[20] at distinct at <console>:29

cNoDups.count res7: Long = 7805

college.count res8: Long = 9000

val cRows = college.map(x => x.split(",",-1)) cRows: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[21] at map at <console>:29 val cKeyRows = cRows.map(x => "%s_%s_%s_%s".format(x(0),x(1),x(2),x(3)) -> x ) cKeyRows: org.apache.spark.rdd.RDD[(String, Array[String])] = MapPartitionsRDD[22] at map at <console>:31

cKeyRows.take(2) res11: Array[(String, Array[String])] = Array((UNITID_OPEID_opeid6_INSTNM,Array(

val cGrouped = cKeyRows.groupBy(x => x._1).map(x => (x._1,x._2.to[scala.collection.mutable.ArrayBuffer])) cGrouped: org.apache.spark.rdd.RDD[(String, scala.collection.mutable.ArrayBuffer[(String, Array[String])])] = MapPartitionsRDD[27] at map at <console>:33 val cDups = cGrouped.filter(x => x._2.length > 1)

val cDups = cGrouped.filter(x => x._2.length > 1) cDups: org.apache.spark.rdd.RDD[(String, scala.collection.mutable.ArrayBuffer[(String, Array[String])])] = MapPartitionsRDD[28] at filter at <console>:35

cDups.count res12: Long = 1195

val cNoDups = cGrouped.map(x => x._2(0)) cNoDups: org.apache.spark.rdd.RDD[(String, Array[String])] = MapPartitionsRDD[29] at map at <console>:35

cNoDups.count res13: Long = 7805

val cNoDups = cGrouped.map(x => x._2(0)._2) cNoDups: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[30] at map at <console>:35

cNoDups.take(5) 16/08/04 16:44:24 ERROR Executor: Managed memory leak detected; size = 41227428 bytes, TID = 28 res16: Array[Array[String]] = Array(Array(145725, 00169100, 001691, Illinois Institute of Technology, Chicago, IL, www.iit.edu, npc.collegeboard.org/student/app/iit, 0, 3, 2, 11, 0, 0, 0, 0, 0, 0, 0, 0, 0, NULL, 520, 640, 640, 740, 520, 640, 580, 690, 580, 25, 30, 24, 31, 26, 33, NULL, NULL, 28, 28, 30, NULL, 1252, 1252, 0, 0, 0.2026, 0, 0, 0, 0.1225, 0, 0, 0.4526, 0.0245

Demo RDD Code

import org.apache.spark.sql.SQLContext

val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").load("/Users/marksmith/TulsaTechFest2016/colleges.csv") df: org.apache.spark.sql.DataFrame = [UNITID: int, OPEID: int, opeid6: int, INSTNM: string, CITY: string, STABBR: string, INSTURL: string, NPCURL: string, HCM2: int, PREDDEG: int, CONTROL: int, LOCALE: string, HBCU: string, PBI: string, ANNHI: string, TRIBAL: string, AANAPII: string, HSI: string, NANTI: string, MENONLY: string, WOMENONLY: string, RELAFFIL: string, SATVR25: string, SATVR75: string, SATMT25: string, SATMT75: string, SATWR25: string, SATWR75: string, SATVRMID: string, SATMTMID: string, SATWRMID: string, ACTCM25: string, ACTCM75: string, ACTEN25: string, ACTEN75: string, ACTMT25: string, ACTMT75: string, ACTWR25: string, ACTWR75: string, ACTCMMID: string, ACTENMID: string, ACTMTMID: string, ACTWRMID: string, SAT_AVG: string, SAT_AVG_ALL: string, PCIP01: string, PCIP03: stri...

val dfd = df.distinct dfd.count res0: Long = 7804

df.count res1: Long = 8998

val dfdd = df.dropDuplicates(Array("UNITID", "OPEID", "opeid6", "INSTNM")) dfdd.count res2: Long = 7804

val dfCnt = df.groupBy("UNITID", "OPEID", "opeid6", "INSTNM").agg(count("UNITID").alias("cnt")) res8: org.apache.spark.sql.DataFrame = [UNITID: int, OPEID: int, opeid6: int, INSTNM: string, cnt: bigint]

dfCnt.show \+--------+-------+------+--------------------+---+ | UNITID| OPEID|opeid6| INSTNM|cnt| +--------+-------+------+--------------------+---+ |10236801| 104703| 1047|Troy University-P...| 2| |11339705|3467309| 34673|Marinello Schools...| 2| | 135276| 558500| 5585|Lively Technical ...| 2| | 145682| 675300| 6753|Illinois Central ...| 2| | 151111| 181300| 1813|Indiana Universit...| 1|

df.registerTempTable("colleges") val dfCnt2 = sqlContext.sql("select UNITID, OPEID, opeid6, INSTNM, count(UNITID) as cnt from colleges group by UNITID, OPEID, opeid6, INSTNM") dfCnt2: org.apache.spark.sql.DataFrame = [UNITID: int, OPEID: int, opeid6: int, INSTNM: string, cnt: bigint]

dfCnt2.show

+--------+-------+------+--------------------+---+ | UNITID| OPEID|opeid6| INSTNM|cnt| +--------+-------+------+--------------------+---+ |10236801| 104703| 1047|Troy University-P...| 2| |11339705|3467309| 34673|Marinello Schools...| 2| | 135276| 558500| 5585|Lively Technical ...| 2| | 145682| 675300| 6753|Illinois Central ...| 2| | 151111| 181300| 1813|Indiana Universit...| 1| | 156921| 696100| 6961|Jefferson Communi...| 1|

Demo DataFrame Code