Apache SparkConcepts - Spark SQL, GraphX, StreamingPetr ZapletalCake SolutionsApache Spark and Big DataHistory and market overviewInstallationMLlib and Machine Learning on SparkPorting R code to Scala and SparkConcepts - Spark SQL, GraphX, StreamingSparks distributed programming modelDeploymentTable of contentsResilient Distributed DatasetsSpark SQLGraphXSpark StreamingQ & ASpark Modules
Resilient Distributed DatasetsImmutable, distributed collection of recordsLazy evaluation, caching option, can be persistedNumber of operations & transformationsCan be created from data storage or different RDD
Spark SQLSparks interface to work with structured or semistructured dataStructured dataknown set of fields for each record - schemaMain capabilitiesload data from variety of structured sourcesquery the data with SQLintegration between Spark (Java, Scala and Python API) and SQL (joining RDDs and SQL tables, using SQL functionality)More than SQLUnified interface for structured data
SchemaRDDRDD of row objects, each representing a recordKnown schema (i.e. data fields) of its rowsBehaves like regular RDD, stored in more efficient mannerAdds new operations, especially running SQL queriesCan be created fromexternal data sourcesresults of queriesregular RDDUsed in ML Pipeline APISchemaRDD
Getting StartedEntry points:HiveContextsuperset functionality, Hive relatedSQLContext
Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. While initially developed by Facebook, Apache Hive is now used and developed by other companies such as Netflix. Amazon maintains a software fork of Apache Hive that is included in Amazon Elastic MapReduce on Amazon Web Services.
Loads input JSON file into SchemaRDDUses context to execute queryQuery ExampleLoading and Saving DataSupports number of structured data sourcesApache Hivedata warehouse infrastructure on top of Hadoopsummarization, querying (SQL-like interface) and analysisParquetcolumn-oriented storage format in Hadoop ecosystemefficient storage of records with nested fieldsJSONRDDsJDBC/ODBC Serverconnecting Business Intelligence toolsremote access to Spark clusterGraphXNew Spark API for graphs and graph-parallel computationResilient Distributed Property Graph (RDPG, extends RDD)directed multigraph ( -> parallel edges) properties attached to each vertex and edgeCommon graph operations (subgraph computation, joining vertices, ...)Growing collection of graph algorithmsMotivationGrowing scale and importance of graph dataApplication of data-parallel algorithms to graph computation is inefficientGraph-parallel systems (Pregel, PowerGraph, ...) designed for efficient execution of graph algorithmsdo not address graph construction & transformationlimited fault tolerance & data mining support
Connected Components and PageRank algorithmshttps://amplab.cs.berkeley.edu/wp-content/uploads/2014/09/graphx.pdf
For Spark we implemented the algorithms both using idiomatic dataflow operators (Naive Spark, as described in Section 3.2) and using an optimized implementation (Optimized Spark) that eliminates movement of edge data by pre-partitioning the edges to match the partitioning adopted by GraphX.
We have excluded Giraph and Optimized Spark from Figure 7c because they were unable to scale to the larger web-graph in the allotted memory of the cluster. While the basic Spark implementation did not crash, it was forced to re-compute blocks from disk and exceeded 8000 seconds per iteration. We attribute the increased memory overhead to the use of edge-cut partitioning and the need to store bi-directed edges and messages for the connected components algorithmProperty GraphDirected multigraph with user defined objects to each vertex and edge
Triplet ViewLogical join of vertex and edge properties
Graph OperationsBasic information (numEdges, numVertices, inDegrees, ...)Views (vertices, edges, triplets)Caching (persist, cache, ...)Transformation (mapVertices, mapEdges, ...)Structure modification (reverse, subgraph, ...)Neighbour aggregation (collectNeighbours, aggregations, ...)Pregel APIGraph builders (various I/O operations)...Graph AlgorithmsBuilt-in algorithmsPageRank, Connected Components, Triangle Count, ...
Spark StreamingScalable, high-throughput, fault-tolerant stream processing
ArchitectureStreams are chopped up into batchesEach batch is processed in SparkResults pushed out in batches
Streaming Word Count
Streaming Word Count
StreamingContextEntry point for all streaming functionalitydefine input sourcesstream transformationsoutput operations to DStreamsstarts & stops streaming processLimitationsonce started, computations cannot be addedcannot be restartedone active per JVM
Discretized StreamsBasic abstraction, represents a continuous stream of dataDStreamsImplemented as series of RDDs
Stateless TransformationsProcessing of each batch does not depend on previous batchesTransformation is separately applied to every batchMap, flatMap, filter, reduce, groupBy, Combining data from multiple DStreamsJoin, cogroup, union, ...
cogroup - When called on DStream of (K, V) and (K, W) pairs, return a new DStream of (K, Seq[V], Seq[W]) tuples.join - When called on two DStreams of (K, V) and (K, W) pairs, return a new DStream of (K, (V, W)) pairs with all pairs of elements for each key.union - Return a new DStream that contains the union of the elements in the source DStream and otherDStream.
Stateful TransformationsUse data or intermediate results from previous batches to compute the result of the current batchWindowed operationsact over a sliding window of time periodsUpdateStateByKeymaintain state while continuously updating it with new informationRequire checkpointing
Output OperationsSpecify what needs to be done with the final transformed dataPushing to external DB, printing, If not performed, DStream is not evaluated
Input SourcesBuilt-in support for a number of different data sourcesOften in additional libraries (i.e. spark-streaming-kafka)HDFSAkka Actor StreamApache KafkaApache FlumeTwitter StreamKinesisCustom Sources...Demo
ConclusionRDD repetitionSpark Modules Overview Spark SQLGraphXSpark Streaming Questions