Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
© 2016 IBM Corporation
IBM Analytics
Analyzing Flight Data
Jeff Carlson
Rich Tarro
July 21, 2016
2 © 2016 IBM Corporation
Agenda
Spark Overview – a quick review
Introduction to Graph Processing and Spark GraphX
GraphX Overview
Demo Scenario Overview
Demo
Wrap-up
3 © 2016 IBM Corporation
Agenda
Spark Overview – a quick review
Introduction to Graph Processing and Spark GraphX
GraphX Overview
Demo Scenario Overview
Demo
Wrap-up
4 © 2016 IBM Corporation
What is Spark?
Spark is an open source
in-memory
application framework for
distributed data processing and
iterative analysis
on massive data volumes
“Analytic Operating System”
5 © 2016 IBM Corporation
Key reasons for interest in Spark
Performant In-memory architecture greatly reduces disk I/O
Anywhere from 20-100x faster for common tasks
Productive Concise and expressive syntax, especially compared to prior approaches
Single programming model across a range of use cases and steps in data lifecycle
Integrated with common programming languages – Java, Python, Scala
New tools continually reduce skill barrier for access (e.g. SQL for analysts)
Leverages existing
investments
Works well within existing Hadoop ecosystem
Improves with age Large and growing community of contributors continuously improve full analytics stack and extend capabilities
6 © 2016 IBM Corporation
Spark includes a set of core libraries that enable various
analytic methods which can process data from many sources
Spark Core Engine
general
compute
engine, handles
distributed task
dispatching,
scheduling and
basic I/O
functions
Spark SQLSpark
Streaming
MLlib
(machine
learning)
GraphX
(graph)
executes SQL
statements
performs
streaming
analytics using
micro-batches
common
machine
learning and
statistical
algorithms
distributed
graph
processing
framework
large variety of
data sources and
formats can be
supported, both on
premise or cloud
BigInsights
(HDFS)
Cloudant
dashDB
Object
Storage
SQL
DB
…many
others
IBM CLOUD OTHER CLOUD CLOUD APPS ON-PREMISE
7 © 2016 IBM Corporation
Spark Application Architecture
A Spark application is initiated from a driver program
Spark execution modes:– Standalone with the built-in cluster manager
– Use Mesos as the cluster manager
– Use YARN as the cluster manager
– Standalone cluster on any cloud (BlueMix, IBM Softlayer, Amazon, Azure, …)
8 © 2016 IBM Corporation
Spark RDDs
Immutable
Two types of operations– Transformations ~ DDL (Create View V2 as…) – Lazy Evaluation
• val rddNumbers = sc.parallelize(1 to 10): Numbers from 1 to 10
• val rddNumbers2 = rddNumbers.map (x => x+1): Numbers from 2 to 11
• The LINEAGE on how to obtain rddNumbers2 from rddNumber is recorded
• It’s a Directed Acyclic Graph (DAG)
• No actual data processing does take place Lazy evaluations
– Actions ~ Select (Select * From V2…) – Perform Computations• rddNumbers2.collect(): Array [2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
• Performs transformations and action
• Returns a value (or write to a file)
Fault tolerance– If data in memory is lost it will be recreated from lineage
9 © 2016 IBM Corporation
Agenda
Spark Overview – a quick review
Introduction to Graph Processing and Spark GraphX
GraphX Overview
Demo Scenario Overview
Demo
Wrap-up
10 © 2016 IBM Corporation
Graphs are Central to Analytics
Data is not just getting bigger, it’s getting more connected
In many use cases, the relationship between data points provides as
much value or more than the data points themselves
Discovering data relationships and interdependencies is critical to
many applications– fraud detection
– better understanding customer relationships
– ranking web pages or people in social networks
Graph analytics is a powerful tool for understanding and exploiting
the connections in data
Graph applications are everywhere today and are a critical
component of many next generation applications
11 © 2016 IBM Corporation
What is a Graph?
A graph is a mathematical structure used to model relations between
objects. A graph is made up of vertices and edges that connect them. The
vertices are the objects and the edges are the relationships between
them.
Directed graph – A graph where the edges have a direction associated with them. An example of a
directed graph is a Twitter follower. User Bob can follow user Carol without implying that
user Carol follows user Bob.
Regular graph– Graph where each vertex has the same number of edges. An example of a regular graph
is Facebook friends. If Bob is a friend of Carol, then Carol is also a friend of Bob.
12 © 2016 IBM Corporation
Spark GraphX
Graph processing system, NOT a database
GraphX extends Spark RDD by introducing a Graph abstraction– A directed multigraph with properties attached to each vertex and edge
GraphX exposes a set of fundamental operators to support graph
computation– Subgraph, joinVertices, aggregateMessages, …
Algorithms to simplify graph analytics tasks– In addition to a highly flexible API, GraphX comes with a growing library of
graph algorithms
– PageRank, Triangle Counting, …
13 © 2016 IBM Corporation
Spark GraphX – Flexible Graphing
GraphX unifies ETL, exploratory analysis, and iterative graph
computation
You can view the same data as both graphs and collections,
transform and join graphs with RDDs efficiently, and write custom
iterative graph algorithms with the API
14 © 2016 IBM Corporation
GraphX and the Alternatives
GraphX– Optimized for running complex algorithms on the entire graph
Relational databases are inadequate for any real type of graph
analysis
Graph Databases– Database transactions - updates and deletes
– Typically work with small sections of the graph• Ex. Query small groups of vertices
15 © 2016 IBM Corporation
Graph Databases
The same restrictions that enable graph databases to achieve
substantial performance gains also limit their ability to express many
of the important stages in a typical graph-analytics pipeline
Often require data-movement outside of the graph topology to
express operations that are more naturally expressed as
relational/table operations
16 © 2016 IBM Corporation
GraphX Benefits
Unify graph and data centric computation in one system with a single
composable API
Enables users to view data both as graphs and as collections (i.e.,
RDDs) or tables (DataFrames) without data movement or duplication
17 © 2016 IBM Corporation
Agenda
Spark Overview – a quick review
Introduction to Graph Processing and Spark GraphX
GraphX Overview
Demo Scenario Overview
Demo
Wrap-up
18 © 2016 IBM Corporation
Property Graphs
GraphX implements an object called the property graph– Directed multigraph with user defined objects attached to each vertex and edge
Like RDDs, property graphs are immutable, distributed, and fault-
tolerant
Directed multigraphs can have multiple edges in parallel– Every edge and vertex has user defined properties associated with it
– The parallel edges allow multiple relationships between the same vertices
19 © 2016 IBM Corporation
Vertex and Edge RDDs
GraphX exposes RDD views of the vertices and edges stored within
the graph
The VertexRDD[A] extends RDD[(VertexID, A)] and adds the
additional constraint that each VertexID occurs only once
The EdgeRDD[ED] extends RDD[Edge[ED]] organizes the edges in
blocks partitioned using one of the various partitioning strategies
defined in PartitionStrategy
20 © 2016 IBM Corporation
Example Property Graph
21 © 2016 IBM Corporation
Example – Constructing a Property Graph
Construct a property graph consisting of the various collaborators– Vertex property might contain the username and occupation
– Edges with a string describing the relationships between collaborators
22 © 2016 IBM Corporation
Deconstructing a Graph– Vertex and edge views
– Use ‘graph.vertices’ and ‘graph.edges’ members
Alternately, use the case class type constructor as in the following:
Example – Working with a Property Graph
23 © 2016 IBM Corporation
Logically joins the vertex and edge properties
RDD[EdgeTriplet[VD, ED]] contains instances of the EdgeTriplet
class
This join can be expressed in the following SQL expression:
or graphically as:
Triplet Views
24 © 2016 IBM Corporation
Extends the Edge class by adding the srcAttr and dstAttr members
Renders a collection of strings describing relationships between
users
EdgeTriplet Class
25 © 2016 IBM Corporation
Similar to RDD basic operations like map, filter, and reduceByKey
Core operators have optimized implementations
Graph Operators types:– Property Operators (mapVertices, mapEdges, mapTriplets)
– Structural Operators (reverse, subgraph, mask, groupEdges)
– Join Operators (joinVertices, outerJoinVertices)
Graph Operators
26 © 2016 IBM Corporation
Graph Operators - Subgraph
The subgraph operator takes vertex and edge predicates and returns
the graph containing only the vertices that satisfy the vertex
predicate and edges that satisfy the edge predicate– Vertices that satisfy the vertex predicate are connected
The subgraph operator can be used in number of situations to
restrict the graph to the vertices and edges of interest or eliminate
broken links
27 © 2016 IBM Corporation
Graph Algorithms - PageRank
An algorithm created by Google to rank websites in their search
engine results– named after Larry Page one of the founders of Google
PageRank works by counting the number and quality of links to a
page to determine a rough estimate of how important the website is – Underlying assumption is that more important websites are likely to receive
more links from other websites
The mathematics of PageRank are entirely general and apply to any
graph or network in any domain– e.g. Personalized PageRank is used by Twitter to present users with other
accounts they may wish to follow
28 © 2016 IBM Corporation
Graph Algorithms - Triangle Counting
GraphX implements a triangle counting algorithm– The triangle is a three-node small graph, where every two nodes are connected
– Used in many real world applications as a measure of clustering
Determines the number of triangles passing through each vertex– A vertex is part of a triangle when it has two adjacent vertices with an edge
between them
TriangleCount requires the edges to be in canonical orientation
(srcId < dstId) and the graph to be partitioned using
Graph.partitionBy– E.g. RandomVertexCut collocates all same-direction edges between two
vertices hashing the source and destination vertexIDs
29 © 2016 IBM Corporation
Agenda
Spark Overview – a quick review
Introduction to Graph Processing and Spark GraphX
GraphX Overview
Demo Scenario Overview
Demo
Wrap-up
30 © 2016 IBM Corporation
Demo Scenario
Explore and analyze airline data– Vertices representing airports
– Edges representing flights between airports and their associated distance
Use a number of operators provided by GraphX to analyze data in the
graph and the relationship between the data– E.g. find the airports with the greatest number inbound and outbound flights
Employ graph operators to transform the graphs into new graphs– Based on transformation criteria, like the distance between airports
Employ graph algorithms included with GraphX, like PageRank and
Triangle Counting, to determine the busiest airports
31 © 2016 IBM Corporation
Demo Scenario Data
Airline data in CSV format is readily available on the US Bureau of
Transportation (BTS) website– http://www.rita.dot.gov/bts/home)
This demo employs US flight data for March 2016
32 © 2016 IBM Corporation
Demo Flow
Download the data (CSV format)
Read in the CSV file as a DataFrame (infer the schema)
Clean up the DataFrame– Remove blank column and rows that contain nulls
Convert the DataFrame to an RDD– Use custom case class
– GraphX is based on RDDs, so must convert the DataFrame into an RDD
Extract data (airport IDs and airport codes) for the graph vertices
Extract data (origin airport ID, destination airport ID, distance
between airports) for graph edges
Create the EdgeRDD
33 © 2016 IBM Corporation
Example Demo Graph
34 © 2016 IBM Corporation
Demo Flow (continued)
Create the graph
Investigate the graph– Show vertices
– Count number of vertices/airports
– Show edges/flights
– Count the number of edges/flights and distinct routes
– Query the graph based on vertex and edge attributes and properties
Create a triple view of the graph– Query the triplet view
Compute the highest degree vertices (in, out, and total)
Calculate Page Ranks for the graph vertices
35 © 2016 IBM Corporation
Demo Flow (conclusion)
Create a subgraph
Explore the subgraph– Using both vertex predicates and edge predicates
Create a subgraph for Triangle Counting– TriangleCount requires the edges be in canonical orientation
– Also required that the graph is partitioned
Create a Triangle Count graph
Investigate the vertices/airports with the highest triangle count
36 © 2016 IBM Corporation
Agenda
Spark Overview – a quick review
Introduction to Graph Processing and Spark GraphX
GraphX Overview
Demo Scenario Overview
Demo
Wrap-up
37 © 2016 IBM Corporation
Agenda
Spark Overview – a quick review
Introduction to Graph Processing and Spark GraphX
GraphX Overview
Demo Scenario Overview
Demo
Wrap-up
38 © 2016 IBM Corporation
Summary
Graphs provide a powerful way to model and analyze connected data
GraphX builds on the massively parallel, fault-tolerant foundation of
Spark to provide graph processing
Spark provides the ability to complement graph processing with
relational processing in a single consistent framework and set of
APIs
GraphX is a graph processing system and not a database
GraphX provides a number of operators and algorithms to facilitate
working with and understanding the connections in the data
39 © 2016 IBM Corporation
GraphX Challenges
Scala API only– No Python or Java APIs
Utilizes lower level RDD (vs. DataFrame) based API
Does not benefit from Spark DataFrame optimizations such as the
Catalyst query optimizer or Tungsten memory management
40 © 2016 IBM Corporation
Enter Spark GraphFrames
DataFrames based graphs for Apache Spark– Vertices and edges are represented as DataFrames
– Enables arbitrary data to be stored with each vertex
and edge
Python, Java and Scala APIs
Simplified interactive queries– Phrase queries in the familiar, powerful Spark SQL and DataFrame APIs
Supports motif finding for structural pattern search– For example, to recommend whom to follow, you might search for triplets of
users A,B,C where A follows B and B follows C, but A does not follow C
Benefits from Spark DataFrame optimizations
GraphFrames fully integrate with GraphX via conversions between
the two representations