Analyzing Flight Data - GraphX.pdf · 2016-07-24 · 2 © 2016...


Citation preview

© 2016 IBM Corporation

IBM Analytics

Analyzing Flight Data

Jeff Carlson

Rich Tarro

July 21, 2016

2 © 2016 IBM Corporation


Spark Overview – a quick review

Introduction to Graph Processing and Spark GraphX

GraphX Overview

Demo Scenario Overview



3 © 2016 IBM Corporation


Spark Overview – a quick review

Introduction to Graph Processing and Spark GraphX

GraphX Overview

Demo Scenario Overview



4 © 2016 IBM Corporation

What is Spark?

Spark is an open source


application framework for

distributed data processing and

iterative analysis

on massive data volumes

“Analytic Operating System”

5 © 2016 IBM Corporation

Key reasons for interest in Spark

Performant In-memory architecture greatly reduces disk I/O

Anywhere from 20-100x faster for common tasks

Productive Concise and expressive syntax, especially compared to prior approaches

Single programming model across a range of use cases and steps in data lifecycle

Integrated with common programming languages – Java, Python, Scala

New tools continually reduce skill barrier for access (e.g. SQL for analysts)

Leverages existing


Works well within existing Hadoop ecosystem

Improves with age Large and growing community of contributors continuously improve full analytics stack and extend capabilities

6 © 2016 IBM Corporation

Spark includes a set of core libraries that enable various

analytic methods which can process data from many sources

Spark Core Engine



engine, handles

distributed task


scheduling and

basic I/O


Spark SQLSpark







executes SQL




analytics using




learning and







large variety of

data sources and

formats can be

supported, both on

premise or cloud












7 © 2016 IBM Corporation

Spark Application Architecture

A Spark application is initiated from a driver program

Spark execution modes:– Standalone with the built-in cluster manager

– Use Mesos as the cluster manager

– Use YARN as the cluster manager

– Standalone cluster on any cloud (BlueMix, IBM Softlayer, Amazon, Azure, …)

8 © 2016 IBM Corporation

Spark RDDs


Two types of operations– Transformations ~ DDL (Create View V2 as…) – Lazy Evaluation

• val rddNumbers = sc.parallelize(1 to 10): Numbers from 1 to 10

• val rddNumbers2 = (x => x+1): Numbers from 2 to 11

• The LINEAGE on how to obtain rddNumbers2 from rddNumber is recorded

• It’s a Directed Acyclic Graph (DAG)

• No actual data processing does take place Lazy evaluations

– Actions ~ Select (Select * From V2…) – Perform Computations• rddNumbers2.collect(): Array [2, 3, 4, 5, 6, 7, 8, 9, 10, 11]

• Performs transformations and action

• Returns a value (or write to a file)

Fault tolerance– If data in memory is lost it will be recreated from lineage

9 © 2016 IBM Corporation


Spark Overview – a quick review

Introduction to Graph Processing and Spark GraphX

GraphX Overview

Demo Scenario Overview



10 © 2016 IBM Corporation

Graphs are Central to Analytics

Data is not just getting bigger, it’s getting more connected

In many use cases, the relationship between data points provides as

much value or more than the data points themselves

Discovering data relationships and interdependencies is critical to

many applications– fraud detection

– better understanding customer relationships

– ranking web pages or people in social networks

Graph analytics is a powerful tool for understanding and exploiting

the connections in data

Graph applications are everywhere today and are a critical

component of many next generation applications

11 © 2016 IBM Corporation

What is a Graph?

A graph is a mathematical structure used to model relations between

objects. A graph is made up of vertices and edges that connect them. The

vertices are the objects and the edges are the relationships between


Directed graph – A graph where the edges have a direction associated with them. An example of a

directed graph is a Twitter follower. User Bob can follow user Carol without implying that

user Carol follows user Bob.

Regular graph– Graph where each vertex has the same number of edges. An example of a regular graph

is Facebook friends. If Bob is a friend of Carol, then Carol is also a friend of Bob.

12 © 2016 IBM Corporation

Spark GraphX

Graph processing system, NOT a database

GraphX extends Spark RDD by introducing a Graph abstraction– A directed multigraph with properties attached to each vertex and edge

GraphX exposes a set of fundamental operators to support graph

computation– Subgraph, joinVertices, aggregateMessages, …

Algorithms to simplify graph analytics tasks– In addition to a highly flexible API, GraphX comes with a growing library of

graph algorithms

– PageRank, Triangle Counting, …

13 © 2016 IBM Corporation

Spark GraphX – Flexible Graphing

GraphX unifies ETL, exploratory analysis, and iterative graph


You can view the same data as both graphs and collections,

transform and join graphs with RDDs efficiently, and write custom

iterative graph algorithms with the API

14 © 2016 IBM Corporation

GraphX and the Alternatives

GraphX– Optimized for running complex algorithms on the entire graph

Relational databases are inadequate for any real type of graph


Graph Databases– Database transactions - updates and deletes

– Typically work with small sections of the graph• Ex. Query small groups of vertices

15 © 2016 IBM Corporation

Graph Databases

The same restrictions that enable graph databases to achieve

substantial performance gains also limit their ability to express many

of the important stages in a typical graph-analytics pipeline

Often require data-movement outside of the graph topology to

express operations that are more naturally expressed as

relational/table operations

16 © 2016 IBM Corporation

GraphX Benefits

Unify graph and data centric computation in one system with a single

composable API

Enables users to view data both as graphs and as collections (i.e.,

RDDs) or tables (DataFrames) without data movement or duplication

17 © 2016 IBM Corporation


Spark Overview – a quick review

Introduction to Graph Processing and Spark GraphX

GraphX Overview

Demo Scenario Overview



18 © 2016 IBM Corporation

Property Graphs

GraphX implements an object called the property graph– Directed multigraph with user defined objects attached to each vertex and edge

Like RDDs, property graphs are immutable, distributed, and fault-


Directed multigraphs can have multiple edges in parallel– Every edge and vertex has user defined properties associated with it

– The parallel edges allow multiple relationships between the same vertices

19 © 2016 IBM Corporation

Vertex and Edge RDDs

GraphX exposes RDD views of the vertices and edges stored within

the graph

The VertexRDD[A] extends RDD[(VertexID, A)] and adds the

additional constraint that each VertexID occurs only once

The EdgeRDD[ED] extends RDD[Edge[ED]] organizes the edges in

blocks partitioned using one of the various partitioning strategies

defined in PartitionStrategy

20 © 2016 IBM Corporation

Example Property Graph

21 © 2016 IBM Corporation

Example – Constructing a Property Graph

Construct a property graph consisting of the various collaborators– Vertex property might contain the username and occupation

– Edges with a string describing the relationships between collaborators

22 © 2016 IBM Corporation

Deconstructing a Graph– Vertex and edge views

– Use ‘graph.vertices’ and ‘graph.edges’ members

Alternately, use the case class type constructor as in the following:

Example – Working with a Property Graph

23 © 2016 IBM Corporation

Logically joins the vertex and edge properties

RDD[EdgeTriplet[VD, ED]] contains instances of the EdgeTriplet


This join can be expressed in the following SQL expression:

or graphically as:

Triplet Views

24 © 2016 IBM Corporation

Extends the Edge class by adding the srcAttr and dstAttr members

Renders a collection of strings describing relationships between


EdgeTriplet Class

25 © 2016 IBM Corporation

Similar to RDD basic operations like map, filter, and reduceByKey

Core operators have optimized implementations

Graph Operators types:– Property Operators (mapVertices, mapEdges, mapTriplets)

– Structural Operators (reverse, subgraph, mask, groupEdges)

– Join Operators (joinVertices, outerJoinVertices)

Graph Operators

26 © 2016 IBM Corporation

Graph Operators - Subgraph

The subgraph operator takes vertex and edge predicates and returns

the graph containing only the vertices that satisfy the vertex

predicate and edges that satisfy the edge predicate– Vertices that satisfy the vertex predicate are connected

The subgraph operator can be used in number of situations to

restrict the graph to the vertices and edges of interest or eliminate

broken links

27 © 2016 IBM Corporation

Graph Algorithms - PageRank

An algorithm created by Google to rank websites in their search

engine results– named after Larry Page one of the founders of Google

PageRank works by counting the number and quality of links to a

page to determine a rough estimate of how important the website is – Underlying assumption is that more important websites are likely to receive

more links from other websites

The mathematics of PageRank are entirely general and apply to any

graph or network in any domain– e.g. Personalized PageRank is used by Twitter to present users with other

accounts they may wish to follow

28 © 2016 IBM Corporation

Graph Algorithms - Triangle Counting

GraphX implements a triangle counting algorithm– The triangle is a three-node small graph, where every two nodes are connected

– Used in many real world applications as a measure of clustering

Determines the number of triangles passing through each vertex– A vertex is part of a triangle when it has two adjacent vertices with an edge

between them

TriangleCount requires the edges to be in canonical orientation

(srcId < dstId) and the graph to be partitioned using

Graph.partitionBy– E.g. RandomVertexCut collocates all same-direction edges between two

vertices hashing the source and destination vertexIDs

29 © 2016 IBM Corporation


Spark Overview – a quick review

Introduction to Graph Processing and Spark GraphX

GraphX Overview

Demo Scenario Overview



30 © 2016 IBM Corporation

Demo Scenario

Explore and analyze airline data– Vertices representing airports

– Edges representing flights between airports and their associated distance

Use a number of operators provided by GraphX to analyze data in the

graph and the relationship between the data– E.g. find the airports with the greatest number inbound and outbound flights

Employ graph operators to transform the graphs into new graphs– Based on transformation criteria, like the distance between airports

Employ graph algorithms included with GraphX, like PageRank and

Triangle Counting, to determine the busiest airports

31 © 2016 IBM Corporation

Demo Scenario Data

Airline data in CSV format is readily available on the US Bureau of

Transportation (BTS) website–

This demo employs US flight data for March 2016

32 © 2016 IBM Corporation

Demo Flow

Download the data (CSV format)

Read in the CSV file as a DataFrame (infer the schema)

Clean up the DataFrame– Remove blank column and rows that contain nulls

Convert the DataFrame to an RDD– Use custom case class

– GraphX is based on RDDs, so must convert the DataFrame into an RDD

Extract data (airport IDs and airport codes) for the graph vertices

Extract data (origin airport ID, destination airport ID, distance

between airports) for graph edges

Create the EdgeRDD

33 © 2016 IBM Corporation

Example Demo Graph

34 © 2016 IBM Corporation

Demo Flow (continued)

Create the graph

Investigate the graph– Show vertices

– Count number of vertices/airports

– Show edges/flights

– Count the number of edges/flights and distinct routes

– Query the graph based on vertex and edge attributes and properties

Create a triple view of the graph– Query the triplet view

Compute the highest degree vertices (in, out, and total)

Calculate Page Ranks for the graph vertices

35 © 2016 IBM Corporation

Demo Flow (conclusion)

Create a subgraph

Explore the subgraph– Using both vertex predicates and edge predicates

Create a subgraph for Triangle Counting– TriangleCount requires the edges be in canonical orientation

– Also required that the graph is partitioned

Create a Triangle Count graph

Investigate the vertices/airports with the highest triangle count

36 © 2016 IBM Corporation


Spark Overview – a quick review

Introduction to Graph Processing and Spark GraphX

GraphX Overview

Demo Scenario Overview



37 © 2016 IBM Corporation


Spark Overview – a quick review

Introduction to Graph Processing and Spark GraphX

GraphX Overview

Demo Scenario Overview



38 © 2016 IBM Corporation


Graphs provide a powerful way to model and analyze connected data

GraphX builds on the massively parallel, fault-tolerant foundation of

Spark to provide graph processing

Spark provides the ability to complement graph processing with

relational processing in a single consistent framework and set of


GraphX is a graph processing system and not a database

GraphX provides a number of operators and algorithms to facilitate

working with and understanding the connections in the data

39 © 2016 IBM Corporation

GraphX Challenges

Scala API only– No Python or Java APIs

Utilizes lower level RDD (vs. DataFrame) based API

Does not benefit from Spark DataFrame optimizations such as the

Catalyst query optimizer or Tungsten memory management

40 © 2016 IBM Corporation

Enter Spark GraphFrames

DataFrames based graphs for Apache Spark– Vertices and edges are represented as DataFrames

– Enables arbitrary data to be stored with each vertex

and edge

Python, Java and Scala APIs

Simplified interactive queries– Phrase queries in the familiar, powerful Spark SQL and DataFrame APIs

Supports motif finding for structural pattern search– For example, to recommend whom to follow, you might search for triplets of

users A,B,C where A follows B and B follows C, but A does not follow C

Benefits from Spark DataFrame optimizations

GraphFrames fully integrate with GraphX via conversions between

the two representations
