Analyzing Flight Data - Meetupfiles.meetup.com/9505222/Spark GraphX.pdf · 2016-07-24 · 2 © 2016 IBM Corporation Agenda Spark Overview –a quick review Introduction to Graph Processing

© 2016 IBM Corporation

IBM Analytics

Analyzing Flight Data

Jeff Carlson

Rich Tarro

July 21, 2016

2 © 2016 IBM Corporation

Agenda

Spark Overview – a quick review

Introduction to Graph Processing and Spark GraphX

GraphX Overview

Demo Scenario Overview

Demo

Wrap-up


Agenda



GraphX Overview


Demo

Wrap-up


What is Spark?

Spark is an open source

in-memory

application framework for

distributed data processing and

iterative analysis

on massive data volumes

“Analytic Operating System”


Key reasons for interest in Spark

Performant In-memory architecture greatly reduces disk I/O

Anywhere from 20-100x faster for common tasks

Productive Concise and expressive syntax, especially compared to prior approaches

Single programming model across a range of use cases and steps in data lifecycle

Integrated with common programming languages – Java, Python, Scala

New tools continually reduce skill barrier for access (e.g. SQL for analysts)

Leverages existing

investments

Works well within existing Hadoop ecosystem

Improves with age Large and growing community of contributors continuously improve full analytics stack and extend capabilities


Spark includes a set of core libraries that enable various

analytic methods which can process data from many sources

Spark Core Engine

general

compute

engine, handles

distributed task

dispatching,

scheduling and

basic I/O

functions

Spark SQLSpark

Streaming

MLlib

(machine

learning)

GraphX

(graph)

executes SQL

statements

performs

streaming

analytics using

micro-batches

common

machine

learning and

statistical

algorithms

distributed

graph

processing

framework

large variety of

data sources and

formats can be

supported, both on

premise or cloud

BigInsights

(HDFS)

Cloudant

dashDB

Object

Storage

SQL

DB

…many

others

IBM CLOUD OTHER CLOUD CLOUD APPS ON-PREMISE


Spark Application Architecture

A Spark application is initiated from a driver program

Spark execution modes:– Standalone with the built-in cluster manager

– Use Mesos as the cluster manager

– Use YARN as the cluster manager

– Standalone cluster on any cloud (BlueMix, IBM Softlayer, Amazon, Azure, …)


Spark RDDs

Immutable

Two types of operations– Transformations ~ DDL (Create View V2 as…) – Lazy Evaluation

• val rddNumbers = sc.parallelize(1 to 10): Numbers from 1 to 10

• val rddNumbers2 = rddNumbers.map (x => x+1): Numbers from 2 to 11

• The LINEAGE on how to obtain rddNumbers2 from rddNumber is recorded

• It’s a Directed Acyclic Graph (DAG)

• No actual data processing does take place Lazy evaluations

– Actions ~ Select (Select * From V2…) – Perform Computations• rddNumbers2.collect(): Array [2, 3, 4, 5, 6, 7, 8, 9, 10, 11]

• Performs transformations and action

• Returns a value (or write to a file)

Fault tolerance– If data in memory is lost it will be recreated from lineage


Agenda



GraphX Overview


Demo

Wrap-up


Graphs are Central to Analytics

Data is not just getting bigger, it’s getting more connected

In many use cases, the relationship between data points provides as

much value or more than the data points themselves

Discovering data relationships and interdependencies is critical to

many applications– fraud detection

– better understanding customer relationships

– ranking web pages or people in social networks

Graph analytics is a powerful tool for understanding and exploiting

the connections in data

Graph applications are everywhere today and are a critical

component of many next generation applications


What is a Graph?

A graph is a mathematical structure used to model relations between

objects. A graph is made up of vertices and edges that connect them. The

vertices are the objects and the edges are the relationships between

them.

Directed graph – A graph where the edges have a direction associated with them. An example of a

directed graph is a Twitter follower. User Bob can follow user Carol without implying that

user Carol follows user Bob.

Regular graph– Graph where each vertex has the same number of edges. An example of a regular graph

is Facebook friends. If Bob is a friend of Carol, then Carol is also a friend of Bob.


Spark GraphX

Graph processing system, NOT a database

GraphX extends Spark RDD by introducing a Graph abstraction– A directed multigraph with properties attached to each vertex and edge

GraphX exposes a set of fundamental operators to support graph

computation– Subgraph, joinVertices, aggregateMessages, …

Algorithms to simplify graph analytics tasks– In addition to a highly flexible API, GraphX comes with a growing library of

graph algorithms

– PageRank, Triangle Counting, …


Spark GraphX – Flexible Graphing

GraphX unifies ETL, exploratory analysis, and iterative graph

computation

You can view the same data as both graphs and collections,

transform and join graphs with RDDs efficiently, and write custom

iterative graph algorithms with the API


GraphX and the Alternatives

GraphX– Optimized for running complex algorithms on the entire graph

Relational databases are inadequate for any real type of graph

analysis

Graph Databases– Database transactions - updates and deletes

– Typically work with small sections of the graph• Ex. Query small groups of vertices


Graph Databases

The same restrictions that enable graph databases to achieve

substantial performance gains also limit their ability to express many

of the important stages in a typical graph-analytics pipeline

Often require data-movement outside of the graph topology to

express operations that are more naturally expressed as

relational/table operations


GraphX Benefits

Unify graph and data centric computation in one system with a single

composable API

Enables users to view data both as graphs and as collections (i.e.,

RDDs) or tables (DataFrames) without data movement or duplication


Agenda



GraphX Overview


Demo

Wrap-up


Property Graphs

GraphX implements an object called the property graph– Directed multigraph with user defined objects attached to each vertex and edge

Like RDDs, property graphs are immutable, distributed, and fault-

tolerant

Directed multigraphs can have multiple edges in parallel– Every edge and vertex has user defined properties associated with it

– The parallel edges allow multiple relationships between the same vertices


Vertex and Edge RDDs

GraphX exposes RDD views of the vertices and edges stored within

the graph

The VertexRDD[A] extends RDD[(VertexID, A)] and adds the

additional constraint that each VertexID occurs only once

The EdgeRDD[ED] extends RDD[Edge[ED]] organizes the edges in

blocks partitioned using one of the various partitioning strategies

defined in PartitionStrategy


Example Property Graph


Example – Constructing a Property Graph

Construct a property graph consisting of the various collaborators– Vertex property might contain the username and occupation

– Edges with a string describing the relationships between collaborators


Deconstructing a Graph– Vertex and edge views

– Use ‘graph.vertices’ and ‘graph.edges’ members

Alternately, use the case class type constructor as in the following:

Example – Working with a Property Graph


Logically joins the vertex and edge properties

RDD[EdgeTriplet[VD, ED]] contains instances of the EdgeTriplet

class

This join can be expressed in the following SQL expression:

or graphically as:

Triplet Views


Extends the Edge class by adding the srcAttr and dstAttr members

Renders a collection of strings describing relationships between

users

EdgeTriplet Class


Similar to RDD basic operations like map, filter, and reduceByKey

Core operators have optimized implementations

Graph Operators types:– Property Operators (mapVertices, mapEdges, mapTriplets)

– Structural Operators (reverse, subgraph, mask, groupEdges)

– Join Operators (joinVertices, outerJoinVertices)

Graph Operators


Graph Operators - Subgraph

The subgraph operator takes vertex and edge predicates and returns

the graph containing only the vertices that satisfy the vertex

predicate and edges that satisfy the edge predicate– Vertices that satisfy the vertex predicate are connected

The subgraph operator can be used in number of situations to

restrict the graph to the vertices and edges of interest or eliminate

broken links


Graph Algorithms - PageRank

An algorithm created by Google to rank websites in their search

engine results– named after Larry Page one of the founders of Google

PageRank works by counting the number and quality of links to a

page to determine a rough estimate of how important the website is – Underlying assumption is that more important websites are likely to receive

more links from other websites

The mathematics of PageRank are entirely general and apply to any

graph or network in any domain– e.g. Personalized PageRank is used by Twitter to present users with other

accounts they may wish to follow


Graph Algorithms - Triangle Counting

GraphX implements a triangle counting algorithm– The triangle is a three-node small graph, where every two nodes are connected

– Used in many real world applications as a measure of clustering

Determines the number of triangles passing through each vertex– A vertex is part of a triangle when it has two adjacent vertices with an edge

between them

TriangleCount requires the edges to be in canonical orientation

(srcId < dstId) and the graph to be partitioned using

Graph.partitionBy– E.g. RandomVertexCut collocates all same-direction edges between two

vertices hashing the source and destination vertexIDs


Agenda



GraphX Overview


Demo

Wrap-up


Demo Scenario

Explore and analyze airline data– Vertices representing airports

– Edges representing flights between airports and their associated distance

Use a number of operators provided by GraphX to analyze data in the

graph and the relationship between the data– E.g. find the airports with the greatest number inbound and outbound flights

Employ graph operators to transform the graphs into new graphs– Based on transformation criteria, like the distance between airports

Employ graph algorithms included with GraphX, like PageRank and

Triangle Counting, to determine the busiest airports


Demo Scenario Data

Airline data in CSV format is readily available on the US Bureau of

Transportation (BTS) website– http://www.rita.dot.gov/bts/home)

This demo employs US flight data for March 2016

http://www.rita.dot.gov/bts/home


Demo Flow

Download the data (CSV format)

Read in the CSV file as a DataFrame (infer the schema)

Clean up the DataFrame– Remove blank column and rows that contain nulls

Convert the DataFrame to an RDD– Use custom case class

– GraphX is based on RDDs, so must convert the DataFrame into an RDD

Extract data (airport IDs and airport codes) for the graph vertices

Extract data (origin airport ID, destination airport ID, distance

between airports) for graph edges

Create the EdgeRDD


Example Demo Graph


Demo Flow (continued)

Create the graph

Investigate the graph– Show vertices

– Count number of vertices/airports

– Show edges/flights

– Count the number of edges/flights and distinct routes

– Query the graph based on vertex and edge attributes and properties

Create a triple view of the graph– Query the triplet view

Compute the highest degree vertices (in, out, and total)

Calculate Page Ranks for the graph vertices


Demo Flow (conclusion)

Create a subgraph

Explore the subgraph– Using both vertex predicates and edge predicates

Create a subgraph for Triangle Counting– TriangleCount requires the edges be in canonical orientation

– Also required that the graph is partitioned

Create a Triangle Count graph

Investigate the vertices/airports with the highest triangle count


Agenda



GraphX Overview


Demo

Wrap-up


Agenda



GraphX Overview


Demo

Wrap-up


Summary

Graphs provide a powerful way to model and analyze connected data

GraphX builds on the massively parallel, fault-tolerant foundation of

Spark to provide graph processing

Spark provides the ability to complement graph processing with

relational processing in a single consistent framework and set of

APIs

GraphX is a graph processing system and not a database

GraphX provides a number of operators and algorithms to facilitate

working with and understanding the connections in the data


GraphX Challenges

Scala API only– No Python or Java APIs

Utilizes lower level RDD (vs. DataFrame) based API

Does not benefit from Spark DataFrame optimizations such as the

Catalyst query optimizer or Tungsten memory management


Enter Spark GraphFrames

DataFrames based graphs for Apache Spark– Vertices and edges are represented as DataFrames

– Enables arbitrary data to be stored with each vertex

and edge

Python, Java and Scala APIs

Simplified interactive queries– Phrase queries in the familiar, powerful Spark SQL and DataFrame APIs

Supports motif finding for structural pattern search– For example, to recommend whom to follow, you might search for triplets of

users A,B,C where A follows B and B follows C, but A does not follow C

Benefits from Spark DataFrame optimizations

GraphFrames fully integrate with GraphX via conversions between

the two representations

Documents

Analyzing Flight Data - Meetupfiles.meetup.com/9505222/Spark GraphX.pdf · 2016-07-24 · 2 © 2016 IBM Corporation Agenda Spark Overview –a quick review Introduction to Graph Processing