Machine Learning and GraphX

Massive Graph MiningApache Spark’s GraphX and Data Mining

Who we are

Andy

@Noootsab@NextLab_be@Wajug co-driver@Devoxx4Kids organizerMaths & CSData lover: geo, open, massiveFool

Rand

@randhindi@snipsEntrepreneurPhD bioinformatics, etc.. Love data & ML

Graph 101

A graph is a mathematical representation of linked data.It’s defined in term of its Vertices and Edges, G(V,E).

A vertex is an entity that can bring a bag of data (generally small)An edge connects vertices, and can also own a bag of data.

Graph 101

A Graph represent data in a less convenient way for classical processing framework.

Because the burden is not put on the observations themselves (row) but on their linkage, and specifically density.

Thus, the problem is often translated as a self-join one.

Graph 101

A Graph, G(V,E) has a reverse representation, its Dual.

A Dual is nothing other than the graph, G’(V’,E’), where ● a vertex is an edge in G, and● an edge is a vertex in G, which has at least

one edge.

Graph 101

The classical way to store or share the connectivity of a graph is using its tabular version, that is, its Adjacency Matrix.

ref: http://en.wikipedia.org/wiki/Adjacency_matrix

http://en.wikipedia.org/wiki/Adjacency_matrix

GraphX (Apache Spark)

Spark 101


Offers a Graph API on top of Spark.Enabling cross-world manipulations


How it differs from other classical systems...




Plenty of operators on both RDDs, but


Plenty of operators on both RDDs, but


1. Sends messages to neighbors2. Returns an RDD of aggregated messages


Offers higher level operators and algo, like


This one rules them all (and more)

More later...

PageRank and Pregel

Everybody know PageRank, right?

If not: it’s our oil, our friend, our preferred black box…

It’s why Google Search works so fine!

PageRank and Pregel

Essentially, PageRank is all about importance of a node in a Graph → Link Analysis.

The bottom line is:● In-Links are votes● In-Links from important node are more

important →recursion

PageRank and Pregel

https://d396qusza40orc.cloudfront.net/mmds/lecture_slides/week1_pagerank_the_flow_formulation.pdf



PageRank and Pregel

TL;DRThe importance of a node is the probability that a random (drunk) walker fall on a given node.So, it depends on:1. the probability that he lands into one of its

neighbor2. the probability that he crosses a link from

the neighbor to it3. an arbitrary probability of teleportation

PageRank and Pregel

Solution: Power Method/Iteration (recursive)

r_new = A x r_old Matrix algebra is a pain in distributed environment…

But wait, the process is rather graph oriented!

PageRank and Pregel

Pregel (google again)

Based on BSP, Bulk Sync Parallel

BSP works like message passing style

PageRank and Pregel

During Superstep i, a vertex can:

● use messages received from Superstep i-1● execute a function● send messages● vote to halt

PageRank and Pregel

PageRank and Pregel

In GraphX, as usual with Spark, it’s simple:

mapReduceTriplet

PageRank and Pregel

PageRank with Pregel:

PageRank and Pregel

Applying on our USA.csv file:

OpenStreetMap

Founded by Steve Coast (UK, 2004)

Aims to take Geodata off the govs hands to give them to the crowd

Actually, the crowd has to create them...

OSM

OSM

OSM

So it’s a Graph!

Node = Vertexsingle point in space defined by its latitude, longitude and node id

Way = EdgeA way can have between 2 and 2,000 nodes

OSM

The network is over-complex for what we need, thus:

● reducing cycling ways like roundabouts to a single one

● transforming the nodes into sections, i.e. pieces of streets between 2 intersections

OSM

Hence, OSM ~ G(Node, Way)

If it’s not exactly we can still manipulate them

In our case, we don’t need the connectivity of an intersection, but the connectivity of a section.This is given by G’ (dual of G)

Dataset

● 80 cities● 3M edges in total● smallest city 200 edges (Tempe)● largest city 200,000 edges (Los Angeles)

● Hypothesis: Cities with similar connectivity have similar PageRank distribution

NYC Chicago

Comparing Cities

Fort Worth = Philadelphia?

Looks the same!

Smells like Spurious Correlation

● Problem: PageRank is correlated with the size of the city

● size of city = number of sections (edges) in the graph

● Normalized PageRank = PageRank / size_of_city

● Now we can compare cities of different sizes!

Normalizing PageRank distributions

Fort Worth != Philadelphia!

Totally different!

Fort Worth before and after

Note that range of PageRank is preserved

● How to compare PageRank distributions?● It’s not always a normal distribution!● Can use the Kullback-Leibler divergence

from information theory● the Kullback–Leibler divergence of Q from

P, denoted DKL(P||Q), is a measure of the information lost when Q is used to approximate P

Distance between PG Distributions

● Easy to compute● Units is nats (can be bits if using log2

instead of ln)

KL Divergence

● KL divergence = 18.407 ● Dallas is irregular, Seattle is a perfect grid

Very different cities: Dallas & Seattle

● KL divergence = 0.36● Both are very irregular

Very similar cities: Atlanta & Boston

● Using multiple street topology indicators to measure the risk of car accident

Next steps

Q.E.D

Thanks for keeping up!

Question => Future[(Option[Response], Future[Question])]

Technology

Machine Learning and GraphX