46
Massive Graph Mining Apache Spark’s GraphX and Data Mining

Machine Learning and GraphX

Embed Size (px)

Citation preview

Page 1: Machine Learning and GraphX

Massive Graph MiningApache Spark’s GraphX and Data Mining

Page 2: Machine Learning and GraphX

Who we are

Andy

@Noootsab@NextLab_be@Wajug co-driver@Devoxx4Kids organizerMaths & CSData lover: geo, open, massiveFool

Rand

@randhindi@snipsEntrepreneurPhD bioinformatics, etc.. Love data & ML

Page 3: Machine Learning and GraphX

Graph 101

A graph is a mathematical representation of linked data.It’s defined in term of its Vertices and Edges, G(V,E).

A vertex is an entity that can bring a bag of data (generally small)An edge connects vertices, and can also own a bag of data.

Page 4: Machine Learning and GraphX

Graph 101

A Graph represent data in a less convenient way for classical processing framework.

Because the burden is not put on the observations themselves (row) but on their linkage, and specifically density.

Thus, the problem is often translated as a self-join one.

Page 5: Machine Learning and GraphX

Graph 101

A Graph, G(V,E) has a reverse representation, its Dual.

A Dual is nothing other than the graph, G’(V’,E’), where ● a vertex is an edge in G, and● an edge is a vertex in G, which has at least

one edge.

Page 6: Machine Learning and GraphX

Graph 101

The classical way to store or share the connectivity of a graph is using its tabular version, that is, its Adjacency Matrix.

ref: http://en.wikipedia.org/wiki/Adjacency_matrix

Page 7: Machine Learning and GraphX

GraphX (Apache Spark)

Spark 101

Page 8: Machine Learning and GraphX

GraphX (Apache Spark)

Offers a Graph API on top of Spark.Enabling cross-world manipulations

Page 9: Machine Learning and GraphX

GraphX (Apache Spark)

How it differs from other classical systems...

Page 10: Machine Learning and GraphX

GraphX (Apache Spark)

Page 11: Machine Learning and GraphX

GraphX (Apache Spark)

Page 12: Machine Learning and GraphX

GraphX (Apache Spark)

Plenty of operators on both RDDs, but

Page 13: Machine Learning and GraphX

GraphX (Apache Spark)

Plenty of operators on both RDDs, but

Page 14: Machine Learning and GraphX

GraphX (Apache Spark)

1. Sends messages to neighbors2. Returns an RDD of aggregated messages

Page 15: Machine Learning and GraphX

GraphX (Apache Spark)

Offers higher level operators and algo, like

Page 16: Machine Learning and GraphX

GraphX (Apache Spark)

This one rules them all (and more)

More later...

Page 17: Machine Learning and GraphX

PageRank and Pregel

Everybody know PageRank, right?

If not: it’s our oil, our friend, our preferred black box…

It’s why Google Search works so fine!

Page 18: Machine Learning and GraphX

PageRank and Pregel

Essentially, PageRank is all about importance of a node in a Graph → Link Analysis.

The bottom line is:● In-Links are votes● In-Links from important node are more

important →recursion

Page 20: Machine Learning and GraphX

PageRank and Pregel

TL;DRThe importance of a node is the probability that a random (drunk) walker fall on a given node.So, it depends on:1. the probability that he lands into one of its

neighbor2. the probability that he crosses a link from

the neighbor to it3. an arbitrary probability of teleportation

Page 21: Machine Learning and GraphX

PageRank and Pregel

Solution: Power Method/Iteration (recursive)

r_new = A x r_old Matrix algebra is a pain in distributed environment…

But wait, the process is rather graph oriented!

Page 22: Machine Learning and GraphX

PageRank and Pregel

Pregel (google again)

Based on BSP, Bulk Sync Parallel

BSP works like message passing style

Page 23: Machine Learning and GraphX

PageRank and Pregel

During Superstep i, a vertex can:

● use messages received from Superstep i-1● execute a function● send messages● vote to halt

Page 24: Machine Learning and GraphX

PageRank and Pregel

Page 25: Machine Learning and GraphX

PageRank and Pregel

In GraphX, as usual with Spark, it’s simple:

mapReduceTriplet

Page 26: Machine Learning and GraphX

PageRank and Pregel

PageRank with Pregel:

Page 27: Machine Learning and GraphX

PageRank and Pregel

Applying on our USA.csv file:

Page 28: Machine Learning and GraphX

OpenStreetMap

Founded by Steve Coast (UK, 2004)

Aims to take Geodata off the govs hands to give them to the crowd

Actually, the crowd has to create them...

Page 29: Machine Learning and GraphX

OSM

Page 30: Machine Learning and GraphX

OSM

Page 31: Machine Learning and GraphX

OSM

So it’s a Graph!

Node = Vertexsingle point in space defined by its latitude, longitude and node id

Way = EdgeA way can have between 2 and 2,000 nodes

Page 32: Machine Learning and GraphX

OSM

The network is over-complex for what we need, thus:

● reducing cycling ways like roundabouts to a single one

● transforming the nodes into sections, i.e. pieces of streets between 2 intersections

Page 33: Machine Learning and GraphX

OSM

Hence, OSM ~ G(Node, Way)

If it’s not exactly we can still manipulate them

In our case, we don’t need the connectivity of an intersection, but the connectivity of a section.This is given by G’ (dual of G)

Page 34: Machine Learning and GraphX

Dataset

● 80 cities● 3M edges in total● smallest city 200 edges (Tempe)● largest city 200,000 edges (Los Angeles)

Page 35: Machine Learning and GraphX

● Hypothesis: Cities with similar connectivity have similar PageRank distribution

NYC Chicago

Comparing Cities

Page 36: Machine Learning and GraphX

Fort Worth = Philadelphia?

Looks the same!

Page 37: Machine Learning and GraphX

Smells like Spurious Correlation

Page 38: Machine Learning and GraphX

● Problem: PageRank is correlated with the size of the city

● size of city = number of sections (edges) in the graph

● Normalized PageRank = PageRank / size_of_city

● Now we can compare cities of different sizes!

Normalizing PageRank distributions

Page 39: Machine Learning and GraphX

Fort Worth != Philadelphia!

Totally different!

Page 40: Machine Learning and GraphX

Fort Worth before and after

Note that range of PageRank is preserved

Page 41: Machine Learning and GraphX

● How to compare PageRank distributions?● It’s not always a normal distribution!● Can use the Kullback-Leibler divergence

from information theory● the Kullback–Leibler divergence of Q from

P, denoted DKL(P||Q), is a measure of the information lost when Q is used to approximate P

Distance between PG Distributions

Page 42: Machine Learning and GraphX

● Easy to compute● Units is nats (can be bits if using log2

instead of ln)

KL Divergence

Page 43: Machine Learning and GraphX

● KL divergence = 18.407 ● Dallas is irregular, Seattle is a perfect grid

Very different cities: Dallas & Seattle

Page 44: Machine Learning and GraphX

● KL divergence = 0.36● Both are very irregular

Very similar cities: Atlanta & Boston

Page 45: Machine Learning and GraphX

● Using multiple street topology indicators to measure the risk of car accident

Next steps

Page 46: Machine Learning and GraphX

Q.E.D

Thanks for keeping up!

Question => Future[(Option[Response], Future[Question])]