Tree and Graph Processing On Hadoop

Ted Malaska

Schedule

• Intro• Overview of Hadoop and Eco-System• Summarize Tree Rooting• MR Overview/Implementation Options• Hbase Overview/Implementation Options• Giraph Overview/Implementation Options• Spark Overview/Implementation Options• Summery• Quesitons

• Hi there

Overview of Hadoop and Eco-System

SearchNoSqlMachine LearningLFPRTQStreamingIngestionBatch

HDFSSecurity and Access Controls

Auditing and Monitoring

R Pyth

In Scope for Tonight

SearchNoSqlMachine LearningLFPRTQStreamingIngestionBatch

HDFSSecurity and Access Controls

Auditing and Monitoring

R Pyth

Summarize Tree Rooting

• Basic Tree

True Root

Branches

Vertex

• More Complex Tree

Circular Link

Multiple Parents

• Merging Trees• Borderline True Graph Problem

Multi RootedVertex

True RootTrue Root

• Know your data

Basic Storage Format

• <NodeID>|<EdgeID>

• Example• 101• 101|201• 101|202• 201• 202|301• 301

Preprocessing

• Terming Data• Nodes and edges have data• Data has weight• Normally linkage information is under 10% of true data size

• Organize Data by Partitioning

Basic Solution

• Step 1: Identify Roots• Echo to all edges• Vertexes with that receive no echoes are roots• Root the root

• Step 2: Walk the tree• Echo from last newly rooted Vertex to all edges• If vertex is not already rooted then root it.

• 101• 101|201• 101|202• 201• 202|301• 301

• 101|R:101• 101|201|R:101• 101|202|R:101• 201|R:Null• 202|301|R:Null• 301|R:Null

• 101|R:101• 101|201|R:101• 101|202|R:101• 201|R:101• 202|301|R:101• 301|R:Null

• 101|R:101• 101|201|R:101• 101|202|R:101• 201|R:101• 202|301|R:101• 301|R:101

Map Reduce

• Massive parallel processing on Hadoop• Based on the Google 2004 MapReduce white paper• Able to process PBs of data

Map Reduce

Data Blocks

Mapper

Sort & Shuffle

Mapper

Data Blocks

Map Reduce

• Self Joins• Always dumping two output:

• Newly Rooted• Still Un-Rooted

All Data

Un-Rooted

Newly Rooted

Un-Rooted

Newly Rooted

Old Rooted 0

MR - Stage0

Root Identifying

MR – Stage1

Rooting

Un-Rooted

Newly Rooted

Old Rooted 0

MR – Stage2

RootingOld Rooted 1

Map Reduce

• Great for large batch operations• No memory limit• Not good at iterations

• Largest and Most used NoSql Implementation in the World• Based on the Google 2006 BigTable white paper• Imagine it like a giant HashMap with keys and values• Handles 100k of operations a second on even a small 10 node cluster

HBase Getting

Client

HBase Master

HBase Region Server HBase Region Server HBase Region Server

Block Cache Block Cache Block Cache

HBase Putting

Client

HBase Master

HBase Region Server HBase Region Server HBase Region Server

MemStore

• Good for graph traversing• Bad for large batch processing

• Scan rate about 8x slower then HDFS• Good for end of a long tail

Giraph

• System built for Large Batch Graph Processing • Based on Pregel 2009 white paper• Hardened by LinkedIn and FaceBook• Recorded to handle up to a Trillion edges

Giraph Loading

Data Blocks

Worker

Master

Giraph (Bulk Synchronous Parallel)

Worker Worker Worker

Barrier synchronization

Giraph

• Most mature bulk graph processing out there• Of all the solutions, most graph focused

• At Berkeley around 2011 some asked is we could do better then MR• Take advantage of lower cost memory• Building on everything before

WorkerDag Scheduler

(Like a queue planner

Spark Worker

RDD Objects

Task Threads

Block Manager

Rdd1.join(rdd2).groupBy(…).filter(…)

Task Scheduler

Threads

Block Manager

ClusterManager

• Implementations• Onion MR approach with Basic Spark• Pregel approach with Bagel or GraphX

• Bagel is a Façade over Generic Spark Functionality• GraphX is an effort extend to Spark

• Less code• Learning curve • Its Raw will be changing a lot in the next year

Tree and Graph Processing On Hadoop

Documents

CS B O Topics Graph and Tree Layout - Stanford HCI grouphci.stanford.edu/courses/cs448b/f09/lectures/CS448B-20091021-Graph... · Topics Graph and Tree Visualization xTree Layout xGraph

Juniper: A Tree+Table Approach to Multivariate Graph Visualizationsci.utah.edu/~vdl/papers/2018_infovis_juniper.pdf · Juniper: A Tree+Table Approach to Multivariate Graph Visualization

Lecture 18-19 Graph Tree

Tree and Graph Drawing

Bca ii dfs u-3 tree and graph

Chapter 4 The Greedy Approach. Minimum Spanning Tree A tree is an acyclic, connected, undirected graph. A spanning tree for a given graph G=, where E

Minimal Spanning Trees. Spanning Tree Assume you have an undirected graph G = (V,E) Spanning tree of graph G is tree T = (V,E T E, R) –Tree has same set

Mining for Tree-Query Associations in a Graph

Big data hadoop-no sql and graph db-final

Pres Graph Tree (1)

TREE-METRICS GRAPH CUTS FOR BRAIN MRI SEGMENTATION …chenlab.ece.cornell.edu/people/ruogu/publications/WNYIP10_MRI.pdf · TREE-METRICS GRAPH CUTS FOR BRAIN MRI SEGMENTATION WITH

QUAD-TREE GENERATION WITH RNN FOR EFFICIENT GRAPH VISUALIZATION€¦ · QUAD-TREE GENERATION WITH RNN FOR EFFICIENT GRAPH VISUALIZATION RICHARD FORSTER. AGENDA •Graph generation

Recommendation and graph algorithms in Hadoop and SQL

Fractal Geometry, Graph and Tree Constructions - DiVA Portal

Hadoop Summit 2011 - Using a Hadoop Data Pipeline to Build a Graph of Users and Content

Scalable Regression Tree Learning on Hadoop using OpenPlanet

The average covering tree value for directed graph gamesaverage covering tree value equals to the Shapley value of the game. If the graph is the directed analog of an undirected graph

Welcome to my prsentation on graph and tree

Slides on Data Structures tree and graph

computing graph isomorphism, computing tree isomorphism · Combinatorial algorithms computing graph isomorphism, computing tree isomorphism Jiří Vyskočil, Radek Mařík 2012