Upload
yu-liu
View
143
Download
1
Embed Size (px)
Citation preview
Tree Representation in MapReduce World
IPL weekly-seminar
Yu Liu@NII
2011-11-22
Distributed File System of MapReduce
• A GFS/HDFS cluster consists of a single master (namenode) and multiple chunkservers (datanodes) and is accessed by multiple clients.
• The master maintains all filesystem metadata.
• Clients interact with the master for metadata operations, but all data-bearing communication goes directly to the datanodes
Distributed File System of MapReduce
• Architecture of GFS/HDFS
– Files are divided into fixed-size chunks
– Each chunk is identified by an immutable and globally unique 64 bit integer (chunk handle)
– Each chunk is replicated on multiple chunkservers
– Chunks of a file are placed as balance as possablein the cluster.
(The Google File System, SOSP03)
Apache HDFS: http://hadoop.apache.org/hdfs/docs/current/hdfs_design.html#Introduction
Inputs and Outputs of MapReduce
• The MapReduce framework operates exclusively on <key, value> pairs.
• Each pair is called a record.
• Applications specify the input/output locations and supply map and reduce functions and other job parameters, comprise the job configuration. The job client then submits the job to framework.
Tree Data Structure inside MapReduce
• Currently, GFS/HDFS prefers flat data structures/ files, files such as xml is not supported.
• We already know how to represent a file which contains a large list in HDFS (EuroPar2011)
• Tree representation is still a problem.
How to Represent a Tree in MapReduce
• If we can represent the tree by an list , and if :When this list is split into arbitrary continues sublists,
each split of the list represents a sub tree
After any tree contracting operations on each sub tree, concated sublists can still get a tree
Then such a list is what we want.
Tree Representation: Balanced Parenthesis
• Balanced Parenthesis (BP) for a ordered tree (Munro and Raman, 2001)
BP: ( ( ( ) ( )( ) ) (( ) ( )) )
1
2 6
73 4 5 8
1
2 6
3 4 5 7 8 Outer-planar sequence
BP Can Be A Solution
• A tree node can be represented by a pair of parentheses :
• node= ( ‘(’ , ‘)’ )
• We want to represent a list of nodes, the nodes should be sort-able
• data HalfNode = HalfNode{lr::Char, id::Int, index::Int}– E.g.: left1 : HalfNode {lr=‘L’, id=1 , index=0} ,
right1 : HalfNode {lr=‘R’, id=1 , index=16}
• data Node = Node { left::HalfNode, right::HalfNode}– E.g. : the root ① : Node {left =left1, right =right1 }
Parenthesis / HalfNode
• For simple we define
data HalfNode = (Bool,Int,Int)
leftPar: (False, _,_)
rightPar: (True, _,_)
so that a node can be expressed by two HalfNodes,
E.g.: the root ① : { (False, 1 , 0) , (True, 1, 15) }
the node ②: {(False, 2 , 1) , (True, 2, 7) }
the node ⑦: {(False, 7 , 10) , (True, 7, 11) }
Comparable HalfNode
• A set of HalfNode can be sorted by index to get a BP sequence– data HalfNode = (Bool,Int,Int) – We know each bracket is left or right
• { (False, 1 , 0) , (False, 2 , 1) , (False, 3, 2) (True, 3 , 3) , (False, 4 , 4) , (True, 4 , 5)
(False, 5 , 6) , (True, 5 , 7) , (True, 2 , 8) (False, 6 , 9) , (False, 7 , 10) , (True, 7 , 11) (False, 8 , 12), (True, 8 , 13) , (True, 6 , 14) (True, 1 , 15)}
Sub Trees
• A sub sequence indicates the sub tree:
( ( ( ) ( ) ( ) | ) ( ( ) ( ) ) )
1 2 3 4 5
1
2 6
73 4 5 8
Sub Trees
• A sub sequence indicates the sub tree:
( ( ( ) ( ) ( ) | ) ( ( ) ( ) ) ) ( (
1 2 3 4 5 1 2
1
2 d
d3 4 5
1
2 d
d
Sub Trees
• A sub sequence indicates the sub tree:
( ( ( ) ( ) ( ) | ) ( ( ) ( ) ) )
1 2 3 4 5 2 6 7 8
d
2 6
73 4 5 8
Sub Trees
• A sub sequence indicates the sub tree:
( ( ( ) ( ) ( ) | ) ( ( ) ( ) ) ) ) ( ) )
1 2 3 4 5 2 6 7 8 2 6
1
6
7 8
1
6
Bottom-up Tree contraction
• When we concat two sublists
( ( ) ( ) )
1 2 2 6
1
2 d
d
1
6
1
2 6
Sub Trees
• { (False, 1 , 0) , (False, 2 , 1) , (False, 3, 2)
(True, 3 , 3) , (False, 4 , 4) , (True, 4 , 5)
(False, 5 , 6) , (True, 5 , 7) , (True, 2 , 8)
(False, 6 , 9) , (False, 7 , 10) , (True, 7 , 11)
(False, 8 , 12), (True, 8 , 13) , (True, 6 , 14)
(True, 1 , 15)}
Splitting and Grouping
• We can split a list and group the elements of each sub-list in MapReduce.
– We extend data HalfNode = (Bool,Int, (Int, Int) )
• Here (Int, Int) is the index /d and index
• For the BP-MR sequence, that means we can split a tree by number of brackets
Practical Data
• Real data are associated to left-half-node
– data HalfNode = ((Int, Int), Bool, Int, Map)
– For right-half-nodes, let Map be always empty/null
Bottom-up Build a Tree
• A list a items as input
• Make a sparse list of “leaf”:E.g.: [ ((0,100),False, 100, data1), ((0,101), True, 100,
null) , ( (0, 200), False, 200, data2), ( (0, 201), True, 200, data2) .. ]
( 100) (200) (300) (400) ….
• Insert parents ( ( 100) (200) ) ( (300) (400) )….
50 250
Examples
• XML
– XML file is just a BP representation
An example of a xml file:
<?xml version="1.0"?><note>
<to>Tove</to><from>Jani</from><heading>Reminder</heading><body>Don't forget </body>
</note>
Examples
• XML file can be easily transformed to BP-MR
– Operation:
• query – by xpath
– By id / index
• Parallel parsing ?
Hieratical Clustering
• This work is not finished
• Usually, clustering algorithms are related to two categories: hierarchical and partitioning
• The more popular hierarchical agglomerative clustering (HAC) algorithms use a bottom-up approach to merge items into a hierarchy of clusters.
Hieratical Clustering
Hieratical Clustering
• The Average-link is one of the most popular algorithms for hieratical clustering
• Average link: The distance between any two clusters is the average distance between each pair of points such that each pair has a point in both clusters
GTA Algorithm for Hieratical Clustering
Currently only for the first merge step• Initial data are a set of items• map makeNode items
where makeNode item= ((0,0), False, 1, item ) , ((0,0), True, 1, item )
• Input are a BP-MR sequence but only left-half• Generate: all possible bags• Test: only keep pairs• Aggregate : the minimum distance pair• Post-process : new HalfNode pair which is parent
of aggregate’s results
Problems
• Hard to do insertion
– Appending to the tail is easy but insertion into other place is difficult
• Parallel generate BP-MR sequences
– Ideas: first generate skeletons of a tree
Skeletons of a Tree
• For example
a = 1000, b = 2000, c= 3000 …
Index_a = 1000, index_b=2000, index_c = 3000 …
Index_a’ = 8000, index_b’ = 4000, index_c’ = 6000 …
1
a e
fb c d g
End
• Thanks