Upload
truongthuy
View
228
Download
1
Embed Size (px)
Citation preview
computer science
saarlanduniversity
Jens Dittrich
EfficientlyIndexing AND QueryingBig Datain Hadoop MapReduce
MapReduceIntro
HAIL
Hadoop++
MapReduceIntro
Big Data
Copyright of all slides: Jens Dittrich 2012
This talk consists of three parts.
First Part
Big data is the new “very large“.
[Physics]
Big data is everywhere: CERN...
http://cdsweb.cern.ch/record/1295244
...moving objects indexing...
http://www.istockphoto.com/stock-video-4518244-la-trafic-a-time-lapse.php
...astronomy...
http://www.flickr.com/photos/14924974@N02/2992963984/
basically whenever you point a satellite dish up in the air, you collect tons of data
but also in...
\href{http://it.wikipedia.org/wiki/File:KSC_radio_telescope.jpg}{http://it.wikipedia.org/wiki/File:KSC\_radio\_telescope.jpg}
[Dean et al, OSDI’04]
...genomics...
http://www.istockphoto.com/stock-illustration-16136234-dna-strands.php
...social networks...
...and search engines.
They proposed a system to effectively analyze big data.
MapReduce
Big Data Tutorial[VLDB 2012b]
Semantics:
That system was coined “MapReduce“. The system is Google-proprietary.
Hadoop is the open source variant. It has a large community of developers and start-up companies.
We presented a tutorial on Big Data Processing in Hadoop MapReduce at VLDB 2012.It contains many details on data layouts, indexing, query processing and so forth.The tutorial slides are available online:http://infosys.uni-saarland.de/publications/BigDataTutorialSlides.pdf
Let‘s briefly revisit the MapReduce interface:
map(key, value) -> set of (ikey, ivalue)
reduce(ikey, set of ivalue) -> (fkey, fvalue)
Google-Use Case:
Web-Index Creation
map(key, value)->
set of (ikey, ivalue)
just two functions: map() and reduce()
Let‘s look at a concrete use-case:
This is vital for Google‘s search service you use everyday.
In this use-case the map function...
map(docID, document)->
set of (term, docID)
map(44, ´´This is text on a website!´´)->{ (``This´´, 44), (``is´´, 44), (``text´´, 44), (``on´´, 44), (``a´´, 44), (``website´´, 44)}
map(42, ´´This is just another website!´´)->{ (``This´´, 42), (``is´´, 42), (``just´´, 42), (``another´´, 42), (``website´´, 42)}
map(43, ´´One more boring website!´´)->{ (``One´´, 43), (``more´´, 43), (``boring´´, 43), (``website´´, 43)}
...takes a docID and a document (the contents of the document)and returns a set of (term,docID)-pairs.
For instance...
...map() will be called for document 44 with its contents “This is text on a website!“.The map()-function breaks this into pairs, one pair for each term occurring on website 44.
the same happens for document 42
and so forth
reduce(ikey, set of ivalue)->
(fkey, fvalue)
reduce(term, set of docID)->
(term, (posting list of docID, count))
reduce(``This´´, {42, 43})->(``This´´, ([42, 43], 2))
reduce(``is´´, {42, 43})->(``is´´, ([42, 43], 2))
What about reduce()?
For Web-index creation reduce() receives a term and the set of docIDs containing that term.reduce() then returns a pair of the input term and an ordered posting loist of docIDs plus a count, i.e. the number of web pages having that term.
Note: there are many variants how to do web-indexing with MapReduce. The actual semantics used by Google may differ; the core idea however is the same.
For instance: documents 42 and 43 contain “This“.reduce() simply returns an ordered posting list plus the count.
Documents 42 and 43 contain “is“.reduce() simply returns an ordered posting list plus count for this as well.
reduce(``boring´´, {43})->(``boring´´, ([43], 1))
etc.
Other Applications:
Search rec.a==42,rec.contains(``bla´´),rec.contains(0011001)
Machine Learningk-means,mahout library
Web-Analysissum of all accesses to page Y from user X
etc.
Big Data ?
map() and reduce() with
and so forth
Many things can be mapped to the map()/reduce()-interface, but not all.Think about twice before blindly using MapReduce. It is useful for many things, but not all.Many important extensions have been done in the past to support more application classes, i.e. iterative problems.
HDFS
...
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDNn
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN6
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN5
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN4
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN3
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN2
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN1
Bob
HDFS
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDNn
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN6
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN5
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN4
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN3
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN2
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN1
...
horizontal partitions
HDFS
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDNn
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN6
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN5
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN4
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN3
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN2
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN1
...
HDFS blocks64MB (default)horizontal partitions
HDFS
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDNn
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN6
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN5
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN4
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN3
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN2
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN1
...
1
4
7
2 3
5 6
8 9
11 22 33
Let‘s assume a user Bob who wants to analyze a large file.
Notice that this is a simplified explanation.For details on how Hadoop works see our paper:Hadoop++: Making a Yellow Elephant Run Like a Cheetah (Without It Even Noticing), VLDB 2010http://infosys.cs.uni-saarland.de/publications/DQJ+10CRv2.pdf
http://www.istockphoto.com/file\_closeup.php?id=591134
Bob first needs to upload his file to Hadoop‘s Distributed File System (HDFS).HDFS partitions his data into large horizontal partitions.
Those horizontal partitions are termed HDFS blocks. They are relatively large: at least 64MB up to 1GB.Do not confuse these large HDFS blocks with the typically small database pages (which are only a few KB in size).
Each HDFS block receives a unique ID.
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDNn
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN6
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN5
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN4
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN3
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN2
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN1
...1
4
7
2
3
5 6
8 9
11 22
3
3
44 55 66
HDFS
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDNn
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN6
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN5
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN4
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN3
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN2
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN1
...1
4
7
2
3 56
8 9
11 22
3
3
4 455 6 6
77 88 99
HDFS
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDNn
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN6
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN5
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN4
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN3
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN2
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN1
...1
4 7
2
3 56
89 11 22
3
3
4 455 6 6
77
8
89
9
HDFS
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDNn
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN6
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN5
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN4
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN3
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN2
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN1
...1
4 7
2
3 56
89 11 22
3
3
4 455 6 6
77
8
89
9
Failover
HDFS
The HDFS blocks get distributed and replicated over the cluster. Each HDFS block gets replicated to at least three different data nodes (DN1, ...DNn in this example).
HDFS does this for every HDFS block of the input file.
Eventually all HDFS blocks have been sent to the datanodes.
We gain nice failover properties: even if two datanodes go offline, we still have one copy of the block.
Assume that we want to retrieve HDFS block 3. We lost the copies on DN2 and DN5. However,we can still retrieve a copy of block 3 from DN6.Notice that once datanodes go offline HDFS (should) copy blocks to other nodes to get back to having three copies of each block again.
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDNn
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN6
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN5
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN4
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN3
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN2
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN1
...1
4 7
2
3 56
89 11 22
3
3
4 455 6 6
77
8
89
9
Load Balancing
I would like to have block 4!
HDFS
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDNn
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN6
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN5
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN4
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN3
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN2
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN1
...1
4 7
2
3 56
89 11 22
3
3
4 455 6 6
77
8
89
9
HDFS
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDNn
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN6
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN5
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN4
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN3
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN2
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN1
...1
4 7
2
3 56
89 11 22
3
3
4 455 6 6
77
8
89
9
HDFS
MapReduce
map(docID, document) -> set of (term, docID)
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDNn
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN6
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN5
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN4
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN3
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN2
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN1
...1
4 7
2
3 56
89 11 22
3
3
4 455 6 6
77
8
89
9
Map Phase
HDFS
MapReduce map(docID, document) -> set of (term, docID)
Another advantage of having three copies for each block is load balancing. Whenever a user or an application asks for a particular block, we have three options for retrieving that block. The decision which datanode to use may be made based on network locality, network congestion, and the current load of the datanodes.
So now we have our data stored in HDFS.What about MapReduce?And by “MapReduce“ I mean “Hadoop MapReduce“ in the following.
MapReduce is another software layer on top of HDFS.MapReduce consists of three phases.
In the first phase (the Map Phase) only the map-function is considered.
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDNn
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN6
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN5
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN4
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN3
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN2
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN1
...1
4 7
2
3 56
89 11 22
3
3
4 455 6 6
77
8
89
9
Map Phase
HDFS
MapReduce
M1 M2 M3 M4 M5 M6 M7
map(docID, document) -> set of (term, docID)
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDNn
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN6
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN5
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN4
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN3
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN2
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN1
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
Bob
...1
4 7
2
3 5
891 22
34 55 6 6
77
8
89
9
Map Phase
6
HDFS
MapReduce
1 3
4
M1 M2 M3 M4 M5 M6 M7
map(docID, document) -> set of (term, docID)
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDNn
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN6
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN5
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN4
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN3
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN2
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN1
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
Bob
...1
4
2
3 5
81
34 55 6 6
77 89
9
Map Phase
6‘ 9‘ 8‘ 1‘ 3‘ 2‘ 7‘
9 1 23
76 8
9 1 23
76 8
2
HDFS
MapReduce
4
M1 M2 M3 M4 M5 M6 M7
map(docID, document) -> set of (term, docID)
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDNn
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN6
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN5
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN4
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN3
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN2
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN1
...12
45
81 2
34 455 6 6
77 89
9
Map Phase
5‘ 4‘
453
6‘ 9‘ 8‘ 1‘ 3‘ 2‘ 7‘
9 1 23
76 8
HDFS
MapReduce
M8 M9
map(docID, document) -> set of (term, docID)
MapReduce assigns a thread (AKA Mapper) at every datanode having data to be processed for this job.
Each mapper reads one of the HDFS blocks...
..and breaks that HDFS blocks into records. This can be customized with the RecordReader.For each record map() is called. The output to that file (also called intermediate results) is collected on the local disks of the datanodes. For instance, for block 6 the output is collected in file 6‘.
This is done for every block of the input file. Obviously we do not have to do this with every copy of am HDFS block. Processing one copy is enough.With this the Map Phase is finished.
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDNn
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN6
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN5
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN4
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN3
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN2
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN1
...
Shuffle Phase
5‘ 4‘6‘ 9‘ 8‘ 1‘ 3‘ 2‘ 7‘
HDFS
MapReduce group by term
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDNn
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN6
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN5
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN4
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN3
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN2
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN1
...
Shuffle Phase
5‘ 4‘6‘ 9‘ 8‘ 1‘ 3‘ 2‘ 7‘
network
HDFS
MapReduce group by term
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDNn
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN6
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN5
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN4
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN3
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN2
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN1
...
Shuffle Phase
5‘ 4‘6‘ 9‘ 8‘ 1‘ 3‘ 2‘ 7‘
networknetwork
HDFS
MapReduce group by term
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDNn
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN6
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN5
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN4
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN3
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN2
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN1
...
Shuffle Phase
A-B C-D E-F G-H I-J K-L W-Z
networknetwork
HDFS
MapReduce group by term
Now, the Shuffle Phase starts.
In the Shuffle Phase all intermediate results are redistributed over the different datanodes. In this example we want to redistribute the intermediate results based on the term, i.e. we want to group all intermediate results by term.
This means, after shuffling, we obtain a range partitioning on terms....
For instance, DN1 contains all intermediate results having terms starting with A or B. In turn, DN2 has only terms starting with C or D and so forth.Once all data has been redistributed, the Shuffle Phase is finished.
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDNn
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN6
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN5
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN4
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN3
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN2
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN1
...
Reduce Phase
A-B C-D E-F G-H I-J K-L W-ZA-B C-D E-F G-H I-J K-L W-Z
HDFS
MapReduce
reduce(term, set of docID) -> set of (term, (posting list of docID, count))
R1 R2 R3 R4 R5 R6 Rn
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDNn
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN6
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN5
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN4
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN3
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN2
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN1
...
Reduce Phase
A-B C-D E-F G-H I-J K-L W-Z
A-B‘ C-D‘ E-F‘ G-H‘ I-J‘ K-L‘ W-Z‘
A-B C-D E-F G-H I-J K-L W-Z
HDFS
MapReduce reduce(term, set of docID) -> set of (term, (posting list of docID, count))
R1 R2 R3 R4 R5 R6 Rn
HadoopMapReduceAdvantages
failoverscalability
schema-later
ease of use
In the final Reduce Phase, MapReduce assigns threads to the different datanodes having intermediate results. These threads are termed reducers. The reducers read the intermediate results and for each distinct key they call reduce(). Notice that reduce() may only be called once for each distinct key on the entire cluster. Otherwise the semantics of the map()/reduce()-paradigm would be broken.
The output of the reduce()-calls is stored on disk again. In this example this is visualized to store the output on local disk. However, Hadoop stores the output on HDFS by default, i.e. the output of the Reduce Phase gets replicated by HDFS again.
When to replicate which data for fault tolerance in MapReduce is an interesting discussion. See our paper RAFT for more details:http://infosys.uni-saarland.de/publications/QPSD11.pdf
HadoopMapReduce Disadvantages
Performance
MapReduceIntro Hadoop++
HAIL
Hadoop++
And this and how to fix it is what the following material is about.
Second Part
[VLDB 2010a]
VLDB 2010, Sep 15, Singapore
The “Map Reduce Plan“ partition load map reduce
fits well with the simplicity philosophy of Hadoop.Hadoop++ changes the internal layout of a split — a large hori-
zontal partition of the data — and/or feeds Hadoop with appropri-ate UDFs. However, Hadoop++ does not change anything in theHadoop framework.
Contributions. In this paper we make the following contributions:
(1.) The Hadoop Plan. We demonstrate that Hadoop is noth-ing but a hard-coded, operator-free, physical query execution planwhere ten User Defined Functions block, split, itemize, mem,map, sh, cmp, grp, combine, and reduce are injected at pre-determined places. We make Hadoop’s hard-coded query process-ing pipeline explicit and represent it as a DB-style physical queryexecution plan (The Hadoop Plan). As a consequence, we are thenable to reason on that plan. (Section 2)(2.) Trojan Index. We provide a non-invasive, DBMS-independent indexing technique coined Trojan Index. A TrojanIndex enriches logical input splits by bulkloaded read-optimizedindexes. Trojan Indexes are created at data load time and thus haveno penalty at query time. Notice, that in contrast to HadoopDB weneither change nor replace the Hadoop framework at all to integrateour index, i.e. the Hadoop framework is not aware of the Trojan In-dex. We achieve this by providing appropriate UDFs. (Section 3)(3.) Trojan Join. We provide a non-invasive, DBMS-independentjoin technique coined Trojan Join. Trojan join allows us to co-partition the data at data load time. Similarly to Trojan Indexes,Trojan Joins do neither require a DBMS nor SQL to do so. Tro-jan Index and Trojan Join may be combined to create arbitrarilyindexed and co-partitioned data inside the same split. (Section 4)(4.) Experimental comparison. To provide a fair experimen-tal comparison, we implemented all Trojan-techniques on top ofHadoop and coin the result Hadoop++. We benchmark Hadoop++against Hadoop as well as HadoopDB as proposed at VLDB2009 [3]. As in [3] we used the benchmark from SIGMOD2009 [16]. All experiments are run on Amazon’s EC2 Cloud.Our results confirm that Hadoop++ outperforms Hadoop and evenHadoopDB for index and join-based tasks. (Section 5)
In addition, Appendix A illustrates processing strategies systemsfor analytical data processing. Appendix B shows that MapReduceand DBMS have the same expressiveness, i.e. any MapReduce taskmay be run on a DBMS and vice-versa. Appendices C to E containadditional details and results of our experimental study.
2. HADOOP AS A PHYSICAL QUERY EX-ECUTION PLAN
In this section we examine how Hadoop computes a MapReducetask. We have analyzed Yahoo!’s Hadoop version 0.19. Note thatHadoop uses a hard-coded execution pipeline. No operator-modelis used. However Hadoop’s query execution strategy may be ex-pressed as a physical operator DAG. To our knowledge, this paperis the first to do so in that detail and we term it The Hadoop Plan.Based on this we then reason on The Hadoop Plan.
2.1 The Hadoop PlanThe Hadoop Plan is shaped by three user-defined parameters
M, R, and P setting the number of mappers, reducers, and datanodes, respectively [10]. An example for a plan with four map-pers (M = 4), two reducers (R = 2), and four data nodes (P = 4)is shown in Figure 1. We observe, that The Hadoop Plan consistsof a subplan L ( ) and P subplans H1–H4 ( ) which correspondto the inital load phase (HDFS) into Hadoop’s distributed file sys-
T
L PPartblock
Replicate Replicate Replicate Replicate Replicate Replicate
T1 T6
H1 Fetch Fetch Fetch
Store Store Store
Scan Scan
PPartsplit PPartsplit
H2
. . .
H3 Fetch Fetch Fetch
Store Store Store
Scan
PPartsplit
H4
. . .
M1 Union
RecReaditemize
MMapmap
PPartmem
LPartsh LPartsh LPartsh
Sortcmp Sortcmp Sortcmp
SortGrpgrp SortGrpgrp SortGrpgrp
MMapcombine MMapcombine MMapcombine
Store Store Store
Mergecmp
SortGrpgrp
MMapcombine
Store
PPartsh
M2
. . .
M3
RecReaditemize
MMapmap
LPartsh
Sortcmp
SortGrpgrp
MMapcombine
Store
PPartsh
M4
. . .
T1 T5 T2 T4 T3 T6
Fetch Fetch Fetch Fetch
Buffer Buffer Buffer Buffer
Store Store Merge
Store
Mergecmp
SortGrpgrp
MMapreduce
R1 Store
T �1
. . .
R2
T �2
Dat
aLo
adPh
ase
Map
Phas
eSh
u⇥e
Phas
eR
educ
ePh
ase
Figure 1: The Hadoop Plan: Hadoop’s processing pipeline ex-pressed as a physical query execution plan
tem. M determines the number of mapper subplans ( ), whereas Rdetermines the number of reducer subplans ( ).
Let’s analyze The Hadoop Plan in more detail:Data Load Phase. To be able to run a MapReduce job, we firstload the data into the distributed file system. This is done by par-titioning the input T horizontally into disjoint subsets T1, . . . ,Tb.See the physical partitioning operator PPart in subplan L. In theexample b = 6, i.e. we obtain subsets T1, . . . ,T6. These subsetsare called blocks. The partitioning function block partitions theinput T based on the block size. Each block is then replicated(Replicate). The default number of replicas used by Hadoop is 3,but this may be configured. For presentation reasons, in the exam-ple we replicate each block only once. The figure shows 4 di�erentdata nodes with subplans H1–H4. Replicas are stored on di�erentnodes in the network (Fetch and Store). Hadoop tries to storereplicas of the same block on di�erent nodes.Map Phase. In the map phase each map subplan M1–M4 reads asubset of the data called a split2 from HDFS. A split is a logicalconcept typically comprising one or more blocks. This assignmentis defined by UDF split. In the example, the split assigned to M1consists of two blocks which may both be retrieved from subplanH1. Subplan M1 unions the input blocks T1 and T5 and breaks theminto records (RecRead). The latter operator uses a UDF itemizethat defines how a split is divided into items. Then subplan M1calls map on each item and passes the output to a PPart opera-tor. This operator divides the output into so-called spills based
2Not to be confused with spills. See below. We use the terminologyintroduced in [10].
§ The “MapReduce Plan“ has 10 user-definedfunctions (UDFs):
! block! split! itemize! mem! map! sh! cmp! grp! combine! reduce
Data Load Phase
Map Phase
Shuffle Phase
Reduce Phase
Good Trojans!
Jens Dittrich, Jorge-Arnulfo Quiane-Ruiz, Alekh Jindal, Yagiz Kargin, Vinay Setty, Jörg SchadHadoop++: Making a Yellow Elephant Run Like a Cheetah (Without It Even Noticing)VLDB 2010/PVLDB, Singaporehttp://infosys.cs.uni-saarland.de/publications/DQJ+10CRv2.pdfslides:http://infosys.cs.uni-saarland.de/publications/DQJ+10talk.pdf
figure shows example with 4 mappers and 2 reducersHadoop MapReduce uses a hard-coded pipeline. This pipeline cannot be changed. This is in sharp contrast to database systems which may use different pipelines for different queries.
However, Hadoop Map Reduce uses 10 user-defined functions (UDFs). Theses UDFs can be used to inject arbitrary code into Hadoop...
...including code that was not intended to be injected into Hadoop.
Our idea is somewhat similar to a trojan horse or a trojan, i.e. a virus that is injected into a computer system to harm or destory the system. However, we inject trojans to improve or heal the system.
Therefore we are calling them...
http://www.istockphoto.com/stock-photo-1824642-trojan-horse.php?st=0bab152
Selection Task
0
20
40
60
80
100
120
140
10 nodes 50 nodes 100 nodes
runt
ime
[sec
onds
]
HadoopHadoopDB
HadoopDB Chunks
Hadoop++(256MB)Hadoop++(1GB)
Join Task
0
500
1000
1500
2000
2500
10 nodes 50 nodes 100 nodes
runt
ime
[sec
onds
]
HadoopHadoopDB
Hadoop++(256MB)Hadoop++(1GB)
1Good Trojans in a DBMS?...ROW TROJAN SQL VECTORWISE VECTORWISE2 VERTICA (other machine) Factor
Q1
Q6
Q12
Q14
76.730296 19.293983 22.0 31.276143 6.0 41.7 1.1 3.277.589034 8.6532381 16.0 25.845965 4.6 11.4 1.8 1.992.486038 37.331905 33.0 29.785149 4.3 15.3 0.9 8.781.207649 30.788114 28.0 22.291128 5.0 13.9 0.9 6.2
0
25
50
75
100
Q1 Q6 Q12 Q14
Que
ry T
ime
(sec
)
Standard Row Trojan ColumnsDBMS-Y DBMS-Z (a)DBMS-Z (b)
[CIDR 2013a]
Good Trojans Versus Closed Source Column Store DBMS
results taken from our VLDB 2010-paper
Hadoop++ is in the same ballpark or even faster than HadoopDB (now spun-off as Hadapt)
Hadoop++ is up to a factor 18 faster than Hadoop
...even though we do not modify the underlying HDFS and Hadoop MapReduce source code at all!Our improvements are all done through UDFs only.
But wait: UDFs are also available in traditional database systems. What happens if we exploit those UDFs to inject “better technology“ into an existing database system?
Say we inject column store technology into a commercial, closed-source row store. How would that look like?
It looks like this. For details see our paper: Alekh Jindal, Felix Martin Schuhknecht, Jens Dittrich, Karen Khachatryan, Alexander Bunte. How Achaeans Would Construct Columns in Troy. CIDR 2013, Asilomar, USA. http://infosys.uni-saarland.de/publications/How%20Achaeans%20Would%20Construct%20Columns%20in%20Troy.pdf You can get much faster than a row store. We are not as fast as a from scratch implementation of a column store. This has to do with the different QP technology used. However for some customers the performance of a native column store might not even be required - especially for medium-sized datasets. ...
Q Standard Row
Trojan Columns
Projected View
Standard Row
Projected View
Trojan Columns
Table
Q1Q6Q12Q14Q3Q5Q10_oQ10_lQ19Q2Q4Q8Q15
76.730295686 19.293982609 27.42381835 3.9769029152 1.4213663869 230.19088706 82.27146 57.881947828 lineitem77.589033799 8.6532381493 21.787439044 8.9664738749 2.5178365218 232.7671014 65.36232 25.959714448 lineitem76.73051647 16.504629093 26.582446376 4.6490300412 1.610605499 230.19154941 79.747339128 49.513887278 lineitem
76.333567335 25.592795096 20.15226419 2.9826194071 0.7874194325 229.00070201 60.456792571 76.778385288 lineitem0 0 0 5.1437565596 lineitem0 0 0 order0 0 0 order0 0 0 lineitem0 0 0
2.935810086 1.9092920003 1.816512607 8.807430258 5.449537821 5.727876001 part15.286783871 15.13420846 5.1889663903 45.860351612 15.566899171 45.402625379 order
0 0 0 part0 0 0 lineitem
0
25
50
75
100
Q1 Q6 Q12 Q14
Que
ry T
ime
(sec
)
Standard Row Trojan ColumnsMaterialized View
Good Trojans in a Closed Source Row Store DBMS
[CIDR 2013a]
1...Good Trojans in a DBMS!
Problems:
Upload-Times
0
10000
20000
30000
40000
50000
10 nodes 50 nodes 100 nodes
runt
ime
[sec
onds
]
HadoopHadoopDB
Hadoop++(256MB)Hadoop++(1GB)
(I)ndex Creation(C)o-PartitioningData (L)oading
L
C
I
L
C
I
L
C
I
L
C
I
L
C
I
L
C
I
... Why buy a car with 1000hp if 200hp are just enough?
An interesting result from our work is also that we can beat materialized views for some queries.
With this let‘s close this footnote and go back to Hadoop MapReduce performance.UDFs in Hadoop allow us to boost query performance without changing the underlying system.However, what are the...
from Hadoop++ paper: The problem is that in order to have fast queries we first have to “massage“ the data before, i.e. create indexes, co-partition and so forth. This takes time. Hadoop does not have to spend this time and therefore uplloading the data to HDFS is fast. In contrast, for Hadoop++ (but als HadoopDB) we also have to do a lot of extra work. This extra work is very costly. So costly that only after running many queries these investments are amortized. In other words, if we only want to run a few queries exploiting our indexes and co-partitioning, we shouldn‘t use Hadoop++ in the first place but rather run the queries directly on Hadoop! How could we fix this problem?
=> coarse-granular indexes
=> index selection algos?
=> back to scanning?
MapReduceIntro Hadoop++
HAIL
HAIL
Hadoop Aggressive Indexing Library
[VLDB 2012a][SOCC 2011]
[int‘ patent]
We could drop the idea of using indexes: just scan everything.Well we are not gonna follow this approach.We could invest into better index selection algorithms. If we pick the wrong index, index creation is unlikely to be amortized. Therfore making the right choice is important. Therefore...Well we are not gonna follow this approach.Or: as it is expensive to create all these indexes, we better investigate coarse-granular indexes, i.e. indexes that are cheaper to construct and yet give some benefit at query time.Well we are not gonna follow this approach.
We do something different. Which brings me to...
... the third part of my talk.The approach I would like to present is coined HAIL.
HAIL means Hadoop Aggressive Indexing Library.For details see our paper:Jens Dittrich, Jorge-Arnulfo Quiane-Ruiz, Stefan Richter, Stefan Schuh, Alekh Jindal, Jörg Schad. Only Aggressive Elephants are Fast Elephants. VLDB 2012/PVLDB, Istanbul, Turkey. http://infosys.uni-saarland.de/publications/HAIL.pdfA predecessor of this work focussing on data layouts in HDFS is our paper:Alekh Jindal, Jorge-Arnulfo Quiane-Ruiz, Jens Dittrich. Trojan Data Layouts: Right Shoes for a Running Elephant. ACM SOCC 2011, Cascais, Portugal. http://infosys.uni-saarland.de/publications/JQD11.pdf
HAIL
...
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDNn
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN6
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN5
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN4
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN3
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN2
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN1
Bob
HAIL
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDNn
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN6
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN5
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN4
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN3
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN2
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN1
...
horizontal partitions
HAIL
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDNn
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN6
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN5
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN4
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN3
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN2
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN1
...
HDFS blocks64MB (default)horizontal partitions
HAIL
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDNn
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN6
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN5
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN4
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN3
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN2
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN1
...
1
4
7
2 3
5 6
8 9
11 22 33
So back to Bob again. Recall tath Bob wants to analyze a large file with Hadoop MapReduce.So he first has to upload his file to HDFS.In our approach we replace HDFS with HAIL. HAIL is an extension of HDFS.
As before Bob‘s file gets partiitoned into HDFS blocks...
those blocks are relatively large, at least 64MB
then those blocks get partitioned to the different datanodes (just as above)
HAIL
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDNn
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN6
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN5
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN4
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN3
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN2
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN1
...1
4
7
2
3
5 6
8 9
11 22
3
3
44 55 66
HAIL
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDNn
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN6
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN5
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN4
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN3
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN2
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN1
...
4
7
5 6
8 9
12
3
11 22
3
3
44 55 66
1 1 12 2 23
33
HAIL
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDNn
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN6
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN5
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN4
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN3
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN2
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN1
...
7 8 9
1 1 12 2 23
33 456 4 455 6 6
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDNn
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN6
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN5
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN4
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN3
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN2
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN1
...
7 8 9
HAIL
1 1 12 2 23
33 456 4 455 6 664656 4 45 5
the HDFS blocks also get replicated (just as above)
but then, before writing the data to the local disks on the different datanodes, we do something in addition:
we sort the data on each HDFS block in main memory. Each replica is sorted using a different sort cirteria. This means after sorting each HDFS block is available in three different sort orders - roughly corresponding to three different clustered indexes.Notice that we do not redistribute data across HDFS blocks! Data that was on one particular block in standard HDFS will sit on the same HDFS block in HAIL. In other words: the different copies of the block contain the same data - yet in different sort orders.
Again, we do this for each and every copy of a block.
Notice that this is done without introducing additional I/O. We fully piggy-back on the existing HDFS upload pipeline.
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDNn
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN6
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN5
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN4
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN3
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN2
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN1
...
HAIL
1 1 12 2 23
33 64656 4 45 5
7 8 977 88 99
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDNn
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN6
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN5
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN4
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN3
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN2
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN1
...
HAIL
1 1 12 2 23
33 64656 4 45 5
9 9
8
8
9
87 7
7
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDNn
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN6
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN5
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN4
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN3
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN2
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN1
...
HAIL
1 1 12 2 23
33 64656 4 45 5
9 9
8
8
9
87 7
78
8879 9
9 7
7
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDNn
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN6
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN5
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN4
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN3
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN2
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN1
...1 1 12 2 23
33 64656 4 45 5 8
8879 9
9 7
7
Failover
HAIL
Eventually uploading (and indexing) is finished.
What does this mean for HDFS failover?
Well actually, nothing changes. All data sits on the same HDFS blocks as before. For instance, if we lose DN2 and DN6, we can still retrieve block 3 from DN6. That block might not be sorted along the desired sort criteria, but it contains all the data. And we can use the remaining block to recreate additional copies in other sort orders.
little
Details
Network
Network
01101000101110101000110110
01010101010100011010001100
Block Metadata01010010101110110010111010
C
B
A
PAX Block
01101000101110101000110110
01010101010100011010001100
Block Metadata01010010101110110010111010
C
B
A
PAX Block
Network
...
forward PCK
2
01101000101110101000110110
01010101010100011010001100
Block Metadata01010010101110110010111010
C
B
A
PAX Block
PCK 1
PCK 2
PCK 1
append
HAIL Client Datanode DN1upload
2
45
6
7
check
acknowledge
reassemble
PCK 2
PCK 1
reassemble
8
HAIL Block
10001010100010111101100110
10110001101101000101110101
Block Metadata00001001000110011110111111
C
B
A
Index MetadataIndex
00101110
HAIL Block
00010011000110111100111111
10011101110111010101101010
Block Metadata00100111011011101001101101
C
B
A
Index MetadataIndexA C
ACK 13 2 1
ACK 23 2 1
ACK 13
ACK 13 2 1
ACK 23 2forward
1
Bob
1015
13
9
buildA CB
1preprocess
build
12
HDFS Namenode
Block directory HAIL Replica directory
14registerregister
ACK 23
convert
11
notify
get location
3
OK
Datanode DN3
HAIL Upload Pipeline
HAIL Query Pipeline
MapReduce PipelineHadoop MapReduce Pipeline
HDFSHDFS
Task TrackerJob TrackerJob Client
Split Phase Scheduler Map Phase
for each split in splits { allocate split to closest DataNode storing block}
Record Reader:- Perform Index Scan- Perform Post-Filtering- for each Record invoke map(HailRecord)
sendsplits[]
allocateMap Task
chosecomputing
Node
read block42
DN3 DN4 DN5 DN6 DN7 DNnDN2
...C BA
block42 block42 block42
1
2
3
4 6
5
7 storeoutput
...@HailQuery(filter="@3 between(1999-01-01, 2000-01-01)", projection={@1})void map(Text k, HailRecord v) { output(v.getInt(1), null);}...
MapReduce JobMain Class
map(...)
reduce(...)
writeJob
runJob
Bob's Perspective System's Perspective
ii
i
HAIL Annotation
Bob
DN1
for each block in input { locations = block .getHostWithIndex(@3); splitBuilder.add(locations, block );}splits[] = splitBuilder.result;
i
i
i
Experiments
Well it is all somewhat more complex than explained in the previous example.
We play some other tricks. For instance, while reading the input file, we immediately parse the data into a binary PAX (column like) layout. The PAX data is then sent to the different datanodes. We also had to make sure to not break the involved data consistency-checks used by HDFS (packet acknowledge). In addition, we extended the namenode to record additional information on the sort orders and layouts used for the different copies of an HDFS block. The latter is needed at query time. None of these changes affects principle properties of HDFS. We just extend and piggyback.
At query time HAIL needs to know how to filter input records and which attributes to project to in intermediate result tuples. This can be solved in many ways. In our current implementation we allow users to annotate their map-function with the filter and projection conditions. However this could also be done fully transparently by using static code analysis as shown in: Michael J. Cafarella, Christopher Ré: Manimal: Relational Optimization for Data-Intensive Programs. WebDB 2010. That code analysis could be used directly with HAIL. Another option, if the map()/reduce()-functions are not produced by a user, is to adjust the application. ...
...Possible “applications“ might be Pig, Hive, Impala or any other system producing map()/reduce() programs as its output. In addition, any other application not relying on MapReduce but just relying on HDFS might use our system, i.e. applications diurectly working with HDFS files.
Upload Times
Hadoop Hadoop++ HAIL0123
1132 3472 6715766 704
712717
0
1700
3400
5100
6800
0 1 2 3
717712704671
5766
3472
1132
Upl
oad
time
[sec
]
Number of created indexes
Hadoop Hadoop++ HAIL
all with 3 replicas
Upload Time
Hadoop Hadoop++ HAIL0123
1132 3472 6715766 704
712717
0
1700
3400
5100
6800
0 1 2 3
717712704671
5766
3472
1132
Upl
oad
time
[sec
]
Number of created indexes
Hadoop Hadoop++ HAIL
all with 3 replicas
Upload Time
Hadoop Hadoop++ HAIL0123
1132 3472 6715766 704
712717
0
1700
3400
5100
6800
0 1 2 3
717712704671
5766
3472
1132
Upl
oad
time
[sec
]
Number of created indexes
Hadoop Hadoop++ HAIL
all with 3 replicas
Upload Time
What happens if we upload a file to HDFS?
Let‘s start with the case that no index is created by any of the systems.Hadoop takes about 1132 seconds
Hadoop++ is considerably slower. Even though we switch off index creation here, Hadoop++ runs an extra job to convert the input data to binary. This takes a while.What about HAIL?
HAIL is faster than Hadoop HDFS. How can this be? We are doing more work than Hadoop HDFS, right? For instance, we convert the input file to binary PAX during upload directly. This can only be slower than Hadoop HDFS but NOT faster. Well, when converting to binary layout it turns out that the binary represenation of this dataset is smaller than the textual representation. Therefore we have to write less data and SAVE I/O. Therefore HAIL is faster for this dataset. This does not necessarily hold for all datasets. Notice that for this and the following experiments we are not using compression yet. We expect compression to be even more beneficial for HAIL.
Hadoop Hadoop++ HAIL0123
1132 3472 6715766 704
712717
0
1700
3400
5100
6800
0 1 2 3
717712704671
5766
3472
1132
Upl
oad
time
[sec
]
Number of created indexes
Hadoop Hadoop++ HAIL
all with 3 replicas
Upload Time
Hadoop Hadoop++ HAIL0123
1132 3472 6715766 704
712717
0
1700
3400
5100
6800
0 1 2 3
717712704671
5766
3472
1132
Upl
oad
time
[sec
]
Number of created indexes
Hadoop Hadoop++ HAIL
all with 3 replicas
Upload Time
Hadoop Hadoop++ HAIL0123
1132 3472 6715766 704
712717
0
1700
3400
5100
6800
0 1 2 3
717712704671
5766
3472
1132
Upl
oad
time
[sec
]
Number of created indexes
Hadoop Hadoop++ HAIL
all with 3 replicas
Upload Time
0
1075
2150
3225
4300
3 5 6 7 10
170012541089956717
3710
27122256
1773
1132
Upl
oad
time
[sec
]
Number of created replicas
Hadoop HAIL
(default)
Hadoop upload time with 3 replicas
Replica Scalability
So what happens if we start creating indexes? We should then feel the additional index creation effort in HAIL.
For Hadoop++ we observe long runtimes.For HAIL, in contrast to what we expected, we observe only a small increase in the upload time.
The same observation holds when creating two clustered indexes with HAIL...
...or three.This is because, standard file upload in HDFS is I/O-bound. The CPUs are mostly idle. HAIL simply exploits the unused CPU ticks that would be idling otherwise. Therefore the additional effort for indexing is hardly noticeable.
Disk space is cheap. For some situations It could be affordable to store more than three copies of an HDFS block. What would be the impact on the upload times?The next experiment shows the results...
Here we create up to 10 replicas - corresponding to 10 different clustered indexes.
We observe that in the same time HDFS uploads the data without creating any index, HAIL uploads the data, converts to binary PAX, and creates six different clustered indexes.
10 NODESSynthetic1st Run(sec)2nd Run (sec)3rd (run)AveragePavlo1st Run (sec)2nd Run (sec)3rd Run (sec)Average
50 NODESSynthetic1st Run(sec)2nd Run (sec)3rd (run)AveragePavlo1st Run (sec)2nd Run (sec)3rd Run (sec)Average
100 NODESSynthetic1st Run(sec)2nd Run (sec)3rd (run)AveragePavlo1st Run (sec)2nd Run (sec)3rd Run (sec)Average
Hadoop Hail809 608827 599845 593
827 600Hadoop Hail
1300 16781296 18661257 16821284.3333333 1742
Hadoop Hail983 720892 686878 647917.66666667 684.33333333
Hadoop Hail1874 14941612 15582022 1538
1836 1530
Hadoop Hail1227 6151014 645836 6391025.6666667 633
Hadoop Hail1582 15191527 14201319 1518
1476 1485.6666667
0
550
1100
1650
2200
Syn UV Syn UV Syn UV
1486
633
1530
684
1742
600
1476
1026
1836
918
1284
827
Upl
oad
Tim
e [s
ec]
Number of Nodes
Hadoop HAIL
10 nodes 50 nodes 100 nodes
min hadoop min hail18 7 18 8
27.333333333 64 15.666666667 12439.666666667 37.333333333 65.333333333 35.666666667
224 36 186 28189.66666667 18 201.33333333 12
157 65.666666667 106 33.333333333
Scale-Out
modulo variance, see [VLDB 2010b]
Query Times
Individual Jobs: Weblog, RecordReader
0
1000
2000
3000
4000
Bob-Q1 Bob-Q2 Bob-Q3 Bob-Q4 Bob-Q5
683333
527573
28642917
5383
277624422470
21122156
3358
RR
Run
time
[ms]
MapReduce Jobs
Hadoop Hadoop ++ HAIL
Individual Jobs: Weblog, Job
0
375
750
1125
1500
Bob-Q1 Bob-Q2 Bob-Q3 Bob-Q4 Bob-Q5
602598598598601
11451143
651705
1160 109910999421006
1094
Job
Run
time
[sec
]
MapReduce Jobs
Hadoop Hadoop++ HAIL
We also evaluated upload times on the cloud using EC2 nodes. Notice that experiments on the cloud are somewhat problematic due to the high runtime variance in those environments.For details see our paper:Jörg Schad, Jens Dittrich, Jorge-Arnulfo Quiane-RuizRuntime Measurements in the Cloud: Observing, Analyzing, and Reducing Variance VLDB 2010/PVLDB, Singapore.http://infosys.cs.uni-saarland.de/publications/SDQ10.pdfslides: http://infosys.cs.uni-saarland.de/publications/SDQ10talk.pdf
What about query times?
Here we display the RecordReader times. They correspond roughly to the data access time, i.e. the time for further processing, which is equal in all systems - is factored out.
We observe that HAIL improves query runtimes dramatically.Hadoop resorts to full scan in all cases.Hadoop++ can benefit from its index if the query happens to hit the right filter condition.In contrast, HAIL supports many more filter conditions.
What does this mean for the overall job runtimes?
Well, those results are not so great.The benefits of HAIL over the other approaches are marginal?How come?It has to do with...
Scheduling Overhead
0
375
750
1125
1500
Bob-Q1 Bob-Q2 Bob-Q3 Bob-Q4 Bob-Q5
Job
Run
time
[sec
]
MapReduce Jobs
Hadoop Hadoop++ HAIL Overhead
HAIL Scheduling
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
BobDN1
1
6
4
7
Hadoop Scheduling
HAIL
MapReduce
5
12
11 22
23
12
2
2119
15 14
⇒ 7 map tasks (aka waves)
sort order
⇒ 7 times scheduling overhead
map(row) -> set of (ikey, value)
HDFS
DN4 DN5 DN6 DN7 DNnDN2
...BA
block42 block42
Bob's Perspective
Bob
HAIL SplitDN1
6
4
7
HAIL Scheduling
HAIL
MapReduce
5
12
11 22
12
2
2119
15 14
⇒ 1 map task (aka wave)
sort order
⇒ 1 times scheduling overhead
1
23
map(row) -> set of (ikey, value)
...the Hadoop scheduling overhead. Hadoop was designed having long running tasks in mind. Accessing indexes in HAIL, however, is in the order of milliseconds. These milliseconds of index access are overshadowed by scheduling latencies.You can try this out with a simple experiment. Write a MapReduce job that does not read any input, does not do anything, and does not produce any output. This takes about 7 seconds - for doing nothing.How could we fix this problem?By...
...introducing “HAIL scheduling“.
But let‘s first look back at standard Hadoop scheduling:
In Hadoop, each HDFS block will be processed by a different map task. This leads to waves of map tasks each having a certain overhead.However the assignment of HDFS blocks to map tasks is not fixed.Therefore, ...
...in HAIL Scheduling we assign all HDFS blocks that need to be processed on a datanode to a single map task. This is achieved by defining appropriate “splits“. see our paper for details.
The overall effect is that we only have to pay the scheduling overhead once rather than 7 times (in this example).Notice that in case of failover we can simply reschedule index access tasks - they are fast anyways. Additionally, we could combine this with the recovery techniques from RAFT, ICDE 2011.
Query Timeswith HAIL Scheduling
0
375
750
1125
1500
Bob-Q1 Bob-Q2 Bob-Q3 Bob-Q4 Bob-Q5
6522151516
11451143
651705
1160 109910999421006
1094
Job
Run
time
[sec
]
MapReduce Jobs
Hadoop Hadoop++ HAIL
Individual Jobs: Weblog
Failover
Failover
FailoverHAIL-1IdxHadoop
113 63 33 1212 661 631598 598
1099
0
375
750
1125
1500
Hadoop HAIL HAIL-1Idx
598598
1099
Job
Run
time
[sec
]
Systems
Hadoop HAIL Slowdown
10.5 % slowdown 5.5 % slowdown
10.3 % slowdown
10 nodes, one killed
What are the end-to-end query runtimes with HAIL Scheduling?
Now the good RecordReader times seen above translate to (very) good query times.
What about failover?
We use two configurations for HAIL. First, we configure HAIL to create indexes on three different attributes, one for each replica. Second, we use a variant of HAIL, coined HAIL-1Idx, where we create an index on the same attribute for all three replicas. We do so to measure the performance impact of HAIL falling back to full scan for some blocks after the node failure. This happens for any map task reading its input from the killed node. Notice that, in the case of HAIL-1Idx, all map tasks will still perform an index scan as all blocks have the same index.Overall result: HAIL inherits Hadoop MapReduce‘s failover properties.
Summary(Summary(...))
BigData=>MapReduce
Intro Hadoop++
HAIL
BigData=>
HAIL
Hadoop++
fast queryingAND
fast indexing
BigData=> Hadoop++
HAIL
of this talk
What I tried to explain to you in my talk is:
Hadoop MapReduce is THE engine for big data analytics.
By using good trojans you can improve system performance dramatically AND after the fact --- even for closed-source systems.
BigData=> Hadoop++
fast queryingAND
fast indexing
infosys.uni-saarland.de
HAIL allows you to have fast index creation AND fast query processing at the same time.
project page:http://infosys.uni-saarland.de/hadoop++.php
Copyright of alls Slides Jens Dittrich 2012
annotated slides are available on that page