Efficiently Indexing AND Querying Big Data in Hadoop MapReduce

computer science

saarlanduniversity

Jens Dittrich

EfficientlyIndexing AND QueryingBig Datain Hadoop MapReduce

MapReduceIntro

HAIL

Hadoop++

MapReduceIntro

Big Data

Copyright of all slides: Jens Dittrich 2012

This talk consists of three parts.

First Part

Big data is the new “very large“.

[Physics]

Big data is everywhere: CERN...

http://cdsweb.cern.ch/record/1295244

...moving objects indexing...

http://www.istockphoto.com/stock-video-4518244-la-trafic-a-time-lapse.php

...astronomy...

http://www.flickr.com/photos/14924974@N02/2992963984/

basically whenever you point a satellite dish up in the air, you collect tons of data

but also in...

\href{http://it.wikipedia.org/wiki/File:KSC_radio_telescope.jpg}{http://it.wikipedia.org/wiki/File:KSC\_radio\_telescope.jpg}

[Dean et al, OSDI’04]

...genomics...

http://www.istockphoto.com/stock-illustration-16136234-dna-strands.php

...social networks...

...and search engines.

They proposed a system to effectively analyze big data.

MapReduce

Big Data Tutorial[VLDB 2012b]

Semantics:

That system was coined “MapReduce“. The system is Google-proprietary.

Hadoop is the open source variant. It has a large community of developers and start-up companies.

We presented a tutorial on Big Data Processing in Hadoop MapReduce at VLDB 2012.It contains many details on data layouts, indexing, query processing and so forth.The tutorial slides are available online:http://infosys.uni-saarland.de/publications/BigDataTutorialSlides.pdf

Let‘s briefly revisit the MapReduce interface:

map(key, value) -> set of (ikey, ivalue)

reduce(ikey, set of ivalue) -> (fkey, fvalue)

Google-Use Case:

Web-Index Creation

map(key, value)->

set of (ikey, ivalue)

just two functions: map() and reduce()

Let‘s look at a concrete use-case:

This is vital for Google‘s search service you use everyday.

In this use-case the map function...

map(docID, document)->

set of (term, docID)

map(44, ´´This is text on a website!´´)->{ (``This´´, 44), (`ìs´´, 44), (``text´´, 44), (`òn´´, 44), (`à´´, 44), (``website´´, 44)}

map(42, ´´This is just another website!´´)->{ (``This´´, 42), (`ìs´´, 42), (``just´´, 42), (`ànother´´, 42), (``website´´, 42)}

map(43, ´Óne more boring website!´´)->{ (`Òne´´, 43), (``more´´, 43), (``boring´´, 43), (``website´´, 43)}

...takes a docID and a document (the contents of the document)and returns a set of (term,docID)-pairs.

For instance...

...map() will be called for document 44 with its contents “This is text on a website!“.The map()-function breaks this into pairs, one pair for each term occurring on website 44.

the same happens for document 42

and so forth

reduce(ikey, set of ivalue)->

(fkey, fvalue)

reduce(term, set of docID)->

(term, (posting list of docID, count))

reduce(``This´´, {42, 43})->(``This´´, ([42, 43], 2))

reduce(`ìs´´, {42, 43})->(`ìs´´, ([42, 43], 2))

What about reduce()?

For Web-index creation reduce() receives a term and the set of docIDs containing that term.reduce() then returns a pair of the input term and an ordered posting loist of docIDs plus a count, i.e. the number of web pages having that term.

Note: there are many variants how to do web-indexing with MapReduce. The actual semantics used by Google may differ; the core idea however is the same.

For instance: documents 42 and 43 contain “This“.reduce() simply returns an ordered posting list plus the count.

Documents 42 and 43 contain “is“.reduce() simply returns an ordered posting list plus count for this as well.

reduce(``boring´´, {43})->(``boring´´, ([43], 1))

etc.

Other Applications:

Search rec.a==42,rec.contains(``bla´´),rec.contains(0011001)

Machine Learningk-means,mahout library

Web-Analysissum of all accesses to page Y from user X

etc.

Big Data ?

map() and reduce() with

and so forth

Many things can be mapped to the map()/reduce()-interface, but not all.Think about twice before blindly using MapReduce. It is useful for many things, but not all.Many important extensions have been done in the past to support more application classes, i.e. iterative problems.

HDFS

...

HDFS

DN4 DN5 DN6 DN7 DNnDN2

...BA

block42 block42

Bob's Perspective

BobDNn

HDFS


...BA

block42 block42

Bob's Perspective

BobDN6

HDFS


...BA

block42 block42

Bob's Perspective

BobDN5

HDFS


...BA

block42 block42

Bob's Perspective

BobDN4

HDFS


...BA

block42 block42

Bob's Perspective

BobDN3

HDFS


...BA

block42 block42

Bob's Perspective

BobDN2

HDFS


...BA

block42 block42

Bob's Perspective

BobDN1

Bob

HDFS

HDFS


...BA

block42 block42

Bob's Perspective

BobDNn

HDFS


...BA

block42 block42

Bob's Perspective

BobDN6

HDFS


...BA

block42 block42

Bob's Perspective

BobDN5

HDFS


...BA

block42 block42

Bob's Perspective

BobDN4

HDFS


...BA

block42 block42

Bob's Perspective

BobDN3

HDFS


...BA

block42 block42

Bob's Perspective

BobDN2

HDFS


...BA

block42 block42

Bob's Perspective

BobDN1

...

horizontal partitions

HDFS

HDFS


...BA

block42 block42

Bob's Perspective

BobDNn

HDFS


...BA

block42 block42

Bob's Perspective

BobDN6

HDFS


...BA

block42 block42

Bob's Perspective

BobDN5

HDFS


...BA

block42 block42

Bob's Perspective

BobDN4

HDFS


...BA

block42 block42

Bob's Perspective

BobDN3

HDFS


...BA

block42 block42

Bob's Perspective

BobDN2

HDFS


...BA

block42 block42

Bob's Perspective

BobDN1

...

HDFS blocks64MB (default)horizontal partitions

HDFS

HDFS


...BA

block42 block42

Bob's Perspective

BobDNn

HDFS


...BA

block42 block42

Bob's Perspective

BobDN6

HDFS


...BA

block42 block42

Bob's Perspective

BobDN5

HDFS


...BA

block42 block42

Bob's Perspective

BobDN4

HDFS


...BA

block42 block42

Bob's Perspective

BobDN3

HDFS


...BA

block42 block42

Bob's Perspective

BobDN2

HDFS


...BA

block42 block42

Bob's Perspective

BobDN1

...

1

4

7

2 3

5 6

8 9

11 22 33

Let‘s assume a user Bob who wants to analyze a large file.

Notice that this is a simplified explanation.For details on how Hadoop works see our paper:Hadoop++: Making a Yellow Elephant Run Like a Cheetah (Without It Even Noticing), VLDB 2010http://infosys.cs.uni-saarland.de/publications/DQJ+10CRv2.pdf

http://www.istockphoto.com/file\_closeup.php?id=591134

Bob first needs to upload his file to Hadoop‘s Distributed File System (HDFS).HDFS partitions his data into large horizontal partitions.

Those horizontal partitions are termed HDFS blocks. They are relatively large: at least 64MB up to 1GB.Do not confuse these large HDFS blocks with the typically small database pages (which are only a few KB in size).

Each HDFS block receives a unique ID.

HDFS


...BA

block42 block42

Bob's Perspective

BobDNn

HDFS


...BA

block42 block42

Bob's Perspective

BobDN6

HDFS


...BA

block42 block42

Bob's Perspective

BobDN5

HDFS


...BA

block42 block42

Bob's Perspective

BobDN4

HDFS


...BA

block42 block42

Bob's Perspective

BobDN3

HDFS


...BA

block42 block42

Bob's Perspective

BobDN2

HDFS


...BA

block42 block42

Bob's Perspective

BobDN1

...1

4

7

2

3

5 6

8 9

11 22

3

3

44 55 66

HDFS

HDFS


...BA

block42 block42

Bob's Perspective

BobDNn

HDFS


...BA

block42 block42

Bob's Perspective

BobDN6

HDFS


...BA

block42 block42

Bob's Perspective

BobDN5

HDFS


...BA

block42 block42

Bob's Perspective

BobDN4

HDFS


...BA

block42 block42

Bob's Perspective

BobDN3

HDFS


...BA

block42 block42

Bob's Perspective

BobDN2

HDFS


...BA

block42 block42

Bob's Perspective

BobDN1

...1

4

7

2

3 56

8 9

11 22

3

3

4 455 6 6

77 88 99

HDFS

HDFS


...BA

block42 block42

Bob's Perspective

BobDNn

HDFS


...BA

block42 block42

Bob's Perspective

BobDN6

HDFS


...BA

block42 block42

Bob's Perspective

BobDN5

HDFS


...BA

block42 block42

Bob's Perspective

BobDN4

HDFS


...BA

block42 block42

Bob's Perspective

BobDN3

HDFS


...BA

block42 block42

Bob's Perspective

BobDN2

HDFS


...BA

block42 block42

Bob's Perspective

BobDN1

...1

4 7

2

3 56

89 11 22

3

3

4 455 6 6

77

8

89

9

HDFS

HDFS


...BA

block42 block42

Bob's Perspective

BobDNn

HDFS


...BA

block42 block42

Bob's Perspective

BobDN6

HDFS


...BA

block42 block42

Bob's Perspective

BobDN5

HDFS


...BA

block42 block42

Bob's Perspective

BobDN4

HDFS


...BA

block42 block42

Bob's Perspective

BobDN3

HDFS


...BA

block42 block42

Bob's Perspective

BobDN2

HDFS


...BA

block42 block42

Bob's Perspective

BobDN1

...1

4 7

2

3 56

89 11 22

3

3

4 455 6 6

77

8

89

9

Failover

HDFS

The HDFS blocks get distributed and replicated over the cluster. Each HDFS block gets replicated to at least three different data nodes (DN1, ...DNn in this example).

HDFS does this for every HDFS block of the input file.

Eventually all HDFS blocks have been sent to the datanodes.

We gain nice failover properties: even if two datanodes go offline, we still have one copy of the block.

Assume that we want to retrieve HDFS block 3. We lost the copies on DN2 and DN5. However,we can still retrieve a copy of block 3 from DN6.Notice that once datanodes go offline HDFS (should) copy blocks to other nodes to get back to having three copies of each block again.

HDFS


...BA

block42 block42

Bob's Perspective

BobDNn

HDFS


...BA

block42 block42

Bob's Perspective

BobDN6

HDFS


...BA

block42 block42

Bob's Perspective

BobDN5

HDFS


...BA

block42 block42

Bob's Perspective

BobDN4

HDFS


...BA

block42 block42

Bob's Perspective

BobDN3

HDFS


...BA

block42 block42

Bob's Perspective

BobDN2

HDFS


...BA

block42 block42

Bob's Perspective

BobDN1

...1

4 7

2

3 56

89 11 22

3

3

4 455 6 6

77

8

89

9

Load Balancing

I would like to have block 4!

HDFS

HDFS


...BA

block42 block42

Bob's Perspective

BobDNn

HDFS


...BA

block42 block42

Bob's Perspective

BobDN6

HDFS


...BA

block42 block42

Bob's Perspective

BobDN5

HDFS


...BA

block42 block42

Bob's Perspective

BobDN4

HDFS


...BA

block42 block42

Bob's Perspective

BobDN3

HDFS


...BA

block42 block42

Bob's Perspective

BobDN2

HDFS


...BA

block42 block42

Bob's Perspective

BobDN1

...1

4 7

2

3 56

89 11 22

3

3

4 455 6 6

77

8

89

9

HDFS

HDFS


...BA

block42 block42

Bob's Perspective

BobDNn

HDFS


...BA

block42 block42

Bob's Perspective

BobDN6

HDFS


...BA

block42 block42

Bob's Perspective

BobDN5

HDFS


...BA

block42 block42

Bob's Perspective

BobDN4

HDFS


...BA

block42 block42

Bob's Perspective

BobDN3

HDFS


...BA

block42 block42

Bob's Perspective

BobDN2

HDFS


...BA

block42 block42

Bob's Perspective

BobDN1

...1

4 7

2

3 56

89 11 22

3

3

4 455 6 6

77

8

89

9

HDFS

MapReduce

map(docID, document) -> set of (term, docID)

HDFS


...BA

block42 block42

Bob's Perspective

BobDNn

HDFS


...BA

block42 block42

Bob's Perspective

BobDN6

HDFS


...BA

block42 block42

Bob's Perspective

BobDN5

HDFS


...BA

block42 block42

Bob's Perspective

BobDN4

HDFS


...BA

block42 block42

Bob's Perspective

BobDN3

HDFS


...BA

block42 block42

Bob's Perspective

BobDN2

HDFS


...BA

block42 block42

Bob's Perspective

BobDN1

...1

4 7

2

3 56

89 11 22

3

3

4 455 6 6

77

8

89

9

Map Phase

HDFS

MapReduce map(docID, document) -> set of (term, docID)

Another advantage of having three copies for each block is load balancing. Whenever a user or an application asks for a particular block, we have three options for retrieving that block. The decision which datanode to use may be made based on network locality, network congestion, and the current load of the datanodes.

So now we have our data stored in HDFS.What about MapReduce?And by “MapReduce“ I mean “Hadoop MapReduce“ in the following.

MapReduce is another software layer on top of HDFS.MapReduce consists of three phases.

In the first phase (the Map Phase) only the map-function is considered.

HDFS


...BA

block42 block42

Bob's Perspective

BobDNn

HDFS


...BA

block42 block42

Bob's Perspective

BobDN6

HDFS


...BA

block42 block42

Bob's Perspective

BobDN5

HDFS


...BA

block42 block42

Bob's Perspective

BobDN4

HDFS


...BA

block42 block42

Bob's Perspective

BobDN3

HDFS


...BA

block42 block42

Bob's Perspective

BobDN2

HDFS


...BA

block42 block42

Bob's Perspective

BobDN1

...1

4 7

2

3 56

89 11 22

3

3

4 455 6 6

77

8

89

9

Map Phase

HDFS

MapReduce

M1 M2 M3 M4 M5 M6 M7


HDFS


...BA

block42 block42

Bob's Perspective

BobDNn

HDFS


...BA

block42 block42

Bob's Perspective

BobDN6

HDFS


...BA

block42 block42

Bob's Perspective

BobDN5

HDFS


...BA

block42 block42

Bob's Perspective

BobDN4

HDFS


...BA

block42 block42

Bob's Perspective

BobDN3

HDFS


...BA

block42 block42

Bob's Perspective

BobDN2

HDFS


...BA

block42 block42

Bob's Perspective

BobDN1

HDFS


...BA

block42 block42

Bob's Perspective

Bob

...1

4 7

2

3 5

891 22

34 55 6 6

77

8

89

9

Map Phase

6

HDFS

MapReduce

1 3

4

M1 M2 M3 M4 M5 M6 M7


HDFS


...BA

block42 block42

Bob's Perspective

BobDNn

HDFS


...BA

block42 block42

Bob's Perspective

BobDN6

HDFS


...BA

block42 block42

Bob's Perspective

BobDN5

HDFS


...BA

block42 block42

Bob's Perspective

BobDN4

HDFS


...BA

block42 block42

Bob's Perspective

BobDN3

HDFS


...BA

block42 block42

Bob's Perspective

BobDN2

HDFS


...BA

block42 block42

Bob's Perspective

BobDN1

HDFS


...BA

block42 block42

Bob's Perspective

Bob

...1

4

2

3 5

81

34 55 6 6

77 89

9

Map Phase

6‘ 9‘ 8‘ 1‘ 3‘ 2‘ 7‘

9 1 23

76 8

9 1 23

76 8

2

HDFS

MapReduce

4

M1 M2 M3 M4 M5 M6 M7


HDFS


...BA

block42 block42

Bob's Perspective

BobDNn

HDFS


...BA

block42 block42

Bob's Perspective

BobDN6

HDFS


...BA

block42 block42

Bob's Perspective

BobDN5

HDFS


...BA

block42 block42

Bob's Perspective

BobDN4

HDFS


...BA

block42 block42

Bob's Perspective

BobDN3

HDFS


...BA

block42 block42

Bob's Perspective

BobDN2

HDFS


...BA

block42 block42

Bob's Perspective

BobDN1

...12

45

81 2

34 455 6 6

77 89

9

Map Phase

5‘ 4‘

453

6‘ 9‘ 8‘ 1‘ 3‘ 2‘ 7‘

9 1 23

76 8

HDFS

MapReduce

M8 M9


MapReduce assigns a thread (AKA Mapper) at every datanode having data to be processed for this job.

Each mapper reads one of the HDFS blocks...

..and breaks that HDFS blocks into records. This can be customized with the RecordReader.For each record map() is called. The output to that file (also called intermediate results) is collected on the local disks of the datanodes. For instance, for block 6 the output is collected in file 6‘.

This is done for every block of the input file. Obviously we do not have to do this with every copy of am HDFS block. Processing one copy is enough.With this the Map Phase is finished.

HDFS


...BA

block42 block42

Bob's Perspective

BobDNn

HDFS


...BA

block42 block42

Bob's Perspective

BobDN6

HDFS


...BA

block42 block42

Bob's Perspective

BobDN5

HDFS


...BA

block42 block42

Bob's Perspective

BobDN4

HDFS


...BA

block42 block42

Bob's Perspective

BobDN3

HDFS


...BA

block42 block42

Bob's Perspective

BobDN2

HDFS


...BA

block42 block42

Bob's Perspective

BobDN1

...

Shuffle Phase

5‘ 4‘6‘ 9‘ 8‘ 1‘ 3‘ 2‘ 7‘

HDFS

MapReduce group by term

HDFS


...BA

block42 block42

Bob's Perspective

BobDNn

HDFS


...BA

block42 block42

Bob's Perspective

BobDN6

HDFS


...BA

block42 block42

Bob's Perspective

BobDN5

HDFS


...BA

block42 block42

Bob's Perspective

BobDN4

HDFS


...BA

block42 block42

Bob's Perspective

BobDN3

HDFS


...BA

block42 block42

Bob's Perspective

BobDN2

HDFS


...BA

block42 block42

Bob's Perspective

BobDN1

...

Shuffle Phase

5‘ 4‘6‘ 9‘ 8‘ 1‘ 3‘ 2‘ 7‘

network

HDFS


HDFS


...BA

block42 block42

Bob's Perspective

BobDNn

HDFS


...BA

block42 block42

Bob's Perspective

BobDN6

HDFS


...BA

block42 block42

Bob's Perspective

BobDN5

HDFS


...BA

block42 block42

Bob's Perspective

BobDN4

HDFS


...BA

block42 block42

Bob's Perspective

BobDN3

HDFS


...BA

block42 block42

Bob's Perspective

BobDN2

HDFS


...BA

block42 block42

Bob's Perspective

BobDN1

...

Shuffle Phase

5‘ 4‘6‘ 9‘ 8‘ 1‘ 3‘ 2‘ 7‘

networknetwork

HDFS


HDFS


...BA

block42 block42

Bob's Perspective

BobDNn

HDFS


...BA

block42 block42

Bob's Perspective

BobDN6

HDFS


...BA

block42 block42

Bob's Perspective

BobDN5

HDFS


...BA

block42 block42

Bob's Perspective

BobDN4

HDFS


...BA

block42 block42

Bob's Perspective

BobDN3

HDFS


...BA

block42 block42

Bob's Perspective

BobDN2

HDFS


...BA

block42 block42

Bob's Perspective

BobDN1

...

Shuffle Phase

A-B C-D E-F G-H I-J K-L W-Z

networknetwork

HDFS


Now, the Shuffle Phase starts.

In the Shuffle Phase all intermediate results are redistributed over the different datanodes. In this example we want to redistribute the intermediate results based on the term, i.e. we want to group all intermediate results by term.

This means, after shuffling, we obtain a range partitioning on terms....

For instance, DN1 contains all intermediate results having terms starting with A or B. In turn, DN2 has only terms starting with C or D and so forth.Once all data has been redistributed, the Shuffle Phase is finished.

HDFS


...BA

block42 block42

Bob's Perspective

BobDNn

HDFS


...BA

block42 block42

Bob's Perspective

BobDN6

HDFS


...BA

block42 block42

Bob's Perspective

BobDN5

HDFS


...BA

block42 block42

Bob's Perspective

BobDN4

HDFS


...BA

block42 block42

Bob's Perspective

BobDN3

HDFS


...BA

block42 block42

Bob's Perspective

BobDN2

HDFS


...BA

block42 block42

Bob's Perspective

BobDN1

...

Reduce Phase

A-B C-D E-F G-H I-J K-L W-ZA-B C-D E-F G-H I-J K-L W-Z

HDFS

MapReduce

reduce(term, set of docID) -> set of (term, (posting list of docID, count))

R1 R2 R3 R4 R5 R6 Rn

HDFS


...BA

block42 block42

Bob's Perspective

BobDNn

HDFS


...BA

block42 block42

Bob's Perspective

BobDN6

HDFS


...BA

block42 block42

Bob's Perspective

BobDN5

HDFS


...BA

block42 block42

Bob's Perspective

BobDN4

HDFS


...BA

block42 block42

Bob's Perspective

BobDN3

HDFS


...BA

block42 block42

Bob's Perspective

BobDN2

HDFS


...BA

block42 block42

Bob's Perspective

BobDN1

...

Reduce Phase


A-B‘ C-D‘ E-F‘ G-H‘ I-J‘ K-L‘ W-Z‘


HDFS

MapReduce reduce(term, set of docID) -> set of (term, (posting list of docID, count))

R1 R2 R3 R4 R5 R6 Rn

HadoopMapReduceAdvantages

failoverscalability

schema-later

ease of use

In the final Reduce Phase, MapReduce assigns threads to the different datanodes having intermediate results. These threads are termed reducers. The reducers read the intermediate results and for each distinct key they call reduce(). Notice that reduce() may only be called once for each distinct key on the entire cluster. Otherwise the semantics of the map()/reduce()-paradigm would be broken.

The output of the reduce()-calls is stored on disk again. In this example this is visualized to store the output on local disk. However, Hadoop stores the output on HDFS by default, i.e. the output of the Reduce Phase gets replicated by HDFS again.

When to replicate which data for fault tolerance in MapReduce is an interesting discussion. See our paper RAFT for more details:http://infosys.uni-saarland.de/publications/QPSD11.pdf

HadoopMapReduce Disadvantages

Performance

MapReduceIntro Hadoop++

HAIL

Hadoop++

And this and how to fix it is what the following material is about.

Second Part

[VLDB 2010a]

VLDB 2010, Sep 15, Singapore

The “Map Reduce Plan“ partition load map reduce

fits well with the simplicity philosophy of Hadoop.Hadoop++ changes the internal layout of a split — a large hori-

zontal partition of the data — and/or feeds Hadoop with appropri-ate UDFs. However, Hadoop++ does not change anything in theHadoop framework.

Contributions. In this paper we make the following contributions:

(1.) The Hadoop Plan. We demonstrate that Hadoop is noth-ing but a hard-coded, operator-free, physical query execution planwhere ten User Defined Functions block, split, itemize, mem,map, sh, cmp, grp, combine, and reduce are injected at pre-determined places. We make Hadoop’s hard-coded query process-ing pipeline explicit and represent it as a DB-style physical queryexecution plan (The Hadoop Plan). As a consequence, we are thenable to reason on that plan. (Section 2)(2.) Trojan Index. We provide a non-invasive, DBMS-independent indexing technique coined Trojan Index. A TrojanIndex enriches logical input splits by bulkloaded read-optimizedindexes. Trojan Indexes are created at data load time and thus haveno penalty at query time. Notice, that in contrast to HadoopDB weneither change nor replace the Hadoop framework at all to integrateour index, i.e. the Hadoop framework is not aware of the Trojan In-dex. We achieve this by providing appropriate UDFs. (Section 3)(3.) Trojan Join. We provide a non-invasive, DBMS-independentjoin technique coined Trojan Join. Trojan join allows us to co-partition the data at data load time. Similarly to Trojan Indexes,Trojan Joins do neither require a DBMS nor SQL to do so. Tro-jan Index and Trojan Join may be combined to create arbitrarilyindexed and co-partitioned data inside the same split. (Section 4)(4.) Experimental comparison. To provide a fair experimen-tal comparison, we implemented all Trojan-techniques on top ofHadoop and coin the result Hadoop++. We benchmark Hadoop++against Hadoop as well as HadoopDB as proposed at VLDB2009 [3]. As in [3] we used the benchmark from SIGMOD2009 [16]. All experiments are run on Amazon’s EC2 Cloud.Our results confirm that Hadoop++ outperforms Hadoop and evenHadoopDB for index and join-based tasks. (Section 5)

In addition, Appendix A illustrates processing strategies systemsfor analytical data processing. Appendix B shows that MapReduceand DBMS have the same expressiveness, i.e. any MapReduce taskmay be run on a DBMS and vice-versa. Appendices C to E containadditional details and results of our experimental study.

2. HADOOP AS A PHYSICAL QUERY EX-ECUTION PLAN

In this section we examine how Hadoop computes a MapReducetask. We have analyzed Yahoo!’s Hadoop version 0.19. Note thatHadoop uses a hard-coded execution pipeline. No operator-modelis used. However Hadoop’s query execution strategy may be ex-pressed as a physical operator DAG. To our knowledge, this paperis the first to do so in that detail and we term it The Hadoop Plan.Based on this we then reason on The Hadoop Plan.

2.1 The Hadoop PlanThe Hadoop Plan is shaped by three user-defined parameters

M, R, and P setting the number of mappers, reducers, and datanodes, respectively [10]. An example for a plan with four map-pers (M = 4), two reducers (R = 2), and four data nodes (P = 4)is shown in Figure 1. We observe, that The Hadoop Plan consistsof a subplan L ( ) and P subplans H1–H4 ( ) which correspondto the inital load phase (HDFS) into Hadoop’s distributed file sys-

T

L PPartblock

Replicate Replicate Replicate Replicate Replicate Replicate

T1 T6

H1 Fetch Fetch Fetch

Store Store Store

Scan Scan

PPartsplit PPartsplit

H2

. . .

H3 Fetch Fetch Fetch

Store Store Store

Scan

PPartsplit

H4

. . .

M1 Union

RecReaditemize

MMapmap

PPartmem

LPartsh LPartsh LPartsh

Sortcmp Sortcmp Sortcmp

SortGrpgrp SortGrpgrp SortGrpgrp

MMapcombine MMapcombine MMapcombine

Store Store Store

Mergecmp

SortGrpgrp

MMapcombine

Store

PPartsh

M2

. . .

M3

RecReaditemize

MMapmap

LPartsh

Sortcmp

SortGrpgrp

MMapcombine

Store

PPartsh

M4

. . .

T1 T5 T2 T4 T3 T6

Fetch Fetch Fetch Fetch

Buffer Buffer Buffer Buffer

Store Store Merge

Store

Mergecmp

SortGrpgrp

MMapreduce

R1 Store

T �1

. . .

R2

T �2

Dat

aLo

adPh

ase

Map

Phas

eSh

u⇥e

Phas

eR

educ

ePh

ase

Figure 1: The Hadoop Plan: Hadoop’s processing pipeline ex-pressed as a physical query execution plan

tem. M determines the number of mapper subplans ( ), whereas Rdetermines the number of reducer subplans ( ).

Let’s analyze The Hadoop Plan in more detail:Data Load Phase. To be able to run a MapReduce job, we firstload the data into the distributed file system. This is done by par-titioning the input T horizontally into disjoint subsets T1, . . . ,Tb.See the physical partitioning operator PPart in subplan L. In theexample b = 6, i.e. we obtain subsets T1, . . . ,T6. These subsetsare called blocks. The partitioning function block partitions theinput T based on the block size. Each block is then replicated(Replicate). The default number of replicas used by Hadoop is 3,but this may be configured. For presentation reasons, in the exam-ple we replicate each block only once. The figure shows 4 di�erentdata nodes with subplans H1–H4. Replicas are stored on di�erentnodes in the network (Fetch and Store). Hadoop tries to storereplicas of the same block on di�erent nodes.Map Phase. In the map phase each map subplan M1–M4 reads asubset of the data called a split2 from HDFS. A split is a logicalconcept typically comprising one or more blocks. This assignmentis defined by UDF split. In the example, the split assigned to M1consists of two blocks which may both be retrieved from subplanH1. Subplan M1 unions the input blocks T1 and T5 and breaks theminto records (RecRead). The latter operator uses a UDF itemizethat defines how a split is divided into items. Then subplan M1calls map on each item and passes the output to a PPart opera-tor. This operator divides the output into so-called spills based

2Not to be confused with spills. See below. We use the terminologyintroduced in [10].

§ The “MapReduce Plan“ has 10 user-definedfunctions (UDFs):

! block! split! itemize! mem! map! sh! cmp! grp! combine! reduce

Data Load Phase

Map Phase

Shuffle Phase

Reduce Phase

Good Trojans!

Jens Dittrich, Jorge-Arnulfo Quiane-Ruiz, Alekh Jindal, Yagiz Kargin, Vinay Setty, Jörg SchadHadoop++: Making a Yellow Elephant Run Like a Cheetah (Without It Even Noticing)VLDB 2010/PVLDB, Singaporehttp://infosys.cs.uni-saarland.de/publications/DQJ+10CRv2.pdfslides:http://infosys.cs.uni-saarland.de/publications/DQJ+10talk.pdf

figure shows example with 4 mappers and 2 reducersHadoop MapReduce uses a hard-coded pipeline. This pipeline cannot be changed. This is in sharp contrast to database systems which may use different pipelines for different queries.

However, Hadoop Map Reduce uses 10 user-defined functions (UDFs). Theses UDFs can be used to inject arbitrary code into Hadoop...

...including code that was not intended to be injected into Hadoop.

Our idea is somewhat similar to a trojan horse or a trojan, i.e. a virus that is injected into a computer system to harm or destory the system. However, we inject trojans to improve or heal the system.

Therefore we are calling them...

http://www.istockphoto.com/stock-photo-1824642-trojan-horse.php?st=0bab152

Selection Task

0

20

40

60

80

100

120

140

10 nodes 50 nodes 100 nodes

runt

ime

[sec

onds

]

HadoopHadoopDB

HadoopDB Chunks

Hadoop++(256MB)Hadoop++(1GB)

Join Task

0

500

1000

1500

2000

2500


runt

ime

[sec

onds

]

HadoopHadoopDB


1Good Trojans in a DBMS?...ROW TROJAN SQL VECTORWISE VECTORWISE2 VERTICA (other machine) Factor

Q1

Q6

Q12

Q14

76.730296 19.293983 22.0 31.276143 6.0 41.7 1.1 3.277.589034 8.6532381 16.0 25.845965 4.6 11.4 1.8 1.992.486038 37.331905 33.0 29.785149 4.3 15.3 0.9 8.781.207649 30.788114 28.0 22.291128 5.0 13.9 0.9 6.2

0

25

50

75

100

Q1 Q6 Q12 Q14

Que

ry T

ime

(sec

)

Standard Row Trojan ColumnsDBMS-Y DBMS-Z (a)DBMS-Z (b)

[CIDR 2013a]

Good Trojans Versus Closed Source Column Store DBMS

results taken from our VLDB 2010-paper

Hadoop++ is in the same ballpark or even faster than HadoopDB (now spun-off as Hadapt)

Hadoop++ is up to a factor 18 faster than Hadoop

...even though we do not modify the underlying HDFS and Hadoop MapReduce source code at all!Our improvements are all done through UDFs only.

But wait: UDFs are also available in traditional database systems. What happens if we exploit those UDFs to inject “better technology“ into an existing database system?

Say we inject column store technology into a commercial, closed-source row store. How would that look like?

It looks like this. For details see our paper: Alekh Jindal, Felix Martin Schuhknecht, Jens Dittrich, Karen Khachatryan, Alexander Bunte. How Achaeans Would Construct Columns in Troy. CIDR 2013, Asilomar, USA. http://infosys.uni-saarland.de/publications/How%20Achaeans%20Would%20Construct%20Columns%20in%20Troy.pdf You can get much faster than a row store. We are not as fast as a from scratch implementation of a column store. This has to do with the different QP technology used. However for some customers the performance of a native column store might not even be required - especially for medium-sized datasets. ...

Q Standard Row

Trojan Columns

Projected View

Standard Row

Projected View

Trojan Columns

Table

Q1Q6Q12Q14Q3Q5Q10_oQ10_lQ19Q2Q4Q8Q15

76.730295686 19.293982609 27.42381835 3.9769029152 1.4213663869 230.19088706 82.27146 57.881947828 lineitem77.589033799 8.6532381493 21.787439044 8.9664738749 2.5178365218 232.7671014 65.36232 25.959714448 lineitem76.73051647 16.504629093 26.582446376 4.6490300412 1.610605499 230.19154941 79.747339128 49.513887278 lineitem

76.333567335 25.592795096 20.15226419 2.9826194071 0.7874194325 229.00070201 60.456792571 76.778385288 lineitem0 0 0 5.1437565596 lineitem0 0 0 order0 0 0 order0 0 0 lineitem0 0 0

2.935810086 1.9092920003 1.816512607 8.807430258 5.449537821 5.727876001 part15.286783871 15.13420846 5.1889663903 45.860351612 15.566899171 45.402625379 order

0 0 0 part0 0 0 lineitem

0

25

50

75

100

Q1 Q6 Q12 Q14

Que

ry T

ime

(sec

)

Standard Row Trojan ColumnsMaterialized View

Good Trojans in a Closed Source Row Store DBMS

[CIDR 2013a]

1...Good Trojans in a DBMS!

Problems:

Upload-Times

0

10000

20000

30000

40000

50000


runt

ime

[sec

onds

]

HadoopHadoopDB


(I)ndex Creation(C)o-PartitioningData (L)oading

L

C

I

L

C

I

L

C

I

L

C

I

L

C

I

L

C

I

... Why buy a car with 1000hp if 200hp are just enough?

An interesting result from our work is also that we can beat materialized views for some queries.

With this let‘s close this footnote and go back to Hadoop MapReduce performance.UDFs in Hadoop allow us to boost query performance without changing the underlying system.However, what are the...

from Hadoop++ paper: The problem is that in order to have fast queries we first have to “massage“ the data before, i.e. create indexes, co-partition and so forth. This takes time. Hadoop does not have to spend this time and therefore uplloading the data to HDFS is fast. In contrast, for Hadoop++ (but als HadoopDB) we also have to do a lot of extra work. This extra work is very costly. So costly that only after running many queries these investments are amortized. In other words, if we only want to run a few queries exploiting our indexes and co-partitioning, we shouldn‘t use Hadoop++ in the first place but rather run the queries directly on Hadoop! How could we fix this problem?

=> coarse-granular indexes

=> index selection algos?

=> back to scanning?

MapReduceIntro Hadoop++

HAIL

HAIL

Hadoop Aggressive Indexing Library

[VLDB 2012a][SOCC 2011]

[int‘ patent]

We could drop the idea of using indexes: just scan everything.Well we are not gonna follow this approach.We could invest into better index selection algorithms. If we pick the wrong index, index creation is unlikely to be amortized. Therfore making the right choice is important. Therefore...Well we are not gonna follow this approach.Or: as it is expensive to create all these indexes, we better investigate coarse-granular indexes, i.e. indexes that are cheaper to construct and yet give some benefit at query time.Well we are not gonna follow this approach.

We do something different. Which brings me to...

... the third part of my talk.The approach I would like to present is coined HAIL.

HAIL means Hadoop Aggressive Indexing Library.For details see our paper:Jens Dittrich, Jorge-Arnulfo Quiane-Ruiz, Stefan Richter, Stefan Schuh, Alekh Jindal, Jörg Schad. Only Aggressive Elephants are Fast Elephants. VLDB 2012/PVLDB, Istanbul, Turkey. http://infosys.uni-saarland.de/publications/HAIL.pdfA predecessor of this work focussing on data layouts in HDFS is our paper:Alekh Jindal, Jorge-Arnulfo Quiane-Ruiz, Jens Dittrich. Trojan Data Layouts: Right Shoes for a Running Elephant. ACM SOCC 2011, Cascais, Portugal. http://infosys.uni-saarland.de/publications/JQD11.pdf

HAIL

...

HDFS


...BA

block42 block42

Bob's Perspective

BobDNn

HDFS


...BA

block42 block42

Bob's Perspective

BobDN6

HDFS


...BA

block42 block42

Bob's Perspective

BobDN5

HDFS


...BA

block42 block42

Bob's Perspective

BobDN4

HDFS


...BA

block42 block42

Bob's Perspective

BobDN3

HDFS


...BA

block42 block42

Bob's Perspective

BobDN2

HDFS


...BA

block42 block42

Bob's Perspective

BobDN1

Bob

HAIL

HDFS


...BA

block42 block42

Bob's Perspective

BobDNn

HDFS


...BA

block42 block42

Bob's Perspective

BobDN6

HDFS


...BA

block42 block42

Bob's Perspective

BobDN5

HDFS


...BA

block42 block42

Bob's Perspective

BobDN4

HDFS


...BA

block42 block42

Bob's Perspective

BobDN3

HDFS


...BA

block42 block42

Bob's Perspective

BobDN2

HDFS


...BA

block42 block42

Bob's Perspective

BobDN1

...

horizontal partitions

HAIL

HDFS


...BA

block42 block42

Bob's Perspective

BobDNn

HDFS


...BA

block42 block42

Bob's Perspective

BobDN6

HDFS


...BA

block42 block42

Bob's Perspective

BobDN5

HDFS


...BA

block42 block42

Bob's Perspective

BobDN4

HDFS


...BA

block42 block42

Bob's Perspective

BobDN3

HDFS


...BA

block42 block42

Bob's Perspective

BobDN2

HDFS


...BA

block42 block42

Bob's Perspective

BobDN1

...

HDFS blocks64MB (default)horizontal partitions

HAIL

HDFS


...BA

block42 block42

Bob's Perspective

BobDNn

HDFS


...BA

block42 block42

Bob's Perspective

BobDN6

HDFS


...BA

block42 block42

Bob's Perspective

BobDN5

HDFS


...BA

block42 block42

Bob's Perspective

BobDN4

HDFS


...BA

block42 block42

Bob's Perspective

BobDN3

HDFS


...BA

block42 block42

Bob's Perspective

BobDN2

HDFS


...BA

block42 block42

Bob's Perspective

BobDN1

...

1

4

7

2 3

5 6

8 9

11 22 33

So back to Bob again. Recall tath Bob wants to analyze a large file with Hadoop MapReduce.So he first has to upload his file to HDFS.In our approach we replace HDFS with HAIL. HAIL is an extension of HDFS.

As before Bob‘s file gets partiitoned into HDFS blocks...

those blocks are relatively large, at least 64MB

then those blocks get partitioned to the different datanodes (just as above)

HAIL

HDFS


...BA

block42 block42

Bob's Perspective

BobDNn

HDFS


...BA

block42 block42

Bob's Perspective

BobDN6

HDFS


...BA

block42 block42

Bob's Perspective

BobDN5

HDFS


...BA

block42 block42

Bob's Perspective

BobDN4

HDFS


...BA

block42 block42

Bob's Perspective

BobDN3

HDFS


...BA

block42 block42

Bob's Perspective

BobDN2

HDFS


...BA

block42 block42

Bob's Perspective

BobDN1

...1

4

7

2

3

5 6

8 9

11 22

3

3

44 55 66

HAIL

HDFS


...BA

block42 block42

Bob's Perspective

BobDNn

HDFS


...BA

block42 block42

Bob's Perspective

BobDN6

HDFS


...BA

block42 block42

Bob's Perspective

BobDN5

HDFS


...BA

block42 block42

Bob's Perspective

BobDN4

HDFS


...BA

block42 block42

Bob's Perspective

BobDN3

HDFS


...BA

block42 block42

Bob's Perspective

BobDN2

HDFS


...BA

block42 block42

Bob's Perspective

BobDN1

...

4

7

5 6

8 9

12

3

11 22

3

3

44 55 66

1 1 12 2 23

33

HAIL

HDFS


...BA

block42 block42

Bob's Perspective

BobDNn

HDFS


...BA

block42 block42

Bob's Perspective

BobDN6

HDFS


...BA

block42 block42

Bob's Perspective

BobDN5

HDFS


...BA

block42 block42

Bob's Perspective

BobDN4

HDFS


...BA

block42 block42

Bob's Perspective

BobDN3

HDFS


...BA

block42 block42

Bob's Perspective

BobDN2

HDFS


...BA

block42 block42

Bob's Perspective

BobDN1

...

7 8 9

1 1 12 2 23

33 456 4 455 6 6

HDFS


...BA

block42 block42

Bob's Perspective

BobDNn

HDFS


...BA

block42 block42

Bob's Perspective

BobDN6

HDFS


...BA

block42 block42

Bob's Perspective

BobDN5

HDFS


...BA

block42 block42

Bob's Perspective

BobDN4

HDFS


...BA

block42 block42

Bob's Perspective

BobDN3

HDFS


...BA

block42 block42

Bob's Perspective

BobDN2

HDFS


...BA

block42 block42

Bob's Perspective

BobDN1

...

7 8 9

HAIL

1 1 12 2 23

33 456 4 455 6 664656 4 45 5

the HDFS blocks also get replicated (just as above)

but then, before writing the data to the local disks on the different datanodes, we do something in addition:

we sort the data on each HDFS block in main memory. Each replica is sorted using a different sort cirteria. This means after sorting each HDFS block is available in three different sort orders - roughly corresponding to three different clustered indexes.Notice that we do not redistribute data across HDFS blocks! Data that was on one particular block in standard HDFS will sit on the same HDFS block in HAIL. In other words: the different copies of the block contain the same data - yet in different sort orders.

Again, we do this for each and every copy of a block.

Notice that this is done without introducing additional I/O. We fully piggy-back on the existing HDFS upload pipeline.

HDFS


...BA

block42 block42

Bob's Perspective

BobDNn

HDFS


...BA

block42 block42

Bob's Perspective

BobDN6

HDFS


...BA

block42 block42

Bob's Perspective

BobDN5

HDFS


...BA

block42 block42

Bob's Perspective

BobDN4

HDFS


...BA

block42 block42

Bob's Perspective

BobDN3

HDFS


...BA

block42 block42

Bob's Perspective

BobDN2

HDFS


...BA

block42 block42

Bob's Perspective

BobDN1

...

HAIL

1 1 12 2 23

33 64656 4 45 5

7 8 977 88 99

HDFS


...BA

block42 block42

Bob's Perspective

BobDNn

HDFS


...BA

block42 block42

Bob's Perspective

BobDN6

HDFS


...BA

block42 block42

Bob's Perspective

BobDN5

HDFS


...BA

block42 block42

Bob's Perspective

BobDN4

HDFS


...BA

block42 block42

Bob's Perspective

BobDN3

HDFS


...BA

block42 block42

Bob's Perspective

BobDN2

HDFS


...BA

block42 block42

Bob's Perspective

BobDN1

...

HAIL

1 1 12 2 23

33 64656 4 45 5

9 9

8

8

9

87 7

7

HDFS


...BA

block42 block42

Bob's Perspective

BobDNn

HDFS


...BA

block42 block42

Bob's Perspective

BobDN6

HDFS


...BA

block42 block42

Bob's Perspective

BobDN5

HDFS


...BA

block42 block42

Bob's Perspective

BobDN4

HDFS


...BA

block42 block42

Bob's Perspective

BobDN3

HDFS


...BA

block42 block42

Bob's Perspective

BobDN2

HDFS


...BA

block42 block42

Bob's Perspective

BobDN1

...

HAIL

1 1 12 2 23

33 64656 4 45 5

9 9

8

8

9

87 7

78

8879 9

9 7

7

HDFS


...BA

block42 block42

Bob's Perspective

BobDNn

HDFS


...BA

block42 block42

Bob's Perspective

BobDN6

HDFS


...BA

block42 block42

Bob's Perspective

BobDN5

HDFS


...BA

block42 block42

Bob's Perspective

BobDN4

HDFS


...BA

block42 block42

Bob's Perspective

BobDN3

HDFS


...BA

block42 block42

Bob's Perspective

BobDN2

HDFS


...BA

block42 block42

Bob's Perspective

BobDN1

...1 1 12 2 23

33 64656 4 45 5 8

8879 9

9 7

7

Failover

HAIL

Eventually uploading (and indexing) is finished.

What does this mean for HDFS failover?

Well actually, nothing changes. All data sits on the same HDFS blocks as before. For instance, if we lose DN2 and DN6, we can still retrieve block 3 from DN6. That block might not be sorted along the desired sort criteria, but it contains all the data. And we can use the remaining block to recreate additional copies in other sort orders.

little

Details

Network

Network

01101000101110101000110110

01010101010100011010001100

Block Metadata01010010101110110010111010

C

B

A

PAX Block

01101000101110101000110110

01010101010100011010001100

Block Metadata01010010101110110010111010

C

B

A

PAX Block

Network

...

forward PCK

2

01101000101110101000110110

01010101010100011010001100

Block Metadata01010010101110110010111010

C

B

A

PAX Block

PCK 1

PCK 2

PCK 1

append

HAIL Client Datanode DN1upload

2

45

6

7

check

acknowledge

reassemble

PCK 2

PCK 1

reassemble

8

HAIL Block

10001010100010111101100110

10110001101101000101110101

Block Metadata00001001000110011110111111

C

B

A

Index MetadataIndex

00101110

HAIL Block

00010011000110111100111111

10011101110111010101101010

Block Metadata00100111011011101001101101

C

B

A

Index MetadataIndexA C

ACK 13 2 1

ACK 23 2 1

ACK 13

ACK 13 2 1

ACK 23 2forward

1

Bob

1015

13

9

buildA CB

1preprocess

build

12

HDFS Namenode

Block directory HAIL Replica directory

14registerregister

ACK 23

convert

11

notify

get location

3

OK

Datanode DN3

HAIL Upload Pipeline

HAIL Query Pipeline

MapReduce PipelineHadoop MapReduce Pipeline

HDFSHDFS

Task TrackerJob TrackerJob Client

Split Phase Scheduler Map Phase

for each split in splits { allocate split to closest DataNode storing block}

Record Reader:- Perform Index Scan- Perform Post-Filtering- for each Record invoke map(HailRecord)

sendsplits[]

allocateMap Task

chosecomputing

Node

read block42

DN3 DN4 DN5 DN6 DN7 DNnDN2

...C BA

block42 block42 block42

1

2

3

4 6

5

7 storeoutput

...@HailQuery(filter="@3 between(1999-01-01, 2000-01-01)", projection={@1})void map(Text k, HailRecord v) { output(v.getInt(1), null);}...

MapReduce JobMain Class

map(...)

reduce(...)

writeJob

runJob

Bob's Perspective System's Perspective

ii

i

HAIL Annotation

Bob

DN1

for each block in input { locations = block .getHostWithIndex(@3); splitBuilder.add(locations, block );}splits[] = splitBuilder.result;

i

i

i

Experiments

Well it is all somewhat more complex than explained in the previous example.

We play some other tricks. For instance, while reading the input file, we immediately parse the data into a binary PAX (column like) layout. The PAX data is then sent to the different datanodes. We also had to make sure to not break the involved data consistency-checks used by HDFS (packet acknowledge). In addition, we extended the namenode to record additional information on the sort orders and layouts used for the different copies of an HDFS block. The latter is needed at query time. None of these changes affects principle properties of HDFS. We just extend and piggyback.

At query time HAIL needs to know how to filter input records and which attributes to project to in intermediate result tuples. This can be solved in many ways. In our current implementation we allow users to annotate their map-function with the filter and projection conditions. However this could also be done fully transparently by using static code analysis as shown in: Michael J. Cafarella, Christopher Ré: Manimal: Relational Optimization for Data-Intensive Programs. WebDB 2010. That code analysis could be used directly with HAIL. Another option, if the map()/reduce()-functions are not produced by a user, is to adjust the application. ...

...Possible “applications“ might be Pig, Hive, Impala or any other system producing map()/reduce() programs as its output. In addition, any other application not relying on MapReduce but just relying on HDFS might use our system, i.e. applications diurectly working with HDFS files.

Upload Times

Hadoop Hadoop++ HAIL0123

1132 3472 6715766 704

712717

0

1700

3400

5100

6800

0 1 2 3

717712704671

5766

3472

1132

Upl

oad

time

[sec

]

Number of created indexes

Hadoop Hadoop++ HAIL

all with 3 replicas

Upload Time


1132 3472 6715766 704

712717

0

1700

3400

5100

6800

0 1 2 3

717712704671

5766

3472

1132

Upl

oad

time

[sec

]



all with 3 replicas

Upload Time


1132 3472 6715766 704

712717

0

1700

3400

5100

6800

0 1 2 3

717712704671

5766

3472

1132

Upl

oad

time

[sec

]



all with 3 replicas

Upload Time

What happens if we upload a file to HDFS?

Let‘s start with the case that no index is created by any of the systems.Hadoop takes about 1132 seconds

Hadoop++ is considerably slower. Even though we switch off index creation here, Hadoop++ runs an extra job to convert the input data to binary. This takes a while.What about HAIL?

HAIL is faster than Hadoop HDFS. How can this be? We are doing more work than Hadoop HDFS, right? For instance, we convert the input file to binary PAX during upload directly. This can only be slower than Hadoop HDFS but NOT faster. Well, when converting to binary layout it turns out that the binary represenation of this dataset is smaller than the textual representation. Therefore we have to write less data and SAVE I/O. Therefore HAIL is faster for this dataset. This does not necessarily hold for all datasets. Notice that for this and the following experiments we are not using compression yet. We expect compression to be even more beneficial for HAIL.


1132 3472 6715766 704

712717

0

1700

3400

5100

6800

0 1 2 3

717712704671

5766

3472

1132

Upl

oad

time

[sec

]



all with 3 replicas

Upload Time


1132 3472 6715766 704

712717

0

1700

3400

5100

6800

0 1 2 3

717712704671

5766

3472

1132

Upl

oad

time

[sec

]



all with 3 replicas

Upload Time


1132 3472 6715766 704

712717

0

1700

3400

5100

6800

0 1 2 3

717712704671

5766

3472

1132

Upl

oad

time

[sec

]



all with 3 replicas

Upload Time

0

1075

2150

3225

4300

3 5 6 7 10

170012541089956717

3710

27122256

1773

1132

Upl

oad

time

[sec

]

Number of created replicas

Hadoop HAIL

(default)

Hadoop upload time with 3 replicas

Replica Scalability

So what happens if we start creating indexes? We should then feel the additional index creation effort in HAIL.

For Hadoop++ we observe long runtimes.For HAIL, in contrast to what we expected, we observe only a small increase in the upload time.

The same observation holds when creating two clustered indexes with HAIL...

...or three.This is because, standard file upload in HDFS is I/O-bound. The CPUs are mostly idle. HAIL simply exploits the unused CPU ticks that would be idling otherwise. Therefore the additional effort for indexing is hardly noticeable.

Disk space is cheap. For some situations It could be affordable to store more than three copies of an HDFS block. What would be the impact on the upload times?The next experiment shows the results...

Here we create up to 10 replicas - corresponding to 10 different clustered indexes.

We observe that in the same time HDFS uploads the data without creating any index, HAIL uploads the data, converts to binary PAX, and creates six different clustered indexes.

10 NODESSynthetic1st Run(sec)2nd Run (sec)3rd (run)AveragePavlo1st Run (sec)2nd Run (sec)3rd Run (sec)Average



Hadoop Hail809 608827 599845 593

827 600Hadoop Hail

1300 16781296 18661257 16821284.3333333 1742

Hadoop Hail983 720892 686878 647917.66666667 684.33333333

Hadoop Hail1874 14941612 15582022 1538

1836 1530

Hadoop Hail1227 6151014 645836 6391025.6666667 633

Hadoop Hail1582 15191527 14201319 1518

1476 1485.6666667

0

550

1100

1650

2200

Syn UV Syn UV Syn UV

1486

633

1530

684

1742

600

1476

1026

1836

918

1284

827

Upl

oad

Tim

e [s

ec]

Number of Nodes

Hadoop HAIL


min hadoop min hail18 7 18 8

27.333333333 64 15.666666667 12439.666666667 37.333333333 65.333333333 35.666666667

224 36 186 28189.66666667 18 201.33333333 12

157 65.666666667 106 33.333333333

Scale-Out

modulo variance, see [VLDB 2010b]

Query Times

Individual Jobs: Weblog, RecordReader

0

1000

2000

3000

4000

Bob-Q1 Bob-Q2 Bob-Q3 Bob-Q4 Bob-Q5

683333

527573

28642917

5383

277624422470

21122156

3358

RR

Run

time

[ms]

MapReduce Jobs

Hadoop Hadoop ++ HAIL

Individual Jobs: Weblog, Job

0

375

750

1125

1500


602598598598601

11451143

651705

1160 109910999421006

1094

Job

Run

time

[sec

]

MapReduce Jobs


We also evaluated upload times on the cloud using EC2 nodes. Notice that experiments on the cloud are somewhat problematic due to the high runtime variance in those environments.For details see our paper:Jörg Schad, Jens Dittrich, Jorge-Arnulfo Quiane-RuizRuntime Measurements in the Cloud: Observing, Analyzing, and Reducing Variance VLDB 2010/PVLDB, Singapore.http://infosys.cs.uni-saarland.de/publications/SDQ10.pdfslides: http://infosys.cs.uni-saarland.de/publications/SDQ10talk.pdf

What about query times?

Here we display the RecordReader times. They correspond roughly to the data access time, i.e. the time for further processing, which is equal in all systems - is factored out.

We observe that HAIL improves query runtimes dramatically.Hadoop resorts to full scan in all cases.Hadoop++ can benefit from its index if the query happens to hit the right filter condition.In contrast, HAIL supports many more filter conditions.

What does this mean for the overall job runtimes?

Well, those results are not so great.The benefits of HAIL over the other approaches are marginal?How come?It has to do with...

Scheduling Overhead

0

375

750

1125

1500


Job

Run

time

[sec

]

MapReduce Jobs

Hadoop Hadoop++ HAIL Overhead

HAIL Scheduling

HDFS


...BA

block42 block42

Bob's Perspective

BobDN1

1

6

4

7

Hadoop Scheduling

HAIL

MapReduce

5

12

11 22

23

12

2

2119

15 14

⇒ 7 map tasks (aka waves)

sort order

⇒ 7 times scheduling overhead

map(row) -> set of (ikey, value)

HDFS


...BA

block42 block42

Bob's Perspective

Bob

HAIL SplitDN1

6

4

7

HAIL Scheduling

HAIL

MapReduce

5

12

11 22

12

2

2119

15 14

⇒ 1 map task (aka wave)

sort order

⇒ 1 times scheduling overhead

1

23

map(row) -> set of (ikey, value)

...the Hadoop scheduling overhead. Hadoop was designed having long running tasks in mind. Accessing indexes in HAIL, however, is in the order of milliseconds. These milliseconds of index access are overshadowed by scheduling latencies.You can try this out with a simple experiment. Write a MapReduce job that does not read any input, does not do anything, and does not produce any output. This takes about 7 seconds - for doing nothing.How could we fix this problem?By...

...introducing “HAIL scheduling“.

But let‘s first look back at standard Hadoop scheduling:

In Hadoop, each HDFS block will be processed by a different map task. This leads to waves of map tasks each having a certain overhead.However the assignment of HDFS blocks to map tasks is not fixed.Therefore, ...

...in HAIL Scheduling we assign all HDFS blocks that need to be processed on a datanode to a single map task. This is achieved by defining appropriate “splits“. see our paper for details.

The overall effect is that we only have to pay the scheduling overhead once rather than 7 times (in this example).Notice that in case of failover we can simply reschedule index access tasks - they are fast anyways. Additionally, we could combine this with the recovery techniques from RAFT, ICDE 2011.

Query Timeswith HAIL Scheduling

0

375

750

1125

1500


6522151516

11451143

651705

1160 109910999421006

1094

Job

Run

time

[sec

]

MapReduce Jobs


Individual Jobs: Weblog

Failover

Failover

FailoverHAIL-1IdxHadoop

113 63 33 1212 661 631598 598

1099

0

375

750

1125

1500

Hadoop HAIL HAIL-1Idx

598598

1099

Job

Run

time

[sec

]

Systems

Hadoop HAIL Slowdown

10.5 % slowdown 5.5 % slowdown

10.3 % slowdown

10 nodes, one killed

What are the end-to-end query runtimes with HAIL Scheduling?

Now the good RecordReader times seen above translate to (very) good query times.

What about failover?

We use two configurations for HAIL. First, we configure HAIL to create indexes on three different attributes, one for each replica. Second, we use a variant of HAIL, coined HAIL-1Idx, where we create an index on the same attribute for all three replicas. We do so to measure the performance impact of HAIL falling back to full scan for some blocks after the node failure. This happens for any map task reading its input from the killed node. Notice that, in the case of HAIL-1Idx, all map tasks will still perform an index scan as all blocks have the same index.Overall result: HAIL inherits Hadoop MapReduce‘s failover properties.

Summary(Summary(...))

BigData=>MapReduce

Intro Hadoop++

HAIL

BigData=>

HAIL

Hadoop++

fast queryingAND

fast indexing

BigData=> Hadoop++

HAIL

of this talk

What I tried to explain to you in my talk is:

Hadoop MapReduce is THE engine for big data analytics.

By using good trojans you can improve system performance dramatically AND after the fact --- even for closed-source systems.

BigData=> Hadoop++

fast queryingAND

fast indexing

infosys.uni-saarland.de

HAIL allows you to have fast index creation AND fast query processing at the same time.

project page:http://infosys.uni-saarland.de/hadoop++.php

Copyright of alls Slides Jens Dittrich 2012

annotated slides are available on that page