80
1 An Architectural Overview of MapR M. C. Srivas, CTO/Founder

Architectural Overview of MapR's Apache Hadoop Distribution

Embed Size (px)

DESCRIPTION

Describes the thinking behind MapR's architecture. MapR"s Hadoop achieves better reliability on commodity hardware compared to anything on the planet, including custom, proprietary hardware from other vendors. Apache HDFS and Cassandra replication is also discussed, as are SAN and NAS storage systems like Netapp and EMC.

Citation preview

Page 1: Architectural Overview of MapR's Apache Hadoop Distribution

1

An Architectural Overview of MapR M. C. Srivas, CTO/Founder

Page 2: Architectural Overview of MapR's Apache Hadoop Distribution

2

100% Apache Hadoop

With significant enterprise- grade enhancements

Comprehensive management

Industry-standard interfaces

Higher performance

MapR Distribution for Apache Hadoop

Page 3: Architectural Overview of MapR's Apache Hadoop Distribution

3

MapR: Lights Out Data Center Ready

• Automated stateful failover

• Automated re-replication

• Self-healing from HW and SW failures

• Load balancing

• Rolling upgrades

• No lost jobs or data

• 99999’s of uptime

Reliable Compute Dependable Storage

• Business continuity with snapshots and mirrors

• Recover to a point in time

• End-to-end check summing

• Strong consistency

• Built-in compression

• Mirror between two sites by RTO policy

Page 4: Architectural Overview of MapR's Apache Hadoop Distribution

4

MapR does MapReduce (fast)

TeraSort Record 1 TB in 54 seconds

1003 nodes

MinuteSort Record 1.5 TB in 59 seconds

2103 nodes

Page 5: Architectural Overview of MapR's Apache Hadoop Distribution

5

MapR does MapReduce (faster)

TeraSort Record 1 TB in 54 seconds

1003 nodes

MinuteSort Record 1.5 TB in 59 seconds

2103 nodes

1.65

300

Page 6: Architectural Overview of MapR's Apache Hadoop Distribution

6

Google chose MapR to provide Hadoop on

Google Compute Engine

The Cloud Leaders Pick MapR

Amazon EMR is the largest Hadoop provider in revenue

and # of clusters

Deploying OpenStack? MapR partnership with Canonical and Mirantis on OpenStack support.

Page 7: Architectural Overview of MapR's Apache Hadoop Distribution

7

1. Make the storage reliable – Recover from disk and node failures

How to make a cluster reliable?

Page 8: Architectural Overview of MapR's Apache Hadoop Distribution

8

1. Make the storage reliable – Recover from disk and node failures

2. Make services reliable – Services need to checkpoint their state rapidly

– Restart failed service, possibly on another node

– Move check-pointed state to restarted service, using (1) above

How to make a cluster reliable?

Page 9: Architectural Overview of MapR's Apache Hadoop Distribution

9

1. Make the storage reliable – Recover from disk and node failures

2. Make services reliable – Services need to checkpoint their state rapidly

– Restart failed service, possibly on another node

– Move check-pointed state to restarted service, using (1) above

3. Do it fast – Instant-on … (1) and (2) must happen very, very fast

– Without maintenance windows • No compactions (eg, Cassandra, Apache HBase)

• No “anti-entropy” that periodically wipes out the cluster (eg, Cassandra)

How to make a cluster reliable?

Page 10: Architectural Overview of MapR's Apache Hadoop Distribution

10

No NVRAM

Reliability With Commodity Hardware

Page 11: Architectural Overview of MapR's Apache Hadoop Distribution

11

No NVRAM

Cannot assume special connectivity – no separate data paths for "online" vs. replica traffic

Reliability With Commodity Hardware

Page 12: Architectural Overview of MapR's Apache Hadoop Distribution

12

No NVRAM

Cannot assume special connectivity – no separate data paths for "online" vs. replica traffic

Cannot even assume more than 1 drive per node – no RAID possible

Reliability With Commodity Hardware

Page 13: Architectural Overview of MapR's Apache Hadoop Distribution

13

No NVRAM

Cannot assume special connectivity – no separate data paths for "online" vs. replica traffic

Cannot even assume more than 1 drive per node – no RAID possible

Use replication, but … – cannot assume peers have equal drive sizes

– drive on first machine is 10x larger than drive on other?

Reliability With Commodity Hardware

Page 14: Architectural Overview of MapR's Apache Hadoop Distribution

14

No NVRAM

Cannot assume special connectivity – no separate data paths for "online" vs. replica traffic

Cannot even assume more than 1 drive per node – no RAID possible

Use replication, but … – cannot assume peers have equal drive sizes

– drive on first machine is 10x larger than drive on other?

No choice but to replicate for reliability

Reliability With Commodity Hardware

Page 15: Architectural Overview of MapR's Apache Hadoop Distribution

15

Replication is easy, right? All we have to do is send the same bits to the master and replica.

Reliability via Replication

Clients

Primary Server

Replica

Clients

Primary Server

Replica

Normal replication, primary forwards Cassandra-style replication

Page 16: Architectural Overview of MapR's Apache Hadoop Distribution

16

When the replica comes back, it is stale – it must brought up-to-date

– until then, exposed to failure

But crashes occur…

Clients

Primary Server

Replica

Clients

Primary Server

Replica

Primary re-syncs replica Replica remains stale until "anti-entropy" process

kicked off by administrator

Who re-syncs?

Page 17: Architectural Overview of MapR's Apache Hadoop Distribution

17

HDFS solves the problem a third way

Unless its Apache HDFS …

Page 18: Architectural Overview of MapR's Apache Hadoop Distribution

18

HDFS solves the problem a third way

Make everything read-only – Nothing to re-sync

Unless its Apache HDFS …

Page 19: Architectural Overview of MapR's Apache Hadoop Distribution

19

HDFS solves the problem a third way

Make everything read-only – Nothing to re-sync

Single writer, no reads allowed while writing

Unless its Apache HDFS …

Page 20: Architectural Overview of MapR's Apache Hadoop Distribution

20

HDFS solves the problem a third way

Make everything read-only – Nothing to re-sync

Single writer, no reads allowed while writing

File close is the transaction that allows readers to see data – unclosed files are lost

– cannot write any further to closed file

Unless its Apache HDFS …

Page 21: Architectural Overview of MapR's Apache Hadoop Distribution

21

HDFS solves the problem a third way

Make everything read-only – Nothing to re-sync

Single writer, no reads allowed while writing

File close is the transaction that allows readers to see data – unclosed files are lost

– cannot write any further to closed file

Real-time not possible with HDFS – to make data visible, must close file immediately after writing

– Too many files is a serious problem with HDFS (a well documented limitation)

Unless its Apache HDFS …

Page 22: Architectural Overview of MapR's Apache Hadoop Distribution

22

HDFS solves the problem a third way

Make everything read-only – Nothing to re-sync

Single writer, no reads allowed while writing

File close is the transaction that allows readers to see data – unclosed files are lost

– cannot write any further to closed file

Real-time not possible with HDFS – to make data visible, must close file immediately after writing

– Too many files is a serious problem with HDFS (a well documented limitation)

HDFS therefore cannot do NFS, ever – No “close” in NFS … can lose data any time

Unless its Apache HDFS …

Page 23: Architectural Overview of MapR's Apache Hadoop Distribution

23

To support normal apps, need full read/write support

Let's return to issue: resync the replica when it comes back

This is the 21st century…

Clients

Primary Server

Replica

Clients

Primary Server

Replica

Primary re-syncs replica Replica remains stale until "anti-entropy" process

kicked off by administrator

Who re-syncs?

Page 24: Architectural Overview of MapR's Apache Hadoop Distribution

24

24 TB / server – @ 1000MB/s = 7 hours

– practical terms, @ 200MB/s = 35 hours

How long to re-sync?

Page 25: Architectural Overview of MapR's Apache Hadoop Distribution

25

24 TB / server – @ 1000MB/s = 7 hours

– practical terms, @ 200MB/s = 35 hours

Did you say you want to do this online?

How long to re-sync?

Page 26: Architectural Overview of MapR's Apache Hadoop Distribution

26

24 TB / server – @ 1000MB/s = 7 hours

– practical terms, @ 200MB/s = 35 hours

Did you say you want to do this online? – throttle re-sync rate to 1/10th

– 350 hours to re-sync (= 15 days)

How long to re-sync?

Page 27: Architectural Overview of MapR's Apache Hadoop Distribution

27

24 TB / server – @ 1000MB/s = 7 hours

– practical terms, @ 200MB/s = 35 hours

Did you say you want to do this online? – throttle re-sync rate to 1/10th

– 350 hours to re-sync (= 15 days)

What is your Mean Time To Data Loss (MTTDL)?

How long to re-sync?

Page 28: Architectural Overview of MapR's Apache Hadoop Distribution

28

24 TB / server – @ 1000MB/s = 7 hours

– practical terms, @ 200MB/s = 35 hours

Did you say you want to do this online? – throttle re-sync rate to 1/10th

– 350 hours to re-sync (= 15 days)

What is your Mean Time To Data Loss (MTTDL)? – how long before a double disk failure?

– a triple disk failure?

How long to re-sync?

Page 29: Architectural Overview of MapR's Apache Hadoop Distribution

29

Use dual-ported disk to side-step this problem

Traditional solutions

Clients

Primary Server

Replica

Dual Ported Disk Array

Raid-6 with idle spares

Servers use NVRAM

Page 30: Architectural Overview of MapR's Apache Hadoop Distribution

30

Use dual-ported disk to side-step this problem

Traditional solutions

Clients

Primary Server

Replica

Dual Ported Disk Array

Raid-6 with idle spares

Servers use NVRAM

COMMODITY HARDWARE LARGE SCALE CLUSTERING

Page 31: Architectural Overview of MapR's Apache Hadoop Distribution

31

Use dual-ported disk to side-step this problem

Traditional solutions

Clients

Primary Server

Replica

Dual Ported Disk Array

Raid-6 with idle spares

Servers use NVRAM

COMMODITY HARDWARE LARGE SCALE CLUSTERING

Page 32: Architectural Overview of MapR's Apache Hadoop Distribution

32

Use dual-ported disk to side-step this problem

Traditional solutions

Clients

Primary Server

Replica

Dual Ported Disk Array

Raid-6 with idle spares

Servers use NVRAM

COMMODITY HARDWARE LARGE SCALE CLUSTERING

Large Purchase Contracts, 5-year spare-parts plan

Page 33: Architectural Overview of MapR's Apache Hadoop Distribution

33

Forget Performance?

SAN/NAS

data data data

data data data

daa data data

data data data

function

RDBMS

Traditional Architecture

function

App

function

App

function

App

Page 34: Architectural Overview of MapR's Apache Hadoop Distribution

34

Forget Performance?

SAN/NAS

data data data

data data data

daa data data

data data data

function

RDBMS

Traditional Architecture

data

function

data

function

data

function

data

function

data

function

data

function

data

function

data

function

data

function

data

function

data

function

data

function

Hadoop

function

App

function

App

function

App

Geographically dispersed also?

Page 35: Architectural Overview of MapR's Apache Hadoop Distribution

35

Chop the data on each node to 1000's of pieces – not millions of pieces, only 1000's

– pieces are called containers

What MapR does

Page 36: Architectural Overview of MapR's Apache Hadoop Distribution

36

Chop the data on each node to 1000's of pieces – not millions of pieces, only 1000's

– pieces are called containers

What MapR does

Page 37: Architectural Overview of MapR's Apache Hadoop Distribution

37

Chop the data on each node to 1000's of pieces – not millions of pieces, only 1000's

– pieces are called containers

Spread replicas of each container across the cluster

What MapR does

Page 38: Architectural Overview of MapR's Apache Hadoop Distribution

38

Chop the data on each node to 1000's of pieces – not millions of pieces, only 1000's

– pieces are called containers

Spread replicas of each container across the cluster

What MapR does

Page 39: Architectural Overview of MapR's Apache Hadoop Distribution

39

Why does it improve things?

Page 40: Architectural Overview of MapR's Apache Hadoop Distribution

40

100-node cluster

each node holds 1/100th of every node's data

MapR Replication Example

Page 41: Architectural Overview of MapR's Apache Hadoop Distribution

41

100-node cluster

each node holds 1/100th of every node's data

when a server dies

MapR Replication Example

Page 42: Architectural Overview of MapR's Apache Hadoop Distribution

42

100-node cluster

each node holds 1/100th of every node's data

when a server dies

MapR Replication Example

Page 43: Architectural Overview of MapR's Apache Hadoop Distribution

43

100-node cluster

each node holds 1/100th of every node's data

when a server dies

MapR Replication Example

Page 44: Architectural Overview of MapR's Apache Hadoop Distribution

44

100-node cluster

each node holds 1/100th of every node's data

when a server dies

entire cluster resync's the dead node's data

MapR Replication Example

Page 45: Architectural Overview of MapR's Apache Hadoop Distribution

45

100-node cluster

each node holds 1/100th of every node's data

when a server dies

entire cluster resync's the dead node's data

MapR Replication Example

Page 46: Architectural Overview of MapR's Apache Hadoop Distribution

46

99 nodes re-sync'ing in parallel

MapR Re-sync Speed

Page 47: Architectural Overview of MapR's Apache Hadoop Distribution

47

99 nodes re-sync'ing in parallel – 99x number of drives

– 99x number of ethernet ports

– 99x cpu's

MapR Re-sync Speed

Page 48: Architectural Overview of MapR's Apache Hadoop Distribution

48

99 nodes re-sync'ing in parallel – 99x number of drives

– 99x number of ethernet ports

– 99x cpu's

Each is resync'ing 1/100th of the data

MapR Re-sync Speed

Page 49: Architectural Overview of MapR's Apache Hadoop Distribution

49

99 nodes re-sync'ing in parallel – 99x number of drives

– 99x number of ethernet ports

– 99x cpu's

Each is resync'ing 1/100th of the data

MapR Re-sync Speed

Net speed up is about 100x – 350 hours vs. 3.5

Page 50: Architectural Overview of MapR's Apache Hadoop Distribution

50

99 nodes re-sync'ing in parallel – 99x number of drives

– 99x number of ethernet ports

– 99x cpu's

Each is resync'ing 1/100th of the data

MapR Re-sync Speed

Net speed up is about 100x – 350 hours vs. 3.5

MTTDL is 100x better

Page 51: Architectural Overview of MapR's Apache Hadoop Distribution

51

Why is this so difficult?

Page 52: Architectural Overview of MapR's Apache Hadoop Distribution

52

Writes are synchronous

Data is replicated in a "chain" fashion – utilizes full-duplex network

Meta-data is replicated in a "star" manner – response time better

MapR's Read-write Replication

client1 client2

clientN

client1 client2

clientN

Page 53: Architectural Overview of MapR's Apache Hadoop Distribution

53

As data size increases, writes spread more, like dropping a pebble in a pond

Larger pebbles spread the ripples farther

Space balanced by moving idle containers

Container Balancing

• Servers keep a bunch of containers "ready to go".

• Writes get distributed around the cluster.

Page 54: Architectural Overview of MapR's Apache Hadoop Distribution

54

MapR Container Resync

MapR is 100% random write – very tough problem

On a complete crash, all replicas diverge from each other

On recovery, which one should be master?

Complete crash

Page 55: Architectural Overview of MapR's Apache Hadoop Distribution

55

MapR Container Resync

MapR can detect exactly where replicas diverged – even at 2000 MB/s update rate

Resync means – roll-back rest to divergence point – roll-forward to converge with chosen

master

Done while online – with very little impact on normal

operations

New master after crash

Page 56: Architectural Overview of MapR's Apache Hadoop Distribution

56

Resync traffic is “secondary”

Each node continuously measures RTT to all its peers

More throttle to slower peers – Idle system runs at full speed

All automatically

MapR does Automatic Resync Throttling

Page 57: Architectural Overview of MapR's Apache Hadoop Distribution

57

Where/how does MapR exploit this unique advantage?

Page 58: Architectural Overview of MapR's Apache Hadoop Distribution

58

NameNode

E F

NameNode

E F

NameNode

E F

MapR's No-NameNode Architecture

HDFS Federation MapR (distributed metadata)

• Multiple single points of failure • Limited to 50-200 million files • Performance bottleneck • Commercial NAS required

• HA w/ automatic failover • Instant cluster restart • Up to 1T files (> 5000x advantage) • 10-20x higher performance • 100% commodity hardware

NAS appliance

NameNode

A B

NameNode

C D

NameNode

E F

DataNode DataNode DataNode

DataNode DataNode DataNode

A F C D E D

B C E B

C F B F

A B

A D

E

Page 59: Architectural Overview of MapR's Apache Hadoop Distribution

59

Relative performance and scale

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

0 1000 2000 3000 4000 5000 6000

File

cre

ates

/s

Files (M) 0 100 200 400 600 800 1000

MapR distribution

Other distribution

Benchmark: File creates (100B) Hardware: 10 nodes, 2 x 4 cores, 24 GB RAM, 12 x 1 TB 7200 RPM

0

50

100

150

200

250

300

350

400

0 0.5 1 1.5

File

cre

ates

/s

Files (M)

Other distribution MapR Other Advantage

Rate (creates/s) 14-16K 335-360 40x

Scale (files) 6B 1.3M 4615x

Page 60: Architectural Overview of MapR's Apache Hadoop Distribution

60

Where/how does MapR exploit this unique advantage?

Page 61: Architectural Overview of MapR's Apache Hadoop Distribution

61

MapR’s NFS allows Direct Deposit

Connectors not needed No extra scripts or clusters to deploy and maintain

Random Read/Write

Compression

Distributed HA

Web

Server

Database

Server

Application

Server

Page 62: Architectural Overview of MapR's Apache Hadoop Distribution

62

Where/how does MapR exploit this unique advantage?

Page 63: Architectural Overview of MapR's Apache Hadoop Distribution

63

MapR Volumes

100K volumes are OK, create as many as

desired!

/projects

/tahoe

/yosemite

/user

/msmith

/bjohnson

Volumes dramatically simplify the management of Big Data

• Replication factor • Scheduled mirroring • Scheduled snapshots • Data placement control • User access and tracking • Administrative permissions

Page 64: Architectural Overview of MapR's Apache Hadoop Distribution

64

Where/how does MapR exploit this unique advantage?

Page 65: Architectural Overview of MapR's Apache Hadoop Distribution

65

M7 Tables

M7 tables integrated into storage – always available on every node, zero admin

Unlimited number of tables – Apache HBase is typically 10-20 tables (max 100)

No compactions

Instant-On – zero recovery time

5-10x better perf

Consistent low latency – At 95%-ile and 99%-ile MapR M7

Page 66: Architectural Overview of MapR's Apache Hadoop Distribution

66

M7 vs. CDH: 50-50 Mix (Reads)

Page 67: Architectural Overview of MapR's Apache Hadoop Distribution

67

M7 vs. CDH: 50-50 load (read latency)

Page 68: Architectural Overview of MapR's Apache Hadoop Distribution

68

Where/how does MapR exploit this unique advantage?

Page 69: Architectural Overview of MapR's Apache Hadoop Distribution

69

ALL Hadoop components are Highly Available, eg, YARN

MapR makes Hadoop truly HA

Page 70: Architectural Overview of MapR's Apache Hadoop Distribution

70

ALL Hadoop components are Highly Available, eg, YARN

ApplicationMaster (old JT) and TaskTracker record their state in MapR

MapR makes Hadoop truly HA

Page 71: Architectural Overview of MapR's Apache Hadoop Distribution

71

ALL Hadoop components are Highly Available, eg, YARN

ApplicationMaster (old JT) and TaskTracker record their state in MapR

On node-failure, AM recovers its state from MapR – Works even if entire cluster restarted

MapR makes Hadoop truly HA

Page 72: Architectural Overview of MapR's Apache Hadoop Distribution

72

ALL Hadoop components are Highly Available, eg, YARN

ApplicationMaster (old JT) and TaskTracker record their state in MapR

On node-failure, AM recovers its state from MapR – Works even if entire cluster restarted

All jobs resume from where they were – Only from MapR

MapR makes Hadoop truly HA

Page 73: Architectural Overview of MapR's Apache Hadoop Distribution

73

ALL Hadoop components are Highly Available, eg, YARN

ApplicationMaster (old JT) and TaskTracker record their state in MapR

On node-failure, AM recovers its state from MapR – Works even if entire cluster restarted

All jobs resume from where they were – Only from MapR

Allows pre-emption – MapR can pre-empt any job, without losing its progress

– ExpressLane™ feature in MapR exploits it

MapR makes Hadoop truly HA

Page 74: Architectural Overview of MapR's Apache Hadoop Distribution

74

Where/how can YOU exploit MapR’s unique advantage?

ALL your code can easily be scale-out HA

Page 75: Architectural Overview of MapR's Apache Hadoop Distribution

75

Where/how can YOU exploit MapR’s unique advantage?

ALL your code can easily be scale-out HA

Save service-state in MapR

Save data in MapR

Page 76: Architectural Overview of MapR's Apache Hadoop Distribution

76

Where/how can YOU exploit MapR’s unique advantage?

ALL your code can easily be scale-out HA

Save service-state in MapR

Save data in MapR

Use Zookeeper to notice service failure

Page 77: Architectural Overview of MapR's Apache Hadoop Distribution

77

Where/how can YOU exploit MapR’s unique advantage?

ALL your code can easily be scale-out HA

Save service-state in MapR

Save data in MapR

Use Zookeeper to notice service failure

Restart anywhere, data+state will move there automatically

Page 78: Architectural Overview of MapR's Apache Hadoop Distribution

78

Where/how can YOU exploit MapR’s unique advantage?

ALL your code can easily be scale-out HA

Save service-state in MapR

Save data in MapR

Use Zookeeper to notice service failure

Restart anywhere, data+state will move there automatically

That’s what we did!

Page 79: Architectural Overview of MapR's Apache Hadoop Distribution

79

Where/how can YOU exploit MapR’s unique advantage?

ALL your code can easily be scale-out HA

Save service-state in MapR

Save data in MapR

Use Zookeeper to notice service failure

Restart anywhere, data+state will move there automatically

That’s what we did!

Only from MapR: HA for Impala, Hive, Oozie, Storm, MySQL, SOLR/Lucene, Kafka, …

Page 80: Architectural Overview of MapR's Apache Hadoop Distribution

80

Build cluster brick by brick, one node at a time

Use commodity hardware at rock-bottom prices

Get enterprise-class reliability: instant-restart, snapshots, mirrors, no-single-point-of-failure, …

Export via NFS, ODBC, Hadoop and other std protocols

MapR: Unlimited Scale

# files, # tables trillions

# rows per table trillions

# data 1-10 Exabytes

# nodes 10,000+