Crossing the Chasm

Crossing the ChasmHadoop for the EnterpriseSanjay Radia – Hortonworks Founder & ArchitectFormerly Hadoop Architect @ Yahoo!4 Years @ Yahoo! @srr (@hortonworks)

© Hortonworks Inc. 2011 June 29, 2011

Crossing the Chasm• Apache Hadoop grew rapidly charting new territories in features,

abstractions, APIs, scale, fault tolerance, multi-tenancy, operations …−Small number of early customers who needed a new platform−Provide Hadoop as a service to make adoption easy

• Today:−Dramatic growth in adoption and customer base−Growth of Hadoop stack and applications−New requirements and expectations

2

MissionCritical

EarlyMajority

LateMajority

EarlyAdopters

Geoffrey A. Moore

http://en.wikipedia.org/wiki/Geoffrey_Moore

Crossing the Chasm: Overview• How the Chasm is being crossed−Security−SLAs & Predictability−Scalability−Availability & Data Integrity−Backward Compatibility−Quality & Testing

• Fundamental architectural improvements− Federation & MR.next

Adapt to changing, sometime unforeseen, needs Fuel innovation and rapid development

• The Community Effect

3

SecurityEarly Gains−No authorization or authentication requirements−Added permissions and passed client-side userid to server (0.16)

Addresses accidental deletes by another user−Service Authorization (0.18, 0.20)

Issues: Stronger Authorization required−Shared clusters – multiple tenants−Critical data−New categories of users (financial) −SOX compliance

Our Response−Authentication using Kerberos (0.20.203)−10 Person-year effort by Yahoo!

4

SLAs and PredictabilityIssue: Customers uncomfortable with shared clusters

Customer traditionally plan for peaks with dedicated HW Dedicated clusters had poor utilization

Response: Capacity scheduler (0.20)−Guaranteed capacities in a multi-tenant shared cluster

Almost like dedicated hardware Each organization given queue(s) with a guaranteed capacity

• controls who is allowed to submit jobs to their queues• sets the priorities of jobs within their queue• creates sub-queues (0.21) for finer grain control within their capacity

Unused capacity given to tasks in other queues• Better than private cluster –access to unused capacity when in crunch

−Resource limits for tasks – deals with misbehaved apps

Response: FairShare Scheduler (0.20)−Focus is fair share of resources, but does have pools

5

ScalabilityEarly Gains−Simple design allowed rapid improvements

Single master, namespace in RAM, simpler locking Cluster size improvements: 1K 2K 4K Vertical scaling: Tuned GC + Efficient memory usage Archive file system – reduce files and blocks (0.20)

Current Issues−Growth of files and storage limited by single NN (0.20)

Only an issue for very very large clusters−JobTracker does not scale to beyond 30K tasks – needs redesign

Our Response−RW locks in NN (0.22)−MR.next– complete rewrite of MR servers (JT, TT) - 100K tasks (0.23)−Federation: horizontal scaling of namespace – billion files (0.23)−NN that keeps only part of Namespace in memory –trillion files (0.23.x)

6

HDFS Availability & Data Integrity:Early Gains

• Simple design, Java, storage fault tolerance−Java – saved from pointer errors that lead to data corruption−Simplicity - subset of Posix – random writers not supported−Storage: Rely in OS’s file system rather than use raw disk−Storage Fault Tolerance: multiple replicas, active monitoring−Single Namenode Master

Persistent state: multiple copies + checkpoints Restart on failure

• How well did it work?−Lost 650 blocks out of 329 M on 10 clusters with 20K nodes in 2009

82% abandoned open file (append bug, fixed in 0.21) 15% files created with single replica (data reliability not needed) 3% due to roughly 7 bugs that were then fixed (0.21)

−Over the last 18 months 22 failures on 25 clusters Only 8 would have benefitted from HA failover!! (0.23 failures per cluster year)

−NN is very robust and can take a lot of abuse NN is resilient against overload caused by misbehaving apps

7

HDFS Availability & Data Integrity:Response

• Data Integrity−Append/flush/sync redesign (0.21)−Pipeline recruits new replicas rather than just remove them on failures (0.23)

• Improving Availability of NN− Faster HDFS restarts

NN bounce in 20 minutes (0.23) Federation allows smaller NNs (0.23)

−Federation will significantly improve NN isolation hence availability (0.23)

• Why did we wait this long for HA NN?−The failure rates did not demand making this a high priority−Failover requires corner cases to be correctly addressed

Correct fencing of shared state during failover is critical• Can lead to corruption of data and reduce availability!!

−Many factors impact availability, not just failover 8

HDFS Availability & Data Integrity:Response: HA NNActive work has started on HA NN (Failover)• HA NN – Detailed design (HDFS-1623)−Community effort−HDFS-1971, 1972, 1973, 1974,1975, 2005, 2064, 1073

• HA: Prototype work−Backup NN (0.21)−Avatar NN (Facebook)−HA NN prototype using Linux HA (Yahoo!)−HA NN prototype with Backup NN and block report replicator (EBay)

HA the highest priority for 23.x

9

MapReduce: Fault Tolerance and Availability

Early Gains: Fault-tolerance of tasks and compute nodes

Current Issues: Loss of job queue if Job tracker is restarted

Our Response

MR.next designed with fault tolerance and availability−HA Resource Manager (0.23.x)−Loss of Resource Manager – degraded mode - recover via restart or failover

Apps continue with their current resources App Manager can reschedule with current resources New apps cannot submitted or launched, New resources cannot be allocated

−Loss of an App Manager - recovers App is restarted and state is recovered

−Loss of tasks and nodes - recovers Recovered as in old MapReduce

10

Backwards Compatibility

Early Gains−Early success stemmed from a philosophy of ship early and

often, resulting in changing APIs.Data and metadata compatibility always maintainedThe early customers paid the price

• current customers reap benefits of more mature interfaces

Issues−Increased adoption leads to increased expectations of

backwards compatibility

11

Backward Compatibility:Response• Interface classification - audience and stability tags (0.21)−Patterned on enterprise-quality software process

• Evolve interfaces but maintain backward compatible−Added newer forward looking interfaces - old interface maintained

• Test for compatibility−Run old jars of automation tests, Real Yahoo applications

• Applications adopting higher abstractions (Pig, Hive)−Insulates from lower primitive interfaces

• Wire compatibility (Hadoop-7347)−Maintain compatibility with current protocol (java serialization)−Adapters for addressing future discontinuity

e.g. serialization or protocol change−Moved to ProtocolBuf for data transfer protocol

12

Testing & QualityNightly Testing

− Against 1200 automated tests on 30 nodes− Against live data and live applications

QE Certification for Release

− Large variety and scale tests on 500 nodes− Performance benchmarking−QE HIT integration testing of whole stack

Release Testing

• Sandbox cluster – 3 clusters each with 400 -1K nodes

−Major releases: 2 months testing on actual data - all production projects must sign off• Research clusters – 6 Clusters (non-revenue production jobs) (4K Nodes)

−Major releases – minimum 2 months before moving to production− .25Million to .5Million jobs per week

if it clears research then mostly fine in fine in production

Release• Production clusters - 11 clusters (4.5K nodes)

− Revenue generating, stricter SLAs

13

Fundamental Architecture Changes that cut across several issues

• HDFS storage: mostly a separate layer – but one customer: one NN−Federation generalizes the layer

• MapReduce – compute resource scheduling tightly coupled to MapReduce job management −MapReduce.next separates the layers

14

Storage Resources

HDFS

Namesystem

Compute Resources

MapReduce

Job Manager

Resource Scheduler

Coupled One-to-One

HBase

Fundamental Architecture Changes that cut across several issues

−Scalability, Isolation, Availability−Generic lower layer: first class support new applications on top

MR tmp, HBase, MPI, −Layering facilitates faster development of new work

NN that caches Namespace – a few months of work New implementations of MR App manager

−Compatibility: Support multiple versions of MR Tenants upgrade at their own pace – crucial for shared clusters 15

Storage Resources Compute

Resources

HDFS

NamesystemHDFS

NamesystemHDFS

Namesystem

Alternate NN Implementation

MR tmp

HDFS

NamesystemHDFS

NamesystemMR App

MR App with Different version

MR libMPI App

Resource Scheduler

Layered One-to-Many

The Community Effect• Some projects are done entirely by teams at Yahoo!, FB or Cloudera

• But several projects are joint work

−Yahoo & FB on NN scalability and concurrency esp in face of misbehaved apps−Edits log v2 and refactoring edits log (Cloudera and Yahoo!/Hortonworks)

HDFS-1073, 2003, 1557, 1926−NN HA – Yahoo!/Hortonworks, Cloudera, FB, EBay

HDFS-1623, 1971, 1972, 1973, 1974, ,1975, 2005−Features to support HBase: FB, Cloudera, Yahoo, and the HBase community

• Expect to see rapid improvements in the very near future

−Further Scalability - NN that cache part of namespace−Improved IO Performance - DN performance improvements−Wire Compatibility - Wire protocols, operational improvements, −New App Managers for MR.next−Continued improvement of management and operability

16

Hadoop is Successfully Crossing the Chasm

• Hadoop used in enterprises for revenue generating applications

• Apache Hadoop is improving at a rapid rate−Addressing many issues including HA−Fundamental design improvements to fuel innovation

The might of a large growing developer community

• Battle tested on large clusters and variety of applications−At Yahoo!, Facebook and the many other Hadoop customers.−Data integrity has been a focus from the early days

A level of testing that even the large commercial vendors cannot match!

Can you trust your data to anything less?17

Q & A

Hortonworks @ Hadoop Summit• 1:45pm: Next Generation Apache Hadoop MapReduce−Community track by Arun Murthy

• 2:15pm: Introducing HCatalog (Hadoop Table Manager)−Community track by Alan Gates

• 4:00pm: Large Scale Math with Hadoop MapReduce−Applications and Research Track by Tsz-Wo Sze

• 4:30pm: HDFS Federation and Other Features−Community track by Suresh Srinivas and Sanjay Radia

18

© Hortonworks Inc. 2011

About Hortonworks

• Mission: Revolutionize and commoditize the storage and processing of big data via open source

• Vision: Half of the world’s data will be stored in Apache Hadoop within five years

• Strategy: Drive advancements that make Apache Hadoop projects more consumable for the community, enterprises and ecosystem− Make Apache Hadoop easy to install, manage and use − Improve Apache Hadoop performance and availability− Make Apache Hadoop easy to integrate and extend

19

Thank You.

© Hortonworks Inc. 2011

Technology

Crossing the Chasm