38
Productionizing Hadoop: 7 Architectural Best Practices Mike Gualtieri, Principal Analyst

Productionizing Hadoop: 7 Architectural Best Practices

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Productionizing Hadoop: 7 Architectural Best Practices

Productionizing Hadoop: 7 Architectural Best PracticesMike Gualtieri, Principal Analyst

Page 2: Productionizing Hadoop: 7 Architectural Best Practices

#BigData

Page 3: Productionizing Hadoop: 7 Architectural Best Practices

© 2013 Forrester Research, Inc. Reproduction Prohibited

7% 13% 7% 17% 31%

Implemented, not expanding Expanding/upgrading implementation

Planning to implement in the next 12 months Planning to implement in more than 1 year

Interested but no plans

Base: 634 business intelligence users and planners

“What best describes your firm's current usage/plans to adopt Big Data technologies and solutions?”

Source: Forrsights BI/Big Data Survey, Q3 2012

Big Data has momentum

20% have implemented

some big data technology

37% are planning some big data technology project

Page 4: Productionizing Hadoop: 7 Architectural Best Practices

“Big Data is the frontier of a firm’s ability to store, process,

and access (SPA) all of the data it needs to operate, make

decisions, reduce risks, and serve customers.”

DEFINITION

FORRESTER

Page 5: Productionizing Hadoop: 7 Architectural Best Practices

© 2013 Forrester Research, Inc. Reproduction Prohibited

Other

Don't know

Earlier generation technology is too expensive

The velocity of data is too high for earlier technologies

We can achieve (or are achieving) significant cost reductions by changing our data management and analytic architecture

Data changes or becomes available much faster than we can process in support of business decisions

The number of data formats that we must be able to deal with exceeds our ability to cost-effectively integrate

Analysis requirements change too fast to keep up with

We want to access data that was not accessible for us with existing technologies

Data volumes have grown beyond what we can cost effectively manage

We don't know what our entire data universe contains, we need new ways to explore data and discover patterns and insights, before we even understand what we are looking for

2%

3%

21%

22%

28%

32%

32%

36%

36%

38%

41%

“What are the main business and technical requirements or inadequacies of earlier-generation BI technologies that lead you to consider new BI techniques and technologies?”

Firms seek more value in data, struggle to wrangle it, & seek lower cost solutions

Page 6: Productionizing Hadoop: 7 Architectural Best Practices

© 2013 Forrester Research, Inc. Reproduction Prohibited

Integrating data from a variety of data sources is a top challenge

Page 7: Productionizing Hadoop: 7 Architectural Best Practices

© 2013 Forrester Research, Inc. Reproduction Prohibited

Big Data architecture must support three core capabilities (SPA):

•Can you capture and store all your data++?Store

•Do you have the compute power to cleanse, enrich, & analyze your data++?

Process

•Can you retrieve, search, integrate, and visualize all your data++?

Access

7

Page 8: Productionizing Hadoop: 7 Architectural Best Practices

8

Page 9: Productionizing Hadoop: 7 Architectural Best Practices

#Production

Page 10: Productionizing Hadoop: 7 Architectural Best Practices

How can you keep your Big Data operations running smoothly?

Production

Page 11: Productionizing Hadoop: 7 Architectural Best Practices

© 2013 Forrester Research, Inc. Reproduction Prohibited

Productionizing Big Data can be complex because of:Integration with heterogeneous infrastructureUse of multiple analytical software applicationsReliance on 3rd-party cloud servicesAlways available modeling and visualization sandboxesIncreasing volume, velocity, variety of data from multiple data sourcesCompute intensive analytics

Page 12: Productionizing Hadoop: 7 Architectural Best Practices

Big Data production requires sound architecture.

Production

Page 13: Productionizing Hadoop: 7 Architectural Best Practices

The 7 architectural qualities of Big Data production platforms

Quality What it means

1 ExperienceUsers’ perceptions of the usefulness, usability, and desirability of the application.

2 AvailabilityThe readiness of the service or application to perform its functions when needed

3 PerformanceThe speed to perform functions to meet business and user expectations

4 ScalabilityHandle increasing volumes of data, transactions, services, and applications.

5 AdaptabilityThe ease with which an application or service can be changed or extended

6 SecuritySupports the security properties of confidentiality, integrity, authentication, authorization, and nonrepudiation

7 EconomyMinimize cost to build, operate, & change an application or service without compromising its business value

Page 14: Productionizing Hadoop: 7 Architectural Best Practices

Operational experience is critical to production.

1. Experience

Page 15: Productionizing Hadoop: 7 Architectural Best Practices

Best practices: User experience

Usefulness, Usability, Desirability of applications require ease of use with power

Developers Administrators

• Standard Tools

• Linux Commands

• Direct Access with NFS

• Visibility

• Self Healing

• Architectural Simplicity

Page 16: Productionizing Hadoop: 7 Architectural Best Practices

Easy Workflow Management

Workload Automation with Cisco Tidal Enterprise Scheduler

• Detailed, dependency-driven event execution

• Point-and-click dynamic variables and parameters

• Scalable, extensible architecture • Granular notification and alerts

Page 17: Productionizing Hadoop: 7 Architectural Best Practices

High-availability strategy and architecture are often overlooked in proof-of-concepts.

2. Availability

Page 18: Productionizing Hadoop: 7 Architectural Best Practices

What does high availability mean?

Uptime %* Downtime per year

99.999% (5 nines) 5.26 minutes

99.99% (4 nines) 52.6 minutes

99.5% 1.83 days

99% (2 nines) 3.65 days

98% 7.30 days

95% 18.25 days

*Uptime calculations assume no scheduled downtime.

Page 19: Productionizing Hadoop: 7 Architectural Best Practices

19©MapR Technologies - Confidential

High Availability and Dependability

Reliable Compute Dependable Storage

Automated stateful failover Automated re-replication Self-healing from HW

and SW failures Load balancing Rolling upgrades No lost jobs or data 99999’s of uptime

Business continuity with snapshots and mirrors

Recover to a point in time End-to-end check summing Strong consistency Data safe Mirror across sites to meet

Recovery Time Objectives

Page 20: Productionizing Hadoop: 7 Architectural Best Practices

Unexpected latencies can emerge from rapid fluctuations in volume, velocity, & variety of data and interactions of the larger Big Data ecosystem.

3. Performance

Page 21: Productionizing Hadoop: 7 Architectural Best Practices

21©MapR Technologies - Confidential

World Record Performance

New Minute Sort WorldRecord

1.5 TB in 1 minute2103 nodes

Previous Record: 1.4 TB

Benchmark MapR 2.1.1 CDH 4.1.1 MapR Speed Increase

Terasort (1x replication, compression disabled)

Total 13m 35s 26m 6s 1.9x

Map 7m 58s 21m 8s 2.7x

Reduce 13m 32s 23m 37s 1.7x

DFSIO throughput/node

Read 1003 MB/s 656 MB/s 1.5x

Write 924 MB/s 654 MB/s 1.4x

YCSB (50% read, 50% update)

Throughput 36,584.4 op/s 12,500.5 op/s 2.9x

Runtime 3.80 hr 11.11 hr 2.9x

YCSB (95% read, 5% update)

Throughput 24,704.3 op/s 10,776.4 op/s 2.3x

Runtime 0.56 hr 1.29 hr 2.3x

Page 22: Productionizing Hadoop: 7 Architectural Best Practices

Scalability is as much about scaling up as it is about scaling down.

4. Scalability

Page 23: Productionizing Hadoop: 7 Architectural Best Practices

23©MapR Technologies - Confidential

MapR’s Relative Scale

Testing completed on 10 node cluster, 2x Quad-Core, 24G DRAM 12 x 1TB SATA Drives @ 7200 rpm

0 1000 2000 3000 4000 5000 60000

2000

4000

6000

8000

10000

12000

14000

16000

18000

Files (M)

File

crea

tes/

s

0 100 200 400 600 800 1000

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.60

50100150200250300350400

Files (M)Fi

le c

reat

es/s

Other distribution

MapR distribution

Scale Advantage: 4600x

Page 24: Productionizing Hadoop: 7 Architectural Best Practices

Firms have barely scratched the surface of what is possible with Big Data analytics. Change is always in the wind.

5. Adaptability

Page 25: Productionizing Hadoop: 7 Architectural Best Practices

I am a data scientist.

I am a data scientist.

I am a data scientist.

Data scientists will constantly have new requirements

Page 26: Productionizing Hadoop: 7 Architectural Best Practices

…to accelerate the pace of discovery

Compress…

Production must address and help compress the full Big Data analytics life cycle

Page 27: Productionizing Hadoop: 7 Architectural Best Practices

27©MapR Technologies - Confidential

Direct Integration with Existing Applications

100% POSIX compliant

Industry standard APIs - NFS, ODBC, LDAP, REST

More 3rd party solutions

Proprietary connectors unnecessary

Language neutral

Page 28: Productionizing Hadoop: 7 Architectural Best Practices

A breach can devastate an organization's reputation with customers or have legal repercussions.

6. Security

Page 29: Productionizing Hadoop: 7 Architectural Best Practices

All, some, or none of these 6 security properties may apply to Big Data

• Information is available only to the people intended to use it or see itConfidentiality

• Information is only changed in appropriate ways by people authorized to change itIntegrity

• Applications are available when needed and perform acceptablyReadiness

• A person’s identity is determined before access is granted if anonymous people are not allowedAuthentication

• People are allowed or denied access to applications or application resourcesAuthorization

• A person cannot perform and action and then later deny performing that actionNonrepudiation

Page 30: Productionizing Hadoop: 7 Architectural Best Practices

30©MapR Technologies - Confidential

Securing Big Data

Corporate Security Requirements Authentication

Wire-level security

Authorization (Access Control)Standard: UID, GID basedGranular: File, Table, Column Family, Column, Cell

Integration into Existing EnvironmentsKerberos or non-KerberosUse existing Directory for credential lookups

Seamless Access with Single Sign-On

Page 31: Productionizing Hadoop: 7 Architectural Best Practices

Every architectural decision has an impact on the return on investment for Big Data analytics platforms.

7. Economy

Page 32: Productionizing Hadoop: 7 Architectural Best Practices

Production Sweet Spot

Beware of pilot programs that don’t scale economically

Business value of big data

Investment

People- intensive platforms

Technology-intensive platforms

Page 33: Productionizing Hadoop: 7 Architectural Best Practices

33©MapR Technologies - Confidential

Maximizing Economic Value

Analytics – Ability to perform broader and deeper analytics– Real-time streaming– Mission critical SLAs– Cloud based analysis

Ease of Development Ease of Administration Value of Uptime Value of Data Protection Hardware Efficiency First Class Support

Page 34: Productionizing Hadoop: 7 Architectural Best Practices

34©MapR Technologies - Confidential

One Platform for Big Data

99.999% HA

Data Protection

Disaster Recovery

Scalability &

Performance

Enterprise Integration

Multi-tenancy

MapReduce

File-Based Applications SQL Database Search Stream

Processing

Batch Interactive Real-time

Page 35: Productionizing Hadoop: 7 Architectural Best Practices

The 7 qualities of Big Data production platforms

Quality What it means

1 ExperienceUsers’ perceptions of the usefulness, usability, and desirability of the application.

2 AvailabilityThe readiness of the service or application to perform its functions when needed

3 PerformanceThe speed to perform functions to meet business and user expectations

4 ScalabilityHandle increasing or decreasing volumes of transactions, services, and data

5 AdaptabilityThe ease with which an application or service can be changed or extended

6 SecuritySupports the security properties of confidentiality, integrity, authentication, authorization, and nonrepudiation

7 EconomyMinimize cost to build, operate, & change an application or service without compromising its business value

Page 36: Productionizing Hadoop: 7 Architectural Best Practices

Big Data is about innovation, but not if you don’t productionize it.

36

Collectors• Capture• Store

Journalists• Reports• Dashboards

Innovators• Predictive

analytics

Operations Business Intelligence

Predictive Power

Page 37: Productionizing Hadoop: 7 Architectural Best Practices

Frontier

Big data is about pushing limits. Exponential growth in data means the frontier is vast.

Page 38: Productionizing Hadoop: 7 Architectural Best Practices

Thank youMike Gualtieri

[email protected]

Twitter: @mgualtieri