Productionizing Hadoop: 7 Architectural Best PracticesMike Gualtieri, Principal Analyst
#BigData
© 2013 Forrester Research, Inc. Reproduction Prohibited
7% 13% 7% 17% 31%
Implemented, not expanding Expanding/upgrading implementation
Planning to implement in the next 12 months Planning to implement in more than 1 year
Interested but no plans
Base: 634 business intelligence users and planners
“What best describes your firm's current usage/plans to adopt Big Data technologies and solutions?”
Source: Forrsights BI/Big Data Survey, Q3 2012
Big Data has momentum
20% have implemented
some big data technology
37% are planning some big data technology project
“Big Data is the frontier of a firm’s ability to store, process,
and access (SPA) all of the data it needs to operate, make
decisions, reduce risks, and serve customers.”
DEFINITION
FORRESTER
© 2013 Forrester Research, Inc. Reproduction Prohibited
Other
Don't know
Earlier generation technology is too expensive
The velocity of data is too high for earlier technologies
We can achieve (or are achieving) significant cost reductions by changing our data management and analytic architecture
Data changes or becomes available much faster than we can process in support of business decisions
The number of data formats that we must be able to deal with exceeds our ability to cost-effectively integrate
Analysis requirements change too fast to keep up with
We want to access data that was not accessible for us with existing technologies
Data volumes have grown beyond what we can cost effectively manage
We don't know what our entire data universe contains, we need new ways to explore data and discover patterns and insights, before we even understand what we are looking for
2%
3%
21%
22%
28%
32%
32%
36%
36%
38%
41%
“What are the main business and technical requirements or inadequacies of earlier-generation BI technologies that lead you to consider new BI techniques and technologies?”
Firms seek more value in data, struggle to wrangle it, & seek lower cost solutions
© 2013 Forrester Research, Inc. Reproduction Prohibited
Integrating data from a variety of data sources is a top challenge
© 2013 Forrester Research, Inc. Reproduction Prohibited
Big Data architecture must support three core capabilities (SPA):
•Can you capture and store all your data++?Store
•Do you have the compute power to cleanse, enrich, & analyze your data++?
Process
•Can you retrieve, search, integrate, and visualize all your data++?
Access
7
8
#Production
How can you keep your Big Data operations running smoothly?
Production
© 2013 Forrester Research, Inc. Reproduction Prohibited
Productionizing Big Data can be complex because of:Integration with heterogeneous infrastructureUse of multiple analytical software applicationsReliance on 3rd-party cloud servicesAlways available modeling and visualization sandboxesIncreasing volume, velocity, variety of data from multiple data sourcesCompute intensive analytics
Big Data production requires sound architecture.
Production
The 7 architectural qualities of Big Data production platforms
Quality What it means
1 ExperienceUsers’ perceptions of the usefulness, usability, and desirability of the application.
2 AvailabilityThe readiness of the service or application to perform its functions when needed
3 PerformanceThe speed to perform functions to meet business and user expectations
4 ScalabilityHandle increasing volumes of data, transactions, services, and applications.
5 AdaptabilityThe ease with which an application or service can be changed or extended
6 SecuritySupports the security properties of confidentiality, integrity, authentication, authorization, and nonrepudiation
7 EconomyMinimize cost to build, operate, & change an application or service without compromising its business value
Operational experience is critical to production.
1. Experience
Best practices: User experience
Usefulness, Usability, Desirability of applications require ease of use with power
Developers Administrators
• Standard Tools
• Linux Commands
• Direct Access with NFS
• Visibility
• Self Healing
• Architectural Simplicity
Easy Workflow Management
Workload Automation with Cisco Tidal Enterprise Scheduler
• Detailed, dependency-driven event execution
• Point-and-click dynamic variables and parameters
• Scalable, extensible architecture • Granular notification and alerts
High-availability strategy and architecture are often overlooked in proof-of-concepts.
2. Availability
What does high availability mean?
Uptime %* Downtime per year
99.999% (5 nines) 5.26 minutes
99.99% (4 nines) 52.6 minutes
99.5% 1.83 days
99% (2 nines) 3.65 days
98% 7.30 days
95% 18.25 days
*Uptime calculations assume no scheduled downtime.
19©MapR Technologies - Confidential
High Availability and Dependability
Reliable Compute Dependable Storage
Automated stateful failover Automated re-replication Self-healing from HW
and SW failures Load balancing Rolling upgrades No lost jobs or data 99999’s of uptime
Business continuity with snapshots and mirrors
Recover to a point in time End-to-end check summing Strong consistency Data safe Mirror across sites to meet
Recovery Time Objectives
Unexpected latencies can emerge from rapid fluctuations in volume, velocity, & variety of data and interactions of the larger Big Data ecosystem.
3. Performance
21©MapR Technologies - Confidential
World Record Performance
New Minute Sort WorldRecord
1.5 TB in 1 minute2103 nodes
Previous Record: 1.4 TB
Benchmark MapR 2.1.1 CDH 4.1.1 MapR Speed Increase
Terasort (1x replication, compression disabled)
Total 13m 35s 26m 6s 1.9x
Map 7m 58s 21m 8s 2.7x
Reduce 13m 32s 23m 37s 1.7x
DFSIO throughput/node
Read 1003 MB/s 656 MB/s 1.5x
Write 924 MB/s 654 MB/s 1.4x
YCSB (50% read, 50% update)
Throughput 36,584.4 op/s 12,500.5 op/s 2.9x
Runtime 3.80 hr 11.11 hr 2.9x
YCSB (95% read, 5% update)
Throughput 24,704.3 op/s 10,776.4 op/s 2.3x
Runtime 0.56 hr 1.29 hr 2.3x
Scalability is as much about scaling up as it is about scaling down.
4. Scalability
23©MapR Technologies - Confidential
MapR’s Relative Scale
Testing completed on 10 node cluster, 2x Quad-Core, 24G DRAM 12 x 1TB SATA Drives @ 7200 rpm
0 1000 2000 3000 4000 5000 60000
2000
4000
6000
8000
10000
12000
14000
16000
18000
Files (M)
File
crea
tes/
s
0 100 200 400 600 800 1000
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.60
50100150200250300350400
Files (M)Fi
le c
reat
es/s
Other distribution
MapR distribution
Scale Advantage: 4600x
Firms have barely scratched the surface of what is possible with Big Data analytics. Change is always in the wind.
5. Adaptability
I am a data scientist.
I am a data scientist.
I am a data scientist.
Data scientists will constantly have new requirements
…to accelerate the pace of discovery
Compress…
Production must address and help compress the full Big Data analytics life cycle
27©MapR Technologies - Confidential
Direct Integration with Existing Applications
100% POSIX compliant
Industry standard APIs - NFS, ODBC, LDAP, REST
More 3rd party solutions
Proprietary connectors unnecessary
Language neutral
A breach can devastate an organization's reputation with customers or have legal repercussions.
6. Security
All, some, or none of these 6 security properties may apply to Big Data
• Information is available only to the people intended to use it or see itConfidentiality
• Information is only changed in appropriate ways by people authorized to change itIntegrity
• Applications are available when needed and perform acceptablyReadiness
• A person’s identity is determined before access is granted if anonymous people are not allowedAuthentication
• People are allowed or denied access to applications or application resourcesAuthorization
• A person cannot perform and action and then later deny performing that actionNonrepudiation
30©MapR Technologies - Confidential
Securing Big Data
Corporate Security Requirements Authentication
Wire-level security
Authorization (Access Control)Standard: UID, GID basedGranular: File, Table, Column Family, Column, Cell
Integration into Existing EnvironmentsKerberos or non-KerberosUse existing Directory for credential lookups
Seamless Access with Single Sign-On
Every architectural decision has an impact on the return on investment for Big Data analytics platforms.
7. Economy
Production Sweet Spot
Beware of pilot programs that don’t scale economically
Business value of big data
Investment
People- intensive platforms
Technology-intensive platforms
33©MapR Technologies - Confidential
Maximizing Economic Value
Analytics – Ability to perform broader and deeper analytics– Real-time streaming– Mission critical SLAs– Cloud based analysis
Ease of Development Ease of Administration Value of Uptime Value of Data Protection Hardware Efficiency First Class Support
34©MapR Technologies - Confidential
One Platform for Big Data
…
99.999% HA
Data Protection
Disaster Recovery
Scalability &
Performance
Enterprise Integration
Multi-tenancy
MapReduce
File-Based Applications SQL Database Search Stream
Processing
Batch Interactive Real-time
The 7 qualities of Big Data production platforms
Quality What it means
1 ExperienceUsers’ perceptions of the usefulness, usability, and desirability of the application.
2 AvailabilityThe readiness of the service or application to perform its functions when needed
3 PerformanceThe speed to perform functions to meet business and user expectations
4 ScalabilityHandle increasing or decreasing volumes of transactions, services, and data
5 AdaptabilityThe ease with which an application or service can be changed or extended
6 SecuritySupports the security properties of confidentiality, integrity, authentication, authorization, and nonrepudiation
7 EconomyMinimize cost to build, operate, & change an application or service without compromising its business value
Big Data is about innovation, but not if you don’t productionize it.
36
Collectors• Capture• Store
Journalists• Reports• Dashboards
Innovators• Predictive
analytics
Operations Business Intelligence
Predictive Power
Frontier
Big data is about pushing limits. Exponential growth in data means the frontier is vast.