Upload
saratoga
View
76
Download
1
Tags:
Embed Size (px)
Citation preview
Jeff Fletcher
Building Hadoop Solutions as a Service as an ISP
DEBUNKING THE MYTHS Speaker 12 of 17
Followed by
Barry Devlin
2
A quick history of big data
Source: IDCs Digital Universe Study, sponsored by EMC, December 2012
0
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
Exabytes
3
Be warned, my son, of anything in addition to them. Of making many books there is no end, and much study wearies the body. Ecclesiastes 12:12
4
A quick summary of big data
“Big Data is a rapid increase in public awareness that data is a valuable resource for discovering useful and sometimes potentially harmful knowledge.” Stephen Few
5
A quick summary of big data
6
Understanding the market
Data collection
Log files
Location data
Public data
Social data
Transaction data
Usage data
Data services
Analytics
Visualisation
Data storage and processing
Hadoop Analytical RDBMS
In memory DBMS
NoSQL NewSQL Streaming
Dedicated servers MCP VDC
Pig Splunk MongoDB
SAP HANA
HBASE
Scribe Cassandra Spark
VoltDB
Hive
Elastisearch Vertica GreenPlum
7
How to build a minimum viable product
Like this!
Not like this…
8
Our minimum viable product (for tomorrow’s demo) 4 CentOS servers on cheaper VMs running CDH5
VDC
Server 1 4 x CPU 32 GB 1 TB
Server 2 4 x CPU 32 GB 1 TB
Server 3 4 x CPU 32 GB 1 TB
Server 4 4 x CPU 32 GB 1 TB
9
An short history of Hadoop
10
A short history of the Apache Hadoop ecosystem
Ambari Provisioning, managing and monitoring Hadoop clusters
Sqoop Data exchange
Flume Log collector
Zookeeper Coordination
Oozie Workflow
Pig Scripting
Mahout Machine learning
R Connectors Statistics
Hive SQL query
YARN Map Reduce v2 Distributed processing framework
HDFS Hadoop distributed file system
Hbase Columnar store
11
Understanding the market
Source: Gartner (July 2013)
12
Why Hadoop and not [insert name here]?
? ?
13
Choosing a Hadoop distribution
14
Fragmentation: Distribution wars
15
The reference architecture
HP 5900 1u HP 5900 1u DL360p Gen8 1u DL360p Gen8 1u DL360p Gen8 1u SL4540 Gen8 4.3u SL4540 Gen8 4.3u SL4540 Gen8 4.3u SL4540 Gen8 4.3u SL4540 Gen8 4.3u SL4540 Gen8 4.3u SL4540 Gen8 4.3u SL4540 Gen8 4.3u
2 x ToR switches
1 x Management node 1 x Job tracker node 1 x Name node
SL4540 Gen8 4.3u SL4540 Gen8 4.3u SL4540 Gen8 4.3u SL4540 Gen8 4.3u SL4540 Gen8 4.3u SL4540 Gen8 4.3u SL4540 Gen8 4.3u
8 x SL4540 (3 x15) 24 x Worker nodes
16
Web scale considerations Hadoop changes the game | Storage and compute on one platform
The old way
Expensive, special purpose ‘Reliable’ servers Expensive licenced software • Hard to scale • Network is a bottleneck • Only handles relational data • Difficult to add new fields and data types
Expensive and unattainable US$30 000+ per TB
Network
Compute (RDBMS, EDW) Data Storage (SAN, NAS)
The Hadoop way
Commodity ‘Unreliable’ servers Hybrid open source software • Scales out forever • No bottlenecks • Easy to ingest any data • Agile data access
Affordable and Attainable US$300 – US$1 000 per TB
Compute (CPU) Memory Storage (Disk)
17
Dedicated vs virtualised
Virtual Hadoop as a service provider
Containers
vs
Hypervisor
18
Data considerations Cost of transferring data
19
Data considerations Security
20
More realistic starting architecture 4 CentOS servers on cheaper VMs running CDH5
Virtual Data Centre
Server 1 4 x CPU 32 GB 1 TB
Server 2 4 x CPU 32 GB 1 TB
Server 3 4 x CPU 32 GB 1 TB
Server 4 4 x CPU 32 GB 1 TB
21
Customer migration path