21
Jeff Fletcher Building Hadoop Solutions as a Service as an ISP DEBUNKING THE MYTHS Speaker 12 of 17 Followed by Barry Devlin

Jeff Fletcher - Building a Hadoop based infrastructure as a service product and ISP

Embed Size (px)

Citation preview

Page 1: Jeff Fletcher - Building a Hadoop based infrastructure as a service product and ISP

Jeff Fletcher

Building Hadoop Solutions as a Service as an ISP

DEBUNKING THE MYTHS Speaker 12 of 17

Followed by

Barry Devlin

Page 2: Jeff Fletcher - Building a Hadoop based infrastructure as a service product and ISP

2

A quick history of big data

Source: IDCs Digital Universe Study, sponsored by EMC, December 2012

0

5,000

10,000

15,000

20,000

25,000

30,000

35,000

40,000

2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020

Exabytes

Page 3: Jeff Fletcher - Building a Hadoop based infrastructure as a service product and ISP

3

Be warned, my son, of anything in addition to them. Of making many books there is no end, and much study wearies the body. Ecclesiastes 12:12

Page 4: Jeff Fletcher - Building a Hadoop based infrastructure as a service product and ISP

4

A quick summary of big data

“Big Data is a rapid increase in public awareness that data is a valuable resource for discovering useful and sometimes potentially harmful knowledge.” Stephen Few

Page 5: Jeff Fletcher - Building a Hadoop based infrastructure as a service product and ISP

5

A quick summary of big data

Page 6: Jeff Fletcher - Building a Hadoop based infrastructure as a service product and ISP

6

Understanding the market

Data collection

Log files

Location data

Public data

Social data

Transaction data

Usage data

Data services

Analytics

Visualisation

Data storage and processing

Hadoop Analytical RDBMS

In memory DBMS

NoSQL NewSQL Streaming

Dedicated servers MCP VDC

Pig Splunk MongoDB

SAP HANA

HBASE

Scribe Cassandra Spark

VoltDB

Hive

Elastisearch Vertica GreenPlum

Page 7: Jeff Fletcher - Building a Hadoop based infrastructure as a service product and ISP

7

How to build a minimum viable product

Like this!

Not like this…

Page 8: Jeff Fletcher - Building a Hadoop based infrastructure as a service product and ISP

8

Our minimum viable product (for tomorrow’s demo) 4 CentOS servers on cheaper VMs running CDH5

VDC

Server 1 4 x CPU 32 GB 1 TB

Server 2 4 x CPU 32 GB 1 TB

Server 3 4 x CPU 32 GB 1 TB

Server 4 4 x CPU 32 GB 1 TB

Page 9: Jeff Fletcher - Building a Hadoop based infrastructure as a service product and ISP

9

An short history of Hadoop

Page 10: Jeff Fletcher - Building a Hadoop based infrastructure as a service product and ISP

10

A short history of the Apache Hadoop ecosystem

Ambari Provisioning, managing and monitoring Hadoop clusters

Sqoop Data exchange

Flume Log collector

Zookeeper Coordination

Oozie Workflow

Pig Scripting

Mahout Machine learning

R Connectors Statistics

Hive SQL query

YARN Map Reduce v2 Distributed processing framework

HDFS Hadoop distributed file system

Hbase Columnar store

Page 11: Jeff Fletcher - Building a Hadoop based infrastructure as a service product and ISP

11

Understanding the market

Source: Gartner (July 2013)

Page 12: Jeff Fletcher - Building a Hadoop based infrastructure as a service product and ISP

12

Why Hadoop and not [insert name here]?

? ?

Page 13: Jeff Fletcher - Building a Hadoop based infrastructure as a service product and ISP

13

Choosing a Hadoop distribution

Page 14: Jeff Fletcher - Building a Hadoop based infrastructure as a service product and ISP

14

Fragmentation: Distribution wars

Page 15: Jeff Fletcher - Building a Hadoop based infrastructure as a service product and ISP

15

The reference architecture

HP 5900 1u HP 5900 1u DL360p Gen8 1u DL360p Gen8 1u DL360p Gen8 1u SL4540 Gen8 4.3u SL4540 Gen8 4.3u SL4540 Gen8 4.3u SL4540 Gen8 4.3u SL4540 Gen8 4.3u SL4540 Gen8 4.3u SL4540 Gen8 4.3u SL4540 Gen8 4.3u

2 x ToR switches

1 x Management node 1 x Job tracker node 1 x Name node

SL4540 Gen8 4.3u SL4540 Gen8 4.3u SL4540 Gen8 4.3u SL4540 Gen8 4.3u SL4540 Gen8 4.3u SL4540 Gen8 4.3u SL4540 Gen8 4.3u

8 x SL4540 (3 x15) 24 x Worker nodes

Page 16: Jeff Fletcher - Building a Hadoop based infrastructure as a service product and ISP

16

Web scale considerations Hadoop changes the game | Storage and compute on one platform

The old way

Expensive, special purpose ‘Reliable’ servers Expensive licenced software •  Hard to scale •  Network is a bottleneck •  Only handles relational data •  Difficult to add new fields and data types

Expensive and unattainable US$30 000+ per TB

Network

Compute (RDBMS, EDW) Data Storage (SAN, NAS)

The Hadoop way

Commodity ‘Unreliable’ servers Hybrid open source software •  Scales out forever •  No bottlenecks •  Easy to ingest any data •  Agile data access

Affordable and Attainable US$300 – US$1 000 per TB

Compute (CPU) Memory Storage (Disk)

Page 17: Jeff Fletcher - Building a Hadoop based infrastructure as a service product and ISP

17

Dedicated vs virtualised

Virtual Hadoop as a service provider

Containers

vs

Hypervisor

Page 18: Jeff Fletcher - Building a Hadoop based infrastructure as a service product and ISP

18

Data considerations Cost of transferring data

Page 19: Jeff Fletcher - Building a Hadoop based infrastructure as a service product and ISP

19

Data considerations Security

Page 20: Jeff Fletcher - Building a Hadoop based infrastructure as a service product and ISP

20

More realistic starting architecture 4 CentOS servers on cheaper VMs running CDH5

Virtual Data Centre

Server 1 4 x CPU 32 GB 1 TB

Server 2 4 x CPU 32 GB 1 TB

Server 3 4 x CPU 32 GB 1 TB

Server 4 4 x CPU 32 GB 1 TB

Page 21: Jeff Fletcher - Building a Hadoop based infrastructure as a service product and ISP

21

Customer migration path