Jeff Fletcher - Building a Hadoop based infrastructure as a service product and ISP

Jeff Fletcher

Building Hadoop Solutions as a Service as an ISP

DEBUNKING THE MYTHS Speaker 12 of 17

Followed by

Barry Devlin

2

A quick history of big data

Source: IDCs Digital Universe Study, sponsored by EMC, December 2012

0

5,000

10,000

15,000

20,000

25,000

30,000

35,000

40,000

2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020

Exabytes

3

Be warned, my son, of anything in addition to them. Of making many books there is no end, and much study wearies the body. Ecclesiastes 12:12

4

A quick summary of big data

“Big Data is a rapid increase in public awareness that data is a valuable resource for discovering useful and sometimes potentially harmful knowledge.” Stephen Few

5

A quick summary of big data

6

Understanding the market

Data collection

Log files

Location data

Public data

Social data

Transaction data

Usage data

Data services

Analytics

Visualisation

Data storage and processing

Hadoop Analytical RDBMS

In memory DBMS

NoSQL NewSQL Streaming

Dedicated servers MCP VDC

Pig Splunk MongoDB

SAP HANA

HBASE

Scribe Cassandra Spark

VoltDB

Hive

Elastisearch Vertica GreenPlum

7

How to build a minimum viable product

Like this!

Not like this…

8

Our minimum viable product (for tomorrow’s demo) 4 CentOS servers on cheaper VMs running CDH5

VDC

Server 1 4 x CPU 32 GB 1 TB




9

An short history of Hadoop

10

A short history of the Apache Hadoop ecosystem

Ambari Provisioning, managing and monitoring Hadoop clusters

Sqoop Data exchange

Flume Log collector

Zookeeper Coordination

Oozie Workflow

Pig Scripting

Mahout Machine learning

R Connectors Statistics

Hive SQL query

YARN Map Reduce v2 Distributed processing framework

HDFS Hadoop distributed file system

Hbase Columnar store

11

Understanding the market

Source: Gartner (July 2013)

12

Why Hadoop and not [insert name here]?

? ?

13

Choosing a Hadoop distribution

14

Fragmentation: Distribution wars

15

The reference architecture

HP 5900 1u HP 5900 1u DL360p Gen8 1u DL360p Gen8 1u DL360p Gen8 1u SL4540 Gen8 4.3u SL4540 Gen8 4.3u SL4540 Gen8 4.3u SL4540 Gen8 4.3u SL4540 Gen8 4.3u SL4540 Gen8 4.3u SL4540 Gen8 4.3u SL4540 Gen8 4.3u

2 x ToR switches

1 x Management node 1 x Job tracker node 1 x Name node

SL4540 Gen8 4.3u SL4540 Gen8 4.3u SL4540 Gen8 4.3u SL4540 Gen8 4.3u SL4540 Gen8 4.3u SL4540 Gen8 4.3u SL4540 Gen8 4.3u

8 x SL4540 (3 x15) 24 x Worker nodes

16

Web scale considerations Hadoop changes the game | Storage and compute on one platform

The old way

Expensive, special purpose ‘Reliable’ servers Expensive licenced software •  Hard to scale •  Network is a bottleneck •  Only handles relational data •  Difficult to add new fields and data types

Expensive and unattainable US$30 000+ per TB

Network

Compute (RDBMS, EDW) Data Storage (SAN, NAS)

The Hadoop way

Commodity ‘Unreliable’ servers Hybrid open source software •  Scales out forever •  No bottlenecks •  Easy to ingest any data •  Agile data access

Affordable and Attainable US$300 – US$1 000 per TB

Compute (CPU) Memory Storage (Disk)

17

Dedicated vs virtualised

Virtual Hadoop as a service provider

Containers

vs

Hypervisor

18

Data considerations Cost of transferring data

19

Data considerations Security

20

More realistic starting architecture 4 CentOS servers on cheaper VMs running CDH5

Virtual Data Centre





21

Customer migration path

Data & Analytics

Jeff Fletcher - Building a Hadoop based infrastructure as a service product and ISP