Big data, Hadoop - lunchtime talk 2015.02.26

Preview:

Citation preview

Big Data Consulting

Hadoop, big dataRobert Gibbon - www.bigindustries.be

The information age

■ The “economic third wave” has badly hit many blue chip organisations

■ Manufacturing and retail is in rapid decline in Europe and the US■ Tech, connectivity and information is restructuring our societies■ Levels of political and social engagement have surged■ Trading platforms are empowering small businesses

Innovation■ Mass-production hates innovation■ Innovation means change – a huge cost with little benefit for

production-line economies■ Continuous improvement mentality

■ Knowledge services need to innovate to differentiate■ Change in a virtual world can be cheap and yield huge rewards■ Continuous reinvention mentality

The rover bicycle, 1885

Big data viz. innovation■ In a free market like the web, innovation can open up new

opportunities■ Consumer access to grid computing tech is a recent innovation■ Grid computing opens up new opportunities that would otherwise

not be viable■ Ideal for ventures architected around the long-tail economic

model

The future - thingternet■ The internet of things is with us■ Billions of connected devices, even digital tattoos

Big data viz. internet of things

■ Billions of connected devices create a huge amount of data

■ Until big data tech, Internet of Things was nearly impossible to monetize

The internet of things is a wild west■ Many new, unsolved challenges

■ Privacy■ Governance■ Civil liberties

■ New challenges = new opportunities

let's get back to hadoop

■ FOSS software solution for processing terabytes to petabytes of data■ Using arrays of regular servers

■ Hadoop core:■ HDFS - a scale-out file system■ YARN - a scale-out application resource manager

■ Runtimes:■ Spark, Impala, Flink, MapReduce, Kafka, SolrCloud etc.

■ Components for data protection, access control and operational management■ NOSQL databases

■ Hbase, Accumulo, Cassandra etc.

Hadoop refresher

what can you do with hadoop?

Storage

■ Pure online data storage, with no other processing ■ Low cost per-GB for petascale online storage ■ Option to directly query and analyse the data is

available if required.

■ Example: huge, constantly changing catalogue of products – like Ebay and Amazon

■ SolrCloud – an advanced search engine serving terabytes of content from Hadoop

Search

Messaging■ A distributed message queue backed by a Hadoop

cluster - Apache Kafka■ Elastically scalable■ Messages are persisted and replicated for durability■ TBs of messages per broker with predictable

performance

Targeting■ Personalised content for users■ Generates and consumes a huge amount of log data

■ for reporting ■ for predictive analysis

■ Predictive analysis is compute intensive ■ Can be TBs of data per day

Self-service Business Intelligence■ Enterprise Data Hub paradigm ■ A very popular emerging use case

■ Business users directly access raw datasets using specialised discovery tools built on top of Hadoop - DataMeer, Platfora and others

Data warehousing

■ Migration of Enterprise Data Warehouse to Hadoop ■ Big cost savings versus trad vendors like Oracle and

Teradata

Machine learning

■ Predictive analytics with Spark MLLib or Revolution R Enterprise

■ Automatically predict component failures for proactive intervention

Big Database■ Low latency, high throughput, high concurrency,

high volume■ Algotrading■ Realtime ad auctions

■ Volumes at 200BN transactions per day in realtime reliably served

■ Analysis and response to threats detected by SPI module on remote switch

■ Automated systems management – shut down heating when nobody home to reduce heating bill and emissions

■ Monitor driver propensity to break the speed limit - offer lower insurance premiums to good drivers

Device management

hadoop - mature?

Choice of vendors

Solid operational management

Impala v Teradata

Free grid computing

Free scale-out database

Growing commercial ecosystem

Secure and available■ RPC authentication and encryption with PKI■ Data encryption at rest and in transit■ Kerberos resource access control - HDFS, YARN■ Table cell level permissions - Accumulo■ Online snapshot backups■ No SPoF

thanks for listeningbe.linkedin.com/in/robertgibbon