Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event

Big Data App Server

Lance Riedel

Big Data App Server

A new applica5on framework for (4 V’s): •  Volume of raw data (Petabytes) •  Velocity at which it is being generated/

ingested •  Variety of data sources and schemas •  Advanced data sciences and analy5cs that

can be applied to extract Value

Big Data App Server Use Cases

•  Log/Machine Analy5cs •  Security/Fraud Detec5on •  Sensor Data Analy5cs •  Financial Analy5cs •  Retail Analy5cs •  Ad Targe5ng •  Recommenda5on (e.g. NeMlix, Amazon)

Components B

ig D

ata

Pla

tform

APP SERVER COMPONENTS

Storage and Compute B

ig D

ata

Pla

tform

Storage and Compute

Mo8va8on Google needed to capture the web and process it efficiently •  Calculate importance of pages, words,

domains against each other •  The more cost-‐effec5ve they could make

it -‐ the more they could process, index, understand

Storage/Compute: Centralized

•  Centralized doesn’t scale! •  Move a lot of data – boWleneck

Storage/Compute: Sharding

•  Sharding is spliXng the problem into isolated chunks •  Sharding scales, but fails when you need to look across the data

•  E.G. How to calculate term weights or top pages across shards??

✓ ✓ ✓ ✓ ✓ ✓ ✓

≠

DFS, MapReduce

•  Used a new programming model to distribute computa5on AND data (NOT sharding)

•  Runs on commodity hardware •  Failure resilience using so_ware control •  Easy to calculate across corpus •  Two parts of a complete Solu5on:

•  Distributed File System – DFS •  MapReduce

Distributed File System

MapReduce

•  Process where the data resides (Data and compute are local to each other) •  Map (read the data, emit a key and a value) •  Reduce (group all values per key, perform another opera5on)

Hadoop

•  Open Source implementa5on of Google’s DFS and MapReduce whitepaper

•  Huge Eco-‐System •  Used by: Yahoo, Facebook, TwiWer, LinkedIn, Sears, Apple, The New York Times, Telefonica, +1000’s more!

Management B

ig D

ata

Pla

tform

Data Ingestion

Mo8va8on •  Data origina5ng from a

variety of sources

•  Some data more valuable than others: •  Time-‐to-‐live (TTL) •  Guarantees on

delivery

Data Ingestion: Apache Flume

•  A scalable, fault-‐tolerant, configurable topology data inges5on pipeline that works hand in hand with the Hadoop Eco-‐System

•  Configurable delivery guarantees -‐ rou5ng, replica5on, failover •  Extensible sources and sinks allows for pluggable data sources

•  Scales out horizontally – 100k’s messages/sec

Workflow

Mo8va8on Transforming, storing, joining, data can take a lot of steps that need to be repeatable and traceable – the programming model for data

Workflow: Oozie

A workflow engine that understands the dependency graph of work and can schedule, replay, and report on the steps •  Jobs triggered by 5me (frequency) and data

availability •  Integrated with the rest of the Hadoop stack •  Scalable, reliable and extensible system.

Schema Management

Mo8va8on As data sources explode, the need to understand the data schemas becomes a principle concern

Schema: HCatalog

•  A table and storage management layer for Hadoop

•  Enables users with different data processing tools – Pig, MapReduce, and Hive – to more easily read and write data on the grid.

Schema: Avro

•  A data serializa5on system •  When Avro data is stored in a file, its schema is stored with it

•  Correspondence between same named fields, missing fields, extra fields, etc. can all be easily resolved.

•  Most technologies in the Hadoop stack understand avro– interoperability/data passing

Data Access, Querying B

ig D

ata

Pla

tform

Data Access

Mo8va8on Various data access paWerns require data stores beyond just the DFS files. An example is a key value store that needs random access to data. Solu8on(s) There are a number of solu5ons depending on the use case. •  Google’s BigTable whitepaper •  SQL has been adapted to Hadoop

Data Access: HBase

•  The Hadoop database -‐ a distributed, scalable, big data store (sorted map) – from Google’s BigTable, backed by Hadoop DFS

•  Linear and modular scalability. •  Automa5c and configurable sharding of tables

•  Automa5c failover support •  Convenient base classes for backing Hadoop MapReduce jobs with Apache HBase tables.

Data Access: SQL – Hive, Impala

•  SQL querying of raw data on the distributed file system

•  Impala – Query files on HDFS including SELECT, JOIN, and aggregate func5ons – in real 5me

•  Hive – provides easy data summariza5on, ad-‐hoc queries, and the analysis of large datasets stored in Hadoop compa5ble file systems

Analytics B

ig D

ata

Pla

tform

Data Analytics

Mo8va8on •  Discover the latent value of the data. The core

mo5va5on behind Big Data! •  Clustering, Machine Learning, Correla5ons,

Modeling – the guts of the Data Science – o_en extremely diverse use cases.

Solu8on(s) A pluggable architecture that can share schemas, but allow for a suite of tools appropriate for the use case

Data Analytics: Example Frameworks •  Mahout

•  Machine learning, clustering •  PaWern -‐ Machine Learning DSL for Hadoop from

Cascading •  0xData

•  Open source math and predic5on engine for big data •  Sample Algorithms

•  Random Forest algorithm •  K-‐Means Clustering •  Hierarchical Clustering •  Linear Regression •  Logis5c Regression •  Support Vector Machines •  Ar5ficial Neural Networks •  Associa5on Rule Learning

Serving B

ig D

ata

Pla

tform

Serving

Mo8va8on •  Powering applica5ons for end users •  Search/browse and recommenda5on engines

allow real-‐5me access to data

Serving: Search – Solr Cloud •  Builds indexes on top of Hadoop •  Horizontally scalable, fault tolerant •  Incredible flexibility in indexing op5ons

•  Tokeniza5on •  Field types •  Data storage

•  Search op5ons just as flexible •  AND,OR,NOT, wildcard •  Facets (counts from a derived ontology) •  Extensive algorithm and weigh5ng plug-‐ability

Serving: Manas – Matching Engine

•  The Hive’s massively scalable matching engine

•  Handles 100’s millions to billions of documents efficiently while matching against 100’s to 1000’s features

•  Nothing exists today in the Open Source community that has these capabili5es

EXAMPLE APP USE-‐CASE

App Server Data Flow

SecurityX on App Server

Technology

Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event