Upload
the-hive
View
107
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Citation preview
Big Data App Server
Lance Riedel
Big Data App Server
A new applica5on framework for (4 V’s): • Volume of raw data (Petabytes) • Velocity at which it is being generated/
ingested • Variety of data sources and schemas • Advanced data sciences and analy5cs that
can be applied to extract Value
Big Data App Server Use Cases
• Log/Machine Analy5cs • Security/Fraud Detec5on • Sensor Data Analy5cs • Financial Analy5cs • Retail Analy5cs • Ad Targe5ng • Recommenda5on (e.g. NeMlix, Amazon)
Components B
ig D
ata
Pla
tform
APP SERVER COMPONENTS
Storage and Compute B
ig D
ata
Pla
tform
Storage and Compute
Mo8va8on Google needed to capture the web and process it efficiently • Calculate importance of pages, words,
domains against each other • The more cost-‐effec5ve they could make
it -‐ the more they could process, index, understand
Storage/Compute: Centralized
• Centralized doesn’t scale! • Move a lot of data – boWleneck
Storage/Compute: Sharding
• Sharding is spliXng the problem into isolated chunks • Sharding scales, but fails when you need to look across the data
• E.G. How to calculate term weights or top pages across shards??
✓ ✓ ✓ ✓ ✓ ✓ ✓
≠
DFS, MapReduce
• Used a new programming model to distribute computa5on AND data (NOT sharding)
• Runs on commodity hardware • Failure resilience using so_ware control • Easy to calculate across corpus • Two parts of a complete Solu5on:
• Distributed File System – DFS • MapReduce
Distributed File System
MapReduce
• Process where the data resides (Data and compute are local to each other) • Map (read the data, emit a key and a value) • Reduce (group all values per key, perform another opera5on)
Hadoop
• Open Source implementa5on of Google’s DFS and MapReduce whitepaper
• Huge Eco-‐System • Used by: Yahoo, Facebook, TwiWer, LinkedIn, Sears, Apple, The New York Times, Telefonica, +1000’s more!
Management B
ig D
ata
Pla
tform
Data Ingestion
Mo8va8on • Data origina5ng from a
variety of sources
• Some data more valuable than others: • Time-‐to-‐live (TTL) • Guarantees on
delivery
Data Ingestion: Apache Flume
• A scalable, fault-‐tolerant, configurable topology data inges5on pipeline that works hand in hand with the Hadoop Eco-‐System
• Configurable delivery guarantees -‐ rou5ng, replica5on, failover • Extensible sources and sinks allows for pluggable data sources
• Scales out horizontally – 100k’s messages/sec
Workflow
Mo8va8on Transforming, storing, joining, data can take a lot of steps that need to be repeatable and traceable – the programming model for data
Workflow: Oozie
A workflow engine that understands the dependency graph of work and can schedule, replay, and report on the steps • Jobs triggered by 5me (frequency) and data
availability • Integrated with the rest of the Hadoop stack • Scalable, reliable and extensible system.
Schema Management
Mo8va8on As data sources explode, the need to understand the data schemas becomes a principle concern
Schema: HCatalog
• A table and storage management layer for Hadoop
• Enables users with different data processing tools – Pig, MapReduce, and Hive – to more easily read and write data on the grid.
Schema: Avro
• A data serializa5on system • When Avro data is stored in a file, its schema is stored with it
• Correspondence between same named fields, missing fields, extra fields, etc. can all be easily resolved.
• Most technologies in the Hadoop stack understand avro– interoperability/data passing
Data Access, Querying B
ig D
ata
Pla
tform
Data Access
Mo8va8on Various data access paWerns require data stores beyond just the DFS files. An example is a key value store that needs random access to data. Solu8on(s) There are a number of solu5ons depending on the use case. • Google’s BigTable whitepaper • SQL has been adapted to Hadoop
Data Access: HBase
• The Hadoop database -‐ a distributed, scalable, big data store (sorted map) – from Google’s BigTable, backed by Hadoop DFS
• Linear and modular scalability. • Automa5c and configurable sharding of tables
• Automa5c failover support • Convenient base classes for backing Hadoop MapReduce jobs with Apache HBase tables.
Data Access: SQL – Hive, Impala
• SQL querying of raw data on the distributed file system
• Impala – Query files on HDFS including SELECT, JOIN, and aggregate func5ons – in real 5me
• Hive – provides easy data summariza5on, ad-‐hoc queries, and the analysis of large datasets stored in Hadoop compa5ble file systems
Analytics B
ig D
ata
Pla
tform
Data Analytics
Mo8va8on • Discover the latent value of the data. The core
mo5va5on behind Big Data! • Clustering, Machine Learning, Correla5ons,
Modeling – the guts of the Data Science – o_en extremely diverse use cases.
Solu8on(s) A pluggable architecture that can share schemas, but allow for a suite of tools appropriate for the use case
Data Analytics: Example Frameworks • Mahout
• Machine learning, clustering • PaWern -‐ Machine Learning DSL for Hadoop from
Cascading • 0xData
• Open source math and predic5on engine for big data • Sample Algorithms
• Random Forest algorithm • K-‐Means Clustering • Hierarchical Clustering • Linear Regression • Logis5c Regression • Support Vector Machines • Ar5ficial Neural Networks • Associa5on Rule Learning
Serving B
ig D
ata
Pla
tform
Serving
Mo8va8on • Powering applica5ons for end users • Search/browse and recommenda5on engines
allow real-‐5me access to data
Serving: Search – Solr Cloud • Builds indexes on top of Hadoop • Horizontally scalable, fault tolerant • Incredible flexibility in indexing op5ons
• Tokeniza5on • Field types • Data storage
• Search op5ons just as flexible • AND,OR,NOT, wildcard • Facets (counts from a derived ontology) • Extensive algorithm and weigh5ng plug-‐ability
Serving: Manas – Matching Engine
• The Hive’s massively scalable matching engine
• Handles 100’s millions to billions of documents efficiently while matching against 100’s to 1000’s features
• Nothing exists today in the Open Source community that has these capabili5es
EXAMPLE APP USE-‐CASE
App Server Data Flow
SecurityX on App Server