Upload
mapr-technologies
View
227
Download
0
Tags:
Embed Size (px)
DESCRIPTION
The unification of big and little data processing onto a single platform is an important requirement for Hadoop. How can this be achieved? Ted Dunning explains what is needed for three important use cases.
Citation preview
1©MapR Technologies - Confidential
Remembering the Future
2©MapR Technologies - Confidential
My Background
University, Startups– Aptex, MusicMatch, ID Analytics, Veoh– big data since before it was big
Open source– even before the internet– Apache Hadoop, Mahout, Zookeeper, Drill– bought the beer at first HUG
MapR Founding member of Apache Drill
3©MapR Technologies - Confidential
MapR Technologies
Silicon Valley Startup– Top investors– Top technical and management team• Google, Microsoft, EMC, NetApp, Oracle
Enterprise quality distribution for Hadoop
Many extensions to basic Hadoop function Strong supporter of Apache Drill
4©MapR Technologies - Confidential
Philosophy First
What is History?
5©MapR Technologies - Confidential
The study of the past
(what came before now)
6©MapR Technologies - Confidential
What is the future?
(it comes after now)
7©MapR Technologies - Confidential
8©MapR Technologies - Confidential
9©MapR Technologies - Confidential
10©MapR Technologies - Confidential
But the future also has a past!
11©MapR Technologies - Confidential
Do you remember the future?
12©MapR Technologies - Confidential
13©MapR Technologies - Confidential
14©MapR Technologies - Confidential
15©MapR Technologies - Confidential
16©MapR Technologies - Confidential
17©MapR Technologies - Confidential
Some things
turned out as
expected
18©MapR Technologies - Confidential
Guys wearing Fedoras
19©MapR Technologies - Confidential
Many things are different!
20©MapR Technologies - Confidential
Hadoop has a history
21©MapR Technologies - Confidential
Hadoop also has a
future
22©MapR Technologies - Confidential
The Old Future of Hadoop
Map-reduce and HDFS– more and more, but not really different
Eco-system additions– Simpler programming (Hive and Pig)– Key-value store– Ad hoc query
Stands apart from other computing– Required by HDFS and other limitations
23©MapR Technologies - Confidential
The New Future of Hadoop
Real-time processing– Combines real-time and long-time
Integration with traditional IT– No need to stand apart
Integration with new technologies– Solr, Node.js, Twisted all should interface directly
Fast and flexible computation– Drill logical plan language
24©MapR Technologies - Confidential
Example #1Search Abuse
25©MapR Technologies - Confidential
History matrix
One row per user
One column per thing
26©MapR Technologies - Confidential
Recommendation based on cooccurrence
Cooccurrence gives item-item mapping
One row and column per thing
27©MapR Technologies - Confidential
Cooccurrence matrix can also be implemented as a search index
28©MapR Technologies - Confidential
SolRIndexerSolR
IndexerSolrindexing
Cooccurrence(Mahout)
Item meta-data
Indexshards
Complete history
29©MapR Technologies - Confidential
SolRIndexerSolR
IndexerSolrsearchWeb tier
Item meta-data
Indexshards
User history
30©MapR Technologies - Confidential
Objective Results
At a very large credit card company
History is all transactions, all web interaction
Processing time cut from 20 hours per day to 3
Recommendation engine load time decreased from 8 hours to 3 minutes
31©MapR Technologies - Confidential
Example #2Web
Technology
32©MapR Technologies - Confidential
Fast analysis(Storm)
Analytic output
Real-timedata
Raw logs
33©MapR Technologies - Confidential
Large analysis(map-reduce)
Analytic output Raw logs
34©MapR Technologies - Confidential
Presentation tier (d3 + node.js)
Analytic output
Browser query
Raw logs
35©MapR Technologies - Confidential
Objective Results
Real-time + long-time analysis is seamless
Web tier can be rooted directly on Hadoop cluster
No need to move data
36©MapR Technologies - Confidential
Example #3Apache Drill
37©MapR Technologies - Confidential
Big Data Processing – Hadoop
Batch processing
Query runtime Minutes to hours
Data volume TBs to PBs
Programming model
MapReduce
Users Developers
Google project MapReduce
Open source project
Hadoop MapReduce
38©MapR Technologies - Confidential
Big Data Processing – Hadoop and Storm
Batch processing Stream processing
Query runtime Minutes to hours Never-ending
Data volume TBs to PBs Continuous stream
Programming model
MapReduce DAG (pre-programmed)
Users Developers Developers
Google project MapReduce
Open source project
Hadoop MapReduce
Storm or Apache S4
39©MapR Technologies - Confidential
Big Data Processing – The missing part
Batch processing Interactive analysis Stream processing
Query runtime Minutes to hours Never-ending
Data volume TBs to PBs Continuous stream
Programming model
MapReduce DAG (pre-programmed)
Users Developers Developers
Google project MapReduce
Open source project
Hadoop MapReduce
Storm and S4
40©MapR Technologies - Confidential
Big Data Processing – The missing part
Batch processing Interactive analysis Stream processing
Query runtime Minutes to hours Milliseconds to minutes
Never-ending
Data volume TBs to PBs GBs to PBs Continuous stream
Programming model
MapReduce Queries(ad hoc)
DAG (pre-programmed)
Users Developers Analysts and developers
Developers
Google project MapReduce
Open source project
Hadoop MapReduce
Storm and S4
41©MapR Technologies - Confidential
Big Data Processing
Batch processing Interactive analysis Stream processing
Query runtime Minutes to hours Milliseconds to minutes
Never-ending
Data volume TBs to PBs GBs to PBs Continuous stream
Programming model
MapReduce Queries DAG
Users Developers Analysts and developers
Developers
Google project MapReduce Dremel
Open source project
Hadoop MapReduce
Storm and S4
42©MapR Technologies - Confidential
Big Data Processing
Batch processing Interactive analysis Stream processing
Query runtime Minutes to hours Milliseconds to minutes
Never-ending
Data volume TBs to PBs GBs to PBs Continuous stream
Programming model
MapReduce Queries DAG
Users Developers Analysts and developers
Developers
Google project MapReduce Dremel
Open source project
Hadoop MapReduce
Storm and S4
Apache Drill
43©MapR Technologies - Confidential
Design Principles
Flexible• Pluggable query languages• Extensible execution engine• Pluggable data formats
• Column-based and row-based• Schema and schema-less
• Pluggable data sources
Easy• Unzip and run• Zero configuration• Reverse DNS not needed• IP addresses can change• Clear and concise log messages
Dependable• No SPOF• Instant recovery from crashes
Fast• C/C++ core with Java support
• Google C++ style guide• Min latency and max throughput
(limited only by hardware)
44©MapR Technologies - Confidential
Simple Architecture
45©MapR Technologies - Confidential
Standard Interfaces
46©MapR Technologies - Confidential
query:[ { op:"sequence", do:[ { op: "scan", memo: "initial_scan", ref: "donuts", source: "local-logs", selection: {data: "activity"} }, { op: "transform", transforms: [ { ref: "donuts.quanity", expr: "donuts.sales”} ] }, { op: "filter", expr: "donuts.ppu < 1.00" }, …
Logical Plan Syntax:
47©MapR Technologies - Confidential
Logical Streaming Example
{ @id: <refnum>, op: “window-frame”, input: <input>, keys: [ <name>,... ], ref: <name>, before: 2, after: here}
0 1 2 3 4
0 0 10 1 2 1 2 32 3 4
48©MapR Technologies - Confidential
Logical Plan
49©MapR Technologies - Confidential
Execution Plan
50©MapR Technologies - Confidential
Representing a DAG
{ @id: 19, op: "aggregate", input: 18, type: <simple|running|repeat>, keys: [<name>,...], aggregations: [ {ref: <name>, expr: <aggexpr> },... ]}
51©MapR Technologies - Confidential
Non-SQL queries
52©MapR Technologies - Confidential
Design Principles
Flexible• Pluggable query languages• Extensible execution engine• Pluggable data formats
• Column-based and row-based• Schema and schema-less
• Pluggable data sources
Easy• Unzip and run• Zero configuration• Reverse DNS not needed• IP addresses can change• Clear and concise log messages
Dependable• No SPOF• Instant recovery from crashes
Fast• C/C++ core with Java support
• Google C++ style guide• Min latency and max throughput
(limited only by hardware)
53©MapR Technologies - Confidential
The future is not what we thought it would be
54©MapR Technologies - Confidential
It is better!
55©MapR Technologies - Confidential
Get Involved!
Tweet:#hcj13w#mapr
@ted_dunning
56©MapR Technologies - Confidential
Get Involved!
Download these slides– http://www.mapr.com/company/events/hcj-01-21-2013
Join the Drill project– [email protected] – #apachedrill
Contact me:– [email protected]– [email protected]– @ted_dunning
Join MapR (in Japan!)– [email protected]