View
1.143
Download
0
Category
Preview:
DESCRIPTION
Since it became an Apache Top Level Project in early 2008, Hadoop has established itself as the de-facto industry standard for batch processing. The two layers composing its core, HDFS and MapReduce, are strong building blocks for data processing. Running data analysis and crunching petabytes of data is no longer fiction. But the MapReduce framework does have two major drawbacks: query latency and data freshness. At the same time, businesses have started to exchange more and more data through REST API, leveraging HTTP words (GET, POST, PUT, DELETE) and URI (for instance http://company/api/v2/domain/identifier), pushing the need to read data in a random access style – from simple key/value to complex queries. Enhancing the BigData stack with real time search capabilities is the next natural step for the Hadoop ecosystem, because the MapReduce framework was not designed with synchronous processing in mind. There is a lot of traction today in this area and this talk will try to answer the question of how to fill in this gap with specific open-source components, ultimately building a dedicated platform that will enable real-time queries on Internet-scale data sets. After discussing the evolution of the deployments of common Hadoop platform, a hybrid approach called lambda architecture will be proposed. It will be demonstrated with concrete examples, discussing which technology could be a good match, and how they would interact together.
Citation preview
Enabling Real-time Queries to End Users
Benoit Perroud
SoftShake, Geneva, October 24, 2013
2 Verisign Public
About Me
• Benoit Perroud• Software Engineer @ Verisign• Leading Hadoop Infrastructure Team• Apache Committer• @killerwhile
3 Verisign Public
Agenda
• What’s going on• Data lifecycle• Batch and Realtime• Hadoop Deployments• Next Steps
4 Verisign Public
What’s going on
• Mainframes are obsolete, replaced by commodity hardware’s cluster
• TenG (10Gb/s) links are the new standard• RESTful APIs are everywhere• Everybody wants to visit Paxos Island• Firehoses do not only carry water• Asynchronous non-blocking functional programming is taught
at primary school• NoSQL is the new way to store data at scale• API management startups are rising (and raising)• Hadoop keywords boost your LinkedIn profile by 2000%• Public clouds are responsible for more than 50% of the global
Internet traffic• … and counting …
5 Verisign Public
Source: http://dev.datasift.com/blog/high-scalabilityNote: the diagram is stamped from 2009, it is probablypartially or even completely outdated today
A Possible Deployment
6 Verisign Public
Data Lifecycle
7 Verisign Public
Data Lifecycle
Data Ingestion
Data Storage
Data Processing
Data Retrieval
Producers Consumers
8 Verisign Public
• Copying internal and external sources of data into the cluster
• Pre-processing: data cleanup, proper format, …• Time vs. block-size tradeoff
• Targeted property: Availability
Source of Data
Ingesting the flow
Local buffering
HDFSUploading to HDFS
9 Verisign Public
• Hadoop HDFS is a well established distributed file system
• File system is the central component of every data-driven approach
• Space vs. network tradeoff
• Targeted property: Reliability
File1
Upload to HDFS
DataNode1 DataNode2
DataNode3 DataNode4
10 Verisign Public
• Hadoop MapReduce• Higher level tools (Hive, Pig, Impala) help• Data catalog needs to be maintained
Targeted property: parallelism
11 Verisign Public
• Only way to make use of the data• Business driven need• At scale, data needs to be stored as they are queried.• DPI: Data Programmable Interfaces
Targeted property: user friendliness, reliability
12 Verisign Public
Batch and Realtime
13 Verisign Public
Batch Processing
Batch 1
Batch 1 ready to be served
Time
Batch 1 startsprocessing
t1 t2
Batch 2
Batch 2 ready to be served
Batch 2 startsprocessing
t3 t4
Query data from batch 1 Query data from batch 2
Batch 3
Batch 3 startsprocessing
t5
Data gap Data gap
14 Verisign Public
Batch Processing in details
Batch with data from yesterday
Time
New batch granularityperiod
Let some timefor data to finishupload
Load resultsin a data store
Notify the retrieval systema new batch is readyto be served
Processing time
Query data from the day before yesterday?
15 Verisign Public
Realtime Query
• Interactive query• REST like request/response queries• With SLA
And
• Query the latest version of the data• Latest means n seconds ago with n predictible
16 Verisign Public
Hadoop Deployments
17 Verisign Public
Naïve Hadoop Deployment
Gateway
NameNodehdfs dfs -put
mapred job …jar
hdfs dfs -get
JobTracker
DataNodeDataNode
DataNodeDataNode
DataNodeDataNode
DataNodeDataNode
DataNodeDataNode
Processing
18 Verisign Public
Industry Hadoop Deployment
Data In GW
Data Out GWMetadata StoreMonitoring
Gateway
NameNode JobTracker
DataNodeDataNode
DataNodeDataNode
DataNodeDataNode
DataNodeDataNode
DataNode
DataNodeDataNode
DataNodeDataNode
DataNodeDataNode
DataNodeDataNode
DataNode
Processing
NameNode JobTracker NameNode JobTracker
DataNodeDataNode
DataNodeDataNode
DataNodeDataNode
DataNodeDataNode
DataNodeDataNode
DataNodeDataNode
DataNodeDataNode
NameNode
Research,Data Science
19 Verisign Public
Realtime Hadoop Deployment
Data In GW
Gateway
NameNode JobTracker
DataNodeDataNode
DataNodeDataNode
DataNodeDataNode
DataNodeDataNode
Processing
NameNode JobTracker
RT Data Out GW
RT processing
20 Verisign Public
Hybrid Approach
Batch 1
Batch 1 ready to be served
Time
Batch 1 startsprocessing
t1 t2
Batch 2
Batch 2 ready to be served
Batch 2 startsprocessing
t3 t4
Complementary data for batch 1
Complementary data for batch 2
21 Verisign Public
Realtime Search with Hadoop
Data In GW
Gateway
NameNode JobTracker
DataNodeDataNode
DataNodeDataNode
DataNodeDataNode
DataNodeDataNode
Generate Indexes
NameNode JobTracker
RT Data Out GW
Update indexes
Coordinator
22 Verisign Public
Next Steps
23 Verisign Public
Hadoop Ecosystem
… is moving … really fast
• Interactive Queries: Cloudera Impala, Apache Drills, Tez, …
• Search: SolrCloud, ElasticSearch, Cloudera Search• Hybrid layer: Twitter SummingBird
• … and counting…
Thanks for the attention!
Follow @killewhilebperroud@verisign.com
“Copyright © 2013 VeriSign, Inc. All rights reserved. The VERISIGN word mark, the Verisign logo, and other Verisign trademarks, service marks, and designs that may appear herein are registered or unregistered trademarks or service marks of VeriSign, Inc., and its subsidiaries in the United States and foreign countries. All other trademarks, service marks, and designs are property of their respective owners. Verisign has made efforts to ensure the accuracy and completeness of the information in this document. However, Verisign makes no warranties of any kind (whether express, implied or statutory) with respect to the information contained herein. Verisign assumes no liability to any party for any loss or damage (whether direct or indirect) caused by any errors, omissions, or statements of any kind contained in this document. Further, Verisign assumes no liability arising from the application or use of the products, services, or materials described or referenced herein and specifically disclaims any representation that any such products, services, or materials do not infringe upon any existing or future intellectual property rights.”
Recommended