Upload
alex-lefur
View
291
Download
0
Embed Size (px)
Citation preview
2©2014 Cloudera, Inc. All rights reserved.
• 15 years of moving data around
• Formerly consultant
• Now Cloudera Engineer:– Flume
– Sqoop
– Kafka
About Me
5
Getting Data from Kafka to Hadoop
There are only bad options.
It's about finding the best one.
©2014 Cloudera, Inc. All rights reserved.
7©2014 Cloudera, Inc. All rights reserved.
Camus
ZooKeeper
Setup
Topic Offsets
Pro
cesses
HD
FS
Oth
er
Syste
ms
TaskTask
Task
In process
Avro Files
In process
Avro FilesAudit Counts
Clean Up
Kakfa
B
A
C
D
F
G H
I
E
8©2014 Cloudera, Inc. All rights reserved.
• Kafka has no MR layer– InputFormat, OutputFormat, Utils…
• Sqoop is a generic batch ingest framework– Why no Kafka?
Missing in Action
10
Sources Interceptors Selectors Channels Sinks
Flume Agent
How does work?Twitter, logs,
webserver,
Kafka…
Mask, re-format,
validate…DR, critical
Memory, fileHDFS,
Hbase, Solr,
Kafka
13©2014 Cloudera, Inc. All rights reserved.
SparkStreaming
Single Pass
SourceRawInput
DStreamRDD
SourceRawInput
DStreamRDD
RDD
Filter Count Print
SourceRawInput
DStreamRDD
RDD
RDD
Single Pass
Filter Count Print
Pre-first
Batch
First
Batch
Second
Batch
14©2014 Cloudera, Inc. All rights reserved.
Storm
Spout
Source
Split
words
bolts
Split
words
bolts
Spout
Split
words
bolts
Split
words
bolts
Count
Count
Count
Spout Layer Fan out Layer 1 Shuffle Layer 2
16©2014 Cloudera, Inc. All rights reserved.
• Data often has schema
• At least it should
• Kafka is unaware – which is good
• Need capability to figure out schema for events
• Without including it in every event
Schema