Upload
szehon-ho
View
660
Download
2
Embed Size (px)
Citation preview
2 © 2014 Cloudera, Inc. All rights reserved.
Background • HiveKa was our Cloudera 2014 Hackathon Project
• Ashish Singh • Gwen Shapira • Szehon Ho
3 © 2014 Cloudera, Inc. All rights reserved.
Background • Enable SQL on all user’s data, even in Kafka cluster • Implementation via HiveStorageHandler on Kafka
4 © 2014 Cloudera, Inc. All rights reserved.
Apache Kafka • LinkedIn general-purpose distributed publish-subscribe framework
• Ingest problem: How to get data into Hadoop • Standardize data pipelines: Eliminate ad-hoc pipelines. • Scalable and resilient, and low-latency
5 © 2014 Cloudera, Inc. All rights reserved.
Apache Kafka • Producer • Consumer • Cluster = Brokers
• Message Store • Message replication
6 © 2014 Cloudera, Inc. All rights reserved.
Apache Kafka • Message
• Key-Value • Offset
• Topics • Partitions
• Messages in order
7 © 2014 Cloudera, Inc. All rights reserved.
Existing Solution: Camus • LinkedIn developed Kafka à HDFS pipeline called Camus
1. Camus’s InputFormat pulls latest message from Kafka into HDFS 2. Pluggable MessageDecoder (Kafka message bytes -> Writable)
8 © 2014 Cloudera, Inc. All rights reserved.
HiveKa • We implemented Hive storage-handlers to access Kafka messages directly
from Hive • ETL: Load data directly into Hive, bypass Camus • Analytic: Run Hive queries directly on Kafka data
KafkaStorageHandler
10 © 2014 Cloudera, Inc. All rights reserved.
HiveKa Design
• Future: • Avro schema • Expose pluggable MessageDecoder/SerDe pairs for different Kafka messages.
11 © 2014 Cloudera, Inc. All rights reserved.
Conclusion • Guide to implementing Hive Storage Handlers:
http://szehon3.wordpress.com/2014/11/09/kafkaesque-hive-thoughts-on-storage-handlers/
• Website with source code and examples: http://hiveka.weebly.com/ • Source code: https://github.com/HiveKa/HiveKa
• Will contribute back to Hive