Upload
jeremy-beard
View
472
Download
2
Embed Size (px)
Citation preview
1© Cloudera, Inc. All rights reserved.
Introducing KuduJeremy Beard | Senior Solutions ArchitectDecember 2015
2© Cloudera, Inc. All rights reserved.
Presenter
• Jeremy Beard• Senior Solutions Architect at Cloudera
• Three years in big data• Six years in data warehousing
3© Cloudera, Inc. All rights reserved.
Current storage landscape in Hadoop
HDFS excels at:• Efficiently scanning large amounts
of data• Accumulating data with high
throughputHBase excels at:• Efficiently finding and writing
individual rows• Making data mutable
Gaps exist when these properties are needed simultaneously
4© Cloudera, Inc. All rights reserved.
Changing hardware landscape
• Spinning disk -> solid state storage• NAND flash: Up to 450k read 250k write iops, about 2GB/sec read and 1.5GB/sec write
throughput, at a price of less than $3/GB and dropping• 3D XPoint memory (1000x faster than NAND, cheaper than RAM)
• RAM is cheaper and more abundant:• 64->128->256GB over last few years
• Takeaway 1: The next bottleneck is CPU, and current storage systems weren’t designed with CPU efficiency in mind• Takeaway 2: Column stores are feasible for random access
5© Cloudera, Inc. All rights reserved.
KuduStorage for Fast Analytics on Fast Data
• New updating column store for Hadoop• Simplifies the architecture for building analytic
applications on changing data• Designed for fast analytic performance• Natively integrated with Hadoop
• Donated as incubating project at Apache Software Foundation
• Beta now availableSTRUCTURED
SqoopUNSTRUCTURED
Kafka, Flume
PROCESS, ANALYZE, SERVE
UNIFIED SERVICES
RESOURCE MANAGEMENTYARN
SECURITYSentry, RecordService
FILESYSTEMHDFS
RELATIONALKudu
NoSQLHBase
STORE
INTEGRATE
BATCHSpark, Hive, Pig
MapReduce
STREAMSpark
SQLImpala
SEARCHSolr
SDKKite
6© Cloudera, Inc. All rights reserved.
• High throughput for big scans (columnar storage and replication)Goal: Within 2x of Parquet
• Low-latency for short accesses (primary key indexes and quorum design)Goal: 1ms read/write on SSD
• Database-like semantics (initially single-row ACID)
• Relational data model• SQL query• “NoSQL” style scan/insert/update (Java client)
Kudu design goals
7© Cloudera, Inc. All rights reserved.
Kudu basic design• Apache-licensed open source software
• Structured data model
• Basic construct: tables• Tables broken down into tablets (roughly equivalent to partitions)
• Architecture supports geographically disparate, active/active systems • Not the initial design goal
8© Cloudera, Inc. All rights reserved.
What Kudu is not• Not a SQL interface• Just the storage layer• “BYO SQL”
• Not a file system• Data must have tabular structure
• Not an application that runs on HDFS• An alternative, native Hadoop storage engine
• Not a replacement for HDFS or HBase• Select the right storage for the right use case• Cloudera will continue to support and invest in all three
9© Cloudera, Inc. All rights reserved.
Kudu data model
• Tables have a RDBMS-like schema• Finite number of columns (unlike HBase/Cassandra)• Types: BOOL, INT8/16/32/64, FLOAT, DOUBLE, STRING, BINARY, TIMESTAMP• Some subset of columns makes up a primary key
• Fast random reads/writes by primary key• No secondary indexes (yet)
• Columnar layout on disk• Lazy materialization• Encoding and compression options
9
10© Cloudera, Inc. All rights reserved.
Table partitioning
• Hash bucketing• Distribute records by hash of partition column(s)• N buckets leads to N tablets
• Range partitioning• Distribute records by ranges of the partition column(s)• N split keys leads to N tablets
• Can be a mix for different columns of the primary key
11© Cloudera, Inc. All rights reserved.
Consistency model
• Consistency and replication enforced by Raft consensus (similar to Paxos)• Replication by operation not data
• Single-row transactions now• Multi-row transactions later
• Geo-distributed replicas will be possible under strict time synchronization
• Techniques drawn from Google Spanner and others
12© Cloudera, Inc. All rights reserved.
Kudu interfaces
• NoSQL-style APIs• Insert(), Update(), Delete(), Scan()• Java and C++ now• Python soon
• Integrations with MapReduce, Spark, and Impala
• No direct access to underlying Kudu tablet files
• Beta does not have authentication, authorization, encryption
13© Cloudera, Inc. All rights reserved.
Impala integration
• Opens up Kudu to JDBC/ODBC clients
• Intuitive way to get data into Kudu• INSERT INTO kudu SELECT * FROM csv;
• Additional commands• UPDATE• DELETE• Efficient INSERT VALUES
• Runs on the Kudu C++ client
14© Cloudera, Inc. All rights reserved.
Performance characteristics
• Very CPU efficient• Written in modern C++, uses specialized CPU instructions, JIT compilation
• Latency mostly driven by storage hardware capabilities• Expect sub-millisecond response on SSDs and upcoming technologies
• No garbage collection allows very large memory footprint with no pauses
• Bloom filters reduce the need for many disk accesses
15© Cloudera, Inc. All rights reserved.
Operating Kudu
• Easiest through Cloudera Manager integration• Separate parcel for now
• Kudu is always compacting• No minor vs. major compaction• No compaction latency spikes
• Web UI is full of metrics and logs
16© Cloudera, Inc. All rights reserved.
Cluster layout
• One or multiple masters• Only one in current beta• Low CPU and memory impact
• One tablet server per worker node• Can share disks with HDFS• One SSD per worker node just for Kudu WAL can speed up writes
• No dependencies on other Hadoop ecosystem components• But interfacing components like Impala or Spark do
17© Cloudera, Inc. All rights reserved.
Real-time analytics in Hadoop todayMerging in new data = storage complexity
Downsides:
● Multiple storage layers
● Latest data is hidden
● Files are messy
● Complex to do updates without breaking running queriesNew Partition
Most Recent Partition
Historic Data
HBase
Parquet File
Have we accumulated enough data?
Reorganize HBase file
into Parquet
• Wait for running operations to complete • Define new Impala partition referencing
the newly written Parquet file
Incoming Data (Messaging
System)
Reporting Request
HDFS + Impala
18© Cloudera, Inc. All rights reserved.
Real-time analytics in Hadoop with Kudu
Improvements:
● One system to operate
● No schedules or background processes
● Handle late arrivals or data corrections with ease
● New data available immediately for analytics or operations
Historical and Real-timeData
Incoming Data (Messaging
System)
Reporting Request
Kudu + Impala
19© Cloudera, Inc. All rights reserved.
Kudu for data warehousing
• Near real time data visibility• BI tools can display events that happened seconds earlier
• Excellent for star schemas• Fast scans of deep fact tables• Efficient wide fact tables• Simplified updates of slowly changing dimensions
20© Cloudera, Inc. All rights reserved.
Near real time data warehousing on Kudu
21© Cloudera, Inc. All rights reserved.
Resources
Join the communityhttp://[email protected]
Download the betacloudera.com/downloads
Read the whitepapergetkudu.io/kudu.pdf
22© Cloudera, Inc. All rights reserved.
Thank [email protected]