DataEngConf SF16 - High cardinality time series search

© Rocana, Inc. All Rights Reserved. | 1

Eric Sammer – CTO and co-founder, @esammer

DataEngConf 2016

High cardinality time series searchA new level of scale


Context• We build a system for large scale realtime collection, processing, and

analysis of event-oriented machine data

• On prem or in the cloud, but not SaaS

• Supportability is a big deal for us• Predictability of performance and under failures• Ease of configuration and operation• Behavior in wacky environments

• All of our decisions are informed by this - YMMV


What I mean by “scale”• Typical: 10s of TB of new data per day

• Average event size ~200-500 bytes

• 20TB per day• @200 bytes = 1.2M events / second, ~109.9B events / day, 40.1T events / year• @500 bytes = 509K events / second, ~43.9B events / day, 16T events / year,

• Retaining years online for query


General purpose search – the good parts• We originally built against Solr Cloud (but most of this goes for Elastic

Seach too)

• Amazing feature set for general purpose search

• Good support for moderate scale

• Excellent at• Content search – news sites, document repositories• Finite size datasets – product catalogs, job postings, things you prune• Low(er) cardinality datasets that (mostly) fit in memory


Problems with general purpose search systems• Fixed shard allocation models – always N partitions

• Multi-level and semantic partitioning is painful without building your own macro query planner

• All shards open all the time; poor resource control for high retention

• APIs are record-at-a-time focused for NRT indexing; poor ingest performance (aka: please stop making everything REST!)

• Ingest concurrency is wonky

• High write amplification on data we know won’t change

• Other smaller stuff…


“Well actually…”

Plenty of ways to push general purpose systems

(We tried many of them)

• Using multiple collections as partitions, macro query planning

• Running multiple JVMs per node for better utilization

• Pushing historical searches into another system

• Building weirdo caches of things

At some point the cost of hacking outweighed the cost of building


Warning!• This is not a condemnation of general purpose search systems!

• Unless the sky is falling, use one of those systems


We built a thing: Rocana SearchHigh cardinality, low latency, parallel search system for time-oriented events


Features of Rocana Search• Fully parallelized ingest and query, built for large clusters

• Every node is an indexer, query coordinator, and executor

• Optimized for high cardinality time-oriented event data

• Built to keep all data online and queryable without wasting resources for infrequently used data

• Fully durable, resistant to node failures

• Operationally friendly: online ops, predictable resource usage and performance

• Uses battle tested open source components (Kafka, Lucene, HDFS, ZK)


Major differences• Storage and partition model looks more like range-partitioned tables in

databases; new partitions easily added, old ones dropped, support for multi-field partitioning, allows for fine grained resource management

• Partitions subdivided into slices for parallel writes

• Query engine aggressively prunes partitions by analyzing predicates

• Ingestion path is Kafka, built for extremely high throughput of small events

What we know about our data allows us to optimize


Architecture

(A single node)


Collections, partitions, and slices• A search collection is split into partitions by a partition strategy

• Think: “By year, month, day, hour”• Partitioning invisible to queries (e.g. `time:[x TO y] AND host:z` works normally)

• Partitions are divided into slices to support (mostly) lock-free parallel writes• Think: “This hour has 20 slices, each of which is independent for write”


Collections, partitions, and slices


From events to partitions to slices


Assigning slices to nodes


Following the write path• One of the search nodes is the exclusive owner of KP 0 and KP 1

• Consume a batch of events

• Use the partition strategy to figure out to which RS partition it belongs

• Kafka messages carry the partition so we know the slice

• Event written to the proper partition/slice

• Eventually the indexes are committed

• If the partition or slice is new, metadata service is informed


Query engine basics• Queries submitted to coordinator via RPC

• Coordinator (smart) parses, plans, schedules and monitors fragments, merges results, responds to client

• Fragments are submitted to executors for processing

• Executors (dumb) search exactly what they’re told, stream to coordinator

• Fragment is generated for every partition/slice that may contain data


Some implications• Search processes are on the same nodes as the HDFS DataNode

• First replica of any event received by search from Kafka is written locally

• Result: Unless nodes fail, all reads are local (HDFS short circuit reads)

• Linux kernel page cache is useful here

• HDFS caching can be used

• Search has an off-heap block cache as well

• In case of failure, any search node can read any index

• HDFS overhead winds up being very little, still get the advantages


Contrived query scenario• 80 Kafka partitions (80 slices)

• Collection partitioned by day

• 80 nodes, 16 executor threads each

• Query: time:[2015-01-01 TO 2016-01-01] AND service:sshd• 365 * 80 = 29200 fragments generated for the query (a lot!)• 29200 / (80 * 16) = ~22 “waves” of fragments• If each “wave” takes ~0.5 second, the query takes ~11 seconds


More real, but a little outdated• 24 AWS EC2 d2.2xl, instance storage

• Ingesting data at ~3 million events per minute (50K eps)• 24 Kafka partitions / RS slices• Index size: 5.9 billion events

• Query: All events, facet by 3 fields• No tuning (default config): ~10 seconds (with a silly bug)• 10 concurrent instances of the same query: ~21 seconds total• 50 concurrent instances: ~41 seconds

• We do much better today


What we’ve really shown

In the context of search, scale means:

• High cardinality: Billions of events per day

• High speed ingest: Millions of events per second

• Not having to age data out of the collection

• Handling large, concurrent queries, while ingesting data

• Fully utilizing modern hardware

These things are very possible


Next steps• Read replicas

• Smarter partition elimination in complex queries

• Speculative execution of query fragments

• Additional metadata for index fields to improve storage efficiency

• Smarter cache management

• Better visibility into performance and health

• Strong consensus (e.g. Raft, multi-paxos) for metadata?


Thank you!

Hopefully I still have time for questions.

rocana.com

@esammer

[email protected]

(ask me for stickers)

The (amazing) core search team:

• Michael Peterson - @quux00

• Mark Tozzi - @not_napoleon

• Brad Cupit

• Joey Echeverria - @fwiffo

Technology

DataEngConf SF16 - High cardinality time series search