Upload
hakka-labs
View
368
Download
1
Embed Size (px)
Citation preview
© Rocana, Inc. All Rights Reserved. | 1
Eric Sammer – CTO and co-founder, @esammer
DataEngConf 2016
High cardinality time series searchA new level of scale
© Rocana, Inc. All Rights Reserved. | 2
Context• We build a system for large scale realtime collection, processing, and
analysis of event-oriented machine data
• On prem or in the cloud, but not SaaS
• Supportability is a big deal for us• Predictability of performance and under failures• Ease of configuration and operation• Behavior in wacky environments
• All of our decisions are informed by this - YMMV
© Rocana, Inc. All Rights Reserved. | 3
What I mean by “scale”• Typical: 10s of TB of new data per day
• Average event size ~200-500 bytes
• 20TB per day• @200 bytes = 1.2M events / second, ~109.9B events / day, 40.1T events / year• @500 bytes = 509K events / second, ~43.9B events / day, 16T events / year,
• Retaining years online for query
© Rocana, Inc. All Rights Reserved. | 4
General purpose search – the good parts• We originally built against Solr Cloud (but most of this goes for Elastic
Seach too)
• Amazing feature set for general purpose search
• Good support for moderate scale
• Excellent at• Content search – news sites, document repositories• Finite size datasets – product catalogs, job postings, things you prune• Low(er) cardinality datasets that (mostly) fit in memory
© Rocana, Inc. All Rights Reserved. | 5
Problems with general purpose search systems• Fixed shard allocation models – always N partitions
• Multi-level and semantic partitioning is painful without building your own macro query planner
• All shards open all the time; poor resource control for high retention
• APIs are record-at-a-time focused for NRT indexing; poor ingest performance (aka: please stop making everything REST!)
• Ingest concurrency is wonky
• High write amplification on data we know won’t change
• Other smaller stuff…
© Rocana, Inc. All Rights Reserved. | 6
“Well actually…”
Plenty of ways to push general purpose systems
(We tried many of them)
• Using multiple collections as partitions, macro query planning
• Running multiple JVMs per node for better utilization
• Pushing historical searches into another system
• Building weirdo caches of things
At some point the cost of hacking outweighed the cost of building
© Rocana, Inc. All Rights Reserved. | 7
Warning!• This is not a condemnation of general purpose search systems!
• Unless the sky is falling, use one of those systems
© Rocana, Inc. All Rights Reserved. | 8
We built a thing: Rocana SearchHigh cardinality, low latency, parallel search system for time-oriented events
© Rocana, Inc. All Rights Reserved. | 9
Features of Rocana Search• Fully parallelized ingest and query, built for large clusters
• Every node is an indexer, query coordinator, and executor
• Optimized for high cardinality time-oriented event data
• Built to keep all data online and queryable without wasting resources for infrequently used data
• Fully durable, resistant to node failures
• Operationally friendly: online ops, predictable resource usage and performance
• Uses battle tested open source components (Kafka, Lucene, HDFS, ZK)
© Rocana, Inc. All Rights Reserved. | 10
Major differences• Storage and partition model looks more like range-partitioned tables in
databases; new partitions easily added, old ones dropped, support for multi-field partitioning, allows for fine grained resource management
• Partitions subdivided into slices for parallel writes
• Query engine aggressively prunes partitions by analyzing predicates
• Ingestion path is Kafka, built for extremely high throughput of small events
What we know about our data allows us to optimize
© Rocana, Inc. All Rights Reserved. | 11
Architecture
(A single node)
© Rocana, Inc. All Rights Reserved. | 12
Collections, partitions, and slices• A search collection is split into partitions by a partition strategy
• Think: “By year, month, day, hour”• Partitioning invisible to queries (e.g. `time:[x TO y] AND host:z` works normally)
• Partitions are divided into slices to support (mostly) lock-free parallel writes• Think: “This hour has 20 slices, each of which is independent for write”
© Rocana, Inc. All Rights Reserved. | 13
Collections, partitions, and slices
© Rocana, Inc. All Rights Reserved. | 14
From events to partitions to slices
© Rocana, Inc. All Rights Reserved. | 15
Assigning slices to nodes
© Rocana, Inc. All Rights Reserved. | 16
Following the write path• One of the search nodes is the exclusive owner of KP 0 and KP 1
• Consume a batch of events
• Use the partition strategy to figure out to which RS partition it belongs
• Kafka messages carry the partition so we know the slice
• Event written to the proper partition/slice
• Eventually the indexes are committed
• If the partition or slice is new, metadata service is informed
© Rocana, Inc. All Rights Reserved. | 17
Query engine basics• Queries submitted to coordinator via RPC
• Coordinator (smart) parses, plans, schedules and monitors fragments, merges results, responds to client
• Fragments are submitted to executors for processing
• Executors (dumb) search exactly what they’re told, stream to coordinator
• Fragment is generated for every partition/slice that may contain data
© Rocana, Inc. All Rights Reserved. | 18
Some implications• Search processes are on the same nodes as the HDFS DataNode
• First replica of any event received by search from Kafka is written locally
• Result: Unless nodes fail, all reads are local (HDFS short circuit reads)
• Linux kernel page cache is useful here
• HDFS caching can be used
• Search has an off-heap block cache as well
• In case of failure, any search node can read any index
• HDFS overhead winds up being very little, still get the advantages
© Rocana, Inc. All Rights Reserved. | 19
Contrived query scenario• 80 Kafka partitions (80 slices)
• Collection partitioned by day
• 80 nodes, 16 executor threads each
• Query: time:[2015-01-01 TO 2016-01-01] AND service:sshd• 365 * 80 = 29200 fragments generated for the query (a lot!)• 29200 / (80 * 16) = ~22 “waves” of fragments• If each “wave” takes ~0.5 second, the query takes ~11 seconds
© Rocana, Inc. All Rights Reserved. | 20
More real, but a little outdated• 24 AWS EC2 d2.2xl, instance storage
• Ingesting data at ~3 million events per minute (50K eps)• 24 Kafka partitions / RS slices• Index size: 5.9 billion events
• Query: All events, facet by 3 fields• No tuning (default config): ~10 seconds (with a silly bug)• 10 concurrent instances of the same query: ~21 seconds total• 50 concurrent instances: ~41 seconds
• We do much better today
© Rocana, Inc. All Rights Reserved. | 21
What we’ve really shown
In the context of search, scale means:
• High cardinality: Billions of events per day
• High speed ingest: Millions of events per second
• Not having to age data out of the collection
• Handling large, concurrent queries, while ingesting data
• Fully utilizing modern hardware
These things are very possible
© Rocana, Inc. All Rights Reserved. | 22
Next steps• Read replicas
• Smarter partition elimination in complex queries
• Speculative execution of query fragments
• Additional metadata for index fields to improve storage efficiency
• Smarter cache management
• Better visibility into performance and health
• Strong consensus (e.g. Raft, multi-paxos) for metadata?
© Rocana, Inc. All Rights Reserved. | 23
Thank you!
Hopefully I still have time for questions.
rocana.com
@esammer
(ask me for stickers)
The (amazing) core search team:
• Michael Peterson - @quux00
• Mark Tozzi - @not_napoleon
• Brad Cupit
• Joey Echeverria - @fwiffo