A General Purpose Extensible Scanning Query Architecture for Ad Hoc Analytics

A General Purpose Extensible Scanning Query Architecture for Ad Hoc Analytics

Erik Freed Brian Anderson Flurry/Yahoo Flurry/Yahoo

erikfreed@yahooinc.com briananderson@yahooinc.com

Abstract We present Burst, an analytic query system with a scalable and flexible approach to performing lowlatency ad hoc analysis over large complex datasets. The architecture consists of hardwareefficient scan techniques and a language facility to transform an extensible set of ad hoc declarative queries into imperative physical scan plans. These plans are multicast across all nodes/cores of a two level sharded/distributed ingestion, storage, and execution topology and executed. The first release of this system is the query engine behind the Flurry Explorer product. Here we explore the design details of that system as well as the incremental ingestion pipeline enhancement currently being implemented for the next major release. NOTE: This copy has had performance numbers updated and is not the same as the one submitted to Tech Pulse.

1. Introduction The promise of the Flurry Explorer Product is to invite the user into an unstructured interactive discovery session where they can easily pose arbitrary offthecuff and potentially complex questions about end user behavior. If they get back answers quickly enough then their next question starts a virtuous cycle of more targeted questions continuously leading to more specific and valuable results. The first major release of the back end query engine engineered to fully support this type of exploration was developed in the Flurry Analytics group in Q1 2015 and delivered as part of a limited beta of the Explorer feature within Flurry Analytics. We successfully utilized a unique hyper distributed/parallel/concurrent object tree scanning model with a simple daily batched ingestion system for this limited audience. The next major release of this scanning architecture replaces the batched ingestion system with a more scalable incremental data ingestion pipeline to expand the reach of Explorer to all Flurry customers. Here we present the architectural basis and specifics of the previous and upcoming release.

2. Background For those of us who have spent any time with production scale SQL databases, seeing large table scans being sorted and joined in a query plan is cause for panic. We can only relax once we find a way to constrain that query and/or implement heavyweight indices so the query transforms into pure index lookups and partial joins. However for analytics the use cases are inherently unbounded, personalized, and constantly evolving while the corpora are typically enormous. This makes adding indices intractable in most cases. These limitations forced us to reevaluate our previous nemesis, the full table scan. We determined that if we could make the scans efficient enough, distribute the scans across enough nodes and CPU cores, and develop a query language that could take an arbitrary ad hoc analytic question and transform it into an instance of this hyper paralleldistributedconcurrent scan model, then we would have an attractively simple general purpose model. We reasoned that this model would scale well not only in terms of input size and general query complexity, but in terms of feature development time, risk, and effort.

page 1 of 7

mailto:[email protected]

mailto:[email protected]

Explorer Burst: A General Purpose Extensible Scanning Query Architecture for Ad Hoc Analytics

3. Top Level View

The basic components of the Burst ecosystem are:

1. External Datasource(s) 2. Ingestion Subsystem 3. Data Model 4. Sample Store 5. Dataset Store 6. Query Subsystem

The previous release of Burst had a simplified batched ingestion model where the exporting Mapreduce jobs wrote the entire history of a given mobile application’s event stream into new HDFS sequence files on a daily basis. These datasets were then read into memory on demand as users posed queries. This initial beta pipeline design is being replaced by the incremental version described in subsequent sections. The rest of the architecture described here is as currently deployed. Each of these components (other than the external data sources) are deployed on one or more clusters called a Cell where each Cell is comprised of a master node, a failover master node and a set of worker nodes. Each Cell has its own Apache Kafka [KAFKA], Apache HBase[HBASE], and Apache Spark[SPARK] clusters deployed. The Master (and failover Master) node contains the master process for each of these systems as well as a Docker [DOCKER] container populated with all of the Burst specific JVM service processes. The Worker nodes are populated only by the associated Spark, HBase, and Kafka worker specific deployments. Burst does not itself deploy anything directly onto Worker nodes.

4. Data Sources Burst is inherently schema independent as well as agnostic to the specific technology of the external datasource. However the data source must have the following basic characteristics:

1. it must be in a schema that can be expressed in the relationships and datatypes of the Burst Data Model 2. The external data model can be partitioned into two levels of well defined shards:

a. The first level is composed of a set of Domain instances that each represent a subset of data that is the input to a single query e.g. for Flurry Explorer, this is a event stream associated with a single ‘Mobile Application’ or constructed ‘Mobile Application Group’. A query can only be executed against a single Domain at a time.

b. The second level is a strict partitioning across a Domain creating order independent subsets of Item instances that each has a well defined rooted acyclic object model (tree) that can be scanned in a depth first, preferably time ordered, traversal. For Flurry Explorer, this is a single ‘Mobile Device’, each of which has a set of time ordered ‘sessions’, each of which has a set of time ordered ‘events’, each of which has a set of unordered keyvalue map ‘parameters’

page 2 of 7


3. The external data source physical form can be exported as both a periodic historical batch and a continuous incremental update and fed to the Burst Kafka based Ingestion API. e.g. for Flurry this is our 2,000 node, six petabyte, ~50 trillion mobile device events, evergrowing HBase cluster with custom MapReduce jobs performing both initializing batch and daily incremental update feeds.

5. Ingestion The new Burst Ingestion Subsystem design starts with a Kafka queues that provide a controlplane (control/administration), and the dataplane (data feeds). The data source is responsible for sending and responding to the controlplane, as well as feeding the dataplane in response to controlplane messages. An Apache Spark based process model manages controlplane and dataplane operations. It is responsible for transforming the schema of the external system into an appropriate Burst schema, updating the Sample Store as it arrives.

6. Data Model The Burst Data Model has the following requirements/features/implementation details:

1. It is schema independent, but schema defined. 2. It is schema versioned, and supports heterogeneous versioned collections. 3. The data model/schema supports type structures, singular and plural structure reference relationships, value

collections, value maps, and atomic data types (boolean, byte, short, int, long, double, string) 4. The data model/schema inherently defines a tree with a well defined root as part of well defined traversal 5. Data is encoded in a single byte array where the disk storage encoding is identical to the inmemory format. 6. This encoding is an unrolled depth first traversal of the object tree as a linear sequence of bytes. The

reading from disk into memory and traversal scans are in the same exact byte order and thus can take direct advantage of the OS disk mmap semantics with the associated high performance kernel buffer management and aggressive prefetching. The data can be cached in memory or not depending on your preferences with respect to repeated queries on identical datasets . 1

7. All interpretation of atomic data fields are done insitu within the byte array ondemand iff any given field is accessed in a query. The data model structures are not ever deserialized and no ephemeral objects are created. This is similar to columnar storage, as it eliminates much of the costs of accessing unused columns in standard bulk serializing models, but along with a higher degree of inherent simplicity and attendant efficiency. A truly adhoc system, where it is not known what fields will be accessed at what frequency, if at all, is not an ideal columnar storage candidate.

8. Fetching, in memory storage, and scans of the data model generate zero JVM objects. They bypass the JVM memory models as well. The byte sequence traversal is scanned using efficient stack based protocols with data accesses performed via ‘unsafe off heap’ libraries. The problems associated with large JVM 2

heaps are minimized as none of this memory is actually ‘seen’ by the JVM. The JVM processes have quite small heap sizes.

9. There are various optimizations for immutable encodings e.g. for value maps we store the keys and the values as twin sorted arrays using a binary search to lookup key values. We also use dictionaries to reduce string storage requirements.

1 Burst may support streaming query processing in a future release 2 ‘Unsafe’ refers to a design pattern where Java code is written using the same techniques the Java libraries use to access non JVM heap memory (e.g. Network & Disc IO). It is called unsafe because JVM manufacturers do not offer support for these lower level libraries, even though they are extensively used and quite reliable.

page 3 of 7


7. Sample Store

The Burst architecture uses a Apache HBase keyvalue store, to reliably and efficiently store the continuous largely unordered incremental feed of assorted Item updates from assorted Domains coming from one or more external data sources. This data is stored in one of a plurality of tables each called a Province . Each arriving update is a new 3

cell, encoded in the Burst Data Model, in a row keyed by the specific Item, Domain and Channel in a single 4

Province table where the given Domain is hosted.

8. Dataset Store

For a query to be executed over a Domain, appropriate rows in the Sample Store and appropriate update cells for each Item must scanned and transformed into a Dataset in Brio Data Model encoding. This transformation is called melding and happens locally on each worker node. Each node creates and stores a single partition of the Dataset. 5

These partitions are the most recent ‘view’ of the data as a single byte array cached in local disc (magnetic or solid state). When a query is executed, if the local Worker node has cached the partition, and if it is not considered ‘stale’ then it is read directly from disc and no meld is required. The melding can also customize the dataset by down sampling items along with other forms of object tree filtering if it is desired to reduce the datasets size for performance/resource utilization reasons. It is also possible to have more than one defined and reified custom Dataset ‘view’ per Domain.

Caching It is vital that the Dataset partitions be loaded into memory quickly and released aggressively in order to manage expensive/limited DRAM resources efficiently. The load of a Dataset partition is a simple mmap() call of a single file as a single byte array into offheap memory managed directly by the OS. The scan can proceed before the file has been fully read due to the natural OS semantics of paged disc reads with linear order prefetching. Since there

3 Provinces are used to subdivide the overall dataset into separate tables so that efficient table operations can be used to manage, move, and cleanup data as needed in manageable chunks. 4 An Ingestion API/Sample Store management artifact 5 i.e. without replication or fault tolerance. In the case of worker node failure, these dataset partitions are recreated on whatever replica location is targeted by HBase/Spark for the next query.

page 4 of 7


are essentially zero onheap artifacts associated with this load, the release of the byte array has minimal GC implications. In this way, the local disc, especially if it is SSD, acts as a cost effective second level DRAM cache. 6

9. Query Engine

The Query Subsystem has an API that consists of a programmerfriendly declarative query language called SILQ which is translated into a machinefriendly imperative query language called GIST. Both of these are textual languages with a well defined grammar and syntax . The details of this are described in [SILQ]. Here we will say 7

that these languages provide a rich and extensible set of aggregation, dimensioning, filtering and causal/temporal reasoning features. Burst clients form their queries as SILQ, which the SILQ pipeline transforms into GIST. The GIST pipeline transforms those into well defined execution plans that are multicast to worker nodes. The multidimensional result model is gathered and delivered back to the client.

Execution Models These execution plans contain:

1. Traversal Model a simple numeric array based state machine holding the semantics of what to do where in the object tree traversal

2. Result Schema the semantics of all aggregations, dimensions, and merges and joins. 3. Closures filters and traversal data model updates in generated and JIT optimized JVM byte code 4. Routes Log structured record of graph automata paths

Zap Data Structures Because of the extreme number of objects visited and the prolific object churn associated with standard data structures, Burst requires specialized data structures called Zap structures for inner loops. These are designed to 8

use nothing but simple off heap blocks of memory, preallocated in perthread chunks, reused over and over again, and with all needed functions coded using unsafe access patterns. There are just a two of these currently : 9

● ZapMaps: The object tree scan requires a nested overlay of lightweight hash maps with the ability to join 10

with child/peer maps on the fly as the traversal unfolds from parent to child. The ways these nested self joins can be expressed is an important part of how GIST creates complex adhoc multidimensional result models. The performance of Zap Maps is a key factor in the overall performance of the system.

6 If desired, a future version of Burst may support ‘streaming’ semantics where the scan is executed as the data is read from disc and never cached in memory. 7 very convenient for unit and system testing! 8 ‘Zero Allocation Protocol’ 9 We are working on another structure, a Zap Lexicon that eliminates the use of standard JVM strings which are quite noisy from the perspective of JVM object creation 10 something like a cross join

page 5 of 7


● Zap Routes: For causal/temporal reasoning we implemented an off heap logstructured recording structure with a graph automata to discover and capture ‘paths’ through sequences of events. This how ‘Funnels’ are implemented in the Explorer product..

Concurrency Because each of the Item instances in a Dataset partition is part of a sequence of individual order independent object trees, we refine our concurrency model to a single core/thread dedicated to each traversal. Each of these can be executed in parallel on available cores using a fixed pool model. This makes hardware happy as the linear byte array being scanned is read solely by a single Core.

Spark Like the Ingestion Subsystem, the Query Subsystem is built on top of Apache Spark with a Spark Executor on each 11

worker node initialized with a Query Kernel that can execute scan plans. The scan traversals are carefully designed to use a minimum of JVM memory and create a minimum of JVM objects. There is essentially no JVM memory overhead to the storage and execution models other than created by the ipc protocols.

10. Performance Because of the efficiency of the scanning techniques involved, one can think of Burst as an objects scanned per second machine and so the performance of queries is almost exclusively about how many objects the query needs to visit. As an example, in the Flurry mobile analytics world, queries that only look at the top level object in the tree (the User or Mobile Device) run much faster than queries that need to visit the sessions associated with that User. At the next level, queries that need to visit the events in each session run slower than ones that only look at sessions. Generally the complexity of the query in terms of what data is accessed and what results are generated at each object is not nearly as impactful. In our 250 node, 6 SATA spindle, 48 haswell hthread cluster, we see a sustained 50 QPS with >1,000 applications in memory. Datasets cold load in <10s, cache load in <1s. Generally we scan about 200K objects/sec/hthread.

11. Future Work The Burst architecture was designed to be extensible and the GIST language is implemented on top of a ‘plugin’ abstraction. We have a working first version plugin of a next generation of SILQ/GIST called HYDRA, that combines both into a single language that is more performant in a few key areas. One is that you can combine any number of queries into a single concurrent scan . We are also well into developing more efficient filtering using 12

code generated predicates that can be used by both HYDRA and for melding.

12. Conclusions By rigorously constraining the data to be queried in terms of a two level partition model, where the first level partition (Domains) subdivides the entire dataset into individually queryable subsets, and a second level partition (Items) defines unordered parallel/distributed partitions of sequences of scannable object graphs, and by implementing hyper paralleldistributedconcurrent scans we can provide a linearly scaling, cost effective, completely general purpose, ad hoc low latency query engine. The first version is deployed in beta behind the

11 Burst does not use Spark features extensively in fact for the most part it uses Spark as a distributed process manager. The actual Spark execution model is a very simple single stage scatter/gather model. The implementation abstracts this facility so as to make it easy to move to a different distributed process manager or to roll our own multicast execution model such as with JGroups. 12 This is an important optimization for multiple use case including 1) ‘dashboards’ where a mobile application displays an initial UI view with a fixed set of personalized queries 2) when a dataset is melded, it is critical to provide metadata about that dataset to the query clients in terms of a fixed set of queries e.g. for the Flurry product the UI needs to display user, session, event, and parameter counts as well as parameter keys and value frequencies to help inform users about formed query relevance during interactive query sessions.

page 6 of 7


recently released Explorer Product. The next release introduces an incremental ingestion pipeline allowing this query system to scale to serve all Flurry Explorer customers.

13. References ● [DREMEL] Sergey Melnik and Andrey Gubarev and Jing Jing Long and Geoffrey Romer and Shiva

Shivakumar and Matt Tolton and Theo Vassilakis, “Dremel: Interactive Analysis of WebScale Datasets”, Proc. of the 36th Int'l Conf on Very Large Data Bases: http://research.google.com/pubs/pub36632.html

● [DRUID] Druid, “Open Source Data Store for Interactive Analytics at Scale”: http://druid.io/ ● [BLINK] AmpLab, “Queries with Bounded Errors and Bounded Response Times on Very Large Data”:

http://blinkdb.org/ ● [DRILL] MAPR, “Industry's First SchemaFree SQL Engine for Big Data”:

https://www.mapr.com/products/apachedrill ● [TEZ] https://tez.apache.org/ ● [PRESTO] https://prestodb.io/ ● [SPARK] http://spark.apache.org/ ● [DOCKER] https://www.docker.com/ ● [HBASE] http://hbase.apache.org/ ● [KAFKA] http://kafka.apache.org/ ● [SILQ]

https://docs.google.com/a/yahooinc.com/document/d/1of2GDtLJuItLdNQxDO7E24D6T8hOGspdKnm8lFnDkM/edit?usp=sharing

page 7 of 7

http://research.google.com/pubs/pub36632.html

http://druid.io/

http://blinkdb.org/

https://www.mapr.com/products/apache-drill

https://tez.apache.org/

https://prestodb.io/

http://spark.apache.org/

https://www.docker.com/

http://hbase.apache.org/

http://kafka.apache.org/

https://docs.google.com/a/yahoo-inc.com/document/d/1of2GDtLJuItLdNQxDO7E24D6T8hOGspd-Knm8lFnDkM/edit?usp=sharing

https://docs.google.com/a/yahoo-inc.com/document/d/1of2GDtLJuItLdNQxDO7E24D6T8hOGspd-Knm8lFnDkM/edit?usp=sharing

Engineering

A General Purpose Extensible Scanning Query Architecture for Ad Hoc Analytics