7
A General Purpose Extensible Scanning Query Architecture for Ad Hoc Analytics Erik Freed Brian Anderson Flurry/Yahoo Flurry/Yahoo [email protected] [email protected] Abstract We present Burst, an analytic query system with a scalable and flexible approach to performing lowlatency ad hoc analysis over large complex datasets. The architecture consists of hardwareefficient scan techniques and a language facility to transform an extensible set of ad hoc declarative queries into imperative physical scan plans. These plans are multicast across all nodes/cores of a two level sharded/distributed ingestion, storage, and execution topology and executed. The first release of this system is the query engine behind the Flurry Explorer product. Here we explore the design details of that system as well as the incremental ingestion pipeline enhancement currently being implemented for the next major release. NOTE: This copy has had performance numbers updated and is not the same as the one submitted to Tech Pulse. 1. Introduction The promise of the Flurry Explorer Product is to invite the user into an unstructured interactive discovery session where they can easily pose arbitrary offthecuff and potentially complex questions about end user behavior. If they get back answers quickly enough then their next question starts a virtuous cycle of more targeted questions continuously leading to more specific and valuable results. The first major release of the back end query engine engineered to fully support this type of exploration was developed in the Flurry Analytics group in Q1 2015 and delivered as part of a limited beta of the Explorer feature within Flurry Analytics. We successfully utilized a unique hyper distributed/parallel/concurrent object tree scanning model with a simple daily batched ingestion system for this limited audience. The next major release of this scanning architecture replaces the batched ingestion system with a more scalable incremental data ingestion pipeline to expand the reach of Explorer to all Flurry customers. Here we present the architectural basis and specifics of the previous and upcoming release. 2. Background For those of us who have spent any time with production scale SQL databases, seeing large table scans being sorted and joined in a query plan is cause for panic. We can only relax once we find a way to constrain that query and/or implement heavyweight indices so the query transforms into pure index lookups and partial joins. However for analytics the use cases are inherently unbounded, personalized, and constantly evolving while the corpora are typically enormous. This makes adding indices intractable in most cases. These limitations forced us to reevaluate our previous nemesis, the full table scan. We determined that if we could make the scans efficient enough, distribute the scans across enough nodes and CPU cores, and develop a query language that could take an arbitrary ad hoc analytic question and transform it into an instance of this hyper paralleldistributedconcurrent scan model, then we would have an attractively simple general purpose model. We reasoned that this model would scale well not only in terms of input size and general query complexity, but in terms of feature development time, risk, and effort. page 1 of 7

A General Purpose Extensible Scanning Query Architecture for Ad Hoc Analytics

Embed Size (px)

Citation preview

Page 1: A General Purpose Extensible Scanning Query Architecture for Ad Hoc Analytics

 

A General Purpose Extensible Scanning Query Architecture for Ad Hoc Analytics  

             Erik Freed           Brian Anderson          Flurry/Yahoo           Flurry/Yahoo 

  erikfreed@yahoo­inc.com  briananderson@yahoo­inc.com  

Abstract We present Burst, an analytic query system with a scalable and flexible approach to performing low­latency ad hoc                                   analysis over large complex datasets. The architecture consists of hardware­efficient scan techniques and a                           language facility to transform an extensible set of ad hoc declarative queries into imperative physical scan plans.                                 These plans are multicast across all nodes/cores of a two level sharded/distributed ingestion, storage, and execution                               topology and executed. The first release of this system is the query engine behind the Flurry Explorer product. Here                                     we explore the design details of that system as well as the incremental ingestion pipeline enhancement currently                                 being implemented for the next major release.  NOTE: This copy has had performance numbers updated and is not the same as the one submitted to Tech Pulse. 

1. Introduction The promise of the Flurry Explorer Product is to invite the user into an unstructured interactive discovery session                                   where they can easily pose arbitrary off­the­cuff and potentially complex questions about end user behavior. If they                                 get back answers quickly enough then their next question starts a virtuous cycle of more targeted questions                                 continuously leading to more specific and valuable results. The first major release of the back end query engine                                   engineered to fully support this type of exploration was developed in the Flurry Analytics group in Q1 2015 and                                     delivered as part of a limited beta of the Explorer feature within Flurry Analytics. We successfully utilized a unique                                     hyper distributed/parallel/concurrent object tree scanning model with a simple daily batched ingestion system for                           this limited audience. The next major release of this scanning architecture replaces the batched ingestion system                               with a more scalable incremental data ingestion pipeline to expand the reach of Explorer to all Flurry customers.                                   Here we present the architectural basis and specifics of the previous and upcoming release. 

2. Background For those of us who have spent any time with production scale SQL databases, seeing large table scans being sorted                                       and joined in a query plan is cause for panic. We can only relax once we find a way to constrain that query and/or                                               implement heavyweight indices so the query transforms into pure index lookups and partial joins. However for                               analytics the use cases are inherently unbounded, personalized, and constantly evolving while the corpora are                             typically enormous. This makes adding indices intractable in most cases. These limitations forced us to reevaluate                               our previous nemesis, the full table scan. We determined that if we could make the scans efficient enough, distribute                                     the scans across enough nodes and CPU cores, and develop a query language that could take an arbitrary ad hoc                                       analytic question and transform it into an instance of this hyper parallel­distributed­concurrent scan model, then we                               would have an attractively simple general purpose model. We reasoned that this model would scale well not only in                                     terms of input size and general query complexity, but in terms of feature development time, risk, and effort. 

 page 1 of 7 

 

Page 2: A General Purpose Extensible Scanning Query Architecture for Ad Hoc Analytics

  

Explorer Burst: A General Purpose Extensible Scanning Query Architecture for Ad Hoc Analytics 

3. Top Level View 

 The basic components of the Burst ecosystem are: 

1. External Datasource(s) 2. Ingestion Subsystem 3. Data Model 4. Sample Store 5. Dataset Store 6. Query Subsystem 

 The previous release of Burst had a simplified batched ingestion model where the exporting Mapreduce jobs wrote                                 the entire history of a given mobile application’s event stream into new HDFS sequence files on a daily basis. These                                       datasets were then read into memory on demand as users posed queries. This initial beta pipeline design is being                                     replaced by the incremental version described in subsequent sections. The rest of the architecture described here is                                 as currently deployed.  Each of these components (other than the external data sources) are deployed on one or more clusters called a Cell                                       where each Cell is comprised of a master node, a failover master node and a set of worker nodes. Each Cell has its                                             own Apache Kafka [KAFKA], Apache HBase[HBASE], and Apache Spark[SPARK] clusters deployed. The Master                         (and failover Master) node contains the master process for each of these systems as well as a Docker [DOCKER]                                     container populated with all of the Burst specific JVM service processes. The Worker nodes are populated only by                                   the associated Spark, HBase, and Kafka worker specific deployments. Burst does not itself deploy anything directly                               onto Worker nodes. 

4. Data Sources Burst is inherently schema independent as well as agnostic to the specific technology of the external datasource.                                 However the data source must have  the following basic characteristics: 

1. it must be in a schema that can be expressed in the relationships and datatypes of the Burst Data Model 2. The external data  model can be partitioned into two levels of well defined shards: 

a. The first level is composed of a set of Domain instances that each represent a subset of data that is                                       the input to a single query e.g. for Flurry Explorer, this is a event stream associated with a single                                     ‘Mobile Application’ or constructed ‘Mobile Application Group’. A query can only be executed                         against a single Domain at a time. 

b. The second level is a strict partitioning across a Domain creating order independent subsets of                             Item instances that each has a well defined rooted acyclic object model (tree) that can be scanned                                 in a depth first, preferably time ordered, traversal. For Flurry Explorer, this is a single ‘Mobile                               Device’, each of which has a set of time ordered ‘sessions’, each of which has a set of time                                     ordered ‘events’, each of which has a set of unordered key­value map ‘parameters’ 

 page 2 of 7 

Page 3: A General Purpose Extensible Scanning Query Architecture for Ad Hoc Analytics

  

Explorer Burst: A General Purpose Extensible Scanning Query Architecture for Ad Hoc Analytics 

3. The external data source physical form can be exported as both a periodic historical batch and a continuous                                   incremental update and fed to the Burst Kafka based Ingestion API. e.g. for Flurry this is our 2,000 node,                                     six petabyte, ~50 trillion mobile device events, ever­growing HBase cluster with custom Map­Reduce jobs                           performing both initializing batch and daily incremental update feeds. 

5. Ingestion The new Burst Ingestion Subsystem design starts with a Kafka queues that provide a control­plane                             (control/administration), and the data­plane (data feeds). The data source is responsible for sending and responding                             to the control­plane, as well as feeding the data­plane in response to control­plane messages. An Apache Spark                                 based process model manages control­plane and data­plane operations. It is responsible for transforming the                           schema of the external system into an appropriate Burst schema, updating the Sample Store as it arrives. 

6. Data Model The Burst Data Model has the following requirements/features/implementation details: 

1. It is schema independent, but schema defined.  2. It is schema versioned, and supports heterogeneous versioned collections. 3. The data model/schema supports type structures, singular and plural structure reference relationships, value                         

collections, value maps, and atomic data types (boolean, byte, short, int, long, double, string) 4. The data model/schema inherently defines a tree with a well defined root as part of well defined traversal 5. Data is encoded in a single byte array where the disk storage encoding is identical to the in­memory format.  6. This encoding is an unrolled depth first traversal of the object tree as a linear sequence of bytes. The                                     

reading from disk into memory and traversal scans are in the same exact byte order and thus can take direct                                       advantage of the OS disk mmap semantics with the associated high performance kernel buffer management                             and aggressive prefetching. The data can be cached in memory or not depending on your preferences with                                 respect to repeated queries on identical datasets . 1

7. All interpretation of atomic data fields are done in­situ within the byte array on­demand iff any given field                                   is accessed in a query. The data model structures are not ever deserialized and no ephemeral objects are                                   created. This is similar to columnar storage, as it eliminates much of the costs of accessing unused columns                                   in standard bulk serializing models, but along with a higher degree of inherent simplicity and attendant                               efficiency. A truly ad­hoc system, where it is not known what fields will be accessed at what frequency, if                                     at all, is not an ideal columnar storage candidate. 

8. Fetching, in memory storage, and scans of the data model generate zero JVM objects. They bypass the                                 JVM memory models as well. The byte sequence traversal is scanned using efficient stack based protocols                               with data accesses performed via ‘unsafe off heap’ libraries. The problems associated with large JVM                             2

heaps are minimized as none of this memory is actually ‘seen’ by the JVM. The JVM processes have quite                                     small heap sizes. 

9. There are various optimizations for immutable encodings e.g. for value maps we store the keys and the                                 values as twin sorted arrays using a binary search to lookup key values. We also use dictionaries to reduce                                     string storage requirements. 

1 Burst may support streaming query processing in a future release 2 ‘Unsafe’ refers to a design pattern where Java code is written using the same techniques the Java libraries use to access non                                             JVM heap memory (e.g. Network & Disc IO). It is called unsafe because JVM manufacturers do not offer support for these lower                                           level libraries, even though they are extensively used and quite reliable. 

 page 3 of 7 

Page 4: A General Purpose Extensible Scanning Query Architecture for Ad Hoc Analytics

  

Explorer Burst: A General Purpose Extensible Scanning Query Architecture for Ad Hoc Analytics 

7. Sample Store 

 The Burst architecture uses a Apache HBase key­value store, to reliably and efficiently store the continuous largely                                 unordered incremental feed of assorted Item updates from assorted Domains coming from one or more external data                                 sources. This data is stored in one of a plurality of tables each called a Province . Each arriving update is a new                                           3

cell, encoded in the Burst Data Model, in a row keyed by the specific Item, Domain and Channel in a single                                         4

Province table where the given Domain is hosted.  

8. Dataset Store 

 For a query to be executed over a Domain, appropriate rows in the Sample Store and appropriate update cells for                                       each Item must scanned and transformed into a Dataset in Brio Data Model encoding. This transformation is called                                   melding and happens locally on each worker node. Each node creates and stores a single partition of the Dataset.                                     5

These partitions are the most recent ‘view’ of the data as a single byte array cached in local disc (magnetic or solid                                           state). When a query is executed, if the local Worker node has cached the partition, and if it is not considered ‘stale’                                           then it is read directly from disc and no meld is required. The melding can also customize the dataset by down                                         sampling items along with other forms of object tree filtering if it is desired to reduce the datasets size for                                       performance/resource utilization reasons. It is also possible to have more than one defined and reified custom                               Dataset ‘view’ per Domain. 

Caching It is vital that the Dataset partitions be loaded into memory quickly and released aggressively in order to manage                                     expensive/limited DRAM resources efficiently. The load of a Dataset partition is a simple mmap() call of a single                                   file as a single byte array into off­heap memory managed directly by the OS. The scan can proceed before the file                                         has been fully read due to the natural OS semantics of paged disc reads with linear order prefetching. Since there                                       

3 Provinces are used to subdivide the overall dataset into separate tables so that efficient table operations can be used to manage,                                           move, and cleanup data as needed in manageable chunks. 4 An Ingestion API/Sample Store management artifact 5 i.e. without replication or fault tolerance. In the case of worker node failure, these dataset partitions are recreated on whatever                                         replica location is targeted by HBase/Spark for the next query. 

 page 4 of 7 

Page 5: A General Purpose Extensible Scanning Query Architecture for Ad Hoc Analytics

  

Explorer Burst: A General Purpose Extensible Scanning Query Architecture for Ad Hoc Analytics 

are essentially zero on­heap artifacts associated with this load, the release of the byte array has minimal GC                                   implications. In this way, the local disc, especially if it is SSD, acts as a cost effective second level DRAM cache.   6

9. Query Engine 

 The Query Subsystem has an API that consists of a programmer­friendly declarative query language called SILQ                               which is translated into a machine­friendly imperative query language called GIST. Both of these are textual                               languages with a well defined grammar and syntax . The details of this are described in [SILQ]. Here we will say                                       7

that these languages provide a rich and extensible set of aggregation, dimensioning, filtering and causal/temporal                             reasoning features. Burst clients form their queries as SILQ, which the SILQ pipeline transforms into GIST. The                                 GIST pipeline transforms those into well defined execution plans that are multicast to worker nodes. The                               multidimensional result model is gathered and delivered back to the client.  

Execution Models These execution plans contain: 

1. Traversal Model­ a simple numeric array based state machine holding the semantics of what to do where                                 in the object tree traversal 

2. Result Schema the semantics of all aggregations, dimensions, and merges and joins. 3. Closures ­ filters and traversal data model updates in generated and JIT optimized JVM byte code  4. Routes ­ Log structured record of graph automata paths 

Zap Data Structures Because of the extreme number of objects visited and the prolific object churn associated with standard data                                 structures, Burst requires specialized data structures called Zap structures for inner loops. These are designed to                               8

use nothing but simple off heap blocks of memory, pre­allocated in per­thread chunks, re­used over and over again,                                   and with all needed functions coded using unsafe  access patterns. There are just a two of these currently : 9

● ZapMaps: The object tree scan requires a nested overlay of lightweight hash maps with the ability to join                                     10

with child/peer maps on the fly as the traversal unfolds from parent to child. The ways these nested self                                     joins can be expressed is an important part of how GIST creates complex ad­hoc multi­dimensional result                               models. The performance of Zap Maps is a key factor in the overall performance of the system. 

6 If desired, a future version of Burst may support ‘streaming’ semantics where the scan is executed as the data is read from disc                                               and never cached in memory. 7 very convenient  for unit and system testing! 8 ‘Zero Allocation Protocol’ 9 We are working on another structure, a Zap Lexicon that eliminates the use of standard JVM strings which are quite noisy from                                             the perspective of JVM object creation 10 something like a cross join 

 page 5 of 7 

Page 6: A General Purpose Extensible Scanning Query Architecture for Ad Hoc Analytics

  

Explorer Burst: A General Purpose Extensible Scanning Query Architecture for Ad Hoc Analytics 

● Zap Routes: For causal/temporal reasoning we implemented an off heap log­structured recording structure                         with a graph automata to discover and capture ‘paths’ through sequences of events. This how ‘Funnels’ are                                 implemented in the Explorer product.. 

Concurrency Because each of the Item instances in a Dataset partition is part of a sequence of individual order independent object                                       trees, we refine our concurrency model to a single core/thread dedicated to each traversal. Each of these can be                                     executed in parallel on available cores using a fixed pool model. This makes hardware happy as the linear byte array                                       being scanned is read solely by a single Core.  

Spark Like the Ingestion Subsystem, the Query Subsystem is built on top of Apache Spark with a Spark Executor on each                                       11

worker node initialized with a Query Kernel that can execute scan plans. The scan traversals are carefully designed                                   to use a minimum of JVM memory and create a minimum of JVM objects. There is essentially no JVM memory                                       overhead to the storage and execution models other than created by the ipc protocols. 

10. Performance Because of the efficiency of the scanning techniques involved, one can think of Burst as an objects scanned per                                     second machine and so the performance of queries is almost exclusively about how many objects the query needs to                                     visit. As an example, in the Flurry mobile analytics world, queries that only look at the top level object in the tree                                           (the User or Mobile Device) run much faster than queries that need to visit the sessions associated with that User. At                                         the next level, queries that need to visit the events in each session run slower than ones that only look at sessions.                                           Generally the complexity of the query in terms of what data is accessed and what results are generated at each object                                         is not nearly as impactful.  In our 250 node, 6 SATA spindle, 48 haswell hthread cluster, we see a sustained 50 QPS with >1,000 applications in                                         memory.  Datasets cold load in <10s, cache load in <1s. Generally we scan about 200K objects/sec/hthread. 

11. Future Work The Burst architecture was designed to be extensible and the GIST language is implemented on top of a ‘plugin’                                     abstraction. We have a working first version plugin of a next generation of SILQ/GIST called HYDRA, that                                 combines both into a single language that is more performant in a few key areas. One is that you can combine any                                           number of queries into a single concurrent scan . We are also well into developing more efficient filtering using                                   12

code generated predicates that can be used by both HYDRA and for melding. 

12. Conclusions By rigorously constraining the data to be queried in terms of a two level partition model, where the first level                                       partition (Domains) subdivides the entire dataset into individually queryable subsets, and a second level partition                             (Items) defines unordered parallel/distributed partitions of sequences of scannable object graphs, and by                         implementing hyper parallel­distributed­concurrent scans we can provide a linearly scaling, cost effective,                       completely general purpose, ad hoc low latency query engine. The first version is deployed in beta behind the                                   

11 Burst does not use Spark features extensively in fact for the most part it uses Spark as a distributed process manager. The actual                                               Spark execution model is a very simple single stage scatter/gather model. The implementation abstracts this facility so as to make                                       it easy to move to a different distributed process manager or to roll our own multicast execution model such as with JGroups. 12 This is an important optimization for multiple use case including­ 1) ‘dashboards’ where a mobile application displays an                                     initial UI view with a fixed set of personalized queries ­ 2) when a dataset is melded, it is critical to provide metadata about that                                                 dataset to the query clients in terms of a fixed set of queries e.g. for the Flurry product the UI needs to display user, session,                                                 event, and parameter counts as well as parameter keys and value frequencies to help inform users about formed query relevance                                       during interactive query sessions. 

 page 6 of 7 

Page 7: A General Purpose Extensible Scanning Query Architecture for Ad Hoc Analytics

  

Explorer Burst: A General Purpose Extensible Scanning Query Architecture for Ad Hoc Analytics 

recently released Explorer Product. The next release introduces an incremental ingestion pipeline allowing this                           query system to scale to serve all Flurry Explorer customers. 

13. References ● [DREMEL] Sergey Melnik and Andrey Gubarev and Jing Jing Long and Geoffrey Romer and Shiva                             

Shivakumar and Matt Tolton and Theo Vassilakis, “Dremel: Interactive Analysis of Web­Scale Datasets”,                         Proc. of the 36th Int'l Conf on Very Large Data Bases: http://research.google.com/pubs/pub36632.html 

● [DRUID] Druid, “Open Source Data Store for Interactive Analytics at Scale”:  http://druid.io/ ● [BLINK] AmpLab, “Queries with Bounded Errors and Bounded Response Times on Very Large Data”:                           

http://blinkdb.org/ ● [DRILL] MAPR, “Industry's First Schema­Free SQL Engine for Big Data”:                   

https://www.mapr.com/products/apache­drill ● [TEZ] https://tez.apache.org/  ● [PRESTO] https://prestodb.io/  ● [SPARK] http://spark.apache.org/  ● [DOCKER] https://www.docker.com/  ● [HBASE] http://hbase.apache.org/  ● [KAFKA] http://kafka.apache.org/  ● [SILQ] 

https://docs.google.com/a/yahoo­inc.com/document/d/1of2GDtLJuItLdNQxDO7E24D6T8hOGspd­Knm8lFnDkM/edit?usp=sharing  

 page 7 of 7