Native Store Extension for SAP HANA - VLDB · 2019-08-21 · 2. AN OVERVIEW OF SAP HANA HANA supports both analytical and transactional access patterns [26,20] and di erent data representations

Native Store Extension for SAP HANA

Reza Sherkat, Colin Florendo, Mihnea Andrei, Rolando Blanco, Adrian Dragusanu, AmitPathak, Pushkar Khadilkar, Neeraj Kulkarni, Christian Lemke, Sebastian Seifert, Sarika Iyer,Sasikanth Gottapu, Robert Schulze, Chaitanya Gottipati, Nirvik Basak, Yanhong Wang, Vivek

Kandiyanallur, Santosh Pendap, Dheren Gala, Rajesh Almeida, Prasanta GhoshSAP SE

[email protected]

ABSTRACTWe present an overview of SAP HANA’s Native Store Exten-sion (NSE). This extension substantially increases databasecapacity, allowing to scale far beyond available system mem-ory. NSE is based on a hybrid in-memory and paged col-umn store architecture composed from data access prim-itives. These primitives enable the processing of hybridcolumns using the same algorithms optimized for traditionalHANA’s in-memory columns. Using only three key primi-tives, we fabricated byte-compatible counterparts for com-plex memory resident data structures (e.g. dictionary andhash-index), compressed schemes (e.g. sparse and run-lengthencoding), and exotic data types (e.g. geo-spatial). Wedeveloped a new buffer cache which optimizes the manage-ment of paged resources by smart strategies sensitive to pagetype and access patterns. The buffer cache integrates withHANA’s new execution engine that issues pipelined prefetchrequests to improve disk access patterns. A novel load unitconfiguration, along with a unified persistence format, allowsthe hybrid column store to dynamically switch between in-memory and paged data access to balance performance andstorage economy according to application demands while re-ducing Total Cost of Ownership (TCO). A new partitioningscheme supports load unit specification at table, partition,and column level. Finally, a new advisor recommends opti-mal load unit configurations. Our experiments illustrate theperformance and memory footprint improvements on typicalcustomer scenarios.

PVLDB Reference Format:Reza Sherkat, Colin Florendo, Mihnea Andrei, Rolando Blanco,Adrian Dragusanu, Amit Pathak, Pushkar Khadilkar, Neeraj Kul-karni, Christian Lemke, Sebastian Seifert, Sarika Iyer, SasikanthGottapu, Robert Schulze, Chaitanya Gottipati, Nirvik Basak,Yanhong Wang, Vivek Kandiyanallur, Santosh Pendap, DherenGala, Rajesh Almeida, and Prasanta Ghosh. Native Store Exten-sion for SAP HANA. PVLDB, 12(12): 2047-2058, 2019.DOI: https://doi.org/10.14778/3352063.3352123

1. INTRODUCTIONThe SAP HANA database (referred to simply as HANA)

created a hybrid transactional and analytical processing par-

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. To view a copyof this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. Forany use beyond those covered by this license, obtain permission by [email protected]. Copyright is held by the owner/author(s). Publication rightslicensed to the VLDB Endowment.Proceedings of the VLDB Endowment, Vol. 12, No. 12ISSN 2150-8097.DOI: https://doi.org/10.14778/3352063.3352123

adigm, known as translytics or HTAP, by performing effi-ciently over the same shared data for both workloads. Per-formance for such combined workloads is achieved by com-bining the columnar store engine with highly optimized in-memory processing technologies. The combination of colum-nar store and in-memory processing optimizes performanceby providing fast columnar access while avoiding per-columnpage access. HANA’s original qualities were thus HTAP,multi-model, and performance. From the beginning, HANA’sapproach was to focus on qualities and treat features as be-ing subservient to these desired qualities. Likewise, in cloudprocessing, the qualities of a service are more importantthan its features; a service with a high TCO or a low avail-ability will fail to be adopted even if the service is feature-rich. HANA naturally evolves and extends its focus to cloudqualities, e.g. TCO, availability, elasticity, and automation.

HANA’s Native Store Extension (NSE) is part of this ma-jor shift and one of the technology enhancements which re-shapes HANA as a cloud-native database. The focus of NSE

is to give HANA a cloud-native Infrastructure-as-a-Servicecost structure with low TCO and the capability to cover awide range of cost and performance choices while preserv-ing HANA’s original qualities. NSE extends HANA with ablock device oriented store without introducing a special-ized disk engine. Instead, we have taken a hybrid approach:the seamless extension of the in-memory store and engine.The extension is based on having a unified persistence for-mat for in-memory processing and for paged access. Thisunified format is friendly to paging, enabling thus efficientdisk processing, e.g. using a buffer cache, touching a mini-mal number of pages for point access and prefetching pageswhen appropriate for serial scans. The content of persis-tent pages is byte-compatible with contiguous sections ofHANA’s original in-memory columnar format. Therefore,the disk extension brings no performance degradation.

We extended the in-memory engine at the store accesslevel and at the higher relational operator level. At the storeaccess level (the leaves of the query plan), disk processingreuses the in-memory code within the limited scope of eachpage. The relational operators of HANA’s execution engineare pipelined whenever on-the-fly processing is possible (e.g.for filter, project, hash probe). There are pipeline breakerswhenever the whole input must be consumed before generat-ing results. The execution engine’s disk extension requiredno change for the pipelined operators. Only the pipelinebreakers are extended with disk-based variants to limit theirprocessing memory footprint, keeping HANA’s in-memoryprocessing performance the same for loaded columns.

2047

2. AN OVERVIEW OF SAP HANAHANA supports both analytical and transactional access

patterns [26, 20] and different data representations (struc-tured, semi-structured and unstructured). HANA makes useof modern hardware features like multi-core CPUs [31], non-volatile memory (NVM) [6] and large main memory sizes.With parallelization at all levels (scale-up, scale-out) andMVCC-based transaction management [16], HANA providesscalability for increasingly complex enterprise workloads.

2.1 In-Memory Column StoreThe relational capabilities of HANA are built around an

in-memory column store. Columns are internally split into aread-optimized main fragment, handling the majority of thecolumn data, and a small write-optimized delta fragment.Inserted or updated records are always appended to the endof the delta fragment. Queries are executed independentlyon both fragments and the two result sets are joined and re-turned, with some rows removed after applying row visibilityrules. The tuples of the delta fragment are regularly mergedinto the main fragment by the delta merge operation. Datais stored using domain coding, specifically the column val-ues are substituted by bit-packed value ids which point intoa dictionary that contains the distinct, sorted (main) or un-sorted (delta) original column values. Reverse lookups areaccelerated by an optional inverted index. Depending onthe data distribution, the value identifier array in main frag-ments is subject to further compaction using prefix, sparse,run-length, indirect, and cluster compression [18]. Domaincoding, in conjunction with advanced compression, reducesmemory footprint by up to a factor of 10x. The process-ing of queries in HANA is rapidly converging towards HEX

(HANA Execution Engine), a new state-of-the-art query en-gine. HEX achieves great locality for queries using two fea-tures. First, HEX is able to merge adjacent physical planoperators and compile them into native code using LLVM.This is useful for CPU-intensive parts of a query where asmuch data as possible shall be retained in CPU registers.Second, HEX pipes tuple batches of approximately Lx cachesize through a chain of plan operators. Compared to the tra-ditional row-at-a-time model, pipelining amortizes operatorinvocation costs and improves cache behavior significantly.

2.2 In-Memory and On-Disk ProcessingHANA’s in-memory store serves many use cases well. How-

ever, database capacity is inherently constrained by the avail-able amount of memory. As a remedy, SAP recently addedNVM as a new storage option to HANA [6]. NVM continuesto be a strategic direction, however, its size is expected to beonly factors larger than DRAM whereas disk storage is ordersof magnitude larger than DRAM. To substantially increaseHANA’s capacity and scale with the best cost/performanceratio, we extended HANA with native disk store technolo-gies. Our solution implements a hybrid column store ar-chitecture which provides in-memory and page loadabilityacross all compression types, data types, and database ob-jects (schema, table, and column). With this new hybrid col-umn store, HANA does not only enable seamless migrationof data to slower storage as data volume increases, it alsoretains full functionality with predictable and elastic degra-dation of performance in resource-constrained environmentssuch as cloud deployments.

3. ARCHITECTURAL OVERVIEW OF HY-BRID COLUMN STORE IN SAP HANA

HANA’s NSE adds complementary disk-based column storefeatures to HANA’s in-memory column store, which togetherform a flexible, hybrid column store. This hybrid columnstore offers in-memory processing for performance criticaloperations and buffer-managed paged processing for less crit-ical, economical data access. This hybrid capability extendsup from a unified persistence format which can be used toload paged or in-memory primitive structures, which in turnform paged or in-memory column store structures (data vec-tors, dictionaries, and indexes) that are ultimately arrangedby the hybrid column store according to each column’s loadconfiguration. Hybrid columns can be configured to load allin-memory structures, all paged structures, or a mixture ofboth. This configuration can be manually set by the user orapplication, as well as automatically set based on workload,data access patterns, and intelligent algorithms. HANA’shybrid column store architecture is optimized towards in-memory processing in that the original HANA in-memorycolumn store performance is preserved for in-memory con-figurations, while paged configurations are aligned as closelyas possible to the in-memory structure to provide maxi-mum code sharing at the top level. Fortunately, the ex-isting set of in-memory column implementations (uncom-pressed and compressed simple values, as well as specializedrepresentations for materialized hash index, persistent rowid, and spatial data type) share some common high-levelstructures (dictionaries, indexes, data vectors and block vec-tors). These pieces can be encapsulated and interchangedwith paged implementations.

The hybrid column store (Figure 1) is built on the conceptof primitive building blocks. A primitive hides the memoryor paged nature of those parts behind standard APIs. As anexample, the most widely used in-memory primitive in theHANA column store is the n-bit compressed data vector,which provides a compressed representation of dictionary-encoded values in contiguous memory, while the paged coun-terpart provides the same compressed byte compatible rep-resentation in paged memory. Both primitives provide thesame API but different memory footprint and performancecharacteristics. HANA’s Data Definition Language SQL hasbeen extended to allow the specification of load unit hintsfor database objects. A load unit hint declares the preferredaccess type (either paged or fully in-memory) of a databaseobject. Load unit specifications allow for flexible multi-leveltable partitioning options, where all or selected table parti-tions can be declared with a given load unit tag such thatthe columns on the main fragments of the specified parti-tions follow a given load unit. New DDL statements arealso provided for the efficient conversion of legacy databaseobjects (originally created without NSE’s unified persistenceformat) into the unified persistence format. An elastic buffercache is provided to manage paged memory efficiently withina configurable size limit of the total memory allocated to theHANA database server. This limit is dynamically adjustable,and it works in tandem with the memory resource managerto ensure resources using paged memory do not interferewith memory resources allocated for in-memory objects.

An NSE Load Unit Advisor utility is provided to assistusers with configuring the load unit for the database ob-jects accessed by a given workload. The Load Unit Advisor

2048

Memory PrimitivesVector

NBit Compressed VectorBit Vector

Custom memory blockstructures

Paged PrimitivesPaged Vector

Paged NBit VectorPaged Bit Vector

Generic on-page blockstorage

Hybrid Column StoreSpecialized column implementations:

Uncompressed, compressed, spatial, rowid, hashkey

Hybrid Column StructuresDictionary, data vectors, indexes

May be loaded in memory or paged

HEX Query EngineSupports hybrid column storefeatures, including prefetch

Data Temperature StatisticsRecords data access patterns and

frequency

Partition ManagementFlexible data partitioning, data

partitioning and aging with nativestorage extension

NSE AdvisorHeuristics-based page-ability

recommendation tool

NSE Configuration andConversion DDL

Paging eligibility for tables,partitions, and columns.

Efficient, non-blocking conversionfrom existing storage to NSE

Elastic Buffer CacheManages paged memory pool,

replacement policy, and prefetchfor page-loaded structures

Unified Persistence Format

Figure 1: HANA’s Hybrid Column Store Architecture

provides a heuristics-based engine that interprets data us-age patterns collected by the hybrid column store to makerecommendations on load unit changes; for example, that atable or column should be page loadable. These recommen-dations are aimed at placing smaller and frequently accesseddatabase objects fully in memory while allowing larger andless frequently accessed objects in paged memory, balancingmemory footprint with performance. Our contributions arepresented as follows:

• We introduce the unified persistency (Section 4) whichis consumed by the pageable primitives (Section 5), de-signed to integrate well with in-memory column store.

• We present the design of our new buffer cache (Sec-tion 6) and its integration with new execution engine(Section 7).

• We present metadata and online dynamic load unitconversion at column, table, and partition granulari-ties (Section 8).

• We present the design of recommendation engine thatprovides hints for load unit conversion (Section 9).

• We share experimental results (Section 10) demon-strating memory footprint reduction and performanceimpact in end-to-end evaluation on large workload indifference scenarios.

4. UNIFIED PERSISTENCY FORMATPiecewise columnar access for read-optimized main frag-

ments was previously introduced to HANA [29]. This pagedaccess was designed to achieve minimal memory utilizationas a complement to the high-performance data access al-gorithms developed and optimized for fully memory resi-dent columns. This paged access empowered several usecases that relied on clear resolution of the load unit in ad-vance. The configuration of load unit had to be done ex-plicitly by the application expert who knows the data ac-cess patterns and knows the tradeoff between the memoryconsumption and performance. One of the use cases wasdata aging, where recently added data is considered hot,and configured to use in-memory load unit, while older datais considered warm and is configured to use paged load unit.When data is aged, hot data is converted to warm data bya lengthy persistence change from memory resident to page

loadable. Once a column is chosen to be in either format,it will remain so throughout its life cycle. An explicit loadunit conversion can change a column’s persistency format(Section 8.2). However, the format conversion is slow, as itchanges the format for every subcomponent of a column, i.e.the data vector, the dictionary, and any index present. Withthe introduction of a common persistence format, referredto as unified persistence in this paper, we bridge the gap be-tween the two representations and avoid expensive formatconversions when the load unit changes.

Unified persistence is achieved by providing an identi-cal byte-compatible representation for each columnar datastructure. This is achieved by composing primitives to con-struct a page loadable equivalent of any data structure ofeach column. If a subcomponent of a column needs to befully loaded in memory for optimal performance, insteadof needing a persistency format change, we switch the ac-cess mode for each data structure and primitive by allo-cating contiguous memory which is populated quickly andefficiently by copying data from temporary loaded pages (inthe buffer cache) to memory. This step can be performedindependently for each of the used primitives. When a sub-component is small enough, e.g. less than the minimum pagesize, it can be co-located with other such subcomponents ofthe same column on the same page chain and it will alwaysbe loaded in memory directly and independently.

The unified persistency enables a new class of use cases byreducing the total cost of ownership for both in cloud and on-premise installations. Not only there is no need to performa conversion between the two representations, but also wecan dynamically load a column in either load unit type andperform this for each of its subcomponents. If needed, theapplications can still provide a load unit type hint at table,partition or column level. However, this is entirely optionalas the NSE Load Unit Advisor (Section 9) can analyze theworkload of the application and provide the load unit typehints. Moreover, to achieve an optimal cost/performanceratio, a strategy can keep the most frequent used substruc-tures of selected columns in memory and the others in pagedformat residing in a less performant media. This is mainlybecause the cost of switching from one load unit type to an-other is minimal (there is no persistency format conversionto switch load unit).

2049

Multi-page Vectors

Paged PrimitivesVector1 (content metadata)Vector2 (content metadata)

Page 1

Page 3Page 2

Vector1

Vector2

In-memory Vectors

Figure 2: The multi-page vectors primitive: 2in-memory vectors (left) and their byte-compatiblepaged primitives (right) on single page chain

5. PAGEABLE PRIMITIVESMost of the common data structures of a column (i.e. data

vector, dictionary, and inverted index) are implemented inHANA with the help of one or more compressed in-memoryvectors. Vectors were chosen intentionally to store thesedata structures in contiguous memory. This enables op-timized processor cache access to any element in constanttime. This contiguous representation also enables optimizedparallel scan operators that are fine-tuned to take advan-tage of multi-core vectorized processing [31]. The architec-ture of the native store extension is based on the unifiedpersistency format and the ability for transient paged andin-memory structures to load from it efficiently. This re-quires not only to provide byte-compatible representationacross in-memory and page loadable representation of everydata structure, but also to employ identical algorithms toprocess classic and page loadable pieces of a column. Thisbrings two key challenges: to provide paged counterpartsof memory resident data structures, and to adapt existingalgorithms to work seamlessly with either format without re-gressing HANA’s column store performance. To achieve this,we introduced the concept of paged primitives. Primitivesare key enabler of the unified persistency in NSE. We moti-vate the idea of composing primitives to support compressedpageable columns, as well as advanced memory reductionfeatures, and remark on our performance enhancement.

5.1 PrimitivesPaged primitives provide a uniform interface with their

in-memory counterparts to read-write algorithms. This al-lows our code base to seamlessly operate on either formatwith minimum adaption, hiding the details of operating ondata in native store extension. Furthermore, these prim-itives are composable and can be enriched with auxiliaryand performant search structures and indexes.

5.1.1 Multi-page VectorsA large vector can be stored on a single page chain with

fixed size pages. Having a fixed number of objects per pagesimplifies identifying the page that has the content for agiven vector position [29]. For NSE, we enriched this primi-tive with the possibility of storing more than one vector oneach page chain (Figure 2), with each vector having its ownmetadata. Once a vector is sorted, each multi-page vectorcan be extended with a helper structure to facilitate searchand to avoid loading pages that are guaranteed not to havea value that does not satisfy the search.

5.1.2 Paged Mapped VectorsUnlike multi-page vectors, we may need to store many

small vectors on a single page chain, and to have perfor-mant access to these small vectors. This is required by sev-eral compression schemes and features (Section 5.2 and 5.3).Such vectors can fit within a single page. During scans,a small vector is created directly pointing to memory of aloaded page, so it does not need to be copied, therefore re-ducing memory consumption.

5.1.3 Arbitrary Sized ItemsSome data items are variable sized in nature, e.g. geomet-

ric objects (Section 5.3.1). Paging of such items requires aprimitive to store blocks of variable size. A block is fullystored on a single page. The number of blocks stored ona page depends on block sizes. For data items larger thanthe page size, data is internally divided into multiple blocks.This primitive provides a key to its consumers when a newitem is added. The key can be used to later retrieve thestored object.

5.2 Compressed Pageable ColumnsHANA extensively uses advanced compression algorithms

to reduce the memory footprint of columnar data and toachieve performance [18]. With the introduction of page-able columns [29], we mainly focused on supporting pag-ing of data vectors compressed using uniform bit packing.While plain domain encoding provides good compressionrate for data vectors, it could not exploit the increased sav-ings achieved by the compression techniques that HANA usesfor in-memory columns [18]. Consequently, the applicationshad to choose between an uncompressed pageable or com-pressed in-memory format. This restricts the data vectorspace that could benefit from the unified persistency (Sec-tion 4) to domain-coded columns. To tackle this limitation,we provide pageable counterparts to all compression formatssupported by HANA. Beyond the obvious reduction in diskand memory footprint together with increased performance,this closes the gap between the page loadable and in-memorycolumn and enables fast load unit conversion. In this sec-tion, we briefly overview three compression techniques anddescribe their page loadable counterparts.

5.2.1 Pageable Run-Length EncodingThe Run-Length Encoding (RLE) technique is suitable for

a data vector with regions of repeated values (runs). It re-duces this redundancy by representing each run by a pair ofvalue identifier and frequency. In HANA we trade off somecompression and prefer to store the start position of eachrun instead of the frequency. Therefore, a compressed datavector with RLE compression stores two vectors: the inte-ger vector containing the identifier of values per run, andthe start position of each run. Both vectors can becomepageable by using paged mapped vectors. With RLE, wetrade disk space with performance. The performance im-pact can be prohibitive when we want to access the valuegiven a row position r; a brute force approach would loadmany compressed pages to locate the runs corresponding tor. To mitigate this problem, we introduced a compact mem-ory resident helper structure that contains the last valuerecorded in each of the start positions vector pages and dobinary search on this memory resident structure to narrowthe number of pages loaded to two pages.

2050

5.2.2 Pageable Sparse EncodingWhen the value distribution of a column contains one very

frequent value, in a simplified form, the compressed datacan be represented by two vectors: an integer vector (theoriginal vector with the most frequent value removed) anda bit vector with bits set at the positions where the mostfrequent value is observed. Pageable sparse encoding usesthe multi-page vectors for storing the reduced data vectorand a simplified multi-page vector for storing the bit vectorthat represents the position of the most frequent value.

5.2.3 Pageable Indirect EncodingThis encoding scheme can save more space by finding

blocks in a data vector that have few distinct values. Thecompressed data vector is represented by an integer vectorthat keeps the block dictionaries and the uncompressed data.We store a vector of compressed blocks containing a pointerto the individual values and metadata. Pageable indirectencoding uses a multi-page vector to encode the integer vec-tor that stores one dictionary per block. Each dictionarycontains the distinct values observed per block. Each com-pressed block is stored using a small integer vector (a pagedmapped vector) which can be loaded on-demand from theunderlying pageable storage. We also store a vector of pagelogical pointers for each block (null pointer if a block is notcompressed) followed by a vector of offsets in the value vec-tor (one offset per block) using paged mapped vectors andarbitrary size items, respectively.

5.3 Advanced Memory Reduction Features

5.3.1 Paging of Geo-Spatial ColumnsGeo-spatial columns typically store large data sets and

can easily require large amounts of memory when loaded.Users are forced to choose between expensive hardware forhosting all of their geo-spatial data or to manage only asubset of the data. Paging of geo-spatial columns allow fullHANA functionality over smaller memory footprints. Geo-spatial columns store one cell per block using the arbitrarysized items primitive (Section 5.1.3) and a block can spanmultiple pages when needed. The block address for eachcell is stored separately using the multi-page vector prim-itive (Section 5.1.1). To scan a value for a row position,the multi-page vector primitive is used to retrieve the corre-sponding block address. Further, the block address is usedto load the specified page using the arbitrary sized itemsprimitive. Without paging, geo-spatial columns are loadedas a vector of reference counted objects. Each object pointsto corresponding geometry data, has size information, and areference count which determines the life cycle of the data.Similarly, a paged geo-Spatial column cell is represented asa reference counted object where the object points directlyto the geometry data in the corresponding page. The objectalso holds a handle to the underlying page to manage thepage’s life cycle.

5.3.2 Paging of RowID ColumnHANA uses a persistent RowID column to uniquely identify

rows across delta merges and similar operations which phys-ically reorder rows for optimal storage. The RowID columnis stored in compressed format where consecutive 1, 024 val-ues are compressed together to form a block. We store such

word boundary aligned blocks on the page chains. Align-ment ensures that the blocks can directly be accessed frompages when needed without requiring them to be copied intosome aligned temporary memory. The start position of eachof the blocks is also stored at the end of the page chain,the so called block address vector. We use the multi-pagevector primitive (Section 5.1.1) to store the block addressvector. Considering the maximum number of rows per par-tition (2 billion), the block address vector is guaranteed tobe not very large. During load of the column, the blockaddress vector is fully loaded into memory from the pages.A frequent operation in HANA is to determine the positionof a row containing a given RowID. This can be determinedby using the block address vector to identify the page wherethe RowID is and its offset. Only the specified page is loadedand the blocks are accessed using the page mapped vectors’API (Section 5.1.2).

5.3.3 Paging DictionaryIn [29], we introduced on-demand dictionary access struc-

tures for variable size data types (e.g. VARCHAR). For fixedsize data types (e.g. INTEGER) a sorted dictionary for themain fragment of a column is represented as a vector of fixedsize values. There are two key operations on a column dic-tionary: 1) materialization of the value corresponding to thevalue identifier observed in a data vector at a desired rowposition, and 2) probing of the dictionary to perform reversemapping (i.e. from value to its identifier in the data vector).For fixed size data types, we designed the paged dictionaryusing a variant of the multi-page vector primitive. To ma-terialize a value, the primitive can provide a single page ac-cess guarantee. This is because each page can hold a fixednumber of values. To probe a sorted dictionary, however,a binary search is required which may load many pages tolocate the single page that contains the probed value. Tomitigate this problem, we introduced a compact memory res-ident sorted vector (helper for dictionary). The last valueon each dictionary page is stored in the helper. The probeoperation can perform binary search on the helper and loada dictionary page only if there is a match.

5.3.4 Paging Hash-based IndexesThe hash-based indexes are unique multi-column indexes

created over the table. Being unique, the dictionary is themajor space consuming subcomponent of these indexes. Weimplemented a two-pronged approach to reduce the memoryfootprint. First, we replaced the dictionary in the hash in-dexes with a hash function and functionality for paging therow position mapping is provided through a hybrid hash col-umn implementation. A variant of multi-page vector prim-itive (Section 5.1.1) is used to abstract paging. The hashvalues stored for the indexes are hidden and almost neverprojected through the user queries. Owing to this pecu-liarity of the hash-based index, the amount of index dataaccessed while executing a query is reduced. This removesthe need to do page accesses for the dictionary lookup.

5.4 Remark on Performance OptimizationPaged primitives can be slower than in-memory counter-

parts, because of the requirement to access unloaded pages.Paged primitives are constructed for the main fragment of acolumn (the read-only, but large piece) as opposed to deltafragment. Large scans are typical on the main fragment.

2051

Buffer Pool 16M Pages...

Buffer Pool 16K Pages

Free List TopGrows

Hot Buffers Head

Tail

LRU List

MRU LRU

Buffer Pool 4K Pages

GrowsShrinks

Domain-SpecificBuffer Container

Buffer Container

Figure 3: Architectural of HANA’s buffer cache

To mitigate the natural impact of I/O latency when access-ing paged primitives, a combination of optimization tech-niques can be used. First, as will be described in Section 7,HANA’s execution engine can provide prefetch hints to thebuffer cache for sequential scans that expose patterns whenaccessing primitives. Second, page pinning can prevent atrip to the buffer cache, by storing page handles in the con-text of the operator’s stack to ensure repeated access to thepage are not impacted by buffer cache eviction. Third, smallmemory resident data structures, e.g. page synopsis [4, 14,24] can be designed to reduce the amount of page accessesand to cache statistics, as the main fraction of a columnis guaranteed to not change between two consecutive deltamerges.

6. HANA BUFFER CACHEThe introduction of the Native Store Extension in HANA

necessitated a well performant and scalable buffer manage-ment system for page resources. HANA has a resource man-ager that manages different kinds of resources in memoryand evicts them under high memory pressure situations us-ing a weighted LRU strategy, where each kind of resource(columns, pages) can be associated with a specific weight.However, for better eviction strategies, we needed to furtherclassify pages based on their access patterns. With this pur-pose, we have introduced a buffer cache which is maintainedseparately from the resource manager to manage the pageresources within a memory limit. The buffer cache holdsused and free pages and can grow and shrink on demand.Most used pages are retained in-memory while the least usedones are replaced, if needed, reducing HANA’s overall mem-ory footprint. HANA supports multiple page sizes based ontheir usage for data, dictionary, or index. Multiple bufferpools of different page sizes are maintained within the buffercache. In this section, we explain features and requirementsthat motivated the design of the buffer cache architecture(Figure 3). HANA’s buffer cache supports the creation ofdomain-specific page pools for objects (e.g. data, dictionar-ies, and indexes) helping to retain one set of heavily usedpages more than the other, to improve system performance.

6.1 Buffer Cache ElasticityThe buffer cache supports all HANA page sizes by main-

taining a buffer pool per size. On server start, the cachecontains no memory. When a free buffer is requested, thebuffer cache grows by pre-allocating free buffers in the back-ground, so the subsequent requests for free buffers of thesame page size need not wait on memory allocation. Allthe buffer pools can cumulatively grow up to the cache ca-pacity. The growth size of each buffer pool can be tuned

based on the frequency of the free page requests of its pagesize. Cache capacity can be tuned, and when reduced, thebuffer pool will start asynchronously shrinking the cache byremoving free buffers followed by unused pages proportion-ately across all the pools until the cache size falls below thereduced cache capacity.

6.2 Hot Buffer Retention and Page StealingAlthough the goal is to keep the memory footprint low by

limiting usage through the limited buffer cache, the overallperformance of the system should remain on par with thatof the in-memory system. To achieve this, it is essential toreduce I/O by retaining frequently used buffers while main-taining enough free buffers for future usage. We providean adaptive version of the LRU replacement policy, that notonly provides the simplicity of the LRU mechanism, but alsosolves the synchronization overhead associated with it. Wemaintain a separate list of working set buffers in a locklessqueue. A housekeeper thread monitors the overall bufferrequests in the system and rebalances the buffers from theworking set queue into the LRU list and free list. This alsofollows a policy of retaining frequently used buffers, extend-ing the stay of hot buffers in the cache until there is anoccasional burst of buffer requests e.g. by delta merge oper-ation. As buffers exist in different pools, a pool size balanceis maintained by stealing buffers from pools with least ac-tivity. This is critical for the frequent scenario of a specificpool being heavily used causing LRU replacements leadingto a reduced performance due to page I/O.

6.3 Page PrefetchThe buffer cache allows queries to asynchronously prefetch

pages before accessing them to reduce wait on I/O. Everyprefetched page follows the same life cycle of an LRU page.For example, analytical queries access multiple pages of eachcolumn, and many such queries could run concurrently. Thiscan result in many concurrent page prefetch requests. How-ever, buffer cache prefetch subsystem limits the total numberof prefetched pages to certain percentage of the cache capac-ity so the synchronous page loads or page allocations do notstarve for free buffers. To fulfill large prefetch requests, de-spite a limited prefetch size, the prefetch subsystem allowsthe queries to incrementally prefetch a list of pages. Queriesrun with multiple jobs, and before jobs for a query begin,the first few pages of the range of pages each job will ac-cess are prefetched. As each job begins its processing, theremaining pages are prefetched.

6.4 Out-of-buffers and Emergency PoolQueries fail if the buffer cache fails to find a free buffer

despite a reasonable number of retries. In this case, theuser is alerted to increase the buffer caches capacity. Butsome critical tasks like undo of a transaction, crash recov-ery, or continuous log replay on a secondary site, cannot failas their failure would leave the database in an inconsistentstate. For these cases, a free buffer request is fulfilled by anoverflow resource provider managed outside the buffer cache.Here, the free buffer provided for the page is destroyed im-mediately after the page has been used for its designatedpurpose. By allocating-on-demand and immediate destruc-tion of the emergency buffer, the memory consumed by theoverflow resource provider is kept as low as possible.

2052

7. HEX INTEGRATIONAn incoming query in HANA is first parsed and optimized

by the SQL frontend. In this step, various static rules (e.g.convert expressions to conjunctive normal form, eliminateconstants and tautologies, apply static partition pruning)and cost-based heuristics (e.g. enumerate different join or-ders, pick plan alternative with lowest estimated cost) areapplied to produce an optimized logical execution plan. Thelogical operators are subsequently substituted by physicaloperators (e.g. replace a generic join by a hash, nested, orloop index join). If a physical HEX alternative exists foreach operator, HEX is used for execution, else the query issent to one of HANA’s classical query engines.

7.1 Query Execution in HEXRelational databases traditionally evaluate a physical op-

erator tree in a root-to-leaf fashion using the iterator modelfirst popularized by the Volcano system [13]. In that model,parent operators repeatedly pull single tuples from theirchild operators by calling a virtual getNext() method. Vol-cano style processing is simple and versatile and had itsmerits in a time when disk I/O was the bottleneck. How-ever, getNext() is not only invoked very frequently (once pertuple and operator), also each call needs to be resolved ina vtable. This makes the iterator model generally ill-suitedto the branch prediction of modern CPUs. HEX insteadadopted a data-centric, leaf-to-root push model [23, 21]. Inthis approach, the control flow is reversed: operators pushtuples towards their parent operators using produce() andconsume() methods. Neighboring HEX operators are saidto form a pipeline if they pass batches of tuples, as manyas fit the Lx cache, without spilling them to DRAM or todisk. To achieve maximum flexibility, HEX operators canadd two kinds of code to a pipeline. One kind inserts pre-compiled code which usually wraps complex logic like de-compression of domain-coded tuples or disk pages fetches.The other kind emits code in a high-level programming lan-guage (L, an SAP-internal language) which is by default ei-ther interpreted upon the first query invocation or, upon re-peated invocation, compiled to native code at runtime usingLLVM. Adjacent generated operators can be amalgamatedinto a single, fused operator. The resulting tight nested loopgreatly benefits code locality and data locality as tuples typ-ically reside fully in CPU registers during execution.

7.2 Column Scans with PrefetchSome of the most I/O-intensive operations in many HEX

plans are pipelines which are fed by a main fragment scan.Such pipelines are typically generated from WHERE clausesin SQL queries with predicates on one or more columns. Amain fragment scan comprises two steps. In the first step,called dictionary lookup, the scan predicate is translatedinto a set of value ids using the dictionary. For most predi-cates, this breaks down to (few) binary dictionary searcheswith logarithmic runtime. The second step, called data vec-tor scan, compares each value id of the data vector againstthe value ids found in the previous step. Data vector scanstend to be more expensive than dictionary scans because thesize of the data vector usually exceeds the dictionary size byfar and, the scan advances row-by-row with linear runtime.To that end, the HEX framework breaks up the scan rangeinto subranges, and eventually scans the subranges usingmultiple threads in parallel. If the scanned column is paged

Algorithm 1 GetLoadUnitAffinity

1: if column load unit hint 6= ”default loadable” then2: return column load unit hint3: else if partition load unit hint 6= ”default loadable” then4: return partition load unit hint5: else if table load unit hint 6= ”default loadable” then6: return table load unit hint7: else8: return ”column loadable”

and not loaded, scan jobs will repeatedly trigger blockingpage loads via the buffer cache. To avoid such unneces-sary roundtrips to persistency, a prefetch pipeline is createdand populated with prefetch operators, one per column frag-ment, and run as early as possible. The prefetch operatorrequests the buffer cache to preload as many pages of therespective subrange as possible. Prefetching generally hasbest-effort semantics, i.e. if the page cache area is alreadyhighly utilized or if the I/O system is contended, the buffercache may decide to drop parts of the prefetch request. Wefound that fairness is a desirable property of prefetching.By that we mean that approximately the same number ofpages shall be loaded for all subranges and that pages shallbe loaded in the same order in which the scan visits them.If prefetching is not fair, some scan jobs are given a headstart over other scan threads and finish (in average) earlierthan penalized threads. This works in general against thescalability and resource utilization of the database system.To ensure fairness, our implementation of prefetch asks onesubrange after the other to advance a per-subrange row wa-termark by which the scan can proceed, assuming it canaccess a constant number of pages. Prefetch stops if thebuffer cache prefetch limit has been exhausted or if all sub-ranges have been prefetched. During the data vector scan, abest-effort attempt will be made to prefetch any remainingpages in each subrange not accepted during the first prefetchrequest.

8. LOAD UNIT CONFIGURATIONFor HANA to use the features introduced so far, database

objects (tables, partitions and columns) need to be config-ured to use NSE. This section describes the metadata andSQL interface used to configure NSE, as well as the processof load unit conversion, which is required to convert existingdatabase objects to use the unified persistency format.

8.1 Persistent Metadata and SQL InterfaceHANA provides the functionality to either page individual

columns or to use the data aging feature. In data aging, asingle partition will be in-memory and all other partitionspaged [29]. HANA’s NSE adds flexibility by providing a SQLinterface to tag columns, partitions and tables with load unithints and convert the underlying column persistency to thepreferred storage format. The load unit hint tag is persis-tent and acts as a hint to determine the loading behaviorof columns i.e. in-memory or paged. Column, partition andtable can be tagged with one of the three load unit hints:column-loadable to indicate the columns to be loaded asfully in-memory, page-loadable to indicate columns to beloaded as paged, default-loadable to indicate the absence ofany explicit load unit hint. The desired loading behavior fora column in a given partition is derived during the load unittagging operation as described in Algorithm 1.

2053

8.2 Online Load Unit ConversionPrior to NSE, HANA used a different persistency format

for in-memory vs. page loadable column structures [29].This is because in-memory structures do not require pagealignment, so the most efficient way to store them was sim-ple serialization. NSE hybrid column structures use a page-aligned unified persistency format (Section 4). This meanstables with persistency created prior to NSE need to converttheir persistency to the unified format to take advantage ofthe hybrid column store features. This persistency conver-sion is termed load unit conversion and is done through anexplicit SQL alter column DDL.

In HANA, table columns have a numeric identifier whichis unique within a table and consistent across all partitionsof that table. Before the introduction of NSE, alter col-umn operations in HANA were implemented by internallycreating a new table column with a new identifier followedby copying data from the old column to the new columnand subsequently dropping the old column. DDLs to al-ter the load unit of a given partition, which internally usesthe alter column infrastructure, would have to operate onall the partitions of the table to guarantee a consistent col-umn identifier across partitions. HANA also uses internalcolumns with fixed unique column identifiers where loadunit conversion must be supported. With NSE, the onlineload unit conversion within a partitioned table is no longerdone by introducing a new column, but by versioning exis-tent columns while keeping their identifiers. The load unitconversion DDL creates a new version of the column withthe preferred persistency format, while existing readers re-ferring to the old column version can continue to access it.The new column version is activated when the DDL com-mits, from which time readers use the new column version.This column versioning mechanism allows HANA to executeload unit conversions on specific partitions while preservingthe same unique column identifier across partitions.

8.3 Partitioning with NSEUnbalanced partitioning is a flexible partitioning strategy

that departs from the conventional balanced partitioningscheme. It allows to define separate partitioning schemes(e.g. hash) at selected nodes of the second level of parti-tioning subtrees than the first level which uses range par-titioning. For instance, one could decide on a range-rangepartitioning scheme, and have different ranges at the 2ndlevel for some of the 1st level ranges, or not even have arange at the 2nd level for some of them. This scheme isvery useful in organizing the table based on data temper-ature i.e. based on the hot vs. cold nature of the data.It may be possible, for instance to organize the cold parti-tions differently as the access requirements are optimized bydesign. The partitioning specification for such a scheme isinternally represented as a JSON document; very differentfrom a linear text string used in balanced partitioning.

The partitioning schema offers the flexibility to declarespecific properties assigned for each physical partition, e.g.the location, the load unit specification, and the partition-ing group types. Declaring inheritance rules allows to in-herit from parent level (i.e. 2nd level can inherit from 1stlevel). Furthermore, one can partition at the second level us-ing a different partitioning statement, e.g. range(Column1)and range(Column2) for some of the 1st level ranges butrange(Column1) and range(Column3) for some other parti-

tion nodes. This enables specializing and balancing the par-titioning tree based on actual data distribution. Unbalancedpartitioning integrates well with the Native Store Extension.As opposed to classic data aging, where each partitioned ta-ble could have one hot current partition and a set of coldaged partitions, we can designate selected nodes in the par-titioning tree to be page loadable with NSE. This ensuresthat the data for all the columns in those partitions wouldreside on disk and will be paged via the buffer cache whenthe data is queried. This relaxes the memory requirementsfor storing very large tables in HANA. The NSE advisor cananalyze workloads on partitioned tables and recommend thechange of the load unit for each partition node.

9. LOAD UNIT ADVISOR FOR NSETo achieve the best cost/performance ratio, it is necessary

to know which objects should be made page loadable to re-duce total memory footprint without significantly affectingthe performance. This section describes a solution to helpidentify the objects (tables, partitions, or columns) that aresuitable to be converted to page loadable (to save the mem-ory space) or to column loadable (to improve performance).Ideally, objects that are small and accessed frequently shouldbe kept in memory to improve column data access perfor-mance. Large objects rarely accessed are better to be kepton disk to reduce memory usage, and only brought in mem-ory when needed and only for the pages that are accessed.Our solution consists of two phases: access pattern collec-tion and rule-based load unit recommendation.

9.1 Access Pattern CollectionTo build the data access pattern for a workload, each phys-

ical access to a column fragment is counted and updated atquery level in an internal cache unit called the access statis-tics cache. Becuase only the main fragment is paged for apage loadable column, the access count is only recorded foraccesses to the main fragment. When the delta fragmentis merged to the main fragment, the access count for theoriginal main fragment is retained. When a column is un-loaded from memory during the workload, its access countis also retained in the access statistics cache and continuesto be updated after the column is loaded again into mem-ory. DDLs that drop an object will clean up the statisticsentries related to the dropped objects. Building the data ac-cess statistics has a small impact on the ongoing workloadperformance. Therefore, it is only enabled through a con-figuration option. The statistics cache can be cleared whennot needed or when a new cache needs to be built.

9.2 Object Load Unit RecommendationLoad unit recommendations uses a heuristics rule-based

approach that relies on measurements and thresholds:

Scan density d(o) ratio of access scan count of object oover the object’s memory size

Hot object threshold θh minimum scan density for anobject to be considered a hot object

Cold object threshold θc maximum scan density for anobject to be considered a cold object

Object size threshold θs minimum object size to be con-sidered for recommendation

The thresholds (θh, θc, and θs) are system parameters.A list of objects is recommended to convert their load units

2054

based on their scan density. For object o1, if d(o)<θc and theobject’s memory size is greater than θs, it is recommendedto be page loadable. If d(o)>θh, it is recommended to becolumn loadable. A table may be recommended to be pageloadable due to an overall low scan density while a columnwithin the table may be recommended to be column load-able due to its high scan density. In that case, the columnlevel recommendation takes precedence over the table levelrecommendation. Similarly, partition level recommendationtakes precedence over the table level recommendation.

10. EVALUATIONWe report the performance analysis and the memory foot-

print reduction for SAP HANA. We compare default columnloadable tables, where columns are completely loaded intomemory (CL) vs. columns that use the Native Store Exten-sion (NSE). We report only the results from selected set ofexperiments for brevity.

10.1 Workload and MetricsWe used an in-house benchmark tool (ML4). ML4 simu-

lates production environment SQL workload at customer sitewith configurable scale factor. ML4 simulates users’ interac-tions on an ABAP stack based on SAP S/4HANA application.ABAP is server programming platform for SAP applications.It sends SQL statements to the database. Basis of the work-load is a sell from stock process including creation of sales or-ders, outbound deliveries, goods movements and billing doc-uments (OLTP part) with reporting queries (OLAP part). Asingle scenario loop in ML4 simulates business transactionsconsisting of multiple user interaction steps. Each step isseparated by an artificial constant waiting time of 10 sec-onds (think time). A typical test setup consists of multipleconcurrently simulated scenarios executed by different users,where each user runs a series of loops. The first few set oftransactions sometimes includes the execution of a reportingquery, hence creating the OLAP part of the workload. Theremaining transactions form the OLTP part. When simu-lating a certain number of users, OLTP and OLAP loadsare balanced by default, allowing statistically 1 out of 100users to perform a reporting query in the first transaction.The underlying HANA database used by ML4 has at least100, 000 tables with every column type that is used in atypical SAP S/4HANA installation, with different compres-sion format (i.e. Clustered, Indirect, Prefixed, RLE, Sparse)as well as no compression. We conducted two sets of exper-iments: CL when accessed columns were completely loaded,and NSE when the majority of tables were converted to useNative Store Extension. In NSE experiments, we turned onthe Load Unit Advisor for NSE (Section 9) to collect statis-tics from the workload and to provide recommendation onconverting columns to NSE format. We report two set ofmetrics; memory consumption and performance. Memoryconsumption was measured at two levels; total system mem-ory (M1) and the aggregate of memory footprint for tables(M2). The latter metric excludes the memory not used bythe column store, e.g. for storing intermediate results ofqueries. We report performance as average total databasetime for running OLTP and OLAP transactions (we call theseOLTP MMV and OLAP MMV, where MMV stands for ML4-metric value). We restrict the amount of memory used by

1a table, a partition, or a column

60% savings75% savings

Normalized Peak Memory (M1) Normalized Agg. Memory (M2)

CL NSE

1 51 101 151Time interval (seconds)

CL NSE

Figure 4: Normalized memory (OLTP workload)

HANA server to around 2X the size of the database, and weconducted NSE experiments by fixing the buffer cache sizeat 10% of data size as default. We also report experimentvarying the buffer cache size.

10.2 Results and AnalysisFigure 4(top) presents the normalized memory measure-

ments (M1 and M2) for running OLTP workload. Both CL

and NSE benefit from compressing index vectors, but NSE

does not need to load each column accessed for each OLTP

operation completely into memory. Because of this, we ob-served immediate memory savings for both measures. Wealso monitored the memory consumption of the two settings(M2) over time, shown in Figure 4(bottom). For NSE, thememory consumption is restricted by the buffer cache size,whereas for CL, it is limited by the total memory avail-able to HANA server. Notice that there are fluctuationsfor CL somewhere around 40 seconds. These correspond toaccesses to column(s) designated as CL that had not beenloaded into memory yet when the benchmark executed, andthe read only portion of these columns need to be accessed,e.g. for checking a uniqueness constraint, or for system levelstatistics collection. As the buffer cache is operating at itsfull capacity, the same operation would trigger page evic-tions, to accommodate new requests. This kept the totalmemory consumption for NSE balanced in this experiment.This is a very desirable property for cloud deployment. Themajority of the OLTP requests that require modifications ofany type (INSERT/UPDATE/DELETE) were handled by thedelta store of each column. Because the NSE technologyonly targets the read optimized section of column store (i.e.the main store), the performance of OLTP operations werenot impacted significantly (less than 3.7%) by choosing theNSE configuration over fully loaded setting.

We conducted experiments with a mixture of OLTP andOLAP operations and observed similar trends as in the OLTP

experiments for normalized runtime (Figure 5). The buffercache again guaranteed that the total memory usage re-mains smaller than the amount of memory consumed bythe CL setting (Figure 6 top) with a system memory usagepattern (Figure 6 bottom) very similar to the OLTP onlyexperiments in Figure 4 bottom. Converting to NSE doesnot significantly affect the runtime, as shown in Figure 7.We should note that in the extreme case, when the buffer

2055

4.53%

5.97%

OLTP_MMV OLAP_MMV

CL NSE

Figure 5: Normalized runtime (mixed workload)

44% savings

66% savings

Normalized Peak Memory (M1) Normalized Agg. Memory (M2)

CL NSE

1 51 101 151 201Time interval (seconds)

CL NSE

Figure 6: Normalized memory (mixed workload)

cache is very small, any operation that requires a portionof data which is not in the buffer cache would incur I/O toload required page(s). However, there is a significant dif-ference with when a needed column is not present in thememory; NSE only loads sections of a column (e.g. piecesof compressed index vector, fractions of dictionary, and in-verted index). However, CL needs every subcomponent of acolumn to be entirely present in memory. Because of this,CL may suffer more from column evictions, whereas NSE’sgranularity of eviction from the buffer cache is page units.

To evaluate the impact of buffer caches size, we conductedboth OLTP and OLAP experiments, varying the size of buffercache (Figure 7 and 8). NSE benefits from the NSE advisorintegrated with the buffer cache, to enhance the page evic-tion decisions, and to initiate prefetch requests if needed.Because HANA’s memory resources for CL setting is man-aged separately by HANA’s resource manager, as opposedto NSE buffer cache, we did not observe difference in mem-ory consumption and runtime for CL. When the size of thebuffer cache reduces, we see the impact on the runtime forboth OLTP and OLAP, with OLAP being more affected be-cause it accesses more pages (Figure 7). This is expected,as buffer cache is pressured to evict more pages when thesize decreases. This translates into more page misses in thebuffer cache, and less average page lifetime. Reducing buffercache size has no impact on peak and average memory forCL, but directly impacts NSE (Figure 8).

We conducted another experiment by reducing the globalallocation limit (GAL). This limit restricts the amount ofmemory available to CL objects, and it has no impact on NSE

resident objects. Reducing GAL imposes memory pressureon the HANA server, which results in the resource manager

CL NSE (500 GB) NSE (50 GB) NSE (5 GB)

OLTP_MMV OLAP_MMV

Figure 7: Normalized runtime varying buffer cachesize (mixed workload)

CL NSE (500 GB) NSE (50 GB) NSE (5 GB)

Normalized Peak MemoryNormalized Agg. Memory

Figure 8: Normalized peak memory varying buffercache size (mixed workload)

GAL = 1TB GAL = 512 GB GAL = 256 GB

CL NSE

GAL = 1TB GAL = 512 GB GAL = 256 GB

CL NSE

Figure 9: Normalized peak memory (top) and nor-malized runtime (bottom) varying global allocationlimit (OLAP workload)

evicting database objects, at table and column granularity,to free memory. This eviction is orthogonal to the pageeviction by the buffer cache. The former type of evictionresults into reduction in normalized peak memory (Figure 9top) and increased memory traffic, which impacts normal-ized query response time in CL (Figure 9 top). The buffercache size is not affected by the total memory reduction, aslong as it does not affect the amount of resources it pre-allocates. Hence, no performance impact for NSE (Figure 9bottom).

11. RELATED WORKWe focus on selected components of NSE and place our

contributions in the context of the state of the art.Columnar databases use compression techniques exten-

sively [2] and HANA implemented highly optimized loss-less packing techniques for its in-memory column store [18].Pairing paging and compression is a challenging problem andhas been studied for data pages [27, 8]. Storing geo-spatial

2056

data in paged format has been implemented before, with in-tegration with specialized query processing engines [10, 32].In our work, we proposed primitives that were designed forgeneric use case, and can be used to provide pageable com-pression and pageable geospatial columns. The byte com-patibility with in-memory columns provided seamless inte-gration with query processing at upper layer, without theneed to specialize highly optimized algorithms.

Enterprise-class main memory translytics systems havestarted focusing on cloud-native infrastructure as a serviceto balance TCO. For instance, Oracle [1] uses a disk-basedrow store backed by a buffer cache as the warm store. Thein-memory store with compression units support fast ac-cess to the columnar hot store. The row store is backedby a buffer cache to balance TCO and performance. Theseseparate two layouts require format conversions and datamigration to move data to a slower volume. Our unifiedpersistency format mitigates this problem and enables dy-namic load unit conversion to keep TCO at a desired levelfor cloud deployment. Stoica et. al. [30] use OS paging di-rectly to efficiently migrate cold data efficiently to secondarystorage by relying on the virtual memory paging mechanismof the operating system. Unlike NSE, their focus is only onrow stores and OLTP. As this approach does not need buffercache, it is hard to benefit from the optimizations possible bymaintaining statistics from query execution engine at lowergranularity, e.g. primitives. SQL Server [15] maintains acolumnar store index (CSI) on memory-optimized row storetables for operational analytics on hot data. Operationalanalytics on warm data is done by creating and maintainingsecondary CSI on disk-based row store tables. This resultsin additional overhead of maintaining the dual stores andmoving data.

The use of buffer pools as a performance optimizationcomponent dates back to the early days of database sys-tems [28]. Several modern in-memory databases use a formof buffer pool [5]. New hardware technology has been re-cently adapted to enhance the performance of classic bufferpool. Sai et. al. [25] propose a new replacement policythat separates the queue of managed resources into cleanand dirty pools, when the database is stored on flash disk.Do et. al. [9] propose alternative eviction policies enhancedby high I/O bandwidth of SSD drives and predicted blockaccess patterns. Liu and Salem [19] propose cost aware re-placement algorithms for hybrid storage configuration (SSD

and HDD drives). Arulraj et. al. [7] propose multi-tierbuffer management using DRAM, NVM, and SSD that canbe used for different workloads using different storage hier-archy for optimum performance. Graefe et. al. [12] proposea design that uses pointer swizzling to eliminate buffer pooloverheads for memory resident data, in which raw pointersto loaded pieces are kept as reference to avoid lookup inseparate queues managed by the buffer pool. LeanStore [17]uses pointer swizzling and replacement strategy by identi-fying infrequently accessed pages and an epoch-based tech-nique to pin pages and achieve performance comparable tomain memory systems. In HANA’s NSE, the read optimizedsection of a hybrid column benefits from the NSE architec-ture and is managed by NSE’s buffer cache for better tun-ing and optimizations that are only applicable to paged re-sources. Our buffer cache was designed from the ground upto take advantage of lessons learned from SAP’s ASE [13]and HANA and providing extensibility as well as integration

with HANA’s brand new execution engine HEX, and NSE’shybrid column and primitives. NSE uses advanced replace-ment strategies to retain frequently accessed pages whilemaintaining balance of buffers for reuse across different pagesize pools. HANA’s page accesses avoid the hash lookup bycaching the buffer pointers in the page chain metadata.

The design and configuration of individual database fea-tures is a challenging task that significantly affects the per-formance and the total cost of ownership. The databasecommunity has studied this problem in the context of self-management for relational databases [11], to recommend thedesign of physical layout [19, 33], to create new set of indexes[3, 33], to define new data statistics objects [24], to alter par-titioning landscape [22], or to tweak the knobs and switches[3], all with the common goal of running the database at itshighest performance level. The combination of NSE’s hybridcolumn store and partitioning proposes the opportunity tobalance TCO with optimum performance goals. With thisgoal in mind our advisor pairs with byte-compatible primi-tives to support dynamic load unit conversion unique to NSE

and to maintain a desired balance between performance andTCO.

12. SUMMARYWe presented the architecture of HANA’s Native Store Ex-

tension, with a unified persistent format and pageable prim-itives at its foundation. By integrating NSE into HANA’sin-memory columnar store we support hybrid columns withadvanced paged compression and reduced memory require-ments. Hybrid columns, being a primary citizen of HANA’scolumn store, require a new buffer cache to effectively man-age pageable resources. This buffer cache works in conjunc-tion with HANA’s resource manager which oversees the in-memory structures. The buffer cache was optimized consid-ering the nature of hybrid columns and page access patterns,and for tight integration with HANA’s HEX query executionengine. HEX can feed the buffer cache with hints to en-hance its operation, e.g. buffer retention policies, amongmany other optimized algorithms that we implemented al-ready. Through comprehensive end-to-end experiments, wehave shown that on typical customer scenarios NSE reducesthe memory footprint with a minimal loss of performancecompared to in-memory column stores, especially when therequired resources for a query are provided by the buffercache. Our approach proved to be fruitful. Beyond its initialgoal, NSE and the unified persistence format in conjunctionwith a flexible load unit configuration and partitioning en-ables several desirable applications, including data tiering,K-safe high availability with reduced TCO, and a conven-tional disk database with fast restart.

13. ACKNOWLEDGMENTSWe thank the anonymous reviewers for the comments and

suggestions which contributed to improving the presentationof this paper. We give special thanks to colleagues and themembers of NSE, Store, HEX, Attribute Engine, Basis andPersistency, and SPeeD teams for their contributions to theNSE project, and to Tony Zhou for contributing to Section 9.

2057

14. REFERENCES[1] Oracle Database 19c In-Memory Guide.

https://docs.oracle.com/en/database/oracle/

oracle-database/19/inmem/, 2019. [Online; accessed03-February-2019].

[2] D. Abadi, P. Boncz, and S. Harizopoulos. The Designand Implementation of Modern Column-OrientedDatabase Systems. Now Publishers Inc., 2013.

[3] D. V. Aken, A. Pavlo, G. J. Gordon, and B. Zhang.Automatic Database Management System TuningThrough Large-scale Machine Learning. In ACMSIGMOD, pages 1009–1024, 2017.

[4] K. Alexiou, D. Kossmann, and P.-A. Larson. AdaptiveRange Filters for Cold Data: Avoiding Trips toSiberia. PVLDB, 6(14):1714–1725, 2013.

[5] T. Anderson. Microsoft SQL Server 14 man: Nothingstops a Hekaton transaction.http://www.theregister.co.uk/2013/06/03/

microsoft_sql_server_14_teched/, 2013. [Online;accessed 03-February-2019].

[6] M. Andrei, C. Lemke, G. Radestock, R. Schulze,C. Thiel, R. Blanco, A. Meghlan, M. Sharique,S. Seifert, S. Vishnoi, D. Booss, T. Peh, I. Schreter,W. Thesing, M. Wagle, and T. Willhalm. SAP HANAAdoption of Non-Volatile Memory. PVLDB,10(12):1754–1765, 2017.

[7] J. Arulraj, A. Pavlo, and K. T. Malladi. Multi-TierBuffer Management and Storage System Design forNon-Volatile Memory. CoRR, abs/1901.10938, 2019.

[8] B. Bhattacharjee, L. Lim, T. Malkemus, G. Mihaila,K. Ross, S. Lau, C. McArthur, Z. Toth, andR. Sherkat. Efficient Index Compression in DB2 LUW.PVLDB, 2(2):1462–1473, 2009.

[9] J. Do, D. Zhang, J. M. Patel, D. J. DeWitt,J. Naughton, and A. Halverson. Turbocharging DBMSBuffer Pool Using SSDs. In ACM SIGMOD, pages1113–1124, 2011.

[10] A. Eldawy, L. Alarabi, and M. F. Mokbel. SpatialPartitioning Techniques in SpatialHadoop. PVLDB,8(12):1602–1605, 2015.

[11] S. Finkelstein, M. Schkolnick, and P. Tiberio. PhysicalDatabase Design for Relational Databases. ACMTODS, 13(1):91–128, 1988.

[12] G. Graefe, H. Volos, H. Kimura, H. Kuno, J. Tucek,M. Lillibridge, and A. Veitch. In-memory Performancefor Big Data. PVLDB, 8(1):37–48, 2014.

[13] A. Gurajada, D. Gala, F. Zhou, A. Pathak, and Z. F.Ma. BTrim: Hybrid In-memory Database Architecturefor Extreme Transaction Processing in VLDBs.PVLDB, 11(12):1889–1901, 2018.

[14] H. Lang, T. Muhlbauer, F. Funke, P. Boncz,T. Neumann, and A. Kemper. Data Blocks: HybridOLTP and OLAP on Compressed Storage Using BothVectorization and Compilation. In ACM SIGMOD,pages 311–326, 2016.

[15] P.-A. Larson, A. Birka, E. N. Hanson, W. Huang,M. Nowakiewicz, and V. Papadimos. Real-timeAnalytical Processing with SQL Server. PVLDB,8(12):1740–1751, 2015.

[16] J. Lee, H. Shin, C. G. Park, S. Ko, J. Noh, Y. Chuh,W. Stephan, and W.-S. Han. Hybrid Garbage

Collection for Multi-Version Concurrency Control inSAP HANA. In ACM SIGMOD, pages 1307–1318,2016.

[17] V. Leis, M. Haubenschild, A. Kemper, andT. Neumann. LeanStore: In-Memory DataManagement beyond Main Memory. In IEEE ICDE,pages 185–196, 2018.

[18] C. Lemke, K. Sattler, F. Faerber, and A. Zeier.Speeding Up Queries in Column Stores - A Case forCompression. In DAWAK, pages 117–129, 2010.

[19] X. Liu and K. Salem. Hybrid Storage Management forDatabase Systems. PVLDB, 6(8):541–552, 2013.

[20] N. May, A. Bohm, and W. Lehner. SAP HANA - TheEvolution of an In-Memory DBMS from Pure OLAPProcessing Towards Mixed Workloads. In BTW, pages545–563, 2017.

[21] P. Menon, T. C. Mowry, and A. Pavlo. RelaxedOperator Fusion for In-memory Databases: MakingCompilation, Vectorization, and Prefetching WorkTogether at Last. PVLDB, 11(1):1–13, 2017.

[22] R. Nehme and N. Bruno. Automated PartitioningDesign in Parallel Database Systems. In ACMSIGMOD, pages 1137–1148, 2011.

[23] T. Neumann. Efficiently Compiling Efficient QueryPlans for Modern Hardware. PVLDB, 4(9):539–550,2011.

[24] A. Nica, R. Sherkat, M. Andrei, X. Cheng, M. Heidel,C. Bensberg, and H. Gerwens. Statisticum: DataStatistics Management in SAP HANA. PVLDB,10(12):1658–1669, 2017.

[25] S. T. On, Y. Li, B. He, M. Wu, Q. Luo, and J. Xu.FD-buffer: A Buffer Manager for Databases on FlashDisks. In ACM CIKM, pages 1297–1300, 2010.

[26] H. Plattner. The Impact of Columnar In-memoryDatabases on Enterprise Systems: Implications ofEliminating Transaction-maintained Aggregates.PVLDB, 7(13):1722–1729, 2014.

[27] M. Poess and D. Potapov. Data Compression inOracle. In VLDB, pages 937–947, 2003.

[28] G. M. Sacco and M. Schkolnick. A Mechanism forManaging the Buffer Pool in a Relational DatabaseSystem Using the Hot Set Model. In VLDB, pages257–262, 1982.

[29] R. Sherkat, C. Florendo, M. Andrei, A. Goel, A. Nica,P. Bumbulis, I. Schreter, G. Radestock, C. Bensberg,D. Booss, and H. Gerwens. Page As You Go:Piecewise Columnar Access In SAP HANA. In ACMSIGMOD, pages 1295–1306, 2016.

[30] R. Stoica and A. Ailamaki. Enabling Efficient OSPaging for Main-memory OLTP Databases. In ACMDaMoN, pages 7:1–7:7, 2013.

[31] T. Willhalm, I. Oukid, I. Muller, and F. Faerber.Vectorizing Database Column Scans with ComplexPredicates. In ADMS@VLDB, pages 1–12, 2013.

[32] D. Xie, F. Li, B. Yao, G. Li, L. Zhou, and M. Guo.Simba: Efficient In-Memory Spatial Analytics. InACM SIGMOD, pages 1071–1085, 2016.

[33] D. C. Zilio, J. Rao, S. Lightstone, G. Lohman,A. Storm, C. Garcia-Arellano, and S. Fadden. DB2Design Advisor: Integrated Automatic PhysicalDatabase Design. In VLDB, pages 1087–1097, 2004.

2058

https://docs.oracle.com/en/database/oracle/oracle-database/19/inmem/

https://docs.oracle.com/en/database/oracle/oracle-database/19/inmem/

http://www.theregister.co.uk/2013/06/03/microsoft_sql_server_14_teched/

http://www.theregister.co.uk/2013/06/03/microsoft_sql_server_14_teched/

Documents

Native Store Extension for SAP HANA - VLDB · 2019-08-21 · 2. AN OVERVIEW OF SAP HANA HANA supports both analytical and transactional access patterns [26,20] and di erent data representations