12
A Study on Database Cracking with GPUs Eleazar Leal University of Minnesota Duluth Duluth, Minnesota, USA [email protected] Le Gruenwald University of Oklahoma Norman, Oklahoma, USA [email protected] ABSTRACT In dynamic environments, a DBMS may not have enough idle time to build indexes that can help speed up query processing. Moreover, in these same environments, there is usually no prior knowledge about the nature of the query workload that could help improve query processing times. Database cracking deals with these two issues by incremen- tally building indexes while processing queries. With this in- cremental index construction, database cracking techniques sacrifice the query response time of the present query in order to improve the response time of future queries. Addi- tionally, database cracking can adapt to changing workloads because the incremental index construction is query-driven. Database cracking can be made more efficient by taking advantage of parallelism; while multicore CPUs have been studied for this purpose, no existing work has done the same for GPUs. In this paper, we perform the first study on the use of GPUs versus multicore CPUs for cracking. For this, we propose two GPU cracking techniques, Crack GPU and Crack GPU Hybrid, and compare them with the best per- forming multicore CPU cracking technique existing in the literature, P-CCGI. Our experiments showed that in terms of query response time, the best performing techniques are Sort GPU if the workload is non-random, and P-CCGI if the workload is random. 1. INTRODUCTION In dynamic database environments, there is not enough idle time before the processing of the queries to build an index structure (like a B+ tree [2] or an ART-tree [17]) that can help speed up those queries [7, 9, 24]. This issue is ad- dressed by an adaptive indexing technique called database cracking [8, 9]. Database cracking helps improve the query response time by incrementally building an index while si- multaneously answering queries. As a consequence of this, each time a new query is issued, database cracking devotes part of the processing time of each query to doing a partial This article is published under a Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0/), which permits distribution and reproduction in any medium as well allowing derivative works, pro- vided that you attribute the original work to the author(s) and ADMS 2019. 10th International Workshop on Accelerating Analytics and Data Manage- ment Systems (ADMS’19), August 26, 2019, Los Angeles, California, CA, USA. Query 1: Select R.A From R Where R.A 12 and R.A 20 Query 2: Select R.A From R Where R.A 5 and R.A 28 9 10 14 11 2 29 7 33 1 3 21 13 6 8 9 10 11 2 7 1 3 6 8 14 13 29 33 21 Cracker Index A < 12 12 A 20 A > 20 2 1 3 9 10 11 7 6 8 14 13 21 29 33 Cracker Index 5 A < 12 12 A 20 20 < A 28 A < 5 Cracker column Cracker column after Query 1 Cracker column after Queries 1 and 2 A A A A >28 Figure 1: Example of database cracking physical re-organization of the database tables, so that fu- ture queries can benefit from the re-organizations introduced by past queries [9]. Figure 1 shows an example of database cracking, which will be described in detail in Section 2. However, the lack of enough idle time for indexing is not the only issue in these environments. It is also the case that the nature and the probability distribution of the query workloads are not known a priori. Nonetheless, database cracking’s adaptive nature addresses this issue, too. This is because the locations within the tables that are reorga- nized are chosen such that those portions of the table that are queried most often will be reorganized most frequently. Therefore, when workloads evolve over time, cracking shifts its focus to reorganizing a different portion of the tables, thereby improving the performance in the new workload’s future queries. Database cracking has several challenges [23, 24], among which we find the following: (i) the query response time of the first query in a sequence of queries should not be overly long as a result of cracking; this challenge is partly a con- sequence of the assumption that there is not enough idle time before the first query to build an index; (ii) the con- vergence time; the convergence time refers to the number of cracking operations that are required until the average query response time when using a cracking equals the aver- age query response time when using a regular index (like a B+-tree or an ART tree [17]). Ideally, this number should be as small as possible; and (iii) query workload robustness. Some database cracking techniques work well under the as- 1

A Study on Database Cracking with GPUs - cs.ou.edudatabase/HIGEST-DB/publications/cracking.… · BAT, called the cracker column, and partitions the cracker columns into pieces according

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A Study on Database Cracking with GPUs - cs.ou.edudatabase/HIGEST-DB/publications/cracking.… · BAT, called the cracker column, and partitions the cracker columns into pieces according

A Study on Database Cracking with GPUs

Eleazar LealUniversity of Minnesota Duluth

Duluth, Minnesota, [email protected]

Le GruenwaldUniversity of Oklahoma

Norman, Oklahoma, [email protected]

ABSTRACTIn dynamic environments, a DBMS may not have enoughidle time to build indexes that can help speed up queryprocessing. Moreover, in these same environments, there isusually no prior knowledge about the nature of the queryworkload that could help improve query processing times.Database cracking deals with these two issues by incremen-tally building indexes while processing queries. With this in-cremental index construction, database cracking techniquessacrifice the query response time of the present query inorder to improve the response time of future queries. Addi-tionally, database cracking can adapt to changing workloadsbecause the incremental index construction is query-driven.Database cracking can be made more efficient by takingadvantage of parallelism; while multicore CPUs have beenstudied for this purpose, no existing work has done the samefor GPUs. In this paper, we perform the first study on theuse of GPUs versus multicore CPUs for cracking. For this,we propose two GPU cracking techniques, Crack GPU andCrack GPU Hybrid, and compare them with the best per-forming multicore CPU cracking technique existing in theliterature, P-CCGI. Our experiments showed that in termsof query response time, the best performing techniques areSort GPU if the workload is non-random, and P-CCGI ifthe workload is random.

1. INTRODUCTIONIn dynamic database environments, there is not enough

idle time before the processing of the queries to build anindex structure (like a B+ tree [2] or an ART-tree [17]) thatcan help speed up those queries [7, 9, 24]. This issue is ad-dressed by an adaptive indexing technique called databasecracking [8, 9]. Database cracking helps improve the queryresponse time by incrementally building an index while si-multaneously answering queries. As a consequence of this,each time a new query is issued, database cracking devotespart of the processing time of each query to doing a partial

This article is published under a Creative Commons Attribution License(http://creativecommons.org/licenses/by/3.0/), which permits distributionand reproduction in any medium as well allowing derivative works, pro-vided that you attribute the original work to the author(s) and ADMS 2019.10th International Workshop on Accelerating Analytics and Data Manage-ment Systems (ADMS’19), August 26, 2019, Los Angeles, California, CA,USA.

Query 1:

Select R.A

From R

Where R.A ≥ 12

and R.A ≤ 20

Query 2:

Select R.A

From R

Where R.A ≥ 5

and R.A ≤ 28

9

10

14

11

2

29

7

33

1

3

21

13

6

8

9

10

11

2

7

1

3

6

8

14

13

29

33

21

Cra

cker

Ind

ex

A < 12

12 ≤ A ≤ 20

A > 20

2

1

3

9

10

11

7

6

8

14

13

21

29

33

Cra

cker

Ind

ex

5 ≤ A < 12

12 ≤ A ≤ 20

20 < A ≤ 28

A < 5

Cracker

column

Cracker column

after Query 1

Cracker column after

Queries 1 and 2

A A A

A >28

Figure 1: Example of database cracking

physical re-organization of the database tables, so that fu-ture queries can benefit from the re-organizations introducedby past queries [9]. Figure 1 shows an example of databasecracking, which will be described in detail in Section 2.

However, the lack of enough idle time for indexing is notthe only issue in these environments. It is also the casethat the nature and the probability distribution of the queryworkloads are not known a priori. Nonetheless, databasecracking’s adaptive nature addresses this issue, too. Thisis because the locations within the tables that are reorga-nized are chosen such that those portions of the table thatare queried most often will be reorganized most frequently.Therefore, when workloads evolve over time, cracking shiftsits focus to reorganizing a different portion of the tables,thereby improving the performance in the new workload’sfuture queries.

Database cracking has several challenges [23, 24], amongwhich we find the following: (i) the query response time ofthe first query in a sequence of queries should not be overlylong as a result of cracking; this challenge is partly a con-sequence of the assumption that there is not enough idletime before the first query to build an index; (ii) the con-vergence time; the convergence time refers to the numberof cracking operations that are required until the averagequery response time when using a cracking equals the aver-age query response time when using a regular index (like aB+-tree or an ART tree [17]). Ideally, this number shouldbe as small as possible; and (iii) query workload robustness.Some database cracking techniques work well under the as-

1

Page 2: A Study on Database Cracking with GPUs - cs.ou.edudatabase/HIGEST-DB/publications/cracking.… · BAT, called the cracker column, and partitions the cracker columns into pieces according

sumption that the queries have a uniform distribution, butdo not offer the same good performance if queries have asequential distribution, for example, Idreos et al. [9]. Adatabase cracking technique should be able to offer consis-tently good performance independently of the distributionof the query workload.

Faced with these issues, one question is whether parallelarchitectures can be exploited to improve the query responsetime of database cracking. So far, most of the works in thisarea have devoted their efforts towards studying databasecracking in single-threaded applications [9, 10, 11, 12, 22,23, 24]. While a few of the existing works have studieddatabase cracking on parallel architectures [4, 5, 24], theyfocus on multicore systems only; so there are no works study-ing how Graphics Processing Units (GPU)s can be exploitedfor database cracking. This is a natural question concern-ing GPUs because the latter have many advantages thatcould be exploited for database cracking. Among them wehave: (i) in certain tasks, GPUs can achieve up to an or-der of magnitude of higher instruction processing through-put than comparable multicore CPUs [16]; (ii) GPUs arevery energy-efficient [13]; (iii) GPUs are present in all kindsof computers ranging from desktops to mobile phones; and(iv) GPUs have larger memory bandwidth than comparablemulticore architectures.

However, there are also several challenges associated withusing GPUs for database cracking and among these we findthe following: (i) GPUs are designed for maximizing theinstruction throughput, as opposed to minimizing instruc-tion latency [15]. This means that GPUs could potentiallyachieve high query throughput, but not necessarily low queryresponse times; and (ii) GPUs have a separate memory spaceand are connected through the PCIe bus. Since GPUs havea memory space that is separate from the host’s, then eachquery result set needs to be materialized in the host bybringing its result set through the slow PCIe bus, therebyaffecting the query response time. This issue is compoundedwith the fact that most GPUs are connected to their hostcomputers through the PCIe bus, which is significantly slowerthan the maximum theoretical throughput of main memoryDDR4 RAM.

Another question relates to the conjecture made in thework of Alvarez et al. [1]. There it was suggested thatparallel sorting techniques could be big contenders againstdatabase cracking techniques, meaning that in parallel ar-chitectures, a parallel sort could provide better performanceover cracking. Given this observation, we seek to see if thisis the case of GPUs, i.e., whether GPU database crackingalgorithms are a match for GPU sorting.

The contributions of this paper are the following: (i) wepresent the first study on the use of GPUs for databasecracking, and to this end, we propose two cracking GPU al-gorithms, Crack GPU and Crack GPU Hybrid, and two non-cracking algorithms for GPU, Sort GPU and Filter GPU.These last two algorithms are introduced for comparisonpurposes; and (ii) we perform comprehensive experimentscomparing the proposed algorithms with P-CCGI, which isthe best performing multicore CPU algorithm for databasecracking reported by Schuhknecht et al. [24].

The remainder of this paper is organized as follows. Sec-tion 2 presents the background and related work. Section 3presents our GPU algorithms for database cracking. Sec-tion 4 contains an experimental comparison between our

proposed GPU techniques and the state-of-the-art multicorealgorithm for cracking. Finally, Section 5 provides conclu-sions and future research directions.

2. BACKGROUND AND RELATED WORKIn this section, we provide the GPU background and dis-

cuss the related work in the area of database cracking.GPUs. GPUs are highly parallel coprocessors for general-

purpose programming installed on the PCIe bus of mostcomputers. GPU applications contain a special type of Cfunction called kernels, which are the functions that are runon the GPU. Since GPUs have a memory space that is sep-arate from the host’s, GPU programs are in general struc-tured in this manner: at the start of the execution of theprogram all the input data are transferred from the host’sRAM through the PCIe bus to the GPU’s memory, thenthe kernels are launched. Once all kernels finish, the outputdata is sent back to the host’s RAM. Since the through-put of the PCIe bus is lower than the throughput of boththe GPU’s RAM and the host’s RAM, this back and forthcommunication over the PCIe bus could be a bottleneck forapplications that do not perform a substantial amount ofwork per unit of data transferred.

Multiple works have proposed techniques for indexing andquery processing in GPUs [20, 21]. For example, Kim etal. introduced FAST [14], which is a tree index designedto allow efficient searches on both CPUs and GPUs. GPUshave also been used to accelerate other aspects of databases[3]. However, to the best of our knowledge, they have notbeen used for database cracking.

Database cracking. Database cracking [8] is an adap-tive indexing technique developed for column stores [9]. Inthis work we call the original database cracking algorithm[8, 9] Standard Crack. Each table in a column store, calleda BAT, consists of two attributes: an identifier oid, and asecond attribute attr [9]. We use the example given in Fig-ure 1 to illustrate how Standard Crack works given the twoqueries 1 and 2. As in the previous works in the literature,here we focus on queries of the type select R.A from R

where R.A between a and b.As soon as Query 1 arrives, Standard Crack initializes

an index, called the cracker index, makes a copy of theBAT, called the cracker column, and partitions the crackercolumns into pieces according to the parameters of the query.In Figure 1 we observe that after Query 1, the cracker col-umn is partitioned into 3 pieces: the piece with the elementsstrictly less than 12, the piece with the elements between 12and 20, and the piece with elements strictly greater than20. Then, Query 1 simply returns the intermediate (yellow)piece as the result set. When Query 2 (5 ≤ R.A ≤ 28) ar-rives, the technique will employ the cracker index to find thepiece where 5 would be found, which is the top (green) piecein the middle column, and then will partition only that pieceaccording to 5, so that the top (green) piece of the middlecolumn turns into two pieces: the piece with the elements <5 and the piece with the elements 5 ≤ R.A < 12. A similarprocedure takes place when searching and partitioning thepiece where 28 would be found (the bottom (pink) piece inthe middle column). The advantage of cracking is that if afuture query, say Query 3, wishes to retrieve elements in therange 6 ≤ R.A ≤ 11, then it does not need to search for itsresult in the whole cracker column. Instead, it would use thecracker index to find the piece that contains 6, which is the

2

Page 3: A Study on Database Cracking with GPUs - cs.ou.edudatabase/HIGEST-DB/publications/cracking.… · BAT, called the cracker column, and partitions the cracker columns into pieces according

(cyan) second from top to bottom in the right-most columnof the figure, then find the piece that contains 11, which isagain the same piece that contains 6, and then search forthe result in that piece only.

There is a large amount of works devoted to adaptive in-dexing, i.e., techniques that can adapt to the changing dis-tribution of the workloads. Among these techniques is stan-dard database cracking [6, 8, 9, 19], which works by incre-mentally building an index while simultaneously processingthe queries. Several variants have been proposed to addressa number of issues associated with standard database crack-ing. For example, stochastic cracking [7] was proposed toaddress the issue of convergence in non-uniformly-randomworkloads; sideways cracking [11] to address the problem ofinefficient tuple reconstruction in database cracking; and hy-brid cracking [12] to improve the speed of the convergence ofStandard Crack. Schuhknecht et al. [22] observed that eachof these is designed to address a single separate issue, likeslow convergence, workload robustness, etc., but there doesnot exist a single technique that simultaneously addressesall of these issues. The index of Schuhknecht et al. [22] isthen the first meta-adaptive index that can emulate severalof the different database cracking algorithms, thereby mak-ing it able to simultaneously address many of the issues ofdatabase cracking.

Parallel cracking algorithms. All the existing paral-lel cracking techniques are designed for multicore systems,unlike the techniques we propose in this paper, which are de-signed for GPU systems. The existing parallel cracking tech-niques can be grouped as follows [1]: (i) Algorithms based onParallel Standard Cracking (SC) [1, 4, 5]. These algorithmsexploit the fact that as more queries arrive and the numberof cracks increases, then the most recent queries are morelikely to involve distinct cracker pieces, which can be crackedin parallel; and (ii) Algorithms based on Course Granu-lar Indexing (CGI). These algorithms behave very similarlyto Standard Crack but pre-process the cracker column byrange-partitioning instead. Among the CGI-based paral-lel cracking techniques are Parallel Coarse-Granular Index(P-CGI) and Parallel-Chunked Coarse-Granular Index (P-CCGI) [1]. This latter algorithm has been shown to bethe best performing multicore CPU algorithm [24]; it pre-processes the cracker column by dividing the data in thiscolumn into equal-sized partitions, called chunks, then toprocess each query, it assigns a separate thread to crackeach chunk using CGI and then merges the results of thechunks.

3. PROPOSED ALGORITHMS FOR GPUSIn this work we propose two database cracking algorithms

for GPUs, Crack GPU and Crack GPU Hybrid, and alsopropose two competing non-cracking techniques for GPUs,Sort GPU and Filter GPU, which can also be used for queryprocessing and therefore are used as the baseline algorithmsin our work to compare against the proposed GPU crack-ing algorithms. Crack GPU works by keeping a copy ofthe cracker column in the GPU, and then answers queriesby introducing cracks in parallel on the GPU cracker col-umn, while keeping the cracker index in the host’s RAM.Crack GPU Hybrid (CrackGH300) is a hybrid CPU-GPUalgorithm for standard cracking. This algorithm works byinitially copying the cracker column to the GPU’s global

memory and then performing device-wide parallel partition-ing operations [18] to crack the cracker column. Then, aftera specified number of queries, the cracker column is copiedfrom the GPU back to the host’s RAM, which is then usedto continue processing queries from then on. Sort GPU isa GPU algorithm that sorts the cracker column and thenperforms in-parallel binary searches in the GPU to answereach query. Filter GPU uses the GPU to search in parallelfor the elements that belong to the query result set. Thedetails of these four algorithms are given in the followingsubsections.

3.1 Cracking Algorithms for GPUsIn this subsection, we describe the two proposed cracking

algorithms for GPUs: Crack GPU and Crack GPU Hybrid(which we abbreviate as Crack-GH).

3.1.1 Crack GPUCrack GPU is based on the Standard Crack algorithm [9].

Crack GPU keeps the cracker index in the host’s RAM andthe cracker column in the GPU’s global memory. For everyquery of the form select R.A from R where R.A between

a and b, Crack GPU will introduce one crack for a and onefor b in the cracker column residing in the GPU and addone cracker index entry for a and another one for b in thehost. Each of these cracks is introduced in parallel usingdevice-wide GPU partition operations [18]. However, theintroduction of the crack for a is serialized with respect tothe operations to introduce the crack for b. After adding thecracks to the cracker column and the entries in the crackerindex, Crack GPU will materialize the query result set inthe host’s RAM by running the Filter GPU algorithm onlyon the piece of the cracker column, instead of running it onthe whole cracker column.

The pseudo-code of Crack GPU is presented in Algorithm1. Just before the first query is processed (Line 1) a copyof the BAT in which cracking is to be done is copied fromthe host’s RAM to the GPU’s global memory (Line 2), thenthe cracker index is created in the host’s memory (Line 3).Each time a query of the form select R.A from R where

R.A between a and b arrives, the algorithm Crack GPU iscalled by passing gC, the pointer to the cracker column inthe GPU, C the pointer to the cracker column in the host,and the values a and b.

Algorithm 1: Crack GPU

Input : GPU cracker column gC, host crackercolumn C, a, b

Output: Retrieves the values in C that belong tothe range [a, b] and cracks the crackercolumn gC according to [a, b]

1 if numQueries = 0 then2 gC ← Copy C to the GPU’s global memory;3 Initialize the cracker index in the host’s memory;

4 posLow ← add gpu crack(gC, a);5 posHigh←add gpu crack(gC, b);6 result← GPUToHost(gC, posLow, posHigh);7 numQueries + +;8 return result ;

Crack GPU then proceeds to crack only the cracker col-umn gC residing in the GPU (Lines 4 and 5). To crack

3

Page 4: A Study on Database Cracking with GPUs - cs.ou.edudatabase/HIGEST-DB/publications/cracking.… · BAT, called the cracker column, and partitions the cracker columns into pieces according

the cracker column gC in the GPU, add gpu crack uses thehost’s cracker index to search for the piece that contains thevalue v (Line 1 of add gpu crack in Algorithm 2). Then,add gpu crack uses a device-wide parallel GPU partition al-gorithm [18] that uses all the computational resources of theGPU to find an index position such that every element inthe gC column whose index is strictly less than position isstrictly smaller than v and every element in the gC columnwhose index is greater than or equal to position is greaterthan v (Line 2 of add gpu crack). Once position has beenfound, add gpu crack inserts the key v with value positionin the host’s cracker index and returns position (Lines 3 and4 of add gpu crack). After cracking, Crack GPU material-izes the results in the host’s main memory (Line 6 of CrackGPU) and increments the number of queries processed (Line7).

Algorithm 2: add gpu crack

Input : GPU cracker column gC, integer vOutput: Cracks the cracker column gC residing in

the GPU on [a, b]1 [low, high]← Use the host’s cracker index to find the

lower and upper indexes of the piece that contains v;// Parallel GPU algorithm to split gC into

two sections

2 position← gpu partition(gC[low, high], v);3 Insert the entry (v, position) into the host’s cracker

index;4 return position;

3.1.2 Crack GPU Hybrid (Crack-GH)Crack GPU Hybrid (Crack-GH) uses both the GPU and

the CPU for query processing. The difference between thetwo algorithms, Crack-GH and Crack GPU, is that Crack-GH takes as input parameter a constant T , such that assoon as the first T queries have been processed in the GPU,the partially partitioned cracker index is then transferredfrom the GPU’s global memory to the host’s and then futurequeries are processed exclusively using the CPU.

Like Crack GPU, Crack-GH keeps the cracker index in thehost’s RAM and the cracker column in the GPU’s globalmemory. The advantage of Crack-GH over Crack GPUis that the GPU’s high instruction throughput can be ex-ploited to quickly partition the cracker column during thefirst queries of the query sequence, and after the first Tqueries have been processed, when the amount of work as-signed to the GPU diminishes, it switches back to CPU pro-cessing.

Algorithm 3 contains the pseudo-code of Crack-GH. Be-fore the first query, Crack-GH makes a copy of the crackercolumn in the GPU and initializes the query index in thehost (Lines 1 to 3). This algorithm later checks if the num-ber of queries processed so far is still below the thresholdT (Line 4), and if it is, then it proceeds to crack only thecracker column gC residing in the GPU (Lines 5 and 6). Tocrack the cracker column gC in the GPU, Crack-GH alsouses add gpu crack, which we have described already in Sec-tion 3.1.1. Otherwise, if the number of queries processed isequal to T , the cracker column is copied to the host’s RAM(Lines 9 and 10). Then, processing takes place using theCPU as with Standard Crack (Lines 11 to 15).

Algorithm 3: Crack GPU Hybrid

Input : GPU cracker column gC, host crackercolumn C, a, b

Output: Retrieves the values in C that belong tothe range [a, b] and cracks the crackercolumn gC according to [a, b]

1 if numQueries = 0 then2 gC ← Copy C to the GPU’s global memory;3 Initialize the cracker index in the host’s memory;

4 if numQueries < T then5 posLow ← add gpu crack(gC, a);6 posHigh←add gpu crack(gC, b);7 result← GPUToHost(gC, posLow, posHigh);

8 else9 if numQueries = T then

10 Copy gC to the cracker column C in the host;

11 posLow ← add cpu crack(C, a);12 posHigh← add cpu crack(C, b);13 result← C[posLow, posHigh];

14 numQueries + +;15 return result ;

Under a random workload distribution, as more and morequeries are processed, the pieces of the cracker column be-come smaller and smaller, so that the performance of pro-cessing queries using cracking should become closer and closerto that of a fully-indexed approach. If this is the case, thenthe issue with Crack GPU is that since the pieces are sosmall, the amount of work that the GPU must perform toprocess a single query is also very small. In such a case,the GPU would start becoming underutilized because theoverhead of launching a GPU kernel and transferring thevery small query result set to the host would dominate theamount of time spent processing the query. This is the rea-son why we have considered Crack GPU Hybrid as a poten-tial competitor, because we can exploit the high throughputof GPUs to partition the cracker column while the piecesare large, but as soon as they become small enough, we canswitch to CPU processing.

3.2 Non-Cracking Algorithms for GPUsIn this subsection, we describe the two non-cracking al-

gorithms, Filter GPU and Sort GPU, that are natural com-petitors against the GPU cracking algorithms introduced inSection 3.1.

3.2.1 Filter GPUFilter GPU is the GPU analog of the Scan algorithm [7,

24]. We assume that all queries are of the form: select R.A

from R where R.A between a and b. Filter GPU works asfollows. If Filter GPU is processing the first query in the se-quence, Filter GPU will make a copy of the cracker columnand send it to the GPU’s global memory. This is done in thisway because we work under one of the assumptions that giverise to database cracking: there is no idle time before thefirst query. If Filter GPU is processing any other query thatis not the first query in the sequence, then the cracker col-umn should already be in the GPU’s global memory. Theneach GPU thread will be assigned at least one element in

4

Page 5: A Study on Database Cracking with GPUs - cs.ou.edudatabase/HIGEST-DB/publications/cracking.… · BAT, called the cracker column, and partitions the cracker columns into pieces according

the cracker column and will check if the element(s) in ques-tion are in the range [a, b); if they are, they will be collectedin a separate array in the GPU’s global memory using anin-parallel prefix sum [18].

3.2.2 Sort GPUSort GPU performs during the first query a device-wide

radix-sort to sort all the elements of the cracker column[18]. Then, to process a query of the form that we study inthis work, select R.A from R where R.A between a and

b, Sort GPU performs two device-wide searches in parallel:one for a and one for b. These GPU searches are calleddevice-wide because they concurrently employ all the threadblocks of a GPU kernel, instead of using a separate threadblock for different elements. Once Sort GPU has found theindex of the first occurrence in the cracker column of a,which we will call ia, and the index of the last occurrence ofb, called ib, Sort GPU knows that the query result set is thearray gC[ia, ib], which is then copied from the GPU’s globalmemory to the host’s.

4. EXPERIMENTAL ANALYSISIn this section, we present the experiments comparing the

proposed GPU algorithms for database cracking, the pro-posed GPU non-cracking algorithms, single-threaded Stan-dard Crack and multi-threaded P-CCGI. We have organizedthe presentation of this experimental comparison in threestages. In the first stage, we compare the non-cracking algo-rithms, which although do not perform cracking operations,still compete against cracking algorithms as alternative ap-proaches for query processing in environments where theworkload is not known a priori. In the second stage, wecompare the GPU cracking algorithms. In the third stage,we compare the best-performing non-cracking algorithm andthe best performing GPU cracking algorithms against Stan-dard Crack and P-CCGI.

4.1 Hardware and SoftwareOur experiments were run on an Ubuntu 18.04 worksta-

tion with two Intel Xeon Gold 6136 chips running at 3 GHz.Each processor has 12 cores, so the machine has a total of12 hardware threads per chip. As was done in the work ofAlvarez et al. [1], TurboBoost was disabled. This worksta-tion is equipped with 128 GB of DDR4 RAM, and an NvidiaQuadro P4000 with 8 GB of GDDR5 RAM. We have imple-mented our GPU algorithms using CUDA 10.0, Thrust andCUDA Cub 1.8.0 [18]. The GPU code was compiled usingthe -arch = sm 60 flag. We used the single-threaded CPUC++ implementations of Halim et al. [7] for the StandardCrack, Sort and Filter algorithms. These CPU implementa-tions were compiled with g++ using -O3 optimizations.

4.2 Datasets and Query WorkloadsIn our experiments, we used a dataset that was generated

from a random uniform distribution as it has been done inprevious database cracking studies [7, 24]. We tested severalworkloads that have been previously used in the literature:ConsRandom, Mixed, Periodic, Random, SeqAlt, SeqInv,SeqOver, SeqRand, and ZoomOut [7].

Like in many other works in the literature, we assume thatwe have a main-memory column-store where there is a singlecolumn (BAT) with 100 million tuples [1, 4, 5, 7, 9, 12, 23].Each row in this column consists of a single 4-byte integer

Table 1: Experiment Parameters

Parameter Name Range of Values DefaultValues

Cracker Col. Size 1× 108 – 6× 108 1× 108

Selectivity 0.01 0.01Num. of Queries 1,000 1,000T 300 300Num. Chunks 10 10Workload ConsRandom, Mixed,

Periodic, Random, Se-qAlt, SeqInv, SeqOver,SeqRand, ZoomOut

Random

in the range, and there are no duplicates in this column [7].By default, the selectivity used in all our queries is 0.01.Our GPU algorithms store the cracker column in the GPU’sglobal memory. All our GPU algorithms use pinned memoryfor the database column [15, 25]; pinned memory cannot beswapped out of RAM, therefore data transfers from the GPUto pinned memory locations, and vice versa, are much fasterthan those that involve only non-pinned memory locations[15].

4.3 Performance MeasuresIn our experiments, we used two performance measures:

the individual query response time and the cumulative queryresponse time. The individual query response time of aquery is the total time elapsed since the moment that itwas issued until the moment its result set is found. Thecumulative query response time at the i-th query is equalto the sum of the query response time of all queries beforequery i. To obtain these performance measures, we have runeach query 10 times and taken the average execution timeacross these 10 repetitions. For all our GPU algorithms, wehave included the cost of copying the cracker column fromthe host’s RAM to the GPU’s RAM as part of the query re-sponse time of the first query in the sequence. This is donebecause in database cracking it is assumed that there is noidle time before the first query.

Table 1 lists the parameters of the experimental studiesconducted in this paper and their ranges of values and de-fault values. We studied the impacts of each of these param-eters on our performance measures. To study the impactsof a parameter, we varied the value of that parameter alongits range of values while fixing all the other parameters attheir default values.

4.4 Non-Cracking Algorithms for Query Pro-cessing

In this subsection, we compare two sub-groups of non-cracking algorithms for query processing: (i) the sub-groupof single-threaded CPU algorithms, which consists of Filterand Sort [7]; and (ii) the sub-group of GPU baseline algo-rithms, which consists of Filter GPU and Sort GPU. None ofthese algorithms relies on cracking. The best-performing ofthese algorithms will be compared in Section 4.6 against thebest-performing GPU cracking algorithm, Standard Crackand P-CCGI.

4.4.1 Impact on Individual Query Response Time

5

Page 6: A Study on Database Cracking with GPUs - cs.ou.edudatabase/HIGEST-DB/publications/cracking.… · BAT, called the cracker column, and partitions the cracker columns into pieces according

●● ● ● ●

●●

●●●●●

●●●●●

●●●●●

●●●●●

●●●●

●●●●●●

●●●●●●●

●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●

●●●

●●●●

●●●●●●●●●●●

●●●●●

●●●●●●●●●

●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●

●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●

●●

●●

●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●

●●●

●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●

●●●●●●

●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●

●●●●●

●●●●●●●●●●●

●●●●●●●●●●●

●●●●

●●●●●●

●●●●●

●●●●●●●●●●

●●●●●●●●●●●●

●●

●●●●●●●●●●●●●●●●

●●

●●

●●●●●●

●●

●●●●●●●●●●●●●●

●●●●

●●●●●●

●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●

●●●

●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●

●●●●●

●●●●●●●●●●●●

●●●●●●●●●

●●●●●●

●●●●●●●●●●●●

●●●●●●●●●●●

●●●●●●●●●●●

●●●●●●

●●●●

●●●●●●

●●●●●

●●●

●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●

●●●●●●●●

●●

●●●●●●

●●●●

●●●●●●●●●●●●●●●●●

●●●●

●●●

●●●●●●●●●●●●●●●●●●

●●●

●●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●

●●

●●●●●●●●

●●●●●●●●

●●●

●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●

●●●●●●●●●●●

●●●●●

●●●

Query Sequence

Indi

vidu

al Q

uery

Res

pons

e T

ime

(ms)

1 10 100 1000

0.00

10.

11

1010

00(a) Random, Selectivity = 0.01

● FilterSortSort GPUFilter GPU

● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

Query Sequence

Indi

vidu

al Q

uery

Res

pons

e T

ime

(ms)

1 10 100 1000

0.1

110

100

1000

(b) Sort GPU: Random, Selectivity = 0.01

● Total TimeHost/GPU TransferCracker Index TimeSort Time

● ● ● ●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

Query Sequence

Indi

vidu

al Q

uery

Res

pons

e T

ime

(ms)

1 10 100 1000

0.1

110

100

1000

(c) Filter GPU: Random, Selectivity = 0.01

● Total TimeHost/GPU TransferCracker Index Time

Figure 2: (a) Comparison between the individual query response times of the non-cracking algorithms. (b)Time decomposition of the individual query response time of Sort GPU. (c) Time decomposition of theindividual query response time of Filter GPU.

Filter GPU receives each query as input and performs asearch in parallel over the cracker array in order to retrievethe query result set, while Sort GPU sorts the cracker indexand then performs in-parallel searches over the index.

Figure 2 (a) shows the query response time, in millisec-onds, as a function of the query sequence. In this figure, weobserve that the query response times of all non-crackingalgorithms converge after the first 2 or 3 queries. Our ex-perimental results for the two single-threaded algorithms,Filter and Sort, are almost identical to those obtained inprevious studies [7] [24]. In the case of Sort, the first querypays a cost of around 10 seconds to sort the cracker col-umn, and from then on, all queries perform (very cheap)binary searches on this sorted column. Filter, on the otherhand, has a constant execution time, which is expected as aconsequence of the nature of this algorithm.

The two non-cracking GPU algorithms have query re-sponse times that are in between those of Filter and Sort.Compared to Sort, which pays a significant time penaltyfor the first query (10.5 seconds), Sort GPU pays a cost(approximately 750 milliseconds) for the first query that iscompetitive with that of Filter GPU (740 milliseconds) andnot as low as Filter’s (500 milliseconds). This is a con-sequence of the highly parallelized nature of CUDA Cub’sradix sort algorithm [18]. From the second query on, thequery response times of Sort GPU and Filter GPU decreasesignificantly to about 1 millisecond and 6 milliseconds, re-spectively. These query response times for Sort GPU andFilter GPU are still significantly more expensive than thoseof Sort (3 microseconds) because Sort GPU has much largeroverhead as a result of launching Cuda kernels, then per-forming binary searches and then transferring the data toand from the GPU.

Figures 2 (b) and (c) show the decomposition of the in-dividual query response times of Sort GPU and Filter GPUinto the Host/GPU transfer time, cracker index time andsort time (applicable to Sort GPU only). The Host/GPUtransfer time refers to the time spent doing communicationto and from the GPU through the PCIe bus, and the crackerindex time refers to the time spent searching the cracker in-dex. In Figure 2 (b) and in Figure 2 (c), we see in the first

query a gap between the total time and the sum of all othertimes because the total time for this query includes the costfor allocating and initializing memory on the GPU, whichtakes around 600 ms. Notice that in Figure 2 (b) only thefirst query has a non-zero sorting time, as expected. FromFigure 2 (b) we conclude that in most cases the Host/GPUtransfer time of Sort GPU accounts for half of the query re-sponse time, while the other half includes searching for thequery result set and calling GPU kernels. From Figure 2(c) we can conclude that the Host/GPU transfer time rep-resents only a small portion (around a 10%) of the overallquery response time of Filter GPU, while the search timeaccounts for the majority of it.

4.4.2 Impact on Cumulative Query Response TimeFigure 3 shows the cumulative query response times in

seconds as a function of the query sequence for all four non-cracking algorithms. This means that for any particularcurve on this plot, say Sort GPU, the value assumed at the100th query sequence equals the sum of the query responsetimes from the 1st query up to the 100th query. Here weobserve that Sort GPU has much shorter cumulative queryresponse times than Sort (Sort GPU’s cumulative query re-sponse time is 12% of Sort’s on average), and this is becauseof the elevated penalty paid by Sort in the first query. More-over, after the first 100 queries, the cumulative costs of bothalgorithms begin getting slightly closer to each other, whichis also expected because Sort’s query response times afterthe 1st query are extremely low, as opposed to Sort GPU’swhich are bigger. We also notice in this figure that FilterGPU’s cumulative query response times are longer than SortGPU’s (though close to each other); however, starting fromaround the 65th query the cost of Sort GPU improves uponFilter GPU’s and this difference grows to almost an order ofmagnitude by the time of the 1,000th query; this is expectedbecause in general the query response time of Filter GPU ishigher than that of Sort GPU.

Summary. The experiments in this section suggest thatSort GPU is the best-performing non-cracking candidate forquery processing in a scenario like the one that gives riseto standard database cracking. This is because Sort GPU

6

Page 7: A Study on Database Cracking with GPUs - cs.ou.edudatabase/HIGEST-DB/publications/cracking.… · BAT, called the cracker column, and partitions the cracker columns into pieces according

●●

● ●●●●

●●●●●●●●●

●●●●●●●●

●●●●●●●●

●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

Query Sequence

Cum

ulat

ive

Que

ry R

espo

nse

Tim

e (s

)

1 10 100 1000

0.1

110

100

1000

Random, Selectivity = 0.01

● FilterSortSort GPUFilter GPU

Figure 3: Cumulative query response times of thenon-cracking algorithms

imposes a significantly smaller penalty (almost 14X) on thefirst query than (single-threaded) Sort, while offering a com-parable penalty to that of Filter GPU and Filter. Moreover,Sort GPU is able to maintain low cumulative query responsetimes when compared against Filter GPU.

4.5 Crack GPU versus Crack GPU HybridIn this section, we compare our two proposed GPU crack-

ing algorithms, Crack GPU and Crack GPU Hybrid. Thebest of these two algorithms will be then compared againstthe best non-cracking algorithm, Standard Crack and P-CCGI in Section 4.6.

We set T , the threshold used by Crack GPU Hybrid, at300, meaning that Crack GPU Hybrid will copy the crackercolumn from the GPU to the host at the 300th query. Thealgorithm that results from Crack GPU Hybrid by settingT = 300 is called CrackGH300. The idea behind this lat-ter algorithm is to exploit the GPU’s massive throughputto partition the cracker column in the initial queries of thequery sequence, and then once the pieces of the cracker col-umn become so small that the overhead of the communica-tion costs and GPU kernel launching start dominating theexecution time, we switch to CPU processing. We have cho-sen T = 300 because we observed good results across all theworkloads tested in this study.

Figure 4 (a) presents a plot of the query response timesof Crack GPU and CrackGH300 as a function of the queryidentifier, i.e., it displays a sequence composed of the queryresponse times of each one of the 10,000 queries. In this plot,we observe that the query response times of CrackGH300are exactly the same as those of Crack GPU for the first 300queries, which is expected because both algorithms performthe same activities in exactly the same way during thosefirst queries. At the 300th query, we observe a spike, whichis a consequence of the communication cost of shipping the

cracker column back to the host’s RAM. The variance ofthe query response times of CrackGH300 is much largerthan that of Crack GPU; the main contributor to this vari-ance is the CPU version of Standard Crack. Nonetheless,CrackGH300 can provide an average speedup of 50% whencompared to Crack GPU. The average query response timeof CrackGH300 decreases at a significantly quicker rate thanthat of Crack GPU after the first 1,000 queries. After thefirst 1,000 queries, the query response time of Crack GPUstabilizes at 0.7 milliseconds, while the query response timeof CrackGH300 keeps decreasing despite its variance.

Figure 4 (b) and Figure 4 (c) present a time breakdownof the query response time of Crack GPU and CrackGH300,respectively, into the host/GPU transfer time and crackerindex time as a function of the query sequence. Host/GPUtransfer time refers to the time spent sending data backand forth through the PCIe bus, while cracker index timerefers to the time spent by the CPU searching the crackerindex. From these figures, we conclude that the crackerindex time represents a very small portion of the overallexecution time (it is 2 to 4 orders of magnitude smaller thanthe total execution time). During the first 10 queries, itis the cracking time (not shown in the figure, because itmatches the total time almost exactly) that accounts forthe majority of the query response time; this makes sensesince the GPU is partitioning larger pieces in the crackerarray. After the first 10 queries, the host/GPU transfer timedominates the total execution time, as expected, becausethe pieces are much smaller so most of the time is spentin memory transactions to and from the GPU. In Figure4 (c) we observe that the host/GPU transfer plot has noother points beyond the 300th query, which is expected sincethe algorithm begins processing queries in the CPU at thatpoint.

Summary. Overall, we conclude that Crack GPU Hy-brid with T = 300 is the best-performing algorithm out ofthe two GPU cracking algorithms compared. The reasonfor this lies in that this algorithm achieves on average up to50% shorter query response times when compared againstCrack GPU. However, Crack GPU has an advantage overCrack GPU Hybrid in that it has a lower variance in queryresponse times, which is a desirable property in databasequery processing algorithms because it makes for more con-sistent behavior.

4.6 Cracking Algorithms for Query Process-ing

In this subsection, we compare the best-performing crack-ing GPU algorithm from Section 4.5, Crack GPU Hybrid(CrackGH300), against the best-performing non-cracking al-gorithm from Section 4.4, Sort GPU, against single-threadedStandard Crack and multithreaded P-CCGI [24] with 16threads and 10 chunks. We study the impacts of the querysequence size, the type of workload and the size of thecracker column on both the individual query response timeand the cumulative query response time.

4.6.1 Impact on Individual Query Response TimeFigure 5 (d) shows the average query response time (also

called the per query cost [7]) for every single query in thesequence of 1,000 queries tested, assuming a random work-load. This figure shows that we were able to replicate theperformance results of Standard Crack observed in previ-

7

Page 8: A Study on Database Cracking with GPUs - cs.ou.edudatabase/HIGEST-DB/publications/cracking.… · BAT, called the cracker column, and partitions the cracker columns into pieces according

● ●●●●

●●

●●●

●●

●●●●

●●

●●

●●

●●●●●●●●●●●●●●●

●●

●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●

●●

●●●

●●●●●●

●●

●●●

●●●●

●●

●●

●●●

●●

●●

●●●

●●●

●●●●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●●●

●●●●

●●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●●

●●●

●●

●●

●●

●●●●●

●●

●●●

●●●

●●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●●

●●

●●

●●●●

●●

●●

●●●

●●

●●

●●●

●●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●●●●

●●●

●●

●●●

●●●

●●

●●●●●●●

●●●●●

●●●

●●

●●●

●●●

●●

●●●

●●●●

●●

●●

●●

●●

●●●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●●●●●●●

●●●

●●

●●●

●●

●●●●●●●

●●

●●

●●●

●●●

●●

●●

●●

●●●●

●●

●●●

●●●

●●

●●●●●●●

●●

●●●●

●●●●●

●●●●●●●●

●●

●●

●●●

●●

●●●●

●●●

●●

●●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●●

●●●

●●

●●●●●

●●

●●

●●●

●●●●●

●●

●●

●●●

●●●●●

●●●●

●●●●●

●●●●

●●

●●●

●●

●●●

●●

●●●●●●

●●

●●●●●

●●

●●

●●

●●●●●

●●

●●

●●

●●●●

●●●●●●

●●

●●●

●●●

●●●●●●

●●

●●●●

●●

●●●●●

●●●

●●●

●●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●●

●●●

●●

●●●●

●●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●●●●●●●●●

●●●●

●●●●

●●

●●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●●

●●

●●

●●

●●●

●●●●●

●●

●●●

●●

●●●

●●

●●●

●●

●●●●●●●

●●

●●●

●●●

●●●

●●

●●

●●●

●●●●

●●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●●●●●●●

●●

●●●

●●●●

●●

●●

●●●●●●●

●●

●●

●●

●●

●●

●●

●●●●●

●●●●●

●●●

●●

●●●

●●

●●●

●●●

●●●

●●

●●●●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●●

●●●●●●

●●

●●

●●●●●●●

●●

●●

●●●●

●●

●●●

●●

●●●●●

●●●

●●●

●●●●●●

●●●

●●

●●●●●

●●

●●●●

●●●

●●●●●●●

●●●

●●

●●●

●●●●

●●

●●●●●●

●●

●●

●●●

●●

●●●

●●●

●●

●●●

●●

●●●

●●●

●●

●●●

●●

●●

●●

●●●●●

●●

●●●

●●

●●

●●

●●●●

●●

●●●●●●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●●●

●●

●●●

●●

●●●

●●●

●●

●●

●●●●●●

●●

●●●

●●●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●●●

●●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●●

●●●●●

●●

●●

●●●

●●●

●●●

●●

●●

●●

●●

●●●●

●●●●●●

●●

●●●

●●

●●

●●●●●

●●

●●

●●

●●

●●

●●

●●●●●

●●

●●●●

●●

●●

●●●

●●●

●●●

●●

●●●

●●

●●

●●

●●

●●●●

●●●

●●

●●

●●

●●●●●●●

●●

●●

●●

●●●

●●●

●●

●●●

●●●

●●●

●●●●

●●

●●●

●●

●●

●●●●

●●

●●

●●

●●

●●●●●●

●●

●●●

●●

●●●●●

●●●

●●●●

●●

●●●

●●

●●●●

●●

●●

●●●●

●●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●●●●●

●●●

●●●●●

●●●●●

●●

●●●●

●●

●●●●●●

●●●●

●●

●●●●●●

●●●●●●●●●

●●

●●●●

●●

●●●●●●

●●●●

●●●●●

●●●●●

●●

●●

●●

●●

●●

●●●●●●

●●

●●

●●●

●●

●●

●●●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●●

●●

●●●●●●●

●●●●

●●●●●

●●●

●●●

●●

●●

●●

●●

●●●

●●●●●●

●●

●●●●●●

●●

●●

●●

●●

●●●●●●

●●

●●●

●●●●●

●●●●●●●●

●●

●●

●●●●●●●●

●●

●●

●●●

●●●

●●●●●●

●●

●●

●●

●●●

●●

●●

●●●●

●●●●

●●●●●●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●●●●

●●●

●●

●●

●●

●●●●●

●●●

●●

●●

●●

●●●●

●●●●●

●●

●●●

●●●●

●●

●●●

●●●●

●●

●●

●●

●●

●●

●●

●●●●●●●●●●

●●

●●●

●●●

●●●

●●

●●●●●

●●●

●●

●●

●●●●●●●●●

●●

●●

●●●●●●●

●●

●●

●●●●

●●

●●●●

●●●

●●●●

●●

●●

●●

●●●

●●●●●●

●●

●●

●●

●●●

●●●

●●●●●

●●●

●●●●●●

●●

●●●

●●●●●●

●●

●●

●●●

●●

●●●

●●

●●

●●●

●●

●●●●●

●●

●●

●●●●

●●●●

●●

●●

●●

●●●

●●

●●●

●●

●●●●

●●

●●●●●

●●

●●●●●●

●●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●●

●●●●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●●●

●●●

●●

●●

●●●

●●●

●●

●●●●●

●●●●

●●

●●

●●●●●●

●●●

●●

●●

●●●

●●●

●●●●

●●

●●

●●

●●

●●●●●●

●●

●●●●

●●

●●●

●●●●

●●●

●●●●●●●

●●●●

●●

●●

●●●

●●

●●●●

●●●

●●●

●●

●●

●●

●●●

●●

●●●●

●●

●●●●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●●

●●

●●

●●●●●

●●

●●

●●●●

●●

●●●●●●

●●●

●●

●●●●●●

●●

●●●

●●●

●●●●

●●●

●●

●●●

●●

●●●

●●

●●

●●●●

●●●

●●

●●●

●●

●●

●●●●●

●●

●●

●●●

●●●

●●

●●

●●●

●●●●

●●

●●●

●●●●

●●●

●●●

●●●●

●●

●●

●●●

●●●●

●●

●●

●●●●

●●

●●

●●●●●●

●●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●●●

●●●●

●●

●●●●●●●

●●

●●

●●

●●●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●●●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●●●

●●●

●●

●●●●

●●●●●●●●

●●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●●

●●

●●

●●●●●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●●●●

●●

●●

●●●

●●

●●

●●

●●●●●●

●●●●

●●

●●●

●●●●●

●●●●

●●

●●

●●●●●

●●

●●●

●●●●●●●●●

●●

●●●

●●●●

●●

●●●●

●●

●●●

●●●●

●●●●

●●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●●●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●●

●●

●●●●

●●●

●●

●●

●●●

●●●

●●●●

●●●●

●●●●

●●●

●●

●●

●●●●●

●●●

●●

●●

●●

●●

●●

●●●●

●●

●●●●

●●

●●

●●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●●●●●

●●

●●

●●●●

●●●●●●

●●

●●

●●

●●

●●●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●●

●●●

●●

●●●

●●●●●

●●

●●

●●●●●●●

●●

●●●

●●●

●●●

●●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●●

●●

●●●●

●●●●●●●

●●

●●

●●

●●●●●●●●

●●

●●

●●

●●

●●●●●

●●

●●

●●

●●

●●●●

●●●

●●●

●●

●●●●

●●

●●

●●●●●

●●●

●●●

●●

●●●●

●●

●●●●●●

●●●

●●

●●●

●●

●●●●

●●●

●●●

●●

●●●

●●●●●

●●●

●●

●●

●●

●●

●●●

●●

●●●

●●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●●●

●●●

●●

●●●

●●●

●●

●●

●●

●●

●●●

●●

●●●●●

●●●

●●

●●●

●●

●●●●●●

●●

●●

●●●●●

●●●●●●

●●●●

●●●

●●●●●●●●●

●●

●●

●●

●●●

●●●

●●

●●●●

●●

●●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●●●●

●●

●●●

●●

●●●

●●

●●

●●●●

●●●

●●

●●●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●●●

●●

●●●●●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●●●●●

●●

●●●

●●●

●●●●

●●●

●●

●●

●●

●●●●●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●●●

●●

●●●

●●

●●

●●

●●

●●●

●●●

●●

●●●

●●

●●●●●

●●●●

●●

●●●●●●●●●

●●●●

●●●●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●●●●

●●●●

●●

●●●●

●●●

●●

●●

●●

●●●

●●

●●

●●●●●●

●●

●●●●●

●●

●●

●●●

●●

●●

●●

●●●●

●●

●●●

●●

●●●

●●

●●●●●●●●●●●●

●●

●●

●●●●●

●●●

●●●●

●●●●

●●●●

●●●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●●●

●●

●●

●●●

●●

●●●●●●●

●●●●●

●●●

●●

●●●

●●●

●●

●●

●●●●●

●●

●●

●●●

●●

●●●

●●●

●●

●●

●●●●●●

●●●●

●●●●

●●●

●●

●●

●●●

●●

●●●●●

●●●

●●

●●

●●

●●

●●

●●

●●●●●

●●

●●●

●●

●●

●●●●

●●

●●

●●●●

●●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●●

●●

●●●●●●●

●●

●●

●●●●●

●●●●●

●●●●●

●●

●●●

●●●

●●●●

●●

●●●●●●

●●●

●●

●●

●●

●●●

●●●

●●

●●●●

●●●●●●

●●●●

●●●

●●●●●●●

●●●●

●●●

●●●●●

●●

●●

●●

●●●

●●●●

●●

●●

●●●●●

●●●

●●●

●●●

●●

●●

●●●

●●●

●●●●

●●

●●

●●

●●

●●

●●●

●●●

●●●●●

●●

●●

●●

●●

●●●●

●●●●

●●

●●●●

●●

●●

●●

●●

●●

●●●●

●●●●●●●

●●

●●●

●●

●●●

●●

●●

●●●●

●●●●

●●

●●●

●●●

●●

●●

●●●

●●●●

●●

●●●●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●●●

●●●●●●

●●

●●●

●●

●●

●●

●●●●

●●

●●●

●●

●●

●●

●●

●●

●●●●

●●●

●●●●●●

●●●●●

●●

●●

●●●

●●●

●●●●

●●●

●●

●●●

●●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●●●

●●●

●●

●●●●●

●●

●●

●●●

●●●●●

●●

●●

●●●

●●●

●●

●●●●

●●

●●●●●

●●

●●●

●●

●●

●●●

●●●

●●●

●●

●●●

●●

●●●

●●●

●●

●●●

●●

●●●

●●●●●●●●●●●●●●

●●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●●

●●●

●●

●●●

●●●●●

●●●●●●

●●●

●●

●●

●●

●●●

●●●●●●●

●●●●

●●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●●

●●●

●●

●●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●●●

●●

●●●

●●●●

●●

●●

●●●●

●●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●●

●●●●

●●

●●

●●●●

●●

●●●●●●●

●●

●●

●●●

●●

●●●●●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●●

●●●●

●●

●●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●●

●●●

●●

●●

●●●●●

●●

●●●

●●●

●●●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●●●●

●●●●

●●●●●●

●●

●●

●●●

●●

●●●

●●●

●●●

●●

●●●

●●●●●●●

●●

●●●

●●

●●

●●●

●●●

●●●●●

●●

●●●●●●●

●●

●●

●●●●

●●

●●●

●●

●●

●●●

●●

●●●●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●●●●●

●●

●●●●●

●●

●●●

●●●

●●

●●●

●●●

●●●●

●●

●●●

●●

●●

●●●●●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●●●

●●

●●●

●●

●●●●●●●

●●

●●

●●

●●●

●●

●●

●●●●

●●

●●

●●●

●●

●●●

●●●

●●

●●●

●●●

●●

●●

●●●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●●●●●

●●

●●●

●●●●●

●●●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●●

●●

●●

●●●●

●●

●●●

●●●●

●●●

●●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●●●●●●●●●●●

●●

●●●

●●●

●●●●

●●

●●

●●

●●

●●

●●

●●●●

●●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●●

●●

●●●●●●

●●●●

●●

●●●●●

●●

●●●

●●●

●●●

●●●

●●●●●

●●●●●●

●●

●●●●●●●

●●

●●●●●

●●●●●●

●●

●●●

●●

●●

●●●●

●●

●●●

●●●●

●●●

●●

●●●●●

●●●

●●

●●

●●●

●●

●●

●●●●

●●

●●●●●●

●●●●

●●●●

●●

●●

●●●

●●●●●●●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●●●●

●●●

●●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●●●●

●●●●

●●

●●

●●●

●●●

●●

●●●

●●

●●●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●●●

●●

●●●

●●●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●●●●●●●●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●●●

●●●

●●●

●●

●●

●●

●●●●

●●●

●●

●●

●●

●●●

●●●

●●●●●●●●

●●

●●

●●●●●●

●●

●●●●●

●●

●●

●●●

●●●

●●●

●●

●●●

●●

●●

●●●

●●

●●●●●●

●●●●

●●

●●

●●

●●

●●●●●●

●●●●●

●●●

●●

●●

●●

●●

●●●●

●●

●●●●●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●●

●●

●●

●●●

●●●

●●

●●●●●●

●●

●●●●●●●

●●

●●●●

●●●

●●●

●●

●●●●●●●

●●

●●●

●●●●●●●

●●

●●

●●●●●

●●

●●

●●

●●●

●●●

●●

●●●

●●

●●●

●●●

●●●

●●

●●

●●●

●●

●●

●●●●

●●

●●●

●●●

●●

●●

●●●

●●

●●

●●

●●●

●●●

●●

●●

●●●●

●●

●●

●●●

●●

●●●●●

●●

●●●

●●●●

●●

●●

●●

●●

●●

●●●

●●

●●●●

●●

●●●●

●●●●

●●●●●●

●●

●●

●●

●●

●●

●●●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●●●●●●

●●●●●●

●●

●●

●●●●●●●●●●

●●●

●●●

●●

●●●●●●●●

●●●●●

●●

●●

●●

●●

●●●

●●●

●●●●●

●●●

●●●

●●●●●

●●

●●●●●

●●

●●

●●

●●●●●●

●●

●●

●●

●●

●●●

●●

●●

Query Sequence

Indi

vidu

al Q

uery

Res

pons

e T

ime

(ms)

1 10 100 1000 100000.00

10.

11

1010

010

00(a) Random, Selectivity = 0.01

● CrackGH300Crack GPU

● ● ●●

●●

●●●

●●

●●●●

●●

●●

●●

●●●●●●●●●●●●●●

●●

●●●●●●●

●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

Query Sequence

Indi

vidu

al Q

uery

Res

pons

e T

ime

(ms)

1 10 100 10000.00

10.

11

1010

010

00

(b) Crack GPU: Random, Selectivity = 0.01● Total Time

Host/GPU TransferCracker Index Time

● ● ●●

●●

●●●

●●

●●●●

●●

●●

●●

●●●●●●●●●●●●●●

●●

●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●

●●

●●●

●●●●●●

●●

●●●

●●●●

●●

●●

●●●

●●

●●

●●●

●●●

●●●●●

●●

●●

●●

●●●

●●

●●●

●●

●●●

●●

●●

●●●●

●●●●

●●●

●●

●●●

●●

●●●

●●

●●●

●●

●●

●●

●●●

●●●

●●●

●●

●●

●●

●●●●●

●●

●●●

●●●

●●

●●

●●●

●●

●●

●●●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●●

●●

●●

●●●●

●●

●●

●●●

●●

●●

●●●

●●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●●●

●●●

●●

●●●

●●●

●●

●●●●●●●●

●●●●●

●●●

●●

●●●

●●●

●●

●●●

●●●●

●●

●●

●●

●●●

●●●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●●●●●●●

●●●

●●

●●

●●

●●●●●●●

●●

●●

●●●●

●●●

●●

●●

●●

●●

●●●●●●

●●●

●●●

●●

●●●●●●●●

●●

●●●●

●●●●●

●●●●●●

●●

●●

●●

●●●

●●

●●●●

●●●

●●

●●●●

●●

Query Sequence

Indi

vidu

al Q

uery

Res

pons

e T

ime

(ms)

1 10 100 10000.00

10.

11

1010

010

00

(c) CrackGH300: Random, Selectivity = 0.01● Total Time

Host/GPU TransferCracker Index Time

Figure 4: (a) Comparison between the individual query response times of CrackGH300 and Crack GPU. (b)Time decomposition of the individual query response time of Crack GPU. (c) Time decomposition of theindividual query response time of CrackGH300.

ous works [7, 24], where the first query has a response timeof around 500 milliseconds and then converges to around 1millisecond.

In Figure 5 (d) we observe that CrackGH300’s responsetimes show a red spike at the 300th query, marked witha dark vertical striped line. This spike rises up to around45 milliseconds and is due to the fact that after processingthat query, the cracker column is transferred from the GPU’sglobal memory to the host’s RAM in order for CrackGH300to start processing queries in the CPU. The transfer time isslightly under 50 milliseconds, which corresponds to a PCIethroughput of at least 7.6 GB/s, which is consistent with theexperiments we have performed on our GPU using pinnedmemory. According to the experiments we performed in ourGPU, if the host’s cracker column were stored in non-pinnedmemory, then that memory transfer would take around twicethat time, i.e., around 100 milliseconds, which would makethat spike more noticeable.

In Figure 5 (d) we notice that Sort GPU has consistentlylower query response times than CrackGH300, StandardCrack and P-CCGI for approximately the first 10 queries.After the first 12 queries, CrackGH300 starts exhibitingcompetitive execution times to those of Sort GPU and P-CCGI. Then, after the 12th query, CrackGH300, StandardCrack and multi-threaded P-CCGI offer shorter query ex-ecution times than Sort GPU. The reason for this is thatonce Sort GPU sorts the cracker array in the GPU’s mem-ory, then every single query from that point on will simplyperform a single GPU in-parallel binary search to retrievethe elements in the cracker column that belong to the in-terval [a, b]. The execution times of these parallel binarysearches usually show small variance, which explains whatcan be appreciated in Figure 5 (d), where the query responsetime of Sort GPU stabilizes around 1 millisecond after thefirst or the second query. On the other hand, as seen in Fig-ure 5 (d), the query response times of Sort GPU exhibit lowvariance, while the query response times of CrackGH300,Standard Crack and P-CCGI possess higher variance. Thisis an advantage for Sort GPU because this leads to morepredictable execution times. Despite this, P-CCGI shows

on average around an order of magnitude shorter query ex-ecution times than all other techniques.

Impact of the Workload Type. Figure 5 shows theindividual query response times, measured in milliseconds,of different algorithms as a function of the query sequence.The workloads shown in this figure are ConsRandom, Mixed,Periodic, Random, SeqAlt, SeqInv, SeqOver, SeqRand, andZoomOut. In this figure we see that in non-uniformly-randomworkloads (Periodic, SeqAlt, SeqInv, SeqOver, ZoomOut),Sort GPU is able to provide better query response timesthan Standard Crack, multicore P-CCGI and CrackGH300.However, in uniformly-random workloads (Mixed, Random,SeqRand), both multicore P-CCGI and CrackGH300 per-form better than all other algorithms, including Sort GPU,after the first 10 to 20 queries. Between the multicore P-CCGI and CrackGH300, the former shows smaller individ-ual query response times. CrackGH300 performs betterthan Sort GPU for uniformly-random workloads becausesuch workloads tend to scatter more the locations of thefirst cracks across the cracker column, which means that theGPU can be better exploited for partitioning first the bigpieces. In the case of non-uniformly random workloads, thequeries may fall more frequently in the same pieces, reduc-ing their sizes more and more to the point where the workassigned to the GPU is very small. We also observe thatin most of these random workloads (Random and SeqRand)CrackGH300, Crack GPU and P-CCGI have much highervariability in their query response times than Sort GPU.

Summary. In terms of individual query response times,for the non-uniformly random workloads tested (Periodic,SeqAlt, SeqInv, SeqOver and ZoomOut), Sort GPU per-formed the best; and P-CCGI performed the best for randomworkloads (ConsRandom, Mixed, Random and SeqRand),followed by CrackGH300.

4.6.2 Impact on Cumulative Query Response TimeFigure 6 (d) shows the cumulative query response times

in milliseconds as a function of the query sequence. In ourexperiments, we were able to replicate the results obtainedby Halim et al. [7] concerning the cumulative query re-sponse times for Standard Crack, which starts at around

8

Page 9: A Study on Database Cracking with GPUs - cs.ou.edudatabase/HIGEST-DB/publications/cracking.… · BAT, called the cracker column, and partitions the cracker columns into pieces according

Query Sequence

Indi

vidu

al Q

uery

Res

pons

e T

ime

(ms)

10 100 1000

0.01

0.1

110

100

1000

0 (a) ConsRandom

CrackGH300Std. CrackSort GPUP−CCGI (16 thr.)

Query Sequence

Indi

vidu

al Q

uery

Res

pons

e T

ime

(ms)

10 100 1000

0.1

110

100

1000

(b) MixedCrackGH300Std. CrackSort GPUP−CCGI (16 thr.)

Query Sequence

Indi

vidu

al Q

uery

Res

pons

e T

ime

(ms)

10 100 1000

0.01

0.1

110

100

1000

0 (c) Periodic

CrackGH300Std. CrackSort GPUP−CCGI (16 thr.)

Query Sequence

Indi

vidu

al Q

uery

Res

pons

e T

ime

(ms)

10 100 1000

0.01

0.1

110

100

1000

0 (d) Random

CrackGH300Std. CrackSort GPUP−CCGI (16 thr.)

Query Sequence

Indi

vidu

al Q

uery

Res

pons

e T

ime

(ms)

10 100 1000

0.1

110

100

1000

(e) SeqAlt

CrackGH300Std. CrackSort GPUP−CCGI (16 thr.)

Query Sequence

Indi

vidu

al Q

uery

Res

pons

e T

ime

(ms)

10 100 1000

0.1

110

100

1000

(f) SeqInv

CrackGH300Std. CrackSort GPUP−CCGI (16 thr.)

Query Sequence

Indi

vidu

al Q

uery

Res

pons

e T

ime

(ms)

10 100 1000

0.1

110

100

1000

(g) SeqOverCrackGH300Std. CrackSort GPUP−CCGI (16 thr.)

Query Sequence

Indi

vidu

al Q

uery

Res

pons

e T

ime

(ms)

10 100 1000

0.01

0.1

110

100

1000

0 (h) SeqRand

CrackGH300Std. CrackSort GPUP−CCGI (16 thr.)

Query Sequence

Indi

vidu

al Q

uery

Res

pons

e T

ime

(ms)

10 100 1000

0.1

110

100

1000

(i) ZoomOutCrackGH300Std. CrackSort GPUP−CCGI (16 thr.)

Figure 5: Individual query response times of CrackGH300, Standard Crack, Sort GPU and P-CCGI fordifferent workloads and with a selectivity of 0.01

under 1,000 milliseconds for the 1st query and then progres-sively reduces its rate of increase. This behavior is due to thefact that a random workload is equally likely to add cracks inany point in the cracker column, and when more cracks areadded, the query processing time decreases because the sizesof the pieces inside the column are progressively smaller.

In Figure 6 (d), we can see that both CrackGH300 andSort GPU show small rates of increase for the first 300queries, and then they start increasing in a linear fashion.However, we notice that the behavior that Sort GPU ex-

hibits is different from the one that (single-threaded) Sortshowed in Halim et al.’s work [7], where the plot of the cu-mulative query response time of (single-threaded) Sort is ahorizontal line at 10 seconds. The reason for this discrep-ancy is that (single-threaded) Sort has a much longer queryresponse for the first query (around 10.5 seconds) than SortGPU (around 760 milliseconds), so that when we add thevery short query response times, of the order of 0.001 mil-liseconds, of Sort to the cumulative response times, it barelychanges the cumulative time of 10.5 seconds that Sort ex-

9

Page 10: A Study on Database Cracking with GPUs - cs.ou.edudatabase/HIGEST-DB/publications/cracking.… · BAT, called the cracker column, and partitions the cracker columns into pieces according

Query Sequence

Indi

vidu

al Q

uery

Res

pons

e T

ime

(ms)

10 100 1000

100

1000

1000

0(a) ConsRandom

CrackGH300Std. CrackSort GPUP−CCGI (16 thr.)

Query Sequence

Indi

vidu

al Q

uery

Res

pons

e T

ime

(ms)

10 100 1000

100

1000

1000

0

(b) Mixed

CrackGH300Std. CrackSort GPUP−CCGI (16 thr.)

Query Sequence

Indi

vidu

al Q

uery

Res

pons

e T

ime

(ms)

10 100 1000

100

1000

1000

01e

+05 (c) Periodic

CrackGH300Std. CrackSort GPUP−CCGI (16 thr.)

Query Sequence

Indi

vidu

al Q

uery

Res

pons

e T

ime

(ms)

10 100 1000

100

1000

1000

0

(d) Random

CrackGH300Std. CrackSort GPUP−CCGI (16 thr.)

Query Sequence

Indi

vidu

al Q

uery

Res

pons

e T

ime

(ms)

10 100 1000

100

1000

1000

01e

+05 (e) SeqAlt

CrackGH300Std. CrackSort GPUP−CCGI (16 thr.)

Query Sequence

Indi

vidu

al Q

uery

Res

pons

e T

ime

(ms)

10 100 1000

100

1000

1000

01e

+05 (f) SeqInv

CrackGH300Std. CrackSort GPUP−CCGI (16 thr.)

Query Sequence

Indi

vidu

al Q

uery

Res

pons

e T

ime

(ms)

10 100 1000

100

1000

1000

01e

+05 (g) SeqOver

CrackGH300Std. CrackSort GPUP−CCGI (16 thr.)

Query Sequence

Indi

vidu

al Q

uery

Res

pons

e T

ime

(ms)

10 100 1000

100

1000

1000

0

(h) SeqRand

CrackGH300Std. CrackSort GPUP−CCGI (16 thr.)

Query Sequence

Indi

vidu

al Q

uery

Res

pons

e T

ime

(ms)

10 100 1000

100

1000

1000

01e

+05 (i) ZoomOut

CrackGH300Std. CrackSort GPUP−CCGI (16 thr.)

Figure 6: Cumulative query response times CrackGH300, Standard Crack, Sort GPU and P-CCGI fordifferent workloads and with a selectivity of 0.01

hibits. In the case of Sort GPU, the query response times ofall queries starting from the 2nd query are higher, around 0.4milliseconds, than those of Sort for the same queries. Stan-dard Crack, on the other hand, behaves exactly like in Halimet al.’s work [7], where its cumulative time shows increasesbut with a rate that decreases in time. P-CCGI shows a cu-mulative query response time curve with a shape very sim-ilar to that of Standard Crack, but shifted down an orderof magnitude; this is expected because of P-CCGI’s design,which initially breaks the cracker column into equally-sized

chunks without following any semantic restriction, and itthen uses independent threads to apply Standard Crack oneach chunk.

Impact of the Workload Type. Figure 6 shows thecumulative query response times, measured in milliseconds,of different algorithms, as a function of the query sequence.The workloads shown in this figure are ConsRandom, Mixed,Periodic, Random, SeqAlt, SeqInv, SeqOver, SeqRand, andZoomOut. These figures show a vertical striped black line atquery 300, which denotes the query at which CrackGH300

10

Page 11: A Study on Database Cracking with GPUs - cs.ou.edudatabase/HIGEST-DB/publications/cracking.… · BAT, called the cracker column, and partitions the cracker columns into pieces according

switches from GPU processing to CPU processing. In Fig-ure 6 we notice that in the random workloads, ConsRandom,Random and SeqRand, the cumulative query response timesof P-CCGI are the shortest across all algorithms, followedby CrackGH300. We also observe that in several sequen-tial workloads (Periodic, SeqAlt, SeqInv, SeqOver), Stan-dard Crack is able to achieve lower cumulative query re-sponse times during the first 10 queries than Sort GPU andCrackGH300. However, in those same workloads, startingfrom the 10th query, (multicore) P-CCGI provides the short-est cumulative query response times for approximately thefirst 100 queries, and after those first queries, Sort GPU hasthe shortest cumulative query response times.

One might consider a different algorithm to the one usedby CrackGH300; this different approach, instead of employ-ing GPU cracking for the first T = 300 queries and thenswitching to single-threaded cracking, could employ GPUcracking for the same initial T = 300 queries and thenswitch to multi-threaded cracking. However, from the re-sults shown in Figure 6 we observe that this hypotheticalalgorithm would not perform any better than P-CCGI be-cause P-CCGI consistently outperforms Crack GH300 acrossall workloads.

Consistent with the behavior observed in the previous Sec-tion 4.6.1, the cumulative query response times of Sort GPUare better than those of CrackGH300 and Standard Crackfor non-uniformly-random workloads, and this is a naturalconsequence of the shorter query response times exhibitedby Sort GPU. The only workloads that are exceptions tothe previous statement are ConsRandom, Random, and Se-qRand, where CrackGH300 is able to outperform Sort GPUand Standard Crack.

Impact of the Size of the Cracker Column. Figure7 illustrates the impact of the size of the cracker columnon the total execution time, measured in seconds, of 10,000queries of the different algorithms. In this experiment, wevary the size of the cracker column in the range 1 × 108 –6× 108, we also assume a random workload and a selectiv-ity of 0.01. Notice that, unlike many other figures in thiswork, this illustration is not in logarithmic scale. In thisfigure we observe that for the cracker column sizes tested,all algorithms seem to have total execution times that scalelinearly as functions of the size of the cracker column. Thismakes sense if these techniques converge to a fully-indexedapproach much before the 10,000 queries, and if the execu-tion time to answer a query on an already sorted crackercolumn is not significantly smaller than the amount of timeinvested in performing the actual cracking operations. Wealso observe that on average CrackGH300 has a 2.4X shortertotal execution time than Standard Crack, and that SortGPU and Standard Crack have competitive total executiontimes. However, P-CCGI shows consistently shorter totalexecution times than all other algorithms, exhibiting on av-erage a 3.4X shorter total execution time than CrackGH300across different column sizes.

Summary. Our experiments have shown that in termsof cumulative response time, the best algorithm for non-uniformly-random workloads is P-CCGI for the first queries,and Sort GPU after the first 100 queries, and that for uni-formly-random workloads the best-performing algorithm isP-CCGI. On average, in terms of cumulative query execu-tion time, Sort GPU had 2.4X better performance thanP-CCGI for non-random workloads like SeqInv, while P-

Cracker Column Size

Tota

l Exe

cutio

n T

ime

(s)

1e+08 2e+08 3e+08 4e+08 5e+08 6e+08

010

2030

40

Random, Selectivity = 0.01, Queries = 10,000

● CrackGH300Std. CrackSort GPUP−CCGI (16 thr.)

Figure 7: Impact of the size of the cracker columnon the total execution time

CCGI outperformed Sort GPU by a factor of 1.98X in ran-dom workloads. Also, we have found that with the crackercolumn sizes we have tested and under random workloads,CrackGH300, Standard Crack, P-CCGI and Sort GPU ex-hibit what seems to be a linear scaling of the total executiontime for 10,000 queries as a function of the size of the crackercolumn.

5. CONCLUSIONS AND FUTURE WORKIn this paper, we performed the first study on the use

of GPUs for database cracking. To this end, we proposedtwo GPU algorithms for database cracking, Crack GPU andCrack GPU Hybrid, and also proposed two non-crackingGPU techniques, Sort GPU and Filter GPU. We comparedthese algorithms against single-threaded Standard Crackingon CPU and the best performing multithreaded CPU algo-rithm for cracking, P-CCGI [24], running with 16 threads.

Our experiments showed that in terms of cumulative queryresponse time, Sort GPU achieved on average a 2.4X im-provement over P-CCGI in non-random workloads, while P-CCGI achieved on average a 1.98X improvement over SortGPU in random workloads. These results on non-randomworkloads give corroborating evidence for the conjecture ofAlvarez et al. [1] in that sorting seems to be a very strongcompetitor against database cracking techniques for parallelarchitectures.

In this paper, we studied the question of whether GPUscan be effectively used to help improve the query responsetime of database cracking. The results reported in this pa-per show that Crack GPU and Crack GPU Hybrid do notimprove over the best performing multicore CPU algorithm,P-CCGI, because of two main problems: a) the introductionof cracks one at a time in Crack GPU and Crack GPU Hy-brid, which does not exploit the full parallel potential of

11

Page 12: A Study on Database Cracking with GPUs - cs.ou.edudatabase/HIGEST-DB/publications/cracking.… · BAT, called the cracker column, and partitions the cracker columns into pieces according

GPUs; and b) the high transfer time from CPU to GPUand vice versa.

One approach to eliminate the first problem consists inusing the GPU to split the cracker column into chunks andthen perform cracking independently in each chunk, whichwill permit the introduction of multiple cracks simultane-ously. To eliminate the second problem, we can exploit GPUstreams. GPU streams make it possible to perform memorytransfers from the host to the GPU, and vice versa, simul-taneously with kernel execution [15]. In the case of GPUalgorithms for database cracking, GPU streams could po-tentially help decrease, or even completely hide, the spikesthat we observed in CrackGH300 after processing the 300thquery, which were a product of the copying of the crackercolumn from the GPU to the host’s RAM.

6. ACKNOWLEDGMENTSThis work is supported in part by the National Science

Foundation under Grant No. 1302439 and 1302423.

7. REFERENCES[1] V. Alvarez, F. M. Schuhknecht, J. Dittrich, and

S. Richter. Main memory adaptive indexing formulti-core systems. In Proceedings of the InternationalWorkshop on Data Management on New Hardware –DaMoN, 2014.

[2] D. Comer and Douglas. Ubiquitous B-Tree. ACMComputing Surveys, 11(2):121–137, 1979.

[3] E. Furst, M. Oskin, and B. Howe. Profiling a GPUdatabase implementation: a holistic view of GPUresource utilization on TPC-H queries. In Proceedingsof the International Workshop on Data Managementon New Hardware – DaMoN, 2017.

[4] G. Graefe, F. Halim, S. Idreos, H. Kuno, andS. Manegold. Concurrency control for adaptiveindexing. Proceedings of the VLDB Endowment,5(7):656–667, 2012.

[5] G. Graefe, F. Halim, S. Idreos, H. Kuno, S. Manegold,and B. Seeger. Transactional support for adaptiveindexing. The VLDB Journal, 23(2):303–328, 2014.

[6] I. Haffner, F. M. Schuhknecht, and J. Dittrich. Ananalysis and comparison of database cracking kernels.In Proceedings of the International Workshop on DataManagement on New Hardware – DaMoN, 2018.

[7] F. Halim, S. Idreos, P. Karras, and R. H. C. Yap.Stochastic database cracking. Proceedings of theVLDB Endowment, 5(6):502–513, 2012.

[8] S. Idreos. Database cracking: Towards auto-tuningdatabase kernels,. PhD thesis, 2010.

[9] S. Idreos, M. L. Kersten, and S. Manegold. DatabaseCracking. CIDR, 2007.

[10] S. Idreos, M. L. Kersten, and S. Manegold. Updatinga cracked database. In Proceedings of the ACMSIGMOD International Conference on Management ofData – SIGMOD, 2007.

[11] S. Idreos, M. L. Kersten, and S. Manegold.Self-organizing tuple reconstruction in column-stores.In Proceedings of the ACM International Conferenceon Management of Data – SIGMOD, 2009.

[12] S. Idreos, S. Manegold, H. Kuno, and G. Graefe.Merging what’s cracked, cracking what’s merged:adaptive indexing in main-memory column-stores.Proceedings of the VLDB Endowment, 4(9):586–597,2011.

[13] S. W. Keckler, W. J. Dally, B. Khailany, M. Garland,and D. Glasco. GPUs and the Future of ParallelComputing. IEEE Micro, 31(5):7–17, 2011.

[14] C. Kim, J. Chhugani, N. Satish, E. Sedlar, A. D.Nguyen, T. Kaldewey, V. W. Lee, S. A. Brandt, andP. Dubey. FAST: fast architecture sensitive tree searchon modern CPUs and GPUs. In Proceedings of theACM International Conference on Management ofData – SIGMOD, 2010.

[15] D. Kirk and W.-m. Hwu. Programming massivelyparallel processors: a hands-on approach. MorganKaufmann, 2016.

[16] V. W. Lee, P. Hammarlund, R. Singhal, P. Dubey,C. Kim, J. Chhugani, M. Deisher, D. Kim, A. D.Nguyen, N. Satish, M. Smelyanskiy, S. Chennupaty,V. W. Lee, C. Kim, J. Chhugani, M. Deisher, D. Kim,A. D. Nguyen, N. Satish, M. Smelyanskiy,S. Chennupaty, P. Hammarlund, R. Singhal, andP. Dubey. Debunking the 100X GPU vs. CPU myth:an evaluation of throughput computing on CPU andGPUß. ACM SIGARCH Computer Architecture News,38(3), 2010.

[17] V. Leis, A. Kemper, and T. Neumann. The adaptiveradix tree: ARTful indexing for main-memorydatabases. In Proceedings of the IEEE InternationalConference on Data Engineering – ICDE, 2013.

[18] D. Merrill. Cuda CUB.

[19] H. Pirk, E. Petraki, S. Idreos, S. Manegold, andM. Kersten. Database cracking: fancy scan, not poorman’s sort! In Proceedings of the InternationalWorkshop on Data Management on New Hardware –DaMoN, 2014.

[20] R. Rui, H. Li, and Y.-C. Tu. Join algorithms onGPUs: A revisit after seven years. In Proceedings ofthe IEEE International Conference on Big Data (BigData), 2015.

[21] R. Rui and Y.-C. Tu. Fast Equi-Join Algorithms onGPUs: Design and Implementation. In Proceedings ofthe International Conference on Scientific andStatistical Database Management – SSDBM, 2017.

[22] F. M. Schuhknecht, J. Dittrich, and L. Linden.Adaptive Adaptive Indexing. In Proceedings of theIEEE International Conference on Data Engineering –ICDE, 2018.

[23] F. M. Schuhknecht, A. Jindal, and J. Dittrich. Theuncracked pieces in database cracking. Proceedings ofthe VLDB Endowment, 7(2):97–108, 2013.

[24] F. M. Schuhknecht, A. Jindal, and J. Dittrich. Anexperimental evaluation and analysis of databasecracking. The VLDB Journal, 25(1):27–52, 2016.

[25] N. Wilt. The CUDA Handbook: A ComprehensiveGuide to GPU Programming. Addison-Wesley, 2013.

12