9
Examining single server database systems in the context of real-time sensor data Christian Gorenflo Abstract—In this course project for CS 848 – Main/In- Memory Database Systems I examine the write and read performance of three different database systems. The used benchmarks are tailored to real world requirements originating from the WeBike project I. I NTRODUCTION Big Data analysis today is ubiquitous. Especially with the advent of the Internet of Things and smart gadgets of all sorts data collections from a plethora of sensors is becoming common place. So they don’t get left behind by the big players like Google, Microsoft and Facebook, small research groups or enterprises need database solu- tions that can perform analytics on vast data sets without having to deploy multiple server farms. This report is based on the work of the WeBike project [1] [2], which uses sensor arrays to study technological improvements to increase the convenience of electric bicycles as car alternatives. The goal of this course project is to give a recommendation for research teams and small businesses that work with real time data, usually provided as time series from one or multiple sensor arrays. The project aims to answer how Big Data analytics can be done on limited resources like a single server instance. A. The WeBike project Global Warming will be a very real threat to modern society in the coming years. The damage done by severe weather conditions will raise the government expenses dramatically. Additionally, the increase in the average global temperature will have dire consequences for har- vesting crops all over the world. In light of the agreement at the 2015 United Nations Climate Change Conference to limit the warming ”to well below 2 C” [3] and the current state of an increase of already about 1 C [4], it is paramount to take ac- tion. It will probably not be possible to reach the goal by improving technology alone, a change in consumer behavior will also have to accompany it. Climate change is mainly driven by the excessive emission of greenhouse gases like CO (carbon monox- ide), CO 2 (carbon dioxide) and CH 4 (methane). The transportation sector is a major producer of these gases as most vehicles are fuel driven and exhaust (primarily) carbon oxides in the process. Nowadays, many people use their car even for short trips that could easily be done riding a bike or even on foot. As long as the price of gas is low enough that these trip don’t become noticeable financially, convenience and comfort will far outweigh any regard for the en- vironment. However, raising gas prices to a point where short trips financially hurt people would elevate cars to a luxury item and in turn lead too huge ramifications for society. On the other hand changing the perception of a majority of the population for the dangers of climate change before severe consequences actually happen—it would be to late then—is an unrealistic thought. Technological improvements can work, provided they don’t suffer from disadvantages in comparison to cur- rent solutions. Namely, they need to be at least as comfortable and convenient as established technology and be available at a comparable price, if they don’t have any additional appeal (e.g., as a status symbol). However, if technology changes to drastically, consumers will mistrust it regardless of equal benefits to known solutions. For example, electric cars have a high buy- in cost and shorter range than traditional cars. Even though lower energy prices might make up for the price difference in the long run, the perceived financial benefit and greater convenience due to wider range of gas fueled cars hamper the adoption of this new technology. 1) Electric bicycles: As stated before, many people shun from using their bicycle to do even short trips out of a perceived inconvenience compared to using their car. This fact can be partially mitigated by making bicycles more appealing. One solution is to equip bicycles with an electric motor to support the efforts of the rider. Even so—for electric cars—skepticism towards this tech- nology born from range anxiety exists in the general population. To study the concerns and the adoption

Examining single server database systems in the context of ...blizzard.cs.uwaterloo.ca/iss4e/wp-content/uploads/2016/...the battery. This sensor kit consists mainly of a Samsung Galaxy

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Examining single server database systems in the context of ...blizzard.cs.uwaterloo.ca/iss4e/wp-content/uploads/2016/...the battery. This sensor kit consists mainly of a Samsung Galaxy

Examining single server database systems in thecontext of real-time sensor data

Christian Gorenflo

Abstract—In this course project for CS 848 – Main/In-Memory Database Systems I examine the write and readperformance of three different database systems. Theused benchmarks are tailored to real world requirementsoriginating from the WeBike project

I. INTRODUCTION

Big Data analysis today is ubiquitous. Especially withthe advent of the Internet of Things and smart gadgetsof all sorts data collections from a plethora of sensors isbecoming common place. So they don’t get left behindby the big players like Google, Microsoft and Facebook,small research groups or enterprises need database solu-tions that can perform analytics on vast data sets withouthaving to deploy multiple server farms.

This report is based on the work of the WeBike project[1] [2], which uses sensor arrays to study technologicalimprovements to increase the convenience of electricbicycles as car alternatives. The goal of this courseproject is to give a recommendation for research teamsand small businesses that work with real time data,usually provided as time series from one or multiplesensor arrays. The project aims to answer how Big Dataanalytics can be done on limited resources like a singleserver instance.

A. The WeBike project

Global Warming will be a very real threat to modernsociety in the coming years. The damage done by severeweather conditions will raise the government expensesdramatically. Additionally, the increase in the averageglobal temperature will have dire consequences for har-vesting crops all over the world.

In light of the agreement at the 2015 United NationsClimate Change Conference to limit the warming ”towell below 2◦C” [3] and the current state of an increaseof already about 1◦C [4], it is paramount to take ac-tion. It will probably not be possible to reach the goalby improving technology alone, a change in consumerbehavior will also have to accompany it.

Climate change is mainly driven by the excessiveemission of greenhouse gases like CO (carbon monox-ide), CO2 (carbon dioxide) and CH4 (methane). Thetransportation sector is a major producer of these gasesas most vehicles are fuel driven and exhaust (primarily)carbon oxides in the process.

Nowadays, many people use their car even for shorttrips that could easily be done riding a bike or even onfoot. As long as the price of gas is low enough that thesetrip don’t become noticeable financially, convenienceand comfort will far outweigh any regard for the en-vironment. However, raising gas prices to a point whereshort trips financially hurt people would elevate cars toa luxury item and in turn lead too huge ramifications forsociety. On the other hand changing the perception ofa majority of the population for the dangers of climatechange before severe consequences actually happen—itwould be to late then—is an unrealistic thought.

Technological improvements can work, provided theydon’t suffer from disadvantages in comparison to cur-rent solutions. Namely, they need to be at least ascomfortable and convenient as established technologyand be available at a comparable price, if they don’thave any additional appeal (e.g., as a status symbol).However, if technology changes to drastically, consumerswill mistrust it regardless of equal benefits to knownsolutions. For example, electric cars have a high buy-in cost and shorter range than traditional cars. Eventhough lower energy prices might make up for the pricedifference in the long run, the perceived financial benefitand greater convenience due to wider range of gas fueledcars hamper the adoption of this new technology.

1) Electric bicycles: As stated before, many peopleshun from using their bicycle to do even short trips outof a perceived inconvenience compared to using their car.This fact can be partially mitigated by making bicyclesmore appealing. One solution is to equip bicycles withan electric motor to support the efforts of the rider.Even so—for electric cars—skepticism towards this tech-nology born from range anxiety exists in the generalpopulation. To study the concerns and the adoption

Page 2: Examining single server database systems in the context of ...blizzard.cs.uwaterloo.ca/iss4e/wp-content/uploads/2016/...the battery. This sensor kit consists mainly of a Samsung Galaxy

(a)

(b)

Fig. 1: (a) Electric bicycle with battery and sensor kit(b) The smart phone in the sensor kit attached to thebattery connects to the University’s Wi-Fi network toupload the collected sensor data as a batch to the serverdatabase

behaviour of electric bicycles, a project called WeBike[1][2] by the ISS4E group at the University of Waterloostarted in mid 2014.

After a survey, a fleet of about 30 electric bicycles(e-bikes) were distributed among University faculty andstudents who participated in the study. These e-bikes areequipped with a battery that has a capacity to support therider for about 40km and a sensor kit that is attached tothe battery. This sensor kit consists mainly of a SamsungGalaxy S3 smart phone with its built-in sensors (GPS,clock, gyroscope, accelerometer, magnetometer) and ad-ditional sensors for measuring ambient temperature andcharge/discharge current and voltage. The sensor kit isautomatically charged directly from the battery.

The battery plus sensor kit can be removed from the e-bike and carried with the rider in order to charge it from

a power supply. In order to preserve energy and thereforethe supported range of the e-bike, the sensors were onlyactivated twice per minute for 2 seconds until recently.This resolution is high enough to detect charging eventsand riding trips. From these data points, intermediatetime intervals can be interpolated. Now the sensors areactivated every second as soon as a trip or charging eventis detected.

2) Architecture: The gathered data is enriched withthe current time stamp and IMEI information of thephone and then saved to the smart phone’s internalstorage space. Then, whenever study participants taketheir e-bikes to the University of Waterloo campus, thesmart phone connects to the University’s Wi-Fi networkto upload these sensor logs to the file system of aserver, from where the raw data is inserted into a MySQLdatabase.

This raw data is used to cut the time series into trippieces, which are then stored in different tables per riderin order to get quick access to all data points belongingto a particular trip.

II. PROBLEM STATEMENT

Currently the sensor logs of the phones on the e-bikescollect a batch of data spanning an hour before they sendit to the server. Additionally the insert job on the serverto gather those logs and write them into the databaseruns once about every 10 minutes. The trip detectionalgorithm that queries the database even runs only every30 minutes. Until recently every bike only took sensorreadings twice per minute. With currently 28 e-bikescollecting data, the read and write requirements for thedatabase were easily fulfilled by a MySQL database.

Several improvements to the WeBike project are dis-cussed:

• The software has been modified so that data iscollected every second, whenever an e-bike is inuse.

• Instead of only Wi-Fi, the e-bikes could also usethe phone network to transfer data, which opens upthe possibility of continuously sending data to theserver

• The bike pool might be increased to about a thou-sand electric bicycles

• Real time analyses could be set into place to allowfor increased functionality. Participants could get al-ways up-to-date dashboards and researchers wouldbe able to detect defects early and would alwayshave a current overview of the project

Page 3: Examining single server database systems in the context of ...blizzard.cs.uwaterloo.ca/iss4e/wp-content/uploads/2016/...the battery. This sensor kit consists mainly of a Samsung Galaxy

All of these points combined lead to a vast increase ofload for the database system holding all the data. To testthis scenario, per-second data transfer from a thousande-bikes is simulated during this work. Therefore theserequirements are defined:

• The system must be deployed on a single serverwith RAM size >50 GB but �1 TB.

• Inserts of one row of data from each of the 1000e-bikes must complete in under 1 second.

• Queries can be categorized in two groups: queriesfor current data (of the day) and historical data thatspans a longer period of time in the past. Preferablyboth types, however at least the query for currentdata need to execute in under 1 second

• Queries must adhere to the previous requirementeven under concurrency with inserts.

A last requirements stems from a lack of big funding:The examined data bases are all either open source orfree under an academic license.

III. RELATED WORK

kdb+ is mostly applied in the financial sector andmentioned in very few papers by name. However, Chanet al.[5] compared the performance of their ATLAS vi-sualization platform, which is based on a kdb+ database,favorably with a MySQL database. Closer to this work,both Ilic et al. [6] and Pungila et al. [7] compare aMySQL database and a MonetDB database for theirperformance while writing and analyzing smart metersensor data. This work targets the same area of researchby working on time series sensor data.

What sets this work apart is that kdb+ has neverbefore been tested with continually incoming sensor datatime series, the amount of processed data exceeds thepreviously mentioned related work by a factor of roughly10000 and the requirements on the performance of thedatabases are much higher. Lastly none of the previouswork tested the systems in a concurrency scenario wheredatabases have to simultaneously write and analyze thestored data.

IV. SETUP DESCRIPTION

The database systems and corresponding benchmarkswere run on a single server that is equipped with twoIntel Xeon E5620 @ 2.4 GHz and 100 GB of RAM. Eachprocessor consists of 4 cores and due to Hyper-Threading16 threads are available in total. The MySQL databasewas mounted to an SSD while MonetDB and kdb+ ranon traditional hard drives because of size restrictions ofthe SSD.

MySQL MonetDB kdb+0

50

100

150

112.62 110.31 109.28

145.81

109.27

128.95

Agg

rega

ted

tabl

esi

zin

GB

single tableseparate tables per IMEI

Fig. 2: Aggregated table sizes for the different databaseversions with one week of sensor data

Each database was set up in two different versions:1) A single table, gathering the collective sensor data

of all simulated e-bikes2) A separate table for each e-bike (split by the data

transferring phone’s unique IMEI)The latter results in a thousand tables with the sameschema but delivers a filter in the specific bike directlybuilt into the database structure. Since queries concern-ing multiple e-bikes only happen on rare occasions thestructural overhead might be compensated be increasedexecution speed for common queries.

In order to set up the historical queries, a week’sworth of sensor data by a thousand simulated e-bikeswas created. This means that the number of rows insertedinto each database cumulated to:

#rows = #bikes ×#seconds per week = 604,800,000

Both MySQL and MonetDB had to do this by bulkinsert from file, while kdb+ was able to do it directlythanks to the vectorization abilities of its q language.Fig. 2 shows the size of the different databases. Whilethe version with a single table reaches similar numbersamong the contestants—with MySQL having only asmall almost negligible overhead compared to the othertwo— there is obvious overhead for single tables withMySQL and kdb+. The size of a single row amasses toroughly 200 bytes.

A. MySQL

MySQL is a traditional row-based RDBMS. It isused for the current instance of the WeBike database.

Page 4: Examining single server database systems in the context of ...blizzard.cs.uwaterloo.ca/iss4e/wp-content/uploads/2016/...the battery. This sensor kit consists mainly of a Samsung Galaxy

It supports a version of the SQL language and its API’sare usable with all major programming languages. In thiswork a Python connector[8] was used for data creationand benchmarking. MySQL by default mostly works onstored data on disk. In order to speed it up and makeit better comparable to the other two systems its buffersize was increased from the default of 128 MB to 50 GB.Indexes were created for the Timestamp column and inthe case of the single table for the IMEI column.

All interactions with the database were done viaPython scripts, so that transactions could be timed. Thehistorical data was written to a CSV file and thenimported with a LOAD DATA INFILE file INTOtable statement. It is recommended to use the fileimport for large data sets[9] and tests showed thatany other way was impossible to do for over halfa billion rows. The simulated sensor data inflow wasinserted into the single table with a bulk INSERT INTOtable VALUES values and into the separate IMEItables with separate INSERT INTO table VALUEvalue statements, combined into a single transaction,all timed to trigger every second. The execution timingwas done with time.perf_counter() from thePython package time .

B. MonetDB

As one of the earliest column store databases[10],MonetDB has reached a very mature state many newercolumn stores are still lacking. Like MySQL, it under-stands SQL and has a Python connector[11]. In con-trast to MySQL however it is optimized for in-memoryprocessing[12]. Again, commands to create indexes forTimestamp and IMEI were executed. However, MonetDBhandles index creating commands merely as suggestionsand can disregard them in favor of its own internalindexing[13].

The creation of the historical data and real-time sim-ulation were done the same way as for MySQL withsmall variations to the syntax (use COPY INTO tableFROM file for bulk insert from file)

C. kdb+

In addition to being a column store and capable of bothin-memory querying for real-time and on-disk queryingfor historical data, kdb+ is optimized for time-seriesprocessing[14]. Its built-in proprietary language q is ableto execute vectorized operations on large data sets. Thoseoperations are easily parallelizable and can take fulladvantage of a multiprocessor setup[15].

Fig. 3: Recommended publish/subscribe schema forfeeding data to a kdb+ database

1) Real time and historical databases: kdb+ differ-entiates between real time databases (rdb) which areusually held completely in memory and the commonlymuch larger historical databases (hdb) which are storedon disk. Depending on their size, tables in hdbs can bestored differently:

1) Small tables can be stored in a single file per table.2) Large tables (table is bigger than the main mem-

ory) can be stored in files for each column, so theycan loaded independently. These are called splayedtables.

3) Enormous tables (a single column does not fitinto the main memory) can be further partitionedhorizontally either by date (yearly, monthly ordaily) or simply by an arbitrary integer. Thesetables are called partitioned tables

When the database is loaded by a process, splayed andpartitioned tables are automatically recognized and canbe interacted with just like normal single file tables. Inthis work an rdb with one or multiple in-memory tablesis used for the incoming sensor data of the current dayand an hdb with one or multiple partitioned tables witha daily partition domain is used for the stored data of thesimulated past week. In the rdb, the kdb+ equivalent ofan index was used for IMEI and the hdb stores the datasorted by IMEI (indexing is not supported for on-disktables). Since there is no Date column in the rdb beforethe data gets stored in the hdb and the historical datais already partitioned by date, no further modificationswere done to the timestamp.

2) Publish/subscribe schema: kdb+ favors a setupwith multiple processes and it is recommended to use apublish/subscribe model for feed-like processes such asthe e-bike sensor data simulation in this work[16]. Thesetup is shown in fig. 3. One or more external feeds send

Page 5: Examining single server database systems in the context of ...blizzard.cs.uwaterloo.ca/iss4e/wp-content/uploads/2016/...the battery. This sensor kit consists mainly of a Samsung Galaxy

data to a tickerplant called interfacing process. It mightclean the data or it could trigger different processes. Thenthe tickerplant pushes the data to subscribing processes(which could also subscribe to just a subset of the feed)like the real time database. At the end of the day,the tickerplant sends a signal to subscribers to finishtheir end-of-day operations (like flushing the data to thehistorical database).

V. QUERIES & BENCHMARKS

The purpose of this work is to compare the perfor-mance of three different database systems in a real worldenvironment. The requirements of the WeBike projecthave to be met, which means near real time write speedsfor a large amount of data feeds and near real timeexecution of at least the most common queries. As faras technically possible, both insert speeds and querieswhere tested on two different experimental setups. Thebenchmarks were set up to answer the question, ifsplitting a single table for the complete data set intoseparate tables for each IMEI would increase overallperformance.

The chosen queries are detailed in Tab. I. Usually,the GPS sensor data is the most important and mostrelevant for real time analysis in the WeBike project.Therefore it was chosen to always query for the GPSdata, while varying the query constraints. Having thebattery discharge current above a 50% threshold meansthat a rider is using the bike, therfore this is an interestingconstraint to look into for trip detection.

All queries were benchmarked both against a staticdatabase and in a concurrency setup, where the real timefeed would insert new data points for each bike everysecond.

In order to get an upper bound for the daily queries,the complete data for the day was filled and thensimulated current data would be inserted in addition tothat.

TABLE I: Benchmarked queries. Every query was runwith and without concurrent inserts.

Qd select GPS (lat & long) data for specific a IMEI for asingle date.

Qdc Qd with the constraint of the discharge current valuebeing above a 50% threshold.

Qw Qd for a whole week’s worth of data instead of a singleday.

Qwc Qdc for a whole week’s worth of data instead of a singleday.

MySQL MonetDB kdb+0

2

4

6

8

0.1030.413

0.065

7.6

3.5

Exe

cutio

ntim

ein

s

single tableseparate tables per IMEI

Fig. 4: Inserting one row for each e-bike. 1000-rowinsert for single tables, 1-row insert into 1000 tables forseparate tables. Repeated 1000 times.

VI. ANALYSIS

All benchmarks were taken after a fresh start ofthe data base system and a warm up period to allowthe loading of data into memory. Therefore the bench-marks should represent a real world scenario, where thedatabases are repeatedly queried by the same processesto provide real time updates.

A. Inserting Data

The first benchmark to look into is shown in fig. 4.This simply demonstrates the feasibility of the real timefeed for a given scenario. For better comparability (bothother databases permeate inserted data) the timing forthe kdb+ database takes both inserting into the rdb andflushing to the hdb on disk into account. Therefore itis expected that without the continuous saving to disk,kdb+ will be much faster. All insert commands wererepeated 1000 times to guarantee statistical relevance.The three systems are able to stay within the limit ofthe 1-second-requirement of the project when they insertinto a single sensor data table. The timings have little tono variance between repeats. kdb+ manages the insertsslightly faster than MySQL, while MonetDB is more thanfour times slower.

The separate table scenario demonstrates how slowmultiple small writes to disk are compared to a singlebigger one. All systems fail this test by a wide margin,while the benchmark for kdb+ was not even able tocomplete. However, this does not relate to the real world

Page 6: Examining single server database systems in the context of ...blizzard.cs.uwaterloo.ca/iss4e/wp-content/uploads/2016/...the battery. This sensor kit consists mainly of a Samsung Galaxy

Qd Qdc Qw Qwc

0

5

10

15

3.81

2.05

8.31

4.24

2.54

1.29

8.4

4.18

Exe

cutio

ntim

ein

ssingle tableseparate tables per IMEI

(a)

Qd Qdc Qw Qwc

0

5

10

15

12.09

4.43

8.4

4.28

2.59

1.29

8.37

4.21

Exe

cutio

ntim

ein

s

single tableseparate tables per IMEI

(b)

Fig. 5: Query execution on a MySQL database. 5a with-out concurrency, 5b with concurrency. Queries executedfor 10 different IMEIs, then cycle repeated once.

scenario for the latter, where it only has to store the dataat the end of the day. In this case, kdb+ stays within thelimit.

B. Queries – MySQL

Fig. 5 shows the results of the query benchmarking onthe MySQL database. Due to the long execution time,the queries were only repeated 20 times. In fig. 5a itcan be seen that even without concurring inserts, theonly query that comes close to the goal of per secondupdates is Qdc on separate IMEI tables. Furthermore,as the query result reaches a certain size the partition

0 5 10 15 20

0

200

400

600

800

Exe

cutio

ntim

ein

s

without concurrencywith concurrency

Fig. 6: Warm up period for a MySQL database withand without concurrency. Qd executed for 10 differentIMEIs, then cycle repeated once.

into separate tables does not boost the execution speedanymore. Interestingly, the only non-negligible varianceof execution time occurs for queries of the daily data onthe single table.

This behavior repeats with concurrency (fig. 5b). Theconcurrency affects the daily data queries the most.Given that constantly data for the current day is insertedand has to be indexed, this result seems to be reasonable.For the queries on the complete data set the index of theTimestamp column plays no role, so the result changesonly slightly compared to the no-concurrency scenario.Similarly for the separate table database the inserts aremeaningless. This has two reasons: First, the insert (asseen in fig. 4) completes only about once every eightseconds. Second, in every table, only one new data pointhas to be indexed.

By far the biggest impact concurrency has on aMySQL database is the shape of its warm up period (fig.6). Here 10 different IMEIs are queried for the daily datawithout constraints on the discharge current (Qd) andthen the queries are repeated. The complete first passof queries under concurrency take more than four timeslonger to complete than without concurrent insertions.The first execution even runs for more than 10 minutes.Only when the queries are repeated for the same IMEIsdo the query execution times drop to comparable levelswith non-concurrent query execution.

Page 7: Examining single server database systems in the context of ...blizzard.cs.uwaterloo.ca/iss4e/wp-content/uploads/2016/...the battery. This sensor kit consists mainly of a Samsung Galaxy

Qd Qdc Qw Qwc

0

100

200

300

400

500

420436

249

390

75 83

7 5

Exe

cutio

ntim

ein

ms

single tableseparate tables per IMEI

Fig. 7: Query execution on a MonetDB database withoutconcurrency. Query executed for 100 different IMEIs,then cycle repeated ten times.

C. Queries – MonetDB

The four queries executed on a MonetDB databasewithout concurrency (fig. 7) perform about a the orderof a magnitude better than on a MySQL database forthe same scenario (notice the millisecond scale insteadof a second scale). All queries for all table types areexecuted in time well below the one second requirement,with the single table version significantly speeding up thecommands. However, as seen before in fig. 4, this is nota viable overall strategy.

Somewhat surprisingly, the more constrained queriesperformed more poorly than the queries requesting thecomplete time period. An explanation for this behav-ior could be that MonetDB ignored the index on theTime stamp column and therefore queries constrainingthe date take longer to execute.

Even though MonetDB employs an optimistic con-currency control and concurrency in this work consistssolely of insert and select statements withoutdelete or update, the benchmark constantly ran intoconcurrency issues and crashed. This means, a moresophisticated queuing system needs to be set in placein the application, if a MonetDB database is to be usedas a continuous time series database. This is beyond thescope of this work and therefore further concurrencybenchmark results for MonetDB are omitted.

Qd Qdc Qw Qwc

0

2

4

6

8

0.77

0.12

7.58

2.36

0.55

0.06

5.72

1.77Exe

cutio

ntim

ein

ms

(a)

Qd Qdc

0

200

400

600

800 763

126

535

66

Exe

cutio

ntim

ein

ms

(b)

Fig. 8: Query execution on a kdb+ database. 8a withoutconcurrency, 8b with concurrency. Queries executed forall 1000 different IMEIs in 100 cycles. Queries Qw andQwc omitted in 8b as the results are identical to 8a.

D. Queries – kdb+

The results from the kdb+ benchmarks range on adifferent scale entirely. It has the other two databasesystems beaten by three to four orders of magnitude. Dueto the fast execution time, a bigger query statistic couldbe accumulated. In both the case without concurrency(fig. 8a)) and with concurrency (fig. 8b)) the querieswere executed for each of the 1000 IMEIs and repeateda 100 times each. While querying the historical data settakes about 10 times longer to complete, the executiontime still remains below 10 ms per query. The gains

Page 8: Examining single server database systems in the context of ...blizzard.cs.uwaterloo.ca/iss4e/wp-content/uploads/2016/...the battery. This sensor kit consists mainly of a Samsung Galaxy

from splitting the table by IMEI are irrelevant withthis execution speed, especially regarding the price paidwhen inserting data. Although in a real world scenariothe inserts would not be permeated directly after eachrow insert, the added complexity outweighs the benefits.However, in a separate table version, the database has tobe modified every time a new e-bike joins the project,whereas a single table is completely independent of thenumber of sensor feeds.

The performance for single day queries takes a bighit from concurrent inserts, as shown in fig. 8b. Still theexecution time lies well within the requirements. Theexecution times for historical data under concurrency areomitted, since concurrency only occurs in the rdb andqueries on older data are completely unfazed as long asthe system resources are not under full load.

An additional insight can be gained by comparingthe execution times of Qd and Qdc in both charts. Qdc

performs significantly better, even though the constraintis imposed on a column that is not indexed. Due tokdb+’s column store nature and q’s vectorization abilitiesthe discharge current column is essentially a vast integervector on which a comparison with a scalar can becheaply performed.

VII. LIMITATIONS & POSSIBLE FUTURE WORK

In the scope of this work only easily accessibleoptimizations were performed:

• Indexes were set when possible (see IV-B)• Buffer size was increased for MySQL to take ad-

vantage of the large amount of RAM available• kdb+ tables were partitioned by date and processes

started with 8 slaves so they could run on multiplethreads in parallel

However, there are possibly still numerous other waysto slightly increase the performance of the systems.Also, a construct like a query queue could allow quasi-concurrent inserts and selects for MonetDB.Given the clear lead of the kdb+ database system, thesepossible improvements are not really interesting to lookinto.

In a future work that goes on beyond the requirementsof the WeBike project more complex queries and addi-tional functionality of the kdb+ system like MapReducecould be tested against other platforms that support thisnatively, such as Hadoop and Spark.

VIII. CONCLUSION

In this work three different database systems withvarying degrees of in-memory optimization were exam-

ined for suitability as the database back-end for timeseries sensor data. It has been shown that it is detrimentalto overall performance if tables are split manually intounrelated horizontal slices, when these slices follow thesame table schema and are used roughly evenly. Incontrast to that, kdb+ clearly achieves a performanceboost by partitioning by date, since older partitions arenot written to anymore. The research question was drivenby a specific need to adhere to the requirements of anexpansion of the WeBike project, but the results can begeneralized.

Whenever the data is organized in a time series andmust be written and read continuously in large amounts,kdb+ is by far the best choice, given the data does nothave to be permeated instantly to a hard drive. Due to thevectorization capabilities of the proprietary q language,it hardly matters what type of column is constrained bythe query and if there are any index-equivalents in place.

Extending on previous work it could be shown that in-memory column stores and specifically kdb+ can handleBIG (Sensor) Data on server configurations that arerealistically accessible for small research groups andenterprises.

REFERENCES

[1] Tommy Carpenter. Measuring & Mitigating Electric VehicleAdoption Barriers. PhD thesis, University of Waterloo, 2015.

[2] ISS4E group. Webike web page. http://blizzard.cs.uwaterloo.ca/iss4e/?page id=3661, 2015.

[3] United Nations. Adoption of the paris agreement. In 2015United Nations Climate Change Conference, 2015.

[4] Stocker, T.F., D. Qin, G.-K, Plattner, M. Tignor, S.K. Allen,J. Boschung, A. Nauels, Y. Xia, V. Bex, and P.M. Midgley(eds.). Summary for policymakers. in: Climate change 2013:The physical science basis.contribution of working group ito the fifth assessment report of the intergovernmental panelon climate change. In IPCC. Cambridge University Press,Cambridge, United Kingdom and New York, NY, USA, 2013.

[5] Sye-Min Chan, Ling Xiao, J. Gerth, and P. Hanrahan. Main-taining interactivity while exploring massive time series. InVisual Analytics Science and Technology, 2008. VAST ’08. IEEESymposium on, pages 59–66, Oct 2008.

[6] D. Ilic, S. Karnouskos, and M. Wilhelm. A comparative analysisof smart metering data aggregation performance. In 201311th IEEE International Conference on Industrial Informatics(INDIN), pages 434–439, July 2013.

[7] Ciprian Pungila, Teodor-Florin Fortis, and Ovidiu Aritoni.Benchmarking database systems for the requirements of sensorreadings. IETE Technical Review, 26(5):342–349, 2009.

[8] Pymysql python package. https://github.com/PyMySQL/PyMySQL,2016.

[9] Speed of insert statements. https://dev.mysql.com/doc/refman/5.7/en/insert-speed.html, 2016.

[10] Monetdb history. https://www.monetdb.org/AboutUs, 2016.[11] https://github.com/gijzelaerr/pymonetdb, 2016.[12] https://www.monetdb.org/content/column-store-features, 2016.

Page 9: Examining single server database systems in the context of ...blizzard.cs.uwaterloo.ca/iss4e/wp-content/uploads/2016/...the battery. This sensor kit consists mainly of a Samsung Galaxy

[13] https://www.monetdb.org/Documentation/Manuals/SQLreference/Indices, 2016.

[14] https://kx.com/real-time-in-memory-analytics.php, 2016.[15] https://kx.com/time-series-database.php, 2016.[16] q For Mortals Version 3: An Introduction to q Programming.

q For Mortals Version 3: An Introduction to q Programming.q4m LLC, 3 edition, 2015.