Upload
others
View
18
Download
0
Embed Size (px)
Citation preview
Universita degli Studi di Cagliari
Facolta di Scienze Matematiche Fisiche e Naturali
Corso di Laurea in Informatica
Laurea Magistrale
SPARKSQL vs RDBMS DatabaseQuery Benchmark
Candidate
Carlo Corona(matr. 65009)
Supervisor Coordinator
Prof. Diego Reforgiato Recupero Prof. G. Michele Pinna
Academic Year Academic Year 2016/2017
Abstract
In the near future everything will be connected to the network: people, things,
machines and operating processes will daily contribute to a permanent channel
between the real world and the virtual dimensions enabled by the Internet.
The amount of data generated by these connections will be enormous.
Big Data, their analysis and exploitation will enable the birth of a new company
and a new economy based on the value of digital data, the Data-Driven Society.
The term “Big Data” tends to refer to the use of predictive analytics, user
behavior analytics, or certain other advanced data analytics methods that
extract value from data, and seldom to a particular size of data set.
Analysis of datasets can find new correlations to business trends, prevent
diseases, combat crime and so on.
Scientists, business executives, practitioners of medicine, advertising and
governments regularly meet difficulties with large datasets in areas including
Internet search, fintech, urban informatics, and business informatics.
Scientists encounter limitations in e-Science work, including meteorology,
genomics, connectomics, complex physics simulations, biology and environ-
mental researches.
Relational database management systems (RDBMS) and desktop statistics
and visualization-packages often have difficulty handling big data.
The work may require massively parallel software running on tens, hun-
dreds, or even thousands of servers.
Contents
1 Introduction 3
1.1 Argument of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Context of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Purpose of thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Big Data 4
3 RDBMS Database 6
3.1 Oracle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2 MySQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.3 PostgreSQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4 Map Reduce 10
5 Apache Spark 12
5.1 Spark Core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
5.2 Spark SQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
5.3 Spark Streaming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
5.4 MLlib Machine Learning Library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
5.5 GraphX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
5.6 Cluster Managers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
5.7 Spark Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5.7.1 The Driver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5.7.2 The Executor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5.7.3 The Cluster Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
6 Test Environment 20
6.1 Hardware Requirement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
6.2 Software Requirement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
6.3 Query List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
6.3.1 OnTime1/OnTime2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
6.3.2 Unica1/Unica2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
6.4 Test Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
7 Conclusions 30
Appendices 31
Bibliography 32
Chapter 1
Introduction
1.1 Argument of the thesis
Some queries have been selected and executed on two datasets that contain
data from United States’s airways traffic and University of Cagliari’s database.
1.2 Context of the thesis
The optimization and processing speed of large amounts of data will attract
the attention of several sectors with totally different purposes, study, research,
commerce, security and so on.
This has stimulated the realization of new highly sophisticated metologies,
algorithms and data processing tools.
1.3 Purpose of thesis
The purpose of this thesis was to measure the speed of data extraction between
an Apache Spark engine and three relational databases (RDBMS).
Chapter 2
Big Data
Figure 2.1:
The term Big Data indicates any such a large data collection to make it
difficult or impossible to store it in a traditional database system such as a
RDBMS (Relational Database Management System). Although it does not
refer to a particular quantity and it is commonly used in relation to quantities
that are leat as large as a terabyte ie when data can no longer be stored or
processed by a single machine.
Big Data has many features that differentiate it from traditional data col-
lections. The most important is Volume, which is, the amount of data that
must be memorized.
Another Big Data feature is Variety: data can come from different sources
and in different forms, for example they can be structured, semi-structured or
unstructured. Think about the text of a tweet, pictures or data from sensors:
they obviously correspond to different type of data, which means that their
integration requires special efforts.
CHAPTER 2. BIG DATA 5
Unstructured data can not be stored in an RDBMS, but it is stored in
NoSQL databases because they are more appropriate to manage data variabil-
ity.
RDBMSs require database’s structure to be fixed before its use so that it
remains unchanged.
An increasing percentage of the population has Internet access and a Smart-
phone, and there is an explosion of sensors due to the emerging Internet Of
Things. For this reason a great amount of data must be stored quickly.
The third feature of Big Data is Speed , which indicates how quickly new
data can be available. Technologies to control this aspect of the Big Data are
called streaming data and complex event processing, which analyze data as
it arrives and answer questions like: ”How many times was the word “apple”
searched yesterday?”
The forth feature, which is Variability refers to data’s inconsistency, which
obstructs the manipulation process and the effective data management.
Complexity, the fifth and last Big Data’s feature, indicates that data
coming from different sources need to be linked to each other to get useful
information.
The need of high scalability and the necessity to store unstructured data
make the traditional DBMS database not suitable to store Big Data. For this
reason new systems now allow you to store non-relational data types, offering
horizontal scalability and, consequently, performance’s improvement. This is
against the assignment of more resources in single machines to improve their
general performances.
Chapter 3
RDBMS Database
A Relational Database Management System (RDBMS) is a database man-
agement system (DBMS) that is based on the relational model invented by
Edgar F.Codd, of IBM’s San Jose Research Laboratory. In 2017, many of the
databases in widespread use are based on the relational database model.
RDBMSs have been a common choice for the storage of information in new
databases used for financial records, manufacturing and logistical information,
personnel data, and other applications since the 1980s. Relational databases
have often replaced legacy hierarchical databases and network databases be-
cause they are easier to understand and use.
A database is a collection of data (structured and logically related). A database
consists of tables. Each table is composed of records and fields. The databases
can be composed of one or more tables. Each table must contain a field that
identifies each data uniquely. This field comes defined primary key. When
designing a database, you start from the definition of the tables that are part
of the database. For each table you define the fields that represent the table
structure. Then, set the ”relationships” between tables that allow to nor-
malize (break the fat table, containing all information, in more lean tables)
by avoiding redundancies, achieving an adequate degree of efficiency and will
provide a check on errors (insert, delete, update anomalies) by setting integrity
referential.
Some example of RDBMS database are Oracle, MySQL and PostgreSQL.
CHAPTER 3. RDBMS DATABASE 7
Figure 3.1: RDBMS Database Architecture
3.1 Oracle
Oracle Database is one of the most popular database management systems
(DBMS).
Oracle Corporation, one of the largest companies in the world, was founded
in 1977 by Lawrence J. Ellison (current chief executive officer, Chief Technology
Officer and major shareholder), Bob Miner and Ed Oates, headquartered in
California.
The first available version of the publicly available Oracle Database dates
back to 1979, and since then, numerous changes and improvements have been
introduced to follow technology developments, up to version 12c R2.
Figure 3.2: Oracle Database 12c Architecture
CHAPTER 3. RDBMS DATABASE 8
3.2 MySQL
MySQL is an open-source relational database management system (RDBMS).
Its name is a combination of ”My”, the name of co-founder Michael Wide-
nius’s daughter, and ”SQL”, the abbreviation for Structured Query Language.
The MySQL development project has made its source code available under the
terms of the GNU General Public License, as well as under a variety of pro-
prietary agreements. MySQL was owned and sponsored by a single for-profit
firm, the Swedish company MySQL AB, now owned by Oracle Corporation.
For proprietary use, several paid editions are available, and offer additional
functionality.
Figure 3.3: Mysql Architecture
3.3 PostgreSQL
PostgreSQL, often simply Postgres, is an object-relational database manage-
ment system (ORDBMS) with an emphasis on extensibility and standards
compliance. As a database server, its primary functions are to store data
securely and return that data in response to requests from other software
applications. It can handle workloads ranging from small single-machine ap-
plications to large Internet-facing applications (or for data warehousing) with
many concurrent users.
PostgreSQL is ACID-compliant and transactional database, has updatable
views and materialized views, triggers, foreign keys; supports functions and
stored procedures, and other expandability as Oracle Database.
CHAPTER 3. RDBMS DATABASE 9
PostgreSQL is developed by the PostgreSQL Global Development Group,
a diverse group of many companies and individual contributors. It is free and
open-source, released under the terms of the PostgreSQL License, a permissive
software license.
Figure 3.4: PostgreSQL Architecture
Chapter 4
Map Reduce
Figure 4.1: Map Reduce
MapReduce is a programming template to process large datasets on paral-
lel computing systems. A Job MapReduce is defined by:
- input data
- a Map process that generates some input for each input element number of
key pairs/value
- a phase of network shuffle
- a Reduce procedure, which receives input elements with the same key and
generates summary information from those elements
- output data
MapReduce ensures that all items with the same key will be processed by the
same reducer as the mapper all use the same function hash to decide which
reducer to send key pairs/value. This programming paradigm is very com-
plicated to use directly, given the number of jobs needed to perform complex
data operations. Tools like Pig and Hive have been created to offer a high
language level (Pig Latin and HiveQL) and transform their queries into a set
CHAPTER 4. MAP REDUCE 11
of jobs MapReduce that are run in succession.
Chapter 5
Apache Spark
Figure 5.1: Web Console
Apache Spark is a cluster computing platform designed to be fast and
general-purpose. Spark provides an interface to program entire clusters through
implicit data parallelism and fault-tolerance.
Speed is important in processing large datasets, as it means the difference
between exploring data interactively and waiting minutes or hours. One of
the main features Spark offers for speed is the ability to run computations in
memory, but the system is also more efficient than MapReduce for complex
applications running on disk.
On the generality side, Spark is designed to cover a wide range of workloads
that previously required separate distributed systems, including batch applica-
tions, iterative algorithms, interactive queries, and streaming. By supporting
these workloads in the same engine, Spark makes it easy and inexpensive to
combine different processing types, which is often necessary in production data
CHAPTER 5. APACHE SPARK 13
analysis pipelines.
Apache Spark provides programmers with an application programming in-
terface centered on a data structure called the Resilient Distributed Dataset
(RDD), a read-only multiset of data items distributed over a cluster of ma-
chines, that is maintained in a fault-tolerant way. It was developed in response
to limitations in the MapReduce cluster computing paradigm, which forces a
particular linear dataflow structure on distributed programs: MapReduce pro-
grams read input data from disk, map a function across the data, reduce the
results of the map, and store reduction results on disk. Spark’s RDDs function
as a working set for distributed programs that offers a (deliberately) restricted
form of distributed shared memory.
The availability of RDDs facilitates the implementation of both iterative
algorithms, that visit their dataset multiple times in a loop, and interactive/-
exploratory data analysis, i.e., the repeated database style querying of data.
The latency of such applications (compared to a MapReduce implementation,
as was common in Apache Hadoop stacks) may be reduced by several orders
of magnitude.
Spark is designed to be highly accessible, offering simple APIs in Python,
Java, Scala, and SQL, and rich built-in libraries. It also integrates closely with
other Big Data tools.
Apache Spark requires a cluster manager and a distributed storage system.
For cluster management, Spark supports standalone (native Spark cluster),
Hadoop YARN, or Apache Mesos. For distributed storage, Spark can interface
with a wide variety, including Hadoop Distributed File System (HDFS), MapR
File System (MapR-FS), Cassandra, OpenStack Swift, Amazon S3, Kudu,
or a custom solution can be implemented. Spark also supports a pseudo-
distributed local mode, usually used only for development or testing purposes,
where distributed storage is not required and the local file system can be used
instead; in such a scenario, Spark is run on a single machine with one executor
per CPU core.
5.1 Spark Core
Spark Core contains the basic functionality of Spark, including components for
task scheduling, memory management, fault recovery, interacting with storage
CHAPTER 5. APACHE SPARK 14
Figure 5.2: Spark Stack
systems, and more. Spark Core is also home to the API that defines resilient
distributed datasets (RDDs), which are Sparks main programming abstraction.
RDDs represent a collection of items distributed across many compute nodes
that can be manipulated in parallel. Spark Core provides many APIs for
building and manipulating these collections.
Spark Core provides distributed task dispatching, scheduling, and basic
I/O functionalities, exposed through an application programming interface
(for Java, Python, Scala, and R) centered on the RDD abstraction, but is
also usable for some other non-JVM languages. This interface mirrors a
functional/higher-order model of programming: a driver program invokes par-
allel operations such as map, filter or reduce on an RDD by passing a function
to Spark, which then schedules the function’s execution in parallel on the clus-
ter. These operations, and additional ones such as joins, take RDDs as input
and produce new RDDs. RDDs are immutable and their operations are lazy;
fault-tolerance is achieved by keeping track of the “lineage” of each RDD (the
sequence of operations that produced it) so that it can be reconstructed in
the case of data loss. RDDs can contain any type of Python, Java, or Scala
objects.
Aside from the RDD-oriented functional style of programming, Spark pro-
vides two restricted forms of shared variables: broadcast variables reference
read-only data that needs to be available on all nodes, while accumulators can
be used to program reductions in an imperative style.
A typical example of RDD-centric functional programming is the following
Scala program that computes the frequencies of all words occurring in a set of
text files and prints the most common ones. Each map, flatMap (a variant of
map) and reduceByKey takes an anonymous function that performs a simple
operation on a single data item (or a pair of items), and applies its argument
CHAPTER 5. APACHE SPARK 15
to transform an RDD into a new RDD.
5.2 Spark SQL
Spark SQL is a component on top of Spark Core that introduced a data ab-
straction called DataFrames, which provides support for structured and semi-
structured data. Spark SQL provides a domain-specific language (DSL) to
manipulate DataFrames in Scala, Java, or Python. It also provides SQL lan-
guage support, with command-line interfaces and ODBC/JDBC server.
Spark SQL is Sparks package for working with structured data. It allows
querying data via SQL as well as the Apache Hive variant of SQLcalled the
Hive Query Language (HQL) and it supports many sources of data, including
Hive tables, Parquet, and JSON. Beyond providing a SQL interface to Spark,
Spark SQL allows developers to intermix SQL queries with the programmatic
data manipulations supported by RDDs in Python, Java, and Scala, all within
a single application, thus combining SQL with complex analytics. This tight
integration with the rich computing environment provided by Spark makes
Spark SQL unlike any other open source data warehouse tool.
5.3 Spark Streaming
Spark Streaming is a Spark component that enables processing of live streams
of data. Examples of data streams include logfiles generated by production
web servers, or queues of messages containing status updates posted by users
of a web service.
Spark Streaming provides an API for manipulating data streams that
closely matches the Spark Cores RDD API, making it easy for programmers to
learn the project and move between applications that manipulate data stored in
memory, on disk, or arriv ing in real time. Underneath its API, Spark Stream-
ing was designed to provide the same degree of fault tolerance, throughput,
and scalability as Spark Core.
Spark Streaming leverages Spark Core’s fast scheduling capability to per-
form streaming analytics. It ingests data in mini-batches and performs RDD
transformations on those mini-batches of data. This design enables the same
set of application code written for batch analytics to be used in streaming ana-
lytics, thus facilitating easy implementation of lambda architecture. However,
CHAPTER 5. APACHE SPARK 16
this convenience comes with the penalty of latency equal to the mini-batch
duration. Other streaming data engines that process event by event rather
than in mini-batches include Storm and the streaming component of Flink.
Spark Streaming has support built-in to consume from Kafka, Flume, Twit-
ter, ZeroMQ, Kinesis, and TCP/IP sockets.
5.4 MLlib Machine Learning Library
Spark comes with a library containing common machine learning (ML) func-
tionality, called MLlib. MLlib provides multiple types of machine learning
algorithms, including classification, regression, clustering, and collaborative fil-
tering, as well as sup porting functionality such as model evaluation and data
import. It also provides some lower-level ML primitives, including a generic
gradient descent optimization algorithm. All of these methods are designed to
scale out across a cluster.
5.5 GraphX
GraphX is a library for manipulating graphs and performing graph-parallel
computations. Like Spark Streaming and Spark SQL, GraphX extends the
Spark RDD API, allowing us to create a directed graph with arbitrary proper-
ties attached to each vertex and edge. GraphX also provides various operators
for manipulating graphs (e.g., subgraph and mapVertices) and a library of
common graph algorithms (e.g., PageRank and triangle counting).
5.6 Cluster Managers
Under the hood, Spark is designed to efficiently scale up from one to many
thousands of compute nodes. To achieve this while maximizing flexibility,
Spark can run over a variety of cluster managers, including Hadoop YARN,
Apache Mesos, and a simple cluster manager included in Spark itself called
the Standalone Scheduler. If you are just installing Spark on an empty set of
machines, the Standalone Scheduler provides an easy way to get started; if you
already have a Hadoop YARN or Mesos cluster, however, Sparks support for
these cluster managers allows your applications to also run on them.
CHAPTER 5. APACHE SPARK 17
5.7 Spark Architecture
In general there are a number of running processes for each Spark application
(one driver and many executor).
The driver is the manager of a Spark program, deciding the tasks to be
performed on the processes executor, that are running in the cluster. The
driver, on the other hand, could be running on the client machine.
In the main program of a Spark application (the driver) there is an object
called SparkContext, whose instance communicates with the cluster resource
manager to require a set of resources (RAM, core, etc.) for executors.
Several cluster managers are supported including YARN, Mesos, EC2 and
Spark’s Standalone Cluster Manager. A master/slave architecture is used,
where there are a coordinator process(driver) and many worker processes (ex-
ecutors).
Since each executor is in a separate process, different applications do not
can share data unless they first write to disk. If you work in a single node you
only have one process that contains both the driver and an executor, but this
is a special case. Working in a single node allows you to test applications, as
you use the same API that you would use if you were working in a cluster. A
Spark application consists of jobs, one for each action. Each job consists of a
set of stages that depend one on the other, performed in sequence and each of
which is executed by a multitude of tasks, carried out parallel by executors.
Figure 5.3: Spark Architecture
5.7.1 The Driver
The driver is the main process, that contains the main method and the user
code. The user code uses traversal operations and actions on RDDs (dis-
CHAPTER 5. APACHE SPARK 18
tributed datasets). It is run in parallel from the executor processes deployed
in the cluster. The driver can be run both within the cluster and on the client
machine that is running the Spark application. It performs the following two
functions:
- convert the user program into a task set, that is the smallest working
unit in Spark. Every Spark program is structured in this way: you read
data from disk into one or more RDDs, transform them and you recovers the
computation result. Transformation operations are done only when a result is
asked. In fact, Spark stores an acyclic graph direct (DAG) of the operations
to get the contents of a RDD. Processing or rescue/recovery operations are
transformed into a series of stages performed sequentially, each one of which
is composed of a set of tasks that are performed by the executor.
- do task scheduling on executor nodes. Scheduling tasks is made basing
on where the files are stored, to try to avoid as much as possible to transfer
them to the network. If a node fails, the platform automatically schedules it
in another node, and only lost data is recalculated.
5.7.2 The Executor
Executors are the processes that perform tasks from the driver.
Each application has its own executor (ie its own processes), each of which
can have multiple threads running.
Executors have one certain amount of memory assigned (configurable),
which allows it to store the data in memory if requested by the user appli-
cation (via the cache statement on a RDD).
Executors of different Spark applications do not communicate with each
other, causing the failure of different applications to share data with each
other unless you first write them to disk.
Executors live for the duration of an application; if a Executor fails, Spark
can continue to run the program by recalculating only lost data.
It is good that the driver and execution nodes are in the same network
since the driver continually communicates with them.
5.7.3 The Cluster Manager
Cluster managers are handling resources within a cluster.
CHAPTER 5. APACHE SPARK 19
For example, when multiple applications require cluster resources, the clus-
ter manager performs scheduling in nodes based on the memory and core of
the Free CPUs.
Some cluster managers also allow you to give different priorities to different
applications.
Spark supports the following cluster managers:
- YARN: Hadoop’s new resource manager
- Mesos
- Standalone cluster manager
In addition, Spark provides a script to run on an Amazon EC2 cluster.
Chapter 6
Test Environment
The tests environment consists of two separate components, the server
databases (Oracle, MySQL, PostgreSQL) installed on one virtual machine,
and the Spark Standalone Cluster installed on other two servers.
The test measures the performance of selected SQL queries that are exe-
cuted locally in the three databases and the time compared with SparkSQL
query remotely executed via the Spark Cluster.
To execute SparkSQL query, tables involved were mapped to Spark
Dataframe data abstraction.
To interface Spark Cluster with remote databases, it was necessary to use
the appropriate jdbc drivers for each single database.
In this configuration network latency is minimized because all VM resided
on the same physical server and same network (no ip routing is performed).
Figure 6.1: Server Interconnection
CHAPTER 6. TEST ENVIRONMENT 21
6.1 Hardware Requirement
Spark Cluster consist of two VmWare Virtual Machines named Spark-Master
and Spark-Slave.
Database server named SparkDB and Spark Cluster have been installed
on three VmWare Virtual Machine with OS Centos 7 at 64bit, 8 Core, 12GB
RAM.
All virtual machines reside on physical server SUN X4550 with 16 Core
and 36GB RAM.
The virtualization software VmWare Esx 5.1 is installed in that server.
6.2 Software Requirement
Three different database server are installed in the SparkDB Virtual Machine:
MySQL Database server (MariaDB) version 5.5.52;
PostgreSQL Database version 9.2.18;
Oracle Database 12c R2;
In a Spark Cluster is installed a Apache Spark version 2.2.0 with Hadoop
v.2.7 and configured in cluster mode. Is necessary to download the appropriate
JDBC connectors to connect it to all the database servers.
6.3 Query List
Two types of queries have been chosen, one on a table with millions of records
and hundreds of fields, and another construct on dozens of tables to join with
each other but only with thousands of data.
Some SQL queries have necessarily been adapted, for a compatibility issue,
in standard SQL ANSI format and subsequently executed in all environments
(RDBMS databases, SpakSQL) considered in the test.
The main reason for choosing these queries is that they are difficult to
optimize in RDBMS databases.
Partitioned tables were also used in the test queries to help reduce RDBMS
level contention.
At the same time partitionColumn option used in SparkSQL queries does
not require that RDBMS table is partitioned.
CHAPTER 6. TEST ENVIRONMENT 22
The example queries that are considered are:Query Name Description Number of Total Records
Tables Used ProcessedOnTime1 Total delayed flight per each airline 1 65.971.419OnTime2 Total flight per days ok week 1 65.971.419Unica1 Number of student evaluation of 30 21.106.599
study programs for the yeardifferent from 2016
Unica2 Number of student evaluation 30 21.106.599of study programs for the year
different 2016 (grouped)
6.3.1 OnTime1/OnTime2
Table used in this query is:Table Alias Table Real Name Description Num RowsONTIME ONTIME Airlines On-Time Performance 65.971.419
This table have million records and hundreds of fileds.
Datasets Population
To populate all databases follow the instructions in the Readme.txt file at the
link:
https://github.com/ccorona70/tesimagistrale/blob/master/QUERY/
ONTIME/DataSet%20Population/
Database Query SQL Scripts
OnTime1 (Total delayed flight per each airline) :
s e l e c t min ( year ) , max( year ) as max year , Carr i e r , count (∗ ) as cnt , sum(case when ArrDelayMinutes > 30 then 1 e l s e 0 end ) as f l i g h t s d e l a y e d , round(sum( case when ArrDelayMinutes > 30 then 1 e l s e 0 end )/ count (∗ ) , 2 ) as ra t eFROM ontimeWHERE DayOfWeek not in (6 , 7 )and Or ig inState not in ( ’AK’ , ’HI ’ , ’PR’ , ’VI ’ )and DestState not in ( ’AK’ , ’HI ’ , ’PR’ , ’VI ’ )\\GROUP by c a r r i e r HAVING count (∗ ) > 100000 and max( year ) > 2010ORDER by ra t e DESC, count (∗ ) desc ;
OnTime2 (Total flight per days ok week):
s e l e c t dayofweek , count (∗ ) from ontime group by dayofweek ;
CHAPTER 6. TEST ENVIRONMENT 23
All SQL database query are founded at link https://github.com/
ccorona70/tesimagistrale/tree/master/QUERY/ONTIME/DMLScripts
Python SpakSQL Script Example
In Spark a Python script has been created for each query.
Each script use a SparkSQL sintaxt which use a Dataframe concept to
manipulate table data.
The queries, in ANSI format, are executed in all relational databases.OnTime1 example Python script for Oracle Database:
from pyspark import SparkConf , SparkContextfrom pyspark . s q l import SQLContext , Rowimport timeconf=SparkConf ( ) . setMaster (” spark :// spark−master : 7 0 7 7 ” ) . setAppName(” ontime1o r a c l e ” ) . s e t (” spark . executor .memory” ,”8G” ) . s e t (” spark . d r i v e r .memory” ,”4G”)sc = SparkContext ( conf=conf )
sq lContext = SQLContext ( sc )
tab1=sqlContext . read . format (” jdbc ” ) . opt i ons ( u r l=”jdbc : o r a c l e : th in : e s s e3 /esse3@// spark−db :1521/ o r c l . unica . i t ” , dbtable=”ontime ” , f e t c hS i z e =10000 ,part it ionColumn=”year ” , lowerBound=2007 , upperBound=2017 , numPartit ions=11 ) . load ( )
tab1 . registerTempTable (” ontime ”)
q1 = sqlContext . s q l (” s e l e c t min ( year ) , max( year ) as max year , Carr i e r , count (∗ )as cnt , sum( i f ( ArrDelayMinutes $>$ 30 , 1 , 0 ) ) as f l i g h t s d e l a y e d , round (sum( i f( ArrDelayMinutes $>$ 30 , 1 , 0 ) )/ count (∗ ) , 2 ) as ra t e FROM ontimeWHERE DayOfWeek not in (6 , 7 ) and Or ig inState not in ( ’AK’ , ’HI ’ , ’PR’ , ’VI ’ )and DestState not in ( ’AK’ , ’HI ’ , ’PR’ , ’VI ’ ) GROUP by c a r r i e r HAVING cnt >100000 and max year > 2 0 1 0 ORDER by ra t e DESC, cnt desc LIMIT 10”)
s t a r t=time . time ( )
q1 . show ( )
p r i n t ( time . time ()− s t a r t )}
All Python scripts are available at the following links:
https://github.com/ccorona70/tesimagistrale/tree/master/QUERY/
ONTIME/SparkScripts
6.3.2 Unica1/Unica2
SQL query code for all databases is available at link:
https://github.com/ccorona70/tesimagistrale/tree/master/QUERY/
UNICA/DMLScripts
CHAPTER 6. TEST ENVIRONMENT 24
Tables used in this query are:
Table Alias Table Real Name Description N RowsP02 QUESITI P02 QUESITI Questions 8.830
ELEMENTO P02 QUESITI ELEMENTI Element 1.379ELEMENTO QUESITI PADRE ELEMENTI Element 1.379
P02 RISPOSTE P02 RISPOSTE Answers 9.875.135P02 QUEST COMP RISPOSTE P02 QUEST COMP Answers 724.956
V02 RISPOSTE V02 RISPOSTE ROW Free text answers 9.970.651ROW TESTO LIBERO TESTO LIBEROP02 TIPI FORMATO P02 TIPI FORMATO Format Types 10Q35 DATI COMP Q35 DATI COMP 412.536Q35 FAC COMP P06 FAC Faculties 81Q35 CDS COMP P06 CDS Course of study 892
Q35 DOCENTE AD VAL DOCENTI Teachers 11.595Q35 DOCENTE TIT AD VAL DOCENTI Teachers 11.595
Q35 CDS AD VAL P06 CDS Course of study 892Q35 FAC AD VAL P06 FAC Faculties 81Q35 P09 AD GEN P09 AD GEN Educational 18.607
activitiesQ35 SCUOLA P01 SCUOLA High School 13.050
Q35 TIPI TITOLO SUP TIPI TITOLO SUP High school 240degree type
Q35 P09 UD CDS P09 UD CDS Didactic units 219.255Q35 TIPI CORSO AD VAL TIPI CORSO Course type 52
Q35 NORMATIVA CDS AD VAL P07 NORMATIVA Regulations 10Q35 INVIO SEGNALAZIONE Q35 INVIO SEGNALAZIONE Send Report 6.508
Q35 NUM QUEST CDS DOC UD Q35 NUM QUEST CDS DOC UD Number questionnaires 27.008Q35 CARICHE FAC AD VAL V06 CARICHE SDR VALIDE List of assignments 416Q35 PRESIDE FAC AD VAL DOCENTI Faculty members 11.595Q35 CARICHE CDS AD VAL V06 CARICHE SDR VALIDE List of assignments 416Q35 PRESIDE CDS AD VAL DOCENTI President of the 11.595
study programQ35 DOC AD P06 DIP Department 61
VAL DIP AFFERENZAQ35 UD TIPO COPERTURA P09 UD PDSORD DOC Didactic unit 31.596
QUESITI PADRE P02 QUESITI Questions 8.830Q35 FAC CDS AD VAL P06 FAC CDS Relationship between 1.833
Faculty/Study Courses
Datasets Population
In the Oracle database, table and data was imported using the expdp/impdp
owner command.
In all other RDBMS Database (MySQL, PostgreSQL) tables and indexes
are recreated with DDL scripts.
Data between the Oracle database and other remaining RDBMS database
have been migrated using linux program called sqldata available for download
at link: http://www.sqlines.com/sqldata (”SQLines Data - Database Migra-
tion and ETL”).
DDL scripts are available at the following links: https://github.com/
ccorona70/tesimagistrale/tree/master/QUERY/UNICA/DDLScripts
CHAPTER 6. TEST ENVIRONMENT 25
Database Query SQL Scripts
All SQL database query are founded at link https://github.com/ccorona70/
tesimagistrale/tree/master/QUERY/UNICA/DMLScripts
Python SpakSQL Script Example
In Spark a Python script has been created for each query.
Each script use a SparkSQL sintaxt which use a Dataframe concept to
manipulate table data.
The queries, in ANSI format, are executed in all relational databases.
To speed up query has been used a SparkSQL options ”partitionColumn”
to parallelize the query on some large tables.So one table is imported into Spark Dataframe with command like:
tab1=sqlContext . read . format (” jdbc ” ) . opt i ons ( u r l=”jdbc : mysql : // spark−db :3306/ESSE3? user=es s e3&password=es s e3 ” , dbtable=”ontime ” , f e t c hS i z e =10000 ,part it ionColumn=”Year ” , lowerBound=2007 , upperBound=2017 , numPartit ions =12).load ( )tab1 . registerTempTable (” ontime ”)
Python scripts are available at the following links:
https://github.com/ccorona70/tesimagistrale/tree/master/QUERY/
UNICA/SparkScripts
CHAPTER 6. TEST ENVIRONMENT 26
6.4 Test Results
OnTime queries
Figure 6.2: Benchmark OnTime Queries
CHAPTER 6. TEST ENVIRONMENT 27
Figure 6.3: OnTime1
Figure 6.4: OnTime2
CHAPTER 6. TEST ENVIRONMENT 28
Unica queries
Figure 6.5: Benchmark Unica Queries
CHAPTER 6. TEST ENVIRONMENT 29
Figure 6.6: Unica1
Figure 6.7: Unica2
Chapter 7
Conclusions
Retrieving data from RDBMS and loading it into Spark is not free.
Spark doesn´t work well for faster queries, those that use indexes or can
efficiently use an index.
Spark is recommended when the tables used in queries have millions or
billions of records.
Spark’s performance is even more pronounced, with RDBMS database like
MySQL or Oracle, when large tables are indexed and partitioned.
It can increase the OnTime query’s performance up to four times and Unica
query’s more than one hundred times.
Using Apache Spark as an additional engine level on top of RDBMS
databases can help to speed up the slow reporting queries and add more scal-
ability for the long running queries.
In addition, Spark, combined with query caching feature, can speed up
frequent queries’s execution.
Acknowledgments
I thank all my family for the support and patience they have had in these eight
years of work and study.
Bibliography
[1] Matei Zaharia, Holden Karau, Andy Konwinski, Patrick Wendell. Learning
Spark, Lightning-Fast Big Data Analysis. O’Reilly Media, 2015.
[2] Alexander Rubin. How Apache Spark makes your slow MySQL queries
10x faster (or more). https://www.percona.com/blog/2016/08/17/apache-
spark-makes-slow-mysql-queries-10x-faster/