43
IN DEGREE PROJECT INFORMATION AND COMMUNICATION TECHNOLOGY, SECOND CYCLE, 30 CREDITS , STOCKHOLM SWEDEN 2017 Big data scalability for high throughput processing and analysis of vehicle engineering data FENG LU KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF INFORMATION AND COMMUNICATION TECHNOLOGY

Big data scalability for high throughput processing and ...kth.diva-portal.org/smash/get/diva2:1095664/FULLTEXT01.pdf · Big data scalability for high throughput processing and analysis

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Big data scalability for high throughput processing and ...kth.diva-portal.org/smash/get/diva2:1095664/FULLTEXT01.pdf · Big data scalability for high throughput processing and analysis

IN DEGREE PROJECT INFORMATION AND COMMUNICATION TECHNOLOGY,SECOND CYCLE, 30 CREDITS

, STOCKHOLM SWEDEN 2017

Big data scalability for high throughput processing and analysis of vehicle engineering data

FENG LU

KTH ROYAL INSTITUTE OF TECHNOLOGYSCHOOL OF INFORMATION AND COMMUNICATION TECHNOLOGY

Page 2: Big data scalability for high throughput processing and ...kth.diva-portal.org/smash/get/diva2:1095664/FULLTEXT01.pdf · Big data scalability for high throughput processing and analysis

www.kth.se

Page 3: Big data scalability for high throughput processing and ...kth.diva-portal.org/smash/get/diva2:1095664/FULLTEXT01.pdf · Big data scalability for high throughput processing and analysis

Feng Lu Master of Science Thesis February 4, 2017

Big data scalability for high throughput processing andanalysis of vehicle engineering data

Feng [email protected]

February 4, 2017

KTH Department of Computer ScienceThesis examiner: Mihhail Matskin

Thesis supervisor: Ashok Chaitanya Koppisetty

1

Page 4: Big data scalability for high throughput processing and ...kth.diva-portal.org/smash/get/diva2:1095664/FULLTEXT01.pdf · Big data scalability for high throughput processing and analysis

Feng Lu Master of Science Thesis February 4, 2017

Abstract

"Sympathy for Data" is a platform that is utilized for Big Data automation analytics. It is based onvisual interface and workflow configurations. The main purpose of the platform is to reuse parts of codefor structured analysis of vehicle engineering data. However, there are some performance issues on asingle machine for processing a large amount of data in Sympathy for Data. There are also disk andCPU IO intensive issues when the data is oversized and the platform need fits comfortably in memory.In addition, for data over the TB or PB level, the Sympathy for data needs separate functionality forefficient processing simultaneously and scalable for distributed computation functionality.

This paper focuses on exploring the possibilities and limitations in using the Sympathy for Dataplatform in various data analytic scenarios within the Volvo Cars vision and strategy. This projectre-writes the CDE workflow for over 300 nodes into pure Python script code and make it executableon the Apache Spark and Dask infrastructure. We explore and compare both distributed computingframeworks implemented on Amazon Web Service EC2 used for 4 machine with a 4x type for distributedcluster measurement. However, the benchmark results show that Spark is superior to Dask fromperformance perspective. Apache Spark and Dask will combine with Sympathy for Data productsfor a Big Data processing engine to optimize the system disk and CPU IO utilization. There areseveral challenges when using Spark and Dask to analyze large-scale scientific data on systems. Forinstance, parallel file systems are shared among all computing machines, in contrast to shared-nothingarchitectures. Moreover, accessing data stored in commonly used scientific data formats, such as HDF5is not tentatively supported in Spark.

This report presents research carried out on the next generation of Big Data platforms in theautomotive industry called "Sympathy for Data". The research questions focusing on improvingthe I/O performance and scalable distributed function to promote Big Data analytics. Duringthis project, we used the Dask.Array parallelism features for interpretation the data sources as araster shows in table format, and Apache Spark used as data processing engine for parallelism toload data sources to memory for improving the big data computation capacity. The experimentschapter will demonstrate 640GB of engineering data benchmark for single node and distributedcomputation mode to evaluate the Sympathy for Data Disk CPU and memory metrics. Finally, theoutcome of this project improved the six times performance of the original Sympathy for data bydeveloping a middleware SparkImporter. It is used in Sympathy for Data for distributed computationand connected to the Apache Spark for data processing through the maximum utilization of thesystem resources. This improves its throughput, scalability, and performance. It also increases thecapacity of the Sympathy for data to process Big Data and avoids big data cluster infrastructures.[1][2][3][4][5][6][7][8][9][10][11][12][13][14][15][16][17][18][19][20][21][22][23][24][25][26][27][28][29]

2

Page 5: Big data scalability for high throughput processing and ...kth.diva-portal.org/smash/get/diva2:1095664/FULLTEXT01.pdf · Big data scalability for high throughput processing and analysis

Feng Lu Master of Science Thesis February 4, 2017

Contents1 Introduction 6

1.1 Sympathy for Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.3 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.4 Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.5 Benefits, Ethics and Sustainability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.6 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.7 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2 Literature Study 102.1 Sympathy for Data Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.2 Vehicle Engineer Data Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.3 Sympathy for Data Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.4 Parallelism Frameworks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.5 Python Global Interpreter Locker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.6 How Spark and Dask Avoided the Global Interpreter Locker . . . . . . . . . . . . . . . . 142.7 Spark Resilient Distributed Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.8 Spark Distributed Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.9 Dask Parallel Computation library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.10 Dask Distributed Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3 Methodology 203.1 Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.2 Spark Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.3 Dask Solusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.4 Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4 Sympathy for data Optimization 234.1 Development Environment Specifications . . . . . . . . . . . . . . . . . . . . . . . . . . 234.2 CDE Workflow Formulas and Filter Rules . . . . . . . . . . . . . . . . . . . . . . . . . . 234.3 SparkImporter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244.4 Dask Internal of Sympathy for Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254.5 Sympathy for Data Distributed Computation . . . . . . . . . . . . . . . . . . . . . . . . . 274.6 Dask and Spark Distributed Computation . . . . . . . . . . . . . . . . . . . . . . . . . . 27

5 Experiment and Evaluation 285.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285.2 Parallelism Throughput and Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305.3 CPU Throughput Measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315.4 Cluster measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

6 Conclusions and Discussion 356.1 Spark and Dask Comparison and Discussion . . . . . . . . . . . . . . . . . . . . . . . . 356.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356.3 Restrictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366.5 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3

Page 6: Big data scalability for high throughput processing and ...kth.diva-portal.org/smash/get/diva2:1095664/FULLTEXT01.pdf · Big data scalability for high throughput processing and analysis

Feng Lu Master of Science Thesis February 4, 2017

PrefaceThis is a Master’s Thesis project that will be carried out in Volvo Car Group Sweden and is supervised byKTH’s ICT department. The project begun on 2016-01-18 and finished on 2016-06-30, as dictated by thecontract of employment that I, Feng Lu a Master’s student at the Software Engineer of Distributed Systemsprogram, have signed with the company.

The research of this project has been mutually accepted on and dialectically determined between thecompany’s needs and the development research agenda. As a Master’s Thesis project, it will expose ascientific aspect on which the engineering effort shall be rooted.

All the necessary equipment software, hardware and experiment data has been kindly provided by theVolvo Car Group. Since the majority data related to company integrity and privacy it will be limitedto elaborate on this thesis and emphasis on academic research Dr.Ashok Chaitanya Koppisetty is theproject’s supervisor from Volvo Car Group, I would appreciate him for all the help. While Pro.MishaMatskin are the examiner of this project from KTH, I would like to thank the all feedback from him.

4

Page 7: Big data scalability for high throughput processing and ...kth.diva-portal.org/smash/get/diva2:1095664/FULLTEXT01.pdf · Big data scalability for high throughput processing and analysis

Feng Lu Master of Science Thesis February 4, 2017

AcronymsAWS: Amazon Web ServiceEC2: Amazon Elastic Compute CloudHDF5: Hierarchical Data FormatHDFS: Hadoop Distributed File SystemYARN: Yet Another Resources NegotiatorRDD: Resilient Distributed DatasetRAM: Random Access MemoryIO: Input OutputW/R: Write/ReadJVM: Java Virtual MachineGPU: Graphics Processing UnitSSH: Secure ShellCPU: Central Processing UnitADAF: Internal Data Type in Sympathy for DataCDE: Sympathy for Data Workflow for Big Data AnalyticsSfD: Sympathy for dataTB: TerabytePB: PetabyteGB: GigabyteGPL: General Public LicenseNCSA: National Center for Supercomputing ApplicationsBSD: Berkeley Software DistributionGIL: Global Interpreter LockerMDF: Measurement Data Format

5

Page 8: Big data scalability for high throughput processing and ...kth.diva-portal.org/smash/get/diva2:1095664/FULLTEXT01.pdf · Big data scalability for high throughput processing and analysis

Feng Lu Master of Science Thesis February 4, 2017

1 IntroductionIndustrial enterprises have moved forward to the big data era with the rapid growth of big data scale invarious automotive industrial systems. Vehicles will generate and consume roughly 4000GB of data forevery eight hours of driving [1]. The change is driven by several factors: large volumes, high velocities,and high complexity as well as an increase in unprecedented speed. Increasing the ability to store andperform the computation of huge amounts of data sets is becoming more challenging each day. The datacome from various types of devices such as vehicle monitors, cameras,radar, drones etc. Those data sourcesare extremely hard to collect and manage. Ensuring the rapid transfer of such a huge amount of data fromthe vehicles’ devices to the Big Data analytics platform is a challenging task. The reason is that thereare several terabytes of data accumulating in a time series which requires high quality bandwidth for datatransfer, which is frequently unavailable. The types of devices and their volume increases the massiveamount of data generally based on time series and hierarchical structure. Therefore, the data analyticsmethodologies and data analysis platform performance confronted unpredictable challenges[2].

The rapid growth in automotive industry is changing the vehicle embedded devices in a variety of ways,This leads to the need to further explore ways to ensure that there are no delays and that valuable datais not lost. As a result, a reliable, high-performance big data platform is essential for big data processes.This platform must assist in data collection and big data analytics for the organization to enable data-drivendecisions[2].

Big data analytics is the direction of research that is focused on data collecting, missing datacomputation, specific data filtration and organizing and generate results etc. The data results are furtherused to discover patterns and other valuable information. Thereby the whole process support organizationto make the right decision. Big data analytics in automotive industry can support organizations to betterunderstand the vehicle’s status there by optimizing its functions. The huge amount of data produced needsto be analyzed more quickly and more easier[3].

Sympathy of Data is a visual software platform that working with offline big data analytics and onlinebig data. Its goal is to support the industry in performing complex data analytics and computation. Itallows building workflows by drag and drop nodes that is is focused on automating data analysis byimporting data, preparing and analyzing data and thus generating visualized reports accordingly [4][5].The Sympathy platform currently has performance issues when analyzing large amounts of data due tothe required processing time. This issue conflicts with quick data analysis requirements. This projectfocuses on mapping the key bottlenecks in existing sympathy workflows and developing and implementingpossible solutions for increasing the disk and CPU IO intensive performance and scalability. Also the bigdata is growing at an exponential rate, commonly when data over TBs or PBs there is a needs a functionfor scalable Sympathy for data to support the distributed computation functionality and to make the dataprocessing more efficient.

1.1 Sympathy for DataSympathy for data can be easy and flexible to use in organizations for big data analytics. This is because it isa free and lightweight cross-platform. It also supports a direct connection to a third server environment suchas SQL Server, SharePoint or CSV files etc. The platform is mainly used to handle large unstructured data,time-series and complex vehicle engineering data sources. These data formats generally exist in multiplegroups such as meta-data and time-series etc[5].

Sympathy for data is developed through the systems engineering software society as an open sourceproject[5]. It supports high precision and complex industrial data processing. The representation of theinternal data file of sympathy is ADAF which is based on a hierarchical data format. The core purpose ofthe platform is to reuse parts of the code for structured workflows analytics of big data[5]. The workflowconsists of nodes developed by Python. Users can customize the nodes based on their requirements.

In this project, the CDE workflow has constructed by multiple nodes, each nodes perform a specificdata analytics functions. It used for exploring the typical issues that exist on Sympathy for Data. It willinvestigate how to optimize the system disk and CPU IO intensive performance. The main solution is focus

6

Page 9: Big data scalability for high throughput processing and ...kth.diva-portal.org/smash/get/diva2:1095664/FULLTEXT01.pdf · Big data scalability for high throughput processing and analysis

Feng Lu Master of Science Thesis February 4, 2017

is on the the parallelism of the scheduling tasks for improving system utilization in order to maximize thethroughput and scalability.

1.2 Problem StatementThe Sympathy platform has performance and scalability issues in analyzing large amounts of data whichcreates several inefficiencies. This project primarily focuses on mapping the critical bottlenecks in theexisting Sympathy for data and develops and implements possible solutions for increasing the computationperformance and scalability. In order to enhance the capacity, it is necessary to improve both the parallelismcapacity and the distributed computation. Since Sympathy for data is based on a Python environment whichis plagued with Global Interpreter Locker problems[6], it limited the parallel capacity and it was difficult togather results for multiple-threads or concurrency programming. Figure 1 shows that during the Sympathyfor Data CDE work flow analytics of 80GB of vehicle engineering data, the results consume time graph. It isestimated that approximately 84% of the time is consumed on data import. This part is released to intensivecomputation, while the other is part computation. Data filtration and results generation only accounts for16% of the time consumed. During the import data task, the major operation is focused on loading abatch of .dat files, converting it to ADAF(The Sympathy for data internal data format) with the ADAFautomatically interpolating the signals. This main research of this project will focus on this workflow forconduction, where typical Sympathy for data workflows involve operations that are related to disks and areCPU IO intensive. The primary challenge of this project is to enable the Sympathy for Data to processlarge amounts of data more efficiently and conveniently. The disk and CPU IO intensive operations requireseparate functionality for efficient processing. It was hypothesized that by improving the system disk IO,CPU and memory utilization, computation performance will improve, thereby avoiding any unnecessarywaste of time and large distributed infrastructure.

Figure 1: The CDE work flow analytics of 80GB vehicle engineering data time consuming graph

1.3 PurposeThe purpose of this project is to optimize the Sympathy for Data architecture utilizing a reliable solution,and to optimize the existing data analysis workflow that limited the parallel of disk memory and CPUIO intensiveness as well as its scalability. The project experiment environment will be based on AmazonWeb Service with an Apache Spark cluster and Dask cluster. It provides a different solution comparisonbenchmark with IO matrices in a single machine and a cluster. For the accuracy of our measurements, weprepared from 10GB to 640GB vehicle engineering data for our data analytics.

The methods and architecture that are developed will be the basis for the future functionality of scalingthe Sympathy for Data platform when analyzing large amounts of engineering data. The following are the

7

Page 10: Big data scalability for high throughput processing and ...kth.diva-portal.org/smash/get/diva2:1095664/FULLTEXT01.pdf · Big data scalability for high throughput processing and analysis

Feng Lu Master of Science Thesis February 4, 2017

main purposes of this project:1. Map the current bottlenecks in the Sympathy for Data architecture and the data analysis workflows thatlimits its scalability.2. Propose reliable solutions for improving the Sympathy for data throughput and scalability.3. By comparing the performance, decide on the most efficient data architecture solution to integrate intoSympathy for improving the product.

1.4 Goal

The goal of this project is to provide a solution for approving the Sympathy for Data performance by fullyutilizing system resources in order to improve the data analytics speed. This will reduce the unnecessarywaste of time and improve the big data processing performance. The goals of this project include thefollowing:1. Review and document existing state-of- the-art methods for efficiently handling disk and CPU IOintensive operations in the context and applicability of Sympathy for Data.2. Present the advantages and disadvantages of each of the chosen methods in the current context and topropose solutions for the development, implementation and benchmark suggested solutions or architecturesfor scaling data analysis workflow for analyzing large amounts of data.3. Decide the most efficient solution development to Sympathy for Data for improving Big Data processcapacity.4. Integrate Apache Spark support HDF5 for analyzing large-scale scientific data.5. Document the results for the proposed solutions together with the limitations and suggestions for futuredevelopment. The thesis project would benefit the industry by providing standard solutions for analyzinglarge amounts of engineering data.

1.5 Benefits, Ethics and Sustainability

Big data analytics highly requires the system resources of the analytics platform, because the high-performance platform will generate the results more quickly. It can also provide significant support to theorganizations so that they can efficiently make informed decisions. The Sympathy for Data platform hasthe potential to allow these organizations to perform their data analysis tasks faster and easier. The overallbenefit would be an efficient use of resources and expedition of the product development life-cycle[2].

There is no doubt that integrating the possible solutions to Sympathy for data would significantlyimprove the parallelism, throughput, and scalability of the platform. The high-performance data analyticsplatform can significantly reduce both the data computation time and costs.

The measurement of data sources came from vehicle devices, that were provided by Volvo Car Group.The measurement data was collected from realistic vehicles. Therefore, for the integrity, privacy andconfidentiality of the company’s requirements, this project only demonstrates limited amounts of criticaldata and the test data sources will not be published. The project development code, benchmarks andsolutions were agreed to by the company publishing in the open source community and Github.

1.6 Methodology

This paper used experimental research methods for proposing possible solutions and employed thecomparison method for measuring the performance of proposed solutions in the Sympathy for data. Bycomparing the solution performance, we decided on the most efficient solution to combine Sympathy fordata. Qualitative and quantitative methods will also be used for analyzing the experimental results[24].Finally, the measuring method used different scales of data execution. through the disk CPU IO and memorymetrics to observe the performance of the proposed solutions.

8

Page 11: Big data scalability for high throughput processing and ...kth.diva-portal.org/smash/get/diva2:1095664/FULLTEXT01.pdf · Big data scalability for high throughput processing and analysis

Feng Lu Master of Science Thesis February 4, 2017

1.7 OutlineThis thesis discusses how to maximize the utilization of the CPU, memory and parallelism reading orwriting from disk for improving the Sympathy for Data IO performance and integrating it with the currentopen source framework for improving the system’s scalability and throughput. Chapter 2 will discuss thebackground of this project and provide details on the CDE workflow that is used for the measurement ofthe performance of the Sympathy for data. It will also highlight the internal file ADAF of the platform. Byanalyzing the workflow execution time, it will reflect common issues that exist on the platform. It will thendiscuss the currently popular big data processing frameworks such as Apache Hadoop, Apache Spark andDask. This chapter also will introduce the Python Global Interpreter Locker issues and distributed clusterframeworks background. Additional details will be discussed in Chapter 5.

Chapter 3introduces the development methods and research methodologies as well as a solutionscomparison at the theoretical level. Chapter 4 discusses the details on the challenges of how the Chapter3 methods were used in the research. It also describes the development the proposed solutions to theSympathy for Data and the experiment details. Chapter 5 is the most important part of this project. Itintroduces the measurement and experiment results of different scales of data. It also demonstrates thebenchmark that has been used for the evaluation of the IO performance and parallelism in cluster. Chapter6 discusses the difference between Spark and Dask and presents the outcomes of this project. It also containssuggestions for future work.

9

Page 12: Big data scalability for high throughput processing and ...kth.diva-portal.org/smash/get/diva2:1095664/FULLTEXT01.pdf · Big data scalability for high throughput processing and analysis

Feng Lu Master of Science Thesis February 4, 2017

2 Literature StudyThis chapter presents the necessary background information on Sympathy for Data, CDE workflow, anddata sources. The Apache Spark and Dask are introduced in order to understand and comprehend theproject. The first section will present the Sympathy for Data internal data structure, HDF5, file systemADAF and the data analytics CDE workflow. The second section explores the background of GlobalInterpreter Locker issues and the approach which Python uses to avoid Global Interpreter Locker. The thirdsection highlights the possible proposal solutions such as Apache Spark and Dask by providing a theoreticallevel comparison and the advantages and disadvantages of the independent mode and the distributed modescenarios respectively. There is also a discussion of the Spark and Dask properties that are available forSympathy for Data platform’s architecture to ameliorate the disk memory CPU IO intensive scalability andthroughput.

2.1 Sympathy for Data PlatformSympathy for Data is based on Python 2.7 environment and includes many scientific data analysis librarysuch as Pandas/Scipy/Numpy. The Pandas library performs as dataframe, the Numpy library as a n-dimensional array in Python for data analysis and the Scipy as mathematics formula library, it has providedabundant calculation formulas[13]. The fundamental function of the Sympathy for Data platform is importsor customises the third-party libraries and self-defines the nodes uses for the workflows[13]. While workingin Sympathy, it is possible to build workflows by drag nodes and configuration filter rules or computationformula scripts. The workflow is constructed over multiple nodes and sub-flows, with each sub-flowconnecting to perform particular data analysis tasks. Also, the workflow visual performs of the task steps,but under the connected nodes surface, it also contains all the Python codes necessary to carry out thatdata analysis tasks. The workflow execution computation makes the actual data analysis process moretransparent and properly structured. It allows different users to use the tool in a variety of forms to join theworkflow. Sympathy for Data provides a comprehensive number of workflows in default, and some willonly run in existing workflows. It can also create workflows or modify existing ones. The others will createnodes, the abundant components that are used to build workflows[8].

Sympathy for data is built to encourage reuse and sharing at all levels of the analysis tasks. Nodes canbe customised and use a few standardised data types to guarantee that they will work well together. Partsof workflows can be divided into modular linked sub-flows where each sub-flow is constructed by multiplenodes and reused in other workflows, thereby reducing the time wastage. As a natural step, Sympathy forData and its standard library are both free software and open source, the license distributed under GPL andBSD respectively "adopted from [5]".

The internal of Sympathy for Data used ADAF as the data type. ADAF is a highly complex data typewith the internal designed structure based on the HDF5 structure. HDF5 is used to describe the file moreprecisely and refers to a couple of different technology sets, each consisting of a data file format and asuite of software for manipulating data stored in the format, where a majority is used to store scientificdata[7][8]. The ADAF is internally constructed thought different groups, meta-data, results group suchaggregated/calculated data and time series based accumulated time-resolved data. The groups exist asprimary containers. They are used to store different files and each group is connected simultaneously. Themajority of data signal channels based time series will present as a raster bind on the table. Figure 2 showsthe internal ADAF file structure which has 15 groups. Each group consists of metadata, results and timeseries groups, and the time series data bind with raster. A single .dat file contains up to 600 columns, whereeach column stores a large volume of signals which in the .dat file format uses MDF as a file interpreter.The ADAF is very powerful since it is embedded with many file interpreters that can be utilized accordingto the data source format for automatically converting and rendering data to groups respectively[5].

The various data types supported currently in Sympathy for data include datasource, Table, ADAF,and text files respectively. During this project’s implementation and measurement, it used the ADAF datasource type for the data analysis. The ADAF is according to the HDF5 structure for designing the datatype. Compared to the traditional data formats, the HDF5 has many benefits that can represent complex

10

Page 13: Big data scalability for high throughput processing and ...kth.diva-portal.org/smash/get/diva2:1095664/FULLTEXT01.pdf · Big data scalability for high throughput processing and analysis

Feng Lu Master of Science Thesis February 4, 2017

Figure 2: Sympathy for Data Internal ADAF Structure

data objects and within a variety of metadata, with no limitation on the number or size of data objects in thecollection. It also has excellent performance features for storing time-series data and it is extremely easyto manipulate, view and analyze[8]. For storing and managing scientific data sources, ADAF is internallybased upon the HDF5 structure, since the platform aims to support and analyze vehicle engineer data.Therefore, HDF5 has significant benefits for storage. HDF5, as designed by NCSA, seeks to provide self-explanatory, flexible, versatile, scalable and cross-platform file system functionality. HDF5 is based on ahierarchical structure including symbols, numbers, time series, and graphics, since those data are necessaryfor the scientific data. HDF5 aims to connect all data files together with different groups in one file[8][7].

2.2 Vehicle Engineer Data Sources

Figure 3: N-dimensional array performs ADAF file [27]

The CDE workflow uses the .dat format files for conducting the data analytics. The .dat file internallyis interpreted by MDF which is a binary file format that can be used for recording, exchanging and post-measurement analysis of measurement data. The ADAF procedure is done by importing the .dat filesvia formula or computation script for conversion to different group files according to the source’s internalstructure. The data format is pivotal and is constructed by a large number of columns, signals and groups,where each group performs a raster and demonstrates as an N-dimensional array as shown in Figure 3. Afterthe execution of the workflow, the data import nodes will be responsible for loading a batch of .dat files

11

Page 14: Big data scalability for high throughput processing and ...kth.diva-portal.org/smash/get/diva2:1095664/FULLTEXT01.pdf · Big data scalability for high throughput processing and analysis

Feng Lu Master of Science Thesis February 4, 2017

to ADAF. The procedure is performed during the data loading execution. It will interpolate the majorityof the signals to the raster after the ADAF demonstrates that the files can be stored to .sydata format. The.sydata is a re-sampled data source file which is generated by Sympathy for Data. The .sydata is normallyrelatively small because it only stores the concrete columns for the data analysis and computation.

2.3 Sympathy for Data WorkflowThe CDE workflow is built by a number of nodes. Each of the nodes performs a particular function of theanalysis procedure. The workflow aims to analyze vehicle engineering data. The data sets are separated intomultiple groups, metadata, results, and time-series. The time-series is demonstrated as raster with multiplegroups, where each group contains 500-600 columns and over 20,000 rows. The file format and structure isshown in the Figure 2 internal ADAF file. Figure 4 shows the CDE workflow used for processing vehicleengineer data. This workflow typically involves data ingestion, data preparation (for example, sampling,extracting, transforming and loading), data analytics and will export data as plots or reports. This workflowis separated into six modules, where each module composed of multiple connected nodes.The mode detailsare shown below:

Figure 4: CDE data analytics workflow

Import data is used to choose an input source folder and output folder. After workflow executionanalysis tasks, it will automatically interpolate the needed signals in the selected folder. This procedure iscalled re-sampling. This will load a batch of .dat files into the system memory. The ADAF data type isused to interpret the files and other sub-flows will responsible filtering specific columns. In the workflowanalysis work, you can take the specific signals for the analysis and computation. The original .dat file willbe converted to .sydata format by ADAF. During the ImportData part the ADAF file interpolation and thesignal interpolate function nodes will take a longer time to compute. This is primarily related to the disk,as well as the CPU IO being intensive and out-of-memory.Vehicle config is used in this part for specifying the user meta-data and groups used to compute the pre-configured formula scripts.Filter files are used for the filtration of specific data groups or signals for generating the results.Create subsets that are created are used for evaluating the particular data group.B_KatDgnos is used for generating reports according to the group and making the decision.

Through the execution and analysis of Figure 4, the CDE workflow can reflect the common issuesthat exist on Sympathy for Data, because it consists of hundreds of nodes and includes many commonscenarios. Through the execution of the CDE workflow from importing data to the generation of the endresults, the majority of the time is wasted on data import and re-sampling. Since this part typically involves

12

Page 15: Big data scalability for high throughput processing and ...kth.diva-portal.org/smash/get/diva2:1095664/FULLTEXT01.pdf · Big data scalability for high throughput processing and analysis

Feng Lu Master of Science Thesis February 4, 2017

operations that are disk (IO)and CPU intensive; for processing large amounts of data, the IO and CPUintensive operations need separate functionality for efficient processing. Therefore, high parallel throughputand a scalability computation framework have been proposed to solve these issues. By integrating thoseframeworks into Sympathy for Data, this project will result in the maximum utilization of the systemresources for improving the data analytics effectiveness.

2.4 Parallelism FrameworksIn the Big Data era, based workflow platforms need to utilize data parallel computing techniques for moreefficient processing and analyzing of big data. The parallel and high-performance computing techniquesare used to increase the utilisation of a single machine or workstation. The aim of the frameworks is tomaximise the utilisation of the CPU’s cores, memories and parallel read write from disk to improve theperformance. Several technologies are currently available that exploit multiple levels of parallelization(e.g. multicore, many-core, GPU, cluster, etc.)[9]. Those frameworks need to trade-off between reasonabletime and performance such as system usage, cost, failure, data recovery, maintenance the availability andusability to provide solutions for adapted to applications more solid[28]. The popular Big data processframeworks have Apache Hadoop, Apache Spark or Blaze Dask[9]. Apache Hadoop is a greatly simplifiedbig data processing platform which is largely distributed and developed by Java. The core idea includesa Map-Reduce method in parallel to process a batch of data in distributed systems. The MapReduce is aframework and an associated implementation for processing and generating massive data sets. It specifies amap function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reducefunction that aggregates all of the intermediate values associated with the same intermediate key [10]. Thedisadvantage of the platform is that it is developed by Java which is not compatible with Python. Moreover,due to the loading of the big data from disk, it triggers the high disk IO intensive which is very expensive.

A number of Python open source frameworks have been released for solving the disk memory CPUIO intensive issue. The most popular frameworks utilize Apache Spark and Dask which process big datathrough the maximum utilization of the system resources and within distributed computation. Spark is basedon memory alternative computation, and it has performed very well for computation and parallelization.Spark uses resilient distributed datasets(RDDs) which are collections of objects that can split into manypartitions and across the cluster. It can also be rebuilt, if a partition task is a failure.[9].

Dask is compatible with both Python and Sympathy for Data. It is very versatile in its ability to performparallel programming with dynamic task scheduling. It enables in parallel and with a blocked algorithmto fit out-of-memory computation and can process large volumes of data. Dask has many collections usedin big data analytics such as a Bag/Array/Dataframe. The basic idea is to parallel load a batch of data anduse a blocked algorithm to break-up large tasks to multiple small tasks. Each chunks maintains a smalltask computation and then aggregates all of the intermediate sums. Through tricks in the Dask array, it ispossible to solve one large problem by solving many small problems[11].

2.5 Python Global Interpreter LockerBefore discussing the Spark and Dask platform in detail, it is necessary to mention a primary problem inPython, which is the Global interpreter locker. Since Python develops sympathy for Data, it experiencedparallel issues which limited the capacity of multiple threads or programming processes. The GlobalInterpreter Locker is a bottleneck in Python. It has been around as long as the interpreter and has survivedseveral attempts to remove it. It caused many problems for Python programmers[12]. Since the big dataprocessing requires parallel disk R/W and CPU IO intensive tasks, the parallel, and concurrent computationis necessary for Sympathy for data to process big data.

Global interpreter lock exists on the CPython compiler. It introduces a significant speed penalty formulti-threaded Python programs. The primary mechanism of the CPython is to use mutex to preventmultiple native threads from executing Python bytecodes. Also because the CPython compiler memorymechanism is not thread-safe, so that lock native exists[14]. The Global Interpreter Locker problem hasbeen controversial for a long time because the CPython compiler prevents the multiple threadw from

13

Page 16: Big data scalability for high throughput processing and ...kth.diva-portal.org/smash/get/diva2:1095664/FULLTEXT01.pdf · Big data scalability for high throughput processing and analysis

Feng Lu Master of Science Thesis February 4, 2017

fully utilizing the CPU processors in some circumstances. Since Dask based on the Numpy interface, thepotentiql blocking or long-running operations, such as frequency I/O happen outside the GIL. Therefore,the multiple-threaded programs in the GIL will spend considerable time inside interpreting the CPythonbytecode, leading the GIL to become the Python performance bottleneck. For a better understanding of theproblem with the Global Interpreter Locker, we used one thread and multiple threads to execute one billioncounting tasks. Unfortunately, the multiple threads were 50% slower than the single thread because theCPU has considerable waiting time [29]. Figure 5 shows two threads executing computation tasks on twocore CPUs. The green line presents the threads that are executing some tasks, while the red line shows thatthe threads that are awakened by the scheduler, but are unable to obtain the GIL locker which leads to anunnecessary waste of time. By observing the figure, it is possible to see this phenomenon because of theGIL bottleneck which influences the CPU parallel feature as shown.

Figure 5: Python two threads execution on double core CPU performance [29]

Python solved the Global Interpreter Locker problem by using multiple processes instead of the multiplethreads. The multiple processing internal library is used to avoid the Global Interpreter Locker issue. Themultiple processes library imitates the multiple threads library interface which is easier to integrate and use.The Python solved this Global Interpreter Locker by instead the multiple-threads to multiple-processes;each processor has a fundamental locker, which is not able to snatch data from memory. The multipleprocessing is not the best solution. It is hard to synchronization and communicate with threads. For instance,if multiple threads accumulate a variable declared as a global value and threads; the threads uses locker caneasier to achievement, but multiple processing is a fundamental processing that exists in the system. Suchis laborious to share the memory, only can be implemented by declaring a queue and using put get methodto invoke which leads the GIL become the most difficult problem in Python[14].

2.6 How Spark and Dask Avoided the Global Interpreter LockerSince Spark is based on Java and developed by Scala, it supported the Python interface. How can Sparkavoid the Global Interpreter Locker problem? That is when using Python for manipulates Spark the RDDonly in the interface layer while in facts the real computation is still applied by Scala Spark. The JVMalso has to spawn a Python process to run the Python scripts. Therefore, there will be no Global InterpreterLocker issue. However, with Scala Spark, all of the processes and threads run in the same JVM withoutcreating additional process space and generating additional copy between JVM and Python process[15].

Dask releasing the global interpreter by pandas library; Because the Dask array implements Pandasand Numpy library interface, it will allow multiple threads to run simultaneously during computation topotentially allow improvements in performance by multiple threads[11].

2.7 Spark Resilient Distributed DatasetApache Spark is a fast and general engine for large-scale data processing, which is open sourced byApache and developed by Scala. It supports multiple language interfaces such as Java/Python/Scala/R[16].Spark has a pivotal position in Big Data processing and distributed computation, and is compatible withHadoop HDFS. Hadoop primarily used MapReduce for processing data, but Spark used RDD Resilientdistributed datasets for data processing. RDD is a fault-tolerant collection of elements that can be operatedin parallelism because RDD endows with Spark superior computation performance. That is the reasonwhy many organizations were attracted to join the Spark. Those organizations used Spark for machinelearning, big data analysis and real-time data computation. This project will implement Python Spark forthe research and experiment. However, its speed is slower than Scala because Python Spark only inherited

14

Page 17: Big data scalability for high throughput processing and ...kth.diva-portal.org/smash/get/diva2:1095664/FULLTEXT01.pdf · Big data scalability for high throughput processing and analysis

Feng Lu Master of Science Thesis February 4, 2017

the Scala interface. Within Python Spark the JVM has to spawn a Python process to run the Python scripts,while with Scala version, all processes and threads run in the same JVM without creating additional processspace and generating additional copy between the JVM and Python process[15] [16] [17].

RDD are motivated iterative algorithms and interactive data mining algorithms for handle the inefficiency.In both algorithms, maintaining data in memory can significantly improve the performance by an orderof magnitude. RDD shared memory provides a restricted approach which is based on coarse-grainedtransformations rather than fine-grained updates to the shared state. Meanwhile, RDD divides the operationsinto transformation and action. No matter how many transformation operations it executes, RDD willnot perform the practical computation. This is because only the action operation can trigger the realcomputation execution. In the RDD’s internal implementation mechanisms, the bottom interface is basedon the iteration, which makes the data access more efficient. It, therefore, avoids excessive memoryconsumption[17]. Because RDD is immutable collection of objects spread across a cluster, as data structuresit is a collection of read-only partitions. RDD is split into multiple partitions; where each partition is asliced dataset[15]. Figure 6 shows that when the Spark RDD loads batch of engineering data, the RDD, byparallelism, processes .dat files and the processors perform partitions which can be configured according toCPU cores.

Figure 6: Spark RDD parallelism processing .dat files

RDD parallelismSpark provides persistence and guarantees the data processing performance. The RDD partitioning

will improve parallelism and more partitions will dominate the parallel computing capabilities. This willensure that Spark can maximize the utilization of hardware and system resources. The combination ofpersistence and partition can more efficiently process Big Data. The RDD parallelism can also be configuredby Spark[15].

RDD fault torranceRDD is an immutable data structure. The immutability rules out a significant set of potential problems

due to updates from multiple threads at once. The immutable data type is safe to share across processes. Ingeneral, the case to prevent fault tolerance has data replication and logger. However, it is costly for the Bigdata systems. This is because data is oversized and will cross platforms and be distributed as a replicatedata. RDD naturally uses lineage to support fault-tolerance because it can memorize the build action graph.When the task to be performed fails, it can be recalculating on the previous steps from the lineage[17].

Since Spark has no replication support for fault tolerance, it reduces the cost of data transfer across thenetwork. However, in certain scenarios, Spark also requires the use of the logger mode for supporting thefault tolerance. Examples, during Spark streaming, are when operating the data update or invoke streamingwindows function. That needs to resume the intermediate state. Moreover, the checkpoint mechanism canbe used for regeneration to the previous status[17].

15

Page 18: Big data scalability for high throughput processing and ...kth.diva-portal.org/smash/get/diva2:1095664/FULLTEXT01.pdf · Big data scalability for high throughput processing and analysis

Feng Lu Master of Science Thesis February 4, 2017

2.8 Spark Distributed ComputationSince Sympathy for data needs scalable distributed functions for big data processing, it is possible tocombine it with Spark distributed computation. Spark distributed computation is based on the Hadoopecosystem. This is where the YARN and HDFS are the primary components of the distributed framework.The distributed computation is designed to run on a large number of commodity devices. YARN as taskscheduler is used to coordinate the computation tasks and jobs. The HDFS provides the distributed storagesystem in cluster, which uses a master/slave structure. The master node is responsible for storing meta-dataor scheduling and there is a slave node that is used to store files. This project is focused on researching howto maximize the utilization of nodes in order to avoid a large cluster infrastructure. This is because in orderto maintain a large commodity, hardware is expensive[18].

Figure 7 shows the internal architecture of HDFS which consists of a NameNode that is used to managethe cluster metadata, and a certain number of DataNodes which usually perform as files and directorystorage systems. Internally, the big data is split into multiple 128MB blocks by hash to guarantee itsintegrity and to store to DataNodes as well as each block file that is independently replicated at multipleDataNodes. The NameNode performs file system namespace operations associated with open, close, re-name file and directories and determines the relation between blocks and DataNodes. The DataNodes arealso responsible for serving R/W requests from the file system clients. They also perform block creation,deletion, and replication upon instruction from the NameNode [19].

Figure 7: Hadoop Distributed File System Structure [19]

The YARN is a task scheduler in the Spark cluster whose purpose is to split up the large tasks andjobs into separate daemons. Figure 8 shows the execution of Spark on the YARN architecture. During theexecution of Spark on YARN, the spark executor runs as a container. The ResourceManager performsas ultimate authority that arbitrates the resources among all of the applications in the system. TheNodeManager is the per-machine framework agent which is responsible for containers and monitoring thesystem resource usage such as CPU, memory, disk, network and reporting the status to the ResourceManager.Spark combines the YARN-cluster mode and YARN-client mode for running on YARN. In the yarn-clientmode, the driver runs the client process. The application master is used for requesting resources from YARNand in yarn-cluster mode. The Spark driver runs inside an application master process that is controlled byYARN on the cluster, and the client can leave after initiating the application [20].

Running Spark on YARN has numerous benefits because YARN allows it to dynamically share thecluster resources between different frameworks and nodes such as Dask and Hadoop that executes onYARN. For instance, the MapReduce job can execute after execution of a Spark job without any changesin YARN configurations. YARN can also schedule for categorizing, isolating, and prioritizing workloads.YARN is the only cluster manager for Spark that is used for task scheduling and that supports security.With YARN, Spark can be more efficiently combined with HDFS and secure authentication between itsprocesses.

16

Page 19: Big data scalability for high throughput processing and ...kth.diva-portal.org/smash/get/diva2:1095664/FULLTEXT01.pdf · Big data scalability for high throughput processing and analysis

Feng Lu Master of Science Thesis February 4, 2017

Figure 8: Spark on YARN Architecture.

The Spark RDD uses partition for improving the parallelism and IO throughput, when tackling Bigdata on a single node or cluster. In that case, the IO intensiveness causes inefficiency. RDD uses severalmechanisms for avoiding such issues. For example, RDD applies cache for maintaining persistence of thedisk and memory. The Kyro library is also used for the serialization of objects, thereby avoiding time delayand unnecessary time wastage. sparkSnappy and the LZF algorithm are also used for compressing objectswhich improves space and performance. Spark through RDD reduces the IO and network transformationoverhead that is achieved. It achieves an irreplaceable high-performance framework[15][21].

2.9 Dask Parallel Computation libraryDask is a flexible parallel computation library for data analysis and parallel computation. It emphasizes outof memory, utilizes multiple cores and blocked algorithms and also acts as a dynamic task scheduler. Daskhas integrated Pandas and Numpy interface for big data analysis in a more efficient way. Dask has goodscalability, flexibility, and high throughput features. It maximizes the utilization of CPU cores and memoryin improving high-performance computation[11].

Dask through parallel computation and blocked algorithms is used for tackling large datasets. It providesabundant big data collections such as arrays, bags, and dataframes to fit various scenarios of Big dataanalysis. Figure 9 Shows the Dask primary collections which are based on graph for task scheduling[22].

Figure 9: Dask components [22]

Dask is composed of dynamic task scheduling and Big Data collections such as arrays and bagdataframes. These parallel collections work on the top layer of the dynamic task schedulers. The coreprinciple of the task scheduler when working on large datasets is to break up large arrays into many smallarrays. Each array performs a particular task and aggregates results. When working on those pieces, itminimizes the memory footprint of the computation, and efficiently streams data from the disk. Daskutilized parallel computation characteristics can maximize the utilization of your computer CPU cores andmemory[22].

Dask arrays are suitable for analyzing large complex data types, such as n-dimensional , time-series ,N-matrix weather data or hierarchical data. Figure 2 shows Sympathy for data internal data sources signals

17

Page 20: Big data scalability for high throughput processing and ...kth.diva-portal.org/smash/get/diva2:1095664/FULLTEXT01.pdf · Big data scalability for high throughput processing and analysis

Feng Lu Master of Science Thesis February 4, 2017

that show as raster can be demonstrated by Dask array, Since Sympathy for data uses the Pandas array forinterpolating signals. So a Dask array can perform better with a n-dimensional array. The Dask graphsare used to create a NumPy-like library that maximizes the utilization of system resources and operateson data sources that are comfortable with memory. Dask graph has a dictionary mapping for identifyingkeys to values or tasks, where graph creation and graph execution are separable problems. Dask.array isalso commonly used to perform data analysis or to speed up expensive computations in memory that usemultiple cores and threads, such as you might find in image analysis or statistical and machine learningapplications. Figure 10 shows how the Dask array slices many small pandas arrays into a grid[22].

Figure 10: Dask arrays coordinate many NumPy arrays arranged into a grid [22]

Dask bag has parallelized computations across a large collection of clusters. It is particularly useful indealing with large quantities of semi-structured data such as JSON or CSV files. Dask bag has parallel anditerating features. It allows extensive data split up, and multiple machines with multiple cores to executein parallel. also the principle it was similar with Spark RDD it execution lazily which allowing smoothexecution of larger-than-memory data[22].

Dask dataframe is a collection which implements the Pandas Dataframe interface using blockedalgorithms for splitting up the large DataFrame into many small DataFrames. One operation on a DaskDataFrame triggers many panda operations on the constituent panda DataFrames in a way that takes intoaccount potential parallelism and memory constraints. This performs computing on dataframes that arelarger than memory and can utilize all of the CPU cores[22].

2.10 Dask Distributed ComputationDask distributed computation is a component that is implemented to manage centrally distributed systemsand a dynamic task scheduler. The distributed scheduler process coordinates the actions of several Daskworkers for processes spread across multiple machines in a cluster and concurrently to requests of severalclients. The concurrent.futures and Dask interface APIs are used for moderate sized clusters. The scheduleris asynchronous and event-driven, which simultaneously responds to and requests computation from variousclients and tracks the task progress of multiple workers nodes. The event-driven and asynchronous realitymakes it flexible to simultaneously handle a variety of workloads coming from multiple clients at the sametime, while also manipulating the workers’ population with failures and additions. During the cluster, theworkers interact with data transfer over TCP [22].Dask distributed computation has many properties that are listed below:Low latency: Since Dask used distributed scheduler the per task scheduling suffers around 1ms of overhead.P2P structure data sharing: workers in the cluster communicate and share data by the P2P network.Complex Scheduling: Supports complex task scheduling and include nd-arrays, machine learning matrixprocessing and statistics.Data Locality: Data migration is expensive in distributed systems, Dask distributed minimizes datamovement when possible and enables the user to take control when necessary.

Dask distributed architecture can seen in Figure 8. This system has a single centralized scheduler andseveral workers with potential clients. The Clients send Dask graphs to the central scheduler that worksout those tasks to worker nodes and coordinates the execution of the graph tasks. Although the schedulercentralizes metadata, the workers themselves handle transformation of intermediate data in a peer-to-peermethod. Once the Dask graph is completed, the workers carry the results that send data back to the scheduler

18

Page 21: Big data scalability for high throughput processing and ...kth.diva-portal.org/smash/get/diva2:1095664/FULLTEXT01.pdf · Big data scalability for high throughput processing and analysis

Feng Lu Master of Science Thesis February 4, 2017

Figure 11: Dask Distributed Scheduling adapted from[23]

that will automatically to match it to the appropriate workers. Internally, the scheduler is responsible totracks all work as a constantly changing directed acyclic graph of tasks. The scheduler by acyclic graph isused to separate big task into many small tasks and to submit tasks to workers. Worker computation is doneby utilizing their own resources and directing the results to the scheduler[23].

Dask high parallel is implemented by those methods for avoiding GIL and high parallel programming,which is used to support collections flexible to schedule the tasks.

dask.threaded.get is implement a scheduler used to backed by a thread pool.dask.multiprocessing.get is used as a scheduler used to backed by a process pool.dask.async.get_sync is used as a synchronous scheduler, It is great for debugging the parallel

performance.distributed.Executor.get is used as a distributed scheduler for executing graphs and to split graph tasks.During distributed computation, failure happens frequently while the Dask distributed is not an

exception. The Dask client also provides a mechanism to recover all of the workers in the cluster. TheClient uses the restart method to kill all workers and refresh all scheduler events and then brings all workernodes back online, resulting in a clean cluster. Compared to Spark, the Dask distributed lacks convenienceand automation, but it will fit with extension in future works, and a more detailed comparison will bediscussed in the conclusion chapter.

19

Page 22: Big data scalability for high throughput processing and ...kth.diva-portal.org/smash/get/diva2:1095664/FULLTEXT01.pdf · Big data scalability for high throughput processing and analysis

Feng Lu Master of Science Thesis February 4, 2017

3 MethodologyThis chapter discusses the methods that were used for the project research and development. Thisproject seeks to find a solution to the problem presented in Section 1.2 and concludes by discussing themethodology used to implement the solutions. It discusses: How to integrate Apache Spark and Dask toSympathy for Data internally and also how to implement Sympathy for Data in distributed computation inparallel to process Big data and the limitations encountered.

3.1 GoalSympathy for Data, during execution the CDE workflow for large data analytics, spend over 84% timeconsumption on the import data part; That part is released in order to load large of AFAF files to memoryfor the interpolation of specific signals and missing signal computations. As a result it causes the disk, CPUIO intensive and out-of-memory situations. In order to solve the bottleneck problem, there are the followingchallenges:Firstly, the goal of this project is to analyze and document the current distribution between IO and CPUintensive operations in Sympathy Data analysis workflows.Secondly, Spark and Dask are both high parallel computation framework proposals that are The used forefficiently handling IO and CPU intensive operations in the context and applicability of Sympathy platform.The thesis will map the pros and cons for each of the chosen methods in the current context and proposesolutions.Thirdly,was to develop, implement and benchmark suggested solutions or architectures for scaling dataanalysis workflows in order to analyze large amounts of data.Fourth, was to present the results for the proposed solutions as well as the limitations and suggestions forfuture development. Finally, develop nodes for Sympathy for Data to execute Big Data analytics on Sparkin order to effectively improve the throughput of engineering data.

3.2 Spark SolutionThis section discusses the Spark implementation solutions to Sympathy for data for optimizing the disk,memory and CPU IO intensiveness and scalability. The features of the Sympathy for data provide potentialbenefits including a software platform which is designed to simplify the reuse of code through well-defineddata exchange formats. The Section 1.2 problem description, shows that the Sympathy for Data executionworkflow for data analysis consumes 84% of the time with data import part with the remainder left forselecting specific data and generating results. This project focuses on the 84% part. However, there areseveral challenges to the integration of Spark to Sympathy for Data including:

First, HDF5 files are not natively supported in Spark. This project proposes two methods to solve thisproblem. The first step is to convert our target files in order to make them compatible with Spark file. Thesecond step is to implement a file system for Spark. This project has migrated the internal data type ADAFto be executable on Spark for handle the HDF5 files.

Second, the data source for data analysis is very large. This makes data migration very expensive. Itwas, therefore, necessary to design a middleware for the connection between the workflow and Spark. Themiddleware is a shared folder between Spark and Sympathy for Data. This reduces the time required totransfer data to the Spark platform.

Third, as can be seen in Figure 4, within the CDE data analytics workflow, the import data sub-flowinternally consists of over 300 nodes. Each node has a particular function. Extracting the rules was a majorchallenge. In order to achieve this, it re-write over 300 nodes to pure .Py file by use the middleware forSSH submits to Spark.

Finally, the cluster computation has a global variable problem. When Spark is working with multiplenodes, the .Py file should be shared to all of the computation nodes. Spark provides broadcast andaccumulators with two functions for solving these problems. Before the middleware connects to Sparkin order to start the computation, it will broadcast the .Py file to the distributed cluster.

20

Page 23: Big data scalability for high throughput processing and ...kth.diva-portal.org/smash/get/diva2:1095664/FULLTEXT01.pdf · Big data scalability for high throughput processing and analysis

Feng Lu Master of Science Thesis February 4, 2017

3.3 Dask SolusionDask is developed by Python as a flexible parallel computing library for data analytics, which providedvarious data structures which can be used for big data analysis and machine learning, such as with arrays,bags and dataframe. Based on the previously discussions in the Figure 2 Sympathy for Data internal ADAFfile structure, our target data source is in a hierarchical data format that is time-series and n-dimensional.With Sympathy for data, the biggest problem is during the data interpolation process. The interpolationprocesses operation is to load batch of time-series or hierarchical files to memory and use the ADAF forsampling, or calculation the missing signals, filter signals etc. The naively Sympathy for data used pandaarrays for processing those time-series data that convert the signals to ADAF and show as raster. Since thisresearch data signals is oversize for pandas to processing that has some limitation, instead of Pandas arrayby used Dask array to replace it. That will be parallel to convert signals to raster, also when the signal is toohuge, Dask can use distributed computation for set multiple workers coordinate together to computation thesignals.

Sympathy for data is based on a nodes visual graph-based data analysis platform. The Figure 12 showsthe SparkImporter architecture of the solution implementation, which the Sympathy for data based onthe nodes library and third part libraries, the SparkImporter developed by Python and as nodes exist onthe workflow. It thought the SSH for connect to Spark machine for submit the computation tasks, as theSparkImporter and Spark both share the Hadoop HDFS storage. This project developed a middleware calledSparkImporter that is used to connect between Spark and workflow. By configuration, the SparkImporterinformation can automatically submit tasks to the Spark platform. Spark will use RDD to parallelly processdata and return the sampled signals to a shared folder with Sympathy for data. When the data size is too largeto be completed by a single machine, the Spark or Dask distributed systems must be set up manually. Forcluster management, there are many types of software that can be used such as hortonworks and Cloudra.The commercial cluster management software is used to manage Spark or Hadoop clusters. The annacondais used to manage the Dask cluster more easily.

Figure 12: SparkImporter Architecture

3.4 IntegrationBased on the theoretical level investigation and research, it was necessary to re-write the over 300 CDEworkflow nodes to pure Python script code. It was also necessary to ensure that the script it was executableon both Spark and Dask environment for benchmark measurement. Finally, based on the benchmark resultsit was necessary to decide which framework needed to be combined with the Sympathy for data product.Although Spark and Dask are big data processing frameworks, but they implemented different mechanismfor data processing they are not conflicting when it combined in some products. The main focus of this thesisis on developing the software and researching the system resource utilization for improve the Sympathy fordata utilization in Big Data.

During this project, it used the qualitative and quantitative method for this research[24], and based onseveral benchmarks for analysis the various of vehicle engineering data. It also used a comparative methodfor analyzing the Spark and Dask data processing speed and internal structure. By implementing bothframeworks to the platform for processing different scales of data, and through the data processes completetime and disk, memory CPU utilization to choice the suitable solution for integrate to Sympathy for data

21

Page 24: Big data scalability for high throughput processing and ...kth.diva-portal.org/smash/get/diva2:1095664/FULLTEXT01.pdf · Big data scalability for high throughput processing and analysis

Feng Lu Master of Science Thesis February 4, 2017

for contribution to the future data processing. Finally, we compared the speed and usage to verify ourhypothesis whether correct.

22

Page 25: Big data scalability for high throughput processing and ...kth.diva-portal.org/smash/get/diva2:1095664/FULLTEXT01.pdf · Big data scalability for high throughput processing and analysis

Feng Lu Master of Science Thesis February 4, 2017

4 Sympathy for data OptimizationThis chapter summarizes the details of the related implementation work that has been completed inSympathy for data. It will also describe the methodologies used in the development. There were severalchallenges that are described below:

1.The need to load HDF5 files in Spark, because HDF5 has a hierarchical data format and Spark doesn’tnatively support it yet.

2.Re-write the CDE workflow over 300 nodes and formula to pure Python code to ensure that it isexecutable on Spark and Dask in order to perform a benchmark test.

3.Develop the middleware for connecting Sympathy for data and Apache Spark.4.The need for the Sympathy for Data task to submit to Spark and to return the sampled files .sydata

back to the shared folder for the CDE workflow to continue the rest of the data analytics5.The replacement of the Dask array to the pandas array in the Sympathy for data internal architecture.6.The need to broadcast Spark distributed computation file rules to all of the cluster nodes.The majority of this chapter will address those challenges. It will also discuss the blocked algorithm

internally and how the Dask array enables it to process in parallel hierarchical data sets.

4.1 Development Environment SpecificationsThe development environment was conducted on Windows 7 and Ubuntu 14.04 Desktop virtual machinesbecause of the Linux environment compatible with Spark. Since this project research used several cuttingedge frameworks and all of the products had fast iterations, each iteration made a huge difference. Table 1shows the specific details of the development environment.

Table 1: The Development Environment and Framework Versions.

4.2 CDE Workflow Formulas and Filter RulesThe workflow or subflows are constructed by using a number of subflows. The subflows improve theconstruction of the workflow. As a result, you can create sub-flows from some of the nodes in yourworkflow. The CDE workflow shown in Figure 4 was discussed in Chapter 2. This was constructed withmultiple sub-flows. Each sub-flows was formed by multiple nodes that perform a particular function such asa filter rule or a computation formula. During the data execution, it loads a batch of .dat files with the filterrule and math formula script to memory for the computation of missing signals or samples of the requireddata. When it finishes the computation, it will generate the sampled file .sydata to the output folder. Inorder to develop this project, the CDE workflow has been re-written to Python script code since those nodesbehind the workflow are still performed by Python. Because Spark and Dask used completely differentcollections and operations, this project implemented Dask CDE workflow and Spark CDE workflow.Both of these workflows execute the same logic as the CDE workflow. It supported running on Dask andSpark respectively. The next chapter will dicusses how to use it to measure performance.

23

Page 26: Big data scalability for high throughput processing and ...kth.diva-portal.org/smash/get/diva2:1095664/FULLTEXT01.pdf · Big data scalability for high throughput processing and analysis

Feng Lu Master of Science Thesis February 4, 2017

Algorithm 1 below shows the procedure for the execution of the Spark CDE workflow steps. ThehdfsFile obtains from Hadoop and by map function for parallel to read over 3000 invidual files. And usesit for each function to aggregate the result back to Sympathy for data. In Figure 4 there are six steps thatinclude: Import data, Vehicle config, Filter files, Create subflows, Get Item list and B katDiagnos. Eachmodule has encapsulation to different methods and shows in Algorithm 1, The modules as parameter andexecute by Spark RDD for parallel processing.

Algorithm 1 Spark Algorithm for CDE workflowa d a f _ o b j s = h d f s F i l e . map ( r e a d _ d a t _ h d f s ) \

. map ( Ex t r ac tVIN ) \

. map ( p r o c e s s _ d a t _ a d a f ) . map ( s o r t _ a d a f ) \

. map ( v e h i c a l _ c o n f i g ) \

. f i l t e r ( d o _ f i l t e r ) \

. map ( s u b s e t M e t a D a t a )a d a f _ o b j s . f o r e a c h ( s a v e S y d a t a )

Spark supports various file formats. However, the vehicle engineer data is more naturally representedas a ADAF hierarchical file that is irreplaceable for Sympathy for Data. The original data format it wasused MDF for interpreting the ADAF. This was done mainly to automatically convert the .dat file to groupsand data signal lists as raster shown in Figure 2. To combine Spark with Sympathy for data, the internalADAF library needed to be migrated to the Spark platform. This condition is Spark compatible with Pythonlibraries. For the implementation of the internal file type to Spark, we extracted the internal Sympathy fordata libraries to Spark.

Figure 13 shows the subflow portion of the CDE workflow import data. In the ADAF node, it takes alonger time and the other nodes wait until the ADAF is completed. A total of 84% the time is consumedon ADAF conversion to raster and interpolation signals. Based on the circumstances, the contributionof this part is to design a middleware SparkImport that will be used to connect between Sympathy fordata and Spark. When Sympathy for data executes the workflow for data analytic tasks, the SparkImportsubmits the data import task to Spark which is responsible for processing the data filtration and computation.This because Spark processes data primarily in RDD partitions. As a result, the number of partitions willinfluence the computation performance and the available parallelism. As result, it will improve the parallelcapacity and CPU/IO throughput.

Figure 13: Shows The subflow of The CDE workflow data import part

4.3 SparkImporterApache Spark is comfortable with the Linux environment and can better utilize the capacity of the resources.The test results in the next chapter verifies that Spark has excellent performance. Therefore, we decidedto implement Spark as the processing engine for Sympathy for data. The outcome of the project is thedevelopment of a middleware SparkImport that is shown in Figure 14. This requires configuring theSpark host IP, port, login information and data source information to connect to the Spark client. It also

24

Page 27: Big data scalability for high throughput processing and ...kth.diva-portal.org/smash/get/diva2:1095664/FULLTEXT01.pdf · Big data scalability for high throughput processing and analysis

Feng Lu Master of Science Thesis February 4, 2017

requires the Spark installation on Ubuntu VM, and Sympathy for data to be installed on Windows. Theyare connected through a shared folder between VM and Windows host. The SparkImport middleware isimplemented as a pipeline between Spark and Sympathy for data. When the Sympathy for Data executes theworkflow, it can drop or drag the SparkImport node for connecting the workflow to Spark. Since big datamigration is costly, the SparkImport used SSH for communication between machines. While triggeringthe workflow execution data analysis, the Spark SparkImport node will assist in gathering the previousnode filter formula or rules as encapsulated to Python script that will be submitted to the Spark masternode for data processing. Because Spark with Sympathy has an isolated environment, the re-sampled datasources need to generated to the shared folder between Spark and Sympathy for data. Once Spark completesthe execution, it will direct a signal to the workflow node and Sympathy for data will continue with the restof the data analysis.

Figure 14: Middle-ware SparkImporter Configuration

4.4 Dask Internal of Sympathy for DataDask naturally uses standard Python, NumPy, and Pandas interfaces, and transparently executes operationsand applied blocked algorithms to process when datasets are larger than the machine’s memory. DuringSympathy for data processing .dat files, the procedure loads a batch of .dat files to memory and the ADAFnode will be automatically directed to the .dat files interpreter to .sydata. However, some of the missingsignals will require execution of math scriptsfor re-sampling. The signals data format performs as rasterwhich is shown in Figure 2. ADAF internal is constructed with 500-600 columns, where each columnshows a panda array. Panda arrays have limited parallelism and scalability. Since Dask has good parallelismproperties and blocked algorithms are used, it can dramatically improve the utilization of the system’s CPU,memory and disk. Dask also supports distributed computation. When the data is too large and unsuitablefor a single node, it can submit a Dask array to a dask scheduler. This will use a couple of nodes to performdistributed computation. This significantly improves the parallelism and scalability.

Figure 15 shows a Dask array internal sequence of operations in the computation of a schedulinggraph. That performs a large task split into many small tasks in chunks. Each chunk performs a smallcomputation. The chunk size has critical performance implications. if the chunks are oversized, the queuingup operations will be extremely slow. This is because Dask will transform each operation into a hugenumber of graphs mapped across chunks. Computation on Dask arrays with small chunks can also be slowbecause each operation on a chunk has some fixed overhead from the Python interpreter and the Dask taskexecutor. Conversely, reasonable chunk size would be be based on the data format. In this project, weused between 1000 to 1,000,000 chunks for measuring performance that will be demonstrated in the nextchapter. Algorithm 2 shows the steps of using a Dask array for short computation within 200 elements forfinding the minimum value computation. In algorithm 2, the result of the computation is not a value but aDask array object which contains the sequence of steps needed to compute the result of that value. Figure14 shows the Algorithm 2 computation graph reading from the bottom to the top. This graph shows exactlywhat will happen with triggering the computation. Leading the array x is ingested by daskfrom_array andsplit into five chunks of 200 elements. Each of those chunks is multiplied by 4 in order to find the minimumvalue. Eventually, the global minimum was found among these chunk-wise minimal, and the result wasaggregated back up[25].

25

Page 28: Big data scalability for high throughput processing and ...kth.diva-portal.org/smash/get/diva2:1095664/FULLTEXT01.pdf · Big data scalability for high throughput processing and analysis

Feng Lu Master of Science Thesis February 4, 2017

Figure 15: Dask Chunks Aggregation Graph

Algorithm 3 shows the Dask Array internal structure by converting .dat files to ADAF and shows signalsas raster. The conversion procedure has used a pandas array to interpret the signal channels. The use of apanda array when the signal size is oversized, will lead to poor performance. This is because the panda arrayneeds to repeat to load the signals many times. The dask array is the best resolution for these circumstances.This is because when the signals are oversized, it can split into many chunks and parallel to executing thecomputation. Table 4 shows the Dask array to replace the panda array with the Dask array that is used ableto process signals of 1,000,000 chunks. Subsequent development is for Dask array to Sympathy for datawhen the workflow loads a batch of .dat files for computation. The Dask array is responsible for processingthe signals and converts them to raster by parallel. Because in this experiment there is a massive amount ofdata, it used within 1,000,000 chunks which improved 2-3 times performance compared to the panda array.

Algorithm 2 Dask Array with chunksi m p o r t numpy as npfrom dask . d o t i m p o r t d o t _ g r a p hi m p o r t dask . a r r a y as da# C r e a t e an a r r a y f o r d i s t r i b u t e d some random numbers by u s i n g numpyrandn = np . random . randn ( 1 0 0 0 )# C r e a t e a dask a r r a y wi th 200 chunks t o s t o r e t h o s e d a t ad a _ a r r a y = da . f r o m _ a r r a y ( randn , chunks =200)# M u l t i p l y t h i s a r r a y by a p a r a m e t e r 4minimum = d a _ a r r a y * 4r e s u l t = minimum . min ( )# G e n e r a t e t h e dask graphd o t _ g r a p h ( r e s u l t . dask )

26

Page 29: Big data scalability for high throughput processing and ...kth.diva-portal.org/smash/get/diva2:1095664/FULLTEXT01.pdf · Big data scalability for high throughput processing and analysis

Feng Lu Master of Science Thesis February 4, 2017

Algorithm 3 Dask array optimized Dask CDE workflowi m p o r t dask . a r r a y as dad a _ a r r a y = da . f r o m _ a r r a y ( s i g n a l : v a l u e ; chunks = 1000000)

4.5 Sympathy for Data Distributed ComputationThe Apache Spark cluster system is dependent on the Hadoop ecosystem and HDFS as the distributedstorage system and YARN for distributed task scheduling. To achieve Sympathy for Data support distributedcomputation functionality, the big scale computation tasks need to be submitted to the Spark master node.The Spark master node is responsible for scheduling computation and distribution tasks, while the Sparkcluster obtains the target data sources from HDFS. However, Spark is only responsible for the data importpart of the computation of the workflow and the generation of the .dat file sample data to sydata. Therest of the computation and analytics is still complete by Sympathy for data workflow, which means thesampled data need to be re-stored to Sympathy for the data shared folder. The following are the threepossible solutions: 1. manual transfer of data; 2. a shared folder between Spark and Sympathy for data and3. the use of the network file system for mapping the files to the target folder. In order to implement this, wechose the intelligence method to solve this problem. This method used the network file system protocol formapping the HDFS. When Spark finishes the computation, the sampled data will be saved to master node.The master node will share the folder with Sympathy for data. Another problem is the Spark’s distributedcomputation global variables problem, the workflow rules, and formula scripts which need to be obtained byeach attended distributed slave node. Fortunately, Spark provides a broadcast and accumulator mechanismthat can share the .Py file with other slaves. The implementation algorithm is shown in Algorithm 4 thatis used by the Spark broadcast for sharing the filter rules and formula scripts for each node to performconsistent computation.

Algorithm 4 Spark broadcast variable on the clusterb c _ b a d _ s i g n a l s = sc . b r o a d c a s t ( g e t _ b a d _ s i g n a l s ( b a d _ s i g n a l _ f i l e ) )b c _ v e h i c a l _ c o n f i g = sc . b r o a d c a s t ( g e t _ c o n f i g _ s p a r k ( v e h i c a l _ c o n f i g _ f i l e ) )b c _ o u t p u t _ d i r = sc . b r o a d c a s t ( o u t p u t _ d i r )bc_spec = sc . b r o a d c a s t ( g e t _ s p e c ( ) )

4.6 Dask and Spark Distributed ComputationCompared with Spark, Dask distributed computation is lightweight and flexible. Dask is based on Pythonand is distributed as a scheduler that is a lightweight framework. Dask distribution also supports HDFSand AWS S3. Dask is also compatible with the ADAF which makes the development work efficiently.Sympathy for data uses workflow to execute data analytics and Dask distributed can dynamically schedulethe vast array to multiple nodes. Each node uses its own CPU and memory to perform part of the Daskarray computation and eventually returns the computation results back to client.

Distributed computation can significantly improve the computation scalability. However, Dask or Sparkprimarily use high parallelism to increase the throughput and scalability through the maximum utilizationof system resources which enhances performance. The maximum usage of the system, not only influencesthe performance but also avoids the complexity of the cluster infrastructure. The computation ability andcluster environment of the performance measurement and metric will be demonstrated in the next chapter.

27

Page 30: Big data scalability for high throughput processing and ...kth.diva-portal.org/smash/get/diva2:1095664/FULLTEXT01.pdf · Big data scalability for high throughput processing and analysis

Feng Lu Master of Science Thesis February 4, 2017

5 Experiment and EvaluationThis chapter is mainly focused on the experiment and analysis. It introduces the experiment environmentand analyzes the scalability and the throughput of the optimized Sympathy for Data. This is important sincethe 84% of the time is spent on importing the .dat file to ADAF and converting to .sydata. In this project, itwas decided to measure and optimize the data import part of the Sympathy for data workflow that is shownin Figure 4.

Since this research used a comparison methodology for the research and measurement, we compare theSympathy for data Original CDE workflow, Dask CDE workflow and Spark CDE workflow to measurethe different scales of the vehicle’s engineer data. During the experiment, we implemented a single machinetest and a cluster test that are deployed on a cluster on Amazon Web Service for measuring the distributedcomputation performance. In this analysis, this different scales of data will be measured to evaluate theperformance of CPU, memory and disk throughput and parallelism. This will prove whether or not ourhypothesis is correct.

5.1 ResultsDuring this experiment, the evaluation environment based on Aamazon Web Service EC2 virginia datacenter that shows below table used one c4.4x large with Ubuntu Desktop 14.04 machine for a stand-alonemode test and four of c4.2 x large for the distributed cluster test. Both the Dask distributed and the Sparkcluster have 32 cores and a total of 64 GB memory. Since the aim of this research is to promote big dataprocessing and the measurement data is used over 640GB of engineer data with over 3000 individual files.The data internal structure is shown in Figures 2 and 3.

Table 2: The EC2 Cluster Machines.

CDE workflow; for the accurate of the comparison test result with Spark optimized CDE workflowand Dask optimized CDE workflow. The CDE workflow benchmark was re-written over the 300 nodes toPython Script without any optimization. The test results is the CDE workflow was run on a pure Python 2.7and annaconda environment. Table 3 shows the result of the CDE workflow execution with 80GB, 200GBand 640GB of engineer data. The procedure for the CDE workflow starts from a load batch of .dat filesto .sydata continuously generating reports and graphs. The total time consumed is 4526, 8781 and 15353seconds respectively. Compared to Figure 1, the 80GB of data import time is exactly same. However, theother parts are faster than the original Sympathy platform. This is because it executes the Python scriptcode without the Sympathy for data framework.

Table 3: The Sympathy for data original CDE workflow each part time consuming. (the time unit is inseconds)

Dask CDE workflow; the table 4 shows the test results by the Dask Array optimized CDE workflow.We examined data of 80GB 200GB and 640GB that consumes 2436, 6134 and 12012 seconds respectively.

28

Page 31: Big data scalability for high throughput processing and ...kth.diva-portal.org/smash/get/diva2:1095664/FULLTEXT01.pdf · Big data scalability for high throughput processing and analysis

Feng Lu Master of Science Thesis February 4, 2017

Compared to the result with the original CDE workflow (shown in Table 3), it improved the data processingspeed significantly and with a 40% time saving at the 80GB but compare the 200GB and 640GB dataprocessing which only improves 20% time saving. The other parts such as vehicle config, filter filescreate subsets, and B_KatDiagnos take the same amount of time. Dask can be faster because it uses arrayparallelism to process signal channels and to split chunks into reasonable quantities. Additional detailsabout the chunk size distribution will be discussed in the next section.

Table 4: The Dask optimized CDE workflow with each part time consuming. (the time unit is in seconds)

Spark CDE workflow the Spark CDE workflow is presented in Table 5 below. It shows the test resultsby Spark to optimize the workflow. The test data 80GB, 200GB and 640GB consumed 1777, 2560 and8336 seconds respectively. The test results were compared with the original CDE workflow and the Daskoptimized workflow. It dramatically improved the performance of the data processes which were roughly3 times faster than the original and 2 times faster than the Dask optimized. However, the other processessuch as vehicle conifg, filter files, create subsets and B_KatDiagnos took more time than other originalCDE workflow and Dask CDE optimized. That is because Spark developed by Scale and based on JVMthe Python Spark only the interface for invoke methods from Scala Spark. When pure Python scriptexecution takes place on Spark, the JVM and RDD occupied much of the system resources. There wasmore complexity involved in executing the python file. This led to more time being spent on the other parts.Dask optimized workflow and Sympathy for Data original workflow are still based on the Python ecosystemthat is executed in general way.

Table 5: The Sympathy for data Spark optimized CDE workflow with each part time consumed. (the timeunit is in seconds)

Figure 16 shows the total time consumed through a comparison with the original CDE workflow,the Dask CDE workflow and the Spark CDE workflow. Dask has improved the performance to twotimes faster than the original workflow, while Spark’s performance was 1-2 times faster than the originalworkflow. The critical problem is that if we observe the Spark CDE workflow test for 80GB of vehicleengineer data, it totally takes 2500 seconds from load .dat to generate reports. A total of 20-50% of the timeis spent on the Import data part and the vehicle config, filter files, create subsets and B_KatDiagnos takeshalf of the time. Although the Spark optimized workflow is much faster than the original workflow, it stillconsumes time unevenly. Because Spark is based on Scala and used JVM for garbage collection, the purePython computation takes more time.

However, the goal of this project is to optimize the data import sub-flow part which doesn’t influencethe other data analytics. The comparison of the results of the Dask CDE workflow and the original CDEworkflow, and the improved analysis result shows that our hypothesis is valid. We can conclude thatincreasing the parallelism and IO throughput will significantly improve the performance and speed. Thenext section will discuss more details about the related system resources usage and scalability analysis.

29

Page 32: Big data scalability for high throughput processing and ...kth.diva-portal.org/smash/get/diva2:1095664/FULLTEXT01.pdf · Big data scalability for high throughput processing and analysis

Feng Lu Master of Science Thesis February 4, 2017

Figure 16: The Original workflow, Dask optimized workflow, and Spark optimized workflow totally timeconsumed

5.2 Parallelism Throughput and ScalabilityThis section will analyze the reasons for the speed of Dask and Spark. By comparing the Dask and Sparkparallel computation framework, it was proven that parallelism computation can significantly improveperformance. It requires a trade-off between the systems resource utilization and a reasonable amount oftime. Due to the measurement results, the dask chunks will influence the parallel performance throughoutthe computation. During the measurement, we tested from 1,000 to 100,000,000 chunks respectively.Figure 17 shows the dask chunk performance for 10-20GB of data that is out-of memory. When the chunksize is 1,000, it is much slower. However, when it is increased to 10,000 or more, it becomes naturallyfaster, and subsequently remains stable. The dask array needs to slice the chunks many times repeatedlyfor one file. However, when the chunks are increased to 10,000 the speed growth becomes permanent. Thechunk size influence the performance and speed, the properly chunks configuretion will linear influenceyour resources, meanwhile the chunks size need base on your experiment data sources, more dimensionalfile need more chunks.

Figure 17: Dask array Chunks Performance

Spark RDD thought partitions are used for maintaining the parallelism and scalability. Once the largedata set is loaded to RDD, it will separate into many partitions according to the configured cores. Eachpartition performs a specific part of the computation. The partitions influence the parallelism. Morepartitions mean more parallelism but parallels are not omnipotent. They are unable to improve performanceas a linear increase that is because during the CPU processing the large of complexity data, the CPU alsoneed seperate cores for processing the data computation. Since the partition numbers are relevant to theCPU cores, too many partitions will collapse the system or reduce the performance. Therefore, they cansometimes be considered counterproductive. Figure 18 below shows the RDD partition performance. We

30

Page 33: Big data scalability for high throughput processing and ...kth.diva-portal.org/smash/get/diva2:1095664/FULLTEXT01.pdf · Big data scalability for high throughput processing and analysis

Feng Lu Master of Science Thesis February 4, 2017

used the Spark optimized workflow for measuring 10GB, 15GB and 20GB of data respectively, When thepartition numbers increase to 4 or 8, it is the best performance of the computation since we used an eightcore CPU. In Figure 17, there is a linear speedup from adding partitions. In this project, the RDD is usedfor parallelism and the core processes still used ADAF which will occupy most of the system’s resources.

Figure 18: Spark RDD partition parallel performance

5.3 CPU Throughput MeasurementTo measure the scalability and the throughput, it used Glances with influxDB on the Ubuntu system forobserving the system performance. Glances in the Linux system is used for real time monitoring or todetect the status of your network utilization, such as disk throughput, CPU usage, bandwidth, and memoryusage[26]. Figure 19 shows how Dask and Spark are used to process 80GB of data in order to conduct a CPUperformance comparison. The top section is the Dask CPU status and the section below shows the SparkCPU status. Since RDD used multiple partitions to process the .dat files, the Spark CPU usage maintain alevel of 30%-40%. Dask only maintains 10% usage, which means that Dask primarily operates at a parallellevel for signal processing. In the workflow, the majority of the internal computation is based on thefilter that is used to file particular columns which does not involve the frequency of the math computation.Compared to the Dask CPU usage, the Spark CPU usage is increased by 40% which explains why Spark isfaster than Dask.

Figure 19: Dask and Spark CPU performance comparison .

In the section above, we discussed the CPU usage. Figure 20 shows the Dask and Spark disk IOperformance comparison. Dask is based upon a blocked algorithm on memory optimization. As a result,the disk IO without any optimization leads the read performance the same as the original Sympathy for dataform which can read at a maximum of 181 MB per second and an average read is 59 MB per second, thewriter at maximum of 191MB per second an average write is 60MB. However, we observe that with the

31

Page 34: Big data scalability for high throughput processing and ...kth.diva-portal.org/smash/get/diva2:1095664/FULLTEXT01.pdf · Big data scalability for high throughput processing and analysis

Feng Lu Master of Science Thesis February 4, 2017

Spark disk IO, the max read is 486 MB per seconds and the average is 104 MB per second,the max writeis 412 MB per seconds and the average is 99 MB per second. It takes a total of 20 minutes to read all ofthe files. Since the monitor is based on a one hour result, the actual average is between 300-380 MB persecond. This is because Spark can read the data source simultaneously due to four partitions which canimprove the IO throughput and increase the speed.

Figure 20: Dask and Spark disk IO performance comparison

Dask and Spark used different mechanisms that make the memory throughput exhibit different trends.The Dask array procedure loads batches of data to ADAF and the ADAF slices the signal channel into manysmall chunks. It performs each chunk as a specific task and then aggregates all of the intermediate chunks.Figure 21 shows the Dask and Spark memory performance with comparison of the above status of Daskmemory that has a critical point. When the data is near the critical point, the system will automaticallyrelease memory It will never collapse the system. Because Spark used Java, it has Java garbage collectionto help to reclaim the memory. In addition, Spark initializes to parallel to load .dat files. All filessimultaneously load which leads to Spark memory throwing the maximum. It only takes 20 minutes tocomplete the data processes.

Figure 21: Dask and Spark Memory Performance Comparison

Spark used RDD with many partitions to maximize the disk, memory and CPU throughput. Thisdramatically improved the system’s performance, as well as the performance of the Sympathy for Data. Dueto these benefits, we decided to use Spark RDD as our final solution and integrated it into the Sympathy fordata product to help with big data processes and analytics. Figure 22 shows the Sympathy for data productused to achieve the SparkImporter performance results. It processed 700GB of engineering data around3000 individual .dat files as shown by the time consume graph. Sympathy for data originally used 65,600seconds and the Spark combined version took 23,000 seconds. This is a three time speedup in performance.Dask array is the internal memory layer, and if it is combined with Spark and Sympathy for Data it canincrease performance 4-6 times. It maximizes the utilization of CPU, memory, and disk, thus speeding upcomputation.

32

Page 35: Big data scalability for high throughput processing and ...kth.diva-portal.org/smash/get/diva2:1095664/FULLTEXT01.pdf · Big data scalability for high throughput processing and analysis

Feng Lu Master of Science Thesis February 4, 2017

Figure 22: Dask and Spark Memory Performance Comparison

5.4 Cluster measurementIn the cluster experiments, we evaluate the Spark and Dask cluster performances by deploying a clusterat Amazon Web Service with one master and three slaves with the machine type of c4.2 x large. The is16 GB memory and 8 cores CPU the configuration details shows in Table2. This test has used 50GB ofsemi-structured data that contain 720 individual files on HDFS. In Spark, it used YARN as the distributedscheduler for the distributed computation. For Dask,it used Dask distributed for cluster. Dask distributedused one master as a scheduler and three slaves for workers responsible for task computation.

Figure 23 shows Dask Cluster Parallel Computation CPU status, where one scheduler is responsiblefor scheduling tasks to workers. The real computation is maintained by three workers, with each workerretrieving data from HDFS. Dask uses HDFS3 for interaction with the Hadoop ecosystem. In Figure 22,it can be seen that it has 24 cores in parallel to execution tasks. Since it used three slave machines, eachrectangle symbolizes a scheduling task. This significantly more efficient than the Sympathy for data singlecore processing capacity.

Figure 23: Dask Cluster Parallel Computation CPU status A

Figure 24 shows the Dask Cluster Parallel CPU status from task initialization until the task results arecompleted. There are 24 cores parallel to execution and it used a total of 730 seconds until the job wascompleted. During the cluster computation, each slave disk throughput averages 100-300MB/S and theCPU status maintains the 30-100% usage. The memory status keeps up to 100%. The details can be seenin Appendix A. The monitor shows that the slave’s performance is significantly faster than the single nodesince all nodes execute tasks in parallel. In Figure 23, the purple rectangle symbolizes the file processingstatus. Each core reads HDFS to memory, with the majority of the time consumed on data reading. This isbecause the experiment data sources have 720 pitches, where each core is processing one file. The yellowrectangle symbolizes the task computation and aggregations.

For the Spark computation, in the cluster for processing 50GB of semi-structured data, it only takes360 seconds, which is half the time of Dask. Since RDD in parallel to process data in the memory. SincePython Spark does not natively support cluster computation, in this experiment it used YARN for schedulingSpark distributed computation in the cluster. By monitoring the CPU memory and disk throughput, Spark

33

Page 36: Big data scalability for high throughput processing and ...kth.diva-portal.org/smash/get/diva2:1095664/FULLTEXT01.pdf · Big data scalability for high throughput processing and analysis

Feng Lu Master of Science Thesis February 4, 2017

Figure 24: Dask Cluster Parallel Computation CPU status B

averages up to 200-400MB/S and the CPU stays in 100% usage and the memory is stable with 50% usage(see Appendix A). Compared to the Dask workers utilization, Spark almost fully uses your system resourcesfor improving the high computation speed.

34

Page 37: Big data scalability for high throughput processing and ...kth.diva-portal.org/smash/get/diva2:1095664/FULLTEXT01.pdf · Big data scalability for high throughput processing and analysis

Feng Lu Master of Science Thesis February 4, 2017

6 Conclusions and DiscussionThis chapter contains reflections on the overall comparison between Spark and Dask. also this chapterinclude discussions and conclusion on Sympathy for Data optimization outcome as well as proposal possiblesuggestions for future works.

6.1 Spark and Dask Comparison and DiscussionSpark is an open source project with IO intensive framework combined with the Hadoop ecosystem fordistributed computing. It uses RDD for big data processing with partitions to cut the dataset for parallel.Spark RDD have a flexible interface that can make it easier to migrate cde workflow executable on theSpark platform for benchmark testing. By combining Spark with Sympathy for data, it improved thecomputation performance by four times. During the distributed cluster experiment,it used three slavesexecution tasks simultaneously only improved additional two times performance. This project uses theSpark CDE workflow to test the difference in the scale of data. The results proved that the partitionnumbers will impact the performance. The additional partition will result in more parallelism. However,more parallelism is not always faster. Because of the Spark RDD super capability, it improved the disk R/Wperformance. During the test, it reached 400MB/S.

Dask is a flexible parallel computing library for data analytics. It is a dynamic and parallel programminglibrary that combines with the numeric Python libraries such as pandas, numpy and scipy. Dask providesparallel data collections such arrays, bag, dataframes. A python library can easily integrate into a third partyPython platform. However, Sympathy for data was based on workflow for the execution of data analyticstasks. During this project, only the Dask array has been useful for parallel to process signals. Since thisproject is based on Sympathy workflow the Dask.dataframe and Dask.bag it was difficult to implement toworkflow for test, Dask via parallelism and maximum to utilization CPU cores and chunks for large taskcomputation, in this project it measured over 10,000,000 dask chunks for computation, the properly chunkssize will promote the calculation more easier, in spite of Dask have superb performance, but is test resultsstill gap with Spark. The R / W shows Dask only reached 200MB / S, but Spark reached 400MB / S. Thatis because Dask array not played the potential power.

Spark and Dask both have own distributed computation ecosystems. Dask distributed is relativelyflexible and lightweight. Since Spark is based on the Hadoop ecosystem, it is heavyweight and powerful.According to the experiment result of the cluster computation, Spark still more prominent than Dask, ineither the cluster mode or stand-alone. This is because in Spark all of the slaves have maximum utilizationof the system resources. Spark is two times faster than Dask in the clustering result.

6.2 DiscussionThe computer science community has two directions for Big Data processing solutions. The first directionis centralized computing by increasing the number of CPU processors to enhance the computing capabilityof a single computer, thereby improving the speed of processing data. The second direction is a distributedsystem that is a group of commodity nodes connected by a distributed network. It is used to process largeamounts of data by splitting it into multiple parts, where each node is responsible for part of the datacomputation. These results will indicate aggregation. In the distributed system where commodity hardwareis used without a considerable power capacity, each node calculation is only part of the data, and overthousands of devices are working together simultaneously. The data processing speed is much faster thanon a single computer. As Figure 25 shows, the trade-off between distributed systems to scale up or scale outallows us to increase the number of high-performance nodes to replace the amount of commodity hardwarereaching the same level of performance. However, the amount of commodity hardware used will tackle thebig challenges for failures and large infrastructure.

Spark and Dask both depend on a distributed computation ecosystem, Dask distributed is relativelyflexible and lightweight. Since Spark is based on the Hadoop ecosystem, it is heavyweight and powerful.Spark is still more superior than Dask,whether in distributed framework or stand-alone framework, Spark

35

Page 38: Big data scalability for high throughput processing and ...kth.diva-portal.org/smash/get/diva2:1095664/FULLTEXT01.pdf · Big data scalability for high throughput processing and analysis

Feng Lu Master of Science Thesis February 4, 2017

Figure 25: Trade-off Between Scale up or Scale out [9]

both have remarkable performance, because Spark maximum utilized the parallelism and nodes resources,In this project experiments result shows Spark is two times faster than Dask in the distributed framework.

6.3 Restrictions

This project involved leading-edge framework development and research. There are some restrictions. Themost obvious is Dask implementation. Since dask contains array, dataframe and bags, that was very difficultto combine to Sympathy workflow for experiment, that caused it not to bring Dask into full play the power.During the experimental test, it used the HDF5 test files. However, the different file formats will impactthis test results and performance. Finally, Sympathy for data is cross platform, but the majority of librariesand plugins are only comfortable with Windows. that leads to when the workflow drags the SparkImportnode for computation, it mandatory to use the Linux virtual machine installed Spark to execute tasks.

6.4 Conclusions

In this work was cover up the Sympathy for Data workflow for conducting the research and exploration.Since this workflow is huge and exists over 300 of nodes while performing data analytics tasks. The generalissues can be addressed by executing the workflow to analyze different scales of data in various systems andfiguring out the problems. Some examples are the global interpreter locker, and scalability and parallelismissues.

This project implemented both Spark and Dask frameworks to Sympathy for Data workflow forimproving the parallelism and scalability of the platform. Via analysis the experiment systems utilizationsuch as IO CPU utilization and distributed cluster the nodes IO throughput , that has proven our researchhypothesis was correct, by increased the parallel CPU cores can linear speed up the performance.During this experiment, the Spark via RDD with four cores for promoting Sympathy for Data improvedperformance 3-5 times. In addition, more partition and cores will exhibit more parallelism. The Daskwith a blocked algorithm and with generous chunk size only improved performance two times in a singlemachine. We also designed middleware SparkImporter used to support Sympathy for data distributedcluster functionality for satisfying large data handling. By using the SparkImporter for three node clusters,it improved the double performance.

In the research results, our proposed solutions have significantly improved performance of theSympathy for Data 5-7 times. With 3 node cluster, it will improve performance up to 10 times.performance. Successfully combined ADAF executable on the Spark environment and via SparkImportercan dramatically support computation and the data store volumes, since it is based on Hadoop.

36

Page 39: Big data scalability for high throughput processing and ...kth.diva-portal.org/smash/get/diva2:1095664/FULLTEXT01.pdf · Big data scalability for high throughput processing and analysis

Feng Lu Master of Science Thesis February 4, 2017

6.5 Future workThe efficient utilization of your system resources to improve computational speed is a dominant factor infuture Big data processing. There have been many parallelism frameworks that have been proposed in datastorage such as Alluxio based on memory storage, or Numba that is used to develop your GPU capacity formultiple threads.

In this project, we implemented Spark and Dask to Sympathy for Data and based on the benchmark test,it dramatically improved the computation performance. Spark and Dask are only based on parallelismand maximize the utilization of your system’s resources for speed. Spark is based on Java whichtakes considerable time and resources. Therefore, Spark2.0 has proposed a tungsten project which cansubstantially improve the efficiency of memory and CPU utilization and is also based on GPU for improvinggraph computation and machine learning.

However, in order to improve the speed, not only parallelism and multiple cores but also the GPU is stillconsidered as one of solutions to increase the computation capacity. Python also provides the Cuda libraryas a parallel computation platform based on GPU for computation or data processing. In this project, we hada lot of research, development and experiment work limitation, so in the future, we would like to combineGPU computation with the platform and measure its effects on performance.

37

Page 40: Big data scalability for high throughput processing and ...kth.diva-portal.org/smash/get/diva2:1095664/FULLTEXT01.pdf · Big data scalability for high throughput processing and analysis

Feng Lu Master of Science Thesis February 4, 2017

References[1] P. Nelson, “Just one autonomous car will use 4,000 GB of data,” Dec

7, 2016. [Online]. Available: http://www.networkworld.com/article/3147892/internet/one-autonomous-car-will-use-4000-gb-of-dataday.html

[2] D. G. O. G. Anders Holst, Bjorn Bjurling, “Big Data Analytics A Research and InnovationAgenda for Sweden,” 2013. [Online]. Available: http://www.vinnova.se/PageFiles/0/Big%20Data%20Analytics.pdf

[3] M. Bauhammer, “Unboxing engineering data Big Data solutions for the automotiveindustry,” February 2016. [Online]. Available: https://www.hpe.com/h20195/v2/GetPDF.aspx/4AA6-4089ENW.pdf

[4] B. A. Jan Wassen, “cross-functional business analytics at Volvo cars,” 08 Oct 2015.[Online]. Available: https://www.chalmers.se/en/areas-of-advance/ict/research/big-data/Documents/JanWassen-VolvoCars_8Oct2015.pdf

[5] S. for Data organization, “Sympathy Release 1.3.4,” Sep 30, 2016. [Online]. Available:https://media.readthedocs.org/pdf/sympathy-for-data/1.3/sympathy-for-data.pdf

[6] D. Beazley, “Understanding the Python GIL,” 2010. [Online]. Available: http://www.dabeaz.com/python/UnderstandingGIL.pdf

[7] E. P. Mike Folk, “Balancing Performance and Preservation Lessons learned with HDF5,”2010. [Online]. Available: https://support.hdfgroup.org/pubs/papers/HDFandpreservation_NIST_2010_paper_Folk.pdf

[8] Q. K. R. S. C. A. G. L. G. S. B. M. F. R. P. Jialin Liu, Evan Racah, “H5Spark: Bridging theI/O Gap between Spark and Scientific Data Formats on HPC Systems,” 2016. [Online]. Available:https://cug.org/proceedings/cug2016_proceedings/includes/files/pap137.pdf

[9] K. Overholt, “Unlocking the True Value of Hadoop with Open Data Science,” June 7, 2016.[Online]. Available: http://schd.ws/hosted_files/bigdatatechday2016/4e/Open%20Data%20Science%20Continuum.pdf

[10] S. G. Jeffrey Dean, “MapReduce: Simplified Data Processing on Large Clusters,” 2004.[Online]. Available: https://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-osdi04.pdf

[11] M. Rocklin, “Dask: Parallel Computation with Blocked algorithms and Task Scheduling,” 2015.[Online]. Available: http://conference.scipy.org/proceedings/scipy2015/pdfs/matthew_rocklin.pdf

[12] A. K. Erik Froese, “Just Say No to the combined evils of locking, deadlocks, lockgranularity, livelocks, nondeterminism and race conditions.” Spring, 2010. [Online]. Available:http://www.cs.nyu.edu/~lerner/spring10/projects/Python_GIL.pdf

[13] P. D. Group, “powerful Python data analysis toolkit,” October 09, 2015. [Online]. Available:http://pandas.pydata.org/pandas-docs/version/0.17.0/

[14] D. Beazley, “Understanding the Python GIL,” 2010. [Online]. Available: http://www.dabeaz.com/python/UnderstandingGIL.pdf

[15] T. D. A. D. J. M. M. M. M. J. F. S. S. I. S. Matei Zaharia, Mosharaf Chowdhury, “ResilientDistributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing,” 2012.[Online]. Available: http://www-bcf.usc.edu/~minlanyu/teach/csci599-fall12/papers/nsdi_spark.pdf

38

Page 41: Big data scalability for high throughput processing and ...kth.diva-portal.org/smash/get/diva2:1095664/FULLTEXT01.pdf · Big data scalability for high throughput processing and analysis

Feng Lu Master of Science Thesis February 4, 2017

[16] S. official, “Spark Overview Documents,” 2016. [Online]. Available: http://spark.apache.org/docs/latest/index.html

[17] M. J. F. S. S. I. S. Matei Zaharia, Mosharaf Chowdhury, “Spark: Cluster Computing with WorkingSets,” Jun 22, 2010. [Online]. Available: http://dl.acm.org/citation.cfm?id=1863103.1863113

[18] R. W. E. J. Jesus Carretero, Javier Garcia Blas, “Proceedings of the Second InternationalWorkshop on Sustainable Ultrascale Computing Systems,” 2015. [Online]. Available: http://e-archivo.uc3m.es/bitstream/handle/10016/21995/log_NESUS_2015.pdf?sequence=1

[19] A. H. Group, “HDFS Architecture Guide,” 2013. [Online]. Available: http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html

[20] A. Hadoop, “HDFS Architecture GuideApache Hadoop YARN,” 2016. [Online]. Available:https://hadoop.apache.org/docs/r2.7.2/hadoop-yarn/hadoop-yarn-site/YARN.html

[21] A. S. official, “Apache Spark Configuration,” 2016. [Online]. Available: http://spark.apache.org/docs/latest/configuration.html

[22] M. Rocklin, “Dask 0.11.1,” 2016. [Online]. Available: http://dask.pydata.org/en/latest/

[23] D. D. Offical, “Dask Distributed,” 2016. [Online]. Available: https://distributed.readthedocs.io/en/latest/

[24] A. Hakansson, “Portal of Research Methods and Methodologies for Research Projects and DegreeProjects,” 22-25 July, 2013. [Online]. Available: http://kth.diva-portal.org/smash/record.jsf?pid=diva2%3A677684&dswid=2403

[25] J. Vanderplas, “Out-of-Core Dataframes in Python: Dask and OpenStreetMap,” May, 2015. [Online].Available: https://jakevdp.github.io/blog/2015/08/14/out-of-core-dataframes-in-python/

[26] N. Hennion, “Glances Documentations,” 2016. [Online]. Available: http://glances.readthedocs.io/en/latest/

[27] E. H. Leah A Wasser, “Intro to Working with Hyperspectral Remote Sensing Data in HDF5 Formatin R,” 2014. [Online]. Available: http://neondataskills.org/HDF5/Imaging-Spectroscopy-HDF5-In-R/

[28] D. A. Jorge L. Reyes-Ortiz, “Big Data Analytics in the Cloud: Spark on Hadoop vs MPI/OpenMP onBeowulf,” 2015. [Online]. Available: http://fulltext.study/preview/pdf/484820.pdf

[29] D. Beazley, “The Python GIL Visualized,” January 05, 2010. [Online]. Available: http://dabeaz.blogspot.se/2010/01/python-gil-visualized.html

39

Page 42: Big data scalability for high throughput processing and ...kth.diva-portal.org/smash/get/diva2:1095664/FULLTEXT01.pdf · Big data scalability for high throughput processing and analysis

Feng Lu Master of Science Thesis February 4, 2017

Appendix A

Figure 26: Dask Three Slaves CPU Utilization Status

Figure 27: Spark Three Slaves CPU Utilization Status

40

Page 43: Big data scalability for high throughput processing and ...kth.diva-portal.org/smash/get/diva2:1095664/FULLTEXT01.pdf · Big data scalability for high throughput processing and analysis

Feng Lu Master of Science Thesis February 4, 2017

Figure 28: Dask Three Slaves MEM Utilization Status

Figure 29: Spark Three Slaves MEM Utilization Status

41