Final Presentation IRT - Jingxuan Wei V1.2

1

PARALLEL DATA ACQUISITION AND

RETRIEVAL APPROACH FOR HEAVY HAUL RAILWAYS(FINAL PRESENTATION)

Faculty of Information TechnologySupervisor: Assoc Prof David Taniar

BY: JINGXUAN WEI (Tom) 25025031

2

PRESENTATION OUTLINE Research Background

Instrumented Ore Car Program issue Problem of the existing database

Research Question Related works

Research Aim Data Acquisition

MongoDB Import and Export Tools Spark-MongoDB Application Result analysis

Data Retrieval Data retrieval by Spark SQL Data retrieval by Spark filter operation How to improve searching efficiency?

Conclusion and Future work

3

RESEARCH BACKGROUND（ INSTRUMENTED ORE CAR (IOC) PROGRAM ） Railway In Mining, Pilbara region, WA

Loaded iron ore Equipped with sensors to collect data as Train run

Trained professionals to maintain the sensors

Aim of the program: • Monitoring track and wagon

performance.• Detect track abnormalities

INSTRUMENTED ORE CAR PROGRAM ISSUEWhat are the issues?• Sensor selection• Smart sensor is expensive.• Less expensive sensor is inaccuracy (Semi-

structured data)

• Database issue• Low data ingestion speed • Spend too much time on searching

Expected outcome:Equipped with many cheap sensors to collect data in order to obtain the desire outcome. (Reduce cost) 4

5

PROBLEM ANALYSISLow data ingestion speed in current database• High velocity data input:

• Each wagon fitted with 16 sensors• The one sensor produce 25 records per second• Approximately 200 wagons in one train• At least 30 trains running at the same time.

• Transaction management of relational databaseSpend too much time on searching • Large volume of unstructured data

6

THE CURRENT SYSTEM: HOW DOES THE DATA LOOK LIKE ?

Data Information:• Twenty one attributes

include train acceleration and geography information (latitude and longitude)

• Missing Track Information

Solution: • Append Track

Information

Concept Used• Geo Hashing Algorithm

((Wolfson & Rigoutsos, 1997))

7 RESEARCH QUESTION“How to improve the performance of data ingestion into the database?”“How to perform fast data retrieval in the IRT project?.

8

RELATED WORKS Use MongoDB to Enhancing the Management of Unstructured Data (Stevic, Milosavljevic, & Perisic, 2015)

Improvement of MongoDB Auto-Sharding (Liu, Wang, & Jin, 2012) Spark SQL (Armbrust et al., 2015)

Pervious work (Benchmark Model) Given the infrastructure we have for processing, we have successfully processed 40,000 records per second.

With the same infrastructure, based on the file system (CSV files provided by IRT), we have successfully retrieved results for 40 GB of data in less that 85 seconds.

9 RESEARCH AIMScalable Techniques for Parallel Data Acquisition and Retrieval of High-Velocity Data

WHY WE USE MONGODB? NoSQL Document Database Handle unstructured data well Improve Storage Capacity

10

11

HOW TO IMPLEMENT FAST DATA ACQUISITION? Approaches taken:MongoDB Default Import Tool

Regular MongoDBMongoDB Sharded Cluster

Spark-MongoDB Application

12

MONGODB DEFAULT IMPORT TOOL Command:mongoimport --db RegularDB --collection railwayDataCollection --type csv --headerline --file /mnt/data/IRTRailwayData80K.csv

13

MONGODB SHARDING TECHNOLOGY Sharded MongoDB Cluster

Divide the data set and distributes the data over multiple shards. Each shard is an independent database.

14

PARALLEL DATA ACQUISITION IN SHARDED DATABASE Reads / Writes Storage Capacity High Availability

15

CHANGE DATA DISTRIBUTION IN SHARDED DATABASE

Hashed Sharding sh.shardCollection("<database>.

<collection>", { <key> : <direction> } )

Ranged Sharding sh.shardCollection( "database.collecti

on", { <shard key> } )

16

MONGOIMPORT RESULT ANALYSIS(1)

40K 80K 160K 320K0.0

5.0

10.0

15.0

20.0

25.0

30.0

Hashing Sharding VS Ranged Sharding

Ranged Sharding Hashed Sharding

Number of records

Seco

nds

17

MONGOIMPORT RESULT ANALYSIS (2)

40K 80K 160K 320K0.05.0

10.015.020.025.0

Sharded Database VS Regular Database

Sharded MongoDB MongoDB (Regular)

Number of records

Seco

nds

18

MONGOIMPORT RESULT ANALYSIS (3) The bottleneck occurs in

the first section. Compare the database

enable sharding, the regular database perform better job in data acquisition.

The acquisition result can not meet industry requirement.(80000 per second)

19

SPARK-MONGODB APPLICATION FOR DATA ACQUISITION We need to set the spark environment first:

Create 80000 records as input batch:

Store into MongoDB:

20

SPARK-MONGODB APPLICATION RESULT

MongoImport:Shard database: 4.3 sRegular database 4.0s

Spark program 1.4s

40000 records 50000 records 60000 records 70000 records 80000 records0200400600800

1000120014001600

8221007

11671302

1444Data inserting – Router (Master) (4CPUs)

Number of records

Mill

iseco

nds

21

SPARK-MONGODB APPLICATION RESULT

New Record !!!

140000 records per

second120000 records 130000 records 140000 records 150000 records 160000 records0

200400600800

10001200

860 9311031 1053 1134

Data inserting -- Server 16CPUs

Number of records

Mill

iseco

nds

22

DATA RETRIEVAL APPROACH Database level Application level

23

PARALLEL DATA RETRIEVAL BY MONGODB

2970000 5940000 8910000 11880000 196814 393628 590442 78725602000400060008000

10000120001400016000

Searching performance between sharded MongoDB and regular MongoDB

Number of record

Millis

econ

ds

Conclusion:1. Sharded

MongoDB perform faster searching than Regular MongoDB

2. Hard to measure query execution time when the dataset is too big.

db.getCollection('Test').find({'accR3': { $gt: 4 , $lt:6}}). explain(‘executionStats’)

DATA RETRIEVAL BY SPARK SQL QUERY Create Spark SQL object

24

Create register temp table and run the searching query.

Sample result

DATA RETRIEVAL BY SPARK FILTER OPERATION Perform searching by using filter operation

25

26

SPARK-MONGODB APPLICATION RESULTData searching Local Machine 2CPU/I5 Server 16CPUs - Regular

DatabaseFilter Search Spark SQL

query

Filter Search Spark SQL query

4G (3.92G) 16881 877 5736 557/6028G (7.85G) 52229 2012 13281 175312G (11.78G) N/A N/A 19556 332316G (15.70G) N/A N/A 31179 451840G (39.93G) N/A N/A 79893 878345G (44.83G) N/A N/A 86399 10883

Query

1. SELECT * FROM railwayData Where accR3 > 4 and accR3 < 6

2. val result = readData.filter(readData("acc.r3") >= 4 && readData("acc.r3") <= 6)

27

HOW TO IMPROVE SEARCHING EFFICIENCY?

Approach 1

Approach 2

(Key-Value)

SPARK-MONGODB APPLICATION2 RESULT Adopt hash

partitioner to partition data and use mapPartitionsWithIndex to get the target partition.

Perform searching in the target partition

Narrow search scope

28

29

SPARK-MONGODB APPLICATION2 RESULT

1 2 301000020000300004000050000

30724 29058 27440

44737 42773 43200

Compare the performance between two searching approaches

Search by Hash Partition Search for all data

Number of testing

Millis

econ

ds

30

CONCLUSIONWe have successfully created a system that is able to accept the data as a batch or streams.

We solve the low data ingestion speed problem by writing a spark program.

We have successful import 1400000 record in one second in the MongoDB server.

We perform searching by using Spark SQL and execute SQL query in 40 GB of data within 11 seconds.

31

FUTURE WORK How to measure MongoDB query execute time in the very large database.

Efficient searching mechanism in sharded MongoDB by using Spark.

32

REFERENCE Wolfson, H. J., & Rigoutsos, I. (1997). Geometric hashing: An overview. IEEE

computational science and engineering, 4(4), 10-21. Stevic, M. P., Milosavljevic, B., & Perisic, B. R. (2015). Enhancing the

management of unstructured data in e-learning systems using MongoDB. Program, 49(1), 91-114.

Liu, Y., Wang, Y., & Jin, Y. (2012). Research on the improvement of MongoDB Auto-Sharding in cloud environment. Paper presented at the Computer Science & Education (ICCSE), 2012 7th International Conference on.

Armbrust, M., Xin, R. S., Lian, C., Huai, Y., Liu, D., Bradley, J. K., . . . Ghodsi, A. (2015). Spark sql: Relational data processing in spark. Paper presented at the Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data.

33

THE TEAM

Dr. Maria Indrawan-SantiagoSenior Lecturer

Faculty of IT

Prajwol SangatResearch Assistant

Faculty of ITAssoc Prof David Taniar

Associate ProfessorFaculty of IT

Jingxuan WeiStudent

Faculty of IT

Subudh SaliStudent

Faculty of IT

34

Thank You Questions

?(Final Presentation) Supervisor: Assoc Prof David Taniar

BY: JINGXUAN WEI (Tom) 25025031

Documents

Final Presentation IRT - Jingxuan Wei V1.2