34
PARALLEL DATA ACQUISITION AND RETRIEVAL APPROACH FOR HEAVY HAUL RAILWAYS (FINAL PRESENTATION) Faculty of Information Technology Supervisor: Assoc Prof David Taniar BY: JINGXUAN WEI (Tom) 25025031 1

Final Presentation IRT - Jingxuan Wei V1.2

Embed Size (px)

Citation preview

Page 1: Final Presentation  IRT - Jingxuan Wei V1.2

1

PARALLEL DATA ACQUISITION AND

RETRIEVAL APPROACH FOR HEAVY HAUL RAILWAYS(FINAL PRESENTATION)

Faculty of Information TechnologySupervisor: Assoc Prof David Taniar

BY: JINGXUAN WEI (Tom) 25025031

Page 2: Final Presentation  IRT - Jingxuan Wei V1.2

2

PRESENTATION OUTLINE Research Background

Instrumented Ore Car Program issue Problem of the existing database

Research Question Related works

Research Aim Data Acquisition

MongoDB Import and Export Tools Spark-MongoDB Application Result analysis

Data Retrieval Data retrieval by Spark SQL Data retrieval by Spark filter operation How to improve searching efficiency?

Conclusion and Future work

Page 3: Final Presentation  IRT - Jingxuan Wei V1.2

3

RESEARCH BACKGROUND( INSTRUMENTED ORE CAR (IOC) PROGRAM ) Railway In Mining, Pilbara region, WA

Loaded iron ore Equipped with sensors to collect data as Train run

Trained professionals to maintain the sensors

Aim of the program: • Monitoring track and wagon

performance.• Detect track abnormalities

Page 4: Final Presentation  IRT - Jingxuan Wei V1.2

INSTRUMENTED ORE CAR PROGRAM ISSUEWhat are the issues?• Sensor selection• Smart sensor is expensive.• Less expensive sensor is inaccuracy (Semi-

structured data)

• Database issue• Low data ingestion speed • Spend too much time on searching

Expected outcome:Equipped with many cheap sensors to collect data in order to obtain the desire outcome. (Reduce cost) 4

Page 5: Final Presentation  IRT - Jingxuan Wei V1.2

5

PROBLEM ANALYSISLow data ingestion speed in current database• High velocity data input:

• Each wagon fitted with 16 sensors• The one sensor produce 25 records per second• Approximately 200 wagons in one train• At least 30 trains running at the same time.

• Transaction management of relational databaseSpend too much time on searching • Large volume of unstructured data

Page 6: Final Presentation  IRT - Jingxuan Wei V1.2

6

THE CURRENT SYSTEM: HOW DOES THE DATA LOOK LIKE ?

Data Information:• Twenty one attributes

include train acceleration and geography information (latitude and longitude)

• Missing Track Information

Solution: • Append Track

Information

Concept Used• Geo Hashing Algorithm

((Wolfson & Rigoutsos, 1997))

Page 7: Final Presentation  IRT - Jingxuan Wei V1.2

7 RESEARCH QUESTION“How to improve the performance of data ingestion into the database?”“How to perform fast data retrieval in the IRT project?.

Page 8: Final Presentation  IRT - Jingxuan Wei V1.2

8

RELATED WORKS Use MongoDB to Enhancing the Management of Unstructured Data (Stevic, Milosavljevic, & Perisic, 2015)

Improvement of MongoDB Auto-Sharding (Liu, Wang, & Jin, 2012) Spark SQL (Armbrust et al., 2015)

Pervious work (Benchmark Model) Given the infrastructure we have for processing, we have successfully processed 40,000 records per second.

With the same infrastructure, based on the file system (CSV files provided by IRT), we have successfully retrieved results for 40 GB of data in less that 85 seconds.

Page 9: Final Presentation  IRT - Jingxuan Wei V1.2

9 RESEARCH AIMScalable Techniques for Parallel Data Acquisition and Retrieval of High-Velocity Data

Page 10: Final Presentation  IRT - Jingxuan Wei V1.2

WHY WE USE MONGODB? NoSQL Document Database Handle unstructured data well Improve Storage Capacity

10

Page 11: Final Presentation  IRT - Jingxuan Wei V1.2

11

HOW TO IMPLEMENT FAST DATA ACQUISITION? Approaches taken:MongoDB Default Import Tool

Regular MongoDBMongoDB Sharded Cluster

Spark-MongoDB Application

Page 12: Final Presentation  IRT - Jingxuan Wei V1.2

12

MONGODB DEFAULT IMPORT TOOL Command:mongoimport --db RegularDB --collection railwayDataCollection --type csv --headerline --file /mnt/data/IRTRailwayData80K.csv

Page 13: Final Presentation  IRT - Jingxuan Wei V1.2

13

MONGODB SHARDING TECHNOLOGY Sharded MongoDB Cluster

Divide the data set and distributes the data over multiple shards. Each shard is an independent database.

Page 14: Final Presentation  IRT - Jingxuan Wei V1.2

14

PARALLEL DATA ACQUISITION IN SHARDED DATABASE Reads / Writes Storage Capacity High Availability

Page 15: Final Presentation  IRT - Jingxuan Wei V1.2

15

CHANGE DATA DISTRIBUTION IN SHARDED DATABASE

Hashed Sharding sh.shardCollection("<database>.

<collection>", { <key> : <direction> } )

Ranged Sharding sh.shardCollection( "database.collecti

on", { <shard key> } )

Page 16: Final Presentation  IRT - Jingxuan Wei V1.2

16

MONGOIMPORT RESULT ANALYSIS(1)

40K 80K 160K 320K0.0

5.0

10.0

15.0

20.0

25.0

30.0

Hashing Sharding VS Ranged Sharding

Ranged Sharding Hashed Sharding

Number of records

Seco

nds

Page 17: Final Presentation  IRT - Jingxuan Wei V1.2

17

MONGOIMPORT RESULT ANALYSIS (2)

40K 80K 160K 320K0.05.0

10.015.020.025.0

Sharded Database VS Regular Database

Sharded MongoDB MongoDB (Regular)

Number of records

Seco

nds

Page 18: Final Presentation  IRT - Jingxuan Wei V1.2

18

MONGOIMPORT RESULT ANALYSIS (3) The bottleneck occurs in

the first section. Compare the database

enable sharding, the regular database perform better job in data acquisition.

The acquisition result can not meet industry requirement.(80000 per second)

Page 19: Final Presentation  IRT - Jingxuan Wei V1.2

19

SPARK-MONGODB APPLICATION FOR DATA ACQUISITION We need to set the spark environment first:

Create 80000 records as input batch:

Store into MongoDB:

Page 20: Final Presentation  IRT - Jingxuan Wei V1.2

20

SPARK-MONGODB APPLICATION RESULT

MongoImport:Shard database: 4.3 sRegular database 4.0s

Spark program 1.4s

40000 records 50000 records 60000 records 70000 records 80000 records0200400600800

1000120014001600

8221007

11671302

1444Data inserting – Router (Master) (4CPUs)

Number of records

Mill

iseco

nds

Page 21: Final Presentation  IRT - Jingxuan Wei V1.2

21

SPARK-MONGODB APPLICATION RESULT

New Record !!!

140000 records per

second120000 records 130000 records 140000 records 150000 records 160000 records0

200400600800

10001200

860 9311031 1053 1134

Data inserting -- Server 16CPUs

Number of records

Mill

iseco

nds

Page 22: Final Presentation  IRT - Jingxuan Wei V1.2

22

DATA RETRIEVAL APPROACH Database level Application level

Page 23: Final Presentation  IRT - Jingxuan Wei V1.2

23

PARALLEL DATA RETRIEVAL BY MONGODB

2970000 5940000 8910000 11880000 196814 393628 590442 78725602000400060008000

10000120001400016000

Searching performance between sharded MongoDB and regular MongoDB

Number of record

Millis

econ

ds

Conclusion:1. Sharded

MongoDB perform faster searching than Regular MongoDB

2. Hard to measure query execution time when the dataset is too big.

db.getCollection('Test').find({'accR3': { $gt: 4 , $lt:6}}). explain(‘executionStats’)

Page 24: Final Presentation  IRT - Jingxuan Wei V1.2

DATA RETRIEVAL BY SPARK SQL QUERY Create Spark SQL object

24

Create register temp table and run the searching query.

Sample result

Page 25: Final Presentation  IRT - Jingxuan Wei V1.2

DATA RETRIEVAL BY SPARK FILTER OPERATION Perform searching by using filter operation

25

Page 26: Final Presentation  IRT - Jingxuan Wei V1.2

26

SPARK-MONGODB APPLICATION RESULTData searching Local Machine 2CPU/I5 Server 16CPUs - Regular

DatabaseFilter Search Spark SQL

 query 

Filter Search Spark SQL  query

4G (3.92G) 16881 877 5736 557/6028G (7.85G) 52229 2012 13281 175312G (11.78G)  N/A  N/A 19556 332316G (15.70G)  N/A  N/A 31179 451840G (39.93G)  N/A  N/A 79893 878345G (44.83G)  N/A  N/A 86399 10883

Query 

1. SELECT * FROM railwayData Where accR3 > 4 and accR3 < 6

2. val result = readData.filter(readData("acc.r3") >= 4 && readData("acc.r3") <= 6)

Page 27: Final Presentation  IRT - Jingxuan Wei V1.2

27

HOW TO IMPROVE SEARCHING EFFICIENCY?

Approach 1

Approach 2

(Key-Value)

Page 28: Final Presentation  IRT - Jingxuan Wei V1.2

SPARK-MONGODB APPLICATION2 RESULT Adopt hash

partitioner to partition data and use mapPartitionsWithIndex to get the target partition.

Perform searching in the target partition

Narrow search scope

28

Page 29: Final Presentation  IRT - Jingxuan Wei V1.2

29

SPARK-MONGODB APPLICATION2 RESULT

1 2 301000020000300004000050000

30724 29058 27440

44737 42773 43200

Compare the performance between two searching approaches

Search by Hash Partition Search for all data

Number of testing

Millis

econ

ds

Page 30: Final Presentation  IRT - Jingxuan Wei V1.2

30

CONCLUSIONWe have successfully created a system that is able to accept the data as a batch or streams.

We solve the low data ingestion speed problem by writing a spark program.

We have successful import 1400000 record in one second in the MongoDB server.

We perform searching by using Spark SQL and execute SQL query in 40 GB of data within 11 seconds.

Page 31: Final Presentation  IRT - Jingxuan Wei V1.2

31

FUTURE WORK How to measure MongoDB query execute time in the very large database.

Efficient searching mechanism in sharded MongoDB by using Spark.

Page 32: Final Presentation  IRT - Jingxuan Wei V1.2

32

REFERENCE Wolfson, H. J., & Rigoutsos, I. (1997). Geometric hashing: An overview. IEEE

computational science and engineering, 4(4), 10-21. Stevic, M. P., Milosavljevic, B., & Perisic, B. R. (2015). Enhancing the

management of unstructured data in e-learning systems using MongoDB. Program, 49(1), 91-114.

Liu, Y., Wang, Y., & Jin, Y. (2012). Research on the improvement of MongoDB Auto-Sharding in cloud environment. Paper presented at the Computer Science & Education (ICCSE), 2012 7th International Conference on.

Armbrust, M., Xin, R. S., Lian, C., Huai, Y., Liu, D., Bradley, J. K., . . . Ghodsi, A. (2015). Spark sql: Relational data processing in spark. Paper presented at the Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data.

Page 33: Final Presentation  IRT - Jingxuan Wei V1.2

33

THE TEAM

Dr. Maria Indrawan-SantiagoSenior Lecturer

Faculty of IT

Prajwol SangatResearch Assistant

Faculty of ITAssoc Prof David Taniar

Associate ProfessorFaculty of IT

Jingxuan WeiStudent

Faculty of IT

Subudh SaliStudent

Faculty of IT

Page 34: Final Presentation  IRT - Jingxuan Wei V1.2

34

Thank You Questions

?(Final Presentation) Supervisor: Assoc Prof David Taniar

BY: JINGXUAN WEI (Tom) 25025031