26
Jongwook Woo HiPIC CSULA Big Data Analysis of Airline Data Set on Cloud Computing JIIBR SYMPOSIUM 2015 Cal State LA, CA October 9 2015 Nillohit Bhattacharya, [email protected] Jongwook Woo, PhD, [email protected] High-Performance Information Computing Center (HiPIC) Cloudera Academic Partner and Grants Awardee of Amazon AWS California State University Los Angeles

Big Data Analysis of Airline Data Set on Cloud Computing

Embed Size (px)

Citation preview

Jongwook Woo

HiPIC

CSULA

Big Data Analysis of Airline Data Set on Cloud

Computing

JIIBR SYMPOSIUM 2015Cal State LA, CAOctober 9 2015

Nillohit Bhattacharya, [email protected] Woo, PhD, [email protected]

High-Performance Information Computing Center (HiPIC)

Cloudera Academic Partner and Grants Awardee of Amazon AWS

California State University Los Angeles

High Performance Information Computing CenterJongwook Woo

CSULA

Contents

Airline Data Set

Hadoop: Data Intensive Computing

Hadoop on Cloud Computing

Hive and its Architecture on Azure

Experimental Results

Conclusions

High Performance Information Computing CenterJongwook Woo

CSULA

Characteristics of the Airline Data Set

Data has been taken from the US

Department of Transportation

Consist of the arrival and departure records

of domestic airlines

Time period January 2005 – December 2014

(10 Years)

Total number of files: 120

File Format: csv (comma separated values)

Total file size: 13.1 GB

Total Number of records: 66 million

High Performance Information Computing CenterJongwook Woo

CSULA

Traditional Computing Challenges

Not easy for a single computer to store and

process all the data by itself.

Approached the problem in a different way

Traditional Parallel Computer

– Processor Intensive Computing

• by increasing the processing speed and power of the

computer

As the data grows exponentially,

– The processing power of the single computer

becomes a bottleneck

– And, mostly it does not work for large scale data

because of the latency in data transfer on Network

and Disk I/O

High Performance Information Computing CenterJongwook Woo

CSULA

A New Approach (Hadoop)

Many non-expensive commodity computers

all working together,

Data Intensive Computing

– break the data in smaller chunks and process the data

locally where it is stored

– Data Locality

• Computation occurs where data resides

All the computers process the data in parallel.

Provides the ability to harness the power of

multiple computers simultaneously.

High Performance Information Computing CenterJongwook Woo

CSULA

Hadoop on Cloud

Create Hadoop clusters with minimal

investment.

No overhead of maintaining the cluster.

Delete the cluster when no longer needed.

Increase/Decrease resources on demand.

Deleting the cluster does not result in loss

of data.

High Performance Information Computing CenterJongwook Woo

CSULA

Apache Hive

SQL like language

Developed at Facebook

HQL (Hive Query Language) is

different than SQL

Runs map reduce jobs under the hood.

Batch Process

Queries have a high latency

Read based

Not appropriate for transaction processing

High Performance Information Computing CenterJongwook Woo

CSULA

Microsoft Azure HDInsight

Deploys and provisions Hadoop clusters in

the cloud

HDInsight uses Hortonworks Data Platform

(HDP) Hadoop Distribution

HDInsight cluster configuration

Number of data nodes: 4

CPU: 4 Cores

Memory: 7 GB

Operating System: Windows Server 2012 R2 Datacenter

Hadoop clusters can be launched using

Linux Operating System

Windows Server Operating System

High Performance Information Computing CenterJongwook Woo

CSULA

System Architecture

High Performance Information Computing CenterJongwook Woo

CSULA

Experimental Results

Total number of flights cancelled each

month for the period 2005-2014

Time taken: 210.862 seconds, Fetched: 120 row(s)

Total number of flights diverted each month

for the period 2005-2014

Time taken: 216.704 seconds, Fetched: 120 row(s)

High Performance Information Computing CenterJongwook Woo

CSULA

Cancelled and Diverted flights by month

0

5000

10000

15000

20000

25000

30000

35000

Num

ber

of

cance

lled

/div

erte

d f

lights

Cancelled/Diverted Vs Time

Cancelled

Diverted

High Performance Information Computing CenterJongwook Woo

CSULA

Experimental Results

Total number of flights cancelled every year

for the period 2005-2014

Time taken: 302.465 seconds, Fetched: 10 row(s)

Total number of flights diverted every year

for the period 2005-2014

Time taken: 461.433 seconds, Fetched: 10 row(s)

High Performance Information Computing CenterJongwook Woo

CSULA

Cancelled and Diverted flights by year

0

20000

40000

60000

80000

100000

120000

140000

160000

180000

2005 2006 2007 2008 2009 2010 2011 2012 2013 2014

Num

ber

of

cance

lled

fli

ghts

Number of cancelled/diverted flights Vs Year

Cancelled

Diverted

High Performance Information Computing CenterJongwook Woo

CSULA

Experimental Results

Effect of flight distance on flight

diversions

Time taken: 675.725 seconds, Fetched: 1500 row(s)

High Performance Information Computing CenterJongwook Woo

CSULA

Diverted Flights Vs Distance

0

100

200

300

400

500

600

700

800

900

1000

0 1000 2000 3000 4000 5000 6000

Num

ber

of

Div

erte

d f

lights

(co

unt)

Flight Distance (in miles)

Number of diverted flights Vs Distance

Diverted (Count)

High Performance Information Computing CenterJongwook Woo

CSULA

Experimental Results

Effect of flight distance on flight

cancellations

Time taken: 576.925 seconds, Fetched: 1500 row(s)

High Performance Information Computing CenterJongwook Woo

CSULA

Cancelled Flights Vs Distance

0

2000

4000

6000

8000

10000

12000

14000

0 1000 2000 3000 4000 5000 6000

Num

ber

of

cance

lled

fli

ghts

(co

unt)

Flight Distance (in miles)

Number of cancelled flights Vs Distance

Cancellation (Count)

High Performance Information Computing CenterJongwook Woo

CSULA

Experimental Results

Effect of flight distance on average

departure delay

Time taken: 992.911 seconds, Fetched: 1500 row(s)

High Performance Information Computing CenterJongwook Woo

CSULA

Average Departure Delay vs Flight Distance

0

50

100

150

200

250

0 1000 2000 3000 4000 5000 6000

Ave

rage

Dep

artu

re D

elay

(in

min

ute

s)

Flight Distance (in miles)

Average Departure Delay Vs Flight Distance

Avg Dep Delay

High Performance Information Computing CenterJongwook Woo

CSULA

Experimental Results

Monthly average departure delay for

the period 2005-2014

Time taken: 973.695 seconds, Fetched: 13 row(s)

High Performance Information Computing CenterJongwook Woo

CSULA

Average Departure Delay by month

0

2

4

6

8

10

12

14

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

Aver

age

Dep

artu

re D

elay

(in

min

ute

s)

Average Depature Delay Vs Month

Avg Dep Delay

High Performance Information Computing CenterJongwook Woo

CSULA

Experimental Results

Yearly average departure delay for the

period 2005-2014

Time taken: 623.694 seconds, Fetched: 11 row(s)

High Performance Information Computing CenterJongwook Woo

CSULA

Average Departure Delay by year

0

2

4

6

8

10

12

2005 2006 2007 2008 2009 2010 2011 2012 2013 2014

Aver

age

Dep

artu

re D

elay

(in

min

ute

s)

Average Departure Delay Vs Year

Avg Dep Delay

High Performance Information Computing CenterJongwook Woo

CSULA

Conclusion

Interesting sets of trends and patterns exists in large data

sets

Average Departure delay is at a peak during the mid and end of the year i.e.

during the months of June, July and December

The highest number of flights were cancelled in the year 2007 as observed

between the period 2005-2014

Cloud infrastructure has enabled the use of Hadoop for big

data systems with minimal investment and cost of ownership

Hive provides an easy way to query the data without worrying

about the underlying complex structure of the system

Big Data systems build in the cloud can be decommissioned

without loosing the data

Any large scale data set in Business can be analyzed

Marketing, Finance, Economics, Management

Contact Prof Jongwook Woo ([email protected]) if you

need a collaboration

High Performance Information Computing CenterJongwook Woo

CSULA

Question?

High Performance Information Computing CenterJongwook Woo

CSULA

References

Airline Data Set, United States Department of Transportation, http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236

What is Hive?,

http://www-01.ibm.com/software/data/infosphere/hadoop/hive/

Introduction to Windows Azure Blob Storage, https://www.simple-talk.com/cloud/cloud-data/an-introduction-to-windows-azure-blob-storage-/

Introduction to Hadoop in HDInsight: Big-data analysis and processing in the cloud, https://azure.microsoft.com/en-us/documentation/articles/hdinsight-hadoop-introduction/

Explorer for Microsoft Azure Storage: Freeware Client, http://www.cloudberrylab.com/free-microsoft-azure-explorer.aspx

Upload data for Hadoop jobs in HDInsight, https://azure.microsoft.com/en-us/documentation/articles/hdinsight-upload-data/

“Market Basket Analysis Algorithms with MapReduce”, Jongwook Woo, DMKD-00150, Wiley Interdisciplinary Reviews Data Mining and Knowledge Discovery, Oct 28 2013, Volume 3, Issue 6, pp445-452, ISSN 1942-4795