16
LATENCY AWARE ELASTIC STREAMING FOR ESTIMATING ONLINE VACANCY IN TRAFFIC DATA Roshni P, Surekha Mariam Varghese Mar Athanasius College of Engineering Kothamangalam Kerala [email protected] , [email protected] Real-time road traffic monitoring is now an efficient method for road transport. The trajectory data from the vehicles are collected in real-time in smart cities. The processing of these massive amount of data with low latency and minimal resource utilization is a considerable challenge. The trajectory sensor data stream can be used to predict and prevent the traffic jams. An elastic and distributed system that performs analysis in real time of road traffic data is proposed to predict the vacancies in road traffic data. Apache spark parallel framework is used to implement the distributed system to process the huge volume of trajectory data. The amount of data generated depends on time and day. So there is need for elastic provisioning of resources to handle the varying amount of workload and processing. In-order to manage and aggregate resources elastically, Amazon EC2 cloud is deployed in the system. Since adding or removing computational resources takes several minutes to launch, it results in an impaired performance. Thus the system predicts the need for extra capacity using decision tree classifier and provision new resources to meet the computational needs. The elasticity property of the system will optimize the resource utilization and increases the overall performance. The elasticity is implemented with shell script and vacancy estimation with multi-linear regression. Keywords. Amazon EMR, Apache spark, Elasticity, Over-provisioning, Under- provisioning. 1. INTRODUCTION Real time streaming of big data resulted in large-scale processing of data. Road-traffic can be prevented by the use of real-time big data for new products and improved services with traffic and monitoring systems. For new cases, application needs to be flexible and simple for processing system and also be adaptable to recognise. Workload and resource use has to be statistically balanced and variation in workload should adapt to the system run time. The system should add or remove new computational nodes thereby distributing workload. Road transport will be efficient with real - time road traffic monitoring. In smart cities, traffic trajectory sensor produces trajectory data streams in real - time. Processing real time traffic sensor data in large city is challenging because of numerous vehicles present. Data generated from vehicles are used to predict traffic jams in a location. Processing real - time traffic data and streaming the same time is quite difficult. Real - time traffic is calculated through data collected from wireless communication device installed in all vehicles. But trajectory sensor generating continuous data creates bottle neck. Multiple linear regression International Journal of Pure and Applied Mathematics Volume 119 No. 18 2018, 1371-1385 ISSN: 1314-3395 (on-line version) url: http://www.acadpubl.eu/hub/ Special Issue http://www.acadpubl.eu/hub/ 1371

LATENCY AWARE ELASTIC STREAMING FOR ...LATENCY AWARE ELASTIC STREAMING FOR ESTIMATING ONLINE VACANCY IN TRAFFIC DATA Roshni P, Surekha Mariam Varghese Mar Athanasius College of Engineering

  • Upload
    others

  • View
    22

  • Download
    0

Embed Size (px)

Citation preview

Page 1: LATENCY AWARE ELASTIC STREAMING FOR ...LATENCY AWARE ELASTIC STREAMING FOR ESTIMATING ONLINE VACANCY IN TRAFFIC DATA Roshni P, Surekha Mariam Varghese Mar Athanasius College of Engineering

LATENCY AWARE ELASTIC STREAMING FOR

ESTIMATING ONLINE VACANCY IN TRAFFIC DATA

Roshni P, Surekha Mariam Varghese

Mar Athanasius College of Engineering Kothamangalam Kerala

[email protected], [email protected]

Real-time road traffic monitoring is now an efficient method for road transport. The

trajectory data from the vehicles are collected in real-time in smart cities. The processing

of these massive amount of data with low latency and minimal resource utilization is a

considerable challenge. The trajectory sensor data stream can be used to predict and

prevent the traffic jams. An elastic and distributed system that performs analysis in real

time of road traffic data is proposed to predict the vacancies in road traffic data. Apache

spark parallel framework is used to implement the distributed system to process the huge

volume of trajectory data. The amount of data generated depends on time and day. So there

is need for elastic provisioning of resources to handle the varying amount of workload and

processing. In-order to manage and aggregate resources elastically, Amazon EC2 cloud is

deployed in the system. Since adding or removing computational resources takes several

minutes to launch, it results in an impaired performance. Thus the system predicts the need

for extra capacity using decision tree classifier and provision new resources to meet the

computational needs. The elasticity property of the system will optimize the resource

utilization and increases the overall performance. The elasticity is implemented with shell

script and vacancy estimation with multi-linear regression.

Keywords. Amazon EMR, Apache spark, Elasticity, Over-provisioning, Under-

provisioning.

1. INTRODUCTION

Real time streaming of big data resulted in large-scale processing of data. Road-traffic can

be prevented by the use of real-time big data for new products and improved services with

traffic and monitoring systems. For new cases, application needs to be flexible and simple

for processing system and also be adaptable to recognise. Workload and resource use has

to be statistically balanced and variation in workload should adapt to the system run time.

The system should add or remove new computational nodes thereby distributing workload.

Road transport will be efficient with real - time road traffic monitoring. In smart cities,

traffic trajectory sensor produces trajectory data streams in real - time. Processing real time

traffic sensor data in large city is challenging because of numerous vehicles present. Data

generated from vehicles are used to predict traffic jams in a location. Processing real - time

traffic data and streaming the same time is quite difficult. Real - time traffic is calculated

through data collected from wireless communication device installed in all vehicles. But

trajectory sensor generating continuous data creates bottle neck. Multiple linear regression

International Journal of Pure and Applied MathematicsVolume 119 No. 18 2018, 1371-1385ISSN: 1314-3395 (on-line version)url: http://www.acadpubl.eu/hub/Special Issue http://www.acadpubl.eu/hub/

1371

Page 2: LATENCY AWARE ELASTIC STREAMING FOR ...LATENCY AWARE ELASTIC STREAMING FOR ESTIMATING ONLINE VACANCY IN TRAFFIC DATA Roshni P, Surekha Mariam Varghese Mar Athanasius College of Engineering

used to estimate traffic vacancy. Correlation coefficient computed with the multiple linear

aggression models established by sample items in multiple linear approach. Correlation

coefficient calculates the vacancy estimation.

The system proposes an optimized method which presents an efficient allocation of the

computing resources to incorporate the random nature of the vehicle movement. An online

real time approach to address the problem is realized with Apache Spark framework in

Elastic Map Reduce. The sys-tem processes the constant stream of location information

from vehicles to address the major defect sparseness which focuses on employing the real

time data. The vehicles emit new location information every second, so the data processing

should be close as the data providers, meaning the data transfer duration between stream

operator and source should be minimized [4]. The real time processing ( geographical as

well as operational ) of this transport data demands a distributed stream processing engine,

where each stream processing operator can be deployed on different clouds as individual

operator nodes. Since the traffic are based on real time, and they are fluctuating in non-

continuous manner, the system elastically stream the data in cloud [5].

Cloud computing is a kind of Internet-based computing that provides shared processing

resources and data to computers and other devices on demand. [2] It is a model for

enabling ubiquitous, on-demand access to a shared pool of configurable computing

resources.[2] Cloud computing has become a highly demanded service or utility due to the

advantages of high computing power, cheap cost of services, high performance, scalability,

accessibility as well as availability. [2] Users access cloud computing using networked

client devices, such as desktop computers, laptops, tablets and smartphones and any

Ethernet enabled device such as Home Automation Gadgets. The system uses Amazon

Elastic Map Reduce to rapidly and cost-effectively process vast amounts of data. We can

provision many computing instances to process data at any scale. The computing resources

can be provisioned to process data at any scale [6]. The increasing or decreasing the

computing resources according to the requirements, or workload are called over-

provisioning and under-provisioning respectively. The system uses Apache spark

framework for parallel execution.

Figure. 1.1 Spark Architecture

The Figure1 shows the Apache spark cluster, where driver nodes communicate with

executor nodes. (Logically like execution cores).

International Journal of Pure and Applied Mathematics Special Issue

1372

Page 3: LATENCY AWARE ELASTIC STREAMING FOR ...LATENCY AWARE ELASTIC STREAMING FOR ESTIMATING ONLINE VACANCY IN TRAFFIC DATA Roshni P, Surekha Mariam Varghese Mar Athanasius College of Engineering

2. MOTIVATION

With the arrival of data-specific applications that generate enormous volumes of real-time

data, distributed stream processing systems become increasingly important in road-traffic

monitoring. Data stream processing is needed for processing incoming data streams from

large numbers of sensors in a real-time fashion. Even when the incoming data rate

fluctuates in a non-continuous manner, the data stream system must work with low

latency. This is almost impossible with a local workstation or laboratory because its

computational resources are finite. An efficient stream processing engine is needed to

implement a cost efficient stream processing engine. It should adopt to elastic resource

usage while maintaining the real-time processing. Elastic resource usage means elastically

increasing and decreasing the computational resources according to the requirements. In

the proposed system, method changes the computing environment based on the data rate in

the input data stream. By using the cloud environment, additional computational resources

within a few minutes can be added or removed, and there is no need to consider where the

new resources are located. And, the number of virtual machines can be dynamically added

or removed, so system can deal the elastic nature by temporarily adding or removing

computational nodes. Green computing is the environmentally responsible and eco-

friendly use of computers and their resources [3] & [7]. Reducing the electronic waste and

power dissipation can highly contribute to green computing. Thus using electronic

equipment according to the workload can help in green computing. In-order to support the

above, the elastic property should be implemented where there is a need for performing

computations in massive amount of data.

3. PROPOSED SYSTEM

The increasing power of computer brought innovations in the world of information.

Analysing massive real-time data has enabled to massively distribute computing in well

adapted to cloud which computes the data elastically. The main objective of the proposed

system is to make the cloud services elastic for processing the real-time data coming from

traffic sensors and help in decision making for transport systems. By using the cloud

environment, additional computational resources can be varied, and there is no need to

consider where the new resources are located. In addition, the number of virtual machines

can be dynamically, so that system can even deal with situations where the data rate

suddenly becomes high, by temporarily adding cloud nodes. The system uses a cloud

environment by changing the number of computational nodes in the cloud. The system

takes continuously generating trajectory data, then adding computational nodes in the

cloud environment by using an appropriate number of virtual machines and processing in

parallel. The main challenges in processing the traffic data streams are, in homogeneous

sparseness in both spatial and temporal dimensions that is introduced by probe vehicles

moving at their own will, and processing stream data in real time manner with low

latency.[1] The proposed system presents real-time road traffic monitoring system that

collects data from vehicles to monitor the real-time traffic scenario. Since the computing

resources continues to be generated in massive amounts, elasticity is established to meet

the requirements.

The proposed system mainly consist of;

International Journal of Pure and Applied Mathematics Special Issue

1373

Page 4: LATENCY AWARE ELASTIC STREAMING FOR ...LATENCY AWARE ELASTIC STREAMING FOR ESTIMATING ONLINE VACANCY IN TRAFFIC DATA Roshni P, Surekha Mariam Varghese Mar Athanasius College of Engineering

1. Real time road traffic monitoring in Apache Spark

2. Design of elastic stream processing in Amazon Web Service cloud

Elastic Stream Processing Platform for transport data analysis is proposed to estimate the

vacancies in raw traffic data. The road traffic monitoring module is executed in Amazon

web service cloud. The monitor checks the storage and the average CPU usage of the

instances (cluster) to fire the trigger. Various threshold metrics are given so as to check

whether there is need for over provisioning and under provisioning. If the CPU usage is

above a limit, then the instances are resized to add more task nodes, thereby providing

more storage and memory. If the usage is less than the threshold, then the cluster instances

are shrinked. The usage monitor observes the load of the stream source and its CPU and

RAM usage are analysis. Based on the threshold, whether data to be scaled up or down is

decided. This monitoring data is preprocesses within the usage monitor to derive the

metrics such as average system load/ minute and forwarded to the reason which decides

the over provisioning and under provisioning scenarios.

The processing node component takes care of allocating new processing nodes to ensure

that the real-time processing capabilities of the stream processing engine. The processing

node management allocates cloud computing resources and deploy the processing nodes

on these computational resource.. The velocity estimations using speed is computed so as

to decide whether there is block at the study or estimated region. It can take up to 10

minutes for an instances to launch in a cloud.

Figure 3.1 Proposed System

International Journal of Pure and Applied Mathematics Special Issue

1374

Page 5: LATENCY AWARE ELASTIC STREAMING FOR ...LATENCY AWARE ELASTIC STREAMING FOR ESTIMATING ONLINE VACANCY IN TRAFFIC DATA Roshni P, Surekha Mariam Varghese Mar Athanasius College of Engineering

That's ten minutes is between the system checks the requirement for extra capacity and the

time when that capacity is actually available. That's a ten minutes of impaired performance

for computation. If proper capacity prediction is done before, the merit is that capacity will

be added before it is needed, ensuring that the proper capacity is always in place. The

system also considers this case, by predicting the need for extra capacity using decision

tree classifier.

3.1. Real-time Traffic Monitoring Design

The real-time road traffic monitoring design collects the real time data from GPS wireless

communication in vehicles to monitor the real-time traffic scenario. The velocity or

vacancy estimation problem in traffic data is addressed using multiple linear regression in

[1]. The velocity is considered as the criteria to predict the traffic congestion in a region.

The assumption is that, a good traffic condition always results in more velocity.

1. The location is identified and is converted to (latitude, longitude).

2. The eight neighborhood regions of location is found out.

3. The vehicles in each region is identified, and velocity of each vehicle is calculated

from its distance and time.

4. The average velocity is found out considering every region‟s velocity and it is

considered as the real value.

5. The multi regression approach is used to find the estimated velocity of the region

If there are m neighboring regions construct the matrix X.

1 𝑣11 …… 𝑣𝑚1

X = 1 𝑣12……. 𝑣𝑚2

1 𝑣1ℎ …….𝑣𝑚ℎ

6. The estimation of correlation coefficient βˆ is found out as

βˆ = (XTX)

-1(X

TV)

7. The models for representing correlations between ri and its neighboring regions at

time t, denoted as

vit = β0 + β1v1t + β2v2t +•••+ βmvmt + μt

8. vit is the average velocity at the estimated region.

The velocities in neighborhood regions are also considered to estimate the vacancy at a

particular region. So based on the velocity at a particular region, one can predict whether

there is more or less traffic at a region. In traffic vacancy estimation problem, the first step

is to find all near-by vehicles around the study region in a specific time window. The geo

hash method divides points on the earth‟s surface into grids. The pseudo code for the

traffic vacancy estimation is given below [1].

Input:

Locationi : include longitude and latitude

scopei : range of region

di : direction of traffic condition estimation

∆ti : time window

ti : time of traffic condition

Output:

ETCM : traffic condition estimation matrix

International Journal of Pure and Applied Mathematics Special Issue

1375

Page 6: LATENCY AWARE ELASTIC STREAMING FOR ...LATENCY AWARE ELASTIC STREAMING FOR ESTIMATING ONLINE VACANCY IN TRAFFIC DATA Roshni P, Surekha Mariam Varghese Mar Athanasius College of Engineering

1: r(ri, ti, di) ← geohash( longitude, latitude), scopei,∆ti

2: Rnb(i) ← SearchNeighboringr(r0, t0, d0), |Rnb(i)| = 8

3: for K = 1 to |Rnb(i)| do

4: Rk(ri, ti, di) = {S

ki (t)|r(r, ri) < _r ∧ |t − ti| < _ti ∧ d = di}

5: k(ri, ti, di) = where N = |R(ri, ti, di)|

6: CkTCM(di) ← v

k(ri, ti, di)

7: end for

8: while di == destimate do

9: X ←CkTCM(di) //samplefromtwoflanks

10: ˆβ= ( ˆβ0, ˆβ1, ˆβ2, . . . , ˆβm)T = (X

T X)

−1(X

TV)

11: ˆvit = ˆβ0 + ˆβ1v1t + ˆβ2v2t + . . . + ˆβmvmt //sample in the same direction

12: end while

13: ETCM ← vit

14: return ETCM

The region is divided to nine rectangular areas. There are eight neighboring regions around

the estimation region. The estimation is made by the two rectangular areas that are in the

same direction as the estimation region.

3.2. Elasticity Design

Elasticity is defined as the degree to which a system is able to adapt to workload changes

by provisioning and de-provisioning resources in an autonomic manner, such that at each

point in time the available resources match the current demand as closely as possible. [8]

For the elastic streaming of transport-data, there should be under provisioning (de-

provisioning) and over provisioning (provisioning) cases. Over-provisioning means

allocating more resources than required and under-provisioning means allocating fewer

resources than fixed number of resources. Varying workload can be adjusted by altering

the number or use of computing resources and is called "elastic computing". [9]

Elasticity can be illustrated with an example. Consider a road traffic monitoring office

where road traffics are monitored and they have five systems deployed. During peak time,

(say, Saturday evening) the number of vehicles will be more, and thus they need more

computational nodes to store and process the data generated randomly from the vehicles.

So there is a need of overprovisioning of nodes. ie, more worker nodes are needed to

handle the traffic load without making the nodes overloaded. Consider another scenario

where there are hardly any vehicle (say, mid-night) there is no need of five worker nodes

to process the data. So system should elastically provision the clusters according to the

requirements. Thus over-provisioning (provisioning) means allocating more resources than

required and under-provisioning (de-provisioning), means allocating fewer resources than

fixed number of resources. Elasticity is given as some thresholds that set the conditions

that trigger over-provisioning or under-provisioning. Provisioning and de-provisioning are

triggered if the average CPU utilization is above an upper utilization threshold or below

the lower utilization threshold. Different metrics are used as the thresholds 10].

Reconfiguration actions aim at achieving an average CPU utilization as close as possible to

Target Utilization Threshold. Load balancing is triggered when the standard deviation of

the CPU utilization is above the Upper Imbalance Threshold. In order to enforce the

International Journal of Pure and Applied Mathematics Special Issue

1376

Page 7: LATENCY AWARE ELASTIC STREAMING FOR ...LATENCY AWARE ELASTIC STREAMING FOR ESTIMATING ONLINE VACANCY IN TRAFFIC DATA Roshni P, Surekha Mariam Varghese Mar Athanasius College of Engineering

elasticity rules, the system periodically collects monitoring information from all instances

on each sub cluster. The information includes average CPU usage. The system then

computes the average CPU usage per sub cluster. If it is outside the allowed range, the

number of instances required to cope with the current load is computed. If the sub cluster

is under provisioned, new instances are allocated. If the sub cluster is over provisioned, the

load of not needed instances is transferred to the rest of instances by offload function.

(a) Over-Provisioning Scenario

The upscaling algorithm is composed of two upscaling thresholds. The reasoner decides to

scale up, either the load of the incoming queue excesses a specific threshold or if the

average CPU usage of the processing nodes exceeds 95% of the available CPU resources.

def scaleUp() :

if (incomingData > specificThreshold ) :

addmoreNodes()

else if ( avgCPUusage() > 0.95 ) :

addmoreNodes()

else

doNothing()

The CPU usage of the processing node is monitored every time. Since the amount of

continuous flow of traffic data recorded vary with time, the CPU load is analyzed to check

whether it falls above the threshold. If it reaches a value above the particular threshold, the

scale up function is called to add more instances to process the data[11].

(b) Under-Provisioning Scenario

The downscaling operation is triggered if the load of the incoming queue falls below a

specific threshold and if there are at least two processing nodes assigned to one operator

node. If this is the case, the reasoner iterates through all processing nodes to select a

suitable one.

def scaleDown() :

if (incomingData < specificThreshold ) :

if( processingNode > 1) :

scaleDownNodes()

else if ( avgCPUusage() < 0.20 ) :

removeNodes()

The under provisioning design starts by analyzing the incoming queue of the traffic data

[13]. If the amount of incoming queue is less than the specified threshold, and the

processing nodes are more than one, the system scales down the number of instances.

(c) Prediction to foresee Capacity Requirements

Amazon and other clouds cannot respond fast enough to increased capacity needs. It takes

up more than five minutes for instances to launch. That's many minutes of impaired and

delayed performance for computation. Only if proper prediction about capacity is done, the

impaired performance can be improved by resizing the cluster instances before workload

comes. This is done by considering the capacity needed in the past days at the same time.

For example, to foresee whether provisioning is needed at 12.00 pm, the streamed traffic

data at 12.00 pm on past five days is taken and is analyzed to check whether it exceeded

International Journal of Pure and Applied Mathematics Special Issue

1377

Page 8: LATENCY AWARE ELASTIC STREAMING FOR ...LATENCY AWARE ELASTIC STREAMING FOR ESTIMATING ONLINE VACANCY IN TRAFFIC DATA Roshni P, Surekha Mariam Varghese Mar Athanasius College of Engineering

threshold and needed large instance. At the same time, workload before 10 min on the

same day is also taken as the criteria. Then it is predicted whether provisioning of instance

is needed. Thus latency during the instance allocation can be decreased to an extent,

thereby increasing the performance.

def capacitypred() :

currentTime ()

calculatepastcapacities()

if ( capacitiesexceeds_more )

resize_instances()

else

elasticity()

The algorithm for prediction analysis is shown above. It takes the current time and

analyses the past capacities [14]. If more past capacities exceeds threshold, then resizing is

considered. Otherwise algorithm for provisioning or de-provisioning is considered.

3.3. Dataset

A taxi trajectory data set that contains one-day trajectories of approximate 7,648 taxis. [12]

The total number of data in the dataset taken is about 18 million.

ID Timestamp Longitude Latitude

Figure 3.2 An extract from the dataset

The dataset consist of the vehicle number of the taxi, the time at which the information is

sent, longitude and latitude of the current location from where the information is sent.

They are represented as a four tuple < ID, Timestamp, Longitude, Latitude > as in Figure

3.2

4. IMPLEMENTING ELASTICITY

The resource usage is continuously monitored and analyzed on current resource usage

threshold basis. There are two scenarios, namely, under provisioning and over

provisioning [15]. The below pseudo codes decide on either scaling up, ie, adding a new

processing Node, maintaining the current state or scaling down, ie, removing the most

processing node.

def capacitypred() :

International Journal of Pure and Applied Mathematics Special Issue

1378

Page 9: LATENCY AWARE ELASTIC STREAMING FOR ...LATENCY AWARE ELASTIC STREAMING FOR ESTIMATING ONLINE VACANCY IN TRAFFIC DATA Roshni P, Surekha Mariam Varghese Mar Athanasius College of Engineering

currentTime ()

calculatepastcapacities()

if ( capacitiesexceeds_more )

resize_instances()

else

elasticity()

def elasticity()

if queueIncomingload > upscalingThreshold then

scaleUP()

else if avgCPUload > 90 then

scaleUP();

else if queueIncomingload < downscalingThreshold then

if processingNodes>1 then

scaleDownNode = NULL

else if avgCPUload < 30 then

scaleDownNode();

else

doNothing();

The upscaling algorithm is composed of two upscaling thresholds. The reasoner decides to

scale up, either the load of the incoming queue excesses a specific threshold or if the

average CPU usage of the processing nodes exceeds 90% of the available CPU resources.

The downscaling operation is triggered if the load of the incoming queue falls below a

specific threshold. Prediction to foresee the capacity requirement before the arrival of real

time traffic data is also added in the elasticity. This is to adjust the impaired time taken to

resize the cluster instances. Thereby, latency in performing computation can be reduced.

The decision about the resizing also considers the following steps. In-order predict the

capacity requirement at hh:mm:ss time, ten minute before data-stream is considered. The

prediction is done on decision tree classifier. DecisionTreeClassifier is a class capable of

performing multi-class classification on a dataset [16] & [17]. Here, there are two target

classes, „yes‟ and „no‟. „yes‟ stands for the need for resizing and „no‟ stands for the

converse. If „no‟ comes, then conditions for over-provisioning and under-provisioning is

considered. The past five days workload are considered for prediction. If threshold exceeds

more than three, then it falls in class „yes‟ is predicted and if not, it falls in class „no‟.

Since there are five attributes, then there will be 32 training sets. If it exceeds the

threshold, it is represented as „1‟ and if it not, then it is represented as „0‟. The decision

classifier is implemented in python using scikit-learn library. It is a simple and efficient

tools for data mining and data analysis which is built on NumPy, SciPy, and matplotlib. If

the answer results in „no‟ class, it should trigger under provisioning or over provisioning

algorithm[18] & [19]. In Amazon Web Service cloud, the instance usage is monitored

using Cloud Watch.

aws cloudwatch get-metric-statistics \

--namespace AWS/EC2 \

--metric-name CPUUtilization \

--dimensions Name=InstanceId,Value=i-6bcb57c4 \

--statistics Average \

--start-time `date -u '+%FT%TZ' -d '10 mins ago'` \

International Journal of Pure and Applied Mathematics Special Issue

1379

Page 10: LATENCY AWARE ELASTIC STREAMING FOR ...LATENCY AWARE ELASTIC STREAMING FOR ESTIMATING ONLINE VACANCY IN TRAFFIC DATA Roshni P, Surekha Mariam Varghese Mar Athanasius College of Engineering

--end-time `date -u '+%FT%TZ'` \

--period 60 | jq '.Datapoints[0] | .Average'

Code 1 : Cloudwatch Monitoring

The output is the average CPU usage of the cluster instance in 10 minute duration. The

output is compared with the threshold value (say upper limit as 90%). If it is greater than

the threshold, the master and provisioned nodes are out of CPU, thus there is a need for

more computational resources [20]. Thus cluster instances are resized to a larger instance

cluster by adding more task nodes. This is how over-provision carried out. If it is below

the threshold, the cluster is shrinked to its minimum instance type.

clf = tree.DecisionTreeClassifier()

clf = clf.fit(dataset, target)

predctn= clf.predict(input)

Code 2: Pseudocode for prediction

5. PERFORMANCE ANALYSIS

Analyzing the behavior of the proposed system based on several properties such as

execution time, data input and liveness is explained in this section. The analysis explains

the need for elasticity when the data load is less or more and when the CPU usage is high.

The conditions used to perform the analysis are;

1. Under-provisioning when the input load is less than a threshold (Based on the

instance-type in the cluster)

2. Over-provisioning when the input load is greater than a threshold (Based in the

instance-type in the cluster, it varies)

3. Over-provisioning when the CPU usage is 90 % (It means the instance reached its

saturation point and will not perform well)

The instance type used during cluster creation is m1.large instance type. It is a general

purpose instance type. One master node and one core node is allocated for the cluster. A

core (slave) node is an instance in the cluster that runs tasks and stores data. A master node

typically runs master components of the distributed applications that are installed on a

cluster. More tasks nodes are added to the cluster node when under-provisioning or over-

provisioning occurs. One can add up to 48 additional task groups. When it reaches its

threshold, one can modify the instance type itself. The analysis is also carried out with

m1.medium (when the system is under-provisioned) and adding more task-nodes (when

the system is over-provisioned) instance-type to show the elastic property. The resource

utilization for the normal case instance is shown in the Figure 5.1. The resource utilization

of the running cluster is shown in the Figure 5.1. The resource usage in terms of CPU

usage, memory and storage is plotted against time. The x-axis is taken as the time duration

and y-axis is taken as the resource utilized in each hour for performing the computations

on traffic data. The time is taken in one hour difference.

International Journal of Pure and Applied Mathematics Special Issue

1380

Page 11: LATENCY AWARE ELASTIC STREAMING FOR ...LATENCY AWARE ELASTIC STREAMING FOR ESTIMATING ONLINE VACANCY IN TRAFFIC DATA Roshni P, Surekha Mariam Varghese Mar Athanasius College of Engineering

Figure 5.1 Resource Utilization

At the starting phase, the graph for resource usage is plotted against computation on early

morning. That time, the number of vehicles are too less. So there is no need for heavy

utilization of resources. Most of the time, allocated resources will be in idle state or low

usage state, thus resulting in wastage of these computational resources. Even-though they

are in low-usage state, there will be more power dissipation and cost since allocated

resources have high configurations [21]. The usage of resources gradually increases. At

peak time, especially at morning and evening there will be large number of vehicles, thus

large amount of streaming data. So resource usage will be very high. Thus the high usage

of resources will result in low execution time and increased latency. The system aims at

reducing the latency and increasing the throughput. So under-provisioning is implemented

where the resource usage falls below a certain threshold ( say 30% ). A small instance is

allocated for processing the traffic data on these times. Similarly more resources are added

to process the data on peak times, since most of the resources will be in saturated state

because of heavy load. A threshold of 80% is set as the condition for over provisioning and

a threshold of 30% usage monitoring is taken as the condition for under provisioning [22].

The cluster instances are provisioned with more task nodes when the CPU usage reaches

threshold of 85 %. When they are assigned more task nodes, the memory usage gradually

decreased as the workload is distributed among the task nodes. The Figure 5.3 shows the

memory usage after the provisioning of nodes.

International Journal of Pure and Applied Mathematics Special Issue

1381

Page 12: LATENCY AWARE ELASTIC STREAMING FOR ...LATENCY AWARE ELASTIC STREAMING FOR ESTIMATING ONLINE VACANCY IN TRAFFIC DATA Roshni P, Surekha Mariam Varghese Mar Athanasius College of Engineering

Figure 5.2 Capacity-Utilization Curve

Figure 5.3 Memory usage after provisioning of nodes

A comparison graph as shown in Figure 5.4 is drawn to compare the memory usage

before and after provisioning. From the above, it is shown that, memory usage can be

minimized by provisioning more task nodes. Thus latency can be decreased, and the

execution time can also be decreased [23].

Figure 5.4 Comparison

A comparison graph as shown in Figure 5.4 is drawn to compare the memory usage before

International Journal of Pure and Applied Mathematics Special Issue

1382

Page 13: LATENCY AWARE ELASTIC STREAMING FOR ...LATENCY AWARE ELASTIC STREAMING FOR ESTIMATING ONLINE VACANCY IN TRAFFIC DATA Roshni P, Surekha Mariam Varghese Mar Athanasius College of Engineering

and after provisioning. From the above, it is shown that, memory usage can be minimized

by provisioning more task nodes. Thus latency can be decreased, and the execution time

can also be decreased.

6. CONCLUSION

The demand for new efficient methods for processing large-scale heterogeneous data in

real-time is growing. Currently, key challenge in Big Data is performing low latency

analysis with real-time data. In vehicle traffic, continuous high speed data streams generate

large data volumes. The system is deployed in distributed and parallel computing

framework with Apache Spark. Elasticity is deployed in Amazon EC2 cloud to handle the

varying storage and resources. The system evaluates the need for elastic and distributed

real-time analysis of heterogeneous data. It is shown that the elasticity can contribute to

low resource utilization, low latency and high throughput. Predicting the need for extra

capacity reduce the time taken to launch clusters by providing with capacity when

required. The roadtraffic monitoring also provides an efficient mechanism to predict the

congestion in a region by leveraging the real-time data. The system also discovers the

trends of system resources by predicting the workload so as to increase the system capacity

and performance. The system highly contributes to green computing by optimizing the use

of computational resources.

7. REFERENCES

[1] F.Wangetal., Estimating online vacancies in real- time road traffic monitoring with

traffic sensor data stream ,AdHoc Networks (2015) Ad Hoc networks Volume 35,

December 2015, Pages 3-13

[2] Konstantinou, I., Angelou, E., Boumpouka, C., Tsoumakos, D., and Koziris, N. On

the elasticity of nosql databases over cloud management platforms. In Proceedings of the

20th ACM international conference on Information and knowledge management (2011),

ACM, pp. 2385–2388.

[3] Jugraj Veer Singh, Sonia Vatta, Green Computing : Eco Friendly Technology

International Journal of Engineering Research and General Science Volume 4, Issue 1,

Jan-Feb, 2016 ISSN 2091- 2730

[4] Michael Franklin, Alon Halevy From Databases to Dataspaces: A New Abstraction

for Information Management, ACM SIGMOD Record Homepage archive Volume 34

Issue 4, December 2005 Pages 27-33

[5] Gulisano, Ricardo Jimenez-Peris, Marta Patino-Martinez, Claudio Soriente, and

Patrick Valduriez. 2012. StreamCloud: An Elastic and Scalable Data Streaming System.

IEEE Trans. Parallel Distrib. Syst. 23, 12 (December 2012), 2351-2365.

DOI=http://dx.doi.org/10.1109/TPDS.2012.24

[6] Thomas Heinze, Valerio Pappalardo, Zbigniew Jerzak, and Christof Fetzer. 2014.

Auto-scaling techniques for elastic data stream processing. In Proceedings of the 8th ACM

International Conference on Distributed EventBased Systems

[7] Yingjun Wu, Kian-Lee Tan ChronoStream: Elastic Stateful Stream Computation

in the Cloud Data Engineering (ICDE), 2015 IEEE 31st International Conference

10.1109/ICDE.2015.7113328

International Journal of Pure and Applied Mathematics Special Issue

1383

Page 14: LATENCY AWARE ELASTIC STREAMING FOR ...LATENCY AWARE ELASTIC STREAMING FOR ESTIMATING ONLINE VACANCY IN TRAFFIC DATA Roshni P, Surekha Mariam Varghese Mar Athanasius College of Engineering

[8] F. Calabrese, M. Colonna, P. Lovisolo, D. Parata, C. Ratti, Real-time urban

monitoring using cell phones: A case study in rome, IEEE Trans. Intell.Transp. Syst. 12

(1) (2011) 141–151, doi:10.1109/tits.2010.2074196.

[9] N. Caceres, J.P. Wideberg, G. Benitez, Review of traffic data estimations extracted

from cellular networks, IET Intell. Transp. Syst. 2 (3) (2008) 179– 192, doi:10.1049/iet-

its:20080003. [10] J.C. Herrera, D.B. Work, R. Herring, X.G. Ban, Q. Jacobson, A.M.

Bayen, Evaluation of traffic data obtained via gps-enabled mobile phones: The mobile

century field experiment, Transp. Res. Part C-Emerging Technol. 18 (4) (2010) 568–583,

doi:10.1016/j.trc.2009.10.006.

[11] H. Su, K. Zheng, J. Huang, H. Jeung, L. Chen, X. Zhou, Crowdplanner: A crowd-

based route recommendation system, in: Data Engineering (ICDE), 2014 IEEE 30th

International Conference on, 2014, doi:10.1109/ICDE.2014.6816730.

[12] Trajectory Data https://www.microsoft.com/enus/research/publication/trajectory-data-

mining-an-overview/

[13] R. Frank, M. Mouton, T. Engel, Towards collaborative traffic sensing using mobile

phones (poster), in: 2012 IEEE Vehicular Networking Conference, VNC 2012, November

14- November 16, IEEE Computer Society, 2012, pp. 115–120,

doi:10.1109/VNC.2012.6407419.

[14] J. Zhou, C.L. Philip Chen, L. Chen, A small-scale traffic monitoring system in urban

wireless sensor networks, in: Proceedings of the IEEE International Conference on

Systems, Man, and Cybernetics, SMC 2013, October 13, 2013 - October 16, 2013, IEEE

Computer Society, 2013, pp. 4929–4934, doi:10.1109/SMC.2013.842.

[15] M. Whaiduzzaman, M. Sookhak, A. Gani, R. Buyya, A survey on vehicular cloud

computing, J. Network Comput Appl.325–344, doi:10.1016/j.jnca.2013.08.004.

[16] M. Gerla, Vehicular cloud computing, in: 11th Annual Mediterranean Ad Hoc

Networking Workshop, Med-Hoc-Net 2012, June 19,2012 - June 22, 2012, IEEE

Computer Society, 2012, pp. 152– 155,doi:10.1109/MedHocNet.2012.6257116.

[17] C.Y. Goh, J. Dauwels, N. Mitrovic, M.T. Asif, A. Oran, P. Jaillet, Online map-

matching based on hidden markov model for real-time traffic sensing applications,

Intelligent Transportation Systems (ITSC), 2012 15th International IEEE Conference on,

2012, pp. 776–781, doi:10.1109/ITSC.2012.6338627.

[18] S Rajeswari, K Suthendran, K Rajakumar and S Arumugam, “An Overview of the

MapReduce Model”, International Conference on Theoretical Computer Science and

Discrete Mathematics, Springer-LNCS,Vol.10398,pp.312-317,2016.

[19] S.Rajeswari and K. Suthendran , “Chi-Square MapReduce Model for Agricultural

Data”, Journal of Cyber Security and Mobility,Vol.7(1),pp.13-24,2018.

[20] Thulasi Mohan,Shilpa Sudheendran, Fepslin AthishMon and K. Suthendran,

“Divisioning and Replicating Data in Cloud for Optimal Performance and Security”, International Journal of Pure and Applied Mathematics,Vol.118,pp. 271-275,2018.

[21] Z. Shan, D. Zhao, Y. Xia, Urban road traffic speed estimation for missing probe

vehicle data based on multiple linear regression model, in: Intelligent Transportation

Systems - (ITSC), 2013 16th International IEEE Conference on, 2013, pp. 118–123. doi:

10.1109ITSC.2013.6728220.

[22] Hamzeh Khazaei, Saeed Zareian, Rodrigo Veleda, Marin Litoiu Sipresk: A Big Data

Analytic Platform for Smart Transportation

[23] Yisheng Lv ; „Traffic Flow Prediction with Big Data: A Deep Learning Approach‟

State Key Lab. of Manage. & Control for Complex Syst., Inst. of Autom., Beijing, China

International Journal of Pure and Applied Mathematics Special Issue

1384

Page 15: LATENCY AWARE ELASTIC STREAMING FOR ...LATENCY AWARE ELASTIC STREAMING FOR ESTIMATING ONLINE VACANCY IN TRAFFIC DATA Roshni P, Surekha Mariam Varghese Mar Athanasius College of Engineering

1385

Page 16: LATENCY AWARE ELASTIC STREAMING FOR ...LATENCY AWARE ELASTIC STREAMING FOR ESTIMATING ONLINE VACANCY IN TRAFFIC DATA Roshni P, Surekha Mariam Varghese Mar Athanasius College of Engineering

1386