8
5Ws Model f Jinson Zhang IEEE Member School of Software, Faculty of Engineer University of Technology, Sydne Sydney, Australia [email protected] Abstract BigData, which contains image, and other forms of data, collected from mu difficult to process using traditional datab tools or applications. In this paper, we establi by using 5Ws data dimension for BigDa visualization. 5Ws data dimension stands fo content is, Why the data occurred, Where the When the data occurred, Who received the d data was transferred. This framework no BigData attributes and patterns, but also es patterns that provide more analytical featur clustering to display data sending and receivin demonstrate BigData patterns. The model is t network security ISCX2012 dataset. The e that this new model with clustered visu efficiently used for BigData analysis and visua Keywords BigData analysis; BigDa dimensions; data density; BigData visualization I. INTRODUCTION BigData, according to Wikipedia (Auterm for a collection of data set so large anbecomes difficult to process using on management tools or traditional d applications”. The datasets not only co databases, but also include unstructured d social media data or GPS (Global Positionin According to Gartner 3Vs definition [ three characteristics: volume, velocity and v in Fig 1. Figure 1. 3Vs model for BigData [ for BigData Analysis and Visualization ring & IT ey Mao Lin Hua School of Computer S Tianjin University, Tianj School of Software, Faculty of E University of Technolog Sydney, Austral M[email protected]d video, text, audio ultiple datasets, is base management sh the 5Ws model ata analysis and or; What the data e data came from, data and How the ot only classifies stablishes density res. We use visual ng densities which tested by using the experiment shows alization can be alization. ta pattern; data n N g 2013), is “the d complex that it n-hand database data processing ontain structured databases such as ng System) data. [2], BigData has variety, as shown [3] Volume describes datasets that easily amassed into Terabytesinformation. Table I shown dataset volume of dataset is not only a s massive analysis issue. TABLE I. DATASET VO VALUE ABBREVIATIO 1000 1 KB 1000 2 MB 1000 3 GB 1000 4 TB 1000 5 PB 1000 6 EB 1000 7 ZB 1000 8 YB Velocity is how fast the dataset on statistics from Pingdom 2012 38000 Google searches every se phone users using 1.3 Exabytes of g per month, 2.2 billion email users sper day, 2.7 billion likes on Faceboof photo content added on Facebbillion hours of video watched on Y Variety is how datasets conta unstructured data, such as docume images, videos, click streams, transactions. Hundreds, even tattributes in multiple dimensions in much information for traditional daor applications to handle. BigData comes from everywheso is too big, too complex and mov posting pictures and writing co uploading and watching videos on receiving messages through smart messages through WeChat all coun BigData, new analytical methods feed business, government and orga Distributed computing and techniques are widely used in applications. Hadoop (High-availa oriented platform), the most popul for reliable, scalable, distributed n ang oftware jin, China Engineering & IT y, Sydney ia du.au are extremely large and , even Zettabytes of t volume size. Too much storage issue, but also a OLUME SIZE ON NAME Kilobytes Megabytes Gigabytes Terabytes Petabytes Exabytes Zettabytes Yottabytes is being produced. Based [1], there are more than econd, 5 billion mobile global mobile data traffic ending 144 billion emails ok every day, 7 petabytes ook every month and 4 YouTube a month. ain both structured and ents, emails, audio files, log files, or financial housands, of different n the dataset provide too tabase management tools re influence our life, and ves too fast. For example, omments on Facebook; n YouTube; sending and t phones; sending voice nt as BigData. To analyse have to be developed to anization needs. d parallel processing industry for BigData ability distributed object lar open-source platform d computing, is often 2013 IEEE 16th International Conference on Computational Science and Engineering 978-0-7695-5096-1/13 $31.00 © 2013 IEEE DOI 10.1109/CSE.2013.149 1021 2013 IEEE 16th International Conference on Computational Science and Engineering 978-0-7695-5096-1/13 $31.00 © 2013 IEEE DOI 10.1109/CSE.2013.149 1021

[IEEE 2013 IEEE 16th International Conference on Computational Science and Engineering (CSE) - Sydney, Australia (2013.12.3-2013.12.5)] 2013 IEEE 16th International Conference on Computational

  • Upload
    mao-lin

  • View
    218

  • Download
    3

Embed Size (px)

Citation preview

Page 1: [IEEE 2013 IEEE 16th International Conference on Computational Science and Engineering (CSE) - Sydney, Australia (2013.12.3-2013.12.5)] 2013 IEEE 16th International Conference on Computational

5Ws Model f

Jinson Zhang IEEE Member

School of Software, Faculty of EngineerUniversity of Technology, Sydne

Sydney, Australia [email protected]

Abstract — BigData, which contains image, and other forms of data, collected from mudifficult to process using traditional databtools or applications. In this paper, we establiby using 5Ws data dimension for BigDavisualization. 5Ws data dimension stands focontent is, Why the data occurred, Where theWhen the data occurred, Who received the ddata was transferred. This framework noBigData attributes and patterns, but also espatterns that provide more analytical featurclustering to display data sending and receivindemonstrate BigData patterns. The model is tnetwork security ISCX2012 dataset. The ethat this new model with clustered visuefficiently used for BigData analysis and visua

Keywords – BigData analysis; BigDadimensions; data density; BigData visualization

I. INTRODUCTIONBigData, according to Wikipedia (Aug

term for a collection of data set so large andbecomes difficult to process using onmanagement tools or traditional dapplications”. The datasets not only codatabases, but also include unstructured dsocial media data or GPS (Global Positionin

According to Gartner 3Vs definition [three characteristics: volume, velocity and vin Fig 1.

Figure 1. 3Vs model for BigData [

for BigData Analysis and Visualization

ring & IT ey

Mao Lin HuaSchool of Computer S

Tianjin University, TianjSchool of Software, Faculty of E

University of TechnologSydney, Austral

[email protected]

video, text, audio ultiple datasets, is base management sh the 5Ws model ata analysis and or; What the data e data came from, data and How the ot only classifies stablishes density

res. We use visual ng densities which tested by using the experiment shows alization can be alization.

ta pattern; data n

N g 2013), is “the d complex that it n-hand database data processing ontain structured databases such as ng System) data. [2], BigData has variety, as shown

[3]

Volume describes datasets that easily amassed into Terabytes,information. Table I shown datasetvolume of dataset is not only a smassive analysis issue.

TABLE I. DATASET VO

VALUE ABBREVIATIO

1000 1 KB 1000 2 MB 1000 3 GB 1000 4 TB 1000 5 PB 1000 6 EB 1000 7 ZB 1000 8 YB

Velocity is how fast the dataset

on statistics from Pingdom 2012 38000 Google searches every sephone users using 1.3 Exabytes of gper month, 2.2 billion email users seper day, 2.7 billion likes on Facebooof photo content added on Facebobillion hours of video watched on Y

Variety is how datasets contaunstructured data, such as documeimages, videos, click streams, transactions. Hundreds, even thattributes in multiple dimensions inmuch information for traditional dator applications to handle.

BigData comes from everywherso is too big, too complex and movposting pictures and writing couploading and watching videos onreceiving messages through smartmessages through WeChat all counBigData, new analytical methods hfeed business, government and orga

Distributed computing andtechniques are widely used in applications. Hadoop (High-availaoriented platform), the most populfor reliable, scalable, distributed

n

ang oftware jin, China Engineering & IT y, Sydney ia du.au

are extremely large and , even Zettabytes of t volume size. Too much storage issue, but also a

OLUME SIZE

ON NAME

Kilobytes Megabytes Gigabytes Terabytes Petabytes Exabytes

Zettabytes Yottabytes

is being produced. Based [1], there are more than

econd, 5 billion mobile global mobile data traffic ending 144 billion emails ok every day, 7 petabytes ook every month and 4

YouTube a month. ain both structured and ents, emails, audio files, log files, or financial housands, of different n the dataset provide too tabase management tools

re influence our life, and ves too fast. For example, omments on Facebook; n YouTube; sending and t phones; sending voice

nt as BigData. To analyse have to be developed to anization needs. d parallel processing

industry for BigData ability distributed object lar open-source platform d computing, is often

2013 IEEE 16th International Conference on Computational Science and Engineering

978-0-7695-5096-1/13 $31.00 © 2013 IEEE

DOI 10.1109/CSE.2013.149

1021

2013 IEEE 16th International Conference on Computational Science and Engineering

978-0-7695-5096-1/13 $31.00 © 2013 IEEE

DOI 10.1109/CSE.2013.149

1021

Page 2: [IEEE 2013 IEEE 16th International Conference on Computational Science and Engineering (CSE) - Sydney, Australia (2013.12.3-2013.12.5)] 2013 IEEE 16th International Conference on Computational

referred to by BigData researchers. Tframeworks in Hadoop: Hadoop Distribu(HDFS) and MapReduce, have being deplofor the management of cluster distributed das Facebook, Google, Yahoo, Amazon.Twitter (hadoop.appache.org).

A. Motivation The six data dimensions, or 5Ws data d

the data occurred, where the data came frois, how the data was transferred, who receiwhen the data occurred) for BigDatavisualization are not addressed by previoubest of our knowledge.

Current BigData visualization approachhigh dimension data to low dimension, andtrends or relationships. The visual graph dthrough multiple clusters or multiple viewshapes and colours to help identified data pmany lines or nodes are used to displaylinking or nodes, which are hard governments and organizations to understan

Our approach uses the 5Ws data dimensBigData patterns to uncover the cortraditional database management tools couldclustered visualization methods display dimensions without any linking between ddata destinations.

B. Our contributions In this paper, we have further devel

analytics model [8] and [13] for BigDatBigData pattern established can handle mFirst, we analyzed the attributes of da

Figure 2. Examp

Two main core uted File System oyed in industries data centers such com, eBay and

imensions, (Why m, what the data ived the data and a analysis and us works, to the

hes often reduce d omit some data displays raw data

ws using different patterns. But too y these massive for businesses,

nd [4][5][7]. sions to establish rrelations which d not reveal. The

the 5Ws data data sources and

loped our visual ta analysis. The

multiple datasets. atasets and then

introduced the 5Ws data dimenSecond, we established the 5Ws msending and receiving patterns. Tvisualization method to illustrate Bevaluation for different datasets and

The 5Ws model with clustered clear outline of data patterns that measurement for BigData analysiscontributions are:

• Introduced 5Ws data di

BigData attributes across di• Established density patterns

measure BigData behaviour• Introduced visual clustering

patterns without any linkinand data destinations

The paper is organized as follo

our 5Ws data analytics model, Secimplementation and visualizationrelated works, and Section 5 summfuture works.

II. 5Ws MO

A. 5Ws data dimensions Each data item can be cla

dimensions, which stands for whadata came from, when the data occdata, why the data occurred antransferred. Fig. 2 shows the exampdata dimensions crossing multiple d

ple of BigData in 5Ws data dimensions crossing multiple datasets

sions for classification. model that measures data Third, we introduced a BigData patterns and its

d resources. visualization provides a significantly change the

s and visualization. Our

imensions to illustrate ifferent datasets s based on data subset to rs g method to display data ng between data sources

ows; Section 2 illustrates ction 3 demonstrates the n, Section 4 describes marises our approach and

ODEL

assified into 5Ws data at the data is, where the curred, who received the nd how the data was

ple of BigData in the 5Ws datasets and resources.

10221022

Page 3: [IEEE 2013 IEEE 16th International Conference on Computational Science and Engineering (CSE) - Sydney, Australia (2013.12.3-2013.12.5)] 2013 IEEE 16th International Conference on Computational

We define six sets to represent the 5Ws data dimensions. • set T={t1, t2, tj,…, tm} representing when the data

occurred • set X={x1, x2, xj,…, xm} representing how the data

was transferred • set Y={y1, y2, yj,…, ym} representing why the data

occurred • set Z={z1, z2, zj,…, zm} representing what the data is • set P={p1, p2, pj,…, pm} representing where the data

came from • set Q={q1, q2, qj,…, qm} representing who received

the data Therefore, each data can be defined as a node as

f (t, x, y, z, p, q)

t | T{ } is the time stamp for each data incidence. x | X{ } represents how the data was transferred, such as

“by Internet”, “by email” or “online transferred”. y | Y{ } represents why the data occurred, such as

“sharing photos”, “finding new friends” or “spreading a virus”.

z | Z{ } represents what the data is, such as “video”, “image”, “text” or “number”.

p | P{ } represent where the data came from, such as “from twitter”, “smart phone” or “hacker”.

q | Q{ } represent who received the data, such as “friend”, “bank account” or “victim”.

All data in the T time slot, represent as a set F with a number n incidences, is defined as

F = {f1, f2, f3, …, fn} (1)

F contains all incident nodes within a certain time

period. For example, there were 9.66 million tweets during the Opening Ceremony of the London 2012 Olympic Games [1]. The twitter dataset for the Opening Ceremony is therefore |F| = 9.66 million.

For a particular attribute node where x= , y= , z= , p= and q=ε, the node can then be defined as

f( , , , , ε) = f (t, x( ), y( ), z( ), p( ), p(ε)) (2)

A subset F( , , , , ε) that contains all the particular

attributed nodes f( , , , , ε) in the T time slot is therefore defined as

F( , , , , ε) = { f ∊ F | f (t, x, y, z, p, q), x= ,

y= , z= , p= , q=ε } (3)

The subset F( , , , , ε) represents the particular incident

nodes by the 5Ws data dimensions. For example, during the Opening Ceremony of the London 2012 Olympic Games, 9.66 million tweets contain multiple patterns such as = “sent or received”, =“sharing opening ceremony” or

“enjoying ceremony”, = “London” + “Olympics” + “Opening” + “Ceremony” plus more, = twitter, ε = users and t = 27-Jul-2012, 21:00 – 00:45.

The datasets |F| illustrates the statistical results in volume and velocity. The subset F( , , , , ε) demonstrates the variety for the particular incident pattern, which provides more analytical features for business, government and organizations. B. Data transfer patterns

The subset F( , , , , ε) contains information about where the data came from (sender ), who received the data (receiver ε) and how the data was transferred ( ). The sender ( ) and receiver (ε) can be a person, location, system, or any attributes that sent to received data. Three basic data transfer patterns are shown in Fig. 3.

(a)

(b)

(c)

Figure 3. Three basic data transfer patterns (a) represents a data transfer pattern of 1:1, which means

that the data occurred only between the sender ( ) and the receiver (ε). If the pattern is 1: N, as shown in (b), it indicates that the sender ( ) sent multiple data and (ε) is one of the receivers. A pattern of N: 1 is shown in (c), which indicates that multiple data is sent to the receiver (ε) and ( ) is one of the senders. The pattern N:N can be described as a combination of these three basic patterns, N:N = (1:1) + (1:N) + (N:1).

C. Data sending density We introduced sending density (SD) to measure the

sender’s pattern during data transferal. Based on (3), the sending density for particular attributes x= , y= , z= , and p= , in time slot T, is defined as SD(α, β, γ, δ)

SD(α, β, γ, δ) = | , , , || |

(4)

10231023

Page 4: [IEEE 2013 IEEE 16th International Conference on Computational Science and Engineering (CSE) - Sydney, Australia (2013.12.3-2013.12.5)] 2013 IEEE 16th International Conference on Computational

= , , , , ,

where 0 ≤ SD(α, β, γ, δ) ≤ 1 SD(α, β, γ, δ) represents the 5Ws dim

sender’s pattern; the content was , transfeT time, for reason and sent by . A higindicates where the most data came frillustrates the example of a high value of SD

D. Data receiving density The receiving density for x= , y= , z

defined as RD(α, β, γ, ε) as

RD(α, β, γ, ε) = | , , , || |

= , , , , ,

where 0 ≤ RD(α, β, γ, ε) ≤ 1

RD(α, β, γ, ε) represents the 5Ws dimreceiver’s pattern; the contents was , trans⊂ T time, for reason and received by ε.RD( ) indicates who received the most displays the example of a high value of RD(

E. Density cross datasets SD( ) and RD( ) not only measure the

dataset, but can also be used to compare mFor example, a Facebook dataset and a dataset are two different datasets. But similaon both datasets, such as δ = “users”connection”, the comparison between thoseδ = “users”. = “mobile connection” will einternet banking mobile users via FacebooThose two densities provide the more measfor BigData analysis.

F. Density classification When SD(α, β, γ, δ) = RD(α, β, γ, ε), it indica

transferred pattern is 1:1, shown as Fig 3 (aRD(α, β, γ, ε), it represents that data transferreshown as Fig 3 (b), and for N:1, SD(α, β, γ, δ)shown as Fig 3 (c).

Here, we introduce a coefficient θ( )density patterns for the attributes α, β, γ.

θ(α, β, γ) =

= , , , , ,, , , , ,

When SD(α, β, γ, δ) = RD(α, β, γ, ε), θ( ) = 1

RD(α, β, γ, ε), then θ( ) > 1, and θ( ) → ∞ if When SD(α, β, γ, δ) > RD(α, β, γ, ε), then θ( ) < 1

mensions for the rred by , in t ⊂

gh value of SD( ) from. Fig 3 (b) D( ).

z= and q=ε, is

(5)

mensions for the sferred by , in t A high value of data. Fig 3 (c)

( ).

patterns for one multiple datasets. bank transaction ar attributes exist

”, = “mobile e two datasets for export the ratio of ok mobile users. surement features

ates that the data a). If SD(α, β, γ, δ) > ed pattern is 1:N, ) < RD(α, β, γ, ε), as

) to classify the

(6)

. If SD(α, β, γ, δ) < SD(α, β, γ, δ) → 0.

1, and θ( ) → 0 if

RD(α, β, γ, ε) → 0. Fig. 4 shows an exafor SD( ) and RD( ).

Three curves (green, red and oradifferent patterns of SD( ) and Rrepresents SD( ) < RD( ) and θ( ) > 1. SD( ) > RD( ) and θ( ) < 1.

Figure 4. Example of den When θ( ) → ∞ and RD( ) → 1,

SD( ) is very low. This shows that huthe same attributes (α, β, γ) wermultiple senders. When θ( ) → 0 andis very low, which demonstrates thwith the same attributes (α, β, γ) wemultiple receivers.

The density pattern θ( ) can be uBack to the example of having 9.6the Opening Ceremony of the Games. Suppose α = “iPhone apps= “James Bond” + “Queen” + “parε = “Twitter receiving server in Lo>> 1 because huge amount of texwere sent to one receiver from multbe used for comparing different “iPhone apps” and α2 = “Androidteam” and γ2 = “UK team”.

G. Noise data calculation The noise data is defined as th

density algorithm methods, such aunknown_y, z = unknown_z, p =unknown_q. A subset for unknown n

F(unknown ) = { f ∊ F | f (t, x, y, z, p, q

� y=unknown_y � z=unk � p=unknown_p � q=unk

(4) would be amended as

SD(α, β, γ, δ) = | , , ,| | |

ample of density patterns

ange) are examples of the RD( ). The yellow area The grey area represents

nsity patterns

it indicates the value of uge amounts of data with e sent to a receiver by d SD( ) = 1, it means RD( )

hat huge amounts of data ere sent from a sender to

used for many purposes. 66 million tweets during London 2012 Olympic

s”, β =“sharing photo”, γ rachute”, δ = “users” and ondon”. The value of θ( ) xt messages and images tiple iPhones. θ( ) can also attributes such as α1 = d apps”, or γ1 = “USA

he unknown nodes in the as x = unknown_x, y = = unknown_p and q = nodes can be defined as

q), x=unknown_x known_z known_q }

(7)

| | (8)

10241024

Page 5: [IEEE 2013 IEEE 16th International Conference on Computational Science and Engineering (CSE) - Sydney, Australia (2013.12.3-2013.12.5)] 2013 IEEE 16th International Conference on Computational

(5) would be amended as

RD(α, β, γ, ε) = | , , , || | | | SD( ) and RD( ) represents the sender’

pattern, which significantly improves tBigData analysis, because both densities avo

H. 5Ws clustered visualization Fig. 5 shows an example of 5Ws tree

map is widely used to illustrate data stcontents. Hundreds, even thousands, of dican be classified by our 5Ws data dimension

Figure 5. 5Ws data dimension tre

Clustering is the process of organizing with similar elements in some attributesrepresents a node that contained the 5Ws for each classification. After gained values ), we use the five clustering levels todensities. The first clustered level is x=α, ththe third is y= , and fourth and fifth are p=

We use a visual circle to represent the patterns, with points in the graph. We dlinking because the data patterns have before the clustering visualization. Accordof the circle for a particular attribute is cdensity, which is defined as

R(α) = | || | 1 , , , , ,

where R(α) is the radius for x=α.

(9)

’s and receiver’s the accuracy of oid noise data.

e map. The Tree tructures and its ifferent attributes ns tree.

ee

data into groups s [6]. Each data

data dimensions of RD( ) and SD(

o illustrate both he second is z= ,

and q=ε. attributes and its

do not show any been calculated

dingly, the radius calculated by its

(10)

R( ) = | | | 1 , , , where R( ) is the radius for y= .

R( ) = | || | 1 , , , where R( ) is the radius for z= .

p= and q=ε are not calculated takes too much space to display it iwill be displayed separately after this finalized, which is shown in the n

III. IMPLEMENTATION ANWe have tested the 5Ws mode

security dataset, ISCX2012 datasetattributes shown in Appendix. summary of one ISCX2012 dataset.

TABLE II. ISCX2012 data

Name Network traffic nodes Source Ips Destination IPs ICMP traffics TCP traffics Unknown TCP traffics UDP traffics Unknown UDP traffics Connecting methods Source ports Destination ports Attacks

Assume that α = “TCP traffic”,

γ = “Attack”, δ = “Source IP and poIP and port” for this simulation. Tthat the destination IP address εtargeted as the victim, and faced “TCP” traffic with = “HTTP” cdifferent sources IPs ( ) that sent tThe value of six different SD(α, β, γ, shown as bubbles in Fig. 6. Thpatterns, represented as the values Fig. 7.

| , ,

(11)

, ,

(12)

for the radius because it n the graph. But, p and q

he clustering visualization next sector.

ND VISUALIZATION el by using the network t [9], which contains 20 Table II displays the

aset - TestbedTueJun15c

Amount 130288

36 1656

31 119242

3 11015

36 19

23653 222

37375

β = “HTTP connection”, ort” and ε = “Destination The result has illustrated ε = 192.168.5.122:80 is

= “attack” by = connection. There are six the attacks to the victim. δ) and one RD(α, β, γ, ε) are

he six different density of θ(α, β, γ), are shown in

10251025

Page 6: [IEEE 2013 IEEE 16th International Conference on Computational Science and Engineering (CSE) - Sydney, Australia (2013.12.3-2013.12.5)] 2013 IEEE 16th International Conference on Computational

Figure 6. Values of sd( ) and rd( )

Figure 7. values of θ(α, β, γ)

In simulation, the values of SD( ) increasing from 16:05, demonstrated in Fithreat, both SD( ) and RD( ) are kept at h17:20. It gives the alert for the network inat an early stage. The density patterns are i7, which show that the hacker sent differenfrom six sources IPs.

Fig. 8 shows that the visual nodes ibefore clustering visualization. The clustedependent on the values of SD( ) and Rrepresented as yellow nodes. The different aand RD( ) use different scales to save spacSD(3) may represent the value of FTP connmay indicate the value of ICMP traffic. Bprogress, there is no linking edge connected

We preset SD( )=0.80 and RD( )=0.visualization process, but unfortunatelystructure have appeared. This is because thand RD( ) are not reached at these points. Thand RD( ) were decreased until RD( correspondingly SD( ) = 0.10. The final clgenerated is shown in Fig. 9.

The two top values of SD( ) nodes and o) node appeared in the final clustered viattribute x=α is the first clustered level f

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

rd (ε=192.168.5.122) sd (δ=192.168.1.103) sdsd (δ=192.168.2.110) sd (δ=192.168.2.113) sdsd (δ=192.168.4.120)

α=TCPβ=HTTPγ=Attack

)

and RD( ) start ig 6. During the high levels until

ntrusion detection illustrated in Fig. nt attack patterns

in random spots ered structure is

RD( ), which are attributes of SD( ) ce. For example, ection, and RD(6) Before clustering d to any node. .80 to start the y, no clustered he value of SD( ) he values of SD( )

) = 0.50 and lustered structure

one top value RD( isualization. The for visualization.

z= is the second and y= is the ththe same classifications for RD(gathered into groups, linked by the(k and j is used for separating them q has been displayed at the bottomwhere the data came from and who

The nodes that do not belong toshown in the graph as random visualization scales down the departicular attributes. This enablesdetection to be seen easily, efficient

Our clustered visualization has acrosses that the nodes provided, thstructure for the pattern detection. Tindicates that the hacker sent widespvictim’s systems across the networkmeans that the victim suffered multiple sources.

Therefore, the 5Ws model withhas provided clear outlines of significantly change the measuremand visualization.

IV. RELATEDThe researchers have practiced

for data processing, data sharing orit has the ability to take a query ovemany small fragments, and run it inaddition, it provides a distributed sythe computer nodes, with high aggthe cluster [6][10][11][12].

SAS Visual Analytics Explorer analyse the massive dataset to finmaps and tree maps were used iexample of a head map for BankWin Fig. 10. They created the head mand longitude of each log entry. Thethe graph.

Figure 8. Heat map of B

Cheng-Long Ma et al [5] usedmethod to find out the clustevisualization. The distances of the t

d (δ=192.168.1.105)d (δ=192.168.4.118)

hird. All nodes that have (k) and SD(j) have been e different clustered level

from other nodes). p and m field which illustrates received the data.

o RD(k) and SD(j) are still points. The clustering

ensity patterns into the s the network intrusion tly and clearly. avoided the mass linking herefore showing a clear The higher value of SD( ) pread attack traffic to the k. The high value of RD( ) the flood attacks from

h clustered visualization f data patterns, which ment of BigData analysis

WORKS HDFS and MapReduce r data clustering because er a dataset, divide it into

n parallel to the cluster. In ystem that stores data on

gregate bandwidth across

[4] scales, visualizes and nd visual patterns. Heat in the visualization. An

World’s activity is shown map based on the latitude e raw data is displayed in

BankWorld’s activity [4]

d the K-means clustering ering centers for 3-D three coordinate axes are

10261026

Page 7: [IEEE 2013 IEEE 16th International Conference on Computational Science and Engineering (CSE) - Sydney, Australia (2013.12.3-2013.12.5)] 2013 IEEE 16th International Conference on Computational

corresponded by data in the original space. model using the Iris database, as shown in F

Figure 9. Iris 3-D visualizat

Unfortunately, current visualization apdisplay BigData patterns crossing multiplemany lines or nodes are used in the grapmake many comparisons because there is,knowledge, no previous work which addresdimensions for BigData analysis and visuali

F

They tested their Fig. 11.

tion [5]

pproaches cannot e datasets, as too ph. We couldn’t , to the best our sses the 5Ws data ization.

V. CONCLUSIONS & In this paper, we have establish

framework for BigData analysis model analyses the 5Ws data dimedatasets and builds sending densitymeasure BigData patterns. This icorrelations which traditional datacould not revealed.

5Ws model not only measures Benables density comparisons betweeprovides more analytical features business, government and organizat

Our clustered visualization medata dimensions without any linkinand destinations. This allows userand scale down views of BigData pshows that this model, with the clube used effectively for BigData anal

For the future work, we plan to in three directions. Firstly, we planclassification in more areas. Second5Ws model for more datasets. ThiGapminder’s visualization techniquRosling (www.gapminder.org), visualization presenter, to displathrough 2D graphs.

Figure 10. The nodes before clustering process

FUTURE WORK hed the 5Ws model, the and visualization. The

ensions crossing multiple and receiving density to is done to uncover any abase management tools

BigData patterns, but also en multiple datasets. This of BigData analysis for

tional needs. ethod displays the 5Ws ng between data sources rs to interactively select patterns. The experiment ustered visualization, can lysis and visualization. develop our 5Ws model

n to deploy the densities dly, we plan to apply our irdly, we plan to use the ue developed by Dr. Hans

the world famous ay 5D data dimensions

10271027

Page 8: [IEEE 2013 IEEE 16th International Conference on Computational Science and Engineering (CSE) - Sydney, Australia (2013.12.3-2013.12.5)] 2013 IEEE 16th International Conference on Computational

Figure 1

REFERENCES [1] Pingdom, “Internet 2012 in numbers”, post

http://royal.pingdom.com/2013/01/16/internnumbers/

[2] Stamford, “Gartner Says Solving ‘Big Involcves More Than Just Managing Vposted on June http://www.gartner.com/newsroom/id/1731

[3] D. Klein, P. Tran-Gia, M. Hartmann “Big Spektrum, vol 36, issue 3, pp319-323, June

[4] N.A. Abousalh-Neto, and S. Kazgan, “Bigthrough Visual Analytics”, In Proc. IEEVisual Analytics Science and Technology Oct 2012

[5] C.L. Ma, X.F Shang, and Y.B Yuan “A TDisplay for Big Data Sets”, In Proc. 2Conference on Machine Learning and Cyb1545, July 2012

[6] J. Hurwitz, A. Nugent, F. Halper, M. KaufDummies, Published by John Wiley & Sons

[7] J. Choo, H. Park, “Customizing ComputatVisual Analytics with Big Data”, IEEE Cand Applications, vol 33, No 4, pp 22-28, Ju

[8] J. Zhang, and M.L. Huang, “Visual AnIntrusion Detection in Flood Attack”. In Pro12th IEEE International Conference on TPrivacy in Computing and Communications2013

[9] A. Shiravi, H. Shiravi, M. Tavallaee, an“Toward developing a systematic apprbenchmark datasets for intrusion detectio

11. The higher value of SD( ) and RD( ) after clustering visualization

ed on Jan 16, 2013, net-2012-in-

Data’ Challenge Volumes of Data”,

27, 2011, 916 Data”, Informatik-2013

g Data Exploration EE Symposium on

2012, pp 285-286,

Three-Dimensional 2012 International bernetics, pp 1541-

fman, Big Data for s. Inc, 2013 tional Methods for

Computer Graphics uly 2013

nalytics Model for oc. TrustCom 2013, Trust, Security and s, pp 277-284, July

nd A.A. Ghorbani, roach to generate on,” Computers &

Security, vol 31, issue 3, May 2014048

[10] T. Kraska, "Finding the Needle Haystack," Internet Computing, IEJan-Feb. 2013

[11] S. Narayan, S. Bailey, A. Daga,"OpenFlow-Based Cluster," HighNetworking, Storage and AnaCompanion:, pp 535,538, 10-16 No

[12] A. Menon, “Big Data @ Facebookon management of big data system

[13] J. Zhang, M.L. Huang and D. Hointrusion detection in spam emailGrid and Utility Computing, vol 4,

APPENDIXISCX2012 dataset contains 20 attrib

attributes in 5Ws dimensions is shown a

When (T) StartDateTime, StopDHow (X) ProtocolName, DirectiWhy (Y) AppName, SourceTCP

DestinationTCPFlagsDWhat (Z) Tag, SourcePayloadAs

SourcePayloadAsUTFDestinationPayloadAsDestinationPayloadAs

Where (P) Source, SourcePort, ToTotalSourcePackets,

Who (Q) Destination, DestinatioTotalDestinationBytes

12, pp 357-374, ISSN 0167-

in the Big Data Systems EEE, vol.17, no.1, pp 84,86,

Hadoop Acceleration in an Performance Computing, alysis (SCC), 2012 SC ov. 2012 k”, In Proc. 2012 workshop

m, pp 31-32, 2012 oang, “Visual analytics for ls”, International Journal of , no 2/3, pp 178-186, 2013

X butes. The example of as below.

ateTime ion PFlagsDescription, Description sBase64,

F, Base64, UTF otalSourceBytes,

onPort, s, TotalDestinationPackets,

10281028