An Architecture-based Framework For Understanding Large-Volume Data Distribution Chris A. Mattmann USC CSSE Annual Research Review March 17, 2009

An Architecture-based Framework For Understanding Large-Volume

Data Distribution

Chris A. Mattmann

USC CSSE Annual Research Review

March 17, 2009

Agenda

• Research Problem and Importance• Our Approach

– Classification– Selection– Analysis

• Evaluation– Precision, Recall, Accuracy Measurements– Speed

• Conclusion & Future Work

Research Problem and Importance

• Content repositories are growing rapidly in size

• At the same time, we expect more immediate dissemination of this data

• How do we distribute it…– In a performant manner?– Fulfilling system

requirements? ?NASA Planetary Data System

Archive Volume Growth

0

10

20

30

40

50

60

70

80

90

1990 1992 1994 1996 1998 2000 2002 2004 2006 2008

Year

TB (Accum)

TBytes

Data Distribution Scenarios

A medium-sized volume of data, e.g., on the order of a gigabyte needs to be delivered across a LAN, using multiple delivery intervals consisting of 10 megabytes of data per interval, to a single user.

A Backup Site periodically connects across the WAN to the Digital Movie Repository to backup its entire catalog and archive of over 20 terabytes of movie data and metadata.

Data Distribution Problem Space

Insight: Software Architecture

• The definition of a system in the form of its canonical building blocks– Software Components: the computational units in the system

– Software Connectors: the communications and interactions between software components

– Software Configurations: arrangements of components and connectors and the rules that guide their composition

Data Distribution Systems

Data Producer

Data ConsumerData ConsumerData ConsumerData Consumer

data

???

data

Connector

Insight: Use Software Connectors to model data distribution technologies

ComponentComponent

Impact of Data Distribution Technologies

• Broad variety of data distribution technologies

• Some are highly efficient, some more reliable

• P2P, Grid, Client/Server, and Event-based

• Some are entirely appropriate to use, some are not appropriate

Data Movement Technologies

• Wide array of available OTS “large-scale” connector technologies– GridFTP, Aspera, HTTP/REST, RMI, CORBA,

SOAP, XML-RPC, Bittorrent, JXTA, UFTP, FTP, SFTP, SCP, Siena, GLIDE/PRISM-MW, and more

• Which one is the best one?• How do we compare them

– Given our current architecture?– Given our distribution scenarios & requirements?

Research Question

• What types of software connectors are best suited for delivering vast amounts of data to users, that satisfy their particular scenarios, in a manner that is performant, scalable, in these hugely distributed data systems?

Broad variety of distribution connector families

• P2P, Grid, Client/Server, and Event-based

• Though each connector family varies slightly in some form or fashion– They all share 3 common atomic connector

constituents• Data Access, Stream, Distributor• Adapted from our group’s ICSE2000 Connector

Taxonomy

Connector Tradeoff Space

• Surveyed properties of 13 representative distribution connectors, across all 4 distribution connector families and classified them– Client/Server

• SOAP, RMI, CORBA, HTTP/REST, FTP, UFTP, SCP, Commercial UDP Technology

– Peer to Peer• Bittorrent

– Grid• GridFTP, bbFTP

– Event-based• GLIDE, Sienna

Large Heterogeneity in Connector Properties

Procedure Call Connector Breakdown (5 connectors, 2 families)

0

1

2

3

4

5

6

HTTP ResponseRMI message

GridFTP messageSOAP messageCORBA message

one senderMethod Call

Globus Log LayerHTTP Server logRMI Registry

CORBA Name Registry

Web Server

valuereference

publicprotected

private

one receiverkeyword

Num Connectors

proc_call_params_return_valueproc_call_cardinality_sendersproc_call_invocation_explicitproc_call_params_invocation_recordproc_call_params_datatransferproc_call_accessibilityproc_call_semantics

Data Access Connector Breakdown (8 Connectors, 4 families)

0

1

2

3

4

5

6

7

8

9

ProcessGlobal

Dynamic Data Exchange

Database AccessRepository Access

File I/O

Session-Based

Cache

Peer-Based

Many ReceiversOne Receiver

AccessorMutator

Many SendersOne Sender

Num Connectors

data_access_localitydata_access_persistencedata_access_avail_transientdata_access_cardinality_receiversdata_access_accessesdata_access_cardinality_senders

Distributor Connector Breakdown (8 connectors, 4 families)

0

1

2

3

4

5

6

7

8

9

ad-hocbounded

RMI MessageGridFTP Message

SOAP Message

Event

HTTP MessagePeer Pieces

registry-basedattribute-basedHeirarchical

Flat

content-based

tcp/ip

architecture configuration

tracker

Exactly OnceAt least onceBest Effort

dynamiccachedstaticUnicastMulticastBroadcast

Num Connectors

distributor_routing_membershipdistributor_delivery_typedistributor_naming_typedistributor_naming_structuresdistributor_routing_typedistributor_delivery_semanticsdistributor_routing_pathdistributor_delivery_mechanisms

Stream Connector Breakdown (8 connectors, 4 families)

0

1

2

3

4

5

6

7

8

9

Raw

StructuredMany Senders

One Sender

RemoteLocal

Exactly OnceAt least onceBest Effort

bps

Many ReceiversOne Receiver

StatefulStatelessNamed

Bounded

Asynchronous

Time Out Synchronous

Buffered

Num Connectors

stream_formatsstream_cardinality_sendersstream_localitiesstream_deliveriesstream_throughputstream_cardinality_receiversstream_statestream_identitystream_boundsstream_synchronicitystream_buffering

How do experts make these decisions?

• Performed survey of 33 “experts”• Experts defined to be

– Practitioners in industry, building data-intensive systems

– Researchers in data distribution– Admitted architects of data

distribution technologies

• General consensus?– They don’t the how and the why

about which connector(s) are appropriate

– They rely on anecdotal evidence and “intuition”

Percentage Breakdown of Expert Responses

67%

15%

15%

3%

No ResponseNot ComfortableNo TimeFull Response

Expert Survey Demographic

6%

18%

12%

12%6%

22%

6%

12%

6%

Cancer Research

Planetary Science

Earth Science

Industry

Grid Computing

Professors

Web Technologies

Open Source

Students45% of respondents claimed to be uncomfortable being addressed as a data

distribution expert.

Why is it bad to have these types of experts?

• Employ a small set of COTS, and/or pervasive distribution technologies, and stick to them– Regardless of the scenario requirements– Regardless of the capabilities at user’s institutions

• Lack a comprehensive understanding of benefits/tradeoffs amongst available distribution technologies– They have “pet technologies” that they have used in similar

situations– These technologies are not always applicable and

frequently only satisfy one or two scenario requirements and ignore the rest

Our Approach: DISCO

• Develop a software framework for:– Connector Classification

• Build metadata profiles of connector technologies, describing their intrinsic properties (DCPs)

– Connector Selection• Adaptable, extensible algorithm development framework

for selecting the “right” connectors (and identifying wrong ones)

– Connector Selection Analysis• Measurement of accuracy of results

– Connector Performance Analysis

DISCO in a Nutshell

Scenario Language• Describes distribution scenarios

Data Distribution

Delivery Schedule

Performance Requirements

Number of Intervals

Volume Per Interval

Timing of Interval

Consistency

Scalability

DependabilityEfficiency

Access Policies

Geographic Distribution

Number of Data Types

Total Volume

WAN

LAN

Number of Users

Number of User Types

Producers

Consumers

Automatic

Initiated

Automatic

InitiatedTypes of Data Data

Metadata

e.g., 10 MB, 100 GB, etc., int + higher order unit

e.g., 1, 10, int

e.g., SSL/HTTP 1.0, Linux File System Perms, string from controlled value range

1-10, computed scalee.g., 1, 10, int

e.g., 1, 10, int

e.g., 1, 10, int

Distribution Connector Model

• Developed model for distribution connectors

• Identified combination of primitive connectors that a distribution connector is made from

• Model defines important properties of each of the important “modules” within a distribution connector• Defines value space for each

property• Defines each property

• Properties are based on the combination of underlying “primitive” connector constituents

• Model forms the basis for a metadata description (or profile) of a distribution connector

Distribution Connector Model

Selection Algorithms

• So far– Let data system architects encode the data

distribution scenarios within their system using scenario language

– Let connector gurus describe important properties of connectors using architectural metadata (connector model)

• Selection Algorithms– Use scenario(s) and connector properties identify

the “best” connectors for the given scenario(s)

Selection Algorithms• Formal Statement of the problem

• Selection algorithm interface

?Connector

KB

scenario

(bbFTP, 0.157)(FTP,0.157)(GridFTP,0.157)(HTTP/REST, 0.157)(SCP, 0.157)(UFTP, 0.157)(Bittorrent, 0.021)(CORBA, 0.005)(Commercial UDP Technology, 0.005)(GLIDE, 0.005)(RMI, 0.005)(Sienna, 0.005)(SOAP, 0.005)

This interface is desirable because it allows a user to rank and compare how “appropriate” each connector is, rather than having a binary decision

Selection Algorithms

Selection Algorithm Approach

• White box– Consider the internal properties of a

connector (e.g., its internal architecture) when selecting it for a distribution scenario

• Black box– Consider the external (observable)

properties of the connector (such as performance) when selecting it for a distribution scenario

Develop complementary selection algorithms

•Users familiar with connector technologies develop score functions

•Relating observable properties (performance reqs) of connector to scenario dimensions

•Software architects fill out Bayesian domain profiles containing conditional probabilities

•Likelihood a connector, given attribute A and its value, and given scenario requirement, is appropriate for scenario S

Selection Analysis

• How do we make decisions based on a rank list?

• Insight: looking at the rank list, it is apparent that many connectors are similarly ranked, while many are not– Appropriate versus Inappropriate?

Selection Analysis(bbFTP, 0.15789473684210525)(FTP,0.15789473684210525)(GridFTP,0.15789473684210525)(HTTP/REST, 0.15789473684210525)(SCP, 0.15789473684210525)(UFTP, 0.15789473684210525)(Bittorrent, 0.02105263157894737)(CORBA, 0.005263157894736843)(Commercial UDP Technology, 0.005263157894736843)(GLIDE, 0.005263157894736843)(RMI, 0.005263157894736843)(Sienna, 0.005263157894736843)(SOAP, 0.005263157894736843)

appropriate

inappropriate

Selection Analysis

Selection Analysis

• Employed k-means data clustering algorithm– k parameter defines how many sets data is partitioned into

• Allows for clustering of data points (x, y) around a “centroid” or mean value

• We developed an exhaustive connector clustering algorithm based on k-means– clusters connectors into 2 groups, appropriate, and inappropriate– uses connector rank value as y parameter (x is the connector

name)– exhaustive in the sense that it iterates over all possible connector

clusters (vanilla k-means is heuristic & possibly incomplete)

Tool Support• Allows a user to utilize different connector

knowledge bases, configure selection algorithms and execute them and visualize their results

Decision Process

87% 80.5%

•Precision - the fraction of connectors correctly identified as appropriate for a scenario•Accuracy - the fraction of connectors correctly identified as appropriate or inappropriate for a scenario

Decision Process: Speed

Conclusions & Future Work

• Conclusions– Domain experts (gurus) rely on tacit knowledge and

often cannot explain design rationale– Disco provides a quantification of & framework for

understanding an ad hoc process– Bayesian algorithm has a higher precision rate

• Future Work– Explore the tradeoffs between white-box and black-

box approaches– Investigate the role of architectural mismatch in

connectors for data system architectures

Thank You!

Questions?

Backup

Related Work

• Software Connectors– Mehta00 (Taxonomy), Spitznagel01, Spitznagel03,

Arbab04, Lau05

• Data Distribution/Grid Computing– Crichton01, Chervenak00, Kesselman01

• COTS Component/Connector selection– Bhuta07, Mancebo05, Finkelstein05

• Data Dissemination– Franklin/Zdonik97

Documents

An Architecture-based Framework For Understanding Large-Volume Data Distribution Chris A. Mattmann USC CSSE Annual Research Review March 17, 2009