26
SALSA Group Research Activities April 27, 2011

SALSA Group Research Activities

  • Upload
    toby

  • View
    24

  • Download
    0

Embed Size (px)

DESCRIPTION

SALSA Group Research Activities. April 27, 2011. Research Overview. MapReduce Runtime Twister Azure MapReduce Dryad and Parallel Applications NIH Projects Bioinformatics Workflow Data Visualization – GTM/MDS/ PlotViz Education. Twister & Azure MapReduce. What is Twister?. - PowerPoint PPT Presentation

Citation preview

Page 1: SALSA Group Research Activities

SALSA Group Research Activities

April 27, 2011

Page 2: SALSA Group Research Activities

Research OverviewMapReduce Runtime

TwisterAzure MapReduce

Dryad and Parallel ApplicationsNIH Projects

Bioinformatics WorkflowData Visualization – GTM/MDS/PlotViz

Education

Page 3: SALSA Group Research Activities

Twister & Azure MapReduce

Page 4: SALSA Group Research Activities

What is Twister?Twister is an Iterative MapReduce

Framework which supportsCustomized static input data partitionCacheable map/reduce tasks Combining operation to converge

intermediate outputs to main programFault recovery between iterations

Page 5: SALSA Group Research Activities

Twister Programming Model

Page 6: SALSA Group Research Activities

Twister Architecture

Page 7: SALSA Group Research Activities

Applications and Performance

Page 8: SALSA Group Research Activities

MapReduceRoles for Azure

MapReduce framework for Azure Cloud Built using highly-available and scalable Azure cloud

services Distributed, highly scalable & highly available services Minimal management / maintenance overhead Reduced footprint

Hides the complexity of cloud & cloud services from the users

Co-exist with eventual consistency & high latency of cloud services

Decentralized control avoids single point of failure

Page 9: SALSA Group Research Activities

MapReduceRoles for Azure

• Supports dynamically scaling up and down of the compute resources.

• Fault Tolerance

• Combiner step• Web based monitoring console• Easy testing and deployment

Page 10: SALSA Group Research Activities

Twister for Azure

Reduce

Reduce

MergeAdd

Iteration? No

Map Combine

Map Combine

Map Combine

Data Cache

Yes

Hybrid scheduling of the new iteration

Job Start

Job Finish

Iterative MapReduce Framework for Microsoft Azure Cloud.

Merge Step In-Memory Caching of static data Cache aware hybrid scheduling using Queues

as well as using a bulletin board

Map 1

Map 2

Map n

Map Workers

Red 1

Red 2

Red n

Reduce Workers

In Memory Data Cache

Task Monitoring

Role Monitoring

Worker Role

MapID ……. Status

Map Task Table

MapID ……. Status

Job Bulleting Board

Scheduling Queue

Kmeans Performance with/without data caching.

Page 11: SALSA Group Research Activities

Performance Comparisons

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

128 228 328 428 528 628 728

Par

alle

l Effi

cien

cy

Number of Query Files

Twister4Azure

Hadoop-Blast

DryadLINQ-Blast

0

500

1000

1500

2000

2500

3000

Adju

sted

Time (

s)

Num. of Cores * Num. of Blocks

Twister4Azure

Amazon EMR

Apache Hadoop50%55%60%65%70%75%80%85%90%95%

100%

Para

llel E

ffici

ency

Num. of Cores * Num. of Files

Twister4Azure

Amazon EMR

Apache Hadoop

BLAST Sequence Search

Cap3 Sequence AssemblySmith Watermann Sequence Alignment

0%

20%

40%

60%

80%

100%

120%

140%

160%

0

200

400

600

800

1000

1200

1400

1600

8 X 16M 16 X 32M 32 X 64M 48 X 96M 64 X 128M

Rela

tive

Para

llel E

ffici

ency

Tim

e (s

)

Num Instances X Num Data Points

Relative ParallelEfficiencyTime(s)

Kmeans Scaling speedup Kmeans Increasing number of iterations

Page 12: SALSA Group Research Activities

Dryad & Parallel Applications

Page 13: SALSA Group Research Activities

DryadLINQ CTP Evaluation The beta version released on Dec 2010 Motivation:

Evaluate key features and interface in DryadLINQStudy parallel programming model in DryadLINQ

Three applicationsSW-G bioinformatics application Matrix Matrix MultiplicationPageRank

Page 14: SALSA Group Research Activities

Parallel programming model DryadLINQ store input data as DistributedQuery<T>

objects It splits distributed objects into partitions with following

APIs: AsDistributed() RangePartition()

...

Compute node

Vertex 1

Data

Compute node

Vertex 2

Data

Compute node

Vertex n

Data

Compute node

Dryad graph manager

Head node

DSC Service

HPC Job Scheduler

Service

DSC

Window HPC Server 2008 R2 Cluster

HPC Client Utilites

DSC Client Service

DryadLINQ Provider

Workstation computer

Common LINQ providers

Provider Base class

LINQ-to-objects IEnumerable<T>

PLINQ ParallelQuery<T>

LINQ-to-SQL IQueryable<T>

LINQ-to-? IQueryable<T>

DryadLINQ DistributedQuery<T>

Page 15: SALSA Group Research Activities
Page 16: SALSA Group Research Activities

Matrix-Matrix Multiplication Parallel programming algorithms

Row split Row Column split 2 dimensional block decomposition in Fox algorithm

Multi core technologies in .NET TPL, PLINQ, Thread pool

Hybrid parallel model Port multi-core to Dryad task to improve performance

Fox-DSC RowColumn-DSC RowSplit-DSC0

50

100

150

200

250

TPLThreadTaskPLINQ

Page 17: SALSA Group Research Activities

PageRank Grouped Aggregation

A core primitive of many distributed programming models.

Two stage:1) Partition the data into groups by some keys 2) Performs an aggregation over each groups

DryadLINQ provide two types of grouped aggregation GroupBy(), without partial aggregation optimization. GroupAndAggregate(), with partial aggregation.

1280 960 640 3200

500

1000

1500

2000

2500

3000

3500

GroupAndAggregateTwoApplyPerpartitionOneApplyPerPartitionGroupByHierarchicalAggregation

number of am files

Seco

nds

Page 18: SALSA Group Research Activities

NIH Projects

Page 19: SALSA Group Research Activities

Sequence Clustering

Gene Sequences

Pairwise Alignment &

Distance Calculation

Distance Matrix

Pairwise Clustering

Multi-Dimensional

Scaling

Visualization

Cluster Indices

Coordinates

3D Plot

Smith-Waterman / Needleman-Wunsch

with Kimura2 / Jukes-Cantor / Percent-

Identity

MPI.NET Implementation

MPI.NET Implementation

MPI.NET Implementation

Chi-Square / Deterministic

Annealing

C# Desktop Application based

on VTK

* Note. The implementations of Smith-Waterman and Needleman-Wunsch algorithms are from Microsoft Biology Foundation library

Page 20: SALSA Group Research Activities

Scale-up Sequence Clustering with Twister

Gene Sequences (N = 1 Million)

Distance Matrix

Interpolative MDS with Pairwise

Distance Calculation

Multi-Dimensional

Scaling (MDS)

Visualization 3D Plot

Reference Sequence Set (M = 100K)

N - M Sequence

Set (900K)

Select Reference

Reference Coordinates

x, y, z

N - M Coordinates

x, y, z

Pairwise Alignment &

Distance Calculation

O(MxM)

O(MxM)

O(Mx(N-1))

e.g. 25 Million

Page 21: SALSA Group Research Activities

Services and SupportWeb Portal and Metadata

ManagementCGB work

// todo - Ryan

Page 22: SALSA Group Research Activities

GTM vs. MDSGTM MDS (SMACOF)

Maximize Log-Likelihood Minimize STRESS or SSTRESSObjectiveFunction

O(KN) (K << N) O(N2)Complexity

• Non-linear dimension reduction• Find an optimal configuration in a lower-dimension• Iterative optimization method

Purpose

EM Iterative Majorization (EM-like)OptimizationMethod

Vector-based data Non-vector (Pairwise similarity matrix)Input

Page 23: SALSA Group Research Activities

23

PlotViz

Visualization Algorithms Chem2Bio2RDF

PlotViz

Parallel dimension reduction algorithms

Aggregated public databases

3-D M

ap Fi

le SPARQL queryMeta data

Light-weight client

PubChem

CTD

DrugBank

QSAR

Page 24: SALSA Group Research Activities

Education

Page 25: SALSA Group Research Activities

SALSAHPC Dynamic Virtual Cluster on FutureGrid --  Demo at SC09

Pub/Sub Broker Network

Summarizer

Switcher

Monitoring Interface

iDataplex Bare-metal Nodes

XCAT Infrastructure

Virtual/Physical Clusters

Monitoring & Control Infrastructure

iDataplex Bare-metal Nodes (32 nodes)

XCAT Infrastructure

Linux Bare-

system

Linux on Xen

Windows Server 2008 Bare-system

SW-G Using Hadoop

SW-G Using Hadoop

SW-G Using DryadLINQ

Monitoring Infrastructure

Dynamic Cluster Architecture

Demonstrate the concept of Science on Clouds on FutureGrid

Page 26: SALSA Group Research Activities

SALSAHPC Dynamic Virtual Cluster on FutureGrid --  Demo at SC09Demonstrate the concept of Science

on Clouds using a FutureGrid clusterhttp://salsahpc.indiana.edu/b534

http://salsahpc.indiana.edu/b534projects