31
MS eScience Workshop 2008 1 Mark Silberstein, CS, Technion Mark Silberstein, CS, Technion Dan Geiger, Computational Biology Lab Dan Geiger, Computational Biology Lab Assaf Schuster, Distributed Systems Lab Assaf Schuster, Distributed Systems Lab Genetics Research Institutes in Israel, Genetics Research Institutes in Israel, EU, US EU, US Superlink-Online: Harnessing the world’s computers to hunt for disease-provoking genes Computational Biology Laborator Distributed Systems Laboratory

MS eScience Workshop 20081 Mark Silberstein, CS, Technion Dan Geiger, Computational Biology Lab Assaf Schuster, Distributed Systems Lab Genetics Research

Embed Size (px)

Citation preview

Page 1: MS eScience Workshop 20081 Mark Silberstein, CS, Technion Dan Geiger, Computational Biology Lab Assaf Schuster, Distributed Systems Lab Genetics Research

MS eScience Workshop 2008 1

Mark Silberstein, CS, TechnionMark Silberstein, CS, Technion

Dan Geiger, Computational Biology LabDan Geiger, Computational Biology Lab

Assaf Schuster, Distributed Systems LabAssaf Schuster, Distributed Systems Lab

Genetics Research Institutes in Israel, Genetics Research Institutes in Israel, EU, USEU, US

Superlink-Online:Harnessing the world’s computers to hunt for

disease-provoking genes

Computational Biology LaboratoryDistributed Systems Laboratory

Page 2: MS eScience Workshop 20081 Mark Silberstein, CS, Technion Dan Geiger, Computational Biology Lab Assaf Schuster, Distributed Systems Lab Genetics Research

5

Familial Onychodysplasia and Familial Onychodysplasia and dysplasia of distal phalanges dysplasia of distal phalanges

(ODP) (ODP) III-15 IV-10

IV-7

Page 3: MS eScience Workshop 20081 Mark Silberstein, CS, Technion Dan Geiger, Computational Biology Lab Assaf Schuster, Distributed Systems Lab Genetics Research

MS eScience Workshop 2008 6

Family PedigreeFamily Pedigree

Page 4: MS eScience Workshop 20081 Mark Silberstein, CS, Technion Dan Geiger, Computational Biology Lab Assaf Schuster, Distributed Systems Lab Genetics Research

MS eScience Workshop 2008 7

Id, dad, mom, sex, affId, dad, mom, sex, aff Marker 1Marker 1 Marker 2Marker 2

III-21 II-10 II-11 f hIII-21 II-10 II-11 f h 00 00 00 00

II-5 I-3 I-4 f hII-5 I-3 I-4 f h 155155 157157 AA AA

III-7 II-4 II-5 f aIII-7 II-4 II-5 f a 155155 157157 AA TT

III-13 II-4 II-5 m aIII-13 II-4 II-5 m a 151151 155155 AA TT

III-14 II-1 II-2 f hIII-14 II-1 II-2 f h 151151 155155 AA AA

III-15 II-4 II-5 male aIII-15 II-4 II-5 male a 151151 155155 AA AA

III-16 II-10 II-11 f hIII-16 II-10 II-11 f h 151151 159159 AA AA

III-5 II-4 II-5 f hIII-5 II-4 II-5 f h 151151 155155 AA AA

IV-1 III-13 III-14 f hIV-1 III-13 III-14 f h 151151 155155 AA TT

IV-2 III-13 III-14 f aIV-2 III-13 III-14 f a 151151 155155 AA TT

IV-3 III-13 III-14 female aIV-3 III-13 III-14 female a 155155 155155 AA TT

.

M1 M2

Chromosome pair:

Marker Information AddedMarker Information Added

Page 5: MS eScience Workshop 20081 Mark Silberstein, CS, Technion Dan Geiger, Computational Biology Lab Assaf Schuster, Distributed Systems Lab Genetics Research

MS eScience Workshop 2008 8

Maximum Likelihood EvaluationMaximum Likelihood Evaluation

III-15 151,159III-16 151,155

202,209202,202

ah

139,141139,146

1,23,3

M1 M2 M3 M4D1

θ

The computational problem:

find a value of θ maximizing Pr(data|θ)

LOD score (to quantify how confident we are): Z(θ)=log10[Pr(data|θ) / Pr(data|

θ=½)].

D2

Page 6: MS eScience Workshop 20081 Mark Silberstein, CS, Technion Dan Geiger, Computational Biology Lab Assaf Schuster, Distributed Systems Lab Genetics Research

MS eScience Workshop 2008 9

Results of Multipoint AnalysisResults of Multipoint Analysis Position in centi-MorgansPosition in centi-Morgans Ln(Likelihood)Ln(Likelihood) LODLOD

0.0000 (Marker 3)0.0000 (Marker 3) -216.0217-216.0217 -14.74 -14.74

0.55000.5500 -192.2385-192.2385 -4.41 -4.41

1.1000 (Marker 4)1.1000 (Marker 4) -216.0210-216.0210 -14.74 -14.74

3.60003.6000 -176.3810-176.3810 2.47 2.47

6.1000 (Marker 5)6.1000 (Marker 5) -174.3392-174.3392 3.35 3.35

8.65008.6500 -173.9743-173.9743 3.51 3.51

11.2000 (Marker 6)11.2000 (Marker 6) -173.7030-173.7030 3.63 3.63

16.550016.5500 -173.3106-173.3106 3.80 3.80

21.9000 (Marker 9)21.9000 (Marker 9) -172.9497-172.9497 3.96 3.96

25.2500 25.2500 -173.6540-173.6540 3.65 3.65

28.6000 (Marker 10)28.6000 (Marker 10) -177.5622-177.5622 1.95 1.95

40.300140.3001 -178.9946-178.9946 1.33 1.33

Page 7: MS eScience Workshop 20081 Mark Silberstein, CS, Technion Dan Geiger, Computational Biology Lab Assaf Schuster, Distributed Systems Lab Genetics Research

MS eScience Workshop 2008 10

The Bayesian network modelThe Bayesian network model

Locus 1

Locus 3 Locus 4

Si3

m

Li1

fL

i1m

Li3

m

Xi1

Si3

f

Li2

fL

i2m

Li3

f

Xi2

Xi3

Locus 2 (Disease)

Y3

y2

Y1

This model depicts the qualitative relations between the variables.We need also to specify the joint distribution over these variables.

Page 8: MS eScience Workshop 20081 Mark Silberstein, CS, Technion Dan Geiger, Computational Biology Lab Assaf Schuster, Distributed Systems Lab Genetics Research

MS eScience Workshop 2008 11

The Computational TaskThe Computational Task

Computing Pr(data|θ) for a specific value of θ :

ij ikl kjm lmnm n l k

Y A B C

Finding the best order is equivalent to finding the best order for sum-product operations for high dimensional matrices :

kx x x

n

iii paxPP

3 1 1

)|()|( dataExponential time and space in:• #variables

five per person #markers #gene loci

#values per variable #alleles non-typed persons

table dimensionality cycles in pedigree

Page 9: MS eScience Workshop 20081 Mark Silberstein, CS, Technion Dan Geiger, Computational Biology Lab Assaf Schuster, Distributed Systems Lab Genetics Research

MS eScience Workshop 2008 13

Divisible Tasks through Divisible Tasks through Variable ConditioningVariable Conditioning

non trivial non trivial parallelization parallelization overheadoverhead

Page 10: MS eScience Workshop 20081 Mark Silberstein, CS, Technion Dan Geiger, Computational Biology Lab Assaf Schuster, Distributed Systems Lab Genetics Research

MS eScience Workshop 2008 15

• Basic unit of execution – batch jobBasic unit of execution – batch job– Non-interactive mode: “enqueue – wait –

execute – return”– Self-contained execution sandbox

• A linkage analysis request - a taskA linkage analysis request - a task– A bag (of millions) of jobs– Turnaround time is important

TerminologyTerminology

Page 11: MS eScience Workshop 20081 Mark Silberstein, CS, Technion Dan Geiger, Computational Biology Lab Assaf Schuster, Distributed Systems Lab Genetics Research

MS eScience Workshop 2008 16

• The system must be geneticists-friendly The system must be geneticists-friendly – Interactive experience

• Low response time for short tasks • Prompt user feedback

– Simple, secure, reliable, stable, overload-resistant, concurrent tasks, multiple users...

– Fast computation of previously infeasible long tasks via parallel execution• Harness all available resources: grids, clouds, clusters• Use them efficiently!

RequirementsRequirements

Page 12: MS eScience Workshop 20081 Mark Silberstein, CS, Technion Dan Geiger, Computational Biology Lab Assaf Schuster, Distributed Systems Lab Genetics Research

Grids or Clouds?Grids or Clouds?Remaining

Jobs inQueue

Time

Cloud (k CPUs)

Grid(k CPUs)

Queue Waiting Time

Small tasks are severely slow on gridsSmall tasks are severely slow on grids Takes 5 minutes on 10-nodes dedicated cluster May take several hours on a grid

Should we move scientific loads on the cloud? YES!

Long taildue to failures Queuing time in EGEE

Error rate, UW Madison

Preempted jobs, UW Madison

17MS eScience Workshop 2008

Page 13: MS eScience Workshop 20081 Mark Silberstein, CS, Technion Dan Geiger, Computational Biology Lab Assaf Schuster, Distributed Systems Lab Genetics Research

Consider 3.2x10Consider 3.2x1066 jobs, ~40 min each jobs, ~40 min each It took 21 days on ~6000-8000 CPUsIt took 21 days on ~6000-8000 CPUs It would cost about It would cost about $10K$10K on Amazon’s on Amazon’s

EC2 EC2

Grids or CGrids or Cloudlouds?s?

Should we move scientific loads on the cloud? NO!

?

18MS eScience Workshop 2008

Page 14: MS eScience Workshop 20081 Mark Silberstein, CS, Technion Dan Geiger, Computational Biology Lab Assaf Schuster, Distributed Systems Lab Genetics Research

Clouds or Grids? Clouds and Grids!Clouds or Grids? Clouds and Grids!

ReliabilityLow High

Performance predictibility Low High

High LowPotential amount of available resources

High LowReuse of existing infrastructure

Through

put com

puting

“Burs

t” co

mputin

g

19MS eScience Workshop 2008

DedicatedOpportunistic

Page 15: MS eScience Workshop 20081 Mark Silberstein, CS, Technion Dan Geiger, Computational Biology Lab Assaf Schuster, Distributed Systems Lab Genetics Research

Cheap and Expensive ResourcesCheap and Expensive Resources Task sensitivity to QoS differ in different stagesTask sensitivity to QoS differ in different stages

High throughput High performance

Use cheap unreliable resources

Grids Community grids Non-dedicated clusters

Use expensive reliable resources

Dedicated clusters Clouds

Remainingjobs in queue

Dynamically determine entering tail mode Switch to expensive resources (gracefully)

20MS eScience Workshop 2008

Page 16: MS eScience Workshop 20081 Mark Silberstein, CS, Technion Dan Geiger, Computational Biology Lab Assaf Schuster, Distributed Systems Lab Genetics Research

Virtual cluster maintainer

Scheduling ServerScheduling Server

SchedulerJob

queue

Glue pools together via overlayGlue pools together via overlay

Submitter to Grid 1

Submitter to Cloud 1

Submitter to Cloud 2

21

Submitter to Grid 2

Issues: granularity, load balancing, firewalls, failed resources, scheduler scalability…

Page 17: MS eScience Workshop 20081 Mark Silberstein, CS, Technion Dan Geiger, Computational Biology Lab Assaf Schuster, Distributed Systems Lab Genetics Research

Practical considerationsPractical considerations

Overlay scalability and firewall penetrationOverlay scalability and firewall penetration Server may not initiate connect to the agent

Compatibility with community gridsCompatibility with community grids The server is based on BOINC Agents are upgraded BOINC clients

Elimination of failed resources from Elimination of failed resources from schedulingscheduling Performance statistics is analyzed

Resource allocation depending on the task Resource allocation depending on the task statestate Dynamic policy update via Condor classad

mechanism

22MS eScience Workshop 2008

Page 18: MS eScience Workshop 20081 Mark Silberstein, CS, Technion Dan Geiger, Computational Biology Lab Assaf Schuster, Distributed Systems Lab Genetics Research

Virtual cluster maintainer

Submitter to Technion

Submitter To EC2 Cloud

Submitter to OSG

Submitter to any

grid/cluster/cloud

BOINC clientssubmitter for EGEE

BOINC clientssubmitter for Madison pool

Dedicated cluster fallback

Task executionand monitoring

workflow

Upgraded Upgraded BOINC ServerBOINC Server

Databasejobs, monitoring,system statistics

SchedulerHTTP frontend

SUPERLINK@TECHNION

23

Web Portal

Task state

Page 19: MS eScience Workshop 20081 Mark Silberstein, CS, Technion Dan Geiger, Computational Biology Lab Assaf Schuster, Distributed Systems Lab Genetics Research

Superlink-online 1.0: Superlink-online 1.0: http://bioinfo.cs.technion.ac.ilhttp://bioinfo.cs.technion.ac.il

24

Page 20: MS eScience Workshop 20081 Mark Silberstein, CS, Technion Dan Geiger, Computational Biology Lab Assaf Schuster, Distributed Systems Lab Genetics Research

Task SubmissionTask Submission

25

Page 21: MS eScience Workshop 20081 Mark Silberstein, CS, Technion Dan Geiger, Computational Biology Lab Assaf Schuster, Distributed Systems Lab Genetics Research

Superlink-online statisticsSuperlink-online statistics ~1720~1720 CPU CPU yearsyears for ~18,000 tasks during for ~18,000 tasks during

2006-2008 (counting)2006-2008 (counting) ~37 citations (several mutations found)~37 citations (several mutations found)

Examples: Ichthyosis,"uncomplicated" hereditary spastic paraplegia (1-9 people per 100,000)

Over 250 (counting) users: Israeli and Over 250 (counting) users: Israeli and international international Soroka H., Be'er Sheva, Galil Ma'aravi H., Nahariya, Rabin H., Petah

Tikva, Rambam H., Haifa, Beney Tzion H., Haifa, Sha'arey Tzedek H., Jerusalem, Hadassa H., Jerusalem, Afula H. NIH, Universities and research centers in US, France, Germany, UK, Italy, Austria, Spain, Taiwan, Australia, and others...

Task exampleTask example 250 days on single computer - 7 hours on 300-700 computers Short tasks: few seconds even during severe overload

26MS eScience Workshop 2008

Page 22: MS eScience Workshop 20081 Mark Silberstein, CS, Technion Dan Geiger, Computational Biology Lab Assaf Schuster, Distributed Systems Lab Genetics Research

MS eScience Workshop 2008 27

Using our system in Israeli Using our system in Israeli HospitalsHospitals

Rabin Hospital, by Motti Shochat’s group New locus for mental retardation Infantile bilateral striatal necrosis

Soroka Hospital, by Ohad Birk’s group Lethal congenital contractural syndrome Congenital cataract

Rambam Hospital, by Eli Shprecher’s group

Congenital recessive ichthyosis CEDNIK syndrome

Galil Ma’aravi Hospital, by Tzipi Falik’s group

Familial Onychodysplasia and dysplasia Familial juvenile hypertrophy

Page 23: MS eScience Workshop 20081 Mark Silberstein, CS, Technion Dan Geiger, Computational Biology Lab Assaf Schuster, Distributed Systems Lab Genetics Research

Utilizing Community Computing

~3.4 TFLOPs, ~3000 users, from 75 countries28

Page 24: MS eScience Workshop 20081 Mark Silberstein, CS, Technion Dan Geiger, Computational Biology Lab Assaf Schuster, Distributed Systems Lab Genetics Research

Submission serverSubmission server

Dedicated cluster

Technion Condor pools

EGEE-II BIOMED VO

Superlink@Technion

Superlink@Campus

Superlink-online V2(beta) deploymentSuperlink-online V2(beta) deployment

UW in Madison Condor pool

OSG GLOW VO

~12,000 hosts operational during

the last month

29MS eScience Workshop 2008

Page 25: MS eScience Workshop 20081 Mark Silberstein, CS, Technion Dan Geiger, Computational Biology Lab Assaf Schuster, Distributed Systems Lab Genetics Research

3.1 million jobs in 21 days3.1 million jobs in 21 days60 dedicatedCPUs only

30MS eScience Workshop 2008

Page 26: MS eScience Workshop 20081 Mark Silberstein, CS, Technion Dan Geiger, Computational Biology Lab Assaf Schuster, Distributed Systems Lab Genetics Research

ConclusionsConclusions

Our system integrates clusters, grids, Our system integrates clusters, grids, clouds, community grids, etc.clouds, community grids, etc. Geneticist friendly

Minimizes use of expensive resources Minimizes use of expensive resources while providing QoS for taskswhile providing QoS for tasks

Generic mechanism for scheduling Generic mechanism for scheduling policypolicy Can dynamically reroute jobs from one

pool to another according to a given optimization function (budget, energy, etc.)

31MS eScience Workshop 2008

Page 27: MS eScience Workshop 20081 Mark Silberstein, CS, Technion Dan Geiger, Computational Biology Lab Assaf Schuster, Distributed Systems Lab Genetics Research

33 MS eScience Workshop 2008

NVIDIA Compute Unified NVIDIA Compute Unified Device Architecture (CUDA)Device Architecture (CUDA)

GPU

Global Memory

...

16MPX8SPX4

Cached Read-Only memory Cached Read-Only memory

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

Shared memory (16KB)

MP

Reg

iste

r fi

leSP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

Shared memory (16KB)

MP

Reg

iste

r fi

le

~1 cycle~TB/s

Page 28: MS eScience Workshop 20081 Mark Silberstein, CS, Technion Dan Geiger, Computational Biology Lab Assaf Schuster, Distributed Systems Lab Genetics Research

34 MS eScience Workshop 2008

Key ideas Key ideas (Joint work with John Owens -UC Davis)(Joint work with John Owens -UC Davis)

Software-managed cacheSoftware-managed cacheWe implement the cache replacement policy in software

Maximization of data reuseMaximization of data reuseBetter compute/memory access ratioA simple model for performance bounds

Yes, we are (optimal)

Use special function units for Use special function units for hardware-assisted executionhardware-assisted execution

Page 29: MS eScience Workshop 20081 Mark Silberstein, CS, Technion Dan Geiger, Computational Biology Lab Assaf Schuster, Distributed Systems Lab Genetics Research

35

Results summaryResults summaryExperiment setupExperiment setup

CPU: single core Intel Core 2 2.4GHz, 4MB L2GPU: NVIDIA G80 (GTX8800), 750MB GDDR4, 128 SP, 16K mem / 512 threadsOnly kernel runtime included (no memory transfers, no CPU setup time)

2500~ 2 x 25 x 25 x 2

Hardware

Use of SFU: expf is about6x slower than “+” on GPU,

but ~200x slower on CPUSoftware managed

Caching

Page 30: MS eScience Workshop 20081 Mark Silberstein, CS, Technion Dan Geiger, Computational Biology Lab Assaf Schuster, Distributed Systems Lab Genetics Research

AcknowledgmentsAcknowledgments Superlink-online team:Superlink-online team:

Alumni: Anna Tzemach, Julia Stolin, Nikolay Dovgolevsky, Maayan Fishelson, Hadar Grubman, Ophir Etzion

Current: Artyom Sharov, Oren Shtark Prof. Miron Livny (Condor pool UW Madison, OSG)Prof. Miron Livny (Condor pool UW Madison, OSG) EGEE BIOMED VO and OSG GLOW VOEGEE BIOMED VO and OSG GLOW VO Microsoft TCI program, NIH grant, SciDAC Institute for Microsoft TCI program, NIH grant, SciDAC Institute for

ultrascale visualizationultrascale visualization

If your grid is underutilized – let us know!Visit us at: http://bioinfo.cs.technion.ac.il/superlink-online

Superlink@TECHNION project home page:http://cbl-boinc-server2.cs.technion.ac.il/superlinkattechnion

36MS eScience Workshop 2008

Page 31: MS eScience Workshop 20081 Mark Silberstein, CS, Technion Dan Geiger, Computational Biology Lab Assaf Schuster, Distributed Systems Lab Genetics Research

MS eScience Workshop 2008 37

QUESTIONSQUESTIONS??????

Visit us at:

http://bioinfo.cs.technion.ac.il/superlink-online