Upload
sandra-gesing
View
121
Download
0
Embed Size (px)
Citation preview
Sandra Gesing
Center for Research Compu6ng [email protected]
19 August 2016
Increasing the Efficiency of Workflows:
Use Cases in the Life Sciences
University of Notre Dame
Sandra Gesing 2
• In the middle of nowhere of northern Indiana (1.5 h from Chicago) • 4 undergraduate colleges • ~35 research ins6tutes and centers • ~12,000 students
Center for Research Compu6ng
Sandra Gesing 3
• SoUware development and profiling • Cyberinfrastructure/science gateway development • Computa6onal Scien6st support • Collabora6ve research/ grant development • System administra6on/ prototype architectures • Computa6onal resources: 25,000 cores+ • Storage resources: 3 PB • Na6onal resources (e.g., XSEDE) • ~40 researchers, research programmers, HPC specialists
CRC and OIT building
h`p://crc.nd.edu CRC HPC Center (old Union Sta6on)
Life Sciences
Sandra Gesing 4
• Genomics • Proteomics • Metabolomics • Immunomics • System biology • Molecular simula6ons • Docking • Epidemiology • …
Black Swallowtail – larvae and bu`erfly
The Genomics Boom
Sandra Gesing 5
February 16, 2001 biotech company Celera
February 15, 2001 The Human Genome Project
Big Data
Sandra Gesing 7
• Explosion in the quan6ty, variety and complexity of data • Ques6ons can be answered impossible to even ask about 10 years ago • Costs far reduced (e.g., Human Genome project, 15 years, ~$2 billion; today ~3 days, $1000)
Analysis of Data
Sandra Gesing 9
12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt 12241 cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt 12301 gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct 12361 gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt 12421 taggtgactt gcctgttttt ttttaattgg gatcttaatt tttttaaatt attgatttgt 12481 aggagctatt tatatattct ggatacaagt tctttatcag atacacagtt tgtgactatt 12541 ttcttataag tctgtggttt ttatattaat gtttttattg atgactgttt tttacaattg 12601 tggttaagta tacatgacat aaaacggatt atcttaacca ttttaaaatg taaaattcga 12661 tggcattaag tacatccaca atattgtgca actatcacca ctatcatact ccaaaagggc 12721 atccaatacc cattaagctg tcactcccca atctcccatt ttcccacccc tgacaatcaa 12781 taacccattt tctgtctcta tggatttgcc tgttctggat attcatatta atagaatcaa
Slide copied from: Stuart Owen „Workflows with Taverna“
A sequence of connected steps in a defined order based on their control and data dependencies
Analysis of data
Sandra Gesing 10
12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt 12241 cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt 12301 gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct 12361 gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt 12421 taggtgactt gcctgttttt ttttaattgg gatcttaatt tttttaaatt attgatttgt 12481 aggagctatt tatatattct ggatacaagt tctttatcag atacacagtt tgtgactatt 12541 ttcttataag tctgtggttt ttatattaat gtttttattg atgactgttt tttacaattg 12601 tggttaagta tacatgacat aaaacggatt atcttaacca ttttaaaatg taaaattcga 12661 tggcattaag tacatccaca atattgtgca actatcacca ctatcatact ccaaaagggc 12721 atccaatacc cattaagctg tcactcccca atctcccatt ttcccacccc tgacaatcaa 12781 taacccattt tctgtctcta tggatttgcc tgttctggat attcatatta atagaatcaa
Slide copied from: Stuart Owen „Workflows with Taverna“
A sequence of connected steps in a defined order based on their control and data dependencies
Workflows!
Workflows
Sandra Gesing 11
• Different workflow concepts • Different workflow languages • Different workflow constructs
Taverna WorkWays
Workflow Editors
Sandra Gesing 12
• Different technologies (workbenches, web-‐based) • Different look-‐and-‐feel
State of the Art
Sandra Gesing 14
Data and compute-‐ intensive problems
High-‐speed networks
Users generally not IT specialists Tools and workflow
engines
Web-‐based agile frameworks Distributed data and
compu6ng infrastructures
Challenge for Developers
Sandra Gesing 15
Data and compute-‐ intensive problems
High-‐speed networks Tools and workflow engines
Web-‐based agile frameworks Distributed data and
compu6ng infrastructures
Users generally not IT specialists
Need for intui6ve and efficient workflows!
Challenge for Developers
Sandra Gesing 16
Data and compute-‐ intensive problems
High-‐speed networks Tools and workflow engines
Web-‐based agile frameworks Distributed data and
compu6ng infrastructures
Users generally not IT specialists
Usability
Sandra Gesing 17
“AUer all, usability really just means that making sure that something works well: that a person … can use the thing -‐ whether it's a Web site, a fighter jet, or a revolving door -‐ for its intended purpose without geung hopelessly frustrated.” (Steve Krug in “Don't make me think!: A Common Sense Approach to Web Usability”, 2005)
Workflow Enhancements
Sandra Gesing 19
• Logical level: Meta-‐workflows Herres-‐Pawlis, S., Hoffmann, A., Rösener, T., Krüger, J., Grunzke, R., and Gesing, S. “Mul6-‐layer Meta-‐metaworkflows for the Evalua6on of Solvent and Dispersion Effects in Transi6on Metal Systems Using the MoSGrid Science Gateways”Science Gateways (IWSG), 2015 7th Interna6onal Workshop on, pp.47-‐52, 3-‐5 June 2015, IEEE Xplore, doi: 10.1109/IWSG.2015.13
• System level: Combina6on of strengths of workflow systems Hazekamp, N., Sarro, J., Choudhury, O., Gesing, S., Sco` Emrich and Thain, D. “Scaling Up Bioinforma6cs Workflows with Dynamic Job Expansion: A Case Study Using Galaxy and Makeflow”, e-‐Science (e-‐Science), 2015 IEEE 11th Interna6onal Conference on, pp.332-‐341, Aug. 31 2015-‐Sept. 4 2015
• Predic6on: Model for op6miza6on of tasks and threads Choudhury, O., Rajan, D., Hazekamp, N., Gesing, S., Thain, D., and Emrich, S. “Balancing Thread-‐level and Task-‐level Parallelism for Data-‐Intensive Workloads on Clusters and Clouds”, Cluster Compu6ng (CLUSTER), 2015 IEEE Interna6onal Conference on, pp.390-‐393, 8-‐11 Sept. 2015, doi:10.1109/CLUSTER.2015.60
MoSGrid Science Gateway
Molecular Simula6on Grid • Science gateway integrated with underlying compute and data management infrastructure • Distributed workflow management • Data repository • Metadata management
Sandra Gesing 20
MoSGrid Science Gateway
User Interface WS-‐PGRADE
Liferay
DCI Resources Middleware Layer
UNICORE XtreemFS
High-‐Level Middleware Service Layer
gUSE
Sandra Gesing 21
Scaling Up Workflows
Sandra Gesing 33
Simple Workflow in Galaxy
Problem: As Size increases so does Time
Scaling Up Workflows
Sandra Gesing 34
Workflow with Parallelism added in Galaxy
Problem: Tools must be updated every change in Parallelism/Relies on Scien6st
Scaling Up Workflows
Sandra Gesing 38
Makeflow • Task Structure INPUTS : OUTPUTS
COMMAND • Directed Acyclic Graph (DAG)
• Programma6cally Generated
Scaling Up Workflows
Sandra Gesing 42
Dynamic Job Expansion • Work Queue: we u6lized 100s of cores from a Condor Pool
• Cleaning Sandbox using knowledge of intermediates and logging
• Explored methods to transmit needed environments such as executables and Java
61.5X speed-‐up on 32 GB dataset u6lizing these methods
Thread-‐level and Task-‐level Parallelism
Sandra Gesing 43
• Develop predictive performance models for an application domain
• Achieve acceptable performance the first time
• Optimize resource utilization • Execution time • Memory usage
Thread-‐level and Task-‐level Parallelism
• WorkQueue master-worker framework
• Sun Grid Engine (SGE) batch system Sandra Gesing 44
Thread-‐level and Task-‐level Parallelism
Sandra Gesing 45
1. Applica6on-‐level model for 6me: 𝑇(𝑅,𝑄,𝑁)= 𝛽1𝑅𝑄/𝑁 + 𝛽2
2. Applica6on-‐level model for memory: 𝑀(𝑅,𝑁)= γ1R +γ2N
3. System-‐level model for 6me: 𝑇𝑇𝑜𝑡𝑎𝑙=𝜂1𝑄𝐾/𝐷 +𝜂2(𝑄/𝐵 + 𝑅𝐾𝑁/𝐵𝐶 )+𝜂3T(R, 𝑄/𝐾 ,𝑁)∗𝐾𝑁/𝑀𝐶 + 𝜂4𝑂/𝐵 +𝜂5𝑂𝐾/𝐷
4. System-‐level model for memory: 𝑀𝑀𝑎𝑠𝑡𝑒𝑟(𝑅,𝑄)=ϕ1R +ϕ2Q
Thread-‐level and Task-‐level Parallelism
Sandra Gesing 46
7 data points (R)
7 data points (Q)
7 data points (N)
343 data points
Data CollecCon
Training data
Regression Model
Training
Accuracy Test
MAPE TesCng
Regression Coefficient
s
Tes6ng data
Thread-‐level and Task-‐level Parallelism
Sandra Gesing 47
Avg. MAPE = 3.1
MAPE = Mean Absolute Percentage Error
Result
Sandra Gesing 49
# Cores/
Task
# Tasks
Predicted Time (min)
Speedup
Estimated EC2
Cost ($)
Estimated Azure
Cost ($)
1 360 70 6.6 50.4 64.8
2 180 38 12.3 25.2 32.4
4 90 24 19.5 18.9 32.4
8 45 27 17.3 18.9 32.4
Acknowledgements
Sandra Gesing 52
Logical level • Richard Grunzke • Sonja Herres-‐Pawlis • Alexander Hoffmann • Jens Krüger • MTA SZTAKI • University of Westminster System level and predic6on • Nicholas Hazekamp • Olivia Choudhury • Douglas Thain • Scott Emrich • Notre Dame Bioinformatics Lab • The Cooperative Computing Lab, University of Notre
Dame
Science Gateways Community Ins6tute
Sandra Gesing 53
01 August 2016 – 31 July 2021 1. Incubator: shared exper6se in
business and sustainability planning, cybersecurity, user interface design, and soUware engineering prac6ces.
2. Extended Developer Support: expert developers for up to one year
3. Scien6fic SoUware Collabora6ve: open-‐source, extensible framework for gateway design, integra6on, and services
4. Community Engagement and Exchange: forum for communica6on and shared experiences
5. Workforce Development: training programs and helping universi6es form gateway support groups
h`p://sciencegateways.org/