Upload
rolf-greer
View
218
Download
0
Tags:
Embed Size (px)
Citation preview
www.iplantcollaborative.org [email protected]
The iPlant CollaborativeIBP Annual Meeting – June 1st 2011
Steve Goff
iPlant Collaborative, BIO5 Institute
School of Plant Science
University of Arizona
www.iplantcollaborative.org [email protected]
What is iPlant?• iPlant’s mission is to build the CI to support plant
biology’s Grand Challenge solutions
• Phase I – Community Input
• Phase II – Building the CI Foundation
• Next Phase – Enabling Plant Science Discovery
Now need to integrate workflows and test theories
Will support tool integration and synthesis activities
www.iplantcollaborative.org [email protected]
NSF Cyberinfrastructure Vision
• High Performance Computing• Data and Data Analysis• Virtual Organizations• Learning and Workforce
Ref: “Cyberinfrastructure Vision for 21st Century Discovery”, NSF Cyberinfrastructure Council, March 2007.
www.iplantcollaborative.org [email protected]
CI for Plant Science: Observations • Investment in data creation is high • Sources of data are disparate.• Investment in existing tools is significant• Tools shouldn’t be discarded• Tools shouldn’t be reproduced, but lack:– Interoperability w/other tools
–Data standards
–Scalability
–Consistency of interface access & use
–Experimental reproducibility
www.iplantcollaborative.org [email protected]
iPlant is a process and a platform
(or set of platforms, depending on your point of view).
www.iplantcollaborative.org [email protected]
Computational & Storage Capability– Compute: Ranger, Lonestar, Stampede (UT/TeraGrid) Saguaro, Sonora
(ASU) Marin, Ice (UA) • ~700 Teraflops
– Storage: Corral, Ranch (UT), Ocotillo (ASU) • > 10 Petabytes of storage available for the project
– Visualization: Spur, Stallion (UT), Matinee (ASU), UA-Cave• Among the world’s largest visualization systems
– Virtualized/Cloud Services: iPlant, TeraGrid, vendor clouds• Cloud tech to deliver persistent gateways and user services
Thanks to large-scale NSF investments, iPlant has excellent CI access
www.iplantcollaborative.org [email protected]
BenchBiologist
s
APIsAPIs Data
Algorithms
DiscoveryEnvironment
Data Store Atmosphere
Computational Biologists
Semantic Web Layer
iPlant Cyberinfrastructure
www.iplantcollaborative.org [email protected]
Overview of Components
• iPlant Discovery Environment - Core Software • iRODS Integration – Core Services• Atmosphere Cloud – Core Services• Semantic Web Tech – SSWAP Team• iPlant Tool/Workflow API – Core Software &
Engagement Teams
www.iplantcollaborative.org [email protected]
DiscoveryEnvironment
DNASubway
3rd Party Science
Gateways
User Scripts &Applications
Public APIs
Low-Level Services
Event I/O Data Apps Job Profile Auth
CondorPBSSGFLSFLL
iRODS
MySQL
LDAP
Eucalyptus
ActionFolders
Shibboleth
Globus/Unicore
GPIR
MyProxy XSEDE
iPlant Hardware Resources
High Perf Computing Databases Storage Cloud Systems
Semantic Web
www.iplantcollaborative.org [email protected]
iRODSIntegrated Rule-Oriented Data System
www.irods.org• Why iRODS?
– Large data storage in simple format
– Sharing of large data among iPlant CI Resources
– Sharing of large data with colleagues and collaborators
– Processing large data with TACC resources
• General information on iRODS: www.irods.org
• Access iPlant’s iRODS: irodsweb.iplantcollaborative.org
• Documentation: https://pods.iplantcollaborative.org/wiki/display/systems/iRODS
www.iplantcollaborative.org [email protected] 11
AtmosphereiPlant’s Cloud Computing Resources
http://atmosphere.iplantcollaborative.org
• Tutorial: https://pods.iplantcollaborative.org/wiki/display/atmosphere/Demo+with+picture+walkthrough
• Why Atmosphere?
–Use a virtual machine (VM) with preinstalled software
–Create a VM to install complex software
–Create and share an image of a VM (VMI)
–Mount data from iPlant iRODS for use by your VM
www.iplantcollaborative.org [email protected] 12
Semantic Webhttp://www.iplantcollaborative.org/communities/developers/semanticweb
•Why Semantic Web Technology?–Provides a means for web-services to
communicate and be aware of one another
iPlant Consumer
Semantic Web
Remote Service
User-Created Service in
Atmosphere
Semantic Web
iPlant’s Discovery
Environment
iPlant Service
Semantic Web
Remote Consumer
iPG2P: From Genotype to Phenotype
• Visual Analytics– R. Grene and G. Abram: Information Visualization Tools capable of
displaying diverse types of data from laboratory, field, in silico analyses and simulations
• Data Integration– D. Ware and C. Jordan: Methods for describing and unifying data sets
into systems that support iPG2P activities• Statistical Inference
– D. Kliebenstein and E. Buckler: Platform for using advanced computational approaches to statistically link genotype to phenotype
• Modeling Tools– J. White, C. Myers, S. Welch : Framework for the construction,
simulation and analysis of computational models of plant• Ultra High Throughput Sequencing
– T. Brutnell and M. Vaughn: HPC resources and applications to process large-volume sequence data
Genome Services
Ultra High-Throughput Sequencing
Scalable computing
Data• NCBI SRA• Desktop
• AmazonS3• FTP
• HTTP
Data Wrangling• Quality Control• Preprocessing
• Rescaling• Barcoding
Alignments• BWA
• TopHat
Cufflinks
SAMTools
SAM Alignments
ExpressionLevels
(RPKM)
Genome Variants(VCF3.3)
Community Use CasesExpression studies
Forward genetic screensAssociation studies
High Throughput Image Analysis
Scope: Enable image-based plant sciences research by incorporating image processing algorithms, grid computing, and databasing into an analysis pipeline
Objectives
1. Integrate Phytomorph and BISQUE as PhytoBisque2. Broaden access to algorithms that benefit the community
3. Automate workflows so that plant biologists need not be computer scientists
Storage
Authentication
APIs
Compute cluster
E. Spalding @ U of Wisconsin, B.S Majunath and K. Kvilekval @ UCSB
Phytobisque: Example Use CaseGiven a flatbed scanner image of Arabidopsis seeds, measures the length, width, and area and produce a population estimate for each trait
Seed trait QTL can be mapped when applied to mapped populations like Ler x CVI
Basic QTL/GWAS analysis• R/Qtl, QTLcartographer, et al.
• Community can integrate these into the CI
Iterative analyses• iPlant workflow
management simplifies automation
• Compare methods!
Exploratory methods• Hand-built R, Python,
SAS, C codes• Easy integration into
iPlant CI via API • Adopt common data
model
Scalability Challenges: High-density markers, large
populations, combinatorial analyses
• iPlant-authored parallel GLM (etc) implementations
• Common data model• Utilize workflow framework
A Strategy for Association
Studies
• Simplest case*: a few minutes using GLM on desktop TASSEL
• 1000-replicate bootstrap: 75-150 hours / trait
• Runtimes only gets larger (days to years) for more complex analyses
* One trait x 40 million markers with no bootstrapping or epistasis testing
Statistical Inference: Scalable GLM
6 traits of interest
40 million markers in maize NAM
1000 replicate analyses
Epistasis testing
X X
Genotype Phenotype
ANOVA
GPU-based QTL Mapping
• Aspects of the problem are highly parallel• Re-architect data flow and mapping algorithms for GPU
architecture• Interface for C and GPU implementations will be identical
Ali Akoglu and Dave Lowenthal, UArizonaAlignment-based protein searches sped up 6-10x
www.iplantcollaborative.org [email protected]
iPlant Tree of Life (iPToL)
Large phylogenetic inferenceBuilding a tree of life for up to 500,000 green plants
Tree VisualizationScalable visualization for small to large trees
Data Assembly and IntegrationAcquisition, organization and processing the data
Taxonomic IntelligenceSorting out different names for the same species
Tree ReconciliationResolving discordant gene and species trees
Trait EvolutionUsing tree to understand how traits evolved
www.iplantcollaborative.org [email protected]
Phyloviewer: visualization of large phylogenetic trees
21
www.iplantcollaborative.org [email protected]
My-Plant
• Social networking for plant biologists
• Organized by clade
• Used to organize the data collection for the “big tree”
www.iplantcollaborative.org [email protected]
Taxonomic Name Resolution Service
www.iplantcollaborative.org [email protected]
Integration of New Tools w/o Programming
This part is done!!!
This part is coming soon!
Related Activities
Integrated Breeding Platform Social networking portal for plant breeders R analysis packages Breeders fieldbook
1kp (1,000 plant transcriptomes) DOE’s Knowledgebase (Kbase) Seed projects Elixir CoGe
Future Workshop Activities Small tool/workflow integration meetings
2-3 days each, 10-20 local participants 4-5 meetings starting in June 2011
Addressing specific biological questions With appropriate test data and available software
Building on iPlant’s cyberinfrastructure Complementary tools and additional data access Preference for broad use, high impact tools & workflows Can be kept private until published Positive results will stimulate additional support
www.iplantcollaborative.org [email protected] 27
iPlant’s Building Blocks
27
Metadata Data Tools Workflows Viz
Executive Team:Steve GoffDan Stanzione
Staff:Greg AbramVictoria BryanRion DooleyAndy EdmondsJuan Antonio Raygoza GarayKarla GendlerDamian GesslerCornel GhibanMichael GonzalesHariolf HäfeleMatthew Helmke
Faculty Advisors:Greg AndrewsKobus BarnardSusan BrownVicki ChandlerJohn HartmanNirav Merchant
Students:Storme BriscoeSteven GregoryMonica LentBansri PoduvalPavithra RaviShannon WermesJill Yarmchuk
Sudha RamAnn StapletonLincoln SteinDoreen WareSue WesslerRamin Yadegari
Natalie HenriquesUwe HilgertNicole HopkinsLisa HowellsKathleen KennedyMohammed KhalfanSeung-jin KimAdam KubachSangeeta KuchimanchiTina LeeAndrew LenardsSonya Lowry
Jerry LuEric LyonsNaim MatasciSheldon McKayDave Micklos Andy MuirMartha NarroChristos NoutosDennis RobertsBernice RogowitzJerry SchneiderBruce Schumaker
Edwin SkidmoreSriram SrinivasanMary Margaret SprinkleMatthew VaughnLiya WangSharon Wei Jason WilliamsFrank WillmoreJohn WregglesworthWeijia Xu