33
High Performance Cyberinfrastructure Enables Data-Driven Science in the Globally Networked World Keynote Presentation Sequencing Data Storage and Management Meeting at The X-GEN Congress and Expo San Diego, CA March 14, 2011 Dr. Larry Smarr Director, California Institute for Telecommunications and Information Technology Harry E. Gruber Professor, Dept. of Computer Science and Engineering Jacobs School of Engineering, UCSD Follow me on Twitter: lsmarr 1

High Performance Cyberinfrastructure Enables Data-Driven Science in the Globally Networked World Keynote Presentation Sequencing Data Storage and Management

Embed Size (px)

Citation preview

Page 1: High Performance Cyberinfrastructure Enables Data-Driven Science in the Globally Networked World Keynote Presentation Sequencing Data Storage and Management

High Performance Cyberinfrastructure Enables Data-Driven Science

in the Globally Networked World

Keynote Presentation

Sequencing Data Storage and Management Meeting at

The X-GEN Congress and Expo

San Diego, CA

March 14, 2011

Dr. Larry Smarr

Director, California Institute for Telecommunications and Information Technology

Harry E. Gruber Professor,

Dept. of Computer Science and Engineering

Jacobs School of Engineering, UCSD

Follow me on Twitter: lsmarr

1

Page 2: High Performance Cyberinfrastructure Enables Data-Driven Science in the Globally Networked World Keynote Presentation Sequencing Data Storage and Management

Abstract

High performance cyberinfrastructure (10Gbps dedicated optical channels end-to-end) enables new levels of discovery for data-intensive research projects—such as next generation sequencing. In addition to international and national optical fiber infrastructure, we need local campus high performance research cyberinfrastructure (HPCI) to provide “on-ramps,” as well as scalable visualization walls and compute and storage clouds, to augment the emerging remote commercial clouds. I will review how UCSD has built out just such a HPCI and is in the process of connecting it to a variety of high throughput biomedical devices. I will show how high performance collaboration technologies allow for distributed interdisciplinary teams to analyze these large data sets in real-time.

Page 3: High Performance Cyberinfrastructure Enables Data-Driven Science in the Globally Networked World Keynote Presentation Sequencing Data Storage and Management

Two Calit2 Buildings Provide Laboratories for “Living in the Future”

• “Convergence” Laboratory Facilities– Nanotech, BioMEMS, Chips, Radio, Photonics

– Virtual Reality, Digital Cinema, HDTV, Gaming

• Over 1000 Researchers in Two Buildings– Linked via Dedicated Optical Networks

UC San Diego

www.calit2.net

Over 400 Federal Grants, 200 Companies

UC Irvine

Page 4: High Performance Cyberinfrastructure Enables Data-Driven Science in the Globally Networked World Keynote Presentation Sequencing Data Storage and Management

The Required Components ofHigh Performance Cyberinfrastructure

• High Performance Optical Networks• Scalable Visualization and Analysis• Multi-Site Collaborative Systems• End-to-End Wide Area CI• Data-Intensive Campus Research CI

Page 5: High Performance Cyberinfrastructure Enables Data-Driven Science in the Globally Networked World Keynote Presentation Sequencing Data Storage and Management

The OptIPuter Project: Creating High Resolution Portals Over Dedicated Optical Channels to Global Science Data

Picture Source: Mark Ellisman, David Lee, Jason Leigh

Calit2 (UCSD, UCI), SDSC, and UIC Leads—Larry Smarr PIUniv. Partners: NCSA, USC, SDSU, NW, TA&M, UvA, SARA, KISTI, AISTIndustry: IBM, Sun, Telcordia, Chiaro, Calient, Glimmerglass, Lucent

Scalable Adaptive Graphics Environment (SAGE)

OptIPortal

Page 6: High Performance Cyberinfrastructure Enables Data-Driven Science in the Globally Networked World Keynote Presentation Sequencing Data Storage and Management

Visual Analytics--Use of Tiled Display Wall OptIPortal to Interactively View Microbial Genome (5 Million Bases)

Acidobacteria bacterium Ellin345 Soil Bacterium 5.6 Mb; ~5000 Genes

Source: Raj Singh, UCSD

Page 7: High Performance Cyberinfrastructure Enables Data-Driven Science in the Globally Networked World Keynote Presentation Sequencing Data Storage and Management

Use of Tiled Display Wall OptIPortal to Interactively View Microbial Genome

Source: Raj Singh, UCSD

Page 8: High Performance Cyberinfrastructure Enables Data-Driven Science in the Globally Networked World Keynote Presentation Sequencing Data Storage and Management

Use of Tiled Display Wall OptIPortal to Interactively View Microbial Genome

Source: Raj Singh, UCSD

Page 9: High Performance Cyberinfrastructure Enables Data-Driven Science in the Globally Networked World Keynote Presentation Sequencing Data Storage and Management

Large Data Challenge: Average Throughput to End User on Shared Internet is 10-100 Mbps

http://ensight.eos.nasa.gov/Missions/terra/index.shtml

Transferring 1 TB:--50 Mbps = 2 Days--10 Gbps = 15 Minutes

TestedJanuary 2011

Page 10: High Performance Cyberinfrastructure Enables Data-Driven Science in the Globally Networked World Keynote Presentation Sequencing Data Storage and Management

fc *

Solution: Give Dedicated Optical Channels to Data-Intensive Users

(WDM)

Source: Steve Wallach, Chiaro Networks

“Lambdas”Parallel Lambdas are Driving Optical Networking

The Way Parallel Processors Drove 1990s Computing

10 Gbps per User ~ 100-1000x Shared Internet Throughput

Page 11: High Performance Cyberinfrastructure Enables Data-Driven Science in the Globally Networked World Keynote Presentation Sequencing Data Storage and Management

Dedicated 10Gbps Lightpaths Tie Together State and Regional Fiber Infrastructure

NLR 40 x 10Gb Wavelengths

Interconnects Two Dozen

State and Regional Optical NetworksInternet2 Dynamic

Circuit Network Is Now Available

Page 12: High Performance Cyberinfrastructure Enables Data-Driven Science in the Globally Networked World Keynote Presentation Sequencing Data Storage and Management

Visualization courtesy of Bob Patterson, NCSA.

www.glif.is

Created in Reykjavik, Iceland 2003

The Global Lambda Integrated Facility--Creating a Planetary-Scale High Bandwidth Collaboratory

Research Innovation Labs Linked by 10G Dedicated Lambdas

Page 13: High Performance Cyberinfrastructure Enables Data-Driven Science in the Globally Networked World Keynote Presentation Sequencing Data Storage and Management

Launch of the 100 Megapixel OzIPortal Kicked Off a Rapid Build Out of Australian OptIPortals

Covise, Phil Weber, Jurgen Schulze, Calit2CGLX, Kai-Uwe Doerr , Calit2

http://www.calit2.net/newsroom/release.php?id=1421

January 15, 2008No Calit2 Person Physically Flew to Australia to Bring This Up!

January 15, 2008

Page 14: High Performance Cyberinfrastructure Enables Data-Driven Science in the Globally Networked World Keynote Presentation Sequencing Data Storage and Management

“Blueprint for the Digital University”--Report of the UCSD Research Cyberinfrastructure Design Team

• Focus on Data-Intensive Cyberinfrastructure

research.ucsd.edu/documents/rcidt/RCIDTReportFinal2009.pdf

No Data Bottlenecks--Design for Gigabit/s Data Flows

April 2009

Page 15: High Performance Cyberinfrastructure Enables Data-Driven Science in the Globally Networked World Keynote Presentation Sequencing Data Storage and Management

Source: Jim Dolgonas, CENIC

Campus Preparations Needed to Accept CENIC CalREN Handoff to Campus

Page 16: High Performance Cyberinfrastructure Enables Data-Driven Science in the Globally Networked World Keynote Presentation Sequencing Data Storage and Management

Current UCSD Prototype Optical Core:Bridging End-Users to CENIC L1, L2, L3 Services

Source: Phil Papadopoulos, SDSC/Calit2 (Quartzite PI, OptIPuter co-PI)Quartzite Network MRI #CNS-0421555; OptIPuter #ANI-0225642

Lucent

Glimmerglass

Force10

Enpoints:

>= 60 endpoints at 10 GigE

>= 32 Packet switched

>= 32 Switched wavelengths

>= 300 Connected endpoints

Approximately 0.5 TBit/s Arrive at the “Optical” Center of Campus.Switching is a Hybrid of: Packet, Lambda, Circuit --OOO and Packet Switches

Page 17: High Performance Cyberinfrastructure Enables Data-Driven Science in the Globally Networked World Keynote Presentation Sequencing Data Storage and Management

Calit2 SunlightOptical Exchange Contains Quartzite

Maxine Brown,

EVL, UICOptIPuter

Project Manager

Page 18: High Performance Cyberinfrastructure Enables Data-Driven Science in the Globally Networked World Keynote Presentation Sequencing Data Storage and Management

UCSD Planned Optical NetworkedBiomedical Researchers and Instruments

Cellular & Molecular Medicine West

National Center for Microscopy & Imaging

Biomedical Research

Center for Molecular Genetics Pharmaceutical

Sciences Building

Cellular & Molecular Medicine East

CryoElectron Microscopy Facility

Radiology Imaging Lab

Bioengineering

Calit2@UCSD

San Diego Supercomputer Center

• Connects at 10 Gbps :– Microarrays

– Genome Sequencers

– Mass Spectrometry

– Light and Electron Microscopes

– Whole Body Imagers

– Computing

– Storage

Page 19: High Performance Cyberinfrastructure Enables Data-Driven Science in the Globally Networked World Keynote Presentation Sequencing Data Storage and Management

UCSD Campus Investment in Fiber Enables Consolidation of Energy Efficient Computing & Storage

Source: Philip Papadopoulos, SDSC, UCSD

OptIPortalTiled Display Wall

Campus Lab Cluster

Digital Data Collections

N x 10Gb/sN x 10Gb/s

Triton – Petascale

Data Analysis

Gordon – HPD System

Cluster Condo

WAN 10Gb: WAN 10Gb: CENIC, NLR, I2CENIC, NLR, I2

Scientific Instruments

DataOasis (Central) Storage

GreenLightData Center

Page 20: High Performance Cyberinfrastructure Enables Data-Driven Science in the Globally Networked World Keynote Presentation Sequencing Data Storage and Management

Community Cyberinfrastructure for Advanced Microbial Ecology Research and Analysis

http://camera.calit2.net/

Page 21: High Performance Cyberinfrastructure Enables Data-Driven Science in the Globally Networked World Keynote Presentation Sequencing Data Storage and Management

Calit2 Microbial Metagenomics Cluster-Next Generation Optically Linked Science Data Server

512 Processors ~5 Teraflops

~ 200 Terabytes Storage 1GbE and

10GbESwitched/ Routed

Core

~200TB Sun

X4500 Storage

10GbE

Source: Phil Papadopoulos, SDSC, Calit2

4000 UsersFrom 90 Countries

Page 22: High Performance Cyberinfrastructure Enables Data-Driven Science in the Globally Networked World Keynote Presentation Sequencing Data Storage and Management

OptIPuter Persistent Infrastructure EnablesCalit2 and U Washington CAMERA Collaboratory

Ginger Armbrust’s Diatoms:

Micrographs, Chromosomes,

Genetic Assembly

Photo Credit: Alan Decker Feb. 29, 2008

iHDTV: 1500 Mbits/sec Calit2 to UW Research Channel Over NLR

Page 23: High Performance Cyberinfrastructure Enables Data-Driven Science in the Globally Networked World Keynote Presentation Sequencing Data Storage and Management

Creating CAMERA 2.0 -Advanced Cyberinfrastructure Service Oriented Architecture

Source: CAMERA CTO Mark Ellisman

Page 24: High Performance Cyberinfrastructure Enables Data-Driven Science in the Globally Networked World Keynote Presentation Sequencing Data Storage and Management

The GreenLight Project: Instrumenting the Energy Cost of Computational Science• Focus on 5 Communities with At-Scale Computing Needs:

– Metagenomics– Ocean Observing– Microscopy – Bioinformatics– Digital Media

• Measure, Monitor, & Web Publish Real-Time Sensor Outputs– Via Service-oriented Architectures– Allow Researchers Anywhere To Study Computing Energy Cost– Enable Scientists To Explore Tactics For Maximizing Work/Watt

• Develop Middleware that Automates Optimal Choice of Compute/RAM Power Strategies for Desired Greenness

• Data Center for School of Medicine Illumina Next Gen Sequencer Storage and Processing

Source: Tom DeFanti, Calit2; GreenLight PI

Page 25: High Performance Cyberinfrastructure Enables Data-Driven Science in the Globally Networked World Keynote Presentation Sequencing Data Storage and Management

http://tritonresource.sdsc.eduhttp://tritonresource.sdsc.edu

SDSCLarge Memory Nodes• 256/512 GB/sys• 8TB Total• 128 GB/sec• ~ 9 TF x28

SDSC Shared ResourceCluster• 24 GB/Node• 6TB Total• 256 GB/sec• ~ 20 TFx256

UCSD Research LabsSDSC Data OasisLarge Scale Storage• 2 PB• 50 GB/sec• 3000 – 6000 disks• Phase 0: 1/3 TB, 8GB/s

Moving to Shared Enterprise Data Storage & Analysis Resources: SDSC Triton Resource & Calit2 GreenLight

Campus Research Network

Calit2 GreenLight

N x 10Gb/sN x 10Gb/s

Source: Philip Papadopoulos, SDSC, UCSD

Page 26: High Performance Cyberinfrastructure Enables Data-Driven Science in the Globally Networked World Keynote Presentation Sequencing Data Storage and Management

NSF Funds a Data-Intensive Track 2 Supercomputer:SDSC’s Gordon-Coming Summer 2011

• Data-Intensive Supercomputer Based on SSD Flash Memory and Virtual Shared Memory SW– Emphasizes MEM and IOPS over FLOPS– Supernode has Virtual Shared Memory:

– 2 TB RAM Aggregate– 8 TB SSD Aggregate– Total Machine = 32 Supernodes– 4 PB Disk Parallel File System >100 GB/s I/O

• System Designed to Accelerate Access to Massive Data Bases being Generated in Many Fields of Science, Engineering, Medicine, and Social Science

Source: Mike Norman, Allan Snavely SDSC

Page 27: High Performance Cyberinfrastructure Enables Data-Driven Science in the Globally Networked World Keynote Presentation Sequencing Data Storage and Management

Data Mining Applicationswill Benefit from Gordon

• De Novo Genome Assembly from Sequencer Reads & Analysis of Galaxies from Cosmological Simulations & Observations • Will Benefit from

Large Shared Memory

• Federations of Databases & Interaction Network Analysis for Drug Discovery, Social Science, Biology, Epidemiology, Etc. • Will Benefit from

Low Latency I/O from Flash

Source: Mike Norman, SDSC

Page 28: High Performance Cyberinfrastructure Enables Data-Driven Science in the Globally Networked World Keynote Presentation Sequencing Data Storage and Management

Rapid Evolution of 10GbE Port PricesMakes Campus-Scale 10Gbps CI Affordable

2005 2007 2009 2010

$80K/port Chiaro(60 Max)

$ 5KForce 10(40 max)

$ 500Arista48 ports

~$1000(300+ Max)

$ 400Arista48 ports

• Port Pricing is Falling • Density is Rising – Dramatically• Cost of 10GbE Approaching Cluster HPC Interconnects

Source: Philip Papadopoulos, SDSC/Calit2

Page 29: High Performance Cyberinfrastructure Enables Data-Driven Science in the Globally Networked World Keynote Presentation Sequencing Data Storage and Management

10G Switched Data Analysis Resource:SDSC’s Data Oasis

212

OptIPuterOptIPuter

32

Co-LoCo-Lo

UCSD RCI

UCSD RCI

CENIC/NLR

CENIC/NLR

Trestles100 TF

8Dash

128Gordon

Oasis Procurement (RFP)

• Phase0: > 8GB/s Sustained Today • Phase I: > 50 GB/sec for Lustre (May 2011) :Phase II: >100 GB/s (Feb 2012)

40128

Source: Philip Papadopoulos, SDSC/Calit2

Triton32

Radical Change Enabled by Arista 7508 10G Switch

384 10G Capable

8Existing

Commodity Storage1/3 PB

2000 TB> 50 GB/s

10Gbps

58 2

4

Page 30: High Performance Cyberinfrastructure Enables Data-Driven Science in the Globally Networked World Keynote Presentation Sequencing Data Storage and Management

Calit2 CAMERA Automatic Overflows into SDSC Triton

Triton Resource

CAMERA

DATA

@ CALIT2

@ SDSC

CAMERA -Managed

Job Submit Portal (VM)

10Gbps

Transparently Sends Jobs to Submit Portal

on Triton

Direct Mount

== No Data Staging

Page 31: High Performance Cyberinfrastructure Enables Data-Driven Science in the Globally Networked World Keynote Presentation Sequencing Data Storage and Management

California and Washington Universities Are Testing a 10Gbps Connected Commercial Data Cloud

• Amazon Experiment for Big Data– Only Available Through CENIC & Pacific NW

GigaPOP– Private 10Gbps Peering Paths

– Includes Amazon EC2 Computing & S3 Storage Services

• Early Experiments Underway– Robert Grossman, Open Cloud Consortium– Phil Papadopoulos, Calit2/SDSC Rocks

Page 32: High Performance Cyberinfrastructure Enables Data-Driven Science in the Globally Networked World Keynote Presentation Sequencing Data Storage and Management

Academic Research OptIPlanet Collaboratory:A 10Gbps “End-to-End” Lightpath Cloud

National LambdaRail

CampusOptical Switch

Data Repositories & Clusters

HPC

HD/4k Video Repositories

End User OptIPortal

10G Lightpaths

HD/4k Live Video

Local or Remote Instruments

Page 33: High Performance Cyberinfrastructure Enables Data-Driven Science in the Globally Networked World Keynote Presentation Sequencing Data Storage and Management

You Can Download This Presentation at lsmarr.calit2.net