160
SAN DIEGO SUPERCOMPUTER CENTER What Can SDSC Do For You? Michael L. Norman, Director Distinguished Professor of Physics

Download Presentation Slide Set

  • Upload
    lytruc

  • View
    224

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

What Can SDSC Do For You? Michael L. Norman, Director

Distinguished Professor of Physics

Page 2: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

Mission: Transforming Science and Society Through “Cyberinfrastructure”

“The comprehensive infrastructure needed to capitalize on dramatic advances in information technology has been termed cyberinfrastructure.”

D. Atkins, NSF Office of Cyberinfrastructure 2

Page 3: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

What Does SDSC Do?

Page 4: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

Gordon – World’s First Flash-based Supercomputer for Data-intensive Apps

>300,000 times as fast as SDSC’s first supercomputer 1,000,000 times as much memory

Page 5: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

Industrial Computing Platform: Triton

• Fast • Flexible • Economical • Responsive service

• Being upgraded

now with faster CPUs and GPU nodes

Page 6: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

First all 10Gig Multi-PB Storage System

Page 7: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

High Performance Cloud Storage Analogous to AWS S3

6/15/2013 7

• Data preservation and sharing • Low cost • High reliability • Web-accessible

Page 8: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

Awesome connectivity to the outside world

100G CENIC

10G XSEDE

20G UCSD RCI

100G ESnet

your 10G Link here

384 port

10G switch

commercial Internet

384 port

10G switch

www.yourdatacollection.edu

Page 9: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

What Does SDSC Do?

Page 10: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

Over 100 in-house researchers and technical staff

• Modeling & simulation • Parallel computing • Cloud computing • Energy efficient computing • Advanced networking • Software development • Database systems • Data mining/BI tools • Data modeling & integration • Data management • Data processing workflows • Datacenter management

Core competencies

Page 13: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

Assemble complex processing easily

Access transparently to diverse resources

Incorporate multiple software tools

Assure reproducibility

Community development model

bioKepler: Programmable and Scalable Workflows for Distributed Analysis of Large-Scale Biological Data

MapReduce BLAST Ilkay Altintas

Page 14: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

Natasha Balac

Page 15: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

Big Data Predictive Analytics for UCSD Smart Grid

Page 16: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

Over 70,000 sensor streams from UCSD Smart Grid processed on Gordon

Page 17: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

What Does SDSC Do?

Page 18: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

Center for Large Scale Data Systems Research (CLDS)

Chaitan Baru

Jim Short

Page 19: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

Page 20: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

What Does SDSC Do?

Page 21: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

HPWREN: A Unique Regional Capability for Public-Private Partnerships

Page 23: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

What Does SDSC Do?

Page 24: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

Page 25: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

What Can SDSC Do for You? • Just about anything involving high

capability/capacity technical computing, data management, networking

• Our technical experts are eager to engage on

R&D projects and service agreements customized to meet your needs

• There is a spectrum of ways we can interact

Page 26: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

How do you begin working with us? • You have already taken the first step by coming

here today • Join the IPP program to learn more about SDSC

expertise and resources • POC Ron Hawkins ([email protected])

• Enjoy the rest of the program

Page 27: Download Presentation Slide Set

SDSC Data Initiatives

Chaitan Baru Associate Director, Data Initiatives Director, Center for Large- scale Data Systems Research (CLDS) SDSC, UC San Diego [email protected]

Page 28: Download Presentation Slide Set

SDSC IPP Research Review, June 12, 2013

Outline

• SDSC and Data • Center for Large-Scale Data Systems Research

(CLDS) • Graduate Student Engagement • Data Science Education and Training

28

Page 29: Download Presentation Slide Set

SDSC IPP Research Review, June 12, 2013

SDSC’s Data DNA • 25+ year history as a supercomputer center focused on data • Applied Informatics is what we do ▫ At the intersection of science and data and computational science ▫ Applied and applications-driven research and development

• Multidisciplinary projects and interdisciplinary collaborations is how we do it ▫ It is our strength and the secret sauce in our above average

success rate on highly competitive proposals • Advancing the state of the art in science, improving the

science research process is why we do it ▫ Lessons can be applied to business application as well ▫ We believe many science applications are precursors to future

business apps

29

Page 30: Download Presentation Slide Set

SDSC IPP Research Review, June 12, 2013

Data: A rapidly evolving set of problems • Analytics: Real-time and historical trend analysisData velocity

and volume • Integration: More, comprehensive, holistic analysisData variety • Costs

▫ Hardware, energy, software, people • Skill Sets

▫ Need for “cross-trained”, data savvy individuals ▫ Ability to thrive in multidisciplinary, holistic, data-driven environments ▫ Break out of narrow academic silos / corporate roles and departments ▫ A real shortage

• Competition ▫ Global talent ▫ Increasingly, local problems

• Privacy

30

Page 31: Download Presentation Slide Set

SDSC IPP Research Review, June 12, 2013

• Benchmarking • Bioinformatics • Computational Science • Data warehousing • Data and info visualization • Large graph and text data

• Machine learning • Performance Modeling • Predictive analytics • Scientific data management • Spatial data management • Workflow systems

SDSC R&D Activities in Data • Informatics collaborations in

▫ High-energy physics, astrophysics/astronomy, computational chemistry, bioinformatics, biomedical informatics, geoinformatics, ecoinformatics, social science, neurosciences, smart energy grids, anthropology, archaeology, …

• Expertise and Labs in:

• Centers of Excellence ▫ CLDS: Center for Large-scale Data Systems Research, Chaitan Baru, Director ▫ PACE: Predictive Analysis Center of Excellence, Natasha Balac, Director ▫ CAIDA: Center for Applied Internet Data Analysis, KC Claffy, Director

31

Page 32: Download Presentation Slide Set

SDSC IPP Research Review, June 12, 2013

CLDS: Center for Large-Scale Data Systems Research

32

• Focus: Technical and technology management aspects related to big data

• Key initiatives ▫ Big Data Benchmarking ▫ Data Value and How Much Information?

• Principals: Chaitan Baru, James Short

Page 33: Download Presentation Slide Set

SDSC IPP Research Review, June 12, 2013

Big Data Benchmarking – 1 • Community activity for development of a system-level

big data benchmark, like TPC ▫ Coordinated by SDSC, http://clds.sdsc.edu/bdbc ▫ [email protected]: Biweekly phone meetings

• A proposed BigData Top100 List, bigdatatop100.org • Two proposals under discussion ▫ BigBench: Extending TPC-DS for big data ▫ Data Analytics Pipeline: End-to-end analysis of event

stream data • Discussions with TPC and SPEC

33

Page 34: Download Presentation Slide Set

SDSC IPP Research Review, June 12, 2013

Big Data Benchmarking – 2

• Workshops on Big Data Benchmarking (WBDB) 1st WBDB: May 2012, San Jose 2nd WBDB: December 2012, Pune, India 3rd WBDB: July 2013, Xi’an, China 4th WBDB: October 2012, San Jose

34

Page 35: Download Presentation Slide Set

SDSC IPP Research Review, June 12, 2013

• An initiative of the Cloud Security Alliance-BigData Working Group ▫ Sreeranga Rajan, Fujitsu (Chair), Neel Sundaresan,

eBay (Co-Chair), Wilco van Ginkel, Verizon (Co-Chair) Arnab Roy, Fujitsu (Crypto co-lead in BDWG/CSA)

• Objective: ▫ Make reference datasets available on one or more

platforms, for algorithm-level benchmarking • Hosted by SDSC ▫ http://clds.sdsc.edu/bdbc/referencedata

Big Data Reference Datasets

35

Page 36: Download Presentation Slide Set

SDSC IPP Research Review, June 12, 2013

NIST Big Data Working Group • http://bigdatawg.nist.gov • Co-chairs:

Chaitan Baru Robert Marcus, CTO, ET-Strategies, Wo Chang, Chris Greer, NIST

• Objective: 1 year time frame ▫ Definitions ▫ Taxonomies ▫ Reference Architectures ▫ Technology Roadmap

• First meeting: June 19th • Open to community

36

Page 37: Download Presentation Slide Set

SDSC IPP Research Review, June 12, 2013

Current CLDS Programs • Big Data Benchmarking (Pivotal lead) • Project on Data Value (NetApp lead) ▫ Develop definitions, frameworks, assessment

methodology, and tools for Data Value ▫ Proposed Workshop on Data Value, Jan-Feb 2014

• How Much Information 2013 (Seagate lead) ▫ Consumer Information; Enterprise Information

• Data Science Institute (Brocade) ▫ SDSC-level program

37

Page 38: Download Presentation Slide Set

SDSC IPP Research Review, June 12, 2013

CLDS Sponsorship • Current sponsors ▫ Seagate, Pivotal, NetApp, Brocade, Intel (soon)

• Goals ▫ Small, focused group of core sponsors representing major

industry quadrants (non competitive). (6-8 companies) ▫ Extended network of members who provide scale and

scope, help fund industry events (20-30 companies) • Sponsor structure: ▫ Founding (100k, multi year) ▫ Program (50K, annual)

• Member structure: ▫ Continuing (10K+, pay as you go) ▫ Member (5K, per workshop event)

38

Page 39: Download Presentation Slide Set

SDSC IPP Research Review, June 12, 2013

Big Data Benchmarking: How you can participate • BDBC ▫ Join BDBC mailing list, ~150 members, ~75 organizations ▫ Attend biweekly meetings, every other Thursday ▫ Present at biweekly meetings

• WBDB ▫ Submit papers to workshops; attend workshops

• Reference Datasets ▫ Participate in the Reference Datasets activity; contribute

reference datasets • NIST Big Data Working Group ▫ Join and contribute to the NIST Big Data Working Group

• Join CLDS as a sponsor

39

Page 40: Download Presentation Slide Set

SDSC IPP Research Review, June 12, 2013

Data Value; How Much Information? How you can participate

• Data Value ▫ Join workshop planning and organization ▫ Contribute use cases ▫ Join CLDS as a sponsor

• HMI? ▫ Contribute use cases ▫ Join CLDS as a sponsor

40

Page 41: Download Presentation Slide Set

SDSC IPP Research Review, June 12, 2013

SDSC Data Science Institute (DSI) • Objective: Provide training and education in data science • Audience: Industry attendees/academic researchers

dealing with data • Format ▫ Coverage of end-to-end issues in data science ▫ Emphasis on hands-on learning using short course formats,

e.g. 1-day, 2-day, 1-week, and up to 1-month ▫ Inclusion of modules taught by industry ▫ At-home and On-The-Road programs ▫ Possible internships associated with DSI

• First offering: SDSC Summer Institute, “Discovering Big Data”, Aug 5-12, 2013

Page 42: Download Presentation Slide Set

SDSC IPP Research Review, June 12, 2013

DSI: How you can participate • Naming opportunity ▫ The <Your Company Name Here> Data Science Institute

• Sign-up for SDSC Summer Institute ▫ $2K for 3 days or 5 days

• Sign-up for future DSI offerings • On-the-road program ▫ Work with us to create an on-the-road program for your

company, or your customers • Contribute training modules ▫ Contribute modules based on your technology

• Provide your case studies for use in DSI

42

Page 43: Download Presentation Slide Set

SDSC IPP Research Review, June 12, 2013

SDSC Graduate Projects Program • SDSC Projects for CSE MS Graduate students ▫ Students work on projects with vendor hardware/software ▫ Upon successful completion, students receive internships at

companies ▫ Companies have option to hire students permanently

• A testbed/sandbox for big data/data science/computational science ▫ Currently have 32-node Hadoop cluster with Hortonworks HDP ▫ Plan to also install Intel Hadoop; Would like to extend to 96

nodes ▫ Discussing a project with Brocade to test performance of one of

their Ethernet switches • Two students just completed their MS projects (presentations

today!) ▫ Joining Google and Zynga

43

Page 44: Download Presentation Slide Set

SDSC IPP Research Review, June 12, 2013

Graduate Projects Program: How you can participate

• Create a project ▫ Announce a project and offer internship after

successful completion • Contribute hardware/software for projects • Contribute data with application scenarios / use

cases

44

Page 45: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

Bioinformatics Meets Big Data

Wayne Pfeiffer SDSC/UCSD

June 12, 2013

Page 46: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

Questions for today

• What is causing the flood of data in bioinformatics? • How much data are we talking about? • What bioinformatics codes are installed at SDSC? • What are typical compute- and data-intensive analyses of

in bioinformatics? • What are their computational challenges?

Page 47: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

Cost of DNA sequencing has dropped much faster than cost of computing in recent years,

producing the flood of data

Page 48: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

Size matters: how much data are we talking about?

• 3.1 GB for human genome • Fits on flash drive; assumes FASTA format (1 B per base)

• >100 GB/day from a single Illumina HiSeq 2000 • 50 Gbases/day of reads in FASTQ format (2.5 B per base)

• 300 GB to 1 TB of reads needed as input for analysis of whole human genome, depending upon coverage • 300 GB for 40x coverage • 1 TB for 130x coverage

• Multiple TB needed for subsequent analysis • 45 TB on disk at SDSC for W115 project! (~10,000x single genome) • Multiple genomes per person! • May only be looking for kB or MB in the end

Page 49: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

SDSC has a rich set of bioinformatics software; representative codes are listed here

• Pairwise sequence alignment • ATAC, BFAST, BLAST, BLAT, Bowtie, BWA

• Multiple sequence alignment (via CIPRES gateway) • ClustalW, MAFFT

• RNA-Seq analysis • GSNAP, Tophat

• De novo assembly • ABySS, Edena, SOAPdenovo, Velvet

• Phylogenetic tree inference (via CIPRES gateway) • BEAST with BEAGLE, GARLI, MrBayes, RAxML, RAxML-Light

• Tool kits • BEDTools, GATK, SAMtools

Page 50: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

Many bioinformatics projects use SDSC supercomputers

• HuTS: Human Tumor Study (STSI/SDSC) • Find mutations in tumor, and select appropriate chemotherapy

• W115: Study of somatic mutations in genome of 115-year-old woman (VU Amsterdam, et al.) • Find somatic mutations in white blood cells

• MRSA (STSI/UCSD) • Characterize genomes of MRSA found in local hospitals

• Larry Smarr’s microbiome (UCSD) • Analyze Larry Smarr’s microbiome

• Various phylogenetics studies (via CIPRES gateway) • Calculate phylogenetic trees

Page 51: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

Computational workflow for read mapping & variant calling

DNA reads in FASTQ format

Read mapping, i.e., pairwise alignment:

BFAST, BWA, …

Reference genome in

FASTA format

Variant calling: GATK, …

Variants: SNPs, indels, others

Alignment info in BAM format

Goal: identify simple variants, e.g.,

• single nucleotide polymorphisms (SNPs) or single nucleotide variants (SNVs)

• short insertions & deletions (indels)

CACCGGCGCAGTCATTCTCATAAT

||||||||||| |||||||||||| CACCGGCGCAGACATTCTCAT

AAT

CACCGGCGCAGTCATTCTCATAAT |||||||||| ||||||||||| CACCGGCGCA ATTCTCATAAT

Page 52: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

Pileup diagram shows mapping of reads to reference; example from HuTS shows a SNP in KRAS gene; this

means that cetuximab is not effective for chemotherapy

BWA analysis by Sam Levy, STSI; diagram from Andrew Carson, STSI

Page 53: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

The CIPRES gateway lets biologists run phylogenetics codes at SDSC via a browser interface;

http://www.phylo.org/index.php/portal

Page 54: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

Computational challenges abound in bioinformatics

• Large amounts of data, which can grow substantially during analysis

• Complex workflows, often with different computational requirements along the way

• Parallelism that varies between steps in the workflow pipeline

• Large shared memory needed for some analyses

Page 55: Download Presentation Slide Set

UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER

Ilkay ALTINTAS, Ph.D. Deputy Coordinator for Research, San Diego Supercomputer Center, UCSD Lab Director, Scientific Workflow Automation Technologies [email protected]

Distributed Workflow-Driven Analysis of Biological Big Data

Page 56: Download Presentation Slide Set

UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER

So, what is a scientific workflow?

Scientific workflows emerged as an

answer to the need to combine multiple Cyberinfrastructure components in automated

process networks.

Page 57: Download Presentation Slide Set

UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER

The Big Picture is Supporting the Scientist

Conceptual SWF

Executable SWF

From “Napkin Drawings” to Executable Workflows

Fasta File

Circonspect

Average Genome Size

Combine Results PHACCS

Page 58: Download Presentation Slide Set

UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER

Scientific Workflow Automation Technologies @ SDSC • Housed in San Diego Supercomputer Center at UCSD since 2004

• Mission: Support CI projects, scientists and engineers for computational practices involving process management

• Research and development focus • Scientific workflow management

• Data and process provenance • Distributed execution using scientific workflows • Engineering and streaming workflows for environmental observatories • Fault tolerance in scientific workflows

• Sensor network management and monitoring • Role of scientific workflows in eScience infrastructures • Understanding collaborative work in workflow-driven eScience

• Scientific collaborations

• Bioinformatics, Environmental Observatories, Oceanography, Computational Chemistry, Fusion, Geoinformatics, …

Page 59: Download Presentation Slide Set

UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER

Workflows are Used as Toolboxes in Biological Sciences

Acquisition Generation

Data Analysis

Data

Data Publication Archival

Workflows foster collaborations!

• Flexibility and synergy • Optimization of resources

• Increasing reuse • Standards compliance

Need expertise to identify which tool to use when and how! Require computation models to schedule and optimize execution!

– Assemble complex processing easily

– Access transparently to diverse resources

– Incorporate multiple software tools

– Assure reproducibility

– Community development model

Page 60: Download Presentation Slide Set

UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER

Ptolemy II: A laboratory for investigating design

KEPLER: A problem-solving environment for Scientific

Workflow

KEPLER = “Ptolemy II + X” for Scientific Workflows

Kepler is a Scientific Workflow System

• An open collaboration … initiated August 2003

• Kepler 2.4 released 04/2013

www.kepler-project.org

• Builds upon the open-source Ptolemy II framework

Page 61: Download Presentation Slide Set

UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER

CAMERA Example:

Using Scientific Workflows and Related Provenance for

Collaborative Metagenomics Research

Community Cyberinfrastructure for Advanced Microbial Ecology Research and Analysis

(CAMERA) http://camera.calit2.net

Page 62: Download Presentation Slide Set

UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER

CAMERA is a Collaborative Environment

Data Cart Multiple Available

Mixed collections of CAMERA Data (e.g. projects, samples)

User Workspace Single workspace with

access to all data and results (private and

shared)

Group Workspace Share specified User Workspace data with

collaborators

Data Discovery GIS and Advanced query

options

Data Analysis Workflow based analysis

Page 63: Download Presentation Slide Set

UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER

Workflows are a Central Part of CAMERA

All can be reached through the CAMERA portal at: http://portal.camera.calit2.net

Inputs: from local or CAMERA file systems; user-supplied parameters Outputs: sharable with a group of users and links to the semantic database

More than 1500 workflow submissions monthly!

Page 64: Download Presentation Slide Set

UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER

Pushing the boundaries of existing infrastructure and workflow system

capabilities! • Increase reuse • Increase programmability by end users • Increase resource utilization • Make analysis a part of the end-to-end scientific

model from data generation to publication

Add to these large amounts of next generation sequencing data!

Page 65: Download Presentation Slide Set

UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER

Cyberinfrastructure platforms

bioKepler

Kepler and Provenance Framework

BioLinux Galaxy Hadoop

Stratosphere

+

• Development of a comprehensive bioinformatics scientific workflow module for distributed analysis of large-scale biological data

Improvement on usability and programmability by end users!

www.bioKepler.org

COMPUTE BIO

www.kepler-project.org

bioKepler is a coordinated ecosystem of biological and technological packages!

Page 66: Download Presentation Slide Set

UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER

What can we do for you?

• Training • Workflow-driven data- and compute-

intensive process • Consulting

• Designing, scaling and tracking pipelines and workflows

• Development Services • Production workflows for you

Access to technology and biology packages: • Bio-Linux • Galaxy via Amazon • Amazon Cloud • Hadoop • Stratosphere Partners • Individual

researchers and labs • Research projects,

e.g., CAMERA • Academic institutions • Private labs, e.g.,

JCVI JCVI selection criteria • Programmability • Modularity • Customizability • Scalability

www.bioKepler.org

MapReduce BLAST

Page 67: Download Presentation Slide Set

UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER

Ilkay Altintas

[email protected]

Thanks! & Questions…

How to download Kepler?

https://kepler-project.org/users/downloads Please start with the short Getting Started Guide: https://kepler-project.org/users/documentation

Page 68: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

Building a Semantic Information Infrastructure

Research Review: SDSC Industrial Partnership Program

Amarnath Gupta Advanced Query Processing Lab San Diego Supercomputer Center, University of California at San Diego

Page 69: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

Lots of Data, Little Glue

A Science Enterprise • Lab resources • Experiment design • Reagent catalogs • Experimental data • Chains of Derived Data • Analysis Results (incl.

outputs of software tools) • External Information • Publications and

Presentations • …

An Industrial Enterprise • Customer data • Product specifications • Product sales data • Production process data • Customer call records • Internal memos • Legal documents • Emails • Social Intranet • …

What binds them together?

Wide variation in data types, models, volume, systems, usage, updates, …

Page 70: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

Some Consequences of Not Having a Glue

• An Orphan Disease Researcher • Spends five months to determine that her disease of interest relates to the

mouse version of a gene studied in breast cancer • A Health Institution

• Takes over a year to integrate clinical trial management data, drug characteristics, and patient reports to get a 360 deg. view of drug effects

• A Reagent Vendor • Spends significant time and money to get to determine the effectiveness of

a product-line • A Legal Department

• Takes much longer to discover relevant documents and data for a complex, multi-party litigation

• A Utilities Company • Cannot easily perform Synchrophasor analytics for grid behavior

understanding because SCADA, EMS, PMU data cannot be integrated

Page 71: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

Integration through Semantics

• In NIF • Data can be relational, XML,

RDF, OWL, wiki content, manuscripts, publications, blogs, multimedia, annotations, …

• Any new domain goes through semantic processing

• Every piece of data gets some semantic markup

• Data are integrated and searched using ontological indices

• Any search or discovery process is interpreted through an ontology

• Any workflow that performs a mining-style computation utilizes semantic properties

What NIF is designed to answer

Some unanticipated questions Are NIH studies gender-biased?

Which? Is the CRE mouse line being used

by anyone? Which diseases have data but

have been underfunded? Is the method section in this paper

under-specified?

We have been building NIF, a Semantic Information System for Neuroscience

Page 72: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

Technology Research

• Semantic Prospecting for any industry and application domain

• Semi-automatic Construction of Semantic Models for any problem domain

• Developing general-purpose, yet domain-specific Semantic Search Engines for heterogeneous data

• Complex Information Discovery using semantic graph analysis and mining-style computation

• Multi-domain Information Integration using semantic information bridges

Page 73: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

Why SDSC? • Semantic processing can be complex and costly

• Domain model construction (incl. active learning) • Semantic indexing and correlation indexing • Graph query processing • Semantic search algorithms • Landscape and discovery analytics

• The SDSC infrastructure handles complexity at scale • Configurable compute nodes • Large-Memory Systems • SSD drives

Let SDSC help your infrastructure, research and service needs

Page 74: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

Biomedical data integration

system and web search engine

Julia Ponomarenko, PhD Michael Baitaluk, PhD

San Diego Supercomputer Center

Page 75: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

Taxonomies Data

Sequences Publications

Structures

Variations

Expression Data

Networks

Annotations

Biochemical Data

Epigenetic Data

Page 76: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

Taxonomies Data

Databases (2,000+)

Sequences Publications

Structures

Variations

Expression Data

Networks

Annotations

Biochemical Data

Epigenetic Data

Web pages

How can a researcher embrace such amount of data

in their entirety?

Page 77: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

Taxonomies Data

Data Integration Resources

Databases (2,000+)

Sequences Publications

Structures

Variations

Expression Data

Networks

Annotations

Biochemical Data

Epigenetic Data

Web pages

Page 78: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

Taxonomies Data

Data Integration Resources

Databases (2,000+)

Sequences Publications

Structures

Variations

Expression Data

Networks

Annotations

Biochemical Data

Epigenetic Data

This leaves a researcher to work with partial, incomplete, incomprehensive data sets!

Web pages

Page 79: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

Taxonomies Data

(0.1 PB)

Databases (2,000+)

Sequences Publications

Structures

Variations

Expression Data

Networks

Annotations

Biochemical Data

Epigenetic Data

Web pages

Data Warehouse

Biological Ontologies The Semantic Web technologies

Page 80: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

Other web pages Database web pages

Page 81: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

Other web pages Database web pages

Page 82: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

Other web pages Database web pages

Page 83: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

Other web pages Database web pages

Page 84: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

Other web pages Database web pages

Page 85: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

Other web pages Database web pages

Page 86: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

Other web pages Database web pages

Page 87: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

Other web pages Database web pages

Automatically extract data and map them

into the internal database schema

For each ontological term A and page X, calculate the relevance score of X to A

Page 88: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

User Community

Web-portal & API integromeDB.org

Java-application BiologicalNetworks.org

IntegromeDB

Public Data on the Web User’s Private Data

Page 89: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

integromedb.org

Page 90: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

Page 91: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

Page 92: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

Page 93: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

Integromedb.org visit statistics

Page 94: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

User Community

Web-portal & API integromeDB.org

Java-application BiologicalNetworks.org

IntegromeDB

Public Data on the Web User’s Private Data

Page 95: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

Your User Community

Your Database

Public Data on the Web User’s Private Data

Web-portal & APIs Applications

Page 96: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

Your User Community

Your Google-like “Search-in-the-Box”

Appliance

Page 97: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA; SAN DIEGO

PACE Predictive Analytics And Data

Mining Research and Applications

Natasha Balac, Ph.D. Director, PACE

Predictive Analytics Center of Excellence @ San Diego Supercomputer Center, UCSD

Page 98: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA; SAN DIEGO

PACE: Closing the gap between Government, Industry and Academia

PACE is a non-profit, public educational organization oTo promote, educate and innovate

in the area of Predictive Analytics oTo leverage predictive analytics to

improve the education and well being of the global population and economy

oTo develop and promote a new, multi-level curriculum to broaden participation in the field of predictive analytics

Predictive Analysis Center of

Excellence

Inform, Educate and

Train

Develop Standards

and Methodology

High Performance

Scalable Data Mining

Foster Research

and Collaboration

Data Mining Repository of Very Large Data Sets

Provide Predictive Analytics Services

Bridge the Industry and Academia

Gap

Page 99: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA; SAN DIEGO

Foster Research and Collaboration

Predictive Analysis Center of

Excellence

Inform, Educate and

Train

Develop Standards and Methodology

High Performance

Scalable Data Mining

Foster Research and Collaboration

Data Mining Repository of Very Large Data Sets

Provide Predictive Analytics Services

Bridge the Industry and Academia

Gap

• Fraud Detection • Modeling user behaviors • Smart Grid Analytics • Solar powered system

modeling • Microgrid anomaly

detection • Distributed Energy

Generation • Manufacturing • Sport Analytics • Genomics

Page 100: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA; SAN DIEGO

UCSD Smart Grid • UCSD Smart Grid sensor network data set

• 45MW peak micro grid; daily population of over 54,000 people • Self-generate 92% of its own annual electricity load

• Smart Grid data – over 100,000 measurements/sec • Sensor and environmental/weather data

• Large amount of multivariate and heterogeneous data streaming from complex sensor networks

• Predictive Analytics throughout the Microgrid

Page 101: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA; SAN DIEGO

4 V’s of Big Data

IBM, 2012

Page 102: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA; SAN DIEGO

What to do with big data? ERIC SALL(IBM)’s list

• Big Data Exploration • To get an overall understanding of what is there

• 360 degree view of the customer • Combine both internally available and external information

to gain a deeper understanding of the customer • Monitoring Cyber-security and fraud in real time • Operational Analysis

• Leveraging machine generated data to improve business effectiveness

• Data Warehouse Augmentation • Enhancing warehouse solution with new information

models and architecture

Page 103: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA; SAN DIEGO

Big Data – Big Training • “Data Scientist”

• The “Hot new gig in town” • O’Reilly report

• Data Scientist: The Sexiest Job of the 21st Century • Harvard Business Review, October 2012

• The future belongs to the companies and people that turn data into products

• Article in Fortune • “The unemployment rate in the U.S. continues to be

abysmal (9.1% in July), but the tech world has spawned a new kind of highly skilled, nerdy-cool job that companies are scrambling to fill: data scientist”

Page 104: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA; SAN DIEGO

Data Science Job Growth

By 2018 shortage of 140-190,000 predictive analysts and 1.5M managers / analysts in the US

Page 105: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA; SAN DIEGO

Page 106: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA; SAN DIEGO

PACE Education • Data mining Boot Camps

• Boot Camp 1 • September 12-13, 2013

• Boot Camp 2 • October 17- 18, 2013

• On-site Personalized Boot Camp • 10-15; 20-30

• Tech Talks – every 3rd Wednesday • Workshops, Webinars • Interesting Reads, “Tool-off” • “Bring your own data”

Page 107: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA; SAN DIEGO

Predictive Analytics Consulting

• Full-service consulting and development services

• Targeted projects with industry and agency partners

• Applied and applications oriented research

• Technical expertise and industry experience

Predictive Analysis Center of

Excellence

Inform, Educate and

Train

Develop Standards and Methodology

High Performance

Scalable Data Mining

Foster Research and Collaboration

Data Mining Repository of

Very Large Data Sets

Provide Predictive Analytics Services

Bridge the Industry and

Academia Gap

Page 109: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA; SAN DIEGO

Questions?

• http://pace.sdsc.edu/

• For further information, contact Natasha Balac [email protected]

Page 110: Download Presentation Slide Set

PMaC Performance Modeling and Characterization

Performance, Modeling, and Characterization

(PMaC) Lab

Laura Carrington, Ph.D

PMaC Lab Director

University of California, San Diego

San Diego Supercomputer Center

Page 111: Download Presentation Slide Set

PMaC Performance Modeling and Characterization

The PMaC Lab

Mission Statement: Research the complex interactions between HPC systems and applications

to predict and understand the factors that affect performance and power on current and projected

HPC platforms.

• Develop tools and techniques to deconstruct HPC systems and HPC applications to provide detailed characterizations of their power and performance.

Page 112: Download Presentation Slide Set

PMaC Performance Modeling and Characterization

The PMaC Lab Utilize the characterizations to construct power and performance models that guide:

– Improvement of application performance 2 Gordon Bell Finalists, DoD HPCMP Applications, NSF BlueWaters, etc.

– System procurement, acceptance, and installation DoD HPCMP procurement team, DoE upgrade of ORNL Jaguar, installation of NAVO PWR6

– Accelerator assessment for a workload Performance assessment and prediction for GPUs & FPGAs

– Hardware customization/ Hardware-Software co-design Performance and power analysis for Exascale

– Improvement of energy efficiency and resiliency Green Queue Project, DoE SUPER Institute, DoE BSM, etc.

Page 113: Download Presentation Slide Set

PMaC Performance Modeling and Characterization

Automated tools & techniques to characterize HPC systems & applications

HPC System Characterize the computational

(& communication) patterns affect the overall power draw

113

HPC Application Characterize the computational (& communication) behavior of

application Loop #1

Func. Foo

Loop #3

Loop #2

Design software- and hardware-aware energy and performance optimization techniques

Page 114: Download Presentation Slide Set

PMaC Performance Modeling and Characterization

Hardware Customization

• 10x10 project – PI: A. Chien @ U. of Chicago – Heterogeneous processor

architecture

• Reconfigurable memory hierarchy – L1, L2, and L3

• Selection of energy-optimal configuration – Simulation – Reuse distance

Space Name

# Configurations in search space

# Unique Configurations selected

Avg. Energy Savings (%)

Full 2652 33 68.8 2N 469 26 66.0 Restricted 224 23 65.5 Cluster 10 10 63.7

114

Analysis of 37 workloads

~70% energy savings

Page 115: Download Presentation Slide Set

PMaC Performance Modeling and Characterization

Goal: Use machine and application characterization to make application-aware energy optimizations

during execution

PMaC’s Green Queue Framework (optimizing for performance & power)

(6.5%)

(4.8%)

(19%)

(21%) (5.3%)

(32%) (6.5%)

(Energy Savings %)

Time (seconds)

Pow

er (k

W)

1024 cores Gordon

115

Page 116: Download Presentation Slide Set

PMaC Performance Modeling and Characterization

Questions

116

Page 117: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

Benchmarking and Tuning Big Data Software

David Nadeau, Ph.D.

Page 118: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

What to tune? • Application benchmarking/profiling finds what to

tune for one system

• But... we want answers for many systems for now and (hopefully) years to come

• Requires understanding fundamental trends • Processor speeds, core counts, memory bandwidth,

memory latency, network bandwidth, etc.

Page 119: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

A few trends

Clock speeds almost flat Cycles/math op almost flat

Raw math ability/core is not improving.

Page 120: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

A few trends

Core count/CPU way up SPEC float/CPU way up

Float math/core (thread) up 8 to 20%/year Why is this trend upwards if math speed is flat?

Page 121: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

A few trends

DDR bandwidth up 15%/year

Memory bandwidth/core up 0 to 12%/year

Improvement is primarily from better memory bandwidth, not math ability.

Page 122: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

What this means • SPEC, Dhrystones, etc. are now dominated by

memory performance not math. • 1 Multiply = 1 cycle • 1 Memory access = 300 cycles • Not likely to change soon

• Application performance is dominated by

memory access costs.

• So tune access patterns or data order.

Page 123: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

Example: 3D volume Task: Store 3D volume in memory. Array of arrays of arrays, one big array, etc.

Worst (blue): array of array of arrays. Best (black): one big array with simple 3D indexing.

Fewer memory references is much faster.

Page 124: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

Example: 3D volume sweep

4 sweep directions are slow. 2 are 10x to 30x faster.

Task: Sweep plane thru volume. 6 axis directions: +X, -X, +Y, -Y, +Z, -Z

Sweep in natural data order is much faster.

Page 125: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

Example: 3D volume bricking

Makes all sweep directions have similar performance.

Bricked data order is more cache friendly.

Task: Sweep in any direction. Brick it: Cube of cubes

Unbricked 2x2x2 4x4x4 8x8x8 16x16x16

Page 126: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

Example: Desktop compression

Clever codecs slower, despite smaller result data.

Codecs with fewer memory references are faster.

Task: Real-time compress & send desktop. Many codecs.

Page 127: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

Example: Parallel compositing Task: Composite N images on cluster of N nodes.

Many algorithms.

Many small messages Few big messages Same amount of data

Better net use is faster.

Page 128: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

And so on…

Page 129: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA; SAN DIEGO

Gordon: A First-of-its Kind Data-intensive Supercomputer

SDSC Research Review June 5, 2013

Shawn Strande Gordon Project Manager

Page 130: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA; SAN DIEGO

Gordon is a highly flexible system for exploring a wide range of data intensive technologies and applications

Gordo

High performance flash technology

High speed InfiniBand interconnect

On-demand Hadoop and data intensive environments

Massively large memory environments High performance

parallel file system

Scientific databases

Complex application architectures

New algorithms and optimizations

Page 131: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA; SAN DIEGO

Gordon is a data movement machine

Sandy Bridge Compute Nodes (1,024) • 64 TB memory • 341 Tflop/s

Flash based I/O Nodes (64) • 300 TB Intel eMLC flash • 35M IOPS

Large Memory Nodes • vSMP Foundation 5.0 • 2 TB of cache-coherent

memory per node

“Data Oasis” Lustre PFS 100 GB/sec, 4 PB

Dual-rail, 3D Torus Interconnect • 7GB/s

Page 132: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA; SAN DIEGO

SSD latencies are 2 orders of magnitude lower than HDD’s (that’s a big deal for some data intensive applications)

Typical hard drive ~ 10 ms (.010s)

IOPS = 200

Solid State Disk ~ 100 µs (.0001s)

IOPS = 35,000/3000 (R/W)

Page 133: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA; SAN DIEGO

Protein Data Bank (flash-based I/O node) The RCSB Protein Data Bank (PDB) is the leading primary database that provides access to the experimentally determined structures of proteins, nucleic acids and complex assemblies. In order to allow users to quickly identify more distant 3D relationships, the PDB provides a pre-calculated set of all possible pairwise 3D protein structure alignments.

Although the pairwise structure comparisons are computationally intensive, the bottleneck is the centralized server that is responsible for assigning work, collecting results and updating the MySQL database. Using a dedicated Gordon I/O node and the associated 16 compute nodes, work could be accomplished 4-6x faster than using the OSG

Configuration Time for 15M alignments speedup

Reference (OSG) 24 hours 1

Lyndonville 6.3 hours 3.8

Taylorsville 4.1 hours 5.8

Page 134: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA; SAN DIEGO

OpenTopography Facility (flash-based I/O node) The NSF funded OpenTopography Facility provides online access to Earth science-oriented high-resolution LIDAR topography data along with online processing tools and derivative products. Point cloud data are processed to produce digital elevation models (DEMs) - 3D representations of the landscape.

High-resolution bare earth DEM of San Andreas fault south of San Francisco, generated using OpenTopography LIDAR processing tools Source: C. Crosby, UNAVCO

Illustration of local binning geometry. Dots are LIDAR shots ‘+’ indicate locations of DEM nodes at which elevation is estimated based

Dataset and processing configuration # concurrent jobs OT Servers Gordon ION Speed-up

Lake Tahoe 208 Million LIDAR returns 0.2-m grid res and 0.2 m rad.

1 3297 sec 1102 sec 3x

4 29607 sec 1449 sec 20x

Local binning algorithm utilizes the elevation information from only the points inside of a circular search area with user specified radius. An out-of-core (memory) version of the local binning algorithm exploits secondary storage for saving intermediate results when the size of a grid exceeds that of memory. Using a dedicated Gordon I/O node with the fast SSD drives reduces run times of massive concurrent out-of-core processing jobs by a factor of 20x

Page 135: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA; SAN DIEGO

IntegromeDB (flash-based I/O node) The IntegromeDB is a large-scale data integration system and biomedical search engine. IntegromeDB collects and organizes heterogeneous data from over a thousand databases covered by the Nucleic Acid and millions of public biomedical, biochemical, drug and disease-related resources

IntegromeDB is a distributed system stored in a PostgreSQL database containing over 5,000 tables, 500 billion rows and 50TB of data. New content is acquired using a modified version of the SmartCrawler web crawler and pages are indexed using Apache Lucene. Project was awarded two Gordon I/O nodes, the accompanying compute nodes and 50 TB of space on Data Oasis. The compute nodes are used primarily for post-processing of raw data. Using the I/O nodes dramatically increased the speed of read/write file operations (10x) and I/O database operations (50x).

Source: Michael Baitaluk (UCSD) Used by permission 2013

Page 136: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA; SAN DIEGO

Structural response of bone to stress (vSMP)

Source: Matthew Goff, Chris Hernandez (Cornell University) Used by permission. 2012

The goal of the simulations is to analyze how small variances in boundary conditions effect high strain regions in the model. The research goal is to understand the response of trabecular bone to mechanical stimuli. This has relevance for paleontologists to infer habitual locomotion of ancient people and animals, and in treatment strategies for populations with fragile bones such as the elderly.

• 5 million quadratic, 8 noded elements

• Model created with custom Matlab application that converts 253 micro CT images into voxel-based finite element models

Page 137: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

Managing HPC Systems at the

National and Campus Levels

Rick Wagner, Ph.D. Candidate HPC Systems Manager, SDSC

Page 138: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

Systems

Gordon

Trestles

TSCC

Dev Data Oasis

Page 139: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

Challenges Enabling unique & differentiating features of each system…

…without maintaining N different systems.

Gordon

Trestles

TSCC

Data Oasis

Dev

Technology

Policy

Business Model

Agnosticism

Isolation

Page 140: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

Solutions

Part I: • Common systems

management – Rocks • Shared staff responsibility

across systems

Part II: • Build on (cower behind) core SDSC services

Page 141: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

User Services

Page 142: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

Operations

Page 143: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

Storage

Page 144: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

VM Hosting

Page 145: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

Security

Page 146: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

Networking

Page 147: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

SDSC Sandbox

With support from:

Page 148: Download Presentation Slide Set

UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER

Ilkay ALTINTAS, Ph.D. Deputy Coordinator for Research, San Diego Supercomputer Center, UCSD Lab Director, Scientific Workflow Automation Technologies [email protected]

SDSC’s Myriad Areas of Expertise

Page 149: Download Presentation Slide Set

UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER

Visualization Services Group

Lead – Amit Chourasia • Develops new ways to represent data visually • Research collaborations in many science and

engineering disciplines • Visualization support and consulting • Provides visualization education and training • Website: http://www.sdsc.edu/us/visservices/

Page 150: Download Presentation Slide Set

UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER

Ross Walker ([email protected]) Andreas Goetz ([email protected])

Technical Expertise • GPU Computing • CUDA Teaching Center • Parallel computing • Workstation and Cluster Design for

Biomolecular Simulations and Computational Drug Discovery

• Cloud computing

Page 151: Download Presentation Slide Set

UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER

Ross Walker ([email protected]) Andreas Goetz ([email protected])

Scientific Expertise • Molecular Dynamics • Quantum Chemistry • Force Field Development,

Automatic Parameter Fitting • Drug Discovery • Biomolecular Simulations

Page 152: Download Presentation Slide Set

UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER

SDSC Spatial Information Systems Lab • Services-based spatial information

integration infrastructure • Advanced online and high performance GIS

and Geospatial Databases • Information Interoperability in the

geosciences • Long-term spatial data preservation • Information models and data standards

(adopted by federal government and internationally)

• Innovative user interfaces for connecting people, projects, resources…

• Large distributed data systems and catalogs (for scientific field observations from hydrology, critical zone and others)

Hydrologic Information System (largest in the world)

Brain data integration

Contact: Ilya Zaslavsky ([email protected])

Ecosystem Services

Dashboard Katrina

portal

Mexico Health

Atlas

NSF EarthCube

CZO

Page 153: Download Presentation Slide Set

UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER

High Performance Wireless Research and Education

Network http://hpwren.ucsd.edu/

Existing ~60 HPWREN/ASAPnet fire agency sites in June 2013 (from Google Earth KML object) Project partners include: • the County of San Diego • the California Department of Forestry

and Fire Protection (CAL FIRE) • the United States Forest Service

(USFS) • San Diego Gas and Electric (SDG&E) • San Diego State University (SDSU)

An extension of Area Situational Awareness for Public Safety Network (ASAPnet)

Page 154: Download Presentation Slide Set

UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER

Mathematical anthropology

James Moody, Douglas R. White. Structural Cohesion and Embeddedness: A Hierarchical Conception of Social Groups. American Sociological Review 68(1):1-25. 2004

The identification of cohesive subgroups in large networks is of key importance to problems spanning the social and biological sciences. A k-cohesive subgroup has the property that it is resistant to disruption by disconnection by removal of at least k of its nodes. This has been shown to be equivalent to a set of vertices where all members are joined by k independent vertex-independent paths (Menger’s theorem).

Doug White (UCI) and his collaborators are using software developed using R and the igraph package to study social networks. The software was parallelized using the R multicore package and ported to Gordon’s vSMP nodes by SDSC computational scientist Robert Sinkovits. Analyses for large problems (2400 node Watts-Strogatz model) are achieving estimated speedups of 243x on 256 compute cores. Work is underway to identify cohesive subgroups in large co-authorship networks

Page 155: Download Presentation Slide Set

UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER

Impact of high-frequency trading To determine the impact of high-frequency trading activity on financial markets, it is necessary to construct nanosecond resolution limit order books – records of all unexecuted orders to buy/sell stock at a specified price. Analysis provides evidence of quote stuffing: a manipulative practice that involves submitting a large number of orders with immediate cancellation to generate congestion

Source: Mao Ye, Dept. of Finance, U. Illinois. Used by permission. 6/1/2012

Symbol wall time (s) orig. code

wall time (s) opt. code

speedup

SWN 8400 128 66x

AMZN 55200 437 126x

AAPL 129914 1145 113x

Optimizations by SDSC computational scientists Robert Sinkovits and DongJu Choi to the original thread-parallel code resulted in greater than 100x speedups. It is now possible to analyze entire day of NASDAQ activity in a few hours using 16 Gordon nodes. With new capabilities, beginning to consider analysis of options data with 100x greater memory requirements.

Run times for LOB construction of heavily traded NASDAQ securities (June 4, 2010)

Page 156: Download Presentation Slide Set

UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER

SDSC’s Education, Outreach and Training (EOT) Programs

Diane Baxter, Ph.D., Ange Mason, Jeff Sale

San Diego Supercomputer Center

University of California, San Diego

Page 157: Download Presentation Slide Set

UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER

SDSC EOT Program Challenges • Prepare Teachers to teach their students the

skills and knowledge for a future in which . . . • Technology Power • Computational skills Success

• Give Students Access to the computational tools, knowledge and thinking skills to seek their dreams and create their future

• Train researchers at all levels, to use HPC and Data-Intensive Computing tools to accelerate discovery in science, engineering, technology, mathematics, and other data-related fields

Page 158: Download Presentation Slide Set

UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER

Ilkay Altintas

[email protected]

Thanks! & Questions…

Page 159: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

Industrial Engagement at SDSC • Industrial Partners Program (IPP)

• “Gateway” program • Annual membership • Large company, small company, individual categories

• CLDS • Focus on Big Data

• PACE • Focus on Predictive Analytics

• Research Contracts • Specific project defined

• Service Agreements • For use of SDSC resources/services

Page 160: Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

THANK YOU!