Download Presentation Slide Set

SAN DIEGO SUPERCOMPUTER CENTER

What Can SDSC Do For You? Michael L. Norman, Director

Distinguished Professor of Physics


Mission: Transforming Science and Society Through “Cyberinfrastructure”

“The comprehensive infrastructure needed to capitalize on dramatic advances in information technology has been termed cyberinfrastructure.”

D. Atkins, NSF Office of Cyberinfrastructure 2


What Does SDSC Do?


Gordon – World’s First Flash-based Supercomputer for Data-intensive Apps

>300,000 times as fast as SDSC’s first supercomputer 1,000,000 times as much memory


Industrial Computing Platform: Triton

• Fast • Flexible • Economical • Responsive service

• Being upgraded

now with faster CPUs and GPU nodes


First all 10Gig Multi-PB Storage System


High Performance Cloud Storage Analogous to AWS S3

6/15/2013 7

• Data preservation and sharing • Low cost • High reliability • Web-accessible


Awesome connectivity to the outside world

100G CENIC

10G XSEDE

20G UCSD RCI

100G ESnet

your 10G Link here

384 port

10G switch

commercial Internet

384 port

10G switch

www.yourdatacollection.edu


What Does SDSC Do?


Over 100 in-house researchers and technical staff

• Modeling & simulation • Parallel computing • Cloud computing • Energy efficient computing • Advanced networking • Software development • Database systems • Data mining/BI tools • Data modeling & integration • Data management • Data processing workflows • Datacenter management

Core competencies


Application Domains

• Fluid dynamics • Structural engineering • Biomolecular simulation • Computational chemistry • Seismic modeling • Coastal hydrology • Geoinformatics • Neuroinformatics • Bioinformatics/genomics • Radiology • Smart energy grids • Medicare fraud detection

http://www.google.com/url?sa=i&rct=j&q=&esrc=s&frm=1&source=images&cd=&cad=rja&docid=7u-JkNW02plpUM&tbnid=dP-EotRKfD9ZKM:&ved=0CAUQjRw&url=http://www.tougaloo.edu/research/qmmm/&ei=r4S3UYegBqGcjAKuq4CgAg&psig=AFQjCNH-Fr8Vd1i-k-5kKFRjwataiS9tig&ust=1371067941913290

http://www.google.com/url?sa=i&rct=j&q=&esrc=s&frm=1&source=images&cd=&cad=rja&docid=DIn7_qJBWp4OqM&tbnid=9UZQ3zvEk1cMCM:&ved=0CAUQjRw&url=http://www.molbiol-tools.ca/Genomics.htm&ei=LoW3UYaAG6jEigL2m4GoCA&psig=AFQjCNFS41WKWSdJkul2STR1aSgXmbDn7w&ust=1371068014732510


SDSC is at the nexus of the genomic medicine revolution

Wayne Pfeiffer

https://www.google.com/url?sa=i&rct=j&q=&esrc=s&frm=1&source=images&cd=&cad=rja&docid=pb_kdI5E7hkglM&tbnid=DV0mQ3oITCgcPM:&ved=0CAUQjRw&url=https://www.facebook.com/pages/Sanford-Consortium-for-Regenerative-Medicine/129905487023719&ei=0GS3UbzJO-atigLd2oCICA&bvm=bv.47810305,d.cGE&psig=AFQjCNGsDuWxQNBsB7zgFX4rzZblPs56PA&ust=1371059763925020

https://www.google.com/url?sa=i&rct=j&q=&esrc=s&frm=1&source=images&cd=&cad=rja&docid=GlgKEmIZ3PCvoM&tbnid=vp_Gql_xrgiIzM:&ved=0CAUQjRw&url=https://www.usbio.net/misc/psoriasis&ei=Qma3UavEKOzriQL-o4HQBg&bvm=bv.47810305,d.cGE&psig=AFQjCNF9aJOc9j6iDm-K2NgxtlQvOrz8kQ&ust=1371060129360486


Assemble complex processing easily

Access transparently to diverse resources

Incorporate multiple software tools

Assure reproducibility

Community development model

bioKepler: Programmable and Scalable Workflows for Distributed Analysis of Large-Scale Biological Data

MapReduce BLAST Ilkay Altintas


Natasha Balac


Big Data Predictive Analytics for UCSD Smart Grid


Over 70,000 sensor streams from UCSD Smart Grid processed on Gordon


What Does SDSC Do?


Center for Large Scale Data Systems Research (CLDS)

Chaitan Baru

Jim Short



What Does SDSC Do?


HPWREN: A Unique Regional Capability for Public-Private Partnerships


SDSC Teaming with CALFIRE and SDG&E to Respond to and Prevent Wildfires

http://www.google.com/url?sa=i&rct=j&q=&esrc=s&frm=1&source=images&cd=&cad=rja&docid=7u_SMheVIukIoM&tbnid=gis6y9fZzo-gtM:&ved=0CAUQjRw&url=http://hpwren.ucsd.edu/news/20080316/&ei=fmy3UYfbAempigLFt4EQ&bvm=bv.47810305,d.cGE&psig=AFQjCNEw0KNfEFlWxSrkcJlOcTEL0g6N3w&ust=1371061723711450


What Does SDSC Do?



What Can SDSC Do for You? • Just about anything involving high

capability/capacity technical computing, data management, networking

• Our technical experts are eager to engage on

R&D projects and service agreements customized to meet your needs

• There is a spectrum of ways we can interact


How do you begin working with us? • You have already taken the first step by coming

here today • Join the IPP program to learn more about SDSC

expertise and resources • POC Ron Hawkins ([email protected])

• Enjoy the rest of the program

mailto:[email protected]

SDSC Data Initiatives

Chaitan Baru Associate Director, Data Initiatives Director, Center for Large- scale Data Systems Research (CLDS) SDSC, UC San Diego [email protected]

SDSC IPP Research Review, June 12, 2013

Outline

• SDSC and Data • Center for Large-Scale Data Systems Research

(CLDS) • Graduate Student Engagement • Data Science Education and Training

28


SDSC’s Data DNA • 25+ year history as a supercomputer center focused on data • Applied Informatics is what we do ▫ At the intersection of science and data and computational science ▫ Applied and applications-driven research and development

• Multidisciplinary projects and interdisciplinary collaborations is how we do it ▫ It is our strength and the secret sauce in our above average

success rate on highly competitive proposals • Advancing the state of the art in science, improving the

science research process is why we do it ▫ Lessons can be applied to business application as well ▫ We believe many science applications are precursors to future

business apps

29


Data: A rapidly evolving set of problems • Analytics: Real-time and historical trend analysisData velocity

and volume • Integration: More, comprehensive, holistic analysisData variety • Costs

▫ Hardware, energy, software, people • Skill Sets

▫ Need for “cross-trained”, data savvy individuals ▫ Ability to thrive in multidisciplinary, holistic, data-driven environments ▫ Break out of narrow academic silos / corporate roles and departments ▫ A real shortage

• Competition ▫ Global talent ▫ Increasingly, local problems

• Privacy

30


• Benchmarking • Bioinformatics • Computational Science • Data warehousing • Data and info visualization • Large graph and text data

• Machine learning • Performance Modeling • Predictive analytics • Scientific data management • Spatial data management • Workflow systems

SDSC R&D Activities in Data • Informatics collaborations in

▫ High-energy physics, astrophysics/astronomy, computational chemistry, bioinformatics, biomedical informatics, geoinformatics, ecoinformatics, social science, neurosciences, smart energy grids, anthropology, archaeology, …

• Expertise and Labs in:

• Centers of Excellence ▫ CLDS: Center for Large-scale Data Systems Research, Chaitan Baru, Director ▫ PACE: Predictive Analysis Center of Excellence, Natasha Balac, Director ▫ CAIDA: Center for Applied Internet Data Analysis, KC Claffy, Director

31


CLDS: Center for Large-Scale Data Systems Research

32

• Focus: Technical and technology management aspects related to big data

• Key initiatives ▫ Big Data Benchmarking ▫ Data Value and How Much Information?

• Principals: Chaitan Baru, James Short


Big Data Benchmarking – 1 • Community activity for development of a system-level

big data benchmark, like TPC ▫ Coordinated by SDSC, http://clds.sdsc.edu/bdbc ▫ [email protected]: Biweekly phone meetings

• A proposed BigData Top100 List, bigdatatop100.org • Two proposals under discussion ▫ BigBench: Extending TPC-DS for big data ▫ Data Analytics Pipeline: End-to-end analysis of event

stream data • Discussions with TPC and SPEC

33

http://clds.sdsc.edu/bdbc



Big Data Benchmarking – 2

• Workshops on Big Data Benchmarking (WBDB) 1st WBDB: May 2012, San Jose 2nd WBDB: December 2012, Pune, India 3rd WBDB: July 2013, Xi’an, China 4th WBDB: October 2012, San Jose

34


• An initiative of the Cloud Security Alliance-BigData Working Group ▫ Sreeranga Rajan, Fujitsu (Chair), Neel Sundaresan,

eBay (Co-Chair), Wilco van Ginkel, Verizon (Co-Chair) Arnab Roy, Fujitsu (Crypto co-lead in BDWG/CSA)

• Objective: ▫ Make reference datasets available on one or more

platforms, for algorithm-level benchmarking • Hosted by SDSC ▫ http://clds.sdsc.edu/bdbc/referencedata

Big Data Reference Datasets

35

http://clds.sdsc.edu/bdbc/referencedata

http://clds.sdsc.edu/bdbc/referencedata


NIST Big Data Working Group • http://bigdatawg.nist.gov • Co-chairs:

Chaitan Baru Robert Marcus, CTO, ET-Strategies, Wo Chang, Chris Greer, NIST

• Objective: 1 year time frame ▫ Definitions ▫ Taxonomies ▫ Reference Architectures ▫ Technology Roadmap

• First meeting: June 19th • Open to community

36


Current CLDS Programs • Big Data Benchmarking (Pivotal lead) • Project on Data Value (NetApp lead) ▫ Develop definitions, frameworks, assessment

methodology, and tools for Data Value ▫ Proposed Workshop on Data Value, Jan-Feb 2014

• How Much Information 2013 (Seagate lead) ▫ Consumer Information; Enterprise Information

• Data Science Institute (Brocade) ▫ SDSC-level program

37


CLDS Sponsorship • Current sponsors ▫ Seagate, Pivotal, NetApp, Brocade, Intel (soon)

• Goals ▫ Small, focused group of core sponsors representing major

industry quadrants (non competitive). (6-8 companies) ▫ Extended network of members who provide scale and

scope, help fund industry events (20-30 companies) • Sponsor structure: ▫ Founding (100k, multi year) ▫ Program (50K, annual)

• Member structure: ▫ Continuing (10K+, pay as you go) ▫ Member (5K, per workshop event)

38


Big Data Benchmarking: How you can participate • BDBC ▫ Join BDBC mailing list, ~150 members, ~75 organizations ▫ Attend biweekly meetings, every other Thursday ▫ Present at biweekly meetings

• WBDB ▫ Submit papers to workshops; attend workshops

• Reference Datasets ▫ Participate in the Reference Datasets activity; contribute

reference datasets • NIST Big Data Working Group ▫ Join and contribute to the NIST Big Data Working Group

• Join CLDS as a sponsor

39


Data Value; How Much Information? How you can participate

• Data Value ▫ Join workshop planning and organization ▫ Contribute use cases ▫ Join CLDS as a sponsor

• HMI? ▫ Contribute use cases ▫ Join CLDS as a sponsor

40


SDSC Data Science Institute (DSI) • Objective: Provide training and education in data science • Audience: Industry attendees/academic researchers

dealing with data • Format ▫ Coverage of end-to-end issues in data science ▫ Emphasis on hands-on learning using short course formats,

e.g. 1-day, 2-day, 1-week, and up to 1-month ▫ Inclusion of modules taught by industry ▫ At-home and On-The-Road programs ▫ Possible internships associated with DSI

• First offering: SDSC Summer Institute, “Discovering Big Data”, Aug 5-12, 2013


DSI: How you can participate • Naming opportunity ▫ The <Your Company Name Here> Data Science Institute

• Sign-up for SDSC Summer Institute ▫ $2K for 3 days or 5 days

• Sign-up for future DSI offerings • On-the-road program ▫ Work with us to create an on-the-road program for your

company, or your customers • Contribute training modules ▫ Contribute modules based on your technology

• Provide your case studies for use in DSI

42


SDSC Graduate Projects Program • SDSC Projects for CSE MS Graduate students ▫ Students work on projects with vendor hardware/software ▫ Upon successful completion, students receive internships at

companies ▫ Companies have option to hire students permanently

• A testbed/sandbox for big data/data science/computational science ▫ Currently have 32-node Hadoop cluster with Hortonworks HDP ▫ Plan to also install Intel Hadoop; Would like to extend to 96

nodes ▫ Discussing a project with Brocade to test performance of one of

their Ethernet switches • Two students just completed their MS projects (presentations

today!) ▫ Joining Google and Zynga

43


Graduate Projects Program: How you can participate

• Create a project ▫ Announce a project and offer internship after

successful completion • Contribute hardware/software for projects • Contribute data with application scenarios / use

cases

44


at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

Bioinformatics Meets Big Data

Wayne Pfeiffer SDSC/UCSD

June 12, 2013



Questions for today

• What is causing the flood of data in bioinformatics? • How much data are we talking about? • What bioinformatics codes are installed at SDSC? • What are typical compute- and data-intensive analyses of

in bioinformatics? • What are their computational challenges?



Cost of DNA sequencing has dropped much faster than cost of computing in recent years,

producing the flood of data



Size matters: how much data are we talking about?

• 3.1 GB for human genome • Fits on flash drive; assumes FASTA format (1 B per base)

• >100 GB/day from a single Illumina HiSeq 2000 • 50 Gbases/day of reads in FASTQ format (2.5 B per base)

• 300 GB to 1 TB of reads needed as input for analysis of whole human genome, depending upon coverage • 300 GB for 40x coverage • 1 TB for 130x coverage

• Multiple TB needed for subsequent analysis • 45 TB on disk at SDSC for W115 project! (~10,000x single genome) • Multiple genomes per person! • May only be looking for kB or MB in the end



SDSC has a rich set of bioinformatics software; representative codes are listed here

• Pairwise sequence alignment • ATAC, BFAST, BLAST, BLAT, Bowtie, BWA

• Multiple sequence alignment (via CIPRES gateway) • ClustalW, MAFFT

• RNA-Seq analysis • GSNAP, Tophat

• De novo assembly • ABySS, Edena, SOAPdenovo, Velvet

• Phylogenetic tree inference (via CIPRES gateway) • BEAST with BEAGLE, GARLI, MrBayes, RAxML, RAxML-Light

• Tool kits • BEDTools, GATK, SAMtools



Many bioinformatics projects use SDSC supercomputers

• HuTS: Human Tumor Study (STSI/SDSC) • Find mutations in tumor, and select appropriate chemotherapy

• W115: Study of somatic mutations in genome of 115-year-old woman (VU Amsterdam, et al.) • Find somatic mutations in white blood cells

• MRSA (STSI/UCSD) • Characterize genomes of MRSA found in local hospitals

• Larry Smarr’s microbiome (UCSD) • Analyze Larry Smarr’s microbiome

• Various phylogenetics studies (via CIPRES gateway) • Calculate phylogenetic trees



Computational workflow for read mapping & variant calling

DNA reads in FASTQ format

Read mapping, i.e., pairwise alignment:

BFAST, BWA, …

Reference genome in

FASTA format

Variant calling: GATK, …

Variants: SNPs, indels, others

Alignment info in BAM format

Goal: identify simple variants, e.g.,

• single nucleotide polymorphisms (SNPs) or single nucleotide variants (SNVs)

• short insertions & deletions (indels)

CACCGGCGCAGTCATTCTCATAAT

||||||||||| |||||||||||| CACCGGCGCAGACATTCTCAT

AAT

CACCGGCGCAGTCATTCTCATAAT |||||||||| ||||||||||| CACCGGCGCA ATTCTCATAAT



Pileup diagram shows mapping of reads to reference; example from HuTS shows a SNP in KRAS gene; this

means that cetuximab is not effective for chemotherapy

BWA analysis by Sam Levy, STSI; diagram from Andrew Carson, STSI



The CIPRES gateway lets biologists run phylogenetics codes at SDSC via a browser interface;

http://www.phylo.org/index.php/portal



Computational challenges abound in bioinformatics

• Large amounts of data, which can grow substantially during analysis

• Complex workflows, often with different computational requirements along the way

• Parallelism that varies between steps in the workflow pipeline

• Large shared memory needed for some analyses

UNIVERSITY OF CALIFORNIA, SAN DIEGO


Ilkay ALTINTAS, Ph.D. Deputy Coordinator for Research, San Diego Supercomputer Center, UCSD Lab Director, Scientific Workflow Automation Technologies [email protected]

Distributed Workflow-Driven Analysis of Biological Big Data




So, what is a scientific workflow?

Scientific workflows emerged as an

answer to the need to combine multiple Cyberinfrastructure components in automated

process networks.



The Big Picture is Supporting the Scientist

Conceptual SWF

Executable SWF

From “Napkin Drawings” to Executable Workflows

Fasta File

Circonspect

Average Genome Size

Combine Results PHACCS



Scientific Workflow Automation Technologies @ SDSC • Housed in San Diego Supercomputer Center at UCSD since 2004

• Mission: Support CI projects, scientists and engineers for computational practices involving process management

• Research and development focus • Scientific workflow management

• Data and process provenance • Distributed execution using scientific workflows • Engineering and streaming workflows for environmental observatories • Fault tolerance in scientific workflows

• Sensor network management and monitoring • Role of scientific workflows in eScience infrastructures • Understanding collaborative work in workflow-driven eScience

• Scientific collaborations

• Bioinformatics, Environmental Observatories, Oceanography, Computational Chemistry, Fusion, Geoinformatics, …



Workflows are Used as Toolboxes in Biological Sciences

Acquisition Generation

Data Analysis

Data

Data Publication Archival

Workflows foster collaborations!

• Flexibility and synergy • Optimization of resources

• Increasing reuse • Standards compliance

Need expertise to identify which tool to use when and how! Require computation models to schedule and optimize execution!

– Assemble complex processing easily

– Access transparently to diverse resources

– Incorporate multiple software tools

– Assure reproducibility

– Community development model



Ptolemy II: A laboratory for investigating design

KEPLER: A problem-solving environment for Scientific

Workflow

KEPLER = “Ptolemy II + X” for Scientific Workflows

Kepler is a Scientific Workflow System

• An open collaboration … initiated August 2003

• Kepler 2.4 released 04/2013

www.kepler-project.org

• Builds upon the open-source Ptolemy II framework



CAMERA Example:

Using Scientific Workflows and Related Provenance for

Collaborative Metagenomics Research

Community Cyberinfrastructure for Advanced Microbial Ecology Research and Analysis

(CAMERA) http://camera.calit2.net



CAMERA is a Collaborative Environment

Data Cart Multiple Available

Mixed collections of CAMERA Data (e.g. projects, samples)

User Workspace Single workspace with

access to all data and results (private and

shared)

Group Workspace Share specified User Workspace data with

collaborators

Data Discovery GIS and Advanced query

options

Data Analysis Workflow based analysis



Workflows are a Central Part of CAMERA

All can be reached through the CAMERA portal at: http://portal.camera.calit2.net

Inputs: from local or CAMERA file systems; user-supplied parameters Outputs: sharable with a group of users and links to the semantic database

More than 1500 workflow submissions monthly!



Pushing the boundaries of existing infrastructure and workflow system

capabilities! • Increase reuse • Increase programmability by end users • Increase resource utilization • Make analysis a part of the end-to-end scientific

model from data generation to publication

Add to these large amounts of next generation sequencing data!



Cyberinfrastructure platforms

bioKepler

Kepler and Provenance Framework

BioLinux Galaxy Hadoop

Stratosphere

+

• Development of a comprehensive bioinformatics scientific workflow module for distributed analysis of large-scale biological data

Improvement on usability and programmability by end users!

www.bioKepler.org

COMPUTE BIO

www.kepler-project.org

bioKepler is a coordinated ecosystem of biological and technological packages!



What can we do for you?

• Training • Workflow-driven data- and compute-

intensive process • Consulting

• Designing, scaling and tracking pipelines and workflows

• Development Services • Production workflows for you

Access to technology and biology packages: • Bio-Linux • Galaxy via Amazon • Amazon Cloud • Hadoop • Stratosphere Partners • Individual

researchers and labs • Research projects,

e.g., CAMERA • Academic institutions • Private labs, e.g.,

JCVI JCVI selection criteria • Programmability • Modularity • Customizability • Scalability

www.bioKepler.org

MapReduce BLAST

http://www.bioKepler.org



Ilkay Altintas

[email protected]

Thanks! & Questions…

How to download Kepler?

https://kepler-project.org/users/downloads Please start with the short Getting Started Guide: https://kepler-project.org/users/documentation


https://kepler-project.org/users/downloads



https://kepler-project.org/users/documentation




Building a Semantic Information Infrastructure

Research Review: SDSC Industrial Partnership Program

Amarnath Gupta Advanced Query Processing Lab San Diego Supercomputer Center, University of California at San Diego


Lots of Data, Little Glue

A Science Enterprise • Lab resources • Experiment design • Reagent catalogs • Experimental data • Chains of Derived Data • Analysis Results (incl.

outputs of software tools) • External Information • Publications and

Presentations • …

An Industrial Enterprise • Customer data • Product specifications • Product sales data • Production process data • Customer call records • Internal memos • Legal documents • Emails • Social Intranet • …

What binds them together?

Wide variation in data types, models, volume, systems, usage, updates, …


Some Consequences of Not Having a Glue

• An Orphan Disease Researcher • Spends five months to determine that her disease of interest relates to the

mouse version of a gene studied in breast cancer • A Health Institution

• Takes over a year to integrate clinical trial management data, drug characteristics, and patient reports to get a 360 deg. view of drug effects

• A Reagent Vendor • Spends significant time and money to get to determine the effectiveness of

a product-line • A Legal Department

• Takes much longer to discover relevant documents and data for a complex, multi-party litigation

• A Utilities Company • Cannot easily perform Synchrophasor analytics for grid behavior

understanding because SCADA, EMS, PMU data cannot be integrated


Integration through Semantics

• In NIF • Data can be relational, XML,

RDF, OWL, wiki content, manuscripts, publications, blogs, multimedia, annotations, …

• Any new domain goes through semantic processing

• Every piece of data gets some semantic markup

• Data are integrated and searched using ontological indices

• Any search or discovery process is interpreted through an ontology

• Any workflow that performs a mining-style computation utilizes semantic properties

What NIF is designed to answer

Some unanticipated questions Are NIH studies gender-biased?

Which? Is the CRE mouse line being used

by anyone? Which diseases have data but

have been underfunded? Is the method section in this paper

under-specified?

We have been building NIF, a Semantic Information System for Neuroscience


Technology Research

• Semantic Prospecting for any industry and application domain

• Semi-automatic Construction of Semantic Models for any problem domain

• Developing general-purpose, yet domain-specific Semantic Search Engines for heterogeneous data

• Complex Information Discovery using semantic graph analysis and mining-style computation

• Multi-domain Information Integration using semantic information bridges


Why SDSC? • Semantic processing can be complex and costly

• Domain model construction (incl. active learning) • Semantic indexing and correlation indexing • Graph query processing • Semantic search algorithms • Landscape and discovery analytics

• The SDSC infrastructure handles complexity at scale • Configurable compute nodes • Large-Memory Systems • SSD drives

Let SDSC help your infrastructure, research and service needs


Biomedical data integration

system and web search engine

Julia Ponomarenko, PhD Michael Baitaluk, PhD

San Diego Supercomputer Center


Taxonomies Data

Sequences Publications

Structures

Variations

Expression Data

Networks

Annotations

Biochemical Data

Epigenetic Data


Taxonomies Data

Databases (2,000+)


Structures

Variations

Expression Data

Networks

Annotations

Biochemical Data

Epigenetic Data

Web pages

How can a researcher embrace such amount of data

in their entirety?


Taxonomies Data

Data Integration Resources

Databases (2,000+)


Structures

Variations

Expression Data

Networks

Annotations

Biochemical Data

Epigenetic Data

Web pages


Taxonomies Data

Data Integration Resources

Databases (2,000+)


Structures

Variations

Expression Data

Networks

Annotations

Biochemical Data

Epigenetic Data

This leaves a researcher to work with partial, incomplete, incomprehensive data sets!

Web pages


Taxonomies Data

(0.1 PB)

Databases (2,000+)


Structures

Variations

Expression Data

Networks

Annotations

Biochemical Data

Epigenetic Data

Web pages

Data Warehouse

Biological Ontologies The Semantic Web technologies


Other web pages Database web pages















Automatically extract data and map them

into the internal database schema

For each ontological term A and page X, calculate the relevance score of X to A


User Community

Web-portal & API integromeDB.org

Java-application BiologicalNetworks.org

IntegromeDB

Public Data on the Web User’s Private Data


integromedb.org





Integromedb.org visit statistics


User Community

Web-portal & API integromeDB.org

Java-application BiologicalNetworks.org

IntegromeDB



Your User Community

Your Database


Web-portal & APIs Applications


Your User Community

Your Google-like “Search-in-the-Box”

Appliance


at the UNIVERSITY OF CALIFORNIA; SAN DIEGO

PACE Predictive Analytics And Data

Mining Research and Applications

Natasha Balac, Ph.D. Director, PACE

Predictive Analytics Center of Excellence @ San Diego Supercomputer Center, UCSD



PACE: Closing the gap between Government, Industry and Academia

PACE is a non-profit, public educational organization oTo promote, educate and innovate

in the area of Predictive Analytics oTo leverage predictive analytics to

improve the education and well being of the global population and economy

oTo develop and promote a new, multi-level curriculum to broaden participation in the field of predictive analytics

Predictive Analysis Center of

Excellence

Inform, Educate and

Train

Develop Standards

and Methodology

High Performance

Scalable Data Mining

Foster Research

and Collaboration

Data Mining Repository of Very Large Data Sets

Provide Predictive Analytics Services

Bridge the Industry and Academia

Gap



Foster Research and Collaboration


Excellence

Inform, Educate and

Train

Develop Standards and Methodology

High Performance



Data Mining Repository of Very Large Data Sets


Bridge the Industry and Academia

Gap

• Fraud Detection • Modeling user behaviors • Smart Grid Analytics • Solar powered system

modeling • Microgrid anomaly

detection • Distributed Energy

Generation • Manufacturing • Sport Analytics • Genomics



UCSD Smart Grid • UCSD Smart Grid sensor network data set

• 45MW peak micro grid; daily population of over 54,000 people • Self-generate 92% of its own annual electricity load

• Smart Grid data – over 100,000 measurements/sec • Sensor and environmental/weather data

• Large amount of multivariate and heterogeneous data streaming from complex sensor networks

• Predictive Analytics throughout the Microgrid



4 V’s of Big Data

IBM, 2012



What to do with big data? ERIC SALL(IBM)’s list

• Big Data Exploration • To get an overall understanding of what is there

• 360 degree view of the customer • Combine both internally available and external information

to gain a deeper understanding of the customer • Monitoring Cyber-security and fraud in real time • Operational Analysis

• Leveraging machine generated data to improve business effectiveness

• Data Warehouse Augmentation • Enhancing warehouse solution with new information

models and architecture



Big Data – Big Training • “Data Scientist”

• The “Hot new gig in town” • O’Reilly report

• Data Scientist: The Sexiest Job of the 21st Century • Harvard Business Review, October 2012

• The future belongs to the companies and people that turn data into products

• Article in Fortune • “The unemployment rate in the U.S. continues to be

abysmal (9.1% in July), but the tech world has spawned a new kind of highly skilled, nerdy-cool job that companies are scrambling to fill: data scientist”



Data Science Job Growth

By 2018 shortage of 140-190,000 predictive analysts and 1.5M managers / analysts in the US





PACE Education • Data mining Boot Camps

• Boot Camp 1 • September 12-13, 2013

• Boot Camp 2 • October 17- 18, 2013

• On-site Personalized Boot Camp • 10-15; 20-30

• Tech Talks – every 3rd Wednesday • Workshops, Webinars • Interesting Reads, “Tool-off” • “Bring your own data”



Predictive Analytics Consulting

• Full-service consulting and development services

• Targeted projects with industry and agency partners

• Applied and applications oriented research

• Technical expertise and industry experience


Excellence

Inform, Educate and

Train

Develop Standards and Methodology

High Performance



Data Mining Repository of

Very Large Data Sets


Bridge the Industry and

Academia Gap



Mike Gualtieri's blog

http://blogs.forrester.com/mike_gualtieri






Questions?

• http://pace.sdsc.edu/

• For further information, contact Natasha Balac [email protected]

PMaC Performance Modeling and Characterization

Performance, Modeling, and Characterization

(PMaC) Lab

Laura Carrington, Ph.D

PMaC Lab Director

University of California, San Diego



The PMaC Lab

Mission Statement: Research the complex interactions between HPC systems and applications

to predict and understand the factors that affect performance and power on current and projected

HPC platforms.

• Develop tools and techniques to deconstruct HPC systems and HPC applications to provide detailed characterizations of their power and performance.


The PMaC Lab Utilize the characterizations to construct power and performance models that guide:

– Improvement of application performance 2 Gordon Bell Finalists, DoD HPCMP Applications, NSF BlueWaters, etc.

– System procurement, acceptance, and installation DoD HPCMP procurement team, DoE upgrade of ORNL Jaguar, installation of NAVO PWR6

– Accelerator assessment for a workload Performance assessment and prediction for GPUs & FPGAs

– Hardware customization/ Hardware-Software co-design Performance and power analysis for Exascale

– Improvement of energy efficiency and resiliency Green Queue Project, DoE SUPER Institute, DoE BSM, etc.


Automated tools & techniques to characterize HPC systems & applications

HPC System Characterize the computational

(& communication) patterns affect the overall power draw

113

HPC Application Characterize the computational (& communication) behavior of

application Loop #1

Func. Foo

Loop #3

Loop #2

Design software- and hardware-aware energy and performance optimization techniques


Hardware Customization

• 10x10 project – PI: A. Chien @ U. of Chicago – Heterogeneous processor

architecture

• Reconfigurable memory hierarchy – L1, L2, and L3

• Selection of energy-optimal configuration – Simulation – Reuse distance

Space Name

# Configurations in search space

# Unique Configurations selected

Avg. Energy Savings (%)

Full 2652 33 68.8 2N 469 26 66.0 Restricted 224 23 65.5 Cluster 10 10 63.7

114

Analysis of 37 workloads

~70% energy savings


Goal: Use machine and application characterization to make application-aware energy optimizations

during execution

PMaC’s Green Queue Framework (optimizing for performance & power)

(6.5%)

(4.8%)

(19%)

(21%) (5.3%)

(32%) (6.5%)

(Energy Savings %)

Time (seconds)

Pow

er (k

W)

1024 cores Gordon

115


Questions

116


Benchmarking and Tuning Big Data Software

David Nadeau, Ph.D.


What to tune? • Application benchmarking/profiling finds what to

tune for one system

• But... we want answers for many systems for now and (hopefully) years to come

• Requires understanding fundamental trends • Processor speeds, core counts, memory bandwidth,

memory latency, network bandwidth, etc.


A few trends

Clock speeds almost flat Cycles/math op almost flat

Raw math ability/core is not improving.


A few trends

Core count/CPU way up SPEC float/CPU way up

Float math/core (thread) up 8 to 20%/year Why is this trend upwards if math speed is flat?


A few trends

DDR bandwidth up 15%/year

Memory bandwidth/core up 0 to 12%/year

Improvement is primarily from better memory bandwidth, not math ability.


What this means • SPEC, Dhrystones, etc. are now dominated by

memory performance not math. • 1 Multiply = 1 cycle • 1 Memory access = 300 cycles • Not likely to change soon

• Application performance is dominated by

memory access costs.

• So tune access patterns or data order.


Example: 3D volume Task: Store 3D volume in memory. Array of arrays of arrays, one big array, etc.

Worst (blue): array of array of arrays. Best (black): one big array with simple 3D indexing.

Fewer memory references is much faster.


Example: 3D volume sweep

4 sweep directions are slow. 2 are 10x to 30x faster.

Task: Sweep plane thru volume. 6 axis directions: +X, -X, +Y, -Y, +Z, -Z

Sweep in natural data order is much faster.


Example: 3D volume bricking

Makes all sweep directions have similar performance.

Bricked data order is more cache friendly.

Task: Sweep in any direction. Brick it: Cube of cubes

Unbricked 2x2x2 4x4x4 8x8x8 16x16x16


Example: Desktop compression

Clever codecs slower, despite smaller result data.

Codecs with fewer memory references are faster.

Task: Real-time compress & send desktop. Many codecs.


Example: Parallel compositing Task: Composite N images on cluster of N nodes.

Many algorithms.

Many small messages Few big messages Same amount of data

Better net use is faster.


And so on…



Gordon: A First-of-its Kind Data-intensive Supercomputer

SDSC Research Review June 5, 2013

Shawn Strande Gordon Project Manager



Gordon is a highly flexible system for exploring a wide range of data intensive technologies and applications

Gordo

High performance flash technology

High speed InfiniBand interconnect

On-demand Hadoop and data intensive environments

Massively large memory environments High performance

parallel file system

Scientific databases

Complex application architectures

New algorithms and optimizations



Gordon is a data movement machine

Sandy Bridge Compute Nodes (1,024) • 64 TB memory • 341 Tflop/s

Flash based I/O Nodes (64) • 300 TB Intel eMLC flash • 35M IOPS

Large Memory Nodes • vSMP Foundation 5.0 • 2 TB of cache-coherent

memory per node

“Data Oasis” Lustre PFS 100 GB/sec, 4 PB

Dual-rail, 3D Torus Interconnect • 7GB/s



SSD latencies are 2 orders of magnitude lower than HDD’s (that’s a big deal for some data intensive applications)

Typical hard drive ~ 10 ms (.010s)

IOPS = 200

Solid State Disk ~ 100 µs (.0001s)

IOPS = 35,000/3000 (R/W)



Protein Data Bank (flash-based I/O node) The RCSB Protein Data Bank (PDB) is the leading primary database that provides access to the experimentally determined structures of proteins, nucleic acids and complex assemblies. In order to allow users to quickly identify more distant 3D relationships, the PDB provides a pre-calculated set of all possible pairwise 3D protein structure alignments.

Although the pairwise structure comparisons are computationally intensive, the bottleneck is the centralized server that is responsible for assigning work, collecting results and updating the MySQL database. Using a dedicated Gordon I/O node and the associated 16 compute nodes, work could be accomplished 4-6x faster than using the OSG

Configuration Time for 15M alignments speedup

Reference (OSG) 24 hours 1

Lyndonville 6.3 hours 3.8

Taylorsville 4.1 hours 5.8



OpenTopography Facility (flash-based I/O node) The NSF funded OpenTopography Facility provides online access to Earth science-oriented high-resolution LIDAR topography data along with online processing tools and derivative products. Point cloud data are processed to produce digital elevation models (DEMs) - 3D representations of the landscape.

High-resolution bare earth DEM of San Andreas fault south of San Francisco, generated using OpenTopography LIDAR processing tools Source: C. Crosby, UNAVCO

Illustration of local binning geometry. Dots are LIDAR shots ‘+’ indicate locations of DEM nodes at which elevation is estimated based

Dataset and processing configuration # concurrent jobs OT Servers Gordon ION Speed-up

Lake Tahoe 208 Million LIDAR returns 0.2-m grid res and 0.2 m rad.

1 3297 sec 1102 sec 3x

4 29607 sec 1449 sec 20x

Local binning algorithm utilizes the elevation information from only the points inside of a circular search area with user specified radius. An out-of-core (memory) version of the local binning algorithm exploits secondary storage for saving intermediate results when the size of a grid exceeds that of memory. Using a dedicated Gordon I/O node with the fast SSD drives reduces run times of massive concurrent out-of-core processing jobs by a factor of 20x



IntegromeDB (flash-based I/O node) The IntegromeDB is a large-scale data integration system and biomedical search engine. IntegromeDB collects and organizes heterogeneous data from over a thousand databases covered by the Nucleic Acid and millions of public biomedical, biochemical, drug and disease-related resources

IntegromeDB is a distributed system stored in a PostgreSQL database containing over 5,000 tables, 500 billion rows and 50TB of data. New content is acquired using a modified version of the SmartCrawler web crawler and pages are indexed using Apache Lucene. Project was awarded two Gordon I/O nodes, the accompanying compute nodes and 50 TB of space on Data Oasis. The compute nodes are used primarily for post-processing of raw data. Using the I/O nodes dramatically increased the speed of read/write file operations (10x) and I/O database operations (50x).

Source: Michael Baitaluk (UCSD) Used by permission 2013



Structural response of bone to stress (vSMP)

Source: Matthew Goff, Chris Hernandez (Cornell University) Used by permission. 2012

The goal of the simulations is to analyze how small variances in boundary conditions effect high strain regions in the model. The research goal is to understand the response of trabecular bone to mechanical stimuli. This has relevance for paleontologists to infer habitual locomotion of ancient people and animals, and in treatment strategies for populations with fragile bones such as the elderly.

• 5 million quadratic, 8 noded elements

• Model created with custom Matlab application that converts 253 micro CT images into voxel-based finite element models


Managing HPC Systems at the

National and Campus Levels

Rick Wagner, Ph.D. Candidate HPC Systems Manager, SDSC


Systems

Gordon

Trestles

TSCC

Dev Data Oasis


Challenges Enabling unique & differentiating features of each system…

…without maintaining N different systems.

Gordon

Trestles

TSCC

Data Oasis

Dev

Technology

Policy

Business Model

Agnosticism

Isolation


Solutions

Part I: • Common systems

management – Rocks • Shared staff responsibility

across systems

Part II: • Build on (cower behind) core SDSC services


User Services


Operations


Storage


VM Hosting


Security


Networking


SDSC Sandbox

With support from:



Ilkay ALTINTAS, Ph.D. Deputy Coordinator for Research, San Diego Supercomputer Center, UCSD Lab Director, Scientific Workflow Automation Technologies [email protected]

SDSC’s Myriad Areas of Expertise




Visualization Services Group

Lead – Amit Chourasia • Develops new ways to represent data visually • Research collaborations in many science and

engineering disciplines • Visualization support and consulting • Provides visualization education and training • Website: http://www.sdsc.edu/us/visservices/

http://www.sdsc.edu/us/visservices/



Ross Walker ([email protected]) Andreas Goetz ([email protected])

Technical Expertise • GPU Computing • CUDA Teaching Center • Parallel computing • Workstation and Cluster Design for

Biomolecular Simulations and Computational Drug Discovery

• Cloud computing



Ross Walker ([email protected]) Andreas Goetz ([email protected])

Scientific Expertise • Molecular Dynamics • Quantum Chemistry • Force Field Development,

Automatic Parameter Fitting • Drug Discovery • Biomolecular Simulations



SDSC Spatial Information Systems Lab • Services-based spatial information

integration infrastructure • Advanced online and high performance GIS

and Geospatial Databases • Information Interoperability in the

geosciences • Long-term spatial data preservation • Information models and data standards

(adopted by federal government and internationally)

• Innovative user interfaces for connecting people, projects, resources…

• Large distributed data systems and catalogs (for scientific field observations from hydrology, critical zone and others)

Hydrologic Information System (largest in the world)

Brain data integration

Contact: Ilya Zaslavsky ([email protected])

Ecosystem Services

Dashboard Katrina

portal

Mexico Health

Atlas

NSF EarthCube

CZO

http://gis.team.sdsc.edu/dashboard/dashboard.html

http://www.google.com/url?sa=i&rct=j&q=&esrc=s&frm=1&source=images&cd=&cad=rja&docid=7QxXtjs2SnzK8M&tbnid=G6QguoyDrN8aQM:&ved=0CAUQjRw&url=http://external.opengis.org/&ei=H5W3UYiVD8THiwLK1IDABg&bvm=bv.47810305,d.cGE&psig=AFQjCNH_xgYkCBvWGfxott4b1OOSYzoK2g&ust=1371072145340637

http://www.google.com/url?sa=i&rct=j&q=&esrc=s&frm=1&source=images&cd=&cad=rja&docid=7DS5zghqpx77jM&tbnid=ACUpRD2wDxIawM:&ved=0CAUQjRw&url=http://www.can-acn.org/meeting2012/satellite.htm&ei=SJW3UaXHOcLFigLMoIHAAg&bvm=bv.47810305,d.cGE&psig=AFQjCNGWgLd8R0g1LKNzhENKhNSI3y0jQQ&ust=1371072190298634

http://www.google.com/url?sa=i&rct=j&q=&esrc=s&frm=1&source=images&cd=&cad=rja&docid=TLrJSPSj8COAcM&tbnid=bnHQA3AbDd3CtM:&ved=0CAUQjRw&url=http://my.nationallabnetwork.org/NIH&ei=_ZW3UcOYEIrMiQKVxIGgAQ&bvm=bv.47810305,d.cGE&psig=AFQjCNFKB0I5dvuPB9kO-lQR9swE2-K70Q&ust=1371072348443419



High Performance Wireless Research and Education

Network http://hpwren.ucsd.edu/

Existing ~60 HPWREN/ASAPnet fire agency sites in June 2013 (from Google Earth KML object) Project partners include: • the County of San Diego • the California Department of Forestry

and Fire Protection (CAL FIRE) • the United States Forest Service

(USFS) • San Diego Gas and Electric (SDG&E) • San Diego State University (SDSU)

An extension of Area Situational Awareness for Public Safety Network (ASAPnet)

http://hpwren.ucsd.edu/



Mathematical anthropology

James Moody, Douglas R. White. Structural Cohesion and Embeddedness: A Hierarchical Conception of Social Groups. American Sociological Review 68(1):1-25. 2004

The identification of cohesive subgroups in large networks is of key importance to problems spanning the social and biological sciences. A k-cohesive subgroup has the property that it is resistant to disruption by disconnection by removal of at least k of its nodes. This has been shown to be equivalent to a set of vertices where all members are joined by k independent vertex-independent paths (Menger’s theorem).

Doug White (UCI) and his collaborators are using software developed using R and the igraph package to study social networks. The software was parallelized using the R multicore package and ported to Gordon’s vSMP nodes by SDSC computational scientist Robert Sinkovits. Analyses for large problems (2400 node Watts-Strogatz model) are achieving estimated speedups of 243x on 256 compute cores. Work is underway to identify cohesive subgroups in large co-authorship networks



Impact of high-frequency trading To determine the impact of high-frequency trading activity on financial markets, it is necessary to construct nanosecond resolution limit order books – records of all unexecuted orders to buy/sell stock at a specified price. Analysis provides evidence of quote stuffing: a manipulative practice that involves submitting a large number of orders with immediate cancellation to generate congestion

Source: Mao Ye, Dept. of Finance, U. Illinois. Used by permission. 6/1/2012

Symbol wall time (s) orig. code

wall time (s) opt. code

speedup

SWN 8400 128 66x

AMZN 55200 437 126x

AAPL 129914 1145 113x

Optimizations by SDSC computational scientists Robert Sinkovits and DongJu Choi to the original thread-parallel code resulted in greater than 100x speedups. It is now possible to analyze entire day of NASDAQ activity in a few hours using 16 Gordon nodes. With new capabilities, beginning to consider analysis of options data with 100x greater memory requirements.

Run times for LOB construction of heavily traded NASDAQ securities (June 4, 2010)



SDSC’s Education, Outreach and Training (EOT) Programs

Diane Baxter, Ph.D., Ange Mason, Jeff Sale


University of California, San Diego



SDSC EOT Program Challenges • Prepare Teachers to teach their students the

skills and knowledge for a future in which . . . • Technology Power • Computational skills Success

• Give Students Access to the computational tools, knowledge and thinking skills to seek their dreams and create their future

• Train researchers at all levels, to use HPC and Data-Intensive Computing tools to accelerate discovery in science, engineering, technology, mathematics, and other data-related fields



Ilkay Altintas

[email protected]

Thanks! & Questions…



Industrial Engagement at SDSC • Industrial Partners Program (IPP)

• “Gateway” program • Annual membership • Large company, small company, individual categories

• CLDS • Focus on Big Data

• PACE • Focus on Predictive Analytics

• Research Contracts • Specific project defined

• Service Agreements • For use of SDSC resources/services


THANK YOU!

Documents

Download Presentation Slide Set