Upload
lytruc
View
224
Download
0
Embed Size (px)
Citation preview
SAN DIEGO SUPERCOMPUTER CENTER
What Can SDSC Do For You? Michael L. Norman, Director
Distinguished Professor of Physics
SAN DIEGO SUPERCOMPUTER CENTER
Mission: Transforming Science and Society Through “Cyberinfrastructure”
“The comprehensive infrastructure needed to capitalize on dramatic advances in information technology has been termed cyberinfrastructure.”
D. Atkins, NSF Office of Cyberinfrastructure 2
SAN DIEGO SUPERCOMPUTER CENTER
What Does SDSC Do?
SAN DIEGO SUPERCOMPUTER CENTER
Gordon – World’s First Flash-based Supercomputer for Data-intensive Apps
>300,000 times as fast as SDSC’s first supercomputer 1,000,000 times as much memory
SAN DIEGO SUPERCOMPUTER CENTER
Industrial Computing Platform: Triton
• Fast • Flexible • Economical • Responsive service
• Being upgraded
now with faster CPUs and GPU nodes
SAN DIEGO SUPERCOMPUTER CENTER
First all 10Gig Multi-PB Storage System
SAN DIEGO SUPERCOMPUTER CENTER
High Performance Cloud Storage Analogous to AWS S3
6/15/2013 7
• Data preservation and sharing • Low cost • High reliability • Web-accessible
SAN DIEGO SUPERCOMPUTER CENTER
Awesome connectivity to the outside world
100G CENIC
10G XSEDE
20G UCSD RCI
100G ESnet
your 10G Link here
384 port
10G switch
commercial Internet
384 port
10G switch
www.yourdatacollection.edu
SAN DIEGO SUPERCOMPUTER CENTER
What Does SDSC Do?
SAN DIEGO SUPERCOMPUTER CENTER
Over 100 in-house researchers and technical staff
• Modeling & simulation • Parallel computing • Cloud computing • Energy efficient computing • Advanced networking • Software development • Database systems • Data mining/BI tools • Data modeling & integration • Data management • Data processing workflows • Datacenter management
Core competencies
SAN DIEGO SUPERCOMPUTER CENTER
Application Domains
• Fluid dynamics • Structural engineering • Biomolecular simulation • Computational chemistry • Seismic modeling • Coastal hydrology • Geoinformatics • Neuroinformatics • Bioinformatics/genomics • Radiology • Smart energy grids • Medicare fraud detection
SAN DIEGO SUPERCOMPUTER CENTER
SDSC is at the nexus of the genomic medicine revolution
Wayne Pfeiffer
SAN DIEGO SUPERCOMPUTER CENTER
Assemble complex processing easily
Access transparently to diverse resources
Incorporate multiple software tools
Assure reproducibility
Community development model
bioKepler: Programmable and Scalable Workflows for Distributed Analysis of Large-Scale Biological Data
MapReduce BLAST Ilkay Altintas
SAN DIEGO SUPERCOMPUTER CENTER
Natasha Balac
SAN DIEGO SUPERCOMPUTER CENTER
Big Data Predictive Analytics for UCSD Smart Grid
SAN DIEGO SUPERCOMPUTER CENTER
Over 70,000 sensor streams from UCSD Smart Grid processed on Gordon
SAN DIEGO SUPERCOMPUTER CENTER
What Does SDSC Do?
SAN DIEGO SUPERCOMPUTER CENTER
Center for Large Scale Data Systems Research (CLDS)
Chaitan Baru
Jim Short
SAN DIEGO SUPERCOMPUTER CENTER
SAN DIEGO SUPERCOMPUTER CENTER
What Does SDSC Do?
SAN DIEGO SUPERCOMPUTER CENTER
HPWREN: A Unique Regional Capability for Public-Private Partnerships
SAN DIEGO SUPERCOMPUTER CENTER
SDSC Teaming with CALFIRE and SDG&E to Respond to and Prevent Wildfires
SAN DIEGO SUPERCOMPUTER CENTER
What Does SDSC Do?
SAN DIEGO SUPERCOMPUTER CENTER
SAN DIEGO SUPERCOMPUTER CENTER
What Can SDSC Do for You? • Just about anything involving high
capability/capacity technical computing, data management, networking
• Our technical experts are eager to engage on
R&D projects and service agreements customized to meet your needs
• There is a spectrum of ways we can interact
SAN DIEGO SUPERCOMPUTER CENTER
How do you begin working with us? • You have already taken the first step by coming
here today • Join the IPP program to learn more about SDSC
expertise and resources • POC Ron Hawkins ([email protected])
• Enjoy the rest of the program
SDSC Data Initiatives
Chaitan Baru Associate Director, Data Initiatives Director, Center for Large- scale Data Systems Research (CLDS) SDSC, UC San Diego [email protected]
SDSC IPP Research Review, June 12, 2013
Outline
• SDSC and Data • Center for Large-Scale Data Systems Research
(CLDS) • Graduate Student Engagement • Data Science Education and Training
28
SDSC IPP Research Review, June 12, 2013
SDSC’s Data DNA • 25+ year history as a supercomputer center focused on data • Applied Informatics is what we do ▫ At the intersection of science and data and computational science ▫ Applied and applications-driven research and development
• Multidisciplinary projects and interdisciplinary collaborations is how we do it ▫ It is our strength and the secret sauce in our above average
success rate on highly competitive proposals • Advancing the state of the art in science, improving the
science research process is why we do it ▫ Lessons can be applied to business application as well ▫ We believe many science applications are precursors to future
business apps
29
SDSC IPP Research Review, June 12, 2013
Data: A rapidly evolving set of problems • Analytics: Real-time and historical trend analysisData velocity
and volume • Integration: More, comprehensive, holistic analysisData variety • Costs
▫ Hardware, energy, software, people • Skill Sets
▫ Need for “cross-trained”, data savvy individuals ▫ Ability to thrive in multidisciplinary, holistic, data-driven environments ▫ Break out of narrow academic silos / corporate roles and departments ▫ A real shortage
• Competition ▫ Global talent ▫ Increasingly, local problems
• Privacy
30
SDSC IPP Research Review, June 12, 2013
• Benchmarking • Bioinformatics • Computational Science • Data warehousing • Data and info visualization • Large graph and text data
• Machine learning • Performance Modeling • Predictive analytics • Scientific data management • Spatial data management • Workflow systems
SDSC R&D Activities in Data • Informatics collaborations in
▫ High-energy physics, astrophysics/astronomy, computational chemistry, bioinformatics, biomedical informatics, geoinformatics, ecoinformatics, social science, neurosciences, smart energy grids, anthropology, archaeology, …
• Expertise and Labs in:
• Centers of Excellence ▫ CLDS: Center for Large-scale Data Systems Research, Chaitan Baru, Director ▫ PACE: Predictive Analysis Center of Excellence, Natasha Balac, Director ▫ CAIDA: Center for Applied Internet Data Analysis, KC Claffy, Director
31
SDSC IPP Research Review, June 12, 2013
CLDS: Center for Large-Scale Data Systems Research
32
• Focus: Technical and technology management aspects related to big data
• Key initiatives ▫ Big Data Benchmarking ▫ Data Value and How Much Information?
• Principals: Chaitan Baru, James Short
SDSC IPP Research Review, June 12, 2013
Big Data Benchmarking – 1 • Community activity for development of a system-level
big data benchmark, like TPC ▫ Coordinated by SDSC, http://clds.sdsc.edu/bdbc ▫ [email protected]: Biweekly phone meetings
• A proposed BigData Top100 List, bigdatatop100.org • Two proposals under discussion ▫ BigBench: Extending TPC-DS for big data ▫ Data Analytics Pipeline: End-to-end analysis of event
stream data • Discussions with TPC and SPEC
33
SDSC IPP Research Review, June 12, 2013
Big Data Benchmarking – 2
• Workshops on Big Data Benchmarking (WBDB) 1st WBDB: May 2012, San Jose 2nd WBDB: December 2012, Pune, India 3rd WBDB: July 2013, Xi’an, China 4th WBDB: October 2012, San Jose
34
SDSC IPP Research Review, June 12, 2013
• An initiative of the Cloud Security Alliance-BigData Working Group ▫ Sreeranga Rajan, Fujitsu (Chair), Neel Sundaresan,
eBay (Co-Chair), Wilco van Ginkel, Verizon (Co-Chair) Arnab Roy, Fujitsu (Crypto co-lead in BDWG/CSA)
• Objective: ▫ Make reference datasets available on one or more
platforms, for algorithm-level benchmarking • Hosted by SDSC ▫ http://clds.sdsc.edu/bdbc/referencedata
Big Data Reference Datasets
35
SDSC IPP Research Review, June 12, 2013
NIST Big Data Working Group • http://bigdatawg.nist.gov • Co-chairs:
Chaitan Baru Robert Marcus, CTO, ET-Strategies, Wo Chang, Chris Greer, NIST
• Objective: 1 year time frame ▫ Definitions ▫ Taxonomies ▫ Reference Architectures ▫ Technology Roadmap
• First meeting: June 19th • Open to community
36
SDSC IPP Research Review, June 12, 2013
Current CLDS Programs • Big Data Benchmarking (Pivotal lead) • Project on Data Value (NetApp lead) ▫ Develop definitions, frameworks, assessment
methodology, and tools for Data Value ▫ Proposed Workshop on Data Value, Jan-Feb 2014
• How Much Information 2013 (Seagate lead) ▫ Consumer Information; Enterprise Information
• Data Science Institute (Brocade) ▫ SDSC-level program
37
SDSC IPP Research Review, June 12, 2013
CLDS Sponsorship • Current sponsors ▫ Seagate, Pivotal, NetApp, Brocade, Intel (soon)
• Goals ▫ Small, focused group of core sponsors representing major
industry quadrants (non competitive). (6-8 companies) ▫ Extended network of members who provide scale and
scope, help fund industry events (20-30 companies) • Sponsor structure: ▫ Founding (100k, multi year) ▫ Program (50K, annual)
• Member structure: ▫ Continuing (10K+, pay as you go) ▫ Member (5K, per workshop event)
38
SDSC IPP Research Review, June 12, 2013
Big Data Benchmarking: How you can participate • BDBC ▫ Join BDBC mailing list, ~150 members, ~75 organizations ▫ Attend biweekly meetings, every other Thursday ▫ Present at biweekly meetings
• WBDB ▫ Submit papers to workshops; attend workshops
• Reference Datasets ▫ Participate in the Reference Datasets activity; contribute
reference datasets • NIST Big Data Working Group ▫ Join and contribute to the NIST Big Data Working Group
• Join CLDS as a sponsor
39
SDSC IPP Research Review, June 12, 2013
Data Value; How Much Information? How you can participate
• Data Value ▫ Join workshop planning and organization ▫ Contribute use cases ▫ Join CLDS as a sponsor
• HMI? ▫ Contribute use cases ▫ Join CLDS as a sponsor
40
SDSC IPP Research Review, June 12, 2013
SDSC Data Science Institute (DSI) • Objective: Provide training and education in data science • Audience: Industry attendees/academic researchers
dealing with data • Format ▫ Coverage of end-to-end issues in data science ▫ Emphasis on hands-on learning using short course formats,
e.g. 1-day, 2-day, 1-week, and up to 1-month ▫ Inclusion of modules taught by industry ▫ At-home and On-The-Road programs ▫ Possible internships associated with DSI
• First offering: SDSC Summer Institute, “Discovering Big Data”, Aug 5-12, 2013
SDSC IPP Research Review, June 12, 2013
DSI: How you can participate • Naming opportunity ▫ The <Your Company Name Here> Data Science Institute
• Sign-up for SDSC Summer Institute ▫ $2K for 3 days or 5 days
• Sign-up for future DSI offerings • On-the-road program ▫ Work with us to create an on-the-road program for your
company, or your customers • Contribute training modules ▫ Contribute modules based on your technology
• Provide your case studies for use in DSI
42
SDSC IPP Research Review, June 12, 2013
SDSC Graduate Projects Program • SDSC Projects for CSE MS Graduate students ▫ Students work on projects with vendor hardware/software ▫ Upon successful completion, students receive internships at
companies ▫ Companies have option to hire students permanently
• A testbed/sandbox for big data/data science/computational science ▫ Currently have 32-node Hadoop cluster with Hortonworks HDP ▫ Plan to also install Intel Hadoop; Would like to extend to 96
nodes ▫ Discussing a project with Brocade to test performance of one of
their Ethernet switches • Two students just completed their MS projects (presentations
today!) ▫ Joining Google and Zynga
43
SDSC IPP Research Review, June 12, 2013
Graduate Projects Program: How you can participate
• Create a project ▫ Announce a project and offer internship after
successful completion • Contribute hardware/software for projects • Contribute data with application scenarios / use
cases
44
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Bioinformatics Meets Big Data
Wayne Pfeiffer SDSC/UCSD
June 12, 2013
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Questions for today
• What is causing the flood of data in bioinformatics? • How much data are we talking about? • What bioinformatics codes are installed at SDSC? • What are typical compute- and data-intensive analyses of
in bioinformatics? • What are their computational challenges?
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Cost of DNA sequencing has dropped much faster than cost of computing in recent years,
producing the flood of data
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Size matters: how much data are we talking about?
• 3.1 GB for human genome • Fits on flash drive; assumes FASTA format (1 B per base)
• >100 GB/day from a single Illumina HiSeq 2000 • 50 Gbases/day of reads in FASTQ format (2.5 B per base)
• 300 GB to 1 TB of reads needed as input for analysis of whole human genome, depending upon coverage • 300 GB for 40x coverage • 1 TB for 130x coverage
• Multiple TB needed for subsequent analysis • 45 TB on disk at SDSC for W115 project! (~10,000x single genome) • Multiple genomes per person! • May only be looking for kB or MB in the end
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
SDSC has a rich set of bioinformatics software; representative codes are listed here
• Pairwise sequence alignment • ATAC, BFAST, BLAST, BLAT, Bowtie, BWA
• Multiple sequence alignment (via CIPRES gateway) • ClustalW, MAFFT
• RNA-Seq analysis • GSNAP, Tophat
• De novo assembly • ABySS, Edena, SOAPdenovo, Velvet
• Phylogenetic tree inference (via CIPRES gateway) • BEAST with BEAGLE, GARLI, MrBayes, RAxML, RAxML-Light
• Tool kits • BEDTools, GATK, SAMtools
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Many bioinformatics projects use SDSC supercomputers
• HuTS: Human Tumor Study (STSI/SDSC) • Find mutations in tumor, and select appropriate chemotherapy
• W115: Study of somatic mutations in genome of 115-year-old woman (VU Amsterdam, et al.) • Find somatic mutations in white blood cells
• MRSA (STSI/UCSD) • Characterize genomes of MRSA found in local hospitals
• Larry Smarr’s microbiome (UCSD) • Analyze Larry Smarr’s microbiome
• Various phylogenetics studies (via CIPRES gateway) • Calculate phylogenetic trees
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Computational workflow for read mapping & variant calling
DNA reads in FASTQ format
Read mapping, i.e., pairwise alignment:
BFAST, BWA, …
Reference genome in
FASTA format
Variant calling: GATK, …
Variants: SNPs, indels, others
Alignment info in BAM format
Goal: identify simple variants, e.g.,
• single nucleotide polymorphisms (SNPs) or single nucleotide variants (SNVs)
• short insertions & deletions (indels)
CACCGGCGCAGTCATTCTCATAAT
||||||||||| |||||||||||| CACCGGCGCAGACATTCTCAT
AAT
CACCGGCGCAGTCATTCTCATAAT |||||||||| ||||||||||| CACCGGCGCA ATTCTCATAAT
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Pileup diagram shows mapping of reads to reference; example from HuTS shows a SNP in KRAS gene; this
means that cetuximab is not effective for chemotherapy
BWA analysis by Sam Levy, STSI; diagram from Andrew Carson, STSI
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
The CIPRES gateway lets biologists run phylogenetics codes at SDSC via a browser interface;
http://www.phylo.org/index.php/portal
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Computational challenges abound in bioinformatics
• Large amounts of data, which can grow substantially during analysis
• Complex workflows, often with different computational requirements along the way
• Parallelism that varies between steps in the workflow pipeline
• Large shared memory needed for some analyses
UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
Ilkay ALTINTAS, Ph.D. Deputy Coordinator for Research, San Diego Supercomputer Center, UCSD Lab Director, Scientific Workflow Automation Technologies [email protected]
Distributed Workflow-Driven Analysis of Biological Big Data
UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
So, what is a scientific workflow?
Scientific workflows emerged as an
answer to the need to combine multiple Cyberinfrastructure components in automated
process networks.
UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
The Big Picture is Supporting the Scientist
Conceptual SWF
Executable SWF
From “Napkin Drawings” to Executable Workflows
Fasta File
Circonspect
Average Genome Size
Combine Results PHACCS
UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
Scientific Workflow Automation Technologies @ SDSC • Housed in San Diego Supercomputer Center at UCSD since 2004
• Mission: Support CI projects, scientists and engineers for computational practices involving process management
• Research and development focus • Scientific workflow management
• Data and process provenance • Distributed execution using scientific workflows • Engineering and streaming workflows for environmental observatories • Fault tolerance in scientific workflows
• Sensor network management and monitoring • Role of scientific workflows in eScience infrastructures • Understanding collaborative work in workflow-driven eScience
• Scientific collaborations
• Bioinformatics, Environmental Observatories, Oceanography, Computational Chemistry, Fusion, Geoinformatics, …
UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
Workflows are Used as Toolboxes in Biological Sciences
Acquisition Generation
Data Analysis
Data
Data Publication Archival
Workflows foster collaborations!
• Flexibility and synergy • Optimization of resources
• Increasing reuse • Standards compliance
Need expertise to identify which tool to use when and how! Require computation models to schedule and optimize execution!
– Assemble complex processing easily
– Access transparently to diverse resources
– Incorporate multiple software tools
– Assure reproducibility
– Community development model
UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
Ptolemy II: A laboratory for investigating design
KEPLER: A problem-solving environment for Scientific
Workflow
KEPLER = “Ptolemy II + X” for Scientific Workflows
Kepler is a Scientific Workflow System
• An open collaboration … initiated August 2003
• Kepler 2.4 released 04/2013
www.kepler-project.org
• Builds upon the open-source Ptolemy II framework
UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
CAMERA Example:
Using Scientific Workflows and Related Provenance for
Collaborative Metagenomics Research
Community Cyberinfrastructure for Advanced Microbial Ecology Research and Analysis
(CAMERA) http://camera.calit2.net
UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
CAMERA is a Collaborative Environment
Data Cart Multiple Available
Mixed collections of CAMERA Data (e.g. projects, samples)
User Workspace Single workspace with
access to all data and results (private and
shared)
Group Workspace Share specified User Workspace data with
collaborators
Data Discovery GIS and Advanced query
options
Data Analysis Workflow based analysis
UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
Workflows are a Central Part of CAMERA
All can be reached through the CAMERA portal at: http://portal.camera.calit2.net
Inputs: from local or CAMERA file systems; user-supplied parameters Outputs: sharable with a group of users and links to the semantic database
More than 1500 workflow submissions monthly!
UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
Pushing the boundaries of existing infrastructure and workflow system
capabilities! • Increase reuse • Increase programmability by end users • Increase resource utilization • Make analysis a part of the end-to-end scientific
model from data generation to publication
Add to these large amounts of next generation sequencing data!
UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
Cyberinfrastructure platforms
bioKepler
Kepler and Provenance Framework
BioLinux Galaxy Hadoop
Stratosphere
+
• Development of a comprehensive bioinformatics scientific workflow module for distributed analysis of large-scale biological data
Improvement on usability and programmability by end users!
www.bioKepler.org
COMPUTE BIO
www.kepler-project.org
bioKepler is a coordinated ecosystem of biological and technological packages!
UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
What can we do for you?
• Training • Workflow-driven data- and compute-
intensive process • Consulting
• Designing, scaling and tracking pipelines and workflows
• Development Services • Production workflows for you
Access to technology and biology packages: • Bio-Linux • Galaxy via Amazon • Amazon Cloud • Hadoop • Stratosphere Partners • Individual
researchers and labs • Research projects,
e.g., CAMERA • Academic institutions • Private labs, e.g.,
JCVI JCVI selection criteria • Programmability • Modularity • Customizability • Scalability
www.bioKepler.org
MapReduce BLAST
UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
Ilkay Altintas
Thanks! & Questions…
How to download Kepler?
https://kepler-project.org/users/downloads Please start with the short Getting Started Guide: https://kepler-project.org/users/documentation
SAN DIEGO SUPERCOMPUTER CENTER
Building a Semantic Information Infrastructure
Research Review: SDSC Industrial Partnership Program
Amarnath Gupta Advanced Query Processing Lab San Diego Supercomputer Center, University of California at San Diego
SAN DIEGO SUPERCOMPUTER CENTER
Lots of Data, Little Glue
A Science Enterprise • Lab resources • Experiment design • Reagent catalogs • Experimental data • Chains of Derived Data • Analysis Results (incl.
outputs of software tools) • External Information • Publications and
Presentations • …
An Industrial Enterprise • Customer data • Product specifications • Product sales data • Production process data • Customer call records • Internal memos • Legal documents • Emails • Social Intranet • …
What binds them together?
Wide variation in data types, models, volume, systems, usage, updates, …
SAN DIEGO SUPERCOMPUTER CENTER
Some Consequences of Not Having a Glue
• An Orphan Disease Researcher • Spends five months to determine that her disease of interest relates to the
mouse version of a gene studied in breast cancer • A Health Institution
• Takes over a year to integrate clinical trial management data, drug characteristics, and patient reports to get a 360 deg. view of drug effects
• A Reagent Vendor • Spends significant time and money to get to determine the effectiveness of
a product-line • A Legal Department
• Takes much longer to discover relevant documents and data for a complex, multi-party litigation
• A Utilities Company • Cannot easily perform Synchrophasor analytics for grid behavior
understanding because SCADA, EMS, PMU data cannot be integrated
SAN DIEGO SUPERCOMPUTER CENTER
Integration through Semantics
• In NIF • Data can be relational, XML,
RDF, OWL, wiki content, manuscripts, publications, blogs, multimedia, annotations, …
• Any new domain goes through semantic processing
• Every piece of data gets some semantic markup
• Data are integrated and searched using ontological indices
• Any search or discovery process is interpreted through an ontology
• Any workflow that performs a mining-style computation utilizes semantic properties
What NIF is designed to answer
Some unanticipated questions Are NIH studies gender-biased?
Which? Is the CRE mouse line being used
by anyone? Which diseases have data but
have been underfunded? Is the method section in this paper
under-specified?
We have been building NIF, a Semantic Information System for Neuroscience
SAN DIEGO SUPERCOMPUTER CENTER
Technology Research
• Semantic Prospecting for any industry and application domain
• Semi-automatic Construction of Semantic Models for any problem domain
• Developing general-purpose, yet domain-specific Semantic Search Engines for heterogeneous data
• Complex Information Discovery using semantic graph analysis and mining-style computation
• Multi-domain Information Integration using semantic information bridges
SAN DIEGO SUPERCOMPUTER CENTER
Why SDSC? • Semantic processing can be complex and costly
• Domain model construction (incl. active learning) • Semantic indexing and correlation indexing • Graph query processing • Semantic search algorithms • Landscape and discovery analytics
• The SDSC infrastructure handles complexity at scale • Configurable compute nodes • Large-Memory Systems • SSD drives
Let SDSC help your infrastructure, research and service needs
SAN DIEGO SUPERCOMPUTER CENTER
Biomedical data integration
system and web search engine
Julia Ponomarenko, PhD Michael Baitaluk, PhD
San Diego Supercomputer Center
SAN DIEGO SUPERCOMPUTER CENTER
Taxonomies Data
Sequences Publications
Structures
Variations
Expression Data
Networks
Annotations
Biochemical Data
Epigenetic Data
SAN DIEGO SUPERCOMPUTER CENTER
Taxonomies Data
Databases (2,000+)
Sequences Publications
Structures
Variations
Expression Data
Networks
Annotations
Biochemical Data
Epigenetic Data
Web pages
How can a researcher embrace such amount of data
in their entirety?
SAN DIEGO SUPERCOMPUTER CENTER
Taxonomies Data
Data Integration Resources
Databases (2,000+)
Sequences Publications
Structures
Variations
Expression Data
Networks
Annotations
Biochemical Data
Epigenetic Data
Web pages
SAN DIEGO SUPERCOMPUTER CENTER
Taxonomies Data
Data Integration Resources
Databases (2,000+)
Sequences Publications
Structures
Variations
Expression Data
Networks
Annotations
Biochemical Data
Epigenetic Data
This leaves a researcher to work with partial, incomplete, incomprehensive data sets!
Web pages
SAN DIEGO SUPERCOMPUTER CENTER
Taxonomies Data
(0.1 PB)
Databases (2,000+)
Sequences Publications
Structures
Variations
Expression Data
Networks
Annotations
Biochemical Data
Epigenetic Data
Web pages
Data Warehouse
Biological Ontologies The Semantic Web technologies
SAN DIEGO SUPERCOMPUTER CENTER
Other web pages Database web pages
SAN DIEGO SUPERCOMPUTER CENTER
Other web pages Database web pages
SAN DIEGO SUPERCOMPUTER CENTER
Other web pages Database web pages
SAN DIEGO SUPERCOMPUTER CENTER
Other web pages Database web pages
SAN DIEGO SUPERCOMPUTER CENTER
Other web pages Database web pages
SAN DIEGO SUPERCOMPUTER CENTER
Other web pages Database web pages
SAN DIEGO SUPERCOMPUTER CENTER
Other web pages Database web pages
SAN DIEGO SUPERCOMPUTER CENTER
Other web pages Database web pages
Automatically extract data and map them
into the internal database schema
For each ontological term A and page X, calculate the relevance score of X to A
SAN DIEGO SUPERCOMPUTER CENTER
User Community
Web-portal & API integromeDB.org
Java-application BiologicalNetworks.org
IntegromeDB
Public Data on the Web User’s Private Data
SAN DIEGO SUPERCOMPUTER CENTER
integromedb.org
SAN DIEGO SUPERCOMPUTER CENTER
SAN DIEGO SUPERCOMPUTER CENTER
SAN DIEGO SUPERCOMPUTER CENTER
SAN DIEGO SUPERCOMPUTER CENTER
Integromedb.org visit statistics
SAN DIEGO SUPERCOMPUTER CENTER
User Community
Web-portal & API integromeDB.org
Java-application BiologicalNetworks.org
IntegromeDB
Public Data on the Web User’s Private Data
SAN DIEGO SUPERCOMPUTER CENTER
Your User Community
Your Database
Public Data on the Web User’s Private Data
Web-portal & APIs Applications
SAN DIEGO SUPERCOMPUTER CENTER
Your User Community
Your Google-like “Search-in-the-Box”
Appliance
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
PACE Predictive Analytics And Data
Mining Research and Applications
Natasha Balac, Ph.D. Director, PACE
Predictive Analytics Center of Excellence @ San Diego Supercomputer Center, UCSD
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
PACE: Closing the gap between Government, Industry and Academia
PACE is a non-profit, public educational organization oTo promote, educate and innovate
in the area of Predictive Analytics oTo leverage predictive analytics to
improve the education and well being of the global population and economy
oTo develop and promote a new, multi-level curriculum to broaden participation in the field of predictive analytics
Predictive Analysis Center of
Excellence
Inform, Educate and
Train
Develop Standards
and Methodology
High Performance
Scalable Data Mining
Foster Research
and Collaboration
Data Mining Repository of Very Large Data Sets
Provide Predictive Analytics Services
Bridge the Industry and Academia
Gap
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Foster Research and Collaboration
Predictive Analysis Center of
Excellence
Inform, Educate and
Train
Develop Standards and Methodology
High Performance
Scalable Data Mining
Foster Research and Collaboration
Data Mining Repository of Very Large Data Sets
Provide Predictive Analytics Services
Bridge the Industry and Academia
Gap
• Fraud Detection • Modeling user behaviors • Smart Grid Analytics • Solar powered system
modeling • Microgrid anomaly
detection • Distributed Energy
Generation • Manufacturing • Sport Analytics • Genomics
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
UCSD Smart Grid • UCSD Smart Grid sensor network data set
• 45MW peak micro grid; daily population of over 54,000 people • Self-generate 92% of its own annual electricity load
• Smart Grid data – over 100,000 measurements/sec • Sensor and environmental/weather data
• Large amount of multivariate and heterogeneous data streaming from complex sensor networks
• Predictive Analytics throughout the Microgrid
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
4 V’s of Big Data
IBM, 2012
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
What to do with big data? ERIC SALL(IBM)’s list
• Big Data Exploration • To get an overall understanding of what is there
• 360 degree view of the customer • Combine both internally available and external information
to gain a deeper understanding of the customer • Monitoring Cyber-security and fraud in real time • Operational Analysis
• Leveraging machine generated data to improve business effectiveness
• Data Warehouse Augmentation • Enhancing warehouse solution with new information
models and architecture
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Big Data – Big Training • “Data Scientist”
• The “Hot new gig in town” • O’Reilly report
• Data Scientist: The Sexiest Job of the 21st Century • Harvard Business Review, October 2012
• The future belongs to the companies and people that turn data into products
• Article in Fortune • “The unemployment rate in the U.S. continues to be
abysmal (9.1% in July), but the tech world has spawned a new kind of highly skilled, nerdy-cool job that companies are scrambling to fill: data scientist”
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Data Science Job Growth
By 2018 shortage of 140-190,000 predictive analysts and 1.5M managers / analysts in the US
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
PACE Education • Data mining Boot Camps
• Boot Camp 1 • September 12-13, 2013
• Boot Camp 2 • October 17- 18, 2013
• On-site Personalized Boot Camp • 10-15; 20-30
• Tech Talks – every 3rd Wednesday • Workshops, Webinars • Interesting Reads, “Tool-off” • “Bring your own data”
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Predictive Analytics Consulting
• Full-service consulting and development services
• Targeted projects with industry and agency partners
• Applied and applications oriented research
• Technical expertise and industry experience
Predictive Analysis Center of
Excellence
Inform, Educate and
Train
Develop Standards and Methodology
High Performance
Scalable Data Mining
Foster Research and Collaboration
Data Mining Repository of
Very Large Data Sets
Provide Predictive Analytics Services
Bridge the Industry and
Academia Gap
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Mike Gualtieri's blog
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Questions?
• http://pace.sdsc.edu/
• For further information, contact Natasha Balac [email protected]
PMaC Performance Modeling and Characterization
Performance, Modeling, and Characterization
(PMaC) Lab
Laura Carrington, Ph.D
PMaC Lab Director
University of California, San Diego
San Diego Supercomputer Center
PMaC Performance Modeling and Characterization
The PMaC Lab
Mission Statement: Research the complex interactions between HPC systems and applications
to predict and understand the factors that affect performance and power on current and projected
HPC platforms.
• Develop tools and techniques to deconstruct HPC systems and HPC applications to provide detailed characterizations of their power and performance.
PMaC Performance Modeling and Characterization
The PMaC Lab Utilize the characterizations to construct power and performance models that guide:
– Improvement of application performance 2 Gordon Bell Finalists, DoD HPCMP Applications, NSF BlueWaters, etc.
– System procurement, acceptance, and installation DoD HPCMP procurement team, DoE upgrade of ORNL Jaguar, installation of NAVO PWR6
– Accelerator assessment for a workload Performance assessment and prediction for GPUs & FPGAs
– Hardware customization/ Hardware-Software co-design Performance and power analysis for Exascale
– Improvement of energy efficiency and resiliency Green Queue Project, DoE SUPER Institute, DoE BSM, etc.
PMaC Performance Modeling and Characterization
Automated tools & techniques to characterize HPC systems & applications
HPC System Characterize the computational
(& communication) patterns affect the overall power draw
113
HPC Application Characterize the computational (& communication) behavior of
application Loop #1
Func. Foo
Loop #3
Loop #2
Design software- and hardware-aware energy and performance optimization techniques
PMaC Performance Modeling and Characterization
Hardware Customization
• 10x10 project – PI: A. Chien @ U. of Chicago – Heterogeneous processor
architecture
• Reconfigurable memory hierarchy – L1, L2, and L3
• Selection of energy-optimal configuration – Simulation – Reuse distance
Space Name
# Configurations in search space
# Unique Configurations selected
Avg. Energy Savings (%)
Full 2652 33 68.8 2N 469 26 66.0 Restricted 224 23 65.5 Cluster 10 10 63.7
114
Analysis of 37 workloads
~70% energy savings
PMaC Performance Modeling and Characterization
Goal: Use machine and application characterization to make application-aware energy optimizations
during execution
PMaC’s Green Queue Framework (optimizing for performance & power)
(6.5%)
(4.8%)
(19%)
(21%) (5.3%)
(32%) (6.5%)
(Energy Savings %)
Time (seconds)
Pow
er (k
W)
1024 cores Gordon
115
PMaC Performance Modeling and Characterization
Questions
116
SAN DIEGO SUPERCOMPUTER CENTER
Benchmarking and Tuning Big Data Software
David Nadeau, Ph.D.
SAN DIEGO SUPERCOMPUTER CENTER
What to tune? • Application benchmarking/profiling finds what to
tune for one system
• But... we want answers for many systems for now and (hopefully) years to come
• Requires understanding fundamental trends • Processor speeds, core counts, memory bandwidth,
memory latency, network bandwidth, etc.
SAN DIEGO SUPERCOMPUTER CENTER
A few trends
Clock speeds almost flat Cycles/math op almost flat
Raw math ability/core is not improving.
SAN DIEGO SUPERCOMPUTER CENTER
A few trends
Core count/CPU way up SPEC float/CPU way up
Float math/core (thread) up 8 to 20%/year Why is this trend upwards if math speed is flat?
SAN DIEGO SUPERCOMPUTER CENTER
A few trends
DDR bandwidth up 15%/year
Memory bandwidth/core up 0 to 12%/year
Improvement is primarily from better memory bandwidth, not math ability.
SAN DIEGO SUPERCOMPUTER CENTER
What this means • SPEC, Dhrystones, etc. are now dominated by
memory performance not math. • 1 Multiply = 1 cycle • 1 Memory access = 300 cycles • Not likely to change soon
• Application performance is dominated by
memory access costs.
• So tune access patterns or data order.
SAN DIEGO SUPERCOMPUTER CENTER
Example: 3D volume Task: Store 3D volume in memory. Array of arrays of arrays, one big array, etc.
Worst (blue): array of array of arrays. Best (black): one big array with simple 3D indexing.
Fewer memory references is much faster.
SAN DIEGO SUPERCOMPUTER CENTER
Example: 3D volume sweep
4 sweep directions are slow. 2 are 10x to 30x faster.
Task: Sweep plane thru volume. 6 axis directions: +X, -X, +Y, -Y, +Z, -Z
Sweep in natural data order is much faster.
SAN DIEGO SUPERCOMPUTER CENTER
Example: 3D volume bricking
Makes all sweep directions have similar performance.
Bricked data order is more cache friendly.
Task: Sweep in any direction. Brick it: Cube of cubes
Unbricked 2x2x2 4x4x4 8x8x8 16x16x16
SAN DIEGO SUPERCOMPUTER CENTER
Example: Desktop compression
Clever codecs slower, despite smaller result data.
Codecs with fewer memory references are faster.
Task: Real-time compress & send desktop. Many codecs.
SAN DIEGO SUPERCOMPUTER CENTER
Example: Parallel compositing Task: Composite N images on cluster of N nodes.
Many algorithms.
Many small messages Few big messages Same amount of data
Better net use is faster.
SAN DIEGO SUPERCOMPUTER CENTER
And so on…
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Gordon: A First-of-its Kind Data-intensive Supercomputer
SDSC Research Review June 5, 2013
Shawn Strande Gordon Project Manager
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Gordon is a highly flexible system for exploring a wide range of data intensive technologies and applications
Gordo
High performance flash technology
High speed InfiniBand interconnect
On-demand Hadoop and data intensive environments
Massively large memory environments High performance
parallel file system
Scientific databases
Complex application architectures
New algorithms and optimizations
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Gordon is a data movement machine
Sandy Bridge Compute Nodes (1,024) • 64 TB memory • 341 Tflop/s
Flash based I/O Nodes (64) • 300 TB Intel eMLC flash • 35M IOPS
Large Memory Nodes • vSMP Foundation 5.0 • 2 TB of cache-coherent
memory per node
“Data Oasis” Lustre PFS 100 GB/sec, 4 PB
Dual-rail, 3D Torus Interconnect • 7GB/s
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
SSD latencies are 2 orders of magnitude lower than HDD’s (that’s a big deal for some data intensive applications)
Typical hard drive ~ 10 ms (.010s)
IOPS = 200
Solid State Disk ~ 100 µs (.0001s)
IOPS = 35,000/3000 (R/W)
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Protein Data Bank (flash-based I/O node) The RCSB Protein Data Bank (PDB) is the leading primary database that provides access to the experimentally determined structures of proteins, nucleic acids and complex assemblies. In order to allow users to quickly identify more distant 3D relationships, the PDB provides a pre-calculated set of all possible pairwise 3D protein structure alignments.
Although the pairwise structure comparisons are computationally intensive, the bottleneck is the centralized server that is responsible for assigning work, collecting results and updating the MySQL database. Using a dedicated Gordon I/O node and the associated 16 compute nodes, work could be accomplished 4-6x faster than using the OSG
Configuration Time for 15M alignments speedup
Reference (OSG) 24 hours 1
Lyndonville 6.3 hours 3.8
Taylorsville 4.1 hours 5.8
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
OpenTopography Facility (flash-based I/O node) The NSF funded OpenTopography Facility provides online access to Earth science-oriented high-resolution LIDAR topography data along with online processing tools and derivative products. Point cloud data are processed to produce digital elevation models (DEMs) - 3D representations of the landscape.
High-resolution bare earth DEM of San Andreas fault south of San Francisco, generated using OpenTopography LIDAR processing tools Source: C. Crosby, UNAVCO
Illustration of local binning geometry. Dots are LIDAR shots ‘+’ indicate locations of DEM nodes at which elevation is estimated based
Dataset and processing configuration # concurrent jobs OT Servers Gordon ION Speed-up
Lake Tahoe 208 Million LIDAR returns 0.2-m grid res and 0.2 m rad.
1 3297 sec 1102 sec 3x
4 29607 sec 1449 sec 20x
Local binning algorithm utilizes the elevation information from only the points inside of a circular search area with user specified radius. An out-of-core (memory) version of the local binning algorithm exploits secondary storage for saving intermediate results when the size of a grid exceeds that of memory. Using a dedicated Gordon I/O node with the fast SSD drives reduces run times of massive concurrent out-of-core processing jobs by a factor of 20x
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
IntegromeDB (flash-based I/O node) The IntegromeDB is a large-scale data integration system and biomedical search engine. IntegromeDB collects and organizes heterogeneous data from over a thousand databases covered by the Nucleic Acid and millions of public biomedical, biochemical, drug and disease-related resources
IntegromeDB is a distributed system stored in a PostgreSQL database containing over 5,000 tables, 500 billion rows and 50TB of data. New content is acquired using a modified version of the SmartCrawler web crawler and pages are indexed using Apache Lucene. Project was awarded two Gordon I/O nodes, the accompanying compute nodes and 50 TB of space on Data Oasis. The compute nodes are used primarily for post-processing of raw data. Using the I/O nodes dramatically increased the speed of read/write file operations (10x) and I/O database operations (50x).
Source: Michael Baitaluk (UCSD) Used by permission 2013
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Structural response of bone to stress (vSMP)
Source: Matthew Goff, Chris Hernandez (Cornell University) Used by permission. 2012
The goal of the simulations is to analyze how small variances in boundary conditions effect high strain regions in the model. The research goal is to understand the response of trabecular bone to mechanical stimuli. This has relevance for paleontologists to infer habitual locomotion of ancient people and animals, and in treatment strategies for populations with fragile bones such as the elderly.
• 5 million quadratic, 8 noded elements
• Model created with custom Matlab application that converts 253 micro CT images into voxel-based finite element models
SAN DIEGO SUPERCOMPUTER CENTER
Managing HPC Systems at the
National and Campus Levels
Rick Wagner, Ph.D. Candidate HPC Systems Manager, SDSC
SAN DIEGO SUPERCOMPUTER CENTER
Systems
Gordon
Trestles
TSCC
Dev Data Oasis
SAN DIEGO SUPERCOMPUTER CENTER
Challenges Enabling unique & differentiating features of each system…
…without maintaining N different systems.
Gordon
Trestles
TSCC
Data Oasis
Dev
Technology
Policy
Business Model
Agnosticism
Isolation
SAN DIEGO SUPERCOMPUTER CENTER
Solutions
Part I: • Common systems
management – Rocks • Shared staff responsibility
across systems
Part II: • Build on (cower behind) core SDSC services
SAN DIEGO SUPERCOMPUTER CENTER
User Services
SAN DIEGO SUPERCOMPUTER CENTER
Operations
SAN DIEGO SUPERCOMPUTER CENTER
Storage
SAN DIEGO SUPERCOMPUTER CENTER
VM Hosting
SAN DIEGO SUPERCOMPUTER CENTER
Security
SAN DIEGO SUPERCOMPUTER CENTER
Networking
SAN DIEGO SUPERCOMPUTER CENTER
SDSC Sandbox
With support from:
UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
Ilkay ALTINTAS, Ph.D. Deputy Coordinator for Research, San Diego Supercomputer Center, UCSD Lab Director, Scientific Workflow Automation Technologies [email protected]
SDSC’s Myriad Areas of Expertise
UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
Visualization Services Group
Lead – Amit Chourasia • Develops new ways to represent data visually • Research collaborations in many science and
engineering disciplines • Visualization support and consulting • Provides visualization education and training • Website: http://www.sdsc.edu/us/visservices/
UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
Ross Walker ([email protected]) Andreas Goetz ([email protected])
Technical Expertise • GPU Computing • CUDA Teaching Center • Parallel computing • Workstation and Cluster Design for
Biomolecular Simulations and Computational Drug Discovery
• Cloud computing
UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
Ross Walker ([email protected]) Andreas Goetz ([email protected])
Scientific Expertise • Molecular Dynamics • Quantum Chemistry • Force Field Development,
Automatic Parameter Fitting • Drug Discovery • Biomolecular Simulations
UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
SDSC Spatial Information Systems Lab • Services-based spatial information
integration infrastructure • Advanced online and high performance GIS
and Geospatial Databases • Information Interoperability in the
geosciences • Long-term spatial data preservation • Information models and data standards
(adopted by federal government and internationally)
• Innovative user interfaces for connecting people, projects, resources…
• Large distributed data systems and catalogs (for scientific field observations from hydrology, critical zone and others)
Hydrologic Information System (largest in the world)
Brain data integration
Contact: Ilya Zaslavsky ([email protected])
Ecosystem Services
Dashboard Katrina
portal
Mexico Health
Atlas
NSF EarthCube
CZO
UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
High Performance Wireless Research and Education
Network http://hpwren.ucsd.edu/
Existing ~60 HPWREN/ASAPnet fire agency sites in June 2013 (from Google Earth KML object) Project partners include: • the County of San Diego • the California Department of Forestry
and Fire Protection (CAL FIRE) • the United States Forest Service
(USFS) • San Diego Gas and Electric (SDG&E) • San Diego State University (SDSU)
An extension of Area Situational Awareness for Public Safety Network (ASAPnet)
UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
Mathematical anthropology
James Moody, Douglas R. White. Structural Cohesion and Embeddedness: A Hierarchical Conception of Social Groups. American Sociological Review 68(1):1-25. 2004
The identification of cohesive subgroups in large networks is of key importance to problems spanning the social and biological sciences. A k-cohesive subgroup has the property that it is resistant to disruption by disconnection by removal of at least k of its nodes. This has been shown to be equivalent to a set of vertices where all members are joined by k independent vertex-independent paths (Menger’s theorem).
Doug White (UCI) and his collaborators are using software developed using R and the igraph package to study social networks. The software was parallelized using the R multicore package and ported to Gordon’s vSMP nodes by SDSC computational scientist Robert Sinkovits. Analyses for large problems (2400 node Watts-Strogatz model) are achieving estimated speedups of 243x on 256 compute cores. Work is underway to identify cohesive subgroups in large co-authorship networks
UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
Impact of high-frequency trading To determine the impact of high-frequency trading activity on financial markets, it is necessary to construct nanosecond resolution limit order books – records of all unexecuted orders to buy/sell stock at a specified price. Analysis provides evidence of quote stuffing: a manipulative practice that involves submitting a large number of orders with immediate cancellation to generate congestion
Source: Mao Ye, Dept. of Finance, U. Illinois. Used by permission. 6/1/2012
Symbol wall time (s) orig. code
wall time (s) opt. code
speedup
SWN 8400 128 66x
AMZN 55200 437 126x
AAPL 129914 1145 113x
Optimizations by SDSC computational scientists Robert Sinkovits and DongJu Choi to the original thread-parallel code resulted in greater than 100x speedups. It is now possible to analyze entire day of NASDAQ activity in a few hours using 16 Gordon nodes. With new capabilities, beginning to consider analysis of options data with 100x greater memory requirements.
Run times for LOB construction of heavily traded NASDAQ securities (June 4, 2010)
UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
SDSC’s Education, Outreach and Training (EOT) Programs
Diane Baxter, Ph.D., Ange Mason, Jeff Sale
San Diego Supercomputer Center
University of California, San Diego
UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
SDSC EOT Program Challenges • Prepare Teachers to teach their students the
skills and knowledge for a future in which . . . • Technology Power • Computational skills Success
• Give Students Access to the computational tools, knowledge and thinking skills to seek their dreams and create their future
• Train researchers at all levels, to use HPC and Data-Intensive Computing tools to accelerate discovery in science, engineering, technology, mathematics, and other data-related fields
UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
Ilkay Altintas
Thanks! & Questions…
SAN DIEGO SUPERCOMPUTER CENTER
Industrial Engagement at SDSC • Industrial Partners Program (IPP)
• “Gateway” program • Annual membership • Large company, small company, individual categories
• CLDS • Focus on Big Data
• PACE • Focus on Predictive Analytics
• Research Contracts • Specific project defined
• Service Agreements • For use of SDSC resources/services
SAN DIEGO SUPERCOMPUTER CENTER
THANK YOU!