Using&the&Open&Science&DataCloud&&...

Using the Open Science Data Cloud for Data Science Research

Robert Grossman University of Chicago

Open Cloud Consor=um June 17, 2013

Data: 1 PB of OSDC data across several disciplines

Instrument: 3000 cores / 5 PB OSDC science cloud

Team: you and your colleagues

Discoveries

correla=on algorithms +

Part 1 What Instrument Do we Use to Make Big Data Discoveries?

How do we build a “datascope?”

What is big data?

TB? PB? EB?

W? KW? MW?

An algorithm and compu=ng infrastructure is “big-‐data scalable” if adding a rack (or container) of data (and corresponding processors) allows you to do the same computa=on in the same =me but over more data.

Commercial Cloud Service Provider (CSP) 15 MW Data Center

100,000 servers 1 PB DRAM

100’s of PB of disk

Automa=c provisioning and infrastructure management

Monitoring, network security and forensics

Accoun=ng and billing Customer

Facing Portal

Data center network

~1 Tbps egress bandwidth

25 operators for 15 MW Commercial Cloud

OSDC’s vote for a datascope: a (bou=que) data center scale facility with a big-‐data scalable analy=c infrastructure.

Discoveries

Discipline Dura2on Size # Devices

HEP -‐ LHC 10 years 15 PB/year* One

Astronomy -‐ LSST 10 years 12 PB/year** One

Genomics -‐ NGS 2-‐4 years 0.5 TB/genome 1000’s

Some Examples of Big Data Science

*At full capacity, the Large Hadron Collider (LHC), the world's largest par=cle accelerator, is expected to produce more than 15 million Gigabytes of data each year. … This ambi=ous project connects and combines the IT power of more than 140 computer centres in 33 countries. Source: hhp://press.web.cern.ch/public/en/Spotlight/SpotlightGrid_081008-‐en.html **As it carries out its 10-‐year survey, LSST will produce over 15 terabytes of raw astronomical data each night (30 terabytes processed), resul=ng in a database catalog of 22 petabytes and an image archive of 100 petabytes. Source: hhp://www.lsst.org/News/enews/teragrid-‐1004.html

One large instrument Many smaller instruments

Part 2. What is a Cloud and Why Do We Care?

There Are Two Essen=al Characteris=cs of a Cloud

1.  Self service 2.  Scale

•  Clouds enable you to compute over large amounts of data with the necessity of first downloading the data.

•  Clouds can be designed to be secure and compliant.

Self Service

Types of Clouds

•  Public Clouds – Amazon

•  Private Clouds – Run internally by universi=es or companies

•  Community Clouds – Run by organiza=ons (either formally or informally), such as the Open Cloud Consor=um

Amazon Web Services (AWS)?

Community clouds, science clouds, etc.

•  Lower cost (at medium scale) •  Data too important for

commercial cloud •  Compu=ng over scien=fic

data is a core competency •  Can support any required

governance / security

•  Scale •  Simplicity of a credit card •  Wide variety of offerings.

OCC supports AWS interop and burs=ng when permissible. 16

Science Clouds

NFP Science Clouds Commercial Clouds POV Democra=ze access to

data. Integrate data to make discoveries. Long term archive.

As long as you pay the bill; as long as the business model holds.

Data & Storage

Data intensive compu=ng & HP storage

Internet style scale out and object-‐based storage

Flows Large & small data flows Lots of small web flows Streams Streaming processing

required NA

Accoun=ng Essen=al Essen=al Lock in Moving environment

between CSPs essen=al Lock in is good

Interop Cri=cal, but difficult Customers will drive to some degree 17

Essen=al Services for a Science CSP •  Support for data intensive compu=ng •  Support for big data flows •  Account management, authen=ca=on and authoriza=on services

•  Health and status monitoring •  Billing and accoun=ng •  Ability to rapidly provision infrastructure •  Security services, logging, event repor=ng •  Access to large amounts of public data •  High performance storage •  Simple data export and import services

Datascope – Science Cloud Service Provider (Sci CSP)

Data scien=st

Sci CSP services

Cloud Services Opera=ons Centers (CSOC)

•  The OSDC operates Cloud Services Opera=ons Center (or CSOC).

•  It is a CSOC focused on suppor=ng Science Clouds for researchers.

•  Compare to Network Opera=ons Center or NOC.

•  Both are an important part of cyber infrastructure for big data science.

Datascope – Science Cloud Service Provider (Sci CSP)

Data scien=st

Sci CSP services

Cloud Service Opera=ons Center (CSOC)

Part 3 Data Science

Founda=ons of data science

General and discipline specific souware applica=ons and tools

Models and algorithms

Establish best prac=ces, strategies for data science in general and discipline specific data science in par=cular

Analy=c infrastructure

What are the founda=ons for data science?

Theory to Big Data Spectrum

Simple counts and sta=s=cs over big data

Mathema=cal theorems

No data Small data

Big data

Tradi=onal sta=s=cal modeling

Medium data

(Semi-‐)Automa=ng sta=s=cal modeling

GB TB PB

OSDC Datascope 0.5-‐2.0 MW

Part 4 The Open Science Data Cloud

www.opensciencedatacloud.org

Discoveries

2013 Open Science Data Cloud (IaaS)

5 PB 2013 (OpenStack & GlusterFS)

Infrastructure automa=on & management

(Yates)

Compliance, & security

(OpenFISMA)

Accoun=ng & billing

(Salesforce.com)

Customer Facing Portal (Tukey)

Data center network

~10-‐100 Gbps bandwidth

5 engineers to operate 0.5 MW Science Cloud

Science Cloud SW & Services

•  Virtual Machine (VM) containing common applica=ons & pipelines

•  Tukey (OSDC portal & middleware v0.3) •  Yates (infrastructure automa=on and management v0.1) 28

•  Tukey (based in part on Horizon). •  We have factored out digital ID service, file sharing, and transport from Bionimbus and Matsu.

•  Automa=on installa=on of OSDC souware stack on rack of computers.

•  Based upon Chef •  Version 0.1

•  UDT is a high performance network transport protocol •  UDR = rsync + UDT •  It is easy for an average systems administrator to keep 100’s of TB of distributed data synchronized.

•  We are using it to distribute c. 1 PB from the OSDC

Open Science Data Cloud Services

•  Digital ID services •  Data sharing services •  Data transport services (UDR) •  What other core services are essen&al? •  Of course, working groups and applica=ons always add their own services

•  These core services will hopefully make the OSDC ahrac=ve as a plaxorm (PaaS) for scien=fic discovery.

33 www.opencloudconsor=um.org

•  U.S based not-‐for-‐profit corpora=on. •  Manages cloud compu=ng infrastructure to

support scien=fic research: Open Science Data Cloud.

•  Manages cloud compu=ng infrastructure to support medical and health care research: Biomedical Commons Cloud

•  Manages cloud compu=ng testbeds: Open Cloud Testbed.

OCC Members & Partners

•  Companies: Cisco, Yahoo!, Intel, … •  Universi=es: University of Chicago, Northwestern Univ., Johns Hopkins, Calit2, ORNL, University of Illinois at Chicago, …

•  Federal agencies and labs: NASA •  Interna=onal Partners: Univ. Edinburgh, AIST (Japan), Univ. Amsterdam, …

•  Partners: Na=onal Lambda Rail

Third party open source souware

Open source souware developed by the OCC and open standards

Data center

+ Data with permissions

+ Authoriza=on of users access to data

+ Policies, procedures, controls, etc.

+ Governance, legal agreements

+ Sustainability model 35

Part 5 OSDC Data

Discoveries

OSDC Public Data Sets

•  Over 800 TB of open access data in the OSDC •  Earth sciences data •  Biological sciences data •  Social sciences data •  Digital humani=es

Part 6 OSDC Working Groups

Just look around you

Matsu Working Group: Clouds to Support Earth Science

matsu.opensciencedatacloud.org

Matsu Architecture

Hadoop HDFS

Matsu Web Map Tile Service (WMTS)

Matsu MR-‐based Tiling Service

NoSQL Database

Images at different zoom layers suitable for OGC Web Mapping Server

Level 0, Level 1 and Level 2 images

MapReduce used to process Level n to Level n+1 data and to par==on images for different zoom levels

NoSQL-‐based Analy=c Services

Streaming Analy=c Services

MR-‐based Analy=c Services

Analy=c Services Storage for WMS =les and derived data products

Presenta=on Services

Web Coverage Processing Service

(WCPS)

Workflow Services

Hadoop-‐Based Re-‐Analysis Zoom Level 1: 4 images Zoom Level 2: 16 images

Zoom Level 3: 64 images Zoom Level 4: 256 images

Bionimbus Working Group

bionimbus.opensciencedatacloud.org (biological data)

Bionimbus Protected Data Cloud

Analyzing Data From The Cancer Genome Atlas (TCGA)

1.  Apply to dbGaP for access to data.

2.  Hire staff, set up and operate secure compliant compu=ng environment to mange 10 – 100+ TB of data.

3.  Get environment approved by your research center.

4.  Setup analysis pipelines. 5.  Download data from CG-‐

Hub (takes days to weeks). 6.  Begin analysis.

Current Prac2ce With Protected Data Cloud (PDC)

1.  Apply to dbGaP for access to data.

2.  Use your eRA commons creden=als to login to the PDC, select the data that you want to analyze, and the pipelines that you want to use.

3.  Begin analysis.

One Million Genomes •  Sequencing a million genomes would most likely fundamentally change the way we understand genomic varia=on.

•  The genomic data for a pa=ent is about 1 TB (including samples from both tumor and normal =ssue).

•  One million genomes is about 1000 PB or 1 EB •  With compression, it may be about 100 PB •  At $1000/genome, the sequencing would cost about $1B

Big data driven discovery on 1,000,000 genomes and 1 EB of data.

Genomic-‐driven

diagnosis

Improved understanding of genomic science

Genomic-‐ driven drug development

Precision diagnosis and treatment. Preven=ve

health care.

Biomedical Commons Cloud (BCC) Working Group

Cloud for Public Data

Cloud for Controlled Genomic Data

Cloud for EMR, PHI,

Example: Open Cloud Consor=um’s Biomedical Commons Cloud (BCC)

Medical Research Center A

Medical Research Center B

Hospital D

Medical Research Center C

Resource Who users Who operates Open Science Data Cloud (OSDC)

Pan science data for researchers

Open Cloud Consor=um (OCC) supported by University OCC members

Biomedical Commons Clouds (BCC)

(Interna=onal) biomedical researchers

OCC Biomedical Commons Cloud Working Group supported by OCC University members

Bionimbus Protected Data Cloud

Genomics researchers

University of Chicago supported by the OCC

OpenFlow-‐Enabled Hadoop WG

•  When running Hadoop some map and reduce jobs take significantly longer than others.

•  These are stragglers and can significantly slow down a MapReduce computa=on.

•  Stragglers are common (dirty secret about Hadoop) •  Infoblox and UChicago are leading a OCC Working Group on OpenFlow-‐enabled Hadoop that will provide addi=onal bandwidth to stragglers.

•  We have a testbed for a wide area version of this project.

OSDC PIRE Project We select OSDC PIRE Fellows (US ci=zens or permanent residents): •  We give them tutorials and training on big data science.

•  We provide them fellowships to work with OSDC interna=onal partners.

•  We give them preferred access to the OSDC.

Nominate your favorite scien=st as an OSDC PIRE Fellow. www.opensciencedatacloud.org (look for PIRE)

Part 7 Key Ques=ons for This Workshop

•  Ques=on 1. How can we add partner sites at other loca=ons that extend the OSDC? In par=cular, how can we extend the OSDC to sites around the world? How can the OSDC interoperate with other science clouds?

•  Ques=on 2. What data can we add to the OSDC to facilitate data intensive cross-‐disciplinary discoveries?

•  Ques=on 3. How can we build a plugin structure so that Tukey can be extended by other users and by other communi=es?

•  Ques=on 4. What tools and applica=ons can we add to the OSDC facilitate data intensive cross-‐disciplinary discoveries?

•  Ques=on 5. How can we beher integrate digital IDs and file sharing services into the OSDC?

•  Ques=on 6. What are 3-‐5 grand challenge ques=ons that leverage the OSDC?

Ques=ons

Robert Grossman is a faculty member at the University of Chicago. He is the Chief Research Informa=cs Officer for the Biological Sciences Division, a Faculty Member and Senior Fellow at the Computa=on Ins=tute and the Ins=tute for Genomics and Systems Biology, and a Professor of Medicine in the Sec=on of Gene=c Medicine. His research group focuses on big data, biomedical informa=cs, data science, cloud compu=ng, and related areas. He is also the Founder and a Partner of Open Data Group, which has been building predic=ve models over big data for companies for over ten years. He recently wrote a book for the general reader that discusses big data (among other topics) called the Structure of Digital Compu=ng: From Mainframes to Big Data, which can be purchased from Amazon. He blogs occasionally about big data at rgrossman.com.

Using&the&Open&Science&DataCloud&&...

Documents

Osdc php preday odata 2011

Wissbi osdc pdf

Osdc - Meteor Intorduction

IEEE DataCloud 2012 2datasys.cs.iit.edu/events/MTAGS12/closing-remarks.pdfIEEE DataCloud 2012 3rd International Workshop on Data Intensive Computing in the Clouds 2012 Keynotes Daniel

Trac Osdc Nov 2007

OSDC 2014 ONIE by Nat Morris

OSDC 2013 - Configuration Management and Linux Packages

Analytics and Benchmarking, Powered by ADP DataCloud · PDF fileAnalytics and Benchmarking, Powered by ADP ® DataCloud Is Your Workforce Empowered by Data Insights? Employers struggle

Osdc group 8_project_-_part_1

Firefox Extension Development | By JIIT OSDC

The VERCE Architecture for Data-Intensive Seismologyciara.fiu.edu/osdc-verce_Iraklis.pdf · The VERCE Architecture for Data-Intensive Seismology OSDC Workshop, Edinburgh, 19 June

The$INDIGODataCloud Data&$Computing$Platform$ … · 2016-07-06 · INDIGODataCloud General$ Architecture 6 JSAGA/JSAGA Adaptors FutureGateway Engine FutureGateway REST API Other

OSDC Keynote: How Public Lab collaborates

Designing Website for Mobile Safari (OSDC 2010)

Using&the&Open&Science&DataCloud&& …pire.opensciencedatacloud.org/talks/osdc-overview-13-v3.pdf · Using&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago&

JBoss AS7 OSDC 2011

BitSpace OSDC Presentation

Osdc services catalogue

ADP DataCloud0F916658-03DA... · AWESOME NEW TECHNOLOGY 2016 Benchmarking, powered by ADP DataCloud TOP HR PRODUCT 2016 Reporting, powered by ADP DataCloud BEST IN CLASS - HR 2016

Osdc Complex Event Processing