Keynote on 2015 Yale Day of Data

Big Data & Analy-cs: Five Trends and Five Research Challenges

Robert Grossman University of Chicago

& Open Data Group

September 18, 2015

Part 1 What is Big Data?

Researchers and policymakers are beginning to realize the poten-al for channeling these torrents of data into ac-onable informa-on that can be used to iden-fy needs & provide services for the benefit of low-‐income popula-ons. Source: Big Data, Big Impact: New Possibili-es for Interna-onal Development, World Economic Forum, 2012.

•  Volume •  Velocity •  Variety •  Veracity •  Value

•  Megabytes •  Gigabytes •  Terabytes •  Petabytes •  Etabytes •  Zetabytes

The Name Changes 1830 sta-s-cs 1980 computa-onally intensive sta-s-cs 1993 data mining & knowledge discovery in databases 1997 business analy-cs 2004 predic-ve analy-cs 2011 big data, data science & data analy-cs

Source: Google Trends, www.google.com/trends

What is Big Data? (Opera-ons POV)

A marke-ng term introduced by O’Reilly: Big data is data that exceeds the processing capacity of conven-onal database systems. The data is too big, moves too fast, or doesn’t fit the strictures of your database architectures. To gain value from this data, you must choose an alterna-ve way to process it. Edd Dumbill, What is Big Data?, strata.oreilly.com, January 11, 2012.

What is Big Data? (POV: New Types of Data that IT Cannot Manage)

Period New types of data Term Used 1990’s Clicks on the Internet,

POS transac-ons Data mining

2000’s Unstructured data, graph data

Predic-ve Analy-cs

2010’s Mobile data, IoT data Big Data

What Is Small Data?

•  100 million movie ra-ngs •  480 thousand customers •  17,000 movies •  From 1998 to 2005 •  Less than 2 GB data. •  Fits into memory, but very sophis-cated models required to win.

What are the origins of big data?

Basic Choice with Hardware: Scale Up or Out

More memory, more processors, more disk ($K)

Specialized hardware (e.g. connects)($100K)

Specialized devices ($M)

One machine Cluster (racks) ($100K)

Cyber Pod $M

Distributed cyber pods $10M+

Source: Interior of one of Google’s Data Center, www.google.com/about/datacenters/

Computa-onal adver-sing finds the “best match” between a given user in a given context and a suitable adver-sement ($100+ B market).

The Google Data Stack

•  The Google File System (2003) •  MapReduce: Simplified Data Processing… (2004) •  BigTable: A Distributed Storage System… (2006)

11

Source: Terence Kawaja, hnp://www.slideshare.net/tkawaja

•  The leaders in big data analy-cs measure data in Megawans. – As in, Facebook’s leased data centers are typically between 2.5 MW and 6.0 MW.

– Facebook’s new Pineville data center is 30 MW.

What is Big Data? (My computer is a data center POV)

Part 2 What is Analy-cs?

Source: Aaron Parecki, Everywhere I’ve Been, aaronparecki.com.

What is Analy-cs? Short Defini8on •  Using data to make decisions. Longer Defini8on •  Using data to take ac-ons and make decisions using models that are sta-s-cally valid and empirically derived.

Defini-on of Sta-s-cs from ASA web page: •  Sta-s-cs is the science of learning from data, and of measuring, controlling, and communica-ng uncertainty …

15

Source: American Sta-s-cal Associa-on, www.amstat.org/careers/wha-ssta-s-cs.cfm, from: Davidian, M. and Louis, T. A., 10.1126/science.1218685.

16 1993 2004

Data Mining & KDD

1984

Computa-onally Intensive Sta-s-cs

Predic-ve Analy-cs

Big Data & Data Science

2011

PageRank Spanner TX algorithm

Devices/IoT Internet POS Direct marke-ng

ID3 & C4.5

1.  Given n planes A1, …, An. Assume each plane Ai has bij bullet holes in the tail, wing, fuselage and other (j=1, 2, 3, 4, respec-vely).

2.  Compute where to put addi-onal armor to maximize the chance that planes return.

Part 3. Data Science

A picture of Cern’s Large Hadron Collider (LHC). The LHC took about a decade to construct, and cost about $4.75 billion. Source of picture: Conrad Melvin, Crea-ve Commons BY-‐SA 2.0, www.flickr.com/photos/58220828@N07/5350788732

Some fields have (one) billion dollar (or more) instrument that generates big data.

A genomics sequencing facility might have 3-‐5 next genera-on sequencing instruments that cost $250,000 or more each.

Some fields have hundreds or thousands of million dollar instruments that in aggregate produce big data.

Some fields have millions of hundred dollar sensors that in aggregate produce big data.

Math & Sta-s-cs

Computer Science

Disciplinary Science

Data Science

Understanding Salmon (A Cau-onary Tale)

Source: Salmo salar, (Atlan-c Salmon), wikipedia.org

Methods

Subject. One mature Atlan-c Salmon (Salmo salar) par-cipated in the fMRI study. The salmon was approximately 18 inches long, weighed 3.8 lbs, and was not alive at the -me of scanning. Task. The task administered to the salmon involved comple-ng an open-‐ended mentalizing task. The salmon was shown a series of photographs depic-ng human individuals in social situa-ons with a specified emo-onal valence. The salmon was asked to determine what emo-on the individual in the photo must have been experiencing. Design. S-muli were presented in a block design with each photo presented for 10 seconds followed by 12 seconds of rest. A total of 15 photos were displayed. Total scan -me was 5.5 minutes.

Several ac-ve voxels were discovered in a cluster located within the salmon’s brain cavity (Figure 1, see above). The size of this cluster was 81 mm3 with a cluster-‐level significance of p = 0.001. Due to the coarse resolu-on of the echo-‐planar image acquisi-on and the rela-vely small size of the salmon brain further discrimina-on between brain regions could not be completed. Out of a search volume of 8064 voxels a total of 16 voxels were significant.

The bigger the data, the easier it is to do stupid things with it, such as forgetng to correct for mul-ple tests.

Part 4. What Instrument Do we Use to Make Discoveries in Data Science?

How do we build a “datascope?”

experimental science

simula-on science

1609 30x

1670 250x

1976 10x-‐100x

data science

experimental science

simula-on science

data science

1609 30x

1670 250x

1976 10x-‐100x

2004 10x-‐100x

“Cyberpod”

Could we con-nuously re-‐analyze the world’s cancer data?

Complex sta-s-cal models over small data that are highly manual and update infrequently.

Simpler sta-s-cal models over large data that are highly automated and updated frequently.

memory databases

GB TB PB

W KW MW

datapods

cyber pods

Part 5 Five Trends

Source: Google Trends, for term “data commons”, www.google.com/trends.

Trend 1 Data Commons

Source: NEXRAD, NOAA, www.noaa.org

The Standard Model of Biomedical Compu-ng No Longer Works

Public data repositories

Private local storage & compute

Network download

Local data ($1K)

Community souware

Souware, sweat and tears ($100K)

Data Commons

Data commons co-‐locate data, storage and compu-ng infrastructure, and commonly used tools for analyzing and sharing data to create a resource for the research community.

Source: Interior of one of Google’s data centers, www.google.com/about/datacenters/

Open Science Data Cloud (Open Cloud Consor-um, 2012)

NCI Data Commons (UChicago, Nov 2015)

Bionimbus Protected Data Cloud (UChicago, 2013)

NOAA Data Commons (Open Cloud Consor-umOct 2015)

Purple balls are lung adenocarcinoma. Grey are lung squamous cell carcinoma. Green are misdiagnosed.

Hospitals, medical research centers and doctors

Data commons containing genomic and clinical data.

Pa-ents

Output: con-nuously updated, data-‐driven, analy-cs-‐informed discovery, diagnosis and treatment.

Trend 2 Analy-cs of Things, People and Places

Source: Urban sensor on street pole in Chicago (conceptual), arrayouhings.github.io/

People and things genera-ng streaming data that are relevant for research.

Places that generate data Source: Jane Macfarlane, Here, a Division of Nokia.

Trend 3 Languages for Data, Sta-s-cal Models, Data Science Workflows & Exploratory Data Analysis

Source: M. Bostock, hnp://bl.ocks.org/mbostock/4063318

Portable Format for Analy-cs (PFA) Predic-ve Model Markup Language (PMML)

Grammar of Graphics

d3.js

Trend 4 More Policies That Make Data Available and Analy-cs Repeatable

Execu-ve Order 13642 (May 9, 2013) Making Open and Machine Readable the Default for

Government Informa-on (“Open Data Policy”)

OMB Guidance President’s Ex Order

Trend 5 Transla-onal Data Science

How do we translate data driven discoveries into ac-ons that impact society?

Imaging Informatics

Clinical InformaticsBioinformatics Public Health

Informatics

Basic Research

Applied Research

Practice (dx, treatment and prevention)

Molecular & cellular

processes

Tissues & organs

Individuals (patients)

Groups & populations

Quality & outcomesTranslational Informatics

New algorithms, new sta-s-cal models (data science)

Applica-ons to genomics, analysis of EMR, etc.

Souware stacks for data intensive compu-ng (data engineering)

Data driven discoveries

Data driven diagnosis

Data driven therapeu-cs

Develop souware stack that scales to a “datapod”, to create “commons” for data driven discoveries, dx & treatment. (Core strategy for Center for Data Intensive Science, University of Chicago)

Transla-onal Data Science

Source: Maria T. Panerson and Robert L. Grossman, Detec-ng localized spa-al panerns of disease incidence using a neighbor-‐based bootstrapping method on electronic medical records data from 99.1 million pa-ents, to appear.

Part 5 Five Challenges

Challenge 1. Is More Different?

Source: P. W. Anderson, More is Different, Science, Volume 177, Number 4047, 4 August 1972, pages 393-‐396.

Do New Phenomena Emerge at Scale in Data?

Challenge 2. One Million Genomes

•  Sequencing a million genomes would likely change the way we understand genomic varia-on and provide a founda-on for precision medicine.

•  The genomic data for a pa-ent is about 1 TB (including samples from both tumor and normal -ssue).

•  One million genomes is about 1000 PB or 1 EB •  With compression, it may be about 100 PB •  At $1000/genome, the sequencing would cost about $1B

•  Think of this as one hundred studies with 10,000 pa-ents each over three years.

Challenge 3. Datapods

•  Databases have fundamentally changed the way we manage and analyze scien-fic data.

•  NoSQL databases allow us to scale out to mul-ple racks of computers, but are hard to to operate.

•  If our scien-fic instrument for data science is a cyberpod of hardware and a souware stack suppor-ng data analysis, we need a simple-‐to-‐manage, open source “database” that scales to a cyberpod.

•  Call this a “datapod.” •  It could support open source data commons and allow them to peer.

Challenge 4. A Billion Predic-ve Models

•  Develop technology to generate automa-cally 1 to 10 billion heterogeneous segmented models

•  Applica-ons – George Church’s challenge individual predic-ve models for each human genome 6.5 Billion humans.

– 1 Million cancer genomes x 1,000 models / genome.

– Urban science – instrumen-ng ci-es. – Consumer Marke-ng -‐ large adver-sers will see 1-‐3 billion different consumers

Challenge 5. HDSI

•  Human Computer Interac-on (HCI) was an important field before everyone got a computer and became an expert.

•  Think of Human Data Science Interac-on (HDSI) of how humans interact with the souware suppor-ng the analysis of data science at the scale of datapods with billion models and trillions of hypotheses.

•  How can we improve the interac-on to improve how we semi-‐automa-cally integrate data, validate hypotheses, interac-vely explore data, etc.

Ques-ons?

59

rgrossman.com @bobgrossman

For More Informa-on

cdis.uchicago.edu

www.opendatagroup.com

rgrossman.com

Data & Analytics

Keynote on 2015 Yale Day of Data