Upload
robert-grossman
View
447
Download
2
Embed Size (px)
Citation preview
Big Data & Analy-cs: Five Trends and Five Research Challenges
Robert Grossman University of Chicago
& Open Data Group
September 18, 2015
Part 1 What is Big Data?
Researchers and policymakers are beginning to realize the poten-al for channeling these torrents of data into ac-onable informa-on that can be used to iden-fy needs & provide services for the benefit of low-‐income popula-ons. Source: Big Data, Big Impact: New Possibili-es for Interna-onal Development, World Economic Forum, 2012.
• Volume • Velocity • Variety • Veracity • Value
• Megabytes • Gigabytes • Terabytes • Petabytes • Etabytes • Zetabytes
The Name Changes 1830 sta-s-cs 1980 computa-onally intensive sta-s-cs 1993 data mining & knowledge discovery in databases 1997 business analy-cs 2004 predic-ve analy-cs 2011 big data, data science & data analy-cs
Source: Google Trends, www.google.com/trends
What is Big Data? (Opera-ons POV)
A marke-ng term introduced by O’Reilly: Big data is data that exceeds the processing capacity of conven-onal database systems. The data is too big, moves too fast, or doesn’t fit the strictures of your database architectures. To gain value from this data, you must choose an alterna-ve way to process it. Edd Dumbill, What is Big Data?, strata.oreilly.com, January 11, 2012.
What is Big Data? (POV: New Types of Data that IT Cannot Manage)
Period New types of data Term Used 1990’s Clicks on the Internet,
POS transac-ons Data mining
2000’s Unstructured data, graph data
Predic-ve Analy-cs
2010’s Mobile data, IoT data Big Data
What Is Small Data?
• 100 million movie ra-ngs • 480 thousand customers • 17,000 movies • From 1998 to 2005 • Less than 2 GB data. • Fits into memory, but very sophis-cated models required to win.
What are the origins of big data?
Basic Choice with Hardware: Scale Up or Out
More memory, more processors, more disk ($K)
Specialized hardware (e.g. connects)($100K)
Specialized devices ($M)
One machine Cluster (racks) ($100K)
Cyber Pod $M
Distributed cyber pods $10M+
Source: Interior of one of Google’s Data Center, www.google.com/about/datacenters/
Computa-onal adver-sing finds the “best match” between a given user in a given context and a suitable adver-sement ($100+ B market).
The Google Data Stack
• The Google File System (2003) • MapReduce: Simplified Data Processing… (2004) • BigTable: A Distributed Storage System… (2006)
11
Source: Terence Kawaja, hnp://www.slideshare.net/tkawaja
• The leaders in big data analy-cs measure data in Megawans. – As in, Facebook’s leased data centers are typically between 2.5 MW and 6.0 MW.
– Facebook’s new Pineville data center is 30 MW.
What is Big Data? (My computer is a data center POV)
Part 2 What is Analy-cs?
Source: Aaron Parecki, Everywhere I’ve Been, aaronparecki.com.
What is Analy-cs? Short Defini8on • Using data to make decisions. Longer Defini8on • Using data to take ac-ons and make decisions using models that are sta-s-cally valid and empirically derived.
Defini-on of Sta-s-cs from ASA web page: • Sta-s-cs is the science of learning from data, and of measuring, controlling, and communica-ng uncertainty …
15
Source: American Sta-s-cal Associa-on, www.amstat.org/careers/wha-ssta-s-cs.cfm, from: Davidian, M. and Louis, T. A., 10.1126/science.1218685.
16 1993 2004
Data Mining & KDD
1984
Computa-onally Intensive Sta-s-cs
Predic-ve Analy-cs
Big Data & Data Science
2011
PageRank Spanner TX algorithm
Devices/IoT Internet POS Direct marke-ng
ID3 & C4.5
1. Given n planes A1, …, An. Assume each plane Ai has bij bullet holes in the tail, wing, fuselage and other (j=1, 2, 3, 4, respec-vely).
2. Compute where to put addi-onal armor to maximize the chance that planes return.
Part 3. Data Science
A picture of Cern’s Large Hadron Collider (LHC). The LHC took about a decade to construct, and cost about $4.75 billion. Source of picture: Conrad Melvin, Crea-ve Commons BY-‐SA 2.0, www.flickr.com/photos/58220828@N07/5350788732
Some fields have (one) billion dollar (or more) instrument that generates big data.
A genomics sequencing facility might have 3-‐5 next genera-on sequencing instruments that cost $250,000 or more each.
Some fields have hundreds or thousands of million dollar instruments that in aggregate produce big data.
Some fields have millions of hundred dollar sensors that in aggregate produce big data.
Math & Sta-s-cs
Computer Science
Disciplinary Science
Data Science
Understanding Salmon (A Cau-onary Tale)
Source: Salmo salar, (Atlan-c Salmon), wikipedia.org
Methods
Subject. One mature Atlan-c Salmon (Salmo salar) par-cipated in the fMRI study. The salmon was approximately 18 inches long, weighed 3.8 lbs, and was not alive at the -me of scanning. Task. The task administered to the salmon involved comple-ng an open-‐ended mentalizing task. The salmon was shown a series of photographs depic-ng human individuals in social situa-ons with a specified emo-onal valence. The salmon was asked to determine what emo-on the individual in the photo must have been experiencing. Design. S-muli were presented in a block design with each photo presented for 10 seconds followed by 12 seconds of rest. A total of 15 photos were displayed. Total scan -me was 5.5 minutes.
Several ac-ve voxels were discovered in a cluster located within the salmon’s brain cavity (Figure 1, see above). The size of this cluster was 81 mm3 with a cluster-‐level significance of p = 0.001. Due to the coarse resolu-on of the echo-‐planar image acquisi-on and the rela-vely small size of the salmon brain further discrimina-on between brain regions could not be completed. Out of a search volume of 8064 voxels a total of 16 voxels were significant.
The bigger the data, the easier it is to do stupid things with it, such as forgetng to correct for mul-ple tests.
Part 4. What Instrument Do we Use to Make Discoveries in Data Science?
How do we build a “datascope?”
experimental science
simula-on science
1609 30x
1670 250x
1976 10x-‐100x
data science
experimental science
simula-on science
data science
1609 30x
1670 250x
1976 10x-‐100x
2004 10x-‐100x
“Cyberpod”
Could we con-nuously re-‐analyze the world’s cancer data?
Complex sta-s-cal models over small data that are highly manual and update infrequently.
Simpler sta-s-cal models over large data that are highly automated and updated frequently.
memory databases
GB TB PB
W KW MW
datapods
cyber pods
Part 5 Five Trends
Source: Google Trends, for term “data commons”, www.google.com/trends.
Trend 1 Data Commons
Source: NEXRAD, NOAA, www.noaa.org
The Standard Model of Biomedical Compu-ng No Longer Works
Public data repositories
Private local storage & compute
Network download
Local data ($1K)
Community souware
Souware, sweat and tears ($100K)
Data Commons
Data commons co-‐locate data, storage and compu-ng infrastructure, and commonly used tools for analyzing and sharing data to create a resource for the research community.
Source: Interior of one of Google’s data centers, www.google.com/about/datacenters/
Open Science Data Cloud (Open Cloud Consor-um, 2012)
NCI Data Commons (UChicago, Nov 2015)
Bionimbus Protected Data Cloud (UChicago, 2013)
NOAA Data Commons (Open Cloud Consor-umOct 2015)
Purple balls are lung adenocarcinoma. Grey are lung squamous cell carcinoma. Green are misdiagnosed.
Hospitals, medical research centers and doctors
Data commons containing genomic and clinical data.
Pa-ents
Output: con-nuously updated, data-‐driven, analy-cs-‐informed discovery, diagnosis and treatment.
Trend 2 Analy-cs of Things, People and Places
Source: Urban sensor on street pole in Chicago (conceptual), arrayouhings.github.io/
People and things genera-ng streaming data that are relevant for research.
Places that generate data Source: Jane Macfarlane, Here, a Division of Nokia.
Trend 3 Languages for Data, Sta-s-cal Models, Data Science Workflows & Exploratory Data Analysis
Source: M. Bostock, hnp://bl.ocks.org/mbostock/4063318
Portable Format for Analy-cs (PFA) Predic-ve Model Markup Language (PMML)
Grammar of Graphics
d3.js
Trend 4 More Policies That Make Data Available and Analy-cs Repeatable
Execu-ve Order 13642 (May 9, 2013) Making Open and Machine Readable the Default for
Government Informa-on (“Open Data Policy”)
OMB Guidance President’s Ex Order
Trend 5 Transla-onal Data Science
How do we translate data driven discoveries into ac-ons that impact society?
Imaging Informatics
Clinical InformaticsBioinformatics Public Health
Informatics
Basic Research
Applied Research
Practice (dx, treatment and prevention)
Molecular & cellular
processes
Tissues & organs
Individuals (patients)
Groups & populations
Quality & outcomesTranslational Informatics
New algorithms, new sta-s-cal models (data science)
Applica-ons to genomics, analysis of EMR, etc.
Souware stacks for data intensive compu-ng (data engineering)
Data driven discoveries
Data driven diagnosis
Data driven therapeu-cs
Develop souware stack that scales to a “datapod”, to create “commons” for data driven discoveries, dx & treatment. (Core strategy for Center for Data Intensive Science, University of Chicago)
Transla-onal Data Science
Source: Maria T. Panerson and Robert L. Grossman, Detec-ng localized spa-al panerns of disease incidence using a neighbor-‐based bootstrapping method on electronic medical records data from 99.1 million pa-ents, to appear.
Part 5 Five Challenges
Challenge 1. Is More Different?
Source: P. W. Anderson, More is Different, Science, Volume 177, Number 4047, 4 August 1972, pages 393-‐396.
Do New Phenomena Emerge at Scale in Data?
Challenge 2. One Million Genomes
• Sequencing a million genomes would likely change the way we understand genomic varia-on and provide a founda-on for precision medicine.
• The genomic data for a pa-ent is about 1 TB (including samples from both tumor and normal -ssue).
• One million genomes is about 1000 PB or 1 EB • With compression, it may be about 100 PB • At $1000/genome, the sequencing would cost about $1B
• Think of this as one hundred studies with 10,000 pa-ents each over three years.
Challenge 3. Datapods
• Databases have fundamentally changed the way we manage and analyze scien-fic data.
• NoSQL databases allow us to scale out to mul-ple racks of computers, but are hard to to operate.
• If our scien-fic instrument for data science is a cyberpod of hardware and a souware stack suppor-ng data analysis, we need a simple-‐to-‐manage, open source “database” that scales to a cyberpod.
• Call this a “datapod.” • It could support open source data commons and allow them to peer.
Challenge 4. A Billion Predic-ve Models
• Develop technology to generate automa-cally 1 to 10 billion heterogeneous segmented models
• Applica-ons – George Church’s challenge individual predic-ve models for each human genome 6.5 Billion humans.
– 1 Million cancer genomes x 1,000 models / genome.
– Urban science – instrumen-ng ci-es. – Consumer Marke-ng -‐ large adver-sers will see 1-‐3 billion different consumers
Challenge 5. HDSI
• Human Computer Interac-on (HCI) was an important field before everyone got a computer and became an expert.
• Think of Human Data Science Interac-on (HDSI) of how humans interact with the souware suppor-ng the analysis of data science at the scale of datapods with billion models and trillions of hypotheses.
• How can we improve the interac-on to improve how we semi-‐automa-cally integrate data, validate hypotheses, interac-vely explore data, etc.
Ques-ons?
59
rgrossman.com @bobgrossman
For More Informa-on
cdis.uchicago.edu
www.opendatagroup.com
rgrossman.com