26
Laws and Limits of Data Science: The Next Decade Michael L. Brodie

Laws and limits of data science 11 10-14

Embed Size (px)

DESCRIPTION

Keynote Analytics Week, Boston, MA November 7, 2014 Big Data is in its infancy and is opening the door to profound change - Grand Opportunities (Accelerating Scientific Discovery) and Grand Challenges to be addressed over the next decade. We explore the premise that Data Science is to data-intensive discovery as the Scientific Method is to scientific discovery, leading us to potential Laws and Limits of Data Science, and then to Best Practices.

Citation preview

Page 1: Laws and limits of data science 11 10-14

Laws and Limits of Data Science: The Next Decade

Michael L. Brodie

Page 2: Laws and limits of data science 11 10-14

2

Big Data is Opening the door to …

Page 3: Laws and limits of data science 11 10-14

3

Grand Opportunities:Accelerating Scientific Discovery …

Page 4: Laws and limits of data science 11 10-14

4

Grand Challenges:Many – efficacy, efficiency, …

Page 5: Laws and limits of data science 11 10-14

What is Big Data?

•  Defining Big Data constrains this emerging phenomena •  Since Big Data is not

—  About data, but a problem solving ecosystem —  A discipline, but a multidisciplinary sub-domain of most disciplines*

•  What matters is what we will do with Big Data •  Big Data is opening the door to profound change in

—  Processing —  Thinking

•  Let’s use the potential of profound change to understand Big Data

5

*  “transforma,ve  …  changing  academia  (…  emerged  ..  on  the  cri,cal  path  for  their  sub-­‐discipline)”  and  is  changing  society”  Michael  Jordan.  

Page 6: Laws and limits of data science 11 10-14

Starting to Understand Big Data

•  Listen to Data —  Hypothesis generation ! overcome limits of human cognition*

•  Multiple, Simultaneous Perspectives —  Ensemble models ! Accelerating Scientific Discovery*

•  And many more …

6

* Necessary condition: human-guidance

Page 7: Laws and limits of data science 11 10-14

7

Big Data is in its infancyWith at least decade-long challenges

Page 8: Laws and limits of data science 11 10-14

Outline •  Big Picture: Why and What •  Grand Opportunities •  Grand Challenges

—  Efficacy, amongst many •  Laws and Limits of Data Science

Page 9: Laws and limits of data science 11 10-14

Hypothesis

Phenomenon

Big Picture Scientific Method

Causality

Experiment Model

Page 10: Laws and limits of data science 11 10-14

Big Picture: Why & What

Experiment Model What

(Big Data) Why

(Empiricism)

Correlation: What might occur

Causation: Why it occurs

Phenomenon

Page 11: Laws and limits of data science 11 10-14

Why: Scientific Method and the Search for Causation History of Science and the Scientific Method Mature Disciplines: Empiricism, Clinical Studies, Drug Discovery

The Holy Grail of science is to identify accurate causality.

Empirical, clinical trial, and drug discovery methods take time +100 years

Three Ages of Medicine [The Remedy: Goetz] Free-for-All: 1850s–1940s Rise of Trials: 1940s–2010s Beyond the Lab: Post-2010

Page 12: Laws and limits of data science 11 10-14

What: Models and the Search for Meaningful Correlations

•  History of Modelling: mathematics, sciences, computing, …

•  Disciplines "  Mature (theory-driven): math, physics, statistics, … "  Emerging (data-driven): data mining, machine learning, neural networks, support

vector machines, …

The Holy Grail of data-intensive discovery is correlations that are meaningful.

Correlation does not imply causation

•  Methodologies "  Mature: 100s of years "  Emerging: at least a decade

The Holy Grail of data-intensive discovery is correlations that are meaningful. The Holy Grail of data-intensive discovery is correlations that are accurate and reliable.

Page 13: Laws and limits of data science 11 10-14

GRAND OPPORTUNITIES Big Data

Page 14: Laws and limits of data science 11 10-14

Accelerating Scientific Discovery

Experiment Model

Correlations

Hypotheses

Why: Causation

What: Correlation

Data D

riven Theory D

riven

Page 15: Laws and limits of data science 11 10-14

Accelerating Scientific Discovery

Experiment Model

Correlations

Hypotheses

Why: Causation

What: Correlation

Data D

riven Theory D

riven

Watson

Baylor

Scientists

Wonderful Use Case

Page 16: Laws and limits of data science 11 10-14

Grand Challenges •  Big Data is in its infancy: 10+ year evolution

"  Efficiency: expression/language ! execution (stack) "  Open Data: data use/reuse / sharing "  Efficacy

“major engineering and mathematical challenge, one that will not be solved by just gluing together a few

existing ideas from statistics, optimization, databases and computer systems.” Michael Jordan

Page 17: Laws and limits of data science 11 10-14

“wrt to Big Data we’re now at the what are the principles? point in time”. Michael Jordan

Page 18: Laws and limits of data science 11 10-14

What is Data Science @ Scale? Data Science @ scale is to data-intensive discovery as The Scientific Method is to scientific discovery

Reframe Empiricism* "  Data Science is the data component of the Scientific Method for data "  Concepts, tools, and techniques for data-intensive discovery

•  Data-intensive discovery = virtual experiment

"  Laws and Limits of Data Science

* With Dr. Jennie Duggan, MIT & Northwestern University

Page 19: Laws and limits of data science 11 10-14

First Law of Data Science

Meaning of a correlation requires empirical verification

What is seldom enough Why is not always necessary

Best Practice #1: Efficacy-driven data discovery

(Efficacy before efficiency)

Page 20: Laws and limits of data science 11 10-14

Second Law of Data Science*

Causality can be determined from correlations only by community accepted mechanisms and metrics**, e.g.,

empiricism.

* With Gregory Piatetsky-Shapiro, KDNuggets

** for What and Why

Page 21: Laws and limits of data science 11 10-14

Limits of Data Science

We do not know where our concepts, tools, and techniques break on massive data sets!

Caution: Big Data Winter Potential (Michael Jordan) Best Practice #2: Experiment + Error bars everywhere

"  Common Practice: not so much

Best Practice #3: Machine-driven, human guided "  Common Practice: not so much

Page 22: Laws and limits of data science 11 10-14

Best Practice Not So Common* •  BP1: Efficacy-driven data discovery

"  Best eScience, Journalism, Economics, Computational X, … "  Big Data not so much (<5%)

•  BP2: Experiment + Error bars everywhere "  Above + Best Data Scientists (~5%, w/scientific, ML, … training) "  Big Data (<5%): Customers don’t ask; data scientists don’t practice

•  BP3: Machine-driven, human guided "  ~5% strict;95% not so much, e.g., ~60 Data Curation products "  50% partial: supervised / trained

•  Example: based on the above Laws and Best Practices

*Personal un-scientific study, limited data, yet so unbiased and oh so true

Page 23: Laws and limits of data science 11 10-14

Laws of Data Science Less So … 1st Correlations ≠ Causation

Common confusion in science*, more in Data Science, even more in business

2nd Causality (meaning) requires verification by community-accepted norms

Cornerstone of Science, hopefully emerging in Data Science**

*Richard Feynman, 1974 ** If #1 is rare, #2 is more so

Page 24: Laws and limits of data science 11 10-14

Conclusions •  Big Data is in its infancy and is opening the door to … •  Grand Opportunities •  Grand Challenges •  10+ year evolution •  Data Science ~= Scientific Method For Data •  Laws of Data Science

1  Correlations must be verified 2  Verification relative to community-accepted norms

•  Data Science Best Practices 1  Efficacy-driven discovery 2  Experiment + Error Bars everywhere 3  Machine-Driven – Human Guided

•  Limit of Data Science: we do not know where our tools break

Page 25: Laws and limits of data science 11 10-14

25

Page 26: Laws and limits of data science 11 10-14

26