39
Detecting Novel Associations in Large Data Sets Sean Patrick Murphy [email protected] ragmatic discussion of by David N. Reshef, Yakir Reshef, Hilary Finucane, Sharon Grossman, Gilean McVean, Peter Turnbaugh, Eric Lander, Michael Mitzenmacher, and Pardis Sabeti

Detecting Novel Associations in Large Data Sets

  • Upload
    herb

  • View
    30

  • Download
    0

Embed Size (px)

DESCRIPTION

A pragmatic discussion of. Detecting Novel Associations in Large Data Sets. by David N. Reshef , Yakir Reshef , Hilary Finucane , Sharon Grossman, Gilean McVean , Peter Turnbaugh , Eric Lander, Michael Mitzenmacher , and Pardis Sabeti. Sean Patrick Murphy [email protected]. - PowerPoint PPT Presentation

Citation preview

Page 1: Detecting  Novel Associations in Large Data  Sets

Detecting Novel Associations in Large Data Sets

Sean Patrick [email protected]

A pragmatic discussion of

by David N. Reshef, Yakir Reshef, Hilary Finucane, Sharon Grossman, Gilean McVean, Peter Turnbaugh, Eric Lander, Michael Mitzenmacher, and Pardis Sabeti

Page 2: Detecting  Novel Associations in Large Data  Sets

Getting Started

• Blog overview - http://theoreticalecology.wordpress.com/2011/12/16/the-maximal-information-coefficient/

• MINE code (Java-based with python and R wrappers) http://www.exploredata.net/Downloads/MINE-Application

• MINE homepage - http://www.exploredata.net/• Science article and supplemental information -

http://www.sciencemag.org/content/334/6062/1518.abstract

• http://andrewgelman.com/2011/12/mr-pearson-meet-mr-mandelbrot-detecting-novel-associations-in-large-data-sets/

Page 3: Detecting  Novel Associations in Large Data  Sets

So who actually read the paper?

Page 4: Detecting  Novel Associations in Large Data  Sets

Outline

1.Motivation2.Explanation3.Application

Page 5: Detecting  Novel Associations in Large Data  Sets

The Problem• 10,000+ variables• Hundreds, thousands, millions of observations• Your boss wants you to find all possible

relationships between all different variable pairs …

• Where do you start?

Motivation

Page 6: Detecting  Novel Associations in Large Data  Sets

Scatter Plots?

Motivation

Page 7: Detecting  Novel Associations in Large Data  Sets

50 Variables 1225 different scatter plots to examine!

Motivation

Page 8: Detecting  Novel Associations in Large Data  Sets

Other Options?

• Correlation Matrix • Factor Analysis/Principal Component Analysis• Audience recommendations?

Motivation

Page 9: Detecting  Novel Associations in Large Data  Sets

Possible Problems

• A large number of possible relationships• Each has a different statistical test• Need to have a hypothesis about the

relationship that might be present in the data

Motivation

Page 10: Detecting  Novel Associations in Large Data  Sets

Desired Properties

• Generality – the correlation coefficient should be sensitive to a wide range of possible dependencies, including superpositions of functions.

• Equitability – the score of the coefficient should be influenced by noise, but not by the form of the dependency between variables

Motivation

Page 11: Detecting  Novel Associations in Large Data  Sets

Enter the Maximal Information Coefficient (MIC)

Explanation

Page 12: Detecting  Novel Associations in Large Data  Sets

Algorithm IntuitionExplanation

Page 13: Detecting  Novel Associations in Large Data  Sets

x

y

We have a dataset D

Explanation

Page 14: Detecting  Novel Associations in Large Data  Sets

Explanation

Page 15: Detecting  Novel Associations in Large Data  Sets

Definition of mutual information (for discrete random variables)

Explanation

Page 16: Detecting  Novel Associations in Large Data  Sets

MI = 0.5

MI = 0.6

MI = 0.7

Maximum mutual information

Explanation

Page 17: Detecting  Novel Associations in Large Data  Sets

Characteristic MatrixExplanation

We have to normalize by min {log x, log y} to enable comparison across grids.

Page 18: Detecting  Novel Associations in Large Data  Sets

2x3Explanation

MI = 0.65

MI = 0.56

MI = 0.71

Page 19: Detecting  Novel Associations in Large Data  Sets

Characteristic MatrixExplanation

Page 20: Detecting  Novel Associations in Large Data  Sets

Characteristic MatrixExplanation

Page 21: Detecting  Novel Associations in Large Data  Sets

This highest value is the Maximal Information Coefficient (MIC)

This surface is just a 3D representation of the characteristic matrix.

1. Every entry of the characteristic matrix is between 0 and 1, inclusive

2. MIC(X,Y) = MIC(Y,X) – symmetric

3. MIC is invariant under order preserving transformations of the axis

Explanation

Page 22: Detecting  Novel Associations in Large Data  Sets

How Big is the Characteristic Matrix?

• Technically, infinite in size• This is unwieldy• So we set bounds

on xy < B(n) = n0.6

n = number of data points• This is an empirically set value

Explanation

Page 23: Detecting  Novel Associations in Large Data  Sets

How Do We Compute the Maximum Information for a Particular xy Grid?

• Heuristic-based, dynamic programming• Pseudo-code in supplemental materials• Only approximate solution, seems to work• Authors acknowledge better algorithm should

be found• At the moment, mostly irrelevant as the

authors have released a Java implementation of the algorithm

Explanation

Page 24: Detecting  Novel Associations in Large Data  Sets

With probability approaching 1 as sample size grows(i) MIC assigns scores that tend to 1 for all never-

constant noiseless functional relationships(ii) MIC assigns scores that tend to 1 for a larger class

of noiseless relationships (including superpositions of noiseless functional relationships)

(iii) MIC assigns scores that tend to 0 to statistically independent variables

Useful Properties of the MIC StatisticApplication

Page 25: Detecting  Novel Associations in Large Data  Sets

MICApplication

Page 26: Detecting  Novel Associations in Large Data  Sets

Application

Page 27: Detecting  Novel Associations in Large Data  Sets

So what does the MIC mean?

• Uncorrected p-value tables are available to download for various sample sizes of data

• Null hypothesis is variables are statistically independent

• http://www.exploredata.net/Downloads/P-Value-Tables

Application

Page 28: Detecting  Novel Associations in Large Data  Sets

MINE = Maximal Information-based Nonparametric Exploration

Hopefully this part is self explanatory now

Nonparametric vs parametric could be a session unto itself.

Here, we do not rely on assumptions that the data in question are drawn from a specific probability distribution (such as the normal distribution).

Application

MINE statistics leverage the extra information captured by the characteristic matrix to offer more insight into the relationships between variables.

Page 29: Detecting  Novel Associations in Large Data  Sets

Minimum Cell Number (MCN) - measures the complexity of an association in terms of the number of cells required

Application

Maximum Edge Value (MEV <= MIC) – measures closeness to being a function (vertical line test )

Maximum Asymmetry Score (MAS<= MIC) – measures deviations from monotonicity

Page 30: Detecting  Novel Associations in Large Data  Sets

Application

MAS – monotonicityMEV – vertical line testMCN – complexity

Page 31: Detecting  Novel Associations in Large Data  Sets

Application

http://www.exploredata.net/Usage-instructions

this takes too long … change it first

R: MINE(“MLB2008.csv”,”one.pair”,var1.id=2,var2.id=12)Java: java -jar MINE.jar MLB2008.csv -onePair 2 12Seeks relationships between salary and home runs, 338 pairs

Usage

Page 32: Detecting  Novel Associations in Large Data  Sets

Notes

• Does not work on textual data (must be numeric)

• Long execution times• Outputs MIC and other mentioned MINE

statistics, not the Characteristic Matrix• Output is .csv, a row per variable pair

Application

Page 33: Detecting  Novel Associations in Large Data  Sets

Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License

You are free to:• to copy, distribute and transmit the work

With the following conditions:• Attribution — You must attribute the work in the manner

specified by the author or licensor (but not in any way that suggests that they endorse you or your use of the work).

• Noncommercial — You may not use this work for commercial purposes.

• No Derivative Works — You may not alter, transform, or build upon this work.

Application

Page 34: Detecting  Novel Associations in Large Data  Sets

Now What? Data Triage PipelineApplication

Complex Data Set MIC

Ranked list of variable relationships to examine in more depth with the tool(s) of your choice

Page 35: Detecting  Novel Associations in Large Data  Sets

Lingering Questions• Can this be extended to higher-dimensional relationships?• Just how approximate is the current MIC algorithm? • Who wants to develop an open source implementation?• What other MINE statistics are waiting for discovery?• Execution time – the algorithm is embarrassingly parallel –

easily HADOOPified• Many tests reported by the paper only introduced vertical

noise into the data?• There is also some question as to its power vs Pearson and

Dcor (http://www-stat.stanford.edu/~tibs/reshef/comment.pdf)

Page 36: Detecting  Novel Associations in Large Data  Sets

Comment by N. Simon and R. Tibshiran

http://www-stat.stanford.edu/~tibs/reshef/script.R

Noise Level Noise Level

Pow

erPo

wer

Pow

erPo

wer

Page 37: Detecting  Novel Associations in Large Data  Sets
Page 38: Detecting  Novel Associations in Large Data  Sets

Backup Slides

Page 39: Detecting  Novel Associations in Large Data  Sets