Correlations Genomic Data

8/7/2019 Correlations Genomic Data

1/41

David Brody

Maya KrishnanRajiv McCoy

Gourab Mukherjee

Identifying significant correlationsbetween sets of genomic data


2/41

Key issues in finding relationships

between tracks of genomic data

From a statistical standpoint, how do you find these

relationships?

How do you know if the relationships are statistically

significant?

How can you generalize the methods you use so they will

work for multiple types of data?


3/41

DNAmethylation may physically impede

transcriptional machinery or recruit proteins

that modify histones (packaging proteins) in a

way that alters transcription.

One class of histone modification that can

influence transcription is methylation. Oneexample of a transcription-activating histone

modification is the trimethylation of the 4th

lysine residue of the 3rd histone, abbreviated

as H3K4me3. Annotated active promoters

show peaks of this methylation signature, but

enhancers do not.

Heintzman et al.

Nature Genetics

2007;39:311-318

Roh T et al. PNAS2006;103:15782-

15787

Histone methylation H3K4me3 is

associated with active transcription


4/41

CpG islands are regions in the genome where cytosine bases are

followed by guanine bases. Such regions are chemically unstable, and

therefore evolutionary pressure is required to maintain them.

Most human promoters contain CpG islands that lack DNAmethylation

and are enriched for histone modifications characteristic of activetranscription, such as H3K4me3.

Recent studies have shown that methylation-free CpGs act as a signal

to recruit the binding protein Cfp1 which is involved in H3K4 methylation

(Bird et al. 2010).

CpG islands lack DNA methylation

signalingH3K4me3 at these promoters


5/41

Function

Measurement

+ + - +

Given a function track and a measurement track

Preprocessing of tracks for statistical tests


6/41

Function

Measurement

+ + - +




7/41

Function

Measurement

Binning10,000K / 200

+ + - +




8/41

Function

Measurement

+ + - +





9/41

Function

Measurement

+ + - +


Binning100,000K / 200

3 3 2 1




10/41

Code Verification


0 1 3 1 0

Output


11/41

Given a function track and two measurement tracks


Function

Measurement 1

+ + - +

Measurement 2


12/41

Function

Measurement 1

+ + - +

Measurement 2

Binning100,000K / 200

Given a function track and two measurement tracks


2 1 0 3


13/41

Enrichment over assumption of

independence

For all bins, count:

how many have H3K4me3 markers

how many have promoters

how many have both

Do more bins have both H3K4me3 and the promoters than we

would expect if the two factors were independent?


14/41


independence

Bin Size Padding Size P(co if indep) P(co in reality) p-value

10000 200 0.013 0.040 2.28 10 ( -3389.0 )

10000 500 0.014 0.041 2.91 10 ( -3462.0 )

10000 1000 0.014 0.042 9.26 10 ( -3539.0 )50000 200 0.117 0.200 5.19 10 ( -865.0 )

50000 500 0.118 0.200 7.76 10 ( -867.0 )

50000 1000 0.118 0.201 3.61 10 ( -866.0 )

100000 200 0.241 0.343 3.95 10^( -371.0 )

100000 500 0.242 0.343 2.58 10^( -371.0 )

100000 1000 0.242 0.344 1.23 10 ( -370.0 )


15/41


independence

Difference between assumption and reali

0.000

0.050

0.

00

0.

50

0. 00

0.

50

0.

00

0.

50

0.

00

1 2 3 4 5 6 7 8 9

Trial

P(Indep)

P(Reality)


16/41

Impact that varying input parameters has

on final resultsBi i i i U str m hits Hist m rks O rl s

10000 200 40264 30508 11976

10000 500 40264 31308 12270

10000 1000 40264 32596 12695

50000 200 21808 19641 12073

50000 500 21808 19702 1210750000 1000 21808 19823 12162

100000 200 15627 14096 10353

100000 500 15627 14115 10365

100000 1000 15627 14157 10387

SUMMARY

-Bin size affects number of upstream hits and histone marks, as would be expected, butdoes not seem to have a huge impact on the number of overlap

-Padding size does not have much of an effect on the number of overlaps detected


17/41


on final resultsHow tr

l

gth i

t

r ofoverl

etecte

0

2000

4000

6000

8000

10000

12000

14000

0 20000 40000 60000 80000 100000 120000

Bi

ize


18/41


on final results

Bin si

e versus overlap percenta

e

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0 20000 40000 60000 80000 100000 120000

Bin si

e


19/41

Foldwise enrichment to further explore

low p-values and impact of bin size

Abandons binning approach

Examines percentage of genome covered by H3K4me3 markers

and/or upstream region

Conceptually similar to previous enrichment, but more direct

Base-pair-wise intersection

Compare percentage expected and actual percentage


20/41

Results of performing foldwise

enrichment on two tracksU str m L th U str mBa s s Hist Bas s Both Bas s

200 7555840 120425900 71978675

500 18142155 120425900 72108300

1000 34863595 120425900 72268775

p trea engt % Indep % Reality

200 %0.00001 %0.02316

500 %0.00023 %0.02320

1000 %0.00043 %0.02325


21/41

Expanding test to analyze two tracks of

measurement data

This test can be expanded to test for a correlation between

multiple tracks of functional data

Are they pairwise independent? And does P(A B C) =P(A)P(B)P(C)?

In this example:

CpG islands and H3K4me3 markers are the two measurement

tracks

Upstream regions form the functional track


22/41

Results of test between H3K4me3,CpG

islands, and upstream regions

Para

eter

M1 & M2 (Indep) M1 & M2 (Rlty)

(10000, 200) 0.00850 0.04185

(10000, 500) 0.00850 0.04185

(10000, 1000) 0.00850 0.04185

(50000,200) 0.09285 0.19181

(50000, 500) 0.09285 0.19181

(50000, 1000) 0.09285 0.19181

Para!

eter"

M1 and F (Indep) M1 and F (Rlty)

(10000, 200) 0.01117 0.03437

(10000, 500) 0.01142 0.03513

(10000, 1000) 0.01181 0.03633

(50000,200) 0.11239 0.19975

(50000, 500) 0.11264 0.20031(50000, 1000) 0.11322 0.20122

Para#

eter$

M2 and F (I) M2 and F (R )

(10000, 200) 0.00626 0.03329

(10000, 500) 0.00640 0.03376

(10000, 1000) 0.00662 0.03421

(50000,200) 0.08012 0.18621

(50000, 500) 0.08030 0.18648

(50000, 1000) 0.08071 0.18697

Para%

eter&

M1 and M2 and F (I) M1 and M2 and F (R )

(10000, 200) 0.00077 0.02408

(10000, 500) 0.00078 0.02437

(10000, 1000) 0.00081 0.02470

(50000,200) 0.02891 0.15121

(50000, 500) 0.02898 0.15140(50000, 1000) 0.02913 0.15177

.04.008

.19.09

.04

.20

.01

.11

.006

.08

.03

.19

.001

.03

.02

.15


23/41

M1 & F: A' ' ( ) 0

tionvs. Reality

0

0.05

0.1

0.15

0.2

0.25

1 2

Trial

Assum1tio

2

3

4 5

lit6

M1 &M2: Assumptionvs. Reality

0

0.05

0.1

0.15

0.2

0.25

1 2

Trial

Assum7 tio 8

9 @ A litB




24/41

M2 & F: Assumptionvs. Reality

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

1 2

Trial

AssumCtio

D

E

F G

litH

M I &M2 & F: Assumptionvs. Reality

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

1 2

Trial

AssumP tio Q

R S T litU




25/41

bin size upstream window corrected p-value

10 Kb 200 bp 6.292483e-275

500 bp 7.351025e-283

1000 bp 2.413367e-299

50 Kb 200 bp 4.425021e-170

500 bp 8.912015e-172

1000 bp 2.539715e-171

100 Kb 200 bp 4.538280e-119

500 bp 3.354191e-120

1000 bp 1.425142e-119

Function track: upstream window of

known genes

Measurement track: peaks of

H3K4me3

k = # bins with both H3K4me3 peak

and upstream region of known

gene

n = # bins with at least one H3K4me3

peak

K = # bins with upstream region of a

known gene

N = total # bins

Hypergeometric test for two tracks


26/41

Function track: upstream window of

known genes (proxy for promoter)Measurement track 1: peaks of

H3K4me3

Measurement track 2: CpG islands

Based on restricting counts to cases

where the function track is a hit.

bin size upstream window corrected p-value

10 Kb 200 bp 6.69E-200

k = # bins with H3K4me3 peak, CpG island,

and upstream region of known gene

n = # bins with H3K4me3 and upstream

region of known gene

K = # bins with CpG island and upstream

region of a known gene

N = total # bins with upstream regions of

known gene

Hypergeometric test for three tracks


27/41

+ + - +

MeasurementMeasurement

Function

Region Overlap & Non-linearity


28/41

+ + - +


Function

Region

overlap

Region

overlap



29/41

+ + - +


Function

Region

overlap

Region

overlap



30/41

Segmentation


31/41

Segmentation

Min segmentation

length

Min segmentation

length


32/41

Block-wise Sub-sampling


33/41


Select SegmentSelect Segment


34/41



Select SubSelect Sub--segmentsegment


35/41



Select SubSelect Sub--segmentsegmentRepeat it and rescaleRepeat it and rescale

to get null distributionto get null distribution

Calculate TestCalculate Test

StatisticsStatistics


36/41

Googol^{-1} P-Values !!


37/41

Expected P-Values


38/41

Expected P-Values (n=2600)


39/41



40/41



41/41

Conclusion: finding relationships

between tracks of genomic data

The tests we implemented successfully identify significantly

correlated tracks

Given a statistical test and a p-value, able to determine theexpected correlation

Using different preprocessing techniques the statistical tests

can be extended for multiple tracks to identify new biological

associations

Documents

Correlations Genomic Data