Correlations Genomic Data

Embed Size (px)

Citation preview

  • 8/7/2019 Correlations Genomic Data

    1/41

    David Brody

    Maya KrishnanRajiv McCoy

    Gourab Mukherjee

    Identifying significant correlationsbetween sets of genomic data

  • 8/7/2019 Correlations Genomic Data

    2/41

    Key issues in finding relationships

    between tracks of genomic data

    From a statistical standpoint, how do you find these

    relationships?

    How do you know if the relationships are statistically

    significant?

    How can you generalize the methods you use so they will

    work for multiple types of data?

  • 8/7/2019 Correlations Genomic Data

    3/41

    DNAmethylation may physically impede

    transcriptional machinery or recruit proteins

    that modify histones (packaging proteins) in a

    way that alters transcription.

    One class of histone modification that can

    influence transcription is methylation. Oneexample of a transcription-activating histone

    modification is the trimethylation of the 4th

    lysine residue of the 3rd histone, abbreviated

    as H3K4me3. Annotated active promoters

    show peaks of this methylation signature, but

    enhancers do not.

    Heintzman et al.

    Nature Genetics

    2007;39:311-318

    Roh T et al. PNAS2006;103:15782-

    15787

    Histone methylation H3K4me3 is

    associated with active transcription

  • 8/7/2019 Correlations Genomic Data

    4/41

    CpG islands are regions in the genome where cytosine bases are

    followed by guanine bases. Such regions are chemically unstable, and

    therefore evolutionary pressure is required to maintain them.

    Most human promoters contain CpG islands that lack DNAmethylation

    and are enriched for histone modifications characteristic of activetranscription, such as H3K4me3.

    Recent studies have shown that methylation-free CpGs act as a signal

    to recruit the binding protein Cfp1 which is involved in H3K4 methylation

    (Bird et al. 2010).

    CpG islands lack DNA methylation

    signalingH3K4me3 at these promoters

  • 8/7/2019 Correlations Genomic Data

    5/41

    Function

    Measurement

    + + - +

    Given a function track and a measurement track

    Preprocessing of tracks for statistical tests

  • 8/7/2019 Correlations Genomic Data

    6/41

    Function

    Measurement

    + + - +

    Preprocessing of tracks for statistical tests

    Given a function track and a measurement track

  • 8/7/2019 Correlations Genomic Data

    7/41

    Function

    Measurement

    Binning10,000K / 200

    + + - +

    Given a function track and a measurement track

    Preprocessing of tracks for statistical tests

  • 8/7/2019 Correlations Genomic Data

    8/41

    Function

    Measurement

    + + - +

    Binning10,000K / 200

    Given a function track and a measurement track

    Preprocessing of tracks for statistical tests

  • 8/7/2019 Correlations Genomic Data

    9/41

    Function

    Measurement

    + + - +

    Binning10,000K / 200

    Binning100,000K / 200

    3 3 2 1

    Given a function track and a measurement track

    Preprocessing of tracks for statistical tests

  • 8/7/2019 Correlations Genomic Data

    10/41

    Code Verification

    Preprocessing of tracks for statistical tests

    0 1 3 1 0

    Output

  • 8/7/2019 Correlations Genomic Data

    11/41

    Given a function track and two measurement tracks

    Preprocessing of tracks for statistical tests

    Function

    Measurement 1

    + + - +

    Measurement 2

  • 8/7/2019 Correlations Genomic Data

    12/41

    Function

    Measurement 1

    + + - +

    Measurement 2

    Binning100,000K / 200

    Given a function track and two measurement tracks

    Preprocessing of tracks for statistical tests

    2 1 0 3

  • 8/7/2019 Correlations Genomic Data

    13/41

    Enrichment over assumption of

    independence

    For all bins, count:

    how many have H3K4me3 markers

    how many have promoters

    how many have both

    Do more bins have both H3K4me3 and the promoters than we

    would expect if the two factors were independent?

  • 8/7/2019 Correlations Genomic Data

    14/41

    Enrichment over assumption of

    independence

    Bin Size Padding Size P(co if indep) P(co in reality) p-value

    10000 200 0.013 0.040 2.28 10 ( -3389.0 )

    10000 500 0.014 0.041 2.91 10 ( -3462.0 )

    10000 1000 0.014 0.042 9.26 10 ( -3539.0 )50000 200 0.117 0.200 5.19 10 ( -865.0 )

    50000 500 0.118 0.200 7.76 10 ( -867.0 )

    50000 1000 0.118 0.201 3.61 10 ( -866.0 )

    100000 200 0.241 0.343 3.95 10^( -371.0 )

    100000 500 0.242 0.343 2.58 10^( -371.0 )

    100000 1000 0.242 0.344 1.23 10 ( -370.0 )

  • 8/7/2019 Correlations Genomic Data

    15/41

    Enrichment over assumption of

    independence

    Difference between assumption and reali

    0.000

    0.050

    0.

    00

    0.

    50

    0. 00

    0.

    50

    0.

    00

    0.

    50

    0.

    00

    1 2 3 4 5 6 7 8 9

    Trial

    P(Indep)

    P(Reality)

  • 8/7/2019 Correlations Genomic Data

    16/41

    Impact that varying input parameters has

    on final resultsBi i i i U str m hits Hist m rks O rl s

    10000 200 40264 30508 11976

    10000 500 40264 31308 12270

    10000 1000 40264 32596 12695

    50000 200 21808 19641 12073

    50000 500 21808 19702 1210750000 1000 21808 19823 12162

    100000 200 15627 14096 10353

    100000 500 15627 14115 10365

    100000 1000 15627 14157 10387

    SUMMARY

    -Bin size affects number of upstream hits and histone marks, as would be expected, butdoes not seem to have a huge impact on the number of overlap

    -Padding size does not have much of an effect on the number of overlaps detected

  • 8/7/2019 Correlations Genomic Data

    17/41

    Impact that varying input parameters has

    on final resultsHow tr

    l

    gth i

    t

    r ofoverl

    etecte

    0

    2000

    4000

    6000

    8000

    10000

    12000

    14000

    0 20000 40000 60000 80000 100000 120000

    Bi

    ize

  • 8/7/2019 Correlations Genomic Data

    18/41

    Impact that varying input parameters has

    on final results

    Bin si

    e versus overlap percenta

    e

    0

    0.05

    0.1

    0.15

    0.2

    0.25

    0.3

    0.35

    0.4

    0 20000 40000 60000 80000 100000 120000

    Bin si

    e

  • 8/7/2019 Correlations Genomic Data

    19/41

    Foldwise enrichment to further explore

    low p-values and impact of bin size

    Abandons binning approach

    Examines percentage of genome covered by H3K4me3 markers

    and/or upstream region

    Conceptually similar to previous enrichment, but more direct

    Base-pair-wise intersection

    Compare percentage expected and actual percentage

  • 8/7/2019 Correlations Genomic Data

    20/41

    Results of performing foldwise

    enrichment on two tracksU str m L th U str mBa s s Hist Bas s Both Bas s

    200 7555840 120425900 71978675

    500 18142155 120425900 72108300

    1000 34863595 120425900 72268775

    p trea engt % Indep % Reality

    200 %0.00001 %0.02316

    500 %0.00023 %0.02320

    1000 %0.00043 %0.02325

  • 8/7/2019 Correlations Genomic Data

    21/41

    Expanding test to analyze two tracks of

    measurement data

    This test can be expanded to test for a correlation between

    multiple tracks of functional data

    Are they pairwise independent? And does P(A B C) =P(A)P(B)P(C)?

    In this example:

    CpG islands and H3K4me3 markers are the two measurement

    tracks

    Upstream regions form the functional track

  • 8/7/2019 Correlations Genomic Data

    22/41

    Results of test between H3K4me3,CpG

    islands, and upstream regions

    Para

    eter

    M1 & M2 (Indep) M1 & M2 (Rlty)

    (10000, 200) 0.00850 0.04185

    (10000, 500) 0.00850 0.04185

    (10000, 1000) 0.00850 0.04185

    (50000,200) 0.09285 0.19181

    (50000, 500) 0.09285 0.19181

    (50000, 1000) 0.09285 0.19181

    Para!

    eter"

    M1 and F (Indep) M1 and F (Rlty)

    (10000, 200) 0.01117 0.03437

    (10000, 500) 0.01142 0.03513

    (10000, 1000) 0.01181 0.03633

    (50000,200) 0.11239 0.19975

    (50000, 500) 0.11264 0.20031(50000, 1000) 0.11322 0.20122

    Para#

    eter$

    M2 and F (I) M2 and F (R )

    (10000, 200) 0.00626 0.03329

    (10000, 500) 0.00640 0.03376

    (10000, 1000) 0.00662 0.03421

    (50000,200) 0.08012 0.18621

    (50000, 500) 0.08030 0.18648

    (50000, 1000) 0.08071 0.18697

    Para%

    eter&

    M1 and M2 and F (I) M1 and M2 and F (R )

    (10000, 200) 0.00077 0.02408

    (10000, 500) 0.00078 0.02437

    (10000, 1000) 0.00081 0.02470

    (50000,200) 0.02891 0.15121

    (50000, 500) 0.02898 0.15140(50000, 1000) 0.02913 0.15177

    .04.008

    .19.09

    .04

    .20

    .01

    .11

    .006

    .08

    .03

    .19

    .001

    .03

    .02

    .15

  • 8/7/2019 Correlations Genomic Data

    23/41

    M1 & F: A' ' ( ) 0

    tionvs. Reality

    0

    0.05

    0.1

    0.15

    0.2

    0.25

    1 2

    Trial

    Assum1tio

    2

    3

    4 5

    lit6

    M1 &M2: Assumptionvs. Reality

    0

    0.05

    0.1

    0.15

    0.2

    0.25

    1 2

    Trial

    Assum7 tio 8

    9 @ A litB

    Results of test between H3K4me3,CpG

    islands, and upstream regions

  • 8/7/2019 Correlations Genomic Data

    24/41

    M2 & F: Assumptionvs. Reality

    0

    0.02

    0.04

    0.06

    0.08

    0.1

    0.12

    0.14

    0.16

    0.18

    0.2

    1 2

    Trial

    AssumCtio

    D

    E

    F G

    litH

    M I &M2 & F: Assumptionvs. Reality

    0

    0.02

    0.04

    0.06

    0.08

    0.1

    0.12

    0.14

    0.16

    1 2

    Trial

    AssumP tio Q

    R S T litU

    Results of test between H3K4me3,CpG

    islands, and upstream regions

  • 8/7/2019 Correlations Genomic Data

    25/41

    bin size upstream window corrected p-value

    10 Kb 200 bp 6.292483e-275

    500 bp 7.351025e-283

    1000 bp 2.413367e-299

    50 Kb 200 bp 4.425021e-170

    500 bp 8.912015e-172

    1000 bp 2.539715e-171

    100 Kb 200 bp 4.538280e-119

    500 bp 3.354191e-120

    1000 bp 1.425142e-119

    Function track: upstream window of

    known genes

    Measurement track: peaks of

    H3K4me3

    k = # bins with both H3K4me3 peak

    and upstream region of known

    gene

    n = # bins with at least one H3K4me3

    peak

    K = # bins with upstream region of a

    known gene

    N = total # bins

    Hypergeometric test for two tracks

  • 8/7/2019 Correlations Genomic Data

    26/41

    Function track: upstream window of

    known genes (proxy for promoter)Measurement track 1: peaks of

    H3K4me3

    Measurement track 2: CpG islands

    Based on restricting counts to cases

    where the function track is a hit.

    bin size upstream window corrected p-value

    10 Kb 200 bp 6.69E-200

    k = # bins with H3K4me3 peak, CpG island,

    and upstream region of known gene

    n = # bins with H3K4me3 and upstream

    region of known gene

    K = # bins with CpG island and upstream

    region of a known gene

    N = total # bins with upstream regions of

    known gene

    Hypergeometric test for three tracks

  • 8/7/2019 Correlations Genomic Data

    27/41

    + + - +

    MeasurementMeasurement

    Function

    Region Overlap & Non-linearity

  • 8/7/2019 Correlations Genomic Data

    28/41

    + + - +

    MeasurementMeasurement

    Function

    Region

    overlap

    Region

    overlap

    Region Overlap & Non-linearity

  • 8/7/2019 Correlations Genomic Data

    29/41

    + + - +

    MeasurementMeasurement

    Function

    Region

    overlap

    Region

    overlap

    Region Overlap & Non-linearity

  • 8/7/2019 Correlations Genomic Data

    30/41

    Segmentation

  • 8/7/2019 Correlations Genomic Data

    31/41

    Segmentation

    Min segmentation

    length

    Min segmentation

    length

  • 8/7/2019 Correlations Genomic Data

    32/41

    Block-wise Sub-sampling

  • 8/7/2019 Correlations Genomic Data

    33/41

    Block-wise Sub-sampling

    Select SegmentSelect Segment

  • 8/7/2019 Correlations Genomic Data

    34/41

    Block-wise Sub-sampling

    Select SegmentSelect Segment

    Select SubSelect Sub--segmentsegment

  • 8/7/2019 Correlations Genomic Data

    35/41

    Block-wise Sub-sampling

    Select SegmentSelect Segment

    Select SubSelect Sub--segmentsegmentRepeat it and rescaleRepeat it and rescale

    to get null distributionto get null distribution

    Calculate TestCalculate Test

    StatisticsStatistics

  • 8/7/2019 Correlations Genomic Data

    36/41

    Googol^{-1} P-Values !!

  • 8/7/2019 Correlations Genomic Data

    37/41

    Expected P-Values

  • 8/7/2019 Correlations Genomic Data

    38/41

    Expected P-Values (n=2600)

  • 8/7/2019 Correlations Genomic Data

    39/41

    Expected P-Values (n=2600)

  • 8/7/2019 Correlations Genomic Data

    40/41

    Expected P-Values (n=2600)

  • 8/7/2019 Correlations Genomic Data

    41/41

    Conclusion: finding relationships

    between tracks of genomic data

    The tests we implemented successfully identify significantly

    correlated tracks

    Given a statistical test and a p-value, able to determine theexpected correlation

    Using different preprocessing techniques the statistical tests

    can be extended for multiple tracks to identify new biological

    associations