MSPC: Joint analysis of ChIP-seq replicates

Preview:

Citation preview

POLITECNICO

DI MILANO

Department of Electronics,

Information and Bioengineering

July 20, 2015

Using combined evidence from replicates to evaluate ChIP-seq peaks

Vahid Jalili

Vahid Jalili (vahid.jalili@polimi.it)

Matteo Matteucci (matteo.matteucci@polimi.it)

Marco Masseroli (marco.masseroli@polimi.it)

Marco Morelli (marco.morelli@iit.it)

Website: https://mspc.codeplex.com

P o l i t e c n i c o d i M i l a n oV a h i d J a l i l i / 35July 20, 2015 2

MotivationTag c

ount

Genomic DNA

Signal Background

ChIP-seq sample

True Positive False Positive

False Negative True Negative

Stringent

Threshold

Permissive

Threshold

Stringent

Threshold

Permissive

Threshold

P o l i t e c n i c o d i M i l a n oV a h i d J a l i l i / 35July 20, 2015 3

Motivation

Benefit from ReplicatesUtilize replicates to discriminate between

sub-threshold binding from truly none-bounding regions

Tag c

ount

Genomic DNA

Signal Background

Replicate 1

Replicate 2

Tag c

ount

P o l i t e c n i c o d i M i l a n oV a h i d J a l i l i / 35July 20, 2015 4

Motivation

Benefit from Replicates

P o l i t e c n i c o d i M i l a n oV a h i d J a l i l i / 35July 20, 2015 5

Method

Notations

𝒯𝑠

𝒯𝑀

Strong threshold

Weak threshold

𝑝 βˆ’ π‘£π‘Žπ‘™π‘’π‘’ ≀ 𝒯𝑠

Strong Peak

Weak Peak

𝒯𝑠 < 𝑝 βˆ’ π‘£π‘Žπ‘™π‘’π‘’ ≀ 𝒯𝑀

P o l i t e c n i c o d i M i l a n oV a h i d J a l i l i / 35July 20, 2015 6

Method

Combining Evidences

𝑋2π‘˜2 follows a πœ’2 distribution with 2π‘˜ degrees of freedom.

Alternatives for combining test statistics :

Liptak’s method (Liptak, 1958)

Mudholkar and George (Mudholkar & George, 1979)

Wilkinson’s method (Wilkinson, 1951)

Truncated product method (Zaykin D. , Zhivotovsky, Westfall, & Weir, 2002)

…

How to combine evidences ?

Fisher’s combined probability test

𝑋2π‘˜2 = βˆ’2

𝑖=1

π‘˜

ln 𝑝𝑖

πΆπ‘œπ‘›π‘“π‘–π‘Ÿπ‘š, 𝑋2π‘˜

2 β‰₯ π‘‘β„Žπ‘Ÿπ‘’π‘ β„Žπ‘œπ‘™π‘‘

π·π‘–π‘ π‘π‘Žπ‘Ÿπ‘‘, 𝑋2π‘˜2 < π‘‘β„Žπ‘Ÿπ‘’π‘ β„Žπ‘œπ‘™π‘‘

P o l i t e c n i c o d i M i l a n oV a h i d J a l i l i / 35July 20, 2015 7

Method

Combining Evidences

Replicate 1

Replicate 2

Replicate 3

Which evidences to combine ?

Replicate 4

P o l i t e c n i c o d i M i l a n oV a h i d J a l i l i / 35July 20, 2015 8

Method

Combining Evidences

Replicate 1

Replicate 2

Replicate 3

Which evidences to combine ?

Replicate 4

P o l i t e c n i c o d i M i l a n oV a h i d J a l i l i / 35July 20, 2015 9

Method

Combining Evidences

Replicate 1

Replicate 2

Replicate 3

Which evidences to combine ?

Replicate 4

P o l i t e c n i c o d i M i l a n oV a h i d J a l i l i / 35July 20, 2015 10

Method

Combining Evidences

Replicate 1

Replicate 2

Replicate 3

Which evidences to combine ?

Replicate 4

P o l i t e c n i c o d i M i l a n oV a h i d J a l i l i / 35July 20, 2015 11

Method

Intersection DeterminationThe Challenge …

an optimal method for finding the intersections

Sorted Lists

NaΓ―ve method

Hashing Based

Interval Trees

𝑢 π’Ž 𝒏

𝑢 π’π’Ž

𝑢𝒏 π’π’π’ˆπŸπ’˜

π’˜+π’Žπ’“

𝑢 𝒏 log𝟐 𝒏

S o m e Po s s i b l e M e t h o d s

β€’ 𝑛 average peaks count on a sample

β€’ π‘š sample count

M e t h o d ’ s C o m p l ex i t y

β€’ 𝑀 number of bits in a machine-word

β€’ π‘Ÿ intersection size

P o l i t e c n i c o d i M i l a n oV a h i d J a l i l i / 35July 20, 2015 12

Method

Intersection DeterminationInterval Trees

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

[ 16 , 21 ]

Data

[ 8 , 9 ]

Data

[ 25 , 30 ]

Data

[ 17 , 19 ]

Data

[ 26 , 27 ]

Data

[ 19 , 20 ]

Data

[ 15 , 23 ]

Data

[ 5 , 8 ]

Data

[ 6 , 10 ]

Data

[ 0 , 3 ]

Data

P o l i t e c n i c o d i M i l a n oV a h i d J a l i l i / 35July 20, 2015 13

Method

Algorithm

P o l i t e c n i c o d i M i l a n oV a h i d J a l i l i / 35July 20, 2015 14

Method

Algorithm

P o l i t e c n i c o d i M i l a n oV a h i d J a l i l i / 35July 20, 2015 15

Method

Algorithm

P o l i t e c n i c o d i M i l a n oV a h i d J a l i l i / 35July 20, 2015 16

Method

Algorithm

Replicate 1

Replicate 2

Replicate 3

R 1 (weak peak)

R 4 (strong region)

R 3 (weak peak)

Algorithm … an example

R 2 (weak peak)

R 1 (weak peak)

P o l i t e c n i c o d i M i l a n oV a h i d J a l i l i / 35July 20, 2015 17

Method

Algorithm

Replicate 1

Replicate 2

Replicate 3 R 4 (strong region)

R 3 (weak peak)

Algorithm … an example

R 2 (weak peak)

Determine intersecting regions across all samples

R 1 (weak peak)

R 2 (weak peak) R 3 (weak peak)

P o l i t e c n i c o d i M i l a n oV a h i d J a l i l i / 35July 20, 2015 18

Method

Algorithm

Replicate 1

Replicate 2

Replicate 3 R 4 (strong region)

Algorithm … an example

R 1 (weak peak)

R 2 (weak peak) R 3 (weak peak)

If multiple regions determined intersecting on a

sample, choose the strongest one

R 3 (weak peak)

Determine intersecting regions across all samples

P o l i t e c n i c o d i M i l a n oV a h i d J a l i l i / 35July 20, 2015 19

Method

Algorithm

Replicate 1

Replicate 2

Replicate 3 R 4 (strong region)

Algorithm … an example

R 1 (weak peak)

R 2 (weak peak) R 3 (weak peak)

If multiple regions determined intersecting on a

sample, choose the strongest one

Determine intersecting regions across all samples

Combine test statistics using Fisher’s method

P o l i t e c n i c o d i M i l a n oV a h i d J a l i l i / 35July 20, 2015 20

Method

Algorithm

Replicate 1

Replicate 2

Replicate 3 R 4 (strong region)

Algorithm … an example

R 1 (weak peak)

R 2 (weak peak) R 3 (weak peak)

If multiple regions determined intersecting on a

sample, choose the strongest one

Determine intersecting regions across all samples

Combine test statistics using Fisher’s method

𝑋2 β‰₯ π‘‡β„Žπ‘Ÿπ‘’π‘ β„Žπ‘œπ‘™π‘‘ ? NO !

P o l i t e c n i c o d i M i l a n oV a h i d J a l i l i / 35July 20, 2015 21

Method

Algorithm

β–ˆβ–ˆ Confirmed Peaks Set

β–ˆβ–ˆ Discarded Peaks Set

Algorithm … an example

R 1

I n t e r m e d i a t e S e t s

R e p l i c a t e 1 R e p l i c a t e 2 R e p l i c a t e 3

R 2

P o l i t e c n i c o d i M i l a n oV a h i d J a l i l i / 35July 20, 2015 22

Method

Algorithm

Replicate 1

Replicate 2

Replicate 3 R 4 (strong region)

Algorithm … an example

R 1 (weak peak)

R 2 (weak peak) R 3 (weak peak)

Determine intersecting regions across all samples

R 2 (weak peak)

Since R2 intersects only with R1, and R1-R2 test is

already performed, no further process will be taken

P o l i t e c n i c o d i M i l a n oV a h i d J a l i l i / 35July 20, 2015 23

Method

Algorithm

Replicate 1

Replicate 2

Replicate 3 R 4 (strong region)

Algorithm … an example

R 1 (weak peak)

R 3 (weak peak)

Determine intersecting regions across all samples

R 2 (weak peak) R 3 (weak peak)

R 4 (strong region)

R 1 (weak peak)

Combine test statistics using Fisher’s method

𝑋2 β‰₯ π‘‡β„Žπ‘Ÿπ‘’π‘ β„Žπ‘œπ‘™π‘‘ ? YES !

P o l i t e c n i c o d i M i l a n oV a h i d J a l i l i / 35July 20, 2015 24

Method

AlgorithmAlgorithm … an example

β–ˆβ–ˆ Confirmed Peaks Set

β–ˆβ–ˆ Discarded Peaks Set

R 1

I n t e r m e d i a t e S e t s

R e p l i c a t e 1 R e p l i c a t e 2 R e p l i c a t e 3

R 2

R 3 R 4

P o l i t e c n i c o d i M i l a n oV a h i d J a l i l i / 35July 20, 2015 25

Method

Algorithm

Replicate 1

Replicate 2

Replicate 3 R 4 (strong region)

Algorithm … an example

R 3 (weak peak)R 2 (weak peak)

R 1 (weak peak)

R 4 (strong region)

Determine intersecting regions across all samples

Combine test statistics using Fisher’s method

𝑋2 β‰₯ π‘‡β„Žπ‘Ÿπ‘’π‘ β„Žπ‘œπ‘™π‘‘ ? YES !

R 3 (weak peak)

P o l i t e c n i c o d i M i l a n oV a h i d J a l i l i / 35July 20, 2015 26

Method

AlgorithmAlgorithm … an example

β–ˆβ–ˆ Confirmed Peaks Set

β–ˆβ–ˆ Discarded Peaks Set

I n t e r m e d i a t e S e t s

R e p l i c a t e 1 R e p l i c a t e 2 R e p l i c a t e 3

R 2

R 3 R 4

R 1

P o l i t e c n i c o d i M i l a n oV a h i d J a l i l i / 35July 20, 2015 27

Method

AlgorithmAlgorithm … an example

I n t e r m e d i a t e S e t s

R e p l i c a t e 1 R e p l i c a t e 2 R e p l i c a t e 3

R 2

R 3 R 4

R 1

R 1

β–ˆβ–ˆ Confirmed Peaks Set

β–ˆβ–ˆ Discarded Peaks Set

β–ˆβ–ˆ Output Set

O u t p u t S e t s

P o l i t e c n i c o d i M i l a n oV a h i d J a l i l i / 35July 20, 2015 28

Method

Algorithm

Replicate 1

Replicate 2

Replicate 3 R 4 (strong region)

Algorithm … an example

R 3 (weak peak)R 2 (weak peak)

R 1 (weak peak)

R 2 (weak peak)

R 1 (weak peak)

R 3 (weak peak)

R 4 (strong region)

P o l i t e c n i c o d i M i l a n oV a h i d J a l i l i / 35July 20, 2015 29

Results

Myc2_1

0e

+0

02

e+

04

4e

+0

46

e+

04

8e

+04

1e+

05

Myc2_2Myc3_1

05

00

010

00

015

00

02

000

02

50

00

30

00

0

Myc3_2

Myc2_1

0e+

00

2e+

04

4e+

04

6e+

04

8e+

04

1e

+0

5

Myc2_2 Myc3_1 Myc3_2

Abbreviation File name

Myc2_1 wgEncodeSydhTfbsK562CmycIggrabAlnRep1

Myc2_2 wgEncodeSydhTfbsK562CmycIggrabAlnRep2

Myc3_1 wgEncodeSydhTfbsK562CmycStdAlnRep1

Myc3_2 wgEncodeSydhTfbsK562CmycStdAlnRep2

Category Abbreviation Color Implication

Input (source BED file) Inβ–ˆβ–ˆ Strong

β–ˆβ–ˆ Weak

Analysis Results Re

β–ˆβ–ˆ Strong Confirmed

β–ˆβ–ˆ Weak Confirmed

β–ˆβ–ˆ Weak Discarded

S e t 1 S e t 2 S e t 3

In Re In Re In Re In Re In Re In Re In Re In Re

P o l i t e c n i c o d i M i l a n oV a h i d J a l i l i / 35July 20, 2015 31

Results

Motif was enriched in the sequence defined by peaks

Motif was NOT enriched in the sequence defined by peaks

Presence of Ebox

P o l i t e c n i c o d i M i l a n oV a h i d J a l i l i / 35July 20, 2015 32

Implementation

Performance

0

5

10

15

20

25

30

35

40

45

50

0 5 10 15 20 25 30 35 40 45

Tim

e (

seco

nds)

Peaks Count

x 10000

Running Time

2-Replicates 4-Replicates 6-Replicates

Demo

P o l i t e c n i c o d i M i l a n oV a h i d J a l i l i / 35July 20, 2015 33

Questions

Q u e s t i o n sare welcome at: https://mspc.codeplex.com/discussions

Recommended