31
1 Overrepresented Segment Strings (Aug/8/2011) Bob Harris Penn State Center for Comparative Genomics and Bioinformatics [email protected]

O verrepresented Segment Strings (Aug/8/2011)

  • Upload
    kateb

  • View
    39

  • Download
    1

Embed Size (px)

DESCRIPTION

1. O verrepresented Segment Strings (Aug/8/2011). Bob Harris Penn State Center for Comparative Genomics and Bioinformatics. [email protected]. Overview. Analysis of segmentation sequences, incorporating longer local context Update of previous enrichment/depletion plots - PowerPoint PPT Presentation

Citation preview

Page 1: O verrepresented Segment Strings (Aug/8/2011)

11

Overrepresented Segment Strings(Aug/8/2011)

Bob HarrisPenn State

Center for Comparative Genomics and Bioinformatics

[email protected]

Page 2: O verrepresented Segment Strings (Aug/8/2011)

2

• Analysis of segmentation sequences, incorporating longer local context

• Update of previous enrichment/depletion plots– For the round8 segmentations

Overview

Page 3: O verrepresented Segment Strings (Aug/8/2011)

3

Motivation

> segway.k562.coordinated chr10:812820-872329AOUNDKAGAGXGRXNXNCDUXNYNUNCNCYCYCYNYCYCYCNCYNCYCNXCNCYCNXNYNYNCNCYCNDCYNDYCYCYCNCICICDNXCICIWTMJMTWICYCYNCBDUXRNCURDXNUDUVRGVUAVAGKUVUXGAVARXRDKDVXKXAGAXDXRAXRVKPBPIQBQBQVBQBQLQHQHLQVKQVQVLTLBVUVQVKVLQVQBVLVQVOVQLQLQLQLQLQLHLVUVQLVLQLQLQVLQLQHQLVLQVL

Quick eyeball test usingone-character class-encoding: A=class 0 B=class 1 … 2,13,24 is C,N,Y

Page 4: O verrepresented Segment Strings (Aug/8/2011)

4

Redundancy Apparent, but…

• How surprising are the C,N,Y (2,13,24) groups?– Together these classes have only average probability– But 1st and 2nd order probabilities favor continuing in

this group

> segway.k562.coordinated chr10:812820-872329AOUNDKAGAGXGRXNXNCDUXNYNUNCNCYCYCYNYCYCYCNCYNCYCNXCNCYCNXNYNYNCNCYCNDCYNDYCYCYCNCICICDNXCICIWTMJMTWICYCYNCBDUXRNCURDXNUDUVRGVUAVAGKUVUXGAVARXRDKDVXKXAGAXDXRAXRVKPBPIQBQBQVBQBQLQHQHLQVKQVQVLTLBVUVQVKVLQVQBVLVQVOVQLQLQLQLQLQLHLVUVQLVLQLQLQVLQLQHQLVLQVL

Page 5: O verrepresented Segment Strings (Aug/8/2011)

5

Overrepresented Strings

• String of 2N segments

• Estimate expected probability with Nth order model– e.g. pr(ABCD) = pr(AB) pr(C|AB) pr(D|BC)

• “Evaluate” strings with high observed:expected ratio– Comparison to “features”. In this case RNAseq contigs

• Caveat(?): length of segments ignored

Page 6: O verrepresented Segment Strings (Aug/8/2011)

6

Overrepresented Strings, Example

• Length-4 strings in segway.k562.coordinated– Highest obs/exp ratio, after

eliminating rare observations

string #obs’d #exp’d obs/exp21-10-0-21 3761 970.80 3.87411221-0-10-21 3561 966.65 3.68386513-23-20-13 5227 2386.44 2.19029613-20-23-13 5177 2371.56 2.18295313-23-17-13 3205 1530.04 2.09471113-17-23-13 3156 1535.76 2.05500416-21-11-16 4833 2466.86 1.95917414-23-17-14 3263 1711.13 1.90692816-11-21-16 4629 2443.15 1.89468710-6-0-10 6980 3686.84 1.89322214-17-23-14 3180 1686.41 1.88565810-0-6-10 6846 3632.72 1.88453623-0-6-23 3265 1748.77 1.86702323-6-0-23 3254 1749.80 1.85964423-6-14-23 8780 4821.21 1.82112123-14-6-23 8933 4927.23 1.81298524-13-3-24 5419 3007.67 1.80172723-0-14-23 7142 4023.34 1.77514124-3-13-24 5270 2987.69 1.76390623-6-10-3 3045 1734.93 1.75511524-3-10-3 3192 1832.07 1.7422873-10-6-23 3046 1751.86 1.73872423-14-0-23 7000 4028.87 1.7374613-10-3-24 3126 1809.36 1.727681 …

Page 7: O verrepresented Segment Strings (Aug/8/2011)

7

CSHL RNAseq contigs

• CSHL RNAseq contigs– ftp: //genome.crg.es/pub/Encode/data_analysis/

ForDeadZones/Contigs_IDR0.1_CSHL.tar.gz• Differentiated by cell line (14), compartment (6),

RNA fraction (4)• and attributed to 11 biotypes (gencode v7 exons)

– non coding, protein coding, etc.– and a 12th type — empty, or “no exon”

• From Sarah Djebali, Felix Schlesinger, Wei Lin

Page 8: O verrepresented Segment Strings (Aug/8/2011)

8

Measuring Enrichment

• Vf,s = enrichment of string s for feature f

{s} = set of bases covered by string s (in either direction){f} = set of bases covering the feature{fs} = intersection of {f} and {s}{F} = union of {f’} for all features f’# = size of set

• I plot log2(Vf,s ), fold enrichment– Or, if negative, fold depletion

=#{ fs} #{s}#{ f } #{F}

Page 9: O verrepresented Segment Strings (Aug/8/2011)

9

Single-segment Enrichment

segway.k562.coordinated vs CSHL RNAseq contigs

white = no occurrences

Page 10: O verrepresented Segment Strings (Aug/8/2011)

10

Length-4 Strings Enrichment

segway.k562.coordinated vs CSHL RNAseq contigs(highest observed/expected strings)

white = no occurrences

Page 11: O verrepresented Segment Strings (Aug/8/2011)

11

Length-4 Strings Enrichment

segway.k562.coordinated vs CSHL RNAseq contigs(highest observed/expected strings)

Page 12: O verrepresented Segment Strings (Aug/8/2011)

12

To Do

• Incorporate single-segment enrichment into evaluation of multi-segment strings

• Longer strings

• Run on all 14 round 8 segmentations– And the bake-off composites

Page 13: O verrepresented Segment Strings (Aug/8/2011)

13

Aligning Class Sequences

• Work in progress, with these questions…

• Do longer, highly similar sequences indicate similar function?

segway.k562.coordinated chr10:88422790-88427017 CYCNCYNCNYNCNCNCNCNsegway.k562.coordinated chr13:113696011-113701344 CYCNCYNCNYNCNCNCNCN

• Or do small changes indicate functional differences?

segway.k562.coordinated chr10:133868081-133875219 NCNXnXNXNXNCYNCNCNCNXNCNsegway.k562.coordinated chr13:113638232-113645027- NCNXoXNXNXNCYNCNCNCNXNCN

Page 14: O verrepresented Segment Strings (Aug/8/2011)

14

Aligning Class Sequences

• Do longer, highly similar sequences indicate similar function?

Page 15: O verrepresented Segment Strings (Aug/8/2011)

15

Aligning Class Sequences

• Or do small changes indicate functional differences?

Page 16: O verrepresented Segment Strings (Aug/8/2011)

16

Alignments

• Confounded by presence of 2- and 3-segment cycles– Implement separate search for short repeated cycles– Then align with those masked

• Should incorporate segment lengths

• May be better to align in peak space

Page 17: O verrepresented Segment Strings (Aug/8/2011)

17

Appendix

• The following slides show single-segment enrichment heatmaps for all 14 round 8 segmentations

Page 18: O verrepresented Segment Strings (Aug/8/2011)

18

Single-segment Enrichment

segway.gm12878.coordinated vs CSHL RNAseq contigs

white = no occurrences

Page 19: O verrepresented Segment Strings (Aug/8/2011)

19

Single-segment Enrichment

segway.h1hesc.coordinated vs CSHL RNAseq contigs

white = no occurrences

Page 20: O verrepresented Segment Strings (Aug/8/2011)

20

Single-segment Enrichment

segway.helas3.coordinated vs CSHL RNAseq contigs

white = no occurrences

Page 21: O verrepresented Segment Strings (Aug/8/2011)

21

Single-segment Enrichment

segway.hepg2.coordinated vs CSHL RNAseq contigs

white = no occurrences

Page 22: O verrepresented Segment Strings (Aug/8/2011)

22

Single-segment Enrichment

segway.huvec.coordinated vs CSHL RNAseq contigs

white = no occurrences

Page 23: O verrepresented Segment Strings (Aug/8/2011)

23

Single-segment Enrichment

segway.k562.all vs CSHL RNAseq contigs

white = no occurrences

Page 24: O verrepresented Segment Strings (Aug/8/2011)

24

Single-segment Enrichment

segway.k562.coordinated vs CSHL RNAseq contigs

white = no occurrences

Page 25: O verrepresented Segment Strings (Aug/8/2011)

25

Single-segment Enrichment

segway.tier1-2.coordinated vs CSHL RNAseq contigs

white = no occurrences

Page 26: O verrepresented Segment Strings (Aug/8/2011)

26

Single-segment Enrichment

chromhmm.GM12878_concatenate_25 vs CSHL RNAseq contigs

white = no occurrences

Page 27: O verrepresented Segment Strings (Aug/8/2011)

27

Single-segment Enrichment

chromhmm.H1_concatenate_25 vs CSHL RNAseq contigs

white = no occurrences

Page 28: O verrepresented Segment Strings (Aug/8/2011)

28

Single-segment Enrichment

chromhmm.HELA_concatenate_25 vs CSHL RNAseq contigs

white = no occurrences

Page 29: O verrepresented Segment Strings (Aug/8/2011)

29

Single-segment Enrichment

chromhmm.HEPG2_concatenate_25 vs CSHL RNAseq contigs

white = no occurrences

Page 30: O verrepresented Segment Strings (Aug/8/2011)

30

Single-segment Enrichment

chromhmm.HUVEC_concatenate_25 vs CSHL RNAseq contigs

white = no occurrences

Page 31: O verrepresented Segment Strings (Aug/8/2011)

31

Single-segment Enrichment

chromhmm.K562_concatenate_25 vs CSHL RNAseq contigs

white = no occurrences