How do we represent the position specific preference ?
BID_MOUSE I A R H L A Q I G D E MBAD_MOUSE Y G R E L R R M S D E FBAK_MOUSE V G R Q L A L I G D D IBAXB_HUMAN L S E C L K R I G D E L BimS I A Q E L R R I G D E FHRK_HUMAN T A A R L K A L G D E LEgl-1 I G S K L A A M C D D F
Statistical representation
G: 5 -> 71%
S: 1 -> 14 %
C: 1 -> 14 %
Basic concept of motif identification 2.
Practice: identify potential transcription factor binding sites on a promoter
sequence.
Using TESS : Transcription Element Search System
http://www.cbil.upenn.edu/cgi-bin/tess/tess33?RQ=WELCOME
TESS result
Why there are many false positives for TF binding site scan?
Contextual dependency is not considered.
Stringency of the matrices.
Stringency of the matrices
A C G T Consens
us 40 13 23 23 N
20 3 70 5 G
55 3 40 0 R
0 93 0 5 C
53 8 8 30 W
15 0 3 82 T
0 0 100 0 G
0 50 0 50 Y
0 68 0 30 C
12 35 3 48 Y
A C G T
Consensus
4 0 13 0 G 5 0 12 0 G
15 0 2 0 A 0 17 0 0 C
17 0 0 0 A 0 0 0 17 T 0 0 17 0 G 0 13 0 4 C 0 17 0 0 C 0 17 0 0 C 0 0 17 0 G 0 0 17 0 G 2 0 15 0 G 0 17 0 0 C
17 0 0 0 A 0 0 0 17 T 0 0 17 0 G 0 2 0 15 T 0 13 0 4 C 0 7 2 7 Y P53_01
P53_02
Consensus –10 bp
Consensus –20 bp
DNA Pattern – Transcription factor binding site
• Pattern strings / Matrixes are extracted from known binding sequence.
• Core vs whole.
• Some short and/or ambiguous patterns will have many hits.
Sequence logo
Info N A C G T Consensus
1 0.679 27 0 5 17 5 G
2 0.883 27 6 2 19 0 G
3 1.771 27 1 0 26 0 G
4 1.619 27 25 2 0 0 A
5 2 27 0 0 0 27 T
6 1.771 27 0 0 1 26 T
7 1.771 27 26 0 0 1 A
8 0.192 27 8 2 11 6 R
1.0
2.0 Information
content
Comparing genomes
For understanding genome organization.
For identifying functionally conserved region / sequences. 3’, 5’ UTR (eg. microRNA binding sites) Transcription factor binding sites /
regulatory modules.
Vista Genome Browser
Practice & Observe: cross genome comparison using vista browser
Identifying conserved regulatory modules
• Regulatory module: a set of TF binding sites that controls a particular aspects of transcriptional regulation.
• Functional requirement conservation at the binding site (sequence) level.
Ways to Identify conserved regulatory modules
• Based on sequence similarity: MEME, rVista, Whole genome rVista for model
organisms…
• Based on binding site identity: BLISS
Practice: Identifying conserved TF binding sites using rVista
1.) Search for your gene in Whole genome rVista.
Or
2.) Compile corresponding genomic region from different species (can be >2). Load to rVista. This can be used for identifying shared regulatory modules in related genes in the same organism as well.
rVista
Practice & Observe: Load genomic sequences from Human, Rat, and Opossum to rVista. Choose TF matrices (e.g. E2F, P53, ATF, etc)
Representation of Deep Seq data
chr2L 10000192 10000217 U0 0 + chr2L 10000227 10000252 U1 0 -chr2R 10000310 10000335 U2 0 +chr3L 10000496 10000521 U1 0 -chr21 10000556 10000581 U2 0 +
Chrom. Start End name Scor Strand
Representation of Deep Seq data
The importance of reference genome
• All coordinates are only meaningful for a given genome assembly.
• One assembly may have multiple releases (annotations).
Manipulating Deep Seq data with Galaxy
Practice & Observe:
1.Load the PolII.H99.Bed file to Galaxy with the Get Data tool.
2.Sort data based on chromosome location c2.
3.Filter out lines with U0 with the expression c4!=‘U2’
Visualizing Deep Seq data with UCSC genome browser
Practice & Observe I:
1.Load the PolII.H99.Bed file as custom track to the browser by copy/past the URL link.
2.View ‘dense’ and then ‘full’ presentation of the track.
Visualizing Deep Seq data with UCSC genome browser
Practice & Observe II:
1.Save the landmark.bed file to your local computer. View the contents with Notepad.
2.Load the local file to UCSC browser.
3.Edit the color value, save, resubmit, and observe the differences.
Apollo Genome annotation tools
Observe: Using Apollo to organize information for studying complex genomic regions.