Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
Transcriptional regulation &Clustering
Elena Nikolaeva [email protected] University of Tartu, Estonia
MTAT.03.239 Bioinforma2cs
• Part 1: Transcrip2onal regula2on - Gene regula*on in eukaryotes - PWM - TFBS predic*on using PWM
• Part 2: Clustering - Goal - Types of clustering - Distance measures - Applica*ons
Informa2on flow in eukaryo2c cell
h@p://www.nature.com/scitable/topicpage/gene-‐expression-‐14121669
Intron is any nucleo*de sequence within a gene that is removed by RNA splicing while the final mature RNA product of a gene is being generated
Exon is any nucleo*de sequence encoded by a gene that remains present within the final mature RNA product of that gene
Transcrip2on factor
Image from “Op*miza*on of PWMs using sta*s*cally synchrofasosta*c morphogene*c infrastructural modeling” by Konstan*n Tretjakov
TF1
TF2
perform this func*on: alone or with other proteins in a complex, by promo*ng (as an ac*vator), or blocking (as a repressor) the recruitment of RNA polymerase (the enzyme that performs the transcrip*on of gene*c informa*on from DNA to RNA)
Is a protein that binds to specific DNA sequences, thereby controlling the flow (or transcrip=on) of gene=c informa=on from DNA to messenger RNA
Transcrip2onal regulators can determine cell types
h@p://www.nature.com/scitable/topicpage/gene-‐expression-‐14121669
8
Gene
Enhancer
TSS: Transcription Start Site
“Proximal” promoter (100bp-2Kb 5’ Upstream)
How is gene expression regulated? Transcrip*on begins when an RNA polymerase binds to a so-‐called promoter sequence on the DNA molecule
Binding of regulatory proteins to an enhancer sequence causes a shi\ in chroma*n structure that either promotes or inhibits RNA polymerase and transcrip*on factor binding
Promoter analysis. TFBS Detec*on by D.Rico
Promoters • Promoters are DNA segments upstream of transcripts that ini*ate transcrip*on
• Promoter a"racts RNA Polymerase to the transcrip*on start site
5’ Promoter 3’
9 Promoter analysis. TFBS Detec*on by D.Rico
Enhancers Is a short region of DNA that can be bound with proteins to enhance transcrip=on levels of genes (does not need to be par*cularly close to the genes it acts on)
h@p://www.nature.com/scitable/topicpage/gene-‐expression-‐14121669
11
Transcrip2on repression
An inac*ve repressor protein can become ac*vated by another molecule
interfere with RNA polymerase binding to the promoter, effec*vely preven*ng transcrip*on.
h@p://www.nature.com/scitable/topicpage/gene-‐expression-‐14121669
How to iden2fy Transcrip2on Factor Binding Sites(TFBS)?
h@p://www.nature.com/scitable/topicpage/gene-‐expression-‐14121669
Transcription factors recognize specific sequences.
http://www.bio.jhu.edu/Faculty/Privalov/
TGAGTCATGACTCA
Gcn4
DNA
TFs recognize specific sequences
h@p://www.bio.jhu.edu/Faculty/Privalov/
Some positions can have multiple nucleotides.IUPAC ambiguity codes
Some posi2ons can have different nucleo2des
TGAGTCATGACTCA TGASTCA
Gcn4 consensus sequenceGcn4 consensus sequence
TFBS: Detec2on methods in vivo
Functional analysis ChIP
in vitro on cloned fragment Footprinting reactions Exonuclease digests Gel retardation (EMSA) UV Crosslinking
in vitro on artificial DNA: SELEX: Systematic Evolution of Ligands
by Exponential enrichment
Slide from Promoter analysis. TFBS Detec*on by D.Rico
ChIP-Seq can be used to detect TF
binding sites.
ChIP-‐Seq can be used to detect TF binding sites
Not all nucleo*des are likely to be present at each posi*on
19
TF Binding Sites
• Problems: – o\en poorly defined consensus – Sequences not conserved within species, and even worse between species
– Examples of enhancers func*onally conserved but not sequence-‐conserved
– Most of the TFBS sequence data comes from just a few species
– Very o\en in vitro experiments – 2 completely different binding sites could be merged in the same matrix/consensus
19 Promoter analysis. TFBS Detec*on by D.Rico
Binding sites and mo2fs
• Transcrip*on factor binding is specific, hence binding sites are similar to each other, but variability is o\en seen
• A mo*f is the common sequence pa@ern among binding sites of transcrip*on factor
Data collection
Probabilities can be calculated and corrected for background
Also called posi*on-‐specific scoring matrices (PSSMs). In log scale. 21
From PFM to PWM/PSSM
22 h@p://www.nature.com/nrg/journal/v5/n4/box/nrg1315_BX2.html
SEQUENCE LOGOS: The informa*on content of a matrix column ranges from 0 (no base preference) and 2 (only 1 base used).
h@p://weblogo.berkeley.edu/ h@p://www.lecb.ncifcrf.gov/~toms/sequencelogo.html 23
AAGTTC AAGCTC AGGCTC AAGGTC
A 430000 C 000204 G 014100 T 000140
Consensus: ARGBTC
Summary
24 Slide from Promoter analysis. TFBS Detec*on by D.Rico
25
Transfac: not free, >848 matrices, loads of informa*on and references, quality score based on methods used
Jaspar: open sources, 174 matrices, minimal informa*on, majority based on SELEX method (80%)
25
PWM databases
TRANSFAC®
26 h@p://www.gene-‐regula*on.com/pub/databases.html
h@p://jaspar.genereg.net/
27
28
Jaspar example: Pax6
28
Fu*lity Theorem: Essen*ally all predicted TFBSs will have no func*onal role It’s necessary to constrain the search space
• Promoter regions • Conserved sequences • Open chroma*n • Integrate over a promoter region. • Proximity to transcrip*on start site (TSS) • etc …
Mul2ple approaches to constrain the search space
Cluster Analysis
Adapted from Meelis Kull’s slides Bioinforma*cs course 2011
Clustering is finding groups of objects such that: • similar (or related) to the objects in the same group
and • different from (or unrelated) to the objects in other
groups
What is cluster analysis?
• Intui*on building • Hypothesis genera*on • Summarizing / compressing large data
Why to cluster biological data?
Par22onal vs Hierarchical
Fuzzy vs Non-‐Fuzzy Fuzzy vs Non-Fuzzy
Each object belongs to eachcluster with some weight(the weight can be zero)
Each object belongs to exactly one cluster
Each object belongs to each cluster with some weight (the weight can be zero)
Each object belongs to exactly one cluster
Hierarchical clustering Hierarchical clustering
Hierarchical clustering is usually depicted as a dendrogram (tree)Hierarchical clustering is usually depicted as a dendrogram (tree)
Hierarchical clustering
• Each subtree corresponds to a cluster • Height of branching shows distance
Hierarchical clustering
• Each subtree corresponds to a cluster• Height of branching shows distance
Hierarchical clustering (0)
Algorithm for Agglomerative Hierarchical Clustering:Join the two closest objects
Algorithm for Agglomera*ve Hierarchical Clustering: Join the two closest objects
Hierarchical clustering
Join the two closest objects
Hierarchical clustering (1)
Join the two closest objects
Hierarchical clustering (1)
Hierarchical clustering (2)
Keep joining the closest pairs
Hierarchical clustering (2)
Keep joining the closest pairs
Hierarchical clustering (3)
Keep joining the closest pairs
Hierarchical clustering (3)
Keep joining the closest pairs
Hierarchical clustering (4)
Keep joining the closest pairs
Hierarchical clustering (4)
Keep joining the closest pairs
Hierarchical clustering (5) Hierarchical clustering (5)
Keep joining the closest pairsKeep joining the closest pairs
Hierarchical clustering (10) Hierarchical clustering (10)
After 10 steps we have 4 clusters left
A\er 10 steps we have 4 clusters le\
Hierarchical clustering (10) Hierarchical clustering (10)
Several ways to measure distancebetween clusters:• Single linkage (MIN)
Several ways to measure distance between clusters: • Single linkage(MIN)
Hierarchical clustering (10) Hierarchical clustering (10)
Several ways to measure distancebetween clusters:• Single linkage (MIN) • Complete linkage (MAX)
Several ways to measure distance between clusters: • Single linkage(MIN) • Complete linkage(MAX)
Hierarchical clustering (10) Hierarchical clustering (10)Several ways to measure distancebetween clusters:• Single linkage (MIN) • Complete linkage (MAX)• Average linkage• Weighted• Unweighted• ...
Several ways to measure distance between clusters: • Single linkage(MIN) • Complete linkage(MAX) • Average linkage
• Weighted • Unweighted ...
Hierarchical clustering (11) Hierarchical clustering (11)
In this example and at this stage we have the same result as in partitional clustering
In this example and at this stage we have the same result as in par**onal clustering
Hierarchical clustering (12) Hierarchical clustering (12)
In the final step the two remaining clusters are joined into a single cluster
In the final step the two remaining clusters are joined into a single cluster
Hierarchical clustering (13) Hierarchical clustering (13)
In the final step the two remaining clusters are joined into a single cluster
In the final step the two remaining clusters are joined into a single cluster
Examples of Hierarchical Clustering in Bioinforma2cs
Examples of Hierarchical Clustering in Bioinformatics
PhylogenyGene expression clustering
K-‐means clustering
• Par**onal, non-‐fuzzy • Par**ons the data into K clusters • K is given by the user
Algorithm: • Choose K ini*al centers for the clusters • Assign each object to its closest center • Recalculate cluster centers • Repeat un*l converges
K-‐means (1) K-means (1)
K-‐means (2) K-means (2)
K-‐means (3) K-means (3)
K-‐means (4) K-means (4)
K-‐means (5) K-means (5)
K-‐means (6) K-means (6)
K-‐means clustering summary
• One of the fastest clustering algorithms • Therefore very widely used • Sensi*ve to the choice of ini*al centres
• many algorithms to choose ini*al centres cleverly
• Assumes that the mean can be calculated • can be used on vector data • cannot be used on sequences (what is the mean of A and T?)
Distance measures Distance measuresDistance of vectors and
• Euclidean distance
• Manhattan distance
• Correlation distance
Distance of sequences and
• Hamming distance => 3
• Levenshtein distance
x = (x1, . . . , xn) y = (y1, . . . , yn)
d(x, y) =
����n�
i=1
(xi − yi)2
d(x, y) =n�
i=1
|xi − yi|
d(x, y) = 1− r(x, y)is Pearson
correlation coefficientr(x, y)
ACCTTG TACCTGACCTTGTACCTG
.ACCTTGTACC.TG => 2
K-‐medoids clustering
• The same as K-‐means, except that the center is required to be at an object
• Medoid -‐ an object which has minimal total distance to all other objects in its cluster
• Can be used on more complex data, with any distance measure
• Slower than K-‐means
K-‐medoids (1) K-medoids (1)
K-‐medoids (2) K-medoids (2)
K-‐medoids (3) K-medoids (3)
K-‐medoids (4) K-medoids (4)
K-‐medoids (5) K-medoids (5)
K-‐medoids (6) K-medoids (6)
K-‐medoids (7) K-medoids (7)
K-‐medoids (8) K-medoids (8)
K-‐medoids (9) K-medoids (9)
Examples of K-means and K-medoids in Bioinformatics
Gene expression clustering
Sequence clustering
Examples of K-‐means and K-‐medoids in Bioinforma2cs
• Aims: intui*on, hypothesis genera*on, summariza*on • Types:
• Hierarchical/Par**onal • Fuzzy/Non-‐Fuzzy • Vector-‐based/Distance-‐based etc.
• Distance measures • Euclidean, Manha@an, Correla*on • Hamming, Levenshtein • etc.
• Applica*ons: • Clustering genes, sequences, organisms, etc.
Summary of Clustering