Transcriptional regulation &Clustering · Transcriptional regulation &Clustering Elena Nikolaeva [email protected] University of Tartu, Estonia MTAT.03.239)Bioinforma2cs! • Part1 :Transcriponalregulaon!

Transcriptional regulation &Clustering

Elena Nikolaeva [email protected] University of Tartu, Estonia

MTAT.03.239 Bioinforma2cs

•  Part 1: Transcrip2onal regula2on -  Gene regula*on in eukaryotes -  PWM -  TFBS predic*on using PWM

•  Part 2: Clustering -  Goal -  Types of clustering -  Distance measures -  Applica*ons

Informa2on flow in eukaryo2c cell

h@p://www.nature.com/scitable/topicpage/gene-‐expression-‐14121669

Intron is any nucleo*de sequence within a gene that is removed by RNA splicing while the final mature RNA product of a gene is being generated

Exon is any nucleo*de sequence encoded by a gene that remains present within the final mature RNA product of that gene

Transcrip2on factor

Image from “Op*miza*on of PWMs using sta*s*cally synchrofasosta*c morphogene*c infrastructural modeling” by Konstan*n Tretjakov

TF1

TF2

perform this func*on: alone or with other proteins in a complex, by promo*ng (as an ac*vator), or blocking (as a repressor) the recruitment of RNA polymerase (the enzyme that performs the transcrip*on of gene*c informa*on from DNA to RNA)

Is a protein that binds to specific DNA sequences, thereby controlling the flow (or transcrip=on) of gene=c informa=on from DNA to messenger RNA

Transcrip2onal regulators can determine cell types


8

Gene

Enhancer

TSS: Transcription Start Site

“Proximal” promoter (100bp-2Kb 5’ Upstream)

How is gene expression regulated? Transcrip*on begins when an RNA polymerase binds to a so-‐called promoter sequence on the DNA molecule

Binding of regulatory proteins to an enhancer sequence causes a shi\ in chroma*n structure that either promotes or inhibits RNA polymerase and transcrip*on factor binding

Promoter analysis. TFBS Detec*on by D.Rico

Promoters •  Promoters are DNA segments upstream of transcripts that ini*ate transcrip*on

•  Promoter a"racts RNA Polymerase to the transcrip*on start site

5’ Promoter 3’

9 Promoter analysis. TFBS Detec*on by D.Rico

Enhancers Is a short region of DNA that can be bound with proteins to enhance transcrip=on levels of genes (does not need to be par*cularly close to the genes it acts on)


11

Transcrip2on repression

An inac*ve repressor protein can become ac*vated by another molecule

interfere with RNA polymerase binding to the promoter, effec*vely preven*ng transcrip*on.


How to iden2fy Transcrip2on Factor Binding Sites(TFBS)?


Transcription factors recognize specific sequences.

http://www.bio.jhu.edu/Faculty/Privalov/

TGAGTCATGACTCA

Gcn4

DNA

TFs recognize specific sequences

h@p://www.bio.jhu.edu/Faculty/Privalov/

Some positions can have multiple nucleotides.IUPAC ambiguity codes

Some posi2ons can have different nucleo2des

TGAGTCATGACTCA TGASTCA

Gcn4 consensus sequenceGcn4 consensus sequence

TFBS: Detec2on methods in vivo

Functional analysis ChIP

in vitro on cloned fragment Footprinting reactions Exonuclease digests Gel retardation (EMSA) UV Crosslinking

in vitro on artificial DNA: SELEX: Systematic Evolution of Ligands

by Exponential enrichment

Slide from Promoter analysis. TFBS Detec*on by D.Rico

ChIP-Seq can be used to detect TF

binding sites.

ChIP-‐Seq can be used to detect TF binding sites

Not all nucleo*des are likely to be present at each posi*on

19

TF Binding Sites

•  Problems: –  o\en poorly defined consensus –  Sequences not conserved within species, and even worse between species

–  Examples of enhancers func*onally conserved but not sequence-‐conserved

– Most of the TFBS sequence data comes from just a few species

– Very o\en in vitro experiments –  2 completely different binding sites could be merged in the same matrix/consensus

19 Promoter analysis. TFBS Detec*on by D.Rico

Binding sites and mo2fs

•  Transcrip*on factor binding is specific, hence binding sites are similar to each other, but variability is o\en seen

•  A mo*f is the common sequence pa@ern among binding sites of transcrip*on factor

Data collection

Probabilities can be calculated and corrected for background

Also called posi*on-‐specific scoring matrices (PSSMs). In log scale. 21

From PFM to PWM/PSSM

22 h@p://www.nature.com/nrg/journal/v5/n4/box/nrg1315_BX2.html

SEQUENCE LOGOS: The informa*on content of a matrix column ranges from 0 (no base preference) and 2 (only 1 base used).

h@p://weblogo.berkeley.edu/ h@p://www.lecb.ncifcrf.gov/~toms/sequencelogo.html 23

AAGTTC AAGCTC AGGCTC AAGGTC

A 430000 C 000204 G 014100 T 000140

Consensus: ARGBTC

Summary

24 Slide from Promoter analysis. TFBS Detec*on by D.Rico

25

Transfac: not free, >848 matrices, loads of informa*on and references, quality score based on methods used

Jaspar: open sources, 174 matrices, minimal informa*on, majority based on SELEX method (80%)

25

PWM databases

TRANSFAC®

26 h@p://www.gene-‐regula*on.com/pub/databases.html

h@p://jaspar.genereg.net/

27

28

Jaspar example: Pax6

28

Fu*lity Theorem: Essen*ally all predicted TFBSs will have no func*onal role It’s necessary to constrain the search space

• Promoter regions • Conserved sequences • Open chroma*n • Integrate over a promoter region. • Proximity to transcrip*on start site (TSS) • etc …

Mul2ple approaches to constrain the search space

Cluster Analysis

Adapted from Meelis Kull’s slides Bioinforma*cs course 2011

Clustering is finding groups of objects such that: •  similar (or related) to the objects in the same group

and •  different from (or unrelated) to the objects in other

groups

What is cluster analysis?

• Intui*on building • Hypothesis genera*on • Summarizing / compressing large data

Why to cluster biological data?

Par22onal vs Hierarchical

Fuzzy vs Non-‐Fuzzy Fuzzy vs Non-Fuzzy

Each object belongs to eachcluster with some weight(the weight can be zero)

Each object belongs to exactly one cluster

Each object belongs to each cluster with some weight (the weight can be zero)

Each object belongs to exactly one cluster

Hierarchical clustering Hierarchical clustering

Hierarchical clustering is usually depicted as a dendrogram (tree)Hierarchical clustering is usually depicted as a dendrogram (tree)

Hierarchical clustering

•  Each subtree corresponds to a cluster •  Height of branching shows distance


• Each subtree corresponds to a cluster• Height of branching shows distance

Hierarchical clustering (0)

Algorithm for Agglomerative Hierarchical Clustering:Join the two closest objects

Algorithm for Agglomera*ve Hierarchical Clustering: Join the two closest objects


Join the two closest objects


Join the two closest objects



Keep joining the closest pairs











Hierarchical clustering (5) Hierarchical clustering (5)

Keep joining the closest pairsKeep joining the closest pairs


After 10 steps we have 4 clusters left

A\er 10 steps we have 4 clusters le\


Several ways to measure distancebetween clusters:• Single linkage (MIN)

Several ways to measure distance between clusters: •  Single linkage(MIN)


Several ways to measure distancebetween clusters:• Single linkage (MIN) • Complete linkage (MAX)

Several ways to measure distance between clusters: •  Single linkage(MIN) •  Complete linkage(MAX)

Hierarchical clustering (10) Hierarchical clustering (10)Several ways to measure distancebetween clusters:• Single linkage (MIN) • Complete linkage (MAX)• Average linkage• Weighted• Unweighted• ...

Several ways to measure distance between clusters: •  Single linkage(MIN) •  Complete linkage(MAX) •  Average linkage

•  Weighted •  Unweighted ...


In this example and at this stage we have the same result as in partitional clustering

In this example and at this stage we have the same result as in par**onal clustering


In the final step the two remaining clusters are joined into a single cluster





Examples of Hierarchical Clustering in Bioinforma2cs

Examples of Hierarchical Clustering in Bioinformatics

PhylogenyGene expression clustering

K-‐means clustering

•  Par**onal, non-‐fuzzy •  Par**ons the data into K clusters •  K is given by the user

Algorithm: •  Choose K ini*al centers for the clusters •  Assign each object to its closest center •  Recalculate cluster centers •  Repeat un*l converges

K-‐means (1) K-means (1)






K-‐means clustering summary

•  One of the fastest clustering algorithms •  Therefore very widely used •  Sensi*ve to the choice of ini*al centres

•  many algorithms to choose ini*al centres cleverly

•  Assumes that the mean can be calculated •  can be used on vector data •  cannot be used on sequences (what is the mean of A and T?)

Distance measures Distance measuresDistance of vectors and

• Euclidean distance

• Manhattan distance

• Correlation distance

Distance of sequences and

• Hamming distance => 3

• Levenshtein distance

x = (x1, . . . , xn) y = (y1, . . . , yn)

d(x, y) =

��n�

i=1

(xi − yi)2

d(x, y) =n�

i=1

|xi − yi|

d(x, y) = 1− r(x, y)is Pearson

correlation coefficientr(x, y)

ACCTTG TACCTGACCTTGTACCTG

.ACCTTGTACC.TG => 2

K-‐medoids clustering

•  The same as K-‐means, except that the center is required to be at an object

•  Medoid -‐ an object which has minimal total distance to all other objects in its cluster

•  Can be used on more complex data, with any distance measure

•  Slower than K-‐means

K-‐medoids (1) K-medoids (1)









Examples of K-means and K-medoids in Bioinformatics

Gene expression clustering

Sequence clustering

Examples of K-‐means and K-‐medoids in Bioinforma2cs

• Aims: intui*on, hypothesis genera*on, summariza*on • Types:

• Hierarchical/Par**onal • Fuzzy/Non-‐Fuzzy • Vector-‐based/Distance-‐based etc.

• Distance measures • Euclidean, Manha@an, Correla*on • Hamming, Levenshtein • etc.

• Applica*ons: • Clustering genes, sequences, organisms, etc.

Summary of Clustering

Documents

Transcriptional regulation &Clustering · Transcriptional regulation &Clustering Elena Nikolaeva [email protected] University of Tartu, Estonia MTAT.03.239)Bioinforma2cs! • Part1 :Transcriponalregulaon!