27
Multiple Alignments Motifs/Profiles • What is multiple alignment? • HOW does one do this? • WHY does one do this? • What do we mean by a motif or profile? BIO520 Bioinformatics Jim Lund Prev. reading: Ch 1-5 Assigned reading: Ch 6.4, 6.5, 6.6

Multiple Alignments Motifs/Profiles What is multiple alignment? HOW does one do this? WHY does one do this? What do we mean by a motif or profile? BIO520

Embed Size (px)

Citation preview

Page 1: Multiple Alignments Motifs/Profiles What is multiple alignment? HOW does one do this? WHY does one do this? What do we mean by a motif or profile? BIO520

Multiple AlignmentsMotifs/Profiles

• What is multiple alignment?• HOW does one do this?• WHY does one do this?• What do we mean by a motif or

profile?

BIO520 Bioinformatics Jim Lund

Prev. reading: Ch 1-5Assigned reading: Ch 6.4, 6.5, 6.6

Page 2: Multiple Alignments Motifs/Profiles What is multiple alignment? HOW does one do this? WHY does one do this? What do we mean by a motif or profile? BIO520

Information from Alignments

• Infer biological function– Conserved elements critical for function– Divergent elements relate to divergent

function

• Infer structure (2°, 3°)• Infer phylogeny

– History– Evolutionary forces (selection…)

Page 3: Multiple Alignments Motifs/Profiles What is multiple alignment? HOW does one do this? WHY does one do this? What do we mean by a motif or profile? BIO520

How do I find similar sequences?

Page 4: Multiple Alignments Motifs/Profiles What is multiple alignment? HOW does one do this? WHY does one do this? What do we mean by a motif or profile? BIO520

Multiple Alignment

•Global, Optimal

•Theory

•Computation

•Progressive Alignment

Page 5: Multiple Alignments Motifs/Profiles What is multiple alignment? HOW does one do this? WHY does one do this? What do we mean by a motif or profile? BIO520

Multiple Alignment: better alignments

Page 6: Multiple Alignments Motifs/Profiles What is multiple alignment? HOW does one do this? WHY does one do this? What do we mean by a motif or profile? BIO520

Alignment Methods/Programs

• GAP (GCG suite)– Optimal Alignment

• MSA– (nearly) Optimal Alignment

• Clustal W/X – Progressive Alignment

• PSI-BLAST– Searches for matching sequences iteratively– Search seq is invariant master for the

alignment.

Page 7: Multiple Alignments Motifs/Profiles What is multiple alignment? HOW does one do this? WHY does one do this? What do we mean by a motif or profile? BIO520

MSA Strategy

c(A)=c(Ai,j)Minimize score!

• HUGE matrix(aa# of seqs) CRASH computer– time~product of sequence length– 1000x10,000 OK, but 200x200x200x200 NOT

• Alignment procedure– nearly optimal--only considers a subset of all

alignment)– weight sequences via distance– branch-and-bound algorithm

Page 8: Multiple Alignments Motifs/Profiles What is multiple alignment? HOW does one do this? WHY does one do this? What do we mean by a motif or profile? BIO520

Running MSA

• Download and run it locally (UNIX):– http://www.ncbi.nlm.nih.gov/CBBresearch/S

chaffer/genetic_analysis.html

• On the internet:– http://searchlauncher.bcm.tmc.edu/multi-

align/multi-align.html

• Rerun on segments AFTER Clustal...

Page 9: Multiple Alignments Motifs/Profiles What is multiple alignment? HOW does one do this? WHY does one do this? What do we mean by a motif or profile? BIO520

Clustal Strategy

1. Rapid pairwise alignments each-to-each

2. Calculate distance matrix– Create guide tree (neighbor joining)

3. Align– Closest pairs first

– Add pairs or align sub-alignments

– Adjust similarity matrix as alignment proceeds

4. Add sequences– introduce gaps

• gaps at loops, not inside known 2° structures

• Dynamic gap weighting

Page 10: Multiple Alignments Motifs/Profiles What is multiple alignment? HOW does one do this? WHY does one do this? What do we mean by a motif or profile? BIO520

Clustal Strategy

Pairwise alignments Guide tree Align

Page 11: Multiple Alignments Motifs/Profiles What is multiple alignment? HOW does one do this? WHY does one do this? What do we mean by a motif or profile? BIO520

Clustal W(X) Strategy1. Pairwise alignments

The pairwise alignment number here is a dissimilarity measure.

Page 12: Multiple Alignments Motifs/Profiles What is multiple alignment? HOW does one do this? WHY does one do this? What do we mean by a motif or profile? BIO520

Clustal W(X) Strategy2. Unrooted neighbor tree

(dendrogram)

Page 13: Multiple Alignments Motifs/Profiles What is multiple alignment? HOW does one do this? WHY does one do this? What do we mean by a motif or profile? BIO520

Clustal W(X) Strategy3. Guide tree

Page 14: Multiple Alignments Motifs/Profiles What is multiple alignment? HOW does one do this? WHY does one do this? What do we mean by a motif or profile? BIO520

Clustal W(X) Strategy4. Progressive alignment

using guide tree

Page 15: Multiple Alignments Motifs/Profiles What is multiple alignment? HOW does one do this? WHY does one do this? What do we mean by a motif or profile? BIO520

Running Clustal W/X• WWW, Win, Mac, UNIX

– http://www2.ebi.ac.uk/clustalw/

• Input– Multiple sequence file (PIR, FASTA,…)

• Can FORCE alignments

• Specify secondary structures

• Considerations– Fast, easy, widely used

– Divergent proteins OK (trees misleading)

Page 16: Multiple Alignments Motifs/Profiles What is multiple alignment? HOW does one do this? WHY does one do this? What do we mean by a motif or profile? BIO520

“The Right Proteins”GAPDH

Rabbit KAENGKLVING-KAITIFQERDPANIKWGDAGAEYVVESTGVFTTMEKAGAHLKGGAKRV 117

Chick KAENGKLVING-HAITIFQERDPSNIKWADAGAEYVVESTGVFTTMEKAGAHLKGGAKRV 117

*********** :**********.:***.*******************************

Page 17: Multiple Alignments Motifs/Profiles What is multiple alignment? HOW does one do this? WHY does one do this? What do we mean by a motif or profile? BIO520

“The Right Proteins”GAPDH

Rabbit KAENGKLVING-KAITIFQERDPANIKWGDAGAEYVVESTGVFTTMEKAGAHLKGGAKRV 117

Chick KAENGKLVING-HAITIFQERDPSNIKWADAGAEYVVESTGVFTTMEKAGAHLKGGAKRV 117

Human KAEDGKLVIDG-KAITIFQERDPENIKWGDAGTAYVVESTGVFTTMEKAGAHLKGGAKRI 118

Tobacco KVKDEKTLLFGEKSVRVFGIRNPEEIPWAEAGADFVVESTGVFTDKDKAAAHLKGGAKKV 110

Entamoeba EAGENAIIVNGHKIV-VKAERDPAQIGWGALGVDYVVESTGVFTTIPKAEAHIKGGAKKV 105

:. : :: * : : :*:* :* *. *. :********* ** **:*****::

Page 18: Multiple Alignments Motifs/Profiles What is multiple alignment? HOW does one do this? WHY does one do this? What do we mean by a motif or profile? BIO520

Alignment Interpretation

• DNA sequences– >50% “worth looking at” (eyeball test)– ~75% needed for phylogeny

• Polypeptide sequences– 80% similar=SAME tertiary structure– 30-80% domains=similar structure– 15-30% ????– <15% short motifs

Page 19: Multiple Alignments Motifs/Profiles What is multiple alignment? HOW does one do this? WHY does one do this? What do we mean by a motif or profile? BIO520

Uses of Alignment

• Understanding or predicting mutant function

• Finding motifs in DNA or polypeptides

• Directing experiments--e.g. PCR primers

• Phylogeny

Page 20: Multiple Alignments Motifs/Profiles What is multiple alignment? HOW does one do this? WHY does one do this? What do we mean by a motif or profile? BIO520

“The Right Proteins”

Rabbit KAENGKLVING-KAITIFQERDPANIKWGDAGAEYVVESTGVFTTMEKAGAHLKGGAKRV 117

Chick KAENGKLVING-HAITIFQERDPSNIKWADAGAEYVVESTGVFTTMEKAGAHLKGGAKRV 117

Human KAEDGKLVIDG-KAITIFQERDPENIKWGDAGTAYVVESTGVFTTMEKAGAHLKGGAKRI 118

Tobacco KVKDEKTLLFGEKSVRVFGIRNPEEIPWAEAGADFVVESTGVFTDKDKAAAHLKGGAKKV 110

Entamoeba EAGENAIIVNGHKIV-VKAERDPAQIGWGALGVDYVVESTGVFTTIPKAEAHIKGGAKKV 105

:. : :: * : : :*:* :* *. *. :********* ** **:*****::

Page 21: Multiple Alignments Motifs/Profiles What is multiple alignment? HOW does one do this? WHY does one do this? What do we mean by a motif or profile? BIO520

Viewing and interpreting alignments

•Color residues by property•Conservation in the alignment•Known properties

•Substitution groups: STA, HY•Physiochemical property

•charge•hydrophobicity

•Programs for visualization•Jalview•AMAS•Alscript

Page 22: Multiple Alignments Motifs/Profiles What is multiple alignment? HOW does one do this? WHY does one do this? What do we mean by a motif or profile? BIO520

Viewing alignments

JalView alignment viewer

Page 23: Multiple Alignments Motifs/Profiles What is multiple alignment? HOW does one do this? WHY does one do this? What do we mean by a motif or profile? BIO520

How to build multiple alignments

1. Find sequences to align (db search).

2. Choose which regions of each protein to include.

• Sequences should be of similar lengths.

3. Run multiple alignment program.

4. Inspect multiple alignment for problems.• Regions with many gaps have aligned poorly.

5. Remove disruptive sequences and re-run alignment.

6. Add back remaining sequences avoiding disruption.

Page 24: Multiple Alignments Motifs/Profiles What is multiple alignment? HOW does one do this? WHY does one do this? What do we mean by a motif or profile? BIO520

Interpro

• Pfam 7.3 (3865 domains), • PRINTS 33.0 (1650 fingerprints), • PROSITE 17.5 (1565 and 252

preliminary profiles), • ProDom 2001.3 (1346 domains), • SMART 3.1 (509 domains), • TIGRFAMs 1.2 (814 domains), • SWISS-PROT 40.27 (113470 entries), • TrEMBL 21.12 (685610 entries).

Page 25: Multiple Alignments Motifs/Profiles What is multiple alignment? HOW does one do this? WHY does one do this? What do we mean by a motif or profile? BIO520

InterproA database of protein families, domains

and functional sites

• PROSITE, home of regular expressions and profiles;

• Pfam, SMART, TIGRFAMs, PIRSF, and SUPERFAMILY keepers of hidden Markov models(HMMs);

• PRINTS, provider of fingerprints (groups of aligned, un-weighted motifs);

Page 26: Multiple Alignments Motifs/Profiles What is multiple alignment? HOW does one do this? WHY does one do this? What do we mean by a motif or profile? BIO520

Interpro

Page 27: Multiple Alignments Motifs/Profiles What is multiple alignment? HOW does one do this? WHY does one do this? What do we mean by a motif or profile? BIO520

NCBI CDD (Conserved Domain Database

Domains from:• Pfam (Protein families)

– A database of protein families that currently contains > 7973 entries.

• SMART (a Simple Modular Architecture Research Tool)– More than 500 domain families found in signalling,

extracellular and chromatin-associated proteins are detectable.

– Domains are extensively annotated with respect to phyletic distributions, functional class, tertiary structures and functionally important residues.

• COGs (Clusters of Orthologous Groups)– Proteins or groups of paralogs from at least 3 lineages that

correspond to an ancient conserved domain