31
It & Health 2009 Summary Thomas Nordahl Petersen

It & Health 2009 Summary Thomas Nordahl Petersen

  • View
    221

  • Download
    0

Embed Size (px)

Citation preview

Page 1: It & Health 2009 Summary Thomas Nordahl Petersen

It & Health 2009Summary

Thomas Nordahl Petersen

Page 2: It & Health 2009 Summary Thomas Nordahl Petersen

Teachers

QuickTime™ and a decompressor

are needed to see this picture.

QuickTime™ and a decompressor

are needed to see this picture.

QuickTime™ and a decompressor

are needed to see this picture.

QuickTime™ and a decompressor

are needed to see this picture.

QuickTime™ and a decompressor

are needed to see this picture.

QuickTime™ and a decompressor

are needed to see this picture.

QuickTime™ and a decompressor

are needed to see this picture.

Thomas Nordahl Petersen

Rasmus Wernersson

Lisbeth Nielsen Fink

Anders Gorm Pedersen

Bent Petersen

Ramneek Gupta

Thomas Blicher

Page 3: It & Health 2009 Summary Thomas Nordahl Petersen

Outline of the course

• Topics will cover a general introduction to bioinformatics– Evolution– DNA / Protein– Alignment and scoring matrices

• How does it work & what are the numbers

– Visualization of multiple alignments• Phylogenetic trees and logo plots

– Commonly used databases• Uniprot/Genbank & Genome browsers

– Protein 3D-structure– Artificial neural networks & case stories– Practical use of bioinformatics tools

• Preparation for exam

Page 4: It & Health 2009 Summary Thomas Nordahl Petersen

Topics covered - (some of them)

Page 5: It & Health 2009 Summary Thomas Nordahl Petersen

Information flow in biological systems

Page 6: It & Health 2009 Summary Thomas Nordahl Petersen

Amino Acids

Amine and carboxyl groups. Sidechain ‘R’ is attached to C-alpha carbon

The amino acids found in Living organisms are L-amino acids

Page 7: It & Health 2009 Summary Thomas Nordahl Petersen

Amino Acids - peptide bond

N-terminal C-terminal

Page 8: It & Health 2009 Summary Thomas Nordahl Petersen

1 and 3-letter codes

1.There are 20 naturally occurring amino acids2.Normally the one/three codes are used

Ala - ACys - CAsp - DGlu - EPhe - FGly - GHis - HIle - ILys - KLeu - L

Met - MAsn - NPro - PGln - QArg - RSer - SThr - TVal - VTrp - WTyr - Y

Page 9: It & Health 2009 Summary Thomas Nordahl Petersen

CE

NT

ER

FO

R B

IOLO

GIC

AL

SE

QU

EN

CE

AN

ALY

SIS

Theory of evolution

Charles DarwinCharles Darwin1809-18821809-1882

Page 10: It & Health 2009 Summary Thomas Nordahl Petersen

Phylogenetic tree

Page 11: It & Health 2009 Summary Thomas Nordahl Petersen

Global versus local alignments

Global alignment: align full length of both sequences. (The “Needleman-Wunsch” algorithm).

Local alignment: find best partial alignment of two sequences (the “Smith-Waterman” algorithm).

Global alignment

Seq 1

Seq 2

Local alignment

Page 12: It & Health 2009 Summary Thomas Nordahl Petersen

Pairwise alignment: the solution

”Dynamic programming” (the Needleman-Wunsch algorithm)

Page 13: It & Health 2009 Summary Thomas Nordahl Petersen

Sequence alignment - Blast

Page 14: It & Health 2009 Summary Thomas Nordahl Petersen

Sequence alignment - Blast

Page 15: It & Health 2009 Summary Thomas Nordahl Petersen

Blosum & PAM matrices

• Blosum matrices are the most commonly used substitution matrices.

• Blosum50, Blosum62, blosum80• PAM - Percent Accepted Mutations• PAM-0 is the identity matrix.• PAM-1 diagonal small deviations from 1, off-

diag has small deviations from 0• PAM-250 is PAM-1 multiplied by itself 250

times.

Page 16: It & Health 2009 Summary Thomas Nordahl Petersen

Sequence profiles (1J2J.B)

>1J2J.B mol:aa PROTEIN TRANSPORT NVIFEDEEKSKMLARLLKSSHPEDLRAANKLIKEMVQEDQKRMEK

Page 17: It & Health 2009 Summary Thomas Nordahl Petersen

Log-odds scores

• BLOSUM is a log-likelihood matrix:• Likelihood of observing j given you have i is

– P(j|i) = Pij/Pi

• The prior likelihood of observing j is– Qj , which is simply the frequency

• The log-likelihood score is– Sij = 2log2(P(j|i)/log(Qj) = 2log2(Pij/(QiQj))– Where, Log2(x)=logn(x)/logn(2) – S has been normalized to half bits, therefore the factor 2

Page 18: It & Health 2009 Summary Thomas Nordahl Petersen

BLAST Exercise

Page 19: It & Health 2009 Summary Thomas Nordahl Petersen

Genome browsers - UCSC

Intron - Exon structure

Single Nucleotide polymorphism - SNP

Page 20: It & Health 2009 Summary Thomas Nordahl Petersen

SNPs

Page 21: It & Health 2009 Summary Thomas Nordahl Petersen

Protein 3D-structure

Page 22: It & Health 2009 Summary Thomas Nordahl Petersen

Protein structure

Primary structure: Amino acids sequences

Secondary structure: Helix/Beta sheet

Tertiary structure: Fold, 3D cordinates

Page 23: It & Health 2009 Summary Thomas Nordahl Petersen

Protein structure-helix

helix 3 residues/turn - few, but not uncommon-helix 3.6 residues/turn - by far the most common helixPi-helix 4.1 residues/turn - very rare

Page 24: It & Health 2009 Summary Thomas Nordahl Petersen

Protein structurestrand/sheet

Page 25: It & Health 2009 Summary Thomas Nordahl Petersen

Protein folds

Class4’th is ‘few secondary structure

ArchitectureOverall shape of a domain

TopologyShare secondary structure connectivity

Page 26: It & Health 2009 Summary Thomas Nordahl Petersen

Protein 3D-structure

Page 27: It & Health 2009 Summary Thomas Nordahl Petersen

Neural NetworksFrom knowledge to information

Protein sequence Biological feature

Page 28: It & Health 2009 Summary Thomas Nordahl Petersen

• A data-driven method to predict a feature, given a set of training data

• In biology input features could be amino acid sequence or nucleotides

• Secondary structure prediction

• Signal peptide prediction

• Surface accessibility

• Propeptide prediction

Use of artificial neural networks

N C

Signalpeptide

Propeptide Mature/active protein

Page 29: It & Health 2009 Summary Thomas Nordahl Petersen

Prediction of biological featuresSurface accessible

QuickTime™ and a decompressor

are needed to see this picture.

Predict surface accessible fromamino acid sequence only.

Page 30: It & Health 2009 Summary Thomas Nordahl Petersen

Logo plots

Information content, how is it calculated - what does it mean.

Page 31: It & Health 2009 Summary Thomas Nordahl Petersen

Logo plots - Information Content

Sequence-logo

Calculate Information Content

I = apalog2pa + log2(4), Maximal value is 2 bits

• Total height at a position is the ‘Information Content’ measured in bits.• Height of letter is the proportional to the frequency of that letter.• A Logo plot is a visualization of a mutiple alignment.

~0.5 each

Completely conserved