It & Health 2009 Summary

Preview:

DESCRIPTION

It & Health 2009 Summary. Thomas Nordahl Petersen. Teachers. Bent Petersen. Thomas Nordahl Petersen. Ramneek Gupta. Rasmus Wernersson. Lisbeth Nielsen Fink. Thomas Blicher. Anders Gorm Pedersen. Outline of the course. Topics will cover a general introduction to bioinformatics Evolution - PowerPoint PPT Presentation

Citation preview

It & Health 2009Summary

Thomas Nordahl Petersen

Teachers

QuickTime™ and a decompressor

are needed to see this picture.

QuickTime™ and a decompressor

are needed to see this picture.

QuickTime™ and a decompressor

are needed to see this picture.

QuickTime™ and a decompressor

are needed to see this picture.

QuickTime™ and a decompressor

are needed to see this picture.

QuickTime™ and a decompressor

are needed to see this picture.

QuickTime™ and a decompressor

are needed to see this picture.

Thomas Nordahl Petersen

Rasmus Wernersson

Lisbeth Nielsen Fink

Anders Gorm Pedersen

Bent Petersen

Ramneek Gupta

Thomas Blicher

Outline of the course

• Topics will cover a general introduction to bioinformatics– Evolution– DNA / Protein– Alignment and scoring matrices

• How does it work & what are the numbers

– Visualization of multiple alignments• Phylogenetic trees and logo plots

– Commonly used databases• Uniprot/Genbank & Genome browsers

– Protein 3D-structure– Artificial neural networks & case stories– Practical use of bioinformatics tools

• Preparation for exam

Topics covered - (some of them)

Information flow in biological systems

Amino Acids

Amine and carboxyl groups. Sidechain ‘R’ is attached to C-alpha carbon

The amino acids found in Living organisms are L-amino acids

Amino Acids - peptide bond

N-terminal C-terminal

1 and 3-letter codes

1.There are 20 naturally occurring amino acids2.Normally the one/three codes are used

Ala - ACys - CAsp - DGlu - EPhe - FGly - GHis - HIle - ILys - KLeu - L

Met - MAsn - NPro - PGln - QArg - RSer - SThr - TVal - VTrp - WTyr - Y

CE

NT

ER

FO

R B

IOLO

GIC

AL

SE

QU

EN

CE

AN

ALY

SIS

Theory of evolution

Charles DarwinCharles Darwin1809-18821809-1882

Phylogenetic tree

Global versus local alignments

Global alignment: align full length of both sequences. (The “Needleman-Wunsch” algorithm).

Local alignment: find best partial alignment of two sequences (the “Smith-Waterman” algorithm).

Global alignment

Seq 1

Seq 2

Local alignment

Pairwise alignment: the solution

”Dynamic programming” (the Needleman-Wunsch algorithm)

Sequence alignment - Blast

Sequence alignment - Blast

Blosum & PAM matrices

• Blosum matrices are the most commonly used substitution matrices.

• Blosum50, Blosum62, blosum80• PAM - Percent Accepted Mutations• PAM-0 is the identity matrix.• PAM-1 diagonal small deviations from 1, off-

diag has small deviations from 0• PAM-250 is PAM-1 multiplied by itself 250

times.

Sequence profiles (1J2J.B)

>1J2J.B mol:aa PROTEIN TRANSPORT NVIFEDEEKSKMLARLLKSSHPEDLRAANKLIKEMVQEDQKRMEK

Log-odds scores

• BLOSUM is a log-likelihood matrix:• Likelihood of observing j given you have i is

– P(j|i) = Pij/Pi

• The prior likelihood of observing j is– Qj , which is simply the frequency

• The log-likelihood score is– Sij = 2log2(P(j|i)/log(Qj) = 2log2(Pij/(QiQj))– Where, Log2(x)=logn(x)/logn(2) – S has been normalized to half bits, therefore the factor 2

BLAST Exercise

Genome browsers - UCSC

Intron - Exon structure

Single Nucleotide polymorphism - SNP

SNPs

Protein 3D-structure

Protein structure

Primary structure: Amino acids sequences

Secondary structure: Helix/Beta sheet

Tertiary structure: Fold, 3D cordinates

Protein structure-helix

helix 3 residues/turn - few, but not uncommon-helix 3.6 residues/turn - by far the most common helixPi-helix 4.1 residues/turn - very rare

Protein structurestrand/sheet

Protein folds

Class4’th is ‘few secondary structure

ArchitectureOverall shape of a domain

TopologyShare secondary structure connectivity

Protein 3D-structure

Neural NetworksFrom knowledge to information

Protein sequence Biological feature

• A data-driven method to predict a feature, given a set of training data

• In biology input features could be amino acid sequence or nucleotides

• Secondary structure prediction

• Signal peptide prediction

• Surface accessibility

• Propeptide prediction

Use of artificial neural networks

N C

Signalpeptide

Propeptide Mature/active protein

Prediction of biological featuresSurface accessible

QuickTime™ and a decompressor

are needed to see this picture.

Predict surface accessible fromamino acid sequence only.

Logo plots

Information content, how is it calculated - what does it mean.

Logo plots - Information Content

Sequence-logo

Calculate Information Content

I = apalog2pa + log2(4), Maximal value is 2 bits

• Total height at a position is the ‘Information Content’ measured in bits.• Height of letter is the proportional to the frequency of that letter.• A Logo plot is a visualization of a mutiple alignment.

~0.5 each

Completely conserved

Recommended