23
Basic Overview of Bioinformatics Tools and Biocomputing Applications I Dr Tan Tin Wee Director Bioinformatics Centre

Basic Overview of Bioinformatics Tools and Biocomputing Applications I Dr Tan Tin Wee Director Bioinformatics Centre

Embed Size (px)

Citation preview

Basic Overview of Bioinformatics Tools and

Biocomputing Applications I

Dr Tan Tin Wee

Director

Bioinformatics Centre

Software Tools• Data stored in retrievable forms in database systems• Data generated by machines, DNA / Protein sequencers,

automated systems

Biological DataAutomatedMachines

ResearchLabs

DatabasesAnalytical

Tools

New Knowledge

Common Computational Analyses• Sequence Assembly• Simple sequence analysis

– Translation and reverse Complement, ORF– Composition statistics (protein & DNA)– Molecular mass– Total charge and pI; local hydropathy– Simple determination of secondary structures – Restriction site analysis– Internal repeat analysis

• Detection of active sites, functional residues, characteristic structures, substrates, and processing signals

Common Computational Analyses

• Database sequence search

• Multiple alignment

• 2 and 3 Structure prediction; transmembrane helix detection

• Structure modeling

• Docking prediction and design

• Hidden Markov model searches

Sequence Assembly• Fragmented data from DNA sequencers• Detection of Overlap• Merging of Contigs• Assembly into continuous sequence

5' 3'

Sequence Format Interconversion

• DNA/Protein and other sequence data come in different formats.

• Annotations

• Different programs use different formats

• Interconversion utility tools• eg. READSEQ, TOGCG, TOSTADEN, etc

Simple Sequence Analysis

1. Linear Sequence eg. DNA/ Protein

2. Open a Window - n = 1 n = variable n = sliding

3. Calculate based on list of criteria

………….………………..……………..……………...

Some Simple Sequence Analysis Applications

• DNA complementary strand eg. COMPLEMENT & REVERSE

– Open window size 1– A--->T– C --->G– T ---> A– G ---> C– Slide to next Window of 1– Proceed to end of sequence– Reverse order of complement– 5' ...ATCTCGATACTACTACG...3'– |||||||||||||||||– 3' ...TAGAGCTATGATGATGC...5'

• DNA to Protein sequence translation, e.g. TRANSLATE

– Open window of 3 bases– Look up Codon Usage table– Assign Amino acid residue– Slide window to next 3 bases– Proceed till stop codon detected.– Repeat whole procedure for six frames

ATACTACTGAGATCTAGGCTAGTACTGCGTGCGFrame 1 Frame 2 Frame 3

Complement - Frames 4-6

Some Simple Sequence Analysis Applications

• Detect Open Reading Frame e.g.ORF– Translate sequence, report long stretches of start

and stop codons

• Compositional analysis– eg. Calculate total A, T, G, C– eg. Calculate total molecular mass of protein,

analysis percentages of amino acids– eg. Total Charge composition, pI

Some Simple Sequence Analysis Applications

• Simple prediction of secondary structure of Protein sequence– decide a window size– compute for each window of amino acids statistical

potential to form helix, beta sheet, turn, etc. Chou-Fasman, GOR etc algorithms

– use a statistical potential chart– plot potentials in graphical or pictorial format

Some Simple Sequence Analysis Applications

• Restriction Mapping eg. MAP, MAPPLOT,MAPSORT, PLASMIDMAP etc– Table of Restriction Enzymes

and cut siteseg. EcoRI, BamHI AluIand their cut sites eg. GAATTC , AATT

– Take a DNA sequence– Pattern match against the list of cut sites– For each match, assign Restriction enzyme– Calculate distance between cut sites– Display in table, graphical, or restriction map, etc

Some Simple Sequence Analysis Applications

Plasmidmap

gel

• Protein sequence Motifs pattern matching eg. PROSITEMAP, MOTIFS, BLOCKS etc

– Table/Database of Sequence Patterns/Motifs and their signature sequence eg. Arg-Gly-Asp (RGD) or consensus sequence (eg. PROSITE, BLOCKS db)

– Take Protein sequence

– Pattern match against the list of signature sites

– For each match, assign potential function according to database

– Display in table or graphically, or hyperlinked

Some Simple Sequence Analysis Applications

• Peptide Cleavage Maps eg. PEPTIDESORT, PEPTIDE MAP– Table of Protease vs Cleavage sites eg. Trypsin,

chymotrypsin, and Chemical cleavage sites cyanogen bromide

– Pattern match with entire protein sequence– Calculate size of peptide fragments– Sort and Map, Plot as electrophoretic patterns on a log-

linear simulated digest.– Compute Partial Digest patterns

Some Simple Sequence Analysis Applications

• DOTPLOT- selfcomparison– Take a Window size

– Compare against entire length of own sequence

– Report matches above a threshold

– Plot on Graph

– Slide window, repeat till end of sequence

– Detection of Internal repeats

• Pairwise comparison - detection of homology

Some Simple Sequence Analysis Applications

Sequence AS

eque

nce

A

• RNA secondary structure analysis• Mfold, PlotFold, FoldRNA, Squiggles, Circles, Domes,

Mountains, StemLoop

• Folding of RNA into stems, loops• Calculation of energy

- prediction of stability of structure

• Display of structure and alternatives

Some Simple Sequence Analysis Applications

...AUCGA AUCUC...

AUGC

UACG

--------

AUCGU G G A

Database Searching

• Text-based Database Searching -using a text string to match an annotation in a sequence database record, ie. Keyword search

• Sequence-based Database Searching -using a biological sequence to match its whole or parts of its sequence to the sequences of every sequence database records

Text-Based Database Searching• Examples: Entrez, SRS, DBGET, AceDB

- common integrated database systems• Search Concepts

– Boolean Search - AND, OR, NOT– Broadening Search– Narrowing the Search– Proximity searching, soundex– Wild Card, Stemming eg. Thala* for thalasemia, thalassemia,

thalassemic

• Use standard string search algorithms and boolean operations, vocabulary matches

Text-based Database Searching

• Example: To find the human homolog of the Drosophila per gene• Procedure

– Web to Entrez– All Fields : enter "human" "per"– Hits returned, irrelevant - broaden search– "human" "period" - more hits– check every one, find the human RIGUI gene

• Hit and miss, clever guess work, free form or controlled vocabulary (MeSH terms)?Use Boolean searches?

Sequence-based Database Searching

• Homology Search

• Global or Local Sequence Alignment

• Needleman-Wunch Algorithm

• Smith-Waterman Algorithm

• Lipman - Pearson FASTA

• Altschul's BLAST

• Take a sequence, pairwise comparison with each sequence in the database

Sequence-based Database Searching

• Basic Assumptions:• Sequences of homologous Genes/Protein diverge over

time even though structure and/or function change little• Significant sequence similarity inferred as potential

structural /functional similarity or common evolutionary origin

• Based on well-characterised protein, infer the function of an unknown sequence at gene or protein sequence level.

Sequence-based Database Searching

• Global Alignmentforces complete alignment of the pairwise comparison of the two input sequences

• Local Alignmentlooks for local stretches of similarity and tries to align the most similar segments

• Algorithms used may be similar, but output different, statistics needed to assess results

Sequence-based Database Searching

• Alignment Scoring• Substitution score and substitution matrix

PAM, BLOSUM• affine gap costs/gap penalty and gap scores• Optimal alignments, dynamic programming

Needleman-Wunsch algorithm,Smith-Waterman algorithm (SSEARCH)

• Additional heuristics - FASTA, BLAST