21
BioInformatics - What and Why? The following power point presentation is designed to give some background information on Bioinformatics. This presentation is modified from information supplied by Dr. Bruno Gaeta, and with permission from eBioInformatics Pty Ltd (c) Copywright

BioInformatics - What and Why? The following power point presentation is designed to give some background information on Bioinformatics. This presentation

Embed Size (px)

Citation preview

BioInformatics - What and Why?BioInformatics - What and Why?

The following power point presentation is designed to give

some background information on Bioinformatics.

This presentation is modified from information supplied by Dr. Bruno Gaeta, and with permission from eBioInformatics Pty

Ltd (c) Copywright

The need for bioinformaticists. The number of entries in data bases of gene sequences is increasing exponentially. Bioinformaticians are needed to understand and use this information.

82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99

GenBank growth

Genome sequencing projects, including the human genome project are producing vast amounts of information. The challenge is to use this information in a useful way

Genome sequencing projects, including the human genome project are producing vast amounts of information. The challenge is to use this information in a useful way

COMPLETE/PUBLIC

Aquifex aeolicus

Pyrococcus horikoshii

Bacillus subtilis

Treponema pallidum

Borrelia burgdorferi

Helicobacter pylori

Archaeoglobus fulgidus

Methanobacterium thermo.

Escherichia coli

Mycoplasma pneumoniae

Synechocystis sp. PCC6803

Methanococcus jannaschii

Saccharomyces cerevisiae

Mycoplasma genitalium

Haemophilus influenzae

COMPLETE/PENDING PUBLICATIONRickettsia prowazekii Pseudomonas aeruginosa

Pyrococcus abyssii

Bacillus sp. C-125

Ureaplasma urealyticum

Pyrobaculum aerophilum

ALMOST/PUBLIC

Pyrococcus furiosus

Mycobacterium tuberculosis H37Rv

Mycobacterium tuberculosis CSU93

Neisseria gonorrhea

Neisseria meningiditis

Streptococcus pyogenes

Terry Gaasterland, Siv Andersson, Christoph Sensenhttp://www.mcs.anl.gov/home/gaasterl/genomes.html

Publically available genomes (April 1998)

”..We must hook our individual computers into the worldwide network that gives us access to daily changes in the databases and also makes immediate our communications with each other. The programs that display and analyze the material for us must be improved - and we must learn to use them more effectively. Like the purchased kits, they will make our life easier, but also like the kits, we must understand enough of how they work to use them effectively…”

Walter Gilbert (1991) “Towards a paradigm shift in biology” Nature News and Views 349:99

Bioinformatics impacts on all aspects of biological research.

Promises of genomics and bioinformatics Promises of genomics and bioinformatics

Medicine Knowledge of protein structure facilitates drug design Understanding of genomic variation allows the tailoring

of medical treatment to the individual’s genetic make-up Genome analysis allows the targeting of genetic

diseases The effect of a disease or of a therapeutic on RNA and

protein levels can be elucidated

The same techniques can be applied to biotechnology, crop and livestock improvement, etc...

What is bioinformatics?What is bioinformatics?

Application of information technology to the storage, management and analysis of biological information

Facilitated by the use of computers

What is bioinformatics?What is bioinformatics? Sequence analysis

Geneticists/ molecular biologists analyse genome sequence information to understand disease processes

Molecular modeling Crystallographers/ biochemists design drugs using computer-aided

tools

Phylogeny/evolution Geneticists obtain information about the evolution of organisms by

looking for similarities in gene sequences

Ecology and population studies Bioinformatics is used to handle large amounts of data obtained in

population studies

Medical informatics Personalised medicine

Sequence analysis: overviewSequence analysis: overview

Nucleotide sequence file

Search databases for similar sequences

Sequence comparison

Multiple sequence analysis

Design further experimentsRestriction mappingPCR planning

Translate into protein

Search for known motifs

RNA structure prediction

non-coding

coding

Protein sequence analysis

Search for protein coding regions

Manual sequence entry

Sequence database browsing

Sequencing project management

Protein sequence file

Search databases for similar sequences

Sequence comparison

Search for known motifs

Predict secondary structure

Predict tertiary

structureCreate a multiple sequence alignment

Edit the alignment

Format the alignment for publication

Molecular phylogeny

Protein family analysis

Nucleotide sequence analysis

Sequence entry

Gene Sequencing: Automated chemcial sequencing methods allow rapid generation of large data banks of gene sequences

Database similarity searching: The BLAST program has been written to allow rapid comparison of a new gene sequence with the 100s of 1000s of gene sequences in data bases

Sequences producing significant alignments: (bits) Value

gnl|PID|e252316 (Z74911) ORF YOR003w [Saccharomyces cerevisiae] 112 7e-26gi|603258 (U18795) Prb1p: vacuolar protease B [Saccharomyces ce... 106 5e-24gnl|PID|e264388 (X59720) YCR045c, len:491 [Saccharomyces cerevi... 69 7e-13gnl|PID|e239708 (Z71514) ORF YNL238w [Saccharomyces cerevisiae] 30 0.66gnl|PID|e239572 (Z71603) ORF YNL327w [Saccharomyces cerevisiae] 29 1.1gnl|PID|e239737 (Z71554) ORF YNL278w [Saccharomyces cerevisiae] 29 1.5

gnl|PID|e252316 (Z74911) ORF YOR003w [Saccharomyces cerevisiae] Length = 478 Score = 112 bits (278), Expect = 7e-26 Identities = 85/259 (32%), Positives = 117/259 (44%), Gaps = 32/259 (12%)

Query: 2 QSVPWGISRVQAPAAHNRG---------LTGSGVKVAVLDTGIST-HPDLNIRGG-ASFV 50 + PWG+ RV G G GV VLDTGI T H D R + +Sbjct: 174 EEAPWGLHRVSHREKPKYGQDLEYLYEDAAGKGVTSYVLDTGIDTEHEDFEGRAEWGAVI 233

Query: 51 PGEPSTQDGNGHGTHVAGTIAALNNSIGVLGVAPSAELYXXXXXXXXXXXXXXXXXQGLE 110 P D NGHGTH AG I + + GVA + ++ +G+ESbjct: 234 PANDEASDLNGHGTHCAGIIGSKH-----FGVAKNTKIVAVKVLRSNGEGTVSDVIKGIE 288

Sequence comparison: Gene sequences can be aligned to see similarities between gene from different sources

768 TT....TGTGTGCATTTAAGGGTGATAGTGTATTTGCTCTTTAAGAGCTG 813 || || || | | ||| | |||| ||||| ||| ||| 87 TTGACAGGTACCCAACTGTGTGTGCTGATGTA.TTGCTGGCCAAGGACTG 135 . . . . .814 AGTGTTTGAGCCTCTGTTTGTGTGTAATTGAGTGTGCATGTGTGGGAGTG 863 | | | | |||||| | |||| | || | |136 AAGGATC.............TCAGTAATTAATCATGCACCTATGTGGCGG 172 . . . . .864 AAATTGTGGAATGTGTATGCTCATAGCACTGAGTGAAAATAAAAGATTGT 913 ||| | ||| || || ||| | ||||||||| || |||||| |173 AAA.TATGGGATATGCATGTCGA...CACTGAGTG..AAGGCAAGATTAT 216

Restriction mapping: Genes can be analysed to detect gene sequences that can be cleaved with restriction enzymes

AceIII 1 CAGCTCnnnnnnn’nnn...AluI 2 AG’CTAlwI 1 GGATCnnnn’n_ApoI 2 r’AATT_yBanII 1 G_rGCy’CBfaI 2 C’TA_GBfiI 1 ACTGGGBsaXI 1 ACnnnnnCTCCBsgI 1 GTGCAGnnnnnnnnnnn...

BsiHKAI 1 G_wGCw’CBsp1286I 1 G_dGCh’C

BsrI 2 ACTG_Gn’BsrFI 1 r’CCGG_yCjeI 2 CCAnnnnnnGTnnnnnn...CviJI 4 rG’CyCviRI 1 TG’CADdeI 2 C’TnA_GDpnI 2 GA’TCEcoRI 1 G’AATT_CHinfI 2 G’AnT_CMaeIII 1 ’GTnAC_MnlI 1 CCTCnnnnnn_n’MseI 2 T’TA_AMspI 1 C’CG_GNdeI 1 CA’TA_TG

Sau3AI 2 ’GATC_SstI 1 G_AGCT’CTfiI 2 G’AwT_C

Tsp45I 1 ’GTsAC_Tsp509I 3 ’AATT_

TspRI 1 CAGTGnn’

50 100 150 200 250

PCR Primer Design:Oligonucleotides for use in the polymerisation chain

reaction can be designed using computer based prgrams

OPTIMAL primer length --> 20MINIMUM primer length --> 18MAXIMUM primer length --> 22 OPTIMAL primer melting temperature --> 60.000MINIMUM acceptable melting temp --> 57.000MAXIMUM acceptable melting temp --> 63.000MINIMUM acceptable primer GC% --> 20.000MAXIMUM acceptable primer GC% --> 80.000Salt concentration (mM) --> 50.000 DNA concentration (nM) --> 50.000MAX no. unknown bases (Ns) allowed --> 0 MAX acceptable self-complementarity --> 12 MAXIMUM 3' end self-complementarity --> 8 GC clamp how many 3' bases --> 0

Plot created using codon preference (GCG)

Gene discovery: Computer program can be used to recognise the protein coding regions in DNA

0 1,000 2,000 3,000 4,000

4,0003,0002,0001,0000

2.0

1.5

1.0

0.5

-0.0

2.0

1.5

1.0

0.5

-0.0

2.0

1.5

1.0

0.5

-0.0

RNA structure prediction: Structural features of RNA can be predicted

G

GA

C

A

G

G

A

G

G

A

U

ACCG

CG

G

U

C

C

UGC

CG G U C C

U CA

CUU

GGACUUAGU

A

U

CA

U

C

A

G

U

C

UGCGC

AAU

A

G

G

UA A

C

G CGU

Protein structure prediction: Particular structural features can be recognised in protein sequences

Protein structure prediction: Particular structural features can be recognised in protein sequences

KD Hydrophobicity

Surface Prob.

Flexibility

Antigenic Index

CF TurnsCF Alpha Helices

CF Beta Sheets

GOR Alpha HelicesGOR Turns

GOR Beta SheetsGlycosylation Sites

0.8

1.20.0

-1.7

1.7

10

-5.0

5.0

50 100

50 100

Protein Structure : the 3-D structure of proteins is used to understand protein function and design new drugs

Multiple sequence alignment: Sequences of proteins from different organisms can be aligned to see similarities and differences

Alignment formatted using MacBoxshade

Phylogeny inference: Analysis of sequences allows evolutionary relationships to be determined

E.coli

C.botulinum

C.cadavers

C.butyricum

B.subtilis

B.cereusPhylogenetic tree constructed using the Phylip package

Large scale bioinformatics: genome projectsLarge scale bioinformatics: genome projects

MappingIdentifying the location of clones and markers on the chromosome by genetic linkage analysis and physical mapping

SequencingAssembling clone sequence reads into large (eventually complete) genome sequences

Gene discoveryIdentifying coding regions in genomic DNA by database searching and other methods

Function assignmentUsing database searches, pattern searches, protein family analysis and structure prediction to assign a function to each predicted gene

Data mining

Searching for relationships and correlations in the information

Genome comparisonComparing different complete genomes to infer evolutionary history and genome rearrangements

Challenges in bioinformaticsChallenges in bioinformatics Explosion of information

Need for faster, automated analysis to process large amounts of data

Need for integration between different types of information (sequences, literature, annotations, protein levels, RNA levels etc…)

Need for “smarter” software to identify interesting relationships in very large data sets

Lack of “bioinformaticians” Software needs to be easier to access, use and

understand Biologists need to learn about the software, its

limitations, and how to interpret its results