50
Bioinformatics Databases: Getting Knowledge from Information Kristen Chambers Director of Bioinformatics Dartmouth Medical School BioInformatics @ Dartmouth Medical School

Bioinformatics Databases: Getting Knowledge from Information Kristen Chambers Director of Bioinformatics Dartmouth Medical School BioInformatics @ Dartmouth

  • View
    217

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Bioinformatics Databases: Getting Knowledge from Information Kristen Chambers Director of Bioinformatics Dartmouth Medical School BioInformatics @ Dartmouth

Bioinformatics Databases:Getting Knowledge from

Information

Kristen ChambersDirector of BioinformaticsDartmouth Medical School

BioInformatics @ Dartmouth Medical School

Page 2: Bioinformatics Databases: Getting Knowledge from Information Kristen Chambers Director of Bioinformatics Dartmouth Medical School BioInformatics @ Dartmouth

What is Bioinformatics?

Bioinformatics provides the backbone computational tools, databases and domain

expertise that facilitates modern biomedical, biological and genomic research.

BioInformatics @ Dartmouth Medical School

Page 3: Bioinformatics Databases: Getting Knowledge from Information Kristen Chambers Director of Bioinformatics Dartmouth Medical School BioInformatics @ Dartmouth

BioInformatics @ Dartmouth Medical School

What is Bioinformatics?

• ‘Wet-lab’ science• Sequence analysis• Modeling & structural work• Algorithm development• Hardware & software

infrastructure

The expertise is multidisciplinary,and the skills fall on a continuum from

‘pure’ science to ‘pure’ computing:

Page 4: Bioinformatics Databases: Getting Knowledge from Information Kristen Chambers Director of Bioinformatics Dartmouth Medical School BioInformatics @ Dartmouth

With a field this extensive and skill sets so varied, where do we begin?

BioInformatics @ Dartmouth Medical School

Page 5: Bioinformatics Databases: Getting Knowledge from Information Kristen Chambers Director of Bioinformatics Dartmouth Medical School BioInformatics @ Dartmouth

BioInformatics @ Dartmouth Medical School

From Information Design, Nathan Shedroff

Page 6: Bioinformatics Databases: Getting Knowledge from Information Kristen Chambers Director of Bioinformatics Dartmouth Medical School BioInformatics @ Dartmouth

Moving from Information to Knowledge to Understanding:

Genetic testing

• BRCA1 and BRCA2 gene mutations: what is the real risk to women carriers? 25% - 80%

• Huntington’s Disease: mechanism defined, but what does that mean for the individual in terms of age of onset, severity of disease, or how disease will progress?

BioInformatics @ Dartmouth Medical School

Page 7: Bioinformatics Databases: Getting Knowledge from Information Kristen Chambers Director of Bioinformatics Dartmouth Medical School BioInformatics @ Dartmouth

How can Bioinformatics facilitate the extraction of

information?• Development of tools that support

laboratory experiments

• Design, implementation and integration of biological databases

• Development of various analytical tools, algorithms and models

BioInformatics @ Dartmouth Medical School

Page 8: Bioinformatics Databases: Getting Knowledge from Information Kristen Chambers Director of Bioinformatics Dartmouth Medical School BioInformatics @ Dartmouth

Bioinformatics will not replace experiments, but can greatly

streamline and enable the discovery process.

BioInformatics @ Dartmouth Medical School

Page 9: Bioinformatics Databases: Getting Knowledge from Information Kristen Chambers Director of Bioinformatics Dartmouth Medical School BioInformatics @ Dartmouth

One of the fundamental toolsof bioinformatics: Database

• A database is a body of information stored in two dimensions (rows and columns)

• The power of the database lies in the relationships that you construct between the pieces of information (tables)

• SQL (Structured Query Language) - interactive and embedded

• Good design and application ensure data integrity• Interoperability

BioInformatics @ Dartmouth Medical School

Page 10: Bioinformatics Databases: Getting Knowledge from Information Kristen Chambers Director of Bioinformatics Dartmouth Medical School BioInformatics @ Dartmouth

Industry Challenge #1:Genome annotation

The Human Genome is sequenced. It is estimated that 5% of the human genome codes for genes. The function of the remaining 95% is largely unknown.

BioInformatics @ Dartmouth Medical School

Page 11: Bioinformatics Databases: Getting Knowledge from Information Kristen Chambers Director of Bioinformatics Dartmouth Medical School BioInformatics @ Dartmouth

What does the genome data look like? 1 gcggagggtg cgtgcgggcc gcggcagccg aacaaaggag caggggcgcc gccgcaggga 61 cccgccaccc acctcccggg gccgcgcagc ggcctctcgt ctactgccac catgaccgcc 121 aacggcacag ccgaggcggt gcagatccag ttcggcctca tcaactgcgg caacaagtac 181 ctgacggccg aggcgttcgg gttcaaggtg aacgcgtccg ccagcagcct gaagaagaag 241 cagatctgga cgctggagca gccccctgac gaggcgggca gcgcggccgt gtgcctgcgc 301 agccacctgg gccgctacct ggcggcggac aaggacggca acgtgacctg cgagcgcgag 361 gtgcccggtc ccgactgccg tttcctcatc gtggcgcacg acgacggtcg ctggtcgctg 421 cagtccgagg cgcaccggcg ctacttcggc ggcaccgagg accgcctgtc ctgcttcgcg 481 cagacggtgt cccccgccga gaagtggagc gtgcacatcg ccatgcaccc tcaggtcaac 541 atctacagtg tcacccgtaa gcgctacgcg cacctgagcg cgcggccggc cgacgagatc 601 gccgtggacc gcgacgtgcc ctggggcgtc gactcgctca tcaccctcgc cttccaggac 661 cagcgctaca gcgtgcagac cgccgaccac cgcttcctgc gccacgacgg gcgcctggtg 721 gcgcgccccg agccggccac tggctacacg ctggagttcc gctccggcaa ggtggccttc 781 cgcgactgcg agggccgtta cctggcgccg tcggggccca gcggcacgct caaggcgggc 841 aaggccacca aggtgggcaa ggacgagctc tttgctctgg agcagagctg cgcccaggtc 901 gtgctgcagg cggccaacga gaggaacgtg tccacgcgcc agggtatgga cctgtctgcc 961 aatcaggacg aggagaccga ccaggagacc ttccagctgg agatcgaccg cgacaccaaa ...

Multiply times eighteen million

BioInformatics @ Dartmouth Medical School

Page 12: Bioinformatics Databases: Getting Knowledge from Information Kristen Chambers Director of Bioinformatics Dartmouth Medical School BioInformatics @ Dartmouth

What does the genome annotation look like today?

BioInformatics @ Dartmouth Medical School

Page 13: Bioinformatics Databases: Getting Knowledge from Information Kristen Chambers Director of Bioinformatics Dartmouth Medical School BioInformatics @ Dartmouth

BioInformatics @ Dartmouth Medical School

Page 14: Bioinformatics Databases: Getting Knowledge from Information Kristen Chambers Director of Bioinformatics Dartmouth Medical School BioInformatics @ Dartmouth

The value of a genome is onlyas good as its annotation

• Two steps: annotation & curation• Each genome is annotated individually• Manual curation is standard practice• New tools, ie. NCBI Mapviewer• Many databases available …

BioInformatics @ Dartmouth Medical School

Page 15: Bioinformatics Databases: Getting Knowledge from Information Kristen Chambers Director of Bioinformatics Dartmouth Medical School BioInformatics @ Dartmouth

Nucleic Acids Research article lists719 public databases

(freely available to the public):

BioInformatics @ Dartmouth Medical School

Page 16: Bioinformatics Databases: Getting Knowledge from Information Kristen Chambers Director of Bioinformatics Dartmouth Medical School BioInformatics @ Dartmouth

Industry Challenge #2:Too much unintegrated data

• Data sources incompatible

• No standard naming convention

• No common interface (varying tools for browsing, querying and visualizing data)

BioInformatics @ Dartmouth Medical School

Page 17: Bioinformatics Databases: Getting Knowledge from Information Kristen Chambers Director of Bioinformatics Dartmouth Medical School BioInformatics @ Dartmouth

Public Data Resources• “Mandatory” sequence submissions• Cover enormously wide range of informational

topics• Broad (sequence) to very specific (proteins

associated with tooth decay) issues• No standard database format: poor

interoperability, difficulty with integration• Ongoing efforts to address annotation problem

BioInformatics @ Dartmouth Medical School

Page 18: Bioinformatics Databases: Getting Knowledge from Information Kristen Chambers Director of Bioinformatics Dartmouth Medical School BioInformatics @ Dartmouth

Major Sequence Repositories

• GenBank All known nucleotide and protein sequences; International Nucleotide Sequence Database Collaboration

• EMBL Nucleotide Sequence Database All known nucleotide and protein sequences; International Nucleotide Sequence Database Collaboration

• DNA Data Bank of Japan (DDBJ) All known nucleotide and protein sequences; International Nucleotide Sequence Database Collaboration

• TIGR HGI Non-redundant, gene-oriented clusters (and many curated microbial genome databases)

• UniGene Non-redundant, gene-oriented clusters

BioInformatics @ Dartmouth Medical School

Page 19: Bioinformatics Databases: Getting Knowledge from Information Kristen Chambers Director of Bioinformatics Dartmouth Medical School BioInformatics @ Dartmouth
Page 20: Bioinformatics Databases: Getting Knowledge from Information Kristen Chambers Director of Bioinformatics Dartmouth Medical School BioInformatics @ Dartmouth

GenBank

BioInformatics @ Dartmouth Medical School

Page 21: Bioinformatics Databases: Getting Knowledge from Information Kristen Chambers Director of Bioinformatics Dartmouth Medical School BioInformatics @ Dartmouth

GenBank Growth

BioInformatics @ Dartmouth Medical School

GenBank Data

Year Base Pairs Sequences

1982 680,338 606

1983 2,274,029 2,427

1984 3,368,765 4,175

1985 5,204,420 5,700

1986 9,615,371 9,978

1987 15,514,776 14,584

1988 23,800,000 20,579

1989 34,762,585 28,791

1990 49,179,285 39,533

1991 71,947,426 55,627

1992 101,008,486 78,608

1993 157,152,442 143,492

1994 217,102,462 215,273

1995 384,939,485 555,694

1996 651,972,984 1,021,211

1997 1,160,300,687 1,765,847

1998 2,008,761,784 2,837,897

1999 3,841,163,011 4,864,570

2000 11,101,066,288 10,106,023

2001 15,849,921,438 14,976,310

2002 28,507,990,166 22,318,883

2003 36,553,368,485 30,968,418

2004 44,575,745,176 40,604,319

Page 22: Bioinformatics Databases: Getting Knowledge from Information Kristen Chambers Director of Bioinformatics Dartmouth Medical School BioInformatics @ Dartmouth

GenBank Growth

• 1982 Database contains 606 sequences• 2004: Database contains more than 40 million

sequences 40,604,319 (the number of bases approximately doubles every 14 months)

• 140,000 different species represented, with new species added at rate of 1700/month

• 17 divisions (including sequencing strategies)

BioInformatics @ Dartmouth Medical School

Page 23: Bioinformatics Databases: Getting Knowledge from Information Kristen Chambers Director of Bioinformatics Dartmouth Medical School BioInformatics @ Dartmouth

Potential Errors in GenBank

• Sequence errors estimated at between 0.37 and 35 (!) errors per 1000 bases

• Recombination• Contamination• Annotation errors - propagated misannotations

– Transfer by similarity is problematic– Errors not always corrected in a timely way– Genes with varying unrelated functions depending on

context– Functional annotation is often unsystematic

• Name-function disconnect

BioInformatics @ Dartmouth Medical School

Page 24: Bioinformatics Databases: Getting Knowledge from Information Kristen Chambers Director of Bioinformatics Dartmouth Medical School BioInformatics @ Dartmouth

Potential Errors in GenBank

• Naming conflicts– One gene, many acronyms– Many genes, shared acronym– Spelling errors– Cultural differences (US, UK)– Representation of non-ASCII characters

BioInformatics @ Dartmouth Medical School

Page 25: Bioinformatics Databases: Getting Knowledge from Information Kristen Chambers Director of Bioinformatics Dartmouth Medical School BioInformatics @ Dartmouth

BioInformatics @ Dartmouth Medical School

Page 26: Bioinformatics Databases: Getting Knowledge from Information Kristen Chambers Director of Bioinformatics Dartmouth Medical School BioInformatics @ Dartmouth

BioInformatics @ Dartmouth Medical School

Page 27: Bioinformatics Databases: Getting Knowledge from Information Kristen Chambers Director of Bioinformatics Dartmouth Medical School BioInformatics @ Dartmouth

BioInformatics @ Dartmouth Medical School

Page 28: Bioinformatics Databases: Getting Knowledge from Information Kristen Chambers Director of Bioinformatics Dartmouth Medical School BioInformatics @ Dartmouth

BioInformatics @ Dartmouth Medical School

Page 29: Bioinformatics Databases: Getting Knowledge from Information Kristen Chambers Director of Bioinformatics Dartmouth Medical School BioInformatics @ Dartmouth

BioInformatics @ Dartmouth Medical School

Page 30: Bioinformatics Databases: Getting Knowledge from Information Kristen Chambers Director of Bioinformatics Dartmouth Medical School BioInformatics @ Dartmouth

Many Databases available:• Comparative Genomics• Gene Expression• Gene Identification & structure• Genetic Maps• Genomic Databases• Intermolecular Interactions• Metabolic Pathways and Cellular Regulation• Mutation Databases• Pathology• Protein Databases• Protein Sequence Motifs• Proteome Resources• Retrieval Systems & Database Structure• RNA Sequences• Structure• Transgenics• Varied Biomedical Content

BioInformatics @ Dartmouth Medical School

Page 31: Bioinformatics Databases: Getting Knowledge from Information Kristen Chambers Director of Bioinformatics Dartmouth Medical School BioInformatics @ Dartmouth

Comparative Genomics: COG• Clusters of

Orthologous Groups of proteins (COGs) were delineated by comparing protein sequencesencoded in 43 complete genomes, representing 30 major phylogenetic lineages.

• Each cluster corresponds to an ancient conserved domain.

BioInformatics @ Dartmouth Medical School

Page 32: Bioinformatics Databases: Getting Knowledge from Information Kristen Chambers Director of Bioinformatics Dartmouth Medical School BioInformatics @ Dartmouth

Gene Expression

BioInformatics @ Dartmouth Medical School

Page 33: Bioinformatics Databases: Getting Knowledge from Information Kristen Chambers Director of Bioinformatics Dartmouth Medical School BioInformatics @ Dartmouth

Gene Identification & Structure

BioInformatics @ Dartmouth Medical School

Page 34: Bioinformatics Databases: Getting Knowledge from Information Kristen Chambers Director of Bioinformatics Dartmouth Medical School BioInformatics @ Dartmouth

Genetic Maps

BioInformatics @ Dartmouth Medical School

Page 35: Bioinformatics Databases: Getting Knowledge from Information Kristen Chambers Director of Bioinformatics Dartmouth Medical School BioInformatics @ Dartmouth

GenomicDatabases

BioInformatics @ Dartmouth Medical School

Page 36: Bioinformatics Databases: Getting Knowledge from Information Kristen Chambers Director of Bioinformatics Dartmouth Medical School BioInformatics @ Dartmouth

Intermolecular Interactions

BioInformatics @ Dartmouth Medical School

Page 37: Bioinformatics Databases: Getting Knowledge from Information Kristen Chambers Director of Bioinformatics Dartmouth Medical School BioInformatics @ Dartmouth

Metabolic Pathways and Celluar Regulation

BioInformatics @ Dartmouth Medical School

Page 38: Bioinformatics Databases: Getting Knowledge from Information Kristen Chambers Director of Bioinformatics Dartmouth Medical School BioInformatics @ Dartmouth

Mutation Databases

BioInformatics @ Dartmouth Medical School

Page 39: Bioinformatics Databases: Getting Knowledge from Information Kristen Chambers Director of Bioinformatics Dartmouth Medical School BioInformatics @ Dartmouth

Pathology

BioInformatics @ Dartmouth Medical School

Page 40: Bioinformatics Databases: Getting Knowledge from Information Kristen Chambers Director of Bioinformatics Dartmouth Medical School BioInformatics @ Dartmouth

Protein Databases

BioInformatics @ Dartmouth Medical School

Page 41: Bioinformatics Databases: Getting Knowledge from Information Kristen Chambers Director of Bioinformatics Dartmouth Medical School BioInformatics @ Dartmouth

Protein Databases: Swiss-Prot• Extremely well curated

protein database

• Link to BLAST

• Powerful cross-references

• Est. 1986

• Maintained by the Department of Medical Biochemistry of the University of Geneva and the EMBL Data Library

BioInformatics @ Dartmouth Medical School

Page 42: Bioinformatics Databases: Getting Knowledge from Information Kristen Chambers Director of Bioinformatics Dartmouth Medical School BioInformatics @ Dartmouth

Protein Sequence Motifs: BLOCKs

• Profiles of protein families (weighted matrix at each position in multiple alignment)

• Alignment between a pattern and your query sequence

BioInformatics @ Dartmouth Medical School

Page 43: Bioinformatics Databases: Getting Knowledge from Information Kristen Chambers Director of Bioinformatics Dartmouth Medical School BioInformatics @ Dartmouth

Proteome Resources: Proteome BKL

BioInformatics @ Dartmouth Medical School

Page 44: Bioinformatics Databases: Getting Knowledge from Information Kristen Chambers Director of Bioinformatics Dartmouth Medical School BioInformatics @ Dartmouth

RNA Sequences

BioInformatics @ Dartmouth Medical School

Page 45: Bioinformatics Databases: Getting Knowledge from Information Kristen Chambers Director of Bioinformatics Dartmouth Medical School BioInformatics @ Dartmouth

Structure

BioInformatics @ Dartmouth Medical School

Page 46: Bioinformatics Databases: Getting Knowledge from Information Kristen Chambers Director of Bioinformatics Dartmouth Medical School BioInformatics @ Dartmouth

Transgenics

BioInformatics @ Dartmouth Medical School

Page 47: Bioinformatics Databases: Getting Knowledge from Information Kristen Chambers Director of Bioinformatics Dartmouth Medical School BioInformatics @ Dartmouth

Varied Biomedical Content

BioInformatics @ Dartmouth Medical School

Page 48: Bioinformatics Databases: Getting Knowledge from Information Kristen Chambers Director of Bioinformatics Dartmouth Medical School BioInformatics @ Dartmouth

National Center for Biotechnology Information (NCBI):

A network of linked resources• Database access: Genbank,

structure, function, SNP, taxonomy...

• Literature (PubMed)• Whole genomes• Tools• Contacts & research

information• FTP

BioInformatics @ Dartmouth Medical School

Page 49: Bioinformatics Databases: Getting Knowledge from Information Kristen Chambers Director of Bioinformatics Dartmouth Medical School BioInformatics @ Dartmouth

NCBI resources

• Nucleotide databases• Protein databases• Structure databases• Taxonomy databases• Genome databases• Expression databases

BioInformatics @ Dartmouth Medical School

Page 50: Bioinformatics Databases: Getting Knowledge from Information Kristen Chambers Director of Bioinformatics Dartmouth Medical School BioInformatics @ Dartmouth

Nucleic Acids Research, 2005, Vol 33, Database issue

http://www3.oup.co.uk/nar/database/cap/