View
217
Download
0
Tags:
Embed Size (px)
Citation preview
Bioinformatics Databases:Getting Knowledge from
Information
Kristen ChambersDirector of BioinformaticsDartmouth Medical School
BioInformatics @ Dartmouth Medical School
What is Bioinformatics?
Bioinformatics provides the backbone computational tools, databases and domain
expertise that facilitates modern biomedical, biological and genomic research.
BioInformatics @ Dartmouth Medical School
BioInformatics @ Dartmouth Medical School
What is Bioinformatics?
• ‘Wet-lab’ science• Sequence analysis• Modeling & structural work• Algorithm development• Hardware & software
infrastructure
The expertise is multidisciplinary,and the skills fall on a continuum from
‘pure’ science to ‘pure’ computing:
With a field this extensive and skill sets so varied, where do we begin?
BioInformatics @ Dartmouth Medical School
BioInformatics @ Dartmouth Medical School
From Information Design, Nathan Shedroff
Moving from Information to Knowledge to Understanding:
Genetic testing
• BRCA1 and BRCA2 gene mutations: what is the real risk to women carriers? 25% - 80%
• Huntington’s Disease: mechanism defined, but what does that mean for the individual in terms of age of onset, severity of disease, or how disease will progress?
BioInformatics @ Dartmouth Medical School
How can Bioinformatics facilitate the extraction of
information?• Development of tools that support
laboratory experiments
• Design, implementation and integration of biological databases
• Development of various analytical tools, algorithms and models
BioInformatics @ Dartmouth Medical School
Bioinformatics will not replace experiments, but can greatly
streamline and enable the discovery process.
BioInformatics @ Dartmouth Medical School
One of the fundamental toolsof bioinformatics: Database
• A database is a body of information stored in two dimensions (rows and columns)
• The power of the database lies in the relationships that you construct between the pieces of information (tables)
• SQL (Structured Query Language) - interactive and embedded
• Good design and application ensure data integrity• Interoperability
BioInformatics @ Dartmouth Medical School
Industry Challenge #1:Genome annotation
The Human Genome is sequenced. It is estimated that 5% of the human genome codes for genes. The function of the remaining 95% is largely unknown.
BioInformatics @ Dartmouth Medical School
What does the genome data look like? 1 gcggagggtg cgtgcgggcc gcggcagccg aacaaaggag caggggcgcc gccgcaggga 61 cccgccaccc acctcccggg gccgcgcagc ggcctctcgt ctactgccac catgaccgcc 121 aacggcacag ccgaggcggt gcagatccag ttcggcctca tcaactgcgg caacaagtac 181 ctgacggccg aggcgttcgg gttcaaggtg aacgcgtccg ccagcagcct gaagaagaag 241 cagatctgga cgctggagca gccccctgac gaggcgggca gcgcggccgt gtgcctgcgc 301 agccacctgg gccgctacct ggcggcggac aaggacggca acgtgacctg cgagcgcgag 361 gtgcccggtc ccgactgccg tttcctcatc gtggcgcacg acgacggtcg ctggtcgctg 421 cagtccgagg cgcaccggcg ctacttcggc ggcaccgagg accgcctgtc ctgcttcgcg 481 cagacggtgt cccccgccga gaagtggagc gtgcacatcg ccatgcaccc tcaggtcaac 541 atctacagtg tcacccgtaa gcgctacgcg cacctgagcg cgcggccggc cgacgagatc 601 gccgtggacc gcgacgtgcc ctggggcgtc gactcgctca tcaccctcgc cttccaggac 661 cagcgctaca gcgtgcagac cgccgaccac cgcttcctgc gccacgacgg gcgcctggtg 721 gcgcgccccg agccggccac tggctacacg ctggagttcc gctccggcaa ggtggccttc 781 cgcgactgcg agggccgtta cctggcgccg tcggggccca gcggcacgct caaggcgggc 841 aaggccacca aggtgggcaa ggacgagctc tttgctctgg agcagagctg cgcccaggtc 901 gtgctgcagg cggccaacga gaggaacgtg tccacgcgcc agggtatgga cctgtctgcc 961 aatcaggacg aggagaccga ccaggagacc ttccagctgg agatcgaccg cgacaccaaa ...
Multiply times eighteen million
BioInformatics @ Dartmouth Medical School
What does the genome annotation look like today?
BioInformatics @ Dartmouth Medical School
BioInformatics @ Dartmouth Medical School
The value of a genome is onlyas good as its annotation
• Two steps: annotation & curation• Each genome is annotated individually• Manual curation is standard practice• New tools, ie. NCBI Mapviewer• Many databases available …
BioInformatics @ Dartmouth Medical School
Nucleic Acids Research article lists719 public databases
(freely available to the public):
BioInformatics @ Dartmouth Medical School
Industry Challenge #2:Too much unintegrated data
• Data sources incompatible
• No standard naming convention
• No common interface (varying tools for browsing, querying and visualizing data)
BioInformatics @ Dartmouth Medical School
Public Data Resources• “Mandatory” sequence submissions• Cover enormously wide range of informational
topics• Broad (sequence) to very specific (proteins
associated with tooth decay) issues• No standard database format: poor
interoperability, difficulty with integration• Ongoing efforts to address annotation problem
BioInformatics @ Dartmouth Medical School
Major Sequence Repositories
• GenBank All known nucleotide and protein sequences; International Nucleotide Sequence Database Collaboration
• EMBL Nucleotide Sequence Database All known nucleotide and protein sequences; International Nucleotide Sequence Database Collaboration
• DNA Data Bank of Japan (DDBJ) All known nucleotide and protein sequences; International Nucleotide Sequence Database Collaboration
• TIGR HGI Non-redundant, gene-oriented clusters (and many curated microbial genome databases)
• UniGene Non-redundant, gene-oriented clusters
BioInformatics @ Dartmouth Medical School
GenBank
BioInformatics @ Dartmouth Medical School
GenBank Growth
BioInformatics @ Dartmouth Medical School
GenBank Data
Year Base Pairs Sequences
1982 680,338 606
1983 2,274,029 2,427
1984 3,368,765 4,175
1985 5,204,420 5,700
1986 9,615,371 9,978
1987 15,514,776 14,584
1988 23,800,000 20,579
1989 34,762,585 28,791
1990 49,179,285 39,533
1991 71,947,426 55,627
1992 101,008,486 78,608
1993 157,152,442 143,492
1994 217,102,462 215,273
1995 384,939,485 555,694
1996 651,972,984 1,021,211
1997 1,160,300,687 1,765,847
1998 2,008,761,784 2,837,897
1999 3,841,163,011 4,864,570
2000 11,101,066,288 10,106,023
2001 15,849,921,438 14,976,310
2002 28,507,990,166 22,318,883
2003 36,553,368,485 30,968,418
2004 44,575,745,176 40,604,319
GenBank Growth
• 1982 Database contains 606 sequences• 2004: Database contains more than 40 million
sequences 40,604,319 (the number of bases approximately doubles every 14 months)
• 140,000 different species represented, with new species added at rate of 1700/month
• 17 divisions (including sequencing strategies)
BioInformatics @ Dartmouth Medical School
Potential Errors in GenBank
• Sequence errors estimated at between 0.37 and 35 (!) errors per 1000 bases
• Recombination• Contamination• Annotation errors - propagated misannotations
– Transfer by similarity is problematic– Errors not always corrected in a timely way– Genes with varying unrelated functions depending on
context– Functional annotation is often unsystematic
• Name-function disconnect
BioInformatics @ Dartmouth Medical School
Potential Errors in GenBank
• Naming conflicts– One gene, many acronyms– Many genes, shared acronym– Spelling errors– Cultural differences (US, UK)– Representation of non-ASCII characters
BioInformatics @ Dartmouth Medical School
BioInformatics @ Dartmouth Medical School
BioInformatics @ Dartmouth Medical School
BioInformatics @ Dartmouth Medical School
BioInformatics @ Dartmouth Medical School
BioInformatics @ Dartmouth Medical School
Many Databases available:• Comparative Genomics• Gene Expression• Gene Identification & structure• Genetic Maps• Genomic Databases• Intermolecular Interactions• Metabolic Pathways and Cellular Regulation• Mutation Databases• Pathology• Protein Databases• Protein Sequence Motifs• Proteome Resources• Retrieval Systems & Database Structure• RNA Sequences• Structure• Transgenics• Varied Biomedical Content
BioInformatics @ Dartmouth Medical School
Comparative Genomics: COG• Clusters of
Orthologous Groups of proteins (COGs) were delineated by comparing protein sequencesencoded in 43 complete genomes, representing 30 major phylogenetic lineages.
• Each cluster corresponds to an ancient conserved domain.
BioInformatics @ Dartmouth Medical School
Gene Expression
BioInformatics @ Dartmouth Medical School
Gene Identification & Structure
BioInformatics @ Dartmouth Medical School
Genetic Maps
BioInformatics @ Dartmouth Medical School
GenomicDatabases
BioInformatics @ Dartmouth Medical School
Intermolecular Interactions
BioInformatics @ Dartmouth Medical School
Metabolic Pathways and Celluar Regulation
BioInformatics @ Dartmouth Medical School
Mutation Databases
BioInformatics @ Dartmouth Medical School
Pathology
BioInformatics @ Dartmouth Medical School
Protein Databases
BioInformatics @ Dartmouth Medical School
Protein Databases: Swiss-Prot• Extremely well curated
protein database
• Link to BLAST
• Powerful cross-references
• Est. 1986
• Maintained by the Department of Medical Biochemistry of the University of Geneva and the EMBL Data Library
BioInformatics @ Dartmouth Medical School
Protein Sequence Motifs: BLOCKs
• Profiles of protein families (weighted matrix at each position in multiple alignment)
• Alignment between a pattern and your query sequence
BioInformatics @ Dartmouth Medical School
Proteome Resources: Proteome BKL
BioInformatics @ Dartmouth Medical School
RNA Sequences
BioInformatics @ Dartmouth Medical School
Structure
BioInformatics @ Dartmouth Medical School
Transgenics
BioInformatics @ Dartmouth Medical School
Varied Biomedical Content
BioInformatics @ Dartmouth Medical School
National Center for Biotechnology Information (NCBI):
A network of linked resources• Database access: Genbank,
structure, function, SNP, taxonomy...
• Literature (PubMed)• Whole genomes• Tools• Contacts & research
information• FTP
BioInformatics @ Dartmouth Medical School
NCBI resources
• Nucleotide databases• Protein databases• Structure databases• Taxonomy databases• Genome databases• Expression databases
BioInformatics @ Dartmouth Medical School
Nucleic Acids Research, 2005, Vol 33, Database issue
http://www3.oup.co.uk/nar/database/cap/