BOOTCAMP
BASICBASIC
BIOINFORMATICSBIOINFORMATICS
Welcome to Day 1bioteach.ubc.ca/bootcamp
The National Center forBiotechnology Information
Created in 1988 as a part of theNational Library of Medicine at NIH
– Establish public databases
– Research in computational biology
– Develop software tools for sequence analysis
– Disseminate biomedical information
Bethesda,MD
Web Access: www.ncbi.nlm.nih.gov
Number of Users and Hits Per Day
0
50,000
100,000
150,000
200,000
250,000
300,000
350,000
400,000
450,000
Nu
mb
er
of
Users
1997 1998 1999 2000 2001 2002 2003
Christmas &
New Year’s Days
Currently averaging
10,000,000 to 35,000,000
hits per day!
The NCBI ftp site
30,000 files per day
620 Gigabytes per day
NCBI Databases and Services
• GenBank largest sequence database
• Free public access to biomedical literature
– PubMed free Medline
– PubMed Central full text online access
• Entrez integrated molecular and literature databases
• BLAST highest volume sequence search service
• VAST structure similarity searches
• Software and Databases
Types of Databases
• Primary Databases
– Original submissions by experimentalists
– Content controlled by the submitter
• Examples: GenBank, SNP, GEO
• Derivative Databases
– Built from primary data
– Content controlled by third party (NCBI)
• Examples: Refseq, TPA, RefSNP, UniGene, NCBI
Protein, Structure, Conserved Domain
What is GenBank? NCBI’s Primary Sequence Database
• Nucleotide only sequence database
• Archival in nature
– Historical
– Reflective of submitter point of view (subjective)
– Redundant
• GenBank Data
– Direct submissions (traditional records)
– Batch submissions (EST, GSS, STS)
– ftp accounts (genome data)
• Three collaborating databases
– GenBank
– DNA Database of Japan (DDBJ)
– European Molecular Biology Laboratory (EMBL)Database
EBI
GenBank
DDBJ
EMBL
EMBLEMBL
Entrez
SRS
getentry
NIGNIG
CIB
NCBI
NIHNIH
•Submissions•Updates •Submissions
•Updates
•Submissions•Updates
GenBank: NCBI’s Primary Sequence Database
ftp://ftp.ncbi.nih.gov/genbank/
Records 101,530,711
Total Bases181,489,883,388 includes WGS
August 2007 Release 161
• full release every two months
• incremental updates daily
• available only via ftp
The Growth of GenBank
Release 161
Doubling time 12-14 months
Non-WGS: 79.5 billion bases
WGS: 102 billion bases
Organization of GenBank:Traditional Divisions
Records are divided into 18 Divisions.12 Traditional
6 Bulk
TraditionalTraditional
Divisions:Divisions:
•• Direct Submissions
(Sequin and BankIt)
•• Accurate
•• Well characterized
PRI PrimatePLN Plant and FungalBCT Bacterial and ArchealINV InvertebrateROD RodentVRL ViralVRT Other VertebrateMAM MammalianPHG PhageSYN Synthetic(cloning vectors)ENV Environmental SamplesUNA Unannotated
Entrez query: gbdiv_xxx[Properties]
Organization of GenBank:Bulk Divisions
Records are divided into 18 Divisions.12 Traditional
6 Bulk
BULK Divisions:BULK Divisions:
•• Batch Submission
(Email and FTP)
•• Inaccurate
•• Poorly characterized
EST Expressed Sequence TagGSS Genome Survey SequenceHTG High Throughput GenomicSTS Sequence Tagged SiteHTC High Throughput cDNAPAT Patent
Entrez query: gbdiv_xxx[Properties]
A Traditional
GenBank Record
LOCUS AY182241 1931 bp mRNA linear PLN 04-MAY-2004DEFINITION Malus x domestica (E,E)-alpha-farnesene synthase (AFS1) mRNA, complete cds.ACCESSION AY182241VERSION AY182241.2 GI:32265057KEYWORDS .SOURCE Malus x domestica (cultivated apple) ORGANISM Malus x domestica Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus.REFERENCE 1 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Cloning and functional expression of an (E,E)-alpha-farnesene synthase cDNA from peel tissue of apple fruit JOURNAL Planta 219, 84-94 (2004)REFERENCE 2 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (18-NOV-2002) PSI-Produce Quality and Safety Lab, USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USAREFERENCE 3 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (25-JUN-2003) PSI-Produce Quality and Safety Lab, USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REMARK Sequence update by submitterCOMMENT On Jun 26, 2003 this sequence version replaced gi:27804758.FEATURES Location/Qualifiers source 1..1931 /organism="Malus x domestica" /mol_type="mRNA" /cultivar="'Law Rome'" /db_xref="taxon:3750" /tissue_type="peel" gene 1..1931 /gene="AFS1" CDS 54..1784 /gene="AFS1" /note="terpene synthase" /codon_start=1 /product="(E,E)-alpha-farnesene synthase" /protein_id="AAO22848.2" /db_xref="GI:32265058" /translation="MEFRVHLQADNEQKIFQNQMKPEPEASYLINQRRSANYKPNIWK NDFLDQSLISKYDGDEYRKLSEKLIEEVKIYISAETMDLVAKLELIDSVRKLGLANLF EKEIKEALDSIAAIESDNLGTRDDLYGTALHFKILRQHGYKVSQDIFGRFMDEKGTLE DFLHKNEDLLYNISLIVRLNNDLGTSAAEQERGDSPSSIVCYMREVNASEETARKNIK GMIDNAWKKVNGKCFTTNQVPFLSSFMNNATNMARVAHSLYKDGDGFGDQEKGPRTHI LSLLFQPLVN"ORIGIN 1 ttcttgtatc ccaaacatct cgagcttctt gtacaccaaa ttaggtattc actatggaat 61 tcagagttca cttgcaagct gataatgagc agaaaatttt tcaaaaccag atgaaacccg 121 aacctgaagc ctcttacttg attaatcaaa gacggtctgc aaattacaag ccaaatattt 181 ggaagaacga tttcctagat caatctctta tcagcaaata cgatggagat gagtatcgga 241 agctgtctga gaagttaata gaagaagtta agatttatat atctgctgaa acaatggatt//
Header
Feature Table
Sequence
The Flatfile Format
Traditional GenBank Record
ACCESSION U07418
VERSION U07418.1 GI:466461
Accession
•Stable
•Reportable
•Universal
Version
Tracks changes in sequenceGI number
NCBI internal use
well annotated
the sequence is the data
Primary vs. Derivative Databases
ACGTGC
CG
TG
AATTGACTAACGTGCA
CG
TG
C TTGACA
TATA
GCCG
GenBank
SequencingCenters
GAGA
ATTC
C
GAGA
ATTC
C
RefSeq:LocusLink andGenomes Pipelines
Labs
Curators
TATAGCCG
AGCTCCGATA
CCGATGACAA
Updated ONLY by submitters
EST
STS
GSS
HTG
UniGene
RefSeq:Annotation Pipeline
Algorithms
UniSTS
Updatedcontinuallyby NCBI
PRI ROD PLN MAM BCT
INV VRT PHG VRL
Derivative Databases
Entrez Protein: Derivative Database
99,187PDB
723,998)(PAT Division
5,267,602BLAST nr total
(no patents or env_nr -now 6 million)
17,360,570Total
29,456PIR
12,079PRF
273,209Swiss Prot
5,263Third Party Annotation
3,889,502RefSeq
Sequences
11,585,396
Data Source
GenPept
FEATURES Location/Qualifiers
source 1..2484
/organism="Homo sapiens"
/mol_type="mRNA" /db_xref="taxon:9606"
/chromosome="3"
/map="3p22-p23"
gene 1..2484
/gene="MLH1" CDS 22..2292
/gene="MLH1"
/note="homolog of S. cerevisiae PMS1 (Swiss-Prot Accession
Number P14242), S. cerevisiae MLH1 (GenBank Accession Number U07187), E. coli MUTL (Swiss-Prot Accession Number
P23367), Salmonella typhimurium MUTL (Swiss-Prot Accession
Number P14161) and Streptococcus pneumoniae (Swiss-Prot
Accession Number P14160)"
/codon_start=1 /product="DNA mismatch repair protein homolog"
/protein_id="AAC50285.1"
/db_xref="GI:463989"
/translation="MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKS
TSIQVIVKEGGLKLIQIQDNGTGIRKEDLDIVCERFTTSKLQSFEDLASISTYGFRGE ALASISHVAHVTITTKTADGKCAYRASYSDGKLKAPPKPCAGNQGTQITVEDLFYNIA
TRRKALKNPSEEYGKILEVVGRYSVHNAGISFSVKKQGETVADVRTLPNASTVDNIRS
GenPept: GenBank CDS
translations
>gi|463989|gb|AAC50285.1| DNA mismatch repair prote... MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV...
EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD...
RefSeq: NCBI’s Derivative Sequence Database
• Curated transcripts and proteins– reviewed
– human, mouse, rat, fruit fly, zebrafish, arabidopsis
microbial genomes (proteins), and more
• Model transcripts and proteins
• Assembled Genomic Regions (contigs)– human genome
– mouse genome
– rat genome
• Chromosome records
– Human genome
– microbial
– organelle
ftp://ftp.ncbi.nih.gov/refseq/release/
srcdb_refseq[Properties]
– chicken
– honeybee
– sea urchin
Selected RefSeq Accession Numbers
mRNAs and Proteins
NM_123456 Curated mRNA
NP_123456 Curated Protein
NR_123456 Curated non-coding RNA
XM_123456 Predicted mRNA
XP_123456 Predicted Protein
XR_123456 Predicted non-coding RNA
Gene RecordsNG_123456 Reference Genomic Sequence
ChromosomeNC_123455 Microbial replicons, organelle
genomes, human chromosomes
AssembliesNT_123456 Contig
NW_123456 WGS Supercontig
RefSeq Benefits
• non-redundancy
• explicitly linked nucleotide and protein sequences
• updates to reflect current sequence data and biology
• data validation
• format consistency
• distinct accession series
• stewardship by NCBI staff and collaborators
Other NCBI Databases
•Structure: imported structures (PDB)
Cn3D viewer, NCBI curation
•CDD: conserved domain database
Protein families (COGs and KOGs)
Single domains (PFAM, SMART, CD)
•dbSNP: nucleotide polymorphism
•Gene: gene recordsUnifies LocusLink and Microbial Genomes
•HomoloGene: neighboring function for Gene
WWW
Access
Entrez
&
BLAST
Gene
Homologene
Entrez: Database Integration
PubMed
abstracts
Nucleotide
sequences
Protein
sequences
3-D
Structure
3 -D
Structure
Word weight
VAST
BLASTBLAST
Hard LinkNeighbors
Related Sequences
Neighbors
Related Sequences
BLink
Domains
Neighbors
Related Structures
The Links Menu: Access Links/Neighbors
SNP
GEO
Gene
PubMed
Protein
The Links Menu: Access Neighbors/Links
Neighbors: BLAST Link
pre-computed BLAST
Neighbors:
pre-computed CDD search
The Links Menu: Access Neighbors/Links
Neighbors
Hard Links
Database Searching with Entrez
Using limits and field restriction to find human MutL homolog
Linking and neighboring with MutL
Global NCBI (Entrez) Search
colon cancer
Global Entrez Search Results
OMIM: Human Disease Genes
Nucleotide Sequences
Nucleotide database now three parts
•EST expressed sequence tags
•GSS genome survey sequences
•CoreNucleotide everything else
Advanced Search OptionsTabs
colon cancer[Title] AND nonpolyposis[Title]
colon cancer[Title] AND nonpolyposis[Title] AND
biomol_mrna[Properties] AND srcdb_refseq[Properties]
Advanced Search OptionsTabs
More Precise Nucleotides Search
colon cancer[Title] AND nonpolyposis[Title] AND human[Organism]
AND biomol_mrna[Properties] AND srcdb_refseq[Properties]
Useful Field Restrictions[Title]: Definition line in GenBank / GenPept format shown in Summary format
glyceraldehyde 3 phosphate dehydrogenase[Title]
[Organism]: NCBI’s taxonomy. Organizing system for molecular databases
mouse[organism]; green plants[organism]; Streptomyces coelicolor[organism]
[Properties]: molecule type, location, database source
biomol_mrna[properties]; biomol_genomic[properties];
gene_in_mitochondrion[properties]; srcdb_pdb[properties]
[Filter]: subsets of data, Entrez links
all[filter]; nucleotide mapview[filter]; nucleotide_omim[filter]
Organism Field: NCBI’s Taxonomy
All molecular
databases
Entrez: Use Gene for everything
HomoloGene
Entrez
Protein
GeneOther Entrez DBs
BLink
Homologene:
Gene Neighbors
MLH1 Gene Record
MLH1 Gene Record: Interactions + GO
MLH1 Gene Record: Sequences
MLH1: Sequence Links
Finding Homologs: HomoloGene
Protein
mRNA
Genomic
HomoloGene Cluster
Gene Links Protein Links
Finding Homologs 2: BLink
BLink: BLAST Link (Best Hits)
BLAST
Opossum homolog
Redundant Proteins
First 200 only
navigate to:
bioteach.ubc.ca/bootcamp
Follow link to practical exercise
page at the NCBI where you’ll find
step-by-step instructions
Strategy #1:
search nt
Let’s compare
our results
Strategy #2: search
entrez gene
Use the preview tab and feature keys
215Search human[Organism] AND cancer[Text Word]
AND promoter[Feature key]
(Approach A: Entrez CoreNucleotide search)
#1
48178CoreNucleotide Links for Gene (Search
human[Organism] AND cancer[Text Word] AND
gene_nucleotide[Filter])
(Approach B: Entrez gene follow link to
CoreNucleotide)
#2
317Search #2 AND promoter[Feature key]
(limit Approach B search to records with promoter
annotated)
#3
173Search #1 NOT #3 (unique hits from Approach A:
straight to Entrez CoreNucleotide search)
#4
275Search #3 NOT #1 (unique hits from Approach B:
Entrez Gene to CoreNucleotide)
#5
ResultMost Recent QueriesSearch
Check your History
Searching PubMed
• How many papers in PubMed are there:– about cancer?
– about carrots?
• Using Entrez PubMed, can you see ifthere is any scientific links betweencarrots and cancer?– How many papers are there about “carrots
AND cancer”?
– What is the active chemical substance incarrots that may play a role in cancers?
You can make up your own
examples, to search Pubmed…
or the Bookshelf…
86
http://www.ncbi.nih.gov/Database/datamodel
Links• The About Entrez page at the NCBI
http://www.ncbi.nlm.nih.gov/Database/index.html
• Model of Entrez Databases from NCBIhttp://www.ncbi.nih.gov/Database/datamodel/index.html
• PubMed Tutorial from NLMhttp://www.nlm.nih.gov/bsd/pubmed_tutorial/m1001.html