122
Common tools, useful databases, and tricks of the trade. Bioinformatics bioteach.ubc.ca/bioinfo2008 [email protected] 1

[email protected]€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

Common tools, useful databases, and tricks of the trade.

Bioinformatics

bioteach.ubc.ca/bioinfo2008

[email protected]

1

Page 2: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

Workshop Schedule

• Laptops, available here for your use 9am - 4:30pm

• wireless login

mslguest

4myguest

• Vancouver guide books available

2

Page 3: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

Today’s Topics

• DNA Sequencing - Generating Data & Emerging Technologies

• Sequence Databases - Public Resources at the NCBI

• GUIDED TOUR - Advanced Tips & Tricks for Searching Entrez

• PRACTICAL EXERCISES - Navigating Links, Retrieving Data with Entrez, and Searching PubMed

3

Page 4: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

What is Bioinformatics?

4

Page 5: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

Bioinformatics for BiologistsGovernment Employees

5

Page 6: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

6

Page 7: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

By the end of the morning session, you will define bioinformatics

in your own words

7

Page 8: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

0

50

100

150

200

1990 1995 2000 2001 2002 2003 2004 2005 2006 2007

Growth of GenBank

In 2005, International sequence databases

exceed 100 gigabases 8

Page 9: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

9

Page 10: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

Personalized Medicine?

10

Page 11: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

11

Page 12: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

Linking Biological Information• Nucleic acid & Protein

Sequence data

• Sequence similarity implies homology or similar function

• Literature / Books

• Papers on related topics

• Macromolecular 3D Structure Data

• Similar protein fold implies homology

• Regulatory Pathways, Expression Data

• Two genes / enzymes in the same pathway

• Taxonomy

• Linkage of organisms by evolution

12

Page 13: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

=

13

Page 14: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

What is Bioinformatics?

14

Page 15: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

Bioinformatics is an fast-paced interdisciplinary research field that involves the integration of

computers, software tools, and databases in an effort to address biological questions.

15

Page 16: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

Genomics refers to the analysis of all of the genes and transcripts included within the genome. Proteomics, on the

other hand, refers to the analysis of the complete set of

proteins or proteome.

16

Page 17: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

Bioinformatics Questions• What is encoded by the

genome?

- Genes, regulatory, and functional regions

• How is genome information expressed?

- Function of genes and gene products (proteins)

- Structure of proteins

• How can we interpret the information encoded in the genome?

- Linking knowledge to the biological entities.

- Systems biology approach

- drugs, metabolites, ...

• How does the genome interact with its environment?

17

Page 18: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

Summary

• An article called, “What is Bioinformatics?” is available from the Science Creative Quarterly.

http://www.scq.ubc.ca/what-is-bioinformatics/

18

Page 19: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

Generating Data & Emerging Technologies

DNA Sequencing

19

Page 20: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

Technology

20

Page 21: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

ggagcctcgg gaggtggtgg agtgacctgg ccccagtgct gcgtccttat cagccgagccggtcccagct cttgctcctg cctgtttgcc tggaaatggc cacgcttctc cttctccttggggtgctggt ggtaagccca gacgctctgg ggagcacaac agcagtgcag acacccacctccggagagcc tttggtctct actagcgagc ccctgagctc aaagatgtac accacttcaataacaagtga ccctaaggcc gacagcactg gggaccagac ctcagcccta cctccctcaacttccatcaa tgagggatcc cctctttgga cttccattgg tgccagcact ggttcccctttacctgagcc aacaacctac caggaagttt ccatcaagat gtcatcagtg ccccaggaaacccctcatgc aaccagtcat cctgctgttc ccataacagc aaactctcta ggatcccacaccgtgacagg tggaaccata acaacgaact ctccagaaac ctccagtagg accagtggagcccctgttac cacggcagct agctctctgg agacctccag aggcacctct ggaccccctcttaccatggc aactgtctct ctggagactt ccaaaggcac ctctggaccc cctgttaccatggcaactga ctctctggag acctccactg ggaccactgg accccctgtt accatgacaactggctctct ggagccctcc agcggggcca gtggacccca ggtctctagc gtaaaactatctacaatgat gtctccaacg acctccacca acgcaagcac tgtgcccttc cggaacccagatgagaactc acgaggcatg ctgccagtgg ctgtgcttgt ggccctgctg gcggtcatagtcctcgtggc tctgctcctg ctgtggcgcc ggcggcagaa gcggcggact ggggccctcgtgctgagcag aggcggcaag cgtaacgggg tggtggacgc ctgggctggg ccagcccaggtccctgagga gggggccgtg acagtgaccg tgggagggtc cgggggcgac aagggctctgggttccccga tggggagggg tctagccgtc ggcccacgct caccactttc tttggcagacctggctctct ggagccctcc agcggggcca gtggacccca ggtctctagc gtaaaactatctacaatgat gtctccaacg acctccacca acgcaagcac tgtgcccttc cggaacccagatgagaactc acgaggcatg ctgccagtgg ctgtgcttgt ggccctgctg gcggtcatag

21

Page 22: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

22

Page 23: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

23

Page 24: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

Figure 1. The Sanger sequencing reaction. Single stranded DNA is

amplified in the presence of fluorescently labelled ddNTPs that

serve to terminate the reaction and label all the fragments of DNA

produced. The fragments of DNA are then separated via polyacrylamide

gel electrophoresis and the sequence read using a laser beam and

computer.

source: http://www.scq.ubc.ca/genome-projects-uncovering-the-blueprints-of-biology/

24

Page 25: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

Every few years, a new technology comes along that dramatically changes how fundamental questions in biology are

addressed. The impact of the technology is not always appreciated at first ...

- Stanley Fields

25

Page 26: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

• DNA sequencing by synthesis

• approach built around very large number of short sequence reads

• key points:

solid phase amplification = no cloning necessary

reversible chemistry

data generated by imaging

read lengths 30-50 bp

< 1% cost, ultra high-throughput

Solexa Technology

26

Page 27: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

source: http://www.illumina.com/27

Page 28: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

source: http://www.illumina.com/28

Page 29: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

source: http://www.illumina.com/29

Page 30: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

source: http://www.illumina.com/30

Page 31: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

Illumina/Solexa instrument

• Laser-based TIRF* Optics

• 4-colour Detection

• CCD camera

• 8-channel flow cell

• 1 Gb / run at launch

source: Dr. Steven Jones, GSC 31

Page 32: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

Data acquisitionsource: Dr. Steven Jones, GSC32

Page 33: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

33

Page 34: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

0

100

200

300

400

1990 1995 2000 20012002

20032004

20052006

2007

GenBank

34

Page 35: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

0

100

200

300

400

1990 1995 2000 20012002

20032004

20052006

2007

Two Days

One machine, running for two days, can generate

~1 Gb data.35

Page 36: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

0

100

200

300

400

1990 1995 2000 20012002

20032004

20052006

2007

One Month

One machine, running for one month, could generate

~15 Gb data.36

Page 37: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

0

100

200

300

400

1990 1995 2000 20012002

20032004

20052006

2007

One Year

Two machines, running for one year, could generate

~365 Gb data.37

Page 38: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

source: Fields, Science 2007, p144138

Page 39: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

• ultra high-throughput sequencing

- DNA binding site identification

- genome re-sequencing; SNPs, expression

low cost

whole genome

any genome

Frontiers in Bioinformatics

39

Page 40: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

Credits & References

• Technology Spotlight on DNA Sequencing with Solexa Technology:

http://www.illumina.com/downloads/SS_DNAsequencing.pdf

• Dr. Steven Jones, GSC

several slides/images used with permission

• Stanley Fields, “Site-Seeing by Sequencing”, Science, 8 June 2007

40

Page 41: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

Public Resources at the NCBI

Sequence Databases

41

Page 42: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

The National Center for Biotechnology Information

Bethesda,MD

42

Page 43: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

NCBI

• Created in 1988 as a part of the National Library of Medicine at NIH

• Establish public databases

• Research in computational biology

• Develop software tools for sequence analysis

• Disseminate biomedical information

43

Page 44: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

www.ncbi.nlm.nih.gov

44

Page 45: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

Number of Users and Hits Per Day

Currently averaging10,000,000 to 35,000,000

hits per day!

1997 1998 1999 2000 2001 2002 2003

Christmas &New Year’s Days

45

Page 46: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

46

Page 47: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

The NCBI ftp site

• 30,000 files per day

• 620 Gigabytes per day47

Page 48: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

NCBI Databases & Services• GenBank largest sequence database

• Free public access to biomedical literature

• PubMed free Medline

• PubMed Central full text online access

• Entrez integrated molecular & literature databases

• BLAST highest volume sequence search service

• VAST structure similarity searches

• Software and Databases

48

Page 49: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

Types of Databases

Primary Databases

Original submissions by experimentalists

Content controlled by the submitter

Examples: GenBank, SNP, GEO

Derivative

Databases

Built from primary data

Content controlled by third party (NCBI)

Examples: Refseq, TPA, RefSNP, UniGene, NCBI Protein, Structure, Conserved Domain

49

Page 50: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

What is GenBank? NCBI’s Primary Sequence Database

• Nucleotide only sequence database

• Archival in nature

• Historical

• Reflective of submitter point of view (subjective)

• Redundant

GenBank Data

Direct submissions (traditional records)

Batch submissions (EST, GSS, STS)

ftp accounts (genome data)

50

Page 51: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

GenBank

EMBLDDBJ

Entrez

SRSgetentry

NCBI

EBI

CIB

International Sequence Database

Collaboration- submit anywhere

- daily updates51

Page 52: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

GenBank: NCBI’s Primary Sequence Database

52

Page 53: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

ftp://ftp.ncbi.nih.gov/genbank/

• full release every two months

• incremental updates daily

• available only via ftp

Release 161 August 2007 101,530,711 Records181,489,883,388* Total Bases

*includes WGS

53

Page 54: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

Growth of GenBank

0

50

100

150

200

1990 1995 2000 2001 2002 2003 2004 2005 2006 2007

GenBank WGS

Current Release 163Doubling time 12-14 months

54

Page 55: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

Organization of GenBank

Traditional Divisions:

- Direct Submissions (Sequin and BankIt)

- Accurate

- Well characterized

PRI Primate PLN Plant and FungalBCT Bacterial and Archeal INV InvertebrateROD RodentVRL ViralVRT Other VertebrateMAM Mammalian PHG PhageSYN Synthetic(cloning vectors)ENV Environmental Samples UNA Unannotated

Records are divided into 18 Divisions.12 Traditional

6 Bulk

55

Page 56: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

Organization of GenBank

BULK Divisions:

- Batch Submission (Email and FTP)

- Inaccurate

- Poorly characterized

EST Expressed Sequence Tag GSS Genome Survey SequenceHTG High Throughput GenomicSTS Sequence Tagged SiteHTC High Throughput cDNAPAT Patent

Records are divided into 18 Divisions.12 Traditional

6 Bulk

Entrez query: gbdiv_xxx[Properties]56

Page 57: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

A TraditionalGenBank Record

Header

Feature Table

Sequence57

Page 58: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

Traditional GenBank Record

58

ACCESSION U07418

VERSION U07418.1 GI:466461

Accession•Stable•Reportable•Universal

GI number• NCBI internal

use

Version•Tracks changes in sequence

Page 59: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

the sequence is the datawell annotated

59

Page 60: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

Primary vs. Derivative Databases

ACGTGC

CG

TG

A

ATTGACTAACGTGC

AC

GT

GC

TTG

ACA

TATA

GCCG

GenBank

Sequencing

Centers

GAGA

ATT

C

C

GAGA

ATT

C

C

RefSeq:

Entrez Gene and

Genomes Pipelines

Labs

Curators

TATAGCCG

AGCTCCGATA

CCGATGACAA

UniGene

RefSeq:

Annotation

Pipeline

Algorithms

Updated

continually

by NCBI

60

Page 61: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

Derivative Databases

61

Page 62: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

Entrez Protein0 3,750,000 7,500,000 11,250,000 15,000,000

GenPept

RefSeq

Third Party Annotation

Swiss Prot

PIR

PRF

PDB

PAT Division

BLAST nr

62

Page 63: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

GenPept• GenBank CDS translations

FEATURES Location/Qualifiers source 1..2484 /organism="Homo sapiens" /mol_type="mRNA" /db_xref="taxon:9606" /chromosome="3" /map="3p22-p23" gene 1..2484 /gene="MLH1" CDS 22..2292 /gene="MLH1" /note="homolog of S. cerevisiae PMS1 (Swiss-Prot Accession Number P14242), S. cerevisiae MLH1 (GenBank Accession Number U07187), E. coli MUTL (Swiss-Prot Accession Number P23367), Salmonella typhimurium MUTL (Swiss-Prot Accession Number P14161) and Streptococcus pneumoniae (Swiss-Prot Accession Number P14160)" /codon_start=1 /product="DNA mismatch repair protein homolog" /protein_id="AAC50285.1" /db_xref="GI:463989" /translation="MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKS TSIQVIVKEGGLKLIQIQDNGTGIRKEDLDIVCERFTTSKLQSFEDLASISTYGFRGE ALASISHVAHVTITTKTADGKCAYRASYSDGKLKAPPKPCAGNQGTQITVEDLFYNIA TRRKALKNPSEEYGKILEVVGRYSVHNAGISFSVKKQGETVADVRTLPNASTVDNIRS

63

Page 64: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

RefSeq• The goal is to provide a comprehensive,

standard dataset that represents sequence information for a species.

- transcript, protein, assembled genomic (contigs), and chromosome records

- known and predicted

- reviewed

- human, mouse, rat, fruit fly, zebrafish, arabidopsis, microbial genome (proteins), organelles, and more

64

Page 65: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

RefSeq Accession Numbers

• mRNAs and Proteins

NM_123456 Curated mRNA

NP_123456 Curated Protein

NR_123456 Curated nc RNA

XM_123456 Predicted mRNA

XP_123456 Predicted Protein

XR_123456 Predicted nc RNA

• Genomic Records

NG_123456 Reference Genomic

Sequence

• Chromosome

NC_123455 Microbial

replicons, organelle, genomes,

human chromosomes

• Assemblies

NT_123456 Contig

NW_123456 WGS Supercontig

65

Page 66: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

Other NCBI Databases

Structure: imported structures (PDB)

Cn3D viewer, NCBI curation

CDD: conserved domain database

Protein families (COGs and

KOGs); Single domains (PFAM,

SMART, CD)

dbSNP: nucleotide polymorphism

Gene: gene records Unifies LocusLink and

Microbial Genomes

HomoloGene: homologs neighboring function for

Gene

66

Page 67: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

67

Page 68: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

GUIDED TOUR: Retrieving Data

Sequence Databases

68

Page 69: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

WWW

Entrez

BLAST

http://www.ncbi.nlm.nih.gov/

69

Page 70: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

http://www.ncbi.nih.gov/Database/datamodel

70

Page 71: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

Gene

Homologene

PubMed abstracts

Nucleotide sequences

Protein sequences

3-D Structure

3 -D

Structure

Hard LinkNeighborsRelated Sequences

NeighborsRelated SequencesBlinkDomains

NeighborsRelated Structures

NeighborsRelated Articles

BLAST

NeighborsRelated Sequences

VAST

BLAST

Word weight

71

Page 72: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

Neighbors in Entrez

72

Page 73: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

Blink & DomainsNeighbors: BLAST Linkpre-computed BLAST

Neighbors:pre-computed CDD search

73

Page 74: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

Links

Neighbors

Hard Links

74

Page 75: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

Database searching with Entrez

• Scenario: Let’s find out more about the MutL gene

• Using limits and field restriction to find human MutL homolog

• Linking and neighboring with MutL

75

Page 76: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

Start with a search for “colon cancer”

76

Page 77: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

77

Page 78: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

Human Disease Genes

78

Page 79: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

Search CoreNucleotide

Nucleotide database now three parts:EST expressed sequence tagsGSS genome survey sequencesCoreNucleotide everything else

79

Page 80: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

Advanced Search OptionsTabs

80

Page 81: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

colon cancer[Title] AND nonpolyposis[Title]81

Page 82: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

colon cancer[Title] AND nonpolyposis[Title] AND biomol_mrna[Properties] AND srcdb_refseq[Properties]

82

Page 83: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

Advanced Search OptionsTabs

83

Page 84: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

84

Page 85: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

Refining your Search

colon cancer[Title] AND nonpolyposis[Title] AND human[Organism] AND biomol_mrna[Properties] AND srcdb_refseq[Properties]

85

Page 86: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

Useful Field Restrictions

• [Title]: Definition line in GenBank / GenPept format shown in Summary formatglyceraldehyde 3 phosphate dehydrogenase[Title]

• [Organism]: NCBI’s taxonomy. Organizing system for molecular databasesmouse[organism]; green plants[organism]; Streptomyces coelicolor

[organism]

• [Properties]: molecule type, location, database sourcebiomol_mrna[properties]; biomol_genomic[properties];

gene_in_mitochondrion[properties]; srcdb_pdb[properties]

• [Filter]: subsets of data, Entrez linksall[filter]; nucleotide mapview[filter]; nucleotide_omim[filter]

86

Page 87: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

87

Page 88: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

88

Page 89: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

89

Page 90: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

Taxonomy

All molecular databases

90

Page 91: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

Goal: Find MLH1 homologs

• Tip: Use Entrez Gene as your hub to connect to everything else!

Entrez Gene

HomologeneProtein

Gene neighborsBLink Other Entrez

Databases91

Page 92: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

92

Page 93: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

MLH1 Gene Record

93

Page 94: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

Interactions + GO

94

Page 95: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

Sequences

95

Page 96: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

MLH1: Sequence Links

96

Page 97: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

97

Page 98: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

Finding Homologs:

ProteinmRNA

Genomic

98

Page 99: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

HomoloGene Cluster

Gene Links Protein Links99

Page 100: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

Finding Homologs 2: BLink

100

Page 101: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

BLink: BLAST Link (Best Hits)

BLAST

Opossum homolog

Redundant ProteinsFirst 200 only

101

Page 102: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

102

Page 103: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

PRACTICAL EXERCISES: Navigating Links, Retrieving Data with Entrez, and Searching PubMed

Sequence Databases

103

Page 104: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

navigate to:bioteach.ubc.ca/bioinfo2008

Follow link to practical exercisepage at the NCBI where you’ll find

step-by-step instructions

Strategy #1: search nt

Let’s compareour results

Strategy #2: search entrez gene

Use the preview tab and feature keys

104

Page 105: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

Search Most Recent Queries Result

#5Search #3 NOT #1 (unique hits from Approach B:

Entrez Gene to CoreNucleotide) 286

#4Search #1 NOT #3 (unique hits from Approach A:

straight to Entrez CoreNucleotide search) 183

#3Search #2 AND promoter[Feature key]

(limit Approach B search to records with promoter annotated)

329

#2

CoreNucleotide Links for Gene (Search human[Organism] AND cancer[Text Word] AND

gene_nucleotide[Filter])(Approach B: Entrez gene follow link to

CoreNucleotide)

53844

#1Search human[Organism] AND cancer[Text Word]

AND promoter[Feature key](Approach A: Entrez CoreNucleotide search)

226

Check your History

105

Page 106: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

Searching PubMed• How many papers in PubMed are there:

- about cancer?

- about carrots?

• Using Entrez PubMed, can you see if there is any scientific links between carrots and cancer?

• How many papers are there about “carrots AND cancer”?

• What is the active chemical substance in carrots that may play a role in cancers?

106

Page 107: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

107

Page 108: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

108

Page 109: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

109

Page 110: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

110

Page 111: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

111

Page 112: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

112

Page 113: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

113

Page 114: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

114

Page 115: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

115

Page 116: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

116

Page 117: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

117

Page 118: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

You can make up your own examples, to search

Pubmed… or the Bookshelf…

118

Page 119: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

119

Page 120: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

Web References

• The “About Entrez” page at the NCBI

http://www.ncbi.nlm.nih.gov/Database/index.html

• Model of Entrez Databases from NCBI

http://www.ncbi.nih.gov/Database/datamodel/index.html

• PubMed Tutorial from NLM

http://www.nlm.nih.gov/bsd/pubmed_tutorial/m1001.html

120

Page 121: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

Credits

• Materials for this presentation have been adapted from the following sources:

NCBI HelpDesk - Field Guide Course Materials

Bioinformatics: A practical guide to the analysis of genes and proteins

• Questions? Please contact:

Dr. Joanne FoxMichael Smith [email protected]

121

Page 122: joanne@msl.ubc€¦ · NP_123456 Curated Protein NR_123456 Curated nc RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted nc RNA • Genomic Records NG_123456

BLAST background, guided tour & practical exercises

Let’s start at 9:30am

[email protected]