Computational Biology and Bioinformaticscschweikert/cisc4020/sequenceData… · Computational Biology and Bioinformatics • Computational biology – Development of algorithms to

Computational Biology and Bioinformatics

• Computational biology

– Development of algorithms to solve problems in biology

• Bioinformatics

– Application of computational biology to the analysis and management of biological data

• Applied bioinformatics

– Intelligent use of tools to navigate the sequence space

Better experiments can be designed by a careful Bioinformatics analysis before the bench work

Bioinformatics 7,850,000 hits

DNA sequencing

• DNA sequences are the most abundant type of sequences (5 * 1010 )

• Generated by the chain termination method (Sanger sequencing)

• Based on the action of a DNA polymerase that adds nucleotides tocomplementary strand

• Fluorescently labeled ddNTP (Dideoxynucleotides) stop synthesis acting as chain terminators. They are included in amounts so as to terminate every time the base appears in the template

• Requires template DNA, and primer and one ddNTP for each base: A,C,G, and T

• Products are separated by electrophoresis

A

AA

GG

G

CC

C

TT

T

Primer

Summary of chain termination sequencing

In the past, four different reactions, one for each ddNTP, were separated on a gel that could resolve one-base differences. The sequence was then read from the bottom of the gel to the top.

Sequence reading of fluorescently labeled reactions

• Fluorescently labeled reactions scanned by laser as particular point is passed

• Color picked up by detector

• Output sent directly to computer

In its simplest form a sequence can be represented as a string of nucleotides with a basic tag or identifier after a greater than character “>”: FASTA format

Definition line (commonly called “def line”)

>U54469.1

CGGTTGCTTGGGTTTTATAACATCAGTCAGTGACAGGCATTTCCAGAGTTGCCCTGTTCAACAATCGATA

GCTGCCTTTGGCCACCAAAATCCCAAACTTAATTAAAGAATTAAATAATTCGAATAATAATTAAGCCCAG

TAACCTACGCAGCTTGAGTGCGTAACCGATATCTAGTATACATTTCGATACATCGAAATCATGGTAGTGT

TGGAGACGGAGAAGGTAAGACGATGATAGACGGCGAGCCGCATGGGTTCGATTTGCGCTGAGCCGTGGCA

GGGAACAACAAAAACAGGGTTGTTGCACAAGAGGGGAGGCGATAGTCGAGCGGAAAAGAGTGCAGTTGGC

GTGGCTACATCATCATTGTGTTCACCGATTATTTTTTGCACAATTGCTTAATATTAATTGTACTTGCACG

CTATTGTCTACGTCATAGCTATCGCTCATCTCTGTCTGTCTCTATCAAGCTATCTCTCTTTCGCGGTCAC

TCGTTCTCTTTTTTCTCTCCTTTCGCATTTGCATACGCATACCACACGTTTTCAGTGTTCTCGCTCTCTC

TCTCTTGTCAAGACATCGCGCGCGTGTGTGTGGGTGTGTCTCTAGCACATATACATAAATAGGAGAGCGG

More information can be be added to the FASTA definition line

>gb|U54469.1|DMU54469 Drosophila melanogaster

eukaryotic initiation factor 4E (eIF4E) gene.

GenBank

Accession.versionLocusDescritpion

Editing Errors

Capillary electrophoresis

• Newer automated

sequencers use very

thin capillary tubes

• Run all four

fluorescently tagged

reactions in same

capillary

• Can have 96

capillaries running at

the same time

96–well plate

robotic arm and syringe

96 glass capillaries

load bar

New sequencing technologies454 Life Sciences: FLX titanium: 1,000,000 reads 400 bp= 400,000,000 high quality bp in 10h run

http://www.454.com/products-solutions/how-it-works/index.asp

Present and Future of sequencing

• Sequencing costs

– Dropping each year

•Opens possibility of sequencing genomes of individuals

•Greatly facilitates comparative genomics.

DNA RNA

cDNAESTsUniGene

phenotype

genomicDNAdatabases

protein sequence databases

protein

GenBankEMBL DDBJ

Housedat EBI

EuropeanBioinformatics

Institute

There are three major public DNA databases

Housed at NCBINational

Center forBiotechnology

Information

Housed in Japan

Taxonomy at NCBI:>200,000 species are represented in GenBank

http://www.ncbi.nlm.nih.gov/Taxonomy/txstat.cgi

The most sequenced organisms in GenBank

Homo sapiens 13.1 billion basesMus musculus 8.4b

Rattus norvegicus 6.1b

Bos taurus 5.2bZea mays 4.6b

Sus scrofa 3.6bDanio rerio 3.0b

Oryza sativa (japonica) 1.5bStrongylocentrotus purpurata 1.4b

Nicotiana tabacum 1.1b

GenBank release 168.0

Sequence databases

• Examples of sequence databases– Primary databases (archival)

• GenBank

• EMBL (European Molecular Biology Laboratory)

• DDBJ (DNA Data Bank of Japan)

– Secondary databases (curated)• RefSeq

• EMBL Genome Reviews

• Protein databases

• TPA (Third party annotations)

• What is a database?– An indexed set of records

– Records retrieved using a query language

Web Browser

BLAST Search

Engine

Database

Web Server

Biologist query

The client–server model has made access to sequence databases fast and easy

Data flow of submissions between primary databases (Chp. 1)

GenBank

EMBL

DDBJ

•Submissions

•Updates

(Sequin)NCBI

Entrez

NIHNational Institute of

Health(USA)

•Submissions

•Updates

EBI

Ensambl

EMBLEuropean

MolecularBiology

Laboratory

•Submissions

•Updates

(Sakura)

NIGNational

Institute of

Genetics(JAPAN)

International Nucleotide

Sequence Database

CollaborationUpdated every 24 hs

National Center for

Biotechnology Information

DNA Data Bank of Japan

European

Bioinformatics Institute

CIB

Getentry

Center for Information Biology

http://getentry.ddbj.nig.ac.jp/getstart-e.html

Integrated information retrieval systemIs an interface not a database

http://www.ensembl.org/index.html

http://www.ensembl.org/index.html

Nucleotide sequence flatfilesHeader: Locus (10 characters, 1st letter, arbitrary name, not

useful), Length, Molecule type (DNA or RNA), Division code (includes functional divisions as ExpressedSeqTags,

SeqTagsSites, WholeGenomeSeq, etc.), Last release date EMBL).

Definition: Summary of biological information

Accession number: Primary key to reference a record in the database. Used in publications (1+5 or 2+6).

Version: The version is increased every time the sequence is updated. For each version there is a dif. GI geninfo identifier, which is specific of GenBank, not very useful

Keywords: abandoned because of absence of controlled vocabulary.

Source / Organism : taxonomic information top-down.

Reference: submission credit and published paper.

Feature Table: direct representation of the biological information in the record:

Source: org, chromosome, mapGene may include regulatory sequences

mRNA from start of translation to polyadenilation signal

CDS (coding sequences): from start to stop codon, provides

exon coordinates. Every CDS is assigned a protein_id.version (3+5)

Sequence: actual sequence ending in //

Secondary databases

Refseq– Comprehensive, integrated, and non-redundant set of sequences

– Genomic DNA – RNA – Protein explicitly linked

– 2_6 Format (the underscore is never present in GenBank accessions):

• Experimental Predicted

• NT_123456 (genomic contig)

• NM_123456 (mRNAs) [XM_123456 (model mRNAs)]

• NP_123456 (Proteins) [XP_123456 (model protein)]

– Undergo continuous curation: most up to date sequence

– Each RefSeq is a synthesis of information, not a piece of a primary research:

equivalent to a“review article”

– Message for

• Removed

• Secondary

Secondary databases

Third Party Annotation (TPA)– Includes

• Reannotations,

• Combinations of novel andexisting primary entries

• Annotations of trace archives

• Whole genome Shotgun data

– Provides

• GenBank accession. Version numbers and nucleotide locations for all primary entries to which the TPA sequence relates

EMBL Genome reviews– Includes

• Add information from UniProtknowledgebase, Gene Ontology Annotation, InterPro, and others

• Curated versions of entries representing complete genomes

• Standardize annotations

Protein databases

• GenPept: translations of all CDS. Not curated

• Uniprot (Swiss-Prot/TrEMBL/PIR-PSD)

– UniParc: most comprehensive, public

nonredundant protein database

• Swiss-Prot (manual)/TrEMBL(computer)/PIR-PSD

• GenBank, Patents, Int. Pr. Index (IPI)

• Protein Data Bank

– UniProt Knowledgebase: curated subset of

UniParc

• Function Postranslational modifications

• Domains Catalytic sites

• Structures Associated diseases

• Pathways Etc.

– UniRef: UniProt nonredundant reference

database: 95%, 90% and 50% sets.

• Functional groups

– Pfam

– Prosite

– IternPro

Merge 95% =

Merge 90% =

Merge 50% =

All predicted coding regions

http://www.ebi.ac.uk/blast2/index.html?UniProt

www.ncbi.nlm.nih.gov

PubMed is…

• National Library of Medicine's search service

• 19 million citations in MEDLINE

• links to participating online journals

• PubMed tutorial

Entrez integrates…

• the scientific literature; • DNA and protein sequence databases;

• 3D protein structure data; • population study data sets; • assemblies of complete genomes

Entrez is a search and retrieval system that integrates NCBI databases

BLAST is…

• Basic Local Alignment Search Tool

• NCBI's sequence similarity search tool

• supports analysis of DNA and protein databases

• 100,000 searches per day

OMIM is…

•Online Mendelian Inheritance in Man

•catalog of human genes and genetic disorders

•created by Dr. Victor McKusick; led by Dr. Ada Hamosh

at JHMI

Bookshelf is…

• searchable resource of on-line books

Taxonomy Browser is…

• browser for the major divisions of living organisms

(archaea, bacteria, eukaryota, viruses)

• taxonomy information such as genetic codes

• molecular data on extinct organisms

• practically useful to find a protein or gene from a species

Structure site includes…

• Molecular Modeling Database (MMDB)

• biopolymer structures obtained from

the Protein Data Bank (PDB)

• Cn3D (a 3D-structure viewer)

• vector alignment search tool (VAST)

Accession numbers are labels for sequences

NCBI includes databases (such as GenBank) that contain information on DNA, RNA, or protein sequences.

You may want to acquire information beginning with a

query such as the name of a protein of interest, or theraw nucleotides comprising a DNA sequence of interest.

DNA sequences and other molecular data are tagged with

accession numbers that are used to identify a sequenceor other record relevant to molecular data.

Page 26

What is an accession number?

An accession number is label that used to identify a sequence. It is a string of letters and/or numbers that

corresponds to a molecular sequence.

Examples (all for retinol-binding protein, RBP4):

X02775 GenBank genomic DNA sequence

NT_030059 Genomic contig

Rs7079946 dbSNP (single nucleotide polymorphism)

N91759.1 An expressed sequence tag (1 of 170)

NM_006744 RefSeq DNA sequence (from a transcript)

NP_007635 RefSeq protein

AAC02945 GenBank protein

Q28369 SwissProt protein

1KT7 Protein Data Bank structure record

protein

DNA

RNA

Page 27

NCBI’s important RefSeq project: best representative sequences

RefSeq (accessible via the main page of NCBI)provides an expertly curated accession number that

corresponds to the most stable, agreed-upon “reference”

version of a sequence.

RefSeq identifiers include the following formats:

Complete genome NC_######Complete chromosome NC_######

Genomic contig NT_######

mRNA (DNA format) NM_###### e.g. NM_006744Protein NP_###### e.g. NP_006735

Page 27

Accession Molecule Method Note

AC_123456 Genomic Mixed Alternate complete genomic

AP_123456 Protein Mixed Protein products; alternate

NC_123456 Genomic Mixed Complete genomic molecules

NG_123456 Genomic Mixed Incomplete genomic regions

NM_123456 mRNA Mixed Transcript products; mRNA

NM_123456789 mRNA Mixed Transcript products; 9-digit

NP_123456 Protein Mixed Protein products;

NP_123456789 Protein Curation Protein products; 9-digit

NR_123456 RNA Mixed Non-coding transcripts

NT_123456 Genomic Automated Genomic assemblies

NW_123456 Genomic Automated Genomic assemblies

NZ_ABCD12345678 Genomic Automated Whole genome shotgun data

XM_123456 mRNA Automated Transcript products

XP_123456 Protein Automated Protein products

XR_123456 RNA Automated Transcript products

YP_123456 Protein Auto. & Curated Protein products

ZP_12345678 Protein Automated Protein products

NCBI’s RefSeq project: accession for genomic, mRNA, protein sequences

Access to sequences: Entrez Gene at NCBI

Entrez Gene is a great starting point: it collectskey information on each gene/protein from

major databases. It covers all major organisms.

RefSeq provides a curated, optimal accession number for each DNA (NM_000518 for beta globin DNA

corresponding to mRNA) or protein (NP_000509)

Page 29

From the NCBI home

page, type “beta globin”and hit “Search”

Follow the link to “Gene”

Entrez Gene is in the headerNote the “Official Symbol” HBB for beta globin

Note the “limits” option

By applying limits, there are now far fewer entries

Entrez Gene (top of page)

Note that links tomany other HBB database entries are available

Entrez Gene (middle of page): genomic region, bibliography

Entrez Gene (middle of page, continued): phenotypes, function

Entrez Gene (bottom of page): RefSeq accession numbers

Entrez Protein:

accession, organism,

literature…

Entrez Protein:…features of a protein, and its sequence

in the one-letter amino acid code

FASTA format:versatile, compact with one header line

followed by a string of nucleotides or amino acids in the single letter code

Documents

Computational Biology and Bioinformaticscschweikert/cisc4020/sequenceData… · Computational Biology and Bioinformatics • Computational biology – Development of algorithms to