Databases in bioinformatics II - Göteborgs universitetbio.lundberg.gu.se/courses/ht09/bio2/2009_DB2_toprint.pdfDatabases in bioinformatics II 3 Organization of GenBank Query specific

Databases in bioinformatics II

Marcela Davila-LopezDepartment of Medical Biochemistry and Cell Biology

Institute of Biomedicine

BIOINFORMATICS AND SYSTEMS BIOLOGY, MSC PROGR Sequence analysis, UMF018, 2009

Databases in bioinformatics II 2

Overview

– Genome sequencing

– Sequencing methods• Sanger, Maxam• Next generation methods (2nd, 3rd)• Uses• Implications

– RefSeq vs GenBank

– TraceArchive

– Refining searches at Entrez

– eUtilis (programer utilities)


Organization of GenBankQuery specific subsets particular technique interpretation of data

from a proper biological point of view

TraditionalBulk

Direct Submissions (Sequin and BankIt)AccurateWell characterized

PRI PrimateROD RodentMAM Mammalian VRT Other VertebrateINV InvertebratePLN Plant and FungalBCT Bacterial and ArchealVRL ViralPHG PhageSYN Synthetic (cloning vectors)UNA Unannotated

EST Expressed Sequence Tag GSS Genome Survey SequenceHTG High Throughput GenomicSTS Sequence Tagged SiteHTC High Throughput cDNAPAT PatentWGS Whole Genome ShutgunENV Environmental Samples CON Constructed sequences

Batch Submission (Email and FTP)InaccuratePoorly characterized

Benson DA, et al. 2008. Nucleic Acids Research


Why Sequencing Genomes


Why Sequencing Genomes

Remarkable similar molecular level despite their obvious outward differences

genes similar DNA sequence tend to perform ≈ functions

Understanding the function of a gene in one organism we may get an idea of what function that gene may perform in a more complex organism (humans)

Applied to various fields: medicine, biological engineering, forensics, etc, etc ...............


Archon X Prize

"the first Team that can build a device and use it to sequence 100 human genomes within 10 days or less, with an accuracy of no more than one error in every 100,000 bases sequenced, with sequences accurately covering at least 98% of the genome, and at a recurring cost of no more than $10,000 (US) per genome." $10 million

HGP 1993 1st draft 2000 final 2003 ($3 billion)

2007

$2 million in 2 months

James WatsonCraig Venter

2008 $60,000-$100,00 in 4 weeks


Personalized medicine

Deep sequencing mutations, cancer genetics, pharmacogenetics

Lab computer infrastructuredata storagedata trasnfer capacity

Training lab personnel

Statistical methodsincidental discoveriesuncertain clinical significance

EthicsConsent (children, incompetent adults)Results with uncertain clinical significancePrediction of serious diseases that can’t be prevented/treatedResults with implications to family memberesWhole genome data to analize a small portionTime and place of data storageAccess to data: patient, physician, insurance companies, policeWhen can it be used: identification of disaster victims, confirmation of

citizenship


Sequencing methods

1954 Whitfeld PR. - Sequencing by degradation

1975-1977 W. Gilbert – A. Maxam (chemical modification)F. Sanger (chain termination)

Next generation

Cyclic array sequencingIllumina/SolexaRoche/454AB SOLiDHelicos/HeliScope

Sequencing by HybridizationAffymetrix

Sequencing in real time (3rd generation)Oxford Nanopore TechnologiesPacific Bioscience SMRT


Maxam-Gilbert sequencing

- Chemical modification of DNA(radiolabelling)

- Cleavage at specific bases(G,G+A,C,C+T)

- Size-separated(gel electrophoresis)

- Autoradiography(X-ray film)

Strong band 1st w/ weaker band in the 2nd AStrong band 2nd w/ weaker band in the 1st GBand in 3rd and 4th CBand only in 4th T

Maxam AM, Gilbert W., A new method for sequencing DNA, Proc Natl Acad Sci U S A. 1977 Feb;74(2):560-4

PROS: Purified DNA could be used directly

CONS: Technical complexUse of hazardous chemicalsDifficulties to scale-up


Sanger method

dNTP (deoxynucleotide) ddNTP (dideoxynucleotide)

Arthur Kornberg DNA replicationChain termination


Sanger method: labeled dNTP

Radio/fluorescentlylabeled dNTP

DNA templatePolymerasePrimerdNTPddNTP


Sanger method : labeled dNTP

A C T G


Sanger method: dye-labeled primer

Dye-labeled primer

PROS: Upon completion, these four reactions can be combined into one lane on a gel, and run on a machine that can scan the lanes with a laser

http://www.escience.ws/b572/L8/L8.htm


Sanger method: dye-labeled terminator

Dye-labeled terminator

PROS: Use an optical system fastermore economicautomated

Single reaction (≠ dye for each nt)


Large scale sequencing strategies

Sanger: Not practical to sequence a complete genomeOnly about 1000 bases can be sequenced accuratelyA primer of known sequence is required

A Privately-funded Sequencing Project :Celera Genomics

The Publically-funded HGP: NIH/NSF


Sequencing methods

1954 Whitfeld PR. - Sequencing by degradation

1975-1977 W. Gilbert – A. Maxam (chemical modification)F. Sanger (chain termination)

Next generation

Cyclic array sequencingIllumina/SolexaRoche/454AB SOLiDHelicos/HeliScope

Sequencing by HybridizationAffymetrix

Sequencing in real time (3rd generation)Oxford Nanopore TechnologiesPacific Bioscience SMRT


Cyclic array sequencing1.- DNA library preparation (ligation of adaptors)

2.- Amplificationemulsion PCR (ePCR)


Cyclic array sequencing

3.- Sequencing reaction

4.- Imaging

bridge PCR

5.- Bioinformatics:

Polymerase-basedLigation-basedPyrosequencing

image analysis, statistical measures, assembly …


Illumina/Solexa genome analyzer

http://www.illumina.com/media.ilmn?Title=Sequencing-Workflow-Video&Cap=&Img=spacer.gif&PageName=illumina%20sequencing%20technology&PageURL=203&Media=10

http://www.illumina.com/

Sequencing by synthesis

Detects the fluorescence of the added nucleotide at each position while synthesizing the complementary strand.Reverse terminator

http://www.biotagebio.com/DynPage.aspx?id=7454


Pyrosequencing

Pyrogram

C G T C C G G A

Sulfurylase

Luciferin

(1)PPi

(1)ATP

Oxyluciferin

Luciferase

http://www.roche-applied-science.com/publications/multimedia/genome_sequencer/flx_multimedia/wbt.htm

Pyrosequencing

Detects the activity of DNA polymerase with a chemiluminescentenzyme by synthesizing the complementary strand.


Applied biosystems / SOLiD System TM

http://appliedbiosystems.cnpg.com/Video/flatFiles/699/index.aspx

http://www3.appliedbiosystems.com/AB_Home/index.htm

Sequencing by ligation

Uses the enzyme DNA ligase to identify the nucleotide present at a given position in a DNA sequence.

2-base color encoding data

1 dye = 4 possible di-nucelotides

2 bases are interrogated in each ligation reaction providing increased specificity



Primer round 1

Primer round 2

Total of 5 primer rounds





Ref seq

CS Ref

CS Reads

CS consensus

BS consensusPolymorphism

Error

RE-sequencing

Higher accuracy in built-in error checking capabilitydiscrimination between measurement errors and SNP


Helicos Heliscope TM

http://www.helicosbio.com/Technology/TrueSingleMoleculeSequencing/tSMStradeHowItWorks/tabid/162/Default.aspx

http://www.helicosbio.com/Default.aspx?base

Sequencing by synthesis


Affymetrixhttp://www.affymetrix.com/index.affx

Sequencing by hybridization

Microarray – DNA chip (non-enzymatic)

Hybridization

Probe

Image



TGC ATG CCC GTA

CTA CAA GAT AAA

GCG GGG TAG CAT

TGA TTC TTT CGT

T G CG C G

C G TG T A

T A GT G C G T A G

T G CG T AG C GT A GC G T

ACGCATC ACGCATC ACGCATC ACGCATC




1. DNA sample

2. Hybridization

3. Spectrum

4. Reconstruct the sequence

A C G C A T C

Drmanac R et al. Adv Biochem Eng Biotechnol. 2002



ACC GCG CCT CCACCG TCC GCC CTC


Sequencing by hybridizationOligomers in chip = 4 # bases

In our example: 3bp = 64 oligomers

25 bases = 1,125,899,906,842,624 oligomers!

Probe: 5-25 bases

Probe overlapEach base is read by multiple probes SNP

Not homogeneous hybridization conditions melting temparature depends strongly on the ratio on GC ATRepeats

A C G C A T C


Pacific Biosciences / SMRTTM

technology

http://www.pacificbiosciences.com/video_lg.html

http://www.pacificbiosciences.com/

Single Molecule Real TimeNot commercially availablePlatform for single molecule real time detection based on DNA Polymerase activity.


Oxford NanoporeTM Technologies

http://www.nanoporetech.com/sequences/index/34

http://www.nanoporetech.com/

Reads the sequence as a DNA strand transits through nanopores

transmembrane cellular proteins

Voltage electrical current

Amount of current is very sensitive to the size and shape of the nanopore.

G

T

A

C


VisiGen Biotechnologies, Inchttp://visigenbio.com/home.html

http://visigenbio.com/technology_movie_streaming.html

Intelligent BioSystemshttp://www.intelligentbiosystems.com/index%20mod%201.html

Complete genomicshttp://www.completegenomics.com/

More on sequencing methods …


Sequencing and gene expression

Although important goals of any sequencing project may be to obtain a genomic sequence and identify a complete set of genes, the ultimate goal is to gain an understanding of when, where, and how a gene is turned on, a process commonly referred to as gene expression.

Expression in normal circumstances

altered state (?)

Identify and study the protein(s) coded by a geneIdentify gene (Genome bioinformatics)


Redundancy at GenBank

Many sequences are represented more than once in GenBank

huge degrees of Redundancy

2003 RefSeq collection : curated secondary databasenon-redundtantselected organisms

•Genome DNA (assemblies)•Transcripts (RNA)•Protein


RefSeq vs GenBank

GenBank RefSeqNot curated Curated

Author submits NCBI creates from existing data

Only author can revise NCBI reivses as new data emerge

Multiple records from same loci common

Single records for each moleculer of major organisms

Records can contradict each other

No limit to species included Limitied to model organisms

Data exchange among INDSC members Exclusive NCBI database

Akin to primary literature Akin to review articles

Proteins identified and linked Proteins and transcripts identified and linked

Access via NCBI Nucleotide db Access via Nucl. and Protein db

http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=handbook


RefSeq accession numbershttp://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=handbook

CuratedAutomated

mRNA: NM_000000Gene:

NG_000000

Model mRNA:XM_000000

protein:NP_000000

Model RNA: XR_000000

RNA: NR_000000

Model protein: XP_000000

Contig: NT_000000NW_000000

Chromosome: NC_000000


Trace Archive

2001 NCBI and EMBL/ENSEMBLpurpose collect raw data at sequencing centers worldwidePERMANENT repository of single-pass reads (300-1,000 nt)

Hunt for polymorphisms in gene sequences Insigths to the impact of genetic variation on health

http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?

2,1112,309,330 traces2009-11-06


Entrez


Refining search resultshttp://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=helpentrez.section.EntrezHelp.Searching_Entrez_usihttp://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=helppubmed.section.pubmedhelp.Searching_PubMed


Limits

Refine search results retrieve only the most relevant documents

Allow restriction of a search to a defined subset of the database


Refining search results


Index

Alphabetical lists of terms from searchable database fields

Used to browse and/or select the terms by which records and/or data are described


Refining search results


Search Field Descriptions and Qualifiers

Index search field Qualifier

Accession [ACCN] or [ACCESSION]

All Fields [ALL] or [ALL FIELDS]

Author [AUTH] or [AUTHOR]

EC/RN Number [ECNO]

Feature Key [FKEY]

Filter [FILT] or [SB]

Gene Name [GENE]

Issue [ISS] or [ISSUE]

Keyword [KYWD] or [KEYWORD]

Journal Name [JOUR] or [JOURNAL]

Modification Date [MDAT]

Organism [ORGN] or [ORGANISM]

Page Number [PAGE]

Primary Accession [PACC]

Index search field Qualifier

Properties [PROP]

Protein Name [PROT]

Publication Date [PDAT]

SeqID String [SQID]

Sequence Length [SLEN]

Substance Name [SUBS]

Text Word [WORD]

Title [TITL]

Volume [VOL]

Entrez date [EDAT]

Journal title [TA]

Language [LA]

MeSH term [MH]

Title/Abstract [TIAB]

http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=helpentrez.table.EntrezHelp.T7http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=helppubmed.section.pubmedhelp.Search_Field_Descrip


Advanced search statements

term [field] OPERATOR term [field]

Find all human nucleotide sequences with D-loop annotations

Find Drosophila population studies published in the Journal of Molecular Evolution

D-loop[FKEY] AND human[ORGN] in Nucleotide database

j mol evol[JOUR] AND drosophila[ORGN] in PopSet database


History

Provides a record of the searches performed during a search session.

Database specificLost after eight hours of inactivity

Used to review, revise, or combine the results of earlier searches.


Combining results


Query translation


Details

Display your search strategy as translated using Entrez's search and syntax rules

Error messages, when applicable


Author search


Example - author


Example - journal


eUtils: Entrez Programming Utilities

•Tools that provide access to Entrez data outside of the regular web query interface. • Set of 7 server-side programs• Helpful for retrieving search results (manipulated in another environment)• Perl, Python, Java, and C++• Currently includes 35 databases

http://www.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html

ESearch

ESummary

EGQuery

EInfo

EFetch

ELink

EPost

Espell

• Perform searches on large datasets• Implement data pipelines for genomic, proteomic, or

microarray analysis • Create automated searches to keep local databases

current • Create and download customized datasets • Seamlessly combine local data with NCBI data • Develop a focused interface to NCBI data

URL Result(XML)


Common Entrez Engine

Assemble a list of UIDs

ESearch (for a given db)

EGQuery (global version all db)

ESummary (for a list of UIDs)

Retrieve a brief summary record (DocSum)


URL

http://www.ncbi.nlm.nih.gov/sites/gquery?term=cancer+stem+cells

[Base_URL] [Query][Eutils_URL]

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=taxonomy&id=9913&retmode=xml

[Base_URL] [Query][DB][Eutils_URL]


URL: DB

Entrez Database E-Utility Database Name

3D Domains domains

Domains cdd

Genome genome

Nucleotide nucleotide

OMIM omim

PopSet popset

Protein protein

ProbeSet geo

PubMed pubmed

Structure structure

SNP snp

Taxonomy taxonomy

UniGene unigene

UniSTS unists

Each Entrez DB has an E-Utility name (used instead of its original name)

[Base_URL] [Query][DB][Eutils_URL]eSearch =


URL: Query

EFetchEGQuery Espell EInfo ESearch ESummary ELink EPost

Tax Seq Lit

db X X X X X X X X X

term X X X X

field X

reldate X X

mindate X X

maxdate X X

datatype X X

retstart X X X X

retmax X X X X

retmode X X X X X X

rettype X X X X

history X X X X X X X

WebEnv X X X X X X X

query_key X X X X X X X

id X X X X X X

report X

strand X

seq_start X

seq_stop X

dbfrom X

cmd X

[Base_URL] [Query][DB][Eutils_URL]eSearch =


Espell

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/espell.fcgi?db=pubmed&term=brest+cancer

Retrieves spelling suggestions when available

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/espell.fcgi?

Only PubMed


EInfo

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi?db=pubmed

Provides detailed information about a given database:term counts, last update and available links

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi?


EGQuery

Provides Entrez database counts in XML for a single search using GQuery

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/egquery.fcgi

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/egquery.fcgi?term=brca1+OR+brca2&rettype=html


ESummary

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=pubmed&id=11850928,11482001&retmode=xml

xml, ref, html, text, asn.1

Retrieves DocSums from a list of primary IDs

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?


ELink

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=nuccore&db=protein&id=7140346

Existence of an external/Related Articles link from a list of UIDsRetrieves related IDs to a list of UIDs (same db, external db)

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?


ELink

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=pubmed&id=10611131&retmode=ref&cmd=prlinks

Creates a hyperlink to the primary LinkOut provider for a specific IDLists LinkOut URLs and attributes for multiple IDs.


ESearch

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=cancer&reldate=60&datetype=edat&retmax=10

Returns a list of matching UIDs (text search) in a given Entrez database

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?

edat, mdat, pdat


EFetch

Generates formatted output for a list of input IDs: abstracts from PubMedFASTA format from Protein

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?

DBs:Literature Database

PubMed, Journals, PubMed Central, OMIM

Sequence and other Molecular Biology DatabasesNucleotide,Protein, Gene, etc.

Taxonomy


EFetch - Literature

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=12091962,9997&retmode=html&rettype=abstract


Rettype

Rettype scope Description

count PubMed Hits counts

uilist all Default format for viewing hits

sort PubMed and gene

abstract PubMed

citation PubMed

medline PubMed

full PubMed

native all Default format for viewing sequences

fasta sequence FASTA view of a sequence

gb nucleotide GenBank view for sequences

est dbEST EST Report.

gp protein GenPept view

seqid sequence To convert list of gis into list of seqids.

acc sequence To convert list of gis into list of accessions

chr dbSNP only SNP Chromosome Report.


EFetch - Sequences

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=5&rettype=fasta

strand 1(+), 2(-)


Efetch - Taxonomy

http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=taxonomy&id=44689&report=docsum

uilist, brief, docsum, xml


Search in Journals for the term obstetrics:

In PubMed display PMIDs 12091962 and 9997 in html retrieval mode and abstract retrieval type:

From Entrez Gene display as xml the GenomeID 2:

To retrieve PubMed related articles for proteins 61742829 with a publication date from 1995 to the present:

Excercise


Combining eUtils calls

The eUtils are useful when used by themselves in single URLs; however their full potential is reached when successive eUtils URLs are combined to create a data pipeline

• Retrieving data records matching an Entrez query

ESearch → ESummaryESearch → EFetch

• Finding IDs linked to records matching an Entrez query

ESearch → ELink

• Retrieving data records in database B linked to records in database A matching an Entrez query

ESearch → ELink → ESummaryESearch → ELink → EFetch


a PERL example

TASK: Retrieve protein sequences of the factor IX in fasta format

my $Base_URL = "http://www.ncbi.nlm.nih.gov/entrez/eutils/" ;

my $esearch_URL = "esearch.fcgi?" ;

my $DB = "db=protein&";

my $Query = "term=factor ix human";

my $esearch_Parameters= "retmax=1&usehistory=y&";

my $E_search =

"$Base_URL$esearch_URL$DB$esearch_Parameters$Query";

http://www.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=protein&retmax=1&usehistory=y&term=factor ix human

ESearch → EFetch


Output from ESearch


QueryKey - WebEnv

$WebEnv: cookie value used with EFetch in place of primary ID result list (encoded server address)

$QueryKey: value used for a history search number (label)

corresponds to a UID list for subsequent search strategies


a PERL example

my $efetch_URL= "efetch.fcgi?";

my $efetch_Parameters =

"rettype=fasta&retmode=text&query_key=$QueryKey&WebEnv=$WebEnv";

my $E_fetch = "$Base_URL$efetch_URL$DB$efetch_Parameters" ;

http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&rettype=fasta&retmode=text&query_key=1&WebEnv=0ujfmXBW0U0hNr3FjaUutLkz1bR-NnJ9kp5vybL3u1AbTQdD7uMETHEtG5N@1EE047D172B3B8D0_0015SID

ESearch → EFetch

TASK: Retrieve protein sequences of the factor IX in fasta format


Output from EFetch

Documents

Databases in bioinformatics II - Göteborgs universitetbio.lundberg.gu.se/courses/ht09/bio2/2009_DB2_toprint.pdfDatabases in bioinformatics II 3 Organization of GenBank Query specific