Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Databases in bioinformatics II
Marcela Davila-LopezDepartment of Medical Biochemistry and Cell Biology
Institute of Biomedicine
BIOINFORMATICS AND SYSTEMS BIOLOGY, MSC PROGR Sequence analysis, UMF018, 2009
Databases in bioinformatics II 2
Overview
– Genome sequencing
– Sequencing methods• Sanger, Maxam• Next generation methods (2nd, 3rd)• Uses• Implications
– RefSeq vs GenBank
– TraceArchive
– Refining searches at Entrez
– eUtilis (programer utilities)
Databases in bioinformatics II 3
Organization of GenBankQuery specific subsets particular technique interpretation of data
from a proper biological point of view
TraditionalBulk
Direct Submissions (Sequin and BankIt)AccurateWell characterized
PRI PrimateROD RodentMAM Mammalian VRT Other VertebrateINV InvertebratePLN Plant and FungalBCT Bacterial and ArchealVRL ViralPHG PhageSYN Synthetic (cloning vectors)UNA Unannotated
EST Expressed Sequence Tag GSS Genome Survey SequenceHTG High Throughput GenomicSTS Sequence Tagged SiteHTC High Throughput cDNAPAT PatentWGS Whole Genome ShutgunENV Environmental Samples CON Constructed sequences
Batch Submission (Email and FTP)InaccuratePoorly characterized
Benson DA, et al. 2008. Nucleic Acids Research
Databases in bioinformatics II 4
Why Sequencing Genomes
Databases in bioinformatics II 5
Why Sequencing Genomes
Remarkable similar molecular level despite their obvious outward differences
genes similar DNA sequence tend to perform ≈ functions
Understanding the function of a gene in one organism we may get an idea of what function that gene may perform in a more complex organism (humans)
Applied to various fields: medicine, biological engineering, forensics, etc, etc ...............
Databases in bioinformatics II 6
Archon X Prize
"the first Team that can build a device and use it to sequence 100 human genomes within 10 days or less, with an accuracy of no more than one error in every 100,000 bases sequenced, with sequences accurately covering at least 98% of the genome, and at a recurring cost of no more than $10,000 (US) per genome." $10 million
HGP 1993 1st draft 2000 final 2003 ($3 billion)
2007
$2 million in 2 months
James WatsonCraig Venter
2008 $60,000-$100,00 in 4 weeks
Databases in bioinformatics II 7
Personalized medicine
Deep sequencing mutations, cancer genetics, pharmacogenetics
Lab computer infrastructuredata storagedata trasnfer capacity
Training lab personnel
Statistical methodsincidental discoveriesuncertain clinical significance
EthicsConsent (children, incompetent adults)Results with uncertain clinical significancePrediction of serious diseases that can’t be prevented/treatedResults with implications to family memberesWhole genome data to analize a small portionTime and place of data storageAccess to data: patient, physician, insurance companies, policeWhen can it be used: identification of disaster victims, confirmation of
citizenship
Databases in bioinformatics II 8
Sequencing methods
1954 Whitfeld PR. - Sequencing by degradation
1975-1977 W. Gilbert – A. Maxam (chemical modification)F. Sanger (chain termination)
Next generation
Cyclic array sequencingIllumina/SolexaRoche/454AB SOLiDHelicos/HeliScope
Sequencing by HybridizationAffymetrix
Sequencing in real time (3rd generation)Oxford Nanopore TechnologiesPacific Bioscience SMRT
Databases in bioinformatics II 9
Maxam-Gilbert sequencing
- Chemical modification of DNA(radiolabelling)
- Cleavage at specific bases(G,G+A,C,C+T)
- Size-separated(gel electrophoresis)
- Autoradiography(X-ray film)
Strong band 1st w/ weaker band in the 2nd AStrong band 2nd w/ weaker band in the 1st GBand in 3rd and 4th CBand only in 4th T
Maxam AM, Gilbert W., A new method for sequencing DNA, Proc Natl Acad Sci U S A. 1977 Feb;74(2):560-4
PROS: Purified DNA could be used directly
CONS: Technical complexUse of hazardous chemicalsDifficulties to scale-up
Databases in bioinformatics II 10
Sanger method
dNTP (deoxynucleotide) ddNTP (dideoxynucleotide)
Arthur Kornberg DNA replicationChain termination
Databases in bioinformatics II 11
Sanger method: labeled dNTP
Radio/fluorescentlylabeled dNTP
DNA templatePolymerasePrimerdNTPddNTP
Databases in bioinformatics II 12
Sanger method : labeled dNTP
A C T G
Databases in bioinformatics II 13
Sanger method: dye-labeled primer
Dye-labeled primer
PROS: Upon completion, these four reactions can be combined into one lane on a gel, and run on a machine that can scan the lanes with a laser
http://www.escience.ws/b572/L8/L8.htm
Databases in bioinformatics II 14
Sanger method: dye-labeled terminator
Dye-labeled terminator
PROS: Use an optical system fastermore economicautomated
Single reaction (≠ dye for each nt)
Databases in bioinformatics II 15
Large scale sequencing strategies
Sanger: Not practical to sequence a complete genomeOnly about 1000 bases can be sequenced accuratelyA primer of known sequence is required
A Privately-funded Sequencing Project :Celera Genomics
The Publically-funded HGP: NIH/NSF
Databases in bioinformatics II 16
Sequencing methods
1954 Whitfeld PR. - Sequencing by degradation
1975-1977 W. Gilbert – A. Maxam (chemical modification)F. Sanger (chain termination)
Next generation
Cyclic array sequencingIllumina/SolexaRoche/454AB SOLiDHelicos/HeliScope
Sequencing by HybridizationAffymetrix
Sequencing in real time (3rd generation)Oxford Nanopore TechnologiesPacific Bioscience SMRT
Databases in bioinformatics II 17
Cyclic array sequencing1.- DNA library preparation (ligation of adaptors)
2.- Amplificationemulsion PCR (ePCR)
Databases in bioinformatics II 18
Cyclic array sequencing
3.- Sequencing reaction
4.- Imaging
bridge PCR
5.- Bioinformatics:
Polymerase-basedLigation-basedPyrosequencing
image analysis, statistical measures, assembly …
Databases in bioinformatics II 19
Illumina/Solexa genome analyzer
http://www.illumina.com/media.ilmn?Title=Sequencing-Workflow-Video&Cap=&Img=spacer.gif&PageName=illumina%20sequencing%20technology&PageURL=203&Media=10
http://www.illumina.com/
Sequencing by synthesis
Detects the fluorescence of the added nucleotide at each position while synthesizing the complementary strand.Reverse terminator
http://www.biotagebio.com/DynPage.aspx?id=7454
Databases in bioinformatics II 20
Pyrosequencing
Pyrogram
C G T C C G G A
Sulfurylase
Luciferin
(1)PPi
(1)ATP
Oxyluciferin
Luciferase
http://www.roche-applied-science.com/publications/multimedia/genome_sequencer/flx_multimedia/wbt.htm
Pyrosequencing
Detects the activity of DNA polymerase with a chemiluminescentenzyme by synthesizing the complementary strand.
Databases in bioinformatics II 21
Applied biosystems / SOLiD System TM
http://appliedbiosystems.cnpg.com/Video/flatFiles/699/index.aspx
http://www3.appliedbiosystems.com/AB_Home/index.htm
Sequencing by ligation
Uses the enzyme DNA ligase to identify the nucleotide present at a given position in a DNA sequence.
2-base color encoding data
1 dye = 4 possible di-nucelotides
2 bases are interrogated in each ligation reaction providing increased specificity
Databases in bioinformatics II 22
Sequencing by ligation
Primer round 1
Primer round 2
Total of 5 primer rounds
Databases in bioinformatics II 23
Sequencing by ligation
Databases in bioinformatics II 24
Sequencing by ligation
Ref seq
CS Ref
CS Reads
CS consensus
BS consensusPolymorphism
Error
RE-sequencing
Higher accuracy in built-in error checking capabilitydiscrimination between measurement errors and SNP
Databases in bioinformatics II 25
Helicos Heliscope TM
http://www.helicosbio.com/Technology/TrueSingleMoleculeSequencing/tSMStradeHowItWorks/tabid/162/Default.aspx
http://www.helicosbio.com/Default.aspx?base
Sequencing by synthesis
Databases in bioinformatics II 26
Affymetrixhttp://www.affymetrix.com/index.affx
Sequencing by hybridization
Microarray – DNA chip (non-enzymatic)
Hybridization
Probe
Image
Databases in bioinformatics II 27
Sequencing by hybridization
TGC ATG CCC GTA
CTA CAA GAT AAA
GCG GGG TAG CAT
TGA TTC TTT CGT
T G CG C G
C G TG T A
T A GT G C G T A G
T G CG T AG C GT A GC G T
ACGCATC ACGCATC ACGCATC ACGCATC
ACGCATC ACGCATC ACGCATC ACGCATC
ACGCATC ACGCATC ACGCATC ACGCATC
ACGCATC ACGCATC ACGCATC ACGCATC
1. DNA sample
2. Hybridization
3. Spectrum
4. Reconstruct the sequence
A C G C A T C
Drmanac R et al. Adv Biochem Eng Biotechnol. 2002
Databases in bioinformatics II 28
Sequencing by hybridization
ACC GCG CCT CCACCG TCC GCC CTC
Databases in bioinformatics II 29
Sequencing by hybridizationOligomers in chip = 4 # bases
In our example: 3bp = 64 oligomers
25 bases = 1,125,899,906,842,624 oligomers!
Probe: 5-25 bases
Probe overlapEach base is read by multiple probes SNP
Not homogeneous hybridization conditions melting temparature depends strongly on the ratio on GC ATRepeats
A C G C A T C
Databases in bioinformatics II 30
Pacific Biosciences / SMRTTM
technology
http://www.pacificbiosciences.com/video_lg.html
http://www.pacificbiosciences.com/
Single Molecule Real TimeNot commercially availablePlatform for single molecule real time detection based on DNA Polymerase activity.
Databases in bioinformatics II 31
Oxford NanoporeTM Technologies
http://www.nanoporetech.com/sequences/index/34
http://www.nanoporetech.com/
Reads the sequence as a DNA strand transits through nanopores
transmembrane cellular proteins
Voltage electrical current
Amount of current is very sensitive to the size and shape of the nanopore.
G
T
A
C
Databases in bioinformatics II 32
VisiGen Biotechnologies, Inchttp://visigenbio.com/home.html
http://visigenbio.com/technology_movie_streaming.html
Intelligent BioSystemshttp://www.intelligentbiosystems.com/index%20mod%201.html
Complete genomicshttp://www.completegenomics.com/
More on sequencing methods …
Databases in bioinformatics II 33
Sequencing and gene expression
Although important goals of any sequencing project may be to obtain a genomic sequence and identify a complete set of genes, the ultimate goal is to gain an understanding of when, where, and how a gene is turned on, a process commonly referred to as gene expression.
Expression in normal circumstances
altered state (?)
Identify and study the protein(s) coded by a geneIdentify gene (Genome bioinformatics)
Databases in bioinformatics II 34
Redundancy at GenBank
Many sequences are represented more than once in GenBank
huge degrees of Redundancy
2003 RefSeq collection : curated secondary databasenon-redundtantselected organisms
•Genome DNA (assemblies)•Transcripts (RNA)•Protein
Databases in bioinformatics II 35
RefSeq vs GenBank
GenBank RefSeqNot curated Curated
Author submits NCBI creates from existing data
Only author can revise NCBI reivses as new data emerge
Multiple records from same loci common
Single records for each moleculer of major organisms
Records can contradict each other
No limit to species included Limitied to model organisms
Data exchange among INDSC members Exclusive NCBI database
Akin to primary literature Akin to review articles
Proteins identified and linked Proteins and transcripts identified and linked
Access via NCBI Nucleotide db Access via Nucl. and Protein db
http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=handbook
Databases in bioinformatics II 36
RefSeq accession numbershttp://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=handbook
CuratedAutomated
mRNA: NM_000000Gene:
NG_000000
Model mRNA:XM_000000
protein:NP_000000
Model RNA: XR_000000
RNA: NR_000000
Model protein: XP_000000
Contig: NT_000000NW_000000
Chromosome: NC_000000
Databases in bioinformatics II 37
Trace Archive
2001 NCBI and EMBL/ENSEMBLpurpose collect raw data at sequencing centers worldwidePERMANENT repository of single-pass reads (300-1,000 nt)
Hunt for polymorphisms in gene sequences Insigths to the impact of genetic variation on health
http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?
2,1112,309,330 traces2009-11-06
Databases in bioinformatics II 38
Entrez
Databases in bioinformatics II 39
Refining search resultshttp://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=helpentrez.section.EntrezHelp.Searching_Entrez_usihttp://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=helppubmed.section.pubmedhelp.Searching_PubMed
Databases in bioinformatics II 40
Limits
Refine search results retrieve only the most relevant documents
Allow restriction of a search to a defined subset of the database
Databases in bioinformatics II 41
Refining search results
Databases in bioinformatics II 42
Index
Alphabetical lists of terms from searchable database fields
Used to browse and/or select the terms by which records and/or data are described
Databases in bioinformatics II 43
Refining search results
Databases in bioinformatics II 44
Search Field Descriptions and Qualifiers
Index search field Qualifier
Accession [ACCN] or [ACCESSION]
All Fields [ALL] or [ALL FIELDS]
Author [AUTH] or [AUTHOR]
EC/RN Number [ECNO]
Feature Key [FKEY]
Filter [FILT] or [SB]
Gene Name [GENE]
Issue [ISS] or [ISSUE]
Keyword [KYWD] or [KEYWORD]
Journal Name [JOUR] or [JOURNAL]
Modification Date [MDAT]
Organism [ORGN] or [ORGANISM]
Page Number [PAGE]
Primary Accession [PACC]
Index search field Qualifier
Properties [PROP]
Protein Name [PROT]
Publication Date [PDAT]
SeqID String [SQID]
Sequence Length [SLEN]
Substance Name [SUBS]
Text Word [WORD]
Title [TITL]
Volume [VOL]
Entrez date [EDAT]
Journal title [TA]
Language [LA]
MeSH term [MH]
Title/Abstract [TIAB]
http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=helpentrez.table.EntrezHelp.T7http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=helppubmed.section.pubmedhelp.Search_Field_Descrip
Databases in bioinformatics II 45
Advanced search statements
term [field] OPERATOR term [field]
Find all human nucleotide sequences with D-loop annotations
Find Drosophila population studies published in the Journal of Molecular Evolution
D-loop[FKEY] AND human[ORGN] in Nucleotide database
j mol evol[JOUR] AND drosophila[ORGN] in PopSet database
Databases in bioinformatics II 46
History
Provides a record of the searches performed during a search session.
Database specificLost after eight hours of inactivity
Used to review, revise, or combine the results of earlier searches.
Databases in bioinformatics II 47
Combining results
Databases in bioinformatics II 48
Query translation
Databases in bioinformatics II 49
Details
Display your search strategy as translated using Entrez's search and syntax rules
Error messages, when applicable
Databases in bioinformatics II 50
Author search
Databases in bioinformatics II 51
Example - author
Databases in bioinformatics II 52
Example - journal
Databases in bioinformatics II 53
eUtils: Entrez Programming Utilities
•Tools that provide access to Entrez data outside of the regular web query interface. • Set of 7 server-side programs• Helpful for retrieving search results (manipulated in another environment)• Perl, Python, Java, and C++• Currently includes 35 databases
http://www.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html
ESearch
ESummary
EGQuery
EInfo
EFetch
ELink
EPost
Espell
• Perform searches on large datasets• Implement data pipelines for genomic, proteomic, or
microarray analysis • Create automated searches to keep local databases
current • Create and download customized datasets • Seamlessly combine local data with NCBI data • Develop a focused interface to NCBI data
URL Result(XML)
Databases in bioinformatics II 54
Common Entrez Engine
Assemble a list of UIDs
ESearch (for a given db)
EGQuery (global version all db)
ESummary (for a list of UIDs)
Retrieve a brief summary record (DocSum)
Databases in bioinformatics II 55
URL
http://www.ncbi.nlm.nih.gov/sites/gquery?term=cancer+stem+cells
[Base_URL] [Query][Eutils_URL]
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=taxonomy&id=9913&retmode=xml
[Base_URL] [Query][DB][Eutils_URL]
Databases in bioinformatics II 56
URL: DB
Entrez Database E-Utility Database Name
3D Domains domains
Domains cdd
Genome genome
Nucleotide nucleotide
OMIM omim
PopSet popset
Protein protein
ProbeSet geo
PubMed pubmed
Structure structure
SNP snp
Taxonomy taxonomy
UniGene unigene
UniSTS unists
Each Entrez DB has an E-Utility name (used instead of its original name)
[Base_URL] [Query][DB][Eutils_URL]eSearch =
Databases in bioinformatics II 57
URL: Query
EFetchEGQuery Espell EInfo ESearch ESummary ELink EPost
Tax Seq Lit
db X X X X X X X X X
term X X X X
field X
reldate X X
mindate X X
maxdate X X
datatype X X
retstart X X X X
retmax X X X X
retmode X X X X X X
rettype X X X X
history X X X X X X X
WebEnv X X X X X X X
query_key X X X X X X X
id X X X X X X
report X
strand X
seq_start X
seq_stop X
dbfrom X
cmd X
[Base_URL] [Query][DB][Eutils_URL]eSearch =
Databases in bioinformatics II 58
Espell
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/espell.fcgi?db=pubmed&term=brest+cancer
Retrieves spelling suggestions when available
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/espell.fcgi?
Only PubMed
Databases in bioinformatics II 59
EInfo
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi?db=pubmed
Provides detailed information about a given database:term counts, last update and available links
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi?
Databases in bioinformatics II 60
EGQuery
Provides Entrez database counts in XML for a single search using GQuery
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/egquery.fcgi
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/egquery.fcgi?term=brca1+OR+brca2&rettype=html
Databases in bioinformatics II 61
ESummary
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=pubmed&id=11850928,11482001&retmode=xml
xml, ref, html, text, asn.1
Retrieves DocSums from a list of primary IDs
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?
Databases in bioinformatics II 62
ELink
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=nuccore&db=protein&id=7140346
Existence of an external/Related Articles link from a list of UIDsRetrieves related IDs to a list of UIDs (same db, external db)
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?
Databases in bioinformatics II 63
ELink
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=pubmed&id=10611131&retmode=ref&cmd=prlinks
Creates a hyperlink to the primary LinkOut provider for a specific IDLists LinkOut URLs and attributes for multiple IDs.
Databases in bioinformatics II 64
ESearch
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=cancer&reldate=60&datetype=edat&retmax=10
Returns a list of matching UIDs (text search) in a given Entrez database
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?
edat, mdat, pdat
Databases in bioinformatics II 65
EFetch
Generates formatted output for a list of input IDs: abstracts from PubMedFASTA format from Protein
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?
DBs:Literature Database
PubMed, Journals, PubMed Central, OMIM
Sequence and other Molecular Biology DatabasesNucleotide,Protein, Gene, etc.
Taxonomy
Databases in bioinformatics II 66
EFetch - Literature
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=12091962,9997&retmode=html&rettype=abstract
Databases in bioinformatics II 67
Rettype
Rettype scope Description
count PubMed Hits counts
uilist all Default format for viewing hits
sort PubMed and gene
abstract PubMed
citation PubMed
medline PubMed
full PubMed
native all Default format for viewing sequences
fasta sequence FASTA view of a sequence
gb nucleotide GenBank view for sequences
est dbEST EST Report.
gp protein GenPept view
seqid sequence To convert list of gis into list of seqids.
acc sequence To convert list of gis into list of accessions
chr dbSNP only SNP Chromosome Report.
Databases in bioinformatics II 68
EFetch - Sequences
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=5&rettype=fasta
strand 1(+), 2(-)
Databases in bioinformatics II 69
Efetch - Taxonomy
http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=taxonomy&id=44689&report=docsum
uilist, brief, docsum, xml
Databases in bioinformatics II 70
Search in Journals for the term obstetrics:
In PubMed display PMIDs 12091962 and 9997 in html retrieval mode and abstract retrieval type:
From Entrez Gene display as xml the GenomeID 2:
To retrieve PubMed related articles for proteins 61742829 with a publication date from 1995 to the present:
Excercise
Databases in bioinformatics II 71
Combining eUtils calls
The eUtils are useful when used by themselves in single URLs; however their full potential is reached when successive eUtils URLs are combined to create a data pipeline
• Retrieving data records matching an Entrez query
ESearch → ESummaryESearch → EFetch
• Finding IDs linked to records matching an Entrez query
ESearch → ELink
• Retrieving data records in database B linked to records in database A matching an Entrez query
ESearch → ELink → ESummaryESearch → ELink → EFetch
Databases in bioinformatics II 72
a PERL example
TASK: Retrieve protein sequences of the factor IX in fasta format
my $Base_URL = "http://www.ncbi.nlm.nih.gov/entrez/eutils/" ;
my $esearch_URL = "esearch.fcgi?" ;
my $DB = "db=protein&";
my $Query = "term=factor ix human";
my $esearch_Parameters= "retmax=1&usehistory=y&";
my $E_search =
"$Base_URL$esearch_URL$DB$esearch_Parameters$Query";
http://www.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=protein&retmax=1&usehistory=y&term=factor ix human
ESearch → EFetch
Databases in bioinformatics II 73
Output from ESearch
Databases in bioinformatics II 74
QueryKey - WebEnv
$WebEnv: cookie value used with EFetch in place of primary ID result list (encoded server address)
$QueryKey: value used for a history search number (label)
corresponds to a UID list for subsequent search strategies
Databases in bioinformatics II 75
a PERL example
my $efetch_URL= "efetch.fcgi?";
my $efetch_Parameters =
"rettype=fasta&retmode=text&query_key=$QueryKey&WebEnv=$WebEnv";
my $E_fetch = "$Base_URL$efetch_URL$DB$efetch_Parameters" ;
http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&rettype=fasta&retmode=text&query_key=1&WebEnv=0ujfmXBW0U0hNr3FjaUutLkz1bR-NnJ9kp5vybL3u1AbTQdD7uMETHEtG5N@1EE047D172B3B8D0_0015SID
ESearch → EFetch
TASK: Retrieve protein sequences of the factor IX in fasta format
Databases in bioinformatics II 76
Output from EFetch