Upload
bits
View
4.434
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Module 1: Sequence databases. Part of training session "Basic Bioinformatics concepts, databases and tools" - http://www.bits.vib.be/training
Citation preview
Basic bioinformatics concepts, databases and tools
Introduction to the training
and Sequence databases
Joachim Jacobhttp://www.bits.vib.be
Updated 22 February 2012http://dl.dropbox.com/u/18352887/BITS_training_material/Link%20to%20mod1-intro_H1_2012_SeqDBs.pdf
Scope
Introductory training to Bioinformatics
Exploring and understanding
databases and software
for everyday bioinformatics use
If there is any term which is unclear, please stop me and ask me!
Bio
all data is derived from living samples
Informatics
that data is stored and analyzed in and with computers to obtain understanding
Extremely broad description, for which however we will extract common principles during the course
Bioinformatics ...
Bioinformatics is present into every aspect of life sciences research
Bioinformatics is present into every aspect of life sciences research
Bioinformatics is present into every aspect of life sciences research
, sequences
Bioinformatics is present into every aspect of life sciences research
Bioinformatics is present into every aspect of life sciences research
Bioinformatics is present into every aspect of life sciences research
Bioinformatics is present into every aspect of life sciences research
Bioinformatics is present into every aspect of life sciences research
Bioinformatics is present into every aspect of life sciences research
Bioinformatics ...
Bio
- different types of living samples
Informatics
- storing and categorizing the information and making it easily accessible
- interpreting that information reliably
Bioinformatics … and his companion
Bio
- different types of living samples
Informatics
- storing and categorizing the information and making it easily accessible
- interpreting that information reliably
Statistics
- large numbers, observational data
The siblings of Bioinformatics
Based on the biological component extracted from life, the measured properties and the ultimate goal of the analysis, different sub-disciplines of bioinformatics exist.
DNA RNA proteins metabolites
GenomicsTranscriptomics
ProteomicsMetabolomics
Epigenomics Structural bioinformatics Systems biology Microbiomics InteractomicsMetagenomics Functional genomics Comparative gx
Mere data is worth nothing
Data = symbols
Information = data that are processed to be useful; provides answers to "who", "what", "where", and "when" questions. Also called metadata.
Knowledge: application of data and information; answers "how" questions
Understanding: appreciation of "why"
Wisdom
CGCTACGCATATCGCT
- Dasypus novemcinctus- found in my garden- Part of genome- sequenced on June 2010
This species seems to be related to my neighbor's pet, because it has also this sequence
Has the same mother
http://www.systems-thinking.org/dikw/dikw.htm
Biology Computer Statistics
Bioinformatics research, as a specific branch on the boundary of life science, mathematics and computer science'tool manufacturer'
Tools and approaches
Life sciences research as major 'end user' for the bioinformatics tools and conclusions'tool user'
? !data knowledge
This course is organised in several modules
Module 1: Sequence databases: what, where, how
Module 2: Sequence comparisons: searching, aligning
Module 3: Sequence analysis – domains in protein sequences and predicting functionality, standardisation and useful links
Module 4: Beyond sequences - additional important data sources
Module 5: Genome Browsers - integrating biological data and performing reproducible bioinformatics research in the Galaxy
Overview of the crash course
One tip for the future
Be prepared for change...
Information is fluid
So are bioinfo tools
Learn how to accommodate for change
Major resources are more stable
Important concepts do not change often
Module 1
Sequence databases
Module 1: Sequence databases
Sequence databases store DNA and RNA sequences. In Bioinformatics, they are by far (still) the largest collections of biological data, and used by many subdisciplines of bioinformatics.
http://www.ebi.ac.uk/embl/Services/DBStats/
... and growing
http://www.ebi.ac.uk/embl/Services/DBStats/
Three major nucleotide databanks host primary sequence data
European Nucleotide Archive (ENA) at EBI - http://www.ebi.ac.uk/
Division EMBL-bank (European Molecular Biology Laboratory) (single)
Trace Archive
SRA Archive
GenBank at NCBI - http://www.ncbi.nlm.nih.gov/
maintained at NCBI (National Center for Biotechnology Information,
(USA)
DDBJ (DNA Data Bank of Japan) - http://www.ddbj.nig.ac.jp/
maintained at NIG/CIB (National Institute of Genetics, Center for
Information Biology, Mishima, Japan)
These databases are filled with NA sequence information by scientists and consortia
Individual scientists
Large-scale sequencing
projects
Primary sequence data
Primarysequencedatabase
Patent Offices ACTGCTGCTA
GCTAGCTGATCTATGCTAGCTGTAGCTGAG
each primary sequence =
one experiment
Basically, all 'source' nucleotide material
Jennifer McDowall - http://www.biotnet.org/training-materials/nucleotide-sequence-databases-ena
Primary NA sequence can be produced by Sanger-based technologies or NGS technologies
sample
DNARNA
cDNA
RT
Sanger
Low output in number of seqs, high quality, 400-850 bp.Read profiles in .abi format. Stored in Trace Archive.
NGSDifferent technologies. Extremely high output rate, low quality, 30 bp – 600 bp. Reads in .fastq format, stored inthe SRA.
These techniques can only read DNA strands, so RNA needs first to be converted to cDNA with reverse transcriptases prior to loading to the machines.
Sanger overview: http://www.bio.davidson.edu/Courses/Molbio/MolStudents/spring2003/Obenrader/sanger_method_page.htmNGS overview: http://seqanswers.com/forums/showthread.php?t=3561
Dennis Wall, NGS Data Analysis and Computation I course, Wall Lab
Overview major DNA reading technologies
In the primary sequence dbs a major distinction can be made in two major categories
High quality single submission (Sanger)- gene sequence (genomic – 'STD' data class)- mRNA sequence (via cDNA – 'STD')- BAC/YAC/cosmid sequences- genome sequencing projects (contigs,
assemblies, WGS)- genome markers, STS (sequence tagged sites, unique short sequences from a genome)
Low quality batch submissions- Expressed Sequence Tags (EST)- Genome Survey Sequences (GSS)- high-throughput sequence data (e.g. NGS)
DNARNAcDNA
http://www.ebi.ac.uk/ena/about/formats
The batch submissions originate mostly from sequencing centers
chromosome
cyp30 cyp309 insvcg343
annotation
sequence reads
sequencing library
assemble sequence
Large-scale sequencing
projects
submissione.g. whole genome shotgun
submission
submission
fragment
Each primary database stores their sequences and batch submissions in their own way...
- NCBI: ESTs are stored in dbEST (separate database)- ENA: ESTs are part of EMBL-bank in 'EST' data class
Similar for GSS (see dbGSS at NCBI)
ESTs : expressed sequence tag, often partial sequence derived from RNA in batch. See example
sample
RNA
RNA-seq
>est1ATCGACTAGCATCA>est2TCGACTAGCGACTA>est3CAGCATCATCGAC
Batch submissions are marked and/or stored differently than single submissions
TYPETIER CLASS
Sequencing and sampling information
Assembly information
Feature annotation
ENA-Reads:
ENA-Assembly:
ENA-Annotation:
1) EMBL-Bank
2) Trace Archive - Raw data (capillary sequencing)
3) Sequence Read Archive - Raw data (Next Gen sequencing)
http://www.biotnet.org/sites/biotnet.org/files/documents/17/2010_ena_v2.0.ppt
ENA structure
Batch submissions
Data class ESTs arealso batch submissions
The 'normal' submissions are a minority in primary sequence databases
http://www.ebi.ac.uk/ena/about/statistics#embl_bases_per_dataclass
Primary sequence dbs are synchronised and every sequence receives a unique identifier
All database maintainers assign and share a unique accession number (AC) to each sequence – besides their own ID number – (info at NCBI). Sequences can get updated, and the accession number is extended with a version number, e.g. .1 (see SVA)
http://www.insdc.org/Collaboration onFeatures, taxonomy,...
Example of acc number: BC010109.2
http://en.wikipedia.org/wiki/Accession_number_(bioinformatics)
ENA
GenBank+ SRA
DDBJ
International nucleotideSequence databases collaboration
Synchronized
daily
All use the same- Accession Ids- Project Ids- Feature tables (see later)
One sequence entry contains three categories of different types of information
1. Info about sequence, submitters and literature (metadata)
2. Annotations of the sequence (metadata related to the seq)
3. Stretch of ATGC / AUGC sequence (the 'data', at the bottom)• A sequence record is called 'annotated' when biological information is
added and linked to a position in the sequence
• Annotations, also called 'features', are abbreviated as codes, which can be found in the Feature Tables
http://www.ebi.ac.uk/embl/Documentation/FT_definitions/feature_table.html
This sequence information can be written in different formats(plain) Text format, e.g. GenBank
1. General info
Official shared accession
Genbank specific identifier (just sums up with each new)
A lot of different identifiers! ~number of databases→ conversion tools can translate identifiers needed (see exercises)
*In humans: HUGO Nomenclature committee determines the right gene name
http://mobyle.pasteur.fr/cgi-bin/portal.py#tutorials::seqfmt
db_xref = cross references,
= links to records of other databases which are related to this record (see later). The format dbname:identifier
2. Annotation
Feature name Qualifier name
Each protein sequence receives also an accession number
3. Sequence
Other sequence formats
Fasta (minimal metadata, basically only sequence)
>genename And a descriptionATCGATGCAGCTATATCCTCGCGATCAGCCGGACAGCTCTCGAGCGCATCGACGACGAC
ASN.1 Abstract Syntax Notation (ASN.1)
EMBL :all info as in gb, online referred to as 'plain text'
XML
Fastq : sequence info and base 'call' quality
http://emboss.sourceforge.net/docs/themes/SequenceFormats.html
Important
'Format' has nothing to do with which program you save your file! You don't have a choice: it needs to be 'plain text format' (.txt - not a file which can be opened with MS Word such as .doc or .rtf files). Wordpad is a good choice for this. 'Format' in bioinfo is all about how the information is structured and written down in the plain text file.
Degree of annotation differs between entries
TYPETIER CLASS
Sequencing and sampling information
Assembly information
Feature annotation
ENA-Reads:
ENA-Assembly:
ENA-Annotation:
1) EMBL-Bank
2) Trace Archive - Raw data (capillary sequencing)
3) Sequence Read Archive - Raw data (Next Gen sequencing)
http://www.biotnet.org/sites/biotnet.org/files/documents/17/2010_ena_v2.0.ppt
ENA structure
Good seq annotations
Experiment informationis of most importance in batch submissions (e.g.
which species, which technique, ...)
Batch submitted sequences are annotated poorly, single submissions are annotated better
SRA contains batch submitted records of which experiment information is of most importance
Since the sequences are barely (not) annotated, is experiment description important: which machine, which organism, which tissue, which developmental stage, disease, treatment, …
How to get sequences into the db, and back out
Submit
submit retrieve
Sequin (GenBank stand alone)Bankit (GenBank web tool)Webin (EMBL online submission)
One or few sequences → Use one of the numerous webbased toolsGenBank: EntrezEMBL: EB-eyeMRS: developed for easy retrieval
Many sequences (Batch retrieval)→ use ftp (file transfer protocol)→ use perl (flexible pro-gramming language)→ BioMart http://www.biomart.org/
RetrieveAlways submit your sequence data (mostly obliged by journals) and include your ACC number in articles (not any other number).
Example of a primary NA sequence record (ENA)
http://www.ebi.ac.uk/ena/about/formats
Example of a primary NA sequence record (ENA)
http://www.ebi.ac.uk/ena/about/formats
Text format
Code usable for
searching
Data linked to that
code
Primary sequence data contains a lot of redundancy!
Several gene sequences from different labs
EST sequencesfrom transcripts
Chromosome sequence
cDNA sequence
Al match to the same gene. Often you end up in your database search with all these sequences...A lot of redundancy!
The primary sequences are the basis for analyses that generate derived sequence data
Scientists/Consortia → primary databases
– Source for further analyses. Which?
• Create protein sequences
• Curate the sequence database
• Assemble genomes
• Searching similarities
• Aggregate information about one gene
• …
Results stored in derived databases
Protein databases come in two kinds
The most important protein db is UniProt and contains 'automatic' and manual entries
UniProt Knowledge Base - 'the best annotated protein database of the world'
http://www.uniprot.org/
The most important protein db is UniProt and contains 'automatic' and manual entries
Refseq - The NCBI way to reduce redundancy in primary sequence data
RefSeq is NCBI 'Reference Sequences' (prot and nuc)
Redundancy from primary sequence data is reduced both automatically and by manual annotation of NA and protein sequences. 'one natural biological molecule = one entry'. Links back to the original primary sequences. Hugely popular and a basis for a lot of analyses.
http://www.ncbi.nlm.nih.gov/RefSeq/
Click to apply refseq filter in entrez search
RefSeq has its own identifiers, not to be mixed up with accession numbers
Refseq entry codes looks similar as ACC numbers (but are not ACC numbers – underscore!); and RefSeq is also in GenBank format. Note: in 'Features' section one can find the raw sequences from what is was derived. (typical mistake: search with refseq code in uniprot)
NC_* (curated) complete genomic element (chromosome, plasmid,...)NT_* (automated) intermediate assembly from BACNZ_* (automated) incomplete genomic sequence from WGSNW_* (automated) intermediate assembly from WGSNG_* (curated) incomplete genomic element corresponding to geneNM_* (curated) mRNANR_* (curated) non-coding RNA or predicted transcript of pseudogeneNP_* (curated) proteinZP_* (automated) protein predicted from WGS sequence (NZ_*)YP_* (curated) other predicted protein sequences from NCBI Genome Annotation Pipeline XM_* (automated) mRNAXR_* (automated) non-coding RNA or predicted transcript of pseudogeneXP_* (automated) protein
http://www.ncbi.nlm.nih.gov/RefSeq/http://www.ncbi.nlm.nih.gov/RefSeq/key.html
UniRef – UniProt redundancy reducing system for proteins sequences
Non redundant protein sequences from UniProt
~ refseq
Hiding redundant sequences by clustering them• UniRef100 = complete identical sequences• UniRef90 = 90% identical sequences• UniRef50 = 50% identical sequences
See http://www.uniprot.org/help/uniref
NCBI's Gene – summarizes gene information including sequence information from primary dbs
Example of the gene NPR1 from A. thaliana
UniGene – summarizes transcriptomic information around genes
And a lot more derived databases with sequence information exist
Repbase :
repeats (Alu, …), maintained by Jerzy Jurka at the Genetic Information Research Institute (Mountain View CA, USA). CENSOR server allows to "clean" sequences. http://www.girinst.org/repbase
MiRBase → published miRNA sequences
http://www.mirbase.org/
Eukaryotic promoter database
http://www.epd.isb-sib.ch/
UniVec
GenBank subset + some sequences from commercial sources - ftp://ftp.ncbi.nih.gov/pub/UniVec/
The most important sequence databases overview
DDBJ
ENA
GB
Prim seq data Derived
trEMBL
GenPept
Curated
SwissProt
RefSeq Entrez
ENA searchEB-eye
UniProt
Integrated SearchPortals
UNIPROT
Common gene annotations on sequences
Genome sequence: e.g. Chr6
Enhancers/promotors
Gene sequence
mRNA
protein
exon
5'UTR 3'UTRCDS
Genetic code tables
Intron
terminator
AAAAAAAAAAAAA
poly(A) tail
Searching the database for your gene of interest
First you have to determine for yourself which information you want
- NA sequences vs. protein sequences
- If NA, genomic sequences, or RNA derived
- All possible sequences that exists, or curated ones
- Protein sequences of which quality
- ...
Entrez is a starting point for searches at NCBIhttp://www.ncbi.nlm.nih.gov/sites/gquery
Visualising the db_xrefs in records at NCBI
ENA has its text-search portalhttp://www.ebi.ac.uk/ena/
Results from an ENA search are organised following the ENA database structure
UniProt has a simple search box leading to a sophisticated search results page
Complex searches can be achieved by using the index codes in the database
e.g.
“oc=Primates and de=complete and de=cds and de=MHC”
Could answer: give me all coding sequence of MHC available in primates.
Code usable for
searching
Meta-search tools can search different sequence databases at once.
MRS
Open Source, developed by Maarten Hekkelman at Radboud U. (Nijmegen, the Netherlands). Allows searching in different databases at once, and provides also statistics on the databases.
Alternatives: ACNUC, SRS
Logical operators
Q1 AND Q2&
Q1 OR Q2|
Q1 NOT Q2!
Searching involves making combinations of conditions.Here the difference between a logic and, or and not explained by venn diagrams.
Hands-on!
Every module ends with an exercise session.
We will now explore how data is stored in different sequence databases. You get …. minutes for this exercise.
Afterwards, we summarizes some of the difficulties some of you might have experienced.
Summary This course is organised in several modulesModule 1: Sequence databasesThree major nucleotide databanks host primary sequence dataThese databases are filled with NA sequence information by scientists and consortiaThe batch submissions originate mostly from sequencing centersEach primary database stores their sequences and batch submissions in their own way...Batch submissions are marked and/or stored differently than single submissionsThe 'normal' submissions are a minority in primary sequence databasesPrimary sequence dbs are synchronised and every sequence receives a unique identifierOne sequence entry contains three categories of different types of informationThis sequence information can be written in different formatsDegree of annotation differs between entries SRA contains batch submitted records of which experiment information is of most importanceHow to get sequences into the db, and back outPrimary sequence data contains a lot of redundancy! The primary sequences are the basis for analyses that generate derived sequence dataProtein databases come in two kindsThe most important protein db is UniProt and contains 'automatic' and manual entriesRefseq - The NCBI way to reduce redundancy in primary sequence dataRefSeq has its own identifiers, not to be mixed up with accession numbersUniRef – UniProt redundancy reducing system for proteins sequencesNCBI's Gene – summarizes gene information including sequence information from primary dbsUniGene – summarizes transcriptomic information around genesAnd a lot more derived databases with sequence information existSearching the database for your gene of interestEntrez is a starting point for searches at NCBIVisualising the db_xrefs in records at NCBIENA has its text-search portalResults from an ENA search are organised following the ENA database structureUniProt has a simple search box leading to a sophisticated search results pageComplex searches can be achieved by using the index codes in the databaseMeta-search tools can search different sequence databases at once.Hands-on!