BITS: Basics of sequence databases

Basic bioinformatics concepts, databases and tools

Introduction to the training

and Sequence databases

Joachim Jacobhttp://www.bits.vib.be

Updated 22 February 2012http://dl.dropbox.com/u/18352887/BITS_training_material/Link%20to%20mod1-intro_H1_2012_SeqDBs.pdf

http://www.bits.vib.be/

http://dl.dropbox.com/u/18352887/BITS_training_material/Link%20to%20mod1-intro_H1_2012_SeqDBs.pdf

Scope

Introductory training to Bioinformatics

Exploring and understanding

databases and software

for everyday bioinformatics use

If there is any term which is unclear, please stop me and ask me!

Bio

all data is derived from living samples

Informatics

that data is stored and analyzed in and with computers to obtain understanding

Extremely broad description, for which however we will extract common principles during the course

Bioinformatics ...

Bioinformatics is present into every aspect of life sciences research



, sequences







Bioinformatics ...

Bio

- different types of living samples

Informatics

- storing and categorizing the information and making it easily accessible

- interpreting that information reliably

Bioinformatics … and his companion

Bio

- different types of living samples

Informatics

- storing and categorizing the information and making it easily accessible

- interpreting that information reliably

Statistics

- large numbers, observational data

The siblings of Bioinformatics

Based on the biological component extracted from life, the measured properties and the ultimate goal of the analysis, different sub-disciplines of bioinformatics exist.

DNA RNA proteins metabolites

GenomicsTranscriptomics

ProteomicsMetabolomics

Epigenomics Structural bioinformatics Systems biology Microbiomics InteractomicsMetagenomics Functional genomics Comparative gx

Mere data is worth nothing

Data = symbols

Information = data that are processed to be useful; provides answers to "who", "what", "where", and "when" questions. Also called metadata.

Knowledge: application of data and information; answers "how" questions

Understanding: appreciation of "why"

Wisdom

CGCTACGCATATCGCT

- Dasypus novemcinctus- found in my garden- Part of genome- sequenced on June 2010

This species seems to be related to my neighbor's pet, because it has also this sequence

Has the same mother

http://www.systems-thinking.org/dikw/dikw.htm

http://www.systems-thinking.org/dikw/dikw.htm

Biology Computer Statistics

Bioinformatics research, as a specific branch on the boundary of life science, mathematics and computer science'tool manufacturer'

Tools and approaches

Life sciences research as major 'end user' for the bioinformatics tools and conclusions'tool user'

? !data knowledge

This course is organised in several modules

Module 1: Sequence databases: what, where, how

Module 2: Sequence comparisons: searching, aligning

Module 3: Sequence analysis – domains in protein sequences and predicting functionality, standardisation and useful links

Module 4: Beyond sequences - additional important data sources

Module 5: Genome Browsers - integrating biological data and performing reproducible bioinformatics research in the Galaxy

Overview of the crash course

One tip for the future

Be prepared for change...

Information is fluid

So are bioinfo tools

Learn how to accommodate for change

Major resources are more stable

Important concepts do not change often

Module 1

Sequence databases

Module 1: Sequence databases

Sequence databases store DNA and RNA sequences. In Bioinformatics, they are by far (still) the largest collections of biological data, and used by many subdisciplines of bioinformatics.

http://www.ebi.ac.uk/embl/Services/DBStats/


... and growing



Three major nucleotide databanks host primary sequence data

European Nucleotide Archive (ENA) at EBI - http://www.ebi.ac.uk/

Division EMBL-bank (European Molecular Biology Laboratory) (single)

Trace Archive

SRA Archive

GenBank at NCBI - http://www.ncbi.nlm.nih.gov/

maintained at NCBI (National Center for Biotechnology Information,

(USA)

DDBJ (DNA Data Bank of Japan) - http://www.ddbj.nig.ac.jp/

maintained at NIG/CIB (National Institute of Genetics, Center for

Information Biology, Mishima, Japan)

http://www.ebi.ac.uk/

http://www.ddbj.nig.ac.jp/

These databases are filled with NA sequence information by scientists and consortia

Individual scientists

Large-scale sequencing

projects

Primary sequence data

Primarysequencedatabase

Patent Offices ACTGCTGCTA

GCTAGCTGATCTATGCTAGCTGTAGCTGAG

each primary sequence =

one experiment

Basically, all 'source' nucleotide material

Jennifer McDowall - http://www.biotnet.org/training-materials/nucleotide-sequence-databases-ena

http://www.biotnet.org/training-materials/nucleotide-sequence-databases-ena

Primary NA sequence can be produced by Sanger-based technologies or NGS technologies

sample

DNARNA

cDNA

RT

Sanger

Low output in number of seqs, high quality, 400-850 bp.Read profiles in .abi format. Stored in Trace Archive.

NGSDifferent technologies. Extremely high output rate, low quality, 30 bp – 600 bp. Reads in .fastq format, stored inthe SRA.

These techniques can only read DNA strands, so RNA needs first to be converted to cDNA with reverse transcriptases prior to loading to the machines.

Sanger overview: http://www.bio.davidson.edu/Courses/Molbio/MolStudents/spring2003/Obenrader/sanger_method_page.htmNGS overview: http://seqanswers.com/forums/showthread.php?t=3561

http://www.bio.davidson.edu/Courses/Molbio/MolStudents/spring2003/Obenrader/sanger_method_page.htm

http://seqanswers.com/forums/showthread.php?t=3561

Dennis Wall, NGS Data Analysis and Computation I course, Wall Lab

Overview major DNA reading technologies

In the primary sequence dbs a major distinction can be made in two major categories

High quality single submission (Sanger)- gene sequence (genomic – 'STD' data class)- mRNA sequence (via cDNA – 'STD')- BAC/YAC/cosmid sequences- genome sequencing projects (contigs,

assemblies, WGS)- genome markers, STS (sequence tagged sites, unique short sequences from a genome)

Low quality batch submissions- Expressed Sequence Tags (EST)- Genome Survey Sequences (GSS)- high-throughput sequence data (e.g. NGS)

DNARNAcDNA

http://www.ebi.ac.uk/ena/about/formats


The batch submissions originate mostly from sequencing centers

chromosome

cyp30 cyp309 insvcg343

annotation

sequence reads

sequencing library

assemble sequence

Large-scale sequencing

projects

submissione.g. whole genome shotgun

submission

submission

fragment

Each primary database stores their sequences and batch submissions in their own way...

- NCBI: ESTs are stored in dbEST (separate database)- ENA: ESTs are part of EMBL-bank in 'EST' data class

Similar for GSS (see dbGSS at NCBI)

ESTs : expressed sequence tag, often partial sequence derived from RNA in batch. See example

sample

RNA

RNA-seq

>est1ATCGACTAGCATCA>est2TCGACTAGCGACTA>est3CAGCATCATCGAC

http://www.ncbi.nlm.nih.gov/dbEST/

http://www.ncbi.nlm.nih.gov/dbGSS/

http://www.ncbi.nlm.nih.gov/nucest/HO850345.1

Batch submissions are marked and/or stored differently than single submissions

TYPETIER CLASS

Sequencing and sampling information

Assembly information

Feature annotation

ENA-Reads:

ENA-Assembly:

ENA-Annotation:

1) EMBL-Bank

2) Trace Archive - Raw data (capillary sequencing)

3) Sequence Read Archive - Raw data (Next Gen sequencing)

http://www.biotnet.org/sites/biotnet.org/files/documents/17/2010_ena_v2.0.ppt

ENA structure

Batch submissions

Data class ESTs arealso batch submissions


The 'normal' submissions are a minority in primary sequence databases

http://www.ebi.ac.uk/ena/about/statistics#embl_bases_per_dataclass

http://www.ebi.ac.uk/ena/about/statistics#embl_bases_per_dataclass

Primary sequence dbs are synchronised and every sequence receives a unique identifier

All database maintainers assign and share a unique accession number (AC) to each sequence – besides their own ID number – (info at NCBI). Sequences can get updated, and the accession number is extended with a version number, e.g. .1 (see SVA)

http://www.insdc.org/Collaboration onFeatures, taxonomy,...

Example of acc number: BC010109.2

http://en.wikipedia.org/wiki/Accession_number_(bioinformatics)

ENA

GenBank+ SRA

DDBJ

International nucleotideSequence databases collaboration

Synchronized

daily

All use the same- Accession Ids- Project Ids- Feature tables (see later)

http://www.ncbi.nlm.nih.gov/books/NBK21105/

http://www.ebi.ac.uk/cgi-bin/sva/sva.pl?search=Go&query=AJ870305

http://www.insdc.org/

http://en.wikipedia.org/wiki/Accession_number_(bioinformatics)

One sequence entry contains three categories of different types of information

1. Info about sequence, submitters and literature (metadata)

2. Annotations of the sequence (metadata related to the seq)

3. Stretch of ATGC / AUGC sequence (the 'data', at the bottom)• A sequence record is called 'annotated' when biological information is

added and linked to a position in the sequence

• Annotations, also called 'features', are abbreviated as codes, which can be found in the Feature Tables

http://www.ebi.ac.uk/embl/Documentation/FT_definitions/feature_table.html

http://en.wikipedia.org/wiki/Annotation

http://www.ncbi.nlm.nih.gov/collab/FT/index.html

http://www.ebi.ac.uk/embl/Documentation/FT_definitions/feature_table.html

This sequence information can be written in different formats(plain) Text format, e.g. GenBank

1. General info

Official shared accession

Genbank specific identifier (just sums up with each new)

A lot of different identifiers! ~number of databases→ conversion tools can translate identifiers needed (see exercises)

*In humans: HUGO Nomenclature committee determines the right gene name

http://mobyle.pasteur.fr/cgi-bin/portal.py#tutorials::seqfmt

http://www.genenames.org/

http://mobyle.pasteur.fr/cgi-bin/portal.py#tutorials::seqfmt

db_xref = cross references,

= links to records of other databases which are related to this record (see later). The format dbname:identifier

2. Annotation

Feature name Qualifier name

Each protein sequence receives also an accession number

3. Sequence

Other sequence formats

Fasta (minimal metadata, basically only sequence)

>genename And a descriptionATCGATGCAGCTATATCCTCGCGATCAGCCGGACAGCTCTCGAGCGCATCGACGACGAC

ASN.1 Abstract Syntax Notation (ASN.1)

EMBL :all info as in gb, online referred to as 'plain text'

XML

Fastq : sequence info and base 'call' quality

http://emboss.sourceforge.net/docs/themes/SequenceFormats.html

Important

'Format' has nothing to do with which program you save your file! You don't have a choice: it needs to be 'plain text format' (.txt - not a file which can be opened with MS Word such as .doc or .rtf files). Wordpad is a good choice for this. 'Format' in bioinfo is all about how the information is structured and written down in the plain text file.

http://pubs.acs.org/subscribe/archive/mdd/v04/i01/html/felton_fig2.htm

http://emboss.sourceforge.net/docs/themes/SequenceFormats.html

Degree of annotation differs between entries

TYPETIER CLASS

Sequencing and sampling information

Assembly information

Feature annotation

ENA-Reads:

ENA-Assembly:

ENA-Annotation:

1) EMBL-Bank

2) Trace Archive - Raw data (capillary sequencing)

3) Sequence Read Archive - Raw data (Next Gen sequencing)


ENA structure

Good seq annotations

Experiment informationis of most importance in batch submissions (e.g.

which species, which technique, ...)

Batch submitted sequences are annotated poorly, single submissions are annotated better


SRA contains batch submitted records of which experiment information is of most importance

Since the sequences are barely (not) annotated, is experiment description important: which machine, which organism, which tissue, which developmental stage, disease, treatment, …

How to get sequences into the db, and back out

Submit

submit retrieve

Sequin (GenBank stand alone)Bankit (GenBank web tool)Webin (EMBL online submission)

One or few sequences → Use one of the numerous webbased toolsGenBank: EntrezEMBL: EB-eyeMRS: developed for easy retrieval

Many sequences (Batch retrieval)→ use ftp (file transfer protocol)→ use perl (flexible pro-gramming language)→ BioMart http://www.biomart.org/

RetrieveAlways submit your sequence data (mostly obliged by journals) and include your ACC number in articles (not any other number).

http://www.ncbi.nlm.nih.gov/Sequin/index.html

http://www.ncbi.nlm.nih.gov/WebSub/?tool=genbank

http://www.ebi.ac.uk/embl/Submission/

Example of a primary NA sequence record (ENA)



Example of a primary NA sequence record (ENA)


Text format

Code usable for

searching

Data linked to that

code


Primary sequence data contains a lot of redundancy!

Several gene sequences from different labs

EST sequencesfrom transcripts

Chromosome sequence

cDNA sequence

Al match to the same gene. Often you end up in your database search with all these sequences...A lot of redundancy!

The primary sequences are the basis for analyses that generate derived sequence data

Scientists/Consortia → primary databases

– Source for further analyses. Which?

• Create protein sequences

• Curate the sequence database

• Assemble genomes

• Searching similarities

• Aggregate information about one gene

• …

Results stored in derived databases

Protein databases come in two kinds

The most important protein db is UniProt and contains 'automatic' and manual entries

UniProt Knowledge Base - 'the best annotated protein database of the world'

http://www.uniprot.org/

http://www.uniprot.org/

The most important protein db is UniProt and contains 'automatic' and manual entries

Refseq - The NCBI way to reduce redundancy in primary sequence data

RefSeq is NCBI 'Reference Sequences' (prot and nuc)

Redundancy from primary sequence data is reduced both automatically and by manual annotation of NA and protein sequences. 'one natural biological molecule = one entry'. Links back to the original primary sequences. Hugely popular and a basis for a lot of analyses.

http://www.ncbi.nlm.nih.gov/RefSeq/

Click to apply refseq filter in entrez search


RefSeq has its own identifiers, not to be mixed up with accession numbers

Refseq entry codes looks similar as ACC numbers (but are not ACC numbers – underscore!); and RefSeq is also in GenBank format. Note: in 'Features' section one can find the raw sequences from what is was derived. (typical mistake: search with refseq code in uniprot)

NC_* (curated) complete genomic element (chromosome, plasmid,...)NT_* (automated) intermediate assembly from BACNZ_* (automated) incomplete genomic sequence from WGSNW_* (automated) intermediate assembly from WGSNG_* (curated) incomplete genomic element corresponding to geneNM_* (curated) mRNANR_* (curated) non-coding RNA or predicted transcript of pseudogeneNP_* (curated) proteinZP_* (automated) protein predicted from WGS sequence (NZ_*)YP_* (curated) other predicted protein sequences from NCBI Genome Annotation Pipeline XM_* (automated) mRNAXR_* (automated) non-coding RNA or predicted transcript of pseudogeneXP_* (automated) protein

http://www.ncbi.nlm.nih.gov/RefSeq/http://www.ncbi.nlm.nih.gov/RefSeq/key.html


http://www.ncbi.nlm.nih.gov/RefSeq/key.html

UniRef – UniProt redundancy reducing system for proteins sequences

Non redundant protein sequences from UniProt

~ refseq

Hiding redundant sequences by clustering them• UniRef100 = complete identical sequences• UniRef90 = 90% identical sequences• UniRef50 = 50% identical sequences

See http://www.uniprot.org/help/uniref

http://www.uniprot.org/help/uniref

NCBI's Gene – summarizes gene information including sequence information from primary dbs

Example of the gene NPR1 from A. thaliana

UniGene – summarizes transcriptomic information around genes

And a lot more derived databases with sequence information exist

Repbase :

repeats (Alu, …), maintained by Jerzy Jurka at the Genetic Information Research Institute (Mountain View CA, USA). CENSOR server allows to "clean" sequences. http://www.girinst.org/repbase

MiRBase → published miRNA sequences

http://www.mirbase.org/

Eukaryotic promoter database

http://www.epd.isb-sib.ch/

UniVec

GenBank subset + some sequences from commercial sources - ftp://ftp.ncbi.nih.gov/pub/UniVec/

http://www.girinst.org/repbase

http://www.mirbase.org/

http://www.epd.isb-sib.ch/

The most important sequence databases overview

DDBJ

ENA

GB

Prim seq data Derived

trEMBL

GenPept

Curated

SwissProt

RefSeq Entrez

ENA searchEB-eye

UniProt

Integrated SearchPortals

UNIPROT

Common gene annotations on sequences

Genome sequence: e.g. Chr6

Enhancers/promotors

Gene sequence

mRNA

protein

exon

5'UTR 3'UTRCDS

Genetic code tables

Intron

terminator

AAAAAAAAAAAAA

poly(A) tail

http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi

Searching the database for your gene of interest

First you have to determine for yourself which information you want

- NA sequences vs. protein sequences

- If NA, genomic sequences, or RNA derived

- All possible sequences that exists, or curated ones

- Protein sequences of which quality

- ...

Entrez is a starting point for searches at NCBIhttp://www.ncbi.nlm.nih.gov/sites/gquery

http://www.ncbi.nlm.nih.gov/sites/gquery

Visualising the db_xrefs in records at NCBI

ENA has its text-search portalhttp://www.ebi.ac.uk/ena/

http://www.ebi.ac.uk/ena/

Results from an ENA search are organised following the ENA database structure

UniProt has a simple search box leading to a sophisticated search results page

Complex searches can be achieved by using the index codes in the database

e.g.

“oc=Primates and de=complete and de=cds and de=MHC”

Could answer: give me all coding sequence of MHC available in primates.

Code usable for

searching

Meta-search tools can search different sequence databases at once.

MRS

Open Source, developed by Maarten Hekkelman at Radboud U. (Nijmegen, the Netherlands). Allows searching in different databases at once, and provides also statistics on the databases.

Alternatives: ACNUC, SRS

http://mrs.cmbi.ru.nl/mrs-web/

http://pbil.univ-lyon1.fr/search/query_fam.php

http://srs.embl.de/srs/

Logical operators

Q1 AND Q2&

Q1 OR Q2|

Q1 NOT Q2!

Searching involves making combinations of conditions.Here the difference between a logic and, or and not explained by venn diagrams.

Hands-on!

Every module ends with an exercise session.

We will now explore how data is stored in different sequence databases. You get …. minutes for this exercise.

Afterwards, we summarizes some of the difficulties some of you might have experienced.

Summary This course is organised in several modulesModule 1: Sequence databasesThree major nucleotide databanks host primary sequence dataThese databases are filled with NA sequence information by scientists and consortiaThe batch submissions originate mostly from sequencing centersEach primary database stores their sequences and batch submissions in their own way...Batch submissions are marked and/or stored differently than single submissionsThe 'normal' submissions are a minority in primary sequence databasesPrimary sequence dbs are synchronised and every sequence receives a unique identifierOne sequence entry contains three categories of different types of informationThis sequence information can be written in different formatsDegree of annotation differs between entries SRA contains batch submitted records of which experiment information is of most importanceHow to get sequences into the db, and back outPrimary sequence data contains a lot of redundancy! The primary sequences are the basis for analyses that generate derived sequence dataProtein databases come in two kindsThe most important protein db is UniProt and contains 'automatic' and manual entriesRefseq - The NCBI way to reduce redundancy in primary sequence dataRefSeq has its own identifiers, not to be mixed up with accession numbersUniRef – UniProt redundancy reducing system for proteins sequencesNCBI's Gene – summarizes gene information including sequence information from primary dbsUniGene – summarizes transcriptomic information around genesAnd a lot more derived databases with sequence information existSearching the database for your gene of interestEntrez is a starting point for searches at NCBIVisualising the db_xrefs in records at NCBIENA has its text-search portalResults from an ENA search are organised following the ENA database structureUniProt has a simple search box leading to a sophisticated search results pageComplex searches can be achieved by using the index codes in the databaseMeta-search tools can search different sequence databases at once.Hands-on!

Education

BITS: Basics of sequence databases