23
http://www.biomart.or Databases in Biomart format: Ensembl HapMap HTGT HGNC Dictybase Wormbase Gramene Europhenome UniPro Rat Genome Database DroSpeGe ArrayExpress DW Eurexpress GermOnLine PRIDE PepSeeker VectorBase Pancreatic Expression Database Reactome EU Rat Mart Paramecium DB “BioMart is a query-oriented data management system developed jointly by the Ontario Institute for Cancer Research (OICR) and the European Bioinformatics Institute (EBI).” Open Source – LGPL * Perl API → Web Interface, Web Services Interface, REST API * Java API → Mart Explorer GUI, MartShell * 3 rd Party Software → Bioclipse, biomaRt-BioConductor, Cytoscape, Galaxy, Taverna, WebLab

Http:// Databases in Biomart format: EnsemblHapMapHTGTHGNCDictybaseWormbaseGramene EurophenomeUniProRat Genome Database DroSpeGeArrayExpress

Embed Size (px)

Citation preview

Page 1: Http:// Databases in Biomart format: EnsemblHapMapHTGTHGNCDictybaseWormbaseGramene EurophenomeUniProRat Genome Database DroSpeGeArrayExpress

http://www.biomart.org/

Databases in Biomart format:

EnsemblHapMap HTGT HGNC Dictybase Wormbase Gramene Europhenome UniPro Rat Genome Database DroSpeGe ArrayExpress DW Eurexpress GermOnLine PRIDE PepSeeker VectorBase Pancreatic Expression Database Reactome EU Rat Mart Paramecium DB

“BioMart is a query-oriented data management system developed jointly by the Ontario Institute for Cancer Research

(OICR) and the European Bioinformatics Institute (EBI).”

Open Source – LGPL

* Perl API → Web Interface, Web Services Interface, REST API

* Java API → Mart Explorer GUI, MartShell

* 3rd Party Software → Bioclipse, biomaRt-BioConductor, Cytoscape, Galaxy, Taverna, WebLab

Page 2: Http:// Databases in Biomart format: EnsemblHapMapHTGTHGNCDictybaseWormbaseGramene EurophenomeUniProRat Genome Database DroSpeGeArrayExpress

A Mart is a collection of datasets (~=Database).

Marts are optimised for querying.

A Dataset has a main table, with an entry (and Primary Key) for each of the items of interest in that dataset (eg Mouse Transcripts).

Related bits of information about these items are hung off the table in dimension tables (eg. Affy Ids corresponding to this gene)

More Info: http://www.biomart.org/user-docs.pdf

Page 3: Http:// Databases in Biomart format: EnsemblHapMapHTGTHGNCDictybaseWormbaseGramene EurophenomeUniProRat Genome Database DroSpeGeArrayExpress

Ensembl annotates everything at the transcript level:

Ensembl_transcript_1

Ensembl_transcript_2

Ensembl_transcript_3

AffyID

HUGO Symbol

1939_at ENST0000037891939_at ENST000003790 1939_at ENST000003791 TP53

Affy Ids are mapped by Ensembl. If there is no clear match then that probe is not assigned to a gene.

Page 4: Http:// Databases in Biomart format: EnsemblHapMapHTGTHGNCDictybaseWormbaseGramene EurophenomeUniProRat Genome Database DroSpeGeArrayExpress

Web Interface:

http://www.biomart.org/biomart/martview/

Choose a Database (mart) to query (eg Ensembl)

Choose a Dataset from that mart to query (eg Mus Musculus Genes)

Page 5: Http:// Databases in Biomart format: EnsemblHapMapHTGTHGNCDictybaseWormbaseGramene EurophenomeUniProRat Genome Database DroSpeGeArrayExpress

Filters

Use filters to select the members of the dataset in which you're interested

eg.

Limit to miRNA genes from Chr1

Page 6: Http:// Databases in Biomart format: EnsemblHapMapHTGTHGNCDictybaseWormbaseGramene EurophenomeUniProRat Genome Database DroSpeGeArrayExpress

Attributes

Use attributes to define what bits of information you want to retrieve about the members of the dataset

eg. Gene ID, Transcript ID, Start, End and Status:

Page 7: Http:// Databases in Biomart format: EnsemblHapMapHTGTHGNCDictybaseWormbaseGramene EurophenomeUniProRat Genome Database DroSpeGeArrayExpress

Results:

Page 8: Http:// Databases in Biomart format: EnsemblHapMapHTGTHGNCDictybaseWormbaseGramene EurophenomeUniProRat Genome Database DroSpeGeArrayExpress

http://www.biomart.org/biomart/martview

Page 9: Http:// Databases in Biomart format: EnsemblHapMapHTGTHGNCDictybaseWormbaseGramene EurophenomeUniProRat Genome Database DroSpeGeArrayExpress

www.bioconductor.org

source("http://bioconductor.org/biocLite.R")

#Default package setbiocLite()

#ORbiocLite(“someBiocPkg”)

#ORbiocLite(groupName=”pkgGroupName”)

“Bioconductor is an open source and open development software projectfor the analysis and comprehension of genomic data.”

Page 10: Http:// Databases in Biomart format: EnsemblHapMapHTGTHGNCDictybaseWormbaseGramene EurophenomeUniProRat Genome Database DroSpeGeArrayExpress

Core Packages:

affy, affydata, affyPLM, annaffy, annotate, Biobase, Biostrings, DynDoc, gcrma, genefilter, geneplotter, hgu95av2.db, limma, marray, matchprobes, multtest, ROC, vsn, xtable, affyQCReport.

Alternative Package Groups

lite, affy, graph, all

http://www.bioconductor.org/packages/release/BiocViews.html

Full Package Listing (software)

http://www.bioconductor.org/packages/release/data/annotation/

Full Package Listing (annotation)

Page 11: Http:// Databases in Biomart format: EnsemblHapMapHTGTHGNCDictybaseWormbaseGramene EurophenomeUniProRat Genome Database DroSpeGeArrayExpress

Querying biomart from R:

# Install librarysource(“http://www.bioconductor.org/biocLite.R”)biocLite(“biomaRt”)

# Load librarylibrary(biomaRt)

listMarts()

# result is just a data.frame, so you can subset it:

listMarts()[1:5,]

# or search it:

grep('ensembl', listMarts()[,1], value=TRUE)

Page 12: Http:// Databases in Biomart format: EnsemblHapMapHTGTHGNCDictybaseWormbaseGramene EurophenomeUniProRat Genome Database DroSpeGeArrayExpress

# Select a mart

mart <- useMart('ensembl')

# List the available datasets (returns data.frame)

listDatasets(mart)

# Select a dataset

mart <- useDataset('mmusculus_gene_ensembl', mart=mart)

# Both in one:

mart <- useMart('ensembl', dataset='mmusculus_gene_ensembl')

Page 13: Http:// Databases in Biomart format: EnsemblHapMapHTGTHGNCDictybaseWormbaseGramene EurophenomeUniProRat Genome Database DroSpeGeArrayExpress

# Available Filters (returns data.frame)listFilters(mart)

# Available Attributes (returns data.frame)listAttributes(mart)

# A Simple Query

getBM(filters=c('ensembl_gene_id'), values=c('ENSMUSG00000029249','ENSMUSG00000048482'),

attributes=c('ensembl_gene_id', 'ensembl_transcript_id', 'transcript_start', 'transcript_end'), mart=mart)

ensembl_gene_id ensembl_transcript_id transcript_start transcript_end1 ENSMUSG00000029249 ENSMUST00000113448 77694516 777089552 ENSMUSG00000029249 ENSMUST00000113449 77695221 777154573 ENSMUSG00000029249 ENSMUST00000080359 77694516 777120094 ENSMUSG00000048482 ENSMUST00000053317 109514857 1095672005 ENSMUSG00000048482 ENSMUST00000111052 109533720 1095672006 ENSMUSG00000048482 ENSMUST00000111051 109516054 1095672007 ENSMUSG00000048482 ENSMUST00000111050 109532593 1095672008 ENSMUSG00000048482 ENSMUST00000111047 109516054 1095671639 ENSMUSG00000048482 ENSMUST00000111049 109516054 10956716310 ENSMUSG00000048482 ENSMUST00000111046 109517251 10956716311 ENSMUSG00000048482 ENSMUST00000111045 109533720 10956716312 ENSMUSG00000048482 ENSMUST00000111044 109534626 10956716313 ENSMUSG00000048482 ENSMUST00000111043 109534626 10956716314 ENSMUSG00000048482 ENSMUST00000111042 109534628 109567204

Page 14: Http:// Databases in Biomart format: EnsemblHapMapHTGTHGNCDictybaseWormbaseGramene EurophenomeUniProRat Genome Database DroSpeGeArrayExpress

# If using multiple filters, values should be a list

# If chromosome_name, start and end filters used they are auto# interpreted as 'search within this region'

getBM(filters=c('chromosome_name', 'start', 'end' ), values=list(10, 80000000,80050000), attributes= c('ensembl_gene_id', 'start_position','end_position'), mart=mart)

ensembl_gene_id start_position end_position1 ENSMUSG00000003346 80046400 800530492 ENSMUSG00000035397 80029874 800400663 ENSMUSG00000047417 80005138 800242864 ENSMUSG00000003341 79982330 80001869

Page 15: Http:// Databases in Biomart format: EnsemblHapMapHTGTHGNCDictybaseWormbaseGramene EurophenomeUniProRat Genome Database DroSpeGeArrayExpress

# Filters can be either numeric, string or boolean.# Boolean filters need a TRUE or FALSE value

# Determine type of filter with:

filterType('with_unigene', mart)

# Attributes and filters are organised into categories

# To get a list of the categories:attributeSummary(mart)filterSummary(mart) # You can then list attributes and filters limited to a # specified category:listAttributes(mart, category='Variations')

Page 16: Http:// Databases in Biomart format: EnsemblHapMapHTGTHGNCDictybaseWormbaseGramene EurophenomeUniProRat Genome Database DroSpeGeArrayExpress

# Older versions of ensembl are archived, useful if you've # got genome positions to a previous build

old.mart <- useMart('ensembl_mart_46', dataset='mmusculus_gene_ensembl', archive=TRUE)

Page 17: Http:// Databases in Biomart format: EnsemblHapMapHTGTHGNCDictybaseWormbaseGramene EurophenomeUniProRat Genome Database DroSpeGeArrayExpress

Retrieving Sequences:

# can get complicated with getBM. Use the getSequence wrapper

# Genome Sequences always 5'-3' but...

# Web-Services mode (default): Strand is context dependant # MySQL mode: Always top strand

#eg...

# BRCA1 peptide sequence from gene symbolgetSequence(id="BRCA1", type="mgi_symbol", seqType="peptide", mart = mart)

# REST transcript 20 bases upstream getSequence(id='ENSMUST00000113448', type='ensembl_transcript_id', seqType='transcript_flank', upstream=20, mart=mart)

# Chromosome 4 100,000,000-100,000,010getSequence(chromosome=4, start=10000000, end=11000000, mart=mart, seqType="gene_exon", type="ensembl_gene_id")

Page 18: Http:// Databases in Biomart format: EnsemblHapMapHTGTHGNCDictybaseWormbaseGramene EurophenomeUniProRat Genome Database DroSpeGeArrayExpress

seqTypes:

Note that any of the _flank types need an 'upstream' or 'downstream' argument to determine the size of the flanking region. At the moment, you can't specify both.

Page 19: Http:// Databases in Biomart format: EnsemblHapMapHTGTHGNCDictybaseWormbaseGramene EurophenomeUniProRat Genome Database DroSpeGeArrayExpress

Exporting Sequences:

# The exportFASTA function provides a quick way of saving # sequences in FASTA format:

res <- getSequence(id="BRCA1", type="mgi_symbol", seqType="peptide", mart = mart)

exportFASTA(res, file='sequence.fa')

Page 20: Http:// Databases in Biomart format: EnsemblHapMapHTGTHGNCDictybaseWormbaseGramene EurophenomeUniProRat Genome Database DroSpeGeArrayExpress

Linking Datasets...

# Make mart connections for each of the datasets:mouse.mart<-useMart('ensembl', dataset="mmusculus_gene_ensembl")people.mart<-useMart('ensembl', dataset='hsapiens_gene_ensembl')

# In Ensembl, datasets are made of transcripts # from a single species. # Linking datasets amounts to homology

#eg. Get pos of mouse homolog to human 'TP53' gene

getLDS(attributes = c("hgnc_symbol","chromosome_name", "start_position"), filters = "hgnc_symbol", values = "TP53", mart = people.mart, attributesL = c("chromosome_name","start_position"), martL = mouse.mart) }

V1 V2 V3 V4 V51 TP53 17 7512445 11 69393861

Page 21: Http:// Databases in Biomart format: EnsemblHapMapHTGTHGNCDictybaseWormbaseGramene EurophenomeUniProRat Genome Database DroSpeGeArrayExpress

Pretty HTML Output:

library(annotate)# Provides the htmlpage function. Salient args are:# genelist – a list or dataframe of IDs to be made into links# filename# title – for the table# othernames – a list of other things to add to the table as is# table.head – a character vector of col headers for the table.# repository – a list of repositories to use for creating links

ids <- c('ENSMUSG00000029249','ENSMUSG00000048482')

genelist <- getBM(attributes=c('uniprot_swissprot_accession', 'entrezgene'), filters='ensembl_gene_id', values=ids, output='list', na.value='&nbsp;', mart=mart)

othernames <- getBM(attributes=c('ensembl_gene_id','mgi_symbol', 'description'), filters='ensembl_gene_id', values=ids, output='list', na.value='&nsbp;',mart=mart)

htmlpage(genelist=genelist, othernames=othernames, title='Some Genes', table.head=c('Uniprot', 'Entrezgene', 'Ensembl','Name', 'Description'), repository=list('sp', 'en'), filename='genes.html')

# Note that all the lists are expected to be in the right order

Page 22: Http:// Databases in Biomart format: EnsemblHapMapHTGTHGNCDictybaseWormbaseGramene EurophenomeUniProRat Genome Database DroSpeGeArrayExpress
Page 23: Http:// Databases in Biomart format: EnsemblHapMapHTGTHGNCDictybaseWormbaseGramene EurophenomeUniProRat Genome Database DroSpeGeArrayExpress

More Info...

Bioconductor Mailing List:

http://www.bioconductor.org/docs/mailList.html

biomaRt Users' Guide:

vignette('biomaRt')

Biomart Website

http://www.biomart.org

Slides & examples:

http://www.cassj.co.uk/biomart_slides.ppt

http://www.cassj.co.uk/worksheet.txthttp://www.cassj.co.uk/worksheet_code.R