19
Data retrieval BioMart Data sets on ftp site MySQL queries of databases Perl API access to databases Export View

Data retrieval BioMart Data sets on ftp site MySQL queries of databases Perl API access to databases Export View

  • View
    223

  • Download
    1

Embed Size (px)

Citation preview

Data retrieval

BioMart

Data sets on ftp site

MySQL queries of databases

Perl API access to databases

Export View

ExportView

Data Mining in Ensembl with Data Mining in Ensembl with EnsMartEnsMart

August 2005

• All genes from a candidate region

• Genes with a particular protein domain

• Members of a protein family

• Genes associated with SNPs

Possible queries…Possible queries…

• Human genes with upstream regions conserved w.r.t. mouse

• Upstream sequence for all Ensembl genes mapped to U95A chip (similarly, complete genomic annotation of MG_U74).

• Genomic location and description of all mouse, rat and fugu homologues of all human genes, with transmembrane domains, expressed in cardiovascular system and have non-synonymous SNPs.

More specific queriesMore specific queries

• Normalised

• Each data point stored only once

• Quick updates

• Minimal storage requirements

• But:• Many tables

• Many joins for complicated queries

• Slow for data mining questions

Ensembl core databaseEnsembl core database

BioMart and EnsMartBioMart and EnsMart

• Large-scale data retrieval tool• Query builder interface• Databases: Ensembl, SNP, Vega, (MSD, UniProt)• Associated features or sequences• Flexible output formats• http://www.ebi.ac.uk/biomart/• http://www.ensembl.org/EnsMart/

• De-normalised

• Tables with ‘redundant’ information

• Query-optimised

• Fast and flexible

• designed for data mining

Mart databaseMart database

Primary Data SetsPrimary Data Sets

• Ensembl genes• SNP

– Single nucleotide polymorphisms– Deletion-insertion polymorphisms– Short tandem repeats

• Vega genes• (MSD protein structures)• (UniProt proteomes)

Secondary Data SetsSecondary Data Sets

• Markers

• Diseases

• Gene ontology

• Gene expression information

• Homology predictions

• Protein annotation

SPECIES

FOCUS

REGION

SNP

PROTEIN

HOMOLOGY

GENE

EXPRESSION

REFSEQ

INTERPRO

GO

SWISSPROT

EMBL

AFFY

REGION

SNP

PROTEIN

HOMOLOGY

GENE

EXPRESSION

FASTA

FILE

EXCEL

TEXT

GTF

HTML

start filter output

Information flowInformation flow

BioMarthttp://www.biomart.org/

BioMart - Features

BioMart - Sequences

Output formatsOutput formatsHTML

• Direct database access at ensembldb.ensembl.org• martdb.ebi.ac.uk • MySQL client

Download MySQL for Windowshttp://www.winmysql.com/page4.htmlFile: wmysr11.zip

What about queries not What about queries not possible to do in EnsMartpossible to do in EnsMart

• Based on bioperl

• Ensembl modules

• For an introduction, see the tutorial at:

• http://www.ensembl.org/info/software/core/

Access via Perl object APIAccess via Perl object API

There are other ways…There are other ways…MartShellCommandline interface to Mart written in Java.

It works with a Mart Query Language

MartExplorerMartExplorer