26
Ewan Birney (tweetable) Curation

Ewan Birney Biocuration 2013

  • Upload
    iddo

  • View
    335

  • Download
    0

Embed Size (px)

DESCRIPTION

Ewan Birney's slides from Biocuration 2013

Citation preview

Page 1: Ewan Birney Biocuration 2013

Ewan Birney (tweetable)

Curation

Page 2: Ewan Birney Biocuration 2013

Who am I?• Associate Director at

European Bioinformatics Institute (EBI)

• Involved in genomics since I was 19 (> 20 years!)

• Trained as a biochemist – most people think I am CS

• Analysed – sometimes lead – human/mouse/rat/platypus etc genomes, ENCODE, Others.

EBI is in Hinxton, SouthCambridgeshire

EBI is part of EMBL, ~likeCERN for molecular biology

Page 3: Ewan Birney Biocuration 2013

Molecular Biology

• The study of how life works – at a molecular level

• Key molecules:

• DNA – Information store (Disk)

• RNA – Key information transformer, also does stuff (RAM)

• Proteins – The business end of life (Chip, robotic arms)

• Metabolites – Fuel and signalling molecules (electricity)

• Theories of how these interact – no theories of to predict what they are

• Instead we determine attributes of molecules and store them in globally accessible, open, databases

Page 4: Ewan Birney Biocuration 2013

Theory Observation

MolecularBiology

Geology,Astronomy

Climatemodelling

High EnergyPhysics

Can accurately predict from models

Must directly observe

Page 5: Ewan Birney Biocuration 2013

This ratio is not well correlated with data size

Ratio of model predictability

Data Size

~60PB

~5PB

MolecularBiology

Astronomy

Climate Models

High Energy Physics

Page 6: Ewan Birney Biocuration 2013

“Knowing stuff” is critical to biology…

• The bases of the human genome

• … and the Mouse, Rat, Wheat, Ecoli, Plasmodium, Cow….

• The functions of proteins

• Enzymes, Transcription Factors, Signalling….

• The types of cells, their lineages and organ composition

• …and all the molecular components in each cell

• Small molecules

• … and their conversions, binding partners

• Structures of molecules, complexes and cells

• … at atomic and higher resolution

Page 7: Ewan Birney Biocuration 2013

Two fundamental types of information

• Experimental data

• The result of a specific experiment

• Often an experiment specific, data heavy part plus a “meta-data” part

• Might be contradictory

• “Primary paper”

• Consensus Knowledge

• Integration of different strands of information on a topic

• Realised as a computationally accessible scheme

• “Review article”

Page 8: Ewan Birney Biocuration 2013

Five types of curation

Page 9: Ewan Birney Biocuration 2013

Experimental Data Entry

• Intact – Protein:Protein interactions

• GWAS Catalog – extraction of summary statistics

Page 10: Ewan Birney Biocuration 2013

Experimental Meta data capture

• Sample, CDS lines in ENA

• Sample in Metabolights, PRIDE etc

• Machine and analysis specification in PDB, PRIDE, ENA

Page 11: Ewan Birney Biocuration 2013

Consensus integration of information

• GenCode gene models in human

• Summaries and GO assignment in UniProt

• Pathway information in Reactome

• GO assignment and summaries in MODs (eg, PomBase, WormBase, PhytoPathDB etc)

Page 12: Ewan Birney Biocuration 2013

Knowledge frameworks

• The EC classification

• Cell type ontologies

• Cell lineages – Worms!

• SnowMed, HPO etc

• GO ontologies

Page 13: Ewan Birney Biocuration 2013

Knowledge management

• Creation of rules representing ENA standards compliance

• Cross-ontology coordination (eg, EFO) or tieing (GO ChEBI)

• RuleBase / UniRule curation processes

Page 14: Ewan Birney Biocuration 2013

Data Entry vs Programming

DirectData Entry

ProgrammaticData Entry

ImprovedData entrytools RuleBase,

Computational AccessibleStandards

“Messy” Scripting

Page 15: Ewan Birney Biocuration 2013

Thank You!

Page 16: Ewan Birney Biocuration 2013

Curation Dilema

• If you do your job well…

• Everyone assumes it’s easy

• People forget about the complexity

• You are ignored

• If you do your job badly…

• Everyone assumes it’s easy

• People forget about the complexity

• People complain

Page 17: Ewan Birney Biocuration 2013

Why we need an infrastructure…

Page 18: Ewan Birney Biocuration 2013

Infrastructures are critical…

Page 19: Ewan Birney Biocuration 2013

But we only notice them when they go wrong

Page 20: Ewan Birney Biocuration 2013

Biology already needs an information infrastructure

• For the human genome

• (…and the mouse, and the rat, and… x 150 now, 1000 in the future!) - Ensembl

• For the function of genes and proteins

• For all genes, in text and computational – UniProt and GO

• For all 3D structures

• To understand how proteins work – PDBe

• For where things are expressed

• The differences and functionality of cells - Atlas

Page 21: Ewan Birney Biocuration 2013

..But this keeps on going…

• We have to scale across all of (interesting) life

• There are a lot of species out there!

• We have to handle new areas, in particular medicine

• A set of European haplotypes for good imputation

• A set of actionable variants in germline and cancers

• We have to improve our chemical understanding

• Of biological chemicals

• Of chemicals which interfere with Biology

Page 22: Ewan Birney Biocuration 2013

22

medicine

environment

bioindustries

society

To build a sustainable European infrastructure for biological information, supporting life science research and its translation to:

ELIXIR’s mission

Page 23: Ewan Birney Biocuration 2013

How?

Fully Centralised

Pros: Stability, reuse,Learning ease

Cons: Hard to concentrateExpertise across of life scienceGeographic, language placementBottlenecks and lack of diversity

Pros: Responsive, GeographicLanguage responsive

Cons: Internal communication overheadHarder for end users to learnHarder to provide multi-decade stability

Fully Distributed

Page 24: Ewan Birney Biocuration 2013

24

InternationalEBI / ElixirEnglishLow legalities

NationalHealthcareNational LanguageComplex legalities

Research Healthcare

Page 25: Ewan Birney Biocuration 2013

Other infrastructures needed for biology• EuroBioImaging

• Cellular and whole organism Imaging

• BioBanks (BBMRI)

• We need numbers – European populations – in particular for rare diseases, but also for specific sub types of common disease

• Mouse models and phenotypes (Infrafrontier)

• A baseline set of knockouts and phenotypes in our most tractable mammalian model

• (it’s hard to prove something in human)

• Robust molecular assays in a clinical setting (EATRIS)

• The ability to reliably use state of the art molecular techniques in a clinical research setting

Page 26: Ewan Birney Biocuration 2013

Questions?

(you can follow me on twitter @ewanbirney)I blog and update this on Google Plus publically