Beyond the Tsunami: Dealing with Life Sciences Data

Preview:

DESCRIPTION

Microsoft E Science 2009

Citation preview

[1]

Beyond the Tsunami: Developing the Infrastructure to Deal with Life Sciences Data

Christopher Southan and Graham Cameron, EMBL-European Bioinformatics Institute (EBI), Cambridge, U.K.

[2]

EBI and Sanger at Hinxton: Engaging with the Data Challenges

• Technology for sequence data generation and reduction• Repositories, storage, archiving • Databases, entitity linking, infrasctruture and utility• Biocuration, annotation, standards, ontolgies• Experimental biological data from research groups• Data exploitation, mining and visualisation • Biological hypothesis iteration

[3]

EMBL-Bank

0

5E+10

1E+11

1.5E+11

2E+11

2.5E+11

3E+11

Release 101, Aug 2009, 163 million entries, 283 billion bases

[4]

10 years of Rapid Growth

GU057010; SV 1; linear; viral cRNA; STD; VRL; 1701 BP.08-OCT-2009 (Rel. 102, Created)08-OCT-2009 (Rel. 102, Last updated, Version 1)Influenza A virus (A/Chengdu/03/2009(H1N1)) segment 4 hemagglutinin (HA) Jiang T., Qin C., Li X., Zhao H., Yu M., Deng Y., Yu X., Han J., Qin E., RA Zhu Q.; "A community transmission of influenza A (H1N1) virus in a boarding school RT in China, 22-27 July 2009“

*******************************************************************************************AF177758; SV 1; linear; mRNA; STD; HUM; 1868 BP.10-SEP-1999 (Rel. 61, Created)07-OCT-2008 (Rel. 97, Last updated, Version 6)Homo sapiens ubiquitin specific protease 16 (USP16) mRNA, complete cds.PUBMED; 10786635. Smith T.S., Southan C.; "Sequencing, tissue distribution and chromosomal assignment of a novel ubiquitin-specific protease USP23"; Biochim. Biophys. Acta 1490(1-2):184-88(2000). Ensembl-Gn; ENSG00000143258; Homo_sapiens.

[5]

New Technology > New Data Archives

Volume (TB) 1.9

70

35Assembledsequence

Capilliary traces

Next. Gen. Reads

European Nucleotide Archive Snapshot March 2009

[6]

Accelerating Genome Coverage

Jan 2009, 4370 projects

[7]

from EBI/Sanger

[8]

The 1000 Genomes Project: Cataloging Human Genetic Variation

• Initial human genome -10 years and 40 gigabases • Over next two years the eqivalent of two human genomes

will be produced every 24 hours • Completed dataset will be 6 trillion DNA bases, 500 TB• 60-fold more than 28 years of EMBL-Bank • Expected to cover 1200 genomes

[9]

Data Exploitation: EBI Accesses

Last 4 years of hit-rates for web pages and web services

0

200,000

400,000

600,000

800,000

1,000,000

1,200,000

CGI

API

[10]

GenomesGenomes Nucleotide sequenceNucleotide sequence

ExpressionExpression ProteomesProteomes

Protein families, and domains

Protein families, and domains

Protein structureProtein structure

Protein interactions

Protein interactions

Chemical entitiesChemical entities

PathwaysPathways

SystemsSystems

Literature, ontologiesLiterature, ontologies

Towards a sustainable infrastructure for biological information in Europe, to support life science, translation to medicine, the environment, bio-industries and society.

[11]

Conclusions

• The International Nucleotide Sequence Database Collaboration will exeed 300 billion bases in 2009.

• Storage at the EBI has doubled annually and is now 5 Petabytes.• Next-Generation Sequencing is increasing data production ~ 10-fold.• By 2010 the full genomic variation in over 1000 people will be revealed

and genomes from over 1000 species completed.• An increase in data mining is needed to facilitate conversion into

knowledge.• The European ELIXIR project and other global initiatives to enhance

the sustainable infrastructure for biological databases are essential.• The impact of data-intensive computing on the Life Sciences will be

profound and transforming.• Exploitation will bring major benefits for biology, medicine, agriculture,

biofuels and environmental science.

Recommended