Upload
karen-harris
View
215
Download
0
Tags:
Embed Size (px)
Citation preview
• Big Data: data sets whose size and complexity is beyond the capabilities of commonly used tools to capture, manage, and process the data within a tolerable time frame.
• Big Data: constantly moving target currently ranging from a few dozen terabytes to many petabytes of data in single data sets, with different types of data sets potentially deeply intertwined.
- Wikipedia (http://en.wikipedia.org/wiki/Big_data)
Challenges: the scope and scale of life sciences data continue to growWorking with Big Data
Coming into the Genome Age
For the first time in the history of science students can work with the same data and tools that are used by researchers.
Learning by posing and answering question.
Students generate new knowledge.
The iPlant CollaborativeVision
Enable life science researchers and educators to use and extend iPlant's foundational cyberinfrastructure to understand and ultimately predict the complexity of biological systems and their dynamic nature under various environmental conditions.
The iPlant CollaborativeWhat is Cyberinfrastructure?
Cyberinfrastructure (CI) is data storage, software, high-performance computing, and people – organized into systems that solve problems of size and scope that would not otherwise be solvable.
The iPlant CollaborativeWhat is Cyberinfrastructure?
Platforms, tools, datasets Storage and compute Training and support
The iPlant CollaborativeWhat problems can iPlant Solve?
Crops and model plant systems Animal and livestock Agronomic microbes, insects…
“I had the feeling I have been exposed to many bioinformatics tools but I would be unable to use any of them on my own.”
The limitations of any training workshop
3. Keep asking questions
• If iPlant can, we’ll help show you how…• If iPlant can’t we’ll find the path that gets you what you need
Don’t hesitate to ask “Can iPlant do this?”
Keep asking at ask.iplantcollabortive.org
Bringing Genomics into the Classroom
Visualization of the Pectobacterium atrosepticum genomehttp://www.scri.ac.uk/research/pp/plantpathogengenomics/pathogenbioinformatics
Bringing Genomics into the Classroom“Essentially, all models are wrong, but some are useful” – George E.P. Box
From This…
• 1866 – Mendel publishes work on inheritance• 1869 – DNA discovered• 1915 – Hunt Morgan describes linkage and recombination• 1953 – Structure of DNA described• 1956 – Human chromosome number determined• 1968 – First gene mapped to autosome• 1977 – Dideoxy sequencing• 1983 – PCR• 1986 – Human Genome Project proposed
Bringing Genomics into the Classroom
• 1993 – First MicroRNAs described• 2003 – First ‘Gold Standard’ human genome sequence• 2005 – First draft of human haplotype map (HapMap)• 2007 – ENCODE project
Timeline: Welcome Trust (http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtx063807.pdf)
Bringing Genomics into the Classroom
1973Sharp, Sambrook, Sugden
Gel Electrophoresis Chamber, $250
1958 Matt Meselson &
Ultracentrifuge, $500,000
The Egalitarian GeneAgarose Gel Electrophoresis, 1973
The Egalitarian GenomeNext Generation Sequencing, 2005
Bacterial colonies PCR colonies (clusters, features)
Hundreds of millions of…
Research Education
For the first time in the history of biology students can work with the same data at the same time and
with the same tools as research scientists.
Educational Challenge
Context of scientific discovery
iPlant Genomics in Education Workshop
Major Workshop Concepts:
•Biology is becoming a “Data Unlimited” science.•Genomes are dynamic.•Genomes are more than just protein coding genes.•DNA sequence is information.•Gene annotation adds “meaning” to DNA sequence.•Biological concepts like “genes” and “species” continually evolving.•DNA barcoding bridges molecular genetics, evolution, ecology.
The Problem of Big Data in Biology The abundance of biological data generated by high-throughput sequencing creates challenges, as well as opportunities:
•How do scientists share their data and make it publically available?
•How do scientists extract maximum value from the datasets they generate?
•How can students and educators (who will need to come to grips with data-intensive biology) be brought into the fold?
Majority of genome is transcribed~50% transposons~25% protein coding genes/1.3% exons~23,700 protein coding genes~160,000 transcriptsAverage Gene ~ 36,000 bp
7 exons @ ~ 300 bp6 introns @ ~5,700 bp7 alternatively spliced products
(95% of genes)RefSeq: ~34,600 “reference sequence” genes (includes pseudogenes, known RNA genes)
Bringing Genomics into the Classroom
Using Plants to Explore GenomicsThe “weirdness” of plant genomes
on your dinner plate
Triticum aestivum: allohexaploid
Brachypodium
Sorghum
Oryza
Brachypodium
1 2 3 4 5
1
2 3 4 5
10 3 9 7 8 4 2 5 6
1
3 6 1 5 7 2 8 10 11 12 9 4
50-70
46
28
25
13
14
9
150-300
Monocots
Dicots
Time (million years)Present204060
Oryza (rice)
Avena (oats)
Hordeum (barley)
Triticum (wheat)
Setaria (foxtail millet)
Pennisetum (pearl millet)
Sorghum
Zea (maize)
Arabidopsis
Brachypodium
Glycine max (soy)
2,500 Mb
750 Mb
20,000 Mb
270 Mb
430 Mb
145 Mb
1,115 Mb
?? Mb
5,200 Mb
>20,000 Mb
?? Mb
- Genome duplication event
Using Plants to Explore Genomics