38
Scientific Databasing with TreeGenes: Genotype, Phenotype, & Environment Jill Wegrzyn Department of Ecology & Evolutionary Biology Institute for Systems Genomics: Computational Biology Core University of Connecticut, Storrs CT treegenesdb.org

Scientific Databasing with TreeGenesScientific Databasing with TreeGenes : Genotype, Phenotype, & Environment Jill Wegrzyn Department of Ecology & Evolutionary Biology Institute for

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Scientific Databasing with TreeGenesScientific Databasing with TreeGenes : Genotype, Phenotype, & Environment Jill Wegrzyn Department of Ecology & Evolutionary Biology Institute for

Scientific Databasingwith TreeGenes:

Genotype, Phenotype, & Environment

Jill WegrzynDepartment of Ecology & Evolutionary BiologyInstitute for Systems Genomics: Computational Biology CoreUniversity of Connecticut, Storrs CT

treegenesdb.org

Page 2: Scientific Databasing with TreeGenesScientific Databasing with TreeGenes : Genotype, Phenotype, & Environment Jill Wegrzyn Department of Ecology & Evolutionary Biology Institute for

Big Data in Genomics

“ComparedgenomicswiththreeothermajorgeneratorsofBigData:Astronomy,YouTube,andTwitter...Genomics iseitheronparwithorthemostdemandingofthedomainsanalyzedhereintermsofdataacquisition,storage,distribution,andanalysis”

Unit SizeByte 1Kilobyte 1,000Megabyte 1,000,000Gigabyte 1,000,000,000Terabyte 1,000,000,000,000Petabyte 1,000,000,000,000,000Exabyte 1,000,000,000,000,000,000Zettabyte 1,000,000,000,000,000,000,000

Mostly Genomic but…Proteomics, Phenomics, Metabolomics…

Page 3: Scientific Databasing with TreeGenesScientific Databasing with TreeGenes : Genotype, Phenotype, & Environment Jill Wegrzyn Department of Ecology & Evolutionary Biology Institute for

•Kb=1000bp

•Mb=1x106 bp

•Gb=1x109 bp

•Tb=1x1012 bp

•Pb =1x1015 bp

1Gb 10Gb 100Gb

GenomesarevastinformationrepositoriesHuman3Gb

Page 4: Scientific Databasing with TreeGenesScientific Databasing with TreeGenes : Genotype, Phenotype, & Environment Jill Wegrzyn Department of Ecology & Evolutionary Biology Institute for

Sensors & MetadataSequencers, Microscopy, Imaging, Mass spec, Metadata & Ontologies

IO SystemsHardrives, Networking, Databases, Compression, LIMS

Compute SystemsCPU, GPU, Distributed, Clouds

Scalable AlgorithmsStreaming, Sampling, Indexing,

Machine Learningclassification, modeling,

visualization & data Integration

ResultsDomainKnowledge

Acquiring Knowledge through Big Data

Page 5: Scientific Databasing with TreeGenesScientific Databasing with TreeGenes : Genotype, Phenotype, & Environment Jill Wegrzyn Department of Ecology & Evolutionary Biology Institute for

Gene Conservation of Tree Species –Banking on the Future (2016)

• Survey Conducted– Breeders, Geneticists, Land Managers, and

Ecologists– 31 Questions

• Trees (greenhouse, plots, landscape, numbers, species)• Data collection (devices, software)• Analytical tools (statistical, databases)• Data storage• Challenges

– 283 Respondents

Page 6: Scientific Databasing with TreeGenesScientific Databasing with TreeGenes : Genotype, Phenotype, & Environment Jill Wegrzyn Department of Ecology & Evolutionary Biology Institute for

Gene Conservation of Tree Species –Banking on the Future (2016)

01020304050607080

ComputationalResources

FormattingData

HostingDataontheWeb

AccessingDatafromDatabases

IntegratingDataAcrossDatabases

ScriptingSupporttoExtract

Information

Page 7: Scientific Databasing with TreeGenesScientific Databasing with TreeGenes : Genotype, Phenotype, & Environment Jill Wegrzyn Department of Ecology & Evolutionary Biology Institute for

Motivation (Data Provider)

• Support next-generation data requirements for the biological database– Increased quantity and availability of new data– Support data integration across resources– Support complex data analytics–Move data efficiently

Page 8: Scientific Databasing with TreeGenesScientific Databasing with TreeGenes : Genotype, Phenotype, & Environment Jill Wegrzyn Department of Ecology & Evolutionary Biology Institute for

treegenesdb.org

TreeGenes Database: History

– Began to hold forest tree genetic maps and associated markers

– Expanded to other data types• Sequence

– Reseqeuncing, Large-Scale Genotyping, Transcriptomics/Expression

– Full Genome Sequences

• Analysis and Visualization Tools– Ability for users to mine the data

• Resources for the user community– Literature, Colleagues

Page 9: Scientific Databasing with TreeGenesScientific Databasing with TreeGenes : Genotype, Phenotype, & Environment Jill Wegrzyn Department of Ecology & Evolutionary Biology Institute for

TreeGenes Database: Users

Unique Web Visitors to TreeGenes Database per month, January-December 2016

treegenesdb.org

10,000

2,086 users from 862 organizations in 94 countries

Page 10: Scientific Databasing with TreeGenesScientific Databasing with TreeGenes : Genotype, Phenotype, & Environment Jill Wegrzyn Department of Ecology & Evolutionary Biology Institute for

• 1,774 species from 101 genera– At least one genetic artifact from each species

• Full genome sequence: 21 species• Transcriptome/Expression resources:

4,120,817 sequences from 283 species• 106 genetic maps from 35 species

treegenesdb.org

TreeGenes Database: Species

Page 11: Scientific Databasing with TreeGenesScientific Databasing with TreeGenes : Genotype, Phenotype, & Environment Jill Wegrzyn Department of Ecology & Evolutionary Biology Institute for

treegenesdb.org

TreeGenes Database: Species

Page 12: Scientific Databasing with TreeGenesScientific Databasing with TreeGenes : Genotype, Phenotype, & Environment Jill Wegrzyn Department of Ecology & Evolutionary Biology Institute for

treegenesdb.org

TreeGenes Database: Data Sources

Primary data sources (semi-automated)• Primary databases such as NCBI/EBI• Appropriate data should be submitted to primary

databases• Consistent with changing standards

– Currently no repository for non-human SNPs (new!)

User submissions • For data and metadata not captured well by primary

databases (Journals)

Project submissions• Internal project management (private to public)

Curated Sources• Phytozome and PlantGDB• PLAZA (OrthoFinder)• TRY-DB (Phenotypes)• Dryad (Flat files)

Page 13: Scientific Databasing with TreeGenesScientific Databasing with TreeGenes : Genotype, Phenotype, & Environment Jill Wegrzyn Department of Ecology & Evolutionary Biology Institute for

Data that is not collected!treegenesdb.org

TreeGenes Database: Data Sources

Submit genetic maps, association or population study data

Most submissions from journal requirement: Tree Genetics and Genomes, New Phytologist, and Forests

Page 14: Scientific Databasing with TreeGenesScientific Databasing with TreeGenes : Genotype, Phenotype, & Environment Jill Wegrzyn Department of Ecology & Evolutionary Biology Institute for

PopulationStudy

•Publication•Species

StudyDesign

•Landscape•CommonGarden•Greenhouse•GrowthChamber

•Breeding(Plot)

Phenotype,Genotype,Environment

•Georeferenced

RawData•Trees•Genotypes•Phenotypes

treegenesdb.org

TreeGenes Database: Data Sources

Page 15: Scientific Databasing with TreeGenesScientific Databasing with TreeGenes : Genotype, Phenotype, & Environment Jill Wegrzyn Department of Ecology & Evolutionary Biology Institute for

Metadata on published studies!treegenesdb.org

TreeGenes Database: Data Sources

Genetic maps, association or population studies

Page 16: Scientific Databasing with TreeGenesScientific Databasing with TreeGenes : Genotype, Phenotype, & Environment Jill Wegrzyn Department of Ecology & Evolutionary Biology Institute for

treegenesdb.org

TreeGenes Database: Data Sources

Genetic maps, association or population studies

Obtain TGDR accession number!

Page 17: Scientific Databasing with TreeGenesScientific Databasing with TreeGenes : Genotype, Phenotype, & Environment Jill Wegrzyn Department of Ecology & Evolutionary Biology Institute for

Opensourcecontentmanagementsystem(CMS)anddatabaseforbiologicaldata

Modulesforgenetic,genomic,andbreedingdatageneratedthroughaCMSandstandardizedschema

Benefits:• Reducesdevelopmentcosts• ProvidesanAPIforcomplete

customization• UsesGMODChado andcommunity

ontologiesforstandardization• Accesscontrolforuser/usergroups• Allowsforsharingofextensionsbetween

sites– Implementedinover30databases!

Page 18: Scientific Databasing with TreeGenesScientific Databasing with TreeGenes : Genotype, Phenotype, & Environment Jill Wegrzyn Department of Ecology & Evolutionary Biology Institute for

Current State of Tripal

• http://tripal.info• Content Management System for Biological Data• Over 100 Installations• Current Version 2.0

Page 19: Scientific Databasing with TreeGenesScientific Databasing with TreeGenes : Genotype, Phenotype, & Environment Jill Wegrzyn Department of Ecology & Evolutionary Biology Institute for

Tripal Gateway Project (Data Provider)

• Support next-generation data requirements for the biological database

• Tripal Gateway Project– Increased quantity and availability of new data– Support data integration across resources (Web

Services) – Tripal Exchange (v3.0)– Support complex data analytics (Integration with

Galaxy API)– Move data efficiently (Software Defined

Networking – Tripal Data Transfer BDSS)

Page 20: Scientific Databasing with TreeGenesScientific Databasing with TreeGenes : Genotype, Phenotype, & Environment Jill Wegrzyn Department of Ecology & Evolutionary Biology Institute for

AlexFeltus,Kuangching WangClemson,Univ.DataTransfer,SDN,SOS

DorrieMain,SookJung,StephenFicklinWashingtonStateUniversity• GenomeDatabaseforRosaceae,• CoolSeasonFoodLegumes• CitrusGenomeDatabase

KirstinBett,LaceySandersonUniv ofSaskatchewan• KnowPulse

JillWegrzynUniversityofConnecticut• TreeGenes

UniversityofUtahNSFACI-REFCollaborators

SteveCannon,Ethy Cannon,IowaStateAndrewFarmer,NCGR• LegumeInfo,PeanutBase

DataTransferCollaborators

ProjectPIs

CollaboratingDatabasesDataAnalysisCollaborators

GalaxyProjectTexasAdvancedComputingCenter,publicGalaxyServer

MegStatonUniversityofTennessee• HardwoodGenomics

Tripal GatewayProjectTree(&Legume)Databases

Page 21: Scientific Databasing with TreeGenesScientific Databasing with TreeGenes : Genotype, Phenotype, & Environment Jill Wegrzyn Department of Ecology & Evolutionary Biology Institute for

treegenesdb.org

TreeGenes Database: Interfaces

Page 22: Scientific Databasing with TreeGenesScientific Databasing with TreeGenes : Genotype, Phenotype, & Environment Jill Wegrzyn Department of Ecology & Evolutionary Biology Institute for

Web-based framework (Galaxy) promotes genomics analysis

Page 23: Scientific Databasing with TreeGenesScientific Databasing with TreeGenes : Genotype, Phenotype, & Environment Jill Wegrzyn Department of Ecology & Evolutionary Biology Institute for

Integrating Galaxy with Tripal

Page 24: Scientific Databasing with TreeGenesScientific Databasing with TreeGenes : Genotype, Phenotype, & Environment Jill Wegrzyn Department of Ecology & Evolutionary Biology Institute for

Data analysis brought to the user via the database with Galaxy Workflows

DNA Sequence Data• Re-sequencingalignment• Variantdiscovery(againstthereference)• Variantdiscovery(betweensamples)• Predictionoffunctionalgeneticvariants• AssociationGenetics• FunctionalAnnotation

RNASequenceData• Transcriptomeassembly• Alignmenttoareference• DifferentialExpressionanalysis• Geneco-expressionnetworkconstruction• MiRNA analysis

Page 25: Scientific Databasing with TreeGenesScientific Databasing with TreeGenes : Genotype, Phenotype, & Environment Jill Wegrzyn Department of Ecology & Evolutionary Biology Institute for

treegenesdb.org

BDSS: Big Data Smart Socket

• SmartDataTransfer• Standaloneclientwithametadatarepository• Firststepistobuildaninventoryofdatasourcesrelevanttoaparticularusercommunity• NCBI(Genbank forRawData)• Cyverse (iPlant foranalytics)• Tripal supportedwebsitesforsupportingdata

• Determinesoptimalmethodfordatatransferforeachdatasourcethroughtesting

• Datatransfermethodologyisencodedintothemetadatarepository

Page 26: Scientific Databasing with TreeGenesScientific Databasing with TreeGenes : Genotype, Phenotype, & Environment Jill Wegrzyn Department of Ecology & Evolutionary Biology Institute for

treegenesdb.org

BDSS: Moving data efficiently

Page 27: Scientific Databasing with TreeGenesScientific Databasing with TreeGenes : Genotype, Phenotype, & Environment Jill Wegrzyn Department of Ecology & Evolutionary Biology Institute for

Tripal Gateway: Use Cases

Tripal Gateway:

1. A user could search across community DBs for their set of SNPs interest (from a genotyping array) using Tripal Exchange.

2. The probe sequences could be gathered as a list and transferred to the user with the Data Transfer (BDSS)tool.

3. If the user prefers to use Galaxy for analysis, the transfer could load the probes into the Tripal Galaxy module and align them to a recently released genome reference

4. Basic workflow for alignment could be selected along with the appropriate target in Galaxy

Page 28: Scientific Databasing with TreeGenesScientific Databasing with TreeGenes : Genotype, Phenotype, & Environment Jill Wegrzyn Department of Ecology & Evolutionary Biology Institute for
Page 29: Scientific Databasing with TreeGenesScientific Databasing with TreeGenes : Genotype, Phenotype, & Environment Jill Wegrzyn Department of Ecology & Evolutionary Biology Institute for

PopulationStudy

•Publication•Species

StudyDesign

•Landscape•CommonGarden•Greenhouse•GrowthChamber

•Breeding(Plot)

Phenotype,Genotype,Environment

•Georeferenced

RawData•Trees•Genotypes•Phenotypes

treegenesdb.org

TreeGenes Database: Data Sources

Inadditionto:• Internalprojects• TREESNAP(public)• DRYAD• TRY-DB

Page 30: Scientific Databasing with TreeGenesScientific Databasing with TreeGenes : Genotype, Phenotype, & Environment Jill Wegrzyn Department of Ecology & Evolutionary Biology Institute for

treegenesdb.org

TreeGenes Database: CartograTree

– Providing context to geo-referenced data–Data from TreeGenes, WorldClim, Ameriflux,

TRY-DB

Page 31: Scientific Databasing with TreeGenesScientific Databasing with TreeGenes : Genotype, Phenotype, & Environment Jill Wegrzyn Department of Ecology & Evolutionary Biology Institute for

treegenesdb.org

TreeGenes Database: Interfaces

– Retrieve genotype, phenotype, environmental, and sequence data

– Further analysis (MUSCLE, TASSEL, PAML) via SSWAP

Page 32: Scientific Databasing with TreeGenesScientific Databasing with TreeGenes : Genotype, Phenotype, & Environment Jill Wegrzyn Department of Ecology & Evolutionary Biology Institute for

treegenesdb.org

TreeGenes Database: SSWAP

– SSWAP “reasons” over the input data and responds with relevant applications

– Send data through pipeline with selection (parameters)

Page 33: Scientific Databasing with TreeGenesScientific Databasing with TreeGenes : Genotype, Phenotype, & Environment Jill Wegrzyn Department of Ecology & Evolutionary Biology Institute for

treegenesdb.org

TreeGenes Database: Cyverse(TACC)

– Connect with Cyverse Views– Download data locally or maintain on cloud-based

storage

Page 34: Scientific Databasing with TreeGenesScientific Databasing with TreeGenes : Genotype, Phenotype, & Environment Jill Wegrzyn Department of Ecology & Evolutionary Biology Institute for

treegenesdb.org

CartograTree: Current Development

• Flexible georeferenced tagging• Approximate• Exact• Obscured (radius)

• Environmental layers (Geoserver)• Soil• Fire/Drought• Climate models• LIDAR

• Integration with Tripal• User control of workspace• Ability to upload their own trees/phenotypes

• Connection with Galaxy framework • More analytical options (PLINK, TASSEL, MSA, PAML)• Intelligent workflows

Page 35: Scientific Databasing with TreeGenesScientific Databasing with TreeGenes : Genotype, Phenotype, & Environment Jill Wegrzyn Department of Ecology & Evolutionary Biology Institute for

treegenesdb.org

CartograTree: TreeSNAP

• Validated accessions from TreeSNAP (obscured)

Page 36: Scientific Databasing with TreeGenesScientific Databasing with TreeGenes : Genotype, Phenotype, & Environment Jill Wegrzyn Department of Ecology & Evolutionary Biology Institute for

treegenesdb.org

CartograTree: Galaxy Workflows

Transcriptomics

ExomeCapture

RNA-Seq

GenotypingArray

Affy

Illumina

WholeGenome

Resequencing

GBS• RAD-Seq• ddRAD-Seq

Page 37: Scientific Databasing with TreeGenesScientific Databasing with TreeGenes : Genotype, Phenotype, & Environment Jill Wegrzyn Department of Ecology & Evolutionary Biology Institute for

treegenesdb.org

CartograTree: Advanced Interface

• 142species• 27,913TGDR• 17,412Inventory• 26,332TRY-DB

• 815TreeSNAP

• ReleaseDate:• December2017

Page 38: Scientific Databasing with TreeGenesScientific Databasing with TreeGenes : Genotype, Phenotype, & Environment Jill Wegrzyn Department of Ecology & Evolutionary Biology Institute for

treegenesdb.org

TreeGenes Database: Team

Project LeadsJill Wegrzyn Emily GrauNic Herndon

AdvisingDamian Gessler

Semantic Options

[email protected]

@TreeGenes TreeGenes Database

Project DevelopersSean BuehlerTaylor FalkPeter RichterClayton Michael

CollaboratorsStephen Ficklin (Tripal)Alex Feltus (BDSS)Meg Staton (HWG)Dorrie Main (GDR)