© 2010 LabKey Software www.labkey.com
Managing Next Generation Sequencing and Multiplexed Genotyping Data Using
Open Source LabKey Server
Adam [email protected]
LabKey Software 2010
LabKey Software Company Overview
LabKey Software is a consulting company Spun off from the McIntosh Lab (part owned by FHCRC) Professional software engineers from Amazon,
Microsoft, BEA etc Work in partnership with scientists
For-profit fee-for-service contracts Non-profit grant sub-awards
– Co-investigators with a shared research agenda All development approved by and relevant to FHCRC
Development & support around LabKey Server Extending the base LabKey Server platform Creating customized lab-specific solutions Hosting LabKey server Support
2
LabKey Software 2010
What Is LabKey Server?
An open-source, web-based platform for organizing, analyzing & sharing scientific data
Data integration analysis for assays Proteomics, flow cytometry, plate-based assays, etc.
Study Data Management Combines demographic, clinical, assay & specimen data
LabKey Server powers many deployments… CPAS: FHCRC proteomics repository Atlas Science Portal: SCHARP’s HIV vaccine studies AdaptiveTCR: Customer analytics for ImmunoSEQ NGS UW (Katze, Heinecke, et al), USC, Markey, Harvard,
IDRI, TGen, Wisconsin Primate EHR, UC Denver, etc.
3
LabKey Software 2010
Dave O’Connor Lab, University of Wisconsin
Academic research lab Focus: understanding SIV using nonhuman
primate models & applying NHP methods to human HIV disease research
Academic research lab Focus: understanding SIV using nonhuman
primate models & applying NHP methods to human HIV disease research
Source: modified from Yewdell et al., Nature Reviews Immunology 2003
Source: Korber et al., British Medical Bulletin 2001
Host Immune Genetics
Virus Genetics
O’Connor Lab SIV/HIV Research
Source: modified from Yewdell et al., Nature Reviews Immunology 2003
Host Immune Genetics
MHC class I molecules dictate immunity to disease
High degree of polymorphism within the MHC class I peptide-binding domain
Specific MHC alleles associated with superior control of HIV infection
Importance of MHC Class I
Source: Korber et al., British Medical Bulletin 2001
Virus Genetics HIV has fast replication cycle, high mutation rate
Evolution of the virus causes escape from immune responses
Specific mutations are associated with resistance to antiretroviral drug therapy
Importance of Viral Variability
LabKey Software 2010
Sequencing in the O’Connor Lab
8
2005 – 2009 Sanger sequencing “Prohibitively expensive” for most experiments
2009 Roche/454 GS FLX at UIUC 2010 Roche/454 GS Junior in lab
Roche/454 GS Junior Long-read instrument, critical for genotyping Identical to GS FLX, but 1/8 throughput & lower cost ~100,000 reads per run (~1¢ per read), average ~560bp read length 115 runs this year
MID tagging Allows pooling multiple samples (30-100) into a single run
Galaxy server Open-source sequence analysis tool (Giardine et al, Genome Res 2005) Lab has built custom workflow to match sequences to known MHC alleles Uses BLAT, transitioning to AGILE (Northwestern alignment tool)
Roche/454 MHC Workflow
• Total RNA isolation and cDNA synthesis– RNA isolation ~4 hrs; cDNA synthesis ~2
hrs
• Primary PCR amplification– plus SPRI purification, quantification,
pooling ~3 hrs
• emPCR– set-up ~1 hr, run ~5.5 hrs
• Breaking and enrichment– ~3 hrs
• Roche/454 GS Junior run– set-up ~1.5 hrs; run time ~10 hrs
• Data processing and analysis– run processing ~2 hrs; analysis time
varies
www.454.com
LabKey Software 2010
PROBLEM: DATA MANAGEMENT!
There is a real disconnect between the ability to collect next-generation sequence data (easy) and the ability to analyze it meaningfully (hard)
Dave O’Connor
10
LabKey Software 2010
Problem: Data Management
As volume has increased, lab has found it difficult to manage all their sequencing data & meta data: Run meta data Run metrics Sequencing reads and quality scores Sample information and multiplex identifiers (MIDs) Reference sequences for genotyping experiments Genotyping matches
O’Connor asked LabKey to build a system that can: Store sequencing and genotyping data in a single database that
links all the tables, allowing arbitrary queries and reports Provide tools for analysis, querying, visualization and export Automate data workflows for efficiency & consistency Eventually, link sequencing results to their primate EHR system
11
LabKey Software 2010
LabKey Sequencing System
12
Reads Quality Scores
Metrics
Sample Information
Sequencing and Genotyping Database
External Tools
AnalysisReporting Export
Galaxy Genotyping Workflow
Reference Sequences
Visualization
Database Schema
13
Metrics (genotyping)Run
[...]
Runs (genotyping)RowId
MetaDataId
Container
CreatedBy
Created
Path
FileName
Status
AllelesJ unction (genotyping)MatchId
SequenceId
Analyses (genotyping)RowId
Run
CreatedBy
Created
Description
Path
FileName
Status
SequenceDictionary
SequencesView
AnalysisSamples (genotyping)Analysis
SampleId
Reads (genotyping)RowId
Run
Name
Mid
Sequence
Quality
ReadsJ unction (genotyping)MatchId
ReadId
Dictionaries (genotyping)RowId
Container
CreatedBy
Created
Matches (genotyping)RowId
Analysis
SampleId
Reads
[Percent]
AverageLength
PosReads
NegReads
PosExtReads
NegExtReads
Samples (genotyping)SampleId
[...]
Sequences (genotyping)RowId
Dictionary
Uid
AlleleName
Initials
GenbankId
ExptNumber
Comments
Locus
Species
Origin
Sequence
PreviousName
LastEdit
Version
ModifiedBy
Translation
Type
IpdAccession
Reference
RegIon
Id
Variant
UploadId
FullLength
AlleleFamily
MetaData (genotyping)Run
[...]
LabKey Software 2010
Demo
14
LabKey Software 2010
Possible Future Directions
Respond to O’Connor lab’s near-term needs Genomics-specific analytics Additional export formats Tighter integration with Galaxy Support for amplicon-designated reads Match combining Simplify configuration and operation
Integrate with Wisconsin primate EHR Better integration with R / Bioconductor Visualization Other sequencing platforms: Illumina, PacBio…
15
LabKey Software 2010
Acknowledgements
O’Connor Laboratory David O’Connor Simon Lank Julie Karl Benjamin Bimber
LabKey Software Mark Igra Brian Connolly Elizabeth Nelson Josh Eckels Matthew Bellew Et al
LabKey Software 2010
Questions?
17