Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
GGeennoommiiccss rreessoouurrcceessFeeding your inner bioinformatician
Associate Professor Mik BlackDepartment of Biochemistry, University of Otago
Brief aside: who am I?Background in statistics: my (rather diverse and collaborative) researchinvolves the development and application of statistical methods forproblems in human disease genomics.Heavily involved in the establishment of two government-fundednational infrastructure initiatives in New Zealand:
Formerly the bioinformatics team leader for NZ Genomics Limited(2012), and still a semi-active team member.
·
·
NZGL (New Zealand Genomics Ltd) - inter-university collaborationin genomics and bioinformatics.NeSI (NZ eScience Infrastructure) - cross institutional (universitiesand Crown Research Institutes) collaboration in high performancecomputing and eResearch.
-
-
·
2/41
Some backstory, or "how did I get this gig?"This slot used to be about "Genomics Infrastructure"
This year I wanted to focus more on skills development - why thechange?
·Using external providers to generate your sequence dataOptions (and caveats) for outsourcing the bioinformatic and/orstatistical analysis of your data
--
·
We are all "Biological Data Scientists" - genomic data analysis is acore component of modern molecular research.Programming, version control, Open Science, reproducibleresearch: these are core skills and concepts that are relevant forALL researchers.
-
-
3/41
Some backstory, or "how did I get this gig?"
BUT: outsourcing some of the analytic workload isn't necessarilybad -‐ let's still consider that as an option.
This slot used to be about "Genomics Infrastructure"
This year I wanted to focus more on skills development - why thechange?
·Using external providers to generate your sequence dataOptions (and caveats) for outsourcing the bioinformatic and/orstatistical analysis of your data
--
·
We are all "Biological Data Scientists" - genomic data analysis is acore component of modern molecular research.Programming, version control, Open Science, reproducibleresearch: these are core skills and concepts that are relevant forALL researchers.
-
-
4/41
Overview -‐ this talk will cover...Bioinformatics service provision: outsourcing the analysis ofgenomic data.Community resources: doing bioinformatics without (quite) being abioinformaticianTraining: growing your computational skill set.
·
·
·
5/41
Overview -‐ this talk will cover...
But no Game of Thrones...
Bioinformatics service provision: outsourcing the analysis ofgenomic data.Community resources: doing bioinformatics without (quite) being abioinformaticianTraining: growing your computational skill set.
·
·
·
6/41
Overview -‐ this talk will cover...
But no Game of Thrones...
First though: I'm a statistician -‐ let's generate some data.
Bioinformatics service provision: outsourcing the analysis ofgenomic data.Community resources: doing bioinformatics without (quite) being abioinformaticianTraining: growing your computational skill set.
·
·
·
7/41
Outsourcing your bioinformaticsWhat are you outsourcing?
Make sure a full analysis plan is in place before committing to the work.
·Quality assessment and basic bioinformatics?Generic data analysis?Domain-specific analysis?Tailored analysis for your specific question?
----
·If possible, have the plan inspected by an independent "expert".-
8/41
Outsourcing your bioinformatics
DON'T underestimate the value that an expert team ofbioinformaticians can bring to your project, but DO make sure youknow what you will be getting from them (and the cost...)
What are you outsourcing?
Make sure a full analysis plan is in place before commuting to the work.
·Quality assessment and basic bioinformatics?Generic data analysis?Domain-specific analysis?Tailored analysis for your specific question?
----
·If possible, have the plan inspected by an independent "expert".-
9/41
Outsourcing your bioinformaticsQuality assessment: this will usually be provided with the data. Don'tbe afraid to ask for more information (and even more QA).Basic bioinformatics: e.g., quality trimming/filtering and alignment toa reference genome - make sure you are very clear about what youwant (if you know): trim/filter parameters, genome build, organism (!),aligner, parameters...Generic data analysis: e.g., variant calling, differential expression etcTailored analysis for your specific question: this requires the mostspecification (and input from you) and should involve a provider with abackground in this area.
·
·
··
10/41
Outsourcing your bioinformatics
ALWAYS specify that you require the code used to perform theanalysis -‐ you need to know what was done every step of the way
Quality assessment: this will usually be provided with the data. Don'tbe afraid to ask for more information (and even more QA).Basic bioinformatics: e.g., quality trimming/filtering and alignment toa reference genome - make sure you are very clear about what youwant (if you know): trim/filter parameters, genome build, organism (!),aligner, parameters...Generic data analysis: e.g., variant calling, differential expression etcTailored analysis for your specific question: this requires the mostspecification (and input from you) and should involve a provider with abackground in this area.
·
·
··
11/41
Brief aside -‐ publishingWhen outsourcing genomics and bioinformatics work, discusspublishing expectations up front.
There are some advantages to including genomics and bioinformaticspersonnel on publications:
·
some researchers feel that "fee-for-service" work does notconstitute a "meaningful contribution" to a paper.other researchers treat "service providers" more like collaborators.
-
-·
access to deep expertise (can be particularly helpful at reviewtime).collaborative approach can lead to greater engagement in the workbeing done.
-
-
12/41
Community resources: what is available?Databases/Browsers (the big players)
Software tools:
Generic data sources:
·NCBI (http://www.ncbi.nlm.nih.gov/)Ensembl (http://www.ensembl.org/)UCSC (https://genome.ucsc.edu/)MANY domain-specific options (check Nat Gen annual DB issue)
----
·GenomeSpace (Galaxy, GenePattern, Cytoscape,...)R/BioconductorThat scary command line thing....
---
·GEO, ArrayExpress, inSilicoDB, SRA, dbGaP, EGA...-
13/41
GenePattern: web-‐based analysis platform
http://www.broadinstitute.org/cancer/software/genepattern/
14/41
Galaxy: web-‐based analysis platform
https://usegalaxy.org/
15/41
GenomeSpace: joining up the cool stuff...
http://www.genomespace.org/
16/41
GenomeSpace: joining up the cool stuff...
http://www.genomespace.org/
17/41
Training -‐ "I'm sure I can do it better..."Most researchers aren't looking to outsource the investigativecomponent of their research.
Training: there are MANY opportunities available for up-skilling:
·
Data analysis is a fundamental part of scientific investigation.Many investigators want to "own" the entire dataprocessing/analysis process, others don't.
--
·
Bioplatforms Australia: http://www.bioplatforms.com.auInstitute based (e.g., IMB Winter School...)Software Carpentry: http://software-carpentry.org/NZGL: http://nzgenomics.co.nzOnline courses (what to choose...??)
-----
18/41
Training -‐ know what you needBioinformatics is a VERY broad field: what is it that you want to learn?
Specialization is a GOOD thing, if you can afford it
·Early-stage analysis: QA/QC, alignmentMore specialized: variant calling (SNPs, CNV, other SV),assembly, RNA-seq count generation, metagenomics...Further downstream: analysis of "processed" data (clustering,prediction, pathways, network reconstruction, phylogeny...)
--
-
·it's great to be a jack-of-all-trades, but the "master-of-none"trade-off can be a problem.makes sense to invest your time where it will be most effective:learn the skills most relevant to what you are trying to accomplishwith your research.
-
-
19/41
Training -‐ "I'm sure I can do it better..."
DO take advantage of training opportunities, but DON'Toverestimate what is being provided.
There is only so much we can teach in a few days...
Learn "enough to be dangerous", and then find a good"bioinformatics buddy" to keep you from going astray -‐ know yourlimitations.
20/41
Upskilling: a case study (my group)Small research group of relatively junior graduate students andresearch assistants
Common requirements across projects
None of these areas is specific to bioinformatics/genetics/genomics.
·
Mix of computer science, statistics and biology/geneticsbackgrounds.Similar/related research projects, and common needs in terms ofskills development
-
-
·programming skillsversion controlcollaboration toolsreproducible research
----
·
21/41
What were/are our needs?Unix shell
R
HPC cluster access
·general usage for data manipulationscripting for basic automation
--
·general statistical analysis (esp. linear models)genetics/genomics data analysis techniquesdata visualisationreproducible research
----
·simulations and permutations/resamplingembarrassingly parallel...
--
22/41
A non-‐sustainable training model...During the second half of 2014 I prepared training sessions for myresearch group on:
That was exhausting...
·
dplyr/tidyrreproducible researchggplot2ggvisshinyBayesian modelling with JAGSgenomic data visualisationlinear algebra and linear models
--------
·
23/41
A better approachSoftware Carpentry
Regular workshops offered throughout Australia and New Zealand
Data Carpentry (more domain-focused) now also offered.
·Unix ShellR/PythonGitMySQL
----
·Australia:
NZ: NeSI (Aleksandra Pawlik: [email protected])
-Belinda Weaver ([email protected])Damien Irving ([email protected]).
--
-·
24/41
Software Carpentry"Since 1998, Software Carpentry has been teaching researchers inscience, engineering, medicine, and related disciplines the computingskills they need to get more done in less time and with less pain."
http://software-carpentry.org
Trained instructorsComprehensive lessons
··
25/41
Data CarpentryIn May 2014, the first "Data Carpentry bootcamp" was taught:
We now use the Data Carpentry material to give our incoming 4th yearBiochemistry and Genetics students a two day "crash course" in dataanalysis with R.
·
"Data Carpentry develops and teaches workshops on thefundamental data skills needed to conduct research."sibling organisation to Software Carpentryhttp://www.datacarpentry.org/
-
--
·
Gives them the basic tools needed for the analytic components oftheir 4th year projects.Prepares them to take a Software Carpentry workshop later in theyear.
-
-
26/41
Software CarpentrySWC instructor training, Melbourne, Jan 2015
Inaugural Research Bazaar (ResBaz), Melbourne, Feb 2015
·Two group members and myself attended.Became certified SWC instructors.
--
·postgraduate students and early/mid-career researchersSWC training + many other workshops
--
27/41
Saved by SYSKAWe now had three trained SWC instructors in our extended researchgroup
Time for SYSKA: Sh*t You Should Know About
·
the students were taking over!the senior students were now able to train others... and so werethe junior students
--
·rotating weekly slotsshort (20-30 minute) presentation (by student) to group onsomething useful or topicalPython Tricks, Vim vs Emacs, NeSI HPC, Shell tricks, RMySQL,LaTeX, dplyr (again)...
--
-
28/41
Expanding: Mozilla Study Groups
https://github.com/mozillascience/studyGroupLessons
Announced in April 2015 by Mozilla Science Lab·skill sharing and idea discoverycommunity supportlots of introductory lessons:
---
29/41
Additional lessons: Bioconductor course material
http://bioconductor.org/help/course-materials/2016/
30/41
(Our) Mozilla Study Group formatOtago-based Mozilla Study group takes the student-led training beyondour immediate research group.
Fortnightly meetings
Lightning SYSKA!
·
·
4 session rotating format:2 weeks of nominated topics: hands-on coding1 week of hacky hour1 week of 5x5 lightning SYSKA
----
·
5 presenters, 5 minutes eachPresent a cool topicUse topics/interest to decide content for future lessons
---
31/41
Other events: Research BazaarDigital skills training for graduate students and early careerresearchers.
Site-specific programme
Fantastic opportunity for upskilling (especially as a group), and meetingother members of the research community.
·
First held at University of Melbourne in February 2015.Sites throughout NZ and Australia (and the Americas) in 2016.Look for "ResBaz Week" in February 2017.
---
·Generally a Software Carpentry core, plus more advanced lessonsfor SWC "graduates"Key note presentationsModules on a broad range of a digital skills and tools.
-
--
·
32/41
My group: where are we now?Senior group members are competent SWC instructors or helpers
Reproducible Research and Open Science concepts/techniques arestarting to be used more frequently.
·Good study group attendanceExtended research group is becoming competent with core digitalresearch tools (Shell, R, Git)We have a solid collection of training materials (both general, anddomain-specific)Presentations are hands-on: major advantage (and a good stepforward)
--
-
-
·
33/41
Reproducible researchWe are currently (I hope) in the midst of a "reproducibility revolution"
The R computing environment provides a good example of this, butthere are a number of others (e.g., iPython notebooks).
·
increased emphasis on sharing all aspects of our research.strong emphasis on the use (and development) of open sourcetools that build on existing frameworks.move (by many) towards the use of frameworks for ensuring thatwe are doing "reproducible research".
--
-
·
Rstudio (http://rstudio.com) includes R markdown by default.Facilitates the production of high-quality output (HTML, PDF, evenWord!) with embedded analysis and results.
--
34/41
Rstudio interface
http://rstudio.com
35/41
R markdown output
http://rmarkdown.rstudio.com/
36/41
Tools for collaborationGenomeSpace (e.g., Galaxy and GenePattern) provide domain-specifictools for the collaborative sharing of data and analyses.A number of groups combine cloud-based tools in an ad hoc fashion togenerate a collaborative research environment:
·
·
storage provision (e.g., Dropbox , Google Drive , FigShare )code sharing/editing + version control (e.g., Git/Github, Bitbucket)reproducible research (e.g., R markdown, iPython notebooks)shared/collaborative web-based analysis (e.g., RStudio Server,Shiny Server).
- ∗ ∗ ∗
---
37/41
Tools for collaboration
Note potential data security/privacy issues.
GenomeSpace (e.g., Galaxy and GenePattern) provide domain-specifictools for the collaborative sharing of data and analyses.A number of groups combine cloud-based tools in an ad hoc fashion togenerate a collaborative research environment:
·
·
storage provision (e.g., Dropbox , Google Drive , FigShare )code sharing/editing + version control (e.g., Git/Github, Bitbucket)reproducible research (e.g., R markdown, iPython notebooks)shared/collaborative web-based analysis (e.g., RStudio Server,Shiny Server).
- ∗ ∗ ∗
---
∗
38/41
Tools for collaboration
Although seemingly haphazard, this approach provides a lot offlexibility for incorporating new tools as they emerge.
GenomeSpace (e.g., Galaxy and GenePattern) provide domain-specifictools for the collaborative sharing of data and analyses.A number of groups combine cloud-based tools in an ad hoc fashion togenerate a collaborative research environment:
·
·
storage provision (e.g., Dropbox, Google Drive, FigShare)code sharing/editing + version control (e.g., Git/Github, Bitbucket)reproducible research (e.g., R markdown, iPython notebooks)shared/collaborative web-based analysis (e.g., RStudio Server,Shiny Server).
----
39/41
SummaryOutsourcing your analysis - know what you are getting:
Shared resources - know what is available:
Personal workflow - know what you are doing:
·Clearly define plans and expectations in terms of the data andanalysis that you are paying for.Ensure you have the resources needed to complete the project.
-
-·
Generic and domain-specific resources exist that can facilitate,streamline and complement your research.
-
·Upskill yourself: interact with your research community.Know your tools, and develop (and follow!) an analysis plan.The "reproducible research" paradigm offers a valuable set ofresources to help ensure reproducibility.
---
40/41
A (non-‐exhaustive) list of useful local links:Australia:
New Zealand:
·QFAB: http://qfab.orgAGRF: http://agrf.org.auQCIF: http://www.qcif.edu.auCombine: https://combine.org.auAus. Bioinformatics Network: http://australianbioinformatics.netBioplatforms Australia: http://bioplatforms.com.auEnsembl/EMBL resources (local): https://www.embl-abr.org.au
-------
·NZGL: http://nzgenomics.co.nzBioinformatics Institute: http://www.bioinformatics.org.nzNeSI: http://nesi.org.nz
---
41/41