HMS Genetics Department 2010 Retreat Computational Biology breakout session John Aach, PhD, Lecturer, Church Lab, Department of Genetics Mark Borowsky,

HMS Genetics Department 2010 RetreatComputational Biology breakout session

• John Aach, PhD, Lecturer, Church Lab, Department of Genetics • Mark Borowsky, PhD, Director, Molecular Biology

bioinformatics team & co-Director, Illumina sequencing core.• Peter Park. PhD, Assistant Professor of Pediatrics &

Associate Director of Bioinformatics, PCPGM

1

HMS Genetics Computational Biology breakout session

About this session …

• Purpose: Encourage interactions with computationalists; discuss how computational methods can work effectively in research

• Session objectives:

2

Reason to be here What we hope to do

• Question about specific tools or specific data in your project

Tell you what we can … but don’t expect Car Talk!


Cars vs Computational biology

3


About this session …

• Purpose: Encourage interactions with computationalists; discuss how computational methods can work effectively in research

• Session objectives:

4

Reason to be here What we hope to do

• Question about specific tools or specific data in your project

Tell you what we can … but don’t expect Car Talk!

• Broader questions about comp. bio. or how to approach larger computational issues

Give guidance on best practices and possible starting points for your thinking

• Just curious … Give you an idea of what we do

• What we get out of it: Learn about things important to you, ideas we can develop, possibilities for collaboration

• Feedback: Contact us or Vonda Shannon for suggestions or follow-up on this session

Who we are

5


John Aach, PhD [email protected], 617-432-0061

• Currently: Lecturer, HMS Dept. of Genetics, Church Lab (since 1996)• Background

– PhD Boston U. 1985 (philosophy / psychology); BA Princeton U. 1975 (music)– many years as developer, manager, technology architect in IT

• Focus / interests: Like to be at interface between data and biology. – Church Lab comp. bio. requirements for “omics”, synthetic bio:

• Develop new forms of data by error analysis, integration with other data, etc.• modeling / performance assessment / optimization of technology or bio. system• fast-moving develop/demonstrate vs. production orientation• close interaction with bench; interface with many other fields

• Projects have included– next-gen sequencing analysis for Church Lab targeted sequencing methods– automated image processing for cell morphology (with Perrimon Lab)– mathematical modeling (“polony” formation, metabolic models)– (early) computational miRNA search (with Ruvkun Lab)– microarray analysis; expression data “time-warping” method

6

mailto:[email protected]


Peter Park

• Currently: Assistant Professor (2006-)• Background

– Instructor, HMS; Postdoctoral fellow, HSPH (biostatistics); – PhD Caltech 1999 (applied mathematics); BA Harvard 1994 (applied

mathematics)

• Focus / interests: – Microarray-based

• gene expression, ChIP-chip, copy number, microRNA• Platforms: Affymetrix, Agilent, Illumina, etc.

– Sequence-based• ChIP-seq, RNA-seq (SAGE-like vs whole transcript), copy number (whole-genome

sequencing/targeted sequencing)• Platforms: Illumina, SOLiD, Helicos

• Longwood sequencing facilities: Partners—Landsdowne/MGH, HMS Biopolymer, Children’s (SOLiD), DFCI (Helicos), others?

7


Mark Borowsky, PhD

• Molecular and cell biologist…– Developmental gene regulation in flies– Cell adhesion in vertebrates– Infectious disease

• …turned informaticist– Microbial genome sequencing and analysis– Human annotation and cDNA sequencing– Gene expression analysis– Next gen sequence analysis (Illumina)– Tools for bench scientists

Make computational analysis available to biologists to answer biological questions.

8

Questions and Discussion

9


Some possible discussion topics

• What are some “best practices” in computational biology?• How do you design an experiment with a computational component?• How does one pick a set of tools for a problem?• When does one write custom software vs rely on pre-existing tools?• How do you work with a computational collaborator?• What are some models for computational biology support?• How do computational biology projects develop?• How is the field of computational biology evolving?• Other questions?

10

What are some best practices in computational biology?

11


Computational “best practices” (John Aach)

Pay attention to fundamentals• Record every bit of analysis in scripts, spreadsheets, etc., and document

internally with comments and externally in a project log or notebook, so that they can be repeated and/or varied.

• Double-check every coded computation– Individual computations by working out examples by hand or other software– Whole systems by comparing with other systems or biological expectations

• Don’t re-invent the wheel. If a tool is available that does close to what you need, it’s worth trying to use it.

• Work closely with experimentalist partners to – assure you’re addressing the right problems– know what parameters affect computations– keep pace with changing protocols– provide feedback on experimental controls, performance, and data integration

• Keep your programs and files well organized, and write them to the level of performance needed by the project.– See Noble (2009) PLoS Comput Biol e1000424 for one set of recommendations

12

How do you design an experiment with a computational component?

13


Plan your experiment with your computationalist (Mark Borowsky)

1. Define the biological question.

2. Choose metrics to evaluate the quality of your data.

3. Choose metrics to answer your question.

4. Determine how much data you will need to achieve significance.

5. Determine how many replicates you will need.

6. Define sources of bias and necessary controls.

7. Be realistic about yields from high throughput instruments.

8. Estimate a failure rate and plan extra samples.

14

How does one pick a set of tools for a problem?

15


Picking out computational tools (John Aach)

16

1. The central issue• All algorithms make

o assumptions about data (e.g., biological source, error models, data content …)o generate a computational result (alignment, expression significance, …)

• Research the algorithms and make sure you use ones that generate the results you need and that your data conforms with its assumptions

2. Routine problems• Use any tool that is convenient and is conformant with “best practices”

3. Complex problems• Break into main parts and look for solutions to each. Start with the harder parts.

4. Inevitable compromises• If algorithms don’t do exactly what you need or data assumptions not exactly met,

consider whether they are close enough to use with suitable adjustments• Performance / convenience important; influence overall research time allocation• Sometimes you simply can’t get an algorithm to run

5. Other considerations• Always check the results of the algorithm on data whose results you know• Try to keep abreast of new tools (difficult…)• If a choice of tools, choose ones that have been shown to perform better on similar

data. Otherwise, choose ones that have better theoretical foundation

When does one write custom software vs rely on pre-existing tools?

17


When does one write software (John Aach; Mark Borowsky)

• Only when necessary!

• Existing tools don’t work, don’t perform well enough, or don’t integrate well enough to do a task

• You need to run the same processing repeatedly on different data sets or with changes restricted to a fixed set of parameters.

• You have an idea for a new algorithm

18

How do you work with a computational collaborator?

19


Care and handling of your computationalist (Mark Borowsky)

• Contact when planning experiments• Approach as collaborators• Educate

– Us about your biological system and questions– Yourself about computational approaches used/accepted in your field

(provide references)

• Ask to be educated about assumptions, costs and benefits• Discuss resources

– Development time– Compute time– Hardware and disk space (ours and yours)

• Ask about the queue• Ask what you can do to facilitate

Don’t get frustrated, get in touch.

20

What are some models for computational biology support?

21


Bioinformatics Support? (Peter Park)

• Software packages can be used to formulate hypothesis or carry out initial analysis

• But manual intervention is necessary to get to publication• Even miscellaneous tasks like data deposition could take a lot of time

when done manually.

• The problem will be more acute in the future with sequencing data.• Infrastructure will be an issue, given the size of data sets.• HMS Orchestra cluster is available (~$.70/GB/year for storage)

22


Possible models for Bioinformatics Support? (Peter Park)• Joint grants

– a considerable lag-time until the grant is funded, if at all• Fee for service

– there aren’t many places that offer service (some companies do this)– quality is unpredictable– investigators generally under-estimate the cost– small data sets often require as much work as large data sets

• Institutional core services– Institutional grants

• ‘Collaboration’ between labs– Enough incentives for the bioinformatician?

• How do other places do it?– Not many places do it well– How does Broad do it?– DFCI – additional computing charges in grants for personnel– DFCI - Center for Cancer Computational Biology– Harvard Catalyst Genetics and Bioinformatics Consulting Program for

“clinical and translational investigators”

23

How do computational biology projects develop?

24


Collaboration life cycle (Mark Borowsky)

25

Plan

Review

Refine AlgorithmGenerate more data

Analysisoutput

Generate primary data

Implementanalysismethods


Example (John Aach)

26

Need: automated morphology analysis (hard!)Phases1. initial analysis: established need for

stochastic labeling (with Amy Kiger and Pam Bradley)

2. Strategy development (with Chris Bakal)– different cell line more amenable to analysis– smaller no. of perturbations compatible with

less than full automation

3. Analysis methods development– supervised learning approach– statistics– integrate with biological knowledge

4. Publish

How is the field of computational biology evolving?

27


How computational biology evolves (John Aach)1. State of computational

biology at any given time = established tools & data + tools in development that

make “best guesses” at phenomena currently hard to measure

28

your data

your analysis

Low level data management

tools

Data analysis & interpretation

tools

Data-bases

imputation tools (forefront of research!)

Example DNA motif

discovery tools


How computational biology evolves (John Aach)

29

1. State of computational biology at any given time = established tools & data + tools in development that


2. New experimental techniques are developed and demonstrated that capture new relevant data

your data

your analysis


tools


tools

Data-bases

imputation tools (still developing!!)

new experimental

technique

demo analyses

Example ChIP2


How computational biology evolves (John Aach)

30

your data

your analysis


tools


tools

Data-bases

old imputation tools

maturingexperimental

technique

1. State of computational biology at any given time = established tools & data + tools in development that


2. New experimental techniques are developed and demonstrated that capture new relevant data

3. The techniques and their tools mature and their data gets put into databases some imputation still

needed, but much drops off databasing difficult due to

massive data and changing technology

Thank you!

31

Documents

HMS Genetics Department 2010 Retreat Computational Biology breakout session John Aach, PhD, Lecturer, Church Lab, Department of Genetics Mark Borowsky,