Practically Genomic A hands-on bioinformatics IAP Course Materials: Instructors: Paola Favaretto, Sebastian Hoersch,

Practically GenomicA hands-on bioinformatics IAP

Course Materials:

http://rous.mit.edu/index.php/IAP_2012Instructors:

Paola Favaretto, Sebastian Hoersch, Charlie Whittaker and Courtney Crummett

KI for Integrative Cancer Research at MIT and MIT Libraries

Students - Wide range of experience levels Unix account access information will be provided Evaluations - Please send comments to

[email protected]

Turning Biologists into Bioinformaticists - A practical approach

The teaching material should: be modular and practical have obvious contextual relevance serve as readily accessible and easily used reference

materials

The students should: become aware of the contents of a basic bioinformatics

toolkit learn how to find instructions covering tools and methods. experiment with different methods covered in classes gain familiarity and comfort with command-line

computing

Target Audience are KI Biologists

Turning Biologists into Bioinformaticists - A practical approach – the specifics

1. Theory - Core Bioinformatics ConceptsImportant principles required to use

bioinformatics

2. Tools - A Basic Bioinformatics ToolkitThe software of bioinformatics

3. Tasks - Bioinformatics Methods Data analysis with bioinformatics

Under Development!

http://rous.mit.edu/index.php/Teaching



IAP 2012 Agenda (subject to change)

1-23-12 Introduction Getting more from Excel Unix Introduction

1-25-12 Next Generation Sequence

Analysis with Unix and Galaxy 1-27-12

Visualization and Analysis of Genomics Data

rous.mit.edu

Theory – Genomic Data

All kinds of genomics data are described using at least 4 pieces of information.

1) The name of a DNA sequence name 2) A position on that sequence 3) A feature that exists at that position.4) Genome assembly version

Sequence1 Position Feature Chromosome1 1314 Mutation

• Sequence 1 is a long block of sequence arranged by a process called genome assembly.

• This is critical because the 3 pieces of information described above are only meaningful for one specific assembly version. A new version of the genome will probably not have this mutation at position 1314. It would be located elsewhere.

BED, GFF, GTF formats

http://en.wikipedia.org/wiki/Genome_project

Theory – Microarray Data

1. Target features created on a surface2. Labeled material hybridized3. Image analysis

ProbeID Sample1 Sample2 Sample3 Sample41007_s_at 10.93 11.44 11.1911.641053_at 8.28 7.54 8.06 7.32117_at 3.31 3.41 3.13 3.13121_at 4.42 4.32 4.46 4.631255_g_at 1.8 1.7 1.751.81

Used for:• Gene expression analysis• Polymorphism detection• Copy number analysis• DNA binding studies

Data is gathered about the features present on the array.

Theory – Next Generation Sequencing (NGS)

1. Generate DNA fragments2. Attach to surface and

amplify in situ.3. Subject surface to cycles

of imaging/chemistry.4. Image analysis to call

base sequences and qualities

Used for:• Gene expression analysis• Polymorphism/Mutation detection• Copy number analysis• Mixture Quantization• DNA or RNA binding studies• others…

200+ million clusters per experiment

Data is gathered about everything in the input mixture.

Theory – NGS Alignment Files

2:75:1538:897 16 chr1 8291 0 60M AGGCCAGGCCCTC HHHHHGGH@HGHHHHH4:31:101:1130 16 chr1 8328 1 60M CACCTACTTGCCA ################

Query

Flag Reference

Position

MapQual

CIGAR Sequence

Base Quality

SAM Format

• Each line has a lot of information (not all columns are shown)• One experiment = millions of lines = many Gb of data• Scale of the data causes problems with Excel etc.

Documents

Practically Genomic A hands-on bioinformatics IAP Course Materials: Instructors: Paola Favaretto, Sebastian Hoersch,