Upload
clare-parrish
View
223
Download
2
Tags:
Embed Size (px)
Citation preview
Practically GenomicA hands-on bioinformatics IAP
Course Materials:
http://rous.mit.edu/index.php/IAP_2012Instructors:
Paola Favaretto, Sebastian Hoersch, Charlie Whittaker and Courtney Crummett
KI for Integrative Cancer Research at MIT and MIT Libraries
Students - Wide range of experience levels Unix account access information will be provided Evaluations - Please send comments to
Turning Biologists into Bioinformaticists - A practical approach
The teaching material should: be modular and practical have obvious contextual relevance serve as readily accessible and easily used reference
materials
The students should: become aware of the contents of a basic bioinformatics
toolkit learn how to find instructions covering tools and methods. experiment with different methods covered in classes gain familiarity and comfort with command-line
computing
Target Audience are KI Biologists
Turning Biologists into Bioinformaticists - A practical approach – the specifics
1. Theory - Core Bioinformatics ConceptsImportant principles required to use
bioinformatics
2. Tools - A Basic Bioinformatics ToolkitThe software of bioinformatics
3. Tasks - Bioinformatics Methods Data analysis with bioinformatics
Under Development!
http://rous.mit.edu/index.php/Teaching
IAP 2012 Agenda (subject to change)
1-23-12 Introduction Getting more from Excel Unix Introduction
1-25-12 Next Generation Sequence
Analysis with Unix and Galaxy 1-27-12
Visualization and Analysis of Genomics Data
rous.mit.edu
Theory – Genomic Data
All kinds of genomics data are described using at least 4 pieces of information.
1) The name of a DNA sequence name 2) A position on that sequence 3) A feature that exists at that position.4) Genome assembly version
Sequence1 Position Feature Chromosome1 1314 Mutation
• Sequence 1 is a long block of sequence arranged by a process called genome assembly.
• This is critical because the 3 pieces of information described above are only meaningful for one specific assembly version. A new version of the genome will probably not have this mutation at position 1314. It would be located elsewhere.
BED, GFF, GTF formats
Theory – Microarray Data
1. Target features created on a surface2. Labeled material hybridized3. Image analysis
ProbeID Sample1 Sample2 Sample3 Sample41007_s_at 10.93 11.44 11.1911.641053_at 8.28 7.54 8.06 7.32117_at 3.31 3.41 3.13 3.13121_at 4.42 4.32 4.46 4.631255_g_at 1.8 1.7 1.751.81
Used for:• Gene expression analysis• Polymorphism detection• Copy number analysis• DNA binding studies
Data is gathered about the features present on the array.
Theory – Next Generation Sequencing (NGS)
1. Generate DNA fragments2. Attach to surface and
amplify in situ.3. Subject surface to cycles
of imaging/chemistry.4. Image analysis to call
base sequences and qualities
Used for:• Gene expression analysis• Polymorphism/Mutation detection• Copy number analysis• Mixture Quantization• DNA or RNA binding studies• others…
200+ million clusters per experiment
Data is gathered about everything in the input mixture.
Theory – NGS Alignment Files
2:75:1538:897 16 chr1 8291 0 60M AGGCCAGGCCCTC HHHHHGGH@HGHHHHH4:31:101:1130 16 chr1 8328 1 60M CACCTACTTGCCA ################
Query
Flag Reference
Position
MapQual
CIGAR Sequence
Base Quality
SAM Format
• Each line has a lot of information (not all columns are shown)• One experiment = millions of lines = many Gb of data• Scale of the data causes problems with Excel etc.